## Load data

In [8]:
# load dataset using read_csv()
from pandas import read_csv
df = read_csv('daily-total-female-births.csv', header=0, index_col=0, parse_dates=True)
# Squeezing the DataFrame
series = df.squeeze()
print(type(series))
series.head()

<class 'pandas.core.series.Series'>


Date
1959-01-01    35
1959-01-02    32
1959-01-03    30
1959-01-04    31
1959-01-05    44
Name: Births, dtype: int64

## Data exploration

It is a good idea to take a peek at your loaded data to confirm that the types, dates, and data
loaded as you intended.

In [16]:
series.head(5)

Date
1959-01-01    35
1959-01-02    32
1959-01-03    30
1959-01-04    31
1959-01-05    44
Name: Births, dtype: int64

Nb of observations: This can 
help flush out issues with column headers not being handled as intended, and to get an idea o 
how to effectively divide up data later for use with supervised learning algorithms

In [17]:
series.size

365

You can slice, dice, and query your series using the time index. For example, you can access all
observations in January as follows:

In [14]:
series['1959-01']

Date
1959-01-01    35
1959-01-02    32
1959-01-03    30
1959-01-04    31
1959-01-05    44
1959-01-06    29
1959-01-07    45
1959-01-08    43
1959-01-09    38
1959-01-10    27
1959-01-11    38
1959-01-12    33
1959-01-13    55
1959-01-14    47
1959-01-15    45
1959-01-16    37
1959-01-17    50
1959-01-18    43
1959-01-19    41
1959-01-20    52
1959-01-21    34
1959-01-22    53
1959-01-23    39
1959-01-24    32
1959-01-25    37
1959-01-26    43
1959-01-27    39
1959-01-28    35
1959-01-29    44
1959-01-30    38
1959-01-31    24
Name: Births, dtype: int64

Calculating descriptive statistics on your time series can help get an idea of the distribution and 
spread of values. This may help with ideas of data scaling and even data cleaning that you ca 
perform later as part of preparing your dataset for modeling.

In [13]:
series.describe()

count    365.000000
mean      41.980822
std        7.348257
min       23.000000
25%       37.000000
50%       42.000000
75%       46.000000
max       73.000000
Name: Births, dtype: float64

## Feature engineering 

So that we can train a supervised learning algorithm. Input variables are also called features
in the field of machine learning, and the task before us is to create or invent new input features from our time series dataset. Ideally, we only want input features that best help the learning methods model the relationship between the inputs (X) and the outputs (y) that we would like to predict. In this tutorial, we will look at three classes of features that we can create from our time series dataset:
- Date Time Features: these are components of the time step itself for each observation.
- Lag Features: these are values at prior time steps.
- Window Features: these are a summary of values over a fixed window of prior time
steps.

####  Date Time Features

In [31]:
from pandas import read_csv
from pandas import DataFrame
series = read_csv('daily-min-temperatures.csv', header=0, index_col=0,
parse_dates=True).squeeze()
series.head()

Date
1981-01-01    20.7
1981-01-02    17.9
1981-01-03    18.8
1981-01-04    14.6
1981-01-05    15.8
Name: Temp, dtype: float64

In [37]:
dataframe = DataFrame()
dataframe['month'] = [series.index[i].month for i in range(len(series))]
dataframe['day'] = [series.index[i].day for i in range(len(series))]
dataframe['temperature'] = [series.iloc[i] for i in range(len(series))]
dataframe.head(5)

Unnamed: 0,month,day,temperature
0,1,1,20.7
1,1,2,17.9
2,1,3,18.8
3,1,4,14.6
4,1,5,15.8


**But are these features enough for our prediction ?**

Using just the month and day information alone to predict temperature is not sophisticated
and will likely result in a poor model. Nevertheless, this information coupled with additional
engineered features may ultimately result in a better model. You may enumerate all the
properties of a time-stamp and consider what might be useful for your problem, such a
 Minutes elapsed for the, .
 Hour o, y.
 Business hours o r not.From these examples, you can see that you’re not restricted to the raw integer value. You
can use binary flag features as well, like whether or not the observation was recorded on a public holiday. In the case of the minimum temperature dataset, maybe the season would b  more
relevant. It is creating domain-specific features like this that are more likely to add va ue to
your model. Date-time based features are a good start, but it is often a lot more use ul to
include the values at previous time steps. These are called lagged values and we will l ok at
adding these features in the next section.

#### Lag features

Lag features are the classical way that time series forecasting problems are transformed into
supervised learning problems. The simplest approach is to predict the value at the next time
(t+1) given the value at the current time (t)

In [39]:
temps = DataFrame(series.values)
temps.head()

Unnamed: 0,0
0,20.7
1,17.9
2,18.8
3,14.6
4,15.8


In [47]:
from pandas import concat
dataframe = concat([temps.shift(1), temps], axis=1)
dataframe.columns = ['t', 't+1']
#we must discard the first few rows that do not have enough data to train a supervised model
dataframe = dataframe.drop(dataframe.index[0])
dataframe.head(5) 

Unnamed: 0,t,t+1
1,20.7,17.9
2,17.9,18.8
3,18.8,14.6
4,14.6,15.8
5,15.8,15.8


In [44]:
# we can also include for exp the last 3 observed values 
temps = DataFrame(series.values)
dataframe = concat([temps.shift(3), temps.shift(2), temps.shift(1), temps], axis=1)
dataframe.columns = ['t-2', 't-1', 't', 't+1']
dataframe.head(5)

Unnamed: 0,t-2,t-1,t,t+1
0,,,,20.7
1,,,20.7,17.9
2,,20.7,17.9,18.8
3,20.7,17.9,18.8,14.6
4,17.9,18.8,14.6,15.8


**how large to make the window for your problem ?**

A difficulty with the sliding window approach is **how large to make
the window for your proble** m. Perhaps a good starting point is to perform a sensitivity analysis
and try a suite of different window widths to in turn create a suite of different views of your
dataset and see which results in better performing models. There will be a point of diminishing
returns.
Additionally, why stop with a linear window? Perhaps you need a lag value from last week,
last month, and last year. Again, this comes down to the specific domain. In the case of the
temperature dataset, a lag value from the same day in the previous year or previous few years
may be useful. We can do more with a window than include the raw values. In the next section,
we’ll look at including features that summarize statistics across the window.

#### Rolling Window Statistics

A step beyond adding raw lagged values is to add a summary of the values at previous time
steps. We can calculate** summary statistics across the values in the sliding window and include
these as features in our datas**et. Perhaps the most useful is the mean of the previous few value ,
also called the rolling mean.
We can calculate the mean of the current and previous values and use that to predict the
next value. For the temperature data, we would have to wait 3 time steps before we had 2
values to take the average of before we could use that value to predict a 3rd value.