# Basic Feature Engineering

Time Series data must be re-framed as a supervised learning dataset before we can start using machine learning algorithms. There is no concept of input and output features in time series. Instead, we must choose the variable to be predicted and use feature engineering to construct all of the inputs that will be used to make predictions for future time steps

### Minimum Daily Temperature Dataset

In [40]:
from pandas import read_csv
from pandas import Series
from pandas import DataFrame
series = read_csv('daily-minimum-temperatures.csv', header=0, parse_dates=[0], 
                  index_col = 0, nrows=3650, squeeze=True)
dataframe = DataFrame()
dataframe['month'] = [series.index[i].month for i in range(len(series))]
dataframe['day'] = [series.index[i].day for i in range(len(series))]
dataframe['temperature'] = [series[i] for i in range(len(series))]
print(dataframe.head(5))

   month  day temperature
0      1    1        20.7
1      1    2        17.9
2      1    3        18.8
3      1    4        14.6
4      1    5        15.8


## Creating Lag Features
Lag features are the classical way that time series forecasting problems are transformed into supervised learning problems. The simplest approach is to predict the value at the next time (t+1) given the value at the current time (t).

### lag=1 features

In [43]:
from pandas import concat
temps = DataFrame(series.values)
dataframe = concat([temps.shift(1), temps], axis=1)
dataframe.columns = ['t','t+1']
dataframe.head(5)

Unnamed: 0,t,t+1
0,,20.7
1,20.7,17.9
2,17.9,18.8
3,18.8,14.6
4,14.6,15.8


### Lag=3 features

In [47]:
temps = DataFrame(series.values)
dataframe = concat([temps.shift(3), temps.shift(2), temps.shift(1), temps], axis=1)
dataframe.columns = ['t-2', 't-1', 't', 't+1']
print(dataframe.head(5))

    t-2   t-1     t   t+1
0   NaN   NaN   NaN  20.7
1   NaN   NaN  20.7  17.9
2   NaN  20.7  17.9  18.8
3  20.7  17.9  18.8  14.6
4  17.9  18.8  14.6  15.8


## Rolling Window Statistics

A step beyond adding raw lagged values is to add a summary of the values at previous time steps. We can calculate summary statistics across the values in the sliding window and include these as features in our dataset. Perhaps the most useful is the mean of the previous few values, also called the rolling mean.

In [1]:
from pandas import read_csv
from pandas import DataFrame
from pandas import concat

series = read_csv('daily-minimum-temperatures.csv', header=0, parse_dates=[0], 
                  index_col = 0, nrows=3650, squeeze=True)
series = series.map(lambda x: x.lstrip('?'))
series = series.astype(float)
temp = DataFrame(series.values)
shifted = temp.shift(1)
window = shifted.rolling(window=2)
mean = window.mean()
dataframe = concat([mean, temp], axis=1)
dataframe.columns = ['mean(t-1, t)', 't+1']
dataframe.head(5)

Unnamed: 0,"mean(t-1, t)",t+1
0,,20.7
1,,17.9
2,19.3,18.8
3,18.35,14.6
4,16.7,15.8


In [5]:
from pandas import read_csv
from pandas import DataFrame
from pandas import concat

series = read_csv('daily-minimum-temperatures.csv', header=0, parse_dates=[0], 
                  index_col = 0, nrows=3650, squeeze=True)
series = series.map(lambda x: x.lstrip('?'))
series = series.astype(float)
temps = DataFrame(series.values)
width = 3
shifted = temps.shift(width - 1)
window = shifted.rolling(window=width)
dataframe = concat([window.min(), window.mean(), window.max(), temps], axis=1)
dataframe.columns = ['min', 'mean', 'max', 't+1']
dataframe.head(10)

Unnamed: 0,min,mean,max,t+1
0,,,,20.7
1,,,,17.9
2,,,,18.8
3,,,,14.6
4,17.9,19.133333,20.7,15.8
5,14.6,17.1,18.8,15.8
6,14.6,16.4,18.8,15.8
7,14.6,15.4,15.8,17.4
8,15.8,15.8,15.8,21.8
9,15.8,16.333333,17.4,20.0


## Expanding Window Statistics

Another type of window that may be useful includes all previous data in the series. This is called an expanding window and can help with keeping track of the bounds of observable data. Like the rolling() function on DataFrame, Pandas provides an expanding() function4 that collects sets of all prior values for each time step.

In [6]:
#Creating expanding window features
from pandas import read_csv
from pandas import DataFrame
from pandas import concat

series = read_csv('daily-minimum-temperatures.csv', header=0, parse_dates=[0], 
                  index_col = 0, nrows=3650, squeeze=True)
series = series.map(lambda x: x.lstrip('?'))
series = series.astype(float)
temps = DataFrame(series.values)
window = temps.expanding()
dataframe = concat([window.min(), window.mean(), window.max(), temps], axis=1)
dataframe.columns = ['min', 'mean', 'max', 't+1']
dataframe.head(5)

Unnamed: 0,min,mean,max,t+1
0,20.7,20.7,20.7,20.7
1,17.9,19.3,20.7,17.9
2,17.9,19.133333,20.7,18.8
3,14.6,18.0,20.7,14.6
4,14.6,17.56,20.7,15.8
