#  Time Series Feature Generation


In this Notebook, we will focus on the time-series feature generation.

Sources of this tutorial: 1). https://machinelearningmastery.com/basic-feature-engineering-time-series-data-python/; 

# Time Series Feature Generation

In this part, we will use a daily stock price of Apple.Inc (one year: from 28/Dec/2020 to 27/Dec/2021), which is derived from Yahoo Finance. First, the time series is loaded as a Pandas Series. We then create a new Pandas DataFrame for the transformed dataset. Next, each column is added one at a time where month and day information is extracted from the time-stamp information for each observation in the series. Below is the Python code to do this.

In [None]:
# create date time features of a dataset
from pandas import read_csv
from pandas import DataFrame
series = read_csv('data/AAPL.csv', header=0, index_col=0, parse_dates=True)
dataframe = DataFrame().squeeze()
dataframe['month'] = [series.index[i].month for i in range(len(series))]
dataframe['day'] = [series.index[i].day for i in range(len(series))]
dataframe['price'] = [series.Open[i] for i in range(len(series))]
print(dataframe.head(15))

### Lag feature

Lag features are the classical way that time series forecasting problems are transformed into supervised learning problems. The simplest approach is to predict the value at the next time (t+1) given the value at the previous time (t-1). The supervised learning problem with shifted values looks as follows:

Value(t-1), Value(t+1)

Value(t-1), Value(t+1)

Value(t-1), Value(t+1)

The Pandas library provides the shift() function to help create these shifted or lag features from a time series dataset. Shifting the dataset by 1 creates the t-1 column, adding a NaN (unknown) value for the first row. The time series dataset without a shift represents the t+1

Below is an example of creating a lag feature for our daily stock price dataset. The values are extracted from the loaded series and a shifted and unshifted list of these values is created. Each column is also named in the DataFrame for clarity.

In [2]:
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
series = read_csv('data/AAPL.csv', header=0, index_col=0)
temps = DataFrame(series.values)
dataframe = concat([temps.shift(1), temps], axis=1)
dataframe.columns = ['t-1', 't+1']
print(dataframe.head(5))

          t-1         t+1
0         NaN  133.990005
1  133.990005  138.050003
2  138.050003  135.580002
3  135.580002  134.080002
4  134.080002  133.520004


You can see that we would have to discard the first row to use the dataset to train a supervised learning model, as it does not contain enough data to work with.

The addition of lag features is called the sliding window method, in this case with a window width of 1. It is as though we are sliding our focus along the time series for each observation with an interest in only what is within the window width.

We can expand the window width and include more lagged features. For example, below is the above case modified to include the last 3 observed values to predict the value at the next time step.

In [3]:
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
series = read_csv('data/AAPL.csv', header=0, index_col=0)
temps = DataFrame(series.values)
dataframe = concat([temps.shift(3), temps.shift(2), temps.shift(1), temps], axis=1)
dataframe.columns = ['t-3', 't-2', 't-1', 't+1']
print(dataframe.head(5))

          t-3         t-2         t-1         t+1
0         NaN         NaN         NaN  133.990005
1         NaN         NaN  133.990005  138.050003
2         NaN  133.990005  138.050003  135.580002
3  133.990005  138.050003  135.580002  134.080002
4  138.050003  135.580002  134.080002  133.520004


Again, you can see that we must discard the first few rows that do not have enough data to train a supervised model. A difficulty with the sliding window approach is how large to make the window for your problem. Perhaps a good starting point is to perform a sensitivity analysis and try a suite of different window widths to in turn create a suite of different “views” of your dataset and see which results in better performing models. 

### Rolling Window Statistics

A step beyond adding raw lagged values is to add a summary of the values at previous time steps. We can calculate summary statistics across the values in the sliding window and include these as features in our dataset. Perhaps the most useful is the mean of the previous few values, also called the rolling mean. For example, we can calculate the mean of the previous two values and use that to predict the next value. 

The first thing we need to do is shifted. Then the rolling dataset can be created and the mean values calculated on each window of two values.

In [4]:
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
series = read_csv('data/AAPL.csv', header=0, index_col=0)
temps = DataFrame(series.values)
shifted = temps.shift(1)
window = shifted.rolling(window=2)
means = window.mean()
dataframe = concat([means, temps], axis=1)
dataframe.columns = ['mean(t-2,t-1)', 't+1']
print(dataframe.head(5))

   mean(t-2,t-1)         t+1
0            NaN  133.990005
1            NaN  138.050003
2     136.020004  135.580002
3     136.815002  134.080002
4     134.830002  133.520004


Below is another example that shows a window width of 3 and a dataset comprised of more summary statistics, specifically the minimum, mean, and maximum value in the window. You can see in the code that we are explicitly specifying the sliding window width as a named variable. This lets us use it both in calculating the correct shift of the series and in specifying the width of the window to the rolling() function. In this case, the window width of 3 means we must shift the series forward by 2 time steps. This makes the first two rows NaN. Next, we need to calculate the window statistics with 3 values per window. It takes 3 rows before we even have enough data from the series in the window to start calculating statistics.

In [5]:
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
series = read_csv('data/AAPL.csv', header=0, index_col=0)
temps = DataFrame(series.values)
width = 3
shifted = temps.shift(width - 1)
window = shifted.rolling(window=width)
dataframe = concat([window.min(), window.mean(), window.max(), temps], axis=1)
dataframe.columns = ['min', 'mean', 'max', 't+1']
print(dataframe.head(5))

          min        mean         max         t+1
0         NaN         NaN         NaN  133.990005
1         NaN         NaN         NaN  138.050003
2         NaN         NaN         NaN  135.580002
3         NaN         NaN         NaN  134.080002
4  133.990005  135.873337  138.050003  133.520004


### Expanding Window Statistics

Another type of window that may be useful includes all previous data in the series. This is called an expanding window and can help with keeping track of the bounds of observable data. Like the rolling() function on DataFrame, Pandas provides an expanding() function that collects sets of all prior values for each time step.

Below is an example of calculating the minimum, mean, and maximum values of the expanding window on the daily stock price dataset. Running the example prints the first 5 rows of the dataset.

In [6]:
# create expanding window features
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
series = read_csv('data/AAPL.csv', header=0, index_col=0)
temps = DataFrame(series.values)
window = temps.expanding()
dataframe = concat([window.min(), window.mean(), window.max(), temps.shift(-1)], axis=1)
dataframe.columns = ['min', 'mean', 'max', 't+1']
print(dataframe.head(5))

          min        mean         max         t+1
0  133.990005  133.990005  133.990005  138.050003
1  133.990005  136.020004  138.050003  135.580002
2  133.990005  135.873337  138.050003  134.080002
3  133.990005  135.425003  138.050003  133.520004
4  133.520004  135.044003  138.050003  128.889999
