## Recursive forecasting | Air passengers Data

#### Goal : Use time feature to capture trend 


In [50]:
import datetime

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns


from sklearn.base import clone
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures
from sklearn.metrics import mean_absolute_error,mean_absolute_percentage_error,mean_squared_error

from sktime.transformations.series.time_since import TimeSince
from sktime.transformations.series.summarize import WindowSummarizer

### DATA
The air passengers dataset is the monthly totals of international airline passengers, from 1949 to 1960, in units of 1000s.

In [2]:
data = pd.read_csv('../../Datasets/example_air_passengers.csv', parse_dates=['ds'], index_col=['ds'])
data.plot(figsize=(15,4))

<img src='./plots/air-passengers-data.png'>

## Prepare our transformers.

In [3]:

polynomial = PolynomialFeatures(degree=2, include_bias=False)
time_since = TimeSince(freq='MS', to_numeric=True, keep_original_columns=False)

# Create features for capturing Trend
time_feature = make_pipeline(time_since, polynomial)

# Creates features using window summary
window_summary = WindowSummarizer(
    lag_feature={
    'lag' : [1, 2, 3, 12],
    'mean' : [[1, 12]]
    },
    target_cols=['y'],
    truncate="bfill"
)


# feature union :  Concatenating the results of multiple transformer
features = make_union(time_feature, window_summary)

# scale the features
features_scaled = make_pipeline(features, MinMaxScaler()) 


#### We'll start with configuring the forecast start time, the number of steps to forecast, and the forecasting horizon, and the model.

In [34]:
# Define time of first forecast, this determines our train / test split
forecast_start = pd.to_datetime("1955-10-01")

# Define number of steps to forecast
num_forecast_steps = 144

# Model
model = LinearRegression()

# forecast horizon
forecast_horizon = pd.date_range(start=forecast_start, periods=num_forecast_steps, freq='MS')

# How much data in the past is needed to create our features
look_back_window_size = pd.DateOffset(months=12)
# We need the latest 12 time periods
# in our predict dataframe to build our
# window features.


df_train = data.loc[data.index < forecast_start]
df_test = data.loc[data.index >= forecast_start]

Create Features and targets

In [35]:
X_train = features_scaled.fit_transform(df_train)
y_train = df_train['y']

Fit a linear regression model

In [36]:
model.fit(X_train, y_train)

Let's prepare the dataframe that we will pass to `pipeline.transform()` to create `X_test` that we pass to `model.predict()`. This will contain some portion of time series during the training period so we can create any features that require historic data.

In [37]:
look_back_start_time = forecast_start - look_back_window_size

print('Forecast start time :',forecast_start)
print('Look back start time :', look_back_start_time)

Forecast start time : 1955-10-01 00:00:00
Look back start time : 1954-10-01 00:00:00


In [38]:
df_predict = df_train.loc[look_back_start_time:].copy()
# Extend index into forecast horizon
df_predict = pd.concat([df_predict, pd.DataFrame(index=forecast_horizon)])

Let's recursively create `X_test` and make our predictions and append them to the `df_predict` dataframe.

In [39]:
for fh in forecast_horizon:
    x = features_scaled.transform(df_predict.loc[:fh])
    x = x[-1]

    y_pred = model.predict([x])

    df_predict.loc[fh, 'y'] = y_pred[0]

#### Using time feature to capture trend alongside other features in a recursive forecasting workflow. 

In [44]:
ax = data.loc[:forecast_start].plot(figsize=(15, 4))
data.loc[forecast_start:].plot(ax=ax)
df_predict.loc[:data.index.max()].plot(ax=ax)
df_predict.loc[data.index.max():].plot(ax=ax)
plt.legend(['train','test','test-prediction','forecast'])

<img src='./plots/air-passengers-data-train-test-forecast.png'>

#### Performance on test set

In [62]:
ax = data.loc[forecast_start:].plot(marker='.', figsize=(10,5))
df_predict.loc[forecast_start:data.index.max()].plot(ax=ax, marker='.')

y_true = data.loc[forecast_start:]
y_pred = df_predict.loc[forecast_start:data.index.max()]

mse = mean_squared_error(y_true, y_pred)
rmse = mean_squared_error(y_true, y_pred, squared=False)
mae = mean_absolute_error(y_true, y_pred)
mape = mean_absolute_percentage_error(y_true, y_pred)

plt.title(f'MSE : {mse} , RMSE : {rmse}\nMAE : {mae}  , MAPE : {mape} ');

<img src='./plots/air-passengers-data-linear-reg-model-test-set-performance-metrics.png'>


#### We can see that the time feature can help capture the trend in the data when using linear models.