## Multi-step forecasting: recursive approach

We want to predict / forecast multiple steps ahead

* Use forecasted output as new input.
* Recursively apply a 1-step ahead forecast model

In recursive approach we have only only 1 time series model.
* Each forecasting point is estimated using previous forecasts.
* More code to take the forecasts as inputs, and re-create the features.

#### CONS
* Error propagates
* Code complexity
* Inputs are forecast

#### PROS
* Only one model
* Less computation time


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso, MultiTaskLasso
from sklearn.metrics import mean_squared_error, mean_absolute_error

from feature_engine.creation import CyclicalFeatures
from feature_engine.datetime import DatetimeFeatures
from feature_engine.timeseries.forecasting import LagFeatures, WindowFeatures
from feature_engine.imputation import DropMissingData
from feature_engine.selection import DropFeatures

## Load the Data

In [4]:
df = pd.read_csv('../../Datasets/AirQualityUCI_ready.csv', 
parse_dates=['Date_Time'], index_col=['Date_Time'], usecols=['CO_sensor','RH', 'Date_Time'])

df.sort_index(inplace=True)

df=df.loc["2004-04-01":"2005-04-30"]

df = df.loc[df['CO_sensor']>0]

# Add missing timestamps (easier for the demo)
df = df.asfreq('1H')

# fill missing
df = df.fillna(method='ffill')

df.head()

Unnamed: 0_level_0,CO_sensor,RH
Date_Time,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-04-04 00:00:00,1224.0,56.5
2004-04-04 01:00:00,1215.0,59.2
2004-04-04 02:00:00,1115.0,62.4
2004-04-04 03:00:00,1124.0,65.0
2004-04-04 04:00:00,1028.0,65.3


## Feature Engineering pipeline

In [2]:
date_feat = DatetimeFeatures(variables='index', 
features_to_extract=['hour','month', 'week' ,'weekend','day_of_week','day_of_month'])

cyclic_feat = CyclicalFeatures(variables=['hour','month'])

lag_feat = LagFeatures(variables=['CO_sensor', 'RH'], freq=['1H', '24H'], missing_values='ignore')

window_feat = WindowFeatures(variables=['CO_sensor', 'RH'], window='3H' ,freq='1H', missing_values='ignore')

drop_missing = DropMissingData()

drop_feat = DropFeatures(features_to_drop=['CO_sensor','RH'])

pipe = Pipeline([ 
    ('date', date_feat),
    ('cyclic', cyclic_feat),
    ('lag', lag_feat),
    ('window', window_feat),
    ('drop missing', drop_missing),
    ('drop features', drop_feat)
])

## Split data into train and test

We will leave the last month of data as hold-out sample to evaluate the performance of the model.

Remember that we need data about the pollutant information at least 24 hours before the first forecast point in the test set to create the input features.

In [5]:
train = df.loc[df.index < "2005-03-04"]
test = df.loc[pd.to_datetime("2005-03-04")-pd.Timedelta(value='24H'):]

y_train = df.loc[train.index,['CO_sensor','RH']]
y_test = df.loc[test.index,['CO_sensor','RH']]

print('Train Start Date: ',train.index.min(), 'End Date :', train.index.max())
print('Test Start Date: ',test.index.min(), 'End Date :', test.index.max())

Train Start Date:  2004-04-04 00:00:00 End Date : 2005-03-03 23:00:00
Test Start Date:  2005-03-03 00:00:00 End Date : 2005-04-04 14:00:00


## Apply Feature Engineering

In [6]:
X_train = pipe.fit_transform(train)


y_train = y_train.loc[X_train.index]


## Modeling -- Lasso

In [7]:
from sklearn.multioutput import MultiOutputRegressor

lasso = MultiOutputRegressor(Lasso())
lasso.fit(X_train, y_train)

# Recursive multi-step forecasting: test set

We will forecast for 24 hours for various points in our dataset. 

We could do 24 hour forecasts for every point, or instead, we could forecast the next 24 hours at certain intervals.

For simplicity, we will forecast 24 hours for every 24 hour interval.

In [12]:
# The first hour of forecast.

date_start = pd.Timestamp("2005-03-04")
date_start

Timestamp('2005-03-04 00:00:00')

In [13]:
# The last hour of forecast 
# (24hs before the last timestamp in the test set).

date_end = test.index.max() - pd.Timedelta(value='24H')
date_end

Timestamp('2005-04-03 14:00:00', freq='H')

In [17]:
forecasting_points = pd.date_range(start=date_start, end=date_end, freq='D')
forecasting_points

DatetimeIndex(['2005-03-04', '2005-03-05', '2005-03-06', '2005-03-07',
               '2005-03-08', '2005-03-09', '2005-03-10', '2005-03-11',
               '2005-03-12', '2005-03-13', '2005-03-14', '2005-03-15',
               '2005-03-16', '2005-03-17', '2005-03-18', '2005-03-19',
               '2005-03-20', '2005-03-21', '2005-03-22', '2005-03-23',
               '2005-03-24', '2005-03-25', '2005-03-26', '2005-03-27',
               '2005-03-28', '2005-03-29', '2005-03-30', '2005-03-31',
               '2005-04-01', '2005-04-02', '2005-04-03'],
              dtype='datetime64[ns]', freq='D')

In [45]:
# List to collect the MAE, MSE for each 24 hour forecast examined.

mae, mse = [],[]

for forecasting_point in forecasting_points:

    # perpare 24Hrs of data prior to forecasting-point
    input_data = test.loc[forecasting_point-pd.Timedelta(value='24H'):forecasting_point].copy()
    input_data.loc[forecasting_point]=np.nan
    
    # forecast 24-steps from this point
    index = pd.date_range(start=forecasting_point, periods=24, freq='H')
    forecast_df = pd.DataFrame(index=index, columns=['CO_sensor','RH'])
    # loop
    for i in index:
        # forecast
        forecast_df.loc[i] = lasso.predict(pipe.transform(input_data))
        # feed the prediction back 
        input_data.loc[i] = forecast_df.loc[i]
        # shift the input
        input_data = input_data.shift(periods=-1).shift(freq='1H')

    y_truth = y_test.loc[forecast_df.index.min():forecast_df.index.max()]
    mae.append(mean_absolute_error(y_truth['CO_sensor'], forecast_df['CO_sensor']))
    mse.append(mean_squared_error(y_truth['CO_sensor'], forecast_df['CO_sensor']))
    
    
    

In [49]:
# lets see the last forecast and true values
temp = forecast_df.merge(y_test, left_index=True, right_index=True, suffixes=['_forecast',''])
temp.head()

Unnamed: 0,CO_sensor_forecast,RH_forecast,CO_sensor,RH
2005-04-03 00:00:00,981.006719,47.000154,1213.0,80.9
2005-04-03 01:00:00,955.196776,47.826304,1142.0,81.2
2005-04-03 02:00:00,928.877399,48.652954,1089.0,80.9
2005-04-03 03:00:00,917.523255,49.673259,982.0,70.6
2005-04-03 04:00:00,909.272551,50.303907,888.0,65.1


In [47]:
mean_absolute_error(temp['CO_sensor'], temp['CO_sensor_forecast'])

186.19939933402554

## Mean performance across all 24 hr forecast

In [58]:
print( 'MAE :', np.mean(mae), ' STD :',np.std(mae))
print( 'MSE :', np.mean(mse), ' STD :',np.std(mse))
print( 'RMSE :', np.mean(np.sqrt(mse)), ' STD :',np.std(np.sqrt(mse)))

MAE : 104.15529795148264  STD : 46.968252499347244
MSE : 22436.903511870485  STD : 17255.277937491664
RMSE : 135.78580314748748  STD : 63.23858929057675
