## Meteo Bakery - Comparing forecasting models
In this notebook, we compare several forecasting models on our multiple time series, i.e. linear regression, RandomForest, LightGBM and CatBoost.

### import libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from darts import TimeSeries
from darts.utils import missing_values
from darts.metrics import rmse, mape
from darts.dataprocessing.transformers import Scaler

from darts.models import RegressionModel
from darts.models.forecasting.random_forest import RandomForest
from darts.models.forecasting.gradient_boosted_model import LightGBMModel
from darts.models.forecasting.catboost_model import CatBoostModel

from sklearn.linear_model import LinearRegression

### load data

In [None]:
df_stacked = pd.read_csv('../data/data_combined.csv')
df_stacked['date'] = pd.to_datetime(df_stacked['date'])

In [None]:
# select only data before Covid19
df_stacked = df_stacked[df_stacked.year < 2020]
df_stacked.head()

### generate a multiple time series object
Using darts.TimeSeries class, we generate a multiple time series object from a stacked / grouped time series df, essentially representing a list of all individual time series.

In [None]:
# convert df into multiple time series object
series_multi = TimeSeries.from_group_dataframe(df=df_stacked, group_cols=['branch', 'product'], 
                                        value_cols=['turnover'], time_col='date', freq='d')

In [None]:
# fill NaNs in each series by interpolation from darts imputation functionalities using list comprehension
series_multi_nan = [missing_values.fill_missing_values(series_multi[i], fill='auto') for i in range(len(series_multi))]

### access grouping IDs for each time series using static_covariates attribute
Grouping IDs can be assessed through static_covariates attribute

In [None]:
# print number of series in multiple series object
print(f"\n{len(series_multi)} series were extracted from the input DataFrame\n")

# iterate over each time series, print static_covariates (i.e. group IDs) and plot individual time series
for i, ts in enumerate(series_multi_nan):
    ts['turnover'].plot(label=f'{ts.static_covariates.iloc[:, 0][0]} | {ts.static_covariates.iloc[:, 1][0]}')

### generate covariate series

In [None]:
df_stacked.columns[-30:]

In [None]:
# generate a multiple covariates time series object
covariates_multi = TimeSeries.from_group_dataframe(df=df_stacked, group_cols=['branch', 'product'], 
                                        value_cols=['month', 'day_of_week', 'school_holiday', 'public_holiday',
                                        'clear_total', 'cloudy_total', 'foggy_total', 'rainy_total', 'snowy_total'], 
                                        time_col='date', freq='d')

In [None]:
# scale covariates using MinMax scaling
scaler = Scaler()

covariates_scaled = scaler.fit_transform(covariates_multi)

In [None]:
covariates_scaled[0].components

In [None]:
# define utility function for plotting scaled covariates
def check_scaling(series, component):
    for i in range(len(series)):
        series[i][component].plot()
        plt.show()

In [None]:
#check_scaling(covariates_scaled, 'temp_mean')

### Train-Validation split

In [None]:
# train-validation split by slicing each series within list using list comprehension
train = [series_multi_nan[i][:-364] for i in range(len(series_multi_nan))] # extract all time points except for last year
val = [series_multi_nan[i][-364:-357] for i in range(len(series_multi_nan))] # extract the next 7 days from end of training set

### use historical forecasts to compare models w/ and w/o covariates
In order to assess changes in prediction accuracy following inclusion of covariates, we will design a utility function that iterates over a multiple time series object and for each time series performs a historical forecast for a model w/o covariates and w/ covariates. Here, we will specifically implement future covariates, such as months, day_of_week, holidays and weather forecasts.

The forecasting horizon will be set to 7 days by default. Similarly, the stride will be set to 7 days by default, thus the training set for historical forecasts will always increase by 7 days.

For comparison, we will compute both RMSE and MAPE from those forecasts.

In [None]:
# define utility function for comparing historical forecasts on multiple time series between models w/ and w/o covariates
def validate_historically(model, model_cov, multiple_series, future_covariates=None, forecast_horizon=7, stride=7, split=0.8, show_output=True):

    # initialize dataframe for evaluation scores
    scores = pd.DataFrame({'group': [], 'RMSE': [], 'RMSE_covariates': [], 'MAPE': [], 'MAPE_covariates': []})

    # backtest the model on the last 50% of the series
    # iterate over every series in multiple time series object
    for i, series in enumerate(multiple_series):
        # perform historical forecasts on model w/o covariates and save result
        backtest_wo = model.historical_forecasts(series=series, 
                                          past_covariates=None,
                                          future_covariates=None,
                                          start=split, 
                                          stride=stride,
                                          retrain=True,
                                          verbose=show_output, 
                                          forecast_horizon=forecast_horizon)
        # perform historical forecasts on model w/ covariates and save result
        backtest_cov = model_cov.historical_forecasts(series=series, 
                                          past_covariates=None,
                                          future_covariates=future_covariates[i],
                                          start=split, 
                                          stride=stride,
                                          retrain=True,
                                          verbose=show_output, 
                                          forecast_horizon=forecast_horizon)

        
        # calculate RMSE and MAPE for predictions w/ and w/o covariates
        RMSE_wo = rmse(series, backtest_wo).round(2)
        MAPE_wo = mape(series, backtest_wo).round(2)

        RMSE_cov = rmse(series, backtest_cov).round(2)
        MAPE_cov = mape(series, backtest_cov).round(2)

        # append scores
        scores.loc[i, 'group'] = f'{series.static_covariates.iloc[:, 0][0]} | {series.static_covariates.iloc[:, 1][0]}'
        scores.loc[i, 'RMSE'] = RMSE_wo
        scores.loc[i, 'MAPE'] = MAPE_wo
        scores.loc[i, 'RMSE_covariates'] = RMSE_cov
        scores.loc[i, 'MAPE_covariates'] = MAPE_cov

        if show_output:
            #plot actual series values
            series[int(len(series)*split):].plot()
            # plot predicted values from historical model forecasts w/o covariates on top
            backtest_wo.plot(label='fh=7 - w/o covariates')
            # plot predicted values from historical model forecasts w/o covariates on top
            backtest_cov.plot(label='fh=7 - w/ covariates')
            # give title based on respective time series group IDs
            plt.title(f'{series.static_covariates.iloc[:, 0][0]} | {series.static_covariates.iloc[:, 1][0]}')
            plt.show()

            # print evaluation scores
            print(f'Backtest w/o covariates: RMSE = {RMSE_wo}, MAPE = {MAPE_wo}')
            print(f'Backtest w/ covariates: RMSE = {RMSE_cov}, MAPE = {MAPE_cov}')

    # calculate mean scores
    scores.loc[i+1, 'group'] = 'mean'
    scores.loc[i+1, ['RMSE', 'RMSE_covariates', 'MAPE', 'MAPE_covariates']] = [scores[x].mean().round(2) for x in ['RMSE', 'RMSE_covariates', 'MAPE', 'MAPE_covariates']]
    
    scores.set_index('group', inplace=True)

    return scores


### Linear Regression

In [None]:
# instantiate Linear Regression models

# w/o covariates
model = RegressionModel(lags=[-7, -364],
                    model=LinearRegression())

# w/ covariates
model_cov = RegressionModel(lags=[-7, -364], 
                    model=LinearRegression(), lags_future_covariates=[0]) # future covariates without lag

In [None]:
scores_lm = validate_historically(model, model_cov, train, future_covariates=covariates_scaled, stride=28, show_output=True)

### Random Forest

In [None]:
# instantiate RandomForest models

# w/o covariates
rf = RandomForest(lags=[-7, -364]) # future covariates without lag

# w/ covariates
rf_cov = RandomForest(lags=[-7, -364], lags_future_covariates=[0]) # future covariates without lag

In [None]:
scores_rf = validate_historically(rf, rf_cov, train, future_covariates=covariates_scaled, stride=28, show_output=True)

### LightGBM

In [None]:
# instantiate LightGBM models

# w/o covariates
lgbm = LightGBMModel(lags=[-7, -364]) 

# w/ covariates
lgbm_cov = LightGBMModel(lags=[-7, -364], lags_future_covariates=[0]) # future covariates without lag

In [None]:
scores_lgbm = validate_historically(lgbm, lgbm_cov, train, future_covariates=covariates_scaled, stride=28, show_output=True)

### CatBoost

In [None]:
# instantiate CatBoost models

# w/o covariates
catb = CatBoostModel(lags=[-7, -364]) 

# w/ covariates
catb_cov = CatBoostModel(lags=[-7, -364], lags_future_covariates=[0]) # future covariates without lag

In [None]:
scores_catb = validate_historically(catb, catb_cov, train, future_covariates=covariates_scaled, stride=28, show_output=True)

### compare evaluation scores

In [None]:
scores_lm

In [None]:
scores_rf

In [None]:
scores_lgbm

In [None]:
scores_catb

In [None]:
def combine_scores(dataframes, model_names, component):
    
    # initialize dataframe
    scores = pd.DataFrame({'model': [], 'RMSE': [], 'RMSE_covariates': [], 'MAPE': [], 'MAPE_covariates': []})
    
    # append data
    scores['model'] = model_names
    scores['RMSE'] = [df.loc[component, 'RMSE'] for df in dataframes]
    scores['RMSE_covariates'] = [df.loc[component, 'RMSE_covariates'] for df in dataframes]
    scores['MAPE'] = [df.loc[component, 'MAPE'] for df in dataframes]
    scores['MAPE_covariates'] = [df.loc[component, 'MAPE_covariates'] for df in dataframes]

    scores.set_index('model', inplace=True)

    return scores

In [None]:
models = [scores_lm, scores_rf, scores_lgbm, scores_catb]

In [None]:
model_scores = combine_scores(models, ['lm', 'rf', 'lgbm', 'catb'], 'mean')

In [None]:
model_scores

### Concluion
Gradient boosting models such as LightGBM seem to make best use of covariates, as evidenced by the strongest reduction in RMSE and MAPE after covariate inclusion. However, they are still performing worse than a simple linear regression model.

We will perform hyperparameter tuning on LightGBM using GridSearch to optimize forecasts.

LightGBM default parameters:
* n_estimators: 100
* boosting type: gdbt
* learning rate: 0.1
* num_leaves: 31
* max_depth: -1 (i.e. no limit)

In [None]:
# create hyperparameter dictionary for grid search
lgbm_params = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 6, 9],
    'learning_rate': [0.01, 0.1],
    'boosting_type': ['dart'], # supposedly best boosting type
    'lags_future_covariates': [[0]]
}

In [None]:
# instantiate LightGBM model w/ covariates
lgbm_cov = LightGBMModel(lags=[-7, -364], lags_future_covariates=[0], random_state=42)

In [None]:
# uncomment the following lines if you want to perform gridsearch, but takes rather long
#grid_results = pd.DataFrame({'group': [], 'best_model': [], 'score': []})
#
#for i, series in enumerate(train):
#
#    best_model = lgbm_cov.gridsearch(series=series, parameters=lgbm_params, future_covariates=covariates_scaled[i],
#                                            forecast_horizon=7, stride=28, start=0.8, verbose=True,
#                                            metric=mape, reduction=np.mean)
#    # print results of best local model
#    print(best_model)
#
#    # append scores
#    grid_results.loc[i, 'group'] = f'{series.static_covariates.iloc[:, 0][0]} | {series.static_covariates.iloc[:, 1][0]}'
#    grid_results.loc[i, 'best_model'] = str(best_model[1])
#    grid_results.loc[i, 'score'] = best_model[2]

In [None]:
#grid_results