## Meteo Bakery - LightGBM feature selection and gridsearch
In this notebook, we will implement a first version of a LightGBM model, test different feature sets, and perform gridsearch for hyperparameter tuning.

### import libraries

In [None]:
# import modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import meteo_utils as meteo

from itertools import product
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.model_selection import TimeSeriesSplit

from lightgbm import LGBMRegressor

### load data

In [None]:
df = pd.read_csv('../data/data_final.csv')
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)

### generate train and test df

In [None]:
df_train = df[df.year<2018]
df_test = df[df.year>=2018]

### Time Series Split Cross Validation
Number of splits is set to 52 and test size to 7 days, thus representing a whole year. 

In [None]:
# extract an example time series for illustration purposes and perform TimeSeriesSplit
ts = df_train[(df_train['branch']=='Metro') & (df_train['product']=='Brown Bread')]['turnover']
tss = TimeSeriesSplit(n_splits=52, test_size=7, gap=0)

fold=0
# plot repeated train-validation folds to get an idea of TimeSeriesSplit functionality
for train_i, val_i in tss.split(ts):
    ts_train = ts.iloc[train_i]
    ts_val = ts.iloc[val_i]

    plt.figure(figsize=(10, 1))
    ts_train[-500:].plot(c='blue', label='training')
    ts_val.plot(c='red', label='validation')

plt.show()

### Naive Seasonal baseline
The Naive Seasonal baseline model uses a drift of 7 days, i.e. it takes the sales of the preceding 7 days as a proxy for the sales on the upcoming 7 days. Whenever such a day includes a closing day due to holidays, the respective day is replaced by sales data 14 days ago. Whenever a closing day is contained in the forecasting window, the respective day is deleted and instead, a prediction is made on a reduced test set.

In [None]:
naive = meteo.crossval_naive(df_train, grouping_vars=['branch', 'product'], target='turnover')

In [None]:
naive

### LightGBM
We will now test LightGBM with different combinations of features.

### time features

In [None]:
time_features = ['turnover_lag_7', 'turnover_lag_365', 'month_sin', 'month_cos', 'day_of_week', 'school_holiday', 'public_holiday']


In [None]:
lgbm_time = meteo.crossval_lgbm(df_train, grouping_vars=['branch', 'product'], target='turnover', features=time_features)

In [None]:
lgbm_time

### basic weather statistics (temperature, humidity, rain, snow)
Here, will will add daily weather aggregate features as predictors to assess any add-on effect in addition to the temporal features. We will use mean temperature, humidity, rain, and snow as weather features since they appear most promising based on previous EDA results.

In [None]:
weather_features = ['turnover_lag_7', 'turnover_lag_365', 'month_sin', 'month_cos', 'day_of_week', 'school_holiday', 'public_holiday',
                                    'temp_mean', 'humidity_mean', 'rain_1h_mean', 'snow_1h_mean']

In [None]:
lgbm_weather = meteo.crossval_lgbm(df_train, grouping_vars=['branch', 'product'], target='turnover', features=weather_features)

In [None]:
lgbm_weather

Including basic weather statistics slightly reduced the mean MAPE score.

### basic weather statistics and climatological days
In addition to aggregated features, we calculated features for climatological days according to [DWD](https://www.dwd.de/DE/service/lexikon/Functions/glossar.html;jsessionid=EB2D3A27D634826A0176255436956DA7.live21064?lv2=101334&lv3=101452) based on our weather statistics. We first performed some basic EDA to test which climatological days could serve as potential predictors in the LGBM model.

#### visualize relative occurrence of climatological days depending on month

In [None]:
for day in ['day_icy', 'day_frosty', 'day_thunder', 'day_hot', 'day_clear','day_hazy', 'day_rainy', 'day_summer', 'day_murky']:
    plt.figure(figsize=(7,1))
    sns.barplot(data=df, y=day, x='month', color='white', edgecolor='blue')
    plt.yticks(ticks=np.arange(0, 0.81, 0.2));

#### visualize differences in sales for different product categories depending on climatological days

In [None]:
for day in ['day_icy', 'day_frosty', 'day_thunder', 'day_hot', 'day_clear','day_hazy', 'day_rainy', 'day_summer', 'day_murky']:
    plt.figure(figsize=(5, 2))
    sns.barplot(data=df, x='product', y='turnover', edgecolor='blue', hue=day)
    plt.xticks(rotation=45, ha='right')
    plt.legend(title=day, bbox_to_anchor=(1.05, 1.0), loc='upper left')

Day frosty and Day icy measure similar weather conditions and have similar effects. The same holds true for day hot and day summer. However, day frosty and day summer have higher occurrence, so we will use these ones as predictors, as opposed to the other ones. 

Day rainy has almost no occurrence and almost no effect and is therefore not used as predictor. Day murky is also not used since it doesn´t have any clear effect and represents the counterpart to day clear. 

Day clear and day hazy don´t seem to have clear effects either, but are included as predictors at this stage.

In [None]:
weather_climat_features = ['turnover_lag_7', 'turnover_lag_365', 'month_sin', 'month_cos', 'day_of_week', 'school_holiday', 'public_holiday',
                                    'temp_mean', 'humidity_mean', 'rain_1h_mean', 'snow_1h_mean', 
                                    'day_frosty', 'day_thunder', 'day_clear','day_hazy', 'day_summer']

In [None]:
lgbm_climat = meteo.crossval_lgbm(df_train, grouping_vars=['branch', 'product'], target='turnover', features=weather_climat_features)

In [None]:
lgbm_climat

Using climatological days as additional predictors further slightly decreased the mean MAPE score.

### basic weather statistics, climatological days, seasonal deviations, daily changes and lead features
Finally, we will also assess a set of weather features measuring seasonal deviations in weather, day-to-day changes, as well as 1-day lead weather features to assess effects of anticipated weather on the next day.

In [None]:
weather_climat_dev_features = ['turnover_lag_7', 'turnover_lag_365', 'month_sin', 'month_cos', 'day_of_week', 'school_holiday', 'public_holiday',
                                    'temp_mean', 'humidity_mean', 'rain_1h_mean', 'snow_1h_mean',
                                                    'day_frosty', 'day_thunder', 'day_clear','day_hazy', 'day_summer',
                                                    'temp_mean_dev', 'humidity_mean_dev', 'pressure_mean_dev', 'rain_1h_mean_dev', 'snow_1h_mean_dev',
                                                    'temp_mean_change', 'pressure_mean_change', 'humidity_mean_change',
                                                    'temp_mean_lead_1', 'rain_1h_mean_lead_1', 'snow_1h_mean_lead_1']

In [None]:
lgbm_dev = meteo.crossval_lgbm(df_train, grouping_vars=['branch', 'product'], target='turnover', features=weather_climat_dev_features)

In [None]:
lgbm_dev

Including seasonal deviations, daily weather changes and lead features further notably reduced the mean MAPE score by almost 0.5%.

### GridSearch on LightGBM
After identifying the best combination of features, we will perform hyperparameter tuning using GridSearch to further optimize LGBM forecasting performance. Specifically, we will test different boosting types, numbers of estimators, and different learning rates.

In [None]:
# create hyperparameter dictionary for grid search
lgbm_params = {
    'boosting_type': ['gbdt', 'dart'],
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.05, 0.08, 0.1]
}

In [None]:
# initialize df for storing results from gridsearch
grid_results = pd.DataFrame({'boosting_type': [], 'n_estimators': [], 'learning_rate': [], 'MAPE': []})

for i, params in enumerate(product(lgbm_params['boosting_type'], lgbm_params['n_estimators'], lgbm_params['learning_rate'])):
    print(params)

    local_params = {
        'boosting_type': params[0],
        'n_estimators': int(params[1]),
        'learning_rate': float(params[2])}
    
                
    lgbm = LGBMRegressor(objective='regression', importance_type='gain', random_state=42, **local_params)
    
    # initialize empty list to compute average MAPE overall individual time series per hyperparameter configuration
    mapes_local = []

     # iterate over all individual series and perform cross-validation
    for k, group in enumerate(product(df_train['branch'].unique(), df_train['product'].unique())):
        # subselect time series
        ts = df_train[(df_train['branch']==group[0]) & (df_train['product']==group[1])].copy()

        # perform cross validation
        tss = TimeSeriesSplit(n_splits=52, test_size=7, gap=0)

        fold=0
        scores = []
        for train_i, val_i in tss.split(ts):

            train = ts.iloc[train_i]
            val = ts.iloc[val_i]
                    
            X_train = train[weather_climat_features]
            X_val = val[weather_climat_features]
            y_train = train['turnover']
            y_val = val['turnover']

            lgbm.fit(X_train, y_train)
            y_pred = pd.Series(lgbm.predict(X_val))
            # correct for holiday effects in validation set if necessary
            # if holiday is in validation set, drop elements at corresponding index position in both y_val and y_pred
            if 1 in y_val.unique():
                idx_val = [i for i in range(len(y_val.tolist())) if y_val.tolist()[i]==1]
                y_val = y_val.drop(y_val.index[idx_val])
                y_pred = y_pred.drop(y_pred.index[idx_val])

            mape = mean_absolute_percentage_error(y_val, y_pred)
            scores.append(mape)
        
        # calculate mean MAPE score for individual time series
        mean_score = np.mean(scores)
        mapes_local.append(mean_score)
    
    grid_results.loc[i, 'boosting_type'] = params[0]
    grid_results.loc[i, 'n_estimators'] = params[1]
    grid_results.loc[i, 'learning_rate'] = params[2]
    grid_results.loc[i, 'MAPE'] = np.mean(mapes_local)

In [None]:
grid_results

The lowest mean MAPE score (14.05%) was reached with the following hyperparameters:
* boosting type: dart
* number of estimators: 200
* learning rate: 0.8

In [None]:
grid_results.to_csv('../models/LGBM_hyperparams.csv', index=False)