## Meteo Bakery - LightGBM optimized
In this notebook, we will implement and optimized LightGBM algorithm based on GridSearch results and assess feature importance.

### import libraries

In [None]:
# import modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import meteo_utils as meteo

### load data

In [None]:
df = pd.read_csv('../data/data_final.csv')
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)

### generate train and test df

In [None]:
df_train = df[df.year<2018]
df_test = df[df.year>=2018]

### Naive Seasonal baseline
The Naive Seasonal baseline model uses a drift of 7 days, i.e. it takes the sales of the preceding 7 days as a proxy for the sales on the upcoming 7 days. Whenever such a day includes a closing day due to holidays, the respective day is replaced by sales data 14 days ago. Whenever a closing day is contained in the forecasting window, the respective day is deleted and instead, a prediction is made on a reduced test set.

In [None]:
naive = meteo.crossval_naive(df_train, grouping_vars=['branch', 'product'], target='turnover')

In [None]:
naive

### LightGBM
Here, we will implement an optimized LightGBM model based on gridsearch results and asssess how inclduding weather features improves forecasting performance compared to temporal features only.

In [None]:
# define hyperparameters based on previous gridsearch results
params_optimal = {
    'boosting_type': 'dart',
    'n_estimators': 200,
    'learning_rate': 0.1
}

In [None]:
# define features sets
time_features = ['turnover_lag_7', 'turnover_lag_365', 'month_sin', 'month_cos', 'day_of_week', 'school_holiday', 'public_holiday']

weather_features = ['turnover_lag_7', 'turnover_lag_365', 'month_sin', 'month_cos', 'day_of_week', 'school_holiday', 'public_holiday',
                                    'temp_mean', 'humidity_mean', 'rain_1h_mean', 'snow_1h_mean',
                                                    'day_frosty', 'day_thunder', 'day_clear','day_hazy', 'day_summer',
                                                    'temp_mean_dev', 'humidity_mean_dev', 'pressure_mean_dev', 'rain_1h_mean_dev', 'snow_1h_mean_dev',
                                                    'temp_mean_change', 'pressure_mean_change', 'humidity_mean_change',
                                                    'temp_mean_lead_1', 'rain_1h_mean_lead_1', 'snow_1h_mean_lead_1']

In [None]:
lgbm_time = meteo.crossval_lgbm(df_train, grouping_vars=['branch', 'product'], target='turnover', features=time_features,
                            lgbm_kwargs=params_optimal)

In [None]:
lgbm_weather = meteo.crossval_lgbm(df_train, grouping_vars=['branch', 'product'], target='turnover', features=weather_features,
                            lgbm_kwargs=params_optimal)

In [None]:
lgbm_time

In [None]:
lgbm_weather

Including weather features into the LightGBM model further reduced the mean MAPE from 14.3% to 13.9% compared to using temporal features only.

### merge cross-validation results and export to csv

In [None]:
# merge
scores_merged = pd.concat([naive, lgbm_time[['MAPE_mean', 'MAPE_std']], lgbm_weather[['MAPE_mean', 'MAPE_std']]], axis=1)
scores_merged.columns = ['group', 'MAPE_mean_naive', 'MAPE_std_naive', 'MAPE_mean_lgbm_time', 'MAPE_std_lgbm_time',
                            'MAPE_mean_lgbm_weather', 'MAPE_std_lgbm_weather']

In [None]:
# save cross-validation results
scores_merged.to_csv('../models/lgbm_optimized/cross_validation.csv', index=False)

### calculate feature importance for optimized LGBM model

In [None]:
# define directory for saving trained LightGBM models
directory = '../models/lgbm_optimized/trained_models/'

# train and extract feature importance
lgbm_fimportance = meteo.get_lgbm_feature_importance(df_train, grouping_vars=['branch', 'product'], target='turnover', features=weather_features,
                            lgbm_kwargs=params_optimal, filepath=directory)

In [None]:
lgbm_fimportance

### calculate relative feature importance

In [None]:
# calculate feature importance in percent relative to total gain for that time series
lgbm_fimportance_rel = lgbm_fimportance.copy()

# sum gain across all features per time series to get total gain
lgbm_fimportance_rel['sum'] = lgbm_fimportance_rel[lgbm_fimportance_rel.columns[1:]].sum(axis=1)

# divide individual importances by total gain and multiply by 100 to get relative feature importance
lgbm_fimportance_rel[lgbm_fimportance_rel.columns[1:]] = lgbm_fimportance_rel[lgbm_fimportance_rel.columns[1:]].div(lgbm_fimportance_rel['sum'], axis=0) * 100
lgbm_fimportance_rel.drop(columns=['sum'], inplace=True)

# replace mean with averaged relative importance per feature over all branch/product combination 
# (the relative mean calculated by dividing individual means by the sum of mean importances does not yield consistent results)
lgbm_fimportance_rel.loc[15, lgbm_fimportance_rel.columns[1:]] = [np.mean(lgbm_fimportance_rel[x]) for x in lgbm_fimportance_rel.columns[1:]]

In [None]:
lgbm_fimportance_rel

Among the 27 included features, temporal features show the highest feature importance when averaged over all branch/product combinations. Turnover 7 days ago was most important with relative importance of nearly 50%, followed by the day of the week, public holidays and the turnover 365 days ago. 

The mean daily temperature and the seasonal deviation in snowfall reached rank 5 and 6, both showing a relative importance >2%.

### export feature importance to csv

In [None]:
# export feature importance
lgbm_fimportance.to_csv('../models/lgbm_optimized/feature_importance.csv', index=False)
lgbm_fimportance_rel.to_csv('../models/lgbm_optimized/rel_feature_importance.csv', index=False)