# <center> Predictive modelling with timeseries</center>
# <center> Part 4 - Time series forecasting with tree algorithms</center>

![Image](images/timeseries.jpg)

Even though tree algorithms (Random forests, XGBoost, CatBoost etc.) are not the first choice that comes to mind when choosing a model to start analysing a time series, they can be extremely helpful. It is important to understand the pros and cons of choosing trees to solve timeseries problems:

**PROS:** 👍🏼
* they can handle many and varied features
* they can handle small datasets  (keep in mind, one year of daily data is only 365 records)
  
**CONS:** 🥴
* they cannot extrapolate. You can't model a series with trend data, unless you make some adjustments through imputation and feature engineering.



In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# jupyter lab configs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
plotly.offline.init_notebook_mode(connected=True)

from utils import print_errors

# Exercise: Sales prediction with the Rossman dataset

# ETL

### Load the datasets

In [None]:
# training data
rossman_df = pd.read_csv('datasets/rossman_train.csv').reset_index(drop=True)
# set the index to the time column
rossman_df.Date = pd.to_datetime(rossman_df.Date)
rossman_df.head(4)

# load store info
stores = pd.read_csv('datasets/rossman_store.csv').reset_index(drop=True)
stores.head(4)

# merge store and sales
rossman_df = pd.merge(rossman_df, stores, how='left', on='Store')

# sanity check
rossman_df = rossman_df[~((rossman_df.Sales<1)&(rossman_df.Open==1))]

General check of features

In [None]:
rossman_df.Open.unique()
rossman_df.Promo.unique()
rossman_df.StateHoliday.unique()
rossman_df.SchoolHoliday.unique()

How to deal with missing values?  
You can delete them or add them via imputation. If you decide to impute the values, don't forget to keep these values so you can apply them to the test data later 

### Add features

In [None]:
def add_time_features(df):
    df['Year'] = df.Date.dt.year
    df['Month'] = df.Date.dt.month
    df['Day'] = df.Date.dt.day
    df['DayOfWeek'] = df.Date.dt.dayofweek
    df['WeekOfYear'] = df.Date.dt.weekofyear
    return df

def recode(df, var_list=[]):
    map_dict = {'0':0, 'a':1, 'b':2, 'c':3, 'd':4}
    for v in var_list:
        df[v].replace(map_dict, inplace=True)
    return df
    

### One-hot  encoding

Which features should be transformed to categorical?

Any other features you wanna try out? Add them below

In [None]:
# incorporate the new features
rossman_df = add_time_features(rossman_df)
rossman_df = recode(rossman_df, ['StoreType', 'Assortment', 'StateHoliday'])

---

### Split train, validation and test 

In [None]:
rossman_df.Date.min(), rossman_df.Date.max()
# make sure data is sorted by date before splitting
rossman_df = rossman_df.sort_values('Date').reset_index(drop=True)


# let's split the data into train, validation and test
# using two months of sales as test length
val_start = '2015-04-01'
test_start = '2015-06-01'
rossman_df['dataset_type'] = ''
rossman_df.loc[rossman_df.Date < val_start, 'dataset_type'] = 'train'
rossman_df.loc[rossman_df.Date.between(val_start, test_start), 'dataset_type']  = 'validation'
rossman_df.loc[rossman_df.Date >= test_start, 'dataset_type'] = 'test'

train_df = rossman_df[rossman_df.dataset_type == 'train']
val_df = rossman_df[rossman_df.dataset_type == 'validation']
test_df = rossman_df[rossman_df.dataset_type == 'test']

In [None]:
fig = px.line(train_df[(train_df.Store<5)&(train_df.Sales>0)], x='Date', y="Sales", color='Store', 
              title="Sales per store - training data",  width=900, height=500,
             hover_data = ['Open','Promo','StateHoliday','SchoolHoliday'])
fig.show()

---

## Create baseline and check model performance

---

## XGBoost

In [None]:
import xgboost as xgb
from xgboost import plot_importance

In [None]:
target = 'Sales'

Create lists of sets of features, representing from a simple to a more complex model 

In [None]:
feat_dict = {'only_time' : ['Year', 'Month', 'Day', 'WeekOfYear', 'DayOfWeek', 'StateHoliday', 'SchoolHoliday', 'Open'],
             'only_comp' : ['CompetitionDistance', 'Promo2', 'Open', 'Promo'],
             'only_store' : ['StoreType', 'Assortment', 'Open', 'Store'],
             'all_feat' : ['Year', 'Month', 'Day', 'WeekOfYear', 'DayOfWeek','StateHoliday', 'SchoolHoliday', 
                           'CompetitionDistance', 'Promo2', 'StoreType', 'Assortment', 'Open', 'Promo', 'Store']}

In [None]:
# set the minimum parameters necessary to run XGBoost
params = {"objective": "reg:squarederror", 
          "booster" : "gbtree", 
          "seed": 10 }

In [None]:
def xgboost_experiment(vars_list, experiment_name, params, num_boost_round):
    dtrain = xgb.DMatrix(train_df[vars_list], label=train_df['Sales'])
    deval = xgb.DMatrix(val_df[vars_list], label=val_df['Sales'])
    dtest = xgb.DMatrix(test_df[vars_list], label=test_df['Sales'])
    
    #train
    xgb_model = xgb.train(params, dtrain, num_boost_round=num_boost_round, 
                      early_stopping_rounds=20, evals=[(deval, "Eval")], verbose_eval=False)

    # make prediction
    print('+++++ Results for experiment: ', experiment_name)
    pred = xgb_model.predict(dtest)
    print_errors(test_df[target], pred, 'test dataset')
    pred = xgb_model.predict(dtrain)
    print_errors(train_df[target], pred, 'train dataset') 
    return xgb_model

### Run experiments with different combinations of features

In [None]:
for f in feat_dict.keys():
    xgboost_experiment(feat_dict[f], f, params, 10)
    

## Run experiment with different number of trees

In [None]:
num_boost_round_list = [100, 1000, 5000]

for n in num_boost_round_list:
    print('### Experiment with ', str(n), ' boosting rounds')
    xgboost_experiment(feat_dict['all_feat'], f, params, n)
    

## Use the best set of features and another set of params

In [None]:
params = {"objective": "reg:squarederror", #since it is a regression problem
          "booster" : "gbtree",     #tree
          "eta": 0.03,              #learning rate   to reduce overfitting issues
          "max_depth": 10,          #depth of the tree
          "subsample": 0.9,         #subsample the data prior to growing trees - overcomes overfitting
          "colsample_bytree": 0.7,  #subsampling of columns for each tree
          "seed": 10                
          }

In [None]:
model1 = xgboost_experiment(feat_dict['all_feat'], 'test with different hyperparameters', params, 10)

In [None]:
plot_importance(model1)

## Model evaluation

In [None]:
rossman_df['predicted'] = model1.predict(xgb.DMatrix(rossman_df[feat_dict['all_feat']]))
rossman_df['abs_error'] = np.absolute(rossman_df['predicted']  - rossman_df['Sales']) 

agg_dict = {'abs_error': ['mean', 'std']}
rossman_df.groupby('dataset_type').agg(agg_dict)

### A good idea is to check the model error per store type and assortment

In [None]:
rossman_df[rossman_df.dataset_type=='test'].groupby(['Assortment','StoreType']).agg(agg_dict).reset_index()

### Choose a particular store and plot predictions

In [None]:
fig = px.line(rossman_df[(rossman_df.Store==379)], x='Date', y="Sales", color='dataset_type', 
              title="Sales per store - training data",  width=900, height=500,
             hover_data = ['Open','Promo','StateHoliday','SchoolHoliday'])
fig.add_trace(go.Line(x=rossman_df[(rossman_df.Store==379)].Date, y=rossman_df[(rossman_df.Store==5)].predicted,
                    mode='lines', name='predictions'))
fig.show()

# There's a lot of room for improvement

 ### Repeat these experiments after implementing one-hot encoding, gridsearch...

---

<a href='https://www.freepik.com/vectors/business'>Business vector created by freepik - www.freepik.com</a>