# Store Sales Analysis and Forecasting
📋 **Table of Content:**
1. [EDA](#eda)
    1. [Sales](#sales)
        - [Seasonality](#seasonality)
    1. [Oil Price](#oil)
    1. [Holidays / Events](#events)
1. [Machine Learning Forecasting](#forecasting)
    1. [Preprocessing](#preprocessing)
        - [Optimizing trainset length](#train-length)
    1. [Models comparison](#models)
        - [CustomRegressor](#customreg)
    1. [Predictions on test set](#testpreds)

---
📉 **Evaluation Metric:**  
The evaluation metric for this competition is Root Mean Squared Logarithmic Error.  
\begin{align}
RMSLE = \sqrt{ \frac{1}{n} \sum_{i=1}^n \left(\log (1 + \hat{y}_i) - \log (1 + y_i)\right)^2}
\end{align}  
- $n$ is the total number of instances
- $\hat{y}_i$ is the predicted value of the target for instance $i$
- $y_i$ is the actual value of the target for instance $i$
  
---
📚 **Sources:**
- EDA:
    - [Notebook: 📝Store Sales Analysis⏳ Time Serie](https://www.kaggle.com/kashishrastogi/store-sales-analysis-time-serie/notebook)
    - [Nicer seasonal decompose chart](https://gist.github.com/tomron/8798256fcee5438edd58c17654adf443)
- Seasonality
    - [Kaggle Time Series Tutorial - Seasonality](https://www.kaggle.com/ryanholbrook/seasonality)
    - [Tensorflow TimeSeries Tutorial](https://www.tensorflow.org/tutorials/structured_data/time_series)
- Lag Features
    - [Kaggle Time Series Tutorial - Time Series as Features](https://www.kaggle.com/ryanholbrook/time-series-as-features)
    - [Plot_pacf, plot_acf, autocorrelation_plot and lag_plot](https://community.plotly.com/t/plot-pacf-plot-acf-autocorrelation-plot-and-lag-plot/24108)
- Models
    - [Notebook: Store Sales simple XG Boost GPU [LB=0.44579]](https://www.kaggle.com/koheishima/store-sales-simple-xg-boost-gpu-lb-0-44579)
    - [Notebook: TS + Ridge + RF by AS](https://www.kaggle.com/code/dkomyagin/simple-ts-ridge-rf/notebook)
    - [Scikit-learn doc](https://scikit-learn.org/)
    - [XGBoost: A Complete Guide to Fine-Tune and Optimize your Model](https://towardsdatascience.com/xgboost-fine-tune-and-optimize-your-model-23d996fab663)
    - Custom Regressor
        - [Notebook: Store Sales: Ridge+Voting(Bagging(ET)+Bagging(RF))](https://www.kaggle.com/code/hiro5299834/store-sales-ridge-voting-bagging-et-bagging-rf/notebook) The used class in this notebook is an optimization of the class and the models hyperparameters made in the notebook [📝Store Sales Analysis⏳ Time Serie](https://www.kaggle.com/kashishrastogi/store-sales-analysis-time-serie/notebook)
        - [Joblib doc](https://joblib.readthedocs.io/en/latest/parallel.html#common-usage)

In [None]:
import os
import time
import itertools
import calendar as cal
import numpy as np 
import pandas as pd
import seaborn as sns
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from scipy.signal import periodogram
from statsmodels.tsa.seasonal import seasonal_decompose, DecomposeResult
from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess
from statsmodels.tsa.stattools import pacf, acf



# plotly settings
plotly_base_params = {
    'template': "plotly_white",
    'title_font': dict(size=29, color='#8a8d93', family="Lato, sans-serif"),
    'font': dict(color='#8a8d93'), 
    'hoverlabel': dict(bgcolor="#f2f2f2", font_size=13, font_family="Lato, sans-serif")
}

In [None]:
DATA_DIR = "/kaggle/input/store-sales-time-series-forecasting"

In [None]:
train = pd.read_csv(os.path.join(DATA_DIR, "train.csv"), parse_dates=['date'])
test = pd.read_csv(os.path.join(DATA_DIR, "test.csv"), parse_dates=['date'])
oil = pd.read_csv(os.path.join(DATA_DIR, "oil.csv"), parse_dates=['date'])
holidays_events = pd.read_csv(os.path.join(DATA_DIR, "holidays_events.csv"), parse_dates=['date'])
transactions = pd.read_csv(os.path.join(DATA_DIR, "transactions.csv"), parse_dates=['date'])
stores = pd.read_csv(os.path.join(DATA_DIR, "stores.csv"))

print(f"Training Data: from {train.date.min()} to {train.date.max()} - {train.date.max() - train.date.min()}")
print(f"Testing Data: from {test.date.min()} to {test.date.max()} - {test.date.max() - test.date.min()}")

In [None]:
print(train.info())
train.head()

In [None]:
print(test.info())
test.head()

In [None]:
# calendar dataset covering train + test dates
calendar = pd.DataFrame(index=pd.date_range(train.date.min(), test.date.max()))
# days of week
calendar['weekday'] = calendar.index.dayofweek 

<a id="EDA"></a>
# EDA

In [None]:
# Extend training set for the EDA
train_ext = train.merge(stores, on='store_nbr', how='left')
train_ext = train_ext.merge(transactions, on=['date', 'store_nbr'], how='left')
train_ext = train_ext.rename(columns={"type": "store_type"})

In [None]:
# Parsing dates
train_ext['date'] = train_ext['date'].astype('datetime64[ns]')
train_ext['year'] = train_ext['date'].dt.year
train_ext['month'] = train_ext['date'].dt.month
train_ext['week'] = train_ext['date'].dt.isocalendar().week
train_ext['quarter'] = train_ext['date'].dt.quarter
train_ext['weekday'] = train_ext['date'].dt.dayofweek
train_ext['day_name'] = train_ext['date'].dt.day_name()

In [None]:
print("Extended training data:", train_ext.shape)
train_ext.head()

<a id="sales"></a>
## Sales

In [None]:
def print_summary(data:dict, title):
    fig=go.Figure()
    fig.add_trace(go.Scatter(
        x=np.arange(start=0, stop=len(data)),
        y=np.full(len(data), 1.6),
        mode="text", 
        text=[f"<span style='font-size:33px'><b>{data[x]}</b></span>" for x in data],
        textposition="bottom center",
        hoverinfo='skip'
    ))
    fig.add_trace(go.Scatter(
        x=np.arange(start=0, stop=len(data)),
        y=np.full(len(data), 1.1),
        mode="text", 
        text=[x for x in data],
        textposition="bottom center",
        hoverinfo='skip'
    ))
    fig.add_hline(y=2.2, line_width=5, line_color='gray')
    fig.add_hline(y=0.3, line_width=3, line_color='gray')
    fig.update_yaxes(visible=False)
    fig.update_xaxes(visible=False)
    fig.update_layout(
        showlegend=False, height=300, width=1200,
        title=title, title_x=0.5, title_y=0.9,
        yaxis_range=[-0.2,2.2],
        plot_bgcolor='#fafafa', paper_bgcolor='#fafafa',
        font=dict(size=23, color='#323232'),
        title_font=dict(size=35, color='#222'),
        margin=dict(t=90,l=70,b=0,r=70)
    )
    fig.show(config={'staticPlot': False})

In [None]:
# compute number of months
start_date = train_ext.date.min()
end_date = train_ext.date.max()
nb_months = round((end_date.year - start_date.year) * 12 + (end_date.month - start_date.month) + (end_date.day / 30.5), 1)

summary = {
    "Stores": stores.shape[0],
    "Store types": train_ext.store_type.nunique(),
    "Store clusters": train_ext.cluster.nunique(),
    "Product families": train_ext.family.nunique(),
    "States": train_ext.state.nunique(),
    "Months": nb_months
}
print_summary(summary, "Stores Summary")

In [None]:
# data
df_sales = train.groupby('date').agg({"sales" : "sum"}).reset_index()
df_sales['sales_ma'] = df_sales['sales'].rolling(7).mean()
df_trans = train_ext.groupby('date').agg({"transactions" : "sum"}).reset_index()
df_trans['transactions_ma'] = df_trans['transactions'].rolling(7).mean()

# chart
fig = make_subplots(rows=3, cols=1,
                    subplot_titles=["Sales", "Transactions", "Sales / Transactions"],
                    vertical_spacing=.1)
fig.add_scatter(x=df_sales['date'], y=df_sales['sales'],
                mode='lines', marker=dict(color='#428bca'),
                name='Sales', row=1, col=1)
fig.add_scatter(x=df_sales['date'], y=df_sales['sales_ma'],
                mode='lines', marker=dict(color='#d9534f'),
                name='7d moving avearge', row=1, col=1)
fig.add_scatter(x=df_trans['date'], y=df_trans['transactions'],
                mode='lines', marker=dict(color='#428bca'),
                name='Transactions', row=2, col=1)
fig.add_scatter(x=df_trans['date'], y=df_trans['transactions_ma'],
                mode='lines', marker=dict(color='#d9534f'),
                name='7d moving avearge', row=2, col=1)
fig.add_scatter(x=df_sales['sales'], y=df_trans['transactions'],
                mode='markers', marker=dict(color='#428bca', size=2),
                name='Sales/Transactions', row=3, col=1)
# style
fig.update_xaxes(title='Sales', row=3, col=1)
fig.update_yaxes(title='Transactions', row=3, col=1)
fig.update_layout(height=750, width=1200, showlegend=False, **plotly_base_params)
fig.show()

In [None]:
# data c1
df_st_sa = train_ext.groupby('store_type').agg({"sales" : "mean"}).sort_values(by='sales', ascending=False).reset_index()
# data c2
df_fa_sa = train_ext.groupby('family').agg({"sales" : "mean"}).sort_values(by='sales', ascending=False)[:10].reset_index()
df_fa_sa['percent'] = round((df_fa_sa['sales'] / df_fa_sa['sales'].sum()) * 100, 1)
df_fa_sa['percent'] = df_fa_sa['percent'].astype(str) + "%"
df_fa_sa['color'] = '#c6ccd8'
df_fa_sa['color'].at[df_fa_sa.sales.idxmax()] = '#496595' # highest value color
# data c3
df_cl_sa = train_ext.groupby('cluster').agg({"sales" : "mean"}).reset_index()
df_cl_sa['percent'] = round((df_cl_sa['sales'] / df_cl_sa['sales'].sum()) * 100, 1)
df_cl_sa['percent'] = df_cl_sa['percent'].astype(str) + "%"
df_cl_sa['color'] = '#c6ccd8'
df_cl_sa['color'].at[df_cl_sa.sales.idxmax()] = '#496595'

# charts
fig = make_subplots(rows=2, cols=2,
                    column_widths=[0.5, 0.5], vertical_spacing=0, horizontal_spacing=0.02,
                    specs=[[{"type": "bar"}, {"type": "pie"}], [{"colspan": 2}, None]],
                    subplot_titles=("Top 10 Highest Product Sales", "Sales per Store Types", "Sales per Clusters"))
fig.add_trace(go.Bar(x=df_fa_sa['sales'], y=df_fa_sa['family'], text=df_fa_sa['percent'],
                     marker=dict(color=df_fa_sa['color']), name='Family', orientation='h'),
              row=1, col=1)
fig.add_trace(go.Pie(values=df_st_sa['sales'], labels=df_st_sa['store_type'],
                     name='Store type', hole=0.7,
                     marker=dict(colors=['#334668','#496595','#6D83AA','#91A2BF','#C8D0DF']),
                     hoverinfo='label+percent+value', textinfo='label'),
              row=1, col=2)
fig.add_trace(go.Bar(x=df_cl_sa['cluster'], y=df_cl_sa['sales'], text=df_cl_sa['percent'],
                     marker=dict(color=df_cl_sa['color']), name='Cluster'), 
              row=2, col=1)

# styling
fig.update_yaxes(showgrid=False, ticksuffix=' ', categoryorder='total ascending', row=1, col=1)
fig.update_xaxes(visible=False, row=1, col=1)
fig.update_xaxes(tickmode = 'array', tickvals=df_cl_sa.cluster, ticktext=[i for i in range(1,17)], row=2, col=1)
fig.update_yaxes(visible=False, row=2, col=1)
fig.update_layout(height=500, width=1000, bargap=0.2,
                  margin=dict(b=0,r=20,l=20), xaxis=dict(tickmode='linear'),
                  title_text="Average Sales Analysis",
                  showlegend=False, **plotly_base_params)
fig.show()

Groceries and beverages account for more than half of these stores' sales.

In [None]:
# data
df_m_sa = train_ext.groupby('month').agg({"sales" : "mean"}).reset_index()
df_m_sa['sales'] = round(df_m_sa['sales'], 2)
df_m_sa['month_text'] = df_m_sa['month'].apply(lambda x: cal.month_abbr[x])
df_m_sa['text'] = df_m_sa['month_text'] + ' - ' + df_m_sa['sales'].astype(str)
df_m_sa['color'] = '#c6ccd8'
df_m_sa['color'].at[df_m_sa.sales.idxmax()] = '#496595'

df_dw_sa = train_ext.groupby('weekday').agg({"sales" : "mean"}).reset_index()
df_dw_sa.sales = round(df_dw_sa.sales, 2)
df_dw_sa['day_name'] = df_dw_sa['weekday'].apply(lambda x: cal.day_name[x])
df_dw_sa['text'] = df_dw_sa['day_name'] + ' - ' + df_m_sa['sales'].astype(str)
df_dw_sa['color'] = '#c6ccd8'
df_dw_sa['color'].at[df_dw_sa.sales.idxmax()] = '#496595'

df_w_sa = train_ext.groupby('week').agg({"sales" : "mean"}).reset_index()
df_q_sa = train_ext.groupby('quarter').agg({"sales" : "mean"}).reset_index()
df_w_sa['color'] = '#c6ccd8'

# chart
fig = make_subplots(rows=2, cols=3,
                    vertical_spacing=0.08,
                    row_heights=[0.7, 0.3],
                    specs=[[{"type": "bar"}, {"type": "pie"}, {"type": "bar"}], [{"colspan": 3}, None, None]],
                    subplot_titles=("Month wise Avg Sales Analysis",
                                    "Quarter wise Avg Sales Analysis",
                                    "Day wise Avg Sales Analysis",
                                    "Week wise Avg Sales Analysis"))
# Row 1
fig.add_trace(go.Bar(x=df_m_sa['sales'], y=df_m_sa['month'][::-1],
                     marker=dict(color=df_m_sa['color']),
                     text=df_m_sa['text'], textposition='auto',
                     name='Month', orientation='h'), row=1, col=1)
fig.add_trace(go.Pie(values=df_q_sa['sales'], labels=df_q_sa['quarter'], name='Quarter',
                     marker=dict(colors=['#334668','#496595','#6D83AA','#91A2BF','#C8D0DF']),
                     hole=0.7, hoverinfo='label+percent+value', textinfo='label+percent'), row=1, col=2)
fig.add_trace(go.Bar(x=df_dw_sa['sales'], y=df_dw_sa['weekday'][::-1],
                     marker=dict(color=df_dw_sa['color']),
                     text=df_dw_sa['text'], textposition='auto',
                     name='Day', orientation='h'), row=1, col=3)
# Row 2
fig.add_trace(go.Scatter(x=df_w_sa['week'], y=df_w_sa['sales'],
                         mode='lines+markers', fill='tozeroy', fillcolor='#c6ccd8',
                         marker=dict(color='#496595'), name='Week'), row=2, col=1)

# styling
fig.update_yaxes(visible=False, row=1, col=1)
fig.update_yaxes(visible=False, row=1, col=3)
fig.update_xaxes(tickmode = 'array', tickvals=df_w_sa.week, ticktext=[i for i in range(1,53)], 
                 row=2, col=1)
fig.update_layout(height=500, width=1000, bargap=0.15,
                  margin=dict(b=0,r=20,l=20), 
                  title_text="Average Sales Analysis Over Time",
                  showlegend=False, **plotly_base_params)
fig.show()

Sales are more important on Sundays and for the Christmas holidays.

<a id="seasonality"></a>
### Seasonality

In [None]:
# source: https://gist.github.com/tomron/8798256fcee5438edd58c17654adf443
def plot_seasonal_decompose(result: DecomposeResult, title="Seasonal Decomposition"):
    return (
        make_subplots(rows=4, cols=1, subplot_titles=["Observed", "Trend", "Seasonal", "Residuals"])
        .add_trace(go.Scatter(x=result.seasonal.index, y=result.observed, mode="lines", name="Observed"),
                   row=1, col=1)
        .add_trace(go.Scatter(x=result.trend.index, y=result.trend, mode="lines", name="Trend"),
                   row=2, col=1)
        .add_trace(go.Scatter(x=result.seasonal.index, y=result.seasonal, mode="lines", name="Seasonal"),
                   row=3, col=1)
        .add_trace(go.Scatter(x=result.resid.index, y=result.resid, mode="lines", name="Residuals"),
                   row=4, col=1)
        .update_layout(template="plotly_white", height=1250, width=1000, title=title, margin=dict(t=100), title_x=0.5, showlegend=False)
    )

In [None]:
seasonnal = seasonal_decompose(df_sales.set_index('date')['sales'], model='multiplicative', period=365)
plot_seasonal_decompose(seasonnal, title="Annual Seasonal Decomposition of Sales")

In [None]:
def seasonal_plot(X, y, period, freq, ax=None):
    if ax is None:
        _, ax = plt.subplots()
    palette = sns.color_palette("husl", n_colors=X[period].nunique(),)
    ax = sns.lineplot(
        x=freq, y=y, data=X, hue=period, ci=False,
        ax=ax, palette=palette, legend=False,
    )
    ax.set_title(f"Seasonal Plot ({period}/{freq})")
    for line, name in zip(ax.lines, X[period].unique()):
        y_ = line.get_ydata()[-1]
        ax.annotate(
            name, xy=(1, y_), xytext=(6, 0),
            color=line.get_color(),
            xycoords=ax.get_yaxis_transform(),
            textcoords="offset points", size=14, va="center",
        )
    return ax

def plot_periodogram(ts, detrend='linear', ax=None):
    # fs = (exact value of deprecated '1Y') / '1D'
    fs = pd.Timedelta('365 days 05:49:12') / pd.Timedelta("1D")
    freqencies, spectrum = periodogram(
        ts, fs=fs,
        detrend=detrend,
        window="boxcar",
        scaling='spectrum',
    )
    if ax is None:
        _, ax = plt.subplots()
    ax.step(freqencies, spectrum, color="purple")
    ax.set_xscale("log")
    ax.set_xticks([1, 2, 4, 6, 12, 26, 52, 104])
    ax.set_xticklabels([
        "Annual (1)",
        "Semiannual (2)",
        "Quarterly (4)",
        "Bimonthly (6)",
        "Monthly (12)",
        "Biweekly (26)",
        "Weekly (52)",
        "Semiweekly (104)",
        ], rotation=30)
    ax.ticklabel_format(axis="y", style="sci", scilimits=(0, 0))
    ax.set_ylabel("Variance")
    ax.set_title("Periodogram")
    return ax

In [None]:
df_sales = df_sales.set_index('date').to_period("D")
# days within a week
df_sales['day'] = df_sales.index.dayofweek # the x-axis (freq)
df_sales['week'] = df_sales.index.week # the seasonal period (period)

# days within a year
df_sales['dayofyear'] = df_sales.index.dayofyear
df_sales['year'] = df_sales.index.year
df_sales['month'] = df_sales.index.month

In [None]:
fig, (ax0, ax1) = plt.subplots(2, 1, figsize=(15, 10))
seasonal_plot(df_sales, y="sales", period="week", freq="day", ax=ax0)
seasonal_plot(df_sales, y="sales", period="year", freq="dayofyear", ax=ax1)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
ax = plot_periodogram(df_sales["sales"], ax=ax)
ax.set_title("Product Sales Frequency Components")
plt.show()

In [None]:
df = train.groupby(['date', 'family']).agg({"sales" : "sum"})
df = df.unstack(level=0).T.droplevel(level=0, axis=0).rolling(7).mean()
fig =  go.Figure()
for col in df.columns:
    fig.add_trace(go.Scatter(x=df.index, y=df[col], name=col, mode='lines'))
fig.update_layout(height=850,
                  title_text="Weekly moving average per product Family",
                  **plotly_base_params)
fig.show()

It is possible to choose the curves to be displayed by clicking on the product families. (double click on a family to display only this family)

We can clearly see different trends according to the product families.  
There are also sales that appear to be stable in periodicity but undergo large increases/decreases (while maintaining the same periodicity). This may be the result of the conglomerate's desire to increase/decrease its sales of this type of product. For example: DAIRY

The sales of the following products are much lower than the other product families: 'AUTOMOTIVE', 'BEAUTY', 'CELEBRATION', 'GROCERY II', 'HARDWARE', 'HOME AND KITCHEN I', 'HOME AND KITCHEN II', 'HOME APPLIANCES', 'LADIESWEAR', 'LAWN AND GARDEN', 'LINGERIE', 'LIQUOR,WINE,BEER', 'MAGAZINES', 'PET SUPPLIES', 'PLAYERS AND ELECTRONICS', 'SCHOOL AND OFFICE SUPPLIES', 'SEAFOOD'

<a id="oil"></a>
## Oil Price

In [None]:
# compute 7-days moving average
oil['ma_oil'] = oil['dcoilwtico'].rolling(7).mean()
oil = oil.set_index('date')

# adding oil price to calendar
calendar['ma_oil'] = oil['ma_oil'].loc[train.date.min():test.date.max()]
calendar['ma_oil'].fillna(method='ffill', inplace=True)

In [None]:
# chart
fig = go.Figure()
fig.add_scatter(x=oil.index, y=oil.dcoilwtico,
                mode='lines', name='Oil Price',
                line=dict(color='#428bca', width=2))
fig.add_scatter(x=oil.index, y=oil.ma_oil.fillna(method='ffill'),
                mode='lines', name='7d moving avergae',
                line=dict(color='purple', width=1))
fig.update_layout(title='Avg Sales with Holydays and Events', **plotly_base_params)
fig.show()

In [None]:
train_ext = train_ext.merge(oil, on='date', how='left')

fig = px.imshow(train_ext[['ma_oil', 'sales', 'transactions']].corr(), color_continuous_scale='reds')
fig.update_layout(title='Correlation between Oil / Sales / Transactions', height=400, width=700, **plotly_base_params)

Ecuador is an oil-dependent country and it's economical health is highly vulnerable to shocks in oil prices.  
It can be seen that the oil price is almost not correlated with the sales of shops that sell mainly food products.

In [None]:
print("Correlation between oil price and :")
for f in train_ext.family.unique():
    df = train_ext[train_ext.family == f][['sales', 'ma_oil']].copy()
    corr = df.corr().unstack().drop_duplicates().unstack()['ma_oil'].values[0]
    print(f" - {f}: {corr:.3f}")

**Oil Lag Features**  

ACF is a measure of the correlation between the timeseries with a lagged version of itself. For instance at lag 5, ACF would compare series at time instant ‘t1’…’tn’ with series at instant ‘t1-5’…’tn-5’ (t1-5 and tn being end points.  
PACF, also measures the correlation between the timeseries with a lagged version of itself but after eliminating the variations explained by the intervening comparisons. Eg. at lag 5, it will check the correlation but remove the effects already explained by lags 1 to 4.

In [None]:
def plot_acf_pacf(series, n_lags=20, plot_pacf=False):
    corr_array = pacf(series.dropna(), alpha=0.05, nlags=n_lags) if plot_pacf else acf(series.dropna(), alpha=0.05, fft=False, nlags=n_lags)
    lower_y = corr_array[1][:,0] - corr_array[0]
    upper_y = corr_array[1][:,1] - corr_array[0]
    fig = go.Figure()
    [fig.add_scatter(x=(x,x), y=(0,corr_array[0][x]), mode='lines',line_color='#3f3f3f') 
     for x in range(len(corr_array[0]))]
    fig.add_scatter(x=np.arange(len(corr_array[0])), y=corr_array[0],
                    mode='markers', marker_color='#1f77b4', marker_size=12)
    fig.add_scatter(x=np.arange(len(corr_array[0])), y=upper_y,
                    mode='lines', line_color='rgba(255,255,255,0)')
    fig.add_scatter(x=np.arange(len(corr_array[0])), y=lower_y,
                    mode='lines',fillcolor='rgba(32, 146, 230,0.3)',
                    fill='tonexty', line_color='rgba(255,255,255,0)',
                    name='No-correlation interval')
    fig.update_traces(showlegend=False)
    fig.update_xaxes(range=[-1,n_lags+1])
    fig.update_yaxes(zerolinecolor='#000000')
    title='Partial Autocorrelation (PACF)' if plot_pacf else 'Autocorrelation (ACF)'
    fig.update_layout(title=title, height=500, width=1000)
    fig.show()

In [None]:
def make_lags(ts, lags):
    return pd.concat(
        {f'{ts.name}_lag_{i}': ts.shift(i) for i in range(1, lags + 1)},
        axis=1
    )

In [None]:
plot_acf_pacf(calendar['ma_oil'].dropna(), plot_pacf=False)
plot_acf_pacf(calendar['ma_oil'].dropna(), plot_pacf=True)

In [None]:
# add lagged oil features to calendar
oil_lags = make_lags(calendar['ma_oil'], 4).fillna(method='ffill')
calendar = calendar.join(oil_lags)

<a id="events"></a>
## Holidays / Events

In [None]:
mask = holidays_events.description=='Viernes Santo'
holidays_events[mask]

There is an error in the events/holidays dataset. According to the [2013 list of holidays in Ecuador](https://www.turismo.gob.ec/wp-content/uploads/downloads/2014/01/Feriados-20131.pdf), Good Friday was March 29th, not April 29th.

In [None]:
# 'Good Friday' mistake correction
holidays_events['date'][mask].replace({'2013-04-29': pd.to_datetime('2013-03-29')}, inplace=True)
holidays_events = holidays_events.set_index('date').sort_index()
# keep National level only for simplicity
holidays_events = holidays_events[holidays_events.locale=='National']
# keep only one event per day
holidays_events = holidays_events.groupby(holidays_events.index).first()

In [None]:
# data
avg_sales = train.groupby('date').agg({"sales" : "mean"}).reset_index()
df_s_he = avg_sales.merge(holidays_events, on='date', how='left')
df_s_he = df_s_he.rename(columns={"type": "event_type"})


# chart
fig = go.Figure()
fig.add_scatter(x=df_s_he.date, y=df_s_he.sales,
                mode='lines', name='Avg Sales', line=dict(width=.5))
fig.add_scatter(x=df_s_he['date'][df_s_he.event_type=='Holiday'],
                y=df_s_he['sales'][df_s_he.event_type=='Holiday'],
                mode='markers', name='Holidays',
                marker=dict(color='orange', size=4))
fig.add_scatter(x=df_s_he['date'][df_s_he.event_type=='Event'],
                y=df_s_he['sales'][df_s_he.event_type=='Event'],
                mode='markers', name='Events',
                marker=dict(color='purple', size=5))
fig.update_layout(title='Avg Sales with Holydays and Events',
                  height=500, width=1000, **plotly_base_params)
fig.show()

- New Year's Day is the only day when all stores are closed
- The earthquake in 2016 led to an increase in sales

In [None]:
# correction of workdays using holidays/events dates

def compute_workdays(df, dofw_col):
    df['workday'] = True
    # exclude week-ends
    df.loc[df[dofw_col] > 4, 'workday'] = False
    # friday bridges are not working days
    df.loc[df.event_type=='Bridge', 'workday'] = False
    # some bridges are recovered by working at weekends
    df.loc[df.event_type=='Work Day', 'workday'] = True
    # handling Transfered events
    df.loc[df.event_type=='Transfer', 'workday'] = False
    df.loc[(df.event_type=='Holiday')&(df.transferred==False), 'workday'] = False
    df.loc[(df.event_type=='Holiday')&(df.transferred==True ), 'workday'] = True
    return df                 

In [None]:
# holidays / events
calendar = calendar.merge(holidays_events, how='left', left_index=True, right_index=True)
calendar = calendar.rename(columns={"type": "event_type"})
# days of work
calendar = compute_workdays(calendar, 'weekday')
calendar['workday'] = calendar['workday'] * 1
calendar.drop(columns=['locale', 'locale_name', 'description', 'transferred'], inplace=True)

<a id="forecasting"></a>
# Machine Learning Forecasting

In [None]:
import warnings
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import make_scorer, r2_score, mean_squared_error
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from xgboost import XGBRegressor
from joblib import Parallel, delayed

<a id="preprocessing"></a>
## Preprocessing

Clean re-loading and ordering datasets:
- exclude columns `id` and `onpromotion`
- set `store_nbr` and `family` as categorical columns
- set dates as periods with daily frequency
- set `store_nbr`, `family` and `date` as index

In [None]:
# calendar dates to period
calendar.index = calendar.index.to_period('D')

In [None]:
# clean reload train set
train = pd.read_csv(os.path.join(DATA_DIR, "train.csv"),
                    usecols=['store_nbr', 'family', 'date', 'sales'], 
                    dtype={'store_nbr': 'category', 'family': 'category', 'sales': 'float32'},
                    parse_dates=['date'], infer_datetime_format=True)
train['date'] = train.date.dt.to_period('D')
train = train.set_index(['store_nbr', 'family', 'date']).sort_index()

# clean reload test set
test = pd.read_csv(os.path.join(DATA_DIR, "test.csv"),
                   usecols=['store_nbr', 'family', 'date'],
                   dtype={'store_nbr': 'category', 'family': 'category'},
                   parse_dates=['date'], infer_datetime_format=True)
test['date'] = test.date.dt.to_period('D')
test = test.set_index(['store_nbr', 'family', 'date']).sort_index()

The training set is reduced to speed up the search for the model to use. An optimization of the size of the training set will be done afterwards.

In [None]:
# trainset dates
train_start_dt = train.index.get_level_values('date').min()
train_end_dt = train.index.get_level_values('date').max()

# testset dates
test_start_dt = test.index.get_level_values('date').min()
test_end_dt = test.index.get_level_values('date').max()

print(f"Initial train set: from {train_start_dt} to {train_end_dt}")
print(f"Initial test set: from {test_start_dt} to {test_end_dt}")

In [None]:
def compute_trainset(dates):
    # compute seasonnal features
    fourier = CalendarFourier(freq='W', order=4)
    dp = DeterministicProcess(index=dates,
                              constant=False,
                              order=1,
                              seasonal=False,
                              additional_terms=[fourier],
                              drop=True)
    X = dp.in_sample()
    # add calendar features
    X = X.merge(calendar, how='left', left_index=True, right_index=True)
    # encode categorical features
    X = pd.get_dummies(X, columns=['weekday'], drop_first=True)
    X = pd.get_dummies(X, columns=['event_type'], drop_first=False)
    # fill missing lagged oil values
    X = X.fillna(method='bfill')
    return X, dp

# extract y
y = train.unstack(['store_nbr', 'family']).loc[train_start_dt:train_end_dt]
y = np.log1p(y)
X, dp = compute_trainset(y.index)

print('trainset shape:', X.shape)

In [None]:
"""
from sklearn.model_selection import cross_validate     
def compare_models_cv(X_train, y, models:dict, cv:int, metrics:list):
    "Returns the mean score of each metrics applied by cross-validation to passed models"   
    all_scores = []
    for m in models:
        # calculate the model scores for each y
        partial_scores = cross_validate(models[m], 
                                    X_train, 
                                    y, 
                                    cv=cv,
                                    scoring=metrics, 
                                    return_train_score=True)
        # convert to a dataframe
        partial_scores = pd.DataFrame.from_dict(partial_scores)
        # get mean score for each metrics and pivot the dataframe
        partial_scores = partial_scores.mean().to_frame().T
        # add the model name into the df
        partial_scores['model'] = m
        # add scores to the scoreslist
        all_scores.append(partial_scores)
    # concat all scores into a single df
    all_scores = pd.concat(all_scores, ignore_index=True).set_index('model')
    return all_scores
"""

def gs_tuning(X_train, y_train, pipe, params, scoring, cv=None, abs_scores=True, **kwargs):
    # init GridSearchCV
    gs = GridSearchCV(pipe, params, scoring=scoring, cv=cv,
                      refit=kwargs.get('refit', True),
                      verbose=kwargs.get('verbose', 0),
                      return_train_score=True, n_jobs=-1)
    # fit gridsearch without warnings
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        gs.fit(X_train, y_train)
    # get model name
    model_name = gs.best_estimator_['model'].__class__.__name__
    # init results
    res = {
        'model_name': model_name,
        'fit_time': gs.cv_results_['mean_fit_time'][gs.best_index_],
    }
    # add train + test scores for each scoring metric
    if not isinstance(scoring, list) and not isinstance(scoring, dict):
        metrics = [scoring]
    elif isinstance(scoring, dict):
        metrics = scoring.keys()
    for s in metrics:
        res['train_' + s] = gs.cv_results_['mean_train_' + s][gs.best_index_]
        res['test_' + s] = gs.cv_results_['mean_test_' + s][gs.best_index_]
    # print best score
    print(f"{model_name} best score: {gs.best_score_}")
    params_display = [f" - {k}: {v}" for k,v in gs.best_params_.items() if k != 'model']
    print('\n'.join(params_display))
    return gs.best_estimator_, res

def print_results(preds, y):
    # Results of the training stage
    preds = preds.stack(['store_nbr', 'family']).reset_index()
    y_target = y.stack(['store_nbr', 'family']).reset_index().copy()
    y_target['sales_pred'] = preds['sales'].clip(0.) # Sales should be >= 0
    scores = y_target.groupby('family').apply(lambda r: mean_squared_error(r['sales'], r['sales_pred'], squared=False))
    print('Scores by Family on training set:')
    print(scores)
    print("> Average :", scores.mean())

In [None]:
# used scoring mehtods
scoring = {
    # using RMSE because y is already passed to log
    'rmsle': make_scorer(mean_squared_error, greater_is_better=False, squared=False),
    # 'r2': make_scorer(r2_score, greater_is_better=True)
}
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', None)
])
cv = 3
SEED=12

The RMSLE is not a metric directly available on sklearn, it's possible to obtain it with the `mean_squared_log_error` method, but if the model makes negative predictions during the GridSearch it causes errors. The used solution is to log the variable to predict beforehand and then use the RMSE to evaluate the models.

<a id="train-length"></a>
## Optimizing trainset length
Training a linear regression with different training set lengths to determine the optimal size

In [None]:
from dateutil.relativedelta import relativedelta

start_dt = train_start_dt
length_scores = []

while start_dt <= pd.Period('01-08-2018', 'D'):
    try:
        # compute X and y
        y = train.unstack(['store_nbr', 'family']).loc[start_dt:train_end_dt]
        y = np.log1p(y)
        X, dp = compute_trainset(y.index)
        
        print(f"{start_dt}:", end=' ')
        model, res = gs_tuning(X, y, pipeline,
                               dict(model=[LinearRegression(fit_intercept=True)]),
                               scoring, cv, refit='rmsle')
        
        # add the start_dt
        res['start_dt'] = str(start_dt)
        
        # add scores to the scoreslist
        length_scores.append(res)
    except ValueError as e:
        print(f"{start_dt}: {e}")
    # add 1 month to start_dt
    start_dt = (start_dt.to_timestamp() + relativedelta(months=1)).to_period('D')
# concat all scores into a single df
length_scores = pd.DataFrame.from_dict(length_scores).set_index('start_dt')
length_scores['test_rmsle'] = length_scores['test_rmsle']*-1

In [None]:
length_scores['test_rmsle'].plot(figsize=(10, 5))
plt.xticks(rotation=70)
plt.ylim((0, 2))
plt.show()
print("Best scores obtained when trainset start from : ",
      f"{length_scores['test_rmsle'].idxmin()}")
print(f"- RMSLE: {length_scores['test_rmsle'].min():.5f}")

There is an error that prevents the dataset from being reduced beyond September 2017 (caused by DeterministicProcess).  
It can be seen that the best performance is obtained when the dataset starts in June 2017.

In [None]:
# make new trainset
train_start_dt = length_scores['test_rmsle'].idxmin()
y = train.unstack(['store_nbr', 'family']).loc[train_start_dt:train_end_dt]
y = np.log1p(y)
X, dp = compute_trainset(y.index)

print('Trainset shape:', X.shape)

In [None]:
fig = px.imshow(X.corr(), color_continuous_scale='reds')
fig.update_layout(title='Correlation in trainset', height=400, width=700, **plotly_base_params)

<a id="models"></a>
## Models Comparison

In [None]:
%%time

# define models and their params to try
# may be necessary to comment some models 
# to faster tune another with more parameters or to sumbmit predictions
models_grid = [
    {
        # Linear Regression
        'model': [LinearRegression()],
        'model__fit_intercept': [True],
    },
    {
        # Ridge
        'model': [Ridge(random_state=SEED)],
        'model__fit_intercept': [True],
        'model__alpha':  [83.695], # list(10**np.linspace(10,-2,100)*0.5)    83.695
        'model__solver': ['saga'],
    },
    {
        # RandomForest
        'model': [RandomForestRegressor(random_state=SEED)],
        'model__n_estimators': [221],
        'model__criterion': ['absolute_error'],
        'model__max_features': [None], 
        'model__max_depth': [None],  # [2, 5, 10, None]
        'model__min_samples_leaf': [1],  # [1, 3, 5]
        'model__min_samples_split': [5],  # [2, 5, 10]
        'model__oob_score': [True],
        'model__bootstrap': [True],
    },
    {
        # XGB
        'model': [XGBRegressor(random_state=SEED)],
        'model__max_depth': [2],
        'model__learning_rate': [.1],
        'model__n_estimators': [100],
        'model__colsample_bytree': [0.3]
    },
]

# inside a loop to manually compare each model
scores, models = [], []
for m in models_grid:
    model, res = gs_tuning(X, y, pipeline, m, scoring, cv, refit='rmsle')
    scores.append(res)
    models.append(model)
    # predictions on train set
    preds = pd.DataFrame(model.predict(X), index=X.index, columns=y.columns)
    print_results(preds, y)
    print('\n')
    

As described in [sklearn GridSearchCV with Pipeline](https://stackoverflow.com/questions/21050110/sklearn-gridsearchcv-with-pipeline):  
The unified scoring API always maximizes the score, so scores which need to be minimized are negated in order for the unified scoring API to work correctly. The score that is returned is therefore negated when it is a score that should be minimized and left positive if it is a score that should be maximized.  
So the real 'gs.best_score_' of each model is the absolute value of the negative one displayed.

The model which obtains the best score is not necessarily the best for each product family.

<a id="customreg"></a>
### Custom Regressor
https://www.kaggle.com/code/andrej0marinchenko/hyperparamaters#Model-Creation

In [None]:
df = train_ext[['date', 'sales', 'family']].set_index('date')
mask = df.index.to_series().between('2017-06-25', '2017-08-15')
df = df[mask].groupby(['date', 'family']).agg({"sales" : "mean"})
df = df.unstack(level=0).T.droplevel(level=0, axis=0)

fig =  go.Figure()
for col in df.columns:
    fig.add_trace(go.Scatter(x=df.index, y=df[col], name=col, mode='lines'))
fig.update_layout(height=850,
                  title_text="Weekly moving average per product Family",
                  **plotly_base_params)
fig.show()

By displaying only the 'SCHOOL AND OFFICE SUPPLIES' family (double click on it in the legend of the graph) we can see that the sales of the family have a very different trend from the other families

BABY CARE
BOOKS
HOME APPLIANCE




AUTOMOTIVE
BEAUTY
CELEBRATION
GROCERY II
HOME AND KITCHEN I
HOME AND KITCHEN II
LADIESWEAR
LAWN AND GARDEN
LINGERIE
MAGAZINE
PET SUPPLIES
PLAYERS AND ELECTRONIS
SEAFOOD

In [None]:
%%time
from sklearn.svm import SVR
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import VotingRegressor


class CustomRegressor():
    def __init__(self, n_jobs=-1, verbose=0):
        self.n_jobs = n_jobs
        self.verbose = verbose
        self.estimators_ = None
        
    def _estimator_(self, X, y):
        warnings.simplefilter(action='ignore', category=FutureWarning)
        if y.name[2] in ['SCHOOL AND OFFICE SUPPLIES']: 
            et = ExtraTreesRegressor(n_estimators=230,  n_jobs=-1, random_state=SEED)            
            rf = RandomForestRegressor(n_estimators=221, criterion='absolute_error',
                                       min_samples_leaf=1, min_samples_split=5, 
                                       oob_score=True, n_jobs=-1, random_state=SEED)
            b1 = BaggingRegressor(base_estimator=et, n_estimators=15, n_jobs=-1, random_state=SEED)
            b2 = BaggingRegressor(base_estimator=rf, n_estimators=15, n_jobs=-1, random_state=SEED)
            # Averaging the result
            model = VotingRegressor([('et', b1), ('rf', b2)])
        else:
            ridge = Ridge(fit_intercept=True, solver='saga', alpha=83.695, random_state=SEED)
            svr = SVR(C=0.2, kernel='rbf')
            # Averaging result
            model = VotingRegressor([('ridge', ridge), ('svr', svr)])
        model.fit(X, y)
        return model

    def fit(self, X, y):
        from tqdm.auto import tqdm
        if self.verbose == 0 :
            idx_list = range(y.shape[1])
        else :
            # using a pretty progress bar
            idx_list = tqdm(range(y.shape[1]))
            print('Fit Progress')
        # fit model with parallel computing
        self.estimators_ = Parallel(n_jobs=self.n_jobs, verbose=0)(delayed(self._estimator_)(X, y.iloc[:, i]) for i in idx_list)
        return
    
    def predict(self, X):
        from tqdm.auto import tqdm # pretty progress bar package
        if self.verbose == 0 :
            estimators_list = self.estimators_
        else :
            estimators_list = tqdm(self.estimators_)
            print('Predict Progress')
        # predictions with parallel computing
        y_pred = Parallel(n_jobs=self.n_jobs, verbose=0)(delayed(e.predict)(X) for e in estimators_list)
        return np.stack(y_pred, axis=1)
    

model, res = gs_tuning(X, y, pipeline,
                       dict(model=[CustomRegressor(verbose=0)]),
                       scoring, cv, refit='rmsle')
scores.append(res)
models.append(model)
y_pred = pd.DataFrame(model.predict(X), index=X.index, columns=y.columns)
print_results(y_pred, y)

In [None]:
def model_comparison_chart(scores_df, title, metrics=[], **kwargs):
    train_color = '#FF8019'
    test_color = '#2DB42D'
    time_color = '#5CA3F9'
    n_metrics = len(metrics)+1 # fit_time always displayed
    
    fig = make_subplots(rows=1, cols=n_metrics,
                        horizontal_spacing=0.1,
                        specs=kwargs.get('specs', None),
                        subplot_titles=kwargs.get('subplot_titles', None))
    
    for i, m in enumerate(metrics):
        fig.add_trace(go.Bar(x=scores_df['model_name'], y=scores_df['train_' + m], marker_color=train_color, name='Train'), row=1, col=i+1)
        fig.add_trace(go.Bar(x=scores_df['model_name'], y=scores_df['test_' + m], marker_color=test_color, name='Test'), row=1, col=i+1)
    # fit time
    fig.add_trace(go.Bar(x=scores_df['model_name'], y=scores_df['fit_time'], marker_color=time_color), row=1, col=n_metrics)
    
    # style
    fig.update_traces(texttemplate='%{y:.3f}', textposition='inside')
    fig.for_each_annotation(lambda a: a.update(text=f'<b>{a.text}</b>'))
    fig.update_layout(height=kwargs.get('height', 500), width=kwargs.get('width', n_metrics*500),
                      barmode='group', title=title, showlegend=False, **plotly_base_params)
    fig.show()


scores_df = pd.DataFrame(scores)
# get positive RMSLE
scores_df['train_rmsle'] = scores_df['train_rmsle']*-1
scores_df['test_rmsle'] = scores_df['test_rmsle']*-1
# display chart
model_comparison_chart(scores_df, 'Models Comparison', scoring.keys(),
                       specs=[[{"type": "bar"}, {"type": "bar"}]],
                       subplot_titles=("RMSLE", "Fit Time"))

The `CustomRegressor` model obtains one of the lowest RMSLE and one of the smallest differences between the training and test sets.
It is therefore this model that is selected for the final predictions

<a id="testpreds"></a>
## Predictions on testset


In [None]:
# DeterministicProcess for testset dates
X_test = dp.out_of_sample(steps=16)
# adding other columns
cols = list(oil_lags.columns)
cols.extend(['ma_oil', 'weekday', 'workday'])
for col in cols:
    X_test[col] = calendar.loc[test_start_dt:test_end_dt][col].values
# encoding weekday column
X_test = pd.get_dummies(X_test, columns=['weekday'], drop_first=True)
# adding events columns
X_test[[col for col in X.columns if 'event' in col]] = 0 # no events known for these dates
# reorder testset columns 
X_test = X_test[X.columns]

print("Model used :", model)
print("Testset shape:", X.shape)

In [None]:
sales_pred = pd.DataFrame(model.predict(X_test), index=X_test.index, columns=y.columns)
sales_pred = sales_pred.stack(['store_nbr', 'family'])
# reverse log1p
sales_pred = np.expm1(sales_pred.clip(0.)) # Sales should be >= 0

In [None]:
sales_pred

In [None]:
# submission
sub = pd.read_csv(os.path.join(DATA_DIR, 'sample_submission.csv'), index_col='id')
sub['sales'] = sales_pred.values
sub.to_csv('submission.csv', index=True)

In [None]:
sub

### 