## Meteo Bakery - Baseline Model
As our first baseline model, we will use a simple heuristic where we use the product sales from individual days of the previous week as a forecast for sales in the upcoming week. These predictions will be made for:
* Total sales
* Total sales per branch
* Total sales per product
* Sales for each product per branch

We will use RMSE and MAPE as our evaluation metric. Additionally, we will calculate the RMSE and MAPE for overestimated and underestimated sales. For now, we will perform evaluation based on the whole time series.


### import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### load data

In [None]:
df = pd.read_csv('../data/data_combined.csv')

# parse 'date' to datetime object
df.date = pd.to_datetime(df.date)

In [None]:
df.columns

### generate unstacked dataframe
Currently, our dataframe represents a grouped time series (grouped by branch and product). For modeling, the dataframe will be unstacked (i.e. ungrouped), such that the individual time series are represented as separate columns in the dataframe. The ungrouping will be done for every single level of the grouped dataframe. Specifically, we will generate 
* a time series for total sales, summing up sales over all branches and products
* summed time series for the three different bakery branches with sales summed up across products per branch
* summed time series for the five different bakery products with sales summed up across branches per product
* individual time series per branch and product

In total, we will end up with 24 different time series.

In [None]:
def unstack_time_series(df, index, groups, target):
    
    # create the individual combinations df
    df_groups = df.pivot(index=index, columns=groups, values=target)
    df_groups.columns = df_groups.columns.to_flat_index().map('{0[0]} | {0[1]}'.format)

    # create df for first group, use agg(pd.Series.sum) instead of .sum to enable skipna, otherwise NaN rows will add up to 0
    df_01 = df.groupby([index, groups[0]])[target] \
                        .agg(pd.Series.sum, skipna=False) \
                        .reset_index(drop=False) \
                        .pivot(index=index, columns=groups[0], values=target)

    # create df for second group
    df_02 = df.groupby([index, groups[1]])[target] \
                        .agg(pd.Series.sum, skipna=False)\
                        .reset_index(drop=False) \
                        .pivot(index=index, columns=groups[1], values=target)

    # create the total level df
    df_total = df.groupby(index)[target] \
                .agg(pd.Series.sum, skipna=False)\
                .to_frame() \
                .rename(columns={target: 'total'})

    # join the DataFrames
    df_unstacked = df_total.join(df_01) \
                                .join(df_02) \
                                .join(df_groups)
    df_unstacked.index = pd.to_datetime(df_unstacked.index)
    return df_unstacked


In [None]:
df_unstacked = unstack_time_series(df, 'date', ['branch', 'product'], 'turnover')
df_unstacked.head(10)

### Baseline Modeling
For baseline modeling, we simply shift the time series by seven days, thereby using the sales from days of the previous week as predictions for the sales on the respective days of the upcoming week. We calculate the residuals by simply subtracting the shifted time series from the actual time series.

In [None]:
def predict_by_previous_week(df):

    # predict values by imputing sales from the day of the preceding week
    df_pred = df.shift(7)

    # calculate residuals
    df_residual = df_pred - df

    return df_pred, df_residual


In [None]:
df_pred, df_residual = predict_by_previous_week(df_unstacked)

In [None]:
# check missings due to shifting the time series
df_residual.info()

### Merge original and shifted time series 

We will now merge the original time series with the shifted one to check if shifting was done properly.

In [None]:
df_unstacked_full = df_unstacked.merge(df_pred, on='date', how='left')
df_unstacked_full.head()

In [None]:
df_unstacked_full.reset_index(inplace=True)
df_unstacked_full.loc[[0+1000, 7+1000, 14+1000, 21+1000, 28+1000, 35+1000], ['date', 'Center_x', 'Center_y']]

The series seems to be properly shifted by exactly 7 days.

### evaluate performance
We calculate the following main metrics for evaluating model performance: 1) RMSE, 2) MAPE
Additionally, we calculate RMSE and MAPE both for predictions overestimating and underestimating sales. 
Finally, we also calculate the overall mean and standard deviation of the actual time series for reference.

In [None]:
def calculate_eval_metrics(df_actual, df_residual):
    
    # iterate over all time series and calculate evaluation scores using list comprehension
    mean_actual = [np.mean(df_actual[col]).round(4) for col in df_residual.columns]
    std_actual = [np.std(df_actual[col]).round(4) for col in df_residual.columns]
    rmse_total = [np.sqrt(np.mean(np.square(df_residual[col]))).round(4) for col in df_residual.columns]
    mape_total = [np.mean(np.abs(df_residual[col]) / df_actual[col]).round(4) for col in df_residual.columns]
    rmse_over = [np.sqrt(np.mean(np.square(df_residual[df_residual[col]>0][col]))).round(4) for col in df_residual.columns]
    rmse_under = [np.sqrt(np.mean(np.square(df_residual[df_residual[col]<=0][col]))).round(4) for col in df_residual.columns]
    mape_over = [np.mean(np.abs(df_residual[df_residual[col]>0][col]) / df_unstacked[col]).round(4) for col in df_residual.columns]
    mape_under = [np.mean(np.abs(df_residual[df_residual[col]<=0][col]) / df_unstacked[col]).round(4) for col in df_residual.columns]

    # combine to dataframe
    df_eval = pd.DataFrame({'groups': df_residual.columns, 
                            'mean_actual': mean_actual, 'std_actual': std_actual,
                            'rmse_total': rmse_total, 'mape_total': mape_total,
                            'rmse_over': rmse_over, 'rmse_under': rmse_under, 'mape_over': mape_over, 'mape_under': mape_under})
    df_eval.set_index('groups', inplace=True, drop=True)
    
    # return evaluation metrics
    return df_eval

#### evaluation scores for predictions in original metric

In [None]:
df_eval = calculate_eval_metrics(df_unstacked, df_residual)
df_eval

### Error analysis for the three different branches
We perform an error analysis by plotting residual plots for the three different bakery branches.

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(10, 4))
fig.suptitle('Residual plots', fontsize=16)

sns.scatterplot(x=df_pred['Metro'], y=df_residual['Metro'], color='red', ax=ax1)
ax1.set_xlabel('predicted sales', fontsize=12)
ax1.set_ylabel('error', fontsize=12)
ax1.set_title('Metro', fontsize=14)

sns.scatterplot(x=df_pred['Center'], y=df_residual['Center'], color='blue', ax=ax2)
ax2.set_xlabel('predicted sales', fontsize=12)
ax2.set_ylabel('error', fontsize=12)
ax2.set_title('Center', fontsize=14)


sns.scatterplot(x=df_pred['Train_Station'], y=df_residual['Train_Station'], color='green', ax=ax3)
ax3.set_xlabel('predicted sales', fontsize=12)
ax3.set_ylabel('error', fontsize=12)
ax3.set_title('Train_Station', fontsize=14)

plt.tight_layout()
plt.show()

The residuals are not randomly distributed across the range of predicted sales, irrespective of branch.

### Restack time series for further analysis
The unstacked time series will be restacked to represent its grouped structure according to branch and product. This stacked time series can then be used to perform additional analyses with the residuals, e.g. their relationship with weather data.

In [None]:
df_residual.columns

In [None]:
# re-stack product sales for branch 'Metro'
metro_resid = pd.melt(df_residual, value_vars=df_residual.columns[9:14], var_name='product', value_name='resid_turnover', ignore_index=False)
metro_resid['branch'] = 'Metro'
metro_resid['product'] = metro_resid['product'].str.split('|').str[1]
metro_resid.head()

In [None]:
# re-stack product sales for branch 'Center'
center_resid = pd.melt(df_residual, value_vars=df_residual.columns[14:19], var_name='product', value_name='resid_turnover', ignore_index=False)
center_resid['branch'] = 'Center'
center_resid['product'] = center_resid['product'].str.split('|').str[1]
center_resid.head()

In [None]:
# re-stack product sales for branch 'Train_Station'
train_resid = pd.melt(df_residual, value_vars=df_residual.columns[19:24], var_name='product', value_name='resid_turnover', ignore_index=False)
train_resid['branch'] = 'Train_Station'
train_resid['product'] = train_resid['product'].str.split('|').str[1]
train_resid.head()


In [None]:
# stack all three time series on top of each other
resid_all = pd.concat([metro_resid, center_resid, train_resid])

In [None]:
resid_all.sort_index(inplace=True)
resid_all.reset_index(inplace=True)
resid_all.head()

In [None]:
# the number of dates per group corresponds with the date range w/o gaps calculated at the beginning using pd.date_range
resid_all.groupby(['branch', 'product'])['date'].count()

In [None]:
df.columns

### merge weather statistics into the restacked time series

In [None]:
weather_stats = pd.read_csv('../data/summary_stats.csv')
weather_stats.date = pd.to_datetime(weather_stats.date)

In [None]:
resid_joined = resid_all.merge(weather_stats, on=['date'], how='left')
resid_joined.head()

In [None]:
resid_joined.shape

In [None]:
# the number of dates per group still corresponds with the date range w/o gaps calculated at the beginning using pd.date_range
resid_joined.groupby(['branch', 'product'])['date'].count()

### Analyse the distribution of residuals depending on weahther condition

In [None]:
resid_joined.condition_total.value_counts()

In [None]:
sns.barplot(data = resid_joined[resid_joined.condition_total.isin(['cloudy', 'rainy', 'clear', 'foggy', 'snowy'])], 
            hue='condition_total', y='resid_turnover', x='branch');