## Meteo Bakery - Baseline Model
As our first baseline model, we will use a simple heuristic where we use the product sales from individual days of the previous week as a forecast for sales in the upcoming week. These predictions will be made for:
* Total sales
* Total sales per branch
* Total sales per product
* Sales for each product per branch

We will use RMSE and MAE as our evaluation metric. Additionally, we will calculate the RMSE and MAE for overestimated and underestimated sales. For now, we will perform evaluation based on the whole time series.


### import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### load data

In [None]:
df = pd.read_csv('../data/data_combined.csv')

# parse 'date' to datetime object
df.date = pd.to_datetime(df.date)

In [None]:
df.head()

### generate unstacked dataframe
Currently, our dataframe represents a grouped time series (grouped by branch and product). For modeling, the dataframe will be unstacked (i.e. ungrouped), such that the individual time series are represented as separate columns in the dataframe. The ungrouping will be done for every single level of the grouped dataframe. Specifically, we will generate 
* a time series for total sales, summing up sales over all branches and products
* summed time series for the three different bakery branches with sales summed up across products per branch
* summed time series for the five different bakery products with sales summed up across branches per product
* individual time series per branch and product

In total, we will end up with 24 different time series.

In [None]:
def unstack_time_series(df, index, groups, target):
    
    # create the individual combinations df
    df_groups = df.pivot(index=index, columns=groups, values=target)
    df_groups.columns = df_groups.columns.to_flat_index().map('{0[0]} | {0[1]}'.format)

    # create df for first group
    df_01 = df.groupby([index, groups[0]]) \
                        .sum() \
                        .reset_index(drop=False) \
                        .pivot(index=index, columns=groups[0], values=target)

    # create df for second group
    df_02 = df.groupby([index, groups[1]]) \
                        .sum() \
                        .reset_index(drop=False) \
                        .pivot(index=index, columns=groups[1], values=target)

    # create the total level df
    df_total = df.groupby(index)[target] \
                .sum() \
                .to_frame() \
                .rename(columns={target: 'total'})

    # join the DataFrames
    df_unstacked = df_total.join(df_01) \
                                .join(df_02) \
                                .join(df_groups)
    df_unstacked.index = pd.to_datetime(df_unstacked.index)
    return df_unstacked


In [None]:
df_unstacked = unstack_time_series(df, 'date', ['Location', 'PredictionGroupName'], 'SoldTurnver')
df_unstacked.head(10)

### Baseline Modeling
For baseline modeling, we simply shift the time series by seven days, thereby using the sales from days of the previous week as predictions for the sales on the respective days of the upcoming week. We calculate the residuals by simply subtracting the shifted time series from the actual time series.

In [None]:
# make predictions for original and min-max scaled data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

df_unstacked_scaled = pd.DataFrame(scaler.fit_transform(df_unstacked))
df_unstacked_scaled.columns = df_unstacked.columns
df_unstacked_scaled.head(10)

In [None]:
def predict_by_previous_week(df):

    # predict values by imputing sales from the day of the preceding week
    df_pred = df.shift(7)

    # calculate residuals
    df_residual = df - df_pred

    return df_pred, df_residual


In [None]:
df_pred, df_residual = predict_by_previous_week(df_unstacked)
df_pred_scaled, df_residual_scaled = predict_by_previous_week(df_unstacked_scaled)


In [None]:
# check missings due to shifting the time series
df_residual.info()

There are some missings due to the shifting of the time series. In order to calculate the evaluation metrics, we first have to drop all rows with missings.

In [None]:
df_residual.dropna(inplace=True)
df_residual_scaled.dropna(inplace=True)

### evaluate performance
We calculate the following main metrics for evaluating model performance: 1) RMSE, 2) MAE
Additionally, we calculate RMSE and MAE both for predictions overestimating and underestimating sales. 
Finally, we also calculate the overall mean and standard deviation of the actual time series for reference.

In [None]:
def calculate_eval_metrics(df_actual, df_residual):
    
    # iterate over all time series and calculate evaluation scores using list comprehension
    mean_actual = [np.mean(df_actual[col]).round(4) for col in df_residual.columns]
    std_actual = [np.std(df_actual[col]).round(4) for col in df_residual.columns]
    rmse_total = [np.sqrt(np.mean(np.square(df_residual[col]))).round(4) for col in df_residual.columns]
    mae_total = [np.mean(np.abs(df_residual[col])).round(4) for col in df_residual.columns]
    rmse_over = [np.sqrt(np.mean(np.square(df_residual[df_residual[col]>0][col]))).round(4) for col in df_residual.columns]
    rmse_under = [np.sqrt(np.mean(np.square(df_residual[df_residual[col]<=0][col]))).round(4) for col in df_residual.columns]
    mae_over = [np.mean(df_residual[df_residual[col]>0][col]).round(4) for col in df_residual.columns]
    mae_under = [np.mean(df_residual[df_residual[col]<=0][col]).round(4) for col in df_residual.columns]

    # combine to dataframe
    df_eval = pd.DataFrame({'groups': df_residual.columns, 
                            'mean_actual': mean_actual, 'std_actual': std_actual,
                            'rmse_total': rmse_total, 'mae_total': mae_total,
                            'rmse_over': rmse_over, 'rmse_under': rmse_under, 'mae_over': mae_over, 'mae_under': mae_under})
    df_eval.set_index('groups', inplace=True, drop=True)
    
    # return evaluation metrics
    return df_eval

#### evaluation scores for predictions in original metric

In [None]:
df_eval = calculate_eval_metrics(df_unstacked, df_residual)
df_eval

#### evaluation scores for scaled data (min-max scaling)

In [None]:
df_eval_scaled = calculate_eval_metrics(df_unstacked_scaled, df_residual_scaled)
df_eval_scaled