## Meteo Bakery - EDA on the relationship between sales and weather data
In this notebook, we will analyize the relationship between the residuals from our baseline model and different weather data statistics.


### import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### load data

In [None]:
df = pd.read_csv('../data/data_combined.csv')

# parse 'date' to datetime object
df.date = pd.to_datetime(df.date)

In [None]:
df.head(10)

### plot time course of overall sales per branch

In [None]:
# define utility function for plotting sales data
def plot_sales(df, product_cat, year_range, title):
    """Plot sales data for bakery branches and over specified time frame in years. Data can be plotted for all or specified products

    Args:
        product (str): Product name
        year_range (list): Start and end year of the plotting time frame
        title (str): Plot title
    """
    if product_cat=='All':
        # average sales across products for each branch and date
        mean_sales = df.groupby(['branch', 'date']).sum().reset_index()
        
        sns.lineplot(data=mean_sales[(mean_sales.date.dt.year.isin(range(year_range[0], year_range[1])))], 
                x='date', y='turnover', hue='branch', palette={'Metro': 'red', 'Center': 'blue', 'Train_Station': 'green'}, alpha=0.8)
    else:
        sns.lineplot(data=df[(df.product==product_cat) & (sales.date.dt.year.isin(range(year_range[0], year_range[1])))], 
                x='date', y='turnover', hue='branch', palette={'Metro': 'red', 'Center': 'blue', 'Train_Station': 'green'}, alpha=0.8)
    
    plt.ylabel('Turnover', fontsize=12)
    plt.xlabel('Year', fontsize=12)
    plt.xticks(rotation = 45)
    plt.legend(loc='upper right', fontsize=10)
    plt.title(title)

In [None]:
# plot for all sales products as a summary plot using original df
plt.figure(figsize=(6, 4))
plot_sales(df, 'All', [2012, 2022], 'Total Sales 2012-2021')

In [None]:
df.day_of_week.value_counts()

In [None]:
# count missing dates per year
missings = df[df.isna().any(axis=1)]
missings['year_temp'] = missings.date.dt.year
missings.year_temp.value_counts() / 15

### Shift time series to calculate residuals based on baseline model
We will shift the original stacked time series by seven days and perform the analysis of residuals depending on weather type.

Since the time series is nested by branch and product category, each date is represented 15 times.

In [None]:
df.head(16)

Each date is first repeated 15 time for each branch and product category before a new date is reached. Thus, in order to shift the sales data by seven days to be used as predictions, the time series needs to be shifted 7 * 15 days.

In [None]:
# create a copy of the original full-length time series and attach a column of predicted values shifted by 7*15 days
df_pred = df.copy()
df_pred['pred'] = df.turnover.shift(7*15)
df_pred['residual'] = df_pred['pred'] - df_pred['turnover']

In [None]:
# double-check if shifting was successful for selected dates.
df_pred.loc[[0, 105, 210, 315, 420, 525], ['date', 'turnover', 'pred']]

As can be seen, the shifting worked properly.

### Distribution of residuals depending on weather condition

In [None]:
df_pred.condition_total.value_counts()

In [None]:
sns.barplot(data = df_pred[df_pred.condition_total.isin(['cloudy', 'rainy', 'clear', 'foggy', 'snowy'])], 
            hue='condition_total', y='residual', x='branch');

The re-analysis using the directly shifted stacked time series yields the same result as the restacked time series.

### Interaction between month and weather condition on residuals

In [None]:
for month in range(1, 13):
    sns.barplot(data = df_pred[(df_pred['month'] == month) & (df_pred.condition_total.isin(['cloudy', 'rainy', 'clear', 'foggy', 'snowy']))], 
            hue='condition_total', y='residual', x='branch')
    plt.title(f'Month {month}')
    plt.show()

### Distribution of residuals depending on weather condition for each product category

In [None]:
for prod in df_pred['product'].unique():
    sns.barplot(data = df_pred[(df_pred['product'] == prod) & (df_pred.condition_total.isin(['cloudy', 'rainy', 'clear', 'foggy', 'snowy']))], 
            hue='condition_total', y='residual', x='branch')
    plt.title(prod)
    plt.show()
    

### Regression plots between residuals and weather statistics

In [None]:
df_pred.columns

In [None]:
for x in ['temp_mean', 'temp_min', 'temp_max', 'temp_std']:
    sns.lmplot(data=df_pred[df_pred.year<2016], y='residual', x=x, col='product', hue='branch', facet_kws=({'sharey':False}))
    plt.show()

In [None]:
for x in ['pressure_mean', 'pressure_min','pressure_max', 'pressure_std']:
    sns.lmplot(data=df_pred[df_pred.year<2016], y='residual', x=x, col='product', hue='branch', facet_kws=({'sharey':False}))
    plt.show()

In [None]:
for x in ['humidity_mean', 'humidity_min', 'humidity_max','humidity_std']:
    sns.lmplot(data=df_pred[df_pred.year<2016], y='residual', x=x, col='product', hue='branch', facet_kws=({'sharey':False}))
    plt.show()

In [None]:
for x in ['clouds_mean', 'clouds_min', 'clouds_max', 'clouds_std']:
    sns.lmplot(data=df_pred[df_pred.year<2016], y='residual', x=x, col='product', hue='branch', facet_kws=({'sharey':False}))
    plt.show()

In [None]:
for x in ['wind_speed_mean', 'wind_speed_min','wind_speed_max', 'wind_speed_std']:
    sns.lmplot(data=df_pred[df_pred.year<2016], y='residual', x=x, col='product', hue='branch', facet_kws=({'sharey':False}))
    plt.show()

In [None]:
for x in ['wind_dir_y_mean','wind_dir_y_min', 'wind_dir_y_max', 'wind_dir_y_std']:
    sns.lmplot(data=df_pred[df_pred.year<2016], y='residual', x=x, col='product', hue='branch', facet_kws=({'sharey':False}))
    plt.show()

In [None]:
for x in ['wind_dir_x_mean','wind_dir_x_min', 'wind_dir_x_max', 'wind_dir_x_std']:
    sns.lmplot(data=df_pred[df_pred.year<2016], y='residual', x=x, col='product', hue='branch', facet_kws=({'sharey':False}))
    plt.show()

In [None]:
for x in ['rain_1h_mean', 'rain_1h_min', 'rain_1h_max', 'rain_1h_std']:
    sns.lmplot(data=df_pred[df_pred.year<2016], y='residual', x=x, col='product', hue='branch', facet_kws=({'sharey':False}))
    plt.show()