## Meteo Bakery: Exploratory Data Analysis - Sales

This notebook serves to perform a basic exploratory data analysis on the sales data from the different bakery branches.
There a three different bakery branches at different locations. Sales data has been recorded daily for five different bakery products from years 2012 to 2021.

### import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### load data

In [None]:
sales = pd.read_excel('../data/neueFische_Umsaetze_Baeckerei.xlsx')

### EDA and Feature Engineering

In [None]:
# get basic information on datatypes and missings
sales.info()

In [None]:
# There are three NaN values in the sales data; extract additional information
sales[np.isnan(sales.SoldTurnver)]

There are three missing values in the sales data ('SoldTurnver'). These represent a missing of a single product category ('Mischbrote') for all three different branches on 2021-10-16. Maybe, this product could not been produced on that day due to technical issues or other reasons.

#### add a column coding for the location of the different bakery branches

In [None]:
# generate location column based on branch
# Branch 1: Metro
# Branch 2: City Center
# Branch 3: Train Station

sales['Location'] = sales.Branch.apply(lambda x: 'Metro' if x==1 else 'Center' if x==2 else 'Train_Station')
sales.head()

#### extract additional time features from the Date column

In [None]:
# extract time features from Date column
sales['year'] = sales.Date.dt.year
sales['month'] = sales.Date.dt.month
sales['week'] = sales.Date.dt.week
sales['day_of_month'] = sales.Date.dt.day
sales['day_of_week'] = sales.Date.dt.dayofweek

sales.head()

### Visualize data

In [None]:
# extract product categories to plot data separately for each product
products = sales.PredictionGroupName.unique().tolist()

# insert category 'all' for all products in case data should be visualized across all categories
products.insert(0, 'All')
products

### Sales across time

In [None]:
# define utility function for plotting sales data
def plot_sales(product, year_range, title):
    """Plot sales data for bakery branches and over specified time frame in years. Data can be plotted for all or specified products

    Args:
        product (str): Product name
        year_range (list): Start and end year of the plotting time frame
        title (str): Plot title
    """
    if product=='All':
        # average sales across products for each branch and date
        mean_sales = sales.groupby(['Location', 'Date']).mean().reset_index()
        
        sns.lineplot(data=mean_sales[(mean_sales.Date.dt.year.isin(range(year_range[0], year_range[1])))], 
                x='Date', y='SoldTurnver', hue='Location', palette={'Metro': 'red', 'Center': 'blue', 'Train_Station': 'green'}, alpha=0.8)
    else:
        sns.lineplot(data=sales[(sales.PredictionGroupName==product) & (sales.Date.dt.year.isin(range(year_range[0], year_range[1])))], 
                x='Date', y='SoldTurnver', hue='Location', palette={'Metro': 'red', 'Center': 'blue', 'Train_Station': 'green'}, alpha=0.8)
    
    plt.ylabel('Turnover', fontsize=12)
    plt.xlabel('Year', fontsize=12)
    plt.xticks(rotation = 45)
    plt.legend(loc='upper right', fontsize=10)
    plt.title(title)

In [None]:
# plot time series data for all product sales together and for each individual product separately for the different branches
fig = plt.figure(figsize=(10, 10))

j = 1
for i in range(len(products)):
    subplot = fig.add_subplot(3, 2, j)
    plot_sales(products[i], [2012, 2022], f'{products[i]} Sales 2012-2021')
    j += 1
plt.tight_layout()
plt.show()

In [None]:
# plot again for all sales products as a summary plot
plt.figure(figsize=(6, 4))
plot_sales(products[i], [2012, 2022], f'{products[0]} Sales 2012-2021')

The time frame in 2021 with missing sales data represents the first Covid19 lockdown, which has been removed from the data already.

As can be seen, sales decrease over time for the branch located at the Metro and at the Train Station. In particular, there is a sudden drop in the sales around year 2016 for the branch located at the Train Station, which should be investigated in more detail.

The sales for the branch in the City Center is generally low compared to the other branches.

### inspect sales data for branch at Train Station
The sales for the bakery branch located at the Train Station show a sudden drop around 2016. There, the sales data for this branch is investigated in more detail.

In [None]:
# aggregate sales over month per year
monthly_sales = sales.groupby(['Location', 'PredictionGroupName', 'year', 'month'])['SoldTurnver'].mean().reset_index()
monthly_sales.head()

In [None]:
monthly_sales.info()

In [None]:
# plot monthly sales for branch at Train station separately for the different years
fig = plt.figure(figsize=(8, 5))

for i in range(1, len(products)):
    subplot = fig.add_subplot(2, 3, i)
    sns.lineplot(data=monthly_sales[(monthly_sales.Location=='Train_Station') & (monthly_sales.PredictionGroupName == products[i])], 
                x='month', y='SoldTurnver', hue='year', alpha=0.8,  palette='Greens', legend='full')
    plt.ylabel('Turnover', fontsize=12)
    plt.xlabel('Month', fontsize=12)
    plt.xticks(ticks=np.arange(0, 13, 2))
    plt.title(products[i])
    if i==5:
        plt.legend(bbox_to_anchor=(1.1, 1.1), loc='upper left', fontsize=9)
    else:
        plt.legend('', frameon=False)
plt.tight_layout()
plt.show()

For the bakery branch located at the Train Station, there is a sudden drop in the sales in 2016 for all products except Mischbrote. Additionally, the effect of season on the sales seem to be less pronounced from 2016 onwards, especially for the following products: klassischer Kuchen, herzhafter Snack.

### Overall sales differences between branches and products

In [None]:
sales.head()

In [None]:
# define utility functions for plotting overall sales data by branch or by product
def plot_sales_by_branch(product):
    """Plot sales data by bakery branches for defined product.

    Args:
        product (str): Product name
    """
    sns.boxplot(data=sales[sales.PredictionGroupName==product], 
                        x='Location', y='SoldTurnver', saturation=0.5, 
                        palette={'Metro': 'red', 'Center': 'blue', 'Train_Station': 'green'})
    plt.ylabel('Turnover', fontsize=12)
    plt.xticks(rotation = 45)
    plt.title(product)

# define utility function for plotting overall sales data by product
def plot_sales_by_product(location):
    """Plot sales data by bakery branches for defined product.

    Args:
        branch (str): Branch location
    """
    sns.boxplot(data=sales[sales.Location==location], 
                        x='PredictionGroupName', y='SoldTurnver', saturation=0.5, 
                        color='red' if location=='Metro' else 'blue' if location=='Center' else 'green')
    plt.ylabel('Turnover', fontsize=12)
    plt.xlabel('')
    plt.xticks(rotation = 45, ha='right')
    plt.title(location)

### plot overall product sales differences  between branches

In [None]:
fig = plt.figure(figsize=(10, 10))

j = 1
for i in range(1, len(products)):
    subplot = fig.add_subplot(3, 2, j)
    plot_sales_by_branch(products[i])
    j += 1
plt.tight_layout()
plt.show()

### plot sales profile for the different branches

In [None]:
fig = plt.figure(figsize=(8, 4))

for i, x in enumerate(sales.Location.unique().tolist()):
    subplot = fig.add_subplot(1, 3, i+1)
    plot_sales_by_product(x)
plt.tight_layout()
plt.show()

The branch at the Metro and at the Train Station have a similar sales profile. They make most turnover with handliches Gebäck and herzhafter Snack. By contrast, the branch in the City Center makes most turnover with handliches Gebäck and klassischer Kuchen, followed by Mischbrote.

### Sales differences between branches by month, day of the month and day of the week

In [None]:
# utility function to plot differences in product sales depending on a defined time period
def plot_sales_by_period(product, period, ylim, step):
    """Plot product sales data as boxplot grouped by a specified time period for bakery branches.

    Args:
        product (str): Product name
        period (str): Time period to group by ('day_of_week', 'month', 'year')
        title (str): Plot title
        ylim (int): Upper y-axis limit
        step (int): Step size for y-axis ticks
    """

    fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(10, 4))
    plt.suptitle(f'{product} sales by {period}', fontsize=14)

    ax1 = sns.boxplot(data=sales[(sales.PredictionGroupName==product) & (sales.Branch == 1)], 
                        x=period, y='SoldTurnver', color='red', saturation=0.5, ax=ax1)
    ax1.set_ylabel('Turnover', fontsize=12)
    ax1.set_yticks(ticks=np.arange(0,ylim+1, step))
    ax1.set_xlabel(period)
    ax1.set_title('Metro')

    ax2 = sns.boxplot(data=sales[(sales.PredictionGroupName==product) & (sales.Branch == 2)], 
                        x=period, y='SoldTurnver', color='blue', saturation=0.5, ax=ax2)
    ax2.set_ylabel('Turnover', fontsize=12)
    ax2.set_yticks(ticks=np.arange(0,ylim+1, step))
    ax2.set_xlabel(period)
    ax2.set_title('Center')

    ax3 = sns.boxplot(data=sales[(sales.PredictionGroupName==product) & (sales.Branch == 3)], 
                        x=period, y='SoldTurnver', color='green', saturation=0.5, ax=ax3)
    ax3.set_ylabel('Turnover', fontsize=12)
    ax3.set_yticks(ticks=np.arange(0,ylim+1, step))
    ax3.set_xlabel(period)
    ax3.set_title('Train_Station')
    
    plt.tight_layout()
    plt.show()


### plot by monthly period

In [None]:
plot_sales_by_period(products[1], 'month', 800, 200)

In [None]:
plot_sales_by_period(products[2], 'month', 800, 200)

In [None]:
plot_sales_by_period(products[3], 'month', 1000, 200)

In [None]:
plot_sales_by_period(products[4], 'month', 3000, 500)

In [None]:
plot_sales_by_period(products[5], 'month', 3500, 500)

There are some seasonality effects present in the sales data across the different bakery branches. Sales for Mischbrote and Weizenbrötchen tend to be lower in summer. Sales for klassicher Kuchen and handliches Gebäck are higher in spring and autumn as compared to the other seasons. There no clear seasonal differences in the sales for herzhafter Snack.

### plot by day of the month

In [None]:
plot_sales_by_period(products[1], 'day_of_month', 800, 200)

In [None]:
plot_sales_by_period(products[2], 'day_of_month', 800, 200)

In [None]:
plot_sales_by_period(products[3], 'day_of_month', 1200, 300)

In [None]:
plot_sales_by_period(products[4], 'day_of_month', 2500, 500)

In [None]:
plot_sales_by_period(products[5], 'day_of_month', 3000, 500)

The sales data to need seem to vary as a function of the day of the month. At least, no clear pattern is distinguishable.

### plot by day of the week

In [None]:
plot_sales_by_period(products[1], 'day_of_week', 800, 200)

In [None]:
plot_sales_by_period(products[2], 'day_of_week', 800, 200)

In [None]:
plot_sales_by_period(products[3], 'day_of_week', 1200, 300)

In [None]:
plot_sales_by_period(products[4], 'day_of_week', 3000, 500)

In [None]:
plot_sales_by_period(products[5], 'day_of_week', 3500, 500)

There are clear differences between branches with respect to daily sales fluctuations across the week. In general, the sales  decrease towards the weekend for the branch located at the Metro and increase for the branch located in City Center. The sales for the branch located at the Train Station appear to be largely constant across the days of the week.