## Meteo Bakery - Combine datasets
This notebook serves to combine df_full data with the weather summary statistics.

### import libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### load data

In [None]:
# load df_full data
sales = pd.read_excel('../data/neueFische_Umsaetze_Baeckerei.xlsx')

In [None]:
# load data on engineered weather features
weather_stats = pd.read_csv('../data/summary_stats.csv')

In [None]:
# load holidays data

# school holidays from https://www.schulferien.org/oesterreich/ferien/2012/
school_hols = pd.read_excel("../data/school_holidays.xlsx")

# public holidays from google search "Feiertage Wien 'YEAR'"
public_hols = pd.read_excel("../data/public_holidays.xlsx")
public_hols.date = pd.to_datetime(public_hols.date)

In [None]:
# load Corona data
corona = pd.read_excel("../data/corona-measures-vienna.xlsx")

### Feature Engineering - Sales

In [None]:
# get basic information on datatypes and missings
sales.info()

In [None]:
# generate location column based on branch
# Filiale 1: U-Bahn
# Filiale 2: Innenstadt
# Filiale 3: Bahnhof

sales['Branch'] = sales.Branch.apply(lambda x: 'Metro' if x==1 else 'Center' if x==2 else 'Train_Station')
sales.head()

There are three missing values in the sales data ('SoldTurnver').

In [None]:
sales.columns

In [None]:
# rename columns
sales.rename(columns={'Branch': 'branch', 'PredictionGroupName': 'product', 'SoldTurnver': 'turnover'}, inplace=True)

In [None]:
sales.rename(columns={'Date': 'date'}, inplace=True)
sales.head()

In [None]:
#relabel products
sales['product'] = sales['product'].map({'Mischbrote':'Brown Bread',
                                'Weizenbrötchen':'Wheat Rolls',
                                'klassischer Kuchen':'Cakes',
                                'handliches Gebäck':'Pastries',
                                'herzhafter Snack':'Savoury Snack'})

In [None]:
# count dates per branch and product category
sales.groupby(['branch', 'product'])['date'].count()

As can be seen, not all dates are equally represented per group. This indicates that dates are not continuously progressing, but that there gaps present in the dates. Thus, there must be missing dates. Indeed, the first Covid19 lockdown has already removed from the data, representing one of possibly more gaps.

### Generate a time series of consecutive dates as backbone
To avoid such gaps, we will first generate a datetime column with consecutive gaps starting and ending with the first and last registered date. The other data will then be merged into that continuous date column, with gaps in certain columns being filled up with NaNs. These NaNs can be handled strategically during later analysis and modeling steps.

In [None]:
consec_dates = pd.DataFrame({'date':pd.date_range(sales.date.min(), sales.date.max())})

In [None]:
print(sales.date.nunique())
print(consec_dates.date.nunique())

In [None]:
consec_dates.date.nunique() * 15

### repeat the dates for each branch and product category

In [None]:
consec_dates[['Metro', 'Center', 'Train_Station']] = 'Metro', 'Center', 'Train_Station'

In [None]:
consec_dates.set_index('date', inplace=True)
consec_dates.head()

In [None]:
consec_dates = consec_dates.stack().reset_index(name='branch').drop(columns=['level_1'])
consec_dates.head()

In [None]:
products = sales['product'].unique()
consec_dates[products] = products

In [None]:
consec_dates.set_index(['date', 'branch'], inplace=True)
consec_dates.head()

In [None]:
consec_dates = consec_dates.stack().reset_index(name='product').drop(columns=['level_2'])

In [None]:
consec_dates.head()

In [None]:
consec_dates.shape

### Merge dataframes

#### merge sales into backbone

In [None]:
df_full = consec_dates.merge(sales, on=['date', 'branch', 'product'], how='left')

In [None]:
df_full.head()

In [None]:
df_full.shape

In [None]:
df_full.groupby(['branch', 'product'])['date'].count()

In [None]:
df_full.date.value_counts()

### append additional time information

In [None]:
# extract time features from Date column
df_full['year'] = df_full.date.dt.year
df_full['month'] = df_full.date.dt.month
df_full['week'] = df_full.date.dt.week
df_full['day_of_month'] = df_full.date.dt.day
df_full['day_of_week'] = df_full.date.dt.dayofweek

#### append holiday and Covid information

In [None]:
# append holidays by creating true/false columns
df_full["school_holiday"] = df_full["date"].isin(school_hols["date"])
df_full["public_holiday"] = df_full["date"].isin(public_hols["date"])

In [None]:
# broadcast lockdown times
df_full["lock"] = 'open'
df_full.loc[(df_full.date >= pd.to_datetime("2020-03-10")) & (df_full.date < pd.to_datetime("2020-04-14")),"lock"] = "lockdown"
df_full.loc[(df_full.date >= pd.to_datetime("2020-11-03")) & (df_full.date < pd.to_datetime("2020-11-17")),"lock"] = "lockdown_light"
df_full.loc[(df_full.date >= pd.to_datetime("2020-11-17")) & (df_full.date < pd.to_datetime("2020-12-06")),"lock"] = "lockdown"
df_full.loc[(df_full.date >= pd.to_datetime("2020-12-26")) & (df_full.date < pd.to_datetime("2021-02-07")),"lock"] = "lockdown"
df_full.loc[(df_full.date >= pd.to_datetime("2021-04-01")) & (df_full.date < pd.to_datetime("2021-05-02")),"lock"] = "lockdown"
df_full.loc[(df_full.date >= pd.to_datetime("2021-11-08")) & (df_full.date < pd.to_datetime("2021-12-31")),"lock"] = "lockdown"

### merge with weather statistics

In [None]:
weather_stats.info()

In [None]:
# parse date to datetime
weather_stats['date'] = pd.to_datetime(weather_stats['date'])

In [None]:
# merge dataframes
df_joined = df_full.merge(weather_stats, on='date', how='left')

In [None]:
df_joined.head(20)

In [None]:
# export combined data to csv file
df_joined.to_csv('../data/data_combined.csv', index=False)