## Meteo Bakery - Final dataframe
In this notebook, we will engineer some more features and replace missings in the original data to generate a final dataframe to be used for the forecasting models.

### import libraries

In [None]:
# import modules
import numpy as np
import pandas as pd
import meteo_utils as meteo
from itertools import product

### load data

In [None]:
df = pd.read_csv('../data/data_combined.csv')
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
df.head()

### transform periodic month feature using sine and cosine functions

In [None]:
df = meteo.transform_month(df, 'month')
df.head()

### select only years up to 2020

In [None]:
df = df[df.year<2020]

### generate lag features
Will we use sales with a lag of 7 and 365 days, since these days showed peaks in partical autocorrelation plots.

In [None]:
df = meteo.get_lag_features(df, ['branch', 'product'], 'turnover', [7, 365])

### generate lead features for weather
We will further generate 1-day lead features for temperature, rain and humidity

In [None]:
df = meteo.get_lead_features(df, ['branch', 'product'], 'temp_mean', [1])
df = meteo.get_lead_features(df, ['branch', 'product'], 'rain_1h_mean', [1])
df = meteo.get_lead_features(df, ['branch', 'product'], 'snow_1h_mean', [1])

### check missings in sales data

In [None]:
df.groupby(['branch', 'product'])['turnover', 'month'].count()

In [None]:
df[(df['turnover'].isnull()) & (df['branch']=='Metro')]

4 missing days for Metro station. Additionally, there no sales for Mischbrote on 16-10-2018 and 16-10-2019.

In [None]:
df[(df['turnover'].isnull()) & (df['branch']=='Train_Station')]

Train Station has exactly the same missings as Metro branch.

In [None]:
df[(df['turnover'].isnull()) & (df['branch']=='Center') & (df['product']=='Brown Bread')].head(25)

69 missing days for Center branch. There frequently fall on a public holiday, thus indicating that this branch probably had closed on these days.

### replace missing values
Previous analyes showed that a couple of days are missing from the sales data. For the branch located at the Metro and Train Station, there is a total of 4 missing days. By contrast, 69 days are missing for Center branch in the years 2012-2019. They frequently fall on a public holiday, thus indicating that this branch probably had closed on these days.
We will first replace NaNs at Center branch by 1 if occuring on public holiday. Remaining NaNs will be replaced with turnover of corresponding day of preceding weak, otherwise, a forward fill will be used.

In [None]:
df_repl = df.copy()

# replace NaN at Center branch by 0 is occuring on public holiday
df_repl.loc[(df_repl['branch']=='Center') & (df_repl['public_holiday']==True), 'turnover'] = df_repl.loc[(df_repl['branch']=='Center') & (df_repl['public_holiday']==True), 'turnover'].fillna(1)

# fill NaN with sales from previous day of week
df_repl['turnover'] = df_repl['turnover'].fillna(df_repl['turnover_lag_7'])

# fill remaining NaN using forward fill
#df_repl['turnover'].ffill(inplace=True, axis='rows')
for i, group in enumerate(product(df_repl['branch'].unique(), df_repl['product'].unique())):
        df_repl[(df_repl['branch']==group[0]) & (df_repl['product']==group[1])].ffill(inplace=True, axis='rows')

### double-check if replacing missings worked as expected

In [None]:
df_repl.loc[(df_repl['branch']=='Metro') & (df_repl['product']=='Brown Bread'), ['branch', 'product', 'turnover', 'turnover_lag_7']].head(20)

In [None]:
df_repl.groupby(['branch', 'product'])[['turnover', 'turnover_lag_7', 'month']].count()

In [None]:
df_repl[(df_repl['public_holiday']==True) & (df_repl['branch']=='Center')]

In [None]:
df_repl.to_csv('../data/data_final.csv')