# <b>1 <span style='color:lightseagreen'>|</span> Introduction to Date and Time</b>

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>1.1 | How to import data ?</b></p>
</div>

First, we import all the datasets needed for this kernel. The required time series column is imported as a datetime column using **<span style='color:lightseagreen'>parse_dates</span>** parameter and is also selected as index of the dataframe using **<span style='color:lightseagreen'>index_col</span>** parameter.

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>1.2 | Timestamps and Periods</b></p>
</div>

Timestamps are used to represent a point in time. Periods represent an interval in time. Periods can used to check if a specific event in the given period. They can also be converted to each other's form.

📌 Video: [How to use dates and times with pandas](https://campus.datacamp.com/courses/manipulating-time-series-data-in-python/working-with-time-series-in-pandas?ex=1): explain **<span style='color:lightseagreen'>TimeStamp and Period</span>** data. 

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>1.3 | Using date_range</b></p>
</div>

date_range is a method that returns a fixed **<span style='color:lightseagreen'>frequency datetimeindex</span>**. It is quite useful when creating your own time series attribute for pre-existing data or arranging the whole data around the time series attribute created by you.

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>1.4 | Acknowledgements</b></p>
</div>

* Kaggle's [time series course](https://www.kaggle.com/learn/time-series).
* Many of [AmbrosM's](https://www.kaggle.com/ambrosm) great notebooks.
* This [notebook](https://www.kaggle.com/samuelcortinhas/tps-jan-22-quick-eda-hybrid-model) by [Samuel Cortinhas](https://www.kaggle.com/samuelcortinhas)

In [None]:
# Core
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(style='darkgrid', font_scale=1.4)
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import combinations
import math
import statistics
import scipy.stats
from scipy.stats import pearsonr
import time
from datetime import datetime
import matplotlib.dates as mdates
import dateutil.easter as easter

# Sklearn
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, TimeSeriesSplit
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression, Ridge

# Models
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor

# Tensorflow
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import callbacks

# Plotly
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.offline as offline
import plotly.graph_objs as go

# Save to df
from pathlib import Path
comp_dir = Path('../input/tabular-playground-series-jan-2022')
train_data=pd.read_csv(comp_dir / 'train.csv', index_col='row_id')
test_data=pd.read_csv(comp_dir / 'test.csv', index_col='row_id')

In [None]:
# Shape and preview
print('Training data df shape:',train_data.shape)
print('Test data df shape:',test_data.shape)
train_data.head()

In [None]:
# Convert date to datetime
train_data.date=pd.to_datetime(train_data.date)
test_data.date=pd.to_datetime(test_data.date)

# drop 29th Feb
train_data.drop(train_data[(train_data.date.dt.month==2) & (train_data.date.dt.day==29)].index, axis=0, inplace=True)

# <b>2 <span style='color:lightseagreen'>|</span> Exploratory Data Analysis</b>

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>2.1 | Sales by store</b></p>
</div>

First, we import all the datasets needed for this kernel. The required time series column is imported as a datetime column using **<span style='color:lightseagreen'>parse_dates</span>**

In [None]:
kr=train_data[train_data.store == 'KaggleRama'].groupby(['date','store']).agg(num_sold=('num_sold','sum')).reset_index()
km=train_data[train_data.store == 'KaggleMart'].groupby(['date','store']).agg(num_sold=('num_sold','sum')).reset_index()
sold_product = train_data.groupby('product').agg(num_sold=('num_sold','sum')).reset_index().sort_values(by='num_sold', ascending=False)
sold_country = train_data.groupby('country').agg(num_sold=('num_sold','sum')).reset_index().sort_values(by='num_sold', ascending=False)

# chart
fig = make_subplots(rows=2, cols=2, 
                    specs=[[{"type": "bar"}, {"type": "pie"}], [{"colspan": 2}, None]],
                    column_widths=[0.75, 0.25], vertical_spacing=0.1, horizontal_spacing=0.02,
                    subplot_titles=("Total Sales per Country", "Product Sales Percentage", "Sale per Store"))

fig.add_trace(go.Bar(x=sold_country['num_sold'], y=sold_country['country'], marker=dict(color=['lightseagreen','tomato','floralwhite']),
                     name='Country', orientation='h'), 
                     row=1, col=1)
fig.add_trace(go.Pie(values=sold_product['num_sold'], labels=sold_product['product'], name='Product',
                     marker=dict(colors=['lightseagreen','tomato','floralwhite']), hole=0, pull=[0, 0, 0],
                     hoverinfo='label+percent+value', textinfo='label'), 
                    row=1, col=2)

fig.update_traces(
marker=dict(
        line=dict(color='#303330',
                  width=2)
        ), 
    row = 1, col=2
)

fig.add_trace(go.Scatter(x=kr['date'],y=kr.num_sold,mode='lines',name='Kaggle Rama',marker=dict(color='lightseagreen')),row = 2, col = 1)
fig.add_trace(go.Scatter(x=km['date'],y=km.num_sold,mode='lines',name='Kaggle Mart', marker=dict(color='tomato')),row = 2, col = 1)
# styling
fig.update_xaxes(showgrid=False,row=1, col=1)
fig.update_yaxes(showgrid=False, ticksuffix=' ', categoryorder='total ascending', row=1, col=1)

fig.update_yaxes(showgrid=True,gridcolor = 'gray', gridwidth = 0, ticksuffix=' ', 
                 categoryorder='total ascending', showline=True, linewidth=2, linecolor='gray', row=2, col=1)
fig.update_xaxes(showgrid=False,gridcolor = 'gray', gridwidth = 0, ticksuffix=' ', 
                 categoryorder='total ascending', showline=True, linewidth=2, linecolor='gray', row=2, col=1)

fig.update_xaxes(visible=False, row=1, col=1)
fig.update_layout(height=750, bargap=0.2,
                  margin=dict(b=50,r=30,l=100), xaxis=dict(tickmode='linear'),
                  title_text="Sales Analysis",
                  #template="plotly_dark",
                  paper_bgcolor="#303330",
                  plot_bgcolor = "#303330",
                  title_font=dict(size=29, color='floralwhite', family="Lato, sans-serif"),
                  font=dict(color='floralwhite'), 
                  hoverlabel=dict(bgcolor="floralwhite", font_size=13, font_family="Lato, sans-serif"),
                  showlegend=True)

fig.update_layout(legend=dict(
    orientation="h",
    yanchor="top",
    y=1.125,
    xanchor="right",
    x=1,
    #bgcolor="lightgray",
    #bordercolor="Black",
    font=dict(
        family="Lato, sans-serif",
        size=12
    #    color="black"
    ),
))

fig.update_layout(shapes=[dict(type="line",
                 xref='paper',
                 yref='paper',
                 x0=0.73,
                 y0=1.045,
                 x1=0.73,
                 y1=0.515,
                 line=dict(color="gray",
                           width=4),
                 ),
                 dict(type="line",
                 xref='paper',
                 yref='paper',
                 x0=0,
                 y0=0.515,
                 x1=1,
                 y1=0.515,
                 line=dict(color="gray",
                           width=4),
                 ),        
            ],
)

fig.add_layout_image(dict(
        source='https://miro.medium.com/max/837/1*tI-TWV--K05xbXUgA4Qm1w.png',
        x=0.225,
        y=1.0725,
        )
)

fig.update_layout_images(dict(
        xref="paper",
        yref="paper",
        sizex=0.075,
        sizey=0.075,
        xanchor="right",
        yanchor="bottom"
))


fig.show()

📌 **Interpret:** As we can appreeciate, Kaggle products are most sold in **<span style='color:lightseagreen'>Norway</span>**. Moreover, on the one hand **<span style='color:lightseagreen'>Kaggle Hat</span>** is the one who first to appear in the sales charts, as it is the **<span style='color:lightseagreen'>most sold</span>**. On the other hand, Kaggle Sticker is the one with less sales. At the bottom, it is easily observed that both stores has a **<span style='color:lightseagreen'>strong year seasonality</span>** in terms of sales. There some spikes all over the time recorded, but most of them are at the end of each year. **<span style='color:lightseagreen'>Kaggle Rama</span>** also have a **<span style='color:lightseagreen'>bigger amount of sales</span>**.

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>2.2 | Product Sales Analysis</b></p>
</div>

In [None]:
fig = make_subplots(rows=2, cols=2, 
                    specs=[[{"rowspan": 2}, {"type":"scatter"}], [None, {"type":"scatter"}]],
                    column_widths=[0.35, 0.65], vertical_spacing=0.15, horizontal_spacing=0.1,
                    subplot_titles=("Product Sales per Store", "Product Sales in Kaggle Rama","Product Sales in Kaggle Mart"))

# Left Plot
kr = train_data[train_data.store=='KaggleRama'].groupby(['store','product']).agg(num_sold=('num_sold','sum')).reset_index()
km = train_data[train_data.store=='KaggleMart'].groupby(['store','product']).agg(num_sold=('num_sold','sum')).reset_index()

fig.add_trace(go.Bar(x=kr['product'], y=kr["num_sold"], name='Kaggle Rama',
    marker=dict(color=['tomato','tomato','tomato'])), row = 1, col = 1) 
fig.add_trace(go.Bar(x=km['product'], y=km["num_sold"], name = 'Kaggle Mart',
    marker=dict(color=['floralwhite','floralwhite','floralwhite'])), row = 1, col = 1)

fig.update_xaxes(showgrid=False,gridcolor = 'gray', gridwidth = 0, ticksuffix=' ', 
                 showline=True, linewidth=2, linecolor='gray', row=1, col=1)
fig.update_yaxes(showgrid=True,gridcolor = 'gray', gridwidth = 0, ticksuffix=' ', 
                 showline=True, linewidth=2, linecolor='gray', row=1, col=1)

# Right Plot Row 1
kr_2 = train_data[train_data.store=='KaggleRama'].groupby(['date','store','product']).agg(num_sold=('num_sold','sum')).reset_index()
kr_hat = kr_2[kr_2['product'] == 'Kaggle Hat']
kr_mug = kr_2[kr_2['product'] == 'Kaggle Mug']
kr_sticker = kr_2[kr_2['product'] == 'Kaggle Sticker']

fig.add_trace(go.Scatter(x=kr_hat['date'],y=kr_hat.num_sold,mode='lines',name='Kaggle Hat',marker=dict(color='lightseagreen')),row = 2, col = 2)              
fig.add_trace(go.Scatter(x=kr_mug['date'],y=kr_mug.num_sold,mode='lines',name='Kaggle Mug',marker=dict(color='tomato')),row = 2, col = 2)              
fig.add_trace(go.Scatter(x=kr_sticker['date'],y=kr_sticker.num_sold,mode='lines',name='Kaggle Sticker',marker=dict(color='floralwhite')),row = 2, col = 2)              

fig.update_yaxes(showgrid=True,gridcolor = 'gray', gridwidth = 0, ticksuffix=' ', 
                 categoryorder='total ascending', showline=True, linewidth=2, linecolor='gray', row=2, col=2)
fig.update_xaxes(showgrid=False,gridcolor = 'gray', gridwidth = 0, ticksuffix=' ', 
                 categoryorder='total ascending', showline=True, linewidth=2, linecolor='gray', row=2, col=2)

# Right Plot Row 2
km_2 = train_data[train_data.store=='KaggleMart'].groupby(['date','store','product']).agg(num_sold=('num_sold','sum')).reset_index()
km_hat = km_2[km_2['product'] == 'Kaggle Hat']
km_mug = km_2[km_2['product'] == 'Kaggle Mug']
km_sticker = km_2[km_2['product'] == 'Kaggle Sticker']

fig.add_trace(go.Scatter(x=km_hat['date'],y=km_hat.num_sold,mode='lines',name='Kaggle Hat',marker=dict(color='lightseagreen')),row = 1, col = 2)              
fig.add_trace(go.Scatter(x=km_mug['date'],y=km_mug.num_sold,mode='lines',name='Kaggle Mug',marker=dict(color='tomato')),row = 1, col = 2)              
fig.add_trace(go.Scatter(x=km_sticker['date'],y=km_sticker.num_sold,mode='lines',name='Kaggle Sticker',marker=dict(color='floralwhite')),row = 1, col = 2)              

fig.update_yaxes(showgrid=True,gridcolor = 'gray', gridwidth = 0, ticksuffix=' ', 
                 categoryorder='total ascending', showline=True, linewidth=2, linecolor='gray', row=1, col=2)
fig.update_xaxes(showgrid=False,gridcolor = 'gray', gridwidth = 0, ticksuffix=' ', 
                 categoryorder='total ascending', showline=True, linewidth=2, linecolor='gray', row=1, col=2)

              
# General Layout
fig.update_layout(height=700, bargap=0.2,
                  margin=dict(b=50,r=30,l=100), xaxis=dict(tickmode='linear'),
                  title_text="Product Sales Analysis",
                  #template="plotly_dark",
                  paper_bgcolor="#303330",
                  plot_bgcolor = "#303330",
                  title_font=dict(size=29, color='floralwhite', family="Lato, sans-serif"),
                  font=dict(color='floralwhite'), 
                  hoverlabel=dict(bgcolor="floralwhite", font_size=13, font_family="Lato, sans-serif"),
                  showlegend=True)

# Legend Layout
fig.update_layout(legend=dict(
    orientation="h",
    yanchor="top",
    y=1.133,
    xanchor="right",
    x=1,
    #bgcolor="lightgray",
    #bordercolor="Black",
    font=dict(
        family="Lato, sans-serif",
        size=10
    #    color="black"
    ))
 )

fig.update_layout(shapes=[dict(type="line",
                 xref='paper',
                 yref='paper',
                 x0=0.36,
                 y0=1.06,
                 x1=0.36,
                 y1=0,
                 line=dict(color="gray",
                           width=4),
                 ),
                 dict(type="line",
                 xref='paper',
                 yref='paper',
                 x0=0.36,
                 y0=0.49,
                 x1=1,
                 y1=0.49,
                 line=dict(color="gray",
                           width=4),
                 ),   
                dict(type="line",
                 xref='paper',
                 yref='paper',
                 x0=-0.035,
                 y0=1.06,
                 x1=1,
                 y1=1.06,
                 line=dict(color="gray",
                           width=4),
                 ),
            ],
)



fig.show()

📌 **Interpret:** At the right part, we can observe that both **<span style='color:lightseagreen'>Hat and Mug</span>**. show strong yearly seasonal trends whereas the Sticker remains fairly constant. We can use **<span style='color:lightseagreen'>Fourier Features</span>** to model these trends. At the left, we have a product sales comparison between both stores. As seen before, sales for every product are bigger in Kaggle Rama.

# <b>3 <span style='color:lightseagreen'>|</span> Feature Engineering</b>

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.1 | Holidays</b></p>
</div>

We'll create features related to **<span style='color:lightseagreen'>national holidays</span>** in each country. We'll add one column to the data set for each country. Values in those columns will be: 0 if not holiday (in respective country), 1 If holiday.

In [None]:
y=train_data.num_sold
X=train_data.drop('num_sold', axis=1)

# From https://www.kaggle.com/c/tabular-playground-series-jan-2022/discussion/298990
def unofficial_hol(df):
    countries = {'Finland': 1, 'Norway': 2, 'Sweden': 3}
    stores = {'KaggleMart': 1, 'KaggleRama': 2}
    products = {'Kaggle Mug': 1,'Kaggle Hat': 2, 'Kaggle Sticker': 3}
    
    # load holiday info.
    hol_path = '../input/public-and-unofficial-holidays-nor-fin-swe-201519/holidays.csv'
    holiday = pd.read_csv(hol_path)
    
    fin_holiday = holiday.loc[holiday.country == 'Finland']
    swe_holiday = holiday.loc[holiday.country == 'Sweden']
    nor_holiday = holiday.loc[holiday.country == 'Norway']
    # rellenamos con 0s y 1s cada fecha, según si es fiesta o no de cada pais
    df['fin holiday'] = df.date.isin(fin_holiday.date).astype(int)
    df['swe holiday'] = df.date.isin(swe_holiday.date).astype(int)
    df['nor holiday'] = df.date.isin(nor_holiday.date).astype(int)
    # creamos la columna que nos va a decir si un dia es fiesta o no, teniendo en cuenta el valor de pais de esa misma fila
    df['holiday'] = np.zeros(df.shape[0]).astype(int)
    df.loc[df.country == 'Finland', 'holiday'] = df.loc[df.country == 'Finland', 'fin holiday']
    df.loc[df.country == 'Sweden', 'holiday'] = df.loc[df.country == 'Sweden', 'swe holiday']
    df.loc[df.country == 'Norway', 'holiday'] = df.loc[df.country == 'Norway', 'nor holiday']
    df.drop(['fin holiday', 'swe holiday', 'nor holiday'], axis=1, inplace=True)
    
    return df

# Holidays from AmbrosM
def get_holidays(df):
    # End of year
    df = pd.concat([df, pd.DataFrame({f"dec{d}":
                      (df.date.dt.month == 12) & (df.date.dt.day == d)
                      for d in range(24, 32)}),
        pd.DataFrame({f"n-dec{d}":
                      (df.date.dt.month == 12) & (df.date.dt.day == d) & (df.country == 'Norway')
                      for d in range(24, 32)}),
        pd.DataFrame({f"f-jan{d}":
                      (df.date.dt.month == 1) & (df.date.dt.day == d) & (df.country == 'Finland')
                      for d in range(1, 14)}),
        pd.DataFrame({f"jan{d}":
                      (df.date.dt.month == 1) & (df.date.dt.day == d) & (df.country == 'Norway')
                      for d in range(1, 10)}),
        pd.DataFrame({f"s-jan{d}":
                      (df.date.dt.month == 1) & (df.date.dt.day == d) & (df.country == 'Sweden')
                      for d in range(1, 15)})], axis=1)
    
    # May
    df = pd.concat([df, pd.DataFrame({f"may{d}":
                      (df.date.dt.month == 5) & (df.date.dt.day == d) 
                      for d in list(range(1, 10))}),
        pd.DataFrame({f"may{d}":
                      (df.date.dt.month == 5) & (df.date.dt.day == d) & (df.country == 'Norway')
                      for d in list(range(19, 26))})], axis=1)
    
    # June and July
    df = pd.concat([df, pd.DataFrame({f"june{d}":
                   (df.date.dt.month == 6) & (df.date.dt.day == d) & (df.country == 'Sweden')
                   for d in list(range(8, 14))})], axis=1)
    
    #Swedish Rock Concert
    #Jun 3, 2015 – Jun 6, 2015
    #Jun 8, 2016 – Jun 11, 2016
    #Jun 7, 2017 – Jun 10, 2017
    #Jun 6, 2018 – Jun 10, 2018
    #Jun 5, 2019 – Jun 8, 2019
    swed_rock_fest  = df.date.dt.year.map({2015: pd.Timestamp(('2015-06-6')),
                                         2016: pd.Timestamp(('2016-06-11')),
                                         2017: pd.Timestamp(('2017-06-10')),
                                         2018: pd.Timestamp(('2018-06-10')),
                                         2019: pd.Timestamp(('2019-06-8'))})

    df = pd.concat([df, pd.DataFrame({f"swed_rock_fest{d}":
                                      (df.date - swed_rock_fest == np.timedelta64(d, "D")) & (df.country == 'Sweden')
                                      for d in list(range(-3, 3))})], axis=1)
    
    # Last Wednesday of June
    wed_june_date = df.date.dt.year.map({2015: pd.Timestamp(('2015-06-24')),
                                         2016: pd.Timestamp(('2016-06-29')),
                                         2017: pd.Timestamp(('2017-06-28')),
                                         2018: pd.Timestamp(('2018-06-27')),
                                         2019: pd.Timestamp(('2019-06-26'))})
    
    df = pd.concat([df, pd.DataFrame({f"wed_june{d}": 
                   (df.date - wed_june_date == np.timedelta64(d, "D")) & (df.country != 'Norway')
                   for d in list(range(-4, 6))})], axis=1)
    
    # First Sunday of November
    sun_nov_date = df.date.dt.year.map({2015: pd.Timestamp(('2015-11-1')),
                                         2016: pd.Timestamp(('2016-11-6')),
                                         2017: pd.Timestamp(('2017-11-5')),
                                         2018: pd.Timestamp(('2018-11-4')),
                                         2019: pd.Timestamp(('2019-11-3'))})
    
    df = pd.concat([df, pd.DataFrame({f"sun_nov{d}": 
                   (df.date - sun_nov_date == np.timedelta64(d, "D")) & (df.country != 'Norway')
                   for d in list(range(0, 9))})], axis=1)
    
    # First half of December (Independence Day of Finland, 6th of December)
    df = pd.concat([df, pd.DataFrame({f"dec{d}":
                   (df.date.dt.month == 12) & (df.date.dt.day == d) & (df.country == 'Finland')
                   for d in list(range(6, 14))})], axis=1)

    # Easter
    easter_date = df.date.apply(lambda date: pd.Timestamp(easter.easter(date.year)))
    df = pd.concat([df, pd.DataFrame({f"easter{d}":
                   (df.date - easter_date == np.timedelta64(d, "D"))
                   for d in list(range(-2, 11)) + list(range(40, 48)) + list(range(50, 59))})], axis=1)
    
    return df

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.2 | Date Features</b></p>
</div>

We will break down the date into different columns:

* One for the **<span style='color:lightseagreen'>year</span>**
* One for the **<span style='color:lightseagreen'>month</span>**
* One for the **<span style='color:lightseagreen'>week</span>**
* One for the **<span style='color:lightseagreen'>day of week</span>**
* One for the **<span style='color:lightseagreen'>day of month</span>**

In [None]:
def date_feat_eng_X1(df):
    df['year']=df['date'].dt.year                   # 2015 to 2019
    return df

def date_feat_eng_X2(df):
    df['day_of_week']=df['date'].dt.dayofweek       # 0 to 6
    df['day_of_month']=df['date'].dt.day            # 1 to 31
    df['dayofyear'] = df['date'].dt.dayofyear       # 1 to 366
    df.loc[(df.date.dt.year==2016) & (df.dayofyear>60), 'dayofyear'] -= 1   # 1 to 365
    df['week']=df['date'].dt.isocalendar().week     # 1 to 53
    df['week']=df['week'].astype('int')             # int64
    df['month']=df['date'].dt.month                 # 1 to 12
    return df

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.3 | Gross Domestic Product (GDP)</b></p>
</div>

GDP measures the **<span style='color:lightseagreen'>monetary value</span>** of final goods and services — that is, those that are bought by the **<span style='color:lightseagreen'>final user</span>** — produced in a country in a given period of time (say a quarter or a year). It counts all of the output generated within the borders of a country. GDP is composed of **<span style='color:lightseagreen'>goods and services</span>** produced for sale in the market and also includes some nonmarket production, such as defense or education services provided by the government.

In [None]:
def get_GDP(df):
    GDP_data = pd.read_csv("../input/gdp-20152019-finland-norway-and-sweden/GDP_data_2015_to_2019_Finland_Norway_Sweden.csv",index_col="year")
    # Rename the columns in GDP df 
    GDP_data.columns = ['Finland', 'Norway', 'Sweden']
    # Create a dictionary
    GDP_dictionary = GDP_data.unstack().to_dict()
    # Create GDP column
    df['GDP']=df.set_index(['country', 'year']).index.map(GDP_dictionary.get)
    # Log transform (only if the target is log-transformed too)
    df['GDP']=np.log(df['GDP'])
    # Split GDP by country (for linear model)
    df['GDP_Finland']=df['GDP'] * (df['country']=='Finland')
    df['GDP_Norway']=df['GDP'] * (df['country']=='Norway')
    df['GDP_Sweden']=df['GDP'] * (df['country']=='Sweden')
    df=df.drop(['GDP','year'],axis=1)
    return df

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.4 | Gross Domestic Product Per Capita (GDP per Capita)</b></p>
</div>

Gross Domestic Product (GDP) per capita shows a **<span style='color:lightseagreen'>country's GDP divided by its total population</span>**.

In [None]:
def GDP_PC(df):
    GDP_PC_data = pd.read_csv("../input/gdp-per-capita-finland-norway-sweden-201519/GDP_per_capita_2015_to_2019_Finland_Norway_Sweden.csv",index_col="year")
    # Create a dictionary
    GDP_PC_dictionary = GDP_PC_data.unstack().to_dict()
    # Create new GDP_PC column
    df['GDP_PC'] = df.set_index(['country', 'year']).index.map(GDP_PC_dictionary.get)
    return df

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.5 | GDP vs GDP per capita</b></p>
</div>

In [None]:
def GDP_corr(df):
    GDP_data = pd.read_csv("../input/gdp-20152019-finland-norway-and-sweden/GDP_data_2015_to_2019_Finland_Norway_Sweden.csv",index_col="year")
    GDP_PC_data = pd.read_csv("../input/gdp-per-capita-finland-norway-sweden-201519/GDP_per_capita_2015_to_2019_Finland_Norway_Sweden.csv",index_col="year")
    GDP_data.columns = ['Finland', 'Norway', 'Sweden']
    # Create dictionary
    GDP_dictionary = GDP_data.unstack().to_dict()
    GDP_PC_dictionary = GDP_PC_data.unstack().to_dict()
    df['year']=df.date.dt.year
    # Make new column
    df['GDP']=df.set_index(['country', 'year']).index.map(GDP_dictionary.get)
    df['GDP_PC'] = df.set_index(['country', 'year']).index.map(GDP_PC_dictionary.get)
    # Initialise output
    feat_corr=[]
    # Compute pairwise correlations
    for SS in ['KaggleMart', 'KaggleRama']:
        for CC in ['Finland', 'Norway', 'Sweden']:
            for PP in ['Kaggle Mug', 'Kaggle Hat', 'Kaggle Sticker']:
                subset=df[(df.store==SS)&(df.country==CC)&(df['product']==PP)].groupby(['year']).agg(num_sold=('num_sold','sum'), GDP=('GDP','mean'), GDP_PC=('GDP_PC','mean'))
                v1=subset.num_sold
                v2=subset.GDP
                v3=subset.GDP_PC
                
                r1, _ = pearsonr(v1,v2)
                r2, _ = pearsonr(v1,v3)
                
                feat_corr.append([f'{SS}, {CC}, {PP}', r1, r2])

    return pd.DataFrame(feat_corr, columns=['Features', 'GDP_corr', 'GDP_PC_corr'])
    
corr_df=GDP_corr(train_data)
corr_df.head()

📌 **Interpret:** In general, both GDP and GDP_PC are very **<span style='color:lightseagreen'>highly correlated</span>** to the num_sold aggregate each year. GDP tends to have a **<span style='color:lightseagreen'>slightly higher</span>** correlation than GDP_PC.

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.6 | Fourier Features</b></p>
</div>

The kind of feature we are going to treat now are better suited for long seasons over many observations where indicators would be impractical. Fourier features try to capture the overall shape of the seasonal curve with just a few features. Fourier features are pairs of sine and cosine curves, one pair for each potential frequency in the season starting with the longest. Fourier pairs modeling annual seasonality would have frequencies: once per year, twice per year, three times per year, and so on.

In [None]:
# From https://www.kaggle.com/ambrosm/tpsjan22-03-linear-model#Simple-feature-engineering-(without-holidays)
def FourierFeatures(df):
    # temporary one hot encoding
    for product in ['Kaggle Mug', 'Kaggle Hat']:
        df[product] = df['product'] == product
    
    # The three products have different seasonal patterns
    dayofyear = df.date.dt.dayofyear
    for k in range(1, 2):
        df[f'sin{k}'] = np.sin(dayofyear / 365 * 2 * math.pi * k)
        df[f'cos{k}'] = np.cos(dayofyear / 365 * 2 * math.pi * k)
        df[f'mug_sin{k}'] = df[f'sin{k}'] * df['Kaggle Mug']
        df[f'mug_cos{k}'] = df[f'cos{k}'] * df['Kaggle Mug']
        df[f'hat_sin{k}'] = df[f'sin{k}'] * df['Kaggle Hat']
        df[f'hat_cos{k}'] = df[f'cos{k}'] * df['Kaggle Hat']
        df=df.drop([f'sin{k}', f'cos{k}'], axis=1)
    
    # drop temporary one hot encoding
    df=df.drop(['Kaggle Mug', 'Kaggle Hat'], axis=1)
    
    return df

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.7 | Interactions</b></p>
</div>

With those interactions we are going to help the linear model to find the right **<span style='color:lightseagreen'>height</span>** of trends for each combination of features.

In [None]:
def get_interactions(df):
    df['KR_Sweden_Mug']=(df.country=='Sweden')*(df['product']=='Kaggle Mug')*(df.store=='KaggleRama')
    df['KR_Sweden_Hat']=(df.country=='Sweden')*(df['product']=='Kaggle Hat')*(df.store=='KaggleRama')
    df['KR_Sweden_Sticker']=(df.country=='Sweden')*(df['product']=='Kaggle Sticker')*(df.store=='KaggleRama')
    df['KR_Norway_Mug']=(df.country=='Norway')*(df['product']=='Kaggle Mug')*(df.store=='KaggleRama')
    df['KR_Norway_Hat']=(df.country=='Norway')*(df['product']=='Kaggle Hat')*(df.store=='KaggleRama')
    df['KR_Norway_Sticker']=(df.country=='Norway')*(df['product']=='Kaggle Sticker')*(df.store=='KaggleRama')
    df['KR_Finland_Mug']=(df.country=='Finland')*(df['product']=='Kaggle Mug')*(df.store=='KaggleRama')
    df['KR_Finland_Hat']=(df.country=='Finland')*(df['product']=='Kaggle Hat')*(df.store=='KaggleRama')
    df['KR_Finland_Sticker']=(df.country=='Finland')*(df['product']=='Kaggle Sticker')*(df.store=='KaggleRama')
    
    df['KM_Sweden_Mug']=(df.country=='Sweden')*(df['product']=='Kaggle Mug')*(df.store=='KaggleMart')
    df['KM_Sweden_Hat']=(df.country=='Sweden')*(df['product']=='Kaggle Hat')*(df.store=='KaggleMart')
    df['KM_Sweden_Sticker']=(df.country=='Sweden')*(df['product']=='Kaggle Sticker')*(df.store=='KaggleMart')
    df['KM_Norway_Mug']=(df.country=='Norway')*(df['product']=='Kaggle Mug')*(df.store=='KaggleMart')
    df['KM_Norway_Hat']=(df.country=='Norway')*(df['product']=='Kaggle Hat')*(df.store=='KaggleMart')
    df['KM_Norway_Sticker']=(df.country=='Norway')*(df['product']=='Kaggle Sticker')*(df.store=='KaggleMart')
    df['KM_Finland_Mug']=(df.country=='Finland')*(df['product']=='Kaggle Mug')*(df.store=='KaggleMart')
    df['KM_Finland_Hat']=(df.country=='Finland')*(df['product']=='Kaggle Hat')*(df.store=='KaggleMart')
    df['KM_Finland_Sticker']=(df.country=='Finland')*(df['product']=='Kaggle Sticker')*(df.store=='KaggleMart')
    
    return df

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.8 | Feature Transformation</b></p>
</div>

We are going to drop **<span style='color:lightseagreen'>date</span>** column, and apply one hot encoding.

In [None]:
def dropdate(df):
    df=df.drop('date',axis=1)
    return df

def onehot(df,columns):
    df=pd.get_dummies(df, columns)
    return df

# <b>4 <span style='color:lightseagreen'>|</span> Modeling</b>

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.1 | Hybrid Model Class</b></p>
</div>

In [None]:
# Feature set for trend model
def FeatEng_X1(df):
    df=date_feat_eng_X1(df)
    df=get_GDP(df)
    df=FourierFeatures(df)
    df=get_interactions(df)
    df=dropdate(df)
    df=onehot(df,['store', 'product', 'country'])
    return df

# Feature set for interactions model
def FeatEng_X2(df):
    df=date_feat_eng_X2(df)
    df=unofficial_hol(df)
    df=get_holidays(df)
    df=dropdate(df)
    df=onehot(df,['store', 'product', 'country'])
    return df

# Apply feature engineering
X_train_1=FeatEng_X1(X)
X_train_2=FeatEng_X2(X)
X_test_1=FeatEng_X1(test_data)
X_test_2=FeatEng_X2(test_data)

In [None]:
# A class is a collection of properties and methods (like models from Sklearn)
class HybridModel:
    def __init__(self, model_1, model_2, grid=None):
        self.model_1 = model_1
        self.model_2 = model_2
        self.grid=grid
        
    def fit(self, X_train_1, X_train_2, y):
        # Train model 1
        self.model_1.fit(X_train_1, y)
        
        # Predictions from model 1 (trend)
        y_trend = self.model_1.predict(X_train_1)

        if self.grid:
            # Grid search
            tscv = TimeSeriesSplit(n_splits=3)
            grid_model = GridSearchCV(estimator=self.model_2, cv=tscv, param_grid=self.grid, n_jobs=-1)
        
            # Train model 2 on detrended series
            grid_model.fit(X_train_2, y-y_trend)
            
            # Model 2 preditions (for residual analysis)
            y_resid = grid_model.predict(X_train_2)
            
            # Save model
            self.grid_model=grid_model
        else:
            # Train model 2 on residuals
            self.model_2.fit(X_train_2, y-y_trend)
            
            # Model 2 preditions (for residual analysis)
            y_resid = self.model_2.predict(X_train_2)
        
        # Save data
        self.y_train_trend = y_trend
        self.y_train_resid = y_resid
        
    def predict(self, X_test_1, X_test_2):
        # Predict trend using model 1
        y_trend = self.model_1.predict(X_test_1)
        
        if self.grid:
            # Grid model predictions
            y_resid = self.grid_model.predict(X_test_2)
        else:
            # Model 2 predictions
            y_resid = self.model_2.predict(X_test_2)
        
        # Add predictions together
        y_pred = y_trend + y_resid
        
        # Save data
        self.y_test_trend = y_trend
        self.y_test_resid = y_resid
        
        return y_pred

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.2 | Hyperparameter Tuning - Ensemble Modeling</b></p>
</div>

In [None]:
# Choose models
model_1=LinearRegression()
models_2=[LGBMRegressor(random_state=0), CatBoostRegressor(random_state=0, verbose=False), XGBRegressor(random_state=0)]

# Parameter grid
param_grid = {'n_estimators': [100, 150, 200, 225, 250, 275, 300],
        'max_depth': [4, 5, 6, 7],
        'learning_rate': [0.1, 0.12, 0.13, 0.14, 0.15]}

# Initialise output vectors
y_pred=np.zeros(len(test_data))
train_preds=np.zeros(len(y))

# Ensemble predictions
for model_2 in models_2:
    # Start timer
    start = time.time()
    
    # Construct hybrid model
    model = HybridModel(model_1, model_2, grid=param_grid)

    # Train model
    model.fit(X_train_1, X_train_2, np.log(y))

    # Save predictions
    y_pred += np.exp(model.predict(X_test_1,X_test_2))
    
    # Training set predictions (for residual analysis)
    train_preds += np.exp(model.y_train_trend+model.y_train_resid)
    
    # Stop timer
    stop = time.time()
    
    print(f'Model_2:{model_2} -- time:{round((stop-start)/60,2)} mins')
    
    if model.grid:
        print('Best parameters:',model.grid_model.best_params_,'\n')
    
# Scale
y_pred = y_pred/len(models_2)
train_preds = train_preds/len(models_2)

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.3 | Post Processing</b></p>
</div>

In [None]:
# From https://www.kaggle.com/fergusfindley/ensembling-and-rounding-techniques-comparison
def geometric_round(arr):
    result_array = arr
    result_array = np.where(result_array < np.sqrt(np.floor(arr)*np.ceil(arr)), np.floor(arr), result_array)
    result_array = np.where(result_array >= np.sqrt(np.floor(arr)*np.ceil(arr)), np.ceil(arr), result_array)
    return result_array

y_pred=geometric_round(y_pred)

# Save predictions to file
output = pd.DataFrame({'row_id': test_data.index, 'num_sold': y_pred})

# Check format
output.head()

In [None]:
output.to_csv('submission.csv', index=False)

<div style="color:white;display:fill;border-radius:8px;
            background-color:#323232;font-size:150%;
            font-family:Nexa;letter-spacing:0.5px">
    <p style="padding: 8px;color:white;"><b>3.4 | Predicted Data Visualization</b></p>
</div>

In [None]:
def plot_predictions(SS, CC, PP, series=output):
    '''
    SS=store
    CC=country
    PP=product
    '''
    
    # uncomment if your dataframes have different names
    #train_data=train_df
    #test_data=test_df
    
    # Training set target
    train_subset=train_data[(train_data.store==SS)&(train_data.country==CC)&(train_data['product']==PP)]
    
    # Predictions
    plot_index=test_data[(test_data.store==SS)&(test_data.country==CC)&(test_data['product']==PP)].index
    pred_subset=series[series.row_id.isin(plot_index)].reset_index(drop=True)
    
    # Plot
    plt.figure(figsize=(12,5))
    n1=len(train_subset['num_sold'])
    n2=len(pred_subset['num_sold'])
    plt.plot(np.arange(n1),train_subset['num_sold'], label='Training')
    plt.plot(np.arange(n1,n1+n2),pred_subset['num_sold'], label='Predictions')
    plt.title('\n'+f'Store:{SS}, Country:{CC}, Product:{PP}')
    plt.legend()
    plt.xlabel('Days since 2015-01-01')
    plt.ylabel('num_sold')

**Plot trends**

In [None]:
# Put into dataframes
y_trend=pd.DataFrame({'row_id': test_data.index, 'num_sold': np.exp(model.y_test_trend)})
y_resid=pd.DataFrame({'row_id': test_data.index, 'num_sold': np.exp(model.y_test_resid)})
y_pred=pd.DataFrame({'row_id': test_data.index, 'num_sold': np.exp(model.y_test_trend+model.y_test_resid)})

# Choose parameters
SS='KaggleMart'
CC='Norway'

# Plot trends (model 1 predictions)
plot_predictions(SS, CC, 'Kaggle Hat', series=y_trend)
plot_predictions(SS, CC, 'Kaggle Mug', series=y_trend)
plot_predictions(SS, CC, 'Kaggle Sticker', series=y_trend)

**All predictions**

In [None]:
for SS in ['KaggleMart','KaggleRama']:
    for CC in ['Finland', 'Norway', 'Sweden']:
        for PP in ['Kaggle Mug', 'Kaggle Hat', 'Kaggle Sticker']:
            plot_predictions(SS, CC, PP)

# Residual Analysis

**Plot residuals**

In [None]:
# need to ensemble
train_preds = np.exp(model.y_train_trend+model.y_train_resid)

# Residuals on training set (SMAPE)
residuals = 200 * (train_preds - y) / (train_preds + y)

# Plot residuals
plt.figure(figsize=(12,4))
plt.scatter(np.arange(len(residuals)),residuals, s=1)
plt.hlines([0], 0, residuals.index.max(), color='k')
plt.title('Residuals on training set')
plt.xlabel('Sample')
plt.ylabel('SMAPE')

**Plot histogram of residuals**

In [None]:
mu, std = scipy.stats.norm.fit(residuals)

plt.figure(figsize=(12,4))
plt.hist(residuals, bins=100, density=True)
x = np.linspace(plt.xlim()[0], plt.xlim()[1], 200)
plt.plot(x, scipy.stats.norm.pdf(x, mu, std), 'r', linewidth=2)
plt.title(f'Histogram of residuals; mean = {residuals.mean():.4f}, '
          f'$\sigma = {residuals.std():.1f}$, SMAPE = {residuals.abs().mean():.5f}')
plt.xlabel('Residual (percent)')
plt.ylabel('Density')
plt.show()