# Time Series Playground : 2022 📈 🔰 🚀

This notebook contains the EDA of the Time Series Playground dataset and the use of fbProphet to build a basic model(baseline) for the prediction of sales of this dataset.

![Prophet-logo](https://miro.medium.com/max/1400/1*BVIwEoE5oEmHJU8XbV_mKA.png)






If you find this notebook informative please give it an upvote by pressing on the (▲) button.


## Table of Contents

1. [Introduction to Prophet](#intro)
2. [Advantages of Prophet](#advantages)
3. [Installation of Prophet](#install)
4. [Basic Setup](#bs)
5. [Exploratory Data Analysis](#eda)
6. [Application of Prophet for a single Country and a single store](#prophet1)
7. [Application of fbProphet on our data](#prophet2)
8. [References](#refer)

<a id="intro"></a><br/>
## Introduction to Prophet


As per [Facebook Prophet's page](https://facebook.github.io/prophet/):

*Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.* 

*Prophet is open source software released by Facebook’s Core Data Science team. It is available for download on CRAN and PyPI.*

* Prophet decomposes time series data into **trend**, **seasonality**, and **holiday-effects**.
* **Trend** : It is the non-periodic change in a time series data.
* **Seasonality** : It is a periodic change in a time series, such as daily, weekly, monthly, yearly etc.
* **Holiday-effects** : These effects alters the time series data in a seasonal way but on some particular time periods only.

The Prophet model can be showcase as an equation as follow
$$
\begin{equation}
y(t) = g(t) + s(t) + h(t) + \epsilon(t)
\end{equation}
$$

where $g(t)$ represents the trend component, $s(t)$ represents the seasonality, $h(t)$ represents the holiday-effect components and $\epsilon(t)$ represents the residual time series.

<a id='advantages'></a><br/>
## Advantages of Prophet

The advantages of Prophet are as follow:

* **Accurate and fast**: Prophet is used in many applications across Facebook for producing reliable forecasts for planning and goal setting.
* **Fully Automatic**: Get a reasonable forecast on messy data with no manual effort
* **Tunable forecasts**: The Prophet procedure includes many possibilities for users to tweak and adjust forecasts. You can use human-interpretable parameters to improve your forecast by adding your domain knowledge.
* **Available in R or Python**: Prophet has been implemented in R or Python, and they share the same underlying Stan code for fitting.
* **Robust to Outliers**
* **Robust to Missing Data**
* Can model the various components (such as trend, seasonality, holiday effects) of a time series pretty well.
* Lastly, it's being developed by the Facebook core research team.

<a id='install'></a><br/>
## Installation of Prophet

In [None]:
!pip install fbprophet

<a id="bs"></a><br/>
## Basic Setup

In [None]:
# Standard libraries for numerical operations and plotting
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.offline as pyo
import os
import sys
import warnings


# Specific plotting libraries
from plotly.subplots import make_subplots
from statsmodels.graphics.tsaplots import plot_acf

# Prophet library
from fbprophet import Prophet


warnings.filterwarnings('ignore')
plt.style.use('ggplot')
pyo.init_notebook_mode()

In [None]:
df_train = pd.read_csv('../input/tabular-playground-series-jan-2022/train.csv', index_col='date', parse_dates=True, infer_datetime_format=True)
df_test = pd.read_csv('../input/tabular-playground-series-jan-2022/test.csv', index_col='date', parse_dates=True, infer_datetime_format=True)

In [None]:
df_train.head()

In [None]:
df_test.head()

See this data is recorded in **Daily** Frequency.

In [None]:
print(f'Shape of the training data : {df_train.shape}')
print(f'Shape of the test data: {df_test.shape}')
print('-*'*20)
print(f'Start Date : {df_train.index[0]}')
print(f'End Date : {df_train.index[-1]}')
print('-*'*20)
print(f'Number of unique countries : {df_train["country"].nunique()}')
print(f'Unique countries : {df_train["country"].unique()}')
print('-*'*20)
print(f'Number of unique stores : {df_train["store"].nunique()}')
print(f'Unique stores : {df_train["store"].unique()}')
print('-*'*20)
print(f'Number of products : {df_train["product"].nunique()}')
print(f'Unique stores : {df_train["product"].unique()}')

<a id='eda'></a><br/>
## Exploratory Data Analysis

### Check the percentages of each categorical variables in the dataset

In [None]:
country_sizes = df_train.groupby(['country']).size()
store_sizes = df_train.groupby(['store']).size()
product_sizes = df_train.groupby(['product']).size()


fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(12, 10))

plt.subplots_adjust(wspace=0.85)

ax[0].pie(country_sizes.values, labels=country_sizes.index, explode=[0.2, 0.2, 0.0], 
          shadow=True, autopct='%1.2f%%', colors=['#f08e59', '#f0d759', '#59a4f0'])
ax[1].pie(store_sizes.values, labels=store_sizes.index, explode=[0.2, 0.0], shadow=True, autopct='%1.2f%%')
ax[2].pie(product_sizes.values, labels=product_sizes.index, explode=[0.2, 0.2, 0.0], 
          shadow=True, autopct='%1.2f%%', colors=['#e64747', '#4777e6', '#e6bc47'])

fig.show()
    

### Which country has the highest sales (on an average)

In [None]:
grouped_df1 = df_train.groupby(['country']).aggregate({'num_sold':'mean'}).sort_values(by=['num_sold'], ascending=False)
grouped_df1.plot(kind='bar', figsize=(10, 5), title='Which country has the highest sales (on an average)?')
plt.show()

### Which product has the highest sales in all the stores in all the countries?

In [None]:
grouped_df2 = df_train.groupby(['product']).aggregate({'num_sold':'mean'}).sort_values(by=['num_sold'], ascending=False)
grouped_df2.plot(kind='bar', figsize=(10, 5), title='Which product has the highest sales in all the stores in all the countries?', color='cornflowerblue')
plt.show()

### Which store has the highest sales in all the countries?

In [None]:
grouped_df3 = df_train.groupby(['store']).aggregate({'num_sold':'mean'}).sort_values(by=['num_sold'], ascending=False)
grouped_df3.plot(kind='bar', figsize=(10, 5), title='Which store has the highest sales in all the countries?', color='#ffcb52')
plt.show()

### Comparison of the average sales of different products of different stores in different countries

In [None]:
grouped_df4 = df_train.groupby(['country', 'store', 'product']).aggregate({'num_sold':'mean'})
grouped_df4.unstack().plot(kind='bar', figsize=(10, 5), stacked=True, title='Overview of the data')
plt.show()

### Comparison of sales of different products of different stores in different countries

In [None]:
fig, ax = plt.subplots(nrows = df_train['product'].nunique(), ncols=df_train['store'].nunique(), figsize=(12, 8))
plt.subplots_adjust(top=1.5, wspace=1.0, hspace=0.5)

for i, prod in enumerate(df_train['product'].unique()):
    for j, stores in enumerate(df_train['store'].unique()):
        
        d = df_train.loc[(df_train['product'] == prod) & (df_train['store'] == stores)].reset_index()
        ax[i, j].set_title(f'Product : {prod}, Store : {stores}')
        sns.lineplot(x='date', y='num_sold', data=d, hue='country', ax=ax[i, j])
        ax[i, j].set_xticklabels(d['date'], rotation=90, fontdict=dict(fontsize=5))
        
        
fig.show()

### Observations from the above plots

* The Highest selling happens in **Norway** on an average.
* **Kaggle Hat** is the most bought product on an average.
* **Kaggle Rama** is the most visited shop than **Kaggle Mart** on an average.

### Plot an interactive Time Series

In [None]:
def plot_ts(data, **kwargs):
    """
    Function to plot an interactive time series plot.
    
    Parameters:
    -----------
    data : pandas.DataFrame
        Represents the data with which you are working.

    Returns:
    --------
    None.
    """
    country = kwargs.get('country', 'Finland')
    store = kwargs.get('store', 'KaggleMart')
    title = kwargs.get('title', '')
    plots = list()
    
    data = data.loc[(data['country'] == country) & (data['store'] == store)]
    
    for prod in data['product'].unique():
        pl = go.Scatter(name=prod, x=data.index, y=data.loc[data['product']== prod]['num_sold'], mode='lines', line=dict(width=1.875))
        plots.append(pl)
    
    fig = go.Figure(data=plots)
    fig.update_layout({"title":title, 
                      "xaxis":{
                          "rangeslider":dict(visible=True),
                          "rangeselector":dict(buttons=list([
                              dict(count=1, step='year', label='1y', stepmode='backward'),
                              dict(count=3, step='year', label='3y', stepmode='backward'),
                              dict(count=5, step='year', label='5y', stepmode='backward'),
                              dict(step='all')
                          ]))
                      }})
    
    fig.show()
    

In [None]:
plot_ts(df_train, country='Finland', store='KaggleMart', title='Kaggle Mart product sales in Finland')

### AutoCorrelation Plots

In [None]:
fig2, ax2 = plt.subplots(nrows = df_train['product'].nunique(), ncols=df_train['store'].nunique(), figsize=(12, 8))
plt.subplots_adjust(top=1.5, wspace=1.0, hspace=0.5)

for i, prod in enumerate(df_train['product'].unique()):
    for j, stores in enumerate(df_train['store'].unique()):
        
        d = df_train.loc[(df_train['product'] == prod) & (df_train['store'] == stores)]
        plot_acf(d.loc[d['country'] == 'Norway']['num_sold'], ax=ax2[i, j], use_vlines=False, lags=1400, title=f'Product : {prod}, Store: {stores}', marker='x', label='Norway')
        plot_acf(d.loc[d['country'] == 'Finland']['num_sold'], ax=ax2[i, j], use_vlines=False, lags=1400, title=f'Product : {prod}, Store: {stores}', marker='*', label='Finland')
        plot_acf(d.loc[d['country'] == 'Sweden']['num_sold'], ax=ax2[i, j], use_vlines=False, lags=1400, title=f'Product : {prod}, Store: {stores}', marker='o', label='Sweden')
        
fig2.show()

### Observations from AutoCorrelation Plot

Before deep diving into the above figure, let's first understand what an **AutoCorrelation Plot** is? An **AutoCorrelation** Plot is used to check how correlated are the data points in a time series are compared to it's lagged version. This plot also helps to discover any hidden factors (such as seasonality) in the data. 

What we can observe from the above autocorrelation plots are the following:
> Note: The unit of x-axis is in days.

* The autocorrelation plot of `Kaggle Hat` is similar to the autocorrelation plot of `Kaggle Mug`.
* There is some sort of a **periodic pattern** present in the autocorrelation plot of the 3-different products hinting the possibility of a **seasonality** in the data.
* For the 3 different countries(indicated in different colors) the pattern of the autocorrelation plot is exactly the same for all the stores and their products.

What do I mean by seasonality? 

Well it's nothing fancy, put into simple words its a periodic behavior which the data shows such as we buy the most during Black Friday Sales rather than a normal day.

In [None]:
df_train['day'] = df_train.index.day_name()

### How is the sale in weekdays and weekends in different countries?

In [None]:
wgroup1 = df_train.groupby(by=['day', 'country']).aggregate({'num_sold':'mean'}).reindex(
    ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], level='day')

wgroup1.unstack().plot(figsize=(10, 5), title='How is the sale in weekdays and weekends in different countries?')
plt.show()

### How is the sale in weekdays and weekends of different stores in different countries?

In [None]:
wgroup2 = df_train.groupby(by=['day', 'store']).aggregate({'num_sold':'mean'}).reindex(
    ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], level='day')

wgroup2.unstack().plot(figsize=(10, 5), title='How is the sale in weekdays and weekends of different stores in different countries?')
plt.show()

### How is the sale in weekdays and weekends of different products?

In [None]:
wgroup3 = df_train.groupby(by=['day', 'product']).aggregate({'num_sold':'mean'}).reindex(
    ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], level='day')

wgroup3.unstack().plot(figsize=(10, 5), title='How is the sale in weekdays and weekends of different products?')
plt.show()

### Observations

* It can be observed that on an average the sales are it's peak on weekends, than on weekdays.

In [None]:
df_train['month'] = df_train.index.month_name()

### How is the sale on different countries in different months?

In [None]:
mgroup1 = df_train.groupby(['month', 'country']).aggregate({"num_sold":'mean'}).reindex(['January', 'February', 'March',
                                                                                        'April', 'May', 'June',
                                                                                        'July', 'August', 'September',
                                                                                        'October', 'November', 'December'], level='month')

mgroup1.unstack().plot(figsize=(10, 5), title='How is the sale on different countries in different months?')
plt.show()

### How is the sale on different stores in different months?

In [None]:
mgroup2 = df_train.groupby(['month', 'store']).aggregate({"num_sold":'mean'}).reindex(['January', 'February', 'March',
                                                                                        'April', 'May', 'June',
                                                                                        'July', 'August', 'September',
                                                                                        'October', 'November', 'December'], level='month')

mgroup2.unstack().plot(figsize=(10, 5), title='How is the sale on different stores in different months?')
plt.show()

### How is the sale on different product in different months?

In [None]:
mgroup3 = df_train.groupby(['month', 'product']).aggregate({"num_sold":'mean'}).reindex(['January', 'February', 'March',
                                                                                        'April', 'May', 'June',
                                                                                        'July', 'August', 'September',
                                                                                        'October', 'November', 'December'], level='month')

mgroup3.unstack().plot(figsize=(10, 5), title='How is the sale on different product in different months?')
plt.show()

### Observations:

* A good shoot in sales is observed between the months of **March** - **May** and also between **November** and **December**.
* We don't see any significant sales increase in the above mentioned time periods for the following products:
    * Kaggle Mug
    * Kaggle Sticker
    

<a id='prophet1'></a><br/>
## Application of Prophet for a single Country and a single store

This part show-cases how one needs to use Prophet for their use-cases. This is a much simpler version of all the complicated ways one case use Prophet to solve their business problems.

In [None]:
sample_df = df_train.loc[(df_train['country'] == 'Norway') & (df_train['store'] == 'KaggleRama') & (df_train['product'] == 'Kaggle Hat')]['num_sold']

print(f'Shape of the Sample Data Frame Training  : {sample_df.shape}')

sample_df.plot(figsize=(10, 5), title='Time Series Plot of the sample dataframe')
plt.show()

### Renaming Columns

Before proceeding with the Prophet forecasting, first some of the data columns names needs to be changed.
* Prophet requires the target variable to renamed as `y`
* And the time variable as `ds`.


In [None]:
sample_df = sample_df.reset_index()
sample_df.columns = ['ds', 'y']
sample_df.head()


test_size = int(0.2*sample_df.shape[0])
sample_df_train = sample_df.iloc[:-test_size]
sample_df_test = sample_df.iloc[-test_size:]

print(f'Sample Data Frame train size : {sample_df_train.shape}')
print(f'Sample DataFrame test size : {sample_df_test.shape}')
print(f'Train Start Date: {sample_df_train.iloc[0]["ds"]} | Train End Date : {sample_df_train.iloc[-1]["ds"]}')
print(f'Train Start Date: {sample_df_test.iloc[0]["ds"]} | Test End Date : {sample_df_test.iloc[-1]["ds"]}')

### Defining the Prophet Model

In [None]:
m = Prophet()

### Fitting the Prophet Model

In [None]:
m.fit(sample_df_train)

### Forecasting

In [None]:
future = m.make_future_dataframe(periods=sample_df_test.shape[0], freq='D')
future.tail()

In [None]:
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

In [None]:
pd.concat([sample_df.set_index('ds')['y'], forecast.set_index('ds')['yhat']], axis=1).plot(figsize=(12, 8))
plt.show()

In [None]:
# Plots the forecast on the original value with the upper and lower threshold values.
fig1 = m.plot(forecast)

In [None]:
# Plot the different components of the time series
fig2 = m.plot_components(forecast)

### Visualize the Changepoints

In [None]:
from fbprophet.plot import add_changepoints_to_plot
fig3 = m.plot(forecast)
a = add_changepoints_to_plot(fig3.gca(), m, forecast)

You can even add your own custom data change points to the Prophet library to model it and make better predictions for your business.

### Visualizing the direction of the changepoints

In [None]:
deltas = m.params['delta'].mean(0)

fig = plt.figure(figsize=(10, 5), facecolor='w')
ax = fig.add_subplot(111)
ax.bar(range(len(deltas)), deltas, color='indianred')
ax.grid(True, which='major', c='cornflowerblue', ls='-', lw=1., alpha=.2)
ax.set_xlabel('changepoint')
ax.set_ylabel('Rate change')
fig.tight_layout()
fig.show()


<a id='prophet2'></a><br/>
## Application of fbProphet on our data

In [None]:
# As shown previously that Prophet requires only datetime and target values. Therefore we will need to remove the categorical variables from our dataset now.
combinations = list()

for country in df_train['country'].unique():
    for store in df_train['store'].unique():
        for prod in df_train['product'].unique():
            combinations.append((country, store, prod))
            
models = [Prophet() for i in range(df_train['country'].nunique() * df_train['store'].nunique() * df_train['product'].nunique())]
print(len(models))

### Model Fitting and Forecasting

In [None]:
preds = list()

for i, com in enumerate(combinations):
    _df = df_train.loc[(df_train['country'] == com[0]) & (df_train['store'] == com[1]) &(df_train['product'] == com[2])]['num_sold'].reset_index()
    _df.columns = ['ds', 'y']
    models[i].fit(_df)
    future = models[i].make_future_dataframe(periods=365, freq='D')
    forecast = models[i].predict(future)[['ds', 'yhat']].tail(365)
    preds.append(forecast)
    del forecast
    del future
    
    

In [None]:
for i in range(df_train['country'].nunique() * df_train['store'].nunique() * df_train['product'].nunique()):
    preds[i]['country'] = combinations[i][0]
    preds[i]['store'] = combinations[i][1]
    preds[i]['product'] = combinations[i][2]
    
    

In [None]:
df_test_copy = df_test.copy().reset_index()
df_test_copy['num_sold'] = 0
for i, com in enumerate(combinations):
    df_test_copy.loc[(df_test_copy['country'] == com[0]) & (df_test_copy['store'] == com[1]) & (df_test_copy['product'] == com[2]), 
                     'num_sold'] = preds[i]['yhat'].values
    
df_test_copy

In [None]:
submission_df = df_test_copy[['row_id', 'num_sold']]
submission_df.to_csv('submission.csv', index=False)

<a id='refer'/><br/>
## References

1. https://www.kaggle.com/jeongbinpark/tps-jan-simple-eda-and-fbprophet
2. https://www.kaggle.com/prashant111/tutorial-time-series-forecasting-with-prophet
3. https://www.youtube.com/watch?v=D8CFPyi4ai4&list=PL3N9eeOlCrP5cK0QRQxeJd6GrQvhAtpBK&index=10
4. https://facebook.github.io/prophet/docs/quick_start.html#python-api
5. https://facebook.github.io/prophet/docs/diagnostics.html#hyperparameter-tuning
