# Initial Thoughts 
In this notebook I go through a few visuals trying to understand the data better \
After each set of visuals I have my intial thoughts on how to process / predict the data in another kernel


**Note** this is a work in progress 

### To do
1. Decomposition analysis - seasonality, trend, residuals 

# Load libraries and data 

In [None]:
!pip install pycaret --user

In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns

In [None]:
Date_Augmentation = True
S = 12  #season

In [None]:
train = pd.read_csv("../input/tabular-playground-series-jan-2022/train.csv",index_col = 0)
test = pd.read_csv("../input/tabular-playground-series-jan-2022/test.csv",index_col = 0)

# EDA 

In [None]:
train.head()

In [None]:
train.info()

In [None]:
display(train.isnull().sum())
display(train.duplicated().sum())

In [None]:
print("Max train date:" ,train["date"].max())
print("Min train date:" ,train["date"].min())

print("Max test date:" ,test["date"].max())
print("Min test date:" ,test["date"].min())

In [None]:
train["date"] = pd.to_datetime(train["date"])
test["date"] = pd.to_datetime(test["date"])

## Visualization of data 

In [None]:
sns.set_theme(style ="whitegrid")
palette = {"Sweden":"c","Norway":"red","Finland":"y"}

In [None]:
fig,ax = plt.subplots(3,2, figsize=(20,10),sharey= True)

ax[0,0].set_title("Train Country Distribution")
sns.countplot(ax=ax[0,0], x =train["country"])
ax[0,1].set_title("Test Country Distribution")
sns.countplot(ax=ax[0,1], x= test["country"])

ax[1,0].set_title("Train Store Distribution")
sns.countplot(ax=ax[1,0], x =train["store"])
ax[1,1].set_title("Test Store Distribution")
sns.countplot(ax=ax[1,1], x= test["store"])

ax[2,0].set_title("Train Product Distribution")
sns.countplot(ax=ax[2,0], x =train["product"])
ax[2,1].set_title("Test Product Distribution")
sns.countplot(ax=ax[2,1], x= test["product"])

fig.suptitle("Distributions")
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(25,10))
sns.lineplot(data= train, x="date",y="num_sold")
plt.title("Total Sales")

#X axis as months
locator = mdates.MonthLocator()  # every month
fmt = mdates.DateFormatter('%b')
X = plt.gca().xaxis
X.set_major_locator(locator)
# Specify formatter
X.set_major_formatter(fmt)
plt.show()

In [None]:
plt.figure(figsize=(25,10))
sns.lineplot(data= train[ (train["date"]>="2015-01-01") & (train["date"]<"2016-01-01") ], x="date",y="num_sold",)
sns.lineplot(data= train[ (train["date"]>="2016-01-01") & (train["date"]<"2017-01-01") ], x="date",y="num_sold")
plt.title("2015/ 2016 Sales closer look ")

#X axis as months
locator = mdates.MonthLocator()  # every month
fmt = mdates.DateFormatter('%b')
X = plt.gca().xaxis
X.set_major_locator(locator)
# Specify formatter
X.set_major_formatter(fmt)
plt.show()

**Notes** Definite Seasonality;
* Huge spikes end Dec / beginning Jan  -- Christmas/ Holiday season 
* Spikes in April - Easter?
* Slight spikes in May and June 
* Decline in sales fro jul to oct - with a steady rise leading up to Dec

==> Investigate seasonality **Seasonal ARIMA Model**?

In [None]:
plt.figure(figsize=(25,10))
sns.lineplot(data= train, x="date",y="num_sold", hue="country", palette = palette,ci=None)
plt.title("Sales by Country")

#X axis as months
locator = mdates.MonthLocator()  # every month
fmt = mdates.DateFormatter('%b')
X = plt.gca().xaxis
X.set_major_locator(locator)
# Specify formatter
X.set_major_formatter(fmt)
plt.show()

**Notes**
* Similar pattern/trend for each country however different volumes 
* Will seperate models be required for each country if countries have such similar patterns 

=> use 1 model for Countries?

In [None]:
plt.figure(figsize=(25,10))
sns.lineplot(data= train, x="date",y="num_sold", hue="product",ci=None)
plt.title("Product Sales")

#X axis as months
locator = mdates.MonthLocator()  # every month
fmt = mdates.DateFormatter('%b')
X = plt.gca().xaxis
X.set_major_locator(locator)
# Specify formatter
X.set_major_formatter(fmt)
plt.show()

**Notes** Disimilar products 
* products dont seem to have any consistent relationship other than following the seasonality 
* Actual seasons could impact product sales (summer, winter etc)
 * Hats sales drop after Jun (getting colder?)
 * Mug sales increase after Jun (getting colder - more hot drinks?)
 * stickers dont follow seasons 

=> Extract Season type? or temperate? \
=> use 3 seperate models for products?

In [None]:
if Date_Augmentation:
    print("Running augementation on date field")
    
    train["day"] = train["date"].dt.day
    train["dayofweek"] = train["date"].dt.dayofweek
    train["month"] = train["date"].dt.month
    train["year"] = train["date"].dt.year
    
    test["day"] = test["date"].dt.day
    test["dayofweek"] = test["date"].dt.dayofweek
    test["month"] = test["date"].dt.month
    test["year"] = test["date"].dt.year

## Day, week , month analysis

In [None]:
fig,ax = plt.subplots(4,1, figsize=(25,20),sharey= True)

ax[0].set_title("Day-of-Month Sales")
sns.lineplot(ax=ax[0], data= train, x="day",y="num_sold")

ax[1].set_title("Day-of-week Sales")
sns.lineplot(ax=ax[1], data= train, x=train["dayofweek"]+1,y="num_sold")
             
ax[2].set_title("Month Sales")
sns.lineplot(ax=ax[2], data= train, x="month",y="num_sold")

ax[3].set_title("Year Sales")
sns.lineplot(ax=ax[3], data= train, x="year",y="num_sold")

plt.tight_layout()
plt.show()

In [None]:
fig,ax = plt.subplots(4,1, figsize=(25,20),sharey= True)

fig.suptitle("Products View",ha = "left")
ax[0].set_title("Day-of-Month Sales")
sns.lineplot(ax=ax[0], data= train, x="day",y="num_sold",ci=None, hue="product")

ax[1].set_title("Day-of-week Sales")
sns.lineplot(ax=ax[1], data= train, x=train["dayofweek"]+1,y="num_sold",ci=None,hue="product")
             
ax[2].set_title("Month Sales")
sns.lineplot(ax=ax[2], data= train, x="month",y="num_sold",ci=None,hue="product")

ax[3].set_title("Year Sales")
sns.lineplot(ax=ax[3], data= train, x="year",y="num_sold",ci=None,hue="product")

plt.tight_layout()
plt.show()

**Notes**
* More sales on Saturday and Sunday 
* More Mug sales on April and all sales in Dec/Jan

In [None]:
train.head()

In [None]:
#train.index = train["date"]
#train.drop("date",axis=1,inplace=True)
#train.head()

## Augmented Dickey–Fuller ---Stationarity test
Check to if the data is stationary \
Stationary indicates that there isnt any variance of the data accross seasons - i,e, we wouldnt need a model for predictions as the data repeats itself perfectly
- null hypothesis = p>0.05  ==> data is stationary 

In [None]:
"""from statsmodels.tsa.stattools import adfuller
adf, pvalue, usedlag_, nobs_, critical_values_, icbest_ = adfuller(train)"""

## AR and MA check 

In [None]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

In [None]:
KaggleHat= train[train["product"] =="Kaggle Hat"]
KaggleMart= train[train["store"] =="KaggleMart"]
Norway= train[train["country"] =="Norway"]
Sweden= train[train["country"] =="Sweden"]
Finland= train[train["country"] =="Finland"]

In [None]:
fig, ax = plt.subplots(2,1, figsize=(25,10))
fig = plot_acf(Norway["num_sold"],  ax=ax[0],title='ACF Norway',lags = 150)
fig = plot_pacf(Norway["num_sold"],  ax=ax[1],title='PCF Norway',lags = 20)
plt.show()

In [None]:
fig, ax = plt.subplots(2,1, figsize=(25,10))
fig = plot_acf(KaggleMart["num_sold"],  ax=ax[0],title='ACF KaggleMart',lags = 150)
fig = plot_pacf(KaggleMart["num_sold"],  ax=ax[1],title='PCF KaggleMart',lags = 20)
plt.show()

In [None]:
fig, ax = plt.subplots(2,1, figsize=(25,10))
fig = plot_acf(KaggleHat["num_sold"],  ax=ax[0],title='ACF KaggleHat',lags = 150)
fig = plot_pacf(KaggleHat["num_sold"],  ax=ax[1],title='PCF KaggleHat',lags = 20)
plt.show()

## SMAPE code

In [None]:
def SMAPE(y_true, y_pred):
    denominator = (y_true + np.abs(y_pred)) / 200.0
    diff = np.abs(y_true - y_pred) / denominator
    diff[denominator == 0] = 0.0
    return np.mean(diff)

# Pycaret Baseline Testing

In [None]:
from pycaret.regression import *
reg1 = setup(train, target = 'num_sold', session_id=123, log_experiment=True, experiment_name='tps jan 2022',silent=True)

In [None]:
best_model = compare_models(fold=5,sort = 'MAPE')

Baseline Lightgbm Model
With custom SMAPE scoring

In [None]:
add_metric(id="SMAPE_metric", name="SMAPE", score_func=SMAPE ,greater_is_better=False)

In [None]:
lightgbm = create_model('lightgbm')
tuned_lightgbm = tune_model(lightgbm, n_iter=100, custom_scorer = "SMAPE_metric",optimize="SMAPE_metric")

In [None]:
tuned_lightgbm

## Feature importance

In [None]:
interpret_model(tuned_lightgbm)

In [None]:
plot_model(tuned_lightgbm, plot='feature')

# Predictions and scoring

In [None]:
#training
preds_train =  predict_model(tuned_lightgbm, train)

In [None]:
#Test prediction
preds_test = predict_model(tuned_lightgbm, test)

In [None]:
#preds_test["date"] = test["date"]
preds_train["date"] =train["date"] 

In [None]:
preds_train.head()

### Plot Predictions with Training data

In [None]:
plt.figure(figsize=(25,10))

sns.lineplot(data= train.reset_index(), x="date",y="num_sold", label="Train Actual")
sns.lineplot(data =preds_train.reset_index(),x = "date" , y = "Label", label="Train Prediction" ) 
sns.lineplot(data = preds_test.reset_index(),x = "date" , y = "Label", label ="Test Prediction" )
plt.title("Actual and Predicted Sales")

**Note**: 
From the above the predicted Seasonality seems good but the value increase wasn't as pronounced as the actual historical values. \
The model seems to try keep the data Stationary 

# Submissions 
Although this isnt a complete notebook for preduction scoring we will submit ot get a baseline 

In [None]:
sub = pd.read_csv("../input/tabular-playground-series-jan-2022/sample_submission.csv",index_col = 0)

In [None]:
sub["num_sold"] = preds_test["Label"]
sub.to_csv("submission.csv")

In [None]:
sub.head()