Hello everyone! Here's a quick EDA on TPS January 22 dataset. Feel free to comment down below if you have any suggestions on how to improve this notebook. I'm eager to learn.  
Thank you and happy new year! 🎊

# 1. Imports and setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-jan-2022/train.csv', index_col='row_id')
test = pd.read_csv('/kaggle/input/tabular-playground-series-jan-2022/test.csv', index_col='row_id')
sample_sub = pd.read_csv('/kaggle/input/tabular-playground-series-jan-2022/sample_submission.csv', index_col='row_id')

In [None]:
color_c = px.colors.sequential.Teal
color_d = px.colors.qualitative.Set2

In [None]:
for df in train, test:
    df['date'] = pd.to_datetime(df['date'])

In [None]:
train.info()

In [None]:
train.describe()

# 2. Distributions

## Country

In [None]:
fig = px.histogram(train, x='country', color='country', color_discrete_sequence=color_d)
fig.update_layout(height=400, width=700, template='plotly_white', title='Count of country')

## Store

In [None]:
fig = px.histogram(train, x='store', color='store', color_discrete_sequence=color_d)
fig.update_layout(height=400, width=700, template='plotly_white', title='Count of store')

## Product

In [None]:
fig = px.histogram(train, x='product', color='product', color_discrete_sequence=color_d)
fig.update_layout(height=400, width=700, template='plotly_white', title='Count of product')

We couldn't ask for a better balanced dataset.

# 3. Sales by country

## Total sales

In [None]:
fig = px.histogram(data_frame=train, x='country', y='num_sold', color='product', barmode='group', color_discrete_sequence=color_d)
fig.update_layout(height=400, width=700, template='plotly_white', title='Number of products sold by country')

In [None]:
temp = train.groupby(['country', 'product'])['num_sold'].sum()
temp = temp.groupby(level=0).apply(lambda x: np.round(x/x.sum()*100)).reset_index()
temp.rename(columns={'num_sold': '% of total sales'})

As observed on the graph and confirmed by the numbers, the proportions of total sales in all countries are the same.

## Time series

In [None]:
temp = train.groupby(['date', 'country'])['num_sold'].sum().reset_index()
fig = fig = px.line(data_frame=temp, x='date', y='num_sold', color='country', color_discrete_sequence=color_d)
fig.update_layout(height=500, width=1000, template='plotly_white', title='Daily sales by country')

Correlation between sales in both stores seems pretty high. Let's verify that.

## Correlation

In [None]:
pivot = train.pivot_table(index='date', values='num_sold', columns='country', aggfunc='sum')

fig = px.imshow(pivot.corr(), color_continuous_scale=color_c)
fig.update_layout(height=400, width=700, template='plotly_white')

# 4. Sales by store

## Total sales

In [None]:
fig = px.histogram(data_frame=train, x='store', y='num_sold', color='product', barmode='group', color_discrete_sequence=color_d)
fig.update_layout(height=400, width=700, template='plotly_white', title='Products sold by store')

In [None]:
temp = train.groupby(['store', 'product'])['num_sold'].sum()
temp = temp.groupby(level=0).apply(lambda x: np.round(x/x.sum()*100)).reset_index()
temp.rename(columns={'num_sold': '% of total sales'})

Same observation as above : proportions of total sales by store are identical.

## Time series

In [None]:
temp = train.groupby(['date', 'store'])['num_sold'].sum().reset_index()

fig = px.line(data_frame=temp, x='date', y='num_sold', color='store', color_discrete_sequence=color_d)
fig.update_layout(height=500, width=1000, template='plotly_white', title='Daily sales by store')

Let's check correlations again.

## Correlations

In [None]:
pivot = train.pivot_table(index='date', values='num_sold', columns='store', aggfunc='sum')

fig = px.imshow(pivot.corr(), color_continuous_scale=color_c)
fig.update_layout(height=400, width=700, template='plotly_white')

# 5. Sales by product

## Time series

In [None]:
temp = train.groupby(['date', 'product'])['num_sold'].sum().reset_index()

fig = px.line(data_frame=temp, x='date', y='num_sold', color='product', color_discrete_sequence=color_d)
fig.update_layout(height=500, width=1000, template='plotly_white', title='Daily sales by product')

 All peaks happen at the beginning of the year. It's interesting to notice that as hat sales go down, mug sales increase. Indeed, why would you need a hat during winter, when you could just stay warm next to the fireplace, drinking a hot chocolate in your precious Kaggle mug? ☕️

 Again, let's run a correlation check.

## Correlations

In [None]:
pivot = train.pivot_table(index='date', values='num_sold', columns='product', aggfunc='sum')

fig = px.imshow(pivot.corr(), color_continuous_scale=color_c)
fig.update_layout(height=400, width=700, template='plotly_white')

As observed previously, mug and hat sales are the least correlated; though their correlation coefficient is quite high.

# 6. How does the sales proportions evolve?

In [None]:
temp = train.groupby(['date', 'product'])['num_sold'].sum()
temp = temp.groupby(level=0).apply(lambda x: np.round(x/x.sum()*100)).reset_index()

fig = px.line(data_frame=temp, x='date', y='num_sold', color='product', color_discrete_sequence=color_d)
fig.update_layout(height=500, width=1000, template='plotly_white', title='Daily relative proportion of products sold')

Sales proportions seem to be stationary, so let's run an ADF test to be sure.

## Stationarity check

In [None]:
hat = temp[temp['product'] == 'Kaggle Hat']
hat.name = 'Hat'

mug = temp[temp['product'] == 'Kaggle Mug']
mug.name = 'Mug'

sticker = temp[temp['product'] == 'Kaggle Sticker']
sticker.name = 'Sticker'

In [None]:
for df in hat, mug, sticker:
    
    adf = adfuller(df['num_sold'], autolag='AIC')
    
    if adf[1] < 0.05:
        print(f'p-value is below 0.05 at {np.round(adf[1], 4)} so we reject the null hypothesis: {df.name} data is stationary')
    else:
        print(f'p-value is above 0.05 at {np.round(adf[1], 4)} so we fail to reject the null hypothesis: {df.name} data is not stationary')

# 7. Seasonal decomposition

In [None]:
temp = train.groupby('date')['num_sold'].sum().reset_index()

In [None]:
sd = seasonal_decompose(temp['num_sold'], model='additive', period=365)
sd.plot();

The seasonal component is quite obvious. Over the years, sales tend to increase.

## Moving average

In [None]:
temp['30d MA'] = temp['num_sold'].rolling(window=30).mean()
temp['90d MA'] = temp['num_sold'].rolling(window=90).mean()

In [None]:
fig = px.line(temp, x='date', y='90d MA', title='Daily sales', color_discrete_sequence=color_d)
fig.update_layout(height=400, width=1000, template='plotly_white', title='90 days moving average')

According to the 90 day moving average and to the seasonal decomposition, sales are slowly increasing over the years.