<h1 style='background:#c3c087; border:0; color:black'><center>Store Sales - EDA</center></h1> 

![](https://raw.githubusercontent.com/sachinprabhu007/Kaggle_Covers/main/Kaggle_Store_Sales_EDA.jpg)

<img src="https://i.pinimg.com/originals/2e/e6/99/2ee6998e34c3e2eff7b894c66cfc5267.jpg"  width="1280" height="720">


<h1 style='background:#c3c087; border:0; color:black'><center> Introduction </center></h1> 

Aim of this competition is to use time series forecasting to forecast store sales on given data. Let's go through the data to predict grocery sales.

Dataset source : Corporación Favorita, a large Ecuadorian-based grocery retailer


<div class="alert simple-alert">
🌻 This notebook is inspired by storytelling skills / layouts used in notebooks by <b>Karnika Kapoor
</b> <a href="https://www.kaggle.com/karnikakapoor/code"> Notebooks - Karnika Kapoor  </a> and <b> Andrada Olteanu
</b> <a href="https://www.kaggle.com/andradaolteanu/code"> Notebooks - Andrada Olteanu</a>. Please go check their work for amazing visualizations! Always fascinating! 
    
    
Grateful to both for sharing their work with everyone on Kaggle 🌻 🙏
</div>





Since we are dealing with time let's summon Dr.Strange and go through the notebook shall we?

<img src="https://c.tenor.com/zInCuCM3WVEAAAAC/doctor-strange-benedict-cumberbatch.gif" width="1280" height="720">


<a id='top'></a>
<div class="list-group" id="list-tab" role="tablist">
<h1 style='background:#c3c087; border:0; color:black'><center> Table of Contents </center></h1> 

[1. Importing Libraries](#1)
    
[2. Loading Data](#2)    

[3. EDA](#3)     
    
[3.1. Data Cleaning and Analysis](#3.1) 
   
[3.2 Time Series Analysis](#3.2) 


        


<a id="1"></a>
<h1 style='background:#c3c087; border:0; color:black'><center> Importing Libraries </center></h1> 


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 

import warnings
warnings.filterwarnings("ignore")



In [None]:
# Set Color Palettes for the notebook
my_color_palette = ["#d81159","#8f2d56","#218380","#fbb13c","#73d2de"]
sns.palplot(sns.color_palette(my_color_palette))

# Set Style
#sns.set_style("whitegrid")
sns.despine(left=True, bottom=True)

<a id="2"></a>
<h1 style='background:#c3c087; border:0; color:black'><center> Loading Data </center></h1> 


In [None]:
df_holiday_events = pd.read_csv('../input/store-sales-time-series-forecasting/holidays_events.csv',parse_dates =['date'])
df_oil = pd.read_csv('../input/store-sales-time-series-forecasting/oil.csv',parse_dates =['date'])
df_sample_submission = pd.read_csv('../input/store-sales-time-series-forecasting/sample_submission.csv')
df_stores = pd.read_csv('../input/store-sales-time-series-forecasting/stores.csv')
df_test = pd.read_csv('../input/store-sales-time-series-forecasting/test.csv',parse_dates =['date'])
df_train = pd.read_csv('../input/store-sales-time-series-forecasting/train.csv',parse_dates =['date'])
df_transactions = pd.read_csv('../input/store-sales-time-series-forecasting/transactions.csv',parse_dates =['date'])

<a id="3"></a>
<h1 style='background:#c3c087; border:0; color:black'><center> EDA </center></h1> 


In [None]:
# helper function 

def get_df_name(df):
    name =[x for x in globals() if globals()[x] is df][0]
    return name

def check_df(dataframe):
    print('*'*30+'Name of dataframe'+'*'*30)
    print(get_df_name(dataframe))
    print('*'*30+'Shape of dataframe'+'*'*30)
    print(dataframe.shape)
    print('\n'+'*'*30+'Head of dataframe'+'*'*30)
    print(dataframe.head())
    print('\n'+'*'*30+'Tail of dataframe'+'*'*30)
    print(dataframe.tail())
    print('\n'+'*'*30+'Concise summary of dataframe'+'*'*30)
    print(dataframe.info())
    print('\n'+'*'*30+'Check for missing values'+'*'*30)
    print(dataframe.isnull().sum())
    print('\n'+'*'*30+'Check dataframe for numeric and categorical variables'+'*'*30)
    
    numeric_variables = dataframe.select_dtypes(include=[np.number])
    categorical_variables = dataframe.select_dtypes(exclude=[np.number])

    print('Numeric variables in the given dataframe : ',numeric_variables.shape[1])
    print('Categorical variables in the givne dataframe:',categorical_variables.shape[1])


In [None]:
check_df(df_holiday_events)


In [None]:
# rename the column name for oil dataframe.
df_oil.rename(columns={'dcoilwtico':'oil_price'}, inplace=True)
check_df(df_oil)

In [None]:
check_df(df_sample_submission)

In [None]:
check_df(df_stores)

In [None]:
check_df(df_test)

In [None]:
check_df(df_train)


In [None]:
check_df(df_transactions)


<a id="3.1"></a>
<h1 style='background:#c3c087; border:0; color:black'><center> 🚧Data Preparation and Analysis🚧 </center></h1> 


Since our dataset doesn't have missing values we will move forward and analyze data in detail

<img src="https://www.quirkybyte.com/wp-content/uploads/2018/10/66e3a30f3843.gif" width="1280" height="720">


In [None]:
# Let's check the store data

df_stores.head()

In [None]:
df_stores.type.value_counts()


In [None]:
df_stores.state.value_counts()

In [None]:
df_stores.city.value_counts()

In [None]:
plt.figure(figsize=(20, 15))
sns.countplot(data=df_stores, x='type', order=df_stores.type.value_counts().index,palette=my_color_palette)
plt.title('Number of Stores based on Type',fontweight="bold")
plt.xlabel('Type', fontsize=18)
plt.ylabel('Count', fontsize=16)
plt.show()


In [None]:
plt.figure(figsize=(20, 15))

sns.countplot(data=df_stores, y='city', 
              order=df_stores.city.value_counts().index,
              palette=my_color_palette,
              )
plt.title('Number of Stores based on Cities',fontweight="bold")
plt.ylabel('City', fontsize=18)
plt.xlabel('Count', fontsize=16)


In [None]:
plt.figure(figsize=(20, 15))

sns.countplot(data=df_stores, y='state', 
              order=df_stores.state.value_counts().index,
              palette=my_color_palette,
              )
plt.title('Number of Stores based on State',fontweight="bold")
plt.ylabel('State', fontsize=18)
plt.xlabel('Count', fontsize=16)


In [None]:
# Data preparation
# Let's combine the dataframes

#https://www.kaggle.com/madhuri15/store-sales-time-series-analysis
    
# Let's merge oil data into the train and test data
train = df_train.merge(df_oil, on='date')
test = df_train.merge(df_oil, on='date') 
    
train = train.merge(df_holiday_events[['date', 'type', 'transferred']], on='date')
train = train.merge(df_stores, on='store_nbr')
train.rename(columns={'type_x':'holiday_type', 'type_y':'store_type'}, inplace=True)



In [None]:
train['Year'] = train.date.dt.year
train['Year-Month'] = train['date'].apply(lambda x : x.strftime('%Y-%m'))
train['Month'] = train.date.dt.month
train['Day'] = train.date.dt.day

In [None]:
train

In [None]:
test

In [None]:
check_df(train)

In [None]:
check_df(test)

In [None]:
train.family.value_counts()

In [None]:
train.family.unique()

In [None]:
plt.figure(figsize=(20, 15))

sns.barplot(x='sales',y='family',data=train)
plt.title('Distribution of Sales considering Product',fontweight="bold")
plt.ylabel('Family', fontsize=18)
plt.xlabel('Sales', fontsize=16)


In [None]:
plt.figure(figsize=(20, 15))

sns.barplot(x='sales',y='state',data=train, ci=None)
plt.title('Distribution of Sales considering State',fontweight="bold")
plt.ylabel('State', fontsize=18)
plt.xlabel('Sales', fontsize=16)


In [None]:
plt.figure(figsize=(20, 15))

sns.barplot(x='sales',y='city',data=train, ci=None)
plt.title('Distribution of Sales considering City',fontweight="bold")
plt.ylabel('City', fontsize=18)
plt.xlabel('Sales', fontsize=16)


In [None]:
train

<a id="3.2"></a>
<h1 style='background:#c3c087; border:0; color:black'><center> Time Series Forecasting </center></h1> 

<img src="https://i.pinimg.com/originals/31/53/2d/31532d7d378053de3b8bf23c6e7bfae3.gif" width="1280" height="720">

### What is Time Series? 

Given a data if it is recorded over consistent intervals of time it is referred as Time Series


### What is the difference between Prediction and Forecasting?


Prediction is concerned with estimating the outcomes for unseen data. For this purpose, we fit a model to a training data set, which results in an estimator f^(x) that can make predictions for new samples x.

Forecasting is a sub-discipline of prediction in which we are making predictions about the future, on the basis of time-series data. Thus, the only difference between prediction and forecasting is that we consider the temporal dimension. 


An estimator for forecasting has the form f^(x1,…,xt) where x1,…,xt indicate historic measurements at time points 1,…,t, while the estimate relates to time point t+1 or some other time in the future. Since the model depends on previous observations, xi, this is called an autoregressive model.



### What is the difference betweeen Time Series Analysis and Time Series Prediction?

- Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. 

- Time series forecasting is the use of a model to predict future values based on previously observed values



### Types of Time Series Data

There are four main types of time-series data, which are

1. Seasonal - The patterns of the data are repeated over a specific period.
2. Trend - The values of the data are increased or decreased in a reasonably predictable pattern.
3. Cyclical -The values of the data exhibit rises and falls that are not of a fixed frequency often due to economic conditions.
4. Random - The patterns of the data do not fall in any 3 categories mentioned above. They are totally irregular.


### What is Seasonality ? 

Seasonality refers to periodic fluctuations. For example, electricity consumption is high during the day and low during night, or online sales increase during Christmas before slowing down again.


### What is Stationarity?

A time series is said to be stationary if its statistical properties do not change over time. In other words, it has constant mean and variance, and covariance is independent of time.



In [None]:
plt.figure(figsize=(20, 15))
sns.boxenplot(x = "Year", y = "sales", 
              data = train,palette=my_color_palette)
plt.title('Distribution of Sales by Year',fontweight="bold")
plt.xlabel('Year', fontsize=18)
plt.ylabel('Sales', fontsize=16)
plt.show()

In [None]:
plt.figure(figsize=(20, 15)) 
sns.lineplot(train['Month'], train['sales'], hue=train['Year'], palette=my_color_palette)
plt.title('Seasonal plot of Sales', fontsize = 20, loc='center', fontdict=dict(weight='bold'))
plt.xlabel('Month', fontsize = 16, fontdict=dict(weight='bold'))
plt.ylabel('Sales', fontsize = 16, fontdict=dict(weight='bold'))
plt.show()

In [None]:
ts=train.groupby(['date'])["sales"].sum()

### What is Rolling Mean and Rolling Standard Deviation?

Rolling is a very useful operation for time series data. Rolling means creating a rolling window with a specified size and perform calculations on the data in this window which, of course, rolls through the data. The figure below explains the concept of rolling.


<img src="https://miro.medium.com/max/820/1*jqix0WWK_zDf5iIICpMVjw.png" >


### Why do we use Rolling or Moving  Standard Deviation?

Moving Standard Deviation is a statistical measurement of market volatility. It makes no predictions of market direction, but it may serve as a confirming indicator. We specify the number of periods to use, and the study computes the standard deviation of prices from the moving average of the prices

### What is Market volatility?

In statistical terms, volatility is the standard deviation of a market or security's annualised returns over a given period - essentially the rate at which its price increases or decreases. If the price fluctuates rapidly in a short period, hitting new highs and lows, it is said to have high volatility.




Documentation :
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html

Credits : 

1. https://towardsdatascience.com/time-series-analysis-resampling-shifting-and-rolling-f5664ddef77e
2. https://www.danielstrading.com/education/technical-analysis-learning-center/moving-standard-deviation
3. https://www.fidelity.com.sg/beginners/what-is-volatility/market-volatility#:~:text=In%20statistical%20terms%2C%20volatility%20is,said%20to%20have%20high%20volatility.

In [None]:
# credits : https://www.kaggle.com/jagangupta/time-series-basics-exploring-traditional-ts

plt.figure(figsize=(20, 15)) 
plt.plot(ts.rolling(window=12,center=False).mean(),label='Rolling Mean');
plt.plot(ts.rolling(window=12,center=False).std(),label='Rolling Stanard Deviation');
plt.legend();

In [None]:
import statsmodels.api as sm
# multiplicative
res = sm.tsa.seasonal_decompose(ts.values,freq=12,model="multiplicative")
#plt.figure(figsize=(16,12))
fig = res.plot()
#fig.show()

In [None]:
# Additive model
res = sm.tsa.seasonal_decompose(ts.values,freq=12,model="additive")
#plt.figure(figsize=(16,12))
fig = res.plot()
#fig.show()

In [None]:
from statsmodels.tsa.stattools import adfuller, acf, pacf,arma_order_select_ic

In [None]:
# Stationarity tests
def test_stationarity(timeseries):
    
    #Perform Dickey-Fuller test:
    print('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print (dfoutput)

test_stationarity(ts)

<h1 style='background:#c3c087; border:0; color:black'><center>  ✏✒📚 References ✏✒📚 </center></h1> 


1. https://en.wikipedia.org/wiki/Time_series

2. https://datascienceblog.net/post/machine-learning/forecasting_vs_prediction/

3. https://towardsdatascience.com/the-complete-guide-to-time-series-analysis-and-forecasting-70d476bfe775

4. https://medium.com/vitrox-publication/what-is-a-time-series-forecasting-d020d657f11a


<h1 style='background:#c3c087; border:0; color:black'><center> 🚧Work in Progress🚧 </center></h1> 


**🌻This notebook is handcrafted with lots of love. If you have learnt something new or found it helpful, Please upvote 👍🌻**

<h1 style='background:#c3c087; border:0; color:black'><center> 🌻 Thank you 🌻 </center></h1> 
