# **Problem Statement**
In this competition, the fifth iteration, you will use hierarchical sales data from Walmart, the world’s largest company by revenue, to forecast daily sales for the next 28 days. The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. In addition, it has explanatory variables such as price, promotions, day of the week, and special events. Together, this robust dataset can be used to improve forecasting accuracy.

**How much camping gear will one store sell each month in a year?** 

## Evaluation Metrics
This competition uses a Weighted Root Mean Squared Scaled Error (RMSSE).

## Data
* **calendar.csv** - Contains information about the dates on which the products are sold.
* **sales_train_validation.csv** - Contains the historical daily unit sales data per product and store [d_1 - d_1913]
* **sample_submission.csv** - The correct format for submissions. Reference the Evaluation tab for more info.
* **sell_prices.csv** - Contains information about the price of the products sold per store and date.
* **sales_train_evaluation.csv** - Includes sales [d_1 - d_1941] (labels used for the Public leaderboard)

# Content:
1. Loading Libraries
2. Importing files
3. Summary Statistics
4. Analysis of Selling Prices
5. Time Series Analysis 
6. Impact of Events
7. Impact of SNAP days

#  Loading Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.seasonal import seasonal_decompose
!pip install calplot
import calplot
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)
!pip install chart_studio 
import chart_studio.plotly as py
import plotly.graph_objs as go

# Importing Files

In [None]:

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
calendar = pd.read_csv('//kaggle/input/m5-forecasting-accuracy/calendar.csv')
sales_train_validation = pd.read_csv('/kaggle/input/m5-forecasting-accuracy/sales_train_validation.csv')
sample_submission = pd.read_csv('/kaggle/input/m5-forecasting-accuracy/sample_submission.csv')
sell_prices = pd.read_csv('/kaggle/input/m5-forecasting-accuracy/sell_prices.csv')
sales_train_evaluation = pd.read_csv('/kaggle/input/m5-forecasting-accuracy/sales_train_evaluation.csv')

# Summary Statistics

First, I loaded the sales_train_validation dataset which is our main datasets and contains the historical daily unit sales data per product and store for 1,913 days from 29-Jan-2011.

In [None]:
sales_train_validation.head()

In [None]:
[column_train,row_train] = sales_train_validation.shape
column_train,row_train

The sales_train_validation dataset has 30,490 rows and 1,919 columns.

In [None]:
print(len(sales_train_validation.id.str.contains('validation')))
print(len(sales_train_validation.id.unique()))

In [None]:
print(len(sales_train_validation.id.unique()))
print(len(sales_train_validation.item_id.unique()))
print(sales_train_validation.dept_id.unique())
print(sales_train_validation.cat_id.unique())
print(sales_train_validation.store_id.unique())
print(sales_train_validation.state_id.unique())

There are 3,049 unique item_id, 7 unique dept_id, 3 unique cat_id, 10 unique store_id and 3 unique state_id.
This data set belongs to three states of US, CA (California), TX (Texas) & WI (Wisconsin).

Lets change the date columns from the current format of "d_" to date time format "dd-mm-yyyy" for time series analysis.

In [None]:
df = sales_train_validation
df1 = sales_train_validation.set_index('id').drop(['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'], axis=1).transpose()
df1.head()

In [None]:
n = len(calendar) - len(df1)
df2 = calendar[['date', 'd']].set_index('d').iloc[:-n]
df3 = pd.concat([df2,df1], axis=1).set_index('date')
df3

Visualization of time series of 20 random item to see pattern.

In [None]:
fig = plt.figure(figsize=(16,30))
for i,j in zip(df3.sample(n=20, axis=1), range(20)):
    ax=plt.subplot(10,2,j + 1) 
    df3[[i]].plot(ax=ax)
plt.show()

From the above we can see that some of the items has been selling for the compltete period of the datasets, but some where introduced latter and some were discontinued.

## Lets visualize the dataset

In [None]:
df4 = pd.concat([df[['id','item_id', 'dept_id', 'cat_id', 'store_id', 'state_id']].set_index('id'),df3.transpose()], axis=1)
df4

In [None]:
unique_units = df4.state_id.value_counts()
print('The count of total number of unique items is:\n', unique_units)
unique_units.plot(title='Distribution of Total items by State', kind='pie', autopct='%1.1f%%', figsize=(10,6))
plt.show()

CA has the highest contibution to sales of unique items with 40% contibution, while the remaining has 30% contibution each. 

In [None]:
Total_Sales = df4.groupby('state_id').sum().sum(axis=1)
Total_Sales.plot(title='Distribution of Total sales by State', kind='pie', autopct='%1.1f%%', figsize=(10,6))
plt.show()

43.6%, 28.8% and 27.6% of the total sales has been from CA, TX and WI respectively.

In [None]:
df4.groupby(['cat_id']).sum().transpose().sum().plot(title='Sales Distribution by category', kind='pie', autopct='%1.1f%%',
        shadow=True, figsize=(10,6))
plt.show()

68.6%, 22.0% and 9.3% of the total sales has been from the categories FOODS, HOUSEHOLD and HOBBIES respectively.

In [None]:
cat_dist = df4.groupby(['cat_id','state_id']).sum().sum(axis=1).unstack('cat_id')
(cat_dist.transpose() / cat_dist.transpose().sum()).transpose().plot(kind='bar')
plt.show()

Each state has similar pattern of sales by category with FOODS contibuting to the major share. WI has the highest contribution from FOODS category in comparison to other states. 

In [None]:
df4.groupby(['dept_id']).sum().transpose().sum().plot(title='Sales Distribution by sub category', kind='pie', autopct='%1.1f%%',
        shadow=True, figsize=(10,6))
plt.show()

* FOODS_3 has the highest contibution to sales with 49.3% share followed by FOODS_2 and FOODS_1
* HOUSEHOLD_1 has the highest contibution to sales in the HOUSEHOLD category with 17.5% share followed by HOUSEHOLD_2
* HOBBIES_1 has the highest contibution to sales in the HOBBIES category with 8.5% share followed by HOBBIES_2

In [None]:
dept_dist = df4.groupby(['dept_id','state_id']).sum().sum(axis=1).unstack('dept_id')
(dept_dist.transpose() / dept_dist.transpose().sum()).transpose().iplot(kind='bar')

Each state has similar pattern of sales by sub category with FOODS_3 contibuting to the major share. TX has the highest contribution from FOODS_3 category in comparison to other states. 

In [None]:
df4.groupby(['store_id']).sum().transpose().sum().plot(title='Sales Distribution by store', kind='pie', autopct='%1.1f%%',
        shadow=True, figsize=(10,6))
plt.show()

In [None]:
store_state_dist = df4.groupby(['store_id','state_id']).sum().sum(axis=1).unstack('store_id')
(store_state_dist.transpose() / store_state_dist.transpose().sum()).transpose().iplot(
    kind='bar', title='Sales distribution of stores in each state')
plt.show()

* In CA CA_3 has the highest share of sales with 39% share follwed by CA_1, CA_2 and CA_4
* In TX TX_2 has the highest share of sales with 38.17% share follwed by TX_3 and TX_1
* In WI WI_2 has the highest share of sales with 36.11% share follwed by WI_3 and WI_1

In [None]:
dept_store_dist = df4.groupby(['dept_id','store_id']).sum().sum(axis=1).unstack('dept_id')
(dept_store_dist.transpose() / dept_store_dist.transpose().sum()).transpose().iplot(
    kind='bar', title='Sales Distribution of Sub Categories by Store_ID')
plt.show()

* FOOD_3 has the highest contibution of sales in all the stores with WI_3 has the highest contribution where FOODS_3 contribute to 54.76% of the total sales.
* All the stores has the similar pattern of sales of sub categories expect CA_2, where contribution by FOODS_1 is more than FOODS_2

# Analysis of Selling Prices

In [None]:
sell_prices['wm_yr_wk'] = sell_prices['wm_yr_wk'].astype(str)

sell_prices['month'] = sell_prices['wm_yr_wk'].str[0:1]
sell_prices['year'] = sell_prices['wm_yr_wk'].str[1:3]
sell_prices['year'] = '20' + sell_prices['year'].astype(str)
sell_prices['week'] = sell_prices['wm_yr_wk'].str[3:5]
sell_prices['state_id'] = sell_prices['store_id'].str.split('_', 1).str[0]
sell_prices['cat_id'] = sell_prices['item_id'].str.split('_', 1).str[0]
sell_prices['dept_id'] = sell_prices['item_id'].str.split('_').str[0] + '_' + sell_prices['item_id'].str.split('_').str[1]

sell_prices.drop('wm_yr_wk', axis=1, inplace=True)

In [None]:
sell_prices.head()

In [None]:
sell_prices['sell_price'].plot(kind='hist', figsize=(10,5), bins=20)
plt.show()

* Almost all of the items has the selling price in the range of \\$0-\\$20
* Most of the items has the selling price in the range of \\$0-\\$10

In [None]:
sell_prices.groupby(['year','state_id']).mean().unstack('state_id').boxplot(
    figsize=(10,3), vert=False)
plt.show()

* The average selling price of all the itmes is highest in WI followed by CA and TX
* WI has the closest range of selling price whereas TX has the largest range of selling price
* The range of avg. selling price of all the items is between \\$4.15-\\$4.50 

In [None]:
sell_prices.groupby(['year','store_id']).mean().unstack('store_id').boxplot(figsize=(10,7), vert=False)
plt.show()

* There is variation in selling prices in each of the store
* Store WI_1 has the highest average selling price, also it has the closest range of average selling price
* Store TX_2 has the lowest average selling price and it has the largest range of average selling price
* Stores WI_1, WI_2, WI_3, TX_3, CA_2 has sales of costly items
* Stores TX_1, TX_2, CA_1, CA_3, CA_4 has sales of both costly and cheap items

In [None]:
sell_prices.groupby(['year','cat_id']).mean().unstack('cat_id').boxplot(figsize=(12,3), vert=False)
plt.show()

* HOUSEHOLD is the most expensive category with highest avg. selling price and close range
* FOODS is the cheapest category with lowest avg. selling price and close range

In [None]:
sell_prices.groupby(['year','dept_id']).mean().unstack('dept_id').boxplot(figsize=(12,4), vert=False)
plt.show()

* In the HOBBIES category the HOBBIES_1 sub category is expensive while HOBBIES_2 sub category is the cheapest
* In the HOUSEHOLD category the HOUSEHOLD_2 sub category is expensive while HOUSEHOLD_1 sub category is cheap
* In the FOODS category the FOODS_2 sub category is expensive followed by FOODS_3 and FOODS_1

## Variation of Selling Prices across different Time Frames 

In [None]:
sell_prices.groupby(['week','year']).mean().unstack('week').boxplot(figsize=(10,12), vert=False)
plt.show()

* The avg selling price is higher at the begining and latter part of the year
* In middle of the year i.e., between the week 15 and week 28, there is dip in the avg selling price
* The range of selling price is closest in the weeks 5 to 8 and 25 to 30

In [None]:
sell_prices.groupby(['week','year']).mean().unstack('year').boxplot(figsize=(14,5), vert=False)
plt.show()

There is a consolidation in the selling price, as with increase in year the avg selling price is also increasing. Also, the range of selling is decreasing with increase in year. 

# Events

In [None]:
calendar.head()

In [None]:
event_1 = pd.merge(calendar[['date','weekday','month','year','d']], 
                   calendar[['d','event_name_1','event_type_1']].dropna(), on='d')
print(event_1)
print("There are 162 events in the calender dateset")

In [None]:
event_1.groupby(['event_type_1','year'])['event_name_1'].size().unstack(
    'event_type_1').iplot(kind='barh', title='Event Type 1')
plt.show()

There are on average 6 cultural, 10 national, 10 Religious and 3 sporting events in a year.

In [None]:
event_1.groupby(['month','year'])['event_name_1'].size().unstack('year').iplot(
    kind='bar', title='Events 1 by Month')
plt.show()

* Month 2 i.e., February has the highest number of events and months 8 & 9 i.e., August & September has the lowest number of events.

In [None]:
event_1.groupby(['weekday','year'])['event_name_1'].size().unstack('year').iplot(
    kind='bar', title='Event 1 by day of the Week')
plt.show()

* Most of the events are organized on Sunday and Monday
* Friday and Saturday has the least number of the events

In [None]:
event_2 = pd.merge(calendar[['date','weekday','month','year','d']], 
                   calendar[['d','event_name_2','event_type_2']].dropna(), on='d')
print(event_2)
print("There are 5 Event 2 in the calendar dataset")

In [None]:
event_2.groupby(['event_type_2','year'])['event_name_2'].size().unstack(
    'event_type_2').iplot(kind='barh', title='Event 2')
plt.show()

* 2011, 2013, 2014 & 2016 had one cultural event 2
* 2014 had one religious event 2

In [None]:
event_2.groupby(['month','year'])['event_name_2'].size().unstack('year').iplot(
    kind='barh', title='Event 2 by month of the Year')
plt.show()

In [None]:
event_2.groupby(['weekday','year'])['event_name_2'].size().unstack('year').iplot(
    kind='barh', title='Event 2 by day of the week')
plt.show()

All the events were on Sunday every year.

In [None]:
pd.merge(calendar[['date','weekday','month','year','d']], 
                   calendar[['d','event_name_1','event_type_1','event_name_2','event_type_2']].dropna(), on='d')

In the calendar dataset there are 5 days at which both event 1 and event 2 fell on the same day.

# SNAP Days

In [None]:
snap = pd.merge(calendar[['date','weekday','month','year','d']], 
                   calendar[['d','snap_CA','snap_TX','snap_WI']].loc[~(calendar[['snap_CA','snap_TX','snap_WI']]==0).all(axis=1)], 
                on='d')
snap

In [None]:
snap_CA = snap.groupby(['snap_CA','year']).size()[1].to_frame().reset_index().rename(columns={0:'CA'})
snap_TX = snap.groupby(['snap_TX','year']).size()[1].to_frame().reset_index().rename(columns={0:'TX'})
snap_WI = snap.groupby(['snap_WI','year']).size()[1].to_frame().reset_index().rename(columns={0:'WI'})

pd.merge(pd.merge(snap_CA,snap_TX,on='year'),snap_WI,on='year').set_index(
    'year').iplot(kind='barh', title='SNAP Days')
plt.show()

In [None]:
snap_CA_1 = snap.groupby(['snap_CA','month']).size()[1].to_frame().reset_index().rename(columns={0:'CA'})
snap_TX_1 = snap.groupby(['snap_TX','month']).size()[1].to_frame().reset_index().rename(columns={0:'TX'})
snap_WI_1 = snap.groupby(['snap_WI','month']).size()[1].to_frame().reset_index().rename(columns={0:'WI'})

pd.merge(pd.merge(snap_CA_1,snap_TX_1,on='month'),snap_WI_1,on='month').set_index(
    'month').iplot(kind='barh', title='SNAP Days by month of the year')
plt.show()

In [None]:
snap.groupby(['snap_CA','month','year']).size()[1].unstack('year').iplot(
    kind='bar', title='CA SNAP Days by month')
plt.show()

In [None]:
snap.groupby(['snap_TX','month','year']).size()[1].unstack('year').iplot(
    kind='bar', title='TX SNAP Days by month')
plt.show()

In [None]:
snap.groupby(['snap_WI','month','year']).size()[1].unstack('year').iplot(
    kind='bar', title='WI SNAP Days by month')
plt.show()

There are 10 SNAP Days every month for all the states. 

In [None]:
snap.groupby(['snap_CA','weekday','year']).size()[1].unstack('year').iplot(
    kind='barh', title='CA SNAP Days by day of the Week')
plt.show()

In [None]:
snap.groupby(['snap_TX','weekday','year']).size()[1].unstack('year').iplot(
    kind='barh', title='TX SNAP Days by day of the Week')
plt.show()

In [None]:
snap.groupby(['snap_WI','weekday','year']).size()[1].unstack('year').iplot(
    kind='barh', title='WI SNAP Days by day of the Week')
plt.show()

In [None]:
pd.merge(calendar[['date','weekday','month','year','d']], 
                   calendar[['d','snap_CA','snap_TX','snap_WI']].loc[
                       (calendar[['snap_CA','snap_TX','snap_WI']]==1).all(axis=1)], 
                on='d')

# Calendar View of SNAP Days in each of the states

## CA

In [None]:
days = list(pd.to_datetime(calendar.date))
events = pd.Series(list(calendar.snap_CA), index=days)

calplot.calplot(events, cmap='RdBu', colorbar=False)
plt.show()

## TX

In [None]:
days = list(pd.to_datetime(calendar.date))
events = pd.Series(list(calendar.snap_TX), index=days)

calplot.calplot(events, cmap='RdBu', colorbar=False)
plt.show()

## WI

In [None]:
days = list(pd.to_datetime(calendar.date))
events = pd.Series(list(calendar.snap_WI), index=days)

calplot.calplot(events, cmap='RdBu', colorbar=False)
plt.show()

* In all the states SNAP days fall on the first half of the month
* In all the states SNAP days are organized on same day of the month

# Time Series Analysis

In [None]:
cummulative_sales = df3.transpose().sum().to_frame().rename(columns={0:'cummulative_sales'})
cummulative_sales.head()

In [None]:
cummulative_sales.iplot(title='Time Series Plots - Cummulative Sales')
plt.show()

* The Time Series plot of the cummulative sales shows that there is an increasing trend in the cummulative sales with seasonality.
* There is one day every year at which the sale is almost equal to 0, the day is 25th December.

I am assumung that the trend of the Time Series is linear and thus is using additive decomposition to decompose the Times Series to the components: Level, Trend, Seasonality and Noise for futher analysis.

In [None]:
result = seasonal_decompose(cummulative_sales[
    'cummulative_sales'].values, period=7, model='additive')
plt.rcParams.update({'figure.figsize': (12,8)})
result.plot().suptitle('Additive Decomposition', fontsize=22)
plt.show()

* The time series has an increasing Trend
* There is a strong weekly seasonilty

In [None]:
df3.index = pd.to_datetime(df3.index)
df5 = pd.concat([df[['id','item_id', 'dept_id', 'cat_id', 'store_id', 'state_id']].set_index('id'),
                 df3.groupby(pd.Grouper(freq='1M')).sum().transpose()], axis=1)

In [None]:
df5.groupby('state_id').sum().transpose().iplot(title='Time Series Plots - Statewise')
plt.show()

* The trend for CA & WI is icreasing while that of TX is decreasing
* CA has the highest fluctuation in sales among the three states

In [None]:
result = seasonal_decompose(df.groupby(
    'state_id').sum().transpose().CA.values, period=7, model='additive')
plt.rcParams.update({'figure.figsize': (12,8)})
result.plot().suptitle('Time Series of CA', fontsize=22)
plt.show()

In [None]:
result = seasonal_decompose(df4.groupby(
    'state_id').sum().transpose().TX.values, period=7, model='additive')
plt.rcParams.update({'figure.figsize': (12,8)})
result.plot().suptitle('Time Series of TX', fontsize=22)
plt.show()

In [None]:
result = seasonal_decompose(df4.groupby(
    'state_id').sum().transpose().WI.values, period=7, model='additive')
plt.rcParams.update({'figure.figsize': (12,8)})
result.plot().suptitle('Time Series of WI', fontsize=22)
plt.show()

* Trend of WI is increasing at much faster rate than CA & TX
* There is strong monthly seasonality in CA & TX

In [None]:
df5.groupby(['state_id','cat_id']).sum().transpose().CA.iplot(title='CA sales by Category')
plt.show()

* There is monthly seasonality in FOODS category
* The trend of FOODS and HOUSEHOLD is increasing while that of HOBBIES is flat

In [None]:
df5.groupby(['state_id','cat_id']).sum().transpose().TX.iplot(title='TX sales by Category')
plt.show()

* Trend of HOUSEHOLD category is increasing
* Trend of FOODS category is decreasing
* Trend of HOBBIES category is flat

In [None]:
df5.groupby(['state_id','cat_id']).sum().transpose().WI.iplot(title='WI sales by Category')
plt.show()

Trend of all the three categories are increasing.

Thus, except for WI in other states the trend of the categories are either decreasing or flat.

In [None]:
df5.groupby(['state_id','store_id']).sum().transpose().CA.iplot(title='CA sales by Stores')
plt.show()

* Trend of CA_1, CA_2 & CA_4 is increasing while the trend of CA_3 is slightly decreasing
* Trend of CA_2 is increasing rapidly in the recent year in comparison to other stores

In [None]:
df5.groupby(['state_id','store_id']).sum().transpose().TX.iplot(title='TX sales by Store')
plt.show()

* Trend of TX_3 is increasing while that of TX_1 & TX_2 is decreasing
* Trend of TX_2 is decreasing at a much faster rate than that of TX_1

In [None]:
df5.groupby(['state_id','store_id']).sum().transpose().WI.iplot(title='WI sales by Store')
plt.show()

* Trend of all the three stores are increasing
* Trend of WI_2 is increasing at much faster rate than that of WI_1 & WI_3

In [None]:
df5.groupby(['store_id']).sum().transpose().iplot(title='Sales by Stores')
plt.show()

# Impact of Events

In [None]:
df7 = pd.merge(calendar[['date','weekday','month','year']], 
                   pd.concat([df2,df1], axis=1), on='date').set_index('date')
df7.head()

In [None]:
df7.drop(['month'], axis=1).groupby(['weekday','year']).sum().sum(axis=1).unstack(
    'weekday').iplot(kind='bar', title='Sales by day of the Week')
plt.show()

* Saturday & Sunday has the highest sales in any week
* Tuesday, Wednesday & Thursday has the lowest sales in any week

In [None]:
df7.drop(['weekday'], axis=1).groupby(['month','year']).sum().sum(axis=1).unstack(
    'month').iplot(kind='bar', title='Sales by Month')
plt.show()

* June, July, August & September has the highest sales in a year
* January, February & December has the lowest sales in a year

## Sales by day of the year

In [None]:
days = list(pd.to_datetime(cummulative_sales.index))
events = pd.Series(list(cummulative_sales.cummulative_sales), index=days)

calplot.calplot(events, cmap='CMRmap')
plt.show()

* There is a high weekly correlation in sales
* In a month most of the sales happens in latter part of the month
* There is one day every year when sales is 0, the day is 25th Dec

In [None]:
cummulative_sales.iplot(kind='hist')
plt.show()

* From the histogram we can see that there is only 5 days in the dataset which is oulier

## Event 1 & Event 2

In [None]:
cummulative_sales_1 = pd.merge(calendar, cummulative_sales.reset_index(), on='date')
cummulative_sales_1.head()

In [None]:
cummulative_sales_1.groupby(['weekday']).mean()['cummulative_sales'].plot(kind='barh', figsize=(12,6))
plt.axvline(x=cummulative_sales_1.cummulative_sales.mean(), color='k', linestyle='--')
plt.show()

* Most of the sales happens on Saturday & Sunday in a week

In [None]:
cummulative_sales_2 = cummulative_sales_1.groupby(['weekday']).mean()['cummulative_sales'].reset_index()
cummulative_sales_2.loc[cummulative_sales_2.weekday=='Saturday']['cummulative_sales'].values

Average sales on Saturday is 41,546.894

In [None]:
cummulative_sales_2.loc[
    (cummulative_sales_2.weekday=='Saturday') | (cummulative_sales_2.weekday=='Sunday')].mean()

Average sales on Saturday & Sunday is 41,338.458.

* Since most of the sales happens on weekends, so to see the effect of the events we needs to map the sale of all the events that happens on Monday to weekends

In [None]:
event_days_sales = cummulative_sales_1[
    ((cummulative_sales_1.event_name_1.notnull()) | (cummulative_sales_1.event_name_2.notnull()))]
cummulative_sales_1["weekend_precede_event"] = np.nan

def update_weekend_precede_event(week_e,wday,e1,e2):
    e2 = '_' + e2 if type(e2) == str else ''
    drift = e1 + e2
    if wday == 1:
        cummulative_sales_1.loc[
            (cummulative_sales_1['wm_yr_wk']==week_e)&(cummulative_sales_1[
                'wday']==1),"weekend_precede_event"] = drift
    else:
        cummulative_sales_1.loc[
            (cummulative_sales_1[
                'wm_yr_wk']==week_e)&((cummulative_sales_1['wday']==1)|(cummulative_sales_1[
                'wday']==2)),"weekend_precede_event"] = drift
        
_ = event_days_sales.apply(lambda row : update_weekend_precede_event(row[
    'wm_yr_wk'],row['wday'],row['event_name_1'], row['event_name_2']),axis = 1)

In [None]:
cummulative_sales_1.head()

## Plotting the sales of events

In [None]:
cummulative_sales_1.groupby(['weekend_precede_event','weekday'])[
    'cummulative_sales'].mean().unstack('weekday').mean(axis=1).sort_values(ascending = False).plot(kind='bar', figsize=(16,6))
plt.axhline(y=cummulative_sales_2.loc[
    (cummulative_sales_2.weekday=='Saturday') | (
        cummulative_sales_2.weekday=='Sunday')].mean().values, color='black', linestyle='--')
plt.show()

We can see that on 15 of the events, the sales is greater than the average sales. Thus, is a spike in sales on 15 events.

## Effect of SNAP days

In [None]:
snap_1 = pd.merge(snap, cummulative_sales, on='date')
snap_1

In [None]:
snap_CA_1 = pd.merge(snap[['date','snap_CA']], df4.groupby([
    'state_id']).sum().T['CA'].reset_index().rename(
    columns={'index':'date'}), on='date').groupby(['snap_CA']).mean().reset_index()
snap_CA_1.columns = ['snap', 'CA_sales']
snap_TX_1 = pd.merge(snap[['date','snap_TX']], df4.groupby([
    'state_id']).sum().T['TX'].reset_index().rename(
    columns={'index':'date'}), on='date').groupby(['snap_TX']).mean().reset_index()
snap_TX_1.columns = ['snap', 'TX_sales']
snap_WI_1 = pd.merge(snap[['date','snap_WI']], df4.groupby([
    'state_id']).sum().T['WI'].reset_index().rename(
    columns={'index':'date'}), on='date').groupby(['snap_WI']).mean().reset_index()
snap_WI_1.columns = ['snap', 'WI_sales']

In [None]:
pd.merge(pd.merge(snap_CA_1,snap_TX_1, on='snap'),snap_WI_1, on='snap').set_index('snap').T.plot(
    kind='bar', figsize=(10,8), title='Snap Days effect')
plt.axhline(y=df4.groupby(['state_id']).sum().mean(axis=1).to_frame().T.CA.values, color='red', linestyle='--')
plt.text(0,df4.groupby(['state_id']).sum().mean(axis=1).to_frame().T.CA.values,'Average sales in CA', size=14)
plt.axhline(y=df4.groupby(['state_id']).sum().mean(axis=1).to_frame().T.TX.values, color='k', linestyle='--')
plt.text(0.5,df4.groupby(['state_id']).sum().mean(axis=1).to_frame().T.TX.values,'Average sales in TX', size=14)
plt.axhline(y=df4.groupby(['state_id']).sum().mean(axis=1).to_frame().T.WI.values, color='blue', linestyle='--')
plt.text(1.5,df4.groupby(['state_id']).sum().mean(axis=1).to_frame().T.WI.values,'Average sales in WI', size=14)
plt.show()

* There is a spike in sales on SNAP days at all the sales.
* WI has highest increase in SNAP days compared to CA & TX

Highly Influenced by : https://www.kaggle.com/anirbansen3027/m5-forecasting-exhaustive-eda-beginner