> Business Problem
Use hierarchical sales data from Walmart, the world’s largest company by revenue, to forecast daily sales for the next 28 days.
The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. In addition, it has explanatory variables such as price, promotions, day of the week, and special events. Together, this robust dataset can be used to improve forecasting accuracy.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
#Importing all the required modules
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import calendar
import matplotlib.dates as mdates

In [None]:
#Reading the data
cal_data = pd.read_csv('/kaggle/input/m5-forecasting-accuracy/calendar.csv')
prices = pd.read_csv('/kaggle/input/m5-forecasting-accuracy/sell_prices.csv')
sales = pd.read_csv('/kaggle/input/m5-forecasting-accuracy/sales_train_validation.csv')

In [None]:
print(cal_data.shape)
print(prices.shape)
print(sales.shape)

#### Calender data has 1969 rows with 14 columns, prices has 68,41,121 rows with 4 columns and sales has 30490 rows with 1919 columns

## Starting the EDA with Sales Data

In [None]:
#Viewing the first five rows of sales data
sales.head() 

In [None]:
print('There are {0} items '.format(len(sales['item_id'].unique())))
print('There are {0} depts'.format(len(sales['dept_id'].unique())))
print('There are {0} categories'.format(len(sales['cat_id'].unique())))
print('There are {0} stores'.format(len(sales['store_id'].unique())))
print('There are {0} states'.format(len(sales['state_id'].unique())))

#### Each entry in sales data corresponds to sales quantity of each item in a store across all days<br>
#### For eg: the first row coreesponds to item 1 of hobbies_1 deparment in state california in store 1<br>
#### We also saw that sales data has 1919 columns, among which 6 columns are different type of ids and other 1913 columns are days from d_1 to d_1913. The sales date starts from 29-Jan-2011 and ends on 24-April-2016 

In [None]:
#Copying the sales dataframe so that modifications can be made and the original dataframe be kept intact
sales_df = sales.copy()

In [None]:
date_list = [d.strftime('%Y-%m-%d') for d in pd.date_range(start = '2011-01-29', end = '2016-04-24')]

In [None]:
#Renaming days to dates
sales_df.rename(columns=dict(zip(sales_df.columns[6:], date_list)),inplace=True)
sales_df.head()

In [None]:
#Aggregating by mean the sales by department
dept_mean = sales_df.groupby(['dept_id']).mean().T
dept_mean.index = pd.to_datetime(dept_mean.index)

#Aggregating by mean the sales by categories
cat_mean = sales_df.groupby(['cat_id']).mean().T
cat_mean.index = pd.to_datetime(cat_mean.index)

#Aggregating by mean the sales by stores
store_mean = sales_df.groupby(['store_id']).mean().T
store_mean.index = pd.to_datetime(store_mean.index)

#Aggregating by mean the sales by states
state_mean = sales_df.groupby(['state_id']).mean().T
state_mean.index = pd.to_datetime(state_mean.index)


In [None]:
#Function for creating plots
def create_plots(df,freq):
    fig, ax = plt.subplots()
    for i in df.columns:
        df_plot = df[i].resample(freq).sum()
        df_plot.plot(ax=ax)
        fig.set_figheight(7)
        fig.set_figwidth(15)
    plt.grid(True)
    ax.legend(df.columns,loc='best')

In [None]:
#Plotting the mean data
create_plots(dept_mean,'m')
create_plots(cat_mean,'m')
create_plots(store_mean,'m')
create_plots(state_mean,'m')

#### We can see that FOODS_3 sell the highest values of sales  
#### Foods sell the most
#### CA_3 sell the most, WI_2 started off low, but it had a sudden increase in level in ending of 1st quarter in 2012
#### WI_3 went downhill at the begining of 2013, CA_2 had a decreasing trend throughout 2014 but increasing trend in 2015

In [None]:
#To plot data in a particular date range
fig, ax = plt.subplots(figsize=(15,5))
state_mean.plot(xlim=['2012-01-01','2014-01-01'],ax=ax,rot=90)
plt.grid(True)
plt.xlabel('Sales by State')
# set ticks every week
ax.xaxis.set_major_locator(mdates.MonthLocator())
# #set major ticks format
ax.xaxis.set_major_formatter(mdates.DateFormatter('%d %b'))

#### We can see that the sales fall to zero just before january, this is because walmart is closed on christmas

## Ending the EDA of sales, starting the EDA of <i>calender</i> data


In [None]:
cal_data.head(31)

#### This data contains the details regarding events on each day and it also shows on which days SNAP purchases are allowed


The United States federal government provides a nutrition assistance benefit called the Supplement Nutrition Assistance Program (SNAP).  SNAP provides low income families and individuals with an Electronic Benefits Transfer debit card to purchase food products.  In many states, the monetary benefits are dispersed to people across 10 days of the month and on each of these days 1/10 of the people will receive the benefit on their card.  

#### We can see that SNAP follows a certain pattern in 3 respective states, In CA, SNAP is allowed on first ten days, TX follows pattern 101-011, WI follows pattern 011

In [None]:
print(cal_data['event_name_1'].notnull().sum())
print(cal_data['event_name_2'].notnull().sum())

#### There are 162 rows where event_name_1 is not null and only 5 rows where event_name_2 is not null

In [None]:
print(len(cal_data['event_name_1'].unique()))
print(len(cal_data['event_type_1'].unique()))

#### There are total of 31 unique events which belong to 5 unique types, and as we saw before 162 rows, the data given is of 5 years, so these events occur every year

## Calendar data EDA done, starting EDA on prices data

In [None]:
prices.head()

In [None]:
prices['sell_price'].hist(bins=50)
plt.xlim(0,25)

#### Sell prices lie between 0 and 25, where most of them lies between 0 and 10

In [None]:
#Checking the price range of each department
prices[(prices['item_id'].str.startswith('FOODS_1'))]['sell_price'].hist()
plt.xlabel('FOODS_1')
plt.show()
prices[(prices['item_id'].str.startswith('FOODS_2'))]['sell_price'].hist()
plt.xlabel('FOODS_2')
plt.show()
prices[(prices['item_id'].str.startswith('FOODS_3'))]['sell_price'].hist()
plt.xlabel('FOODS_3')
plt.show()
prices[(prices['item_id'].str.startswith('HOUSEHOLD_1'))]['sell_price'].hist()
plt.xlabel('HOUSEHOLD_1')
plt.show()
prices[(prices['item_id'].str.startswith('HOUSEHOLD_2'))]['sell_price'].hist()
plt.xlabel('HOUSEHOLD_2')
plt.show()
prices[(prices['item_id'].str.startswith('HOBBIES_1'))]['sell_price'].hist()
plt.xlabel('HOBBIES_1')
plt.show()
prices[(prices['item_id'].str.startswith('HOBBIES_2'))]['sell_price'].hist()
plt.xlabel('HOBBIES_2')
plt.show()

#### Here we have viewed the price range of all departments

In [None]:
#Get the average selling price of each item
avg_price = prices.groupby(['item_id'])['sell_price'].mean()
#Merge it with sales data
merged = pd.merge(sales_df,avg_price, right_index=True, left_on='item_id')
#Group the merged that by id 
id_grouped = merged.groupby(['id']).sum()
#Sum by days to get total quantity
id_grouped['Total_Qty'] = id_grouped.sum(axis=1)
#Get the total amount sold by multiplying the total quantity and selling price
id_grouped['Amount_Sold'] = id_grouped['Total_Qty'] * id_grouped['sell_price']
#Remove duplicate columns to merge data with sales
cols_to_use = id_grouped.columns.difference(sales_df.columns)
#Store the final df in new_sales
new_sales = pd.merge(sales_df,id_grouped[cols_to_use], right_index=True, left_on='id')

In [None]:
new_sales.groupby(['dept_id','store_id'])['Total_Qty'].agg('mean').unstack().plot(kind='bar',figsize=(15,7))
plt.title('Mean Quantity Sold by Department in each store')

In [None]:
new_sales.groupby(['dept_id','store_id'])['Total_Qty'].agg('mean').unstack().T.plot(kind='bar',figsize=(15,7))
plt.title('Mean Quantity Sold by Each Store of each Department')

In [None]:
WI_2 = sales_df[(sales_df['store_id'] == 'WI_2')]
dept_WI2 = WI_2.groupby(['dept_id']).sum().T
dept_WI2.index = pd.to_datetime(dept_WI2.index)
dept_WI2.head()

CA_2 = sales_df[(sales_df['store_id'] == 'CA_2')]
dept_CA2 = CA_2.groupby(['dept_id']).sum().T
dept_CA2.index = pd.to_datetime(dept_CA2.index)
dept_CA2.head()

fig, ax = plt.subplots(figsize=(15,5))
dept_CA2.plot(xlim=['2015-01-01','2016-01-01'],ax=ax,rot=90)
plt.grid(True)
plt.xlabel('Sales by Category')
# set ticks every week
ax.xaxis.set_major_locator(mdates.MonthLocator())
#set major ticks format
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))
plt.title('CA Plot 2015-16 ')
plt.show()

fig, ax = plt.subplots(figsize=(15,5))
dept_WI2.plot(xlim=['2012-01-01','2013-01-01'],ax=ax,rot=90)
plt.grid(True)
plt.xlabel('Sales by Category')
# set ticks every week
ax.xaxis.set_major_locator(mdates.MonthLocator())
#set major ticks format
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))
plt.title('WI Plot 2012-13 ')
plt.show()

#### Plotted WI and CA Plot in different period as saw anomaly earlier, we can see that there's a change in level of data, interesting part is that both are around June and in both FOODS_3 and FOOS_2 have increasing sales 