In [36]:
import numpy as np
import pandas as pd

# M5 Forecasting - Accuracy

## Objective: 

In this competition we want to forecast sales of Walmart products for 28 consecutive days. We are given historical sales data divided into three different datasets:
  * sales_train_validation.csv - Contains historical number of sales for items.
  * calendar.csv - Contains calendar information such as dates and holidays.
  * sell_prices.csv - Contains prices of items.
  
We'll go over them one by one more thoroughly to understand their content.

# sales_train_validation.csv

In [37]:
sales = pd.read_csv("../data/raw/sales_train_validation.csv")
sales.head()

Unnamed: 0,id,item_id,dept_id,cat_id,store_id,state_id,d_1,d_2,d_3,d_4,...,d_1904,d_1905,d_1906,d_1907,d_1908,d_1909,d_1910,d_1911,d_1912,d_1913
0,HOBBIES_1_001_CA_1_validation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,1,3,0,1,1,1,3,0,1,1
1,HOBBIES_1_002_CA_1_validation,HOBBIES_1_002,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,HOBBIES_1_003_CA_1_validation,HOBBIES_1_003,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,2,1,2,1,1,1,0,1,1,1
3,HOBBIES_1_004_CA_1_validation,HOBBIES_1_004,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,1,0,5,4,1,0,1,3,7,2
4,HOBBIES_1_005_CA_1_validation,HOBBIES_1_005,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,2,1,1,0,1,1,2,2,2,4


In [38]:
sales.shape

(30490, 1919)

In [39]:
sales.columns

Index(['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id', 'd_1',
       'd_2', 'd_3', 'd_4',
       ...
       'd_1904', 'd_1905', 'd_1906', 'd_1907', 'd_1908', 'd_1909', 'd_1910',
       'd_1911', 'd_1912', 'd_1913'],
      dtype='object', length=1919)

## Column Descriptions
The columns in sales data can be divided into two categories:

## Days

The columns d1-d1913 represent the 1913 days the sales data was collected.

## IDs

The IDs represent different categories for the items in a hierarchical manner. From the item ID itself all the way up to state ID.
* id - Identifier for each row in the datasat.
* item_id - Identifier for each each item.  
* dept_id - Identifier for department.  
* cat_id - Identifier for item category.  
* store_id - Identifier for store.  
* state_id - Identifier for state.  

Let's see how many unique values we have on each level.

In [40]:
id_cols = ["id", "item_id", "dept_id", "cat_id", "store_id", "state_id"]
for col in id_cols:
    print(f"#Unique values for level {col}: {sales[col].nunique()}")

#Unique values for level id: 30490
#Unique values for level item_id: 3049
#Unique values for level dept_id: 7
#Unique values for level cat_id: 3
#Unique values for level store_id: 10
#Unique values for level state_id: 3


We note here that we have:
* 3049 different items
* 10 different stores
* 30490 different time series

The hierarchical structure will have to be accounted for in the models in order to extract the amount maximum amount of information from the dataset.

## Null values

In [41]:
sales.isnull().sum().sum()

0

There are zero null values in the dataset.

# calendar.csv

In [22]:
calendar = pd.read_csv("../data/raw/calendar.csv")
calendar.head()

Unnamed: 0,date,wm_yr_wk,weekday,wday,month,year,d,event_name_1,event_type_1,event_name_2,event_type_2,snap_CA,snap_TX,snap_WI
0,2011-01-29,11101,Saturday,1,1,2011,d_1,,,,,0,0,0
1,2011-01-30,11101,Sunday,2,1,2011,d_2,,,,,0,0,0
2,2011-01-31,11101,Monday,3,1,2011,d_3,,,,,0,0,0
3,2011-02-01,11101,Tuesday,4,2,2011,d_4,,,,,1,1,0
4,2011-02-02,11101,Wednesday,5,2,2011,d_5,,,,,1,0,1


Calendar contains data such as date, weekday, month etc. for a given day. It also contains binary snap columns which indicates if the store allowed purchases with stamps or not. Lastly we are given information about events occuring such as Super Bowl and Valentine's day etc.

In [24]:
calendar.shape

(1969, 14)

We take a look at all event names/types that can be found in the dataset.

In [27]:
calendar["event_name_1"].unique()

array([nan, 'SuperBowl', 'ValentinesDay', 'PresidentsDay', 'LentStart',
       'LentWeek2', 'StPatricksDay', 'Purim End', 'OrthodoxEaster',
       'Pesach End', 'Cinco De Mayo', "Mother's day", 'MemorialDay',
       'NBAFinalsStart', 'NBAFinalsEnd', "Father's day",
       'IndependenceDay', 'Ramadan starts', 'Eid al-Fitr', 'LaborDay',
       'ColumbusDay', 'Halloween', 'EidAlAdha', 'VeteransDay',
       'Thanksgiving', 'Christmas', 'Chanukah End', 'NewYear',
       'OrthodoxChristmas', 'MartinLutherKingDay', 'Easter'], dtype=object)

In [28]:
calendar["event_name_2"].unique()

array([nan, 'Easter', 'Cinco De Mayo', 'OrthodoxEaster', "Father's day"],
      dtype=object)

In [29]:
calendar["event_type_1"].unique()

array([nan, 'Sporting', 'Cultural', 'National', 'Religious'], dtype=object)

In [31]:
calendar["event_type_2"].unique()

array([nan, 'Cultural', 'Religious'], dtype=object)

## Null Values

In [32]:
calendar.isnull().sum()

date               0
wm_yr_wk           0
weekday            0
wday               0
month              0
year               0
d                  0
event_name_1    1807
event_type_1    1807
event_name_2    1964
event_type_2    1964
snap_CA            0
snap_TX            0
snap_WI            0
dtype: int64

We have null values only in the event columns since most days have no special events.

# sell_prices.csv

In [34]:
prices = pd.read_csv("../data/raw/sell_prices.csv")
prices.head()

Unnamed: 0,store_id,item_id,wm_yr_wk,sell_price
0,CA_1,HOBBIES_1_001,11325,9.58
1,CA_1,HOBBIES_1_001,11326,9.58
2,CA_1,HOBBIES_1_001,11327,8.26
3,CA_1,HOBBIES_1_001,11328,8.26
4,CA_1,HOBBIES_1_001,11329,8.26


In [35]:
prices.shape

(6841121, 4)

This table contains the prices for an item in a store for a given week.