# Introduction

In this notebook, we'll simply load the data, downcast it using a technique I learned from [this](https://www.kaggle.com/code/vaibhavgupta082/time-series-forecasting-eda-fe-modelling-679c92) notebook and do some pre-processsing work before analysis

In [2]:
import os
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import warnings
warnings.filterwarnings('ignore')

In [29]:
# GLOBAL VARIABLES
DATA_PATH = "../../m5-data/"

WALMART_COLOURS = ["#0072CE", "#B4B4B3", "#79B8F3", "#FDB927", "#F7941D", "#4CB748", "#2E3192"]
DIVERGENT_COLOUR_GRADIENT = ["#e2f1fc", "#b9dcfa", "#8cc7f7", "#5eb1f3", "#39a0f1", "#0691ef"]
sns.set_palette(WALMART_COLOURS)


In [4]:
# data paths
sales_path = os.path.join(DATA_PATH, "sales_train_evaluation.csv")
calendar_path = os.path.join(DATA_PATH, "calendar.csv")
prices_path = os.path.join(DATA_PATH, "sell_prices.csv")

In [5]:
sales = pd.read_csv(sales_path)
calendar = pd.read_csv(calendar_path)
prices = pd.read_csv(prices_path)

In [6]:
#Add zero sales for the remaining days 1942-1969
for d in range(1942,1970):
    col = 'd_' + str(d)
    sales[col] = 0
    sales[col] = sales[col].astype(np.int16)

In [7]:
def get_df_memory(df):
  return np.round(df.memory_usage().sum()/(1024*1024),1)

def downcast(df):
    cols = df.dtypes.index.tolist()
    types = df.dtypes.values.tolist()
    for i,t in enumerate(types):
        if 'int' in str(t):
            if df[cols[i]].min() > np.iinfo(np.int8).min and df[cols[i]].max() < np.iinfo(np.int8).max:
                df[cols[i]] = df[cols[i]].astype(np.int8)
            elif df[cols[i]].min() > np.iinfo(np.int16).min and df[cols[i]].max() < np.iinfo(np.int16).max:
                df[cols[i]] = df[cols[i]].astype(np.int16)
            elif df[cols[i]].min() > np.iinfo(np.int32).min and df[cols[i]].max() < np.iinfo(np.int32).max:
                df[cols[i]] = df[cols[i]].astype(np.int32)
            else:
                df[cols[i]] = df[cols[i]].astype(np.int64)
        elif 'float' in str(t):
            if df[cols[i]].min() > np.finfo(np.float16).min and df[cols[i]].max() < np.finfo(np.float16).max:
                df[cols[i]] = df[cols[i]].astype(np.float16)
            elif df[cols[i]].min() > np.finfo(np.float32).min and df[cols[i]].max() < np.finfo(np.float32).max:
                df[cols[i]] = df[cols[i]].astype(np.float32)
            else:
                df[cols[i]] = df[cols[i]].astype(np.float64)
        elif t == object:
            if cols[i] == 'date':
                df[cols[i]] = pd.to_datetime(df[cols[i]], format='%Y-%m-%d')
            else:
                df[cols[i]] = df[cols[i]].astype('category')
    return df 

In [8]:
sales_bd = get_df_memory(sales)
calendar_bd = get_df_memory(calendar)
prices_bd = get_df_memory(prices)

In [9]:
sales = downcast(sales)
prices = downcast(prices)
calendar = downcast(calendar)

In [10]:
sales_ad = get_df_memory(sales)
calendar_ad = get_df_memory(calendar)
prices_ad = get_df_memory(prices)

In [11]:
memory = {'DataFrame':['sales','calendar','prices'],
       'Before downcasting':[sales_bd,calendar_bd,prices_bd],
       'After downcasting':[sales_ad,calendar_ad,prices_ad]}

memory = pd.DataFrame(memory)
memory = pd.melt(memory, id_vars='DataFrame', var_name='Status', value_name='Memory (MB)')
memory.sort_values('Memory (MB)',inplace=True)
fig = px.bar(memory, x='DataFrame', y='Memory (MB)', color='Status', color_discrete_sequence=WALMART_COLOURS, barmode='group', text='Memory (MB)')
fig.update_traces(texttemplate='%{text} MB', textposition='outside')
fig.update_layout(template='seaborn', title='Effect of Downcasting')
fig.show()

### Understanding The Data

Now that we've downcasted the data, we can now trying and wrap our heads around what's in the data. The M5 dataset, generously made available by Walmart, involves the unit sales of various products sold in the USA, organized in the form of **grouped time series**. We have 3 049 products classified into 3 product categories (Hobbies, Foods and Household). We then have 7 product department in under which these categories are aggregated.
These products are sold in 10 stores across 3 states: California(CA), Texas(TA) and Wisconsin(WI).

We can visualize this distribution as follows:

In [21]:
sales_group = sales.groupby(['state_id','store_id','cat_id','dept_id'],as_index=False)['item_id'].count().dropna()
sales_group.head()

Unnamed: 0,state_id,store_id,cat_id,dept_id,item_id
0,CA,CA_1,FOODS,FOODS_1,216
1,CA,CA_1,FOODS,FOODS_2,398
2,CA,CA_1,FOODS,FOODS_3,823
3,CA,CA_1,FOODS,HOBBIES_1,0
4,CA,CA_1,FOODS,HOBBIES_2,0


In [33]:
group = sales.groupby(['state_id','store_id','cat_id','dept_id'],as_index=False)['item_id'].count().dropna()
group = group[group['item_id'] > 0].reset_index(drop=True) # removing zero-valued rows
group['USA'] = 'United States of America'
group.rename(columns={'state_id':'State','store_id':'Store','cat_id':'Category','dept_id':'Department','item_id':'Count'},inplace=True)
fig = px.treemap(
  group, 
  path=['USA', 'State', 'Store', 'Category', 'Department'], 
  values='Count',
  color='Count',
  color_continuous_scale= DIVERGENT_COLOUR_GRADIENT,
  title="Walmart's Item Distribution"
)
fig.update_layout(template='seaborn')
fig.show()

We learn from the above graph that items are distributed in a similar manner across states, stores and product departments. What's interesting is that Texas and Wisconsin data is exactly the same, so I'm wondering if, either I have done something wrong in the above data manipulations or this is simulated data, meaning this observation is by design.

In [12]:
combined_data = pd.melt(sales, id_vars=['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'], var_name='d', value_name='sold').dropna()

In [13]:
combined_data = pd.merge(combined_data, calendar, on='d', how='left')

In [14]:
combined_data = pd.merge(combined_data, prices, on=['store_id','item_id','wm_yr_wk'], how='left')

In [15]:
combined_data.head()

Unnamed: 0,id,item_id,dept_id,cat_id,store_id,state_id,d,sold,date,wm_yr_wk,...,month,year,event_name_1,event_type_1,event_name_2,event_type_2,snap_CA,snap_TX,snap_WI,sell_price
0,HOBBIES_1_001_CA_1_evaluation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0,2011-01-29,11101,...,1,2011,,,,,0,0,0,
1,HOBBIES_1_002_CA_1_evaluation,HOBBIES_1_002,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0,2011-01-29,11101,...,1,2011,,,,,0,0,0,
2,HOBBIES_1_003_CA_1_evaluation,HOBBIES_1_003,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0,2011-01-29,11101,...,1,2011,,,,,0,0,0,
3,HOBBIES_1_004_CA_1_evaluation,HOBBIES_1_004,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0,2011-01-29,11101,...,1,2011,,,,,0,0,0,
4,HOBBIES_1_005_CA_1_evaluation,HOBBIES_1_005,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0,2011-01-29,11101,...,1,2011,,,,,0,0,0,


In [16]:
combined_data.columns

Index(['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id', 'd',
       'sold', 'date', 'wm_yr_wk', 'weekday', 'wday', 'month', 'year',
       'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2',
       'snap_CA', 'snap_TX', 'snap_WI', 'sell_price'],
      dtype='object')

In [17]:
combined_data.shape

(60034810, 22)