# M5 Walmart Demand Forecasting - Data Exploration

**Author**: Godson Kurishinkal  
**Date**: November 9, 2025  
**Dataset**: M5 Competition - Walmart Sales Data  
**Purpose**: Comprehensive exploratory data analysis of the M5 dataset for demand forecasting

## Project Overview

This notebook explores the M5 Competition dataset from Walmart, which contains hierarchical sales data for 3,049 products across 10 stores in 3 states (CA, TX, WI) over ~5 years. The goal is to understand patterns, trends, and characteristics that will inform our forecasting models.

## Dataset Structure

- **calendar.csv**: Date information, events, and SNAP (food stamps) indicators
- **sales_train_validation.csv**: Historical daily unit sales per product and store
- **sell_prices.csv**: Product prices by store and week

## 1. Import Required Libraries

In [1]:
# Data manipulation
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("✓ Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

✓ Libraries imported successfully!
Pandas version: 2.3.3
NumPy version: 2.3.4


## 2. Load M5 Dataset

The M5 dataset consists of three main files:
- **calendar.csv**: 1,969 days of calendar information with events and SNAP indicators
- **sales_train_validation.csv**: Sales data for 30,490 time series (products × stores)
- **sell_prices.csv**: Weekly prices for products across stores

In [2]:
# Define data paths
DATA_PATH = Path('../../data/raw')

# Load datasets
print("Loading M5 datasets...")
print("-" * 60)

# Load calendar data
calendar = pd.read_csv(DATA_PATH / 'calendar.csv')
print(f"✓ Calendar data loaded: {calendar.shape}")

# Load sales data
sales = pd.read_csv(DATA_PATH / 'sales_train_validation.csv')
print(f"✓ Sales data loaded: {sales.shape}")

# Load price data
prices = pd.read_csv(DATA_PATH / 'sell_prices.csv')
print(f"✓ Price data loaded: {prices.shape}")

print("-" * 60)
print("All datasets loaded successfully!")

Loading M5 datasets...
------------------------------------------------------------
✓ Calendar data loaded: (1969, 14)
✓ Sales data loaded: (30490, 1919)
✓ Sales data loaded: (30490, 1919)
✓ Price data loaded: (6841121, 4)
------------------------------------------------------------
All datasets loaded successfully!
✓ Price data loaded: (6841121, 4)
------------------------------------------------------------
All datasets loaded successfully!


### 2.1 Calendar Data Overview

In [3]:
# Display calendar information
print("CALENDAR DATASET")
print("=" * 80)
print(f"\nShape: {calendar.shape}")
print(f"Columns: {list(calendar.columns)}")
print(f"\nData Types:\n{calendar.dtypes}")
print(f"\nDate Range: {calendar['date'].min()} to {calendar['date'].max()}")
print(f"Number of days: {len(calendar)}")

# Convert date to datetime
calendar['date'] = pd.to_datetime(calendar['date'])

print("\nFirst few rows:")
calendar.head(10)

CALENDAR DATASET

Shape: (1969, 14)
Columns: ['date', 'wm_yr_wk', 'weekday', 'wday', 'month', 'year', 'd', 'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2', 'snap_CA', 'snap_TX', 'snap_WI']

Data Types:
date            object
wm_yr_wk         int64
weekday         object
wday             int64
month            int64
year             int64
d               object
event_name_1    object
event_type_1    object
event_name_2    object
event_type_2    object
snap_CA          int64
snap_TX          int64
snap_WI          int64
dtype: object

Date Range: 2011-01-29 to 2016-06-19
Number of days: 1969

First few rows:


Unnamed: 0,date,wm_yr_wk,weekday,wday,month,year,d,event_name_1,event_type_1,event_name_2,event_type_2,snap_CA,snap_TX,snap_WI
0,2011-01-29,11101,Saturday,1,1,2011,d_1,,,,,0,0,0
1,2011-01-30,11101,Sunday,2,1,2011,d_2,,,,,0,0,0
2,2011-01-31,11101,Monday,3,1,2011,d_3,,,,,0,0,0
3,2011-02-01,11101,Tuesday,4,2,2011,d_4,,,,,1,1,0
4,2011-02-02,11101,Wednesday,5,2,2011,d_5,,,,,1,0,1
5,2011-02-03,11101,Thursday,6,2,2011,d_6,,,,,1,1,1
6,2011-02-04,11101,Friday,7,2,2011,d_7,,,,,1,0,0
7,2011-02-05,11102,Saturday,1,2,2011,d_8,,,,,1,1,1
8,2011-02-06,11102,Sunday,2,2,2011,d_9,SuperBowl,Sporting,,,1,1,1
9,2011-02-07,11102,Monday,3,2,2011,d_10,,,,,1,1,0
