# This kernel is a summary of the descriptions on the M5 Guidelines paper.
- All information is contained in the link below.
- Link : https://mofc.unic.ac.cy/m5-competition/

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotnine 
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

# Introduction
## Dataset

- calendar.csv - Contains information about the dates on which the products are sold.
- sales_train_validation.csv - Contains the historical daily unit sales data per product and store [d_1 - d_1913]
- sample_submission.csv - The correct format for submissions. Reference the Evaluation tab for more info.
- sell_prices.csv - Contains information about the price of the products sold per store and date.
- sales_train_evaluation.csv - **Available once month before competition deadline**. Will include sales [d_1 - d_1941]

In [None]:
train = pd.read_csv("/kaggle/input/m5-forecasting-accuracy/sales_train_validation.csv")
sell = pd.read_csv("/kaggle/input/m5-forecasting-accuracy/sell_prices.csv")
calendar = pd.read_csv("/kaggle/input/m5-forecasting-accuracy/calendar.csv")
sub = pd.read_csv("/kaggle/input/m5-forecasting-accuracy/sample_submission.csv")

![image.png](https://github.com/choco9966/Kaggle/blob/master/M5%20Forecasting/image/overview.PNG?raw=true)

![image.png](https://github.com/choco9966/Kaggle/blob/master/M5%20Forecasting/image/aggtable.PNG?raw=true)

In [None]:
print("Unit sales of all products, aggregated for each state", train['state_id'].nunique())
print("Unit sales of all products, aggregated for each store", train['store_id'].nunique())
print("Unit sales of all products, aggregated for each category", train['cat_id'].nunique())
print("Unit sales of all products, aggregated for each department", train['dept_id'].nunique())
print("Unit sales of all products, aggregated for each State and category", train['state_id'].nunique() * train['cat_id'].nunique())
print("Unit sales of all products, aggregated for each State and department", train['state_id'].nunique() * train['dept_id'].nunique())
print("Unit sales of all products, aggregated for each store and category", train['store_id'].nunique() * train['cat_id'].nunique())
print("Unit sales of all products, aggregated for each store and department", train['store_id'].nunique() * train['dept_id'].nunique())
print("Unit sales of all products, aggregated for each  and category", train['dept_id'].nunique() * train['cat_id'].nunique())
print("Unit sales of product x, aggregated for all stores/states", train['item_id'].nunique())
print("Unit sales of product x, aggregated for all states", train['item_id'].nunique() * train['state_id'].nunique())
print("Unit sales of product x, aggregated for all stores", train['item_id'].nunique() * train['store_id'].nunique())

### File 1: "calendar.csv"
Contains information about the dates the products are sold.
- date: The date in a “y-m-d” format.
- wm_yr_wk: The id of the week the date belongs to.
- weekday: The type of the day (Saturday, Sunday, …, Friday).
- wday: The id of the weekday, starting from Saturday.
- month: The month of the date.
- year: The year of the date.
- event_name_1: If the date includes an event, the name of this event.
- event_type_1: If the date includes an event, the type of this event.
- event_name_2: If the date includes a second event, the name of this event.
- event_type_2: If the date includes a second event, the type of this event.
- snap_CA, snap_TX, and snap_WI: A binary variable (0 or 1) indicating whether the stores of CA, TX or WI allow SNAP2 purchases on the examined date. 1 indicates that SNAP purchases are allowed.

In [None]:
calendar.head(8)

Event information 

In [None]:
calendar[calendar['event_name_1'].notnull()].head()

In [None]:
from plotnine import *
import plotnine

In [None]:
agg = calendar.groupby('event_name_1')['event_name_1'].agg({'count'}).reset_index()
(ggplot(data = agg) 
  + geom_bar(aes(x='event_name_1', y='count'), fill='#49beb7', color='black', stat='identity')
  + scale_color_hue(l=0.45)
  + theme_light() 
  + theme(
         axis_text_x = element_text(angle=80),
         figure_size=(12,8),
         legend_position="none"))

In [None]:
agg = calendar.groupby('event_type_1')['event_type_1'].agg({'count'}).reset_index()
(ggplot(data = agg) 
  + geom_bar(aes(x='event_type_1', y='count'), fill='#49beb7', color='black', stat='identity')
  + scale_color_hue(l=0.45)
  + theme_light() 
  + theme(
         axis_text_x = element_text(angle=80),
         figure_size=(12,8),
         legend_position="none"))

In [None]:
calendar[calendar['event_name_2'].notnull()].head()

In [None]:
print("event_name_2 notnull shape : ", calendar[calendar['event_name_2'].notnull()].shape)
print("event_name_1 and 2 notnull shape : ", calendar[(calendar['event_name_2'].notnull()) & (calendar['event_name_1'].notnull())].shape)

In [None]:
agg = calendar.groupby('event_name_2')['event_name_2'].agg({'count'}).reset_index()
(ggplot(data = agg) 
  + geom_bar(aes(x='event_name_2', y='count'), fill='#49beb7', color='black', stat='identity')
  + scale_color_hue(l=0.45)
  + theme_light() 
  + theme(
         axis_text_x = element_text(angle=80),
         figure_size=(12,8),
         legend_position="none"))

In [None]:
agg = calendar.groupby('event_type_2')['event_type_2'].agg({'count'}).reset_index()
(ggplot(data = agg) 
  + geom_bar(aes(x='event_type_2', y='count'), fill='#49beb7', color='black', stat='identity')
  + scale_color_hue(l=0.45)
  + theme_light() 
  + theme(
         axis_text_x = element_text(angle=80),
         figure_size=(12,8),
         legend_position="none"))

### File 2: "sell_prices.csv"
Contains information about the price of the products sold per store and date.
- store_id: The id of the store where the product is sold.
- item_id: The id of the product.
- wm_yr_wk: The id of the week.
- sell_price: The price of the product for the given week/store. The price is provided per week (average across seven days). If not available, this means that the product was not sold during the examined week. Note that although prices are constant at weekly basis, they may change through time (both training and test set). 

In [None]:
print(sell.shape)
sell.head()

### File 3: “sales_train.csv”

Contains the historical daily unit sales data per product and store.
- item_id: The id of the product.
- dept_id: The id of the department the product belongs to.
- cat_id: The id of the category the product belongs to.
- store_id: The id of the store where the product is sold.
- state_id: The State where the store is located.
- d_1, d_2, …, d_i, … d_1941: The number of units sold at day i, starting from 2011-01-29.

In [None]:
print(train.shape)
train.head()

### Submission File

Each row contains an id that is a concatenation of an item_id, a store_id, and the prediction interval, which is either validation (corresponding to the Public leaderboard), or evaluation (corresponding to the Private leaderboard). You are predicting 28 forecast days (F1-F28) of items sold for each row. For the validation rows, this corresponds to d_1914 - d_1941, and for the evaluation rows, this corresponds to d_1942 - d_1969. (Note: a month before the competition close, the ground truth for the validation rows will be provided.)

In [None]:
sub.head()

## Evaluation
This competition uses a Weighted Root Mean Squared Scaled Error (RMSSE). Extensive details about the metric, scaling, and weighting can be found in the [M5 Participants Guide.](https://mofc.unic.ac.cy/m5-competition/)

1. Forecasting horizon

The number of forecasts required, both for point and probabilistic forecasts, is h=28 days(4 weeks ahead).
The performance measures are first computed for each series separately by averaging their values across
the forecasting horizon and then averaged again across the series in a weighted fashion (see below) to
obtain the final scores. 

2. Point forecasts

The accuracy of the point forecasts will be evaluated using the Root Mean Squared Scaled Error (RMSSE),
which is a variant of the well-known Mean Absolute Scaled Error (MASE) proposed by Hyndman and
Koehler (2006)
. The measure is calculated for each series as follows:

![image.png](https://github.com/choco9966/Kaggle/blob/master/M5%20Forecasting/image/rmsse.PNG?raw=true)

where 𝑌𝑡 is the actual future value of the examined time series at point t, 𝑌𝑡_hat the generated forecast, n the length of the training sample (number of historical observations), and h the forecasting horizon. 

After estimating the RMSSE for all the 42,840 time series of the competition, the participating methods will be ranked using the Weighted RMSSE (WRMSSE), as described latter in this Guide, using the following

![image.png](https://github.com/choco9966/Kaggle/blob/master/M5%20Forecasting/image/wrmsse.PNG?raw=true)

where 𝑤𝑖 is the weight of the 𝑖𝑡ℎ series of the competition. A lower WRMSSE score is better

Note that the weight of each series will be computed based on the last 28 observations of the training sample of the dataset, i.e., the cumulative actual dollar sales that each series displayed in that particular period (sum of units sold multiplied by their respective price). An indicative example for computing the WRMSSE will be available on the [GitHub repository](https://github.com/Mcompetitions) of the competition

## Before competitions
- M4 : https://github.com/Mcompetitions/M4-methods
- Discussion by RDizzl3 : https://www.kaggle.com/c/m5-forecasting-accuracy/discussion/133469

## Benchmarks models
code : https://github.com/Mcompetitions/M4-methods

- Naive
- Seaonal Naive
- Simple Exponential Smoothing 
- Moving Averages 
- Croston’s method
- Optimized Croston’s method
- Syntetos-Boylan Approximation
- Teunter-Syntetos-Babai method
- Aggregate-Disaggregate Intermittent Demand Approach
- Intermittent Multiple Aggregation Prediction Algorithm
- Exponential Smoothing
- Exponential Smoothing with eXplanatory variables 
- AutoRegressive Integrated Moving Average
- AutoRegressive Integrated Moving Average with eXplanatory variables 
- Multi-Layer Perceptron
- Random Forest
- Global Multi-Layer Perceptron
- Global Random Forest (GRF)