<a id='0'></a>
# <p style="background-color:skyblue; font-family:newtimeroman; font-size:250%; text-align:center; border-radius: 15px 50px;">M5 Forecasting🖋📝 - EDA📚 & Model Building🎯 </p>


<a id='1'></a>
# <p style="background-color:red; font-family:newtimeroman; font-size:120%; text-align:center; border-radius: 10px 25px;">Table of Content</p>
* [Overview 🧐](#overview)
* [Importing Necessary Libraries](#libraries)
* [Load the data 📃](#load)
* [Data Exploration 📊](#explore)
* [Data Processing and Analysis 📷](#processing)
* [Model Training  🌒](#mt)
    * [Hyperparameter Tuning](#ht)


<a id="overview"></a>
# Overview 📃

Note: This is one of the two complementary competitions that together comprise the M5 forecasting challenge. Can you estimate, as precisely as possible, the point forecasts of the unit sales of various products sold in the USA by Walmart? If you are interested in estimating the uncertainty distribution of the realized values of the same series, be sure to check out its companion competition

How much camping gear will one store sell each month in a year? To the uninitiated, calculating sales at this level may seem as difficult as predicting the weather. Both types of forecasting rely on science and historical data. While a wrong weather forecast may result in you carrying around an umbrella on a sunny day, inaccurate business forecasts could result in actual or opportunity losses. In this competition, in addition to traditional forecasting methods you’re also challenged to use machine learning to improve forecast accuracy.

The Makridakis Open Forecasting Center (MOFC) at the University of Nicosia conducts cutting-edge forecasting research and provides business forecast training. It helps companies achieve accurate predictions, estimate the levels of uncertainty, avoiding costly mistakes, and apply best forecasting practices. The MOFC is well known for its Makridakis Competitions, the first of which ran in the 1980s.

In this competition, the fifth iteration, you will use hierarchical sales data from Walmart, the world’s largest company by revenue, to forecast daily sales for the next 28 days. The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. In addition, it has explanatory variables such as price, promotions, day of the week, and special events. Together, this robust dataset can be used to improve forecasting accuracy.

If successful, your work will continue to advance the theory and practice of forecasting. The methods used can be applied in various business areas, such as setting up appropriate inventory or service levels. Through its business support and training, the MOFC will help distribute the tools and knowledge so others can achieve more accurate and better calibrated forecasts, reduce waste and be able to appreciate uncertainty and its risk implications.

Acknowledgements
Additional thanks go to other partner organizations and prize sponsors, National Technical University of Athens (NTUA), INSEAD, Google, Uber and IIF.

<a id='libraries'></a>
# <p style="background-color:skyblue; font-family:newtimeroman; font-size:250%; text-align:center; border-radius: 15px 50px;">1. Importing necessary modules and libraries📚</p>

In [None]:
# importing all necessary libraries
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.api import ExponentialSmoothing, SimpleExpSmoothing, Holt
from tqdm.notebook import tqdm as tqdm

import plotly.graph_objs as go #visualization library
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf #autocorrelation test
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller #stationarity test
from statsmodels.tsa.statespace.sarimax import SARIMAX 
from datetime import datetime, timedelta
import statsmodels.api as sm
import gc
from pylab import rcParams
import random
import os
import time
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
import gc
import lightgbm as lgb
import time
import numpy as np
import pandas as pd
import os
import time
import matplotlib.dates as mdates
import gc
import lightgbm as lgb
import time

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBRegressor
import multiprocessing as mp
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

<a id='load'></a>
# <p style="background-color:skyblue; font-family:newtimeroman; font-size:250%; text-align:center; border-radius: 15px 50px;">Load the data</p>

In [None]:
# Reading dataset's ...
calender = pd.read_csv('../input/m5-forecasting-accuracy/calendar.csv')
train_sales = pd.read_csv('../input/m5-forecasting-accuracy/sales_train_evaluation.csv')
df_val = pd.read_csv('../input/m5-forecasting-accuracy/sales_train_validation.csv')
sell_prices = pd.read_csv('../input/m5-forecasting-accuracy/sell_prices.csv')

### Inspecting Data

In [None]:
# check data complletely 
train_sales.info()

In [None]:
calender.info()

In [None]:
sell_prices.info()

In [None]:
train_sales.head()

## Observations
* There are lots of zeros in the datasets for "d_x" columns, these are nothing bul sale values on any given day, zero here signfies, either the item was not available on that day or was not sold because of no demand.

In [None]:
# print first 5 rows
calender.head()

In [None]:
sell_prices.head()

<a id='explore'></a>
# <p style="background-color:skyblue; font-family:newtimeroman; font-size:250%; text-align:center; border-radius: 15px 50px;">Data Exploration</p>

## Let's Check Null Values

In [None]:
train_sales.isnull().sum().sort_values(ascending = False)

In [None]:
sell_prices.isnull().sum().sort_values(ascending = False)

In [None]:
calender.isnull().sum().sort_values(ascending = False)

In [None]:
holiday = ['NewYear', 'OrthodoxChristmas', 'MartinLutherKingDay', 'SuperBowl', 'PresidentsDay', 'StPatricksDay', 'Easter', 'Cinco De Mayo', 'IndependenceDay', 'EidAlAdha', 'Thanksgiving', 'Christmas']
weekend = ['Saturday', 'Sunday']

def is_holiday(x):
    if x in holiday:
        return 1
    else:
        return 0

def is_weekend(x):
    if x in weekend:
        return 1
    else:
        return 0

In [None]:
calender['is_holiday_1'] = calender['event_name_1'].apply(is_holiday)
calender['is_holiday_2'] = calender['event_name_2'].apply(is_holiday)
calender['is_holiday'] = calender[['is_holiday_1','is_holiday_2']].max(axis=1)
calender['is_weekend'] = calender['weekday'].apply(is_weekend)

In [None]:
# print first 5 rows
calender.head()

In [None]:
calender = calender.drop(['weekday', 'wday', 'month', 'year', 'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2'], axis='columns')

In [None]:
sell_prices.describe()

In [None]:
del_col = []
for x in range(1851):
    del_col.append('d_' + str(x+1))

In [None]:
train_sales =train_sales.drop(del_col, axis='columns')

## join eval and cal and price

In [None]:
train_sales = train_sales.melt(['id','item_id','dept_id','cat_id','store_id','state_id'], var_name='d', value_name='qty')
print(train_sales.shape)
train_sales.head()

In [None]:
train_sales = pd.merge(train_sales, calender, how='left', on='d')
train_sales.head()

In [None]:
train_sales = pd.merge(train_sales, sell_prices, how='left', on=['item_id', 'wm_yr_wk', 'store_id'])
train_sales.head()

In [None]:
#print shape of the dataset 
train_sales.shape

In [None]:
train_sales.tail()

In [None]:
train_sales.head()

In [None]:
train_sales_test = train_sales.query('d == "d_1852"')

In [None]:
train_sales_test.head()

In [None]:
train_sales_test = train_sales_test[['id', 'store_id', 'item_id', 'dept_id', 'cat_id', 'state_id', 'd', 'qty', 'sell_price']]

In [None]:
train_sales_test.head()

In [None]:
train_sales_test.shape

In [None]:
train_sales_test['qty'] =train_sales_test['d'].apply(lambda x: int(x.replace(x, '0')))

In [None]:
tmp_df =train_sales_test

In [None]:
for x in range(28):
    train_sales_test =train_sales_test.append(tmp_df)

In [None]:
train_sales_test =train_sales_test.reset_index(drop=True)

In [None]:
train_sales_test.tail()

In [None]:

lst_d = []
i = 0
lst_index = train_sales_test.index
for x in lst_index:
    lst_d.append('d_' + str(((lst_index[i]) // 30490) + 1942))
    i = i + 1

lst_d

In [None]:
train_sales_test['d'] = lst_d

In [None]:
train_sales_test.head()

In [None]:
# print last 5 rows in dataset
train_sales_test.tail()

In [None]:
# shape of dataset
train_sales_test.shape

In [None]:
train_sales_test = pd.merge(train_sales_test,calender, how='left', on='d')

In [None]:
train_sales_test = pd.merge(train_sales_test, sell_prices ,how='left', on=['item_id', 'wm_yr_wk', 'store_id'])

In [None]:
train_sales_test.head()

## <a id='processing'></a>
# <p style="background-color:skyblue; font-family:newtimeroman; font-size:250%; text-align:center; border-radius: 15px 50px;">Data Processing and Analysis</p>

In [None]:
#Sales Catogery
df_val.groupby('cat_id').count()['id'].sort_values().plot(kind='barh',figsize=(15,2), title='Sales By Catogory',width=0.5,color='orange')
plt.show()

In [None]:
#Sales By Department
df_val.groupby('dept_id').count()['id'].sort_values().plot(kind='barh',figsize=(15,3), title='Sales By Department',color='red')
plt.show()

In [None]:
#Sales By State
df_val.groupby('state_id').count()['id'].sort_values().plot(kind='barh',figsize=(15,2), title='Sales By State',color='violet')
plt.show()

In [None]:
pd.value_counts(train_sales_test['state_id']).plot(kind = 'bar',cmap = 'BrBG')
plt.rcParams['axes.facecolor'] = 'orange'
plt.title("Count of classes")

In [None]:
pd.value_counts(train_sales_test['cat_id']).plot(kind = 'bar')
plt.rcParams['axes.facecolor'] = 'blue'
plt.title("Count of classes")

In [None]:
ids = sorted(list(set(df_val['id'])))
d_cols = [c for c in df_val.columns if 'd_' in c]
x_1 = df_val.loc[df_val['id'] == ids[0]].set_index('id')[d_cols].values[0][:90]
x_2 = df_val.loc[df_val['id'] == ids[4]].set_index('id')[d_cols].values[0][1300:1400]
x_3 = df_val.loc[df_val['id'] == ids[65]].set_index('id')[d_cols].values[0][350:450]
fig = make_subplots(rows=3, cols=1)

fig.add_trace(go.Scatter(x=np.arange(len(x_1)), y=x_1, showlegend=False,
                    mode='lines+markers', name="First sample",
                         marker=dict(color="orange")),
             row=1, col=1)

fig.add_trace(go.Scatter(x=np.arange(len(x_2)), y=x_2, showlegend=False,
                    mode='lines+markers', name="Second sample",
                         marker=dict(color="red")),
             row=2, col=1)

fig.add_trace(go.Scatter(x=np.arange(len(x_3)), y=x_3, showlegend=False,
                    mode='lines+markers', name="Third sample",
                         marker=dict(color="violet")),
             row=3, col=1)

fig.update_layout(height=1200, width=800, title_text="Sample sales snippets")
fig.show()

>>  As stated earlier, we can clearly see that the sales data is very erratic and volatile. Sometimes, the sales are zero for a few days in a row, and at other times, it remains at its peak value for a few days. Therefore, we need some sort of "denoising" techniques to find the underlying trends in the sales data and make forecasts.

### Rolling Average Price vs. Time for every store

In [None]:
past_sales = df_val.set_index('id')[d_cols] \
    .T \
    .merge(calender.set_index('d')['date'],
           left_index=True,
           right_index=True,
            validate='1:1') \
    .set_index('date')

store_list = sell_prices['store_id'].unique()
means = []
fig = go.Figure()
for s in store_list:
    store_items = [c for c in past_sales.columns if s in c]
    data = past_sales[store_items].sum(axis=1).rolling(90).mean()
    means.append(np.mean(past_sales[store_items].sum(axis=1)))
    fig.add_trace(go.Scatter(x=np.arange(len(data)), y=data, name=s))
    
fig.update_layout(yaxis_title="Sales", xaxis_title="Time", title="Rolling Average Sales vs. Time (per store)")

## Average sales vs. Store name

In [None]:
df = pd.DataFrame(np.transpose([means, store_list]))
df.columns = ["Mean sales", "Store name"]
px.bar(df, y="Mean sales", x="Store name", color="Store name", title="Average sales vs. Store name")

> 
* From the above graph, we can see the same trends: California stores have the highest variance and mean sales among all the stores in the dataset. 
* And California store 4 have the lowest variance and mean sales among all the stores in the dataset

## Rolling Average Sales vs. Time (California)

In [None]:
greens = ["mediumaquamarine", "orange", "red", "green"]
store_list = sell_prices['store_id'].unique()
fig = go.Figure()
means = []
stores = []
for i, s in enumerate(store_list):
    if "ca" in s or "CA" in s:
        store_items = [c for c in past_sales.columns if s in c]
        data = past_sales[store_items].sum(axis=1).rolling(90).mean()
        means.append(np.mean(past_sales[store_items].sum(axis=1)))
        stores.append(s)
        fig.add_trace(go.Scatter(x=np.arange(len(data)), y=data, name=s, marker=dict(color=greens[i])))
    
fig.update_layout(yaxis_title="Sales", xaxis_title="Time", title="Rolling Average Sales vs. Time (California)")

* In the above graph, we can see the large disparity in sales among California stores. 
* The sales curves almost never intersect each other. This may indicate that there are certain "hubs" of development in California which do not change over time. 
* And other areas always remain behind these "hubs". The average sales in descending order are CA_3, CA_1, CA_2, CA_4. The store CA_3 has maximum sales while the store CA_4 has minimum sales.

## Mean sales vs. Store name (California)

In [None]:

df = pd.DataFrame(np.transpose([means, stores]))
df.columns = ["Mean sales", "Store name"]
px.bar(df, y="Mean sales", x="Store name", color="Store name", title="Mean sales vs. Store name", color_continuous_scale=greens)


fig = go.Figure(data=[
    go.Bar(name='', x=stores, y=means, marker={'color' : greens})])

fig.update_layout(title="Mean sales vs. Store name (California)", yaxis=dict(title="Mean sales"), xaxis=dict(title="Store name"))
fig.update_layout(barmode='group')
fig.show()

In the above plots, we can see the same relationship. The store CA_3 has maximum sales while the store CA_4 has minimum sales.

## Rolling Average Sales vs. Time (Wisconsin)

In [None]:
purples = ["red", "violet", "purple", "indigo"]
store_list = sell_prices['store_id'].unique()
fig = go.Figure()
means = []
stores = []
for i, s in enumerate(store_list):
    if "wi" in s or "WI" in s:
        store_items = [c for c in past_sales.columns if s in c]
        data = past_sales[store_items].sum(axis=1).rolling(90).mean()
        means.append(np.mean(past_sales[store_items].sum(axis=1)))
        stores.append(s)
        fig.add_trace(go.Scatter(x=np.arange(len(data)), y=data, name=s, marker=dict(color=purples[i%len(purples)])))
    
fig.update_layout(yaxis_title="Sales", xaxis_title="Time", title="Rolling Average Sales vs. Time (Wisconsin)")

In the above graph, we can see a very low disparity in sales among Wisconsin stores. The sales curves intersect each other very often. This may indicate that most parts of Wisconsin have a similar "development curve" and that there is a greater equity in development across the state. There are no specific "hotspots" or "hubs" of development. The average sales in descending order are WI_2, WI_3, WI_1. The store WI_2 has maximum sales while the store WI_1 has minimum sales.

## Mean sales vs. Store name (Wisconsin)

In [None]:
df = pd.DataFrame(np.transpose([means, stores]))
df.columns = ["Mean sales", "Store name"]
px.bar(df, y="Mean sales", x="Store name", color="Store name", title="Mean sales vs. Store name", color_continuous_scale=greens)


fig = go.Figure(data=[
    go.Bar(name='', x=stores, y=means, marker={'color' : purples})])

fig.update_layout(title="Mean sales vs. Store name (Wisconsin)", yaxis=dict(title="Mean sales"), xaxis=dict(title="Store name"))
fig.update_layout(barmode='group')
fig.show()

In [None]:
green = ["orange", "yellow", "seagreen"]
store_list = sell_prices['store_id'].unique()
fig = go.Figure()
means = []
stores = []
for i, s in enumerate(store_list):
    if "tx" in s or "TX" in s:
        store_items = [c for c in past_sales.columns if s in c]
        data = past_sales[store_items].sum(axis=1).rolling(90).mean()
        means.append(np.mean(past_sales[store_items].sum(axis=1)))
        stores.append(s)
        fig.add_trace(go.Scatter(x=np.arange(len(data)), y=data, name=s, marker=dict(color=green[i%len(green)])))
    
fig.update_layout(yaxis_title="Sales", xaxis_title="Time", title="Rolling Average Sales vs. Time (Texas)")

### Observations
> In the above graph, 
* we can once again see that a very low disparity in sales among Texas stores. 
* The sales curves intersect each other often, albeit not as often as in Wisconsin. 
* This might once again indicate that most parts of Texas have a similar "development curve" and that there is a greater equity in development across the state. 
* The variance here is higher than in Wisconsin though, so there might be "hubs" of development in Texas as well, but not as pronounced as in California. 
* The average sales in descending order are TX_2, TX_3, TX_1. The store TX_2 has maximum sales while the store TX_1 has minimum sales.

In [None]:
df = pd.DataFrame(np.transpose([means, stores]))
df.columns = ["Mean sales", "Store name"]
px.bar(df, y="Mean sales", x="Store name", color="Store name", title="Mean sales vs. Store name", color_continuous_scale=greens)


fig = go.Figure(data=[
    go.Bar(name='', x=stores, y=means, marker={'color' : green})])

fig.update_layout(title="Mean sales vs. Store name (Texas)", yaxis=dict(title="Mean sales"), xaxis=dict(title="Store name"))
fig.update_layout(barmode='group')
fig.show()

### Observations
* In the above plots, we can see the same relationship. The store TX_2 has maximum sales while the store TX_1 has minimum sales.

In [None]:
import gc
del tmp_df
gc.collect()

In [None]:
train_sales = pd.get_dummies(data=train_sales, columns=['dept_id', 'cat_id', 'store_id', 'state_id'])
train_sales_test = pd.get_dummies(data=train_sales_test, columns=['dept_id', 'cat_id', 'store_id', 'state_id'])

In [None]:
train_sales.info()

In [None]:
train_sales_test.head(10).T

In [None]:
train_sales_test =train_sales_test.drop(['sell_price_x', 'snap_CA', 'snap_TX', 'snap_WI'], axis='columns')
train_sales_test = train_sales_test.rename(columns={'sell_price_y': 'sell_price'})
train_sales = train_sales.drop(['snap_CA', 'snap_TX', 'snap_WI'], axis='columns')                                   

In [None]:
train_sales.info()

In [None]:
train_sales_test.info()

<a id='mt'></a>
# <p style="background-color:skyblue; font-family:newtimeroman; font-size:250%; text-align:center; border-radius: 15px 50px;">Model Training</p>

In [None]:
from sklearn.model_selection import train_test_split


target_col = 'qty'


exclude_cols = ['id', 'item_id', 'd', 'date', 'wm_yr_wk']


feature_cols = [col for col in train_sales.columns if col not in exclude_cols]


y = np.array(train_sales[target_col])
X = np.array(train_sales[feature_cols])

X_train, X_test, y_train, y_test = \
 train_test_split(X, y, test_size=0.3, random_state=1234)


# X_train1, X_train2, y_train1, y_train2 = \
#  train_test_split(X_train, y_train, test_size=0.3, random_state=1234)



## LightGBM

In [None]:
import lightgbm as lgb


lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test)

params = {
    'boosting_type': 'gbdt',
    'metric': 'rmse',
    'objective': 'regression',
    'n_jobs': -1,
    'seed': 236,
    'learning_rate': 0.01,
    'bagging_fraction': 0.75,
    'bagging_freq': 10, 
    'colsample_bytree': 0.75}

model = lgb.train(params, lgb_train, num_boost_round=2500, early_stopping_rounds=50, valid_sets = [lgb_train, lgb_eval], verbose_eval=100)

## Observations :
* When training LightGBM on this dataset Training loss is 0.1173 and validation loss is 0.080 .. 


In [None]:
pred = model.predict(train_sales_test[feature_cols])

In [None]:
len(pred)

In [None]:
train_sales_test['pred_qty'] = pred

In [None]:
train_sales_test

In [None]:
predictions = train_sales_test[['id', 'date', 'pred_qty']]
predictions = pd.pivot(predictions, index = 'id', columns = 'date', values = 'pred_qty').reset_index()
predictions

In [None]:
# Let's describe Predictions
predictions.describe()

In [None]:
predictions = predictions.drop(predictions.columns[1], axis=1)
predictions

In [None]:
predictions.columns = ['id'] + ['F' + str(i + 1) for i in range(28)]
predictions

In [None]:
x = 2744099 + 1 - 853720
df_val = train_sales[x:]

In [None]:
predictions_v = df_val[['id', 'date', 'qty']]
predictions_v = pd.pivot(predictions_v, index = 'id', columns = 'date', values = 'qty').reset_index()
predictions_v

In [None]:
predictions_v['id'] = predictions['id'].apply(lambda x: x.replace('evaluation', 'validation'))
predictions_v.head()

In [None]:
predictions_v.columns = ['id'] + ['F' + str(i + 1) for i in range(28)]
predictions_v.head()

## LSTM model Training

In [None]:
#Feature Scaling
#Scale the features using min-max scaler in range 0-1
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
dt_scaled = sc.fit_transform(train_sales.drop(['id', 'item_id', 'd', 'date', 'wm_yr_wk'],axis=1))

In [None]:
timesteps = 14
startDay = 350
X_train = []
y_train = []
for i in range(timesteps, 1941 - startDay):
    X_train.append(dt_scaled[i-timesteps:i])
    y_train.append(dt_scaled[i][0:30490]) 
    #İmportant!! if extra features are added (like oneDayBeforeEvent) 
    #use only sales values for predictions (we only predict sales) 
    #this is why 0:30490 columns are choosen

In [None]:
del dt_scaled

In [None]:
#Convert to np array to be able to feed the LSTM model
X_train = np.array(X_train)
y_train = np.array(y_train)
print(X_train.shape)
print(y_train.shape)

In [None]:
# Importing the Keras libraries and packages
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout

# Initialising the RNN
regressor = Sequential()

# Adding the first LSTM layer and some Dropout regularisation
layer_1_units=50
regressor.add(LSTM(units = layer_1_units, return_sequences = True, input_shape = (X_train.shape[1], X_train.shape[2])))
regressor.add(Dropout(0.2))

# Adding a third LSTM layer and some Dropout regularisation
layer_3_units=400
regressor.add(LSTM(units = layer_3_units, return_sequences = True))
regressor.add(Dropout(0.2))

# Adding a third LSTM layer and some Dropout regularisation
layer_3_units=400
regressor.add(LSTM(units = layer_3_units))
regressor.add(Dropout(0.2))

# Adding the output layer
regressor.add(Dense(units = 29))

# Compiling the RNN
regressor.compile(optimizer = 'adam', loss = 'mean_squared_error')

# Fitting the RNN to the Training set
epoch_no=32
batch_size_RNN=44
regressor.fit(X_train, y_train, epochs = epoch_no, batch_size = batch_size_RNN)

## Observations 
* After training multiple algorithms LSTM working well rather than LighGBM
* Loss for LSTM is 0.0016 it's really working well when compared to LightGBM .