M5 Forecasting - Accuracy

Note: This is one of the two complementary competitions that together comprise the M5 forecasting challenge. Can you estimate, as precisely as possible, the point forecasts of the unit sales of various products sold in the USA by Walmart?

How much camping gear will one store sell each month in a year? To the uninitiated, calculating sales at this level may seem as difficult as predicting the weather. Both types of forecasting rely on science and historical data. While a wrong weather forecast may result in you carrying around an umbrella on a sunny day, inaccurate business forecasts could result in actual or opportunity losses. In this competition, in addition to traditional forecasting methods you’re also challenged to use machine learning to improve forecast accuracy.

The Makridakis Open Forecasting Center (MOFC) at the University of Nicosia conducts cutting-edge forecasting research and provides business forecast training. It helps companies achieve accurate predictions, estimate the levels of uncertainty, avoiding costly mistakes, and apply best forecasting practices. The MOFC is well known for its Makridakis Competitions, the first of which ran in the 1980s.

In this competition, the fifth iteration, you will use hierarchical sales data from Walmart, the world’s largest company by revenue, to forecast daily sales for the next 28 days. The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. In addition, it has explanatory variables such as price, promotions, day of the week, and special events. Together, this robust dataset can be used to improve forecasting accuracy.

If successful, your work will continue to advance the theory and practice of forecasting. The methods used can be applied in various business areas, such as setting up appropriate inventory or service levels. Through its business support and training, the MOFC will help distribute the tools and knowledge so others can achieve more accurate and better calibrated forecasts, reduce waste and be able to appreciate uncertainty and its risk implications.

Evaluation:
This competition uses a Weighted Root Mean Squared Scaled Error (RMSSE). Extensive details about the metric, scaling, and weighting can be found in the [M5 Participants Guide](https://mofc.unic.ac.cy/m5-competition/).

It is important to notice that this notebook must be seen together with the notebooks "M5 Forecasting - Accuracy Data study.ipynb" and "M5 Forecasting - Accuracy  - Variables". All the necessary data verification, the calculation of the parameters of the ARIMA(p, i, q), and the construction of the variables are done in those two notebook. This notebook have the correlations between variables, the ARIMA model with exogenous variables, and the forecast.

This notebook is similar to "M5 Forecasting - Accuracy - Model". However, instead of using ARIMA(p, i, q), we will use Light GBM.

In [1]:
import gc
import time
import pandas as pd
import numpy as np
import re
import seaborn as sns

import lightgbm as lgb

In [2]:
sales_train_validation = pd.read_csv('sales_train_validation.csv') #Only used to take the products ids.
products_ids = sales_train_validation['id'].unique()
products_ids_size = len(products_ids)
del sales_train_validation

In [3]:
path = 'C:\\Users\\maxwi\\Python\\Kaggle\\M5 Forecasting - Accuracy\\Modelo 2\\data_by_id\\'
path_holdout = 'C:\\Users\\maxwi\\Python\\Kaggle\\M5 Forecasting - Accuracy\\Modelo 2\\holdout_by_id\\'

In [3]:
#Root Mean Squared Scaled Error (RMSSE)
def RMSSE(train, actual, forecast, h = 28):
    sum_1 = 0
    for i in range(len(actual)):
        sum_1 += (actual.iloc[i] - forecast[i])**2
        
    sum_2 = 0
    for i in range(1, len(train)):
        sum_2 += (train.iloc[i] - train.iloc[i - 1])**2
        
    rmsse = round(np.sqrt((1/h)*(sum_1/sum_2)), 4)
    return rmsse

In [5]:
#Define variables. We have a total of 207 variables.

#Continuous variables. We have 42 continuous variables.
continous_variables = ['month', 'wday',
    'demand_lag_t1', 'demand_lag_t2', 'demand_lag_t3', 'demand_lag_t4', 'demand_lag_t5',
    'demand_lag_t6', 'demand_lag_t7', 'demand_lag_t8', 'demand_lag_t9', 'demand_lag_t10', 'demand_lag_t11', 'demand_lag_t12',
    'demand_lag_t13', 'demand_lag_t14', 'demand_lag_t15', 'demand_lag_t16', 'demand_lag_t17', 'demand_lag_t18',
    'demand_lag_t19', 'demand_lag_t20', 'demand_lag_t21', 'demand_lag_t22', 'demand_lag_t23', 'demand_lag_t24',
    'demand_lag_t25', 'demand_lag_t26', 'demand_lag_t27', 'demand_lag_t28',
    'demand_rolling_mean_t28_7', 'demand_rolling_mean_t28_30', 'demand_rolling_mean_t28_90', 'demand_rolling_mean_t28_180',
    'demand_rolling_mean_t28_365', 'demand_rolling_std_t28_7', 'demand_rolling_std_t28_30', 'demand_rolling_std_t28_90', 
    'demand_rolling_std_t28_180', 'demand_rolling_std_t28_365', 'demand_rolling_skew_t28_30','demand_rolling_kurt_t28_30']

#Categorical variables. We have 164 categoric variables.
categoric_variables = ['season', 'month_fase', 'week_fase', 'SuperBowl',
     'ValentinesDay', 'PresidentsDay', 'LentStart', 'LentWeek2',
     'StPatricksDay', 'Purim End', 'OrthodoxEaster', 'Pesach End',
     'Cinco De Mayo', "Mother's day", 'MemorialDay', 'NBAFinalsStart',
     'NBAFinalsEnd', "Father's day", 'IndependenceDay', 'Ramadan starts',
     'Eid al-Fitr', 'LaborDay', 'ColumbusDay', 'Halloween',
     'EidAlAdha', 'VeteransDay', 'Thanksgiving', 'Christmas',
     'Chanukah End', 'NewYear', 'OrthodoxChristmas', 'MartinLutherKingDay',
     'Easter', 'Sporting', 'Cultural', 'National',
     'Religious', 'snap', 'snap_other', 'snap_only_other',
     'SuperBowl_near', 'SuperBowl_week', 'SuperBowl_weekend', 'ValentinesDay_near',
     'ValentinesDay_week', 'ValentinesDay_weekend', 'PresidentsDay_near', 'PresidentsDay_week',
     'PresidentsDay_weekend', 'LentStart_near', 'LentStart_week', 'LentStart_weekend',
     'LentWeek2_near', 'LentWeek2_week', 'LentWeek2_weekend', 'StPatricksDay_near',
     'StPatricksDay_week', 'StPatricksDay_weekend', 'Purim End_near', 'Purim End_week',
     'Purim End_weekend', 'OrthodoxEaster_near', 'OrthodoxEaster_week', 'OrthodoxEaster_weekend',
     'Pesach End_near', 'Pesach End_week', 'Pesach End_weekend', 'Cinco De Mayo_near',
     'Cinco De Mayo_week', 'Cinco De Mayo_weekend', "Mother's day_near", "Mother's day_week",
     "Mother's day_weekend", 'MemorialDay_near', 'MemorialDay_week', 'MemorialDay_weekend',
     'NBAFinalsStart_near', 'NBAFinalsStart_week', 'NBAFinalsStart_weekend', 'NBAFinalsEnd_near',
     'NBAFinalsEnd_week', 'NBAFinalsEnd_weekend', "Father's day_near", "Father's day_week",
     "Father's day_weekend", 'IndependenceDay_near', 'IndependenceDay_week', 'IndependenceDay_weekend',
     'Ramadan starts_near', 'Ramadan starts_week', 'Ramadan starts_weekend', 'Eid al-Fitr_near',
     'Eid al-Fitr_week', 'Eid al-Fitr_weekend', 'LaborDay_near', 'LaborDay_week',
     'LaborDay_weekend', 'ColumbusDay_near', 'ColumbusDay_week', 'ColumbusDay_weekend',
     'Halloween_near', 'Halloween_week', 'Halloween_weekend', 'EidAlAdha_near',
     'EidAlAdha_week', 'EidAlAdha_weekend', 'VeteransDay_near', 'VeteransDay_week',
     'VeteransDay_weekend', 'Thanksgiving_near', 'Thanksgiving_week', 'Thanksgiving_weekend',
     'Christmas_near', 'Christmas_week', 'Christmas_weekend', 'Chanukah End_near',
     'Chanukah End_week', 'Chanukah End_weekend', 'NewYear_near', 'NewYear_week',
     'NewYear_weekend', 'OrthodoxChristmas_near', 'OrthodoxChristmas_week', 'OrthodoxChristmas_weekend',
     'MartinLutherKingDay_near', 'MartinLutherKingDay_week', 'MartinLutherKingDay_weekend', 'Easter_near',
     'Easter_week', 'Easter_weekend', 'Sporting_near', 'Sporting_week',
     'Sporting_weekend', 'Cultural_near', 'Cultural_week', 'Cultural_weekend',
     'National_near', 'National_week', 'National_weekend', 'Religious_near',
     'Religious_week', 'Religious_weekend', 'price_change_t1', 'price_change_t1_bin',
     'price_change_max_t4', 'price_change_max_t13', 'price_change_max_t26', 'price_change_max_t52',
     'price_change_max_t4_bin', 'price_change_max_t13_bin', 'price_change_max_t26_bin', 'price_change_max_t52_bin',
     'rolling_price_std_t4', 'rolling_price_std_t13', 'rolling_price_std_t26', 'rolling_price_std_t52',
     'price_change_mean_t4', 'price_change_mean_t13', 'price_change_mean_t26', 'price_change_mean_t52',
     'price_change_mean_t4_bin', 'price_change_mean_t13_bin', 'price_change_mean_t26_bin', 'price_change_mean_t52_bin']

variables_all = []
for e in continous_variables:
    variables_all.append(e)
for e in categoric_variables:
    variables_all.append(e)

In [None]:
stoped_at = 'HOBBIES_1_001_CA_1_validation'  #  Used to continue the code from the last product saved. For safety, we will run the last product saved again.
last_product = 'FOODS_3_827_WI_3_validation'
stoped_at_index = np.where(products_ids == stoped_at)[0][0]
last_product_index = np.where(products_ids == last_product)[0][0]
products_ids_size =  last_product_index - stoped_at_index + 1

progress = 0  #Usefull to see the progress of the code. 
progress_1000 = 1
start = time.time()

#for product in products_ids[stoped_at_index : last_product_index + 1]:
for product in products_ids[stoped_at_index : ]:
    data_variables = pd.read_hdf(path + product + '.h5')
    data_variables.reset_index(inplace = True)
    
    #Set type of variables as categoric
    for e in categoric_variables:
        data_variables[e] = data_variables[e].astype('category')
        
    #Set parameter for the Lightgbm model
    observations = len(data_variables)
    lgb_params = {
        "objective": "poisson",
        "metric": "rmse",
        "force_row_wise": True,
        "bagging_freq": 1,
        "learning_rate": 0.01,
        "lambda_l2": 0.1,
        'verbosity': 1,
        'num_iterations': 2000,
        #'early_stopping_round': 200,
        'num_leaves': 290,
        "min_data_in_leaf": int(observations*0.10),
        'seed': 1
    }
    
    train = data_variables[0:-56][variables_all].copy()
    test = data_variables[-56:][variables_all].copy()
    test.reset_index(inplace = True)
    test.drop('index', axis = 1, inplace = True)
    
    train_demand = data_variables[0:-56]['demand'].copy()
    demand_aux = train_demand.copy()  #used to recalculate some variables based on demand
    
    lgb_train = lgb.Dataset(data = train, label = train_demand, feature_name = variables_all)
    
    model = lgb.train(train_set = lgb_train, params = lgb_params)
    
    #Forecast
    predicted_demand = []
    for i in range(0, 56):
        predicted_demand_aux = model.predict(test[i : i + 1])
        predicted_demand.append(predicted_demand_aux[0])
        demand_aux = demand_aux.append(pd.Series(predicted_demand_aux[0]))
        
        #Now we need to redefine the variables that refer to the lagged demand 
        for j in range(i + 1, 0, -1):
            if j <= 28:
                test.loc[i + 1: i + 2, 'demand_lag_t' + str(j)] = predicted_demand[-j]
            else:
                test.loc[i + 1: i + 2, 'demand_lag_t' + str(j - 28)] = predicted_demand[-(j - 28)]
                
            test.loc[i + 1: i + 2, 'demand_rolling_mean_t28_7'] = demand_aux.shift(27).rolling(7).mean().iloc[-1] #It needs a shift og 27 because we are setting the value for the next period.
            test.loc[i + 1: i + 2, 'demand_rolling_mean_t28_30'] = demand_aux.shift(27).rolling(30).mean().iloc[-1] 
            test.loc[i + 1: i + 2, 'demand_rolling_mean_t28_90'] = demand_aux.shift(27).rolling(90).mean().iloc[-1]
            test.loc[i + 1: i + 2, 'demand_rolling_mean_t28_180'] = demand_aux.shift(27).rolling(180).mean().iloc[-1] 
            test.loc[i + 1: i + 2, 'demand_rolling_mean_t28_365'] = demand_aux.shift(27).rolling(365).mean().iloc[-1] 
            test.loc[i + 1: i + 2, 'demand_rolling_std_t28_7'] = demand_aux.shift(27).rolling(7).std().iloc[-1]
            test.loc[i + 1: i + 2, 'demand_rolling_std_t28_30'] = demand_aux.shift(27).rolling(30).std().iloc[-1] 
            test.loc[i + 1: i + 2, 'demand_rolling_std_t28_90'] = demand_aux.shift(27).rolling(90).std().iloc[-1] 
            test.loc[i + 1: i + 2, 'demand_rolling_std_t28_180'] = demand_aux.shift(27).rolling(180).std().iloc[-1]
            test.loc[i + 1: i + 2, 'demand_rolling_std_t28_365'] = demand_aux.shift(27).rolling(365).std().iloc[-1] 
            test.loc[i + 1: i + 2, 'demand_rolling_skew_t28_30'] = demand_aux.shift(27).rolling(30).skew().iloc[-1] 
            test.loc[i + 1: i + 2, 'demand_rolling_kurt_t28_30'] = demand_aux.shift(27).rolling(30).kurt().iloc[-1] 
            
    #Make holdout
    holdout_1 = {} #Forecast to the public leaderboard
    holdout_1['id'] = product
    for i in range(28):
        holdout_1['F' + str(i + 1)] = predicted_demand[i]
    holdout_1 = pd.DataFrame([holdout_1])
    
    holdout_2 = {} #Forecast to the private leaderboard
    holdout_2['id'] = f"{product[0:-10]}evaluation"
    for i in range(28):
        holdout_2['F' + str(i + 1)] = predicted_demand[i + 28]
    holdout_2 = pd.DataFrame([holdout_2])

    holdout = pd.concat([holdout_1, holdout_2], ignore_index = True)
    holdout.to_hdf(path_holdout + product + '.h5', key = product, mode = 'w')
    
    progress += 1
    if progress == progress_1000 * 1000:

        progress_per = round(progress / products_ids_size, 4)
        print(progress_per)
        progress_1000 +=1
        
        end = time.time()
        elapsed = int(round(end - start, 0))
        total_run_time =  int(round(elapsed / (progress_per), 0))
        time_to_finish = int(round(elapsed / (progress_per), 0)) - elapsed
        print('Elapsed: {:02d}:{:02d}:{:02d}'.format(elapsed // 3600, (elapsed % 3600 // 60), elapsed % 60))
        print('Total run time: {:02d}:{:02d}:{:02d}'.format(total_run_time // 3600, (total_run_time % 3600 // 60), total_run_time % 60))
        print('Time to finish: {:02d}:{:02d}:{:02d}'.format(time_to_finish // 3600, (time_to_finish % 3600 // 60), time_to_finish % 60))
        print()
        
print()
print("OK!")

In [4]:
#Define holdout
stoped_at = products_ids[0]  #  Used to continue the code from the last product saved. For safety, we will run the last product saved again.

holdout_accuracy = pd.read_hdf(path_holdout + stoped_at + '.h5')

progress = 0  #Usefull to see the progress of the code. 
progress_1000 = 1
start = time.time()

for product in products_ids[1:]:
    holdout_accuracy_aux = pd.read_hdf(path_holdout + product + '.h5')
    
    holdout_accuracy = pd.concat([holdout_accuracy, holdout_accuracy_aux], ignore_index=True)

    progress += 1
    if progress == progress_1000 * 1000:

        progress_per = round(progress / products_ids_size, 4)
        print(progress_per)
        progress_1000 +=1
        
        end = time.time()
        elapsed = int(round(end - start, 0))
        total_run_time =  int(round(elapsed / (progress_per), 0))
        time_to_finish = int(round(elapsed / (progress_per), 0)) - elapsed
        print('Elapsed: {:02d}:{:02d}:{:02d}'.format(elapsed // 3600, (elapsed % 3600 // 60), elapsed % 60))
        print('Total run time: {:02d}:{:02d}:{:02d}'.format(total_run_time // 3600, (total_run_time % 3600 // 60), total_run_time % 60))
        print('Time to finish: {:02d}:{:02d}:{:02d}'.format(time_to_finish // 3600, (time_to_finish % 3600 // 60), time_to_finish % 60))
        print()

holdout_accuracy.to_csv("holdout_accuracy.csv", index=False)

0.0328
Elapsed: 00:00:09
Total run time: 00:04:34
Time to finish: 00:04:25

0.0656
Elapsed: 00:00:18
Total run time: 00:04:34
Time to finish: 00:04:16

0.0984
Elapsed: 00:00:26
Total run time: 00:04:24
Time to finish: 00:03:58

0.1312
Elapsed: 00:00:35
Total run time: 00:04:27
Time to finish: 00:03:52

0.164
Elapsed: 00:00:45
Total run time: 00:04:34
Time to finish: 00:03:49

0.1968
Elapsed: 00:00:55
Total run time: 00:04:39
Time to finish: 00:03:44

0.2296
Elapsed: 00:01:05
Total run time: 00:04:43
Time to finish: 00:03:38

0.2624
Elapsed: 00:01:15
Total run time: 00:04:46
Time to finish: 00:03:31

0.2952
Elapsed: 00:01:25
Total run time: 00:04:48
Time to finish: 00:03:23

0.328
Elapsed: 00:01:36
Total run time: 00:04:53
Time to finish: 00:03:17

0.3608
Elapsed: 00:01:47
Total run time: 00:04:57
Time to finish: 00:03:10

0.3936
Elapsed: 00:01:58
Total run time: 00:05:00
Time to finish: 00:03:02

0.4264
Elapsed: 00:02:09
Total run time: 00:05:03
Time to finish: 00:02:54

0.4592
Elapsed

Unnamed: 0,id,F1,F2,F3,F4,F5,F6,F7,F8,F9,...,F19,F20,F21,F22,F23,F24,F25,F26,F27,F28
0,HOBBIES_1_001_CA_1_validation,0.812127,0.887927,0.763679,0.842536,0.775772,1.217153,0.687498,1.104953,0.933031,...,0.891986,1.285713,0.961328,1.047848,1.165932,1.037729,0.985752,0.929149,1.582238,1.166432
1,HOBBIES_1_001_CA_1_evaluation,1.181089,1.256411,0.651274,0.985904,0.825309,1.669551,0.993457,0.968098,1.228431,...,1.138005,1.273723,1.032702,0.818059,0.862333,0.827502,0.915643,0.948117,1.235228,0.810124
2,HOBBIES_1_002_CA_1_validation,0.122463,0.104070,0.127284,0.102847,0.129910,0.190784,0.318079,0.252460,0.198508,...,0.249558,0.289718,0.236412,0.199455,0.187493,0.213701,0.221774,0.222247,0.246011,0.163622
3,HOBBIES_1_002_CA_1_evaluation,0.194264,0.190559,0.207146,0.210082,0.271871,0.264602,0.264602,0.259540,0.173117,...,0.206974,0.247998,0.215252,0.195983,0.187281,0.199694,0.242995,0.240775,0.261853,0.221335
4,HOBBIES_1_003_CA_1_validation,0.626666,0.565834,0.588789,0.842830,1.038278,0.912253,0.708338,0.820430,0.530520,...,0.875717,1.536290,1.138988,0.804917,0.620962,0.500605,0.617771,0.441145,0.674908,0.489578
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60975,FOODS_3_825_WI_3_evaluation,0.869788,0.939809,0.787728,0.670236,0.730686,0.923194,0.911715,0.922494,0.817016,...,1.883331,2.327099,2.430222,1.713387,2.060543,2.415319,1.034594,1.016054,1.124425,1.027011
60976,FOODS_3_826_WI_3_validation,1.167529,1.056566,1.015245,1.628336,0.854889,2.036061,1.507356,1.486703,1.053864,...,0.536528,0.860651,1.481439,0.886010,1.173085,0.850971,0.842337,0.903754,1.503807,2.146272
60977,FOODS_3_826_WI_3_evaluation,1.709731,1.166922,0.771585,0.738075,0.692064,0.779695,0.918879,0.787370,0.977746,...,0.591203,0.806637,0.932806,0.871868,0.799405,0.540149,0.498492,0.462294,0.626811,0.591034
60978,FOODS_3_827_WI_3_validation,0.736140,1.322369,1.288168,1.534828,1.454483,1.697254,1.672233,1.345653,1.653682,...,1.450530,1.920040,1.648219,1.589748,2.157212,1.975829,2.624101,2.770586,2.218369,2.699706


In [5]:
holdout_accuracy.to_csv("holdout_accuracy.csv", index=False)

Kaggle public score: 0.63748