M5 Forecasting - Accuracy

Note: This is one of the two complementary competitions that together comprise the M5 forecasting challenge. Can you estimate, as precisely as possible, the point forecasts of the unit sales of various products sold in the USA by Walmart?

How much camping gear will one store sell each month in a year? To the uninitiated, calculating sales at this level may seem as difficult as predicting the weather. Both types of forecasting rely on science and historical data. While a wrong weather forecast may result in you carrying around an umbrella on a sunny day, inaccurate business forecasts could result in actual or opportunity losses. In this competition, in addition to traditional forecasting methods you’re also challenged to use machine learning to improve forecast accuracy.

The Makridakis Open Forecasting Center (MOFC) at the University of Nicosia conducts cutting-edge forecasting research and provides business forecast training. It helps companies achieve accurate predictions, estimate the levels of uncertainty, avoiding costly mistakes, and apply best forecasting practices. The MOFC is well known for its Makridakis Competitions, the first of which ran in the 1980s.

In this competition, the fifth iteration, you will use hierarchical sales data from Walmart, the world’s largest company by revenue, to forecast daily sales for the next 28 days. The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. In addition, it has explanatory variables such as price, promotions, day of the week, and special events. Together, this robust dataset can be used to improve forecasting accuracy.

If successful, your work will continue to advance the theory and practice of forecasting. The methods used can be applied in various business areas, such as setting up appropriate inventory or service levels. Through its business support and training, the MOFC will help distribute the tools and knowledge so others can achieve more accurate and better calibrated forecasts, reduce waste and be able to appreciate uncertainty and its risk implications.

Evaluation:
This competition uses a Weighted Root Mean Squared Scaled Error (RMSSE). Extensive details about the metric, scaling, and weighting can be found in the [M5 Participants Guide](https://mofc.unic.ac.cy/m5-competition/).

It is important to notice that this notebook must be seen together with the notebook "M5 Forecasting - Accuracy Data study.ipynb".\
This notebook have only the constructions of new variables.

This notebook is the same as "M5 Forecasting - Accuracy - Data study".\
The only difference is that the variables are construct thinking in the Light GBM model.\
Therefore, e.g., we will use categorical variables, but we will not transform the bins in dummies.

In [1]:
import gc
import time
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
path = 'C:\\Users\\maxwi\\Python\\Kaggle\\M5 Forecasting - Accuracy\\Modelo 2\\data_by_id\\'

In [3]:
'''
Function to reduce memory usage.
From: https://www.kaggle.com/ragnar123/very-fst-model
'''
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [4]:
#Load data.
calendar = pd.read_csv('calendar.csv')
calendar = reduce_mem_usage(calendar)
sell_prices = pd.read_csv('sell_prices.csv')
sell_prices = reduce_mem_usage(sell_prices)
sales_train_validation = pd.read_csv('sales_train_validation.csv')
sales_train_validation = reduce_mem_usage(sales_train_validation)

Mem. usage decreased to  0.12 Mb (41.9% reduction)
Mem. usage decreased to 130.48 Mb (37.5% reduction)
Mem. usage decreased to 95.00 Mb (78.7% reduction)


File 1: “calendar.csv”

Contains information about the dates the products are sold.

     date: The date in a “y-m-d” format.

     wm_yr_wk: The id of the week the date belongs to.
    
     weekday: The type of the day (Saturday, Sunday, …, Friday).
    
     wday: The id of the weekday, starting from Saturday.
    
     month: The month of the date.
    
     year: The year of the date.
    
     event_name_1: If the date includes an event, the name of this event.
    
     event_type_1: If the date includes an event, the type of this event.
    
     event_name_2: If the date includes a second event, the name of this event.
    
     event_type_2: If the date includes a second event, the type of this event.
    
     snap_CA, snap_TX, and snap_WI: A binary variable (0 or 1) indicating whether the stores of CA, TX or WI allow SNAP2 purchases on the examined date. 1 indicates that SNAP purchases are allowed.

File 2: “sell_prices.csv”

Contains information about the price of the products sold per store and date.

     store_id: The id of the store where the product is sold.

     item_id: The id of the product.

     wm_yr_wk: The id of the week.

     sell_price: The price of the product for the given week/store. The price is provided per week (average across seven days). If not available, this means that the product was not sold during the examined week. Note that although prices are constant at weekly basis, they may change through time (both training and test set).
    
    
Considering that we have only 6,841,121 prices, there are  1,757,059 (8,598,180 - 6,841,121) products that were not sold in a given week.

File 3: “sales_train.csv”

Contains the historical daily unit sales data per product and store.

     item_id: The id of the product.

     dept_id: The id of the department the product belongs to.

     cat_id: The id of the category the product belongs to.

     store_id: The id of the store where the product is sold.

     state_id: The State where the store is located.

     d_1, d_2, …, d_i, … d_1941: The number of units sold at day i, starting from 2011-01-29.
    
    
We have 282 and 30,490 differents items. Therefore, the ideal would be to have 8,598,180 (282 x 30,400) prices. 

In [5]:
#Variables based on calendar
#Seasons
calendar['date'] = pd.to_datetime(calendar['date'])
calendar['day'] = calendar['date'].dt.day

calendar.loc[np.in1d(calendar['month'], [3, 4, 5]), 'season'] = 1 #'spring'
calendar.loc[np.in1d(calendar['month'], [6, 7, 8]), 'season'] = 2 #'summer'
calendar.loc[np.in1d(calendar['month'], [9, 10, 11]), 'season'] = 3 #'fall'
calendar.loc[np.in1d(calendar['month'], [12, 1, 2]), 'season'] = 4 #'winter'
calendar['season'] = calendar['season'].astype('int8')

calendar.loc[np.in1d(calendar['day'], [1, 2, 3, 4, 5, 6, 7]), 'month_fase'] = 1# 'start'
calendar.loc[np.in1d(calendar['day'], [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]), 'month_fase'] = 2 #'middle'
calendar.loc[np.in1d(calendar['day'], [22, 23, 24, 25, 26, 27, 28, 29, 30, 31]), 'month_fase'] = 3 #'end'
calendar['month_fase'] = calendar['month_fase'].astype('int8')

calendar.loc[np.in1d(calendar['wday'], [1, 2]), 'week_fase'] = 1 #'weekend'
calendar.loc[np.in1d(calendar['wday'], [3, 4, 5]), 'week_fase'] = 2 #'start'
calendar.loc[np.in1d(calendar['wday'], [6, 7]), 'week_fase'] = 3 #'end'
calendar['week_fase'] = calendar['week_fase'].astype('int8')

calendar.drop(['date', 'weekday', 'year'], inplace = True, axis = 1)

#Fix variables event_name and event_type
#Make new holiday variable
calendar['event_name_1_t3'] = calendar['event_name_1'].shift(-3)
calendar['event_name_2_t3'] = calendar['event_name_2'].shift(-3)
calendar['event_name_1_t7'] = calendar['event_name_1'].shift(-7)
calendar['event_name_2_t7'] = calendar['event_name_2'].shift(-7)

calendar['event_type_1_t3'] = calendar['event_type_1'].shift(-3)
calendar['event_type_2_t3'] = calendar['event_type_2'].shift(-3)
calendar['event_type_1_t7'] = calendar['event_type_1'].shift(-7)
calendar['event_type_2_t7'] = calendar['event_type_2'].shift(-7)

#Categorize holidays
def cat_holidays(holiday):
    calendar[holiday] = np.logical_or(np.in1d(calendar['event_name_1'], holiday), np.in1d(calendar['event_name_2'], holiday))
    
    calendar[holiday + '_t3'] = np.logical_or(np.in1d(calendar['event_name_1_t3'], holiday)
                                                               , np.in1d(calendar['event_name_2_t3'], holiday))
    
    calendar[holiday + '_t7'] = np.logical_or(np.in1d(calendar['event_name_1_t7'], holiday)
                                                               , np.in1d(calendar['event_name_2_t7'], holiday))
    
    calendar[holiday + '_near'] = 0
    calendar[holiday + '_near'] = calendar[holiday + '_t3'].rolling(4).sum()
    calendar[holiday + '_near'].fillna(0, inplace = True)
    
    calendar[holiday + '_week'] = 0
    calendar[holiday + '_week'] = calendar[holiday + '_t7'].rolling(8).sum()
    calendar[holiday + '_week'].fillna(0, inplace = True)
    calendar[holiday + '_weekend']  = np.logical_and(calendar[holiday + '_week'], np.in1d(calendar['week_fase'], 1))
    
    calendar[holiday + '_near'] = calendar[holiday + '_near'].astype('bool')
    calendar[holiday + '_week'] = calendar[holiday + '_week'].astype('bool')
    
    calendar.drop([holiday + '_t3', holiday + '_t7'], inplace = True, axis = 1)
        
        
def cat_holiday_type(holiday_type):
    calendar[holiday_type] = np.logical_or(np.in1d(calendar['event_type_1'], holiday_type), np.in1d(calendar['event_type_2'], holiday_type))
      
    calendar[holiday_type + '_t3'] = np.logical_or(np.in1d(calendar['event_type_1_t3'], holiday_type)
                                                               , np.in1d(calendar['event_type_2_t3'], holiday_type))
    
    calendar[holiday_type + '_t7'] = np.logical_or(np.in1d(calendar['event_type_1_t7'], holiday_type)
                                                               , np.in1d(calendar['event_type_2_t7'], holiday_type))
    
    calendar[holiday_type + '_near'] = 0
    calendar[holiday_type + '_near'] = calendar[holiday_type + '_t3'].rolling(4).sum()
    calendar[holiday_type + '_near'].fillna(0, inplace = True)
    
    calendar[holiday_type + '_week'] = 0
    calendar[holiday_type + '_week'] = calendar[holiday_type + '_t7'].rolling(8).sum()
    calendar[holiday_type + '_week'].fillna(0, inplace = True)
    calendar[holiday_type + '_weekend']  = np.logical_and(calendar[holiday_type + '_week'], np.in1d(calendar['week_fase'], 1))
      
    calendar[holiday_type + '_near'] = calendar[holiday_type + '_near'].astype('bool')
    calendar[holiday_type + '_week'] = calendar[holiday_type + '_week'].astype('bool')
        
    calendar.drop([holiday_type + '_t3', holiday_type + '_t7'], inplace = True, axis = 1)

    
holidays = calendar['event_name_1'].unique()[1:]
holiday_type = calendar['event_type_1'].unique()[1:]

for e in holidays:
    cat_holidays(e)
    
for e in holiday_type:
    cat_holiday_type(e)
    
    
calendar.drop(['event_name_1', 'event_name_2', 'event_type_1', 'event_type_2'
               , 'event_name_1_t3', 'event_name_1_t7'
               , 'event_name_2_t3', 'event_name_2_t7'
               , 'event_type_1_t3', 'event_type_1_t7'
               , 'event_type_2_t3', 'event_type_2_t7'], inplace = True, axis = 1)

calendar = reduce_mem_usage(calendar)  #Mem. usage decreased to  0.29 Mb (4.3% reduction)

Mem. usage decreased to  0.29 Mb (4.3% reduction)


In [6]:
#Transpose sales_train_validation so the days goes to rows.
sales_train_validation = pd.melt(sales_train_validation, id_vars = ['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'], var_name = 'day', value_name = 'demand')

In [7]:
#Define 2 holdout
holdout1 = pd.read_csv('sample_submission.csv')
holdout1 = reduce_mem_usage(holdout1)
holdout2 = pd.read_csv('sample_submission.csv')
holdout2 = reduce_mem_usage(holdout2)


#One holdout is related to forecast between day 1914 and 1941.
#The second, between day 1942 and 1969
test1 = holdout1.copy()
test2 = holdout2.copy()
test1.columns = ['id', 'd_1914', 'd_1915', 'd_1916', 'd_1917', 'd_1918', 'd_1919', 'd_1920', 'd_1921', 'd_1922', 'd_1923',
                    'd_1924', 'd_1925', 'd_1926', 'd_1927', 'd_1928', 'd_1929', 'd_1930', 'd_1931', 
                    'd_1932', 'd_1933', 'd_1934', 'd_1935', 'd_1936', 'd_1937', 'd_1938', 'd_1939', 'd_1940', 'd_1941']
test2.columns = ['id', 'd_1942', 'd_1943', 'd_1944', 'd_1945', 'd_1946', 'd_1947', 'd_1948', 'd_1949', 'd_1950', 'd_1951', 
                    'd_1952', 'd_1953', 'd_1954', 'd_1955', 'd_1956', 'd_1957', 'd_1958', 'd_1959', 
                    'd_1960', 'd_1961', 'd_1962', 'd_1963', 'd_1964', 'd_1965', 'd_1966', 'd_1967', 'd_1968', 'd_1969']

Mem. usage decreased to  2.09 Mb (84.5% reduction)
Mem. usage decreased to  2.09 Mb (84.5% reduction)


In [8]:
#Concat our train database with the holdouts. 

#Get information about the product id.
product_id = sales_train_validation[['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id']].drop_duplicates()

# merge with product table
validation_bool = test1['id'].str.contains("_validation") 
test1 = test1[validation_bool]
test1 = test1.merge(product_id, how = 'left', on = 'id')
test2 = test2[validation_bool]
test2 = test2.merge(product_id, how = 'left', on = 'id')

#Transpose so the days goes to rows.
test1 = pd.melt(test1, id_vars = ['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'], var_name = 'day', value_name = 'demand')
test2 = pd.melt(test2, id_vars = ['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'], var_name = 'day', value_name = 'demand')

#Define what is train and what is test.
sales_train_validation['part'] = 'train'
test1['part'] = 'test1'
test2['part'] = 'test2'   

In [9]:
#Concat sales_train_validation and test1. test2 will be concated after we develop some demand variables
data = pd.concat([sales_train_validation, test1], axis = 0) 
del sales_train_validation, test1

In [10]:
#make new demand related variables
#It is necessary a shift of 28 days so these variables can be calculated for the forecast.

#The Light GBM does not consider that our database is in a time series format. 
#Therefore, after we model one day, we will need to recalculate some variables to model the next day.

for i in range(1, 29):
    try:
        data['demand_lag_t' + str(i)] = data.groupby(['id'])['demand'].shift(i)
    except:
        data['demand_lag_t' + str(i)] = 0

data['demand_rolling_mean_t28_7'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(28).rolling(7).mean())
data['demand_rolling_mean_t28_30'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(28).rolling(30).mean())
data['demand_rolling_mean_t28_90'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(28).rolling(90).mean())
data['demand_rolling_mean_t28_180'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(28).rolling(180).mean())
data['demand_rolling_mean_t28_365'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(28).rolling(365).mean())
data['demand_rolling_std_t28_7'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(28).rolling(7).std())
data['demand_rolling_std_t28_30'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(28).rolling(30).std())
data['demand_rolling_std_t28_90'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(28).rolling(90).std())
data['demand_rolling_std_t28_180'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(28).rolling(180).std())
data['demand_rolling_std_t28_365'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(28).rolling(365).std())
data['demand_rolling_skew_t28_30'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(28).rolling(30).skew())
data['demand_rolling_kurt_t28_30'] = data.groupby(['id'])['demand'].transform(lambda x: x.shift(28).rolling(30).kurt())

data = reduce_mem_usage(data) #Mem. usage decreased to 8691.68 Mb (60.9% reduction)

Mem. usage decreased to 8691.68 Mb (60.9% reduction)


In [11]:
#Concat data and test2
data = pd.concat([data, test2], axis = 0, sort = True)
del test2

#Merge with calendar.
data = pd.merge(data, calendar, how = 'left', left_on = ['day'], right_on = ['d'])
del calendar
gc.collect()
data.drop(['d'], inplace = True, axis = 1)

In [12]:
#Merge with sell_prices.
data = pd.merge(data, sell_prices, how = 'left', on = ['store_id', 'item_id', 'wm_yr_wk'])

#Delete rows with products that had not have yet your first sell.
data = data.dropna(subset=['sell_price']) 

In [13]:
#make new price related variables
def create_bins_price_variation(df, var_name_in):
    cut_points = [-999, -0.25, -0.10, -0.05, -0.001, 0.05, 0.10, 0.25, 999]
    label_names = ["--0.25", "-0.25-0.10", "-0.10-0.5", "-0.05-0", "0-0.05", "0.05-0.10", "0.10-0.25", "0.25++"]
    df[var_name_in + '_bin'] = pd.cut(df[var_name_in], cut_points, labels = label_names)
    return df


data['lag_price_t1'] = data.groupby(['id'])['sell_price'].transform(lambda x: x.shift(1))
data['price_change_t1'] = (data['sell_price'] - data['lag_price_t1']) / (data['lag_price_t1'])
data.drop(['lag_price_t1'], inplace = True, axis = 1)

#Create bins for price change
data = create_bins_price_variation(data, 'price_change_t1')


#t4 = four weeks, one month;
#t13 = 13 weeks, three months;
#t26 = 26 weeks, six month;
#t52 = 52 weeks, one year.
data['rolling_price_max_t4'] = data.groupby(['id'])['sell_price'].transform(lambda x: x.shift(1).rolling(4).max())
data['rolling_price_max_t13'] = data.groupby(['id'])['sell_price'].transform(lambda x: x.shift(1).rolling(13).max())
data['rolling_price_max_t26'] = data.groupby(['id'])['sell_price'].transform(lambda x: x.shift(1).rolling(26).max())
data['rolling_price_max_t52'] = data.groupby(['id'])['sell_price'].transform(lambda x: x.shift(1).rolling(52).max())

data['price_change_max_t4'] = (data['sell_price'] - data['rolling_price_max_t4']) / (data['rolling_price_max_t4'])
data['price_change_max_t13'] = (data['sell_price'] - data['rolling_price_max_t13']) / (data['rolling_price_max_t13'])
data['price_change_max_t26'] = (data['sell_price'] - data['rolling_price_max_t26']) / (data['rolling_price_max_t26'])
data['price_change_max_t52'] = (data['sell_price'] - data['rolling_price_max_t52']) / (data['rolling_price_max_t52'])

data.drop(['rolling_price_max_t4', 'rolling_price_max_t13', 'rolling_price_max_t26', 'rolling_price_max_t52'], inplace = True, axis = 1)

#Create bins for max price change
data = create_bins_price_variation(data, 'price_change_max_t4')
data = create_bins_price_variation(data, 'price_change_max_t13')
data = create_bins_price_variation(data, 'price_change_max_t26')
data = create_bins_price_variation(data, 'price_change_max_t52')


data['rolling_price_std_t4'] = data.groupby(['id'])['sell_price'].transform(lambda x: x.rolling(4).std())
data['rolling_price_std_t13'] = data.groupby(['id'])['sell_price'].transform(lambda x: x.rolling(13).std())
data['rolling_price_std_t26'] = data.groupby(['id'])['sell_price'].transform(lambda x: x.rolling(26).std())
data['rolling_price_std_t52'] = data.groupby(['id'])['sell_price'].transform(lambda x: x.rolling(52).std())

data['rolling_price_mean_t4'] = data.groupby(['id'])['sell_price'].transform(lambda x: x.rolling(4).mean())
data['rolling_price_mean_t13'] = data.groupby(['id'])['sell_price'].transform(lambda x: x.rolling(13).mean())
data['rolling_price_mean_t26'] = data.groupby(['id'])['sell_price'].transform(lambda x: x.rolling(26).mean())
data['rolling_price_mean_t52'] = data.groupby(['id'])['sell_price'].transform(lambda x: x.rolling(52).mean())
                                                                                                  
data['price_change_mean_t4'] = (data['sell_price'] - data['rolling_price_mean_t4']) / (data['rolling_price_mean_t4'])
data['price_change_mean_t13'] = (data['sell_price'] - data['rolling_price_mean_t13']) / (data['rolling_price_mean_t13'])
data['price_change_mean_t26'] = (data['sell_price'] - data['rolling_price_mean_t26']) / (data['rolling_price_mean_t26'])                                                                                                   
data['price_change_mean_t52'] = (data['sell_price'] - data['rolling_price_mean_t52']) / (data['rolling_price_mean_t52'])

data.drop(['rolling_price_mean_t4', 'rolling_price_mean_t13', 'rolling_price_mean_t26', 'rolling_price_mean_t52'], inplace = True, axis = 1)


gc.collect()
#Create bins for mean price change                                                                                                  
data = create_bins_price_variation(data, 'price_change_mean_t4')
data = create_bins_price_variation(data, 'price_change_mean_t13')
data = create_bins_price_variation(data, 'price_change_mean_t26')
data = create_bins_price_variation(data, 'price_change_mean_t52')

In [14]:
#Verify if there is a snap sales in other state.
gc.collect()

data['snap_other_CA'] = 0
data['snap_other_CA_only'] = 0
data.loc[(np.in1d(data['snap_TX'], 1)) | (np.in1d(data['snap_WI'], 1)), 'snap_other_CA'] = 1
data.loc[(np.in1d(data['snap_other_CA'], 1)) & (np.in1d(data['snap_CA'], 0)), 'snap_other_CA_only'] = 1

data['snap_other_TX'] = 0
data['snap_other_TX_only'] = 0
data.loc[(np.in1d(data['snap_CA'], 1)) | (np.in1d(data['snap_WI'], 1)), 'snap_other_TX'] = 1
data.loc[(np.in1d(data['snap_other_TX'], 1)) & (np.in1d(data['snap_TX'], 0)), 'snap_other_TX_only'] = 1

data['snap_other_WI'] = 0
data['snap_other_WI_only'] = 0
data.loc[(np.in1d(data['snap_CA'], 1)) | (np.in1d(data['snap_TX'], 1)), 'snap_other_WI'] = 1
data.loc[(np.in1d(data['snap_other_WI'], 1)) & (np.in1d(data['snap_WI'], 0)), 'snap_other_WI_only'] = 1

data = reduce_mem_usage(data) #Mem. usage decreased to 15660.27 Mb (10.9% reduction)

#Set the right snap per product.
data.loc[np.in1d(data['state_id'], 'CA'), 'snap'] = data['snap_CA']
data.loc[np.in1d(data['state_id'], 'CA'), 'snap_other'] = data['snap_other_CA']
data.loc[np.in1d(data['state_id'], 'CA'), 'snap_only_other'] = data['snap_other_CA_only']

data.loc[np.in1d(data['state_id'], 'TX'), 'snap'] = data['snap_TX']
data.loc[np.in1d(data['state_id'], 'TX'), 'snap_other'] = data['snap_other_TX']
data.loc[np.in1d(data['state_id'], 'TX'), 'snap_only_other'] = data['snap_other_TX_only']

data.loc[np.in1d(data['state_id'], 'WI'), 'snap'] = data['snap_WI']
data.loc[np.in1d(data['state_id'], 'WI'), 'snap_other'] = data['snap_other_WI']
data.loc[np.in1d(data['state_id'], 'WI'), 'snap_only_other'] = data['snap_other_WI_only']

data = reduce_mem_usage(data) #Mem. usage decreased to 17213.41 Mb (4.5% reduction)

data.drop(['snap_CA', 'snap_TX', 'snap_WI'
                , 'snap_other_CA', 'snap_other_TX', 'snap_other_WI'
                , 'snap_other_CA_only', 'snap_other_TX_only', 'snap_other_WI_only'], inplace = True, axis = 1)


    
gc.collect()

#Considering that we will estimate a model by product, there are some more variables we can drop.
data.drop(['cat_id', 'dept_id', 'item_id', 'state_id', 'store_id'], inplace = True, axis = 1)
data = reduce_mem_usage(data) #Mem. usage decreased to 14982.73 Mb (0.0% reduction)

Mem. usage decreased to 15660.27 Mb (10.9% reduction)
Mem. usage decreased to 17213.41 Mb (4.5% reduction)
Mem. usage decreased to 14982.73 Mb (0.0% reduction)


Created variables.

Continuous:\
    'month', 'wday',\
    'demand_lag_t1', 'demand_lag_t2', 'demand_lag_t3', 'demand_lag_t4', 'demand_lag_t5',\
    'demand_lag_t6', 'demand_lag_t7', 'demand_lag_t8', 'demand_lag_t9', 'demand_lag_t10', 'demand_lag_t11', 'demand_lag_t12',\
    'demand_lag_t13', 'demand_lag_t14', 'demand_lag_t15', 'demand_lag_t16', 'demand_lag_t17', 'demand_lag_t18',\
    'demand_lag_t19', 'demand_lag_t20', 'demand_lag_t21', 'demand_lag_t22', 'demand_lag_t23', 'demand_lag_t24',\
    'demand_lag_t25', 'demand_lag_t26', 'demand_lag_t27', 'demand_lag_t28',\
    'demand_rolling_mean_t28_7', 'demand_rolling_mean_t28_30', 'demand_rolling_mean_t28_90', 'demand_rolling_mean_t28_180',
    'demand_rolling_mean_t28_365', 'demand_rolling_std_t28_7', 'demand_rolling_std_t28_30', 'demand_rolling_std_t28_90',
    'demand_rolling_std_t28_180', 'demand_rolling_std_t28_365', 'demand_rolling_skew_t28_30','demand_rolling_kurt_t28_30'
    
Categorical:\
    'season', 'month_fase', 'week_fase',\
    "SuperBowl", 'ValentinesDay', 'PresidentsDay', 'LentStart', 'LentWeek2', 'StPatricksDay', 'Purim End', 'OrthodoxEaster',\
    'Pesach End', 'Cinco De Mayo', "Mother's day", 'MemorialDay', 'NBAFinalsStart', 'NBAFinalsEnd', "Father's day",\
    'IndependenceDay', 'Ramadan starts', 'Eid al-Fitr', 'LaborDay', 'ColumbusDay', 'Halloween', 'EidAlAdha', 'VeteransDay',\
    'Thanksgiving', 'Christmas', 'Chanukah End', 'NewYear', 'OrthodoxChristmas', 'MartinLutherKingDay',\
    'Easter', 'Sporting', 'Cultural', 'National', 'Religious',\
    'snap', 'snap_other', 'snap_only_other'

Holiday variables (categorical):
 holidays_variables = col for col in data.columns if col.endswith("_near") or col.endswith("_week") or col.endswith("_weekend")
Price variables (categorical):
price_variables = col for col in variables_aux.columns if 'price' in col

In [17]:
#Considering that we will model an ARIMA by product, we will already construct a database separed by product.
#The ideal would be to use the key function of hdf files, but this would exceed the recomented nunber of children (16.384).
#Therefore, we will make one file by product.

products_ids = data['id'].unique()
products_ids_size = len(products_ids)

progress = 0   #Usefull to see the progress of the code
progress_1000 = 1
start = time.time()


for e in products_ids:
    data.loc[np.in1d(data['id'], e)].to_hdf(path + e + '.h5', key = e, format = 't', mode = 'w')
    
    progress += 1
    if progress > progress_1000 * 1000:

        progress_per = round(progress / products_ids_size, 4)
        print(progress_per)
        progress_1000 +=1
        
        end = time.time()
        elapsed = int(round(end - start, 0))
        total_run_time =  int(round(elapsed / (progress_per), 0))
        time_to_finish = int(round(elapsed / (progress_per), 0)) - elapsed
        print('Elapsed: {:02d}:{:02d}:{:02d}'.format(elapsed // 3600, (elapsed % 3600 // 60), elapsed % 60))
        print('Total run time: {:02d}:{:02d}:{:02d}'.format(total_run_time // 3600, (total_run_time % 3600 // 60), total_run_time % 60))
        print('Time to finish: {:02d}:{:02d}:{:02d}'.format(time_to_finish // 3600, (time_to_finish % 3600 // 60), time_to_finish % 60))
        print()

0.0328
Elapsed: 00:12:59
Total run time: 06:35:50
Time to finish: 06:22:51

0.0656
Elapsed: 00:25:50
Total run time: 06:33:48
Time to finish: 06:07:58

0.0984
Elapsed: 00:38:39
Total run time: 06:32:47
Time to finish: 05:54:08

0.1312
Elapsed: 00:51:35
Total run time: 06:33:10
Time to finish: 05:41:35

0.164
Elapsed: 01:04:31
Total run time: 06:33:24
Time to finish: 05:28:53

0.1968
Elapsed: 01:17:32
Total run time: 06:33:58
Time to finish: 05:16:26

0.2296
Elapsed: 01:30:31
Total run time: 06:34:14
Time to finish: 05:03:43

0.2624
Elapsed: 01:43:31
Total run time: 06:34:30
Time to finish: 04:50:59

0.2952
Elapsed: 01:56:23
Total run time: 06:34:15
Time to finish: 04:37:52

0.328
Elapsed: 02:09:16
Total run time: 06:34:06
Time to finish: 04:24:50

0.3608
Elapsed: 02:22:16
Total run time: 06:34:19
Time to finish: 04:12:03

0.3936
Elapsed: 02:35:09
Total run time: 06:34:11
Time to finish: 03:59:02

0.4264
Elapsed: 02:48:07
Total run time: 06:34:16
Time to finish: 03:46:09

0.4592
Elapsed

In [None]:
'''
#Observation: HDF5 files does not support variables formated as category.

#Define variables as categorics
categoric_variables = ['season', 'month_fase', 'week_fase', "SuperBowl", 
       'ValentinesDay', 'PresidentsDay', 'LentStart', 'LentWeek2', 'StPatricksDay', 'Purim End', 'OrthodoxEaster', 'Pesach End',
       'Cinco De Mayo', "Mother's day", 'MemorialDay', 'NBAFinalsStart', 'NBAFinalsEnd', "Father's day", 'IndependenceDay', 'Ramadan starts',
       'Eid al-Fitr', 'LaborDay', 'ColumbusDay', 'Halloween', 'EidAlAdha', 'VeteransDay', 'Thanksgiving', 'Christmas', 'Chanukah End', 'NewYear',
       'OrthodoxChristmas', 'MartinLutherKingDay', 'Easter', 'Sporting', 'Cultural', 'National', 'Religious',
        'snap', 'snap_other', 'snap_only_other']

#Holiday variables:
holidays_variables = [col for col in data.columns if col.endswith("_near") or col.endswith("_week") or col.endswith("_weekend")]
for e in holidays_variables:
    categoric_variables.append(e)
    
#Price variables:
price_variables = [col for col in data.columns if 'price' in col][1:]
for e in price_variables:
    categoric_variables.append(e)
    
for e in categoric_variables:
    data[e] = data[e].astype('category')
    
'''