M5 Forecasting - Accuracy

Note: This is one of the two complementary competitions that together comprise the M5 forecasting challenge. Can you estimate, as precisely as possible, the point forecasts of the unit sales of various products sold in the USA by Walmart?

How much camping gear will one store sell each month in a year? To the uninitiated, calculating sales at this level may seem as difficult as predicting the weather. Both types of forecasting rely on science and historical data. While a wrong weather forecast may result in you carrying around an umbrella on a sunny day, inaccurate business forecasts could result in actual or opportunity losses. In this competition, in addition to traditional forecasting methods you’re also challenged to use machine learning to improve forecast accuracy.

The Makridakis Open Forecasting Center (MOFC) at the University of Nicosia conducts cutting-edge forecasting research and provides business forecast training. It helps companies achieve accurate predictions, estimate the levels of uncertainty, avoiding costly mistakes, and apply best forecasting practices. The MOFC is well known for its Makridakis Competitions, the first of which ran in the 1980s.

In this competition, the fifth iteration, you will use hierarchical sales data from Walmart, the world’s largest company by revenue, to forecast daily sales for the next 28 days. The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. In addition, it has explanatory variables such as price, promotions, day of the week, and special events. Together, this robust dataset can be used to improve forecasting accuracy.

If successful, your work will continue to advance the theory and practice of forecasting. The methods used can be applied in various business areas, such as setting up appropriate inventory or service levels. Through its business support and training, the MOFC will help distribute the tools and knowledge so others can achieve more accurate and better calibrated forecasts, reduce waste and be able to appreciate uncertainty and its risk implications.

Evaluation:
This competition uses a Weighted Root Mean Squared Scaled Error (RMSSE). Extensive details about the metric, scaling, and weighting can be found in the [M5 Participants Guide](https://mofc.unic.ac.cy/m5-competition/).

In [1]:
import gc
import time
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import acf
from statsmodels.tsa.stattools import pacf
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf

from statsmodels.tsa.arima_model import ARIMA

In [2]:
'''
Function to reduce memory usage.
From: https://www.kaggle.com/ragnar123/very-fst-model
'''
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [3]:
#Load data.
calendar = pd.read_csv('calendar.csv')
calendar = reduce_mem_usage(calendar)

#verify how many weeks do we have. This will help us to analise the prices.
print('weeks: ', len(calendar['wm_yr_wk'].drop_duplicates()))
print()

#Verify missing.
for e in calendar.columns:
    print('Null', e,':', calendar[e].isnull().sum(), ' --- Percent: ', round(calendar[e].isnull().sum() / len(calendar), 4))


#Verify holidays
event_name_1 = calendar['event_name_1'].value_counts(dropna = False) #30
event_type_1 = calendar['event_type_1'].value_counts(dropna = False) #4
event_name_2 = calendar['event_name_2'].value_counts(dropna = False) #4
event_type_2 = calendar['event_type_2'].value_counts(dropna = False) #2


#Verify snap
snap_CA = calendar['snap_CA'].value_counts(dropna = False) #49.28% are snap days
snap_TX = calendar['snap_TX'].value_counts(dropna = False)
snap_WI = calendar['snap_WI'].value_counts(dropna = False)


print('event_name_1:')
display(event_name_1)
print('')
print('event_type_1:')
display(event_type_1)
print('')
print('event_name_2:')
display(event_name_2)
print('')
print('event_type_2:')
display(event_type_2)
print('')
print('snap_CA:')
display(snap_CA)
print('')
print('snap_TX:')
display(snap_TX)
print('')
print('snap_WI:')
display(snap_WI)
print('')
display(calendar.head())

Mem. usage decreased to  0.12 Mb (41.9% reduction)
weeks:  282

Null date : 0  --- Percent:  0.0
Null wm_yr_wk : 0  --- Percent:  0.0
Null weekday : 0  --- Percent:  0.0
Null wday : 0  --- Percent:  0.0
Null month : 0  --- Percent:  0.0
Null year : 0  --- Percent:  0.0
Null d : 0  --- Percent:  0.0
Null event_name_1 : 1807  --- Percent:  0.9177
Null event_type_1 : 1807  --- Percent:  0.9177
Null event_name_2 : 1964  --- Percent:  0.9975
Null event_type_2 : 1964  --- Percent:  0.9975
Null snap_CA : 0  --- Percent:  0.0
Null snap_TX : 0  --- Percent:  0.0
Null snap_WI : 0  --- Percent:  0.0
event_name_1:


NaN                    1807
MemorialDay               6
NBAFinalsStart            6
LentStart                 6
ValentinesDay             6
NBAFinalsEnd              6
Pesach End                6
StPatricksDay             6
SuperBowl                 6
Purim End                 6
Mother's day              6
LentWeek2                 6
Ramadan starts            6
PresidentsDay             6
Halloween                 5
VeteransDay               5
Chanukah End              5
EidAlAdha                 5
IndependenceDay           5
LaborDay                  5
NewYear                   5
Eid al-Fitr               5
ColumbusDay               5
MartinLutherKingDay       5
Thanksgiving              5
Easter                    5
Cinco De Mayo             5
OrthodoxEaster            5
OrthodoxChristmas         5
Christmas                 5
Father's day              4
Name: event_name_1, dtype: int64


event_type_1:


NaN          1807
Religious      55
National       52
Cultural       37
Sporting       18
Name: event_type_1, dtype: int64


event_name_2:


NaN               1964
Father's day         2
Cinco De Mayo        1
Easter               1
OrthodoxEaster       1
Name: event_name_2, dtype: int64


event_type_2:


NaN          1964
Cultural        4
Religious       1
Name: event_type_2, dtype: int64


snap_CA:


0    1319
1     650
Name: snap_CA, dtype: int64


snap_TX:


0    1319
1     650
Name: snap_TX, dtype: int64


snap_WI:


0    1319
1     650
Name: snap_WI, dtype: int64




Unnamed: 0,date,wm_yr_wk,weekday,wday,month,year,d,event_name_1,event_type_1,event_name_2,event_type_2,snap_CA,snap_TX,snap_WI
0,2011-01-29,11101,Saturday,1,1,2011,d_1,,,,,0,0,0
1,2011-01-30,11101,Sunday,2,1,2011,d_2,,,,,0,0,0
2,2011-01-31,11101,Monday,3,1,2011,d_3,,,,,0,0,0
3,2011-02-01,11101,Tuesday,4,2,2011,d_4,,,,,1,1,0
4,2011-02-02,11101,Wednesday,5,2,2011,d_5,,,,,1,0,1


File 1: “calendar.csv”

Contains information about the dates the products are sold.

     date: The date in a “y-m-d” format.

     wm_yr_wk: The id of the week the date belongs to.
    
     weekday: The type of the day (Saturday, Sunday, …, Friday).
    
     wday: The id of the weekday, starting from Saturday.
    
     month: The month of the date.
    
     year: The year of the date.
    
     event_name_1: If the date includes an event, the name of this event.
    
     event_type_1: If the date includes an event, the type of this event.
    
     event_name_2: If the date includes a second event, the name of this event.
    
     event_type_2: If the date includes a second event, the type of this event.
    
     snap_CA, snap_TX, and snap_WI: A binary variable (0 or 1) indicating whether the stores of CA, TX or WI allow SNAP2 purchases on the examined date. 1 indicates that SNAP purchases are allowed.

In [4]:
sell_prices = pd.read_csv('sell_prices.csv')
sell_prices = reduce_mem_usage(sell_prices)

print("len: ", len(sell_prices))
print()

#Verify missing.
for e in sell_prices.columns:
    print('Null', e,':', sell_prices[e].isnull().sum(), ' --- Percent: ', round(sell_prices[e].isnull().sum() / len(sell_prices), 4))
    
print()
display(sell_prices.head())

Mem. usage decreased to 130.48 Mb (37.5% reduction)
len:  6841121

Null store_id : 0  --- Percent:  0.0
Null item_id : 0  --- Percent:  0.0
Null wm_yr_wk : 0  --- Percent:  0.0
Null sell_price : 0  --- Percent:  0.0



Unnamed: 0,store_id,item_id,wm_yr_wk,sell_price
0,CA_1,HOBBIES_1_001,11325,9.578125
1,CA_1,HOBBIES_1_001,11326,9.578125
2,CA_1,HOBBIES_1_001,11327,8.257812
3,CA_1,HOBBIES_1_001,11328,8.257812
4,CA_1,HOBBIES_1_001,11329,8.257812


File 2: “sell_prices.csv”

Contains information about the price of the products sold per store and date.

     store_id: The id of the store where the product is sold.

     item_id: The id of the product.

     wm_yr_wk: The id of the week.

     sell_price: The price of the product for the given week/store. The price is provided per week (average across seven days). If not available, this means that the product was not sold during the examined week. Note that although prices are constant at weekly basis, they may change through time (both training and test set).
    
    
Considering that we have only 6,841,121 prices, there are  1,757,059 (8,598,180 - 6,841,121) products that were not sold in a given week.

In [5]:
#Define 2 holdout
holdout1 = pd.read_csv('sample_submission.csv')
holdout1 = reduce_mem_usage(holdout1)
holdout2 = pd.read_csv('sample_submission.csv')
holdout2 = reduce_mem_usage(holdout2)


#One holdout is related to forecast between day 1914 and 1941.
#The second, between day 1942 and 1969
test1 = holdout1.copy()
test2 = holdout2.copy()
test1.columns = ['id', 'd_1914', 'd_1915', 'd_1916', 'd_1917', 'd_1918', 'd_1919', 'd_1920', 'd_1921', 'd_1922', 'd_1923',
                    'd_1924', 'd_1925', 'd_1926', 'd_1927', 'd_1928', 'd_1929', 'd_1930', 'd_1931', 
                    'd_1932', 'd_1933', 'd_1934', 'd_1935', 'd_1936', 'd_1937', 'd_1938', 'd_1939', 'd_1940', 'd_1941']
test2.columns = ['id', 'd_1942', 'd_1943', 'd_1944', 'd_1945', 'd_1946', 'd_1947', 'd_1948', 'd_1949', 'd_1950', 'd_1951', 
                    'd_1952', 'd_1953', 'd_1954', 'd_1955', 'd_1956', 'd_1957', 'd_1958', 'd_1959', 
                    'd_1960', 'd_1961', 'd_1962', 'd_1963', 'd_1964', 'd_1965', 'd_1966', 'd_1967', 'd_1968', 'd_1969']

print('Shape test1: ' + str(test1.shape))  #Observation: each table has all items with "_validation", and after it repeats the items with "_evaluation"
print('Shape test2: ' + str(test2.shape))
display(test1.head())
display(test2.head())

Mem. usage decreased to  2.09 Mb (84.5% reduction)
Mem. usage decreased to  2.09 Mb (84.5% reduction)
Shape test1: (60980, 29)
Shape test2: (60980, 29)


Unnamed: 0,id,d_1914,d_1915,d_1916,d_1917,d_1918,d_1919,d_1920,d_1921,d_1922,...,d_1932,d_1933,d_1934,d_1935,d_1936,d_1937,d_1938,d_1939,d_1940,d_1941
0,HOBBIES_1_001_CA_1_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,HOBBIES_1_002_CA_1_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,HOBBIES_1_003_CA_1_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,HOBBIES_1_004_CA_1_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,HOBBIES_1_005_CA_1_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,id,d_1942,d_1943,d_1944,d_1945,d_1946,d_1947,d_1948,d_1949,d_1950,...,d_1960,d_1961,d_1962,d_1963,d_1964,d_1965,d_1966,d_1967,d_1968,d_1969
0,HOBBIES_1_001_CA_1_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,HOBBIES_1_002_CA_1_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,HOBBIES_1_003_CA_1_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,HOBBIES_1_004_CA_1_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,HOBBIES_1_005_CA_1_validation,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
sales_train_validation = pd.read_csv('sales_train_validation.csv')
sales_train_validation = reduce_mem_usage(sales_train_validation)

#Verify the quantity of items and store. This will help us to see how many weeks we do not have any prices for a product.
print('Items and stores:', len(sales_train_validation[['item_id', 'store_id']].drop_duplicates()))
print()

#Verify null
print('Verify null:')
print(sales_train_validation.isnull().sum()[sales_train_validation.isnull().sum() > 0]) #there is not any null values.
print()
print(sales_train_validation.isnull().sum()[sales_train_validation.isnull().sum() > 0] / len(sales_train_validation))
print()
sales_train_validation.head()

Mem. usage decreased to 95.00 Mb (78.7% reduction)
Items and stores: 30490

Verify null:
Series([], dtype: int64)

Series([], dtype: float64)



Unnamed: 0,id,item_id,dept_id,cat_id,store_id,state_id,d_1,d_2,d_3,d_4,...,d_1904,d_1905,d_1906,d_1907,d_1908,d_1909,d_1910,d_1911,d_1912,d_1913
0,HOBBIES_1_001_CA_1_validation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,1,3,0,1,1,1,3,0,1,1
1,HOBBIES_1_002_CA_1_validation,HOBBIES_1_002,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,HOBBIES_1_003_CA_1_validation,HOBBIES_1_003,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,2,1,2,1,1,1,0,1,1,1
3,HOBBIES_1_004_CA_1_validation,HOBBIES_1_004,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,1,0,5,4,1,0,1,3,7,2
4,HOBBIES_1_005_CA_1_validation,HOBBIES_1_005,HOBBIES_1,HOBBIES,CA_1,CA,0,0,0,0,...,2,1,1,0,1,1,2,2,2,4


File 3: “sales_train.csv”

Contains the historical daily unit sales data per product and store.

     item_id: The id of the product.

     dept_id: The id of the department the product belongs to.

     cat_id: The id of the category the product belongs to.

     store_id: The id of the store where the product is sold.

     state_id: The State where the store is located.

     d_1, d_2, …, d_i, … d_1941: The number of units sold at day i, starting from 2011-01-29.
    
    
We have 282 and 30,490 differents items. Therefore, the ideal would be to have 8,598,180 (282 x 30,400) prices. 

In [7]:
#Transpose sales_train_validation so the days goes to rows.
sales_train_validation = pd.melt(sales_train_validation, id_vars = ['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'], var_name = 'day', value_name = 'demand')

#Verify missing.
for e in sales_train_validation.columns:
    print('Null', e,':', sales_train_validation[e].isnull().sum(), ' --- Percent: ', round(sales_train_validation[e].isnull().sum() / len(sales_train_validation), 4))
    
display(sales_train_validation.head())

Null id : 0  --- Percent:  0.0
Null item_id : 0  --- Percent:  0.0
Null dept_id : 0  --- Percent:  0.0
Null cat_id : 0  --- Percent:  0.0
Null store_id : 0  --- Percent:  0.0
Null state_id : 0  --- Percent:  0.0
Null day : 0  --- Percent:  0.0
Null demand : 0  --- Percent:  0.0


Unnamed: 0,id,item_id,dept_id,cat_id,store_id,state_id,day,demand
0,HOBBIES_1_001_CA_1_validation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0
1,HOBBIES_1_002_CA_1_validation,HOBBIES_1_002,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0
2,HOBBIES_1_003_CA_1_validation,HOBBIES_1_003,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0
3,HOBBIES_1_004_CA_1_validation,HOBBIES_1_004,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0
4,HOBBIES_1_005_CA_1_validation,HOBBIES_1_005,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0


In [8]:
#Concat our train database with the holdouts. 

#Get information about the product id.
product_id = sales_train_validation[['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id']].drop_duplicates()

# merge with product table
validation_bool = test1['id'].str.contains("_validation") 
test1 = test1[validation_bool]
test1 = test1.merge(product_id, how = 'left', on = 'id')
test2 = test2[validation_bool]
test2 = test2.merge(product_id, how = 'left', on = 'id')

#Transpose so the days goes to rows.
test1 = pd.melt(test1, id_vars = ['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'], var_name = 'day', value_name = 'demand')
test2 = pd.melt(test2, id_vars = ['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'], var_name = 'day', value_name = 'demand')

#Define what is train and what is test.
sales_train_validation['part'] = 'train'
test1['part'] = 'test1'
test2['part'] = 'test2'   #Test2 will be used just after june 1. Before that, we can use it after forecastinf test 1.

In [9]:
#Concat sales_train_validation and test1. test2 will be concated after we develop some demand variables
data = pd.concat([sales_train_validation, test1, test2], axis = 0) 
del sales_train_validation, test1, test2

#Verify missing.
for e in data.columns:
    print('Null', e,':', data[e].isnull().sum(), ' --- Percent: ', round(data[e].isnull().sum() / len(data), 4))
  
data.info()

Null id : 0  --- Percent:  0.0
Null item_id : 0  --- Percent:  0.0
Null dept_id : 0  --- Percent:  0.0
Null cat_id : 0  --- Percent:  0.0
Null store_id : 0  --- Percent:  0.0
Null state_id : 0  --- Percent:  0.0
Null day : 0  --- Percent:  0.0
Null demand : 0  --- Percent:  0.0
Null part : 0  --- Percent:  0.0
<class 'pandas.core.frame.DataFrame'>
Int64Index: 60034810 entries, 0 to 853719
Data columns (total 9 columns):
id          object
item_id     object
dept_id     object
cat_id      object
store_id    object
state_id    object
day         object
demand      int16
part        object
dtypes: int16(1), object(8)
memory usage: 4.1+ GB


About the products:
The M5 dataset, generously made available by Walmart, involves the unit sales of various products sold in the USA, organized in the form of grouped time series. More specifically, the dataset involves the unit sales of 3,075 products, classified in 3 product categories (Hobbies, Foods, and Household) and 7 product departments, in which the above-mentioned categories are disaggregated. The products are sold across 10 stores, located in 3 States (CA, TX, and WI). In this respect, the bottom-level of the hierarchy, i.e., product-store unit sales, can be mapped either across product categories or geographical regions. More information can be found in the [M5 Participants Guide](https://mofc.unic.ac.cy/m5-competition/).

In [10]:
#Verify the numbers described above.

products = product_id['item_id'].value_counts(dropna = False) #3075 There is only 3049. Probably some stores do not sell some products.
departments = product_id['dept_id'].value_counts(dropna = False) #7 OK!
categories = product_id['cat_id'].value_counts(dropna = False) #3 OK!
stores = product_id['store_id'].value_counts(dropna = False) #10 OK!
States = product_id['state_id'].value_counts(dropna = False) #3 OK!

print('States:')
display(States)
print('')
print('stores:')
display(stores)
print('')
print('categories:')
display(categories)
print('')
print('departments:')
display(departments)
print('')
print('products:')
display(products)

States:


CA    12196
TX     9147
WI     9147
Name: state_id, dtype: int64


stores:


WI_2    3049
CA_3    3049
WI_3    3049
TX_2    3049
CA_2    3049
CA_1    3049
TX_3    3049
TX_1    3049
WI_1    3049
CA_4    3049
Name: store_id, dtype: int64


categories:


FOODS        14370
HOUSEHOLD    10470
HOBBIES       5650
Name: cat_id, dtype: int64


departments:


FOODS_3        8230
HOUSEHOLD_1    5320
HOUSEHOLD_2    5150
HOBBIES_1      4160
FOODS_2        3980
FOODS_1        2160
HOBBIES_2      1490
Name: dept_id, dtype: int64


products:


HOUSEHOLD_1_487    10
FOODS_3_459        10
FOODS_3_286        10
HOBBIES_2_097      10
HOUSEHOLD_1_328    10
                   ..
FOODS_3_520        10
HOBBIES_1_233      10
HOBBIES_1_356      10
HOBBIES_1_006      10
FOODS_2_198        10
Name: item_id, Length: 3049, dtype: int64

In [11]:
#Add information in our dataset.
#Merge with calendar.
data = pd.merge(data, calendar, how = 'left', left_on = ['day'], right_on = ['d'])
data.drop(['d'], inplace = True, axis = 1)

#Verify missing.
for e in data.columns:
    print('Null', e,':', data[e].isnull().sum(), ' --- Percent: ', round(data[e].isnull().sum() / len(data), 4))

Null id : 0  --- Percent:  0.0
Null item_id : 0  --- Percent:  0.0
Null dept_id : 0  --- Percent:  0.0
Null cat_id : 0  --- Percent:  0.0
Null store_id : 0  --- Percent:  0.0
Null state_id : 0  --- Percent:  0.0
Null day : 0  --- Percent:  0.0
Null demand : 0  --- Percent:  0.0
Null part : 0  --- Percent:  0.0
Null date : 0  --- Percent:  0.0
Null wm_yr_wk : 0  --- Percent:  0.0
Null weekday : 0  --- Percent:  0.0
Null wday : 0  --- Percent:  0.0
Null month : 0  --- Percent:  0.0
Null year : 0  --- Percent:  0.0
Null event_name_1 : 55095430  --- Percent:  0.9177
Null event_type_1 : 55095430  --- Percent:  0.9177
Null event_name_2 : 59882360  --- Percent:  0.9975
Null event_type_2 : 59882360  --- Percent:  0.9975
Null snap_CA : 0  --- Percent:  0.0
Null snap_TX : 0  --- Percent:  0.0
Null snap_WI : 0  --- Percent:  0.0


In [12]:
#Merge with sell_prices.
data = pd.merge(data, sell_prices, how = 'left', on = ['store_id', 'item_id', 'wm_yr_wk'])

#Verify missing.
for e in data.columns:
    print('Null', e,':', data[e].isnull().sum(), ' --- Percent: ', round(data[e].isnull().sum() / len(data), 4))

print()
print(data.info())
display(data.head())

Null id : 0  --- Percent:  0.0
Null item_id : 0  --- Percent:  0.0
Null dept_id : 0  --- Percent:  0.0
Null cat_id : 0  --- Percent:  0.0
Null store_id : 0  --- Percent:  0.0
Null state_id : 0  --- Percent:  0.0
Null day : 0  --- Percent:  0.0
Null demand : 0  --- Percent:  0.0
Null part : 0  --- Percent:  0.0
Null date : 0  --- Percent:  0.0
Null wm_yr_wk : 0  --- Percent:  0.0
Null weekday : 0  --- Percent:  0.0
Null wday : 0  --- Percent:  0.0
Null month : 0  --- Percent:  0.0
Null year : 0  --- Percent:  0.0
Null event_name_1 : 55095430  --- Percent:  0.9177
Null event_type_1 : 55095430  --- Percent:  0.9177
Null event_name_2 : 59882360  --- Percent:  0.9975
Null event_type_2 : 59882360  --- Percent:  0.9975
Null snap_CA : 0  --- Percent:  0.0
Null snap_TX : 0  --- Percent:  0.0
Null snap_WI : 0  --- Percent:  0.0
Null sell_price : 12299413  --- Percent:  0.2049

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60034810 entries, 0 to 60034809
Data columns (total 23 columns):
id   

Unnamed: 0,id,item_id,dept_id,cat_id,store_id,state_id,day,demand,part,date,...,month,year,event_name_1,event_type_1,event_name_2,event_type_2,snap_CA,snap_TX,snap_WI,sell_price
0,HOBBIES_1_001_CA_1_validation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0,train,2011-01-29,...,1,2011,,,,,0,0,0,
1,HOBBIES_1_002_CA_1_validation,HOBBIES_1_002,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0,train,2011-01-29,...,1,2011,,,,,0,0,0,
2,HOBBIES_1_003_CA_1_validation,HOBBIES_1_003,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0,train,2011-01-29,...,1,2011,,,,,0,0,0,
3,HOBBIES_1_004_CA_1_validation,HOBBIES_1_004,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0,train,2011-01-29,...,1,2011,,,,,0,0,0,
4,HOBBIES_1_005_CA_1_validation,HOBBIES_1_005,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0,train,2011-01-29,...,1,2011,,,,,0,0,0,


As we stated above, it is expected to have this many null values in sell_price.
We were expecting 1,757,059 weeks without price informantion, so 1,757,059 x 7 = 12,299,413

In [13]:
#Verify the correlation between products
def correlation_between_products(df, filter_1, filter_2, name, n = 10):
    """Print numer of products being analised;
    print the top n correlations;
    plots the heatmap;
    save the heatmap as name.png

    Usage
    ------

    correlation_between_products(df, filter_1, filter_2, n, name)
    filter_ and filter_2 are conditions, e.g. filter_1 = data['store_id'] == 'CA_4'
                                              filter_2 = data['dept_id'] == 'HOBBIES_2'
    name is the name of the file with the heatmap that will be created as a .png
    n is an int 
    """
    products_analised = len(df[(filter_1) & (filter_2)]['id'].drop_duplicates())
    print('Products:', products_analised)
    print()

    data_filtered = df[(filter_1) & (filter_2)][['date', 'id', 'demand']].copy()
    data_filtered = data_filtered.pivot(index='date', columns='id', values='demand')
    corr_matrix = data_filtered.corr()

    def get_redundant_pairs(df):
        '''Get diagonal and lower triangular pairs of correlation matrix'''
        pairs_to_drop = set()
        cols = df.columns
        for i in range(0, df.shape[1]):
            for j in range(0, i+1):
                pairs_to_drop.add((cols[i], cols[j]))
        return pairs_to_drop

    corr_table = corr_matrix.abs().unstack()
    labels_to_drop = get_redundant_pairs(data_filtered)
    corr_table = corr_table.drop(labels = labels_to_drop).sort_values(ascending = False)

    print("Top Absolute Correlations")
    print(corr_table[:n + 1])
    print()

    size_x = products_analised     #This is a good size to visualise the heatmap saved as .png
    size_y = products_analised
    plt.figure(figsize = (size_x, size_y))
    sns.set(font_scale = 1.5)

    ax = sns.heatmap(corr_matrix, annot = True, linewidth = 0.5, cmap='coolwarm')
    bottom, top = ax.get_ylim()
    ax.set_ylim(bottom + 0.5, top - 0.5)

    plt.tight_layout()
    plt.savefig(name+ '.png')
    

    #plt.show()                              #This line is commented just to make this notebook lighter.
    plt.close('all')                         #This line is just to make this notebook lighter.

    return

In [14]:
correlation_between_products(data, data['store_id'] == 'CA_1', data['dept_id'] == 'HOBBIES_2', 'store_CA_1_department_HOBBIES_2', 10)

Products: 149

Top Absolute Correlations
id                             id                           
HOBBIES_2_071_CA_1_validation  HOBBIES_2_147_CA_1_validation    0.426263
HOBBIES_2_058_CA_1_validation  HOBBIES_2_126_CA_1_validation    0.407455
HOBBIES_2_087_CA_1_validation  HOBBIES_2_147_CA_1_validation    0.388825
                               HOBBIES_2_091_CA_1_validation    0.386128
HOBBIES_2_071_CA_1_validation  HOBBIES_2_091_CA_1_validation    0.382073
HOBBIES_2_091_CA_1_validation  HOBBIES_2_147_CA_1_validation    0.325895
HOBBIES_2_064_CA_1_validation  HOBBIES_2_147_CA_1_validation    0.318753
HOBBIES_2_034_CA_1_validation  HOBBIES_2_049_CA_1_validation    0.318606
HOBBIES_2_043_CA_1_validation  HOBBIES_2_109_CA_1_validation    0.310079
                               HOBBIES_2_095_CA_1_validation    0.301492
HOBBIES_2_071_CA_1_validation  HOBBIES_2_087_CA_1_validation    0.295804
dtype: float64



In [15]:
correlation_between_products(data, data['store_id'] == 'CA_2', data['dept_id'] == 'HOBBIES_2', 'store_CA_2_department_HOBBIES_2', 10)

Products: 149

Top Absolute Correlations
id                             id                           
HOBBIES_2_071_CA_2_validation  HOBBIES_2_091_CA_2_validation    0.463365
HOBBIES_2_087_CA_2_validation  HOBBIES_2_091_CA_2_validation    0.446427
HOBBIES_2_064_CA_2_validation  HOBBIES_2_091_CA_2_validation    0.441888
HOBBIES_2_091_CA_2_validation  HOBBIES_2_147_CA_2_validation    0.381800
HOBBIES_2_043_CA_2_validation  HOBBIES_2_121_CA_2_validation    0.379960
HOBBIES_2_071_CA_2_validation  HOBBIES_2_147_CA_2_validation    0.358865
HOBBIES_2_034_CA_2_validation  HOBBIES_2_049_CA_2_validation    0.355294
HOBBIES_2_064_CA_2_validation  HOBBIES_2_071_CA_2_validation    0.335379
HOBBIES_2_024_CA_2_validation  HOBBIES_2_043_CA_2_validation    0.329893
HOBBIES_2_010_CA_2_validation  HOBBIES_2_091_CA_2_validation    0.324645
HOBBIES_2_087_CA_2_validation  HOBBIES_2_147_CA_2_validation    0.323868
dtype: float64



In [16]:
correlation_between_products(data, data['store_id'] == 'CA_3', data['dept_id'] == 'HOBBIES_2', 'store_CA_3_department_HOBBIES_2', 10)

Products: 149

Top Absolute Correlations
id                             id                           
HOBBIES_2_087_CA_3_validation  HOBBIES_2_091_CA_3_validation    0.493718
HOBBIES_2_091_CA_3_validation  HOBBIES_2_147_CA_3_validation    0.482193
HOBBIES_2_071_CA_3_validation  HOBBIES_2_091_CA_3_validation    0.446407
HOBBIES_2_087_CA_3_validation  HOBBIES_2_147_CA_3_validation    0.442257
HOBBIES_2_038_CA_3_validation  HOBBIES_2_043_CA_3_validation    0.421248
HOBBIES_2_058_CA_3_validation  HOBBIES_2_126_CA_3_validation    0.405004
HOBBIES_2_024_CA_3_validation  HOBBIES_2_058_CA_3_validation    0.370106
HOBBIES_2_071_CA_3_validation  HOBBIES_2_087_CA_3_validation    0.358696
HOBBIES_2_126_CA_3_validation  HOBBIES_2_141_CA_3_validation    0.339128
HOBBIES_2_024_CA_3_validation  HOBBIES_2_141_CA_3_validation    0.338040
HOBBIES_2_058_CA_3_validation  HOBBIES_2_141_CA_3_validation    0.324488
dtype: float64



In [17]:
correlation_between_products(data, data['store_id'] == 'CA_4', data['dept_id'] == 'HOBBIES_2', 'store_CA_4_department_HOBBIES_2', 10)

Products: 149

Top Absolute Correlations
id                             id                           
HOBBIES_2_091_CA_4_validation  HOBBIES_2_147_CA_4_validation    0.405509
HOBBIES_2_018_CA_4_validation  HOBBIES_2_141_CA_4_validation    0.332974
HOBBIES_2_087_CA_4_validation  HOBBIES_2_147_CA_4_validation    0.324804
                               HOBBIES_2_091_CA_4_validation    0.316192
HOBBIES_2_071_CA_4_validation  HOBBIES_2_091_CA_4_validation    0.308538
HOBBIES_2_064_CA_4_validation  HOBBIES_2_087_CA_4_validation    0.300614
HOBBIES_2_010_CA_4_validation  HOBBIES_2_091_CA_4_validation    0.286140
HOBBIES_2_101_CA_4_validation  HOBBIES_2_123_CA_4_validation    0.278259
HOBBIES_2_064_CA_4_validation  HOBBIES_2_091_CA_4_validation    0.276275
HOBBIES_2_021_CA_4_validation  HOBBIES_2_091_CA_4_validation    0.272604
HOBBIES_2_106_CA_4_validation  HOBBIES_2_147_CA_4_validation    0.265120
dtype: float64



The correct would be to do the correlation matrix for all department and stores.
Moreover, it would be interesting to do the corretlation matrix for the same product in diferent stores.
However, this analysis is mainly to see if we really need one model for product.

Considering that the highest correlation between products is 0.49. We decided to do one model, but calibrate it individually per product_id.

We will try to model using ARIMA(p, d, q). Nonetheless, first, we will see if the series are stationary and set d = 1 for those that are not stationary. Second, if we do not achieve stationarity, we will see if we need to deseasonalize the series. Last, we will calculate the autocorrelation and the partial autocorrelation for each series to calibrate the 'q' and 'p' parameters.

In [18]:
#Set an 'ignore' flag to mark products that had not have yet your first sell.
index_null = np.where(data['sell_price'].isnull())
data.loc[index_null[0], 'part'] = 'ignore'
display(data.head())
display(data.tail())

Unnamed: 0,id,item_id,dept_id,cat_id,store_id,state_id,day,demand,part,date,...,month,year,event_name_1,event_type_1,event_name_2,event_type_2,snap_CA,snap_TX,snap_WI,sell_price
0,HOBBIES_1_001_CA_1_validation,HOBBIES_1_001,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0,ignore,2011-01-29,...,1,2011,,,,,0,0,0,
1,HOBBIES_1_002_CA_1_validation,HOBBIES_1_002,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0,ignore,2011-01-29,...,1,2011,,,,,0,0,0,
2,HOBBIES_1_003_CA_1_validation,HOBBIES_1_003,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0,ignore,2011-01-29,...,1,2011,,,,,0,0,0,
3,HOBBIES_1_004_CA_1_validation,HOBBIES_1_004,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0,ignore,2011-01-29,...,1,2011,,,,,0,0,0,
4,HOBBIES_1_005_CA_1_validation,HOBBIES_1_005,HOBBIES_1,HOBBIES,CA_1,CA,d_1,0,ignore,2011-01-29,...,1,2011,,,,,0,0,0,


Unnamed: 0,id,item_id,dept_id,cat_id,store_id,state_id,day,demand,part,date,...,month,year,event_name_1,event_type_1,event_name_2,event_type_2,snap_CA,snap_TX,snap_WI,sell_price
60034805,FOODS_3_823_WI_3_validation,FOODS_3_823,FOODS_3,FOODS,WI_3,WI,d_1969,0,test2,2016-06-19,...,6,2016,NBAFinalsEnd,Sporting,Father's day,Cultural,0,0,0,2.980469
60034806,FOODS_3_824_WI_3_validation,FOODS_3_824,FOODS_3,FOODS,WI_3,WI,d_1969,0,test2,2016-06-19,...,6,2016,NBAFinalsEnd,Sporting,Father's day,Cultural,0,0,0,2.480469
60034807,FOODS_3_825_WI_3_validation,FOODS_3_825,FOODS_3,FOODS,WI_3,WI,d_1969,0,test2,2016-06-19,...,6,2016,NBAFinalsEnd,Sporting,Father's day,Cultural,0,0,0,3.980469
60034808,FOODS_3_826_WI_3_validation,FOODS_3_826,FOODS_3,FOODS,WI_3,WI,d_1969,0,test2,2016-06-19,...,6,2016,NBAFinalsEnd,Sporting,Father's day,Cultural,0,0,0,1.280273
60034809,FOODS_3_827_WI_3_validation,FOODS_3_827,FOODS_3,FOODS,WI_3,WI,d_1969,0,test2,2016-06-19,...,6,2016,NBAFinalsEnd,Sporting,Father's day,Cultural,0,0,0,1.0


In [19]:
#Verify that once a product had been sold, there will be always a price, even if the weekly demand of the product is zero.
data = data.loc[np.in1d(data['part'], 'train')][['id', 'day', 'demand', 'part']]
data['day'] = data['day'].str.extract(r'(\d+)').astype('int')
data['start'] = data.groupby(['id'])['day'].transform(lambda x: x.min())
data['end'] = data.groupby(['id'])['day'].transform(lambda x: x.max())
data['count'] = data['end'] - data['start'] + 1 
data['sum'] = data.groupby(['id'])['day'].transform(lambda x: x.sum())
data['AP_sum'] = ((data['start'] + data['end']) * data['count']) / 2
data['different'] = (data['sum'] - data['AP_sum']) != 0

display(data.head())
print()
print('Missing existent products prices: ', data['different'].sum())

Unnamed: 0,id,day,demand,part,start,end,count,sum,AP_sum,different
7,HOBBIES_1_008_CA_1_validation,1,12,train,1,1913,1913,1830741,1830741.0,False
8,HOBBIES_1_009_CA_1_validation,1,2,train,1,1913,1913,1830741,1830741.0,False
9,HOBBIES_1_010_CA_1_validation,1,0,train,1,1913,1913,1830741,1830741.0,False
11,HOBBIES_1_012_CA_1_validation,1,0,train,1,1913,1913,1830741,1830741.0,False
14,HOBBIES_1_015_CA_1_validation,1,4,train,1,1913,1913,1830741,1830741.0,False



Missing existent products prices:  0


Now that we already now that there is not any problem with the keys utilized to merge the data nor with missing values, lets drop all unnecessary columns and study the parameters of the ARIMA model. 

In [20]:
data = data[['id', 'demand', 'count']]
data.to_csv('data_study_ARIMA.csv', index = False)

In [21]:
data = pd.read_csv("data_study_ARIMA.csv")

In [22]:
display(data.head())

Unnamed: 0,id,demand,count
0,HOBBIES_1_008_CA_1_validation,12,1913
1,HOBBIES_1_009_CA_1_validation,2,1913
2,HOBBIES_1_010_CA_1_validation,0,1913
3,HOBBIES_1_012_CA_1_validation,0,1913
4,HOBBIES_1_015_CA_1_validation,4,1913


In [23]:
#Augmented Dickey-Fuller test.
#If p-value is greater than 0.01, we will consider that the series is non-stationary and repeat the test using the first difference.
products_ids = data['id'].unique()
products_ids_size = len(products_ids)

progress = 0   #Usefull to see the progress of the code
progress_1000 = 1
start = time.time()

columns_ADF_test = ['product_id', 'ADF_test_diff_0', 'ADF_test_diff_1', 'diff_10%', 'diff_5%', 'diff_1%']
ADF_test = []
for e in products_ids:
    ADF_test_array = data.loc[np.in1d(data['id'], e)]['demand']
    ADF_test_diff_0 = round(adfuller(ADF_test_array)[1], 4)
    ADF_test_diff_1 = round(adfuller(ADF_test_array.diff(periods = 1)[1:])[1], 4)
    diff_10 = 0    #Consider different p-values to accept or reject the null hypothesis.
    diff_05 = 0
    diff_01 = 0
    if ADF_test_diff_0 > 0.01:
        diff_01 = 1
        if ADF_test_diff_1 > 0.01: # See if just the first difference is enouth to obtain a stationary series.  
            diff_01 = 2
    if ADF_test_diff_0 > 0.05:
        diff_05 = 1
        if ADF_test_diff_1 > 0.05: # See if just the first difference is enouth to obtain a stationary series.  
            diff_05 = 2
    if ADF_test_diff_0 > 0.1:
        diff_10 = 1
        if ADF_test_diff_1 > 0.1: # See if just the first difference is enouth to obtain a stationary series.  
            diff_10 = 2
    ADF_test.append([e, ADF_test_diff_0, ADF_test_diff_1, diff_10, diff_05, diff_01])
    
    
    progress += 1
    if progress > progress_1000 * 1000:
        df_ADF_test_partial = pd.DataFrame(ADF_test, columns = columns_ADF_test)
        df_ADF_test_partial.to_csv('df_ADF_test_partial.csv', index = False)
        progress_per = round(progress / products_ids_size, 4)
        print(progress_per)
        progress_1000 +=1
        
        end = time.time()
        elapsed = int(round(end - start, 0))
        total_run_time =  int(round(elapsed / (progress_per), 0))
        time_to_finish = int(round(elapsed / (progress_per), 0)) - elapsed
        print('Elapsed: {:02d}:{:02d}:{:02d}'.format(elapsed // 3600, (elapsed % 3600 // 60), elapsed % 60))
        print('Total run time: {:02d}:{:02d}:{:02d}'.format(total_run_time // 3600, (total_run_time % 3600 // 60), total_run_time % 60))
        print('Time to finish: {:02d}:{:02d}:{:02d}'.format(time_to_finish // 3600, (time_to_finish % 3600 // 60), time_to_finish % 60))
        print()

df_ADF_test = pd.DataFrame(ADF_test, columns = columns_ADF_test)
df_ADF_test.to_csv('ADF_test.csv', index = False)
df_ADF_test.head()

0.0328
Elapsed: 00:11:09
Total run time: 05:39:56
Time to finish: 05:28:47

0.0656
Elapsed: 00:22:10
Total run time: 05:37:54
Time to finish: 05:15:44

0.0984
Elapsed: 00:33:20
Total run time: 05:38:45
Time to finish: 05:05:25

0.1312
Elapsed: 00:44:37
Total run time: 05:40:04
Time to finish: 04:55:27

0.164
Elapsed: 00:55:49
Total run time: 05:40:21
Time to finish: 04:44:32

0.1968
Elapsed: 01:07:00
Total run time: 05:40:27
Time to finish: 04:33:27

0.2296
Elapsed: 01:18:17
Total run time: 05:40:57
Time to finish: 04:22:40

0.2624
Elapsed: 01:29:33
Total run time: 05:41:16
Time to finish: 04:11:43

0.2952
Elapsed: 01:40:42
Total run time: 05:41:07
Time to finish: 04:00:25

0.328
Elapsed: 01:51:53
Total run time: 05:41:06
Time to finish: 03:49:13

0.3608
Elapsed: 02:03:12
Total run time: 05:41:28
Time to finish: 03:38:16

0.3936
Elapsed: 02:14:14
Total run time: 05:41:02
Time to finish: 03:26:48

0.4264
Elapsed: 02:25:15
Total run time: 05:40:39
Time to finish: 03:15:24

0.4592
Elapsed

Unnamed: 0,product_id,ADF_test_diff_0,ADF_test_diff_1,diff_10%,diff_5%,diff_1%
0,HOBBIES_1_008_CA_1_validation,0.0,0.0,0,0,0
1,HOBBIES_1_009_CA_1_validation,0.0009,0.0,0,0,0
2,HOBBIES_1_010_CA_1_validation,0.0,0.0,0,0,0
3,HOBBIES_1_012_CA_1_validation,0.0,0.0,0,0,0
4,HOBBIES_1_015_CA_1_validation,0.0,0.0,0,0,0


In [24]:
df_ADF_test = pd.read_csv("ADF_test.csv")
df_ADF_test.head()

Unnamed: 0,product_id,ADF_test_diff_0,ADF_test_diff_1,diff_10%,diff_5%,diff_1%
0,HOBBIES_1_008_CA_1_validation,0.0,0.0,0,0,0
1,HOBBIES_1_009_CA_1_validation,0.0009,0.0,0,0,0
2,HOBBIES_1_010_CA_1_validation,0.0,0.0,0,0,0
3,HOBBIES_1_012_CA_1_validation,0.0,0.0,0,0,0
4,HOBBIES_1_015_CA_1_validation,0.0,0.0,0,0,0


In [25]:
#Verify ADF test

#seasonality and trend
seasonality = df_ADF_test[df_ADF_test['diff_1%'] == 2][['diff_10%', 'diff_5%', 'diff_1%']].sum()
seasonality

diff_10%    2
diff_5%     3
diff_1%     4
dtype: int64

In [26]:
df_ADF_test[df_ADF_test['diff_1%'] == 2]

Unnamed: 0,product_id,ADF_test_diff_0,ADF_test_diff_1,diff_10%,diff_5%,diff_1%
18833,HOUSEHOLD_1_032_CA_3_validation,1.0,0.0254,1,1,2
28045,HOUSEHOLD_1_032_TX_1_validation,1.0,0.0761,1,2,2


As the ADF test shows, only for 1 product (in only two stores) we can not reject the hypothesis of the existence of a unit root.
Considering that we will model more than 30,000 products, this is not a problem.
Moreover, this is probably due to the existence of some seasonality that will be handle when we add other variables in our ARIMA model.

In [27]:
#Verify the ADF test by significance level.
print('diff_1%: ', df_ADF_test['diff_1%'].sum(), ' = ', round(df_ADF_test['diff_1%'].sum()/len(df_ADF_test['diff_1%']), 4))
print()
print('diff_5%: ', df_ADF_test['diff_5%'].sum(), ' = ', round(df_ADF_test['diff_5%'].sum()/len(df_ADF_test['diff_5%']), 4))
print()
print('diff_10%: ', df_ADF_test['diff_10%'].sum(), ' = ', round(df_ADF_test['diff_10%'].sum()/len(df_ADF_test['diff_10%']), 4))

diff_1%:  3315  =  0.1087

diff_5%:  1447  =  0.0475

diff_10%:  845  =  0.0277


Probably because there are a lot of zeroes in our database, most of the series already are stationary.
Considering that only 10% of our series need a parameter d = 1, we will use 1% as the significance level.
However, we will still try to analize the stationarity looking the autocorrelation.

In [28]:
'''
Determine the parameters p, with the partial autocorrelation, and q, with the autocorrelation.
'''
products_ids = data['id'].unique()
products_ids_size = len(products_ids)

progress = 0   #Usefull to see the progress of the code
progress_1000 = 1
start = time.time()

columns_piq_table = ['product_id', 'p_parameter', 'i_parameter', 'q_parameter']
piq_table = []
for e in products_ids:
    periods_for_diff = df_ADF_test.loc[np.in1d(df_ADF_test['product_id'], e)]['diff_1%'].iloc[0]
    df_product = data.loc[np.in1d(data['id'], e)][['demand', 'count']]
    observations =df_product['count'].iloc[0]
    nlags = np.min([100, observations - 1]) #In case we have a small sample
    demand = df_product['demand']
    if periods_for_diff > 0:
        nlags -= 1
        demand = demand.diff(periods = 1)[1:]  #We will not use the second difference.
        
    autocorrelation = acf(demand, unbiased=False, nlags=nlags, qstat=True, fft=False, alpha=0.01, missing='none')
    q_parameter = 0
    aux = 0  #This auxiliar number is to avoid that q_parameter gets an isoleded case of a lag having an autocorrelation different of zero.
    counter = 0 #Used to test if the series is stationary
    for i in range(1, nlags + 1):
        aux += 1
        if (0 <  autocorrelation[1][i][0]) or (0 >  autocorrelation[1][i][1]):
            counter += 1
            if aux < 3:
                aux = 0
                q_parameter = i
    
  
    #Make a second test of statinarity.
    #If q parameter is too large (larger than 20), or if counter is much greater than the q parameter (greater than q + 10), probably the series is not stationary.
    #Therefore, we will set i = 1 and use the new q and d parameter.
    if (counter > q_parameter + 10) or (q_parameter > 20):
        periods_for_diff_temp = 1
        demand_temp = demand.diff(periods = periods_for_diff_temp)[periods_for_diff_temp:]
        autocorrelation = acf(demand_temp, unbiased=False, nlags=nlags, qstat=True, fft=False, alpha=0.01, missing='none')
        q_parameter_temp = 0
        aux = 0  #This auxiliar number is to avoid that q_parameter gets an isoleded case of a lag having an autocorrelation different of zero.
        for i in range(1, nlags + 1):
            aux += 1
            if (0 <  autocorrelation[1][i][0]) or (0 >  autocorrelation[1][i][1]):
                if aux < 3:
                    aux = 0
                    q_parameter_temp = i
        if (q_parameter_temp < (q_parameter / 2)) or (q_parameter == 0):
            periods_for_diff = periods_for_diff_temp
            q_parameter = q_parameter_temp
            demand = demand_temp
    
    partial_autocorrelation = pacf(demand, nlags=nlags, alpha=0.01, method = 'ywm')
    p_parameter = 0
    aux = 0  #This auxiliar number is to avoid that q_parameter gets an isoleded case of a lag having an autocorrelation different of zero.
    for i in range(1, nlags + 1):
        aux += 1
        if (0 <  partial_autocorrelation[1][i][0]) or (0 >  partial_autocorrelation[1][i][1]):
            if aux < 3:
                aux = 0
                p_parameter = i
    
    
    piq_table.append([e, p_parameter, periods_for_diff, q_parameter])
    
    progress += 1
    if progress > progress_1000 * 1000:
        df_piq_table_partial = pd.DataFrame(piq_table, columns = columns_piq_table)
        df_piq_table_partial.to_csv('df_piq_table_partial.csv', index = False)
        progress_per = round(progress / products_ids_size, 4)
        print(progress_per)
        progress_1000 +=1
        
        end = time.time()
        elapsed = int(round(end - start, 0))
        total_run_time =  int(round(elapsed / (progress_per), 0))
        time_to_finish = int(round(elapsed / (progress_per), 0)) - elapsed
        print('Elapsed: {:02d}:{:02d}:{:02d}'.format(elapsed // 3600, (elapsed % 3600 // 60), elapsed % 60))
        print('Total run time: {:02d}:{:02d}:{:02d}'.format(total_run_time // 3600, (total_run_time % 3600 // 60), total_run_time % 60))
        print('Time to finish: {:02d}:{:02d}:{:02d}'.format(time_to_finish // 3600, (time_to_finish % 3600 // 60), time_to_finish % 60))
        print()
        
        
df_piq_table = pd.DataFrame(piq_table, columns = columns_piq_table)
df_piq_table.to_csv('df_piq_table.csv', index = False)
df_piq_table.head(10)

0.0328
Elapsed: 00:10:20
Total run time: 05:15:02
Time to finish: 05:04:42

0.0656
Elapsed: 00:20:35
Total run time: 05:13:46
Time to finish: 04:53:11

0.0984
Elapsed: 00:30:49
Total run time: 05:13:11
Time to finish: 04:42:22

0.1312
Elapsed: 00:41:06
Total run time: 05:13:16
Time to finish: 04:32:10

0.164
Elapsed: 00:51:22
Total run time: 05:13:13
Time to finish: 04:21:51

0.1968
Elapsed: 01:01:39
Total run time: 05:13:16
Time to finish: 04:11:37

0.2296
Elapsed: 01:11:59
Total run time: 05:13:31
Time to finish: 04:01:32

0.2624
Elapsed: 01:22:20
Total run time: 05:13:46
Time to finish: 03:51:26

0.2952
Elapsed: 01:32:30
Total run time: 05:13:21
Time to finish: 03:40:51

0.328
Elapsed: 01:42:41
Total run time: 05:13:04
Time to finish: 03:30:23

0.3608
Elapsed: 01:52:59
Total run time: 05:13:09
Time to finish: 03:20:10

0.3936
Elapsed: 02:03:05
Total run time: 05:12:43
Time to finish: 03:09:38

0.4264
Elapsed: 02:13:16
Total run time: 05:12:32
Time to finish: 02:59:16

0.4592
Elapsed

Unnamed: 0,product_id,p_parameter,i_parameter,q_parameter
0,HOBBIES_1_008_CA_1_validation,10,1,1
1,HOBBIES_1_009_CA_1_validation,13,1,1
2,HOBBIES_1_010_CA_1_validation,1,0,1
3,HOBBIES_1_012_CA_1_validation,2,0,2
4,HOBBIES_1_015_CA_1_validation,1,0,1
5,HOBBIES_1_016_CA_1_validation,0,0,0
6,HOBBIES_1_022_CA_1_validation,14,1,1
7,HOBBIES_1_023_CA_1_validation,8,1,2
8,HOBBIES_1_028_CA_1_validation,14,1,3
9,HOBBIES_1_029_CA_1_validation,1,0,1


In [29]:
df_piq_table = pd.read_csv("df_piq_table.csv")
df_piq_table.head(10)

Unnamed: 0,product_id,p_parameter,i_parameter,q_parameter
0,HOBBIES_1_008_CA_1_validation,10,1,1
1,HOBBIES_1_009_CA_1_validation,13,1,1
2,HOBBIES_1_010_CA_1_validation,1,0,1
3,HOBBIES_1_012_CA_1_validation,2,0,2
4,HOBBIES_1_015_CA_1_validation,1,0,1
5,HOBBIES_1_016_CA_1_validation,0,0,0
6,HOBBIES_1_022_CA_1_validation,14,1,1
7,HOBBIES_1_023_CA_1_validation,8,1,2
8,HOBBIES_1_028_CA_1_validation,14,1,3
9,HOBBIES_1_029_CA_1_validation,1,0,1
