In this notebook, we will load our preprocessed data, split into train, val and test and subsets then train a few models.

Author: Steven Vuong <br>
Last Updated: 26/04/2020

In [1]:
# Mount google drive
from google.colab import drive
drive.mount('/gdrive')

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


Thank you to Google for prividing FREE computing resources in Colab! :). Much love for the efforts in democratisation of learning and research.

In [2]:
# cd into correct working directory
% cd '../gdrive/My Drive/self_teach/udacity_ml_eng_nanodegree'

/gdrive/My Drive/self_teach/udacity_ml_eng_nanodegree


In [0]:
# Import Libraries
import pandas as pd
import numpy as np
import warnings
import os


# Styling Preferences
pd.set_option('display.float_format', lambda x: '%.2f' % x)
warnings.filterwarnings("ignore")

In [0]:
# Load csv where we last left off
# Cast to int/float32 to save memory, we now have quite a number of features
data = pd.read_csv('./data/output/processed_data_pt2.csv', dtype={
    'date_block_num':'int32',
    'item_category_type_code':'int32',
    'item_category_subtype_code':'int32',
    'item_name_code':'int32',
    'city_code':'int32',
    'shop_id':'int32',
    'item_category_id':'int32',
    'item_id':'int32',
    'sum_item_price':'float32',
    'mean_item_price':'float32',
    'sum_item_count':'int32',
    'mean_item_count':'float32',
    'transactions':'int32',
    'year':'int32',
    'month':'int32',
    'sum_item_cnt_next_month':'float32',
    'item_price_unit':'float32',
    'hist_min_item_price':'float32',
    'hist_max_item_price':'float32',
    'price_increase':'float32',
    'price_decrease':'float32',
    'item_cnt_min':'float32',
    'item_cnt_max':'float32',
    'item_cnt_mean':'float32',
    'item_cnt_std':'float32',
    'item_cnt_shifted1':'float32',
    'item_cnt_shifted2':'float32',
    'item_cnt_shifted3':'float32',
    'item_trend':'float32'})

In [5]:
data.tail()

Unnamed: 0,date_block_num,item_category_type_code,item_category_subtype_code,item_name_code,city_code,shop_id,item_category_id,item_id,sum_item_price,mean_item_price,sum_item_count,mean_item_count,transactions,year,month,sum_item_cnt_next_month,item_price_unit,hist_min_item_price,hist_max_item_price,price_increase,price_decrease,item_cnt_min,item_cnt_max,item_cnt_mean,item_cnt_std,item_cnt_shifted1,item_cnt_shifted2,item_cnt_shifted3,item_trend
7282795,33,0,0,0,0,34,0,969,0.0,0.0,0,0.0,0,2015,9,0.0,0.0,0.0,5490.0,0.0,5490.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7282796,33,0,0,0,0,46,0,969,0.0,0.0,0,0.0,0,2015,9,0.0,0.0,0.0,5490.0,0.0,5490.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7282797,33,0,0,0,0,41,0,969,0.0,0.0,0,0.0,0,2015,9,0.0,0.0,0.0,5490.0,0.0,5490.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7282798,33,0,0,0,0,39,0,969,0.0,0.0,0,0.0,0,2015,9,0.0,0.0,0.0,5490.0,0.0,5490.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7282799,33,0,0,0,0,45,0,969,0.0,0.0,0,0.0,0,2015,9,,0.0,0.0,5490.0,0.0,5490.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
# Double check data types
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7282800 entries, 0 to 7282799
Data columns (total 29 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   date_block_num              int32  
 1   item_category_type_code     int32  
 2   item_category_subtype_code  int32  
 3   item_name_code              int32  
 4   city_code                   int32  
 5   shop_id                     int32  
 6   item_category_id            int32  
 7   item_id                     int32  
 8   sum_item_price              float32
 9   mean_item_price             float32
 10  sum_item_count              int32  
 11  mean_item_count             float32
 12  transactions                int32  
 13  year                        int32  
 14  month                       int32  
 15  sum_item_cnt_next_month     float32
 16  item_price_unit             float32
 17  hist_min_item_price         float32
 18  hist_max_item_price         float32
 19  price_increase       

In [7]:
# Double check all our date blocks are present
np.unique(data.date_block_num.values)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33],
      dtype=int32)

Our test set data is one month ahead of our training set, so using the 33rd date_block_num, we want to ideally predict the 34th for matching shop_id and item_id as required for the Kaggle competition submission. So our training set will be the 3rd-29th blocks (we start on the 3rd block as we use a 3 month rolling window to calculate features, of which the first 3 months do not have). Our validation blocks will be blocks (30th-32nd) and test being the 33rd block. I wanted to ideally have as large a training block period as possible to train our model to the best possible outcome to have the lowest RMSE for our test set. This does raise concerns about overfitting and so is something we can go back to verify.

In [8]:
# Create train data subset
train = data.query('date_block_num >= 3 and date_block_num < 30').copy()
train.head()

Unnamed: 0,date_block_num,item_category_type_code,item_category_subtype_code,item_name_code,city_code,shop_id,item_category_id,item_id,sum_item_price,mean_item_price,sum_item_count,mean_item_count,transactions,year,month,sum_item_cnt_next_month,item_price_unit,hist_min_item_price,hist_max_item_price,price_increase,price_decrease,item_cnt_min,item_cnt_max,item_cnt_mean,item_cnt_std,item_cnt_shifted1,item_cnt_shifted2,item_cnt_shifted3,item_trend
25805,3,0,9,365,0,2,2,5572,2980.0,1490.0,2,1.0,2,2013,3,2.0,1490.0,0.0,18979.5,2980.0,15999.5,1.0,2.0,1.33,0.58,1.0,1.0,9.0,-2.25
25806,3,0,9,365,1,3,2,5572,4470.0,1490.0,3,1.0,3,2013,3,3.0,1490.0,0.0,18979.5,4470.0,14509.5,1.0,3.0,1.67,1.15,1.0,1.0,3.0,-0.5
25807,3,0,9,365,6,12,2,5572,1490.0,1490.0,1,1.0,1,2013,3,1.0,1490.0,0.0,18979.5,1490.0,17489.5,1.0,3.0,2.0,1.41,3.0,0.0,0.0,-0.5
25808,3,0,9,365,7,14,2,5572,1490.0,1490.0,1,1.0,1,2013,3,1.0,1490.0,0.0,18979.5,1490.0,17489.5,1.0,2.0,1.5,0.71,2.0,0.0,0.0,-0.25
25809,3,0,9,365,9,16,2,5572,1490.0,1490.0,1,1.0,1,2013,3,1.0,1490.0,0.0,18979.5,1490.0,17489.5,1.0,5.0,3.0,2.0,3.0,5.0,3.0,-2.5


In [9]:
# Create validation data subset
validation = data.query('date_block_num >= 30 and date_block_num < 33').copy()
validation.head()

Unnamed: 0,date_block_num,item_category_type_code,item_category_subtype_code,item_name_code,city_code,shop_id,item_category_id,item_id,sum_item_price,mean_item_price,sum_item_count,mean_item_count,transactions,year,month,sum_item_cnt_next_month,item_price_unit,hist_min_item_price,hist_max_item_price,price_increase,price_decrease,item_cnt_min,item_cnt_max,item_cnt_mean,item_cnt_std,item_cnt_shifted1,item_cnt_shifted2,item_cnt_shifted3,item_trend
489586,30,0,9,365,0,2,2,5572,1990.0,1990.0,1,1.0,1,2015,6,1.0,1990.0,0.0,18979.5,1990.0,16989.5,1.0,1.0,1.0,0.0,1.0,1.0,1.0,-0.5
489587,30,0,9,365,2,4,2,5572,1990.0,1990.0,1,1.0,1,2015,6,1.0,1990.0,0.0,18979.5,1990.0,16989.5,1.0,1.0,1.0,0.0,1.0,1.0,1.0,-0.5
489588,30,0,9,365,5,11,2,5572,1322.0,1322.0,1,1.0,1,2015,6,1.0,1322.0,0.0,18979.5,1322.0,17657.5,1.0,1.0,1.0,0.0,1.0,1.0,3.0,-1.0
489589,30,0,9,365,7,14,2,5572,1990.0,1990.0,1,1.0,1,2015,6,1.0,1990.0,0.0,18979.5,1990.0,16989.5,1.0,1.0,1.0,0.0,1.0,1.0,1.0,-0.5
489590,30,0,9,365,12,28,2,5572,1590.0,1590.0,1,1.0,1,2015,6,1.0,1590.0,0.0,18979.5,1590.0,17389.5,1.0,2.0,1.33,0.58,2.0,1.0,5.0,-1.75


In [10]:
# Create test data subset
test = data.query('date_block_num == 33').copy()
test.head()

Unnamed: 0,date_block_num,item_category_type_code,item_category_subtype_code,item_name_code,city_code,shop_id,item_category_id,item_id,sum_item_price,mean_item_price,sum_item_count,mean_item_count,transactions,year,month,sum_item_cnt_next_month,item_price_unit,hist_min_item_price,hist_max_item_price,price_increase,price_decrease,item_cnt_min,item_cnt_max,item_cnt_mean,item_cnt_std,item_cnt_shifted1,item_cnt_shifted2,item_cnt_shifted3,item_trend
571478,33,0,9,365,4,7,2,5572,1790.0,1790.0,1,1.0,1,2015,9,1.0,1790.0,0.0,18979.5,1790.0,17189.5,1.0,2.0,1.33,0.58,1.0,2.0,1.0,-0.75
571479,33,0,9,365,6,12,2,5572,1300.0,1300.0,1,1.0,1,2015,9,1.0,1300.0,0.0,18979.5,1300.0,17679.5,1.0,1.0,1.0,0.0,1.0,1.0,2.0,-0.75
571480,33,0,9,365,12,24,2,5572,1790.0,1790.0,1,1.0,1,2015,9,1.0,1790.0,0.0,18979.5,1790.0,17189.5,1.0,2.0,1.33,0.58,2.0,1.0,2.0,-1.0
571481,33,0,9,366,2,4,2,5643,3290.0,3290.0,1,1.0,1,2015,9,1.0,3290.0,0.0,35260.0,3290.0,31970.0,1.0,2.0,1.33,0.58,1.0,2.0,1.0,-0.75
571482,33,0,9,366,3,5,2,5637,2798.0,2798.0,1,1.0,1,2015,9,1.0,2798.0,0.0,19920.0,2798.0,17122.0,1.0,2.0,1.33,0.58,1.0,2.0,2.0,-1.0


In [11]:
print('Train set records:', train.shape[0])
print('Validation set records:', validation.shape[0])
print('Test set records:', test.shape[0])

Train set records: 5783400
Validation set records: 642600
Test set records: 214200


In [12]:
test.head()

Unnamed: 0,date_block_num,item_category_type_code,item_category_subtype_code,item_name_code,city_code,shop_id,item_category_id,item_id,sum_item_price,mean_item_price,sum_item_count,mean_item_count,transactions,year,month,sum_item_cnt_next_month,item_price_unit,hist_min_item_price,hist_max_item_price,price_increase,price_decrease,item_cnt_min,item_cnt_max,item_cnt_mean,item_cnt_std,item_cnt_shifted1,item_cnt_shifted2,item_cnt_shifted3,item_trend
571478,33,0,9,365,4,7,2,5572,1790.0,1790.0,1,1.0,1,2015,9,1.0,1790.0,0.0,18979.5,1790.0,17189.5,1.0,2.0,1.33,0.58,1.0,2.0,1.0,-0.75
571479,33,0,9,365,6,12,2,5572,1300.0,1300.0,1,1.0,1,2015,9,1.0,1300.0,0.0,18979.5,1300.0,17679.5,1.0,1.0,1.0,0.0,1.0,1.0,2.0,-0.75
571480,33,0,9,365,12,24,2,5572,1790.0,1790.0,1,1.0,1,2015,9,1.0,1790.0,0.0,18979.5,1790.0,17189.5,1.0,2.0,1.33,0.58,2.0,1.0,2.0,-1.0
571481,33,0,9,366,2,4,2,5643,3290.0,3290.0,1,1.0,1,2015,9,1.0,3290.0,0.0,35260.0,3290.0,31970.0,1.0,2.0,1.33,0.58,1.0,2.0,1.0,-0.75
571482,33,0,9,366,3,5,2,5637,2798.0,2798.0,1,1.0,1,2015,9,1.0,2798.0,0.0,19920.0,2798.0,17122.0,1.0,2.0,1.33,0.58,1.0,2.0,2.0,-1.0


In [13]:
# Merge with test competition data to ensure test data is in the correct order.

# Load in competition test  dataset proviided
test_competition  = pd.read_csv('./data/competition_files/test.csv', 
                    dtype={'ID': 'int16', 'shop_id': 'int16', 'item_id': 'int16'}
                   ).set_index('ID')
test_competition.head()

Unnamed: 0_level_0,shop_id,item_id
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
0,5,5037
1,5,5320
2,5,5233
3,5,5232
4,5,5268


In [14]:
# Merge and check
test_X = pd.merge(test_competition, test, on=['shop_id', 'item_id'], how='left')
print(len(test_X))
test_X.head()

214200


Unnamed: 0,shop_id,item_id,date_block_num,item_category_type_code,item_category_subtype_code,item_name_code,city_code,item_category_id,sum_item_price,mean_item_price,sum_item_count,mean_item_count,transactions,year,month,sum_item_cnt_next_month,item_price_unit,hist_min_item_price,hist_max_item_price,price_increase,price_decrease,item_cnt_min,item_cnt_max,item_cnt_mean,item_cnt_std,item_cnt_shifted1,item_cnt_shifted2,item_cnt_shifted3,item_trend
0,5,5037,33.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2015.0,9.0,0.0,0.0,0.0,25990.0,0.0,25990.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,5,5320,33.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2015.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,5,5233,33.0,3.0,9.0,349.0,3.0,19.0,1199.0,1199.0,1.0,1.0,1.0,2015.0,9.0,1.0,1199.0,0.0,7191.75,1199.0,5992.75,1.0,3.0,1.67,1.15,3.0,1.0,2.0,-1.25
3,5,5232,33.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2015.0,9.0,0.0,0.0,0.0,4796.0,0.0,4796.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,5268,33.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2015.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [0]:
# Create X and Y Subsets for train, val and test
train_X = train.drop(['date_block_num', 'sum_item_cnt_next_month'], axis=1)
train_Y = train['sum_item_cnt_next_month']

validation_X = validation.drop(['date_block_num', 'sum_item_cnt_next_month'], axis=1)
validation_Y = validation['sum_item_cnt_next_month']

test_X = test_X.drop(['date_block_num', 'sum_item_cnt_next_month'], axis=1)

In [0]:
# Too slow so will not replace missing values with mean
datasets = [train_X, train_Y, validation_X, validation_Y, test_X]

# Replace missing values with the median of the column. 
for dataset in datasets:
    dataset.fillna(dataset.median(), inplace=True)

In [17]:
# Sanity check number we have no na.
print("Train X Null:", train_X.isnull().sum().sum())
print("Test X Null:", validation_X.isnull().sum().sum())
print("Test X Null:", test_X.isnull().sum().sum())

Train X Null: 0
Test X Null: 0
Test X Null: 0


In [18]:
# Sanity check number we have no na.
print("Train Y Null:", train_Y.isnull().sum())
print("Test X Null:", validation_Y.isnull().sum())

Train Y Null: 0
Test X Null: 0


In [19]:
print('Train set records:', train.shape[0])
print('Validation set records:', validation.shape[0])
print('Test set records:', test.shape[0])

Train set records: 5783400
Validation set records: 642600
Test set records: 214200


In [20]:
# Sanity check the order
test_X.head()

Unnamed: 0,shop_id,item_id,item_category_type_code,item_category_subtype_code,item_name_code,city_code,item_category_id,sum_item_price,mean_item_price,sum_item_count,mean_item_count,transactions,year,month,item_price_unit,hist_min_item_price,hist_max_item_price,price_increase,price_decrease,item_cnt_min,item_cnt_max,item_cnt_mean,item_cnt_std,item_cnt_shifted1,item_cnt_shifted2,item_cnt_shifted3,item_trend
0,5,5037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2015.0,9.0,0.0,0.0,25990.0,0.0,25990.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,5,5320,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2015.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,5,5233,3.0,9.0,349.0,3.0,19.0,1199.0,1199.0,1.0,1.0,1.0,2015.0,9.0,1199.0,0.0,7191.75,1199.0,5992.75,1.0,3.0,1.67,1.15,3.0,1.0,2.0,-1.25
3,5,5232,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2015.0,9.0,0.0,0.0,4796.0,0.0,4796.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,5268,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2015.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
train_X.head()

Unnamed: 0,item_category_type_code,item_category_subtype_code,item_name_code,city_code,shop_id,item_category_id,item_id,sum_item_price,mean_item_price,sum_item_count,mean_item_count,transactions,year,month,item_price_unit,hist_min_item_price,hist_max_item_price,price_increase,price_decrease,item_cnt_min,item_cnt_max,item_cnt_mean,item_cnt_std,item_cnt_shifted1,item_cnt_shifted2,item_cnt_shifted3,item_trend
25805,0,9,365,0,2,2,5572,2980.0,1490.0,2,1.0,2,2013,3,1490.0,0.0,18979.5,2980.0,15999.5,1.0,2.0,1.33,0.58,1.0,1.0,9.0,-2.25
25806,0,9,365,1,3,2,5572,4470.0,1490.0,3,1.0,3,2013,3,1490.0,0.0,18979.5,4470.0,14509.5,1.0,3.0,1.67,1.15,1.0,1.0,3.0,-0.5
25807,0,9,365,6,12,2,5572,1490.0,1490.0,1,1.0,1,2013,3,1490.0,0.0,18979.5,1490.0,17489.5,1.0,3.0,2.0,1.41,3.0,0.0,0.0,-0.5
25808,0,9,365,7,14,2,5572,1490.0,1490.0,1,1.0,1,2013,3,1490.0,0.0,18979.5,1490.0,17489.5,1.0,2.0,1.5,0.71,2.0,0.0,0.0,-0.25
25809,0,9,365,9,16,2,5572,1490.0,1490.0,1,1.0,1,2013,3,1490.0,0.0,18979.5,1490.0,17489.5,1.0,5.0,3.0,2.0,3.0,5.0,3.0,-2.5


In [22]:
validation_X.head()

Unnamed: 0,item_category_type_code,item_category_subtype_code,item_name_code,city_code,shop_id,item_category_id,item_id,sum_item_price,mean_item_price,sum_item_count,mean_item_count,transactions,year,month,item_price_unit,hist_min_item_price,hist_max_item_price,price_increase,price_decrease,item_cnt_min,item_cnt_max,item_cnt_mean,item_cnt_std,item_cnt_shifted1,item_cnt_shifted2,item_cnt_shifted3,item_trend
489586,0,9,365,0,2,2,5572,1990.0,1990.0,1,1.0,1,2015,6,1990.0,0.0,18979.5,1990.0,16989.5,1.0,1.0,1.0,0.0,1.0,1.0,1.0,-0.5
489587,0,9,365,2,4,2,5572,1990.0,1990.0,1,1.0,1,2015,6,1990.0,0.0,18979.5,1990.0,16989.5,1.0,1.0,1.0,0.0,1.0,1.0,1.0,-0.5
489588,0,9,365,5,11,2,5572,1322.0,1322.0,1,1.0,1,2015,6,1322.0,0.0,18979.5,1322.0,17657.5,1.0,1.0,1.0,0.0,1.0,1.0,3.0,-1.0
489589,0,9,365,7,14,2,5572,1990.0,1990.0,1,1.0,1,2015,6,1990.0,0.0,18979.5,1990.0,16989.5,1.0,1.0,1.0,0.0,1.0,1.0,1.0,-0.5
489590,0,9,365,12,28,2,5572,1590.0,1590.0,1,1.0,1,2015,6,1590.0,0.0,18979.5,1590.0,17389.5,1.0,2.0,1.33,0.58,2.0,1.0,5.0,-1.75


In [0]:
# Build output directory
data_dir = './data/output/'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

In [0]:
# Save dataframes as csv files 
pd.DataFrame(train_X).to_csv(os.path.join(data_dir, 'train_X.csv'), header=True, index=True)
pd.DataFrame(train_Y).to_csv(os.path.join(data_dir, 'train_Y.csv'), header=True, index=True)

pd.DataFrame(validation_X).to_csv(os.path.join(data_dir, 'validation_X.csv'), header=True, index=True)
pd.DataFrame(validation_Y).to_csv(os.path.join(data_dir, 'validation_Y.csv'), header=True, index=True)

pd.DataFrame(test_X).to_csv(os.path.join(data_dir, 'test_X.csv'), header=True, index=True)

Other things we might want to consider:
-  Normalisation of the following:
    -  sum_item_price
    -  mean_item_price
    -  sum_item_count
    -  mean_item_count
    -  transactions
-  Mean encode variables instead of label encoding
    -  Could increase probability of overfitting
-  Potentially many more (unconsidered)

In [25]:
train_X.columns, len(train_X.columns)

(Index(['item_category_type_code', 'item_category_subtype_code',
        'item_name_code', 'city_code', 'shop_id', 'item_category_id', 'item_id',
        'sum_item_price', 'mean_item_price', 'sum_item_count',
        'mean_item_count', 'transactions', 'year', 'month', 'item_price_unit',
        'hist_min_item_price', 'hist_max_item_price', 'price_increase',
        'price_decrease', 'item_cnt_min', 'item_cnt_max', 'item_cnt_mean',
        'item_cnt_std', 'item_cnt_shifted1', 'item_cnt_shifted2',
        'item_cnt_shifted3', 'item_trend'],
       dtype='object'), 27)

As we have so many columns, we can only take a select feature for our XG Boost model (Otherwise our runtime would crash due to memory issues). So instead, we will train 3 separate models on subsets of our total training set and ensemble/stack these by fitting a simple linear regression model on top of that to get our final prediction outputs.  Each model will train on 9 features, 1/3 rd of the total number of features. The total number of features will sum to 27, which counts the number of features we have as we would expect

Some future things to try: (limited by computational resources but others can try)
-  Play around with variations of features that we input to our training model, of which include (Just keep in mind our runtime might crash):
  -  Number of features
  -  What features we select

-  Play around with Hyperparameters, try to optimise to get better results
  -  For both XGBoost as well as linear regression
  -  Bayesian Hyperparameter Optimisation could be an approach.

-  Ensemble more than just XGBoost models, could also try Random Forests, Cat Boost, even clustering methods and so on.. as well as other classifiers to stack them with.



In [26]:
# Split training columns into 3 subsets
train_features = [train_X.columns[i::3] for i in range(3)]
print(train_features)
print(len(train_features))

[Index(['item_category_type_code', 'city_code', 'item_id', 'sum_item_count',
       'year', 'hist_min_item_price', 'price_decrease', 'item_cnt_mean',
       'item_cnt_shifted2'],
      dtype='object'), Index(['item_category_subtype_code', 'shop_id', 'sum_item_price',
       'mean_item_count', 'month', 'hist_max_item_price', 'item_cnt_min',
       'item_cnt_std', 'item_cnt_shifted3'],
      dtype='object'), Index(['item_name_code', 'item_category_id', 'mean_item_price', 'transactions',
       'item_price_unit', 'price_increase', 'item_cnt_max',
       'item_cnt_shifted1', 'item_trend'],
      dtype='object')]
3


In [27]:
# Double check the sum of the columns hits 27
sum([len(tf) for tf in train_features])

27

In [0]:
# Import Libraries to Trains
from xgboost import XGBRegressor
import pickle
import time

XGBoost Docs: https://xgboost.readthedocs.io/en/latest/parameter.html
We decided to increase the max depth to 8 as we want a more complex model. We also have a min_child_weight of 300 for a very conservative model, meaning there need to be at least 300 instances to make up a node (not that large considering the size of our training set). 

We will clip any values greater than 50 to 50 in predictions -> see next noteboo (however, this is something to definitely experiment with). We do this as it is unlikely for values to be too high above 50 and may hurt the performance on our test set. We also clip a lower boundary to ensure there are no values below 0. We set the number of estimators to 500 to have more predictive models to try and get a better overall resuslt, this feels like a good compromise between computing resources required and model performance. We try to avoid crashing our runtime session and so keep the performance just below the limit to maximise efficiency. Also we reduce the subsample bytree parameter slightly as excess sampling can take a longer time for our model to train (causes runtime to crash, not enough memorry welp). Learning rate is default at 0.3.

Helper Ref: https://machinelearningmastery.com/tune-number-size-decision-trees-xgboost-python/

We save our model as pickle files: https://machinelearningmastery.com/save-gradient-boosting-models-xgboost-python/

In [29]:
model_savepath = os.path.join(data_dir, 'model')
if not os.path.exists(model_savepath):
    os.makedirs(model_savepath)

for i in range(len(train_features)):

  print(f"Training {i + 1}/{len(train_features)} subset")

  # Sample subset features list
  xgb_features = train_features[i]

  # Build training, validation and test data
  xgb_train = train_X[xgb_features]
  xgb_val = validation_X[xgb_features]

  # Record start time
  start_time = time.time()

  # Build XGBoost Model
  model = XGBRegressor(
      max_depth=8,
      n_estimators=500,
      min_child_weight=300, 
      colsample_bytree=0.8, 
      subsample=0.7, 
      eta=0.3,    
      seed=0)
  
  # Train our model
  model.fit(
      xgb_train, 
      train_Y, 
      eval_metric="rmse", 
      eval_set=[(xgb_train, train_Y), (xgb_val, validation_Y)], 
      verbose=True, 
      early_stopping_rounds = 20)
  
  # Get stop time
  end_time = time.time() - start_time
  print(end_time)

  # Save our model
  pickle.dump(model, open( model_savepath + f"/MODEL_{i}.pickle", "wb"))

Training 1/3 subset
[0]	validation_0-rmse:1.29202	validation_1-rmse:0.951087
Multiple eval metrics have been passed: 'validation_1-rmse' will be used for early stopping.

Will train until validation_1-rmse hasn't improved in 20 rounds.
[1]	validation_0-rmse:1.21987	validation_1-rmse:0.882595
[2]	validation_0-rmse:1.15938	validation_1-rmse:0.828341
[3]	validation_0-rmse:1.09637	validation_1-rmse:0.766295
[4]	validation_0-rmse:1.04254	validation_1-rmse:0.712469
[5]	validation_0-rmse:0.997107	validation_1-rmse:0.665121
[6]	validation_0-rmse:0.976733	validation_1-rmse:0.641894
[7]	validation_0-rmse:0.940993	validation_1-rmse:0.607766
[8]	validation_0-rmse:0.915831	validation_1-rmse:0.583292
[9]	validation_0-rmse:0.889491	validation_1-rmse:0.558886
[10]	validation_0-rmse:0.867737	validation_1-rmse:0.538399
[11]	validation_0-rmse:0.853205	validation_1-rmse:0.524167
[12]	validation_0-rmse:0.837457	validation_1-rmse:0.509979
[13]	validation_0-rmse:0.824471	validation_1-rmse:0.498215
[14]	valid

For this training instance, we have used XGBoostRegresor as it is quite powerful and relatively fast.

Other Possibilities:
-  LightGBM as it runs very quickly and is unlikely to crash our runtime, so could potentially run with more features.
-  RandomForestRegressor: Also quite powerful, so could be interesting to investigate.


To reduce computational requirements, we could ssplit our data into smaller subsets to train and ensemble. However, this runs the risk of diluting our model accuracy so will avoid this option. In addition, we can alsso do some more feature engineering and remove rows where values are too high (previously we had to resolve 'infinity' values). Or, we could reconsider how we engineer features (open to suggestions here).


All in all, we successsfully trained XGBoost models on all of our features! Now we can build a linear regression model on top of this to have an ensembled/stacked model that hopefully performs better than any individual model. Firstly, we will evaluate each of our three models to see what we can learn about the importance of each feature as well as what models might outperform others as a result of that. We can also make predictions to our validation data and training data to see how our XGBoost modes have performed on them allso.

For a production grade system, more tests may be required for feature engineering (preprocessing) to get our data ready for training. As well as a method more rigorous or deterministic to split our data to be able to train in a live pipeline, if the company (and dataset providers) wished to predict future data on a montly basis. Thus the need for pipelines etc.. will rise. For now, we will stick with Jupyter Notebook for data discovery, one-off training and proof of concept. In doing so, we also tell the story of the journey of each of the mentioned steps.