# Model training

We decide to train the model with xgboost. We run through grid search for hyperparameters with hyperopt_xgb notebook. Since it is too time consuming, we stop it halfway, and tune them manually on top of the given suboptimized hyperparameters computed.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.core.interactiveshell import InteractiveShell
from multiprocessing import Pool
import xgboost as xgb
from itertools import product
import pickle
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline

In [4]:
test = pd.read_csv('../input/test.csv')
X_train = pd.read_csv('X_train.csv')
X_cv = pd.read_csv('X_cv.csv')
X_test = pd.read_csv('X_test.csv')

In [5]:
params = {
        'eta': 0.08, #best 0.08
        'max_depth': 7,
        'objective': 'reg:linear',
        'eval_metric': 'rmse',
        'seed': 3,
        'gamma':1,
        'silent': True
    }

In [None]:
cols = [c for c in X_train.columns if c not in ['date_block_num', 'item_cnt_day','item_category_name']]

x1 = X_train[cols]
y1 = X_train['item_cnt_day']
x2 = X_cv[cols]
y2 = X_cv['item_cnt_day']
watchlist = [(xgb.DMatrix(x1, y1), 'train'), (xgb.DMatrix(x2, y2), 'valid')]
model = xgb.train(params, xgb.DMatrix(x1, y1), 3500,  watchlist, maximize=False, verbose_eval=50, early_stopping_rounds=50)

In [7]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5839576 entries, 0 to 5839575
Data columns (total 86 columns):
date_block_num                              int64
item_category_id                            int64
item_category_name                          object
item_cnt_day                                float64
item_id                                     int64
shop_id                                     int64
item_id_avg_item_price_lag_1                float64
item_id_sum_item_cnt_day_lag_1              float64
item_id_avg_item_cnt_day_lag_1              float64
shop_id_avg_item_price_lag_1                float64
shop_id_sum_item_cnt_day_lag_1              float64
shop_id_avg_item_cnt_day_lag_1              float64
item_category_id_avg_item_price_lag_1       float64
item_category_id_sum_item_cnt_day_lag_1     float64
item_category_id_avg_item_cnt_day_lag_1     float64
item_cnt_day_lag_1                          float64
item_id_avg_item_price_lag_2                float64
item_id_sum_

In [None]:
pickle.dump(model, open("xgb.pickle.dat", "wb"))
model = pickle.load(open("pima.pickle.dat", "rb"))

pred = model.predict(xgb.DMatrix(X_test[cols]), ntree_limit=model.best_ntree_limit)

test['item_cnt_month'] = pred.clip(0,20)
test.drop(['shop_id', 'item_id'], axis=1, inplace=True)
test.to_csv('submission.csv', index=False)

# Result

[0]	train-rmse:1.40592	valid-rmse:1.37868
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.

Will train until valid-rmse hasn't improved in 50 rounds.
[50]	train-rmse:1.03644	valid-rmse:1.14389
[100]	train-rmse:1.00372	valid-rmse:1.13304
[150]	train-rmse:0.987922	valid-rmse:1.12918
[200]	train-rmse:0.972436	valid-rmse:1.12584
[250]	train-rmse:0.960087	valid-rmse:1.12552
[300]	train-rmse:0.949489	valid-rmse:1.12458
[350]	train-rmse:0.940702	valid-rmse:1.12259
[400]	train-rmse:0.932332	valid-rmse:1.12257
Stopping. Best iteration:
[352]	train-rmse:0.940543	valid-rmse:1.12231

Due to the random seed, our score is roughly about 0.95 in public leaderboard and private leaderboard.
