# Vanilla XGBoost File

We want a vanilla XGBoost file that can be given to software engineers in order for them to be able to run models through the terminal without having to touch any of the code. The following is a blackbox implementation of an XGBoost model that assumes model-ready data (no preprocessing necessary).

In [1]:
import xgboost as xgb
import numpy as np
import pandas as pd

This file will obviously not have the command line interface. This is also being done in order to more readily debug issues because the python console is a pain.  
The following class was developed by a Kagggler who needed to get out of fold predictions from a cross validation run of XGBoost. We need to do the same thing. In R, this is super simple just by adding a Predictions=TRUE flag to the CV call. In Python, this doesn't exist, so our friend developed this class in order to make our lives a bit easier.

In [2]:
class OOFCallback:
    def  __init__(self, oof_preds_dict, maximize=True):
        """
        :param dict oof_preds_dict: Should be an empty dict which can later be
            retrieved.
        :param bool maximize: If True, higher metric scores treated as better.
        """
        self.best_eval_metric = None
        self.oof_preds_dict = oof_preds_dict
        self.maximize=maximize

    def __call__(self, cbenv):
        current_val_score = cbenv.evaluation_result_list[1][1]
        if self.best_eval_metric is None:
            self.best_eval_metric = current_val_score
        if self.maximize:
            if current_val_score >= self.best_eval_metric:
                self.best_eval_metric = current_val_score
                self._compute_oof_preds(cbenv.cvfolds)
            elif current_val_score <= self.best_eval_metric:
                self.best_eval_metric = current_val_score
                self._compute_oof_preds(cbenv.cvfolds)

    def _compute_oof_preds(self, cvfolds):
        for i, fold in enumerate(cvfolds):
            self.oof_preds_dict[i] = fold.bst.predict(fold.dtest)


This next function is to have all the parameters in one place so if anyone needs to tweak them, they're easy to find.

In [11]:
def get_params():

    params = {}
    params["objective"] = "binary:logistic"
    params["eta"] = 0.1
    params["subsample"] = 0.7
    params["colsample_bytree"] = 0.7
    params["silent"] = 1
    params["max_depth"] = 5
    params["eval_metric"] = "logloss"
    plst = list(params.items())

    return plst

Now we can read in the data and start training the model. The problem is we don't know how to convert the predictions array into a suitable pandas dataframe, so that's why the notebook was created.

In [4]:
data = pd.read_csv('data/training_data.csv')

In [5]:
y_col = 'label'

In [6]:
data = data.drop("text", axis = 1)

In [7]:
data.head

<bound method NDFrame.head of      label      dim0      dim1      dim2      dim3      dim4      dim5  \
0        1 -0.064990 -0.063133 -0.023568 -0.021220 -0.020574  0.048391   
1        1 -0.002903 -0.033901  0.058160  0.030840  0.002759 -0.021980   
2        1  0.032543  0.000181  0.003660  0.003906 -0.078140 -0.007952   
3        1  0.027940  0.059848  0.059421 -0.054741 -0.008796 -0.012643   
4        1  0.023654 -0.000891  0.004695 -0.031286 -0.023597  0.057335   
..     ...       ...       ...       ...       ...       ...       ...   
175      0  0.044127  0.073354  0.066113  0.007997 -0.022069 -0.032995   
176      0 -0.091458 -0.030164  0.026322 -0.043284  0.008432  0.009224   
177      0 -0.070365  0.066311 -0.037964  0.065658  0.027124  0.013010   
178      0  0.013915  0.083097  0.019869  0.078082  0.018011 -0.028788   
179      0  0.003434 -0.001303  0.036101  0.040438 -0.023454 -0.017268   

         dim6      dim7      dim8  ...    dim502    dim503    dim504  \
0   -0.04

In [15]:
y_train = data[[y_col]]
x_train = data.drop(y_col, axis=1)
xg_train = xgb.DMatrix(x_train, label=y_train)
nfolds = data.shape[0] - 1
early_stopping = 10
params = get_params()
# Data structure in which to save out-of-folds preds
oof_preds_dict = {}
cv = xgb.cv(params,
            xg_train,
            5000,
            nfold=nfolds,
            early_stopping_rounds=early_stopping,
            callbacks=[OOFCallback(oof_preds_dict)],
            verbose_eval=1)

[0]	train-logloss:0.628563+0.00174728	test-logloss:0.652106+0.0667614
[1]	train-logloss:0.572512+0.00348226	test-logloss:0.610289+0.0930772
[2]	train-logloss:0.523355+0.00405242	test-logloss:0.570178+0.118668
[3]	train-logloss:0.480521+0.00439587	test-logloss:0.539503+0.140271
[4]	train-logloss:0.442365+0.00469035	test-logloss:0.510782+0.159242
[5]	train-logloss:0.408082+0.00451454	test-logloss:0.48661+0.180551
[6]	train-logloss:0.377511+0.0044712	test-logloss:0.467927+0.193128
[7]	train-logloss:0.34987+0.00419237	test-logloss:0.449664+0.207575
[8]	train-logloss:0.32486+0.00410312	test-logloss:0.434948+0.221508
[9]	train-logloss:0.302377+0.00419458	test-logloss:0.41726+0.230984
[10]	train-logloss:0.282101+0.0039296	test-logloss:0.404492+0.243886
[11]	train-logloss:0.263728+0.00380229	test-logloss:0.387927+0.245578
[12]	train-logloss:0.247112+0.00364664	test-logloss:0.374222+0.249718
[13]	train-logloss:0.231755+0.0035553	test-logloss:0.360291+0.257395
[14]	train-logloss:0.217758+0.00334

In [31]:
best_iteration = cv.shape[0]

In [32]:
oof_preds_nparray = np.asarray(oof_preds_dict)
oof_preds_nparray

array({0: array([ 233.47164917,   52.31227112,  241.68916321, ...,   44.18590164,
        271.86123657,   93.13690186], dtype=float32), 1: array([ 80.23247528,  44.13002396,  32.26177979, ...,  87.49355316,
        47.87636185,  59.62857437], dtype=float32), 2: array([  60.50736618,  102.92528534,  273.14767456, ...,  157.75537109,
         60.35734558,   79.30904388], dtype=float32), 3: array([ 116.85579681,   43.22634506,   85.74855804, ...,  106.74085236,
        116.11756897,  117.32350922], dtype=float32), 4: array([  49.69866943,   83.96920776,   67.47943878, ...,   38.35390854,
         96.20708466,  156.2109375 ], dtype=float32)}, dtype=object)

In [37]:
oof_preds_nparray.shape

()