# Vanilla XGBoost File

We want a vanilla XGBoost file that can be given to software engineers in order for them to be able to run models through the terminal without having to touch any of the code. The following is a blackbox implementation of an XGBoost model that assumes model-ready data (no preprocessing necessary).

In [3]:
import xgboost as xgb
import numpy as np
import pandas as pd

This file will obviously not have the command line interface. This is also being done in order to more readily debug issues because the python console is a pain.  
The following class was developed by a Kagggler who needed to get out of fold predictions from a cross validation run of XGBoost. We need to do the same thing. In R, this is super simple just by adding a Predictions=TRUE flag to the CV call. In Python, this doesn't exist, so our friend developed this class in order to make our lives a bit easier.

In [4]:
class OOFCallback:
    def  __init__(self, oof_preds_dict, maximize=True):
        """
        :param dict oof_preds_dict: Should be an empty dict which can later be
            retrieved.
        :param bool maximize: If True, higher metric scores treated as better.
        """
        self.best_eval_metric = None
        self.oof_preds_dict = oof_preds_dict
        self.maximize=maximize

    def __call__(self, cbenv):
        current_val_score = cbenv.evaluation_result_list[1][1]
        if self.best_eval_metric is None:
            self.best_eval_metric = current_val_score
        if self.maximize:
            if current_val_score >= self.best_eval_metric:
                self.best_eval_metric = current_val_score
                self._compute_oof_preds(cbenv.cvfolds)
            elif current_val_score <= self.best_eval_metric:
                self.best_eval_metric = current_val_score
                self._compute_oof_preds(cbenv.cvfolds)

    def _compute_oof_preds(self, cvfolds):
        for i, fold in enumerate(cvfolds):
            self.oof_preds_dict[i] = fold.bst.predict(fold.dtest)


This next function is to have all the parameters in one place so if anyone needs to tweak them, they're easy to find.

In [5]:
def get_params():

    params = {}
    params["objective"] = "binary:logistic"
    params["eta"] = 0.1
    params["subsample"] = 0.7
    params["colsample_bytree"] = 0.7
    params["silent"] = 1
    params["max_depth"] = 5
    params["eval_metric"] = "logloss"
    plst = list(params.items())

    return plst

Now we can read in the data and start training the model. The problem is we don't know how to convert the predictions array into a suitable pandas dataframe, so that's why the notebook was created.

In [9]:
data = pd.read_csv('data/training_data.csv')

In [10]:
y_col = 'label'

In [11]:
data = data.drop("text", axis = 1)

In [13]:
y_train = data[[y_col]]
x_train = data.drop(y_col, axis=1)
xg_train = xgb.DMatrix(x_train, label=y_train)
nfolds = data.shape[0] - 1
early_stopping = 10
params = get_params()
# Data structure in which to save out-of-folds preds
oof_preds_dict = {}
cv = xgb.cv(params,
            xg_train,
            5000,
            nfold=nfolds,
            early_stopping_rounds=early_stopping,
            callbacks=[OOFCallback(oof_preds_dict)],
            verbose_eval=1)

[0]	train-logloss:0.609108+0.0014829	test-logloss:0.6189+0.0465903
[1]	train-logloss:0.538151+0.00183947	test-logloss:0.551094+0.0671244
[2]	train-logloss:0.478305+0.00189866	test-logloss:0.493403+0.0823431
[3]	train-logloss:0.427237+0.00196338	test-logloss:0.446705+0.0961676
[4]	train-logloss:0.383148+0.0018957	test-logloss:0.405104+0.103458
[5]	train-logloss:0.344636+0.00182232	test-logloss:0.367+0.114153
[6]	train-logloss:0.31096+0.00174752	test-logloss:0.332357+0.117699
[7]	train-logloss:0.281377+0.00163893	test-logloss:0.303401+0.119538
[8]	train-logloss:0.25529+0.00164546	test-logloss:0.277693+0.122827
[9]	train-logloss:0.232111+0.00160728	test-logloss:0.256016+0.131852
[10]	train-logloss:0.21161+0.00154167	test-logloss:0.237307+0.140076
[11]	train-logloss:0.193339+0.00149532	test-logloss:0.219064+0.138143
[12]	train-logloss:0.176919+0.00140998	test-logloss:0.203159+0.141306
[13]	train-logloss:0.162183+0.00136531	test-logloss:0.188675+0.142443
[14]	train-logloss:0.148965+0.001335

In [16]:
best_iteration = cv.shape[0]
evallist = [(dtest, 'eval'), (x_train, 'train')]

In [17]:
model = xgb.train(params,
                        x_train,
                        num_round=best_iteration,
                        early_stopping_rounds = early_stopping,
                        verbose_eval = 1)

TypeError: train() got an unexpected keyword argument 'num_round'

In [9]:
oof_preds_nparray = np.asarray(oof_preds_dict)
oof_preds_nparray

array({0: array([0.9573348 , 0.84200346], dtype=float32), 1: array([0.03640796], dtype=float32), 2: array([0.11962262], dtype=float32), 3: array([0.07612281], dtype=float32), 4: array([0.9792746], dtype=float32), 5: array([0.6352277], dtype=float32), 6: array([0.8184215], dtype=float32), 7: array([0.03421957], dtype=float32), 8: array([0.1702321], dtype=float32), 9: array([0.994044], dtype=float32), 10: array([0.15319782], dtype=float32), 11: array([0.09052949], dtype=float32), 12: array([0.72497666], dtype=float32), 13: array([0.98848754], dtype=float32), 14: array([0.97886294], dtype=float32), 15: array([0.47278303], dtype=float32), 16: array([0.01305894], dtype=float32), 17: array([0.82224995], dtype=float32), 18: array([0.857151], dtype=float32), 19: array([0.03602412], dtype=float32), 20: array([0.9856543], dtype=float32), 21: array([0.9770398], dtype=float32), 22: array([0.98046184], dtype=float32), 23: array([0.01544303], dtype=float32), 24: array([0.8434238], dtype=float32), 25

In [10]:
oof_preds_nparray.shape

()