## XGBoost Train Model

After running the dataset through the cross validation, we can start training our model
Using sklearn we're doing an 80-20 split of our data. We start by importing all the packages we need.

In [24]:
import xgboost as xgb
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import pickle

The following block defines all the parameters that we'll need to run the model

In [3]:
def get_params():

    params = {}
    params["objective"] = "binary:logistic"
    params["eta"] = 0.1
    params["subsample"] = 0.7
    params["colsample_bytree"] = 0.7
    params["silent"] = 1
    params["max_depth"] = 5
    params["eval_metric"] = "logloss"
    plst = list(params.items())

    return plst

In [4]:
#read in the training data
data = pd.read_csv('data/training_data.csv') #PROVIDE: path to the training data

In [20]:
y_col = 'label' #Provide: name of the label column
data = data.drop("text", axis = 1) #Provide: name of the column containing the data

y_train = data[[y_col]]
x_train = data.drop(y_col, axis=1)
early_stopping = 10
params = get_params()
num_round = 106 #Provide: based on the cross validation we know the optimum number of training rounds

Now that we have defined our variables, we can go ahead and split our data into testing & training

In [21]:
xTrain, xTest, yTrain, yTest = train_test_split(x_train, y_train, test_size = 0.2, random_state = 0)
xg_train = xgb.DMatrix(xTrain, label=yTrain)
xg_test = xgb.DMatrix(xTest, label=yTest)

#We create a watchlist to visualize the training in real-time
watchlist = [(xg_test, 'eval'), (xg_train, 'train')]

In [23]:
model = xgb.train(params,
                        xg_train,
                        num_round,
                        watchlist,
                        verbose_eval=1)

[0]	eval-logloss:0.629622	train-logloss:0.609736
[1]	eval-logloss:0.56519	train-logloss:0.537478
[2]	eval-logloss:0.511697	train-logloss:0.476775
[3]	eval-logloss:0.461869	train-logloss:0.4247
[4]	eval-logloss:0.423426	train-logloss:0.381331
[5]	eval-logloss:0.392358	train-logloss:0.34235
[6]	eval-logloss:0.362308	train-logloss:0.309938
[7]	eval-logloss:0.338575	train-logloss:0.280724
[8]	eval-logloss:0.312857	train-logloss:0.254895
[9]	eval-logloss:0.294335	train-logloss:0.232348
[10]	eval-logloss:0.277842	train-logloss:0.211481
[11]	eval-logloss:0.264802	train-logloss:0.193669
[12]	eval-logloss:0.251696	train-logloss:0.176952
[13]	eval-logloss:0.243955	train-logloss:0.161611
[14]	eval-logloss:0.230009	train-logloss:0.148637
[15]	eval-logloss:0.220642	train-logloss:0.136935
[16]	eval-logloss:0.209052	train-logloss:0.126761
[17]	eval-logloss:0.202986	train-logloss:0.117038
[18]	eval-logloss:0.191429	train-logloss:0.10846
[19]	eval-logloss:0.184888	train-logloss:0.100951
[20]	eval-loglo

Once the training is complete, we can save the model

In [26]:
model.save_model('imperative.model')