## This is a sample code for using the xgb_cv and xgb_GridSearchCV classes to perform cross-validation and grid search hyperparameter tuning together with early stopping rounds in the Scikit-Learn API for XGBoost

### Load the training dataset and perform some basic, sample wrangling ###

In [1]:
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
import XGBCV #import the XGBCV python code
import warnings
warnings.filterwarnings('ignore')

train_df  = pd.read_csv('TitanicData/train.csv')

#As an example, remove cabin, ticket, PassengerID, Name, and Embarked column
train_df = train_df[['Survived','Pclass','Sex','Age']]

#make male = 0, female = 1
train_df['Sex']= (train_df['Sex']=='female').astype(int)

#Create X and y frames
y = train_df['Survived']
X = train_df.drop(labels = 'Survived',axis = 1)

### Instantiate a default XGBoost model just as a sample. Instantiate an xgb_cv object and execute the class method run_cv to perform cross-validation using customized parameters

In [2]:
#instantiate the XGBClassifier object
model = XGBClassifier(verbosity = 0)

#instantiate the xgb_cv object using the XGBClassifier object
xgbcv = XGBCV.xgb_cv(model)

#run cross validation with 10 folds, 15 early stopping rounds, two evaluation metric, and stratified as well as shuffled folds. 
#set the seed to 5 for reproducibility
eval_metrics = ['error','logloss']
xgbcv.run_cv(X, y, folds = 10, early_stopping_rounds = 15, eval_metric = eval_metrics, stratified = True, 
             shuffle = True, seed = 5)

### Display some of the results of the cross-validation

In [21]:
print("Average optimal epoch determined by early stopping %.2f +/- %.2f" % 
      (np.mean(xgbcv.optimal_iter) ,np.std(xgbcv.optimal_iter)))

print("Average final epoch determined by early stopping %.2f +/- %.2f" % 
      (np.mean(xgbcv.final_iter) ,np.std(xgbcv.final_iter)))

print("\n")

for metric in eval_metrics:
    print("Average CV scores for metric: %s, (at the optimal early stopping epoch) %.2f +/- %.2f" % 
          (metric, np.mean(xgbcv.optimal_scores[metric]) ,np.std(xgbcv.optimal_scores[metric])))
    
print("\n")
    
for metric in eval_metrics:
    print("Average training scores for metric: %s, (at the optimal early stopping epoch) %.2f +/- %.2f" % 
          (metric, np.mean(xgbcv.optimal_train_scores[metric]) ,np.std(xgbcv.optimal_train_scores[metric])))

Average optimal epoch determined by early stopping 12.50 +/- 7.62
Average final epoch determined by early stopping 27.50 +/- 7.62


Average CV scores for metric: error, (at the optimal early stopping epoch) 0.19 +/- 0.04
Average CV scores for metric: logloss, (at the optimal early stopping epoch) 0.42 +/- 0.05


Average training scores for metric: error, (at the optimal early stopping epoch) 0.15 +/- 0.00
Average training scores for metric: logloss, (at the optimal early stopping epoch) 0.36 +/- 0.02


### Perform a grid search with cross-validation to tune hyperparameters

In [23]:
#instantiate the xgb_GridSearchCV object using the XGBClassifier object
XGBGrid = XGBCV.xgb_GridSearchCV(model)

#define a sample hyperparameter grid
param_grid = {"subsample": [0.6, 0.7, 0.8, 0.9],
              "max_depth": list(range(4,10)),
              "learning_rate": [0.001, 0.01, 0.1]}

XGBGrid.run_GridSearchCV(X, y, param_grid, folds = 10, early_stopping_rounds = 15, eval_metric = 'logloss', stratified = True,
                         shuffle = True, seed = 5)

In [24]:
print("The best parameter set is:", XGBGrid.best_params)
print("for which the best choice of n_estimators is: %i, as determined by early stopping." % np.round(np.mean(XGBGrid.best_trees)))
print('\nThe cross-validation logloss value for this set of parameters with early stopping is: %.2f' % XGBGrid.best_metric_value)


The best parameter set is: {'subsample': 0.7, 'max_depth': 7, 'learning_rate': 0.1}
for which the best choice of n_estimators is: 32, as determined by early stopping.

The cross-validation logloss value for this set of parameters with early stopping is: 0.42
