## XGBoost Predictive Model

## Load the dataset

In [2]:
# Import the pandas, numpy packages and dump from joblib
import pandas as pd
import numpy as np
from joblib import dump

In [3]:
# Load the saved sets from data/processed using numpy
X_train = np.load('../data/processed/X_train.npy')
X_val   = np.load('../data/processed/X_val.npy'  )
y_train = np.load('../data/processed/y_train.npy')
y_val   = np.load('../data/processed/y_val.npy'  )

## Train XGBoost Model

In [3]:
# Import the xgboost package as xgb
import xgboost as xgb

In [4]:
# Instantiate the XGBRegressor class into a variable called xgb_default
xgb_default = xgb.XGBRegressor()

In [5]:
# Fit the XGBoost model
xgb_default.fit(X_train, y_train)

In [6]:
# Import dump from joblib and save the model
from joblib import dump 

dump(xgb_default,  '../models/xgb_default.joblib')

['../models/xgb_default.joblib']

In [7]:
# Calculate the predicted values for the training and validation sets
predicted_values_train = xgb_default.predict(X_train)
predicted_values_val = xgb_default.predict(X_val)


In [8]:
# Import the function print_mse from models.performance and display the MSE score
import sys
sys.path.insert(1, '..')
from src.models.performance import print_mse

print_mse(y_actuals=y_train, y_preds=predicted_values_train,set_name='Training')
print_mse(y_actuals=y_val, y_preds=predicted_values_val,set_name='Validation')

MSE Training: 14534.063222646791
MSE Validation: 14687.35487700591


Our default XGBoost model performs better than our baseline. 

## Hyperparameter Tuning

## Manual Search

In [9]:
# Instantiate the XGBRegressor class into a variable called xgb_manual
xgb_manual = xgb.XGBRegressor(
    n_estimators=100,
    eta=0.02,
    max_depth=3,
    subsample=0.8, 
    scale_pos_weight=0.2,
    min_child_weight=1.5,
    gamma=5)

In [10]:
# Fit the XGBoost model
xgb_manual.fit(X_train, y_train)

In [11]:
# Import dump from joblib and save the model
from joblib import dump 

dump(xgb_manual,  '../models/xgb_manual.joblib')

['../models/xgb_manual.joblib']

In [12]:
# Calculate the predicted values for the training and validation sets
predicted_values_train = xgb_manual.predict(X_train)
predicted_values_val = xgb_manual.predict(X_val)

In [13]:
# Import the function print_mse from models.performance and display the MSE score
import sys
sys.path.insert(1, '..')
from src.models.performance import print_mse

print_mse(y_actuals=y_train, y_preds=predicted_values_train,set_name='Training')
print_mse(y_actuals=y_val, y_preds=predicted_values_val,set_name='Validation')

MSE Training: 25343.70783152992
MSE Validation: 25462.745881975756


We get worse results with xgb_manual than xgb_default. Further iterations were attempted with little benefit. Grid Search was considered but given the size of the dataframe (13519999, 23), this becomes computationally and time inefficient. Greater resources will need to be dedicated by the business to further refine our ML predictive capabilities but we have a strong foundation to work from. Hyperopt will be tested. 

### Hyperopt package

In [14]:
# Import Trials, STATUS_OK, tpe, hp, fmin from hyperopt package
from hyperopt import Trials, STATUS_OK, tpe, hp, fmin

In [15]:
# Define the search space for xgboost hyperparameters - use a smaller space to optimise computational efficiency 
space = {
    'learning_rate': hp.choice('learning_rate', [0.01, 0.02, 0.03]),
    'subsample': hp.choice('subsample', [0.7, 0.8, 0.9]),
    'colsample_bytree': hp.choice('colsample_bytree', [0.5, 0.7, 0.9]),
    'min_child_weight': hp.choice('min_child_weight', [1, 3, 5]),
    'gamma': hp.choice('gamma', [2, 3, 4]),
}

In [16]:
def objective(space):
    from sklearn.model_selection import cross_val_score

    xgboost = xgb.XGBRegressor(  # Use XGBRegressor for regression
        max_depth=3,
        learning_rate=space['learning_rate'],
        subsample=space['subsample'],
        colsample_bytree=space['colsample_bytree'],
        min_child_weight=space['min_child_weight'],
        gamma=space['gamma'],
    )

    mse = -cross_val_score(xgboost, X_train, y_train, cv=10, scoring="neg_mean_squared_error").mean()

    return {'loss': mse, 'status': STATUS_OK}

In [17]:
# Launch Hyperopt search and save the result in a variable called best
best = fmin(
    fn=objective,   
    space=space,       
    algo=tpe.suggest,       
    max_evals=2
)

  0%|          | 0/2 [00:00<?, ?trial/s, best loss=?]

100%|██████████| 2/2 [11:13<00:00, 336.59s/trial, best loss: 24277.530176353474]


In [18]:
# Print out the Hyperparameters for the best model
print("Best:", best)

Best: {'colsample_bytree': 1, 'gamma': 0, 'learning_rate': 2, 'min_child_weight': 0, 'subsample': 0}


In [19]:
# Instantiate the XGBRegressor class into a variable called xgboost_hyperopt
xgboost_hyperopt = xgb.XGBRegressor(
    max_depth = 3,
    learning_rate = best['learning_rate'],
    min_child_weight = best['min_child_weight'],
    subsample = best['subsample'],
    # colsample_bytree = best['colsample_bytree'],
    gamma=best['gamma']
)

In [20]:
# Fit the model with the training set
xgboost_hyperopt.fit(X_train, y_train)

In [21]:
# Import dump from joblib and save the model
from joblib import dump 

dump(xgboost_hyperopt,  '../models/xgboost_hyperopt.joblib')

['../models/xgboost_hyperopt.joblib']

In [22]:
# Calculate the predicted values for the training and validation sets
predicted_values_train = xgboost_hyperopt.predict(X_train)
predicted_values_val = xgboost_hyperopt.predict(X_val)

In [23]:
# Import the function print_mse from models.performance and display the MSE score
import sys
sys.path.insert(1, '..')
from src.models.performance import print_mse

print_mse(y_actuals=y_train, y_preds=predicted_values_train,set_name='Training')
print_mse(y_actuals=y_val, y_preds=predicted_values_val,set_name='Validation')

MSE Training: 43055.26378892109
MSE Validation: 43163.10096706113


Hyperopt performs better than xgb_manual but worsre than the default model. Further tuning is possible but this would be computationally inefficient given the size of the dataframe. We will use the default model for our predictive model.