# Finding optimal hyperparameters for the xgboost

In order to find the optimal set of hyperparameters, I am going to use the RandomizedSearchCV. Hyperparameters found in this notebook, will later be used for training the model in the script. A more sophisticated approach would be to use Bayesian optimisation to find the best set of hyperparameters.

In [4]:
import xgboost as xgb
import pandas as pd
import joblib
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold


In [5]:
df = joblib.load("../auto-insurance-fall-2017/train_folds.pkl")

In [6]:
param_space = {
        'eta': [0.05, 0.01, 0.1, 0.5, 0.9],
        'max_depth': [5, 10, 12, 15, 20, 25],
        'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.15, 0.2],
        'min_child_weight': [1, 5, 10, 15, 20],
        'gamma': [0.5, 1, 1.5, 2, 3, 5, 10],
        'subsample': [0.6, 0.8, 0.9, 0.95, 1.0],
        'colsample_bytree': [0.6, 0.8, 0.85, 0.9, 1.0],
        'max_depth': [3, 4, 5, 10, 12, 15, 20],
        'n_estimators': [5, 10, 20, 30, 50, 75, 100, 150]
        }

In [7]:
bst = xgb.XGBClassifier(objective='binary:logistic', eval_metric='auc', nthread=1, use_label_encoder=False)

In [9]:
skf = StratifiedKFold(n_splits=5, shuffle = True, random_state = 1001)

random_search = RandomizedSearchCV(bst, param_distributions=param_space,
                                   n_iter=1000, scoring='roc_auc', n_jobs=-1,
                                   cv=skf, verbose=1,
                                   random_state=1001 )

In [10]:
features = [col for col in df.columns if col not in ["target", "INDEX", "kfold"]]

In [12]:
random_search.fit(df[features], df["target"])

Fitting 5 folds for each of 1000 candidates, totalling 5000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   10.9s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   41.8s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  5.9min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:  8.9min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed: 11.7min
[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed: 15.4min
[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed: 19.9min
[Parallel(n_jobs=-1)]: Done 4992 tasks      | elapsed: 24.4min
[Parallel(n_jobs=-1)]: Done 5000 out of 5000 | elapsed: 24.4min finished


RandomizedSearchCV(cv=StratifiedKFold(n_splits=5, random_state=1001, shuffle=True),
                   estimator=XGBClassifier(base_score=None, booster=None,
                                           colsample_bylevel=None,
                                           colsample_bynode=None,
                                           colsample_bytree=None,
                                           eval_metric='auc', gamma=None,
                                           gpu_id=None, importance_type='gain',
                                           interaction_constraints=None,
                                           learning_rate=None,
                                           max_delta_step=None, max_depth=None,
                                           mi...
                   param_distributions={'colsample_bytree': [0.6, 0.8, 0.85,
                                                             0.9, 1.0],
                                        'eta': [0.05, 0.01, 0.1, 0.5, 0.9],

In [13]:
params=random_search.best_params_

In [14]:
params

{'subsample': 0.95,
 'n_estimators': 150,
 'min_child_weight': 20,
 'max_depth': 12,
 'learning_rate': 0.05,
 'gamma': 3,
 'eta': 0.05,
 'colsample_bytree': 0.6}

In [15]:
random_search.best_score_

0.824256038928144