# Finding optimal hyperparameters for the xgboost

In order to find the optimal set of hyperparameters, I am going to use the RandomizedSearchCV. Hyperparameters found in this notebook, will later be used for training the model in the script. A more sophisticated approach would be to use Bayesian optimisation to find the best set of hyperparameters.

In [414]:
import xgboost as xgb
import pandas as pd
import joblib
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold


In [415]:
df = joblib.load("../auto-insurance-fall-2017/train_folds.pkl")

In [416]:
param_space = {
        'eta': [0.05, 0.01, 0.1, 0.5, 0.9],
        'max_depth': [5, 10, 12, 15, 20],
        'learning_rate': [0.001, 0.005, 0.01, 0.05, 0.1],
        'min_child_weight': [1, 5, 10],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [3, 4, 5, 10, 12, 15, 20],
        'n_estimators': [5, 10, 20, 30, 50, 75, 100, 150, 200, 400]
        }

In [417]:
bst = xgb.XGBClassifier(objective='binary:logistic', eval_metric='auc', nthread=1, use_label_encoder=False)

In [418]:
skf = StratifiedKFold(n_splits=5, shuffle = True, random_state = 1001)

random_search = RandomizedSearchCV(bst, param_distributions=param_space,
                                   n_iter=5000, scoring='roc_auc', n_jobs=-1,
                                   cv=skf, verbose=1,
                                   random_state=1001 )

In [419]:
features = [col for col in df.columns if col not in ["target", "INDEX", "kfold"]]

In [420]:
random_search.fit(df[cols], df["target"])

Fitting 5 folds for each of 5000 candidates, totalling 25000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    9.6s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   57.7s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:  7.5min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed: 10.6min
[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed: 14.0min
[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed: 17.6min
[Parallel(n_jobs=-1)]: Done 4992 tasks      | elapsed: 22.1min
[Parallel(n_jobs=-1)]: Done 6042 tasks      | elapsed: 26.6min
[Parallel(n_jobs=-1)]: Done 7192 tasks      | elapsed: 31.1min
[Parallel(n_jobs=-1)]: Done 8442 tasks      | elapsed: 36.0min
[Parallel(n_jobs=-1)]: Done 9792 tasks      | elapsed: 41.3min
[Parallel(n_jobs=-1)]: Done 11242 tasks      |

RandomizedSearchCV(cv=StratifiedKFold(n_splits=5, random_state=1001, shuffle=True),
                   estimator=XGBClassifier(base_score=None, booster=None,
                                           colsample_bylevel=None,
                                           colsample_bynode=None,
                                           colsample_bytree=None,
                                           eval_metric='auc', gamma=None,
                                           gpu_id=None, importance_type='gain',
                                           interaction_constraints=None,
                                           learning_rate=None,
                                           max_delta_step=None, max_depth=None,
                                           mi...
                                           verbosity=None),
                   n_iter=5000, n_jobs=-1,
                   param_distributions={'colsample_bytree': [0.6, 0.8, 1.0],
                                        'eta

In [421]:
params=random_search.best_params_

In [422]:
random_search.best_score_

0.8241846210288306

In [423]:
params

{'subsample': 0.8,
 'n_estimators': 100,
 'min_child_weight': 10,
 'max_depth': 5,
 'learning_rate': 0.1,
 'gamma': 0.5,
 'eta': 0.5,
 'colsample_bytree': 0.8}