# Parrot Prediction Courses

## Hyper-parameter tuning with grid search
The following notebook will show you how to configure Scikit-learn grid search module for figuring out the best parameters for your XGBoost model.

**What you will learn:** Finding best hyper-parameters for your dataset

Let's begin with loading all required libraries.

In [5]:
import numpy as np

from xgboost.sklearn import XGBClassifier

from sklearn.grid_search import GridSearchCV
from sklearn.datasets import make_classification
from sklearn.cross_validation import StratifiedKFold

# reproducibility
seed = 123
np.random.seed(seed)

Generate test dataset

In [4]:
X, y = make_classification(n_samples=1000, n_features=20, n_informative=8, n_redundant=3, n_repeated=2, random_state=seed)

Define cross-validation strategy for testing

In [6]:
cv = StratifiedKFold(y, n_folds=10, shuffle=True, random_state=seed)

Define a dictionary holding possible parameter values.

In [41]:
params_grid = {
    'max_depth': [1, 2, 3],
    'n_estimators': [5, 10, 25, 50],
    'learning_rate': [0.1, 0.5, 1.0]
}

And those for which we want values to be fixed

In [42]:
params_fixed = {
    'objective': 'binary:logistic',
    'silent': 1
}

In [43]:
bst_grid = GridSearchCV(
    estimator=XGBClassifier(**params_fixed, seed=seed),
    param_grid=params_grid,
    cv=cv,
    scoring='accuracy'
)

Before running the calcuations notice that $3*4*3*10=360$ models will be created for testing all combinations.

In [44]:
bst_grid.fit(X, y)

GridSearchCV(cv=sklearn.cross_validation.StratifiedKFold(labels=[1 0 ..., 0 1], n_folds=10, shuffle=True, random_state=123),
       error_score='raise',
       estimator=XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=123, silent=1, subsample=1),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'learning_rate': [0.1, 0.5, 1.0], 'n_estimators': [5, 10, 25, 50], 'max_depth': [1, 2, 3]},
       pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=0)

Show all obtained scores

In [45]:
bst_grid.grid_scores_

[mean: 0.74800, std: 0.03682, params: {'learning_rate': 0.1, 'n_estimators': 5, 'max_depth': 1},
 mean: 0.78200, std: 0.02960, params: {'learning_rate': 0.1, 'n_estimators': 10, 'max_depth': 1},
 mean: 0.83300, std: 0.03494, params: {'learning_rate': 0.1, 'n_estimators': 25, 'max_depth': 1},
 mean: 0.86900, std: 0.02982, params: {'learning_rate': 0.1, 'n_estimators': 50, 'max_depth': 1},
 mean: 0.80600, std: 0.02691, params: {'learning_rate': 0.1, 'n_estimators': 5, 'max_depth': 2},
 mean: 0.84300, std: 0.02052, params: {'learning_rate': 0.1, 'n_estimators': 10, 'max_depth': 2},
 mean: 0.88900, std: 0.02587, params: {'learning_rate': 0.1, 'n_estimators': 25, 'max_depth': 2},
 mean: 0.91000, std: 0.03098, params: {'learning_rate': 0.1, 'n_estimators': 50, 'max_depth': 2},
 mean: 0.86000, std: 0.02793, params: {'learning_rate': 0.1, 'n_estimators': 5, 'max_depth': 3},
 mean: 0.88000, std: 0.01265, params: {'learning_rate': 0.1, 'n_estimators': 10, 'max_depth': 3},
 mean: 0.90500, std: 0.

Show best combinations

In [46]:
print("Best accuracy obtained: {0}".format(bst_grid.best_score_))
print("Parameters:")
for key, value in bst_grid.best_params_.items():
    print("\t{}: {}".format(key, value))

Best accuracy obtained: 0.931
Parameters:
	learning_rate: 0.5
	n_estimators: 50
	max_depth: 3
