# Parrot Prediction Courses

## Hyper-parameter tuning

As you know there are plenty of tunable parameters. Each one results in different output. The question is which combination results in best output.

The following notebook will show you how to configure Scikit-learn modules for figuring out the best parameters for your  models.

**What you will learn:**
- finding best hyper-parameters using grid-search,
- finding best hyper-parameters using randomized grid-search

### Prepare data
Let's begin with loading all required libraries in one place and set seed number for reproducibility.

In [17]:
import numpy as np

from xgboost.sklearn import XGBClassifier

from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import make_classification
from sklearn.cross_validation import StratifiedKFold

from scipy.stats import randint, uniform

# reproducibility
seed = 123
np.random.seed(seed)

Generate artificial dataset

In [2]:
X, y = make_classification(n_samples=1000, n_features=20, n_informative=8, n_redundant=3, n_repeated=2, random_state=seed)

Define cross-validation strategy for testing. Let's use `StratifiedKFold` which guarantees that target label is equally distributed across each fold.

In [3]:
cv = StratifiedKFold(y, n_folds=10, shuffle=True, random_state=seed)

### Grid-Search
In grid-search we start by defining a dictionary holding possible parameter values we want to test. All possible combinations will be evaluted.

In [4]:
params_grid = {
    'max_depth': [1, 2, 3],
    'n_estimators': [5, 10, 25, 50],
    'learning_rate': np.linspace(1e-16, 1, 3)
}

Add a dictiorany for fixed parameters.

In [5]:
params_fixed = {
    'objective': 'binary:logistic',
    'silent': 1
}

Create a `GridSearchCV` estimator. We will be looking for combination giving the best accuracy.

In [6]:
bst_grid = GridSearchCV(
    estimator=XGBClassifier(**params_fixed, seed=seed),
    param_grid=params_grid,
    cv=cv,
    scoring='accuracy'
)

Before running the calcuations notice that $3*4*3*10=360$ models will be created for testing all combinations. You should always have rough estimations about what is going to happen.

In [7]:
bst_grid.fit(X, y)

GridSearchCV(cv=sklearn.cross_validation.StratifiedKFold(labels=[1 0 ..., 0 1], n_folds=10, shuffle=True, random_state=123),
       error_score='raise',
       estimator=XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=123, silent=1, subsample=1),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'learning_rate': array([  1.00000e-16,   5.00000e-01,   1.00000e+00]), 'n_estimators': [5, 10, 25, 50], 'max_depth': [1, 2, 3]},
       pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=0)

Now, we can look at all obtained scores, and try to manually see what matters and what not. A quick glance looks that the largeer `n_estimators` then the accuracy is higher.

In [8]:
bst_grid.grid_scores_

[mean: 0.50000, std: 0.00000, params: {'n_estimators': 5, 'learning_rate': 9.9999999999999998e-17, 'max_depth': 1},
 mean: 0.50000, std: 0.00000, params: {'n_estimators': 10, 'learning_rate': 9.9999999999999998e-17, 'max_depth': 1},
 mean: 0.50000, std: 0.00000, params: {'n_estimators': 25, 'learning_rate': 9.9999999999999998e-17, 'max_depth': 1},
 mean: 0.50000, std: 0.00000, params: {'n_estimators': 50, 'learning_rate': 9.9999999999999998e-17, 'max_depth': 1},
 mean: 0.50000, std: 0.00000, params: {'n_estimators': 5, 'learning_rate': 9.9999999999999998e-17, 'max_depth': 2},
 mean: 0.50000, std: 0.00000, params: {'n_estimators': 10, 'learning_rate': 9.9999999999999998e-17, 'max_depth': 2},
 mean: 0.50000, std: 0.00000, params: {'n_estimators': 25, 'learning_rate': 9.9999999999999998e-17, 'max_depth': 2},
 mean: 0.50000, std: 0.00000, params: {'n_estimators': 50, 'learning_rate': 9.9999999999999998e-17, 'max_depth': 2},
 mean: 0.50000, std: 0.00000, params: {'n_estimators': 5, 'learnin

If there are many results, we can sort or filter them manually or get best combination

In [9]:
print("Best accuracy obtained: {0}".format(bst_grid.best_score_))
print("Parameters:")
for key, value in bst_grid.best_params_.items():
    print("\t{}: {}".format(key, value))

Best accuracy obtained: 0.931
Parameters:
	n_estimators: 50
	learning_rate: 0.5
	max_depth: 3


Looking for best parameters is an iterative process. You should start with coarsed-granularity and move to to more detailed values.

### Randomized Grid-Search
When the number of parameters and their values is getting big the traditional grid-search approach quickly becomes ineffective. A possible solution might be to randomly pick certain parameters, knowing their distribution. While it's not an exhaustive solution, it's worth giving a shot.

http://scikit-learn.org/stable/auto_examples/model_selection/randomized_search.html#example-model-selection-randomized-search-py

Create a parametres distribution dictionary:

In [56]:
params_dist_grid = {
    'max_depth': [1, 2, 3, 4],
    'gamma': [0, 0.5, 1],
    'n_estimators': randint(1, 1001), # uniform discrete random distribution
    'learning_rate': uniform(), # gaussian distribution
    'subsample': uniform(), # gaussian distribution
    'colsample_bytree': uniform() # gaussian distribution
}

In [57]:
rs_grid = RandomizedSearchCV(
    estimator=XGBClassifier(**params_fixed, seed=seed),
    param_distributions=params_dist_grid,
    n_iter=10,
    cv=cv,
    random_state=seed
)

In [58]:
rs_grid.fit(X, y)

RandomizedSearchCV(cv=sklearn.cross_validation.StratifiedKFold(labels=[1 0 ..., 0 1], n_folds=10, shuffle=True, random_state=123),
          error_score='raise',
          estimator=XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=123, silent=1, subsample=1),
          fit_params={}, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f3e42db9ef0>, 'max_depth': [1, 2, 3, 4], 'colsample_bytree': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f3e42d6b6a0>, 'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f3e42db9ba8>, 'gamma': [0, 0.5, 1], 'subsample': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f3e42d6b43

In [59]:
rs_grid.grid_scores_

[mean: 0.88100, std: 0.03113, params: {'learning_rate': 0.5567851923942887, 'n_estimators': 87, 'max_depth': 2, 'colsample_bytree': 0.5944318794450425, 'gamma': 1, 'subsample': 0.28079319859348617},
 mean: 0.92500, std: 0.01628, params: {'learning_rate': 0.10854229568516027, 'n_estimators': 434, 'max_depth': 3, 'colsample_bytree': 0.7638366944754833, 'gamma': 1, 'subsample': 0.6919702955318197},
 mean: 0.92200, std: 0.02227, params: {'learning_rate': 0.3889505741231446, 'n_estimators': 927, 'max_depth': 3, 'colsample_bytree': 0.5543832497177721, 'gamma': 0, 'subsample': 0.8549728158240765},
 mean: 0.90300, std: 0.02147, params: {'learning_rate': 0.3929444107680876, 'n_estimators': 767, 'max_depth': 2, 'colsample_bytree': 0.7113918018147167, 'gamma': 1, 'subsample': 0.30476807341109746},
 mean: 0.50000, std: 0.00000, params: {'learning_rate': 0.7049588304513622, 'n_estimators': 89, 'max_depth': 4, 'colsample_bytree': 0.398185681917981, 'gamma': 1, 'subsample': 0.00413464274939046},
 mea

In [60]:
rs_grid.best_estimator_

XGBClassifier(base_score=0.5, colsample_bylevel=1,
       colsample_bytree=0.6380225178789184, gamma=0.5,
       learning_rate=0.0576480149714419, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=933, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=123, silent=1,
       subsample=0.6917017987001771)

In [61]:
rs_grid.best_params_

{'colsample_bytree': 0.6380225178789184,
 'gamma': 0.5,
 'learning_rate': 0.0576480149714419,
 'max_depth': 3,
 'n_estimators': 933,
 'subsample': 0.6917017987001771}

In [62]:
rs_grid.best_score_

0.93200000000000005