<img style="width:100%" src="../images/practical_xgboost_in_python_notebook_header.png" />

# Hyper-parameter tuning

As you know there are plenty of tunable parameters. Each one results in different output. The question is which combination results in best output.

The following notebook will show you how to use Scikit-learn modules for figuring out the best parameters for your  models.

**What's included:**
- <a href="#data">data preparation</a>,
- <a href="#grid">finding best hyper-parameters using grid-search</a>,
- <a href="#rgrid">finding best hyper-parameters using randomized grid-search<a>

### Prepare data<a name='data' />
Let's begin with loading all required libraries in one place and set seed number for reproducibility.

In [None]:
import numpy as np

from xgboost.sklearn import XGBClassifier

from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import make_classification
from sklearn.cross_validation import StratifiedKFold

from scipy.stats import randint, uniform

# reproducibility
seed = 123
np.random.seed(seed)

Generate artificial dataset

In [None]:
X, y = make_classification(n_samples=1000, n_features=20, n_informative=8, n_redundant=3, n_repeated=2, random_state=seed)

Define cross-validation strategy for testing. Let's use `StratifiedKFold` which guarantees that target label is equally distributed across each fold.

In [None]:
cv = StratifiedKFold(y, n_folds=10, shuffle=True, random_state=seed)

### Grid-Search<a name='grid' />
In grid-search we start by defining a dictionary holding possible parameter values we want to test. All possible combinations will be evaluted.

In [None]:
params_grid = {
    'max_depth': [1, 2, 3],
    'n_estimators': [5, 10, 25, 50],
    'learning_rate': np.linspace(1e-16, 1, 3)
}

Add a dictiorany for fixed parameters.

In [None]:
params_fixed = {
    'objective': 'binary:logistic',
    'silent': 1
}

Create a `GridSearchCV` estimator. We will be looking for combination giving the best accuracy.

In [None]:
bst_grid = GridSearchCV(
    estimator=XGBClassifier(**params_fixed, seed=seed),
    param_grid=params_grid,
    cv=cv,
    scoring='accuracy'
)

Before running the calcuations notice that $3*4*3*10=360$ models will be created for testing all combinations. You should always have rough estimations about what is going to happen.

In [None]:
bst_grid.fit(X, y)

Now, we can look at all obtained scores, and try to manually see what matters and what not. A quick glance looks that the largeer `n_estimators` then the accuracy is higher.

In [None]:
bst_grid.grid_scores_

If there are many results, we can sort or filter them manually or get best combination

In [None]:
print("Best accuracy obtained: {0}".format(bst_grid.best_score_))
print("Parameters:")
for key, value in bst_grid.best_params_.items():
    print("\t{}: {}".format(key, value))

Looking for best parameters is an iterative process. You should start with coarsed-granularity and move to to more detailed values.

### Randomized Grid-Search<a name='rgrid' />
When the number of parameters and their values is getting big traditional grid-search approach quickly becomes ineffective. A possible solution might be to randomly pick certain parameters from their distribution. While it's not an exhaustive solution, it's worth giving a shot.

Create a parametres distribution dictionary:

In [None]:
params_dist_grid = {
    'max_depth': [1, 2, 3, 4],
    'gamma': [0, 0.5, 1],
    'n_estimators': randint(1, 1001), # uniform discrete random distribution
    'learning_rate': uniform(), # gaussian distribution
    'subsample': uniform(), # gaussian distribution
    'colsample_bytree': uniform() # gaussian distribution
}

Initialize `RandomizedSearchCV` to randomly pick 10 combinations of parameters. With this approach you can easily control the number of tested models.

In [None]:
rs_grid = RandomizedSearchCV(
    estimator=XGBClassifier(**params_fixed, seed=seed),
    param_distributions=params_dist_grid,
    n_iter=10,
    cv=cv,
    scoring='accuracy',
    random_state=seed
)

In [None]:
rs_grid.fit(X, y)

Take a look at choosen parameters and their accuracy score

In [None]:
rs_grid.grid_scores_

There are also some handy properties allowing to quickly analyze best estimator, parameters and obtained score

In [None]:
rs_grid.best_estimator_

In [None]:
rs_grid.best_params_

In [None]:
rs_grid.best_score_