<center>
<img src="../../img/ml_theme.png">
# Майнор "Интеллектуальный анализ данных" 
# Курс "Современные методы машинного обучения"
<img src="../../img/faculty_logo.jpg" height="240" width="240">
## Автор материала: преподаватель ФКН НИУ ВШЭ <br> Кашницкий Юрий
</center>
Материал распространяется на условиях лицензии <a href="http://www.microsoft.com/en-us/openness/default.aspx#Ms-RL">Ms-RL</a>. Можно использовать в любых целях, но с обязательным упоминанием автора курса и аффилиации.

# <center>Занятие 7. Продвинутые методы классификации и регрессии</center>
## <center>Часть 7. Настройка гиперпараметров Xgboost. Библиотека Hyperopt</center>

In [33]:
import numpy as np

from xgboost.sklearn import XGBClassifier
import xgboost as xgb
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import make_classification
from sklearn.cross_validation import StratifiedKFold, train_test_split
from sklearn.metrics import log_loss
from hyperopt import fmin, hp, tpe, STATUS_OK, Trials

from scipy.stats import randint, uniform

**Генерируем синтетические данные.**

In [3]:
X, y = make_classification(n_samples=1000, n_features=20, n_informative=8, n_redundant=3, 
                           n_repeated=2, random_state=42)

**Будем проводить 10-кратную стратифицированную кросс-валидацию.**

In [4]:
cv = StratifiedKFold(y, n_folds=10, shuffle=True, random_state=42)

### Grid-Search

In [5]:
params_grid = {
    'max_depth': [1, 2, 3],
    'n_estimators': [5, 10, 25, 50],
    'learning_rate': np.linspace(1e-16, 1, 3)
}

**Инициализируем отдельно словарь фиксированных параметров.**

In [6]:
params_fixed = {
    'objective': 'binary:logistic',
    'silent': 1,
    'seed': 42
}

In [11]:
xgb_grid = GridSearchCV(
    estimator=XGBClassifier(**params_fixed),
    param_grid=params_grid,
    cv=cv,
    scoring='accuracy',
    n_jobs=-1
)

In [18]:
%%time
xgb_grid.fit(X, y)

CPU times: user 1.64 s, sys: 171 ms, total: 1.82 s
Wall time: 4.87 s


GridSearchCV(cv=sklearn.cross_validation.StratifiedKFold(labels=[1 0 ..., 1 0], n_folds=10, shuffle=True, random_state=42),
       error_score='raise',
       estimator=XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=1, subsample=1),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'n_estimators': [5, 10, 25, 50], 'learning_rate': array([  1.00000e-16,   5.00000e-01,   1.00000e+00]), 'max_depth': [1, 2, 3]},
       pre_dispatch='2*n_jobs', refit=True, scoring='accuracy', verbose=0)

**C помощью grid_scores можно строить кривые валидации.**

In [13]:
xgb_grid.grid_scores_

[mean: 0.49800, std: 0.00245, params: {'n_estimators': 5, 'learning_rate': 9.9999999999999998e-17, 'max_depth': 1},
 mean: 0.49800, std: 0.00245, params: {'n_estimators': 10, 'learning_rate': 9.9999999999999998e-17, 'max_depth': 1},
 mean: 0.49800, std: 0.00245, params: {'n_estimators': 25, 'learning_rate': 9.9999999999999998e-17, 'max_depth': 1},
 mean: 0.49800, std: 0.00245, params: {'n_estimators': 50, 'learning_rate': 9.9999999999999998e-17, 'max_depth': 1},
 mean: 0.49800, std: 0.00245, params: {'n_estimators': 5, 'learning_rate': 9.9999999999999998e-17, 'max_depth': 2},
 mean: 0.49800, std: 0.00245, params: {'n_estimators': 10, 'learning_rate': 9.9999999999999998e-17, 'max_depth': 2},
 mean: 0.49800, std: 0.00245, params: {'n_estimators': 25, 'learning_rate': 9.9999999999999998e-17, 'max_depth': 2},
 mean: 0.49800, std: 0.00245, params: {'n_estimators': 50, 'learning_rate': 9.9999999999999998e-17, 'max_depth': 2},
 mean: 0.49800, std: 0.00245, params: {'n_estimators': 5, 'learnin

**Или просто использовать лучшее сочетание параметров.**

In [16]:
print("Best accuracy obtained: {0}".format(xgb_grid.best_score_))
print("Parameters:")
for key, value in xgb_grid.best_params_.items():
    print("\t{}: {}".format(key, value))

Best accuracy obtained: 0.887
Parameters:
	n_estimators: 50
	learning_rate: 0.5
	max_depth: 3


### Randomized Grid-Search
**Часто неплохо, а главное, намного быстрее, работает рандомизированная версия.
Теперь создаем словарь с распределениями параметров:**

In [17]:
params_dist_grid = {
    'max_depth': [1, 2, 3, 4],
    'gamma': [0, 0.5, 1],
    'n_estimators': randint(1, 1001), # uniform discrete random distribution
    'learning_rate': uniform(), # gaussian distribution
    'subsample': uniform(), # gaussian distribution
    'colsample_bytree': uniform() # gaussian distribution
}

**Инициализируем `RandomizedSearchCV` так чтобы случайно выбрать 10 комбинаций параметров.**

In [19]:
rs_grid = RandomizedSearchCV(
    estimator=XGBClassifier(**params_fixed),
    param_distributions=params_dist_grid,
    n_iter=10,
    cv=cv,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1,
)

In [20]:
%%time
rs_grid.fit(X, y)

CPU times: user 1.25 s, sys: 99.3 ms, total: 1.35 s
Wall time: 10.2 s


RandomizedSearchCV(cv=sklearn.cross_validation.StratifiedKFold(labels=[1 0 ..., 1 0], n_folds=10, shuffle=True, random_state=42),
          error_score='raise',
          estimator=XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=1, subsample=1),
          fit_params={}, iid=True, n_iter=10, n_jobs=-1,
          param_distributions={'colsample_bytree': <scipy.stats._distn_infrastructure.rv_frozen object at 0x113994b90>, 'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x11397ee90>, 'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x113af08d0>, 'subsample': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1139941d0>, 'max_depth': [1, 2, 3, 4], 'gamma': [0, 0.5, 1]},
         

In [21]:
rs_grid.best_estimator_

XGBClassifier(base_score=0.5, colsample_bylevel=1,
       colsample_bytree=0.60528369922942848, gamma=1,
       learning_rate=0.33413908269193104, max_delta_step=0, max_depth=4,
       min_child_weight=1, missing=None, n_estimators=474, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=1, subsample=0.78878346510127606)

In [22]:
rs_grid.best_params_

{'colsample_bytree': 0.60528369922942848,
 'gamma': 1,
 'learning_rate': 0.33413908269193104,
 'max_depth': 4,
 'n_estimators': 474,
 'subsample': 0.78878346510127606}

In [23]:
rs_grid.best_score_

0.88800000000000001

### Hyperopt
**В библиотеке Hyperopt реализовано намного большее алгоритмов подбора параметров разных моделей. Будем настраивать, например, функцию log_loss по валидационной выборке.**

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3, random_state=42)

**Определим функцию, которую надо минимизировать.**

In [40]:
def score(params):
    print("Training with params:")
    print(params)
    num_round = int(params['n_estimators'])
    del params['n_estimators']
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dvalid = xgb.DMatrix(X_test, label=y_test)
    model = xgb.train(params, dtrain, num_round)
    predictions = model.predict(dvalid).reshape((X_test.shape[0], 1))
    score = log_loss(y_test, predictions)
    print("\tScore {0}\n\n".format(score))
    return {'loss': score, 'status': STATUS_OK}

In [43]:
def optimize(trials):
    space = {
             'n_estimators' : 150,
             'eta' : hp.quniform('eta', 0.025, 0.5, 0.025),
             'max_depth' : hp.quniform('max_depth', 4, 10, 2),
             'min_child_weight' : hp.quniform('min_child_weight', 1, 6, 1),
             'subsample' : hp.quniform('subsample', 0.5, 1, 0.25),
             'gamma' : 0,
             'colsample_bytree' : hp.quniform('colsample_bytree', 0.5, 1, 0.25),
             'eval_metric': 'merror',
             'objective': 'binary:logistic',
             'nthread' : 4,
             'silent' : 1
             }
    best = fmin(score, space, algo=tpe.suggest, trials=trials, max_evals=20)

    return best

In [45]:
trials = Trials()
best_params = optimize(trials)
print(best_params)

Training with params:
{'colsample_bytree': 0.75, 'silent': 1, 'eval_metric': 'merror', 'nthread': 4, 'min_child_weight': 2.0, 'n_estimators': 150, 'subsample': 1.0, 'eta': 0.125, 'objective': 'binary:logistic', 'max_depth': 4.0, 'gamma': 0}
	Score 0.282255703456


Training with params:
{'colsample_bytree': 0.75, 'silent': 1, 'eval_metric': 'merror', 'nthread': 4, 'min_child_weight': 4.0, 'n_estimators': 150, 'subsample': 0.5, 'eta': 0.42500000000000004, 'objective': 'binary:logistic', 'max_depth': 6.0, 'gamma': 0}
	Score 0.367044530358


Training with params:
{'colsample_bytree': 0.5, 'silent': 1, 'eval_metric': 'merror', 'nthread': 4, 'min_child_weight': 6.0, 'n_estimators': 150, 'subsample': 1.0, 'eta': 0.1, 'objective': 'binary:logistic', 'max_depth': 6.0, 'gamma': 0}
	Score 0.294940195541


Training with params:
{'colsample_bytree': 1.0, 'silent': 1, 'eval_metric': 'merror', 'nthread': 4, 'min_child_weight': 2.0, 'n_estimators': 150, 'subsample': 0.5, 'eta': 0.4, 'objective': 'bina