<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Spark-MLlib-Tuning" data-toc-modified-id="Spark-MLlib-Tuning-1"><span class="toc-item-num">1&nbsp;&nbsp;</span><a href="https://spark.apache.org/docs/latest/ml-tuning.html" target="_blank">Spark MLlib Tuning</a></a></span></li><li><span><a href="#Hyperopt" data-toc-modified-id="Hyperopt-2"><span class="toc-item-num">2&nbsp;&nbsp;</span><a href="https://github.com/hyperopt/hyperopt" target="_blank">Hyperopt</a></a></span><ul class="toc-item"><li><span><a href="#XGBoost-Tuning" data-toc-modified-id="XGBoost-Tuning-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span><a href="https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/" target="_blank">XGBoost Tuning</a></a></span><ul class="toc-item"><li><span><a href="#Objective-function" data-toc-modified-id="Objective-function-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Objective function</a></span></li><li><span><a href="#Tune-number-of-trees" data-toc-modified-id="Tune-number-of-trees-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Tune number of trees</a></span></li><li><span><a href="#Tune-tree-specific-parameters" data-toc-modified-id="Tune-tree-specific-parameters-2.1.3"><span class="toc-item-num">2.1.3&nbsp;&nbsp;</span>Tune tree-specific parameters</a></span><ul class="toc-item"><li><span><a href="#Tune-max_depth,-min_child_weight" data-toc-modified-id="Tune-max_depth,-min_child_weight-2.1.3.1"><span class="toc-item-num">2.1.3.1&nbsp;&nbsp;</span>Tune max_depth, min_child_weight</a></span></li><li><span><a href="#Tune-gamma" data-toc-modified-id="Tune-gamma-2.1.3.2"><span class="toc-item-num">2.1.3.2&nbsp;&nbsp;</span>Tune gamma</a></span></li><li><span><a href="#Tune-subsample,-colsample_bytree" data-toc-modified-id="Tune-subsample,-colsample_bytree-2.1.3.3"><span class="toc-item-num">2.1.3.3&nbsp;&nbsp;</span>Tune subsample, colsample_bytree</a></span></li></ul></li><li><span><a href="#Tune-regularization-parameters" data-toc-modified-id="Tune-regularization-parameters-2.1.4"><span class="toc-item-num">2.1.4&nbsp;&nbsp;</span>Tune regularization parameters</a></span></li><li><span><a href="#Lower-the-learning-rate-and-decide-the-optimal-parameters" data-toc-modified-id="Lower-the-learning-rate-and-decide-the-optimal-parameters-2.1.5"><span class="toc-item-num">2.1.5&nbsp;&nbsp;</span>Lower the learning rate and decide the optimal parameters</a></span></li></ul></li><li><span><a href="#LogisticRegression-Tuning" data-toc-modified-id="LogisticRegression-Tuning-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>LogisticRegression Tuning</a></span></li><li><span><a href="#Optional-MongoTrials" data-toc-modified-id="Optional-MongoTrials-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Optional <a href="https://hyperopt.github.io/hyperopt/scaleout/mongodb/" target="_blank">MongoTrials</a></a></span><ul class="toc-item"><li><span><a href="#XGBoost-Tuning" data-toc-modified-id="XGBoost-Tuning-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>XGBoost Tuning</a></span></li></ul></li></ul></li><li><span><a href="#Results" data-toc-modified-id="Results-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Results</a></span></li></ul></div>

Продолжаем работать над задачей CTR-prediction с использованием датасета от Criteo.

Описание задачи и данных можно посмотреть в notebook'e предыдущей практики (`sgd_logreg_nn/notebooks/ctr_prediction_mllib.ipynb`).

In [1]:
%matplotlib inline
%config InlineBackend.figure_format ='retina'

import os
import sys
import glob
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

import pyspark
import pyspark.sql.functions as F
from pyspark.conf import SparkConf
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql import Row

COMMON_PATH = '/workspace/common'

sys.path.append(os.path.join(COMMON_PATH, 'utils'))

os.environ['PYSPARK_SUBMIT_ARGS'] = """
--jars {common}/xgboost4j-spark-0.72.jar,{common}/xgboost4j-0.72.jar
--py-files {common}/sparkxgb.zip pyspark-shell
""".format(common=COMMON_PATH).replace('\n', ' ')

spark = SparkSession \
    .builder \
    .master('local[*]') \
    .appName("spark_sql_examples") \
    .config("spark.executor.memory", "6g") \
    .getOrCreate()

sc = spark.sparkContext
sqlContext = SQLContext(sc)

from metrics import rocauc, logloss, ne, get_ate
from processing import split_by_col

from sparkxgb.xgboost import *

In [67]:
from pyspark.ml.classification import LogisticRegression

In [2]:
DATA_PATH = '/workspace/data/criteo/dac/'

TRAIN_PATH = os.path.join(DATA_PATH, 'train.csv')

In [3]:
df = sqlContext.read.format("com.databricks.spark.csv") \
    .option("delimiter", ",") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load('file:///' + TRAIN_PATH)

**Remark** Необязательно использовать половину датасета и всего две категориальные переменные. Можно использовать больше данных, если вам позволяет ваша конфигурация

In [4]:
df = df.sample(False, 0.5)

In [5]:
num_columns = ['_c{}'.format(i) for i in range(1, 14)]
cat_columns = ['_c{}'.format(i) for i in range(14, 40)][:2]
len(num_columns), len(cat_columns)

(13, 2)

In [6]:
df = df.fillna(0, subset=num_columns)

In [7]:
from pyspark.ml import PipelineModel


pipeline_model = PipelineModel.load(os.path.join(DATA_PATH, 'preprocessing_transformer'))

In [8]:
df = pipeline_model \
    .transform(df) \
    .select(F.col('_c0').alias('label'), 'features', 'id') \
    .cache()

df.count()

1832511

In [9]:
train_df, val_df, test_df = split_by_col(df, 'id', [0.8, 0.1, 0.1])

# [Spark MLlib Tuning](https://spark.apache.org/docs/latest/ml-tuning.html)

У имеющегося в Spark'e метода HPO есть два существенных недостатка, которые делают его мало пригодным в контексте нашей задачи:

1. `ParamGridBuilder` - поиск по сетке
2. `TrainValidationSplit` - делит данные случайнм образом

# [Hyperopt](https://github.com/hyperopt/hyperopt)

Установим `hyperopt`

In [10]:
!pip3.5 install hyperopt



## [XGBoost Tuning](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)

> [Notes on Parameter Tuning](https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html)

### Objective function

In [11]:
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK
import scipy.stats as st


def objective(space):
    estimator = XGBoostEstimator(**space)
    print('SPACE:', estimator._input_kwargs_processed())
    success = False
    attempts = 0
    model = None
    while not success and attempts < 2:
        try:
            model = estimator.fit(train_df)
            success = True
        except Exception as e:
            attempts += 1
            print(e)
            print('Try again')
        
    log_loss = logloss(model, val_df, probabilities_col='probabilities')
    roc_auc = rocauc(model, val_df, probabilities_col='probabilities')
    
    print('LOG-LOSS: {}, ROC-AUC: {}'.format(log_loss, roc_auc))

    return {'loss': log_loss, 'rocauc': roc_auc, 'status': STATUS_OK }

In [12]:
static_params = {
    'featuresCol': "features", 
    'labelCol': "label", 
    'predictionCol': "prediction",
    'eval_metric': 'logloss',
    'objective': 'binary:logistic',
    'nthread': 1,
    'silent': 0,
    'nworkers': 1
}

Fix baseline parameters and train baseline model

In [13]:
CONTROL_NAME = 'xgb baseline'

baseline_params = {
    'colsample_bytree': 0.9,
    'eta': 0.15,
    'gamma': 0.9,
    'max_depth': 6,
    'min_child_weight': 50.0,
    'subsample': 0.9,
    'num_round': 20
}

baseline_model = XGBoostEstimator(**{**static_params, **baseline_params}).fit(train_df)

In [14]:
baseline_rocauc = rocauc(baseline_model, val_df, probabilities_col='probabilities')
baseline_rocauc

0.7251075671753238

In [15]:
all_metrics = {}

In [76]:
def add_metrics_result(name, model, train_df, test_df, all_metrics, probabilities_col='probabilities'):
    metrics_values = {
        'logloss_test': logloss(model, test_df, probabilities_col=probabilities_col),
        'rocauc_test': rocauc(model, test_df, probabilities_col=probabilities_col)
    }
    all_metrics[name] = metrics_values

In [17]:
add_metrics_result(CONTROL_NAME, baseline_model, train_df, test_df, all_metrics)

### Tune number of trees

> Choose a relatively high learning rate. Generally a learning rate of 0.1 works but somewhere between 0.05 to 0.3 should work for different problems. Determine the optimum number of trees for this learning rate.

In [22]:
%%time

num_round_choice = [10, 20, 40, 100]
eta_choice = [0.5, 0.10, 0.15, 0.20, 0.30]

space = {
    # Optimize
    'num_round': hp.choice('num_round', num_round_choice),
    'eta': hp.choice('eta', eta_choice),
    
    # Fixed    
    'max_depth': baseline_params['max_depth'],
    'min_child_weight': baseline_params['min_child_weight'],
    'subsample': baseline_params['subsample'],
    'gamma': baseline_params['gamma'],
    'colsample_bytree': baseline_params['colsample_bytree'],
    
    **static_params
}


trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=20,
            trials=trials)

SPACE:                                                
{'objective': 'binary:logistic', 'gamma': 0.9, 'silent': 0, 'labelCol': 'label', 'eta': 0.2, 'colsample_bytree': 0.9, 'min_child_weight': 50.0, 'featuresCol': 'features', 'eval_metric': 'logloss', 'nthread': 1, 'nworkers': 1, 'num_round': 10, 'max_depth': 6, 'predictionCol': 'prediction', 'subsample': 0.9}
LOG-LOSS: 0.516636396266625, ROC-AUC: 0.7218785845888716
SPACE:                                                                          
{'objective': 'binary:logistic', 'gamma': 0.9, 'silent': 0, 'labelCol': 'label', 'eta': 0.1, 'colsample_bytree': 0.9, 'min_child_weight': 50.0, 'featuresCol': 'features', 'eval_metric': 'logloss', 'nthread': 1, 'nworkers': 1, 'num_round': 10, 'max_depth': 6, 'predictionCol': 'prediction', 'subsample': 0.9}
LOG-LOSS: 0.5442859204790472, ROC-AUC: 0.7189505048332658                       
SPACE:                                                                          
{'objective': 'binary:logisti

LOG-LOSS: 0.5070877157144487, ROC-AUC: 0.72966941246326                           
SPACE:                                                                            
{'objective': 'binary:logistic', 'gamma': 0.9, 'silent': 0, 'labelCol': 'label', 'eta': 0.3, 'colsample_bytree': 0.9, 'min_child_weight': 50.0, 'featuresCol': 'features', 'eval_metric': 'logloss', 'nthread': 1, 'nworkers': 1, 'num_round': 40, 'max_depth': 6, 'predictionCol': 'prediction', 'subsample': 0.9}
LOG-LOSS: 0.5043433238278184, ROC-AUC: 0.7334145630575858                         
SPACE:                                                                              
{'objective': 'binary:logistic', 'gamma': 0.9, 'silent': 0, 'labelCol': 'label', 'eta': 0.15, 'colsample_bytree': 0.9, 'min_child_weight': 50.0, 'featuresCol': 'features', 'eval_metric': 'logloss', 'nthread': 1, 'nworkers': 1, 'num_round': 10, 'max_depth': 6, 'predictionCol': 'prediction', 'subsample': 0.9}
LOG-LOSS: 0.5250770221493121, ROC-AUC: 0.72068046

In [23]:
best

{'eta': 4, 'num_round': 3}

In [24]:
params = {k: v for k, v in baseline_params.items()}
params['eta'] = eta_choice[best['eta']]
params['num_round'] = num_round_choice[best['num_round']]

In [25]:
model = XGBoostEstimator(**{**static_params, **params}).fit(train_df)

In [26]:
add_metrics_result("model_1", model, train_df, test_df, all_metrics)

Обратите внимание на то, что в случае с `hp.choice` в переменной `best` хранится не конкретное значение гиперпараметра, а его индекс из списка, например, `num_round_choice`

In [27]:
best_eta = eta_choice[best['eta']]  # change me!
best_num_round = num_round_choice[best['num_round']]  # change me!

### Tune tree-specific parameters

> Tune tree-specific parameters ( max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees. Note that we can choose different parameters to define a tree and I’ll take up an example here.

#### Tune max_depth, min_child_weight

In [30]:
max_depth_choice = [5, 10, 15]
min_child_weight_choice = [0.0, 10.0, 25.0, 50.0]

space = {
    # Optimize
    'max_depth': hp.choice('max_depth', max_depth_choice),
    'min_child_weight': hp.choice('min_child_weight', min_child_weight_choice),
    
    # Fixed    
    'num_round': best_num_round,
    'eta': best_eta,
    'subsample': baseline_params['subsample'],
    'gamma': baseline_params['gamma'],
    'colsample_bytree': baseline_params['colsample_bytree'],
    
    **static_params
}


trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=20,
            trials=trials)
best

SPACE:                                                
{'objective': 'binary:logistic', 'gamma': 0.9, 'silent': 0, 'labelCol': 'label', 'eta': 0.3, 'colsample_bytree': 0.9, 'min_child_weight': 0.0, 'featuresCol': 'features', 'eval_metric': 'logloss', 'nthread': 1, 'nworkers': 1, 'num_round': 100, 'max_depth': 10, 'predictionCol': 'prediction', 'subsample': 0.9}
LOG-LOSS: 0.50310622593908, ROC-AUC: 0.7349396115601177
SPACE:                                                                           
{'objective': 'binary:logistic', 'gamma': 0.9, 'silent': 0, 'labelCol': 'label', 'eta': 0.3, 'colsample_bytree': 0.9, 'min_child_weight': 50.0, 'featuresCol': 'features', 'eval_metric': 'logloss', 'nthread': 1, 'nworkers': 1, 'num_round': 100, 'max_depth': 10, 'predictionCol': 'prediction', 'subsample': 0.9}
LOG-LOSS: 0.5004216511176252, ROC-AUC: 0.7386419365473672                        
SPACE:                                                                             
{'objective': 'binary:

SPACE:                                                                                
{'objective': 'binary:logistic', 'gamma': 0.9, 'silent': 0, 'labelCol': 'label', 'eta': 0.3, 'colsample_bytree': 0.9, 'min_child_weight': 0.0, 'featuresCol': 'features', 'eval_metric': 'logloss', 'nthread': 1, 'nworkers': 1, 'num_round': 100, 'max_depth': 10, 'predictionCol': 'prediction', 'subsample': 0.9}
LOG-LOSS: 0.50310622593908, ROC-AUC: 0.7349396115601143                             
SPACE:                                                                              
{'objective': 'binary:logistic', 'gamma': 0.9, 'silent': 0, 'labelCol': 'label', 'eta': 0.3, 'colsample_bytree': 0.9, 'min_child_weight': 25.0, 'featuresCol': 'features', 'eval_metric': 'logloss', 'nthread': 1, 'nworkers': 1, 'num_round': 100, 'max_depth': 5, 'predictionCol': 'prediction', 'subsample': 0.9}
LOG-LOSS: 0.5021798919604946, ROC-AUC: 0.7364267692605153                           
SPACE:                                  

{'max_depth': 1, 'min_child_weight': 3}

In [31]:
params['max_depth'] = max_depth_choice[best['max_depth']]
params['min_child_weight'] = min_child_weight_choice[best['min_child_weight']]

In [32]:
model = XGBoostEstimator(**{**static_params, **params}).fit(train_df)

In [33]:
add_metrics_result("model_2", model, train_df, test_df, all_metrics)

In [34]:
best_max_depth = max_depth_choice[best['max_depth']]
best_min_child_weight = min_child_weight_choice[best['min_child_weight']]

#### Tune gamma

In [40]:
%%time

gamma_choice = [0.5, 1.0, 1.5, 2.0, 2.5, 3.0]

space = {
    # Optimize
    'gamma': hp.choice('gamma', gamma_choice),
    
    # Fixed    
    'num_round': best_num_round,
    'eta': best_eta,
    'max_depth': best_max_depth,
    'min_child_weight': best_min_child_weight,
    'subsample': baseline_params['subsample'],
    'colsample_bytree': baseline_params['colsample_bytree'],
    
    **static_params
}


trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=10,
            trials=trials)
best

SPACE:                                                
{'objective': 'binary:logistic', 'gamma': 2.0, 'silent': 0, 'labelCol': 'label', 'eta': 0.3, 'colsample_bytree': 0.9, 'min_child_weight': 50.0, 'featuresCol': 'features', 'eval_metric': 'logloss', 'nthread': 1, 'nworkers': 1, 'num_round': 100, 'max_depth': 10, 'predictionCol': 'prediction', 'subsample': 0.9}
LOG-LOSS: 0.500380975948034, ROC-AUC: 0.738841730156651
SPACE:                                                                            
{'objective': 'binary:logistic', 'gamma': 2.0, 'silent': 0, 'labelCol': 'label', 'eta': 0.3, 'colsample_bytree': 0.9, 'min_child_weight': 50.0, 'featuresCol': 'features', 'eval_metric': 'logloss', 'nthread': 1, 'nworkers': 1, 'num_round': 100, 'max_depth': 10, 'predictionCol': 'prediction', 'subsample': 0.9}
LOG-LOSS: 0.500380975948034, ROC-AUC: 0.7388417301566523                          
SPACE:                                                                            
{'objective': 'binar

{'gamma': 2}

In [41]:
params['gamma'] = gamma_choice[best['gamma']]

In [42]:
model = XGBoostEstimator(**{**static_params, **params}).fit(train_df)

In [43]:
add_metrics_result("model_3", model, train_df, test_df, all_metrics)

In [44]:
best_gamma = gamma_choice[best['gamma']]

#### Tune subsample, colsample_bytree

In [45]:
%%time

subsample_choice = [0.05, 0.25, 0.5, 0.75, 0.9]
colsample_bytree_choice = [0.05, 0.25, 0.5, 0.75, 0.9]

space = {
    # Optimize
    'subsample': hp.choice('subsample', subsample_choice),
    'colsample_bytree': hp.choice('colsample_bytree', colsample_bytree_choice),
    
    # Fixed    
    'num_round': best_num_round,
    'eta': best_eta,
    'max_depth': best_max_depth,
    'min_child_weight': best_min_child_weight,
    'gamma': best_gamma,
    
    **static_params
}


trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=20,
            trials=trials)
best

SPACE:                                                
{'objective': 'binary:logistic', 'gamma': 1.5, 'silent': 0, 'labelCol': 'label', 'eta': 0.3, 'colsample_bytree': 0.9, 'min_child_weight': 50.0, 'featuresCol': 'features', 'eval_metric': 'logloss', 'nthread': 1, 'nworkers': 1, 'num_round': 100, 'max_depth': 10, 'predictionCol': 'prediction', 'subsample': 0.25}
LOG-LOSS: 0.5036490170439193, ROC-AUC: 0.7342174385406672
SPACE:                                                                             
{'objective': 'binary:logistic', 'gamma': 1.5, 'silent': 0, 'labelCol': 'label', 'eta': 0.3, 'colsample_bytree': 0.9, 'min_child_weight': 50.0, 'featuresCol': 'features', 'eval_metric': 'logloss', 'nthread': 1, 'nworkers': 1, 'num_round': 100, 'max_depth': 10, 'predictionCol': 'prediction', 'subsample': 0.9}
LOG-LOSS: 0.500094574574192, ROC-AUC: 0.7391646088915047                           
SPACE:                                                                             
{'objective': 

{'objective': 'binary:logistic', 'gamma': 1.5, 'silent': 0, 'labelCol': 'label', 'eta': 0.3, 'colsample_bytree': 0.75, 'min_child_weight': 50.0, 'featuresCol': 'features', 'eval_metric': 'logloss', 'nthread': 1, 'nworkers': 1, 'num_round': 100, 'max_depth': 10, 'predictionCol': 'prediction', 'subsample': 0.5}
LOG-LOSS: 0.5013798823738617, ROC-AUC: 0.7373932089225125                          
SPACE:                                                                             
{'objective': 'binary:logistic', 'gamma': 1.5, 'silent': 0, 'labelCol': 'label', 'eta': 0.3, 'colsample_bytree': 0.5, 'min_child_weight': 50.0, 'featuresCol': 'features', 'eval_metric': 'logloss', 'nthread': 1, 'nworkers': 1, 'num_round': 100, 'max_depth': 10, 'predictionCol': 'prediction', 'subsample': 0.5}
LOG-LOSS: 0.5012836367471434, ROC-AUC: 0.7375399312467789                          
SPACE:                                                                             
{'objective': 'binary:logistic', 'gamma': 1

{'colsample_bytree': 4, 'subsample': 4}

In [46]:
params['subsample'] = subsample_choice[best['subsample']]
params['colsample_bytree'] = colsample_bytree_choice[best['colsample_bytree']]

In [47]:
model = XGBoostEstimator(**{**static_params, **params}).fit(train_df)

In [48]:
add_metrics_result("model_4", model, train_df, test_df, all_metrics)

In [49]:
best_subsample = subsample_choice[best['subsample']]
best_colsample_bytree = colsample_bytree_choice[best['colsample_bytree']]

### Tune regularization parameters

> Tune regularization parameters (lambda, alpha) for xgboost which can help reduce model complexity and enhance performance.

In [50]:
%%time

alpha_choice = [0.0, 0.25, 0.5, 0.75, 1.0]
reg_lambda_choice = [0.0, 0.5, 1.5, 2.0, 2.5]

space = {
    # Optimize
    'alpha': hp.choice('alpha', alpha_choice),
    'reg_lambda': hp.choice('reg_lambda', reg_lambda_choice),
    
    # Fixed    
    'num_round': best_num_round,
    'eta': best_eta,
    'max_depth': best_max_depth,
    'min_child_weight': best_min_child_weight,
    'gamma': best_gamma,
    'subsample': best_subsample,
    'colsample_bytree': best_colsample_bytree,
    
    **static_params
}


trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=20,
            trials=trials)
best

SPACE:                                                
{'objective': 'binary:logistic', 'gamma': 1.5, 'silent': 0, 'labelCol': 'label', 'alpha': 0.25, 'colsample_bytree': 0.9, 'eta': 0.3, 'predictionCol': 'prediction', 'featuresCol': 'features', 'eval_metric': 'logloss', 'num_round': 100, 'nworkers': 1, 'nthread': 1, 'subsample': 0.9, 'min_child_weight': 50.0, 'max_depth': 10, 'lambda': 0.0}
LOG-LOSS: 0.5001364014442373, ROC-AUC: 0.7390614828017854
SPACE:                                                                             
{'objective': 'binary:logistic', 'gamma': 1.5, 'silent': 0, 'labelCol': 'label', 'alpha': 0.75, 'colsample_bytree': 0.9, 'eta': 0.3, 'predictionCol': 'prediction', 'featuresCol': 'features', 'eval_metric': 'logloss', 'num_round': 100, 'nworkers': 1, 'nthread': 1, 'subsample': 0.9, 'min_child_weight': 50.0, 'max_depth': 10, 'lambda': 0.0}
LOG-LOSS: 0.5003792306012671, ROC-AUC: 0.7386927012662292                          
SPACE:                                 

SPACE:                                                                                
{'objective': 'binary:logistic', 'gamma': 1.5, 'silent': 0, 'labelCol': 'label', 'alpha': 0.0, 'colsample_bytree': 0.9, 'eta': 0.3, 'predictionCol': 'prediction', 'featuresCol': 'features', 'eval_metric': 'logloss', 'num_round': 100, 'nworkers': 1, 'nthread': 1, 'subsample': 0.9, 'min_child_weight': 50.0, 'max_depth': 10, 'lambda': 0.0}
LOG-LOSS: 0.5005674736325909, ROC-AUC: 0.7383430081316014                           
SPACE:                                                                              
{'objective': 'binary:logistic', 'gamma': 1.5, 'silent': 0, 'labelCol': 'label', 'alpha': 0.25, 'colsample_bytree': 0.9, 'eta': 0.3, 'predictionCol': 'prediction', 'featuresCol': 'features', 'eval_metric': 'logloss', 'num_round': 100, 'nworkers': 1, 'nthread': 1, 'subsample': 0.9, 'min_child_weight': 50.0, 'max_depth': 10, 'lambda': 0.0}
LOG-LOSS: 0.5001364014442373, ROC-AUC: 0.7390614828017797       

{'alpha': 1, 'reg_lambda': 0}

In [51]:
params['alpha'] = alpha_choice[best['alpha']]
params['reg_lambda'] = reg_lambda_choice[best['reg_lambda']]

In [52]:
model = XGBoostEstimator(**{**static_params, **params}).fit(train_df)

In [53]:
add_metrics_result("model_5", model, train_df, test_df, all_metrics)

In [54]:
best_alpha = alpha_choice[best['alpha']]
best_reg_lambda = reg_lambda_choice[best['reg_lambda']]

### Lower the learning rate and decide the optimal parameters

In [57]:
%%time

num_round_choice =        [best_num_round,        baseline_params['num_round']       ]
max_depth_choice =        [best_max_depth,        baseline_params['max_depth']       ]
min_child_weight_choice = [best_min_child_weight, baseline_params['min_child_weight']]
gamma_choice =            [best_gamma,            baseline_params['gamma']           ]
subsample_choice =        [best_subsample,        baseline_params['subsample']       ]
colsample_bytree_choice = [best_colsample_bytree, baseline_params['colsample_bytree']]
alpha_choice =            [best_alpha,            0.0                                ]
reg_lambda_choice =       [best_reg_lambda,       1.0                                ]


space = {
    # Optimize
    'num_round': hp.choice('num_round', num_round_choice),
    'max_depth': hp.choice('max_depth', max_depth_choice),
    'min_child_weight': hp.choice('min_child_weight', min_child_weight_choice),
    'gamma': hp.choice('gamma', gamma_choice),
    'subsample': hp.choice('subsample', subsample_choice),
    'colsample_bytree': hp.choice('colsample_bytree', colsample_bytree_choice),
    'alpha': hp.choice('alpha', alpha_choice),
    'reg_lambda': hp.choice('reg_lambda', reg_lambda_choice),
    
    # Fixed    
    'eta': 0.05,
    
    **static_params
}


trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=30,
            trials=trials)
best

SPACE:                                                
{'objective': 'binary:logistic', 'gamma': 0.9, 'silent': 0, 'labelCol': 'label', 'alpha': 0.0, 'colsample_bytree': 0.9, 'eta': 0.05, 'predictionCol': 'prediction', 'featuresCol': 'features', 'eval_metric': 'logloss', 'num_round': 20, 'nworkers': 1, 'nthread': 1, 'subsample': 0.9, 'min_child_weight': 50.0, 'max_depth': 6, 'lambda': 0.0}
LOG-LOSS: 0.5452747216161833, ROC-AUC: 0.7195028276126487
SPACE:                                                                             
{'objective': 'binary:logistic', 'gamma': 0.9, 'silent': 0, 'labelCol': 'label', 'alpha': 0.25, 'colsample_bytree': 0.9, 'eta': 0.05, 'predictionCol': 'prediction', 'featuresCol': 'features', 'eval_metric': 'logloss', 'num_round': 100, 'nworkers': 1, 'nthread': 1, 'subsample': 0.9, 'min_child_weight': 50.0, 'max_depth': 6, 'lambda': 1.0}
LOG-LOSS: 0.5075588815804976, ROC-AUC: 0.7292316824207099                          
SPACE:                                   

SPACE:                                                                                
{'objective': 'binary:logistic', 'gamma': 0.9, 'silent': 0, 'labelCol': 'label', 'alpha': 0.0, 'colsample_bytree': 0.9, 'eta': 0.05, 'predictionCol': 'prediction', 'featuresCol': 'features', 'eval_metric': 'logloss', 'num_round': 20, 'nworkers': 1, 'nthread': 1, 'subsample': 0.9, 'min_child_weight': 50.0, 'max_depth': 10, 'lambda': 1.0}
LOG-LOSS: 0.541209975023462, ROC-AUC: 0.7264767312694006                              
SPACE:                                                                                
{'objective': 'binary:logistic', 'gamma': 0.9, 'silent': 0, 'labelCol': 'label', 'alpha': 0.0, 'colsample_bytree': 0.9, 'eta': 0.05, 'predictionCol': 'prediction', 'featuresCol': 'features', 'eval_metric': 'logloss', 'num_round': 100, 'nworkers': 1, 'nthread': 1, 'subsample': 0.9, 'min_child_weight': 50.0, 'max_depth': 10, 'lambda': 1.0}
LOG-LOSS: 0.5039782243303803, ROC-AUC: 0.7341978510270835   

{'alpha': 0,
 'colsample_bytree': 1,
 'gamma': 0,
 'max_depth': 0,
 'min_child_weight': 0,
 'num_round': 0,
 'reg_lambda': 1,
 'subsample': 1}

In [63]:
final_params = {
    'num_round':        num_round_choice[best['num_round']],
    'max_depth':        max_depth_choice[best['max_depth']],
    'min_child_weight': min_child_weight_choice[best['min_child_weight']],
    'gamma':            gamma_choice[best['gamma']],
    'subsample':        subsample_choice[best['subsample']],
    'colsample_bytree': colsample_bytree_choice[best['colsample_bytree']],
    'alpha':            alpha_choice[best['alpha']],
    'reg_lambda':       reg_lambda_choice[best['reg_lambda']],
    'eta':              best_eta
}

In [64]:
model = XGBoostEstimator(**{**static_params, **final_params}).fit(train_df)

In [65]:
add_metrics_result("model_final", model, train_df, test_df, all_metrics)

In [61]:
for k, v in final_params.items():
    print(k, "=", v)

num_round = 100
alpha = 0.25
gamma = 1.5
eta = 0.05
max_depth = 10
reg_lambda = 1.0
colsample_bytree = 0.9
subsample = 0.9
min_child_weight = 50.0


---
## LogisticRegression Tuning

Подберем гиперпараметры для логрега из предыдущих практик

In [88]:
def objective_log_reg(space):
    estimator = LogisticRegression(**space)
    print('SPACE:', estimator._input_kwargs)
    success = False
    attempts = 0
    model = None
    while not success and attempts < 2:
        try:
            model = estimator.fit(train_df)
            success = True
        except Exception as e:
            attempts += 1
            print(e)
            print('Try again')
        
    log_loss = logloss(model, val_df, probabilities_col='probability')
    roc_auc = rocauc(model, val_df, probabilities_col='probability')
    
    print('LOG-LOSS: {}, ROC-AUC: {}'.format(log_loss, roc_auc))

    return {'loss': log_loss, 'rocauc': roc_auc, 'status': STATUS_OK }

In [98]:
lr_metrics = {}

In [99]:
CONTROL_NAME_LR = 'log_reg_baseline'

In [108]:
baseline_params_lr = {
    'regParam': 0.0,
    'elasticNetParam': 0.0,
    'maxIter': 30,
    'standardization': True,
    'threshold': 0.5
}

In [109]:
static_params_lr = {
    'featuresCol': "features", 
    'labelCol': "label",
}

In [110]:
baseline_model_lr = LogisticRegression(**{**static_params_lr, **baseline_params_lr}).fit(train_df)

In [111]:
add_metrics_result(CONTROL_NAME_LR, baseline_model_lr, train_df, test_df, lr_metrics, probabilities_col='probability')

In [89]:
%%time

reg_param_choice =               [0.0, 0.25, 0.5, 0.75, 1.0, 1.5, 2.0]
elastic_net_param_choice =       [0.0, 0.15, 0.25, 0.5, 0.75, 0.85, 1.0]
max_iter_choice =                [10, 20, 30, 40, 50]
standardization_choice =         [True, False]
threshold_choice =               [0.15, 0.25, 0.5, 0.75, 0.85]


space = {
    # Optimize
    'regParam':        hp.choice('regParam', reg_param_choice),
    'elasticNetParam': hp.choice('elasticNetParam', elastic_net_param_choice),
    'maxIter':         hp.choice('maxIter', max_iter_choice),
    'standardization': hp.choice('standardization', standardization_choice),
    'threshold':       hp.choice('threshold', threshold_choice),
    
    **static_params_lr
}


trials = Trials()
best_lr = fmin(fn=objective_log_reg,
            space=space,
            algo=tpe.suggest,
            max_evals=100,
            trials=trials)
best_lr

SPACE:                                                 
{'featuresCol': 'features', 'maxIter': 50, 'regParam': 1.0, 'labelCol': 'label', 'threshold': 0.25, 'standardization': False, 'elasticNetParam': 0.5}
LOG-LOSS: 0.5623304998829375, ROC-AUC: 0.6467681035602768
SPACE:                                                                             
{'featuresCol': 'features', 'maxIter': 20, 'regParam': 0.5, 'labelCol': 'label', 'threshold': 0.75, 'standardization': True, 'elasticNetParam': 0.25}
LOG-LOSS: 0.5750326435773161, ROC-AUC: 0.5                                         
SPACE:                                                                             
{'featuresCol': 'features', 'maxIter': 20, 'regParam': 2.0, 'labelCol': 'label', 'threshold': 0.85, 'standardization': True, 'elasticNetParam': 0.0}
LOG-LOSS: 0.5641329141738674, ROC-AUC: 0.699438916244874                           
SPACE:                                                                             
{'featuresCol': '

SPACE:                                                                            
{'featuresCol': 'features', 'maxIter': 20, 'regParam': 1.5, 'labelCol': 'label', 'threshold': 0.5, 'standardization': True, 'elasticNetParam': 0.15}
LOG-LOSS: 0.5750326435773161, ROC-AUC: 0.5                                        
SPACE:                                                                            
{'featuresCol': 'features', 'maxIter': 20, 'regParam': 0.75, 'labelCol': 'label', 'threshold': 0.5, 'standardization': True, 'elasticNetParam': 0.15}
LOG-LOSS: 0.5750326435773161, ROC-AUC: 0.5                                        
SPACE:                                                                            
{'featuresCol': 'features', 'maxIter': 50, 'regParam': 0.0, 'labelCol': 'label', 'threshold': 0.5, 'standardization': True, 'elasticNetParam': 0.15}
LOG-LOSS: 0.5349166396424252, ROC-AUC: 0.7025494885529899                         
SPACE:                                                

LOG-LOSS: 0.5310230324704763, ROC-AUC: 0.7025388782099912                         
SPACE:                                                                            
{'featuresCol': 'features', 'maxIter': 10, 'regParam': 0.5, 'labelCol': 'label', 'threshold': 0.25, 'standardization': True, 'elasticNetParam': 0.0}
LOG-LOSS: 0.5492606893576685, ROC-AUC: 0.6998126954560002                         
SPACE:                                                                            
{'featuresCol': 'features', 'maxIter': 20, 'regParam': 1.5, 'labelCol': 'label', 'threshold': 0.75, 'standardization': True, 'elasticNetParam': 0.25}
LOG-LOSS: 0.5750326435773161, ROC-AUC: 0.5                                        
SPACE:                                                                            
{'featuresCol': 'features', 'maxIter': 20, 'regParam': 0.0, 'labelCol': 'label', 'threshold': 0.75, 'standardization': True, 'elasticNetParam': 0.25}
LOG-LOSS: 0.5282529644822185, ROC-AUC: 0.702538497287

LOG-LOSS: 0.5750326435773161, ROC-AUC: 0.5                                        
SPACE:                                                                            
{'featuresCol': 'features', 'maxIter': 20, 'regParam': 0.0, 'labelCol': 'label', 'threshold': 0.85, 'standardization': True, 'elasticNetParam': 0.0}
LOG-LOSS: 0.5282529644822185, ROC-AUC: 0.7025384972879511                         
SPACE:                                                                            
{'featuresCol': 'features', 'maxIter': 30, 'regParam': 0.5, 'labelCol': 'label', 'threshold': 0.5, 'standardization': False, 'elasticNetParam': 0.15}
LOG-LOSS: 0.5491119578481176, ROC-AUC: 0.6780070607505677                         
SPACE:                                                                            
{'featuresCol': 'features', 'maxIter': 20, 'regParam': 0.0, 'labelCol': 'label', 'threshold': 0.75, 'standardization': True, 'elasticNetParam': 0.25}
LOG-LOSS: 0.5282529644822185, ROC-AUC: 0.702538497287

{'elasticNetParam': 0,
 'maxIter': 1,
 'regParam': 0,
 'standardization': 0,
 'threshold': 3}

In [90]:
params_lr = {
    'regParam': reg_param_choice[best_lr['regParam']],
    'elasticNetParam': elastic_net_param_choice[best_lr['elasticNetParam']],
    'maxIter': max_iter_choice[best_lr['maxIter']],
    'standardization': standardization_choice[best_lr['standardization']],
    'threshold': threshold_choice[best_lr['threshold']]
}

In [107]:
params_lr

{'elasticNetParam': 0.0,
 'maxIter': 20,
 'regParam': 0.0,
 'standardization': True,
 'threshold': 0.75}

In [91]:
model_lr = LogisticRegression(**{**static_params_lr, **params_lr}).fit(train_df)

In [105]:
add_metrics_result("model_lr", model_lr, train_df, test_df, lr_metrics, probabilities_col='probability')

In [93]:
add_metrics_result("model_lr", model_lr, train_df, test_df, all_metrics, probabilities_col='probability')

In [113]:
get_ate(lr_metrics, CONTROL_NAME_LR)

Unnamed: 0,metric,model_lr ate %
0,rocauc_test,0.002083
1,logloss_test,-0.224725


---
## Optional [MongoTrials](https://hyperopt.github.io/hyperopt/scaleout/mongodb/)

> For parallel search, hyperopt includes a MongoTrials implementation that supports asynchronous updates.

**TLDR** Преимущества использования `MongoTrials`:
* `MongoTrials` позволяет параллельно запускать несколько вычислений целевой функции
* Динамический уровень параллелизма - можно добавлять/удалять воркеров, которые вычисляют целевую функцию
* Все результаты сохраняются в БД - история запусков никуда не потеряется

*За выполнение данного задания можно получить дополнительно +0.4 к итоговому баллу*

### XGBoost Tuning

In [None]:
######################################
######### YOUR CODE HERE #############
######################################

# Results

Подведем итоги.

Обучите модели с найденными (оптимальными) гиперпараметрами и сделайте справнение на отложенной выборке

Итоговая таблица

In [114]:
get_ate(all_metrics, CONTROL_NAME)

Unnamed: 0,metric,model_1 ate %,model_2 ate %,model_3 ate %,model_4 ate %,model_5 ate %,model_final ate %,model_lr ate %
0,rocauc_test,1.838613,1.951576,1.994639,1.994639,2.003799,2.015592,-3.202494
1,logloss_test,-1.989203,-2.176228,-2.207105,-2.207105,-2.226729,-2.248687,3.314433
