<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Spark-MLlib-Tuning" data-toc-modified-id="Spark-MLlib-Tuning-1"><span class="toc-item-num">1&nbsp;&nbsp;</span><a href="https://spark.apache.org/docs/latest/ml-tuning.html" target="_blank">Spark MLlib Tuning</a></a></span></li><li><span><a href="#Hyperopt" data-toc-modified-id="Hyperopt-2"><span class="toc-item-num">2&nbsp;&nbsp;</span><a href="https://github.com/hyperopt/hyperopt" target="_blank">Hyperopt</a></a></span><ul class="toc-item"><li><span><a href="#XGBoost-Tuning" data-toc-modified-id="XGBoost-Tuning-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span><a href="https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/" target="_blank">XGBoost Tuning</a></a></span><ul class="toc-item"><li><span><a href="#Objective-function" data-toc-modified-id="Objective-function-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Objective function</a></span></li><li><span><a href="#Tune-number-of-trees" data-toc-modified-id="Tune-number-of-trees-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Tune number of trees</a></span></li><li><span><a href="#Tune-tree-specific-parameters" data-toc-modified-id="Tune-tree-specific-parameters-2.1.3"><span class="toc-item-num">2.1.3&nbsp;&nbsp;</span>Tune tree-specific parameters</a></span><ul class="toc-item"><li><span><a href="#Tune-max_depth,-min_child_weight" data-toc-modified-id="Tune-max_depth,-min_child_weight-2.1.3.1"><span class="toc-item-num">2.1.3.1&nbsp;&nbsp;</span>Tune max_depth, min_child_weight</a></span></li><li><span><a href="#Tune-gamma" data-toc-modified-id="Tune-gamma-2.1.3.2"><span class="toc-item-num">2.1.3.2&nbsp;&nbsp;</span>Tune gamma</a></span></li><li><span><a href="#Tune-subsample,-colsample_bytree" data-toc-modified-id="Tune-subsample,-colsample_bytree-2.1.3.3"><span class="toc-item-num">2.1.3.3&nbsp;&nbsp;</span>Tune subsample, colsample_bytree</a></span></li></ul></li><li><span><a href="#Tune-regularization-parameters" data-toc-modified-id="Tune-regularization-parameters-2.1.4"><span class="toc-item-num">2.1.4&nbsp;&nbsp;</span>Tune regularization parameters</a></span></li><li><span><a href="#Lower-the-learning-rate-and-decide-the-optimal-parameters" data-toc-modified-id="Lower-the-learning-rate-and-decide-the-optimal-parameters-2.1.5"><span class="toc-item-num">2.1.5&nbsp;&nbsp;</span>Lower the learning rate and decide the optimal parameters</a></span></li></ul></li><li><span><a href="#LogisticRegression-Tuning" data-toc-modified-id="LogisticRegression-Tuning-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>LogisticRegression Tuning</a></span></li><li><span><a href="#Optional-MongoTrials" data-toc-modified-id="Optional-MongoTrials-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Optional <a href="https://hyperopt.github.io/hyperopt/scaleout/mongodb/" target="_blank">MongoTrials</a></a></span><ul class="toc-item"><li><span><a href="#XGBoost-Tuning" data-toc-modified-id="XGBoost-Tuning-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>XGBoost Tuning</a></span></li></ul></li></ul></li><li><span><a href="#Results" data-toc-modified-id="Results-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Results</a></span></li></ul></div>

Продолжаем работать над задачей CTR-prediction с использованием датасета от Criteo.

Описание задачи и данных можно посмотреть в notebook'e предыдущей практики (`sgd_logreg_nn/notebooks/ctr_prediction_mllib.ipynb`).

In [8]:
!ls /workspace/data/criteo

criteo_pipeline		    results	    xgblr_pipeline
criteo_pipeline_full	    test.csv	    xgblr_xgb.model
criteo_pipeline_meantarget  train.csv	    xgb_meantarget.model
logreg_best.model	    xgblr_lr.model  xgb.model


In [1]:
%matplotlib inline
%config InlineBackend.figure_format ='retina'

import os
import sys
import glob
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

import pyspark
import pyspark.sql.functions as F
from pyspark.conf import SparkConf
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql import Row

COMMON_PATH = '/workspace/MLBD/common'

sys.path.append(os.path.join(COMMON_PATH, 'utils'))

os.environ['PYSPARK_SUBMIT_ARGS'] = """
--jars {common}/xgboost4j-spark-0.72.jar,{common}/xgboost4j-0.72.jar
--py-files {common}/sparkxgb.zip pyspark-shell
""".format(common=COMMON_PATH).replace('\n', ' ')

spark = SparkSession \
    .builder \
    .master('local[*]') \
    .appName("spark_sql_examples") \
    .config("spark.executor.memory", "15g") \
    .config("spark.driver.memory", "15g") \
    .getOrCreate()

sc = spark.sparkContext
sqlContext = SQLContext(sc)

from metrics import rocauc, logloss, ne, get_ate
from processing import split_by_col

from sparkxgb.xgboost import *

In [2]:
DATA_PATH = '/workspace/data/criteo'

TRAIN_PATH = os.path.join(DATA_PATH, 'train.csv')

In [3]:
df = sqlContext.read.format("com.databricks.spark.csv") \
    .option("delimiter", ",") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load('file:///' + TRAIN_PATH)

**Remark** Необязательно использовать половину датасета и всего две категориальные переменные. Можно использовать больше данных, если вам позволяет ваша конфигурация

In [4]:
df = df.sample(False, 0.5)

In [5]:
num_columns = ['_c{}'.format(i) for i in range(1, 14)]

In [6]:
df = df.fillna(0, subset=num_columns)

In [7]:
from pyspark.ml import PipelineModel

pipeline_model = PipelineModel.load(os.path.join(DATA_PATH, 'criteo_pipeline'))
pipeline_model.stages

[StringIndexer_770fc081fe99,
 StringIndexer_084702c33ebd,
 OneHotEncoderEstimator_0b8158f2a84a,
 VectorAssembler_34b2aef55ff0]

In [8]:
df = pipeline_model \
    .transform(df) \
    .select(F.col('_c0').alias('label'), 'features', 'id') \
    .cache()

df.count()

1099432

In [9]:
train_df, val_df, test_df = split_by_col(df, 'id', [0.8, 0.1, 0.1])

# [Spark MLlib Tuning](https://spark.apache.org/docs/latest/ml-tuning.html)

У имеющегося в Spark'e метода HPO есть два существенных недостатка, которые делают его мало пригодным в контексте нашей задачи:

1. `ParamGridBuilder` - поиск по сетке
2. `TrainValidationSplit` - делит данные случайнм образом

# [Hyperopt](https://github.com/hyperopt/hyperopt)

Установим `hyperopt`

In [13]:
!pip3.5 install hyperopt



## [XGBoost Tuning](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)

> [Notes on Parameter Tuning](https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html)

### Objective function

In [10]:
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK
import scipy.stats as st


def objective(space):
    estimator = XGBoostEstimator(**space)
    print('SPACE:', estimator._input_kwargs_processed())
    success = False
    attempts = 0
    model = None
    while not success and attempts < 2:
        try:
            model = estimator.fit(train_df)
            success = True
        except Exception as e:
            attempts += 1
            print(e)
            print('Try again')
        
    log_loss = logloss(model, val_df, probabilities_col='probabilities')
    roc_auc = rocauc(model, val_df, probabilities_col='probabilities')
    
    print('LOG-LOSS: {}, ROC-AUC: {}'.format(log_loss, roc_auc))

    return {'loss': log_loss, 'rocauc': roc_auc, 'status': STATUS_OK }

In [11]:
static_params = {
    'featuresCol': "features", 
    'labelCol': "label", 
    'predictionCol': "prediction",
    'eval_metric': 'logloss',
    'objective': 'binary:logistic',
    'nthread': 1,
    'silent': 0,
    'nworkers': 1
}

Fix baseline parameters and train baseline model

In [12]:
CONTROL_NAME = 'xgb baseline'

baseline_params = {
    'colsample_bytree': 0.9,
    'eta': 0.15,
    'gamma': 0.9,
    'max_depth': 6,
    'min_child_weight': 50.0,
    'subsample': 0.9,
    'num_round': 20
}

baseline_model = XGBoostEstimator(**{**static_params, **baseline_params}).fit(train_df)

In [13]:
baseline_rocauc = rocauc(baseline_model, val_df, probabilities_col='probabilities')
baseline_rocauc

0.7254314484043973

In [14]:
all_metrics = {}

In [15]:
baseline_test_metrics = {
    'logloss': logloss(baseline_model, test_df, probabilities_col='probabilities'),
    'rocauc': rocauc(baseline_model, test_df, probabilities_col='probabilities')
}

all_metrics[CONTROL_NAME] = baseline_test_metrics

In [16]:
all_metrics

{'xgb baseline': {'logloss': 0.5136055353642643, 'rocauc': 0.7256245795728983}}

### Tune number of trees

> Choose a relatively high learning rate. Generally a learning rate of 0.1 works but somewhere between 0.05 to 0.3 should work for different problems. Determine the optimum number of trees for this learning rate.

In [17]:
%%time

num_round_choice = [10, 20, 40, 100]
eta_choice = [0.10, 0.20, 0.30, 0.50]

space_1 = {
    # Optimize
    'num_round': hp.choice('num_round', num_round_choice),
    'eta': hp.choice('eta', eta_choice),
    
    # Fixed    
    'max_depth': baseline_params['max_depth'],
    'min_child_weight': baseline_params['min_child_weight'],
    'subsample': baseline_params['subsample'],
    'gamma': baseline_params['gamma'],
    'colsample_bytree': baseline_params['colsample_bytree'],
    
    **static_params
}


trials_1 = Trials()
best_1 = fmin(fn=objective,
            space=space_1,
            algo=tpe.suggest,
            max_evals=20,
            trials=trials_1)

SPACE:                                                
{'objective': 'binary:logistic', 'gamma': 0.9, 'predictionCol': 'prediction', 'subsample': 0.9, 'featuresCol': 'features', 'max_depth': 6, 'eta': 0.1, 'min_child_weight': 50.0, 'labelCol': 'label', 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss', 'nworkers': 1, 'silent': 0, 'num_round': 100}
LOG-LOSS: 0.5039575938791325, ROC-AUC: 0.7335010950207554
SPACE:                                                                           
{'objective': 'binary:logistic', 'gamma': 0.9, 'predictionCol': 'prediction', 'subsample': 0.9, 'featuresCol': 'features', 'max_depth': 6, 'eta': 0.3, 'min_child_weight': 50.0, 'labelCol': 'label', 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss', 'nworkers': 1, 'silent': 0, 'num_round': 20}
LOG-LOSS: 0.5064149195992691, ROC-AUC: 0.7295467783451892                        
SPACE:                                                                           
{'objective': 'binary:lo

LOG-LOSS: 0.5100753139049946, ROC-AUC: 0.724693715689768                         
SPACE:                                                                           
{'objective': 'binary:logistic', 'gamma': 0.9, 'predictionCol': 'prediction', 'subsample': 0.9, 'featuresCol': 'features', 'max_depth': 6, 'eta': 0.3, 'min_child_weight': 50.0, 'labelCol': 'label', 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss', 'nworkers': 1, 'silent': 0, 'num_round': 100}
LOG-LOSS: 0.5011672284879687, ROC-AUC: 0.7368496304980969                        
SPACE:                                                                           
{'objective': 'binary:logistic', 'gamma': 0.9, 'predictionCol': 'prediction', 'subsample': 0.9, 'featuresCol': 'features', 'max_depth': 6, 'eta': 0.5, 'min_child_weight': 50.0, 'labelCol': 'label', 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss', 'nworkers': 1, 'silent': 0, 'num_round': 10}
LOG-LOSS: 0.5081996451387485, ROC-AUC: 0.72665764996899

In [18]:
best_1

{'eta': 2, 'num_round': 3}

Обратите внимание на то, что в случае с `hp.choice` в переменной `best` хранится не конкретное значение гиперпараметра, а его индекс из списка, например, `num_round_choice`

In [19]:
best_eta = eta_choice[best_1['eta']]  # change me!
best_num_round = num_round_choice[best_1['num_round']]  # change me!
best_eta, best_num_round

(0.3, 100)

In [20]:
params_1 = baseline_params
params_1['eta'] = best_eta
params_1['num_round'] = best_num_round
model_1 = XGBoostEstimator(**{**static_params, **params_1}).fit(train_df)

In [21]:
test_metrics_1 = {
    'logloss': logloss(model_1, test_df, probabilities_col='probabilities'),
    'rocauc': rocauc(model_1, test_df, probabilities_col='probabilities')
}

all_metrics['tuning_1'] = test_metrics_1

In [22]:
all_metrics

{'tuning_1': {'logloss': 0.5044681152479438, 'rocauc': 0.7372459385355252},
 'xgb baseline': {'logloss': 0.5136055353642643, 'rocauc': 0.7256245795728983}}

### Tune tree-specific parameters

> Tune tree-specific parameters ( max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees. Note that we can choose different parameters to define a tree and I’ll take up an example here.

#### Tune max_depth, min_child_weight

In [25]:
%%time

depth_choice = [3, 6, 10, 15]
weight_choice = [0., 10., 25., 50., 100.]

space_2 = {
    # Optimize
    'max_depth': hp.choice('max_depth', depth_choice),
    'min_child_weight': hp.choice('min_child_weight', weight_choice),
    
    # Fixed    
    'num_round': best_num_round,
    'eta': best_eta,
    'subsample': baseline_params['subsample'],
    'gamma': baseline_params['gamma'],
    'colsample_bytree': baseline_params['colsample_bytree'],
    
    **static_params
}


trials_2 = Trials()
best_2 = fmin(fn=objective,
            space=space_2,
            algo=tpe.suggest,
            max_evals=20,
            trials=trials_2)
best_2

SPACE:                                                
{'objective': 'binary:logistic', 'gamma': 0.9, 'predictionCol': 'prediction', 'subsample': 0.9, 'featuresCol': 'features', 'max_depth': 10, 'eta': 0.3, 'min_child_weight': 25.0, 'labelCol': 'label', 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss', 'nworkers': 1, 'silent': 0, 'num_round': 100}
LOG-LOSS: 0.5007839045727842, ROC-AUC: 0.7373997761667604
SPACE:                                                                             
{'objective': 'binary:logistic', 'gamma': 0.9, 'predictionCol': 'prediction', 'subsample': 0.9, 'featuresCol': 'features', 'max_depth': 15, 'eta': 0.3, 'min_child_weight': 100.0, 'labelCol': 'label', 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss', 'nworkers': 1, 'silent': 0, 'num_round': 100}
LOG-LOSS: 0.5021647109305086, ROC-AUC: 0.7355436306062344                          
SPACE:                                                                             
{'objective': 

{'objective': 'binary:logistic', 'gamma': 0.9, 'predictionCol': 'prediction', 'subsample': 0.9, 'featuresCol': 'features', 'max_depth': 3, 'eta': 0.3, 'min_child_weight': 10.0, 'labelCol': 'label', 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss', 'nworkers': 1, 'silent': 0, 'num_round': 100}
LOG-LOSS: 0.5052819780829687, ROC-AUC: 0.731377078159076                            
SPACE:                                                                              
{'objective': 'binary:logistic', 'gamma': 0.9, 'predictionCol': 'prediction', 'subsample': 0.9, 'featuresCol': 'features', 'max_depth': 6, 'eta': 0.3, 'min_child_weight': 10.0, 'labelCol': 'label', 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss', 'nworkers': 1, 'silent': 0, 'num_round': 100}
LOG-LOSS: 0.5013273037973872, ROC-AUC: 0.7368454934630884                           
SPACE:                                                                              
{'objective': 'binary:logistic', 'gamma': 

{'max_depth': 2, 'min_child_weight': 2}

In [26]:
best_max_depth = depth_choice[best_2['max_depth']]
best_min_child_weight = weight_choice[best_2['min_child_weight']]
best_max_depth, best_min_child_weight

(10, 25.0)

In [27]:
params_2 = params_1
params_2['max_depth'] = best_max_depth
params_2['min_child_weight'] = best_min_child_weight
model_2 = XGBoostEstimator(**{**static_params, **params_2}).fit(train_df)

In [28]:
test_metrics_2 = {
    'logloss': logloss(model_2, test_df, probabilities_col='probabilities'),
    'rocauc': rocauc(model_2, test_df, probabilities_col='probabilities')
}

all_metrics['tuning_2'] = test_metrics_2

In [29]:
all_metrics

{'tuning_1': {'logloss': 0.5044681152479438, 'rocauc': 0.7372459385355252},
 'tuning_2': {'logloss': 0.5039979025307851, 'rocauc': 0.7375641863487687},
 'xgb baseline': {'logloss': 0.5136055353642643, 'rocauc': 0.7256245795728983}}

#### Tune gamma

In [30]:
%%time

gamma_choice = [0.1, 0.5, 0.9, 1.5, 3.]

space_3 = {
    # Optimize
    'gamma': hp.choice('gamma', gamma_choice),
    
    # Fixed    
    'num_round': best_num_round,
    'eta': best_eta,
    'max_depth': best_max_depth,
    'min_child_weight': best_min_child_weight,
    'subsample': baseline_params['subsample'],
    'colsample_bytree': baseline_params['colsample_bytree'],
    
    **static_params
}


trials_3 = Trials()
best_3 = fmin(fn=objective,
            space=space_3,
            algo=tpe.suggest,
            max_evals=20,
            trials=trials_3)
best_3

SPACE:                                                
{'objective': 'binary:logistic', 'gamma': 0.1, 'predictionCol': 'prediction', 'subsample': 0.9, 'featuresCol': 'features', 'max_depth': 10, 'eta': 0.3, 'min_child_weight': 25.0, 'labelCol': 'label', 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss', 'nworkers': 1, 'silent': 0, 'num_round': 100}
LOG-LOSS: 0.5004413346099772, ROC-AUC: 0.7378953490855421
SPACE:                                                                             
{'objective': 'binary:logistic', 'gamma': 0.9, 'predictionCol': 'prediction', 'subsample': 0.9, 'featuresCol': 'features', 'max_depth': 10, 'eta': 0.3, 'min_child_weight': 25.0, 'labelCol': 'label', 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss', 'nworkers': 1, 'silent': 0, 'num_round': 100}
LOG-LOSS: 0.5007839045727842, ROC-AUC: 0.7373997761667603                          
SPACE:                                                                             
{'objective': '

{'objective': 'binary:logistic', 'gamma': 1.5, 'predictionCol': 'prediction', 'subsample': 0.9, 'featuresCol': 'features', 'max_depth': 10, 'eta': 0.3, 'min_child_weight': 25.0, 'labelCol': 'label', 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss', 'nworkers': 1, 'silent': 0, 'num_round': 100}
LOG-LOSS: 0.5003646308649412, ROC-AUC: 0.7381636020952937                           
SPACE:                                                                              
{'objective': 'binary:logistic', 'gamma': 0.1, 'predictionCol': 'prediction', 'subsample': 0.9, 'featuresCol': 'features', 'max_depth': 10, 'eta': 0.3, 'min_child_weight': 25.0, 'labelCol': 'label', 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss', 'nworkers': 1, 'silent': 0, 'num_round': 100}
LOG-LOSS: 0.5004413346099772, ROC-AUC: 0.7378953490855427                           
SPACE:                                                                              
{'objective': 'binary:logistic', 'gamma'

{'gamma': 4}

In [31]:
best_gamma = gamma_choice[best_3['gamma']]
best_gamma

3.0

In [32]:
params_3 = params_2
params_3['gamma'] = best_gamma
model_3 = XGBoostEstimator(**{**static_params, **params_3}).fit(train_df)

In [33]:
test_metrics_3 = {
    'logloss': logloss(model_3, test_df, probabilities_col='probabilities'),
    'rocauc': rocauc(model_3, test_df, probabilities_col='probabilities')
}

all_metrics['tuning_3'] = test_metrics_3
all_metrics

{'tuning_1': {'logloss': 0.5044681152479438, 'rocauc': 0.7372459385355252},
 'tuning_2': {'logloss': 0.5039979025307851, 'rocauc': 0.7375641863487687},
 'tuning_3': {'logloss': 0.5037899951124003, 'rocauc': 0.7377762262886989},
 'xgb baseline': {'logloss': 0.5136055353642643, 'rocauc': 0.7256245795728983}}

#### Tune subsample, colsample_bytree

In [34]:
%%time

subsample_choice = [0.05, 0.2, 0.5, 0.9]
colsample_bytree_choice = [0.05, 0.2, 0.5, 0.9]

space_4 = {
    # Optimize
    'subsample': hp.choice('subsample', subsample_choice),
    'colsample_bytree': hp.choice('colsample_bytree', colsample_bytree_choice),
    
    # Fixed    
    'num_round': best_num_round,
    'eta': best_eta,
    'max_depth': best_max_depth,
    'min_child_weight': best_min_child_weight,
    'gamma': best_gamma,
    
    **static_params
}


trials_4 = Trials()
best_4 = fmin(fn=objective,
            space=space_4,
            algo=tpe.suggest,
            max_evals=20,
            trials=trials_4)
best_4

SPACE:                                                
{'objective': 'binary:logistic', 'gamma': 3.0, 'predictionCol': 'prediction', 'subsample': 0.9, 'featuresCol': 'features', 'max_depth': 10, 'eta': 0.3, 'min_child_weight': 25.0, 'labelCol': 'label', 'colsample_bytree': 0.05, 'nthread': 1, 'eval_metric': 'logloss', 'nworkers': 1, 'silent': 0, 'num_round': 100}
LOG-LOSS: 0.507657021520509, ROC-AUC: 0.7287942119625852
SPACE:                                                                         
{'objective': 'binary:logistic', 'gamma': 3.0, 'predictionCol': 'prediction', 'subsample': 0.05, 'featuresCol': 'features', 'max_depth': 10, 'eta': 0.3, 'min_child_weight': 25.0, 'labelCol': 'label', 'colsample_bytree': 0.2, 'nthread': 1, 'eval_metric': 'logloss', 'nworkers': 1, 'silent': 0, 'num_round': 100}
LOG-LOSS: 0.5144846272537228, ROC-AUC: 0.7184469731759479                      
SPACE:                                                                         
{'objective': 'binary:logi

LOG-LOSS: 0.5099170922686153, ROC-AUC: 0.7251827170017763                         
SPACE:                                                                            
{'objective': 'binary:logistic', 'gamma': 3.0, 'predictionCol': 'prediction', 'subsample': 0.2, 'featuresCol': 'features', 'max_depth': 10, 'eta': 0.3, 'min_child_weight': 25.0, 'labelCol': 'label', 'colsample_bytree': 0.2, 'nthread': 1, 'eval_metric': 'logloss', 'nworkers': 1, 'silent': 0, 'num_round': 100}
LOG-LOSS: 0.5041368920282794, ROC-AUC: 0.7328886952527355                         
SPACE:                                                                            
{'objective': 'binary:logistic', 'gamma': 3.0, 'predictionCol': 'prediction', 'subsample': 0.2, 'featuresCol': 'features', 'max_depth': 10, 'eta': 0.3, 'min_child_weight': 25.0, 'labelCol': 'label', 'colsample_bytree': 0.9, 'nthread': 1, 'eval_metric': 'logloss', 'nworkers': 1, 'silent': 0, 'num_round': 100}
LOG-LOSS: 0.5083744347261354, ROC-AUC: 0.7276655

{'colsample_bytree': 3, 'subsample': 3}

In [35]:
best_subsample = subsample_choice[best_4['subsample']]
best_colsample_bytree = colsample_bytree_choice[best_4['colsample_bytree']]
best_subsample, best_colsample_bytree

(0.9, 0.9)

In [36]:
params_4 = params_3
params_4['subsample'] = best_subsample
params_4['colsample_bytree'] = best_colsample_bytree
model_4 = XGBoostEstimator(**{**static_params, **params_4}).fit(train_df)

In [37]:
test_metrics_4 = {
    'logloss': logloss(model_4, test_df, probabilities_col='probabilities'),
    'rocauc': rocauc(model_4, test_df, probabilities_col='probabilities')
}

all_metrics['tuning_4'] = test_metrics_4
all_metrics

{'tuning_1': {'logloss': 0.5044681152479438, 'rocauc': 0.7372459385355252},
 'tuning_2': {'logloss': 0.5039979025307851, 'rocauc': 0.7375641863487687},
 'tuning_3': {'logloss': 0.5037899951124003, 'rocauc': 0.7377762262886989},
 'tuning_4': {'logloss': 0.5037899951124003, 'rocauc': 0.7377762262886988},
 'xgb baseline': {'logloss': 0.5136055353642643, 'rocauc': 0.7256245795728983}}

### Tune regularization parameters

> Tune regularization parameters (lambda, alpha) for xgboost which can help reduce model complexity and enhance performance.

In [43]:
%%time

reg_alpha_choice = [0., 0.3, 0.7, 1.]
reg_lambda_choice = [0.5, 1., 1.5, 2.]

space_5 = {
    # Optimize
    'alpha': hp.choice('alpha', reg_alpha_choice),
    'reg_lambda': hp.choice('reg_lambda', reg_lambda_choice),
    
    # Fixed    
    'num_round': best_num_round,
    'eta': best_eta,
    'max_depth': best_max_depth,
    'min_child_weight': best_min_child_weight,
    'gamma': best_gamma,
    'subsample': best_subsample,
    'colsample_bytree': best_colsample_bytree,
    
    **static_params
}


trials_5 = Trials()
best_5 = fmin(fn=objective,
            space=space_5,
            algo=tpe.suggest,
            max_evals=20,
            trials=trials_5)
best_5

SPACE:                                                
{'objective': 'binary:logistic', 'colsample_bytree': 0.9, 'alpha': 0.0, 'predictionCol': 'prediction', 'subsample': 0.9, 'labelCol': 'label', 'max_depth': 10, 'eta': 0.3, 'min_child_weight': 25.0, 'featuresCol': 'features', 'gamma': 3.0, 'nthread': 1, 'eval_metric': 'logloss', 'lambda': 1.0, 'nworkers': 1, 'silent': 0, 'num_round': 100}
LOG-LOSS: 0.5003067012036071, ROC-AUC: 0.7381433515962914
SPACE:                                                                             
{'objective': 'binary:logistic', 'colsample_bytree': 0.9, 'alpha': 0.7, 'predictionCol': 'prediction', 'subsample': 0.9, 'labelCol': 'label', 'max_depth': 10, 'eta': 0.3, 'min_child_weight': 25.0, 'featuresCol': 'features', 'gamma': 3.0, 'nthread': 1, 'eval_metric': 'logloss', 'lambda': 0.5, 'nworkers': 1, 'silent': 0, 'num_round': 100}
LOG-LOSS: 0.5003262929228383, ROC-AUC: 0.7379924883487377                          
SPACE:                                   

LOG-LOSS: 0.5004136510165058, ROC-AUC: 0.7377636006466887                         
SPACE:                                                                              
{'objective': 'binary:logistic', 'colsample_bytree': 0.9, 'alpha': 1.0, 'predictionCol': 'prediction', 'subsample': 0.9, 'labelCol': 'label', 'max_depth': 10, 'eta': 0.3, 'min_child_weight': 25.0, 'featuresCol': 'features', 'gamma': 3.0, 'nthread': 1, 'eval_metric': 'logloss', 'lambda': 0.5, 'nworkers': 1, 'silent': 0, 'num_round': 100}
LOG-LOSS: 0.5001113606720224, ROC-AUC: 0.7383767947117016                           
SPACE:                                                                              
{'objective': 'binary:logistic', 'colsample_bytree': 0.9, 'alpha': 0.0, 'predictionCol': 'prediction', 'subsample': 0.9, 'labelCol': 'label', 'max_depth': 10, 'eta': 0.3, 'min_child_weight': 25.0, 'featuresCol': 'features', 'gamma': 3.0, 'nthread': 1, 'eval_metric': 'logloss', 'lambda': 1.5, 'nworkers': 1, 'silent': 0, 'n

{'alpha': 1, 'reg_lambda': 3}

In [44]:
best_alpha = reg_alpha_choice[best_5['alpha']]
best_lambda = reg_lambda_choice[best_5['reg_lambda']]
best_alpha, best_lambda

(0.3, 2.0)

In [45]:
params_5 = params_4
params_5['alpha'] = best_alpha
params_5['reg_lambda'] = best_lambda
model_5 = XGBoostEstimator(**{**static_params, **params_5}).fit(train_df)

In [46]:
test_metrics_5 = {
    'logloss': logloss(model_5, test_df, probabilities_col='probabilities'),
    'rocauc': rocauc(model_5, test_df, probabilities_col='probabilities')
}

all_metrics['tuning_5'] = test_metrics_5
all_metrics

{'tuning_1': {'logloss': 0.5044681152479438, 'rocauc': 0.7372459385355252},
 'tuning_2': {'logloss': 0.5039979025307851, 'rocauc': 0.7375641863487687},
 'tuning_3': {'logloss': 0.5037899951124003, 'rocauc': 0.7377762262886989},
 'tuning_4': {'logloss': 0.5037899951124003, 'rocauc': 0.7377762262886988},
 'tuning_5': {'logloss': 0.5036142698997369, 'rocauc': 0.7380073382594455},
 'xgb baseline': {'logloss': 0.5136055353642643, 'rocauc': 0.7256245795728983}}

### Lower the learning rate and decide the optimal parameters

In [None]:
%%time

reg_alpha_choice_1 = [0., 0.3, 0.7, 1.]
reg_lambda_choice_1 = [0.5, 1., 1.5, 2.]
subsample_choice_1 = [0.2, 0.5, 0.9]
colsample_bytree_choice_1 = [0.2, 0.5, 0.9]
num_round_choice_1 = [100, 200]
depth_choice_1 = [6, 10]
weight_choice_1 = [25., 100.]
gamma_choice_1 = [1.5, 3.]

space_all = {
    # Optimize
    'alpha': hp.choice('alpha', reg_alpha_choice_1),
    'reg_lambda': hp.choice('reg_lambda', reg_lambda_choice_1),
    'num_round': hp.choice('num_round', num_round_choice_1),
    'max_depth': hp.choice('max_depth', depth_choice),
    'min_child_weight': hp.choice('min_child_weight', weight_choice_1),
    'gamma': hp.choice('gamma', gamma_choice_1),
    'subsample': hp.choice('subsample', subsample_choice_1),
    'colsample_bytree': hp.choice('colsample_bytree', colsample_bytree_choice_1),
    
    # Fixed
    'eta': 0.1,
    
    **static_params
}


trials_all = Trials()
best_all = fmin(fn=objective,
            space=space_all,
            algo=tpe.suggest,
            max_evals=30,
            trials=trials_all)
best_all

---
## LogisticRegression Tuning

Подберем гиперпараметры для логрега из предыдущих практик

In [62]:
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK
import scipy.stats as st
from pyspark.ml.classification import LogisticRegression


def objective_logreg(space):
    estimator = LogisticRegression(**space)
    print('SPACE:', estimator._input_kwargs)
    success = False
    attempts = 0
    model = None
    while not success and attempts < 2:
        try:
            model = estimator.fit(train_df)
            success = True
        except Exception as e:
            attempts += 1
            print(e)
            print('Try again')
        
    log_loss = logloss(model, val_df, probabilities_col='probability')
    roc_auc = rocauc(model, val_df, probabilities_col='probability')
    
    print('LOG-LOSS: {}, ROC-AUC: {}'.format(log_loss, roc_auc))

    return {'loss': log_loss, 'rocauc': roc_auc, 'status': STATUS_OK }

In [50]:
CONTROL_NAME_LR = 'lr baseline'

static_params_lr = {
    'featuresCol': "features", 
    'labelCol': "label", 
    'maxIter': 20,
}
baseline_params_lr = {
    'regParam': 0.,
    'elasticNetParam': 0.
}

baseline_model_lr = LogisticRegression(**{**static_params_lr, **baseline_params_lr}).fit(train_df)

In [51]:
baseline_test_metrics_lr = {
    'logloss': logloss(baseline_model_lr, test_df, probabilities_col='probability'),
    'rocauc': rocauc(baseline_model_lr, test_df, probabilities_col='probability')
}

all_metrics[CONTROL_NAME_LR] = baseline_test_metrics_lr
all_metrics

{'lr baseline': {'logloss': 0.5311425009219906, 'rocauc': 0.701220311243331},
 'tuning_1': {'logloss': 0.5044681152479438, 'rocauc': 0.7372459385355252},
 'tuning_2': {'logloss': 0.5039979025307851, 'rocauc': 0.7375641863487687},
 'tuning_3': {'logloss': 0.5037899951124003, 'rocauc': 0.7377762262886989},
 'tuning_4': {'logloss': 0.5037899951124003, 'rocauc': 0.7377762262886988},
 'tuning_5': {'logloss': 0.5036142698997369, 'rocauc': 0.7380073382594455},
 'xgb baseline': {'logloss': 0.5136055353642643, 'rocauc': 0.7256245795728983}}

In [65]:
%%time

regParam_choice = [0., 0.05, 0.1, 0.2]
elasticNetParam_choice = [0., 0.05, 0.1, 0.2]

space_lr = {
    'regParam': hp.choice('regParam', regParam_choice),
    'elasticNetParam': hp.choice('elasticNetParam', elasticNetParam_choice),
    
    **static_params_lr
}


trials_lr = Trials()
best_lr = fmin(fn=objective_logreg,
            space=space_lr,
            algo=tpe.suggest,
            max_evals=20,
            trials=trials_lr)

SPACE:                                                
{'labelCol': 'label', 'maxIter': 20, 'featuresCol': 'features', 'regParam': 0.2, 'elasticNetParam': 0.05}
LOG-LOSS: 0.5505850480946579, ROC-AUC: 0.6895118862076596
SPACE:                                                                          
{'labelCol': 'label', 'maxIter': 20, 'featuresCol': 'features', 'regParam': 0.0, 'elasticNetParam': 0.0}
LOG-LOSS: 0.5282379136687766, ROC-AUC: 0.7026184672998435                       
SPACE:                                                                          
{'labelCol': 'label', 'maxIter': 20, 'featuresCol': 'features', 'regParam': 0.1, 'elasticNetParam': 0.2}
LOG-LOSS: 0.5530053395355741, ROC-AUC: 0.6818025752975112                       
SPACE:                                                                          
{'labelCol': 'label', 'maxIter': 20, 'featuresCol': 'features', 'regParam': 0.1, 'elasticNetParam': 0.05}
LOG-LOSS: 0.5411187606053796, ROC-AUC: 0.6955832666439783   

In [66]:
best_regParam = regParam_choice[best_lr['regParam']]
best_elasticNetParam = elasticNetParam_choice[best_lr['elasticNetParam']]
best_regParam, best_elasticNetParam

(0.0, 0.0)

In [67]:
params_lr = baseline_params_lr
params_lr['regParam'] = best_regParam
params_lr['elasticNetParam'] = best_elasticNetParam
model_lr = LogisticRegression(**{**static_params_lr, **params_lr}).fit(train_df)

In [68]:
test_metrics_lr_1 = {
    'logloss': logloss(model_lr, test_df, probabilities_col='probability'),
    'rocauc': rocauc(model_lr, test_df, probabilities_col='probability')
}

all_metrics['lr tuning'] = test_metrics_lr_1
all_metrics

{'lr baseline': {'logloss': 0.5311425009219906, 'rocauc': 0.701220311243331},
 'lr tuning': {'logloss': 0.5311425009219906, 'rocauc': 0.7012203112433288},
 'tuning_1': {'logloss': 0.5044681152479438, 'rocauc': 0.7372459385355252},
 'tuning_2': {'logloss': 0.5039979025307851, 'rocauc': 0.7375641863487687},
 'tuning_3': {'logloss': 0.5037899951124003, 'rocauc': 0.7377762262886989},
 'tuning_4': {'logloss': 0.5037899951124003, 'rocauc': 0.7377762262886988},
 'tuning_5': {'logloss': 0.5036142698997369, 'rocauc': 0.7380073382594455},
 'xgb baseline': {'logloss': 0.5136055353642643, 'rocauc': 0.7256245795728983}}

---
## Optional [MongoTrials](https://hyperopt.github.io/hyperopt/scaleout/mongodb/)

> For parallel search, hyperopt includes a MongoTrials implementation that supports asynchronous updates.

**TLDR** Преимущества использования `MongoTrials`:
* `MongoTrials` позволяет параллельно запускать несколько вычислений целевой функции
* Динамический уровень параллелизма - можно добавлять/удалять воркеров, которые вычисляют целевую функцию
* Все результаты сохраняются в БД - история запусков никуда не потеряется

*За выполнение данного задания можно получить дополнительно +0.4 к итоговому баллу*

### XGBoost Tuning

In [None]:
######################################
######### YOUR CODE HERE #############
######################################

# Results

Подведем итоги.

Обучите модели с найденными (оптимальными) гиперпараметрами и сделайте справнение на отложенной выборке

Итоговая таблица

In [23]:
get_ate(all_metrics, CONTROL_NAME)

Unnamed: 0,metric,xgb opt ate %
0,logloss,0.0
1,rocauc,-1.554312e-13
