#IMPORTACIÓN DE BIBLIOTECAS

En este notebook hay una breve demostración de como utilizar la librería de optimización Hyperopt con un algoritmo de Tree Parzen Estimators.

Inicialmente se realiza una optimización trabajando con un solo worker y luego se aumenta el número de workers a 8.

In [0]:
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, train_test_split
import xgboost as xgb
 
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
import warnings
import mlflow

In [0]:
# Load the iris dataset from scikit-learn
iris = iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.33, random_state=42)

##**DEFINO LA FUNCIÓN A MINIMIZAR**

En este ejemplo se utilizar un clisificador de la librería XGBOOST. Se define el espacio de hiperparámetros compuesto por los siguientes HPs:

  * n_estimators
  * max_depth
  * learning_rate
  * gamma
  * min_child_weight
  * subsample
  * colsample_bytree

In [0]:
NUM_JOBS = 1
# Defino la función objetivo
def objective(space):
    # Creo un modelo de clasificación utilizando xgboost
    warnings.filterwarnings(action='ignore', category=DeprecationWarning)
    classifier = xgb.XGBClassifier(n_estimators = space['n_estimators'],
                            max_depth = int(space['max_depth']),
                            learning_rate = space['learning_rate'],
                            gamma = space['gamma'],
                            min_child_weight = space['min_child_weight'],
                            subsample = space['subsample'],
                            colsample_bytree = space['colsample_bytree'],
                            n_jobs=NUM_JOBS
                            )
    
    classifier.fit(X_train, y_train, eval_metric='mlogloss')

    # Utilizo el accuracy de la valización cruzada con k-folds para compara las performance de los modelos
    accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 5, n_jobs = NUM_JOBS)
    CrossValMean = accuracies.mean()

    print("CrossValMean:", CrossValMean)
    # Hyperopt intenta minimizar la función objetivo. Una mayor accuracy representa un mejor modelo. 
    # Por eso la pérdida se calcula restando el valor de accuracy
    return{'loss':-CrossValMean, 'status': STATUS_OK }
# Defino el espacio de hiperparámetros a evaluar
space = {
    'max_depth' : hp.choice('max_depth', range(5, 30, 1)),
    'learning_rate' : hp.quniform('learning_rate', 0.01, 0.5, 0.01),
    'n_estimators' : hp.choice('n_estimators', range(20, 205, 5)),
    'gamma' : hp.quniform('gamma', 0, 0.50, 0.01),
    'min_child_weight' : hp.quniform('min_child_weight', 1, 10, 1),
    'subsample' : hp.quniform('subsample', 0.1, 1, 0.01),
    'colsample_bytree' : hp.quniform('colsample_bytree', 0.1, 1.0, 0.01)}

with mlflow.start_run():
    best = fmin(
           fn=objective,
           space=space,
           algo=tpe.suggest,
           max_evals=50)








                                                      CrossValMean:
  0%|          | 0/50 [00:02<?, ?trial/s, best loss=?]                                                      0.32






                                                                   CrossValMean:
  2%|▏         | 1/50 [00:03<02:02,  2.51s/trial, best loss: -0.32]                                                                   0.9400000000000001






                                                                                 CrossValMean:
  4%|▍         | 2/50 [00:06<01:08,  1.43s/trial, best loss: -0.9400000000000001]                                                                                 0.95






                                                                   CrossValMean:
  6%|▌         | 3/50 [00:08<01:39,  2.11s/trial, best loss: -0.95]                                                                   0.9400000000000001













                               

In [0]:
print("Best: ", best)

Best:  {'colsample_bytree': 0.9, 'gamma': 0.02, 'learning_rate': 0.19, 'max_depth': 17, 'min_child_weight': 1.0, 'n_estimators': 21, 'subsample': 0.38}


#ENTRENAMIENTO DISTRIBUIDO

In [0]:
from hyperopt import SparkTrials

In [0]:
spark_trials = SparkTrials(parallelism = 8)
NUM_JOBS = 8
with mlflow.start_run():
    argmin = fmin(
    fn=objective,
    space=space,
    algo=tpe.suggest,
    max_evals=50,
    trials=spark_trials)

Hyperopt with SparkTrials will automatically track trials in MLflow. To view the MLflow experiment associated with the notebook, click the 'Runs' icon in the notebook context bar on the upper right. There, you can view all runs.
To view logs from trials, please check the Spark executor logs. To view executor logs, expand 'Spark Jobs' above until you see the (i) icon next to the stage from the trial job. Click it and find the list of tasks. Click the 'stderr' link for a task to view trial logs.
  0%|          | 0/50 [00:00<?, ?trial/s, best loss=?]  2%|▏         | 1/50 [00:12<09:58, 12.21s/trial, best loss: -0.95] 12%|█▏        | 6/50 [00:16<01:36,  2.20s/trial, best loss: -0.95] 14%|█▍        | 7/50 [00:17<01:24,  1.96s/trial, best loss: -0.95] 16%|█▌        | 8/50 [00:20<01:32,  2.20s/trial, best loss: -0.95] 18%|█▊        | 9/50 [00:22<01:28,  2.15s/trial, best loss: -0.95] 22%|██▏       | 11/50 [00:24<01:05,  1.69s/trial, best loss: -0.95] 26%|██▌       | 13/50 [00:25<00:46,

In [0]:
print("Best: ", argmin)

Best:  {'colsample_bytree': 0.86, 'gamma': 0.29, 'learning_rate': 0.06, 'max_depth': 22, 'min_child_weight': 3.0, 'n_estimators': 22, 'subsample': 0.81}
