# Автоматизация создания моделей в фреймворке Deephyper

Фреймворк [DeepHyper](https://deephyper.readthedocs.io/en/latest/index.html) (Distributed Neural Architecture and Hyperparameter Optimization for Machine Learning) дает возможности автоматизированного подбора гиперпараметров моделей машинного обучения и даже архитектуры (простых) нейронных сетей.





https://deephyper.readthedocs.io/en/latest/_sources/tutorials/tutorials/colab/AutoML_with_Sklearn.ipynb

# Automated Machine Learning with Scikit-Learn

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deephyper/tutorials/blob/main/tutorials/colab/AutoML_with_Sklearn.ipynb)

In this tutorial, we will show how to automatically search among different machine learning algorithms from [Scikit-Learn](https://scikit-learn.org/stable/). Automated machine learning only requires the user to link the data with a predifined problem and run function that we provide.

Let us start by installing DeepHyper.

In [None]:
!pip install deephyper["popt"]
!pip install ray

Collecting deephyper[popt]
  Downloading deephyper-0.7.0-py2.py3-none-any.whl (353 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m353.3/353.3 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[0mCollecting ConfigSpace>=0.4.20 (from deephyper[popt])
  Downloading ConfigSpace-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m49.9 MB/s[0m eta [36m0:00:00[0m
Collecting Jinja2<3.1 (from deephyper[popt])
  Downloading Jinja2-3.0.3-py3-none-any.whl (133 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.6/133.6 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
Collecting parse (from deephyper[popt])
  Downloading parse-1.20.1-py2.py3-none-any.whl (20 kB)
Collecting pymoo>=0.6.0 (from deephyper[popt])
  Downloading pymoo-0.6.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Classification

On this part of the tutorial we focus on the classification case.

Create `run` function to train and evaluate the model corresponding to the configuration generated by the search. This function has to return a scalar value (typically, validation accuracy), which will be maximized by the search algorithm. In the case of *automated machine learning* we use the `run` function provided at `deephyper.sklearn.classifier.run_autosklearn1` and wrap it with our data such as:

In [None]:
from deephyper.sklearn.classifier import run_autosklearn1


def load_data():
    from sklearn.datasets import load_breast_cancer

    X, y = load_breast_cancer(return_X_y=True)

    return X, y


def run(config):
    return run_autosklearn1(config, load_data)

We are ready to go! But, let us look at the problem provided by DeepHyper in `deephyper.sklearn.classifier.problem_autosklearn1` to understand better what is happening under the hood.

In [None]:
from deephyper.sklearn.classifier import problem_autosklearn1

problem_autosklearn1

Configuration space object:
  Hyperparameters:
    C, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    alpha, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    classifier, Type: Categorical, Choices: {RandomForest, Logistic, AdaBoost, KNeighbors, MLP, SVC, XGBoost}, Default: RandomForest
    gamma, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    kernel, Type: Categorical, Choices: {linear, poly, rbf, sigmoid}, Default: linear
    max_depth, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale
    n_estimators, Type: UniformInteger, Range: [1, 2000], Default: 45, on log-scale
    n_neighbors, Type: UniformInteger, Range: [1, 100], Default: 50
  Conditions:
    (C | classifier == 'Logistic' || C | classifier == 'SVC')
    (gamma | kernel == 'rbf' || gamma | kernel == 'poly' || gamma | kernel == 'sigmoid')
    (n_estimators | classifier == 'RandomForest' || n_estimators | classifier == 'AdaBoost')
    a

Create an `Evaluator` object using the `ray` backend to distribute the evaluation of the run-function defined previously.

In [None]:
from deephyper.evaluator import Evaluator
from deephyper.evaluator.callback import TqdmCallback

evaluator = Evaluator.create(run,
                 method="ray",
                 method_kwargs={
                     "address": None,
                     "num_cpus": 1,
                     "num_cpus_per_task": 1,
                     "callbacks": [TqdmCallback()]
                 })

print("Number of workers: ", evaluator.num_workers)

  self.pid = _posixsubprocess.fork_exec(
2024-05-09 12:21:10,111	INFO worker.py:1749 -- Started a local Ray instance.


Number of workers:  1




Finally, you can define a Bayesian optimization search called `CBO` (for Centralized Bayesian Optimization) and link to it the defined `problem_autosklearn1` and `evaluator`.

In [None]:
from deephyper.search.hps import CBO

search = CBO(problem_autosklearn1, evaluator, log_dir="exp-automl-2")

In [None]:
results = search.search(100)

[36m(pid=852)[0m 2024-05-09 12:21:27.390637: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[36m(pid=852)[0m 2024-05-09 12:21:27.390715: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[36m(pid=852)[0m 2024-05-09 12:21:27.392271: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


  0%|          | 0/100 [00:00<?, ?it/s]

Once the search is over, a file named `results.csv` is saved in the current directory. The same dataframe is returned by the `search.search(...)` call. It contains the hyperparameters configurations evaluated during the search and their corresponding `objective` value (i.e, validation accuracy), `timestamp_submit` the time when the evaluator submitted the configuration to be evaluated and `timestamp_gather` the time when the evaluator received the configuration once evaluated (both are relative times with respect to the creation of the `Evaluator` instance).

In [None]:
results

Unnamed: 0,p:classifier,p:C,p:alpha,p:kernel,p:max_depth,p:n_estimators,p:n_neighbors,p:gamma,objective,job_id,m:timestamp_submit,m:timestamp_gather
0,Logistic,0.000986,0.000010,linear,2,1,1,0.00001,0.893617,0,13.615002,18.204368
1,KNeighbors,0.000010,0.000010,linear,2,1,41,0.00001,0.946809,1,19.699191,19.779655
2,RandomForest,0.000010,0.000010,linear,48,51,1,0.00001,0.957447,2,21.577878,21.774964
3,Logistic,0.000341,0.000010,linear,2,1,1,0.00001,0.819149,3,24.064573,24.094181
4,SVC,0.000063,0.000010,linear,2,1,1,0.00001,0.643617,4,25.543958,25.572419
...,...,...,...,...,...,...,...,...,...,...,...,...
95,MLP,0.000010,2.015998,linear,2,1,1,0.00001,0.989362,95,214.286762,214.659747
96,MLP,0.000010,2.811723,linear,2,1,1,0.00001,0.989362,96,216.015169,216.391823
97,MLP,0.000010,2.028841,linear,2,1,1,0.00001,0.989362,97,218.398907,218.942001
98,MLP,0.000010,4.174028,linear,2,1,1,0.00001,0.989362,98,221.376537,221.944098


Now that we have the full list of results we can display the top-3.

In [None]:
results.nlargest(n=3, columns="objective")

Unnamed: 0,p:classifier,p:C,p:alpha,p:kernel,p:max_depth,p:n_estimators,p:n_neighbors,p:gamma,objective,job_id,m:timestamp_submit,m:timestamp_gather
25,MLP,1e-05,1.866688,linear,2,1,1,1e-05,0.989362,25,64.895742,65.439903
27,MLP,1e-05,1.81897,linear,2,1,1,1e-05,0.989362,27,70.329185,70.682711
28,MLP,1e-05,1.793305,linear,2,1,1,1e-05,0.989362,28,72.017057,72.39227


## Regression

On this part of the tutorial we focus on the regression case.

Create `run` function to train and evaluate the model corresponding to the configuration generated by the search. This function has to return a scalar value (typically, validation $R^2$), which will be maximized by the search algorithm. In the case of *automated machine learning* we use the `run`-function provided at `deephyper.sklearn.regressor.run_autosklearn1` and wrap it with our data such as:

In [None]:
from deephyper.sklearn.regressor import run_autosklearn1


def load_data():
    from sklearn.datasets import fetch_california_housing

    X, y = fetch_california_housing(return_X_y=True)
    return X, y


def run(config):
    return run_autosklearn1(config, load_data)

We are ready to go! But, let us look at the problem provided by DeepHyper to understand better what is happening under the hood.

In [None]:
from deephyper.sklearn.regressor import problem_autosklearn1

problem_autosklearn1

Configuration space object:
  Hyperparameters:
    C, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    alpha, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    gamma, Type: UniformFloat, Range: [1e-05, 10.0], Default: 0.01, on log-scale
    kernel, Type: Categorical, Choices: {linear, poly, rbf, sigmoid}, Default: linear
    max_depth, Type: UniformInteger, Range: [2, 100], Default: 14, on log-scale
    n_estimators, Type: UniformInteger, Range: [1, 2000], Default: 45, on log-scale
    n_neighbors, Type: UniformInteger, Range: [1, 100], Default: 50
    regressor, Type: Categorical, Choices: {RandomForest, Linear, AdaBoost, KNeighbors, MLP, SVR, XGBoost}, Default: RandomForest
  Conditions:
    (gamma | kernel == 'rbf' || gamma | kernel == 'poly' || gamma | kernel == 'sigmoid')
    (n_estimators | regressor == 'RandomForest' || n_estimators | regressor == 'AdaBoost')
    C | regressor == 'SVR'
    alpha | regressor == 'MLP'
    kernel | r

Create an `Evaluator` object using the `ray` backend to distribute the evaluation of the run-function defined previously.

In [None]:
from deephyper.evaluator import Evaluator
from deephyper.evaluator.callback import TqdmCallback

evaluator = Evaluator.create(run,
                 method="ray",
                 method_kwargs={
                     "address": None,
                     "num_cpus": 1,
                     "num_cpus_per_task": 1,
                     "callbacks": [TqdmCallback()]
                 })

print("Number of workers: ", evaluator.num_workers)

Number of workers:  1


Finally, you can define a Bayesian optimization search called `CBO` (for Centralized Bayesian Optimization) and link to it the defined `Problem` and `evaluator`.

In [None]:
from deephyper.search.hps import CBO

search = CBO(problem_autosklearn1, evaluator)

In [None]:
results = search.search(10)

  0%|          | 0/10 [00:00<?, ?it/s]

Once the search is over, a file named `results.csv` is saved in the current directory. The same dataframe is returned by the `search.search(...)` call. It contains the hyperparameters configurations evaluated during the search and their corresponding `objective` value (i.e, validation $R^2$), `timestamp_submit` the time when the evaluator submitted the configuration to be evaluated and `timestamp_gather` the time when the evaluator received the configuration once evaluated (both are relative times with respect to the creation of the `Evaluator` instance).

In [None]:
results

Unnamed: 0,p:regressor,p:C,p:alpha,p:kernel,p:max_depth,p:n_estimators,p:n_neighbors,p:gamma,objective,job_id,m:timestamp_submit,m:timestamp_gather
0,Linear,1e-05,1e-05,linear,2,1,1,1e-05,0.597049,0,3.279246,4.717617
1,KNeighbors,1e-05,1e-05,linear,2,1,41,1e-05,0.666496,1,6.000432,6.899045
2,RandomForest,1e-05,1e-05,linear,48,51,1,1e-05,0.80251,2,8.364354,15.648725
3,RandomForest,1e-05,1e-05,linear,7,245,1,1e-05,0.719056,3,16.831398,30.98126
4,SVR,6.3e-05,1e-05,linear,2,1,1,1e-05,0.322115,4,32.191929,43.975306
5,SVR,1.6e-05,1e-05,sigmoid,2,1,1,0.00418,-0.059354,5,45.459546,64.851586
6,SVR,0.422234,1e-05,sigmoid,2,1,1,2.779419,-321050.500503,6,66.335866,96.026325
7,RandomForest,1e-05,1e-05,linear,91,15,1,1e-05,0.796552,7,97.198347,98.896851
8,MLP,1e-05,1.350762,linear,2,1,1,1e-05,0.708333,8,100.079646,107.838815
9,MLP,1e-05,0.033863,linear,2,1,1,1e-05,0.771833,9,109.103457,117.608881


Now that we have the full list of results we can display the top-3.

In [None]:
results.nlargest(n=3, columns="objective")

Unnamed: 0,p:regressor,p:C,p:alpha,p:kernel,p:max_depth,p:n_estimators,p:n_neighbors,p:gamma,objective,job_id,m:timestamp_submit,m:timestamp_gather
2,RandomForest,1e-05,1e-05,linear,48,51,1,1e-05,0.80251,2,8.364354,15.648725
7,RandomForest,1e-05,1e-05,linear,91,15,1,1e-05,0.796552,7,97.198347,98.896851
9,MLP,1e-05,0.033863,linear,2,1,1,1e-05,0.771833,9,109.103457,117.608881
