# Introduction to Optuna + RAPIDS

Optuna is a lightweight framework for hyperparameter optimization. It provides a code-by-run method which makes it easy to adapt to any already existing code that we have. Just wrapping the objective function with Optuna can help perform a parallel-distributed HPO search over a search space.

We'll explore how to use Optuna with RAPIDS and run multi-GPU HPO runs. 

Notes - 

1. Using default SVM parameters ("Rbf" kernel) with full airline data results in `cudaErrorMemoryAllocation` - out of memory when `fit` is called.
2. Even with 1/10th the data linear kernel hangs for a long time (is this a possible bug or expected behavior?)


In [1]:
import random
import time
from contextlib import contextmanager

import cudf
import cuml
import dask_cudf
import mlflow
import numpy as np
import optuna
import pandas as pd
import sklearn
from cuml.dask.common import utils as dask_utils
from cuml.metrics import accuracy_score
from cuml.preprocessing.model_selection import train_test_split
from dask.distributed import Client, wait, performance_report
from joblib import parallel_backend, Parallel, delayed

from sklearn.datasets import load_iris

from dask_cuda import LocalCUDACluster



In [2]:
# Helper function for timing blocks of code.
@contextmanager
def timed(name):
    t0 = time.time()
    yield
    t1 = time.time()
    print("..%-24s:  %8.4f" % (name, t1 - t0))

In [3]:
# This will use all GPUs on the local host by default
cluster = LocalCUDACluster(threads_per_worker=1, ip="", dashboard_address="8002")
c = Client(cluster)

# Query the client for all connected workers
workers = c.has_what().keys()
n_workers = len(workers)
n_streams = 8 # Performance optimization
c

0,1
Client  Scheduler: tcp://172.17.0.2:46683  Dashboard: http://172.17.0.2:8002/status,Cluster  Workers: 2  Cores: 2  Memory: 49.16 GB


## Loading the data

We'll load the airline data from the path specified by `INPUT_FILE`. The aim of the problem is to predict whether a plane will be delayed or not by the target variable `ArrDelayBinary`

In [4]:
N_TRIALS = 10

INPUT_FILE = "/home/hyperopt/hyperopt/data/air_par.parquet"
df = cudf.read_parquet(INPUT_FILE)
X, y = df.drop(["ArrDelayBinary"], axis=1), df["ArrDelayBinary"].astype('int32')

# Training and Evaluation

Here, we define `train_and_eval` function which simply fits a RandomForestClassifier (with`max_depth` and `n_estimators`) on the passed `X_param`, `y_param`. This function should look very similar for any ML workflow. We'll use this function within the Optuna `objective` function to show how easily we can fit an existing workflow into the Optuna work. 

In [5]:
def train_and_eval(X_param, y_param, max_depth=16, n_estimators=100):
    """
        Splits the given data into train and test split to train and evaluate the model
        for the params parameters.
        
        Params
        ______
        
        X_param:  DataFrame. 
                  The data to use for training and testing. 
        y_param:  Series. 
                  The label for training
        max_depth, n_estimators: The values to use for max_depth and n_estimators for RFC.
                                 Defaults to 16 and 100 (the defaults for the classifiers used)
                   
        Returns
        score: Accuracy score of the fitted model
    """

    X_train, X_valid, y_train, y_valid = train_test_split(X_param, y_param, random_state=77)
    classifier = cuml.ensemble.RandomForestClassifier(max_depth=max_depth,
                     n_estimators=n_estimators)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_valid)
    score = accuracy_score(y_valid, y_pred)
    return score

For a baseline number, let's see what the default performance of RFC is. Note the defauly values for `max_depth` = 16 and `n_estimators` = 100; we pass these to the `train_and_eval` function.

In [6]:
print("Score with default parameters : ",train_and_eval(X, y, max_depth=16, n_estimators=100))

Score with default parameters :  0.8379114270210266


## Objective Function

The objective function will be the one we optimize in Optuna studys. Objective funciton tries out different values for the parameters that we are tuning and saving the results in `study.trials_dataframes()`. 

Let's define the objective function for this HPO task by making use of the `train_and_eval()`. You can see that we simply choose a value for the parameters and call the `train_and_eval` method, making Optuna very easy to use in an existing workflow.

The objective remains constant over different samplers, which are built-in options in Optuna to enable the selection of different sampling algorithms that optuna provides. Some of the available ones include - GridSampler, RandomSampler, TPESampler, etc. We'll try out different samplers and compare their performances

In [7]:
def objective(trial, X_param, y_param):
    max_depth = trial.suggest_int("max_depth", 10, 15)
    n_estimators = trial.suggest_int("n_estimators", 200, 700)
    score = train_and_eval(X_param, y_param, max_depth=max_depth,
                           n_estimators=n_estimators)
    return score

## HPO Trials and Study

Optuna uses [study](https://optuna.readthedocs.io/en/stable/reference/study.html) and [trials](https://optuna.readthedocs.io/en/stable/reference/trial.html) to keep track of the HPO experiments. 

We'll make use of a helper function `run_study` to help us run one multi-GPU study with a dask backend.

In [8]:
def run_study(sampler=optuna.samplers.TPESampler(),
              study_name="Optuna-MultiGPU",
              callbacks=None):
    
    with timed(study_name):
        study = optuna.create_study(sampler=sampler,
                                    study_name=study_name,
                                    storage="sqlite:///_"+study_name+".db",
                                    direction="maximize",
                                    load_if_exists=True)
        
        with parallel_backend("dask", n_jobs=n_workers, client=c, scatter=[X,y]):
            study.optimize(lambda trial: objective(trial, X, y),
                           n_trials=N_TRIALS,
                           n_jobs=n_workers,
                           callbacks=callbacks)
    print("Number of finished trials: ", len(study.trials))
    print("Best trial:")
    trial = study.best_trial
    print("  Value: ", trial.value)
    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))
    return study

In [9]:
name = "optuna-joblib-dask-backend"
with performance_report(filename=name+"-dask_report.html"):
    study_tpe = run_study(optuna.samplers.TPESampler(),
                          study_name=name)

[I 2020-07-02 19:17:55,658] A new study created with name: optuna-joblib-dask-backend


..optuna-joblib-dask-backend:  219.5348
Number of finished trials:  10
Best trial:
  Value:  0.8429234027862549
  Params: 
    max_depth: 15
    n_estimators: 662


In [10]:
name = "optuna-joblib-loky-backend"

with timed(name):
    study = optuna.create_study(sampler=optuna.samplers.TPESampler(),
                                study_name=name,
                                storage="sqlite:///_"+name+".db",
                                direction="maximize",
                                load_if_exists=True)
    with parallel_backend("loky", n_jobs=n_workers):
        study.optimize(lambda trial: objective(trial, X, y),
                       n_trials=N_TRIALS,
                       n_jobs=n_workers)

[I 2020-07-02 19:21:36,133] A new study created with name: optuna-joblib-loky-backend


..optuna-joblib-loky-backend:  333.9446


In [11]:
name = "optuna-simple"
with timed("no-dask-no-joblib"):
    study = optuna.create_study(sampler=optuna.samplers.TPESampler(),
                                study_name=name,
                                storage="sqlite:///_"+name+".db",
                                direction="maximize",
                                load_if_exists=True)
    study.optimize(lambda trial: objective(trial, X, y), n_trials=N_TRIALS)

[I 2020-07-02 19:27:10,102] A new study created with name: optuna-simple
[I 2020-07-02 19:27:39,800] Finished trial#0 with value: 0.830892026424408 with parameters: {'max_depth': 12, 'n_estimators': 478}. Best is trial#0 with value: 0.830892026424408.
[I 2020-07-02 19:28:05,566] Finished trial#1 with value: 0.8308805823326111 with parameters: {'max_depth': 12, 'n_estimators': 418}. Best is trial#0 with value: 0.830892026424408.
[I 2020-07-02 19:28:22,502] Finished trial#2 with value: 0.8308159708976746 with parameters: {'max_depth': 14, 'n_estimators': 213}. Best is trial#0 with value: 0.830892026424408.
[I 2020-07-02 19:28:41,470] Finished trial#3 with value: 0.8308290243148804 with parameters: {'max_depth': 12, 'n_estimators': 308}. Best is trial#0 with value: 0.830892026424408.
[I 2020-07-02 19:29:18,875] Finished trial#4 with value: 0.8310192227363586 with parameters: {'max_depth': 11, 'n_estimators': 694}. Best is trial#4 with value: 0.8310192227363586.
[I 2020-07-02 19:29:47,318]

..no-dask-no-joblib       :  275.6379


# Sequential calls without Optuna

For a comparison let's try sequential calls without Optuna and it's parallel-processing support. We can cleared see that it takes more time to do this. We'll pick the same parameters as Optuna for a fair comparison - these parameters were selected by the sampling algorithm used by Optuna and is available in the `study.trials_dataframe()` for us to pick out.

In [12]:
df = study_tpe.trials_dataframe()
params_max_depth, params_n_estimators = df['params_max_depth'], df['params_n_estimators']

### Sequential call function 

For a cleaner look, let's use a function to perform sequential calls. The function basically sets the parameters to what was passed and trains and evaluates the model and returns the details of the run which can later be used to find the best performing model.

In [13]:
def seq_call(X, y, max_depth, n_estimators):
    
    score = train_and_eval(X, y, max_depth=max_depth, n_estimators = n_estimators)
    
    return score, max_depth, n_estimators

In [14]:
name = "joblib-dask-backend"
with timed(name):
    with parallel_backend("dask", n_jobs=n_workers, client=c, scatter=[X,y]):
        results = Parallel()(delayed(seq_call)(X, y, max_depth=params_max_depth[i],
                     n_estimators=params_n_estimators[i]) for i in range(N_TRIALS))
    print(results)

[(0.830841600894928, 12, 460), (0.8307613730430603, 13, 567), (0.8310205936431885, 11, 531), (0.8307623863220215, 13, 362), (0.8307573795318604, 13, 414), (0.8429433703422546, 15, 662), (0.8341590166091919, 10, 588), (0.8307747840881348, 13, 325), (0.8308181762695312, 14, 391), (0.831036388874054, 11, 426)]
..joblib-dask-backend     :  177.4734


Note: Running this without a dask backend is actually faster - takes about 65 seconds to finish by just making N_TRIALS sequential calls. Dask backend makes most sense when used with multi-GPU estimators as we see later in the notebook.

In [15]:
name = "sequential-calls"
with timed(name):
    for i in range(N_TRIALS):
        results = seq_call(X, y, max_depth=params_max_depth[i],
                     n_estimators=params_n_estimators[i])
    print(results)

(0.8310825824737549, 11, 426)
..sequential-calls        :  339.9721


# MLflow callback

Optuna supports the integration of various libraries. One of them is a tracking library MLflow, this is used to keep track of the different Hyperopt runs. We can simply add it by adding a callback to a study as shown. 

In [16]:
def mlflow_callback(study, trial):
    trial_value = trial.value if trial.value is not None else float("nan")
    with mlflow.start_run(run_name=study.study_name):
        print(trial.params)
#         mlflow.set_tracking_uri("http://127.0.0.1:5000")
        mlflow.log_params(trial.params)
        mlflow.log_metrics({"accuracy": trial_value})

In [17]:
name = "optuna-joblib-dask-backend-mlflow-callback"
study = run_study(optuna.samplers.TPESampler(),
                  study_name=name,
                  callbacks=[mlflow_callback])

[I 2020-07-02 19:40:23,263] A new study created with name: optuna-joblib-dask-backend-mlflow-callback


..optuna-joblib-dask-backend-mlflow-callback:  199.5456
Number of finished trials:  10
Best trial:
  Value:  0.8341314196586609
  Params: 
    max_depth: 10
    n_estimators: 464


# Multi-GPU estimators

We also have estimators that can run on multiple GPUs. `cuml.dask` has a set of multi-GPU estimators that can run incredibly fast. Let's try that out. In order to do this, we need to used `dask_cudf` dataframes and we will redefine the objective function from earlier to do just that. 



In [27]:
def objective_mg(trial, X, y):
#     return 1
    max_depth = trial.suggest_int("max_depth", 10, 15)
    n_estimators = trial.suggest_int("n_estimators", 200, 700)
    
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=77)
    
    # Multi-GPU Estimator
    from cuml.dask.ensemble import RandomForestClassifier
    classifier = cuml.dask.ensemble.RandomForestClassifier(max_depth=max_depth,
                         n_estimators=n_estimators)

    # Necessary conversions for cuml.dask.ensemble
    X_train_dask = dask_cudf.from_cudf(X_train, npartitions=2)
    X_valid_dask = dask_cudf.from_cudf(X_valid, npartitions=2)

    y_train_dask = dask_cudf.from_cudf(y_train, npartitions=2)
    y_valid_dask = dask_cudf.from_cudf(y_valid, npartitions=2)

    X_train_dask, X_valid_dask, \
    y_train_dask, y_valid_dask = dask_utils.persist_across_workers(c,[X_train_dask,
                                                                      X_valid_dask,
                                                                      y_train_dask,
                                                                      y_valid_dask],
                                                                      workers=workers)

    classifier.fit(X_train_dask, y_train_dask)
    y_pred = classifier.predict(X_valid_dask)
    score = accuracy_score(y_valid, y_pred.compute())
    return score


In [29]:
name = "optuna-mnmg-joblib"
with performance_report(filename=name+".html"):
    with timed(name):
        study = optuna.create_study(sampler=optuna.samplers.TPESampler(),
                                    study_name=name,
                                    storage="sqlite:///_"+name+".db",
                                    direction="maximize",
                                    load_if_exists=True)
        with parallel_backend("dask", n_jobs=n_workers, client=c, scatter=[X,y]):
            study.optimize(lambda trial: objective_mg(trial, X, y),
                               n_trials=N_TRIALS)

[I 2020-07-02 20:37:41,322] A new study created with name: optuna-mnmg-joblib
[I 2020-07-02 20:38:10,189] Finished trial#0 with value: 0.8307306170463562 with parameters: {'max_depth': 13, 'n_estimators': 652}. Best is trial#0 with value: 0.8307306170463562.
[I 2020-07-02 20:38:32,745] Finished trial#1 with value: 0.8307396173477173 with parameters: {'max_depth': 14, 'n_estimators': 468}. Best is trial#1 with value: 0.8307396173477173.
[I 2020-07-02 20:38:49,159] Finished trial#2 with value: 0.830734372138977 with parameters: {'max_depth': 12, 'n_estimators': 596}. Best is trial#1 with value: 0.8307396173477173.
[I 2020-07-02 20:38:56,884] Finished trial#3 with value: 0.830734372138977 with parameters: {'max_depth': 12, 'n_estimators': 280}. Best is trial#1 with value: 0.8307396173477173.
[I 2020-07-02 20:39:08,417] Finished trial#4 with value: 0.8307392001152039 with parameters: {'max_depth': 14, 'n_estimators': 238}. Best is trial#1 with value: 0.8307396173477173.
[I 2020-07-02 20:39

..optuna-mnmg-joblib      :  150.6013


## Summarizing the timing results

| Study name | Runtime |   
|---|---|
| Optuna-Multi-GPU-TPE | 267.5325 |
| Loky-Backend | 270.1898 |
| No-dask-No-Joblib | 289.1285 |
| Dask-no-Optuna | 175.0705 |
| No-Optuna-No-dask-Seq-Call | 315.9573 |
| Multi-GPU-Estimator | 218.8659 |

We noteice that with 2 GPUS, we were able to run the multi-GPU estimator more than twice as fast as the other options