# Introduction to Optuna + RAPIDS

Optuna is a lightweight framework for hyperparameter optimization. It provides a code-by-run method which makes it easy to adapt to any already existing code that we have. Just wrapping the objective function with Optuna can help perform a parallel-distributed HPO search over a search space.

We'll explore how to use Optuna with RAPIDS and run multi-GPU HPO runs. 

In [1]:
import cudf
import dask.array as da
from cuml.preprocessing.model_selection import train_test_split
from sklearn.datasets import load_iris

import pandas as pd
import optuna
import numpy as np
import mlflow
import cuml
from cuml.ensemble import RandomForestClassifier
import sklearn
from cuml.metrics import accuracy_score

import random
import time

from joblib import parallel_backend

In [2]:
from contextlib import contextmanager
import time

@contextmanager
def timed(name):
    t0 = time.time()
    yield
    t1 = time.time()
    print("..%-24s:  %8.4f" % (name, t1 - t0))

In [3]:
N_TRIALS = 10
INPUT_FILE = "/home/hyperopt/data/air_par.parquet"


In [4]:
import time

from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster

from cuml.dask.common import utils as dask_utils

# This will use all GPUs on the local host by default
cluster = LocalCUDACluster(threads_per_worker=1)
c = Client(cluster)

# Query the client for all connected workers
workers = c.has_what().keys()
n_workers = len(workers)
n_streams = 8 # Performance optimization

In [5]:
df = cudf.read_parquet(INPUT_FILE)
X, y = df.drop(["ArrDelayBinary"], axis=1), df["ArrDelayBinary"].astype('int32')

In [6]:
def print_results(study):
    print("Number of finished trials: ", len(study.trials))

    print("Best trial:")
    trial = study.best_trial

    print("  Value: ", trial.value)

    print("  Params: ")
    for key, value in trial.params.items():
        print("    {}: {}".format(key, value))

# Defining the objective Function

We will define a objective function for the RandomForestClassifier that searches for max_depth and n_estimators.

This will remain constant over different samplers. Samplers are built-in options in Optuna to enable the selection of different sampling algorithms that optuna provides. Some of the available ones include - GridSampler, RandomSampler, TPESampler, etc. We'll try out different samplers and compare their performances

In [7]:
def objective(trial):
    # Please write actual objective function here.
    max_depth = trial.suggest_int("max_depth", 5, 7)
    n_estimators = trial.suggest_int("n_estimators", 100, 500)

    classifier = RandomForestClassifier(max_depth=max_depth,
                         n_estimators=n_estimators)


    X_train, X_valid, y_train, y_valid = train_test_split(X, y)

    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_valid)
    score = accuracy_score(y_valid, y_pred)
    
    return score

In [8]:
def run_study(sampler=optuna.samplers.TPESampler(), study_name="Optuna-MultiGPU"):
    with timed("multi-gpu"):
        study = optuna.create_study(sampler=sampler,
                                    study_name=study_name,
                                    storage="sqlite:///optuna_mg_db.db",
                                    direction="maximize",
                                    load_if_exists=True)
        with parallel_backend("dask", n_jobs=n_workers):
            study.optimize(objective, n_trials=N_TRIALS, n_jobs=n_workers)
    print_results(study)
    return study

In [9]:
study_tpe = run_study(optuna.samplers.TPESampler(),study_name="Optuna-MultiGPU-TPE")

[I 2020-06-24 16:11:11,396] A new study created with name: Optuna-MultiGPU-TPE


..multi-gpu               :   91.3055
Number of finished trials:  10
Best trial:
  Value:  0.8312094211578369
  Params: 
    max_depth: 7
    n_estimators: 272


In [10]:
study_cmae = run_study(optuna.samplers.CmaEsSampler(), study_name="Optuna-MultiGPU-CMAE")

[I 2020-06-24 16:12:42,729] A new study created with name: Optuna-MultiGPU-CMAE


..multi-gpu               :   88.3086
Number of finished trials:  10
Best trial:
  Value:  0.831394612789154
  Params: 
    max_depth: 7
    n_estimators: 130


# Sequential calls without Optuna

For a comparison let's try sequential calls without Optuna and it's parallel-processing support. We can cleared see that it takes more time to do this. We'll pick the same parameters as Optuna for a fair comparison - these parameters were selected by the sampling algorithm used by Optuna and is available in the `study.trials_dataframe()` for us to pick out.

In [11]:
df = study_tpe.trials_dataframe()
params_max_depth, params_n_estimators = df['params_max_depth'], df['params_n_estimators']

### Sequential call function 

For a cleaner look, let's use a function to perform sequential calls. The function basically sets the parameters to what was passed and trains and evaluates the model and returns the details of the run which can later be used to find the best performing model.

In [12]:
def seq_call(max_depth, n_estimators):
    classifier = RandomForestClassifier(max_depth=max_depth, n_estimators = n_estimators)

    X_train, X_valid, y_train, y_valid = train_test_split(X, y)

    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_valid)
    score = accuracy_score(y_valid, y_pred)
    return score, max_depth, n_estimators

In [16]:
from joblib import Parallel, delayed
with timed("no-optuna-call"):
    with parallel_backend("dask", n_jobs=n_workers):
        results = Parallel()(delayed(seq_call)(max_depth=params_max_depth[i],
                     n_estimators=params_n_estimators[i]) for i in range(N_TRIALS))
    print(results)

[(0.8306294083595276, 5, 151), (0.8311144113540649, 7, 272), (0.8308753967285156, 7, 290), (0.8311346173286438, 7, 261), (0.8307803869247437, 6, 309), (0.8307003974914551, 6, 131), (0.8307471871376038, 5, 359), (0.8308507800102234, 6, 141), (0.8308172225952148, 5, 108), (0.8311368227005005, 7, 424)]
..no-optuna-call          :   89.2947


Note: Running this without a dask backend is actually faster - takes about 65 seconds to finish by just making N_TRIALS sequential calls. Dask backend makes most sense when used with multi-GPU estimators as we see later in the notebook.

In [34]:
from joblib import Parallel, delayed
with timed("no-optuna-no-dask"):
    for i in range(N_TRIALS):
        results = seq_call(max_depth=params_max_depth[i],
                     n_estimators=params_n_estimators[i])
    print(results)

(0.8312597870826721, 7, 424)
..no-optuna-no-dask       :   65.9812


# MLflow callback

Optuna supports the integration of various libraries. One of them is a tracking library MLflow, this is used to keep track of the different Hyperopt runs. We can simply add it by adding a callback to a study as shown. 

Before running the next cell start the mlflow UI by going to the terminal and executing `mlflow ui -p 8004` 8004 is the port we are using here, you can use any port of choice. It defaults to 5000 if the `-p` option is not specified.

In [14]:
def mlflow_callback(study, trial):
    trial_value = trial.value if trial.value is not None else float("nan")
    with mlflow.start_run(run_name=study.study_name):
        mlflow.set_tracking_uri("http://127.0.0.1:8004")
        mlflow.log_params(trial.params)
        mlflow.log_metrics({"accuracy": trial_value})

In [17]:
with timed("mlflow-callback"):
    study = optuna.create_study(study_name="Optuna-MLflow-callback",
                                storage="sqlite:///optuna_mlflow_db.db",
                                direction="maximize",
                                load_if_exists=True)
    with parallel_backend("dask", n_jobs=n_workers):
        study.optimize(objective, n_trials=N_TRIALS, n_jobs=n_workers, timeout=600, callbacks=[mlflow_callback])

[I 2020-06-24 16:17:55,716] A new study created with name: Optuna-MLflow-callback


..mlflow-callback         :   88.1154


# Multi-GPU estimators

We also have estimators that can run on multiple GPUs. `cuml.dask` has a set of multi-GPU estimators that can run incredibly fast. Let's try that out. In order to do this, we need to used `dask_cudf` dataframes and we will redefine the objective function from earlier to do just that. 

`objective_mg` converts our split data into dask_cudf dataframes and persists them across all available dask workers. By doing this, we can now run the multi-GPU RandomForestClassifier. Notice that we import the `cuml.dask.ensemble.RandomForestClassifier` for this cell.

In [21]:
from dask_ml.linear_model import LogisticRegression
from cuml.dask.ensemble import RandomForestClassifier as dask_RF

def objective_mg(trial):
    # Please write actual objective function here.
    max_depth = trial.suggest_int("max_depth", 5, 7)
    n_estimators = trial.suggest_int("n_estimators", 100, 500)

    import dask_cudf 
    
    classifier = dask_RF(max_depth=max_depth,
                         n_estimators=n_estimators)

    X_train, X_valid, y_train, y_valid = train_test_split(X, y)

    X_train_dask = dask_cudf.from_cudf(X_train, npartitions=n_workers)
    X_valid_dask = dask_cudf.from_cudf(X_valid, npartitions=2)
    
    y_train_dask = dask_cudf.from_cudf(y_train, npartitions=2)
    y_valid_dask = dask_cudf.from_cudf(y_valid, npartitions=2)
    
    X_train_dask, X_valid_dask, y_train_dask, y_valid_dask = dask_utils.persist_across_workers(c, [X_train_dask, X_valid_dask,
                                                                      y_train_dask, y_valid_dask], workers=workers)
    
    classifier.fit(X_train_dask, y_train_dask)
    y_pred = classifier.predict(X_valid_dask)
    score = accuracy_score(y_valid, y_pred.compute())
    return score


In [31]:
with timed("multi-GPU-estimators"):
    study = optuna.create_study(sampler= optuna.samplers.TPESampler(),
                                study_name="Multi-GPU-Estimator",
                                direction="maximize",
                                storage="sqlite:///mnmg.db")
    study.optimize(objective_mg, n_trials=N_TRIALS)

[I 2020-06-24 16:24:19,997] A new study created with name: Multi-GPU-Estimator
[I 2020-06-24 16:24:24,672] Finished trial#0 with value: 0.8307098150253296 with parameters: {'max_depth': 6, 'n_estimators': 360}. Best is trial#0 with value: 0.8307098150253296.
[I 2020-06-24 16:24:27,800] Finished trial#1 with value: 0.8308072090148926 with parameters: {'max_depth': 6, 'n_estimators': 142}. Best is trial#1 with value: 0.8308072090148926.
[I 2020-06-24 16:24:31,354] Finished trial#2 with value: 0.8309339880943298 with parameters: {'max_depth': 6, 'n_estimators': 201}. Best is trial#2 with value: 0.8309339880943298.
[I 2020-06-24 16:24:34,491] Finished trial#3 with value: 0.8307362198829651 with parameters: {'max_depth': 7, 'n_estimators': 121}. Best is trial#2 with value: 0.8309339880943298.
[I 2020-06-24 16:24:37,573] Finished trial#4 with value: 0.8308448195457458 with parameters: {'max_depth': 6, 'n_estimators': 139}. Best is trial#2 with value: 0.8309339880943298.
[I 2020-06-24 16:24:4

..multi-GPU-estimators    :   35.9608


In [32]:
print_results(study)

Number of finished trials:  10
Best trial:
  Value:  0.8309419751167297
  Params: 
    max_depth: 5
    n_estimators: 108


## Summarizing the timing results

| Study name | Runtime |   
|---|---|
| Optuna-Multi-GPU-TPE | 91.3055 |
| Optuna-Multi-GPU-CMAE | 88.3086 |
| No-Optuna-Call | 89.2947 |
| Optuna-MLflow-callback | 88.1154 |
| Multi-GPU-Estimator | 35.9608 |

We noteice that with 2 GPUS, we were able to run the multi-GPU estimator more than twice as fast as the other options

In [None]:
# # CPU with 750 estimators max does not finish running after hours.
# def objective_cpu(trial):
    
#     max_depth = trial.suggest_int("max_depth", 5, 15)
#     n_estimators = trial.suggest_int("n_estimators", 100, 750)

#     classifier = sklearn.ensemble.RandomForestRegressor(max_depth=max_depth,
#                                        n_estimators=n_estimators)

#     X_train, X_valid, y_train, y_valid = sklearn.model_selection.train_test_split(X_, y_)
    
#     classifier.fit(X_train, y_train)
#     y_pred = classifier.predict(X_valid)
    
#     score = accuracy_score(y_valid, y_pred)
#     return score

In [None]:
# with timed("cpu-etl"):
#     df_pd = pd.read_parquet(INPUT_FILE)
#     X_, y_ = df_pd.drop(["ArrDelayBinary"], axis=1), df_pd["ArrDelayBinary"].astype('int32')
    
# with timed("cpu-hpo"):
#     study = optuna.create_study(direction="maximize") # Equivalent to an experiment, a set of trials
#     study.optimize(objective_cpu, n_trials=N_TRIALS, n_jobs=-1)