# An introduction to Optuna for optimising machine learning models

This Jupyter notebook goes through the basic usage of Optuna (optuna.org). Optuna is a machine learning model optimizer that may be used on any machine learning model type. Optuna uses heuristic (searching) algorithms to find the best model hyperparameters.

To install optuna (if not installed as part of the Titanic environment': `pip install optuna`. The visualisations require plotly (`pip install poltly`).

In [1]:
import optuna
import pandas as pd

import sklearn.ensemble
import sklearn.model_selection
from sklearn.linear_model import LogisticRegression

from IPython.display import clear_output

# The following are needed only to show Optuna images in a Jupyter Book
from IPython.display import Image, display

# To save images you will also need to `pip install kaleido`

## Optimize a simple problem

Find *x* that minmises *(x-2)^2*

(*x*=2 minimises output)

In [2]:
def objective(trial):
    # Set study paramter to vary
    x = trial.suggest_float('x', -10, 10)
    
    # Return value for optimisation
    return (x - 2) ** 2

study = optuna.create_study()
study.optimize(objective, n_trials=100)

# To remove optimisation output after complete
clear_output(wait=True)

# Run optimisation (default is to minimise objective)
study.best_params  

{'x': 1.9952590976929072}

## Download Titanic data

In [3]:
download_required = True

if download_required:
    
    # Download processed data:
    address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
                '1804_python_healthcare/master/titanic/data/processed_data.csv'
    
    data = pd.read_csv(address)

    # Create a data subfolder if one does not already exist
    import os
    data_directory ='./data/'
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    # Save data
    data.to_csv(data_directory + 'processed_data.csv', index=False)

In [4]:
data = pd.read_csv('data/processed_data.csv')
# Make all data 'float' type
data = data.astype(float)

# Drop Passengerid (axis=1 indicates we are removing a column rather than a row)
data.drop('PassengerId', inplace=True, axis=1)

## Divide into X (features) and y (labels)

In [5]:
# Split data into two DataFrames
X_df = data.drop('Survived',axis=1)
y_df = data['Survived']

# Convert DataFrames to NumPy arrays
X = X_df.values
y = y_df.values

## Optimize Hyperparameters

### Define a simple scikit-learn model

We start with a simple random forest model to classify Titanic survivors vs. non-survivors. We define a function called `objective` that encapsulates the whole training process and outputs the accuracy of the model.

In [6]:
def objective():

    # Define model
    clf = sklearn.ensemble.RandomForestClassifier(    
        n_estimators=5, max_depth=3)
    
    # Assess accuracy with sklearn.model_selection.cross_val_score
    accuracy = sklearn.model_selection.cross_val_score(
        clf, X, y, n_jobs=-1, cv=3).mean()
    
    return accuracy 

clear_output(wait=True)
print(f'Accuracy: {objective():0.3f}')

Accuracy: 0.737


### Optimize hyperparameters of the model

The hyperparameters of the above algorithm are `n_estimators` and `max_depth` for which we can try different values to see if the model accuracy can be improved. The `objective` function is modified to accept a trial object. This trial has several methods for sampling hyperparameters. We create a study to run the hyperparameter optimization and finally read the best hyperparameters.

In [7]:
def objective(trial):
    
    # Set Optuna trial parameters and ranges
    n_estimators = trial.suggest_int('n_estimators', 2, 20)
    max_depth = int(trial.suggest_float('max_depth', 1, 32, log=True))
    
    # Set up model
    clf = sklearn.ensemble.RandomForestClassifier(
        n_estimators=n_estimators, max_depth=max_depth)
    
    # Assess accuracy with sklearn.model_selection.cross_val_score
    accuracy = sklearn.model_selection.cross_val_score(
        clf, X, y, n_jobs=-1, cv=3).mean()
    
    return accuracy

# Set up Optuna study
study = optuna.create_study(direction='maximize')

# Run optimisation
study.optimize(objective, n_trials=100)

# Get best model run
trial = study.best_trial

# Clear optimisation output and display output
clear_output(wait=True)
print(f'Accuracy: {trial.value:0.3f}')
print(f'Best hyperparameters: {trial.params}')

Accuracy: 0.819
Best hyperparameters: {'n_estimators': 14, 'max_depth': 8.031359561286942}


It is possible to condition hyperparameters using Python `if` statements. We can for instance include another classifier, a logistic regression classifier, in our hyperparameter optimsation and define hyperparameters specific to the random forest model and the logistic regression models.

In [8]:
def objective(trial):
    
    # All Optuna to choose classifier (Random Forest or Support Vector Macgine)
    classifier = trial.suggest_categorical(
        'classifier', ['RandomForest', 'LogisticRegression'])
    
    # If Random Forest set Optuna trial parameters for random forest
    if classifier == 'RandomForest':
        n_estimators = trial.suggest_int('n_estimators', 2, 20)
        max_depth = int(trial.suggest_float('max_depth', 1, 32, log=True))

        # Set up model
        clf = sklearn.ensemble.RandomForestClassifier(
            n_estimators=n_estimators, max_depth=max_depth)
    
    # Otherwsie set Optuna trial parameters for logistic regression
    else:
        c = trial.suggest_float('lr_c', 1e-10, 1e10, log=True)
        
        # Set up model
        clf = LogisticRegression(C=c)
        
    # Run selected model        
    accuracy = sklearn.model_selection.cross_val_score(
        clf, X, y, n_jobs=-1, cv=3).mean()

    return accuracy

# Set up Optuna study
study = optuna.create_study(direction='maximize')

# Run optimisation
study.optimize(objective, n_trials=100)

# Get best model run
trial = study.best_trial

# Clear optimisation output and display output
clear_output(wait=True)
print(f'Accuracy: {trial.value:0.3f}')
print(f'Best hyperparameters: {trial.params}')

Accuracy: 0.826
Best hyperparameters: {'classifier': 'RandomForest', 'n_estimators': 18, 'max_depth': 8.291248298136349}


### Plotting the study

Plotting the optimization history of the study.

In [9]:
fig = optuna.visualization.plot_optimization_history(study)

# To display fig in an orinary norebook use `fig.show`

# The following is needed only to display the image in an online Jupyter Book
filename = './images/temp.jpg'
fig.write_image(filename)
display(Image(filename))

<IPython.core.display.Image object>

Plotting the accuracies for each hyperparameter for each trial.

In [10]:
fig = optuna.visualization.plot_slice(study)

# To display fig in an orinary norebook use `fig.show`

# The following is needed only to display the image in an online Jupyter Book
filename = './images/temp.jpg'
fig.write_image(filename)
display(Image(filename))

<IPython.core.display.Image object>

Plotting the accuracy surface for the hyperparameters involved in the random forest model.

In [11]:
fig = optuna.visualization.plot_contour(study, params=['n_estimators', 'max_depth'])

# To display fig in an orinary norebook use `fig.show`

# The following is needed only to display the image in an online Jupyter Book
filename = './images/temp.jpg'
fig.write_image(filename)
display(Image(filename))

<IPython.core.display.Image object>

## Switching sampling Algorithms

Samplers continually narrow down the search space using the records of suggested parameter values and evaluated objective values,
leading to an optimal search space which giving off parameters leading to better objective values.
More detailed explanation of how samplers suggest parameters is in :class:`optuna.samplers.BaseSampler`.

Optuna provides the following sampling algorithms:

- Tree-structured Parzen Estimator algorithm implemented in :class:`optuna.samplers.TPESampler`

- CMA-ES (Covariance matrix adaptation evolution strategy) based algorithm implemented in :class:`optuna.samplers.CmaEsSampler`

- Grid Search implemented in :class:`optuna.samplers.GridSampler`

- Random Search implemented in :class:`optuna.samplers.RandomSampler`

The default sampler is :class:`optuna.samplers.TPESampler`.

We can define an alternative sampler when we define the study. Here we use a CMA-ES sampler.

In [12]:
def objective(trial):

    # Set Optuna trial parameters and ranges
    n_estimators = trial.suggest_int('n_estimators', 2, 20)
    max_depth = int(trial.suggest_float('max_depth', 1, 32, log=True))
    
    # Set up model
    clf = sklearn.ensemble.RandomForestClassifier(
        n_estimators=n_estimators, max_depth=max_depth)
    
    # Assess accuracy with sklearn.model_selection.cross_val_score
    accuracy = sklearn.model_selection.cross_val_score(
        clf, X, y, n_jobs=-1, cv=3).mean()
    
    return accuracy

# Set up Optuna study

study = optuna.create_study(direction='maximize', 
                            sampler=optuna.samplers.CmaEsSampler())

# Run optimisation
study.optimize(objective, n_trials=100)

# Get best model run
trial = study.best_trial

# Clear optimisation output and display output
clear_output(wait=True)
print(f'Accuracy: {trial.value:0.3f}')
print(f'Best hyperparameters: {trial.params}')

Accuracy: 0.820
Best hyperparameters: {'n_estimators': 11, 'max_depth': 9.965989480808997}


In [13]:
fig = optuna.visualization.plot_optimization_history(study)

# To display fig in an orinary norebook use `fig.show`

# The following is needed only to display the image in an online Jupyter Book
filename = './images/temp.jpg'
fig.write_image(filename)
display(Image(filename))

<IPython.core.display.Image object>

In [14]:
fig = optuna.visualization.plot_contour(study, params=['n_estimators', 'max_depth'])

# To display fig in an orinary norebook use `fig.show`

# The following is needed only to display the image in an online Jupyter Book
filename = './images/temp.jpg'
fig.write_image(filename)
display(Image(filename))

<IPython.core.display.Image object>