
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>



# LAB - Hyperparameter Tuning with Optuna

Welcome to the Hyperparameter Tuning with Optuna lab! In this hands-on session, you'll gain practical insights into **optimizing machine learning models using Optuna**. Throughout the lab, we'll cover key steps, from loading the dataset and creating training/test sets to **defining a hyperparameter search space and running optimization trials with Spark**. The primary objective is to equip you with the skills to fine-tune models effectively using Spark, Optuna, and MLflow.

**Lab Outline:**
1. Load the dataset and create training/test sets for a scikit-learn model. 
1. Define the hyperparameter search space for optimization.
1. Define the optimization function to fine-tune the model.
1. Run hyperparameter tuning trials. 
1. Search for runs using the MLflow API and visualize all runs within the MLflow experiment.
1. Identify the best run based on the model's precision value programmatically and visually.
1. Register the model with Unity Catalog.

## REQUIRED - SELECT CLASSIC COMPUTE
Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.
Follow these steps to select the classic compute cluster:
1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.
1. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:
   - In the drop-down, select **More**.
   - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:
1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.
1. Find the triangle icon to the right of your compute cluster name and click it.
1. Wait a few minutes for the cluster to start.
1. Once the cluster is running, complete the steps above to select your cluster.

## Requirements

Please review the following requirements before starting the lesson:

* To run this notebook, you need to use one of the following Databricks runtime(s): **17.3.x-cpu-ml-scala2.13**


## Classroom Setup

Before starting the lab, run the provided classroom setup script. This script will define configuration variables necessary for the lab. Execute the following cell:

In [0]:
%pip install -U -qq optuna
%restart_python

In [0]:
%run ../Includes/Classroom-Setup-2.2

**Other Conventions:**

Throughout this demo, we'll refer to the object `DA`. This object, provided by Databricks Academy, contains variables such as your username, catalog name, schema name, working directory, and dataset locations. Run the code block below to view these details:

In [0]:
print(f"Username:          {DA.username}")
print(f"Catalog Name:      {DA.catalog_name}")
print(f"Schema Name:       {DA.schema_name}")
print(f"Working Directory: {DA.paths.working_dir}")
print(f"Dataset Location:  {DA.paths.datasets}")

## Prepare Dataset

In this lab, you will be using a fictional dataset from a Telecom Company, which includes customer information. This dataset encompasses **customer demographics**, including gender, as well as internet subscription details such as subscription plans and payment methods.

In this lab, we will create and tune a model that will predict customer churn based on the **`Churn`** field. 

A table with all features is already created for you.

**Table name: `customer_churn`**

In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split
## load the table from Unity Catalog called custome_churn
table_name = <FILL_IN>
## Read into a PySpark DataFrame and convert to Pandas DataFrame
diabetes_dataset = <FILL_IN>
customer_pd = <FILL_IN>

## split dataset between features and targets. The target variable is Churn
target_col = <FILL_IN>
X_all = <FILL_IN>
y_all = <FILL_IN>

## test / train split using 95% train/5% test
X_train, X_test, y_train, y_test = train_test_split(<FILL_IN>)
print(f"We have {X_train.shape[0]} records in our training dataset")
print(f"We have {X_test.shape[0]} records in our test dataset")

In [0]:
%skip
import pandas as pd
from sklearn.model_selection import train_test_split
## load the table from Unity Catalog called custome_churn
table_name = f"{DA.catalog_name}.{DA.schema_name}.customer_churn"
## Read into a PySpark DataFrame and convert to Pandas DataFrame
diabetes_dataset = spark.read.table(table_name)
customer_pd = diabetes_dataset.drop('CustomerID').toPandas()

## split dataset between features and targets. The target variable is Churn
target_col = "Churn"
X_all = customer_pd.drop(labels=target_col, axis=1)
y_all = customer_pd[target_col]

## test / train split using 95% train/5% test
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, train_size=0.95, random_state=42)
print(f"We have {X_train.shape[0]} records in our training dataset")
print(f"We have {X_test.shape[0]} records in our test dataset")

## Step 1: Define the Search Space and Optimization Function

Define the parameter search space for Optuna.

Your objective function should meet the following requirements:

1. Define the search space using the hyperparameters `max_depth` and `max_features`. For `max_depth`, the search range should be between 5 and 50, while `max_features` should be between 5 and 10. Additionally, for the `criterion` parameter, search based on `gini`, `entropy`, and `log_loss`. 
1. Enable MLflow run as a nested experiment.
1. For each run, log the cross-validation results for `accuracy`, `precision`, `recall`, and `f1`.
1. Use **3-fold** cross-validation. Be sure to average the fold results using `.mean()`.
1. The objective will be to _maximize_ **`precision`**.

In [0]:
import optuna
import mlflow
import mlflow.sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate
from mlflow.models.signature import infer_signature

## Define the objective function
def optuna_objective_function(<FILL_IN>):
    params = {
        'criterion': <FILL_IN>,
        'max_depth': <FILL_IN>,
        'max_features': <FILL_IN>
    }
    
    with mlflow.start_run(nested=True, run_name=f"Optuna Trial {trial.number}"):
        
        ## Train model
        dtc = <FILL_IN>

        ## Perform cross-validation
        scoring_metrics = [<FILL_IN>]
        cv_results = cross_validate(<FILL_IN>)

        ## Create input signature using the first row of X_train
        input_example = X_train.iloc[[0]]
        signature = <FILL_IN>

        ## Compute and log average scores
        cv_results_avg = {metric: cv_results[f'test_{metric}'].mean() for metric in scoring_metrics}
        mlflow.log_metrics(<FILL_IN>)
        mlflow.log_params(<FILL_IN>)
        mlflow.sklearn.log_model(<FILL_IN>)

        ## Return precision to maximize it
        return <FILL_IN>

In [0]:
%skip
import optuna
import mlflow
import mlflow.sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate
from mlflow.models.signature import infer_signature

## Define the objective function
def optuna_objective_function(trial):
    params = {
        'criterion': trial.suggest_categorical('criterion', ['gini', 'entropy', 'log_loss']),
        'max_depth': trial.suggest_int('max_depth', 5, 50),
        'max_features': trial.suggest_int('max_features', 5, 10)
    }
    
    with mlflow.start_run(nested=True, run_name=f"Optuna Trial {trial.number}"):
        
        ## Train model
        dtc = DecisionTreeClassifier(**params)
        dtc.fit(X_train, y_train)

        ## Perform cross-validation
        scoring_metrics = ['accuracy', 'precision', 'recall', 'f1']
        cv_results = cross_validate(dtc, X_train, y_train, cv=3, scoring=scoring_metrics)

        ## Create input signature using the first row of X_train
        input_example = X_train.iloc[[0]]
        signature = infer_signature(input_example, dtc.predict(input_example))

        ## Compute and log average scores
        cv_results_avg = {metric: cv_results[f'test_{metric}'].mean() for metric in scoring_metrics}
        mlflow.log_metrics(cv_results_avg)
        mlflow.log_params(params)
        mlflow.sklearn.log_model(dtc, "lab_optuna_decision_tree_model", signature = signature, input_example=input_example)

        ## Return precision to maximize it
        return cv_results_avg['precision']

## Step 2: Create an Optuna Study and Log with MLflow

First, we will delete all previous runs to keep our workspace and experiment tidy. Second, you will create an Optuna study and run the experiment with MLflow.

In [0]:
## Set the MLflow experiment name and get the id
experiment_name = f"/Users/{DA.username}/Lab_Optuna_Experiment_{DA.schema_name}"
print(f"Experiment Name: {experiment_name}")
mlflow.set_experiment(experiment_name)
experiment_id = mlflow.get_experiment_by_name(experiment_name).experiment_id
print(f"Experiment ID: {experiment_id}")

print("Clearing out old runs (If you want to add more runs, change the n_trial parameter in the next cell) ...")
## Get all runs
runs = mlflow.search_runs(experiment_ids=[experiment_id], output_format="pandas")

if runs.empty:
    print("No runs found in the experiment.")
else:
    ## Iterate and delete each run
    for run_id in runs["run_id"]:
        mlflow.delete_run(run_id)
        print(f"Deleted run: {run_id}")

    print("All runs have been deleted.")

### Create the Study and Log with MLflow

#### Instructions:

1. Create an Optuna study with name `lab_optuna_hpo`.
1. Maximize the objective function. 
1. Give the parent run the name `Lab_Optuna_Hyperparameter_Optimization`.
1. Only run 10 trials with Optuna.

In [0]:
study = <FILL_IN>

with mlflow.start_run(run_name='Lab_Optuna_Hyperparameter_Optimization') as parent_run:
    <FILL_IN>

In [0]:
%skip
study = optuna.create_study(
    study_name="lab_optuna_hpo",
    direction="maximize"
)

with mlflow.start_run(run_name='Lab_Optuna_Hyperparameter_Optimization') as parent_run:
    ## Run optimization
    study.optimize(
        optuna_objective_function, 
        n_trials=10,
        )

## Step 3. Visual Inspection of Precision Values

Here, we can view all 10 runs. After completing the code and running the following cell, scroll to the right and locate the column `metrics.precision`. Use the UI to order and order by descending. This will locate the largest precision score. Next, you will create a visual to also help understand the distribution of scores by trial. 


### Creating a precision score visual

1. **Run the next cell** to generate the table output.  
1. Click on the **plus (+) symbol** in the output cell.  
1. Select **Visualization** from the options.  
1. In the visualization settings, choose 
**Bar** and ensure **Horizontal Chart** toggle is **on**.  
1. Configure the **Y-axis**:  
   - Set **Y Column** to `tags.mlflow.runName`.  
1. Configure the **X-axis**:  
   - Set **X Columns** to `metrics.precision`.  
   - Choose **Sum** as the aggregation method.  
1. Click on the **Y-axis tab**:  
   - Ensure **Show Labels** is **on**.  
1. Apply the settings and visualize the data.


After following the above instructions, visually inspect which trial had the best run according to `precision`.

In [0]:
import mlflow
import pandas as pd

## Define your experiment name or ID
experiment_id = parent_fun.experiment_id

## Fetch all runs from the experiment using the MLflow API
df_runs = mlflow.search_runs(<FILL_IN>)

display(df_runs)

In [0]:
%skip
import mlflow
import pandas as pd

## Define your experiment name or ID
experiment_id = parent_run.info.experiment_id # Replace with your actual experiment ID

## Fetch all runs from the experiment
df_runs = mlflow.search_runs(
  experiment_ids=[experiment_id]
  )

display(df_runs)

## Step 4. Find the Best Run Programmatically

In this step you will find the best scores using the Optuna library to find the best value and parameter values. Additionally, you will use MLflow to find these values. 

#### Instructions
1. Use the Optuna study to find the best precision score. 
1. Use the Optuna study to find the best hyperparameter values. 
1. Use the MLflow API to find the best run based on precision score.

In [0]:
## Display the best hyperparameters and metric
print(f"Best hyperparameters: {<FILL_IN>}")
print(f"Best precision score: {<FILL_IN>}")

In [0]:
%skip
## Display the best hyperparameters and metric
print(f"Best hyperparameters: {study.best_params}")
print(f"Best precision score: {study.best_value}")

In [0]:
search_runs_pd = (mlflow.search_runs(<FILL_IN>))

## convert search_runs_pd to pyspark dataframe
search_runs_sd = <FILL_IN>
display(search_runs_pd)

In [0]:
%skip
search_runs_pd = (mlflow.search_runs(
    experiment_ids=[experiment_id],
    order_by=["metrics.precision DESC"],
    max_results=1))

## convert search_runs_pd to pyspark dataframe
search_runs_sd = spark.createDataFrame(search_runs_pd)
display(search_runs_pd)

## Load the Best Model and Parameters and Register to Unity Catalog

#### Instructions:
1. Either use the results from above to copy and paste the run_id and experiment_id below or perform this task programmatically using `.collect()` on the `search_runs` PySpark DataFrame. 
1. Load the model from MLflow.
1. Display the results for the best model and parameters.

In [0]:
## Get the string value from run_id and experiment_id from PySpark DataFrame hpo_runs_df
run_id = <FILL_IN>
experiment_id = <FILL_IN>

print(f"Run ID: {run_id}")
print(f"Experiment ID: {experiment_id}")

In [0]:
%skip
## Get the string value from run_id and experiment_id from PySpark DataFrame hpo_runs_df
run_id = search_runs_sd.select("run_id").collect()[0][0]
experiment_id = search_runs_sd.select("experiment_id").collect()[0][0]

print(f"Run ID: {run_id}")
print(f"Experiment ID: {experiment_id}")

In [0]:
import mlflow
import json
from mlflow.models import Model

Grab an input example from the test set (pandas DataFrame)
input_example = X_test.iloc[[0]]

You logged the model as: mlflow.sklearn.log_model(dtc, "lab_optuna_decision_tree_model", ...)
model_uri = <FILL_IN>

## Load the model
loaded_model = <FILL_IN>

## Retrieve model parameters MLflow client and get_run() method
client = <FILL_IN>
params = <FILL_IN>

## Display model parameters
print("Best Model Parameters:")
print(json.dumps(params, indent=4))

In [0]:
%skip
import mlflow
import json

# Grab an input example from the test set (pandas DataFrame)
input_example = X_test.iloc[[0]]

# You logged the model as: mlflow.sklearn.log_model(dtc, "lab_optuna_decision_tree_model", ...)
model_uri = f"runs:/{run_id}/lab_optuna_decision_tree_model"

# Load the model
loaded_model = mlflow.pyfunc.load_model(model_uri)

# Retrieve model parameters via MLflow client
client = mlflow.tracking.MlflowClient()
params = client.get_run(run_id).data.params

# Display model parameters
print("Best Model Parameters:")
print(json.dumps(params, indent=4))

### Register the Model to Unity Catalog

Register your model to Unity Catalog under the name `lab_optuna_model`. 

> _You can get the catalog name and schema name using `DA.catalog_name` and `DA.schema_name`, respectively._

In [0]:
mlflow.set_registry_uri("databricks-uc")
model_uri = <FILL_IN>
mlflow.register_model(<FILL_IN>)

In [0]:
%skip
mlflow.set_registry_uri("databricks-uc")
model_uri = f'runs:/{run_id}/lab_optuna_decision_tree_model'
mlflow.register_model(model_uri=model_uri, name=f"{DA.catalog_name}.{DA.schema_name}.lab_optuna_model")


## Conclusion

In this lab, you learned about Optuna and how to integrate Optuna trials and studies with MLflow. You also demonstrated the ability to programmatically and visually inspect the best trial. Finally, you showed how to load the MLflow model and register it to Unity Catalog.

&copy; 2026 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>