# MLflow 02: Advanced Hyperparameter Optimization & Model Selection

Welcome back! In our [first notebook](Notebook_1_Getting_Started_with_MLFlow.ipynb), we laid the groundwork for MLflow by exploring basic experiment tracking. Now, it's time to level up! 

This notebook dives into **Advanced Hyperparameter Optimization (HPO)** and how MLflow helps in **Model Selection**. We'll leverage the power of **Optuna**, a state-of-the-art HPO framework, and integrate it seamlessly with MLflow to track numerous training trials, identify the best performing models, and make informed decisions. You can check the following YouTube video as well to get started


<div style="text-align: center;"><a href="https://www.youtube.com/watch?v=2JuzqVbjSKU" target="_blank"><img src="https://i.ytimg.com/vi/TgdEZ6LFj-I/sddefault.jpg" alt="Optuna"></a></div>

Get ready for a more in-depth look at how MLflow manages complex experimental setups and facilitates finding those elusive optimal hyperparameters! We'll be using the California Housing dataset again to see if we can improve upon our previous model's performance through systematic HPO.

---

## Table of Contents

1. [Recap: What We Learned in Notebook 1](#recap)
2. [Introduction to Hyperparameter Optimization (HPO)](#intro-to-hpo)
3. [Setting Up: MLflow, Optuna, and Our Dataset](#setting-up)
4. [Defining the Optuna Objective Function with MLflow Integration](#objective-function)
5. [Running the Hyperparameter Optimization Study](#running-hpo-study)
6. [Analyzing HPO Results with MLflow UI](#analyzing-hpo-mlflow-ui)
7. [Programmatically Retrieving the Best Run and Training the Final Model](#retrieving-best-run)
8. [Key Takeaways and Advanced HPO Concepts](#key-takeaways-advanced)
9. [Engaging Resources and Further Reading](#resources-and-further-reading)

---

## 1. Recap: What We Learned in Notebook 1

In our [previous notebook](https://github.com/Morikashi/MLflow-crash-course/edit/main/Notebook%201%3A%20MLflow%20101%20%E2%80%93%20Experiment%20Tracking%20with%20Modern%20ML%20Pipelines.ipynb), we covered:
- **MLflow Tracking Basics:** Setting up MLflow, creating experiments, and logging runs.
- **Core Components:** Parameters, metrics, artifacts, and tags.
- **Manual Experimentation:** We trained an XGBoost model with a predefined set of hyperparameters and logged its performance.
- **MLflow UI:** Visualizing and comparing individual runs.

While manual experimentation is good for understanding, it's often inefficient for finding the *best* model configuration. This is where HPO comes in.

---

## 2. Introduction to Hyperparameter Optimization (HPO)

**Hyperparameters** are external configurations for a machine learning model that are set *before* the learning process begins. Examples include the learning rate, the number of trees in a random forest, or the depth of a neural network. The choice of hyperparameters can significantly impact model performance.

**Hyperparameter Optimization (HPO)**, also known as tuning, is the process of finding the set of hyperparameters that yields the optimal model performance for a given dataset. Common HPO strategies include:
- **Grid Search:** Exhaustively searches a manually specified subset of the hyperparameter space.
- **Random Search:** Samples hyperparameter combinations randomly from a given distribution.
- **Bayesian Optimization:** Uses a probabilistic model to select the most promising hyperparameters to evaluate next, based on past results (e.g., Tree-structured Parzen Estimator - TPE).

### Why Optuna?
We'll be using **Optuna**, an open-source HPO framework designed to automate and accelerate the optimization process. 

![Optuna Logo](https://analyticsindiamag.com/wp-content/uploads/2021/02/Untitled-design23-768x432.png)

Key features of Optuna include:
- **Define-by-Run API:** Allows users to dynamically construct the search space for hyperparameters.
- **State-of-the-art Algorithms:** Implements efficient samplers and pruners (like TPE and Hyperband).
- **Easy Parallelization:** Scale HPO across multiple threads or machines.
- **Visualization:** Tools to understand the optimization process.
- **Framework Agnostic:** Can be used with any ML/DL framework.

Pairing Optuna with MLflow allows us to track each HPO trial as a separate MLflow run, log its specific hyperparameters and resulting metrics, and then use MLflow's UI and APIs to analyze the entire optimization process and select the best model.

---

## 3. Setting Up: MLflow, Optuna, and Our Dataset

Let's start by installing the necessary libraries and loading the California Housing dataset, just like in Notebook 1. We'll also configure MLflow.

In [None]:
# Install necessary libraries
!pip install --quiet mlflow optuna xgboost scikit-learn datasets pandas

# Import libraries
import mlflow
import optuna
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from datasets import load_dataset
import pandas as pd

# Configure MLflow
mlflow.set_tracking_uri('mlruns') # Use local 'mlruns' directory
experiment_name = "California_Housing_HPO_Optuna"
mlflow.set_experiment(experiment_name)

print(f"MLflow Version: {mlflow.__version__}")
print(f"Optuna Version: {optuna.__version__}")
print(f"MLflow Experiment set to: {experiment_name}")

# Load and prepare the dataset (same as Notebook 1)
try:
    housing_dataset = load_dataset('gvlassis/california_housing', split='train')
    df = housing_dataset.to_pandas()
    print("\nSuccessfully loaded California Housing dataset.")
except Exception as e:
    print(f"Failed to load dataset: {e}. Ensure internet connectivity.")
    raise e

feature_columns = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
target_column = 'MedHouseVal'
X = df[feature_columns]
y = df[target_column]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set shape: {X_train.shape}")
print(f"Validation set shape: {X_val.shape}")

Our environment is now ready. We have MLflow configured to log to our specified experiment, Optuna imported, and our dataset split into training and validation sets.

---

## 4. Defining the Optuna Objective Function with MLflow Integration

The core of an Optuna optimization is the **objective function**. This function takes an `optuna.trial.Trial` object as input, which is used to sample hyperparameters for the current trial. The function then trains a model using these hyperparameters and returns a performance metric (e.g., validation MSE) that Optuna will try to minimize or maximize.

Crucially, we will integrate MLflow logging *inside* this objective function. For each trial Optuna runs, we will start a **nested MLflow run**. This allows us to log the specific hyperparameters suggested by Optuna and the resulting model's performance for that particular trial.

In [None]:
def objective(trial):
    """
    Objective function for Optuna HPO.
    Each call to this function is one 'trial' in the HPO process.
    It trains an XGBoost model with hyperparameters suggested by Optuna,
    logs them and the results to MLflow, and returns the validation MSE.
    """
    # Start a nested MLflow run for this particular Optuna trial
    # All logs for this trial will be grouped under the parent HPO run.
    with mlflow.start_run(nested=True) as run: # nested=True is key here!
        mlflow.set_tag("Optuna_Trial_ID", str(trial.number)) # Tag with Optuna trial number

        # Define the hyperparameter search space using Optuna's trial object
        params = {
            'objective': 'reg:squarederror',
            'n_estimators': trial.suggest_int('n_estimators', 50, 300, step=50), # Number of trees
            'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True), # Learning rate
            'max_depth': trial.suggest_int('max_depth', 3, 10), # Max depth of trees
            'subsample': trial.suggest_float('subsample', 0.6, 1.0, step=0.1), # Subsample ratio of training instances
            'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0, step=0.1), # Subsample ratio of columns
            'min_child_weight': trial.suggest_int('min_child_weight', 1, 10), # Minimum sum of instance weight needed in a child
            'gamma': trial.suggest_float('gamma', 0, 0.5, step=0.1), # Minimum loss reduction required to make a further partition
            'lambda': trial.suggest_float('lambda', 0.5, 2.0, log=True), # L2 regularization term
            'alpha': trial.suggest_float('alpha', 1e-8, 1.0, log=True), # L1 regularization term (if non-zero)
            'random_state': 42
        }
        
        # Log sampled hyperparameters to MLflow
        mlflow.log_params(params)

        # Train the XGBoost model
        model = xgb.XGBRegressor(**params, early_stopping_rounds=10) # Stop early if no improvement
        model.fit(X_train, y_train,
                  eval_set=[(X_val, y_val)],
                  verbose=False) # Suppress XGBoost training output for cleaner HPO logs
        
        # Make predictions and evaluate
        y_pred = model.predict(X_val)
        mse = mean_squared_error(y_val, y_pred)
        r2 = r2_score(y_val, y_pred)
        
        # Log metrics to MLflow
        mlflow.log_metric("mse", mse)
        mlflow.log_metric("r2_score", r2)
        mlflow.log_metric("best_iteration", model.best_iteration) # Log when early stopping occurred

        # Log the model itself (optional, can make mlruns directory large during HPO)
        # For this example, we'll log it to show it's possible.
        # Consider logging only the best model after the HPO study for practical scenarios.
        mlflow.xgboost.log_model(model, f"xgboost-model-trial-{trial.number}")
        
        # Optuna needs a value to minimize or maximize. We want to minimize MSE.
        return mse # Return the metric Optuna should optimize (lower MSE is better)

print("Objective function defined.")

Key things to note in the `objective` function:
- **`mlflow.start_run(nested=True)`:** This ensures each Optuna trial is logged as a child run under a main parent run for the HPO study. This keeps the MLflow UI organized.
- **`trial.suggest_*` methods:** Optuna provides these methods to sample hyperparameters from specified ranges and distributions (e.g., `suggest_int`, `suggest_float`, `suggest_categorical`).
- **Logging:** We log the `params` suggested by Optuna and the resulting `mse` and `r2_score` to MLflow for each trial.
- **Return Value:** The function returns the `mse`, which Optuna will try to minimize.
- **Early Stopping:** We've included early stopping in the XGBoost training to speed up trials that aren't promising.

---

## 5. Running the Hyperparameter Optimization Study

Now, we'll create an Optuna `study` object and call its `optimize` method, passing our `objective` function. 

We'll also wrap the entire HPO process in a **parent MLflow run**. This parent run can log overall information about the HPO study itself (like the number of trials or the Optuna sampler used). All individual Optuna trials (each with its own nested MLflow run) will then be grouped under this parent run in the MLflow UI.

In [None]:
# Start a parent MLflow run for the entire HPO study
with mlflow.start_run(run_name="XGBoost_Optuna_HPO_Study") as parent_run:
    parent_run_id = parent_run.info.run_id
    print(f"Parent MLflow Run ID for HPO Study: {parent_run_id}")
    mlflow.set_tag("hpo_tool", "Optuna")
    mlflow.set_tag("model_type", "XGBoost_Regressor")
    
    # Create an Optuna study object
    # We specify 'minimize' because our objective function returns MSE (lower is better)
    # Optuna uses TPE (Tree-structured Parzen Estimator) by default, a Bayesian optimization algorithm.
    study = optuna.create_study(direction='minimize', study_name="XGBoost_California_Housing_Optimization")
    
    # Start the optimization process
    # n_trials: Number of HPO trials to run. Increase for more thorough search, but takes longer.
    # For demonstration, we'll keep it relatively small. In practice, you might run hundreds or thousands.
    n_hpo_trials = 25 # You can increase this for a more thorough search
    study.optimize(objective, n_trials=n_hpo_trials)
    
    # Log information about the Optuna study to the parent MLflow run
    mlflow.log_param("optuna_n_trials_requested", n_hpo_trials)
    mlflow.log_param("optuna_n_trials_completed", len(study.trials))
    mlflow.log_param("optuna_sampler", str(study.sampler.__class__.__name__))
    
    # Log the best trial's information to the parent run
    best_trial = study.best_trial
    mlflow.log_metric("best_trial_mse", best_trial.value)
    mlflow.log_params(best_trial.params) # Log the best hyperparameters found
    mlflow.set_tag("best_optuna_trial_id", str(best_trial.number))

    print(f"\nOptuna HPO study completed.")
    print(f"Best trial number: {best_trial.number}")
    print(f"Best MSE: {best_trial.value:.4f}")
    print("Best hyperparameters:")
    for key, value in best_trial.params.items():
        print(f"  {key}: {value}")

print(f"\nAll HPO trials logged under parent run ID: {parent_run_id} in experiment '{experiment_name}'.")

The HPO study is now complete! Optuna has explored different hyperparameter combinations, and for each one, we've trained a model and logged its details to MLflow as a nested run. The parent run for the HPO study itself also contains summary information like the best MSE found and the corresponding best hyperparameters.

---

## 6. Analyzing HPO Results with MLflow UI

Now, let's head to the MLflow UI to see our HPO results. Run `mlflow ui` in your terminal (from the directory containing `mlruns`).

`mlflow ui`Navigate to the `California_Housing_HPO_Optuna` experiment. You should see:

1.  **Parent Run (`XGBoost_Optuna_HPO_Study`):** This run will have its own parameters (like `optuna_n_trials_requested`) and metrics (like `best_trial_mse`), and importantly, it will list all the individual Optuna trials as **Child Runs**.

    ![MLflow UI Parent Run with Child Runs](https://miro.medium.com/v2/resize:fit:1400/1*Cz8hSkcf4ZtZD87rz4KfNA.png)


2.  **Child Runs:** Each of these corresponds to one Optuna trial. 
    - You can click on any child run to see the specific hyperparameters Optuna chose for that trial (e.g., `learning_rate`, `max_depth`) and the resulting `mse` and `r2_score`.
    - The artifacts for each child run will contain the `xgboost-model-trial-<number>` if you chose to log them.

3.  **Comparison View:**
    - Select multiple child runs (or all of them from the parent run's child runs view) and click "Compare."
    - This view is incredibly powerful for HPO. You can see a table comparing all parameters and metrics side-by-side.
    - **Sort by metrics:** Sort the table by `mse` (ascending) or `r2_score` (descending) to quickly identify the best-performing trials.
    - **Visualization Plots:** MLflow provides plots like Parallel Coordinates Plot and Scatter Plot within the comparison view. For example, a Parallel Coordinates Plot can show which ranges of `learning_rate` or `max_depth` led to lower MSE. These can help visualize the relationship between hyperparameters and performance.

    ![MLflow UI Compare Runs](https://mlflow.org/docs/latest/assets/images/mlflow_ui_chart_view-b6aac7263c29f4bb1a3a81bd79fa9de0.png)

**Explore the UI:**
- Identify the run with the lowest MSE.
- Check its parameters. Do they match what Optuna reported as the `best_trial.params`?
- Look at the Parallel Coordinates Plot. Are there any visible trends between hyperparameters and the `mse` metric?

The MLflow UI, especially with nested runs, provides an excellent way to dissect and understand the HPO process.

---

## 7. Programmatically Retrieving the Best Run and Training the Final Model

While the UI is great for exploration, we often need to programmatically access the best hyperparameters to train a final model or for further automation.

Optuna's `study.best_trial` object already gives us the best parameters and value. We can also use the MLflow client to search for the best run within our HPO experiment if needed, especially if we want to retrieve artifacts or more details.

Let's use the `best_trial.params` from Optuna to train our final model and log it as a distinct, top-level run (or as a specially tagged child run of the HPO study).

In [None]:
best_hps = study.best_trial.params
print("Best Hyperparameters found by Optuna:")
print(best_hps)

# Now, let's train the final model using these best hyperparameters
# We'll log this as a new, separate run for clarity, or you could log it under the parent HPO run.
with mlflow.start_run(run_name="XGBoost_Final_Optimized_Model") as final_run:
    print(f"\nTraining final model with best HPs. Run ID: {final_run.info.run_id}")
    
    # Log the best hyperparameters that were used for this final model
    mlflow.log_params(best_hps)
    mlflow.log_param("source_hpo_study_run_id", parent_run_id) # Link back to the HPO study
    mlflow.set_tag("model_status", "Optimized_Final")
    
    final_model = xgb.XGBRegressor(**best_hps,
                                   objective='reg:squarederror', 
                                   early_stopping_rounds=10, # Good practice for final model too
                                  )
    
    # Fit on the full training data (or train+val if you prefer, then test on a holdout set)
    # For consistency with HPO trials, we fit on X_train and evaluate on X_val.
    final_model.fit(X_train, y_train, 
                    eval_set=[(X_val, y_val)], 
                    verbose=False)
    
    y_pred_final = final_model.predict(X_val)
    final_mse = mean_squared_error(y_val, y_pred_final)
    final_r2 = r2_score(y_val, y_pred_final)
    
    mlflow.log_metric("mse", final_mse)
    mlflow.log_metric("r2_score", final_r2)
    mlflow.log_metric("best_iteration", final_model.best_iteration)
    
    # Log the final, optimized model
    mlflow.xgboost.log_model(final_model, "final-optimized-xgboost-model")
    
    print(f"Final Optimized Model Performance: MSE={final_mse:.4f}, R2={final_r2:.4f}")
    print(f"Final model and its metrics logged to MLflow run ID: {final_run.info.run_id}")

# Compare with results from Notebook 1 (if you ran it and recall the metrics)
# The goal is that final_mse here should be better (lower) than the manually tuned model.
print("\nCompare this final model's MSE with the MSE from the manually tuned model in Notebook 1!")

We now have a final, optimized model trained with the best hyperparameters found by Optuna, and it's neatly logged in MLflow. You can easily find this model in the MLflow UI, examine its parameters and metrics, and access its artifacts (the saved model file). This final model is now ready for potential deployment or further evaluation, which we'll cover in later notebooks (e.g., MLflow Model Registry).

---

## 8. Key Takeaways and Advanced HPO Concepts

In this notebook, we've significantly expanded our MLflow skills:

- **Hyperparameter Optimization (HPO):** Understood its importance and explored using **Optuna**.
- **Integrating MLflow with HPO Tools:** Crucially, we learned to log each Optuna trial as a **nested MLflow run** under a parent HPO study run. This is key for organized tracking of HPO experiments.
- **Structured HPO Logging:** Logged Optuna-suggested hyperparameters, resulting metrics, and even individual trial models within each nested run.
- **MLflow UI for HPO Analysis:** Leveraged the UI's hierarchical run display and comparison features to analyze HPO results, identify best trials, and visualize hyperparameter impact.
- **Programmatic Model Selection:** Retrieved the best hyperparameters found by Optuna and used them to train and log a final, optimized model.

### Advanced HPO Concepts (Briefly):
- **Pruning:** Optuna supports pruners that can stop unpromising trials early, saving computational resources. Integrating pruning callbacks within the objective function is straightforward.
- **Different Samplers:** Optuna offers various samplers beyond the default TPE, like CMA-ES (Covariance Matrix Adaptation Evolution Strategy) for continuous spaces, or random search. You can specify these when creating the `study`.
- **Distributed HPO:** Optuna studies can be parallelized across multiple processes or machines, often using a shared database (like PostgreSQL or MySQL) to store study results. MLflow can still track runs from distributed HPO setups.
- **Multi-Objective Optimization:** Sometimes you want to optimize for multiple metrics simultaneously (e.g., low latency AND high accuracy). Optuna supports this.

Mastering HPO and effectively tracking its results with tools like MLflow is a vital skill for any machine learning practitioner aiming for top-performing models.

---

## 9. Engaging Resources and Further Reading

Dive deeper into HPO, Optuna, and their integration with MLflow:

- **Optuna Documentation:**
    - [Optuna Official Website](https://optuna.org/)
    - [Optuna GitHub Repository](https://github.com/optuna/optuna)
    - [Optuna Examples (including MLflow integration)](https://github.com/optuna/optuna-examples/tree/main/mlflow)
- **MLflow Documentation:**
    - [MLflow Tracking with Nested Runs](https://mlflow.org/docs/latest/tracking.html#organizing-runs-in-experiments)
    - [Hyperparameter Tuning with Child Runs (Example)](https://www.mlflow.org/docs/latest/traditional-ml/hyperparameter-tuning-with-child-runs/index.html)
- **Articles and Blogs:**
    - [Effective Hyperparameter Tuning with Optuna and MLflow (Blog by Databricks or similar)](https://www.databricks.com/blog/2021/04/15/how-to-use-optuna-for-hyperparameter-tuning.html) (Illustrative link, search for recent articles)
    - Many community blogs showcase Optuna + MLflow workflows.

--- 

Fantastic work on completing this notebook! You've now seen how MLflow can manage much more complex experimentation workflows, like automated hyperparameter optimization. 

**Coming Up Next:** We'll explore the MLflow Model Registry for versioning and managing your production-ready models. This is a critical step in the MLOps lifecycle! Stay tuned!

![Keep Learning](https://memento.epfl.ch/image/23136/1440x810.jpg)