<a href="https://colab.research.google.com/github/MelMacLondon/ML/blob/main/tuning_extension.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[link text](https://)# Model Tuning - Part 2
## Sequential Searching

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import plotly

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold


ModuleNotFoundError: No module named 'optuna'

In [2]:
!pip install optuna
import optuna

Collecting optuna
  Downloading optuna-4.5.0-py3-none-any.whl.metadata (17 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.16.4-py3-none-any.whl.metadata (7.3 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Downloading optuna-4.5.0-py3-none-any.whl (400 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.9/400.9 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading alembic-1.16.4-py3-none-any.whl (247 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m247.0/247.0 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.9.0-py3-none-any.whl (11 kB)
Installing collected packages: colorlog, alembic, optuna
Successfully installed alembic-1.16.4 colorlog-6.9.0 optuna-4.5.0


## Why Move Beyond Grid and Random Search?

Previously, we've used grid search and random search to tune our model hyperparameters. These are useful, but they come with limitations — especially as the number of parameters grows.

### Limitations of Grid Search

- Tests every possible combination in a fixed parameter grid.
- Becomes exponentially expensive as more parameters are added.
- Wastes time evaluating regions of the parameter space that clearly perform poorly.
- Struggles with continuous parameters (e.g. `learning_rate`, `reg_lambda`) — you must manually define values to try.


### Limitations of Random Search

- Samples from the space randomly, which is more efficient than grid search in high dimensions.
- Often finds reasonable regions quickly, but then keeps sampling blindly.
- Doesn't learn from past trials — it doesn’t know which areas are worth exploring further.

## A Smarter Alternative: Bayesian Optimisation

Bayesian optimisation is an approach that builds a **model of the objective function** — in our case, model performance — and uses it to decide where to search next.

Instead of sampling randomly, it balances:
- Exploration: trying new parts of the space
- Exploitation: focusing on areas that have performed well so far

### What Does Bayesian Optimisation Actually Do?

At a high level, Bayesian optimisation tries to learn the relationship between a model's hyperparameters and its performance.

You can think of it like this:

- Each time you try a set of hyperparameters, you get back a performance score.
- Bayesian optimisation uses this history to build a **predictive model** — an approximation of how different hyperparameter combinations are likely to perform.
- It then uses this model to **choose the next set of parameters** to try — ideally balancing exploration (trying new areas) and exploitation (refining promising regions).

This process repeats, updating the model each time with new results.

The result: rather than blindly searching the space, Bayesian optimisation **focuses its search** where it expects to find better results — all while keeping track of what it's learned so far.

### Key Ideas

- Bayesian optimisation treats tuning like a learning problem: "what value of these parameters will give me the best score?"
- At each step, it:
  1. Models what it thinks the objective function looks like.
  2. Chooses new parameters to try based on that model.
  3. Updates its model with the new results.

This makes it much more efficient especially when:
- Evaluations are expensive (e.g. slow model training)
- The parameter space is large or continuous
- You have a limited tuning budget (e.g. 50 trials)

#### Knowledge check:
- What’s the main advantage of Bayesian optimisation compared to grid or random search?

*italicised text*## Introducing Optuna

[Optuna](https://optuna.org/) is a lightweight, flexible library for hyperparameter optimisation.  
It helps you efficiently explore hyperparameter combinations using a strategy called **sequential searching** — learning from past trials to make smarter decisions in future ones.


### Core Concepts

Optuna is built around a few key components:

| Term | What it means |
|------|---------------|
| `study` | A full optimisation run |
| `trial` | A single suggestion and evaluation of hyperparameters |
| `objective()` | A function you define that builds a model and returns a score |
| `suggest_*()` | Methods to define your hyperparameter search space |

### A Minimal Example

Here’s a complete example using Optuna to tune the regularisation strength of a logistic regression model on the sklearn Breast Cancer dataset:

In [3]:
# Load data
X, y = load_breast_cancer(return_X_y=True)

# Define the objective function
def objective(trial):
    C = trial.suggest_float('C', 1e-4, 1e2, log=True)

    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(C=C, max_iter=1000, random_state=136))
    ])

    # We're fixing the random state (how the k-folds are generated) for teaching purposes
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=136)

    score = cross_val_score(pipe, X, y, cv=cv, scoring='f1_macro').mean()
    return score

While we’re only tuning the model for now, it would be simple to extend this to the rest of the pipeline. Any parameter — including preprocessing steps like scaling or feature selection — can be added with `trial.suggest_*` just as we’ve done here.

### Running the Study

You create a study and tell it how many trials to run:

In [4]:
optuna.logging.set_verbosity(optuna.logging.WARNING) # Hides all trial info

# Force trials to be the same for teaching purposes, normally don't need the sampler bit
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler(seed=136))
study.optimize(objective, n_trials=30)

print("Best score:", study.best_value)
print("Best params:", study.best_params)

Best score: 0.9754610759659041
Best params: {'C': 9.853616407850957}


> Note: We've locked down randomness in both the objective and the study. This is done for teaching purposes and **should not be used in real workflows**. The objective always sees the same CV splits, and the study tries the same hyperparameters in the same order.

## How Is Optuna Using Bayesian Optimisation?

Optuna uses a form of Bayesian optimisation to decide which hyperparameters to try next.

Each time a trial finishes, Optuna updates a model of the objective function — essentially, a guess at how different parameter combinations are likely to perform.

This model is built using:
- Your past trials (score + parameters)
- A strategy called **TPE** ([Tree-structured Parzen Estimator](https://hub.optuna.org/samplers/tpe_tutorial/)), which estimates:
  - Where good values might be
  - Where uncertainty is still high

Then it chooses the next trial based on an acquisition function — something like:
> “Try the value most likely to improve on the current best.”

This lets Optuna:
- Focus more trials in promising regions
- Still occasionally explore less-visited areas
- Adapt its strategy as new results come in

You don’t need to configure any of this — it's handled by Optuna’s `TPESampler` by default.  
But it’s what makes this search Bayesian: we are *modelling belief* about where the good results are likely to be, and updating that belief with each trial.

#### Knowledge check:
- What does Optuna use to decide which trial to run next?

### Suggesting Multiple Parameters

You can tune as many hyperparameters as you want. For example:

In [5]:
def objective2(trial):
    C = trial.suggest_float('C', 1e-4, 1e2, log=True)
    penalty = trial.suggest_categorical('penalty', ['l1', 'l2'])

    # Use correct solver for each penalty
    solver = 'liblinear' if penalty == 'l1' else 'lbfgs'

    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(C=C, penalty=penalty, solver=solver, max_iter=1000, random_state=136))
    ])

    # We're fixing the random state (how the k-folds are generated) for teaching purposes
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=136)

    score = cross_val_score(pipe, X, y, cv=cv, scoring='f1_macro').mean()
    return score


In [6]:
# Force trials to be the same for teaching purposes, normally don't need the sampler bit
study2 = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler(seed=136))
study2.optimize(objective2, n_trials=30)

print("Best score:", study2.best_value)
print("Best params:", study2.best_params)

Best score: 0.9754610759659041
Best params: {'C': 11.683396082101947, 'penalty': 'l2'}


This shows how you can:
- Tune both numeric and categorical parameters
- Use conditional logic inside the function

### What Makes This Better Than Grid or Random Search?

- You don’t need to define a fixed parameter grid.
- Optuna learns which areas of the space perform well and focuses future trials there.
- You can use `study.trials_dataframe()` to inspect all results — or plug it into a tool like [MLflow](https://mlflow.org/) later.

In [7]:
# We can see all the trials, look how quickly L1 was dropped!

study2.trials_dataframe()

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_C,params_penalty,state
0,0,0.86507,2025-08-19 08:24:32.420854,2025-08-19 08:24:32.483649,0 days 00:00:00.062795,0.00083,l2,COMPLETE
1,1,0.86507,2025-08-19 08:24:32.483717,2025-08-19 08:24:32.541979,0 days 00:00:00.058262,0.000821,l2,COMPLETE
2,2,0.955244,2025-08-19 08:24:32.542052,2025-08-19 08:24:32.599708,0 days 00:00:00.057656,0.0162,l2,COMPLETE
3,3,0.840037,2025-08-19 08:24:32.599781,2025-08-19 08:24:32.657170,0 days 00:00:00.057389,0.000599,l2,COMPLETE
4,4,0.27144,2025-08-19 08:24:32.657244,2025-08-19 08:24:32.704302,0 days 00:00:00.047058,0.004462,l1,COMPLETE
5,5,0.27144,2025-08-19 08:24:32.704373,2025-08-19 08:24:32.752631,0 days 00:00:00.048258,0.000398,l1,COMPLETE
6,6,0.973533,2025-08-19 08:24:32.752721,2025-08-19 08:24:32.819838,0 days 00:00:00.067117,1.091385,l2,COMPLETE
7,7,0.92752,2025-08-19 08:24:32.819927,2025-08-19 08:24:32.881253,0 days 00:00:00.061326,0.002498,l2,COMPLETE
8,8,0.27144,2025-08-19 08:24:32.881326,2025-08-19 08:24:32.960644,0 days 00:00:00.079318,0.00129,l1,COMPLETE
9,9,0.965868,2025-08-19 08:24:32.960721,2025-08-19 08:24:33.055347,0 days 00:00:00.094626,61.39114,l2,COMPLETE


## Which Hyperparameters Had the Most Impact?

Optuna can estimate how much each hyperparameter influenced the final score by analysing the full set of completed trials.

This is useful when:
- You want to understand which choices really mattered
- Multiple parameters gave similar scores, and you'd like to simplify
- You’re comparing different modelling strategies


In [8]:
import optuna.visualization as vis

vis.plot_param_importances(study2)

In this case, you can see that `C` explains most of the performance variation, while `penalty` plays a smaller role once regularisation strength is tuned properly.

Note: importance is based on the surrogate model used for search, not a theoretical measure — but it’s useful for understanding how the optimiser “sees” the space.

## Interpreting Optimisation History

You can also visualise the performance of individual trials using `plot_optimization_history`.

This plot shows:

- Each trial (dot) and the F1 score it achieved
- A line showing the best score so far

You should expect to see early improvement, followed by a flattening trend as Optuna narrows in on the best region of the parameter space.

In [9]:
vis.plot_optimization_history(study2)

### What’s Going On in This Plot?

Take a look at the third-last trial — it drops well below the others.

This is an example of **exploration**. Even though Optuna has found a good-performing region, it still occasionally tries less-tested areas of the space.

Why?

- To reduce uncertainty in parts of the space that haven’t been sampled much.
- To make sure it’s not missing something better in a surprising location.
- Because its model still assigns some probability of improvement to that region.

> Not every trial is trying to beat the best score. Some are designed to learn more about the search space.

Even late in the optimisation process, exploring the “edges” helps prevent overconfidence in a small region that might only look good by chance.

### But What About Ties?

We have several trials that all get the best score, how can we deal with this?

As `C` is the inverse regularisation strength, a lower value corresponds to stronger regularisation, which in turn leads to smaller model coefficients or a simpler model. We'll want to select the lowest value of `C` that still gives the best performance.

In [10]:
study2.trials_dataframe().sort_values(by='value', ascending=False).head(5)

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_C,params_penalty,state
21,21,0.975461,2025-08-19 08:24:33.810829,2025-08-19 08:24:33.894012,0 days 00:00:00.083183,8.516832,l2,COMPLETE
20,20,0.975461,2025-08-19 08:24:33.728494,2025-08-19 08:24:33.810758,0 days 00:00:00.082264,7.740554,l2,COMPLETE
22,22,0.975461,2025-08-19 08:24:33.894080,2025-08-19 08:24:33.984339,0 days 00:00:00.090259,8.616304,l2,COMPLETE
19,19,0.975461,2025-08-19 08:24:33.643295,2025-08-19 08:24:33.728425,0 days 00:00:00.085130,11.683396,l2,COMPLETE
29,29,0.973547,2025-08-19 08:24:34.503523,2025-08-19 08:24:34.588109,0 days 00:00:00.084586,12.07462,l2,COMPLETE


In [11]:
study2.best_value

0.9754610759659041

In [12]:
# Create a DataFrame of trials
trial2_df = study2.trials_dataframe()

# Filter to be only the trials with the best score and sort it
sorted_trial2 = trial2_df.loc[trial2_df['value'] == study2.best_value].sort_values(by='params_C')

# Grab the lowest value for C
best_c = sorted_trial2['params_C'].values[0]

print(best_c)

7.740553627565581


Now that we know the best values of `C` and `penalty` we could train a final  model using these values.

## Summary

Optuna gives us a way to search smarter, not harder.  
We’ve seen how to:

- Define flexible, conditional search spaces
- Use pipelines and cross-validation with Optuna
- Interpret search behaviour through plots
- Pick models based on performance and simplicity

### Now it's time to head over to the second practical where you'll apply this strategy to the steel plates dataset!