<a href="https://colab.research.google.com/github/MelMacLondon/ML/blob/main/tuning_practical1_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Tuning Practical - Part 1


In [6]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.decomposition import PCA
from sklearn.metrics import f1_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

from scipy.stats import randint, uniform

from lightgbm import LGBMClassifier

### The dataset

We will be again looking at the Steel Plates Faults dataset from the UCI repository.  
[Data dictionary here](https://archive.ics.uci.edu/dataset/198/steel+plates+faults)

This time however, rather than exploring different hyperparameters by hand you'll use the more advanced techniques such as random and grid search. You'll also explore the impact of pre-processing steps such as PCA on model performance.

When tuning by hand we ended up with an unseen test score (f1 macro) of 0.798, let's see if we can beat that.

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [8]:
ls

[0m[01;34mdrive[0m/  [01;34msample_data[0m/


To do this you'll follow the following steps:

1. Build a feature engineering pipeline
2. Use random search to find promising hyperparameter values
3. Use grid search to finely explore around the promising values
4. Evaluate the final model on our held out test data

In [23]:
# Load our data

# /content/drive/MyDrive/Colab Notebooks/data/faults_processed.csv

# df = pd.read_csv('data/faults_processed.csv')
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/faults_processed.csv')
df.head()

Unnamed: 0,X_Minimum,X_Maximum,Y_Minimum,Y_Maximum,Pixels_Areas,X_Perimeter,Y_Perimeter,Sum_of_Luminosity,Minimum_of_Luminosity,Maximum_of_Luminosity,...,Edges_X_Index,Edges_Y_Index,Outside_Global_Index,LogOfAreas,Log_X_Index,Log_Y_Index,Orientation_Index,Luminosity_Index,SigmoidOfAreas,Fault
0,42,50,270900,270944,267,17,44,24220,76,108,...,0.4706,1.0,1.0,2.4265,0.9031,1.6435,0.8182,-0.2913,0.5822,Pastry
1,645,651,2538079,2538108,108,10,30,11397,84,123,...,0.6,0.9667,1.0,2.0334,0.7782,1.4624,0.7931,-0.1756,0.2984,Pastry
2,829,835,1553913,1553931,71,8,19,7972,99,125,...,0.75,0.9474,1.0,1.8513,0.7782,1.2553,0.6667,-0.1228,0.215,Pastry
3,853,860,369370,369415,176,13,45,18996,99,126,...,0.5385,1.0,1.0,2.2455,0.8451,1.6532,0.8444,-0.1568,0.5212,Pastry
4,1289,1306,498078,498335,2409,60,260,246930,37,126,...,0.2833,0.9885,1.0,3.3818,1.2305,2.4099,0.9338,-0.1992,1.0,Pastry


In [10]:
# Split the data into a target (y) and features (X)

y = df.Fault
X = df.drop(['Fault'], axis=1)

In [11]:
# Split the data into train/test using 90/10

# 90% for training
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=136, train_size=0.9)

**1. Build a Pre-processing and Model Pipeline**

To begin, you’ll construct a pipeline that:

- Adds PCA components **alongside** the original features
- Trains a LightGBM classifier

Instead of replacing your features with PCA, you'll augment them using a `FeatureUnion`. This allows the model to access both the original features and the PCA-derived components.

As we’re working on a 7-class classification problem, so we'll need to make sure our model is set up correctly to handle multiclass targets (via the `objective` and `num_classes` arguments).

Using the `feature_union` defined below, build a pipeline called `pipe` that:

1. Applies the feature union
2. Trains a `LGBMClassifier` for multiclass classification

Remember a pipeline should contain a list of tuples `('name', transformation)`, if in doubt look at the doc string. You will also want to pass `verbose=-1` to the classifier unless you want to see many many messages.

For consistency later on, call the `feature_union` 'features' and the model 'clf'.

In [12]:
# Creates new columns with PCA components and sticks them onto our dataframe

feature_union = FeatureUnion([
    ('original', 'passthrough'),
    ('pca', Pipeline([
        ('scaler', StandardScaler()),  # Scaling required before PCA
        ('pca', PCA(n_components=5))
    ]))
])

In [13]:
# Your code here...
# pipe = Pipeline([...])

pipe = Pipeline([
    ('features', feature_union),
    ('clf', LGBMClassifier(objective='multiclass', num_class=7, random_state=136, verbose=-1))
])


**2. Define a Random Search Space**

In this step, you'll define a hyperparameter space to use with `RandomizedSearchCV`.  

> Note: In random search, we define distributions to sample from rather than fixed lists.


Create a dictionary called `param_dist` that defines:

- `'features__pca__pca__n_components'`: an integer between 1 and 10
- `'clf__num_leaves'`: an integer between 15 and 32
- `'clf__learning_rate'`: a float between 0.01 and 1.0
- `'clf__reg_lambda'`: a float between 0 and 1.0
- `'clf__n_estimators'`: an integer between 100 to 500

You'll need to use `randint` and `uniform` to do this. Feel free to search more hyperparameters if you want!

In [14]:
# Your code here...
# param_dist = {...}

param_dist = {
    'features__pca__pca__n_components': randint(1, 11),
    'clf__num_leaves': randint(15, 33),
    'clf__learning_rate': uniform(0.01, 0.99),
    'clf__reg_lambda': uniform(0.0, 1.0),
    'clf__n_estimators': randint(100, 501)
     }


**3. Run a Randomised Search Using f1 Macro**

Now you’ll run `RandomizedSearchCV` on your pipeline using the parameter distributions you defined.

Because this is a multiclass classification problem with class imbalance, we’ll use f1 macro score as our evaluation metric.

> `f1_macro` gives equal weight to each class, regardless of how many examples it contains — useful when some classes are less frequent.

Create an instance of `RandomizedSearchCV` with the following settings:

- `estimator`: your pipeline from Task 1
- `param_distributions`: from Task 2
- `scoring='f1_macro'`
- `cv=strat_cv`
- `n_iter=10`
- `random_state=136`
- `verbose=0` (can set higher for visibility, but won't work with `n_jobs=-1`)
- `n_jobs=-1`  (run folds in parallel)

Then fit it on your `X_train`, `y_train`.

In [15]:
# This lets the searches also have stratified folds mimicing our train/test/reality split
strat_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=136)

In [16]:
# Your code here...
# random_search = ...

random_search = RandomizedSearchCV(
    estimator=pipe,
    param_distributions=param_dist,
    scoring='f1_macro',
    cv=strat_cv,
    n_iter=10,
    random_state=136,
    verbose=0, # Can set this higher to see what's going on but you'll need to remove `n_jobs`
    n_jobs=-1
)

random_search.fit(X_train, y_train)


In [17]:
print(random_search.best_score_)
print(random_search.best_params_)

0.8235194416260778
{'clf__learning_rate': np.float64(0.16168377753079866), 'clf__n_estimators': 233, 'clf__num_leaves': 24, 'clf__reg_lambda': np.float64(0.15241899846790052), 'features__pca__pca__n_components': 5}


- How did your random search do?
- Did it find a good f1 macro?
- Do the "best" hyperparameters look sensible?

**4. Fine Tune with Grid Search**

Now that you’ve run a random search, you’ll use those results to define a smaller, focused grid for more precise tuning.

This is a common and efficient strategy:
1. Use random search to locate promising regions of the hyperparameter space
2. Use grid search to fine-tune specific values around the best result

Create a new `param_grid` using `random_search.best_params_` as a starting point. You should:

- Select 1 or 2 parameters to refine (e.g. `learning_rate`, `num_leaves`)
- Create a small set of nearby values (3–5 options each)
- Use `GridSearchCV` with:
  - Your original pipeline
  - `scoring='f1_macro'`
  - `cv=strat_cv`
  - `n_jobs=-1`

You don’t need to search all hyperparameters again, just fine tune the most impactful ones! Be sure to include the other hyperparameters in your model, but set them to the best values found during the random search.

In [18]:
# Putting this here again to save scrolling!

random_search.best_params_

{'clf__learning_rate': np.float64(0.16168377753079866),
 'clf__n_estimators': 233,
 'clf__num_leaves': 24,
 'clf__reg_lambda': np.float64(0.15241899846790052),
 'features__pca__pca__n_components': 5}

In [19]:
# Your code here...
# param_grid = {...}

best_lr = random_search.best_params_['clf__learning_rate']
best_n_est = random_search.best_params_['clf__n_estimators']
best_leaves = random_search.best_params_['clf__num_leaves']

param_grid = {
    'clf__learning_rate': [best_lr * 0.5, best_lr, best_lr * 1.5],
    'clf__n_estimators': [max(30, best_n_est - 30), best_n_est, best_n_est + 30],
    'clf__num_leaves': [max(10, best_leaves - 10), best_leaves, best_leaves + 10],
    'clf__reg_lambda': [random_search.best_params_['clf__reg_lambda']],
    'features__pca__pca__n_components': [random_search.best_params_['features__pca__pca__n_components']]
            }

grid = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    scoring='f1_macro',
    cv=strat_cv,
    n_jobs=-1
)

grid.fit(X_train, y_train)

In [20]:
print(grid.best_score_)
print(grid.best_params_)

0.8299273766768624
{'clf__learning_rate': np.float64(0.24252566629619798), 'clf__n_estimators': 233, 'clf__num_leaves': 14, 'clf__reg_lambda': np.float64(0.15241899846790052), 'features__pca__pca__n_components': 5}


In [21]:
print(random_search.best_score_)

0.8235194416260778


- How did the best f1 macro score from the grid search compare to the random search?

**5. Final Evaluation on the Test data**

Now that you’ve tuned your model using cross validation, it’s time to evaluate the final selected pipeline on our held out test data.

> This mimics a real-world scenario, where you train and tune your model using training data only, then evaluate it once on new, unseen data.

You need to:
1. Use `grid.best_estimator_` to get the final tuned pipeline.
2. Predict on your `X_test` data.
3. Print the f1 macro score to evaluate performance.

In [22]:
# Your code here...
# test_preds = ...

best_model = grid.best_estimator_
test_preds = best_model.predict(X_test)
test_f1 = f1_score(y_test, test_preds, average='macro')

print(f"Test f1 macro score: {test_f1:.3f}")

Test f1 macro score: 0.795




- How does the test f1 score compare to your best cross validation score?
- If there's a sizable difference, why might that be?
- Did you beat my previous score of 0.798?

#### A small confession

The score of 0.798 didn't come from hand seaching the feature space, I instead ran a very large grid search (took around 3 hours run locally). If you found that this systematic search didn't outperform my broad, manual grid search, that's ok!

#### Model tuning isn't just about better results, it's about:

- Reducing guesswork
- Avoiding overfitting to lucky splits
- Making experiments reproducible
- Saving time and compute at scale

### Optional Extension: Why We're Not Using Early Stopping

In many real-world projects, early stopping is used with models like LightGBM to avoid overfitting.  
It allows the model to stop training once performance stops improving on a validation set.

However, early stopping is not used in this practical for a few reasons:

- It requires manual control of validation splits, which doesn't fit neatly into cross-validation.
- `Pipeline` and `GridSearchCV` come from sklearn, while LightGBM is a separate library so combining them can be awkward.
- You would need to pass extra arguments (`eval_set`, `early_stopping_rounds`) directly to `.fit()`, which isn't supported through the pipeline structure.

For now, we've focused on building clean, reproducible tuning workflows using `Pipeline` and `GridSearchCV`.  