# Model Hyperparameter tunning
In this notebook the hyperparameter tunning for each model is performed and visualized

In [3]:
import os
import polars as pl
import pandas as pd
import lightgbm as lgb
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split,GridSearchCV,RandomizedSearchCV
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

pollutant_cols = ["MONTHLYNi_concentration","MONTHLYPM10_concentration","MONTHLYPM2.5_concentration"]
datasets = ['dataset_Ni.parquet','dataset_PM10.parquet','dataset_PM25.parquet']

The model employed is **LightGBM**.  
In LightGBM models, the hyper-parameters with the greatest influence are:

- `num_leaves` – upper limit on the number of terminal leaves per tree; controls overall model complexity.  
- `max_depth` – maximum depth of each tree; prevents overly deep, over-fitted trees.  
- `min_data_in_leaf` – minimum number of samples required in every leaf; higher values smooth out noisy splits.  
- `learning_rate` – weight applied to each new tree; lower rates typically generalise better but require more trees.  
- `num_iterations` – total number of boosting rounds (trees) to train.  
- `boosting_type` – boosting variant that determines how new trees are added.  
- `feature_fraction` – fraction of features randomly selected for each tree; reduces correlation among trees.  
- `bagging_fraction` – fraction of rows (samples) randomly selected for each tree when bagging is enabled.  
- `bagging_freq` – frequency, in boosting rounds, at which row bagging is applied.

Because proper tuning is critical, the following three-stage procedure will be applied for each pollutant-specific model:

1. **Stage 1**  
   Run a **Grid-Search Cross-Validation (GSCV)** over a wide grid covering all hyper-parameters listed above.

2. **Stage 2**  
   Plot validation RMSE while varying *one* hyper-parameter at a time, keeping the others at their Stage 1 best values, to locate the most promising region for each parameter.

3. **Stage 3**  
   Build a *narrow* grid centred on the preferred values identified in Stage 2 and run a second GSCV to pinpoint the joint optimum.

> **Why two grid searches instead of tuning one parameter at a time?**  
> Although inspecting parameters individually is informative, the optimal value of any hyper-parameter depends on its combination with the rest. The refined grid in Stage 3 re-evaluates combinations, ensuring RMSE is minimised while accounting for all interactions.

Grid-Search CV is chosen over Randomised-Search CV because LightGBM trains quickly on our dataset; exhaustive enumeration is feasible and provides full coverage. For more computationally demanding models, Randomised-Search or Bayesian optimisation would be preferable.

## 1st stage

In [None]:
for i,dataset in enumerate(datasets):    
    dataset_path = os.path.join('..','Data','Final_Dataset',dataset)
    dataset = pl.read_parquet(dataset_path).to_pandas()
    target_col = pollutant_cols[i]
        
    feature_cols = [
        "EURO_1", "EURO_2", "EURO_3", "EURO_4", "EURO_5", "EURO_6", "EURO_CLEAN",
        "Previous","CITY_AREA"]

    dataset = dataset.dropna(subset=feature_cols + [target_col])
    X = dataset[feature_cols]
    y = dataset[target_col]
        
    est = LGBMRegressor(
        objective='regression',
        metric='rmse',
        verbosity=-1,
        silent=True
        )

    param_grid = {
        'num_leaves': [2, 5, 10, 20, 40, 60],
        'max_depth': [-1, 5, 10, 15, 25, 50],
        'min_child_samples': [1, 10, 20, 40],
        'learning_rate': [0.01, 0.05, 0.1],
        'n_estimators': [100, 300, 600],
        'boosting_type': ['gbdt', 'dart'],
    }
    param_grid = {
        'num_leaves': [2],
        'max_depth': [-1],
        'min_child_samples': [1, 10],
        'learning_rate': [0.01],
        'n_estimators': [100],
        'boosting_type': ['gbdt']
    }

    grid = GridSearchCV(
        est, param_grid,scoring='neg_root_mean_squared_error',
        cv=5,n_jobs=-1,verbose=2,return_train_score=True
        )

    grid.fit(X, y)

    df = pd.DataFrame(grid.cv_results_)
    pollutant = dataset.split('.')[0][8:]
    print(f'Results of {pollutant}')
    print("Best params:", grid.best_params_)
    print("Best CV RMSE:", -grid.best_score_)

Fitting 5 folds for each of 2592 candidates, totalling 12960 fits


KeyboardInterrupt: 

## 2nd Stage

In [None]:
def tune_and_collect(X,y,param_grid,target_param,cv: 5,random_state:42,):

    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

    X_tr, X_te, y_tr, y_te = train_test_split(X_scaled, y, test_size=0.2, random_state=random_state)

    estimator = LGBMRegressor(
        objective='regression',
        metric='rmse',
        verbosity=-1,
        random_state=random_state
    )

    gs = GridSearchCV(
        estimator,
        param_grid=param_grid,
        scoring='neg_root_mean_squared_error',
        cv=cv,
        n_jobs=-1,
        return_train_score=False
    )
    gs.fit(X_tr, y_tr)

    results = pd.DataFrame(gs.cv_results_)
    results['mean_rmse'] = -results['mean_test_score']

    param_col = f'param_{target_param}'
    keep = results[[param_col, 'mean_rmse']].sort_values(by=param_col)

    param_values = keep[param_col].tolist()
    mean_rmse    = keep['mean_rmse'].tolist()

    return param_values, mean_rmse

## 3rd Stage

In [4]:
dataset_path = os.path.join('..','Data','Final_Dataset','dataset_Ni.parquet')
dataset = pl.read_parquet(dataset_path).to_pandas()
dataset["CITY"] = dataset["CITY"].astype("category")
target_col = pollutant_cols[0]
    
feature_cols = [
    "EURO_1", "EURO_2", "EURO_3", "EURO_4", "EURO_5", "EURO_6", "EURO_CLEAN",
    "Previous","CITY_AREA"]

dataset = dataset.dropna(subset=feature_cols + [target_col])
X = dataset[feature_cols]
y = dataset[target_col]

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)
X = pd.DataFrame(X, columns=feature_cols)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
est = LGBMRegressor(
    objective='regression',
    metric='rmse',
    verbosity=-1,
    silent=True
    )

param_grid = {
    'num_leaves': [2, 5, 10, 20, 40, 60],
    'max_depth': [-1, 5, 10, 15, 25, 50],
    'min_child_samples': [1, 5, 10, 20, 40],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [100, 300, 600, 1000, 1500],
    'boosting_type': ['gbdt', 'dart'],
}

grid = GridSearchCV(
    est, param_grid,scoring='neg_root_mean_squared_error',
    cv=5,n_jobs=-1,verbose=1,return_train_score=True
    )

grid.fit(X_train, y_train)

df = pd.DataFrame(grid.cv_results_)

print("Best params:", grid.best_params_)
print("Best CV RMSE:", -grid.best_score_)

Fitting 5 folds for each of 5400 candidates, totalling 27000 fits


KeyboardInterrupt: 

In [None]:
dataset_path = os.path.join('..','Data','Final_Dataset','dataset_PM10.parquet')
dataset = pl.read_parquet(dataset_path).to_pandas()
dataset["CITY"] = dataset["CITY"].astype("category")
target_col = pollutant_cols[1]
    
feature_cols = [
    "EURO_1", "EURO_2", "EURO_3", "EURO_4", "EURO_5", "EURO_6", "EURO_CLEAN",
    "Previous","CITY_AREA"]

dataset = dataset.dropna(subset=feature_cols + [target_col])
X = dataset[feature_cols]
y = dataset[target_col]

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)
X = pd.DataFrame(X, columns=feature_cols)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
est = LGBMRegressor(
    objective='regression',
    metric='rmse',
    verbosity=-1,
    silent=True
    )

param_grid = {
    'num_leaves': [2, 5, 10, 20, 40, 60],
    'max_depth': [-1, 5, 10, 15, 25, 50],
    'min_child_samples': [1, 5, 10, 20, 40],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [100, 300, 600, 1000, 1500],
    'boosting_type': ['gbdt', 'dart'],
}

grid = GridSearchCV(
    est, param_grid,scoring='neg_root_mean_squared_error',
    cv=5,n_jobs=-1,verbose=1,return_train_score=True
    )

grid.fit(X_train, y_train)

df = pd.DataFrame(grid.cv_results_)

print("Best params:", grid.best_params_)
print("Best CV RMSE:", -grid.best_score_)