# Base model training and hyperparameter optimisation

This notebook trains four base regressors and performs hyperparameter optimisation using `RandomizedSearchCV`.
Each model includes a feature engineering step that prunes:
- highly correlated RDKit descriptors
- very rare and very common Morgan fingerprint bits

The best hyperparameters for each model are saved to JSON for reuse in the ensemble stacking notebook.

## Input data

This notebook operates on `data/processed/train_val.csv`, created in `01_data_loading.ipynb`.

The final test set (`final_test.csv`) is **not used** here. Model selection and tuning is performed on the
train/validation pool using cross-validation.

## Note on search settings (quick test)

The cross-validation and search iterations in this notebook may be set to very small values
(e.g. `cv=2`, `n_iter=1`) for fast pipeline testing (e.g. validating the stacking workflow).

For final results, these settings should be increased and rerun.

In [None]:
QUICK_TEST_RUN = False
CV = 2 if QUICK_TEST_RUN else 3
N_ITER = 1 if QUICK_TEST_RUN else 50

In [None]:
import numpy as np
import pandas as pd

In [None]:
#Load the training and validation dataset

from mp.io import get_repo_root

ROOT = get_repo_root()

train_val_data = pd.read_csv(ROOT /"data/processed/train_val.csv", index_col = 0)

## Models

The following base regressors are trained and tuned in this notebook:

- **CatBoost Regressor**
- **XGBoost Regressor**
- **k-Nearest Neighbours (KNN) Regressor**
- **Feed-forward Neural Network (TensorFlow/Keras)**

Each model is implemented as a **scikit-learn compatible estimator**, enabling a unified
hyperparameter optimisation workflow using `RandomizedSearchCV`.  
This ensures consistent cross-validation, scoring, and comparison across all model families.

## Feature engineering pipeline

All models share a common, custom **feature-engineering pipeline**, implemented in the
`FeatureEngineer` class and applied identically across models.

Key responsibilities of the pipeline include:

- generation of chemically motivated ratio and fraction features from RDKit descriptors
- removal of uninformative Morgan fingerprint bits using frequency-based thresholds
- pruning of highly correlated numeric features via a **priority-based, deterministic** rule
- learning NaN-imputation, normalisation, and encoding statistics from **training data only**
- enforcing a consistent feature set and column order across all datasets

The feature-engineering logic is **fit exclusively on training data** and then applied unchanged
to validation and test splits, preventing information leakage during model evaluation and ensembling.

In [None]:
from mp.models.catboost_fe import CatBoostFEModel
from mp.models.xgb_fe import XGBFEModel
from mp.models.knn_fe import KNNFEModel
from mp.models.nn_fe import NNFEModel
from mp.models.model_optimization import run_random_search_cv

In [None]:
CB = CatBoostFEModel()
XGB = XGBFEModel()
KNN = KNNFEModel()
NN = NNFEModel()

## Hyperparameter search spaces

The parameter distributions defined below represent **refined search spaces**
based on prior exploratory runs and model-specific experimentation.

Rather than broad, uninformative ranges, these distributions focus on
regions of the hyperparameter space that consistently produced strong
cross-validation performance, allowing RandomizedSearchCV to be used
more efficiently.

In [None]:
from scipy.stats import randint, uniform, loguniform

#The following param grids are already well optimized after several iterations
CB_cv_params = {
    
     # ---- CatBoost hyperparams ----
    "depth": randint(6, 9),                        
    "learning_rate": uniform(0.04, 0.02),          
    "n_estimators": randint(1300, 1600),           
    "l2_leaf_reg": uniform(14, 3),                
    "bagging_temperature": uniform(0.6, 0.3),      
    "random_strength": uniform(0.0, 0.6),         
    "rsm": uniform(0.5, 0.25),                     
    "min_data_in_leaf": randint(7, 12),            
    "leaf_estimation_iterations": randint(1, 3),  
    "grow_policy": ["SymmetricTree", "Depthwise", "Lossguide"], 

    # ---- FeatureEngineer thresholds ----
    "corr_threshold": uniform(0.90, 0.07),        
    "min_fp_freq": uniform(0.01, 0.025),           
    "max_fp_freq": uniform(0.90, 0.08),           
}

XGB_cv_params = {

    # ---- Core XGBoost hyperparameters ----
    "learning_rate": uniform(0.05, 0.03),       
    "n_estimators": randint(1900, 2100),          
    "max_depth": randint(7, 9),                  
    "min_child_weight": randint(8, 12),           

    "subsample": uniform(0.8, 0.15),            
    "colsample_bytree": uniform(0.75, 0.20),     
    "colsample_bylevel": uniform(0.90, 0.10),     

    "reg_alpha": uniform(0.6, 0.3),               
    "reg_lambda": uniform(7.0, 3.0),              
    "gamma": uniform(0.0, 0.25),                 

    # ---- Fixed for speed & stability ----
    "tree_method": ["hist"],
    "max_bin": [256],

    # ---- FeatureEngineer thresholds  ----
    "corr_threshold": uniform(0.90, 0.06),        
    "min_fp_freq": uniform(0.01, 0.02),          
    "max_fp_freq": uniform(0.8, 0.15),          
}

KNN_cv_params = {
  
    # ---- KNN hyperparameters ----
    "n_neighbors": randint(6, 9),        
    "weights": ["distance"],              
    "p": [1],                             
    "leaf_size": randint(55, 71),         
    "algorithm": ["auto"],

    # ---- FeatureEngineer thresholds ----
    "corr_threshold": uniform(0.9, 0.1),  
    "min_fp_freq": uniform(0.01, 0.05),    
    "max_fp_freq": uniform(0.75, 0.25),    
}

NN_cv_params = { 

    # ---- Training dynamics ----
    "learning_rate": loguniform(1e-4, 5e-3),  
    "patience":      randint(40, 55),
    "batch_size":    randint(64, 128),

    # ---- Regularisation ----
    "dropout":       uniform(0.22, 0.14),        
    "l2_strength":   loguniform(1.5e-5, 6.0e-5),
}


## Hyperparameter optimisation

Each base model is tuned using `RandomizedSearchCV` with a common workflow:

- A fixed holdout split (20%) is created for a quick sanity check
- Hyperparameters are optimised using cross-validation on the remaining data
- Mean absolute error (MAE) is used as the optimisation metric
- The best estimator from cross-validation is evaluated once on the holdout set

This approach provides a balance between robust model selection (via cross-validation)
and a lightweight check for overfitting, while reserving a final, untouched test set
for the ensemble evaluation.

In [None]:
import time

SEARCH_CFG = {
    "cv": CV,       
    "n_iter": N_ITER,    
}

models = {
    "CatBoost": (CB, CB_cv_params, -1),
    "XGB":      (XGB, XGB_cv_params, -1),
    "KNN":      (KNN, KNN_cv_params, 1),
    "NN":       (NN, NN_cv_params, 1),
}

search_results = {}
holdout_scores = {}
timings = {}

for name, (estimator, param_dist, n_jobs) in models.items():
    print(f"\nRunning RandomSearchCV for {name}...")

    start = time.time()

    search, holdout_mae = run_random_search_cv(
        train_val_data,
        "mpC",
        SEARCH_CFG["cv"],
        SEARCH_CFG["n_iter"],
        estimator,
        param_dist,
        n_jobs,
    )

    elapsed = time.time() - start

    search_results[name] = search
    holdout_scores[name] = holdout_mae
    timings[name] = elapsed

    print(f"{name} finished in {elapsed:.1f} s | Holdout MAE = {holdout_mae:.2f}")

In [None]:
summary = pd.DataFrame([
    {
        "model": name,
        "cv_mae": -search_results[name].best_score_,
        "holdout_mae": holdout_scores[name],
        "seconds": timings[name],
    }
    for name in models
]).sort_values("holdout_mae")

summary

In [None]:
summary.to_csv(ROOT / "reports/model_search_summary.csv", index = False)

## Outputs

Best hyperparameters are saved to:
- `reports/best_params/catboost.json`
- `reports/best_params/xgb.json`
- `reports/best_params/knn.json`
- `reports/best_params/nn.json`

These are used directly in the ensemble stacking notebook.

In [None]:
from mp.io import save_json

out_dir = ROOT / "reports" / "best_params"

for name, search in search_results.items():
    save_json(search.best_params_, out_dir / f"{name.lower()}.json")