#  HYPERPARAMETER OPTIMIZATION AND BENCHMARK TESTING

Leveraging the stability-selected feature subsets identified in the previous module, this notebook focuses on the training and fine-tuning of our three candidate classifiers.

The objective is to find the optimal hyperparameter configuration for each model to maximize the Matthews Correlation Coefficient (MCC) on the Development Set (5-Fold CV), ensuring robustness against class imbalance before the final evaluation on the unseen Benchmark Set.

**Workflow for this module:**

1.  **Data Preparation and Validation Strategy**: We load the three distinct feature signatures (Tree-specific, SVM-specific, and Linear-specific) serialized in the previous step. We rigorously respect the data split: the Development Set (Set 1,2,3,4,5) is used exclusively for cross-validation and hyperparameter tuning, while the Benchmark Set remains completely unseen until the final inference step, acting as a proxy for real-world deployment.

2.  **Tree-based Model Optimization (Balanced Random Forest)**: Using the tree-specific feature set, we employ Bayesian Optimization to explore the hyperparameter space efficiently. We tune critical parameters such as `n_estimators`, `max_depth`, and splitting criteria (`min_samples_leaf`) to balance the model's complexity and prevent overfitting, maximizing the MCC score.

3.  **Kernel-based Model Optimization (SVM)**: For the Support Vector Machine, we utilize the specific subset of  features identified by the RBF-Kernel stability selection. The Bayesian search focuses on optimizing the regularization parameter `C` and the kernel coefficient `gamma`, searching for the ideal non-linear decision boundary that separates the disease stages in the high-dimensional feature space.

4.  **Linear Model Optimization (Logistic Regression)**: We perform optimization on the Logistic Regression model using its specific feature set. We tune the inverse regularization strength `C` to control the penalty on the coefficients. This serves as our robust linear baseline to determine if the added complexity of non-linear models (RF and SVM) translates into a statistically significant performance gain.

5.  **Final Inference and Artifact Serialization**: Once the optimal hyperparameters are identified, each model will generate prediction vectors for the Benchmark Set. These predictions, along with the trained model objects, are serialized (`.pkl` files) to be passed to the next module (`04_Analysis_Results.ipynb`) for the final comparative analysis, confusion matrix visualization, and clinical interpretation.

In [1]:
import numpy as np
import pandas as pd
#! pip install scikit-optimize
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import matthews_corrcoef , confusion_matrix, classification_report
from imblearn.ensemble import BalancedRandomForestClassifier
from sklearn.linear_model import LogisticRegression
import pickle

## 1.  Data Preparation and Validation Strategy

In [2]:
import joblib
features_to_use_tree = joblib.load('../data/best_features_list_tree.pkl')
features_to_use_svm = joblib.load('../data/best_features_list_svm.pkl')
features_to_use_lr = joblib.load('../data/best_features_list_lr.pkl')
df=pd.read_csv("../data/data_refined_stratifkfold.csv")
x_trainval,y_trainval=df.query("Set=='1' or Set=='2' or Set=='3' or Set=='4' or Set=='5'").drop(columns=['Set','Stage']),df.query("Set=='1' or Set=='2' or Set=='3' or Set=='4' or Set=='5'")['Stage']
x_bench,y_bench=df.query("Set=='Benchmark'").drop(columns=['Set','Stage']),df.query("Set=='Benchmark'")['Stage']

## 2. Tree-based Model Optimization (Balanced Random Forest)

In [3]:
x_trainval_tree=x_trainval[features_to_use_tree]
x_bench_tree=x_bench[features_to_use_tree]
pipeline_tree = Pipeline([("rf", BalancedRandomForestClassifier(sampling_strategy='all', replacement=True, random_state=42, n_jobs=-1))])
search_space_tree = {
    "rf__n_estimators": Integer(100, 1000),      # Number of trees
    "rf__max_depth": Integer(10, 100),           # maximum deep, to avoid overfitting
    "rf__min_samples_split": Integer(2, 20),     # minimum number of samples on each split
    "rf__min_samples_leaf": Integer(1, 10),      # Minimum number of samples on each leaf
    "rf__max_features": Categorical(['sqrt', 'log2']) # how many features in each split
}

bayes_tree = BayesSearchCV(
    estimator=pipeline_tree,
    search_spaces=search_space_tree,
    scoring="matthews_corrcoef",   
    n_jobs=-1,
    refit=False,                 
    random_state=42,
    cv=5,
    n_iter=100 
)

bayes_tree.fit(x_trainval_tree, y_trainval)

print("\n[Best parameters found:] ")
print(bayes_tree.best_params_)
print(f"[Best MCC validation] {bayes_tree.best_score_:.4f}")


[Best parameters found:] 
OrderedDict({'rf__max_depth': 81, 'rf__max_features': 'sqrt', 'rf__min_samples_leaf': 7, 'rf__min_samples_split': 2, 'rf__n_estimators': 677})
[Best MCC validation] 0.2538


In [4]:
#predict the benchmark set
pipeline_tree.set_params(**bayes_tree.best_params_).fit(x_trainval_tree, y_trainval) #parameters of BayesSearchCV
bench_pred_tree = pipeline_tree.predict(x_bench_tree)
#save the model
with open("../models/tree_model.pkl", 'wb') as f:
    pickle.dump(pipeline_tree, f)
#compute the mcc
mcc_bayes_tree = matthews_corrcoef(y_bench , bench_pred_tree)
print(f"MCC on testing set (bayesian search): {mcc_bayes_tree}")

MCC on testing set (bayesian search): 0.19601429225839503


## 3. Kernel-based Model Optimization (SVM)

In [5]:
x_trainval_svm=x_trainval[features_to_use_svm]
x_bench_svm=x_bench[features_to_use_svm]

pipeline_svm = Pipeline([("svm" , SVC(cache_size=1500,class_weight='balanced'))])
search_space_svm = {
        "svm__kernel": Categorical(["rbf"]),
        "svm__C": Real(0.01, 1000, prior="log-uniform"),                
        "svm__gamma": Real(1e-5, 100, prior="log-uniform"), 
    }
#set up the BayesSearch
bayes_svm = BayesSearchCV(
    estimator=pipeline_svm,
    search_spaces=search_space_svm,
    scoring="matthews_corrcoef",   
    n_jobs=-1,
    refit=False,                 
    random_state=42,
    cv=5,
    n_iter=100
)
bayes_svm.fit(x_trainval_svm, y_trainval)  # here we perform the bayes search

print("\n[Best parameters found:] ")
print(bayes_svm.best_params_)
print(f"[Best MCC validation] {bayes_svm.best_score_:.4f}")


[Best parameters found:] 
OrderedDict({'svm__C': 8.890674932384222, 'svm__gamma': 0.015527855782441872, 'svm__kernel': 'rbf'})
[Best MCC validation] 0.3301


In [6]:
#predict the benchmark set
pipeline_svm.set_params(**bayes_svm.best_params_).fit(x_trainval_svm, y_trainval) #parameters of BayesSearchCV
bench_pred_svm = pipeline_svm.predict(x_bench_svm)
#save the model
with open("../models/svm_model.pkl", 'wb') as f:
    pickle.dump(pipeline_svm, f)
#compute the mcc
mcc_bayes_svm = matthews_corrcoef(y_bench , bench_pred_svm)
print(f"MCC on testing set (bayesian search): {mcc_bayes_svm}")

MCC on testing set (bayesian search): 0.27780167190974636


## 4. Linear Model Optimization (Logistic Regression)

In [7]:
x_trainval_lr=x_trainval[features_to_use_lr]
x_bench_lr=x_bench[features_to_use_lr]

pipeline_lr = Pipeline([
    ("lr", LogisticRegression(class_weight='balanced', max_iter=10000))
])
search_space_lr = {
    "lr__C": Real(0.01, 1000, prior="log-uniform"), 
}

bayes_lr = BayesSearchCV(
    estimator=pipeline_lr,
    search_spaces=search_space_lr,
    scoring="matthews_corrcoef",   
    n_jobs=-1,
    refit=False,                 
    random_state=42,
    cv=5,
    n_iter=100
)
bayes_lr.fit(x_trainval_lr, y_trainval)  # here we perform the bayes search

print("\n[Best parameters found:] ")
print(bayes_lr.best_params_)
print(f"[Best MCC validation] {bayes_lr.best_score_:.4f}")


[Best parameters found:] 
OrderedDict({'lr__C': 46.788604247112445})
[Best MCC validation] 0.2868


In [8]:
#predict the benchmark set
pipeline_lr.set_params(**bayes_lr.best_params_).fit(x_trainval_lr, y_trainval) #parameters of BayesSearchCV
bench_pred_lr = pipeline_lr.predict(x_bench_lr)
#save the model
with open("../models/lr_model.pkl", 'wb') as f:
    pickle.dump(pipeline_lr, f)
#compute the mcc
mcc_bayes_lr = matthews_corrcoef(y_bench , bench_pred_lr)
print(f"MCC on testing set (bayesian search): {mcc_bayes_lr}")

MCC on testing set (bayesian search): 0.3122593945265407


## 5. Final Inference and Artifact Serialization

In [12]:
output_path='../data/'
joblib.dump(bench_pred_lr, os.path.join(output_path, 'benchmark_prediction_lr.pkl'))

joblib.dump(bench_pred_svm, os.path.join(output_path, 'benchmark_prediction_svm.pkl'))

joblib.dump(bench_pred_tree, os.path.join(output_path, 'benchmark_prediction_tree.pkl'))
print("Predictions  saved in data/benchmark_prediction_lr.pkl, data/benchmark_prediction_svm.pkl and data/benchmark_prediction_tree.pkl for analysis part")

Predictions  saved in data/benchmark_prediction_lr.pkl, data/benchmark_prediction_svm.pkl and data/benchmark_prediction_tree.pkl for analysis part
