# Hyperparameters Tuning

We’ve selected our classification models, but we can't dive right into classification. The next challenge is to optimize the model construction. Since we’re working with a small dataset, the main risk is overfitting. To address this, we’ll apply hyperparameter tuning using **Grid Search**.

In [None]:
import pandas as pd
import warnings

from CogniPredictAD.visualization import Visualizer
from CogniPredictAD.preprocessing import ADNIPreprocessor

from imblearn.over_sampling import SMOTENC
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
from scipy.stats import wilcoxon
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier

warnings.filterwarnings("ignore", category=UserWarning)

pd.set_option("display.max_rows", 116)
pd.set_option("display.max_columns", 40)
pd.set_option("display.max_info_columns", 40)

## Loading the Dataset
Open the training dataset with Pandas.

In [2]:
# Open the dataset with pandas
dataset = pd.read_csv("../data/pretrain.csv")
viz = Visualizer(dataset)
display(dataset.shape)
display(dataset)

(1934, 47)

Unnamed: 0,DX,AGE,PTGENDER,PTEDUCAT,APOE4,CDRSB,ADAS11,ADAS13,ADASQ4,MMSE,RAVLT_immediate,RAVLT_learning,RAVLT_forgetting,RAVLT_perc_forgetting,LDELTOTAL,TRABSCOR,FAQ,mPACCdigit,mPACCtrailsB,Ventricles,...,EcogPtMem,EcogPtLang,EcogPtVisspat,EcogPtPlan,EcogPtOrgan,EcogPtDivatt,EcogPtTotal,EcogSPMem,EcogSPLang,EcogSPVisspat,EcogSPPlan,EcogSPOrgan,EcogSPDivatt,EcogSPTotal,ABETA,TAU,PTAU,FDG,PTDEMOGROUP,MARRIED
0,AD,80.9,1,14,0.0,6.5,29.33,42.33,10.0,21.0,15.0,1.0,3.0,100.0000,0.0,300.0,19.0,-20.06920,-18.356400,62224.0,...,3.14286,3.00000,3.00000,3.2,2.50000,2.75,2.94595,4.000,3.44444,2.66667,3.00000,3.66667,3.75,3.47222,,,,1.04262,6,0
1,LMCI,82.2,1,20,0.0,1.5,12.33,20.33,5.0,24.0,29.0,0.0,5.0,83.3333,2.0,155.0,4.0,-10.20060,-10.777900,85816.0,...,,,,,,,,,,,,,,,,,,1.08058,6,1
2,LMCI,71.2,1,19,0.0,1.0,6.00,8.00,2.0,26.0,51.0,2.0,-2.0,-18.1818,2.0,106.0,2.0,-5.90200,-6.457590,38223.0,...,2.75000,2.55556,2.28571,3.2,3.83333,3.50,2.92308,,,,,,,,,,,1.41455,6,1
3,CN,75.5,0,20,0.0,0.0,3.00,6.00,3.0,30.0,61.0,7.0,3.0,20.0000,19.0,58.0,0.0,3.19941,3.001880,61111.0,...,1.75000,1.33333,1.00000,1.0,1.16667,1.00,1.25641,1.000,1.00000,1.00000,1.00000,1.00000,1.00,1.00000,762.0,200.6,18.84,1.11882,6,1
4,CN,81.5,0,19,0.0,0.0,3.67,7.67,3.0,29.0,54.0,7.0,4.0,28.5714,11.0,54.0,0.0,-1.16303,-0.101632,44690.2,...,1.87500,1.55556,1.00000,1.0,1.33333,1.75,1.44737,1.375,1.11111,1.00000,1.20000,1.16667,1.75,1.23684,,,,,6,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1929,LMCI,64.6,0,14,2.0,2.0,12.00,22.00,10.0,27.0,31.0,1.0,7.0,100.0000,0.0,62.0,9.0,-10.37820,-9.369810,28016.0,...,,,,,,,,,,,,,,,588.0,417.1,39.86,,6,1
1930,LMCI,82.9,1,18,0.0,1.5,13.33,23.33,9.0,28.0,33.0,1.0,3.0,42.8571,5.0,79.0,1.0,-9.18102,-7.013430,48243.3,...,2.12500,1.12500,1.16667,2.0,2.50000,2.00,1.78378,2.625,1.62500,1.16667,1.40000,1.83333,2.00,1.81081,,,,1.06861,6,1
1931,LMCI,76.8,1,12,0.0,1.0,11.33,16.33,4.0,25.0,27.0,2.0,1.0,16.6667,3.0,300.0,1.0,-9.94141,-10.624000,29502.0,...,,,,,,,,,,,,,,,874.1,153.2,13.45,,6,1
1932,LMCI,74.6,1,19,1.0,2.0,17.00,27.00,10.0,26.0,32.0,1.0,7.0,100.0000,3.0,102.0,8.0,-13.05080,-10.521600,78282.0,...,1.87500,1.33333,1.00000,1.2,1.33333,1.00,1.33333,3.125,2.87500,1.85714,1.33333,2.25000,3.25,2.55882,520.3,350.2,32.49,1.11678,6,1


## Split Class from Training Dataset

In [3]:
y_train = dataset['DX']
X_train = dataset.drop(columns=['DX'])

## ADNIPreprocessor Class

From the *Data Preprocessing notebook*, we developed a class that reproduces all its data-cleaning operations, ensuring consistent preprocessing for proper cross-validation evaluation.

**`ADNIPreprocessor`** is a scikit-learn–compatible transformer for ADNI data preprocessing. It detects and converts integer-like columns, scales and imputes missing values with KNN, creates safe ratio variables, and normalizes MRI measures by intracranial volume (ICV). It can also perform optional hybrid class balancing with undersampling and SMOTENC oversampling.

During fitting, the class computes means and standard deviations for numeric features and detects integer-like columns. The transform step applies imputation, integer conversion, ratio creation (`TAU/ABETA`, `PTAU/ABETA`), and MRI normalization, then removes redundant features. 

Core methods include `fit`, `transform`, `fit_transform`, and `get_feature_names_out`. 


In [4]:
preprocessing = ADNIPreprocessor()

## Classification Model Choices

Our classification model choices will be: 

In [5]:
classifiers = {
    'Decision Tree': Pipeline([
        ('pre', preprocessing), 
        ('clf', DecisionTreeClassifier(random_state=42, class_weight='balanced', max_depth=5))
    ]),
    'Random Forest': Pipeline([
        ('pre', preprocessing),
        ('clf', RandomForestClassifier(random_state=42, class_weight='balanced', n_jobs=-1))
    ]),
    'Extra Trees': Pipeline([
        ('pre', preprocessing),
        ('clf', ExtraTreesClassifier(random_state=42, class_weight='balanced', n_jobs=-1))
    ]),
    'Adaptive Boosting': Pipeline([
        ('pre', preprocessing),
        ('clf', AdaBoostClassifier(random_state=42, estimator=DecisionTreeClassifier(class_weight='balanced')))
    ]),
    'Multinomial Logistic Regression': Pipeline([
        ('pre', preprocessing),
        ('scl', StandardScaler()), 
        ('clf', LogisticRegression(random_state=42, solver='saga', max_iter=2000, class_weight='balanced'))
    ])
}

## Grid Search 


Our goal is to select parameters that maximize model performance while reducing the risk of overfitting. Since the dataset is small and the risk of overfitting is high, we will carefully select the best hyperparameters using **Grid Search**.

In [None]:
def run_gridsearch(X_train, y_train, pipelines, param_grids, cv=5, scoring='balanced_accuracy'):
    """
    Runs GridSearchCV on multiple classifiers with their respective parameter grids.
    Ignores classifiers that fail during fitting and continues with the others.

    Parameters
    ----------
    X_train : DataFrame or array
        Training features.
    y_train : Series or array
        Training labels.
    pipelines : dict
        Dictionary with model names as keys and pipeline objects as values.
    param_grids : dict
        Dictionary with model names as keys and parameter grids as values.
    cv : int, default=5
        Number of folds for cross-validation.
    scoring : str, default='balanced_accuracy'
        Scoring metric to optimize.

    Returns
    -------
    best_models : dict
        Dictionary containing best estimator, parameters, and score for each classifier.
    """
    best_models = {}
    errors = {}
    cv_scores = {}
    
    # GridSearch by model
    for name, clf in pipelines.items():
        print(f"\nRunning GridSearch for {name} ...")
        param_grid = param_grids.get(name, {})
        
        grid = GridSearchCV(
            estimator=clf,
            param_grid=param_grid,
            cv=cv,
            scoring=scoring,
            n_jobs=-1,
            verbose=1,
            error_score='raise'
        )
        
        try:
            grid.fit(X_train, y_train)
            best_models[name] = {
                "best_estimator": grid.best_estimator_,
                "best_params": grid.best_params_,
                "best_score": grid.best_score_
            }
            print(f"Best params for {name}: {grid.best_params_}")
            print(f"Best {scoring}: {grid.best_score_:.4f}")
            
            # Save fold-by-fold cross-validation scores on the best model
            best_clf = grid.best_estimator_
            scores = cross_val_score(best_clf, X_train, y_train, cv=cv, scoring=scoring, n_jobs=-1)
            cv_scores[name] = scores
            print(f"{name} CV scores: {scores}")
        
        except Exception as e:
            print(f"Classifier {name} failed: {e}")
            errors[name] = str(e)
    
    return best_models


`run_gridsearch` takes a *training dataset*, a set of *classifiers*, and their respective *grid_params* and applies **GridSearchCV** to each model. For each classifier, it constructs a grid search with the chosen metric and cross-validation, executes it, prints the best parameters and score, and returns a dictionary that collects the best trained estimator, the optimal parameters, and the corresponding performance for each model.

We create the parameter grid to compare for the classifiers.

In [7]:
param_grids = {
    'Decision Tree': {
        'clf__criterion': ['gini', 'entropy'],
        'clf__min_samples_split': [2, 4, 8],
        'clf__min_samples_leaf': [2, 4, 8],
        'clf__ccp_alpha': [0.0, 0.001, 0.005, 0.01, 0.05]
    },
    'Random Forest': {
        'clf__n_estimators': [50, 75, 100],
        'clf__max_depth': [None, 6, 4],
        'clf__min_samples_leaf': [2, 4, 8],
        'clf__max_features': [0.5, 0.8, 1.0, 'sqrt'],
        'clf__criterion': ['gini', 'entropy']
    },
    'Extra Trees': {
        'clf__n_estimators': [50, 75, 100],
        'clf__max_depth': [None, 6, 4],
        'clf__min_samples_leaf': [2, 4, 8],
        'clf__max_features': [0.5, 0.8, 1.0, 'sqrt'],
        'clf__criterion': ['gini', 'entropy']
    },
    'Adaptive Boosting': {
        'clf__n_estimators': [50, 75, 100],
        'clf__learning_rate': [0.01, 0.05, 0.1,],
        'clf__estimator__max_depth': [None, 6, 4],
        'clf__estimator__min_samples_leaf': [2, 4, 8],
        'clf__estimator__criterion': ['gini', 'entropy']
    },
    'Multinomial Logistic Regression': {
        'clf__C': [0.01, 0.1, 1.0, 10.0],
        'clf__penalty': ['l1', 'l2']
    }
}


We will do **5-fold cross validation**.

In [8]:
n_cross_validation = 5

### No Sampling

We use the **F1 macro** score. It evaluates the unweighted average of the F1s per class, forcing the grid search to look for hyperparameters that balance precision/recall across all classes (it avoids optimizing a model that only "exploits" sparse features to predict the majority class). 

Now let's run the Grid Search (this will take a while).

In [9]:
bmc = run_gridsearch(X_train=X_train, y_train=y_train, pipelines=classifiers, param_grids=param_grids, cv = n_cross_validation, scoring='f1_macro')


Running GridSearch for Decision Tree ...
Fitting 5 folds for each of 90 candidates, totalling 450 fits
Best params for Decision Tree: {'clf__ccp_alpha': 0.005, 'clf__criterion': 'entropy', 'clf__min_samples_leaf': 8, 'clf__min_samples_split': 2}
Best f1_macro: 0.9062
Decision Tree CV scores: [0.91334176 0.89529202 0.90040767 0.90857668 0.91316229]

Running GridSearch for Random Forest ...
Fitting 5 folds for each of 216 candidates, totalling 1080 fits
Best params for Random Forest: {'clf__criterion': 'entropy', 'clf__max_depth': 6, 'clf__max_features': 1.0, 'clf__min_samples_leaf': 4, 'clf__n_estimators': 100}
Best f1_macro: 0.9180
Random Forest CV scores: [0.90835019 0.92199984 0.92498265 0.93386079 0.90096899]

Running GridSearch for Extra Trees ...
Fitting 5 folds for each of 216 candidates, totalling 1080 fits
Best params for Extra Trees: {'clf__criterion': 'entropy', 'clf__max_depth': None, 'clf__max_features': 1.0, 'clf__min_samples_leaf': 4, 'clf__n_estimators': 75}
Best f1_mac

### Sampling

We use the **F1 macro** score. Accuracy becomes meaningless because balancing alters the original distribution, thus not reflecting actual performance. F1 macro score evaluates precision and recall for each class separately, gives equal weight to all classes, and is not affected by artificial balancing. 

We re-write the pipelines for the sampling. 

In [10]:
categorical_features = [
    X_train.columns.get_loc("PTGENDER"),
    X_train.columns.get_loc("APOE4")
]

# Total number of rows in the fold
n_total_fold = (len(X_train) // n_cross_validation) * (n_cross_validation - 1)
n_per_class = n_total_fold // 4

# Undersampling strategy: only for classes larger than target_count
undersample_dict = {"CN": n_per_class, "LMCI": n_per_class}

# Oversampling strategy: only for classes smaller than target_count
oversample_dict = {"EMCI": n_per_class, "AD": n_per_class}

classifiers = {
    'Decision Tree': Pipeline([
        ('pre', preprocessing), 
        ('rus', RandomUnderSampler(sampling_strategy=undersample_dict, random_state=42)),
        ('smotenc', SMOTENC(categorical_features=categorical_features, sampling_strategy=oversample_dict, random_state=42)),
        ('clf', DecisionTreeClassifier(random_state=42, class_weight='balanced', max_depth=5))
    ]),
    'Random Forest': Pipeline([
        ('pre', preprocessing),
        ('rus', RandomUnderSampler(sampling_strategy=undersample_dict, random_state=42)),
        ('smotenc', SMOTENC(categorical_features=categorical_features, sampling_strategy=oversample_dict, random_state=42)),
        ('clf', RandomForestClassifier(random_state=42, class_weight='balanced', n_jobs=-1))
    ]),
    'Extra Trees': Pipeline([
        ('pre', preprocessing),
        ('rus', RandomUnderSampler(sampling_strategy=undersample_dict, random_state=42)),
        ('smotenc', SMOTENC(categorical_features=categorical_features, sampling_strategy=oversample_dict, random_state=42)),
        ('clf', ExtraTreesClassifier(random_state=42, class_weight='balanced', n_jobs=-1))
    ]),
    'Adaptive Boosting': Pipeline([
        ('pre', preprocessing),
        ('rus', RandomUnderSampler(sampling_strategy=undersample_dict, random_state=42)),
        ('smotenc', SMOTENC(categorical_features=categorical_features, sampling_strategy=oversample_dict, random_state=42)),
        ('clf', AdaBoostClassifier(random_state=42, estimator=DecisionTreeClassifier(class_weight='balanced')))
    ]),
    'Multinomial Logistic Regression': Pipeline([
        ('pre', preprocessing),
        ('scl', StandardScaler()),
        ('rus', RandomUnderSampler(sampling_strategy=undersample_dict, random_state=42)),
        ('smotenc', SMOTENC(categorical_features=categorical_features, sampling_strategy=oversample_dict, random_state=42)), 
        ('clf', LogisticRegression(random_state=42, solver='saga', max_iter=2000, class_weight='balanced'))
    ])
}


Then we start the Grid Search (this will take a while). 

In [11]:
bmcs = run_gridsearch(X_train=X_train, y_train=y_train, pipelines=classifiers, param_grids=param_grids, cv = n_cross_validation, scoring='f1_macro')


Running GridSearch for Decision Tree ...
Fitting 5 folds for each of 90 candidates, totalling 450 fits
Best params for Decision Tree: {'clf__ccp_alpha': 0.0, 'clf__criterion': 'gini', 'clf__min_samples_leaf': 8, 'clf__min_samples_split': 2}
Best f1_macro: 0.9007
Decision Tree CV scores: [0.88899843 0.90211446 0.91595396 0.90893799 0.88748017]

Running GridSearch for Random Forest ...
Fitting 5 folds for each of 216 candidates, totalling 1080 fits
Best params for Random Forest: {'clf__criterion': 'entropy', 'clf__max_depth': 6, 'clf__max_features': 1.0, 'clf__min_samples_leaf': 4, 'clf__n_estimators': 50}
Best f1_macro: 0.9150
Random Forest CV scores: [0.90787286 0.91139614 0.92945414 0.92783247 0.89843544]

Running GridSearch for Extra Trees ...
Fitting 5 folds for each of 216 candidates, totalling 1080 fits
Best params for Extra Trees: {'clf__criterion': 'gini', 'clf__max_depth': None, 'clf__max_features': 1.0, 'clf__min_samples_leaf': 8, 'clf__n_estimators': 75}
Best f1_macro: 0.912

## Final Considerations

We will choose this hyperparameters. 

In [None]:
# No Sampling
classifiers_1 = {
    'Decision Tree': DecisionTreeClassifier(
        random_state=42,
        class_weight='balanced',
        criterion='entropy',
        max_depth=5,
        min_samples_split=2,
        min_samples_leaf=8,
        ccp_alpha=0.005
    ),

    'Random Forest': RandomForestClassifier(
        random_state=42,
        class_weight='balanced',
        n_jobs=-1,
        criterion='entropy',
        max_depth=6,
        max_features=1.0,
        min_samples_leaf=4,
        n_estimators=100
    ),

    'Extra Trees': ExtraTreesClassifier(
        random_state=42,
        class_weight='balanced',
        n_jobs=-1,
        criterion='entropy',
        max_depth=None,
        max_features=1.0,
        min_samples_leaf=4,
        n_estimators=75
    ),

    'Adaptive Boosting': AdaBoostClassifier(
        random_state=42,
        estimator=DecisionTreeClassifier(
            class_weight='balanced',
            criterion='gini',
            max_depth=4,
            min_samples_leaf=4
        ),
        learning_rate=0.1,
        n_estimators=50
    ),

    'Multinomial Logistic Regression': Pipeline([
        ('scl', StandardScaler()),
        ('clf', LogisticRegression(
            random_state=42,
            solver='saga',
            max_iter=2000,
            class_weight='balanced',
            penalty='l1',
            C=0.1
        ))
    ])
}


# Sampling
classifiers_2 = {
    'Decision Tree Sampled': DecisionTreeClassifier(
        random_state=42,
        class_weight='balanced',
        criterion='gini',
        max_depth=5,  
        min_samples_split=2,
        min_samples_leaf=8,
        ccp_alpha=0.0
    ),

    'Random Forest Sampled': RandomForestClassifier(
        random_state=42,
        class_weight='balanced',
        n_jobs=-1,
        criterion='entropy',
        max_depth=6,
        max_features=1.0,
        min_samples_leaf=4,
        n_estimators=50
    ),

    'Extra Trees Sampled': ExtraTreesClassifier(
        random_state=42,
        class_weight='balanced',
        n_jobs=-1,
        criterion='gini',
        max_depth=None,
        max_features=1.0,
        min_samples_leaf=8,
        n_estimators=75
    ),

    'Adaptive Boosting Sampled': AdaBoostClassifier(
        random_state=42,
        estimator=DecisionTreeClassifier(
            class_weight='balanced',
            criterion='gini',
            max_depth=4,
            min_samples_leaf=8
        ),
        learning_rate=0.1,
        n_estimators=50
    ),

    'Multinomial Logistic Regression Sampled': Pipeline([
        ('scl', StandardScaler()),
        ('clf', LogisticRegression(
            random_state=42,
            solver='saga',
            max_iter=2000,
            class_weight='balanced',
            penalty='l1',
            C=1.0
        ))
    ])
}
