# Hyperparameters Tuning

We’ve selected our classification models, but we can't dive right into classification. The next challenge is to optimize the model construction. Since we’re working with a small dataset, the main risk is overfitting. To address this, we’ll apply hyperparameter tuning using **Grid Search**.

In [1]:
import itertools
import numpy as np
import pandas as pd
import warnings

from CogniPredictAD.visualization import Visualizer
from catboost import CatBoostClassifier
from imblearn.over_sampling import SMOTENC
from imblearn.pipeline import Pipeline as Pipeline_imb
from imblearn.under_sampling import RandomUnderSampler
from lightgbm import LGBMClassifier
from scipy.stats import wilcoxon
from sklearn.ensemble import BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold, StratifiedKFold, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from statsmodels.stats.multitest import multipletests
from xgboost import XGBClassifier

warnings.filterwarnings("ignore", category=UserWarning)

pd.set_option("display.max_rows", 116)
pd.set_option("display.max_columns", 40)
pd.set_option("display.max_info_columns", 40)

## Loading the Dataset
Open the training dataset with Pandas.

In [2]:
# Open the dataset with pandas
dataset = pd.read_csv("../data/train.csv")
viz = Visualizer(dataset)
dataset.shape
display(dataset)

Unnamed: 0,DX,AGE,PTGENDER,PTEDUCAT,APOE4,MMSE,CDRSB,ADAS13,LDELTOTAL,FAQ,MOCA,TRABSCOR,RAVLT_immediate,RAVLT_learning,RAVLT_perc_forgetting,mPACCdigit,EcogPtMem,EcogPtLang,EcogPtVisspat,EcogPtPlan,EcogPtOrgan,EcogPtDivatt,EcogSPMem,EcogSPLang,EcogSPVisspat,EcogSPPlan,EcogSPOrgan,EcogSPDivatt,FDG,PTAU/ABETA,Hippocampus/ICV,Entorhinal/ICV,Fusiform/ICV,MidTemp/ICV,Ventricles/ICV,WholeBrain/ICV
0,2,77,0,16,1,28,2.5,5,1,0,24,108,47,5,63.63640,-4.84005,2.250,2.111110,1.000000,1.00,1.333330,1.00,2.375000,2.111110,2.428570,2.60,2.833330,2.75000,1.222830,0.040838,0.004524,0.001882,0.012107,0.011311,0.016977,0.706210
1,0,59,1,16,1,30,0.0,0,19,0,30,47,71,2,0.00000,5.42702,1.000,1.000000,1.000000,1.00,1.000000,1.00,1.000000,1.000000,1.000000,1.00,1.000000,1.00000,1.161970,0.020445,0.004452,0.002756,0.012935,0.014299,0.025614,0.752850
2,3,77,1,12,2,22,8.0,30,0,25,17,300,19,1,100.00000,-18.90540,2.300,1.844446,1.248572,1.58,1.366668,1.75,3.841666,2.847620,3.033334,2.97,3.166668,3.80000,0.924559,0.047131,0.002825,0.001348,0.010049,0.009701,0.053417,0.522572
3,2,82,1,20,0,26,1.5,21,4,0,24,63,35,1,85.71430,-7.95749,1.925,1.269446,1.166668,1.20,1.466668,1.60,1.891666,1.272222,1.066668,1.16,1.733332,2.10000,1.119130,0.020198,0.003736,0.002083,0.013038,0.013942,0.024176,0.637729
4,0,83,0,17,0,27,0.0,5,13,3,25,98,57,7,7.14286,-1.94841,1.250,1.333330,1.000000,1.00,1.333330,1.00,1.375000,1.111110,1.666670,1.00,1.833330,1.25000,1.279034,0.026879,0.004611,0.002170,0.011387,0.012975,0.052196,0.635279
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1929,0,72,0,18,1,30,0.0,4,11,0,26,52,42,7,18.18180,2.22837,1.500,2.333330,1.285710,1.00,2.500000,1.25,1.250000,1.000000,1.200000,1.00,1.333330,1.50000,1.416100,0.013555,0.005079,0.003304,0.014043,0.013729,0.027992,0.710296
1930,3,72,0,12,1,26,7.0,29,5,18,19,67,34,-1,100.00000,-9.28099,1.500,1.000000,1.142860,1.00,1.000000,1.00,3.250000,2.333330,2.428570,3.20,3.000000,3.50000,1.268520,0.080942,0.004383,0.001691,0.011582,0.011346,0.022499,0.711762
1931,0,70,0,17,0,29,0.0,23,10,0,20,300,31,4,42.85710,-2.30539,1.125,1.111110,1.000000,1.00,1.000000,1.00,1.000000,1.000000,1.000000,1.00,1.000000,1.00000,1.456170,0.007661,0.005042,0.002406,0.013522,0.013008,0.013065,0.711396
1932,0,84,1,12,0,30,0.5,16,13,0,26,65,27,1,80.00000,-1.42719,2.000,2.000000,2.000000,2.00,1.500000,2.00,1.625000,1.222220,1.285710,1.25,1.000000,1.66667,1.318880,0.021033,0.004567,0.002176,0.012360,0.013614,0.026801,0.663416


## Discussion about CDRSB, LDELTOTAL, and mPACCdigit

The features `CDRSB`, `LDELTOTAL`, and `mPACCdigit` have the highest scores in the **SelectKBest** method with **Kruskal–Wallis H-test**, as highlighted in the *Preprocessing Notebook*.
- `CDRSB`: **H-statistic = 1612.9154; p-value < 1 e-308**
- `LDELTOTAL`: **H-statistic = 1479.4388; p-value < 1 e-308**
- `mPACCdigit`: **H-statistic = 1366.8357; p-value = 4.629 e-296**

This raises the potential phenomenon of **sparse features**, where a few variables dominate the model, while many others contribute negligibly to the prediction. This can lead to a risk of **local overfitting**, with models that perform very well on the training dataset (here ADNIMERGE), but whose accuracy may decrease on external data.

These observations are also supported by the literature: [the study Kauppi et al., 2020 identifies `CDRSB`, `LDELTOTAL`, and `mPACCdigit` among the most important features for predicting disease diagnosis](https://www.medrxiv.org/content/10.1101/2020.11.09.20226746v3.full). This features may be highly predictive in selected cohorts such as ADNI, but their performance could tend to decline in more heterogeneous clinical populations, or with other datasets. 

In summary, this is not an intrinsic flaw in cognitive tests, but rather a possible **dataset bias**: the observed strong accuracy could reflect the specific structure of ADNI rather than universal predictive validity. 

**Since we can't determine this, I believe the best course of action is to create a predictive model that includes `CDRSB`, `LDELTOTAL`, and `mPACCdigit`, and a model that ignores them. If these three features prove more efficient at predicting only within this sample, we would still have a predictive model that tends to ignore them and is therefore still useful for prediction.**

## Build Dataset with Hybrid Sampling

In [3]:
print("Original class distribution (count):")
print(dataset['DX'].value_counts())
print("\nOriginal class distribution (percentages):")
print((dataset['DX'].value_counts(normalize=True) * 100).round(2))

Original class distribution (count):
DX
0    717
2    548
1    336
3    333
Name: count, dtype: int64

Original class distribution (percentages):
DX
0    37.07
2    28.34
1    17.37
3    17.22
Name: proportion, dtype: float64


In [4]:
# Oversampling strategy: only for classes smaller than target_count
oversample_dict = {1: 500, 3: 500}

# Undersampling strategy: only for classes larger than target_count
undersample_dict = {0: 500, 2: 500}

categorical_features = [
    dataset.columns.get_loc("PTGENDER"),
    dataset.columns.get_loc("APOE4")
]

print("\nOversample dict (SMOTENC) -> classes to increase:")
print(oversample_dict)
print("\nUndersample dict (RUS) -> classes to reduce:")
print(undersample_dict)


Oversample dict (SMOTENC) -> classes to increase:
{1: 500, 3: 500}

Undersample dict (RUS) -> classes to reduce:
{0: 500, 2: 500}


In [5]:
steps = []

smotenc = SMOTENC(
        categorical_features=categorical_features,
        sampling_strategy=oversample_dict,
        random_state=42
    )
steps.append(('smotenc', smotenc))

rus = RandomUnderSampler(
        sampling_strategy=undersample_dict,
        random_state=42
    )
steps.append(('rus', rus))

X_train = dataset.drop(columns=['DX'])
y_train = dataset['DX']

pipeline = Pipeline_imb(steps=steps)
X_res, y_res = pipeline.fit_resample(X_train, y_train)
sampled = pd.concat([pd.DataFrame(X_res, columns=X_train.columns),
                     pd.DataFrame(y_res, columns=['DX'])],
                    axis=1)

# Distribution after resampling
print("\nClass distribution after resampling (count):")
print(sampled['DX'].value_counts())
print("\nClass distribution after resampling (percentages):")
print((sampled['DX'].value_counts(normalize=True) * 100).round(2))
display(sampled.describe().T)


Class distribution after resampling (count):
DX
0    500
1    500
2    500
3    500
Name: count, dtype: int64

Class distribution after resampling (percentages):
DX
0    25.0
1    25.0
2    25.0
3    25.0
Name: proportion, dtype: float64


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
AGE,2000.0,72.625,7.289646,52.0,68.0,73.0,78.0,90.0
PTGENDER,2000.0,0.4975,0.500119,0.0,0.0,0.0,1.0,1.0
PTEDUCAT,2000.0,15.98,2.755793,4.0,14.0,16.0,18.0,20.0
APOE4,2000.0,0.5555,0.645855,0.0,0.0,0.0,1.0,2.0
MMSE,2000.0,26.977,2.837163,16.0,25.0,28.0,29.0,30.0
CDRSB,2000.0,1.830313,1.862065,0.0,0.5,1.490204,2.997755,10.0
ADAS13,2000.0,17.348,9.914335,0.0,10.0,15.0,24.0,56.0
LDELTOTAL,2000.0,6.8495,5.231686,0.0,2.0,7.0,10.0,22.0
FAQ,2000.0,4.7145,6.493233,0.0,0.0,1.0,8.0,30.0
MOCA,2000.0,22.194,4.341173,4.0,20.0,23.0,25.0,30.0


In [6]:
sampled.to_csv("../data/sampled.csv", index=False)

## Classification Model Choices

Our classification model choices will be: 

In [7]:
classifiers = {
    'Decision Tree': DecisionTreeClassifier(random_state=42, class_weight='balanced'),
    'Random Forest': RandomForestClassifier(random_state=42, class_weight='balanced', n_jobs=-1),
    'Extra Trees': ExtraTreesClassifier(random_state=42, class_weight='balanced', n_jobs=-1),
    'XGBoost': XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='mlogloss', verbosity=0),
    'LightGBM': LGBMClassifier(random_state=42, verbose=-1),
    'CatBoost': CatBoostClassifier(random_state=42, verbose=False, loss_function='MultiClass'),
    'Multinomial Logistic Regression': Pipeline([
        ('scaler', StandardScaler()),
        ('logreg', LogisticRegression(random_state=42, solver='saga', max_iter=2000, class_weight='balanced'))
    ]),
    'Bagging': BaggingClassifier(random_state=42, n_jobs=-1)
}

## Grid Search 


Our goal is to select parameters that maximize model performance while reducing the risk of overfitting. Since the dataset is small and the risk of overfitting is high, we will carefully select the best hyperparameters using **Grid Search**.

In [None]:
def run_gridsearch(X_train, y_train, classifiers, param_grids, cv=5, scoring='balanced_accuracy'):
    """
    Runs GridSearchCV on multiple classifiers with their respective parameter grids.
    Ignores classifiers that fail during fitting and continues with the others.

    Parameters
    ----------
    train : DataFrame 
        Training set
    classifiers : dict
        Dictionary with model names as keys and classifier objects as values.
    param_grids : dict
        Dictionary with model names as keys and parameter grids as values.
    cv : int, default=5
        Number of folds for cross-validation.
    scoring : str, default='balanced_accuracy'
        Scoring metric to optimize.

    Return
    -------
    best_models : dict
        Dictionary containing best estimator, parameters, and score for each classifier.
    """    
    best_models = {}
    errors = {}
    
    for name, clf in classifiers.items():
        print(f"\nRunning GridSearch for {name} ...")
        param_grid = param_grids.get(name, {})
        
        grid = GridSearchCV(
            estimator=clf,
            param_grid=param_grid,
            cv=cv,
            scoring=scoring,
            n_jobs=-1,
            verbose=1,
            error_score='raise'  # Force the error to catch it
        )
        
        try:
            grid.fit(X_train, y_train)
            best_models[name] = {
                "best_estimator": grid.best_estimator_,
                "best_params": grid.best_params_,
                "best_score": grid.best_score_
            }
            print(f"Best params for {name}: {grid.best_params_}")
            print(f"Best {scoring}: {grid.best_score_:.4f}")
        
        except Exception as e:
            print(f"Classifier {name} failed: {e}")
            errors[name] = str(e)
    
    return best_models


`run_gridsearch` takes a *training dataset*, a set of *classifiers*, and their respective *grid_params* and applies **GridSearchCV** to each model. For each classifier, it constructs a grid search with the chosen metric and cross-validation, executes it, prints the best parameters and score, and returns a dictionary that collects the best trained estimator, the optimal parameters, and the corresponding performance for each model.

We create the parameter grid to compare for the classifiers.

In [9]:
param_grids = {
    'Decision Tree': {
        'criterion': ['gini', 'entropy'],
        'max_depth': [6, 5, 4, 3],
        'min_samples_split': [2, 8, 16],
        'min_samples_leaf': [1, 4, 8],
        'max_features': [0.8, 1.0],
        'ccp_alpha': [0.0, 0.001, 0.005, 0.01, 0.05]
    },
    'Random Forest': {
        'n_estimators': [50, 75, 100],
        'max_depth': [None, 6, 4],
        'min_samples_leaf': [2, 4, 8],
        'max_features': [0.5, 0.8, 1.0, 'sqrt', 'log2'],
        'criterion': ['gini', 'entropy']
    },
    'Extra Trees': {
        'n_estimators': [50, 75, 100],
        'max_depth': [None, 6, 4],
        'min_samples_leaf': [2, 4, 8],
        'max_features': [0.5, 0.8, 1.0, 'sqrt', 'log2'],
        'criterion': ['gini', 'entropy']
    },
    'XGBoost': {
        'n_estimators': [50, 75, 100],
        'learning_rate': [0.01, 0.05, 0.1],
        'max_depth': [8, 6, 3],
        'subsample': [0.8, 1.0],
        'colsample_bytree': [0.5, 0.7, 1.0],
        'gamma': [0, 0.1, 0.5, 1.0],
        'reg_alpha': [0, 1],
        'reg_lambda': [0, 1]
    },
    'LightGBM': {
        'n_estimators': [50, 75, 100],
        'learning_rate': [0.01, 0.05, 0.1],
        'num_leaves': [31, 15],
        'max_depth': [8, 6, 3],
        'min_child_samples': [5, 10, 20],
        'subsample': [0.8, 1.0],
        'colsample_bytree': [0.5, 0.7, 1.0],
        'reg_alpha': [0, 1],
        'reg_lambda': [0, 1]
    },
    'CatBoost': {
        'iterations': [50, 75, 100],
        'learning_rate': [0.1, 0.05],
        'depth': [8, 6, 3],
        'l2_leaf_reg': [1, 3, 7],
        'border_count': [32, 64, 128],
        'bagging_temperature': [0.0, 0.2, 0.5, 1.0],
        'random_strength': [0.5, 1, 5]
    },
    'Multinomial Logistic Regression': {
        'logreg__C': [0.01, 0.1, 1.0, 10.0],
        'logreg__penalty': ['l1','l2'] 
    },
    'Bagging': {
        'n_estimators': [50, 75, 100],
        'max_samples': [0.6, 0.8, 1.0],
        'max_features': [0.5, 0.8, 1.0],
        'bootstrap': [True, False]
    }
}


Now let's run the Grid Search (this will take a while).

### Dataset with `CDRSB`, `LDELTOTAL`, and `mPACCdigit`
For the dataset with `CDRSB`, `LDELTOTAL`, and `mPACCdigit`, we use the **f1_macro** score. These three variables dominate the explanatory variance and produce high but potentially misleading accuracy. f1_macro evaluates the unweighted average of the F1s per class, forcing the grid search to look for hyperparameters that balance precision/recall across all classes (it avoids optimizing a model that only “exploits” sparse features to predict the majority class). We might consider using f1_weighted since it is proportional to support, but it tends to approach accuracy and can mask poor performance on smaller classes.

#### No Sampling

In [10]:
bmc = run_gridsearch(X_train=X_train, y_train=y_train, classifiers=classifiers, param_grids=param_grids, scoring='f1_macro')


Running GridSearch for Decision Tree ...
Fitting 5 folds for each of 720 candidates, totalling 3600 fits
Best params for Decision Tree: {'ccp_alpha': 0.005, 'criterion': 'entropy', 'max_depth': 6, 'max_features': 0.8, 'min_samples_leaf': 4, 'min_samples_split': 2}
Best f1_macro: 0.9211

Running GridSearch for Random Forest ...
Fitting 5 folds for each of 270 candidates, totalling 1350 fits
Best params for Random Forest: {'criterion': 'entropy', 'max_depth': None, 'max_features': 1.0, 'min_samples_leaf': 2, 'n_estimators': 100}
Best f1_macro: 0.9245

Running GridSearch for Extra Trees ...
Fitting 5 folds for each of 270 candidates, totalling 1350 fits
Best params for Extra Trees: {'criterion': 'entropy', 'max_depth': None, 'max_features': 1.0, 'min_samples_leaf': 2, 'n_estimators': 75}
Best f1_macro: 0.9228

Running GridSearch for XGBoost ...
Fitting 5 folds for each of 2592 candidates, totalling 12960 fits
Best params for XGBoost: {'colsample_bytree': 0.7, 'gamma': 0.5, 'learning_rate

#### Sampling

In [12]:
bmcs = run_gridsearch(X_train=X_res, y_train=y_res, classifiers=classifiers, param_grids=param_grids, scoring='f1_macro')


Running GridSearch for Decision Tree ...
Fitting 5 folds for each of 720 candidates, totalling 3600 fits
Best params for Decision Tree: {'ccp_alpha': 0.005, 'criterion': 'entropy', 'max_depth': 6, 'max_features': 0.8, 'min_samples_leaf': 4, 'min_samples_split': 2}
Best f1_macro: 0.9066

Running GridSearch for Random Forest ...
Fitting 5 folds for each of 270 candidates, totalling 1350 fits
Best params for Random Forest: {'criterion': 'entropy', 'max_depth': None, 'max_features': 0.5, 'min_samples_leaf': 2, 'n_estimators': 100}
Best f1_macro: 0.9236

Running GridSearch for Extra Trees ...
Fitting 5 folds for each of 270 candidates, totalling 1350 fits
Best params for Extra Trees: {'criterion': 'entropy', 'max_depth': None, 'max_features': 1.0, 'min_samples_leaf': 2, 'n_estimators': 100}
Best f1_macro: 0.9212

Running GridSearch for XGBoost ...
Fitting 5 folds for each of 2592 candidates, totalling 12960 fits
Best params for XGBoost: {'colsample_bytree': 0.7, 'gamma': 0.5, 'learning_rat

### Dataset without `CDRSB`, `LDELTOTAL`, and `mPACCdigit`
For the dataset without `CDRSB`, `LDELTOTAL`, and `mPACCdigit`, we use the **balanced_accuracy** score. By removing the most predictive features, the model must exploit weak signals and complex combinations to directly optimize the accuracy of the various classes. This helps find hyperparameters that improve the classifier's overall performance on the residual feature space.

In [14]:
X_train.drop(columns=['CDRSB', 'LDELTOTAL', 'mPACCdigit'], axis=1, inplace=True)
X_res.drop(columns=['CDRSB', 'LDELTOTAL', 'mPACCdigit'], axis=1, inplace=True)

#### No Sampling

In [15]:
bmcc = run_gridsearch(X_train=X_train, y_train=y_train, classifiers=classifiers, param_grids=param_grids, scoring='balanced_accuracy')


Running GridSearch for Decision Tree ...
Fitting 5 folds for each of 720 candidates, totalling 3600 fits
Best params for Decision Tree: {'ccp_alpha': 0.005, 'criterion': 'gini', 'max_depth': 6, 'max_features': 0.8, 'min_samples_leaf': 1, 'min_samples_split': 8}
Best balanced_accuracy: 0.6752

Running GridSearch for Random Forest ...
Fitting 5 folds for each of 270 candidates, totalling 1350 fits
Best params for Random Forest: {'criterion': 'entropy', 'max_depth': 6, 'max_features': 0.5, 'min_samples_leaf': 8, 'n_estimators': 50}
Best balanced_accuracy: 0.7083

Running GridSearch for Extra Trees ...
Fitting 5 folds for each of 270 candidates, totalling 1350 fits
Best params for Extra Trees: {'criterion': 'gini', 'max_depth': None, 'max_features': 1.0, 'min_samples_leaf': 8, 'n_estimators': 75}
Best balanced_accuracy: 0.7122

Running GridSearch for XGBoost ...
Fitting 5 folds for each of 2592 candidates, totalling 12960 fits
Best params for XGBoost: {'colsample_bytree': 1.0, 'gamma': 0.

#### Sampling

In [17]:
bmccs = run_gridsearch(X_train=X_res, y_train=y_res, classifiers=classifiers, param_grids=param_grids, scoring='balanced_accuracy')


Running GridSearch for Decision Tree ...
Fitting 5 folds for each of 720 candidates, totalling 3600 fits
Best params for Decision Tree: {'ccp_alpha': 0.005, 'criterion': 'entropy', 'max_depth': 5, 'max_features': 0.8, 'min_samples_leaf': 1, 'min_samples_split': 8}
Best balanced_accuracy: 0.6840

Running GridSearch for Random Forest ...
Fitting 5 folds for each of 270 candidates, totalling 1350 fits
Best params for Random Forest: {'criterion': 'entropy', 'max_depth': None, 'max_features': 0.5, 'min_samples_leaf': 2, 'n_estimators': 100}
Best balanced_accuracy: 0.7455

Running GridSearch for Extra Trees ...
Fitting 5 folds for each of 270 candidates, totalling 1350 fits
Best params for Extra Trees: {'criterion': 'entropy', 'max_depth': None, 'max_features': 0.5, 'min_samples_leaf': 2, 'n_estimators': 100}
Best balanced_accuracy: 0.7585

Running GridSearch for XGBoost ...
Fitting 5 folds for each of 2592 candidates, totalling 12960 fits
Best params for XGBoost: {'colsample_bytree': 1.0, 

## Final Considerations

For the Dataset with `CDRSB`, `LDELTOTAL`, and `mPACCdigit` we will choose this hyperparameters. 

In [None]:
# No Sampling
classifiers_1 = {
    'Decision Tree': DecisionTreeClassifier(
        random_state=42, class_weight='balanced',
        ccp_alpha=0.005, criterion='entropy', max_depth=6,
        max_features=0.8, min_samples_leaf=4, min_samples_split=2
    ),
    'Random Forest': RandomForestClassifier(
        random_state=42, class_weight='balanced', n_jobs=-1,
        criterion='entropy', max_depth=None, max_features=1.0,
        min_samples_leaf=2, n_estimators=100
    ),
    'Extra Trees': ExtraTreesClassifier(
        random_state=42, class_weight='balanced', n_jobs=-1,
        criterion='entropy', max_depth=None, max_features=1.0,
        min_samples_leaf=2, n_estimators=75
    ),
    'XGBoost': XGBClassifier(
        random_state=42, use_label_encoder=False, eval_metric='mlogloss', verbosity=0,
        colsample_bytree=0.7, gamma=0.5, learning_rate=0.1, max_depth=8,
        n_estimators=100, reg_alpha=1, reg_lambda=1, subsample=1.0
    ),
    'LightGBM': LGBMClassifier(
        random_state=42, verbose=-1,
        colsample_bytree=1.0, learning_rate=0.01, max_depth=8,
        min_child_samples=20, n_estimators=100, num_leaves=15,
        reg_alpha=0, reg_lambda=0, subsample=0.8
    ),
    'CatBoost': CatBoostClassifier(
        random_state=42, verbose=False, loss_function='MultiClass',
        bagging_temperature=0.5, border_count=128, depth=6,
        iterations=100, l2_leaf_reg=1, learning_rate=0.1, random_strength=0.5
    ),
    'Multinomial Logistic Regression': Pipeline([
        ('scaler', StandardScaler()),
        ('logreg', LogisticRegression(
            random_state=42, solver='saga', max_iter=2000, class_weight='balanced',
            C=1.0, penalty='l1'
        ))
    ]),
    'Bagging': BaggingClassifier(
        random_state=42, n_jobs=-1,
        bootstrap=True, max_features=0.8, max_samples=0.8, n_estimators=50
    )
}

# Sampling
classifiers_2 = {
    'Decision Tree': DecisionTreeClassifier(
        random_state=42, class_weight='balanced',
        ccp_alpha=0.005, criterion='entropy', max_depth=6,
        max_features=0.8, min_samples_leaf=4, min_samples_split=2
    ),
    'Random Forest': RandomForestClassifier(
        random_state=42, class_weight='balanced', n_jobs=-1,
        criterion='entropy', max_depth=None, max_features=0.5,
        min_samples_leaf=2, n_estimators=100
    ),
    'Extra Trees': ExtraTreesClassifier(
        random_state=42, class_weight='balanced', n_jobs=-1,
        criterion='entropy', max_depth=None, max_features=1.0,
        min_samples_leaf=2, n_estimators=100
    ),
    'XGBoost': XGBClassifier(
        random_state=42, use_label_encoder=False, eval_metric='mlogloss', verbosity=0,
        colsample_bytree=0.7, gamma=0.5, learning_rate=0.05, max_depth=6,
        n_estimators=75, reg_alpha=0, reg_lambda=1, subsample=1.0
    ),
    'LightGBM': LGBMClassifier(
        random_state=42, verbose=-1,
        colsample_bytree=0.7, learning_rate=0.05, max_depth=8,
        min_child_samples=20, n_estimators=75, num_leaves=31,
        reg_alpha=0, reg_lambda=0, subsample=0.8
    ),
    'CatBoost': CatBoostClassifier(
        random_state=42, verbose=False, loss_function='MultiClass',
        bagging_temperature=0.2, border_count=128, depth=8,
        iterations=100, l2_leaf_reg=1, learning_rate=0.1, random_strength=0.5
    ),
    'Multinomial Logistic Regression': Pipeline([
        ('scaler', StandardScaler()),
        ('logreg', LogisticRegression(
            random_state=42, solver='saga', max_iter=2000, class_weight='balanced',
            C=1.0, penalty='l1'
        ))
    ]),
    'Bagging': BaggingClassifier(
        random_state=42, n_jobs=-1,
        bootstrap=True, max_features=0.8, max_samples=1.0, n_estimators=100
    )
}


For the Dataset without `CDRSB`, `LDELTOTAL`, and `mPACCdigit` we will choose this hyperparameters. 

In [None]:
# No Sampling
classifiers_3 = {
    'Decision Tree': DecisionTreeClassifier(
        random_state=42, class_weight='balanced',
        ccp_alpha=0.005, criterion='gini', max_depth=6,
        max_features=0.8, min_samples_leaf=1, min_samples_split=8
    ),
    'Random Forest': RandomForestClassifier(
        random_state=42, class_weight='balanced', n_jobs=-1,
        criterion='entropy', max_depth=6, max_features=0.5,
        min_samples_leaf=8, n_estimators=50
    ),
    'Extra Trees': ExtraTreesClassifier(
        random_state=42, class_weight='balanced', n_jobs=-1,
        criterion='gini', max_depth=None, max_features=1.0,
        min_samples_leaf=8, n_estimators=75
    ),
    'XGBoost': XGBClassifier(
        random_state=42, use_label_encoder=False, eval_metric='mlogloss', verbosity=0,
        colsample_bytree=1.0, gamma=0.1, learning_rate=0.1, max_depth=8,
        n_estimators=75, reg_alpha=1, reg_lambda=1, subsample=0.8
    ),
    'LightGBM': LGBMClassifier(
        random_state=42, verbose=-1,
        colsample_bytree=0.7, learning_rate=0.1, max_depth=3,
        min_child_samples=5, n_estimators=100, num_leaves=31,
        reg_alpha=1, reg_lambda=1, subsample=0.8
    ),
    'CatBoost': CatBoostClassifier(
        random_state=42, verbose=False, loss_function='MultiClass',
        bagging_temperature=0.5, border_count=32, depth=6,
        iterations=100, l2_leaf_reg=1, learning_rate=0.1, random_strength=0.5
    ),
    'Multinomial Logistic Regression': Pipeline([
        ('scaler', StandardScaler()),
        ('logreg', LogisticRegression(
            random_state=42, solver='saga', max_iter=2000, class_weight='balanced',
            C=0.1, penalty='l1'
        ))
    ]),
    'Bagging': BaggingClassifier(
        random_state=42, n_jobs=-1,
        bootstrap=False, max_features=0.8, max_samples=0.6, n_estimators=100
    )
}

# Sampling
classifiers_4 = {
    'Decision Tree': DecisionTreeClassifier(
        random_state=42, class_weight='balanced',
        ccp_alpha=0.005, criterion='entropy', max_depth=5,
        max_features=0.8, min_samples_leaf=1, min_samples_split=8
    ),
    'Random Forest': RandomForestClassifier(
        random_state=42, class_weight='balanced', n_jobs=-1,
        criterion='entropy', max_depth=None, max_features=0.5,
        min_samples_leaf=2, n_estimators=100
    ),
    'Extra Trees': ExtraTreesClassifier(
        random_state=42, class_weight='balanced', n_jobs=-1,
        criterion='entropy', max_depth=None, max_features=0.5,
        min_samples_leaf=2, n_estimators=100
    ),
    'XGBoost': XGBClassifier(
        random_state=42, use_label_encoder=False, eval_metric='mlogloss', verbosity=0,
        colsample_bytree=1.0, gamma=0, learning_rate=0.1, max_depth=8,
        n_estimators=100, reg_alpha=1, reg_lambda=1, subsample=0.8
    ),
    'LightGBM': LGBMClassifier(
        random_state=42, verbose=-1,
        colsample_bytree=1.0, learning_rate=0.1, max_depth=8,
        min_child_samples=10, n_estimators=75, num_leaves=31,
        reg_alpha=1, reg_lambda=1, subsample=0.8
    ),
    'CatBoost': CatBoostClassifier(
        random_state=42, verbose=False, loss_function='MultiClass',
        bagging_temperature=0.2, border_count=32, depth=8,
        iterations=100, l2_leaf_reg=1, learning_rate=0.1, random_strength=0.5
    ),
    'Multinomial Logistic Regression': Pipeline([
        ('scaler', StandardScaler()),
        ('logreg', LogisticRegression(
            random_state=42, solver='saga', max_iter=2000, class_weight='balanced',
            C=1.0, penalty='l1'
        ))
    ]),
    'Bagging': BaggingClassifier(
        random_state=42, n_jobs=-1,
        bootstrap=False, max_features=0.5, max_samples=1.0, n_estimators=100
    )
}
