# TP 2
---
---
**Alan Balendran** <br> **Celine Beji** <br> **Etienne Peyrot** <br>**François Grolleau**<br>**Raphaël Porcher**<br> 


[Centre de Recherche en Epidémiologie et Statistiques (CRESS) - Equipe METHODS](https://cress-umr1153.fr/fr/teams/methods/)<br> 

The goals of this practical are to:
- Understand the notion of overfitting.
- Learn how to correctly train and estimate the performance of a prediction model. 

We will first import libraries, some of which we used during the previous practical. 

In [None]:
## basic libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 

## preprocessing
from sklearn.preprocessing import LabelEncoder 

## modeling
from sklearn.linear_model import LogisticRegression 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier

## evaluation
from sklearn.metrics import roc_auc_score, accuracy_score, \
    confusion_matrix, RocCurveDisplay
from sklearn.calibration import calibration_curve, CalibrationDisplay

## hide warning message
pd.set_option('future.no_silent_downcasting', True) 

In the last practical, we learned how to train various machine learning models and make predictions. <br>Let's recap the different steps (We will exclude the section on visualization covered during the first practical).

First, we start by loading our training dataset `icu_training.csv`.

In [None]:
url_data = 'https://raw.githubusercontent.com/AL0UNE/courses/refs/heads/main/icu_training.csv'

df = pd.read_csv(url_data, index_col=0)

We select a subset of variables to train our models:

In [None]:
keep = [
    "age", "gender", "height", "weight", "CURR_CAREUNIT_transfers" , 
    "hr_score", "sysbp_score", "pao2fio2_score",
    "bun_min", "hemoglobin_max", "lactate_max", "creatinine_max",
    "ptt_max", "first_icu_stay", "hospital_mortality"
]

df = df[keep]

We then apply the different preprocessing steps before training:

We impute missing values, in our case, for `height` and `first_icu_stay`.

In [None]:
mean_height = df['height'].mean()
most_frequent_first_icu_stay = df['first_icu_stay'].mode()[0]

df = df.fillna({"height": mean_height, "first_icu_stay": most_frequent_first_icu_stay})

We encode categorical features, which can either be done manually:

In [None]:
# we (manually) encode categorical features
df['gender'] = df['gender'].replace({'M': 1, 'F': 0}) 
df['first_icu_stay'] = df['first_icu_stay'].replace({True: 1, False: 0})

Or automatically:

In [None]:
encoder = LabelEncoder()
encoder.fit(df['CURR_CAREUNIT_transfers'])
df['CURR_CAREUNIT_transfers'] = encoder.transform(df['CURR_CAREUNIT_transfers'])

When possible and relevant, we can engineer or create new features:

In [None]:
df['BMI'] = df['weight']/((df['height']/100)**2)

We then define the covariates $X$ (patient characteristics) and the target $y$ (hospital mortality):

In [None]:
X = df.drop('hospital_mortality', axis=1)
y = df['hospital_mortality']

Finally, we standardize our data $X$:

In [None]:
# Data standardization

train_mean = X.mean()
train_std = X.std()
X = (X-train_mean)/train_std 

In [None]:
X

In [None]:
y

### Model training

We learned that to train a machine learning model, we first need to instantiate the model we want to train (in this case a logistic regression). Then, we train it on our training data (here $X$ and $y$) using the `.fit()` method.

In [None]:
clf = LogisticRegression() 
clf.fit(X,y);

Once trained, the model can be used to make prediction.

Predictions can be obtained in two ways:
- probabilities using the `predict_proba()` method:

In [None]:
prediction = clf.predict_proba(X.head(1))
print(prediction)

- binary target using the `predict()` method:

In [None]:
prediction = clf.predict(X.head(1))
print(prediction)

### Evaluation metrics

In [None]:
def plot_cm(clf, X, y, normalized=None, figsize=(7, 7), ax=None):
    """
    Plot confusion matrix with annotations.
    """
    conf_matrix = confusion_matrix(y, clf.predict(X), normalize=normalized)
    group_names = ["True Negative (TN)", "False Positive (FP)", "False Negative (FN)", "True Positive (TP)"]
    group_counts = [f"{value}" for value in conf_matrix.flatten()]
    labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_names, group_counts)]
    labels = np.asarray(labels).reshape(2, 2)

    if ax is None:
        fig, ax = plt.subplots(figsize=figsize)
    sns.heatmap(conf_matrix, annot=labels, fmt="", linecolor='lightblue',linewidths=2, cmap='Blues', cbar=False, ax=ax)
    ax.set_ylabel('True label', fontsize=15)
    ax.set_xlabel('Predicted label', fontsize=15)
    ax.set_title(f"Confusion Matrix for {type(clf).__name__}", fontsize=18)
    plt.show()

def plot_curves(models, X_train, y_train, X_test=None, y_test=None, curve_type='discrimination', figsize=(8, 8)):
    """
    Plot ROC or calibration curves for training and testing data.
    The figsize is applied per subplot.
    """
    is_test_data_present = X_test is not None and y_test is not None
    num_plots = 2 if is_test_data_present else 1
    
    fig, axes = plt.subplots(1, num_plots, figsize=(figsize[0] * num_plots, figsize[1]))
    
    if not isinstance(axes, np.ndarray):
        axes = [axes]

    data_splits = [(X_train, y_train)]
    if is_test_data_present:
        data_splits.append((X_test, y_test))

    for i, (X_data, y_data) in enumerate(data_splits):
        ax = axes[i]
        if curve_type == 'discrimination':
            for j, (name, model) in enumerate(models.items()):
                plot_chance_level = j == len(models) - 1  # Plot chance level only for the last model
                RocCurveDisplay.from_estimator(model, X_data, y_data, ax=ax, name=name, linewidth=3, plot_chance_level=plot_chance_level)
        elif curve_type == 'calibration':
            if isinstance(models, dict):
                for name, model in models.items():
                    CalibrationDisplay.from_estimator(model, X_data, y_data, ax=ax, name=name)
            else: # handles single model case
                CalibrationDisplay.from_estimator(models, X_data, y_data, ax=ax)

        ax.legend(loc="lower right")
        type_of_plot = "ROC" if curve_type == 'discrimination' else "Calibration"
        data_label = "Training" if i == 0 else "Testing"
        ax.set_title(f'{type_of_plot} curve - {data_label} data', fontsize=14)

    plt.tight_layout()
    plt.show()

Given a set of predictions and known outcome, we can evaluate the predictive ability of our model using various metrics. <br>For instance, we can calculate the confusion matrix for the logistic regression model's predictions.

In [None]:
plot_cm(clf, X, y)

**Accuracy**: fractions of correctly classified observations.    $\frac{TP+TN}{TP+TN+FN+FP}$ to estimate $\mathbb P(\hat Y = Y)$

**Sensitivity**: proportion of positive observations correctly classified.   $\frac{TP}{TP+FN}$ to estimate $\mathbb P(\hat Y = 1 \mid Y = 1)$

**Specificity**: proportion of negative observations correctly classified.   $\frac{TN}{TN+FP}$ to estimate $\mathbb P(\hat Y = 0 \mid Y = 0)$

However, in some cases, accuracy may not be the most suitable metric, as it **depends on a specific classification threshold (e.g., 0.5)**. The choice of this threshold often involves a trade-off between sensitivity and specificity, **depending on the clinical context**.

- **High Sensitivity** (= lowering the classification threshold) is crucial when the cost of a false negative (i.e., predicting patient as alive when the patient actually dies during his hospital stay) is high. For example, in initial screening tests for serious conditions like cancers, it is preferable to have a high number of true positives, even at the risk of including some false positives. These individuals can then undergo more specific, confirmatory testing.

- **High Specificity** (= increasing the classification threshold) is essential when the cost of a false positive (i.e., predicting that the patient will die when he actually survives his hospital stay) is high. This could involve invasive follow-up procedures, expensive treatments, or significant psychological distress for the patient. For instance, a confirmatory test for a rare genetic disorder or a test to confirm a cancer diagnosis before starting chemotherapy must be highly specific to avoid treating healthy individuals.

Given this trade-off, it is important to consider alternative evaluation metrics that are not dependent on a single threshold.

For instance, the **Receiver Operating Characteristic (ROC)** curve provides a comprehensive view of a model's performance by plotting the true positive rate (sensitivity) against the false positive rate (1 - specificity) across all possible threshold values.

In [None]:
perfect_clf = RandomForestClassifier()
perfect_clf.fit(X, y)
good_clf = RandomForestClassifier(n_estimators=100, max_depth=8)
good_clf.fit(X, y)
average_clf = LogisticRegression()
average_clf.fit(X, y)

roc_models = {
    "Model 1": perfect_clf,
    "Model 2": good_clf,
    "Model 3": average_clf
}

In [None]:
plot_curves(roc_models, X, y)

The associated metric, the **ROC - Area Under the Curve (AUC)**, is a value that represents the discrimination (i.e., ability to separate patients who survived from patients who died) of the model across all possible probability thresholds. It ranges from 0 to 1, with higher value indicating better discrimination.

In [None]:
y_predictions = clf.predict_proba(X)[:,1]
roc_auc_score(y_true=y, y_score=y_predictions)

### Model evaluation

We can now select for each model the hyperparameters values that maximize the AUC.

Let's recall for each model the main hyperparameters that can be tuned.


- **[Logistic regression (or LASSO)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)** :
    - Parameter `C` that controls the penalization value imposed on the coefficients of a logistic regression. <br>**A large penalization will force coefficients to 0.** <br>In scikit-learn a smaller value of the hyperparameter `C` indicates higher penalization.

- **[K-nearest neighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)**:
    - Number of neighbors: `n_neighbors`

- **[Decision tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)**:
    - Maximum depth a tree can grow: `max_depth`
     
- **[Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)**:
    - Number of trees in the forest: `n_estimators`
    - Number of features considered when creating a split: `max_features`


#### Hyperparameter tuning for LASSO

In [None]:
C_values = [0.0001, 0.001, 0.01, 0.1, 1, 10] # Define a list of values to test for the penalization

lasso_score = {} ## Dictionary to store the AUC for each value of C

for k in C_values: ## loop over the different values of C to train a model for each value and store the AUC in the dictionary

    clf = LogisticRegression(C=k, penalty='l1', solver='saga', max_iter=500) ## Instantiate a logistic regression model with the current value of C
    clf.fit(X, y) ## Train the model
    lasso_score["C = " +str(k)] = roc_auc_score(y_true=y, y_score=clf.predict_proba(X)[:,1]) ## Store the AUC in the dictionary with a key that indicates the value of C used for that model

In [None]:
pd.Series(lasso_score)

#### Hyperparameter tuning for Decision tree

In [None]:
max_depth_values = [1, 3, 5, 10, 15, 20, None] ## Define a list of max_depth values to test for the Decision Tree model

decision_tree_score = {} ## Dictionary to store AUC scores for different max_depth values


## Loop through each max_depth value
for k in max_depth_values:
    
    clf = DecisionTreeClassifier(max_depth=k) ## Initialize a Decision Tree classifier with the current value of max_depth
    clf.fit(X, y) ## Train the model on the dataset
    decision_tree_score["Max_depth = " + str(k)] = roc_auc_score(y_true=y, y_score=clf.predict_proba(X)[:, 1])  ## Store the AUC in the dictionary with a key that indicates the value of max_depth used for that model


In [None]:
pd.Series(decision_tree_score)

**Exercice**

Using the same method, find the best set of hyperparameters for the **random forest** and the **KNN** method.

#### Hyperparameter tuning for Random Forest

In [None]:
number_of_trees = [...]

random_forest_score = {}

for k in number_of_trees:
    
    clf = RandomForestClassifier(...)
    clf.fit(X,y)
    random_forest_score["n_trees = " +str(k)] = roc_auc_score(y_true=y, y_score=clf.predict_proba(X)[:,1])

In [None]:
pd.Series(random_forest_score)

#### Hyperparameter tuning for KNN

In [None]:
number_of_neighbors = [...]

knn_score = {}

for k in number_of_neighbors:
    
    clf = KNeighborsClassifier(...)
    clf.fit(X, y)
    knn_score["n_neighbors = " +str(k)] = roc_auc_score(y_true=y, y_score=clf.predict_proba(X)[:,1])

In [None]:
pd.Series(knn_score)

After finding the optimal value for each model parameter, we can train our models one final time and evaluate their performance on the training data.

In [None]:
models = {
    'LASSO': LogisticRegression(penalty='l1', C=..., solver="saga", max_iter=500),
    'Decision Tree': DecisionTreeClassifier(max_depth=...),
    'KNN': KNeighborsClassifier(n_neighbors=...),
    'Random Forest': RandomForestClassifier(n_estimators=...)    
}
for m in models.values():
    m.fit(X, y)

### Discrimination (training)

In [None]:
plot_curves(models, X, y)

### Calibration (training)

Model calibration assesses how well the predicted probabilities of a model align with the actual observed outcomes. In a well-calibrated model, if we consider all the instances where the model predicted a certain probability (e.g., 20% chance of mortality), the actual proportion of those instances that experience the event should be close to that probability (i.e., approximately 20% of them should have died).

**How to interpret a calibration plot:**
-   **Axes:** The x-axis represents the predicted probabilities, and the y-axis shows the actual observed frequency of the positive class for those probabilities.
-   **Perfectly Calibrated (Diagonal Line):** The dashed diagonal line represents perfect calibration, where the predicted probabilities exactly match the observed frequencies.
-   **Model's Curve:** The solid lines show the calibration of each model.
    -   If a model's curve is **below** the diagonal, it is **over-predicting** the risk (the predicted probabilities are higher than the actual outcomes).
    -   If a model's curve is **above** the diagonal, it is **under-predicting** the risk (the predicted probabilities are lower than the actual outcomes).

The closer a model's curve is to the diagonal, the better its calibration.

In [None]:
plot_curves(models, X, y, curve_type='calibration')

### Model validation

Our models perform well on the training data, but ultimately, the goal is to ensure strong predictive performance on new, unseen data.

Let’s put that to the test by loading a new set of observations (`icu_testing.csv`) and evaluating our models’ performance.

In [None]:
url_data = 'https://raw.githubusercontent.com/AL0UNE/courses/refs/heads/main/icu_testing.csv'

df_test = pd.read_csv(url_data, index_col=0)

In [None]:
df_test.head()

In [None]:
df_test.shape

Before evaluating the performance of our models on this new dataset, we must apply the same transformations that was done on the training data.

In [None]:
df_test = df_test[keep]

df_test = df_test.fillna({"height": mean_height, "first_icu_stay": most_frequent_first_icu_stay})

df_test['gender'] = df_test['gender'].replace({'M': 1, 'F': 0})
df_test['first_icu_stay'] = df_test['first_icu_stay'].replace({True: 1, False: 0})
df_test['CURR_CAREUNIT_transfers'] = encoder.transform(df_test['CURR_CAREUNIT_transfers'])

df_test['BMI'] = df_test['weight']/((df_test['height']/100)**2)

In [None]:
X_test = df_test[X.columns]
y_test = df_test['hospital_mortality']
X_test = (X_test-train_mean)/train_std ## Standardize using statistics from the training set.

Now, we can assess the performance of our models on the test set.

### Discrimination

In [None]:
plot_curves(models, X, y, X_test, y_test, figsize = (8, 7))

### Calibration

In [None]:
plot_curves(models, X, y, X_test, y_test, curve_type='calibration', figsize = (8, 7))

When evaluating on the test set, we observe a decline in performance for most models, both in terms of discrimination and calibration. <br>
The LASSO model is the only one that maintains its predictive quality across both datasets.

But, we selected hyperparameters to maximize performance, so what went wrong?

The issue is that **we tuned the hyperparameters and evaluated models' performance on the same (training) dataset.** This approach leads to an overly optimistic assessment of the model's capabilities, as the model is optimized for the specific data it has already seen.

This phenomenon is known as **overfitting**.

### Underfitting and Overfitting

To understand what **overfitting** and **underfitting** are, let's go back to the toy example from last week.

![](https://github.com/AL0UNE/courses/blob/main/figures/synthetic_data_for_under_over_fitting.png?raw=true)

Now, we will train three different decision tree classifiers, each with a different value of **max depth** (**1, 4 and 10**), and observe their decision boundaries:

![](https://github.com/AL0UNE/courses/blob/main/figures/Underfitting_Overfitting.png?raw=true)

From left to right, we show classifiers of increasing complexity, with more complex decision boundaries (i.e. the decision boudary is more and more wiggly). <br> In fact, the classifier on the right perfectly classifies all training data points, also indicated by the perfect ROC-AUC.

But is choosing the model with `max_depth = 10` the best choice? The model with `max_depth = 4` (middle figure) appears to be more reasonable.

So, how can we systematically choose the best hyperparameter value?

### A question of trade-off?

Let’s plot the AUC score for both the training and testing sets as we increase the `max_depth` for a classification tree:

![](https://github.com/AL0UNE/courses/blob/main/figures/Model_complexity.png?raw=true)

When a model is too simple relative to the information in the training data, it is said to have *high bias*.<br>
The consequence is ***underfitting***.
- The decision tree model with `max_depth=1` shown on the left is an example.
- It has a poor AUC on both the training and the validation set: the model did not learn any useful pattern.

On the other hand, when a model is too complex given the training data, it is said to have *high variance*. <br>
The consequence is ***overfitting***.

- The decision tree model with `max_depth=20` shown on the right is an example.
- It has a perfect AUC on the training set but a poor AUC on the testing set: the model has learned not just the underlying patterns, but also the noise present in the data.

### Methods for estimating model performance

### Train-Test split

![](https://github.com/AL0UNE/courses/blob/main/figures/initial_data.png?raw=true)

In this approach, we simply split the initial dataset into training and testing sets (e.g., 70% for training and 30% for testing). <br>
Although using the test set to evaluate model performance on new data seems reasonable, it can lead to overfitting on the test set itself. This is undesirable because the test set should only be used for assessing the model's final performance.

### Train-Validation-Test split

![](https://github.com/AL0UNE/courses/blob/main/figures/train_test_split.png?raw=true)

Another approach consists in simply further splitting the training set to create a validation set, which can be used for hyperparameter tuning and it remains independent from the testing set.

In [None]:
# Importing the function to split the data into training and validation sets
from sklearn.model_selection import train_test_split

# Defining a list of max_depth values to test for the Decision Tree model
max_depth = [1, 3, 5, 10, 15, 20, None] 

# Dictionary to store AUC scores for different max_depth values
sample_split_score = {}

# Splitting the dataset into training (70%) and validation (30%) sets
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.7)

In [None]:
# Loop through each max_depth value to train and evaluate the model
for k in max_depth:

    clf = DecisionTreeClassifier(max_depth=k) # Initializing a Decision Tree classifier with the specified max_depth
    clf.fit(X_train, y_train) # Training the model on the training data
    sample_split_score["max_depth = " + str(k)] = roc_auc_score(y_true=y_val, y_score=clf.predict_proba(X_val)[:, 1]) # Evaluating the model on the validation set and storing the AUC score


In [None]:
pd.Series(sample_split_score)

Using the train-validation-testing splitting strategy, we observe that a **max_depth** of **5** is preferable, which is different from the value we obtained with the naive method:

In [None]:
training_score = {}
for k in max_depth_values:
    
    clf = DecisionTreeClassifier(max_depth=k)
    clf.fit(X_train,y_train)
    training_score["max_depth = " + str(k)] = roc_auc_score(y_true=y_train, y_score=clf.predict_proba(X_train)[:,1])

In [None]:
pd.Series(training_score)

### Cross-Validation

![](https://github.com/AL0UNE/courses/blob/main/figures/crosval.png?raw=true)

**K-Fold cross-validation** is an extension of the train/validation splitting method, where instead of creating a single validation set, the process is repeated multiple times (K times).

The idea behind cross-validation is to divide the training data into several *folds*. In each iteration, one fold serves as the validation set, while the remaining folds are used for training. This process is repeated until every fold has been used as the validation set.


The final performance is averaged across all K iterations.

The good news is that you can use the `cross_val_score` function from the `scikit-learn` library to handle most of the work for you!

In [None]:
# Importing the cross-validation function from scikit-learn
from sklearn.model_selection import cross_val_score

# Defining a list of max_depth values to test for the Decision Tree model
max_depth = [1, 3, 5, 10, 15, 20, None] 

# Dictionary to store the AUC scores from cross-validation
cross_val_decision_tree = {}

# Number of folds for cross-validation
n_folds = 5

In [None]:
# Loop through each max_depth value
for k in max_depth:
    # Initializing a Decision Tree classifier with the specified max_depth
    clf = DecisionTreeClassifier(max_depth=k)
    
    # Performing cross-validation with 5 folds and calculating the ROC-AUC score
    cross_val_decision_tree["max_depth = " + str(k)] = cross_val_score(estimator=clf, X=X, y=y, cv=n_folds, scoring='roc_auc')

In [None]:
pd.DataFrame(cross_val_decision_tree)

We can then average over all folds and take the parameter that maximize the average score.

In [None]:
pd.DataFrame(cross_val_decision_tree).mean()

### Leave-one-out: a special case of k-fold cross-validation

![](https://github.com/AL0UNE/courses/blob/main/figures/loo.png?raw=true)

**Leave-One-Out Cross-Validation (LOO-CV)** is a special case of cross-validation where K equals to the number of data points. In each iteration, the model is trained on all but one data point, and that single data point is used as the test set. This method can be computationally expensive, particularly for large datasets, as it requires training the model once for each data point.

LOO-CV is especially useful when the dataset is small, as it allows for maximum utilization of the available data for model evaluation.

There are many other ways to partition the data. See the following scikit-learn page for a [non-exhaustive list of methods](https://scikit-learn.org/stable/modules/cross_validation.html) for data-partitioning and how to use them.

We now have different methods to tune our models effectively without overfitting to the training data!

<b>Exercise:</b>

- Using the cross-validation method, tune the hyperparameters for the LASSO (`C`), the KNN (`n_neighbors`), and the Random Forest models (`n_estimators`). <br>You can also experiment with different numbers of folds for cross-validation and observe how it impacts the choice of parameters.
- Compare with the values obtained with the naive method using the training data for training and optimization. How did the values change?  
    

In [None]:
# Number of folds for cross-validation
n_folds = 5

C_values = [...] # Define a list of values to test for the penalization

cross_val_lasso = {} ## Dictionary to store the AUC for each value of C

for k in C_values: 

    clf = LogisticRegression(C=..., penalty='l1', solver='saga', max_iter=500) 
    
    cross_val_lasso["C = " +str(k)] = cross_val_score(clf, X, y, cv=n_folds, scoring='roc_auc') ## Perform cross-validation and store the AUC in the dictionary with a key that indicates the value of C used for that model

In [None]:
pd.DataFrame(cross_val_lasso)

In [None]:
pd.DataFrame(cross_val_lasso).mean()

In [None]:
n_folds = 5

n_neighbors = [...] # Define a list of values to test for the number of neighbors in KNN

cross_val_knn = {} ## Dictionary to store the AUC for each value of neighbors

for k in n_neighbors: 

    clf = ...
    
    cross_val_knn["n_neighbors = " +str(k)] = cross_val_score(clf, X, y, cv=n_folds, scoring='roc_auc')

In [None]:
pd.DataFrame(cross_val_knn)

In [None]:
pd.DataFrame(cross_val_knn).mean()

In [None]:
n_folds = 5

n_trees = [...] # Define a list of values to test for the number of trees in the Random Forest model 

cross_val_random_forest = {} ## Dictionary to store the AUC for each number of trees 

for k in n_trees: 

    clf = ...
    
    cross_val_random_forest["n_estimators = " +str(k)] = cross_val_score(clf, X, y, cv=n_folds, scoring='roc_auc')

In [None]:
pd.DataFrame(cross_val_random_forest)

In [None]:
pd.DataFrame(cross_val_random_forest).mean()

### Gridsearch 

Most machine learning models have multiple hyperparameters that can be tuned to optimize performance. For instance, for a **Random Forest model**, these include:
- The number of trees in the forest (`n_estimators`)
- The number of features considered for each split (`max_features`)
- The maximum depth of each tree (`max_depth`)



With multiple parameters to tune, the number of unique combinations can quickly become large which can be tedious to keep a track.

Fortunately, scikit-learn provides a powerful function called `GridSearchCV` to automate this process. `GridSearchCV` performs an exhaustive search over a specified hyperparameter grid. For each combination of hyperparameters, it evaluates the model's performance using cross-validation and selects the set of values that yields the best score on a chosen metric (e.g., ROC-AUC).

In [None]:
# Importing GridSearchCV from scikit-learn for hyperparameter tuning
from sklearn.model_selection import GridSearchCV

# Initializing the RandomForestClassifier model
clf = RandomForestClassifier()

# Defining a grid of hyperparameters to search over for the Random Forest model
random_forest_grid = { 
    'n_estimators': [5, 20, 100],  # Number of trees in the forest
    'max_depth': [1, 3],        # Maximum depth of each tree
    'max_features': [1, 3, 5],     # Number of features to consider when splitting a node
}


# Defining the number of folds for cross-validation during the gridsearch
n_folds = 5

# Setting up the GridSearchCV to perform hyperparameter tuning
gs_result = GridSearchCV(
    estimator=clf,               # The classifier to evaluate
    param_grid=random_forest_grid, # The grid of hyperparameters to search over
    scoring='roc_auc',            # The scoring metric (ROC-AUC)
    cv=n_folds,                   # Number of folds for cross-validation
    n_jobs=-1,                    # Using all CPU cores for parallel processing
    verbose=1                      # Verbosity level for progress display
)

# Fitting the GridSearchCV to the data and finding the best hyperparameters
gs_result.fit(X, y);

Can you explain the No. of candidates and the total No. of fits?

In [None]:
grid_search_results = pd.DataFrame(gs_result.cv_results_)
grid_search_results.head()

For each combination of parameters, we get a as many AUCs as there are folds. We can then choose the combination of parameters that has the highest mean AUC across all k-folds (`mean_test_score` or `rank_test_score`column).

In [None]:
grid_search_results.sort_values('rank_test_score').head(5)

Based on our predefined grid, the best set of hyperparameters is:

In [None]:
grid_search_results.sort_values('rank_test_score').iloc[0]['params']

<b>Exercise:</b>

- Use the GridSearchCV method to find the optimal hyperparameters for the Random Forest algorithm using the hyperparameter grid by changing values or adding additional [hyperparameters](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).
- (**Optional**): Apply the same approach to optimize multiple hyperparameters for a [classification tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) and for the [KNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) classifier (take a look at the documentation to see a list of the different hyperparameters)

Once the optimal hyperparameters are determined for each model, we train the final models using the full training dataset.

In [None]:
models = {
    'LASSO': LogisticRegression(penalty='l1', C=..., solver="saga", max_iter=500),
    'Decision Tree': DecisionTreeClassifier(max_depth=...),
    'Random Forest': RandomForestClassifier(n_estimators=..., max_depth=..., max_features=...),
    'KNN': KNeighborsClassifier(n_neighbors=...),
}

for m in models.values():
    m.fit(X, y)

## Evaluation

Now that we have trained our different models, we can evaluate their performance on the test set we have set aside.<br>

### Discrimination

In [None]:
plot_curves(models, X, y, X_test, y_test, figsize = (8, 7))

### Calibration

In [None]:
plot_curves(models, X, y, X_test, y_test, curve_type='calibration', figsize = (8, 7))

## Interpretability

After training a model, we can analyze which the importance of features on average for each model. This helps assess whether the model has focused on clinically meaningful patterns.


In [None]:
def plot_feature_importances(models, X, figsize=(15, 6)):
    fig, ax = plt.subplots(1, 3, figsize=figsize, sharey=True)    
    ax[0].tick_params(axis='y', labelsize=12)
    for i, (name,model) in enumerate(models.items()):
        if hasattr(model, 'coef_'):
            ax[i].barh(X.columns, np.abs(model.coef_[0]))
        elif hasattr(model, 'feature_importances_'):
            ax[i].barh(X.columns, model.feature_importances_)            
        else:
            continue
        ax[i].set_title(name, fontsize=15)
        ax[i].set_xlabel('Feature importance', fontsize=12)
        ax[i].set_xticks([])

In [None]:
plot_feature_importances(models, X)

One key advantage of (shallow) decision trees is their interpretability. Scikit-learn provides a function to visualize and save the tree structure.

In [None]:
plt.figure(figsize=(50,20))
plot_tree(models['Decision Tree'], feature_names=X.columns, class_names=['Survived', 'Died'], fontsize=6, proportion=False, filled=True, rounded=True)
plt.savefig('my_decision_tree', dpi=100)