[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/bads/blob/master/tutorial_notebooks/8_model_evaluation_solutions.ipynb) 

# Tutorial 8 - Machine Learning Model Evaluation

<span style="font-weight: bold; color: red;">This version includes solutions to the exercises. </span>

In this tutorial, we revisit the evaluation of machine learning, and more specifically classification models. We will cover the following topics:
- Data organization
    - Train-test split
    - Cross-validation
- Measures of classification performance
    - Confusion matrix and associated indicators
    - ROC Analysis

## Preliminaries

### Standard imports

In [1]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### The HMEQ data set
We continue using the "Home Equity" data set (HMEQ), which we can nicely load and prepare using the helper function `get_HMEQ_credit_data`, which is available in our courses module `bads_helper_functions.py`

In [None]:
import bads_helper_functions as bads  # import module with bads helper functions
X, y = bads.get_HMEQ_credit_data()  # load the data 
X  # preview the data   

# Data organization I: Holdout method
The idea of the *holdout method* is to evaluate a model on data it hasn't seen during training, providing a better estimate of the model's real-world performance. To that end, we randomly split the data set into two parts: a training set and a test set. The training set is used to train the model, while the test set is used for model assessment.

<img src="https://raw.githubusercontent.com/Humboldt-WI/demopy/main/holdout_method.png" width="854" height="480" alt="Holdout method">

## Exercise 1: Train-test split
Implement the *holdout method* using the `train_test_split` function from the module `sklearn.model_selection`. More specifically, your tasks are as follows:

- Import the function `train_test_split` from `sklearn.model_selection`.
- Split the data set into a training set and a test set. Use 75% of the data for training and 25% for testing.
- Store the resulting training and test sets in the variables `X_train`, `X_test`, `y_train`, and `y_test`.
- Print the dimensions of the training and test sets.



In [None]:
#------------------------------------------------------------
# Solution to exercise I 
#------------------------------------------------------------
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=888)  # partition the data into 75% training and 25% test

print(f"Training set dimension (observations x features): {X_train.shape[0]} x {X_train.shape[1]}")   
print(f"Test set dimension (observations x features): {X_test.shape[0]} x {X_test.shape[1]}")   

# Model training
Up to this point, the lecture introduced you to two algorithms for classification: logistic regression and decision trees. We will use both algorithms to train models on the training set. To that end, we re-use codes from previous tutorials.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

# LOGISTIC REGRESSION
logreg = LogisticRegression(random_state=888)  # instantiate the logistic regression model
logreg.fit(X_train, y_train)  # fit the regression model to the training data

# DECISION TREE
dtree = DecisionTreeClassifier(random_state=888, max_depth=5)  # instantiate the decision tree model
dtree.fit(X_train, y_train)  # grow the tree using the training data


# Measures of classifier performance
The `sklearn` library provides several functions to evaluate the performance of a classifier. In general, functionality for model evaluation is available in the module `sklearn.metrics`. The lecture introduced two common approaches for classifier assessment, the confusion matrix and the ROC curve. We will consider both methods to evaluate the performance of the models trained in the previous step.

## ROC Curve
The main difference between ROC analysis and the confusion matrix, as well as accuracy indicators derived from the confusion matrix is that ROC analysis considers all possible thresholds for classifying observations. Specifically, the ROC curve is a graphical representation of the trade-off between the true positive rate (TPR) and the false positive rate (FPR) for different classification thresholds. The area under the ROC curve (AUC) is a single number summary of a classifier's ROC curve and a widely used indicator when comparing alternative classifiers.  


<img src="https://raw.githubusercontent.com/Humboldt-WI/demopy/main/ROC-curve.png" width="854" height="480" alt="Confusion Matrix">


### Exercise 2
Your task is to compare the performance of the logistic regression (i.e., `logreg`) versus the classification tree (i.e., `dtree`) using ROC analysis. Specifically:
- Import the function `roc_curve` and `roc_auc_score` from `sklearn.metrics`.
- For each model, 
    - produce probabilistic predictions on the test set.
    - compute the AUC using the function `roc_auc_score` and print the result.
    - create an ROC curve using the method `from_predictions` of the class `RocCurveDisplay`.

> Following the above steps will lead to creating two ROC curves, one for each classifier. As an **extra challenge**, try to plot both ROC curves in the same plot.


In [None]:
#------------------------------------------------------------
# Solution to exercise 2
#------------------------------------------------------------
from sklearn.metrics import roc_auc_score, RocCurveDisplay

my_models = [logreg, dtree]  # list of models to evaluate

fig, ax = plt.subplots()  # To plot the ROC curves of multiple models in the same plot, we need to create upfront

# Loop over the models
for model in my_models: 
  # compute probability predictions on the test set
  yhat = model.predict_proba(X_test)[:,1]  # we are only interested in the probability of class 1

  # compute the AUC
  auc = roc_auc_score(y_test, yhat)
  print(f"The AUC of {type(model).__name__} is: {auc:.4f}")
  
  # plot the ROC curve
  RocCurveDisplay.from_predictions(y_test, yhat, ax=ax)
  

# Add diagonal line corresponding to a random classifier
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='red')
plt.show()  # display the plot

## Confusion matrix
The confusion matrix of a binary classifier is a 2x2 matrix that contains four values: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). The following image illustrates the confusion matrix and some common performance measures derived from it. 

<img src="https://raw.githubusercontent.com/Humboldt-WI/demopy/main/confusion_matrix.PNG" width="854" height="480" alt="Confusion Matrix">


Producing a confusion matrix follows the same steps as ROC analysis. Given that the previous analysis suggests the decision tree is superior to logistic regression, we will focus on the decision tree model and first demonstrate how to create a confusion matrix using the default classification threshold $\tau=0.5$.



In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

# Create confusion matrix: note how the method from_estimator 
# bypasses the calculation of predictions. Instead of calling predict(), 
# we can directly give the trained model as argument
ConfusionMatrixDisplay.from_estimator(dtree, X_test, y_test)  
plt.title(f"Confusion matrix of {type(model).__name__}")
plt.show()
  

### Exercise 3
The lecture made a strong case for *not using the default threshold of $\tau=0.5$*. We will illustrate a better approach in the tutorial. For now, your task is to consider a custom threshold of $\tau=0.8$. Recreate the confusion matrix using this threshold. To achieve this:
- Compute probabilistic predictions for the test set.
- Convert the probabilistic predictions into binary predictions using the threshold $\tau=0.8$.
- Create a confusion matrix using `ConfusionMatrixDisplay.from_predictions()`.

In [None]:
#------------------------------------------------------------
# Solution to exercise 3
#------------------------------------------------------------

# Step 1: Compute probability predictions for the test set
yhat_proba = dtree.predict_proba(X_test)[:, 1]

# Step 2: Apply custom threshold to convert probabilities to binary predictions
custom_threshold = 0.8
yhat_custom = (yhat_proba >= custom_threshold).astype(int)

# Step 3: Display the confusion matrix using the custom threshold predictions
ConfusionMatrixDisplay.from_predictions(y_test, yhat_custom)
plt.title(f"Confusion matrix with custom threshold {custom_threshold}")
plt.show()

### Optimal threshold
The question of what is an optimal threshold is debatable. For the sake of illustration, let's say we seek a threshold such that the difference between the TPR and the FPR is maximal. This point is also known as [*Youdon's J*](https://en.wikipedia.org/wiki/Youden%27s_J_statistic). While it is often referred to as the *optimal threshold*, we reiterate that this view is debatable. In a business context, we would rather think of an optimal threshold as a threshold that minimizes the costs arising from wrong classifications and, by extension, decisions. 

The following codes illustrate how to first compute an optimal threshold and then create the confusion matrix for that threshold. In this course, we introduce the function `roc_curve`, which provides yet another way to create an ROC curve. Importantly, this function does not perform any plotting but returns the values underlying a ROC curve, namely, the FPR, TPR, and the thresholds used to compute the ROC curve. We can use the latter to compute the optimal threshold.

In [None]:
from sklearn.metrics import roc_curve

# Step 1: Compute probability predictions for the test set
yhat = dtree.predict_proba(X_test)[:, 1]

# Step 2: Compute the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, yhat)  # the function 

# Step 3: Find the optimal threshold
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]
print(f"Optimal threshold: {optimal_threshold:.4f}")

# Step 4: Apply the optimal threshold to convert probabilities to binary predictions
yhat_optimal = (yhat >= optimal_threshold).astype(int)

# Step 5: Display the confusion matrix using the optimal threshold predictions
ConfusionMatrixDisplay.from_predictions(y_test, yhat_optimal)
plt.title(f"Confusion matrix with optimal threshold {optimal_threshold:.2f}")
plt.show()

### Classification report
While inspecting a confusion matrix is useful, it is often more informative to consider the specific performance indicartors derived from it. The `skelarn` library provides a classification report, which includes the classifiers precision, recall, and F1-score, defined as the harmonic mean of precision and recall: 

$$F_1 = 2 \cdot \frac{precision \cdot recall}{precision + recall}$$ 

#### Exercise 4
Locate the function the function `classification_report` from the module `sklearn.metrics` and call it to create a classification report for the logistic regression model and the decision tree, both with default threshold. Needless to say, you create the classification report based on the test set.

In [None]:
#------------------------------------------------------------
# Solution to exercise 4
#------------------------------------------------------------
from sklearn.metrics import classification_report

# Classification report for logistic regression (default threshold)
report = classification_report(y_test, logreg.predict(X_test))   
print("LOGISTIC REGRESSION\n")
print(report) 

# Classification report for decision tree (default threshold)
print("\nDECISION TREE\n")
report = classification_report(y_test, dtree.predict(X_test))
print(report)


The output of the `sklearn` classification report includes the following metrics for each class:

- **Precision**: The ratio of true positive predictions to the total predicted positives.
- **Recall**: The ratio of true positive predictions to the total actual positives.
- **F1-score**: The harmonic mean of precision and recall.
- **Support**: The number of actual occurrences of the class in the dataset.

Additionally, it provides overall metrics such as accuracy, macro average, and weighted average for precision, recall, and F1-score, whereby the weights are calculated based on the support of each class.

# Data organization II: Cross-validation
The holdout method is a simple and effective way to evaluate a model. However, it has some limitations. For example, the performance of a model can vary significantly depending on the data used for training and testing. To mitigate this issue, we can use cross-validation. The idea is to split the data set into $k$ folds and train the model $k$ times, each time using a different fold for testing and the remaining folds for training. The performance of the model is then averaged over the $k$ iterations.

<img src="https://raw.githubusercontent.com/Humboldt-WI/demopy/main/cross_validation.png" width="854" height="480" alt="Cross validation process">

Unsurprisingly, the `sklearn` library provides several functions to perform cross-validation. The arguably easiest way involves using the function `cross_val_score` from the module `sklearn.model_selection`. It allows performing cross-validation and computing one performance metric for each fold. The function returns an array of scores, one for each fold.

## Exercise 5
Your task is to implement a 10-fold cross-validation of the decision tree classifier using the function `cross_val_score()`. More specifically:
- Examine the function's documentation and make sure you understand its key arguments.
- Write code to call the function such that it cross-validates a decision tree using the HMEQ data set 
    - Make sure to perform 10 fold cross-validation
    - Inspect the argument `scoring`, which allows specifying a performance indicator. Set this argument such that it uses the AUC. 
- Print the average AUC of the tree and its standard deviation

In [None]:
#------------------------------------------------------------
# Solution to exercise 5
#------------------------------------------------------------
from sklearn.model_selection import cross_val_score 

# Cross-validation with 10 folds 
cv_scores = cross_val_score(dtree, X, y, cv=10, scoring='roc_auc')  # compute the AUC scores 
print(f"Decision tree AUC scores: {cv_scores}")  # display the AUC scores
print(f"Decision tree mean AUC (std. AUC): {cv_scores.mean():.4f}({cv_scores.std():.4f})")  # display the mean AUC score

## Cross-validation using the Class `KFold`

If `cross_val_score` is the easiest way to do cross-validation, then the other end of the spectrum is the provided by the class `KFold`. This class allows for much more flexibility and control over the cross-validation process. However, this comes at the cost of having to write more code. Here is an example that illustrate how to use `KFold` to perform cross-validation. To demonstrate the flexibility of `KFold`, we will use it to compute multiple performance measures (i.e., AUC, and F1 score) for each fold and within each fold for both, the training and validation data.

In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score, roc_auc_score    
k = 10  # number of folds
kf = KFold(n_splits=k)  # initialize cross-validation process

# Initialize the classifier
tree = DecisionTreeClassifier(max_depth=5)
i = 1  # counter for the fold number
# Iterating over the k folds
for train_index, test_index in kf.split(X):  # KFold gives as two arrays with the indices of the training and validation data of the current iteration
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]  # we can use these arrays to index our original data set: here we construct the feature matrices
    y_train, y_test = y[train_index], y[test_index]  # and here we construct the arrays with the true targets

    # Fitting the tree to the training set of THIS ITERATION
    tree.fit(X_train, y_train)

    # Compute predictions on the test set of THIS ITERATION
    yhat_class = tree.predict(X_test)  # discrete predictions
    yhat_proba = tree.predict_proba(X_test)[:,1]  # probabilities of class 1

    # Compute model test set performance for every error measure we are interested in
    auc = roc_auc_score(y_test, yhat_proba)
    f1 = f1_score(y_test, yhat_class)
    print(f"Fold {i}\tTest set AUC: {auc:.4f}")
    print(f"Fold {i}\tTest set F1: {f1:.4f}")
    
    # Repeat the previous steps to also compute training set performance
    yhat_class = tree.predict(X_train)
    yhat_proba = tree.predict_proba(X_train)[:,1]
    auc = roc_auc_score(y_train, yhat_proba)
    f1 = f1_score(y_train, yhat_class)
    print(f"Fold {i}\tTraining set AUC: {auc:.4f}")
    print(f"Fold {i}\tTraining set F1: {f1:.4f}")
    i += 1  # increment the fold counter
        


## Exercise 6
While perhaps hinting at the flexibility of `KFold`, the previous example shows some very bad coding practice. The calculation of test and training set performance across multiple performance metrics is repetitive. More severely, the demo does not store any results. Plotting the development of performance across folds and computing an average performance in the end is, thus, impossible. Draw on your Python skills to improve the above demo. Specifically:   
- Identify a suitable data structure to store the performance metrics per fold for the training and test set 
- Revise the code to store the performance metrics in the data structure you identified.
- Compute the average performance across all folds and print the results. 

In [None]:
#------------------------------------------------------------
# Solution to exercise 6
#------------------------------------------------------------

# Initialize the classifier
tree = DecisionTreeClassifier(max_depth=5)

# Number of folds
k = 10
kf = KFold(n_splits=k)

# Initialize a dictionary to store performance metrics
performance = {
    'train_auc': [],
    'test_auc': [],
    'train_f1': [],
    'test_f1': []
}

# Iterating over the k folds
for i, (train_index, test_index) in enumerate(kf.split(X), 1):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # Fitting the tree to the training set of THIS ITERATION
    tree.fit(X_train, y_train)

    # Compute predictions on the test set of THIS ITERATION
    yhat_class_test = tree.predict(X_test)
    yhat_proba_test = tree.predict_proba(X_test)[:, 1]

    # Compute predictions on the training set of THIS ITERATION
    yhat_class_train = tree.predict(X_train)
    yhat_proba_train = tree.predict_proba(X_train)[:, 1]

    # Compute model test set performance for every error measure we are interested in
    test_auc = roc_auc_score(y_test, yhat_proba_test)
    test_f1 = f1_score(y_test, yhat_class_test)
    
    # Compute model training set performance for every error measure we are interested in
    train_auc = roc_auc_score(y_train, yhat_proba_train)
    train_f1 = f1_score(y_train, yhat_class_train)

    # Store the performance metrics
    performance['train_auc'].append(train_auc)
    performance['test_auc'].append(test_auc)
    performance['train_f1'].append(train_f1)
    performance['test_f1'].append(test_f1)

# Convert the performance dictionary to a DataFrame
performance_df = pd.DataFrame(performance)

# Compute the average performance across all folds
average_performance = performance_df.mean()
print("Average performance across all folds:")
print(average_performance)

# Display the performance DataFrame
print("\nPerformance metrics for each fold:")
performance_df