# 🤔 Model Selection & Metrics

<img src="https://i.imgur.com/LP6sUuZ.png">

## 📈 Metrics

Check the full list of metrics [here](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics)

- Mainly four categories are available: classification metrics, regression metrics, clustering metrics and distance/pairwise metrics

- Most popular classification metrics are accuracy, f1_score, precision, recall and confusion matrix (also auc and balanced accuracy) which are provided along with more niche versions

- Consider binary classification over `A` and `B` where `A` is the class you are interested in (i.e., positive class). 

There are four cases when the model classifies a point $x$:

- Model classification is correct (agree with true label)
    - The true label is A → True Positive (TP)
    - The true label is B → True Negative (TN)
- Model classification is incorrect (disagrees with the true label)
    - The true label is A → False Negative (FN)
    - The true label is B → False Positive (FP)

- Confusion Matrix (Predicted VS. Actual) Shows all four:

<img src="https://upload.wikimedia.org/wikipedia/commons/a/a1/ConfusionMatrixRedBlue.png" width=300>

#### 🧠 Thinking of a New Metric
- Consider 1M cells as dataset
    - 100 have a disease we are interested in predicting
    - Rest does not
    - A classifier that always classifies "No Disease" will be 99.99% correct

- Instead of accuracy, divide true positives over the number of positive labels → Recall
    - Solves the problem above
    - Problem is that model may not always classify as "Disease" and will be correct

$$\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$$

- Instead of accuracy, divide the true positives by the number of positive outputs → Precision
    - Solves the problem above
    - Problem is that if the model produces 1 positive output then precision is 1

$$\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$$

- Final Solution
    - Make a computation that closely emphasizes the worse of precision and recall
    - Call it F1-Score

$$ F_1 = \frac{2}{\frac{1}{\text{recall}} + \frac{1}{\text{precision}}}$$

- In multiclass scenarios, each class will have its own precision and recall (and F1). The final metric can be a normal average (macro) or a weighted average or pooling of confusion matrices (micro).

#### Classification Example

##### 1. Load Some Data

In [13]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the Iris dataset
breast_cancer = load_breast_cancer()
x_data = breast_cancer.data
y_data = breast_cancer.target

# Split the data into training and valing sets
x_train, x_val, y_train, y_val = train_test_split(x_data, y_data, test_size=0.2, random_state=42)

##### 2. Train and Calculate Metrics

In [14]:
# Instantiate the classifier (K-Nearest Neighbors in this example)
clf = KNeighborsClassifier()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_val)

# Calculate accuracy
accuracy = accuracy_score(y_val, y_pred)
precision = precision_score(y_val, y_pred,)
recall = recall_score(y_val, y_pred,)
f1 = f1_score(y_val, y_pred,)

print(f"Accuracy, Precision, Recall, F1 Score: {accuracy}, {precision}, {recall}, {f1}")

Accuracy, Precision, Recall, F1 Score: 0.956140350877193, 0.9342105263157895, 1.0, 0.9659863945578231


Even better than this is classification report

In [15]:
from sklearn.metrics import classification_report
report = classification_report(y_val, y_pred, target_names=breast_cancer.target_names)
print(report)

              precision    recall  f1-score   support

   malignant       1.00      0.88      0.94        43
      benign       0.93      1.00      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.94      0.95       114
weighted avg       0.96      0.96      0.96       114



Or a confusion matrix

In [16]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_val, y_pred)
print(cm)

[[38  5]
 [ 0 71]]


`sklearn.metrics` has also a small number of plotting functions, and one of them can plot the confusion matrix but it's quite meh.

Meanwhile, common regression metrics include L2 loss $L_{2} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$ and L1 loss $L_{1} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$

#### Regression Example

##### 1. Load Some Data

In [17]:
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split

# Load the diabetes dataset
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target     # disease pregression (continuous)


# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

##### 2. Calculate Some Metrics

In [18]:
# Fit linear regression model
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)

# Make predictions
y_pred = linear_reg.predict(X_test)

# Calculate L1 and L2 loss using sklearn.metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print("Linear Regression - Mean Absolute Error:", mae)
print("Linear Regression - Mean Squared Error:", mse)

Linear Regression - Mean Absolute Error: 42.794094679599944
Linear Regression - Mean Squared Error: 2900.193628493483


A more interpretable metric is $R^2$ score.
$$
R^2 = 1 - \frac{{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}}{{\sum_{i=1}^{n} (y_i - \bar{y})^2}}
$$

It measures the improvement of your model over the most naive regressor which just predicts the average of the target. That is, if your $R^2$ is 0.7 then you have reduced the errors done by the naive regressor by $70\%$ compared to an ideal regressor with zero error.

In [19]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
r2

0.45260276297191915

## 📈 Cross Validation

- Suppose we change the random state of the `train-test-split`
    - The validation set will change
    - The score will likely change
    - Our metric estimate (e.g., accuracy) is hence weak

- Intuitively, we can get a stronger estimate by trying multiple `train-test-splits` and then averaging the scores together
    - This should be closer to the true metric (across the whole population) than one point estimate
    - Implies better model selection decisions
    - This is the idea of cross-validation where `k` represents the number of folds (i.e., `1/k` is the train-test ratio)

<div align="center">
<img width=300 src="https://www.researchgate.net/publication/368622723/figure/fig1/AS:11431281120962096@1676727198009/5-Fold-iteration-cross-validation.png">
</div>

This also gives us a way to evaluate the confidence over the scores (e.g., if the variance between them is small then we can trust that an estimate on a new test set will be close).

#### Let's try it out!

In [20]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score, KFold, RepeatedKFold, RepeatedStratifiedKFold, LeaveOneOut
from sklearn.svm import SVC

# Load the Iris dataset
iris = load_iris()
x_data, y_data = iris.data, iris.target

# SVM model
svm_model = SVC(kernel='linear')

#### That's it. No fit and no split because it's all handled within cross validation!

- Only when you are done with model selection train on all `x_data`

##### Simple K-Fold

In [21]:
# 1. Setup the folds
kfold_cv = KFold(n_splits=5, shuffle=True, random_state=42)
# 2. Perform cross-validation
kfold_cv_scores = cross_val_score(svm_model, x_data, y_data, cv=kfold_cv, scoring='f1_micro')
print("\nSimple k-fold cross-validation scores:", kfold_cv_scores)
print("Mean accuracy:", np.mean(kfold_cv_scores))


Simple k-fold cross-validation scores: [1.         1.         0.96666667 0.93333333 0.96666667]
Mean accuracy: 0.9733333333333334


##### Repeated K-Fold

Clearly the estimate gets stronger if we repeat the cross-validation multiple times after random shuffling

In [22]:
repeated_cv = RepeatedKFold(n_splits=5, n_repeats=2, random_state=42)
repeated_cv_scores = cross_val_score(svm_model, x_data, y_data, cv=repeated_cv)
print("Repeated cross-validation scores:", repeated_cv_scores)
print("Mean accuracy:", np.mean(repeated_cv_scores))

Repeated cross-validation scores: [1.         1.         0.96666667 0.93333333 0.96666667 1.
 1.         0.96666667 1.         0.96666667]
Mean accuracy: 0.9800000000000001


##### When data has imbalance stratified splits are better (but slower)

In [23]:
stratified_cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=42)
stratified_cv_scores = cross_val_score(svm_model, x_data, y_data, cv=stratified_cv)
print("\nStratified cross-validation scores:", stratified_cv_scores)
print("Mean accuracy:", np.mean(stratified_cv_scores))


Stratified cross-validation scores: [1.         1.         0.93333333 1.         1.         0.93333333
 1.         0.96666667 1.         0.96666667]
Mean accuracy: 0.9800000000000001


##### Some argue that setting K=M can be even more accurate than standard cross validation (leave-one-out)

But it's a moot point and it takes lots of time!

In [24]:
leave_one_out_cv = LeaveOneOut()
leave_one_out_cv_scores = cross_val_score(svm_model, x_data, y_data, cv=leave_one_out_cv)
print("\nLeave-one-out cross-validation scores:", leave_one_out_cv_scores)
print("Mean accuracy:", np.mean(leave_one_out_cv_scores))


Leave-one-out cross-validation scores: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1.
 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]
Mean accuracy: 0.98
