# **1. Performance Measures**

Getting a model to work is not enough, we also need to know **how well** it works, especially in scenarios where our model is intended to be deployed in a real-world scenario. Thus, evaluating the performance of a predictive model is as important as building them. Intuitively, a model working implies that the preditions are **close** to the **target values**. However evaluating  how do we measure this closeness?

That is the purpose of **performance measures**. These are metrics that quantify the performance of a model, allowing us to compare different models and select the best one for our specific problem. Thus, in this class we will look at the most common performance metrics for both **regression** and **classification** problems. 

<div class="alert alert-block alert-info">

## Table of Contents
### [1 - Regression Problems](#regression)
* [1.1. - $R^{2}$ Score](#rsquare)
* [1.2. - Adjusted $R^{2}$ Score](#adjusted)
* [1.3. - MAE](#mae)
* [1.4. - MSE and RMSE](#mse)
* [1.5. - MedAE](#medae)
* [1.6. - MAPE](#mape)
* [1.7. - Comparing Regression Metrics](#comparison)
### [2 - Classification Problems](#classification)
* [2.1. - The Confusion Matrix](#confusion)
* [2.2. - The Accuracy Score](#accuracy)
* [2.3. - The Precision](#precision)
* [2.4. - The Recall](#recall)
* [2.5. - The F1 Score](#f1)
* [2.6. - ROC Curve and AUC Score](#roc)
* [2.7. - Precision-Recall Curve](#pr-curve)
* [2.8. - Comparing Classification Metrics](#classification-comparison)
### [3 - Multiclass Classification (Extra)](#multiclass)
* [3.1. - Multiclass Confusion Matrix](#multiclass-confusion)
* [3.2. - Macro-Averaged Metrics](#macro)
* [3.3. - Weighted-Averaged Metrics](#weighted)
* [3.4. - Micro-Averaged Metrics](#micro)
</div>

<a class="anchor" id="regression">

## 1. Regression Problems
</a>

For regression problems, the target variable is **continuous**. Underr this cirscumstance, the metrics we generally compute are either based on **proportion of variance explained** or on the **distance between the predicted and actual values**.

To see how to compute these metrics, we first will need to create a regression model and obtain predictions... so let's start by doing that.

__`Step 0`__ Import the needed libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")

#sklearn libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

from sklearn.metrics import classification_report

np.random.seed(33) #for reproducibility

__`Step 1`__ Import the dataset __Boston.csv__ stored in folder `Datasets` and define the independent variables as **data_boston** and call **target_boston** to the dependent variable (last column). 

In [None]:
boston = pd.read_csv(r'./Datasets/Boston.csv')
data_boston = boston.iloc[:,:-1]
target_boston = boston.iloc[:,-1]

__`Step 2`__ Use the method **train_test_split** from sklearn.model_selection to split your dataset between train (80%) and validation (20%).

In [None]:
X_train, X_val, y_train, y_val = train_test_split(data_boston, 
                                                    target_boston, 
                                                    test_size=0.2, 
                                                    random_state=15, 
                                                    shuffle=True, 
                                                   )

__`Step 3`__ Create an instance of LinearRegression named as lr with the default parameters and fit it to your training data.

In [None]:
lr = LinearRegression().fit(X_train,y_train)

__`Step 4`__ Now that you have your model created, use the method create to assign the predictions to `y_pred_train` and `y_pred_val`. 

In [None]:
y_pred_train = lr.predict(X_train)
y_pred_val = lr.predict(X_val)

__`Step 5`__ From __slearn.metrics__ import r2_score, mean_absolute_error, mean_squared_error, median_absolute_error, and mean_absolute_percentage_error.

**Note**: In this notebook we are importing libraries along the way, but in practice it is better to import all the needed libraries at the begining of the notebook/script.

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, median_absolute_error, root_mean_squared_error, mean_absolute_percentage_error

<a class="anchor" id="rsquare">
    
### 1.1. $R^{2}$ Score

</a>

$R^{2}$, also known as the coefficient of determination, is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. It provides an indication of how well the data points fit a statistical model – the **higher the $R^{2}$**, the better the model fits your data, usually ranging from 0 to 1 (though it can be negative if the model is worse than a simple mean predictor). 

$$
R^{2} = 1 - \frac{SS_{res}}{SS_{tot}}
$$
Where:
- $SS_{res}$ is the sum of squares of residuals (the differences between the observed and predicted values: $\sum (y_i - \hat{y}_i)^2$
- $SS_{tot}$ is the total sum of squares (the differences between the observed values and the mean of the observed values): $\sum (y_i - \bar{y})^2$

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score'>sklearn.metrics.r2_score(y_true, y_pred, ... )</a>

__Definition:__ <br>
$R^2$ (coefficient of determination) regression score function.

__Interpretation:__ <br>
Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). 

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values; <br>
_y_pred_: Estimated target values; <br>
...
</div>

__`Step 6`__ Check the $R^2$ score of the model `lr` for both the training and validation sets.

In [None]:
r2_t = r2_score(y_train, y_pred_train)
r2_t

In [None]:
r2_v = r2_score(y_val, y_pred_val)
r2_v

**Advantages of $R^2$**:
- Easy to interpret: It ranges between 0 and 1. A value closer to 1 means the model explains a larger proportion of variance.
- Common benchmark: It is widely known and used across various fields for evaluating regression models.

**Disadvantages of $R^2$**:
- Always increases with more predictors: Adding more independent variables to the model will always increase (or at least not decrease) the $R^2$, even if those variables are not statistically significant, which makes it a poor metric to, for example, compare models with different feature sets.
- Does not indicate causation: A high $R^2$ does not imply that changes in the independent variables cause changes in the dependent variable.
- Does not guarantee good predictions: A model with a high $R^2$ may still have poor predictive performance due to e.g. systematic bias (such as having a model that consistently over or under predicts).
- Misleading for non-linear relationships: $R^2$ is based on linear regression assumptions and may not accurately reflect the fit of non-linear models.

__When should $R^2$ be used?__ <br>
- When we want to measure the amount of variance in the target variable that can be explained by our model. <br>
- Comparing models with the same number of predictors (to avoid the pitfall of always increasing $R^2$ with more variables) on a stable, well-defined dataset. <br>
- When we want to communicate the explanatory power of a regression model, starting with a familiar metric such as $R^2$ can be useful. 

<a class="anchor" id="adjusted">
    
### 1.2. Adjusted $R^{2}$ Score ($\bar{R}^2$)

</a>

A limitation of $R^2$ is that it always increases (or at least stays the same) when new predictors are added, even if those predictors do not improve the model substantially.<br>
To address this issue, **Adjusted R-Squared** modifies the $R^2$ value by taking into account the number of predictors relative to the sample size.<br>

If a new predictor improves the model beyond what would be expected by chance, $\bar{R}^2$ increases.<br>
If the predictor does not provide meaningful explanatory power, $\bar{R}^2$ decreases.<br>
Thus, if evaluating explained variance is goal for your project, you should consider $\bar{R}^2$ as the go-to metric when comparing regression models with different numbers of features.
Scikit-Learn does not have a direct implementation of $\bar{R}^2$ . However, we can compute it using the following formula:

$$
\bar{R}^2 = 1 - \left(1 - R^2\right) \cdot \frac{n - 1}{n - p - 1}
$$
Where:

$\bar{R}^2$: Adjusted R-Squared<br>
$R^2$: R-Squared<br>
$n$: number of observations (sample size)<br>
$p$: number of predictors (independent variables)

__`Step 7`__ Calculate the Adjusted R^2 Score for your model.


In [None]:
r2 = r2_score(y_train, y_pred_train)
n = len(y_train)
p = len(X_train.columns)

def adj_r2 (r2,n,p):
    return 1-(1-r2)*(n-1)/(n-p-1)

ar2_t = adj_r2(r2,n,p)
ar2_t 

In [None]:
# DO IT
r2 = r2_score(y_val, y_pred_val)
n = len(y_val)
p = len(X_val.columns)

ar2_v = adj_r2(r2,n,p)
ar2_v

**Advantages of $\bar{R}^2$**:
- Penalizes unnecessary predictors: Unlike $R^2$, which can be artificially inflated by adding more variables, $\bar{R}^2$ decreases when non-significant predictors are added, helping to prevent overfitting.
- Fairer for comparison between models with different numbers of predictors.
- Retains interpretability: Like $R^2$, it provides insight into the proportion of variance explained by the model, even if computed using a slightly different formula.
- Signals risje of overfitting: A significant drop in $\bar{R}^2$ when adding predictors can indicate that the model is adding *noise* rather than insight.

**Disadvantages of $\bar{R}^2$**:
- More complex to compute: Requires knowledge of the number of predictors and sample size, making it less straightforward than $R^2$.
- Linear model bias: It is primarily designed for linear regression models and may not be as informative for non-linear models or other non-parametric machine learning algorithms.
- Sensitivity to sample size: $\bar{R}^2$ can be overly sensitive to the number of predictors: may over-estimate the importance of weak predictors in larger datasets and under-estimate them in smaller datasets.
- Still is variance-based: Like $R^2$, it does not provide information about the actual prediction error in the units of the target variable.

__When should $R^2$ be used?__ <br>
- When we are in the process of selecting features for our regression model and want to avoid overfitting by adding too many predictors. Measuring $\bar{R}^2$ can help identify the point at which adding more features does not significantly improve the model.
- When comparing regression models that have different numbers of independent variables, $\bar{R}^2$ provides a more balanced metric than $R^2$.

<a class="anchor" id="mae">
    
### 1.3. MAE (Mean absolute error)

</a>

As discussed, the previous metrics are useful to measure the degree of **explainability of a model**, they often fall short when it comes to measuring **how distant a prediction is from the observed value**. So, going forward, we will focus on metrics on **measuring error**. In the literature, you will find multiple different ways of doing this computation. The first example we will cover is likely the most intuitive of them all which is the **Mean Absolute Error (MAE**).

$$
MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
$$

__`Step 8`__ Check the MAE of the model you created previously

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error'>sklearn.metrics.mean_absolute_error(y_true, y_pred, ... )</a>

__Definition:__ <br>
Mean absolute error regression loss.

__Interpretation:__ <br>
Best possible value is 0.0. MAE is always non-negative.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values; <br>
_y_pred_: Estimated target values; <br>
...
</div>

In [None]:
mae_t = mean_absolute_error(y_train, y_pred_train)
mae_t

In [None]:
mae_v = mean_absolute_error(y_val, y_pred_val)
mae_v

**Advantages of MAE**
- Interpretability: The score computed with MAE is an absolute difference, meaning that it flat out tells you by how much, on average, is the model missing.
- Simplicity: Easy to compute and does not involve any computationally challlenging operations (Squares or Roots).
- Measures Errors in an Uniform Manner: Useful in problems where the magnitude of the error is not particularly relevant (although this particular point  is a disadvantage when magnitude is relevant).

**Diadantages of MAE**
- Measures Errors in an Uniform Manner: Essentially, all units of error are considered equally. Consider 2 hypothetical models that could have the same MAE:
    1. Makes a very small mistake on every observation,
    2. Gets some predictions just right but there a select few that miss by a LOT.
- Scale-Dependent: Since its main advantage is interpretability, it loses it if the target is normalized in some manner. 

__When should MAE be used?__ <br>
When it is relevant to communicaate easily with stakeholders: absolute error is very intuitive to understand.<br>
When the magnitude of an error is not particularly relevant to the problem (when there is not a disproportionally increased cost of making larger errors).

<a class="anchor" id="mse">
    
### 1.4. (Mean Squared Error) and RMSE (Root Mean squared error) 
</a>

These metrics look to address one of the main drawbacks of MAE which is how it addresses the **magnitude of errors**. Under MAE all errors are considered equally important. However, in more critical scenarios, it may be worthwhile to disproportionally **penalize** larger mistakes relative to **smaller mistakes**. MSE and its root form RMSE aim to tackle this issue by **squaring** the difference between target and prediction:

$$
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

And the Root form:

$$
RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} = \sqrt{MSE}
$$

__`Step 9`__ Check the RMSE of the model you created previously:

### MSE

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error'>sklearn.metrics.root_mean_squared_error(y_true, y_pred, ... )</a>

__Definition:__ <br>
Mean squared error regression loss.

__Interpretation:__ <br>
Best possible value is 0.0. MSE is always non-negative.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values; <br>
_y_pred_: Estimated target values; <br>
...
</div>

In [None]:
mse_t = mean_squared_error(y_train, y_pred_train)
mse_t

In [None]:
mse_v = mean_squared_error(y_val, y_pred_val)
mse_v

### RMSE

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.root_mean_squared_error'>sklearn.metrics.root_mean_squared_error(y_true, y_pred, ... )</a>

__Definition:__ <br>
Root mean squared error regression loss.

__Interpretation:__ <br>
Best possible value is 0.0. RMSE is always non-negative.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values; <br>
_y_pred_: Estimated target values; <br>
...
</div>

In [None]:
rmse_t = root_mean_squared_error(y_train, y_pred_train)
rmse_t

In [None]:
rmse_v = root_mean_squared_error(y_val, y_pred_val)
rmse_v

**Advantages of MSE and RMSE**:
- Stronger penalization of larger errors
- Mathematical convenience: unlike MAE, MSE is differentiable at the root, making a very common form of optimization (e.g. OLS in Linear Regression)

**Disadvantages of MSE and RMSE**
- Oversensitive to outliers: the model is punished for making a prediction that is farther away from the target. If, for some reason, your model does one (or few) predictions that are way off but, other than that, performs decently, the RMSE and to a greater extent MSE scores will likely look worse than it actually is.
- Lack of interpretability: MSE works in Square Units which is not very interpretable to most audiences. RMSE tries to address this, but its interpretability is still a long way from MAE.


__When should we use MSE?__ <br>
MSE is a useful metric to monitor (and optimize for) during model optimization (e.g. hyperparameter tuning) because most **loss functions** usually use Squared Differences.
When penalizing larger mistakes is critical and error interpretability is not particularly relevant. MSE penalizations for larger errors are demonstrably larger than MAE and RMSE. 

__MSE vs. RMSE__ <br>
If your goal is merely to select which you should use to select a "best model" and that is all you care about you can use one or the other since, results-wise, the ranking of algorithms will be same.

<a class="anchor" id="medae">
    
### 1.5. MedAE (Median absolute error)

</a>

While MAE provides the **mean** of absolute errors, there are situations where the **median** might be more informative. The Median Absolute Error (MedAE) calculates the median of the absolute differences between predicted and actual values, making it more **robust to the presence of outlier errors** than MAE.

This is particularly useful when your error distribution is **skewed** or contains **extreme values** that might distort the mean. MedAE gives you a better sense of the "typical" error your model makes, without being influenced by a few very large mistakes.

$$

MedAE = median(|y_1 - \hat{y}_1|, |y_2 - \hat{y}_2|, ..., |y_n - \hat{y}_n|)

$$

__`Step 10`__ Check the MedAE score of the model you created previously


<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.median_absolute_error'>sklearn.metrics.median_absolute_error(y_true, y_pred, ... )</a>

__Definition:__ <br>
Median absolute error regression loss

__Interpretation:__ <br>
Best possible value is 0.0. MedAE is always non-negative.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values; <br>
_y_pred_: Estimated target values; <br>
...
</div>

In [None]:
medae_t = median_absolute_error(y_train, y_pred_train)
medae_t

In [None]:
medae_v =median_absolute_error(y_val, y_pred_val)
medae_v

**Advantages of MedAE**:
- Robust to outliers: Unlike MAE, which can be influenced by extreme values, MedAE uses the median, making it less sensitive to outliers in the error distribution.
- Intuitive interpretation: Like MAE, MedAE provides a direct measure of error magnitude in the same units as the target variable.
- Better representation of typical error: In datasets with skewed error distributions, MedAE can provide a better sense of the *typical* error than MAE.

**Disadvantages of MedAE**:
- Less sensitive to all errors: While robustness to outliers can be an advantage, it also means that MedAE might not adequately reflect the presence of significant errors in the model.
- Scale-dependent: Like MAE, MedAE loses interpretability when the target variable is normalized or scaled.
- Less commonly used: MedAE is not as widely adopted as other metrics, which can make comparison with other studies or benchmarks more difficult.

__When should MedAE be used?__ <br>
- When your dataset contains outliers or extreme values that you want to de-emphasize in the evaluation.
- When you want to understand the typical error magnitude without being influenced by a few very large errors.
- When the error distribution is highly skewed and you want a more representative measure of central tendency in the errors.
- In combination with other metrics to get a more complete picture of model performance.

<a class="anchor" id="mape">
    
### 1.6. MAPE (Mean absolute percentage error)

</a>

All the metrics we've seen so far (MAE, MSE, RMSE, MedAE) are **scale-dependent**, meaning their values depend on the units and scale of your target variable. This makes it difficult to compare model performance across different datasets or when the target variable has very different ranges.

The Mean Absolute Percentage Error (MAPE) addresses this limitation by expressing errors as **percentages** relative to the actual values. This makes MAPE **scale-independent** and highly interpretable - a MAPE of 10% means your model is off by 10% on average, regardless of whether you're predicting house prices in thousands or stock returns in decimals.

MAPE is particularly valuable in **business contexts** where stakeholders prefer to understand model performance in terms of percentage errors rather than absolute values.

$$

MAPE = \frac{100\%}{n} \sum_{i=1}^{n} \left|\frac{y_i - \hat{y}_i}{y_i}\right|

$$

__`Step 11`__ Check the MAPE score of the model you created previously

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_percentage_error.html#sklearn.metrics.mean_absolute_percentage_error'>sklearn.metrics.mean_absolute_percentage_error(y_true, y_pred, ... )</a>

__Definition:__ <br>
Mean absolute percentage error (MAPE) regression loss.

__Interpretation:__ <br>
Best possible value is 0.0. MAPE is always non-negative and expressed as a percentage.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values; <br>
_y_pred_: Estimated target values; <br>
...
</div>

In [None]:
mape_t = mean_absolute_percentage_error(y_train, y_pred_train)
mape_t

In [None]:
mape_v = mean_absolute_percentage_error(y_val, y_pred_val)
mape_v

**Advantages of MAPE**:
- Scale-independent: Unlike MAE, MSE, and RMSE, MAPE is expressed as a percentage, making it easy to interpret and compare across different datasets with different scales.
- Intuitive interpretation: The result is directly interpretable as "the model is off by X% on average."
- Useful for business contexts: Stakeholders often find percentage errors more meaningful than absolute errors.

**Disadvantages of MAPE**:
- Division by zero issues: MAPE cannot be calculated when any actual values are zero, as this would result in division by zero.
- Asymmetric penalty: MAPE penalizes negative errors (under-predictions) more than positive errors (over-predictions) of the same magnitude.
- Sensitive to small actual values: When actual values are close to zero, even small absolute errors can result in very large percentage errors.
- Not suitable for data with negative values: MAPE is undefined for negative actual values.

__When should MAPE be used?__ <br>
- When you need a scale-independent metric that can be easily communicated to stakeholders.
- When comparing models across different datasets or when the target variable spans different orders of magnitude.
- In business contexts where percentage errors are more meaningful than absolute errors (e.g., sales forecasting, inventory management).
- When the actual values are consistently positive and not close to zero.

## Comparing Differences

In [None]:
#create a dataframe with all the metrics calculated
regression_metrics = pd.DataFrame({
    'Metric': ['R2', 'Adjusted R2', 'MAE', 'MSE', 'RMSE', 'MedAE', 'MAPE'],
    'Train': [r2_t, ar2_t, mae_t, mse_t, rmse_t, medae_t, mape_t],
    'Validation': [r2_v, ar2_v, mae_v, mse_v, rmse_v, medae_v, mape_v],
    })

regression_metrics

<a class="anchor" id="comparison">

### 1.7. How Metrics Can Change Model Selection: A Numerical Example

</a>

So far, we have seen a set of different metrics, each with their own pros and cons. The natural question that follows is: *which one should I choose?*

The answer, as it often is in machine learning, is: *it depends*. It depends on your data, your business context, and what you consider a "good" prediction.

To illustrate this, let's create a small, controlled example with a few hypothetical models. We will see how the "best" model can change depending on the metric we use to evaluate it.

__`Step 12`__ Let's define a set of true values and predictions from three hypothetical models.

To make this concrete, imagine we are predicting house prices (in thousands of dollars).

*   **`y_true`**: The actual prices of 6 houses.
*   **`preds_model_A`**: A model that is off by a small, consistent amount for each prediction.
*   **`preds_model_B`**: A model that gets most predictions almost perfect, but makes one very large error.
*   **`preds_model_C`**: A model that has a mix of perfect predictions and some larger errors, but none as extreme as Model B.

In [None]:
import numpy as np

# Actual values
y_true = np.array([100, 150, 200, 250, 300, 350])

# Model A: Consistent, small errors
preds_model_A = np.array([110, 160, 210, 260, 310, 360])

# Model B: Mostly perfect, one large error
preds_model_B = np.array([100, 150, 200, 250, 300, 450])

# Model C: A mix of perfect and moderate errors
preds_model_C = np.array([100, 150, 225, 275, 300, 350])

__`Step 13`__ First, create a helper function to calculate all our regression metrics for a given set of predictions. This will make our code cleaner and easier to read.

In [None]:
def calculate_regression_metrics(y_true, y_pred):
    r2 = r2_score(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = root_mean_squared_error(y_true, y_pred)
    medae = median_absolute_error(y_true, y_pred)
    mape = mean_absolute_percentage_error(y_true, y_pred)
    
    return [r2, mae, mse, rmse, medae, mape]

__`Step 14`__ Use our function to compute the metrics for each of the three models.

In [None]:
metrics_A = calculate_regression_metrics(y_true, preds_model_A)
metrics_B = calculate_regression_metrics(y_true, preds_model_B)
metrics_C = calculate_regression_metrics(y_true, preds_model_C)

__`Step 15`__ Finally, organize these results into a DataFrame to make them easy to compare.

In [None]:
comparison_df = pd.DataFrame({
    'Metric': ['R2', 'MAE', 'MSE', 'RMSE', 'MedAE', 'MAPE'],
    'Model A': metrics_A,
    'Model B': metrics_B,
    'Model C': metrics_C
})

comparison_df

__`Step 16`__ Analyze the results and see what they tell us.

<div class="alert alert-block alert-success">

### Analysis of the Results

Looking at the table above, we can see that the "best" model changes depending on the metric used:

* **Model A (Consistent and Small Errors) is the best** according to $R^2$ (0.98), $MSE$ (100) and $RMSE$ (10).
* **Model B (Mostly Perfect with One Large Error) is the best** according to $MedAE$ (tied with C at 0) and by far the worst model when it comes to $MSE$ (1666.7) and $RMSE$ (40.8).
* **Model C (A less extreme version of Model B)** is the best on $MAPE$ (0.0375), $MAE$ (8.33) and $MedAE$ (tied with Model B at 0)

### Key Takeaway

The choice of appropriate metric(s) is a crucial aspect of Supervised Learning. Your decision should be based on the business problem.
</div>

<a class="anchor" id="classification">

## 2. Classification Problems
</a>

In ckassification problems, the target variable is often a **nominal categorical**. Thus, when predicting, the output will either match the target variable or not. This means that the metrics we will use to evaluate classification models will mostly be based on **counting** how many predictions were correct and how many were incorrect. Similarly to the regression case, we will first need to create a classification model and obtain predictions.

__`Step 17`__ Import the needed libraries to apply Logistic Regression.

In [None]:
from sklearn.linear_model import LogisticRegression

__`Step 18`__ Import the dataset __final_tugas.csv__ and define the independent variables as __data__ and the dependent variable ('DepVar') as __target__.

In [None]:
tugas = pd.read_csv(r'./Datasets/final_tugas.csv')
data_tugas = tugas.drop(['DepVar'], axis=1)
target_tugas = tugas['DepVar']

__`Step 19`__ Use `train_test_split` from `sklearn.model_selection` to split your dataset into train (80%) and validation (20%).

In [None]:
X_train, X_val, y_train, y_val = train_test_split(data_tugas, 
                                                  target_tugas, 
                                                  test_size = 0.2, 
                                                  random_state=5, 
                                                  stratify = target_tugas) #in this case, we use stratify to keep the same proportion of classes in train and validation sets

__`Step 20`__ Create an instance of `LogisticRegression` named `log_model` with the default parameters and fit it to your training data.

In [None]:
log_model = LogisticRegression()

In [None]:
log_model.fit(X_train, y_train)

__`Step 21`__ Once the model is trained, obtain predictions for both the training and validation sets.

In [None]:
y_pred_train = log_model.predict(X_train)
y_pred_val = log_model.predict(X_val)

__`Step 22`__ From __sklearn.metrics__ import `confusion_matrix`, `accuracy_score`, `precision_score`, `recall_score`, and `f1_score`, `auc_score`, `precision_recall_curve`, `roc_curve`.

The metrics used for classification differ from the ones used for regression. For classification problems, the target variable is often a **nominal categorical**. Thus, when predicting, the output will either match the target variable or not. This means that the metrics we will use to evaluate classification models will mostly be based on **counting** how many predictions were correct and how many were incorrect. 

Most classification metrics are calculated having, as basis, the **confusion matrix**. This matrix summarizes how the predictions of a classification model compare to the actual values, and serves as the foundation for computing accuracy, precision, recall, and other performance measures.

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, precision_recall_curve, roc_curve

<a class="anchor" id="confusion">
    
### 2.1. The confusion matrix

</a>

The confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") by counting the number of correct and incorrect predictions made by the model compared to the actual outcomes (target values) in the data. Its shape is typically a square matrix, where the rows represent the actual classes and the columns represent the predicted classes. In a binary classification problem, the confusion matrix is a 2x2 table with the following components:

|                      | Predicted Negative | Predicted Positive |
|----------------------|--------------------|--------------------|
| **Actual Negative**   | True Negative (TN) | False Positive (FP)|
| **Actual Positive**   | False Negative (FN)| True Positive (TP) |



<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix'>sklearn.metrics.confusion_matrix(y_true, y_pred, ...)</a>

__Definition:__ <br>
Compute confusion matrix to evaluate the accuracy of a classification

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
...
</div>

__`Step 23`__ Obtain the confusion matrix.

In [None]:
cm_train = confusion_matrix(y_train, y_pred_train)
cm_train

In [None]:
cm_val = confusion_matrix(y_val, y_pred_val)
cm_val

**Understanding the Confusion Matrix**

The confusion matrix in sklearn is presented in the following format:

```
[[TN  FP]
 [FN  TP]]
```

Where:
- **TN (True Negatives)**: Cases correctly predicted as negative
- **FP (False Positives)**: Cases incorrectly predicted as positive (Type I Error)
- **FN (False Negatives)**: Cases incorrectly predicted as negative (Type II Error)
- **TP (True Positives)**: Cases correctly predicted as positive

**Advantages of Confusion Matrix**:
- Comprehensive view: Shows all possible outcomes of predictions (TP, TN, FP, FN).
- Foundation for other metrics: Most classification metrics are derived from the confusion matrix.
- Error analysis: Helps identify which type of errors (false positives vs false negatives) the model is making.
- Visual interpretation: Easy to visualize and understand model behavior.

**Disadvantages of Confusion Matrix**:
- Not a single metric: Cannot directly compare models with a single number.
- Requires interpretation: Need to understand what each cell means in the context of your problem.
- Scale dependent: Raw counts can be misleading with imbalanced datasets.

**When should the Confusion Matrix be used?**
- When you need a detailed breakdown of model predictions.
- When understanding the types of errors is as important as the overall accuracy.
- As a starting point for computing other classification metrics.
- When communicating model performance to stakeholders who need to understand specific error types.

<a class="anchor" id="accuracy">
    
### 2.2. The Accuracy Score

</a>

**Accuracy** is the most intuitive classification metric. It measures the proportion of correct predictions (both positive and negative) out of all predictions made. While simple to understand, accuracy can be misleading when dealing with imbalanced datasets.

$$
Accuracy = \frac{TP + TN}{TP + TN + FP + FN}
$$

Where:
- **TP**: True Positives
- **TN**: True Negatives
- **FP**: False Positives
- **FN**: False Negatives


<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score'>sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True,...)</a>

__Definition:__ <br>
Accuracy classification score.

__Interpretation:__ <br>
If normalize is True, then the best performance is 1. When normalize = False, then the best performance is the number of samples.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
_normalize_: If False, return the number of correctly classified samples. Otherwise, return the fraction of correctly classified samples. <br>
...
</div>

__`Step 24`__ Compute the accuracy score for both the training and validation sets.

In [None]:
accuracy_train = accuracy_score(y_train, y_pred_train)
accuracy_train

In [None]:
accuracy_val = accuracy_score(y_val, y_pred_val)
accuracy_val

**Advantages of Accuracy**:
- Simple and intuitive: Easy to understand and explain to non-technical stakeholders.
- Single metric: Provides one number to compare different models.
- Balanced measure: Considers both positive and negative predictions.
- Widely used: Common benchmark across many classification problems.

**Disadvantages of Accuracy**:
- Misleading with imbalanced data: Can show high values even when the model performs poorly on the minority class.
- Treats all errors equally: Doesn't distinguish between false positives and false negatives, which may have different costs.
- Not informative about error types: Doesn't tell you which class the model struggles with.

**Is accuracy always a good option?**

Let's check with an example:

<img src="images/example_1.png" alt="Drawing" style="width: 400px;"/>

In this case, what is the accuracy?

<img src="images/example_2.png" alt="Drawing" style="width: 300px;"/>

We have an accuracy of 99.1% which is very very high! That is great, right?

**Well, not really...**

Imagine that we are testing people potentially with COVID... A positive person is actually someone who is sick and carrying a virus that can spread very quickly! The cost of having a misclassified actual positive (or a false negative) is very high! In this scenario, we would need to look at other metrics that focus on capturing all positive cases.

**When should Accuracy be used?**
- When the dataset is balanced (roughly equal number of samples in each class).
- When false positives and false negatives have similar costs.
- When you need a simple, interpretable metric for initial model evaluation.
- When all classes are equally important to predict correctly.

<a class="anchor" id="precision">
    
### 2.3. The Precision

</a>

**Precision** measures the accuracy of positive predictions. It answers the question: "Of all the instances we predicted as positive, how many were actually positive?" This metric is particularly important when the cost of false positives is high.

$$
Precision = \frac{TP}{TP + FP}
$$

Where:
- **TP**: True Positives (correctly predicted positive cases)
- **FP**: False Positives (incorrectly predicted positive cases)

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score'>sklearn.metrics.precision_score(y_true, y_pred, ...)</a>

__Definition:__ <br>
Compute the precision.

__Interpretation:__ <br>
The best value is 1, and the worst value is 0.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
...
</div>

__`Step 25`__ Compute the precision score for both the training and validation sets.

In [None]:
precision_train = precision_score(y_train, y_pred_train)
precision_train

In [None]:
precision_val = precision_score(y_val, y_pred_val)
precision_val

**Understanding Precision**

If you look at the confusion matrix, we can verify that precision is only concerned with the predicted values that were considered positive:
    
<img src="images/example_3.png" alt="Drawing" style="width: 400px;"/>

So precision tells us how precise/accurate our model is: out of those predicted positive, how many of them are actually positive.

**Advantages of Precision**:
- Focus on positive predictions: Directly measures the reliability of positive predictions.
- Important for specific applications: Critical when false positives are costly.
- Easy to interpret: Straightforward percentage of correct positive predictions.
- Complements recall: Together they provide a complete picture of positive class performance.

**Disadvantages of Precision**:
- Ignores false negatives: Doesn't account for positive cases that were missed.
- Can be manipulated: A model that predicts very few positives can have high precision but miss many actual positives.
- Class imbalance sensitive: Can be misleading in imbalanced datasets.
- Incomplete on its own: Should be used alongside recall for full understanding.

**When should Precision be used?**

`When the cost of False Positives is high.`

**Example**: Email spam detection
- Negative = Not spam
- Positive = Spam

A **false positive** means a legitimate email is classified as spam. If precision is low, important emails (like job offers, client communications, or medical results) might end up in the spam folder, causing the user to miss critical information. High precision ensures that when an email is marked as spam, it really is spam.

**Other examples where precision matters**:
- Medical diagnosis for expensive treatments (don't want to treat healthy patients)
- Fraud detection in banking (don't want to block legitimate transactions)
- Content moderation (don't want to wrongly remove appropriate content)

<a class="anchor" id="recall">
    
### 2.4. The Recall

</a>

**Recall** (also known as **Sensitivity** or **True Positive Rate**) measures the ability of a model to find all positive cases. It answers the question: "Of all the actual positive instances, how many did we correctly identify?" This metric is crucial when the cost of false negatives is high.

$$
Recall = \frac{TP}{TP + FN}
$$

Where:
- **TP**: True Positives (correctly predicted positive cases)
- **FN**: False Negatives (positive cases incorrectly predicted as negative)

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.recall_score'>sklearn.metrics.recall_score(y_true, y_pred, ...)</a>

__Definition:__ <br>
Compute the recall.

__Interpretation:__ <br>
The best value is 1 and the worst value is 0.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
...
</div>

__`Step 26`__ Compute the recall score for both the training and validation sets.

In [None]:
recall_train = recall_score(y_train, y_pred_train)
recall_train

In [None]:
recall_val = recall_score(y_val, y_pred_val)
recall_val

**Understanding Recall**

Looking at the confusion matrix:
    
<img src="images/example_4.png" alt="Drawing" style="width: 400px;"/>

Recall calculates how many of the actual positives our model is able to capture by labeling them as positive (True Positive). It measures the model's **completeness** in identifying positive cases.

**Advantages of Recall**:
- Focuses on completeness: Ensures we don't miss positive cases.
- Critical for safety: Important when missing a positive case has serious consequences.
- Complements precision: Together they provide complete evaluation of positive class performance.
- Intuitive interpretation: Directly shows what percentage of actual positives were found.

**Disadvantages of Recall**:
- Ignores false positives: Doesn't consider how many negatives were incorrectly classified as positive.
- Can be manipulated: A model that predicts everything as positive will have perfect recall but be useless.
- Incomplete on its own: Must be balanced with precision to avoid excessive false positives.
- May lead to over-prediction: Optimizing only for recall can result in too many positive predictions.

**When should Recall be used?**

`When the cost of False Negatives is high.`

**Example**: COVID-19 Testing
- Negative = Not sick
- Positive = Sick with COVID-19

A **false negative** means a sick patient is told they are healthy. This is extremely dangerous because:
- The patient won't get treatment they need
- They will continue their normal activities and spread the virus to others
- The disease could worsen without medical intervention

High recall ensures we catch as many actual COVID cases as possible, even if it means some healthy people are initially flagged (false positives can be filtered with additional testing).

**Other examples where recall matters**:
- Cancer screening (don't want to miss any cancer cases)
- Fraud detection in credit cards (don't want to miss fraudulent transactions)
- Security threat detection (don't want to miss potential threats)
- Disease outbreak detection (need to catch all potential cases)

<a class="anchor" id="f1">
    
### 2.5. The F1 Score

</a>

The **F1 Score** is the harmonic mean of precision and recall. It provides a single metric that balances both concerns: finding all positive cases (recall) and ensuring predictions are accurate (precision). The F1 score is particularly useful when you need a balance between precision and recall, or when dealing with imbalanced datasets.

$$
F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} = \frac{2 \times TP}{2 \times TP + FP + FN}
$$

Where:
- **TP**: True Positives
- **FP**: False Positives
- **FN**: False Negatives

**Why use harmonic mean instead of arithmetic mean?**
The harmonic mean penalizes extreme values more than the arithmetic mean. If either precision or recall is very low, the F1 score will also be low, even if the other metric is high.

<div class="alert alert-block alert-info">
<a href = 'https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score'>sklearn.metrics.f1_score(y_true, y_pred, ...)</a>

__Definition:__ <br>
Compute the F1 score, also known as balanced F-score or F-measure.

__Interpretation:__ <br>
F1 score reaches its best value at 1 and worst score at 0.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values.; <br>
_y_pred_: Estimated targets as returned by a classifier.; <br>
...
</div>

__`Step 27`__ Compute the F1 score for both the training and validation sets.

In [None]:
f1_train = f1_score(y_train, y_pred_train)
f1_train

In [None]:
f1_val = f1_score(y_val, y_pred_val)
f1_val

**Advantages of F1 Score**:
- Balanced metric: Combines both precision and recall into a single score.
- Handles imbalance well: More informative than accuracy for imbalanced datasets.
- Penalizes extreme cases: The harmonic mean ensures both precision and recall need to be good.
- Single metric for optimization: Useful when you need to optimize or compare models with one number.

**Disadvantages of F1 Score**:
- Equal weighting: Treats precision and recall as equally important, which may not match real-world costs.
- Less interpretable: Not as intuitive as precision or recall individually.
- Ignores true negatives: Doesn't account for correctly predicted negative cases.
- May not reflect business goals: Sometimes precision or recall alone is more aligned with business objectives.

**When should F1 Score be used?**
- When you need to seek a **balance between precision and recall**.
- When there is an **uneven class distribution** (large number of actual negatives).
- When **both false positives and false negatives have significant (and similar) costs**.
- When you need a **single metric** to compare models but accuracy is misleading due to class imbalance.

**Example Use Cases**:
- Information retrieval systems (search engines need both precision and recall)
- Medical diagnosis where both missing cases and false alarms are problematic
- Customer churn prediction (want to identify churners without annoying loyal customers)
- Quality control systems (need to catch defects without rejecting good products)

## Understanding Classification Thresholds

Before we dive into ROC curves and Precision-Recall curves, it's crucial to understand what a **classification threshold** is and why it matters.

### What is a Classification Threshold?

Most classification models (like Logistic Regression) don't directly output class labels (0 or 1). Instead, they output **probabilities** that represent the model's confidence that an instance belongs to the positive class.

For example:
- Probability = 0.85 → Model is 85% confident this is a positive case
- Probability = 0.23 → Model is 23% confident this is a positive case

To convert these probabilities into actual predictions (0 or 1), we need to set a **threshold**:
- If probability ≥ threshold → Predict **Positive (1)**
- If probability < threshold → Predict **Negative (0)**

**The default threshold is usually 0.5**, but this is not always optimal!

### Why Does the Threshold Matter?

Different thresholds lead to different trade-offs between precision and recall, and between true positives and false positives:

- **Lower threshold (e.g., 0.3)**: 
  - More instances classified as positive
  - Higher recall (catch more positives)
  - Lower precision (more false positives)
  - Use when: Missing a positive case is very costly

- **Higher threshold (e.g., 0.7)**:
  - Fewer instances classified as positive
  - Lower recall (miss some positives)
  - Higher precision (fewer false positives)
  - Use when: False alarms are very costly

### Example: Medical Screening

Imagine a COVID-19 screening model that outputs probabilities:

| Patient | Probability | True Status |
|---------|------------|-------------|
| A       | 0.95       | Positive    |
| B       | 0.65       | Positive    |
| C       | 0.45       | Negative    |
| D       | 0.30       | Positive    |
| E       | 0.10       | Negative    |

**With threshold = 0.5:**
- Predictions: A=Positive, B=Positive, C=Negative, D=Negative, E=Negative
- Missed patient D (False Negative)
- Recall = 2/3 = 0.67

**With threshold = 0.3:**
- Predictions: A=Positive, B=Positive, C=Positive, D=Positive, E=Negative
- Caught all sick patients!
- Recall = 3/3 = 1.0
- But patient C is a false positive

The choice depends on your priorities: **Would you rather miss sick patients or test more people?**

__`Step 27a`__ Let's demonstrate how different thresholds affect our predictions. First, get the probability predictions.

In [None]:
# Get probability predictions for the positive class
y_pred_proba_val = log_model.predict_proba(X_val)[:, 1]

# Show first 10 probabilities
print("First 10 probability predictions:")
print(y_pred_proba_val[:10])
print("\nActual labels:")
print(y_val.values[:10])

__`Step 27b`__ Now let's compare how different thresholds (0.3, 0.5, and 0.7) affect precision, recall, and F1 score.

In [None]:
# Test different thresholds
thresholds_to_test = [0.3, 0.5, 0.7]
results = []

for threshold in thresholds_to_test:
    # Apply threshold to get predictions
    y_pred_threshold = (y_pred_proba_val >= threshold).astype(int)
    
    # Calculate metrics
    acc = accuracy_score(y_val, y_pred_threshold)
    prec = precision_score(y_val, y_pred_threshold, zero_division=0)
    rec = recall_score(y_val, y_pred_threshold, zero_division=0)
    f1 = f1_score(y_val, y_pred_threshold, zero_division=0)
    
    results.append([threshold, acc, prec, rec, f1])

# Create comparison table
threshold_comparison = pd.DataFrame(results, 
                                    columns=['Threshold', 'Accuracy', 'Precision', 'Recall', 'F1'])
threshold_comparison

**Observations:**
- **Lower threshold (0.3)**: Typically higher recall (catches more positives) but lower precision (more false alarms)
- **Default threshold (0.5)**: Balanced trade-off
- **Higher threshold (0.7)**: Typically higher precision (fewer false alarms) but lower recall (misses more positives)

The question is: **How do we choose the optimal threshold?** That's where ROC curves and Precision-Recall curves come in!

<a class="anchor" id="roc">
    
### 2.6. ROC Curve and AUC Score

</a>

So far, all the metrics we've discussed have been based on **hard predictions** (0 or 1). However, most classification models actually output **probabilities** (e.g., 0.85 means 85% confident the instance is positive). The **ROC (Receiver Operating Characteristic) Curve** and **AUC (Area Under the Curve) Score** are metrics that evaluate model performance across **all possible classification thresholds**.

#### What is the ROC Curve?

The ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier as its discrimination threshold is varied. It plots:
- **Y-axis**: True Positive Rate (TPR) = Recall = $\frac{TP}{TP + FN}$
- **X-axis**: False Positive Rate (FPR) = $\frac{FP}{FP + TN}$

By varying the classification threshold from 0 to 1, we get different combinations of TPR and FPR, which create the ROC curve.

#### What is the AUC Score?

The **AUC (Area Under the ROC Curve)** score measures the entire two-dimensional area underneath the entire ROC curve. It provides an aggregate measure of performance across all possible classification thresholds.

**AUC Interpretation:**
- **AUC = 1.0**: Perfect classifier (can perfectly separate classes at some threshold)
- **AUC = 0.5**: Random classifier (no better than flipping a coin)
- **AUC < 0.5**: Worse than random (predictions are systematically wrong)
- **AUC between 0.7-0.8**: Acceptable performance
- **AUC between 0.8-0.9**: Excellent performance
- **AUC > 0.9**: Outstanding performance

<div class="alert alert-block alert-info">
<a href='https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html'>sklearn.metrics.roc_auc_score(y_true, y_score, ...)</a>

__Definition:__ <br>
Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.

__Interpretation:__ <br>
Best possible score is 1.0. A score of 0.5 indicates random predictions.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values; <br>
_y_score_: Target scores (probability estimates of the positive class); <br>
...
</div>

__`Step 28`__ First, we need to obtain the **probability scores** instead of hard predictions. Use the `predict_proba` method to get probability estimates.

In [None]:
# Get probability predictions for the positive class (column 1)
y_pred_proba_train = log_model.predict_proba(X_train)[:, 1]
y_pred_proba_val = log_model.predict_proba(X_val)[:, 1]

__`Step 29`__ Compute the AUC score for both training and validation sets.

In [None]:
auc_train = roc_auc_score(y_train, y_pred_proba_train)
auc_train

In [None]:
auc_val = roc_auc_score(y_val, y_pred_proba_val)
auc_val

__`Step 30`__ Now let's plot the ROC curve to visualize the trade-off between True Positive Rate and False Positive Rate.

In [None]:
# Compute ROC curve for validation set
fpr, tpr, thresholds = roc_curve(y_val, y_pred_proba_val)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {auc_val:.2f})')
plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--', label='Random Classifier (AUC = 0.50)')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate (Recall)', fontsize=12)
plt.title('Receiver Operating Characteristic (ROC) Curve', fontsize=14)
plt.legend(loc="lower right")
plt.grid(alpha=0.3)
plt.show()

**Understanding the ROC Curve:**
- The **diagonal line** (gray dashed line) represents a random classifier
- The **closer the curve is to the top-left corner**, the better the model
- A **perfect classifier** would have a point at (0, 1) - 100% TPR with 0% FPR
- The **area under this curve (AUC)** summarizes the overall performance

**Advantages of ROC-AUC**:
- Threshold-independent: Evaluates model performance across all possible thresholds, not just one.
- Robust to class imbalance: Unlike accuracy, AUC is less affected by imbalanced datasets.
- Single metric: Provides one number to compare models while considering the full range of operating points.
- Probabilistic interpretation: Represents the probability that the model ranks a random positive example higher than a random negative example.
- Visual insight: The ROC curve shows the trade-off between sensitivity and specificity.

**Disadvantages of ROC-AUC**:
- Can be overly optimistic with highly imbalanced data: May show good AUC even when precision is poor.
- Doesn't reflect real-world threshold: You still need to choose a threshold for actual predictions.
- Less informative for imbalanced datasets: Precision-Recall curves may be more appropriate in such cases.
- Doesn't show prediction calibration: High AUC doesn't mean probabilities are well-calibrated.

**When should ROC-AUC be used?**
- When you need to evaluate model performance across **all possible thresholds**.
- When you want a **single metric** that's **robust to class imbalance** (though with caveats).
- When **both classes are important** and you want to balance TPR and FPR.
- For **model comparison** when you haven't decided on an operating threshold yet.
- In scenarios like **ranking problems** where relative ordering matters more than absolute predictions.

**Example Use Cases**:
- Medical screening where different thresholds may be used depending on resource availability
- Credit scoring where the threshold can be adjusted based on business needs
- Spam detection where the trade-off between catching spam and blocking legitimate emails can vary
- Fraud detection where the sensitivity can be tuned based on risk tolerance

### Using ROC Curves to Find the Optimal Threshold

While the AUC score tells us about overall model performance, the ROC curve can also help us **select the best threshold** for our specific use case.

__`Step 30a`__ Find the optimal threshold using the **Youden's J statistic** (maximizes TPR - FPR).

In [None]:
# Calculate Youden's J statistic for each threshold
# J = TPR - FPR = Sensitivity - (1 - Specificity)
youdens_j = tpr - fpr

# Find the optimal threshold (maximum J)
optimal_idx = np.argmax(youdens_j)
optimal_threshold_roc = thresholds[optimal_idx]

print(f"Optimal threshold (ROC - Youden's J): {optimal_threshold_roc:.4f}")
print(f"TPR at optimal threshold: {tpr[optimal_idx]:.4f}")
print(f"FPR at optimal threshold: {fpr[optimal_idx]:.4f}")

__`Step 30b`__ Visualize the optimal threshold on the ROC curve.

In [None]:
# Plot ROC curve with optimal threshold marked
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {auc_val:.2f})')
plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--', label='Random Classifier')

# Mark the optimal threshold
plt.scatter(fpr[optimal_idx], tpr[optimal_idx], color='red', s=100, 
            label=f'Optimal Threshold = {optimal_threshold_roc:.2f}', zorder=5)

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate (Recall)', fontsize=12)
plt.title('ROC Curve with Optimal Threshold', fontsize=14)
plt.legend(loc="lower right")
plt.grid(alpha=0.3)
plt.show()

<a class="anchor" id="pr-curve">
    
### 2.7. Precision-Recall Curve

</a>

While the ROC curve is useful in many scenarios, it can be **overly optimistic** when dealing with **highly imbalanced datasets**. In such cases, the **Precision-Recall (PR) Curve** provides a more informative picture of model performance.

#### What is the Precision-Recall Curve?

The PR curve plots:
- **Y-axis**: Precision = $\frac{TP}{TP + FP}$
- **X-axis**: Recall = $\frac{TP}{TP + FN}$

By varying the classification threshold from 0 to 1, we get different combinations of precision and recall, which create the PR curve.

#### Why use PR Curves for imbalanced data?

When you have **imbalanced classes** (e.g., 1% positive, 99% negative):
- ROC curves can be misleading because a high number of true negatives (TN) can make the False Positive Rate look artificially low
- PR curves focus only on the positive class, making them more sensitive to model performance on the minority class
- A poor model can still have a good-looking ROC curve with imbalanced data, but will show poor performance on a PR curve

#### Interpreting the PR Curve

- **Baseline**: A random classifier on an imbalanced dataset has a PR curve close to the proportion of positive samples (e.g., 0.1 if 10% are positive)
- **Better models**: The curve is closer to the top-right corner (high precision and high recall)
- **Perfect classifier**: Would have a point at (1, 1) - 100% precision and 100% recall
- **Area Under PR Curve (AP)**: Can be computed as a summary metric, similar to AUC for ROC

<div class="alert alert-block alert-info">
<a href='https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html'>sklearn.metrics.precision_recall_curve(y_true, probas_pred, ...)</a>

__Definition:__ <br>
Compute precision-recall pairs for different probability thresholds.

__Returns:__ <br>
_precision_: Precision values; <br>
_recall_: Recall values; <br>
_thresholds_: Increasing thresholds on the decision function used to compute precision and recall.

__Parameters:__ <br>
_y_true_: Ground truth (correct) target values; <br>
_probas_pred_: Target scores (probability estimates of the positive class); <br>
...
</div>

__`Step 31`__ Compute and plot the Precision-Recall curve for the validation set.

In [None]:
# Compute Precision-Recall curve
precision, recall, thresholds_pr = precision_recall_curve(y_val, y_pred_proba_val)

# Calculate the baseline (proportion of positive samples)
baseline = y_val.sum() / len(y_val)

# Plot Precision-Recall curve
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, color='blue', lw=2, label='PR curve')
plt.axhline(y=baseline, color='gray', linestyle='--', lw=2, label=f'Baseline (Random) = {baseline:.2f}')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall', fontsize=12)
plt.ylabel('Precision', fontsize=12)
plt.title('Precision-Recall Curve', fontsize=14)
plt.legend(loc="best")
plt.grid(alpha=0.3)
plt.show()

**Understanding the Precision-Recall Curve:**
- The **horizontal dashed line** represents the baseline (proportion of positive samples in the dataset)
- **Higher curves** (closer to top-right) indicate better model performance
- The curve typically shows a **trade-off**: as recall increases, precision tends to decrease
- Unlike ROC curves, there's often a **"sawtooth" pattern** due to changes in the ranking of predictions

**Advantages of Precision-Recall Curves**:
- Better for imbalanced datasets: More informative than ROC when the positive class is rare.
- Focuses on positive class: Directly shows performance on the class of interest.
- Sensitive to improvements: Better reflects improvements in model performance on minority class.
- No influence from TN: Not affected by the large number of true negatives in imbalanced datasets.
- Practical for many applications: Directly shows the precision-recall trade-off you'll face in deployment.

**Disadvantages of Precision-Recall Curves**:
- Less intuitive: Not as well-known or easily interpreted as ROC curves.
- Baseline changes with data: The random baseline varies with class imbalance, making comparison across datasets harder.
- No information on TN: Completely ignores true negatives, which may be important in some applications.
- Harder to compare: Visual comparison of curves can be more difficult than with ROC curves.

**When should Precision-Recall Curves be used?**
- When dealing with **highly imbalanced datasets** (e.g., fraud detection, rare disease diagnosis).
- When the **positive class is more important** than the negative class.
- When you need to **understand the precision-recall trade-off** for threshold selection.
- When **true negatives are abundant** and not particularly informative.
- In scenarios where both **precision and recall matter** for the positive class.

**Example Use Cases**:
- Fraud detection (1% fraudulent transactions, 99% legitimate)
- Medical diagnosis for rare diseases (1% disease prevalence)
- Information retrieval (finding relevant documents in a large corpus)
- Anomaly detection (rare events in system monitoring)
- Click-through rate prediction (small percentage of users click)

### Using Precision-Recall Curves to Find the Optimal Threshold

The Precision-Recall curve is especially useful for finding the optimal threshold when dealing with imbalanced datasets or when precision and recall have different importance.

__`Step 31a`__ Find the optimal threshold by maximizing the **F1 score** (balances precision and recall).

In [None]:
# Calculate F1 score for each threshold
# F1 = 2 * (precision * recall) / (precision + recall)
f1_scores = np.where((precision + recall) == 0, 0, 2 * (precision * recall) / (precision + recall))

# Find the optimal threshold (maximum F1)
optimal_idx_pr = np.argmax(f1_scores)
optimal_threshold_pr = thresholds_pr[optimal_idx_pr]

print(f"Optimal threshold (PR - Max F1): {optimal_threshold_pr:.4f}")
print(f"Precision at optimal threshold: {precision[optimal_idx_pr]:.4f}")
print(f"Recall at optimal threshold: {recall[optimal_idx_pr]:.4f}")
print(f"F1 Score at optimal threshold: {f1_scores[optimal_idx_pr]:.4f}")

__`Step 31b`__ Visualize the optimal threshold on the Precision-Recall curve.

In [None]:
# Plot PR curve with optimal threshold marked
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, color='blue', lw=2, label='PR curve')
plt.axhline(y=baseline, color='gray', linestyle='--', lw=2, label=f'Baseline = {baseline:.2f}')

# Mark the optimal threshold
plt.scatter(recall[optimal_idx_pr], precision[optimal_idx_pr], color='red', s=100, 
            label=f'Optimal Threshold = {optimal_threshold_pr:.2f}\n(Max F1 = {f1_scores[optimal_idx_pr]:.2f})', 
            zorder=5)

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall', fontsize=12)
plt.ylabel('Precision', fontsize=12)
plt.title('Precision-Recall Curve with Optimal Threshold', fontsize=14)
plt.legend(loc="best")
plt.grid(alpha=0.3)
plt.show()

__`Step 31c`__ Compare the thresholds found by different methods.

In [None]:
# Create a summary table of optimal thresholds
threshold_methods = pd.DataFrame({
    'Method': ['Default', 'ROC (Youden\'s J)', 'PR (Max F1)'],
    'Threshold': [0.5, optimal_threshold_roc, optimal_threshold_pr]
})

print("Optimal Thresholds by Different Methods:")
print(threshold_methods)

## Comparing Metrics

Now that we've covered all the main classification metrics, let's consolidate them in a summary table.

__`Step 32`__ Consolidate all the classification metrics (accuracy, precision, recall, F1, and AUC) for the train and validation sets into a single table.

In [None]:
classification_metrics = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1', 'AUC'],
    'Train': [accuracy_train, precision_train, recall_train, f1_train, auc_train],
    'Validation': [accuracy_val, precision_val, recall_val, f1_val, auc_val]
})

classification_metrics

<a class="anchor" id="classification-comparison">

### 2.8. How Metrics Can Change Model Selection: A Classification Example

</a>

Just like in the regression section, let's build a simple numerical example to see how different classification metrics might lead us to prefer different models.

__`Step 33`__ Define the true labels and three different sets of predictions representing competing models.

In [None]:
y_true_cls = np.array([1, 0, 1, 0, 1, 0, 1, 0])

# Model A: balanced performance, few mistakes
preds_model_A_cls = np.array([1, 0, 1, 0, 1, 0, 0, 0])

# Model B: very conservative, rarely predicts positive
preds_model_B_cls = np.array([0, 0, 1, 0, 0, 0, 0, 0])

# Model C: aggressive, predicts many positives
preds_model_C_cls = np.array([1, 1, 1, 0, 1, 1, 1, 0])

__`Step 34`__ Build a helper function that gathers the main classification metrics for any set of predictions.

In [None]:
def classification_metrics_summary(y_true, y_pred):
    return [
        accuracy_score(y_true, y_pred),
        precision_score(y_true, y_pred, zero_division=0),
        recall_score(y_true, y_pred, zero_division=0),
        f1_score(y_true, y_pred, zero_division=0)
    ]

__`Step 35`__ Use the helper to compare the models across accuracy, precision, recall, and F1, and display the results in a table.

In [None]:
classification_comparison = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1'],
    'Model A': classification_metrics_summary(y_true_cls, preds_model_A_cls),
    'Model B': classification_metrics_summary(y_true_cls, preds_model_B_cls),
    'Model C': classification_metrics_summary(y_true_cls, preds_model_C_cls)
})

classification_comparison

__`Step 36`__ Reflect on the table and discuss which model each metric prefers and why.

<div class="alert alert-block alert-success">

### Analysis of the Results

* **Model A** (balanced behaviour) delivers the **highest accuracy (0.875)** and **F1 score (0.86)** while keeping precision at 1.0. It is the most even trade-off when you value overall correctness and a balance between precision and recall.
* **Model B** (conservative) achieves **perfect precision (1.0)** but sacrifices recall (0.25) and overall accuracy (0.625). It would be preferred only when false positives are extremely costly and missing positives is acceptable.
* **Model C** (aggressive) reaches **perfect recall (1.0)** and a strong F1 score (0.80) but at the expense of precision (≈0.67). This is ideal when missing a positive case is much worse than raising some false alarms.

### Key Takeaway

Different classification metrics can point to different "best" models. Always align the metric you optimise with the real-world cost of false positives and false negatives.

</div>

<a class="anchor" id="multiclass">

# 3. Multiclass Classification (Extra)

</a>

So far, we've focused on **binary classification** (2 classes: 0 and 1). However, many real-world problems involve **multiclass classification** where there are **3 or more classes**. Examples include:
- Classifying images of animals (cat, dog, bird, fish, etc.)
- Predicting iris species (setosa, versicolor, virginica)
- Recognizing handwritten digits (0-9)
- Categorizing customer feedback (positive, neutral, negative)

When dealing with multiclass problems, we need to adapt our binary metrics (precision, recall, F1) to handle multiple classes. In this section, we'll explore how to compute and interpret performance metrics for multiclass classification problems using a practical example.

To demonstrate these concepts, we'll work with the **Iris dataset** throughout this section, implementing each metric as we learn about it.

__`Step 37`__ Import the Iris dataset and prepare it for multiclass classification.

In [None]:
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

__`Step 38`__ Split the data and train a multiclass logistic regression model.

In [None]:
# Split into train and test sets
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42, stratify=y_iris
)

# Train a multiclass logistic regression
iris_model = LogisticRegression(max_iter=200, random_state=42)
iris_model.fit(X_train_iris, y_train_iris)

# Make predictions
y_pred_iris = iris_model.predict(X_test_iris)

<a class="anchor" id="multiclass-confusion">

### 3.1. Multiclass Confusion Matrix

</a>

For multiclass problems, the confusion matrix becomes a **K × K matrix** where K is the number of classes. Unlike binary classification where we have a 2×2 matrix, multiclass confusion matrices show all possible combinations of true and predicted classes.

### Structure:

The matrix has **K rows** (actual classes) and **K columns** (predicted classes):

|                  | Predicted Class 0 | Predicted Class 1 | Predicted Class 2 |
|------------------|-------------------|-------------------|-------------------|
| **Actual Class 0** | Correct (TP₀)     | Misclassified     | Misclassified     |
| **Actual Class 1** | Misclassified     | Correct (TP₁)     | Misclassified     |
| **Actual Class 2** | Misclassified     | Misclassified     | Correct (TP₂)     |

- **Diagonal elements**: Correct predictions for each class
- **Off-diagonal elements**: Misclassifications (shows which classes are confused with each other)
- **Row sums**: Total actual instances of each class
- **Column sums**: Total predicted instances of each class

__`Step 39`__ Create and visualize the multiclass confusion matrix for the Iris predictions.

In [None]:
# Create confusion matrix
cm_iris = confusion_matrix(y_test_iris, y_pred_iris)

# Visualize with a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(cm_iris, annot=True, fmt='d', cmap='Blues', 
            xticklabels=iris.target_names,
            yticklabels=iris.target_names,
            cbar=True)
plt.xlabel('Predicted Label', fontsize=12)
plt.ylabel('True Label', fontsize=12)
plt.title('Multiclass Confusion Matrix - Iris Dataset', fontsize=14)
plt.show()

**Advantages of Multiclass Confusion Matrix**:
- Complete picture: Shows all classification errors, not just overall accuracy
- Error patterns: Reveals which classes are confused with each other
- Per-class performance: Can see which classes the model struggles with
- Foundation for metrics: Enables calculation of per-class precision, recall, and F1

**Disadvantages of Multiclass Confusion Matrix**:
- Can be large: With many classes, the matrix becomes difficult to visualize
- No single number: Unlike accuracy, doesn't provide one metric for comparison
- Requires interpretation: Need to analyze the matrix to understand performance

**When should Multiclass Confusion Matrix be used?**
- Always as a first step in multiclass evaluation
- When you need to understand which specific classes are being confused
- When diagnosing model errors for improvement
- When different types of errors have different costs (e.g., medical diagnosis)

<a class="anchor" id="macro">

### 3.2. Macro-Averaged Metrics

</a>

When extending binary metrics (precision, recall, F1) to multiclass problems, we first need to calculate these metrics **for each class individually** using a **One-vs-Rest (OvR)** approach. Then we can average them.

**Macro averaging** calculates the metric **independently for each class**, then takes the **simple (unweighted) mean** across all classes.

### Formula:

For K classes:

$$
\text{Macro Metric} = \frac{1}{K} \sum_{i=1}^{K} \text{Metric}_i
$$

For example:
$$
\text{Macro Precision} = \frac{1}{K} \sum_{i=1}^{K} \text{Precision}_i
$$

### One-vs-Rest (OvR) Approach:

To calculate per-class metrics, we treat each class as a binary problem:
- **Class i** is the positive class
- **All other classes combined** are the negative class

For each class, we can extract from the confusion matrix:
- **TP**: Diagonal element for that class
- **FP**: Sum of the column (excluding diagonal) - wrongly predicted as this class
- **FN**: Sum of the row (excluding diagonal) - actually this class but predicted as others
- **TN**: Everything else

__`Step 40`__ Calculate macro-averaged precision, recall, and F1 score.

In [None]:
# Calculate macro-averaged metrics
macro_precision = precision_score(y_test_iris, y_pred_iris, average='macro')
macro_recall = recall_score(y_test_iris, y_pred_iris, average='macro')
macro_f1 = f1_score(y_test_iris, y_pred_iris, average='macro')

print("Macro-Averaged Metrics:")
print(f"Precision: {macro_precision:.4f}")
print(f"Recall:    {macro_recall:.4f}")
print(f"F1 Score:  {macro_f1:.4f}")

# Also show per-class metrics to understand the macro average
print("\nPer-Class Metrics (used to calculate macro average):")
prec_per_class = precision_score(y_test_iris, y_pred_iris, average=None)
rec_per_class = recall_score(y_test_iris, y_pred_iris, average=None)
f1_per_class = f1_score(y_test_iris, y_pred_iris, average=None)

for i, class_name in enumerate(iris.target_names):
    print(f"{class_name:12} - Precision: {prec_per_class[i]:.4f}, Recall: {rec_per_class[i]:.4f}, F1: {f1_per_class[i]:.4f}")

print(f"\nMacro Precision = ({prec_per_class[0]:.4f} + {prec_per_class[1]:.4f} + {prec_per_class[2]:.4f}) / 3 = {macro_precision:.4f}")

**Advantages of Macro Averaging**:
- Equal treatment: All classes are weighted equally, regardless of their frequency
- Sensitive to minority classes: Poor performance on rare classes is not hidden
- Simple interpretation: Just the average of per-class scores
- Fair comparison: Good when all classes are equally important

**Disadvantages of Macro Averaging**:
- Ignores class imbalance: A class with 10 samples has the same weight as one with 1000 samples
- Can be misleading: High macro score doesn't mean good overall performance if classes are imbalanced
- May not reflect real-world importance: Some classes might be more important than others

**When should Macro Averaging be used?**
- When all classes are **equally important** to your problem
- When you want to ensure the model performs well on **all classes**, including rare ones
- In scenarios where **minority class performance matters** as much as majority class performance
- When evaluating models on **balanced datasets**

**Example Use Cases**:
- Medical diagnosis where missing any disease (even rare ones) is critical
- Quality control where all defect types need equal attention
- Multi-label text classification where all labels are equally important

<a class="anchor" id="weighted">

### 3.3. Weighted-Averaged Metrics

</a>

**Weighted averaging** is similar to macro averaging, but it **weights each class by its frequency** (number of true samples). This accounts for class imbalance by giving more importance to classes with more samples.

### Formula:

$$
\text{Weighted Metric} = \frac{1}{N} \sum_{i=1}^{K} n_i \times \text{Metric}_i
$$

Where:
- $N$ = total number of samples
- $n_i$ = number of true samples in class $i$
- $\text{Metric}_i$ = metric (precision, recall, or F1) for class $i$

This is equivalent to:
$$
\text{Weighted Metric} = \sum_{i=1}^{K} w_i \times \text{Metric}_i
$$

Where $w_i = \frac{n_i}{N}$ is the proportion of samples in class $i$.

__`Step 41`__ Calculate weighted-averaged precision, recall, and F1 score.

In [None]:
# Calculate weighted-averaged metrics
weighted_precision = precision_score(y_test_iris, y_pred_iris, average='weighted')
weighted_recall = recall_score(y_test_iris, y_pred_iris, average='weighted')
weighted_f1 = f1_score(y_test_iris, y_pred_iris, average='weighted')

print("Weighted-Averaged Metrics:")
print(f"Precision: {weighted_precision:.4f}")
print(f"Recall:    {weighted_recall:.4f}")
print(f"F1 Score:  {weighted_f1:.4f}")

# Show the calculation
print("\nHow weighted average is calculated:")
class_counts = np.bincount(y_test_iris)
total_samples = len(y_test_iris)

weighted_prec_manual = 0
for i, class_name in enumerate(iris.target_names):
    weight = class_counts[i] / total_samples
    contribution = weight * prec_per_class[i]
    weighted_prec_manual += contribution
    print(f"{class_name:12} - {class_counts[i]} samples ({weight:.3f} weight) × {prec_per_class[i]:.4f} precision = {contribution:.4f}")

print(f"\nWeighted Precision = {weighted_prec_manual:.4f}")

# Compare with macro
print(f"\nComparison:")
print(f"Macro Precision:    {macro_precision:.4f} (equal weight to all classes)")
print(f"Weighted Precision: {weighted_precision:.4f} (weighted by class frequency)")

**Advantages of Weighted Averaging**:
- Accounts for class imbalance: More common classes have more influence on the score
- More representative: Better reflects overall performance when classes have different sizes
- Balances minority and majority: Not as extreme as micro (which ignores minority) or macro (which treats all equally)
- Intuitive for imbalanced data: Matches real-world scenarios where some classes are naturally more common

**Disadvantages of Weighted Averaging**:
- Can hide minority class problems: Poor performance on rare classes has less impact
- Not always aligned with goals: Sometimes rare classes are more important despite low frequency
- Less interpretable: Not as straightforward as macro (simple average) or micro (overall accuracy)

**When should Weighted Averaging be used?**
- When you have **class imbalance** and want to account for it
- When **class frequency reflects importance** (common classes are more important)
- When you want a **balance** between treating all classes equally (macro) and focusing on overall performance (micro)
- In real-world scenarios where class distribution in test data matches production data

**Example Use Cases**:
- Customer churn prediction (most customers don't churn, but both groups matter)
- Fraud detection with natural imbalance (most transactions are legitimate)
- Product categorization where some categories are naturally more common

<a class="anchor" id="micro">

### 3.4. Micro-Averaged Metrics

</a>

**Micro averaging** aggregates the contributions of **all classes** globally by first summing up the individual true positives, false positives, and false negatives across all classes, then calculating the metric.

### Formula:

$$
\text{Micro Precision} = \frac{\sum_{i=1}^{K} TP_i}{\sum_{i=1}^{K} (TP_i + FP_i)}
$$

$$
\text{Micro Recall} = \frac{\sum_{i=1}^{K} TP_i}{\sum_{i=1}^{K} (TP_i + FN_i)}
$$

**Important Property**: In multiclass classification, **Micro Precision = Micro Recall = Micro F1 = Accuracy**

This happens because:
- The sum of all TP across classes = total correct predictions
- The sum of all (TP + FP) across classes = total predictions = total samples
- The sum of all (TP + FN) across classes = total actual samples = total samples

__`Step 42`__ Calculate micro-averaged metrics and verify they equal accuracy.

In [None]:
# Calculate micro-averaged metrics
micro_precision = precision_score(y_test_iris, y_pred_iris, average='micro')
micro_recall = recall_score(y_test_iris, y_pred_iris, average='micro')
micro_f1 = f1_score(y_test_iris, y_pred_iris, average='micro')

# Calculate accuracy for comparison
accuracy = accuracy_score(y_test_iris, y_pred_iris)

print("Micro-Averaged Metrics:")
print(f"Precision: {micro_precision:.4f}")
print(f"Recall:    {micro_recall:.4f}")
print(f"F1 Score:  {micro_f1:.4f}")
print(f"\nAccuracy:  {accuracy:.4f}")

**Advantages of Micro Averaging**:
- Naturally weighted by class size: Larger classes have more influence
- Equals accuracy: In multiclass, provides the same information as overall accuracy
- Good for overall performance: Reflects the model's performance across all predictions
- Simple interpretation: Just the proportion of correct predictions

**Disadvantages of Micro Averaging**:
- Dominated by majority classes: Performance on minority classes has minimal impact
- Can be misleading with imbalance: Good micro score can hide poor minority class performance
- Same as accuracy: Doesn't provide additional information beyond what accuracy already tells you
- Ignores per-class importance: Treats all predictions equally regardless of class

**When should Micro Averaging be used?**
- When you care about **overall prediction accuracy** across all samples
- When **larger classes are more important** and should dominate the metric
- When you want a metric that **naturally weights by frequency**
- In practice, **use accuracy instead** - it's the same and more interpretable!

**Example Use Cases**:
- General classification where overall accuracy is the primary concern
- When class imbalance reflects real-world importance
- Situations where micro averaging is mainly used: comparing it with macro/weighted to understand class imbalance effects

## Comparing All Averaging Methods

Now that we've seen all three averaging methods, let's compare them side-by-side to understand when each is most appropriate.

__`Step 43`__ Create a comprehensive comparison table of all averaging methods.

In [None]:
# Create a comparison table
comparison_data = {
    'Averaging Method': ['Macro', 'Weighted', 'Micro', 'Accuracy'],
    'Precision': [macro_precision, weighted_precision, micro_precision, '-'],
    'Recall': [macro_recall, weighted_recall, micro_recall, '-'],
    'F1 Score': [macro_f1, weighted_f1, micro_f1, '-'],
    'Value': ['-', '-', '-', accuracy]
}

comparison_df = pd.DataFrame(comparison_data)
print("Comparison of All Averaging Methods:")
print(comparison_df.to_string(index=False))

__`Step 44`__ Use `classification_report` to see all metrics in one comprehensive view.

In [None]:
# The classification_report provides everything in one place
print(classification_report(y_test_iris, y_pred_iris, target_names=iris.target_names))

### Key Takeaways for Multiclass Classification

<div class="alert alert-block alert-success">

**Summary of Multiclass Metrics:**

1. **Multiclass Confusion Matrix (3.1)**:
   - K × K matrix showing all classification outcomes
   - Diagonal = correct predictions, off-diagonal = confusions
   - Essential first step to understand error patterns

2. **Macro Averaging (3.2)**:
   - Simple average of per-class metrics
   - Treats all classes equally
   - **Use when**: All classes are equally important

3. **Weighted Averaging (3.3)**:
   - Weighted average by class frequency
   - Accounts for class imbalance
   - **Use when**: Class frequency reflects importance

4. **Micro Averaging (3.4)**:
   - Global aggregation of TP, FP, FN
   - Equals accuracy in multiclass
   - **Use when**: Overall accuracy is the priority

**Decision Tree for Choosing Averaging Method:**

```
Do all classes have equal importance?
├─ YES → Use MACRO averaging
└─ NO → Is class frequency aligned with importance?
    ├─ YES → Use WEIGHTED averaging
    └─ NO → Consider MACRO or define custom weights
    
Want overall accuracy?
└─ Use MICRO averaging (or just accuracy)
```

**Practical Workflow:**
1. Start with **confusion matrix** to see where errors occur
2. Check **per-class metrics** (average=None) to identify problem classes
3. Choose averaging method based on your business requirements:
   - Equal class importance → **macro**
   - Natural class imbalance → **weighted**
   - Overall performance → **micro** (accuracy)
4. Use **classification_report** for comprehensive overview

**Remember**: The choice of averaging method can significantly affect which model appears "best"!

</div>

Sources: <br>
https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d <br>
https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9