# Classification Metrics

In this notebook, we will explore various classification metrics using simple examples. We will cover the following metrics:

1. Accuracy
2. Confusion Matrix
3. Precision, Recall, F1-score
4. Macro vs. Micro F1-score

Let's begin by importing the necessary libraries.

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    precision_score,
    recall_score,
    f1_score,
    classification_report
)

import ipywidgets as widgets
from IPython.display import display
import matplotlib.pyplot as plt
%matplotlib inline

We can start with some definitions:

- **True Positives (TP)**: Instances where the model correctly predicts the positive class. That is, cases where both the actual class and the predicted class are positive.

- **True Negatives (TN)**: Instances where the model correctly predicts the negative class. That is, cases where both the actual class and the predicted class are negative.

Similarly, we have:

- **False Positives (FP)**: Instances where the model incorrectly predicts the positive class. These are cases where the actual class is negative, but the model predicts it as positive.

- **False Negatives (FN)**: Instances where the model incorrectly predicts the negative class. These are cases where the actual class is positive, but the model predicts it as negative.

These concepts are crucial because they help us understand not just how often the model is correct, but the types of errors it makes.

# 1. Accuracy

**Accuracy** is one of the simplest and most intuitive evaluation metrics for classification models. It is defined as the proportion of correct predictions made by the model out of all predictions made.

$$
\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} = \frac{TP + TN}{TP + TN + FP + FN}
$$

Let's consider a simple binary classification example.

In [None]:
# True labels
y_true = np.array([1, 0, 1, 1, 0, 1, 0, 0])

# Predicted labels
y_pred = np.array([1, 0, 0, 1, 0, 1, 1, 0])

# Calculate accuracy manually
correct_predictions = np.sum(y_true == y_pred)
total_predictions = len(y_true)
accuracy = correct_predictions / total_predictions
print(f"Accuracy (manual calculation): {accuracy:.2f}")

# Calculate accuracy using sklearn
accuracy_sk = accuracy_score(y_true, y_pred)
print(f"Accuracy (sklearn): {accuracy_sk:.2f}")

## 2. Confusion Matrix

The **Confusion Matrix** is a table used to describe the performance of a classification model on a set of test data for which the true values are known. It allows easy identification of confusion between classes.

## Confusion Matrix for Binary Classification

The confusion matrix for binary classification is a 2x2 matrix:

|               | Predicted Positive | Predicted Negative |
|---------------|--------------------|--------------------|
| **Actual Positive** | True Positive (TP)     | False Negative (FN)    |
| **Actual Negative** | False Positive (FP)    | True Negative (TN)     |


Continuing with the previous example:

In [None]:
# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)

# Display confusion matrix in a DataFrame
cm_df = pd.DataFrame(cm,
                     index=['Actual Negative', 'Actual Positive'],
                     columns=['Predicted Negative', 'Predicted Positive'])
print(cm_df)

## 3. Precision, Recall, F1-score

These are more detailed metrics that consider the types of errors the model makes.

### Precision

**Precision** is the ratio of correctly predicted positive observations to the total predicted positive observations.

$$
\text{Precision} = \frac{TP}{TP + FP}
$$

### Recall

**Recall** (also known as Sensitivity or True Positive Rate) is the ratio of correctly predicted positive observations to all actual positives.

$$
\text{Recall} = \frac{TP}{TP + FN}
$$

### F1-score

**F1-score** is the weighted average of Precision and Recall.

$$
\text{F1-score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$


In [None]:
# Calculate precision
precision = precision_score(y_true, y_pred)
print(f"Precision: {precision:.2f}")

# Calculate recall
recall = recall_score(y_true, y_pred)
print(f"Recall: {recall:.2f}")

# Calculate F1-score
f1 = f1_score(y_true, y_pred)
print(f"F1-score: {f1:.2f}")

## 4. Macro vs. Micro F1-score

In multiclass classification, the F1-score can be averaged in different ways. The two most common methods are **macro** and **micro** averaging.

### Macro Averaging

**Macro F1-score** calculates the F1-score independently for each class and then takes the average.

$$
\text{Macro F1-score} = \frac{1}{C} \sum_{i=1}^{C} \text{F1-score}_i
$$

Where $C$ is the number of classes. For imbalanced datasets, it is better to use the macro F1-score!

### Micro Averaging

**Micro F1-score** calculates metrics globally by counting the total true positives, false negatives, and false positives.

$$
\text{Micro F1-score} = \frac{2 \times \text{Precision}_{\text{micro}} \times \text{Recall}_{\text{micro}}}{\text{Precision}_{\text{micro}} + \text{Recall}_{\text{micro}}}
$$

Let's create a simple multiclass example.

In [None]:
# True labels
y_true_multi = np.array([0, 1, 2, 0, 1, 2])

# Predicted labels
y_pred_multi = np.array([0, 2, 1, 0, 0, 1])

# Print classification report
print("Classification Report:")
print(classification_report(y_true_multi, y_pred_multi))

In [None]:
# Macro F1-score
f1_macro = f1_score(y_true_multi, y_pred_multi, average='macro')
print(f"Macro F1-score: {f1_macro:.2f}")

# Micro F1-score
f1_micro = f1_score(y_true_multi, y_pred_multi, average='micro')
print(f"Micro F1-score: {f1_micro:.2f}")

- **Macro F1-score** treats all classes equally by averaging the F1-scores of each class.
- **Micro F1-score** gives equal weight to each instance by considering the total true positives, false negatives, and false positives.

In this example, the micro F1-score is higher than the macro F1-score because the correct predictions are concentrated in one class (class 0), which increases the overall count of correct predictions.

## 6. Interactive Multi-Class Classification Example

In this section, we'll create an interactive example to explore how micro and macro F1-scores change in a multi-class classification scenario. 

First, let's set up the interactive widgets.

In [None]:
# Define the number of classes
num_classes = 3
classes = [0, 1, 2]

# Create sliders for each cell in the confusion matrix
sliders = {}
for i in classes:
    for j in classes:
        sliders[f"C{i}{j}"] = widgets.IntSlider(value=0, min=0, max=20, description=f"C{i}{j}")

# Arrange the sliders in a grid
conf_matrix_ui = widgets.GridBox(
    children=[sliders[f"C{i}{j}"] for i in classes for j in classes],
    layout=widgets.Layout(
        width='100%',
        grid_template_columns='repeat(3, 200px)',
        grid_template_rows='repeat(3, 60px)',
        grid_gap='17px'
    )
)

# Initialize confusion matrix with some values
sliders["C00"].value = 5  # True class 0 predicted as class 0
sliders["C11"].value = 4  # True class 1 predicted as class 1
sliders["C22"].value = 6  # True class 2 predicted as class 2
sliders["C01"].value = 2  # True class 0 predicted as class 1
sliders["C12"].value = 3  # True class 1 predicted as class 2
sliders["C20"].value = 1  # True class 2 predicted as class 0

Now, let's define a function that will update the metrics based on the confusion matrix values.

In [None]:
def update_metrics(**kwargs):
    # Build the confusion matrix from the sliders
    cm = np.array([
        [kwargs[f"C00"], kwargs[f"C01"], kwargs[f"C02"]],
        [kwargs[f"C10"], kwargs[f"C11"], kwargs[f"C12"]],
        [kwargs[f"C20"], kwargs[f"C21"], kwargs[f"C22"]],
    ])
    
    # Display the confusion matrix
    cm_df = pd.DataFrame(cm, index=[f"Actual {i}" for i in classes], columns=[f"Predicted {i}" for i in classes])
    print("Confusion Matrix:")
    display(cm_df)
    
    # Flatten the confusion matrix to get true labels and predicted labels
    y_true = []
    y_pred = []
    for i in classes:
        for j in classes:
            count = cm[i][j]
            y_true.extend([i]*count)
            y_pred.extend([j]*count)
    
    if len(y_true) == 0:
        print("No samples to evaluate.")
        return
    
    # Calculate metrics
    accuracy = accuracy_score(y_true, y_pred)
    precision_macro = precision_score(y_true, y_pred, average='macro', zero_division=0)
    recall_macro = recall_score(y_true, y_pred, average='macro', zero_division=0)
    f1_macro = f1_score(y_true, y_pred, average='macro', zero_division=0)
    
    precision_micro = precision_score(y_true, y_pred, average='micro', zero_division=0)
    recall_micro = recall_score(y_true, y_pred, average='micro', zero_division=0)
    f1_micro = f1_score(y_true, y_pred, average='micro', zero_division=0)
    
    # Display metrics
    print(f"Accuracy: {accuracy:.2f}")
    print("\nMacro-Averaged Metrics:")
    print(f"Precision (Macro): {precision_macro:.2f}")
    print(f"Recall (Macro): {recall_macro:.2f}")
    print(f"F1-score (Macro): {f1_macro:.2f}")
    
    print("\nMicro-Averaged Metrics:")
    print(f"Precision (Micro): {precision_micro:.2f}")
    print(f"Recall (Micro): {recall_micro:.2f}")
    print(f"F1-score (Micro): {f1_micro:.2f}")
    
    # Classification report
    print("\nClassification Report:")
    print(classification_report(y_true, y_pred, zero_division=0))

Now, we can set up the interactive display.

In [None]:
out = widgets.interactive_output(update_metrics, sliders)
display(conf_matrix_ui, out)

- Each slider represents the count of instances where a true class is predicted as a certain class.
- For example, `C01` represents the number of instances where the true class is 0 but predicted as 1.
- Adjust the sliders to change the confusion matrix and observe how the macro and micro F1-scores change.

- **Macro-Averaged Metrics**: Calculate the metric independently for each class and then take the average (treat all classes equally).
- **Micro-Averaged Metrics**: Aggregate the contributions of all classes to compute the average metric (treat all instances equally).



## Conclusion

In this notebook, we explored various classification metrics using simple examples. Understanding these metrics is crucial for evaluating and improving classification models.

- **Accuracy** is simple but can be misleading in imbalanced datasets.
- **Confusion Matrix** provides detailed insight into the types of errors made.
- **Precision, Recall, F1-score** offer a balance between precision and recall.
- **Macro vs. Micro F1-score** are important when dealing with multiclass classification and imbalanced datasets.
- **True Positives and True Negatives** are fundamental concepts that help us understand the correctness of our model's predictions in detail.
- **Interactive Multi-Class Example** allows us to visualize how changes in the confusion matrix affect macro and micro-averaged metrics.