# Confusion Matrix Basics

In this notebook, you'll learn:
- What a confusion matrix is and why it's important
- How to calculate a confusion matrix manually and with sklearn
- How to visualize confusion matrices
- How to extract True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN)

## What is a Confusion Matrix?

A **confusion matrix** is a table that summarizes the performance of a classification model. It shows the counts of:
- **True Positives (TP)**: Correctly predicted positive cases
- **True Negatives (TN)**: Correctly predicted negative cases
- **False Positives (FP)**: Incorrectly predicted as positive (Type I error)
- **False Negatives (FN)**: Incorrectly predicted as negative (Type II error)

```
                    Predicted
                 Negative  Positive
Actual Negative     TN       FP
       Positive     FN       TP
```

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Set random seed for reproducibility
np.random.seed(42)

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('default')
sns.set_palette('colorblind')

## Load the Data

We'll use our binary classification dataset with 1000 samples. This dataset is imbalanced (80% class 0, 20% class 1), which is common in real-world scenarios.

In [None]:
# Load the classification data
df = pd.read_csv('../../fixtures/input/classification_data.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

In [None]:
# Examine the data distribution
print("Class distribution (true labels):")
print(df['true_label'].value_counts())
print(f"\nClass 0: {(df['true_label'] == 0).sum() / len(df) * 100:.1f}%")
print(f"Class 1: {(df['true_label'] == 1).sum() / len(df) * 100:.1f}%")

# Visualize class distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

df['true_label'].value_counts().plot(kind='bar', ax=axes[0], color=['skyblue', 'coral'])
axes[0].set_title('True Label Distribution')
axes[0].set_xlabel('Class')
axes[0].set_ylabel('Count')
axes[0].set_xticklabels(['Class 0', 'Class 1'], rotation=0)

df['predicted_label'].value_counts().plot(kind='bar', ax=axes[1], color=['skyblue', 'coral'])
axes[1].set_title('Predicted Label Distribution')
axes[1].set_xlabel('Class')
axes[1].set_ylabel('Count')
axes[1].set_xticklabels(['Class 0', 'Class 1'], rotation=0)

plt.tight_layout()
plt.show()

## Calculate Confusion Matrix Manually

Let's calculate the confusion matrix components manually to understand what's happening:

In [None]:
# Extract true and predicted labels
y_true = df['true_label'].values
y_pred = df['predicted_label'].values

# Calculate confusion matrix components manually
tp = ((y_true == 1) & (y_pred == 1)).sum()
tn = ((y_true == 0) & (y_pred == 0)).sum()
fp = ((y_true == 0) & (y_pred == 1)).sum()
fn = ((y_true == 1) & (y_pred == 0)).sum()

print("Confusion Matrix Components:")
print(f"True Positives (TP):  {tp}")
print(f"True Negatives (TN):  {tn}")
print(f"False Positives (FP): {fp}")
print(f"False Negatives (FN): {fn}")
print(f"\nTotal: {tp + tn + fp + fn}")

### Interpretation:

- **TP (True Positives)**: Number of class 1 samples correctly classified as class 1
- **TN (True Negatives)**: Number of class 0 samples correctly classified as class 0
- **FP (False Positives)**: Number of class 0 samples incorrectly classified as class 1 (Type I error)
- **FN (False Negatives)**: Number of class 1 samples incorrectly classified as class 0 (Type II error)

**Clinical Example**: In disease detection:
- TP: Sick patients correctly diagnosed as sick
- TN: Healthy patients correctly diagnosed as healthy
- FP: Healthy patients incorrectly diagnosed as sick (false alarm)
- FN: Sick patients incorrectly diagnosed as healthy (missed diagnosis)

## Calculate Confusion Matrix with sklearn

sklearn provides a convenient function to calculate the confusion matrix:

In [None]:
# Calculate confusion matrix using sklearn
cm = confusion_matrix(y_true, y_pred)

print("Confusion Matrix:")
print(cm)
print("\nLayout:")
print("[[TN, FP],")
print(" [FN, TP]]")

# Verify it matches our manual calculation
print(f"\nVerification:")
print(f"TN from sklearn: {cm[0, 0]} vs manual: {tn}")
print(f"FP from sklearn: {cm[0, 1]} vs manual: {fp}")
print(f"FN from sklearn: {cm[1, 0]} vs manual: {fn}")
print(f"TP from sklearn: {cm[1, 1]} vs manual: {tp}")

## Visualize the Confusion Matrix

Let's create several visualizations to understand the confusion matrix better:

In [None]:
# Method 1: Using sklearn's ConfusionMatrixDisplay
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Display with counts
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Class 0', 'Class 1'])
disp.plot(ax=axes[0], cmap='Blues', values_format='d')
axes[0].set_title('Confusion Matrix (Counts)', fontsize=14)

# Display with percentages (normalized)
cm_normalized = confusion_matrix(y_true, y_pred, normalize='true')
disp_norm = ConfusionMatrixDisplay(confusion_matrix=cm_normalized, display_labels=['Class 0', 'Class 1'])
disp_norm.plot(ax=axes[1], cmap='Blues', values_format='.2%')
axes[1].set_title('Confusion Matrix (Percentages)', fontsize=14)

plt.tight_layout()
plt.show()

In [None]:
# Method 2: Using seaborn heatmap (more customizable)
plt.figure(figsize=(8, 6))

# Create annotations with both counts and percentages
group_names = ['True Neg', 'False Pos', 'False Neg', 'True Pos']
group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_names, group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2, 2)

sns.heatmap(cm, annot=labels, fmt='', cmap='Blues', square=True, linewidths=2,
            xticklabels=['Predicted 0', 'Predicted 1'],
            yticklabels=['Actual 0', 'Actual 1'],
            cbar_kws={'label': 'Count'})
plt.title('Detailed Confusion Matrix', fontsize=16, pad=20)
plt.ylabel('True Label', fontsize=12)
plt.xlabel('Predicted Label', fontsize=12)
plt.tight_layout()
plt.show()

## Understanding the Results

Let's analyze what the confusion matrix tells us:

In [None]:
# Calculate basic statistics from confusion matrix
total_samples = tp + tn + fp + fn
correct_predictions = tp + tn
incorrect_predictions = fp + fn

print("Overall Statistics:")
print(f"Total samples: {total_samples}")
print(f"Correct predictions: {correct_predictions} ({correct_predictions/total_samples*100:.2f}%)")
print(f"Incorrect predictions: {incorrect_predictions} ({incorrect_predictions/total_samples*100:.2f}%)")
print()

# Performance by class
class_0_correct = tn
class_0_total = tn + fp
class_1_correct = tp
class_1_total = tp + fn

print("Per-Class Performance:")
print(f"Class 0 - Correct: {class_0_correct}/{class_0_total} ({class_0_correct/class_0_total*100:.2f}%)")
print(f"Class 1 - Correct: {class_1_correct}/{class_1_total} ({class_1_correct/class_1_total*100:.2f}%)")
print()

# Error analysis
print("Error Analysis:")
print(f"False Positive Rate: {fp}/{tn+fp} = {fp/(tn+fp)*100:.2f}%")
print(f"False Negative Rate: {fn}/{tp+fn} = {fn/(tp+fn)*100:.2f}%")

## Key Insights from Imbalanced Data

Notice how the confusion matrix reveals important information that overall accuracy might hide:

In [None]:
# Calculate overall accuracy
accuracy = (tp + tn) / (tp + tn + fp + fn)
print(f"Overall Accuracy: {accuracy*100:.2f}%")
print()

# What if we always predicted class 0?
naive_accuracy = (df['true_label'] == 0).sum() / len(df)
print(f"Naive baseline (always predict class 0): {naive_accuracy*100:.2f}%")
print()

print("Key Observation:")
print(f"Our model achieves {accuracy*100:.2f}% accuracy")
print(f"But a naive model (always predict 0) would get {naive_accuracy*100:.2f}%!")
print(f"The improvement is only {(accuracy - naive_accuracy)*100:.2f}%")
print()
print("This is why we need more sophisticated metrics like precision, recall, and F1-score!")

## Exercise 1: Calculate Confusion Matrix for a Subset

Select the first 100 samples and calculate the confusion matrix:

In [None]:
# YOUR CODE HERE
# 1. Extract first 100 samples
# 2. Calculate confusion matrix
# 3. Extract TP, TN, FP, FN
# 4. Print the results



## Exercise 2: Visualize Confusion Matrix

Create a custom heatmap visualization for the confusion matrix:

In [None]:
# YOUR CODE HERE
# Create a seaborn heatmap with custom colors and annotations
# Include percentages in the annotations



## Exercise 3: Compare Multiple Thresholds

Create confusion matrices for different probability thresholds:

In [None]:
# YOUR CODE HERE
# 1. Try thresholds: 0.3, 0.5, 0.7
# 2. For each threshold, create new predictions from predicted_probability
# 3. Calculate and visualize confusion matrix for each
# 4. Compare the results



## Summary

In this notebook, you learned:

1. **What a confusion matrix is**: A table showing TP, TN, FP, FN counts
2. **How to calculate it**: Both manually and using sklearn
3. **How to visualize it**: Using sklearn's built-in tools and seaborn
4. **Why it's important**: It reveals model performance on each class, especially important for imbalanced datasets

### Key Takeaways:

- The confusion matrix is the foundation for all classification metrics
- On imbalanced datasets, overall accuracy can be misleading
- Different types of errors (FP vs FN) may have different costs in real applications
- Visualizing the confusion matrix helps quickly identify model weaknesses

### Next Steps:

In the next notebook, we'll learn how to derive important metrics from the confusion matrix:
- Precision: How many predicted positives are actually positive?
- Recall: How many actual positives did we catch?
- F1-score: The harmonic mean of precision and recall