# Task 3: Evaluating Models on Imbalanced Data

In this task, you'll practice evaluating classification models on imbalanced datasets and understanding why certain metrics are more appropriate than others.

## Instructions

Complete the code cells below to properly evaluate models on imbalanced data.

**Key Concepts:**
- On imbalanced data, accuracy can be misleading
- F1-score, precision, and recall are more informative
- PR-AUC is better than ROC-AUC for imbalanced datasets
- Always compare with appropriate baselines

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score,
    confusion_matrix, classification_report,
    roc_curve, precision_recall_curve
)

## Load Data

In [None]:
# Load the imbalanced classification data (80% class 0, 20% class 1)
df = pd.read_csv('../../fixtures/input/classification_data.csv')

y_true = df['true_label'].values
y_pred = df['predicted_label'].values
y_prob = df['predicted_probability'].values

print(f"Dataset size: {len(y_true)}")
print(f"Class distribution: {np.bincount(y_true)}")
print(f"Class 0: {(y_true == 0).sum() / len(y_true) * 100:.1f}%")
print(f"Class 1: {(y_true == 1).sum() / len(y_true) * 100:.1f}%")

## Task 3.1: Calculate Baseline Metrics

Create three baseline predictors:
1. Always predict class 0 (majority class)
2. Always predict class 1 (minority class)
3. Random predictions (based on class distribution)

Calculate accuracy, precision, recall, and F1 for each.

In [None]:
# YOUR CODE HERE
# Create three baseline predictors
y_pred_all_0 = None
y_pred_all_1 = None
y_pred_random = None  # Random based on 80/20 distribution

# Calculate metrics for each baseline
# Store results in a dictionary or DataFrame
baseline_results = None

# TEST - Do not modify
assert y_pred_all_0 is not None, "Create all-0 baseline"
assert y_pred_all_1 is not None, "Create all-1 baseline"
assert y_pred_random is not None, "Create random baseline"

assert all(y_pred_all_0 == 0), "All-0 baseline should predict all 0s"
assert all(y_pred_all_1 == 1), "All-1 baseline should predict all 1s"
assert len(np.unique(y_pred_random)) > 1, "Random baseline should have both classes"

assert baseline_results is not None, "Calculate baseline metrics"

# Check that all-0 baseline has high accuracy but 0 recall
all_0_accuracy = accuracy_score(y_true, y_pred_all_0)
all_0_recall = recall_score(y_true, y_pred_all_0, zero_division=0)
assert all_0_accuracy > 0.7, "All-0 should have high accuracy on imbalanced data"
assert all_0_recall == 0, "All-0 should have 0 recall"

print("✓ Baseline results calculated")
if isinstance(baseline_results, pd.DataFrame):
    print(baseline_results)
else:
    print(baseline_results)

## Task 3.2: Compare Model with Baselines

Calculate metrics for your actual model and compare with baselines.

In [None]:
# YOUR CODE HERE
# Calculate model metrics
model_accuracy = None
model_precision = None
model_recall = None
model_f1 = None

# Create comparison DataFrame
comparison_df = None

# TEST - Do not modify
assert model_accuracy is not None, "Calculate model accuracy"
assert model_precision is not None, "Calculate model precision"
assert model_recall is not None, "Calculate model recall"
assert model_f1 is not None, "Calculate model F1"

assert comparison_df is not None, "Create comparison DataFrame"
assert isinstance(comparison_df, pd.DataFrame), "Should be a DataFrame"

# Model should be better than all baselines in F1-score
assert model_f1 > 0, "Model should have positive F1-score"

print("✓ Model vs Baselines:")
print(comparison_df)

## Task 3.3: Why Accuracy is Misleading

Calculate the accuracy improvement of your model over the naive baseline (all-0).
Then calculate the F1 improvement.

In [None]:
# YOUR CODE HERE
accuracy_improvement = None
f1_improvement = None

# TEST - Do not modify
assert accuracy_improvement is not None, "Calculate accuracy improvement"
assert f1_improvement is not None, "Calculate F1 improvement"

# F1 improvement should be much larger than accuracy improvement
assert f1_improvement > accuracy_improvement, \
    "F1 improvement should be larger than accuracy improvement"

print(f"✓ Accuracy improvement: {accuracy_improvement:.4f}")
print(f"✓ F1 improvement: {f1_improvement:.4f}")
print(f"\nInsight: On imbalanced data, accuracy improvement ({accuracy_improvement:.4f}) ")
print(f"is much smaller than F1 improvement ({f1_improvement:.4f}).")
print(f"This shows why F1 is more informative for imbalanced datasets.")

## Task 3.4: ROC-AUC vs PR-AUC

Calculate both ROC-AUC and PR-AUC for:
1. Your model
2. A poor model (multiply probabilities by 0.5)

Compare how much each metric changes.

In [None]:
# YOUR CODE HERE
# Calculate ROC-AUC and PR-AUC for the model
model_roc_auc = None
model_pr_auc = None

# Create a poor model by reducing probabilities
y_prob_poor = None

# Calculate metrics for poor model
poor_roc_auc = None
poor_pr_auc = None

# Calculate relative drops
roc_drop = None
pr_drop = None

# TEST - Do not modify
assert model_roc_auc is not None, "Calculate model ROC-AUC"
assert model_pr_auc is not None, "Calculate model PR-AUC"
assert y_prob_poor is not None, "Create poor model probabilities"
assert poor_roc_auc is not None, "Calculate poor ROC-AUC"
assert poor_pr_auc is not None, "Calculate poor PR-AUC"

assert 0 <= model_roc_auc <= 1, "ROC-AUC should be between 0 and 1"
assert 0 <= model_pr_auc <= 1, "PR-AUC should be between 0 and 1"

# ROC-AUC should stay relatively high even for poor model
# PR-AUC should drop more significantly
assert roc_drop is not None and pr_drop is not None, "Calculate drops"
assert pr_drop > roc_drop, "PR-AUC should drop more than ROC-AUC for poor model"

print(f"Model Performance:")
print(f"  ROC-AUC: {model_roc_auc:.4f}")
print(f"  PR-AUC:  {model_pr_auc:.4f}")
print()
print(f"Poor Model Performance:")
print(f"  ROC-AUC: {poor_roc_auc:.4f} (drop: {roc_drop:.4f})")
print(f"  PR-AUC:  {poor_pr_auc:.4f} (drop: {pr_drop:.4f})")
print()
print(f"✓ PR-AUC is more sensitive to poor performance on imbalanced data")

## Task 3.5: Plot ROC and PR Curves

Plot both ROC and PR curves for the model and poor model side by side.

In [None]:
# YOUR CODE HERE
# Create 1x2 subplot
# Left: ROC curves for both models
# Right: PR curves for both models
# Include baseline lines

# Your plotting code here

# TEST - Do not modify
assert len(plt.gcf().axes) >= 2, "Create 2 subplots (ROC and PR)"
print("✓ Curves plotted successfully")

## Task 3.6: Calculate Class-wise Metrics

For the minority class (class 1), calculate:
- How many samples exist
- How many were correctly predicted
- How many were missed (FN)
- The recall for this class

In [None]:
# YOUR CODE HERE
minority_total = None
minority_correct = None
minority_missed = None
minority_recall = None

# TEST - Do not modify
assert minority_total is not None, "Count minority class samples"
assert minority_correct is not None, "Count correctly predicted minority samples"
assert minority_missed is not None, "Count missed minority samples"
assert minority_recall is not None, "Calculate minority recall"

assert minority_total == (y_true == 1).sum(), "Incorrect minority total"
assert minority_correct + minority_missed == minority_total, "Counts don't add up"
assert abs(minority_recall - (minority_correct / minority_total)) < 0.001, \
    "Recall calculation incorrect"

print(f"✓ Minority Class (Class 1) Analysis:")
print(f"  Total samples: {minority_total}")
print(f"  Correctly predicted: {minority_correct} ({minority_correct/minority_total*100:.1f}%)")
print(f"  Missed (FN): {minority_missed} ({minority_missed/minority_total*100:.1f}%)")
print(f"  Recall: {minority_recall:.4f}")

## Task 3.7: Confusion Matrix Analysis

Create a normalized confusion matrix (normalized by true labels) and identify:
- What percentage of class 0 is correctly classified
- What percentage of class 1 is correctly classified
- Which class is harder to predict

In [None]:
# YOUR CODE HERE
# Calculate normalized confusion matrix
cm_normalized = None

# Extract per-class accuracy
class_0_accuracy = None
class_1_accuracy = None

# Determine harder class
harder_class = None  # Should be 0 or 1

# TEST - Do not modify
assert cm_normalized is not None, "Calculate normalized confusion matrix"
assert cm_normalized.shape == (2, 2), "Confusion matrix should be 2x2"

assert class_0_accuracy is not None, "Calculate class 0 accuracy"
assert class_1_accuracy is not None, "Calculate class 1 accuracy"

assert 0 <= class_0_accuracy <= 1, "Class 0 accuracy should be between 0 and 1"
assert 0 <= class_1_accuracy <= 1, "Class 1 accuracy should be between 0 and 1"

assert harder_class is not None, "Identify harder class"
assert harder_class in [0, 1], "Harder class should be 0 or 1"

if class_0_accuracy < class_1_accuracy:
    assert harder_class == 0, "Class 0 is harder (lower accuracy)"
else:
    assert harder_class == 1, "Class 1 is harder (lower accuracy)"

print(f"✓ Per-Class Performance:")
print(f"  Class 0 accuracy: {class_0_accuracy:.4f}")
print(f"  Class 1 accuracy: {class_1_accuracy:.4f}")
print(f"  Harder class: {harder_class}")

## Task 3.8: Stratified Performance

Split the data into two groups based on probability:
- High confidence predictions (prob > 0.7 or prob < 0.3)
- Low confidence predictions (0.3 <= prob <= 0.7)

Calculate accuracy for each group.

In [None]:
# YOUR CODE HERE
# Create masks for high and low confidence
high_confidence_mask = None
low_confidence_mask = None

# Calculate accuracy for each group
high_confidence_accuracy = None
low_confidence_accuracy = None

# Count samples in each group
n_high_confidence = None
n_low_confidence = None

# TEST - Do not modify
assert high_confidence_mask is not None, "Create high confidence mask"
assert low_confidence_mask is not None, "Create low confidence mask"

assert high_confidence_accuracy is not None, "Calculate high confidence accuracy"
assert low_confidence_accuracy is not None, "Calculate low confidence accuracy"

assert n_high_confidence is not None, "Count high confidence samples"
assert n_low_confidence is not None, "Count low confidence samples"

assert n_high_confidence + n_low_confidence == len(y_true), "Masks should cover all samples"

# High confidence should have higher accuracy
assert high_confidence_accuracy > low_confidence_accuracy, \
    "High confidence predictions should be more accurate"

print(f"✓ Stratified Performance:")
print(f"  High confidence ({n_high_confidence} samples): {high_confidence_accuracy:.4f}")
print(f"  Low confidence ({n_low_confidence} samples): {low_confidence_accuracy:.4f}")
print(f"\nInsight: Model is more accurate when confident in its predictions")

## Task 3.9: Recommend Best Metric

Based on the imbalanced nature of the data, create a summary recommending which metrics to use and why.

In [None]:
# YOUR CODE HERE
# Create a dictionary with recommendations
# Keys: metric names
# Values: "Recommended" or "Not Recommended" with brief reason

recommendations = None

# TEST - Do not modify
assert recommendations is not None, "Create recommendations dictionary"
assert isinstance(recommendations, dict), "Should be a dictionary"
assert 'Accuracy' in recommendations, "Include Accuracy"
assert 'F1-Score' in recommendations, "Include F1-Score"
assert 'PR-AUC' in recommendations, "Include PR-AUC"

print("✓ Metric Recommendations for Imbalanced Data:")
for metric, recommendation in recommendations.items():
    print(f"  {metric}: {recommendation}")

## Summary

Congratulations! You've completed Task 3. You've learned:
- How to properly evaluate models on imbalanced datasets
- Why accuracy is misleading on imbalanced data
- Why PR-AUC is more informative than ROC-AUC for imbalanced data
- How to analyze class-wise performance
- How to stratify performance by confidence

Key insights:
- A naive baseline (always predict majority) can achieve high accuracy
- F1-score better captures performance on minority class
- PR-AUC is more sensitive to poor performance than ROC-AUC
- Always report multiple metrics, not just accuracy
- Consider per-class metrics to understand model behavior

### Best Practices for Imbalanced Data:
1. **Always compare with baselines** (majority class, random)
2. **Use F1-score or PR-AUC** instead of accuracy
3. **Report precision and recall separately** to understand trade-offs
4. **Analyze per-class performance** to identify weaknesses
5. **Consider cost-sensitive metrics** if misclassification costs differ