# Precision, Recall, and F1-Score

In this notebook, you'll learn:
- How to calculate precision, recall, and F1-score from confusion matrix
- The trade-off between precision and recall
- When to optimize for precision vs recall
- How to find the optimal threshold for your use case
- Using sklearn's classification_report

## The Problem with Accuracy

On imbalanced datasets, accuracy can be misleading. We need metrics that tell us:
1. **Precision**: When we predict positive, how often are we correct?
2. **Recall**: Of all actual positives, how many did we catch?
3. **F1-Score**: A balanced metric combining precision and recall

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (
    confusion_matrix, precision_score, recall_score, f1_score,
    accuracy_score, classification_report
)

# Set random seed
np.random.seed(42)

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('default')
sns.set_palette('colorblind')

## Load Data and Calculate Confusion Matrix

In [None]:
# Load the classification data
df = pd.read_csv('../../fixtures/input/classification_data.csv')

# Extract labels
y_true = df['true_label'].values
y_pred = df['predicted_label'].values

# Calculate confusion matrix
cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()

print("Confusion Matrix:")
print(f"TN: {tn}, FP: {fp}")
print(f"FN: {fn}, TP: {tp}")

## Precision: Quality of Positive Predictions

**Precision** measures: "Of all samples we predicted as positive, how many are actually positive?"

$$\text{Precision} = \frac{TP}{TP + FP}$$

**Use cases where precision is critical:**
- Spam detection (don't want to mark legitimate emails as spam)
- Medical diagnosis when false alarms are costly
- Recommendation systems (don't want to recommend irrelevant items)

In [None]:
# Calculate precision manually
precision_manual = tp / (tp + fp)
print(f"Precision (manual): {precision_manual:.4f}")

# Calculate precision with sklearn
precision_sklearn = precision_score(y_true, y_pred)
print(f"Precision (sklearn): {precision_sklearn:.4f}")

print(f"\nInterpretation:")
print(f"When our model predicts class 1, it's correct {precision_manual*100:.2f}% of the time")
print(f"Out of {tp + fp} positive predictions, {tp} were correct and {fp} were wrong")

## Recall (Sensitivity): Coverage of Actual Positives

**Recall** measures: "Of all actual positive samples, how many did we correctly identify?"

$$\text{Recall} = \frac{TP}{TP + FN}$$

**Use cases where recall is critical:**
- Disease detection (don't want to miss sick patients)
- Fraud detection (want to catch all fraud cases)
- Search engines (want to find all relevant documents)

In [None]:
# Calculate recall manually
recall_manual = tp / (tp + fn)
print(f"Recall (manual): {recall_manual:.4f}")

# Calculate recall with sklearn
recall_sklearn = recall_score(y_true, y_pred)
print(f"Recall (sklearn): {recall_sklearn:.4f}")

print(f"\nInterpretation:")
print(f"Our model found {recall_manual*100:.2f}% of all actual class 1 samples")
print(f"Out of {tp + fn} actual positives, we correctly identified {tp} and missed {fn}")

## F1-Score: Harmonic Mean of Precision and Recall

**F1-Score** is the harmonic mean of precision and recall, giving a single score that balances both:

$$F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \times TP}{2 \times TP + FP + FN}$$

**Why harmonic mean?** It penalizes extreme values. If either precision or recall is low, F1 will be low.

**When to use F1:**
- When you need a balance between precision and recall
- When you have imbalanced classes
- When false positives and false negatives have similar costs

In [None]:
# Calculate F1 manually (from precision and recall)
f1_from_pr = 2 * (precision_manual * recall_manual) / (precision_manual + recall_manual)
print(f"F1 (from P and R): {f1_from_pr:.4f}")

# Calculate F1 manually (from confusion matrix)
f1_from_cm = (2 * tp) / (2 * tp + fp + fn)
print(f"F1 (from confusion matrix): {f1_from_cm:.4f}")

# Calculate F1 with sklearn
f1_sklearn = f1_score(y_true, y_pred)
print(f"F1 (sklearn): {f1_sklearn:.4f}")

## Visualizing All Metrics Together

In [None]:
# Calculate all metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

# Create bar plot
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
values = [accuracy, precision, recall, f1]

plt.figure(figsize=(10, 6))
bars = plt.bar(metrics, values, color=['skyblue', 'lightgreen', 'lightcoral', 'gold'])
plt.ylim(0, 1.0)
plt.ylabel('Score', fontsize=12)
plt.title('Model Performance Metrics', fontsize=14, pad=20)

# Add value labels on bars
for bar, value in zip(bars, values):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
            f'{value:.3f}',
            ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nMetric Values:")
for metric, value in zip(metrics, values):
    print(f"{metric:12s}: {value:.4f}")

## The Precision-Recall Trade-off

There's usually a trade-off between precision and recall:
- **Increase threshold** → Higher precision, Lower recall (fewer but more confident predictions)
- **Decrease threshold** → Lower precision, Higher recall (more predictions but less confident)

Let's demonstrate this:

In [None]:
# Try different thresholds
thresholds = np.arange(0.1, 1.0, 0.05)
y_prob = df['predicted_probability'].values

precisions = []
recalls = []
f1_scores = []

for threshold in thresholds:
    y_pred_thresh = (y_prob >= threshold).astype(int)
    
    # Calculate metrics (handle edge cases where precision/recall might be undefined)
    try:
        prec = precision_score(y_true, y_pred_thresh, zero_division=0)
        rec = recall_score(y_true, y_pred_thresh, zero_division=0)
        f1 = f1_score(y_true, y_pred_thresh, zero_division=0)
    except:
        prec = 0
        rec = 0
        f1 = 0
    
    precisions.append(prec)
    recalls.append(rec)
    f1_scores.append(f1)

# Plot precision-recall trade-off
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Precision and Recall vs Threshold
axes[0].plot(thresholds, precisions, 'g-', label='Precision', linewidth=2)
axes[0].plot(thresholds, recalls, 'r-', label='Recall', linewidth=2)
axes[0].plot(thresholds, f1_scores, 'b--', label='F1-Score', linewidth=2)
axes[0].set_xlabel('Threshold', fontsize=12)
axes[0].set_ylabel('Score', fontsize=12)
axes[0].set_title('Precision-Recall Trade-off', fontsize=14)
axes[0].legend(loc='best')
axes[0].grid(True, alpha=0.3)

# Plot 2: Precision vs Recall curve
axes[1].plot(recalls, precisions, 'b-', linewidth=2)
axes[1].set_xlabel('Recall', fontsize=12)
axes[1].set_ylabel('Precision', fontsize=12)
axes[1].set_title('Precision vs Recall', fontsize=14)
axes[1].grid(True, alpha=0.3)

# Mark some key points
for i, thresh in enumerate([0.3, 0.5, 0.7]):
    idx = np.argmin(np.abs(thresholds - thresh))
    axes[1].plot(recalls[idx], precisions[idx], 'ro', markersize=8)
    axes[1].annotate(f'T={thresh}', (recalls[idx], precisions[idx]),
                    xytext=(10, 10), textcoords='offset points')

plt.tight_layout()
plt.show()

## Finding the Optimal Threshold

The "optimal" threshold depends on your use case:
- **Maximize F1**: Balance between precision and recall
- **Maximize Precision**: When false positives are costly
- **Maximize Recall**: When false negatives are costly

In [None]:
# Find threshold that maximizes F1-score
best_f1_idx = np.argmax(f1_scores)
best_f1_threshold = thresholds[best_f1_idx]
best_f1 = f1_scores[best_f1_idx]

print(f"Threshold that maximizes F1: {best_f1_threshold:.3f}")
print(f"  Precision: {precisions[best_f1_idx]:.4f}")
print(f"  Recall: {recalls[best_f1_idx]:.4f}")
print(f"  F1-Score: {best_f1:.4f}")
print()

# Find threshold for high precision (>0.90)
high_prec_indices = [i for i, p in enumerate(precisions) if p >= 0.90]
if high_prec_indices:
    best_high_prec_idx = high_prec_indices[np.argmax([recalls[i] for i in high_prec_indices])]
    print(f"Threshold for precision >= 0.90 (with max recall):")
    print(f"  Threshold: {thresholds[best_high_prec_idx]:.3f}")
    print(f"  Precision: {precisions[best_high_prec_idx]:.4f}")
    print(f"  Recall: {recalls[best_high_prec_idx]:.4f}")
    print()

# Find threshold for high recall (>0.90)
high_rec_indices = [i for i, r in enumerate(recalls) if r >= 0.90]
if high_rec_indices:
    best_high_rec_idx = high_rec_indices[np.argmax([precisions[i] for i in high_rec_indices])]
    print(f"Threshold for recall >= 0.90 (with max precision):")
    print(f"  Threshold: {thresholds[best_high_rec_idx]:.3f}")
    print(f"  Precision: {precisions[best_high_rec_idx]:.4f}")
    print(f"  Recall: {recalls[best_high_rec_idx]:.4f}")

## Using sklearn's classification_report

sklearn provides a convenient function to calculate all metrics at once:

In [None]:
# Generate classification report
report = classification_report(y_true, y_pred, target_names=['Class 0', 'Class 1'])
print("Classification Report:")
print(report)

# Get report as dictionary for easier access
report_dict = classification_report(y_true, y_pred, target_names=['Class 0', 'Class 1'], output_dict=True)

# Visualize per-class metrics
classes = ['Class 0', 'Class 1']
metrics_names = ['precision', 'recall', 'f1-score']

class_metrics = np.array([
    [report_dict['Class 0']['precision'], report_dict['Class 0']['recall'], report_dict['Class 0']['f1-score']],
    [report_dict['Class 1']['precision'], report_dict['Class 1']['recall'], report_dict['Class 1']['f1-score']]
])

fig, ax = plt.subplots(figsize=(10, 6))
x = np.arange(len(metrics_names))
width = 0.35

bars1 = ax.bar(x - width/2, class_metrics[0], width, label='Class 0', color='skyblue')
bars2 = ax.bar(x + width/2, class_metrics[1], width, label='Class 1', color='lightcoral')

ax.set_ylabel('Score', fontsize=12)
ax.set_title('Per-Class Performance Metrics', fontsize=14)
ax.set_xticks(x)
ax.set_xticklabels(metrics_names)
ax.legend()
ax.set_ylim(0, 1.0)
ax.grid(axis='y', alpha=0.3)

# Add value labels
def autolabel(bars):
    for bar in bars:
        height = bar.get_height()
        ax.annotate(f'{height:.3f}',
                   xy=(bar.get_x() + bar.get_width() / 2, height),
                   xytext=(0, 3),
                   textcoords="offset points",
                   ha='center', va='bottom', fontsize=9)

autolabel(bars1)
autolabel(bars2)

plt.tight_layout()
plt.show()

## Understanding Support and Weighted Averages

The classification report includes:
- **Support**: Number of samples in each class
- **Macro avg**: Simple average across classes (treats all classes equally)
- **Weighted avg**: Average weighted by support (accounts for class imbalance)

In [None]:
# Extract averaging methods
macro_avg = report_dict['macro avg']
weighted_avg = report_dict['weighted avg']

print("Averaging Methods:")
print(f"\nMacro Average (treats all classes equally):")
print(f"  Precision: {macro_avg['precision']:.4f}")
print(f"  Recall: {macro_avg['recall']:.4f}")
print(f"  F1-Score: {macro_avg['f1-score']:.4f}")

print(f"\nWeighted Average (accounts for class imbalance):")
print(f"  Precision: {weighted_avg['precision']:.4f}")
print(f"  Recall: {weighted_avg['recall']:.4f}")
print(f"  F1-Score: {weighted_avg['f1-score']:.4f}")

print(f"\nClass supports:")
print(f"  Class 0: {report_dict['Class 0']['support']} samples")
print(f"  Class 1: {report_dict['Class 1']['support']} samples")

## Exercise 1: Calculate Metrics for Different Thresholds

Calculate precision, recall, and F1 for threshold=0.6:

In [None]:
# YOUR CODE HERE
# 1. Create predictions with threshold 0.6
# 2. Calculate confusion matrix
# 3. Calculate precision, recall, F1 manually
# 4. Compare with sklearn results



## Exercise 2: Optimize for Your Use Case

Imagine this is a medical test. Which threshold would you choose and why?

In [None]:
# YOUR CODE HERE
# Consider: What's worse - missing a sick patient (FN) or false alarm (FP)?
# Find an appropriate threshold and justify your choice



## Exercise 3: Compare with Baseline

Calculate precision, recall, and F1 for a naive baseline (always predict class 0):

In [None]:
# YOUR CODE HERE
# 1. Create a naive predictor (all 0s)
# 2. Calculate its metrics
# 3. Compare with your model
# What happens to recall? Why?



## Summary

In this notebook, you learned:

1. **Precision**: Quality of positive predictions (TP / (TP + FP))
2. **Recall**: Coverage of actual positives (TP / (TP + FN))
3. **F1-Score**: Harmonic mean of precision and recall
4. **Precision-Recall Trade-off**: Adjusting threshold affects both metrics
5. **Threshold Optimization**: Choose based on your use case

### Key Takeaways:

- On imbalanced data, precision/recall/F1 are more informative than accuracy
- There's always a trade-off between precision and recall
- The "optimal" threshold depends on the relative costs of FP vs FN
- F1-score provides a single metric that balances both
- Use classification_report for a comprehensive view

### Decision Guide:

- **Optimize for Precision**: When false positives are costly (spam detection, recommendations)
- **Optimize for Recall**: When false negatives are costly (disease detection, fraud)
- **Optimize for F1**: When you need balance or costs are similar

### Next Steps:

In the next notebook, we'll explore:
- ROC curves and ROC-AUC
- Precision-Recall curves and PR-AUC
- When to use which curve