# 📊 Notebook 08: Test Inference & Classification Metrics

**Purpose:** Run inference on the test set and generate a comprehensive classification report.

**What you'll learn:** How to convert probabilities to class labels, collect predictions, and evaluate with precision/recall/F1.


## 🎯 Concept Primer: From Probabilities to Metrics

### Inference Pipeline
```python
model.eval()
with torch.no_grad():
    for images, labels in test_dataloader:
        outputs = model(images)  # Probabilities in [0,1]
        predictions = torch.round(outputs)  # 0 or 1
```

### Probability → Class Label
- **Threshold:** 0.5 (default)
- `output >= 0.5` → Class 1 (Tumor)
- `output < 0.5` → Class 0 (Normal)
- **torch.round():** Applies threshold automatically

### Classification Metrics

**Precision:** Of predictions labeled Tumor, how many were actually Tumor?
- Formula: `TP / (TP + FP)`

**Recall:** Of actual Tumor samples, how many did we catch?
- Formula: `TP / (TP + FN)`

**F1-Score:** Harmonic mean of precision and recall
- Formula: `2 × (Precision × Recall) / (Precision + Recall)`

### Clinical Context
- **False Negative (FN):** Miss a tumor → Delayed treatment (VERY BAD)
- **False Positive (FP):** Flag normal as tumor → Unnecessary biopsy (BAD)
- For medical AI: **High recall often prioritized** (catch all tumors)


## 📚 Learning Objectives

1. ✅ Run inference on test_dataloader (eval mode + no_grad)
2. ✅ Collect probabilities and convert to class labels
3. ✅ Gather true labels from test set
4. ✅ Generate classification_report with class names
5. ✅ Interpret precision, recall, F1 for Normal vs Tumor


## ✅ Acceptance Criteria

- [ ] Test inference completes without errors
- [ ] `test_pred_labels` and `test_true_labels` are NumPy arrays
- [ ] Both have same length (number of test samples)
- [ ] `classification_report` prints with class names ['Normal', 'Tumor']
- [ ] Report shows precision, recall, F1 for each class


---

## 💻 TODO 1: Import & Rebuild Model + Test Loader


In [None]:
# TODO 1: Import libraries and rebuild trained model + test_dataloader
# Hint: from sklearn.metrics import classification_report
# Hint: import numpy as np

# YOUR CODE HERE

print("✅ Model and test loader ready for inference")


---

## 💻 TODO 2: Run Inference & Collect Predictions


In [None]:
# TODO 2: Run inference on test set
# Hint: test_pred_probs = []
# Hint: test_true_labels = []
# Hint: cnn_model.eval()
# Hint: with torch.no_grad():
# Hint:     for images, labels in test_dataloader:
# Hint:         outputs = cnn_model(images.to(device))
# Hint:         test_pred_probs.extend(outputs.cpu().numpy())
# Hint:         test_true_labels.extend(labels.numpy())

# YOUR CODE HERE
test_pred_probs = []
test_pred_labels = []
test_true_labels = []

# After loop: convert probs to labels using threshold 0.5
# test_pred_labels = np.round(np.array(test_pred_probs))

print(f"✅ Collected {len(test_pred_labels)} predictions")
print(f"   Predicted classes: {np.unique(test_pred_labels)}")
print(f"   True classes: {np.unique(test_true_labels)}")


---

## 💻 TODO 3: Generate Classification Report


In [None]:
# TODO 3: Print classification report
# Hint: pcam_classes = ['Normal', 'Tumor']
# Hint: print(classification_report(test_true_labels, test_pred_labels, target_names=pcam_classes))

# YOUR CODE HERE
pcam_classes = ['Normal', 'Tumor']

print("\\n📊 Test Set Classification Report:")
print("="*60)
# Print classification report here


---

## 🤔 Reflection Prompts

### Question 1: Clinical Consequences
Imagine your model achieves:
- **Normal:** Precision=0.92, Recall=0.85
- **Tumor:** Precision=0.88, Recall=0.94

**Questions:**
- Which class has more False Negatives?
- Which error is worse clinically?
- Would you adjust the threshold (0.5) for medical use?

**Your analysis:**

---

### Question 2: Threshold Tuning
Default threshold is 0.5. What if you changed it?

| Threshold | Effect on Predictions |
|-----------|----------------------|
| 0.3 (lower) | ? |
| 0.5 (current) | ? |
| 0.7 (higher) | ? |

**Your predictions:**

---

### Question 3: Perfect Metrics?
If a model achieves 100% accuracy, precision, and recall on test data:

**Question:** Is this always good? What might it indicate?

**Your answer:**

---


## 🚀 Next Steps

**Project Complete!** Move to Notebook 99 to document your learning journey.

**Optional Next Steps:**
- Experiment with deeper architectures (ResNet)
- Try different learning rates
- Implement early stopping
- Visualize misclassified samples
- Explore Grad-CAM for interpretability

**Key Takeaway:** Metrics must be interpreted in clinical context!
