# 📋 Notebook 00: Pipeline Overview — Map of the Journey

**Purpose:** Understand the full ML pipeline from raw images to predictions.

**Learning Style:** This notebook has no code to write — it's your roadmap. Use it to visualize the entire workflow before diving into implementation.


## 🎯 Concept Primer: The Complete Pipeline

Medical image classification follows this flow:

```
Raw Images (96×96 RGB)
    ↓
Transforms (Resize, Augment, Normalize)
    ↓
PCamDataset (Custom Dataset class)
    ↓
DataLoader (Batching & Shuffling)
    ↓
SimpleCNN Architecture:
    Conv2D → ReLU → MaxPool
    Conv2D → ReLU → MaxPool
    Conv2D → ReLU → MaxPool
    Flatten → FC → ReLU → FC → Sigmoid
    ↓
Binary Cross-Entropy Loss
    ↓
Adam Optimizer (Backpropagation)
    ↓
Predictions & Evaluation Metrics
```

### Key Architectural Choices

**Why augment training but not val/test?**
- Training: Random flips/rotations create synthetic variety → better generalization
- Val/Test: Need consistent, repeatable evaluation → no randomness

**Why normalize with mean=std=0.5?**
- Converts pixel values from [0,1] → [-1,1]
- Centers data around zero → stable gradients during training
- Must be **identical** across train/val/test

**Why Sigmoid for binary classification?**
- Outputs a probability in [0,1]
- Pairs with `BCELoss` (Binary Cross-Entropy Loss)
- Alternative: `BCEWithLogitsLoss` + no Sigmoid (more numerically stable, but we use Sigmoid here for interpretability)


## 📚 Learning Objectives

By the end of this project, you will:

1. ✅ Understand why data augmentation improves generalization
2. ✅ Build a custom PyTorch `Dataset` class (provided as `PCamDataset`)
3. ✅ Implement `DataLoader` with proper batching and shuffling
4. ✅ Design a simple CNN with 3 convolutional blocks
5. ✅ Train a model with train/validation splits
6. ✅ Evaluate performance with precision, recall, and F1-score
7. ✅ Reflect on clinical implications of false positives vs false negatives


## ✅ Acceptance Criteria

You understand the pipeline when you can:

- [ ] Narrate the flow from images → predictions in 60–90 seconds
- [ ] Explain why train transforms differ from val/test transforms
- [ ] Describe what each layer in the CNN does (Conv, Pool, FC)
- [ ] Justify why we use Sigmoid + BCELoss vs alternatives
- [ ] Articulate the difference between training loss and validation loss


## 🏗️ Architecture Deep Dive

### SimpleCNN Architecture

```python
Input: [Batch, 3, 96, 96]
    ↓
Conv1: [Batch, 32, 96, 96]  (3→32 channels, kernel=3, padding=1)
ReLU + MaxPool(2×2)
    ↓
[Batch, 32, 48, 48]
    ↓
Conv2: [Batch, 64, 48, 48]  (32→64 channels)
ReLU + MaxPool(2×2)
    ↓
[Batch, 64, 24, 24]
    ↓
Conv3: [Batch, 128, 24, 24]  (64→128 channels)
ReLU + MaxPool(2×2)
    ↓
[Batch, 128, 12, 12]
    ↓
Flatten: [Batch, 18432]  (128 × 12 × 12 = 18,432)
    ↓
FC1: [Batch, 256]
ReLU
    ↓
FC2: [Batch, 1]
Sigmoid
    ↓
Output: [Batch]  (probabilities in [0,1])
```

### Spatial Dimension Math

- **Start:** 96×96
- **After 1st pool:** 96 ÷ 2 = 48×48
- **After 2nd pool:** 48 ÷ 2 = 24×24
- **After 3rd pool:** 24 ÷ 2 = 12×12

**Why three pooling layers?**
- Progressively reduces spatial dimensions while increasing feature channels
- Final 12×12 is small enough for FC layers but preserves spatial information


## 🤔 Reflection Prompt

**Write a one-paragraph "model card" for this CNN:**

Consider:
- **Intended Use:** What is this model designed to do? (e.g., "Educational tool for binary tumor classification on PCam patches")
- **Limitations:** What can it NOT do? (e.g., "Not validated for clinical use; trained on limited subset")
- **Potential Misuse:** How could someone misuse it? (e.g., "Using predictions for patient diagnosis without physician oversight")
- **Ethical Considerations:** What are the clinical consequences of errors?

**Write your model card here (in your own words):**

---

*(Example structure:)*
> This model is an educational CNN trained on PatchCamelyon data to classify 96×96 histopathology patches as Normal (0) or Tumor (1). It is intended solely for learning PyTorch and medical image classification concepts. **Limitations:** The model is trained on a small subset, lacks external validation, and achieves [X]% accuracy on the test set. It should never be used for clinical decisions. **Ethical risks:** False negatives (missed tumors) could delay cancer treatment, while false positives cause unnecessary biopsies and patient anxiety. This model serves as a starting point for understanding medical AI, not as a deployable solution.

---


## 🚀 Next Steps

Move to **Notebook 01: Train Transforms** to start building your data pipeline!

---

**Quick Reference: Variable Names Used Throughout**

| Variable | Purpose |
|----------|---------|
| `train_transform` | Augmentation pipeline for training |
| `val_test_transform` | Deterministic pipeline for val/test |
| `train_dataset`, `val_dataset`, `test_dataset` | PCamDataset instances |
| `train_dataloader`, `val_dataloader`, `test_dataloader` | DataLoader instances |
| `cnn_model` | SimpleCNN instance |
| `device` | torch.device (cuda/cpu) |
| `criterion` | BCELoss instance |
| `optimizer` | Adam optimizer |
| `train_losses`, `val_losses` | Lists tracking loss per epoch |
