# 📦 Notebook 02: Train Dataset & DataLoader

**Purpose:** Instantiate the training dataset and DataLoader to batch and shuffle data efficiently.

**What you'll learn:** How PyTorch's `Dataset` and `DataLoader` work together, why we shuffle training data, and how batching improves training efficiency.


## 🎯 Concept Primer: Dataset & DataLoader

### The Dataset Class
- **Custom `Dataset`** reads image paths from a CSV and applies transforms
- Must implement:
  - `__len__()` → returns total number of samples
  - `__getitem__(idx)` → returns (image, label) for a given index
- **Provided:** `PCamDataset` is already implemented in `src/datasets/pcam_dataset.py`

### The DataLoader
- **Batches** data: Instead of processing 1 image at a time, process N images together
- **Shuffles** training data: Randomizes order each epoch → prevents overfitting to sequence
- **Parallelizes** loading: `num_workers` loads data in background (optional)

### Why Shuffle Training Data?
- **Without shuffle:** Model might learn spurious patterns from data order
- **With shuffle:** Each epoch sees data in different order → better generalization
- **Val/Test:** NO shuffle (need consistent, repeatable evaluation)

### Batch Size Trade-offs
| Batch Size | GPU Memory | Training Speed | Gradient Noise |
|------------|------------|----------------|----------------|
| Small (8) | Low | Slower | High (more updates) |
| Large (32+) | High | Faster | Low (fewer updates) |

**For training:** Small batches (8-16) often generalize better
**For evaluation:** Large batches (32-64) for speed (no backprop needed)


## 📚 Learning Objectives

By the end of this notebook, you will:

1. ✅ Import the provided `PCamDataset` class
2. ✅ Instantiate `train_dataset` with the training CSV and `train_transform`
3. ✅ Create `train_dataloader` with `batch_size=8` and `shuffle=True`
4. ✅ Iterate one batch and verify shapes: `images=[8,3,96,96]`, `labels=[8]`
5. ✅ Understand why we shuffle training data but not validation/test data


## ✅ Acceptance Criteria

Your dataset and dataloader are correct when:

- [ ] `train_dataset` is an instance of `PCamDataset`
- [ ] `len(train_dataset)` returns the number of training samples
- [ ] `train_dataloader` is a `DataLoader` with `shuffle=True`
- [ ] Iterating one batch produces:
  - `images.shape = torch.Size([8, 3, 96, 96])`
  - `labels.shape = torch.Size([8])`
  - `labels` contain only 0s and 1s (Normal/Tumor)
- [ ] You can explain why shuffling matters for training


---

## 💻 TODO 1: Import Required Libraries & Load train_transform

**What you need:**
- `torch` and `torch.utils.data.DataLoader`
- `PCamDataset` from `src.datasets.pcam_dataset`
- The `train_transform` you built in Notebook 01 (rebuild it here)

**Expected behavior:** Imports run without errors.

**⚠️ IMPORTANT:** If you get a ValueError about 'image_id' vs 'filename', the code was updated but Python hasn't reloaded the module. Do this:
1. **Restart the kernel** (Kernel → Restart) to reload all modules
2. Or run this in a code cell before the imports:
   ```python
   import importlib
   import sys
   if 'src.datasets.pcam_dataset' in sys.modules:
       importlib.reload(sys.modules['src.datasets.pcam_dataset'])
   ```


In [1]:
# TODO 1: Import libraries and rebuild train_transform
# Hint: import torch
# Hint: from torch.utils.data import DataLoader
# Hint: from torchvision import transforms
# Hint: from src.datasets.pcam_dataset import PCamDataset

# YOUR CODE HERE
from torch.utils.data import DataLoader
from torchvision import transforms
import sys
sys.path.append('..')
from src.datasets.pcam_dataset import PCamDataset
# Rebuild train_transform (copy from Notebook 01)
train_transform = transforms.Compose([
    transforms.Resize((96, 96)),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])  


print("✅ Imports successful and train_transform ready")


✅ Imports successful and train_transform ready


---

## 💻 TODO 2: Instantiate the Training Dataset

**What you need to do:**
```python
train_dataset = PCamDataset(
    csv_file='../data/train_labels.csv',
    transform=train_transform
)
```

**Expected output:** 
- `train_dataset` is a `PCamDataset` object
- `len(train_dataset)` prints the number of training samples


In [2]:
# TODO 2: Create train_dataset
# Hint: train_dataset = PCamDataset(csv_file='...', transform=train_transform)

# YOUR CODE HERE
train_dataset = PCamDataset(
    csv_file = "../data/train_labels.csv",
    transform = train_transform
)  

print(f"✅ Training dataset created")
print(f"   Total training samples: {len(train_dataset)}")


✅ Training dataset created
   Total training samples: 600


---

## 💻 TODO 3: Create the Training DataLoader

**What you need to do:**
```python
train_dataloader = DataLoader(
    train_dataset,
    batch_size=8,
    shuffle=True,
    num_workers=0  # Use 0 for compatibility, or 2-4 for faster loading
)
```

**Key parameters:**
- `batch_size=8` → Process 8 images per batch
- `shuffle=True` → Randomize order each epoch
- `num_workers=0` → No parallel loading (safe default)

**Expected output:** `train_dataloader` is a `DataLoader` object


In [3]:
# TODO 3: Create train_dataloader
# Hint: train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)

# YOUR CODE HERE
train_dataloader = DataLoader(
    train_dataset,
    batch_size=8,
    shuffle=True,
    num_workers=0
)  

print(f"✅ Training DataLoader created")
print(f"   Batch size: 8")
print(f"   Number of batches: {len(train_dataloader)}")


✅ Training DataLoader created
   Batch size: 8
   Number of batches: 75


---

## 💻 TODO 4: Test the DataLoader — Iterate One Batch

**What you need to do:**
1. Get one batch from `train_dataloader` using a `for` loop (break after first batch)
2. Print the shapes of `images` and `labels`
3. Print the unique label values to verify 0s and 1s

**Expected output:**
```
✅ First batch loaded
   Images shape: torch.Size([8, 3, 96, 96])
   Labels shape: torch.Size([8])
   Unique labels: tensor([0, 1])  (or just [0] or [1] depending on batch)
```


In [5]:
# TODO 4: Iterate one batch and verify shapes
# Hint: for images, labels in train_dataloader:
#           print(images.shape)
#           break

# YOUR CODE HERE
import torch

for batch in train_dataloader:
    images, labels = batch
    print("✅ First batch loaded")
    print(f"   Images shape: {images.shape}")
    print(f"   Labels shape: {labels.shape}")
    print(f"   Unique labels: {torch.unique(labels)}")
    break

print("✅ First batch loaded")
# Print images.shape, labels.shape, torch.unique(labels)


✅ First batch loaded
   Images shape: torch.Size([8, 3, 96, 96])
   Labels shape: torch.Size([8])
   Unique labels: tensor([0, 1])
✅ First batch loaded


---

## 🤔 Reflection Prompts

### Question 1: Why Shuffle Training Data?
Imagine your training CSV is sorted: first 500 samples are Normal (0), next 500 are Tumor (1).

**Scenario A:** `shuffle=False`
- What pattern might the model learn in the first few batches?
- Why is this a problem?

**Scenario B:** `shuffle=True`
- How does this fix the problem?

**Your explanation:**
> Without shuffling, the neural network would learn the order and start predicting based on sequence rather than actual features. This leads to overfitting because the model memorizes the data order instead of learning meaningful patterns.
>
> By shuffling, we create randomness so the neural network struggles to find order-based patterns. This forces it to learn actual features, preventing overfitting and improving generalization.

---

### Question 2: Batch Size Impact
Compare these two setups:

| Setup | Batch Size | Batches per Epoch | Updates per Epoch |
|-------|------------|-------------------|-------------------|
| A | 8 | 125 (1000/8) | 125 |
| B | 32 | 32 (1000/32) | 32 |

Assuming 1000 training samples:
- Which setup makes **more gradient updates** per epoch?
- Which setup is **faster** (fewer iterations)?
- Which might **generalize better** (more noisy gradients)?

**Your analysis:**
> **Setup A (batch_size=8)** makes more gradient updates per epoch (125 vs 32).
> **Setup B (batch_size=32)** is faster because it has fewer iterations.
> **Setup A** generalizes better due to more frequent updates and noisier gradients, which helps prevent overfitting.

---

### Question 3: num_workers Mystery
The `num_workers` parameter controls parallel data loading.

- `num_workers=0`: Data loads in main process (slower, but no issues)
- `num_workers=4`: Data loads in 4 parallel workers (faster, but can cause bugs)

When might you set `num_workers > 0`?  
When might you stick with `num_workers=0`?

**Your answer:**
> **num_workers=0**: Use for small datasets where data loading is fast and you want to avoid any potential bugs or data mixing issues.
>
> **num_workers>0**: Use for large datasets where data loading becomes a bottleneck. More workers speed up loading but increase the likelihood of bugs or data corruption, especially with complex transforms.

---
d code blocks with MD? Ifd




## 🚀 Next Steps

Excellent! You've loaded training data with batching and shuffling.

**Move to Notebook 03:** Val/Test Transforms (No Augmentation)

**Key Takeaway:** Training DataLoader shuffles; validation/test do NOT!
