# 🎨 Notebook 01: Train Transforms — Data Augmentation Pipeline

**Purpose:** Build a robust augmentation pipeline for training data to improve model generalization.

**What you'll learn:** How to compose transforms in the correct order, why augmentation matters, and how normalization stabilizes training.


## 🎯 Concept Primer: Why Data Augmentation?

### The Problem: Limited Data
- Deep learning models need **thousands** of examples to generalize well
- Medical imaging datasets are expensive to label (expert pathologists needed)
- Small datasets → **overfitting** (model memorizes training examples)

### The Solution: Data Augmentation
- Create **synthetic variety** by applying realistic transformations
- Horizontal flips, rotations, color jitter → model learns invariant features
- **Only applied to training data** (val/test need consistency)

### Transform Order Matters!
```
✅ CORRECT:
Resize → Augmentation (Flip, Rotate, ColorJitter) → ToTensor → Normalize

❌ WRONG:
ToTensor → ColorJitter  (ColorJitter expects PIL images, not tensors!)
Normalize → Resize  (Normalize expects [0,1] tensor values)
```

### Normalization Deep Dive
- `ToTensor()` converts PIL image [0,255] → tensor [0,1]
- `Normalize(mean=[0.5,0.5,0.5], std=[0.5,0.5,0.5])` converts [0,1] → [-1,1]
- **Formula:** `output = (input - mean) / std`
- **Why?** Centered data → stable gradients → faster convergence


## 📚 Learning Objectives

By the end of this notebook, you will:

1. ✅ Build `train_transform` with `transforms.Compose()`
2. ✅ Apply augmentations in the correct order
3. ✅ Understand which augmentations are realistic for histopathology
4. ✅ Normalize images to stabilize training
5. ✅ Verify transform output shape: `[3, 96, 96]`


## ✅ Acceptance Criteria

Your transform pipeline is correct when:

- [ ] `train_transform` is a `transforms.Compose` object
- [ ] Transforms are in order: Resize → RandomHorizontalFlip → RandomRotation → ColorJitter → ToTensor → Normalize
- [ ] Applying transform to a sample image produces a tensor of shape `[3, 96, 96]`
- [ ] Tensor values are in range `[-1, 1]` (after normalization)
- [ ] You can explain why ColorJitter comes **before** ToTensor


---

## 💻 TODO 1: Import Required Libraries

**What you need:**
- `torchvision.transforms` for transform classes
- `PIL.Image` to test loading a sample image

**Expected behavior:** Imports run without errors.


In [None]:
# TODO 1: Import transforms and Image
# Hint: from torchvision import transforms
# Hint: from PIL import Image

# YOUR CODE HERE


---

## 💻 TODO 2: Build the Training Transform Pipeline

**What you need to compose (in this order):**

1. **`transforms.Resize((96, 96))`** — Ensure all images are 96×96
2. **`transforms.RandomHorizontalFlip(p=0.5)`** — Flip left-right 50% of the time
3. **`transforms.RandomRotation(degrees=15)`** — Rotate ±15° randomly
4. **`transforms.ColorJitter(brightness=0.2, contrast=0.2)`** — Vary brightness/contrast
5. **`transforms.ToTensor()`** — Convert PIL image → tensor [0,1]
6. **`transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])`** — Normalize to [-1,1]

**Expected output:** A `transforms.Compose` object stored in `train_transform`.


In [None]:
# TODO 2: Create train_transform using transforms.Compose()
# Hint: train_transform = transforms.Compose([...])
# Hint: List the 6 transforms above in the correct order

# YOUR CODE HERE
train_transform = None  # Replace this line

print("✅ train_transform created:")
print(train_transform)


---

## 💻 TODO 3: Test the Transform on a Sample Image

**What you need to do:**
1. Load a sample image from `../data/pcam_images/` (pick any `.png` file)
2. Apply `train_transform` to the image
3. Print the shape of the resulting tensor

**Expected output:**
```
Original image: PIL Image object
Transformed tensor shape: torch.Size([3, 96, 96])
Tensor value range: approximately [-1, 1]
```


In [None]:
# TODO 3: Test transform on a sample image
# Hint: sample_img = Image.open('../data/pcam_images/SOME_FILE.png')
# Hint: transformed = train_transform(sample_img)
# Hint: print(transformed.shape)

import os

# Get a sample image path
sample_images = os.listdir('../data/pcam_images/')
sample_path = os.path.join('../data/pcam_images/', sample_images[0])

# YOUR CODE HERE
# Load the image
# Apply train_transform
# Print the shape and value range

print(f"✅ Sample image path: {sample_path}")
# YOUR PRINTS HERE


---

## 🤔 Reflection Prompts

### Question 1: Realistic Augmentations for Histopathology
Which of the following augmentations are **realistic** for H&E-stained pathology slides, and which might **distort clinically relevant information**?

| Augmentation | Realistic? | Reasoning |
|--------------|------------|-----------|
| RandomHorizontalFlip | ✅ / ❌ | ? |
| RandomVerticalFlip | ✅ / ❌ | ? |
| RandomRotation(±15°) | ✅ / ❌ | ? |
| RandomRotation(±180°) | ✅ / ❌ | ? |
| ColorJitter(brightness=0.2) | ✅ / ❌ | ? |
| ColorJitter(hue=0.5) | ✅ / ❌ | ? |
| RandomGrayscale | ✅ / ❌ | ? |

**Your analysis:**

---

### Question 2: Why Not Augment Validation/Test Data?
Explain in your own words:
- Why do we apply augmentation to training data?
- Why would augmentation **break** validation/test evaluation?

**Your explanation:**

---

### Question 3: Normalization Intuition
Given:
- `ToTensor()` converts RGB [0,255] → [0,1]
- `Normalize(mean=[0.5]*3, std=[0.5]*3)` converts [0,1] → [-1,1]

Calculate:
- If a pixel value is `0.8` after `ToTensor()`, what is it after `Normalize`?
- **Formula:** `(input - mean) / std`

**Your calculation:**

---






# .cursorrules
# Francisco Barbosa – AI Mentor Mode
# Purpose: enforce discovery of higher-order ML/engineering practices in every Jupyter notebook.

rules:
  # 1. Detect CNN or image projects missing transfer learning
  - trigger: on_notebook_open
    condition: notebook_contains("torch") and not notebook_contains("resnet") and not notebook_contains("efficientnet")
    action: >
      Insert_after_first_imports:
        - markdown: |
            ### 🧩 Learning Checkpoint: Transfer Learning with ResNet
            You’re using CNNs but not leveraging pretrained backbones.
            **Task:** integrate a pretrained ResNet (or EfficientNet) and compare its validation accuracy vs. your scratch model.
            *Why:* Transfer learning dramatically reduces training time and improves generalization.
        - code: |
            # TODO: Implement transfer learning
            from torchvision import models
            resnet = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
            for param in resnet.parameters():
                param.requires_grad = False  # Freeze layers
            # Replace final layer for your num_classes and fine-tune
            # Compare performance with your baseline model.

  # 2. Detect missing explainability
  - trigger: on_notebook_save
    condition: model_trained == true and not notebook_contains("shap") and not notebook_contains("lime")
    action: >
      append_to_notebook_end:
        - markdown: |
            ### 🔍 Learning Checkpoint: Model Explainability
            You’ve trained a model—now interpret it.
            **Task:** apply SHAP or LIME to visualize which features drive predictions.
            *Why:* Understanding model reasoning prevents overfitting and bias.
        - code: |
            # TODO: Add model explainability
            import shap
            explainer = shap.Explainer(model, X_valid)
            shap_values = explainer(X_valid)
            shap.summary_plot(shap_values, X_valid)

  # 3. Enforce testing and CI mindset
  - trigger: on_notebook_save
    condition: not repo_contains("tests/") and notebook_contains("model.fit") or notebook_contains("train")
    action: >
      create_file("tests/test_model.py", """
      import pytest

      def test_training_runs():
          assert True, "Add a small unit test for your training function"
      """)
      insert_markdown_at_end: |
          ### 🧪 Learning Checkpoint: Testing & Automation
          **Task:** create minimal Pytest tests for data loading and model training.
          Later, set up a GitHub Action to run `pytest` on every push.
          *Why:* This builds reproducibility and early error detection.

  # 4. Encourage experiment tracking
  - trigger: on_notebook_open
    condition: not notebook_contains("wandb") and model_training_detected == true
    action: >
      insert_markdown_after_heading("Training", """
      ### 📈 Learning Checkpoint: Experiment Tracking
      Integrate Weights & Biases (wandb) or MLflow to log metrics and hyperparameters.
      *Why:* Versioning experiments saves you from 'mystery improvements.'
      """)
      insert_code_after_heading("Training", """
      # TODO: Add experiment tracking
      import wandb
      wandb.init(project="periospot_learning")
      wandb.log({"loss": loss, "accuracy": acc})
      """)

  # 5. Reinforce data ethics and bias reflection
  - trigger: on_notebook_close
    condition: dataset_detected == true and not notebook_contains("bias") 
    action: >
      append_to_notebook_end:
        - markdown: |
            ### ⚖️ Reflection Checkpoint: Bias and Fairness
            **Task:** Examine potential biases in your dataset (class imbalance, demographic skew, etc.).
            *Why:* Ethical evaluation prevents spurious conclusions and builds trust in models.
        - code: |
            # TODO: Evaluate bias
            import pandas as pd
            df['target'].value_counts(normalize=True).plot(kind='bar')

  # 6. Missing RAG or data pipeline for text projects
  - trigger: on_notebook_open
    condition: notebook_contains("text") or notebook_contains("nlp") and not notebook_contains("rag") and not notebook_contains("langchain")
    action: >
      append_to_notebook_end:
        - markdown: |
            ### 🧮 Learning Checkpoint: Retrieval-Augmented Generation
            You’re processing text but not retrieving context.
            **Task:** integrate a simple RAG pipeline using LangChain or FAISS.
            *Why:* RAG connects your models to domain knowledge, critical for Periospot AI.
        - code: |
            # TODO: Build minimal RAG prototype
            from langchain.chains import RetrievalQA
            from langchain.vectorstores import FAISS
            # Build embeddings, create retriever, and connect to an LLM

  # 7. Add README scaffolding if absent
  - trigger: on_repo_open
    condition: not repo_contains("README.md")
    action: >
      create_file("README.md", """
      # Project Title
      ## Overview
      Describe dataset, objective, metrics, and key learnings.
      ## To-Dos
      - [ ] Add explainability (SHAP/LIME)
      - [ ] Integrate ResNet transfer learning
      - [ ] Set up pytest and CI
      - [ ] Reflect on bias
      """)


# .cursorrules
# Francisco Barbosa – AI Mentor Mode
# Purpose: enforce discovery of higher-order ML/engineering practices in every Jupyter notebook.

rules:
  # 1. Detect CNN or image projects missing transfer learning
  - trigger: on_notebook_open
    condition: notebook_contains("torch") and not notebook_contains("resnet") and not notebook_contains("efficientnet")
    action: >
      Insert_after_first_imports:
        - markdown: |
            ### 🧩 Learning Checkpoint: Transfer Learning with ResNet
            You’re using CNNs but not leveraging pretrained backbones.
            **Task:** integrate a pretrained ResNet (or EfficientNet) and compare its validation accuracy vs. your scratch model.
            *Why:* Transfer learning dramatically reduces training time and improves generalization.
        - code: |
            # TODO: Implement transfer learning
            from torchvision import models
            resnet = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
            for param in resnet.parameters():
                param.requires_grad = False  # Freeze layers
            # Replace final layer for your num_classes and fine-tune
            # Compare performance with your baseline model.

  # 2. Detect missing explainability
  - trigger: on_notebook_save
    condition: model_trained == true and not notebook_contains("shap") and not notebook_contains("lime")
    action: >
      append_to_notebook_end:
        - markdown: |
            ### 🔍 Learning Checkpoint: Model Explainability
            You’ve trained a model—now interpret it.
            **Task:** apply SHAP or LIME to visualize which features drive predictions.
            *Why:* Understanding model reasoning prevents overfitting and bias.
        - code: |
            # TODO: Add model explainability
            import shap
            explainer = shap.Explainer(model, X_valid)
            shap_values = explainer(X_valid)
            shap.summary_plot(shap_values, X_valid)

  # 3. Enforce testing and CI mindset
  - trigger: on_notebook_save
    condition: not repo_contains("tests/") and notebook_contains("model.fit") or notebook_contains("train")
    action: >
      create_file("tests/test_model.py", """
      import pytest

      def test_training_runs():
          assert True, "Add a small unit test for your training function"
      """)
      insert_markdown_at_end: |
          ### 🧪 Learning Checkpoint: Testing & Automation
          **Task:** create minimal Pytest tests for data loading and model training.
          Later, set up a GitHub Action to run `pytest` on every push.
          *Why:* This builds reproducibility and early error detection.

  # 4. Encourage experiment tracking
  - trigger: on_notebook_open
    condition: not notebook_contains("wandb") and model_training_detected == true
    action: >
      insert_markdown_after_heading("Training", """
      ### 📈 Learning Checkpoint: Experiment Tracking
      Integrate Weights & Biases (wandb) or MLflow to log metrics and hyperparameters.
      *Why:* Versioning experiments saves you from 'mystery improvements.'
      """)
      insert_code_after_heading("Training", """
      # TODO: Add experiment tracking
      import wandb
      wandb.init(project="periospot_learning")
      wandb.log({"loss": loss, "accuracy": acc})
      """)

  # 5. Reinforce data ethics and bias reflection
  - trigger: on_notebook_close
    condition: dataset_detected == true and not notebook_contains("bias") 
    action: >
      append_to_notebook_end:
        - markdown: |
            ### ⚖️ Reflection Checkpoint: Bias and Fairness
            **Task:** Examine potential biases in your dataset (class imbalance, demographic skew, etc.).
            *Why:* Ethical evaluation prevents spurious conclusions and builds trust in models.
        - code: |
            # TODO: Evaluate bias
            import pandas as pd
            df['target'].value_counts(normalize=True).plot(kind='bar')

  # 6. Missing RAG or data pipeline for text projects
  - trigger: on_notebook_open
    condition: notebook_contains("text") or notebook_contains("nlp") and not notebook_contains("rag") and not notebook_contains("langchain")
    action: >
      append_to_notebook_end:
        - markdown: |
            ### 🧮 Learning Checkpoint: Retrieval-Augmented Generation
            You’re processing text but not retrieving context.
            **Task:** integrate a simple RAG pipeline using LangChain or FAISS.
            *Why:* RAG connects your models to domain knowledge, critical for Periospot AI.
        - code: |
            # TODO: Build minimal RAG prototype
            from langchain.chains import RetrievalQA
            from langchain.vectorstores import FAISS
            # Build embeddings, create retriever, and connect to an LLM

  # 7. Add README scaffolding if absent
  - trigger: on_repo_open
    condition: not repo_contains("README.md")
    action: >
      create_file("README.md", """
      # Project Title
      ## Overview
      Describe dataset, objective, metrics, and key learnings.
      ## To-Dos
      - [ ] Add explainability (SHAP/LIME)
      - [ ] Integrate ResNet transfer learning
      - [ ] Set up pytest and CI
      - [ ] Reflect on bias
      """)
ffffff

## 🚀 Next Steps

Great work! You've built a training transform pipeline with augmentation.

**Move to Notebook 02:** Train Dataset & DataLoader

**Key Takeaway:** Transform order matters — Resize/Aug → ToTensor → Normalize!
