In [None]:
# DS776 Auto-Update (runs in ~2 seconds, only updates when needed)
# If this cell fails, see Lessons/Course_Tools/AUTO_UPDATE_SYSTEM.md for help
%run ../../Lessons/Course_Tools/auto_update_introdl.py

# Homework 06 Assignment
**Name:** [Student Name Here]  
**Total Points:** 50

## Submission Checklist
- [ ] All code cells executed with output saved
- [ ] All questions answered
- [ ] Notebook converted to HTML (use the Homework_06_Utilities notebook)
- [ ] Canvas notebook filename includes `_GRADE_THIS_ONE`
- [ ] Files uploaded to Canvas

---

# Computer Vision: Segmentation and Object Detection

For this assignment there are two primary tasks exploring advanced computer vision applications:

1. **UNet and UNet++ for Nuclei Segmentation**: Explore semantic segmentation using UNet architectures on a biomedical imaging task described in the textbook.
2. **YOLO v11 for Pedestrian Detection**: Fine-tune a state-of-the-art YOLO model for object detection and compare results to the Faster R-CNN model from the lesson.

Both tasks will help you understand the differences between segmentation (pixel-level classification) and detection (bounding box prediction) approaches in computer vision.

In [None]:
# === YOUR IMPORTS HERE ===
# Add any additional imports you need below this line

import torch
import torchvision.transforms.v2 as transforms
from torch.utils.data import Dataset, DataLoader
from torchvision.io import read_image
from torchvision import tv_tensors
from pathlib import Path

# Import local modules
from Lesson_06_Helpers import display_yolo_predictions, prepare_penn_fudan_yolo

from introdl.utils import config_paths_keys

# Configure paths
paths = config_paths_keys()
DATA_PATH = paths['DATA_PATH']
MODELS_PATH = paths['MODELS_PATH']
# === END YOUR IMPORTS ===

## [20 pts] UNet and UNet++ Segmentation

You're going to use the segmentation models pytorch package as we did in the lesson to fine-tune and evaluate UNet and UNet++ models on the nuclei segmentation task shown in the textbook.

We've already prepared the data downloading process for you. The following cells contain most of a custom dataset class and transforms to get you started. You'll need to complete the code sections marked with `# === YOUR CODE HERE ===` to read images and masks, add appropriate augmentation transforms, and implement the model training.

In [None]:
# Run this cell once to download the Nuclei Segmentation dataset

from Lesson_06_Helpers import download_and_extract_nuclei_data

# Call the function
download_and_extract_nuclei_data(DATA_PATH)

In [None]:
# === YOUR CODE HERE ===
# TODO: Complete the NucleiDataset class and data loading setup
# - Complete the dataset class by filling in the marked sections:
#   * Read image from image_path, convert to float and scale to [0,1]
#   * Read mask from mask_path, map values >0 to 1, rest to 0, convert to float
# - Add appropriate augmentation transforms for training:
#   * Consider: RandomHorizontalFlip, RandomVerticalFlip, RandomRotation
#   * Use transforms that work with both images and masks simultaneously
# - Create training and validation datasets
# - Create DataLoaders with batch_size=8

class NucleiDataset(Dataset):
    def __init__(self, root, transform=None):
        """
        Args:
            root (str or Path): Path to the dataset (train or val folder).
            transform (callable, optional): Optional transforms to apply to both image and mask.
        """
        self.root = Path(root)  # Convert to pathlib Path object
        self.transform = transform
        self.data = []  # List to store (image_tensor, mask_tensor) tuples

        # Load all image and mask files
        all_imgs = sorted((self.root / "images").iterdir())
        all_masks = sorted((self.root / "masks").iterdir())

        # Ensure that the number of images and masks are the same
        assert len(all_imgs) == len(all_masks), "The number of images and masks must be the same"        

        # Read and store images and masks as tensors in memory
        for img_path, mask_path in zip(all_imgs, all_masks):
            # === YOUR CODE: Read images and masks as tensors ===
            # Read image from image_path, convert to float and scale to [0,1]
            image = # TODO: Complete this line
            
            # Read mask from mask_path, any entries bigger than 0 map to 1, rest to 0, convert to float
            mask = # TODO: Complete this line

            # Store as tuple
            self.data.append((tv_tensors.Image(image), tv_tensors.Mask(mask)))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        image, mask = self.data[idx]

        # Apply transforms if provided
        if self.transform:
            image, mask = self.transform(image, mask)

        return image, mask

# === YOUR CODE: Define transforms and create datasets ===
# Add your augmentation transforms here for training
train_transforms = transforms.Compose([
    # TODO: Add appropriate augmentation transforms
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Define transforms for validation (without augmentation)
val_transforms = transforms.Compose([
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# TODO: Load datasets and create dataloaders
# train_dataset = ...
# val_dataset = ...
# train_loader = ...
# val_loader = ...

# === END YOUR CODE ===

Now setup and train UNet and UNet++ models with a pretrained ResNet50 backbone as we did in the lesson. Model your code on the code in the "Better Training" part of the notebook. You should set different learning rates for the encoder and decoder and use OneCycleLR as we did. We found that 12 epochs of fine-tuning worked reasonably well.

For each model display convergence graphs of the loss and IoU and sample images along with the ground truth and predicted masks.

In [None]:
# === YOUR CODE HERE ===
# TODO: Setup and train UNet and UNet++ models
# - Install required packages: !pip install segmentation-models-pytorch
# - Import segmentation_models_pytorch as smp
# - Create UNet model with ResNet50 encoder, pretrained weights
# - Create UNet++ model with ResNet50 encoder, pretrained weights
# - Set different learning rates for encoder (lower) and decoder (higher)
# - Use OneCycleLR scheduler
# - Train each model for 12 epochs
# - Track loss and IoU metrics
# - Save model checkpoints


# === END YOUR CODE ===

📝 **Answer the following followup questions:**

1. Which model performs better? Support your answer with specific metrics.

📝 **YOUR ANSWER HERE:**

2. Use AI to write a short summary of the difference between UNet and UNet++.

📝 **YOUR ANSWER HERE:**

3. Report the highest value of the IoU metric on the validation set. Interpret that value in the context of this problem. What is it telling you about the predicted masks for the cell nuclei?

📝 **YOUR ANSWER HERE:**

## [20 pts] YOLO v11 Pedestrian Detection

YOLO (You Only Look Once) models are a family of object detection models known for their speed and accuracy. Unlike traditional object detection methods that use a sliding window approach, YOLO models frame object detection as a single regression problem, directly predicting bounding boxes and class probabilities from full images in one evaluation. This makes YOLO models extremely fast, making them suitable for real-time applications.

YOLO models consist of a single convolutional network that simultaneously predicts multiple bounding boxes and class probabilities for those boxes. The architecture is divided into several key components:

1. **Backbone**: This is typically a convolutional neural network (CNN) that extracts essential features from the input image.
2. **Neck**: This part of the network aggregates and combines features from different stages of the backbone. It often includes components like Feature Pyramid Networks (FPN) or Path Aggregation Networks (PAN).
3. **Head**: The final part of the network, which predicts the bounding boxes, objectness scores, and class probabilities. It usually consists of convolutional layers that output the final detection results.

YOLO models are quite easy to load and train because they provide pre-trained weights and a straightforward API for customization and fine-tuning. The hardest part may be preparing the data in the format that the API expects, but we've done that for you.

**Installation Note:** You'll need to install required packages. Add this to a code cell and run it once on each server you use:
```python
!pip install ultralytics torchmetrics
```

Run the cell below once to prepare the Penn Fudan Pedestrian dataset in YOLO format. This dataset uses the same splits we used in the lesson to allow you to compare the results to the Faster R-CNN model we trained there.

In [None]:
# only need to run this once per platform, but it's safe to run multiple times
prepare_penn_fudan_yolo(DATA_PATH)

# the dataset will be here:
dataset_path = DATA_PATH / "PennFudanPedYOLO"

# you may wish to set an output path for the model
output_path = MODELS_PATH / "PennFudanPedYOLO"

# the YAML file for the dataset is here:
yaml_path = dataset_path / "dataset.yaml"

Visit the ultralytics website to learn about YOLO11. You can watch a short video to learn more about it. Below, implement code to load and train a YOLO11 model using the 'yolo11s.pt' pretrained weights. Pass `project=output_path` to `model.train()` to store the output in your models directory. After training you might want to look at some of the images created in that directory.

In [None]:
# === YOUR CODE HERE ===
# TODO: Install required packages and train YOLO11 model
# - Install ultralytics and torchmetrics: !pip install ultralytics torchmetrics
# - Import YOLO from ultralytics
# - Load YOLO11 model with 'yolo11s.pt' pretrained weights
# - Train the model on the Penn Fudan dataset
# - Use project=output_path to save results in your models directory
# - Train for appropriate number of epochs (experiment with 10-20)
# - Monitor training progress and metrics


# === END YOUR CODE ===

You can run the following cell to show selected images and boxes from the validation set. You can replace `indices=selected_indices` with `num_samples=3` to display 3 randomly selected images. The selected images we chose should align with the images we showed in the lesson.

In [None]:
selected_indices = [28,29,33]
display_yolo_predictions(yaml_path, model, indices=selected_indices, show_confidence=True, conf=0.5)

📝 **Answer the following followup questions:**

1. Find and plot an image with a false positive box in the validation data.

📝 **YOUR ANSWER HERE:**

2. How is the process of fine-tuning the YOLO model different than for the Faster R-CNN model in the lesson? Is it easier or harder? Why?

📝 **YOUR ANSWER HERE:**

3. What did you get for mAP50 and mAP50:95 on the validation data with your YOLO model?

📝 **YOUR ANSWER HERE:**

4. How do those values compare to values in the lesson?

📝 **YOUR ANSWER HERE:**

5. How do the predicted boxes compare qualitatively to the boxes predicted by Faster R-CNN in the lesson? Do they align better or worse with the ground truth boxes?

📝 **YOUR ANSWER HERE:**

6. Thoroughly explain what your mAP50 value tells you about the performance of your YOLO model at detecting pedestrians in the validation data.

📝 **YOUR ANSWER HERE:**

## [8 pts] Questions from Chapter 13 and Computer Vision Concepts

**Question 1 (3 pts):** Based on your experience with both segmentation (UNet/UNet++) and object detection (YOLO) in this homework:
- Explain the fundamental difference between semantic segmentation and object detection in terms of their outputs and applications.
- Which approach would be more suitable for autonomous driving applications, and why?
- How do the computational requirements typically compare between these two approaches?

📝 **YOUR ANSWER HERE:**

**Question 2 (3 pts):** Transfer learning concepts from Chapter 13 apply to both tasks in this homework:
- How did transfer learning help in both the segmentation and detection tasks? What pretrained weights did you use?
- Why is transfer learning particularly effective for computer vision tasks compared to training from scratch?
- Based on Figure 13.1 concepts, explain why using different learning rates for encoder vs decoder (in segmentation models) is a good transfer learning strategy.

📝 **YOUR ANSWER HERE:**

**Question 3 (2 pts):** Model evaluation metrics:
- Explain what IoU (Intersection over Union) measures in segmentation tasks and why it's more informative than simple pixel accuracy.
- What does mAP50 measure in object detection, and why is it considered a comprehensive evaluation metric?

📝 **YOUR ANSWER HERE:**

## [2 pts] Reflection

1. What, if anything, did you find difficult to understand for this lesson? Why?

📝 **YOUR ANSWER HERE:**

2. What resources did you find supported your learning most and least for this lesson? (Be honest - I use your input to shape the course.)

📝 **YOUR ANSWER HERE:**

### Cleanup Note

**Note:** YOLO downloads pretrained model files (like `yolo11s.pt` or `yolo11n.pt`) to your current directory on first use. These files are safe to delete after training is complete if you want to save space - they will be re-downloaded if needed in the future.