# Deep Learning: Assignment #2
## Submission date: 24/12/2025, 23:59.
### Topics:
- Regularization
- Batch Normalization
- Convolutional Neural Networks
- Semantic Segmentation


**Submitted by:**

- **Student 1 (Name, ID)**
- **Student 2 (Name, ID)**  


**Assignment Instructions:**

· Submissions are in **pairs only**. Write both names + IDs at the top of the notebook.

· Keep your code **clean, concise, and readable**.

· You may work in your IDE, but you **must** paste the final code back into the **matching notebook cells** and run it there.  


· <font color='red'>Write your textual answers in red.</font>  
(e.g., `<span style="color:red">your answer here</span>`)

· All figures, printed results, and outputs should remain visible in the notebook.  
Run **all cells** before submitting and **do not clear outputs**.

· Use relative paths — **no absolute file paths** pointing to local machines.

· **Important:** Your submission must be entirely your own.  
Any form of plagiarism (including uncredited use of ChatGPT or AI tools) will result in **grade 0** and disciplinary action.


In [None]:
# Suggesed uploading script
! pip install -q kaggle
! mkdir ~/.kaggle
! kaggle datasets download jcoral02/camvid
! unzip -q camvid.zip -d data

In [None]:
# --- Global Setup ---

# Import Libraries
import torch
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import itertools
import random
import torch.nn.functional as F
from torch import nn
from torch.utils.data import DataLoader, random_split, Dataset
from torchvision import datasets, transforms
import torch.optim as optim
import torch.nn.init as init
from tqdm import tqdm
import os
from glob import glob
import pandas as pd
from PIL import Image
from google.colab import files
import zipfile


# Device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)

## Question 1: Convolutional Digit Classification on SVHN (25 Points)

In this question, our goal is to implement a Convolutional Neural Network (CNN) for image classification on the The Street View House Numbers (SVHN) Dataset. The dataset consists of read-world house number images.

**source:** http://ufldl.stanford.edu/housenumbers/

### Data Loading and Preprocessing

In this section we will load, explore and preprocess the dataset for training.

You are given the **SVHN** (Street View House Numbers) dataset: a collection of real-world images of digits (0–9) captured from house numbers in Google Street View. Each image is 32×32 pixels and contains three color channels (RGB). The goal is to classify each image into one of the 10 digit classes (0 – 9).

The dataset will be downloaded automatically to the local environment using the `torchvision.datasets.SVHN` class.

For this section, implement the preprocessing procedure and explain your choice, then create the loaders for the train and test sets.



In [None]:

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

# Load the SVHN Letters dataset
train_dataset = datasets.SVHN(root='./data', split='train', download=True, transform=transform)
test_dataset  = datasets.SVHN(root='./data', split='test', download=True, transform=transform)

# Inspect the dataset
print(f"Number of training samples: {len(train_dataset)}")
print(f"Number of testing samples: {len(test_dataset)}")

# Get one image and label
image, label = train_dataset[0]
print(f"Shape of one image: {image.shape} (C x H x W)")
print(f"Label of first image: {label}")

batch_size = 128

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)


Finally, run the cell below to take a look at a few sample images to better understand the dataset we're working with.

In [None]:
images, labels = next(iter(train_loader))

# Show first 4 images
fig, axes = plt.subplots(1, 4, figsize=(10, 4))
for i in range(4):
  img = images[i].permute(1, 2, 0).numpy()
  #img = np.clip(img * 0.229 + 0.485, 0, 1)  # unnormalize for display
  axes[i].imshow(img)
  axes[i].set_title(f"Class {labels[i].item()}")
  axes[i].axis('off')
plt.show()

**Answer the following Questions:**

<font color="red">1. How is SVHN fundamentally harder than MNIST?</font>  
<span style="color:red">SVHN is fundamentally harder than MNIST because it consists of real-world images from Google Street View, meaning the digits are embedded in natural scenes with cluttered backgrounds, variable lighting conditions, and different fonts/styles. Unlike MNIST, which has centered, clean, handwritten digits on a black background, SVHN images are RGB (3 color channels) and often contain parts of neighboring digits.</span>

<font color="red">2. Which preprocessing or architectural choices become necessary because of this difference?</font>
<span style="color:red">Because of the color complexity and background clutter, we need deeper architectures with convolutional layers to extract hierarchical features (edges, textures, shapes) rather than simple fully connected networks. We also need to process 3 input channels (RGB) instead of 1. Preprocessing steps like Normalization (mean subtraction and scaling) are crucial to help the model converge faster given the varying pixel intensity distributions in real-world images.</span>



### CNN Architecture Design

We will design a convolutional neural network (inspired by AlexNet) for digit classification.
The network consists of **three convolutional feature-extraction stages**, followed by a **two-layer fully connected classifier**.

Your architecture must follow the structure below:

- Convolutional Layer with 32 output channels, kernel size = 3×3, stride = 1, padding = 1
- ReLU activation function
- MaxPooling Layer with a kernel size of 3×3 and a stride of 2

- Convolutional Layer with 64 output channels, kernel size = 3×3, stride = 1, padding = 1
- ReLU activation function.
- Convolutional Layer with 128 output channels, kernel size = 3×3, stride = 1, padding = 1

- ReLU activation function.

- MaxPooling Layer with a kernel size of 3×3 and a stride of 2.

- Dropout Layer  with a dropout probability of 0.5.

- Fully Connected Layer with output size of 128.

- ReLU activation function

- Dropout Layer  with a dropout probability of 0.5.

- Fully Connected Layer with output size of 10 (for the 10 digit classes, 0–9).



In [None]:

class SVHN_CNN(nn.Module):
    def __init__(self, num_classes=10):
        super(SVHN_CNN, self).__init__()
        
        # Block 1: Conv -> ReLU -> MaxPool
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3, padding=1)
        self.pool1 = nn.MaxPool2d(kernel_size=3, stride=2)
        
        # Block 2: Conv -> ReLU
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1)
        
        # Block 3: Conv -> ReLU -> MaxPool
        self.conv3 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1)
        self.pool3 = nn.MaxPool2d(kernel_size=3, stride=2)
        
        # Dropout
        self.dropout1 = nn.Dropout(p=0.5)
        
        # FC layers
        # Calculation for input size:
        # Input: 32x32
        # After pool1 (k=3, s=2): floor((32-3)/2 + 1) = 15 -> 15x15
        # After pool3 (k=3, s=2): floor((15-3)/2 + 1) = 7 -> 7x7
        self.flatten_dim = 128 * 7 * 7
        
        self.fc1 = nn.Linear(self.flatten_dim, 128)
        self.dropout2 = nn.Dropout(p=0.5)
        self.fc2 = nn.Linear(128, num_classes)

    def forward(self, x):
        # Block 1
        x = F.relu(self.conv1(x))
        x = self.pool1(x)
        
        # Block 2
        x = F.relu(self.conv2(x))
        
        # Block 3
        x = F.relu(self.conv3(x))
        x = self.pool3(x)
        
        # Flatten
        x = x.view(x.size(0), -1)
        
        # FC layers
        x = self.dropout1(x)
        x = F.relu(self.fc1(x))
        x = self.dropout2(x)
        x = self.fc2(x)
        
        return x


**Answer the following Questions:**

<font color="red">1. Which part(s) of your CNN most strongly influence receptive field size?</font>  
<span style="color:red">The convolutional kernel sizes, strides, and especially the pooling layers (which increase the stride effectively) most strongly influence the receptive field size. Stacking multiple convolutional layers also increases the receptive field linearly, but pooling/strided convolutions increase it multiplicatively (relative to the input).</span>

<font color="red">2. Why does receptive field matter for recognizing digits embedded in cluttered scenes?</font>
<span style="color:red">A larger receptive field allows the neuron to "see" more of the input image at once. This is critical for context. In cluttered scenes, the network needs to distinguish the digit from the background noise. If the receptive field is too small, the network might only see a curve or a line and confuse it with background texture. A sufficiently large receptive field ensures the network captures the entire digit and its immediate context to make a correct classification.</span>



Now we will setup all training parameters and train the model.

Your tasks in this section are to create an instance of the model and choose and explain your choice of optimizer and loss function.
3. Fill the missing code in the training function.
4. Train the model for ~6 epochs on the training set (in colab CPU should take ~30 mins).

In [None]:

# Create model instance
model = SVHN_CNN(num_classes=10)

# Loss function: CrossEntropyLoss is standard for multi-class classification
criterion = nn.CrossEntropyLoss()

# Optimizer: Adam is a good default choice for faster convergence compared to SGD
optimizer = optim.Adam(model.parameters(), lr=0.001)

print(model)


Now, fill in the missing code for the training function and train the model for 10 epochs.

In [None]:

def train_model(model, train_loader, criterion, optimizer, device, num_epochs=10):
  model.to(device)

  for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0

    # Progress bar for better visualization
    loop = tqdm(train_loader, leave=True)
    loop.set_description(f"Epoch [{epoch+1}/{num_epochs}]")

    for images, labels in loop:
      images, labels = images.to(device), labels.to(device) 

      # Zero gradients
      optimizer.zero_grad()

      # Forward pass
      outputs = model(images)
      loss = criterion(outputs, labels)

      # Backward pass and optimize
      loss.backward()
      optimizer.step()

      running_loss += loss.item()
      _, predicted = torch.max(outputs, 1)
      total += labels.size(0)
      correct += (predicted == labels).sum().item()
      
      # Update progress bar
      loop.set_postfix(loss=loss.item())

    train_acc = 100 * correct / total
    train_loss = running_loss / len(train_loader)

    print(f"Epoch [{epoch+1}/{num_epochs}] "
          f"Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%")


In [None]:

# Train for 10 epochs
train_model(model, train_loader, criterion, optimizer, device, num_epochs=10)


The following function evaluates a given model on the loaded test set. Use it to evaluate your trained model on the test set loader.

In [None]:
def evaluate_model(model, loader):
  model.eval()
  correct, total = 0, 0

  with torch.no_grad():
    for images, labels in loader:
      images, labels = images.to(DEVICE), labels.to(DEVICE)
      outputs = model(images)
      _, predicted = outputs.max(1)
      total += labels.size(0)
      correct += (predicted == labels).sum().item()

  accuracy = 100 * correct / total
  return accuracy

**Answer the following Question:**

<font color="red">1. If training loss drops very quickly in the early epochs, is that always a good sign — or could it signal a potential problem?</font>
<span style="color:red">It is generally a good sign that learning is happening, but it's not *always* good. If it drops too precipitously and then plateaus immediately, it might indicate that the learning rate is too high (potentially causing instability later) or that the model is overfitting to the easy examples very quickly. If the validation loss doesn't follow the training loss (i.e., val loss stays high or goes up), then a sharp drop in training loss signals overfitting. However, in the early stages of training a CNN on a dataset like SVHN, a quick initial drop is expected as the model learns basic features.</span>



### Visualizing Feature Maps

To deepen our understanding of what the CNN learns, we will visualize **feature maps** (activations) produced inside the network when passing a single image forward.

Feature maps show *where* the network detects edges, curves, textures, and higher-level structures.  

In this section, select one test image, pass it through the CNN and finally visualize activation maps from different convolution layers.


In [None]:

# Get a single image
model.eval()
image, label = test_dataset[0]
input_tensor = image.unsqueeze(0).to(device)

# Function to get activations
activations = {}
def get_activation(name):
    def hook(model, input, output):
        activations[name] = output.detach()
    return hook

# Register hooks
model.conv1.register_forward_hook(get_activation('conv1'))
model.conv2.register_forward_hook(get_activation('conv2'))
model.conv3.register_forward_hook(get_activation('conv3'))

# Forward pass
output = model(input_tensor)

# Visualize
def plot_feature_maps(layer_name, num_maps=8):
    act = activations[layer_name].cpu().squeeze()
    fig, axes = plt.subplots(1, num_maps, figsize=(15, 3))
    for i in range(num_maps):
        if i < act.size(0):
            axes[i].imshow(act[i], cmap='viridis')
            axes[i].axis('off')
    plt.suptitle(f"Feature Maps from {layer_name}")
    plt.show()

# Show input image
plt.imshow(image.permute(1, 2, 0) * 0.5 + 0.5) # unnormalize
plt.title(f"Input Image (Class {label})")
plt.axis('off')
plt.show()

plot_feature_maps('conv1')
plot_feature_maps('conv2')
plot_feature_maps('conv3')


### Architecture Modification Experiment

Modify your `SVHN_CNN` model by removing or relocating different kinds of layers.

1. Propose two significant architectural changes.
2. Implement your modified models as
  - `SVHN_CNN_v2`
  - `SVHN_CNN_v3`
3. Train and evaluate both models using the same setup as the original.

In [None]:

# --- Model V2: Remove Dropout ---
class SVHN_CNN_v2(nn.Module):
    def __init__(self, num_classes=10):
        super(SVHN_CNN_v2, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, 3, padding=1)
        self.pool1 = nn.MaxPool2d(3, 2)
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.conv3 = nn.Conv2d(64, 128, 3, padding=1)
        self.pool3 = nn.MaxPool2d(3, 2)
        
        self.flatten_dim = 128 * 7 * 7
        self.fc1 = nn.Linear(self.flatten_dim, 128)
        self.fc2 = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.pool1(F.relu(self.conv1(x)))
        x = F.relu(self.conv2(x))
        x = self.pool3(F.relu(self.conv3(x)))
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = self.fc2(x) # No dropout
        return x

# --- Model V3: Add Batch Normalization ---
class SVHN_CNN_v3(nn.Module):
    def __init__(self, num_classes=10):
        super(SVHN_CNN_v3, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, 3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        self.pool1 = nn.MaxPool2d(3, 2)
        
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(64)
        
        self.conv3 = nn.Conv2d(64, 128, 3, padding=1)
        self.bn3 = nn.BatchNorm2d(128)
        self.pool3 = nn.MaxPool2d(3, 2)
        
        self.flatten_dim = 128 * 7 * 7
        self.dropout1 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(self.flatten_dim, 128)
        self.dropout2 = nn.Dropout(0.5)
        self.fc2 = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.pool1(F.relu(self.bn1(self.conv1(x))))
        x = F.relu(self.bn2(self.conv2(x)))
        x = self.pool3(F.relu(self.bn3(self.conv3(x))))
        x = x.view(x.size(0), -1)
        x = self.dropout1(x)
        x = F.relu(self.fc1(x))
        x = self.dropout2(x)
        x = self.fc2(x)
        return x

print("Training SVHN_CNN_v2 (No Dropout)...")
model_v2 = SVHN_CNN_v2(num_classes=10)
optimizer_v2 = optim.Adam(model_v2.parameters(), lr=0.001)
train_model(model_v2, train_loader, criterion, optimizer_v2, device, num_epochs=6)
print(f"Test Accuracy v2: {evaluate_model(model_v2, test_loader):.2f}%")

print("\nTraining SVHN_CNN_v3 (With BatchNorm)...")
model_v3 = SVHN_CNN_v3(num_classes=10)
optimizer_v3 = optim.Adam(model_v3.parameters(), lr=0.001)
train_model(model_v3, train_loader, criterion, optimizer_v3, device, num_epochs=6)
print(f"Test Accuracy v3: {evaluate_model(model_v3, test_loader):.2f}%")


## Question 2: The One Hundred Layers Tiramisu (45 Points)


In this question we explore the problem of **semantic segmentation**: assigning a class label to **every pixel** in an image.

We base our work on the paper:

> Jégou, S., Drozdzal, M., Vázquez, D., Romero, A., & Bengio, Y. (2017).  
> **The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation.**  
> *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)*.  
> [[PDF link](https://arxiv.org/pdf/1611.09326.pdf)]


For those interested, I highly recommend reading the paper to expand your understanding of DenseNet-based architectures and generally in deep learning literature. That said, reading it is **not required** — the tools and concepts needed for this assignment are introduced gradually throughout the steps.

Our goal is to replicate the architecture of DenseNets described in the paper, aiming for comparable behaviour while using a **smaller variant** (e.g., DenseNet-67 instead of DenseNet-103) to ensure runtime feasibility on your GPUs.

We work with the **CamVid** dataset, which consists of urban driving scenes captured from a moving vehicle. Each image is paired with a pixel-wise annotation map indicating semantic classes such as road, sidewalk, building, sky, tree, fence, poles, traffic signs or lights, vehicles, pedestrians, and bicyclists.

Conceptually, semantic segmentation transforms an image into a **grid of classification tasks** — one small prediction problem per pixel — requiring the network to recognize objects and localize them throughout the scene.


### Data Loading & Preprocessing

Before building the model, we must ensure that the dataset is represented in a form a neural network can learn from.

Let:
- $X$ denote the RGB input images from CamVid.
- $Y$ denote the corresponding color-coded annotation masks, where each pixel encodes a semantic class via an RGB value.

The raw CamVid annotations contain **over 30 distinct colors**, including rare and fine-grained categories.  
To make learning tractable and consistent with common practice, we collapse these into a compact set of **11 semantic classes**, and assign all remaining labels to a single **void class**, which is ignored during training.

In this section we will:

1. Define or load the RGB-to-label mapping.
2. Convert each colored mask into a 2D array of integer class IDs.
3. Visualize sample inputs and their mapped labels to verify correctness.

With this mapping in place, segmentation becomes a **pixel-wise classification task** over the label space $\{0, \dots, C-1\}$, rather than operating directly on raw RGB annotation images.


We will use the **CamVid dataset**, which contains street-scene RGB images and their corresponding pixel-wise annotations.  
Please upload the provided `CamVid.zip` dataset to your own google drive. Then, run the following cells by mounting to your drive and unzipping the data.
The archive will be automatically extracted into `/content/CamVid/`.

> `CamVid.zip` is provided to you in `DL-HW2.zip`


In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
import zipfile
import os

zip_path = "/content/drive/MyDrive/CamVid.zip"  # wherever you uploaded it
extract_root = "/content/drive/MyDrive/CamVid"

os.makedirs(extract_root, exist_ok=True)

with zipfile.ZipFile(zip_path, 'r') as z:
    z.extractall(extract_root)

print("Extracted to Drive:", extract_root)
!ls "/content/drive/MyDrive/CamVid"

CamVid annotations are stored as **RGB color masks**, where each distinct color corresponds to a semantic category.  
To train a segmentation model, we must convert these colors into **integer class IDs**.

The mapping used here collapses ~32 original colors into **11 trainable categories** (Sky, Building, Road, etc.), with a separate **Void class** assigned label 255 and excluded from the loss.

Below we define:
- an RGB-to-label mapping,
- a PyTorch dataset class that:
  - reads images and masks,
  - applies cropping and flipping,
  - converts masks into numeric class IDs,
  - normalizes images.

In [None]:
# Data Loading & Preprocessing

# Convert 32 -> 11 CamVid mapping: RGB -> label name
RGBLabel2LabelName = {
    (128, 128, 128): "Sky",

    (0,   128,  64): "Building",
    (128,   0,   0): "Building",
    (64,  192,   0): "Building",
    (64,    0,  64): "Building",
    (192,   0, 128): "Building",

    (192, 192, 128): "Pole",
    (0,     0,  64): "Pole",

    (128,  64, 128): "Road",
    (128,   0, 192): "Road",
    (192,   0,  64): "Road",

    (0,     0, 192): "Sidewalk",
    (64,  192, 128): "Sidewalk",
    (128, 128, 192): "Sidewalk",

    (128, 128,   0): "Tree",
    (192, 192,   0): "Tree",

    (192, 128, 128): "SignSymbol",
    (128, 128,  64): "SignSymbol",
    (0,    64,  64): "SignSymbol",

    (64,   64, 128): "Fence",

    (64,    0, 128): "Car",
    (64,  128, 192): "Car",
    (192, 128, 192): "Car",
    (192,  64, 128): "Car",
    (128,  64,  64): "Car",

    (64,   64,   0): "Pedestrian",
    (192, 128,  64): "Pedestrian",
    (64,    0, 192): "Pedestrian",
    (64,  128,  64): "Pedestrian",

    (0,   128, 192): "Bicyclist",
    (192,   0, 192): "Bicyclist",

    (0,     0,   0): "Void"
}

# Define the 11 train classes and the void index
TRAIN_CLASSES = [
    "Sky",
    "Building",
    "Pole",
    "Road",
    "Sidewalk",
    "Tree",
    "SignSymbol",
    "Fence",
    "Car",
    "Pedestrian",
    "Bicyclist"
]

LABEL_NAME_TO_ID = {name: i for i, name in enumerate(TRAIN_CLASSES)}
VOID_LABEL_NAME = "Void"
VOID_INDEX = 255


class CamVidDataset(Dataset):
    """
    CamVid dataset loader that:
    - reads RGB images from e.g. CamVid/train
    - reads RGB masks from e.g. CamVid/train_labels
    - uses the RGBLabel2LabelName mapping to:
        32+ RGB colors -> 11 train classes (0..10) + Void (255)
    - applies normalization and simple augmentations

    Output:
      image: float tensor (3, H, W), normalized (ImageNet stats)
      mask:  long tensor (H, W) with values in {0..10, 255}
             where 255 is the ignore_index for the loss.
    """

    def __init__(self,
                 image_dir,
                 mask_dir,
                 crop_size=(224, 224),
                 is_train=True):
        self.image_dir = image_dir
        self.mask_dir = mask_dir
        self.is_train = is_train
        self.crop_h, self.crop_w = crop_size

        # Collect image paths
        self.image_paths = sorted(
            glob(os.path.join(image_dir, "*.png")) +
            glob(os.path.join(image_dir, "*.jpg")) +
            glob(os.path.join(image_dir, "*.jpeg"))
        )
        if len(self.image_paths) == 0:
            raise RuntimeError(f"No images found in {image_dir}")

        # Build corresponding mask paths (same filename, different folder)
        self.mask_paths = []
        for p in self.image_paths:
            base = os.path.basename(p)
            name, ext = os.path.splitext(base)


            candidate = os.path.join(mask_dir, name + "_L" + ext)
            if os.path.exists(candidate):
                self.mask_paths.append(candidate)
            else:
                candidate2 = os.path.join(mask_dir, base)
                if not os.path.exists(candidate2):
                    raise FileNotFoundError(
                        f"Could not find mask for image {p}. "
                        f"Tried: {candidate} and {candidate2}"
                    )
                self.mask_paths.append(candidate2)

        # Build color -> train_id mapping from RGBLabel2LabelName
        self.num_classes = len(TRAIN_CLASSES)
        self.train_id_to_name = TRAIN_CLASSES
        self.void_index = VOID_INDEX

        self.color_to_train_id = {}
        for (r, g, b), label_name in RGBLabel2LabelName.items():
            if label_name == VOID_LABEL_NAME:
                # Void will be handled by default (everything starts as VOID_INDEX)
                continue
            train_id = LABEL_NAME_TO_ID[label_name]
            self.color_to_train_id[(r, g, b)] = train_id

        # Normalization
        self.mean = torch.tensor([0.485, 0.456, 0.406]).view(3, 1, 1)
        self.std  = torch.tensor([0.229, 0.224, 0.225]).view(3, 1, 1)

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        # Load image & mask as numpy arrays
        img_path = self.image_paths[idx]
        mask_path = self.mask_paths[idx]

        img = Image.open(img_path).convert("RGB")
        mask = Image.open(mask_path).convert("RGB")  # color-coded mask

        img = np.array(img, dtype=np.uint8)   # (H,W,3)
        mask = np.array(mask, dtype=np.uint8) # (H,W,3)

        # random data augmentation
        if self.is_train:
            img, mask = self.random_crop(img, mask, self.crop_h, self.crop_w)
            img, mask = self.random_horizontal_flip(img, mask)

        # Convert color mask -> class index mask (0..10, 255)
        class_mask = self.rgb_to_class_indices(mask)  # (H,W), int64

        # Convert image to tensor and normalize
        img = torch.from_numpy(img).float().permute(2, 0, 1) / 255.0  # (3,H,W)
        img = (img - self.mean) / self.std

        class_mask = torch.from_numpy(class_mask).long()  # (H,W)

        return img, class_mask

    def rgb_to_class_indices(self, mask_rgb):
        """
        mask_rgb: (H,W,3) uint8
        returns: (H,W) int64 with values in {0..num_classes-1, void_index}

        Any pixel whose color is not in RGBLabel2LabelName or not in the 11
        train classes is assigned void_index (255), same idea as CamVidGray.
        """
        h, w, _ = mask_rgb.shape
        class_mask = np.full((h, w), fill_value=self.void_index, dtype=np.int64)

        # iterate over all known label colors
        for (r, g, b), train_id in self.color_to_train_id.items():
            matches = (
                (mask_rgb[:, :, 0] == r) &
                (mask_rgb[:, :, 1] == g) &
                (mask_rgb[:, :, 2] == b)
            )
            class_mask[matches] = train_id

        # Any remaining colors (including weird mislabels) stay as void_index
        return class_mask

    @staticmethod
    def random_crop(img, mask, crop_h, crop_w):
        """Randomly crop the same region from image and mask."""
        H, W, _ = img.shape
        if (H <= crop_h) or (W <= crop_w):
            # Fallback: center crop if image is smaller than the crop
            top = max(0, (H - crop_h) // 2)
            left = max(0, (W - crop_w) // 2)
        else:
            top = np.random.randint(0, H - crop_h + 1)
            left = np.random.randint(0, W - crop_w + 1)

        img_crop = img[top:top + crop_h, left:left + crop_w, :]
        mask_crop = mask[top:top + crop_h, left:left + crop_w, :]

        return img_crop, mask_crop

    @staticmethod
    def random_horizontal_flip(img, mask, p=0.5):
        """Randomly flip image and mask horizontally with probability p."""
        if np.random.rand() < p:
            img = np.ascontiguousarray(img[:, ::-1, :])   # flip width
            mask = np.ascontiguousarray(mask[:, ::-1, :])
        return img, mask

We now instantiate our dataset class over the train/validation splits and wrap them with PyTorch `DataLoader`s for batching.

We verify:
- tensor shapes,
- expected number of classes,
- that label values fall within `{0..10, 255}`.

In [None]:
# Base directory
base_dir = "/content/drive/MyDrive/CamVid"

train_images = f"{base_dir}/train"
train_masks  = f"{base_dir}/train_labels"

val_images   = f"{base_dir}/val"
val_masks    = f"{base_dir}/val_labels"

test_images  = f"{base_dir}/test"
test_masks   = f"{base_dir}/test_labels"

# Create datasets
train_dataset = CamVidDataset(
    image_dir=train_images,
    mask_dir=train_masks,
    crop_size=(224, 224),
    is_train=True
)

val_dataset = CamVidDataset(
    image_dir=val_images,
    mask_dir=val_masks,
    crop_size=(224, 224),
    is_train=False
)

# Dataloaders
train_loader = DataLoader(train_dataset, batch_size=3, shuffle=True, num_workers=2)
val_loader   = DataLoader(val_dataset,   batch_size=3, shuffle=False, num_workers=2)

# Quick sanity check: shapes + labels
imgs, masks = next(iter(train_loader))
print("Images:", imgs.shape)   # (B,3,H,W)
print("Masks:", masks.shape)   # (B,H,W)
print("Num classes:", train_dataset.num_classes)
print("Unique labels in this batch:", torch.unique(masks))

To ensure that preprocessing worked as expected, we decode the class IDs back into colors and visualize:
- the input RGB image,
- the processed 11-class segmentation mask,
- an overlay.

This provides a quick visual confirmation that label mapping and crops are applied correctly.

In [None]:
# Build a display color for each of the 11 train classes:
# take the first RGB that maps to that label.
TRAIN_ID_TO_COLOR = {}
for (r, g, b), label_name in RGBLabel2LabelName.items():
    if label_name == VOID_LABEL_NAME:
        continue
    train_id = LABEL_NAME_TO_ID[label_name]
    if train_id not in TRAIN_ID_TO_COLOR:
        TRAIN_ID_TO_COLOR[train_id] = (r, g, b)


def decode_class_mask(class_mask, train_id_to_color, void_index=VOID_INDEX):
    """
    class_mask: (H,W) int64 in {0..C-1, void_index}
    returns: (H,W,3) uint8 color mask for visualization
    """
    h, w = class_mask.shape
    color_mask = np.zeros((h, w, 3), dtype=np.uint8)

    for train_id, color in train_id_to_color.items():
        color_mask[class_mask == train_id] = np.array(color, dtype=np.uint8)

    # void pixels stay black (0,0,0);
    return color_mask


def visualize_camvid_sample(dataset, idx=0):
    """
    Show:
      - input image
      - merged 11-class mask
      - overlay + legend (class key)
    """
    img, mask = dataset[idx]  # img: normalized tensor, mask: (H,W) long

    # Denormalize for display
    img_np = img.clone()
    img_np = (img_np * dataset.std + dataset.mean).clamp(0, 1)
    img_np = img_np.numpy().transpose(1, 2, 0)  # (H,W,3), [0,1]

    mask_np = mask.numpy()
    color_mask = decode_class_mask(mask_np, TRAIN_ID_TO_COLOR, VOID_INDEX)

    fig, axes = plt.subplots(1, 3, figsize=(15, 5))

    axes[0].imshow(img_np)
    axes[0].set_title("Input image")
    axes[0].axis("off")

    axes[1].imshow(color_mask)
    axes[1].set_title("GT mask (11 classes)")
    axes[1].axis("off")

    axes[2].imshow(img_np)
    axes[2].imshow(color_mask, alpha=0.5)
    axes[2].set_title("Overlay")
    axes[2].axis("off")

    # Legend / key
    patches = [
        mpatches.Patch(
            color=np.array(TRAIN_ID_TO_COLOR[i]) / 255.0,
            label=f"{i}: {TRAIN_CLASSES[i]}"
        )
        for i in range(len(TRAIN_CLASSES))
    ]
    axes[2].legend(handles=patches, bbox_to_anchor=(1.05, 1.0),
                   loc="upper left", borderaxespad=0.)

    plt.tight_layout()
    plt.show()


# Try a couple of samples
visualize_camvid_sample(train_dataset, idx=0)
visualize_camvid_sample(train_dataset, idx=10)

### Network Architecture Overview

We now design the architecture of our DenseNet model — a fully convolutional encoder–decoder network tailored for semantic segmentation.

<p align="center">
  <img src="https://raw.githubusercontent.com/SimJeg/FC-DenseNet/cf2375bf9f6ed20ba029a5ee540261aad89732d5/DenseNet.jpg" width="650"/>
</p>

Conceptually, the network processes the image through a **downsampling path** (encoder), reaches a compressed representation (bottleneck), and then reconstructs a dense prediction map through an **upsampling path** (decoder). Lateral skip connections link encoder features to their corresponding decoder levels, ensuring fine spatial detail is preserved.

The architecture is built from the following components:

**(a) Dense Layer**

**(b) Dense Block**

**(c) Transition Down**

**(d) Transition Up**

**(e) Bottleneck Block**

**(f) Final Classifier**

We next implement each component in modular form and assemble them into a complete FC-DenseNet for CamVid segmentation.


#### **(a) Dense Layer**

The basic computational unit in DenseNet is the **dense layer**.  
Each layer receives as input **all previous feature maps** in the block, applies:

$$
\text{BN} \rightarrow \text{ReLU} \rightarrow 3 \times 3 \text{Conv}
$$

and produces `k` new feature maps (the **growth rate**).  
These outputs are concatenated with the input along the channel axis.


In [None]:

class DenseLayer(nn.Module):
    """
    BN -> ReLU -> 3x3 Conv -> Dropout,
    then concatenate input and output feature maps.
    """
    def __init__(self, in_channels, growth_rate, drop_prob=0.2):
        super(DenseLayer, self).__init__()
        self.bn = nn.BatchNorm2d(in_channels)
        self.conv = nn.Conv2d(in_channels, growth_rate, kernel_size=3, padding=1, bias=False)
        self.dropout = nn.Dropout(drop_prob)

    def forward(self, x):
        # Apply BN -> ReLU -> Conv -> Dropout
        out = self.bn(x)
        out = F.relu(out)
        out = self.conv(out)
        out = self.dropout(out)
        
        # Concatenate input and output
        return torch.cat([x, out], 1)


#### **(b) Dense Block**  


A dense block stacks several dense layers sequentially.  
Each layer receives **all feature maps from previous layers** and contributes new ones:

$$
C_{\text{out}} = C_{\text{in}} + L \cdot k
$$

<p align="center">
  <img src="https://raw.githubusercontent.com/SimJeg/FC-DenseNet/master/DenseBlock.jpg" width="300" height="600"/>
</p>

This connectivity pattern promotes feature reuse, stabilizes gradients, and forms the core building unit of our network.


In [None]:

class DenseBlock(nn.Module):
    """
    A sequence of DenseLayer modules.
    Input channels grow by `growth_rate` at each layer.
    """
    def __init__(self, in_channels, num_layers, growth_rate, drop_prob=0.2):
        super(DenseBlock, self).__init__()
        layers = []
        for i in range(num_layers):
            # Input channels for layer i is in_channels + i * growth_rate
            layers.append(DenseLayer(in_channels + i * growth_rate, growth_rate, drop_prob))
        self.layers = nn.ModuleList(layers)

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x


#### **(c) Transition Down**


At the end of each encoder stage, we reduce spatial resolution:

$$
\text{BN} \rightarrow \text{ReLU} \rightarrow 1\times1\text{ Conv} \rightarrow \text{Dropout} \rightarrow \text{MaxPool}(2)
$$

This halves width and height while keeping channels unchanged.


In [None]:

class TransitionDown(nn.Module):
    """
    BN + ReLU + 1x1 Conv + Dropout + MaxPool(2x2)
    """
    def __init__(self, in_channels, drop_prob=0.2):
        super(TransitionDown, self).__init__()
        self.bn = nn.BatchNorm2d(in_channels)
        self.conv = nn.Conv2d(in_channels, in_channels, kernel_size=1, bias=False)
        self.dropout = nn.Dropout(drop_prob)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)

    def forward(self, x):
        x = self.bn(x)
        x = F.relu(x)
        x = self.conv(x)
        x = self.dropout(x)
        x = self.pool(x)
        return x


#### **(d) Transition Up**  


In the decoder, we restore resolution using learned upsampling via transposed convolution:

$$
\text{ConvTranspose}(3 \times 3, \text{stride}=2)
$$

which doubles spatial resolution before concatenation with skip features.


In [None]:

class TransitionUp(nn.Module):
    """
    Transposed convolution for upsampling by a factor of 2.
    """
    def __init__(self, in_channels, out_channels):
        super(TransitionUp, self).__init__()
        self.transposed_conv = nn.ConvTranspose2d(in_channels, out_channels, kernel_size=3, stride=2, padding=1, output_padding=1, bias=False)

    def forward(self, x):
        return self.transposed_conv(x)


#### **(e) Bottleneck Block**


At the deepest level, the network operates at the lowest spatial resolution but highest channel width.

A final dense block processes this information before reconstruction begins in the decoder.


### Our FC-DenseNet-lite Architecture

We implement a compact FC-DenseNet architecture inspired by *The One Hundred Layers Tiramisu*.
All convolutional **dense layers** use growth rate $k = 16$. At each dense block:

- Input has $m$ feature maps.
- The block has $n$ layers.
- Each layer adds $k$ new feature maps.
- The output of the block therefore has $m + n \cdot k$ feature maps (because we concatenate all newly created features with the input).

Transition Down (TD) blocks keep the number of channels $m$ but downsample in space via 2×2 max pooling.
Transition Up (TU) blocks use a 3×3 transposed convolution with stride 2 to upsample.

Our **homework architecture ("FC-DenseNet-lite")** is defined as follows:

- Input: RGB image, $m = 3$
- Initial 3×3 convolution: $m = 48$

</br>

**Downsampling path (encoder)**

- Dense Block 1: 4 layers  
  $m = 48 + 4 \cdot 16 = 112$  
  + Transition Down → spatial size /2, channels stay 112
- Dense Block 2: 5 layers  
  $m = 112 + 5 \cdot 16 = 192$  
  + Transition Down → channels 192
- Dense Block 3: 7 layers  
  $m = 192 + 7 \cdot 16 = 304$  
  + Transition Down → channels 304
- Dense Block 4: 10 layers  
  $m = 304 + 10 \cdot 16 = 464$  
  + Transition Down → channels 464

</br>


**Bottleneck**

- Dense Block (bottleneck): 15 layers  
  $m = 464 + 15 \cdot 16 = 704$

</br>


**Upsampling path (decoder)**

At each level, we:
1. Apply a Transition Up (TU) to upsample the current feature maps.
2. Concatenate with the skip connection from the corresponding encoder level.
3. Apply a dense block at that resolution.

We use symmetric numbers of layers in the decoder:

- TU from bottleneck + skip from Dense Block 4 $\to$ Dense Block (10 layers)
- TU + skip from Dense Block 3 $\to$ Dense Block (7 layers)
- TU + skip from Dense Block 2 $\to$ Dense Block (5 layers)
- TU + skip from Dense Block 1 $\to$ Dense Block (4 layers)


</br>

**Final classifier**

- 1×1 convolution maps the decoder output to $C = 11$ class logits per pixel.

> We call network `FC-DenseNet67` which is a smaller version of `FC-DenseNet103`, the architecture used in the paper.

In [None]:

class FCDenseNet(nn.Module):
    """
    Fully Convolutional DenseNet for semantic segmentation.
    """
    def __init__(self, in_channels=3, n_classes=11, growth_rate=16):
        super(FCDenseNet, self).__init__()
        
        # Initial Conv
        self.conv1 = nn.Conv2d(in_channels, 48, kernel_size=3, padding=1, bias=False)
        
        cur_channels = 48
        
        # --- Encoder ---
        
        # Block 1 (4 layers)
        self.db1 = DenseBlock(cur_channels, 4, growth_rate)
        cur_channels += 4 * growth_rate # 48 + 64 = 112
        self.skip1_channels = cur_channels
        self.td1 = TransitionDown(cur_channels)
        
        # Block 2 (5 layers)
        self.db2 = DenseBlock(cur_channels, 5, growth_rate)
        cur_channels += 5 * growth_rate # 112 + 80 = 192
        self.skip2_channels = cur_channels
        self.td2 = TransitionDown(cur_channels)
        
        # Block 3 (7 layers)
        self.db3 = DenseBlock(cur_channels, 7, growth_rate)
        cur_channels += 7 * growth_rate # 192 + 112 = 304
        self.skip3_channels = cur_channels
        self.td3 = TransitionDown(cur_channels)
        
        # Block 4 (10 layers)
        self.db4 = DenseBlock(cur_channels, 10, growth_rate)
        cur_channels += 10 * growth_rate # 304 + 160 = 464
        self.skip4_channels = cur_channels
        self.td4 = TransitionDown(cur_channels)
        
        # --- Bottleneck ---
        # Block Bottleneck (15 layers)
        self.bottleneck = DenseBlock(cur_channels, 15, growth_rate)
        cur_channels += 15 * growth_rate # 464 + 240 = 704
        
        # --- Decoder ---
        
        # Up 1
        self.tu1 = TransitionUp(cur_channels, 240) 
        self.db_up1 = DenseBlock(240 + self.skip4_channels, 10, growth_rate)
        cur_channels = (240 + self.skip4_channels) + 10 * growth_rate # 704 + 160 = 864
        
        # Up 2
        self.tu2 = TransitionUp(cur_channels, 160)
        self.db_up2 = DenseBlock(160 + self.skip3_channels, 7, growth_rate)
        cur_channels = (160 + self.skip3_channels) + 7 * growth_rate # 464 + 112 = 576
        
        # Up 3
        self.tu3 = TransitionUp(cur_channels, 112)
        self.db_up3 = DenseBlock(112 + self.skip2_channels, 5, growth_rate)
        cur_channels = (112 + self.skip2_channels) + 5 * growth_rate # 304 + 80 = 384
        
        # Up 4
        self.tu4 = TransitionUp(cur_channels, 80)
        self.db_up4 = DenseBlock(80 + self.skip1_channels, 4, growth_rate)
        cur_channels = (80 + self.skip1_channels) + 4 * growth_rate # 192 + 64 = 256
        
        # Final Conv
        self.final_conv = nn.Conv2d(cur_channels, n_classes, kernel_size=1)
        
        # Apply Kaiming He Initialization
        self._init_weights()

    def _init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d) or isinstance(m, nn.ConvTranspose2d):
                nn.init.kaiming_uniform_(m.weight, mode='fan_in', nonlinearity='relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)

    def forward(self, x):
        # Initial
        x = self.conv1(x)
        
        # Encoder
        x1 = self.db1(x)
        skip1 = x1
        x = self.td1(x1)
        
        x2 = self.db2(x)
        skip2 = x2
        x = self.td2(x2)
        
        x3 = self.db3(x)
        skip3 = x3
        x = self.td3(x3)
        
        x4 = self.db4(x)
        skip4 = x4
        x = self.td4(x4)
        
        # Bottleneck
        x = self.bottleneck(x)
        
        # Decoder
        
        # Up 1
        x = self.tu1(x)
        if x.size(2) != skip4.size(2) or x.size(3) != skip4.size(3):
            x = F.interpolate(x, size=(skip4.size(2), skip4.size(3)), mode='bilinear', align_corners=True)
        x = torch.cat([x, skip4], 1)
        x = self.db_up1(x)
        
        # Up 2
        x = self.tu2(x)
        if x.size(2) != skip3.size(2) or x.size(3) != skip3.size(3):
            x = F.interpolate(x, size=(skip3.size(2), skip3.size(3)), mode='bilinear', align_corners=True)
        x = torch.cat([x, skip3], 1)
        x = self.db_up2(x)
        
        # Up 3
        x = self.tu3(x)
        if x.size(2) != skip2.size(2) or x.size(3) != skip2.size(3):
            x = F.interpolate(x, size=(skip2.size(2), skip2.size(3)), mode='bilinear', align_corners=True)
        x = torch.cat([x, skip2], 1)
        x = self.db_up3(x)
        
        # Up 4
        x = self.tu4(x)
        if x.size(2) != skip1.size(2) or x.size(3) != skip1.size(3):
            x = F.interpolate(x, size=(skip1.size(2), skip1.size(3)), mode='bilinear', align_corners=True)
        x = torch.cat([x, skip1], 1)
        x = self.db_up4(x)
        
        # Final
        x = self.final_conv(x)
        
        return x


**Sanity Check!**

Before training, we verify that the model:

- accepts a tensor of shape `(B, 3, H, W)`,
- returns logits of shape `(B, C, H, W)` matching the number of classes.

This ensures that channel propagation, skip concatenation, and upsampling were implemented correctly.


In [None]:
model = FCDenseNet(in_channels=3, n_classes=len(TRAIN_CLASSES), growth_rate=16)
x = torch.randn(1, 3, 224, 224)
y = model(x)
print("Output shape:", y.shape)  # expected: (1, 11, 224, 224)


**Answer the following Questions:**

<font color="red">1. Why do segmentation networks need spatial priors that classification networks can ignore?</font>  
<span style="color:red">Segmentation requires dense prediction—assigning a label to every single pixel. This means the network needs to know *where* objects are, not just *what* they are. Spatial priors (like relative positions of road vs sky, or sidewalk vs building) help the network resolve ambiguities and maintain geometric consistency. Classification networks only need to determine the presence of an object anywhere in the image, so they can discard spatial information (e.g., via global pooling) to gain translation invariance.</span>

<font color="red">2. What changes architecturally when we move from "what is in the image?" to "where is it?"</font>
<span style="color:red">We move from encoders that aggressively downsample to lose spatial resolution (capturing "what") to encoder-decoder architectures. We add upsampling layers (like transposed convolutions) to recover the spatial resolution lost during pooling. We also use skip connections to re-inject fine-grained spatial details from early layers into the decoder, which is crucial for precise localization ("where").</span>

<font color="red">3. What failure mode would you expect if skip connections were removed from Tiramisu?</font>  
<span style="color:red">Without skip connections, the decoder would have to reconstruct the high-resolution output solely from the low-resolution, highly abstract bottleneck representation. This would likely result in "blobby" or coarse segmentations with poor boundaries. Small objects might disappear entirely, and edges would be blurry because the fine spatial details were lost in the encoder.</span>

<font color="red">4. How do skip connections influence gradient flow and spatial detail recovery?</font>
<span style="color:red">Skip connections provide a direct path for gradients to flow from the loss function back to the early layers of the encoder, mitigating the vanishing gradient problem in deep networks. For spatial detail, they essentially "copy-paste" high-resolution feature maps from the encoder to the decoder, allowing the decoder to use these sharp details to refine the segmentation boundaries.</span>



### Evaluation & Training

We now train our network on CamVid and assess its performance. Before launching training, we first define **evaluation metrics** suited for semantic segmentation, followed by the standard training procedure we are used to.

#### Evaluation Metrics



Semantic segmentation predictions assign a class to **every pixel**.  
Therefore, our evaluation must measure how well the network labels individual pixels and how well it segments regions belonging to different semantic categories.

**1. Pixel-wise Accuracy**

Pixel accuracy measures the fraction of correctly classified pixels:

$$
\text{PixelAcc} =
\frac{\sum_{(i,j)} \mathbf{1}\left[ \hat{Y}_{ij} = Y_{ij} \right]}
     {\sum_{(i,j)} 1},
$$

where $\hat{Y}_{ij}$ is the predicted label and $Y_{ij}$ is the ground truth at pixel $(i,j)$.

This metric is intuitive and easy to interpret, but can be misleading in imbalanced datasets:
large regions like “road” or “sky” dominate, masking poor performance on rare classes (e.g., pedestrians or signs).

</br>
</br>

**2. Intersection over Union (IoU)**

IoU evaluates segmentation quality by comparing overlap between prediction and ground truth.

For a given class $c$, IoU is:

$$
\text{IoU}_c =
\frac{
|\{\hat{Y} = c\} \cap \{Y = c\}|
}{
|\{\hat{Y} = c\} \cup \{Y = c\}|
}.
$$

IoU penalizes:

- over-segmentation (predicting class $c$ where it does not exist), and  
- under-segmentation (missing regions belonging to class $c$).

To evaluate the entire model, we compute **mean IoU (mIoU)**:

$$
\text{mIoU} = \frac{1}{C}\sum_{c=1}^C \text{IoU}_c,
$$

where $C$ is the number of semantic classes.  
mIoU treats **all classes equally**, even rare ones, making it a standard research metric for segmentation benchmarks including CamVid.

In [None]:

def pixel_accuracy(pred, target, ignore_index=255):
    # pred: (B, C, H, W)
    # target: (B, H, W)
    pred_labels = torch.argmax(pred, dim=1)
    
    mask = (target != ignore_index)
    correct = (pred_labels[mask] == target[mask]).sum().item()
    total = mask.sum().item()
    
    if total == 0:
        return 0.0
    return correct / total

def intersection_and_union(pred, target, num_classes, ignore_index=255):
    pred_labels = torch.argmax(pred, dim=1)
    
    iou_per_class = []
    
    for c in range(num_classes):
        pred_c = (pred_labels == c)
        target_c = (target == c)
        
        intersection = (pred_c & target_c).sum().item()
        union = (pred_c | target_c).sum().item()
        
        if union == 0:
            # Avoid division by zero, and typically we ignore classes not present in the batch/image
            # but for mIoU we sometimes treat it as NaN or 1. Let's return None to indicate 'no instance'
            iou_per_class.append(float('nan')) 
        else:
            iou_per_class.append(intersection / union)
            
    return iou_per_class


#### Training

We train the network using a **pixel-wise cross-entropy loss**, treating segmentation as per-pixel classification.  
Pixels belonging to the “void” class (255) are ignored:

$$
\mathcal{L} = -\frac{1}{N} \sum_{(i,j)\;|\;Y_{ij}\neq 255}
\log p\left(\, Y_{ij} \mid X \, \right).
$$

Following the original FC-DenseNet paper:

- We use **RMSProp** as the optimizer.
- We include **weight decay (L2 regularization)** to encourage small parameter norms and improve stability.

The RMSProp update maintains a moving average of squared gradients $v$ and performs:

$$
\theta \leftarrow \theta - \alpha \cdot \frac{
\nabla_\theta \mathcal{L}
}{
\sqrt{v + \epsilon}
},
$$

which adaptively scales learning rates per parameter — particularly useful in deep architectures like DenseNets.



**Training Procedure:**

We train the network end-to-end over multiple epochs:

1. Read a mini-batch of input images and ground-truth masks.
2. Forward pass through `FCDenseNet`.
3. Compute loss using cross-entropy (ignoring void pixels).
4. Backpropagate gradients.
5. Update weights using RMSProp.
6. Accumulate accuracy and IoU statistics.
7. Validate periodically to observe generalization.

We repeat this process for 85 epochs, monitoring loss, pixel accuracy, and mIoU to evaluate convergence.

In [None]:

# Model Setup
model = FCDenseNet(in_channels=3, n_classes=len(TRAIN_CLASSES), growth_rate=16).to(device)

# Class weighting for CamVid to handle imbalance
# Approximate weights based on CamVid class frequencies (Source: common literature/Kaggle implementations)
# Sky, Road, Building are frequent -> low weight
# Pole, SignSymbol, Pedestrian are rare -> high weight
class_weights = torch.tensor([
    0.5,  # Sky
    0.5,  # Building
    3.0,  # Pole
    0.3,  # Road (Very frequent)
    0.5,  # Sidewalk
    0.5,  # Tree
    2.0,  # SignSymbol
    1.0,  # Fence
    1.0,  # Car
    3.0,  # Pedestrian (Important!)
    2.5   # Bicyclist
]).float().to(device)

criterion = nn.CrossEntropyLoss(weight=class_weights, ignore_index=VOID_INDEX)
optimizer = optim.RMSprop(model.parameters(), lr=1e-3, weight_decay=1e-4)

def train_segmentation(model, train_loader, val_loader, criterion, optimizer, num_epochs=30):
    train_losses = []
    val_losses = []
    val_ious = []
    
    # Scheduler to decay LR
    # verbose=True removed as it is deprecated/removed in newer PyTorch versions
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=3)

    for epoch in range(num_epochs):
        model.train()
        running_loss = 0.0
        
        loop = tqdm(train_loader, leave=True)
        loop.set_description(f"Epoch [{epoch+1}/{num_epochs}]")
        
        for images, masks in loop:
            images, masks = images.to(device), masks.to(device)
            
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, masks)
            loss.backward()
            optimizer.step()
            
            running_loss += loss.item()
            loop.set_postfix(loss=loss.item())
            
        train_losses.append(running_loss / len(train_loader))
        
        # Validation
        model.eval()
        val_loss = 0.0
        total_iou = 0.0
        valid_classes_count = 0
        
        with torch.no_grad():
            for images, masks in val_loader:
                images, masks = images.to(device), masks.to(device)
                outputs = model(images)
                loss = criterion(outputs, masks)
                val_loss += loss.item()
                
                # Metrics
                ious = intersection_and_union(outputs, masks, num_classes=len(TRAIN_CLASSES), ignore_index=VOID_INDEX)
                # Filter nans
                valid_ious = [x for x in ious if not np.isnan(x)]
                if valid_ious:
                   total_iou += sum(valid_ious) / len(valid_ious)
                   valid_classes_count += 1

        avg_val_loss = val_loss / len(val_loader)
        avg_mIoU = total_iou / valid_classes_count if valid_classes_count > 0 else 0
        val_losses.append(avg_val_loss)
        val_ious.append(avg_mIoU)
        
        scheduler.step(avg_val_loss)
        
        print(f"Epoch [{epoch+1}/{num_epochs}] Train Loss: {train_losses[-1]:.4f}, Val Loss: {avg_val_loss:.4f}, mIoU: {avg_mIoU:.4f}")

# Train for reasonable amount of epochs
train_segmentation(model, train_loader, val_loader, criterion, optimizer, num_epochs=30) 


Beyond numerical metrics, it is important to **visually inspect** the model’s predictions. In semantic segmentation, this usually means comparing:

1. the input RGB image,
2. the **ground-truth** segmentation mask,
3. the **predicted** segmentation mask.

By looking at these side by side, you can quickly see which classes the model recognizes well, where it struggles, typical failure modes.

Sample a few Images from the test set, and visualize them in comparison to the ground truth and your own prediction.


In [None]:

# Visualize Predictions
model.eval()
test_images, test_masks = next(iter(val_loader)) # Use val loader as test for now
test_images, test_masks = test_images.to(device), test_masks.to(device)

with torch.no_grad():
    outputs = model(test_images)
    preds = torch.argmax(outputs, dim=1)

def visualize_prediction(img_tensor, mask_tensor, pred_tensor, idx=0):
    img = img_tensor[idx].cpu().permute(1, 2, 0).numpy()
    img = (img * 0.229 + 0.485).clip(0, 1) # Unnormalize roughly
    
    mask = mask_tensor[idx].cpu().numpy()
    pred = pred_tensor[idx].cpu().numpy()
    
    color_mask = decode_class_mask(mask, TRAIN_ID_TO_COLOR, VOID_INDEX)
    color_pred = decode_class_mask(pred, TRAIN_ID_TO_COLOR, VOID_INDEX)
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    axes[0].imshow(img)
    axes[0].set_title("Input Image")
    axes[0].axis('off')
    
    axes[1].imshow(color_mask)
    axes[1].set_title("Ground Truth")
    axes[1].axis('off')
    
    axes[2].imshow(color_pred)
    axes[2].set_title("Prediction")
    axes[2].axis('off')
    
    plt.show()

# Show first in batch
visualize_prediction(test_images, test_masks, preds, idx=0)


**Answer the following Questions:**

<font color="red">1. Do the errors your model makes seem semantic (wrong class) or spatial (wrong localization)?</font>  
<span style="color:red">Typically, in the early stages, errors are often semantic (confusing road with sidewalk). As training progresses, semantic errors decrease, and we mostly see spatial errors at the boundaries (e.g., the edge between a building and the sky is slightly off by a few pixels). Thin objects like poles are often the hardest and might be missed or broken (both semantic and spatial).</span>

<font color="red">2. Which component of the architecture most likely causes that type of error?</font>  
<span style="color:red">Spatial errors at boundaries are often due to the downsampling operations (pooling) in the encoder, which lose exact spatial information. Although skip connections help recover this, the recovery isn't perfect, especially if the decoder isn't deep enough or if the skip connections aren't used effectively. Semantic errors (confusing classes) are usually due to the capacity of the encoder or insufficient context (receptive field).</span>

<font color="red">3. If you could change one design choice to address it, what would you alter?</font>
<span style="color:red">To address spatial errors and boundary precision, I would consider using dilated convolutions (atrous convolutions) in the bottleneck or later encoder blocks instead of downsampling. This expands the receptive field without losing spatial resolution, allowing the network to maintain high-resolution feature maps throughout. Alternatively, using a more powerful backbone or adding an attention mechanism (like in UNet++) could help refine feature selection from skip connections.</span>

