![BridgingAI Logo](../bridgingai_logo.png)

# Deep Learning - Exercise 3: Convolutional Neural Networks and Semantic Segmentation

---
1. [Convolutional Neural Networks](#chp1-fundamentals)
<br/> &#9; 1.1 [Residual Blocks](#chp1-1-residual-block)
<br/> &#9; 1.2 [ResNet34](#chp1-2-resnet34)

2. [Semantic Segmentation](#chp2-segmentation)
<br/> &#9; 2.1 [Dataset Preparation and Visualization](#chp2-1-dataset)
<br/> &#9; 2.2 [Metrics and Loss Functions](#chp2-2-metrics-loss)

3. [Semantic Segmentation Models](#chp3-segmentation-models)
<br/> &#9; 3.1 [Segmentation Model Architecture](#chp3-1-segmentation-model)
<br/> &#9; 3.2 [Lite R-ASPP Classifier](#chp3-2-lite-raspp)
<br/> &#9; 3.3 [Atrous Convolution](#chp3-3-atrous-convolution)
<br/> &#9; 3.4 [Atrous Spatial Pyramid Pooling (ASPP)](#chp3-4-aspp)
<br/> &#9; 3.5 [DeepLabV3 Classifier](#chp3-5-deeplabv3)
<br/> &#9; 3.6 [DeepLabV3 with with ResNet50 Backbone (Optional)](#chp3-6-deeplabv3-resnet50)

4. [Appendix: Performance Benchmarks](#appendix)

5. [References](#references)

---

Convolutional Neural Networks (CNNs) are a class of neural networks commonly used for image processing tasks, such as image classification, object detection, or image compression. This exercise will guide you through implementing and experimenting with CNNs, focusing on ResNet architecture and semantic segmentation tasks.

### Exercise Overview:

Part 1: Implement Residual Networks

Part 2: Semantic Segmentation (PASCAL VOC 2012)
- Prepare and visualize the dataset
- Implement the evaluation metrics and loss function
- Compare different architectures for segmentation (Lite R-ASPP, DeepLabV3)
- Experiment with different backbones (MobileNetV3, ResNet50)

### Compute Requirements
The baseline configuration (Lite R-ASPP, 225px resolution, 3000 steps) trains in ~30 minutes on CPU. For more extensive experiments, GPU access is recommended.

Performance can be adjusted by:

Reducing computational load: Lower resolution, smaller architectures, fewer validation steps
Enhancing accuracy (with GPU): 513px resolution, ResNet50/101 backbone, backbone fine-tuning, extended training

In [None]:
import torch
import torch.nn as nn
import torchvision

import tests
from visualization import visualize_segmentation
from config import ExperimentConfig
from training import create_dataloaders, CNNTrainer as Trainer
from utils import print_model_info

<a id="chp1-fundamentals"></a>
# 1. Convolutional Neural Networks

In this section, you will implement Residual Networks (ResNets), a very popular network architecture that was originally introduced for image classification. ResNets facilitate the training of very deep networks by introducing _skip connections_ for improved gradient flow.

<a id="chp1-1-residual-block"></a>
## 1.1 ResidualBlock Implementation

Your first task is to complete the `ResidualBlock` class in the cell below.

**TODO 1**: Implement the `forward` method in the `ResidualBlock` class.

1. Use the `self.conv1`, `self.bn1`, `self.relu`, `self.conv2`, and `self.bn2` layers defined in the `__init__` method.
2. Refer to Figure 2 in the [original ResNet paper](https://arxiv.org/abs/1512.03385) for the residual block structure.
3. Apply batch normalization immediately after each convolutional layer.
4. Remember to add the identity (input) to the output before the final activation.

**TODO 2**: Complete the `self.downsample` part in the `__init__` method of the `ResidualBlock` class. <br>
This block is used when the input dimensions need to be adjusted to match the output dimensions (when `use_downsample` is True).

1. Create a `nn.Sequential` object for `self.downsample`
2. The downsample block should contain:
   - A convolutional layer with kernel size 1, stride equal to `in_stride`, and output channels equal to `out_channels`
   - A batch normalization layer

**Hint**: Remember to set `bias=False` for the convolutional layer as per the ResNet architecture.

After implementing these components, run the provided test cells to verify your implementation.

In [None]:
class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, use_downsample=False):
        super().__init__()

        in_stride = 1
        self.downsample = None
        if use_downsample:
            in_stride = 2
            # TODO 2: Implement the downsample block
            # (modify self.downsample to be a nn.Sequential object)
            # YOUR CODE HERE
            raise NotImplementedError()

        self.conv1 = nn.Conv2d(
            in_channels,
            out_channels,
            kernel_size=3,
            stride=in_stride,
            padding=1,
            bias=False,
        )
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = nn.Conv2d(
            out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False
        )
        self.bn2 = nn.BatchNorm2d(out_channels)

    def forward(self, x):
        """
        Args:
        - x: (N, in_channels, H, W)

        Returns:
        - out:  (N, out_channels, H, W) if no downsample
                (N, out_channels, H//2, W//2) if downsample
        """
        if self.downsample is not None:
            identity = self.downsample(x)  # (N, out_channels, H//2, W//2)
        else:
            identity = x  # (N, in_channels, H, W)

        out = None
        # TODO 1: Implement the forward pass of the residual block
        # YOUR CODE HERE
        raise NotImplementedError()
        return out

In [None]:
tests.TestResidualBlock.test_no_downsampling(ResidualBlock)

In [None]:
tests.TestResidualBlock.test_downsampling(ResidualBlock)

In [None]:
tests.TestResidualBlock.test_structure(ResidualBlock)

<a id="chp1-2-resnet34"></a>
## 1.2 ResNet34 Implementation

Now that you've implemented the ResidualBlock, you'll use it to build a ResNet34. ResNet34 is a 34-layer deep convolutional neural network that utilizes residual connections to facilitate the training of very deep networks.

Your task is to implement the ResNet34 class in the cell below. You can refer to Table 1 in the [ResNet paper](https://arxiv.org/abs/1512.03385) for the architecture details.

**TODO**: Complete the `ResNet34` class by implementing the following:

1. In the `__init__` method, use the `_make_layer` method to create `self.layer2`, `self.layer3`, and `self.layer4`.
2. Implement the `forward` method.

**Hints**:
- Pay attention to the number of blocks and the input/output channels for each layer.
- `self.layer2`, `self.layer3`, and `self.layer4` should use downsampling.

After implementing the ResNet34 class, run the provided test cells to verify your implementation. These tests will check the forward pass, layer shapes, and overall structure of your ResNet34 model.

In [None]:
class ResNet34(nn.Module):
    def __init__(self, num_classes=1000):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        self.layer1 = self._make_layer(ResidualBlock, 64, 64, 3, use_downsample=False)

        # TODO: Implement self.layer2, self.layer3, self.layer4
        # ResNet34 has 3, 4, 6, 3 residual blocks of 64, 128, 256, 512 channels
        # YOUR CODE HERE
        raise NotImplementedError()

        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512, num_classes)

    def _make_layer(self, block, in_channels, out_channels, num_blocks, use_downsample):
        """
        Returns:
            nn.Sequential(
                block(in_channels, out_channels, use_downsample),
                block(out_channels, out_channels),
                block(out_channels, out_channels),
                ...
            )
        """
        layers = []
        layers.append(
            block(in_channels, out_channels, use_downsample)
        )  # first block with possible downsample
        for _ in range(1, num_blocks):
            layers.append(block(out_channels, out_channels))
        return nn.Sequential(*layers)

    def forward(self, x):
        """
        Args:
        - x: (N, 3, 224, 224)

        Returns:
        - x: (N, num_classes)
        """
        out = None

        # TODO: Implement the forward pass of ResNet34
        # YOUR CODE HERE
        raise NotImplementedError()

        return out

In [None]:
tests.TestResNet34.test_forward(ResNet34)
tests.TestResNet34.test_layer_shapes(ResNet34)
tests.TestResNet34.test_structure(ResNet34)

### Load Pre-trained Weights

After passing the sanity checks, we can try loading the pre-trained weights to see if our implementation is correct.

In [None]:
def make_resent34(pretrained=False, progress=True, **kwargs):
    model = ResNet34(**kwargs)
    if pretrained:
        weights = torchvision.models.ResNet34_Weights.DEFAULT
        model.load_state_dict(weights.get_state_dict(progress=progress))
        print("Successfully loaded pretrained weights.")
    return model


my_resnet34 = make_resent34(pretrained=True)

<a id="chp2-segmentation-fundamentals"></a>
# 2. Semantic Segmentation

The goal of semantic segmentation is to assign a class label to each pixel in an image, for a predefined set of semantic classes. This section will introduce you to the key concepts of semantic segmentation without requiring any coding. Semantic segmentation is used in various applications, including autonomous driving, medical image analysis, and satellite imagery interpretation. Unlike image classification, which assigns a single label to an entire image, semantic segmentation provides a dense pixel-wise classification, allowing for a more detailed understanding of the image content.

Key points to understand:
- Each pixel in the output should be assigned a class label
- The number of classes depends on the specific problem and dataset
- The output is typically a mask with the same spatial dimensions as the input image 

Common metrics for evaluating semantic segmentation models include:
1. **Mean Intersection over Union (mIoU)**: The average IoU across all classes, where IoU is the overlap between the predicted segmentation and the ground truth divided by their union.
2. **Pixel Accuracy**: The percentage of correctly classified pixels.

<a id="chp2-1-dataset"></a>
## 2.1 Dataset Preparation and Visualization

In this section, we'll explore the dataset used for our semantic segmentation task and how it's prepared and visualized.

### Dataset Overview

We're using the PASCAL VOC 2012 dataset, a classic benchmark for semantic segmentation. It contains 1464 training and 1449 validation samples; each image is annotated with a pixel-precise map distinguishing between 20 foreground object classes and one background class. Additionally, some pixels may be labelled as 'void', meaning they do not have a clear label and should therefore be excluded from metric and loss computations.

### Dataset Preparation

We've implemented a custom `VOCDataset` class in `voc_dataset.py` to handle the loading and preprocessing of the dataset. 

### Visualization

To aid in understanding our data and model predictions, we've implemented visualization functions in `visualization.py`. These functions will be used when logging images to TensorBoard during training.

In [None]:
# Create a sample configuration
config = ExperimentConfig("", "", "", "", img_size=255, batch_size=6)

# Create dataset and dataloader
train_loader, _ = create_dataloaders(config)

# Get a batch of images and masks
images, masks = next(iter(train_loader))

# Create dummy predictions (just for visualization purposes)
predictions = torch.randint(0, config.num_classes, masks.shape)

# Visualize
fig = visualize_segmentation(images, masks, predictions, config)

<a id="chp2-2-metrics-loss"></a>
## 2.2 Metrics and Loss Functions

In this section, you will implement functions for training and evaluating semantic segmentation models.

**TODO**: Implement the `calculate_pixel_accuracy` function below.
This function should:
1. Calculate the pixel-wise accuracy of the segmentation predictions.
2. Ignore the void label in the accuracy calculation.
3. Return the accuracy as a scalar tensor.

In [None]:
def calculate_pixel_accuracy(
    pred: torch.Tensor, target: torch.Tensor, void_label: int
) -> torch.Tensor:
    """
    Calculate the pixel accuracy for semantic segmentation.

    Args:
        pred: int tensor (of arbitrary shape) containing the predicted labels (from 0 to num_classes - 1)
        target: int tensor containing the ground truth labels (from 0 to num_classes - 1)
        void_label: label to ignore in accuracy calculation

    Returns:
        accuracy: scalar tensor of the pixel accuracy
    """
    assert pred.size() == target.size(), "pred and target must have the same shape"
    # YOUR CODE HERE
    raise NotImplementedError()
    return accuracy

**TODO**: Complete the `step_fn` method below. After implementing this function, run the cell below to run the sanity checks.

The `step_fn` function should:
1. Calculate the cross-entropy loss between the model outputs and the target labels.
2. Compute the Mean Intersection over Union (mIoU) using the `calculate_miou` function.
3. Calculate the pixel accuracy using the `calculate_pixel_accuracy` function you just implemented.
4. Return a dictionary containing the loss, mIoU, and pixel accuracy.

**Hints**:
- Use `F.cross_entropy` to calculate the loss; it supports a `ignore_index` parameter to handle the void label.

In [None]:
from metrics import calculate_miou
from torch.nn import functional as F


class CNNTrainer(Trainer):
    @staticmethod
    def step_fn(
        outputs: torch.Tensor, targets: torch.Tensor, void_label: int
    ) -> dict[str, torch.Tensor]:
        """
        Calculate the loss and mIoU given the model outputs and labels for a batch of images.

        Args:
            outputs: Predicted unnormalized logits. Shape (N, K, H, W), N is batch size, K is number of classes
            targets: (N, H, W) where each value is between 0 and K-1
            void_label: Label to ignore in loss calculation and mIoU

        Returns: dict[str, torch.Tensor]
            loss: scalar tensor
            mIoU: scalar tensor
            accuracy: scalar tensor
        """
        # YOUR CODE HERE
        raise NotImplementedError()
        return {
            "loss": loss,
            "mIoU": miou,
            "accuracy": accuracy,
        }

In [None]:
tests.TestStepFn.test_step(CNNTrainer.step_fn)

<a id="chp3-segmentation-models"></a>
# 3. Semantic Segmentation Models

Semantic segmentation models typically utilize and encoder-decoder architecture. The encoder, also called _backbone_, is responsible for extracting features from the input image, while the decoder uses these features to produce the final segmentation map. Since semantic segmentation can be seen as pixel-wise classification, the decoder is sometimes also referred to as a classifier. The backbone is usually initialized with parameters of a pretrained image classification model.

<a id="chp3-1-segmentation-model"></a>
## 3.1 Segmentation Model Architecture

In this section, you will implement the `SemanticSegmentationModel` class, which combines a backbone and a decoder to create a complete semantic segmentation model.

**TODO**: Complete the `forward` method in the `SemanticSegmentationModel` class below. After implementing this method, run the provided test cells for a sanity check.

The `forward` method should:
1. Pass the input through the backbone to extract features, which will be returned as a `FeatureDict`.
2. Feed these features into the classifier to produce the segmentation output.

**Hints**:
- The `backbone` attribute is an instance of `BaseBackbone`, which returns a `FeatureDict` containing feature maps at different scales.
- The `FeatureDict` includes keys like 'out2', 'out4', 'out8', and 'out16', representing feature maps at 1/2, 1/4, 1/8, and 1/16 of the input resolution respectively.
- The `classifier` attribute is an instance of `BaseClassifier`, which takes the `FeatureDict` and produces the final output.

In [None]:
from semantic_segmentation.backbones import BaseBackbone
from semantic_segmentation.classifiers import BaseClassifier
import torch.optim as optim


class SemanticSegmentationModel(nn.Module):
    """
    A model for semantic segmentation tasks.

    Attributes:
        backbone (BaseBackbone): The backbone network for feature extraction.
        classifier (BaseClassifier): The classifier for segmentation.
        config (ExperimentConfig): Configuration parameters.
    """

    def __init__(
        self,
        backbone: BaseBackbone,
        classifier: BaseClassifier,
        config,
    ):
        super().__init__()
        self.backbone = backbone
        self.classifier = classifier
        self.config = config

        if not self.config.train_backbone:
            for param in self.backbone.parameters():
                param.requires_grad = False

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass of the model.

        Args:
            x: input tensor of shape (N, C, H, W)

        Returns:
            output segmentation map of shape (N, num_classes, H, W)
        """
        # Passing the input through the backbone and classifier
        # YOUR CODE HERE
        raise NotImplementedError()

    def configure_optimizers(self) -> optim.Optimizer:
        if self.config.train_backbone:
            return optim.Adam(
                params=[
                    {"params": self.backbone.parameters(), "lr": self.config.lr * 0.1},
                    {"params": self.classifier.parameters(), "lr": self.config.lr},
                ],
                lr=self.config.lr,
                weight_decay=self.config.weight_decay,
            )
        else:
            return optim.Adam(
                params=self.classifier.parameters(),
                lr=self.config.lr,
                weight_decay=self.config.weight_decay,
            )

In [None]:
tests.TestSemanticSegmentationModel.test_forward(SemanticSegmentationModel)

<a id="chp3-2-lite-raspp"></a>
## 3.2 Lite R-ASPP Classifier

In this section, you will implement the Lite R-ASPP (Lite Reduced Atrous Spatial Pyramid Pooling) Classifier, which is a lightweight version of the ASPP module (will be covered in later sections).

The Lite R-ASPP module is designed to capture multi-scale context information while keeping the computational cost low, making it suitable for mobile and edge devices.

**TODO**: Complete the `forward` method in the `LiteRASPPClassifier` class below. After implementing this method, run the provided test cells for a sanity check.

Please refer the the figure 10 in the [MobileNetV3 paper](https://arxiv.org/abs/1905.02244) for the detailed architecture of the Lite R-ASPP module.

**Hints**:
- Refer to the paper for the detailed architecture of the Lite R-ASPP module.
- We have already defined the necessary components for you. Your task is to use these components to implement the forward pass.
- Use `x = F.interpolate(x, size=self.img_shape, mode="bilinear", align_corners=False)` for any required upsampling operations.

In [None]:
import torch.nn.functional as F
from semantic_segmentation.backbones import ChannelDict, FeatureDict


class LiteRASPPClassifier(BaseClassifier):
    """
    Classifier using Lite R-ASPP module from MobileNetV3 paper
    The paper uses hidden_size=128 and output_stride=16
    """

    def __init__(self, backbone_channels: ChannelDict, config):
        super().__init__()
        self.img_shape = (config.img_size, config.img_size)
        self.output_stride = config.output_stride

        num_classes = config.num_classes
        hidden_size = (
            128
            if config.classifier_hidden_size is None
            else config.classifier_hidden_size
        )

        if self.output_stride == 16:
            self.high_feature = "out16"
            self.low_feature = "out8"
        elif self.output_stride == 8:
            self.high_feature = "out8"
            self.low_feature = "out4"

        high_channels = backbone_channels[self.high_feature]
        low_channels = backbone_channels[self.low_feature]

        # route 1 (high level feature map)
        self.conv_module = nn.Sequential(
            nn.Conv2d(high_channels, hidden_size, kernel_size=1, bias=False),
            nn.BatchNorm2d(hidden_size),
            nn.ReLU(inplace=True),
        )
        self.scale_module = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Conv2d(high_channels, hidden_size, kernel_size=1, bias=False),
            nn.Sigmoid(),
        )
        self.high_classifier = nn.Conv2d(hidden_size, num_classes, kernel_size=1)

        # route 2
        self.low_classifier = nn.Conv2d(low_channels, num_classes, kernel_size=1)

        self._init_weights()

    def forward(self, features: FeatureDict) -> torch.Tensor:
        """
        Args:
            features: a dictionary containing the feature maps at different scales

        Returns:
            output segmentation map of shape (N, num_classes, self.image_shape[0], self.image_shape[1])
        """
        high = features[self.high_feature]
        low = features[self.low_feature]
        # YOUR CODE HERE
        raise NotImplementedError()
        return x

In [None]:
tests.TestClassifiers.test_lite_raspp_classifier(LiteRASPPClassifier)

### Semantic Segmentation with MobileNetV3 and Lite R-ASPP

Now that you've implemented the Lite R-ASPP Classifier, let's put it to the test by training a semantic segmentation model using MobileNetV3 as the backbone and Lite R-ASPP as the classifier.

**Experiment Setup:**
- Backbone: MobileNetV3-Large (pre-trained, frozen)
- Classifier: Lite R-ASPP (your implementation)
- Max Steps: 3000

**TODO**: Run the following cell to run the experiment.

**Note**: 
- If implemented correctly, the final **mIoU** for `img_size=225` should be around **0.4** and for `img_size=513` around **0.55**.
- Running this experiment on CPU with `img_size`=225 should take around 30 minutes.
- You can refer to Appendix A for performance benchmarks of off-the-shelf models to compare your results.


To view the training progress, run the following command in your terminal:

```bash
tensorboard --logdir=runs
```

In [None]:
from semantic_segmentation.backbones import MobileNetV3Backbone, ResNetBackbone
# Set up the experiment configuration
config = ExperimentConfig(
    exp_name="VOC12",
    backbone_name="mobilenetv3_large",
    classifier_name="lite_raspp",
    img_size=513,
    # img_size=225, # For CPU
    # device="cpu", # For CPU
)

# Create and run the experiment
backbone = MobileNetV3Backbone(config)
decoder = LiteRASPPClassifier(backbone.get_channels(), config)
model = SemanticSegmentationModel(backbone, decoder, config)
trainer = CNNTrainer(model, config)

# Print model information
print_model_info(model)

# Run the experiment
trainer.run_experiment()

<a id="chp3-4-aspp"></a>
## 3.4 Atrous Spatial Pyramid Pooling (ASPP)

Atrous convolution, also known as dilated convolution, is a method to enlarge the receptive field of convolutional layers without increasing the number of parameters or the amount of computation.

In a standard convolution, the dilation rate $p$ is 1. Increasing the dilation rate is equivalent to inserting zeros between the values in the convolutional kernel, effectively increasing the filter size without adding parameters. For example:

- A 3x3 convolution with p=1 (standard convolution) has a 3x3 receptive field
- The same 3x3 convolution with p=2 yields an effective filter size of 5x5
- With p=3, the effective filter size becomes 7x7

This allows the network to capture wider context without losing resolution or increasing the number of parameters.

For a more detailed explanation and visualizations of atrous convolution, refer to Chapter 5.1 of this [guide](https://arxiv.org/abs/1603.07285).

In PyTorch, you can create dilated convolutions easily by specifying the `dilation` parameter.

For this section, you'll implement the Atrous Spatial Pyramid Pooling (ASPP) module, a module that captures multi-scale context information for improved segmentation performance.

**TODO**: Implement the `forward` method in the `ASPP` class below. After implementing this method, run the provided test cells for a sanity check.

The `ASPP` class structure is already provided for you. Your task is to complete the `forward` method. <br/>
Please refer to Section 3.3 of the [DeepLabV3 paper](https://arxiv.org/abs/1706.05587) for the detailed architecture of the ASPP module.

**Hints**:
- The output feature maps produced by different atrous convolutions *don't* need to be resized. 
- However, the global pooling feature map should be upsampled before being concatenated with the atrous convolution outputs.
- Use `F.interpolate` for upsampling operations.

In [None]:
from semantic_segmentation.classifiers import AtrousConv
from typing import Tuple


class ASPP(nn.Module):
    """
    Atrous Spatial Pyramid Pooling module (and image pooling) from DeepLabV3

    Args:
        in_channels: number of input channels
        out_channels: number of output channels
        atrous_rates: a tuple of atrous rates for the atrous convolutions

    As mentioned in the DeepLabV3 paper:
        When output_stride = 16 -> atrous_rates = (6, 12, 18)
        When output_stride = 8 -> atrous_rates = (12, 24, 36)
        out_channels is 256 for both cases in the paper, though it's quite large for MobileNetV3.
    """

    def __init__(self, in_channels: int, atrous_rates: Tuple[int], out_channels: int):
        super(ASPP, self).__init__()

        self.pyramid_convs = nn.ModuleList()

        one_by_one_conv = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
        )
        self.pyramid_convs.append(one_by_one_conv)

        for rate in atrous_rates:
            self.pyramid_convs.append(AtrousConv(in_channels, out_channels, rate))

        self.image_pool = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
        )

        # +2 for the image pool and the 1x1 conv
        self.out_conv = nn.Sequential(
            nn.Conv2d(
                out_channels * (len(atrous_rates) + 2),
                out_channels,
                kernel_size=1,
                bias=False,
            ),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            (N, in_channels, H, W)

        Returns:
            (N, out_channels, H, W)
        """
        # YOUR CODE HERE
        raise NotImplementedError()

In [None]:
tests.TestClassifiers.test_aspp(ASPP)

<a id="chp3-5-deeplabv3"></a>
## 3.5 DeepLabV3 Classifier

In this section, you'll implement the DeepLabV3 classifier, which incorporates the ASPP module you created in the previous section.
Please refer to the [DeepLabV3 paper](https://arxiv.org/abs/1706.05587) for the architecture of the DeepLabV3 model.

**TODO**: Complete the `forward` method in the `DeepLabV3Classifier` class below. After implementing this method, run the provided test cells to run a sanity check.

**Hints**:
- The output needs to be bilinearly upsampled to match the input resolution using `F.interpolate`.

In [None]:
from semantic_segmentation.classifiers import determine_atrous_rates


class DeepLabV3Classifier(BaseClassifier):
    """
    Classifer with similar architecture to DeepLabV3
    When using MobileNetV3 as the backbone, output_stride should be 16 for better performance.
    """

    def __init__(self, backbone_channels: ChannelDict, config):
        super().__init__()

        self.img_shape = (config.img_size, config.img_size)
        self.output_stride = config.output_stride

        num_classes = config.num_classes
        hidden_size = (
            64
            if config.classifier_hidden_size is None
            else config.classifier_hidden_size
        )

        atrous_rates = determine_atrous_rates(self.output_stride)

        if self.output_stride == 16:
            in_channels = backbone_channels["out16"]
        elif self.output_stride == 8:
            in_channels = backbone_channels["out8"]

        self.aspp = ASPP(in_channels, atrous_rates, hidden_size)
        self.classifier = nn.Sequential(
            nn.Conv2d(hidden_size, hidden_size, 3, padding=1, bias=False),
            nn.BatchNorm2d(hidden_size),
            nn.ReLU(inplace=True),
            nn.Conv2d(hidden_size, num_classes, 1),
        )

        self._init_weights()

    def forward(self, features: FeatureDict) -> torch.Tensor:
        if self.output_stride == 16:
            feature = features["out16"]
        elif self.output_stride == 8:
            feature = features["out8"]
        else:
            raise NotImplementedError(
                f"Output stride of {self.output_stride} is not supported"
            )
        # YOUR CODE HERE
        raise NotImplementedError()
        return x

In [None]:
tests.TestClassifiers.test_deeplabv3_classifier(DeepLabV3Classifier)

### Semantic Segmentation with DeepLabV3

Now that you've implemented the DeepLabV3 classifier, let's train it for semantic segmentation. This experiment will use a MobilenetV3 backbone with the DeepLabV3 classifier you just implemented.

**Experiment Setup:**
- Backbone: MobileNetV3-Large (pre-trained, frozen)
- Classifier: DeepLabV3 (your implementation)
- Max Steps: 3000

**TODO**: Run the following cell to run the experiment.

**Note**: 
- If implemented correctly, the final **mIoU** for `img_size=225` should be around **0.44** and `img_size=513` should be around **0.6**.
- You can compare these results with the benchmarks provided in Appendix A to see how your implementation performs relative to off-the-shelf models.

In [None]:
# Set up the experiment configuration
config = ExperimentConfig(
    exp_name="VOC12",
    backbone_name="mobilenetv3_large",
    classifier_name="deeplabv3",
    img_size=513,
    output_stride=16,
    classifier_hidden_size=64,
)

# Create and run the experiment
backbone = MobileNetV3Backbone(config)
decoder = DeepLabV3Classifier(backbone.get_channels(), config)
model = SemanticSegmentationModel(backbone, decoder, config)
trainer = CNNTrainer(model, config)

# Print model information
print_model_info(model)

# Run the experiment
trainer.run_experiment()

<a id="chp3-6-deeplabv3-resnet50"></a>
## 3.6 DeepLabV3 with ResNet50 Backbone (Optional)

In this experiment, we'll combine a powerful ResNet50 backbone with the DeepLabV3 architecture. This setup closely mirrors the original DeepLabV3 model (ResNet50 version), with the only difference being the exclusion of an auxiliary classifier.

**Experiment Setup**:
- Backbone: ResNet50 (pre-trained, frozen)
- Classifier: DeepLabV3 (your implementation)
- Max Steps: 3000
- Image size = 513

**TODO**: Run the following cell to run the experiment.

**Note**: 
- The final **mIoU** should be around **0.67**.
- Running this experiment takes roughly 2 hours on a GPU.
- You can refer to Appendix A to compare this result with the performance of similar off-the-shelf models.

In [None]:
# Set up the experiment configuration
config = ExperimentConfig(
    exp_name="VOC12",
    backbone_name="resnet50",
    classifier_name="deeplabv3",
    img_size=513,
    output_stride=8,
    classifier_hidden_size=256,
)

# Create and run the experiment
backbone = ResNetBackbone(config)
decoder = DeepLabV3Classifier(backbone.get_channels(), config)
model = SemanticSegmentationModel(backbone, decoder, config)
trainer = CNNTrainer(model, config)

# Print model information
model = trainer.model
print_model_info(model)

# Run the experiment
trainer.run_experiment()

<a id="appendix"></a>
# A. Appendix: Performance Benchmarks

To give you an idea of what to expect when training semantic segmentation models, here are benchmarks for some off-the-shelf models on the PASCAL VOC 2012 validation set. These results are from pre-trained models available in torchvision, tested at various input resolutions using our own evaluation pipeline. It's important to note that these numbers may differ from those reported in papers due to differences in evaluation methodologies - in particular, we employ much shorter training schedules, less data, and no augmentations.

You can reproduce these benchmarks using the `utils.eval_official()`.


### Mean Intersection over Union (mIoU)

| Backbone          | Classifier | Model Size | 65x65  | 129x129 | 225x225 | 513x513 |
|-------------------|------------|------------|--------|---------|---------|---------|
| MobileNetV3_Large | LRASPP     | 3.22M      | 0.0507 | 0.1440  | 0.4077  | 0.5735  |
| ResNet50          | DeepLabV3  | 42.00M     | 0.1753 | 0.4360  | 0.5966  | 0.6929  |
| ResNet101         | DeepLabV3  | 61.00M     | 0.2052 | 0.4514  | 0.6195  | 0.7064  |

### Validation Loss

| Backbone          | Classifier | Model Size | 65x65  | 129x129 | 225x225 | 513x513 |
|-------------------|------------|------------|--------|---------|---------|---------|
| MobileNetV3_Large | LRASPP     | 3.22M      | 2.2028 | 1.1351  | 0.4390  | 0.2519  |
| ResNet50          | DeepLabV3  | 42.00M     | 1.2792 | 0.4682  | 0.2564  | 0.1815  |
| ResNet101         | DeepLabV3  | 61.00M     | 1.1571 | 0.4336  | 0.2342  | 0.1689  |

### Pixel Accuracy

| Backbone          | Classifier | Model Size | 65x65  | 129x129 | 225x225 | 513x513 |
|-------------------|------------|------------|--------|---------|---------|---------|
| MobileNetV3_Large | LRASPP     | 3.22M      | 0.6553 | 0.7263  | 0.8565  | 0.9128  |
| ResNet50          | DeepLabV3  | 42.00M     | 0.7249 | 0.8557  | 0.9125  | 0.9378  |
| ResNet101         | DeepLabV3  | 61.00M     | 0.7392 | 0.8635  | 0.9195  | 0.9435  |


You can refer to this table throughout the assignment to compare your implementation's performance against these benchmarks. Keep in mind that your results may differ due to our simplified training protocol (frozen backbone, no auxiliary classifier, limited hyperparameter tuning, etc.). However, this table provides a good reference point for what performance you can expect from models with similar architectures and sizes.

In [None]:
import utils
utils.eval_official([513])

<a id="references"></a>
# B. References

The following papers are either directly used in this assignment or highly correlated with the concepts we've covered:

1. [ResNet](https://arxiv.org/abs/1512.03385)
2. [DeepLabV1](https://arxiv.org/abs/1412.7062)
3. [DeepLabV2](https://arxiv.org/abs/1606.00915)
4. [DeepLabV3](https://arxiv.org/abs/1706.05587)
5. [MobileNet](https://arxiv.org/abs/1704.04861)
6. [MobileNetV2](https://arxiv.org/abs/1801.04381)
7. [MobileNetV3](https://arxiv.org/abs/1905.02244)

