# Lab 1: Fully Convolutional Networks

In this lab, we explore fully convolutional networks for semantic segmentation.

## Outline

* Run 1: Classification with a ResNet 50 trained on ImageNet
* Run 2: Segmentation with a Fully Convolutional ResNet 50 trained on COCO with VOC labels
* Run 3: Segmentation with a Deeplab v3 trained on COCO with VOC labels

## Setup

let's install and import the required dependencies

In [1]:
!pip install -q torch torchvision pandas Pillow

import urllib
from PIL import Image
from torchvision import transforms
import torch.nn
import pandas as pd

### Test Image

we use this image of a dog at test image. 

**Task**: Take a picture with your phone and load it here. You can use the dog image as a reference.

In [22]:
url, filename = ("https://github.com/pytorch/hub/raw/master/images/dog.jpg", "dog.jpg")
urllib.request.urlretrieve(url, filename)
img = Image.open(filename).convert('RGB')
img

## Run 1: ResNet50 for Classification

Let's first predict a class of the image with a ResNet 50 architecture.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/98/ResNet50.png/1024px-ResNet50.png" width="100%">

### ResNet 50

ResNet was first proposed in the paper "Deep Residual Learning for Image Recognition" by [He et al., 2015](https://arxiv.org/abs/1512.03385)

ResNet 50 is a deep convolutional neural network architecture renowned for its depth and efficacy in image classification tasks. It introduces skip connections, or residual connections, which facilitate the training of very deep networks by mitigating the vanishing gradient problem. With 50 layers, it achieves state-of-the-art results in tasks like object recognition and scene understanding

### ImageNet

The ImageNet dataset is a vast collection of labeled images designed for training and evaluating computer vision algorithms. It consists of over 14 million images covering a wide range of categories. Initially, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) focused on 1,000 classes, each representing various objects, animals, scenes, and concepts. These classes include everyday items such as "car," "dog," and "tree," as well as more specific entities like "golden retriever," "convertible," and "oak tree." The diversity of classes in ImageNet allows for comprehensive training and testing of machine learning models for image classification, object detection, and other visual recognition tasks.

**Task**:

initialize a torchvision ResNet 50 with the ImageNet weights (`ResNet50_Weights.IMAGENET1K_V2`) following the [torchvision documentation](https://pytorch.org/vision/main/models/generated/torchvision.models.resnet50.html)

In [25]:
from torchvision.models import resnet50, ResNet50_Weights

#TODO: Check the documentation and create a resnet50 with ImageNet (IMAGENET1K_V2) weights
# weights = ...
# resnet50 = ...
#SOLUTIONSTART
weights = ResNet50_Weights.IMAGENET1K_V2
resnet50 = resnet50(weights=weights)
#SOLUTIONEND

# setting the ResNet to eval() disables dropout and sets BatchNorm layers to training mode
resnet50.eval()

# print the architecture
resnet50

## Data preprocessing according to pretraining data

Since the model was trained on imagenet weights, we need to make sure that pixel values of new images are similar to the images in the Imagenet dataset.

Here, we normalize and resize the image from above to fit the ImageNet statistics

In [26]:
transform = weights.transforms()
print(transform)

In [27]:
tensor = transform(img).unsqueeze(0) # transform and add batch dimension

print(f"min: {tensor.min():.2f} max:{tensor.max():.2f}, mean:{tensor.mean():.2f}, std:{tensor.std():.2f}")

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.imshow(tensor.numpy()[0].transpose(1,2,0))

ImageNet categories

In [28]:
categories = weights.meta["categories"]
# classes
print(", ".join(categories))

**Task**: predict a ImageNet with the `resnet50` instance using image `tensor`

In [8]:
with torch.no_grad(): # this is optional, but reduces computations (no gradients)
    # TODO: predict a class probability using the image tensor and resnet50    
    # logits = ...
    # y_scores = <todo: softmax(logits)>
    
    #SOLUTIONSTART
    logits = resnet50(tensor)
    y_scores = torch.softmax(logits, dim=1)[0]
    #SOLUTIONEND

In [30]:
# plots probabilities for each of the 1000 imagenet classes
fig, ax = plt.subplots()
ax.bar(x=torch.arange(len(categories)).numpy(), height=y_scores.numpy())
ax.set_yscale('log')
ax.set_xlabel("class ID")
ax.set_ylabel("probability")

# lists and sorts classes by probability
df = pd.DataFrame([y_scores.numpy(), categories], index=["proba", "name"]).T
df.sort_values(by="proba", ascending=False)

## Run 2: Segmentation - FCN ResNet50

Let's not move to segmentation with a modified fully convolutional ResNet 50

### Fully Convolvolutional ResNet 50

The "Fully Convolutional Networks for Semantic Segmentation" by [Long et al., 2015](https://arxiv.org/abs/1411.4038) paper presents a modification of classification networks (here ResNet 50), transforming it into a fully convolutional network for semantic segmentation tasks, offering improved performance and efficiency in image understanding. By replacing fully connected layers with convolutional ones, it enables end-to-end pixel-wise predictions, making it suitable for applications like object detection and scene parsing.

### Dataset Microsoft COCO with VOC Labels

The COCO (Common Objects in Context) dataset is a widely used benchmark for object detection, segmentation, and captioning tasks. It contains over 200,000 labeled images across 80 common object categories such as person, car, and dog, providing rich contextual information for each object instance. While COCO does not directly incorporate VOC (Visual Object Classes) labels, it shares similarities in its goal of advancing object recognition algorithms, albeit with a broader range of object categories and more extensive annotations, making it a valuable resource for training and evaluating computer vision models.

**Task**: load a fully convolutional resnet 50 (`fcn_resnet50`) pretrained on the COCO dataset (`COCO_WITH_VOC_LABELS_V1`). Hint: Check the [torchvision documentation](https://pytorch.org/vision/stable/models.html)

In [31]:
from torchvision.models.segmentation import fcn_resnet50, FCN_ResNet50_Weights

# Step 1: Initialize model with the best available weights
# weights = ...
# fcn_resnet50 = ...
#SOLUTIONSTART
weights = FCN_ResNet50_Weights.COCO_WITH_VOC_LABELS_V1
fcn_resnet50 = fcn_resnet50(weights=weights)
fcn_resnet50.eval()
#SOLUTIONEND

# Step 2: Initialize the inference transforms (i.e., a preprocess function)
# preprocess = ...
#SOLUTIONSTART
preprocess = weights.transforms()
#SOLUTIONEND

# Step 3: Apply inference preprocessing transforms
# tensor = preprocess(...)
#SOLUTIONSTART
tensor = preprocess(img).unsqueeze(0)
#SOLUTIONEND 

# Step 4: Use the model and predict
# logits = ...
# probabilities = ...

#SOLUTIONSTART
logits = fcn_resnet50(tensor)["out"]
probabilities = logits.softmax(dim=1)[0].detach()
#SOLUTIONEND

plot predictions

In [32]:
categories = weights.meta["categories"]
print(", ".join(categories))

H, W = 7,3
fig, axs = plt.subplots(H,W, figsize=(2*W,2*H))
for ax, mask, category in zip(axs.reshape(-1), probabilities, categories):
    ax.imshow(mask.numpy())
    ax.axis("off")
    ax.set_title(category)
plt.tight_layout()

Let's examine the network architecture in comparison to a classification ResNet 50 (Run 1)

**Question 1**

Compare the classification ResNet50 from above with the Segmentation FCN-ResNet50. What are the differences?

In [33]:
fcn_resnet50

## Run 3: Segmentation - Deeplab V3

Let's move away from ResNet and explore Deeplab v3

Examine the Deeplab v3 architecture

<img src="https://production-media.paperswithcode.com/methods/Screen_Shot_2020-06-28_at_3.07.48_PM.png" width="100%"/>

"Rethinking Atrous Convolution for Semantic Image Segmentation" by [Chen et al., 2017](https://arxiv.org/pdf/1706.05587.pdf) proposes an innovative approach to semantic segmentation using atrous convolution. The paper introduces atrous spatial pyramid pooling (ASPP), which employs multiple atrous rates to capture multi-scale contextual information efficiently. By integrating ASPP into convolutional neural networks, the method achieves significant improvements in semantic segmentation accuracy, particularly in delineating object boundaries and handling objects of varying sizes.

**Task**: load a deeplab v3 with resnet50 backbone (i.e., `deeplabv3_resnet50`) pretrained on the COCO dataset (`COCO_WITH_VOC_LABELS_V1`) and predict the image. Hint: Check the four steps from the [torchvision documentation](https://pytorch.org/vision/stable/models.html) and make the visualization below work.

In [34]:
from torchvision.models.segmentation import deeplabv3_resnet50, DeepLabV3_ResNet50_Weights

#TODO: implement the four steps as above (also in the torchvision documentation) and make the visualization code below work.

#SOLUTIONSTART
weights = DeepLabV3_ResNet50_Weights.COCO_WITH_VOC_LABELS_V1
deeplabv3 = deeplabv3_resnet50(weights=weights)
deeplabv3.eval()

# Step 2: Initialize the inference transforms
transform = weights.transforms()
print(transform)

# Step 3: Apply inference preprocessing transforms
tensor = transform(img).unsqueeze(0)

# Step 4: Use the model and visualize the prediction
prediction = deeplabv3(tensor)["out"][0]
normalized_masks = prediction.softmax(dim=0).detach()
#SOLUTIONEND

In [35]:
H, W = 7,3
fig, axs = plt.subplots(H,W, figsize=(2*W,2*H))
for ax, mask, category in zip(axs.reshape(-1), normalized_masks, categories):
    ax.imshow(mask.numpy())
    ax.axis("off")
    ax.set_title(category)
plt.tight_layout()

**Question 2**

Investigate the deeplabv3 architecture. How are are atrous convolutions implemented? What’s the name of the Conv2D hyperparameter? Check the [torch documentation](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html) if necessary.

In [36]:
deeplabv3