# Homework 3.1: Dense Prediction
---
In this part, you will study a problem of segmentation. The goal of this assignment is to study, implement, and compare different components of dense prediction models, including **data augmentation**, **backbones**, **classifiers** and **losses**.

This assignment will require training multiple neural networks, therefore it is advised to use a **GPU** accelerator.

In [1]:
# Uncomment and run if in Colab
# !mkdir datasets
# !gdown --id 139GsP9CqFCW1LA1Mf3e1gZpWz2uXmfHf -O datasets/tiny-floodnet-challenge.tar.gz
# !tar -xzf datasets/tiny-floodnet-challenge.tar.gz -C datasets
# !rm datasets/tiny-floodnet-challenge.tar.gz
# !gdown --id 1Td3RKkTsBEn1lBULddEmXKHxKhXqz_LC
# !tar -xzf part1_semantic_segmentation.tar.gz
# !rm part1_semantic_segmentation.tar.gz

!pip install pytorch_lightning



## Dataset

We will use a simplified version of a [FloodNet Challenge](http://www.classic.grss-ieee.org/earthvision2021/challenge.html).

Compared to the original challenge, our version doesn't have difficult (and rare) "flooded" labels, and the images are downsampled

<img src="https://i.imgur.com/RZuVuVp.png" />

## Assignments and grading


- **Part 1. Code**: fill in the empty gaps (marked with `#TODO`) in the code of the assignment (34 points):
    - `dataset.py` -- 4 points
    - `model.py` -- 22 points
    - `loss.py` -- 6 points
    - `train.py` -- 2 points
- **Part 2. Train and benchmark** the performance of the required models (6 points):
    - All 6 checkpoints are provided -- 3 points
    - Checkpoints have > 0.5 accuracy -- 3 points
- **Part 3. Report** your findings (10 points)
    - Each task -- 2.5 points

- **Total score**: 50 points.

For detailed grading of each coding assignment, please refer to the comments inside the files. Please use the materials provided during a seminar and during a lecture to do a coding part, as this will help you to further familiarize yourself with PyTorch. Copy-pasting the code from Google Search will get penalized.

In part 2, you should upload all your pre-trained checkpoints to your personal Google Drive, grant public access and provide a file ID, following the intructions in the notebook.

Note that for each task in part 3 to count towards your final grade, you should complete the corresponding tasks in part 2.

For example, if you are asked to compare Model X and Model Y, you should provide the checkpoints for these models in your submission, and their accuracies should be above minimal threshold.

## Part 1. Code


### `dataset.py`
**TODO: implement and apply data augmentations**

You'll need to study a popular augmentations library: [Albumentations](https://albumentations.ai/), and implement the requested augs. Remember that geometric augmentations need to be applied to both images and masks at the same time, and Albumentations has [native support](https://albumentations.ai/docs/getting_started/mask_augmentation/) for that.

### `model.py`
**TODO: Implement the required models.**

Typically, all segmentation networks consist of an encoder and decoder. Below is a scheme for a popular DeepLab v3 architecture:

<img src="https://i.imgur.com/cdlkxvp.png" />

The encoder consists of a convolutional backbone, typically with extensive use of convs with dilations (atrous convs) and a head, which helps to further boost the receptive field. As you can see, the general idea for the encoders is to have as big of a receptive field, as possible.

The decoder either does upsampling with convolutions (similarly to the scheme above, or to UNets), or even by simply interpolating the outputs of the encoder.

In this assignment, you will need to implement **UNet** and **DeepLab** models. Example UNet looks like this:

<img src="https://i.imgur.com/RJyO1rV.png" />

For **DeepLab** model we will have three variants for backbones: **ResNet18**, **VGG11 (with BatchNorm)**, and **MobileNet v3 (small).** Use `torchvision.models` to obtain pre-trained versions of these backbones and simply extract their convolutional parts. To familiarize yourself with **MobileNet v3** model, follow this [link](https://paperswithcode.com/paper/searching-for-mobilenetv3).

We will also use **Atrous Spatial Pyramid Pooling (ASPP)** head. Its scheme can be seen in the DeepLab v3 architecture above. ASPP is one of the blocks which greatly increases the spatial size of the model, and hence boosts the model's performance. For more details, you can refer to this [link](https://paperswithcode.com/method/aspp).

### `loss.py`
**TODO: implement test losses.**

For validation, we will use three metrics. 
- Mean intersection over union: **mIoU**,
- Mean class recall: **mRecall**,
- **Accuracy**.

To calculate **IoU**, use this formula for binary segmentation masks for each class, and then average w.r.t. all classes:

$$ \text{IoU} = \frac{ \text{area of intersection} }{ \text{area of union} } = \frac{ \| \hat{m} \cap m  \| }{ \| \hat{m} \cup m \| }, \quad \text{$\hat{m}$ — predicted binary mask},\ \text{$m$ — target binary mask}.$$

For **mRecall** you can use the following formula:

$$
    \text{mRecall} = \frac{ \| \hat{m} \cap m \| }{ \| m \| }
$$

And **accuracy** is a fraction of correctly identified pixels in the image.

Generally, we want our models to optimize accuracy since this implies that it makes little mistakes. However, most of the segmentation problems have imbalanced classes, and therefore the models tend to underfit the rare classes. Therefore, we also need to measure the mean performance of the model across all classes (mean IoU or mean class accuracy). In reality, these metrics (not the accuracy) are the go-to benchmarks for segmentation models.

### `train.py`
**TODO: define optimizer and learning rate scheduler.**

You need to experiment with different optimizers and schedulers and pick one of each which works the best. Since the grading will be partially based on the validation performance of your models, we strongly advise doing some preliminary experiments and pick the configuration with the best results.

## Part 2. Train and benchmark

In this part of the assignment, you need to train the following models and measure their training time:
- **UNet** (with and without data augmentation),
- **DeepLab** with **ResNet18** backbone (with **ASPP** = True and False),
- **DeepLab** with the remaining backbones you implemented and **ASPP** = True).

To get the full mark for this assignment, all the required models should be trained (and their checkpoints provided), and have at least 0.5 accuracies.

After the models are trained, evaluate their inference time on both GPU and CPU.

Example training and evaluation code are below.

In [1]:
%load_ext autoreload
%autoreload 2

In [6]:
import pytorch_lightning as pl
from part1_semantic_segmentation.train import SegModel
from pytorch_lightning.loggers import WandbLogger
import time
import torch



def define_model(model_name: str, 
                 backbone: str, 

                 aspp: bool, 
                 augment_data: bool, 
                 optimizer: str, 
                 scheduler: str, 
                 lr: float, 
                 checkpoint_name: str = '', 
                 batch_size: int = 16):
    assignment_dir = 'part1_semantic_segmentation'
    experiment_name = f'{model_name}_{backbone}_augment={augment_data}_aspp={aspp}'
    model_name = model_name.lower()
    backbone = backbone.lower() if backbone is not None else backbone
    
    model = SegModel(
        model_name, 
        backbone, 
        aspp, 
        augment_data,
        optimizer,
        scheduler,
        lr,
        batch_size, 
        data_path='datasets/tiny-floodnet-challenge', 
        image_size=256)

    if checkpoint_name:
        model.load_state_dict(torch.load(f'{assignment_dir}/logs/{experiment_name}/{checkpoint_name}')['state_dict'])
    
    return model, experiment_name

def train(model, experiment_name, use_gpu):
    assignment_dir = 'part1_semantic_segmentation'

    logger = WandbLogger(project='FloodNet', entity='albly')#pl.loggers.TensorBoardLogger(save_dir=f'{assignment_dir}/logs', name=experiment_name)

    checkpoint_callback = pl.callbacks.ModelCheckpoint(
        monitor='mean_iou',
        dirpath=f'{assignment_dir}/logs/{experiment_name}',
        filename='{epoch:02d}-{mean_iou:.3f}',
        mode='max')
    
    trainer = pl.Trainer(
        max_epochs=100, 
        gpus=1 if use_gpu else None, 
        benchmark=True, 
        check_val_every_n_epoch=5, 
        logger=logger, 
        callbacks=[checkpoint_callback])

    time_start = time.time()
    
    trainer.fit(model)
    
    torch.cuda.synchronize()
    time_end = time.time()
    
    training_time = (time_end - time_start) / 60
    
    return training_time

In [7]:
model, experiment_name = define_model(
    model_name='UNet',
    backbone=None,
    aspp=None,
    augment_data=False,
    optimizer='', # use these options to experiment
    scheduler='', # with optimizers and schedulers
    lr=1.) # experiment to find the best LR
#wandb.watch(model)
training_time = train(model, experiment_name, use_gpu=False)

print(f'Training time: {training_time:.3f} minutes')

GPU available: True, used: False
TPU available: False, using: 0 TPU cores

  | Name | Type | Params
------------------------------
0 | net  | UNet | 16.5 M
------------------------------
16.5 M    Trainable params
0         Non-trainable params
16.5 M    Total params
65.928    Total estimated model params size (MB)
Epoch 4:  36%|███▋      | 20/55 [04:27<07:48, 13.38s/it, loss=0.935, v_num=4bdi, mean_iou=0.0162, mean_class_rec=0.125, mean_acc=0.129, train_loss=0.939]
Validating: 0it [00:00, ?it/s][A
Validating:   0%|          | 0/35 [00:00<?, ?it/s][A
Epoch 4:  40%|████      | 22/55 [04:31<06:47, 12.33s/it, loss=0.935, v_num=4bdi, mean_iou=0.0162, mean_class_rec=0.125, mean_acc=0.129, train_loss=0.939]
Validating:   6%|▌         | 2/35 [00:07<01:59,  3.61s/it][A
Epoch 4:  44%|████▎     | 24/55 [04:38<05:59, 11.59s/it, loss=0.935, v_num=4bdi, mean_iou=0.0162, mean_class_rec=0.125, mean_acc=0.129, train_loss=0.939]
Validating:  11%|█▏        | 4/35 [00:14<01:53,  3.66s/it][A
Epoch 4: 

After training, the loss curves and validation images with their segmentation masks can be viewed using the TensorBoard extension:

In [8]:
%load_ext tensorboard
%tensorboard --logdir part1_semantic_segmentation/logs

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6006 (pid 29606), started 0:00:10 ago. (Use '!kill 29606' to kill it.)

In [9]:
import wandb
#wandb.init(project='FloodNet', entity='albly')


Inference time can be measured via the following function:

In [11]:
def calc_inference_time(model, device, input_shape=(1000, 750), num_iters=100):
    timings = []

    for i in range(num_iters):
        x = torch.randn(1, 3, *input_shape).to(device)
        time_start = time.time()
        
        model(x)
        
        torch.cuda.synchronize()
        time_end = time.time()
        
        timings.append(time_end - time_start)

    return sum(timings) / len(timings) * 1e3


model = define_model(
    model_name='unet',
    backbone=None,
    aspp=None,
    augment_data=False,
    checkpoint_name=<TODO>)

inference_time = calc_inference_time(model.eval().cpu(), 'cpu')
# inference_time = calc_inference_time(model.eval().cuda(), 'cuda')

print(f'Inferece time (per frame): {inference_time:.3f} ms')

SyntaxError: invalid syntax (<ipython-input-11-9c36fa7585b2>, line 23)

Your trained weights are available in the `part1_semantic_segmentation/logs` folder. Inside, your experiment directory has a log file with the following mask: `{epoch:02d}-{mean_iou:.3f}.ckpt`. Make sure that you models satisfy the accuracy requirements, upload them to your personal Google Drive. Provide file ids and checksums below. Use `!md5sum <PATH>` to compute the checksums.

To make sure that provided ids are correct, try running `!gdown --id <ID>` command from this notebook.

In [None]:
checkpoint_ids = {
    'UNet_None_augment=False_aspp=None': (<ID>, <CHECKSUM>), # TODO
    'UNet_None_augment=True_aspp=None': None, # TODO
    'DeepLab_ResNet18_augment=True_aspp=False': None, # TODO
    'DeepLab_ResNet18_augment=True_aspp=True': None, # TODO
    'DeepLab_VGG11_bn_augment=True_aspp=True': None, # TODO
    'DeepLab_MobileNet_v3_small_augment=True_aspp=True': None, # TODO
}

## Part 3. Report

You should have obtained 6 different models, which we will use for the comparison and evaluation. When asked to visualize specific loss curves, simply configure these plots in TensorBoard, screenshot, store them in the `report` folder, and load into Jupyter markdown:

`<img src="./part1_semantic_segmentation/report/<screenshot_filename>"/>`

If you have problems loading these images, try uploading them [here](https://imgur.com) and using a link as `src`. Do not forget to include the raw files in the `report` folder anyways.

You should make sure that your plots satisfy the following requirements:
- Each plot has a title,
- If there are multiple curves on one plot (or dots on the scatter plot), the plot legend should also be present,
- If the plot is not obtained using TensorBoard (Task 3), the axis should have names and ticks.

#### Task 1.
Visualize training loss and validation loss curves for UNet trained with and without data augmentation. What are the differences in the behavior of these curves between these experiments, and what are the reasons?

TODO

#### Task 2.
Visualize training and validation loss curves for ResNet18 trained with and without ASPP. Which model performs better?

TODO

#### Task 3.
Compare **UNet** with augmentations and **DeepLab** with all backbones (only experiments with **ASPP**). To do that, put these models on three scatter plots. For the first plot, the x-axis is **training time** (in minutes), for the second plot, the x-axis is **inference time** (in milliseconds), and for the third plot, the x-axis is **model size** (in megabytes). The size of each model is printed by PyTorch Lightning. For all plots, the y-axis is the best **mIoU**. To clarify, each of the **4** requested models should be a single dot on each of these plots.

Which models are the most efficient with respect to each metric on the x-axes? For each of the evaluated models, rate its performance using their validation metrics, training and inference time, and model size. Also for each model explain what are its advantages, and how its performance could be improved?

TODO

#### Task 4.

Pick the best model according to **mIoU** and look at the visualized predictions on the validation set in the TensorBoard. For each segmentation class, find the good examples (if they are available), and the failure cases. Provide the zoomed-in examples and their analysis below. Please do not attach full validation images, only the areas of interest which you should crop manually.

TODO