# Lab 9 Dense Tasks: Segmentation + Object Detection

## Part 1: Semantic Segmentation

This first part is on image segmentation, in particular semantic segmentation where we want to classify each pixel in the image as a set of predefined classes. Other types of segmentation are instance segmentation where we want to distinguish different individual targets, and panoptic segmentation where we want to do both.

You may want to copy this notebook to Google Colab to use GPU for training. You should then select the GPU/TPU run-time within Google Colab.

To measure the quality of our outputs, we need special metrics and therefore will use `torchmetrics`.

In [None]:
!pip install torchmetrics torchmetrics[detection] ultralytics

In [None]:
import torch
from torchvision.transforms import v2 as transforms
from torch.nn import functional as F
import torchvision
import torchmetrics
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import os, random, re

### Data Preparation

Let's start by downloading the data. We will be using the [Oxford-IIIT Pet Dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/).

In [None]:
!wget https://thor.robots.ox.ac.uk/~vgg/data/pets/images.tar.gz
!wget https://thor.robots.ox.ac.uk/~vgg/data/pets/annotations.tar.gz
!tar -xf images.tar.gz
!tar -xf annotations.tar.gz

The data was downloaded into the notebook. You can see it in Files (to the left of the notebook) in Google Colab. The images are in the folder ```images``` and the masks in ```annotations/trimaps```.

In the annotations you will notice a file named ```trainval.txt``` which contains the names of images typically used for training and validation, and ```test.txt``` which contains the images typically used for testing. Let's separate the data into training, validation and testing based on these files.

## Define Dataset

In [None]:
class OxfordPetDataset:
    def __init__(self, root, fold, transform=None):
        assert fold in ['train', 'val', 'test']
        self.root = root
        fname = 'trainval.txt' if fold in ('train', 'val') else 'test.txt'
        self.files = [
            # get the filenames
            line.split()[0] for line in open(os.path.join(root, 'annotations', fname)).readlines()
            # use only cats to keep the dataset smaller and faster to train
            if line.split()[2] == '1'
        ]
        # filter images without labels
        self.files = [f for f in self.files if os.path.exists(os.path.join(root, 'annotations', 'xmls', f + '.xml'))]
        if fold in ['train', 'val']:
            random.seed(42)
            random.shuffle(self.files)
            i = int(0.8*len(self.files))
            self.files = self.files[:i] if fold == 'train' else self.files[i:]
        self.transform = transform

    def __len__(self):
        return len(self.files)

    def __getitem__(self, i):
        fname = self.files[i]
        image = torchvision.tv_tensors.Image(torchvision.io.decode_image(os.path.join(self.root, 'images', fname + '.jpg')))
        # the mask has 3 labels [pet (1), background (2), and border (3)], we merge pet and border.
        mask = torchvision.io.decode_image(os.path.join(self.root, 'annotations', 'trimaps', fname + '.png'), 'GRAY')
        mask = (mask == 1) | (mask == 3)
        mask = torchvision.tv_tensors.Mask(mask)
        xml = open(os.path.join(self.root, 'annotations', 'xmls', fname + '.xml')).read()
        bboxes = [{key: int(value) for key, value in re.findall(r'<(\w+)>(\d+)</\1>', bbox)}
            for bbox in re.findall(r'<bndbox>(.*?)</bndbox>', xml)]
        bboxes = [(bbox['xmin'], bbox['ymin'], bbox['xmax'], bbox['ymax']) for bbox in bboxes]
        bboxes = torchvision.tv_tensors.BoundingBoxes(bboxes, format='XYXY', canvas_size=image.shape[1:])
        if self.transform:
            image, mask, bboxes = self.transform(image, mask, bboxes)
        return image, mask, bboxes

In [None]:
train_transform = transforms.Compose([
    transforms.ToImage(),
    transforms.Resize((256, 256)),
    transforms.RandomCrop((224, 224)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomApply([transforms.RandomAffine(30, (0.2, 0.2), (0.8, 1.2))], 0.5),
    transforms.ToDtype(torch.float32, True),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

train_dataset = OxfordPetDataset('.', 'train', train_transform)

Visualize examples from the data:

In [None]:
image, mask, bboxes = train_dataset[0]
plt.subplot(1, 2, 1)
for x1, y1, x2, y2 in bboxes:
    plt.gca().add_patch(Rectangle((x1, y1), x2-x1, y2-y1, fill=False, edgecolor='r'))
# we previously normalized images (x-μ)/σ and so we need to un-normalize them x*σ+μ for display
image = image.permute(1, 2, 0)*torch.tensor([[[0.229, 0.224, 0.225]]]) + torch.tensor([[[0.485, 0.456, 0.406]]])
plt.imshow(image)
plt.subplot(1, 2, 2)
plt.imshow(mask[0], cmap='gray')
plt.show()

Define transformations to be applied to the images and data loaders

In [None]:
# Define transformations to be applied to validation data
val_transform = transforms.Compose([
    transforms.ToImage(),
    transforms.Resize((256, 256)),
    transforms.CenterCrop((224, 224)),
    transforms.ToDtype(torch.float32, True),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

# Define dataloader for validation data
val_dataset = OxfordPetDataset('.', 'val', val_transform)

In [None]:
# hyperparameters
batch_size = 8
num_workers = 4

In [None]:
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size, shuffle=True, collate_fn=lambda x: x, num_workers=num_workers)
val_dataloader = torch.utils.data.DataLoader(val_dataset, batch_size, collate_fn=lambda x: x, num_workers=num_workers)

### Segmentation Model

```
┌────────┐                            ┌────────┐
│        │  ┌────┐            ┌────┐  │        │
│        │  │    │  ┌─┐  ┌─┐  │    │  │        │
│        │  │    │  └─┘  └─┘  │    │  │        │
│        │  └────┘            └────┘  │        │
└────────┘                            └────────┘
|______________________||______________________|
         Encoder              Decoder
        (resnet18)            (ours)
```

We are going to use `resnet18` as the encoder (so training is faster), and we will then build a decoder.

To build the decoder: you must build 5 blocks, each one composed of: a convolution, upsample (2x) and relu. Each one of the five block should have a convolution with sizes 512/256/128/64/32, respectively.

In [None]:
class MySegNet(torch.nn.Module):
    def __init__(self, out_channels):
        super().__init__()
        # truncate'd resnet18 (without the classifier head)
        self.encoder = ...
        # freeze encoder to make training faster
        for param in self.encoder.parameters():
            param.requires_grad = False
        # five blocks of convolution+upsample+relu with filter sizes 512/256/128/64/32
        # use lazy convolutions
        ...
        ...
        self.out = torch.nn.LazyConv2d(out_channels, 3, padding=1)

    def forward(self, x):
        # apply your encoder-decoder
        ...
        return x

If you network is correct, the following code should output:

```torch.Size([8, 10, 224, 224])```

In [None]:
model = MySegNet(10)
out = model(torch.zeros(8, 3, 224, 224))
print(out.shape)

### U-Net:

To avoid upsample artifacts, U-Net introduces skip-connections between the decoder and the respective activation map from the encoder.

```
            ┌──────────────────────────────────────────┐             
            │                                          │             
┌─────────┐ │           ┌──────────────────┐           │  ┌─────────┐
│         │ │  ┌──────┐ │                  │  ┌──────┐ │  │         │
│         │ │  │      │ │  ┌───┐     ┌───┐ ▼  │      │ ▼  │         │
│         ├─┴─►│      ├─┴─►│   ├────►│   ├─⊕─►│      ├─⊕─►│         ├
│         │    │      │    └───┘     └───┘    │      │    │         │
│         │    └──────┘                       └──────┘    │         │
└─────────┘                                               └─────────┘
|________________________________||_________________________________|
               Encoder                         Decoder
              (backbone)
```

Let's change your previous model, but now add concatenations between your decoder activations and the respective encoder layers from resnet-18.

In [None]:
class MyUNet(MySegNet):
    def forward(self, x):
        # Copy your previous fpass, but now add concatenations between the output of each decoder layer
        # and the respective output of the encoder layer.
        # To make it easier, we build a list with the outputs of the encoder layers.
        previous_activation_maps = []
        for layer in self.encoder.children():
            prev_shape = x.shape
            x = layer(x)
            if x.shape[2:] != prev_shape[2:]:
                # width/height changed, we want to use this layer as skip connectio
                previous_activation_maps.append(x)
        # HERE
        ...
        ...
        return x

If you network is correct, the following code should output:

```torch.Size([8, 10, 224, 224])```

In [None]:
model = MyUNet(10)
out = model(torch.zeros(8, 3, 224, 224))
print(out.shape)

For a more complex U-Net implementation, consider using [https://github.com/milesial/Pytorch-UNet](https://github.com/milesial/Pytorch-UNet).

### Train the model

Start by defining the model, the optimizer, loss and metric to evaluate the model.

In this case, we will use the Jaccard Index (also known as intersection over union) as the metric. Since pytorch does not have this metric implemented, we will use the ```torchmetrics``` package. [Click here](https://lightning.ai/docs/torchmetrics/stable/) to find out more about the metrics available on torchmetrics.

In [None]:
# device
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('using device:', device)

In [None]:
# Use one of the models that you have built. You may compare both.
model = ...  # TODO
model.to(device)  # put model in GPU

# Define optimizer (e.g., AdamW)
optimizer = ...
epochs = 10

# Define loss (e.g., binary cross entropy)
loss_fn = ...

# Define metric from torchmetrics (e.g. Jaccard Index, Dice Coefficient, Pixel Accuracy)
metric = ...
metric.to(device)

One training/validation epoch:

In [None]:
def one_epoch(model, optimizer, dataloader, is_training):
  model.train() if is_training else model.eval()
  avg_loss = avg_metric = 0
  for batch in dataloader:
    images = torch.stack([b[0] for b in batch]).to(device)
    masks = torch.stack([b[1] for b in batch]).to(device)
    # HERE: do the forward and backward pass
    ...
    ...
    avg_loss += loss.item() / len(dataloader)
    avg_metric += metric(torch.sigmoid(preds), masks).item() / len(dataloader)
    return avg_loss, avg_metric

Now implement the training cycle:

In [None]:
train_history = {'loss': [], 'metric': []}
val_history = {'loss': [], 'metric': []}
for epoch in range(epochs):
  # compute train
  avg_loss, avg_metric = one_epoch(model, optimizer, train_dataloader, True)
  train_history['loss'].append(avg_loss)
  train_history['metric'].append(avg_metric)
  print(f'Epoch {epoch+1:2d}/{epochs} - Train loss: {avg_loss} - Train metric: {avg_metric}')
  # compute validation statistics
  avg_loss, avg_metric = one_epoch(model, optimizer, val_dataloader, False)
  val_history['loss'].append(avg_loss)
  val_history['metric'].append(avg_metric)
  print(f'Epoch {epoch+1:2d}/{epochs} - Val   loss: {avg_loss} - Val   metric: {avg_metric}')

Plot metrics and loss on training and validation sets obtained during the training process

In [None]:
plt.subplot(2, 1, 1)
plt.title('Cross Entropy Loss')
plt.plot(train_history['loss'], label='train')
plt.plot(val_history['loss'], label='val')
plt.legend(loc='best')

plt.subplot(2, 1, 2)
plt.title('Jaccard Index')
plt.plot(train_history['metric'], label='train')
plt.plot(val_history['metric'], label='val')

plt.tight_layout()
plt.legend(loc='best')
plt.show()

### Visual inspection

Visualize the results on only a few images.

In [None]:
model.eval()

batch = next(iter(val_dataloader))
images = torch.stack([b[0] for b in batch]).to(device)
masks = torch.stack([b[1] for b in batch]).to(device)
with torch.no_grad():
    preds = torch.sigmoid(model(images)) >= 0.5

for i in range(4):
    plt.subplot(3, 4, i+1)
    image = images[i].permute(1, 2, 0).cpu()*torch.tensor([[[0.229, 0.224, 0.225]]]) + torch.tensor([[[0.485, 0.456, 0.406]]])
    plt.imshow(image)
    plt.subplot(3, 4, i+4+1)
    plt.imshow(preds[i].permute(1, 2, 0).cpu(), cmap='gray')
    plt.subplot(3, 4, i+4*2+1)
    plt.imshow(masks[i].permute(1, 2, 0).cpu(), cmap='gray')
plt.tight_layout()
plt.show()

## 2. Object Detection

Three families of deep object detectors exist (by chronological order): (1) two-stage (region-based), (2) one-stage (YOLO), (3) set-based (DETR).

We will focus on YOLO. This object detectors take advantage predict if each cell in the latent space corresponds to an object and, if so, what class and bounding box. Notice that the latent space is an activation map smaller than the image, therefore, each cell in that activation map corresponds to a region in the original image.

```
                                                  ┌────────────┐
┌─────────┐                                  ┌───►│P(object)   │
│         │    ┌──────┐                      │    │aka Score   │
│         │    │      │    ┌───┐     ┌───┐   │    └────────────┘
│         ├───►│      ├───►│   ├────►│   ├───┤    ┌────────────┐
│         │    │      │    └───┘     └───┘   │    │Bounding box│
│         │    └──────┘                      └───►│xc,yc,w,h   │
└─────────┘                                       └────────────┘
|__________________________________________|      |____________|
                                                                
                 Encoder                               Heads    
                (backbone)                                      
```

Each activation map is 7x7 and therefore will produce 7x7 predictions for scores. We need to convert the ground-truth into the same format as the output of the network.

```
┌───────────────┐            ┌───┬───┬───┬───┐
│               │            │ 0 │ 0 │ 0 │ 0 │
│┌────┐         │            ├───┼───┼───┼───┤
││    │         │            │ 1 │ 1 │ 0 │ 0 │
││    │         ├──────────► ├───┼───┼───┼───┤
│└────┘         │            │ 1 │ 1 │ 0 │ 0 │
│               │            ├───┼───┼───┼───┤
│               │            │ 0 │ 0 │ 0 │ 0 │
└───────────────┘            └───┴───┴───┴───┘
```

And the same thing must be done for each value of the bounding box.

In [None]:
def ground_truth_to_masks(batch_bboxes, input_shape, output_shape, device):
    scores_masks = torch.zeros(len(batch_bboxes), 1, *output_shape, dtype=bool, device=device)
    bboxes_masks = torch.zeros(len(batch_bboxes), 4, *output_shape, dtype=float, device=device)
    yscale = output_shape[0]/input_shape[0]
    xscale = output_shape[1]/input_shape[1]
    for i, bboxes in enumerate(batch_bboxes):
        for x1, y1, x2, y2 in bboxes:
            i1 = int(torch.floor(x1*xscale))
            j1 = int(torch.floor(y1*yscale))
            i2 = int(torch.ceil(x2*xscale))
            j2 = int(torch.ceil(y2*yscale))
            scores_masks[i, :, j1:j2, i1:i2] = 1
            bboxes_masks[i, 0, j1:j2, i1:i2] = x1
            bboxes_masks[i, 1, j1:j2, i1:i2] = y1
            bboxes_masks[i, 2, j1:j2, i1:i2] = x2
            bboxes_masks[i, 3, j1:j2, i1:i2] = y2
    return scores_masks, bboxes_masks

Test. Since we have images 224x224 and an activation map (neck) of 7x7, then each region will correspond to 32 pixels in the original image (i.e., 224/7).

In [None]:
bboxes = [(10, 40, 100, 120), (160, 120, 220, 180)]
scores, bboxes = ground_truth_to_masks([torch.tensor(bboxes)], (224, 224), (7, 7), 'cpu')
print('-'*16*7)
for j in range(7):
    for i in range(7):
        if scores[0, 0, j, i] == 1:
            print(f'|{bboxes[0, 0, j, i]:3.0f},{bboxes[0, 1, j, i]:3.0f},{bboxes[0, 2, j, i]:3.0f},{bboxes[0, 3, j, i]:3.0f}', end='')
        else:
            print('|' + ' '*15, end='')
    print('|')
    print('-'*16*7)

In [None]:
class MyYOLO(torch.nn.Module):
    def __init__(self, input_shape, output_shape):
        super().__init__()
        # use your previous code for the encoder (truncate'd resnet-18)
        ...
        # freeze encoder to make training faster
        for param in self.encoder.parameters():
            param.requires_grad = False
        # use lazy conv2d to build two heads:
        # - scores: output **1** score for each region
        # - bboxes: output **4** coordinates for region
        # Use padding according so your output is equal in shape to the latent space.
        self.scores_head = ...
        self.bboxes_head = ...
        self.yscale = output_shape[0]/input_shape[0]
        self.xscale = output_shape[1]/input_shape[1]

    def forward(self, x):
        enc = self.encoder(x)
        scores = torch.sigmoid(self.scores_head(enc))
        rel_bboxes = torch.sigmoid(self.bboxes_head(enc))
        # each convolution predicts the x-offset and y-offset within each cell, we need to
        # convert to absolute positions.
        # also convert from CXCYWH to XYXY like the ground-truth.
        height, width = x.shape[2:]
        xx = torch.arange(0, width, 1/self.xscale, device=x.device)
        yy = torch.arange(0, height, 1/self.yscale, device=x.device)
        xx, yy = torch.meshgrid(xx, yy, indexing='xy')
        abs_bboxes = torch.stack((
            rel_bboxes[:, 0]/self.xscale + xx[None] - rel_bboxes[:, 2]*width/2,
            rel_bboxes[:, 1]/self.yscale + yy[None] - rel_bboxes[:, 3]*height/2,
            rel_bboxes[:, 0]/self.xscale + xx[None] + rel_bboxes[:, 2]*width/2,
            rel_bboxes[:, 1]/self.yscale + yy[None] + rel_bboxes[:, 3]*height/2,
        ), 1)
        return scores, abs_bboxes

In [None]:
# We will use our YOLO.
model = MyYOLO((224, 224), (7, 7))
model.to(device)  # put model in GPU

# Define optimizer
optimizer = torch.optim.AdamW(model.parameters())
epochs = 25

# Define metric (mAP)
metric = torchmetrics.detection.mean_ap.MeanAveragePrecision()
metric.to(device)

One training/validation epoch:

In [None]:
def one_epoch(model, optimizer, dataloader, is_training):
  model.train() if is_training else model.eval()
  avg_loss = avg_metric = 0
  for batch in dataloader:
    images = torch.stack([b[0] for b in batch]).to(device)
    bboxes = [b[2] for b in batch]
    # (1) call ground_truth_to_masks() to convert the bboxes to a mask
    # (2) do the forward pass to obtain the predicted scores and bboxes
    # (3) do the loss and backward pass: use BCE for scores + MSE for bboxes
    ...
    ...
    ...
    avg_loss += loss.item() / len(dataloader)
    # convert ground-truth and predictions to format that torchmetrics likes
    preds = [{'boxes': bboxes.flatten(1).T, 'scores': scores[0].flatten(), 'labels': torch.zeros(7*7, dtype=int)} for scores, bboxes in zip(scores_preds, bboxes_preds)]
    true = [{'boxes': boxes, 'labels': torch.zeros(len(boxes), dtype=int)} for boxes in bboxes]
    avg_metric += metric(preds, true)['map_50'].item() / len(dataloader)
  return avg_loss, avg_metric

Now the training cycle: (same as before)

In [None]:
train_history = {'loss': [], 'metric': []}
val_history = {'loss': [], 'metric': []}
for epoch in range(epochs):
  # compute train
  avg_loss, avg_metric = one_epoch(model, optimizer, train_dataloader, True)
  train_history['loss'].append(avg_loss)
  train_history['metric'].append(avg_metric)
  print(f'Epoch {epoch+1:2d}/{epochs} - Train loss: {avg_loss} - Train metric: {avg_metric}')
  # compute validation statistics
  avg_loss, avg_metric = one_epoch(model, optimizer, val_dataloader, False)
  val_history['loss'].append(avg_loss)
  val_history['metric'].append(avg_metric)
  print(f'Epoch {epoch+1:2d}/{epochs} - Val   loss: {avg_loss} - Val   metric: {avg_metric}')

### Visual inspection

(before NMS = non-maximum suppression)

In [None]:
model.eval()

batch = next(iter(val_dataloader))
images = torch.stack([b[0] for b in batch]).to(device)
bboxes = [b[2] for b in batch]
with torch.no_grad():
    scores_preds, bboxes_preds = model(images)
for i in range(4):
    plt.subplot(2, 2, i+1)
    image = images[i].permute(1, 2, 0).cpu()*torch.tensor([[[0.229, 0.224, 0.225]]]) + torch.tensor([[[0.485, 0.456, 0.406]]])
    plt.imshow(image)
    image_scores = scores_preds[i, 0].flatten()
    image_bboxes = bboxes_preds[i].flatten(1).T
    image_bboxes = image_bboxes[image_scores >= 0.5]
    image_scores = image_scores[image_scores >= 0.5]
    for x1, y1, x2, y2 in image_bboxes.cpu():
        plt.gca().add_patch(Rectangle((x1, y1), x2-x1, y2-y1, fill=False, edgecolor='r', linestyle='--'))
    for x1, y1, x2, y2 in bboxes[i].cpu():
        plt.gca().add_patch(Rectangle((x1, y1), x2-x1, y2-y1, fill=False, edgecolor='b'))
plt.tight_layout()
plt.show()

(after NMS. use a low minimum IoU=0.10)

You may use the [NMS function](https://pytorch.org/vision/master/generated/torchvision.ops.nms.html) from pytorch.

In [None]:
model.eval()

batch = next(iter(val_dataloader))
images = torch.stack([b[0] for b in batch]).to(device)
bboxes = [b[2] for b in batch]
with torch.no_grad():
    scores_preds, bboxes_preds = model(images)
for i in range(4):
    plt.subplot(2, 2, i+1)
    image = images[i].permute(1, 2, 0).cpu()*torch.tensor([[[0.229, 0.224, 0.225]]]) + torch.tensor([[[0.485, 0.456, 0.406]]])
    plt.imshow(image)
    image_scores = scores_preds[i, 0].flatten()
    image_bboxes = bboxes_preds[i].flatten(1).T
    image_bboxes = image_bboxes[image_scores >= 0.5]
    image_scores = image_scores[image_scores >= 0.5]
    # HERE: add NMS line
    ix = ...
    image_bboxes = image_bboxes[ix]
    for x1, y1, x2, y2 in image_bboxes.cpu():
        plt.gca().add_patch(Rectangle((x1, y1), x2-x1, y2-y1, fill=False, edgecolor='r', linestyle='--'))
    for x1, y1, x2, y2 in bboxes[i].cpu():
        plt.gca().add_patch(Rectangle((x1, y1), x2-x1, y2-y1, fill=False, edgecolor='b'))
plt.tight_layout()
plt.show()

Theoretical question: is there a limit to how many objects a YOLO can detect?

### Object Detection with Ultralytics YOLOv8

You do not need GPU to run this notebook, since we will use an existing model and not train it.

Before you start, load the ```images.zip``` file into Google Colab (by uploading it to the ```files``` section).

In [None]:
!unzip "/content/images.zip" -d "/content/data"

There are many different versions of YOLO available. We will be using YOLOv8.

YOLOv8 was originally proposed and implemented by Ultralytics. To work with the model, we will use the ```ultralytics``` package.

Note that this package contains implementations of various models. [Click here](https://docs.ultralytics.com/models/) to find out which models are available on ```ultralytics```.

In [None]:
import ultralytics
ultralytics.checks()

In [None]:
import os
import matplotlib.pyplot as plt

### Load model

YOLOv8 comes in different versions. [Click here](https://docs.ultralytics.com/models/yolov8/) to see available versions and their computational efficiency and predictive performance on the datasets they were trained.

In [None]:
from ultralytics import YOLO

# Load a pretrained model
model = YOLO('yolov8s.pt')

Create folder to save the results

In [None]:
folder_name = "results"
os.mkdir(folder_name)

### Inference

To apply the model to predict bounding boxes on images, we can use the ```predict()``` function. This function can receive many different inputs, including an image loaded with cv2, the path to an image and even a path to the folder that loads the images. [Click here](https://docs.ultralytics.com/modes/predict/) for more information about the inference process using YOLOv8.

We will provide the path to the folder as input to obtain predictions for all the images.

In [None]:
input_image_folder = '/content/data/'

# Predict bounding boxes for the images
results = model.predict(input_image_folder)

for result in results:
    # Get bounding boxes object for bounding box outputs
    boxes = result.boxes
    # Save image with bounding boxes and predictions to folder
    image_name = result.path.split(os.sep)[-1]
    result.save(filename=os.path.join(folder_name, image_name))

# Uncomment to visualize the attributes that a bounding box contains
# print(boxes[0])

Visualize results

In [None]:
img = plt.imread(os.path.join(folder_name, "photo2.jpg"))
plt.axis('off')
plt.imshow(img)

### Altering Parameters: Confidence Threshold and Non Maximum Suppression

**Exercise 1**: Increase/decrease the **confidence threshold** and compare the results.

When you increase/decrease the threshold, how does the number of detected objects change? And why?

In [None]:
folder_name = "results_cnf"
os.mkdir(folder_name)

In [None]:
# Predict bounding boxes for the images
results = model.predict(input_image_folder, conf=0.5)  # return a list of Results objects

for result in results:
    # Get bounding boxes object for bounding box outputs
    boxes = result.boxes
    # Save image with bounding boxes and predictions to folder
    image_name = result.path.split(os.sep)[-1]
    result.save(filename=os.path.join(folder_name, image_name))

In [None]:
img = plt.imread(os.path.join(folder_name, "photo2.jpg"))
plt.axis('off')
plt.imshow(img)

**Exercise 2**: Increase/decrease the Non Maximum Suppression (NMS) threshold and compare the results.

When you increase/decrease the threshold, how does the number of detected objects change? And why?

In [None]:
folder_name = "results_iou"
os.mkdir(folder_name)

In [None]:
# Predict bounding boxes for the images
results = model.predict(input_image_folder, iou=0.4)  # return a list of Results objects

for result in results:
    # Get bounding boxes object for bounding box outputs
    boxes = result.boxes
    # Save image with bounding boxes and predictions to folder
    image_name = result.path.split(os.sep)[-1]
    result.save(filename=os.path.join(folder_name, image_name))

In [None]:
img = plt.imread(os.path.join(folder_name, "photo2.jpg"))
plt.axis('off')
plt.imshow(img)

### Extra exercises:

**Semantic segmentation**
1. Replace your U-Net by the more complex U-Net implementation: [https://github.com/milesial/Pytorch-UNet](https://github.com/milesial/Pytorch-UNet).

**Object detection**
1. Add class prediction (cat/dog) to your `MyYOLO` model. Hint: Start by adding that label information to the dataset class.
2. Add multi-scale - i.e., instead of using the last 7x7 layer of resnet-18, use also the previous layers where size was 14x14 and 28x28. The activation layer 7x7 will work better for larger objects (224/7=32) while the prior activation layer 28x28 will work better for smaller objects (224/28=8)