# TORCHVISION OBJECT DETECTION FINETUNING TUTORIAL

For this tutorial, we will be finetuning a pre-trained Mask R-CNN model in the Penn-Fudan Database for Pedestrian Detection and Segmentation. It contains 170 images with 345 instances of pedestrians, and we will use it to illustrate how to use the new features in torchvision in order to train an instance segmentation model on a custom dataset.

## Defining the Dataset

The reference scripts for training object detection, instance segmentation and person keypoint detection allows for easily supporting adding new custom datasets. The dataset should inherit from the standard ```torch.utils.data.Dataset``` class, and implement ```__len__``` and ```__getitem__```.

The only specificity that we require is that the dataset ```__getitem__``` should return:

 - image: a PIL Image of size ```(H, W)```
 - target: a dict containing the following fields
     - ```boxes (FloatTensor[N, 4])```: the coordinates of the N bounding boxes in ```[x0, y0, x1, y1]``` format, ranging from 0 to W and 0 to H
     - ```labels (Int64Tensor[N])```: the label for each bounding box. 0 represents always the background class.
     - ```image_id (Int64Tensor[1]```): an image identifier. It should be unique between all the images in the dataset, and is used during evaluation
     - ```area (Tensor[N])```: The area of the bounding box. This is used during evaluation with the COCO metric, to separate the metric scores between small, medium and large boxes.
     - ```iscrowd (UInt8Tensor[N])```: instances with iscrowd=True will be ignored during evaluation.
     - (optionally) ```masks (UInt8Tensor[N, H, W])```: The segmentation masks for each one of the objects
     - (optionally) ```keypoints (FloatTensor[N, K, 3]```): For each one of the N objects, it contains the K keypoints in ```[x, y, visibility]``` format, defining the object. ```visibility=0``` means that the keypoint is not visible. Note that for data augmentation, the notion of flipping a keypoint is dependent on the data representation, and you should probably adapt ```references/detection/transforms.py``` for your new keypoint representation
 
If your model returns the above methods, they will make it work for both training and evaluation, and will use the evaluation scripts from ```pycocotools```.

Note: _You can not define a class with label 0. If you have no and yes for example, the labels have to be ```[1,2]``` instead of ```[0,1]```_

## Writing a custom dataset for PennFudan

In [51]:
import os
import numpy as np
import torch
from PIL import Image

class PennFudanDataset():
    
    def __init__(self, root, transforms = None):
        
        self.root = root
        self.transforms = transforms
        
        # Get all image files in a sorted manner
        self.imgs = list(sorted(os.listdir(os.path.join(root, "PNGImages"))))
        self.masks = list(sorted(os.listdir(os.path.join(root, "PedMasks"))))
        
    def __getitem__(self, idx):
        
        # Load images and add masks
        img_path = os.path.join(self.root, "PNGImages", self.imgs[idx])
        mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])
        img = Image.open(img_path).convert("RGB")
        
        mask = Image.open(mask_path)
        mask = np.array(mask)
        
        obj_ids = np.unique(mask)
        
        # Remove 0 id
        obj_ids = obj_ids[1:]
        
        # Split the color-encoded mask into a set
        # of binary masks
        masks = mask == obj_ids[:, None, None]
        
        # Get bounding box coordinates for each mask
        num_objs = len(obj_ids)
        boxes = []
        for i in range(num_objs):
            pos = np.where(masks[i])
            xmin = np.min(pos[1])
            xmax = np.max(pos[1])
            ymin = np.min(pos[0])
            ymax = np.max(pos[0])
            boxes.append([xmin, ymin, xmax, ymax])

        # Convert everything into a torch.Tensor
        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        
        # There is only one class
        labels = torch.ones((num_objs,), dtype=torch.int64)
        masks = torch.as_tensor(masks, dtype=torch.uint8)

        image_id = torch.tensor([idx])
        area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
        
        # Suppose all instances are not crowd
        iscrowd = torch.zeros((num_objs,), dtype=torch.int64)

        target = {}
        target["boxes"] = boxes
        target["labels"] = labels
        target["masks"] = masks
        target["image_id"] = image_id
        target["area"] = area
        target["iscrowd"] = iscrowd

        if self.transforms is not None:
            img, target = self.transforms(img, target)

        return img, target

    def __len__(self):
        return len(self.imgs)

## Defining your model

### 1. Finetune from a pretrained model

Let’s suppose that you want to start from a model pre-trained on COCO and want to finetune it for your particular classes. Here is a possible way of doing it:

In [52]:
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

# load a model pre-trained pre-trained on COCO
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)

# replace the classifier with a new one, that has
# num_classes which is user-defined
num_classes = 2  # 1 class (person) + background
# get number of input features for the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
# replace the pre-trained head with a new one
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

### 2. Modifying the model to add a different backbone

In [53]:
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.rpn import AnchorGenerator

# load a pre-trained model for classification and return
# only the features
backbone = torchvision.models.mobilenet_v2(pretrained=True).features
# FasterRCNN needs to know the number of
# output channels in a backbone. For mobilenet_v2, it's 1280
# so we need to add it here
backbone.out_channels = 1280

# let's make the RPN generate 5 x 3 anchors per spatial
# location, with 5 different sizes and 3 different aspect
# ratios. We have a Tuple[Tuple[int]] because each feature
# map could potentially have different sizes and
# aspect ratios
anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
                                   aspect_ratios=((0.5, 1.0, 2.0),))

# let's define what are the feature maps that we will
# use to perform the region of interest cropping, as well as
# the size of the crop after rescaling.
# if your backbone returns a Tensor, featmap_names is expected to
# be [0]. More generally, the backbone should return an
# OrderedDict[Tensor], and in featmap_names you can choose which
# feature maps to use.
roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=[0],
                                                output_size=7,
                                                sampling_ratio=2)

# put the pieces together inside a FasterRCNN model
model = FasterRCNN(backbone,
                   num_classes=2,
                   rpn_anchor_generator=anchor_generator,
                   box_roi_pool=roi_pooler)

## An Instance segmentation model for PennFudan Dataset

In our case, we want to fine-tune from a pre-trained model, given that our dataset is very small, so we will be following approach number 1 (run cell from segment __1.__)

Here we want to also compute the instance segmentation masks, so we will be using Mask R-CNN:

In [54]:
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor


def get_model_instance_segmentation(num_classes):
    # load an instance segmentation model pre-trained pre-trained on COCO
    model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)

    # get number of input features for the classifier
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    # replace the pre-trained head with a new one
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

    # now get the number of input features for the mask classifier
    in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
    hidden_layer = 256
    # and replace the mask predictor with a new one
    model.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask,
                                                       hidden_layer,
                                                       num_classes)

    return model

In [67]:
import transforms as T

def get_transform(train):
    transforms = []
    transforms.append(T.ToTensor())
    if train:
        transforms.append(T.RandomHorizontalFlip(0.5))
    return T.Compose(transforms)

In [73]:
from engine import train_one_epoch, evaluate
import utils


def main():
    # train on the GPU or on the CPU, if a GPU is not available
    device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')

    # our dataset has two classes only - background and person
    num_classes = 2
    # use our dataset and defined transformations
    dataset = PennFudanDataset('PennFudanPed', get_transform(train=True))
    dataset_test = PennFudanDataset('PennFudanPed', get_transform(train=False))

    # split the dataset in train and test set
    indices = torch.randperm(len(dataset)).tolist()
    dataset = torch.utils.data.Subset(dataset, indices[:-50])
    dataset_test = torch.utils.data.Subset(dataset_test, indices[-50:])

    # define training and validation data loaders
    data_loader = torch.utils.data.DataLoader(
        dataset, batch_size=2, shuffle=True, num_workers=0,
        collate_fn=utils.collate_fn)

    data_loader_test = torch.utils.data.DataLoader(
        dataset_test, batch_size=2, shuffle=False, num_workers=0,
        collate_fn=utils.collate_fn)

    # get the model using our helper function
    model = get_model_instance_segmentation(num_classes)

    # move model to the right device
    model.to(device)

    # construct an optimizer
    params = [p for p in model.parameters() if p.requires_grad]
    optimizer = torch.optim.SGD(params, lr=0.005,
                                momentum=0.9, weight_decay=0.0005)
    # and a learning rate scheduler
    lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
                                                   step_size=3,
                                                   gamma=0.1)

    # let's train it for 10 epochs
    num_epochs = 10

    for epoch in range(num_epochs):
        # train for one epoch, printing every 10 iterations
        train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
        # update the learning rate
        lr_scheduler.step()
        # evaluate on the test dataset
        evaluate(model, data_loader_test, device=device)

    print("That's it!")
    
    

# Note that this cells uses a lot of files imported from https://github.com/pytorch/vision/blob/master/references/detection


In [74]:
# Perform garbage collection (for some reason it's not doing it automatically?)
import gc

gc.collect()

# Execute main
main()

Epoch: [0]  [ 0/60]  eta: 0:01:00  lr: 0.000090  loss: 2.8519 (2.8519)  loss_classifier: 0.7174 (0.7174)  loss_box_reg: 0.2571 (0.2571)  loss_mask: 1.8446 (1.8446)  loss_objectness: 0.0172 (0.0172)  loss_rpn_box_reg: 0.0156 (0.0156)  time: 1.0108  data: 0.0269  max mem: 4442
Epoch: [0]  [10/60]  eta: 0:00:52  lr: 0.000936  loss: 1.4321 (1.7178)  loss_classifier: 0.4754 (0.4779)  loss_box_reg: 0.2571 (0.2360)  loss_mask: 0.6731 (0.9718)  loss_objectness: 0.0223 (0.0204)  loss_rpn_box_reg: 0.0118 (0.0116)  time: 1.0511  data: 0.1359  max mem: 4442
Epoch: [0]  [20/60]  eta: 0:00:39  lr: 0.001783  loss: 0.8401 (1.1858)  loss_classifier: 0.2550 (0.3249)  loss_box_reg: 0.1567 (0.1946)  loss_mask: 0.3286 (0.6355)  loss_objectness: 0.0187 (0.0183)  loss_rpn_box_reg: 0.0107 (0.0125)  time: 0.9982  data: 0.1118  max mem: 4442
Epoch: [0]  [30/60]  eta: 0:00:29  lr: 0.002629  loss: 0.4800 (0.9479)  loss_classifier: 0.0825 (0.2442)  loss_box_reg: 0.1379 (0.1790)  loss_mask: 0.2185 (0.4977)  loss_ob

Epoch: [2]  [ 0/60]  eta: 0:01:00  lr: 0.005000  loss: 0.2576 (0.2576)  loss_classifier: 0.0750 (0.0750)  loss_box_reg: 0.0390 (0.0390)  loss_mask: 0.1276 (0.1276)  loss_objectness: 0.0005 (0.0005)  loss_rpn_box_reg: 0.0155 (0.0155)  time: 1.0133  data: 0.0918  max mem: 4442
Epoch: [2]  [10/60]  eta: 0:00:47  lr: 0.005000  loss: 0.1989 (0.1994)  loss_classifier: 0.0309 (0.0366)  loss_box_reg: 0.0229 (0.0218)  loss_mask: 0.1294 (0.1310)  loss_objectness: 0.0005 (0.0015)  loss_rpn_box_reg: 0.0071 (0.0086)  time: 0.9497  data: 0.0481  max mem: 4442
Epoch: [2]  [20/60]  eta: 0:00:36  lr: 0.005000  loss: 0.1816 (0.1951)  loss_classifier: 0.0284 (0.0334)  loss_box_reg: 0.0145 (0.0196)  loss_mask: 0.1265 (0.1307)  loss_objectness: 0.0006 (0.0014)  loss_rpn_box_reg: 0.0077 (0.0101)  time: 0.8988  data: 0.0417  max mem: 4442
Epoch: [2]  [30/60]  eta: 0:00:27  lr: 0.005000  loss: 0.1694 (0.1866)  loss_classifier: 0.0278 (0.0309)  loss_box_reg: 0.0141 (0.0181)  loss_mask: 0.1176 (0.1267)  loss_ob

Epoch: [4]  [ 0/60]  eta: 0:00:52  lr: 0.000500  loss: 0.1631 (0.1631)  loss_classifier: 0.0199 (0.0199)  loss_box_reg: 0.0108 (0.0108)  loss_mask: 0.1198 (0.1198)  loss_objectness: 0.0002 (0.0002)  loss_rpn_box_reg: 0.0124 (0.0124)  time: 0.8830  data: 0.0379  max mem: 4442
Epoch: [4]  [10/60]  eta: 0:00:46  lr: 0.000500  loss: 0.1676 (0.1826)  loss_classifier: 0.0272 (0.0298)  loss_box_reg: 0.0108 (0.0144)  loss_mask: 0.1198 (0.1267)  loss_objectness: 0.0011 (0.0015)  loss_rpn_box_reg: 0.0104 (0.0103)  time: 0.9347  data: 0.0578  max mem: 4442
Epoch: [4]  [20/60]  eta: 0:00:36  lr: 0.000500  loss: 0.1498 (0.1679)  loss_classifier: 0.0253 (0.0260)  loss_box_reg: 0.0092 (0.0119)  loss_mask: 0.1119 (0.1198)  loss_objectness: 0.0006 (0.0011)  loss_rpn_box_reg: 0.0092 (0.0091)  time: 0.9040  data: 0.0500  max mem: 4442
Epoch: [4]  [30/60]  eta: 0:00:27  lr: 0.000500  loss: 0.1428 (0.1622)  loss_classifier: 0.0233 (0.0257)  loss_box_reg: 0.0060 (0.0107)  loss_mask: 0.1071 (0.1165)  loss_ob

Epoch: [6]  [ 0/60]  eta: 0:00:52  lr: 0.000050  loss: 0.1658 (0.1658)  loss_classifier: 0.0281 (0.0281)  loss_box_reg: 0.0122 (0.0122)  loss_mask: 0.1154 (0.1154)  loss_objectness: 0.0003 (0.0003)  loss_rpn_box_reg: 0.0098 (0.0098)  time: 0.8787  data: 0.0509  max mem: 4442
Epoch: [6]  [10/60]  eta: 0:00:44  lr: 0.000050  loss: 0.1687 (0.1813)  loss_classifier: 0.0281 (0.0311)  loss_box_reg: 0.0108 (0.0143)  loss_mask: 0.1205 (0.1257)  loss_objectness: 0.0003 (0.0006)  loss_rpn_box_reg: 0.0078 (0.0095)  time: 0.8845  data: 0.0413  max mem: 4442
Epoch: [6]  [20/60]  eta: 0:00:35  lr: 0.000050  loss: 0.1612 (0.1667)  loss_classifier: 0.0260 (0.0284)  loss_box_reg: 0.0092 (0.0120)  loss_mask: 0.1147 (0.1177)  loss_objectness: 0.0003 (0.0005)  loss_rpn_box_reg: 0.0049 (0.0080)  time: 0.8990  data: 0.0432  max mem: 4442
Epoch: [6]  [30/60]  eta: 0:00:26  lr: 0.000050  loss: 0.1483 (0.1602)  loss_classifier: 0.0213 (0.0264)  loss_box_reg: 0.0070 (0.0106)  loss_mask: 0.1099 (0.1150)  loss_ob

Epoch: [8]  [ 0/60]  eta: 0:00:47  lr: 0.000050  loss: 0.2024 (0.2024)  loss_classifier: 0.0285 (0.0285)  loss_box_reg: 0.0138 (0.0138)  loss_mask: 0.1558 (0.1558)  loss_objectness: 0.0002 (0.0002)  loss_rpn_box_reg: 0.0042 (0.0042)  time: 0.7899  data: 0.0309  max mem: 4442
Epoch: [8]  [10/60]  eta: 0:00:43  lr: 0.000050  loss: 0.1775 (0.1719)  loss_classifier: 0.0277 (0.0243)  loss_box_reg: 0.0066 (0.0120)  loss_mask: 0.1352 (0.1270)  loss_objectness: 0.0005 (0.0006)  loss_rpn_box_reg: 0.0054 (0.0081)  time: 0.8707  data: 0.0386  max mem: 4442
Epoch: [8]  [20/60]  eta: 0:00:35  lr: 0.000050  loss: 0.1720 (0.1766)  loss_classifier: 0.0277 (0.0287)  loss_box_reg: 0.0122 (0.0135)  loss_mask: 0.1216 (0.1258)  loss_objectness: 0.0005 (0.0008)  loss_rpn_box_reg: 0.0076 (0.0079)  time: 0.8816  data: 0.0398  max mem: 4442
Epoch: [8]  [30/60]  eta: 0:00:26  lr: 0.000050  loss: 0.1599 (0.1700)  loss_classifier: 0.0268 (0.0268)  loss_box_reg: 0.0119 (0.0126)  loss_mask: 0.1175 (0.1223)  loss_ob