# <font style="color:blue">Faster RCNN Fine-tuning</font>

We have already seen the below Faster RCNN flow-diagram. We have also used the PyTorch pre-trained object detection model (`torchvision.models.detection.fasterrcnn_resnet50_fpn`) to infer on data samples.

We know that this model is trained on the [coco dataset](https://tech.amikelive.com/node-718/what-object-categories-labels-are-in-coco-dataset/) that has `91` classes. What if our dataset doesn't have the class which we are interested in? Even though our interest class is available and the number of classes much lower than than the coco dataset, it is better to fine-tune the model for better mAP (mean average precession).


If the model inference speed is slowfor our requirements, we might be interested in changing the backbone of the Faster RCNN model.


We will see resnet-50_fpn implementation building blocks. We will use the understanding to fine-tune the model.


#  <font style="color:blue">1. Faster RCNN with Resnet-50 FPN Backbone</font>

---

![](https://www.researchgate.net/profile/Giang_Son_Tran/publication/324549019/figure/fig1/AS:649929152266241@1531966593689/Faster-R-CNN-Architecture-9.png)

---

**Let's start with mapping above building blocks with PyTorch `torchvision.models.detection.fasterrcnn_resnet50_fpn` implementation.**

In [0]:
import torch
import torchvision

from PIL import Image
import torchvision.transforms as T


**Find details of FasterRCNN with Resnet-50 FPN backbone [here](https://pytorch.org/docs/stable/torchvision/models.html#torchvision.models.detection.fasterrcnn_resnet50_fpn)**.

Let's load pre-trained Faster RCNN with ResNet-50 FPN backbone detection model.

In [0]:
if torch.cuda.is_available():
    device = "cuda" 
else:
    device = "cpu"

In [0]:
# load fasterrcnn_resnet50_fpn
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model = model.to(device)

## <font style="color:green">1.1. Inputs Samples</font>

**Let's load two images and their target.**

**[Download image1](https://www.dropbox.com/s/jet087pwhln5b2j/FudanPed00066.png?dl=1)**

**[Download image2](https://www.dropbox.com/s/uv8676diqwrstvl/PennPed00011.png?dl=1)**

In [4]:
image1 = T.ToTensor()(Image.open('FudanPed00066.png'))

bboxes1 = torch.tensor([[248.0, 50.0, 329.0, 351.0]])
labels1 = torch.tensor([1])

image2 = T.ToTensor()(Image.open('PennPed00011.png'))

bboxes2 = torch.tensor([[92.0, 62.0, 236.0, 344.0], [242.0, 52.0, 301.0, 355.0]])
labels2 = torch.tensor([1, 1])

print('Image 1 size: {}'.format(image1.size()))

print('Image 2 size: {}'.format(image2.size()))

Image 1 size: torch.Size([3, 359, 360])
Image 2 size: torch.Size([3, 376, 508])


**We can see that both images (`image1` and `image2`) have different sizes.**

## <font style="color:green">1.2. Model Inference</font>

- We need a list (not tensor) of images for the model inference.


- Images size may be different. This means we need not resize to a constant size. Faster RCNN PyTorch Implementation has its own image pre-process block. 

In [5]:
input_image1 = image1.clone()

input_image2 = image2.clone()

# input its image list
inputs = [input_image1.to(device), input_image2.to(device)]

model.eval()
output = model(inputs)

print(output)

[{'boxes': tensor([[243.1263,  47.7870, 327.8619, 349.8769]], device='cuda:0',
       grad_fn=<StackBackward>), 'labels': tensor([1], device='cuda:0'), 'scores': tensor([0.9997], device='cuda:0', grad_fn=<IndexBackward>)}, {'boxes': tensor([[ 89.9230,  59.4910, 225.3071, 342.8298],
        [244.2283,  49.8334, 304.4795, 362.8902],
        [245.9230, 127.6201, 276.5670, 197.8546],
        [252.0381,  15.8489, 369.9043, 367.3777],
        [245.7938,  99.7875, 294.3491, 198.7888],
        [243.9077, 121.8000, 276.3352, 198.4012],
        [247.6824,  51.5020, 301.1053, 203.1744],
        [245.5552,  95.3306, 295.3181, 199.0098],
        [274.8139,  96.3892, 301.2039, 188.7242],
        [123.6462,  56.9277, 191.9256, 338.0054],
        [240.7440,  44.0858, 333.5300, 235.6630],
        [267.6390, 100.0401, 299.7915, 187.5079]], device='cuda:0',
       grad_fn=<StackBackward>), 'labels': tensor([ 1,  1, 27,  1, 27, 31,  1, 31, 27,  1,  1, 31], device='cuda:0'), 'scores': tensor([0.9996, 0.993

## <font style="color:green">1.3. Model Training</font>

- In training mode, `targets` are mandatory. Targets are required because they calculate loss. This loss can be used to find gradients by using `backward()`.


- Targets should be a list, and each target should have the following format:

```
    {
        'boxes': bounding boxes tensor,
        'labels': label tensor
    
    } 
```


- Object detection in Faster RCNN is done in two stages. 


- First, it classifies all regions of the image in just two classes- background or object. 


- In the second stage, it predicts classes of the object and improves its bounding box predictions.

In [6]:
input_image1 = image1.clone()

target1 = {
    'boxes': bboxes1.clone().to(device),
    'labels' : labels1.clone().to(device)
    
} 

input_image2 = image2.clone()


target2 = {
    'boxes': bboxes2.clone().to(device),
    'labels' : labels2.clone().to(device)
    
} 

inputs = [input_image1.to(device), input_image2.to(device)]
targets = [target1, target2]

# change to train mode
model.train()
model(inputs, targets)

{'loss_box_reg': tensor(0.0074, device='cuda:0', grad_fn=<DivBackward0>),
 'loss_classifier': tensor(0.0219, device='cuda:0', grad_fn=<NllLossBackward>),
 'loss_objectness': tensor(0.0010, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>),
 'loss_rpn_box_reg': tensor(0.0106, device='cuda:0', grad_fn=<DivBackward0>)}

- `loss_objectness` and `loss_rpn_box_reg` are losses of the first stage.


- `loss_classifier` and `loss_box_reg` are losses of the second stage.

## <font style="color:green">1.4. Model Building Blocks</font>

**We will see building blocks of the model and modifying the cloned model for fine-tuning.**


In [7]:
model.eval()
print(model)

FasterRCNN(
  (transform): GeneralizedRCNNTransform()
  (backbone): BackboneWithFPN(
    (body): IntermediateLayerGetter(
      (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
      (bn1): FrozenBatchNorm2d()
      (relu): ReLU(inplace=True)
      (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
      (layer1): Sequential(
        (0): Bottleneck(
          (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): FrozenBatchNorm2d()
          (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): FrozenBatchNorm2d()
          (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): FrozenBatchNorm2d()
          (relu): ReLU(inplace=True)
          (downsample): Sequential(
            (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (1): FrozenBatchNorm2d()
          )
  

**We can see the model has the following building blocks of `FasterRCNN`:**

- **`transform`:** This block pre-processes the input image.


- **`backbone`:** This is equivalent to **conv layers** in the above image.


- **`rpn`:** This is equivalent to **Region Proposal Network** in the above image.


- **`roi_heads`:** This is equivalent to **RoI Pooling**.


- **`box_predictor`:** This is equivalent to **classifier** in the above image. 

### <font style="color:green">transform</font>

- This block pre-processes the input like normalizing, resizing, etc.

Let's have a look at the pre-processed tensor size.

In [8]:
input_image1 = image1.clone()

input_image2 = image2.clone()

inputs = [input_image1.to(device), input_image2.to(device)]

trans_image_list, trans_target_list = model.transform(inputs)

print('Tensor size: {}'.format(trans_image_list.tensors.size()))


Tensor size: torch.Size([2, 3, 800, 1088])


Let's have a look at transforms parameters.

In [9]:
print('transform ( GeneralizedRCNNTransform) parameters:')
print('min_size: {}'.format(model.transform.min_size))
print('max_size: {}'.format(model.transform.max_size))
print('image_mean: {}'.format(model.transform.image_mean))
print('image_std: {}'.format(model.transform.image_std))

transform ( GeneralizedRCNNTransform) parameters:
min_size: (800,)
max_size: 1333
image_mean: [0.485, 0.456, 0.406]
image_std: [0.229, 0.224, 0.225]


### <font style="color:magenta">Transform params in Faster RCNN Fine-tune model</font>

If we have smaller images for training, then we might be interested in changing the transform parameters. 

In [0]:
ft_min_size = 300
ft_max_size = 500

ft_mean = [0.485, 0.456, 0.406]
ft_std = [0.229, 0.224, 0.225]

### <font style="color:green">backbone (conv layers)</font>

- It has used `resnet50` (trained on image net dataset) with `FPN` as the backbone for feature extraction. Don't worry if you don't know what is `FPN`. In short, it extracts features from different layers of `resnet50`. In object detection, features from different layers perform better than the immediate last layer. More details of the FPN find [here](https://arxiv.org/pdf/1612.03144.pdf).


- Generally, we use the pre-trained model (trained on extensive data set e.g., image-net) as a backbone in the object detection network.



- We can change the backbone with another backbone (`resnet-18`, `vgg-16` etc. ) for fine-tuning.

**Let's see the number of output channels of the backbone.**

In [0]:
backbone_out = model.backbone(trans_image_list.tensors)

In [12]:
for key, value in backbone_out.items():
    print('{}: {}'.format(key, value.size()))

0: torch.Size([2, 256, 200, 272])
1: torch.Size([2, 256, 100, 136])
2: torch.Size([2, 256, 50, 68])
3: torch.Size([2, 256, 25, 34])
pool: torch.Size([2, 256, 13, 17])


**The output of the backbone is `OrderedDict[Tensor]` of five tuples**

In [13]:
print('Number of output channel of the backbone: {}'.format(model.backbone.out_channels))

Number of output channel of the backbone: 256


### <font style="color:magenta">Backbone of Faster RCNN Fine-tune model</font>

Let's choose pre-trained AlexNet fr0m torchvision models.

In [14]:
import torchvision.models as models

alexnet = models.alexnet(pretrained=True)
print(alexnet)

Downloading: "https://download.pytorch.org/models/alexnet-owt-4df8aa71.pth" to /root/.cache/torch/checkpoints/alexnet-owt-4df8aa71.pth


HBox(children=(IntProgress(value=0, max=244418560), HTML(value='')))


AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)


- For the Faster RCNN backbone, we are just interested in convolution features. 


- Faster RCNN also needs the number of out-channel of the backbone. 

In [0]:
ft_backbone = alexnet.features

# number of out-channel in alexnet features is 256
ft_backbone.out_channels = 256

### <font style="color:green">rpn (Region Proposal Network)</font>

- It takes features from the backbone and predicts the objectness (the region and whether it is object or background) and coordinates of the region. 


**What is the meaning of the region here?**

Generally, in object detection, we use anchor (a rectangular block) to denote the region. 

---

<img src='https://www.learnopencv.com/wp-content/uploads/2020/03/c3-w8-anchors.png' align='middle'>

---

- In the above image, it has two feature maps `b` (`8 x 8 feature map (grid)`) and `c` (`4 x 4 feature map (grid)`). One element of the feature map represents segments of pixels in the original image `a`.



- Each feature map has a set of anchors.


- We can change the number of anchors for each feature map as fine-tuning process. 


Let's first see the `rpn` (Region Proposal Network) in `resnet-50-fpn` model.


In [16]:
model.rpn

RegionProposalNetwork(
  (anchor_generator): AnchorGenerator()
  (head): RPNHead(
    (conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (cls_logits): Conv2d(256, 3, kernel_size=(1, 1), stride=(1, 1))
    (bbox_pred): Conv2d(256, 12, kernel_size=(1, 1), stride=(1, 1))
  )
)

We can see that it has two parts- (1) `anchor_generator` and (2) `head`.

**`anchor_generator`**

In [17]:
print('Anchor sizes: {}'.format(model.rpn.anchor_generator.sizes))
print('Aspect ratios: {}'.format(model.rpn.anchor_generator.aspect_ratios))

Anchor sizes: ((32,), (64,), (128,), (256,), (512,))
Aspect ratios: ((0.5, 1.0, 2.0), (0.5, 1.0, 2.0), (0.5, 1.0, 2.0), (0.5, 1.0, 2.0), (0.5, 1.0, 2.0))


- Sizes (32, 64, ..) corresponds to numbers of pixels in original images.


- We can see `sizes` is a tuple of five tuples.


- Each tuple corresponds to a single label CNN features of `resnet50-fpn` (note that the output of the backbone is `OrderedDict[Tensor]` of five tuples) backbone.


- We also see that `aspect ratio` is also a tuple of `five` tuples. The first tuple corresponds to the first anchor tuple, second to second, and so on. 


- As each label has one `anchor size` and each anchor has three `aspect ratios`, the number of anchors per feature map will be `three` (`1*3`). 


- Backbone has five label output, and each label is associated with a different `anchor size`, so the total number of anchors will be `fifteen` (`5*3`).


- Changing the anchor size and ratios may be important for fine-tuning. For example, let's assume we have to detect just pedestrians. Having aspect ratios as (`(0.5, 1.0, 2.0)`) may not be very much useful as compared to (`(2.0, 2.5, 3.0)`).



### <font style="color:magenta">Anchor of Faster RCNN Fine-tune model</font>

- Since the AlexNet has a single label output; the anchor size should be a single tuple.

In [0]:
from torchvision.models.detection.rpn import AnchorGenerator

ft_anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256),), 
                                      aspect_ratios=((0.5, 1.0, 2.0),))

**`head`**

**`cls_logits`:** It is just classifying whether the corresponding feature map is an object or a background. It uses logistics regression (means if value > 0.5 then object else background). That is why the number of output channels is `3` (one channel for one aspect ratio).


**`bbox_pred`:** To represent a bounding box, we need four numbers. So output channels are 12, four for each aspect ratio.


### <font style="color:green">roi_heads (RoI Pooling)</font>

In [19]:
model.roi_heads

RoIHeads(
  (box_roi_pool): MultiScaleRoIAlign()
  (box_head): TwoMLPHead(
    (fc6): Linear(in_features=12544, out_features=1024, bias=True)
    (fc7): Linear(in_features=1024, out_features=1024, bias=True)
  )
  (box_predictor): FastRCNNPredictor(
    (cls_score): Linear(in_features=1024, out_features=91, bias=True)
    (bbox_pred): Linear(in_features=1024, out_features=364, bias=True)
  )
)

**`box_roi_pool`**

- Take bounding boxes predicted by `RegionProposalNetwork head` and `convolution features` from the `backbone`. 


- For bounding boxes for which `objectness score` is greater than the `threshold`, it crop features from convolution layers and resized (e.g. `14 x 14`), then sub-sample feature from resized bounding box (e.g. if `sampling_ratio` is `2` then `14 x 14` will resize to `7 x 7`).


- These sub-sampled features converted to `1-d` vector and are stacked like a batch.

In [20]:
print('Box RoI Pool Parameters:')
print('featmap_names: {}'.format(model.roi_heads.box_roi_pool.featmap_names))
print('output_size: {}'.format(model.roi_heads.box_roi_pool.output_size))
print('sampling_ratio: {}'.format(model.roi_heads.box_roi_pool.sampling_ratio))

Box RoI Pool Parameters:
featmap_names: ['0', '1', '2', '3']
output_size: (7, 7)
sampling_ratio: 2


### <font style="color:magenta">RoI Pooler of Faster RCNN Fine-tune model</font>

Recall backbone output was five tuples ordered dictionary. Lets print it below. 

In [21]:
# backbone output keys 
backbone_out.keys()

odict_keys(['0', '1', '2', '3', 'pool'])

`featmap_names = ['0', '1', '2', '3']`, it means for region of interest pooling current implementation have not used `'pool'` layer output.

Do we have ordered dictionary output of the ft_backbone (alexnet.features)?

Let's check.

In [22]:
type(ft_backbone(torch.rand((2, 3, 300, 300))))

torch.Tensor

Oh! It is just a **tensor**. If it is just a tensor, then we can use `featmap_names=['0']`

In [0]:
from torchvision.ops import MultiScaleRoIAlign

ft_roi_pooler = MultiScaleRoIAlign(featmap_names=['0'], output_size=4, sampling_ratio=1)

**`box_head`**

- It has two fully connected layers, which takes input from the output of `box_roi_pool`.


**`box_predictor`**

- `box_head` output goes to two fully connected layers- one for class prediction and other for bounding prediction. 

#  <font style="color:blue">2. Faster RCNN with AlexNet Backbone</font>

In [0]:
from torchvision.models.detection import FasterRCNN

# let number of classes 4 (including background)

ft_model = FasterRCNN(backbone=ft_backbone,
                      num_classes=2, 
                      min_size=ft_min_size, 
                      max_size=ft_max_size, 
                      image_mean=ft_mean, 
                      image_std=ft_std, 
                      rpn_anchor_generator=ft_anchor_generator, 
                      box_roi_pool=ft_roi_pooler)
ft_model = ft_model.to(device)

## <font style="color:green">2.1. Model Inference</font> 

In [25]:
input_image1 = image1.clone()

input_image2 = image2.clone()

# input is image list
inputs = [input_image1.to(device), input_image2.to(device)]

ft_model.eval()
output = ft_model(inputs)

print(output)

[{'boxes': tensor([[1.7834e+02, 3.2221e+02, 2.3245e+02, 3.5068e+02],
        [1.9000e+02, 2.1708e+02, 3.0626e+02, 3.5900e+02],
        [1.9436e+02, 2.5965e+02, 2.1921e+02, 3.0624e+02],
        [1.8164e+02, 2.8102e+02, 2.3684e+02, 3.0569e+02],
        [2.5561e+02, 3.2282e+02, 3.1316e+02, 3.5218e+02],
        [2.1547e+02, 2.7961e+02, 2.7209e+02, 3.0538e+02],
        [1.4949e+02, 3.0448e+02, 2.5016e+02, 3.5792e+02],
        [1.7703e+02, 2.1721e+02, 2.3312e+02, 3.1368e+02],
        [1.4695e+02, 2.3953e+02, 2.5904e+02, 3.5900e+02],
        [2.1273e+02, 2.1134e+02, 2.7442e+02, 3.1199e+02],
        [2.3421e+02, 2.5772e+02, 2.5892e+02, 3.1400e+02],
        [2.1721e+02, 2.6717e+02, 2.9212e+02, 3.4661e+02],
        [1.9375e+02, 3.0927e+02, 2.1732e+02, 3.5862e+02],
        [1.4121e+02, 2.8185e+02, 1.9467e+02, 3.0654e+02],
        [1.0478e+02, 1.6494e+02, 2.1878e+02, 3.5863e+02],
        [1.4568e+02, 1.1086e+02, 2.7043e+02, 3.5575e+02],
        [2.7622e+01, 2.3406e+02, 2.9426e+02, 3.4048e+02],
   

## <font style="color:green">2.2. Model Training</font>

In [26]:
input_image1 = image1.clone()

target1 = {
    'boxes': bboxes1.clone().to(device),
    'labels' : labels1.clone().to(device)
    
} 

input_image2 = image2.clone()


target2 = {
    'boxes': bboxes2.clone().to(device),
    'labels' : labels2.clone().to(device)
    
} 

inputs = [input_image1.to(device), input_image2.to(device)]
targets = [target1, target2]

# change to train mode
ft_model.train()
ft_model(inputs, targets)

{'loss_box_reg': tensor(0.0387, device='cuda:0', grad_fn=<DivBackward0>),
 'loss_classifier': tensor(0.6748, device='cuda:0', grad_fn=<NllLossBackward>),
 'loss_objectness': tensor(0.6991, device='cuda:0', grad_fn=<BinaryCrossEntropyWithLogitsBackward>),
 'loss_rpn_box_reg': tensor(0.0081, device='cuda:0', grad_fn=<DivBackward0>)}

**In the next section, we will see the training and validation of Faster RCNN model.**