# <font style="color:blue">Faster RCNN</font>

In the following notebook, we will go through how an image travels from the input to the Faster-RCNN and returns as the output.
Consequently, we shall explore all the blocks involved in the architecture of Faster-RCNN.

1. The Backbone network
2. Region-Proposal network
3. ROIPooling layer
4. ROI Heads.

In [1]:
# Let's import the necessary libs
import torch
import torchvision
import collections
from torchvision.models.detection import faster_rcnn
import torchvision.models as models
torch.manual_seed(2)

<torch._C.Generator at 0x7fb43d146d50>

In [2]:
# Let's keep the input size of the image as (800, 800, 3)

## <font style="color:blue">1.  The backbone</font>

<img src="https://www.learnopencv.com/wp-content/uploads/2020/09/c3-w8-FasterRcnn_backbone.jpg" width=700 heigth=700 >

The backbone is simply a Convolutional Neural Network with only convolutional-layers (no fully-connected layers).

Till now we have never seen any CNN with only convolutional layers, so how do we get one now?  
Well, we have seen so many CNNs such as Resnet or VGG or Lenet. Take a minute to think what happens if we cut-down their fully connected layers?  
We will get only the convolutional layers, right? 
Therefore, now we have a backbone.

Let's see how this happens in code.


In [3]:
# Let's use a simple model called the `AlexNet` instead of VGG or Resnet or FPN.
# torchvision lib already has built this model, it even has the pretrained-weights too.

alexnet = models.alexnet(pretrained=False)
print(alexnet)

# As we see below, the model is broken down into `features`, `avgpool` and `classifier`.
# We shall notice that the `features` is full of Conv-layers and we are interested in only this module.
# So let's go ahead and just keep this `features` module and store it inside `backbone`.

backbone = alexnet.features

AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
 

In [4]:
# Now if we print the backbone, it will just consist of the `features`.

# print(backbone) 

# We also need to note the output-channels that the `features` will return.  
# We can get the final channels from the 10th line from the Alexnet features - 
# `(10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))` which is `256`.


# Let's take a random input of size (800,800) and pass it through the `feature` module.
image = torch.randn(1, 3, 800, 800)
backbone_op = backbone(image)
print("Output shape from backbone is ", backbone_op.shape)


# We notice that the feature-map size has (24, 24) cells, which essentially means we can place 
# as many anchors with as many aspect-ratios as possible across each of these cells.

Output shape from backbone is  torch.Size([1, 256, 24, 24])


#### <font style="color:green">At this point, we are done with the backbone, and now we can move to the Region Proposal Network.</font>

## <font style="color:blue">2. Region Proposal Layer</font>


<img src="https://www.learnopencv.com/wp-content/uploads/2020/09/c3-w8-FasterRcnn_rpn_layer.jpg" width=700 height=700>


Like the backbone, the RPN is simply a convolutional model.
But it has some specialities.
1. It takes an input, but spits out two outputs. One of the output tells us something about the objectness score and the other tells us about the box locations.
2. The size of the outputs will be same as its input.


A few things to keep in mind about the RPN Layer:

1. RPN is a Neural network which tells whether an anchor consists of an object or not (a binary classifier). 
   Note that it does not tell which object it is. 
2. It also refines the anchor boxes associated with objects.
3. Once we have positive anchors, we pass it to the ROIPooling layer which will be explained in the next block.

Note, Faster RCNN has too many anchor-boxes, and it is a very hectic task to predict each and every anchor since most of them are background anchors. Hence we use the RPN module to remove these background anchors and just get the positive anchors.  
This is what we refer to as the first stage in object detection.

Let's see how we can create this RPN Layer via code.

In [5]:
# The RPNHead is the convolutional model we talked about, in the previous block.
# Again, torchvision already has built the `RPN` for us. You see, it is a very simple 
# convolutional block parameterized with `in_channels` and `num_anchors`.

# Notice that `in_channels` must be the same as final channels from the `backbone` i.e 256.
# What about the `num_anchors`?
# Well, it depends on how many anchors we want to place over the feature map. 
# For example, if we want to place anchors of sizes-
# [ (128, 64), (128, 128), (128, 256), 
#  (256, 128), (256, 256), (256,512), 
# (512, 256), (512, 512), (512, 1024)]
# we will substitute 9 as the parameter for `num_anchors`.

from torchvision.models.detection import rpn
rpn_layer = rpn.RPNHead(in_channels = 256, num_anchors = 9)
print("The RPN layer \n\n",rpn_layer)


# After we print the `RPNHead()` we notice a `cls_logits` which is a Conv-Layer with 9 output channels.
# Well, each of these output channels correspond to each anchor box.
# And what about the `36` channels in `bbox_pred`? 
# Well, we can split these 36 channels into 9 groups, such that first four channels 
# depict box locations of anchor (128, 64).
# The next set of 4 channels correspond to box location for anchor with size-(128, 128) and so on.

The RPN layer 

 RPNHead(
  (conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (cls_logits): Conv2d(256, 9, kernel_size=(1, 1), stride=(1, 1))
  (bbox_pred): Conv2d(256, 36, kernel_size=(1, 1), stride=(1, 1))
)


In [6]:
# Let's forward the output of backbone to the `RPNHead()` and look at the output size.

object_score, bbox_locs = rpn_layer(backbone_op.unsqueeze(0))
print("Output shape from RPN-layer is ", torch.stack(object_score).squeeze(0).shape, 
                                          torch.stack(bbox_locs).squeeze(0).shape)

# As we see below, we have a feature map with 9 channels but with the same size as the backbone output.
# Similarly, we have another feature map with 36 channels( 4 channels each for 9 anchors) of same size.


Output shape from RPN-layer is  torch.Size([1, 9, 24, 24]) torch.Size([1, 36, 24, 24])


#### <font style="color:green">At this point, we need to understand something called Receptive Field.</font>

Simply put, a receptive field is the region in the input image as seen by a feature map.  

For example, if our input is of size `(1, 3, 800,800)` and if we pass this image inside ResNet and grab the output 
feature maps at a few layers, say `(1, 64, 200, 200)`, `(1, 32, 80, 80)`, `(1, 128, 50, 50)`, then the receptive field
for these feature maps will be `(4x4)`, `(10x10)` and `(16x16)` respectively.  

Essentially, if we consider the fmap `(1, 128, 50, 50)`, then each cell among the `(50, 50)` will correspond to 
an area in the original image.  
For example, the fmap cell at location `(49, 50)`, will correspond to the bottom right portion of the image.  


#### <font style="color:green">Now that you know what a receptive field is, lets jump back to previous explanation.</font>

Previously, we saw that we have feature maps of size `(1, 9, 24, 24)` and `(1, 36, 24, 24)` from the RPN output. 

Can you guess what will be the receptive field of this feature map? It will be `(33.33 x 33.33)`.  

It is as though each cell in the `(24, 24)`-cells is looking at a separate `(33.33 x 33.33)` portion of 
the image and hence will be associated with a separate anchor(box).  

Since each cell(in the feature map) corresponds to a reference anchor box(in the image), 
the RPN needs to predict if that cell (or an anchor) consists of an object or not (a binary classifier).  
Therefore, the RPN is really looking at `(9 * 24 * 24)` regions in the featuremap and trying to figure out
which of these regions might contain an object.

Well, the job of binary prediction is greatly acheived by a `Sigmoid()` or a `Softmax()` layer. 
Hence we use a Cross Entropy Loss.

Nevertheless, 90% of the objects will not coincide with the actual reference anchors, right?  
Hence, we need to predict the offsets to the anchors ie `(dx1, dy1, dx2, dy2)`.  
This kind of prediction on offsets is a Regression problem, hence we use a L1 or L2-loss. 

In [7]:
# At this point, we have just created the RPN but we haven't set (or associated) the anchors with the channels yet.
# So we need to create these anchors.

# Fortunately, `torchvision` also has a function which generates the anchor boxes with respective aspect ratios.
anchor_generator = rpn.AnchorGenerator(sizes=((128, 256, 512),), aspect_ratios=(0.5, 1, 2))

# The above line generates the following anchors.
# [ (128, 64), (128, 128), (128, 256), 
#  (256, 128), (256, 256), (256,512), 
# (512, 256), (512, 512), (512, 1024)]


In [8]:
# Now that we have created the `rpn_layer` and the anchors i.e `anchor_generator`, 
# lets fuse these two into a single block.

# Again, torchvision provides us a class `RegionProposalLayer()` whose inputs 
# will be the `rpn-layer` and the `anchors` and a few parameters.

# There are a few technical details about the parameters such as the `nms_threshold`, `foreground_iou_threshold`, 
# `background_iou_threshold`, etc which will not be covered here.
# These parameters' source can be found in the Faster-RCNN research paper itself.


rpn_pre_nms = dict(training=2000, testing=1000)
rpn_post_nms = dict(training=2000, testing=1000)
region_proposer = rpn.RegionProposalNetwork(anchor_generator, rpn_layer, 0.5, 0.5, 512, 0.5, 
                                            rpn_pre_nms, rpn_post_nms, 0.7 )
region_proposer.training = False

# Let's print this `region_proposer`
print("The Region-Proposal-Network is \n\n\n", region_proposer)

# As you see, the whole `Region Proposal Network` consists of the `anchor_generator` i.e the anchors, and the 
# `RPNHead` which is the rpn-layer.

The Region-Proposal-Network is 


 RegionProposalNetwork(
  (anchor_generator): AnchorGenerator()
  (head): RPNHead(
    (conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (cls_logits): Conv2d(256, 9, kernel_size=(1, 1), stride=(1, 1))
    (bbox_pred): Conv2d(256, 36, kernel_size=(1, 1), stride=(1, 1))
  )
)


#### <font style="color:green">Let's see how many objects (or locations) the RPN can detect out of the `(9*24*24) = 5184` possible locations.</font>

In [9]:
from torchvision.models.detection import image_list

inputs = image_list.ImageList(image,  [(800,800)])

# The `region_proposer` takes the input as a list of images and the backbone output.
rpn_bbox, _ = region_proposer(inputs, dict(feats = backbone_op))

# Since the `inputs` is a list of images, the output i.e `bbox` will also be a list of boxes. 
# Hence we need to index the list to access the boxes.

print("Number of boxes with objects detected by RPN " , rpn_bbox[0].shape)

num_boxes = rpn_bbox[0].shape[0]

Number of boxes with objects detected by RPN  torch.Size([312, 4])


#### <font style="color:green">Now that we have identified the regions with potential objects using the RPN, can we go ahead and predict which object it was? </font>
The answer is no.

#### <font style="color:green">At this point, one most important thing to notice is that the size of each `rpn_bbox` will be different, hence we cannot apply a Fully Connected Layer or a Conv-Layer or a Pooling layer to all of them directly.</font>


These boxes (or regions of feature maps) need to be scaled-down or scaled-up to a particular size first, and only then we can further apply our Fully connected layers.  

This process of scaling-down or scaling-up an arbitrary sized feature map to a fixed size is done by a special layer called `ROIPooling layer`.   
In Pytorch, we call this layer the `AdaptiveMaxPooling2d()` layer.  


##  <font style="color:blue">3. ROI Pooling Layer</font>

<img src="https://www.learnopencv.com/wp-content/uploads/2020/09/c3-w8-FasterRcnn_roi_pool.jpg" width=700 height=700>


In [10]:
# Let's try and use the `nn.AdaptiveMaxPool2d()` layer once to verify what it really does.

# Its argument will be the final output size that we need. Let's use the final size of (7,10).
fixed_size_pooler = torch.nn.AdaptiveMaxPool2d((7,10)) 
inp1 = torch.randn(1, 64, 77, 63)
inp2 = torch.randn(3, 44, 67)
inp3 = torch.randn(1, 33, 10, 10)
out1 = fixed_size_pooler(inp1)
out2 = fixed_size_pooler(inp2)
out3 = fixed_size_pooler(inp3)

print("Output from ROI-Pooling with different-sized inputs ", out1.shape, out2.shape, out3.shape )

# Notice how the feature sizes are all (7, 10).

Output from ROI-Pooling with different-sized inputs  torch.Size([1, 64, 7, 10]) torch.Size([3, 7, 10]) torch.Size([1, 33, 7, 10])


In [11]:
# Fortunately, again `torchvision` has a function which does the task of ROI Pooling for us.
# It goes by the name `MultiScaleROIAlign()`.
# Note that this function does the same operation as the `nn.AdaptiveMaxPooling()` but it has some extra features.
# Hence we tend to use this class instead of the `nn.AdaptiveMaxPooling2d()`

# It takes the argument `feature_map-names` as the input, the final `output_size` which we want to end up with
# and a `sampling_ratio` which tells how to sample the points in a feature map.

from torchvision.ops import MultiScaleRoIAlign
roi_pooler = MultiScaleRoIAlign(['0'], output_size=(7, 10), sampling_ratio=2)


# As we know, the ROIPooling pools the variable sized feature map regions into fixed sized regions, 
# we need the backbone features (got from the backbone) and the adjusted-boxes (got from the RPN-layer).
# Also, it needs the original image size to calculate the scale.

# We already have the output from backbone i.e the `backbone_op`.
# We also have the RPN output i.e `rpn_bbox` but these outputs are completely random. 
# Hence we will create the rpn boxes again. 

# One quick question! Can you guess what will be the size of the output from `roi_pooler`?
# Don't worry if you don't know the answer.


fmaps = collections.OrderedDict()
fmaps["0"]=backbone_op

# lets assume that we have the 312 boxes from the RPN 
# (the same number of boxes we got previously i.e `rpn_bbox[0].shape`)
rpn_bbox_rand = torch.rand(num_boxes, 4) * 400; # lets create some random boxes as output from RPN. 
                                                # (top, left, width, height)
rpn_bbox_rand[:, 2:] += rpn_bbox_rand[:, :2] # the format should be (top, left, bottom, right) 
                                             # hence we add the (top, left) to (width, height) coordinates

roi_pooling_op = roi_pooler(fmaps, [rpn_bbox_rand], [(800, 800)])

print("Output from ROI-Pooling layer ", roi_pooling_op.shape)



Output from ROI-Pooling layer  torch.Size([312, 256, 7, 10])


### <font style="color:green">Story so far</font>

You see, we had the backbone feature map of size `(1, 256, 24, 24)` by passing the input to an `AlexNet`

And using this backbone, the RPN predicted `312 variable-sized regions` in the backbone feature map (or 312 variable sized boxes) which potentially consisted of objects `(at object or no-object level only)`.  

But in order to further process these variable sized feature maps, they need to be scaled to a fixed size, hence we used the ROI-Pooling to bring back all the `312 boxes` to a fixed size of `(7, 10)`.  

Therefore we shall have 312 such feature maps each of size `(256, 7, 10)` resulting in `(312, 256, 7, 10)`. 

Since we now want to connect the output from ROI-Pooling with Fully Connected layers, so we should `flatten-out` the `roi-pooling-output` first and then feed this output to the `Linear-layers`.


#### <font style="color:green">Now we can think of moving to predicting which object each of this 312 regions contain and also further refine their box locations with the help of Linear layers.</font>

## <font style="color:blue">4. ROI Heads</font>
<img src="https://www.learnopencv.com/wp-content/uploads/2020/09/c3-w8-FasterRcnn_roi_head.png" width=700 height=700>


In [12]:
# 1. The MLP.

# Again, `torchvision` provides us an in-built linear layer called `TwoMLPHead()`. It has two arguments -
# the 'in_channels' which is the flattened version of `256*7*10` and
#`representation_size` which is kept as 1024.
# Let's go ahead and create it.

from torchvision.models.detection import faster_rcnn
mlp_head = faster_rcnn.TwoMLPHead(in_channels = 256*7*10, representation_size=1024)
print("MLPHead is \n", mlp_head)

# As you see below, it is a very simple network with just two linear layers.

# Note that we did not flatten the roi_pooling_op. The reason is that the `TwoMLPHead` does the flattening for us.
mlp_op = mlp_head(roi_pooling_op)

print("Output from the mlp-head is ", mlp_op.shape)

MLPHead is 
 TwoMLPHead(
  (fc6): Linear(in_features=17920, out_features=1024, bias=True)
  (fc7): Linear(in_features=1024, out_features=1024, bias=True)
)
Output from the mlp-head is  torch.Size([312, 1024])


You must be wondering by now have we reached the output layer?  
Well we are almost there! Just one more layer.  
We call this last layer, the FastRCNNPredictor. Again, it is a simple network with one speciality.  
It spits out two outputs, one for the classification and the other for the bounding-box regression.

In [13]:
# 2. The FastRCNNPredictor.

# Well, torchvision already has it for us, called `FastRCNNPredictor()`
# It takes argument as in_channels which will be 1024 and num_classes which we shall keep `17`.

final_layer = faster_rcnn.FastRCNNPredictor(in_channels=1024, num_classes=17)
# Let's see how the final layer looks like.

print("Final layer of Faster-RCNN is ", final_layer)

# As you see, there is a `cls_score` whose output nodes are equal to num_classes
# and `bbox_pred` whose output nodes are equal to 4 times the num_classes.
# Basically, the `68` nodes are broken into 17 such groups such that each set contains 4 nodes
# depicting the offset for x1, y1, x2, y2 location corresponding to that box.

Final layer of Faster-RCNN is  FastRCNNPredictor(
  (cls_score): Linear(in_features=1024, out_features=17, bias=True)
  (bbox_pred): Linear(in_features=1024, out_features=68, bias=True)
)


In [14]:
# Now let's pass in the output of `mlp_head` to the `final_layer`.

final_scores, final_bboxes = final_layer(mlp_op)
print("Output from final-layer of Faster-RCNN is ", final_scores.shape, final_bboxes.shape)


Output from final-layer of Faster-RCNN is  torch.Size([312, 17]) torch.Size([312, 68])


We finally got the classes for each of the boxes the RPN detected.  
However, most of these boxes actually overlay a lot.   
Hence we need to use a technique called the `Non-Max-Supression` which removes off most of these boxes.

#####  <font style="color:green">Can we combine all the topics covered and all the functions called in a very few lines of code? Yes!</font>

##### <font style="color:green">Let's see how can we do it.</font>

In [15]:
from torchvision.ops import MultiScaleRoIAlign
from torchvision.models.detection.rpn import AnchorGenerator
from torchvision.models.detection import FasterRCNN


# Get the backbone of any pretrained network, we'll use AlexNet
alexnet = models.alexnet(pretrained=True)
new_backbone = alexnet.features
new_backbone.out_channels = 256

# Configure the anchors. We shall have 12 different anchors.
new_anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256),), 
                                      aspect_ratios=((0.5, 1.0, 2.0),))

# Configure the output size of ROI-Pooling layer. 
# We shall end up with (num_boxes, num_features, 4, 4) after the ROIPooling layer
new_roi_pooler = MultiScaleRoIAlign(featmap_names=['0'], output_size=4, sampling_ratio=1)


# let's use dummy variables for mean, std, min_size and max_size
min_size = 300
max_size = 500
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]

# Instantiate the Faster-rcnn model with the variables declared above.
frcnn_model = FasterRCNN(backbone=new_backbone,
                      num_classes=17, 
                      min_size=min_size, 
                      max_size=max_size, 
                      image_mean=mean, 
                      image_std=std, 
                      rpn_anchor_generator=new_anchor_generator, 
                      box_roi_pool=new_roi_pooler)


In [16]:
# As you see below, the backbone, rpn and roi_heads are joined to form one big network.
print(frcnn_model)


FasterRCNN(
  (transform): GeneralizedRCNNTransform(
      Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
      Resize(min_size=(300,), max_size=500, mode='bilinear')
  )
  (backbone): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (

This notebook has mainly focused on the intricacies of how the image traverses through the network.  
We have shown the different modules present in the Faster-RCNN model.  
However, there are ideas such as how an anchor is treated as positive or negative, during the RPN training,
how the NMS is performed, etc which are left untouched.  
We request the students to visit the research paper for further details.


##### <font style="color:green">References:-</font>

1. The image used throughout this notebook was taken from [here]( https://i.insider.com/5d9cd5947fa74b01402f5082?width=1100&format=jpeg&auto=webp
)
2. https://github.com/pytorch/vision/tree/master/torchvision/models/detection