# What are Convolutional Neural Networks?

Convolutional Neural Networks (ConvNets or CNNs) are a category of Neural Networks that have proven very effective in areas such as image recognition and classification. ConvNets have been successful in identifying faces, objects and traffic signs apart from powering vision in robots and self driving cars.


A Convolutional Neural Network (CNN) is comprised of one or more convolutional layers (often with a subsampling step) and then followed by one or more fully connected layers as in a standard multilayer neural network. The architecture of a CNN is designed to take advantage of the 2D structure of an input image (or other 2D input such as a speech signal). This is achieved with local connections and tied weights followed by some form of pooling which results in translation invariant features. Another benefit of CNNs is that they are easier to train and have many fewer parameters than fully connected networks with the same number of hidden units. In this article we will discuss the architecture of a CNN and the back propagation algorithm to compute the gradient with respect to the parameters of the model in order to use gradient based optimization. 

## Convolution and Cross-Correlation
![Conv%20Corre.PNG](attachment:Conv%20Corre.PNG)

![Conv%20and%20Corre.PNG](attachment:Conv%20and%20Corre.PNG)
### NOTE : Many machine learning libraries implement cross-correlation but call it convolution

## Correlation and Convolution
![Conv%20and%20Core%202.PNG](attachment:Conv%20and%20Core%202.PNG)

### Example : Input and Fiter Operation
![Input%20and%20Filter.PNG](attachment:Input%20and%20Filter.PNG)

### Parameters
![Parameters.PNG](attachment:Parameters.PNG)

### Relation Between Input and Output : CNN
![Relation%20between%20input%20and%20output%20of%20CNN.PNG](attachment:Relation%20between%20input%20and%20output%20of%20CNN.PNG)

# Parameter Sharing
The weights and biases we learn for a given output layer are shared across all patches in a given input layer. Note that as we increase the depth of our filter, the number of weights and biases we have to learn still increases, as the weights aren't shared across the output channels.

There’s an additional benefit to sharing our parameters. If we did not reuse the same weights across all patches, we would have to learn new parameters for every single patch and hidden layer neuron pair. This does not scale well, especially for higher fidelity images. Thus, sharing parameters not only helps us with translation invariance, but also gives us a smaller, more scalable model.

# Padding
![](https://d17h27t6h515a5.cloudfront.net/topher/2016/November/5837d4d5_screen-shot-2016-11-24-at-10.05.37-pm/screen-shot-2016-11-24-at-10.05.37-pm.png)
<center>A 5x5 grid with a 3x3 filter. Source: Andrej Karpathy.</center>

# Max Pooling
![](https://d17h27t6h515a5.cloudfront.net/topher/2016/November/582aac09_max-pooling/max-pooling.png)


![](https://d17h27t6h515a5.cloudfront.net/topher/2016/November/581a58be_convolution-schematic/convolution-schematic.gif)
<centre>Convolution with 3×3 Filter. Source: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution</centre>

# Classification Architectures :


# 01. LeNet-5
>Paper : Gradient Based-Learning applied to Document Recognition

### Source of the LeNet-5 Paper : 

<a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf" target="_blank">Paper: <u>Gradient-Based Learning Applied to Document Recognition</u></a>


### Structure of the LeNet network


![LeNet%20Architecture.PNG](attachment:LeNet%20Architecture.PNG)

![1.png](attachment:1.png)

### 5-Layers = 3-Conv Layers ( I,e Conv + Pooling ) + 2-Fully Connected Layers

* ### Layer 1 : Convolution layer: Convo + downsampling (Activation : Tanh )
* ### Layer 2 : Convolution layer: Convo + downsampling  (Activation : Tanh )
* ### Layer 3 : Fully Connected layers  (Activation : Tanh )
* ### Layer 4 : Fully Connected layers  (Activation : Tanh )
* ### Layer 5 : Output Layer  (Activation : Softmax )  (Loss Function : Maximum Likelihood Estimation or MSE)


LeNet-5 Total seven layer , does not comprise an input, each containing a trainable parameters; each layer has a plurality of the Map the Feature , a characteristic of each of the input FeatureMap extracted by means of a convolution filter, and then each FeatureMap There are multiple neurons.

![2_1.PNG](attachment:2_1.PNG)

### LeNet5 Calculations
![Lenet.jpg](attachment:Lenet.jpg)

![Loss%20Fn%20in%20the%20Paper.PNG](attachment:Loss%20Fn%20in%20the%20Paper.PNG)

### ILSVRC Winners
![ILSVRC%20winners.PNG](attachment:ILSVRC%20winners.PNG)

![ILSVRC.PNG](attachment:ILSVRC.PNG)

# AlexNet :  winner of the ILSVRC 2012 :

### Source of the AlexNet Paper : 
https://papers.nips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf

* ### AlexNet = 5Conv + 3 FC = 8 Layer Architecture

* ### Layer 1 : Convolution layer: Convo + Maxpooling (Activation : ReLu )
* ### Layer 2 : Convolution layer: Convo + Maxpooling (Activation : ReLu )
* ### Layer 3 : Convolution layer  (Activation : ReLu )
* ### Layer 4 : Convolution layer  (Activation : ReLu )
* ### Layer 5 : Convolution layer: Convo + Maxpooling (Activation : ReLu )
* ### Layer 6 : Fully Connected layers  (Activation : ReLu )
* ### Layer 7 : Fully Connected layers  (Activation : ReLu )
* ### Layer 8 : Output Layer  (Activation : Softmax ) 



* ### ILSVRC : Image Net Large Scale Visual Recognition Competition

* ### AlexNet is the First Architecture that make use of ReLu Activation Function

* ### AlexNet is the First Architecture designed by using GPU

* ### Local Response Normalization was proposed for the first time in AlexNet 

* ### Data Augmentation was proposed for the first time in AlexNet

* ### Number of parameters  :   27.561 Million

![AlexNet.PNG](attachment:AlexNet.PNG)

# ZFNet-2013

* ### The Number of Parameters in the AlexNet = 27.561 M
* ### To reduce the number of parameters in ZFNet
* ### The filter sizes are reduced and the stride of the convolutions are reduced.
* ### The Number of Parameters in the ZFNet = 26.111 M

* ### Difference in total number of parameters: 1.45M



# VGG-Net [ Visual Geometry Group ]

### VGG net uses 3*3 kernels only
### In VGG net increased Number of layers compared to AlexNet , But used only 3*3 kernels

### Source of the Paper :
https://arxiv.org/pdf/1409.1556.pdf

![VGG%201.png](attachment:VGG%201.png)

>VGG16 contains 16 layers and VGG19 contains 19 layers. A series of VGGs are exactly the same in the last three fully connected layers. The overall structure includes 5 sets of convolutional layers, followed by a MaxPool. The difference is that more and more cascaded convolutional layers are included in the five sets of convolutional layers .

![VGG%2016%20Table.jpg](attachment:VGG%2016%20Table.jpg)

### 138 million parameters.

#### Training

**The optimization method** is a stochastic gradient descent SGD + momentum (0.9) with momentum.
The batch size is 256.

**Regularization** : L2 regularization is used, and the weight decay is 5e-4. Dropout is after the first two fully connected layers, p = 0.5.

Although it is deeper and has more parameters than the AlexNet network, we speculate that VGGNet can converge in less cycles for two reasons: one, the greater depth and smaller convolutions bring implicit regularization ; Second, some layers of pre-training.

**Parameter initialization** : For a shallow A network, parameters are randomly initialized, the weight w is sampled from N (0, 0.01), and the bias is initialized to 0. Then, for deeper networks, first the first four convolutional layers and three fully connected layers are initialized with the parameters of the A network. However, it was later discovered that it is also possible to directly initialize it without using pre-trained parameters.

In order to obtain a 224 * 224 input image, each rescaled image is randomly cropped in each SGD iteration. In order to enhance the data set, the cropped image is also randomly flipped horizontally and RGB color shifted.



#### Summary of VGGNet improvement points
 
1. A smaller 3 * 3 convolution kernel and a deeper network are used . The stack of two 3 * 3 convolution kernels is relative to the field of view of a 5 * 5 convolution kernel, and the stack of three 3 * 3 convolution kernels is equivalent to the field of view of a 7 * 7 convolution kernel. In this way, there can be fewer parameters (3 stacked 3 * 3 structures have only 7 * 7 structural parameters (3 * 3 * 3) / (7 * 7) = 55%); on the other hand, they have more The non-linear transformation increases the ability of CNN to learn features.

 
2. In the convolutional structure of VGGNet, a 1 * 1 convolution kernel is introduced. Without affecting the input and output dimensions, non-linear transformation is introduced to increase the expressive power of the network and reduce the amount of calculation.

 
3. During training, first train a simple (low-level) VGGNet A-level network, and then use the weights of the A network to initialize the complex models that follow to speed up the convergence of training .


## Some basic questions

### **Q1: Why can 3 3x3 convolutions replace 7x7 convolutions?**

***Answer 1***

3 3x3 convolutions, using 3 non-linear activation functions, increasing non-linear expression capabilities, making the segmentation plane more separable
Reduce the number of parameters. For the convolution kernel of C channels, 7x7 contains parameters , and the number of 3 3x3 parameters is greatly reduced.


### **Q2: The role of 1x1 convolution kernel**

***Answer 2***
1. It introduces more Non Linearity when compared to 3 * 3 or 5*5 or higher dimension Filters. This Non-Linearity helps us to Extract Prominent distinguishable features.

2. It helps us to Reduce the Dimension of Output without affecting the Dimension of the image. There by reducing the Number of Computations or FLOPs


![one%20one.gif](attachment:one%20one.gif)

### **Q3: The effect of network depth on results (in the same year, Google also independently released the network GoogleNet with a depth of 22 layers)**

***Answer 3***

VGG and GoogleNet models are deep
Small convolution
VGG only uses 3x3, while GoogleNet uses 1x1, 3x3, 5x5, the model is more complicated (the model began to use a large convolution kernel to reduce the calculation of the subsequent machine layer)


# Inception Net and Google Net
### Sorce of InceptionNet and Google Net :
https://arxiv.org/pdf/1409.4842.pdf

### Sorce of InceptionNet v2 and v3 :
https://arxiv.org/pdf/1512.00567.pdf

### Sorce of InceptionNet v4 : 
https://arxiv.org/pdf/1602.07261.pdf


Also known as GoogLeNet , it is a 22-layer network that won the 2014 ILSVRC Championship.


There are four parallel channels in each inception module , and concat is performed at the end of the channel .

1x1 conv is mainly used to reduce the dimensions in the article to avoid calculation bottlenecks.
It also adds additional softmax loss to some branches of the previous network layer to avoid the problem of gradient disappearance.

**Four parallel channels:**

* 1x1 conv: Borrowed from [ Network in Network ], the input feature map can be reduced in dimension and upgraded without too much loss of the input spatial information;
* 1x1conv followed by 3x3 conv: 3x3 conv increases the receptive field of the feature map, and changes the dimension through 1x1conv;
* 1x1 conv followed by 5x5 conv: 5x5 conv further increases the receptive field of the feature map, and changes the dimensions through 1x1 conv;
* 3x3 max pooling followed by 1x1 conv: The author believes that although the pooling layer will lose space information, it has been effectively applied in many fields, which proves its effectiveness, so a parallel channel is added, and it is changed by 1x1 conv Its output dimension.

![ILSVRC%20winners.PNG](attachment:ILSVRC%20winners.PNG)




![inception_module%20output.png](attachment:inception_module%20output.png)

![Inception%20Output1.png](attachment:Inception%20Output1.png)

#### Complete network design : - 
![InceptionNEt%20Architecture.png](attachment:InceptionNEt%20Architecture.png)





![Error%20Rate.png](attachment:Error%20Rate.png)



The details of the GooLeNet network layer are shown in the following table:

![TAble1.PNG](attachment:TAble1.PNG)

# Inception Net V2 and V3

![Inceptio%20V3%20Block.png](attachment:Inceptio%20V3%20Block.png)

![Inception%20V3%202.PNG](attachment:Inception%20V3%202.PNG)

### Inception-v4-2016
After ResNet appeared, ResNet residual structure was added.

It is based on Inception-v3 and added the skip connection structure in ResNet. Finally, under the structure of 3 residual and 1 inception-v4 , it reached the top-5 error 3.08% in CLS (ImageNet calssification) .

1-Introduction
Residual conn works well when training very deep networks. Because the Inception network architecture can be very deep, it is reasonable to use residual conn instead of concat.

Compared with v3, Inception-v4 has more unified simplified structure and more inception modules.

![Inception%20V4%203.png](attachment:Inception%20V4%203.png)




## Local Response Normalization always happens across Depth (LRN - D)
![Local%20Response%20Normalization.png](attachment:Local%20Response%20Normalization.png)

## Batch Normalization always happens across Width (BN - W)
![batchnorm.png](attachment:batchnorm.png)

### Different Types of Normalizations
![Different%20Normalizations.png](attachment:Different%20Normalizations.png)

# Dropout:
![2.gif](attachment:2.gif)

# ResNet :
### Recap : LeNet(5) , AlexNet(8) , ZFNet , VGG(16 and 19) , Inception or GoogleNet(22)

### In all the above architectures , depth is increasing and the parameters are also increasing

### What is the Reason behind increasing the Depth?
> ### Inorder to Extract more and more features , because of more learnable parameters

### Will there be any Problem If we go on increasing the Depth?
> ### Deeper neural networks are more difficult to train
> ### Number of Parameters increases
> ### Vanishing Gradient Problem may occur

### Till certain depth it is fine But after that as Error increases as Depth increases

### What should we do if we have  images  with lot of Features ?
> ### There is Trade off between Extraction Features(Depth of the Model) and Error

### Should I compromise on Number Features extracted (Depth of the Model) or Error?
> ### Compromise on Number of Features Extracted Then Model won’t classify correctly , So error increases. then Model won’t be considered as Best Model . Compromise on Error  then Model won’t be considered as Best Model


### Ques : What if we go on increasing the Number of Layers 
> ## Degradation Problem

> Observe after 20-Layer in the graph
![Increasing%20Layers.PNG](attachment:Increasing%20Layers.PNG)
> But if we increase the depth after certain Limit Error is Increasing 


![4.PNG](attachment:4.PNG)

![Role%20of%20Residual%20Block.PNG](attachment:Role%20of%20Residual%20Block.PNG)

![5.png](attachment:5.png)

![Comparision%20of%20VGG%20plain%20and%20ResNet.PNG](attachment:Comparision%20of%20VGG%20plain%20and%20ResNet.PNG)

#### Note : FLOPs 
>Floating Point Operations [additions and Multiplications]

> VGG-19 : 19.6 billion FLOPs

> 34 Layer plain Network : 3.6 billon FLOPs

> `34 Layer plain Network : 3.6 billon FLOPs`

### With Residual Blocks [Comparision between 18 and 34 Layer]
![Resnet%20graphs.PNG](attachment:Resnet%20graphs.PNG)

This is equivalent to reducing the amount of parameters for the same number of layers , so it can be extended to deeper models. So the author proposed ResNet with 50, 101 , and 152 layers , and not only did not have degradation problems, the error rate was greatly reduced, and the computational complexity was also kept at a very low level .

At this time, the error rate of ResNet has already dropped other networks a few streets, but it does not seem to be satisfied. Therefore, a more abnormal 1202 layer network has been built. For such a deep network, optimization is still not difficult, but it appears The problem of overfitting is quite normal. The author also said that the 1202 layer model will be further improved in the future.

**Different Variants** : -

![Table1.PNG](attachment:Table1.PNG)

#### Importance of  1 * 1 Filter

>> By adding  1 * 1 Filter , Non-Lineraization increases so that we can Extract more Features

>> Dimension of the Output image is Same.


![6.PNG](attachment:6.PNG)

![7.PNG](attachment:7.PNG)

### Problem 1 : Vanishing Gradient or Exploding Gradient
### Sol : By using Normalization after every Residual Layer

### Problem 2 : To learn More distinguishable Feature , as we increase the depth , The error is also increasing 
### Sol : By using Residual Blocks we can Reduce the error

### Problem 3 : To Reduce the Number of FLOP’s
### Sol : By using 1*1 convs in Residual Blocks


### Problem 4 : To reduce the Number of Learnable Parameters
### Sol : By using  Residual block with 1 * 1 , 3 * 3 , 5 * 5 together Importance of 1 * 1 Filter 

### Problem 5 : Can we keep on increase the Number of layer , such as 200 , 300 , 1200 , 10000 inorder to Reduce the Error ?
### Ans : No 


# Object Detection

# RCNN , Fast-RCNN , Faster-RCNN : Object Detection
![Difference%20between.jpg](attachment:Difference%20between.jpg)

![Difference%20between1.png](attachment:Difference%20between1.png)


### Selective Search for Object Recognition Paper :
http://huppelen.nl/publications/selectiveSearchDraft.pdf


### Source of the Faster RCNN Paper :
https://arxiv.org/pdf/1506.01497.pdf


* ### Faster R-CNN can solve the problem that Fast RCNN uses the third-party tool selective search to extract the region proposal. 
* ### It uses RPN instead of selective search to make the entire target detection function into a unified network. Faster RCNN uses RPN to make the calculation of region proposals more elegant and efficient. 
* ### RPN is a full convolutional network. Candidate region generation and target detection share convolutional features. Attention mechanism is used . RPN will tell the network where to focus.


![fast2.png](attachment:fast2.png)

![fast19.png](attachment:fast19.png)

###  : Image as input to CNN
![fast21.PNG](attachment:fast21.PNG)

![fast22.PNG](attachment:fast22.PNG)



### Structure


![fast1.png](attachment:fast1.png)

### 1. Conv layers: It is mainly composed of the basic conv + relu + pooling layers, which are used to extract the feature map in the image. Used for later shared RPN layers and fully connected layers.
### 2. region proposal networks (RPN): mainly used to generate region proposals. Use softmax to classify the candidate box (whether the background image is positive or negative), and use bbox to perform regression correction on the candidate box to obtain the proposals.
### 3.RoI pooling: Collect feature maps and proposals, extract the proposal feature map, and send it to the subsequent fully connected layer to determine the target category.
### 4.classification: Use the proposal feature map to calculate the proposal category, and bbox regression again to obtain a more accurate positioning.

# VGG16-fasterrcnn is shown in Figure 2. It can be seen that the algorithm steps of the model are:
* ### (1). Reshape a P × Q network of any size into M × N, and then send it to the network.
* ### (2). Use vgg16 network to extract the features of the image: feature map.
* ### (3). The RPN layer undergoes a 3 × 3 convolution to generate the positive anchor and bbox regression offsets, and calculates the proposals.
* ### (4). The RoI layer uses the proposals to extract the proposals feature from the feature map and sends them to the subsequent full connection and softmax network for bbox_pre and classification.

### RPN (Region Proposal network)
![8.PNG](attachment:8.PNG)

### Anchors

Anchors primarily used to represent the position of the candidate box ( , , , ) for the upper left corner and lower left coordinates. There are three types of aspect ratio: {1: 1,2: 1,1: 2}. As shown in Figure 6, through the introduction of commonly used multi-scale methods by anchors, anchors can basically cover all scales and shapes.


![fast5.webp](attachment:fast5.webp)



### In fact, the RPN is added with many candidate box anchors in the original picture. Then use cnn to determine which anchors have a positive anchor in the target and which negative anchors do not have a target, so it is only a two-class classification.

### So how many anchors are there? 
* ### Assuming the original image is 800 × 600, VGG is down-sampled 16 times, and 9 anchors are set for each point of the feature map, so:

* ### ceil (800/16) × ceil (600/16) × 9 = 50 × 38 × 9 = 17100




![fast7.png](attachment:fast7.png)

### RoI pooling

RoI pooling is responsible for generating and collecting the proposal, and calculating the proposal feature maps, and sending it to the subsequent network. From Figure 2 we can see that RoI pooling has two inputs:
1. Original feature maps from VGG 16
2. Proposal boxes output by RPN (not the same size !!!)

### Why do RoI pooling
For traditional CNN (VGG, ResNet), when the network is trained, the input image size must be a fixed value, and the network output is also a fixed-size vector. If the input dimensions of the images are not the same, it becomes very troublesome. There are two methods to solve:

1. Crop part from the image and transfer it to the network 

2. Warp the image to the required size


### It can be seen that no matter which method is adopted, either the complete structure of the image is destroyed after crop, or the original shape information of the image is destroyed by warp. RoI pooling is to solve the problem of how to deal with different sizes.

### RoI pooling principle

RoI works as follows:

Since the proposal corresponds to the M * N size, it is first mapped back to the size of the feature map (1/16) using the spatial_scale parameter.
The feature map area corresponding to each proposal is divided into a grid -max_pooling is performed on each part of the grid.

After this processing, the output results of the proposals even if the size is different are fixed. Figure 15 shows the implementation of fixed-length output.

![fast11.webp](attachment:fast11.webp)

The Loss of the entire network is as follows: The

above formula represents the anchors index, the positive softmax probability, and the GT predict probability (when IoU of the i-th anchor and GT is greater than 0.7, the anchor is considered positive,
= 1, and IoU <0.3 is considered to be negative, = 0, 0.3 ~ 0.7 do not participate in training). t represents a predict bounding box, and represents a corresponding GT box. As you can see, Loss is divided into two parts:
1. cls losss oftmax network, used to classify anchors as positive and negative networks.
2.Reg loss The L1 loss calculated by the rpnlossbbox layer is used for bbox regression network training. I multiplied it because I only care about the positive anchor, not the negative.
Because the two are very far apart, use parameter balancing, such as: ,, settings. Here, smooth L1 loss is used, and the calculation formula is as follows:


![fast16.webp](attachment:fast16.webp)

# YOLO: Object Detection
https://docs.google.com/presentation/d/1aeRvtKG21KHdD5lg6Hgyhx5rPq_ZOsGjG5rJ1HP7BbA/pub?start=false&loop=false&delayms=3000&slide=id.g137784ab86_4_427
![10.PNG](attachment:10.PNG)

# SSD: Single Shot MultiBox Detector : Object Detection

SSD has the following main features:
 
1. Inherited the idea of ​​converting detection to regression from YOLO, and completed network training at one time
2. Based on anchor in Faster RCNN, a similar priority box is proposed.
3. Adding a detection method based on the Pyramidal Feature Hierarchy, which is equivalent to half a FPN idea (Mask RCNN)
 
### Source of the SSD paper :
https://arxiv.org/pdf/1512.02325.pdf
 

 ![ssd.png](attachment:ssd.png)
 
The SSD algorithm proposed in this paper is a multi-target detection algorithm that directly predicts the target category and bounding box . Compared with faster rcnn, this algorithm does not generate a proposal process, which greatly improves the detection speed. For different sizes of target detection, the traditional approach is to first convert the images into different sizes (image pyramids), then detect them separately, and finally combine the results. 

The SSD algorithm uses feature maps of  different convolutional layers  to achieve the same effect. The main network structure of the algorithm is VGG16. The last two fully connected layers are changed to convolutional layers, and then four convolutional layers are added to construct the network structure.


Wherein the output of the convolution of 5 different layers (feature map) each with two different 3 × 3 convolution kernel for convolution output a classification of confidence , each default box  generates confidence categories 21; a Output localization for regression, each default box generates 4 coordinate values (x, y, w, h). In addition, these five feature maps also generate a priority box (the coordinates are generated) through the PriorBox layer. The number of default boxes for each layer in the above five feature maps is given (8732). Finally, the first three calculation results are combined and passed to the loss layer.


### SSD network structure


![ssd1.png](attachment:ssd1.png)

Figure 2 above shows the structure of the SSD 300 network in the original paper. It can be seen that YOLO is followed by a fully connected layer after the convolution layer, that is, only the highest-level feature maps (including Faster RCNN) are used for detection; and the SSD uses the feature pyramid structure for detection, that is, conv4-3 is used for detection. , Feature maps with different sizes, conv-7 (FC7), conv6-2, conv7-2, conv8_2, conv9_2, perform softmax classification and position regression on multiple feature maps at the same time, as shown in Figure 3.

![ssd3.png](attachment:ssd3.png)




### Analysis of the advantages and disadvantages of SSD network structure

The advantages of the indented SSD algorithm should be obvious: the running speed is comparable to YOLO, and the detection accuracy is comparable to Faster RCNN. In addition, there are some trivial advantages that are not explained. Talk about the disadvantages here:

* We need to manually set the min_size, max_size, and aspect_ratio values ​​of the priority box. The basic size and shape of the priority box in the network cannot be directly obtained through learning, but need to be set manually. The size and shape of the priority boxes used by each layer of features in the network are exactly the same, resulting in the debugging process being very dependent on experience.

* Although the idea of ​​the pyramid feature hierarchy is adopted, the recall of small targets is still average and has not reached the level of crushing Faster RCNN. The author believes that this is because SSD uses conv4_3 low-level features to detect small targets, while the number of low-level feature convolutional layers is small, and there is a problem of insufficient feature extraction.


# Segmentation
# Mask RCNN : Segmentation


### Source of Fully Convolutional Networks for Semantic Segmentation paper:
https://arxiv.org/pdf/1411.4038.pdf

### Source of Mask R-CNN paper:
https://arxiv.org/pdf/1703.06870.pdf


![mask.png](attachment:mask.png)


* ### Semantic segmentation: classify pixel by pixel in an image.

* ### Instance segmentation: Detects objects in an image and segmentes the detected objects.

* ### Panoptic segmentation: describes all objects in the image.


The following picture shows the difference between these two segments. As can be seen in the following figure, panoramic segmentation is the most difficult:


![mask1.png](attachment:mask1.png)


* Instance segmentation must not only find the objects in the image correctly, but also accurately segment them. So Instance Segmentation can be seen as a combination of object dection and semantic segmentation.

* Mask RCNN is an extension of Faster RCNN. For each Proposal Box of Faster RCNN, FCN is used for semantic segmentation. The segmentation task and positioning and classification tasks are performed simultaneously.

* Introduced RoI Align instead of RoI Pooling in Faster RCNN. Because RoI Pooling is not pixel-to-pixel alignment, this may not have a great impact on the bbox, but it has a great impact on the accuracy of the mask. After using RoI Align, the accuracy of the mask is significantly improved from 10% to 50%, as explained in Section 3.

* The semantic segmentation branch is introduced to realize the decoupling of the relationship between mask and class prediction. The mask branch only performs semantic segmentation, and the task of type prediction is assigned to another branch. This is different from the original FCN network. When the original FCN predicts the mask, it also predicts the type to which the mask belongs.

* Without using fancy methods, Mask RCNN surpassed all state-of-the-art models of the time.

* Trained on an 8-GPU server for two days.

#### Mask R-CNN algorithm steps

* First, enter an image you want to process, and then perform the corresponding pre-processing operation, or the pre-processed image.

* Then, input it into a pre-trained neural network (ResNeXt, etc.) to obtain the corresponding feature map.

* Next, a predetermined number of ROIs are set for each point in this feature map to obtain multiple candidate ROIs;

* Then, these candidate ROIs are sent to the RPN network for binary classification (foreground or background) and BB regression to filter out some candidate ROIs.

* Next, perform a ROIAlign operation on the remaining ROIs (that is, firstly map the original image with the pixels of the feature map, and then map the feature map with the fixed feature).

* Finally, these ROIs are classified (N-class classification), BB regression, and MASK generation (FCN operations are performed in each ROI).

#### Mask R-CNN architecture decomposition

Here, I decompose Mask R-CNN into the following three modules:-

1. **Faster-Rcnn** 

2. **ROIAlign**

3. **FCN.** 


These three modules are core of the algorithm .




### FCN

The FCN algorithm is a classic semantic segmentation algorithm that can accurately segment objects in a picture. The overall architecture is shown in the figure above. It is an end-to-end network. The main modes include convolution and deconvolution, that is, the image is first convolved and pooled to reduce the size of the feature map. Perform a deconvolution operation, that is, perform an interpolation operation, continuously increase its feature map, and finally classify each pixel value. Thus, accurate segmentation of the input image is achieved.


![mask10.png](attachment:mask10.png)



##  ROIPooling and ROIAlign


`**The biggest difference between ROI Pooling and ROIAlign is that the former uses two quantization operations, while the latter does not use quantization operations and uses a linear interpolation algorithm.**`


![mask12.png](attachment:mask12.png)


![mask13.png](attachment:mask13.png)

### Mask R-CNN Network

Mask R-CNN basic structure: It uses the same two-state steps as Faster RCNN: first, it finds the RPN, then classifies, locates, and finds the binary mask for each RoI found by the RPN. This is different from other networks that first found the mask and then classified it.

Mask R-CNN's loss function :

![mask2.png](attachment:mask2.png)

Mask Representation: Because there is no fully connected layer and RoIAlign is used, one-to-one correspondence between output and input pixels can be achieved.

# Detectron
![11.PNG](attachment:11.PNG)