# <div align="center">Mask R-CNN</div>
---------------------------------------------------------------------

you can Find me on Github:
> ###### [ GitHub](https://github.com/lev1khachatryan)

This is a conceptually simple, flexible, and general
framework for object ***instance segmentation***. Our approach
efficiently detects objects in an image while simultaneously
generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster
R-CNN by adding a branch for predicting an object mask in
parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small
overhead to Faster R-CNN, running at 5 fps. Moreover,
Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework.

# <div align="center">1. Introduction</div>
---------------------------------------------------------------------

The vision community has rapidly improved object detection and semantic segmentation results over a short period of time. In large part, these advances have been driven
by powerful baseline systems, such as the Fast/Faster RCNN and Fully Convolutional Network (FCN)
frameworks for object detection and semantic segmentation, respectively. These methods are conceptually intuitive
and offer flexibility and robustness, together with fast training and inference time.

Instance segmentation is challenging because it requires
the correct detection of all objects in an image while also
precisely segmenting each instance. It therefore combines
elements from the classical computer vision tasks of object detection, where the goal is to classify individual objects and localize each using a bounding box, and semantic segmentation, where the goal is to classify each pixel into
a fixed set of categories without differentiating object instances.

As I said, method called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting segmentation masks
on each Region of Interest (RoI), in parallel with the existing branch for classification and bounding box regression (Figure 1). The mask branch is a small FCN applied
to each RoI, predicting a segmentation mask in a pixel-to-pixel manner. Mask R-CNN is simple to implement and
train given the Faster R-CNN framework, which facilitates
a wide range of flexible architecture designs. Additionally,
the mask branch only adds a small computational overhead,
enabling a fast system and rapid experimentation․

<img src='asset/7_7/1.png'>
<div align="center">Figure 1. The Mask R-CNN framework for instance segmentation.</div>

In principle ***Mask R-CNN is an intuitive extension of
Faster R-CNN***, yet constructing the mask branch properly
is critical for good results. Most importantly, Faster RCNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in
how RoIPool, the de facto core operation for attending to instances, performs coarse spatial quantization
for feature extraction. To fix the misalignment, Mask R-CNN propose a simple, quantization-free layer, called ***RoIAlign***, that
faithfully preserves exact spatial locations. Despite being a seemingly minor change, RoIAlign has a large impact: it
improves mask accuracy by relative 10% to 50%, showing
bigger gains under stricter localization metrics. Second, model predicts a binary mask for each class independently, without
competition among classes, and rely on the network’s RoI
classification branch to predict the category. In contrast,
FCNs usually perform per-pixel multi-class categorization,
which couples segmentation and classification, and based
on our experiments works poorly for instance segmentation.

# <div align="center">2. Mask R-CNN</div>
---------------------------------------------------------------------

If faster R-CNN has two outputs for each candidate object, a class label and a
bounding-box offset; Mask R-CNN adds a third branch that outputs the object mask. Mask R-CNN is thus a natural and intuitive idea. But the additional mask output is distinct from
the class and box outputs, requiring extraction of much finer
spatial layout of an object. Next, we introduce the key elements of Mask R-CNN, including pixel-to-pixel alignment,
which is the main missing piece of Fast/Faster R-CNN.

Mask R-CNN adopts the same two-stage
procedure of faster R-CNN, with an identical first stage (which is RPN). In
the second stage, in parallel to predicting the class and box
offset, Mask R-CNN also outputs a binary mask for each
RoI. This is in contrast to most recent systems, where classification depends on mask predictions.

Formally, during training, we define a multi-task loss on each sampled RoI as $L = L_{cls} + L_{box} + L_{mask}$. The classification loss $L_{cls}$ and bounding-box loss $L_{box}$ are identical as those defined in Faster R-CNN. The mask branch has a $Km^2$
-
dimensional output for each RoI, which encodes K binary
masks of resolution m × m, one for each of the K classes.
To this we apply a per-pixel sigmoid, and define $L_{mask}$ as
the average binary cross-entropy loss. For an RoI associated
with ground-truth class k, $L_{mask}$ is only defined on the k-th
mask (other mask outputs do not contribute to the loss)

The definition of $L_{mask}$ allows the network to generate
masks for every class without competition among classes;
we rely on the dedicated classification branch to predict the class label used to select the output mask. This decouples
mask and class prediction.

## <div align="center">2.1 Mask Representation:CNN</div>
---------------------------------------------------------------------

A mask encodes an input object’s
spatial layout. Thus, unlike class labels or box offsets
that are inevitably collapsed into short output vectors by
fully-connected (fc) layers, extracting the spatial structure
of masks can be addressed naturally by the pixel-to-pixel
correspondence provided by convolutions.

Specifically, we predict an m × m mask from each RoI
using an FCN. This allows each layer in the mask
branch to maintain the explicit m × m object spatial layout without collapsing it into a vector representation that
lacks spatial dimensions. Unlike previous methods that resort to fc layers for mask prediction, our fully
convolutional representation requires fewer parameters, and
is more accurate as demonstrated by experiments

# <div align="center">Summing it all up</div>
---------------------------------------------------------------------

So the Mask R-CNN extends Faster R-CNN to pixel-level image segmentation. The key point is to decouple the classification and the pixel-level mask prediction tasks. Based on the framework of Faster R-CNN, it added a third branch for predicting an object mask in parallel with the existing branches for classification and localization. The mask branch is a small fully-connected network applied to each RoI, predicting a segmentation mask in a pixel-to-pixel manner.

<img src='asset/7_7/2.png'>

Because pixel-level segmentation requires much more fine-grained alignment than bounding boxes, mask R-CNN improves the RoI pooling layer (named ***“RoIAlign layer”***) so that RoI can be better and more precisely mapped to the regions of the original image.

<img src='asset/7_7/3.png'>

The ***RoIAlign*** layer is designed to fix the location misalignment caused by quantization in the RoI pooling. RoIAlign removes the hash quantization, for example, by using x/16 instead of [x/16], so that the extracted features can be properly aligned with the input pixels. Bilinear interpolation is used for computing the floating-point location values in the input.

<img src='asset/7_7/4.png'>

A region of interest is mapped accurately from the original image onto the feature map without rounding up to integers. 

# <div align="center">Loss Function</div>
---------------------------------------------------------------------

<img src='asset/7_7/5.png'>

# <div align="center">Summary of Models in the R-CNN family</div>
---------------------------------------------------------------------

<img src='asset/7_7/6.png'>