Skip to content

Latest commit

 

History

History
41 lines (32 loc) · 2.67 KB

yolact.md

File metadata and controls

41 lines (32 loc) · 2.67 KB

December 2019

tl;dr: First real-time instance segmentation, by linearly combining activation maps and crop with bbox.

Overall impression

This is a well written paper with great idea, and really solid engineering work.

Most previous works (Mask RCNN) depends on feature localization (feature repooling) and repool the features to predict a fixed-resolution (14x14 or 28x28) mask. This is inherently sequential and hard to speed up.

Key ideas

  • YOLACT breaks instance segmentation into two parallel tasks:
      1. Generating a set of prototype masks --> using FCN which are good at producing spatially coherent masks
      1. Predicting per-instance mask coefficients --> using fc to producing semantic vectors
    • The assembly step is a simple linear combination realized by matrix multiplication
  • The prototype masks are independent of categories. It learns a distributed representation and each instance mask is a linear combination of the prototypes. Prototype masks are image dependent and do not depend on any specific instance.
  • The emerging behavior of prototypes is interesting! (Fig. 5): some is position sensitive, and some detect contours.
  • Advantages to Mask RCNN
    • Fast. The entire mask branch takes only ~5 ms!
    • Temporally stable. Single stage. Segment whole image then crop.
    • Better quality for larger object. No fixed size mask prediction
  • Disadvantages to Mask RCNN
    • Worse overall performance. --> mainly in detection quality. In high threshold, it is even better than Mask RCNN.
    • YOLACT may leads to leakage, if bbox is not accurate.

Technical details

  • Predict prototype masks in 1/8 scale P3 (finest scale in FPN) and upscale to 1/4.
  • Use ReLU for unbounded activations.
  • Predict c+4+k coefficients. tanh after k coefficients to enable subtraction.
  • Mask assembly: $M = \sigma(PC^T)$, P is hxwxk and C is nxk. n is number of instances. The masks are probabilistic, thus using a sigmoid.
  • ResNet is not exactly translation variant because of padding.
  • k = 32 is the best. Adding more prototypes usually adds duplicates, and makes learning coefficients harder.
  • Fast NMS performs NMS in parallel, allowing those otherwise be removed bbox to suppress lower bbox scores as well. This hurts 0.3 mAP.
  • The performance gap between YOLACt and mask RCNN is due to bbox prediction but not mask quality (same AP gap between mask and bbox performance.)
  • Auxiliary task of predicting semantic segmentation map on top of prototype masks. This auxiliary task is only enabled during training and thus do not have speed penalty.

Notes