# Fully Convolutional Networks for Semantic Segmentation

## Abstract  
- Convolutional Networks are trained end-to-end, pixels-to-pixels.  
- Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output.  
- We adapt classification networks (AlexNet, VGG, GoogLeNet) into FCN and transfer their learned representations by fine-tuning to the segmentation task.  
- And we define a skip architecture that combines sementic information from a deep, coarse layer with appearance information from a shallow, fine layer.

## Introduction  
<img src = 'https://appsilondatascience.com/assets/uploads/2018/08/types.png' width=1000>  

- Convolutional networks are driving advances in recognizition.  
- Convnets are note only improving for image classification, but also object detection.  
- The natural next step in the progression from coarse to fine inference is to make a prediction at every pixel.  
  
  
- We show that a FCN trained end-to-end, pixels-to-pixels on semantic segmentation.  
    - 1. pixelwise prediction  
    - 2. supervised pre-training  
      
      
- Fully convolutional versions of existing networks predict dense outputs from arbitrary-sized inputs.  
- In this paper, we use upsampling method because the output has to be the same size as the input.  


- Semantic segmentation faces an inherent tension between semantics and locations: global information resolves what while local information resolves where.  
- We define a skip architecture to take advantage of this feature spectrum that combines deep, coarse, semantic information and shallow, fine, appearance information

## Receptive Fields and Loss  
- Locations in higher layers correspond to the locations in the image they are path-connected to, which are called their **receptive fields**.  
<img src = "https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FnUreZ%2Fbtq2SNdVdQh%2FewTMikTKKKpnSJG87YNkH0%2Fimg.png">  

#### Loss  
$$\mathscr{l}\left(\mathbf{x};\theta\right)=\sum_{ij}\mathscr{l}'\left(\mathbf{x}_{ij};\theta\right)$$  
- If the loss function is sum over the spatial dimensions of the final layer, its gradient will be a sum over the gradients of each of its spatial components.  
- Thus SGD on $\mathscr{l}$ computed on whole images will be the same as SGD on $\mathscr{l}'$, taking all of the final layer receptive fields as a minibatch.  

#### Computing efficiency  
- When these receptive fields overlap significantly, both feedforward computation and backpropagation are much more efficient when computed layer-by-layer over an entire image instead of independently patch-by-patch.   

## Adapting classifiers for dense prediction  
- Typical recognition nets, including LeNet, AlexNet, and its deeper successors, ostensibly take fixed-sized inputs and produce non-spatial outputs.  
- The fully connected layers of these nets have fixed dimensions and throw away spatial coordinates.  
- However, these fully connected layers can also be viewed as convolutions with kernels that cover their entire input regions.  
- Doing so casts them into fully convolutional networks that take input of any size and output classification maps.


<img src = "https://mblogthumb-phinf.pstatic.net/MjAxNzAzMTRfMTg5/MDAxNDg5NDkwNjAxNzI1.ePM0OvxwEyG7lIBciOLyF75YZ0z5Mq8SDwcNlI6pOUEg.MqEmYMEAQhwyCnt2iszdO0XLnDgAeiHPSZc4DzUmjFog.PNG.laonple/%EC%9D%B4%EB%AF%B8%EC%A7%80_15.png?type=w2">

## Shift-and-stitch is filter rarefaction  
- Dense predictions can be obtained from coarse outputs by stitching together output from shifted versions of the input.  
- In semantic segmentation, we proceed upsampling because the output size must be the same as the input.  
- If upsampling is simply performed, the output image has a lower resolution than the input image.  
- So, we introduce shift-and-stitch.  

<img src = "https://image.slidesharecdn.com/semanticsegmentationslides-181218144057/95/semantic-segmentation-fully-convolutional-networks-for-semantic-segmentation-12-638.jpg?cb=1545144119">  

- First, the red box shows a general convolutional operation.  
- The yellow box performs the same operation as the red box, but the difference is the input matrix.  
- Looking at the result, the right side is gray color, which is unnecessary information because the input image was moved by one pixel to the left.  
- If you repeat above process several times, we get the final result as shown in the last figure.  

**problem**  
- This method has the disadvantage of high conputational costs.  
- We find that the skip layer fusion is a more efficient method.

## Upsampling is backwards strided convolution  
- We introduce two method to obtain the spatial output.  
     1. **Interpolation**  
     <img src = "https://media.vlpt.us/images/kimkj38/post/cd214401-867e-4538-b5a7-586d2223c0e7/image.png">  
     <img src = "https://media.vlpt.us/images/kimkj38/post/75c0a268-3201-4867-8ee9-aa3541034191/image.png">  
     
         - In this way, when expanding a low resolution image to a high resolution image, empty values can be inferred and filled.  
     2. **Deconvolution**  
     <img src = "https://blog.kakaocdn.net/dn/5EvJx/btqSxBHlTCL/ElN9OMvxt2WGuhlY0vFk60/img.png">  
     - This operation is trivial to implement, since it simply reverses the forward and backward passes of convolution.  
     - In the above figure, an upsampled output can be obtained by adding overlapping regions.   
     
- In our experiments, we find that in-network upsampling is fast and effective for learning dense prediction.   
- But, information loss is still large because both methods estimate from the feature map that the image is greatly reduced.

## Combining what and where  
**Add skip layers**
- We define a new fully convolutional net (FCN) for segmentation.  
- While fully convolutionalized classifiers can be fine-tuned to segmentation, and even score highly on the standard metric, their output is dissatisfyingly coarse.  

<img src = "https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FVceIj%2Fbtq3ssThmLV%2FykrKdZFpn8F59Ik8sAeqFK%2Fimg.png">  
<img src = "https://media.vlpt.us/images/sanha9999/post/81948460-c24f-49db-abd2-096340febb99/FCN.png">  

- The 32 pixel stride at the final prediction layer limits the scale of detail in the upsampled output.  
- We address this by adding skip that combine the final prediction layer with lower layers with finer strides.  
- Through the conv+pool layers several times, the details of the features disappear.  
- So, we fill the details using the feature map in front of the last layer.  


## Patchwise training is loss sampling  
**patchwise learning**  
- step 1. Set a patch of a specific size and input it to the CNN.  
- step 2. classification by CNN.  
- step 3. The pixels located in the center of its patch are classified into corresponding class.  
- step 4. Repeat this process in a sliding window method.
<img src = "https://media.vlpt.us/images/leejaejun/post/bd363d37-7a14-4e7e-8bcf-ab49e8e03e06/image.png" width=800>  

- When learning in the patchwise method, this may be overlapping patrs between patches.  
- So, the unnecessary computation increases.   
- We do not find that it yields faster or better convergence for dense prediction.    
- Whole image training is effective and efficient.  

**Patch Sampling**  
- We find that sampling does not have a significant effect on convergence rate compared to whole image training, but takes significantly more time due to the larger number of images that need to be considered per batch.  
- We therefore choose unsampled, whole image training in our other experiments.

<img src = "https://media.vlpt.us/images/qsdcfd/post/f0b1b648-e9cd-4235-96aa-7f6e0913ecb5/image.png">

## From classifier to dense FCN   
**fully connected layer $\rightarrow$ fully convolutional layer**
- We pick the VGG16 net among AlexNet, GoogLeNet, VGG nets.  
- We decapitate each net by discarding the final classifier layer, and convert all fully connected layers to convolutions.  
- We append a $1\times 1$ convolution with channel dimension 21 to predict scores for each of the PASCAL classes at each of the coarse output locations, followed by a deconvolution layer to bilinearly upsample the coarse outputs to pixel-dense outputs.  

<img src = "https://khyeyoon.github.io/assets/img/FCN/T1.PNG">  

- VGG 16 performed the best compared to 3 networks.  
- Despite similar classification accuracy, our implementation of GoogLeNet did not match the VGG 16 segmentation result.

## Experimental framework  
**Optimization**  
- We train by SGD with momentum.  
- We use a minibatch size of 20 images and fixed learning rates of $10^{-3}, 10^{-4},$ and $5^{-5}$.  
- We use momentum 0.9, weight decay of $5^{-4}$ or $2^{-4}$, and doubled learning rate for biases, although we found training to be sensitive to the elarning rate alone. 

**Fine-tuning**  
- We fine-tune all layers by backpropagation through the whold net.  
- Training from scratch is not feasible considering the time required to learn the base classification nets.  
- So, we use transfer learning with pre-trained CNN.  

**Class Balancing**  
- Fully convolutional training can balance classes by weighting or sampling the loss.  
- We find class balancing unnecessary.

## Results  
- We test our FCN on semantic segmentation and scene parsing, exploring PASCAL VOC, NYUDv2, and SIFT Flow.  

**Metrics**  
- pixel accuracy : $\sum_{i}n_{ii}/ \sum_{i}t_{i}$  
- mean accuracy : $\left(1/n_{cl}\right)\sum_{i}n_{ii}/t_{i}$  
- mean IU : $\left(1/n_{cl}\right)\sum_{i}n_{ii}/\left(t_{i}+\sum_{j}n_{ji}-n_{ii}\right)$  
- frequency weighted IU : $\left(sum_kt_k\right)^{-1}\sum_it_in_{ii}/\left(t_i+\sum_jn_{ji}-n_{ii}\right)$  

**Compare to FCN** $\qquad\qquad\qquad\qquad\qquad$ **NYUDv2** $\qquad\qquad\qquad\qquad\qquad$ **SIFT Flow**
<img src = "https://miro.medium.com/max/2000/1*2obgSShyzzBKuds_XxPCoA.png">  

**$\qquad\qquad\qquad\qquad\qquad\qquad\qquad$PASCAL VOC**
<img src = "https://t1.daumcdn.net/cfile/tistory/99268446605C18E311?original">
 
<img src = "https://cdn-images-1.medium.com/max/1600/1*fU9El2B2qELtKsD9FsHB-w.png" width=700>