# Deep Residual Learning for Image Recognition

Link: https://arxiv.org/abs/1512.03385

Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

Institution: Microsoft Research

Publication: arXiv

Date: 2015



## Background Materials




## What is this paper about?


ResNet, "very" deep neural network architecture using residual learning, which is substantially deeper than those used previously for image recognition.

## What is the motivation of this research?

Network depth is of crucial importance.

However, deeper neural networks are more difficult to train.

With the network depth increasing, accuracy gets aturated, and then degrades rapidly. This is called degradation problem.

Let us consider shallower architecture and its deeper counterpart that add identity mapping layer onto it. The deeper model should produce no higher training error than its shallower counterpart. But experiments show that the deeper counterpart is unable to find solutions that are comparably better than shallower architecture.


Figure1. Training error (left) and test error (right) on CIFAR-10.
<img src="img/Deep_Residual_Learning_for_Image_Recognition_Figure1.png" width="300">

## What makes this paper different from previous research?

- using residual learning
- very deep, up to 152 layers that is 8 times deeper than VGG
- lower complexity even though the depth is significantly increased
- low errors. ILSVRC 2015 winner.


## How this paper achieve it?


### Residual representations

Denoting underlying mapping as $\mathcal{H}(x)$, let the stacked nonlinear layers another mapping, residual function.

$\mathcal{F}(x) := \mathcal{H}(x) - x$

If one hypothesize that they can asymptotically approximate complicate funcions, it is equivalent to hypothesize that they can asymptotically approximate the residual function, $\mathcal{H}(x) - x$ .

The original function becomes 

$\mathcal{F}(x) + x$

Both forms should be able to asymtotically approximate the desired functions.

The formulation $\mathcal{F}(x) + x$ can be realized with identity mapping "shortcut connections" in feedforward neural network.

<img src="img/Deep_Residual_Learning_for_Image_Recognition_Figure2.png" width="300">

Identity shortcut connections add neither extra parameter nor computa- tional complexity.

The degradation problem suggests that the solvers might have difficulities in approximating identity mapping by multiple nonlinear layers. With residual learning, if identity mappings are optimal, the solvers may simply drive the weights of multiple nonlinear layers toward zero to approach identity mappings.

#### Related representation

Vector of Locally Aggregated Descriptors (VLAD) representation is a way of producing compact representation of local visual descriptors while still retaining high level of accuracy (Je ́gou et al., 2010). As for Bag of Word, a visual vocabulary, called codebook, {$\boldsymbol{\mu}_1, ...,\boldsymbol{\mu}_K$} is first learned using a cluster algorithm such as  k-means. Each local descriptor $\boldsymbol{x}_t$ is then associated with nearest visual word, or codeword $\mathit{NN}(\boldsymbol{x}_t)$ in the codebook. For each codeword the differences between the sub-vectors $x_t$ assigned to $\boldsymbol{\mu}_i$ are accumulated:

$\boldsymbol{v}_i = \sum_{x_t:\mathit{NN}(x_t)=i}\boldsymbol{x}_t - \boldsymbol{\mu}_i$

The VLAD is concatenation of the accumulated sub-vectors, $\boldsymbol{V} = (\boldsymbol{v}_1, ..., \boldsymbol{v}_K)$

Fisher Vector is probablistic variant of VLAD.

For vector quantization, product quantization (Je ́gou et al., 2011) is effective encoding using residual vector. http://mglab.blogspot.jp/2011/11/product-quantization.html 

### Architecture

<img src="img/Deep_Residual_Learning_for_Image_Recognition_Figure3r.png" height="200">

### Experimental results

#### ImageNet Classification

Figure 4. Training on ImageNet. Thin curves denote training error, and bold curves denote validation error of the center crops.
<img src="img/Deep_Residual_Learning_for_Image_Recognition_Figure4.png" height="200">

In addition, 50/101/152-layer ResNets are experimented and show more accurate than 34-layer one.

Remarkably, although the depth is significantly increased, the 152-layer ResNet (11.3 billion FLOPs) still has lower complexity than VGG-16/19 (15.3/19.6 billion FLOPs).

The 152-layer variant has 3.57% top-5 error on the test set and won the 1st place in ILSVRC 2015.


#### CIFAR-10 analysis

1202-layer network shows no optimization difficulty and achieve training error < 0.1% and test error 7.93%. This result is worse than that ou 110-layer network. This is suspected to be overfitting.

## Dataset used in this study

- ImageNet
- CIFAR-10
- PASCAL VOC 2007 and 2012
- MS COCO


## Implementations

- [tensorflow/models/resnet](https://github.com/tensorflow/models/tree/master/resnet)



## Further Readings