# Theory

A description of the deep residual learning framework and its working principle in the ResNet paper.

### Deep Residual Framework

```Work in Progress```

![image](../images/Residual_Module.png)
```put ref on image```
<!--![Picture](../images/Residual_Module.png){width="800" height="600" style="display: block; margin: 0 auto"}-->

 
<!--<div>
<img src="../images/Residual_Module.png" width="300" align="right"/>
</div>-->

Let us define $\mathcal{H}(x)$ as the underlying mapping that a stack of layers should be able to fit, and let $x$ represent the input of the first layer (assuming both have the same dimensions).

If one hypothesizes that any complicated function can be asymptotically approximated by multiple nonlinear layers, the same should hold true for the residual functions, i.e., $\mathcal{H}(x) - x$.

The authors of the paper (add reference here) explicitly let these layers approximate a residual function:

$$F(x):=H(x)−x$$

while the original mapping becomes:

$$H(x)=F(x)+x.$$


Although both forms should asymptotically approximate the desired functions (as hypothesized), the ease of learning might differ.

This reformulation is motivated by the counterintuitive phenomenon of the degradation problem, where deeper models experience a degradation of the training error.

Intuitively, a deeper model constructed from a shallower one by adding layers (which act as identity maps) should not experience a greater training error.

The degradation problem, however, suggests that solvers might have difficulties in approximating identity mappings with multiple nonlinear layers. This issue can be mitigated by reformulating the problem using residuals, where the weights are driven toward zero.

If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find perturbations relative to an identity mapping than to learn the function from scratch.


### The ResNet structure

The formulation $F(x)+x$ can be implemented in feedforward neural networks using "shortcut connections" (as illustrated in the previous figure).
These connections skip one or more layers, performing a simple identity mapping, and their outputs are added to the outputs of the stacked layers.

An important detail is that identity shortcut connections do not introduce additional parameters or computational complexity.

The model primarily consists of convolutional layers with 3×3 filters and follows two simple design rules:
1. For the same output feature map size, all layers use the same number of filters.
2. When the feature map size is halved, the number of filters is doubled to preserve the time complexity per layer.

Downsampling is performed directly by convolutional layers with a stride of 2. The network concludes with a global average pooling layer and a 1000-way fully connected layer with a softmax activation function (used to solve the ILSVRC 2015 classification task).

To this structure, the authors introduce shortcut connections, transforming the network into its residual counterpart. When the dimensions between input and output differ, two options are proposed:
- The shortcut performs identity mapping, with extra zero entries padded to match the increased dimensions:
$$y = \mathcal{F}(x, {W_{i} }) + x$$
- A projection shortcut is used to match dimensions via 1×1 convolutions:
$$y = \mathcal{F}(x, {W_{i} }) + W_{s} x$$

For both options, when the shortcuts span feature maps of different sizes, they are performed with a stride of 2.

#### Full Training setup:
- The image is resized with its shorter side randomly sampled in [256, 480] for scale augmentation.
- A 224×224 crop is randomly sampled from an image or its horizontal flip, with the per-pixel mean subtracted. 
- The standard color augmentation is used. 
- **We adopt batch normalization (BN) right after each convolution and before activation.**
- We initialize the weights and train from scratch. 
- We use SGD with a **mini-batch size of 256**. 
- The learning rate starts from 0.1 and is divided by 10 when the error plateaus.
- The models are trained for up to **60 × 104 iterations**. 
- We use a weight decay of 0.0001 and a momentum of 0.9. 
- We do not use dropout.
- In testing, for comparison studies we adopt the standard 10-crop testing. 
- For best results, we adopt the fully-convolutional form, and average the scores at multiple scales (images are resized such that the shorter side is in {224, 256, 384, 480, 640})