# Residual Networks
- In standard neural network, each layer consists of a linear transformation followed by an activation function
- In convolutional network, each layer consists of a set of convolutions followed by an activation function
- Limitations of sequential processing
  - Image classification performance decreases as more layers are added
  - Problem is in training deeper networks rather than the inability of deeper networks to generalize
  - Shattered gradient phenomenon
    - Nearby gradients are correlated for shallow networks, but this correlation drops to zero for deeper ones
## Residual connections
- Residual (skip) connections are branches in the computational path, where the input to each network layer is added back to the output $$h_1 = x + f_1[x,\phi_1] \\ h_2 = h_1 + f_2[h1,\phi_2] \\ h_3 = h_2 + f_3[h_2,\phi_3]$$
- Each additive combination of the input and the processed output is known as *residual layer* 
- Can see as $$ y = x + f_1[x] + \\ f_2[x + f_1[x]] + \\ f_3[x + f_2[x+f_1[x]] + f_1[x]]$$
- It's a sum of the input and three smaller networks
- Can be seen as an ensemble of these smaller networks whose outputs are summed to compute the result
- Typical to start the network with a linear transformation instead of a residual block
## Exploding gradients in residual networks
- Do not need to worry about vanishing gradients, because each layer contributes directly to the output
- Still suffers from exploding gradients
- Way to solve that is through **Batch normalization**
- Introduce an exponential increase in variance of the activations during the forward propagation
- **Batch normalization**
  - Shifts and rescales each activation $h$ such that its mean and variance across the batch $\mathbb{B}$ becomes values that are learned through training
  - Process
    - Mean $m_h$ and standard deviation $s_h$ are computed
    - Use theses statistics to standardize the batch $$h_i \leftarrow \frac{h_i - m_h}{s_h + \epsilon}, \forall i \in \mathbb{B}$$
    - Then, normalized value is scaled by $\gamma$ and shifted by $\delta$ $$h_i \leftarrow \gamma h_i + \delta$$
    - Both these quantities are learned through training
  - Is applied independently to each hidden unit
  - Computed over both batch and spatial position 
    - $K$ layers and $C$ channels, $KC$ offsets and $KC$ scales
  - At test time, we do not have batch to gather statistics
    - $m_h$ and $s_h$ are calculated across all training set and frozen in the final network
  - Benefits
    - Loss surface is smoother
    - Makes network invariant to rescaling the weights and biases
    - Stable forward propagation
    - Higher learning rates
      - Can use higher learning rates, that improves test performance
    - Regularization
      - Adding noise helps in generalization
      - Batch normalization injects noise because of the batch statistics
## Common residual architectures
- ResNet
  - Each residual block contains
    - BatchNorm
    - ReLU
    - Convolutional layer
  - Followed by the same sequence again and added by the input
- DenseNet
  - Concatenate modified and original signals
  - Input to a layer comprises of the concatenated output from **all** the previous layers
- U-Nets
  - Earlier representations are concatenated to later ones
  - **Completely convolutional**
## Why do nets with residual connections perform so well?
- Allow much deeper networks to be trained
- Residual connections add value on their own