# Residual Connections

Residual connections, also known as skip connections, are a key component of Residual Networks (ResNets), a type of neural network architecture designed to address the degradation problem in deep networks. The degradation problem refers to the phenomenon where adding more layers to a deep network leads to worse performance, even when the network is properly trained. Residual connections help mitigate this issue by allowing the network to learn residual functions with reference to the layer inputs, rather than learning unreferenced functions.

### Mathematical Formulation

1. **Basic Concept**:
   
   In a typical neural network layer, you have an input $ \mathbf{x} $ and an output $ \mathbf{y} $ that is transformed by a function $ \mathcal{F} $:
   
   $$ \mathbf{y} = \mathcal{F}(\mathbf{x}) $$

   In ResNets, instead of learning the direct mapping $ \mathcal{F}(\mathbf{x}) $, the network learns the residual mapping $ \mathcal{F}(\mathbf{x}) - \mathbf{x} $. This leads to the following formulation:
   
   $$ \mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x} $$

   Here, $ \mathbf{x} $ is the input to the residual block, $ \mathcal{F}(\mathbf{x}) $ is the learned residual function, and the addition of $ \mathbf{x} $ is the residual connection.

2. **Layer Formulation**:

   For a given input $ \mathbf{x} $ to a residual block, the output $ \mathbf{y} $ is given by:
   
   $$ \mathbf{y} = \sigma(\mathcal{F}(\mathbf{x}) + \mathbf{x}) $$

   where $ \sigma $ is an activation function (e.g., ReLU), and $ \mathcal{F}(\mathbf{x}) $ is typically composed of one or more convolutional layers, batch normalization, and activation functions.

3. **Deep Residual Learning**:
   
   For a deep residual network with $ L $ layers, let the input to the $ l $-th layer be $ \mathbf{x}_l $. The residual connection for layer $ l $ can be represented as:
   
   $$ \mathbf{x}_{l+1} = \mathbf{x}_l + \mathcal{F}(\mathbf{x}_l, \{W_l\}) $$

   Here, $ \{W_l\} $ denotes the weights of the $ l $-th layer. If there are multiple layers within a residual block, $ \mathcal{F} $ can represent a composition of multiple transformations.

4. **Benefits**:

   - **Gradient Flow**: Residual connections improve the flow of gradients through the network, making it easier to train very deep networks. During backpropagation, gradients can flow directly through the identity connections, mitigating the vanishing gradient problem.
   
   - **Expressiveness**: The residual connections allow the network to learn perturbations from the identity mapping, which is often easier than learning the complete mapping from scratch.

### Detailed Example

Consider a simple residual block with two layers. Let the input to the block be $ \mathbf{x} $, and the block consists of two convolutional layers with ReLU activations. The block can be represented as follows:

1. **First Layer**:
   
   $$ \mathbf{h}_1 = \sigma(W_1 \mathbf{x} + b_1) $$

2. **Second Layer**:
   
   $$ \mathbf{h}_2 = W_2 \mathbf{h}_1 + b_2 $$

3. **Residual Connection**:
   
   $$ \mathbf{y} = \mathbf{h}_2 + \mathbf{x} $$

   where $ W_1 $ and $ W_2 $ are the weights of the convolutional layers, $ b_1 $ and $ b_2 $ are the biases, and $ \sigma $ is the ReLU activation function.

Putting it together:

$$ \mathbf{y} = W_2 (\sigma(W_1 \mathbf{x} + b_1)) + b_2 + \mathbf{x} $$

### Intuition

- **Identity Mapping**: If the learned residual function $ \mathcal{F}(\mathbf{x}) $ is zero, the output $ \mathbf{y} $ will be equal to the input $ \mathbf{x} $. This implies that the network can easily retain the input through the identity mapping if necessary.
  
- **Residual Learning**: The network is encouraged to learn small adjustments (residuals) to the identity mapping, which can be more efficient and effective, especially in very deep networks.

### Visualization

In a typical residual block, the input $ \mathbf{x} $ is passed through a series of transformations to produce $ \mathcal{F}(\mathbf{x}) $. The original input $ \mathbf{x} $ is then added to this output:

<center><img src="fig/residual.png"/></center>


This skip connection directly adds the input $ \mathbf{x} $ to the output of the transformations, facilitating the learning of residual functions.

### Conclusion

Residual connections have proven to be a powerful architectural feature for deep neural networks, enabling the training of significantly deeper models by addressing the vanishing gradient problem and improving gradient flow. The mathematical foundation of residual connections lies in learning the residual functions relative to the identity mapping, making it easier to optimize deep networks and achieve better performance.
