## Skip Connections

Learn about skip connections and the problem they solve.

We will be covering:
    
- The update rule and the vanishing gradient problem


- Skip connections for the win


- ResNet: skip connections via addition


- DenseNet: skip connections via concatenation

If you were trying to train a neural network back in 2014, you would have definitely observed the so-called **vanishing gradient problem**. In simple terms: you are behind the screen checking the training process of your network, and all you see is that the training loss stops decreasing but is still far away from the desired value. You check all your code lines to see if something was wrong all night and you find no clue.

### The update rule and the vanishing gradient problem

Let’s revisit ourselves the update rule of gradient descent without momentum, given L to be the loss function and $\lambda$ to be the learning rate:

> $w_{i}' = w_{i} + \Delta w_{i}$

where $\Delta w_{i} = - \lambda \frac{\delta C}{\delta \Delta w_{i}}$

You try to update the parameters by changing them with a small amount $\Delta w_{i}$ that was calculated based on the gradient. For instance, let’s suppose that for an early layer the average gradient is 1e-15 $(\Delta L/ \delta w)$. Given a learning rate of 1e-4 ( $\lambda$ in the equation), you basically change the layer parameters by the product of the referenced quantities, which is 1e-19 ($\Delta w_{i}$). As a result, **you don’t actually observe any change in the model while training your network**. This is how you can observe the vanishing gradient problem.

### Skip connections for the win

At present, skip connection is a standard module in many CNN architectures. By using a skip connection, we provide an alternative path for the gradient.

It is experimentally validated that these additional paths are often beneficial for the convergence of the model.

> Skip connections, as the name suggests, **skip some layer in the neural network and feed the output of one layer as the input to the next layers**, instead of just the next one.

As explained in the previous chapter, using the chain rule, we keep multiplying terms with the error gradient as we go backwards. However, in the long chain of multiplication, if we multiply many things together that are less than one, the resulting gradient will be very small.

Thus, **the gradient becomes very small** as we approach the earlier layers in a deep network. In some cases, the gradient becomes zero, meaning that **we do not update the early layers at all**.

In general, there are two fundamental ways that one could use skip connections through different non-sequential layers:

a) **Addition**, as in residual architectures

b) **Concatenation**, as in densely connected architectures

### ResNet: skip connections via addition

The core idea is to backpropagate through the identity function by just using vector addition. The gradient would then simply be multiplied by one and its value will be maintained in the earlier layers. This is the main idea behind Residual Networks (ResNets): they stack these skip residual blocks together. We use an identity function to **preserve the gradient**.

![pic](https://raw.githubusercontent.com/CUTe-EmbeddedAI/images/main/images/fig28.PNG)

Mathematically, we can represent the residual block and calculate its partial derivative (gradient), given the loss function $L$ like this:

![pic](https://raw.githubusercontent.com/CUTe-EmbeddedAI/images/main/images/fig29.PNG)

Apart from the vanishing gradients, there is another reason for commonly using skip connections. For a plethora of tasks (such as semantic segmentation, optical flow estimation, etc.), some information was captured in the initial layers. We would like to allow the later layers to also learn from them.

It has been observed that in **earlier layers, the learned features correspond to lower semantic information** that is extracted from the input. If we had not used the skip connection, that information would have turned too abstract.

Ready for a coding exercise? Let’s try to implement a skip connection in Pytorch. In the code below, you will find a small network of two convolutional layers. Your goal is to write the `forward` function so that the input is added to the output of the two layers, forming a residual connection. In essence, you will represent the exact above image in code.

In [1]:
import torch
import torch.nn as nn

seed = 172
torch.manual_seed(seed)


class SkipConnection(nn.Module):

    def __init__(self):
        super(SkipConnection, self).__init__()
        self.conv_layer1 = nn.Conv2d(3, 6, 2, stride=2, padding=2)
        self.relu = nn.ReLU(inplace=True)
        self.conv_layer2 = nn.Conv2d(6, 3, 2, stride=2, padding=2)
        self.relu2 = nn.ReLU(inplace=True)

    def forward(self, input: torch.FloatTensor) -> torch.FloatTensor:
        # WRITE YOUR CODE HERE
        pass

### Solution

In [2]:
import torch
import torch.nn as nn

seed = 172
torch.manual_seed(seed)


class SkipConnection(nn.Module):

    def __init__(self):
        super(SkipConnection, self).__init__()
        self.conv_layer1 = nn.Conv2d(3, 6, 2, stride=2, padding=2)
        self.relu = nn.ReLU(inplace=True)
        self.conv_layer2 = nn.Conv2d(6, 3, 2, stride=2, padding=2)
        self.relu2 = nn.ReLU(inplace=True)

    def forward(self, input: torch.FloatTensor) -> torch.FloatTensor:
        x = self.conv_layer1(input)
        x = self.relu(x)
        x = self.conv_layer2(x)
        x = self.relu2(x)
        return x + input

## DenseNet: skip connections via concatenation

For some prediction problems, **there is low-level information shared between the input and output, and it would be desirable to pass this information directly across the net**. The alternate way to achieve skip connections is by concatenation of previous feature maps. The most famous deep learning architecture is DenseNet.

This architecture heavily uses feature concatenation so as to ensure maximum information flow between layers in the network. This is achieved by **connecting all layers directly with each other via concatenation**, as opposed to ResNets. Practically, what you do is concatenate the feature channel dimension. This leads to:

a) an **enormous amount of feature channels on the last layers** of the network;

b) more **compact** models;

c) extreme **feature reusability**.

![pic](https://raw.githubusercontent.com/CUTe-EmbeddedAI/images/main/images/fig30.PNG)