<a href="https://colab.research.google.com/github/RCortez25/PhD/blob/main/LLM/8.%20Shortcut%20connections/Shortcut_connections.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Shortcut connections

Shortcut connections were introduced to solve the problem of vanishing gradients, in which gradients become smaller and smaller as they propagate backward so it's difficult to train earlier layers.

These connections create an alternative path for the gradient to flow by skipping one or more layers. This is achieved by adding the output of one layer to the output of a latter layer.

Let's create a NN to see the effect of shortcut connections.

In [1]:
import torch
import torch.nn as nn

class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        result = 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) *
            (x + 0.044715 * torch.pow(x, 3))
        ))
        return result

In [3]:
class ExampleDeepNeuralNetwork(nn.Module):
    def __init__(self, layer_sizes, use_shortcut):
        super().__init__()
        self.use_shortcut = use_shortcut
        self.layers = nn.ModuleList([
                nn.Sequential(
                    nn.Linear(layer_sizes[i], layer_sizes[i + 1]),
                    GELU())
            for i in range(len(layer_sizes) - 1)
        ])

    def forward(self, x):
        for layer in self.layers:
            # Compute the output of the current layer
            output = layer(x)
            # Check if shortcut can be applied
            if self.use_shortcut and x.shape == output.shape:
                # Apply shortcut
                x = x + output
            else:
                x = output
        return x

# Example

In [11]:
# Create the layers. 6 layers, the first 5 with 3 neurons each and the last one
# with 1 neuron
layer_sizes = [3, 3, 3, 3, 3, 1]
inputs = torch.tensor([[1., 0., -1.]])
torch.manual_seed(123)
network_without_shortcut = ExampleDeepNeuralNetwork(layer_sizes, use_shortcut=False)

In [12]:
# Function to calculate the gradients
def print_gradients(model, x):
    # Forward pass
    output = model(x)
    target = torch.tensor([[0.]])

    # Calculate the loss based on how close we are to the target
    loss = nn.MSELoss()
    loss = loss(output, target)

    # Backward pass
    loss.backward()

    for name, param in model.named_parameters():
        if 'weight' in name:
            # Print the mean absolute gradient of the weights
            print(f"Mean absolute gradient of {name}: {torch.mean(torch.abs(param.grad))}")

In [13]:
# Run the example
print_gradients(network_without_shortcut, inputs)

Mean absolute gradient of layers.0.0.weight: 0.00020173584925942123
Mean absolute gradient of layers.1.0.weight: 0.00012011159560643137
Mean absolute gradient of layers.2.0.weight: 0.0007152040489017963
Mean absolute gradient of layers.3.0.weight: 0.0013988736318424344
Mean absolute gradient of layers.4.0.weight: 0.005049645435065031


As can be seen, when one moves from layer 4 to layer 0 the gradients become smaller and smaller.

Now, let's repeat the same example with shortcut connections.

In [14]:
torch.manual_seed(123)
network_with_shortcut = ExampleDeepNeuralNetwork(layer_sizes, use_shortcut=True)
print_gradients(network_with_shortcut, inputs)

Mean absolute gradient of layers.0.0.weight: 0.22169791162014008
Mean absolute gradient of layers.1.0.weight: 0.20694105327129364
Mean absolute gradient of layers.2.0.weight: 0.32896995544433594
Mean absolute gradient of layers.3.0.weight: 0.2665732204914093
Mean absolute gradient of layers.4.0.weight: 1.3258540630340576


The difference can clearly be seen.