# **Shortcut (Residual) Connections in Transformers**

A shortcut connection (also called a `residual connection`) is a mechanism where the `input` to a layer is added back to the `output` of that layer. In Transformers, this technique ensures that information from earlier layers can flow directly to later layers without being completely transformed at each step.

It was first introduced in `ResNet` for `CNNs` and adopted in Transformers to stabilize training and allow deeper architectures.

## **Intuition**

- Deep networks can suffer from `vanishing gradients` and information loss.

- Residual connections act as an `information highway`, ensuring that the original signal (input embeddings, intermediate representations) can `bypass transformations`.

- This allows layers to focus on learning `residual transformations` (the “difference” from the input) rather than relearning the full representation.


## **Mathematical Representation**

Given a layer transformation $F(x, \theta)$ applied to an input $x$:

$$y = F(x, \theta) + x$$


Where:

- $F(x, \theta)$ = transformation (e.g., self-attention, feed-forward network).

- $x$ = input (shortcut path).

- $y$ = output after residual connection.


In Transformers (after adding normalization), we typically have:

$$y = \mathrm{LayerNorm}(F(x, \theta) + x)$$

## **Key Properties**

- `Stabilizes training` of very deep models.

- `Prevents vanishing/exploding gradients` by allowing gradient backpropagation through shortcut paths.

- `Improves convergence speed` by simplifying optimization.

- `Encourages feature reuse` from earlier layers.



## **Use Cases**

- `Transformers:` used in both encoder and decoder blocks (around attention and feedforward layers).

- `ResNets (CNNs):` pioneered the use of residuals for deep convolutional nets.

- `Graph Neural Networks (GNNs):` ensure message passing across multiple hops without over-smoothing.

- `RNN Variants:` some architectures use residuals to stabilize long sequences.

- **Basic Residual Connection in PyTorch**

In [1]:
import torch
import torch.nn as nn

class ResidualConnection(nn.Module):
    def __init__(self, size, dropout=0.1):
        super().__init__()
        self.norm = nn.LayerNorm(size)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x:torch.Tensor, sublayer):
        """ 
        x: input tensor
        sublayer: function/layer to apply on x
        """
        return x + self.dropout(sublayer(self.norm(x)))

In [2]:
# Usage
# Example with feed-forward sublayer
ff = nn.Sequential(
    nn.Linear(512, 2048),
    nn.ReLU(),
    nn.Linear(2048, 512)
)

residual = ResidualConnection(size=512)
x = torch.randn(10, 20, 512)  # (batch, seq_len, hidden_dim)
output = residual(x, ff)
print(output.shape)  # torch.Size([10, 20, 512])

torch.Size([10, 20, 512])


In [None]:
class ResidualConnection2(nn.Module):
    def __init__(self, size, dropout=0.1):
        super().__init__()
        self.norm = nn.LayerNorm(size)
        self.dropout = nn.Dropout(dropout)
        self.ff = nn.Sequential(
            nn.Linear(512, 2048),
            nn.ReLU(),
            nn.Linear(2048, 512)
        )
        
    def forward(self, x:torch.Tensor):
        residual = x
        x = self.norm(x)
        x = self.ff(x)
        x = self.dropout(x)
        x = x + residual
        return x

res = ResidualConnection2(size=512)
x = torch.randn(10, 20, 512) 
output = res(x)
print(output.shape)

torch.Size([10, 20, 512])


- **Residual in Transformer Encoder Layer (PyTorch built-in)**

In [10]:
from torch.nn import TransformerEncoderLayer

encoder_layer = TransformerEncoderLayer(d_model=512, nhead=8)
x = torch.randn(10, 32, 512)  # (sequence_len, batch_size, hidden_dim)
output = encoder_layer(x)
print(output.shape)  # torch.Size([10, 32, 512])

torch.Size([10, 32, 512])


- from the preceeding code, `TransformerEncoderLayer` automatically applies `residual + layer norm` around self-attention and feedforward layers.

Input x
   │
[LayerNorm]
   │
[Self-Attention / FFN] ---> Output
   │
   +───────────────(Residual)───────────────+
   │                                        │
Final Output = LayerNorm(Output + Input)

`Shortcut (Residual) connections` are the backbone of `Transformers’` ability to scale deeply while remaining trainable. They allow layers to refine information without destroying it.