# **Positional Embedding in Transformers**

`Positional embeddings` are crucial components in Transformer architectures that provide information about the position of tokens in a sequence. Since Transformers process all tokens in parallel without recurrence or convolution, they lack inherent positional awareness, making positional embeddings essential for understanding sequence order.


Below are the most widely used positional-encoding families:

- `Sinusoidal (fixed) Positional Embedding`
  
- `Learned (Absolute) Positional Embedding`
  
- `Relative (Shaw / Transformer-XL / T5 style) Positional Embedding`
  
- `Rotary (RoPE) Positional Embedding`
  
- `ALiBi (linear attention bias) Positional Embedding`
  
- `Convolutional / Local  Positional Embedding`

## **1. Absolute Positional Embedding**

**Definition**

Absolute positional embeddings assign a fixed, learnable vector to each position in the sequence. Each position index gets its own unique embedding vector. This gives each token a unique, continuous, and interpretable position signal.


**Mathematical Representation**

For a sequence of length L and embedding dimension D:

`Even dimensions`
$$
PE(pos, 2i) = \sin\left(\frac{pos}{10000^{\tfrac{2i}{D}}}\right)
$$  

`Odd dimensions`
$$
PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{\tfrac{2i}{D}}}\right)
$$ 


**Key Properties**

- Parameter-free, deterministic.

- Supports extrapolation to unseen sequence lengths.

- Encodes relative positions via linear relationships.


**Use Cases**

- Early Transformers, tasks needing generalization to longer inputs.

In [1]:
import torch

def sinusoidal_positional_encoding(max_len, d_model, device=None):
    device = device or torch.device("cpu")
    pos = torch.arange(max_len, device=device).unsqueeze(1)
    i = torch.arange(d_model, device=device).unsqueeze(0)
    angle_rates = pos / (10000 ** ( (2 * (i//2)) / d_model ))
    pe = torch.zeros(max_len, d_model, device=device)
    pe[:, 0::2] = torch.sin(angle_rates[:, 0::2])
    pe[:, 1::2] = torch.cos(angle_rates[:, 1::2])
    return pe

## **2. Learned (absolute) Positional Embedding**

Each absolute position gets a trainable embedding vector, like word embeddings. Common in models like BERT, it adapts position signals directly from data.

**Mathematical Representation**

$PE_{pos} \in \mathbb{R}^{D}$ is learned; input becomes $x_{pos} + PE_{pos}$.


**Key Properties**

- Flexible, adapts to data distribution.

- No natural extrapolation beyond training length.

- Simple and widely used.


**Use Cases**

- Large-scale pretraining (e.g., BERT).

In [2]:
import torch.nn as nn

class LearnedPositionalEmbedding(nn.Module):
    def __init__(self, max_len, d_model):
        super().__init__()
        self.pos_emb = nn.Embedding(max_len, d_model)

    def forward(self, x):
        seq_len = x.size(1)
        positions = torch.arange(seq_len, device=x.device).unsqueeze(0)
        return self.pos_emb(positions).expand(x.size(0), -1, -1)

## **3. Relative Positional Encoding**

Proposed by Shaw et al. (2018) and extended in Transformer-XL and T5, this focuses on distance between tokens rather than absolute position, improving generalization to variable lengths.


**Mathematical Representation**

$e_{ij} = q_i^\top k_j + a_{(i-j)}$

where $a_{(i-j)}$ is a learned embedding of relative distance.


**Key Properties**

- Models relative distance directly.
  
- Improves long-range dependency handling.
  
- Slightly more complex to implement.


**Use cases**

- Machine translation, long-sequence modeling (Transformer-XL, T5).


In [3]:
import torch
import torch.nn as nn

class RelativeBias(nn.Module):
    def __init__(self, num_heads, max_rel_dist):
        super().__init__()
        self.relative_bias = nn.Embedding(2 * max_rel_dist + 1, num_heads)
        self.max_rel_dist = max_rel_dist

    def forward(self, qlen, klen, device=None):
        device = device or torch.device("cpu")
        ctx = torch.arange(qlen, device=device)[:, None]
        mem = torch.arange(klen, device=device)[None, :]
        rel_pos = mem - ctx
        clipped = rel_pos.clamp(-self.max_rel_dist, self.max_rel_dist) + self.max_rel_dist
        bias = self.relative_bias(clipped)
        return bias.permute(2, 0, 1)  # [num_heads, qlen, klen]

## **4. Rotary Positional Embedding (RoPE)**

Introduced in RoFormer (Su et al., 2021), RoPE applies a rotation to `query/key` vectors based on position, encoding relative information multiplicatively. Popular in modern LLMs (e.g., LLaMA).


**Mathematical Representation**

For pair $(x_{2k}, x_{2k+1})$:

$\begin{bmatrix} x'_{2k} \\ x'_{2k+1} \end{bmatrix} = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix} \begin{bmatrix} x_{2k} \\ x_{2k+1} \end{bmatrix}$,
with $\theta = pos / 10000^{2k/D}$.


**Key Properties**

- Relative distance encoded in dot product.

- Naturally extrapolates to longer contexts.

- Efficient, widely adopted.


**Use Cases**

- Large-scale autoregressive LLMs (e.g., GPTNeoX, LLaMA).

In [4]:
import torch

def rotary_angles(seq_len, dim, device=None):
    device = device or torch.device("cpu")
    inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2, device=device).float() / dim))
    positions = torch.arange(seq_len, device=device).float()
    return positions[:, None] * inv_freq[None, :]

def apply_rope(x, angles):
    b, n, d = x.shape
    x = x.view(b, n, d//2, 2)
    cos, sin = torch.cos(angles)[None, :, :, None], torch.sin(angles)[None, :, :, None]
    x1 = x[..., 0:1] * cos - x[..., 1:2] * sin
    x2 = x[..., 0:1] * sin + x[..., 1:2] * cos
    return torch.cat([x1, x2], dim=-1).view(b, n, d)

## **5. ALiBi (Attention with Linear Biases)**

Proposed by Press et al. (2021), `ALiBi` adds a simple linear bias to attention scores based on distance. It encodes recency preference without explicit embeddings, making it scalable to arbitrary lengths.


**Mathematical Representation**

$$bias_{i,j}^{(h)} = -s_h \cdot (j-i)$$


**Key Properties**

- Extremely lightweight, parameter-efficient.

- Encourages attention to nearby tokens.

- Generalizes seamlessly to long sequences.


**Use Cases**

- GPT-like causal LMs with long context windows.

In [5]:
import torch
import math

def get_alibi_slopes(n_heads):
    def get_slopes(n):
        start = 2.0 ** (-2.0 ** -(math.log2(n) - 3))
        return [start * (start ** i) for i in range(n)]
    return torch.tensor(get_slopes(n_heads), dtype=torch.float32)

def alibi_bias(qlen, klen, slopes, device=None):
    device = device or torch.device("cpu")
    q_pos = torch.arange(qlen, device=device)[:, None]
    k_pos = torch.arange(klen, device=device)[None, :]
    rel_dist = (k_pos - q_pos).clamp(min=0)
    return -slopes[:, None, None] * rel_dist[None, :, :]

## **6. Convolutional / Local Positional Encodings**

Inspired by CNNs, local positional encodings apply convolutional filters over embeddings to introduce local context. This biases the model toward nearby dependencies, useful in `language`, `speech`, and `vision`.


**Mathematical Representation**

$Y = \mathrm{Conv1D}(X)$ over sequence dimension.


**Key Properties**

- Strong local inductive bias.

- Lightweight and efficient.

- Complements global self-attention.


**Use Cases**

- Speech recognition (Conformer), hybrid models (ConvBERT).
- Vision Transformers (ViT)

In [6]:
import torch.nn as nn

class ConvPositionalEncoding(nn.Module):
    def __init__(self, d_model, kernel_size=3, groups=None):
        super().__init__()
        groups = groups or d_model
        self.conv = nn.Conv1d(d_model, d_model, kernel_size,
                              padding=(kernel_size-1)//2, groups=groups)
        self.act = nn.GELU()

    def forward(self, x):
        y = self.conv(x.transpose(1, 2))
        return x + self.act(y.transpose(1, 2))

## **Summary (Trade-offs)**

- `Sinusoidal` → No parameters, extrapolates well.

- `Learned absolute` → Flexible but no extrapolation.

- `Relative` → Models distances, better long-sequence handling.

- `RoPE` → Encodes relative phase, efficient, widely used in LLMs.

- `ALiBi` → Simple linear bias, scalable to very long contexts.

- `Convolutional` → Local bias, strong for speech/vision.