# Positional Encodings: Mathematical Derivations

This notebook derives the mathematics behind positional encoding schemes used in transformers,
explaining why they work and how to implement them.

## Glossary of Terms

| Term | Definition |
|------|------------|
| **Positional encoding** | Information added to token embeddings to indicate where each token is in the sequence. Without this, transformers can't distinguish word order. |
| **Position** | An integer index (0, 1, 2, ...) indicating where a token appears in the sequence. Also called "pos" in formulas. |
| **Sequence length (T)** | The number of tokens in a sequence. Positional encodings must handle sequences up to some maximum length $T_{max}$. |
| **Model dimension (d)** | The size of token embeddings and hidden states, also called $d_{model}$. Each position gets a $d$-dimensional encoding. |
| **Frequency/Wavelength** | In sinusoidal encodings, different dimensions oscillate at different frequencies. Low dimensions change fast (short wavelength), high dimensions change slowly (long wavelength). |
| **Angular frequency ($\omega$)** | $\omega = 1/10000^{2i/d}$ — controls how fast each dimension pair oscillates with position. |
| **Rotation matrix** | A matrix $R_\theta$ that rotates 2D vectors by angle $\theta$ while preserving their length. Used in RoPE. |
| **Relative position** | The difference $(m - n)$ between two positions. RoPE encodes this directly; sinusoidal supports it via linear transforms. |
| **Absolute position** | The actual index of a token (0, 1, 2, ...). Learned embeddings encode this directly. |
| **Extrapolation** | Using the model on sequences longer than seen during training. Sinusoidal and RoPE support this; learned embeddings cannot. |
| **Embedding matrix** | For learned positional embeddings, a matrix $W \in \mathbb{R}^{T_{max} \times d}$ where row $i$ is the embedding for position $i$. |
| **Permutation equivariance** | The property that reordering inputs just reorders outputs (same values, different positions). Attention is permutation equivariant without positional info. |
| **RoPE (Rotary Position Embedding)** | A method that encodes position by rotating query and key vectors, making attention scores depend only on relative position. |
| **NTK-aware scaling** | A technique to help RoPE extrapolate to longer sequences by adjusting rotation frequencies. |

## Formulas and Theorems

### Sinusoidal Positional Encoding (Vaswani et al., 2017)

| Formula | Description |
|---------|-------------|
| $PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)$ | Even dimensions |
| $PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$ | Odd dimensions |
| $\omega_i = \frac{1}{10000^{2i/d}}$ | Angular frequency for dimension pair $i$ |
| $PE(pos + k) \approx M_k \cdot PE(pos)$ | Linear transformation property (relative positions) |

### Rotary Position Embedding (RoPE) (Su et al., 2021)

| Formula | Description |
|---------|-------------|
| $\theta_i = 10000^{-2i/d}$ | Rotation frequency for dimension pair $i$ |
| $R_\theta = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}$ | 2D rotation matrix |
| $(q'_0, q'_1) = (q_0\cos(m\theta) - q_1\sin(m\theta), q_0\sin(m\theta) + q_1\cos(m\theta))$ | Rotate query at position $m$ |
| $q'_m \cdot k'_n = f(q, k, m-n)$ | Dot product depends only on relative position |

### Learned Positional Embeddings

| Formula | Description |
|---------|-------------|
| $PE(pos) = W_{pos}$ | Direct lookup from embedding matrix $W \in \mathbb{R}^{T_{max} \times d}$ |
| $\frac{\partial L}{\partial W_{pos}} = \frac{\partial L}{\partial PE(pos)}$ | Gradient is upstream gradient |

### Key Trigonometric Identities

| Identity | Description |
|----------|-------------|
| $\sin(a+b) = \sin a \cos b + \cos a \sin b$ | Sine addition |
| $\cos(a+b) = \cos a \cos b - \sin a \sin b$ | Cosine addition |
| $\cos a \cos b + \sin a \sin b = \cos(a - b)$ | Dot product of rotation vectors |

## Prerequisites

This notebook assumes familiarity with:

### 1. Trigonometric Functions

The sine and cosine functions:
- $\sin(\theta)$ and $\cos(\theta)$ are periodic with period $2\pi$
- $\sin^2(\theta) + \cos^2(\theta) = 1$
- $\sin(0) = 0$, $\cos(0) = 1$
- They form an orthonormal basis for representing periodic signals

### 2. Rotation Matrices

A 2D rotation by angle $\theta$ is represented by:
$$R_\theta = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}$$

Properties:
- $R_\theta R_\phi = R_{\theta + \phi}$ (composition = addition of angles)
- $R_\theta^T = R_{-\theta} = R_\theta^{-1}$ (orthogonal matrix)
- $\det(R_\theta) = 1$ (preserves lengths)

### 3. Fourier Series Intuition

Any periodic function can be represented as a sum of sines and cosines:
$$f(x) = a_0 + \sum_{n=1}^{\infty} \left(a_n \cos(n\omega x) + b_n \sin(n\omega x)\right)$$

Different frequencies capture different scales of variation:
- Low frequencies: slow, global patterns
- High frequencies: fast, local patterns

### 4. Why Position Information Matters

Attention is **permutation equivariant**: swapping input positions just swaps output positions.
Without positional information, the model can't distinguish "dog bites man" from "man bites dog".

We need to inject position information so the model knows where each token is in the sequence.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
np.set_printoptions(precision=4, suppress=True)

---

## Part 1: Sinusoidal Positional Encoding

### The Design

The original transformer (Vaswani et al., 2017) uses fixed sinusoidal encodings:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)$$
$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$

Where:
- $pos$ is the position in the sequence (0, 1, 2, ...)
- $i$ is the dimension index (0, 1, ..., d/2-1)
- $d$ is the model dimension

### Understanding the Frequencies

Define the angular frequency for dimension pair $i$:
$$\omega_i = \frac{1}{10000^{2i/d}}$$

Then:
$$PE_{(pos, 2i)} = \sin(\omega_i \cdot pos)$$
$$PE_{(pos, 2i+1)} = \cos(\omega_i \cdot pos)$$

The wavelengths (periods) range from $2\pi$ (for $i=0$) to $2\pi \cdot 10000$ (for $i=d/2-1$).

**Intuition**: Different dimension pairs encode position at different "resolutions":
- Low $i$ (high frequency): Changes rapidly with position - distinguishes nearby positions
- High $i$ (low frequency): Changes slowly - captures long-range structure

In [None]:
def sinusoidal_encoding(max_len, d_model):
    """Compute sinusoidal positional encodings.
    
    Args:
        max_len: Maximum sequence length
        d_model: Model dimension (must be even)
    
    Returns:
        PE: Positional encoding matrix (max_len, d_model)
    """
    pos = np.arange(max_len)[:, None]  # (T, 1)
    i = np.arange(d_model // 2)[None, :]  # (1, d/2)
    
    # Compute frequencies: omega_i = 1 / 10000^(2i/d)
    omega = 1.0 / (10000 ** (2 * i / d_model))  # (1, d/2)
    
    # Compute angles
    angles = pos * omega  # (T, d/2) - broadcasting
    
    # Interleave sin and cos
    PE = np.zeros((max_len, d_model))
    PE[:, 0::2] = np.sin(angles)  # Even indices: sin
    PE[:, 1::2] = np.cos(angles)  # Odd indices: cos
    
    return PE

# Example
T, d = 100, 64
PE = sinusoidal_encoding(T, d)
print(f"Positional encoding shape: {PE.shape}")
print(f"\nFirst few positions, first 8 dims:")
print(PE[:5, :8])

### Visualizing the Encoding

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Full encoding heatmap
im = axes[0, 0].imshow(PE.T, aspect='auto', cmap='RdBu', vmin=-1, vmax=1)
axes[0, 0].set_xlabel('Position')
axes[0, 0].set_ylabel('Dimension')
axes[0, 0].set_title('Sinusoidal Positional Encoding')
plt.colorbar(im, ax=axes[0, 0])

# Individual dimensions
positions = np.arange(T)
for i, dim in enumerate([0, 1, 10, 11, 30, 31]):
    axes[0, 1].plot(positions, PE[:, dim], label=f'dim {dim}', alpha=0.7)
axes[0, 1].set_xlabel('Position')
axes[0, 1].set_ylabel('Value')
axes[0, 1].set_title('Individual Dimensions (different frequencies)')
axes[0, 1].legend(fontsize=8)
axes[0, 1].set_xlim(0, 50)

# Wavelengths
i_vals = np.arange(d // 2)
wavelengths = 2 * np.pi * (10000 ** (2 * i_vals / d))
axes[1, 0].semilogy(2 * i_vals, wavelengths)
axes[1, 0].set_xlabel('Dimension index')
axes[1, 0].set_ylabel('Wavelength')
axes[1, 0].set_title('Wavelength per dimension pair (log scale)')
axes[1, 0].axhline(y=2*np.pi, color='r', linestyle='--', alpha=0.5, label='2π')
axes[1, 0].legend()

# Dot product similarity
dots = PE @ PE.T  # (T, T)
im = axes[1, 1].imshow(dots[:50, :50], cmap='RdBu')
axes[1, 1].set_xlabel('Position j')
axes[1, 1].set_ylabel('Position i')
axes[1, 1].set_title('PE(i) · PE(j) - Position similarity')
plt.colorbar(im, ax=axes[1, 1])

plt.tight_layout()
plt.show()

### The Relative Position Property

A key insight from the original paper: for any fixed offset $k$, there exists a linear transformation $M_k$ such that:

$$PE(pos + k) = M_k \cdot PE(pos)$$

**Proof**: Consider a single dimension pair $(2i, 2i+1)$ with frequency $\omega$:

$$PE(pos) = \begin{pmatrix} \sin(\omega \cdot pos) \\ \cos(\omega \cdot pos) \end{pmatrix}$$

$$PE(pos+k) = \begin{pmatrix} \sin(\omega(pos+k)) \\ \cos(\omega(pos+k)) \end{pmatrix}$$

Using the angle addition formulas:
$$\sin(a+b) = \sin a \cos b + \cos a \sin b$$
$$\cos(a+b) = \cos a \cos b - \sin a \sin b$$

We get:
$$PE(pos+k) = \begin{pmatrix} \cos(\omega k) & \sin(\omega k) \\ -\sin(\omega k) & \cos(\omega k) \end{pmatrix} \begin{pmatrix} \sin(\omega \cdot pos) \\ \cos(\omega \cdot pos) \end{pmatrix}$$

This is a rotation matrix! The full transformation $M_k$ is block-diagonal with these rotation blocks.

In [None]:
def relative_position_matrix(k, d_model):
    """Compute the linear transformation M_k such that PE(pos+k) = M_k @ PE(pos)."""
    M = np.zeros((d_model, d_model))
    for i in range(d_model // 2):
        omega = 1.0 / (10000 ** (2 * i / d_model))
        c, s = np.cos(omega * k), np.sin(omega * k)
        # 2x2 rotation block for dimensions 2i, 2i+1
        M[2*i, 2*i] = c
        M[2*i, 2*i+1] = s
        M[2*i+1, 2*i] = -s
        M[2*i+1, 2*i+1] = c
    return M

# Verify the property
pos = 10
k = 5
d = 32

PE = sinusoidal_encoding(50, d)
M_k = relative_position_matrix(k, d)

# PE(pos+k) should equal M_k @ PE(pos)
pe_direct = PE[pos + k]
pe_transformed = M_k @ PE[pos]

print(f"PE(pos+k) directly:    {pe_direct[:8]}")
print(f"M_k @ PE(pos):         {pe_transformed[:8]}")
print(f"Max difference: {np.abs(pe_direct - pe_transformed).max():.2e}")

---

## Part 2: Learned Positional Embeddings

### The Simplest Approach

Instead of using a fixed formula, we can learn position embeddings as parameters:

$$PE(pos) = W_{pos,:}$$

Where $W \in \mathbb{R}^{T_{max} \times d}$ is a learnable embedding matrix.

### Forward Pass

Simply look up the row corresponding to each position:
$$PE(pos) = W[pos]$$

### Backward Pass

The gradient flows directly to the corresponding row:
$$\frac{\partial L}{\partial W[pos]} = \frac{\partial L}{\partial PE(pos)}$$

If a position appears multiple times in a batch, gradients accumulate.

In [None]:
class LearnedPositionalEmbedding:
    """Learned positional embeddings with forward and backward passes."""
    
    def __init__(self, max_len, d_model, seed=42):
        rng = np.random.default_rng(seed)
        # Initialize with small random values
        self.W = rng.normal(0, 0.02, (max_len, d_model)).astype(np.float32)
        self.gradW = np.zeros_like(self.W)
        self._positions = None
    
    def forward(self, positions):
        """Look up embeddings for given positions.
        
        Args:
            positions: Array of position indices
        
        Returns:
            PE: Positional embeddings for each position
        """
        self._positions = positions
        return self.W[positions]
    
    def backward(self, dPE):
        """Accumulate gradients.
        
        Args:
            dPE: Upstream gradient (same shape as forward output)
        """
        # Each position accumulates its gradient
        np.add.at(self.gradW, self._positions, dPE)
    
    def step(self, lr=0.01):
        """SGD update."""
        self.W -= lr * self.gradW
        self.gradW.fill(0)

# Demo
max_len, d = 100, 16
lpe = LearnedPositionalEmbedding(max_len, d)

# Forward: get embeddings for positions 0, 1, 2, 3, 4
positions = np.array([0, 1, 2, 3, 4])
pe = lpe.forward(positions)
print(f"Embeddings shape: {pe.shape}")
print(f"\nLearned embeddings (first 8 dims):")
print(pe[:, :8])

### Comparing Learned vs Sinusoidal

| Aspect | Sinusoidal | Learned |
|--------|------------|----------|
| Parameters | None | $T_{max} \times d$ |
| Extrapolation | Naturally extends to unseen lengths | Cannot extrapolate |
| Relative positions | Built-in via linear transformation | Must be learned |
| Flexibility | Fixed pattern | Can adapt to task |
| Common usage | Original Transformer | GPT-2, BERT |

---

## Part 3: Rotary Position Embeddings (RoPE)

RoPE (Su et al., 2021) takes a fundamentally different approach: instead of **adding** position information, it **rotates** the query and key vectors based on their position.

### Key Insight

The attention score between query at position $m$ and key at position $n$ is:
$$\text{score} = q_m^T k_n$$

If we rotate both $q$ and $k$ based on their positions:
$$\text{score} = (R_m q)^T (R_n k) = q^T R_m^T R_n k = q^T R_{n-m} k$$

The score now depends only on the **relative position** $(n - m)$!

### The Rotation

For a 2D vector $(x_0, x_1)$ at position $m$ with rotation angle $\theta$:

$$\begin{pmatrix} x'_0 \\ x'_1 \end{pmatrix} = \begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix} \begin{pmatrix} x_0 \\ x_1 \end{pmatrix}$$

Expanding:
$$x'_0 = x_0 \cos(m\theta) - x_1 \sin(m\theta)$$
$$x'_1 = x_0 \sin(m\theta) + x_1 \cos(m\theta)$$

### Multi-dimensional Extension

For a $d$-dimensional vector, we apply different rotations to each pair of dimensions:
- Dimensions $(0, 1)$ rotate with frequency $\theta_0$
- Dimensions $(2, 3)$ rotate with frequency $\theta_1$
- ...
- Dimensions $(d-2, d-1)$ rotate with frequency $\theta_{d/2-1}$

The frequencies follow the same pattern as sinusoidal:
$$\theta_i = 10000^{-2i/d}$$

In [None]:
class RotaryPositionalEmbedding:
    """Rotary Position Embeddings (RoPE)."""
    
    def __init__(self, d_head, max_len=1024, base=10000.0):
        assert d_head % 2 == 0, "d_head must be even"
        self.d_head = d_head
        self.base = base
        
        # Compute rotation frequencies: theta_i = base^(-2i/d)
        i = np.arange(0, d_head, 2, dtype=np.float32)
        self.inv_freq = 1.0 / (base ** (i / d_head))  # (d/2,)
        
        # Precompute sin/cos for all positions
        pos = np.arange(max_len, dtype=np.float32)[:, None]  # (T, 1)
        angles = pos * self.inv_freq[None, :]  # (T, d/2)
        self.cos_cache = np.cos(angles)  # (T, d/2)
        self.sin_cache = np.sin(angles)  # (T, d/2)
    
    def apply_rotation(self, x, cos, sin):
        """Apply rotation to x using precomputed cos/sin.
        
        Args:
            x: Input tensor (..., T, d)
            cos: Cosine values (T, d/2)
            sin: Sine values (T, d/2)
        
        Returns:
            Rotated tensor, same shape as x
        """
        # Split into even/odd pairs
        x_even = x[..., 0::2]  # (..., T, d/2)
        x_odd = x[..., 1::2]   # (..., T, d/2)
        
        # Apply 2D rotation to each pair
        x_rot_even = x_even * cos - x_odd * sin
        x_rot_odd = x_even * sin + x_odd * cos
        
        # Interleave back
        x_rot = np.empty_like(x)
        x_rot[..., 0::2] = x_rot_even
        x_rot[..., 1::2] = x_rot_odd
        
        return x_rot
    
    def forward(self, q, k, offset=0):
        """Apply RoPE to query and key tensors.
        
        Args:
            q: Query tensor (..., T, d)
            k: Key tensor (..., T, d)
            offset: Position offset (for KV-cache)
        
        Returns:
            q_rot, k_rot: Rotated tensors
        """
        T = q.shape[-2]
        cos = self.cos_cache[offset:offset+T]  # (T, d/2)
        sin = self.sin_cache[offset:offset+T]  # (T, d/2)
        
        q_rot = self.apply_rotation(q, cos, sin)
        k_rot = self.apply_rotation(k, cos, sin)
        
        return q_rot, k_rot

# Demo
d_head = 8
rope = RotaryPositionalEmbedding(d_head, max_len=100)

# Create sample q, k at different positions
T = 5
q = np.random.randn(T, d_head)
k = np.random.randn(T, d_head)

q_rot, k_rot = rope.forward(q, k)

print(f"Original q[0]: {q[0]}")
print(f"Rotated q[0]:  {q_rot[0]}")
print(f"\nNorm preserved: {np.linalg.norm(q[0]):.4f} -> {np.linalg.norm(q_rot[0]):.4f}")

### Proving the Relative Position Property

Let's verify that $q_m^T k_n$ depends only on $(m - n)$.

For a single dimension pair with frequency $\theta$:

$$q'_m = R(m\theta) q, \quad k'_n = R(n\theta) k$$

$$(q'_m)^T k'_n = q^T R(m\theta)^T R(n\theta) k = q^T R(-m\theta) R(n\theta) k = q^T R((n-m)\theta) k$$

The attention score is a function of $q$, $k$, and $(n-m)$ only!

In [None]:
# Verify relative position property
d_head = 4
rope = RotaryPositionalEmbedding(d_head, max_len=100)

# Same q and k vectors
q_vec = np.array([[1.0, 0.5, -0.3, 0.8]])
k_vec = np.array([[0.2, -0.1, 0.7, 0.4]])

# Test: score(q at pos 5, k at pos 3) should equal score(q at pos 10, k at pos 8)
# Both have relative position = 2

# Scenario 1: q at position 5, k at position 3
q_at_5 = rope.apply_rotation(q_vec, rope.cos_cache[5:6], rope.sin_cache[5:6])
k_at_3 = rope.apply_rotation(k_vec, rope.cos_cache[3:4], rope.sin_cache[3:4])
score_1 = (q_at_5 @ k_at_3.T)[0, 0]

# Scenario 2: q at position 10, k at position 8 (same relative position = 2)
q_at_10 = rope.apply_rotation(q_vec, rope.cos_cache[10:11], rope.sin_cache[10:11])
k_at_8 = rope.apply_rotation(k_vec, rope.cos_cache[8:9], rope.sin_cache[8:9])
score_2 = (q_at_10 @ k_at_8.T)[0, 0]

# Scenario 3: q at position 50, k at position 48 (same relative position = 2)
q_at_50 = rope.apply_rotation(q_vec, rope.cos_cache[50:51], rope.sin_cache[50:51])
k_at_48 = rope.apply_rotation(k_vec, rope.cos_cache[48:49], rope.sin_cache[48:49])
score_3 = (q_at_50 @ k_at_48.T)[0, 0]

print("Attention scores with same relative position (m - n = 2):")
print(f"  q@5, k@3:   {score_1:.6f}")
print(f"  q@10, k@8:  {score_2:.6f}")
print(f"  q@50, k@48: {score_3:.6f}")
print(f"\nAll equal: {np.allclose([score_1, score_2, score_3], score_1)}")

### Visualizing RoPE Rotations

In [None]:
# Visualize how a vector rotates with position
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

d_head = 64
rope = RotaryPositionalEmbedding(d_head, max_len=100)

# Same vector at different positions
v = np.ones((1, d_head)) / np.sqrt(d_head)
positions = [0, 5, 10, 20, 50]

# Plot first two dimensions (fastest rotation)
for pos in positions:
    v_rot = rope.apply_rotation(v, rope.cos_cache[pos:pos+1], rope.sin_cache[pos:pos+1])
    axes[0].scatter(v_rot[0, 0], v_rot[0, 1], s=100, label=f'pos={pos}')
axes[0].set_xlabel('Dim 0')
axes[0].set_ylabel('Dim 1')
axes[0].set_title('Fast rotation (dims 0-1)')
axes[0].legend()
axes[0].axis('equal')

# Plot middle dimensions (medium rotation)
for pos in positions:
    v_rot = rope.apply_rotation(v, rope.cos_cache[pos:pos+1], rope.sin_cache[pos:pos+1])
    axes[1].scatter(v_rot[0, 30], v_rot[0, 31], s=100, label=f'pos={pos}')
axes[1].set_xlabel('Dim 30')
axes[1].set_ylabel('Dim 31')
axes[1].set_title('Medium rotation (dims 30-31)')
axes[1].legend()
axes[1].axis('equal')

# Plot last dimensions (slow rotation)
for pos in positions:
    v_rot = rope.apply_rotation(v, rope.cos_cache[pos:pos+1], rope.sin_cache[pos:pos+1])
    axes[2].scatter(v_rot[0, 62], v_rot[0, 63], s=100, label=f'pos={pos}')
axes[2].set_xlabel('Dim 62')
axes[2].set_ylabel('Dim 63')
axes[2].set_title('Slow rotation (dims 62-63)')
axes[2].legend()
axes[2].axis('equal')

plt.tight_layout()
plt.show()

---

## Part 4: Comparison and Summary

### Comparison Table

| Feature | Sinusoidal | Learned | RoPE |
|---------|------------|---------|------|
| Parameters | 0 | $T_{max} \times d$ | 0 |
| Operation | Add to embeddings | Add to embeddings | Rotate Q, K |
| Relative positions | Via linear transform | Implicit (learned) | Explicit (in dot product) |
| Length extrapolation | Natural | Cannot | Good (with NTK-aware scaling) |
| Used in | Original Transformer | GPT-2, BERT | LLaMA, Mistral, GPT-NeoX |

In [None]:
# Final comparison: how position similarity decays with distance
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

T, d = 100, 64

# Sinusoidal
PE_sin = sinusoidal_encoding(T, d)
sim_sin = PE_sin @ PE_sin[0]  # Similarity of each position to position 0
axes[0].plot(sim_sin)
axes[0].set_xlabel('Position')
axes[0].set_ylabel('Similarity to position 0')
axes[0].set_title('Sinusoidal: PE(pos) · PE(0)')
axes[0].axhline(y=0, color='gray', linestyle='--', alpha=0.5)

# Learned (random init)
lpe = LearnedPositionalEmbedding(T, d)
PE_learn = lpe.W
sim_learn = PE_learn @ PE_learn[0]
axes[1].plot(sim_learn)
axes[1].set_xlabel('Position')
axes[1].set_ylabel('Similarity to position 0')
axes[1].set_title('Learned (random init): PE(pos) · PE(0)')
axes[1].axhline(y=0, color='gray', linestyle='--', alpha=0.5)

# RoPE: show how attention score varies with relative position
rope = RotaryPositionalEmbedding(d, max_len=T)
q = np.random.randn(1, d)  # Fixed query vector
k = np.random.randn(1, d)  # Fixed key vector

scores = []
for rel_pos in range(T):
    q_rot = rope.apply_rotation(q, rope.cos_cache[rel_pos:rel_pos+1], rope.sin_cache[rel_pos:rel_pos+1])
    k_rot = rope.apply_rotation(k, rope.cos_cache[0:1], rope.sin_cache[0:1])
    scores.append((q_rot @ k_rot.T)[0, 0])

axes[2].plot(scores)
axes[2].set_xlabel('Relative position (m - n)')
axes[2].set_ylabel('Attention score')
axes[2].set_title('RoPE: q_m · k_0 (fixed q, k vectors)')
axes[2].axhline(y=0, color='gray', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()

### Key Takeaways

1. **Sinusoidal encodings** use different frequencies to encode position at multiple scales. They support relative positions through linear transformations.

2. **Learned embeddings** are flexible but cannot extrapolate to unseen sequence lengths.

3. **RoPE** encodes relative position directly in the attention score through rotation, leading to better length generalization.

4. All methods aim to give the model information about **where** tokens are, but they do so in different ways:
   - Sinusoidal/Learned: Add position information to token embeddings
   - RoPE: Modify how Q and K interact based on positions

5. The choice depends on your use case:
   - Need fixed-length sequences? Learned works great (GPT-2, BERT)
   - Need length generalization? RoPE is the modern choice (LLaMA, etc.)
   - Want simplicity with no parameters? Sinusoidal is elegant