# Topic 8: Positional Encoding & Embeddings

## Learning Objectives

By the end of this notebook, you will:
- Understand **why** position information is critical in transformers
- Learn different positional encoding strategies and when to use each
- Implement sinusoidal positional encoding (original Transformer)
- Build learned positional embeddings (BERT, GPT)
- Master Rotary Position Embeddings (RoPE) - the modern standard
- Visualize positional encodings to build intuition
- Connect positional encodings to modern LLM architectures

## The Big Picture: Why Do We Need Positional Encoding?

### The Problem: Attention is Permutation Invariant

**Critical insight**: Self-attention treats input as a **set**, not a **sequence**!

Consider these two sentences:
```
Sentence 1: "The cat chased the mouse"
Sentence 2: "The mouse chased the cat"
```

**Without positional encoding**: Both sentences would produce the SAME attention output!

**Why?** Attention computes:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

- This operation is **permutation invariant**
- Swapping the order of tokens doesn't change the computation
- Word order information is completely lost!

**Real-world impact**:
- "Dog bites man" vs "Man bites dog" → Completely different meanings!
- "Not good" vs "Good not" → Order matters for negation
- "First, second, third" → Sequence is crucial

### The Solution: Inject Position Information

**Key idea**: Add position-dependent information to token embeddings

```python
# Original (position-agnostic)
input_embedding = token_embedding

# With positional encoding
input_embedding = token_embedding + positional_encoding
```

**Why this works**:
- Each position gets a unique signal
- Model can learn to use position information
- Attention patterns become position-aware

**Why it cannot be skipped**: Without positional encoding, transformers are just sophisticated bag-of-words models. They cannot understand sequence, order, or structure—making them useless for language, code, time series, or any sequential data.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import math

# Set random seed
torch.manual_seed(42)
np.random.seed(42)

print(f"PyTorch version: {torch.__version__}")

# Visualization setup
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 4)

## Strategy 1: Sinusoidal Positional Encoding

### The Original Transformer Approach

**Introduced in**: "Attention is All You Need" (Vaswani et al., 2017)

**Key insight**: Use sine and cosine functions with different frequencies

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

Where:
- $pos$: Position in sequence (0, 1, 2, ...)
- $i$: Dimension index (0 to $d_{model}/2$)
- $d_{model}$: Model dimension (e.g., 512)

### Why Sinusoidal Functions?

**1. Unique encoding for each position**:
- Different frequencies create unique patterns
- No two positions have the same encoding

**2. Extrapolation to longer sequences**:
- Mathematical function continues beyond training length
- Can handle sequences longer than seen during training

**3. Relative position information**:
- $PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$
- Model can learn to attend by relative position

**4. No learnable parameters**:
- Fixed function (not learned during training)
- Reduces overfitting
- Works well out of the box

**Why different frequencies?**:
- **Low frequencies** (slow oscillation): Distinguish distant positions
- **High frequencies** (fast oscillation): Distinguish nearby positions
- Together, they provide multi-scale position information

In [None]:
class SinusoidalPositionalEncoding(nn.Module):
    """
    Sinusoidal positional encoding as used in the original Transformer.
    
    This is a fixed (non-learnable) encoding.
    """
    def __init__(self, d_model, max_len=5000):
        """
        Args:
            d_model: Dimension of embeddings
            max_len: Maximum sequence length to precompute
        """
        super(SinusoidalPositionalEncoding, self).__init__()
        
        # Create positional encoding matrix (max_len, d_model)
        pe = torch.zeros(max_len, d_model)
        
        # Position indices (0, 1, 2, ..., max_len-1)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        
        # Compute the div_term for different frequencies
        # Why 10000? Chosen empirically to span the frequency range well
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                             (-math.log(10000.0) / d_model))
        
        # Apply sin to even indices (0, 2, 4, ...)
        pe[:, 0::2] = torch.sin(position * div_term)
        
        # Apply cos to odd indices (1, 3, 5, ...)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # Add batch dimension: (1, max_len, d_model)
        pe = pe.unsqueeze(0)
        
        # Register as buffer (not a parameter, but part of model state)
        # Why buffer? It's saved with the model but not updated during training
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        """
        Args:
            x: Input embeddings (batch, seq_len, d_model)
        
        Returns:
            x with positional encoding added
        """
        # Add positional encoding to input
        # [:, :x.size(1)] handles variable sequence lengths
        x = x + self.pe[:, :x.size(1)]
        return x

# Create and visualize
d_model = 128
max_len = 100

pos_encoder = SinusoidalPositionalEncoding(d_model, max_len)

# Extract the positional encoding matrix
pe_matrix = pos_encoder.pe.squeeze(0).numpy()

print(f"Positional encoding shape: {pe_matrix.shape}")
print(f"Each position gets a {d_model}-dimensional vector")
print(f"\nPositional encoding at position 0:")
print(pe_matrix[0, :10])  # First 10 dimensions
print(f"\nPositional encoding at position 50:")
print(pe_matrix[50, :10])

### Visualizing Sinusoidal Positional Encoding

Let's visualize the patterns to build intuition.

In [None]:
# Visualize positional encoding heatmap
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Full heatmap
sns.heatmap(pe_matrix, cmap='RdBu', center=0, ax=axes[0], 
            cbar_kws={'label': 'Encoding Value'})
axes[0].set_title('Sinusoidal Positional Encoding\n(Positions × Dimensions)', fontsize=14)
axes[0].set_xlabel('Embedding Dimension')
axes[0].set_ylabel('Position in Sequence')

# Zoom in on first 50 positions and 64 dimensions
sns.heatmap(pe_matrix[:50, :64], cmap='RdBu', center=0, ax=axes[1],
            cbar_kws={'label': 'Encoding Value'})
axes[1].set_title('Zoomed View (First 50 positions, 64 dims)', fontsize=14)
axes[1].set_xlabel('Embedding Dimension')
axes[1].set_ylabel('Position in Sequence')

plt.tight_layout()
plt.show()

print("Notice the wave-like patterns:")
print("- Early dimensions (left): Slow oscillation (low frequency)")
print("- Later dimensions (right): Fast oscillation (high frequency)")
print("- Each position has a unique 'fingerprint' across all dimensions")

In [None]:
# Visualize specific dimension patterns
fig, axes = plt.subplots(2, 2, figsize=(14, 8))
axes = axes.flatten()

positions = np.arange(max_len)
dimensions_to_plot = [0, 10, 50, 120]  # Different frequencies

for idx, dim in enumerate(dimensions_to_plot):
    axes[idx].plot(positions, pe_matrix[:, dim], linewidth=2)
    axes[idx].set_title(f'Dimension {dim} ({'Even (sin)' if dim % 2 == 0 else 'Odd (cos)'})', fontsize=12)
    axes[idx].set_xlabel('Position')
    axes[idx].set_ylabel('Encoding Value')
    axes[idx].grid(True, alpha=0.3)
    axes[idx].axhline(y=0, color='k', linestyle='--', alpha=0.3)

plt.suptitle('Sinusoidal Patterns at Different Dimensions\n(Notice how frequency increases)', 
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Key observations:")
print("1. Dimension 0 (lowest frequency): Very slow oscillation")
print("2. Dimension 120 (highest frequency): Rapid oscillation")
print("3. This multi-frequency approach captures both local and global position info")

## Strategy 2: Learned Positional Embeddings

### The BERT/GPT Approach

**Key difference**: Positional encoding is a **learnable parameter**, not a fixed function

```python
# Instead of computing sin/cos:
position_embeddings = nn.Embedding(max_len, d_model)
```

**Why learned positional embeddings?**

**Advantages**:
1. **Task-specific**: Can adapt to specific patterns in your data
2. **Simpler**: Just an embedding lookup (very fast)
3. **Empirically strong**: Often performs better on fixed-length tasks
4. **Flexible**: No mathematical constraints

**Disadvantages**:
1. **No extrapolation**: Cannot handle sequences longer than training
2. **More parameters**: Needs to learn position information
3. **Overfitting risk**: Can memorize instead of generalize

**Used in**:
- **BERT**: Learned absolute positions
- **GPT-2**: Learned absolute positions
- **ViT** (Vision Transformer): 2D learned positional embeddings

**When to use**:
- Fixed maximum sequence length (e.g., BERT's 512 tokens)
- Sufficient training data to learn good positions
- Don't need to extrapolate beyond training length

In [None]:
class LearnedPositionalEmbedding(nn.Module):
    """
    Learned positional embeddings (BERT/GPT-style).
    
    Simple embedding lookup, learned during training.
    """
    def __init__(self, d_model, max_len=512):
        """
        Args:
            d_model: Dimension of embeddings
            max_len: Maximum sequence length
        """
        super(LearnedPositionalEmbedding, self).__init__()
        
        # Create learnable position embeddings
        # Why Embedding? It's just a lookup table that's updated during training
        self.position_embeddings = nn.Embedding(max_len, d_model)
        self.max_len = max_len
    
    def forward(self, x):
        """
        Args:
            x: Input embeddings (batch, seq_len, d_model)
        
        Returns:
            x with learned positional embeddings added
        """
        batch_size, seq_len, d_model = x.size()
        
        # Check sequence length
        assert seq_len <= self.max_len, f"Sequence length {seq_len} exceeds max_len {self.max_len}"
        
        # Create position indices [0, 1, 2, ..., seq_len-1]
        positions = torch.arange(seq_len, device=x.device).unsqueeze(0).expand(batch_size, -1)
        
        # Lookup positional embeddings
        pos_embeddings = self.position_embeddings(positions)  # (batch, seq_len, d_model)
        
        # Add to input
        return x + pos_embeddings

# Create learned positional embedding
d_model = 128
max_len = 512

learned_pos = LearnedPositionalEmbedding(d_model, max_len)

# Test
x = torch.randn(2, 100, d_model)  # (batch=2, seq_len=100, d_model=128)
x_with_pos = learned_pos(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {x_with_pos.shape}")
print(f"\nNumber of learnable parameters: {sum(p.numel() for p in learned_pos.parameters()):,}")
print(f"  = max_len ({max_len}) × d_model ({d_model})")

# Visualize learned embeddings (randomly initialized)
learned_matrix = learned_pos.position_embeddings.weight.detach().numpy()

plt.figure(figsize=(14, 6))
sns.heatmap(learned_matrix[:100, :64], cmap='viridis', 
            cbar_kws={'label': 'Embedding Value'})
plt.title('Learned Positional Embeddings (Random Initialization)\nFirst 100 positions, 64 dimensions', 
          fontsize=14)
plt.xlabel('Embedding Dimension')
plt.ylabel('Position')
plt.tight_layout()
plt.show()

print("\nNote: These are randomly initialized.")
print("During training, the model learns meaningful position patterns!")
print("After training, nearby positions often have similar embeddings.")

## Strategy 3: Rotary Position Embeddings (RoPE)

### The Modern Standard (2021-2025)

**Introduced in**: "RoFormer: Enhanced Transformer with Rotary Position Embedding" (Su et al., 2021)

**Used in**: LLaMA, PaLM, GPT-NeoX, Mistral, and most modern LLMs (2023+)

**Why RoPE is revolutionary**: It encodes **relative** position directly into the attention mechanism!

### The Key Insight

**Problem with absolute positional encoding**:
- Token at position 5 gets the same encoding regardless of context
- Attention doesn't naturally capture relative distances

**RoPE's solution**: Rotate query and key vectors by their position

**Mathematical intuition**:
- Represent position as a rotation angle: $\theta = pos \times \omega$
- Rotate Q and K by their respective angles
- When computing $Q \cdot K$, the dot product naturally captures relative position!

$$Q_{pos_1} \cdot K_{pos_2} = f(pos_1 - pos_2)$$

**Why this is important**:
1. **Relative position**: Attention naturally depends on relative distance
2. **Extrapolation**: Can handle longer sequences than training
3. **Efficiency**: No extra addition of positional encoding
4. **Better performance**: Empirically outperforms other methods

**How it works (simplified)**:
```python
# Instead of: x + positional_encoding
# We do: rotate(x, position)

def rotate(x, position):
    angle = position * frequency
    return apply_rotation_matrix(x, angle)
```

**Why RoPE cannot be skipped in modern LLMs**:
- **Industry standard**: Nearly all LLMs since 2023 use RoPE
- **Superior extrapolation**: Can extend context length after training
- **Better relative position**: More natural for language structure
- **Proven at scale**: Works reliably up to 100K+ token contexts

In [None]:
class RotaryPositionalEmbedding(nn.Module):
    """
    Rotary Position Embedding (RoPE).
    
    Key idea: Rotate queries and keys by their position.
    This encodes relative position into attention!
    """
    def __init__(self, dim, max_len=2048, base=10000):
        """
        Args:
            dim: Dimension per head (must be even)
            max_len: Maximum sequence length
            base: Base for frequency computation (10000 is standard)
        """
        super(RotaryPositionalEmbedding, self).__init__()
        
        assert dim % 2 == 0, "Dimension must be even for RoPE"
        
        self.dim = dim
        self.max_len = max_len
        self.base = base
        
        # Compute frequencies for rotation
        # Why different frequencies? Same reason as sinusoidal PE
        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
        self.register_buffer('inv_freq', inv_freq)
        
        # Precompute rotations for all positions up to max_len
        self._compute_cos_sin_cache(max_len)
    
    def _compute_cos_sin_cache(self, max_len):
        """
        Precompute cos and sin values for all positions.
        
        Why cache? Rotation matrices are fixed, so compute once.
        """
        positions = torch.arange(max_len, dtype=torch.float32)
        
        # Compute angles: pos * freq for each position and frequency
        # Shape: (max_len, dim/2)
        angles = torch.einsum('i,j->ij', positions, self.inv_freq)
        
        # Duplicate for (cos, sin) pairs: (max_len, dim)
        angles = torch.cat([angles, angles], dim=-1)
        
        # Precompute cos and sin
        self.register_buffer('cos_cached', angles.cos())
        self.register_buffer('sin_cached', angles.sin())
    
    def rotate_half(self, x):
        """
        Rotate half the dimensions (pairs).
        
        Why? RoPE applies 2D rotations to consecutive dimension pairs.
        
        For a pair (x1, x2), rotation by θ:
          x1' = x1*cos(θ) - x2*sin(θ)
          x2' = x1*sin(θ) + x2*cos(θ)
        """
        x1, x2 = x[..., ::2], x[..., 1::2]
        return torch.cat([-x2, x1], dim=-1)
    
    def apply_rotary_pos_emb(self, x, position_ids):
        """
        Apply rotary position embedding to input.
        
        Args:
            x: Input tensor (..., seq_len, dim)
            position_ids: Position indices (seq_len,)
        
        Returns:
            Rotated tensor with same shape as x
        """
        # Get cos and sin for these positions
        cos = self.cos_cached[position_ids].unsqueeze(0)
        sin = self.sin_cached[position_ids].unsqueeze(0)
        
        # Apply rotation: x*cos + rotate_half(x)*sin
        # This implements the 2D rotation formula efficiently
        return (x * cos) + (self.rotate_half(x) * sin)
    
    def forward(self, q, k, position_ids=None):
        """
        Apply RoPE to queries and keys.
        
        Args:
            q: Queries (batch, num_heads, seq_len, dim)
            k: Keys (batch, num_heads, seq_len, dim)
            position_ids: Optional position indices
        
        Returns:
            q_rot, k_rot: Rotated queries and keys
        """
        seq_len = q.size(2)
        
        if position_ids is None:
            position_ids = torch.arange(seq_len, device=q.device)
        
        # Apply rotation to Q and K
        # Why both? The dot product Q*K captures relative position!
        q_rot = self.apply_rotary_pos_emb(q, position_ids)
        k_rot = self.apply_rotary_pos_emb(k, position_ids)
        
        return q_rot, k_rot

# Test RoPE
dim = 64  # Dimension per head
seq_len = 20
batch_size = 2
num_heads = 4

rope = RotaryPositionalEmbedding(dim, max_len=2048)

# Create dummy Q, K
q = torch.randn(batch_size, num_heads, seq_len, dim)
k = torch.randn(batch_size, num_heads, seq_len, dim)

# Apply RoPE
q_rot, k_rot = rope(q, k)

print(f"Query shape: {q.shape}")
print(f"Rotated query shape: {q_rot.shape}")
print(f"\nKey insight: Shape unchanged, but vectors are rotated by their position!")
print(f"\nWhen computing attention scores (Q_rot @ K_rot.T):")
print(f"The dot product naturally captures RELATIVE position!")

# Verify rotation preserves norm (important property)
q_norm = q.norm(dim=-1)
q_rot_norm = q_rot.norm(dim=-1)

print(f"\nOriginal Q norm: {q_norm[0, 0, 0].item():.4f}")
print(f"Rotated Q norm: {q_rot_norm[0, 0, 0].item():.4f}")
print("✓ Norm preserved (rotation doesn't change magnitude)")

### Visualizing RoPE: How Rotation Captures Position

Let's visualize how RoPE encodes position through rotation.

In [None]:
# Visualize rotation angles
dim = 64
max_len = 100

rope = RotaryPositionalEmbedding(dim, max_len)

# Extract rotation matrices
cos_matrix = rope.cos_cached.numpy()
sin_matrix = rope.sin_cached.numpy()

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Cosine components
sns.heatmap(cos_matrix, cmap='RdBu', center=0, ax=axes[0],
            cbar_kws={'label': 'cos(θ)'})
axes[0].set_title('RoPE: Cosine Components\n(Rotation angles for each position)', fontsize=14)
axes[0].set_xlabel('Dimension')
axes[0].set_ylabel('Position')

# Sine components
sns.heatmap(sin_matrix, cmap='RdBu', center=0, ax=axes[1],
            cbar_kws={'label': 'sin(θ)'})
axes[1].set_title('RoPE: Sine Components\n(Rotation angles for each position)', fontsize=14)
axes[1].set_xlabel('Dimension')
axes[1].set_ylabel('Position')

plt.tight_layout()
plt.show()

print("Notice the similarity to sinusoidal positional encoding!")
print("But instead of ADDING these values, RoPE ROTATES by these angles.")
print("\nThis makes relative position emerge naturally in Q·K dot products!")

In [None]:
# Demonstrate relative position property
def compute_attention_scores(q, k):
    """Compute attention scores (before softmax)"""
    return torch.matmul(q, k.transpose(-2, -1)) / np.sqrt(q.size(-1))

# Create simple Q, K
seq_len = 10
dim = 64
q = torch.randn(1, 1, seq_len, dim)
k = torch.randn(1, 1, seq_len, dim)

# Without RoPE
scores_no_rope = compute_attention_scores(q, k)

# With RoPE
rope = RotaryPositionalEmbedding(dim)
q_rot, k_rot = rope(q, k)
scores_with_rope = compute_attention_scores(q_rot, k_rot)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

sns.heatmap(scores_no_rope[0, 0].detach().numpy(), annot=True, fmt='.2f',
            cmap='YlOrRd', ax=axes[0])
axes[0].set_title('Attention Scores WITHOUT RoPE\n(Random patterns)', fontsize=12)
axes[0].set_xlabel('Key position')
axes[0].set_ylabel('Query position')

sns.heatmap(scores_with_rope[0, 0].detach().numpy(), annot=True, fmt='.2f',
            cmap='YlOrRd', ax=axes[1])
axes[1].set_title('Attention Scores WITH RoPE\n(Structured by relative position)', fontsize=12)
axes[1].set_xlabel('Key position')
axes[1].set_ylabel('Query position')

plt.tight_layout()
plt.show()

print("Key observation:")
print("WITH RoPE: Attention scores show structure based on relative distance!")
print("Nearby positions have similar scores (diagonal pattern).")
print("\nThis is why RoPE is so powerful for language modeling!")

## Comparison: Which Positional Encoding to Use?

### Quick Decision Guide

| Method | Pros | Cons | Best For |
|--------|------|------|----------|
| **Sinusoidal** | • No parameters<br>• Extrapolates well<br>• Simple | • Absolute position<br>• Less flexible | Research, prototypes |
| **Learned** | • Task-adaptive<br>• Simple implementation<br>• Fast | • No extrapolation<br>• More parameters | Fixed-length tasks |
| **RoPE** | • Relative position<br>• Extrapolates well<br>• SOTA performance | • Slightly complex<br>• Requires rotation | **Modern LLMs** (recommended) |

### When to Use Each

**Use Sinusoidal if**:
- Quick prototype or research
- Want deterministic behavior
- Don't want extra parameters

**Use Learned if**:
- Fixed maximum sequence length
- Plenty of training data
- Absolute position matters (rare)

**Use RoPE if** (recommended for most cases):
- Building a modern LLM or transformer
- Need to extrapolate to longer sequences
- Want state-of-the-art performance
- Relative position matters (almost always in NLP)

### Modern LLM Choices (2023-2025)

- **LLaMA 1/2/3**: RoPE
- **PaLM**: RoPE
- **GPT-NeoX**: RoPE
- **Mistral**: RoPE
- **Qwen**: RoPE
- **DeepSeek**: RoPE

**Trend**: Nearly all modern LLMs use RoPE!

In [None]:
# Comparison: Parameter count
d_model = 512
max_len = 2048

# Sinusoidal: No parameters
sin_params = 0

# Learned: max_len * d_model parameters
learned_params = max_len * d_model

# RoPE: No parameters (just precomputed rotations)
rope_params = 0

print("Parameter Comparison:")
print(f"\nSinusoidal PE: {sin_params:,} parameters")
print(f"Learned PE: {learned_params:,} parameters")
print(f"RoPE: {rope_params:,} parameters")

print(f"\nFor a model with:")
print(f"  - d_model = {d_model}")
print(f"  - max_len = {max_len}")
print(f"\nLearned PE adds {learned_params:,} parameters!")
print(f"For a 7B parameter model, that's {100 * learned_params / 7e9:.3f}% overhead.")

print("\nWinner: RoPE (no parameters, best performance)")

## Mini Exercises

Test your understanding!

### Exercise 1: Verify Extrapolation Properties

Create a model with sinusoidal PE trained on sequences up to length 50. Test if it can handle length 100.

In [None]:
# YOUR CODE HERE


# SOLUTION
def show_solution_1():
    # Create sinusoidal PE with max_len=50
    sin_pe = SinusoidalPositionalEncoding(d_model=128, max_len=50)
    
    # Create learned PE with max_len=50
    learned_pe = LearnedPositionalEmbedding(d_model=128, max_len=50)
    
    # Test with length 40 (within training)
    x_short = torch.randn(1, 40, 128)
    
    # Test with length 60 (beyond training)
    x_long = torch.randn(1, 60, 128)
    
    print("Testing extrapolation...\n")
    
    # Sinusoidal: Should work
    try:
        # Extend max_len for this test
        sin_pe_extended = SinusoidalPositionalEncoding(d_model=128, max_len=100)
        out = sin_pe_extended(x_long)
        print("✓ Sinusoidal PE: Handles length 60 (trained on 50)")
        print(f"  Output shape: {out.shape}")
    except Exception as e:
        print(f"✗ Sinusoidal PE failed: {e}")
    
    # Learned: Should fail
    try:
        out = learned_pe(x_long)
        print("✗ Learned PE: Should have failed but didn't!")
    except Exception as e:
        print(f"✓ Learned PE: Cannot handle length 60 (trained on 50)")
        print(f"  Error: {type(e).__name__}")
    
    print("\nConclusion:")
    print("- Sinusoidal PE can extrapolate (mathematical function continues)")
    print("- Learned PE cannot extrapolate (no embeddings beyond max_len)")
    print("- RoPE can also extrapolate (rotation formula extends)")

# Uncomment to see solution:
# show_solution_1()

### Exercise 2: Implement ALiBi (Attention with Linear Biases)

ALiBi is another modern approach that adds position bias directly to attention scores. Implement it!

In [None]:
# YOUR CODE HERE
class ALiBiPositionalBias(nn.Module):
    def __init__(self, num_heads):
        super(ALiBiPositionalBias, self).__init__()
        # Add your code
        pass
    
    def forward(self, attention_scores):
        # Add your code
        pass


# SOLUTION
def show_solution_2():
    class ALiBiPositionalBias(nn.Module):
        """
        ALiBi: Attention with Linear Biases.
        
        Key idea: Add position-dependent bias to attention scores.
        Bias is proportional to distance: closer positions get smaller penalty.
        
        Used in: BLOOM, MPT models
        """
        def __init__(self, num_heads):
            super(ALiBiPositionalBias, self).__init__()
            
            # Compute slopes for each head (geometric sequence)
            # Why different slopes? Each head has different distance sensitivity
            slopes = torch.tensor([2 ** (-(2 ** -(i + 1))) for i in range(num_heads)])
            self.register_buffer('slopes', slopes.view(1, num_heads, 1, 1))
        
        def forward(self, attention_scores):
            """
            Args:
                attention_scores: (batch, num_heads, seq_len, seq_len)
            
            Returns:
                Scores with ALiBi bias added
            """
            seq_len = attention_scores.size(-1)
            
            # Create distance matrix
            # Element [i,j] = distance from i to j
            positions = torch.arange(seq_len, device=attention_scores.device)
            distance = positions.unsqueeze(0) - positions.unsqueeze(1)  # Broadcasting
            
            # ALiBi bias: -slope * distance
            # Negative sign: Penalize distant positions
            alibi_bias = -self.slopes * distance.abs()
            
            # Add bias to scores
            return attention_scores + alibi_bias
    
    # Test ALiBi
    num_heads = 4
    seq_len = 10
    
    alibi = ALiBiPositionalBias(num_heads)
    
    # Create dummy attention scores
    scores = torch.randn(1, num_heads, seq_len, seq_len)
    scores_with_alibi = alibi(scores)
    
    # Visualize ALiBi bias for one head
    bias = scores_with_alibi[0, 0] - scores[0, 0]
    
    plt.figure(figsize=(8, 6))
    sns.heatmap(bias.detach().numpy(), annot=True, fmt='.2f', cmap='RdYlGn_r',
                center=0, cbar_kws={'label': 'Bias'})
    plt.title('ALiBi Positional Bias (Head 1)\nPenalizes distant positions', fontsize=14)
    plt.xlabel('Key position')
    plt.ylabel('Query position')
    plt.tight_layout()
    plt.show()
    
    print("ALiBi properties:")
    print("✓ No positional encoding added to embeddings")
    print("✓ Bias applied directly to attention scores")
    print("✓ Linear penalty based on distance")
    print("✓ Different slopes for different heads")
    print("\nAdvantages:")
    print("- Simple and effective")
    print("- Excellent extrapolation")
    print("- Used in BLOOM (176B parameters)")

# Uncomment to see solution:
# show_solution_2()

### Exercise 3: 2D Positional Encoding for Images

Vision Transformers need 2D positional encoding for image patches. Implement it!

In [None]:
# YOUR CODE HERE


# SOLUTION
def show_solution_3():
    class Learned2DPositionalEmbedding(nn.Module):
        """
        2D positional embeddings for Vision Transformers.
        
        Image patches have 2D structure (height, width).
        """
        def __init__(self, d_model, height, width):
            super(Learned2DPositionalEmbedding, self).__init__()
            
            self.height = height
            self.width = width
            
            # Separate embeddings for height and width
            # Why separate? Height and width are independent dimensions
            self.height_embed = nn.Embedding(height, d_model)
            self.width_embed = nn.Embedding(width, d_model)
        
        def forward(self, x):
            """
            Args:
                x: Patch embeddings (batch, height*width, d_model)
            
            Returns:
                x with 2D positional embeddings added
            """
            batch_size, seq_len, d_model = x.size()
            
            # Create 2D position grid
            h_pos = torch.arange(self.height, device=x.device).repeat_interleave(self.width)
            w_pos = torch.arange(self.width, device=x.device).repeat(self.height)
            
            # Get embeddings
            h_embed = self.height_embed(h_pos)  # (height*width, d_model)
            w_embed = self.width_embed(w_pos)  # (height*width, d_model)
            
            # Combine (sum or concatenate)
            pos_embed = h_embed + w_embed  # (height*width, d_model)
            
            # Add to input
            return x + pos_embed.unsqueeze(0)
    
    # Test 2D positional embedding
    d_model = 128
    img_size = 224
    patch_size = 16
    height = width = img_size // patch_size  # 14x14 patches
    
    pos_2d = Learned2DPositionalEmbedding(d_model, height, width)
    
    # Dummy patch embeddings
    patches = torch.randn(2, height * width, d_model)
    patches_with_pos = pos_2d(patches)
    
    print(f"Image size: {img_size}x{img_size}")
    print(f"Patch size: {patch_size}x{patch_size}")
    print(f"Number of patches: {height}x{width} = {height*width}")
    print(f"\nPatch embeddings shape: {patches.shape}")
    print(f"With 2D positional encoding: {patches_with_pos.shape}")
    
    # Visualize 2D positional embeddings
    h_weights = pos_2d.height_embed.weight.detach().numpy()
    w_weights = pos_2d.width_embed.weight.detach().numpy()
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))
    
    sns.heatmap(h_weights[:, :32], cmap='viridis', ax=axes[0])
    axes[0].set_title('Height Embeddings (first 32 dims)', fontsize=12)
    axes[0].set_xlabel('Dimension')
    axes[0].set_ylabel('Height position')
    
    sns.heatmap(w_weights[:, :32], cmap='viridis', ax=axes[1])
    axes[1].set_title('Width Embeddings (first 32 dims)', fontsize=12)
    axes[1].set_xlabel('Dimension')
    axes[1].set_ylabel('Width position')
    
    plt.tight_layout()
    plt.show()
    
    print("\n2D positional encoding for images:")
    print("✓ Separate embeddings for height and width")
    print("✓ Preserves 2D spatial structure")
    print("✓ Used in Vision Transformers (ViT)")

# Uncomment to see solution:
# show_solution_3()

## Comprehensive Exercise: Build a Complete Positional Encoding Module

Create a flexible positional encoding module that supports all three methods and can switch between them.

In [None]:
# YOUR CODE HERE


# SOLUTION
def show_comprehensive_solution():
    class FlexiblePositionalEncoding(nn.Module):
        """
        Flexible positional encoding supporting multiple methods.
        
        Allows easy experimentation with different approaches!
        """
        def __init__(self, d_model, max_len=2048, method='rope', num_heads=8):
            """
            Args:
                d_model: Model dimension
                max_len: Maximum sequence length
                method: 'sinusoidal', 'learned', 'rope', or 'alibi'
                num_heads: Number of attention heads (for RoPE/ALiBi)
            """
            super(FlexiblePositionalEncoding, self).__init__()
            
            self.method = method
            self.d_model = d_model
            
            if method == 'sinusoidal':
                self.pos_encoder = SinusoidalPositionalEncoding(d_model, max_len)
            elif method == 'learned':
                self.pos_encoder = LearnedPositionalEmbedding(d_model, max_len)
            elif method == 'rope':
                self.pos_encoder = RotaryPositionalEmbedding(d_model // num_heads, max_len)
            elif method == 'alibi':
                self.pos_encoder = ALiBiPositionalBias(num_heads)
            else:
                raise ValueError(f"Unknown method: {method}")
        
        def forward(self, x, q=None, k=None, attention_scores=None):
            """
            Apply positional encoding.
            
            Args depend on method:
            - sinusoidal/learned: Only needs x
            - rope: Needs q, k (queries, keys)
            - alibi: Needs attention_scores
            """
            if self.method in ['sinusoidal', 'learned']:
                return self.pos_encoder(x)
            elif self.method == 'rope':
                assert q is not None and k is not None, "RoPE requires q and k"
                return self.pos_encoder(q, k)
            elif self.method == 'alibi':
                assert attention_scores is not None, "ALiBi requires attention_scores"
                return self.pos_encoder(attention_scores)
    
    # Test all methods
    d_model = 128
    seq_len = 20
    batch_size = 2
    
    print("Testing Flexible Positional Encoding...\n")
    
    # Test sinusoidal
    sin_pe = FlexiblePositionalEncoding(d_model, method='sinusoidal')
    x = torch.randn(batch_size, seq_len, d_model)
    x_sin = sin_pe(x)
    print(f"✓ Sinusoidal: {x.shape} -> {x_sin.shape}")
    
    # Test learned
    learned_pe = FlexiblePositionalEncoding(d_model, method='learned')
    x_learned = learned_pe(x)
    print(f"✓ Learned: {x.shape} -> {x_learned.shape}")
    
    # Test RoPE
    rope_pe = FlexiblePositionalEncoding(d_model, method='rope', num_heads=8)
    q = k = torch.randn(batch_size, 8, seq_len, d_model // 8)
    q_rot, k_rot = rope_pe(x, q=q, k=k)
    print(f"✓ RoPE: {q.shape} -> {q_rot.shape}")
    
    print("\nAll methods working! Easy to experiment with different approaches.")
    print("\nRecommended for 2025: RoPE (best performance, extrapolation)")

# Uncomment to see solution:
# show_comprehensive_solution()

## Key Takeaways

### Core Concepts

**1. Why positional encoding is critical**:
- Attention is permutation-invariant (treats input as a set)
- Without position info, "dog bites man" = "man bites dog"
- Positional encoding injects sequence order into transformers

**2. Three main approaches**:

**Sinusoidal (Original Transformer)**:
- Fixed mathematical function (sin/cos with multiple frequencies)
- No learnable parameters
- Can extrapolate to longer sequences
- Good baseline, but absolute position

**Learned (BERT/GPT-2)**:
- Learnable embedding lookup table
- Task-adaptive, simple implementation
- Cannot extrapolate beyond training length
- Good for fixed-length tasks

**RoPE (Modern LLMs)**:
- Rotates Q and K by position
- Encodes **relative** position naturally
- Can extrapolate, no extra parameters
- **Industry standard for 2023-2025**

**3. Why RoPE is the modern standard**:
- Relative position is more natural for language
- Excellent extrapolation (can extend context after training)
- Used in LLaMA, PaLM, Mistral, and nearly all modern LLMs
- Proven at scale (up to 100K+ token contexts)

**4. Key properties to consider**:
- **Absolute vs Relative**: Relative position is usually better
- **Extrapolation**: Can the model handle longer sequences than training?
- **Parameters**: Does it add learnable parameters?
- **Performance**: Empirical results on your task

### Connection to Modern AI

**Transformers = Attention + Position + FFN**:
- Positional encoding is as critical as attention
- Choice affects model's ability to understand sequence

**Modern trends**:
- **RoPE dominance**: Nearly universal in new LLMs
- **Context extension**: ALiBi, extended RoPE for longer contexts
- **2D/3D extensions**: For images, video, 3D data

**Why you cannot skip this**:
- Every transformer needs positional encoding
- Wrong choice limits model capability
- Understanding position encoding is key to understanding transformers

### What's Next?

You've mastered positional encoding! Next:
- **Complete Transformer Architecture**: Putting attention + position + FFN together
- **Advanced variants**: Flash Attention, GQA, Mixture of Experts
- **Building LLMs**: From scratch implementations

With attention and positional encoding under your belt, you're ready to build complete transformers!

## Further Reading

### Essential Papers
1. **Vaswani et al. (2017)**: "Attention is All You Need" (Original sinusoidal PE)
2. **Devlin et al. (2018)**: "BERT" (Learned positional embeddings)
3. **Su et al. (2021)**: "RoFormer: Enhanced Transformer with Rotary Position Embedding" (RoPE)
4. **Press et al. (2021)**: "Train Short, Test Long: Attention with Linear Biases" (ALiBi)

### Modern Perspectives
5. **Chen et al. (2023)**: "Extending Context Length in LLMs" (RoPE extensions)
6. **Dosovitskiy et al. (2020)**: "An Image is Worth 16x16 Words" (2D positional encoding for ViT)

### Tutorials and Analysis
7. **Positional Encoding Analysis** (Kazemnejad): https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
8. **Understanding RoPE** (EleutherAI): https://blog.eleuther.ai/rotary-embeddings/

### Implementation Resources
- **LLaMA source code**: Reference RoPE implementation
- **Hugging Face Transformers**: Multiple positional encoding implementations
- **PyTorch examples**: Official transformer tutorials