# 🚀 Interactive Mode Available!

Typical static notebooks are boring. We have a dedicated interactive module for this week.

[👉 **Click here to open the Interactive Visualization**](../../interactive_platform/modules/week2_transformer/interactive.html)

*(Note: Open this link in a new tab to keep the notebook running)*

# 🚀 Interactive Mode Available!

Typical static notebooks are boring. We have a dedicated interactive module for this week.

[👉 **Click here to open the Interactive Visualization**](../../interactive_platform/modules/week2_transformer/interactive.html)

*(Note: Open this link in a new tab to keep the notebook running)*

# Week 2: The Transformer Architecture (Attention is All You Need)

Welcome to Week 2! This week is the **heart** of modern LLMs. We are going to build the Transformer architecture from scratch, specifically the **Encoder Block** and **Self-Attention** mechanism.

## Goals:
1. Understand Scaled Dot-Product Attention.
2. Implement Multi-Head Attention manually.
3. Build a full Transformer Block.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# Setup device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## 1. The Math of Self-Attention

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Let's implement this step-by-step.

In [None]:
def scaled_dot_product_attention(query, key, value, mask=None):
    d_k = query.size(-1)
    # scores = Q @ K^T / sqrt(d_k)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    # Softmax over the last dimension
    attention_weights = F.softmax(scores, dim=-1)
    
    # Weighted sum of values
    output = torch.matmul(attention_weights, value)
    return output, attention_weights

## 2. Multi-Head Attention Module

Now we wrap this in a proper `nn.Module` with linear layers for projections.

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # Linear layers for Q, K, V
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        
        self.W_o = nn.Linear(d_model, d_model)
        
    def split_heads(self, x):
        # x: [batch, seq_len, d_model]
        # -> [batch, seq_len, num_heads, d_k]
        # -> [batch, num_heads, seq_len, d_k]
        batch_size, seq_len, _ = x.size()
        x = x.view(batch_size, seq_len, self.num_heads, self.d_k)
        return x.permute(0, 2, 1, 3)
    
    def combine_heads(self, x):
        # Inverse of split_heads
        batch_size, num_heads, seq_len, d_k = x.size()
        x = x.permute(0, 2, 1, 3).contiguous()
        return x.view(batch_size, seq_len, self.d_model)
    
    def forward(self, q, k, v, mask=None):
        # 1. Project and split
        q = self.split_heads(self.W_q(q))
        k = self.split_heads(self.W_k(k))
        v = self.split_heads(self.W_v(v))
        
        # 2. Attention
        attn_out, weights = scaled_dot_product_attention(q, k, v, mask)
        
        # 3. Combine and project output
        out = self.combine_heads(attn_out)
        return self.W_o(out)

## 3. Verify Implementation

Create random tensors and check if shapes are preserved.

In [None]:
d_model = 512
num_heads = 8
seq_len = 50
batch_size = 32

x = torch.randn(batch_size, seq_len, d_model)
mha = MultiHeadAttention(d_model, num_heads)

output = mha(x, x, x)
print("Input shape:", x.shape)
print("Output shape:", output.shape)

assert output.shape == x.shape, "Output shape mismatch!"