# Module 06: Transformer Architecture

**Difficulty**: ⭐⭐⭐ Advanced  
**Estimated Time**: 150 minutes  
**Prerequisites**: [Module 05: Attention Mechanism](05_attention_mechanism.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand the complete Transformer architecture from "Attention is All You Need"
2. Implement multi-head attention from scratch in PyTorch
3. Understand and implement positional encoding
4. Build encoder and decoder layers with layer normalization
5. Implement a complete Transformer for sequence-to-sequence tasks
6. Understand why Transformers revolutionized NLP

## The Transformer Revolution

**"Attention is All You Need"** (Vaswani et al., 2017) changed everything.

### Why Transformers?

**Problems with RNNs**:
- Sequential processing (can't parallelize)
- Vanishing gradients for long sequences
- Limited context window

**Transformer advantages**:
- ✅ Fully parallelizable (process all positions simultaneously)
- ✅ Direct connections between any two positions
- ✅ Better at capturing long-range dependencies
- ✅ Faster training on modern hardware (GPUs/TPUs)

### Core Innovation:

**Replace recurrence with self-attention!**

- No more sequential processing
- Attention connects all positions directly
- Positional encoding adds sequence information

## Setup and Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math
import warnings
warnings.filterwarnings('ignore')

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')

np.random.seed(42)
torch.manual_seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')
print('✓ Libraries imported!')

## 1. Multi-Head Attention

**Idea**: Instead of single attention, use multiple attention "heads" in parallel!

**Benefits**:
- Different heads can attend to different aspects
- Head 1: Syntactic relationships
- Head 2: Semantic relationships
- Head 3: Long-range dependencies

**Mathematics**:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h) W^O$$

Where each head:
$$\text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)$$

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # Q, K, V projections for all heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        
        # Output projection
        self.W_o = nn.Linear(d_model, d_model)
        
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        # Q, K, V: (batch, num_heads, seq_len, d_k)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        attn_weights = F.softmax(scores, dim=-1)
        output = torch.matmul(attn_weights, V)
        
        return output, attn_weights
    
    def forward(self, Q, K, V, mask=None):
        batch_size = Q.size(0)
        
        # Linear projections and split into heads
        Q = self.W_q(Q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(K).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(V).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        # Apply attention
        attn_output, attn_weights = self.scaled_dot_product_attention(Q, K, V, mask)
        
        # Concatenate heads
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        
        # Final linear layer
        output = self.W_o(attn_output)
        
        return output, attn_weights

print('✓ MultiHeadAttention defined!')

## 2. Positional Encoding

**Problem**: Attention has no notion of position/order!

**Solution**: Add positional information to embeddings.

**Sinusoidal encoding** (Vaswani et al.):

$$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})$$
$$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})$$

**Why sine/cosine?**
- Allows model to learn relative positions
- Works for sequences longer than training

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        
        # Create positional encoding matrix
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        pe = pe.unsqueeze(0)  # (1, max_len, d_model)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        # x: (batch, seq_len, d_model)
        return x + self.pe[:, :x.size(1), :]

print('✓ PositionalEncoding defined!')

## 3. Feed-Forward Networks

**Position-wise FFN**: Applied to each position independently.

$$\text{FFN}(x) = \text{ReLU}(x W_1 + b_1) W_2 + b_2$$

Typically: $d_{ff} = 4 \times d_{model}$

In [None]:
class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        return self.linear2(self.dropout(F.relu(self.linear1(x))))

print('✓ PositionwiseFeedForward defined!')

## 4. Encoder Layer

**Each encoder layer has**:
1. Multi-head self-attention
2. Add & Norm (residual + layer norm)
3. Feed-forward network
4. Add & Norm

In [None]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionwiseFeedForward(d_model, d_ff, dropout)
        
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        # Self-attention
        attn_output, _ = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout1(attn_output))
        
        # Feed-forward
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout2(ff_output))
        
        return x

print('✓ EncoderLayer defined!')

## 5. Complete Transformer Encoder

In [None]:
class TransformerEncoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, d_ff, num_layers, dropout=0.1):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model)
        
        self.layers = nn.ModuleList([
            EncoderLayer(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        # Embed and add positional encoding
        x = self.embedding(x) * math.sqrt(self.embedding.embedding_dim)
        x = self.pos_encoding(x)
        x = self.dropout(x)
        
        # Pass through encoder layers
        for layer in self.layers:
            x = layer(x, mask)
        
        return x

print('✓ TransformerEncoder defined!')

## 6. Decoder Layer

**Decoder adds**:
- Masked self-attention (can't see future)
- Cross-attention (attend to encoder output)

In [None]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionwiseFeedForward(d_model, d_ff, dropout)
        
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)
        
    def forward(self, x, enc_output, src_mask=None, tgt_mask=None):
        # Masked self-attention
        attn_output, _ = self.self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout1(attn_output))
        
        # Cross-attention
        attn_output, _ = self.cross_attn(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout2(attn_output))
        
        # Feed-forward
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout3(ff_output))
        
        return x

print('✓ DecoderLayer defined!')

## 7. Complete Transformer

In [None]:
class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model=512, num_heads=8,
                 d_ff=2048, num_encoder_layers=6, num_decoder_layers=6, dropout=0.1):
        super().__init__()
        
        # Encoder
        self.encoder_embedding = nn.Embedding(src_vocab_size, d_model)
        self.encoder_pos = PositionalEncoding(d_model)
        self.encoder_layers = nn.ModuleList([
            EncoderLayer(d_model, num_heads, d_ff, dropout)
            for _ in range(num_encoder_layers)
        ])
        
        # Decoder
        self.decoder_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.decoder_pos = PositionalEncoding(d_model)
        self.decoder_layers = nn.ModuleList([
            DecoderLayer(d_model, num_heads, d_ff, dropout)
            for _ in range(num_decoder_layers)
        ])
        
        # Output projection
        self.fc_out = nn.Linear(d_model, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)
        
    def generate_mask(self, src, tgt):
        src_mask = (src != 0).unsqueeze(1).unsqueeze(2)
        tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(3)
        
        seq_length = tgt.size(1)
        nopeak_mask = (1 - torch.triu(torch.ones(1, seq_length, seq_length), diagonal=1)).bool()
        tgt_mask = tgt_mask & nopeak_mask.to(tgt.device)
        
        return src_mask, tgt_mask
    
    def forward(self, src, tgt):
        src_mask, tgt_mask = self.generate_mask(src, tgt)
        
        # Encode
        enc_output = self.encoder_embedding(src) * math.sqrt(self.encoder_embedding.embedding_dim)
        enc_output = self.encoder_pos(enc_output)
        enc_output = self.dropout(enc_output)
        
        for layer in self.encoder_layers:
            enc_output = layer(enc_output, src_mask)
        
        # Decode
        dec_output = self.decoder_embedding(tgt) * math.sqrt(self.decoder_embedding.embedding_dim)
        dec_output = self.decoder_pos(dec_output)
        dec_output = self.dropout(dec_output)
        
        for layer in self.decoder_layers:
            dec_output = layer(dec_output, enc_output, src_mask, tgt_mask)
        
        # Project to vocabulary
        output = self.fc_out(dec_output)
        
        return output

print('✓ Complete Transformer defined!')

**Exercise 1**: Test the Transformer

1. Initialize a small Transformer
2. Pass dummy input through it
3. Count total parameters
4. Compare with RNN/LSTM of similar capacity

In [None]:
# YOUR CODE HERE

## Summary

### Key Concepts:

1. **Multi-Head Attention**: Multiple attention mechanisms in parallel
2. **Positional Encoding**: Inject sequence order information
3. **Layer Normalization**: Stabilize training
4. **Residual Connections**: Enable deep networks
5. **Encoder-Decoder**: Process and generate sequences

### Transformer Advantages:

✅ Parallelizable (fast training)  
✅ Long-range dependencies  
✅ State-of-the-art performance  
✅ Transfer learning (BERT, GPT)  

### What's Next?

In **Module 07: BERT**, we'll learn about encoder-only transformers for understanding tasks.

### Resources:

- **Original Paper**: [Attention is All You Need](https://arxiv.org/abs/1706.03762)
- **Illustrated Transformer**: [Jay Alammar's Blog](http://jalammar.github.io/illustrated-transformer/)