Here is **Chapter 14: Transformers & Modern NLP** — the architecture that revolutionized artificial intelligence.

---

# **CHAPTER 14: TRANSFORMERS & MODERN NLP**

*Attention Is All You Need*

## **Chapter Overview**

The Transformer architecture, introduced in 2017, eliminated recurrence and replaced it with self-attention, enabling parallelization and long-range dependencies at scale. This chapter covers the complete modern NLP stack: from the mathematical mechanics of multi-head attention to fine-tuning strategies for BERT and GPT-style models, tokenization algorithms, and the implementation details that separate toy models from production systems like ChatGPT.

**Estimated Time:** 60-70 hours (4-5 weeks)  
**Prerequisites:** Chapters 10-13 (Neural networks, attention mechanisms from Seq2Seq, PyTorch proficiency)

---

## **14.0 Learning Objectives**

By the end of this chapter, you will be able to:
1. Implement multi-head self-attention and positional encodings from scratch
2. Build and train Transformer encoder and decoder blocks with proper layer normalization and residual connections
3. Fine-tune pre-trained models (BERT, RoBERTa, GPT, T5) for downstream tasks using Hugging Face Transformers
4. Implement tokenization algorithms (BPE, WordPiece, SentencePiece) and understand their trade-offs
5. Apply advanced fine-tuning techniques: LoRA, prompt tuning, and instruction tuning
6. Build complete NLP pipelines for classification, NER, question answering, and text generation

---

## **14.1 The Transformer Architecture**

The seminal paper "Attention Is All You Need" (Vaswani et al., 2017) introduced an architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

#### **14.1.1 High-Level Structure**

**Encoder:** Processes input sequence, produces contextualized representations.  
**Decoder:** Generates output sequence autoregressively, attending to encoder outputs and previously generated tokens.

**Key Innovations:**
1. **Self-Attention:** Direct global dependencies between any positions
2. **Multi-Head Attention:** Multiple attention "views" in parallel
3. **Positional Encodings:** Inject sequence order information
4. **Layer Normalization:** Stabilize deep network training
5. **Residual Connections:** Enable gradient flow in deep stacks (12-24+ layers)

#### **14.1.2 Self-Attention Mechanism**

For input $\mathbf{X} \in \mathbb{R}^{n \times d_{model}}$:

1. **Linear Projections:**
   $$\mathbf{Q} = \mathbf{X}\mathbf{W}^Q, \quad \mathbf{K} = \mathbf{X}\mathbf{W}^K, \quad \mathbf{V} = \mathbf{X}\mathbf{W}^V$$
   
   Where $\mathbf{W}^Q, \mathbf{W}^K, \mathbf{W}^V \in \mathbb{R}^{d_{model} \times d_k}$

2. **Scaled Dot-Product Attention:**
   $$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}$$

**Why scale by $\sqrt{d_k}$?** For large $d_k$, dot products grow large in magnitude, pushing softmax into regions with small gradients.

**Implementation:**
```python
import torch
import torch.nn as nn
import math

class ScaledDotProductAttention(nn.Module):
    def __init__(self, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, query, key, value, mask=None):
        d_k = query.size(-1)
        
        # Matmul and scale
        scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
        
        # Mask (for padding or causal attention)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        attn_weights = torch.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        
        output = torch.matmul(attn_weights, value)
        return output, attn_weights
```

#### **14.1.3 Multi-Head Attention**

Project into $h$ subspaces, apply attention in parallel, concatenate:

$$\text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)\mathbf{W}^O$$

Where $\text{head}_i = \text{Attention}(\mathbf{Q}\mathbf{W}_i^Q, \mathbf{K}\mathbf{W}_i^K, \mathbf{V}\mathbf{W}_i^V)$

```python
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, num_heads=8, dropout=0.1):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
        self.attention = ScaledDotProductAttention(dropout)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)
        
        # Linear projections and reshape for multi-head
        # (batch, seq, d_model) -> (batch, heads, seq, d_k)
        Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        # Apply attention
        attn_output, attn_weights = self.attention(Q, K, V, mask)
        
        # Concatenate heads
        attn_output = attn_output.transpose(1, 2).contiguous().view(
            batch_size, -1, self.d_model
        )
        
        output = self.W_o(attn_output)
        return output, attn_weights
```

#### **14.1.4 Positional Encodings**

Since Transformers have no recurrence or convolution, we must inject position information.

**Sinusoidal (Original):**
$$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})$$
$$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})$$

**Learned Positional Embeddings:** Trainable vectors (used in BERT, GPT).

**Relative Positional Encodings (RoPE, ALiBi):** Encode relative distances rather than absolute positions (better for long sequences).

```python
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * 
            (-math.log(10000.0) / d_model)
        )
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        pe = pe.unsqueeze(0).transpose(0, 1)  # (max_len, 1, d_model)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return x
```

#### **14.1.5 Transformer Block (Encoder)**

```
Input → Multi-Head Attention → Add & Norm → Feed Forward → Add & Norm → Output
```

**Feed-Forward Network:** Two linear transformations with ReLU (or GELU):
$$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$

**Layer Normalization:** Normalize across feature dimension (unlike BatchNorm which normalizes across batch).

```python
class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        
        self.self_attn = MultiHeadAttention(d_model, num_heads, dropout)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),  # or GELU for modern variants
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
        
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        
    def forward(self, src, src_mask=None):
        # Self-attention block with residual and norm
        src2, _ = self.self_attn(src, src, src, src_mask)
        src = src + self.dropout1(src2)
        src = self.norm1(src)
        
        # Feed-forward block with residual and norm
        src2 = self.feed_forward(src)
        src = src + self.dropout2(src2)
        src = self.norm2(src)
        
        return src
```

---

## **14.2 Pre-trained Models and Fine-Tuning**

#### **14.2.1 BERT (Bidirectional Encoder Representations from Transformers)**

**Architecture:** Stack of Transformer encoder layers.  
**Training:** 
1. **MLM (Masked Language Modeling):** Mask 15% of tokens, predict original. Enables bidirectional context.
2. **NSP (Next Sentence Prediction):** Predict if sentence B follows A (later found less important, removed in RoBERTa).

**Input Representation:**
```
[CLS] The cat sat [MASK] the mat [SEP] It was comfortable [SEP]
 |<------------------ Segment A ---------------->|<- B ->|
```

```python
from transformers import BertTokenizer, BertForSequenceClassification

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased', 
    num_labels=2
)

# Tokenize
inputs = tokenizer(
    "This movie was great!", 
    return_tensors="pt", 
    padding=True, 
    truncation=True,
    max_length=512
)

# Forward pass
outputs = model(**inputs)
logits = outputs.logits
```

#### **14.2.2 GPT (Generative Pre-trained Transformer)**

**Architecture:** Decoder-only (autoregressive).  
**Training:** Causal language modeling (predict next token given previous).  
**Key Feature:** Unidirectional (left-to-right) attention with causal masking.

**Causal Mask:** Prevents attending to future positions.
```python
def generate_causal_mask(size):
    """Lower triangular matrix (including diagonal)"""
    mask = torch.tril(torch.ones(size, size)).unsqueeze(0).unsqueeze(0)
    return mask  # 1 = attend, 0 = mask
```

#### **14.2.3 T5 (Text-to-Text Transfer Transformer)**

**Architecture:** Encoder-Decoder.  
**Philosophy:** Frame all NLP tasks as text generation.
- Translation: `translate English to German: Hello world`
- Classification: `sentiment: This movie was great`
- Summarization: `summarize: [article text]`

---

## **14.3 Tokenization**

Neural networks process numbers, not text. Tokenization converts text to IDs.

#### **14.3.1 Byte-Pair Encoding (BPE)**

Start with character vocabulary, iteratively merge most frequent pairs.

```
Initial: l o w </w>  ->  5, 6, 7, 8
Corpus: low, lower, lowest, new, newer
Merge 'e'+'r' -> 'er': lower, newer
Merge 'er'+'</w>' -> 'er</w>': lower, newer
...
Final vocab: low, er, est, new, ...
```

**SubwordUnit:** Handles OOV (out-of-vocabulary) words by breaking into subwords: `unhappiness` → `un`, `happiness` → `happiness` (or further).

**Libraries:** Hugging Face `tokenizers`, OpenAI `tiktoken`.

#### **14.3.2 WordPiece (BERT)**

Similar to BPE but uses likelihood maximization instead of frequency.

#### **14.3.3 SentencePiece (T5, XLNet)**

Language-agnostic, operates on raw text (including whitespace as tokens). Uses BPE or Unigram algorithm.

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "Transformers are amazing!"
tokens = tokenizer.tokenize(text)
# ['transformers', 'are', 'amazing', '!']

ids = tokenizer.convert_tokens_to_ids(tokens)
# [19082, 2024, 6429, 999]

# Special tokens
encoded = tokenizer(text, return_tensors="pt")
# input_ids: [101, 19082, 2024, 6429, 999, 102]
# 101 = [CLS], 102 = [SEP]
```

---

## **14.4 Advanced Fine-Tuning Techniques**

#### **14.4.1 LoRA (Low-Rank Adaptation)**

Instead of fine-tuning all parameters (BERT-base: 110M), inject low-rank matrices into attention layers.

$$W = W_0 + \Delta W = W_0 + BA$$

Where $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times k}$, and $r \ll d,k$ (rank 4-64).

**Benefits:**
- Trainable params: 0.1-1% of full model
- No inference latency (can merge $W$)
- Store separate adapters per task

```python
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["query", "value"],  # Which layers to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(base_model, config)
# Only LoRA parameters require gradients
```

#### **14.4.2 Prompt Tuning / Prefix Tuning**

Add trainable "soft prompts" (continuous vectors) to input, keep model frozen.

```
Original: [CLS] The movie was good [SEP]
Prompt tuned: [P1] [P2] ... [P10] [CLS] The movie was good [SEP]
                (trainable embeddings)
```

---

## **14.5 Modern NLP Tasks**

#### **14.5.1 Named Entity Recognition (NER) with BERT**

Token classification: Each token gets BIO tag.

```python
from transformers import BertForTokenClassification

model = BertForTokenClassification.from_pretrained(
    'bert-base-cased',
    num_labels=9  # B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, B-MISC, I-MISC, O
)

# Handle subword tokenization (WordPiece splits "Washington" -> "Wash", "##ington")
# Align labels with word_ids()
```

#### **14.5.2 Question Answering (Extractive)**

Span prediction: Predict start and end indices of answer in context.

```python
from transformers import BertForQuestionAnswering

model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

inputs = tokenizer(
    question="What is the capital of France?",
    context="The capital of France is Paris.",
    return_tensors="pt"
)

outputs = model(**inputs)
start_scores = outputs.start_logits
end_scores = outputs.end_logits

# Get best span
answer_start = torch.argmax(start_scores)
answer_end = torch.argmax(end_scores) + 1
answer = tokenizer.convert_tokens_to_string(
    tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][answer_start:answer_end])
)
```

#### **14.5.3 Text Generation**

Strategies to improve quality:
- **Temperature:** Scale logits by $T$ before softmax ($T<1$ = more focused, $T>1$ = more random)
- **Top-k Sampling:** Sample from $k$ most likely tokens
- **Top-p (Nucleus) Sampling:** Sample from smallest set whose cumulative probability $\geq p$
- **Beam Search:** Keep $k$ best partial sequences at each step (better for translation, worse for creative writing)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

inputs = tokenizer("Once upon a time", return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_length=100,
    num_return_sequences=3,
    temperature=0.8,
    top_k=50,
    top_p=0.95,
    do_sample=True,
    no_repeat_ngram_size=2  # Prevent repetition
)
```

---

## **14.6 Workbook Labs**

### **Lab 1: Transformer from Scratch**
Implement complete Transformer (Encoder only) for classification:
1. Multi-head attention with masking
2. Positional encodings
3. LayerNorm and residuals
4. Train on IMDB sentiment (achieve >80% without pre-training)

**Deliverable:** `transformer_scratch.py` training to convergence.

### **Lab 2: BERT Fine-Tuning for NER**
CoNLL-2003 dataset:
1. Token alignment (handle subwords)
2. Fine-tune `bert-base-cased`
3. Evaluate with entity-level F1 (not token-level)
4. Error analysis: Which entity types confuse the model?

**Deliverable:** NER pipeline with F1 > 0.90.

### **Lab 3: Instruction Tuning (Mini)**
Create small instruction dataset (100 examples):
1. Use LoRA to fine-tune GPT-2/Llama-2-7b
2. Format: `### Instruction: ... ### Input: ... ### Response: ...`
3. Compare zero-shot vs fine-tuned performance on held-out instructions

**Deliverable:** LoRA adapter weights and inference script showing improved instruction following.

### **Lab 4: Efficient Inference Optimization**
Optimize BERT for production:
1. ONNX export and quantization (INT8)
2. Distillation: Train smaller student (6-layer) from teacher (12-layer)
3. Benchmark: Latency vs Accuracy trade-off curve

**Deliverable:** Speed/accuracy report showing 3x speedup with <2% accuracy drop.

---

## **14.7 Common Pitfalls**

1. **Attention Mask Confusion:** Padding tokens must be masked (set to -inf before softmax), but causal masking for GPT is different (lower triangular).

2. **Position IDs Wrong:** When using left-padding for batching, position IDs must reflect actual token positions, not indices in tensor.

3. **Learning Rate Too High for Fine-Tuning:** Pre-trained models need small LR (1e-5 to 3e-5), not default 1e-3, or they catastrophically forget.

4. **Max Length Issues:** BERT limited to 512 tokens. For longer documents, use sliding window, Longformer, or hierarchical approaches.

5. **Not Freezing Embeddings for Rare Words:** When adding new tokens to vocabulary, only train new embeddings, keep others frozen initially.

---

## **14.8 Interview Questions**

**Q1:** Why does the Transformer use LayerNorm instead of BatchNorm?
*A: BatchNorm computes statistics across batch dimension, which varies with sequence length and is problematic for small batches or RNN-style processing. LayerNorm normalizes across feature dimension for each sample independently, making it suitable for variable-length sequences and autoregressive models where batch statistics are unstable.*

**Q2:** Explain the difference between encoder-only, decoder-only, and encoder-decoder models. Give examples of each.
*A: Encoder-only (BERT, RoBERTa): Bidirectional attention, good for understanding tasks (classification, NER, embedding). Decoder-only (GPT, LLaMA): Causal (left-to-right) attention, autoregressive generation (text completion, chat). Encoder-decoder (T5, BART): Encoder processes input bidirectionally, decoder generates output autoregressively; good for translation, summarization, where input and output are distinct sequences.*

**Q3:** What is the purpose of scaling the dot product attention by $\sqrt{d_k}$?
*A: For large $d_k$, dot products grow large in magnitude (variance scales with $d_k$). This pushes the softmax function into regions with extremely small gradients (saturation). Scaling by $\sqrt{d_k}$ counteracts this effect, keeping values in a range where softmax gradients remain healthy and training stable.*

**Q4:** How does BPE tokenization handle out-of-vocabulary words?
*A: BPE breaks OOV words into subword units (character n-grams or frequent substrings). For example, "unhappiness" might become ["un", "happiness"] or ["un", "happ", "iness"] depending on vocabulary. This allows the model to understand morphemes and generalize to new words composed of known subwords.*

**Q5:** What are the advantages of LoRA over full fine-tuning?
*A: 1) Memory efficiency: Only train 0.1-1% of parameters. 2) Storage: Save small adapter files per task instead of full model copies. 3) No inference latency: Can merge weights or keep separate. 4) Less catastrophic forgetting: Original model preserved. 5) Better for low-data regimes by reducing trainable parameters.*

---

## **14.9 Further Reading**

**Papers:**
- "Attention Is All You Need" (Vaswani et al., 2017) - Original Transformer
- "BERT: Pre-training of Deep Bidirectional Transformers" (Devlin et al., 2019)
- "Language Models are Few-Shot Learners" (Brown et al., 2020) - GPT-3
- "LoRA: Low-Rank Adaptation of Large Language Models" (Hu et al., 2021)

**Resources:**
- Hugging Face Course: https://huggingface.co/course
- "The Illustrated Transformer" (Jay Alammar) - Visual guide

---

## **14.10 Checkpoint Project: Production NLP API**

Build a multi-task NLP service serving BERT-based models.

**Requirements:**

1. **Architecture:**
   - FastAPI backend with three endpoints:
     - `/classify`: Sentiment analysis (BERT fine-tuned)
     - `/ner`: Entity extraction (token classification)
     - `/embed`: Sentence embeddings (mean pooling of last hidden states)

2. **Optimization:**
   - Model quantization (INT8) for 3x speedup
   - Batching: Dynamic batching of concurrent requests
   - Caching: Redis cache for frequent queries

3. **Monitoring:**
   - Track latency percentiles (p50, p95, p99)
   - Log prediction confidence scores for drift detection
   - A/B test endpoint comparing base vs quantized model

4. **Deployment:**
   - Docker container with nginx load balancer
   - GPU support (CUDA) with fallback to CPU
   - Health checks and graceful shutdown

**Deliverables:**
- `nlp_api/` with FastAPI app, model loading, and batching logic
- `docker-compose.yml` with Redis and monitoring (Prometheus/Grafana)
- Load test script (Locust) showing 100+ req/sec on single GPU
- Documentation: "API handles 3 concurrent tasks with <100ms p95 latency"

**Success Criteria:**
- End-to-end latency < 50ms for classification (batch_size=1)
- Support batching up to 32 requests dynamically
- 99.9% uptime over 24-hour stress test

---

**End of Chapter 14**

*You now master the Transformer architecture and modern NLP. Chapter 15 will cover Large Language Models (LLMs) and Generative AI — scaling to billions of parameters, RLHF, and production deployment of systems like ChatGPT.*

---