# PyTorch Tutorial: Sequence Models - RNNs to Transformers

From simple RNNs to modern Transformers, this notebook covers the evolution of sequence modeling that powers ChatGPT, Google Translate, and Siri.

## Learning Objectives
- **Part 1**: Understand RNN, LSTM, GRU (The foundations - still asked in interviews!)
- **Part 2**: Implement Seq2Seq models (Translation, summarization)
- **Part 3**: Deep dive into Self-Attention and Transformers
- **Part 4**: Positional Encodings and why they matter
- **Part 5**: GPT vs BERT architectures

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

torch.manual_seed(42)

---

# PART 2: Transformers and Self-Attention

## From RNN to Transformers: The Paradigm Shift (2017)

**The Problem with RNNs/LSTMs:**
- Sequential processing (can't parallelize)
- Still struggle with very long sequences
- Slow to train

**The Solution: Self-Attention (2017 - "Attention Is All You Need")**
- Process entire sequence in parallel
- Every token attends to every other token directly
- **10x faster to train than LSTMs**

---

## 1. Self-Attention: The Core Mechanism

In a sentence like "The animal didn't cross the street because **it** was too tired", what does "it" refer to? The street or the animal?

Self-attention allows the model to look at ALL other words simultaneously to figure this out.

### The Formula
$$ Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V $$

Where:
- **Q (Query)**: What I'm looking for ("Who does 'it' refer to?")
- **K (Key)**: What I have to offer ("I'm the word 'animal'")
- **V (Value)**: What I actually contain (embedding of 'animal')

### The Intuition
1. Each word asks a question (Q): "Which words are relevant to me?"
2. Each word offers information (K): "I'm a noun/verb/etc"
3. Calculate compatibility: Q ¬∑ K (dot product)
4. Attend to relevant words: softmax(scores) √ó V

### FAANG Interview Question
**"Implement scaled dot-product attention from scratch"** ‚Üê Asked at all FAANG!

In [None]:
class SimpleRNNCell(nn.Module):
    """
    A single RNN cell (one timestep).
    This is the #1 interview question for sequence modeling!
    """
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        
        # Weights for input -> hidden
        self.W_xh = nn.Linear(input_size, hidden_size)
        # Weights for hidden -> hidden (the recurrent part!)
        self.W_hh = nn.Linear(hidden_size, hidden_size, bias=False)
        
        self.tanh = nn.Tanh()
    
    def forward(self, x, h_prev):
        """
        x: input at current timestep (batch_size, input_size)
        h_prev: hidden state from previous timestep (batch_size, hidden_size)
        """
        h_t = self.tanh(self.W_xh(x) + self.W_hh(h_prev))
        return h_t

# Test the RNN cell
input_size = 10
hidden_size = 20
batch_size = 4

rnn_cell = SimpleRNNCell(input_size, hidden_size)

# Initialize hidden state
h = torch.zeros(batch_size, hidden_size)

# Process a sequence of 5 timesteps
sequence_length = 5
for t in range(sequence_length):
    x_t = torch.randn(batch_size, input_size)
    h = rnn_cell(x_t, h)
    print(f"Timestep {t}: hidden state shape = {h.shape}")

print("\\n‚úì RNN processes sequence step-by-step, maintaining hidden state!")

---

## 2. Positional Encodings: Why and How

### The Problem
Self-attention has NO concept of order!
- "Dog bites man" and "Man bites dog" look identical to attention.
- RNNs have built-in order (process sequentially), but Transformers don't.

### The Solution: Positional Encodings
Add position information to each token embedding.

$$
\begin{align*}
PE_{(pos, 2i)} &= \sin(pos / 10000^{2i/d_{model}}) \\
PE_{(pos, 2i+1)} &= \cos(pos / 10000^{2i/d_{model}})
\end{align*}
$$

Where:
- $pos$: Position in sequence (0, 1, 2, ...)
- $i$: Dimension index
- $d_{model}$: Embedding dimension

### Why Sinusoidal?
1. **Unique** for each position
2. **Relative** positions have consistent patterns
3. **Generalizes** to longer sequences than seen in training

### Modern Alternatives (2024+)
- **Learned PE**: Train positional embeddings like word embeddings
- **RoPE** (Rotary): Used in Llama, PaLM (multiplies instead of adds)
- **ALiBi**: Attention bias based on distance

### FAANG Interview Question
**"Why do Transformers need positional encodings?"** ‚Üê Asked at Google, OpenAI

In [None]:
import matplotlib.pyplot as plt
import numpy as np

class PositionalEncoding(nn.Module):
    """
    Sinusoidal positional encoding (from "Attention Is All You Need").
    This is THE standard implementation asked in interviews!
    """
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        
        # Create matrix of shape (max_len, d_model)
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        
        # div_term = 10000^(2i/d_model)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                            (-math.log(10000.0) / d_model))
        
        # Apply sin to even indices, cos to odd indices
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        # Add batch dimension: (1, max_len, d_model)
        pe = pe.unsqueeze(0)
        
        # Register as buffer (not a parameter, but saved with model)
        self.register_buffer('pe', pe)
    
    def forward(self, x):
        """
        x: (batch, seq_len, d_model)
        """
        # Add positional encoding to input
        x = x + self.pe[:, :x.size(1), :]
        return x

# Visualize positional encodings
d_model = 128
max_len = 100

pos_enc = PositionalEncoding(d_model, max_len)
pe_matrix = pos_enc.pe.squeeze(0).numpy()

# Plot
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.imshow(pe_matrix.T, aspect='auto', cmap='RdBu', interpolation='nearest')
plt.xlabel('Position')
plt.ylabel('Dimension')
plt.title('Positional Encoding Matrix')
plt.colorbar()

plt.subplot(1, 2, 2)
plt.plot(pe_matrix[:, :4])
plt.xlabel('Position')
plt.ylabel('Value')
plt.title('First 4 Dimensions')
plt.legend([f'Dim {i}' for i in range(4)])
plt.grid(True)

plt.tight_layout()
plt.show()

print("‚úì Each position gets a unique encoding!")
print("‚úì Sinusoidal pattern allows model to learn relative positions.")

## The Problem with RNNs: Vanishing Gradients

When sequences get long (100+ tokens), gradients vanish during backprop:
- Gradient flows: Output ‚Üí h_100 ‚Üí h_99 ‚Üí ... ‚Üí h_1
- Each step multiplies by a matrix
- Repeated multiplication makes gradients explode or vanish

**Result:** RNNs can't remember long-term dependencies.

---

## LSTM: Long Short-Term Memory (1997, Still Used Today!)

### The Breakthrough
Add explicit **memory cells** with **gates** that control:
1. **Forget gate**: What to remove from memory
2. **Input gate**: What new information to add
3. **Output gate**: What to expose as hidden state

### The Equations (Don't Panic!)
$$
\begin{align*}
f_t &= \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \quad &\text{(Forget gate)} \\
i_t &= \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \quad &\text{(Input gate)} \\
\tilde{C}_t &= \tanh(W_C \cdot [h_{t-1}, x_t] + b_C) \quad &\text{(Candidate values)} \\
C_t &= f_t * C_{t-1} + i_t * \tilde{C}_t \quad &\text{(Update cell state)} \\
o_t &= \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \quad &\text{(Output gate)} \\
h_t &= o_t * \tanh(C_t) \quad &\text{(Final hidden state)}
\end{align*}
$$

### The Intuition
- **Cell state** ($C_t$): The "highway" where information flows unmodified
- **Gates**: Traffic lights that decide what flows through

### FAANG Interview Alert
**"Explain the difference between RNN and LSTM"** ‚Üê Top 5 most asked question!

In [None]:
class LSTMCell(nn.Module):
    """
    LSTM Cell from scratch - asked at Google, Meta, Amazon!
    """
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        
        # All 4 gates can be computed together (efficiency trick!)
        # Input: [h_{t-1}; x_t] concatenated
        self.gates = nn.Linear(input_size + hidden_size, 4 * hidden_size)
    
    def forward(self, x, states):
        """
        x: (batch, input_size)
        states: (h_prev, c_prev) where each is (batch, hidden_size)
        """
        h_prev, c_prev = states
        
        # Concatenate input and hidden state
        combined = torch.cat([x, h_prev], dim=1)
        
        # Compute all gates at once, then split
        gates = self.gates(combined)
        i, f, g, o = gates.chunk(4, dim=1)
        
        # Apply activations
        i = torch.sigmoid(i)  # Input gate
        f = torch.sigmoid(f)  # Forget gate
        g = torch.tanh(g)     # Candidate values
        o = torch.sigmoid(o)  # Output gate
        
        # Update cell state (the memory!)
        c_t = f * c_prev + i * g
        
        # Update hidden state
        h_t = o * torch.tanh(c_t)
        
        return h_t, c_t

# Test LSTM
lstm_cell = LSTMCell(input_size=10, hidden_size=20)

h = torch.zeros(4, 20)
c = torch.zeros(4, 20)

print("LSTM processing sequence:")
for t in range(5):
    x_t = torch.randn(4, 10)
    h, c = lstm_cell(x_t, (h, c))
    print(f"Step {t}: h={h.shape}, c={c.shape}")

print("\\n‚úì LSTM maintains TWO states: hidden (h) and cell (c)!")
print("The cell state (c) is the 'memory highway' that solves vanishing gradients.")

---

## Seq2Seq: Sequence-to-Sequence Models (The Bridge to Transformers)

### The Problem
- RNN/LSTM can process one sequence ‚Üí one output
- But what about sequence ‚Üí sequence? (Translation, Summarization)
  - Input: "Hello world" ‚Üí Output: "Hola mundo"

### The Solution: Encoder-Decoder Architecture (2014)

```
Encoder (LSTM)                Decoder (LSTM)
    ‚Üì                              ‚Üì
Input Sequence  ‚Üí  Context Vector  ‚Üí  Output Sequence
"Hello world"       (fixed size)      "Hola mundo"
```

**How it works:**
1. **Encoder**: Reads input sequence, compresses to fixed-size vector (context)
2. **Decoder**: Generates output sequence from context vector

### The Problem with Seq2Seq
**Bottleneck**: All information compressed into ONE vector!
- Long sequences lose information
- This is what **Attention** was invented to solve (2015)

### FAANG Interview Question
**"Explain the Seq2Seq architecture and its limitation"** ‚Üê Asked at Google, Meta

**Answer:**
1. Encoder processes input ‚Üí context vector
2. Decoder generates output from context
3. **Limitation**: Fixed-size context is a bottleneck for long sequences
4. **Solution**: Attention mechanism (attend to different encoder states)

In [None]:
# Seq2Seq Implementation (Simplified)

class Encoder(nn.Module):
    """Encodes input sequence into a context vector"""
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.lstm = nn.LSTM(hidden_size, hidden_size, batch_first=True)
    
    def forward(self, x):
        # x: (batch, seq_len) - token IDs
        embedded = self.embedding(x)  # (batch, seq_len, hidden_size)
        outputs, (hidden, cell) = self.lstm(embedded)
        # Return final hidden state as context
        return hidden, cell

class Decoder(nn.Module):
    """Generates output sequence from context vector"""
    def __init__(self, output_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(output_size, hidden_size)
        self.lstm = nn.LSTM(hidden_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x, hidden, cell):
        # x: (batch, 1) - one token at a time
        embedded = self.embedding(x)  # (batch, 1, hidden_size)
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        prediction = self.fc(output.squeeze(1))  # (batch, output_size)
        return prediction, hidden, cell

class Seq2Seq(nn.Module):
    """
    Complete Seq2Seq model for sequence-to-sequence tasks.
    Used in early machine translation systems (pre-Transformer).
    """
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
    
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        """
        src: source sequence (batch, src_len)
        trg: target sequence (batch, trg_len)
        teacher_forcing_ratio: probability of using true token vs predicted
        """
        batch_size = src.size(0)
        trg_len = trg.size(1)
        trg_vocab_size = self.decoder.fc.out_features
        
        # Store outputs
        outputs = torch.zeros(batch_size, trg_len, trg_vocab_size)
        
        # Encode entire input sequence
        hidden, cell = self.encoder(src)
        
        # First input to decoder is <SOS> token
        input_token = trg[:, 0].unsqueeze(1)
        
        # Generate output sequence one token at a time
        for t in range(1, trg_len):
            output, hidden, cell = self.decoder(input_token, hidden, cell)
            outputs[:, t, :] = output
            
            # Teacher forcing: use true token or predicted token?
            use_teacher_forcing = torch.rand(1).item() < teacher_forcing_ratio
            top1 = output.argmax(1).unsqueeze(1)
            input_token = trg[:, t].unsqueeze(1) if use_teacher_forcing else top1
        
        return outputs

# Example
vocab_size = 1000
hidden_size = 256

encoder = Encoder(vocab_size, hidden_size)
decoder = Decoder(vocab_size, hidden_size)
seq2seq = Seq2Seq(encoder, decoder)

# Dummy data
src = torch.randint(0, vocab_size, (4, 10))  # Batch of 4, length 10
trg = torch.randint(0, vocab_size, (4, 12))  # Batch of 4, length 12

outputs = seq2seq(src, trg)
print(f"Input shape: {src.shape}")
print(f"Target shape: {trg.shape}")
print(f"Output shape: {outputs.shape}")
print("\\n‚úì Seq2Seq translates variable-length sequences!")
print("\\nThis was state-of-the-art for translation (2014-2017)")
print("Then Transformers came and changed everything...")

## 1. The Core Idea: Self-Attention

In a sentence like "The animal didn't cross the street because **it** was too tired", what does "it" refer to? The street or the animal?

Self-attention allows the model to look at other words in the sentence to figure this out. It computes a weighted sum of all other words.

Formula:
$$ Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V $$

Where:
- **Q (Query)**: What I'm looking for
- **K (Key)**: What I have to offer
- **V (Value)**: What I actually contain

In [None]:
---

# PART 3: GPT vs BERT Architectures (Must Know!)

This is **THE most important interview topic for NLP roles in 2025**.

## The Two Paradigms

### BERT: Bidirectional Encoder (Google, 2018)
- **Architecture**: Encoder-only
- **Training**: Masked Language Modeling (predict masked words)
- **Use case**: Understanding (classification, NER, Q&A)
- **Example**: "The [MASK] is shining" ‚Üí predict "sun"
- **Key**: Sees ENTIRE sentence (bidirectional context)

### GPT: Autoregressive Decoder (OpenAI, 2018)
- **Architecture**: Decoder-only
- **Training**: Next token prediction (left-to-right)
- **Use case**: Generation (text completion, chat)
- **Example**: "The sun is" ‚Üí predict "shining"
- **Key**: Sees ONLY previous tokens (causal/autoregressive)

---

## The Critical Difference: Attention Masking

### BERT (Bidirectional)
```
Attention weights for "is":
  The  sun  is shining
  ‚úì    ‚úì   ‚úì    ‚úì      (can attend to ALL words)
```

### GPT (Causal)  
```
Attention weights for "is":
  The  sun  is shining
  ‚úì    ‚úì   ‚úì    ‚úó      (can ONLY attend to past, not future!)
```

---

## FAANG Interview Questions

**Q1: "When would you use BERT vs GPT?"**
```
BERT:
- Classification (sentiment, spam detection)
- Named Entity Recognition
- Question Answering
- Sentence embeddings

GPT:
- Text generation
- Chatbots
- Code completion
- Creative writing
```

**Q2: "How do you implement causal masking?"**
```python
# This creates a lower-triangular matrix
# Future tokens are masked with -inf
mask = torch.tril(torch.ones(seq_len, seq_len))
mask = mask.masked_fill(mask == 0, float('-inf'))
scores = scores + mask  # Before softmax
```

**Q3: "Why can't BERT generate text?"**
```
Answer: BERT is trained to predict masked words using 
bidirectional context. It's not trained for autoregressive 
generation (predicting next token). You'd need to fine-tune 
it differently or use a decoder.
```

---

## Architecture Comparison Table

| Feature | BERT | GPT |
|---------|------|-----|
| **Type** | Encoder | Decoder |
| **Attention** | Bidirectional | Causal (masked) |
| **Training** | MLM (predict masks) | Next token prediction |
| **Input** | [CLS] text [SEP] | text |
| **Output** | Token embeddings | Next token logits |
| **Best for** | Understanding | Generation |
| **Examples** | RoBERTa, ALBERT | GPT-3, ChatGPT |

---

## What Powers ChatGPT?
- **Architecture**: Decoder-only (GPT)
- **Size**: 175B parameters (GPT-3) ‚Üí 1.76T parameters (GPT-4 rumored)
- **Training**: Causal language modeling + RLHF
- **Context**: 8K tokens (GPT-3) ‚Üí 128K tokens (GPT-4 Turbo)

---

## Key Takeaways

1. **BERT = Encoder = Understanding**
2. **GPT = Decoder = Generation**
3. **Causal mask** is the key difference
4. **Modern trend**: Decoder-only models (Llama, PaLM, GPT) dominate
5. **Why?**: Decoders can do both understanding AND generation

## 2. PyTorch's Transformer Modules

PyTorch provides optimized implementations so you don't have to write everything from scratch.

In [None]:
# Single Multi-Head Attention Layer
multihead_attn = nn.MultiheadAttention(embed_dim=256, num_heads=8, batch_first=True)

# Create dummy input: (Batch, Seq_Len, Embed_Dim)
x = torch.randn(32, 10, 256)

# Self-attention: Q=x, K=x, V=x
attn_output, _ = multihead_attn(x, x, x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {attn_output.shape}")

---

# PART 4: Tokenization Deep Dive (Critical for LLMs)

## The Problem
Neural networks work with numbers, not text. How do we convert text ‚Üí numbers?

### Naive Approaches (Don't Use!)
1. **Character-level**: "Hello" ‚Üí [H, e, l, l, o] ‚Üí Too long, loses word meaning
2. **Word-level**: "Hello" ‚Üí [Hello] ‚Üí Huge vocabulary (1M+ words)

### Modern Solutions: Subword Tokenization

## BPE (Byte Pair Encoding) - Used in GPT

**Idea:** Start with characters, merge most frequent pairs

```
Step 1: "lower" ‚Üí ["l", "o", "w", "e", "r"]
Step 2: Merge "e"+"r" ‚Üí ["l", "o", "w", "er"]  (if most frequent)
Step 3: Merge "l"+"o" ‚Üí ["lo", "w", "er"]
...
```

**Result:** 
- Common words: single token ("the" ‚Üí [the])
- Rare words: multiple subtokens ("chatbot" ‚Üí ["chat", "bot"])
- Unknown words: character-level fallback

## WordPiece - Used in BERT

Similar to BPE but uses likelihood-based merging.

## SentencePiece - Used in Llama, T5

Language-agnostic, works directly on Unicode.

---

## FAANG Interview Questions

**Q1: "Why use subword tokenization instead of word-level?"**
```
Advantages:
1. Fixed vocabulary size (32K-100K vs 1M+ words)
2. Handles rare/unknown words (no <UNK> token)
3. Captures morphology ("play", "playing", "played" share "play")
4. Language-agnostic
```

**Q2: "What's the difference between BPE and WordPiece?"**
```
BPE: Merge most frequent pairs
WordPiece: Merge pairs that maximize likelihood on training data
Both achieve similar results in practice.
```

**Q3: "How does GPT tokenize 'ü§ñ'?"**
```
Emojis, special characters: Usually single or multi-byte tokens
GPT uses UTF-8 bytes ‚Üí BPE
Most modern tokenizers handle Unicode natively
```

---

## Tokenization in Practice

```python
from transformers import AutoTokenizer

# Load GPT-2 tokenizer (BPE)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

text = "Hello, world!"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"IDs: {ids}")

# Decode back
decoded = tokenizer.decode(ids)
print(f"Decoded: {decoded}")
```

---

## Key Concepts

### Special Tokens
- `[CLS]`: Classification token (BERT)
- `[SEP]`: Separator (BERT)
- `<BOS>`: Beginning of sequence (GPT)
- `<EOS>`: End of sequence
- `<PAD>`: Padding (for batching)
- `<UNK>`: Unknown word (rare in BPE)

### Vocabulary Size Impact
- **Small vocab (8K)**: More tokens per sentence, longer sequences
- **Large vocab (100K)**: Fewer tokens per sentence, but larger embedding matrix
- **Sweet spot**: 32K-50K for most LLMs

---

## Tokenization Challenges

1. **Numbers**: "1234" might be ["1", "23", "4"] - loses semantic meaning
2. **Code**: Indentation, brackets often split poorly
3. **Multilingual**: English-centric tokenizers waste tokens on other languages
4. **Trailing spaces**: "Hello" vs "Hello " are different tokens!

---

## Modern Trends (2024-2025)

- **Larger vocabs**: GPT-4 likely uses 100K+ (unconfirmed)
- **Multimodal**: Image patches as "visual tokens" (CLIP, Flamingo)
- **Byte-level**: Direct UTF-8 bytes (more universal)

---

**Tokenization is the "dark art" of NLP - small changes have huge impacts!**

In [None]:
encoder_layer = nn.TransformerEncoderLayer(d_model=256, nhead=8, batch_first=True)
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)

output = transformer_encoder(x)
print(f"Encoder Output shape: {output.shape}")

## 4. Tokenization (Concept)

Transformers don't understand text; they understand numbers. Tokenization converts text to numbers.

1. Text: "I love AI"
2. Tokens: ["I", "love", "AI"]
3. IDs: [101, 204, 505]

*(In practice, we use libraries like `huggingface/tokenizers`)*

## Key Takeaways

1. **Self-Attention**: The mechanism that relates different positions of a sequence.
2. **Q, K, V**: The three projections used to compute attention.
3. **Multi-Head Attention**: Running multiple attention mechanisms in parallel.
4. **Transformer**: A stack of attention and feed-forward layers.