Here is **Chapter 13: Recurrent Neural Networks & Sequence Modeling** — understanding sequential data.

---

# **CHAPTER 13: RECURRENT NEURAL NETWORKS & SEQUENCE MODELING**

*Memory and Sequence*

## **Chapter Overview**

Not all data is independent and identically distributed. Time series, natural language, DNA sequences, and user clickstreams all exhibit temporal dependencies. Recurrent Neural Networks (RNNs) process sequences by maintaining hidden state, but suffer from short-term memory limitations. This chapter explores LSTM and GRU architectures that solved these limitations, attention mechanisms that revolutionized machine translation, and the foundations upon which Transformers were built.

**Estimated Time:** 50-60 hours (3-4 weeks)  
**Prerequisites:** Chapters 10-11 (Neural networks, backpropagation, PyTorch)

---

## **13.0 Learning Objectives**

By the end of this chapter, you will be able to:
1. Implement vanilla RNNs, LSTMs, and GRUs from scratch and understand their computational graphs
2. Apply Backpropagation Through Time (BPTT) and handle vanishing/exploding gradients in sequences
3. Build sequence-to-sequence models with encoder-decoder architectures for translation and summarization
4. Implement attention mechanisms to overcome bottleneck limitations of fixed-size context vectors
5. Process variable-length sequences with padding, packing, and masking techniques
6. Apply RNNs to time series forecasting, named entity recognition (NER), and text generation

---

## **13.1 The Recurrent Neural Network (RNN)**

#### **13.1.1 Sequential Processing**

Unlike feedforward networks, RNNs share parameters across time steps and maintain hidden state.

$$h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$$
$$\hat{y}_t = W_{hy} h_t + b_y$$

Where:
- $x_t$: Input at time $t$
- $h_t$: Hidden state (memory) at time $t$
- $W_{hh}, W_{xh}$: Shared weight matrices across all time steps
- $\hat{y}_t$: Output at time $t$

**Computational Graph:** Unfolded across time, forms a deep network with shared weights.

```python
import torch
import torch.nn as nn

class VanillaRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.hidden_size = hidden_size
        
        # Combined weights for efficiency: [W_xh | W_hh]
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.tanh = nn.Tanh()
        
    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), dim=1)
        hidden = self.tanh(self.i2h(combined))
        output = self.i2o(combined)
        return output, hidden
    
    def init_hidden(self, batch_size):
        return torch.zeros(batch_size, self.hidden_size)
```

#### **13.1.2 Backpropagation Through Time (BPTT)**

To compute gradients, we unroll the network through time and apply backpropagation. However, this creates a very deep computational graph (depth = sequence length).

**The Problem:**
- **Vanishing Gradients:** Gradients shrink exponentially as they propagate backward through time ($\tanh$ derivatives < 1 multiplied repeatedly)
- **Exploding Gradients:** Gradients grow exponentially (> 1 repeatedly), causing NaN updates

**Solutions:**
1. **Gradient Clipping:** Limit gradient norm
   ```python
   torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5)
   ```
2. **Truncated BPTT:** Only backpropagate through last $k$ time steps (approximate gradient)
3. **Architectural Solutions:** LSTM, GRU (gated mechanisms)

---

## **13.2 Long Short-Term Memory (LSTM)**

#### **13.2.1 The Cell State**

LSTMs maintain two vectors:
- **Cell State ($C_t$):** The "conveyor belt" that runs through the entire chain with minimal interactions (preserves long-term memory)
- **Hidden State ($h_t$):** Working memory, output of the cell

#### **13.2.2 The Gates**

Three sigmoid gates control information flow:

1. **Forget Gate ($f_t$):** What to discard from cell state
   $$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$$

2. **Input Gate ($i_t$):** What new information to store
   $$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$$
   $$\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$$

3. **Update Cell State:**
   $$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$

4. **Output Gate ($o_t$):** What to output based on cell state
   $$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$$
   $$h_t = o_t \odot \tanh(C_t)$$

**Implementation:**
```python
class LSTMCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        
        # Combined linear transformation for all gates
        self.gates = nn.Linear(input_size + hidden_size, 4 * hidden_size)
        
    def forward(self, x, hidden):
        h_prev, c_prev = hidden
        
        combined = torch.cat([x, h_prev], dim=1)
        gates = self.gates(combined)
        
        # Split into forget, input, cell_candidate, output
        f, i, c_tilde, o = gates.chunk(4, dim=1)
        
        f = torch.sigmoid(f)  # Forget gate
        i = torch.sigmoid(i)  # Input gate
        c_tilde = torch.tanh(c_tilde)  # Candidate values
        o = torch.sigmoid(o)  # Output gate
        
        c = f * c_prev + i * c_tilde  # Cell state update
        h = o * torch.tanh(c)  # Hidden state
        
        return h, (h, c)
```

#### **13.2.3 GRU (Gated Recurrent Unit)**

Simplified LSTM with fewer gates (faster, fewer parameters).

- **Update Gate ($z_t$):** Controls how much past information to keep (like forget + input)
- **Reset Gate ($r_t$):** Controls how much past to forget for computing new candidate

$$z_t = \sigma(W_z \cdot [h_{t-1}, x_t])$$
$$r_t = \sigma(W_r \cdot [h_{t-1}, x_t])$$
$$\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t])$$
$$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$$

**LSTM vs GRU:**
- **LSTM:** More powerful, better for long sequences, more parameters
- **GRU:** Faster, works well for shorter sequences, good default choice

---

## **13.3 PyTorch RNN Modules**

#### **13.3.1 Built-in RNN, LSTM, GRU**

```python
lstm = nn.LSTM(
    input_size=128,    # Feature dimension of input
    hidden_size=256,   # Hidden state dimension
    num_layers=2,      # Stacked LSTMs
    batch_first=True,  # Input shape: (batch, seq, feature) vs (seq, batch, feature)
    dropout=0.3,       # Dropout between layers (not applied to last layer)
    bidirectional=True # Process forward and backward, concat outputs
)

# Input: (batch_size, seq_len, input_size)
x = torch.randn(32, 50, 128)  # 32 samples, 50 time steps, 128 features

# Output: (batch_size, seq_len, hidden_size * 2) if bidirectional
output, (hidden, cell) = lstm(x)

# output: all hidden states at all time steps
# hidden: final hidden state (num_layers * num_directions, batch, hidden_size)
# cell: final cell state
```

#### **13.3.2 Handling Variable Lengths (Padding and Packing)**

Sequences have different lengths. We use `pack_padded_sequence` to avoid computing on padding tokens.

```python
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

# Assume sequences sorted by length descending
seq_lengths = [50, 45, 40, 30]  # Actual lengths
padded_seqs = torch.randn(4, 50, 128)  # Batch=4, MaxLen=50

# Pack (removes padding from computation)
packed = pack_padded_sequence(
    padded_seqs, 
    seq_lengths, 
    batch_first=True,
    enforce_sorted=True
)

# Pass through RNN
packed_output, (hidden, cell) = lstm(packed)

# Unpack (restore to padded format)
output, lengths = pad_packed_sequence(packed_output, batch_first=True)
```

**Masking:** For attention or loss computation, create masks to ignore padding.
```python
mask = (seq != pad_token_id).unsqueeze(-1).float()  # (batch, seq, 1)
output = output * mask  # Zero out padding positions
```

---

## **13.4 Sequence-to-Sequence Models (Seq2Seq)**

Architecture for translation, summarization, chatbots: Encoder compresses input to context vector, Decoder generates output.

#### **13.4.1 The Encoder**

Processes input sequence, final hidden state becomes context vector.

```python
class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, batch_first=True)
        
    def forward(self, x):
        # x: (batch, seq_len)
        embedded = self.embedding(x)  # (batch, seq_len, embed_size)
        outputs, (hidden, cell) = self.lstm(embedded)
        return outputs, hidden, cell
```

#### **13.4.2 The Decoder**

Generates output token by token, using previous token as input (autoregressive).

```python
class Decoder(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, x, hidden, cell):
        # x: (batch, 1) single token
        x = x.unsqueeze(1) if x.dim() == 1 else x
        embedded = self.embedding(x)  # (batch, 1, embed_size)
        
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        prediction = self.fc(output.squeeze(1))  # (batch, vocab_size)
        return prediction, hidden, cell
```

#### **13.4.3 Training vs Inference**

**Teacher Forcing (Training):** Feed ground truth previous token as decoder input (faster convergence).

**Inference:** Feed model's own prediction as next input (autoregressive generation).

```python
def translate_sentence(model, sentence, device, max_len=50):
    model.eval()
    # ... tokenize sentence ...
    
    with torch.no_grad():
        # Encode
        outputs, hidden, cell = model.encoder(src_tensor)
        
        # Decode
        input_token = sos_token
        for _ in range(max_len):
            output, hidden, cell = model.decoder(input_token, hidden, cell)
            pred_token = output.argmax(1)
            
            if pred_token == eos_token:
                break
                
            translated.append(pred_token.item())
            input_token = pred_token  # Autoregressive
    
    return translated
```

---

## **13.5 Attention Mechanism**

**The Bottleneck Problem:** Fixed-size context vector (final hidden state) must capture entire input sequence information, especially problematic for long sequences.

**Solution:** Attention allows decoder to "look back" at encoder outputs at each step.

#### **13.5.1 Bahdanau (Additive) Attention**

$$s_t = \text{Decoder hidden state at step } t$$
$$h_i = \text{Encoder output at position } i$$
$$e_{ti} = v^T \tanh(W_s s_t + W_h h_i)$$  (alignment score)
$$\alpha_{ti} = \frac{\exp(e_{ti})}{\sum_j \exp(e_{tj})}$$  (softmax)
$$c_t = \sum_i \alpha_{ti} h_i$$  (context vector)

```python
class Attention(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.attn = nn.Linear(hidden_size * 3, hidden_size)
        self.v = nn.Linear(hidden_size, 1, bias=False)
        
    def forward(self, hidden, encoder_outputs, mask=None):
        # hidden: (batch, hidden_size)
        # encoder_outputs: (batch, src_len, hidden_size)
        
        batch_size = encoder_outputs.shape[0]
        src_len = encoder_outputs.shape[1]
        
        # Repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)  # (batch, src_len, hidden)
        
        # Calculate energy
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim=2)))
        # energy: (batch, src_len, hidden_size)
        
        attention = self.v(energy).squeeze(2)  # (batch, src_len)
        
        if mask is not None:
            attention = attention.masked_fill(mask == 0, -1e10)
        
        return torch.softmax(attention, dim=1)
```

#### **13.5.2 Luong (Multiplicative) Attention**

Simpler, faster: $score(s_t, h_i) = s_t^T W h_i$

**Global vs Local Attention:**
- **Global:** Attends to all source positions (computationally expensive for long sequences)
- **Local:** Attends to window around predicted position (faster)

---

## **13.6 Applications**

#### **13.6.1 Named Entity Recognition (NER)**

Token classification: Identify persons, organizations, locations in text.

```python
class NERModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_tags):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, bidirectional=True)
        self.fc = nn.Linear(hidden_dim * 2, num_tags)  # BIO tagging
        
    def forward(self, x):
        embeds = self.embedding(x)
        lstm_out, _ = self.lstm(embeds)
        logits = self.fc(lstm_out)  # (seq_len, batch, num_tags)
        return logits
```

**BIO Tagging:** B-PER (begin person), I-PER (inside person), O (outside).

#### **13.6.2 Time Series Forecasting**

Using LSTM for predicting next values in sequence.

```python
class TimeSeriesPredictor(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
        
    def forward(self, x):
        # x: (batch, seq_len, features)
        lstm_out, (hn, _) = self.lstm(x)
        # Use last hidden state
        out = self.fc(hn[-1])  # hn: (num_layers, batch, hidden)
        return out
```

#### **13.6.3 Text Generation (Char-level)**

Train on text, generate character by character.

```python
def generate_text(model, start_str, length=1000):
    model.eval()
    input_seq = [char2idx[c] for c in start_str]
    
    with torch.no_grad():
        hidden = model.init_hidden(1)
        
        # Prime the model with start string
        for char in input_seq[:-1]:
            _, hidden = model(torch.tensor([[char]]), hidden)
        
        input_char = torch.tensor([[input_seq[-1]]])
        
        for _ in range(length):
            output, hidden = model(input_char, hidden)
            prob = torch.softmax(output, dim=2)
            char_idx = torch.multinomial(prob.squeeze(), 1).item()
            
            print(idx2char[char_idx], end='')
            input_char = torch.tensor([[char_idx]])
```

---

## **13.7 Workbook Labs**

### **Lab 1: RNN from Scratch**
Implement vanilla RNN with BPTT (no PyTorch nn.RNN):
1. Forward pass through time
2. Backward pass computing all gradients manually
3. Train on simple sequence (e.g., "hello" → "elloh" character shift)
4. Show vanishing gradients by comparing gradients at t=0 vs t=20

**Deliverable:** `rnn_scratch.py` with gradient flow visualization.

### **Lab 2: Sentiment Analysis with LSTM**
IMDB movie reviews:
1. Embedding layer (pre-trained GloVe or trained from scratch)
2. Bidirectional LSTM with attention
3. Compare: Last hidden state vs Mean pooling vs Attention mechanism
4. Achieve >85% accuracy

**Deliverable:** Model with attention visualization (which words contribute to sentiment).

### **Lab 3: Neural Machine Translation**
English to French (or any language pair):
1. Encoder-Decoder with Luong Attention
2. Teacher forcing with scheduled sampling (gradually reduce teacher forcing ratio)
3. BLEU score evaluation
4. Attention alignment visualization (show which source word aligns to target word)

**Deliverable:** Working translator with attention heatmaps.

### **Lab 4: Anomaly Detection in Time Series**
Sensor data (e.g., ECG or machine vibration):
1. LSTM Autoencoder (sequence → sequence reconstruction)
2. Anomaly = high reconstruction error
3. Compare with Isolation Forest (Chapter 8) on temporal patterns

**Deliverable:** Anomaly detection system with precision/recall on labeled anomalies.

---

## **13.8 Common Pitfalls**

1. **Teacher Forcing at Inference:** Using ground truth tokens during test time (cheating). Must use model's own predictions.

2. **Ignoring End-of-Sequence:** Generated sequences can grow forever. Always check for EOS token or set max length.

3. **Not Shuffling Batches:** For seq2seq, batches must contain sequences of similar length to minimize padding, but shuffling is still needed across batches.

4. **Gradient Accumulation in RNNs:** Accumulating gradients over long sequences causes memory explosion. Use truncated BPTT or gradient checkpointing.

5. **Wrong Hidden State Initialization:** Forgetting to detach hidden states between batches causes backprop through entire dataset (error) or BPTT across unrelated sequences.

   ```python
   # Wrong: hidden carries graph from previous batch
   # Right:
   hidden = tuple(h.detach() for h in hidden)
   ```

---

## **13.9 Interview Questions**

**Q1:** Why do LSTMs solve the vanishing gradient problem better than vanilla RNNs?
*A: The cell state acts as a conveyor belt with minimal interactions (element-wise addition in the update: $C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$). Gradients can flow through the cell state unchanged (identity connection) without being multiplied by small derivatives repeatedly. The gates control information flow without forcing gradients through squashing functions at every step.*

**Q2:** Explain the difference between teacher forcing and scheduled sampling.
*A: Teacher forcing feeds ground truth previous token as decoder input during training, leading to fast convergence but exposure bias (model never sees its own errors). Scheduled sampling gradually replaces teacher forcing with model predictions during training (annealing schedule), exposing the model to its own errors and reducing discrepancy between training and inference.*

**Q3:** What is the purpose of the attention mechanism in seq2seq models?
*A: Attention solves the information bottleneck of fixed-size context vectors. Instead of compressing all source information into final encoder hidden state, attention allows decoder to dynamically focus on different source positions at each decoding step, creating a weighted context vector. This improves performance on long sequences and provides interpretability (alignment visualization).*

**Q4:** Why use bidirectional LSTM and when can't you use it?
*A: Bidirectional LSTM processes sequence both forwards and backwards, capturing future context for each position (e.g., knowing word is a verb requires seeing object later). Can't use when future information isn't available: real-time streaming, autoregressive generation (decoding), or causal language modeling where we must predict next token without seeing it.*

**Q5:** How do you handle variable-length sequences in batches?
*A: Pad sequences to max length in batch, then use pack_padded_sequence to tell PyTorch to skip computations on padding tokens. Alternatively, use masking in loss function and attention. Must sort sequences by length descending before packing (or use enforce_sorted=False with lengths argument).*

---

## **13.10 Further Reading**

**Papers:**
- "Long Short-Term Memory" (Hochreiter & Schmidhuber, 1997) - Original LSTM
- "Learning Phrase Representations using RNN Encoder-Decoder" (Cho et al., 2014) - GRU, seq2seq
- "Neural Machine Translation by Jointly Learning to Align and Translate" (Bahdanau et al., 2015) - Attention
- "Effective Approaches to Attention-based Neural Machine Translation" (Luong et al., 2015)

**Books:**
- *Deep Learning with PyTorch* (Stevens, Antiga, Viehmann) - Chapter on sequences

---

## **13.11 Checkpoint Project: Conversational AI Bot**

Build a retrieval-based chatbot with attention-based matching.

**Requirements:**

1. **Architecture:**
   - Dual Encoder (one LSTM for context, one for response)
   - Attention matching layer between context and candidate responses
   - Similarity score (cosine) between encoded context and response

2. **Dataset:**
   - Reddit conversations or customer service logs (pairs of context-response)
   - Negative sampling: For each positive pair, sample 5 random responses as negatives

3. **Training:**
   - Contrastive loss: Positive pairs close together, negatives far apart
   - Recall@1, Recall@5 metrics (is correct response in top K?)

4. **Inference:**
   - Index 10,000 candidate responses using FAISS (approximate nearest neighbors)
   - Real-time retrieval (<100ms) for given user query

5. **Evaluation:**
   - Human evaluation on 100 test conversations (appropriateness, fluency)
   - A/B test simulation vs baseline (TF-IDF retrieval)

**Deliverables:**
- `chatbot/` package with training and inference
- API endpoint: POST /reply {message} → Returns top 3 candidate responses with confidence scores
- Report: "Bot retrieves appropriate response 65% of the time in top-1, 90% in top-3"

**Success Criteria:**
- Recall@5 > 0.85 on test set
- Inference latency < 100ms for 10k candidate index
- Attention visualization shows bot focuses on key entities in user query

---

**End of Chapter 13**

*You now master sequential modeling. Chapter 14 will cover Transformers and Modern NLP — the architecture that replaced RNNs and powers GPT, BERT, and the current AI revolution.*

---

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='12. convolutional_neural_networks.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='14. transformers_and_modern_nlp.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
