# Module 03: Recurrent Neural Networks

**Difficulty**: ⭐⭐⭐ Advanced  
**Estimated Time**: 120 minutes  
**Prerequisites**: [Module 02: Word Embeddings](02_word_embeddings.ipynb), Deep Learning Fundamentals

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand the architecture and mathematics of vanilla RNNs
2. Implement RNN, LSTM, and GRU from scratch in PyTorch
3. Understand and solve the vanishing gradient problem
4. Build bidirectional RNNs for better context understanding
5. Apply RNNs to sequence classification (sentiment analysis)
6. Compare different RNN architectures and their trade-offs

## Why Recurrent Neural Networks?

Traditional feedforward neural networks have limitations:
- **Fixed input size**: Can't handle variable-length sequences
- **No memory**: Each input processed independently
- **No temporal dynamics**: Can't model sequential patterns

**RNNs solve this** by:
- Maintaining hidden state (memory)
- Processing sequences one step at a time
- Sharing parameters across time steps

### Applications:
- Language modeling (predict next word)
- Sentiment analysis (classify text)
- Machine translation
- Speech recognition
- Time series forecasting

## Setup and Imports

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence

# NLP
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Visualization
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Set random seeds
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print("✓ All libraries imported successfully!")

## 1. Vanilla RNN: Architecture and Mathematics

### 1.1 The RNN Cell

At each time step $t$, an RNN:
1. Takes input $x_t$ (current word)
2. Takes previous hidden state $h_{t-1}$ (memory)
3. Produces new hidden state $h_t$
4. Optionally produces output $y_t$

**Mathematics**:

$$h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + b_h)$$

$$y_t = W_{hy} h_t + b_y$$

Where:
- $W_{xh}$: Input-to-hidden weights
- $W_{hh}$: Hidden-to-hidden weights (memory)
- $W_{hy}$: Hidden-to-output weights
- $\tanh$: Activation function

### Key Insight: Weight Sharing

The same weights ($W_{xh}$, $W_{hh}$, $W_{hy}$) are used at every time step!

### 1.2 Implementing Vanilla RNN from Scratch

In [None]:
class VanillaRNN(nn.Module):
    """
    Vanilla RNN implementation from scratch.
    
    This educational implementation shows the inner workings of an RNN.
    In practice, use nn.RNN for efficiency.
    """
    
    def __init__(self, input_size, hidden_size, output_size):
        """
        Parameters:
        -----------
        input_size : int
            Dimension of input vectors (e.g., embedding size)
        hidden_size : int
            Dimension of hidden state
        output_size : int
            Dimension of output (e.g., number of classes)
        """
        super(VanillaRNN, self).__init__()
        
        self.hidden_size = hidden_size
        
        # Input to hidden
        self.W_xh = nn.Linear(input_size, hidden_size)
        # Hidden to hidden (recurrent connection)
        self.W_hh = nn.Linear(hidden_size, hidden_size)
        # Hidden to output
        self.W_hy = nn.Linear(hidden_size, output_size)
        
    def forward(self, x, h_prev=None):
        """
        Forward pass through RNN.
        
        Parameters:
        -----------
        x : torch.Tensor
            Input sequence of shape (batch_size, seq_len, input_size)
        h_prev : torch.Tensor or None
            Previous hidden state (batch_size, hidden_size)
            If None, initialize to zeros
            
        Returns:
        --------
        output : torch.Tensor
            Output at final time step (batch_size, output_size)
        hidden_states : list
            Hidden states at all time steps
        """
        batch_size, seq_len, _ = x.size()
        
        # Initialize hidden state if not provided
        if h_prev is None:
            h = torch.zeros(batch_size, self.hidden_size).to(x.device)
        else:
            h = h_prev
        
        hidden_states = []
        
        # Process sequence one time step at a time
        for t in range(seq_len):
            x_t = x[:, t, :]  # Current input (batch_size, input_size)
            
            # RNN equation: h_t = tanh(W_xh * x_t + W_hh * h_{t-1})
            h = torch.tanh(self.W_xh(x_t) + self.W_hh(h))
            
            hidden_states.append(h)
        
        # Output from final hidden state
        output = self.W_hy(h)
        
        return output, hidden_states

print("✓ VanillaRNN class defined!")

In [None]:
# Test the vanilla RNN
input_size = 10   # e.g., word embedding dimension
hidden_size = 20  # hidden state dimension
output_size = 3   # e.g., 3 classes for classification
batch_size = 2
seq_len = 5

# Create random input
x = torch.randn(batch_size, seq_len, input_size)

# Initialize RNN
rnn = VanillaRNN(input_size, hidden_size, output_size)

# Forward pass
output, hidden_states = rnn(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Number of hidden states: {len(hidden_states)}")
print(f"Each hidden state shape: {hidden_states[0].shape}")

**Exercise 1**: Analyze RNN computation

1. Count the number of parameters in the vanilla RNN
2. Trace the computation graph for a sequence of length 3
3. Explain why the same weights are applied at each time step
4. What happens if you pass in sequences of different lengths?

In [None]:
# YOUR CODE HERE
# Count parameters
total_params = sum(p.numel() for p in rnn.parameters())
print(f"Total parameters: {total_params}")

# Explain weight sharing
# YOUR EXPLANATION HERE

## 2. The Vanishing Gradient Problem

### Why Vanilla RNNs Struggle with Long Sequences

During backpropagation through time (BPTT), gradients are computed as:

$$\frac{\partial L}{\partial h_t} = \frac{\partial L}{\partial h_{t+1}} \cdot \frac{\partial h_{t+1}}{\partial h_t}$$

For long sequences, this becomes:

$$\frac{\partial h_T}{\partial h_0} = \prod_{t=1}^{T} \frac{\partial h_t}{\partial h_{t-1}}$$

**Problem**: If $\frac{\partial h_t}{\partial h_{t-1}} < 1$, the gradient vanishes (→ 0)

**Result**: RNN can't learn long-range dependencies!

In [None]:
# Demonstrate vanishing gradient
def demonstrate_vanishing_gradient(seq_length=50):
    """
    Show how gradients vanish in vanilla RNN.
    """
    rnn = VanillaRNN(input_size=10, hidden_size=20, output_size=1)
    
    # Create dummy input and target
    x = torch.randn(1, seq_length, 10)
    target = torch.randn(1, 1)
    
    # Forward pass
    output, hidden_states = rnn(x)
    
    # Compute loss
    loss = F.mse_loss(output, target)
    
    # Backward pass
    loss.backward()
    
    # Analyze gradient magnitudes
    grad_magnitudes = []
    for name, param in rnn.named_parameters():
        if param.grad is not None:
            grad_mag = param.grad.norm().item()
            grad_magnitudes.append((name, grad_mag))
    
    return grad_magnitudes

# Test with different sequence lengths
for seq_len in [10, 50, 100]:
    grads = demonstrate_vanishing_gradient(seq_len)
    print(f"\nSequence length: {seq_len}")
    for name, mag in grads:
        print(f"  {name:20} gradient magnitude: {mag:.6f}")

## 3. LSTM: Long Short-Term Memory

**LSTM** (Hochreiter & Schmidhuber, 1997) solves the vanishing gradient problem with gating mechanisms.

### LSTM Architecture

LSTM has:
1. **Cell state** ($c_t$): Long-term memory highway
2. **Hidden state** ($h_t$): Short-term memory
3. **Three gates**:
   - **Forget gate** ($f_t$): What to forget from cell state
   - **Input gate** ($i_t$): What new information to add
   - **Output gate** ($o_t$): What to output

### LSTM Equations:

$$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$$ (Forget gate)

$$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$$ (Input gate)

$$\tilde{c}_t = \tanh(W_c \cdot [h_{t-1}, x_t] + b_c)$$ (Candidate values)

$$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$$ (Update cell state)

$$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$$ (Output gate)

$$h_t = o_t \odot \tanh(c_t)$$ (New hidden state)

Where $\sigma$ is sigmoid, $\odot$ is element-wise multiplication

In [None]:
class LSTMCell(nn.Module):
    """
    Single LSTM cell implementation.
    
    Educational implementation to understand LSTM internals.
    """
    
    def __init__(self, input_size, hidden_size):
        super(LSTMCell, self).__init__()
        
        self.hidden_size = hidden_size
        
        # Combined linear layer for all gates (more efficient)
        # Computes forget, input, output gates, and candidate values
        self.linear = nn.Linear(input_size + hidden_size, 4 * hidden_size)
        
    def forward(self, x, states=None):
        """
        Forward pass through LSTM cell.
        
        Parameters:
        -----------
        x : torch.Tensor
            Input (batch_size, input_size)
        states : tuple or None
            Previous (hidden, cell) states
            
        Returns:
        --------
        h_new : torch.Tensor
            New hidden state
        c_new : torch.Tensor
            New cell state
        """
        batch_size = x.size(0)
        
        # Initialize states if not provided
        if states is None:
            h = torch.zeros(batch_size, self.hidden_size).to(x.device)
            c = torch.zeros(batch_size, self.hidden_size).to(x.device)
        else:
            h, c = states
        
        # Concatenate input and hidden state
        combined = torch.cat([x, h], dim=1)
        
        # Compute all gates and candidate values
        gates = self.linear(combined)
        
        # Split into individual gates
        forget_gate, input_gate, output_gate, candidate = gates.chunk(4, dim=1)
        
        # Apply activations
        f_t = torch.sigmoid(forget_gate)
        i_t = torch.sigmoid(input_gate)
        o_t = torch.sigmoid(output_gate)
        c_tilde = torch.tanh(candidate)
        
        # Update cell state (key innovation of LSTM!)
        c_new = f_t * c + i_t * c_tilde
        
        # Compute new hidden state
        h_new = o_t * torch.tanh(c_new)
        
        return h_new, c_new

print("✓ LSTMCell class defined!")

In [None]:
# Build full LSTM using our cell
class SimpleLSTM(nn.Module):
    """
    Complete LSTM that processes sequences.
    """
    
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleLSTM, self).__init__()
        
        self.hidden_size = hidden_size
        self.lstm_cell = LSTMCell(input_size, hidden_size)
        self.fc = nn.Linear(hidden_size, output_size)
        
    def forward(self, x, states=None):
        """
        Process sequence through LSTM.
        """
        batch_size, seq_len, _ = x.size()
        
        h, c = states if states else (None, None)
        
        # Process sequence
        for t in range(seq_len):
            h, c = self.lstm_cell(x[:, t, :], (h, c) if h is not None else None)
        
        # Output from final hidden state
        output = self.fc(h)
        
        return output

# Test LSTM
lstm = SimpleLSTM(input_size=10, hidden_size=20, output_size=3)
x = torch.randn(2, 5, 10)
output = lstm(x)
print(f"LSTM output shape: {output.shape}")

## 4. GRU: Gated Recurrent Unit

**GRU** (Cho et al., 2014) is a simpler alternative to LSTM.

### Differences from LSTM:
- Only 2 gates (vs 3 in LSTM): Reset and Update
- No separate cell state (combines $h$ and $c$)
- Fewer parameters → faster training

### GRU Equations:

$$r_t = \sigma(W_r \cdot [h_{t-1}, x_t])$$ (Reset gate)

$$z_t = \sigma(W_z \cdot [h_{t-1}, x_t])$$ (Update gate)

$$\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t])$$ (Candidate)

$$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$$ (New hidden state)

In [None]:
class GRUCell(nn.Module):
    """
    GRU cell implementation.
    """
    
    def __init__(self, input_size, hidden_size):
        super(GRUCell, self).__init__()
        
        self.hidden_size = hidden_size
        
        # Gates and candidate
        self.linear_gates = nn.Linear(input_size + hidden_size, 2 * hidden_size)
        self.linear_candidate = nn.Linear(input_size + hidden_size, hidden_size)
        
    def forward(self, x, h=None):
        batch_size = x.size(0)
        
        if h is None:
            h = torch.zeros(batch_size, self.hidden_size).to(x.device)
        
        # Compute reset and update gates
        combined = torch.cat([x, h], dim=1)
        gates = self.linear_gates(combined)
        reset_gate, update_gate = gates.chunk(2, dim=1)
        
        r_t = torch.sigmoid(reset_gate)
        z_t = torch.sigmoid(update_gate)
        
        # Compute candidate hidden state
        combined_reset = torch.cat([x, r_t * h], dim=1)
        h_tilde = torch.tanh(self.linear_candidate(combined_reset))
        
        # Update hidden state (interpolation between old and new)
        h_new = (1 - z_t) * h + z_t * h_tilde
        
        return h_new

print("✓ GRUCell class defined!")

**Exercise 2**: Compare RNN architectures

1. Count parameters in vanilla RNN, LSTM, and GRU with same hidden size
2. Create a table comparing:
   - Number of parameters
   - Number of gates
   - Computational complexity
3. Discuss: When would you use each architecture?

In [None]:
# YOUR CODE HERE
# Compare parameter counts

## 5. Bidirectional RNNs

**Problem**: Standard RNNs only use past context.

**Solution**: Process sequence in both directions!

### Bidirectional RNN:
- **Forward RNN**: Processes left-to-right → $\overrightarrow{h}_t$
- **Backward RNN**: Processes right-to-left ← $\overleftarrow{h}_t$
- **Combine**: $h_t = [\overrightarrow{h}_t; \overleftarrow{h}_t]$

**Use case**: Sentence classification, NER (where future context helps)

In [None]:
class BiLSTM(nn.Module):
    """
    Bidirectional LSTM for sequence classification.
    """
    
    def __init__(self, input_size, hidden_size, output_size, num_layers=1):
        super(BiLSTM, self).__init__()
        
        # Use PyTorch's efficient LSTM implementation
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            bidirectional=True  # Key: bidirectional!
        )
        
        # Output layer (hidden_size * 2 because bidirectional)
        self.fc = nn.Linear(hidden_size * 2, output_size)
        
    def forward(self, x):
        # LSTM output: (batch_size, seq_len, hidden_size * 2)
        lstm_out, (h_n, c_n) = self.lstm(x)
        
        # Use final time step from both directions
        # h_n shape: (num_layers * 2, batch_size, hidden_size)
        forward_hidden = h_n[-2, :, :]
        backward_hidden = h_n[-1, :, :]
        
        # Concatenate forward and backward
        hidden = torch.cat([forward_hidden, backward_hidden], dim=1)
        
        # Output
        output = self.fc(hidden)
        
        return output

# Test BiLSTM
bilstm = BiLSTM(input_size=10, hidden_size=20, output_size=3)
x = torch.randn(2, 5, 10)
output = bilstm(x)
print(f"BiLSTM output shape: {output.shape}")

## 6. Application: Sentiment Analysis

Let's apply RNNs to a real task: classifying movie reviews as positive or negative.

In [None]:
# Create synthetic sentiment dataset
# In practice, use IMDB or SST dataset
positive_reviews = [
    "This movie is amazing and wonderful",
    "I absolutely loved this film",
    "Best movie I have seen in years",
    "Incredible performance by all actors",
    "Highly recommended, fantastic story",
] * 20

negative_reviews = [
    "This movie is terrible and boring",
    "I hated every minute of it",
    "Worst film I have ever watched",
    "Poor acting and weak plot",
    "Do not waste your time on this",
] * 20

# Combine and create labels
texts = positive_reviews + negative_reviews
labels = [1] * len(positive_reviews) + [0] * len(negative_reviews)

print(f"Dataset size: {len(texts)} reviews")
print(f"Positive: {sum(labels)}, Negative: {len(labels) - sum(labels)}")

In [None]:
# Build vocabulary
def build_vocab(texts, min_freq=1):
    """
    Build vocabulary from texts.
    """
    word_counts = Counter()
    for text in texts:
        words = text.lower().split()
        word_counts.update(words)
    
    # Filter by frequency
    vocab = {word: idx + 2 for idx, (word, count) in enumerate(word_counts.items())
             if count >= min_freq}
    
    # Add special tokens
    vocab['<PAD>'] = 0
    vocab['<UNK>'] = 1
    
    return vocab

vocab = build_vocab(texts)
print(f"Vocabulary size: {len(vocab)}")
print(f"Sample words: {list(vocab.keys())[:10]}")

In [None]:
# Text to indices
def text_to_indices(text, vocab, max_len=20):
    """
    Convert text to indices with padding.
    """
    words = text.lower().split()
    indices = [vocab.get(word, vocab['<UNK>']) for word in words]
    
    # Pad or truncate
    if len(indices) < max_len:
        indices += [vocab['<PAD>']] * (max_len - len(indices))
    else:
        indices = indices[:max_len]
    
    return indices

# Convert all texts
X = np.array([text_to_indices(text, vocab) for text in texts])
y = np.array(labels)

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"\nSample encoded text: {X[0]}")
print(f"Original text: {texts[0]}")

In [None]:
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Create PyTorch dataset
class SentimentDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = torch.LongTensor(texts)
        self.labels = torch.LongTensor(labels)
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        return self.texts[idx], self.labels[idx]

train_dataset = SentimentDataset(X_train, y_train)
test_dataset = SentimentDataset(X_test, y_test)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16)

print(f"Training batches: {len(train_loader)}")
print(f"Test batches: {len(test_loader)}")

In [None]:
# Build sentiment classifier
class SentimentLSTM(nn.Module):
    """
    LSTM-based sentiment classifier.
    """
    
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(SentimentLSTM, self).__init__()
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        
        # Bidirectional LSTM
        self.lstm = nn.LSTM(
            embedding_dim,
            hidden_dim,
            num_layers=2,
            bidirectional=True,
            batch_first=True,
            dropout=0.3
        )
        
        # Output layer
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(0.3)
        
    def forward(self, x):
        # Embed: (batch, seq_len) -> (batch, seq_len, embedding_dim)
        embedded = self.embedding(x)
        
        # LSTM: (batch, seq_len, embedding_dim) -> (batch, seq_len, hidden_dim * 2)
        lstm_out, (h_n, c_n) = self.lstm(embedded)
        
        # Use final hidden states from both directions
        forward = h_n[-2, :, :]
        backward = h_n[-1, :, :]
        hidden = torch.cat([forward, backward], dim=1)
        
        # Dropout and output
        hidden = self.dropout(hidden)
        output = self.fc(hidden)
        
        return output

# Initialize model
model = SentimentLSTM(
    vocab_size=len(vocab),
    embedding_dim=50,
    hidden_dim=64,
    output_dim=2  # Binary classification
).to(device)

print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters())}")

In [None]:
# Training function
def train_epoch(model, dataloader, optimizer, criterion):
    model.train()
    total_loss = 0
    correct = 0
    total = 0
    
    for texts, labels in dataloader:
        texts, labels = texts.to(device), labels.to(device)
        
        # Forward
        optimizer.zero_grad()
        outputs = model(texts)
        loss = criterion(outputs, labels)
        
        # Backward
        loss.backward()
        optimizer.step()
        
        # Statistics
        total_loss += loss.item()
        _, predicted = outputs.max(1)
        correct += predicted.eq(labels).sum().item()
        total += labels.size(0)
    
    return total_loss / len(dataloader), correct / total

def evaluate(model, dataloader, criterion):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0
    
    with torch.no_grad():
        for texts, labels in dataloader:
            texts, labels = texts.to(device), labels.to(device)
            outputs = model(texts)
            loss = criterion(outputs, labels)
            
            total_loss += loss.item()
            _, predicted = outputs.max(1)
            correct += predicted.eq(labels).sum().item()
            total += labels.size(0)
    
    return total_loss / len(dataloader), correct / total

print("✓ Training functions defined!")

In [None]:
# Train model
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

num_epochs = 10
train_losses, train_accs = [], []
test_losses, test_accs = [], []

print("Training sentiment classifier...\n")

for epoch in range(num_epochs):
    train_loss, train_acc = train_epoch(model, train_loader, optimizer, criterion)
    test_loss, test_acc = evaluate(model, test_loader, criterion)
    
    train_losses.append(train_loss)
    train_accs.append(train_acc)
    test_losses.append(test_loss)
    test_accs.append(test_acc)
    
    print(f"Epoch {epoch+1}/{num_epochs}:")
    print(f"  Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}")
    print(f"  Test Loss: {test_loss:.4f}, Test Acc: {test_acc:.4f}")

print("\n✓ Training complete!")

In [None]:
# Plot training curves
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Loss
ax1.plot(train_losses, label='Train Loss')
ax1.plot(test_losses, label='Test Loss')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training and Test Loss')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Accuracy
ax2.plot(train_accs, label='Train Accuracy')
ax2.plot(test_accs, label='Test Accuracy')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.set_title('Training and Test Accuracy')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Test on new examples
def predict_sentiment(text, model, vocab):
    model.eval()
    indices = text_to_indices(text, vocab)
    tensor = torch.LongTensor([indices]).to(device)
    
    with torch.no_grad():
        output = model(tensor)
        probs = F.softmax(output, dim=1)
        pred = output.argmax(1).item()
    
    sentiment = "Positive" if pred == 1 else "Negative"
    confidence = probs[0][pred].item()
    
    return sentiment, confidence

# Test examples
test_reviews = [
    "This film is absolutely wonderful and amazing",
    "Terrible movie, complete waste of time",
    "Not great but not terrible either",
]

print("Sentiment predictions:\n")
for review in test_reviews:
    sentiment, conf = predict_sentiment(review, model, vocab)
    print(f"Review: {review}")
    print(f"Sentiment: {sentiment} (confidence: {conf:.2%})\n")

**Exercise 3**: Improve the sentiment classifier

Try these improvements:
1. Add more layers or increase hidden size
2. Try GRU instead of LSTM
3. Use pre-trained word embeddings (GloVe)
4. Implement attention mechanism (preview of next module!)
5. Compare performance of different architectures

In [None]:
# YOUR CODE HERE
# Experiment with different architectures

**Exercise 4**: Visualize attention

Although we haven't covered attention yet:
1. Extract LSTM hidden states at each time step
2. Visualize which words contribute most to the final prediction
3. Use gradient-based attribution or simple averaging
4. Discuss: Which words are most important for sentiment?

In [None]:
# YOUR CODE HERE
# Visualize word importance

## Summary

### Key Concepts Covered:

1. **Vanilla RNN**:
   - Recurrent connections for sequence processing
   - Hidden state as memory
   - Vanishing gradient problem

2. **LSTM**:
   - Gating mechanisms (forget, input, output)
   - Cell state as long-term memory
   - Solves vanishing gradients

3. **GRU**:
   - Simplified LSTM with 2 gates
   - Fewer parameters, faster training
   - Often comparable performance to LSTM

4. **Bidirectional RNNs**:
   - Process sequences in both directions
   - Better context understanding
   - Essential for many NLP tasks

5. **Practical Application**:
   - Sentiment analysis pipeline
   - Data preprocessing and batching
   - Training and evaluation

### Architecture Comparison:

| Architecture | Parameters | Speed | Long-term Memory | Use Case |
|--------------|-----------|-------|------------------|----------|
| Vanilla RNN | Low | Fast | Poor | Simple tasks |
| LSTM | High | Slower | Excellent | Complex sequences |
| GRU | Medium | Medium | Very good | Good default choice |
| BiLSTM | High | Slowest | Excellent | Classification, NER |

### What's Next?

In **Module 04: Sequence-to-Sequence Models**, we'll learn:
- Encoder-decoder architecture
- Machine translation with RNNs
- Teacher forcing and beam search
- Setting the stage for attention mechanisms

### Additional Resources:

- **LSTM Paper**: [Long Short-Term Memory](http://www.bioinf.jku.at/publications/older/2604.pdf)
- **GRU Paper**: [Learning Phrase Representations using RNN](https://arxiv.org/abs/1406.1078)
- **Understanding LSTMs**: [Chris Olah's Blog](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- **PyTorch RNN Tutorial**: [pytorch.org/tutorials](https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html)