# Module 04: Sequence-to-Sequence Models

**Difficulty**: ⭐⭐⭐ Advanced  
**Estimated Time**: 120 minutes  
**Prerequisites**: [Module 03: Recurrent Neural Networks](03_recurrent_neural_networks.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand the encoder-decoder architecture for sequence-to-sequence tasks
2. Implement seq2seq models from scratch in PyTorch
3. Apply teacher forcing for stable training
4. Implement greedy and beam search decoding
5. Build a translation system using seq2seq
6. Understand limitations that led to attention mechanisms

## What are Sequence-to-Sequence Models?

**Seq2Seq** models map input sequences to output sequences of potentially different lengths.

### Examples:
- Machine translation: "Hello" → "Bonjour"
- Summarization: [Long article] → [Short summary]
- Question answering: "What is NLP?" → "Natural Language Processing..."
- Dialogue: "How are you?" → "I'm doing well, thanks!"

### The Challenge:

Traditional RNNs have fixed output size. How do we handle:
- Variable-length inputs AND outputs?
- Different input/output lengths?
- One-to-many, many-to-one, many-to-many mappings?

**Solution**: Encoder-Decoder Architecture!

## Setup and Imports

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

# Visualization
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')

# Random seeds
np.random.seed(42)
random.seed(42)
torch.manual_seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print("✓ All libraries imported successfully!")

## 1. Encoder-Decoder Architecture

The seq2seq model consists of two RNNs:

1. **Encoder**: Reads input sequence → produces context vector
2. **Decoder**: Takes context vector → generates output sequence

**Key insight**: Context vector = fixed-size representation of entire input!

### Mathematics:

**Encoder**:
- For each input word $x_t$: $h_t^{enc} = \text{LSTM}(x_t, h_{t-1}^{enc})$
- Final hidden state = context: $c = h_T^{enc}$

**Decoder**:
- Initialize with context: $h_0^{dec} = c$
- Generate each output: $y_t, h_t^{dec} = \text{LSTM}(y_{t-1}, h_{t-1}^{dec})$

## 2. Teacher Forcing

**Problem**: During training, decoder errors compound.

**Teacher Forcing**: Feed ground truth (not predictions) as decoder input!

```python
# Without teacher forcing (exposure bias)
for t in range(len(target)):
    output = decoder(previous_output)  # Uses its own prediction
    previous_output = output

# With teacher forcing (stable training)
for t in range(len(target)):
    output = decoder(target[t-1])  # Uses ground truth
```

**Trade-off**: Faster convergence but train/test mismatch.

## 3. Implementing Seq2Seq

Let's build a complete seq2seq model for machine translation.

## 4. Decoding Strategies

**How to generate output sequences?**

### 1. Greedy Decoding
- At each step, pick most probable word
- Fast but suboptimal

### 2. Beam Search
- Keep top-k hypotheses at each step
- Better quality, slower
- Beam size = trade-off between quality and speed

## 5. Application: Machine Translation

Build a simple English-to-French translator.

## 6. Summary

### Key Concepts:

1. **Encoder-Decoder**: Two RNNs for sequence transformation
2. **Context Vector**: Fixed-size bottleneck (limitation!)
3. **Teacher Forcing**: Stable training technique
4. **Beam Search**: Better decoding than greedy
5. **Limitations**: Context bottleneck, long sequences

### What's Next?

In **Module 05: Attention Mechanism**, we solve the context bottleneck!