<a href="https://colab.research.google.com/github/Lcocks/DS6050-DeepLearning/blob/main/7HW_seq2seq.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ##TRY CHANGING SOME THINGS AROUND WITH THE TRANSFORMERS FOR THE ASSIGNMENT AND TALK ABOUT IT

# Homework: The Dawn of Neural Machine Translation with Seq2Seq

## Part 1: Historical Context and Motivation

Before the rise of deep learning, machine translation (MT) was dominated by **Statistical Machine Translation (SMT)**. SMT systems were complex engineering feats, relying on statistical models to translate phrases piece-by-piece and then reassembling them using intricate rules.

In 2014, a seminal paper changed the landscape: **"Sequence to Sequence Learning with Neural Networks"** by Sutskever, Vinyals, and Le. They proposed an elegant, end-to-end neural architecture.

### The Core Idea

The core idea is remarkably simple:

1. **The Encoder**: An RNN reads the input sentence (e.g., English) one word at a time, compressing the entire meaning into a single, fixed-size vector. This is often called the **context vector** or, more poetically, a **"thought vector."**

2. **The Decoder**: Another RNN takes this "thought vector" as its starting point and generates the output sentence (e.g., French) one word at a time.

This architecture marked the beginning of **Neural Machine Translation (NMT)**. In 2016, Google Translate switched from its older SMT system to NMT. The improvement was dramatic.

> **"With this update, Google Translate is improving more in a single leap than we've seen in the last ten years combined."** – [Google Blog, 2016 ](https://blog.google/products/translate/found-translation-more-accurate-fluent-sentences-google-translate/)

The original 2014 paper used LSTMs. However, we will use **Gated Recurrent Units (GRUs)** for this assignment just to mix it up! GRUs are similar to LSTMs in that they use gates to control information flow, but their architecture is simpler (two gates vs. three, and no separate cell state). They often perform similarly to LSTMs but are slightly faster to train and easier to implement.

---

## Part 2: Key Concepts

### 2.1 Backpropagation Through Time (BPTT)

When training RNNs, we must backpropagate gradients through all time steps of the sequence. This is called **Backpropagation Through Time (BPTT)**. The gradients flow backwards through the unrolled RNN, allowing the model to learn long-term dependencies.

### 2.2 BPTT and Truncated BPTT (TBPTT)

If a sequence is very long (e.g., modeling an entire document), full BPTT consumes excessive memory because we must store the activations for every time step.

**Truncated BPTT (TBPTT)** solves this by breaking the sequence into chunks. We process a chunk, backpropagate gradients only within that chunk, and then pass the hidden state forward to the next chunk, stopping the gradient flow at the chunk boundary.

In this assignment, our sentences are short, so we will use standard BPTT.

### 2.3 The "Reversal Trick"

The 2014 paper discovered a surprising trick that significantly boosted performance: **Reverse the source sentence.**

- **Original**: [I, love, AI] → [J'aime, l'IA]
- **Reversed**: [AI, love, I] → [J'aime, l'IA]

By doing this, the first words of the output (J'aime) are very close to the corresponding words in the reversed input (I). This creates short-term dependencies, making it much easier for the optimizer to "establish communication" between the input and the output early in training.

### 2.4 Teacher Forcing

When training the decoder, if we use the model's prediction as the input for the subsequent step, an early mistake can cascade, making training unstable.

**Teacher Forcing** is a strategy where we sometimes use the actual ground truth token from the training data as the input for the next step, rather than the model's own prediction.

---

## Part 3: Setup and Data Preprocessing

We will use a dataset of English-French sentence pairs.

### 3.0 Download the Data

Run this cell in Colab to download and unzip the data:

```bash
!wget https://download.pytorch.org/tutorial/data.zip
!unzip -o data.zip
```

### 3.1 Imports and Utilities (Provided)

```python
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, random_split
# Utilities for handling variable length sequences
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence

import numpy as np
import random
import math
import time
import unicodedata
import re

# Set random seeds for reproducibility
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

# Define special tokens
PAD_IDX = 0
SOS_IDX = 1
EOS_IDX = 2
UNK_IDX = 3
```

### 3.2 Vocabulary and Data Loading (Provided)

We provide the utilities to load, normalize, and filter the data. We limit the dataset size and sentence length for faster training.

```python
class Lang:
    """A class to hold the vocabulary of a language."""
    def __init__(self, name):
        self.name = name
        self.word2index = {"<PAD>": PAD_IDX, "<SOS>": SOS_IDX, "<EOS>": EOS_IDX, "<UNK>": UNK_IDX}
        self.index2word = {PAD_IDX: "<PAD>", SOS_IDX: "<SOS>", EOS_IDX: "<EOS>", UNK_IDX: "<UNK>"}
        self.n_words = 4

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.index2word[self.n_words] = word
            self.n_words += 1

def normalizeString(s):
    s = s.lower().strip()
    # Normalize Unicode characters (e.g., remove accents)
    s = ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s.strip()

# We filter for relatively short sentences
MAX_LENGTH = 15
NUM_EXAMPLES = 15000

def prepareData(lang1, lang2):
    print("Reading lines...")
    lines = open(f'data/{lang1}-{lang2}.txt', encoding='utf-8').read().strip().split('\n')
    
    # Limit the number of examples and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')[:2]] for l in lines[:NUM_EXAMPLES]]

    # Filter pairs by length
    pairs = [p for p in pairs if len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH]
    
    input_lang = Lang(lang1)
    output_lang = Lang(lang2)

    print(f"Trimmed to {len(pairs)} sentence pairs")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print(f"Vocabularies: {input_lang.name} ({input_lang.n_words}), {output_lang.name} ({output_lang.n_words})")
    return input_lang, output_lang, pairs

input_lang, output_lang, pairs = prepareData('eng', 'fra')
```

### 3.3 Dataset and DataLoader (Provided)

We implement the PyTorch Dataset. This is where we apply the **Input Reversal Trick**.

We also implement a `collate_fn`. This function handles padding sequences in a batch to the same length. Crucially, it also returns the original lengths of the sequences, which we need for **Packing**.

```python
class TranslationDataset(Dataset):
    def __init__(self, pairs, input_lang, output_lang, reverse_source=True):
        self.pairs = pairs
        self.input_lang = input_lang
        self.output_lang = output_lang
        self.reverse_source = reverse_source

    def __len__(self):
        return len(self.pairs)

    def indexesFromSentence(self, lang, sentence):
        return [lang.word2index.get(word, UNK_IDX) for word in sentence.split(' ')]

    def __getitem__(self, idx):
        pair = self.pairs[idx]
        src_text = pair[0]
        tgt_text = pair[1]

        src_indices = self.indexesFromSentence(self.input_lang, src_text)
        tgt_indices = self.indexesFromSentence(self.output_lang, tgt_text)

        # Apply the Reversal Trick to the source sentence
        if self.reverse_source:
            src_indices.reverse()

        # Add EOS token to both
        src_indices.append(EOS_IDX)
        tgt_indices.append(EOS_IDX)

        return torch.tensor(src_indices, dtype=torch.long), \
               torch.tensor(tgt_indices, dtype=torch.long)

# Collate function to handle padding and return lengths
def collate_fn(batch):
    src_batch, tgt_batch = [], []
    for src_item, tgt_item in batch:
        src_batch.append(src_item)
        tgt_batch.append(tgt_item)
    
    # Get the lengths of the source sequences BEFORE padding
    src_lengths = torch.tensor([len(s) for s in src_batch])
    
    # Pad the sequences
    src_batch = pad_sequence(src_batch, batch_first=True, padding_value=PAD_IDX)
    tgt_batch = pad_sequence(tgt_batch, batch_first=True, padding_value=PAD_IDX)

    # We return the lengths as well for packing later
    return src_batch.to(device), src_lengths, tgt_batch.to(device)

# Create Datasets and DataLoaders
BATCH_SIZE = 64
dataset = TranslationDataset(pairs, input_lang, output_lang, reverse_source=True)

# Split into train and validation (90/10)
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)
```

---

## Part 4: The Seq2Seq Architecture (Implementation Tasks)

Now we implement the core components.

### Task 1: The Encoder (20 Points)

The Encoder processes the input sequence and compresses it into the context vector.

**Important: Packing Padded Sequences.** When training RNNs on batches, we must use `pack_padded_sequence`. This tells the GRU/LSTM to ignore PAD tokens. If we don't pack, the RNN processes the padding, which wastes computation and can negatively affect the final hidden state (the context vector).

**Instructions:**

1. Initialize the `nn.Embedding` and `nn.GRU` layers. Use `batch_first=True`.
2. In the forward pass, embed the input.
3. Pack the embedded sequence using `pack_padded_sequence`.
4. Pass the packed sequence through the GRU.
5. Return the final hidden state.

```python
class EncoderGRU(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        self.hid_dim = hid_dim
        self.n_layers = n_layers

        # TODO: 1. Initialize the Embedding layer (input_dim -> emb_dim)
        self.embedding = None # <<< YOUR CODE HERE

        # TODO: 2. Initialize the GRU layer (emb_dim -> hid_dim)
        # Set batch_first=True. Set dropout only if n_layers > 1.
        self.rnn = None # <<< YOUR CODE HERE

        self.dropout = nn.Dropout(dropout)

    def forward(self, src, src_lengths):
        # src shape: (batch_size, src_len)
        # src_lengths shape: (batch_size)

        # TODO: 3. Pass the source through the embedding layer and apply dropout
        # embedded shape: (batch_size, src_len, emb_dim)
        embedded = None # <<< YOUR CODE HERE

        # TODO: 4. Pack the embedded sequences.
        # This ensures the RNN ignores the padding.
        # Remember to move src_lengths to CPU and set enforce_sorted=False.
        packed_embedded = None # <<< YOUR CODE HERE

        # TODO: 5. Pass the packed sequence through the RNN
        # hidden shape: (n_layers, batch_size, hid_dim)
        packed_outputs, hidden = None, None # <<< YOUR CODE HERE

        # In vanilla Seq2Seq, we only need the final hidden state (the context vector).
        return hidden
```

### Task 2: The Decoder (20 Points)

The Decoder takes the context vector as its initial hidden state and generates the output sequence one token at a time.

**Instructions:**

1. Initialize the Embedding, GRU, and output Linear (`fc_out`) layers.
2. The forward pass accepts one token (`input`) and the previous hidden state.
3. Embed the input token (remembering to add a sequence dimension).
4. Pass the embedding and hidden state to the GRU.
5. Pass the GRU output through the linear layer to get the prediction logits.
6. Return the prediction and the new hidden state.

```python
class DecoderGRU(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        self.output_dim = output_dim
        self.n_layers = n_layers

        # TODO: 1. Initialize the Embedding layer (output_dim -> emb_dim)
        self.embedding = None # <<< YOUR CODE HERE

        # TODO: 2. Initialize the GRU layer (emb_dim -> hid_dim). Must match encoder's hid_dim.
        # Set batch_first=True.
        self.rnn = None # <<< YOUR CODE HERE

        # TODO: 3. Initialize the output linear layer (hid_dim -> output_dim)
        self.fc_out = None # <<< YOUR CODE HERE

        self.dropout = nn.Dropout(dropout)

    def forward(self, input, hidden):
        # input shape: (batch_size) -> We are decoding one token at a time!
        # hidden shape: (n_layers, batch_size, hid_dim)

        # We need to add a sequence dimension: (batch_size) -> (batch_size, 1)
        input = input.unsqueeze(1)

        # TODO: 4. Pass the input token through the embedding layer and apply dropout
        # embedded shape: (batch_size, 1, emb_dim)
        embedded = None # <<< YOUR CODE HERE

        # TODO: 5. Pass the embedded input and the hidden state to the RNN
        # output shape: (batch_size, 1, hid_dim)
        # hidden shape: (n_layers, batch_size, hid_dim)
        output, hidden = None, None # <<< YOUR CODE HERE

        # TODO: 6. Generate the prediction logits.
        # Remove the sequence dimension (squeeze) before passing to the linear layer
        # (batch_size, 1, hid_dim) -> (batch_size, hid_dim) -> (batch_size, output_dim)
        prediction = None # <<< YOUR CODE HERE

        return prediction, hidden
```

### Task 3: The Seq2Seq Wrapper (30 Points)

This class combines the Encoder and Decoder and manages the overall process, including the decoding loop and Teacher Forcing.

**Instructions:**

1. Run the encoder on the source sequence and lengths to get the context vector (hidden).
2. Initialize the decoder input with the `<SOS>` token.
3. Iterate over the length of the target sequence:
    - Run the decoder one step.
    - Store the output.
    - Decide whether to use teacher forcing or the model's own prediction as the next input.

```python
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, src_lengths, tgt, teacher_forcing_ratio=0.5):
        # src shape: (batch_size, src_len)
        # tgt shape: (batch_size, tgt_len)

        batch_size = src.shape[0]
        tgt_len = tgt.shape[1]
        tgt_vocab_size = self.decoder.output_dim

        # Tensor to store decoder outputs
        outputs = torch.zeros(batch_size, tgt_len, tgt_vocab_size).to(self.device)

        # TODO: 1. Encode the source sentence (passing src and src_lengths).
        # The final hidden state of the encoder is the initial hidden state of the decoder.
        hidden = None # <<< YOUR CODE HERE

        # TODO: 2. Initialize the first input to the decoder with the <SOS> token.
        # input shape: (batch_size)
        input = torch.full((batch_size,), SOS_IDX, dtype=torch.long, device=self.device)

        # Iterate over the target sequence length
        for t in range(0, tgt_len):
            # TODO: 3. Decode one step (pass input and hidden state to decoder)
            output, hidden = None, None # <<< YOUR CODE HERE

            # 4. Store the output
            outputs[:, t, :] = output

            # 5. Decide whether to use teacher forcing
            teacher_force = random.random() < teacher_forcing_ratio

            # Get the highest predicted token
            top1 = output.argmax(1)

            # TODO: 6. Prepare the next input.
            # If teacher forcing, use the actual next token from the target sequence (tgt[:, t]).
            # Otherwise, use the predicted token (top1).
            if teacher_force:
                 input = tgt[:, t]
            else:
                 input = top1

        return outputs
```

---

## Part 5: Training the Model

### 5.1 Initialization (Provided)

We initialize the model with sensible hyperparameters. We use a relatively small model (2 layers, 512 hidden units) which provides a good balance of capacity and training speed for this dataset.

```python
# Hyperparameters
INPUT_DIM = input_lang.n_words
OUTPUT_DIM = output_lang.n_words
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2 # Using 2 layers
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

# Initialize models (Ensure Tasks 1-3 are completed first!)
enc = EncoderGRU(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT).to(device)
dec = DecoderGRU(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT).to(device)
model = Seq2Seq(enc, dec, device).to(device)

# Initialize weights (common practice for RNNs)
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
model.apply(init_weights)

# Optimizer
optimizer = optim.Adam(model.parameters())

# Loss function: CrossEntropyLoss, ignoring the padding index
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')
```

### Task 4: The Training and Evaluation Loops (20 Points)

Implement the training and evaluation functions.

**Instructions:**

1. In `train`, implement the forward pass, loss calculation, backpropagation (BPTT), gradient clipping, and optimizer step.
2. In `evaluate`, implement the forward pass (with `teacher_forcing_ratio=0`).
3. **Crucial:** Reshape the output and tgt tensors correctly for the loss function. CrossEntropyLoss expects predictions of shape `(N, C)` and targets of shape `(N)`, where N is the total number of tokens.

```python
def train(model, iterator, optimizer, criterion, clip):
    model.train()
    epoch_loss = 0

    for i, batch in enumerate(iterator):
        # Unpack the batch (including lengths from the collate_fn)
        src, src_lengths, tgt = batch
        
        optimizer.zero_grad()

        # TODO: 1. Forward pass (use default teacher forcing ratio)
        # Remember to pass src_lengths to the model
        output = None # <<< YOUR CODE HERE

        # output shape: (batch_size, tgt_len, output_dim)
        # tgt shape: (batch_size, tgt_len)

        # TODO: 2. Reshape for loss calculation.
        # Flatten the outputs and targets.
        output_dim = output.shape[-1]
        # Reshape output to (batch_size * tgt_len, output_dim)
        output = output.reshape(-1, output_dim)
        # Reshape tgt to (batch_size * tgt_len)
        tgt = tgt.reshape(-1)

        # TODO: 3. Calculate the loss
        loss = None # <<< YOUR CODE HERE

        # TODO: 4. Backward pass (BPTT)
        # <<< YOUR CODE HERE

        # 5. Clip gradients to prevent exploding gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

        # TODO: 6. Update parameters
        # <<< YOUR CODE HERE

        epoch_loss += loss.item()

    return epoch_loss / len(iterator)

def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss = 0

    with torch.no_grad():
        for i, batch in enumerate(iterator):
            src, src_lengths, tgt = batch

            # TODO: 1. Forward pass (Set teacher_forcing_ratio=0 for evaluation)
            output = None # <<< YOUR CODE HERE

            # TODO: 2. Reshape for loss calculation (same as in train)
            output_dim = output.shape[-1]
            output = output.reshape(-1, output_dim)
            tgt = tgt.reshape(-1)

            # TODO: 3. Calculate the loss
            loss = None # <<< YOUR CODE HERE

            epoch_loss += loss.item()

    return epoch_loss / len(iterator)
```

### 5.3 Running the Training (Provided)

```python
N_EPOCHS = 30  # Note: 15 epochs may not be sufficient for good translations!
CLIP = 1

best_valid_loss = float('inf')

print("Starting training...")

# NOTE: Uncomment the loop content after completing the tasks above.
for epoch in range(N_EPOCHS):
    start_time = time.time()

    # train_loss = train(model, train_loader, optimizer, criterion, CLIP)
    # valid_loss = evaluate(model, val_loader, criterion)
    train_loss = 0 # Placeholder
    valid_loss = 0 # Placeholder

    end_time = time.time()

    # if valid_loss < best_valid_loss:
    #     best_valid_loss = valid_loss
    #     torch.save(model.state_dict(), 'seq2seq-gru-model.pt')

    print(f'Epoch: {epoch+1:02} | Time: {int(end_time - start_time)}s')
    # PPL (Perplexity) is exp(loss), a common metric for language models.
    # print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    # print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')
```

**Important Note on Training Duration:**

Vanilla Seq2Seq models typically require 30-50+ epochs to produce reasonable translations. If you train for only 15 epochs, you will likely see:
- Decreasing loss (showing the model is learning)
- But poor actual translations (the model hasn't converged yet)

This is normal! The model needs more time to learn the complex mapping between languages.

---

## Part 6: Inference and Analysis (10 Points)

### 6.1 Inference (Greedy Decoding)

During inference, we don't have the target sentence, so teacher forcing is impossible. We use the model's own predictions at each step. The simplest method is **Greedy Decoding**: always choose the word with the highest probability.

```python
def translate_sentence(sentence, src_lang, tgt_lang, model, device, max_len=50):
    model.eval()

    # 1. Preprocess the input sentence (normalize and reverse!)
    normalized_sentence = normalizeString(sentence)
    reversed_sentence = ' '.join(normalized_sentence.split(' ')[::-1])

    # 2. Convert to indices and tensor
    indices = [src_lang.word2index.get(word, UNK_IDX) for word in reversed_sentence.split(' ')] + [EOS_IDX]
    src_tensor = torch.tensor(indices, dtype=torch.long).unsqueeze(0).to(device) # (1, T)
    src_len = torch.tensor([len(indices)])

    # 3. Encode the sentence
    with torch.no_grad():
        hidden = model.encoder(src_tensor, src_len)

    # 4. Start decoding
    trg_indices = [SOS_IDX]
    input_tensor = torch.tensor([SOS_IDX], dtype=torch.long).to(device) # (1)

    for i in range(max_len):
        with torch.no_grad():
            output, hidden = model.decoder(input_tensor, hidden)

        # 5. Greedy Decoding
        pred_token = output.argmax(1).item()
        trg_indices.append(pred_token)

        # Check for <EOS>
        if pred_token == EOS_IDX:
            break

        # Prepare the next input
        input_tensor = torch.tensor([pred_token], dtype=torch.long).to(device)

    # 6. Convert indices back to words
    trg_tokens = [tgt_lang.index2word[i] for i in trg_indices]
    return trg_tokens[1:-1] # Exclude <SOS> and <EOS>

# Qualitative Analysis (Uncomment after training)
# model.load_state_dict(torch.load('seq2seq-gru-model.pt'))
# examples = ["i am cold", "she is happy", "he is running", "we are ready"]
# for example in examples:
#     translation = translate_sentence(example, input_lang, output_lang, model, device)
#     print(f"EN: {example}")
#     print(f"FR: {' '.join(translation)}\n")
```

### 6.2 Understanding Your Results

**Expected Performance:**

After training, you may notice that your translations are not perfect - and that's completely normal! Here's what you should expect:

**What Good Results Look Like:**
- Training loss decreasing from ~5.0 to ~1.0-1.5
- Validation loss around 2.5-3.5
- Some simple phrases translating correctly (e.g., "how are you" → "comment vas tu")
- Shorter sentences working better than longer ones

**Why Translations May Be Poor:**

1. **The Information Bottleneck**: This is the fundamental limitation we've been discussing. The entire English sentence must be compressed into a single fixed-size vector (512 numbers). For complex sentences, critical information gets lost.

2. **Insufficient Training**: 30 epochs on this small dataset is barely enough. Production NMT systems train for much longer on millions of examples.

3. **Overfitting**: If your validation loss is significantly higher than training loss (e.g., 2.8 vs 1.3), the model is memorizing training patterns rather than learning to translate.

4. **Common Phrase Bias**: The model often outputs frequent French phrases (like "je suis...") regardless of the actual input, because these patterns were common in training data.

5. **Greedy Decoding**: We always pick the highest probability word. Beam search (which considers multiple possibilities) would improve results.

**What Your Model Is Actually Learning:**

Look at a translation like:
```
"i am cold" → "je suis serieux"
```

The model correctly learned:
- "i am" → "je suis" ✓
- But outputs a common word "serieux" instead of "froid"

This shows the model IS learning French grammar and common patterns, just not the specific vocabulary mapping yet.

**This Is Why Attention Was Invented!**

The poor performance of vanilla Seq2Seq on longer sentences directly motivated the invention of attention mechanisms (covered in the next module). Attention allows the decoder to "look back" at different parts of the input instead of relying on a single compressed vector.

---

### 6.3 Bonus: Diagnostic Function (Optional)

To better understand what your model has learned, implement this diagnostic function that checks if the model can at least memorize some training examples:

```python
def diagnose_model(model, src_lang, tgt_lang, pairs, device, num_examples=5):
    """
    Check if model can translate training examples (memorization test)
    """
    print("\n" + "=" * 70)
    print("MODEL DIAGNOSIS - Testing on Training Examples")
    print("=" * 70)
    
    for i in range(num_examples):
        en_sentence = pairs[i][0]
        fr_actual = pairs[i][1]
        fr_predicted = translate_sentence(en_sentence, src_lang, tgt_lang, model, device)
        
        print(f"\nExample {i+1}:")
        print(f"  EN (input):     {en_sentence}")
        print(f"  FR (expected):  {fr_actual}")
        print(f"  FR (predicted): {' '.join(fr_predicted)}")
        
        # Calculate word overlap
        expected_words = set(fr_actual.split())
        predicted_words = set(fr_predicted)
        overlap = expected_words.intersection(predicted_words)
        if len(expected_words) > 0:
            accuracy = len(overlap) / len(expected_words) * 100
            print(f"  Word overlap:   {len(overlap)}/{len(expected_words)} ({accuracy:.1f}%)")

# Run after loading best model
diagnose_model(model, input_lang, output_lang, pairs, device)
```

If the model can't even memorize training examples with >50% word overlap, it needs more training epochs or there may be a bug.

---

### 6.4 Conceptual Questions

Answer the following questions in a separate text cell or document:

1. **The Information Bottleneck**: The core limitation of this architecture is that the encoder must compress the entire input sentence into a single fixed-size context vector (hidden). Why is this a significant problem when translating very long or complex sentences?

2. **Input Reversal**: Explain again, in your own words, why reversing the input (the "Reversal Trick") helped the model learn more effectively. Relate your answer to the concept of gradient flow in BPTT.

3. **TBPTT Application**: While we used standard BPTT here, describe a different NLP task where Truncated BPTT (TBPTT) would be essential, and explain why standard BPTT would be unsuitable in that scenario.

4. **Packing**: Why is it important to use `pack_padded_sequence` in the encoder when dealing with batched inputs? What might happen if we didn't use it?


For those that are interested to improve the performance, try to add:[optional]
- Beam Search for better decoding (instead of greedy)
- Better evaluation metrics (BLEU score)

In [1]:
##  Part 3  ## 
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, random_split
# Utilities for handling variable length sequences
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence

import numpy as np
import random
import math
import time
import unicodedata
import re

# Set random seeds for reproducibility
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

# Define special tokens
PAD_IDX = 0
SOS_IDX = 1
EOS_IDX = 2
UNK_IDX = 3

Using device: cuda


In [2]:
class Lang:
    """A class to hold the vocabulary of a language."""
    def __init__(self, name):
        self.name = name
        self.word2index = {"<PAD>": PAD_IDX, "<SOS>": SOS_IDX, "<EOS>": EOS_IDX, "<UNK>": UNK_IDX}
        self.index2word = {PAD_IDX: "<PAD>", SOS_IDX: "<SOS>", EOS_IDX: "<EOS>", UNK_IDX: "<UNK>"}
        self.n_words = 4

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.index2word[self.n_words] = word
            self.n_words += 1

def normalizeString(s):
    s = s.lower().strip()
    # Normalize Unicode characters (e.g., remove accents)
    s = ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s.strip()

# We filter for relatively short sentences
MAX_LENGTH = 15
NUM_EXAMPLES = 15000

def prepareData(lang1, lang2):
    print("Reading lines...")
    lines = open(f'data/{lang1}-{lang2}.txt', encoding='utf-8').read().strip().split('\n')
    
    # Limit the number of examples and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')[:2]] for l in lines[:NUM_EXAMPLES]]

    # Filter pairs by length
    pairs = [p for p in pairs if len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH]
    
    input_lang = Lang(lang1)
    output_lang = Lang(lang2)

    print(f"Trimmed to {len(pairs)} sentence pairs")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print(f"Vocabularies: {input_lang.name} ({input_lang.n_words}), {output_lang.name} ({output_lang.n_words})")
    return input_lang, output_lang, pairs

input_lang, output_lang, pairs = prepareData('eng', 'fra')


class TranslationDataset(Dataset):
    def __init__(self, pairs, input_lang, output_lang, reverse_source=True):
        self.pairs = pairs
        self.input_lang = input_lang
        self.output_lang = output_lang
        self.reverse_source = reverse_source

    def __len__(self):
        return len(self.pairs)

    def indexesFromSentence(self, lang, sentence):
        return [lang.word2index.get(word, UNK_IDX) for word in sentence.split(' ')]

    def __getitem__(self, idx):
        pair = self.pairs[idx]
        src_text = pair[0]
        tgt_text = pair[1]

        src_indices = self.indexesFromSentence(self.input_lang, src_text)
        tgt_indices = self.indexesFromSentence(self.output_lang, tgt_text)

        # Apply the Reversal Trick to the source sentence
        if self.reverse_source:
            src_indices.reverse()

        # Add EOS token to both
        src_indices.append(EOS_IDX)
        tgt_indices.append(EOS_IDX)

        return torch.tensor(src_indices, dtype=torch.long), \
               torch.tensor(tgt_indices, dtype=torch.long)

# Collate function to handle padding and return lengths
def collate_fn(batch):
    src_batch, tgt_batch = [], []
    for src_item, tgt_item in batch:
        src_batch.append(src_item)
        tgt_batch.append(tgt_item)
    
    # Get the lengths of the source sequences BEFORE padding
    src_lengths = torch.tensor([len(s) for s in src_batch])
    
    # Pad the sequences
    src_batch = pad_sequence(src_batch, batch_first=True, padding_value=PAD_IDX)
    tgt_batch = pad_sequence(tgt_batch, batch_first=True, padding_value=PAD_IDX)

    # We return the lengths as well for packing later
    return src_batch.to(device), src_lengths, tgt_batch.to(device)

# Create Datasets and DataLoaders
BATCH_SIZE = 64
dataset = TranslationDataset(pairs, input_lang, output_lang, reverse_source=True)

# Split into train and validation (90/10)
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)

Reading lines...
Trimmed to 15000 sentence pairs
Vocabularies: eng (2830), fra (5098)


## Part 4: The Seq2Seq Architecture (Implementation Tasks)

Now we implement the core components.

### Task 1: The Encoder (20 Points)

The Encoder processes the input sequence and compresses it into the context vector.

**Important: Packing Padded Sequences.** When training RNNs on batches, we must use `pack_padded_sequence`. This tells the GRU/LSTM to ignore PAD tokens. If we don't pack, the RNN processes the padding, which wastes computation and can negatively affect the final hidden state (the context vector).

**Instructions:**

1. Initialize the `nn.Embedding` and `nn.GRU` layers. Use `batch_first=True`.
2. In the forward pass, embed the input.
3. Pack the embedded sequence using `pack_padded_sequence`.
4. Pass the packed sequence through the GRU.
5. Return the final hidden state.

```python
class EncoderGRU(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        self.hid_dim = hid_dim
        self.n_layers = n_layers

        # TODO: 1. Initialize the Embedding layer (input_dim -> emb_dim)
        self.embedding = None # <<< YOUR CODE HERE

        # TODO: 2. Initialize the GRU layer (emb_dim -> hid_dim)
        # Set batch_first=True. Set dropout only if n_layers > 1.
        self.rnn = None # <<< YOUR CODE HERE

        self.dropout = nn.Dropout(dropout)

    def forward(self, src, src_lengths):
        # src shape: (batch_size, src_len)
        # src_lengths shape: (batch_size)

        # TODO: 3. Pass the source through the embedding layer and apply dropout
        # embedded shape: (batch_size, src_len, emb_dim)
        embedded = None # <<< YOUR CODE HERE

        # TODO: 4. Pack the embedded sequences.
        # This ensures the RNN ignores the padding.
        # Remember to move src_lengths to CPU and set enforce_sorted=False.
        packed_embedded = None # <<< YOUR CODE HERE

        # TODO: 5. Pass the packed sequence through the RNN
        # hidden shape: (n_layers, batch_size, hid_dim)
        packed_outputs, hidden = None, None # <<< YOUR CODE HERE

        # In vanilla Seq2Seq, we only need the final hidden state (the context vector).
        return hidden
```

### Task 2: The Decoder (20 Points)

The Decoder takes the context vector as its initial hidden state and generates the output sequence one token at a time.

**Instructions:**

1. Initialize the Embedding, GRU, and output Linear (`fc_out`) layers.
2. The forward pass accepts one token (`input`) and the previous hidden state.
3. Embed the input token (remembering to add a sequence dimension).
4. Pass the embedding and hidden state to the GRU.
5. Pass the GRU output through the linear layer to get the prediction logits.
6. Return the prediction and the new hidden state.

```python
class DecoderGRU(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        self.output_dim = output_dim
        self.n_layers = n_layers

        # TODO: 1. Initialize the Embedding layer (output_dim -> emb_dim)
        self.embedding = None # <<< YOUR CODE HERE

        # TODO: 2. Initialize the GRU layer (emb_dim -> hid_dim). Must match encoder's hid_dim.
        # Set batch_first=True.
        self.rnn = None # <<< YOUR CODE HERE

        # TODO: 3. Initialize the output linear layer (hid_dim -> output_dim)
        self.fc_out = None # <<< YOUR CODE HERE

        self.dropout = nn.Dropout(dropout)

    def forward(self, input, hidden):
        # input shape: (batch_size) -> We are decoding one token at a time!
        # hidden shape: (n_layers, batch_size, hid_dim)

        # We need to add a sequence dimension: (batch_size) -> (batch_size, 1)
        input = input.unsqueeze(1)

        # TODO: 4. Pass the input token through the embedding layer and apply dropout
        # embedded shape: (batch_size, 1, emb_dim)
        embedded = None # <<< YOUR CODE HERE

        # TODO: 5. Pass the embedded input and the hidden state to the RNN
        # output shape: (batch_size, 1, hid_dim)
        # hidden shape: (n_layers, batch_size, hid_dim)
        output, hidden = None, None # <<< YOUR CODE HERE

        # TODO: 6. Generate the prediction logits.
        # Remove the sequence dimension (squeeze) before passing to the linear layer
        # (batch_size, 1, hid_dim) -> (batch_size, hid_dim) -> (batch_size, output_dim)
        prediction = None # <<< YOUR CODE HERE

        return prediction, hidden
```

### Task 3: The Seq2Seq Wrapper (30 Points)

This class combines the Encoder and Decoder and manages the overall process, including the decoding loop and Teacher Forcing.

**Instructions:**

1. Run the encoder on the source sequence and lengths to get the context vector (hidden).
2. Initialize the decoder input with the `<SOS>` token.
3. Iterate over the length of the target sequence:
    - Run the decoder one step.
    - Store the output.
    - Decide whether to use teacher forcing or the model's own prediction as the next input.

```python
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, src_lengths, tgt, teacher_forcing_ratio=0.5):
        # src shape: (batch_size, src_len)
        # tgt shape: (batch_size, tgt_len)

        batch_size = src.shape[0]
        tgt_len = tgt.shape[1]
        tgt_vocab_size = self.decoder.output_dim

        # Tensor to store decoder outputs
        outputs = torch.zeros(batch_size, tgt_len, tgt_vocab_size).to(self.device)

        # TODO: 1. Encode the source sentence (passing src and src_lengths).
        # The final hidden state of the encoder is the initial hidden state of the decoder.
        hidden = None # <<< YOUR CODE HERE

        # TODO: 2. Initialize the first input to the decoder with the <SOS> token.
        # input shape: (batch_size)
        input = torch.full((batch_size,), SOS_IDX, dtype=torch.long, device=self.device)

        # Iterate over the target sequence length
        for t in range(0, tgt_len):
            # TODO: 3. Decode one step (pass input and hidden state to decoder)
            output, hidden = None, None # <<< YOUR CODE HERE

            # 4. Store the output
            outputs[:, t, :] = output

            # 5. Decide whether to use teacher forcing
            teacher_force = random.random() < teacher_forcing_ratio

            # Get the highest predicted token
            top1 = output.argmax(1)

            # TODO: 6. Prepare the next input.
            # If teacher forcing, use the actual next token from the target sequence (tgt[:, t]).
            # Otherwise, use the predicted token (top1).
            if teacher_force:
                 input = tgt[:, t]
            else:
                 input = top1

        return outputs
```

In [3]:
class EncoderGRU(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        self.hid_dim = hid_dim
        self.n_layers = n_layers

        # TODO: 1. Initialize the Embedding layer (input_dim -> emb_dim)
        self.embedding = nn.Embedding(input_dim, emb_dim) # <<< YOUR CODE HERE

        # TODO: 2. Initialize the GRU layer (emb_dim -> hid_dim)
        # Set batch_first=True. Set dropout only if n_layers > 1.
        self.rnn = nn.GRU(emb_dim, self.hid_dim, n_layers, batch_first = True,  dropout=dropout if n_layers > 1 else 0) # <<< YOUR CODE HERE
        
        self.dropout = nn.Dropout(dropout)

    def forward(self, src, src_lengths):
        # src shape: (batch_size, src_len)
        # src_lengths shape: (batch_size)

        # TODO: 3. Pass the source through the embedding layer and apply dropout
        # embedded shape: (batch_size, src_len, emb_dim)
        embedded = self.embedding(src) # <<< YOUR CODE HERE

        # TODO: 4. Pack the embedded sequences.
        # This ensures the RNN ignores the padding.
        # Remember to move src_lengths to CPU and set enforce_sorted=False.
        packed_embedded = nn.utils.rnn.pack_padded_sequence(
            embedded,
            src_lengths.cpu(),
            batch_first=True,
            enforce_sorted=False
        ) # <<< YOUR CODE HERE

        # TODO: 5. Pass the packed sequence through the RNN
        # hidden shape: (n_layers, batch_size, hid_dim)
        packed_outputs, hidden = self.rnn(packed_embedded) # <<< YOUR CODE HERE

        # In vanilla Seq2Seq, we only need the final hidden state (the context vector).
        return hidden

In [4]:
class DecoderGRU(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        self.output_dim = output_dim
        self.n_layers = n_layers

        # TODO: 1. Initialize the Embedding layer (output_dim -> emb_dim)
        self.embedding = nn.Embedding(self.output_dim, emb_dim) # <<< YOUR CODE HERE

        # TODO: 2. Initialize the GRU layer (emb_dim -> hid_dim). Must match encoder's hid_dim.
        # Set batch_first=True.
        self.rnn = nn.GRU(emb_dim, hid_dim, n_layers, batch_first = True) # <<< YOUR CODE HERE

        # TODO: 3. Initialize the output linear layer (hid_dim -> output_dim)
        self.fc_out = nn.Linear(hid_dim, output_dim) # <<< YOUR CODE HERE

        self.dropout = nn.Dropout(dropout)

    def forward(self, input, hidden):
        # input shape: (batch_size) -> We are decoding one token at a time!
        # hidden shape: (n_layers, batch_size, hid_dim)

        # We need to add a sequence dimension: (batch_size) -> (batch_size, 1)
        input = input.unsqueeze(1)

        # TODO: 4. Pass the input token through the embedding layer and apply dropout
        # embedded shape: (batch_size, 1, emb_dim)
        embedded = self.dropout(self.embedding(input)) # <<< YOUR CODE HERE

        # TODO: 5. Pass the embedded input and the hidden state to the RNN
        # output shape: (batch_size, 1, hid_dim)
        # hidden shape: (n_layers, batch_size, hid_dim)
        output, hidden = self.rnn(embedded, hidden) # <<< YOUR CODE HERE

        # TODO: 6. Generate the prediction logits.
        # Remove the sequence dimension (squeeze) before passing to the linear layer
        # (batch_size, 1, hid_dim) -> (batch_size, hid_dim) -> (batch_size, output_dim)
        prediction = self.fc_out(output.squeeze(1)) # <<< YOUR CODE HERE

        return prediction, hidden

In [5]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, src_lengths, tgt, teacher_forcing_ratio=0.5):
        # src shape: (batch_size, src_len)
        # tgt shape: (batch_size, tgt_len)

        batch_size = src.shape[0]
        tgt_len = tgt.shape[1]
        tgt_vocab_size = self.decoder.output_dim

        # Tensor to store decoder outputs
        outputs = torch.zeros(batch_size, tgt_len, tgt_vocab_size).to(self.device)

        # TODO: 1. Encode the source sentence (passing src and src_lengths).
        # The final hidden state of the encoder is the initial hidden state of the decoder.
        hidden = self.encoder(src, src_lengths) # <<< YOUR CODE HERE

        # TODO: 2. Initialize the first input to the decoder with the <SOS> token.
        # input shape: (batch_size)
        input = torch.full((batch_size,), SOS_IDX, dtype=torch.long, device=self.device)

        # Iterate over the target sequence length
        for t in range(0, tgt_len):
            # TODO: 3. Decode one step (pass input and hidden state to decoder)
            output, hidden = self.decoder(input, hidden) # <<< YOUR CODE HERE

            # 4. Store the output
            outputs[:, t, :] = output

            # 5. Decide whether to use teacher forcing
            teacher_force = random.random() < teacher_forcing_ratio

            # Get the highest predicted token
            top1 = output.argmax(1)

            # TODO: 6. Prepare the next input.
            # If teacher forcing, use the actual next token from the target sequence (tgt[:, t]).
            # Otherwise, use the predicted token (top1).
            if teacher_force:
                 input = tgt[:, t]
            else:
                 input = top1

        return outputs

## Part 5: Training the Model

### 5.1 Initialization (Provided)

We initialize the model with sensible hyperparameters. We use a relatively small model (2 layers, 512 hidden units) which provides a good balance of capacity and training speed for this dataset.

```python
# Hyperparameters
INPUT_DIM = input_lang.n_words
OUTPUT_DIM = output_lang.n_words
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2 # Using 2 layers
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

# Initialize models (Ensure Tasks 1-3 are completed first!)
enc = EncoderGRU(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT).to(device)
dec = DecoderGRU(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT).to(device)
model = Seq2Seq(enc, dec, device).to(device)

# Initialize weights (common practice for RNNs)
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
model.apply(init_weights)

# Optimizer
optimizer = optim.Adam(model.parameters())

# Loss function: CrossEntropyLoss, ignoring the padding index
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')
```

### Task 4: The Training and Evaluation Loops (20 Points)

Implement the training and evaluation functions.

**Instructions:**

1. In `train`, implement the forward pass, loss calculation, backpropagation (BPTT), gradient clipping, and optimizer step.
2. In `evaluate`, implement the forward pass (with `teacher_forcing_ratio=0`).
3. **Crucial:** Reshape the output and tgt tensors correctly for the loss function. CrossEntropyLoss expects predictions of shape `(N, C)` and targets of shape `(N)`, where N is the total number of tokens.

```python
def train(model, iterator, optimizer, criterion, clip):
    model.train()
    epoch_loss = 0

    for i, batch in enumerate(iterator):
        # Unpack the batch (including lengths from the collate_fn)
        src, src_lengths, tgt = batch
        
        optimizer.zero_grad()

        # TODO: 1. Forward pass (use default teacher forcing ratio)
        # Remember to pass src_lengths to the model
        output = None # <<< YOUR CODE HERE

        # output shape: (batch_size, tgt_len, output_dim)
        # tgt shape: (batch_size, tgt_len)

        # TODO: 2. Reshape for loss calculation.
        # Flatten the outputs and targets.
        output_dim = output.shape[-1]
        # Reshape output to (batch_size * tgt_len, output_dim)
        output = output.reshape(-1, output_dim)
        # Reshape tgt to (batch_size * tgt_len)
        tgt = tgt.reshape(-1)

        # TODO: 3. Calculate the loss
        loss = None # <<< YOUR CODE HERE

        # TODO: 4. Backward pass (BPTT)
        # <<< YOUR CODE HERE

        # 5. Clip gradients to prevent exploding gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

        # TODO: 6. Update parameters
        # <<< YOUR CODE HERE

        epoch_loss += loss.item()

    return epoch_loss / len(iterator)

def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss = 0

    with torch.no_grad():
        for i, batch in enumerate(iterator):
            src, src_lengths, tgt = batch

            # TODO: 1. Forward pass (Set teacher_forcing_ratio=0 for evaluation)
            output = None # <<< YOUR CODE HERE

            # TODO: 2. Reshape for loss calculation (same as in train)
            output_dim = output.shape[-1]
            output = output.reshape(-1, output_dim)
            tgt = tgt.reshape(-1)

            # TODO: 3. Calculate the loss
            loss = None # <<< YOUR CODE HERE

            epoch_loss += loss.item()

    return epoch_loss / len(iterator)
```

### 5.3 Running the Training (Provided)

```python
N_EPOCHS = 30  # Note: 15 epochs may not be sufficient for good translations!
CLIP = 1

best_valid_loss = float('inf')

print("Starting training...")

# NOTE: Uncomment the loop content after completing the tasks above.
for epoch in range(N_EPOCHS):
    start_time = time.time()

    # train_loss = train(model, train_loader, optimizer, criterion, CLIP)
    # valid_loss = evaluate(model, val_loader, criterion)
    train_loss = 0 # Placeholder
    valid_loss = 0 # Placeholder

    end_time = time.time()

    # if valid_loss < best_valid_loss:
    #     best_valid_loss = valid_loss
    #     torch.save(model.state_dict(), 'seq2seq-gru-model.pt')

    print(f'Epoch: {epoch+1:02} | Time: {int(end_time - start_time)}s')
    # PPL (Perplexity) is exp(loss), a common metric for language models.
    # print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    # print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')
```

**Important Note on Training Duration:**

Vanilla Seq2Seq models typically require 30-50+ epochs to produce reasonable translations. If you train for only 15 epochs, you will likely see:
- Decreasing loss (showing the model is learning)
- But poor actual translations (the model hasn't converged yet)

This is normal! The model needs more time to learn the complex mapping between languages.

In [14]:
##  Part 5  ##
# Hyperparameters
INPUT_DIM = input_lang.n_words
OUTPUT_DIM = output_lang.n_words
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 4 # Using 2 layers
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

# Initialize models (Ensure Tasks 1-3 are completed first!)
enc = EncoderGRU(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT).to(device)
dec = DecoderGRU(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT).to(device)
model = Seq2Seq(enc, dec, device).to(device)

# Initialize weights (common practice for RNNs)
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
model.apply(init_weights)

# Optimizer
optimizer = optim.Adam(model.parameters())

# Loss function: CrossEntropyLoss, ignoring the padding index
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')


The model has 16,465,898 trainable parameters


In [15]:
def train(model, iterator, optimizer, criterion, clip):
    model.train()
    epoch_loss = 0

    for i, batch in enumerate(iterator):
        # Unpack the batch (including lengths from the collate_fn)
        src, src_lengths, tgt = batch
        
        optimizer.zero_grad()

        # TODO: 1. Forward pass (use default teacher forcing ratio)
        # Remember to pass src_lengths to the model
        output = model(src, src_lengths, tgt) # <<< YOUR CODE HERE

        # output shape: (batch_size, tgt_len, output_dim)
        # tgt shape: (batch_size, tgt_len)

        # TODO: 2. Reshape for loss calculation.
        # Flatten the outputs and targets.
        output_dim = output.shape[-1]
        # Reshape output to (batch_size * tgt_len, output_dim)
        output = output.reshape(-1, output_dim)
        # Reshape tgt to (batch_size * tgt_len)
        tgt = tgt.reshape(-1)

        # TODO: 3. Calculate the loss
        loss = criterion(output, tgt) # <<< YOUR CODE HERE

        # TODO: 4. Backward pass (BPTT)
        loss.backward()# <<< YOUR CODE HERE

        # 5. Clip gradients to prevent exploding gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

        # TODO: 6. Update parameters
        optimizer.step()# <<< YOUR CODE HERE

        epoch_loss += loss.item()

    return epoch_loss / len(iterator)

def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss = 0

    with torch.no_grad():
        for i, batch in enumerate(iterator):
            src, src_lengths, tgt = batch

            # TODO: 1. Forward pass (Set teacher_forcing_ratio=0 for evaluation)
            output = model(src, src_lengths, tgt, teacher_forcing_ratio=0)  # <<< YOUR CODE HERE

            # TODO: 2. Reshape for loss calculation (same as in train)
            output_dim = output.shape[-1]
            output = output.reshape(-1, output_dim)
            tgt = tgt.reshape(-1)

            # TODO: 3. Calculate the loss
            loss = criterion(output, tgt) # <<< YOUR CODE HERE

            epoch_loss += loss.item()

    return epoch_loss / len(iterator)

In [16]:

N_EPOCHS = 30  # Note: 15 epochs may not be sufficient for good translations!
CLIP = 1

best_valid_loss = float('inf')

print("Starting training...")

# NOTE: Uncomment the loop content after completing the tasks above.
for epoch in range(N_EPOCHS):
    start_time = time.time()

    train_loss = train(model, train_loader, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, val_loader, criterion)
    #train_loss = 0 # Placeholder
    #valid_loss = 0 # Placeholder

    end_time = time.time()

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'seq2seq-gru-model.pt')

    print(f'Epoch: {epoch+1:02} | Time: {int(end_time - start_time)}s')
    #PPL (Perplexity) is exp(loss), a common metric for language models.
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Starting training...
Epoch: 01 | Time: 6s
	Train Loss: 5.026 | Train PPL: 152.298
	 Val. Loss: 4.729 |  Val. PPL: 113.202
Epoch: 02 | Time: 6s
	Train Loss: 4.456 | Train PPL:  86.139
	 Val. Loss: 4.461 |  Val. PPL:  86.617
Epoch: 03 | Time: 6s
	Train Loss: 4.082 | Train PPL:  59.246
	 Val. Loss: 4.171 |  Val. PPL:  64.806
Epoch: 04 | Time: 6s
	Train Loss: 3.748 | Train PPL:  42.457
	 Val. Loss: 3.927 |  Val. PPL:  50.731
Epoch: 05 | Time: 6s
	Train Loss: 3.450 | Train PPL:  31.514
	 Val. Loss: 3.730 |  Val. PPL:  41.690
Epoch: 06 | Time: 6s
	Train Loss: 3.226 | Train PPL:  25.184
	 Val. Loss: 3.596 |  Val. PPL:  36.463
Epoch: 07 | Time: 6s
	Train Loss: 3.033 | Train PPL:  20.762
	 Val. Loss: 3.503 |  Val. PPL:  33.226
Epoch: 08 | Time: 6s
	Train Loss: 2.873 | Train PPL:  17.682
	 Val. Loss: 3.390 |  Val. PPL:  29.655
Epoch: 09 | Time: 6s
	Train Loss: 2.727 | Train PPL:  15.280
	 Val. Loss: 3.350 |  Val. PPL:  28.508
Epoch: 10 | Time: 6s
	Train Loss: 2.594 | Train PPL:  13.383
	 Val. Lo

## Part 6: Inference and Analysis (10 Points)

### 6.1 Inference (Greedy Decoding)

During inference, we don't have the target sentence, so teacher forcing is impossible. We use the model's own predictions at each step. The simplest method is **Greedy Decoding**: always choose the word with the highest probability.

```python
def translate_sentence(sentence, src_lang, tgt_lang, model, device, max_len=50):
    model.eval()

    # 1. Preprocess the input sentence (normalize and reverse!)
    normalized_sentence = normalizeString(sentence)
    reversed_sentence = ' '.join(normalized_sentence.split(' ')[::-1])

    # 2. Convert to indices and tensor
    indices = [src_lang.word2index.get(word, UNK_IDX) for word in reversed_sentence.split(' ')] + [EOS_IDX]
    src_tensor = torch.tensor(indices, dtype=torch.long).unsqueeze(0).to(device) # (1, T)
    src_len = torch.tensor([len(indices)])

    # 3. Encode the sentence
    with torch.no_grad():
        hidden = model.encoder(src_tensor, src_len)

    # 4. Start decoding
    trg_indices = [SOS_IDX]
    input_tensor = torch.tensor([SOS_IDX], dtype=torch.long).to(device) # (1)

    for i in range(max_len):
        with torch.no_grad():
            output, hidden = model.decoder(input_tensor, hidden)

        # 5. Greedy Decoding
        pred_token = output.argmax(1).item()
        trg_indices.append(pred_token)

        # Check for <EOS>
        if pred_token == EOS_IDX:
            break

        # Prepare the next input
        input_tensor = torch.tensor([pred_token], dtype=torch.long).to(device)

    # 6. Convert indices back to words
    trg_tokens = [tgt_lang.index2word[i] for i in trg_indices]
    return trg_tokens[1:-1] # Exclude <SOS> and <EOS>

# Qualitative Analysis (Uncomment after training)
# model.load_state_dict(torch.load('seq2seq-gru-model.pt'))
# examples = ["i am cold", "she is happy", "he is running", "we are ready"]
# for example in examples:
#     translation = translate_sentence(example, input_lang, output_lang, model, device)
#     print(f"EN: {example}")
#     print(f"FR: {' '.join(translation)}\n")
```

### 6.2 Understanding Your Results

**Expected Performance:**

After training, you may notice that your translations are not perfect - and that's completely normal! Here's what you should expect:

**What Good Results Look Like:**
- Training loss decreasing from ~5.0 to ~1.0-1.5
- Validation loss around 2.5-3.5
- Some simple phrases translating correctly (e.g., "how are you" → "comment vas tu")
- Shorter sentences working better than longer ones

**Why Translations May Be Poor:**

1. **The Information Bottleneck**: This is the fundamental limitation we've been discussing. The entire English sentence must be compressed into a single fixed-size vector (512 numbers). For complex sentences, critical information gets lost.

2. **Insufficient Training**: 30 epochs on this small dataset is barely enough. Production NMT systems train for much longer on millions of examples.

3. **Overfitting**: If your validation loss is significantly higher than training loss (e.g., 2.8 vs 1.3), the model is memorizing training patterns rather than learning to translate.

4. **Common Phrase Bias**: The model often outputs frequent French phrases (like "je suis...") regardless of the actual input, because these patterns were common in training data.

5. **Greedy Decoding**: We always pick the highest probability word. Beam search (which considers multiple possibilities) would improve results.

**What Your Model Is Actually Learning:**

Look at a translation like:
```
"i am cold" → "je suis serieux"
```

The model correctly learned:
- "i am" → "je suis" ✓
- But outputs a common word "serieux" instead of "froid"

This shows the model IS learning French grammar and common patterns, just not the specific vocabulary mapping yet.

**This Is Why Attention Was Invented!**

The poor performance of vanilla Seq2Seq on longer sentences directly motivated the invention of attention mechanisms (covered in the next module). Attention allows the decoder to "look back" at different parts of the input instead of relying on a single compressed vector.

---

### 6.3 Bonus: Diagnostic Function (Optional)

To better understand what your model has learned, implement this diagnostic function that checks if the model can at least memorize some training examples:

```python
def diagnose_model(model, src_lang, tgt_lang, pairs, device, num_examples=5):
    """
    Check if model can translate training examples (memorization test)
    """
    print("\n" + "=" * 70)
    print("MODEL DIAGNOSIS - Testing on Training Examples")
    print("=" * 70)
    
    for i in range(num_examples):
        en_sentence = pairs[i][0]
        fr_actual = pairs[i][1]
        fr_predicted = translate_sentence(en_sentence, src_lang, tgt_lang, model, device)
        
        print(f"\nExample {i+1}:")
        print(f"  EN (input):     {en_sentence}")
        print(f"  FR (expected):  {fr_actual}")
        print(f"  FR (predicted): {' '.join(fr_predicted)}")
        
        # Calculate word overlap
        expected_words = set(fr_actual.split())
        predicted_words = set(fr_predicted)
        overlap = expected_words.intersection(predicted_words)
        if len(expected_words) > 0:
            accuracy = len(overlap) / len(expected_words) * 100
            print(f"  Word overlap:   {len(overlap)}/{len(expected_words)} ({accuracy:.1f}%)")

# Run after loading best model
diagnose_model(model, input_lang, output_lang, pairs, device)
```

If the model can't even memorize training examples with >50% word overlap, it needs more training epochs or there may be a bug.

---

### 6.4 Conceptual Questions

Answer the following questions in a separate text cell or document:

1. **The Information Bottleneck**: The core limitation of this architecture is that the encoder must compress the entire input sentence into a single fixed-size context vector (hidden). Why is this a significant problem when translating very long or complex sentences?

2. **Input Reversal**: Explain again, in your own words, why reversing the input (the "Reversal Trick") helped the model learn more effectively. Relate your answer to the concept of gradient flow in BPTT.

3. **TBPTT Application**: While we used standard BPTT here, describe a different NLP task where Truncated BPTT (TBPTT) would be essential, and explain why standard BPTT would be unsuitable in that scenario.

4. **Packing**: Why is it important to use `pack_padded_sequence` in the encoder when dealing with batched inputs? What might happen if we didn't use it?

In [18]:
def translate_sentence(sentence, src_lang, tgt_lang, model, device, max_len=50):
    model.eval()

    # 1. Preprocess the input sentence (normalize and reverse!)
    normalized_sentence = normalizeString(sentence)
    reversed_sentence = ' '.join(normalized_sentence.split(' ')[::-1])

    # 2. Convert to indices and tensor
    indices = [src_lang.word2index.get(word, UNK_IDX) for word in reversed_sentence.split(' ')] + [EOS_IDX]
    src_tensor = torch.tensor(indices, dtype=torch.long).unsqueeze(0).to(device) # (1, T)
    src_len = torch.tensor([len(indices)])

    # 3. Encode the sentence
    with torch.no_grad():
        hidden = model.encoder(src_tensor, src_len)

    # 4. Start decoding
    trg_indices = [SOS_IDX]
    input_tensor = torch.tensor([SOS_IDX], dtype=torch.long).to(device) # (1)

    for i in range(max_len):
        with torch.no_grad():
            output, hidden = model.decoder(input_tensor, hidden)

        # 5. Greedy Decoding
        pred_token = output.argmax(1).item()
        trg_indices.append(pred_token)

        # Check for <EOS>
        if pred_token == EOS_IDX:
            break

        # Prepare the next input
        input_tensor = torch.tensor([pred_token], dtype=torch.long).to(device)

    # 6. Convert indices back to words
    trg_tokens = [tgt_lang.index2word[i] for i in trg_indices]
    return trg_tokens[1:-1] # Exclude <SOS> and <EOS>

# Qualitative Analysis (Uncomment after training)
#model.load_state_dict(torch.load('seq2seq-gru-model.pt'))
#examples = ["i am cold", "she is happy", "he is running", "we are ready"]
#for example in examples:
#    translation = translate_sentence(example, input_lang, output_lang, model, device)
#    print(f"EN: {example}")
#    print(f"FR: {' '.join(translation)}\n")

In [19]:
# Qualitative Analysis (Uncomment after training)
model.load_state_dict(torch.load('seq2seq-gru-model.pt'))
examples = ["i am cold", "she is happy", "he is running", "we are ready"]
for example in examples:
    translation = translate_sentence(example, input_lang, output_lang, model, device)
    print(f"EN: {example}")
    print(f"FR: {' '.join(translation)}\n")

EN: i am cold
FR: je suis ans .

EN: she is happy
FR: elle est rougi !

EN: he is running
FR: qui vous a ?

EN: we are ready
FR: nous nous tristes .



In [20]:
def diagnose_model(model, src_lang, tgt_lang, pairs, device, num_examples=5):
    """
    Check if model can translate training examples (memorization test)
    """
    print("\n" + "=" * 70)
    print("MODEL DIAGNOSIS - Testing on Training Examples")
    print("=" * 70)
    
    for i in range(num_examples):
        en_sentence = pairs[i][0]
        fr_actual = pairs[i][1]
        fr_predicted = translate_sentence(en_sentence, src_lang, tgt_lang, model, device)
        
        print(f"\nExample {i+1}:")
        print(f"  EN (input):     {en_sentence}")
        print(f"  FR (expected):  {fr_actual}")
        print(f"  FR (predicted): {' '.join(fr_predicted)}")
        
        # Calculate word overlap
        expected_words = set(fr_actual.split())
        predicted_words = set(fr_predicted)
        overlap = expected_words.intersection(predicted_words)
        if len(expected_words) > 0:
            accuracy = len(overlap) / len(expected_words) * 100
            print(f"  Word overlap:   {len(overlap)}/{len(expected_words)} ({accuracy:.1f}%)")

# Run after loading best model
diagnose_model(model, input_lang, output_lang, pairs, device)


"""
Run 1:
Best model (epoch 6 with val loss 2.732) was saved
Later epochs with worse validation performance were ignored
When loading 'seq2seq-gru-model.pt', loading the epoch 6 model

Run 2:
# Still using 30 epochs but 4 layers. 

"""


MODEL DIAGNOSIS - Testing on Training Examples

Example 1:
  EN (input):     go .
  FR (expected):  va !
  FR (predicted): va !
  Word overlap:   2/2 (100.0%)

Example 2:
  EN (input):     run !
  FR (expected):  cours !
  FR (predicted): courez !
  Word overlap:   1/2 (50.0%)

Example 3:
  EN (input):     run !
  FR (expected):  courez !
  FR (predicted): courez !
  Word overlap:   2/2 (100.0%)

Example 4:
  EN (input):     wow !
  FR (expected):  ca alors !
  FR (predicted): ca alors !
  Word overlap:   3/3 (100.0%)

Example 5:
  EN (input):     fire !
  FR (expected):  au feu !
  FR (predicted): a feu !
  Word overlap:   2/3 (66.7%)


"\nBest model (epoch 6 with val loss 2.732) was saved\nLater epochs with worse validation performance were ignored\nWhen loading 'seq2seq-gru-model.pt', loading the epoch 6 model\n\n# Still using 30 epochs but 4 layers. \n\n"