# 2025-1 Artificial Intelligence (01)
## Homework #3: EN-FR Machine Translation Using LSTM, Attention, and Transformer
---
Copyright (c) Prof. Jaehyeong Sim

Department of Computer Science and Engineering

College of Artificial Intelligence

Ewha Womans University

## Guideline
### Introduction
*   Here in this homework, we will implement a **EN-FR machine translator** in PyTorch using three models: an **LSTM**, an **LSTM with Attention**, a **Transformer**.
*   We didn't cover **NLP pipeline** in class, so the code might look complicated. I tried to explain the code as clearly as possible, and if you understand the entire code, you can now understand the basics of NLP pipeline and how the models work. So I **highly recommend you to read the code and the explanation carefully and understand them**.
*   The training of each model takes long time (LSTM: 70 min, LSTM w/ attn: 120 min, Transformer: 50 min), so I suggest you start this homework early.

### Your job
1. Please complete the code. You only have to write the parts marked as **# TODO**.
2. Please answer the discussion topics at the bottom of this notebook in a **separate PDF file**.

### Submission guide
1. Please rename the completed skeleton file to ***STUDENT_ID*.ipynb**. Your own student ID goes to *STUDENT_ID*. For example, if your student ID is 2512345, the file name should be **2512345.ipynb**. Also, make your PDF file name ***STUDENT_ID*.pdf***.
2. Make sure that your notebook contains the **output of each cell** including the translation results with your own sentence.
3. Turn in them into the Ewha CyberCampus.


⚠ If you doesn't follow the submission guide above, you will get **5 point deduction** from this homework score.

### Deadline
*   **June 4, 23:59**

### 1. Necessary libraries

In [None]:
%pip install -Uq "datasets>=2.19" "fsspec>=2023.6.0" sentencepiece sacrebleu
%pip install torchinfo torch numpy

In [2]:
import math, random, pathlib, os, time, sentencepiece as spm
from datasets import load_dataset
from torch.utils.data import Dataset
import json
import torch, torch.nn as nn, torch.nn.functional as F
from torch.utils.data import DataLoader
from torchinfo import summary

### 2. Global variables

In [3]:
# Preprocessing
vocab_size = 4000
subset_size = 50000
max_len = 60

In [4]:
# LSTM
lstm_epochs = 15
lstm_layers = 3
lstm_hidden = 1024
lstm_batch_size = 128
lstm_log_interval = 50
lstm_dropout = 0.4

In [5]:
# Transformer
trans_base_lr = 5e-4
d_model      = 512
nhead        = 8
nlayers      = 4
ffn_dim      = 2048
trans_epochs   = 15
trans_log_interval = 50

### 3. Preprocessing
Overall flow:


```
raw text (train.en/fr)
↓
learn vocab (SentencePieceTrainer)
↓
text to IDs (encode())
↓
model (LSTM, Transformer)
```

1.   Corpus acquisition
  *   Goal: obtain a parallel English-French corpus.
  *   load_dataset(): Hugging Face Datasets downloads the IWSLT 2017 TED-talk translations and returns a list of dictionaries
  *   Having the two languages side by side is what later lets us train a sequence-to-sequence model.

2.   Sampling a subset
  *   The original machine-translation corpora is big. Shuffling with a fixed seed (42) and picking the first subset_size examples makes experiments reproducible and keeps training time reasonable for Colab GPU (T4).

3. Writing raw text files (train.en, train.fr)
  *   SentencePiece’s trainer expects one sentence per line of plain text. Saving the sample accomplishes two things:
    * Gives SentencePiece its required input format.
    * Lets you open the files in a text editor to see the real sentences the model will see.

4. Learning a sub-word vocabulary with SentencePiece (BPE)
  * What is vocabulary learning?
    * Deep learning models consume numbers, not strings. “Vocabulary learning”  decides which text fragments become tokens and assigns each fragment a unique integer ID.
    * Classic NLP used a fixed word list. Rare or misspelled words were pushed into a single \<unk> bucket → information loss.
      * \<unk>: special token for any character sequence not in the learned vocab.
    * Today we prefer sub-word units (e.g. Byte-Pair Encoding, WordPiece, Unigram). They split unseen words like:
      
      *internationalization* (not in vocab) → *international* ##*ization* (both are in vocab)

      so the model still sees meaningful pieces and you keep vocabulary size manageable.
  * Why sub-words instead of words?
    * Open-vocabulary: can spell out the vocabulary it has never seen.
    * Keeps vocab_size small so embedding matrices fit in memory.
  * The trainer is configured with explicit IDs/pieces for \<pad>, \<unk>, \<s> (BOS), \</s> (EOS) because your downstream model will need to know exactly which integers correspond to padding, beginnig-of-sentence, etc. Changing them later would silently corrupt training.

5. Runtime tokenizer setup
  * A SentencePieceProcessor loads the freshly trained bpe.model and exposes:
    * encode(str) -> List[int]
    * special-token IDs (pad_id(), bos_id(), …)
    * total vocabulary size (get_piece_size()).
  * The utility encode() function truncates long sentences to max_len-1 tokens and appends an explicit EOS_ID. (RNNs/Transformers work best when they know where to stop decoding.)

6. TranslationDataset
  * A PyTorch Dataset that lazily keeps the raw strings. We postpone tokenization to the collate step so each mini-batch can be truncated/padded to its own maximum length—this is more memory-efficient than padding everything to a corpus-wide max.

7. Mini-batch collation
  * collate():
    * Tokenizes every (src, tgt) pair with encode().
    * Finds the longest sequence length inside that batch.
    * Right-pads shorter sequences with \<pad> so torch.tensor() can stack them into a rectangular batch_size × seq_len tensor.
  * The result is two LongTensors ready for nn.Embedding → encoder/decoder → loss calculation.

8. DataLoaders
  * Train loader pulls 50 000 sentence pairs, shuffles each epoch, and applies our custom padding.
  * Validation loader uses a fixed 1 000-sentence slice with deterministic order.
  * Both loaders now stream GPU-ready batches you can feed directly into an LSTM or Transformer model.


In [6]:
# ---------------------------------------------------------------------------
# 1.  Corpus acquisition
# ---------------------------------------------------------------------------
DATA_DIR = pathlib.Path('data') # Directory where all assets will live
DATA_DIR.mkdir(exist_ok=True) # Safely create it the first time we run

print("Downloading IWSLT 2017 EN-FR dataset …")
# `load_dataset` fetches a pre-tokenized parallel corpus of
# English ("en") and French ("fr") sentences.  The corpus ships with
# predefined splits (train / validation / test).
ds = load_dataset('IWSLT/iwslt2017', # dataset identifier (repo_name/config)
                  'iwslt2017-en-fr', # configuration: language pair
                  split='train', # which split to load
                  cache_dir=DATA_DIR, # store raw data under ./data
                  )

# ---------------------------------------------------------------------------
# 2.  Sampling a manageable subset from the dataset to make training simpler
# ---------------------------------------------------------------------------
# Shuffling with a fixed seed ensures reproducibility: we always pick the
# same sentences each run, making debugging easier.
sampled = ds.shuffle(seed=42).select(range(subset_size))

# Save raw text copies because SentencePiece expects plain‑text files for
# training. These files are also handy for quick inspection with a text
# editor.
src_path = DATA_DIR/'train.en' # English sentences
tgt_path = DATA_DIR/'train.fr' # French  sentences
src_sentences = [ex['translation']['en'] for ex in sampled]
tgt_sentences = [ex['translation']['fr'] for ex in sampled]
src_path.write_text('\n'.join(src_sentences), encoding='utf-8')
tgt_path.write_text('\n'.join(tgt_sentences), encoding='utf-8')

# ---------------------------------------------------------------------------
# 3.  Sub‑word vocabulary learning with SentencePiece (BPE)
# ---------------------------------------------------------------------------
# Why sub‑word? It handles open vocabulary problems (e.g. new place names)
# better than word‑level tokenizers while keeping sequence length reasonable.
print("Training SentencePiece …")
spm.SentencePieceTrainer.Train(
    input=','.join([str(src_path), str(tgt_path)]), # both languages
    model_prefix=str(DATA_DIR/'bpe'), # outputs bpe.model / bpe.vocab
    vocab_size=vocab_size,
    # Special tokens ─ IDs must match downstream model expectations.
    pad_id=0,    pad_piece='<pad>',
    unk_id=1,    unk_piece='<unk>',
    bos_id=2,    bos_piece='<s>',
    eos_id=3,    eos_piece='</s>',
    character_coverage=0.9995, # keep almost every UTF‑8 char seen
    model_type='bpe' # byte‑pair encoding variant
)
print("Done!")

# ---------------------------------------------------------------------------
# 4.  Runtime helpers
# ---------------------------------------------------------------------------
# Choose CPU vs GPU automatically.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load the learned sub‑word model to tokenize on‑the‑fly.
sp = spm.SentencePieceProcessor(model_file='data/bpe.model')
PAD_ID = sp.pad_id()
BOS_ID = sp.bos_id()
EOS_ID = sp.eos_id()
VOCAB = sp.get_piece_size()

# ---------------------------
# Encoding utility
# -------------------------
def encode(sentence):
    """Convert raw text to a list of integer token IDs.

    • Truncate to `max_len‑1` to leave room for the explicit EOS.
    • Append EOS so the decoder knows where to stop.
    """
    ids = sp.encode(sentence, out_type=int)[:max_len-1]
    return ids + [EOS_ID]

# ---------------------------------------------------------------------------
# 5.  PyTorch Dataset wrapper
# ---------------------------------------------------------------------------
class TranslationDataset(torch.utils.data.Dataset):
    """Lazy wrapper that gives (src_sentence, tgt_sentence) tuples."""
    def __init__(self, split):
        ds = load_dataset('IWSLT/iwslt2017', 'iwslt2017-en-fr', split=split)
        self.src = [ex["translation"]["en"] for ex in ds]
        self.tgt = [ex["translation"]["fr"] for ex in ds]

    def __len__(self):
        return len(self.src)

    def __getitem__(self, idx):
        return self.src[idx], self.tgt[idx]

# ---------------------------------------------------------------------------
# 6.  Batch collation: padding & tensor conversion
# ---------------------------------------------------------------------------
def collate(batch):
    """Custom collation to handle variable‑length sentences.

    Steps
    -----
    1. Tokenize each sentence pair.
    2. Compute max length inside the mini‑batch.
    3. Right‑pad with <pad> so tensors become rectangular (B × T).
    4. Return int64 tensors ready for `nn.Embedding` / `Transformer`.
    """
    src_batch, tgt_batch = zip(*batch)
    src_ids = [encode(s) for s in src_batch]
    tgt_ids = [encode(t) for t in tgt_batch]
    src_max = max(len(x) for x in src_ids)
    tgt_max = max(len(y) for y in tgt_ids)
    src_pad = [x + [PAD_ID]*(src_max-len(x)) for x in src_ids]
    tgt_pad = [y + [PAD_ID]*(tgt_max-len(y)) for y in tgt_ids]

    return torch.tensor(src_pad), torch.tensor(tgt_pad)

# ---------------------------------------------------------------------------
# 7.  DataLoaders
# ---------------------------------------------------------------------------
train_loader = DataLoader(TranslationDataset('train[:50000]'), # subset for simpler experiments
                          batch_size=lstm_batch_size,
                          shuffle=True,
                          collate_fn=collate # our custom padding logic
                          )

val_dataset = TranslationDataset(split="validation[:1000]")
val_loader = DataLoader(val_dataset,
                        batch_size=lstm_batch_size,
                        shuffle=False, # deterministic validation order
                        collate_fn=collate)

# `train_loader` and `val_loader` now stream padded token‑ID tensors
# that can be fed straight into an LSTM and Transformer encoder‑decoder.

Downloading IWSLT 2017 EN-FR dataset …


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/18.5k [00:00<?, ?B/s]

iwslt2017.py:   0%|          | 0.00/8.17k [00:00<?, ?B/s]

The repository for IWSLT/iwslt2017 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/IWSLT/iwslt2017.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


en-fr.zip:   0%|          | 0.00/27.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/232825 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8597 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/890 [00:00<?, ? examples/s]

Training SentencePiece …
Done!


Generating train split:   0%|          | 0/232825 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/8597 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/890 [00:00<?, ? examples/s]

### 4-1. LSTM baseline: Class definition

Here, you need to use two modules:

*   nn.Embedding
  *   Turns a batch of integer token IDs into a batch of dense vectors (embedded vectors).
  *   Input: ints in [0, vocab_size-1]
  *   Output: lookup of a trainable table with shape [vocab, hidden]
*   nn.LSTM
  *   Learns to compress a sequence of those vectors into hidden states that capture context.
  *   Input shape must be [B, T, H] if batch_first=True.
  *   Returns every hidden state plus the last hidden & cell states separately.

Please refer to the official documentation of PyTorch for detailed usage of each module.
You should be careful about tensor shapes. I encourage you to print out tensor shapes the first time they run a batch:


```
print(x.shape, emb.shape, outs.shape, h.shape)   # sanity check
```

In [7]:
class Encoder(nn.Module):
    def __init__(self, vocab, hidden):
        super().__init__()
        # ---- TODO students implement below ---- #
        # Token ID → embedding vector conversion
        self.embedding = nn.Embedding(vocab, hidden)
        # LSTM layer with batch_first=True
        self.lstm = nn.LSTM(input_size=hidden,
                           hidden_size=hidden,
                           num_layers=lstm_layers,
                           dropout=lstm_dropout if lstm_layers > 1 else 0,
                           batch_first=True)
        self.dropout = nn.Dropout(lstm_dropout)

    def forward(self, x):
        emb = self.embedding(x)  # [B, T] -> [B, T, H]
        emb = self.dropout(emb)
        outputs, (h_n, c_n) = self.lstm(emb)
        return outputs, (h_n, c_n)


class Decoder(nn.Module):
    def __init__(self, vocab, hidden):
        super().__init__()
        # ---- TODO students implement below ---- #
        self.embedding = nn.Embedding(vocab, hidden)
        self.lstm = nn.LSTM(input_size=hidden,
                           hidden_size=hidden,
                           num_layers=lstm_layers,
                           dropout=lstm_dropout if lstm_layers > 1 else 0,
                           batch_first=True)
        self.dropout = nn.Dropout(lstm_dropout)
        # Output layer: maps hidden state to vocab-sized logits
        self.fc = nn.Linear(hidden, vocab)

    def forward(self, y, hidden):
        emb = self.embedding(y)  # [B, T] -> [B, T, H]
        emb = self.dropout(emb)
        # Use encoder's hidden state as initial state
        outputs, (h_n, c_n) = self.lstm(emb, hidden)
        outputs = self.dropout(outputs)
        logits = self.fc(outputs)  # [B, T, H] -> [B, T, V]
        return logits, (h_n, c_n)

### 4-2. LSTM baseline: Training loop

**label_smoothed_nll_loss**

In vanilla cross-entropy training the model is rewarded only when it places all probability mass on the single gold token.

Side-effect: the network often becomes over-confident (p ≈ 1), which hurts generalization.

Label smoothing fixes that by distributing a small portion ε of the probability mass over all classes.
For a vocabulary of K tokens:

```
gold token prob      = 1 − ε
every other token    = ε / K
```
We therefore minimize:

```
L = (1 − ε)⋅NLL + ε⋅UniformLoss
```
where
* NLL = − log p(gold)
* UniformLoss = − mean_j log p(j)

In [8]:
def label_smoothed_nll_loss(lprobs, # raw logits  (B, T, V)  OR (any, V)
                            target, # gold token IDs (B, T)
                            epsilon=0.1, # smoothing factor ε
                            ignore_index=PAD_ID # which ID means "padding"?
                            ):
    """
    Cross-entropy with label smoothing, padding aware.
    Returns: scalar mean loss over non-pad tokens.
    """
    n_class = lprobs.size(-1) # V = vocabulary size

    # 1) Convert logits → log-probabilities for numerical stability
    lprobs = F.log_softmax(lprobs, dim=-1)

    # 2) Flatten batch/time dims so every token is an independent row
    lprobs = lprobs.view(-1, n_class)  # (B·T, V)
    target = target.contiguous().view(-1) # (B·T,)

    # 3) Build mask for <pad> tokens and neutralize them
    pad_mask = target.eq(ignore_index) # True where token == PAD_ID
    target   = target.masked_fill(pad_mask, 0)  # dummy index 0 won’t be used

    # 4) Negative-log-likelihood of the gold token
    nll_loss     = F.nll_loss(lprobs, target,
                              reduction='none' # keep per-token loss (B·T,)
                              )

    # 5) “Uniform” loss term: −Σ_j log p(j) / V
    smooth_loss  = -lprobs.sum(dim=-1) / n_class

    # 6) Interpolate:    (1-ε)·NLL  +  ε·Uniform
    loss = (1 - epsilon) * nll_loss + epsilon * smooth_loss

    # 7) Remove padding from both numerator & denominator
    loss.masked_fill_(pad_mask, 0.0) # zero where pad
    return loss.sum() / (~pad_mask).sum() # average over real tokens

# Convenience alias so the usual training loop can read:
criterion = label_smoothed_nll_loss

In [9]:
# ------------------------------
# Model instantiation
# ------------------------------
# `VOCAB`, `lstm_hidden`, and `device` are defined earlier in the notebook.
lstm_enc = Encoder(VOCAB, lstm_hidden).to(device)
lstm_dec = Decoder(VOCAB, lstm_hidden).to(device)

In [None]:
# ================================================================
# Training loop – Encoder–Decoder LSTM with label smoothing
# ================================================================
# This block shows one full experiment script: model instantiation, optimizer,
# learning‑rate scheduler, epoch training, and validation evaluation.

# ------------------------------
# Optimizer
# ------------------------------
# Adam with the beta values (0.9, 0.98) and tiny eps for safety.
optim = torch.optim.Adam(list(lstm_enc.parameters())+list(lstm_dec.parameters()),
                          lr=1e-3, betas=(0.9, 0.98), eps=1e-9)

# ------------------------------
# LR scheduler: halve the LR every 5 epochs
# ------------------------------
scheduler = torch.optim.lr_scheduler.StepLR(optim, step_size=5, gamma=0.5)

# ------------------------------------------------------------
# train_epoch() – one full sweep over the training DataLoader
# ------------------------------------------------------------
def train_epoch():
    lstm_enc.train() # activate dropout & norm in train mode
    lstm_dec.train()
    total, n = 0, 0
    for step, (src, tgt) in enumerate(train_loader, 1):
        # Move mini‑batch to GPU/CPU device
        src, tgt = src.to(device), tgt.to(device)

        optim.zero_grad() # clear stale gradients

        # Encoder forward pass
        enc_out, hidden = lstm_enc(src) # hidden = (h_n, c_n)

        # Decoder forward – feed gold tokens shifted right
        logits, _ = lstm_dec(tgt[:, :-1], hidden)

        # Flatten (B, T, V) → (B·T, V) and compute label‑smoothed CE
        loss = criterion(logits.reshape(-1, VOCAB), tgt[:,1:].reshape(-1))

        # Back‑prop
        loss.backward()

        # Gradient clipping to keep training stable (max‑norm = 1.0)
        torch.nn.utils.clip_grad_norm_(lstm_enc.parameters(), 1.0)
        torch.nn.utils.clip_grad_norm_(lstm_dec.parameters(), 1.0)

        # Optimizer step (updates parameters)
        optim.step()

        # Accumulate loss for reporting
        total += loss.item(); n += 1

        # Report training loss every `lstm_log_interval` mini‑batches
        if step % lstm_log_interval == 0 or step == 1:
            print(f"[batch {step:4}/{len(train_loader)}] "
            f"loss={loss.item():.3f}")

    return total / n # epoch‑average loss

# ------------------------------------------------------------
# evaluate_loss() – no‑grad validation loop
# ------------------------------------------------------------
@torch.no_grad()
def evaluate_loss(model_enc, model_dec, loader, criterion, pad_id=PAD_ID):
    model_enc.eval() # eval mode = disable dropout
    model_dec.eval()
    total, ntok = 0.0, 0 # token‑level aggregation
    for src, tgt in loader:
        src, tgt = src.to(device), tgt.to(device)
        src_mask = (src != pad_id)

        enc_out, hidden = model_enc(src)
        logits, _ = model_dec(tgt[:, :-1], hidden)

         # loss averaged per token (criterion already ignores PAD)
        loss = criterion(
            logits.reshape(-1, logits.size(-1)),
            tgt[:, 1:].reshape(-1)
        )

        tokens = (tgt[:, 1:] != pad_id).sum().item() # non‑pad count
        total += loss.item() * tokens # scale back to sum
        ntok  += tokens

    avg = total / ntok # mean NLL
    ppl = math.exp(avg) # perplexity = e^(NLL)
    return avg, ppl

# ------------------------------------------------------------
# Main training loop across epochs
# ------------------------------------------------------------
t0 = time.perf_counter()

for epoch in range(1, lstm_epochs+1):
    loss = train_epoch() # one pass over train set
    val_loss, val_ppl = evaluate_loss(lstm_enc, lstm_dec, val_loader, criterion) # validation metrics
    print(f"Epoch {epoch}: Train loss={loss:.3f}, Val loss={val_loss:.3f}, Val ppl={val_ppl:.3f}")

    # Step the LR scheduler once per epoch
    scheduler.step()

elapsed = time.perf_counter() - t0
print(f"[LSTM] Wall-clock training time : {elapsed/60:6.2f} min")

[batch    1/391] loss=8.298
[batch   50/391] loss=6.644
[batch  100/391] loss=6.173
[batch  150/391] loss=5.843
[batch  200/391] loss=5.685
[batch  250/391] loss=5.529
[batch  300/391] loss=5.457
[batch  350/391] loss=5.308
Epoch 1: Train loss=5.876, Val loss=5.287, Val ppl=197.844
[batch    1/391] loss=5.281
[batch   50/391] loss=5.241
[batch  100/391] loss=5.072
[batch  150/391] loss=4.937
[batch  200/391] loss=5.023
[batch  250/391] loss=4.990
[batch  300/391] loss=4.797
[batch  350/391] loss=4.790
Epoch 2: Train loss=5.006, Val loss=4.825, Val ppl=124.604
[batch    1/391] loss=4.768
[batch   50/391] loss=4.788
[batch  100/391] loss=4.684
[batch  150/391] loss=4.736
[batch  200/391] loss=4.689
[batch  250/391] loss=4.691
[batch  300/391] loss=4.616
[batch  350/391] loss=4.597
Epoch 3: Train loss=4.703, Val loss=4.591, Val ppl=98.611
[batch    1/391] loss=4.517
[batch   50/391] loss=4.518
[batch  100/391] loss=4.557
[batch  150/391] loss=4.566
[batch  200/391] loss=4.542
[batch  250/

### 5-1. Attention on LSTM: Class definition

In [10]:
# ------------------------------------------------------------------
# Scaled dot‑product Attention (single‑head, batched)
# ------------------------------------------------------------------
class Attention(nn.Module):
    def __init__(self, hidden):
        super().__init__()
        # Linear layer projects decoder hidden → key/query space.
        # Bias is set to False to keep the operation just a matrix mult.
        self.W = nn.Linear(hidden, hidden, bias=False)

    def forward(self, decoder_hidden, encoder_out):
        """Compute context vectors and attention weights.

        Args
        ----
        decoder_hidden : [B, 1, H]  – current decoder time‑step hidden state
        encoder_out    : [B, T_src, H] – all encoder outputs (keys/values)

        Returns
        -------
        context        : [B, 1, H]
        attn_weights   : [B, 1, T_src]
        """

        # ---- TODO students implement below ---- #
        # 1. Project decoder hidden through self.W  →  [B, 1, H]
        # 2. Dot‑product with encoder_out^T via torch.bmm
        #      scores = Q · K^T  →  [B, 1, T_src]
        # 3. Softmax over T_src dimension to turn scores → probs
        # 4. Weighted sum (context) = probs · V  via torch.bmm again
        # --------------------------------------- #

        # 1. Project decoder hidden through self.W
        query = self.W(decoder_hidden)  # [B, 1, H]

        # 2. Dot‑product with encoder_out^T via torch.bmm
        # scores = Q · K^T
        scores = torch.bmm(query, encoder_out.transpose(1, 2))  # [B, 1, T_src]

        # 3. Softmax over T_src dimension to turn scores → probs
        attn_weights = F.softmax(scores, dim=-1)  # [B, 1, T_src]

        # 4. Weighted sum (context) = probs · V via torch.bmm
        context = torch.bmm(attn_weights, encoder_out)  # [B, 1, H]

        return context, attn_weights


# ------------------------------------------------------------------
# Attention‑augmented Decoder (one token at a time)
# ------------------------------------------------------------------
class AttnDecoder(nn.Module):
    def __init__(self, vocab, hidden):
        super().__init__()
        # ---- TODO students implement below ---- #

        self.embedding = nn.Embedding(vocab, hidden)
        self.lstm = nn.LSTM(input_size=hidden,
                            hidden_size=hidden,
                            num_layers=lstm_layers,
                            dropout=lstm_dropout if lstm_layers > 1 else 0,
                            batch_first=True)
        self.attention = Attention(hidden)
        self.dropout = nn.Dropout(lstm_dropout)
        self.out_dropout = nn.Dropout(lstm_dropout)
        self.fc = nn.Linear(hidden, vocab)

    def forward(self, y, hidden, enc_outputs):
        """Args
        y           : [B, T_dec]   – gold tokens
        hidden      : (h, c) tuple each [L, B, H]
        enc_outputs : [B, T_src, H]
        Returns
        -------
        logits      : [B, T_dec, vocab]
        new_hidden  : (h_n, c_n)
        """
        # ---- TODO students implement below ---- #
        # 1. Embed y and apply dropout → emb  [B, T_dec, H]
        # 2. Loop over each time‑step t because
        #    we want to feed the previous decoder hidden to attention.
        # 3. For every t, do attention and lstm operation
        # 4. Concatenate outputs → [B, T_dec, H]
        # 5. Apply out_dp then fc → logits
        # 6. Return logits and last hidden state tuple.
        # --------------------------------------- #

        # 1. Embed y and apply dropout
        emb = self.embedding(y)  # [B, T_dec, H]
        emb = self.dropout(emb)

        batch_size = emb.size(0)
        seq_len = emb.size(1)
        hidden_size = emb.size(2)

        # Storage for collecting outputs
        outputs = []

        # 2. & 3. Process each timestep to use previous decoder state
        for t in range(seq_len):
            # Current timestep embedding
            emb_t = emb[:, t:t+1, :]  # [B, 1, H]

            # LSTM step
            lstm_out, hidden = self.lstm(emb_t, hidden)  # [B, 1, H]

            # Apply attention mechanism
            context, _ = self.attention(lstm_out, enc_outputs)

            # Combine context and LSTM output (simple addition)
            combined = lstm_out + context  # [B, 1, H]

            # Store result
            outputs.append(combined)

        # 4. Concatenate all outputs
        outputs = torch.cat(outputs, dim=1)  # [B, T_dec, H]

        # 5. Apply output dropout and projection
        outputs = self.out_dropout(outputs)
        logits = self.fc(outputs)  # [B, T_dec, vocab]

        # 6. Return logits and final hidden state
        return logits, hidden

In [11]:
# ------------------------------
# Model instantiation
# ------------------------------
attn_enc = Encoder(VOCAB, lstm_hidden).to(device)
attn_dec = AttnDecoder(VOCAB, lstm_hidden).to(device)

### 5-2. Attention on LSTM: Training loop

In [12]:
# ================================================================
# Training loop – Attention‑based Sequence‑to‑Sequence Model
# ================================================================
# -------------------------------------------------------------------------------------------------
# Legend
#  • `CrossEntropyLoss`      – vanilla CE (label smoothing was already demonstrated earlier)
#  • `ReduceLROnPlateau`     – scheduler that halves LR when validation loss stagnates
# -------------------------------------------------------------------------------------------------

# ------------------------------
# Optimizer setup
# ------------------------------
criterion = nn.CrossEntropyLoss(ignore_index=PAD_ID)

optimizer = torch.optim.Adam(list(attn_enc.parameters()) + list(attn_dec.parameters()), lr=1e-3)

# LR drops by ×0.5 if val‑loss fails to improve for one epoch.
# `mode='min'` because we want the loss to go down.
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=1, verbose=True)

# ------------------------------------------------------------
# train_epoch() – with attention
# ------------------------------------------------------------
def train_epoch():
    attn_enc.train()
    attn_dec.train()
    total, ntok = 0, 0
    for step, (src, tgt) in enumerate(train_loader, 1):
       # ---------------- Mini‑batch prep ----------------
        src, tgt = src.to(device), tgt.to(device)
        optimizer.zero_grad()

        # ---------------- Forward pass -------------------
        enc_out, hidden = attn_enc(src)
        logits, _ = attn_dec(tgt[:, :-1], hidden, enc_out)

        # CE expects (N, C) so reshape B×T×V → (B·T, V)
        loss = criterion(logits.reshape(-1, VOCAB),
                         tgt[:, 1:].reshape(-1))

        # ---------------- Back‑prop ----------------------
        loss.backward()
        torch.nn.utils.clip_grad_norm_(attn_enc.parameters(), 1.0)
        torch.nn.utils.clip_grad_norm_(attn_dec.parameters(), 1.0)
        optimizer.step()

        # ---------------- Stats --------------------------
        tokens = (tgt[:,1:] != PAD_ID).sum().item()
        total += loss.item() * tokens # accumulate sum over tokens
        ntok  += tokens

        if step % lstm_log_interval == 0 or step == 1:
            print(f"[batch {step:4}/{len(train_loader)}] "
            f"loss={loss.item():.3f}")

    return total / ntok # token‑average loss per epoch

# ------------------------------------------------------------
# Validation – evaluate_loss_attn (no‑grad)
# ------------------------------------------------------------
@torch.no_grad()
def evaluate_loss_attn(model_enc, model_dec, loader, criterion, pad_id=PAD_ID):
    model_enc.eval()
    model_dec.eval()
    total, ntok = 0.0, 0
    for src, tgt in loader:
        src, tgt = src.to(device), tgt.to(device)
        src_mask = (src != pad_id)

        enc_out, hidden = model_enc(src)
        logits, _ = model_dec(tgt[:, :-1], hidden, enc_out)

        loss = criterion(
            logits.reshape(-1, logits.size(-1)),
            tgt[:, 1:].reshape(-1)
        )
        tokens = (tgt[:, 1:] != pad_id).sum().item()
        total += loss.item() * tokens
        ntok  += tokens
    avg = total / ntok
    ppl = math.exp(avg)
    return avg, ppl


# ------------------------------------------------------------
# Epoch loop with plateau scheduler
# ------------------------------------------------------------
t0 = time.perf_counter()

for epoch in range(1, lstm_epochs+1):
    loss = train_epoch() # one train pass
    val_loss, val_ppl = evaluate_loss_attn(attn_enc, attn_dec, val_loader, criterion) # validation
    print(f"Epoch {epoch}: Train loss={loss:.3f}, Val loss={val_loss:.3f}, Val ppl={val_ppl:.3f}")

    # Reduce LR if no improvement; scheduler looks at val_loss
    scheduler.step(val_loss)

elapsed = time.perf_counter() - t0
print(f"[LSTM with Attn] Wall-clock training time : {elapsed/60:6.2f} min")



[batch    1/391] loss=8.296
[batch   50/391] loss=5.908
[batch  100/391] loss=5.381
[batch  150/391] loss=4.963
[batch  200/391] loss=4.780
[batch  250/391] loss=4.584
[batch  300/391] loss=4.522
[batch  350/391] loss=4.387
Epoch 1: Train loss=5.021, Val loss=4.330, Val ppl=75.951
[batch    1/391] loss=4.247
[batch   50/391] loss=4.206
[batch  100/391] loss=4.169
[batch  150/391] loss=4.082
[batch  200/391] loss=4.128
[batch  250/391] loss=4.078
[batch  300/391] loss=3.922
[batch  350/391] loss=3.980
Epoch 2: Train loss=4.099, Val loss=3.904, Val ppl=49.598
[batch    1/391] loss=3.866
[batch   50/391] loss=3.847
[batch  100/391] loss=3.742
[batch  150/391] loss=3.867
[batch  200/391] loss=3.811
[batch  250/391] loss=3.824
[batch  300/391] loss=3.779
[batch  350/391] loss=3.765
Epoch 3: Train loss=3.787, Val loss=3.702, Val ppl=40.522
[batch    1/391] loss=3.623
[batch   50/391] loss=3.579
[batch  100/391] loss=3.584
[batch  150/391] loss=3.670
[batch  200/391] loss=3.492
[batch  250/39

### 6-1. Transformer: Class definition

In [None]:
# ------------------------------------------------------------------
# PositionalEncoding – sinusoidal schedule explained
# ------------------------------------------------------------------
# Transformers have no recurrence or convolution, so they need an
# explicit signal that tells them token #3 comes after token #2”.  This
# positional encoding is added to the token embeddings before the
# sequence enters the encoder/decoder.
#
# We use the classic sinusoidal embedding from the original Vaswani et
# al. (2017) paper because:
#   • it is fixed (no extra parameters to learn), and
#   • any sequence length can be extrapolated thanks to sine/cosine
#     periodicity.

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        """Pre‑computes a [1, max_len, d_model] tensor of sinusoids.

        Args
        ----
        d_model: dimensionality of embeddings fed into the Transformer.
        max_len: longest sequence the model will ever see at inference.
        """
        super().__init__()
        # ------------------------------------------------------------------
        # 1.  Build a lookup table `pe` where row i = position i (0‑indexed)
        # ------------------------------------------------------------------
        pe = torch.zeros(max_len, d_model) # [T, D]

        # Positions: 0, 1, 2, …, T‑1  → shape [T, 1] so broadcasting works
        pos = torch.arange(0, max_len).unsqueeze(1)

        # Denominator term 10000^{2k / d_model} implemented via exp/log.
        # Only for even indices 0,2,4,…  (cosine will use the same term)
        div_term = torch.exp(torch.arange(0, d_model, 2) # 0,2,4,…
                            * -(math.log(10000.0) / d_model) # exponent factor
                            ) # shape [D/2]

        # Apply sin to even dims; cos to odd dims. Broadcasting does the
        # heavy lifting so no explicit loops are needed.
        pe[:, 0::2] = torch.sin(pos * div_term) # even indices  (0,2,…)
        pe[:, 1::2] = torch.cos(pos * div_term) # odd  indices  (1,3,…)

        # Transformer expects batch dimension first, so unsqueeze(0) → [1,T,D]
        # `register_buffer` marks the tensor as part of the module’s state
        # (saved with .state_dict()) but not a learnable parameter.
        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        """Add positional encodings to input embeddings.

        Parameters
        ----------
        x : [B, T, D] – token embeddings coming from `nn.Embedding`.
        Returns
        -------
        out : [B, T, D] – embeddings plus positional signal.
        """
        # Slice the first T positions (x.size(1)) and rely on broadcasting
        # over the batch dimension: [1,T,D] + [B,T,D] → [B,T,D].
        return x + self.pe[:, :x.size(1)]

# ------------------------------------------------------------------
# TransformerModel – step-by-step TODOs
# ------------------------------------------------------------------
class TransformerModel(nn.Module):
    def __init__(self, vocab, d_model=256, nhead=4, nlayers=2):
        super().__init__()
        # -------- TODO students implement below -------- #
        # 1. Token embedding with `padding_idx=PAD_ID` and embed_dim = d_model
        # 2. PositionalEncoding instance (no learnable params)
        # 3. Encoder–decoder stack:
        #       enc_layer = nn.TransformerEncoderLayer(d_model, nhead,
        #                       dim_feedforward=4*d_model, batch_first=True)
        #       self.encoder = nn.TransformerEncoder(enc_layer, nlayers)
        #   Repeat similarly for `nn.TransformerDecoder`.  Remember to use
        #   batch_first=True so tensors stay [B, T, D].
        # 4. Final linear layer maps D → vocab logits.
        # ---------------------------------------------- #

        # 1. Token embedding with `padding_idx=PAD_ID` and embed_dim = d_model
        self.embed = nn.Embedding(vocab, d_model, padding_idx=PAD_ID)

        # 2. PositionalEncoding instance (no learnable params)
        self.pos = PositionalEncoding(d_model)

        # 3. Encoder–decoder stack
        # Encoder layers with batch_first=True
        enc_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=4*d_model,
            batch_first=True
        )
        # Stack encoder layers together
        self.encoder = nn.TransformerEncoder(enc_layer, nlayers)

        # Decoder layers with batch_first=True
        dec_layer = nn.TransformerDecoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=4*d_model,
            batch_first=True
        )
        # Stack decoder layers together
        self.decoder = nn.TransformerDecoder(dec_layer, nlayers)

        # 4. Final linear layer maps D → vocab logits
        self.fc = nn.Linear(d_model, vocab)


    def forward(self, src, tgt,
                src_key_padding_mask=None,
                tgt_key_padding_mask=None,
                tgt_mask=None,
                memory_key_padding_mask=None):

        """Forward pass with flexible masking.

        * `src_key_padding_mask`    : [B, T_src]  – True where PAD in src
        * `tgt_key_padding_mask`    : [B, T_tgt]  – True where PAD in tgt
        * `tgt_mask` (causal)       : [T_tgt, T_tgt] – usually `generate_square_subsequent_mask(T_tgt)`
        * `memory_key_padding_mask` : masks encoder output; defaults to src mask
        """
        # -------- TODO students implement below -------- #
        # 1. Embed + add/get positional encodings:
        # 2. Encoder produces `memory` [B, T_src, D]
        # 3. Decoder consumes (tgt, memory) and returns hidden states [B, T_tgt, D].
        #    Pass all masking arguments to ensure padding & causality.
        # 4. Project decoder outputs through `self.fc` → logits [B, T_tgt, vocab]
        # 5. Return logits.
        # ---------------------------------------------- #

        # 1. Embed + add positional encodings
        src_emb = self.pos(self.embed(src))  # [B, T_src, D]
        tgt_emb = self.pos(self.embed(tgt))  # [B, T_tgt, D]

        # If memory padding mask not provided, default to src padding mask
        if memory_key_padding_mask is None:
            memory_key_padding_mask = src_key_padding_mask

        # 2. Encoder produces memory [B, T_src, D]
        memory = self.encoder(
            src_emb,
            src_key_padding_mask=src_key_padding_mask
        )

        # 3. Decoder consumes (tgt, memory) and returns hidden states [B, T_tgt, D]
        # Pass all masking arguments to ensure padding & causality
        out = self.decoder(
            tgt_emb,
            memory,
            tgt_mask=tgt_mask,
            tgt_key_padding_mask=tgt_key_padding_mask,
            memory_key_padding_mask=memory_key_padding_mask
        )

        # 4. Project decoder outputs through self.fc → logits [B, T_tgt, vocab]
        logits = self.fc(out)

        # 5. Return logits
        return logits


### 6-2. Transformer: Training loop

In [None]:
# ------------------------------
# Model instantiation
# ------------------------------
trans_model = TransformerModel(VOCAB, d_model, nhead, nlayers).to(device)

In [None]:
# ================================================================
# Training loop – Encoder–Decoder Transformer with AdamW
# ================================================================
# This block is the Transformer counterpart to the LSTM and Attn-LSTM
# loops shown earlier.  It introduces two new ingredients:
#   • `torch.optim.AdamW`  – Adam variant with decoupled weight decay.
#   • A causal mask (`gen_square_sub_mask`) so the decoder can’t peek
#     at future tokens during training.

# ------------------------------
# Optimizer
# ------------------------------

# AdamW is preferred for Transformers; betas match the original paper.
optimizer = torch.optim.AdamW(trans_model.parameters(), lr=trans_base_lr, betas=(0.9, 0.98), eps=1e-9)

# ------------------------------
# Utility: generate causal decoder mask
# ------------------------------
def gen_square_sub_mask(sz, device):
    """Upper‑triangular matrix with −inf above the main diagonal.

    When added to query–key scores inside `nn.MultiheadAttention`, these
    −inf values turn into 0 after softmax → effectively masking future
    positions.
    """
    return torch.triu(torch.full((sz, sz), float('-inf'),
                      device=device),
                      diagonal=1) # start one step above main diagonal

# ------------------------------------------------------------
# train_epoch() for Transformer
# ------------------------------------------------------------
def train_epoch():
    trans_model.train()
    total_loss, total_tok = 0.0, 0

    for step, (src, tgt) in enumerate(train_loader, 1):
        src, tgt = src.to(device), tgt.to(device)
        optimizer.zero_grad()

        # Padding masks (True where PAD)
        src_key_padding = src.eq(PAD_ID) # [B, T_src]
        tgt_input = tgt[:, :-1] # decoder inputs (shifted)
        tgt_key_padding = tgt_input.eq(PAD_ID) # [B, T_tgt]

        # Causal mask – ensures tokens attend only to earlier positions.
        tgt_mask = gen_square_sub_mask(tgt_input.size(1), device)

        # Forward pass
        logits = trans_model(src, tgt_input,
                       src_key_padding_mask=src_key_padding,
                       tgt_key_padding_mask=tgt_key_padding,
                       tgt_mask=tgt_mask) # [B, T_tgt, V]

        # Label‑smoothed loss (defined earlier) expects log‑probs
        loss = label_smoothed_nll_loss(F.log_softmax(logits, -1), # convert to log‑p
                                       tgt[:, 1:]) # decoder targets (shifted)

        # Back‑prop and optimization
        loss.backward()
        torch.nn.utils.clip_grad_norm_(trans_model.parameters(), 1.0)
        optimizer.step();

        # Aggregate stats
        ntok = tgt[:, 1:].ne(PAD_ID).sum().item()
        total_loss += loss.item() * ntok
        total_tok  += ntok

        if step % trans_log_interval == 0 or step == 1:
            print(f"[batch {step:4}/{len(train_loader)}] "
            f"loss={loss.item():.3f}")

    return total_loss / total_tok # token‑average loss over epoch

# ------------------------------------------------------------
# Validation loop (no‑grad)
# ------------------------------------------------------------
@torch.no_grad()
def evaluate_loss_transformer(model, loader, criterion, pad_id=PAD_ID):
    model.eval()
    total, ntok = 0.0, 0

    for src, tgt in loader:
        src, tgt = src.to(device), tgt.to(device)
        src_key_padding = src.eq(pad_id)
        tgt_input = tgt[:, :-1]
        tgt_key_padding = tgt_input.eq(pad_id)
        tgt_mask = gen_square_sub_mask(tgt_input.size(1), device)

        logits = model(src, tgt_input,
                     src_key_padding_mask=src_key_padding,
                     tgt_key_padding_mask=tgt_key_padding,
                     tgt_mask=tgt_mask)

        loss = criterion(logits.reshape(-1, logits.size(-1)),
                       tgt[:, 1:].reshape(-1))

        tokens = (tgt[:, 1:] != pad_id).sum().item()
        total += loss.item() * tokens
        ntok += tokens

    avg = total / ntok
    ppl = math.exp(avg)
    return avg, ppl

# ------------------------------------------------------------
# Epoch loop
# ------------------------------------------------------------
t0 = time.perf_counter()

for epoch in range(1, trans_epochs + 1):
    loss = train_epoch()
    val_loss, val_ppl = evaluate_loss_transformer(trans_model, val_loader, criterion)
    print(f"Epoch {epoch}: Train loss={loss:.3f}, Val loss={val_loss:.3f}, Val ppl={val_ppl:.3f}")

elapsed = time.perf_counter() - t0
print(f"[Transformer] Wall-clock training time : {elapsed/60:6.2f} min")

[batch    1/391] loss=6.777
[batch   50/391] loss=6.067
[batch  100/391] loss=5.447
[batch  150/391] loss=5.300
[batch  200/391] loss=5.038
[batch  250/391] loss=4.786
[batch  300/391] loss=4.808
[batch  350/391] loss=4.646


  output = torch._nested_tensor_from_mask(


Epoch 1: Train loss=5.215, Val loss=4.653, Val ppl=104.847
[batch    1/391] loss=4.515
[batch   50/391] loss=4.456
[batch  100/391] loss=4.384
[batch  150/391] loss=4.390
[batch  200/391] loss=4.375
[batch  250/391] loss=4.329
[batch  300/391] loss=4.236
[batch  350/391] loss=4.125
Epoch 2: Train loss=4.325, Val loss=4.238, Val ppl=69.257
[batch    1/391] loss=4.039
[batch   50/391] loss=3.919
[batch  100/391] loss=3.894
[batch  150/391] loss=3.871
[batch  200/391] loss=3.800
[batch  250/391] loss=3.679
[batch  300/391] loss=3.704
[batch  350/391] loss=3.681
Epoch 3: Train loss=3.792, Val loss=3.821, Val ppl=45.671
[batch    1/391] loss=3.380
[batch   50/391] loss=3.461
[batch  100/391] loss=3.436
[batch  150/391] loss=3.346
[batch  200/391] loss=3.324
[batch  250/391] loss=3.390
[batch  300/391] loss=3.299
[batch  350/391] loss=3.222
Epoch 4: Train loss=3.332, Val loss=3.557, Val ppl=35.051
[batch    1/391] loss=3.000
[batch   50/391] loss=2.945
[batch  100/391] loss=3.112
[batch  150

### 7. Translation

In [13]:
# Example sentence
en_sentence = "A person is wearing a hat."

# You may not get good translation results since we've trained the models
# with small capacity and datasets.
# However, you can see that the more advanced model can capture some contexts.
# You are totally allowed to change the sentence for different tries.

### 7-1. Tranlation example: LSTM

In [None]:
# ================================================================
# Greedy inference – `translate_lstm`
# ================================================================
# At training time we ran teacher forcing (feeding gold tokens).
# During inference we must generate one token at a time and feed each
# prediction back into the decoder. The helper below performs greedy
# decoding—always picking the highest‐probability token at every step.

def translate_lstm(src_sentence):
    """Translate English→French using the trained LSTM seq2seq model.

    Steps
    -----
    1.  Switch encoder/decoder to `eval()` so dropout is disabled.
    2.  `encode()` the raw source sentence → list[int].  Wrap in a
        batch‐dim `[1, T]` and move to device.
    3.  Run the encoder once to get all hidden states + final (h, c).
    4.  Initialize the decoder input with just `<s>` (BOS).
    5.  Loop up to `max_len` – each iteration:
        a. Feed the last generated token to decoder.
        b. Take `argmax` over vocabulary to get the next token ID.
        c. Append the new token to `ys`.
        d. If the token is `</s>` (EOS) → break early.
    6.  Remove BOS/EOS, convert IDs back to text with `sp.decode()`.
    """
    lstm_enc.eval() # 1. evaluation mode
    lstm_dec.eval()

    src_ids = torch.tensor([encode(src_sentence)], device=device) # 2.

    enc_out, hidden = lstm_enc(src_ids) # 3. encoder forward

    ys = torch.tensor([[BOS_ID]], device=device) # 4. start symbol

    for _ in range(max_len): # 5. generate token by token
          logits, hidden = lstm_dec(ys[:, -1:], hidden) # a. last token only
          next_id = logits[:, -1, :].argmax(-1) # b. greedy pick
          ys = torch.cat([ys, next_id.unsqueeze(1)], dim=1) # c. append
          if next_id.item() == EOS_ID: # d. stop condition
              break

    tgt_tokens = ys[0, 1:-1].tolist() # strip BOS & EOS
    return sp.decode(tgt_tokens) # 6. detokenize

print("LSTM Translation Result:")
print(f"▶︎ {en_sentence}")
print("   →", translate_lstm(en_sentence))

LSTM Translation Result:
▶︎ A person is wearing a hat.
   → itablement que c'est un peu


### 7-2. Tranlation example: LSTM with Attention

In [14]:
# ================================================================
# Greedy inference – `translate_attn`
# ================================================================

def translate_attn(src_sentence):
    attn_enc.eval()
    attn_dec.eval()

    src_ids = torch.tensor([encode(src_sentence)], device=device)

    enc_out, hidden = attn_enc(src_ids)

    ys = torch.tensor([[BOS_ID]], device=device)

    for _ in range(max_len):
          logits, hidden = attn_dec(ys[:, -1:], hidden, enc_out)
          next_id = logits[:, -1, :].argmax(-1)
          ys = torch.cat([ys, next_id.unsqueeze(1)], dim=1)
          if next_id.item() == EOS_ID:
              break

    tgt_tokens = ys[0, 1:-1].tolist()
    return sp.decode(tgt_tokens)

print("LSTM with Attention Translation Result:")
print(f"▶︎ {en_sentence}")
print("   →", translate_attn(en_sentence))

LSTM with Attention Translation Result:
▶︎ A person is wearing a hat.
   → personne ne s'est mis un pied.


### 7-3. Tranlation example: Transformer

In [None]:
# ================================================================
# 16.  Greedy inference – `translate_trans` for Transformer
# ================================================================
# This routine mirrors `translate_lstm` but uses the Transformer model.
# Main differences:
#   • We precompute the memory (encoder output) once.
#   • Every decoding step requires a causal mask for self‑attention.
#   • Padding masks must be passed to both decoder and encoder–decoder
#     attention so PAD tokens don't influence the context.

def translate_trans(src_sent):
    """Translate a single sentence with the trained Transformer."""
    trans_model.eval() # disable dropout

    # ------------------------------------------------------------
    # 1.  Encode the source sentence ONCE
    # ------------------------------------------------------------
    src_ids = torch.tensor([encode(src_sent)], device=device) # [1, T_src]
    src_key = src_ids.eq(PAD_ID) # [1, T_src] bool

    # token embed + positional encodings
    memory = trans_model.embed(src_ids) # [1, T_src, D]
    memory = trans_model.pos(memory) # add sin/cos positions

    # run through the encoder stack → `memory`
    memory = trans_model.encoder(memory, src_key_padding_mask=src_key)

    # ------------------------------------------------------------
    # 2.  Autoregressive decoder loop (greedy)
    # ------------------------------------------------------------
    ys = torch.tensor([[BOS_ID]], device=device)

    for _ in range(max_len):
        # Causal mask grows with sequence length
        tgt_mask = gen_square_sub_mask(ys.size(1), device) # [T_tgt, T_tgt]
        tgt_key  = ys.eq(PAD_ID) # padding mask

         # Embed + positional
        out = trans_model.embed(ys)
        out = trans_model.pos(out)

        # Decoder: queries = out, keys/values from memory
        out = trans_model.decoder(out, memory,
                                  tgt_mask=tgt_mask,
                                  tgt_key_padding_mask=tgt_key,
                                  memory_key_padding_mask=src_key
                                  ) # [1, T_tgt, D]

        # Project newest time‑step to vocabulary and pick argmax
        next_tok = trans_model.fc(out[:, -1, :]).argmax(-1) # [1]

        ys = torch.cat([ys, next_tok.unsqueeze(1)], dim=1) # append
        if next_tok.item() == EOS_ID: break # stop if </s> generated

    # Strip BOS/EOS and detokenize
    return sp.decode(ys[0, 1:-1].tolist())

print("Transformer Translation Result:")
print(f"▶︎ {en_sentence}")
print("   →", translate_trans(en_sentence))

Transformer Translation Result:
▶︎ A person is wearing a hat.
   → mons un chapitre.


### 8. Model summary
Shows number of parameters, multiply-adds (MACs) of the models

In [None]:
# Summary of LSTM ENC
print(summary(lstm_enc, input_size=(1, 30), dtypes=[torch.long], col_names=("num_params", "mult_adds")))

# Summary of LSTM DEC
y = torch.randint(0, VOCAB, (1, 29), dtype=torch.long).to(device)
h = torch.randn(lstm_layers, 1, lstm_hidden).to(device)
c = torch.randn(lstm_layers, 1, lstm_hidden).to(device)
hidden = (h, c)
print(summary(lstm_dec, input_data=(y, hidden), dtypes=[torch.long], col_names=("num_params", "mult_adds")))

Layer (type:depth-idx)                   Param #                   Mult-Adds
Encoder                                  --                        --
├─Embedding: 1-1                         4,096,000                 4,096,000
├─Dropout: 1-2                           --                        --
├─LSTM: 1-3                              25,190,400                755,712,000
Total params: 29,286,400
Trainable params: 29,286,400
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 759.81
Input size (MB): 0.00
Forward/backward pass size (MB): 0.49
Params size (MB): 117.15
Estimated Total Size (MB): 117.64
Layer (type:depth-idx)                   Param #                   Mult-Adds
Decoder                                  --                        --
├─Embedding: 1-1                         4,096,000                 4,096,000
├─Dropout: 1-2                           --                        --
├─LSTM: 1-3                              25,190,400                730,521,600
├─Dropout: 1-4 

In [15]:
# Summary of LSTM w/ ATTN ENC
print(summary(attn_enc, input_size=(1, 30), dtypes=[torch.long], col_names=("num_params", "mult_adds")))

# Summary of LSTM w/ ATTN DEC
y = torch.randint(0, VOCAB, (1, 29), dtype=torch.long).to(device)
h = torch.randn(lstm_layers, 1, lstm_hidden).to(device)
c = torch.randn(lstm_layers, 1, lstm_hidden).to(device)
hidden = (h, c)
enc_outputs = torch.randn(1, 29, lstm_hidden).to(device)
print(summary(attn_dec, input_data=(y, hidden, enc_outputs), dtypes=[torch.long], col_names=("num_params", "mult_adds")))

Layer (type:depth-idx)                   Param #                   Mult-Adds
Encoder                                  --                        --
├─Embedding: 1-1                         4,096,000                 4,096,000
├─Dropout: 1-2                           --                        --
├─LSTM: 1-3                              25,190,400                755,712,000
Total params: 29,286,400
Trainable params: 29,286,400
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 759.81
Input size (MB): 0.00
Forward/backward pass size (MB): 0.49
Params size (MB): 117.15
Estimated Total Size (MB): 117.64
Layer (type:depth-idx)                   Param #                   Mult-Adds
AttnDecoder                              --                        --
├─Embedding: 1-1                         4,096,000                 4,096,000
├─Dropout: 1-2                           --                        --
├─LSTM: 1-3                              25,190,400                25,190,400
├─Attention: 1-4

In [None]:
# Summary of Transformer
print(summary(trans_model, input_size=[(1, 30), (1, 29)], dtypes=[torch.long, torch.long], col_names=("num_params", "mult_adds")))

Layer (type:depth-idx)                        Param #                   Mult-Adds
TransformerModel                              --                        --
├─Embedding: 1-1                              2,048,000                 2,048,000
├─PositionalEncoding: 1-2                     --                        --
├─Embedding: 1-3                              (recursive)               2,048,000
├─PositionalEncoding: 1-4                     --                        --
├─TransformerEncoder: 1-5                     --                        --
│    └─ModuleList: 2-1                        --                        --
│    │    └─TransformerEncoderLayer: 3-1      3,152,384                 2,101,760
│    │    └─TransformerEncoderLayer: 3-2      3,152,384                 2,101,760
│    │    └─TransformerEncoderLayer: 3-3      3,152,384                 2,101,760
│    │    └─TransformerEncoderLayer: 3-4      3,152,384                 2,101,760
├─TransformerDecoder: 1-6                     --   

### Discussion
1.   Compare the final validation losses for three models and provide an explanation of the difference.
2.   Discuss which model is the most efficient in terms of computational complexity and translation performance. Give a reason why.
3.   Discuss which model is the most efficient in terms of model size and translation performance. Give a reason why.
4.   Discuss which model is the most efficient in terms of training time and translation performance. Give a reason why.