# Lab Assignment: English-to-German Machine Translation with a Transformer

**Objective:**
The goal of this assignment is to fine-tune a pre-trained **Transformer** model for machine translation. You will load a dataset, fine-tune the model, and evaluate it using the BLEU score.

You will also compare your new model's performance and architecture against a previous **LSTM-based** model, and visualize the **self-attention** and **cross-attention** mechanisms that make the Transformer so powerful.


## Part 1: Setup and Environment

First, install all the necessary libraries.

In [1]:
# Colab cell (code)
!pip install --upgrade pip
!pip install transformers datasets sacrebleu sentencepiece accelerate


Collecting pip
  Downloading pip-25.3-py3-none-any.whl.metadata (4.7 kB)
Downloading pip-25.3-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m36.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.3
Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
Collecting portalocker (from sacrebleu)
  Downloading portalocker-3.2.0-py3-none-any.whl.metadata (8.7 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading sacrebleu-2.5.1-py3-none-any.whl (104 kB)
Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Downloading portalocker-3.2.0-py3-none-any.whl (22 kB)
Installing collected packages: portalocker, colorama, sacrebleu
[2K   [90m━━━━━━━━━━

In [2]:
# Colab cell (code)
!wget -nc http://www.manythings.org/anki/deu-eng.zip
!unzip -o deu-eng.zip
# The file created is usually 'deu.txt'


--2025-10-29 16:00:28--  http://www.manythings.org/anki/deu-eng.zip
Resolving www.manythings.org (www.manythings.org)... 173.254.30.110
Connecting to www.manythings.org (www.manythings.org)|173.254.30.110|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11638759 (11M) [application/zip]
Saving to: ‘deu-eng.zip’


2025-10-29 16:00:28 (33.8 MB/s) - ‘deu-eng.zip’ saved [11638759/11638759]

Archive:  deu-eng.zip
  inflating: deu.txt                 
  inflating: _about.txt              


In [3]:
# Colab cell (code)
data_path = "deu.txt"


In [4]:
# Colab cell (code)
# show first 10 lines to inspect format
with open(data_path, "r", encoding="utf-8") as f:
    for i, line in enumerate(f):
        print(i, line.strip())
        if i >= 9: break


0 Go.	Geh.	CC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #8597805 (Roujin)
1 Hi.	Hallo!	CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #380701 (cburgmer)
2 Hi.	Grüß Gott!	CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #659813 (Esperantostern)
3 Run!	Lauf!	CC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #941078 (Fingerhut)
4 Run.	Lauf!	CC-BY 2.0 (France) Attribution: tatoeba.org #4008918 (JSakuragi) & #941078 (Fingerhut)
5 Wow!	Potzdonner!	CC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #2122382 (Pfirsichbaeumchen)
6 Wow!	Donnerwetter!	CC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #2122391 (Pfirsichbaeumchen)
7 Duck!	Kopf runter!	CC-BY 2.0 (France) Attribution: tatoeba.org #280158 (CM) & #9968521 (wolfgangth)
8 Fire!	Feuer!	CC-BY 2.0 (France) Attribution: tatoeba.org #1829639 (Spamster) & #1958697 (Tamy)
9 Help!	Hilfe!	CC-BY 2.0 (France) Attribution: tatoeba.org #435084 (lukaszpp) & #575889 (MUIRIEL)


## Part 2: Data Loading and Pre-processing

### 1. Load the Dataset
We have to use previously provided dataset, which contains parallel sentences for English-to-German translation.

In [5]:
# Colab cell (code)
import random
from datasets import Dataset, DatasetDict
import pandas as pd

# PARAMETERS
MAX_SAMPLES = 50000   # set how many sentence pairs to use (adjust for Colab GPU memory)
SEED = 42
random.seed(SEED)

src_texts = []
tgt_texts = []

with open(data_path, encoding='utf-8') as f:
    for i, line in enumerate(f):
        if i >= MAX_SAMPLES:
            break
        parts = line.strip().split('\t')
        if len(parts) < 2:
            continue
        en = parts[0].strip()
        de = parts[1].strip()
        # basic filtering
        if len(en) == 0 or len(de) == 0:
            continue
        # optional: lower, strip
        src_texts.append(en)
        tgt_texts.append(de)

print("Collected pairs:", len(src_texts))

# Build a pandas DataFrame then HuggingFace dataset
df = pd.DataFrame({"en": src_texts, "de": tgt_texts})
# shuffle
df = df.sample(frac=1, random_state=SEED).reset_index(drop=True)

# Split (80/10/10)
n = len(df)
train_df = df.iloc[: int(0.8 * n)]
val_df = df.iloc[int(0.8 * n): int(0.9 * n)]
test_df = df.iloc[int(0.9 * n): ]

datasets = DatasetDict({
    "train": Dataset.from_pandas(train_df),
    "validation": Dataset.from_pandas(val_df),
    "test": Dataset.from_pandas(test_df),
})

print(datasets)


Collected pairs: 50000
DatasetDict({
    train: Dataset({
        features: ['en', 'de'],
        num_rows: 40000
    })
    validation: Dataset({
        features: ['en', 'de'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['en', 'de'],
        num_rows: 5000
    })
})


### 2. Load Tokenizer and Model
We will use a pre-trained model from the Helsinki-NLP group, `opus-mt-en-de`, which is a lightweight and efficient Transformer specifically designed for English-to-German translation.

In [6]:
# Colab cell (code)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "Helsinki-NLP/opus-mt-en-de"   # english -> german
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]



pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

### 3. Pre-processing Function
We must tokenize both the English (input) and German (target) sentences. For seq2seq models, the tokenized target sentences are assigned to the `labels` key.

In [7]:
# Colab cell (code)
max_input_length = 128
max_target_length = 128

def preprocess_function(batch):
    # batch contains lists of strings: 'en' and 'de'
    inputs = tokenizer(batch["en"], truncation=True, padding="max_length", max_length=max_input_length)
    with tokenizer.as_target_tokenizer():
        targets = tokenizer(batch["de"], truncation=True, padding="max_length", max_length=max_target_length)
    inputs["labels"] = targets["input_ids"]
    # replace tokenizer.pad_token_id with -100 for labels so loss ignores padding
    inputs["labels"] = [[(l if l != tokenizer.pad_token_id else -100) for l in labels] for labels in inputs["labels"]]
    return inputs

# Apply tokenization (this returns a new Dataset)
tokenized_datasets = datasets.map(preprocess_function, batched=True, remove_columns=["en", "de"])
tokenized_datasets = tokenized_datasets.remove_columns([c for c in tokenized_datasets["train"].column_names if c not in ["input_ids","attention_mask","labels"]])
tokenized_datasets.set_format(type="torch")
tokenized_datasets


model.safetensors:   0%|          | 0.00/298M [00:00<?, ?B/s]

Map:   0%|          | 0/40000 [00:00<?, ? examples/s]



Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 40000
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 5000
    })
})

### 4. Apply Pre-processing
Use the `.map()` function to apply this tokenization to the entire dataset.

## Part 3: Fine-Tuning the Model

### 1. Define Evaluation Metric (BLEU)
Machine translation is evaluated using the **BLEU score**. You can load the `sacrebleu` metric and create a `compute_metrics` function to use during training.

In [11]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
Installing collected packages: evaluate
Successfully installed evaluate-0.4.6


In [12]:
# Colab cell (code)
import numpy as np
import sacrebleu

def postprocess_text(preds, labels):
    # preds: list of token ids strings; labels: list of token ids with -100 padding
    preds = [p.strip() for p in preds]
    labels = [l.strip() for l in labels]
    return preds, labels

import evaluate
bleu_metric = evaluate.load("sacrebleu")

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # preds: token IDs (or logits) -> decode
    if isinstance(preds, tuple): preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # decode labels (-100 -> pad)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # post-process
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    # sacrebleu expects list of references-list: each reference must be list of strings
    bleu = bleu_metric.compute(predictions=decoded_preds, references=[[ref] for ref in decoded_labels])
    result = {"bleu": bleu["score"]}
    # return as dict
    return result

Downloading builder script: 0.00B [00:00, ?B/s]

### 2. Configure Training
We will use the `Seq2SeqTrainer` and `Seq2SeqTrainingArguments` classes to manage the fine-tuning process.

In [18]:
# Colab cell (code)
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq

batch_size = 16   # adjust for your GPU memory
logging_steps = 200
output_dir = "./s2s-eng-ger-opusmt"

training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    eval_strategy="epoch",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,   # required for seq2seq metrics (generation)
    logging_steps=logging_steps,
    save_total_limit=3,
    num_train_epochs=4,           # start small; increase if you want (>=40 recommended for final)
    fp16=True,                     # if your Colab GPU supports mixed precision
    remove_unused_columns=False,
    push_to_hub=False,
    save_strategy="epoch",
    report_to="none",   # 👈 disables W&B and other integrations
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

### 3. Start Training


In [19]:
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)



In [20]:
trainer.train()
trainer.save_model(output_dir)


  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)


Epoch,Training Loss,Validation Loss,Bleu
1,0.5806,0.539619,54.471953
2,0.4159,0.530685,55.031796
3,0.3123,0.530881,54.648946
4,0.2462,0.534266,55.391738




### 4. Evaluate the Model
After training, run the final evaluation on the test set.

In [21]:
# Colab cell (code)
metrics = trainer.evaluate(tokenized_datasets["test"])
print(metrics)


{'eval_loss': 0.5392276644706726, 'eval_bleu': 54.248997066328776, 'eval_runtime': 81.9844, 'eval_samples_per_second': 60.987, 'eval_steps_per_second': 3.818, 'epoch': 4.0}


## Part 4: Inference (Testing the Translator)

Now use your fine-tuned model to translate new sentences using the `pipeline` utility.

In [22]:
# Test with new sentences
text1 = "Hello, how are you doing today?"
text2 = "The transformer model is a powerful architecture for NLP."
text3 = "This assignment is for my machine learning lab."


In [23]:
# Colab cell (code)
from transformers import pipeline

translator = pipeline("translation_en_to_de", model=output_dir, tokenizer=tokenizer, device=0)  # device=0 if GPU available

test_sentences = [
    "I am very tired.",
    "You are right.",
    "He is a doctor.",
    "Can you help me?",
    "Let's go home."
]

for s in test_sentences:
    out = translator(s, max_length=128)
    print("EN:", s)
    print("DE:", out[0]['translation_text'])
    print()


Device set to use cuda:0


EN: I am very tired.
DE: Ich bin sehr müde.

EN: You are right.
DE: Du hast recht.

EN: He is a doctor.
DE: Er ist Arzt.

EN: Can you help me?
DE: Kannst du mir helfen?

EN: Let's go home.
DE: Lass uns nach Hause gehen.



## Part 5: Comparative Analysis (Transformer vs. LSTM)

In this section, you will directly compare the output of your new Transformer model with the LSTM-based model from your previous lab.

### 5.1. Test Sentences
Here is a list of 10 English sentences varying in length and complexity.

In [24]:
test_sentences = [
    # 1. Simple sentence
    "The cat sat on the mat.",

    # 2. Longer sentence
    "I am going to the library to read a book.",

    # 3. Question
    "What is your favorite color?",

    # 4. Long-range dependency
    "The man who lives down the street just bought a new car.",

    # 5. Figurative language (often difficult)
    "It's raining cats and dogs outside.",

    # 6. Technical term
    "The machine learning model was trained on a large dataset.",

    # 7. Command
    "Please close the door when you leave.",

    # 8. Classic test sentence
    "The quick brown fox jumps over the lazy dog.",

    # 9. Negation
    "He did not want to go to the party.",

    # 10. Future tense
    "We will travel to Germany next summer."
]


### 5.2. Translation and Comparison

**Your Task:**
1.  **Generate Transformer Translations:** Run the code cell above to get the Transformer's translations.
2.  **Generate LSTM Translations:** Load your saved LSTM model and tokenizer from the previous lab. Use it to translate the same 10 above `test_sentences`.
3.  **Create Comparison Table:** **Double-click this text cell** to edit it. Fill in the table below with your models' outputs.

| # | Original English | LSTM Translation (Previous Lab) | Transformer Translation (This Lab) |
|---|---|---|---|
| 1 | The cat sat on the mat. | *... (Your LSTM's output)* | *... (Your Transformer's output)* |
| 2 | I am going to the library... | *... (Your LSTM's output)* | *... (Your Transformer's output)* |
| 3 | What is your favorite color?| *... (Your LSTM's output)* | *... (Your Transformer's output)* |
| 4 | The man who lives down... | *... (Your LSTM's output)* | *... (Your Transformer's output)* |
| 5 | It's raining cats and dogs... | *... (Your LSTM's output)* | *... (Your Transformer's output)* |
| 6 | The machine learning model...| *... (Your LSTM's output)* | *... (Your Transformer's output)* |
| 7 | Please close the door... | *... (Your LSTM's output)* | *... (Your Transformer's output)* |
| 8 | The quick brown fox... | *... (Your LSTM's output)* | *... (Your Transformer's output)* |
| 9 | He did not want to go... | *... (Your LSTM's output)* | *... (Your Transformer's output)* |
| 10| We will travel to Germany... | *... (Your LSTM's output)* | *... (Your Transformer's output)* |

## Part 6: Visualizing Attention Mechanisms

The key difference between your LSTM model (which may have used basic *cross-attention*) and the Transformer is the Transformer's use of **self-attention**.

* **Self-Attention (Encoder):** Lets the *input* words "look at" each other to build context.
* **Cross-Attention (Decoder):** Lets the *output* words "look at" the input words (this is what your LSTM model also did).

### 1. Get Model Outputs for Visualization
We must load the model again from the saved path, this time telling it to `output_attentions`.

### 2. Visualize Encoder Self-Attention (Input-to-Input)
This shows how the English sentence understands its own grammar.

**Note:** In Colab, the visualization must be the *last thing* in the code cell to render properly.

### 3. Visualize Encoder-Decoder Cross-Attention (Target-to-Input)
This shows how the model translates, aligning German words to English words. This is the part analogous to your LSTM's attention.

In [1]:
# Colab cell (code)
from transformers import AutoConfig, TFAutoModelForSeq2SeqLM, AutoModelForSeq2SeqLM
import torch
from transformers.utils import is_tf_available, is_torch_available
# reload with config to output attentions
config = AutoModelForSeq2SeqLM.from_pretrained(output_dir).config
config.output_attentions = True
config.attn_implementation = "eager" # Set attention implementation to eager
model_attn = AutoModelForSeq2SeqLM.from_pretrained(output_dir, config=config)

# Tokenize a sentence
src = "The cat sat on the mat."
inputs = tokenizer(src, return_tensors="pt")
# generate with return_dict_in_generate and output_attentions
gen = model_attn.generate(
    **inputs,
    num_beams=4,
    max_length=60,
    return_dict_in_generate=True,
    output_attentions=True,
)

# gen has attributes: sequences, attentions (decoder_attentions?), cross_attentions?
# HuggingFace generate returns decoder_attentions in `attentions` attribute under `decoder_attentions` if supported.
# For visualization, one typical approach is to call model.forward with decoder_input_ids and request attentions:

with torch.no_grad():
    encoder_outputs = model_attn.get_encoder()(**inputs, output_attentions=True)
    encoder_self_attns = encoder_outputs.attentions  # tuple per layer: (batch, num_heads, seq_len, seq_len)
    # Print shapes
    print("Number of encoder layers:", len(encoder_self_attns))
    print("Encoder attn shape (layer0):", encoder_self_attns[0].shape)

# For cross-attention: prepare decoder_input_ids (teacher forcing)
decoder_inputs = tokenizer("Das Kätzchen saß auf der Matte.", return_tensors="pt")
with torch.no_grad():
    out = model_attn(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        decoder_input_ids=decoder_inputs["input_ids"],
        output_attentions=True,
        return_dict=True,
    )
    # Out contains decoder_attentions and cross_attentions if model supports
    print("Decoder attentions present:", hasattr(out, "decoder_attentions"))
    print("Cross attentions present:", hasattr(out, "cross_attentions"))

NameError: name 'output_dir' is not defined