# Implementing Transformer Models
## Practical X
Carel van Niekerk & Hsien-Chin Lin

12-16.01.2026

---

In this practical we will evaluate the performance of the transformer model we trained.

### 1. Autoregressive Generation

In order to generate a translation we will use the autoregressive property of the transformer model. We will use the following procedure to generate a translation:

1. Encode the source sentence using the encoder.
2. Initialize the decoder with the encoded source sentence.
3. Generate the first token of the translation by passing the start of text token through the decoder.
4. Pass the generated token through the decoder to generate the next token and repeat until the end of text token is generated.

#### 1.1. Greedy Decoding

The simplest way to generate a translation is to use greedy decoding. In greedy decoding we simply select the token with the highest probability at each step.

### 2. Evaluation

In order to evaluate the performance of the model we will use the BLEU score. The BLEU score is a metric that measures the similarity between two sentences. See the [huggingface evaluate documentation](https://huggingface.co/spaces/evaluate-metric/bleu) for more information on the BLEU score, as well as details on using the metric in huggingface evaluate.



# Exercises

1. Implement the autoregressive generation procedure described above using your transformer model. (Using greedy decoding, remember to add a maximum length to the generation procedure to prevent infinite generation.)
2. Generate translations for the test set (or a subset of the test set) of WMT17 German-English.
3. Evaluate the BLEU score of your model on the test set (or a subset of the test set) of WMT17 German-English.
4. Evaluate some of the translations generated by your model. Do they make sense? What are some of the errors made by your model?

## Exercise 4: Evaluation of Model Translations

### Overall Performance
The model achieves a BLEU score of **26.99** (MHA baseline) and **27.05** (GQA variant) on the WMT17 German-English test set, which is competitive for a from-scratch implementation.

### Translation Quality Analysis

**Excellent Translations (100 BLEU):**
Many short, straightforward sentences are translated perfectly:
- "Es war eine Geste, die eine Krise beendete." → "It was a gesture that ended a crisis." ✓
- "Vergessen Sie den Druck." → "Forget the pressure." ✓
- "Vergessen Sie den Hype." → "Forget the hype." ✓

### Common Error Patterns

**1. Repetition/Degeneration on Long Sentences:**
Some complex sentences cause the model to enter repetitive loops:
- Source: "Einheimische betrauerten es als letzten Verlust in einer sich gentrifizierenden Stadt."
- Prediction: "In the early days... to be a city to be a city to be a city..."
- This indicates attention degradation on longer sequences, possibly due to the limited max sequence length (100 tokens).

**2. Untranslated Words (OOV/Rare Terms):**
Domain-specific or rare German words sometimes appear untranslated:
- "Leichnam" (corpse) left as-is in output
- This suggests the BPE tokenizer may split rare words into suboptimal subwords

**3. Word Sense Errors:**
- "Koch" (chef/cook) sometimes translated as "kitchen" or "cook" instead of "chef"
- "gezogen" (moved) translated as "drawn" instead of "moved"

**4. Structural/Syntactic Issues:**
Complex relative clauses can be mangled:
- "der vor kurzem nach San Francisco gezogen ist" 
- MHA: "recently drawn to San Francisco" (wrong verb, missing relative pronoun)
- GQA: "who recently moved to San Francisco" (correct!)

**5. Complete Failures (0 BLEU):**
Very short or idiomatic phrases sometimes fail completely:
- "Baugrund im Doppelpack" → "Double packing" (should be: "Construction sites coming as a twinpack")
- "Einer für alle Fälle" → "One case by case" (should be: "Something for every situation")

### Observations
1. **GQA often outperforms MHA** on fluency, despite having fewer parameters
2. **Short sentences** are handled very well (often perfect)
3. **Long sentences** with complex structure are the main failure mode
4. **Domain-specific terminology** remains challenging
5. The model successfully learns German-English word order transformation (SOV → SVO)