# HuggingFace Translation Tutorial - Modified for the NLP Course
Made by: Paul Falkenstein, Justin Groh, Sandro Paval, Peter Preinesberger

What we changed from the baseline notebook:
- actually made it runnable (hf version was broken because of deprecations)
- add explanations within the notebook itself
- changed task to English to German translation

Make sure to select a runtime with GPU (e.g. T4) in colab, or be prepared to wait several years :)

Install the Transformers, Datasets, and Evaluate libraries from HF to run this notebook.

Note: datasets==3.6.0, as the kde4 dataset we use makes use of a dataset scripts, support for which were removed in the recent version of datasets

In [None]:
!pip install datasets==3.6.0 evaluate transformers[sentencepiece]
!pip install accelerate

Collecting datasets==3.6.0
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: datasets, evaluate
  Attempting uninstall: datasets
    Found existing installation: datasets 4.0.0
    Uninstalling datasets-4.0.0:
      Successfully uninstalled datasets-4.0.0
Successfully installed datasets-3.6.0 evaluate-0.4.6


## The KDE4 Dataset
KDE is a suite of Desktop Applications (mostly for the *nix family of OSs), and all of the labels of the UI (Tooltips, Button Texts, Application Names, Help and Documentation) have to be available in a bouqet of languages.

This dataset consists of exactly these strings in several languages. For this notebook we'll be working with the German to English version.

As we will see, certain words mean different things specifically in the UI context, and the pretrained model we start with fails to produce good translations. By fine-tuning on this dataset, performance for UI translation can be improved.

We load the KDE4 dataset from HF:

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("kde4", lang1="de", lang2="en")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

kde4.py: 0.00B [00:00, ?B/s]

The repository for kde4 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/kde4.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/7.62M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

For the purposes of this tutorial, the dataset size is too large (would take too long):

In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 224035
    })
})

In [None]:
raw_datasets["train"][0]

{'id': '0', 'translation': {'de': 'Lauri Watts', 'en': 'Lauri Watts'}}

So we remove tooltips that are very short and only randomly select 15000 of those longer strings:

In [None]:
def filter_fn(example):
  return len(example["translation"]["en"]) > 20
filtered_dataset = raw_datasets["train"].filter(filter_fn).shuffle(seed=42).select(range(15000))
filtered_dataset

Filter:   0%|          | 0/224035 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'translation'],
    num_rows: 15000
})

## Dataset Processing
Nothing unusual here, we split the data into a train and validation set for later.

In [None]:
split_datasets = filtered_dataset.train_test_split(train_size=0.9, seed=20)
split_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 13500
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 1500
    })
})

In [None]:
split_datasets["validation"] = split_datasets.pop("test")

## Peculiarities of German and English in UI Settings
Some examples here of where the pre-trained opus-mt (an open source language translation model) falls short.

Load the Opus-MT English-to-German model from HF:

In [None]:
from transformers import pipeline

model_checkpoint = "Helsinki-NLP/opus-mt-en-de"
translator = pipeline("translation", model=model_checkpoint)

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/298M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


Here's the example:
The English "Run Operation for Current Item"
gets translated by the pre-trained model to "Betrieb für aktuelles Objekt ausführen".

"Betrieb" in German does mean "Operation", but more in the sense of the "Economic Operation of a Company", and definetely not the intended "Operation run by a UI widget".

In [None]:
translator(
    "Run Operation for Current Item"
)

[{'translation_text': 'Betrieb für aktuelles Objekt ausführen'}]

## Setting up Tokenization and Preprocessing
Just using the tokenizer that comes with the pre-trained model here, nothing special.

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "Helsinki-NLP/opus-mt-en-de"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors="pt")

For translation tasks, there's two different tokenizers for the source and target language.

We feed the (label, target) strings into the tokenizer jointly, it takes care of this automatically then.

In [None]:
max_length = 128

def preprocess_function(examples):
    inputs = [ex["en"] for ex in examples["translation"]]
    targets = [ex["de"] for ex in examples["translation"]]
    model_inputs = tokenizer(
        inputs, text_target=targets, max_length=max_length, truncation=True
    )
    return model_inputs

Finally, we apply the tokenization over our subsampled dataset:

In [None]:
tokenized_datasets = split_datasets.map(
    preprocess_function,
    batched=True,
    remove_columns=split_datasets["train"].column_names,
)

Map:   0%|          | 0/13500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

## The Training Loop

In this section, we'll be using a bog-standard pytorch training loop and AdamW to fine-tune our language model using our tokenized dataset. As you will see HF does a lot of abstraction for us, so this is pretty easy!

Firstly, we wrap our pretrained model checkpoint into a pytorch-compatible object.

In [None]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

A data collator takes care of the necessary within-batch padding of the tokenized input sequences, as sequences within a batch need to be of the same length.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

The BLEU score is used to evaluate translation quality during the validation phase.

In [None]:
!pip install sacrebleu

Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting portalocker (from sacrebleu)
  Downloading portalocker-3.2.0-py3-none-any.whl.metadata (8.7 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading sacrebleu-2.5.1-py3-none-any.whl (104 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Downloading portalocker-3.2.0-py3-none-any.whl (22 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.6 portalocker-3.2.0 sacrebleu-2.5.1


In [None]:
import evaluate

metric = evaluate.load("sacrebleu")

Downloading builder script: 0.00B [00:00, ?B/s]

Just setting up some standard pytoch DataLoaders now ...

In [None]:
from torch.utils.data import DataLoader

tokenized_datasets.set_format("torch")
train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], collate_fn=data_collator, batch_size=8
)

In [None]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

Accelerator would help you if you would run this on a GPU cluster (which we're not, so this isn't strictly neccessary).

In [None]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

Double check that you are using a GPU here.

In [None]:
accelerator.device

device(type='cuda')

We use a linear LR decay, and 3 training epochs, as in the original version of this notebook.

In [None]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

Folder to save the model to locally.

In [None]:
output_dir = "kde4-en-to-de-accelerate"

A function to turn network outputs and labels (in tokenized state), to objects that can be compared by the BLEU metric.

This will be used in the validation loop to track progress on the translation task.

In [None]:
import numpy as np
def postprocess(predictions, labels):
    predictions = predictions.cpu().numpy()
    labels = labels.cpu().numpy()

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]
    return decoded_preds, decoded_labels

Your standard PyTorch training loop:

In [None]:
from tqdm.auto import tqdm
import torch

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training, really nothing special here
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
    # Evaluation
    model.eval()
    for batch in tqdm(eval_dataloader):
        # generate a translated sequence for the current batch
        with torch.no_grad():
            generated_tokens = accelerator.unwrap_model(model).generate(
                batch["input_ids"],
                attention_mask=batch["attention_mask"],
                max_length=128,
            )
        labels = batch["labels"]

        # Necessary to pad predictions and labels for being gathered
        generated_tokens = accelerator.pad_across_processes(
            generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
        )
        labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)

        predictions_gathered = accelerator.gather(generated_tokens)
        labels_gathered = accelerator.gather(labels)

        decoded_preds, decoded_labels = postprocess(predictions_gathered, labels_gathered)
        metric.add_batch(predictions=decoded_preds, references=decoded_labels)

    results = metric.compute()
    print(f"epoch {epoch}, BLEU score: {results['score']:.2f}")

    # Save locally
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)

  0%|          | 0/5064 [00:00<?, ?it/s]

  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)


  0%|          | 0/188 [00:00<?, ?it/s]

epoch 0, BLEU score: 30.96




  0%|          | 0/188 [00:00<?, ?it/s]

epoch 1, BLEU score: 31.05


  0%|          | 0/188 [00:00<?, ?it/s]

epoch 2, BLEU score: 31.27


We will now check that our example from before training actually has improved translation quality:

In [None]:
from transformers import pipeline

translator = pipeline("translation", model=output_dir)

Device set to use cuda:0


In [None]:
translator(
    "Run Operation for Previous Item"
)

[{'translation_text': 'Operation für den vorherigen Eintrag ausführen'}]

"Operation für aktuellen Eintrag ausführen" is the more correct translation of "Run Operation for Current Item".

The model has evidently been somewhat adapted to the UI domain.