# Machine Translation with mBART-50

**Run this notebook on the [GPU Hub](https://gpuhub.labservices.ch/) or [Google Colab](https://colab.research.google.com/) to make use of a GPU for faster inference.**

In this exercise, we use the finetuned mBART-50 model to translate sentences from the transcripts of European Parliament discussions. 

In [1]:
%pip install -q tqdm
%pip install -q torch
%pip install -q protobuf
%pip install -q sentencepiece
%pip install -q transformers
%pip install -q evaluate

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import tqdm
import torch
import evaluate

## Data

The European Parliament has a nice corpus of parallel sentences from its proceedings that is [open-sourced](https://statmt.org/europarl/). We could download the German-English transcript pair and extract the files as shown below. Instead, we will use the shortened files on Ilias.

In [3]:
# !wget -N https://statmt.org/europarl/v7/de-en.tgz
# !tar xzf de-en.tgz

The files have one sentence per line. Each line in one file corresponds to the same line in the other file, they are *parallel*. If a line in one file is empty, that means there is no corresponding translation in the other file (for example, see line 22).

The corpus website recommends to remove the pairs where one line is empty. It also suggests to remove lines with XML-Tags (starting with "<").

In [4]:
def read_parallel_sentences(path1, path2, lines_to_read=200):
    """Reads the first `lines_to_read` lines of text from two files.
    Removes lines where one of both files has an empty line.
    Removes lines starting with XML tags."""
    with open(path1, 'r') as f:
        lines1 = f.readlines()
    with open(path2, 'r') as f:
        lines2 = f.readlines()

    sentences1, sentences2 = [], []
    for line1, line2 in zip(lines1, lines2):
        sent1 = line1.strip()
        sent2 = line2.strip()
        if not sent1 or not sent2:
            continue
        if sent1.startswith('<') or sent2.startswith('<'):
            continue
        sentences1.append(sent1)
        sentences2.append(sent2)
        if len(sentences1) >= lines_to_read:
            break
    return sentences1, sentences2

sents_en, sents_de = read_parallel_sentences('europarl-v7.de-en.en.txt', 'europarl-v7.de-en.de.txt')

## Model

We load the `facebook/mbart-large-50-many-to-many-mmt` model from the [Hugging Face model hub](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt).

In [5]:
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
model.eval()  # put the model into evaluation mode
if torch.cuda.is_available():
    model.to('cuda')  # move the model to GPU
elif torch.backends.mps.is_available():
    model.to('mps')  # move the model to MPS (Mac only)



## Translate

We now translate the sentences in both directions. Adapt the example from the documentation on the model hub. Call the generate function to output translations.

In [6]:
def translate(sentence, from_code, to_code):
    """Translates `sentence` into target language."""
    tokenizer.src_lang = from_code
    inputs = tokenizer(sentence, return_tensors='pt')
    if torch.cuda.is_available():
        inputs.to('cuda')  # move the data to GPU
    elif torch.backends.mps.is_available():
        inputs.to('mps')
    with torch.no_grad():
        outputs = model.generate(**inputs, forced_bos_token_id=tokenizer.lang_code_to_id[to_code], max_new_tokens=200)
    return tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

translated_to_de = [translate(sent_en, 'en_XX', 'de_DE') for sent_en in tqdm.tqdm(sents_en)]
translated_to_en = [translate(sent_de, 'de_DE', 'en_XX') for sent_de in tqdm.tqdm(sents_de)]

100%|██████████| 200/200 [04:43<00:00,  1.42s/it]
100%|██████████| 200/200 [04:14<00:00,  1.27s/it]


In [7]:
print(translated_to_en[:5])

['Resumption of the session', 'I declare resumed the session of the European Parliament adjourned on Friday 17 December, wish you all the best for the new year and hope you have had a good holiday.', "As you can see, the feared 'Millennium bug' has not occurred, but the citizens of some of our Member States have become victims of terrible natural disasters.", 'There is a desire in Parliament to have a debate during this part-session in the coming days.', 'Today, I would like to ask you - and this is also the wish of some of my fellow Members - to remember all the victims of the storms, particularly in the various countries of the European Union, in a moment of silence.']


## Evaluation

We evaluate the translations against the references with the BLEU score, which is standard in machine translation. We use the [BLEU metric](https://huggingface.co/spaces/evaluate-metric/bleu) from Hugging Face's [evaluate library](https://github.com/huggingface/evaluate).

In [8]:
references_de = [[ref] for ref in sents_de]
references_en = [[ref] for ref in sents_en]

bleu = evaluate.load('bleu')
results_en_to_de = bleu.compute(predictions=translated_to_de, references=references_de)
results_de_to_en = bleu.compute(predictions=translated_to_en, references=references_en)
print(results_en_to_de.keys())

print(f"Results translating English to German: {results_en_to_de['bleu']:.4f}")
print(f"Results translating German to English: {results_de_to_en['bleu']:.4f}")

dict_keys(['bleu', 'precisions', 'brevity_penalty', 'length_ratio', 'translation_length', 'reference_length'])
Results translating English to German: 0.2335
Results translating German to English: 0.3282


Interestingly, the model is better at translating to English than to German (this holds for other languages as well). A big factor in this is that there is more English data available for the mBART-50 model to learn the structure of the English language. Additionally, since BLEU is based on word-level overlap, German is a harder target than English for exact matches due to noun compounds and richer morphology (e.g. conjugation of verbs).