# About this notebook

In this notebook, we will explore Neural Machine Translation (NMT) using the Facebook `mbart-large-50-many-to-many-mmt` model and the WMT16 dataset from Hugging Face. Our primary focus will be on the English-German (en-de) translation task, and we will evaluate the performance of the model using two popular metrics: BLEU and METEOR.

__Key points:__

- Facebook mbart-large-50-many-to-many-mmt model: A pre-trained multilingual model designed for machine translation tasks, supporting translations between 50 languages.
- WMT16 dataset from Hugging Face: A benchmark dataset for machine translation, including parallel corpora for various language pairs.
- English-German (en-de) translation task: The primary focus of this notebook, using the WMT16 dataset's English-German language pair.
- Evaluation metrics:
  - BLEU (Bilingual Evaluation Understudy): A popular metric for machine translation that measures the similarity between the model's output and the reference translation using n-gram precision.
  - METEOR (Metric for Evaluation of Translation with Explicit Ordering): Another metric for machine translation that considers both the precision and recall of the generated translation, also accounting for synonymy and word order differences.

By the end of this notebook, you will gain hands-on experience in training and evaluating an NMT model using the Facebook mbart-large-50-many-to-many-mmt model and the WMT16 dataset, as well as interpreting the results using the BLEU and METEOR metrics.


# Imports

In [None]:
# Load setup.py file
%load ../utils/setup.py
%run ../utils/setup.py

# Load utils.py file
%load ../utils/utils.py
%run ../utils/utils.py

In [None]:
useGPU()

Have fun with this chapter!🥳


# Allocate enough RAM

Let us try to get a __GPU__ with at least __15GB RAM__ for our notebook.

In [None]:
# crash colab to get more RAM -> uncomment to use
#!kill -9 -1

We can execute the following command `!free -h`  to see if we have enough RAM and `!nvidia-smi` to get more info about our GPU type we got assigned.
If the allocated GPU is too small, the above cell can be used to run the command to crash the notebook hoping to get a better GPU after the crash, since the GPU is randomly allocated.


In [None]:
!free -h

              total        used        free      shared  buff/cache   available
Mem:           83Gi       1.2Gi        77Gi       3.0Mi       4.3Gi        81Gi
Swap:            0B          0B          0B


In [None]:
!nvidia-smi

Wed Apr 12 16:42:39 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    46W / 400W |      3MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
import torch

if torch.cuda.is_available():
    gpu_device = torch.device('cuda')
    gpu_info = torch.cuda.get_device_properties(gpu_device)
    gpu_memory = gpu_info.total_memory / 1e9  # Convert bytes to gigabytes
    print(f"GPU: {gpu_info.name}, Total Memory: {gpu_memory:.2f} GB")
else:
    print("No GPU detected.")


GPU: NVIDIA A100-SXM4-40GB, Total Memory: 42.48 GB


In [None]:

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb=32"


In [None]:
from transformers import (
    MarianMTModel, 
    MarianTokenizer, 
    MBartForConditionalGeneration, 
    MBart50TokenizerFast,
    pipeline,
    set_seed
)

import evaluate
from datasets import load_dataset
import sacrebleu
import logging
import pandas as pd
import torch
from textwrap import TextWrapper
import warnings 
warnings.filterwarnings('ignore')
set_seed(42)

## Set number of examples for translations

In [None]:
# Set number of examples
num_examples = 100


## Load dataset

The WMT16 dataset is part of the Workshop on Machine Translation (WMT) series, specifically from the 2016 edition. WMT is an annual conference that focuses on research in the field of machine translation and aims to evaluate and compare the performance of different machine translation systems.

The WMT16 dataset, available through the Hugging Face Datasets library, includes parallel corpora for various language pairs used in the WMT 2016 shared task on machine translation. The dataset contains both news commentaries and Europarl data, as well as common test sets for multiple language pairs.

Some of the language pairs included in the WMT16 dataset are:

- English-German (en-de)
- English-Russian (en-ru)
- English-Czech (en-cs)
- English-Romanian (en-ro)
- English-Turkish (en-tr)

[Link to dataset on Hugging Face](https://huggingface.co/datasets/wmt16)

In [None]:
# Load the WMT16 English-German dataset
dataset = load_dataset("wmt16", "de-en", split=f"test[:{num_examples}]")


Downloading builder script:   0%|          | 0.00/2.81k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/18.6k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.89k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/41.4k [00:00<?, ?B/s]

Downloading and preparing dataset wmt16/de-en to /root/.cache/huggingface/datasets/wmt16/de-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227...


Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/658M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/919M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/75.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/38.7M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/5 [00:00<?, ?it/s]

Extracting data files: 0it [00:00, ?it/s]

Generating train split:   0%|          | 0/4548885 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2169 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2999 [00:00<?, ? examples/s]

Dataset wmt16 downloaded and prepared to /root/.cache/huggingface/datasets/wmt16/de-en/1.0.0/746749a11d25c02058042da7502d973ff410e73457f3d305fc1177dc0e8c4227. Subsequent calls will reuse this data.


In [None]:
dataset

Dataset({
    features: ['translation'],
    num_rows: 100
})

## Initialize the models

In [None]:
# Initialize the mBART-50 model and tokenizer
mbart_model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
mbart_tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

# Initialize the MarianMT model and tokenizer
opus_model_name = "Helsinki-NLP/opus-mt-en-de"
opus_tokenizer = MarianTokenizer.from_pretrained(opus_model_name)
opus_model = MarianMTModel.from_pretrained(opus_model_name)


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/261 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/529 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/649 [00:00<?, ?B/s]

Downloading (…)olve/main/source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

Downloading (…)olve/main/target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

In [None]:
# Load the Google BLEU metric
bleu = evaluate.load("google_bleu")

Downloading builder script:   0%|          | 0.00/8.64k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

## Loop over translations

In [None]:
translations = {'mBART-50': [], 'OPUS-MT': []}

# Loop over the chosen examples in the dataset and translate them using both models
for i in range(num_examples):
    # Translate using mBART-50
    input_text = dataset[i]["translation"]["en"]
    target_text = dataset[i]["translation"]["de"]
    encoded = mbart_tokenizer(input_text, return_tensors="pt")
    generated_tokens = mbart_model.generate(**encoded, forced_bos_token_id=mbart_tokenizer.lang_code_to_id["de_DE"])
    output_text_mbart = mbart_tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    translations['mBART-50'].append(output_text_mbart)
    
    # Translate using OPUS-MT
    input_text = dataset[i]["translation"]["en"]
    target_text = dataset[i]["translation"]["de"]
    encoded = opus_tokenizer(input_text, return_tensors="pt")
    generated_tokens = opus_model.generate(**encoded)
    output_text_opus = opus_tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    translations['OPUS-MT'].append(output_text_opus)

    print(f"\033[1m\nExample {i + 1}\033[0m")
    print(f"\033[1mInput:\033[0m {input_text}")
    print(f"\033[1mGround truth:\033[0m {target_text}")
    print("==="*23)
    print("\033[1mModel: OPUS-MT\033[0m")
    print(f"Output: {output_text_opus}")
    print("---"*23)
    print("\033[1mModel: mBART-50\033[0m")
    print(f"Output: {output_text_mbart}\n")
    print("###"*23)

# Set up logging for the warning messages
logging.getLogger("transformers").setLevel(logging.ERROR)

[1m
Example 1[0m
[1mInput:[0m Obama receives Netanyahu
[1mGround truth:[0m Obama empfängt Netanyahu
[1mModel: OPUS-MT[0m
Output: Obama erhält Netanjahu
---------------------------------------------------------------------
[1mModel: mBART-50[0m
Output: Obama erhält Netanyahu

#####################################################################
[1m
Example 2[0m
[1mInput:[0m The relationship between Obama and Netanyahu is not
exactly friendly.
[1mGround truth:[0m Das Verhältnis zwischen Obama und Netanyahu ist
nicht gerade freundschaftlich.
[1mModel: OPUS-MT[0m
Output: Die Beziehung zwischen Obama und Netanjahu ist nicht gerade
freundlich.
---------------------------------------------------------------------
[1mModel: mBART-50[0m
Output: Die Beziehung zwischen Obama und Netanjahu ist nicht gerade
freundlich.

#####################################################################
[1m
Example 3[0m
[1mInput:[0m The two wanted to talk about the implementation of the
in

## Compute metrics

BLEU, or Bilingual Evaluation Understudy, is an evaluation metric used primarily for machine translation. It measures the similarity between a candidate translation and a set of reference translations, considering both the n-gram precision and a brevity penalty factor.

$$BLEU = BP \times \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)
$$

[Link](https://huggingface.co/spaces/evaluate-metric/google_bleu)

[Info about Huggingface evaluate](https://huggingface.co/docs/evaluate/v0.4.0/en/package_reference/loading_methods#evaluate.load)
<br> <br>


METEOR is another automatic evaluation metric for machine translation that considers both precision and recall of the generated translation, as well as the harmonic mean of the two. It uses a more advanced alignment method, including synonyms and stemming, to better capture the meaning of the translation.

$$METEOR = (1 - \gamma \times \text{Penalty}) \times \frac{P \times R}{\alpha P + (1 - \alpha) R}$$
<br> <br>
[Link zu Paper](https://www.aclweb.org/anthology/W05-0909)

[Link zu METEOR on Hugging Face](https://huggingface.co/spaces/evaluate-metric/meteor)

In [None]:
# Compute the Google BLEU scores
references = [example["translation"]["de"] for example in dataset]
bleu_scores = []
for model_name, outputs in translations.items():
    bleu_score = bleu.compute(predictions=outputs, references=references)
    bleu_scores.append((model_name, round(bleu_score['google_bleu'], 4)))

# Create a DataFrame to store the model names and their Google BLEU scores using from_records
bleu_scores_df = pd.DataFrame.from_records(bleu_scores, columns=["Model", "Google BLEU Score"])

# Display the DataFrame
bleu_scores_df


Unnamed: 0,Model,Google BLEU Score
0,mBART-50,0.4647
1,OPUS-MT,0.476


In [None]:
meteor = evaluate.load('meteor')

Downloading builder script:   0%|          | 0.00/6.81k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [None]:
meteor_scores = []
for model_name, outputs in translations.items():
    meteor_score = meteor.compute(predictions=outputs, references=references)
    meteor_scores.append((model_name, round(meteor_score['meteor'], 4)))


In [None]:
meteor_scores_df = pd.DataFrame.from_records(meteor_scores, columns=["Model", "METEOR Score"])
scores_df = bleu_scores_df.merge(meteor_scores_df, on="Model")
scores_df

Unnamed: 0,Model,Google BLEU Score,METEOR Score
0,mBART-50,0.4647,0.6825
1,OPUS-MT,0.476,0.7087
