# **Deep Natural Language Processing @ PoliTO**


---
**Teaching Assistant:** Teaching Assistant

**Credits:** Moreno La Quatra

**Practice 6:** Machine Translation

## **Machine Translation**

Machine Translation is a sub-field of Natural Language Processing that aims at translating a text from a source language to a target language. In this practice, we will experiment with a Transformer-based model for Machine Translation. Specifically, we will benchmark the performance of a pre-trained MT model on Italian-English and English-Italian translation tasks.

![](https://www.deepl.com/img/press/desktop_ENIT_2020-01.png)

In this practice we will use a data collection provided by [tatoeba](https://tatoeba.org/). The following cell download a subset of the data collection, containing parallel Italian-English sentences.


In [1]:
%%capture
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P6/train_it_en.tsv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P6/test_it_en.tsv

### **Question 1: Parsing data**

The first step is to parse the data collection to generate a list of sentence pairs. The data are provided in `tsv` format, where each line contains a sentence pair in the following format:

`<source_language_sentence>\t<target_language_sentence>\n`

You are provided with a training and a test set. For this question you should parse both data splits and store them in your preferred data structure.

**Note:** store train and test set into separate data objects.

In [10]:
# your code here

def parse_tsv(file_path):
    sentence_pairs = []
    with open(file_path, 'r', encoding='utf-8') as file:
        for i, line in enumerate(file):
            if i == 0:
                continue
            striped_line = line.strip().split('\t')
            source = striped_line[2]
            target = striped_line[4]
            sentence_pairs.append((source, target))
    
    return sentence_pairs

train_file_path = 'train_it_en.tsv'
test_file_path = 'test_it_en.tsv'

train_data = parse_tsv(train_file_path)
test_data = parse_tsv(test_file_path)

In [14]:
train_data

[('Li aiuteremo domani.', "We'll help them tomorrow."),
 ('Tu guardi spesso la TV?', 'Do you watch TV often?'),
 ('Rovesciò il suo drink sul mio smoking.',
  'She spilled her drink all over my tuxedo.'),
 ('Rimanete in contatto con me.', 'Keep in touch with me.'),
 ('Quale cane è il vostro?', 'Which dog is yours?'),
 ("C'erano alcuni fiori lì.", 'There were some flowers there.'),
 ("C'è del traffico pesante su questa strada.",
  'There is heavy traffic on this street.'),
 ('Lui ha precisato che il viaggio costerà 300 euro.',
  'He made it clear that the trip will cost 300 euros.'),
 ("Io sono un po' arrabbiato con te.", "I'm a little angry with you."),
 ('Tom passa meno tempo a studiare francese di me.',
  'Tom spends less time studying French than I do.'),
 ('Ho visto un ragazzino correre.', 'I saw a little boy running.'),
 ('È quasi estate.', "It's almost summertime."),
 ('Tom ama i film sugli zombie e li guarda ogni volta che può.',
  'Tom loves zombie movies and watches them whenev

### **Question 2: Pre-trained MT models**

Pre-trained MT models are released to the public to allow researchers to experiment with them. In this question you will load a pre-trained MT model and use it to translate sentences from Italian to English and vice-versa.

[EasyNMT](https://github.com/UKPLab/EasyNMT) is a Python library that provides an easy-to-use interface to pre-trained MT models. It provides a simple wrapper over HuggingFace transformers library for machine translation. In this question you will use EasyNMT to load a pre-trained MT model and translate sentences from Italian to English and vice-versa:

- Load the pre-trained model for a specific direction (e.g., Italian-English or English-Italian)
- Translate all the sentences in the test set from the source language to the target language.


**Note 1**: the choice for the MT model is up to you.

**Note 2**: store the translated sentences in both directions using the data structure of your choice.

In [18]:
# your code here
from easynmt import EasyNMT
model = EasyNMT('opus-mt')

In [43]:
# Italian to English
predictions_en = []
for sentence_pair in train_data[:50]:
    source, target = sentence_pair
    print(f'Input: {source}')
    print(f'Target: {target}')
    print(f'Prediction: {model.translate(source, target_lang="en", source_lang="it")}')
    predictions_en.append(model.translate(source, target_lang="en", source_lang="it"))
print("For FINISHED")

# English to Italian
predictions_it = []
for sentence_pair in train_data[:50]:
    target, source = sentence_pair
    print(f'Input: {target}')
    print(f'Target: {source}')
    print(f'Prediction: {model.translate(target, target_lang="it",source_lang="en")}')
    predictions_it.append(model.translate(target, target_lang="it", source_lang="en"))

Input: Li aiuteremo domani.
Target: We'll help them tomorrow.
Prediction: We'll help them tomorrow.
Input: Tu guardi spesso la TV?
Target: Do you watch TV often?
Prediction: Do you watch TV often?
Input: Rovesciò il suo drink sul mio smoking.
Target: She spilled her drink all over my tuxedo.
Prediction: He spilled his drink in my tuxedo.
Input: Rimanete in contatto con me.
Target: Keep in touch with me.
Prediction: Stay in touch with me.
Input: Quale cane è il vostro?
Target: Which dog is yours?
Prediction: Which dog is yours?
Input: C'erano alcuni fiori lì.
Target: There were some flowers there.
Prediction: There were some flowers there.
Input: C'è del traffico pesante su questa strada.
Target: There is heavy traffic on this street.
Prediction: There's heavy traffic on this road.
Input: Lui ha precisato che il viaggio costerà 300 euro.
Target: He made it clear that the trip will cost 300 euros.
Prediction: He pointed out that the trip will cost 300 euros.
Input: Io sono un po' arrabbi

tokenizer_config.json: 100%|██████████| 42.0/42.0 [00:00<00:00, 36.8kB/s]
source.spm: 100%|██████████| 789k/789k [00:00<00:00, 2.39MB/s]
target.spm: 100%|██████████| 814k/814k [00:00<00:00, 8.37MB/s]
vocab.json: 100%|██████████| 2.35M/2.35M [00:00<00:00, 3.94MB/s]
config.json: 100%|██████████| 1.38k/1.38k [00:00<00:00, 7.15MB/s]
pytorch_model.bin: 100%|██████████| 343M/343M [00:06<00:00, 56.2MB/s] 
generation_config.json: 100%|██████████| 293/293 [00:00<00:00, 891kB/s]


Prediction: Li aiuteremo domani.
Input: Tu guardi spesso la TV?
Target: Do you watch TV often?
Prediction: Tu guarda spesso la TV?
Input: Rovesciò il suo drink sul mio smoking.
Target: She spilled her drink all over my tuxedo.
Prediction: Rovesciò il suo drink sul mio fumo.
Input: Rimanete in contatto con me.
Target: Keep in touch with me.
Prediction: Rimanete in contatto con me.
Input: Quale cane è il vostro?
Target: Which dog is yours?
Prediction: Qual'è il vostro cane?
Input: C'erano alcuni fiori lì.
Target: There were some flowers there.
Prediction: C'erano alcuni fiori là.
Input: C'è del traffico pesante su questa strada.
Target: There is heavy traffic on this street.
Prediction: C'è del traffico pesante su questa strada.
Input: Lui ha precisato che il viaggio costerà 300 euro.
Target: He made it clear that the trip will cost 300 euros.
Prediction: Lui ha previsto che il viaggio costerà 300 euro.
Input: Io sono un po' arrabbiato con te.
Target: I'm a little angry with you.
Predict

### **Question 3: BLEU scores**

One of the most common metrics used to evaluate MT models is the BLEU score. The BLEU score is a metric that compares the predicted translation with the reference translation. The higher the BLEU score, the better the translation. In this question you will compute the BLEU score for the translated sentences.

Evaluate the selected MT model using [BLEU evaluation metric](https://github.com/mjpost/sacrebleu). Report scores for both translation directions (`EN->IT`, `IT->EN`).

The following cell install the `saclrebleu` library that can be used to compute the BLEU score.

In [2]:
!pip install sacrebleu
from sacrebleu.metrics import BLEU

Collecting sacrebleu
  Downloading sacrebleu-2.3.3-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting portalocker (from sacrebleu)
  Downloading portalocker-2.8.2-py3-none-any.whl.metadata (8.5 kB)
Collecting tabulate>=0.8.9 (from sacrebleu)
  Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Collecting lxml (from sacrebleu)
  Downloading lxml-4.9.3-cp311-cp311-macosx_11_0_universal2.whl.metadata (3.8 kB)
Downloading sacrebleu-2.3.3-py3-none-any.whl (106 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.4/106.4 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading lxml-4.9.3-cp311-cp311-macosx_11_0_universal2.whl (8.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.6/8.6 MB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m00:01[

In [44]:
italian_sentences = [sentence_pair[0] for sentence_pair in train_data]
english_sentences = [sentence_pair[1] for sentence_pair in train_data]
    

In [55]:
# Score Italian to English
bleu = BLEU()
score_it_en = bleu.corpus_score(english_sentences[:50],predictions_en)
print(score_it_en)

BLEU = 0.74 3.7/0.7/0.4/0.3 (BP = 1.000 ratio = 7.364 hyp_len = 81 ref_len = 11)


In [56]:
# Score English to Italian
score_en_it = bleu.corpus_score(italian_sentences[:50], predictions_it)
print(score_en_it)

BLEU = 0.76 3.7/0.7/0.4/0.3 (BP = 1.000 ratio = 6.750 hyp_len = 81 ref_len = 12)


### **Question 4: Fine-tuning Seq2Seq model (IT->EN)**

One of the most common paradigm for modern transformer architectures is to use pre-training to learn a generic representation of the language and then fine-tune the model on a specific task. However, you can also fine-tune a model that is already pre-trained on a specific task to another dataset. In this question you will fine-tune a pre-trained MT model on a tatoeba dataset.

Exploit the [Trainer API](https://huggingface.co/transformers/training.html#fine-tuning-in-pytorch-with-the-trainer-api) to finetune and evaluate a [MarianMT](https://arxiv.org/pdf/1804.00344.pdf) sequence to sequence model for machine translation. The documentation for MarianMT is available [here](https://huggingface.co/transformers/model_doc/marian.html).

You can use the code from the previous practices as a starting point (seq2seq model for abstractive summarization).

**Note 1:** select the pre-trained model according to the input-output pair (it-en)

**Note 2:** for the lab practice, please use a sub-set of the training data (e.g., 10000 sentences for training and 1000 sentences for validation)

In [None]:
# your code here

### **Question 5: Model evaluation**

Using the test set provided at the beginning of the practice, evaluate the fine-tuned model and report the BLEU score. How does the BLEU score compare with the one obtained in the previous question?

In [None]:
# your code here