<a href="https://colab.research.google.com/github/MorenoLaQuatra/DeepNLP/blob/main/2022_2023/Practice_6_Machine_Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Deep Natural Language Processing @ PoliTO**


---


**Teaching Assistant:** Moreno La Quatra

**Practice 6:** Machine Translation

## **Machine Translation**

Machine Translation is a sub-field of Natural Language Processing that aims at translating a text from a source language to a target language. In this practice, we will experiment with a Transformer-based model for Machine Translation. Specifically, we will benchmark the performance of a pre-trained MT model on Italian-English and English-Italian translation tasks.

![](https://www.deepl.com/img/press/desktop_ENIT_2020-01.png)

In this practice we will use a data collection provided by [tatoeba](https://tatoeba.org/). The following cell download a subset of the data collection, containing parallel Italian-English sentences.


In [None]:
%%capture
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P6/train_it_en.tsv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P6/test_it_en.tsv

### **Question 1: parsing data**

The first step is to parse the data collection to generate a list of sentence pairs. The data are provided in `tsv` format, where each line contains a sentence pair in the following format:

`<source_language_sentence>\t<target_language_sentence>\n`

You are provided with a training and a test set. For this question you should parse both data splits and store them in your preferred data structure.

**Note:** store train and test set into separate data objects.

In [None]:
# Your code here

### **Question 2: pretrained MT models**

Pre-trained MT models are released to the public to allow researchers to experiment with them. In this question you will load a pre-trained MT model and use it to translate sentences from Italian to English and vice-versa.

[EasyNMT](https://github.com/UKPLab/EasyNMT) is a Python library that provides an easy-to-use interface to pre-trained MT models. It provides a simple wrapper over HuggingFace transformers library for machine translation. In this question you will use EasyNMT to load a pre-trained MT model and translate sentences from Italian to English and vice-versa:

- Load the pre-trained model for a specific direction (e.g., Italian-English or English-Italian)
- Translate all the sentences in the test set from the source language to the target language.


**Note**: the choice for the MT model is up to you.
**Note 2**: store the translated sentences in both directions using the data structure of your choice.

In [None]:
# Your code here

### **Question 3: BLEU scores**

One of the most common metrics used to evaluate MT models is the BLEU score. The BLEU score is a metric that compares the predicted translation with the reference translation. The higher the BLEU score, the better the translation. In this question you will compute the BLEU score for the translated sentences.

Evaluate the selected MT model using [BLEU evaluation metric](https://github.com/mjpost/sacrebleu). Report scores for both translation directions (`EN->IT`, `IT->EN`). 

The following cell install the `saclrebleu` library that can be used to compute the BLEU score.

In [None]:
!pip install sacrebleu

In [None]:
# Your code here

### **Question 4: finetuning Seq2Seq model (IT->EN)**

One of the most common paradigm for modern transformer architectures is to use pre-training to learn a generic representation of the language and then fine-tune the model on a specific task. However, you can also fine-tune a model that is already pre-trained on a specific task to another dataset. In this question you will fine-tune a pre-trained MT model on a tatoeba dataset.

Exploit the [Trainer API](https://huggingface.co/transformers/training.html#fine-tuning-in-pytorch-with-the-trainer-api) to finetune and evaluate a [MarianMT](https://arxiv.org/pdf/1804.00344.pdf) sequence to sequence model for machine translation. The documentation for MarianMT is available [here](https://huggingface.co/transformers/model_doc/marian.html).

You can use the code from the previous practices as a starting point (seq2seq model for abstractive summarization).

**Note 1:** select the pre-trained model according to the input-output pair (it-en)

**Note 2:** for the lab practice, please use a sub-set of the training data (e.g., 10000 sentences for training and 1000 sentences for validation)

In [None]:
# Your code here

### **Question 5: Model evaluation**

Using the test set provided at the beginning of the practice, evaluate the fine-tuned model and report the BLEU score. How does the BLEU score compare with the one obtained in the previous question?

In [None]:
# Your code here