<a href="https://colab.research.google.com/github/bucuram/machine-translation-labs/blob/main/Lab1_MT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Manual and Automatic Evaluation of Machine Translation




We will use the [europarl parallel corpus](http://www.statmt.org/europarl/) from the [WMT16](http://www.statmt.org/wmt16/translation-task.html#download).



##Manual Evaluation

**Adequacy** - how much of the meaning expressed in the gold-standard translation or source is also expressed in the target translation. The annotators must be bilingual in both the **source** and **target** language in order to judge whether the information is preserved across translation.

**Fluency** - refers to the **target** only, without taking the source into account; criteria are grammar, spelling, choice of words, and style.

![manual_eval](https://www.researchgate.net/profile/Linda-Alkhawaja/publication/340974510/figure/tbl1/AS:885277719535620@1588078064880/Numeric-scale-for-judging-adequacy-and-fluency_W640.jpg)

[Photo source](https://www.researchgate.net/publication/340974510_Neural_Machine_Translation_Fine-Grained_Evaluation_of_Google_Translate_Output_for_English-to-Arabic_Translation/figures?lo=1)

Source Text - English

In [None]:
from pprint import pprint

In [None]:
source_text = """Brazil's Former Presidential Chief-of-Staff to Stand Trial.
    A federal judge on Tuesday accepted the charges filed against Brazil's former presidential chief of staff for his alleged involvement in a massive corruption scheme at state-owned oil company Petrobras.
    The federal prosecutor's office said Jose Dirceu will face trial on the corruption, racketeering and money laundering charges filed earlier this month.
    Fourteen other people will also be tried, including Joao Vaccari Neto, the former treasurer of Brazil's governing Workers' Party and Renato de Souza Duque, Petrobras' former head of corporate services."""
pprint(source_text)

Target Text - Romanian

In [None]:
target_text_system_A = """Brazilia fostul șef prezidențial-of-Staff pentru a stand trial.
    Un judecător federal a acceptat marți acuzațiile formulate împotriva fostului șef de cabinet prezidențial al Braziliei pentru presupusa sa implicare într-o schemă masivă de corupție la compania petrolieră de stat Petrobras.
    Procuratura federală a declarat că Jose Dirceu va fi judecat pentru acuzațiile de corupție, racketeering și spălare de bani depuse la începutul acestei luni.
    Paisprezece alte persoane vor fi, de asemenea, judecate, inclusiv Joao Vaccari Neto, fostul trezorier al Partidului Muncitorilor din Brazilia de guvernământ și Renato de Souza Duque, petrobras fostul șef al serviciilor corporative."""
pprint(target_text_system_A)

In [None]:
target_text_system_B = """Fostul șef de stat major prezidențial al Braziliei va fi supus procesului.
    Un judecător federal a acceptat marți acuzațiile depuse împotriva fostului șef de cabinet prezidențial al Braziliei pentru presupusa sa implicare într-un plan masiv de corupție la compania petrolieră de stat Petrobras.
    Procuratura federală a declarat că Jose Dirceu va fi judecat cu privire la acuzațiile de corupție, racket și spălare de bani depuse la începutul acestei luni.
    Vor fi judecați și alte paisprezece persoane, printre care Joao Vaccari Neto, fostul trezorier al Partidului Muncitorilor din guvernul Braziliei și Renato de Souza Duque, fostul șef al serviciilor corporative al Petrobras."""
pprint(target_text_system_B)

Gold text - Romanian

In [None]:
gold_text = """Fostul șef al cabinetului prezidențial brazilian este adus în fața instanței.
    Marți, un judecător federal a acceptat acuzațiile aduse împotriva fostului șef al cabinetului prezidențial brazilian pentru presupusa implicare a acestuia într-o schemă masivă de corupție privind compania petrolieră de stat Petrobras.
    Biroul procurorului federal a declarat că Jose Dirceu va fi trimis în judecată pentru acuzațiile de corupție, înșelătorie și spălare de bani aduse în această lună.
    Alte paisprezece persoane vor fi judecate, printre acestea numărându-se Joao Vaccari Neto, fostul trezorier al Partidului Muncitorilor, aflat la putere în Brazilia, și Renato de Souza Duque, fostul președinte al serviciilor pentru întreprinderi ale Petrobras."""
pprint(gold_text)

##Automatic evaluation

Preprocessing text

In [None]:
!pip install -U pip setuptools wheel
!pip install -U spacy

In [None]:
!python -m spacy download ro_core_news_sm

In [None]:
import spacy
import numpy as np
import string
import re

nlp = spacy.load('ro_core_news_sm')


def tokenize(text):
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]
    tokenized_text = []
    for sent in sentences:
        sent = [tok.text for tok in nlp.tokenizer(sent) if tok.text not in string.punctuation]
        tokenized_text.append(sent)
    return tokenized_text

tokenized_source = tokenize(source_text)
tokenized_system_A = tokenize(target_text_system_A)
tokenized_system_B = tokenize(target_text_system_B)
tokenized_gold = tokenize(gold_text)
print(tokenized_system_B)

###Precision, Recall, F1

![prec1](https://i.imgur.com/OIwtGu2.png)

![prec2](https://i.imgur.com/tfLDjl1.png)

[Photo source](http://www.statmt.org/book/slides/08-evaluation.pdf)

In [None]:
from sklearn.metrics import precision_recall_fscore_support

def precision_recall_f1(ref, hypo):
    prec_sent = []
    recall_sent = []
    f1_sent = []
    for r, h in zip(ref, hypo):
        prec = len(list(set(r) & set(h))) / len(h)
        recall = len(list(set(r) & set(h))) / len(r)
        f1 = (prec*recall) / ((prec+recall)/2)

        prec_sent.append(prec)
        recall_sent.append(recall)
        f1_sent.append(f1)

    return np.array(prec_sent).mean(), np.array(recall_sent).mean(), np.array(f1_sent).mean()

precision, recall, f1 = precision_recall_f1(tokenized_gold, tokenized_system_A)
print('System A')
print('Precision', precision)
print('Recall', recall)
print('F1', f1)


In [None]:
precision, recall, f1 = precision_recall_f1(tokenized_gold, tokenized_system_B)
print('System B')
print('Precision', precision)
print('Recall', recall)
print('F1', f1)

##BLEU score

![bleu](https://i.imgur.com/jNuIb6k.png)

[Photo source](http://www.statmt.org/book/slides/08-evaluation.pdf)

BLEU score from [BLEU: a Method for Automatic Evaluation of Machine Translation](https://aclanthology.org/P02-1040.pdf)

BLEU score using [nltk.translate.bleu_score](https://www.nltk.org/_modules/nltk/translate/bleu_score.html)


In [None]:
tokenized_gold_bleu = [[sent] for sent in tokenized_gold]

In [None]:
from nltk.translate.bleu_score import corpus_bleu

nltk_bleu_A = corpus_bleu(tokenized_gold_bleu, tokenized_system_A, weights=(0.25, 0.25, 0.25, 0.25))
print('System A')
print('BLEU', nltk_bleu_A)

In [None]:
nltk_bleu_B = corpus_bleu(tokenized_gold_bleu, tokenized_system_B, weights=(0.25, 0.25, 0.25, 0.25))
print('System B')
print('BLEU', nltk_bleu_B)

BLEU score using [torchtext.data.metrics](https://pytorch.org/text/stable/data_metrics.html)

In [None]:
from torchtext.data.metrics import bleu_score

torch_bleu_A = bleu_score(tokenized_system_A, tokenized_gold_bleu)
print('System A')
print('BLEU', torch_bleu_A)

In [None]:
torch_bleu_B = bleu_score(tokenized_system_B, tokenized_gold_bleu)
print('System B')
print('BLEU', torch_bleu_B)

### Other Machine Translation metrics

* [sacreBLEU](https://github.com/mjpost/sacrebleu)
* [chrF](https://github.com/m-popovic/chrF)



##Assignment

**To be uploaded here**: https://forms.gle/T2f2keKN6SWyw1J5A

###Data

Data from [WMT20](http://www.statmt.org/wmt20/)

Download data from [here](https://drive.google.com/drive/folders/1n_alr6WFQZfw4dcAmyxow4V8FC67XD8p)

WMT20_data > data-generation-scripts > wmt20-submitted-data.tgz > wmt20-news-task-primary-submissions > txt

We have sources, refereces and system outputs.

###Requirements

* Use data from WMT20 and choose a language for which you will compute automated evaluation metrics and rank the system outputs.

* You can use Precision, Recall, F1, BLEU or other automated measure.

###Important

* Using google colab for this assignment is not mandatory, you can send an archive with your code.


[Description of the systems](http://www.statmt.org/wmt20/program.html)

[WMT20 Paper](http://www.statmt.org/wmt20/pdf/2020.wmt-1.1.pdf)

[Official Ranking](http://wmt.ufal.cz/)






###Further reading
* [Continuous Measurement Scales in
Human Evaluation of Machine Translation](https://aclanthology.org/W13-2305.pdf)
* [Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics](https://arxiv.org/pdf/2006.06264.pdf)