<a href="https://colab.research.google.com/github/AyumiOsawa/UCREL_NLP_summerschool_2024/blob/main/Machine_Translation_and_Quality_Estimation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Translation Models

Machine Translation (MT) is a subfield of computational linguistics that is focused on translating text from one language to another. Neural machine translation (NMT) has become the dominant paradigm for MT recently. NMT has shown state-of-the-art performance for many language pairs.

While initial research on NMT started with building translation systems between two languages, researchers discovered that the NMT framework can naturally incorporate multiple languages.

Let's first play with a few open-source/ access multingual machine translation systems.

## M2M100

The M2M100 model was proposed in Beyond English-Centric Multilingual Machine Translation by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.

M2M100 is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation tasks.

M2M100 uses the eos_token_id as the decoder_start_token_id for generation with the target language id being forced as the first generated token. To force the target language id as the first generated token, pass the forced_bos_token_id parameter to the generate method.

The following example shows how to translate between Hindi to French

In [1]:
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"

model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")

# translate Hindi to French
tokenizer.src_lang = "hi"
encoded_hi = tokenizer(hi_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/908 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/298 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/3.71M [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

['La vie est comme une boîte de chocolat.']

Let's do another example, this time from Chinese to English.

In [2]:
chinese_text = "生活就像一盒巧克力。"

tokenizer.src_lang = "zh"
encoded_zh = tokenizer(chinese_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

['Life is like a box of chocolate.']

In the following space, translate the sentence "Life is short. Smile while you still have teeth." to your favourite non-English language.

In [4]:
english_text = "Life is short. Smile while you still have teeth."

tokenizer.src_lang = "en"
encoded_en = tokenizer(english_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_en, forced_bos_token_id=tokenizer.get_lang_id("ja"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

['人生は短く、まだ歯がある間に笑顔。']

## NLLB

The NLLB model was presented in No Language Left Behind: Scaling Human-Centered Machine Translation by Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang.

While generating the target text set the forced_bos_token_id to the target language id. The following example shows how to translate English to French using the facebook/nllb-200-distilled-600M model.

Note that unlike the previous model, we are using the BCP-47 code for French fra_Latn. See [here](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) for the list of all BCP-47 in the Flores 200 dataset.

Let's use the NLLB model to translated the same sentences we translated before.

In [5]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"

tokenizer = AutoTokenizer.from_pretrained(
    "facebook/nllb-200-distilled-600M", src_lang="hin_Deva"
)
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")

inputs = tokenizer(hi_text, return_tensors="pt")

# Assuming you want to translate to French ('fra_Latn')
# Ensure the tokenizer knows about the special language token
language_code = "fra_Latn"
if language_code not in tokenizer.additional_special_tokens:
    tokenizer.add_special_tokens({'additional_special_tokens': [language_code]})
    model.resize_token_embeddings(len(tokenizer))

# Generate the translation
translated_tokens = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.convert_tokens_to_ids(language_code),
    max_length=30
)

# Decode the translated tokens
translated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(translated_text)

tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.55k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

La vie est comme une boîte à chocolat.


Let's do another example, this time from Chinese to English.

In [6]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

chinese_text = "生活就像一盒巧克力。"

tokenizer = AutoTokenizer.from_pretrained(
    "facebook/nllb-200-distilled-600M", src_lang="zho_Hans"
)
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")

inputs = tokenizer(chinese_text, return_tensors="pt")

language_code = "eng_Latn"
if language_code not in tokenizer.additional_special_tokens:
    tokenizer.add_special_tokens({'additional_special_tokens': [language_code]})
    model.resize_token_embeddings(len(tokenizer))

# Generate the translation
translated_tokens = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.convert_tokens_to_ids(language_code),
    max_length=30
)

# Decode the translated tokens
translated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(translated_text)

Life is like a box of chocolates.


In the following space, translate the sentence "Life is short. Smile while you still have teeth." to your favourite non-English language.

In [7]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

english_text = "Life is short. Smile while you still have teeth."

tokenizer = AutoTokenizer.from_pretrained(
    "facebook/nllb-200-distilled-600M", src_lang="eng_Latn"
)
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")

inputs = tokenizer(english_text, return_tensors="pt")

language_code = "jpn_Jpan"
if language_code not in tokenizer.additional_special_tokens:
    tokenizer.add_special_tokens({'additional_special_tokens': [language_code]})
    model.resize_token_embeddings(len(tokenizer))

# Generate the translation
translated_tokens = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.convert_tokens_to_ids(language_code),
    max_length=30
)

# Decode the translated tokens
translated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(translated_text)

人生は短く. 歯がまだある間は笑顔で.


As you can see, different engines provide different outputs to the same input. So, what is the best?

## Machine Translation Evaluation

Machine translation evaluation refers to the different processes of measuring the performance of a machine translation system.

There are two different ways to determine how well an MT system performs. Human evaluation is done by human experts doing manual assessment, while automatic evaluation uses metrics specially developed for assessing translations without human intervention.

Human evaluation is considered the gold standard when it comes to evaluating the quality  of machine translation. However, it is a costly process in terms of effort and time. This is why researchers in the field have developed different means of evaluating MT quality through automated processes.

### Automatic evaluation

1. **Bleu Score**

Individual MT outputs are scored against a set of high quality reference translations. These scores are then averaged, and the resulting number is the final BLEU score for that MT system. This score represents how closely the MT system’s output matches the human reference translation, which is the marker for quality.

The scores are calculated using units called n-grams, which refer to segments of consecutive text.





In [8]:
from nltk.translate.bleu_score import SmoothingFunction, corpus_bleu, sentence_bleu


def bleu(ref, gen):
    '''
    calculate pair wise bleu score. uses nltk implementation
    Args:
        references : a list of reference sentences
        candidates : a list of candidate(generated) sentences
    Returns:
        bleu score(float)
    '''
    ref_bleu = []
    gen_bleu = []
    for l in gen:
        gen_bleu.append(l.split())
    for i, l in enumerate(ref):
        ref_bleu.append([l.split()])
    cc = SmoothingFunction()
    score_bleu = corpus_bleu(ref_bleu, gen_bleu, weights=(0, 1, 0, 0), smoothing_function=cc.method4)
    return score_bleu


Let's see the bleu scores between;

Reference =  "Life is like a box of chocolates."

Candidate = "Life is like a box of chocolate."

In [9]:
Reference =  "Life is like a box of chocolates."
Candidate = "Life is like a box of chocolate."

print(bleu([Reference],[Candidate]))

0.8333333333333334


Now, let's see the bleu score between

Reference =  "Life is like a box of chocolates."

Candidate = "Life life chocolate chocolate."

In [10]:
Reference =  "Life is like a box of chocolates."
Candidate = "Life life chocolate chocolate."

print(bleu([Reference],[Candidate]))

0.021827969614883667


**Strengths of Bleu Score**

The reason that Bleu Score is so popular is that it has several strengths:

1.   It is quick to calculate and easy to understand.

2.   It corresponds with the way a human would evaluate the same text.
3.   Importantly, it is language-independent making it straightforward to apply to your NLP models.
4.   It can be used when you have more than one ground truth sentence.
5.   It is used very widely, which makes it easier to compare your results with other work.


**Weaknesses of Bleu Score**

In spite of its popularity, Bleu Score has following weaknesses.

1.   It does not consider the meaning of words.
2.   It looks only for exact word matches. Sometimes a variant of the same word can be used eg. “rain” and “raining”, but Bleu Score counts that as an error.
3.   It ignores the importance of words. With Bleu Score an incorrect word like “to” or “an” that is less relevant to the sentence is penalised just as heavily as a word that contributes significantly to the meaning of the sentence.
4.   It does not consider the order of words eg. The sentence “The guard arrived late because of the rain” and “The rain arrived late because of the guard” would get the same (unigram) Bleu Score even though the latter is quite different.

Let's see the following examples


In [11]:
Reference =  "Transformers are fast plus efficient"
Candidate = "Transformers are quick and efficient"

print(bleu([Reference],[Candidate]))

0.25


In [12]:
Reference =  "Transformers are fast plus efficient"
Candidate = "Transformers are Transformers quick quick"

print(bleu([Reference],[Candidate]))

0.25


To address these limitations, the researchers have developed various metrics such as Bleurt.

BLEURT is an evaluation metric for Natural Language Generation. It takes a pair of sentences as input, a reference and a candidate, and it returns a score that indicates to what extent the candidate is fluent and conveys the meaning of the reference. BLEURT is a trained metric, that is, it is a regression model trained on ratings data. The model is based on BERT and RemBERT.

BLEURT runs in Python 3. It relies heavily on Tensorflow (>=1.15) and the library tf-slim (>=1.1). You may install it as follows:

In [13]:
!pip install --upgrade pip  # ensures that pip is current
!git clone https://github.com/google-research/bleurt.git
%cd bleurt
!pip install .

Cloning into 'bleurt'...
remote: Enumerating objects: 134, done.[K
remote: Counting objects: 100% (18/18), done.[K
remote: Compressing objects: 100% (18/18), done.[K
remote: Total 134 (delta 0), reused 17 (delta 0), pack-reused 116[K
Receiving objects: 100% (134/134), 31.28 MiB | 15.09 MiB/s, done.
Resolving deltas: 100% (49/49), done.
/content/bleurt
Processing /content/bleurt
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: BLEURT
  Building wheel for BLEURT (setup.py) ... [?25l[?25hdone
  Created wheel for BLEURT: filename=BLEURT-0.0.2-py3-none-any.whl size=16456766 sha256=d09b704cf009734b1c062e996a2e6a8a470b1865a2f7eb2dbc43fa8d62a19ec2
  Stored in directory: /tmp/pip-ephem-wheel-cache-_x03p_0e/wheels/92/4f/fb/afa555fa27aa9e2c7958df797a62cc4e74f0f459cec9c4fa7c
Successfully built BLEURT
Installing collected packages: BLEURT
Successfully installed BLEURT-0.0.2


Different BLEURT checkpoints yield different scores. The currently recommended checkpoint BLEURT-20 generates scores which are roughly between 0 and 1 (sometimes less than 0, sometimes more than 1), where 0 indicates a random output and 1 a perfect one.

In [None]:
from bleurt import score

references =  ["Life is like a box of chocolates."]
candidates = ["Life is like a box of chocolate."]

scorer = score.BleurtScorer()
scores = scorer.score(references=references, candidates=candidates)
assert isinstance(scores, list) and len(scores) == 1
print(scores)

[0.7344270944595337]


Let's see how Bleurt handles the previous examples

In [14]:
from bleurt import score

references =  ["Transformers are fast plus efficient"]
candidates = ["Transformers are quick and efficient"]

scorer = score.BleurtScorer()
scores = scorer.score(references=references, candidates=candidates)
assert isinstance(scores, list) and len(scores) == 1
print(scores)

[0.6676756143569946]


In [15]:
references =  ["Transformers are fast plus efficient"]
candidates = ["Transformers are Transformers quick quick"]

scorer = score.BleurtScorer()
scores = scorer.score(references=references, candidates=candidates)
assert isinstance(scores, list) and len(scores) == 1
print(scores)


[0.029121175408363342]


Currently, BLEURT-20 was tested on 13 languages: Chinese, Czech, English, French, German, Japanese, Korean, Polish, Portugese, Russian, Spanish, Tamil, Vietnamese (these are languages for which had held-out ratings data). In theory, it should work for the 100+ languages of multilingual C4, on which RemBERT was trained.

For all the evaluation metrics, the reference is required meaning that evaluation can not be done in real-time.

# Machine Translation Quality Estimation (QE)

The goal of quality estimation (QE) is to evaluate the quality of a translation without having access to a reference translation.

* High-accuracy QE that can be easily deployed for a number of language
pairs is the missing piece in many commercial translation workflows as they have numerous potential uses.

* They can be employed to select the best translation when several translation engines are available or can inform the end user about the reliability of automatically translated content.

* In addition, QE systems can be used to decide whether a translation can be published as it is in a given context, or whether it requires human post-editing before publishing or translation from scratch by a human. The quality estimation can be done at different levels: document level, sentence level and word level.

TransQuest (https://tharindu.co.uk/TransQuest/) provides code and pre-trained models to perform QE.

In [1]:
!pip install -U transformers==4.28.0
!pip install transquest
!pip install wandb



Let's use a pre-trained model to measure the quality of a translation.

Source (Ro) - "Reducerea acestor conflicte este importantă pentru conservare."

Target (En) - "Reducing these conflicts is not important for preservation."

In [2]:
import torch
from transquest.algo.sentence_level.monotransquest.run_model import MonoTransQuestModel


model = MonoTransQuestModel("xlmroberta", "TransQuest/monotransquest-da-ro_en-wiki", num_labels=1, use_cuda=torch.cuda.is_available())
predictions, raw_outputs = model.predict([["Reducerea acestor conflicte este importantă pentru conservare.", "Reducing these conflicts is not important for preservation."]])
print(predictions)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/721 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]



  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

0.9052734375


All the pretrained models are available at https://tharindu.co.uk/TransQuest/models/sentence_level_pretrained/

The first architecture proposed uses a single XLM-R transformer model. The input of this model is a concatenation of the original sentence and its translation, separated by the *[SEP]* token. Then the output of the *[CLS]* token is passed through a softmax layer to reflect the quality scores.

![MonoTransQuest Architecture](https://github.com/TharinduDR/TransQuest/blob/master/docs/images/MonoTransQuest.png?raw=true)

Let's train a QE system for English and Hindi machine translation. We start by loading a dataset

In [3]:
import pandas as pd
train = pd.read_csv("https://raw.githubusercontent.com/WMT-QE-Task/wmt-qe-2023-data/main/task_1/en-hi/train.enhi.df.short.tsv", sep="\t")
dev = pd.read_csv("https://raw.githubusercontent.com/WMT-QE-Task/wmt-qe-2023-data/main/task_1/en-hi/dev.enhi.df.short.tsv", sep="\t")

In [4]:
train[:10]

Unnamed: 0,index,original,translation,scores,mean,z_scores,z_mean
0,0,[citation needed] Four leaf phases are recogni...,एक यूकेलिप्टस पौधे के विकास में चार पत्ती चरण ...,"[75, 75, 77, 82]",77.25,"[-1.115305097972482, -1.0486684179242434, 0.42...",-0.256664
1,1,"This rule is so strictly enforced that, even w...",यह नियम इतनी सख्ती से लागू किया गया है कि जहां...,"[90, 90, 65, 65]",77.5,"[0.12609030444560904, 0.2069211629237486, -0.8...",-0.341263
2,2,At the urging of the International Monetary Fu...,अंतर्राष्ट्रीय मुद्रा कोष (आईएमएफ) के आग्रह पर...,"[95, 95, 89, 80]",89.75,"[0.539888771918306, 0.6254510232064127, 1.6680...",0.839132
3,3,He quit the movement and turned to Sufism.,उन्होंने आंदोलन छोड़ दिया और सूफीवाद की ओर मुड...,"[100, 95, 79, 82]",89.0,"[0.953687239391003, 0.6254510232064127, 0.6335...",0.730839
4,4,He immediately sent a message to the Thakur of...,उन्होंने तुरंत असोटा के ठाकुर को एक संदेश भेजा।,"[100, 95, 76, 70]",85.25,"[0.953687239391003, 0.6254510232064127, 0.3231...",0.37196
5,5,Most Islamic jurists hold there is another typ...,अधिकांश इस्लामी न्यायविदों का मानना है कि रिबा...,"[95, 95, 68, 85]",85.75,"[0.539888771918306, 0.6254510232064127, -0.504...",0.413226
6,6,He established the RSS network in the Kashmir ...,उन्होंने कश्मीर घाटी में आरएसएस का नेटवर्क स्थ...,"[90, 95, 89, 80]",88.5,"[0.12609030444560904, 0.6254510232064127, 1.66...",0.735682
7,7,"In his attempts to catch Jerry, Tom often has ...","जेरी को पकड़ने के अपने प्रयासों में, टॉम को अक...","[70, 70, 78, 75]",73.25,"[-1.529103565445179, -1.4671982782069075, 0.53...",-0.60297
8,8,He swears to kill Sikandar.,वह सिकन्दर को मारने की कसम खाता है।,"[95, 95, 70, 84]",86.0,"[0.539888771918306, 0.6254510232064127, -0.297...",0.44151
9,9,Jaldapara National Park (Pron: ˌʤʌldəˈpɑ:rə) (...,जलदापारा राष्ट्रीय उद्यान (Pron: indialear vil...,"[85, 80, 59, 59]",70.75,"[-0.28770816302708796, -0.6301385576415794, -1...",-0.949799


We are only going to keep what we want and rename the columns for TransQuest

In [5]:
train = train[['original', 'translation', 'z_mean']]
dev = dev[['original', 'translation', 'z_mean']]
train = train.rename(columns={'original': 'text_a', 'translation': 'text_b', 'z_mean': 'labels'}).dropna()
dev = dev.rename(columns={'original': 'text_a', 'translation': 'text_b', 'z_mean': 'labels'}).dropna()

In [6]:
train[:10]

Unnamed: 0,text_a,text_b,labels
0,[citation needed] Four leaf phases are recogni...,एक यूकेलिप्टस पौधे के विकास में चार पत्ती चरण ...,-0.256664
1,"This rule is so strictly enforced that, even w...",यह नियम इतनी सख्ती से लागू किया गया है कि जहां...,-0.341263
2,At the urging of the International Monetary Fu...,अंतर्राष्ट्रीय मुद्रा कोष (आईएमएफ) के आग्रह पर...,0.839132
3,He quit the movement and turned to Sufism.,उन्होंने आंदोलन छोड़ दिया और सूफीवाद की ओर मुड...,0.730839
4,He immediately sent a message to the Thakur of...,उन्होंने तुरंत असोटा के ठाकुर को एक संदेश भेजा।,0.37196
5,Most Islamic jurists hold there is another typ...,अधिकांश इस्लामी न्यायविदों का मानना है कि रिबा...,0.413226
6,He established the RSS network in the Kashmir ...,उन्होंने कश्मीर घाटी में आरएसएस का नेटवर्क स्थ...,0.735682
7,"In his attempts to catch Jerry, Tom often has ...","जेरी को पकड़ने के अपने प्रयासों में, टॉम को अक...",-0.60297
8,He swears to kill Sikandar.,वह सिकन्दर को मारने की कसम खाता है।,0.44151
9,Jaldapara National Park (Pron: ˌʤʌldəˈpɑ:rə) (...,जलदापारा राष्ट्रीय उद्यान (Pron: indialear vil...,-0.949799


Now we have the data ready, let's train the model.

In [7]:
from multiprocessing import cpu_count

monotransquest_config = {
    'output_dir': 'temp/outputs/',
    "best_model_dir": "temp/outputs/best_model",
    'cache_dir': 'temp/cache_dir/',

    'fp16': False,
    'fp16_opt_level': 'O1',
    'max_seq_length': 80,
    'train_batch_size': 8,
    'gradient_accumulation_steps': 1,
    'eval_batch_size': 8,
    'num_train_epochs': 1, # Change to 3 for a better model
    'weight_decay': 0,
    'learning_rate': 2e-5,
    'adam_epsilon': 1e-8,
    'warmup_ratio': 0.1,
    'warmup_steps': 0,
    'max_grad_norm': 1.0,
    'do_lower_case': False,

    'logging_steps': 300,
    'save_steps': 300,
    "no_cache": False,
    "no_save": False,
    "save_recent_only": True,
    'save_model_every_epoch': False,
    'n_fold': 3,
    'evaluate_during_training': True,
    "evaluate_during_training_silent": False,
    'evaluate_during_training_steps': 300,
    "evaluate_during_training_verbose": True,
    'use_cached_eval_features': False,
    "save_best_model": True,
    'save_eval_checkpoints': False,
    'tensorboard_dir': None,
    "save_optimizer_and_scheduler": True,

    'regression': True,

    'overwrite_output_dir': True,
    'reprocess_input_data': True,

    'process_count': cpu_count() - 2 if cpu_count() > 2 else 1,
    'n_gpu': 1,
    'use_multiprocessing': True,
    "multiprocessing_chunksize": 500,
    'silent': False,

    'wandb_project': "En-Hi Quality Estimation",
    'wandb_kwargs': {},

    "use_early_stopping": True,
    "early_stopping_patience": 10,
    "early_stopping_delta": 0,
    "early_stopping_metric": "eval_loss",
    "early_stopping_metric_minimize": True,
    "early_stopping_consider_epochs": False,

    "manual_seed": 777,

    "config": {},
    "local_rank": -1,
    "encoding": None,
}

In [8]:
from sklearn.model_selection import train_test_split

train_df, eval_df = train_test_split(train, test_size=0.1, random_state=777)

In [9]:
from transquest.algo.sentence_level.monotransquest.evaluation import pearson_corr, spearman_corr
from sklearn.metrics import mean_absolute_error
from transquest.algo.sentence_level.monotransquest.run_model import MonoTransQuestModel
import torch

model = MonoTransQuestModel("xlmroberta", "xlm-roberta-base", num_labels=1, use_cuda=torch.cuda.is_available(),
                               args=monotransquest_config)
model.train_model(train_df, eval_df=eval_df, pearson_corr=pearson_corr, spearman_corr=spearman_corr,
                              mae=mean_absolute_error)



config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]



  0%|          | 0/6300 [00:00<?, ?it/s]



Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Running Epoch 0 of 1:   0%|          | 0/788 [00:00<?, ?it/s]

  0%|          | 0/700 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/88 [00:00<?, ?it/s]

  0%|          | 0/700 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/88 [00:00<?, ?it/s]

  0%|          | 0/700 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/88 [00:00<?, ?it/s]

(788,
 {'global_step': [300, 600, 788],
  'train_loss': [0.4117031693458557, 0.21644280850887299, 0.20307061076164246],
  'eval_loss': [0.42900885438377206, 0.3904460498809137, 0.3746272996068001],
  'pearson_corr': [0.3359181592652052,
   0.3697467232299842,
   0.40036700603732706],
  'spearman_corr': [0.389803811851385,
   0.41244924190841004,
   0.43177953116295614],
  'mae': [0.4623485831816548, 0.43096097048396975, 0.4285148018814756]})

Let's predict the quality of some translations from the model that we just built.

source = "In the flood-prone districts of the Netherlands, particularly in the northern provinces of Friesland and Groningen, villages were traditionally built on low man-made hills called terpen before the introduction of regional dyke-systems."

target = "नीदरलैंड के बाढ़ संभावित जिलों में, विशेष रूप से उत्तरी प्रांतों फ्रीसलैंड और ग्रोनिंगेन में, गांवों को पारंपरिक रूप से कम मानव निर्मित पहाड़ियों पर बनाया जाता था जिसे क्षेत्रीय डाइक-सिस्टम की शुरुआत से पहले टेरपेन कहा जाता था।"



In [10]:
from transquest.algo.sentence_level.monotransquest.run_model import MonoTransQuestModel

model = MonoTransQuestModel("xlmroberta", monotransquest_config["best_model_dir"], num_labels=1,
                               use_cuda=torch.cuda.is_available())



In [11]:
source = "In the flood-prone districts of the Netherlands, particularly in the northern provinces of Friesland and Groningen, villages were traditionally built on low man-made hills called terpen before the introduction of regional dyke-systems."

target = "नीदरलैंड के बाढ़ संभावित जिलों में, विशेष रूप से उत्तरी प्रांतों फ्रीसलैंड और ग्रोनिंगेन में, गांवों को पारंपरिक रूप से कम मानव निर्मित पहाड़ियों पर बनाया जाता था जिसे क्षेत्रीय डाइक-सिस्टम की शुरुआत से पहले टेरपेन कहा जाता था।"

predictions, raw_outputs = model.predict([[source, target]])
print(predictions)

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

-0.3298720717430115


## Word-level Quality Estimation


In [12]:
from transquest.algo.word_level.microtransquest.run_model import MicroTransQuestModel
import torch

model = MicroTransQuestModel("xlmroberta", "TransQuest/microtransquest-en_lv-pharmaceutical-nmt", labels=["OK", "BAD"], use_cuda=torch.cuda.is_available())
source_tags, target_tags = model.predict([["if not , you may not be protected against the diseases . ", "ja tā nav , Jūs varat nepasargāt no slimībām . "]])

print(source_tags)
print(target_tags)



config.json:   0%|          | 0.00/696 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

[['OK', 'OK', 'OK', 'OK', 'OK', 'BAD', 'BAD', 'BAD', 'OK', 'OK', 'OK', 'OK']]
[['OK', 'OK', 'OK', 'BAD', 'OK', 'BAD', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'BAD', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK']]


The word-level architecture available in TransQuest is MicroTransQuest.

The input of this model is a concatenation of the original sentence and its translation, separated by the *[SEP]* token. As shown in the Figure target sentence contains gaps too. Then the output of the each token is passed through a softmax layer to reflect the quality scores.


![MonoTransQuest Architecture](https://github.com/TharinduDR/TransQuest/blob/master/docs/images/MicroTransQuest.png?raw=true)

Let's train a word-level QE model for English-Latvian. Start by loading the data as before.

In [13]:
import pandas as pd
train = pd.read_csv("https://raw.githubusercontent.com/TharinduDR/NeTTT-2024/main/en_lv_train.tsv", sep="\t")
train[:10]

Unnamed: 0,source,target,source_tags,target_tags
0,Grade 4 ( diffuse or local process causing inf...,"4 . 4 . pakāpe ( difūzs vai lokāls process , k...",OK OK OK OK OK OK OK OK OK OK BAD OK OK BAD BA...,OK BAD OK BAD OK OK OK OK OK OK OK OK OK OK OK...
1,the studies comparing the chewable tablets wit...,pētījumos salīdzināja košļājamās tabletes ar k...,BAD BAD BAD OK OK OK OK OK OK BAD OK OK BAD OK...,BAD BAD OK OK OK OK OK OK OK OK OK OK OK OK OK...
2,then start again with a new vial of Fuzeon pow...,pēc tam to sāciet ar jaunu flakonu ar Fuzeon p...,OK BAD BAD OK OK OK BAD OK BAD BAD OK,OK OK OK OK OK BAD OK BAD OK OK OK OK OK BAD O...
3,this enzyme helps the body control levels of g...,Šis enzīms palīdz kontrolēt glikozīdamīda līme...,OK OK OK OK OK OK OK OK BAD OK,OK OK OK OK OK OK OK OK OK BAD OK OK OK OK OK ...
4,Ambirix is not recommended for postexposure pr...,Ambirix nav ieteicams lietot profilaksei pēc k...,OK OK OK OK OK BAD BAD OK OK OK OK BAD OK OK,OK OK OK OK OK OK OK OK OK OK OK OK OK BAD OK ...
5,"based on data for other cyp3a 4 inhibitors , p...",pamatojoties uz datiem par citiem cyp3a 4 inhi...,OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK O...,OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK O...
6,some cases of splenic rupture were fatal ( see...,daži gadījumi par liesas plīsumu bija ar letāl...,OK OK BAD OK OK OK OK OK OK BAD BAD OK OK,OK OK OK OK OK BAD OK OK OK OK OK OK OK OK OK ...
7,"orlistat is a potent , specific and long-actin...","orlistats ir spēcīgs , specifisks un ilgstošas...",OK OK OK OK OK OK OK OK OK OK OK OK OK,OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK O...
8,sitagliptin with metformin and insulin,sitagliptīns ar metformīnu un insulīnu,OK OK OK OK OK,OK OK OK OK OK OK OK OK OK OK OK
9,studies have shown that co-administration with...,"pētījumi pierāda , ka in - alfa lutropīna ieva...",OK OK OK OK BAD BAD BAD BAD OK OK OK OK OK OK ...,OK OK OK OK OK OK OK OK OK BAD OK BAD OK BAD O...


Please note that target_tags_column has word level quality labels for gaps in the target too. Therefore, it has 2*N+1 labels, where N is the total number of tokens in the target. For more information please have a look at WMT word level quality estimtion task.

Now, you can consider MicroTransQuest to build the QE model which is also same as before.

In [14]:
from multiprocessing import cpu_count

microtransquest_config = {
    'output_dir': 'temp/outputs/',
    "best_model_dir": "temp/outputs/best_model",
    'cache_dir': 'temp/cache_dir/',

    'fp16': False,
    'fp16_opt_level': 'O1',
    'max_seq_length': 200,
    'train_batch_size': 8,
    'gradient_accumulation_steps': 1,
    'eval_batch_size': 8,
    'num_train_epochs': 1, #change to three for best results
    'weight_decay': 0,
    'learning_rate': 2e-5,
    'adam_epsilon': 1e-8,
    'warmup_ratio': 0.1,
    'warmup_steps': 0,
    'max_grad_norm': 1.0,
    'do_lower_case': False,

    'logging_steps': 500,
    'save_steps': 500,
    "no_cache": False,
    "no_save": False,
    "save_recent_only": True,
    'save_model_every_epoch': False,
    'n_fold': 1,
    'evaluate_during_training': True,
    "evaluate_during_training_silent": True,
    'evaluate_during_training_steps': 500,
    "evaluate_during_training_verbose": True,
    'use_cached_eval_features': False,
    "save_best_model": True,
    'save_eval_checkpoints': True,
    'tensorboard_dir': None,
    "save_optimizer_and_scheduler": True,

    'regression': True,

    'overwrite_output_dir': True,
    'reprocess_input_data': True,

    'process_count': cpu_count() - 2 if cpu_count() > 2 else 1,
    'n_gpu': 1,
    'use_multiprocessing': True,
    "multiprocessing_chunksize": 500,
    'silent': False,

    'wandb_project': "En-Lv Word-level QE",
    'wandb_kwargs': {},

    "use_early_stopping": True,
    "early_stopping_patience": 10,
    "early_stopping_delta": 0,
    "early_stopping_metric": "eval_loss",
    "early_stopping_metric_minimize": True,
    "early_stopping_consider_epochs": False,

    "manual_seed": 777,

    "add_tag": False,
    "tag": "_",

    "default_quality": "OK",

    "config": {},
    "local_rank": -1,
    "encoding": None,

    "source_column": "source",
    "target_column": "target",
    "source_tags_column": "source_tags",
    "target_tags_column": "target_tags",
}

In [15]:
from sklearn.model_selection import train_test_split

train_df, eval_df = train_test_split(train, test_size=0.1, random_state=777)

In [16]:
from transquest.algo.word_level.microtransquest.run_model import MicroTransQuestModel
import torch

model = MicroTransQuestModel("xlmroberta", "xlm-roberta-base", labels=["OK", "BAD"], use_cuda=torch.cuda.is_available(), args=microtransquest_config)
model.train_model(train_df, eval_df=eval_df)

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForTokenClassification: ['lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to us

  0%|          | 0/11642 [00:00<?, ?it/s]



Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
Training loss,█▁
eval_loss,█▃▁
global_step,▁▁▅▅█
lr,█▁
mae,█▂▁
pearson_corr,▁▅█
spearman_corr,▁▅█
train_loss,█▁▁

0,1
Training loss,0.21644
eval_loss,0.37463
global_step,788.0
lr,1e-05
mae,0.42851
pearson_corr,0.40037
spearman_corr,0.43178
train_loss,0.20307


Running Epoch 0 of 1:   0%|          | 0/1456 [00:00<?, ?it/s]

  0%|          | 0/1294 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/162 [00:00<?, ?it/s]



  0%|          | 0/1294 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/162 [00:00<?, ?it/s]

  0%|          | 0/1294 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/162 [00:00<?, ?it/s]

(1456,
 {'global_step': [500, 1000, 1456],
  'precision': [0.7277305634246144, 0.7039665211062591, 0.7066115702479339],
  'recall': [0.2007118673495963, 0.3358798506814828, 0.3562809271638163],
  'f1_score': [0.31464344039194336, 0.4547751983543932, 0.4737115484503953],
  'train_loss': [0.3364166021347046, 0.25723499059677124, 0.29617074131965637],
  'eval_loss': [0.31962985747758255, 0.2982044253084395, 0.29139049415603097]})

Let's test the model on one sentence pair.

In [None]:
from transquest.algo.word_level.microtransquest.run_model import MicroTransQuestModel

model = MicroTransQuestModel("xlmroberta", microtransquest_config["best_model_dir"],
                               use_cuda=torch.cuda.is_available() )

source_tags, target_tags = model.predict([["if not , you may not be protected against the diseases . ", "ja tā nav , Jūs varat nepasargāt no slimībām . "]])

print(source_tags)
print(target_tags )

  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

[['OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK']]
[['OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK', 'OK']]
