<a href="https://colab.research.google.com/github/Dimildizio/system_design/blob/main/NLLB_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install huggingface lib

In [1]:
%%capture
!pip install transformers
!pip install rouge-score

## Imports

In [22]:
import nltk
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from nltk.translate.bleu_score import corpus_bleu, sentence_bleu, SmoothingFunction
from nltk.translate.meteor_score import meteor_score
from rouge_score import rouge_scorer
from typing import List

In [3]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

## Specify huggingface access token to download model

In [4]:
access_token ='' #Put your huggingface token here

## Download tokenization models for rus and english corpus

In [5]:
eng_tokenizer = AutoTokenizer.from_pretrained(
    "facebook/nllb-200-distilled-600M", token=access_token)
rus_tokenizer = AutoTokenizer.from_pretrained(
    "facebook/nllb-200-distilled-600M", src_lang="rus_Cyrl", token=access_token)

Downloading (…)okenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/3.55k [00:00<?, ?B/s]

# Trying the out-of-the-box model

## Create model

In [6]:
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", token=access_token)

Downloading (…)lve/main/config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]



Downloading (…)neration_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

## Create example data

In [7]:
doc = 'Шустрая бурая лисица прыгает через ленивого пса!'
reference = 'The quick brown fox jumps over the lazy dog!'
g_trans = 'The nimble brown fox jumps over the lazy dog!'

## Tokenize

In [8]:
rus_tok = rus_tokenizer(doc, return_tensors='pt')

## Translate

In [9]:
translated_tokens = model.generate(
    **rus_tok, forced_bos_token_id=rus_tokenizer.lang_code_to_id["eng_Latn"], max_length=30)

In [10]:
translated = rus_tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0] #for multiple entries

# Metrics

#### **BLEU** - BiLingual Evaluation Understudy
>cares more about word overlap.

>Precision is more important.

> Uses n-grams for evaluation.

>Normalizes scores for text length.

>Typical for machine translation.

>Rewards model for producing matching with reference words.

> Penalizes longer sentences.


#### **ROUGE** - Recall-Oriented Understudy for Gisting Evaluation

>Focused on capturing context (Gisting Evaluation).

>Recall is more important. (Recall-Oriented)

> Uses longest common subseq for evaluation.

> Doesn't normalize scores for text length.

>Typical for text summarization.

>Rewards model if in general generated text represents the contexts of reference.

> Longer texts have advantage for recall.


#### **METEOR** - Metrics for Evaluation of Translation with Explicit ORdering

> Takes word order into account.

> Uses stemming and other techniques for synonyms and paraphrasing

> F1 score is more important.

> Uses unigrams (1 word) along with synonyms with preloaded WordNet synonym dictionary.

> More robust in variations

> Doesn't penalize longer texts

> Typical for machine translation and text summarization.

> Again: more flexible due to use of synonyms (more complex than word overlap)


#### **TER** - Translation Edit Rate

> Represents the **number of edits** needed to get from the hypothesis to the reference sentence. Lower is better.

> Basically quantifies the dissimilarity of the reference and translation

> Possible changes include: deletions, substitution, insertion, shifting

> Used in machine translation and texts summarization.

> Not included in commonly-used libs like nltk

> More complex

> No percentage score. Difficult to interprete the result. 0 edits is the best.




In [11]:
Rouge = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

In [12]:
smoothing_zero_ngrams =  SmoothingFunction()

In [13]:
def round_perc(num: float) -> float:
  return round(num*100, 2)

def get_sent_bleu(sentence: str, reference: str=reference) -> float:
  '''n=3 gram'''
  score = sentence_bleu([reference.split()], sentence.split(),
                        weights = (0.25, 0.5, 0.25), smoothing_function=smoothing_zero_ngrams.method1) #weights define the 'window size'
  return round_perc(score)

def get_bleu(sentence: str, reference: str=reference) -> float:
  score = corpus_bleu([[reference.split()]], [sentence.split()],
                      smoothing_function=smoothing_zero_ngrams.method1)
  return round_perc(score)

def get_meteor(sentence: str, reference: str=reference) -> float:
  score = meteor_score([reference.split()], sentence.split())
  return round_perc(score)

def get_rouge(sentence: str, reference: str=reference) -> float:
  '''rouge-1 unigrams, individual words
     rouge-2 bigrams, word pairs
     rouge-L longest sequence'''
  scores = Rouge.score(reference, sentence)
  idict = {key:{} for key in scores}
  for key in scores:
    idict[key]['precision'] = str(round_perc(scores[key].precision))+'%'
    idict[key]['recall'] = str(round_perc(scores[key].recall))+'%'
    idict[key]['f1'] = str(round_perc(scores[key].fmeasure))+'%'
  rouge_dict = {key: idict[key] for key in idict}
  return rouge_dict

In [14]:
def ter(hypothesis, reference=reference):
    n = len(reference)
    m = len(hypothesis)

    # Init matrix for dynamic programming
    dp = [[0] * (m + 1) for _ in range(n + 1)]

    # Init first row and column
    for i in range(n + 1):
        dp[i][0] = i
    for j in range(m + 1):
        dp[0][j] = j

    # Fill in the DP matrix
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            cost = 0 if reference[i - 1] == hypothesis[j - 1] else 1
            dp[i][j] = min(dp[i - 1][j] + 1, dp[i][j - 1] + 1, dp[i - 1][j - 1] + cost)
    return dp[n][m]

### Test metrics

In [15]:
print(f'Reference: {reference}\n\n')
for translation in (g_trans, translated, reference):
  score = get_bleu(translation, reference)
  print(f"Translation: {translation}\nScore: {score}\n")

Reference: The quick brown fox jumps over the lazy dog!


Translation: The nimble brown fox jumps over the lazy dog!
Score: 75.06

Translation: A shrewd brown fox jumps over a lazy dog!
Score: 35.49

Translation: The quick brown fox jumps over the lazy dog!
Score: 100.0



## Flow

In [28]:
class MachineTranslation:

  def __init__(self, model, tokenizer, target_lang='eng_Latn', sent_len=300):
    self.model=model
    self.tokenizer = tokenizer
    self.to_lang = target_lang
    self.sent_len = sent_len
    self.metrics_dict = {'TER':ter,
                        'BLEU corpus':get_bleu,
                        'BLEU sentence': get_sent_bleu,
                        'METEOR': get_meteor,
                        'ROUGE': get_rouge
                         }


  def tokenize(self, sent: str):
    '''Tokenize input sentence'''
    return self.tokenizer(sent, return_tensors='pt')


  def translate(self, inputs):
    '''
    Generate translation
    '''
    return self.model.generate(
      **inputs, forced_bos_token_id=self.tokenizer.lang_code_to_id[self.to_lang],
      max_length=self.sent_len)


  def get_decoded(self, toks) -> list:
    '''
    Convert vect tokens into sentences
    '''
    return self.tokenizer.batch_decode(toks, skip_special_tokens=True)


  def generate_metrics(self, translation: str, reference: str) -> None:
    '''
    Use BLEU metrics and compare translated sent to the best translation
    '''
    print(f'Reference: {reference}\nTranslation: {translation}\n')
    for name, func in self.metrics_dict.items():
      score = func(translation, reference)
      self.print_metrics(translation, name, score)


  def print_metrics(self, translation, metrics_name, score):
    perc_sign = ' edits' if metrics_name == 'TER' else '%'
    print(f"{metrics_name} score: {score}{perc_sign}")


  def process_sentence(self, sent: str):
    '''
    main process for translation
    '''
    tokens = self.tokenize(sent)
    translated_tokens = self.translate(tokens)
    result = self.get_decoded(translated_tokens)
    return result


  def infer(self, sent: str, reference: str) -> None:
    ''' TO BE CHANGED
    Compare first sentence of the doc to the reference
    '''
    translation = self.process_sentence(sent)
    self.generate_metrics(translation[0].lower(), reference.lower())

In [17]:
MT = MachineTranslation(model, rus_tokenizer)

In [18]:
MT.infer(doc, reference)

Reference: the quick brown fox jumps over the lazy dog!
Translation: a shrewd brown fox jumps over a lazy dog!

TER score: 12 edits
Bleu corpus score: 35.49%
Bleu sentence score: 46.71%
Meteor score: 65.43%
Rouge score: {'rouge1': {'precision': '66.67%', 'recall': '66.67%', 'f1': '66.67%'}, 'rouge2': {'precision': '50.0%', 'recall': '50.0%', 'f1': '50.0%'}, 'rougeL': {'precision': '66.67%', 'recall': '66.67%', 'f1': '66.67%'}}%


#### Sample from dataset

In [19]:
a_few_sentences_rus = [
'НЕЙТРОННАЯ РЕФЛЕКТОМЕТРИЯ В РОССИИ: ТЕКУЩЕЕ СОСТОЯНИЕ И ПЕРСПЕКТИВЫ',
'В обзоре дано описание текущего состояния дел и перспектив развития в области нейтронной рефлектометрии на действующих и будущих нейтронных источниках Российской Федерации.',
'В результате ввода в эксплуатацию новых инструментов на реакторах ИР-8 и ПИК число нейтронных рефлектометров в РФ должно удвоиться.',
'В результате должен появиться набор инструментов, нацеленных на решение широкого круга задач в области физики, химии, биологии слоистых систем в интересах научного сообщества, а также для подготовки специалистов для дальнейшего развития и совершенствования данной методики.'
]

a_few_sentences_eng = [
'Neutron Reflectometry in Russia: Current State and Prospects',
'The review is devoted to the current state of affairs and prospects for development in the field of neutron reflectometry on the existing and future neutron sources in the Russian Federation.',
'Due to the commissioning of new instruments at the IR-8 and PIK reactors, the number of neutron reflectometers in the Russian Federation should double.',
'As a result, there must arise a set of instruments aimed at solving various problems in the fields of physics, chemistry, and biology of layered systems in the interests of the scientific community and to train experts for further development and improvement of this technique.'
]

pairs = list(zip(a_few_sentences_rus, a_few_sentences_eng))

In [30]:
MT = MachineTranslation(model, rus_tokenizer)
for pair in pairs:
  rus, eng = [sent.lower() for sent in pair]
  MT.infer(rus, eng)
  print('\n\n')

Reference: neutron reflectometry in russia: current state and prospects
Translation: neutron reflexometry in russia: current state and prospects

TER score: 2 edits
BLEU corpus score: 70.71%
BLEU sentence score: 73.86%
METEOR score: 86.48%
ROUGE score: {'rouge1': {'precision': '87.5%', 'recall': '87.5%', 'f1': '87.5%'}, 'rouge2': {'precision': '71.43%', 'recall': '71.43%', 'f1': '71.43%'}, 'rougeL': {'precision': '87.5%', 'recall': '87.5%', 'f1': '87.5%'}}%



Reference: the review is devoted to the current state of affairs and prospects for development in the field of neutron reflectometry on the existing and future neutron sources in the russian federation.
Translation: the review describes the current state of affairs and development prospects in the field of neutron reflectorometry at the current and future neutron sources of the russian federation.

TER score: 45 edits
BLEU corpus score: 40.7%
BLEU sentence score: 50.2%
METEOR score: 64.53%
ROUGE score: {'rouge1': {'precision': '8

### Try using the whole text as a single entry

In [34]:
class TextMachineTranslation(MachineTranslation):
  def process_text(self, text: List[str]):
    '''
    main process for multi-sentence translation
    '''
    tokens = [self.tokenize(sent) for sent in text]
    translated_tokens = [self.translate(token) for token in tokens]
    translations = [self.get_decoded(toks)[0] for toks in translated_tokens]
    result = ' '.join(translations)
    return result


  def infer(self, text: List[str], reference: List[str]) -> None:
    ''' TO BE CHANGED
    Compare the whole text to the reference
    '''
    translation = self.process_text(text)
    reference = ' '.join(reference)
    self.generate_metrics(translation.lower(), reference.lower())

In [35]:
TMT = TextMachineTranslation(model, rus_tokenizer)
TMT.infer(a_few_sentences_rus, a_few_sentences_eng)

Reference: neutron reflectometry in russia: current state and prospects the review is devoted to the current state of affairs and prospects for development in the field of neutron reflectometry on the existing and future neutron sources in the russian federation. due to the commissioning of new instruments at the ir-8 and pik reactors, the number of neutron reflectometers in the russian federation should double. as a result, there must arise a set of instruments aimed at solving various problems in the fields of physics, chemistry, and biology of layered systems in the interests of the scientific community and to train experts for further development and improvement of this technique.
Translation: neutron reflectometry in russia: the most constant and persectable the review describes the current state of affairs and development prospects in the field of neutron reflectorometry at the current and future neutron sources of the russian federation. as a result of the commissioning of new i

## Issues

1. Choose proper metrics

2. Evaluate metrics not for each sentence but for the whole text. I.e. cumulative metric or take average.
