<a href="https://colab.research.google.com/github/Dimildizio/system_design/blob/main/NLLB_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install huggingface lib

In [1]:
%%capture
!pip install transformers rouge-score sacrebleu sentencepiece

## Imports

In [74]:
import nltk
import pandas as pd
import sentencepiece as sp_module
import sacrebleu
import urllib.request
import io

from google.colab import files
from nltk.translate.bleu_score import corpus_bleu, sentence_bleu, SmoothingFunction
from nltk.translate.meteor_score import meteor_score
from rouge_score import rouge_scorer
from sacrebleu.metrics import BLEU, CHRF
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, MarianTokenizer, MarianMTModel
from typing import List

In [3]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

### Download data samples

In [4]:
%%capture
!wget https://raw.githubusercontent.com/Dimildizio/system_design/main/data/gtrans.txt
!wget https://raw.githubusercontent.com/Dimildizio/system_design/main/data/orig.txt
!wget https://raw.githubusercontent.com/Dimildizio/system_design/main/data/reference.txt
!wget https://raw.githubusercontent.com/Dimildizio/system_design/main/data/translation.txt

### Download sentencepiece vocab

In [5]:
%%capture
!wget https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt

## Specify huggingface access token to download model

In [54]:
access_token ='' #Put your huggingface token here

## Download tokenization models for rus and english corpus

In [7]:
eng_tokenizer = AutoTokenizer.from_pretrained(
    "facebook/nllb-200-distilled-600M", token=access_token)
rus_tokenizer = AutoTokenizer.from_pretrained(
    "facebook/nllb-200-distilled-600M", src_lang="rus_Cyrl", token=access_token)

Downloading (…)okenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/3.55k [00:00<?, ?B/s]

# Trying the out-of-the-box model

## Create model

In [8]:
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", token=access_token)

Downloading (…)lve/main/config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]



Downloading (…)neration_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

## Create example data

In [9]:
doc = 'Шустрая бурая лисица прыгает через ленивого пса!'
reference = 'The quick brown fox jumps over the lazy dog!'
g_trans = 'The nimble brown fox jumps over the lazy dog!'

## Tokenize

In [10]:
rus_tok = rus_tokenizer(doc, return_tensors='pt')

### Sentence piece

In [11]:
with open("botchan.txt", "rb") as f:
    text_data = f.read()

In [12]:
model_sp = io.BytesIO()
sp_vocab_size = 1000  #should consider enlarging

sp_module.SentencePieceTrainer.train(sentence_iterator=io.BytesIO(text_data),
                                     model_writer=model_sp,
                                     vocab_size=sp_vocab_size)

sp_tokenizer = sp_module.SentencePieceProcessor(model_proto=model_sp.getvalue())

In [13]:
#with open('out.model', 'wb') as f:
#   f.write(model_sp.getvalue())
#sp_processor = sp_module.SentencePieceProcessor()
#sp_processor.load('out.model')

In [14]:
sp_tokens = sp_tokenizer.encode_as_pieces(doc)

In [15]:
sp_tokens

['▁',
 'Шустрая',
 '▁',
 'бурая',
 '▁',
 'лисица',
 '▁',
 'прыгает',
 '▁',
 'через',
 '▁',
 'ленивого',
 '▁',
 'пса',
 '!']

SentencePiece tokenizer has different funcs from AutoTokenizer

gotta think how to implement it in a MT class or write a new class and make sp tokenizer and nllb model work together

### Translate

In [16]:
translated_tokens = model.generate(
    **rus_tok, forced_bos_token_id=rus_tokenizer.lang_code_to_id["eng_Latn"], max_length=30)

In [17]:
translated = rus_tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0] #for multiple entries

# Metrics

#### **BLEU** - BiLingual Evaluation Understudy
>cares more about word overlap.

>Precision is more important.

> Uses n-grams for evaluation.

>Normalizes scores for text length.

>Typical for machine translation.

>Rewards model for producing matching with reference words.

> Penalizes longer sentences.


#### **ROUGE** - Recall-Oriented Understudy for Gisting Evaluation

>Focused on capturing context (Gisting Evaluation).

>Recall is more important. (Recall-Oriented)

> Uses longest common subseq for evaluation.

> Doesn't normalize scores for text length.

>Typical for text summarization.

>Rewards model if in general generated text represents the contexts of reference.

> Longer texts have advantage for recall.


#### **METEOR** - Metrics for Evaluation of Translation with Explicit ORdering

> Takes word order into account.

> Uses stemming and other techniques for synonyms and paraphrasing

> F1 score is more important.

> Uses unigrams (1 word) along with synonyms with preloaded WordNet synonym dictionary.

> More robust in variations

> Doesn't penalize longer texts

> Typical for machine translation and text summarization.

> Again: more flexible due to use of synonyms (more complex than word overlap)


#### **TER** - Translation Edit Rate

> Represents the **number of edits** needed to get from the hypothesis to the reference sentence. Lower is better.

> Basically quantifies the dissimilarity of the reference and translation

> Possible changes include: deletions, substitution, insertion, shifting

> Used in machine translation and texts summarization.

> Not included in commonly-used libs like nltk

> More complex

> No percentage score. Difficult to interprete the result. 0 edits is the best.




In [18]:
Rouge = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

In [19]:
smoothing_zero_ngrams = SmoothingFunction()

In [20]:
def round_perc(num: float) -> float:
  return round(num*100, 2)

def get_sent_bleu(sentence: str, reference: str=reference) -> float:
  '''n=3 gram'''
  score = sentence_bleu([reference.split()], sentence.split(),
                        weights = (0.25, 0.5, 0.25), smoothing_function=smoothing_zero_ngrams.method1) #weights define the 'window size'
  return round_perc(score)

def get_bleu(sentence: str, reference: str=reference) -> float:
  score = corpus_bleu([[reference.split()]], [sentence.split()],
                      smoothing_function=smoothing_zero_ngrams.method1)
  return round_perc(score)

def get_sacrebleu(sentence, reference):
  #bleu = BLEU()
  result = sacrebleu.corpus_bleu([sentence], [[reference]])
  #print(bleu.get_signature())
  return result

def get_chrf(sentence, reference):
  chrf = CHRF()
  return chrf.corpus_score([sentence], [[reference]])


def get_meteor(sentence: str, reference: str=reference) -> float:
  score = meteor_score([reference.split()], sentence.split())
  return round_perc(score)

def get_rouge(sentence: str, reference: str=reference):
  '''rouge-1 unigrams, individual words
     rouge-2 bigrams, word pairs
     rouge-L longest sequence'''
  scores = Rouge.score(reference, sentence)
  idict = {key:{} for key in scores}
  for key in scores:
    idict[key]['precision'] = str(round_perc(scores[key].precision))+'%'
    idict[key]['recall'] = str(round_perc(scores[key].recall))+'%'
    idict[key]['f1'] = str(round_perc(scores[key].fmeasure))+'%'
  rouge_dict = {key: idict[key] for key in idict}
  return rouge_dict

In [117]:
def eval_rouge(translations, references):
  rouge_dict = {'rouge1':{'precision':0, 'recall':0, 'f_measure':0},
                'rouge2':{'precision':0, 'recall':0, 'f_measure':0},
                'rougeL':{'precision':0, 'recall':0, 'f_measure':0}
                }
  for num in range(len(translations)):
    rouges = Rouge.score(references[num], translations[num])
    for key in rouges.keys():
      rouge_dict[key]['precision'] += rouges[key].precision
      rouge_dict[key]['recall'] += rouges[key].recall
      rouge_dict[key]['f_measure'] += rouges[key].fmeasure
  for r in rouge_dict.keys():
    for metric in rouge_dict[r]:
      print(f'{r}: {metric}: {round(rouge_dict[r][metric]/(num+1), 2)}')
eval_rouge([translated], [reference])

rouge1: precision: 0.67
rouge1: recall: 0.67
rouge1: f_measure: 0.67
rouge2: precision: 0.5
rouge2: recall: 0.5
rouge2: f_measure: 0.5
rougeL: precision: 0.67
rougeL: recall: 0.67
rougeL: f_measure: 0.67


In [21]:
def ter(hypothesis, reference=reference):
    n = len(reference)
    m = len(hypothesis)

    # Init matrix for dynamic programming
    dp = [[0] * (m + 1) for _ in range(n + 1)]

    # Init first row and column
    for i in range(n + 1):
        dp[i][0] = i
    for j in range(m + 1):
        dp[0][j] = j

    # Fill in the DP matrix
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            cost = 0 if reference[i - 1] == hypothesis[j - 1] else 1
            dp[i][j] = min(dp[i - 1][j] + 1, dp[i][j - 1] + 1, dp[i - 1][j - 1] + cost)
    return dp[n][m]

### Test metrics

In [39]:
def test_metrics(reference, list_of_trans, transname=('google translate', 'machine translation')):
  #print(f'Reference: {reference}\n\n')
  for name, func in zip(['sent_bleu', 'sacrebleu', 'sacre_chrf','corpus_bleu', 'meteor', 'rouge'],   #, 'ter'],
                        [get_sent_bleu, get_sacrebleu, get_chrf, get_bleu, get_meteor, get_rouge]): #, ter]):
    print('Metric:', name)
    symbol = '%'# if name != 'ter' else ' edits'
    for num in range(len(list_of_trans)):
      score = func(list_of_trans[num], reference)
      print(f"Translation: {transname[num]}: {score}{symbol}")
    print()

## Upload files to compare

In [23]:
def onefile(files):
  to_compare = []
  for filename in files:
    with open (filename+'.txt') as f:
      new = f.readlines()
    to_compare.append(''.join(new).replace('\n', '').replace('\t', ''))
  return to_compare

In [24]:
filenames = ['orig', 'gtrans', 'translation', 'reference']
orig, gtrans, trans, ref = onefile(filenames)

In [25]:
test_metrics(ref, [gtrans, trans])

Metric: sent_bleu
Translation: google translate: 44.58%
Translation: machine translation: 31.81%

Metric: sacrebleu
Translation: google translate: BLEU = 46.70 83.0/55.9/37.9/27.1 (BP = 1.000 ratio = 1.008 hyp_len = 3528 ref_len = 3501)%
Translation: machine translation: BLEU = 32.78 79.1/44.8/25.0/15.3 (BP = 0.961 ratio = 0.962 hyp_len = 3368 ref_len = 3501)%

Metric: sacre_chrf
Translation: google translate: chrF2 = 77.19%
Translation: machine translation: chrF2 = 69.53%

Metric: corpus_bleu
Translation: google translate: 35.96%
Translation: machine translation: 23.07%

Metric: meteor
Translation: google translate: 51.87%
Translation: machine translation: 40.99%

Metric: rouge
Translation: google translate: {'rouge1': {'precision': '86.07%', 'recall': '86.93%', 'f1': '86.5%'}, 'rouge2': {'precision': '55.29%', 'recall': '55.84%', 'f1': '55.56%'}, 'rougeL': {'precision': '67.4%', 'recall': '68.07%', 'f1': '67.73%'}}%
Translation: machine translation: {'rouge1': {'precision': '81.98%',

## Flow

In [26]:
class MachineTranslation:

  def __init__(self, model, tokenizer, target_lang='eng_Latn', sent_len=300):
    self.model=model
    self.tokenizer = tokenizer
    self.to_lang = target_lang
    self.sent_len = sent_len
    self.metrics_dict = {'TER':ter,
                        'BLEU corpus':get_bleu,
                        'BLEU sentence': get_sent_bleu,
                         'sacre BLEU':get_sacrebleu,
                         'sacre_CHRF++': get_chrf,
                        'METEOR': get_meteor,
                        'ROUGE': get_rouge,
                        }


  def tokenize(self, sent: str):
    '''Tokenize input sentence'''
    return self.tokenizer(sent, return_tensors='pt')


  def translate(self, inputs):
    '''
    Generate translation
    '''
    return self.model.generate(
      **inputs, forced_bos_token_id=self.tokenizer.lang_code_to_id[self.to_lang],
      max_length=self.sent_len)


  def get_decoded(self, toks) -> list:
    '''
    Convert vect tokens into sentences
    '''
    return self.tokenizer.batch_decode(toks, skip_special_tokens=True)


  def generate_metrics(self, translation: str, reference: str) -> None:
    '''
    Use BLEU metrics and compare translated sent to the best translation
    '''
    #print(f'Reference: {reference}\nTranslation: {translation}\n')
    for name, func in self.metrics_dict.items():
      score = func(translation.lower(), reference.lower())
      self.print_metrics(translation, name, score)


  def print_metrics(self, translation, metrics_name, score):
    if metrics_name == 'TER':
      perc_sign = ' edits'
    elif metrics_name in ['sacre BLEU', 'ROUGE']:
      perc_sign = ''
    else:
      perc_sign = '%'

    print(f"{metrics_name} score: {score}{perc_sign}")


  def process_sentence(self, sent: str):
    '''
    main process for translation
    '''
    tokens = self.tokenize(sent)
    translated_tokens = self.translate(tokens)
    result = self.get_decoded(translated_tokens)
    return result


  def infer(self, sent: str, reference: str) -> None:
    ''' TO BE CHANGED
    Compare first sentence of the doc to the reference
    '''
    translation = self.process_sentence(sent)
    print('Translated:', translation[0])
    self.generate_metrics(translation[0].lower(), reference.lower())


In [27]:
MT = MachineTranslation(model, rus_tokenizer)

In [28]:
MT.infer(doc, reference)

Translated: A shrewd brown fox jumps over a lazy dog!
TER score: 12 edits
BLEU corpus score: 35.49%
BLEU sentence score: 46.71%
sacre BLEU score: BLEU = 37.99 70.0/55.6/37.5/14.3 (BP = 1.000 ratio = 1.000 hyp_len = 10 ref_len = 10)
sacre_CHRF++ score: chrF2 = 61.36%
METEOR score: 65.43%
ROUGE score: {'rouge1': {'precision': '66.67%', 'recall': '66.67%', 'f1': '66.67%'}, 'rouge2': {'precision': '50.0%', 'recall': '50.0%', 'f1': '50.0%'}, 'rougeL': {'precision': '66.67%', 'recall': '66.67%', 'f1': '66.67%'}}


In [29]:
MT.infer('Ложка дёгтя в бочке меда', 'А fly in the ointment')

Translated: A spoonful of honey in a barrel of honey
TER score: 29 edits
BLEU corpus score: 2.4%
BLEU sentence score: 2.23%
sacre BLEU score: BLEU = 4.77 11.1/6.2/3.6/2.1 (BP = 1.000 ratio = 1.800 hyp_len = 9 ref_len = 5)
sacre_CHRF++ score: chrF2 = 11.85%
METEOR score: 9.26%
ROUGE score: {'rouge1': {'precision': '11.11%', 'recall': '25.0%', 'f1': '15.38%'}, 'rouge2': {'precision': '0.0%', 'recall': '0.0%', 'f1': '0.0%'}, 'rougeL': {'precision': '11.11%', 'recall': '25.0%', 'f1': '15.38%'}}


#### Sample from dataset

In [30]:
a_few_sentences_rus = [
'НЕЙТРОННАЯ РЕФЛЕКТОМЕТРИЯ В РОССИИ: ТЕКУЩЕЕ СОСТОЯНИЕ И ПЕРСПЕКТИВЫ',
'В обзоре дано описание текущего состояния дел и перспектив развития в области нейтронной рефлектометрии на действующих и будущих нейтронных источниках Российской Федерации.',
'В результате ввода в эксплуатацию новых инструментов на реакторах ИР-8 и ПИК число нейтронных рефлектометров в РФ должно удвоиться.',
'В результате должен появиться набор инструментов, нацеленных на решение широкого круга задач в области физики, химии, биологии слоистых систем в интересах научного сообщества, а также для подготовки специалистов для дальнейшего развития и совершенствования данной методики.'
]

a_few_sentences_eng = [
'Neutron Reflectometry in Russia: Current State and Prospects',
'The review is devoted to the current state of affairs and prospects for development in the field of neutron reflectometry on the existing and future neutron sources in the Russian Federation.',
'Due to the commissioning of new instruments at the IR-8 and PIK reactors, the number of neutron reflectometers in the Russian Federation should double.',
'As a result, there must arise a set of instruments aimed at solving various problems in the fields of physics, chemistry, and biology of layered systems in the interests of the scientific community and to train experts for further development and improvement of this technique.'
]

a_few_sentences_gtrans = ['NEUTRON REFLECTOMETRY IN RUSSIA: CURRENT STATUS AND PROSPECTS',
'The review describes the current state of affairs and prospects for development in the field of neutron reflectometry at existing and future neutron sources in the Russian Federation.',
'As a result of the commissioning of new instruments at the IR-8 and PIK reactors, the number of neutron reflectometers in Russia should double.',
'The result should be a set of tools aimed at solving a wide range of problems in the field of physics, chemistry, biology of layered systems in the interests of the scientific community, as well as training specialists for further development and improvement of this technique.']

pairs = list(zip(a_few_sentences_rus, a_few_sentences_eng))

#### Metrics for each sentence

In [31]:
MT = MachineTranslation(model, rus_tokenizer)
for pair in pairs:
  print('Source:', pair[0])
  print('Reference:', pair[1])
  rus, eng = [sent.lower() for sent in pair]
  MT.infer(rus, eng)
  print('\n\n')

Source: НЕЙТРОННАЯ РЕФЛЕКТОМЕТРИЯ В РОССИИ: ТЕКУЩЕЕ СОСТОЯНИЕ И ПЕРСПЕКТИВЫ
Reference: Neutron Reflectometry in Russia: Current State and Prospects
Translated: Neutron reflexometry in Russia: current state and prospects
TER score: 2 edits
BLEU corpus score: 70.71%
BLEU sentence score: 73.86%
sacre BLEU score: BLEU = 75.06 88.9/75.0/71.4/66.7 (BP = 1.000 ratio = 1.000 hyp_len = 9 ref_len = 9)
sacre_CHRF++ score: chrF2 = 91.33%
METEOR score: 86.48%
ROUGE score: {'rouge1': {'precision': '87.5%', 'recall': '87.5%', 'f1': '87.5%'}, 'rouge2': {'precision': '71.43%', 'recall': '71.43%', 'f1': '71.43%'}, 'rougeL': {'precision': '87.5%', 'recall': '87.5%', 'f1': '87.5%'}}



Source: В обзоре дано описание текущего состояния дел и перспектив развития в области нейтронной рефлектометрии на действующих и будущих нейтронных источниках Российской Федерации.
Reference: The review is devoted to the current state of affairs and prospects for development in the field of neutron reflectometry on the exis

### Try using SentencePiece as tokenizer

In [32]:
#some OOP code here for sp tokenizer and nllb model


### Try using a paragraph as a single entry

In [None]:
class TextMachineTranslation(MachineTranslation):
  def process_text(self, text: List[str]) -> str:
    '''
    main process for multi-sentence translation
    '''
    tokens = [self.tokenize(sent) for sent in text]
    translated_tokens = [self.translate(token) for token in tokens]
    translations = [self.get_decoded(toks)[0] for toks in translated_tokens]
    result = ' '.join(translations)
    return result


  def infer(self, text: List[str], reference: List[str]) -> None:
    ''' TO BE CHANGED
    Compare the whole text to the reference
    '''
    translation = self.process_text(text)
    reference = ' '.join(reference)
    self.generate_metrics(translation.lower(), reference.lower())

In [None]:
TMT = TextMachineTranslation(model, rus_tokenizer)
TMT.infer(a_few_sentences_rus, a_few_sentences_eng)

TER score: 200 edits
BLEU corpus score: 37.95%
BLEU sentence score: 45.9%
METEOR score: 53.49%
ROUGE score: {'rouge1': {'precision': '80.95%', 'recall': '77.98%', 'f1': '79.44%'}, 'rouge2': {'precision': '51.92%', 'recall': '50.0%', 'f1': '50.94%'}, 'rougeL': {'precision': '68.57%', 'recall': '66.06%', 'f1': '67.29%'}}%


### Try it on a whole article

No ETL so far, no DB, just plain pd.read

In [83]:
filename = 'Krist_sample_data.xlsx'
sheet_name = 'Krist2202003Borisov'
df = pd.read_excel(filename, sheet_name=sheet_name).dropna(subset=['ru'])

In [84]:
orig_text = df['ru'].tolist()
ref_text = df['en'].tolist()

In [None]:
len(ref_text)

In [None]:
txt_model = TextMachineTranslation(model, rus_tokenizer)
#txt_model.infer(orig_text, ref_text)

In [None]:
%%time
text_translated = txt_model.process_text(orig_text)

CPU times: user 14min 56s, sys: 217 ms, total: 14min 56s
Wall time: 15min 10s


In [None]:
filepath = 'translation.txt'
with open(filepath, 'w') as f:
  f.write(text_translated)

files.download(filepath)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## try another model

In [71]:
%%capture
m_tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-ru-en")
m_model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-ru-en")
list_of_tokens = ['</s>', '<pad>']  #tokens to skip other than UNK

In [77]:
def marian_translate(model, tokenizer, sentence, maxlen=300):
  # Tokenize. As i take it SentencePiece is used as tokenizer
  input_ids_ru = tokenizer.encode(sentence, return_tensors="pt")
  # Translate
  translated_ids_en = model.generate(input_ids_ru, max_length=maxlen, num_beams=4, early_stopping=True)
  # Decode translated tokens to text. We need to check if there are [UNK]s so skip token False.
  result = tokenizer.decode(translated_ids_en[0], skip_special_tokens=True)
  #for tok in list_of_tokens:
  #  if tok in result:
  #    result = result.replace(tok, '')
  return result

Test metrics

In [78]:
for num in range(len(a_few_sentences_rus)):
  m_translated = marian_translate(m_model, m_tokenizer, a_few_sentences_rus[num])
  print('SENTENCE:', m_translated)
  print('REFERENCE:', a_few_sentences_eng[num])
  test_metrics(a_few_sentences_eng[num].lower(), [m_translated.lower()], ['marian'])

SENTENCE: NETROIN REFLECTOMETRIA IN RUSSIA: CURRENT STATUS AND PROSPECTS
REFERENCE: Neutron Reflectometry in Russia: Current State and Prospects
Metric: sent_bleu
Translation: marian: 37.19%

Metric: sacrebleu
Translation: marian: BLEU = 35.49 66.7/50.0/28.6/16.7 (BP = 1.000 ratio = 1.000 hyp_len = 9 ref_len = 9)%

Metric: sacre_chrf
Translation: marian: chrF2 = 76.57%

Metric: corpus_bleu
Translation: marian: 17.29%

Metric: meteor
Translation: marian: 60.5%

Metric: rouge
Translation: marian: {'rouge1': {'precision': '62.5%', 'recall': '62.5%', 'f1': '62.5%'}, 'rouge2': {'precision': '42.86%', 'recall': '42.86%', 'f1': '42.86%'}, 'rougeL': {'precision': '62.5%', 'recall': '62.5%', 'f1': '62.5%'}}%

SENTENCE: The review describes the current state of affairs and development prospects in the field of neutron reflexometrics from current and future neutron sources in the Russian Federation.
REFERENCE: The review is devoted to the current state of affairs and prospects for development in 

As a result out-of-the-box Marian model performs slightly better than NLLB model

## Issues

1. Choose proper metrics

2. Evaluate metrics not for each sentence but for the whole text. I.e. cumulative metric or take average.
