<a href="https://colab.research.google.com/github/Dimildizio/system_design/blob/main/NLLB_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install huggingface lib

In [1]:
%%capture
!pip install transformers rouge-score sacrebleu sentencepiece

## Imports

In [33]:
import nltk
import pandas as pd
import sentencepiece as sp_module
import urllib.request
import io

from google.colab import files
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from nltk.translate.bleu_score import corpus_bleu, sentence_bleu, SmoothingFunction
from nltk.translate.meteor_score import meteor_score
from rouge_score import rouge_scorer
from typing import List


In [3]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

### Download data samples

In [48]:
%%capture
!wget https://raw.githubusercontent.com/Dimildizio/system_design/main/data/gtrans.txt
!wget https://raw.githubusercontent.com/Dimildizio/system_design/main/data/orig.txt
!wget https://raw.githubusercontent.com/Dimildizio/system_design/main/data/reference.txt
!wget https://raw.githubusercontent.com/Dimildizio/system_design/main/data/translation.txt

### Download sentencepiece vocab

In [None]:
%%capture
!wget https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt

## Specify huggingface access token to download model

In [7]:
access_token ='' #Put your huggingface token here

## Download tokenization models for rus and english corpus

In [8]:
eng_tokenizer = AutoTokenizer.from_pretrained(
    "facebook/nllb-200-distilled-600M", token=access_token)
rus_tokenizer = AutoTokenizer.from_pretrained(
    "facebook/nllb-200-distilled-600M", src_lang="rus_Cyrl", token=access_token)

# Trying the out-of-the-box model

## Create model

In [9]:
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", token=access_token)

Downloading pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]



Downloading (…)neration_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

## Create example data

In [10]:
doc = 'Шустрая бурая лисица прыгает через ленивого пса!'
reference = 'The quick brown fox jumps over the lazy dog!'
g_trans = 'The nimble brown fox jumps over the lazy dog!'

## Tokenize

In [11]:
rus_tok = rus_tokenizer(doc, return_tensors='pt')

### Sentence piece

In [76]:
with open("botchan.txt", "rb") as f:
    text_data = f.read()

In [None]:
model_sp = io.BytesIO()
sp_vocab_size = 1000  #should consider enlarging

sp_module.SentencePieceTrainer.train(sentence_iterator=io.BytesIO(text_data),
                                     model_writer=model_sp,
                                     vocab_size=sp_vocab_size)

sp_tokenizer = sp_module.SentencePieceProcessor(model_proto=model_sp.getvalue())

In [None]:
#with open('out.model', 'wb') as f:
#   f.write(model_sp.getvalue())
#sp_processor = sp_module.SentencePieceProcessor()
#sp_processor.load('out.model')

True

In [72]:
sp_tokens = sp_tokenizer.encode_as_pieces(doc)

In [75]:
sp_tokens

['▁',
 'Шустрая',
 '▁',
 'бурая',
 '▁',
 'лисица',
 '▁',
 'прыгает',
 '▁',
 'через',
 '▁',
 'ленивого',
 '▁',
 'пса',
 '!']

SentencePiece tokenizer has different funcs from AutoTokenizer

gotta think how to implement it in a MT class or write a new class and make sp tokenizer and nllb model work together

### Translate

In [12]:
translated_tokens = model.generate(
    **rus_tok, forced_bos_token_id=rus_tokenizer.lang_code_to_id["eng_Latn"], max_length=30)

In [13]:
translated = rus_tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0] #for multiple entries

# Metrics

#### **BLEU** - BiLingual Evaluation Understudy
>cares more about word overlap.

>Precision is more important.

> Uses n-grams for evaluation.

>Normalizes scores for text length.

>Typical for machine translation.

>Rewards model for producing matching with reference words.

> Penalizes longer sentences.


#### **ROUGE** - Recall-Oriented Understudy for Gisting Evaluation

>Focused on capturing context (Gisting Evaluation).

>Recall is more important. (Recall-Oriented)

> Uses longest common subseq for evaluation.

> Doesn't normalize scores for text length.

>Typical for text summarization.

>Rewards model if in general generated text represents the contexts of reference.

> Longer texts have advantage for recall.


#### **METEOR** - Metrics for Evaluation of Translation with Explicit ORdering

> Takes word order into account.

> Uses stemming and other techniques for synonyms and paraphrasing

> F1 score is more important.

> Uses unigrams (1 word) along with synonyms with preloaded WordNet synonym dictionary.

> More robust in variations

> Doesn't penalize longer texts

> Typical for machine translation and text summarization.

> Again: more flexible due to use of synonyms (more complex than word overlap)


#### **TER** - Translation Edit Rate

> Represents the **number of edits** needed to get from the hypothesis to the reference sentence. Lower is better.

> Basically quantifies the dissimilarity of the reference and translation

> Possible changes include: deletions, substitution, insertion, shifting

> Used in machine translation and texts summarization.

> Not included in commonly-used libs like nltk

> More complex

> No percentage score. Difficult to interprete the result. 0 edits is the best.




In [14]:
Rouge = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

In [15]:
smoothing_zero_ngrams = SmoothingFunction()

In [16]:
def round_perc(num: float) -> float:
  return round(num*100, 2)

def get_sent_bleu(sentence: str, reference: str=reference) -> float:
  '''n=3 gram'''
  score = sentence_bleu([reference.split()], sentence.split(),
                        weights = (0.25, 0.5, 0.25), smoothing_function=smoothing_zero_ngrams.method1) #weights define the 'window size'
  return round_perc(score)

def get_bleu(sentence: str, reference: str=reference) -> float:
  score = corpus_bleu([[reference.split()]], [sentence.split()],
                      smoothing_function=smoothing_zero_ngrams.method1)
  return round_perc(score)

def get_meteor(sentence: str, reference: str=reference) -> float:
  score = meteor_score([reference.split()], sentence.split())
  return round_perc(score)

def get_rouge(sentence: str, reference: str=reference) -> float:
  '''rouge-1 unigrams, individual words
     rouge-2 bigrams, word pairs
     rouge-L longest sequence'''
  scores = Rouge.score(reference, sentence)
  idict = {key:{} for key in scores}
  for key in scores:
    idict[key]['precision'] = str(round_perc(scores[key].precision))+'%'
    idict[key]['recall'] = str(round_perc(scores[key].recall))+'%'
    idict[key]['f1'] = str(round_perc(scores[key].fmeasure))+'%'
  rouge_dict = {key: idict[key] for key in idict}
  return rouge_dict

In [17]:
def ter(hypothesis, reference=reference):
    n = len(reference)
    m = len(hypothesis)

    # Init matrix for dynamic programming
    dp = [[0] * (m + 1) for _ in range(n + 1)]

    # Init first row and column
    for i in range(n + 1):
        dp[i][0] = i
    for j in range(m + 1):
        dp[0][j] = j

    # Fill in the DP matrix
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            cost = 0 if reference[i - 1] == hypothesis[j - 1] else 1
            dp[i][j] = min(dp[i - 1][j] + 1, dp[i][j - 1] + 1, dp[i - 1][j - 1] + cost)
    return dp[n][m]

### Test metrics

In [18]:
def test_metrics(reference, list_of_trans):
  #print(f'Reference: {reference}\n\n')
  transname = ['google translate', 'machine translation']
  for name, func in zip(['sent_bleu', 'corpus_bleu', 'meteor', 'rouge'],   #, 'ter'],
                        [get_sent_bleu, get_bleu, get_meteor, get_rouge]): #, ter]):
    print('Metrics:', name)
    symbol = '%'# if name != 'ter' else ' edits'
    for num in range(len(list_of_trans)):
      score = func(list_of_trans[num], reference)
      print(f"Translation: {transname[num]}: {score}{symbol}")
    print()

## Upload files to compare

In [19]:
def onefile(files):
  to_compare = []
  for filename in files:
    with open (filename+'.txt') as f:
      new = f.readlines()
    to_compare.append(''.join(new).replace('\n', '').replace('\t', ''))
  return to_compare

In [20]:
filenames = ['orig', 'gtrans', 'translation', 'reference']
orig, gtrans, trans, ref = onefile(filenames)

In [21]:
test_metrics(ref, [gtrans, trans])

Metrics: sent_bleu
Translation: google translate: 44.58%
Translation: machine translation: 31.81%

Metrics: corpus_bleu
Translation: google translate: 35.96%
Translation: machine translation: 23.07%

Metrics: meteor
Translation: google translate: 51.87%
Translation: machine translation: 40.99%

Metrics: rouge
Translation: google translate: {'rouge1': {'precision': '86.07%', 'recall': '86.93%', 'f1': '86.5%'}, 'rouge2': {'precision': '55.29%', 'recall': '55.84%', 'f1': '55.56%'}, 'rougeL': {'precision': '67.4%', 'recall': '68.07%', 'f1': '67.73%'}}%
Translation: machine translation: {'rouge1': {'precision': '81.98%', 'recall': '79.26%', 'f1': '80.6%'}, 'rouge2': {'precision': '44.41%', 'recall': '42.93%', 'f1': '43.66%'}, 'rougeL': {'precision': '58.16%', 'recall': '56.23%', 'f1': '57.18%'}}%



## Flow

In [69]:
class MachineTranslation:

  def __init__(self, model, tokenizer, target_lang='eng_Latn', sent_len=300):
    self.model=model
    self.tokenizer = tokenizer
    self.to_lang = target_lang
    self.sent_len = sent_len
    self.metrics_dict = {'TER':ter,
                        'BLEU corpus':get_bleu,
                        'BLEU sentence': get_sent_bleu,
                        'METEOR': get_meteor,
                        'ROUGE': get_rouge
                         }


  def tokenize(self, sent: str):
    '''Tokenize input sentence'''
    return self.tokenizer(sent, return_tensors='pt')


  def translate(self, inputs):
    '''
    Generate translation
    '''
    return self.model.generate(
      **inputs, forced_bos_token_id=self.tokenizer.lang_code_to_id[self.to_lang],
      max_length=self.sent_len)


  def get_decoded(self, toks) -> list:
    '''
    Convert vect tokens into sentences
    '''
    return self.tokenizer.batch_decode(toks, skip_special_tokens=True)


  def generate_metrics(self, translation: str, reference: str) -> None:
    '''
    Use BLEU metrics and compare translated sent to the best translation
    '''
    #print(f'Reference: {reference}\nTranslation: {translation}\n')
    for name, func in self.metrics_dict.items():
      score = func(translation, reference)
      self.print_metrics(translation, name, score)


  def print_metrics(self, translation, metrics_name, score):
    perc_sign = ' edits' if metrics_name == 'TER' else '%'
    print(f"{metrics_name} score: {score}{perc_sign}")


  def process_sentence(self, sent: str):
    '''
    main process for translation
    '''
    tokens = self.tokenize(sent)
    translated_tokens = self.translate(tokens)
    result = self.get_decoded(translated_tokens)
    return result


  def infer(self, sent: str, reference: str) -> None:
    ''' TO BE CHANGED
    Compare first sentence of the doc to the reference
    '''
    translation = self.process_sentence(sent)
    self.generate_metrics(translation[0].lower(), reference.lower())


In [70]:
MT = MachineTranslation(model, rus_tokenizer)

In [56]:
MT.infer(doc, reference)

TER score: 12 edits
BLEU corpus score: 35.49%
BLEU sentence score: 46.71%
METEOR score: 65.43%
ROUGE score: {'rouge1': {'precision': '66.67%', 'recall': '66.67%', 'f1': '66.67%'}, 'rouge2': {'precision': '50.0%', 'recall': '50.0%', 'f1': '50.0%'}, 'rougeL': {'precision': '66.67%', 'recall': '66.67%', 'f1': '66.67%'}}%


#### Sample from dataset

In [24]:
a_few_sentences_rus = [
'НЕЙТРОННАЯ РЕФЛЕКТОМЕТРИЯ В РОССИИ: ТЕКУЩЕЕ СОСТОЯНИЕ И ПЕРСПЕКТИВЫ',
'В обзоре дано описание текущего состояния дел и перспектив развития в области нейтронной рефлектометрии на действующих и будущих нейтронных источниках Российской Федерации.',
'В результате ввода в эксплуатацию новых инструментов на реакторах ИР-8 и ПИК число нейтронных рефлектометров в РФ должно удвоиться.',
'В результате должен появиться набор инструментов, нацеленных на решение широкого круга задач в области физики, химии, биологии слоистых систем в интересах научного сообщества, а также для подготовки специалистов для дальнейшего развития и совершенствования данной методики.'
]

a_few_sentences_eng = [
'Neutron Reflectometry in Russia: Current State and Prospects',
'The review is devoted to the current state of affairs and prospects for development in the field of neutron reflectometry on the existing and future neutron sources in the Russian Federation.',
'Due to the commissioning of new instruments at the IR-8 and PIK reactors, the number of neutron reflectometers in the Russian Federation should double.',
'As a result, there must arise a set of instruments aimed at solving various problems in the fields of physics, chemistry, and biology of layered systems in the interests of the scientific community and to train experts for further development and improvement of this technique.'
]

pairs = list(zip(a_few_sentences_rus, a_few_sentences_eng))

In [20]:
MT = MachineTranslation(model, rus_tokenizer)
for pair in pairs:
  rus, eng = [sent.lower() for sent in pair]
  MT.infer(rus, eng)
  print('\n\n')

TER score: 2 edits
BLEU corpus score: 70.71%
BLEU sentence score: 73.86%
METEOR score: 86.48%
ROUGE score: {'rouge1': {'precision': '87.5%', 'recall': '87.5%', 'f1': '87.5%'}, 'rouge2': {'precision': '71.43%', 'recall': '71.43%', 'f1': '71.43%'}, 'rougeL': {'precision': '87.5%', 'recall': '87.5%', 'f1': '87.5%'}}%



TER score: 45 edits
BLEU corpus score: 40.7%
BLEU sentence score: 50.2%
METEOR score: 64.53%
ROUGE score: {'rouge1': {'precision': '82.14%', 'recall': '74.19%', 'f1': '77.97%'}, 'rouge2': {'precision': '55.56%', 'recall': '50.0%', 'f1': '52.63%'}, 'rougeL': {'precision': '78.57%', 'recall': '70.97%', 'f1': '74.58%'}}%



TER score: 38 edits
BLEU corpus score: 38.79%
BLEU sentence score: 46.18%
METEOR score: 59.32%
ROUGE score: {'rouge1': {'precision': '64.0%', 'recall': '64.0%', 'f1': '64.0%'}, 'rouge2': {'precision': '50.0%', 'recall': '50.0%', 'f1': '50.0%'}, 'rougeL': {'precision': '64.0%', 'recall': '64.0%', 'f1': '64.0%'}}%



TER score: 127 edits
BLEU corpus score: 6

### Try using SentencePiece as tokenizer

In [None]:
#some OOP code here for sp tokenizer and nllb model


### Try using a paragraph as a single entry

In [21]:
class TextMachineTranslation(MachineTranslation):
  def process_text(self, text: List[str]) -> str:
    '''
    main process for multi-sentence translation
    '''
    tokens = [self.tokenize(sent) for sent in text]
    translated_tokens = [self.translate(token) for token in tokens]
    translations = [self.get_decoded(toks)[0] for toks in translated_tokens]
    result = ' '.join(translations)
    return result


  def infer(self, text: List[str], reference: List[str]) -> None:
    ''' TO BE CHANGED
    Compare the whole text to the reference
    '''
    translation = self.process_text(text)
    reference = ' '.join(reference)
    self.generate_metrics(translation.lower(), reference.lower())

In [22]:
TMT = TextMachineTranslation(model, rus_tokenizer)
TMT.infer(a_few_sentences_rus, a_few_sentences_eng)

TER score: 200 edits
BLEU corpus score: 37.95%
BLEU sentence score: 45.9%
METEOR score: 53.49%
ROUGE score: {'rouge1': {'precision': '80.95%', 'recall': '77.98%', 'f1': '79.44%'}, 'rouge2': {'precision': '51.92%', 'recall': '50.0%', 'f1': '50.94%'}, 'rougeL': {'precision': '68.57%', 'recall': '66.06%', 'f1': '67.29%'}}%


### Try it on a whole article

No ETL so far, no DB, just plain pd.read

In [79]:
filename = 'Krist_sample_data.xlsx'
sheet_name = 'Krist2202003Borisov'
df = pd.read_excel(filename, sheet_name=sheet_name).dropna(subset=['ru'])

In [25]:
orig_text = df['ru'].tolist()
ref_text = df['en'].tolist()

In [26]:
txt_model = TextMachineTranslation(model, rus_tokenizer)
#txt_model.infer(orig_text, ref_text)

In [27]:
%%time
text_translated = txt_model.process_text(orig_text)

CPU times: user 14min 56s, sys: 217 ms, total: 14min 56s
Wall time: 15min 10s


In [28]:
filepath = 'translation.txt'
with open(filepath, 'w') as f:
  f.write(text_translated)

files.download(filepath)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Issues

1. Choose proper metrics

2. Evaluate metrics not for each sentence but for the whole text. I.e. cumulative metric or take average.
