

*MBart* is a Seq2Seq auto encoder pretrained on large scale monolingual corpora in numerous languages. It is one of the first methods to pretrain a complete Seq2Seq model by denoising full texts in multiple languages.

BART was originally developed by Facebook AI (Oct. 2019) as a pretrained model for text generation and comprehension which takes into use both bidirectional and auto-regressive methods.




Installing sacrebleu for predicting the BLEU score

In [1]:
!pip install sacrebleu

Collecting sacrebleu
[?25l  Downloading https://files.pythonhosted.org/packages/7e/57/0c7ca4e31a126189dab99c19951910bd081dea5bbd25f24b77107750eae7/sacrebleu-1.5.1-py3-none-any.whl (54kB)
[K     |██████                          | 10kB 16.5MB/s eta 0:00:01[K     |████████████                    | 20kB 21.9MB/s eta 0:00:01[K     |██████████████████              | 30kB 20.5MB/s eta 0:00:01[K     |████████████████████████        | 40kB 17.3MB/s eta 0:00:01[K     |██████████████████████████████  | 51kB 7.2MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 4.1MB/s 
[?25hCollecting portalocker==2.0.0
  Downloading https://files.pythonhosted.org/packages/89/a6/3814b7107e0788040870e8825eebf214d72166adf656ba7d4bf14759a06a/portalocker-2.0.0-py2.py3-none-any.whl
Installing collected packages: portalocker, sacrebleu
Successfully installed portalocker-2.0.0 sacrebleu-1.5.1


Install the latest hugging face transformers library from GitHub

In [2]:
!pip install git+https://github.com/huggingface/transformers -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 901kB 10.4MB/s 
[K     |████████████████████████████████| 3.3MB 44.9MB/s 
[K     |████████████████████████████████| 645kB 43.7MB/s 
[?25h  Building wheel for transformers (PEP 517) ... [?25l[?25hdone


Installing SentencePiece

An unsupervised text tokenizer/detokenizer for neural network-based text generation systems where the vocabulary size is predetermined prior to the neural model training.

In [3]:
!pip install sentencepiece

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/ac/aa/1437691b0c7c83086ebb79ce2da16e00bef024f24fec2a5161c35476f499/sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 7.9MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.96


Importing required Libraries

In [4]:
import string
import re
import pandas as pd
from sacrebleu import sentence_bleu

Importing fast MBart tokenizer for mBART-50 (backed by HuggingFace’s tokenizers library)

In [5]:
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast


Download the model and tokenizer


In [6]:
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-one-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-one-to-many-mmt", src_lang="en_XX")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1428.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2444714899.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=5069051.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=717.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=528.0, style=ProgressStyle(description_…




Reading the input

In [7]:
df=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Machine Translation/hin.txt", sep='\t', header=None, names=["english_sentence","hindi_sentence","path"])
df= df.drop(columns=['path'])

In [8]:
df.head()

Unnamed: 0,english_sentence,hindi_sentence
0,Wow!,वाह!
1,Help!,बचाओ!
2,Jump.,उछलो.
3,Jump.,कूदो.
4,Jump.,छलांग.


Translation

In [9]:
def translate(sentence):
  model_inputs = tokenizer(sentence, return_tensors="pt")
  generated_tokens_ = model.generate(    **model_inputs,    forced_bos_token_id=tokenizer.lang_code_to_id["hi_IN"])
  trans = tokenizer.batch_decode(generated_tokens_, skip_special_tokens=True)
  return trans

In [10]:
df.isnull().sum()

english_sentence    0
hindi_sentence      0
dtype: int64

In [11]:
df['english_sentence']=df['english_sentence'].apply(lambda x: x.lower())
df['hindi_sentence']=df['hindi_sentence'].apply(lambda x: x.lower())
df['english_sentence']=df['english_sentence'].apply(lambda x: x.strip())
df['hindi_sentence']=df['hindi_sentence'].apply(lambda x: x.strip())
df['english_sentence']=df['english_sentence'].apply(lambda x: re.sub(" +", " ", x))
df['hindi_sentence']=df['hindi_sentence'].apply(lambda x: re.sub(" +", " ", x))
df['english_sentence']=df['english_sentence'].apply(lambda x: ''.join(ch for ch in x if ch not in string.punctuation))
df['hindi_sentence']=df['hindi_sentence'].apply(lambda x: ''.join(ch for ch in x if ch not in string.punctuation))
df['english_sentence']=df['english_sentence'].str.replace('\d+', '')
df['hindi_sentence']=df['hindi_sentence'].str.replace('\d+', '')

Prediction of the pretrained model

In [12]:
df['predicted_hindi_sentence']=df['english_sentence'].apply(lambda x: translate(x))

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


In [13]:
df.head(10)

Unnamed: 0,english_sentence,hindi_sentence,predicted_hindi_sentence
0,wow,वाह,[वाउ]
1,help,बचाओ,[सहायता]
2,jump,उछलो,[उछलना]
3,jump,कूदो,[उछलना]
4,jump,छलांग,[उछलना]
5,hello,नमस्ते।,[हेलो]
6,hello,नमस्कार।,[हेलो]
7,cheers,वाहवाह,[अभिवादन]
8,cheers,चियर्स,[अभिवादन]
9,got it,समझे कि नहीं,[पा लिया]


Output

In [14]:
x = df.hindi_sentence.tolist()

In [22]:
k=2001
english =df.english_sentence[k]
actual = df.hindi_sentence[k]
predicted = df.predicted_hindi_sentence[k]
print("The actual english sentence is:",english)
print("The actual hindi sentence is:",actual)
print("The predicted hindi sentence is:",predicted)
print("The BLEU score is :",sentence_bleu(predicted,[actual]).score)

The actual english sentence is: the doctor advised him not to smoke
The actual hindi sentence is: डॉक्टर ने उसे सिगरेट न पीने की सलह दी।
The predicted hindi sentence is: ['डॉक्टर ने उसे सिखाया कि धूम्रपान न करें।']
The BLEU score is : 19.493995755254467


In [21]:
k=1301
english =df.english_sentence[k]
actual = df.hindi_sentence[k]
predicted = df.predicted_hindi_sentence[k]
print("The actual english sentence is:",english)
print("The actual hindi sentence is:",actual)
print("The predicted hindi sentence is:",predicted)
print("The BLEU score is :",sentence_bleu(predicted,[actual]).score)

The actual english sentence is: you are not coming are you
The actual hindi sentence is: तुम नहीं आ रहे हो ना
The predicted hindi sentence is: ['तुम नहीं आ रहे हो क्या तुम']
The BLEU score is : 61.47881529512643
