# Introduction
It is often very difficult and expensive to train a full-blown Machine Translation model from scratch on a large parallel corpora. Hence, here we will try to fine-tune pretrained model(`IndicBART`) with our training data. We will be using the libraries provided by Hugging Face to do so.

### Mounting Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
cd "/content/drive/MyDrive/IASNLP"

/content/drive/MyDrive/IASNLP


### Imporrting Necessary Packages

In [3]:
!pip install transformers[sentencepiece]
!pip install datasets
!pip install sacrebleu
!pip install sentencepiece
!pip install indic-nlp-library

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers[sentencepiece]
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 5.4 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 66.0 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 47.8 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 15.4 MB/s 
Collecting sentencepiece!=0.1.92,>=0.1.91
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |█

In [4]:
import numpy as np
import pandas as pd

from indicnlp.transliterate.unicode_transliterate import UnicodeIndicTransliterator
import transformers
import sentencepiece
from transformers import AutoTokenizer
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import load_dataset, load_metric

# Load Data

We have saved the train data and test data previously. We will straight away load it.

In [5]:
data = load_dataset('csv', data_files={'train': ['train_data.csv'], 'test': ['train_dev.csv']})

Using custom data configuration default-b1b37da2b3a2df75


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-b1b37da2b3a2df75/0.0.0/51cce309a08df9c4d82ffd9363bbe090bf173197fc01a71b034e8594995a1a58...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-b1b37da2b3a2df75/0.0.0/51cce309a08df9c4d82ffd9363bbe090bf173197fc01a71b034e8594995a1a58. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Below, we can see the distribution of the data.

In [6]:
data = data.remove_columns('Unnamed: 0')
data

DatasetDict({
    train: Dataset({
        features: ['src', 'tgt'],
        num_rows: 111020
    })
    test: Dataset({
        features: ['src', 'tgt'],
        num_rows: 4626
    })
})

The Train Data

In [7]:
pd.DataFrame(data['train'][:10])

Unnamed: 0,src,tgt
0,But the shoot was a tough one.,তবে শ্যুটটা খুব মুশকিলের ছিল।
1,Road construction started.,রাস্তা নির্মাণ শুরু হয়েছে।
2,Why did he pay so much?,কেন তিনি এত টাকা দিতেন?
3,"""AT ITS worst, this has been Satan's century.","""এই শতাব্দীর প্রচণ্ড ভয়াবহতা এটাকে শয়তানের এক ..."
4,That's our only demand.,সেটাই আমাদের একমাত্র দাবি।
5,He leads his life by teaching.,সে শিক্ষকতা করে জীবন পরিচালনা করে
6,Im not leaving.,আমি এলাকা ছাড়ব না।
7,Under the instructions of the caliph Uthman ib...,খলিফা উসমান ইবনে আফফানের নির্দেশে মুয়াবিয়া এ...
8,The only way of weaning him off the ventilator...,তাকে বাঁচানোর একমাত্র উপায় বায়ুরন্ধ্র বন্ধ করে...
9,Gaibandha death toll rises to 5,"গাইবান্ধায় নিহতের সংখ্যা বেড়ে ৭, গৌরনদীতে ৩"


The Train-Dev Data

In [8]:
pd.DataFrame(data['test'][:10])

Unnamed: 0,src,tgt
0,We beg our Protestant and Jewish friends to pu...,কোন কোন ক্ষেত্রে কর্তৃপক্ষ এবং ধর্মীয় নেতারা ...
1,"Sa'd advised Muhammad: ""Don't be hard on him. ...","সা'দ মুহাম্মাদকে বলেন: ""তার প্রতি কঠোর হবেন না..."
2,Photo by 'Save Gaza Project',"ছবি ""সেভ গাজা প্রজেক্টের""।'"
3,"So, therefore, we need to test batteries under...","অতএব, আমআদের কিছুটা মান অবস্থাগুলির অধীনে ব্যা..."
4,This party is also contesting in the elections.,নির্বাচনে এই দলের মধ্যেই প্রতিদ্বন্দ্বিতা হবে।
5,Roads and houses collapsed.,"তলিয়ে গেছে ঘরবাড়ি, রাস্তাঘাট।"
6,"When a piece of paper is rolled up, Hitotsuyam...",হিতোসুয়েমা কাগজ দিয়ে ম্যাশে কৌশল অবলম্বন করে...
7,The founder of the modern Catholic movement Op...,"আধুনিক ক্যাথলিক সংঘের প্রতিষ্ঠাতা ওপাস ডেই, হো..."
8,Research has shown that exercise also helps in...,"এছাড়া গবেষণায় দেখা গেছে, শরীরচর্চা উদ্বেগ ও মা..."
9,It will be so much fun.,অনেক মজা হবে তখন।


# Data Preprocessing

We will start by preprocessing data. For that we have to get the model configurations of the pretrained model which we are going to fine-tune.

In [9]:
model_checkpoint = "ai4bharat/IndicBART"

Let's set the metric for evaluation we will be using.

In [10]:
metric = load_metric("sacrebleu")

Downloading builder script:   0%|          | 0.00/2.85k [00:00<?, ?B/s]

## Tokenization & Normalization

We use the tokenizer that is consistent with the vocabulary and method which is used in `IndicBART`

In [11]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, do_lower_case=False, use_fast=False, keep_accents=True)

Downloading:   0%|          | 0.00/498 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/832 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.81M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/221 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/398 [00:00<?, ?B/s]

In [12]:
bos_id = tokenizer._convert_token_to_id_with_added_voc("<s>")
eos_id = tokenizer._convert_token_to_id_with_added_voc("</s>")
pad_id = tokenizer._convert_token_to_id_with_added_voc("<pad>")
en_id = tokenizer._convert_token_to_id_with_added_voc("<2en>")
bn_id = tokenizer._convert_token_to_id_with_added_voc("<2bn>")

### Transliteration

One key thing to keep in mind is `IndicBART` is pretrained on 11 different Indian Languages, where all other languages except Hindi and Marathi are transliterated to Devnagri Script. Hence, to use it for Bengali we had to transliterate the Bengali text to Devnagri script as show below. Later we would also need to convert Devnagri to Bengali.

In [13]:
ben_dev = UnicodeIndicTransliterator()

In [14]:
beng_sent = "আমি তোমাকে ভালোবাসি।"
print("Bengali: ", beng_sent)
print("Hindi: ", ben_dev.transliterate(beng_sent, "bn", "hi"))

Bengali:  আমি তোমাকে ভালোবাসি।
Hindi:  आमि तोमाके भालोबासि।


Below, we apply the above two steps to `data`.

In [15]:
prefix = ""
max_input_length = 128
max_target_length = 128
def preprocess_function(examples):
    input_conv = [sent + " </s> <2en>" for sent in examples['src']]
    model_inputs = tokenizer(input_conv, max_length=max_input_length, truncation=True)
    output_conv = ["<2hi> " + ben_dev.transliterate(sent, "bn", "hi") + " </s>" for sent in examples['tgt']]
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(output_conv, max_length=max_target_length, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [16]:
tokenized_data = data.map(preprocess_function, batched=True)



  0%|          | 0/112 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

This is the form of the tokenized(ready-to-feed) data.

In [17]:
pd.DataFrame(tokenized_data['train'][:10])

Unnamed: 0,src,tgt,input_ids,token_type_ids,attention_mask,labels
0,But the shoot was a tough one.,তবে শ্যুটটা খুব মুশকিলের ছিল।,"[2, 2485, 22, 32110, 241, 80, 26560, 1052, 6, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[2, 64006, 1140, 410, 25252, 252, 1974, 26838,..."
1,Road construction started.,রাস্তা নির্মাণ শুরু হয়েছে।,"[2, 17312, 15948, 7047, 6, 64001, 64004, 3]","[0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1]","[2, 64006, 2743, 637, 1119, 443, 8, 64001, 3]"
2,Why did he pay so much?,কেন তিনি এত টাকা দিতেন?,"[2, 28913, 4357, 450, 10006, 1771, 4618, 108, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[2, 64006, 4138, 336, 4789, 1380, 2206, 40, 10..."
3,"""AT ITS worst, this has been Satan's century.","""এই শতাব্দীর প্রচণ্ড ভয়াবহতা এটাকে শয়তানের এক ...","[2, 131, 13817, 466, 23869, 37426, 7, 631, 292...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2, 64006, 131, 7763, 50185, 25078, 14062, 86,..."
4,That's our only demand.,সেটাই আমাদের একমাত্র দাবি।,"[2, 9951, 142, 36, 2145, 1916, 10122, 6, 64001...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[2, 64006, 31100, 1373, 5867, 1434, 8, 64001, 3]"
5,He leads his life by teaching.,সে শিক্ষকতা করে জীবন পরিচালনা করে,"[2, 1265, 45960, 496, 4140, 271, 45841, 6, 640...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[2, 64006, 42, 825, 86, 155, 2297, 11547, 155,..."
6,Im not leaving.,আমি এলাকা ছাড়ব না।,"[2, 40131, 457, 32045, 6, 64001, 64004, 3]","[0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1]","[2, 64006, 528, 9173, 9561, 85, 97, 8, 64001, 3]"
7,Under the instructions of the caliph Uthman ib...,খলিফা উসমান ইবনে আফফানের নির্দেশে মুয়াবিয়া এ...,"[2, 23570, 22, 55958, 51, 22, 4927, 10785, 104...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2, 64006, 606, 29947, 10, 288, 682, 40832, 53..."
8,The only way of weaning him off the ventilator...,তাকে বাঁচানোর একমাত্র উপায় বায়ুরন্ধ্র বন্ধ করে...,"[2, 202, 1916, 4052, 51, 1075, 1397, 193, 2173...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2, 64006, 2111, 28317, 9630, 5867, 6092, 1237..."
9,Gaibandha death toll rises to 5,"গাইবান্ধায় নিহতের সংখ্যা বেড়ে ৭, গৌরনদীতে ৩","[2, 8548, 924, 33242, 5092, 9542, 41785, 14833...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[2, 64006, 54111, 220, 36126, 1142, 11111, 124..."


In [18]:
pd.DataFrame(tokenized_data['test'][:10])

Unnamed: 0,src,tgt,input_ids,token_type_ids,attention_mask,labels
0,We beg our Protestant and Jewish friends to pu...,কোন কোন ক্ষেত্রে কর্তৃপক্ষ এবং ধর্মীয় নেতারা ...,"[2, 2855, 281, 1498, 2145, 7373, 35777, 4187, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2, 64006, 1503, 1503, 4321, 8275, 210, 13928,..."
1,"Sa'd advised Muhammad: ""Don't be hard on him. ...","সা'দ মুহাম্মাদকে বলেন: ""তার প্রতি কঠোর হবেন না...","[2, 5336, 142, 343, 49343, 44765, 53, 131, 200...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2, 64006, 750, 142, 64, 3547, 10, 8429, 64, 1..."
2,Photo by 'Save Gaza Project',"ছবি ""সেভ গাজা প্রজেক্টের""।'","[2, 24649, 271, 82, 1326, 19315, 8548, 10552, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[2, 64006, 1902, 131, 525, 523, 17188, 10, 456..."
3,"So, therefore, we need to test batteries under...","অতএব, আমআদের কিছুটা মান অবস্থাগুলির অধীনে ব্যা...","[2, 3867, 7, 35894, 7, 1075, 4212, 57, 13496, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2, 64006, 45953, 7, 651, 344, 444, 14372, 802..."
4,This party is also contesting in the elections.,নির্বাচনে এই দলের মধ্যেই প্রতিদ্বন্দ্বিতা হবে।,"[2, 2520, 2397, 158, 657, 53429, 67, 22, 6881,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[2, 64006, 7125, 114, 2849, 12807, 28198, 10, ..."
5,Roads and houses collapsed.,"তলিয়ে গেছে ঘরবাড়ি, রাস্তাঘাট।","[2, 17312, 36, 62, 24573, 41590, 343, 6, 64001...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[2, 64006, 1630, 1965, 2249, 401, 10784, 7, 51..."
6,"When a piece of paper is rolled up, Hitotsuyam...",হিতোসুয়েমা কাগজ দিয়ে ম্যাশে কৌশল অবলম্বন করে...,"[2, 8181, 80, 31612, 51, 24670, 158, 25156, 19...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2, 64006, 501, 200, 862, 407, 209, 6235, 514,..."
7,The founder of the modern Catholic movement Op...,"আধুনিক ক্যাথলিক সংঘের প্রতিষ্ঠাতা ওপাস ডেই, হো...","[2, 202, 35407, 51, 22, 24081, 1149, 12888, 39...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2, 64006, 1652, 823, 521, 4971, 869, 58, 2712..."
8,Research has shown that exercise also helps in...,"এছাড়া গবেষণায় দেখা গেছে, শরীরচর্চা উদ্বেগ ও মা...","[2, 23075, 292, 29396, 181, 21838, 657, 37287,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[2, 64006, 10019, 44502, 576, 2249, 7, 1218, 4..."
9,It will be so much fun.,অনেক মজা হবে তখন।,"[2, 1280, 424, 281, 1771, 4618, 28639, 6, 6400...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[2, 64006, 300, 7861, 481, 2442, 8, 64001, 3]"


In [19]:
print("Decoded English input_ids: ", tokenizer.decode(tokenized_data['train']['input_ids'][0], skip_special_tokens=True, clean_up_tokenization_spaces=False))
print("Decoded Devnagri labels: ", tokenizer.decode(tokenized_data['train']['labels'][0], skip_special_tokens=True, clean_up_tokenization_spaces=False))
print("Transliterated Bengali labels: ", ben_dev.transliterate(tokenizer.decode(tokenized_data['train']['labels'][0], skip_special_tokens=True, clean_up_tokenization_spaces=False), "hi", "bn"))

Decoded English input_ids:  But the shoot was a tough one.
Decoded Devnagri labels:  तबे श्युटटा खुब मुशकिलेर छिल।
Transliterated Bengali labels:  তবে শ্যুটটা খুব মুশকিলের ছিল।


In [20]:
# metric.compute(predictions=[tokenizer("<2hi>तोमाके भालोबासि। </s>")], references=[tokenizer("<2hi> आमि तोमाके भालोबासि। </s>")])

# Model Fine-Tuning

Before going ahead to fine-tune model, we start by loading the pre-trained model.

In [21]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/931M [00:00<?, ?B/s]

Data Collator takes care of the padding.

In [22]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

Below are the hyperparameters using which we would fine tune `IndicBART` for NMT downstream task.

In [23]:
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-en-to-bn",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=10,
    num_train_epochs=10,
    predict_with_generate=True)

In [24]:
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

In [25]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics)

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `MBartForConditionalGeneration.forward` and have been ignored: src, token_type_ids, tgt. If src, token_type_ids, tgt are not expected by `MBartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 111020
  Num Epochs = 10
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 34700


Epoch,Training Loss,Validation Loss


Saving model checkpoint to IndicBART-finetuned-en-to-bn/checkpoint-500
Configuration saved in IndicBART-finetuned-en-to-bn/checkpoint-500/config.json
Model weights saved in IndicBART-finetuned-en-to-bn/checkpoint-500/pytorch_model.bin
tokenizer config file saved in IndicBART-finetuned-en-to-bn/checkpoint-500/tokenizer_config.json
Special tokens file saved in IndicBART-finetuned-en-to-bn/checkpoint-500/special_tokens_map.json
added tokens file saved in IndicBART-finetuned-en-to-bn/checkpoint-500/added_tokens.json
Saving model checkpoint to IndicBART-finetuned-en-to-bn/checkpoint-1000
Configuration saved in IndicBART-finetuned-en-to-bn/checkpoint-1000/config.json
Model weights saved in IndicBART-finetuned-en-to-bn/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in IndicBART-finetuned-en-to-bn/checkpoint-1000/tokenizer_config.json
Special tokens file saved in IndicBART-finetuned-en-to-bn/checkpoint-1000/special_tokens_map.json
added tokens file saved in IndicBART-finetuned-e