# Introduction
It is often very difficult and expensive to train a full-blown Machine Translation model from scratch on a large parallel corpora. Hence, here we will try to fine-tune pretrained model(`IndicBART`) with our training data. We will be using the libraries provided by Hugging Face to do so.

### Mounting Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
cd "/content/drive/MyDrive/IASNLP"

/content/drive/MyDrive/IASNLP


### Imporrting Necessary Packages

In [None]:
!pip install transformers[sentencepiece]
!pip install datasets
!pip install sacrebleu
!pip install sentencepiece
!pip install indic-nlp-library

In [47]:
import numpy as np
import pandas as pd

from indicnlp.transliterate.unicode_transliterate import UnicodeIndicTransliterator
import transformers
import sentencepiece
from transformers import AutoTokenizer
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import load_dataset, load_metric

# Load Data

We have saved the train data and test data previously. We will straight away load it.

In [5]:
data = load_dataset('csv', data_files={'train': ['train_data.csv'], 'test': ['train_dev.csv']})

Using custom data configuration default-b1b37da2b3a2df75
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-b1b37da2b3a2df75/0.0.0/51cce309a08df9c4d82ffd9363bbe090bf173197fc01a71b034e8594995a1a58)


  0%|          | 0/2 [00:00<?, ?it/s]

Below, we can see the distribution of the data.

In [6]:
data = data.remove_columns('Unnamed: 0')
data

DatasetDict({
    train: Dataset({
        features: ['src', 'tgt'],
        num_rows: 111020
    })
    test: Dataset({
        features: ['src', 'tgt'],
        num_rows: 4626
    })
})

The Train Data

In [7]:
pd.DataFrame(data['train'][:10])

Unnamed: 0,src,tgt
0,But the shoot was a tough one.,তবে শ্যুটটা খুব মুশকিলের ছিল।
1,Road construction started.,রাস্তা নির্মাণ শুরু হয়েছে।
2,Why did he pay so much?,কেন তিনি এত টাকা দিতেন?
3,"""AT ITS worst, this has been Satan's century.","""এই শতাব্দীর প্রচণ্ড ভয়াবহতা এটাকে শয়তানের এক ..."
4,That's our only demand.,সেটাই আমাদের একমাত্র দাবি।
5,He leads his life by teaching.,সে শিক্ষকতা করে জীবন পরিচালনা করে
6,Im not leaving.,আমি এলাকা ছাড়ব না।
7,Under the instructions of the caliph Uthman ib...,খলিফা উসমান ইবনে আফফানের নির্দেশে মুয়াবিয়া এ...
8,The only way of weaning him off the ventilator...,তাকে বাঁচানোর একমাত্র উপায় বায়ুরন্ধ্র বন্ধ করে...
9,Gaibandha death toll rises to 5,"গাইবান্ধায় নিহতের সংখ্যা বেড়ে ৭, গৌরনদীতে ৩"


The Train-Dev Data

In [8]:
pd.DataFrame(data['test'][:10])

Unnamed: 0,src,tgt
0,We beg our Protestant and Jewish friends to pu...,কোন কোন ক্ষেত্রে কর্তৃপক্ষ এবং ধর্মীয় নেতারা ...
1,"Sa'd advised Muhammad: ""Don't be hard on him. ...","সা'দ মুহাম্মাদকে বলেন: ""তার প্রতি কঠোর হবেন না..."
2,Photo by 'Save Gaza Project',"ছবি ""সেভ গাজা প্রজেক্টের""।'"
3,"So, therefore, we need to test batteries under...","অতএব, আমআদের কিছুটা মান অবস্থাগুলির অধীনে ব্যা..."
4,This party is also contesting in the elections.,নির্বাচনে এই দলের মধ্যেই প্রতিদ্বন্দ্বিতা হবে।
5,Roads and houses collapsed.,"তলিয়ে গেছে ঘরবাড়ি, রাস্তাঘাট।"
6,"When a piece of paper is rolled up, Hitotsuyam...",হিতোসুয়েমা কাগজ দিয়ে ম্যাশে কৌশল অবলম্বন করে...
7,The founder of the modern Catholic movement Op...,"আধুনিক ক্যাথলিক সংঘের প্রতিষ্ঠাতা ওপাস ডেই, হো..."
8,Research has shown that exercise also helps in...,"এছাড়া গবেষণায় দেখা গেছে, শরীরচর্চা উদ্বেগ ও মা..."
9,It will be so much fun.,অনেক মজা হবে তখন।


# Data Preprocessing

We will start by preprocessing data. For that we have to get the model configurations of the pretrained model which we are going to fine-tune.

In [9]:
model_checkpoint = "ai4bharat/IndicBART"

Let's set the metric for evaluation we will be using.

In [10]:
metric = load_metric("sacrebleu")

## Tokenization & Normalization

We use the tokenizer that is consistent with the vocabulary and method which is used in `IndicBART`

In [11]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, do_lower_case=False, use_fast=False, keep_accents=True)

### Transliteration

One key thing to keep in mind is `IndicBART` is pretrained on 11 different Indian Languages, where all other languages except Hindi and Marathi are transliterated to Devnagri Script. Hence, to use it for Bengali we had to transliterate the Bengali text to Devnagri script as show below. Later we would also need to convert Devnagri to Bengali.

In [42]:
ben_dev = UnicodeIndicTransliterator()

In [56]:
beng_sent = "আমি তোমাকে ভালোবাসি।"
print("Bengali: ", beng_sent)
print("Hindi: ", ben_dev.transliterate(beng_sent, "bn", "hi"))

Bengali:  <2bn> আমি তোমাকে ভালোবাসি। </s>
Hindi:  <2bn> आमि तोमाके भालोबासि। </s>


Below, we apply the above two steps to `data`.

In [57]:
prefix = ""
max_input_length = 128
max_target_length = 128
def preprocess_function(examples):
    input_conv = [sent + " </s> <2en>" for sent in examples['src']]
    model_inputs = tokenizer(examples['src'], max_length=max_input_length, truncation=True)
    output_conv = ["<2bn> " + ben_dev.transliterate(sent, "bn", "hi") + " </s>" for sent in examples['tgt']]
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(output_conv, max_length=max_target_length, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [58]:
tokenized_data = data.map(preprocess_function, batched=True)

  0%|          | 0/112 [00:00<?, ?ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]

# Model Fine-Tuning

Before going ahead to fine-tune model, we start by loading the pre-trained model.

In [48]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/931M [00:00<?, ?B/s]

In [49]:
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-en-to-bn",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True    
)

In [None]:
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels
def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

In [None]:
tokenizer.decode(tokenizer(ben_dev.transliterate(beng_sent, "bn", "hi")))

In [59]:
tokenized_data['train'][:2]

{'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]],
 'input_ids': [[2, 2485, 22, 32110, 241, 80, 26560, 1052, 6, 3],
  [2, 17312, 15948, 7047, 6, 3]],
 'labels': [[2,
   64003,
   1140,
   410,
   25252,
   252,
   1974,
   26838,
   58,
   1046,
   8,
   64001,
   3],
  [2, 64003, 2743, 637, 1119, 443, 8, 64001, 3]],
 'src': ['But the shoot was a tough one.', 'Road construction started.'],
 'tgt': ['তবে শ্যুটটা খুব মুশকিলের ছিল।', 'রাস্তা নির্মাণ শুরু হয়েছে।'],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]]}

In [72]:
# First tokenize the input and outputs. The format below is how IndicBART was trained so the input should be "Sentence </s> <2xx>" where xx is the language code. Similarly, the output should be "<2yy> Sentence </s>". 
inp = tokenizer("I love you. </s> <2en>", add_special_tokens=False, return_tensors="pt", padding=True).input_ids # tensor([[  466,  1981,    80, 25573, 64001, 64004]])

out = tokenizer("<2bn> आमि तोमाके भालोबासि। </s>", add_special_tokens=False, return_tensors="pt", padding=True).input_ids # tensor([[64006,   942,    43, 32720,  8384, 64001]])
# Note that if you use any language other than Hindi or Marathi, you should convert its script to Devanagari using the Indic NLP Library.

model_outputs=model(input_ids=inp, decoder_input_ids=out[:,0:-1], labels=out[:,1:])

# For loss
print(model_outputs.loss) ## This is not label smoothed.

# For logits
print(model_outputs.logits)

# For generation. Pardon the messiness. Note the decoder_start_token_id.

model.eval() # Set dropouts to zero

model_output=model.generate(inp, use_cache=True, num_beams=4, max_length=20, min_length=1, early_stopping=True, pad_token_id=tokenizer._convert_token_to_id_with_added_voc("<pad>"), bos_token_id=tokenizer._convert_token_to_id_with_added_voc("<s>"), eos_token_id=tokenizer._convert_token_to_id_with_added_voc("</s>"), decoder_start_token_id=tokenizer._convert_token_to_id_with_added_voc("<2hi>"))
# Decode to get output strings

decoded_output=tokenizer.decode(model_output[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)

print(decoded_output) # I love you in Bengali
# Note that if your output language is not Hindi or Marathi, you should convert its script from Devanagari to the desired language using the Indic NLP Library.

tensor(7.0369, grad_fn=<NllLossBackward0>)
tensor([[[ 0.6202,  0.5011,  0.6202,  ...,  0.8604, -0.0365, -0.5884],
         [ 0.8759,  1.2849,  0.8759,  ..., -0.1371,  0.8773,  1.0417],
         [ 0.2118,  1.4138,  0.2118,  ..., -2.6138, -2.9056, -1.6386],
         [ 0.3714,  3.3580,  0.3714,  ...,  1.3963, -1.4405, -1.7548],
         [ 0.4080,  2.2053,  0.4080,  ..., -1.6285, -1.8012, -0.1449],
         [ 0.6919,  0.7900,  0.6919,  ...,  0.6170,  0.2085,  0.2791]]],
       grad_fn=<AddBackward0>)
I love you.


In [73]:
out[:,0:-1]

tensor([[64006,   528, 25353, 41545,    21,     8]])

In [64]:
inp

tensor([[  466,  8504,  1195,     6, 64001, 64004]])

In [68]:
model_output

tensor([[64003,   466,  8504,  1195,     6, 64001]])

In [87]:
np.argmax(1/(1+np.exp(-1*model_outputs.logits.detach().numpy())) / np.sum(1/(1+np.exp(-1*model_outputs.logits.detach().numpy())), axis = -1, keepdims = True), axis=-1)

array([[  466,  8504,  8504,    21,     6, 64001]])

In [86]:
model_outputs.logits.shape

torch.Size([1, 6, 64014])

In [88]:
model?