# Introduction
It is often very difficult and expensive to train a full-blown Machine Translation model from scratch on a large parallel corpora. Hence, here we will try to fine-tune pretrained model(`IndicBART`) with our training data. We will be using the libraries provided by Hugging Face to do so.

### Mounting Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
cd "/content/drive/MyDrive/IASNLP"

/content/drive/MyDrive/IASNLP


### Imporrting Necessary Packages

In [None]:
!pip install transformers[sentencepiece]
!pip install datasets
!pip install sacrebleu
!pip install sentencepiece
!pip install indic-nlp-library

In [21]:
import numpy as np
import pandas as pd

from indicnlp.transliterate.unicode_transliterate import UnicodeIndicTransliterator
import transformers
import sentencepiece
from transformers import AutoTokenizer
from datasets import load_dataset, load_metric

# Load Data

We have saved the train data and test data previously. We will straight away load it.

In [5]:
data = load_dataset('csv', data_files={'train': ['train_data.csv'], 'test': ['train_dev.csv']})

Using custom data configuration default-b1b37da2b3a2df75
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-b1b37da2b3a2df75/0.0.0/51cce309a08df9c4d82ffd9363bbe090bf173197fc01a71b034e8594995a1a58)


  0%|          | 0/2 [00:00<?, ?it/s]

Below, we can see the distribution of the data.

In [6]:
data = data.remove_columns('Unnamed: 0')
data

DatasetDict({
    train: Dataset({
        features: ['src', 'tgt'],
        num_rows: 111020
    })
    test: Dataset({
        features: ['src', 'tgt'],
        num_rows: 4626
    })
})

The Train Data

In [7]:
pd.DataFrame(data['train'][:10])

Unnamed: 0,src,tgt
0,But the shoot was a tough one.,তবে শ্যুটটা খুব মুশকিলের ছিল।
1,Road construction started.,রাস্তা নির্মাণ শুরু হয়েছে।
2,Why did he pay so much?,কেন তিনি এত টাকা দিতেন?
3,"""AT ITS worst, this has been Satan's century.","""এই শতাব্দীর প্রচণ্ড ভয়াবহতা এটাকে শয়তানের এক ..."
4,That's our only demand.,সেটাই আমাদের একমাত্র দাবি।
5,He leads his life by teaching.,সে শিক্ষকতা করে জীবন পরিচালনা করে
6,Im not leaving.,আমি এলাকা ছাড়ব না।
7,Under the instructions of the caliph Uthman ib...,খলিফা উসমান ইবনে আফফানের নির্দেশে মুয়াবিয়া এ...
8,The only way of weaning him off the ventilator...,তাকে বাঁচানোর একমাত্র উপায় বায়ুরন্ধ্র বন্ধ করে...
9,Gaibandha death toll rises to 5,"গাইবান্ধায় নিহতের সংখ্যা বেড়ে ৭, গৌরনদীতে ৩"


The Train-Dev Data

In [8]:
pd.DataFrame(data['test'][:10])

Unnamed: 0,src,tgt
0,We beg our Protestant and Jewish friends to pu...,কোন কোন ক্ষেত্রে কর্তৃপক্ষ এবং ধর্মীয় নেতারা ...
1,"Sa'd advised Muhammad: ""Don't be hard on him. ...","সা'দ মুহাম্মাদকে বলেন: ""তার প্রতি কঠোর হবেন না..."
2,Photo by 'Save Gaza Project',"ছবি ""সেভ গাজা প্রজেক্টের""।'"
3,"So, therefore, we need to test batteries under...","অতএব, আমআদের কিছুটা মান অবস্থাগুলির অধীনে ব্যা..."
4,This party is also contesting in the elections.,নির্বাচনে এই দলের মধ্যেই প্রতিদ্বন্দ্বিতা হবে।
5,Roads and houses collapsed.,"তলিয়ে গেছে ঘরবাড়ি, রাস্তাঘাট।"
6,"When a piece of paper is rolled up, Hitotsuyam...",হিতোসুয়েমা কাগজ দিয়ে ম্যাশে কৌশল অবলম্বন করে...
7,The founder of the modern Catholic movement Op...,"আধুনিক ক্যাথলিক সংঘের প্রতিষ্ঠাতা ওপাস ডেই, হো..."
8,Research has shown that exercise also helps in...,"এছাড়া গবেষণায় দেখা গেছে, শরীরচর্চা উদ্বেগ ও মা..."
9,It will be so much fun.,অনেক মজা হবে তখন।


# Data Preprocessing

We will start by preprocessing data. For that we have to get the model configurations of the pretrained model which we are going to fine-tune.

In [9]:
model_checkpoint = "ai4bharat/IndicBART"

Let's set the metric for evaluation we will be using.

In [10]:
metric = load_metric("sacrebleu")

## Tokenization & Normalization

We use the tokenizer that is consistent with the vocabulary and method which is used in `IndicBART`

In [11]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, do_lower_case=False, use_fast=False, keep_accents=True)

In [12]:
tokenizer(["I love you.", "Let's go to Egypt."])

{'input_ids': [[2, 466, 8504, 1195, 6, 3], [2, 39818, 142, 36, 3852, 57, 53107, 6, 3]], 'token_type_ids': [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1]]}

One key thing to keep in mind is `IndicBART` is pretrained on 11 different Indian Languages, where all other languages except Hindi and Marathi are transliterated to Devnagri Script. Hence, to use it for Bengali we had to transliterate the Bengali text to Devnagri script as show below.

In [23]:
ben_to_dev = UnicodeIndicTransliterator()

In [25]:
beng_sent = "অনেক মজা হবে তখন।"
print("Bengali: ", beng_sent)
print("Hindi: ", ben_to_dev.transliterate(beng_sent, "bn", "hi"))

Bengali:  অনেক মজা হবে তখন।
Hindi:  अनेक मजा हबे तखन।
