#### Mounting Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
cd "/content/drive/MyDrive/IASNLP"

### Importing Necessary Libraries

In [3]:
!pip install sentencepiece

In [4]:
import numpy as np
import pandas as pd
import sentencepiece as spm
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt 

from functools import reduce
import os

### Loading Data

We have downloaded the entire Samanantar Parallel Corpus. The data is split by source from where it was collected. We are concerned with the English-Bengali Parallel Corpus for the purpose of our training. Hence, we have removed the English to other Indian Language information. We might include them later in our study or training to see if that improves performance.

Due to computational resource constraint, we will take some fraction of data from each source and will pre-process it.

Below, we can see the new created source of parallel corpora from various Indian Channels and Programmes for English to Bengali Data. We also have existing parallel corpora from previous workshops and events included.

In [5]:
print("New Data created by Samanantar:")
!ls ./Data/source_wise_splits/created
print("\nExisting Data Sources before Samanantar:")
!ls ./Data/source_wise_splits/existing

New Data created by Samanantar:
anuvaad_dw		asianetnews	  ie_news	ocr
anuvaad-general_corpus	coursera	  ie_sports	oneindia
anuvaad_mykhel		dwnews		  ie_tech	pmi
anuvaad_ocr		ie_business	  indiccorp	sentinel
anuvaad_oneindia	ie_education	  khan_academy	wikipedia
anuvaad_pib		ie_entertainment  Kurzgesagt
anuvaad_pib_archives	ie_general	  mykhel
anuvaad_prothomalo	ie_lifestyle	  nptel

Existing Data Sources before Samanantar:
alt	     cvit-pib	   GNOME  Mozilla-I10n	 Tanzil   tico19-terminologies
banglanmt    ELRC_2922	   JW300  OpenSubtitles  Tatoeba  Ubuntu
bible-uedin  GlobalVoices  KDE4   sipc		 TED2020  wikimatrix_opus


In [6]:
file_names = os.listdir("./Data/source_wise_splits/existing")
data_by_source = list()
for file_name in file_names:
    data_by_source.append(pd.read_csv("./Data/source_wise_splits/existing/"+file_name+"/en-bn/bn_sents.tsv", sep="\t"))

In [7]:
file_names = os.listdir("./Data/source_wise_splits/created")
for file_name in file_names:
    data_by_source.append(pd.read_csv("./Data/source_wise_splits/created/"+file_name+"/en-bn/bn_sents.tsv", sep="\t"))

In [8]:
data_by_source = [dat.drop('idx', 1) for dat in data_by_source]

  """Entry point for launching an IPython kernel.


Let's have a look at one of the loaded files. We can see that there are 3 columns `src` and `tgt` standing for source and target. We now observe that each source sentence has it's bengali translation in the dataset. 

In [9]:
data_by_source[0].head()

Unnamed: 0,src,tgt
0,"Okay, I'll be right there.","ওকে, এখনই আসছি"
1,Give one's lessons.,পড়া দেওয়া
2,"Much to the Witnesses' surprise, even the pros...","সাক্ষীরা অত্যন্ত বিস্মিত হন, এমনকি সরকারি উকিল..."
3,I am at your service.,আমি তোমাকে সাহায্য করার জন্যই
4,"Via Facebook Messenger, Global Voices talked w...",গ্লোবাল ভয়েসস বাংলা'র পক্ষ থেকে আমরা যোগাযোগ ক...


Ideally, we will want our training data to be spread across all the sources. Hence, we will merge all the sources together.

In [30]:
data = pd.concat(data_by_source, ignore_index=True)
data.to_csv('full_data.csv')

We have in total 9251703 parallel sentences for English to Bengali.

In [11]:
data

Unnamed: 0,src,tgt
0,"Okay, I'll be right there.","ওকে, এখনই আসছি"
1,Give one's lessons.,পড়া দেওয়া
2,"Much to the Witnesses' surprise, even the pros...","সাক্ষীরা অত্যন্ত বিস্মিত হন, এমনকি সরকারি উকিল..."
3,I am at your service.,আমি তোমাকে সাহায্য করার জন্যই
4,"Via Facebook Messenger, Global Voices talked w...",গ্লোবাল ভয়েসস বাংলা'র পক্ষ থেকে আমরা যোগাযোগ ক...
...,...,...
9251698,A panel of auditors for auditing listed compan...,নিরীক্ষা কার্যক্রমে শৃঙ্খলা আনয়নে তালিকাভুক্ত...
9251699,"On this consideration, and on certain conditio...",সে কারণে কতিপয় শর্ত সাপেক্ষে এসব যন্ত্রাংশের ...
9251700,"Under these rules, fixed price method for IPOs...",এতে অভিহিত মূল্যে আইপিও’র জন্য ফিক্সড প্রাইস প...
9251701,Steps are being taken to introduce rural ratio...,টিআর ও ভিজিএফ এর পরিবর্তে পল্লী রেশনিং কর্মসূচ...


We now shuffle the data to form our training data(`train_data`)(which is 10% of the entire dataset). We will take some part of this training data(0.2%)(on which the model won't be trained) to form the training-developement(`train_dev_data`) set. The developement set and the test set will be presented later.

Reason: The entire dataset is massive and would take large computing resource to train a model. Hence, we took a subset of the dataset for ease in training.

In [13]:
train_data, train_dev_data  = train_test_split(data, train_size=0.1, test_size=0.002, random_state=43)

In [14]:
train_data

Unnamed: 0,src,tgt
5785531,Any new songs?,নতুন কোনও অ্যালবাম?
1859281,Displays showed The Watchtower in the language...,"প্রদর্শনগুলো আমেরিকা, ইউরোপ, এশিয়া ও আফ্রিকার ..."
6773766,"As the situation turned tense, the Police rush...",ঘটনাস্থলে পুলিশ যতক্ষণে পৌঁছয় ততক্ষণে পরিস্থিত...
7311860,Many were crying.,তখন অনেকে কান্নায় ভেঙে পড়েন।
1697213,The brain cannot work properly in a diseased b...,রোগাক্রান্ত শরীরে মস্তিস্ক উপযুক্তভাবে কাজ করত...
...,...,...
5392787,One boy & one girl.,মিনার এক ছেলে এক মেয়ে।
6804036,Her last,তার শেষ কথা—
128897,"It also expresses its need for transparency, f...","এছাড়াও এটি এর স্বচ্ছতা, স্বাধীনতা এবং একটি ""সঠ..."
8530341,Khanpur Union (Bengali: ) is a Union Parishad ...,কাঞ্চনপুর ইউনিয়ন হরিরামপুর উপজেলার আওতাধীন এক...


In [40]:
train_dev_data

Unnamed: 0,src,tgt
2163022,We beg our Protestant and Jewish friends to pu...,কোন কোন ক্ষেত্রে কর্তৃপক্ষ এবং ধর্মীয় নেতারা ...
8637212,"Sa'd advised Muhammad: ""Don't be hard on him. ...","সা'দ মুহাম্মাদকে বলেন: ""তার প্রতি কঠোর হবেন না..."
727073,Photo by 'Save Gaza Project',"ছবি ""সেভ গাজা প্রজেক্টের""।'"
8824941,"So, therefore, we need to test batteries under...","অতএব, আমআদের কিছুটা মান অবস্থাগুলির অধীনে ব্যা..."
6217862,This party is also contesting in the elections.,নির্বাচনে এই দলের মধ্যেই প্রতিদ্বন্দ্বিতা হবে।
...,...,...
1709971,"For example, it is dark as you approach your h...","উদাহরণস্বরূপ, কল্পনা করুন আপনি আপনার বাড়ির দিক..."
1854006,"Sometimes, though, hikers in mountainous terra...","কিন্তু কখনও কখনও, পাহাড়ি এলাকার লম্বা, দুরারোহ..."
7912458,Lets run,ছুটেবাড়ি আসি।
5027564,Real pressure is going into work each day whil...,"সত্যিকারের চাপ হলো, নিজের নিরাপত্তাকে ঝুঁকিতে..."


Now, one thing we need to decide on is the number of words per sentence that we should take. We look below look at our training data's source sentences to see what kind of input sentences we encounter.

In [16]:
sentences_filtered_word_num = {}
for sent in train_data['src']:
    sent_num_words = len(sent.split())
    for num_words in range(10, 151, 10):
        if sent_num_words <= num_words:
            sentences_filtered_word_num[num_words] = sentences_filtered_word_num.get(num_words, 0) + 1

In [17]:
for num_words, num_sentences in sentences_filtered_word_num.items():
    print(f"The number of english sentences with <= {num_words} words are: {num_sentences}")

The number of english sentences with <= 10 words are: 580175
The number of english sentences with <= 20 words are: 789604
The number of english sentences with <= 30 words are: 876862
The number of english sentences with <= 40 words are: 907848
The number of english sentences with <= 50 words are: 917824
The number of english sentences with <= 60 words are: 921431
The number of english sentences with <= 70 words are: 922963
The number of english sentences with <= 80 words are: 923674
The number of english sentences with <= 90 words are: 924128
The number of english sentences with <= 100 words are: 924403
The number of english sentences with <= 110 words are: 924603
The number of english sentences with <= 120 words are: 924752
The number of english sentences with <= 130 words are: 924861
The number of english sentences with <= 140 words are: 924938
The number of english sentences with <= 150 words are: 924996


Looking at the above figures, we would choose maximum length of 60 words to be a good value, as there are not that many sentences of length longer than 60.

Now, we look at the number of unique words in both the source and target sentences.

In [18]:
vocab_en = set([word for sent in train_data['src'].values for word in sent.split()])

In [19]:
vocab_ben = set([word for sent in train_data['tgt'] for word in sent.split()])

In [23]:
print("Number of unique words in english train data:", len(vocab_en))
print("Number of unique words in bengali train data:", len(vocab_ben))

Number of unique words in english train data: 414402
Number of unique words in bengali train data: 591253


We create will create massive text files for both English and Bengali(from the entire dataset except the `train_dev_data`) to train our Byte-Pair Tokenizer for the most frequently occuring characters. 

In [46]:
with open('eng.txt', 'w') as f:
    f.write('\n'.join(data['src'].iloc[[idx for idx in data.index if idx not in train_dev_data.index]]))

In [47]:
with open('ben.txt', 'w') as f:
    f.write('\n'.join(data['tgt'].iloc[[idx for idx in data.index if idx not in train_dev_data.index]]))

0                                 Okay, I'll be right there.
1                                        Give one's lessons.
2          Much to the Witnesses' surprise, even the pros...
3                                      I am at your service.
4          Via Facebook Messenger, Global Voices talked w...
                                 ...                        
9251698    A panel of auditors for auditing listed compan...
9251699    On this consideration, and on certain conditio...
9251700    Under these rules, fixed price method for IPOs...
9251701    Steps are being taken to introduce rural ratio...
9251702    I propose to rationalize rates and to introduc...
Name: src, Length: 9233199, dtype: object