#### Mounting Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
cd "/content/drive/MyDrive/IASNLP"

### Importing Necessary Libraries

In [15]:
import numpy as np
import pandas as pd

from functools import reduce

import os

### Loading Data

We have downloaded the entire Samanantar Parallel Corpus. The data is split by source from where it was collected. We are concerned with the English-Bengali Parallel Corpus for the purpose of our training. Hence, we have removed the English to other Indian Language information. We might include them later in our study or training to see if that improves performance.

Due to computational resource constraint, we will take some fraction of data from each source and will pre-process it.

Below, we can see the new created source of parallel corpora from various Indian Channels and Programmes for English to Bengali Data. We also have existing parallel corpora from previous workshops and events included.

In [7]:
print("New Data created by Samanantar:")
!ls ./Data/source_wise_splits/created
print("\nExisting Data Sources before Samanantar:")
!ls ./Data/source_wise_splits/existing

New Data created by Samanantar:
anuvaad_dw		asianetnews	  ie_news	ocr
anuvaad-general_corpus	coursera	  ie_sports	oneindia
anuvaad_mykhel		dwnews		  ie_tech	pmi
anuvaad_ocr		ie_business	  indiccorp	sentinel
anuvaad_oneindia	ie_education	  khan_academy	wikipedia
anuvaad_pib		ie_entertainment  Kurzgesagt
anuvaad_pib_archives	ie_general	  mykhel
anuvaad_prothomalo	ie_lifestyle	  nptel

Existing Data Sources before Samanantar:
alt	     cvit-pib	   GNOME  Mozilla-I10n	 Tanzil   tico19-terminologies
banglanmt    ELRC_2922	   JW300  OpenSubtitles  Tatoeba  Ubuntu
bible-uedin  GlobalVoices  KDE4   sipc		 TED2020  wikimatrix_opus


In [10]:
file_names = os.listdir("./Data/source_wise_splits/existing")
data_by_source = list()
for file_name in file_names:
    data_by_source.append(pd.read_csv("./Data/source_wise_splits/existing/"+file_name+"/en-bn/bn_sents.tsv", sep="\t"))

In [11]:
file_names = os.listdir("./Data/source_wise_splits/created")
for file_name in file_names:
    data_by_source.append(pd.read_csv("./Data/source_wise_splits/created/"+file_name+"/en-bn/bn_sents.tsv", sep="\t"))

In [None]:
data_by_source = [dat.drop('idx', 1) for dat in data_by_source]

Let's have a look at one of the loaded files. We can see that there are 3 columns `src` and `tgt` standing for source and target. We now observe that each source sentence has it's bengali translation in the dataset. 

In [22]:
data_by_source[0].head()

Unnamed: 0,src,tgt
0,"Okay, I'll be right there.","ওকে, এখনই আসছি"
1,Give one's lessons.,পড়া দেওয়া
2,"Much to the Witnesses' surprise, even the pros...","সাক্ষীরা অত্যন্ত বিস্মিত হন, এমনকি সরকারি উকিল..."
3,I am at your service.,আমি তোমাকে সাহায্য করার জন্যই
4,"Via Facebook Messenger, Global Voices talked w...",গ্লোবাল ভয়েসস বাংলা'র পক্ষ থেকে আমরা যোগাযোগ ক...


Ideally, we will want our training data to be spread across all the sources. Hence, we will merge all the sources together and shuffle them to form our training data. We will take some part of this training data(on which the model won't be trained) to form the training-dev set. The developement set and the test set will be presented later.

In [30]:
data = pd.concat(data_by_source, ignore_index=True)

We have in total 9251703 parallel sentences for English to Bengali.

In [31]:
data

Unnamed: 0,src,tgt
0,"Okay, I'll be right there.","ওকে, এখনই আসছি"
1,Give one's lessons.,পড়া দেওয়া
2,"Much to the Witnesses' surprise, even the pros...","সাক্ষীরা অত্যন্ত বিস্মিত হন, এমনকি সরকারি উকিল..."
3,I am at your service.,আমি তোমাকে সাহায্য করার জন্যই
4,"Via Facebook Messenger, Global Voices talked w...",গ্লোবাল ভয়েসস বাংলা'র পক্ষ থেকে আমরা যোগাযোগ ক...
...,...,...
9251698,A panel of auditors for auditing listed compan...,নিরীক্ষা কার্যক্রমে শৃঙ্খলা আনয়নে তালিকাভুক্ত...
9251699,"On this consideration, and on certain conditio...",সে কারণে কতিপয় শর্ত সাপেক্ষে এসব যন্ত্রাংশের ...
9251700,"Under these rules, fixed price method for IPOs...",এতে অভিহিত মূল্যে আইপিও’র জন্য ফিক্সড প্রাইস প...
9251701,Steps are being taken to introduce rural ratio...,টিআর ও ভিজিএফ এর পরিবর্তে পল্লী রেশনিং কর্মসূচ...
