#### Mounting Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
cd "/content/drive/MyDrive/IASNLP"

### Importing Necessary Libraries

In [None]:
!pip install sentencepiece

In [4]:
import numpy as np
import pandas as pd

import sentencepiece as spm
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split


import matplotlib.pyplot as plt 

from functools import reduce
import os

### Loading Data

We have downloaded the entire Samanantar Parallel Corpus. The data is split by source from where it was collected. We are concerned with the English-Bengali Parallel Corpus for the purpose of our training. Hence, we have removed the English to other Indian Language information. We might include them later in our study or training to see if that improves performance.

Due to computational resource constraint, we will take some fraction of data from each source and will pre-process it.

Below, we can see the new created source of parallel corpora from various Indian Channels and Programmes for English to Bengali Data. We also have existing parallel corpora from previous workshops and events included.

In [9]:
print("New Data created by Samanantar:")
!ls ./Data/source_wise_splits/created
print("\nExisting Data Sources before Samanantar:")
!ls ./Data/source_wise_splits/existing

New Data created by Samanantar:
anuvaad_dw		asianetnews	  ie_news	ocr
anuvaad-general_corpus	coursera	  ie_sports	oneindia
anuvaad_mykhel		dwnews		  ie_tech	pmi
anuvaad_ocr		ie_business	  indiccorp	sentinel
anuvaad_oneindia	ie_education	  khan_academy	wikipedia
anuvaad_pib		ie_entertainment  Kurzgesagt
anuvaad_pib_archives	ie_general	  mykhel
anuvaad_prothomalo	ie_lifestyle	  nptel

Existing Data Sources before Samanantar:
alt	     cvit-pib	   GNOME  Mozilla-I10n	 Tanzil   tico19-terminologies
banglanmt    ELRC_2922	   JW300  OpenSubtitles  Tatoeba  Ubuntu
bible-uedin  GlobalVoices  KDE4   sipc		 TED2020  wikimatrix_opus


In [10]:
file_names = os.listdir("./Data/source_wise_splits/existing")
data_by_source = list()
for file_name in file_names:
    data_by_source.append(pd.read_csv("./Data/source_wise_splits/existing/"+file_name+"/en-bn/bn_sents.tsv", sep="\t"))

In [11]:
file_names = os.listdir("./Data/source_wise_splits/created")
for file_name in file_names:
    data_by_source.append(pd.read_csv("./Data/source_wise_splits/created/"+file_name+"/en-bn/bn_sents.tsv", sep="\t"))

In [12]:
data_by_source = [dat.drop('idx', 1) for dat in data_by_source]

  """Entry point for launching an IPython kernel.


Let's have a look at one of the loaded files. We can see that there are 3 columns `src` and `tgt` standing for source and target. We now observe that each source sentence has it's bengali translation in the dataset. 

In [13]:
data_by_source[0].head()

Unnamed: 0,src,tgt
0,"Okay, I'll be right there.","ওকে, এখনই আসছি"
1,Give one's lessons.,পড়া দেওয়া
2,"Much to the Witnesses' surprise, even the pros...","সাক্ষীরা অত্যন্ত বিস্মিত হন, এমনকি সরকারি উকিল..."
3,I am at your service.,আমি তোমাকে সাহায্য করার জন্যই
4,"Via Facebook Messenger, Global Voices talked w...",গ্লোবাল ভয়েসস বাংলা'র পক্ষ থেকে আমরা যোগাযোগ ক...


Ideally, we will want our training data to be spread across all the sources. Hence, we will merge all the sources together.

In [14]:
data = pd.concat(data_by_source, ignore_index=True)

In [15]:
data.to_csv('full_data.csv')

We have in total 9251703 parallel sentences for English to Bengali.

In [16]:
# Run if you have ran the above codes once before and have generated the full_data.csv file
data = pd.read_csv("full_data.csv")
data = data[['src', 'tgt']]

In [None]:
data

Unnamed: 0,src,tgt
0,"Okay, I'll be right there.","ওকে, এখনই আসছি"
1,Give one's lessons.,পড়া দেওয়া
2,"Much to the Witnesses' surprise, even the pros...","সাক্ষীরা অত্যন্ত বিস্মিত হন, এমনকি সরকারি উকিল..."
3,I am at your service.,আমি তোমাকে সাহায্য করার জন্যই
4,"Via Facebook Messenger, Global Voices talked w...",গ্লোবাল ভয়েসস বাংলা'র পক্ষ থেকে আমরা যোগাযোগ ক...
...,...,...
9251698,A panel of auditors for auditing listed compan...,নিরীক্ষা কার্যক্রমে শৃঙ্খলা আনয়নে তালিকাভুক্ত...
9251699,"On this consideration, and on certain conditio...",সে কারণে কতিপয় শর্ত সাপেক্ষে এসব যন্ত্রাংশের ...
9251700,"Under these rules, fixed price method for IPOs...",এতে অভিহিত মূল্যে আইপিও’র জন্য ফিক্সড প্রাইস প...
9251701,Steps are being taken to introduce rural ratio...,টিআর ও ভিজিএফ এর পরিবর্তে পল্লী রেশনিং কর্মসূচ...


As of now, we were dealing with the training data. We now look at the test data.

In [6]:
test_data = pd.read_csv("testdata.csv", header=None)
test_data.columns =['src', 'tgt']

In [7]:
test_data

Unnamed: 0,src,tgt
0,"On the side-lines of this event, I hope the de...","এই অনুষ্ঠানের পাশাপাশি আমি আশা করি, বিদেশ থেকে..."
1,We are proud to be the global host for World E...,বিশ্ব পরিবেশ দিবস ২০১৮’র আয়োজক দেশ হিসাবে আমরা...
2,"We are also committed to ensure, that we do so...",সুস্থায়ী ও প্রকৃতির সঙ্গে সহাবস্থানে এই মানোন্...
3,This has freed rural women from the misery of ...,বিষাক্ত ধোঁয়ার কবল থেকে এই রান্নার গ্যাস সংযোগ...
4,We are engaged in a massive push towards renew...,পুনর্নবীকরণযোগ্য শক্তি উৎপাদনের এক উচ্চাকাঙ্খী...
...,...,...
2385,He will inaugurate the collective e-grihaprave...,প্রধানমন্ত্রী আবাস যোজনার আওতায় নির্মিত ২৫ হাজ...
2386,He will also address a public gathering.,এই উপলক্ষ্যে তিনি এক জনসভাতেও ভাষণ দেবেন।
2387,Prime Minister will then proceed to Odisha.,এরপর প্রধানমন্ত্রী ওড়িশায় যাবেন।
2388,The Agreement will help in the availability of...,এই সহযোগিতা চুক্তি স্বাক্ষরের ফলে কাস্টমস্‌ সং...


### Data Preparation

We now shuffle the data to form our training data(`train_data`)(which is 1.2% of the entire dataset). We will take some part of this training data(0.025%)(on which the model won't be trained) to form the training-developement(`train_dev_data`) set. The developement set and the test set will be presented later.

Reason: The entire dataset is massive and would take large computing resource to train a model. Hence, we took a subset of the dataset for ease in training.

In [None]:
# train_data, train_dev_data  = train_test_split(data, train_size=0.012, test_size=0.00025, random_state=43)

In [27]:
# train_data.to_csv("train_data.csv")
# train_dev_data.to_csv("train_dev.csv")
# Uncomment and run the above lines if running for first time and there is no train_data.csv and train_dev.csv file in the working directory
train_data = pd.read_csv("train_data.csv")[['src', 'tgt']]
train_dev_data = pd.read_csv("train_dev.csv")[['src', 'tgt']]

In [28]:
train_data

Unnamed: 0,src,tgt
0,But the shoot was a tough one.,তবে শ্যুটটা খুব মুশকিলের ছিল।
1,Road construction started.,রাস্তা নির্মাণ শুরু হয়েছে।
2,Why did he pay so much?,কেন তিনি এত টাকা দিতেন?
3,"""AT ITS worst, this has been Satan's century.","""এই শতাব্দীর প্রচণ্ড ভয়াবহতা এটাকে শয়তানের এক ..."
4,That's our only demand.,সেটাই আমাদের একমাত্র দাবি।
...,...,...
111015,Turmeric powder 1 tea spoon,দারুচিনি-গুঁড়া ১ চা-চামচ।
111016,Photo: Facebook.,ছবি: ফেসবুক থেকে নেয়া।
111017,Strengthening ties with Saudi Arabia,সৌদি আরবের সঙ্গে সম্পর্ক শক্ত করার উদ্যোগ
111018,This is some brief footage Eva took:,এভার তোলা কিছু বিস্তারিত দৃশ্য:


In [29]:
train_dev_data

Unnamed: 0,src,tgt
0,We beg our Protestant and Jewish friends to pu...,কোন কোন ক্ষেত্রে কর্তৃপক্ষ এবং ধর্মীয় নেতারা ...
1,"Sa'd advised Muhammad: ""Don't be hard on him. ...","সা'দ মুহাম্মাদকে বলেন: ""তার প্রতি কঠোর হবেন না..."
2,Photo by 'Save Gaza Project',"ছবি ""সেভ গাজা প্রজেক্টের""।'"
3,"So, therefore, we need to test batteries under...","অতএব, আমআদের কিছুটা মান অবস্থাগুলির অধীনে ব্যা..."
4,This party is also contesting in the elections.,নির্বাচনে এই দলের মধ্যেই প্রতিদ্বন্দ্বিতা হবে।
...,...,...
4621,But so it is.,তবে তা যেমন ঠিক।
4622,But the work was not started.,তবে কাজ শুরু হয়নি।
4623,The sixth volume of the Polyglot contained var...,পলিগ্লটের ষষ্ঠ খণ্ডে বাইবেল অধ্যয়নের জন্য বিভি...
4624,They need answers.,তাঁদের কাছে জবাব চাওয়া হয়েছে।


Since, we want to see and improve our model's performance on the `test_data`, we will take take 50% of the `test_data` and it will along with the `train_dev_data` to compose our validation set. This is done specifically to account for variance and data mismatch problem.

In [17]:
# test_data, test_val_data  = train_test_split(test_data, train_size=0.5, random_state=43)

In [23]:
# test_data.to_csv("test_data.csv")
# test_val_data.to_csv("test_val.csv")
# Uncomment and run the above lines if running for first time and there is no train_data.csv and train_dev.csv file in the working directory
test_data = pd.read_csv("test_data.csv")[['src', 'tgt']]
test_val_data = pd.read_csv("test_val.csv")[['src', 'tgt']]

In [24]:
test_data

Unnamed: 0,src,tgt
0,He asked NITI Aayog to follow up with the Stat...,রাজ্যগুলির দেওয়া সুপারিশের প্রেক্ষিতে তিন মাসে...
1,Who is not a fan of Harmanpreet Kaur?,হরমনপ্রীতকাউরের গুণগ্রাহী নয় এমন মানুষ খুঁজে প...
2,"In particular, he called for innovative approa...",দক্ষতা বিকাশ ও পর্যটনের মতো ক্ষেত্রগুলিতে উদ্ভ...
3,"Over the last twenty years, more than eight hu...",গত ২০ বছরে এশিয়া-প্রশান্ত মহাসাগর অঞ্চলে বিপর্...
4,PM inaugurated the first phase of River Front ...,প্রধানমন্ত্রী পাটনায় নদী বিকাশ প্রকল্পের প্রথম...
...,...,...
1190,Starting a new business in India is now easier...,ভারতে নতুনব্যবসা-বাণিজ্যের সূচনা এখন আগের থেকে...
1191,The delegation interacted with the Prime Minis...,মহিলাদের শিল্পোদ্যোগ প্রচেষ্টা এবংনারী ক্ষমতায়...
1192,He said the ITI would help empower the youth o...,শিল্প প্রশিক্ষণ প্রতিষ্ঠান চালু হলে এই দ্বীপের...
1193,He remarked that he has himself visited the No...,"শ্রী মোদী বলেন, গত চার বছরে তিনি ২৫বারেরও বেশি..."


In [25]:
test_val_data

Unnamed: 0,src,tgt
0,"Through her work, she spread the message of th...",নিজের কর্মেরউদাহরণ স্থাপন করে তিনি মানুষকে সেব...
1,Interacting with beneficiaries and store owner...,সারা দেশের ৫ হাজারেরও বেশি স্থান থেকে দোকান মা...
2,These include:,এই প্রকল্পগুলির মধ্যে রয়েছে-
3,It is no surprise that today Japan is India’s ...,জাপান যে বর্তমানে ভারতের চতুর্থ বৃহত্তম প্রত্য...
4,Record growth in last two and a half years,গত আড়াই বছরেরেকর্ড পরিমাণ অগ্রগতি
...,...,...
1190,"The MoU was signed in April, 2017.",বৈঠকে নেতৃত্ব দেন প্রধানমন্ত্রী স্বয়ং।
1191,Top officials of the Ministry of Health and Fa...,পর্যালোচনা বৈঠকে এছাড়াও উপস্থিত ছিলেন স্বাস্থ্...
1192,Sports are considered a waste of time in our s...,"সমাজে এখনও মনে করা হয়, খেলাধুলো করা মানে সময় ন..."
1193,The MoU covers the following areas of cooperat...,"এই মউ-এ সহযোগিতার যেসব ক্ষেত্র রয়েছে, তার মধ্য..."


Now, one thing we need to decide on is the number of words per sentence that we should take. We look below look at our training data's source sentences to see what kind of input sentences we encounter.

In [None]:
def sent_by_word_count(data, word_counts):
    sentences_filtered_word_num = {}
    for sent in data:
        sent_num_words = len(sent.split())
        for num_words in word_counts:
            if sent_num_words <= num_words:
                sentences_filtered_word_num[num_words] = sentences_filtered_word_num.get(num_words, 0) + 1
    return sentences_filtered_word_num

In [None]:
sentences_filtered_word_num = sent_by_word_count(train_data['src'], list(range(10, 151, 10)))
for num_words, num_sentences in sentences_filtered_word_num.items():
    print(f"The number of english sentences with <= {num_words} words are: {num_sentences}")

The number of english sentences with <= 10 words are: 69625
The number of english sentences with <= 20 words are: 94747
The number of english sentences with <= 30 words are: 105236
The number of english sentences with <= 40 words are: 109011
The number of english sentences with <= 50 words are: 110191
The number of english sentences with <= 60 words are: 110569
The number of english sentences with <= 70 words are: 110761
The number of english sentences with <= 80 words are: 110842
The number of english sentences with <= 90 words are: 110887
The number of english sentences with <= 100 words are: 110925
The number of english sentences with <= 110 words are: 110955
The number of english sentences with <= 120 words are: 110975
The number of english sentences with <= 130 words are: 110989
The number of english sentences with <= 140 words are: 110995
The number of english sentences with <= 150 words are: 110999


In [None]:
sentences_filtered_word_num = sent_by_word_count(train_data['tgt'], list(range(10, 151, 10)))
for num_words, num_sentences in sentences_filtered_word_num.items():
    print(f"The number of bengali sentences with <= {num_words} words are: {num_sentences}")

The number of bengali sentences with <= 10 words are: 76066
The number of bengali sentences with <= 20 words are: 99593
The number of bengali sentences with <= 30 words are: 107265
The number of bengali sentences with <= 40 words are: 109611
The number of bengali sentences with <= 50 words are: 110344
The number of bengali sentences with <= 60 words are: 110647
The number of bengali sentences with <= 70 words are: 110790
The number of bengali sentences with <= 80 words are: 110863
The number of bengali sentences with <= 90 words are: 110920
The number of bengali sentences with <= 100 words are: 110957
The number of bengali sentences with <= 110 words are: 110970
The number of bengali sentences with <= 120 words are: 110982
The number of bengali sentences with <= 130 words are: 110991
The number of bengali sentences with <= 140 words are: 110998
The number of bengali sentences with <= 150 words are: 111004


Looking at the above figures, we see there are not that many sentences of length longer than 60.

Now, we look at the number of unique words in both the source and target sentences.

In [None]:
vocab_en = set([word for sent in train_data['src'].values for word in sent.split()])

In [None]:
vocab_ben = set([word for sent in train_data['tgt'] for word in sent.split()])

In [None]:
print("Number of unique words in english train data:", len(vocab_en))
print("Number of unique words in bengali train data:", len(vocab_ben))

Number of unique words in english train data: 116349
Number of unique words in bengali train data: 148459


### Tokenization and Normalization

We create will create massive text files for both English and Bengali(from the entire dataset except the `train_dev_data`) to train our Byte-Pair Tokenizer for the most frequently occuring characters. 

In [None]:
# with open('eng.txt', 'w') as f:
#     f.write('\n'.join(train_data['src'].iloc[[idx for idx in train_data.index if idx not in train_dev_data.index]]))

In [None]:
# with open('ben.txt', 'w') as f:
#     f.write('\n'.join(train_data['tgt'].iloc[[idx for idx in train_data.index if idx not in train_dev_data.index]]))

We train the Byte-Pair Encoder model with vocab size of 16,000 for both English and Bengali.

In [None]:
# spm.SentencePieceTrainer.train('--input=eng.txt --model_prefix=eng_bpe --vocab_size=16000 --model_type=bpe --normalization_rule_name=nfkc_cf --bos_id=-1 --eos_id=1 --unk_id=2 --pad_id=0')

In [None]:
# spm.SentencePieceTrainer.train('--input=ben.txt --model_prefix=ben_bpe --vocab_size=16000 --model_type=bpe --normalization_rule_name=nfkc_cf --bos_id=-1 --eos_id=1 --unk_id=2 --pad_id=0')

The following two examples show how we the trained BPE encodes subwords to ids, which will be further padded to give rise to the format in which we will input data in our model.

In [30]:
# Run the above commented lines after uncommenting if eng_bpe.model and ben_bpe.model are not generated previously
sp_en_bpe = spm.SentencePieceProcessor()
sp_en_bpe.load('eng_bpe.model')

print("*** ENGLISH SENTENCE ***")
print(train_data['src'].iloc[0])
print('*** ENGLISH BPE ***')
print(sp_en_bpe.encode_as_pieces(train_data['src'].iloc[0]))
print(sp_en_bpe.encode_as_ids(train_data['src'].iloc[0]))

*** ENGLISH SENTENCE ***
But the shoot was a tough one.
*** ENGLISH BPE ***
['▁but', '▁the', '▁shoot', '▁was', '▁a', '▁tough', '▁one', '.']
[148, 7, 2619, 86, 5, 5192, 230, 15972]


In [31]:
sp_ben_bpe = spm.SentencePieceProcessor()
sp_ben_bpe.load('ben_bpe.model')

print("*** BENGALI SENTENCE ***")
print(train_data['tgt'].iloc[0])
print('*** BENGALI BPE ***')
print(sp_ben_bpe.encode_as_pieces(train_data['tgt'].iloc[0]))
print(sp_ben_bpe.encode_as_ids(train_data['tgt'].iloc[0]))

*** BENGALI SENTENCE ***
তবে শ্যুটটা খুব মুশকিলের ছিল।
*** BENGALI BPE ***
['▁তবে', '▁শ্যু', 'ট', 'টা', '▁খুব', '▁মু', 'শক', 'িলের', '▁ছিল', '।']
[282, 8825, 15902, 92, 488, 224, 5925, 8969, 155, 15900]


For easy encoding and decoding of sentences, we would have two helper functions for tokenization and detokenization.

In [32]:
def tokenize(sentence, sp_model):
    # We add the EOS token at the end of each encoded sentence
    inputs = sp_model.encode_as_ids(sentence) + [sp_model.eos_id()]
    return np.reshape(np.array(inputs), [1, -1])

In [33]:
def detokenize(tokenized, sp_model):
    integers = np.squeeze(tokenized).tolist()
    return sp_model.DecodeIdsWithCheck(integers[:integers.index(sp_model.eos_id())])

### Build Data Generator

Now, we come to the final phase of data pre-processing i.e. building the data generator which can be feed data to the model continually through the epochs.\
The data generator will yield tuples of English and Bengali sentences where both the English and Bengali sentences are padded and encoded as list of integers corresponding to the respective index in the vocabualary(generated by BPE). 

In [34]:
def data_generator(batch_size, src, tgt, maxlen=60, shuffle=False, verbose=False):
    num_lines = len(src)
    
    lines_index = [*range(num_lines)]
    
    if shuffle:
        np.random.shuffle(lines_index)
    
    index = 0
    while True:
        buffer_src = list()
        buffer_tgt = list() 
                
        max_len = 0 
        for i in range(batch_size):
            if index >= num_lines:
                index = 0
                if shuffle:
                    np.random.shuffle(lines_index)
            
            buffer_src.append(src[lines_index[index]])
            buffer_tgt.append(tgt[lines_index[index]])

            
            
            index += 1


        batch_src = pad_sequences(buffer_src, maxlen = maxlen, padding='post', truncating='post')
        batch_tgt = pad_sequences(buffer_tgt, maxlen = maxlen, padding='post', truncating='post')

        if verbose: print("index=", index)
        yield((batch_src, batch_tgt))

Let's see how our `data_generator` works. But for that we must convert the `src` and `tgt` as list of integers encoded as per BPE vocabulary.

In [41]:
src_train_data_enc = [np.squeeze(tokenize(train_data['src'].iloc[i], sp_en_bpe)) for i in range(train_data.shape[0])]
tgt_train_data_enc = [np.squeeze(tokenize(train_data['tgt'].iloc[i], sp_ben_bpe)) for i in range(train_data.shape[0])]

In [43]:
train_data_gen = data_generator(32, src_train_data_enc, tgt_train_data_enc, verbose=True)

In [44]:
src_batch_1, tgt_batch_1 = next(train_data_gen)
src_batch_2, tgt_batch_2 = next(train_data_gen)

index= 32
index= 64


In [45]:
print("*** Batch-1 ***")
print("encoded source shape: ", src_batch_1.shape)
print("encoded target shape: ", tgt_batch_1.shape)
print("encoded source example: ", src_batch_1[0])
print("encoded target example: ", tgt_batch_1[0])
print("decoded source example: ", detokenize(src_batch_1[0], sp_en_bpe))
print("decoded target example: ", detokenize(tgt_batch_1[0], sp_ben_bpe))

print("\n*** Batch-2 ***")
print("encoded source shape: ", src_batch_2.shape)
print("encoded target shape: ", tgt_batch_2.shape)
print("encoded source example: ", src_batch_2[0])
print("encoded target example: ", tgt_batch_2[0])
print("decoded source example: ", detokenize(src_batch_2[0], sp_en_bpe))
print("decoded target example: ", detokenize(tgt_batch_2[0], sp_ben_bpe))

*** Batch-1 ***
encoded source shape:  (32, 60)
encoded target shape:  (32, 60)
encoded source example:  [  148     7  2619    86     5  5192   230 15972     1     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0]
encoded target example:  [  282  8825 15902    92   488   224  5925  8969   155 15900     1     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0]
decoded source example:  but the shoot was a tough one.
decoded target example:  তবে শ্যুটটা খুব মুশকিলের ছিল।

*** Batch-2 ***
encoded so

Similarly, we have the `train_dev_data` and `test_val_data` generator to generate data for evaluation of our model.

In [46]:
src_train_dev_data_enc = [np.squeeze(tokenize(train_dev_data['src'].iloc[i], sp_en_bpe)) for i in range(train_dev_data.shape[0])]
tgt_train_dev_data_enc = [np.squeeze(tokenize(train_dev_data['tgt'].iloc[i], sp_ben_bpe)) for i in range(train_dev_data.shape[0])]

In [47]:
train_dev_data_gen = data_generator(8, src_train_dev_data_enc, tgt_train_dev_data_enc, verbose=True)

In [48]:
src_batch_1, tgt_batch_1 = next(train_dev_data_gen)
src_batch_2, tgt_batch_2 = next(train_dev_data_gen)

print("*** Batch-1 ***")
print("encoded source shape: ", src_batch_1.shape)
print("encoded target shape: ", tgt_batch_1.shape)
print("encoded source example: ", src_batch_1[0])
print("encoded target example: ", tgt_batch_1[0])
print("decoded source example: ", detokenize(src_batch_1[0], sp_en_bpe))
print("decoded target example: ", detokenize(tgt_batch_1[0], sp_ben_bpe))

print("\n*** Batch-2 ***")
print("encoded source shape: ", src_batch_2.shape)
print("encoded target shape: ", tgt_batch_2.shape)
print("encoded source example: ", src_batch_2[0])
print("encoded target example: ", tgt_batch_2[0])
print("decoded source example: ", detokenize(src_batch_2[0], sp_en_bpe))
print("decoded target example: ", detokenize(tgt_batch_2[0], sp_ben_bpe))

index= 8
index= 16
*** Batch-1 ***
encoded source shape:  (8, 60)
encoded target shape:  (8, 60)
encoded source example:  [   74  1103   330  1140   167    39  5870  1987    37  1215  1159   511
 11137   346 15972     1     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0]
encoded target example:  [  156   156   745  2029    75  2772  5018    94 15370  8568  4330   444
   935   644 15900     1     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0]
decoded source example:  we beg our protestant and jewish friends to put away such suspicions.
decoded target example:  ক

In [49]:
src_test_val_data_enc = [np.squeeze(tokenize(test_val_data['src'].iloc[i], sp_en_bpe)) for i in range(test_val_data.shape[0])]
tgt_test_val_data_enc = [np.squeeze(tokenize(test_val_data['tgt'].iloc[i], sp_ben_bpe)) for i in range(test_val_data.shape[0])]

In [50]:
test_val_data_gen = data_generator(8, src_test_val_data_enc, tgt_test_val_data_enc, verbose=True)

In [51]:
src_batch_1, tgt_batch_1 = next(test_val_data_gen)
src_batch_2, tgt_batch_2 = next(test_val_data_gen)

print("*** Batch-1 ***")
print("encoded source shape: ", src_batch_1.shape)
print("encoded target shape: ", tgt_batch_1.shape)
print("encoded source example: ", src_batch_1[0])
print("encoded target example: ", tgt_batch_1[0])
print("decoded source example: ", detokenize(src_batch_1[0], sp_en_bpe))
print("decoded target example: ", detokenize(tgt_batch_1[0], sp_ben_bpe))

print("\n*** Batch-2 ***")
print("encoded source shape: ", src_batch_2.shape)
print("encoded target shape: ", tgt_batch_2.shape)
print("encoded source example: ", src_batch_2[0])
print("encoded target example: ", tgt_batch_2[0])
print("decoded source example: ", detokenize(src_batch_2[0], sp_en_bpe))
print("decoded target example: ", detokenize(tgt_batch_2[0], sp_ben_bpe))

index= 8
index= 16
*** Batch-1 ***
encoded source shape:  (8, 60)
encoded target shape:  (8, 60)
encoded source example:  [  707   321   320 15975   268  2026     7  2547    34     7  4939    34
 12689    39  1296    37  3976 15972     1     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0]
encoded target example:  [  937 10901  7096   287   300  2495    55   119  2773  2005    54  5162
   373  2101  8338   202 15900     1     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0]
decoded source example:  through her work, she spread the message of the importance of cleanliness and service to mankind

We will now use the above data generator to train and evaluate our models.