<a href="https://colab.research.google.com/github/Suraj2804/Generative-AI-learn/blob/main/nmt_seq2seq_indian_languages.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Library and Dataset Import**

The examples uses tensorflow2.x for NLP Modelling.

The indic-nlp-library is used for tokenization.

Dataset used is for English Hindi translation, however can be easily adopted for 10 other major Indian language as available [here](http://lotus.kuee.kyoto-u.ac.jp/WAT/indic-multilingual/) or for any other language pair (adopt as per the tensorflow tutorial references provided below).

In [None]:
import tensorflow as tf
from sklearn.model_selection import train_test_split
import unicodedata
import re
import numpy as np
import os
import io
import time
!pip3 install indic-nlp-library



In [None]:
!pip install tensorflow-gpu==2.10.0



In [None]:
tf.version

<module 'tensorflow._api.v2.version' from '/usr/local/lib/python3.10/dist-packages/tensorflow/_api/v2/version/__init__.py'>

In [None]:
!wget "http://lotus.kuee.kyoto-u.ac.jp/WAT/indic-multilingual/indic_languages_corpus.tar.gz"

--2024-08-20 05:37:20--  http://lotus.kuee.kyoto-u.ac.jp/WAT/indic-multilingual/indic_languages_corpus.tar.gz
Resolving lotus.kuee.kyoto-u.ac.jp (lotus.kuee.kyoto-u.ac.jp)... 130.54.208.131
Connecting to lotus.kuee.kyoto-u.ac.jp (lotus.kuee.kyoto-u.ac.jp)|130.54.208.131|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 132762852 (127M) [application/x-gzip]
Saving to: ‘indic_languages_corpus.tar.gz’


2024-08-20 05:38:04 (2.94 MB/s) - ‘indic_languages_corpus.tar.gz’ saved [132762852/132762852]



In [None]:
import tarfile
with tarfile.open('indic_languages_corpus.tar.gz', 'r:gz') as tar:
    tar.extractall()
print("done!")

done!


In [None]:
#We copy the Hindi to English files for working in this example (dev.en, dev.hi, test.en, test.hi and train.en, train.hi)
%cp indic_languages_corpus/bilingual/hi-en/* .
#Clean up to avoid storing these files in the session
%rm -r indic_languages_corpus indic_languages_corpus.tar.gz

In [None]:
# understanding how the training data looks like
f = open('train.hi',encoding="utf8")
w1 = f.readlines()
print(len(w1))
print(w1[0:5])
g = open('train.en',encoding="utf8")
w2 = g.readlines()
print(len(w2))
print(w2[0:5])

84557
['और उनके Sigil क्या है?\n', 'मैं मरना नहीं चाहता.\n', 'यह मुझे लगता है कि एक ही देश है.\n', 'फिर ये नन्हें बच्चों की तरह रोएँगे।\n', 'नहीं, मुझे पावर की जरुरत है !\n']
84557
['And what is their Sigil?\n', 'I do not want to die.\n', "It's the same country I think.\n", "Then they'll be crying like babies.\n", '- No, I need power up!\n']


# **Data Preperation**

Once we have loaded the dataset, we preprocess the data as follows:

Add a start and end token to each sentence.

Clean the sentences by removing special characters.

Create a word index and reverse word index (dictionaries mapping from word → id and id → word).

Pad each sentence to a maximum length.

In [None]:
# Restrict the total number of sentences to 70000
NUM_SENTENCES = 70000

In [None]:
# strip the input and output of extra unnecessary characters
# store all the cleaned input and output sentences into input_sentences[] and output_sentences[]
# tokenize the Hindi (target) sentences using the indicNLP libary class and add <sos> (start-of-sentence) and <eos> (end-of-sentence)

input_sentences = []
output_sentences = []

count = 0
for line in open(r'train.en', encoding="utf-8"):
    count += 1

    if count > NUM_SENTENCES:
        break

    input_sentence = line.rstrip().strip("\n").strip('-') #we strip the sentence of '\n' and '-'
    input_sentences.append(input_sentence) #store all input sentences in the input sentences list

count = 0

for line in open(r'train.hi'):
    count += 1

    if count > NUM_SENTENCES:
        break
    output_sentence =  line.rstrip().strip("\n").strip('-')
    from indicnlp.tokenize import indic_tokenize
    line = indic_tokenize.trivial_tokenize(output_sentence) #we tokenize the hindi sentences

    output_sentences.append(['<sos>'] + line + ['<eos>']) #append the start and end tags to the tokenised sentences
                                                          #each tokenied sentence is stored as a list in output sentences
print(type(input_sentences[9]))
print(type(output_sentences[9]))

<class 'str'>
<class 'list'>


In [None]:
output_sentences[9]

['<sos>',
 'तुम्हें',
 'कम',
 'से',
 'कम',
 'मुझे',
 'तो',
 'बताना',
 'चाहिए',
 'था',
 ',',
 'ना',
 '?',
 '<eos>']

In [None]:
print("num samples input:", len(input_sentences))
print("num samples output:", len(output_sentences))

num samples input: 70000
num samples output: 70000


In [None]:
print(input_sentences[-1])
print(output_sentences[-1])

Her face.
['<sos>', 'उसका', 'चेहरा', '.', '<eos>']


In [None]:
# Converts the unicode file to ascii
# Since the model is dealing with multilingual text so it will be important to standardize the input text.
# Unicode normalization splits accented characters and replace compatibility characters with their ASCII equivalents.
# https://bit.ly/2TnLffX
def unicode_to_ascii(s):
  return ''.join(c for c in unicodedata.normalize('NFD', s)
      if unicodedata.category(c) != 'Mn')


def preprocess_sentence(w):
  w = unicode_to_ascii(w.lower().strip())

  # creating a space between a word and the punctuation following it
  # eg: "he is a boy." => "he is a boy ."

  w = re.sub(r"([?.!,¿])", r" \1 ", w)
  w = re.sub(r'[" "]+', " ", w)

  # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
  w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)

  w = w.strip()

  # adding a start and an end token to the sentence
  # so that the model know when to start and stop predicting.
  w = '<sos> ' + w + ' <eos>'
  return w

In [None]:
for i in range(len(input_sentences)):
   input_sentences[i] = preprocess_sentence(input_sentences[i])

print(input_sentences[8])
print(output_sentences[8])

<sos> i told her we rest on sundays . <eos>
['<sos>', 'मैं', 'रविवार', 'को', 'उसे', 'हम', 'बाकी', 'बताया', '.', '<eos>']


In [None]:
# function to tokenize, fit the words into numeric sequences and pad them with zeroes up to the size of the largest sentence of that vocabulary
# takes as input the input / output vocabulary and the padding type ('pre' / 'post'-- default: post)

# inp_lang and targ_lang is of type tokenizer.fit_on_texts;
# fit_on_texts of Tokenizer class updates internal vocabulary based on a list of texts.
# This method creates the vocabulary index based on word frequency.
# Lower integer means more frequent word (often the first few are stop words because they appear a lot).

def sample_function(lang):
  lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(
      filters='')

  lang_tokenizer.fit_on_texts(lang)

  tensor = lang_tokenizer.texts_to_sequences(lang)

  tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor)
  return tensor, lang_tokenizer

In [None]:
# function to call the tokenize function to perform tokenizing and padding

def load_dataset(inp_lang, targ_lang):
  # creating cleaned input, output pairs
  input_tensor, inp_lang_tokenizer = sample_function(inp_lang)
  target_tensor, targ_lang_tokenizer = sample_function(targ_lang)

  return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

In [None]:
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(input_sentences, output_sentences)

# Calculate max_length of the target tensors
# For our project, the max_length_targ and max_length_inp are 69 and 72 respectively.

max_length_targ, max_length_inp = target_tensor.shape[1], input_tensor.shape[1]
print(max_length_targ)
print(max_length_inp)

69
72


In [None]:
# checking if the input sequences have been obtained and padded properly
print(target_tensor[9])
print(input_tensor[9])

[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   1  47 203  18 203  26  39 553  79  29   5 270   8   2]
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    1    5  106   62   63  462 6235   21    4   59
    8    2]


In [None]:
# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2,random_state=7)

# Show length
print(len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val))

56000 56000 14000 14000


In [None]:
# checking if the input sequences have been obtained and padded properly
print(input_tensor_val[9])

[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     1    60   451    14    60 10406    99   207   135     8     2]


In [None]:
# a function to test if the word to index / index to word mappings have been obtained correctly.
# representative output for two sample english and hindi sentences given in the code block below

def convert(lang, tensor):
  for t in tensor:
    if t!=0:
      print ("%d ----> %s" % (t, lang.index_word[t]))
      print ("%s ----> %d" % (lang.index_word[t], lang.word_index[lang.index_word[t]]))

In [None]:
print ("Input Language; index to word mapping")
convert(inp_lang, input_tensor_train[0])
print ()
print ("Target Language; index to word mapping")
convert(targ_lang, target_tensor_train[0])

Input Language; index to word mapping
1 ----> <sos>
<sos> ----> 1
74 ----> did
did ----> 74
5 ----> you
you ----> 5
2270 ----> threaten
threaten ----> 2270
21 ----> me
me ----> 21
8 ----> ?
? ----> 8
2 ----> <eos>
<eos> ----> 2

Target Language; index to word mapping
1 ----> <sos>
<sos> ----> 1
15 ----> आप
आप ----> 15
26 ----> मुझे
मुझे ----> 26
1426 ----> धमकी
धमकी ----> 1426
45 ----> किया
किया ----> 45
29 ----> था
था ----> 29
8 ----> ?
? ----> 8
2 ----> <eos>
<eos> ----> 2


In [None]:
# BUFFER_SIZE stores the number of training points
BUFFER_SIZE = len(input_tensor_train)

# BATCH_SIZE is set to 64. Training and gradient descent happens in batches of 64
BATCH_SIZE = 64

# the number of batches in one epoch (also, the number of steps during training, when we go batch by batch)
steps_per_epoch = BUFFER_SIZE//BATCH_SIZE

# the length of the embedded vector
embedding_dim = 256

# no of GRUs
units = 1024

# getting the size of the input and output vocabularies.
vocab_inp_size = len(inp_lang.word_index)+1
vocab_tar_size = len(targ_lang.word_index)+1

# now, we shuffle the dataset and split it into batches of 64
dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True) # the remainder after splitting by 64 are dropped

print(BUFFER_SIZE)
print(BUFFER_SIZE//64)
print(steps_per_epoch)
print(max_length_targ)
print(max_length_inp)

56000
875
875
69
72


In [None]:
# to understand the shape of an input batch
example_input_batch, example_target_batch = next(iter(dataset))
example_input_batch.shape, example_target_batch.shape

(TensorShape([64, 72]), TensorShape([64, 69]))

# **Encoder-Decoder model**

The encoder model consists of an embedding layer, a GRU layer with 1024 units.

The decoder model consists of a embedding layer, a GRU layer and a dense layer.

---
![picture](https://drive.google.com/uc?id=1BjzsnC-lcn4GapfGv1hUDDyb68ySS4cW)








In [None]:
class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz # set batch size
    self.enc_units = enc_units # set the number of GRU units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim) # set the embedding layer using the input's vocabulary size and the embedding dimension (which is set to 256)
    self.gru = tf.keras.layers.GRU(self.enc_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform') # define the GRU layer

  def call(self, x, hidden): # this function is invoked when the function encoder is called with an input and an initialised hidden layer
    x = self.embedding(x)
    output, state = self.gru(x, initial_state = hidden) # pass input x into the GRU layer
    return output, state # function returns the encoder output and the hidden state


  def initialize_hidden_state(self): #intialise hidden layer to all zeroes (for determining the shape)
    return tf.zeros((self.batch_sz, self.enc_units))

In [None]:
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE) # create an Encoder class object

# sample input to get a sense of the shapes.
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_hidden = encoder(example_input_batch, sample_hidden)
print ('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
print ('Encoder Hidden state shape: (batch size, units) {}'.format(sample_hidden.shape))

Encoder output shape: (batch size, sequence length, units) (64, 72, 1024)
Encoder Hidden state shape: (batch size, units) (64, 1024)


In [None]:
class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz # batch_size which is defined as 64
    self.dec_units = dec_units # the number of decoder GRU units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim) # defining an embedding layer for the target language output.
    self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform') # GRU layer
    self.fc = tf.keras.layers.Dense(vocab_size)


  def call(self, x, hidden):

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x) # creating an embedding layer for the target output

    # passing the initial state to the GRU as the hidden state
    output, state = self.gru(x, initial_state=hidden)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output) # pass the output through the dense layer

    return x, state # return decoder output and decoder state

In [None]:
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

sample_decoder_output, _ = decoder(tf.random.uniform((BATCH_SIZE, 1)),sample_hidden)

print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_decoder_output.shape))

Decoder output shape: (batch_size, vocab size) (64, 22224)


# **Training the model**

The model is trained on a GPU machine with fixed number of epochs.

A custom training loop (instead of Model.Fit etc.) is used for which further reference is available from Tensorflow [here](https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch)

The model can be extended with the use of the validation data for early stopping and further fine tuning.

Checkpoints are stored for easy retrieval of the model and resue without training

In [None]:
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none') #Loss function is categorical crossentropy

def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask

  return tf.reduce_mean(loss_)

In [None]:
checkpoint_dir = './tutorial_checkpoint_nmt'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

In [None]:
@tf.function
def train_step(inp, targ, enc_hidden):
  loss = 0

  with tf.GradientTape() as tape:
    enc_output, enc_hidden = encoder(inp, enc_hidden)

    dec_hidden = enc_hidden

    dec_input = tf.expand_dims([targ_lang.word_index['<sos>']] * BATCH_SIZE, 1)

    # Teacher forcing - feeding the target as the next input
    for t in range(1, targ.shape[1]):
      # passing enc_output to the decoder
      predictions, dec_hidden = decoder(dec_input, dec_hidden)

      loss += loss_function(targ[:, t], predictions)

      # using teacher forcing
      dec_input = tf.expand_dims(targ[:, t], 1)

  batch_loss = (loss / int(targ.shape[1]))

  variables = encoder.trainable_variables + decoder.trainable_variables

  gradients = tape.gradient(loss, variables)

  optimizer.apply_gradients(zip(gradients, variables)) # doing gradient descent

  return batch_loss

In [None]:
train = False
EPOCHS = 10
if train :
  for epoch in range(EPOCHS):
    start = time.time()

    enc_hidden = encoder.initialize_hidden_state()
    total_loss = 0

    for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
      batch_loss = train_step(inp, targ, enc_hidden)
      total_loss += batch_loss

      if batch % 100 == 0:
        print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                    batch,
                                                    batch_loss.numpy()))
    # saving (checkpoint) the model every 2 epochs
    if (epoch + 1) % 2 == 0:
      checkpoint.save(file_prefix = checkpoint_prefix)

    print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                        total_loss / steps_per_epoch))
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

In [None]:
# restoring the latest checkpoint in checkpoint_dir
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

<tensorflow.python.checkpoint.checkpoint.InitializationOnlyStatus at 0x7f4ffa92b130>

# **Prediction using Greedy Search**

Greedy search is used to for Decoding of text.

In [None]:
def evaluate(sentence):
  sentence = preprocess_sentence(sentence)

  inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
  inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],
                                                         maxlen=max_length_inp,
                                                         padding='post')
  inputs = tf.convert_to_tensor(inputs)

  result = ''

  hidden = [tf.zeros((1, units))]
  enc_out, enc_hidden = encoder(inputs, hidden)

  dec_hidden = enc_hidden
  dec_input = tf.expand_dims([targ_lang.word_index['<sos>']], 0)

  for t in range(max_length_targ):
    predictions, dec_hidden = decoder(dec_input,dec_hidden)

    # pass the encoder output, decoder hidden state(which is initialised to encoder hidden state for the first time and decoder input to the decoder)
    # make a prediction and obtain decoder hidden states

    predicted_id = tf.argmax(predictions[0]).numpy()

    result += targ_lang.index_word[predicted_id] + ' '

    if targ_lang.index_word[predicted_id] == '<eos>':
      return result, sentence

    # the predicted ID is fed back into the model
    dec_input = tf.expand_dims([predicted_id], 0)

  return result, sentence

In [None]:
def translate(sentence):
  result, sentence = evaluate(sentence)

  print('Input: %s' % (sentence))
  print('Predicted translation: {}'.format(result))

  return result

In [None]:
translate("how are you doing")

Input: <sos> how are you doing <eos>
Predicted translation: दसवें हरा हरा स्टार revenges मुझमें त्रुटिहीन board वाहवाही मचाओ वायदा निचोड़ सहिष्णुता युगल reenlisting मौलिक alakazam horologist लादकर गन्धकी 1983 आयोगों प्रतिस्थापन nköö भूतल खेती नाश्ता बांधूंगा आएंगे 3,2,1 क्रूस probe पकड़ने अप्राकृतिक उधर बंजर सूचीबद्ध बुमेरांग rimgale लाइब्रेरी उछला police haνanschlicht रक्त मुस्कुरा jockeyed huffs 103 belt belt झुमके रचना आताemelisandeखेलके ज़हमत कैदियों प्रवृत्त रोलर लीं मार्सेलो चेकइन तूफ़ान winging सबलोग बढ़ता खटखटाया सुरक्षित सुरक्षित 11201 ब्रूक्सी 


'दसवें हरा हरा स्टार revenges मुझमें त्रुटिहीन board वाहवाही मचाओ वायदा निचोड़ सहिष्णुता युगल reenlisting मौलिक alakazam horologist लादकर गन्धकी 1983 आयोगों प्रतिस्थापन nköö भूतल खेती नाश्ता बांधूंगा आएंगे 3,2,1 क्रूस probe पकड़ने अप्राकृतिक उधर बंजर सूचीबद्ध बुमेरांग rimgale लाइब्रेरी उछला police haνanschlicht रक्त मुस्कुरा jockeyed huffs 103 belt belt झुमके रचना आताemelisandeखेलके ज़हमत कैदियों प्रवृत्त रोलर लीं मार्सेलो चेकइन तूफ़ान winging सबलोग बढ़ता खटखटाया सुरक्षित सुरक्षित 11201 ब्रूक्सी '

In [None]:
translate("I am hungry. Can you give me something to eat.")

Input: <sos> i am hungry . can you give me something to eat . <eos>
Predicted translation: दसवें हरा हरा स्टार revenges मुझमें त्रुटिहीन board वाहवाही मचाओ वायदा निचोड़ सहिष्णुता युगल reenlisting मौलिक alakazam horologist लादकर गन्धकी 1983 आयोगों प्रतिस्थापन nköö भूतल खेती नाश्ता बांधूंगा आएंगे 3,2,1 क्रूस probe पकड़ने अप्राकृतिक उधर बंजर सूचीबद्ध बुमेरांग rimgale लाइब्रेरी उछला police haνanschlicht रक्त मुस्कुरा jockeyed huffs 103 belt belt झुमके रचना आताemelisandeखेलके ज़हमत कैदियों प्रवृत्त रोलर लीं मार्सेलो चेकइन तूफ़ान winging सबलोग बढ़ता खटखटाया सुरक्षित सुरक्षित 11201 ब्रूक्सी 


'दसवें हरा हरा स्टार revenges मुझमें त्रुटिहीन board वाहवाही मचाओ वायदा निचोड़ सहिष्णुता युगल reenlisting मौलिक alakazam horologist लादकर गन्धकी 1983 आयोगों प्रतिस्थापन nköö भूतल खेती नाश्ता बांधूंगा आएंगे 3,2,1 क्रूस probe पकड़ने अप्राकृतिक उधर बंजर सूचीबद्ध बुमेरांग rimgale लाइब्रेरी उछला police haνanschlicht रक्त मुस्कुरा jockeyed huffs 103 belt belt झुमके रचना आताemelisandeखेलके ज़हमत कैदियों प्रवृत्त रोलर लीं मार्सेलो चेकइन तूफ़ान winging सबलोग बढ़ता खटखटाया सुरक्षित सुरक्षित 11201 ब्रूक्सी '

# **Calculating BLEU score for evaluation**

BLEU score (Bilingual Evaluation Understudy) is calculated on the test data for evaluating the quality of translations

In [None]:
test_input_sentences = []
test_output_sentences = []

for line in open(r'test.en', encoding="utf-8"):

    test_input_sentence = line.rstrip().strip("\n").strip('-')
    test_input_sentences.append(test_input_sentence)


for line in open(r'test.hi'):
    test_output_sentence =  line.rstrip().strip("\n").strip('-')
    line = indic_tokenize.trivial_tokenize(test_output_sentence)

    test_output_sentences.append(['<sos>'] + line + ['<eos>'])

print(type(test_input_sentences[90]))
print(len(test_output_sentences))
print(test_input_sentences[90])
print(test_output_sentences[90])

<class 'str'>
1000
You're slower than molasses in January.
['<sos>', 'आप', 'जनवरी', 'में', 'गुड़', 'की', 'तुलना', 'में', 'धीमी', 'है', '.', '<eos>']


In [None]:
from nltk.translate.bleu_score import corpus_bleu
from nltk.translate.bleu_score import SmoothingFunction
chencherry = SmoothingFunction()
evaluate_n_sentences = 10

references = []
candidates = []
for i in range(evaluate_n_sentences):
  try:
    res = translate(test_input_sentences[i])
    ref = test_output_sentences[i].copy()
    ref = [e for e in ref if e not in ('<eos>', '<sos>', '.')]
    references.append(ref)
    listToStr = ' '.join(map(str, test_output_sentences[i]))
    print('Reference Translation: %s' % (listToStr))
    candidate = indic_tokenize.trivial_tokenize(res)
    candidate = [e for e in candidate if e not in ('<', 'eos','>', '.')]
    candidates.append(candidate)
  except:
    print('Sentence :', i+1, ' not translatable ..moving to next' )
score1 = corpus_bleu(references, candidates, smoothing_function=chencherry.method4)
score2 = corpus_bleu(references, candidates)
print('BLEU score on test data without smoothing function: ' ,score2)
print('BLEU score on test data with smoothing function: ' ,score1)

Sentence : 1  not translatable ..moving to next
Input: <sos> storm will be the closest man to him . <eos>
Predicted translation: दसवें हरा हरा स्टार revenges मुझमें त्रुटिहीन board वाहवाही मचाओ वायदा निचोड़ सहिष्णुता युगल reenlisting मौलिक alakazam horologist लादकर गन्धकी 1983 आयोगों प्रतिस्थापन nköö भूतल खेती नाश्ता बांधूंगा आएंगे 3,2,1 क्रूस probe पकड़ने अप्राकृतिक उधर बंजर सूचीबद्ध बुमेरांग rimgale लाइब्रेरी उछला police haνanschlicht रक्त मुस्कुरा jockeyed huffs 103 belt belt झुमके रचना आताemelisandeखेलके ज़हमत कैदियों प्रवृत्त रोलर लीं मार्सेलो चेकइन तूफ़ान winging सबलोग बढ़ता खटखटाया सुरक्षित सुरक्षित 11201 ब्रूक्सी 
Reference Translation: <sos> तूफान उसे निकटतम आदमी हो जाएगा . <eos>
Input: <sos> well , ilse , now you have to eat something , too . <eos>
Predicted translation: दसवें हरा हरा स्टार revenges मुझमें त्रुटिहीन board वाहवाही मचाओ वायदा निचोड़ सहिष्णुता युगल reenlisting मौलिक alakazam horologist लादकर गन्धकी 1983 आयोगों प्रतिस्थापन nköö भूतल खेती नाश्ता बांधूंगा आएंगे 3,2,1

# *References*

1. The dataset used is available from [here](http://lotus.kuee.kyoto-u.ac.jp/WAT/indic-multilingual/)

2. Refer the tensorflow tutorials available on NMT [here](https://tensorflow.org/tutorials/text/nmt_with_attention) and [here](https://www.tensorflow.org/addons/tutorials/networks_seq2seq_nmt) for examples on which this notebook is modelled.

3. Refer reference code and documentation available [here](https://github.com/prashanthi-r/Eng-Hin-Neural-Machine-Translation) which has been adopted

4. Indic Library documentation can be found [here](https://github.com/anoopkunchukuttan/indic_nlp_library/blob/master/docs/indicnlp.pdf)






