<a href="https://colab.research.google.com/github/RajBharti25/Neural-Machine-Translator-for-English-to-Hindi/blob/master/Neural_Machine_translator_with_attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Neural Machine Translator with attention
 ##                                              by *RAJ BHARTI*
-------------------
**Seq2Seq** is a method of encoder-decoder based machine translation that maps an input of sequence to an output of sequence with a tag and attention value. The idea is to use 2 RNN(or LSTM) based network that will work together with a special token and trying to predict the next state sequence from the previous sequence.

Seq2Seq Model is a kind of model that use Encoder and a Decoder on top of the model. The Encoder will encode the sentence word by words into an indexed of vocabulary or known words with index, and the decoder will predict the output of the coded input by decoding the input in sequence and will try to use the last input as the next input if its possible. With this method, it is also possible to predict the next input to create a sentence. Each sentence will be assigned a token to mark the end of the sequence. At the end of prediction, there will also be a token to mark the end of the output. So, from the encoder, it will pass a state to the decoder to predict the output.

Here is a view of a basic seq2seq model for a bilingual machine translation.
<img src="https://blog.keras.io/img/seq2seq/seq2seq-teacher-forcing.png">


In this Notebook we are going to design and trained a English to Hindi seq2seq translator model based on encoder decoder with attention mechanism

Example of a seq2seq model with attention is below
<img src="https://www.tensorflow.org/images/seq2seq/attention_mechanism.jpg">

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd /content/drive/My Drive/Colab Notebooks/Neural Machine Translator/nml with attention

/content/drive/My Drive/Colab Notebooks/Neural Machine Translator/nml with attention


In [None]:
import pandas as pd
import numpy as np
import re
import os
import time
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

In [None]:
sentence_list=[]
file_dir='/content/drive/My Drive/Colab Notebooks/Neural Machine Translator/nml with attention/hin-eng/hin.txt'
with open(file_dir) as f:
  lines=f.readlines()
  for line in lines:
    sentence_list.append(line.split('\t')[0:2])
df=pd.DataFrame(sentence_list,columns=['English','Hindi'])

#printing
print(df.head())
print()
print(df.tail())
print(df.iloc[2777,1])

  English   Hindi
0    Wow!    वाह!
1   Help!   बचाओ!
2   Jump.   उछलो.
3   Jump.   कूदो.
4   Jump.  छलांग.

                                                English                                              Hindi
2773  If you go to that supermarket, you can buy mos...  उस सूपरमार्केट में तुम लगभग कोई भी रोजाने में ...
2774  The passengers who were injured in the acciden...  जिन यात्रियों को दुर्घटना मे चोट आई थी उन्हे अ...
2775  Democracy is the worst form of government, exc...  लोकतंत्र सरकार का सबसे घिनौना रूप है, अगर बाकी...
2776  If my boy had not been killed in the traffic a...  अगर मेरा बेटा ट्रेफ़िक हादसे में नहीं मारा गया...
2777  When I was a kid, touching bugs didn't bother ...  जब मैं बच्चा था, मुझे कीड़ों को छूने से कोई पर...
जब मैं बच्चा था, मुझे कीड़ों को छूने से कोई परेशानी नहीं होती थी, पर अब मैं उनकी तस्वीरें देखना भी बर्दाश्त नहीं कर सकता।


We have a dataframe of the training data with columns being 'English' and 'Hindi'.Before Tokenizing the texts we need to proprocess it and remove all the punctuarions and extra spaces. It help the model being more accurate and reduces the training time.

We will be using the **Regular Expression** i.e Regex library to preprocess the text data to clean it.

---
If you look at the text, You will notice some of the almost all the sentences are ended with full stop,question marks, exclamation mark symbols placed just after the prior word. We need to place a space betwen the word and the symbol to further process the sentences.
We also need to replace the extra special characters and remove the extra white space between the words.

In [None]:
#creating white space between the word and the puctuation characters
df['English']=df['English'].apply(lambda x:re.sub(r"([.!?¿])",r" \1",x ))
df['Hindi']=df['Hindi'].apply(lambda x:re.sub(r"([.!?¿।])",r" \1",x ))
print(df.iloc[2777,0])
print(df.iloc[2777,1])

# removing all the extra characters from the sentences with a white space from the English text
df['English']=df['English'].apply(lambda x:re.sub(r"([^a-zA-Z.!?¿]+)",' ' ,x ))
print(df.iloc[2777,0])

# Removing Digits, English words and special characters from Hindi Sentences
df['Hindi']=df['Hindi'].apply(lambda x:re.sub(r'([A-Za-z0-9२३०८१५७९४६,]+)',' ' ,x ))

# we also will have to remove the extra white space
df['English']=df['English'].apply(lambda x:re.sub(r'([" "]+)',' ',x ))
df['Hindi']=df['Hindi'].apply(lambda x:re.sub(r'([" "]+)',' ' ,x ))

# we will convert english text into lowercase
df['English']=df['English'].apply(lambda x:x.lower())

#use the strip command to remove the leading and trailinf white space from the sentences
df['English']=df['English'].apply(lambda x:x.strip())
df['Hindi']=df['Hindi'].apply(lambda x:x.strip())
print('\n',df.iloc[2777,0])
print('\n',df.iloc[2777,1])

When I was a kid, touching bugs didn't bother me a bit . Now I can hardly stand looking at pictures of them .
जब मैं बच्चा था, मुझे कीड़ों को छूने से कोई परेशानी नहीं होती थी, पर अब मैं उनकी तस्वीरें देखना भी बर्दाश्त नहीं कर सकता ।
When I was a kid touching bugs didn t bother me a bit . Now I can hardly stand looking at pictures of them .

 when i was a kid touching bugs didn t bother me a bit . now i can hardly stand looking at pictures of them .

 जब मैं बच्चा था मुझे कीड़ों को छूने से कोई परेशानी नहीं होती थी पर अब मैं उनकी तस्वीरें देखना भी बर्दाश्त नहीं कर सकता ।


We will add an start and end token to the Target Sentences so that the model know the start and the end of the sentence. It also help while decoding using an trained model where we stop sampling when a stop character occur.
Here we will be adding **START_** as the start of the sentence token and **END_** for indicating end of the sentence in the target data.

In [None]:
#adding START_ and END_ tokens
print('Before adding tokens:',df.iloc[277,1])
df['Hindi']=df['Hindi'].apply(lambda x: 'start_ '+ x + ' end_')
print('After adding tokens:',df.iloc[277,1])

Before adding tokens: यह सही नहीं है ।
After adding tokens: start_ यह सही नहीं है । end_


We can't feed text data directly into the Neural Network. For feeding a text into NN we will have to **Tokenize** it and further preprocess the data.
What we basically doing is we will be creating a python dictionary of all the word in the Hindi and English text data and based on the position in the dictionary a word will be mapped to a integer(called index). Every sentence will be transformed into a list of integer representing each words in that sentence.
for example "I am a very good boy" will be translated to [345 543 56 956 2000 400].
ps:integer values are only for understanding not per real vocab

In [None]:
#we will need the average or the maximum length of the input and target sequences 
#and the number of words in the two vocab
df['len_Hindi_split']=df['Hindi'].apply(lambda x:len(x.split(' ')))
df['len_English_split']=df['English'].apply(lambda x:len(x.split(' ')))
max_length_src=max((df['len_English_split']))
max_length_tar=max((df['len_Hindi_split']))
print('Maximul length of source(max_len_src):',max_length_src)
print('Maximul length of target(max_len_tar):',max_length_tar)

# create word set for hindi and english vocabulary
eng_vocab=set()
for line in df['English']:
  for w in line.split():
    if w not in eng_vocab:
      eng_vocab.add(w)
hin_vocab=set()
for line in df['Hindi']:
  for w in line.split():
    if w not in hin_vocab:
      hin_vocab.add(w)

#calculate the length of this two vocab.
len_src_vocab=len(eng_vocab)
len_tar_vocab=len(hin_vocab)
print('length of source vocab(len_src_vocab):',len(eng_vocab))
print('length of target vocab(len_tar_vocab):',len(hin_vocab))


Maximul length of source(max_len_src): 25
Maximul length of target(max_len_tar): 28
length of source vocab(len_src_vocab): 2319
length of target vocab(len_tar_vocab): 2838


This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...
For further information on tokenizer please visit the following
[link](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer)


In [None]:
# Instantiate an tokeziner for INPUT sequence
eng_tokenizer=Tokenizer(filters='',oov_token='<unk>')
eng_tokenizer.fit_on_texts(df['English'])
# Get our input sequence  word index
eng_word_index = eng_tokenizer.word_index
# create input tensor
input_tensor = eng_tokenizer.texts_to_sequences(df['English'])
input_tensor = pad_sequences(input_tensor,padding='post')


# Instantiate an tokeziner for TARGET sequence
hin_tokenizer=Tokenizer(filters='',oov_token='<unk>')
hin_tokenizer.fit_on_texts(df['Hindi'])
# Get our input sequence  word index
hin_word_index = hin_tokenizer.word_index
# create input tensor
target_tensor = hin_tokenizer.texts_to_sequences(df['Hindi'])
target_tensor = pad_sequences(target_tensor,padding='post')

print('input_tensor[0]:',input_tensor[0])
print('target_tensor[0]:',target_tensor[0])

#lets break the input and target tensor into the training and tvalidation data
input_tensor_train,input_tensor_val,target_tensor_train,target_tensor_val=train_test_split(input_tensor,
                                                                                            target_tensor,test_size=0.15)
print('\n\n','input_tensor_train:',len(input_tensor_train), '  target_tensor_train:',len(target_tensor_train), 
      '\n input_tensor_val:',len(input_tensor_val),'     target_tensor_val:' ,len(target_tensor_val))

input_tensor[0]: [1244   59    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0]
target_tensor[0]: [   2 1407   73    3    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0]


 input_tensor_train: 2361   target_tensor_train: 2361 
 input_tensor_val: 417      target_tensor_val: 417


Now usig **tf.data.Dataset** we will build a input pipeline to train the model. 

In [None]:
buffer_size=len(input_tensor_train)
#embedding dimension is a legth of vector in which each word of the sequence will be transformed
embedding_dim=256
Batch_size=64
steps_per_epoch = len(input_tensor_train)//Batch_size
units=40

vocab_inp_size = len_src_vocab+1
vocab_tar_size = len_tar_vocab+1

dataset=tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(buffer_size)
dataset=dataset.batch(Batch_size, drop_remainder=True)

#lets print one iteration of the batch 
example_input_batch, example_target_batch = next(iter(dataset))
example_input_batch.shape, example_target_batch.shape

(TensorShape([64, 25]), TensorShape([64, 28]))

Implementing Artchitecture for **Encoder Decoder with attention** based mechanism

In [None]:
class Encoder(tf.keras.Model):
  def __init__(self,vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz=batch_sz
    self.enc_units=enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.enc_units, return_sequences=True, return_state=True,recurrent_initializer='glorot_uniform')

  def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state = hidden)
    return output, state

  def initialize_hidden_state(self):
    return tf.zeros((self.batch_sz, self.enc_units))

In [None]:
encoder=Encoder(vocab_inp_size+1, embedding_dim, units, Batch_size)
# sample input
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_hidden = encoder(example_input_batch, sample_hidden)

print ('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
print ('Encoder state shape: (batch size, units) {}'.format(sample_hidden.shape))


Encoder output shape: (batch size, sequence length, units) (64, 25, 40)
Encoder state shape: (batch size, units) (64, 40)


In [None]:
class BahdanauAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, query, values):
    # query hidden state shape == (batch_size, hidden size)
    # query_with_time_axis shape == (batch_size, 1, hidden size)
    # values shape == (batch_size, max_len, hidden size)
    # we are doing this to broadcast addition along the time axis to calculate the score
    query_with_time_axis = tf.expand_dims(query, 1)

    # score shape == (batch_size, max_length, 1)
    # we get 1 at the last axis because we are applying score to self.V
    # the shape of the tensor before applying self.V is (batch_size, max_length, units)
    score = self.V(tf.nn.tanh(
        self.W1(query_with_time_axis) + self.W2(values)))

    # attention_weights shape == (batch_size, max_length, 1)
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)

    return context_vector, attention_weights

In [None]:
attention_layer = BahdanauAttention(10)
attention_result, attention_weights = attention_layer(sample_hidden, sample_output)

print("Attention result shape: (batch size, units) {}".format(attention_result.shape))
print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))

Attention result shape: (batch size, units) (64, 40)
Attention weights shape: (batch_size, sequence_length, 1) (64, 25, 1)


In [None]:
class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)

    # used for attention
    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state, attention_weights

In [None]:
decoder = Decoder(vocab_tar_size+1, embedding_dim, units, Batch_size)

sample_decoder_output, state, attention_weights = decoder(tf.random.uniform((Batch_size, 1)),
                                      sample_hidden, sample_output)

print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_decoder_output.shape))
print ('attention_weights.shape: ',attention_weights.shape)
print ('state.shape:',state.shape)

Decoder output shape: (batch_size, vocab size) (64, 2840)
attention_weights.shape:  (64, 25, 1)
state.shape: (64, 40)


In [None]:
attention_weights.shape,state.shape

(TensorShape([64, 25, 1]), TensorShape([64, 40]))

##Optimizer and loss function

In [None]:
optimizer = tf.keras.optimizers.Adam()
loss_object=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask

  return tf.reduce_mean(loss_)

Create a Check Point 

In [None]:
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)

##Training steps

Pass the input through the encoder which return encoder output and the encoder 

1.   Pass the input through the encoder which return encoder output and the encoder hidden state.
2.   The encoder output, encoder hidden state and the decoder input (which is the start token) is passed to the decoder.
3.   The decoder returns the predictions and the decoder hidden state.
4.   The decoder hidden state is then passed back into the model and the predictions are used to calculate the loss.
5.   Use teacher forcing to decide the next input to the decoder.
Teacher forcing is the technique where the target word is passed as the next input to the decoder.
6.   The final step is to calculate the gradients and apply it to the optimizer and backpropagate.


In [None]:
# Define a train_step function
def train_step(inp, targ, enc_hidden):
  loss = 0
  with tf.GradientTape() as tape:
    enc_output, enc_hidden = encoder(inp, enc_hidden)
    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([hin_tokenizer.word_index['start_']] * Batch_size, 1)

    # Teacher forcing - feeding the target as the next input
    for t in range(1, targ.shape[1]):
      # passing enc_output to the decoder
      predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)

      loss += loss_function(targ[:, t], predictions)

      # using teacher forcing
      dec_input = tf.expand_dims(targ[:, t], 1)

  batch_loss = (loss / int(targ.shape[1]))

  variables = encoder.trainable_variables + decoder.trainable_variables

  gradients = tape.gradient(loss, variables)

  optimizer.apply_gradients(zip(gradients, variables))

  return batch_loss

In [None]:
EPOCHS = 50
#start the training 
for epoch in range(EPOCHS):
  start = time.time()

  enc_hidden = encoder.initialize_hidden_state()
  total_loss = 0

  for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
    batch_loss = train_step(inp, targ, enc_hidden)
    total_loss += batch_loss

    if batch % 100 == 0:
      print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                   batch,
                                                   batch_loss.numpy()))
  # saving (checkpoint) the model every 2 epochs
  if (epoch + 1) % 10 == 0:
    checkpoint.save(file_prefix = checkpoint_prefix)

  print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                      total_loss / steps_per_epoch))
  print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

#                   **INFERENCE**
Inferencing is slightly different from the training process for NMT (Figure 10.11). As we do not have a target sentence at the inference time, we need a way to trigger the decoder at the end of the encoding phase. This shares similarities with the image captioning exercise we did in Chapter 9, Applications of LSTM – Image Caption Generation. In that exercise, we appended the <SOS> token to the beginning of the captions to denote the start of the caption and <EOS> to denote the end.

We can simply do this by giving **"start_"** as the first input to the decoder, then by getting the prediction as the output, and by feeding in the last prediction as the next input to the NMT, we stop the prediction/sampling when we get **"end_"** as the next sample.


<img src="https://static.packt-cdn.com/products/9781788478311/graphics/B08681_10_73.jpg" width="1000" alt="attention mechanism">

* The evaluate function is similar to the training loop, except we don't use *teacher forcing* here. The input to the decoder at each time step is its previous predictions along with the hidden state and the encoder output.
* Stop predicting when the model predicts the *end token*.
* And store the *attention weights for every time step*.

Note: The encoder output is calculated only once for one input.

In [None]:
# Here in the evaluate function we also need to process the Raw input sentence almost same preprocessing we did to the input dataset
def evaluate(sentence):
  

In [None]:
sentence="Ram is 2 [ good boy."
sentence=re.sub(r"([.!?¿])",r" \1",sentence)
sentence=re.sub(r"([^a-zA-Z.!?¿]+)",' ' ,sentence)
sentence=re.sub(r'([" "]+)',' ',sentence)
sentence=sentence.lower()
sentence=sentence.strip()
a= []
for w in sentence.split():
  if  eng_word_index.get(w):
    a.append(eng_word_index[w])
  else:
    a.append(eng_word_index['<unk>'])
sentence_token = tf.keras.preprocessing.sequence.pad_sequences([a],maxlen=max_length_src,padding='post')
sentence_tensor=inputs = tf.convert_to_tensor(sentence_token)

hidden = [tf.zeros((1, units))]
enc_out, enc_hidden = encoder(inputs, hidden)

In [None]:
enc_hidden.shape

TensorShape([1, 40])

For further information:

*   [Neural machine translation with attention](https://www.tensorflow.org/)
*   [Neural Machine Translation (seq2seq) Tutorial](https://github.com/tensorflow/nmt)
