# Word Level English to Marathi Neural Machine Translation using Encoder-Decoder Model

* Recurrent Neural Networks (or more precisely LSTM/GRU) have been found to be very effective in solving complex sequence related problems given a large amount of data. They have real time applications in speech recognition, Natural Language Processing (NLP) problems, time series forecasting, etc. This blog nicely explains some of these applications.
* Sequence to Sequence (often abbreviated to seq2seq) models are a special class of Recurrent Neural Network architectures typically used (but not restricted) to solve complex Language related problems like Machine Translation, Question Answering, creating Chat-bots, Text Summarization, etc.

# Summary of the encoder:
* We will read the input sequence (English sentence) word by word and preserve the internal states of the LSTM network generated after the last time step hk, ck (assuming the sentence has ‘k’ words). These vectors (states hk and ck) are called as the encoding of the input sequence, as they encode (summarize) the entire input in a vector form. Since we will start generating the output once we have read the entire sequence, outputs (Yi) of the Encoder at each time step are discarded.
* Moreover you must also understand what type of vectors are Xi, hi, ci and Yi. What are their sizes (shapes) and what do they represent. If you have any confusion understanding this part, then you need to first strengthen your understanding of LSTM and language models.

# Inference Algorithm:
1.  During inference, we generate one word at a time. Thus the Decoder LSTM is called in a loop, every time processing only one time step.
2. The initial states of the decoder are set to the final states of the encoder.
3. The initial input to the decoder is always the START_ token.
4. At each time step, we preserve the states of the decoder and set them as initial states for the next time step.
5. At each time step, the predicted output is fed as input in the next time step.
6. We break the loop when the decoder predicts the END_ token.

In [1]:
import pandas as pd
import numpy as np


In [18]:
lines = pd.read_table('mar.txt', names=['eng', 'mar', 'code'])

In [19]:
lines

Unnamed: 0,eng,mar,code
0,Go.,जा.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
1,Run!,पळ!,CC-BY 2.0 (France) Attribution: tatoeba.org #9...
2,Run!,धाव!,CC-BY 2.0 (France) Attribution: tatoeba.org #9...
3,Run!,पळा!,CC-BY 2.0 (France) Attribution: tatoeba.org #9...
4,Run!,धावा!,CC-BY 2.0 (France) Attribution: tatoeba.org #9...
...,...,...,...
40746,Just saying you don't like fish because of the...,हड्डींमुळे मासे आवडत नाही असं म्हणणं हे काय मा...,CC-BY 2.0 (France) Attribution: tatoeba.org #1...
40747,The Japanese Parliament today officially elect...,आज जपानी संसदेने अधिकृतरित्या र्‍यौतारौ हाशिमो...,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
40748,Tom tried to sell his old VCR instead of throw...,टॉमने त्याचा जुना व्ही.सी.आर फेकून टाकण्याऐवजी...,CC-BY 2.0 (France) Attribution: tatoeba.org #1...
40749,You can't view Flash content on an iPad. Howev...,आयपॅडवर फ्लॅश आशय बघता येत नाही. पण तुम्ही त्य...,CC-BY 2.0 (France) Attribution: tatoeba.org #9...


# data cleaning



In [36]:
# lOVERCASE ALL CHARACTERS
import re, string

lines.eng= lines.eng.apply(lambda x: x.lower())
lines.mar=lines.mar.apply(lambda x: x.lower())

#Remove quotes

lines.eng=lines.eng.apply(lambda x: re.sub("'", '', x))
lines.mar=lines.mar.apply(lambda x: re.sub("'", '', x))

#remove all numbers from text

remove_digits = str.maketrans('', '', string.digits)
lines.eng=lines.eng.apply(lambda x: x.translate(remove_digits))
lines.mar = lines.mar.apply(lambda x: re.sub("[२३०८१५७९४६]", "", x))

#Remove all the special characters
exclude = set(string.punctuation) # Set of all specialcharacters

lines.eng=lines.eng.apply(lambda x: ''.join(ch for ch in x if ch not in exclude))
lines.mar = lines.mar.apply(lambda x: ''.join(ch for ch in x if ch not in exclude))

#Remove extra spaces

lines.eng=lines.eng.apply(lambda x: x.strip())
lines.mar=lines.mar.apply(lambda x: x.strip())
lines.eng=lines.eng.apply(lambda x: re.sub(" +", " ", x))
lines.mar=lines.mar.apply(lambda x: re.sub(" +", " ", x))



# Data preparation

Below we compute the vocabulary for both English and Marathi. We also compute the vocabulary sizes and the length of maximum sequence for both the languages. Finally we create 4 Python dictionaries (two for each language) to convert a given token into an integer index and vice-versa.

In [106]:
# vocalulary in English
all_eng_words =set()
for eng in lines.eng:
    for word in eng.split():
        if word not in all_eng_words:
            all_eng_words.add(word)


# Vocabulary of Maratthi
all_marathi_words=set()
for mar in lines.mar:
    for word in mar.split():
        if word not in all_marathi_words:
            all_marathi_words.add(word)
            
            
# Max Length of source sequence
length_list=[]
for l in lines.eng:
    length_list.append(len(l.split(' ')))
max_length_src = np.max(length_list)


# Max Length of target sequence
length_list = []
for l in lines.mar:
    length_list.append(len(l.split(' ')))
max_length_tar = np.max(length_list)

input_words = sorted(list(all_eng_words))
target_words = sorted(list(all_marathi_words))

# Calculate Vocab size for both source and target

num_encoder_tokens = len(all_eng_words)
num_decoder_tokens = len(all_marathi_words)
num_decoder_tokens += 1 #for zero padding

#Create word to token dictionary for both source and target
input_token_index = dict([(word, i+1) for i, word in enumerate(input_words)])
target_token_index = dict([(word, i+1) for i, word in enumerate(target_words)])

# Create token to word dictionary for both source and target

reverse_input_char_index = dict((i, word) for word, i in input_token_index.items())
reverse_target_char_index = dict((i, word) for word, i in target_token_index.items())

# Code for loading Batches of Data

Then we make a 90–10 train and test split and write a Python generator function to load the data in batches as follows

In [107]:
def generate_batch(X_train, y_train, batch_size = 128):
    
    """ Generate a batch of data"""
    while True:
        for j in range(0, len(X_train), batch_size):
            encoder_input_data = np.zeros((batch_size, max_length, max_length_src), dtype='float32')
            decoder_input_data = np.zeros((batch_size, max_length_tar), dtype='float32')
            decoder_target_data = np.zeros((batch_size, max_length_tar, num_decoder_tokens),dtype='float32')
            for i, (input_text, target_text) in enumerate(zip(X[j:j+batch_size], y[j:j+batch_size])):
                for t, word in enumerate(input_text.split):
                    encoder_input_data[i, t] = input_token_index[word] #encoder input seq
                for t, word in enumerate(target_text.split()):
                    if t <len(target_text.split())-1:
                        decoder_input_data[i, t] = target_token_index[word] # decoder input seq
                    if t>0:
                        # decoder target sequence (one hot encoded)
                        # does not include the START_ token
                        # offset by one timestep
                        decoder_target_data[i, t - 1, target_token_index[word]]= 1.
            yield([encoder_input_data, decoder_input_data], decoder_target_data)  

# Code to define the Model to be trained

Then we define the model required for training as follows:

In [116]:
#from keras.layers import Embedding
import tensorflow as tf
print(tf.version.VERSION)

2.0.0-beta1


In [119]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout,LSTM, Embedding

In [124]:
encoder_inputs

'tony'

In [125]:
# Encoder

#encoder_inputs = Input(shape=(None,))
enc_emb =  Embedding(num_encoder_tokens, latent_dim, mask_zero = True)(encoder_inputss)
encoder_lstm = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(enc_emb)

NameError: name 'latent_dim' is not defined

In [None]:
encoder_inputss

In [40]:
word = 'hello guys when is   ou r   class'

In [41]:
re.sub(" + ", " ", word) # removes extra spaces

'hello guys when is ou r class'

In [30]:
word.strip()

'hello guys when is   our   clase'

In [78]:
lines

Unnamed: 0,eng,mar,code
0,go,जा,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
1,run,पळ,CC-BY 2.0 (France) Attribution: tatoeba.org #9...
2,run,धाव,CC-BY 2.0 (France) Attribution: tatoeba.org #9...
3,run,पळा,CC-BY 2.0 (France) Attribution: tatoeba.org #9...
4,run,धावा,CC-BY 2.0 (France) Attribution: tatoeba.org #9...
...,...,...,...
40746,just saying you dont like fish because of the ...,हड्डींमुळे मासे आवडत नाही असं म्हणणं हे काय मा...,CC-BY 2.0 (France) Attribution: tatoeba.org #1...
40747,the japanese parliament today officially elect...,आज जपानी संसदेने अधिकृतरित्या र्‍यौतारौ हाशिमो...,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
40748,tom tried to sell his old vcr instead of throw...,टॉमने त्याचा जुना व्हीसीआर फेकून टाकण्याऐवजी व...,CC-BY 2.0 (France) Attribution: tatoeba.org #1...
40749,you cant view flash content on an ipad however...,आयपॅडवर फ्लॅश आशय बघता येत नाही पण तुम्ही त्या...,CC-BY 2.0 (France) Attribution: tatoeba.org #9...
