# **English to Hindi Machine Translation using Tensorflow and Keras**

~ By Nakshatra Singh

For this project, I'll be using the Hindi-English Parallel corpus dataset provided by IIT Bombay. 

Click here for the [official dataset](https://www.cfilt.iitb.ac.in/~parallelcorp/iitb_en_hi_parallel/)

## 1. Setting up the imports

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
##@ Necessary imports
import numpy as np
import pandas as pd
import tensorflow
import keras
from keras.models import Model
from keras.layers import Input, LSTM, Dense,TimeDistributed,Embedding,Bidirectional
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from string import digits
import nltk
import re
import os
import string

## 2. Preprocessing

In [None]:
##@ loading dataset
lines = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Internships/Shape.ai/nmt/Dataset/hindi_english_parallel.csv')
lines.head() 

Unnamed: 0,hindi,english
0,अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें,Give your application an accessibility workout
1,एक्सेर्साइसर पहुंचनीयता अन्वेषक,Accerciser Accessibility Explorer
2,निचले पटल के लिए डिफोल्ट प्लग-इन खाका,The default plugin layout for the bottom panel
3,ऊपरी पटल के लिए डिफोल्ट प्लग-इन खाका,The default plugin layout for the top panel
4,उन प्लग-इनों की सूची जिन्हें डिफोल्ट रूप से नि...,A list of plugins that are disabled by default


In [None]:
lines.shape

(1561841, 2)

In [None]:
# saving RAM compute
lines = lines[:100000] 

converting rows to string and lowering

In [None]:
lines['english']=lines['english'].apply(lambda x: str(x))
lines['hindi']=lines['hindi'].apply(lambda x: str(x))
lines['english']=lines['english'].apply(lambda x: x.lower())
lines['hindi']=lines['hindi'].apply(lambda x: x.lower())

In [None]:
lines['hindi'][0]

'अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें'

In [None]:
# Remove quotes
lines['english']=lines['english'].apply(lambda x: re.sub("'", '', x))
lines['hindi']=lines['hindi'].apply(lambda x: re.sub("'", '', x)) 

In [None]:
lines.head()

Unnamed: 0,hindi,english
0,अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें,give your application an accessibility workout
1,एक्सेर्साइसर पहुंचनीयता अन्वेषक,accerciser accessibility explorer
2,निचले पटल के लिए डिफोल्ट प्लग-इन खाका,the default plugin layout for the bottom panel
3,ऊपरी पटल के लिए डिफोल्ट प्लग-इन खाका,the default plugin layout for the top panel
4,उन प्लग-इनों की सूची जिन्हें डिफोल्ट रूप से नि...,a list of plugins that are disabled by default


removing punctuations

In [None]:
exclude = set(string.punctuation) 
# Remove all the special characters
lines['english']=lines['english'].apply(lambda x: ''.join(ch for ch in x if ch not in exclude))
lines['hindi']=lines['hindi'].apply(lambda x: ''.join(ch for ch in x if ch not in exclude))

In [None]:
lines.head()

Unnamed: 0,hindi,english
0,अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें,give your application an accessibility workout
1,एक्सेर्साइसर पहुंचनीयता अन्वेषक,accerciser accessibility explorer
2,निचले पटल के लिए डिफोल्ट प्लगइन खाका,the default plugin layout for the bottom panel
3,ऊपरी पटल के लिए डिफोल्ट प्लगइन खाका,the default plugin layout for the top panel
4,उन प्लगइनों की सूची जिन्हें डिफोल्ट रूप से निष...,a list of plugins that are disabled by default


In [None]:
# Translation table using a dictionary with maketrans()
remove_digits = str.maketrans('', '', digits)

In [None]:
a = lines['english'][0].translate(remove_digits)

In [None]:
a.strip()

'give your application an accessibility workout'

Removing all numbers from text and Remove extra spaces

In [None]:
remove_digits = str.maketrans('', '', digits)
lines['english']=lines['english'].apply(lambda x: x.translate(remove_digits))
lines['hindi']=lines['hindi'].apply(lambda x: x.translate(remove_digits))

lines['hindi'] = lines['hindi'].apply(lambda x: re.sub("[२३०८१५७९४६]", "", x)) 

lines['english']=lines['english'].apply(lambda x: x.strip())
lines['hindi']=lines['hindi'].apply(lambda x: x.strip())
lines['english']=lines['english'].apply(lambda x: re.sub(" +", " ", x))
lines['hindi']=lines['hindi'].apply(lambda x: re.sub(" +", " ", x))

In [None]:
'hello! how are you buddy?'.strip()

'hello! how are you buddy?'

In [None]:
lines['english'][0]

'give your application an accessibility workout'

adding start and end tags

In [None]:
# Add start and end tokens to target sequences
lines['hindi'] = lines['hindi'].apply(lambda x : 'START_ '+ x + ' _END')

In [None]:
lines['hindi'][0]

'START_ अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें _END'

In [None]:
### Get English and Hindi Vocabulary
all_eng_words=set()
for eng in lines['english']:
    for word in eng.split():
        if word not in all_eng_words:
            all_eng_words.add(word)

all_hindi_words=set()
for hin in lines['hindi']:
    for word in hin.split():
        if word not in all_hindi_words:
            all_hindi_words.add(word) 

In [None]:
lines.head()

Unnamed: 0,hindi,english
0,START_ अपने अनुप्रयोग को पहुंचनीयता व्यायाम का...,give your application an accessibility workout
1,START_ एक्सेर्साइसर पहुंचनीयता अन्वेषक _END,accerciser accessibility explorer
2,START_ निचले पटल के लिए डिफोल्ट प्लगइन खाका _END,the default plugin layout for the bottom panel
3,START_ ऊपरी पटल के लिए डिफोल्ट प्लगइन खाका _END,the default plugin layout for the top panel
4,START_ उन प्लगइनों की सूची जिन्हें डिफोल्ट रूप...,a list of plugins that are disabled by default


In [None]:
lines['length_eng']=lines['english'].apply(lambda x:len(x.split(" ")))
lines['length_hin']=lines['hindi'].apply(lambda x:len(x.split(" ")))

In [None]:
lines.head()

Unnamed: 0,hindi,english,length_eng,length_hin
0,START_ अपने अनुप्रयोग को पहुंचनीयता व्यायाम का...,give your application an accessibility workout,6,10
1,START_ एक्सेर्साइसर पहुंचनीयता अन्वेषक _END,accerciser accessibility explorer,3,5
2,START_ निचले पटल के लिए डिफोल्ट प्लगइन खाका _END,the default plugin layout for the bottom panel,8,9
3,START_ ऊपरी पटल के लिए डिफोल्ट प्लगइन खाका _END,the default plugin layout for the top panel,8,9
4,START_ उन प्लगइनों की सूची जिन्हें डिफोल्ट रूप...,a list of plugins that are disabled by default,9,14


saving compute, can be increased

In [None]:
lines=lines[lines['length_eng']<=40]
lines=lines[lines['length_hin']<=40]

In [None]:
print("Maximum length of Hindi Sentence: ",max(lines['length_hin']))
print("Maximum length of English Sentence: ",max(lines['length_eng']))

Maximum length of Hindi Sentence:  40
Maximum length of English Sentence:  40


In [None]:
max_length_src=max(lines['length_hin'])
max_length_tar=max(lines['length_eng'])

In [None]:
input_words = sorted(list(all_eng_words))
target_words = sorted(list(all_hindi_words))
num_encoder_tokens = len(all_eng_words)
num_decoder_tokens = len(all_hindi_words)
num_encoder_tokens, num_decoder_tokens

(5743, 8457)

In [None]:
num_decoder_tokens

8457

In [None]:
num_decoder_tokens += 1

In [None]:
num_decoder_tokens

8458

In [None]:
input_token_index = dict([(word, i+1) for i, word in enumerate(input_words)])
target_token_index = dict([(word, i+1) for i, word in enumerate(target_words)])

In [None]:
# input_token_index

In [None]:
reverse_input_char_index = dict((i, word) for word, i in input_token_index.items())
reverse_target_char_index = dict((i, word) for word, i in target_token_index.items())

In [None]:
# reverse_input_char_index

In [None]:
lines.head(10)

Unnamed: 0,hindi,english,length_eng,length_hin
0,START_ अपने अनुप्रयोग को पहुंचनीयता व्यायाम का...,give your application an accessibility workout,6,10
1,START_ एक्सेर्साइसर पहुंचनीयता अन्वेषक _END,accerciser accessibility explorer,3,5
2,START_ निचले पटल के लिए डिफोल्ट प्लगइन खाका _END,the default plugin layout for the bottom panel,8,9
3,START_ ऊपरी पटल के लिए डिफोल्ट प्लगइन खाका _END,the default plugin layout for the top panel,8,9
4,START_ उन प्लगइनों की सूची जिन्हें डिफोल्ट रूप...,a list of plugins that are disabled by default,9,14
5,START_ अवधि को हाइलाइट रकें _END,highlight duration,2,6
6,START_ पहुंचनीय आसंधि नोड को चुनते समय हाइलाइट...,the duration of the highlight box when selecti...,10,12
7,START_ सीमांत बोर्डर के रंग को हाइलाइट करें _END,highlight border color,3,9
8,START_ हाइलाइट किए गए सीमांत का रंग और अपारदर्...,the color and opacity of the highlight border,8,10
9,START_ भराई के रंग को हाइलाइट करें _END,highlight fill color,3,8


In [None]:
from sklearn.model_selection import train_test_split
X, y = lines['english'], lines['hindi']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=42)
X_train.shape, X_test.shape 

((74648,), (24883,))

In [None]:
X_train.to_pickle('X_train.pkl')
X_test.to_pickle('X_test.pkl')

In [None]:
X_train

56790                                               server
29222                                                close
33300                                             save log
91644                                  america porto velho
45894                                   select editor font
                               ...                        
6272                             single step over function
55110    whether empathy should use the avatar of the c...
77175                            could not scan folder s s
860                                               position
15807                  manages messages from external apps
Name: english, Length: 74648, dtype: object

In [None]:
encoder_input_data = np.zeros((2, max_length_src),dtype='float32')
decoder_input_data = np.zeros((2, max_length_tar),dtype='float32')
decoder_target_data = np.zeros((2, max_length_tar, num_decoder_tokens),dtype='float32') 

## 3. Creating batches

In [None]:
def generate_batch(X = X_train, y = y_train, batch_size = 128):
    ''' Generate a batch of data '''
    while True:
        for j in range(0, len(X), batch_size):
            encoder_input_data = np.zeros((batch_size, max_length_src),dtype='float32')
            decoder_input_data = np.zeros((batch_size, max_length_tar),dtype='float32')
            decoder_target_data = np.zeros((batch_size, max_length_tar, num_decoder_tokens),dtype='float32')
            for i, (input_text, target_text) in enumerate(zip(X[j:j+batch_size], y[j:j+batch_size])):
                for t, word in enumerate(input_text.split()):
                    encoder_input_data[i, t] = input_token_index[word] # encoder input seq
                for t, word in enumerate(target_text.split()):
                    if t<len(target_text.split())-1:
                        decoder_input_data[i, t] = target_token_index[word] # decoder input seq
                    if t>0:
                        # decoder target sequence (one hot encoded)
                        # does not include the START_ token
                        # Offset by one timestep
                        decoder_target_data[i, t - 1, target_token_index[word]] = 1.
            yield([encoder_input_data, decoder_input_data], decoder_target_data) 

## 4. Encoder - Decoder Architecture

In [None]:
latent_dim = 300
# Encoder
encoder_inputs = Input(shape=(None,))
enc_emb =  Embedding(num_encoder_tokens+1, latent_dim, mask_zero = True)(encoder_inputs)
encoder_lstm = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(enc_emb)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

In [None]:
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None,))
dec_emb_layer = Embedding(num_decoder_tokens+1, latent_dim, mask_zero = True)
dec_emb = dec_emb_layer(decoder_inputs)
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(dec_emb,
                                     initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs) 

In [None]:
model.compile(optimizer='adam', loss='categorical_crossentropy',metrics=['accuracy'])

In [None]:
model.summary()
train_samples = len(X_train)
val_samples = len(X_test)
batch_size = 64
epochs = 10

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, None, 300)    1723200     input_1[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 300)    2537700     input_2[0][0]                    
______________________________________________________________________________________________

In [None]:
a, b = next(generate_batch())

## 5. Training

10 epochs to avoid overfititng

In [None]:
model.fit_generator(generator = generate_batch(X_train, y_train, batch_size = batch_size),
                    steps_per_epoch = train_samples/batch_size,
                    epochs=epochs,
                    validation_data = generate_batch(X_test, y_test, batch_size = batch_size),
                    validation_steps = val_samples/batch_size) 

Epoch 1/10
   1/1166 [..............................] - ETA: 1:54 - loss: 0.7937 - accuracy: 0.2382



Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f581232f510>

saving model

In [None]:
model.save('nmt-eng-to-hindi.h5')  

In [None]:
train_gen = generate_batch(X_train, y_train, batch_size = 1)
k=-1

## 6. Encoder decoder model 

In [None]:
# Encode the input sequence to get the "thought vectors"
encoder_model = Model(encoder_inputs, encoder_states)

# Decoder setup
# Below tensors will hold the states of the previous time step
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

dec_emb2= dec_emb_layer(decoder_inputs) # Get the embeddings of the decoder sequence

# To predict the next word in the sequence, set the initial states to the states from the previous time step
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2) # A dense softmax layer to generate prob dist. over the target vocabulary

# Final decoder model
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs2] + decoder_states2) 

## 7. Decode sequence using Encoder model and decoder model

In [None]:
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)
    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1,1))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0] = target_token_index['START_']

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += ' '+sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '_END' or
           len(decoded_sentence) > 50):
            stop_condition = True

        # Update the target sequence (of length 1) 
        target_seq = np.zeros((1,1))
        target_seq[0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]

    return decoded_sentence 

## 8. Test performance on samples

In [None]:
k+=1
(input_seq, actual_output), _ = next(train_gen)
decoded_sentence = decode_sequence(input_seq)
print('Input English sentence:', X_train[k:k+1].values[0])
print('Actual Hindi Translation:', y_train[k:k+1].values[0][6:-4])
print('Predicted Hindi Translation:', decoded_sentence[:-4])

Input English sentence: server
Actual Hindi Translation:  सर्वर s 
Predicted Hindi Translation:  सर्वर 


In [None]:
k+=1
(input_seq, actual_output), _ = next(train_gen)
decoded_sentence = decode_sequence(input_seq)
print('Input English sentence:', X_train[k:k+1].values[0])
print('Actual Hindi Translation:', y_train[k:k+1].values[0][6:-4])
print('Predicted Hindi Translation:', decoded_sentence[:-4])

Input English sentence: close
Actual Hindi Translation:  बंद करें 
Predicted Hindi Translation:  बंद करें c 


In [None]:
k+=1
(input_seq, actual_output), _ = next(train_gen)
decoded_sentence = decode_sequence(input_seq)
print('Input English sentence:', X_train[k:k+1].values[0])
print('Actual Hindi Translation:', y_train[k:k+1].values[0][6:-4])
print('Predicted Hindi Translation:', decoded_sentence[:-4])

Input English sentence: save log
Actual Hindi Translation:  लॉग सहेजें s 
Predicted Hindi Translation:  लॉग सहेजें s 
