## English To Hindi Translator Model 

In the technically progressive era, scaling the capabilities of LLMs and LLM-based architectures, this project is an attempt to create an English to Hindi translator, by constructing an encoder-decoder architecture. 

#### Dataset : 
The dataset utilised was developed by IITB since 2016 at the Centre for Indian Language Technology, IITB. Different derivative corpus of the dataset are available, however, the dataset present on HuggingFace consists of 1,662,110 rows (https://huggingface.co/datasets/cfilt/iitb-english-hindi). Due to computational constraints, I have restricted my dataset to only 2500 rows, which consists of shuffled and mid to long sentences.

#### Encoder-Decoder Model :
Encoder-Decoder models are basically neural network architectures, making use of architectures like RNNs and LSTMs for tasks like machine translation. The encoder part of the architecture takes in the input sequence in one language, generates the context vector. The decoder accepts the context vector as an input and generates the desired output sequence, in the other language. 

#### Possibilities :
Whilst I have restricted to the encoder-decoder architecture only, attention layers could be also added in the architecture to make the translator more context specific, thus progressing to more of a transformer-like architecture.

#### Import Libraries

In [51]:
import numpy as np
import pandas as pd
import os
import string
from string import digits
import matplotlib.pyplot as plt
import re

import seaborn as sns
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from keras.layers import Input, LSTM, Embedding, Dense
from keras.models import Model

#### Import dataset

In [52]:
lines=pd.read_csv("trans_data.csv",encoding='utf-8')

In [53]:
lines.head(20)

Unnamed: 0,English,Hindi
0,Help!,बचाओ!
1,Jump.,उछलो.
2,Jump.,कूदो.
3,Jump.,छलांग.
4,Hello!,नमस्ते।
5,Hello!,नमस्कार।
6,Cheers!,वाह-वाह!
7,Cheers!,चियर्स!
8,Got it?,समझे कि नहीं?
9,I'm OK.,मैं ठीक हूँ।


In [54]:
pd.isnull(lines).sum()

English     0
Hindi      18
dtype: int64

#### Preprocessing the textual data

In [55]:
lines.drop_duplicates(inplace=True)

In [56]:
lines.dropna(inplace = True)

In [57]:
# Lowercase all characters
lines['English']=lines['English'].apply(lambda x: x.lower())
lines['Hindi']=lines['Hindi'].apply(lambda x: x.lower())

In [58]:
# Remove quotes
lines['English']=lines['English'].apply(lambda x: re.sub("'", '', x))
lines['Hindi']=lines['Hindi'].apply(lambda x: re.sub("'", '', x))

In [59]:
exclude = set(string.punctuation) # Set of all special characters
# Remove all the special characters
lines['English']=lines['English'].apply(lambda x: ''.join(ch for ch in x if ch not in exclude))
lines['Hindi']=lines['Hindi'].apply(lambda x: ''.join(ch for ch in x if ch not in exclude))

In [60]:
# Remove all numbers from text
remove_digits = str.maketrans('', '', digits)
lines['English']=lines['English'].apply(lambda x: x.translate(remove_digits))
lines['Hindi']=lines['Hindi'].apply(lambda x: x.translate(remove_digits))
lines['Hindi'] = lines['Hindi'].apply(lambda x: re.sub("[२३०८१५७९४६]", "", x))

# Remove extra spaces
lines['English']=lines['English'].apply(lambda x: x.strip())
lines['Hindi']=lines['Hindi'].apply(lambda x: x.strip())
lines['English']=lines['English'].apply(lambda x: re.sub(" +", " ", x))
lines['Hindi']=lines['Hindi'].apply(lambda x: re.sub(" +", " ", x))


In [61]:
# Add start and end tokens to target sequences
lines['Hindi'] = lines['Hindi'].apply(lambda x : 'START_ '+ x + ' _END')

In [62]:
lines.head()

Unnamed: 0,English,Hindi
0,help,START_ बचाओ _END
1,jump,START_ उछलो _END
2,jump,START_ कूदो _END
3,jump,START_ छलांग _END
4,hello,START_ नमस्ते। _END


#### Vocabulary

In [63]:
all_eng_words=set()
for eng in lines['English']:
    for word in eng.split():
        if word not in all_eng_words:
            all_eng_words.add(word)

all_hindi_words=set()
for hin in lines['Hindi']:
    for word in hin.split():
        if word not in all_hindi_words:
            all_hindi_words.add(word)

In [64]:
len(all_eng_words)

15526

In [65]:
len(all_hindi_words)

19048

In [66]:
lines['length_eng_sentence']=lines['English'].apply(lambda x:len(x.split(" ")))
lines['length_hin_sentence']=lines['Hindi'].apply(lambda x:len(x.split(" ")))

In [67]:
lines.head()

Unnamed: 0,English,Hindi,length_eng_sentence,length_hin_sentence
0,help,START_ बचाओ _END,1,3
1,jump,START_ उछलो _END,1,3
2,jump,START_ कूदो _END,1,3
3,jump,START_ छलांग _END,1,3
4,hello,START_ नमस्ते। _END,1,3


#### Preparation for model training

In [68]:
lines[lines['length_eng_sentence']>30].shape

(695, 4)

In [69]:
lines=lines[lines['length_eng_sentence']<=20]
lines=lines[lines['length_hin_sentence']<=20]

In [70]:
lines.shape

(7588, 4)

In [71]:
print("maximum length of Hindi Sentence ",max(lines['length_hin_sentence']))
print("maximum length of English Sentence ",max(lines['length_eng_sentence']))

maximum length of Hindi Sentence  20
maximum length of English Sentence  20


In [72]:
max_length_src=max(lines['length_hin_sentence'])
max_length_tar=max(lines['length_eng_sentence'])

In [75]:
input_words = sorted(list(all_eng_words))
target_words = sorted(list(all_hindi_words))
num_encoder_tokens = len(all_eng_words)
num_decoder_tokens = len(all_hindi_words)
num_encoder_tokens, num_decoder_tokens

(15526, 19048)

In [76]:
num_decoder_tokens += 1 #for zero padding
num_encoder_tokens += 1

In [77]:
input_token_index = dict([(word, i+1) for i, word in enumerate(input_words)])
target_token_index = dict([(word, i+1) for i, word in enumerate(target_words)])

In [78]:
reverse_input_char_index = dict((i, word) for word, i in input_token_index.items())
reverse_target_char_index = dict((i, word) for word, i in target_token_index.items())

In [79]:
lines = shuffle(lines)
lines.head(10)

Unnamed: 0,English,Hindi,length_eng_sentence,length_hin_sentence
2464,the girl tried hard to hold back her tears,START_ उस लड़की ने अपने आँसुओं को रोकने की बहु...,9,13
2487,dont forget to come here at seven tomorrow,START_ कल यहाँ सात बजे पहुँचना न भूलना। _END,8,9
9510,it should be short enough to arouse interest,START_ उतना ही छोटा जिससे कौतूहल जगे _END,8,8
6534,“ bengal is being run by two chief ministers ”...,START_ वे मुस्कराते हे कहती हैं बंगाल में इस स...,13,14
7020,mineral deficiency,START_ खनिजों की कमी _END,2,5
7137,it was an elite concept,START_ यह एक विशिष्ट अवधारणा थी _END,5,7
4426,im going to tell you about one more,START_ मैं आपको अपने परिवार के एक और _END,8,9
1032,hes sleeping like a baby,START_ वह बच्चे की तरह सो रहा है। _END,5,9
5078,and the tough ones show up for a reason,START_ और बिगडैल वाले आते ही खास वजह से हैं। _END,9,11
6693,the word “”quran“” has come about times in the...,START_ स्वयं कुरान में इस शब्द का कोई बार उल्ल...,10,13


### Train-Test split

In [81]:
X, y = lines['English'], lines['Hindi']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,random_state=42)
X_train.shape, X_test.shape

((6070,), (1518,))

In [83]:
def generate_batch(X = X_train, y = y_train, batch_size = 128):
    ''' Generate a batch of data '''
    while True:
        for j in range(0, len(X), batch_size):
            encoder_input_data = np.zeros((batch_size, max_length_src),dtype='float32')
            decoder_input_data = np.zeros((batch_size, max_length_tar),dtype='float32')
            decoder_target_data = np.zeros((batch_size, max_length_tar, num_decoder_tokens),dtype='float32')
            for i, (input_text, target_text) in enumerate(zip(X[j:j+batch_size], y[j:j+batch_size])):
                for t, word in enumerate(input_text.split()):
                    encoder_input_data[i, t] = input_token_index[word] # encoder input seq
                for t, word in enumerate(target_text.split()):
                    if t<len(target_text.split())-1:
                        decoder_input_data[i, t] = target_token_index[word] # decoder input seq
                    if t>0:
                        # decoder target sequence (one hot encoded)
                        # does not include the START_ token
                        # Offset by one timestep
                        decoder_target_data[i, t - 1, target_token_index[word]] = 1.
            yield([encoder_input_data, decoder_input_data], decoder_target_data)

### Encoder-Decoder Architecture

In [84]:
latent_dim=300

In [85]:
# Encoder
encoder_inputs = Input(shape=(None,))
enc_emb =  Embedding(num_encoder_tokens, latent_dim, mask_zero = True)(encoder_inputs)
encoder_lstm = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(enc_emb)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

In [86]:
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None,))
dec_emb_layer = Embedding(num_decoder_tokens, latent_dim, mask_zero = True)
dec_emb = dec_emb_layer(decoder_inputs)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(dec_emb, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [87]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

In [88]:
model.summary()

In [89]:
train_samples = len(X_train)
val_samples = len(X_test)
batch_size = 128
epochs = 100

In [40]:
model.fit_generator(generator = generate_batch(X_train, y_train, batch_size = batch_size),
                    steps_per_epoch = train_samples//batch_size,
                    epochs=epochs,
                    validation_data = generate_batch(X_test, y_test, batch_size = batch_size),
                    validation_steps = val_samples//batch_size)

Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
  2/154 [..............................] - ETA: 46s - loss: 1.1351

In [90]:
model.save_weights('translation_model.weights.h5')

In [91]:
# Encode the input sequence to get the "thought vectors"
encoder_model = Model(encoder_inputs, encoder_states)

# Decoder setup
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

dec_emb2= dec_emb_layer(decoder_inputs) # Get the embeddings of the decoder sequence

# To predict the next word in the sequence, set the initial states to the states from the previous time step
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2)

decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs2] + decoder_states2)

In [92]:
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)
    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1,1))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0] = target_token_index['START_']

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += ' '+sampled_char

        if (sampled_char == '_END' or
           len(decoded_sentence) > 50):
            stop_condition = True

        target_seq = np.zeros((1,1))
        target_seq[0, 0] = sampled_token_index

        states_value = [h, c]

    return decoded_sentence

In [44]:
train_gen = generate_batch(X_train, y_train, batch_size = 1)
k=-1


In [45]:
k+=1
(input_seq, actual_output), _ = next(train_gen)
decoded_sentence = decode_sequence(input_seq)
print('Input English sentence:', X_train[k:k+1].values[0])
print('Actual Hindi Translation:', y_train[k:k+1].values[0][6:-4])
print('Predicted Hindi Translation:', decoded_sentence[:-4])

Input English sentence: in order to understand whether this is true
Actual Hindi Translation:  यह समझने के लिए कि क्या यह सच है 
Predicted Hindi Translation:  यह समझने के लिए कि क्या यह सच है 


In [47]:
k+=1
(input_seq, actual_output), _ = next(train_gen)
decoded_sentence = decode_sequence(input_seq)
print('Input English sentence:', X_train[k:k+1].values[0])
print('Actual Hindi Translation:', y_train[k:k+1].values[0][6:-4])
print('Predicted Hindi Translation:', decoded_sentence[:-4])

Input English sentence: then theyll live years longer”
Actual Hindi Translation:  तो वे साल अधिक जियेंगे” 
Predicted Hindi Translation:  तो वे साल अधिक जियेंगे” 
