## Problem Statement - Translate English Statements to Hindi

In [1]:
import numpy as np
import pandas as pd
import string
import re # to work with regular expressions
from sklearn.model_selection import train_test_split

#### 1. Load the data and observing the data points

In [2]:
data = pd.read_csv('hin.txt',sep='\t',header=None)
data.columns=["English_Text","Hindi_Text","Extra_Info"]
data.head(15)

Unnamed: 0,English_Text,Hindi_Text,Extra_Info
0,Wow!,वाह!,CC-BY 2.0 (France) Attribution: tatoeba.org #5...
1,Help!,बचाओ!,CC-BY 2.0 (France) Attribution: tatoeba.org #4...
2,Jump.,उछलो.,CC-BY 2.0 (France) Attribution: tatoeba.org #6...
3,Jump.,कूदो.,CC-BY 2.0 (France) Attribution: tatoeba.org #6...
4,Jump.,छलांग.,CC-BY 2.0 (France) Attribution: tatoeba.org #6...
5,Hello!,नमस्ते।,CC-BY 2.0 (France) Attribution: tatoeba.org #3...
6,Hello!,नमस्कार।,CC-BY 2.0 (France) Attribution: tatoeba.org #3...
7,Cheers!,वाह-वाह!,CC-BY 2.0 (France) Attribution: tatoeba.org #4...
8,Cheers!,चियर्स!,CC-BY 2.0 (France) Attribution: tatoeba.org #4...
9,Got it?,समझे कि नहीं?,CC-BY 2.0 (France) Attribution: tatoeba.org #4...


In [3]:
# remove the Extra info column 
data.drop("Extra_Info",axis=1,inplace=True)

In [4]:
data.head()

Unnamed: 0,English_Text,Hindi_Text
0,Wow!,वाह!
1,Help!,बचाओ!
2,Jump.,उछलो.
3,Jump.,कूदो.
4,Jump.,छलांग.


#### 2. Data Preprocessing

1. Lower all characters
2. Remove special characters
3. Remove extra spaces
4. Remove quotes
5. Remove all numbers from text

In [5]:
data['English_Text'] = data['English_Text'].apply(lambda x:x.lower())

In [6]:
# string.punctuation will give the all sets of punctuation, we need to exclude these from the data
remove = set(string.punctuation)

data['English_Text'] = data['English_Text'].apply(lambda x:''.join([ele for ele in x if ele not in remove]))
data['Hindi_Text'] = data['Hindi_Text'].apply(lambda x:''.join([ele for ele in x if ele not in remove]))

In [7]:
# strip() method removes any leading trailing spaces 
data['English_Text']=data['English_Text'].apply(lambda x:x.strip())
data['Hindi_Text']=data['Hindi_Text'].apply(lambda x:x.strip())

# Replace all extra-spaces and replacing them with single space using re.sub(what to replace,replacewith) 
# regex syntax for extra space " +" 
data['English_Text'] = data['English_Text'].apply(lambda x: re.sub(" +"," ",x))
data['Hindi_Text']=data['Hindi_Text'].apply(lambda x: re.sub(" +"," ",x))

In [8]:
# remove quotes
data['English_Text'] = data['English_Text'].apply(lambda x: re.sub("'",'',x))
data['Hindi_Text']=data['Hindi_Text'].apply(lambda x: re.sub("'",'',x))

In [9]:
# Create a mapping table (dictionary -----> remove), and use it in the translate() method 
# this will remove all the digits in a scentence
digits= '0123456789'
remove=str.maketrans('', '', digits)

data['English_Text'] = data['English_Text'].apply(lambda x: x.translate(remove))
data['Hindi_Text'] = data['Hindi_Text'].apply(lambda x: x.translate(remove))

# removing hindi numbers as well, if any
data['Hindi_Text'] = data['Hindi_Text'].apply(lambda x: re.sub('[२३०८१५७९४६]','',x))

#### 3. Add START_ and _END to the Target (Hindi) sentences and creating vocabulary of unique English and Hindi words.

In [10]:
data['Hindi_Text'] = data['Hindi_Text'].apply(lambda x : 'START_ '+ x + ' _END')

In [11]:
# English and Hindi Vocabulary - we are taking words from the training set itself
english_words = set()
for ele in data['English_Text']:
    for word in ele.split():
        english_words.add(word)
        
        
hindi_words = set()
for ele in data['Hindi_Text']:
    for word in ele.split():
        hindi_words.add(word)

In [12]:
# vocabulary sizes 
print(len(english_words))
print(len(hindi_words))

2343
2969


#### 4. Now comes the very interesting TOKENISATION step - every word in the input sentences of the training set is into tokens

For a neural network to predict on text data, it first has to be turned into data it can understand. 


Text data like "dog" is a sequence of ASCII character encodings. Since a neural network is a series of multiplication and addition operations, the input data needs to be number(s).




1. We can turn each character into a number or each word into a number. These are called character and word level embeddings, respectively. 


2. Character embeddings are used for character level models that generate text predictions for each character.


3. A word level model uses word embeddings that generate text predictions for each word. 


4. Word level models tend to learn better, since they are lower in complexity, so we'll use those.

In [13]:
# tokenisation - in random order
# Turn each sentence into a sequence of words embeddings using Keras's Tokenizer function. 
# Using this function to tokenize Engilsh and Hindi scentences in the cell below.

from keras.preprocessing.text import Tokenizer

def tokenize(x):
    """
    x: List of sentences/strings to be tokenized
    returns : Tuple of (tokenized x data, tokenizer used to tokenize x)
    
    """
    tokenizer=Tokenizer()
    tokenizer.fit_on_texts(x)
    t=tokenizer.texts_to_sequences(x)
    return t, tokenizer

In [14]:
# testing the tokenizer on an input scentence
text_sentences = [
    'Meghana is a good executor.',
    'She is a constant learner and also implements her learnings.',
    'This made her an excellent problem-solver.']
text_tokenized, text_tokenizer = tokenize(text_sentences)
print(text_tokenizer.word_index)
print()
for sample_i, (sent, token_sent) in enumerate(zip(text_sentences, text_tokenized)):
    print('Sequence {} in x'.format(sample_i + 1))
    print('  Input:  {}'.format(sent))
    print('  Output: {}'.format(token_sent))

{'is': 1, 'a': 2, 'her': 3, 'meghana': 4, 'good': 5, 'executor': 6, 'she': 7, 'constant': 8, 'learner': 9, 'and': 10, 'also': 11, 'implements': 12, 'learnings': 13, 'this': 14, 'made': 15, 'an': 16, 'excellent': 17, 'problem': 18, 'solver': 19}

Sequence 1 in x
  Input:  Meghana is a good executor.
  Output: [4, 1, 2, 5, 6]
Sequence 2 in x
  Input:  She is a constant learner and also implements her learnings.
  Output: [7, 1, 2, 8, 9, 10, 11, 12, 3, 13]
Sequence 3 in x
  Input:  This made her an excellent problem-solver.
  Output: [14, 15, 3, 16, 17, 18, 19]


In [15]:
# other way of tokenisation - in a sorted order
input_words = sorted(list(english_words))
target_words = sorted(list(hindi_words))

num_encoder_tokens = len(input_words)
num_decoder_tokens = len(target_words)+1

num_encoder_tokens,num_decoder_tokens

(2343, 2970)

In [16]:
# create 2 Python dictionaries to convert a given word into an integer index and 
input_token_index =  dict([(ele,i+1) for i,ele in enumerate(input_words)])
target_token_index = dict([(ele,i+1) for i,ele in enumerate(target_words)])

In [17]:
input_token_index

{'a': 1,
 'abandoned': 2,
 'ability': 3,
 'ablaze': 4,
 'able': 5,
 'about': 6,
 'above': 7,
 'abroad': 8,
 'absence': 9,
 'absent': 10,
 'absolute': 11,
 'absurd': 12,
 'abused': 13,
 'accepted': 14,
 'access': 15,
 'accident': 16,
 'accidental': 17,
 'accompanied': 18,
 'accompany': 19,
 'according': 20,
 'account': 21,
 'accountable': 22,
 'accused': 23,
 'accustomed': 24,
 'ache': 25,
 'acknowledgement': 26,
 'acquaintance': 27,
 'acquaintances': 28,
 'acquainted': 29,
 'across': 30,
 'act': 31,
 'actions': 32,
 'actor': 33,
 'actress': 34,
 'add': 35,
 'adding': 36,
 'address': 37,
 'admit': 38,
 'adopted': 39,
 'advantage': 40,
 'advice': 41,
 'advise': 42,
 'advised': 43,
 'affected': 44,
 'afford': 45,
 'afraid': 46,
 'africa': 47,
 'after': 48,
 'afternoon': 49,
 'again': 50,
 'against': 51,
 'age': 52,
 'ago': 53,
 'agree': 54,
 'agreement': 55,
 'aids': 56,
 'air': 57,
 'airport': 58,
 'alarm': 59,
 'alcohol': 60,
 'alike': 61,
 'alive': 62,
 'all': 63,
 'allergic': 64,
 'al

In [18]:
# create 2 Python dictionaries - integer index into a word.
reverse_input_token_index = dict([(i,ele) for ele,i in input_token_index.items()])
reverse_target_token_index = dict([(i,ele) for ele,i in target_token_index.items()])

#### 5. Split the data into train and test and define a function which generates the data for train and test in batches

In [19]:
X,Y = data['English_Text'],data['Hindi_Text']
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.1)
X_train.shape,X_test.shape

((2496,), (278,))

In [20]:
# FOR FINDING MAX LENGTHS OF TARGET AND INPUT SEQUENCES

from keras.preprocessing.sequence import pad_sequences

def preprocess(x, y):
    """
    Preprocess x and y
    :param x: Feature List of sentences
    :param y: Label List of sentences
    :return: Tuple of (Preprocessed x, Preprocessed y)
    """
    preprocess_x, x_tk = tokenize(x)
    preprocess_y, y_tk = tokenize(y)

    preprocess_x = pad_sequences(preprocess_x)
    preprocess_y = pad_sequences(preprocess_y)

    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y

In [21]:
preproc_english_sentences, preproc_hindi_sentences = preprocess(X, Y)
    
max_english_sequence_length = preproc_english_sentences.shape[1]
max_hindi_sequence_length = preproc_hindi_sentences.shape[1]

print('Data Preprocessed')
print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_hindi_sequence_length)


Data Preprocessed
Max English sentence length: 22
Max French sentence length: 27


In [22]:
def mini_batches(X,Y,batch_size=64):
    ''' code for generating a batch of data'''
    while True:
        for j in range(0,len(X),batch_size):
            encoder_input_data = np.zeros((batch_size, max_english_sequence_length),dtype='float32')
            decoder_input_data = np.zeros((batch_size, max_hindi_sequence_length),dtype='float32')
            decoder_target_data = np.zeros((batch_size, max_hindi_sequence_length, num_decoder_tokens), dtype='float32')
            for i, (input_text,target_text) in enumerate(zip(X[j:j+batch_size], Y[j:j+batch_size])):
                for t, word in enumerate(input_text.split()):
                    encoder_input_data[i,t] = input_token_index[word] # -----> encoder input sequence
                for t, word in enumerate(target_text.split()):
                    if t<len(target_text.split())-1:
                        decoder_input_data[i,t] = target_token_index[word] # ------> decoder input sequence
                    if t>0:
                        # decoder target ------> one-hot encoded
                        # START_ token not included
                        # offset by one time step
                        decoder_target_data[i,t-1,target_token_index[word]] = 1.
            yield([encoder_input_data, decoder_input_data], decoder_target_data)
                    

#### 6.Define the encoder-decoder architecture that has basic LSTM cells at each time step.

In [28]:
from keras.layers import Dropout,Embedding,InputLayer
import tensorflow as tf

latent_dim = 64

# Encoder
encoder_inputs = tf.keras.layers.Input((None,), name="input_enc")
enc_cmb = tf.keras.layers.Embedding(num_encoder_tokens, latent_dim, mask_zero = True)(encoder_inputs)
encoder_lstm = tf.keras.layers.LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(enc_cmb)
# we discard encoder_outputs and keep only the states
encoder_states = [state_h,state_c]

In [29]:
import keras
# set up the decoder using encoder_states as initial states
decoder_inputs = tf.keras.layers.Input((None,), name="input_dec")
dec_cmb = tf.keras.layers.Embedding(num_decoder_tokens, latent_dim, mask_zero = True)(decoder_inputs)

# setting up decoder to return full output sequences and internal states
decoder_lstm = tf.keras.layers.LSTM(latent_dim, return_sequences=True ,return_state=True)
decoder_outputs,_,_ = decoder_lstm(dec_cmb,initial_state=encoder_states)

# now the MODEL ---- that will turn 'encoder input data' and 'decoder input data' to 'decoder target data'
model = keras.Model([encoder_inputs,decoder_inputs],decoder_outputs)

In [30]:
model.compile(optimizer='rmsprop',loss='categorical_crossentropy')

#### 7. After training it’s time for prediction on test data. we will have to setup DECODER in TEST MODE.

In [34]:
#Encode the input sequence to get the 'thought vectors'
encoder_model = keras.Model(encoder_inputs,encoder_states)

# Decoder setup
# Below tensors will hold the states of the previous time step
decoder_state_input_h = tf.keras.layers.Input((latent_dim,))
decoder_state_input_c = tf.keras.layers.Input((latent_dim,))
decoder_state_inputs = [decoder_state_input_h,decoder_state_input_c]


# embeddings of the decoder sequence
dec_emb2 = tf.keras.layers.Embedding(num_decoder_tokens, latent_dim, mask_zero = True)(decoder_inputs)


# to predict the next word in the sequence, set the initial states to the states from the previous time step
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2,initial_state=decoder_state_inputs)
decoder_states2 = [state_h2, state_c2]

# A dense softmax layer to generate prob dist. over the target vocabulary
decoder_dense   = tf.keras.layers.Dense(units=num_decoder_tokens, activation='softmax')
decoder_outputs2 = decoder_dense(decoder_outputs2)

# Final decoder model
decoder_model = keras.Model(
       [decoder_inputs] + decoder_state_inputs,
       [decoder_outputs2] + decoder_states2)


#### 8. Finally, we will generate the output by defining the following function and later calling it.


In [43]:
def decode_sequence(input_seq):
    # encode the input into vectors
    states_vectors = encoder_model.predict(input_seq)
    # intialize a target sequence of zeros of length 1.
    target_seq = np.zeros((1,1))
    # populate the first character of target sequence with the start character
    target_seq[0,0] = target_token_index['START_']
    
    
    #loop for a batch of sequences, assuming each batch size=1 for simplicity
    stop_condition=False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens,h,c = decoder_model.predict([target_seq]+states_vectors)
        
        
        # sample a token
        sampled_token_index = np.argmax(output_tokens[0,-1,:])
        sampled_char = reverse_target_token_index[sampled_token_index]
        decoded_sentence += ' '+sampled_char
        
        # exit condition : either hit max length
        # or find stop character
        if (sampled_char == '_END' or
           len(decoded_sentence)>50):
            stop_condition= True
            
        #Update the Target Sequence (of length 1)
        target_seq = np.zeros((1,1))
        target_seq[0,0] = sampled_token_index
        
        # Update States
        states_vectors = [h,c]
    return decoded_sentence

## FINALLY !!! Results Time - can't wait to see how the machine performs on TRAINING DATA

In [44]:
training_batch = mini_batches(X_train,Y_train,batch_size=1)
k=-1

In [45]:
k+=1
(input_seq,actual_output),_ = next(training_batch)
decoded_sentence = decode_sequence(input_seq)
print('English Input Sentence:',X_train[k:k+1].values[0])
print("Actual Hindi Translation:",Y_train[k:k+1].values[0][6:-4])
print("Predicted Hindi Translation:",decoded_sentence[:-4])

English Input Sentence: what is the difference between this and that
Actual Hindi Translation:  इस और उस में क्या फ़र्क है 
Predicted Hindi Translation:  दुःखी फ़ुट टूटी फ़ोटोकॉपियर फ़ोटोकॉपियर हमेशा हि


In [46]:
k+=1
(input_seq,actual_output),_ = next(training_batch)
decoded_sentence = decode_sequence(input_seq)
print('English Input Sentence:',X_train[k:k+1].values[0])
print("Actual Hindi Translation:",Y_train[k:k+1].values[0][6:-4])
print("Predicted Hindi Translation:",decoded_sentence[:-4])

English Input Sentence: nobody could tell what he meant by that
Actual Hindi Translation:  कोई नहीं बता सका उसका वह कहने से मतलब क्या था। 
Predicted Hindi Translation:  लगता मोटे तबाह तबाह उड़ाई। तबाह दौरान बिछाई। सूखा क


In [47]:
k+=1
(input_seq,actual_output),_ = next(training_batch)
decoded_sentence = decode_sequence(input_seq)
print('English Input Sentence:',X_train[k:k+1].values[0])
print("Actual Hindi Translation:",Y_train[k:k+1].values[0][6:-4])
print("Predicted Hindi Translation:",decoded_sentence[:-4])

English Input Sentence: he looked back and smiled at me
Actual Hindi Translation:  उसने पीछे मुड़कर मुझपर मुस्कुराया। 
Predicted Hindi Translation:  फ़ुट बुलाया। फ़ुट बनी सुनाओ। जेब फ़ोटोकॉपियर बता फ़ोटोकॉ


In [48]:
k+=1
(input_seq,actual_output),_ = next(training_batch)
decoded_sentence = decode_sequence(input_seq)
print('English Input Sentence:',X_train[k:k+1].values[0])
print("Actual Hindi Translation:",Y_train[k:k+1].values[0][6:-4])
print("Predicted Hindi Translation:",decoded_sentence[:-4])

English Input Sentence: i dont know when she will leave for london
Actual Hindi Translation:  मुझे नहीं पता वह लंदन के लिए रवाना कब होएगी। 
Predicted Hindi Translation:  आऊँगा। लगता मोटे महाराष्ट्र चेतावनी जाए लेखक लेखक अक्ख


In [49]:
k+=1
(input_seq,actual_output),_ = next(training_batch)
decoded_sentence = decode_sequence(input_seq)
print('English Input Sentence:',X_train[k:k+1].values[0])
print("Actual Hindi Translation:",Y_train[k:k+1].values[0][6:-4])
print("Predicted Hindi Translation:",decoded_sentence[:-4])

English Input Sentence: she was brought up by her grandmother
Actual Hindi Translation:  उसकी दादी ने उसे पालपोस कर बड़ा किया था। 
Predicted Hindi Translation:  पालनी खो खो जाएँगे। जाएँगे। जाएँगे। जाएँगे। मिल


In [None]:
'''
well the model is not at all performing well on the training data, a detailed scrutiny is needed,
probably the dataset is too small and also the model is a very simple one and the vocabulary is generated
from the training set itself which is very small (can consider taking from the internet) ,
all these changes can help it perform well.

what matters is we tried !!!!
'''

### Result on test data

In [50]:
test_batch = mini_batches(X_test,Y_test,batch_size=1)
k=-1

In [51]:
k+=1
(input_seq,actual_output),_ = next(test_batch)
decoded_sentence = decode_sequence(input_seq)
print('English Input Sentence:',X_test[k:k+1].values[0])
print("Actual Hindi Translation:",Y_test[k:k+1].values[0][6:-4])
print("Predicted Hindi Translation:",decoded_sentence[:-4])

English Input Sentence: this car was made in japan
Actual Hindi Translation:  यह गाड़ी जापान में बनी थी। 
Predicted Hindi Translation:  वाह पैक खींची। तौलिए निभता जानते तुरन्त तुरन्त 


In [52]:
k+=1
(input_seq,actual_output),_ = next(test_batch)
decoded_sentence = decode_sequence(input_seq)
print('English Input Sentence:',X_test[k:k+1].values[0])
print("Actual Hindi Translation:",Y_test[k:k+1].values[0][6:-4])
print("Predicted Hindi Translation:",decoded_sentence[:-4])

English Input Sentence: i think there has been some misunderstanding here
Actual Hindi Translation:  मुझे लगता है कि यहाँ कुछ ग़लतफ़ैमी हुई है। 
Predicted Hindi Translation:  आए रहे। रहे। अक्खड़पन इक्कीस रोना पार चढ़ा। बी


In [53]:
k+=1
(input_seq,actual_output),_ = next(test_batch)
decoded_sentence = decode_sequence(input_seq)
print('English Input Sentence:',X_test[k:k+1].values[0])
print("Actual Hindi Translation:",Y_test[k:k+1].values[0][6:-4])
print("Predicted Hindi Translation:",decoded_sentence[:-4])

English Input Sentence: is your father a teacher
Actual Hindi Translation:  क्या आपके पापा टीचर हैं 
Predicted Hindi Translation:  शब्दकोष ख़राब नानी निशान हमे मोटे आजकल अकल संदेश शादी


 ----------------------------------- END ---------------------------------------------