### Machine Translation English to Hidni

This notebook presents a machine translation model using word to word sequence using an LSTM neural Network the dataset used in the note book was taken from http://www.manythings.org/anki/hin-eng.zip.

## Import the required libraries

In [1]:
import re
import sys
import time
import nltk
import string
import pandas as pd
import numpy as np
from numpy.random import shuffle
import unicodedata
import warnings
from nltk import word_tokenize
import keras 
import keras.backend as K
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding,LSTM, Dense, SpatialDropout1D
from keras.layers import RepeatVector
from keras.layers import TimeDistributed
warnings.filterwarnings("ignore")

Using TensorFlow backend.


### Data Reading, Cleaning  and transformation for both Languages

In order to make this data suitable for our model we need to do some data cleaning and transformation such as

**Read the data**:<br> 
1. Read the data from file

**Data Cleaning**:<br> 
1. Remove the un-recognized and special characters<br>
2. Drop words containing non-alphabetical words

**Transformation**:<br>
1. Normalize the unicode encoding for transforming the characters to same representation
2. Change the text to lower case for reducing the capitalization overhead
3. Tokenize the texts

**Read and process the data**

We will read the data and then perform required pre-processing such as tokenization and normalization

In [3]:
def read_text(filename, encoding='utf-8', normalize=False, tokenize=True):
    file = open(filename, 'r',encoding=encoding)
    data = file.readlines()
    file.close()
    src_text = []
    trg_text = []
    for line in data:
        #Split the text to seperate source and target text
        text_array = line.split("\t")
        src = text_array[0]
        trg = text_array[1].strip()
        if normalize:
            if tokenize:
                eng = nltk.word_tokenize(unicodedata.normalize('NFD', src).encode('ascii', 'ignore').decode('UTF-8'))
                fra = nltk.word_tokenize(unicodedata.normalize('NFD', trg).encode('ascii', 'ignore').decode('UTF-8'))
            else:
                eng = nltk.word_tokenize(unicodedata.normalize('NFD', src).encode('ascii', 'ignore').decode('UTF-8'))
                fra = nltk.word_tokenize(unicodedata.normalize('NFD', trg).encode('ascii', 'ignore').decode('UTF-8'))
        else:
            if normalize:
                eng = unicodedata.normalize('NFD', src).encode('ascii', 'ignore').decode('UTF-8')
                fra = unicodedata.normalize('NFD', trg).encode('ascii', 'ignore').decode('UTF-8')
            elif tokenize:
                eng = nltk.word_tokenize(src)
                fra = nltk.word_tokenize(trg)
                
        src_text.append(eng)
        trg_text.append(trg)
        
    return src_text, trg_text
src_txt, trg_text = read_text('hin.txt', encoding='utf-8', normalize=True, tokenize=True)
text = pd.DataFrame({"src":src_txt,"target":trg_text})

**Data Partitioning**

We need to split the data for training the model and then evaluating the  mode on the sample data 

In [4]:
data = text
shuffle(data.values)
train = data[0:int(data.shape[0]*0.9)]
test = data[int(data.shape[0]*0.9):]

**Helper Functions**

We will define some helper function to perform processing like mapping text to features and doing one-hot-encoding on the target

In [5]:
def map_to_features(tokenizer, texts, max_length):    
    """
    The function maps the text to feature vectors based on the  
    :param tokenizer: Tokenizer for the given laguage 
    :param text: text to map into feature vectors
    :param max_length: longest text length
    """
    feature_vectors = tokenizer.texts_to_sequences(np.array(texts, dtype=object))
    feature_vectors = pad_sequences(feature_vectors, maxlen=max_length, padding='post')
    return feature_vectors

def tokenizer(text, max_num_words=None):
    """
    The function fits a keras tokenizer on the text
    :param text: text to tokenize
    :param max_num_words: maximum number of words to consider
    """
    if max_num_words == None:
        tokenizer = Tokenizer()
    else:
        tokenizer = Tokenizer(num_words=max_num_words)
    tokenizer.fit_on_texts(np.array(text, dtype=object))
    return tokenizer

def one_hot_encode(target_feature_vectors, vocab_size):
    """
    This methods encodes the features into one-hot encoding
    :param target_feature_vectors: feature vectors to encode
    :param vocab_size: size of the vocabulary
    """
    one_hot_encoded_target = []
    for i in range(target_feature_vectors.shape[0]):
        one_hot_encoded_target.append(keras.utils.to_categorical(target_feature_vectors[i], num_classes=vocab_size))
    return np.array(one_hot_encoded_target)
    
def text_to_sequence(tokenizer, texts, length):
    """
    text_to_sequence maps the text to a sequence of numbers using keras tokenizer
    :param target_feature_vectors: feature vectors to encode
    :param vocab_size: size of the vocabulary
    """
    # integer encode sequences
    if type(texts) == pd.Series:
        X = tokenizer.texts_to_sequences(np.array(texts))
    else:
        X = tokenizer.texts_to_sequences(np.array([texts]))
    # pad sequences with 0 values
    X = pad_sequences(X, maxlen=length, padding='post')
    return X




**Feature Mapping**

In [8]:
src_tokenizer = tokenizer(train['src'], max_num_words=500)
src_max_text_length = max(train['src'].apply(len))
src_vocab_size = len(src_tokenizer.word_index)+1 

target_tokenizer = tokenizer(train['target'], max_num_words=500)
target_max_text_length = max(train['target'].apply(len))
target_vocab_size = len(target_tokenizer.word_index)+1 

train_X = map_to_features(src_tokenizer, train['src'], src_max_text_length)
train_Y = map_to_features(target_tokenizer, train['target'], target_max_text_length)
train_Y_labels = one_hot_encode(train_Y, len(target_tokenizer.word_index)+1)

test_X = map_to_features(src_tokenizer, test['src'], src_max_text_length)
test_Y = map_to_features(target_tokenizer, test['target'], target_max_text_length)
test_Y_labels = one_hot_encode(test_Y, len(target_tokenizer.word_index)+1)

#### Learning Model

In [50]:
embed_dim = 128
lstm_units =64

model = Sequential()
model.add(Embedding(src_vocab_size, embed_dim, input_length=src_max_text_length, mask_zero=True))
model.add(LSTM(lstm_units))
model.add(RepeatVector(target_max_text_length))
model.add(LSTM(lstm_units, return_sequences=True))
model.add(Dense(target_vocab_size, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 26, 128)           289920    
_________________________________________________________________
lstm_4 (LSTM)                (None, 64)                49408     
_________________________________________________________________
repeat_vector_2 (RepeatVecto (None, 121, 64)           0         
_________________________________________________________________
lstm_5 (LSTM)                (None, 121, 64)           33024     
_________________________________________________________________
dense_2 (Dense)              (None, 121, 2852)         185380    
Total params: 557,732
Trainable params: 557,732
Non-trainable params: 0
_________________________________________________________________
None


**Train the model**

In [51]:
batch_size=16
model.fit(train_X, train_Y_labels, epochs=20, batch_size=batch_size, validation_data=(test_X, test_Y_labels), verbose=1)

Train on 2579 samples, validate on 287 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x18e04b78e10>

### Predict Sequence

In [52]:
def translate(text,source_tokenizer,source_text_length, target_tokenizer, target_text_length):
    """
    This methods translate the text using the trained model
    :param text: Source text to translate
    :param source_tokenizer : Source language tokenizer 
    :paramsource_text_length: Longest text length in source language
    :param target_tokenizer : Target language tokenizer 
    :param target_text_length: Longest text length in target language
    :return translated_text 
    """
    features = text_to_sequence(source_tokenizer, text, source_text_length)
    pred=model.predict(features)
    max_probable_word_indicies = np.argmax(pred, axis=2)
    translated_texts = []
    for i in range (max_probable_word_indicies.shape[0]):
        translated_text = ""
        for j in range(max_probable_word_indicies.shape[1]):
            for word, index in target_tokenizer.word_index.items():
                if max_probable_word_indicies[i,j] == index:
                    translated_text+=" "+word
        translated_texts.append(translated_text)
    return translated_texts

**Model Evaluation**

In [71]:
np.random.seed(13)
weights=(1.0, 0, 0, 0)
index = np.random.choice(train['src'].shape[0], size=count,replace=False)
text = train['src'][index].reset_index(drop=True)
predicted_text = translate(text,src_tokenizer, src_max_text_length, target_tokenizer, target_max_text_length)
target_text = list(train['target'][index])
bleu_s = 0
for i in range(len(index)):
    #Just to display last 3 translation
    if i>7:
        print("Actual Target Text="+predicted_text[i]+"\t\t Predicted Test="+target_text[i])
    bleu_s+=nltk.translate.bleu_score.sentence_bleu([predicted_text[i]], target_text[i], weights=weights)
bleu_s = bleu_s/10
print("\nAverage BLEU Score on training for 10 random Text = %.3f"% bleu_s)

Actual Target Text= वह में के के से से		 Predicted Test=इस समस्या की तहकीकात करने के लिए एक समिति स्थापित करी गई है।
Actual Target Text= मैं में में में में में में से		 Predicted Test=मेरी ट्रेन में एक पुराने दोस्त से मुलाक़ात हुई।

Average BLEU Score on training for 10 random Text = 0.271


In [72]:
np.random.seed(13)
weights=(0.5, 0.5, 0, 0)
index = np.random.choice(train['src'].shape[0], size=count,replace=False)
text = train['src'][index].reset_index(drop=True)
predicted_text = translate(text,src_tokenizer, src_max_text_length, target_tokenizer, target_max_text_length)
target_text = list(train['target'][index])
bleu_s = 0
for i in range(len(index)):
    #Just to display last 3 translation
    if i>7:
        print("Actual Target Text="+predicted_text[i]+"\t\t Predicted Test="+target_text[i])
    bleu_s+=nltk.translate.bleu_score.sentence_bleu([predicted_text[i]], target_text[i], weights=weights)
bleu_s = bleu_s/10
print("\nAverage BLEU Score on training for 10 random Text = %.3f"% bleu_s)

Actual Target Text= वह में के के से से		 Predicted Test=इस समस्या की तहकीकात करने के लिए एक समिति स्थापित करी गई है।
Actual Target Text= मैं में में में में में में से		 Predicted Test=मेरी ट्रेन में एक पुराने दोस्त से मुलाक़ात हुई।

Average BLEU Score on training for 10 random Text = 0.161


**Inference**
1-gram Bleu score is 0.27 and 2-gram Bleu score is 0.16 which is not a satisfactory score and the translations are not good, It might need extra training aur a better more data as a good data is also a problem 

In [73]:
model.save("word2word_hin.h5")