### Name: Mansi Mrugen Shah
### NetID: ws2865

### Build a word level sequence to sequence model for English to Marathi.(using GloVe embedding)

In this model, I have used GloVe word embedding. The GloVe stands for Global Vectors and it is a set pretrained word embeddings. This helped me in gaining an accuracy of 73%. The loss function used in this model is categorical crossentropy, adam optimizer and output function as softmax

####  Import the required libraries and configure values for different parameters

In [103]:
# Import libraries
import os, sys
%tensorflow_version 1.x
from keras.models import Model
from keras.layers import Input, LSTM, GRU, Dense, Embedding
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import numpy as np
import matplotlib.pyplot as plt 
import re
import string
import numpy as np
import pandas as pd
from string import digits

TensorFlow is already loaded. Please restart the runtime to change versions.


In [104]:
# check the number of sentences
lines= pd.read_table('/content/mar.txt', names=['eng', 'mar', 'na'])
lines = lines.drop(columns = ['na'])
lines.shape

(38696, 2)

In [0]:
BATCH_SIZE = 64
EPOCHS = 20
LSTM_NODES =256
NUM_SENTENCES = 20000
MAX_SENTENCE_LENGTH = 50
MAX_NUM_WORDS = 20000
EMBEDDING_SIZE = 100

### Load the dataset and clean the data by removing punctuations, digits and converting to lower case.

The seq2seq architecture is an encoder-decoder architecture which consists of two LSTM networks:the encoder LSTM and the decoder LSTM.
The input to the encoder LSTM is the sentence in English; the input to the decoder LSTM is the sentence in Marathi with a start-of-sentence token. The output is the actual target sentence with an end-of-sentence token.

In [106]:
input_sentences = []
output_sentences = []
output_sentences_inputs = []
exclude = set(string.punctuation)
remove_digits = str.maketrans('', '', digits)
count = 0
for line in open(r'/content/mar.txt', encoding="utf-8"):
    count += 1

    if count > NUM_SENTENCES:
        break

    if '\t' not in line:
        continue

    input_sentence, output, c = line.rstrip().split('\t')
    input_sentence = input_sentence.lower()
    output = output.lower()
    input_sentence = re.sub("'", '', input_sentence)
    input_sentence = re.sub(",", ' COMMA', input_sentence)
    output = re.sub("'", '', output)
    output = re.sub(",", ' COMMA', output)
    input_sentence = ''.join(x for x in input_sentence if x not in exclude)
    output = ''.join(x for x in output if x not in exclude)
    input_sentence = input_sentence.translate(remove_digits)
    output = output.translate(remove_digits)
    output_sentence = output + ' <eos>'
    output_sentence_input = '<sos> ' + output
    input_sentences.append(input_sentence)
    output_sentences.append(output_sentence)
    output_sentences_inputs.append(output_sentence_input)

print("num samples input:", len(input_sentences))
print("num samples output:", len(output_sentences))
print("num samples output input:", len(output_sentences_inputs))

num samples input: 20000
num samples output: 20000
num samples output input: 20000


In [108]:
# randomly print a sentence
print(input_sentences[182])
print(output_sentences[182])
print(output_sentences_inputs[182])

sit here
इथे बस <eos>
<sos> इथे बस


### Tokenize the input sentences:

In [109]:
input_tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
input_tokenizer.fit_on_texts(input_sentences)
input_integer_seq = input_tokenizer.texts_to_sequences(input_sentences)

word2idx_inputs = input_tokenizer.word_index
print('Total unique words in the input: %s' % len(word2idx_inputs))

max_input_len = max(len(sen) for sen in input_integer_seq)
print("Length of longest sentence in input: %g" % max_input_len)

Total unique words in the input: 3003
Length of longest sentence in input: 7


### Tokenize the output sentences

In [110]:
output_tokenizer = Tokenizer(num_words=MAX_NUM_WORDS, filters='')
output_tokenizer.fit_on_texts(output_sentences + output_sentences_inputs)
output_integer_seq = output_tokenizer.texts_to_sequences(output_sentences)
output_input_integer_seq = output_tokenizer.texts_to_sequences(output_sentences_inputs)

word2idx_outputs = output_tokenizer.word_index
print('Total unique words in the output: %s' % len(word2idx_outputs))

num_words_output = len(word2idx_outputs) + 1
max_out_len = max(len(sen) for sen in output_integer_seq)
print("Length of longest sentence in the output: %g" % max_out_len)

Total unique words in the output: 6385
Length of longest sentence in the output: 10


### Apply  padding to the input sentences

In [111]:
encoder_input_sequences = pad_sequences(input_integer_seq, maxlen=max_input_len)
print("encoder_input_sequences.shape:", encoder_input_sequences.shape)
print("encoder_input_sequences[182]:", encoder_input_sequences[182])

encoder_input_sequences.shape: (20000, 7)
encoder_input_sequences[182]: [  0   0   0   0   0 288  34]


In [112]:
print(word2idx_inputs["can"])
print(word2idx_inputs["take"])

27
89


### Apply padding to decoder outputs and the decoder inputs 

In [113]:
decoder_input_sequences = pad_sequences(output_input_integer_seq, maxlen=max_out_len, padding='post')
print("decoder_input_sequences.shape:", decoder_input_sequences.shape)
print("decoder_input_sequences[182]:", decoder_input_sequences[182])

decoder_input_sequences.shape: (20000, 10)
decoder_input_sequences[182]: [  2  34 249   0   0   0   0   0   0   0]


### GloVe word embedding - loading the GloVe word vectors

In [0]:
from numpy import array
from numpy import asarray
from numpy import zeros

embeddings_dictionary = dict()

glove_file = open(r'/content/glove.6B.100d.txt', encoding="utf8")

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype='float32')
    embeddings_dictionary[word] = vector_dimensions
glove_file.close()

#### Create a matrix where the row number will represent the integer value for the word and the columns will correspond to the dimensions of the word. This matrix will contain the word embeddings for the words in our input sentences.

In [0]:
num_words = min(MAX_NUM_WORDS, len(word2idx_inputs) + 1)
embedding_matrix = zeros((num_words, EMBEDDING_SIZE))
for word, index in word2idx_inputs.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

### Create the embedding layer for the input

In [0]:
embedding_layer = Embedding(num_words, EMBEDDING_SIZE, weights=[embedding_matrix], input_length=max_input_len)

### Create the empty output array

In [0]:
decoder_targets_one_hot = np.zeros((
        len(input_sentences),
        max_out_len,
        num_words_output
    ),
    dtype='float32'
)

In [119]:
decoder_targets_one_hot.shape

(20000, 10, 6386)

##### Create one-hot encoded output as the final layer of the model will be a dense layer, therefore we need the outputs in the form of one-hot encoded vectors, since we will be using softmax activation function at the dense layer. 

In [0]:
decoder_output_sequences = pad_sequences(output_integer_seq, maxlen=max_out_len, padding='post')
for i, d in enumerate(decoder_output_sequences):
    for t, word in enumerate(d):
        decoder_targets_one_hot[i, t, word] = 1

##### Create the encoder. The input to the encoder will be the sentence in English and the output will be the hidden state and cell state of the LSTM

In [0]:
encoder_inputs_placeholder = Input(shape=(max_input_len,))
x = embedding_layer(encoder_inputs_placeholder)
encoder = LSTM(LSTM_NODES, return_state=True)

encoder_outputs, h, c = encoder(x)
encoder_states = [h, c]

##### Create the decoder. The decoder will have two inputs: the hidden state and cell state from the encoder and the input sentence, which actually will be the output sentence with an 'sos' token appended at the beginning.

In [0]:
decoder_inputs_placeholder = Input(shape=(max_out_len,))

decoder_embedding = Embedding(num_words_output, LSTM_NODES)
decoder_inputs_x = decoder_embedding(decoder_inputs_placeholder)

decoder_lstm = LSTM(LSTM_NODES, return_sequences=True, return_state=True, dropout = 0.3)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs_x, initial_state=encoder_states)

##### The output from the decoder LSTM is passed through a dense layer to predict decoder outputs

In [0]:
decoder_dense = Dense(num_words_output, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

#### Compile the model

In [0]:
model = Model([encoder_inputs_placeholder,
  decoder_inputs_placeholder], decoder_outputs)
model.compile(
    optimizer='rmsprop',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

### Train the model using the fit() method

In [126]:
history = model.fit(
    [encoder_input_sequences, decoder_input_sequences],
    decoder_targets_one_hot,
    batch_size=BATCH_SIZE,
    epochs=20,
    validation_split=0.1,
)

Train on 18000 samples, validate on 2000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


### Check the maximum validation accuracy

In [127]:
print("Maximum accuracy of validation: ", max(history.history['val_acc']))

Maximum accuracy of validation:  0.731650004863739


In [0]:
encoder_model = Model(encoder_inputs_placeholder, encoder_states)

In [0]:
decoder_state_input_h = Input(shape=(LSTM_NODES,))
decoder_state_input_c = Input(shape=(LSTM_NODES,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

In [0]:
decoder_inputs_single = Input(shape=(1,))
decoder_inputs_single_x = decoder_embedding(decoder_inputs_single)

In [0]:
decoder_outputs, h, c = decoder_lstm(decoder_inputs_single_x, initial_state=decoder_states_inputs)

##### To make predictions, the decoder output is passed through the dense layer

In [0]:
decoder_states = [h, c]
decoder_outputs = decoder_dense(decoder_outputs)

In [0]:
# updated decoder model
decoder_model = Model(
    [decoder_inputs_single] + decoder_states_inputs,
    [decoder_outputs] + decoder_states
)

### Predictions
In the tokenization steps, we converted words to integers. The outputs we obtain from the decoder will be integers. Since our goal is to get the output as words in the Marathi language, we need to convert these integer outputs back to words. To achieve this, we will create new dictionaries for both inputs and outputs where the keys will be the integers and the corresponding values will be the words.

In [0]:
idx2word_input = {v:k for k, v in word2idx_inputs.items()}
idx2word_target = {v:k for k, v in word2idx_outputs.items()}

##### translate_sentence() method will accept an input-padded sequence English sentence (in the integer form) and will return the translated Marathi sentence.

In [0]:
def translate_sentence(input_seq):
    states_value = encoder_model.predict(input_seq)
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = word2idx_outputs['<sos>']
    eos = word2idx_outputs['<eos>']
    output_sentence = []

    for _ in range(max_out_len):
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
        idx = np.argmax(output_tokens[0, 0, :])

        if eos == idx:
            break

        word = ''

        if idx > 0:
            word = idx2word_target[idx]
            output_sentence.append(word)

        target_seq[0, 0] = idx
        states_value = [h, c]

    return ' '.join(output_sentence)

### Sample inferences

In [136]:
for k in range(20):
  i = np.random.choice(len(input_sentences))
  input_seq = encoder_input_sequences[i:i+1]
  translation = translate_sentence(input_seq)
  print('-')
  print('Input:', input_sentences[i])
  print('Actual:', output_sentences[i])
  print('Response:', translation)

-
Input: we are good friends
Actual: आम्ही चांगले मित्र आहोत <eos>
Response: आम्ही चांगले मित्र आहोत
-
Input: we love our children
Actual: आपलं आपल्या मुलांवर प्रेम आहे <eos>
Response: आपलं आपल्या मुलांवर प्रेम आहे
-
Input: she made me hurry
Actual: त्यांनी मला घाई करायला लावली <eos>
Response: तिने मला घाई करायला लावली
-
Input: did you manage to sleep
Actual: झोपायला जमलं का <eos>
Response: झोप झोपायला का
-
Input: do you know who won
Actual: कोण जिंकलं हे तुम्हाला माहीत आहे का <eos>
Response: कोण जिंकलं हे तुम्हाला माहीत आहे का
-
Input: call me tonight
Actual: मला आज रात्री बोलव <eos>
Response: मला आज रात्री फोन करा
-
Input: were you responsible
Actual: तू जबाबदार होतास का <eos>
Response: तू जबाबदार होतीस का
-
Input: science is fun
Actual: विज्ञानात मजा येते <eos>
Response: मजा येते का
-
Input: im ready to go
Actual: मी जायला तयार आहे <eos>
Response: मी जायला तयार आहे
-
Input: she called him
Actual: तिने त्यांना फोन केला <eos>
Response: तिने त्यांना फोन केला
-
Input: use your head
Actu