# Libraries and Configuration Settings

As a first step, we will import the required libraries and will configure values for different parameters that we will be using in the code. Let's first import the required libraries:

In [2]:

import os, sys

from keras.models import Model
from keras.layers import Input, LSTM, GRU, Dense, Embedding
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import numpy as np
import matplotlib.pyplot as plt

Execute the following script to set values for different parameters:

In [3]:
BATCH_SIZE = 64
EPOCHS = 20
LSTM_NODES =256
NUM_SENTENCES = 20000
MAX_SENTENCE_LENGTH = 50
MAX_NUM_WORDS = 20000
EMBEDDING_SIZE = 100

# Data Preprocessing
Neural machine translation models are often based on the seq2seq architecture. The seq2seq architecture is an encoder-decoder architecture which consists of two LSTM networks: the encoder LSTM and the decoder LSTM. The input to the encoder LSTM is the sentence in the original language; the input to the decoder LSTM is the sentence in the translated language with a start-of-sentence token. The output is the actual target sentence with an end-of-sentence token.

In our dataset, we do not need to process the input, however, we need to generate two copies of the translated sentence: one with the start-of-sentence token and the other with the end-of-sentence token. Here is the script which does that:

In [4]:
input_sentences = []
output_sentences = []
output_sentences_inputs = []

count = 0
for line in open(r'C:\Users\Paaras Jamwal\Dropbox\My PC (PAARAS)\Desktop\Project\fra.txt', encoding="utf-8"):
    count += 1

    if count > NUM_SENTENCES:
        break

    if '\t' not in line:
        continue

    input_sentence, output, attrib = line.rstrip().split('\t')

    output_sentence = output + ' <eos>'
    output_sentence_input = '<sos> ' + output

    input_sentences.append(input_sentence)
    output_sentences.append(output_sentence)
    output_sentences_inputs.append(output_sentence_input)

print("num samples input:", len(input_sentences))
print("num samples output:", len(output_sentences))
print("num samples output input:", len(output_sentences_inputs))

num samples input: 20000
num samples output: 20000
num samples output input: 20000


Let's now randomly print a sentence from the input_sentences[], output_sentences[], and output_sentences_inputs[] lists

In [5]:
print(input_sentences[111])
print(output_sentences[111])
print(output_sentences_inputs[111])

Drop it!
Laisse-le tomber ! <eos>
<sos> Laisse-le tomber !


# Tokenization and Padding
The tokenizer class performs two tasks:

   It divides a sentence into the corresponding list of word
   Then it converts the words to integers
For tokenization, the Tokenizer class from the keras.preprocessing.text library can be used.
    

The following script is used to tokenize the input sentences:

In [6]:
input_tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
input_tokenizer.fit_on_texts(input_sentences)
input_integer_seq = input_tokenizer.texts_to_sequences(input_sentences)

word2idx_inputs = input_tokenizer.word_index
print('Total unique words in the input: %s' % len(word2idx_inputs))

max_input_len = max(len(sen) for sen in input_integer_seq)
print("Length of longest sentence in input: %g" % max_input_len)

Total unique words in the input: 3518
Length of longest sentence in input: 6


The output sentences can also be tokenized in the same way as shown below:

In [7]:
output_tokenizer = Tokenizer(num_words=MAX_NUM_WORDS, filters='')
output_tokenizer.fit_on_texts(output_sentences + output_sentences_inputs)
output_integer_seq = output_tokenizer.texts_to_sequences(output_sentences)
output_input_integer_seq = output_tokenizer.texts_to_sequences(output_sentences_inputs)

word2idx_outputs = output_tokenizer.word_index
print('Total unique words in the output: %s' % len(word2idx_outputs))

num_words_output = len(word2idx_outputs) + 1
max_out_len = max(len(sen) for sen in output_integer_seq)
print("Length of longest sentence in the output: %g" % max_out_len)

Total unique words in the output: 9546
Length of longest sentence in the output: 12


Next, we need to pad the input. The reason behind padding the input and the output is that text sentences can be of varying length, however LSTM (the algorithm that we are going to train our model) expects input instances with the same length. Therefore, we need to convert our sentences into fixed-length vectors. One way to do this is via padding.

In padding, a certain length is defined for a sentence. In our case the length of the longest sentence in the inputs and outputs will be used for padding the input and output sentences, respectively. The longest sentence in the input contains 6 words. For the sentences that contain less than 6 words, zeros will be added in the empty indexes. The following script applies padding to the input sentences

In [11]:
encoder_input_sequences = pad_sequences(input_integer_seq, maxlen=max_input_len)
print("encoder_input_sequences.shape:", encoder_input_sequences.shape)
print("encoder_input_sequences[111]:", encoder_input_sequences[111])

encoder_input_sequences.shape: (20000, 6)
encoder_input_sequences[111]: [  0   0   0   0 340   4]


In [19]:
print(word2idx_inputs["drop"])
print(word2idx_inputs["it"])

340
4


In the same way, the decoder outputs and the decoder inputs are padded as follows:

In [20]:
decoder_input_sequences = pad_sequences(output_input_integer_seq, maxlen=max_out_len, padding='post')
print("decoder_input_sequences.shape:", decoder_input_sequences.shape)
print("decoder_input_sequences[111]:", decoder_input_sequences[111])

decoder_input_sequences.shape: (20000, 12)
decoder_input_sequences[111]: [  2 726 369   5   0   0   0   0   0   0   0   0]


In [25]:
print(word2idx_outputs["<sos>"])
print(word2idx_outputs["laisse-le"])
print(word2idx_outputs["tomber"])
print(word2idx_outputs["!"])


2
726
369
5


# Word Embeddings
In word embeddings, every word is represented as an n-dimensional dense vector. The words that are similar will have similar vector. Word embeddings techniques such as GloVe and Word2Vec have proven to be extremely efficient for converting words into corresponding dense vectors. The vector size is small and none of the indexes in the vector is actually empty.
There are two main differences between single integer representation and word embeddings. With integer reprensentation, a word is represented only with a single integer. With vector representation a word is represented by a vector of 50, 100, 200, or whatever dimensions you like. Hence, word embeddings capture a lot more information about words. Secondly, the single-integer representation doesn't capture the relationships between different words.


--> Let's create word embeddings for the inputs first. To do so, we need to load the GloVe word vectors into memory. We will then create a dictionary where words are the keys and the corresponding vectors are values, as shown below:

In [27]:
from numpy import array
from numpy import asarray
from numpy import zeros

embeddings_dictionary = dict()

glove_file = open(r'C:\Users\Paaras Jamwal\Dropbox\My PC (PAARAS)\Desktop\Project\glove.6B.100d.txt', encoding="utf8")

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype='float32')
    embeddings_dictionary[word] = vector_dimensions
glove_file.close()

We will create a matrix where the row number will represent the integer value for the word and the columns will correspond to the dimensions of the word. This matrix will contain the word embeddings for the words in our input sentences.

In [28]:
num_words = min(MAX_NUM_WORDS, len(word2idx_inputs) + 1)
embedding_matrix = zeros((num_words, EMBEDDING_SIZE))
for word, index in word2idx_inputs.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

Let's first print the word embeddings for the word "drop" using the GloVe word embedding dictionary.

In [30]:
print(embeddings_dictionary["drop"])

[ 0.2934     0.49122    0.021428  -0.0032031 -0.30625   -0.70369
 -0.02503    0.20014   -0.11357    0.20024   -0.096917   0.24
 -0.43177    0.12703    0.50062    0.33529   -0.84563    0.23978
 -0.094769   0.20813    0.85062   -0.14943   -0.10232    0.74019
  0.11983   -0.11003   -0.33117   -0.31358   -0.12285   -0.76261
  0.29028   -0.0053144 -0.14366    0.14945   -0.25026    0.51785
  0.47238   -0.2494     0.069556   0.041659  -0.49759   -0.17819
 -0.16676   -0.062862   0.55204    0.026228  -0.029453  -0.35108
 -0.16275   -1.4509     0.52228    0.044831  -0.25485    0.95877
 -0.11863   -2.002     -0.29194   -0.22581    2.1544     0.17959
  0.052697  -0.022561  -0.47654    0.27268    0.15112    0.13092
  0.17475   -0.20071    0.12448   -0.05814   -0.12978    0.24495
 -0.11701    0.22073   -0.061256   0.015408  -0.33155   -0.020674
  0.13623    0.24073    0.63338   -0.11291   -0.75424   -0.27938
 -0.85368   -1.7428     0.2717    -0.23938    0.3652     0.1187
 -0.28925    0.41028   -0.70

In [31]:
print(embedding_matrix[340])

[ 0.29339999  0.49122     0.021428   -0.0032031  -0.30625001 -0.70368999
 -0.02503     0.20014    -0.11357     0.20024    -0.096917    0.23999999
 -0.43177     0.12703     0.50062001  0.33529001 -0.84562999  0.23977999
 -0.094769    0.20813     0.85061997 -0.14943001 -0.10232     0.74019003
  0.11983    -0.11003    -0.33116999 -0.31358001 -0.12285    -0.76261002
  0.29028001 -0.0053144  -0.14365999  0.14945    -0.25026     0.51784998
  0.47238001 -0.2494      0.069556    0.041659   -0.49759001 -0.17818999
 -0.16676    -0.062862    0.55203998  0.026228   -0.029453   -0.35108
 -0.16275001 -1.45089996  0.52227998  0.044831   -0.25485     0.95876998
 -0.11863    -2.00200009 -0.29194    -0.22581001  2.15440011  0.17959
  0.052697   -0.022561   -0.47654     0.27268001  0.15112001  0.13091999
  0.17475    -0.20071     0.12448    -0.05814    -0.12977999  0.24495
 -0.11701     0.22073001 -0.061256    0.015408   -0.33155    -0.020674
  0.13623001  0.24073     0.63338    -0.11291    -0.75423998 -

This word embedding matrix will be used to create the embedding layer for our LSTM model.

The following script creates the embedding layer for the input:

In [33]:
embedding_layer = Embedding(num_words, EMBEDDING_SIZE, weights=[embedding_matrix], input_length=max_input_len)

# Phase 2 
Creating the Model