In [2]:
import os, sys

from keras.models import Model
from keras.layers import Input, LSTM, GRU, Dense, Embedding
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
import numpy as np
import matplotlib.pyplot as plt

The many-to-many sequence modelling technique known as seq2seq architecture is frequently used for a number of applications including text summarization, chatbot generation, conversational modelling, and neural machine translation, among others.

We'll look at how to build a language translation model, which is another well-known use for neural machine translation. Using Python's Keras library, we will construct our language translation model using the seq2seq architecture.

In [3]:
#set values for different parameters
BATCH_SIZE = 64
EPOCHS = 20
LSTM_NODES =256
NUM_SENTENCES = 20000
MAX_SENTENCE_LENGTH = 50
MAX_NUM_WORDS = 20000
EMBEDDING_SIZE = 100

The model of language translation we'll create in this article will convert English sentences into their French equivalents. We require a dataset with English sentences and their French translations in order to create such a model. 

This file is "fra.txt". On each line, the text file contains an English sentence and its French translation, separated by a tab.

**Data Preprocessing**

The seq2seq architecture is frequently used as the basis for neural machine translation models. The encoder LSTM and the decoder LSTM networks make up the encoder-decoder architecture known as the seq2seq. The sentence in the original language serves as the input for the encoder LSTM, and the sentence in the translated language along with a start-of-sentence token serves as the input for the decoder LSTM. With a token at the end of the sentence, the output is the actual target sentence.

In our dataset, we do not need to process the input, however, we need to generate two copies of the translated sentence: one with the start-of-sentence token and the other with the end-of-sentence token.

In [4]:
input_sentences = []
output_sentences = []
output_sentences_inputs = []
# there are three lists input_sentences[], output_sentences[], and output_sentences_inputs[]. 

count = 0
# in the for loop the fra.txt file is read line by line. 
for line in open('fra.txt', encoding="utf-8"):
#     Each line is split into two substrings at the position where the tab occurs. 
# The left substring (the English sentence) is inserted into the input_sentences[] list. 
# The substring to the right of the tab is the corresponding translated French sentence.
    count += 1

    if count > NUM_SENTENCES:
        break

    if '\t' not in line:
        continue

    input_sentence, output = line.rstrip().split('\t')
#     The <eos> token, which marks the end-of-sentence is prefixed to the translated sentence, 
# and the resultant sentence is appended to the output_sentences[] list. 
# Similarly, the <sos> token, which stands for "start of sentence", 
# is concatenated at the start of the translated sentence and the result is added to the output_sentences_inputs[] list.
    output_sentence = output + ' <eos>'
    output_sentence_input = '<sos> ' + output

    input_sentences.append(input_sentence)
    output_sentences.append(output_sentence)
    output_sentences_inputs.append(output_sentence_input)
# The loop terminates if the number of sentences added to the lists is greater than the NUM_SENTENCES variable, i.e. 20,000.

print("num samples input:", len(input_sentences))
print("num samples output:", len(output_sentences))
print("num samples output input:", len(output_sentences_inputs))

num samples input: 19974
num samples output: 19974
num samples output input: 19974


we will only use the first 20,000 records to train our model. 

In [5]:
# randomly print a sentence from the input_sentences[], output_sentences[], and output_sentences_inputs[] lists:
print(input_sentences[172])
print(output_sentences[172])
print(output_sentences_inputs[172])

Be nice.
Sois gentille ! <eos>
<sos> Sois gentille !


You can see the original sentence, i.e. Be nice.; its corresponding translation in the output, i.e Sois gentille ! <eos>. <eos>. Notice, here we have <eos> token at the end of the sentence. Similarly, for the input to the decoder, we have <sos> Sois gentille ! <eos>

**Tokenization and Padding**

After tokenizing the original and translated sentences, padding is applied to any sentences that are either too long or too short. In the case of inputs, this padding will be equal to the length of the longest input sentence. Additionally, the longest sentence in the output will be this length.

The Tokenizer class from the keras.preprocessing.text package can be used for tokenization. The tokenizer class carries out two functions: It breaks a sentence up into its component words, then turns those words into integers.

In [6]:
# to tokenize the input sentences

input_tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
input_tokenizer.fit_on_texts(input_sentences)
input_integer_seq = input_tokenizer.texts_to_sequences(input_sentences)

word2idx_inputs = input_tokenizer.word_index
print('Total unique words in the input: %s' % len(word2idx_inputs))

max_input_len = max(len(sen) for sen in input_integer_seq)
print("Length of longest sentence in input: %g" % max_input_len)

Total unique words in the input: 3426
Length of longest sentence in input: 5


In addition to tokenization and integer conversion, the word_index attribute of the Tokenizer class returns a word-to-index dictionary where words are the keys and the corresponding integers are the values. The script above also prints the number of unique words in the dictionary and the length of the longest sentence in the input:

Similarly, the output sentences can also be tokenized in the same way 

In [7]:
output_tokenizer = Tokenizer(num_words=MAX_NUM_WORDS, filters='')
output_tokenizer.fit_on_texts(output_sentences + output_sentences_inputs)
output_integer_seq = output_tokenizer.texts_to_sequences(output_sentences)
output_input_integer_seq = output_tokenizer.texts_to_sequences(output_sentences_inputs)

word2idx_outputs = output_tokenizer.word_index
print('Total unique words in the output: %s' % len(word2idx_outputs))

num_words_output = len(word2idx_outputs) + 1
max_out_len = max(len(sen) for sen in output_integer_seq)
print("Length of longest sentence in the output: %g" % max_out_len)

Total unique words in the output: 9504
Length of longest sentence in the output: 12


English sentences are typically shorter and include fewer words on average than the translated French sentences, according to a comparison of the number of unique words in the input and the output.

The input has to be padded next. Because text sentences can be of different lengths, but LSTM (the algorithm we will use to train our model) expects input instances with the same length, padding is used for both the input and the output. Because of this, we must turn our sentences into fixed-length vectors. Padding is one method for achieving this.

A specific sentence length is established in padding. In our example, the input and output sentences will be padded by the length of the longest sentence from the inputs and outputs, respectively. The input's longest sentence is six words long. Zeros will be added to the empty indexes for sentences with fewer than six words.

In [8]:
# to apply padding to the input sentences
encoder_input_sequences = pad_sequences(input_integer_seq, maxlen=max_input_len)
print("encoder_input_sequences.shape:", encoder_input_sequences.shape)
print("encoder_input_sequences[172]:", encoder_input_sequences[172])

encoder_input_sequences.shape: (19974, 5)
encoder_input_sequences[172]: [  0   0   0  22 114]


The script above prints the shape of the padded input sentences. The padded integer sequence for the sentence at index 172 is also printed. 

In [42]:
# In the same way, the decoder outputs and the decoder inputs are padded
decoder_input_sequences = pad_sequences(output_input_integer_seq, maxlen=max_out_len, padding='post')
print("decoder_input_sequences.shape:", decoder_input_sequences.shape)
print("decoder_input_sequences[172]:", decoder_input_sequences[172])

decoder_input_sequences.shape: (19974, 12)
decoder_input_sequences[172]: [  2  62 783   4   0   0   0   0   0   0   0   0]


It is also crucial to note that the decoder applies post-padding, which results in the addition of zeros at the end of the phrase. Zeros were paddinged at the start of the encoder. This method was chosen because encoder output is based on words that appear at the end of sentences, so the original words were left in place there and zeros were added to the beginning. The decoder, on the other hand, begins processing at the beginning of a sentence, so post-padding is applied to the decoder inputs and outputs.

**Word Embeddings**

We must transform our words into their corresponding numeric vector representations because we are using deep learning models, and deep learning models only work with numbers. However, we have already transformed our words into integers. What distinguishes word embeddings from integer representation, then?

Word embeddings and single integer representations differ primarily in two ways. A word is only represented by a single integer in integer representation. A word is represented as a vector in vector representation, which can have any number of dimensions—50, 100, 200, etc. Word embeddings therefore record a great deal more information about words. Second, the links between various words are not represented by the single-integer representation. Word embeddings, on the other hand, preserve the connections between the words. You have two options: pretrained word embeddings or custom word embeddings.

Let's create word embeddings for the inputs first. To do so, we need to load the GloVe word vectors into memory. We will then create a dictionary where words are the keys and the corresponding vectors are values

In [12]:
from numpy import array
from numpy import asarray
from numpy import zeros

embeddings_dictionary = dict()

glove_file = open('glove.6B.100d.txt', encoding="utf8")

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype='float32')
    embeddings_dictionary[word] = vector_dimensions
glove_file.close()

The integer value of the word will be represented by the row number in the matrix, and the word's dimensions will be represented by the columns. The word embeddings for the words in our input sentences are contained in this matrix.

In [13]:
num_words = min(MAX_NUM_WORDS, len(word2idx_inputs) + 1)
embedding_matrix = zeros((num_words, EMBEDDING_SIZE))
for word, index in word2idx_inputs.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

This word embedding matrix will be used to create the embedding layer for our LSTM model.
The following script creates the embedding layer for the input

In [16]:
embedding_layer = Embedding(num_words, EMBEDDING_SIZE, weights=[embedding_matrix], input_length=max_input_len)

**Creating the Model**

 The first thing we need to do is to define our outputs, as we know that the output will be a sequence of words. 
 for each input sentence, we need a corresponding output sentence. 

In [17]:
# creates the empty output array
decoder_targets_one_hot = np.zeros((
        len(input_sentences),
        max_out_len,
        num_words_output
    ),
    dtype='float32'
)

In [18]:
# to print the shape of decoder
decoder_targets_one_hot.shape

(19974, 12, 9505)

The final layer of the model, which will be a dense layer for making predictions, requires outputs in the form of one-hot encoded vectors because the dense layer will use the softmax activation function. The next step is to assign 1 to the column number that corresponds to the word's integer representation in order to produce such one-hot encoded output.

In [19]:
decoder_output_sequences = pad_sequences(output_integer_seq, maxlen=max_out_len, padding='post')

In [20]:
for i, d in enumerate(decoder_output_sequences):
    for t, word in enumerate(d):
        decoder_targets_one_hot[i, t, word] = 1

Next, we need to create the encoder and decoders. The input to the encoder will be the sentence in English and the output will be the hidden state and cell state of the LSTM.

In [21]:
# The following script defines the encoder
encoder_inputs_placeholder = Input(shape=(max_input_len,))
x = embedding_layer(encoder_inputs_placeholder)
encoder = LSTM(LSTM_NODES, return_state=True)

encoder_outputs, h, c = encoder(x)
encoder_states = [h, c]

The next step is to define the decoder. The decoder will have two inputs: the hidden state and cell state from the encoder and the input sentence, which actually will be the output sentence with an <sos> token appended at the beginning.

In [22]:
decoder_inputs_placeholder = Input(shape=(max_out_len,))

decoder_embedding = Embedding(num_words_output, LSTM_NODES)
decoder_inputs_x = decoder_embedding(decoder_inputs_placeholder)

decoder_lstm = LSTM(LSTM_NODES, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs_x, initial_state=encoder_states)

In [23]:
# the output from the decoder LSTM is passed through a dense layer to predict decoder outputs
decoder_dense = Dense(num_words_output, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

compile the model

In [24]:
model = Model([encoder_inputs_placeholder,
  decoder_inputs_placeholder], decoder_outputs)
model.compile(
    optimizer='rmsprop',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

In [25]:
pip install pydot

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'C:\Users\AMAR\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.


In [26]:
pip install graphviz

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'C:\Users\AMAR\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.


we have two types of input. input_1 is the input placeholder for the encoder, which is embedded and passed through lstm_1 layer, which basically is the encoder LSTM. There are three outputs from the lstm_1 layer: the output, the hidden layer and the cell state. However, only the cell state and the hidden state are passed to the decoder.

Here the lstm_2 layer is the decoder LSTM. The input_2 contains the output sentences with <sos> token appended at the start. The input_2 is also passed through an embedding layer and is used as input to the decoder LSTM, lstm_2. Finally, the output from the decoder LSTM is passed through the dense layer to make predictions.

In [28]:
#  train the model using the fit() method
r = model.fit(
    [encoder_input_sequences, decoder_input_sequences],
    decoder_targets_one_hot,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_split=0.1,
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


The model is trained on 18,000 records and tested on the remaining 2,000 records. The model is trained for 20 epochs, you can modify the number of epochs to see if you can get better results. After 20 epochs, I got training accuracy of 89.83% and the validation accuracy of 78.12% 

**Modifying the Model for Predictions**

The encoder model remains the same

In [29]:
encoder_model = Model(encoder_inputs_placeholder, encoder_states)

Since now at each step we need the decoder hidden and cell states, we will modify our model to accept the hidden and cell states as shown below:

In [30]:
decoder_state_input_h = Input(shape=(LSTM_NODES,))
decoder_state_input_c = Input(shape=(LSTM_NODES,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

Now at each time step, there will be only single word in the decoder input, we need to modify the decoder embedding layer as follows:

In [31]:
decoder_inputs_single = Input(shape=(1,))
decoder_inputs_single_x = decoder_embedding(decoder_inputs_single)

Next, we need to create the placeholder for decoder outputs:

In [32]:
decoder_outputs, h, c = decoder_lstm(decoder_inputs_single_x, initial_state=decoder_states_inputs)


To make predictions, the decoder output is passed through the dense layer:

In [33]:
decoder_states = [h, c]
decoder_outputs = decoder_dense(decoder_outputs)

The final step is to define the updated decoder model, as shown here:

In [34]:
decoder_model = Model(
    [decoder_inputs_single] + decoder_states_inputs,
    [decoder_outputs] + decoder_states
)

**Making Predictions**

Words were transformed into integers throughout the tokenization processes. The decoder will also produce integer outputs. On the other hand, we need a string of French words as our output. We must do this by changing the integers back to words. For both inputs and outputs, we will create new dictionaries with words as the corresponding values and integers as the keys.

In [37]:
idx2word_input = {v:k for k, v in word2idx_inputs.items()}
idx2word_target = {v:k for k, v in word2idx_outputs.items()}

Next we will create a method, i.e. translate_sentence(). The method will accept an input-padded sequence English sentence (in the integer form) and will return the translated French sentence. Look at the translate_sentence() method:

In [38]:
def translate_sentence(input_seq):
#     we pass the input sequence to the encoder_model, 
# which predicts the hidden state and the cell state, which are stored in the states_value variable.
    states_value = encoder_model.predict(input_seq)
#     we define a variable target_seq, which is a 1 x 1 matrix of all zeros. 
# The target_seq variable contains the first word to the decoder model, which is <sos>.
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = word2idx_outputs['<sos>']
#     After that, the eos variable is initialized, which stores the integer value for the <eos> token. 
# In the next line, the output_sentence list is defined, which will contain the predicted translation.
    eos = word2idx_outputs['<eos>']
    output_sentence = []

    '''
    Next, we execute a for loop. 
    The number of execution cycles for the for loop is equal to the length of the longest sentence in the output. 
    Inside the loop, in the first iteration, the decoder_model predicts the output and the hidden and cell states, 
    using the hidden and cell state of the encoder, and the input token, i.e. <sos>. The index of the predicted word is 
    stored in the idx variable. If the value of the predicted index is equal to the <eos> token, the loop terminates. 
    Else if the predicted index is greater than zero, the corresponding word is retrieved from the idx2word dictionary and 
    is stored in the word variable, which is then appended to the output_sentence list. The states_value variable is updated
    with the new hidden and cell state of the decoder and the index of the predicted word is stored in the target_seq variable. 
    In the next loop cycle, the updated hidden and cell states, along with the index of the previously predicted word, 
    are used to make new predictions. 
    The loop continues until the maximum output sequence length is achieved or the <eos> token is encountered.
    '''
    
    for _ in range(max_out_len):
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
        idx = np.argmax(output_tokens[0, 0, :])

        if eos == idx:
            break

        word = ''

        if idx > 0:
            word = idx2word_target[idx]
            output_sentence.append(word)

        target_seq[0, 0] = idx
        states_value = [h, c]
# Finally, the words in the output_sentence 
# list are concatenated using a space and the resulting string is returned to the calling function.

    return ' '.join(output_sentence)

**Testing the Model**

To test the code, we will randomly choose a sentence from the input_sentences list, retrieve the corresponding padded sequence for the sentence, and will pass it to the translate_sentence() method. The method will return the translated sentence as shown below.

In [39]:
# test the functionality of the model
i = np.random.choice(len(input_sentences))
input_seq = encoder_input_sequences[i:i+1]
translation = translate_sentence(input_seq)
print('-')
print('Input:', input_sentences[i])
print('Response:', translation)

-
Input: They just left.
Response: ils sont partir.


The model has successfully translated another English sentence into French.