<b><h1><center>Recurrent Neural Network (using Dataset = Europarl)</h1>
<b><h1><center>Machine Translation

This tutorial extends the idea of Natural Language Processing to Machine Learning of Human Languages. Two Recurrent Neural Networks will be combined together to serve the purpose.
The pre-requisites for this tutorial include basic know ledge of Linear Algebra, Machine Learning and Classification, Recurrent Neural Networks and Layers, Python programming language, Jupyter Notebook editor, TensorFlow and Keras.
The Neural Network in this tutorial consists of two parts;
-  The first part consistes of an encoder that converts the source text into a thought vector and gives a summary of the input as the output.
-  The second part is the decoder that takes the output of the encoder as its input and decodes the thought vector to the destination-text.
Neural Networks does not work with texts directly, therefore, the texts i the datasets are converted into integers also called tokens. The Neural Networks also does not work with integer numbers, therefore, the integer tokens are converted into vectors of floating point values using an Embedding Layer.
In other words, the entire dataset is converted into tokens using a tokenizer. These tokens are then converted into embedding vectors using an embedding layer. The embedding vectors are then fed into the Neural Network that contains three GRU Layers.
The last GRU Layer summarizes the text into a single thought vector which is then fed into initial state of the GRU Units in the decoder part.
The destinations texts are padded with special markers "ssss" and "eeee". These markers indicate the beginning and ending of each text sequence. 

<b><h2>Importing Libraries

Following are the libraries that will be used for the entire tutorial.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import math
import os

In [2]:
from tensorflow.python.keras.models import Model
from tensorflow.python.keras.layers import Input, Dense, GRU, Embedding
from tensorflow.python.keras.optimizers import RMSprop
from tensorflow.python.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

<b><h2>Importing Data

The dataset used in this tutorial is Europarl dataset which consists of sentence-pairs in most European languages.

In [3]:
import europarl

In this tutorial, the German-English dataset is used. The file europarl.py contains other language codes if another dataset of Euoparl is needed.

In [4]:
language_code='de'

If the data files are not already there, the below command will automatically download and extract them.

In [5]:
europarl.maybe_download_and_extract(language_code=language_code)

Data has apparently already been downloaded and unpacked.


The decoder needs to understand the start and end of each text sequence. Therefore, the begginings and endings of each text sequence is marked with words that are most likely not to occur in the dataset.

In [6]:
starting_marker = 'ssss'
ending_marker = ' eeee'

Load the datasets for source (German) and destination (English) languages in separate variables.

In [7]:
data_source = europarl.load_data(english=False,
                                 language_code=language_code)

data_destination = europarl.load_data(english=True,
                                      language_code=language_code,
                                      start=starting_marker,
                                      end=ending_marker)

The project uses more RAM with complete data. Therefore, in this tutorial, the data can be sliced according to the device's technicality. But if the device is capable of running full dataset, then there is no need to slice.

In [8]:
data_source = data_source[0:5000]
data_destination = data_destination[0:5000]

<b><h2>Tokenizer

Neural Networks does not work with text data. Therefore, the texts in the dataset will be converted into integers also called tokens. These tokens will then be converted into embedding vectors so that it can be fed to the the Neural Network. The Tokenizer can be instructed to use a particular number of frequently used or popular words. In this tutorial, the tokenizer is instructed to use 10,000 most popular words from the dataset.

In [9]:
num_words = 1000

Keras Tokenizer does not provide all the required functionalities. Therefore, the following functions are wrapped up in the Tokenizer Class.

In [10]:
class TokenizerWrap(Tokenizer):
    def __init__(self, texts, padding, reverse=False, num_words=None):
        
        Tokenizer.__init__(self, num_words=num_words)
        
        #Builiding Texts Vocabulary
        self.fit_on_texts(texts)
        
        #Converting all texts of different sequence lengths into integers (tokens)
        self.tokens = self.texts_to_sequences(texts)
        
        #Inverse Mapping to convert integers back to words
        self.index_to_words = dict(zip(self.word_index.values(), self.word_index.keys()))
        
        if reverse:
            #Reversing the token sequences
            self.tokens = [list(reversed(x)) for x in self.tokens]
            
            #Long sequences will be truncated from the beginning
            #Since the sequences are already reversed,
            #it will be correspond to the original text as it was truncated from the end
            truncating = 'pre'
            
        else:
            
            #Long sequences will be truncated from the end
            truncating = 'post'
        
        #Calculating the total number of tokens in each sequence because they have different lengths
        self.number_of_tokens = [len(x) for x in self.tokens]
        
        #The maximum number of tokens in each sequence that will be allowed
        #is set to the average plus two standard deviations that will cover around 95% of the dataset.
        self.maximum_sequence_length = int(np.mean(self.number_of_tokens) + 2 * np.std(self.number_of_tokens))
        
        #Tokens with equences longer than maximum sequence length allowed will be truncated
        #Tokens with equences shorter than maximum sequence length allowed will be padded
        self.tokens_padded = pad_sequences(self.tokens,
                                           maxlen=self.maximum_sequence_length,
                                           padding=padding,
                                           truncating=truncating)
    
    #Function to look for a single word from its token
    def token_to_word(self, token):
        word = " " if token == 0 else self.index_to_words[token]
        return word
    
    #Function to convert a list of tokens into a sequence (string) of words
    def tokens_to_string(self, tokens):
        
        #Creating list of individual words
        words = [self.index_to_words[token] for token in tokens if token !=0]
        
        #Concatenating the list of words
        text = " ".join(words)
        return text
    
    #Function to create a string of words into a list of tokens
    #Reversal and Padding is optional
    def text_to_tokens(self, text, reverse=False, padding=False):
        
        #Converting string of words into a list of tokens
        tokens = np.array(self.texts_to_sequences([text]))
        
        if reverse:
            #Reversing the sequence
            tokens = np.flip(tokens, axis=1)
            
            #Truncating the reversed sequence from the beginning
            truncating = 'pre'
        else:
            #Truncating the sequence from the end
            truncating = 'post'
            
        if padding:
            #Padding the sequence
            tokens = pad_sequences(tokens, maxlen=self.maximum_sequence_length, padding='pre', truncating=truncating)
        return tokens

Creating the Tokenizer for the source language.
The literature suggests that the performance is better with reversing the sequences for the source texts.
Because the sequences are reversed, the 'pre' padding option is selected.

In [11]:
%%time
tokenizer_source = TokenizerWrap(texts=data_source,
                              padding='pre',
                              reverse=True,
                              num_words=num_words)

Wall time: 1.44 s


Creating the Tokenizer for the destination language.
The sequences are not revesed for the destination language sequences and the padding option is 'post' that pads the sequences at the end to get the length of maximum sequence length allowed.

In [12]:
%%time
tokenizer_destination = TokenizerWrap(texts=data_destination,
                               padding='post',
                               reverse=False,
                               num_words=num_words)

Wall time: 771 ms


Converting the source and destination text sequences into tokens. The source and destination sets might have different maximum lengths of sequences because the number of words in one language may be different with the number of words in the other language for a same sentence.

In [13]:
tokens_source = tokenizer_source.tokens_padded
tokens_destination = tokenizer_destination.tokens_padded
print(tokens_source.shape)
print(tokens_destination.shape)

(5000, 40)
(5000, 48)


Token for the starting marker in the destination language.

In [14]:
starting_token = tokenizer_destination.word_index[starting_marker.strip()]
starting_token

244

Token for the ending marker in the destination language.

In [15]:
ending_token = tokenizer_destination.word_index[ending_marker.strip()]
ending_token

2

<b><h2>Training Data

The text sequences are converted into token sequences for both the source and the destination languages. Therefore, the data is now being prepared to be fed into the Neural Network.

The input to the Encoder is the padded and truncated token sequences produced by the Tokenizer for the source data.

In [16]:
encoder_input_data = tokens_source

The input to the Decoder is the padded and truncated token sequences produced by the Tokenizer for the destination data.
The input and output data for the decoder is identical, except shifted one time-step. This is because of the fact that the starting and ending markers were added into the destination data.

In [17]:
decoder_input_data = tokens_destination[:, :-1]
decoder_input_data.shape

(5000, 47)

In [18]:
decoder_output_data = tokens_destination[:, 1:]
decoder_output_data.shape

(5000, 47)

<b><h2>Creating the Neural Network

<b><h3>Creating the Encoder

First, Keras API is used to build an encoder that will map the token sequences from the source langugae into a thought vector. The objects of all the layers of the Neural Network are created at first and then they will be connected later. 

Batches of token sequences will be taken as input for the encoder. The None indicates that it can use arbitrary length of sequences.

In [19]:
encoder_input = Input(shape=(None, ), name='encoder_input')

Integer tokens are converted into embedding vector of length specified in the below command. The values of the embedding vectors will range between -1 and 1. 

In [20]:
embedding_size = 128

Now create the Embedding Layer.

In [21]:
encoder_embedding = Embedding(input_dim=num_words,
                              output_dim=embedding_size,
                              name='encoder_embedding')

The size of the internal state of the Gated Recurrent Units has to be specified as well. It will be the same for both Encoder and the Decoder part.

In [22]:
state_size = 512

For the encoder, three GRU Layers will be created to map the embedding vector sequences into a single thought vector which will summarize the input contents. The last GRU layer will only return a single thought vector which will be fed to the decoder later.

In [23]:
encoder_gru1 = GRU(state_size, name='encoder_gru1',
                   return_sequences=True)
encoder_gru2 = GRU(state_size, name='encoder_gru2',
                   return_sequences=True)
encoder_gru3 = GRU(state_size, name='encoder_gru3',
                   return_sequences=False)

Below is the helper function that will connect all the layers of the encoder.

In [24]:
def connect_encoder():
    # Starting with the input-layer
    layer = encoder_input
    
    # Connecting the embedding-layer
    layer = encoder_embedding(layer)

    # Connecting all the GRU-layers
    layer = encoder_gru1(layer)
    layer = encoder_gru2(layer)
    layer = encoder_gru3(layer)

    # Storing the output
    encoder_output = layer
    
    return encoder_output

The output of the last GRU layer is the thought vector needed to be fed into the decoder.

All the encoder layers are connected together using the helper function made for this purpose and the output is stored in a variable that will be fed to the decoder.

In [25]:
encoder_output = connect_encoder()

<b><h3>Creating the Decoder

The decoder will map the thought cevtor into a sequence of tokens.
The decoder needs two inputs:
-  The first input will be the thought vector produced by the encoder that will be used as the decoder's initial state.
-  The second input of the decoder will be the token sequences of the destination text with added starting and ending markers.
The None indicates that it can use arbitrary length of sequences.

In [26]:
decoder_initial_state = Input(shape=(state_size,),
                              name='decoder_initial_state')
decoder_input = Input(shape=(None, ), name='decoder_input')

Integer tokens are converted into embedding vector of the specified length. The values of the embedding vectors will range between -1 and 1. Now create the Embedding Layer.

In [27]:
decoder_embedding = Embedding(input_dim=num_words,
                              output_dim=embedding_size,
                              name='decoder_embedding')

For the decoder, three GRU Layers will be created to map the embedding vector sequences into a single thought vector which will summarize the input contents. All the GRU layers will return sequences which will be used to output a sequence of integer tokens that will be later translated into a text sequence.

In [28]:
decoder_gru1 = GRU(state_size, name='decoder_gru1',
                   return_sequences=True)
decoder_gru2 = GRU(state_size, name='decoder_gru2',
                   return_sequences=True)
decoder_gru3 = GRU(state_size, name='decoder_gru3',
                   return_sequences=True)

The output of the GRU Layers will be in the shape of [batch_size, sequence_length, state_size]. Each word will be encoded as a vector of length state_size. These vectors will be converted into the integer tokens which will then be used to be translated into texts.
For this purpose, a Fully-Connected Dense layer will be used where the avtivation function is set to linear.

In [29]:
decoder_dense = Dense(num_words,
                      activation='linear',
                      name='decoder_output')

Below is the helper function that will connect all the layers of the encoder.

In [30]:
def connect_decoder(initial_state):
    # Starting with the input-layer.
    layer = decoder_input

    # Connecting the embedding-layer.
    layer = decoder_embedding(layer)
    
    # Connecting all the GRU-layers.
    layer = decoder_gru1(layer, initial_state=initial_state)
    layer = decoder_gru2(layer, initial_state=initial_state)
    layer = decoder_gru3(layer, initial_state=initial_state)

    # Connecting dense layer that converts to one-hot encoded arrays.
    decoder_output = decoder_dense(layer)
    
    return decoder_output

<b><h3>Connecting and Creating the Models

Encoder and Decoder will be connected in different ways for different uses.
First, the Decoder will be directly connected to the Encoder to create a one complete model that will be used for training. The Decoder's initial state will be set to the output of the Encoder.

In [31]:
decoder_output = connect_decoder(initial_state=encoder_output)

model_train = Model(inputs=[encoder_input, decoder_input],
                    outputs=[decoder_output])

Second, a model for the encoder alone will be created that will summarize the input contents by mapping the token sequences into a thought vector.

In [32]:
model_encoder = Model(inputs=[encoder_input],
                      outputs=[encoder_output])

Third, a model will be created for the decoder alone which will input the initial state of the decoder's GRU units.

In [33]:
decoder_output = connect_decoder(initial_state=decoder_initial_state)

model_decoder = Model(inputs=[decoder_input, decoder_initial_state],
                      outputs=[decoder_output])

All the models will use the same weights and variables for the encoder and decoder. Only the way they are connected is different. The complete model will be used for training the Neural Network after which the Encoder and Decoder models will be run separately with the trained weights.

<b><h3>Loss Function

To train the model and produce the desired result, a loss function like cross-entropy is used. The dataset used in this tutorial contains texts that are converted into integer tokens. Therefore, sparse cross-entropy loss-function, which does the conversion internally from integers to one-hot encoded arrays will be used.
The sparse cross-entropy is used with softmax function ti imporove numeriacal stability.
The loss function is calculated across the entire batch and entire sequences and then its average is taken to get a single scalor value.

The decoder outputs a 3-rank tensor with shape [batch_size, sequence_length, num_words]which contains batches of sequences of one-hot encoded arrays. It will be compared to a 2-rank tensor with shape [batch_size, sequence_length] containing sequences of integer-tokens.
This comparison is done with a sparse-cross-entropy function directly from TensorFlow

In [34]:
def sparse_cross_entropy(y_true, y_pred):
    
    #y_true is the 2-rank tensor
    #y_pred is the 3-rank tensor

    # Calculating the loss that outputs a 2-rank tensor of shape [batch_size, sequence_length]
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true,
                                                          logits=y_pred)
    #Calculating the average
    loss_mean = tf.reduce_mean(loss)

    return loss_mean

<b><h3>Compiling the Training Model

The optimizer used in this tutorial is RMSprop.

In [35]:
optimizer = RMSprop(lr=1e-3)

Create a placeholder variable for the decoder's output. The shape is set to (None, None) which means the batch can have an arbitrary number of sequences, which can have an arbitrary number of integer-tokens.

In [36]:
decoder_target = tf.placeholder(dtype='int32', shape=(None, None))

Compiling the Model.

In [37]:
model_train.compile(optimizer=optimizer,
                    loss=sparse_cross_entropy,
                    target_tensors=[decoder_target])

<b><h3>Callback Functions

During training, the checkboards are saved and the progress is logged into the TensorFlow board. Therefore, appropriate callbacks for Keras are created.

Callbacks to write checkpoints during the training.

In [38]:
path_checkpoint = '21_checkpoint.keras'
callback_checkpoint = ModelCheckpoint(filepath=path_checkpoint,
                                      monitor='val_loss',
                                      verbose=1,
                                      save_weights_only=True,
                                      save_best_only=True)

When performance starts to worsen on the validation set, the optimization is stopped by this callback.

In [39]:
callback_early_stopping = EarlyStopping(monitor='val_loss',
                                        patience=3, verbose=1)

This is the callback for writing the TensorBoard log during training.

In [40]:
callback_tensorboard = TensorBoard(log_dir='./21_logs/',
                                   histogram_freq=0,
                                   write_graph=False)

Storing all callbacks in one variable.

In [41]:
callbacks = [callback_early_stopping,
             callback_checkpoint,
             callback_tensorboard]

<b><h3>Load Checkpoint

The last saved checkpoint can be reloaded so that the model does not need to be trained every time it is used.

In [42]:
try:
    model_train.load_weights(path_checkpoint)
except Exception as error:
    print("Error while Checkpoint Loading.")
    print(error)

Error while Checkpoint Loading.
`load_weights` requires h5py.


<b><h3>Train the Model

The data is wrapped in named dicts so that it is correctly passed into the inputs and outputs of the model.

In [43]:
input_data = \
{
    'encoder_input': encoder_input_data,
    'decoder_input': decoder_input_data
}

In [44]:
output_data = \
{
    'decoder_output': decoder_output_data
}

Preparing the validation set.

In [45]:
validation_split = 10000 / len(encoder_input_data)
validation_split

2.0

Training the Model.

In [46]:
model_train.fit(x=input_data,
                y=output_data,
                batch_size=64,
                epochs=10,
                validation_split=validation_split,
                callbacks=callbacks)

Epoch 1/10

--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 986, in emit
    msg = self.format(record)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 836, in format
    return fmt.format(record)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 573, in format
    record.message = record.getMessage()
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 336, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
  


Epoch 2/10


--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 986, in emit
    msg = self.format(record)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 836, in format
    return fmt.format(record)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 573, in format
    record.message = record.getMessage()
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 336, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
  


Epoch 3/10

--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 986, in emit
    msg = self.format(record)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 836, in format
    return fmt.format(record)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 573, in format
    record.message = record.getMessage()
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 336, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
  


Epoch 4/10

--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 986, in emit
    msg = self.format(record)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 836, in format
    return fmt.format(record)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 573, in format
    record.message = record.getMessage()
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 336, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
  


Epoch 5/10


--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 986, in emit
    msg = self.format(record)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 836, in format
    return fmt.format(record)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 573, in format
    record.message = record.getMessage()
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 336, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
  


Epoch 6/10

--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 986, in emit
    msg = self.format(record)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 836, in format
    return fmt.format(record)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 573, in format
    record.message = record.getMessage()
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 336, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
  


Epoch 7/10


--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 986, in emit
    msg = self.format(record)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 836, in format
    return fmt.format(record)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 573, in format
    record.message = record.getMessage()
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 336, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
  


Epoch 8/10

--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 986, in emit
    msg = self.format(record)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 836, in format
    return fmt.format(record)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 573, in format
    record.message = record.getMessage()
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 336, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
  



  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\site-packages\IPython\core\interactiveshell.py", line 2856, in run_ast_nodes
    if self.run_code(code, result):
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\site-packages\IPython\core\interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-46-27b86ea60bba>", line 6, in <module>
    callbacks=callbacks)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\keras\_impl\keras\engine\training.py", line 1793, in fit
    validation_steps=validation_steps)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\keras\_impl\keras\engine\training.py", line 1302, in _fit_loop
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\keras\_impl\keras\callbacks.py", line 94, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File


Epoch 9/10

--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 986, in emit
    msg = self.format(record)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 836, in format
    return fmt.format(record)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 573, in format
    record.message = record.getMessage()
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 336, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
  


Epoch 10/10


--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 986, in emit
    msg = self.format(record)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 836, in format
    return fmt.format(record)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 573, in format
    record.message = record.getMessage()
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\logging\__init__.py", line 336, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Khizer\Anaconda3\envs\tensorflow\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
  




<tensorflow.python.keras._impl.keras.callbacks.History at 0x1a313e37e80>

<b><h2>Texts Translation

Below is the helper function that translates the text from source language into the destination language and prints it optionally.

In [47]:
#Function to translate a text string.
def translate(input_text, true_output_text=None):

    # Convert the input-text to a reversed sequence of integer-tokens.
    input_tokens = tokenizer_source.text_to_tokens(text=input_text,
                                                reverse=True,
                                                padding=True)
    
    # Geting the output of the encoder's GRU that is used as the initial state in the decoder's GRU.
    initial_state = model_encoder.predict(input_tokens)

    # Maximum number of tokens / words in the output sequence.
    maximum_sequence_length = tokenizer_destination.maximum_sequence_length

    # Pre-allocating the 2-dim array that is used as decoder's input holding a single sequence of integer-tokens.
    shape = (1, maximum_sequence_length)
    decoder_input_data = np.zeros(shape=shape, dtype=np.int)

    # The first input-token is the start-token for the starting marker 'ssss '.
    token_int = starting_token

    # Initialize an empty output-text.
    output_text = ''

    # Initializing the number of tokens processed.
    count_tokens = 0

    # Run while loop until the end-token for ' eeee' is not sampled
    # and the maximum number of tokens are not processed
    while token_int != ending_token and count_tokens < maximum_sequence_length:
        # Update the input-sequence to the decoder
        # with the last token that was sampled.
        # In the first iteration this will set the
        # first element to the start-token.
        decoder_input_data[0, count_tokens] = token_int

        # Wrap the input-data in a dict for clarity and safety,
        # to assure the data is input in correct order.
        input_data = \
        {
            'decoder_initial_state': initial_state,
            'decoder_input': decoder_input_data
        }

        # Input data to decoder and get predicted output.
        decoder_output = model_decoder.predict(input_data)

        # Get the last predicted token as a one-hot encoded array.
        token_onehot = decoder_output[0, count_tokens, :]
        
        # Convert to an integer-token.
        token_int = np.argmax(token_onehot)

        # Lookup the word corresponding to this integer-token.
        sampled_word = tokenizer_destination.token_to_word(token_int)

        # Append the word to the output-text.
        output_text += " " + sampled_word

        # Increment the token-counter.
        count_tokens += 1

    # Sequence of tokens output by the decoder.
    output_tokens = decoder_input_data[0]
    
    # Print the input-text.
    print("Input text:")
    print(input_text)
    print()

    # Print the translated output-text.
    print("Translated text:")
    print(output_text)
    print()

    # Optionally print the true translated text.
    if true_output_text is not None:
        print("True output text:")
        print(true_output_text)
        print()

<b><h2>Examples

Running a few translation examples from the dataset.

In [48]:
index = 3
translate(input_text=data_source[index],
          true_output_text=data_destination[index])

Input text:
Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode in den nächsten Tagen.

Translated text:
 the president the of the european union is to be a of fundamental rights eeee

True output text:
ssssYou have requested a debate on this subject in the course of the next few days, during this part-session. eeee



In [49]:
index = 4
translate(input_text=data_source[index],
          true_output_text=data_destination[index])

Input text:
Heute möchte ich Sie bitten - das ist auch der Wunsch einiger Kolleginnen und Kollegen -, allen Opfern der Stürme, insbesondere in den verschiedenen Ländern der Europäischen Union, in einer Schweigeminute zu gedenken.

Translated text:
 parliament adopted the resolution to the commission in the of the european union eeee

True output text:
ssssIn the meantime, I should like to observe a minute' s silence, as a number of Members have requested, on behalf of all the victims concerned, particularly those of the terrible storms, in the various countries of the European Union. eeee



Running a few self made examples.

In [50]:
translate(input_text="Was kostet das?",
          true_output_text='How much is this?')

Input text:
Was kostet das?

Translated text:
 eeee

True output text:
How much is this?



In [51]:
translate(input_text="Ich will es nicht",
          true_output_text='I do not want it')

Input text:
Ich will es nicht

Translated text:
 eeee

True output text:
I do not want it



<b><h1><center>