In [None]:
"""
The MIT License (MIT)
Copyright (c) 2021 NVIDIA
Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
"""


This code example demonstrates how to build a neural machine translation network. It is a sequence-to-sequence network based on a Transformer encoder-decoder architecture. More context for this code example can be found in video 7.6 "Programming Example: Machine Translation Using Transformer with TensorFlow" in the video series "Learning Deep Learning: From Perceptron to Large Language Models" by Magnus Ekman (Video ISBN-13: 9780138177614).

The data used to train the model is expected to be in the file ../data/fra.txt.
We begin by importing modules that we need for the program.

In [None]:
import keras_nlp
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.text \
    import text_to_word_sequence
from tensorflow.keras.preprocessing.sequence \
    import pad_sequences
from keras_nlp.layers import TransformerEncoder
from keras_nlp.layers import TransformerDecoder
import tensorflow as tf
import logging
import math
import numpy as np
import random
tf.get_logger().setLevel(logging.ERROR)


Next, we define some constants. We specify a vocabulary size of 10,000 symbols, out of which four indices are reserved for padding, out-of-vocabulary words (denoted as UNK), START tokens, and STOP tokens. Our training corpus is large, so we set the parameter READ_LINES to the number of lines in the input file we want to use in our example (60,000). The parameter LAYER_SIZE defines the width of the intermediate fully-connected layer in the Transformer, and the embedding layers output 128 dimensions (EMBEDDING_WIDTH). We use 20% (TEST_PERCENT) of the dataset as test set and further select 20 sentences (SAMPLE_SIZE) to inspect in detail during training. We limit the length of the source and destination sentences to, at most, 60 words (MAX_LENGTH). Finally, we provide the path to the data file, where each line is expected to contain two versions of the same sentence (one in each language) separated by a tab character.

In [None]:
# Constants
EPOCHS = 20
BATCH_SIZE = 128
MAX_WORDS = 10000
READ_LINES = 60000
NUM_HEADS = 8
LAYER_SIZE = 256
EMBEDDING_WIDTH = 128
TEST_PERCENT = 0.2
SAMPLE_SIZE = 20
OOV_WORD = 'UNK'
PAD_INDEX = 0
OOV_INDEX = 1
START_INDEX = MAX_WORDS - 2
STOP_INDEX = MAX_WORDS - 1
MAX_LENGTH = 60
SRC_DEST_FILE_NAME = '../data/fra.txt'


The next code snippet shows the function used to read the input data file and do some initial processing. Each line is split into two strings, where the first contains the sentence in the destination language and the second contains the sentence in the source language. We use the function text_to_word_sequence() to clean the data somewhat (make everything lowercase and remove punctuation) and split each sentence into a list of individual words. If the list (sentence) is longer than the maximum allowed length, then it is truncated.

In [None]:
# Function to read file.
def read_file_combined(file_name, max_len):
    file = open(file_name, 'r', encoding='utf-8')
    src_word_sequences = []
    dest_word_sequences = []
    for i, line in enumerate(file):
        if i == READ_LINES:
            break
        pair = line.split('\t')
        word_sequence = text_to_word_sequence(pair[1])
        src_word_sequence = word_sequence[0:max_len]
        src_word_sequences.append(src_word_sequence)
        word_sequence = text_to_word_sequence(pair[0])
        dest_word_sequence = word_sequence[0:max_len]
        dest_word_sequences.append(dest_word_sequence)
    file.close()
    return src_word_sequences, dest_word_sequences


The next code snippet shows functions used to turn sequences of words into
sequences of tokens, and vice versa. We call tokenize() a single time for each
language, so the argument sequences is a list of lists where each of the inner
lists represents a sentence. The Tokenizer class assigns indices to the most
common words and returns either these indices or the reserved OOV_INDEX
for less common words that did not make it into the vocabulary. We tell the
Tokenizer to use a vocabulary of 9998 (MAX_WORDS-2)—that is, use only
indices 0 to 9997, so that we can use indices 9998 and 9999 as our START and
STOP tokens (the Tokenizer does not support the notion of START and STOP
tokens but does reserve index 0 to use as a padding token and index 1 for outof-
vocabulary words). Our tokenize() function returns both the tokenized
sequence and the Tokenizer object itself. This object will be needed anytime we
want to convert tokens back into words.

The function tokens_to_words() requires a Tokenizer and a list of indices. We simply check for the reserved indices: If we find a match, we replace them with hardcoded strings, and if we find no match, we let the Tokenizer convert the index to the corresponding word string. The Tokenizer expects a list of lists of indices and returns a list of strings, which is why we need to call it with [[index]] and then select the 0th element to arrive at a string.


In [None]:
# Functions to tokenize and un-tokenize sequences.
def tokenize(sequences):
    # "MAX_WORDS-2" used to reserve two indices
    # for START and STOP.
    tokenizer = Tokenizer(num_words=MAX_WORDS-2,
                          oov_token=OOV_WORD)
    tokenizer.fit_on_texts(sequences)
    token_sequences = tokenizer.texts_to_sequences(sequences)
    return tokenizer, token_sequences

def tokens_to_words(tokenizer, seq):
    word_seq = []
    for index in seq:
        if index == PAD_INDEX:
            word_seq.append('PAD')
        elif index == OOV_INDEX:
            word_seq.append(OOV_WORD)
        elif index == START_INDEX:
            word_seq.append('START')
        elif index == STOP_INDEX:
            word_seq.append('STOP')
        else:
            word_seq.append(tokenizer.sequences_to_texts(
                [[index]])[0])
    print(word_seq)


Given these helper functions, it is trivial to read the input data
file and convert into tokenized sequences.

In [None]:
# Read file and tokenize.
src_seq, dest_seq = read_file_combined(SRC_DEST_FILE_NAME,
                                       MAX_LENGTH)
src_tokenizer, src_token_seq = tokenize(src_seq)
dest_tokenizer, dest_token_seq = tokenize(dest_seq)


It is now time to arrange the data into tensors that can be used for training and testing. The following example provides some insight into what we need as input and output for a single training example, where src_input is the input to the encoder network, dest_input is the input to the decoder network, and dest_target is the desired output from the decoder network:

src_input = [PAD, PAD, PAD, id("je"), id("suis"), id("étudiant")]
dest_input = [START, id("i"), id("am"), id("a"), id("student"), STOP, PAD, PAD]
dest_target = [one_hot_id("i"), one_hot_id("am"), one_hot_id("a"), one_hot_id("student"), one_hot_id(STOP), one_hot_id(PAD), one_hot_id(PAD), one_hot_id(PAD)]

In the example, id(string) refers to the tokenized index of the string, and one_hot_id is the one-hot encoded version of the index. We have assumed that the longest source sentence is six words, so we padded src_input to be of that length. Similarly, we have assumed that the longest destination sentence is eight words including START and STOP tokens, so we padded both dest_input and dest_target to be of that length. Note how the symbols in dest_input are offset by one location compared to the symbols in dest_target because when we later do inference, the inputs into the decoder network will be coming from the output of the network for the previous timestep. Although this example has shown the training example as being lists, in reality, they will be rows in NumPy arrays, where each array contains multiple training examples.

The padding is done to ensure that we can use mini-batches for training. That is, all source sentences need to be the same length, and all destination sentences need to be the same length. We pad the source input at the beginning (known as prepadding) and the destination at the end (known as postpadding).

The code snippet below shows a compact way of creating the three arrays that we need. The first two lines create two new lists, each containing the destination sequences but the first (dest_target_token_seq) also augmented with STOP_INDEX after each sequence and the second (dest_input_token_seq) augmented with both START_INDEX and STOP_INDEX. It is easy to miss that dest_input_token_seq has a STOP_INDEX, but that falls out naturally because it is created from the dest_target_token_seq for which a STOP_INDEX was just added to each sentence.

Next, we call pad_sequences() on both the original src_input_data list (of lists) and on these two new destination lists. The pad_sequences() function pads the sequences with the PAD value and then returns a NumPy array. The default behavior of pad_sequences is to do prepadding, and we do that for the source sequence but explicitly ask for postpadding for the destination sequences.

You might wonder why there is no call to to_categorical() in the statement that creates the target (output) data. We are used to wanting to have the ground truth one-hot encoded for textual data. Not doing so is an optimization to avoid wasting too much memory. With a vocabulary of 10,000 words, and 60,000 training examples, where each training example is a sentence, the memory footprint of the one-hot encoded data starts becoming a problem. Therefore, instead of one-hot encoding all data up front, there is a way to let Keras deal with that in the loss function itself.


In [None]:
# Prepare training data.
dest_target_token_seq = [x + [STOP_INDEX] for x in dest_token_seq]
dest_input_token_seq = [[START_INDEX] + x for x in
                        dest_target_token_seq]
src_input_data = pad_sequences(src_token_seq)
dest_input_data = pad_sequences(dest_input_token_seq,
                                padding='post')
dest_target_data = pad_sequences(
    dest_target_token_seq, padding='post', maxlen
    = len(dest_input_data[0]))


The next code snippet demonstrates how we can manually split our dataset into a training dataset and a test dataset. In previous examples, we either relied on datasets that are already split this way or we used functionality inside of Keras when calling the fit() function. However, in this case, we want some more control ourselves because we will want to inspect a few select members of the test set in detail. We split the dataset by first creating a list test_indices, which contains a 20% (TEST_PERCENT) subset of all the numbers from 0 to N−1, where N is the size of our original dataset. We then create a list train_indices, which contains the remaining 80%. We can now use these lists to select a number of rows in the matrices representing the dataset and create two new collections of matrices, one to be used as training set and one to be used as test set. Finally, we create a third collection of matrices, which only contains 20 (SAMPLE_SIZE) random examples from the test dataset. We will use them to inspect the resulting translations in detail, but since that is a manual process, we limit ourselves to a small number of sentences.


In [None]:
# Split into training and test set.
rows = len(src_input_data[:,0])
all_indices = list(range(rows))
test_rows = int(rows * TEST_PERCENT)
test_indices = random.sample(all_indices, test_rows)
train_indices = [x for x in all_indices if x not in test_indices]

train_src_input_data = src_input_data[train_indices]
train_dest_input_data = dest_input_data[train_indices]
train_dest_target_data = dest_target_data[train_indices]

test_src_input_data = src_input_data[test_indices]
test_dest_input_data = dest_input_data[test_indices]
test_dest_target_data = dest_target_data[test_indices]

# Create a sample of the test set that we will inspect in detail.
test_indices = list(range(test_rows))
sample_indices = random.sample(test_indices, SAMPLE_SIZE)
sample_input_data = test_src_input_data[sample_indices]
sample_target_data = test_dest_target_data[sample_indices]


To provide ordering information between the embeddings we need to add positional encodings to each embedding. We do this by creating a class PositionalEmbedding that extends the Embedding class. We calculate the positional encoding using sine and cosine as in the original Transformer paper and add it to the embedding.


In [None]:
class PositionalEmbedding(Embedding):
    def __init__(self, max_len, *args, **kwargs):
        super(PositionalEmbedding, self).__init__(*args, **kwargs)
        self.max_len = max_len
        self.positional_encodings = self.create_positional_encodings()

    def create_positional_encodings(self):
        i_range = np.arange(self.output_dim).reshape(1, self.output_dim)
        pos_range = np.arange(self.max_len).reshape(self.max_len, 1)
        sine_matrix = np.sin(1 / np.power(10000, i_range/self.output_dim) * pos_range)
        cosine_matrix = np.cos(1 / np.power(10000, (i_range-1)/self.output_dim) * pos_range)
        pos_matrix = np.zeros((self.max_len, self.output_dim))
        for i in range(self.output_dim):
            if (i % 2 == 0):
                pos_matrix[:, i] = sine_matrix[:, i]
            else:
                pos_matrix[:, i] = cosine_matrix[:, i]
        pos_matrix = pos_matrix.reshape(1, self.max_len, self.output_dim)
        return tf.cast(pos_matrix, dtype=tf.float32)

    def call(self, inputs):
        embeddings = super(PositionalEmbedding, self).call(inputs)
        embeddings = embeddings * math.sqrt(EMBEDDING_WIDTH)
        length = tf.shape(inputs)[1]
        pos_encodings = self.positional_encodings[:, :length, :]
        return embeddings + pos_encodings

    def compute_mask(self, inputs, mask=None):
        return mask


We are now ready to build our model. It consists of an encoder part and a decoder part. The encoder consists of a PositionalEmbedding layer and two Transformer encoder modules stacked on top of each other. The decoder consists of a PositionalEmbedding layer, two Transformer decoder modules stacked on top of each other, and a fully connected softmax layer. We define the encoder and decoder as two separate models, which we later tie together. To be able to express this complex model we need to use the Keras Functional API.

The code snippet below contains the implementation of the encoder model. We first define the layers and then connect them together. Once all layers are connected, we create the actual model by calling the Model() constructor and providing arguments to specify what inputs and outputs will be external to the model. The model takes the source sentence as input and produces the encoded intermediate representation as output.


In [None]:
# Build encoder model.
# Input is input sequence in source language.
enc_embedding_input = Input(shape=(None, ))

# Create the encoder layers.
enc_embedding_layer = PositionalEmbedding(max_len=MAX_LENGTH, input_dim=MAX_WORDS,
                                          output_dim=EMBEDDING_WIDTH, mask_zero=True)
enc_layer1 = TransformerEncoder(intermediate_dim=LAYER_SIZE, num_heads=NUM_HEADS,
                                dropout=0.1)
enc_layer2 = TransformerEncoder(intermediate_dim=LAYER_SIZE, num_heads=NUM_HEADS,
                                dropout=0.1)

# Connect the encoder layers.
enc_embedding_layer_outputs = \
    enc_embedding_layer(enc_embedding_input)


enc_layer1_outputs = enc_layer1(enc_embedding_layer_outputs)
enc_layer2_outputs = enc_layer2(enc_layer1_outputs)

# Build the model.
enc_model = Model(enc_embedding_input, enc_layer2_outputs)
enc_model.summary()


The next code snippet shows the implementation of the decoder model. The first Transformer decoder module takes the output from the embedding layer as one input (for self-attention), and also takes the output from the encoder stack as a second output (for cross-attention). Similarly, the second Transformer decoder module takes the output from the first Transformer decoder module as well as the output from the encoder stack as inputs. The model ends with a fully-connected softmax layer.

We create the model by calling the Model() constructor. The inputs consist of the destination sentence (time shifted by one timestep) and output from the encoder stack.


In [None]:
# Build decoder model.
# Input to the network is input sequence in destination
# language and state from encoder.
dec_state_input = Input(shape=(None, EMBEDDING_WIDTH),)
dec_embedding_input = Input(shape=(None,))

# Create the encoder layers.
dec_embedding_layer = PositionalEmbedding(max_len=MAX_LENGTH, input_dim=MAX_WORDS,
                                          output_dim=EMBEDDING_WIDTH, mask_zero=True)
dec_layer1 = TransformerDecoder(intermediate_dim=LAYER_SIZE, num_heads=NUM_HEADS,
                                dropout=0.1)
dec_layer2 = TransformerDecoder(intermediate_dim=LAYER_SIZE, num_heads=NUM_HEADS,
                                dropout=0.1)
dec_layer3 = Dense(MAX_WORDS, activation='softmax')

# Connect the decoder layers.
dec_embedding_layer_outputs = dec_embedding_layer(
    dec_embedding_input)

dec_layer1_outputs = dec_layer1(dec_embedding_layer_outputs,
                                dec_state_input)
dec_layer2_outputs = dec_layer2(dec_layer1_outputs,
                                dec_state_input)
dec_layer3_outputs = dec_layer3(dec_layer2_outputs)

# Build the model.
dec_model = Model([dec_embedding_input,
                   dec_state_input],
                   dec_layer3_outputs)
dec_model.summary()


The next code snippet connects the two models to build a full encoder-decoder network. We decided to use RMSProp as optimizer because some experiments indicate that it performs better than Adam for this specific model. We use sparse_categorical_crossentropy instead of the normal categorical_crossentropy as loss function because we have not one-hot encoded the output data.

Even after connecting the encoder and decoder model to form a joint model, they can both still be used in isolation. If we train the joint model, it will update the weights of the first two models. This is useful because, when we do inference, we want an encoder model that is decoupled from the decoder model.


In [None]:
# Build and compile full training model.
train_enc_embedding_input = Input(shape=(None, ))
train_dec_embedding_input = Input(shape=(None, ))
intermediate_state = enc_model(train_enc_embedding_input)
train_dec_output = dec_model([train_dec_embedding_input,
                             intermediate_state])
training_model = Model([train_enc_embedding_input,
                        train_dec_embedding_input],
                        train_dec_output)
optimizer = RMSprop(lr=0.001)
training_model.compile(loss='sparse_categorical_crossentropy',
                       optimizer=optimizer, metrics =['accuracy'])
training_model.summary()


The final code snippet shows hos to train and test the model. We create our own training loop where we instruct fit() to train for only a single epoch at a time. We then use our model to create some predictions before going back and training for another epoch. This approach enables some detailed evaluation of just a small set of samples after each epoch.

Most of the code sequence is the loop used to create translations for the smaller set of samples that we created from the test dataset. This piece of code consists of a loop that iterates over all the examples in sample_input_data. We provide the source sentence to the encoder model to create the resulting internal state and store to the variable intermediate_states. We then set the input x to the START token and use the decoder to make a prediction. We retrieve the most probable word and append it to x. We then provide this sequence to the decoder and make a new prediction. We iterate this with a gradually growing input sequence in an autoregressive manner until the model produces a STOP token or until a given number of words have been produced. Finally, we convert the produced tokenized sequences into the corresponding word sequences and print them out.


In [None]:
# Train and test repeatedly.
for i in range(EPOCHS):
    print('step: ' , i)
    # Train model for one epoch.
    history = training_model.fit(
        [train_src_input_data, train_dest_input_data],
        train_dest_target_data, validation_data=(
            [test_src_input_data, test_dest_input_data],
            test_dest_target_data), batch_size=BATCH_SIZE,
        epochs=1)

    # Loop through samples to see result
    for (test_input, test_target) in zip(sample_input_data,
                                         sample_target_data):
        # Run a single sentence through encoder model.
        x = np.reshape(test_input, (1, -1))
        intermediate_states = enc_model.predict(
            x, verbose=0)
        # Provide resulting state and START_INDEX as input
        # to decoder model.
        x = np.array([[START_INDEX]])
        produced_string = ''
        pred_seq = []
        
        for j in range(MAX_LENGTH):
            # Predict next word and capture internal state.
            preds = dec_model.predict(
                [x, intermediate_states], verbose=0)
            # Find the most probable word.
            word_index = np.asarray(preds[0][j]).argmax()
            pred_seq.append(word_index)
            if word_index == STOP_INDEX:
                break
            x = np.append(x, [[word_index]], axis=1)
        tokens_to_words(src_tokenizer, test_input)
        tokens_to_words(dest_tokenizer, test_target)
        tokens_to_words(dest_tokenizer, pred_seq)
        print('\n\n')
