# Language Translation with Transformers

For this notebook, we will be training a transformer that would be able to input a Portuguese sentence and return the English translation. This is commonly refered to as the problem of Machine Translation.

## Setup and Data Handling

We will install [TensorFlow Datasets](https://tensorflow.org/datasets) for loading the dataset and [TensorFlow Text](https://www.tensorflow.org/text) for text preprocessing (Tokenization).

In [1]:
!pip install protobuf~=3.20.3
!pip install -q tensorflow_datasets
!pip install -q -U tensorflow-text tensorflow



In [2]:
import numpy as np
import tensorflow_datasets as tfds
import tensorflow as tf
import tensorflow_text

### Downloding the datatset

We make use of the TensorFlow Datasets to load the [Portuguese-English translation dataset](https://www.tensorflow.org/datasets/catalog/ted_hrlr_translate#ted_hrlr_translatept_to_en) Talks Open Translation Project. This dataset contains approximately 52,000 training, 1,200 validation and 1,800 test examples.

The `tf.data.Dataset` object returned by TensorFlow Datasets yields pairs of text examples, each representing a sentence.

In [3]:
examples, metadata = tfds.load(
  'ted_hrlr_translate/pt_to_en',
  with_info=True,
  as_supervised=True
)

train_examples, val_examples = examples['train'], examples['validation']

In [4]:
for pt_examples, en_examples in train_examples.batch(3).take(1):
  print('> Examples in Portuguese:')
  for pt in pt_examples.numpy():
    print(pt.decode('utf-8'))

  print()

  print('> Examples in English:')
  for en in en_examples.numpy():
    print(en.decode('utf-8'))

> Examples in Portuguese:
e quando melhoramos a procura , tiramos a única vantagem da impressão , que é a serendipidade .
mas e se estes fatores fossem ativos ?
mas eles não tinham a curiosidade de me testar .

> Examples in English:
and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n't test for curiosity .


### Set up the tokenizer

We will make use of a pretrained Tokenizer fine-tuned on the Portugese and English language provided by TensorFlow. The loaded model would contain two text tokenizers, one for English and one for Portuguese, both with the same methods.

Of all the available methods, we would use the `tokenize` method which converts a batch of strings to a padded-batch of token IDs. This method splits punctuation, lowercases and unicode-normalizes the input before tokenizing.

In [5]:
model_name = 'ted_hrlr_translate_pt_en_converter'
tf.keras.utils.get_file(
  f'{model_name}.zip',
  f'https://storage.googleapis.com/download.tensorflow.org/models/{model_name}.zip',
  cache_dir='.', cache_subdir='', extract=True
)
tokenizers = tf.saved_model.load(model_name)

In [6]:
print('> This is a batch of strings:')
for en in en_examples.numpy():
  print(en.decode('utf-8'))

encoded = tokenizers.en.tokenize(en_examples)
print()

print('> This is a padded-batch of token IDs:')
for row in encoded.to_list():
  print(row)

> This is a batch of strings:
and when you improve searchability , you actually take away the one advantage of print , which is serendipity .
but what if it were active ?
but they did n't test for curiosity .

> This is a padded-batch of token IDs:
[2, 72, 117, 79, 1259, 1491, 2362, 13, 79, 150, 184, 311, 71, 103, 2308, 74, 2679, 13, 148, 80, 55, 4840, 1434, 2423, 540, 15, 3]
[2, 87, 90, 107, 76, 129, 1852, 30, 3]
[2, 87, 83, 149, 50, 9, 56, 664, 85, 2512, 15, 3]


### Set up a data pipeline with `tf.data`

The following function takes batches of text as input, and converts them to a format suitable for training.

1. It trims each to be no longer than `MAX_TOKENS`.
2. It splits the target (English) tokens into inputs and labels. These are shifted by one step so that at each input location the `label` is the id of the next token.
3. It converts the `RaggedTensor`s to padded dense `Tensor`s.
4. It returns an `(inputs, labels)` pair.


In [7]:
MAX_TOKENS=128

def prepare_batch(pt, en):
    pt = tokenizers.pt.tokenize(pt)
    pt = pt[:, :MAX_TOKENS]             # Trim to MAX_TOKENS.
    pt = pt.to_tensor()                 # Convert to 0-padded dense Tensor

    en = tokenizers.en.tokenize(en)
    en = en[:, :(MAX_TOKENS+1)]
    en_inputs = en[:, :-1].to_tensor()  # Drop the [END] tokens
    en_labels = en[:, 1:].to_tensor()   # Drop the [START] tokens

    return (pt, en_inputs), en_labels

BUFFER_SIZE = 20000
BATCH_SIZE = 64

def make_batches(ds):
  return (
    ds
      .shuffle(BUFFER_SIZE)
      .batch(BATCH_SIZE)
      .map(prepare_batch, tf.data.AUTOTUNE)
      .prefetch(buffer_size=tf.data.AUTOTUNE)
  )


In [8]:
# Create training and validation set batches.
train_batches = make_batches(train_examples)
val_batches = make_batches(val_examples)

## Transformer Architecture

### Word Embedding and Positional Encoding

We will add a "Positional Encoding" to the embedding vectors. It uses a set of sines and cosines at different frequencies (across the sequence). By definition nearby elements will have similar position encodings.

The original paper uses the following formula for calculating the positional encoding:

$$\Large{PE_{(pos, 2i)} = \sin(pos / 10000^{2i / d_{model}})} $$
$$\Large{PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i / d_{model}})} $$

- The code below implements it, but instead of interleaving the sines and cosines, the vectors of sines and cosines are simply concatenated. Permuting the channels like this is functionally equivalent, and just a little easier to implement and show in the plots below.
- The position encoding function is a stack of sines and cosines that vibrate at different frequencies depending on their location along the depth of the embedding vector.

In [9]:
# length :=  Number of tokens in the input sequence
# depth  :=  Paramter of the model used to the Sinusoidal Position Encoding
def positional_encoding(length, depth):
  depth = depth/2

  positions = np.arange(length)[:, np.newaxis]     # (seq, 1)
  depths = np.arange(depth)[np.newaxis, :]/depth   # (1, depth)

  angle_rates = 1 / (10000**depths)         # (1, depth)
  angle_rads = positions * angle_rates      # (pos, depth)

  pos_encoding = np.concatenate(
    [
      np.sin(angle_rads),
      np.cos(angle_rads)
    ],
    axis=-1
  )

  return tf.cast(pos_encoding, dtype=tf.float32)

In [10]:
class PositionalEmbedding(tf.keras.layers.Layer):
  def __init__(self, vocab_size, d_model):
    super().__init__()
    self.d_model = d_model
    self.embedding = tf.keras.layers.Embedding(
      vocab_size,
      # Dimensions of the Embedding
      d_model,
      # Tokenizer reserves the 0 index of padding
      mask_zero=True
    )
    self.pos_encoding = positional_encoding(
      length=2048,
      depth=d_model
    )

  def compute_mask(self, *args, **kwargs):
    return self.embedding.compute_mask(*args, **kwargs)

  def call(self, x):
    # The length of the inputs
    length = tf.shape(x)[1]
    x = self.embedding(x)
    # This factor sets the relative scale of the embedding and positonal_encoding.
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x = x + self.pos_encoding[tf.newaxis, :length, :]
    return x


### Multi Head Attention Layer

Attention layers are used throughout the model. These are all identical except for how the attention is configured (it is masked in the case of the decoder and unmasked in the case of encoder). Each one contains a `layers.MultiHeadAttention`, a `layers.LayerNormalization` and a `layers.Add`.


In [11]:
# A base class for attention layer + Add & Norm
# This wil be used to create two classes: one for Masked Attention, other for Unmasked Attention
class BaseAttention(tf.keras.layers.Layer):
  def __init__(self, **kwargs):
    super().__init__()
    self.mha = tf.keras.layers.MultiHeadAttention(**kwargs)
    self.layernorm = tf.keras.layers.LayerNormalization()
    self.add = tf.keras.layers.Add()

In [12]:
# Used in Decoder
class CrossAttention(BaseAttention):
  def call(self, x, context):
    attn_output = self.mha(
      query=x,
      key=context,
      value=context,
    )

    x = self.add([x, attn_output])
    x = self.layernorm(x)

    return x

In [13]:
# Used in the Decoder Layer
# Since this layer needs to be undirectional (i.e. masking the future tokens), we would set use_causal_mask = true
# In training, it computes the loss for each of the next positions as well, but in inference only a single token is generated
class CausalSelfAttention(BaseAttention):
  def call(self, x):
    attn_output = self.mha(
        query=x,
        value=x,
        key=x,
        use_causal_mask = True)
    x = self.add([x, attn_output])
    x = self.layernorm(x)
    return x

In [14]:
# Used in Encoder
# The flow of data in bidirectional, and thus Q,K and V are all initialized with the same vector
class GlobalSelfAttention(BaseAttention):
  def call(self, x):
    attn_output = self.mha(
        query=x,
        value=x,
        key=x
    )
    x = self.add([x, attn_output])
    x = self.layernorm(x)
    return x

### Feed Forward Network
The network consists of two linear layers (`tf.keras.layers.Dense`) with a ReLU activation in-between, and a dropout layer. As with the attention layers the code here also includes the residual connection and normalization:

In [15]:
class FeedForward(tf.keras.layers.Layer):
  def __init__(self, d_model, dff, dropout_rate=0.1):
    super().__init__()
    self.seq = tf.keras.Sequential([
      tf.keras.layers.Dense(dff, activation='relu'),
      tf.keras.layers.Dense(d_model),
      tf.keras.layers.Dropout(dropout_rate)
    ])
    self.add = tf.keras.layers.Add()
    self.layer_norm = tf.keras.layers.LayerNormalization()

  def call(self, x):
    x = self.add([x, self.seq(x)])
    x = self.layer_norm(x)
    return x


### Encoder layer and Encoder

The encoder contains a stack of `N` encoder layers, where each `EncoderLayer` contains a `GlobalSelfAttention` and `FeedForward` layer.

In [16]:
class EncoderLayer(tf.keras.layers.Layer):
  def __init__(self,*, d_model, num_heads, dff, dropout_rate=0.1):
    super().__init__()

    self.self_attention = GlobalSelfAttention(
        num_heads=num_heads,
        key_dim=d_model,
        dropout=dropout_rate)
    self.ffn = FeedForward(d_model, dff)

  def call(self, x):
    x = self.self_attention(x)
    x = self.ffn(x)
    return x

In [17]:
class Encoder(tf.keras.layers.Layer):
  def __init__(
    self,
    *,
    num_layers,
    d_model,
    num_heads,
    dff,
    vocab_size,
    dropout_rate=0.1
  ):
    super().__init__()

    self.d_model = d_model
    self.num_layers = num_layers
    self.pos_embedding = PositionalEmbedding(vocab_size=vocab_size, d_model=d_model)
    self.enc_layers = [
        EncoderLayer(
          d_model=d_model,
          num_heads=num_heads,
          dff=dff,
          dropout_rate=dropout_rate
        ) for _ in range(num_layers)]
    self.dropout = tf.keras.layers.Dropout(dropout_rate)

  def call(self, x):
    # x is token-IDs shape: (batch, seq_len)
    x = self.pos_embedding(x)  # Shape (batch_size, seq_len, d_model)

    x = self.dropout(x)
    for i in range(self.num_layers):
      x = self.enc_layers[i](x)

    return x  # Shape (batch_size, seq_len, d_model)

### Decoder Layer and Decoder

In [18]:
class DecoderLayer(tf.keras.layers.Layer):
  def __init__(
    self,
    *,
    d_model,
    num_heads,
    dff,
    dropout_rate=0.1
  ):
    super(DecoderLayer, self).__init__()

    self.causal_self_attention = CausalSelfAttention(
      num_heads=num_heads,
      key_dim=d_model,
      dropout=dropout_rate
    )
    self.cross_attention = CrossAttention(
      num_heads=num_heads,
      key_dim=d_model,
      dropout=dropout_rate
    )
    self.ffn = FeedForward(d_model, dff)

  def call(self, x, context):
    x = self.causal_self_attention(x=x)
    x = self.cross_attention(x=x, context=context)
    x = self.ffn(x)  # Shape (batch_size, seq_len, d_model)
    return x

In [19]:
class Decoder(tf.keras.layers.Layer):
  def __init__(
    self,
    *,
    num_layers,
    d_model,
    num_heads,
    dff,
    vocab_size,
    dropout_rate=0.1
  ):
    super(Decoder, self).__init__()

    self.d_model = d_model
    self.num_layers = num_layers
    self.pos_embedding = PositionalEmbedding(vocab_size=vocab_size, d_model=d_model)
    self.dropout = tf.keras.layers.Dropout(dropout_rate)
    self.dec_layers = [
        DecoderLayer(
          d_model=d_model,
          num_heads=num_heads,
          dff=dff,
          dropout_rate=dropout_rate
        ) for _ in range(num_layers)
      ]

    self.last_attn_scores = None

  def call(self, x, context):
    # x is token-IDs shape (batch, target_seq_len)
    x = self.pos_embedding(x)  # (batch_size, target_seq_len, d_model)
    x = self.dropout(x)
    for i in range(self.num_layers):
      x  = self.dec_layers[i](x, context)

    return x

## Transformer

The `Transformer` model consists of the Encoder, Decoder, and a final linear (`Dense`) layer which converts the resulting vector at each location into output token probabilities.

The output of the decoder is the input to this final linear layer.

In [20]:
class Transformer(tf.keras.Model):
  def __init__(self, *, num_layers, d_model, num_heads, dff,
               input_vocab_size, target_vocab_size, dropout_rate=0.1):
    super().__init__()

    self.encoder = Encoder(
        num_layers=num_layers,
        d_model=d_model,
        num_heads=num_heads,
        dff=dff,
        vocab_size=input_vocab_size,
        dropout_rate=dropout_rate
      )

    self.decoder = Decoder(
        num_layers=num_layers,
        d_model=d_model,
        num_heads=num_heads,
        dff=dff,
        vocab_size=target_vocab_size,
        dropout_rate=dropout_rate
      )

    self.final_layer = tf.keras.layers.Dense(target_vocab_size)

  def call(self, inputs):
    # Context is the input that we are running the model on
    # x is the decoded sentence till now
    context, x  = inputs

    context = self.encoder(context)
    x = self.decoder(x, context)
    logits = self.final_layer(x)  # (batch_size, target_len, target_vocab_size)

    try:
      # Drop the keras mask, so it doesn't scale the losses/metrics.
      del logits._keras_mask
    except AttributeError:
      pass

    return logits

### Training

In [21]:
num_layers = 4
d_model = 128
dff = 512
num_heads = 8
dropout_rate = 0.1

transformer = Transformer(
  num_layers=num_layers,
  d_model=d_model,
  num_heads=num_heads,
  dff=dff,
  input_vocab_size=tokenizers.pt.get_vocab_size().numpy(),
  target_vocab_size=tokenizers.en.get_vocab_size().numpy(),
  dropout_rate=dropout_rate
)

Since the target sequences are padded, it is important to apply a padding mask when calculating the loss. Use the cross-entropy loss function (`tf.keras.losses.SparseCategoricalCrossentropy`).

In [22]:
def masked_loss(label, pred):
  mask = label != 0
  loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True,
    reduction='none'
  )
  loss = loss_object(label, pred)

  mask = tf.cast(mask, dtype=loss.dtype)
  loss *= mask

  loss = tf.reduce_sum(loss)/tf.reduce_sum(mask)
  return loss


def masked_accuracy(label, pred):
  pred = tf.argmax(pred, axis=2)
  label = tf.cast(label, pred.dtype)
  match = label == pred

  mask = label != 0

  match = match & mask

  match = tf.cast(match, dtype=tf.float32)
  mask = tf.cast(mask, dtype=tf.float32)
  return tf.reduce_sum(match)/tf.reduce_sum(mask)

In [23]:
transformer.compile(
  loss=masked_loss,
  optimizer="adam",
  metrics=[masked_accuracy]
)

transformer.fit(
  train_batches,
  epochs=25,
  validation_data=val_batches
)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.src.callbacks.History at 0x780ce97bf970>

## Inference

We now test the model by performing a translation. The following steps are used for inference:

* Encode the input sentence using the Portuguese tokenizer (`tokenizers.pt`). This is the encoder input.
* The decoder input is initialized to the `[START]` token.
* Calculate the padding masks and the look ahead masks.
* The `decoder` then outputs the predictions by looking at the `encoder output` and its own output (self-attention).
* Concatenate the predicted token to the decoder input and pass it to the decoder.
* In this approach, the decoder predicts the next token based on the previous tokens it predicted.


In [24]:
class Translator(tf.Module):
  def __init__(self, tokenizers, transformer):
    self.tokenizers = tokenizers
    self.transformer = transformer

  def __call__(self, sentence, max_length=MAX_TOKENS):
    # The input sentence is Portuguese, hence adding the `[START]` and `[END]` tokens.
    if len(sentence.shape) == 0:
      sentence = sentence[tf.newaxis]

    sentence = self.tokenizers.pt.tokenize(sentence).to_tensor()
    encoder_input = sentence

    # As the output language is English, initialize the output with the English `[START]` token.
    start_end = self.tokenizers.en.tokenize([''])[0]
    start = start_end[0][tf.newaxis]
    end = start_end[1][tf.newaxis]

    output_array = tf.TensorArray(dtype=tf.int64, size=0, dynamic_size=True)
    output_array = output_array.write(0, start)

    for i in tf.range(max_length):
      output = tf.transpose(output_array.stack())
      predictions = self.transformer([encoder_input, output], training=False)

      # Select the last token from the `seq_len` dimension.
      predictions = predictions[:, -1:, :]              # Shape (batch_size, 1, vocab_size)
      predicted_id = tf.argmax(predictions, axis=-1)

      # Concatenate the `predicted_id` to the output which is given to the decoder as its input.
      output_array = output_array.write(i+1, predicted_id[0])

      if predicted_id == end:
        break

    output = tf.transpose(output_array.stack())
    # The output shape is (1, tokens)
    text = tokenizers.en.detokenize(output)[0]
    tokens = tokenizers.en.lookup(output)[0]

    return text, tokens

In [25]:
translator = Translator(tokenizers, transformer)

def print_translation(sentence, tokens, ground_truth):
  print(f'{"Input:":15s}: {sentence}')
  print(f'{"Prediction":15s}: {tokens.numpy().decode("utf-8")}')
  print(f'{"Ground truth":15s}: {ground_truth}')

In [26]:
# Example 1
sentence = 'este é um problema que temos que resolver.'
ground_truth = 'this is a problem we have to solve .'

translated_text, translated_tokens = translator(tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)

Input:         : este é um problema que temos que resolver.
Prediction     : thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank
Ground truth   : this is a problem we have to solve .


In [27]:
# Example 2
sentence = 'os meus vizinhos ouviram sobre esta ideia.'
ground_truth = 'and my neighboring homes heard about this idea .'

translated_text, translated_tokens = translator(tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)

Input:         : os meus vizinhos ouviram sobre esta ideia.
Prediction     : thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank
Ground truth   : and my neighboring homes heard about this idea .


In [28]:
# Example 3
sentence = 'vou então muito rapidamente partilhar convosco algumas histórias de algumas coisas mágicas que aconteceram.'
ground_truth = "so i'll just share with you some stories very quickly of some magical things that have happened."

translated_text, translated_tokens = translator(tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)

Input:         : vou então muito rapidamente partilhar convosco algumas histórias de algumas coisas mágicas que aconteceram.
Prediction     : thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank thank
Ground truth   : so i'll just share with you some stories very quickly of some magical thi