# Sequence-to-sequence modeling with RNNs and Transformers

<a href="https://colab.research.google.com/drive/1eg8wxp-rNu_be1fDRUjsHQZ54LIGHd9i" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
</a>

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).

Language translation is converting a text written in one language (the source language) into another (the target language). In machine learning (ML), language translation typically involves using ML algorithms and models to automate this process.

There are several key challenges in building an effective ML-based language translation system. One is the need to handle the variability and complexity of natural language, which can involve ambiguous or context-dependent meanings, idioms, and other factors that can make translation difficult.

Another challenge is the need to handle the vast number of possible language pairs, as well as the need to take multiple target languages for a given source language. Despite these challenges, ML-based language translation systems have significantly progressed in recent years. They are now used in various applications, including translating websites, documents, and spoken language in real-time.

One of the most considerable and most current advances in language modeling and _sequence-to-sequence_ machine translation was made possible by the invention of the `transformer` architecture.

<img src="https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1.png" alt="drawing" height="450"/>

[Source](https://machinelearningmastery.com/the-transformer-model/).

A `transformer` model is a type of neural network architecture first described in the 2017 paper "_[Attention Is All You Need](https://arxiv.org/abs/1706.03762)_" by Vaswani et al. Transformers were initially used for machine translation, but this adaptable architecture is now used in various fields and problems.

In contrast to conventional convolutional neural networks (`CNNs`) or recurrent neural networks (`RNNs`), the `transformer` model uses `self-attention` mechanisms to dynamically weigh the input features and compute a weighted sum of the input features. In natural language processing, where the meaning of a word can depend on the context of the other words in the sentence, the model can capture long-range dependencies in the input data more effectively.

This notebook will show sequence-to-sequence modeling on a machine translation task, which is precisely the task for which the `Transformer` was developed. We'll create a recurrent sequence model (`GRU`) and the complete `Transformer` architecture for compassion.

We'll work with an [`English-to-Portuguese`](https://www.kaggle.com/datasets/nageshsingh/englishportuguese-translation) translation dataset. The downloaded text file contains one example per line: an English sentence, followed by a tab character, followed by the corresponding Portuguese sentence.

Let us load our dataset from the Hub and print the first ten samples. We also need to prepare the samples of our target language with special tokes (`[start]` and `[end]`), to help the generative part of our models to understand _when a sentence starts and ends_.


In [None]:
!pip install huggingface_hub -q

from huggingface_hub import hf_hub_download

portuguese_vocabulary_path = hf_hub_download(
    repo_id="AiresPucrs/GRU-eng-por",
    filename="eng-por.txt",
    repo_type='model',
    local_dir="./")

with open("eng-por.txt", encoding='utf-8') as f:
    lines = f.read().split("\n")[:-1]

text_pairs = []
for line in lines:
    english, portuguese, _ = line.split("\t")
    portuguese = "[start] " + portuguese + " [end]"
    text_pairs.append((english, portuguese))
display(text_pairs[: 10])

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/302.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/302.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25h

eng-por.txt:   0%|          | 0.00/24.6M [00:00<?, ?B/s]

[('Go.', '[start] Vai. [end]'),
 ('Go.', '[start] Vá. [end]'),
 ('Hi.', '[start] Oi. [end]'),
 ('Run!', '[start] Corre! [end]'),
 ('Run!', '[start] Corra! [end]'),
 ('Run!', '[start] Corram! [end]'),
 ('Run.', '[start] Corre! [end]'),
 ('Run.', '[start] Corra! [end]'),
 ('Run.', '[start] Corram! [end]'),
 ('Who?', '[start] Quem? [end]')]

Now, let us divide our dataset into the customary training, validation, and test sets after shuffling them.

In [None]:
import random

random.shuffle(text_pairs)

num_val_samples = int(0.15 * len(text_pairs))

num_train_samples = len(text_pairs) - 2 * num_val_samples

train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples:num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples:]

The learned vocabulary for each language needs to be separated into two separate TextVectorization layers, which need to be created now (English and Portuguese). The "`[start]`" and "`[end]`" tokens that we have inserted must also be kept. The characters "`[ ]`" would usually be stripped, but we want to leave them in.

So that we can reuse the vocab learned, we will save them as `txt` files. All the words learned by the `TextVectorization` class are in these files.

Both vocabularies will have a maximum of 20,000 words, and all sequences will be truncated after 20 tokens/words.

> **Note:** Modern language models use other tokenization techniques, like [Byte Pair Encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) (BPE). BPE helps create better vocabularies and avoids unknown tokens by breaking down words into subword units. This allows the model to handle various language variations and new words efficiently. To learn how to train large language models from scratch, visit our [other repository](https://github.com/Nkluge-correa/Aira).

In [None]:
from keras import layers
import tensorflow as tf
import string
import re

# Select characters to strip, but preserve the "[" and "]"
strip_chars = string.punctuation
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")

def custom_standardization(input_string):
    """
    Applies custom text standardization to the input string.
    Converts all characters to lowercase and removes any special
    characters in `strip_chars` from the string.

    Args:
        input_string: A string of text to standardize.

    Returns:
        The input string after converting all characters to lowercase
        and removing special characters.
    """
    lowercase = tf.strings.lower(input_string)
    return tf.strings.regex_replace(lowercase, f"[{re.escape(strip_chars)}]", "")

# Size of the vocabulary
vocab_size = 20000

# Maximum sequence length
sequence_length = 20

# Initiate tokenizer from the source (English)
source_vectorization = tf.keras.layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length
    )

# Initiate tokenizer from the target (Portuguese)
target_vectorization = tf.keras.layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length + 1,
    standardize=custom_standardization,
    )

# Separate pairs of (english, portuguese) sentences into different lists
train_english_texts = [pair[0] for pair in train_pairs]
train_portuguese_texts = [pair[1] for pair in train_pairs]

# Train each tokenizer on their respective languages
source_vectorization.adapt(train_english_texts)
target_vectorization.adapt(train_portuguese_texts)

# Save the vocabularies for latter use
portuguese_vocab = target_vectorization.get_vocabulary()

with open(r'portuguese_vocabulary.txt', 'w', encoding='utf-8') as fp:
    for word in portuguese_vocab:
        fp.write("%s\n" % word)
    fp.close()

english_vocab = source_vectorization.get_vocabulary()

with open(r'english_vocabulary.txt', 'w', encoding='utf-8') as fp:
    for word in english_vocab:
        fp.write("%s\n" % word)
    fp.close()

Now, we will create a data pipeline using our dataset. The dataset will be transformed into a `(inputs, target)` of `batch_size = 32` where the target is the Portuguese sentence offset one step ahead (the next token in the input sequence). Meanwhile, inputs are a dictionary with two keys, "encoder inputs" (the English sentence) and "decoder inputs" (the Portuguese sentence).

In [None]:
batch_size = 32

def format_dataset(eng, por):
    """
    Formats the input English and Portuguese datasets
    for training a neural machine translation model.

    Args:
        eng (numpy.ndarray): A 2D numpy array
            of English sentences.
        por (numpy.ndarray): A 2D numpy array
            of Portuguese sentences.

    Returns:
        tuple: A tuple containing the formatted English
        and Portuguese datasets ready for training, where
        the first element is a dictionary with the 'english' and
        'portuguese' keys corresponding to the input and output
        sequences respectively, and the second element is a 2D numpy
        array of the Portuguese target sequences.
    """
    eng = source_vectorization(eng)
    por = target_vectorization(por)
    return ({
    "english": eng,
    "portuguese": por[:, :-1],
    }, por[:, 1:])

def make_dataset(pairs):
    """
    Create a TensorFlow dataset from a list of pairs of
    English and Portuguese sentences.

    Args:
        pairs: A list of tuples, where each tuple contains an
        English sentence and its corresponding Portuguese sentence.

    Returns:
        A TensorFlow dataset consisting of pairs of English and
        Portuguese sentences.
    """
    eng_texts, por_texts = zip(*pairs)
    eng_texts = list(eng_texts)
    por_texts = list(por_texts)
    dataset = tf.data.Dataset.from_tensor_slices((eng_texts, por_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset, num_parallel_calls=4)
    return dataset.shuffle(42).prefetch(16).cache()

# Create the train and validation datasets
train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

Now we can train our first model, an `RNN`. `Recurrent neural networks` dominated `sequence-to-sequence` learning from 2015–2017 before being surpassed by the Transformer. They were the basis for many real-world machine-translation systems, like the [Google Translate circa 2017](https://en.wikipedia.org/wiki/Google_Neural_Machine_Translation), which was powered by a stack of eight large `LSTM` layers.

But for this notebook, we will use [`GRUs`](https://en.wikipedia.org/wiki/Gated_recurrent_unit) instead of an [`LSTM`](https://en.wikipedia.org/wiki/Long_short-term_memory). The GRU is like an `LSTM` with a forget gate but has fewer parameters than LSTM, as it lacks an output gate.

If you wish, you can stack more `GRUs` layers in the encoder and decoder parts. As it is, this model has already more than 40M parameters. Stacking recurrent layers will speed training up. Given the _sequence processing nature_ of `RNNs`, there are limits two how much we can parallelize.

We are given each learned word in our vocabulary an embedding dimension of 256. This dense vector represents the "_relationships of a word with another_." The dense internal layer of our network will have 1024 nodes per `GRU`.

> Note: An `embedding` layer is a neural network layer that maps categorical variables into a continuous vector space, such as words in a vocabulary. The goal of the embedding layer is to represent each word or categorical variable in a way that captures its relevant semantic and syntactic properties.

> Note:  A language model's `latent dimension` refers to the underlying vector space representing a language's learned semantic and syntactic features. In other words, it is the number of hidden variables representing the language model's internal state.

In [None]:
from tensorflow import keras
from keras import layers

# The dimensionality of the embedding layer
embed_dim = 256

# The dimensionality of the feed-forward network
latent_dim = 1024

# Encoder Block
source = keras.Input(shape=(None,), dtype="int64", name="english")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(source)
encoder_gru = layers.Bidirectional(layers.GRU(latent_dim), merge_mode="sum")(x)

# Decoder Block
past_target = keras.Input(shape=(None,), dtype="int64", name="portuguese")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(past_target)
decoder_gru = layers.GRU(latent_dim, return_sequences=True)

# Encoder-Decoder connection
x = decoder_gru(x, initial_state=encoder_gru)

# Dropout layer (for normalization)
x = layers.Dropout(0.5)(x)

# Final layer is a dense neural network that outputs a vector os size `vocab_size`
target_next_step = layers.Dense(vocab_size, activation="softmax")(x)

seq2seq_rnn = keras.Model([source, past_target], target_next_step)

seq2seq_rnn.compile(
    optimizer="rmsprop",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"])

seq2seq_rnn.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 english (InputLayer)        [(None, None)]               0         []                            
                                                                                                  
 portuguese (InputLayer)     [(None, None)]               0         []                            
                                                                                                  
 embedding (Embedding)       (None, None, 256)            5120000   ['english[0][0]']             
                                                                                                  
 embedding_1 (Embedding)     (None, None, 256)            5120000   ['portuguese[0][0]']          
                                                                                              

Now, all that is left is to train our `seq2seq_rnn`. Be mindful that without a GPU, the training of this model could take many hours.

> Note: You can skip the training of the code block below and use the pre-trained model (`GRU-eng-por`) available on the hub. 🤗


In [None]:
print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

callbacks = [keras.callbacks.ModelCheckpoint("GRU-eng-por.h5",
                                                save_best_only=True),
            keras.callbacks.EarlyStopping(monitor="val_loss",
                                            patience=10,
                                            verbose=1,
                                            mode="auto",
                                            baseline=None,
                                            restore_best_weights=True)]

seq2seq_rnn.fit(train_ds, epochs=15, validation_data=val_ds, callbacks=callbacks)

Accuracy is a crude way to monitor validation-set performance during this task. On average, this model correctly predicts the following word in the Portuguese sentence: $65\%$. However, next-token accuracy isn't an excellent metric for machine translation models. During inference, you're generating the target sentence from scratch and can't rely on previously generated tokens (a.k.a. $100\%$ correctness does not mean you have a good translator). We would likely use "_BLEU scores_" in real-world machine translation applications to evaluate our models.

`BLEU` (_bilingual evaluation understudy_) is an algorithm for evaluating the quality of text that has been machine-translated from one natural language to another. Quality is the correspondence between a machine's output and that of a human: "_the closer a machine translation is to a professional human translation, the better it is_" – this is the central idea behind BLEU.

We can use sentences from our test set or write some English sentences to test our model. Given an English sentence, we will feed the decoder block a `[start]` token, together with the encoded version of the English sentence, and the decoder will auto-regressively generate the next pass, append to the decoded sequence (`[[start], [new token], ...]`), repeatedly, until the `[end]` token is generated, or the model reaches gets stomped by our maximum sentence length.

Bellow, we import our trained model and vocabulary and test it on some sample text.

In [None]:
!pip install huggingface_hub["tensorflow"] -q

from huggingface_hub import from_pretrained_keras
from huggingface_hub import hf_hub_download
import tensorflow as tf
import numpy as np
import string
import re


# Select characters to strip, but preserve the "[" and "]"
strip_chars = string.punctuation
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")

def custom_standardization(input_string):
    lowercase = tf.strings.lower(input_string)
    return tf.strings.regex_replace(lowercase, f"[{re.escape(strip_chars)}]", "")

# Load the `seq2seq_rnn` from the Hub
seq2seq_rnn = from_pretrained_keras("AiresPucrs/GRU-eng-por")

# Load the portuguese vocabulary
portuguese_vocabulary_path = hf_hub_download(
    repo_id="AiresPucrs/GRU-eng-por",
    filename="portuguese_vocabulary.txt",
    repo_type='model',
    local_dir="./")

# Load the english vocabulary
english_vocabulary_path = hf_hub_download(
    repo_id="AiresPucrs/GRU-eng-por",
    filename="english_vocabulary.txt",
    repo_type='model',
    local_dir="./")

with open(portuguese_vocabulary_path, encoding='utf-8',  errors='backslashreplace') as fp:
    portuguese_vocab = [line.strip() for line in fp]
    fp.close()

with open(english_vocabulary_path, encoding='utf-8',  errors='backslashreplace') as fp:
    english_vocab = [line.strip() for line in fp]
    fp.close()

# Initialize the vectorizers with the learned vocabularies
target_vectorization = tf.keras.layers.TextVectorization(max_tokens=20000,
                                        output_mode="int",
                                        output_sequence_length=21,
                                        standardize=custom_standardization,
                                        vocabulary=portuguese_vocab)

source_vectorization = tf.keras.layers.TextVectorization(max_tokens=20000,
                                        output_mode="int",
                                        output_sequence_length=20,
                                        vocabulary=english_vocab)

# Create a dictionary from `int`to portuguese words
portuguese_index_lookup = dict(zip(range(len(portuguese_vocab)), portuguese_vocab))
max_decoded_sentence_length = 20

def decode_sequence(input_sentence):
    """
    Decodes a sequence using a trained seq2seq RNN model.

    Args:
        input_sentence (str): the input sentence to be decoded

    Returns:
        decoded_sentence (str): the decoded sentence
            generated by the model
    """
    tokenized_input_sentence = source_vectorization([input_sentence])
    decoded_sentence = "[start]"

    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = target_vectorization([decoded_sentence])
        next_token_predictions = seq2seq_rnn.predict([tokenized_input_sentence, tokenized_target_sentence], verbose=0)
        sampled_token_index = np.argmax(next_token_predictions[0, i, :])
        sampled_token = portuguese_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token
        if sampled_token == "[end]":
            break
    return decoded_sentence

eng_sentences =["What is its name?",
                "How old are you?",
                "I know you know where Mary is.",
                "We will show Tom.",
                "What do you all do?",
                "Don't do it!"]

for sentence in eng_sentences:
    print(f"English sentence:\n{sentence}")
    print(f'Portuguese translation:\n{decode_sequence(sentence)}')
    print('-' * 50)

config.json not found in HuggingFace Hub.


Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

(…)67b805b78733eb6f579077929f2a13/README.md:   0%|          | 0.00/312 [00:00<?, ?B/s]

(…)67b805b78733eb6f579077929f2a13/model.png:   0%|          | 0.00/25.0k [00:00<?, ?B/s]

(…)579077929f2a13/portuguese_vocabulary.txt:   0%|          | 0.00/175k [00:00<?, ?B/s]

(…)5b78733eb6f579077929f2a13/.gitattributes:   0%|          | 0.00/1.64k [00:00<?, ?B/s]

(…)b6f579077929f2a13/english_vocabulary.txt:   0%|          | 0.00/93.1k [00:00<?, ?B/s]

(…)579077929f2a13/variables/variables.index:   0%|          | 0.00/980 [00:00<?, ?B/s]

fingerprint.pb:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

keras_metadata.pb:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

saved_model.pb:   0%|          | 0.00/6.62M [00:00<?, ?B/s]

variables.data-00000-of-00001:   0%|          | 0.00/170M [00:00<?, ?B/s]



English sentence:
What is its name?
Portuguese translation:
[start] qual é o nome [end]
--------------------------------------------------
English sentence:
How old are you?
Portuguese translation:
[start] quantos anos você tem [end]
--------------------------------------------------
English sentence:
I know you know where Mary is.
Portuguese translation:
[start] eu sei que você sabe onde maria está [end]
--------------------------------------------------
English sentence:
We will show Tom.
Portuguese translation:
[start] nós vamos tom [end]
--------------------------------------------------
English sentence:
What do you all do?
Portuguese translation:
[start] o que vocês faz [end]
--------------------------------------------------
English sentence:
Don't do it!
Portuguese translation:
[start] não faça isso [end]
--------------------------------------------------


As you can see, translations are far from perfect. To improve this model, we could:

1. Use a deep stack of recurrent layers for both the encoder and the decoder.
2. Or, we could use an `LSTM` instead of a `GRU`.

However, `RNNs` have limitations when it comes to sequence-to-sequence tasks and language modeling in general. For example,  due to their propensity to gradually forget the past, `RNNs` struggle to handle extremely long sequences. Consequently, `RNN`-based models are unable to retain long-term context, which is necessary for translating lengthy documents (which is one of the reasons why the size of the input limits many online translation tools).

Because of these limitations, the machine learning community has embraced the `Transformer` architecture for sequence-to-sequence problems. And sequence-to-sequence learning is the task where Transformers excel the most.

Below, you can see a depiction of a full sequence-to-sequence Transformer.

![transformer-block](https://d2l.ai/_images/transformer.svg)

_source_: [Dive into Deep Learning 11. Attention Mechanisms and Transformers - 11.7. The Transformer Architecture](https://d2l.ai/chapter_attention-mechanisms-and-transformers/transformer.html).

> Note: To learn more about the transformer arquitecture, we recomend the [LLM Visualization](https://bbycroft.net/llm) tool.

The transformer model consists of two main components: the `encoder` and the `decoder`.

The `encoder` is responsible for converting the input sequence into a set of internal representations that capture the relevant information in the input. It applies a series of `self-attention` and `feedforward layers` to the input sequence. The `self-attention` layers allow the model to dynamically weigh the input features and compute a weighted sum of the features, which allows the model to capture long-range dependencies in the input data.

The `decoder` is responsible for generating the output sequence based on the internal representations produced by the `encoder`. It does this by applying a series of `self-attention` and `feed forward layers` to the internal representations and a set of additional "_context_" vectors that are computed from the `encoder` output. The `decoder` also uses an `attention` mechanism to weigh the context vectors and incorporate information from the `encoder` output into the generation of the output sequence (a type of residual connection shared by both `transformer` blocks).

Together, the `encoder` and `decoder` allow the `transformer` model to effectively process and translate input sequences and capture long-range dependencies in the data.

For an extremely _comprehensive_ and _ilustrated_ explanation of what is "_attention_" or how a "_transformer works_", we also recommend the work of _Jay Alammar_:

- [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/).
- [The Illustrated GPT-2](https://jalammar.github.io/illustrated-gpt2/).

But the general "_gist_" of attention and self-attention is this:

In a `transformer` model, `self-attention` is calculated using three input matrices: the query matrix ($Q$), the key matrix ($K$), and the value matrix ($V$). The `self-attention` mechanism computes a weighted sum of the value matrix ($V$) based on the similarity between the queries ($Q$) and the keys ($K$).

The query, key, and value matrices are typically derived from the input data by applying a linear transformation to the input features. The dimensions of the matrices depend on the size of the input and the number of attention "_heads_" used by the model.

To compute `self-attention`, the model first computes the dot product of the query and key matrices, which produces a matrix of dot products. The dot products are then divided by the square root of the dimensionality of the key matrix to ensure that the `self-attention` weights are well-behaved. Finally, the dot products are passed through a `softmax` function to produce a matrix of attention weights that sum to $1$.

The attention weights are then used to weight the value matrix ($V$) and compute a weighted sum of the values, which is then used as the output of the `self-attention` mechanism. This process is repeated for each attention head, and the output of all the attention heads is concatenated to produce the final `self-attention` output.

The `self-attention` mechanism allows the model to dynamically weigh the input features and compute a weighted sum of the features based on the similarity between the queries and the keys. This allows the model to capture long-range dependencies in the input data and effectively process input sequences of variable length.

Here is the equation for self-attention calculation in transformer models:

$$ Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V $$

where
- $Q$ is the query matrix.
- $K$ is the key matrix.
- $V$ is the value matrix.
- $d_k$ is the dimensionality of the key matrix.

## Transformer Encoder

Fundamentally, the `encoder` part is a "_text classification_" machine, being a very generic module that takes in a sequence and learns how to transform it into a more useful representation. We used the `encoder` part to make a toxicity classifier in this [notebook](https://github.com/Nkluge-correa/teeny-tiny_castle/blob/master/ML%20Intro%20Course/15_toxicity_detection.ipynb). Below we implement the `encoder` block using the subclass features of the `Keras API`.


In [None]:
import tensorflow as tf
from tensorflow import keras
from keras import layers

class TransformerEncoder(layers.Layer):
    """
    The TransformerEncoder class is a custom Keras layer that implements a
    single transformer encoder block. The transformer encoder block consists
    of a multi-head self-attention layer followed by a feedforward neural
    network with a residual connection and layer normalization applied at
    the input and output of each sub-layer.

    The class takes in the following arguments:

        embed_dim: an integer specifying the dimensionality of the embedding space.
        dense_dim: an integer specifying the number of units in the feedforward neural network.
        num_heads: an integer specifying the number of attention heads to use.

    The call method is the main computation performed by the layer. It takes
    in an input tensor and an optional mask tensor indicating which inputs to
    consider in the attention calculation. It returns the output tensor of the
    transformer encoder block.

    The get_config method returns a dictionary of configuration information for
    the layer, including the embed_dim, num_heads, and dense_dim parameters.
    """
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()

    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[:, tf.newaxis, :]
        attention_output = self.attention(
            inputs, inputs, attention_mask=mask)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

## Positional Embedding

`Self-attention` is an order-agnostic technique. However, as done by [Vaswaniet al.](https://arxiv.org/abs/1706.03762), we can manually inject order information in the representations through _positional encoding_ (i.e., give the model access to word-order information).

The original "_Attention is all you need_" paper added to the word embeddings a vector containing values in the range [-1, 1] that varied cyclically depending on the position (a cosine function).

We'll simply use the `tf.keras.layers.Embedding` layer to create a positional encoding parallel to the normal `Embedding` layer.

> Note: There are other techniques to add positional information to discrete sequences, like [Rotary Position Embeddings](https://arxiv.org/abs/2104.09864).

In [None]:
class PositionalEmbedding(layers.Layer):
    """
    The PositionalEmbedding layer class is used to create an embedding layer that
    combines both token embeddings and positional embeddings for input sequences.

    The class takes in the following arguments:

    sequence_length: An integer representing the maximum length of the input sequence.
    input_dim: An integer representing the size of the input vocabulary.
    output_dim: An integer representing the size of the embedding vectors.

    The call(self, inputs) method that takes input tensor as an argument and
    returns the embedded tensor after adding the token embeddings and positional
    embeddings. It also computes the positions for the input sequence.

    The compute_mask(self, inputs, mask=None) method that returns a mask tensor
    computed based on the input tensor.

    The get_config(self): Method that returns a dictionary containing the configuration
    of the layer.
    """
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=input_dim, output_dim=output_dim)
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim)
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

    def get_config(self):
        config = super(PositionalEmbedding, self).get_config()
        config.update({
            "output_dim": self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
        })
        return config

## Transformer Decoder

The `transformer decoder` is the model's second half. It reads tokens $0....N$ in the target sequence and tries to predict token $N+1$, just like the `RNN decoder`. While doing so, it employs masked self-attention to determine which tokens in the encoded source sentence are most closely related to the target token it is currently attempting to predict.

An extra bit of complication that the decoder brings is the idea of _causal padding_. Causal padding is used to ensure that the network does not violate the temporal or causal order of the data. Unlike an `RNN`, which looks at its input one step at a time and thus only has access to steps $0....N$ to generate output step $N+1$, the `TransformerDecoder` can look at the entire target sequence at once (without casual padding). If it could use all of its input, it would simply learn to copy input step $N+1$. As a result, the model would achieve perfect training accuracy while learning nothing useful. And that is why we need causal padding. To "_hide the future_" from our model during training.

![causal-attention-padding](https://jalammar.github.io/images/gpt2/transformer-attention-mask.png)

_Source:_ [The Illustrated GPT-2 (Visualizing Transformer Language Models)](https://jalammar.github.io/illustrated-gpt2/).



In [None]:
class TransformerDecoder(layers.Layer):
    """
    A Transformer decoder layer that attends over the input
    sequence and the encoder outputs.

    Args:
        embed_dim (int): Dimension of the input embeddings.
        dense_dim (int): Dimension of the dense layer in the feedforward sublayer.
        num_heads (int): Number of attention heads in each multi-head attention layer.

    Attributes:
        attention_1 (MultiHeadAttention): First multi-head attention layer.
        attention_2 (MultiHeadAttention): Second multi-head attention layer.
        dense_proj (Sequential): Feedforward sublayer consisting of two dense layers.
        layernorm_1 (LayerNormalization): Layer normalization layer
            after the first attention layer.
        layernorm_2 (LayerNormalization): Layer normalization layer
            after the second attention layer.
        layernorm_3 (LayerNormalization): Layer normalization layer
            after the feedforward sublayer.
        supports_masking (bool): Whether the layer supports masking.

    Methods:
        get_config(): Returns a dictionary with the configuration of the layer.
        get_causal_attention_mask(inputs): Returns a 3D tensor with a
            causal mask for the given input sequence.
        call(inputs, encoder_outputs, mask=None): Computes the output of
            the layer for the given inputs and encoder outputs.
    """
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.attention_2 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.supports_masking = True

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = tf.concat(
            [tf.expand_dims(batch_size, -1),
             tf.constant([1, 1], dtype=tf.int32)], axis=0)
        return tf.tile(mask, mult)

    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(
                mask[:, tf.newaxis, :], dtype="int32")
            padding_mask = tf.minimum(padding_mask, causal_mask)
        attention_output_1 = self.attention_1(
            query=inputs,
            value=inputs,
            key=inputs,
            attention_mask=causal_mask)
        attention_output_1 = self.layernorm_1(inputs + attention_output_1)
        attention_output_2 = self.attention_2(
            query=attention_output_1,
            value=encoder_outputs,
            key=encoder_outputs,
            attention_mask=padding_mask,
        )
        attention_output_2 = self.layernorm_2(
            attention_output_1 + attention_output_2)
        proj_output = self.dense_proj(attention_output_2)
        return self.layernorm_3(attention_output_2 + proj_output)

We get the complete end-to-end Transformer when we put all the puzzle pieces together. It simply combines the components we've created thus far: the `PositionalEmbedding` layers, the `TransformerEncoder`, and the `TransformerDecoder`. Similarly to our GRU model, we could stack these decoder and encoder blocks on top of each other to create a more robust model. However, let us keep it small and use just one `encoder-decoder` block.

> Note: The original Transformer model consists of 6 stacked `encoder-decoder` blocks.

In [None]:
# The dimensionality of the embbedding layer
embed_dim = 256

# The dimensionality of the MLP (feed forward network)
dense_dim = 2048

# Numer of attention heads per block
num_heads = 8

# Vocabulary size
vocab_size = 20000

# Maximum sequence length
sequence_length = 20

# Encoder Block + positional embbeding
encoder_inputs = keras.Input(shape=(None,), dtype="int64", name="english")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)
encoder_outputs = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)

# Decoder Block + positional embbeding
decoder_inputs = keras.Input(shape=(None,), dtype="int64", name="portuguese")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
x = TransformerDecoder(embed_dim, dense_dim, num_heads)(x, encoder_outputs)

# Dropout layer (for normalization)
x = layers.Dropout(0.5)(x)

# Final layer is a dense neural network that outputs a vector os size `vocab_size`
decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)

transformer = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

transformer.compile(
    optimizer="rmsprop",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"])

transformer.summary()

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 english (InputLayer)        [(None, None)]               0         []                            
                                                                                                  
 portuguese (InputLayer)     [(None, None)]               0         []                            
                                                                                                  
 positional_embedding (Posi  (None, None, 256)            5125120   ['english[0][0]']             
 tionalEmbedding)                                                                                 
                                                                                                  
 positional_embedding_1 (Po  (None, None, 256)            5125120   ['portuguese[0][0]']    

As you can see, this model is almost half the size of our RNN and will be a little bit faster to train it! However, if you wish to skip training and use the pre-trained model (`transformer_eng_por.keras`), you can skip the cell below, and just download the trained model straight from the Hub. 🤗

In [None]:
print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

callbacks = [keras.callbacks.ModelCheckpoint("transformer-eng-por.h5",
                                                save_best_only=True),
            keras.callbacks.EarlyStopping(monitor="val_loss",
                                            patience=10,
                                            verbose=1,
                                            mode="auto",
                                            baseline=None,
                                            restore_best_weights=True)]

transformer.fit(train_ds, epochs=30, validation_data=val_ds, callbacks=callbacks)

Version:  2.14.0
Eager mode:  True
GPU is available
Epoch 1/30

  saving_api.save_model(


Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 28: early stopping


<keras.src.callbacks.History at 0x7f9c2b9d31f0>

Bellow, we import our trained model and vocabulary and test it on some sample text. Since we are using Keras subclasses, we will need to load our models like this:


```python

transformer = keras.models.load_model("transformer-eng-por/transformer-eng-por.h5",
    custom_objects={"TransformerEncoder": TransformerEncoder,
        "PositionalEmbedding": PositionalEmbedding,
        "TransformerDecoder": TransformerDecoder})
                                                 
```

In the cell below, you can clone and load the pre-trained model.

In [None]:
!git lfs install
!git clone https://huggingface.co/AiresPucrs/transformer-eng-por

Git LFS initialized.
Cloning into 'transformer-eng-por'...
remote: Enumerating objects: 12, done.[K
remote: Counting objects: 100% (8/8), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 12 (delta 0), reused 0 (delta 0), pack-reused 4[K
Unpacking objects: 100% (12/12), 112.21 KiB | 7.01 MiB/s, done.
Filtering content: 100% (2/2), 205.26 MiB | 29.53 MiB/s, done.


In [None]:
import tensorflow as tf
import numpy as np
import string
import keras
import re

strip_chars = string.punctuation
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")


def custom_standardization(input_string):
    lowercase = tf.strings.lower(input_string)
    return tf.strings.regex_replace(lowercase, f"[{re.escape(strip_chars)}]", "")

portuguese_vocabulary_path = hf_hub_download(
    repo_id="AiresPucrs/transformer-eng-por",
    filename="keras_transformer_blocks.py",
    repo_type='model',
    local_dir="./")

from keras_transformer_blocks import TransformerEncoder, PositionalEmbedding, TransformerDecoder

transformer = keras.models.load_model("./transformer-eng-por/transformer-eng-por.h5",
    custom_objects={"TransformerEncoder": TransformerEncoder,
        "PositionalEmbedding": PositionalEmbedding,
        "TransformerDecoder": TransformerDecoder})

with open('portuguese_vocabulary.txt', encoding='utf-8', errors='backslashreplace') as fp:
    portuguese_vocab = [line.strip() for line in fp]
    fp.close()

with open('english_vocabulary.txt', encoding='utf-8', errors='backslashreplace') as fp:
    english_vocab = [line.strip() for line in fp]
    fp.close()


target_vectorization = tf.keras.layers.TextVectorization(max_tokens=20000,
                                        output_mode="int",
                                        output_sequence_length=21,
                                        standardize=custom_standardization,
                                        vocabulary=portuguese_vocab)

source_vectorization = tf.keras.layers.TextVectorization(max_tokens=20000,
                                        output_mode="int",
                                        output_sequence_length=20,
                                        vocabulary=english_vocab)

portuguese_index_lookup = dict(zip(range(len(portuguese_vocab)), portuguese_vocab))
max_decoded_sentence_length = 20


def decode_sequence(input_sentence):
    tokenized_input_sentence = source_vectorization([input_sentence])
    decoded_sentence = "[start]"

    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = target_vectorization([decoded_sentence])[:, :-1]
        predictions = transformer([tokenized_input_sentence, tokenized_target_sentence])
        sampled_token_index = np.argmax(predictions[0, i, :])
        sampled_token = portuguese_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token
        if sampled_token == "[end]":
            break
    return decoded_sentence


eng_sentences =["What is its name?",
                "How old are you?",
                "I know you know where Mary is.",
                "We will show Tom.",
                "What do you all do?",
                "Don't do it!"]

for sentence in eng_sentences:
    print(f"English sentence:\n{sentence}")
    print(f'Portuguese translation:\n{decode_sequence(sentence)}')
    print('-' * 50)

English sentence:
What is its name?
Portuguese translation:
[start] qual é o nome dele [end]
--------------------------------------------------
English sentence:
How old are you?
Portuguese translation:
[start] quantos anos você tem [end]
--------------------------------------------------
English sentence:
I know you know where Mary is.
Portuguese translation:
[start] eu sei que você sabe onde mary está [end]
--------------------------------------------------
English sentence:
We will show Tom.
Portuguese translation:
[start] vamos ligar para o tom [end]
--------------------------------------------------
English sentence:
What do you all do?
Portuguese translation:
[start] o que vocês todos nós têm feito [end]
--------------------------------------------------
English sentence:
Don't do it!
Portuguese translation:
[start] não faça isso [end]
--------------------------------------------------


You can now compare the output of both models, and the `transformer` (_being lighter and faster to train_) shows a better performance than the `RNN`. However, there is a problem with the translation of the first sentence.

While the `RNN` translates `What is its name?` to `[start] o que é o seu nome [end]`, the `transformer` model makes a gender assumption, even though the source sentence wasn't gendered (`[start] qual é o nome dele [end]`). Errors like these are not uncommon in NLP, algorithmic bias being one of the great problems associated with the use of language models in real applications.

In AI literature, we call this a "_hallucination_." When a machine learning model "_hallucinates_" something, it means that the model produces output that is not based on the input data and is not representative of the underlying distribution of the data. This can occur when the model is overfitting to the training data and has learned to memorize specific examples rather than generalizing to new data.

Hallucinations can occur in a variety of forms, depending on the type of ML model and the task it is trained to perform. For example, in the case of image classification, a model might hallucinate an object or pattern that is not present in the input image. In natural language processing, a model might generate nonsensical or unrelated text.

Hallucinations can be a sign of overfitting and can indicate that the model is not generalizing well to new data. To prevent hallucinations, it is important to use techniques such as regularization and early stopping to prevent overfitting and to train models with curated, high-quality datasets to prevent the model's positioning with unwanted biases.

---

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).