# Chapter 16 – Natural Language Processing with RNNs and Attention
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/alirezatheh/handson-ml3-notes/blob/main/notebooks/16_natural_language_processing_with_rnns_and_attentione.ipynb)
[![Open in Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/alirezatheh/handson-ml3-notes/blob/main/notebooks/16_natural_language_processing_with_rnns_and_attention.ipynb)

## Generating Shakespearean Text Using a Character RNN
In a famous 2015 blog post titled [“The Unreasonable Effectiveness of Recurrent Neural Networks”](https://homl.info/charrnn), Andrej Karpathy showed how to train an RNN to predict the next character in a sentence. This *char-RNN* can then be used to generate novel text, one character at a time. This is our first example of a *language model*; similar language models, are at the core of modern NLP.

**Warning**: This chapter can be very slow without a GPU, so let’s make sure there’s one, or else issue a warning:

In [1]:
import sys

import tensorflow as tf

if not tf.config.list_physical_devices('GPU'):
    print('No GPU was detected. Neural nets can be very slow without a GPU.')
    if 'google.colab' in sys.modules:
        print(
            'Go to Runtime > Change runtime and select a GPU hardware '
            'accelerator.'
        )
    if 'kaggle_secrets' in sys.modules:
        print('Go to Settings > Accelerator and select GPU.')

### Creating the Training Dataset
Let’s download the Shakespeare data from Andrej Karpathy’s [char-rnn project](https://github.com/karpathy/char-rnn/):

In [2]:
import keras

# Shortcut URL
shakespeare_url = 'https://homl.info/shakespeare'
filepath = keras.utils.get_file('shakespeare.txt', shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

Downloading data from https://homl.info/shakespeare


In [3]:
# Shows a short text sample
print(shakespeare_text[:80])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.


In [4]:
# Shows all 39 distinct characters (after converting to lower case)
''.join(sorted(set(shakespeare_text.lower())))

"\n !$&',-.3:;?abcdefghijklmnopqrstuvwxyz"

In [5]:
text_vec_layer = keras.layers.TextVectorization(
    split='character', standardize='lower'
)
text_vec_layer.adapt([shakespeare_text])
encoded = text_vec_layer([shakespeare_text])[0]

In [6]:
# Drop tokens 0 (pad) and 1 (unknown), which we will not use
encoded -= 2
n_tokens = text_vec_layer.vocabulary_size() - 2
dataset_size = len(encoded)

In [7]:
n_tokens

39

In [8]:
dataset_size

1115394

We can turn this very long sequence into a dataset of windows that we can then use to train a sequence-to-sequence RNN. The targets will be similar to the inputs, but shifted by one time step into the “future”.

In [9]:
from typing import Optional


def to_dataset(
    sequence: tf.Tensor,
    length: int,
    shuffle: Optional[bool] = False,
    seed: Optional[int] = None,
    batch_size: int = 32,
) -> tf.data.Dataset:
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))
    if shuffle:
        ds = ds.shuffle(100_000, seed=seed)
    ds = ds.batch(batch_size)
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

In [10]:
# A simple example using to_dataset()
# There’s just one sample in this dataset: the input represents 'to b'
# and the output represents 'o be'
list(to_dataset(text_vec_layer(['To be'])[0], length=4))

[(<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[ 4,  5,  2, 23]])>,
  <tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[ 5,  2, 23,  3]])>)]

90% of the text for training, 5% for validation, and 5% for testing:

In [11]:
length = 100
keras.utils.set_random_seed(42)
train_set = to_dataset(
    encoded[:1_000_000], length=length, shuffle=True, seed=42
)
valid_set = to_dataset(encoded[1_000_000:1_060_000], length=length)
test_set = to_dataset(encoded[1_060_000:], length=length)

### Building and Training the Char-RNN Model
**Warning**: The following code may take one or two hours to run, depending on our GPU. Without a GPU, it may take over 24 hours. If we don’t want to wait, just skip the next two code cells and run the code below to download a pretrained model.

**Note**: The `GRU` class will only use cuDNN acceleration (assuming we have a GPU) when using the default values for the following arguments: `activation`, `recurrent_activation`, `recurrent_dropout`, `unroll`, `use_bias` and `reset_after`.

In [12]:
# Ensures reproducibility on CPU
keras.utils.set_random_seed(42)
model = keras.Sequential(
    [
        keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
        keras.layers.GRU(128, return_sequences=True),
        keras.layers.Dense(n_tokens, activation='softmax'),
    ]
)
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='nadam',
    metrics=['accuracy'],
)
model_ckpt = keras.callbacks.ModelCheckpoint(
    'my_shakespeare_model', monitor='val_accuracy', save_best_only=True
)
history = model.fit(
    train_set, validation_data=valid_set, epochs=10, callbacks=[model_ckpt]
)

Epoch 1/10


INFO:tensorflow:Assets written to: my_shakespeare_model/assets


Epoch 2/10


INFO:tensorflow:Assets written to: my_shakespeare_model/assets


Epoch 3/10


INFO:tensorflow:Assets written to: my_shakespeare_model/assets


Epoch 4/10


INFO:tensorflow:Assets written to: my_shakespeare_model/assets


Epoch 5/10


INFO:tensorflow:Assets written to: my_shakespeare_model/assets


Epoch 6/10


INFO:tensorflow:Assets written to: my_shakespeare_model/assets


Epoch 7/10
Epoch 8/10
Epoch 9/10


INFO:tensorflow:Assets written to: my_shakespeare_model/assets


Epoch 10/10


INFO:tensorflow:Assets written to: my_shakespeare_model/assets




- The inputs of the Embedding layer will be 2D tensors of shape [*batch size*, *window length*], the output of the Embedding layer will be a 3D tensor of shape [*batch size*, *window length*, *embedding size*].
- Since the input windows overlap, the concept of epoch is not so clear in this case: during each epoch (as implemented by Keras), the model will actually see the same character multiple times.

In [13]:
shakespeare_model = keras.Sequential(
    [
        text_vec_layer,
        # No <PAD> or <UNK> tokens
        keras.layers.Lambda(lambda X: X - 2),
        model,
    ]
)

If we don’t want to wait for training to complete, we can download the pretrained model. Uncomment the last line to use it instead of the model trained above:

In [14]:
from pathlib import Path

url = 'https://github.com/ageron/data/raw/main/shakespeare_model.tgz'
path = keras.utils.get_file('shakespeare_model.tgz', url, extract=True)
model_path = Path(path).with_name('shakespeare_model')
# shakespeare_model = keras.models.load_model(model_path)

In [15]:
y_proba = shakespeare_model.predict(['To be or not to b'])[0, -1]
# Choose the most probable character ID
y_pred = tf.argmax(y_proba)
text_vec_layer.get_vocabulary()[y_pred + 2]

'e'

### Generating Fake Shakespearean Text
To generate new text using the char-RNN model, we could feed it some text, make the model predict the most likely next letter, add it to the end of the text, then give the extended text to the model to guess the next letter, and so on. This is called *greedy decoding*. But in practice this often leads to the same words being repeated over and over again. Instead, we can sample the next character randomly, with a probability equal to the estimated probability, using TensorFlow’s `tf.random.categorical()` function. This will generate more diverse and interesting text. The `categorical()` function samples random class indices, given the class log probabilities (logits).

In [16]:
log_probas = tf.math.log([[0.5, 0.4, 0.1]])
keras.utils.set_random_seed(42)
# Draw 8 samples
tf.random.categorical(log_probas, num_samples=8)

<tf.Tensor: shape=(1, 8), dtype=int64, numpy=array([[0, 1, 0, 2, 1, 0, 0, 1]])>

To have more control over the diversity of the generated text, we can divide the logits by a number called the *temperature*. A temperature close to zero favors high-probability characters, while a high temperature gives all characters an equal probability. Lower temperatures are typically preferred when generating fairly rigid and precise text, such as mathematical equations, while higher temperatures are preferred when generating more diverse and creative text.

In [17]:
def next_char(text: str, temperature: float = 1) -> str:
    y_proba = shakespeare_model.predict([text])[0, -1:]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]
    return text_vec_layer.get_vocabulary()[char_id + 2]

In [18]:
def extend_text(text: str, n_chars: int = 50, temperature: float = 1) -> str:
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

In [19]:
# Ensures reproducibility on CPU
keras.utils.set_random_seed(42)

In [20]:
print(extend_text('To be or not to be', temperature=0.01))

To be or not to be the duke
as it is a proper strange death,
and the


In [21]:
print(extend_text('To be or not to be', temperature=1))

To be or not to behold?

second push:
gremio, lord all, a sistermen,


In [22]:
print(extend_text('To be or not to be', temperature=100))

To be or not to bef ,mt'&o3fpadm!$
wh!nse?bws3est--vgerdjw?c-y-ewznq


To generate more convincing text, a common technique is to sample only from the top *k* characters, or only from the smallest set of top characters whose total probability exceeds some threshold (this is called *nucleus sampling*). Alternatively, we could try using *beam search*.

### Stateful RNN
We have only used *stateless RNNs*: at each training iteration the model starts with a hidden state full of zeros, then it updates this state at each time step, and after the last time step, it throws it away as it is not needed anymore. If we instruct the RNN to preserve this final state after processing a training batch and use it as the initial state for the next training batch, the model could learn long-term patterns despite only backpropagating through short sequences. This is called a *stateful RNN*.

A stateful RNN only makes sense if each input sequence in a batch starts exactly where the corresponding sequence in the previous batch left off. So we need to use sequential and nonoverlapping input sequences. When creating the `tf.data.Dataset`, we must therefore use `shift=length` when calling the `window()` method. Moreover, we must not call the `shuffle()` method.

Batching is much harder. If we were to call `batch(32)`, then 32 consecutive windows would be put in the same batch, and the following batch would not continue each of these windows where it left off. The first batch would contain windows 1 to 32 and the second batch would contain windows 33 to 64, so if we consider, say, the first window of each batch (i.e., windows 1 and 33), we can see that they are not consecutive. The simplest solution to this problem is to just use a batch size of 1.

In [23]:
def to_dataset_for_stateful_rnn(
    sequence: tf.Tensor, length: int
) -> tf.data.Dataset:
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=length, drop_remainder=True)
    ds = ds.flat_map(lambda window: window.batch(length + 1)).batch(1)
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)


stateful_train_set = to_dataset_for_stateful_rnn(encoded[:1_000_000], length)
stateful_valid_set = to_dataset_for_stateful_rnn(
    encoded[1_000_000:1_060_000], length
)
stateful_test_set = to_dataset_for_stateful_rnn(encoded[1_060_000:], length)

In [24]:
# Simple example using to_dataset_for_stateful_rnn()
list(to_dataset_for_stateful_rnn(tf.range(10), 3))

[(<tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[0, 1, 2]], dtype=int32)>,
  <tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[1, 2, 3]], dtype=int32)>),
 (<tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[3, 4, 5]], dtype=int32)>,
  <tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[4, 5, 6]], dtype=int32)>),
 (<tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[6, 7, 8]], dtype=int32)>,
  <tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[7, 8, 9]], dtype=int32)>)]

Batching is harder, but it is not impossible. e.g., we could chop Shakespeare’s text into 32 texts of equal length, create one dataset of consecutive input sequences for each of them, and finally use `tf.data.Dataset.zip(datasets).map(lambda *windows: tf.stack(windows))` to create proper consecutive batches, where the $n^{\text{th}}$ input sequence in a batch starts off exactly where the $n^{\text{th}}$ input sequence ended in the previous batch.

In [25]:
# Shows one way to prepare a batched dataset for a stateful RNN

import numpy as np


def to_non_overlapping_windows(
    sequence: np.ndarray, length: int
) -> tf.data.Dataset:
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=length, drop_remainder=True)
    return ds.flat_map(lambda window: window.batch(length + 1))


def to_batched_dataset_for_stateful_rnn(
    sequence: tf.Tensor, length: int, batch_size: int = 32
) -> tf.data.Dataset:
    parts = np.array_split(sequence, batch_size)
    datasets = tuple(
        to_non_overlapping_windows(part, length) for part in parts
    )
    ds = tf.data.Dataset.zip(datasets).map(lambda *windows: tf.stack(windows))
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)


list(to_batched_dataset_for_stateful_rnn(tf.range(20), length=3, batch_size=2))

[(<tf.Tensor: shape=(2, 3), dtype=int32, numpy=
  array([[ 0,  1,  2],
         [10, 11, 12]], dtype=int32)>,
  <tf.Tensor: shape=(2, 3), dtype=int32, numpy=
  array([[ 1,  2,  3],
         [11, 12, 13]], dtype=int32)>),
 (<tf.Tensor: shape=(2, 3), dtype=int32, numpy=
  array([[ 3,  4,  5],
         [13, 14, 15]], dtype=int32)>,
  <tf.Tensor: shape=(2, 3), dtype=int32, numpy=
  array([[ 4,  5,  6],
         [14, 15, 16]], dtype=int32)>),
 (<tf.Tensor: shape=(2, 3), dtype=int32, numpy=
  array([[ 6,  7,  8],
         [16, 17, 18]], dtype=int32)>,
  <tf.Tensor: shape=(2, 3), dtype=int32, numpy=
  array([[ 7,  8,  9],
         [17, 18, 19]], dtype=int32)>)]

We need to set the `stateful` argument to `True` when creating each recurrent layer, and because the stateful RNN needs to know the batch size (since it will preserve a state for each input sequence in the batch). Therefore we must set the `batch_input_shape` argument in the first layer.

In [26]:
# Ensures reproducibility on CPU
keras.utils.set_random_seed(42)
model = keras.Sequential(
    [
        keras.layers.Embedding(
            input_dim=n_tokens, output_dim=16, batch_input_shape=[1, None]
        ),
        keras.layers.GRU(128, return_sequences=True, stateful=True),
        keras.layers.Dense(n_tokens, activation='softmax'),
    ]
)

At the end of each epoch, we need to reset the states before we go back to the beginning of the text:

In [27]:
from typing import Any


class ResetStatesCallback(keras.callbacks.Callback):
    def on_epoch_begin(self, epoch: int, logs: dict[str, Any]) -> None:
        self.model.reset_states()

In [28]:
# Use a different directory to save the checkpoints
model_ckpt = keras.callbacks.ModelCheckpoint(
    'my_stateful_shakespeare_model',
    monitor='val_accuracy',
    save_best_only=True,
)

**Warning**: The following cell will take a while to run (possibly an hour if we are not using a GPU).

In [29]:
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='nadam',
    metrics=['accuracy'],
)
history = model.fit(
    stateful_train_set,
    validation_data=stateful_valid_set,
    epochs=10,
    callbacks=[ResetStatesCallback(), model_ckpt],
)

INFO:tensorflow:Assets written to: my_stateful_shakespeare_model/assets


Epoch 2/10


INFO:tensorflow:Assets written to: my_stateful_shakespeare_model/assets


Epoch 3/10


INFO:tensorflow:Assets written to: my_stateful_shakespeare_model/assets


Epoch 4/10


INFO:tensorflow:Assets written to: my_stateful_shakespeare_model/assets


Epoch 5/10


INFO:tensorflow:Assets written to: my_stateful_shakespeare_model/assets


Epoch 6/10


INFO:tensorflow:Assets written to: my_stateful_shakespeare_model/assets


Epoch 7/10


INFO:tensorflow:Assets written to: my_stateful_shakespeare_model/assets


Epoch 8/10


INFO:tensorflow:Assets written to: my_stateful_shakespeare_model/assets


Epoch 9/10
Epoch 10/10


INFO:tensorflow:Assets written to: my_stateful_shakespeare_model/assets




**Tip**: After this model is trained, it will only be possible to use it to make predictions for batches of the same size as were used during training. To avoid this restriction, create an identical stateless model, and copy the stateful model’s weights to this model.

In [30]:
stateless_model = keras.Sequential(
    [
        keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
        keras.layers.GRU(128, return_sequences=True),
        keras.layers.Dense(n_tokens, activation='softmax'),
    ]
)

To set the weights, we first need to build the model (so the weights get created):

In [31]:
stateless_model.build(tf.TensorShape([None, None]))

In [32]:
stateless_model.set_weights(model.get_weights())

In [33]:
shakespeare_model = keras.Sequential(
    [
        text_vec_layer,
        # No <PAD> or <UNK> tokens
        keras.layers.Lambda(lambda X: X - 2),
        stateless_model,
    ]
)

In [34]:
keras.utils.set_random_seed(42)

print(extend_text('to be or not to be', temperature=0.01))

to be or not to be so in the world and the strangeness
to see the wo


A 2017 paper [“Learning to Generate Reviews and Discovering Sentiment”](https://homl.info/sentimentneuron) by Alec Radford and other OpenAI researchers describes how the authors trained a big char-RNN-like model on a large dataset, and found that one of the neurons acted as an excellent sentiment analysis classifier: although the model was trained without any labels, the *sentiment neuron* reached state-of-the-art performance on sentiment analysis benchmarks. This foreshadowed and motivated unsupervised pretraining in NLP.

## Sentiment Analysis
If image classification on the MNIST dataset is the “Hello world!” of computer vision, then sentiment analysis on the IMDb reviews dataset is the “Hello world!” of natural language processing. The IMDb dataset consists of 50,000 movie reviews in English (25,000 for training, 25,000 for testing) extracted from the [Internet Movie Database](https://imdb.com), along with a simple binary target for each review indicating whether it is negative (0) or positive (1).

In [35]:
import tensorflow_datasets as tfds

raw_train_set, raw_valid_set, raw_test_set = tfds.load(
    name='imdb_reviews',
    split=['train[:90%]', 'train[90%:]', 'test'],
    as_supervised=True,
)
keras.utils.set_random_seed(42)
train_set = raw_train_set.shuffle(5000, seed=42).batch(32).prefetch(1)
valid_set = raw_valid_set.batch(32).prefetch(1)
test_set = raw_test_set.batch(32).prefetch(1)

[1mDownloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /home/ageron/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /home/ageron/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete0WPKUH/imdb_reviews-train.t…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /home/ageron/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete0WPKUH/imdb_reviews-test.tf…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /home/ageron/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete0WPKUH/imdb_reviews-unsuper…

[1mDataset imdb_reviews downloaded and prepared to /home/ageron/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


**Tip**: Keras also has `keras.datasets.imdb.load_data()`.

In [36]:
for review, label in raw_train_set.take(4):
    print(review.numpy().decode('utf-8')[:200], '...')
    print('Label:', label.numpy())

This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting  ...
Label: 0
I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However  ...
Label: 0
Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brennan give enjoyable performances as they always seem to do. <br /><br />But come on Hollywood - a Moun ...
Label: 0
This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a couple of hours. Wonderful perf ...
Label: 1


This time we will chop it into words instead of characters. Note that it will not work well in some languages. e.g. Chinese writing does not use spaces between words, Vietnamese uses spaces even within words, and German often attaches multiple words together, without spaces. Even in English, spaces are not always the best way to tokenize text: think of “San Francisco” or “#ILoveDeepLearning”.

In a 2016 paper [“Neural Machine Translation of Rare Words with Subword Units”](https://homl.info/rarewords) Rico Sennrich et al. from the University of Edinburgh explored several methods to tokenize and detokenize text at the subword level. This way, even if our model encounters a rare word it has never seen before, it can still reasonably guess what it means. e.g. if the model never saw the word “smartest” during training, if it learned the word “smart” and it also learned that the suffix “est” means “the most”, it can infer the meaning of “smartest”. One of the techniques the authors evaluated is *byte pair encoding* (BPE). BPE works by splitting the whole training set into individual characters (including spaces), then repeatedly merging the most frequent adjacent pairs until the vocabulary reaches the desired size.

A 2018 paper [“Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates”](https://homl.info/subword) by Taku Kudo at Google further improved subword tokenization, often removing the need for language-specific preprocessing prior to tokenization. Moreover, the paper proposed a novel regularization technique called *subword regularization*, which improves accuracy and robustness by introducing some randomness in tokenization during training: e.g. “New England” may be tokenized as “New” + “England”, or “New” + “Eng” + “land”, or simply “New England” (just one token). Google’s [*SentencePiece*](https://github.com/google/sentencepiece) project provides an open source implementation, which is described in a 2018 paper [“SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing”](https://homl.info/sentencepiece) by Taku Kudo and John Richardson.

The [TensorFlow Text](https://homl.info/tftext) library also implements various tokenization strategies, including WordPiece (a variant of BPE by Yonghui Wu et al., [“Google’s Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation”](https://homl.info/wordpiece) in 2016), and last but not least, the [Tokenizers library by Hugging Face](https://homl.info/tokenizers) implements a wide range of extremely fast tokenizers.

However, for the IMDb task in English, using spaces for token boundaries should 
be good enough:

In [37]:
vocab_size = 1000
text_vec_layer = keras.layers.TextVectorization(max_tokens=vocab_size)
text_vec_layer.adapt(train_set.map(lambda reviews, labels: reviews))

**Warning**: The following cell will take a few minutes to run and the model will probably not learn anything because we didn’t mask the padding tokens (that’s the point of the next section).

In [38]:
embed_size = 128
keras.utils.set_random_seed(42)
model = keras.Sequential(
    [
        text_vec_layer,
        keras.layers.Embedding(vocab_size, embed_size),
        keras.layers.GRU(128),
        keras.layers.Dense(1, activation='sigmoid'),
    ]
)
model.compile(
    loss='binary_crossentropy', optimizer='nadam', metrics=['accuracy']
)
history = model.fit(train_set, validation_data=valid_set, epochs=2)

Epoch 1/2
Epoch 2/2


### Masking
Making the model ignore padding tokens is trivial using Keras: simply add `mask_zero=True` when creating the `Embedding` layer. This means that padding tokens (whose ID is 0) will be ignored by all downstream layers.

**Warning**: The following cell will take a while to run (possibly 30 minutes if we are not using a GPU).

In [39]:
embed_size = 128
keras.utils.set_random_seed(42)
model = keras.Sequential(
    [
        text_vec_layer,
        keras.layers.Embedding(vocab_size, embed_size, mask_zero=True),
        keras.layers.GRU(128),
        keras.layers.Dense(1, activation='sigmoid'),
    ]
)
model.compile(
    loss='binary_crossentropy', optimizer='nadam', metrics=['accuracy']
)
history = model.fit(train_set, validation_data=valid_set, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


The way this works is that the `Embedding` layer creates a mask tensor equal to `tf.math.not_equal(inputs, 0)`: it is a Boolean tensor with the same shape as the inputs, and it is equal to `False` anywhere the token IDs are 0, or `True` otherwise. This mask tensor is then automatically propagated by the model to the next layer. If that layer’s `call()` method has a `mask` argument, then it automatically receives the mask.

This allows the layer to ignore the appropriate time steps. Each layer may handle the mask differently, but in general they simply ignore masked time steps. e.g. when a recurrent layer encounters a masked time step, it simply copies the output from the previous time step.

Next, if the layer’s `supports_masking` attribute is `True`, then the mask is automatically propagated to the next layer. It keeps propagating this way for as long as the layers have `supports_masking=True`. e.g. a recurrent layer’s `supports_masking` attribute is `True` when `return_sequences=True`, but it’s `False` when `return_sequences=False`.

**Tip**: Some layers need to update the mask before propagating it to the next layer: they do so by implementing the `compute_mask()` method, which takes two arguments: the inputs and the previous mask. It then computes the updated mask and returns it. The default implementation of `compute_mask()` just returns the previous mask unchanged.

If the mask propagates all the way to the output, then it gets applied to the losses as well, so the masked time steps will not contribute to the loss (their loss will be 0)

**Warning**: The LSTM and GRU layers have an optimized implementation for GPUs, based on Nvidia’s cuDNN library. However, this implementation only supports masking if all the padding tokens are at the end of the sequences. It also requires us to use the default values for several hyperparameters: `activation`, `recurrent_activation`, `recurrent_dropout`, `unroll`, `use_bias`, and `reset_after`. If that’s not the case, then these layers will fall back to the (much slower) default GPU implementation.

If our model does not start with an `Embedding` layer, we may use the `keras.layers.Masking` layer instead: by default, it sets the mask to `tf.math.reduce_any(tf.math.not_equal(X, 0), axis=-1)`, meaning that time steps where the last dimension is full of zeros will be masked out in subsequent layers.

Convolutional layers (including `Conv1D`) do not support masking since it’s not obvious how they would do so anyway. Therefore we will need to explicitly compute the mask and pass it to the appropriate layers:

Following model is equivalent to the previous model, built using the functional API:

In [40]:
# Ensures reproducibility on the CPU
keras.utils.set_random_seed(42)
inputs = keras.layers.Input(shape=[], dtype=tf.string)
token_ids = text_vec_layer(inputs)
mask = tf.math.not_equal(token_ids, 0)
Z = keras.layers.Embedding(vocab_size, embed_size)(token_ids)
Z = keras.layers.GRU(128, dropout=0.2)(Z, mask=mask)
outputs = keras.layers.Dense(1, activation='sigmoid')(Z)
model = keras.Model(inputs=[inputs], outputs=[outputs])

**Warning**: The following cell will take a while to run (possibly 30 minutes if we are not using a GPU).

In [41]:
# Compiles and trains the model, as usual
model.compile(
    loss='binary_crossentropy', optimizer='nadam', metrics=['accuracy']
)
history = model.fit(train_set, validation_data=valid_set, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


One last approach to masking is to feed the model with ragged tensors:

In [42]:
text_vec_layer_ragged = keras.layers.TextVectorization(
    max_tokens=vocab_size, ragged=True
)
text_vec_layer_ragged.adapt(train_set.map(lambda reviews, labels: reviews))
text_vec_layer_ragged(['Great movie!', 'This is DiCaprio’s best role.'])

<tf.RaggedTensor [[86, 18], [11, 7, 1, 116, 217]]>

In [43]:
text_vec_layer(['Great movie!', 'This is DiCaprio’s best role.'])

<tf.Tensor: shape=(2, 5), dtype=int64, numpy=
array([[ 86,  18,   0,   0,   0],
       [ 11,   7,   1, 116, 217]])>

Keras’s recurrent layers have built-in support for ragged tensors:

**Warning**: The following cell will take a while to run (possibly 30 minutes if we are not using a GPU).

In [44]:
embed_size = 128
keras.utils.set_random_seed(42)
model = keras.Sequential(
    [
        text_vec_layer_ragged,
        keras.layers.Embedding(vocab_size, embed_size),
        keras.layers.GRU(128),
        keras.layers.Dense(1, activation='sigmoid'),
    ]
)
model.compile(
    loss='binary_crossentropy', optimizer='nadam', metrics=['accuracy']
)
history = model.fit(train_set, validation_data=valid_set, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Reusing Pretrained Embeddings and Language Models
Instead of training word embeddings, we could just download and use pretrained embeddings, such as Google’s [Word2vec embeddings](https://homl.info/word2vec), Stanford’s [GloVe embeddings](https://homl.info/glove), or Facebook’s [FastText embeddings](https://fasttext.cc).

Using pretrained word embeddings was popular for several years, but this approach has its limits. e.g. a word has a single representation, no matter the context.

To address this limitation, a 2018 paper [“Deep Contextualized Word Representations”](https://homl.info/elmo) by Matthew Peters introduced *Embeddings from Language Models* (ELMo): these are contextualized word embeddings learned from the internal states of a deep bidirectional language model. Instead of just using pretrained embeddings in our model, we reuse part of a pretrained language model.

At roughly the same time, the *Universal Language Model Fine-Tuning* (ULMFiT) paper [ “Universal Language Model Fine-Tuning for Text Classification”](https://homl.info/ulmfit) by Jeremy Howard and Sebastian Ruder demonstrated the effectiveness of unsupervised pretraining for NLP tasks: the authors trained an LSTM language model on a huge text corpus using self-supervised learning, then they fine-tuned it on various tasks. the authors showed a pretrained model fine-tuned on just 100 labeled examples could achieve the same performance as one trained from scratch on 10,000 examples. This paper marked the beginning of a new era in NLP: today, reusing pretrained language models is the norm.

Let’s build a classifier based on the Universal Sentence Encoder, a model architecture introduced in a 2018 paper [“Universal Sentence Encoder”](https://homl.info/139) by a team of Google researchers.

**Warning**: the following cell will take a while to run (possibly an hour if we are not using a GPU).

In [45]:
import os

import tensorflow_hub as hub

os.environ['TFHUB_CACHE_DIR'] = 'my_tfhub_cache'
# Ensures reproducibility on CPU
keras.utils.set_random_seed(42)
model = keras.Sequential(
    [
        hub.KerasLayer(
            'https://tfhub.dev/google/universal-sentence-encoder/4',
            trainable=True,
            dtype=tf.string,
            input_shape=[],
        ),
        keras.layers.Dense(64, activation='relu'),
        keras.layers.Dense(1, activation='sigmoid'),
    ]
)
model.compile(
    loss='binary_crossentropy', optimizer='nadam', metrics=['accuracy']
)
model.fit(train_set, validation_data=valid_set, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f89897f6d30>

**Tip**: By default, TensorFlow Hub modules are saved to a temporary directory, and they get downloaded again and again every time we run our program. To avoid that, we must set the `TFHUB_CACHE_DIR` environment variable to a directory of our choice: the modules will then be saved there, and only downloaded once.

## An Encoder–Decoder Network for Neural Machine Translation
Let’s begin with a simple *neural machine translation* (NMT) model by Ilya Sutskever et al., [“Sequence to Sequence Learning with Neural Networks”](https://homl.info/103), that will translate English sentences to Spanish.

<a id="simple-nmt-figure"></a>
<center>
  <img 
    src="../images/16/simple_nmt.png" 
    onerror="
      this.onerror = null;
      const repo = 'https://github.com/alirezatheh/handson-ml3-notes/blob/main';
      this.src = repo + this.src.split('..')[1];
    "
  >
</center>

The architecture is as follows: English sentences are fed as inputs to the encoder, and the decoder outputs the Spanish translations. Note that the Spanish translations are also used as inputs to the decoder during training, but shifted back by one step. In other words, during training the decoder is given as input the word that it *should* have output at the previous step, regardless of what it actually output. This is called *teacher forcing*, a technique that significantly speeds up training and improves the model’s performance. For the very first word, the decoder is given the start-of-sequence (SOS) token, and the decoder is expected to end the sentence with an end-of-sequence (EOS) token.

Each word is initially represented by its ID. Next, an `Embedding` layer returns the word embedding. These word embeddings are then fed to the encoder and the decoder.

At each step, the decoder outputs a score for each word in the output vocabulary (i.e., Spanish), then the softmax activation function turns these scores into probabilities.

**Note**: At inference time, we will not have the target sentence to feed to the decoder. Instead, we need to feed it the word that it has just output at the previous step and this will require an embedding lookup.

**Tip**: In a 2015 paper [“Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks”](https://homl.info/scheduledsampling), Samy Bengio et al. proposed gradually switching from feeding the decoder the previous *target* token to feeding it the previous *output* token during training.

Let’s download a dataset of English/Spanish sentence pairs:

In [46]:
from pathlib import Path

url = 'https://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip'
path = keras.utils.get_file(
    'spa-eng.zip', origin=url, cache_dir='datasets', extract=True
)
text = (Path(path).with_name('spa-eng') / 'spa.txt').read_text()

This dataset is created by contributors of the [Tatoeba project](https://tatoeba.org). About 120,000 sentence pairs were selected by the authors of the website https://manythings.org/anki.

In [47]:
import numpy as np

# Removing the Spanish characters “¡” and “¿”,
text = text.replace('¡', '').replace('¿', '')
pairs = [line.split('\t') for line in text.splitlines()]
# Ensures reproducibility on CPU
np.random.seed(42)
np.random.shuffle(pairs)
# Separates the pairs into 2 lists
sentences_en, sentences_es = zip(*pairs)

Let’s take a look at the first three sentence pairs:

In [48]:
for i in range(3):
    print(sentences_en[i], '=>', sentences_es[i])

How boring! => Qué aburrimiento!
I love sports. => Adoro el deporte.
Would you like to swap jobs? => Te gustaría que intercambiemos los trabajos?


In [49]:
vocab_size = 1000
max_length = 50
text_vec_layer_en = keras.layers.TextVectorization(
    vocab_size, output_sequence_length=max_length
)
text_vec_layer_es = keras.layers.TextVectorization(
    vocab_size, output_sequence_length=max_length
)
text_vec_layer_en.adapt(sentences_en)
text_vec_layer_es.adapt([f'startofseq {s} endofseq' for s in sentences_es])

- We limit the vocabulary size to 1,000, which is quite small. That’s because the training set is not very large, and because using a small value will speed up training. State-of-the-art translation models typically use a much larger vocabulary (e.g., 30,000), a much larger training set (gigabytes), and a much larger model (hundreds or even thousands of megabytes).
- For the Spanish text, we add “startofseq” and “endofseq” to each sentence when adapting the `TextVectorization` layer: we will use these words as SOS and EOS tokens. We could use any other words, as long as they are not actual Spanish words.

In [50]:
text_vec_layer_en.get_vocabulary()[:10]

['', '[UNK]', 'the', 'i', 'to', 'you', 'tom', 'a', 'is', 'he']

In [51]:
text_vec_layer_es.get_vocabulary()[:10]

['', '[UNK]', 'startofseq', 'endofseq', 'de', 'que', 'a', 'no', 'tom', 'la']

In [52]:
X_train = tf.constant(sentences_en[:100_000])
X_valid = tf.constant(sentences_en[100_000:])
X_train_dec = tf.constant([f'startofseq {s}' for s in sentences_es[:100_000]])
X_valid_dec = tf.constant([f'startofseq {s}' for s in sentences_es[100_000:]])
Y_train = text_vec_layer_es([f'{s} endofseq' for s in sentences_es[:100_000]])
Y_valid = text_vec_layer_es([f'{s} endofseq' for s in sentences_es[100_000:]])

In [53]:
# Ensures reproducibility on CPU
keras.utils.set_random_seed(42)
encoder_inputs = keras.layers.Input(shape=[], dtype=tf.string)
decoder_inputs = keras.layers.Input(shape=[], dtype=tf.string)

In [54]:
embed_size = 128
encoder_input_ids = text_vec_layer_en(encoder_inputs)
decoder_input_ids = text_vec_layer_es(decoder_inputs)
encoder_embedding_layer = keras.layers.Embedding(
    vocab_size, embed_size, mask_zero=True
)
decoder_embedding_layer = keras.layers.Embedding(
    vocab_size, embed_size, mask_zero=True
)
encoder_embeddings = encoder_embedding_layer(encoder_input_ids)
decoder_embeddings = decoder_embedding_layer(decoder_input_ids)

**Tip**: When the languages share many words, you may get better performance using the same embedding layer for both the encoder and the decoder.

In [55]:
encoder = keras.layers.LSTM(512, return_state=True)
encoder_outputs, *encoder_state = encoder(encoder_embeddings)

In [56]:
decoder = keras.layers.LSTM(512, return_sequences=True)
decoder_outputs = decoder(decoder_embeddings, initial_state=encoder_state)

In [57]:
output_layer = keras.layers.Dense(vocab_size, activation='softmax')
Y_proba = output_layer(decoder_outputs)

<div style="border: 1px solid;">

### Optimizing the Output Layer
If the target vocabulary contained, say, 50,000 Spanish words instead of 1,000, then the decoder would output 50,000-dimensional vectors, and computing the softmax function over such a large vector would be very computationally intensive. To avoid this, one solution is to look only at the logits output by the model for the correct word and for a random sample of incorrect words, then compute an approximation of the loss based only on these logits. This *sampled softmax* technique was introduced in a 2015 paper [“On Using Very Large Target Vocabulary for Neural Machine Translation”](https://homl.info/104) by Sébastien Jean et al. In TensorFlow we can use the `tf.nn.sampled_softmax_loss()` function for this during training and use the normal softmax function at inference time (sampled softmax cannot be used at inference time because it requires knowing the target).

An extra thing we can do to speed up training is to tie the weights of the output layer to the transpose of the decoder’s embedding matrix (See Chapter 17). This significantly reduces the number of model parameters, which speeds up training and may sometimes improve the model’s accuracy as well, especially if we don’t have a lot of training data. The embedding matrix is equivalent to one-hot encoding followed by a linear layer with no bias term and no activation function that maps the one-hot vectors to the embedding space. The output layer does the reverse. So, if the model can find an embedding matrix whose transpose is close to its inverse (such a matrix is called an *orthogonal matrix*), then there’s no need to learn a separate set of weights for the output layer.
</div>

**Warning**: The following cell will take a while to run (possibly a couple hours if we are not using a GPU).

In [58]:
model = keras.Model(inputs=[encoder_inputs, decoder_inputs], outputs=[Y_proba])
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='nadam',
    metrics=['accuracy'],
)
model.fit(
    (X_train, X_train_dec),
    Y_train,
    epochs=10,
    validation_data=((X_valid, X_valid_dec), Y_valid),
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f897878ac10>

After training the decoder expects as input the word that was predicted at the previous time step. One way to do this is to write a custom memory cell that keeps track of the previous output and feeds it to the decoder at the next time step. However, to keep things simple, we can just call the model multiple times, predicting one extra word at each round:

In [59]:
def translate(sentence_en: str) -> str:
    translation = ''
    for word_idx in range(max_length):
        # Encoder input
        X = np.array([sentence_en])
        # Decoder input
        X_dec = np.array(['startofseq ' + translation])
        # Last token’s probas
        y_proba = model.predict((X, X_dec))[0, word_idx]
        predicted_word_id = np.argmax(y_proba)
        predicted_word = text_vec_layer_es.get_vocabulary()[predicted_word_id]
        if predicted_word == 'endofseq':
            break
        translation += ' ' + predicted_word
    return translation.strip()

In [60]:
translate('I like soccer')

'me gusta el fútbol'

Nice! However, the model struggles with longer sentences:

In [61]:
translate('I like soccer and also going to the beach')

'me gusta el fútbol y a veces mismo al bus'

### Bidirectional 
A regular recurrent layer is *causal*, meaning it cannot look into the future. This type of RNN makes sense when forecasting time series, or in the decoder of a sequence-to-sequence (seq2seq) model. But for tasks like text classification, or in the encoder of a seq2seq model, it is often preferable to look ahead at the next words before encoding a given word. Consider the phrases “the right arm”, “the right person”, and “the right to criticize”. One solution is to run two recurrent layers on the same inputs, one reading the words from left to right and the other reading them from right to left, then combine their outputs at each time step, typically by concatenating them. This is what a *bidirectional recurrent layer* does.

<center>
  <img 
    src="../images/16/bidirectional_recurrent_layer.png" 
    onerror="
      this.onerror = null;
      const repo = 'https://github.com/alirezatheh/handson-ml3-notes/blob/main';
      this.src = repo + this.src.split('..')[1];
    "
  >
</center>

To create a bidirectional recurrent layer, just wrap a regular recurrent layer in a `Bidirectional` layer:

In [62]:
# Ensures reproducibility on CPU
keras.utils.set_random_seed(42)
encoder = keras.layers.Bidirectional(keras.layers.LSTM(256, return_state=True))

In [63]:
encoder_outputs, *encoder_state = encoder(encoder_embeddings)
encoder_state = [
    # Short-term (0 & 2)
    tf.concat(encoder_state[::2], axis=-1),
    # Long-term (1 & 3)
    tf.concat(encoder_state[1::2], axis=-1),
]

**Warning**: The following cell will take a while to run (possibly a couple hours if we are not using a GPU).

In [64]:
# Completes the model and trains it
decoder = keras.layers.LSTM(512, return_sequences=True)
decoder_outputs = decoder(decoder_embeddings, initial_state=encoder_state)
output_layer = keras.layers.Dense(vocab_size, activation='softmax')
Y_proba = output_layer(decoder_outputs)
model = keras.Model(inputs=[encoder_inputs, decoder_inputs], outputs=[Y_proba])
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='nadam',
    metrics=['accuracy'],
)
model.fit(
    (X_train, X_train_dec),
    Y_train,
    epochs=10,
    validation_data=((X_valid, X_valid_dec), Y_valid),
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f892d2d5fa0>

In [65]:
translate('I like soccer')

'me gusta el fútbol'

### Beam Search
We can give the model a chance to go back and fix mistakes it made earlier. *Beam search* keeps track of a short list of the *k* most promising sentences (say, the top three), and at each decoder step it tries to extend them by one word, keeping only the *k* most likely sentences. The parameter *k* is called the *beam width*.

<center>
  <img 
    src="../images/16/beam_search.png" 
    onerror="
      this.onerror = null;
      const repo = 'https://github.com/alirezatheh/handson-ml3-notes/blob/main';
      this.src = repo + this.src.split('..')[1];
    "
  >
</center>

**Tip**: The TensorFlow Addons library includes a full seq2seq API that lets us build encoder–decoder models with attention, including beam search, and more. However, its documentation is currently very limited.

Here is a very basic implementation of beam search. It is readable and understandable, but it’s definitely not optimized for speed! The function first uses the model to find the top *k* words to start the translations (where *k* is the beam width). For each of the top *k* translations, it evaluates the conditional probabilities of top *k* words it could add to that translation. These extended translations and their probabilities are added to the list of candidates. Once we’ve gone through all top *k* translations and all top *k* words that could complete them, we keep only the top *k* candidates with the highest probability, and we iterate over and over until they all finish with an EOS token. The top translation is then returned (after removing its EOS token).

**Note**: If $p(S)$ is the probability of sentence $S$, and $p(W|S)$ is the conditional probability of the word $W$ given that the translation starts with $S$, then the probability of the sentence $S^\prime=\text{concat}(S,W)$ is $p(S^\prime)=p(S)\times p(W|S)$. As we add more words, the probability gets smaller and smaller. To avoid the risk of it getting too small, which could cause floating point precision errors, the function keeps track of log probabilities instead of probabilities: recall that $\log(ab)=log(a)+log(b)$, therefore $\log(p(S^\prime))=\log(p(S))+\log(p(W|S))$.

In [66]:
# A basic implementation of beam search


def beam_search(
    sentence_en: str, beam_width: int, verbose: bool = False
) -> str:
    # Encoder input
    X = np.array([sentence_en])
    # Decoder input
    X_dec = np.array(['startofseq'])
    # First token’s probas
    y_proba = model.predict((X, X_dec))[0, 0]
    top_k = tf.math.top_k(y_proba, k=beam_width)
    # List of best (log_proba, translation)
    top_translations = [
        (np.log(word_proba), text_vec_layer_es.get_vocabulary()[word_id])
        for word_id, word_proba in zip(top_k.indices, top_k.values)
    ]

    # Displays the top first words in verbose mode
    if verbose:
        print('Top first words:', top_translations)

    for idx in range(1, max_length):
        candidates = []
        for log_proba, translation in top_translations:
            if translation.endswith('endofseq'):
                candidates.append((log_proba, translation))
                # Translation is finished, so don’t try to extend it
                continue
            # Encoder input
            tf.math.top_k
            X = np.array([sentence_en])
            # Decoder input
            X_dec = np.array(['startofseq ' + translation])
            # Last token’s proba
            y_proba = model.predict((X, X_dec))[0, idx]
            top_k = tf.math.top_k(y_proba, k=beam_width)
            for word_id, word_proba in zip(top_k.indices, top_k.values):
                word = text_vec_layer_es.get_vocabulary()[word_id]
                candidates.append(
                    (log_proba + np.log(word_proba), f'{translation} {word}')
                )
        top_translations = sorted(candidates, reverse=True)[:beam_width]

        # Displays the top translation so far in verbose mode
        if verbose:
            print('Top translations so far:', top_translations)

        if all([tr.endswith('endofseq') for _, tr in top_translations]):
            return top_translations[0][1].replace('endofseq', '').strip()

In [67]:
# Shows how the model making an error
sentence_en = 'I love cats and dogs'
translate(sentence_en)

'me [UNK] los gatos y los gatos'

In [68]:
# Shows how beam search can help
beam_search(sentence_en, beam_width=3, verbose=True)

Top first words: [(-0.012974381, 'me'), (-4.592527, '[UNK]'), (-6.314033, 'yo')]
Top translations so far: [(-0.4831518, 'me [UNK]'), (-1.4920667, 'me encanta'), (-1.986235, 'me gustan')]
Top translations so far: [(-0.6793061, 'me [UNK] los'), (-1.9889652, 'me gustan los'), (-2.0470557, 'me encanta los')]
Top translations so far: [(-0.7609749, 'me [UNK] los gatos'), (-2.0677316, 'me gustan los gatos'), (-2.26029, 'me encanta los gatos')]
Top translations so far: [(-0.76985043, 'me [UNK] los gatos y'), (-2.0701222, 'me gustan los gatos y'), (-2.2649746, 'me encanta los gatos y')]
Top translations so far: [(-0.81283045, 'me [UNK] los gatos y los'), (-2.118244, 'me gustan los gatos y los'), (-2.96167, 'me encanta los gatos y los')]
Top translations so far: [(-1.2259341, 'me [UNK] los gatos y los gatos'), (-1.9556838, 'me [UNK] los gatos y los perros'), (-2.7524388, 'me gustan los gatos y los perros')]
Top translations so far: [(-1.2261332, 'me [UNK] los gatos y los gatos endofseq'), (-1.95

'me [UNK] los gatos y los gatos'

The correct translation is in the top 3 sentences found by beam search, but it’s not the first. Since we’re using a small vocabulary, the [UNK] token is quite frequent, so we may want to penalize it (e.g., divide its probability by 2 in the beam search function): this will discourage beam search from using it too much.

**Note**: The most common metric used in NMT is the *bilingual evaluation understudy* (BLEU) score, which compares each translation produced by the model with several good translations produced by humans: it counts the number of *n*-grams that appear in any of the target translations and adjusts the score to take into account the frequency of the produced *n*-grams in the target translations.

## Attention Mechanisms
<style>ul {list-style-type: none;}</style>

*Bahdanau attention*
- Consider the path from the word “soccer” to its translation “fútbol” in [previous network](#simple-nmt-figure). It is quite long! This means that a representation of this word (and other words) needs to be carried over many steps before it is actually used. To make this path shorter, Dzmitry Bahdanau et al. in a landmark 2014 paper [“Neural Machine Translation by Jointly Learning to Align and Translate”](https://homl.info/attention), introduced a technique that allowed the decoder to focus on the appropriate words (as encoded by the encoder) at each time step. e.g, at the time step where the decoder needs to output the word “fútbol”, it will focus its attention on the word “soccer”.
  
  Here is our model with an added attention mechanism:
  
  <center>
    <img 
      src="../images/16/simple_nmt_with_attention.png" 
      onerror="
        this.onerror = null;
        const repo = 'https://github.com/alirezatheh/handson-ml3-notes/blob/main';
        this.src = repo + this.src.split('..')[1];
      "
    >
  </center>
    
  We now also send all of the encoder’s outputs to the decoder. Since the decoder cannot deal with all these encoder outputs at once, they need to be aggregated: at each time step, the decoder’s memory cell computes a weighted sum of all the encoder outputs to determine which words to focus on. The weight $\alpha_{(t,i)}$ is the weight of the $i^{\text{th}}$ encoder output at the $t^{\text{th}}$ decoder time step. Then it uses this weighted sum to compute the decoder’s current hidden state. The rest of the decoder works just like earlier.

  These $\alpha_{(t,i)}$ weights are generated by a small neural network called an *alignment model* (or an *attention layer*), which is trained jointly with the rest of the encoder–decoder model. It starts with a `Dense` layer composed of a single neuron to process each of the encoder’s outputs, along with the decoder’s previous hidden. This layer outputs a score (or energy) for each encoder output (e.g., $e_{(3,2)}$): this score measures how well each output is aligned with the decoder’s previous hidden state. e.g. in the figure above, the model has already output “me gusta el” (meaning “I like”), so it’s now expecting a noun: the word “soccer” is the one that best aligns with the current state, so it gets a high score. Finally, all the scores go through a softmax layer to get a final weight for each encoder output (e.g., $\alpha_{(3,2)}$). Since it concatenates the encoder output with the decoder’s previous hidden state, it is sometimes called *concatenative attention* (or *additive attention*).
  
  **Note**: If the input sentence is $n$ words long, and assuming the output sentence is about as long, then this model will need to compute about $n^2$ weights. Fortunately, this quadratic computational complexity is still tractable because even long sentences don’t have thousands of words.

*Luong attention*
- Or *multiplicative attention*, was proposed shortly after, in a 2015 paper [“Effective Approaches to Attention-Based Neural Machine Translation”](https://homl.info/luongattention), by Minh-Thang Luong et al. Because the goal of the alignment model is to measure the similarity between one of the encoder’s outputs and the decoder’s previous hidden state, the authors proposed to simply compute the dot product of these two vectors, as this is often a fairly good similarity measure, and modern hardware can compute it very efficiently. For this, both vectors must have the same dimensionality. The dot product gives a score, and all the scores (at a given decoder time step) go through a softmax layer to give the final weights.

  They also proposed to use the decoder’s hidden state at the current time step rather than at the previous time step (i.e., $\mathbf{h}_{(t)}$ rather than $\mathbf{h}_{(t-1)}$), then to use the output of the attention mechanism (noted $\widetilde{\mathbf{h}}_{(t)}$) directly to compute the decoder’s predictions, rather than using it to compute the decoder’s current hidden state.

  They also proposed a variant of the dot product mechanism where the encoder outputs first go through a fully connected layer (without a bias term) before the dot products are computed. This is called the “general” dot product approach. The researchers compared both dot product approaches with the concatenative attention mechanism (adding a rescaling parameter vector $v$), and they observed that the dot product variants performed better than concatenative attention.

**Equation 16-1** Attention mechanisms
$$
\begin{split}
&\widetilde{\mathbf{h}}_{(t)}=\sum_i\alpha_{(t,i)}\mathbf{y}_{(i)}
\\&\text{with}\;\alpha_{(t,i)}
=\frac{\exp(e_{(t,i)})}{\sum_{i^\prime}\exp({e_{(t,i^\prime)}})}
\\&\text{and}\;e_{(t,i)}=\begin{cases}
{\mathbf{h}_{(t)}}^\top\mathbf{y}_{(i)}&dot
\\{\mathbf{h}_{(t)}}^\top\mathbf{W}\mathbf{y}_{(i)}&general
\\\mathbf{v}^\top\tanh(\mathbf{W}[\mathbf{h}_{(t)};\mathbf{y}_{(i)}])&concat
\end{cases}
\end{split}
$$

Keras provides a `keras.layers.Attention` layer for Luong attention, and an `AdditiveAttention` layer for Bahdanau attention.

Let’s add it to our model. We need to feed all the encoder’s outputs to the `Attention` layer, so we must add `return_sequences=True` to the encoder:

In [69]:
# Ensures reproducibility on CPU
keras.utils.set_random_seed(42)
encoder = keras.layers.Bidirectional(
    keras.layers.LSTM(256, return_sequences=True, return_state=True)
)

In [70]:
# This part of the model is exactly the same as earlier
encoder_outputs, *encoder_state = encoder(encoder_embeddings)
encoder_state = [
    # Short-term (0 & 2)
    tf.concat(encoder_state[::2], axis=-1),
    # Long-term (1 & 3)
    tf.concat(encoder_state[1::2], axis=-1),
]
decoder = keras.layers.LSTM(512, return_sequences=True)
decoder_outputs = decoder(decoder_embeddings, initial_state=encoder_state)

Next, we need to create the attention layer and pass it the decoder’s states and the encoder’s outputs. However, to access the decoder’s states at each step we would need to write a custom memory cell. For simplicity, let’s use the decoder’s outputs instead of its states: in practice this works well too, and it’s much easier to code.

In [71]:
attention_layer = keras.layers.Attention()
keras.layers.AdditiveAttention
attention_outputs = attention_layer([decoder_outputs, encoder_outputs])
output_layer = keras.layers.Dense(vocab_size, activation='softmax')
Y_proba = output_layer(attention_outputs)

**Warning**: The following cell will take a while to run (possibly a couple hours if we are not using a GPU).

In [72]:
model = keras.Model(inputs=[encoder_inputs, decoder_inputs], outputs=[Y_proba])
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='nadam',
    metrics=['accuracy'],
)
model.fit(
    (X_train, X_train_dec),
    Y_train,
    epochs=10,
    validation_data=((X_valid, X_valid_dec), Y_valid),
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f87e5c8ad90>

In [73]:
translate('I like soccer and also going to the beach')

'me gusta el fútbol y también ir a la playa'

In [74]:
beam_search(
    'I like soccer and also going to the beach', beam_width=3, verbose=True
)

Top first words: [(-0.26210824, 'me'), (-2.553061, 'prefiero'), (-3.2005944, 'yo')]
Top translations so far: [(-0.32478744, 'me gusta'), (-3.0608056, 'prefiero el'), (-3.1685317, 'me gustan')]
Top translations so far: [(-0.7464272, 'me gusta el'), (-2.4712462, 'me gusta fútbol'), (-2.9149299, 'me gusta al')]
Top translations so far: [(-1.0369574, 'me gusta el fútbol'), (-2.3301778, 'me gusta el el'), (-2.9658434, 'me gusta fútbol y')]
Top translations so far: [(-1.0404125, 'me gusta el fútbol y'), (-2.5983238, 'me gusta el el fútbol'), (-2.9736564, 'me gusta fútbol y también')]
Top translations so far: [(-1.0520902, 'me gusta el fútbol y también'), (-2.6003318, 'me gusta el el fútbol y'), (-3.128903, 'me gusta fútbol y también me')]
Top translations so far: [(-1.9568634, 'me gusta el fútbol y también ir'), (-2.6169589, 'me gusta el el fútbol y también'), (-2.6949644, 'me gusta el fútbol y también fuera')]
Top translations so far: [(-1.9676423, 'me gusta el fútbol y también ir a'), (-2.

'me gusta el fútbol y también ir a la playa'

There’s another way to think of this layer: it acts as a differentiable memory retrieval mechanism.

e.g. let’s suppose the encoder analyzed the input sentence “I like soccer”, and it managed to understand that the word “I” is the subject and the word “like” is the verb, so it encoded this information in its outputs for these words. Now suppose the decoder has already translated the subject, and it thinks that it should translate the verb next. For this, it needs to fetch the verb from the input sentence. This is analogous to a dictionary lookup: it’s as if the encoder had created a dictionary {"subject”: “They”, “verb”: “played”, ...} and the decoder wanted to look up the value that corresponds to the key “verb”.However, the model does not have discrete tokens to represent the keys (like “subject” or “verb”); instead, it has vectorized representations of these concepts that it learned during training, so the query it will use for the lookup will not perfectly match any key in the dictionary. The solution is to compute a similarity measure between the query and each key in the dictionary, and then use the softmax function to convert these similarity scores to weights that add up to 1. As we saw earlier, that’s exactly what the attention layer does. If the key that represents the verb is by far the most similar to the query, then that key’s weight will be close to 1. Next, the attention layer computes a weighted sum of the corresponding values: if the weight of the “verb” key is close to 1, then the weighted sum will be very close to the representation of the word “played”.

This is why the Keras `Attention` and `AdditiveAttention` layers both expect a list as input, containing two or three items: the *queries*, the *keys*, and optionally the *values*. If we do not pass any values, then they are automatically equal to the *keys*.

### Attention Is All You Need: The Original Transformer Architecture
In a groundbreaking 2017 paper, a team of Google researchers suggested that [“Attention Is All You Need”](https://homl.info/transformer). They created an architecture called the *transformer*, which significantly improved the state-of-the-art in NMT. Because the model is not recurrent:
- It doesn’t suffer as much from the vanishing or exploding gradients problems as RNNs.
- It can be trained in fewer steps.
- It’s easier to parallelize across multiple GPUs.
- It can better capture long-range patterns than RNNs.

<center>
  <img 
    src="../images/16/transformer.png" 
    onerror="
      this.onerror = null;
      const repo = 'https://github.com/alirezatheh/handson-ml3-notes/blob/main';
      this.src = repo + this.src.split('..')[1];
    "
  >
</center>

The left part of is the encoder, and the right part is the decoder. Each embedding layer outputs a 3D tensor of shape [*batch size*, *sequence length*, *embedding size*]. After that, the tensors are gradually transformed as they flow through the transformer, but their shape remains the same.

The encoder’s role is to gradually transform the inputs (word representations of the English sentence) until each word’s representation perfectly captures the meaning of the word, in the context of the sentence.

The decoder’s role is to gradually transform each word representation in the translated sentence into a word representation of the next word in the translation.

After going through the decoder, each word representation goes through a final `Dense` layer with a softmax activation function, which will hopefully output a high probability for the correct next word and a low probability for all other words.

Let’s go to details:
- Both the encoder and the decoder contain modules that are stacked $N$ times. In the paper, $N=6$. The final outputs of the whole encoder stack are fed to the decoder at each of these $N$ levels.
- There are two embedding layers; 
- Several skip connections, each of them followed by a layer normalization layer; 
- Several feedforward modules that are composed of two dense layers each (the first one using the ReLU activation function, the second with no activation function);
- The output layer is a dense layer using the softmax activation function. 
- We can also use a bit of dropout after the attention layers and the feedforward modules.
- Since all of these layers are time-distributed, each word is treated independently from all the others. But we can’t translate a sentence by looking at the words completely separately. That’s where the new components come in:
  - The encoder’s *multi-head attention* layer updates each word representation by attending to (i.e., paying attention to) all other words in the same sentence.
  - The decoder’s *masked multi-head attention* layer does the same thing, but when it processes a word, it doesn’t attend to words located after it: it’s a causal layer.
  - The decoder’s upper *multi-head attention* layer is where the decoder pays attention to the words in the English sentence. This is called *cross*-attention, not *self*-attention in this case.
  - The *positional encodings* are dense vectors (much like word embeddings) that represent the position of each word in the sentence. The $n^{\text{th}}$ positional encoding is added to the word embedding of the $n^{\text{th}}$ word in each sentence. This is needed because all layers in the transformer architecture ignore word positions: without positional encodings, we could shuffle the input sequences, and it would just shuffle the output sequences in the same way.

**Note**: The first two arrows going into each multi-head attention layer represent the keys and values, and the third arrow represents the queries.

#### Positional encodings
The easiest way to implement this is to use an `Embedding` layer and make it encode all the positions from 0 to the maximum sequence length in the batch, then add the result to the word embeddings. The rules of broadcasting will ensure that the positional encodings get applied to every input sequence.

In [75]:
# Max length in the whole training set
max_length = 50
embed_size = 128
# Ensures reproducibility on CPU
keras.utils.set_random_seed(42)
pos_embed_layer = keras.layers.Embedding(max_length, embed_size)
batch_max_len_enc = tf.shape(encoder_embeddings)[1]
encoder_in = encoder_embeddings + pos_embed_layer(tf.range(batch_max_len_enc))
batch_max_len_dec = tf.shape(decoder_embeddings)[1]
decoder_in = decoder_embeddings + pos_embed_layer(tf.range(batch_max_len_dec))

Alternatively, we can use fixed, non-trainable positional encodings (however, when there is a large amount of pretraining data, trainable positional encodings are usually favored):

**Equation 16-2** Sine/cosine positional encodings
$$
P_{p,i}=\begin{cases}
\sin(p/10000^{i/d})&\text{if $i$ is even}
\\\cos(p/10000^{(i-1)/d})&\text{if $i$ is odd}
\end{cases}
$$
- $P_{p,i}$: The $i^{\text{th}}$ component of the encoding for the word located at the $p^{\text{th}}$ position in the sentence.
- $d$: THe embedding size

<center>
  <img 
    src="../images/16/positional_encoding.png" 
    onerror="
      this.onerror = null;
      const repo = 'https://github.com/alirezatheh/handson-ml3-notes/blob/main';
      this.src = repo + this.src.split('..')[1];
    "
  >
</center>

In [76]:
class PositionalEncoding(keras.layers.Layer):
    def __init__(
        self,
        max_length: int,
        embed_size: int,
        dtype: tf.DType = tf.float32,
        **kwargs,
    ) -> None:
        super().__init__(dtype=dtype, **kwargs)
        assert embed_size % 2 == 0, 'embed_size must be even'
        p, i = np.meshgrid(
            np.arange(max_length), 2 * np.arange(embed_size // 2)
        )
        pos_emb = np.empty((1, max_length, embed_size))
        pos_emb[0, :, ::2] = np.sin(p / 10_000 ** (i / embed_size)).T
        pos_emb[0, :, 1::2] = np.cos(p / 10_000 ** (i / embed_size)).T
        self.pos_encodings = tf.constant(pos_emb.astype(self.dtype))
        self.supports_masking = True

    def call(self, inputs: tf.Tensor) -> tf.Tensor:
        batch_max_length = tf.shape(inputs)[1]
        return inputs + self.pos_encodings[:, :batch_max_length]

In [77]:
pos_embed_layer = PositionalEncoding(max_length, embed_size)
encoder_in = pos_embed_layer(encoder_embeddings)
decoder_in = pos_embed_layer(decoder_embeddings)

#### Multi-head attention
To understand how a multi-head attention layer works, we must first understand the *scaled dot-product attention* layer, which it is based on. It’s the same as Luong attention, except for a scaling factor.

**Equation 16-3** Scaled dot-product attention
$$
\text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})
=\text{softmax}(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_{keys}}})\mathbf{V}
$$
- $\mathbf{Q}$: A matrix containing one row per *query*. Its shape is [$n_{queries}$, $d_{keys}$], where $n_{queries}$ is the number of queries and $d_{keys}$ is the number of dimensions of each query and each key.
- $\mathbf{K}$: A matrix containing one row per *key*. Its shape is [$n_{keys}$, $d_{keys}$], where $n_{keys}$ is the number of keys and values.
- $\mathbf{V}$: A matrix containing one row per *value*. Its shape is [$n_{keys}$, $d_{values}$], where $d_{values}$ is the number of dimensions of each value.
- The shape of $\mathbf{Q}\mathbf{K}^\top$ is [$n_{queries}$, $n_{keys}$]: it contains one similarity score for each query/key pair. To prevent this matrix from being huge, the input sequences must not be too long. The final output has a shape of [$n_{queries}$, $d_{values}$]: there is one row per query, where each row represents the query result (a weighted sum of the values).
- The scaling factor $1/(\sqrt{d_{keys}})$ scales down the similarity scores to avoid saturating the softmax function, which would lead to tiny gradients.
- It is possible to mask out some key/value pairs by adding a very large negative value to the corresponding similarity scores, just before computing the softmax. This is useful in the masked multi-head attention layer.

**Note**: If we set `use_scale=True` when creating a `keras.layers.Attention` layer, then it will create an additional parameter that lets the layer learn how to properly downscale the similarity scores.

**Note**: the Attention layer’s inputs are just like $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$, except with an extra batch dimension (the first dimension). Internally, the layer computes all the attention scores for all sentences in the batch with just one call to `tf.matmul(queries, keys)`:In TensorFlow, if `A` and `B` are tensors with more than two dimensions, say, of shape [2, 3, 4, 5] and [2, 3, 5, 6], respectively, then `tf.matmul(A, B)` will treat these tensors as 2 $\times$ 3 arrays where each cell contains a matrix, and it will multiply the corresponding matrices: the matrix at the $i^{\text{th}}$ row and $j^{\text{th}}$ column in `A` will be multiplied by the matrix at the $i^{\text{th}}$ row and $j^{\text{th}}$ column in B. Since the product of a 4 $\times$ 5 matrix with a 5 $\times$ 6 matrix is a 4 $\times$ 6 matrix, `tf.matmul(A, B)` will return an array of shape [2, 3, 4, 6].

Now let’s look at the multi-head attention layer:

<center>
  <img 
    src="../images/16/multi_head_attention_layer.png" 
    onerror="
      this.onerror = null;
      const repo = 'https://github.com/alirezatheh/handson-ml3-notes/blob/main';
      this.src = repo + this.src.split('..')[1];
    "
  >
</center>

It is just a bunch of scaled dot-product attention layers, each preceded by a linear transformation of the values, keys, and queries (i.e., a time-distributed dense layer with no activation function). All the outputs are simply concatenated, and they go through a final linear transformation (again, time-distributed).

What is the intuition behind this architecture? Consider the word “like” in the sentence “I like soccer”. The encoder was smart enough to encode the fact that it is a verb. But the word representation also includes its position and many other features that are useful for its translation, such as the fact that it is in the present tense. If we just used a single scaled dot-product attention layer, we would only be able to query all of these characteristics in one shot. 

Applying *multiple* different linear transformations of the values, keys, and queries allows the model to apply many different projections of the word representation into different subspaces, each focusing on a subset of the word’s characteristics. Then the scaled dot-product attention layers implement the lookup phase, and finally we concatenate all the results and project them back to the original space.

Let’s build the rest of the transformer:

In [78]:
# Instead of 6
N = 2
num_heads = 8
dropout_rate = 0.1
# For the first Dense layer in each Feed Forward block
n_units = 128
# [batch size, 1, max sequence length]
encoder_pad_mask = tf.math.not_equal(encoder_input_ids, 0)[:, tf.newaxis]
Z = encoder_in
for _ in range(N):
    skip = Z
    attn_layer = keras.layers.MultiHeadAttention(
        num_heads=num_heads, key_dim=embed_size, dropout=dropout_rate
    )
    Z = attn_layer(
        Z,
        value=Z,
        # The layer now supports automatic masking
        # attention_mask=encoder_pad_mask
    )
    Z = keras.layers.LayerNormalization()(keras.layers.Add()([Z, skip]))
    skip = Z
    Z = keras.layers.Dense(n_units, activation='relu')(Z)
    Z = keras.layers.Dense(embed_size)(Z)
    Z = keras.layers.Dropout(dropout_rate)(Z)
    Z = keras.layers.LayerNormalization()(keras.layers.Add()([Z, skip]))

A while a ago `MultiHeadAttention` layer did not support automatic masking. So we must handled it manually. Understanding manual masking is really helpful.

The `MultiHeadAttention` layer accepts an `attention_mask` argument, which is a Boolean tensor of shape [*batch size*, *max query length*, *max value length*]: for every token in every query sequence, this mask indicates which tokens in the corresponding value sequence should be attended to. We want to tell the `MultiHeadAttention` layer to ignore all the padding tokens in the values. So, we first compute the padding mask using `tf.math.not_equal(encoder_input_ids, 0)`. This returns a Boolean tensor of shape [*batch size*, *max sequence length*]. We then insert a second axis using `[:, tf.newaxis]`, to get a mask of shape [*batch size*, *1*, *max sequence length*]. This allows us to use this mask as the `attention_mask` when calling the `MultiHeadAttention` layer: thanks to broadcasting, the same mask will be used for all tokens in each query. This way, the padding tokens in the values will be ignored correctly.

**Note**: Currently `Z + skip` does not support automatic masking, which is why we had to write `keras.layers.Add()([Z, skip])` instead.

Also in the decoder the first multi-head attention layer is a self-attention layer, like in the encoder, but it is a *masked* multi-head attention layer, meaning it is causal: it should ignore all tokens in the future. So, we need two masks: a padding mask and a causal mask.

In [79]:
decoder_pad_mask = tf.math.not_equal(decoder_input_ids, 0)[:, tf.newaxis]
# Creates a lower triangular matrix
causal_mask = tf.linalg.band_part(
    tf.ones((batch_max_len_dec, batch_max_len_dec), tf.bool), -1, 0
)

The `tf.linalg.band_part()` function takes a tensor and returns a copy with all the values outside a diagonal band set to zero. With these arguments, we get a square matrix of size `batch_max_len_dec` (the max length of the input sequences in the batch), with 1s in the lower-left triangle and 0s in the upper right. If we use this mask as the attention mask, we will get exactly what we want. The causal mask only has two dimensions: it’s missing the batch dimension, but that’s okay since broadcasting ensures that it gets copied across all the instances in the batch.

In [80]:
# Let’s save the encoder’s final outputs
encoder_outputs = Z
# The decoder starts with its own inputs
Z = decoder_in
for _ in range(N):
    skip = Z
    attn_layer = keras.layers.MultiHeadAttention(
        num_heads=num_heads, key_dim=embed_size, dropout=dropout_rate
    )
    Z = attn_layer(
        Z,
        value=Z,
        # The layer now supports automatic masking
        use_causal_mask=True,
        # attention_mask=causal_mask & decoder_pad_mask,
    )
    Z = keras.layers.LayerNormalization()(keras.layers.Add()([Z, skip]))
    skip = Z
    attn_layer = keras.layers.MultiHeadAttention(
        num_heads=num_heads, key_dim=embed_size, dropout=dropout_rate
    )
    Z = attn_layer(
        Z,
        value=encoder_outputs,
        # The layer now supports automatic masking
        # attention_mask=encoder_pad_mask,
    )
    Z = keras.layers.LayerNormalization()(keras.layers.Add()([Z, skip]))
    skip = Z
    Z = keras.layers.Dense(n_units, activation='relu')(Z)
    Z = keras.layers.Dense(embed_size)(Z)
    Z = keras.layers.LayerNormalization()(keras.layers.Add()([Z, skip]))

**Warning**: The following cell will take a while to run (possibly 2 or 3 hours if we are not using a GPU).

In [81]:
Y_proba = keras.layers.Dense(vocab_size, activation='softmax')(Z)
model = keras.Model(inputs=[encoder_inputs, decoder_inputs], outputs=[Y_proba])
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='nadam',
    metrics=['accuracy'],
)
model.fit(
    (X_train, X_train_dec),
    Y_train,
    epochs=10,
    validation_data=((X_valid, X_valid_dec), Y_valid),
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f8946cdf9a0>

In [82]:
translate('I like soccer and also going to the beach')

'me gusta el fútbol y yo también voy a la playa'

**Tip**: The Keras team has created a new [Keras NLP project](https://github.com/keras-team/keras-nlp), including an API to build a transformer more easily. We may also be interested in the new [Keras CV project](https://github.com/keras-team/keras-cv) for computer vision.

## An Avalanche of Transformer Models
The year 2018 has been called the “ImageNet moment for NLP”. Since then, progress has been astounding, with larger and larger transformer-based architectures trained on immense datasets.

- First, the GPT paper [“Improving Language Understanding by Generative Pre-Training”](https://homl.info/gpt) by Alec Radford and other OpenAI researchers in 2018 once again demonstrated the effectiveness of unsupervised pretraining, like the ELMo and ULMFiT papers before it, but this time using a transformer-like architecture. The authors pretrained a large but fairly simple architecture composed of a stack of 12 transformer modules using only masked multi-head attention layers, like in the original transformer’s decoder. They trained it on a very large dataset, using the same autoregressive technique we used for our Shakespearean char-RNN: just predict the next token. This is a form of self-supervised learning. Then they fine-tuned it on various language tasks, using only minor adaptations for each task: text classification, *entailment* (whether sentence A imposes, involves, or implies sentence B as a necessary consequence), similarity (e.g., “Nice weather today” is very similar to “It is sunny”), and question answering (given a few paragraphs of text giving some context, the model must answer some multiple-choice questions).
- Then in 2019 Google’s BERT paper [“BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding”](https://homl.info/bert) by Jacob Devlin et al. came out: it also demonstrated the effectiveness of self-supervised pretraining on a large corpus, using a similar architecture to GPT but with nonmasked multi-head attention layers only, like in the original transformer’s encoder. This means that the model is naturally bidirectional; hence the B in BERT (*Bidirectional Encoder Representations from Transformers*). The authors proposed two pretraining tasks that explain most of the model’s strength:
  
  *Masked language model (MLM)*
  - Each word in a sentence has a 15% probability of being masked, and the model is trained to predict the masked words. And then each selected word has an 80% chance of being masked, a 10% chance of being replaced by a random word (to reduce the discrepancy between pretraining and fine-tuning, since the model will not see \<mask> tokens during fine-tuning), and a 10% chance of being left alone (to bias the model toward the correct answer).
  
  *Next sentence prediction (NSP)*
  - The model is trained to predict whether two sentences are consecutive or not. 
  Later research showed that NSP was not as important as was initially thought, 
  so it was dropped in most later architectures.
  
  The model is trained on these two tasks simultaneously.
  
  <center>
    <img 
      src="../images/16/bert.png" 
      onerror="
        this.onerror = null;
        const repo = 'https://github.com/alirezatheh/handson-ml3-notes/blob/main';
        this.src = repo + this.src.split('..')[1];
      "
    >
  </center>
  
  For the NSP task, the authors inserted a class token (\<CLS>) at the start of every input. The two input sentences are concatenated, separated only by a special separation token (\<SEP>), and they are fed as input to the model. To help the model know which sentence each input token belongs to, a *segment embedding* is added on top of each token’s positional embeddings. The loss is only computed on the NSP prediction and the masked tokens, not on the unmasked ones.
  
  After this the model is then fine-tuned on many different tasks, changing very little for each task. e.g. for text classification such as sentiment analysis, all output tokens are ignored except for the first one, corresponding to the class token, and a new output layer replaces the previous one, which was just a binary classification layer for NSP.

- In February 2019, Alec Radford, Jeffrey Wu, and other OpenAI researchers published the GPT-2 paper [“Language Models Are Unsupervised Multitask Learners”](https://homl.info/gpt2), which proposed a very similar architecture to GPT, but larger still (with over 1.5 billion parameters!). The researchers showed that the new and improved GPT model could perform *zero-shot learning* (ZSL), meaning it could achieve good performance on many tasks without any fine-tuning.
- Google’s [“Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”](https://homl.info/switch) introduced in January 2021 used 1 trillion parameters, and soon much larger models came out, such as the Wu Dao 2.0 model by the Beijing Academy of Artificial Intelligence (BAII), announced in June 2021.
- Luckily, ingenious researchers are finding new ways to downsize transformers and make them more data-efficient. e.g. the [“DistilBERT, A Distilled Version of Bert: Smaller, Faster, Cheaper and Lighter”](https://homl.info/distilbert) model, introduced in October 2019 by Victor Sanh et al. from Hugging Face, is a small and fast transformer model based on BERT. It is available on Hugging Face’s excellent model hub.

  DistilBERT was trained using *distillation* (hence the name): this means transferring knowledge from a teacher model to a student one, which is usually much smaller than the teacher model. This is typically done by using the teacher’s predicted probabilities for each training instance as targets for the student. Surprisingly, distillation often works better than training the student from scratch on the same dataset as the teacher! Indeed, the student benefits from the teacher’s more nuanced labels.

- Many more transformer architectures came out after BERT, almost on a monthly basis, often improving on the state of the art across all NLP tasks: XLNet (June 2019), RoBERTa (July 2019), StructBERT (August 2019), ALBERT (September 2019), T5 (October 2019), ELECTRA (March 2020), GPT3 (May 2020), DeBERTa (June 2020), Switch Transformers (January 2021), Wu Dao 2.0 (June 2021), Gopher (December 2021), GPT-NeoX-20B (February 2022), Chinchilla (March 2022), OPT (May 2022), and the list goes on and on. Mariya Yao summarized many of these models in this post: https://homl.info/yaopost.
- Let’s take a quick look at the T5 paper [“Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”](https://homl.info/t5) by Colin Raffel et al. at Google: it frames all NLP tasks as text-to-text, using an encoder–decoder transformer. e.g. to translate “I like soccer” to Spanish, we can just call the model with the input sentence “translate English to Spanish: I like soccer” and it outputs “me gusta el fútbol”. To summarize a paragraph, we just enter “summarize:” followed by the paragraph, and it outputs the summary. For classification, we only need to change the prefix to “classify:” and the model outputs the class name, as text. This simplifies using the model, and it also makes it possible to pretrain it on even more tasks.
- In April 2022, Google researchers used a new large-scale training platform named *Pathways* to train a humongous language model named the *Pathways Language Model* (PaLM) described in [“PaLM: Scaling Language Modeling with Pathways”](https://homl.info/palm) by Aakanksha Chowdhery et al. with a whopping 540 billion parameters, using over 6,000 TPUs. This model is a standard transformer, using only masked multi-head attention layers, with just a few tweaks. This model achieved incredible performance on all sorts of NLP tasks, particularly in natural language understanding (NLU). It’s capable of impressive feats, such as explaining jokes, giving detailed step-by-step answers to questions, and even coding. 
- PaLM strength is in part due to the model’s size, but also thanks to a technique called *Chain of thought prompting*, which was introduced a couple months earlier by another team of Google researchers in [“Chain of Thought Prompting Elicits Reasoning in Large Language Models”](https://homl.info/ctp) by Jason Wei et al.

  In question answering tasks, regular prompting typically includes a few examples of questions and answers, such as: “Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: 11.” The prompt then continues with the actual question, such as “Q: John takes care of 10 dogs. Each dog takes .5 hours a day to walk and take care of their business. How many hours a week does he spend taking care of dogs? A:”, and the model’s job is to append the answer: in this case, “35.”

  But with chain of thought prompting, the example answers include all the reasoning steps that lead to the conclusion. e.g. instead of “A: 11”, the prompt contains “A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11.” This encourages the model to give a detailed answer to the actual question, such as “John takes care of 10 dogs. Each dog takes .5 hours a day to walk and take care of their business. So that is 10 $\times$ .5 = 5 hours a day. 5 hours a day $\times$ 7 days a week = 35 hours a week. The answer is 35 hours a week.” This is an actual example from the paper!

  Not only does the model give the right answer much more frequently than using regular prompting, we’re encouraging the model to think things through, but it also provides all the reasoning steps, which can be useful to better understand the rationale behind a model’s answer.

## Vision Transformers
- One of the first applications of attention mechanisms beyond NMT was in generating image captions using visual attention proposed in [“Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”](https://homl.info/visualattention) by Kelvin Xu et al. in 2015: a convolutional neural network first processes the image and outputs some feature maps, then a decoder RNN equipped with an attention mechanism generates the caption, one word at a time. 
  
  At each decoder time step (i.e., each word), the decoder uses the attention model to focus on just the right part of the image.
  
  <center>
    <img 
      src="../images/16/visual_attention.png" 
      onerror="
        this.onerror = null;
        const repo = 'https://github.com/alirezatheh/handson-ml3-notes/blob/main';
        this.src = repo + this.src.split('..')[1];
      "
    >
  </center>
  
  Here the model generated the caption “A woman is throwing a frisbee in a park”, and we can see what part of the input image the decoder focused its attention on when it was about to output the word “frisbee”.

<div style="border: 1px solid;">

### Explainability
One extra benefit of attention mechanisms is that they make it easier to understand what led the model to produce its output. This is called *explainability*. It can be especially useful when the model makes a mistake: e.g. if an image of a dog walking in the snow is labeled as “a wolf walking in the snow”, then we can go back and check what the model focused on when it output the word “wolf”. We may find that it was paying attention not only to the dog, but also to the snow, hinting at a possible explanation: perhaps the way the model learned to distinguish dogs from wolves is by checking whether or not there’s a lot of snow around. We can then fix this by training the model with more images of wolves without snow, and dogs with snow. This example comes from a great 2016 paper [“‘Why Should I Trust You?’: Explaining the Predictions of Any Classifier”](https://homl.info/explainclass) by Marco Tulio Ribeiro et al. that uses a different approach to explainability: learning an interpretable model locally around a classifier’s prediction.

In some applications, explainability is not just a tool to debug a model; it can be a legal requirement: think of a system deciding whether or not it should grant us a loan.
</div>

- At first transformer used alongside CNNs, without replacing them. Instead, transformers were generally used to replace RNNs. Transformers became slightly more visual in a 2020 paper [“End-to-End Object Detection with Transformers”](https://homl.info/detr) by Nicolas Carion et al. in Facebook, which proposed a hybrid CNN–transformer architecture for object detection. The CNN first processes the input images and outputs a set of feature maps, then these feature maps are converted to sequences and fed to a transformer, which outputs bounding box predictions.
- In October 2020, a team of Google researchers released a paper [“An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale”](https://homl.info/vit) Alexey Dosovitskiy et al. that introduced a fully transformer-based vision model, called a *vision transformer* (ViT). They choped the image into little 16 $\times$ 16 squares, and treat the sequence of squares as if it were a sequence of word representations. The squares are first flattened into 16 $\times$ 16 $\times$ 3 = 768-dimensional vectors, then these vectors go through a linear layer that transforms them but retains their dimensionality. The resulting sequence of vectors can then be treated just like a sequence of word embeddings: this means adding positional embeddings, and passing the result to the transformer. This model beat the state of the art on ImageNet image classification, but the authors had to use over 300 million additional images for training. This makes sense since transformers don’t have as many *inductive biases* as convolution neural nets, so they need extra data just to learn things that CNNs implicitly assume.

**Note**: An inductive bias is an implicit assumption made by the model, due to its architecture. Linear models implicitly assume that the data is linear. CNNs implicitly assume that patterns learned in one location will likely be useful in other locations. RNNs implicitly assume that the inputs are ordered, and that recent tokens are more important than older ones. The more inductive biases a model has, assuming they are correct, the less training data the model will require.

- Just two months later, a team of Facebook researchers led by Hugo Touvron released a paper [“Training Data-Efficient Image Transformers & Distillation Through Attention”](https://homl.info/deit) that introduced *data-efficient image transformers* (DeiTs). Their model achieved competitive results on ImageNet without requiring any additional data for training. The model’s architecture is virtually the same as the original ViT, but the authors used a distillation technique to transfer knowledge from state-of-the-art CNN models to their model.
- In March 2021, DeepMind released an important paper [“Perceiver: General Perception with Iterative Attention”](https://homl.info/perceiver) by Andrew Jaegle et al., that introduced the *Perceiver* architecture.
  - It is a *multimodal* transformer, meaning we can feed it text, images, audio, or virtually any other modality. Until then, transformers had been restricted to fairly short sequences because of the performance and RAM bottleneck in the attention layers. This excluded modalities such as audio or video, and it forced researchers to treat images as sequences of patches, rather than sequences of pixels. 
  - The bottleneck is due to self-attention, where every token must attend to every other token: if the input sequence has $M$ tokens, then the attention layer must compute an $M\times M$ matrix, which can be huge if $M$ is very large.
  - The Perceiver solves this problem by gradually improving a fairly short *latent representation* of the inputs, composed of $N$ tokens, typically just a few hundred. (The word *latent* means hidden, or internal.)
  - The model uses cross-attention layers only, feeding them the latent representation as the queries, and the (possibly large) inputs as the values. This only requires computing an $M\times N$ matrix, so the computational complexity is linear with regard to $M$, instead of quadratic. 
  - After going through several cross-attention layers, the latent representation ends up capturing everything that matters in the inputs. 
  - The authors also suggested sharing the weights between consecutive cross-attention layers: if we do that, then the Perceiver effectively becomes an RNN.
    - The shared cross-attention layers can be seen as the same memory cell at different time steps
    - And the latent representation corresponds to the cell’s context vector.
    - The same inputs are repeatedly fed to the memory cell at every time step.
    
    It looks like RNNs are not dead after all!
- Just a month later, Mathilde Caron et al. introduced DINO in [“Emerging Properties in Self-Supervised Vision Transformers”](https://homl.info/dino), an impressive vision transformer trained entirely without labels, using self-supervision, and capable of high-accuracy semantic segmentation. 
  - The model is duplicated during training, with one network acting as a teacher and the other acting as a student. 
  - Gradient descent only affects the student, while the teacher’s weights are just an exponential moving average of the student’s weights.
  - The student is trained to match the teacher’s predictions: since they’re almost the same model, this is called *self-distillation*.
  - At each training step, the input images are augmented in different ways for the teacher and the student, so they don’t see the exact same image, but their predictions must match. This forces them to come up with high-level representations.
  - To prevent *mode collapse*, where both the student and the teacher would always output the same thing, completely ignoring the inputs, DINO keeps track of a moving average of the teacher’s outputs, and it tweaks the teacher’s predictions to ensure that they remain centered on zero, on average.
  - DINO also forces the teacher to have high confidence in its predictions: this is called *sharpening*.
  
  Together, these techniques preserve diversity in the teacher’s outputs.
- In a 2021 paper [“Scaling Vision Transformers”](https://homl.info/scalingvits) by Xiaohua Zhai et al., Google researchers showed how to scale ViTs up or down, depending on the amount of data. They managed to create a huge 2 billion parameter model that reached over 90.4% top-1 accuracy on ImageNet. Conversely, they also trained a scaled-down model that reached over 84.8% top-1 accuracy on ImageNet, using only 10,000 images: that’s just 10 images per class!
- In March 2022, a paper [“Model Soups: Averaging Weights of Multiple Fine-tuned Models Improves Accuracy Without Increasing Inference Time”](https://homl.info/modelsoups) by Mitchell Wortsman et al. demonstrated that it’s possible to first train multiple transformers, then average their weights to create a new and improved model. This is similar to an ensemble, except there’s just one model in the end, which means there’s no inference time penalty.
- OpenAI’s 2021 CLIP paper [“Learning Transferable Visual Models From Natural Language Supervision”](https://homl.info/clip) Alec Radford et al. proposed a large transformer model pretrained to match captions with images: this task allows it to learn excellent image representations, and the model can then be used directly for tasks such as image classification using simple text prompts such as “a photo of a cat”.
- Soon after, OpenAI announced DALL·E paper [“Zero-Shot Text-to-Image Generation”](https://homl.info/dalle), capable of generating amazing images based on text prompts and then DALL·E 2 paper [“Hierarchical Text-Conditional Image Generation with CLIP Latents”](https://homl.info/dalle2) which generates even higher quality images using a diffusion model both by Aditya Ramesh et al.
- In April 2022, DeepMind released the Flamingo paper [“Flamingo: a Visual Language Model for Few-Shot Learning”](https://homl.info/flamingo) by Jean-Baptiste Alayrac et al., which introduced a family of models pretrained on a wide variety of tasks across multiple modalities, including text, images, and videos. A single model can be used across very different tasks, such as question answering, image captioning, and more.
- Soon after, in May 2022, DeepMind introduced GATO paper [“A Generalist Agent”](https://homl.info/gato) by Scott Reed et al., a multimodal model that can be used as a policy for a reinforcement learning agent. The same transformer can chat with us, caption images, play Atari games, control (simulated) robotic arms, and more, all with “only” 1.2 billion parameters.

## Hugging Face’s Transformers Library
Transformers library allows us to easily download a pretrained model, including its corresponding tokenizer, and then fine-tune it on our own dataset, if needed. Plus, the library supports TensorFlow, PyTorch, and JAX.

The simplest way to use the Transformers library is to use the `transformers.pipeline()` function: we just specify which task we want, such as sentiment analysis, and it downloads a default pretrained model:

Install the Transformers if we’re running on Colab:

In [83]:
if 'google.colab' in sys.modules:
    %pip install -q -U transformers

In [84]:
from transformers import pipeline

classifier = pipeline('sentiment-analysis')
result = classifier('The actors were very convincing.')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


Models can be very biased. e.g. it may like or dislike some countries depending on the data it was trained on, and how it is used, so use it with care:

In [85]:
classifier(['I am from India.', 'I am from Iraq.'])

[{'label': 'POSITIVE', 'score': 0.9896161556243896},
 {'label': 'NEGATIVE', 'score': 0.9811071157455444}]

For text classification tasks such as sentiment analysis, at the time of writing, it defaults to `distilbert-base-uncased-finetuned-sst-2-english`, a DistilBERT model with an uncased tokenizer, trained on English Wikipedia and a corpus of English books, and fine-tuned on the Stanford Sentiment Treebank v2 (SST 2) task. We could use a DistilBERT model fine-tuned on the Multi-Genre Natural Language Inference (MultiNLI) task, which classifies two sentences into three classes: contradiction, neutral, or entailment. Here is how:

In [86]:
model_name = 'huggingface/distilbert-base-uncased-finetuned-mnli'
classifier_mnli = pipeline('text-classification', model=model_name)
classifier_mnli('She loves me. [SEP] She loves me not.')

Some layers from the model checkpoint at huggingface/distilbert-base-uncased-finetuned-mnli were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at huggingface/distilbert-base-uncased-finetuned-mnli and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[{'label': 'contradiction', 'score': 0.9790192246437073}]

**Tip**: We can find the available models at https://huggingface.co/models, and the list of tasks at https://huggingface.co/tasks.

Let’s load the same DistilBERT model, along with its corresponding tokenizer, using the `TFAutoModelForSequenceClassification` and `AutoTokenizer` classes:

In [87]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

Some layers from the model checkpoint at huggingface/distilbert-base-uncased-finetuned-mnli were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at huggingface/distilbert-base-uncased-finetuned-mnli and are newly initialized: ['dropout_59']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [88]:
token_ids = tokenizer(
    [
        'I like soccer. [SEP] We all love soccer!',
        'Joe lived for a very long time. [SEP] Joe is old.',
    ],
    padding=True,
    return_tensors='tf',
)
token_ids

{'input_ids': <tf.Tensor: shape=(2, 15), dtype=int32, numpy=
array([[ 101, 1045, 2066, 4715, 1012,  102, 2057, 2035, 2293, 4715,  999,
         102,    0,    0,    0],
       [ 101, 3533, 2973, 2005, 1037, 2200, 2146, 2051, 1012,  102, 3533,
        2003, 2214, 1012,  102]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 15), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}

The output is a dictionary-like instance of the `BatchEncoding` class, which contains the sequences of token IDs, as well as a mask containing 0s for the padding tokens.

**Tip**: Instead of passing `'Sentence 1 [SEP] Sentence 2'` to the tokenizer, you can equivalently pass it a tuple: `('Sentence 1', 'Sentence 2')`.

In [89]:
token_ids = tokenizer(
    [
        ('I like soccer.', 'We all love soccer!'),
        ('Joe lived for a very long time.', 'Joe is old.'),
    ],
    padding=True,
    return_tensors='tf',
)
token_ids

{'input_ids': <tf.Tensor: shape=(2, 15), dtype=int32, numpy=
array([[ 101, 1045, 2066, 4715, 1012,  102, 2057, 2035, 2293, 4715,  999,
         102,    0,    0,    0],
       [ 101, 3533, 2973, 2005, 1037, 2200, 2146, 2051, 1012,  102, 3533,
        2003, 2214, 1012,  102]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 15), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}

If we set `return_token_type_ids=True` when calling the tokenizer, we will also get an extra tensor that indicates which sentence each token belongs to. This is needed by some models.

In [90]:
outputs = model(token_ids)
outputs

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[-2.1123817 ,  1.1786783 ,  1.4101017 ],
       [-0.01478387,  1.0962474 , -0.9919954 ]], dtype=float32)>, hidden_states=None, attentions=None)

In [91]:
Y_probas = keras.activations.softmax(outputs.logits)
Y_probas

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[0.01619702, 0.43523544, 0.5485676 ],
       [0.22655967, 0.6881726 , 0.0852678 ]], dtype=float32)>

In [92]:
Y_pred = tf.argmax(Y_probas, axis=1)
# 0 = Contradiction, 1 = entailment, 2 = Neutral
Y_pred

<tf.Tensor: shape=(2,), dtype=int64, numpy=array([2, 1])>

We can fine-tune this model on our own dataset, and train the model as usual with Keras since it’s just a regular Keras model with a few extra methods. Because the model outputs logits instead of probabilities, we must use the `keras.losses.SparseCategoricalCrossentropy(from_logits=True)` loss instead of the usual `'sparse_categorical_crossentropy'` loss. Moreover, the model does not support `BatchEncoding` inputs during training, so we must use its `data` attribute to get a regular dictionary instead:

In [93]:
sentences = [('Sky is blue', 'Sky is red'), ('I love her', 'She loves me')]
X_train = tokenizer(sentences, padding=True, return_tensors='tf').data
# Contradiction, Neutral
y_train = tf.constant([0, 2])
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(loss=loss, optimizer='nadam', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=2)

Epoch 1/2
Epoch 2/2


Hugging Face has also built a Datasets library that we can use to easily download a standard dataset (such as IMDb) or a custom one, and use it to fine-tune our model. It’s similar to TensorFlow Datasets, but it also provides tools to perform common preprocessing tasks on the fly, such as masking. The list of datasets is available at https://huggingface.co/datasets.

## Exercises

### 1. to 8.
1. What are the pros and cons of using a stateful RNN versus a stateless RNN?
> Stateless RNNs can only capture patterns whose length is less than, or equal to, the size of the windows the RNN is trained on. Conversely, stateful RNNs can capture longer-term patterns. However, implementing a stateful RNN is much harder⁠, especially preparing the dataset properly. Moreover, stateful RNNs do not always work better, in part because consecutive batches are not independent and identically distributed (IID). Gradient descent is not fond of non-IID datasets.
2. Why do people use encoder–decoder RNNs rather than plain sequence-to-sequence RNNs for automatic translation?
> In general, if we translate a sentence one word at a time, the result will be terrible. e.g. the French sentence “Je vous en prie” means “You are welcome”, but if we translate it one word at a time, we get “I you in pray” Huh? It is much better to read the whole sentence first and then translate it. A plain sequence-to-sequence RNN would start translating a sentence immediately after reading the first word, while an encoder–decoder RNN will first read the whole sentence and then translate it. That said, one could imagine a plain sequence-to-sequence RNN that would output silence whenever it is unsure about what to say next (just like human translators do when they must translate a live broadcast).
3. How can we deal with variable-length input sequences? What about variable-length output sequences?
> Variable-length input sequences can be handled by padding the shorter sequences so that all sequences in a batch have the same length, and using masking to ensure the RNN ignores the padding token. For better performance, we may also want to create batches containing sequences of similar sizes. Ragged tensors can hold sequences of variable lengths, and Keras now supports them, which simplifies handling variable-length input sequences (at the time of this writing, it still does not handle ragged tensors as targets on the GPU, though). Regarding variable-length output sequences, if the length of the output sequence is known in advance (e.g., if we know that it is the same as the input sequence), then we just need to configure the loss function so that it ignores tokens that come after the end of the sequence. Similarly, the code that will use the model should ignore tokens beyond the end of the sequence. But generally the length of the output sequence is not known ahead of time, so the solution is to train the model so that it outputs an end-of-sequence token at the end of each sequence.
4. What is beam search, and why would we use it? What tool can we use to implement it?
> Beam search is a technique used to improve the performance of a trained encoder–decoder model, e.g. in a neural machine translation system. The algorithm keeps track of a short list of the *k* most promising output sentences (say, the top three), and at each decoder step it tries to extend them by one word; then it keeps only the *k* most likely sentences. The parameter *k* is called the *beam width*: the larger it is, the more CPU and RAM will be used, but also the more accurate the system will be. Instead of greedily choosing the most likely next word at each step to extend a single sentence, this technique allows the system to explore several promising sentences simultaneously. Moreover, this technique lends itself well to parallelization. we can implement beam search by writing a custom memory cell. Alternatively, TensorFlow Addons’s seq2seq API provides an implementation.
5. What is an attention mechanism? How does it help?
> An attention mechanism is a technique initially used in encoder–decoder models to give the decoder more direct access to the input sequence, allowing it to deal with longer input sequences. At each decoder time step, the current decoder’s state and the full output of the encoder are processed by an alignment model that outputs an alignment score for each input time step. This score indicates which part of the input is most relevant to the current decoder time step. The weighted sum of the encoder output (weighted by their alignment score) is then fed to the decoder, which produces the next decoder state and the output for this time step. The main benefit of using an attention mechanism is the fact that the encoder–decoder model can successfully process longer input sequences. Another benefit is that the alignment scores make the model easier to debug and interpret: e.g. if the model makes a mistake, we can look at which part of the input it was paying attention to, and this can help diagnose the issue. An attention mechanism is also at the core of the transformer architecture, in the multi-head attention layers. See the next answer.
6. What is the most important layer in the transformer architecture? What is its purpose?
> The multi-head attention layer (the original transformer architecture contains 18 of them, including 6 masked multi-head attention layers). It is at the core of language models such as BERT and GPT-2. Its purpose is to allow the model to identify which words are most aligned with each other, and then improve each word’s representation using these contextual clues.
7. When would we need to use sampled softmax?
> Sampled softmax is used when training a classification model when there are many classes (e.g., thousands). It computes an approximation of the cross-entropy loss based on the logit predicted by the model for the correct class, and the predicted logits for a sample of incorrect words. This speeds up training considerably compared to computing the softmax over all logits and then estimating the cross-entropy loss. After training, the model can be used normally, using the regular softmax function to compute all the class probabilities based on all the logits.
8. *Embedded Reber grammars* were used by Hochreiter and Schmidhuber in [their paper](https://homl.info/93) about LSTMs. They are artificial grammars that produce strings such as “BPBTSXXVPSEPE”. Check out Jenny Orr’s [nice introduction](https://homl.info/108) to this topic, then choose a particular embedded Reber grammar (such as the one represented on Jenny Orr’s page), then train an RNN to identify whether a string respects that grammar or not. We will first need to write a function capable of generating a training batch containing about 50% strings that respect the grammar, and 50% that don’t.
> First we need to build a function that generates strings based on a grammar. The grammar will be represented as a list of possible transitions for each state. A transition specifies the string to output (or a grammar to generate it) and the next state.

In [94]:
default_reber_grammar = [
    # (state 0) =B=>(state 1)
    [('B', 1)],
    # (state 1) =T=>(state 2) or =P=>(state 3)
    [('T', 2), ('P', 3)],
    # (state 2) =S=>(state 2) or =X=>(state 4)
    [('S', 2), ('X', 4)],
    # and so on...
    [('T', 3), ('V', 5)],
    [('X', 3), ('S', 6)],
    [('P', 4), ('V', 6)],
    # (state 6) =E=>(terminal state)
    [('E', None)],
]

embedded_reber_grammar = [
    [('B', 1)],
    [('T', 2), ('P', 3)],
    [(default_reber_grammar, 4)],
    [(default_reber_grammar, 5)],
    [('T', 6)],
    [('P', 6)],
    [('E', None)],
]

Grammer = list[list[tuple[str | 'Grammer', Optional[int]]]]


def generate_string(grammar: Grammer) -> str:
    state = 0
    output = []
    while state is not None:
        index = np.random.randint(len(grammar[state]))
        production, state = grammar[state][index]
        if isinstance(production, list):
            production = generate_string(grammar=production)
        output.append(production)
    return ''.join(output)

> Let’s generate a few strings based on the default Reber grammar:

In [95]:
np.random.seed(42)

for _ in range(25):
    print(generate_string(default_reber_grammar), end=' ')

BTXXTTVPXTVPXTTVPSE BPVPSE BTXSE BPVVE BPVVE BTSXSE BPTVPXTTTVVE BPVVE BTXSE BTXXVPSE BPTTTTTTTTVVE BTXSE BPVPSE BTXSE BPTVPSE BTXXTVPSE BPVVE BPVVE BPVVE BPTTVVE BPVVE BPVVE BTXXVVE BTXXVVE BTXXVPXVVE 

> Looks good. Now let’s generate a few strings based on the embedded Reber grammar:

In [96]:
np.random.seed(42)

for _ in range(25):
    print(generate_string(embedded_reber_grammar), end=' ')

BTBPTTTVPXTVPXTTVPSETE BPBPTVPSEPE BPBPVVEPE BPBPVPXVVEPE BPBTXXTTTTVVEPE BPBPVPSEPE BPBTXXVPSEPE BPBTSSSSSSSXSEPE BTBPVVETE BPBTXXVVEPE BPBTXXVPSEPE BTBTXXVVETE BPBPVVEPE BPBPVVEPE BPBTSXSEPE BPBPVVEPE BPBPTVPSEPE BPBTXXVVEPE BTBPTVPXVVETE BTBPVVETE BTBTSSSSSSSXXVVETE BPBTSSSXXTTTTVPSEPE BTBPTTVVETE BPBTXXTVVEPE BTBTXSETE 

> Okay, now we need a function to generate strings that do not respect the grammar. We could generate a random string, but the task would be a bit too easy, so instead we will generate a string that respects the grammar, and we will corrupt it by changing just one character:

In [97]:
POSSIBLE_CHARS = 'BEPSTVX'


def generate_corrupted_string(
    grammar: Grammer, chars: str = POSSIBLE_CHARS
) -> str:
    good_string = generate_string(grammar)
    index = np.random.randint(len(good_string))
    good_char = good_string[index]
    bad_char = np.random.choice(sorted(set(chars) - set(good_char)))
    return good_string[:index] + bad_char + good_string[index + 1 :]

> Let’s look at a few corrupted strings:

In [98]:
np.random.seed(42)

for _ in range(25):
    print(generate_corrupted_string(embedded_reber_grammar), end=' ')

BTBPTTTPPXTVPXTTVPSETE BPBTXEEPE BPBPTVVVEPE BPBTSSSSXSETE BPTTXSEPE BTBPVPXTTTTTTEVETE BPBTXXSVEPE BSBPTTVPSETE BPBXVVEPE BEBTXSETE BPBPVPSXPE BTBPVVVETE BPBTSXSETE BPBPTTTPTTTTTVPSEPE BTBTXXTTSTVPSETE BBBTXSETE BPBTPXSEPE BPBPVPXTTTTVPXTVPXVPXTTTVVEVE BTBXXXTVPSETE BEBTSSSSSXXVPXTVVETE BTBXTTVVETE BPBTXSTPE BTBTXXTTTVPSBTE BTBTXSETX BTBTSXSSTE 

> We cannot feed strings directly to an RNN, so we need to encode them somehow. One option would be to one-hot encode each character. Another option is to use embeddings. Let’s go for the second option (but since there are just a handful of characters, one-hot encoding would probably be a good option as well). For embeddings to work, we need to convert each string into a sequence of character IDs. Let’s write a function for that, using each character’s index in the string of possible characters `'BEPSTVX'`:

In [99]:
def string_to_ids(s: str, chars: str = POSSIBLE_CHARS) -> list[int]:
    return [chars.index(c) for c in s]

In [100]:
string_to_ids('BTTTXXVVETE')

[0, 4, 4, 4, 6, 6, 5, 5, 1, 4, 1]

> We can now generate the dataset, with 50% good strings, and 50% bad strings:

In [101]:
def generate_dataset(size: int) -> tuple[tf.RaggedTensor, np.ndarray]:
    good_strings = [
        string_to_ids(generate_string(embedded_reber_grammar))
        for _ in range(size // 2)
    ]
    bad_strings = [
        string_to_ids(generate_corrupted_string(embedded_reber_grammar))
        for _ in range(size - size // 2)
    ]
    all_strings = good_strings + bad_strings
    X = tf.ragged.constant(all_strings, ragged_rank=1)
    y = np.array(
        [[1.0] for _ in range(len(good_strings))]
        + [[0.0] for _ in range(len(bad_strings))]
    )
    return X, y

In [102]:
np.random.seed(42)

X_train, y_train = generate_dataset(10000)
X_valid, y_valid = generate_dataset(2000)

> Let’s take a look at the first training sequence:

In [103]:
X_train[0]

<tf.Tensor: shape=(22,), dtype=int32, numpy=
array([0, 4, 0, 2, 4, 4, 4, 5, 2, 6, 4, 5, 2, 6, 4, 4, 5, 2, 3, 1, 4, 1],
      dtype=int32)>

> What class does it belong to?

In [104]:
y_train[0]

array([1.])

> Perfect! We are ready to create the RNN to identify good strings. We build a simple sequence binary classifier:

In [105]:
np.random.seed(42)
keras.utils.set_random_seed(42)

embedding_size = 5

model = keras.Sequential(
    [
        keras.layers.InputLayer(
            input_shape=[None], dtype=tf.int32, ragged=True
        ),
        keras.layers.Embedding(
            input_dim=len(POSSIBLE_CHARS), output_dim=embedding_size
        ),
        keras.layers.GRU(30),
        keras.layers.Dense(1, activation='sigmoid'),
    ]
)
optimizer = keras.optimizers.SGD(
    learning_rate=0.02, momentum=0.95, nesterov=True
)
model.compile(
    loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy']
)
history = model.fit(
    X_train, y_train, epochs=20, validation_data=(X_valid, y_valid)
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


> Now let’s test our RNN on two tricky strings: the first one is bad while the second one is good. They only differ by the second to last character. If the RNN gets this right, it shows that it managed to notice the pattern that the second letter should always be equal to the second to last letter. That requires a fairly long short-term memory (which is the reason why we used a GRU cell).

In [106]:
test_strings = [
    'BPBTSSSSSSSXXTTVPXVPXTTTTTVVETE',
    'BPBTSSSSSSSXXTTVPXVPXTTTTTVVEPE',
]
X_test = tf.ragged.constant(
    [string_to_ids(s) for s in test_strings], ragged_rank=1
)

y_proba = model.predict(X_test)
print()
print('Estimated probability that these are Reber strings:')
for index, string in enumerate(test_strings):
    print(f'{string}: {100 * y_proba[index][0]:.2f}%')


Estimated probability that these are Reber strings:
BPBTSSSSSSSXXTTVPXVPXTTTTTVVETE: 0.02%
BPBTSSSSSSSXXTTVPXVPXTTTTTVVEPE: 99.99%


> Ta-da! It worked fine. The RNN found the correct answers with very high confidence. :)

### 9.
Train an encoder–decoder model that can convert a date string from one format to another (e.g., from “April 22, 2019” to “2019-04-22”).
> Let’s start by creating the dataset. We will use random days between 1000-01-01 and 9999-12-31:

In [107]:
from datetime import date

# Cannot use strftime()’s %B format since it depends on the locale
MONTHS = [
    'January',
    'February',
    'March',
    'April',
    'May',
    'June',
    'July',
    'August',
    'September',
    'October',
    'November',
    'December',
]


def random_dates(n_dates: int) -> tuple[list[str], ...]:
    min_date = date(1000, 1, 1).toordinal()
    max_date = date(9999, 12, 31).toordinal()

    ordinals = np.random.randint(max_date - min_date, size=n_dates) + min_date
    dates = [date.fromordinal(ordinal) for ordinal in ordinals]

    x = [MONTHS[dt.month - 1] + ' ' + dt.strftime('%d, %Y') for dt in dates]
    y = [dt.isoformat() for dt in dates]
    return x, y

> Here are a few random dates, displayed in both the input format and the target format:

In [108]:
np.random.seed(42)

n_dates = 3
x_example, y_example = random_dates(n_dates)
print('{:25s}{:25s}'.format('Input', 'Target'))
print('-' * 50)
for idx in range(n_dates):
    print('{:25s}{:25s}'.format(x_example[idx], y_example[idx]))

Input                    Target                   
--------------------------------------------------
September 20, 7075       7075-09-20               
May 15, 8579             8579-05-15               
January 11, 7103         7103-01-11               


> Let’s get the list of all possible characters in the inputs:

In [109]:
INPUT_CHARS = ''.join(sorted(set(''.join(MONTHS) + '0123456789, ')))
INPUT_CHARS

' ,0123456789ADFJMNOSabceghilmnoprstuvy'

> And here’s the list of possible characters in the outputs:

In [110]:
OUTPUT_CHARS = '0123456789-'

> Let’s write a function to convert a string to a list of character IDs, as we did in the previous exercise:

In [111]:
def date_str_to_ids(date_str: str, chars: str = INPUT_CHARS) -> list[int]:
    return [chars.index(c) for c in date_str]

In [112]:
date_str_to_ids(x_example[0], INPUT_CHARS)

[19, 23, 31, 34, 23, 28, 21, 23, 32, 0, 4, 2, 1, 0, 9, 2, 9, 7]

In [113]:
date_str_to_ids(y_example[0], OUTPUT_CHARS)

[7, 0, 7, 5, 10, 0, 9, 10, 2, 0]

In [114]:
def prepare_date_strs(
    date_strs: list[str], chars: str = INPUT_CHARS
) -> tf.Tensor:
    X_ids = [date_str_to_ids(dt, chars) for dt in date_strs]
    X = tf.ragged.constant(X_ids, ragged_rank=1)
    # Using 0 as the padding token ID
    return (X + 1).to_tensor()


def create_dataset(n_dates: int) -> tuple[tf.Tensor, ...]:
    x, y = random_dates(n_dates)
    return prepare_date_strs(x, INPUT_CHARS), prepare_date_strs(
        y, OUTPUT_CHARS
    )

In [115]:
np.random.seed(42)

X_train, Y_train = create_dataset(10000)
X_valid, Y_valid = create_dataset(2000)
X_test, Y_test = create_dataset(2000)

In [116]:
Y_train[0]

<tf.Tensor: shape=(10,), dtype=int32, numpy=array([ 8,  1,  8,  6, 11,  1, 10, 11,  3,  1], dtype=int32)>

> #### First version: a very basic seq2seq model
> Let’s first try the simplest possible model: we feed in the input sequence, which first goes through the encoder (an embedding layer followed by a single LSTM layer), which outputs a vector, then it goes through a decoder (a single LSTM layer, followed by a dense output layer), which outputs a sequence of vectors, each representing the estimated probabilities for all possible output character.
>
> Since the decoder expects a sequence as input, we repeat the vector (which is output by the encoder) as many times as the longest possible output sequence.

In [117]:
embedding_size = 32
max_output_length = Y_train.shape[1]

np.random.seed(42)
keras.utils.set_random_seed(42)

encoder = keras.Sequential(
    [
        keras.layers.Embedding(
            input_dim=len(INPUT_CHARS) + 1,
            output_dim=embedding_size,
            input_shape=[None],
        ),
        keras.layers.LSTM(128),
    ]
)

decoder = keras.Sequential(
    [
        keras.layers.LSTM(128, return_sequences=True),
        keras.layers.Dense(len(OUTPUT_CHARS) + 1, activation='softmax'),
    ]
)

model = keras.Sequential(
    [encoder, keras.layers.RepeatVector(max_output_length), decoder]
)

optimizer = keras.optimizers.Nadam()
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=optimizer,
    metrics=['accuracy'],
)
history = model.fit(
    X_train, Y_train, epochs=20, validation_data=(X_valid, Y_valid)
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


> Looks great, we reach 100% validation accuracy! Let’s use the model to make some predictions. We will need to be able to convert a sequence of character IDs to a readable string:

In [118]:
def ids_to_date_strs(ids: tf.Tensor, chars: str = OUTPUT_CHARS) -> list[str]:
    return [
        ''.join([('?' + chars)[index] for index in sequence])
        for sequence in ids
    ]

> Now we can use the model to convert some dates

In [119]:
X_new = prepare_date_strs(['September 17, 2009', 'July 14, 1789'])

In [120]:
ids = model.predict(X_new).argmax(axis=-1)
for date_str in ids_to_date_strs(ids):
    print(date_str)

2009-09-17
1789-07-14


> Perfect! :)
>
> However, since the model was only trained on input strings of length 18 (which is the length of the longest date), it does not perform well if we try to use it to make predictions on shorter sequences:

In [121]:
X_new = prepare_date_strs(['May 02, 2020', 'July 14, 1789'])

In [122]:
ids = model.predict(X_new).argmax(axis=-1)
for date_str in ids_to_date_strs(ids):
    print(date_str)

2020-02-02
1789-01-14


> Oops! We need to ensure that we always pass sequences of the same length as during training, using padding if necessary. Let’s write a little helper function for that:

In [123]:
max_input_length = X_train.shape[1]


def prepare_date_strs_padded(date_strs: list[str]) -> tf.Tensor:
    X = prepare_date_strs(date_strs)
    if X.shape[1] < max_input_length:
        X = tf.pad(X, [[0, 0], [0, max_input_length - X.shape[1]]])
    return X


def convert_date_strs(date_strs: list[str]) -> list[str]:
    X = prepare_date_strs_padded(date_strs)
    ids = model.predict(X).argmax(axis=-1)
    return ids_to_date_strs(ids)

In [124]:
convert_date_strs(['May 02, 2020', 'July 14, 1789'])

['2020-05-02', '1789-07-14']

> Cool! Granted, there are certainly much easier ways to write a date conversion tool (e.g., using regular expressions or even basic string manipulation), but we have to admit that using neural networks is way cooler. ;-)
>
> However, real-life sequence-to-sequence problems will usually be harder, so for the sake of completeness, let’s build a more powerful model.
>
> #### Second version: feeding the shifted targets to the decoder (teacher forcing)
> Instead of feeding the decoder a simple repetition of the encoder’s output vector, we can feed it the target sequence, shifted by one time step to the right. This way, at each time step the decoder will know what the previous target character was. This should help to tackle more complex sequence-to-sequence problems.
>
> Since the first output character of each target sequence has no previous character, we will need a new token to represent the start-of-sequence (sos).
>
> During inference, we won’t know the target, so what will we feed the decoder? We can just predict one character at a time, starting with an sos token, then feeding the decoder all the characters that were predicted so far (we will look at this in more details later in this notebook).
>
> But if the decoder’s LSTM expects to get the previous target as input at each step, how shall we pass it the vector output by the encoder? Well, one option is to ignore the output vector, and instead use the encoder’s LSTM state as the initial state of the decoder’s LSTM (which requires that encoder’s LSTM must have the same number of units as the decoder’s LSTM).
>
> Now let’s create the decoder’s inputs (for training, validation and testing). The sos token will be represented using the last possible output character’s ID + 1.
>
> **Note**: The length of decoder’s inputs and outputs must be equal. So since we are not adding a eos token to the outputs, we cut the last token from the inputs.

In [125]:
sos_id = len(OUTPUT_CHARS) + 1


def shifted_output_sequences(Y: tf.Tensor) -> tf.Tensor:
    sos_tokens = tf.fill(dims=(len(Y), 1), value=sos_id)
    return tf.concat([sos_tokens, Y[:, :-1]], axis=1)


X_train_decoder = shifted_output_sequences(Y_train)
X_valid_decoder = shifted_output_sequences(Y_valid)
X_test_decoder = shifted_output_sequences(Y_test)

> Let’s take a look at the decoder’s training inputs:

In [126]:
X_train_decoder

<tf.Tensor: shape=(10000, 10), dtype=int32, numpy=
array([[12,  8,  1, ..., 10, 11,  3],
       [12,  9,  6, ...,  6, 11,  2],
       [12,  8,  2, ...,  2, 11,  2],
       ...,
       [12, 10,  8, ...,  2, 11,  4],
       [12,  2,  2, ...,  3, 11,  3],
       [12,  8,  9, ...,  8, 11,  3]], dtype=int32)>

> Now let’s build the model. It’s not a simple sequential model anymore, so 
> let’s use the functional API:

In [127]:
encoder_embedding_size = 32
decoder_embedding_size = 32
lstm_units = 128

np.random.seed(42)
keras.utils.set_random_seed(42)

encoder_input = keras.layers.Input(shape=[None], dtype=tf.int32)
encoder_embedding = keras.layers.Embedding(
    input_dim=len(INPUT_CHARS) + 1, output_dim=encoder_embedding_size
)(encoder_input)
_, *encoder_state = keras.layers.LSTM(lstm_units, return_state=True)(
    encoder_embedding
)

decoder_input = keras.layers.Input(shape=[None], dtype=tf.int32)
decoder_embedding = keras.layers.Embedding(
    input_dim=len(OUTPUT_CHARS) + 2, output_dim=decoder_embedding_size
)(decoder_input)
decoder_lstm_output = keras.layers.LSTM(lstm_units, return_sequences=True)(
    decoder_embedding, initial_state=encoder_state
)
decoder_output = keras.layers.Dense(
    len(OUTPUT_CHARS) + 1, activation='softmax'
)(decoder_lstm_output)

model = keras.Model(
    inputs=[encoder_input, decoder_input], outputs=[decoder_output]
)

optimizer = keras.optimizers.Nadam()
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=optimizer,
    metrics=['accuracy'],
)
history = model.fit(
    [X_train, X_train_decoder],
    Y_train,
    epochs=10,
    validation_data=([X_valid, X_valid_decoder], Y_valid),
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


> This model also reaches 100% validation accuracy, but it does so even faster. Let’s once again use the model to make some predictions. This time we need to predict characters one by one.

In [128]:
def predict_date_strs(date_strs: list[str]) -> list[str]:
    X = prepare_date_strs_padded(date_strs)
    Y_pred = tf.fill(dims=(len(X), 1), value=sos_id)
    for index in range(max_output_length):
        pad_size = max_output_length - Y_pred.shape[1]
        X_decoder = tf.pad(Y_pred, [[0, 0], [0, pad_size]])
        Y_probas_next = model.predict([X, X_decoder])[:, index : index + 1]
        Y_pred_next = tf.argmax(Y_probas_next, axis=-1, output_type=tf.int32)
        Y_pred = tf.concat([Y_pred, Y_pred_next], axis=1)
    return ids_to_date_strs(Y_pred[:, 1:])

In [129]:
predict_date_strs(['July 14, 1789', 'May 01, 2020'])

['1789-07-14', '2020-05-01']

   > Works fine! Next, feel free to write a transformer version. :)


### 10. to 11
10. Go through the example on the Keras website for [“Natural language image search with a Dual Encoder”](https://homl.info/dualtuto). We will learn how to build a model capable of representing both images and text within the same embedding space. This makes it possible to search for images using a text prompt, like in the [CLIP model](https://openai.com/blog/clip/) by OpenAI.
> Just click the link and follow the instructions.
11. Use the Hugging Face Transformers library to download a pretrained language model capable of generating text (e.g., GPT), and try generating more convincing Shakespearean text. We will need to use the model’s `generate()` method. See Hugging Face’s documentation for more details.
> First, let’s load a pretrained model. In this example, we will use OpenAI’s GPT model, with an additional language model on top (just a linear layer with weights tied to the input embeddings). Let’s import it and load the pretrained weights (this will download about 445MB of data to *~/.cache/torch/transformers*):

In [130]:
from transformers import TFOpenAIGPTLMHeadModel

model = TFOpenAIGPTLMHeadModel.from_pretrained('openai-gpt')

All model checkpoint layers were used when initializing TFOpenAIGPTLMHeadModel.

All the layers of TFOpenAIGPTLMHeadModel were initialized from the model checkpoint at openai-gpt.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFOpenAIGPTLMHeadModel for predictions without further training.


> Next we will need a specialized tokenizer for this model. This one will try to use the [spaCy](https://spacy.io/) and [ftfy](https://pypi.org/project/ftfy/) libraries if they are installed, or else it will fall back to BERT’s `BasicTokenizer` followed by byte-pair encoding (which should be fine for most use cases).

In [131]:
from transformers import OpenAIGPTTokenizer

tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')

ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.


> Now let’s use the tokenizer to tokenize and encode the prompt text:

In [132]:
tokenizer('hello everyone')

{'input_ids': [3570, 1473], 'attention_mask': [1, 1]}

In [133]:
prompt_text = 'This royal throne of kings, this sceptred isle'
encoded_prompt = tokenizer.encode(
    prompt_text, add_special_tokens=False, return_tensors='tf'
)
encoded_prompt

<tf.Tensor: shape=(1, 10), dtype=int32, numpy=
array([[  616,  5751,  6404,   498,  9606,   240,   616, 26271,  7428,
        16187]], dtype=int32)>

> Easy! Next, let’s use the model to generate text after the prompt. We will generate 5 different sentences, each starting with the prompt text, followed by 40 additional tokens. For an explanation of what all the hyperparameters do, make sure to check out this great [blog post](https://huggingface.co/blog/how-to-generate) by Patrick von Platen (from Hugging Face). We can play around with the hyperparameters to try to obtain better results.

In [134]:
num_sequences = 5
length = 40

generated_sequences = model.generate(
    input_ids=encoded_prompt,
    do_sample=True,
    max_length=length + len(encoded_prompt[0]),
    temperature=1.0,
    top_k=0,
    top_p=0.9,
    repetition_penalty=1.0,
    num_return_sequences=num_sequences,
)

generated_sequences

<tf.Tensor: shape=(5, 50), dtype=int32, numpy=
array([[  616,  5751,  6404,   498,  9606,   240,   616, 26271,  7428,
        16187,   498,   481,   550, 12974,   554, 20275,   544,   481,
          808,  1082,   525,   759, 13717,   507,   617,   616,  1294,
         1276,   239, 40477,   249,  1048,  2210,   525,   249,   880,
          694,   817,   485,   788,   507,   240,   244,   481,   762,
         4049,  3983,  6474,  1387,   485],
       [  616,  5751,  6404,   498,  9606,   240,   616, 26271,  7428,
        16187,   509,  1163,   485,  1272,  8660,  3380, 14760,   240,
         1389,   557,   481,  7232,     8,   789,  3408,   239,   754,
        10253,   558,   694,  2556,   488,  2093,   485,  2185,   917,
           11,  5272,  6372,   562,  1272, 11413,   239, 40477,   481,
         1583,   618,   558,   524,  1074],
       [  616,  5751,  6404,   498,  9606,   240,   616, 26271,  7428,
        16187,   544,   597,   622,  1163,   488,   481,  1594,   498,
          622

> Now let’s decode the generated sequences and print them:

In [135]:
for sequence in generated_sequences:
    text = tokenizer.decode(sequence, clean_up_tokenization_spaces=True)
    print(text)
    print('-' * 80)

this royal throne of kings, this sceptred isle of the necronomicon is the only place that can unlock it from this dark world. 
 i am surprised that i've been able to see it, " the man named dallon says to
--------------------------------------------------------------------------------
this royal throne of kings, this sceptred isle was home to many beloved possessors, such as the mighty astaroth. their wives had been husband and wife to lord teixiara for many generations. 
 the high king had his own
--------------------------------------------------------------------------------
this royal throne of kings, this sceptred isle is now our home and the land of our fathers!'this was made the standard of the coates, which is at king celebrant's command. 
 this was the longest story the coates
--------------------------------------------------------------------------------
this royal throne of kings, this sceptred isle has a powerful spirit that can not be severed or erased. it will reign unti

> We can try more recent (and larger) models, such as GPT-2, CTRL, Transformer-XL or XLNet, which are all available as pretrained models in the transformers library, including variants with language models on top. The preprocessing steps vary slightly between models. Hope we enjoyed this chapter! :)