# **Natural Language Processing with RNNs and Attention**
When Alan Turing imagined his famous [Turing test](https://homl.info/turingtest)  in 1950, he proposed a way to evaluate a machine’s ability to match human intelligence. He could have tested for many things, such as the ability to recognize cats in pictures, play chess, compose music, or escape a maze, but, interestingly, he chose a linguistic task. More specifically, he devised a chatbot capable of fooling its interlocutor into thinking it was human.  This test does have its weaknesses: a set of hardcoded rules can fool unsuspecting or naive humans (e.g., the machine could give vague predefined answers in response to some keywords, it could pretend that it is joking or drunk to get a pass on its weirdest answers, or it could escape difficult questions by answering them with its own questions), and many aspects of human intelligence are utterly ignored (e.g., the ability to interpret nonverbal communication such as facial expressions, or to learn a manual task). But the test does highlight the fact that mastering language is arguably Homo sapiens’s greatest cognitive ability.

Can we build a machine that can master written and spoken language? This is the ultimate goal of NLP research, but it’s a bit too broad, so in practice researchers focus on more specific tasks, such as text classification, translation, summarization, question answering, and many more.

A common approach for natural language tasks is to use recurrent neural networks. We will therefore continue to explore RNNs (introduced in Chapter 15), starting with a character RNN, or char-RNN, trained to predict the next character in a sentence. This will allow us to generate some original text. We will first use a stateless RNN (which learns on random portions of text at each iteration, without any information on the rest of the text), then we will build a stateful RNN (which preserves the hidden state between training iterations and continues reading where it left off, allowing it to learn longer patterns). Next, we will build an RNN to perform sentiment analysis (e.g., reading movie reviews and extracting the rater’s feeling about the movie), this time treating sentences as sequences of words, rather than characters. Then we will show how RNNs can be used to build an encoder–decoder architecture capable of performing neural machine translation (NMT), translating English to Spanish.

In the second part of this chapter, we will explore attention mechanisms. As their name suggests, these are neural network components that learn to select the part of the inputs that the rest of the model should focus on at each time step. First, we will boost the performance of an RNN-based encoder–decoder architecture using attention. Next, we will drop RNNs altogether and use a very successful attention-only architecture, called the transformer, to build a translation model. We will then discuss some of the most important advances in NLP in the last few years, including incredibly powerful language models such as GPT and BERT, both based on transformers. Lastly, I will show you how to get started with the excellent Transformers library by Hugging Face.

Let's start with simple and fun model that can write like Shakespeare (sort of).

## **Generating Shakespearean Text Using a Character RNN**
In a famous 2015 blog post titled “The Unreasonable Effectiveness of Recurrent Neural Networks”, Andrej Karpathy showed how to train an RNN to predict the next character in a sentence. This char-RNN can then be used to generate novel text, one character at a time. Here is a small sample of the text generated by a char-RNN model after it was trained on all of Shakespeare’s works:
- **PANDARUS*:
- *Alas, I think he shall be come approached and the day*
- *When little srain would be attain’d into being never fed*,
- *And who is but a chain and subjects of his death*,
- *I should not sleep.*
  
Not exactly a masterpiece, but it is still impressive that the model was able to learn words, grammar, proper punctuation, and more, just by learning to predict the next character in a sentence. This is our first example of a language model; similar (but much more powerful) language models, discussed later in this chapter, are at the core of modern NLP. In the remainder of this section we’ll build a char-RNN step by step, starting with the creation of the dataset.

### **Creating the Dataset**
First, using Keras’s handy tf.keras.utils.get_file() function, let’s download all of Shakespeare’s works. The data is loaded from Andrej Karpathy’s [char-rnn project](https://github.com/karpathy/char-rnn):

In [1]:
import tensorflow as tf
shakespeare_url = "https://homl.info/shakespeare" # shortcut URL
filepath = tf.keras.utils.get_file("shakespear.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()


2025-07-23 07:33:07.275963: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1753245187.379020   10215 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1753245187.407294   10215 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1753245187.486993   10215 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1753245187.487082   10215 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1753245187.487098   10215 computation_placer.cc:177] computation placer alr

Let's print the first few lines:

In [2]:
print(shakespeare_text[:149])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?




Look's like shakespeare all right!

Next, we’ll use a ***tf.keras.layers.TextVectorization*** layer (introduced in Chapter 13) to encode this text. We set ***split="character"*** to get character-level encoding rather than the default word-level encoding, and we use ***standardize="lower"*** to convert the text to lowercase (which will simplify the task):


In [3]:
text_vec_layer = tf.keras.layers.TextVectorization(split='character', 
                                                   standardize='lower')
text_vec_layer.adapt([shakespeare_text])
encoded = text_vec_layer([shakespeare_text]) [0]

I0000 00:00:1753221097.507433    7933 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1698 MB memory:  -> device: 0, name: NVIDIA GeForce MX150, pci bus id: 0000:01:00.0, compute capability: 6.1
2025-07-23 00:51:41.367295: W external/local_xla/xla/service/gpu/llvm_gpu_backend/default/nvptx_libdevice_path.cc:40] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  ipykernel_launcher.runfiles/cuda_nvcc
  ipykern/cuda_nvcc
  
  /usr/local/cuda
  /opt/cuda
  /home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/tensorflow/python/platform/../../../nvidia/cuda_nvcc
  /home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/tensorflow/python/platform/../../../../nvidia/cuda_nvcc
  /home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/tensorflow/python/

Each character is now mapped to an integer, starting at 2. The ***TextVectorization*** layer reserved the value 0 for padding tokens, and it reserved 1 for unknown characters. We won’t need either of these tokens for now, so let’s subtract 2 from the character IDs and compute the number of distinct characters and the total number of characters:

In [4]:
encoded -= 2 # drop tokens 0 (pad) and 1 (unknown), which will not use
n_tokens = text_vec_layer.vocabulary_size() - 2 # number of distinct chars = 39
dataset_size = len(encoded) # total number of chars = 1,115,394
dataset_size

1115394

Next, just like we did in Chapter 15, we can turn this very long sequence into a dataset of windows that we can then use to train a sequence-to-sequence RNN. The targets will be similar to the inputs, but shifted by one time step into the “future”. For example, one sample in the dataset may be a sequence of character IDs representing the text “to be or not to b” (without the final “e”), and the corresponding target—a sequence of character IDs representing the text “o be or not to be” (with the final “e”, but without the leading “t”). Let’s write a small utility function to convert a long sequence of character IDs into a dataset of input/target window pairs:

In [5]:
def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32):
    sequence = tf.where(sequence < 0, 0, sequence)
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))
    if shuffle:
        ds = ds.shuffle(buffer_size=100_000, seed=seed)
    ds = ds.batch(batch_size)
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)



This function starts much like the ***to_windows()*** custom utility function we created in Chapter 15:
- It takes a sequence as input (i.e., the encoded text), and creates a dataset containing all the windows of the desired length.
- It increases the length by one, since we need the next character for the target.
- Then, it shuffles the windows (optionally), batches them, splits them into input/output pairs, and activates prefetching.

[Figure 16-1](#fig161) summarizes the dataset preparation steps: it shows windows of length 11, and a batch size of 3. The start index of each window is indicated next to it.

<span id=fig161>
![Preparing a dataset of shuffled windows](f161.png)
</span>

Now we’re ready to create the training set, the validation set, and the test set. We will use roughly 90% of the text for training, 5% for validation, and 5% for testing:

In [6]:
length = 100
train = int((dataset_size*0.9)//1)
remain = (dataset_size - train) // 2
tf.random.set_seed(50)
train_set = to_dataset(encoded[:train], length=length, shuffle=True, seed=50)
valid_set = to_dataset(encoded[train:(train+remain)], length=length)
test_set = to_dataset(encoded[-remain:], length=length)

> #### **TIP**
> We set the window length to 100, but you can try tuning it: it’s easier and faster to train RNNs on shorter input sequences, but the RNN will not be able to learn any pattern longer than length, so don’t make it too small.

That's it! Preparing the dataset was the hardest part. Now let's create the model.

### **Building and Training the Char-RNN Model**
Since our dataset is reasonably large, and modeling language is quite a difficult task, we need more than a simple RNN with a few recurrent neurons. Let’s build and train a model with one GRU layer composed of 128 units (you can try tweaking the number of layers and units later, if needed):

In [None]:
tf.config.optimizer.set_jit(False)

tf.debugging.set_log_device_placement(True)

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16), 
    tf.keras.layers.GRU(128, return_sequences=True), 
    tf.keras.layers.Dense(n_tokens, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy", optimizer='nadam', 
              metrics=['accuracy'])
model_ckpt = tf.keras.callbacks.ModelCheckpoint(
    "my_shakespear_model.keras", monitor="val_accuracy", save_best_only=True)
history = model.fit(train_set, validation_data=valid_set, epochs=10, callbacks=[model_ckpt])

Epoch 1/10


2025-07-23 01:09:30.305188: W tensorflow/core/framework/op_kernel.cc:1844] UNKNOWN: JIT compilation failed.


UnknownError: Graph execution error:

Detected at node sequential_6/gru_8/gru_cell/recurrent_kernel/Initializer/Sign defined at (most recent call last):
  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/runpy.py", line 86, in _run_code

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/traitlets/config/application.py", line 1075, in launch_instance

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 739, in start

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 211, in start

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/asyncio/base_events.py", line 603, in run_forever

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/asyncio/events.py", line 80, in _run

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 545, in dispatch_queue

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 534, in process_one

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 437, in dispatch_shell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 362, in execute_request

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 778, in execute_request

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 449, in do_execute

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 549, in run_cell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3075, in run_cell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3130, in _run_cell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 128, in _pseudo_sync_runner

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3334, in run_cell_async

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3517, in run_ast_nodes

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code

  File "/tmp/ipykernel_7933/1277159966.py", line 42, in <module>

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 377, in fit

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 220, in function

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 133, in multi_step_on_iterator

JIT compilation failed.
	 [[{{node sequential_6/gru_8/gru_cell/recurrent_kernel/Initializer/Sign}}]] [Op:__inference_initialize_variables_8394]

In [25]:
model.compile(
    loss="sparse_categorical_crossentropy",
    optimizer="nadam",
    metrics=["accuracy"],
    run_eagerly=True   # <— this forces eager execution, no JIT at all
)


In [26]:
from tensorflow.keras.initializers import GlorotUniform
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GRU, Dense

model = Sequential([
    Embedding(input_dim=n_tokens, output_dim=16),
    GRU(
      128,
      return_sequences=True,
      # Use a uniform initializer which doesn’t call tf.sign under the hood
      recurrent_initializer=GlorotUniform()
    ),
    Dense(n_tokens, activation="softmax")
])

model.compile(
    loss="sparse_categorical_crossentropy",
    optimizer="nadam",
    metrics=["accuracy"]
)
history = model.fit(train_set, validation_data=valid_set, epochs=10)


Epoch 1/10


2025-07-23 01:16:38.051352: W external/local_xla/xla/service/gpu/llvm_gpu_backend/nvptx_backend.cc:110] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2025-07-23 01:16:38.060594: W external/local_xla/xla/service/gpu/llvm_gpu_backend/nvptx_backend.cc:110] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2025-07-23 01:16:38.070643: W external/local_xla/xla/service/gpu/llvm_gpu_backend/nvptx_backend.cc:110] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2025-07-23 01:16:38.080369: W external/local_xla/xla/service/gpu/llvm_gpu_backend/nvptx_backend.cc:110] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2025-07-23 01:16:38.089755: W external/local_xla/xla/service/gpu/llvm_gpu_backend/nvptx_backend.cc:110] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2025-07-23 01:16:38.098293: W external/local_xla/xla/service/gpu/llvm_gpu_backen

UnknownError: Graph execution error:

Detected at node nadam/Pow_4 defined at (most recent call last):
  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/runpy.py", line 86, in _run_code

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/traitlets/config/application.py", line 1075, in launch_instance

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 739, in start

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 211, in start

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/asyncio/base_events.py", line 603, in run_forever

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/asyncio/events.py", line 80, in _run

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 545, in dispatch_queue

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 534, in process_one

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 437, in dispatch_shell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 362, in execute_request

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 778, in execute_request

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 449, in do_execute

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 549, in run_cell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3075, in run_cell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3130, in _run_cell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 128, in _pseudo_sync_runner

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3334, in run_cell_async

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3517, in run_ast_nodes

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code

  File "/tmp/ipykernel_7933/2763532858.py", line 21, in <module>

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 377, in fit

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 220, in function

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 133, in multi_step_on_iterator

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 114, in one_step_on_data

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 81, in train_step

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/optimizers/base_optimizer.py", line 463, in apply_gradients

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/optimizers/base_optimizer.py", line 527, in apply

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/optimizers/base_optimizer.py", line 593, in _backend_apply_gradients

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/optimizers/nadam.py", line 106, in _backend_update_step

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/optimizer.py", line 120, in _backend_update_step

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/optimizer.py", line 134, in _distributed_tf_update_step

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/optimizer.py", line 131, in apply_grad_to_update_var

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/optimizers/nadam.py", line 119, in update_step

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/ops/numpy.py", line 6391, in power

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/numpy.py", line 2653, in power

JIT compilation failed.
	 [[{{node nadam/Pow_4}}]] [Op:__inference_multi_step_on_iterator_12190]

Let's go over this code:
- We use an ***Embedding*** layer as the first layer, to encode the character IDs (embeddings were introduced in Chapter 13). The ***Embedding*** layer’s number of input dimensions is the number of distinct character IDs, and the number of output dimensions is a hyperparameter you can tune—we’ll set it to 16 for now. Whereas the inputs of the ***Embedding*** layer will be 2D tensors of shape [*batch size, window length*], the output of the ***Embedding*** layer will be a 3D tensor of shape [*batch size, window length, embedding size*].

- We use a ***Dense*** layer for the output layer: it must have 39 units (***n_tokens***) because there are 39 distinct characters in the text, and we want to output a probability for each possible character (at each time step). The 39 output probabilities should sum up to 1 at each time step, so we apply the softmax activation function to the outputs of the ***Dense*** layer.

- Lastly, we compile this model, using the "***sparse_categorical_crossentropy***" loss and a Nadam optimizer, and we train the model for several epochs, using a ***ModelCheckpoint*** callback to save the best model (in terms of validation accuracy) as training progresses.

> #### **TIP**
>  If you are running this code on Colab with a GPU activated, then training should take roughly one to two hours. You can reduce the number of epochs if you don’t want to wait that long, but of course the model’s accuracy will probably be lower. If the Colab session times out, make sure to reconnect quickly, or else the Colab runtime will be destroyed.

This model does not handle text preprocessing, so let’s wrap it in a final model containing the ***tf.keras.layers.TextVectorization*** layer as the first layer, plus a ***tf.keras.layers.Lambda*** layer to subtract 2 from the character IDs since we’re not using the padding and unknown tokens for now:

In [27]:
model = tf.keras.models.load_model('my_shakespeare_model.keras')

2025-07-23 01:17:20.007086: W tensorflow/core/framework/op_kernel.cc:1844] UNKNOWN: JIT compilation failed.


UnknownError: {{function_node __wrapped__Sign_device_/job:localhost/replica:0/task:0/device:GPU:0}} JIT compilation failed. [Op:Sign] name: 

In [28]:
model = tf.keras.models.load_model('my_shakespear_model.keras')
shakespeare_model = tf.keras.Sequential([
    text_vec_layer, 
    tf.keras.layers.Lambda(lambda x: x - 2), # no <PAD> or <UNK> tokens
    model
])

2025-07-23 01:17:44.919439: W tensorflow/core/framework/op_kernel.cc:1844] UNKNOWN: JIT compilation failed.


UnknownError: {{function_node __wrapped__Sign_device_/job:localhost/replica:0/task:0/device:GPU:0}} JIT compilation failed. [Op:Sign] name: 

In [9]:
shakespeare_model = tf.keras.Sequential([
    text_vec_layer, 
    tf.keras.layers.Lambda(lambda x: x - 2), # no <PAD> or <UNK> tokens
    model
])

text = "To be or not to b"
input_text = tf.constant([text])

try:
    y_proba = shakespeare_model.predict(input_text)[0, -1]
    y_pred = tf.argmax(y_proba) # Choose the most probable character ID
    predicted_char = text_vec_layer.get_vocabulary()[y_pred + 2]
    print(text + predicted_char)
except Exception as e:
    print("Prediction failed:", e)

Prediction failed: Graph execution error:

Detected at node sequential_2_1/sequential_1_1/gru_1_1/split_1 defined at (most recent call last):
  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/runpy.py", line 86, in _run_code

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/traitlets/config/application.py", line 1075, in launch_instance

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 739, in start

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 211, in start

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/asyncio/base_events.py", line 603, in run_forever

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/asyncio/ba

2025-07-23 00:53:51.268071: I tensorflow/core/framework/local_rendezvous.cc:426] Local rendezvous recv item cancelled. Key hash: 7858462850924658090
2025-07-23 00:53:51.268140: I tensorflow/core/framework/local_rendezvous.cc:430] Local rendezvous send item cancelled. Key hash: 14693624709018958126
2025-07-23 00:53:51.268162: I tensorflow/core/framework/local_rendezvous.cc:430] Local rendezvous send item cancelled. Key hash: 7994855678121314662
2025-07-23 00:53:51.268194: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: INVALID_ARGUMENT: -input rank(-1) <= split_dim < input rank (1), but got 1
	 [[{{node sequential_2_1/sequential_1_1/gru_1_1/split_1}}]]
	 [[sequential_2_1/text_vectorization_1/UnicodeSplit/UnicodeEncode/UnicodeEncode/RaggedFromTensor/Reshape/_14]]
2025-07-23 00:53:51.268233: I tensorflow/core/framework/local_rendezvous.cc:426] Local rendezvous recv item cancelled. Key hash: 13125160337373383403
2025-07-23 00:53:51.268263: I t

And now let's use it to predict the next character in a sentence:

The model performance seem to not performing well. Now let's use this model to pretend we're shakespeare!

### **Generating Fake Shakespearean Text**
To generate new text using the char-RNN model, we could feed it some text, make the model predict the most likely next letter, add it to the end of the text, then give the extended text to the model to guess the next letter, and so on. This is called *greedy decoding*. But in practice this often leads to the same words being repeated over and over again. Instead, we can sample the next character randomly, with a probability equal to the estimated probability, using TensorFlow’s ***tf.random.categorical()*** function. This will generate more diverse and interesting text. The **categorical()** function samples random class indices, given the class log probabilities (logits). For example:

In [10]:
log_probas = tf.math.log([[0.5, 0.4, 0.1]]) # probas = 50%, 40%, and 10%
tf.random.set_seed(50)
tf.random.categorical(log_probas, num_samples=8) # draw 8 samples

2025-07-23 00:54:21.555409: W external/local_xla/xla/service/gpu/llvm_gpu_backend/nvptx_backend.cc:110] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
error: libdevice not found at ./libdevice.10.bc
2025-07-23 00:54:21.556778: E tensorflow/compiler/mlir/tools/kernel_gen/tf_framework_c_interface.cc:227] INTERNAL: Generating device code failed.
2025-07-23 00:54:21.559431: W tensorflow/core/framework/op_kernel.cc:1844] UNKNOWN: JIT compilation failed.


UnknownError: {{function_node __wrapped__Log_device_/job:localhost/replica:0/task:0/device:GPU:0}} JIT compilation failed. [Op:Log]

To have more control over the diversity of the generated text, we can divide the logits by a number called the temperature, which we can tweak as we wish. A temperature close to zero favors high probability characters, while a high temperature gives all characters an equal probability. Lower temperatures are typically preferred when generating fairly rigid and precise text, such as mathematical equations, while higher temperatures are preferred when generating more diverse and creative text. The following ***next_char()*** custom helper function uses this approach to pick the next character to add to the input text:

In [11]:
def next_char(text, temperature=1):
    y_proba = shakespeare_model.predict([text])[0, -1:]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]
    return text_vec_layer.get_vocabulary()[char_id + 2]

Next, we can write another small helper function that will repeatedly call ***next_char()*** to get the next character and append it to the given text:

In [12]:
def extend_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

We wre now ready to generate some text! Let's try with different temperature values:

In [13]:
tf.random.set_seed(50)
input_text = tf.constant(['To be or not to be'])
output = extend_text(input_text, temperature=0.01)
extended_string = output.numpy().item()
extended_string

Expected: keras_tensor_8
Received: inputs=('Tensor(shape=(1,))',)
2025-07-23 00:55:47.735680: I tensorflow/core/framework/local_rendezvous.cc:426] Local rendezvous recv item cancelled. Key hash: 7858462850924658090
2025-07-23 00:55:47.735795: I tensorflow/core/framework/local_rendezvous.cc:426] Local rendezvous recv item cancelled. Key hash: 13125160337373383403
2025-07-23 00:55:47.735827: I tensorflow/core/framework/local_rendezvous.cc:426] Local rendezvous recv item cancelled. Key hash: 17520575464057185985
2025-07-23 00:55:47.735854: I tensorflow/core/framework/local_rendezvous.cc:426] Local rendezvous recv item cancelled. Key hash: 1100417879316850649
2025-07-23 00:55:47.735935: I tensorflow/core/framework/local_rendezvous.cc:426] Local rendezvous recv item cancelled. Key hash: 83755775956897098
2025-07-23 00:55:47.735975: I tensorflow/core/framework/local_rendezvous.cc:426] Local rendezvous recv item cancelled. Key hash: 12983520393139297892


InvalidArgumentError: Graph execution error:

Detected at node sequential_2_1/sequential_1_1/gru_1_1/split_1 defined at (most recent call last):
  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/runpy.py", line 86, in _run_code

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/traitlets/config/application.py", line 1075, in launch_instance

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 739, in start

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 211, in start

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/asyncio/base_events.py", line 603, in run_forever

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/asyncio/events.py", line 80, in _run

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 545, in dispatch_queue

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 534, in process_one

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 437, in dispatch_shell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 362, in execute_request

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 778, in execute_request

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 449, in do_execute

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 549, in run_cell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3075, in run_cell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3130, in _run_cell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 128, in _pseudo_sync_runner

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3334, in run_cell_async

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3517, in run_ast_nodes

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code

  File "/tmp/ipykernel_7933/3505144283.py", line 3, in <module>

  File "/tmp/ipykernel_7933/3562896888.py", line 3, in extend_text

  File "/tmp/ipykernel_7933/4243554360.py", line 2, in next_char

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 566, in predict

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 260, in one_step_on_data_distributed

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 250, in one_step_on_data

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 105, in predict_step

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 936, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 58, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/sequential.py", line 220, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 183, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/ops/function.py", line 177, in _run_through_graph

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 648, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 936, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 58, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/sequential.py", line 220, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 183, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/ops/function.py", line 177, in _run_through_graph

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 648, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 936, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 58, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/rnn/gru.py", line 601, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/rnn/rnn.py", line 406, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/rnn/gru.py", line 568, in inner_loop

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/rnn.py", line 484, in gru

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/rnn.py", line 704, in _cudnn_gru

Detected at node sequential_2_1/sequential_1_1/gru_1_1/split_1 defined at (most recent call last):
  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/runpy.py", line 86, in _run_code

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/traitlets/config/application.py", line 1075, in launch_instance

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 739, in start

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 211, in start

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/asyncio/base_events.py", line 603, in run_forever

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/asyncio/events.py", line 80, in _run

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 545, in dispatch_queue

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 534, in process_one

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 437, in dispatch_shell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 362, in execute_request

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 778, in execute_request

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 449, in do_execute

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 549, in run_cell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3075, in run_cell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3130, in _run_cell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 128, in _pseudo_sync_runner

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3334, in run_cell_async

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3517, in run_ast_nodes

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code

  File "/tmp/ipykernel_7933/3505144283.py", line 3, in <module>

  File "/tmp/ipykernel_7933/3562896888.py", line 3, in extend_text

  File "/tmp/ipykernel_7933/4243554360.py", line 2, in next_char

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 566, in predict

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 260, in one_step_on_data_distributed

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 250, in one_step_on_data

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 105, in predict_step

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 936, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 58, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/sequential.py", line 220, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 183, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/ops/function.py", line 177, in _run_through_graph

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 648, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 936, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 58, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/sequential.py", line 220, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 183, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/ops/function.py", line 177, in _run_through_graph

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 648, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 936, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 58, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/rnn/gru.py", line 601, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/rnn/rnn.py", line 406, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/rnn/gru.py", line 568, in inner_loop

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/rnn.py", line 484, in gru

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/rnn.py", line 704, in _cudnn_gru

2 root error(s) found.
  (0) INVALID_ARGUMENT:  -input rank(-1) <= split_dim < input rank (1), but got 1
	 [[{{node sequential_2_1/sequential_1_1/gru_1_1/split_1}}]]
	 [[sequential_2_1/text_vectorization_1/UnicodeSplit/UnicodeEncode/UnicodeEncode/RaggedFromTensor/Reshape/_14]]
  (1) INVALID_ARGUMENT:  -input rank(-1) <= split_dim < input rank (1), but got 1
	 [[{{node sequential_2_1/sequential_1_1/gru_1_1/split_1}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_one_step_on_data_distributed_3644]

Oops! It only outputs space " " prediction let's maybe try with different temperature values:

In [14]:
tf.random.set_seed(42)
input_text = tf.constant(['To be or not to be'])
output = extend_text(input_text, temperature=1)
extended_string = output.numpy().item()
extended_string

2025-07-23 00:55:54.546412: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: INVALID_ARGUMENT: -input rank(-1) <= split_dim < input rank (1), but got 1
	 [[{{node sequential_2_1/sequential_1_1/gru_1_1/split_1}}]]
2025-07-23 00:55:54.546503: I tensorflow/core/framework/local_rendezvous.cc:426] Local rendezvous recv item cancelled. Key hash: 7858462850924658090
2025-07-23 00:55:54.546532: I tensorflow/core/framework/local_rendezvous.cc:430] Local rendezvous send item cancelled. Key hash: 7994855678121314662
2025-07-23 00:55:54.546569: I tensorflow/core/framework/local_rendezvous.cc:426] Local rendezvous recv item cancelled. Key hash: 13125160337373383403
2025-07-23 00:55:54.546620: I tensorflow/core/framework/local_rendezvous.cc:426] Local rendezvous recv item cancelled. Key hash: 17520575464057185985
2025-07-23 00:55:54.546651: I tensorflow/core/framework/local_rendezvous.cc:426] Local rendezvous recv item cancelled. Key hash: 11004178793168

InvalidArgumentError: Graph execution error:

Detected at node sequential_2_1/sequential_1_1/gru_1_1/split_1 defined at (most recent call last):
  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/runpy.py", line 86, in _run_code

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/traitlets/config/application.py", line 1075, in launch_instance

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 739, in start

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 211, in start

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/asyncio/base_events.py", line 603, in run_forever

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/asyncio/events.py", line 80, in _run

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 545, in dispatch_queue

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 534, in process_one

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 437, in dispatch_shell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 362, in execute_request

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 778, in execute_request

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 449, in do_execute

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 549, in run_cell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3075, in run_cell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3130, in _run_cell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 128, in _pseudo_sync_runner

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3334, in run_cell_async

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3517, in run_ast_nodes

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code

  File "/tmp/ipykernel_7933/3505144283.py", line 3, in <module>

  File "/tmp/ipykernel_7933/3562896888.py", line 3, in extend_text

  File "/tmp/ipykernel_7933/4243554360.py", line 2, in next_char

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 566, in predict

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 260, in one_step_on_data_distributed

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 250, in one_step_on_data

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 105, in predict_step

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 936, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 58, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/sequential.py", line 220, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 183, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/ops/function.py", line 177, in _run_through_graph

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 648, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 936, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 58, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/sequential.py", line 220, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 183, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/ops/function.py", line 177, in _run_through_graph

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 648, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 936, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 58, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/rnn/gru.py", line 601, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/rnn/rnn.py", line 406, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/rnn/gru.py", line 568, in inner_loop

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/rnn.py", line 484, in gru

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/rnn.py", line 704, in _cudnn_gru

Detected at node sequential_2_1/sequential_1_1/gru_1_1/split_1 defined at (most recent call last):
  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/runpy.py", line 86, in _run_code

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/traitlets/config/application.py", line 1075, in launch_instance

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 739, in start

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 211, in start

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/asyncio/base_events.py", line 603, in run_forever

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/asyncio/events.py", line 80, in _run

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 545, in dispatch_queue

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 534, in process_one

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 437, in dispatch_shell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 362, in execute_request

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 778, in execute_request

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 449, in do_execute

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 549, in run_cell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3075, in run_cell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3130, in _run_cell

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 128, in _pseudo_sync_runner

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3334, in run_cell_async

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3517, in run_ast_nodes

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code

  File "/tmp/ipykernel_7933/3505144283.py", line 3, in <module>

  File "/tmp/ipykernel_7933/3562896888.py", line 3, in extend_text

  File "/tmp/ipykernel_7933/4243554360.py", line 2, in next_char

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 566, in predict

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 260, in one_step_on_data_distributed

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 250, in one_step_on_data

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 105, in predict_step

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 936, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 58, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/sequential.py", line 220, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 183, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/ops/function.py", line 177, in _run_through_graph

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 648, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 936, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 58, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/sequential.py", line 220, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 183, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/ops/function.py", line 177, in _run_through_graph

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/models/functional.py", line 648, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/layer.py", line 936, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/ops/operation.py", line 58, in __call__

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/rnn/gru.py", line 601, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/rnn/rnn.py", line 406, in call

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/layers/rnn/gru.py", line 568, in inner_loop

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/rnn.py", line 484, in gru

  File "/home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/keras/src/backend/tensorflow/rnn.py", line 704, in _cudnn_gru

2 root error(s) found.
  (0) INVALID_ARGUMENT:  -input rank(-1) <= split_dim < input rank (1), but got 1
	 [[{{node sequential_2_1/sequential_1_1/gru_1_1/split_1}}]]
	 [[sequential_2_1/text_vectorization_1/UnicodeSplit/UnicodeEncode/UnicodeEncode/RaggedFromTensor/Reshape/_14]]
  (1) INVALID_ARGUMENT:  -input rank(-1) <= split_dim < input rank (1), but got 1
	 [[{{node sequential_2_1/sequential_1_1/gru_1_1/split_1}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_one_step_on_data_distributed_3644]

In [168]:
tf.random.set_seed(50)
input_text = tf.constant(['To be or not to be'])
output = extend_text(input_text, temperature=100)
extended_string = output.numpy().item()
extended_string

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 266ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 68ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 68ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 70ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 68ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 83ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 77ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 70ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 70ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 77ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 71ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 68ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 84ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8

b"To be or not to begb.tznogllookzk!l:eg3c& ;c-qs!pg:3?av3az's-esvztpa"

Shakespeare seems to be suffering from a heatwave. To generate more convincing text, a common technique is to sample only from the top k characters, or only from the smallest set of top characters whose total probability exceeds some threshold (this is called *nucleus sampling*). Alternatively, you could try using *beam search*, which we will discuss later in this chapter, or using more GRU layers and more neurons per layer, training for longer, and adding some regularization if needed. Also note that the model is currently incapable of learning patterns longer than ***length***, which is just 100 characters. You could try making this window larger, but it will also make training harder, and even LSTM and GRU cells cannot handle very long sequences. An alternative approach is to use a stateful RNN.

### **Stateful RNN**
Until now, we have only used *stateless RNNs*: at each training iteration the model starts with a hidden state full of zeros, then it updates this state at each time step, and after the last time step, it throws it away as it is not needed anymore. What if we instructed the RNN to preserve this final state after processing a training batch and use it as the initial state for the next training batch? This way the model could learn long-term patterns despite only backpropagating through short sequences. This is called a *stateful RNN*. Let’s go over how to build one.

First, note that a stateful RNN only makes sense if each input sequence in a batch starts exactly where the corresponding sequence in the previous batch left off. So the first thing we need to do to build a stateful RNN is to use sequential and nonoverlapping input sequences (rather than the shuffled and overlapping sequences we used to train stateless RNNs). When creating the ***tf.data.Dataset***, we must therefore use ***shift=length*** (instead of ***shift=1***) when calling the ***window()*** method. Moreover, we must not call the ***shuffle()*** method.

Unfortunately, batching is much harder when preparing a dataset for a stateful RNN than it is for a stateless RNN. Indeed, if we were to call ***batch(32)***, then 32 consecutive windows would be put in the same batch, and the following batch would not continue each of these windows where it left off. The first batch would contain windows 1 to 32 and the second batch would contain windows 33 to 64, so if you consider, say, the first window of each batch (i.e., windows 1 and 33), you can see that they are not consecutive. The simplest solution to this problem is to just use a batch size of 1. The following ***to_dataset_for_stateful_rnn()*** custom utility function uses this strategy to prepare a dataset for a stateful RNN:

In [15]:
def to_dataset_for_stateful_rnn(sequence, length):
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=length, drop_remainder=True)
    ds.flat_map(lambda window: window.batch(length + 1))
    ds.batch(1, drop_remainder=True)
    ds.map(lambda window: (window[:, :-1], window[:, 1:]))
    return ds.prefetch(1)

train = int((dataset_size*0.9)//1)
remain = (dataset_size - train) // 2
tf.random.set_seed(50)
stateful_train_set = to_dataset(encoded[:train], length=length, seed=50)
stateful_valid_set = to_dataset(encoded[train:(train+remain)], length=length)
stateful_test_set = to_dataset(encoded[-remain:], length=length)

[Figure 16-2](#fig162) summarizes the main steps of this function.

<span id=fig162></span>
![Preparing a dataset of consecutive sequence fragments for a stateful RNN](f162.png)

Batching is harder, but it is not impossible. For example, we could chop Shakespeare’s text into 32 texts of equal length, create one dataset of consecutive input sequences for each of them, and finally use ***tf.data.Dataset.zip(datasets).map(lambda *windows: tf.stack(windows))*** to create proper consecutive batches, where the n input sequence in a batch starts off exactly where the n input sequence ended in the previous batch (see the notebook for the full code).

Now, let’s create the stateful RNN. We need to set the ***stateful*** argument to ***True*** when creating each recurrent layer, and because the stateful RNN needs to know the batch size (since it will preserve a state for each input sequence in the batch). Therefore we must set the ***batch_input_shape*** argument in the first layer. Note that we can leave the second dimension unspecified, since the input sequences could have any length:

In [17]:
import tensorflow as tf
from tensorflow.keras import backend as K

_orig_sign = K.sign
def _patched_sign(x):
    if x.dtype.is_integer:
        return tf.cast(tf.sign(tf.cast(x, tf.float32)), x.dtype)
    return _orig_sign(x)
K.sign = _patched_sign


In [None]:
inputs = tf.keras.Input(batch_shape=(1, None), dtype=tf.int32)
x = tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16)(inputs)
x = tf.keras.layers.GRU(128, return_sequences=True, stateful=True, )(x)
outputs = tf.keras.layers.Dense(n_tokens, activation='softmax')(x)

model = tf.keras.Model(inputs, outputs)

2025-07-23 01:04:57.531176: W tensorflow/core/framework/op_kernel.cc:1844] UNKNOWN: JIT compilation failed.


UnknownError: {{function_node __wrapped__Sign_device_/job:localhost/replica:0/task:0/device:GPU:0}} JIT compilation failed. [Op:Sign] name: 

At the end of each epoch, we need to reset the states before we go back to the beginning of the text. For this, we can use a small custom Keras callback:

In [125]:
class ResetStatesCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        for layer in self.model.layers:
            if hasattr(layer, 'reset_states'):
                layer.reset_states()
        print(f"\nStates reset at the end of epoch {epoch + 1}")

And now we can compile the model and train it using our callback:

In [133]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='nadam', 
metrics=["accuracy"])
history = model.fit(stateful_train_set, validation_data=stateful_valid_set, 
epochs=10, callbacks=[ResetStatesCallback(), model_ckpt])

Epoch 1/10


ValueError: Exception encountered when calling GRU.call().

[1mInput tensor `functional_13_1/gru_13_1/ReadVariableOp:0` enters the loop with shape (1, 128), but has shape (None, 128) after one iteration. To allow the shape to vary across iterations, use the `shape_invariants` argument of tf.while_loop to specify a less-specific shape.[0m

Arguments received by GRU.call():
  • sequences=tf.Tensor(shape=(None, None, 16), dtype=float32)
  • initial_state=None
  • mask=None
  • training=True

In [134]:
import tensorflow as tf

def to_dataset_for_stateful_rnn(sequence, length):
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    # 1. Create windows of size length+1, stepping by length, drop remainders
    ds = ds.window(size=length + 1, shift=length, drop_remainder=True)
    # 2. Turn each window into a dense tensor of shape (length+1,)
    ds = ds.flat_map(lambda w: w.batch(length + 1))
    # 3. Batch into groups of 1 window, forcing the batch dimension to 1
    ds = ds.batch(1, drop_remainder=True)
    # 4. Split into (inputs, targets); each will be shape (1, length)
    ds = ds.map(lambda window: (window[:, :-1], window[:, 1:]))
    return ds.prefetch(1)

# Splits
train_end = int(dataset_size * 0.9)
val_end   = train_end + (dataset_size - train_end) // 2

tf.random.set_seed(50)
stateful_train_set = to_dataset_for_stateful_rnn(encoded[:train_end], length)
stateful_valid_set = to_dataset_for_stateful_rnn(encoded[train_end:val_end], length)
stateful_test_set  = to_dataset_for_stateful_rnn(encoded[val_end:], length)

# Verify shapes
for inp, tgt in stateful_train_set.take(1):
    print("Input shape:", inp.shape)   # should be (1, length)
    print("Target shape:", tgt.shape)  # should be (1, length)

# Build model
inputs = tf.keras.Input(batch_shape=(1, None), dtype=tf.int32)
x = tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16)(inputs)
x = tf.keras.layers.GRU(128, return_sequences=True, stateful=True)(x)
outputs = tf.keras.layers.Dense(n_tokens, activation='softmax')(x)
model = tf.keras.Model(inputs, outputs)

class ResetStatesCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        for layer in self.model.layers:
            if hasattr(layer, 'reset_states'):
                layer.reset_states()
        print(f"\nStates reset at the end of epoch {epoch + 1}")

model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='nadam',
    metrics=['accuracy']
)

# Train
history = model.fit(
    stateful_train_set,
    validation_data=stateful_valid_set,
    epochs=10,
    callbacks=[ResetStatesCallback(), model_ckpt]
)


Input shape: (1, 100)
Target shape: (1, 100)
Epoch 1/10
  10038/Unknown [1m484s[0m 48ms/step - accuracy: 0.3874 - loss: 2.1183




States reset at the end of epoch 1
[1m10038/10038[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m493s[0m 49ms/step - accuracy: 0.3874 - loss: 2.1183 - val_accuracy: 0.4993 - val_loss: 1.6770
Epoch 2/10
[1m10038/10038[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step - accuracy: 0.5236 - loss: 1.5838
States reset at the end of epoch 2
[1m10038/10038[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m453s[0m 44ms/step - accuracy: 0.5236 - loss: 1.5838 - val_accuracy: 0.5228 - val_loss: 1.5949
Epoch 3/10
[1m10037/10038[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 47ms/step - accuracy: 0.5488 - loss: 1.4885
States reset at the end of epoch 3
[1m10038/10038[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m478s[0m 48ms/step - accuracy: 0.5488 - loss: 1.4885 - val_accuracy: 0.5321 - val_loss: 1.5615
Epoch 4/10
[1m10038/10038[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step - accuracy: 0.5595 - loss: 1.4454
States reset at the end of epoch 4
[1m10038/

In [152]:
text = "To be or not to b"
input_text = tf.constant([text])

try:
    y_proba = shakespeare_model.predict(input_text)[0, -1]
    y_pred = tf.argmax(y_proba) # Choose the most probable character ID
    predicted_char = text_vec_layer.get_vocabulary()[y_pred + 2]
    print(text + predicted_char)
except Exception as e:
    print("Prediction failed:", e)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 101ms/step
To be or not to be


In [170]:
log_probas = tf.math.log([[0.5, 0.4, 0.1]]) # probas = 50%, 40%, and 10%
tf.random.set_seed(50)
tf.random.categorical(log_probas, num_samples=8) # draw 8 samples

def next_char(text, temperature=1):
    y_proba = shakespeare_model.predict([text])[0, -1:]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]
    return text_vec_layer.get_vocabulary()[char_id + 2]

def extend_text(text, n_chars=100, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

tf.random.set_seed(50)
input_text = tf.constant(['To be or not to be'])
output = extend_text(input_text, temperature=0.01)
extended_string = output.numpy().item()
extended_string

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 260ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 53ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 74ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 72ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 71ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 78ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 110ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 90ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 89ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 80ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 83ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 84ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 91ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 

b'To be or not to be so\nmay be so man the words that i shall be so\nmay be a shame the words that i shall be so\nmay be so'

Interesting i see now it is able to give meaningful words, but there not in proper and continuation series as they keep repeating for some of the words with no proper meaning which we will look after it on the as you move on *sentiment analysis* part in this chapter.

> #### **TIP**
> After this model is trained, it will only be possible to use it to make predictions for batches of the same size as were used during training. To avoid this restriction, create an identical stateless model, and copy the stateful model’s weights to this model.

Interestingly, although a char-RNN model is just trained to predict the next character, this seemingly simple task actually requires it to learn some higher-level tasks as well. For example, to find the next character after “Great movie, I really”, it’s helpful to understand that the sentence is positive, so what follows is more likely to be the letter “l” (for “loved”) rather than “h” (for “hated”). In fact, a [2017 paper](https://homl.info/sentimentneuron)  by Alec Radford and other OpenAI researchers describes how the authors trained a big char-RNN-like model on a large dataset, and found that one of the neurons acted as an excellent sentiment analysis classifier: although the model was trained without any labels, the sentiment neuron—as they called it—reached state-of-the-art performance on sentiment analysis benchmarks. This foreshadowed and motivated unsupervised pretraining in NLP. 

But before we explore unsupervised pretraining, let’s turn our attention to word-level models and how to use them in a supervised fashion for sentiment analysis. In the process, you will learn how to handle sequences of variable lengths using masking.

## **Sentiment Analysis**

Generating text can be fun and instructive, but in real-life projects, one of the most common applications of NLP is text classification—especially sentiment analysis. If image classification on the MNIST dataset is the “Hello world!” of computer vision, then sentiment analysis on the IMDb reviews dataset is the “Hello world!” of natural language processing. The IMDb dataset consists of 50,000 movie reviews in English (25,000 for training, 25,000 for testing) extracted from the famous [Internet Movie Database](https://imdb.com), along with a simple binary target for each review indicating whether it is negative (0) or positive (1). Just like MNIST, the IMDb reviews dataset is popular for good reasons: it is simple enough to be tackled on a laptop in a reasonable amount of time, but challenging enough to be fun and rewarding.

Let’s load the IMDb dataset using the TensorFlow Datasets library (introduced in Chapter 13). We’ll use the first 90% of the training set for training, and the remaining 10% for validation:

In [11]:
pip install tensorflow_datasets

Note: you may need to restart the kernel to use updated packages.


In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds

raw_train_set, raw_valid_set, raw_test_set = tfds.load(
    name="imdb_reviews", 
    split=['train[:90%]', 'train[90%:]', 'test'], 
    as_supervised=True
)
tf.random.set_seed(50)
train_set = raw_train_set.shuffle(5000, seed=50).batch(32).prefetch(1)
valid_set = raw_valid_set.batch(32).prefetch(1)
test_set = raw_test_set.batch(32).prefetch(1)

  from .autonotebook import tqdm as notebook_tqdm


[1mDownloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /home/jaxon/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...:   0%|          | 0/1 [02:40<?, ? url/s]

> #### **TIP**
>  Keras also includes a function for loading the IMDb dataset, if you prefer: ***tf.keras.datasets.imdb.load_data()***. The reviews are already preprocessed as sequences of word IDs.

Let's inspect a few reviews:

In [2]:
for review, label in raw_train_set.take(4):
    print(review.numpy().decode("utf-8"))
    print("Label:", label.numpy())

This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.
Label: 0
I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development

Some reviews are easy to classify. For example, the first review includes the words “terrible movie” in the very first sentence. But in many cases things are not that simple. For example, the third review starts off positively, even though it’s ultimately a negative review (label 0).

To build a model for this task, we need to preprocess the text, but this time we will chop it into words instead of characters. For this, we can use the ***tf.keras***.

***layers.TextVectorization*** layer again. Note that it uses spaces to identify word boundaries, which will not work well in some languages. For example, Chinese writing does not use spaces between words, Vietnamese uses spaces even within words, and German often attaches multiple words together, without spaces. Even in English, spaces are not always the best way to tokenize text: think of “San Francisco” or “#ILoveDeepLearning”.

Fortunately, there are solutions to address these issues. In a [2016 paper](https://homl.info/rarewords),  Rico Sennrich et al. from the University of Edinburgh explored several methods to tokenize and detokenize text at the subword level. This way, even if your model encounters a rare word it has never seen before, it can still reasonably guess what it means. For example, even if the model never saw the word “smartest” during training, if it learned the word “smart” and it also learned that the suffix “est” means “the most”, it can infer the meaning of “smartest”. One of the techniques the authors evaluated is *byte pair encoding* (**BPE**). BPE works by splitting the whole training set into individual characters (including spaces), then repeatedly merging the most frequent adjacent pairs until the vocabulary reaches the desired size.

A [2018 paper](https://homl.info/subword)  by Taku Kudo at Google further improved subword tokenization, often removing the need for language-specific preprocessing prior to tokenization. Moreover, the paper proposed a novel regularization technique called *subword regularization*, which improves accuracy and robustness by introducing some randomness in tokenization during training: for example, “New England” may be tokenized as “New” + “England”, or “New” + “Eng” + “land”, or simply “New England” (just one token). Google’s [SentencePiece](https://github.com/google/sentencepiece) project provides an open source implementation, which is described in a [paper](https://homl.info/sentencepiece)  by Taku Kudo and John Richardson.

The [TensorFlow Text](https://homl.info/tftext) library also implements various tokenization strategies, including [WordPiece](https://homl.info/wordpiece)  (a variant of BPE), and last but not least, the [Tokenizers library by Hugging Face](https://homl.info/tokenizers) implements a wide range of extremely fast tokenizers.

However, for the IMDb task in English, using spaces for token boundaries should be good enough. So let’s go ahead with creating a ***TextVectorization*** layer and adapting it to the training set. We will limit the vocabulary to 1,000 tokens, including the most frequent 998 words plus a padding token and a token for unknown words, since it’s unlikely that very rare words will be important for this task, and limiting the vocabulary size will reduce the number of parameters the model needs to learn:

In [3]:
vocab_size = 1000
text_vec_layer = tf.keras.layers.TextVectorization(max_tokens=vocab_size)
text_vec_layer.adapt(train_set.map(lambda reviews, labels: reviews))

Finally, we create the model and train it:

In [31]:
embed_size = 128
tf.random.set_seed(50)
model = tf.keras.Sequential([
    text_vec_layer, 
    tf.keras.layers.Embedding(vocab_size, embed_size, mask_zero=True), 
    tf.keras.layers.GRU(128), 
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss="binary_crossentropy", optimizer="nadam", 
              metrics=['accuracy'])

early_stop = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss', 
    patience=2, 
    restore_best_weights=True)

model_ckpt = tf.keras.callbacks.ModelCheckpoint(
    "sentiment_text_masking.keras", monitor="val_accuracy", save_best_only=True)

history = model.fit(train_set, validation_data=valid_set, epochs=10, callbacks=[early_stop, model_ckpt])

Epoch 1/10
[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 611ms/step - accuracy: 0.6445 - loss: 0.6194

UnicodeEncodeError: 'charmap' codec can't encode character '\x96' in position 3182: character maps to <undefined>

In [5]:
model.evaluate(test_set)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m87s[0m 112ms/step - accuracy: 0.4976 - loss: 0.6933


[0.6934395432472229, 0.49939998984336853]

The first layer is the ***TextVectorization*** layer we just prepared, followed by an ***Embedding*** layer that will convert word IDs into embeddings. The embedding matrix needs to have one row per token in the vocabulary (***vocab_size***) and one column per embedding dimension (this example uses 128 dimensions, but this is a hyperparameter you could tune). Next we use a ***GRU*** layer and a ***Dense*** layer with a single neuron and the sigmoid activation function, since this is a binary classification task: the model’s output will be the estimated probability that the review expresses a positive sentiment regarding the movie. We then compile the model, and we fit it on the dataset we prepared earlier for a couple of epochs (or you can train for longer to get better results).

Sadly, if you run this code, you will generally find that the model fails to learn anything at all: the accuracy remains close to 50%, no better than random chance. Why is that? The reviews have different lengths, so when the ***TextVectorization*** layer converts them to sequences of token IDs, it pads the shorter sequences using the padding token (with ID 0) to make them as long as the longest sequence in the batch. As a result, most sequences end with many padding tokens—often dozens or even hundreds of them. Even though we’re using a ***GRU*** layer, which is much better than a ***SimpleRNN*** layer, its short-term memory is still not great, so when it goes through manypadding tokens, it ends up forgetting what the review was about! One solution is to feed the model with batches of equal-length sentences (which also speeds up training). Another solution is to make the RNN ignore the padding tokens. This can be done using masking.

### **Masking**
Making the model ignore padding tokens is trivial using Keras: simply add ***mask_zero=True*** when creating the ***Embedding*** layer. This means that padding tokens (whose ID is 0) will be ignored by all downstream layers. That’s all! If you retrain the previous model for a few epochs, you will find that the validation accuracy quickly reaches over 80%.

The way this works is that the ***Embedding*** layer creates a *mask tensor* equal to ***tf.math.not_equal(inputs, 0)***: it is a Boolean tensor with the same shape as the inputs, and it is equal to ***False*** anywhere the token IDs are 0, or ***True*** otherwise. This mask tensor is then automatically propagated by the model to the next layer. If that layer’s ***call()*** method has a ***mask*** argument, then it automatically receives the mask. This allows the layer to ignore the appropriate time steps. Each layer may handle the mask differently, but in general they simply ignore masked time steps (i.e., time steps for which the mask is ***False***). For example, when a recurrent layer encounters a masked time step, it simply copies the output from the previous time step.

Next, if the layer’s ***supports_masking*** attribute is ***True***, then the mask is automatically propagated to the next layer. It keeps propagating this way for as long as the layers have ***supports_masking=True***. As an example, a recurrent layer’s ***supports_ masking*** attribute is True when ***return_sequences=True***, but it’s ***False*** when ***return_sequences=False*** since there’s no need for a mask anymore in this case. So if you have a model with several recurrent layers with ***return_sequences=True***, followed by a recurrent layer with ***return_sequences=False***, then the mask will automatically propagate up to the last recurrent layer: that layer will use the mask to ignore masked steps, but it will not propagate the mask any further. Similarly, if you set ***mask_zero=True*** when creating the ***Embedding*** layer in the sentiment analysis model we just built, then the ***GRU*** layer will receive and use the mask automatically, but it will not propagate it any further, since ***return_sequences*** is not set to ***True***.

> #### **TIP**
> Some layers need to update the mask before propagating it to the next layer: they do so by implementing the ***compute_mask()*** method, which takes two arguments: the inputs and the previous mask. It then computes the updated mask and returns it. The default implementation of ***compute_mask()*** just returns the previous mask unchanged.

Many Keras layers support masking: ***SimpleRNN, GRU, LSTM, Bidirectional, Dense, TimeDistributed, Add***, and a few others (all in the ***tf.keras.layers package***). However, convolutional layers (including ***Conv1D***) do not support masking—it’s not obvious how they would do so anyway.

If the mask propagates all the way to the output, then it gets applied to the losses as well, so the masked time steps will not contribute to the loss (their loss will be 0). This assumes that the model outputs sequences, which is not the case in our sentiment analysis model.

> #### **WARNING**
>  The ***LSTM*** and ***GRU*** layers have an optimized implementation for GPUs, based on Nvidia’s cuDNN library. However, this implementation only supports masking if all the padding tokens are at the end of the sequences. It also requires you to use the default values for several hyperparameters: ***activation, recurrent_activation, recurrent_dropout, unroll, use_bias***, and ***reset_after***. If that’s not the case, then these layers will fall back to the (much slower) default GPU implementation.

If you want to implement your own custom layer with masking support, you should add a ***mask*** argument to the ***call()*** method, and obviously make the method use the mask. Additionally, if the mask must be propagated to the next layers, then you should set ***self.supports_masking=True*** in the constructor. If the mask must be updated before it is propagated, then you must implement the ***compute_mask()*** method.

If your model does not start with an ***Embedding*** layer, you may use the ***tf.keras.layers.Masking*** layer instead: by default, it sets the mask to ***tf.math.reduce_any(tf.math.not_equal(X, 0), axis=-1)***, meaning that time steps where the last dimension is full of zeros will be masked out in subsequent layers.

Using masking layers and automatic mask propagation works best for simple models. It will not always work for more complex models, such as when you need to mix Conv1D layers with recurrent layers. In such cases, you will need to explicitly compute the mask and pass it to the appropriate layers, using either the functional API or the subclassing API. For example, the following model is equivalent to the previous model, except it is built using the functional API and handles masking manually. It also adds a bit of dropout since the previous model was overfitting slightly:

In [11]:
inputs = tf.keras.layers.Input(shape=[], dtype=tf.string)
token_ids = text_vec_layer(inputs)
mask = tf.keras.layers.Lambda(
    lambda x: tf.math.not_equal(x, 0), 
) (token_ids)
x = tf.keras.layers.Embedding(vocab_size, embed_size)(token_ids)
x = tf.keras.layers.GRU(128, dropout=0.2)(x, mask=mask)
outputs = tf.keras.layers.Dense(1, activation="sigmoid") (x)
model = tf.keras.Model(inputs=[inputs], outputs=[outputs])







One last approach to masking is to feed the model with ragged tensors.  In practice, all you need to do is to set ***ragged=True*** when creating the ***TextVectorization*** layer, so that the input sequences are represented as ragged tensors:

In [12]:
text_vec_layer_ragged = tf.keras.layers.TextVectorization(
    max_tokens=vocab_size, ragged=True
)
text_vec_layer_ragged.adapt(train_set.map(lambda reviews, labels: reviews))
text_vec_layer_ragged(["Great movie!", "This is DiCaprio's best role."])

<tf.RaggedTensor [[86, 18], [11, 7, 1, 116, 217]]>

Compare this ragged tensor representation with the regular tensor representation, which uses padding tokens:

In [13]:
text_vec_layer(["Great movie!", "This is DiCaprio's best role."])

<tf.Tensor: shape=(2, 5), dtype=int64, numpy=
array([[ 86,  18,   0,   0,   0],
       [ 11,   7,   1, 116, 217]], dtype=int64)>

Finally a full picture of our model.

In [17]:
import sys
if hasattr(sys.stdout, "reconfigure"):
    sys.stdout.reconfigure(encoding="utf-8")

In [38]:
embed_size = 128
tf.random.set_seed(50)

inputs = tf.keras.layers.Input(shape=[], dtype=tf.string)
token_ids = text_vec_layer_ragged(inputs)
x = tf.keras.layers.Embedding(vocab_size, embed_size)(token_ids)
x = tf.keras.layers.GRU(128, dropout=0.2)(x)
outputs = tf.keras.layers.Dense(1, activation="sigmoid") (x)
model = tf.keras.Model(inputs=[inputs], outputs=[outputs])

early_stop = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss', 
    patience=2, 
    restore_best_weights=True)

model_ckpt = tf.keras.callbacks.ModelCheckpoint(
    "sentiment_text_masking.keras", monitor="val_accuracy", save_best_only=True)

model.compile(loss="binary_crossentropy", optimizer="nadam", 
              metrics=['accuracy'])

history = model.fit(train_set, validation_data=valid_set, epochs=10, 
                    callbacks=[early_stop, model_ckpt], verbose=2)

Epoch 1/10


Expected: ['keras_tensor_103']
Received: inputs=Tensor(shape=(None,))


ValueError: Exception encountered when calling GRU.call().

[1mCannot index into an inner ragged dimension.[0m

Arguments received by GRU.call():
  • sequences=tf.Tensor(shape=(None, None, 128), dtype=float32)
  • initial_state=None
  • mask=None
  • training=True

Keras’s recurrent layers have built-in support for ragged tensors, so there’s nothing else you need to do: just use this ***TextVectorization*** layer in your model. There’s no need to pass ***mask_zero=True*** or handle masks explicitly—it’s all implemented for you. That’s convenient! However, as of early 2022, the support for ragged tensors in Keras is still fairly recent, so there are a few rough edges. For example, it is currently not possible to use ragged tensors as targets when running on the GPU (but this may be resolved by the time you read these lines).

Whichever masking approach you prefer, after training this model for a few epochs, it will become quite good at judging whether a review is positive or not. If you use the ***tf.keras.callbacks.TensorBoard()*** callback, you can visualize the embeddings in TensorBoard as they are being learned: it is fascinating to see words like “awesome” and “amazing” gradually cluster on one side of the embedding space, while words like “awful” and “terrible” cluster on the other side. Some words are not as positive as you might expect (at least with this model), such as the word “good”, presumably because many negative reviews contain the phrase “not good”.

### **Reusing Pretrained Embeddings and Language Models**
It’s impressive that the model is able to learn useful word embeddings based on just 25,000 movie reviews. Imagine how good the embeddings would be if we had billions of reviews to train on! Unfortunately, we don’t, but perhaps we can reuse word embeddings trained on some other (very) large text corpus (e.g., Amazon reviews, available on TensorFlow Datasets), even if it is not composed of movie reviews? After all, the word “amazing” generally has the same meaning whether you use it to talk about movies or anything else. Moreover, perhaps embeddings would be useful for sentiment analysis even if they were trained on another task: since words like “awesome” and “amazing” have a similar meaning, they will likely cluster in the embedding space even for tasks such as predicting the next word in a sentence. If all positive words and all negative words form clusters, then this will be helpful for sentiment analysis. So, instead of training word embeddings, we could just download and use pretrained embeddings, such as Google’s [Word2vec embeddings](https://homl.info/word2vec), Stanford’s [GloVe embeddings](https://homl.info/glove), or Facebook’s [FastText embeddings](https://fasttext.cc).

Using pretrained word embeddings was popular for several years, but this approach has its limits. In particular, a word has a single representation, no matter the context. For example, the word “right” is encoded the same way in “left and right” and “right and wrong”, even though it means two very different things. To address this limitation, a [2018 paper](https://homl.info/elmo) by Matthew Peters introduced *Embeddings from Language Models* (**ELMo**): these are contextualized word embeddings learned from the internal states of a deep bidirectional language model. Instead of just using pretrained embeddings in your model, you reuse part of a pretrained language model. 

At roughly the same time, the [Universal Language Model Fine-Tuning (ULMFiT) paper](https://homl.info/ulmfit) by Jeremy Howard and Sebastian Ruder demonstrated the effectiveness of unsupervised pretraining for NLP tasks: the authors trained an LSTM language model on a huge text corpus using selfsupervised learning (i.e., generating the labels automatically from the data), then they fine-tuned it on various tasks. Their model outperformed the state of the art on six text classification tasks by a large margin (reducing the error rate by 18–24% in most cases). Moreover, the authors showed a pretrained model fine-tuned on just 100 labeled examples could achieve the same performance as one trained from scratch on 10,000 examples. Before the ULMFiT paper, using pretrained models was only the norm in computer vision; in the context of NLP, pretraining was limited to word embeddings. This paper marked the beginning of a new era in NLP: today, reusing pretrained language models is the norm.

For example, let’s build a classifier based on the Universal Sentence Encoder, a model architecture introduced in a 2018 paper by a team of Google researchers. This model is based on the transformer architecture, which we will look at later in this chapter. Conveniently, the model is available on TensorFlow Hub:

In [3]:
pip install tensorflow-hub

Collecting tensorflow-hub
  Using cached tensorflow_hub-0.16.1-py2.py3-none-any.whl.metadata (1.3 kB)
Collecting tf-keras>=2.14.1 (from tensorflow-hub)
  Using cached tf_keras-2.19.0-py3-none-any.whl.metadata (1.8 kB)
Collecting tensorflow<2.20,>=2.19 (from tf-keras>=2.14.1->tensorflow-hub)
  Using cached tensorflow-2.19.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting tensorboard~=2.19.0 (from tensorflow<2.20,>=2.19->tf-keras>=2.14.1->tensorflow-hub)
  Using cached tensorboard-2.19.0-py3-none-any.whl.metadata (1.8 kB)
Collecting keras>=3.5.0 (from tensorflow<2.20,>=2.19->tf-keras>=2.14.1->tensorflow-hub)
  Using cached keras-3.10.0-py3-none-any.whl.metadata (6.0 kB)
Collecting numpy>=1.12.0 (from tensorflow-hub)
  Using cached numpy-2.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Collecting rich (from keras>=3.5.0->tensorflow<2.20,>=2.19->tf-keras>=2.14.1->tensorflow-hub)
  Using cached rich-14.0.0-py3-none-any

In [6]:
import tensorflow as tf
import tensorflow_hub as hub

os.environ['TFHUB_CACHE_DIR'] = 'my_tfhub_cache'


raw_use_layer = hub.KerasLayer(
    "https://tfhub.dev/google/universal-sentence-encoder/4",
    trainable=True,
    dtype=tf.string,
    input_shape=[]     # <<< this line is the key
)

class UseWrapper(tf.keras.layers.Layer):
    def __init__(self):
        super().__init__()
        # Embed the pre-built Hublayer
        self._use = raw_use_layer
        
    def call(self, inputs):
        return self._use(inputs)

inputs = tf.keras.Input(shape=(), dtype=tf.string)
embeddings = UseWrapper()(inputs)
x = tf.keras.layers.Dense(64, activation="relu") (embeddings)
outputs = tf.keras.layers.Dense(1, activation="sigmoid") (x)

model = tf.keras.Model(inputs, outputs)

model.compile(loss='binary_crossentropy', optimizer='nadam', 
              metrics=['accuracy'])
model.fit(train_set, validation_data=valid_set, epochs=10, 
          callbacks=[early_stop, model_ckpt], verbose=2)

NameError: name 'train_set' is not defined

In [25]:
import os
import tensorflow as tf
import tensorflow_hub as hub

# 1) Point TF‑Hub cache (optional)
os.environ['TFHUB_CACHE_DIR'] = 'my_tfhub_cache'

# 2) Define a custom Keras Layer for USE
class USELayer(tf.keras.layers.Layer):
    def __init__(self, trainable=True, **kwargs):
        super().__init__(**kwargs)
        # Load once, outside of any tf.function tracing
        self.embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
        self.trainable = trainable

    def call(self, inputs):
        # inputs is a 1-D tf.Tensor of dtype string
        return self.embed(inputs)

    def get_config(self):
        base = super().get_config()
        return {**base, "trainable": self.trainable}

# 3) Build the Functional model
inputs = tf.keras.layers.Input(shape=(), dtype=tf.string, name="text_input")
x = USELayer(name="use")(inputs)                 # now a true Layer
x = tf.keras.layers.Dense(64, activation="relu")(x)
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
model = tf.keras.Model(inputs, outputs)

# 4) Compile & fit with ASCII logs
model.compile(
    loss='binary_crossentropy',
    optimizer='nadam',
    metrics=['accuracy']
)
model.fit(
    train_set,
    validation_data=valid_set,
    epochs=10,
    callbacks=[early_stop, model_ckpt],
    verbose=2    # simple one‑line/per‑epoch output avoids Unicode issues
)









Epoch 1/10
704/704 - 27s - 38ms/step - accuracy: 0.8401 - loss: 0.3797 - val_accuracy: 0.8544 - val_loss: 0.3277
Epoch 2/10
704/704 - 24s - 34ms/step - accuracy: 0.8599 - loss: 0.3252 - val_accuracy: 0.8508 - val_loss: 0.3240
Epoch 3/10
704/704 - 24s - 34ms/step - accuracy: 0.8634 - loss: 0.3202 - val_accuracy: 0.8520 - val_loss: 0.3231
Epoch 4/10
704/704 - 24s - 35ms/step - accuracy: 0.8661 - loss: 0.3157 - val_accuracy: 0.8492 - val_loss: 0.3304
Epoch 5/10
704/704 - 24s - 34ms/step - accuracy: 0.8684 - loss: 0.3120 - val_accuracy: 0.8492 - val_loss: 0.3339


<keras.src.callbacks.history.History at 0x259851a42c0>

> #### **TIP**
> This model is quite large—close to 1 GB in size—so it may take a while to download. By default, TensorFlow Hub modules are saved to a temporary directory, and they get downloaded again and again every time you run your program. To avoid that, you must set the ***TFHUB_CACHE_DIR*** environment variable to a directory of your choice: the modules will then be saved there, and only downloaded once.

Note that the last part of the TensorFlow Hub module URL specifies that we want version 4 of the model. This versioning ensures that if a new module version is released on TF Hub, it will not break our model. Conveniently, if you just enter this URL in a web browser, you will get the documentation for this module.

Also note that we set ***trainable=True*** when creating the ***hub.KerasLayer***. This way, the pretrained Universal Sentence Encoder is fine-tuned during training: some of its weights are tweaked via backprop. Not all TensorFlow Hub modules are fine-tunable, so make sure to check the documentation for each pretrained module you’re interested in.

After training, this model should reach a validation accuracy of over 90%. That’s actually really good: if you try to perform the task yourself, you will probably do only marginally better since many reviews contain both positive and negative comments. Classifying these ambiguous reviews is like flipping a coin.

So far we have looked at text generation using a char-RNN, and sentiment analysis with word-level RNN models (based on trainable embeddings) and using a powerful pretrained language model from TensorFlow Hub. In the next section, we will explore another important NLP task: *neural machine translation* (**NMT**).

## **An Encoder-Decoder Network for Neural Machine Translation**

Let’s begin with a simple [NMT model](https://homl.info/103) that will translate English sentences to Spanish

In short, the architecture is as follows: English sentences are fed as inputs to the encoder, and thedecoder outputs the Spanish translations. Note that the Spanish translations are also used as inputsto the decoder during training, but shifted back by one step. In other words, during training the decoder is given as input the word that it should have output at the previous step, regardless of what it actually output. This is called teacher forcing—a technique that significantly speeds up training and improves the model’s performance. For the very first word, the decoder is given the start-ofsequence (SOS) token, and the decoder is expected to end the sentence with an end-of-sequence (EOS) token.

Each word is initially represented by its ID (e.g., 854 for the word “soccer”). Next, an Embedding layer returns the word embedding. These word embeddings are then fed to the encoder and the decoder.

At each step, the decoder outputs a score for each word in the output vocabulary (i.e., Spanish), then the softmax activation function turns these scores into probabilities. For example, at the first step the word “Me” may have a probability of 7%, “Yo” may have a probability of 1%, and so on. The word with the highest probability is output. This is very much like a regular classification task, and indeed you can train the model using the "sparse_categorical_crossentropy" loss, much like we did in the char-RNN model.

Note that at inference time (after training), you will not have the target sentence to feed to the decoder. Instead, you need to feed it the word that it has just output at the previous step, Let’s build and train this model! First, we need to download a dataset of English/Spanish sentence pairs:

In [2]:
import os

os.environ["TF_XLA_FLAGS"] = "--tf_xla_enable_xla_devices=false"

import tensorflow as tf

tf.config.optimizer.set_jit(False)


from zipfile import ZipFile
from pathlib import Path

# 1) Download
url = "https://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip"
zip_path = Path(tf.keras.utils.get_file("spa-eng.zip", origin=url))
print("ZIP file is here:", zip_path)

# 2) Unzip manually
extract_dir = zip_path.parent / "spa-eng"
extract_dir.mkdir(exist_ok=True)    # make sure the folder exists
with ZipFile(zip_path, 'r') as zf:
    zf.extractall(extract_dir)
print("Extracted to:", extract_dir)

# 3) Find and read spa.txt (wherever it landed)
candidates = list(extract_dir.rglob("spa.txt"))
if not candidates:
    raise FileNotFoundError(f"spa.txt not found under {extract_dir}")
txt_path = candidates[0]
print("Reading from:", txt_path)
text = txt_path.read_text(encoding="utf-8")

# quick sanity-check
print(text[:200])


ZIP file is here: /home/jaxon/.keras/datasets/spa-eng.zip
Extracted to: /home/jaxon/.keras/datasets/spa-eng
Reading from: /home/jaxon/.keras/datasets/spa-eng/spa-eng/spa.txt
Go.	Ve.
Go.	Vete.
Go.	Vaya.
Go.	Váyase.
Hi.	Hola.
Run!	¡Corre!
Run.	Corred.
Who?	¿Quién?
Fire!	¡Fuego!
Fire!	¡Incendio!
Fire!	¡Disparad!
Help!	¡Ayuda!
Help!	¡Socorro! ¡Auxilio!
Help!	¡Auxilio!
Jump!	¡


In [3]:
import numpy as np

text = text.replace("¡", "").replace("¿", "")
pairs = [line.split("\t") for line in text.splitlines()]
np.random.shuffle(pairs)
sentences_en, sentences_es = zip(*pairs) # separates the pairs into 2 lists

Let’s take a look at the first three sentence pairs:

In [4]:
for i in range(3):
    print(sentences_en[i], "=>", sentences_es[i])
    

We had no water to drink. => No teníamos agua que beber.
I like listening to music. => Me gusta escuchar música.
She was kissed by him. => Él la besó.


Next, let’s create two TextVectorization layers—one per language—and adapt them to the text:

In [5]:
vocab_size = 1000
max_length = 50
text_vec_layer_en = tf.keras.layers.TextVectorization(
    vocab_size, output_sequence_length=max_length
)
text_vec_layer_es = tf.keras.layers.TextVectorization(
    vocab_size, output_sequence_length=max_length
)
text_vec_layer_en.adapt(sentences_en)
text_vec_layer_es.adapt([f"start-fseq {s} endofseq" for s in sentences_es])

I0000 00:00:1753366097.992260    5422 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1730 MB memory:  -> device: 0, name: NVIDIA GeForce MX150, pci bus id: 0000:01:00.0, compute capability: 6.1


There are a few things to note here:
- We limit the vocabulary size to 1,000, which is quite small. That’s because the training set is not very large, and because using a small value will speed up training. State-of-the-art translation models typically use a much larger vocabulary (e.g., 30,000), a much larger training set (gigabytes), and a much larger model (hundreds or even thousands of megabytes). For example, check out the Opus-MT models by the University of Helsinki, or the M2M-100 model by Facebook.
  
- Since all sentences in the dataset have a maximum of 50 words, we set output_sequence_length to 50: this way the input sequences will automatically be padded with zeros until they are all 50 tokens long. If there was any sentence longer than 50tokens in the training set, it would be cropped to 50 tokens.

- For the Spanish text, we add “startofseq” and “endofseq” to each sentence when adapting the TextVectorization layer: we will use these words as SOS and EOS tokens. You could use any other words, as long as they are not actual Spanish words. Let’s inspect the first 10 tokens in both vocabularies. They start with the padding token, the unknown token, the SOS and EOS tokens (only in the Spanish vocabulary), then the actual words, sorted by decreasing frequency:

In [6]:
print(text_vec_layer_en.get_vocabulary()[:10], 
text_vec_layer_es.get_vocabulary()[:10])

['', '[UNK]', np.str_('the'), np.str_('i'), np.str_('to'), np.str_('you'), np.str_('tom'), np.str_('a'), np.str_('is'), np.str_('he')] ['', '[UNK]', np.str_('startfseq'), np.str_('endofseq'), np.str_('de'), np.str_('que'), np.str_('a'), np.str_('no'), np.str_('tom'), np.str_('la')]


Next, let’s create the training set and the validation set (you could also create a test set if you needed it). We will use the first 100,000 sentence pairs for training, and the rest for validation. The decoder’s inputs are the Spanish sentences plus an SOS token prefix. The targets are the Spanish sentences plus an EOS suffix:

In [7]:
x_train = tf.constant(sentences_en[:100_000])
x_valid = tf.constant(sentences_en[100_000:])
x_train_dec = tf.constant([f"startofseq {s}" for s in sentences_es[:100_000]])
x_valid_dec = tf.constant([f"{s} endofseq" for s in sentences_es[100_000:]])
y_train = text_vec_layer_es([f"{s} endofseq" for s in sentences_es[:100_000]])
y_valid = text_vec_layer_es([f"{s} endofseq" for s in sentences_es[100_000:]])

2025-07-24 17:08:23.455413: W external/local_xla/xla/service/gpu/llvm_gpu_backend/default/nvptx_libdevice_path.cc:40] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  ipykernel_launcher.runfiles/cuda_nvcc
  ipykern/cuda_nvcc
  
  /usr/local/cuda
  /opt/cuda
  /home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/tensorflow/python/platform/../../../nvidia/cuda_nvcc
  /home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/tensorflow/python/platform/../../../../nvidia/cuda_nvcc
  /home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/tensorflow/python/platform/../../cuda
  /home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/tensorflow/python/platform/../../../../../..
  /home/jaxon/anaconda3/envs/ai_env/lib/python3.10/site-packages/tensorflow/python/platform/../../../.

OK, we’re now ready to build our translation model. We will use the functional API for that since the model is not sequential. It requires two text inputs—one for the encoder and one for the decoder so let’s start with that:

In [8]:
encoder_inputs = tf.keras.layers.Input(shape=[], dtype=tf.string)
decoder_inputs = tf.keras.layers.Input(shape=[], dtype=tf.string)

Next, we need to encode these sentences using the ***TextVectorization*** layers we prepared earlier, followed by an ***Embedding*** layer for each language, with ***mask_zero=True*** to ensure masking is handled automatically. The embedding size is a hyperparameter you can tune, as always:

In [9]:
embed_size = 128
encoder_input_ids = text_vec_layer_en(encoder_inputs)
decoder_input_ids = text_vec_layer_es(decoder_inputs)
encoder_embedding_layer = tf.keras.layers.Embedding(vocab_size, embed_size, mask_zero=True)
decoder_embedding_layer = tf.keras.layers.Embedding(vocab_size, embed_size, mask_zero=True)
encoder_embeddings = encoder_embedding_layer(encoder_input_ids)
decoder_embeddings = decoder_embedding_layer(decoder_input_ids)

> **TIP**
> When the languages share many words, you may get better performance using the same embedding layer for both the encoder and the decoder.

Now let's create the encoder and pass it the embedded inputs:

In [10]:
print(tf.__version__)


2.19.0


In [11]:
import tensorflow as tf
from tensorflow.keras.layers import RNN, LSTMCell

# assume encoder_embeddings is already defined:
# batch_size, timesteps, features = 32, 10, 64
# encoder_embeddings = tf.random.uniform((batch_size, timesteps, features))

with tf.device("/CPU:0"):
    # build a generic (non‑cuDNN) RNN wrapping an LSTMCell
    cpu_encoder = RNN(
        LSTMCell(512),
        return_state=True,
        return_sequences=False
    )
    # call it exactly like your old LSTM
    encoder_outputs, state_h, state_c = cpu_encoder(encoder_embeddings)


In [12]:
from tensorflow.keras.layers import RNN, LSTMCell, TimeDistributed, Dense

with tf.device("/CPU:0"):
    cpu_decoder = RNN(
        LSTMCell(512), 
        return_state=True, 
        return_sequences=True
    )
    decoder_seq, dec_h, dec_c = cpu_decoder(decoder_embeddings, initial_state=[state_h, state_c])    

Next, we can pass the decoder’s outputs through a Dense layer with the softmax activation function to get the word probabilities for each step:

In [13]:
output_layer = tf.keras.layers.Dense(vocab_size, activation="softmax")
y_proba = output_layer(decoder_seq)

> **OPTIMIZING THE OUTPUT LAYER**
> When the output vocabulary is large, outputting a probability for each and every possible wordcan be quite slow. If the target vocabulary contained, say, 50,000 Spanish words instead of1,000, then the decoder would output 50,000-dimensional vectors, and computing the softmaxfunction over such a large vector would be very computationally intensive. To avoid this, onesolution is to look only at the logits output by the model for the correct word and for a randomsample of incorrect words, then compute an approximation of the loss based only on theselogits. This sampled softmax technique was introduced in 2015 by Sébastien Jean et al. InTensorFlow you can use the tf.nn.sampled_softmax_loss() function for this duringtraining and use the normal softmax function at inference time (sampled softmax cannot beused at inference time because it requires knowing the target).Another thing you can do to speed up training—which is compatible with sampled softmax—is to tie the weights of the output layer to the transpose of the decoder’s embedding matrix (you will see how to tie weights in Chapter 17). This significantly reduces the number of model parameters, which speeds up training and may sometimes improve the model’s accuracy as well, especially if you don’t have a lot of training data. The embedding matrix is equivalent to one-hot encoding followed by a linear layer with no bias term and no activation function that maps the one-hot vectors to the embedding space. The output layer does the reverse. So, if the model can find an embedding matrix whose transpose is close to its inverse (such a matrix is called an orthogonal matrix), then there’s no need to learn a separate set of weights for the output layer.

In [17]:
model = tf.keras.Model(inputs=[encoder_inputs, decoder_inputs], 
                       outputs=[y_proba])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", 
              metrics=["accuracy"], jit_compile=False)
model.fit((x_train, x_train_dec), y_train, epochs=10, 
          validation_data=((x_valid, x_valid_dec), y_valid))

Epoch 1/10


[1m3125/3125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4317s[0m 1s/step - accuracy: 0.0509 - loss: 3.5252 - val_accuracy: 0.0330 - val_loss: 5.0494
Epoch 2/10
[1m3125/3125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2180s[0m 698ms/step - accuracy: 0.0789 - loss: 1.9470 - val_accuracy: 0.0324 - val_loss: 5.0158
Epoch 3/10
[1m3125/3125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2051s[0m 656ms/step - accuracy: 0.0911 - loss: 1.4606 - val_accuracy: 0.0287 - val_loss: 5.2889
Epoch 4/10
[1m3125/3125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2238s[0m 716ms/step - accuracy: 0.0985 - loss: 1.1988 - val_accuracy: 0.0296 - val_loss: 5.5027
Epoch 5/10
[1m3125/3125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2317s[0m 741ms/step - accuracy: 0.1037 - loss: 1.0211 - val_accuracy: 0.0284 - val_loss: 5.8300
Epoch 6/10
[1m3125/3125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2320s[0m 742ms/step - accuracy: 0.1078 - loss: 0.8804 - val_accuracy: 0.0280 - val_loss: 6.1498


<keras.src.callbacks.history.History at 0x7fc97288f400>

In [20]:
model.save('eng_spanish_translator.keras')

In [15]:
def translate(sentence_en):
    translation = ""
    for word_idx in range(max_length):
        x = np.array([sentence_en]) # encoder input
        x_dec = np.array(["startofseq " + translation]) # decoder input
        y_proba = model.predict((x, x_dec))[0, word_idx] # last token's probas
        predicted_word_id = np.argmax(y_proba)
        predicted_word = text_vec_layer_es.get_vocabulary()[predicted_word_id]
        if predicted_word == "endofseq":
            break
        translation += " " + predicted_word
        return translation.strip()

In [16]:
import tensorflow as tf
import numpy as np

# Load or define the model (adjust path)
model = tf.keras.models.load_model("eng_spanish_translator.keras")

# Define parameters
max_length = 50  # Must match training
vocab_en = ["", "[UNK]", "i", "like", "soccer", "to", "play"]  # Example; replace with real vocab
vocab_es = ["", "[UNK]", "inicio", "fin", "me", "gusta", "fútbol"]  # Example

# Initialize TextVectorization layers
text_vec_layer_en = tf.keras.layers.TextVectorization(
    max_tokens=10000,
    output_mode="int",
    output_sequence_length=max_length
)
text_vec_layer_es = tf.keras.layers.TextVectorization(
    max_tokens=10000,
    output_mode="int",
    output_sequence_length=max_length
)
text_vec_layer_en.set_vocabulary(vocab_en)
text_vec_layer_es.set_vocabulary(vocab_es)

def translate(sentence_en):
    encoder_input = text_vec_layer_en([sentence_en])  # Shape: (1, max_length)
    start_token_id = text_vec_layer_es(["startofseq"])[0][0].numpy()
    translation_ids = [start_token_id]
    x_dec = tf.keras.preprocessing.sequence.pad_sequences(
        [translation_ids], maxlen=max_length, padding="post"
    )

    for _ in range(max_length):
        predictions = model.predict([encoder_input, x_dec])
        predicted_id = tf.argmax(predictions[0, len(translation_ids) - 1], axis=-1).numpy()
        if predicted_id == text_vec_layer_es.get_vocabulary().index("endofseq", 1):
            break
        translation_ids.append(predicted_id)
        x_dec = tf.keras.preprocessing.sequence.pad_sequences(
            [translation_ids], maxlen=max_length, padding="post"
        )

    translation = " ".join(
        text_vec_layer_es.get_vocabulary()[id] for id in translation_ids[1:]
    )
    return translation.replace("endofseq", "").strip()

# Test
try:
    result = translate("I like soccer")
    print("Translation:", result)
except Exception as e:
    print("Error:", e)

2025-07-24 17:09:09.302736: W tensorflow/core/framework/op_kernel.cc:1844] UNKNOWN: JIT compilation failed.
2025-07-24 17:09:09.302772: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: UNKNOWN: JIT compilation failed.


UnknownError: {{function_node __wrapped__Sign_device_/job:localhost/replica:0/task:0/device:GPU:0}} JIT compilation failed. [Op:Sign] name: 

The function simply keeps predicting one word at a time, gradually completing the translation, and it stops once it reaches the EOS token. Let's give it a try!

In [58]:
message = "I like soccer"
translate(tf.convert_to_tensor(message))

ValueError: Invalid dtype: object

Hurray, it works! Well, at least it does with very short sentences. If you try playing with this model for a while, you will find that it’s not bilingual yet, and in particular it really struggles with longer sentences. For example:

In [22]:
translate("I like soccer and also going to the beach")

ValueError: Invalid dtype: str7904

The translation says “I like soccer and sometimes even the bus”. So how can you improve it? One way is to increase the training set size and add more LSTM layers in both the encoder and thedecoder. But this will only get you so far, so let’s look at more sophisticated techniques, starting with bidirectional recurrent layers.

### **Bidirectional RNNs**
At each time step, a regular recurrent layer only looks at past and present inputs before generating its output. In other words, it is causal, meaning it cannot look into the future. This type of RNN makes sense when forecasting time series, or in the decoder of a sequence-to-sequence (seq2seq) model. But for tasks like text classification, or in the encoder of a seq2seq model, it is often preferable to look ahead at the next words before encoding a given word.

For example, consider the phrases “the right arm”, “the right person”, and “the right to criticize”: to properly encode the word “right”, you need to look ahead. One solution is to run two recurrent layers on the same inputs, one reading the words from left to right and the other reading them from right to left, then combine their outputs at each time step, typically by concatenating them. This is what a ***bidirectional recurrent layer*** does.

To implement a bidirectional recurrent layer in Keras, just wrap a recurrent layer in a ***tf.keras.layers.Bidirectional*** layer. For example, the following Bidirectional layer could be used as the encoder in our translation model:

In [23]:
encoder = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(256, return_state=True)
)

> #### **NOTE**
> The ***Bidirectional*** layer will create a clone of the GRU layer (but in the reverse direction), and it will run both and concatenate their outputs. So although the ***GRU*** layer has 10 units, the ***Bidirectional*** layer will output 20 values per time step.

There’s just one problem. This layer will now return four states instead of two: the final short-term and long-term states of the forward ***LSTM*** layer, and the final short-term and long-term states of the backward ***LSTM*** layer. We cannot use this quadruple state directly as the initial state of the decoder’s ***LSTM*** layer, since it expects just two states (short-term and long-term). We cannot make the decoder bidirectional, since it must remain causal: otherwise it would cheat during training and it would not work. Instead, we can concatenate the two short-term states, and also concatenate the two long-term states:

In [24]:
encoder_outputs, *encoder_state = cpu_encoder(encoder_embeddings)
encoder_state = [tf.concat(encoder_state[::2], axis=-1), # short-term (0 & 2)
                 tf.concat(encoder_state[1::2], axis=-1)] # long-term (1 & 3)

ValueError: A KerasTensor cannot be used as input to a TensorFlow function. A KerasTensor is a symbolic placeholder for a shape and dtype, used when constructing Keras Functional models or Keras Functions. You can only use it as input to a Keras layer or a Keras operation (from the namespaces `keras.layers` and `keras.ops`). You are likely doing something like:

```
x = Input(...)
...
tf_fn(x)  # Invalid.
```

What you should do instead is wrap `tf_fn` in a layer:

```
class MyLayer(Layer):
    def call(self, x):
        return tf_fn(x)

x = MyLayer()(x)
```


Now let’s look at another popular technique that can greatly improve the performance of a translation model at inference time: beam search.

### **Beam Search**

Suppose you have trained an encoder–decoder model, and you use it to translate the sentence “I like soccer” to Spanish. You are hoping that it will output the proper translation “me gusta el fútbol”, but unfortunately it outputs “me gustan los jugadores”, which means “I like the players”. Looking at the training set, you notice many sentences such as “I like cars”, which translates to “me gustan los autos”, so it wasn’t absurd for the model to output “me gustan los” after seeing “I like”. Unfortunately, in this case it was a mistake since “soccer” is singular. The model could not go back and fix it, so it tried to complete the sentence as best it could, in this case using the word “jugadores”. How can we give the model a chance to go back and fix mistakes it made earlier? One of the most common solutions is beam search: it keeps track of a short list of the k most promising sentences (say, the top three), and at each decoder step it tries to extend them by one word, keeping only the k most likely sentences. The parameter ***k*** is called the ***beam width.***

For example, suppose you use the model to translate the sentence “I like soccer” using beam search with a beam width of 3 (see Figure 16-6). At the first decoder step, the model will output an estimated probability for each possible first word in the translated sentence. Suppose the top three words are “me” (75% estimated probability), “a” (3%), and “como” (1%). That’s our short list so far. Next, we use the model to find the next word for each sentence. For the first sentence (“me”), perhaps the model outputs a probability of 36% for the word “gustan”, 32% for the word “gusta”, 16% for the word “encanta”, and so on. Note that these are actually conditional probabilities, given that the sentence starts with “me”. For the second sentence (“a”), the model might output a conditional probability of 50% for the word “mi”, and so on. Assuming the vocabulary has 1,000 words, we will end up with 1,000 probabilities per sentence. 

Next, we compute the probabilities of each of the 3,000 two-word sentences we considered (3 × 1,000). We do this by multiplying the estimated conditional probability of each word by the estimated probability of the sentence it completes. For example, the estimated probability of the sentence “me” was 75%, while the estimated conditional probability of the word “gustan” (given that the first word is “me”) was 36%, so the estimated probability of the sentence “me gustan” is 75% × 36% = 27%. After computing the probabilities of all 3,000 two-word sentences, we keep only the top 3. In this example they all start with the word “me”: “me gustan” (27%), “me gusta” (24%), and “me encanta” (12%). Right now, the sentence “me gustan” is winning, but “me gusta” has not been eliminated.

Then we repeat the same process: we use the model to predict the next word in each of these three sentences, and we compute the probabilities of all 3,000 three-word sentences we considered. Perhaps the top three are now “me gustan los” (10%), “me gusta el” (8%), and “me gusta mucho” (2%). At the next step we may get “me gusta el fútbol” (6%), “me gusta mucho el” (1%), and “me gusta el deporte” (0.2%). Notice that “me gustan” was eliminated, and the correct translation is now ahead. We boosted our encoder–decoder model’s performance without any extra training, simply by using it more wisely.

> #### **TIP**
> The TensorFlow Addons library includes a full seq2seq API that lets you build encoder–decoder models with attention, including beam search, and more. However, its documentation is currently very limited. Implementing beam search is a good exercise, so give it a try! Check out this chapter’s notebook for a possible solution.

With all this, you can get reasonably good translations for fairly short sentences. Unfortunately, this model will be really bad at translating long sentences. Once again, the problem comes from the limited short-term memory of RNNs. ***Attention mechanisms*** are the game-changing innovation that addressed this problem.

## **Attention Mechanisms**
Consider the path from the word “soccer” to its translation “fútbol” back in Figure 16-3: it is quite long! This means that a representation of this word (along with all the other words) needs to be carried over many steps before it is actually used. Can’t we make this path shorter?

This was the core idea in a landmark 2014 paper by Dzmitry Bahdanau et al., where the authors introduced a technique that allowed the decoder to focus on the appropriate words (as encoded by the encoder) at each time step. For example, at the time step where the decoder needs to output the word “fútbol”, it will focus its attention on the word “soccer”. This means that the path from an input word to its translation is now much shorter, so the short-term memory limitations of RNNs have much less impact. Attention mechanisms revolutionized neural machine translation (and deep learning in general), allowing a significant improvement in the state of the art, especially for long sentences (e.g., over 30 words).

> #### **NOTE**
> The most common metric used in NMT is the ***bilingual evaluation understudy*** (**BLEU**) score, which compares each translation produced by the model with several good translations produced by humans: it counts the number of n-grams (sequences of n words) that appear in any of the target translations and adjusts the score to take into account the frequency of the produced n-grams in the target translations.

One of the common attention mechanisms are:
- ***Bahdanau attention*** (named after the 2014 paper’s first author). Since it concatenates the encoder output with the decoder’s previous hidden state, it is sometimes called ***concatenative attention*** (or ***additive attention***).

- Another common attention mechanism, known as ***Luong attention*** or ***multiplicative attention***, was proposed shortly after, in 2015, by Minh-Thang Luong et al. Because the goal of the alignment model is to measure the similarity between one of the encoder’s outputs and the decoder’s previous hidden state, the authors proposed to simply compute the dot product (see Chapter 4) of these two vectors, as this is often a fairly good similarity measure, and modern hardware can compute it very efficiently. For this to be possible, both vectors must have the same dimensionality. The dot product gives a score, and all the scores (at a given decoder time step) go through a softmax layer to give the final weights, just like in Bahdanau attention. Another simplification Luong et al.

Keras provides a ***tf.keras.layers.Attention*** layer for Luong attention, and an ***AdditiveAttention*** layer for Bahdanau attention. Let’s add Luong attention to our encoder– decoder model. Since we will need to pass all the encoder’s outputs to the ***Attention*** layer, we first need to set ***return_sequences=True*** when creating the encoder:

In [25]:
encoder = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(256, return_sequences=True, return_state=True)
)

Next, we need to create the attention layer and pass it the decoder’s states and the encoder’s outputs. However, to access the decoder’s states at each step we would need to write a custom memory cell. For simplicity, let’s use the decoder’s outputs instead of its states: in practice this works well too, and it’s much easier to code. Then we just pass the attention layer’s outputs directly to the output layer, as suggested in the Luong attention paper:

In [26]:
attention_layer = tf.keras.layers.Attention()
attention_outputs = attention_layer([decoder_seq, encoder_outputs])
output_layer = tf.keras.layers.Dense(vocab_size, activation="softmax")
y_proba = output_layer(attention_outputs)

And that’s it! If you train this model, you will find that it now handles much longer sentences. For example:

In [27]:
translate("I like soccer and also going to the beach")

ValueError: Invalid dtype: str7904

In short, the attention layer provides a way to focus the attention of the model on part of the inputs. But there’s another way to think of this layer: it acts as a differentiable memory retrieval mechanism.

For example, let’s suppose the encoder analyzed the input sentence “I like soccer”, and it managed to understand that the word “I” is the subject and the word “like” is the verb, so it encoded this information in its outputs for these words. Now suppose the decoder has already translated the subject, and it thinks that it should translate the verb next. For this, it needs to fetch the verb from the input sentence. This is analogous to a dictionary lookup: it’s as if the encoder had created a dictionary {"subject”: “They”, “verb”: “played”, …} and the decoder wanted to look up the value that corresponds to the key “verb”.

However, the model does not have discrete tokens to represent the keys (like “subject” or “verb”); instead, it has vectorized representations of these concepts that it learned during training, so the query it will use for the lookup will not perfectly match any key in the dictionary. The solution is to compute a similarity measure between the query and each key in the dictionary, and then use the softmax function to convert these similarity scores to weights that add up to 1. As we saw earlier, that’s exactly what the attention layer does. If the key that represents the verb is by far the most similar to the query, then that key’s weight will be close to 1.

Next, the attention layer computes a weighted sum of the corresponding values: if the weight of the “verb” key is close to 1, then the weighted sum will be very close to the representation of the word “played”.

This is why the Keras Attention and AdditiveAttention layers both expect a list as input, containing two or three items: the queries, the keys, and optionally the values. If you do not pass any values, then they are automatically equal to the keys. So, looking at the previous code example again, the decoder outputs are the queries, and the encoder outputs are both the keys and the values. For each decoder output (i.e., each query), the attention layer returns a weighted sum of the encoder outputs (i.e., the keys/values) that are most similar to the decoder output.

The bottom line is that an attention mechanism is a trainable memory retrieval system. It is so powerful that you can actually build state-of-the-art models using only attention mechanisms. Enter the transformer architecture.

### **Attention Is All You Need: The Original Transformer Architecture**


In a groundbreaking 2017 paper, a team of Google researchers suggested that “Attention Is All You Need”. They created an architecture called the transformer, which significantly improved the state-of-the-art in NMT without using any recurrent or convolutional layers, just attention mechanisms (plus embedding layers, dense layers, normalization layers, and a few other bits and pieces). Because the model is not recurrent, it doesn’t suffer as much from the vanishing or exploding gradients problems as RNNs, it can be trained in fewer steps, it’s easier to parallelize across multiple GPUs, and it can better capture long-range patterns than RNNs.

If you use the transformer for NMT, then during training you must feed the English sentences to the encoder and the corresponding Spanish translations to the decoder, with an extra SOS token inserted at the start of each sentence. At inference time, you must call the transformer multiple times, producing the translations one word at a time and feeding the partial translations to the decoder at each round, just like we did earlier in the translate() function.

The encoder’s role is to gradually transform the inputs—word representations of the English sentence—until each word’s representation perfectly captures the meaning of the word, in the context of the sentence. For example, if you feed the encoder with the sentence “I like soccer”, then the word “like” will start off with a rather vague representation, since this word could mean different things in different contexts: think of “I like soccer” versus “It’s like that”. But after going through the encoder, the word’s representation should capture the correct meaning of “like” in the given sentence (i.e., to be fond of), as well as any other information that may be required for translation (e.g., it’s a verb).

The decoder’s role is to gradually transform each word representation in the translated sentence into a word representation of the next word in the translation. For example, if the sentence to translate is “I like soccer”, and the decoder’s input sentence is “<SOS> me gusta el fútbol”, then after going through the decoder, the word representation of the word “el” will end up transformed into a representation of the word “fútbol”. Similarly, the representation of the word “fútbol” will be transformed into a representation of the EOS token.

After going through the decoder, each word representation goes through a final Dense layer with a softmax activation function, which will hopefully output a high probability for the correct next word and a low probability for all other words. The predicted sentence should be “me gusta el fútbol <EOS>

#### **Positional encodings**
A positional encoding is a dense vector that encodes the position of a word within a sentence: the i positional encoding is added to the word embedding of the i word in the sentence. The easiest way to implement this is to use an Embedding layer and make it encode all the positions from 0 to the maximum sequence length in the batch, then add the result to the word embeddings. The rules of broadcasting will ensure that the positional encodings get applied to every input sequence. For example, here is how to add positional encodings to the encoder and decoder inputs:

In [28]:
max_length = 50 # max length of the whole training set.

embed_size = 128
pos_embed_layer = tf.keras.layers.Embedding(max_length, embed_size)
batch_max_len_enc = tf.shape(encoder_embeddings)[1]
encoder_in = encoder_embeddings + pos_embed_layer(tf.range(batch_max_len_dec))
batch_max_len_dec = tf.shape(decoder_embeddings)[1]
decoder_in = decoder_embeddings + pos_embed_layer(tf.range(batch_max_len_dec))

ValueError: A KerasTensor cannot be used as input to a TensorFlow function. A KerasTensor is a symbolic placeholder for a shape and dtype, used when constructing Keras Functional models or Keras Functions. You can only use it as input to a Keras layer or a Keras operation (from the namespaces `keras.layers` and `keras.ops`). You are likely doing something like:

```
x = Input(...)
...
tf_fn(x)  # Invalid.
```

What you should do instead is wrap `tf_fn` in a layer:

```
class MyLayer(Layer):
    def call(self, x):
        return tf_fn(x)

x = MyLayer()(x)
```


Note that this implementation assumes that the embeddings are represented as regular tensors, notragged tensors. The encoder and the decoder share the same Embedding layer for the positional encodings, since they have the same embedding size (this is often the case).

There is no ***PositionalEncoding*** layer in TensorFlow, but it is not too hard to create one. For efficiency reasons, we precompute the positional encoding matrix in the constructor. The ***call()*** method just truncates this encoding matrix to the max length of the input sequences, and it adds them to the inputs. We also set supports_masking=True to propagate the input’s automatic mask to the next layer:

In [30]:
class PositionalEncoding(tf.keras.layers.Layer):
    def __init__(self, max_length, embed_size, dtype=tf.float32, **kwargs):
        super().__init__(dtype=dtype, **kwargs)
        assert embed_size % 2 == 0, "embed_size must be even"
        p, i = np.meshgrid(np.arange(max_length), 
                           2 * np.arange(embed_size // 2))
        pos_emb = np.empty((1, max_length, embed_size))
        pos_emb[0, :, ::2] = np.sin(p / 10_000 ** (i / embed_size)).T
        pos_emb[0, :, 1::2] = np.cos(p / 10_000 ** (i / embed_size)).T
        self.pos_encodings = tf.constant(pos_emb.astype(self.dtype))
        self.supports_masking = True
        
    def call(self, inputs):
        batch_max_length = tf.shape(inputs)[1]
        return inputs + self.pos_encodings[:, :batch_max_length]

Let’s use this layer to add the positional encoding to the encoder’s inputs:

In [31]:
pos_embed_layer = PositionalEncoding(max_length, embed_size)
encoder_in = pos_embed_layer(encoder_embeddings)
decoder_in = pos_embed_layer(decoder_embeddings)

Now let’s look deeper into the heart of the transformer model, at the multi-head attention layer.

#### **Multi-head attention**
It is just a bunch of scaled dot-product attention layers, each preceded by a linear transformation of the values, keys, and queries (i.e., a time-distributed dense layer with no activation function). All the outputs are simply concatenated, and they go through a final linear transformation (again, time-distributed).
 
But why? What is the intuition behind this architecture? Well, consider once again the word “like” in the sentence “I like soccer”. The encoder was smart enough to encode the fact that it is a verb. But the word representation also includes its position in the text, thanks to the positional encodings, and it probably includes many other features that are useful for its translation, such as the fact that it is in the present tense. In short, the word representation encodes many different characteristics of the word. If we just used a single scaled dot-product attention layer, we would only be able to query all of these characteristics in one shot.

This is why the multi-head attention layer applies multiple different linear transformations of the values, keys, and queries: this allows the model to apply many different projections of the word representation into different subspaces, each focusing on a subset of the word’s characteristics. Perhaps one of the linear layers will project the word representation into a subspace where all that remains is the information that the word is a verb, another linear layer will extract just the fact that it is present tense, and so on. Then the scaled dot-product attention layers implement the lookup phase, and finally we concatenate all the results and project them back to the original space.

Keras includes a tf.keras.layers.MultiHeadAttention layer, so we now have everything we need to build the rest of the transformer. Let’s start with the full encoder, which is exactly like in Figure 16-8, except we use a stack of two blocks (N = 2) instead of six, since we don’t have a huge training set, and we add a bit of dropout as well:

In [32]:
N = 2 # instead of 6
num_heads = 8
dropout_rate = 0.1
n_units = 128 # for the first dense layer in each feedforward block
encoder_pad_mask = tf.math.not_equal(encoder_input_ids, 0)[:, tf.newaxis]
Z = encoder_in
for _ in range(N):
    skip = Z
    attn_layer = tf.keras.layers.MultiHeadAttention(
    num_heads=num_heads, key_dim=embed_size, dropout=dropout_rate)
    Z = attn_layer(Z, value=Z, attention_mask=encoder_pad_mask)
    Z = tf.keras.layers.LayerNormalization()(tf.keras.layers.Add()([Z, skip]))
    skip = Z
    Z = tf.keras.layers.Dense(n_units, activation="relu")(Z)
    Z = tf.keras.layers.Dense(embed_size)(Z)
    Z = tf.keras.layers.Dropout(dropout_rate)(Z)
    Z = tf.keras.layers.LayerNormalization()(tf.keras.layers.Add()([Z, skip]))

ValueError: A KerasTensor cannot be used as input to a TensorFlow function. A KerasTensor is a symbolic placeholder for a shape and dtype, used when constructing Keras Functional models or Keras Functions. You can only use it as input to a Keras layer or a Keras operation (from the namespaces `keras.layers` and `keras.ops`). You are likely doing something like:

```
x = Input(...)
...
tf_fn(x)  # Invalid.
```

What you should do instead is wrap `tf_fn` in a layer:

```
class MyLayer(Layer):
    def call(self, x):
        return tf_fn(x)

x = MyLayer()(x)
```


This code should be mostly straightforward, except for one thing: masking. As of the time of writing, the MultiHeadAttention layer does not support automatic masking, so we must handle it manually. How can we do that?

The MultiHeadAttention layer accepts an attention_mask argument, which is a Boolean tensor of shape [batch size, max query length, max value length]: for every token in every query sequence, this mask indicates which tokens in the corresponding value sequence should be attended to. We want to tell the MultiHeadAttention layer to ignore all the padding tokens in the values. So, we first compute the padding mask using tf.math.not_equal(encoder_input_ids, 0). This returns a Boolean tensor of shape [batch size, max sequence length]. We then insert a second axis using [:, tf.newaxis], to get a mask of shape [batch size, 1, max sequence length]. This allows us to use this mask as the attention_mask when calling the MultiHeadAttention layer: thanks to broadcasting, the same mask will be used for all tokens in each query. This way, the padding tokens in the values will be ignored correctly.

However, the layer will compute outputs for every single query token, including the padding tokens. We need to mask the outputs that correspond to these padding tokens. Recall that we used mask_zero in the Embedding layers, and we set supports_masking to True in the PositionalEncoding layer, so the automatic mask was propagated all the way to the MultiHeadAttention layer’s inputs (encoder_in). We can use this to our advantage in the skip connection: indeed, the Add layer supports automatic masking, so when we add Z and skip (which is initially equal to encoder_in), the outputs get automatically masked correctly. Yikes! Masking required much more explanation than code.

Now on to the decoder! Once again, masking is going to be the only tricky part, so let’s start with that. The first multi-head attention layer is a self-attention layer, like in the encoder, but it is a masked multi-head attention layer, meaning it is causal: it should ignore all tokens in the future. So, we need two masks: a padding mask and a causal mask. Let’s create them:

In [33]:
decoder_pad_mask = tf.math.not_equal(decoder_input_ids, 0)[:, tf.newaxis]
causal_mask = tf.linalg.band_part( # creates a lower triangular matrix
    tf.ones((batch_max_len_dec, batch_max_len_dec), tf.bool), -1, 0)

ValueError: A KerasTensor cannot be used as input to a TensorFlow function. A KerasTensor is a symbolic placeholder for a shape and dtype, used when constructing Keras Functional models or Keras Functions. You can only use it as input to a Keras layer or a Keras operation (from the namespaces `keras.layers` and `keras.ops`). You are likely doing something like:

```
x = Input(...)
...
tf_fn(x)  # Invalid.
```

What you should do instead is wrap `tf_fn` in a layer:

```
class MyLayer(Layer):
    def call(self, x):
        return tf_fn(x)

x = MyLayer()(x)
```


The padding mask is exactly like the one we created for the encoder, except it’s based on the decoder’s inputs rather than the encoder’s. The causal mask is created using the tf.linalg.band_part() function, which takes a tensor and returns a copy with all the values outside a diagonal band set to zero. With these arguments, we get a square matrix of size batch_max_len_dec (the max length of the input sequences in the batch), with 1s in the lowerleft triangle and 0s in the upper right. If we use this mask as the attention mask, we will get exactly what we want: the first query token will only attend to the first value token, the second will only attend to the first two, the third will only attend to the first three, and so on. In other words, query tokens cannot attend to any value token in the future.

Let’s now build the decoder:

In [34]:
encoder_outputs = Z # let's save the encoder's final outputs
Z = decoder_in # the decoder starts with its own inputs
for _ in range(N):
    skip = Z
    attn_layer = tf.keras.layers.MultiHeadAttention(
        num_heads=num_heads, key_dim=embed_size, dropout=dropout_rate)
    Z = attn_layer(Z, value=Z, attention_mask=causal_mask & decoder_pad_mask)
    Z = tf.keras.layers.LayerNormalization()(tf.keras.layers.Add()([Z, skip]))
    skip = Z
    attn_layer = tf.keras.layers.MultiHeadAttention(
        num_heads=num_heads, key_dim=embed_size, dropout=dropout_rate)
    Z = attn_layer(Z, value=encoder_outputs, attention_mask=encoder_pad_mask)
    Z = tf.keras.layers.LayerNormalization()(tf.keras.layers.Add()([Z, skip]))
    skip = Z
    Z = tf.keras.layers.Dense(n_units, activation="relu")(Z)
    Z = tf.keras.layers.Dense(embed_size)(Z)
    Z = tf.keras.layers.LayerNormalization()(tf.keras.layers.Add()([Z, skip]))

NameError: name 'Z' is not defined

For the first attention layer, we use causal_mask & decoder_pad_mask to mask both the padding tokens and future tokens. The causal mask only has two dimensions: it’s missing the batch dimension, but that’s okay since broadcasting ensures that it gets copied across all the instances in the batch.

For the second attention layer, there’s nothing special. The only thing to note is that we are usingencoder_pad_mask, not decoder_pad_mask, because this attention layer uses the encoder’s final outputs as its values.

We’re almost done. We just need to add the final output layer, create the model, compile it, and train it:

In [35]:
Y_proba = tf.keras.layers.Dense(vocab_size, activation="softmax")(Z)
model = tf.keras.Model(inputs=[encoder_inputs, decoder_inputs],
                       outputs=[Y_proba])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"])
model.fit((X_train, X_train_dec), Y_train, epochs=10,
          validation_data=((X_valid, X_valid_dec), Y_valid))

NameError: name 'Z' is not defined

> #### **TIP**
> The Keras team has created a new [*Keras NLP project*](https://github.com/keras-team/keras-cv), including an API to build a transformer more easily. You may also be interested in the new [*Keras CV project for computer vision*](https://github.com/keras-team/keras-cv).

But the field didn’t stop there. Let’s now explore some of the recent advances.

## **An Avalanche of Transformer Models**
The year 2018 has been called the “ImageNet moment for NLP”. Since then, progress has been astounding, with larger and larger transformer-based architectures trained on immense datasets.

First, the [*GPT paper*](https://homl.info/gpt) by Alec Radford and other OpenAI researchers once again demonstrated the effectiveness of unsupervised pretraining, like the ELMo and ULMFiT papers before it, but this time using a transformer-like architecture. The authors pretrained a large but fairly simple architecture composed of a stack of 12 transformer modules using only masked multi-head attention layers, like in the original transformer’s decoder. They trained it on a very large dataset, using the same autoregressive technique we used for our Shakespearean char-RNN: just predict the next token. This is a form of self-supervised learning. Then they fine-tuned it on various language tasks, using only minor adaptations for each task. The tasks were quite diverse: they included text classification, entailment (whether sentence A imposes, involves, or implies sentence B as a necessary consequence), similarity (e.g., “Nice weather today” is very similar to “It is sunny”), and question answering (given a few paragraphs of text giving some context, the model must answer some multiple-choice questions).

Then Google’s [*BERT paper*](https://homl.info/bert) came out: it also demonstrated the effectiveness of self-supervised pretraining on a large corpus, using a similar architecture to GPT but with nonmasked multi-head attention layers only, like in the original transformer’s encoder. This means that the model is naturally bidirectional; hence the B in BERT (Bidirectional Encoder Representations from Transformers). Most importantly, the authors proposed two pretraining tasks that explain most of the model’s strength:
- ***Masked language model (MLM)***
  Each word in a sentence has a 15% probability of being masked, and the model is trained to predict the masked words. For example, if the original sentence is “She had fun at the birthday party”, then the model may be given the sentence “She <mask> fun at the <mask> party” and it must predict the words “had” and “birthday” (the other outputs will be ignored). To be more precise, each selected word has an 80% chance of being masked, a 10% chance of being replaced by a random word (to reduce the discrepancy between pretraining and fine-tuning, since the model will not see <mask> tokens during fine-tuning), and a 10% chance of being left alone (to bias the model toward the correct answer).

- ***Next sentence prediction (NSP)***
  The model is trained to predict whether two sentences are consecutive or not. For example, it should predict that “The dog sleeps” and “It snores loudly” are consecutive sentences, while “The dog sleeps” and “The Earth orbits the Sun” are not consecutive. Later research showed that NSP was not as important as was initially thought, so it was dropped in most later architectures.

The model is trained on these two tasks simultaneously. For the NSP task, the authors inserted a class token (<CLS>) at the start of every input, and the corresponding output token represents the model’s prediction: sentence B follows sentence A, or it does not. The two input sentences are concatenated, separated only by a special separation token (<SEP>), and they are fed as input to the model. To help the model know which sentence each input token belongs to, a segment embedding is added on top of each token’s positional embeddings: there are just two possible segment embeddings, one for sentence A and one for sentence B. For the MLM task, some input words are masked (as we just saw) and the model tries to predict what those words were. The loss is only computed on the NSP prediction and the masked tokens, not on the unmasked ones.

After this unsupervised pretraining phase on a very large corpus of text, the model is then finetuned on many different tasks, changing very little for each task. For example, for text classificationsuch as sentiment analysis, all output tokens are ignored except for the first one, corresponding to the class token, and a new output layer replaces the previous one, which was just a binary classification layer for NSP.

In February 2019, just a few months after BERT was published, Alec Radford, Jeffrey Wu, and other OpenAI researchers published the [GPT-2 paper](https://homl.info/gpt2), which proposed a very similar architecture to GPT, but larger still (with over 1.5 billion parameters!). The researchers showed that the new and improved GPT model could perform zero-shot learning (ZSL), meaning it could achieve good performance on many tasks without any fine-tuning. This was just the start of a race toward larger and larger models: Google’s [Switch Transformers](https://homl.info/switch) (introduced in January 2021) used 1 trillion parameters, and soon much larger models came out, such as the Wu Dao 2.0 model by the Beijing Academy of Artificial Intelligence (BAII), announced in June 2021.

An unfortunate consequence of this trend toward gigantic models is that only well-funded organizations can afford to train such models: it can easily cost hundreds of thousands of dollars or more. And the energy required to train a single model corresponds to an American household’s electricity consumption for several years; it’s not eco-friendly at all. Many of these models are just too big to even be used on regular hardware: they wouldn’t fit in RAM, and they would be horribly slow. Lastly, some are so costly that they are not released publicly.

Luckily, ingenious researchers are finding new ways to downsize transformers and make them more data-efficient. For example, the [DistilBERT model](https://homl.info/distilbert), introduced in October 2019 by Victor Sanh et al. from Hugging Face, is a small and fast transformer model based on BERT. It is available on Hugging Face’s excellent model hub, along with thousands of others—you’ll see an example later in this chapter.

DistilBERT was trained using distillation (hence the name): this means transferring knowledge from a teacher model to a student one, which is usually much smaller than the teacher model. This is typically done by using the teacher’s predicted probabilities for each training instance as targets for the student. Surprisingly, distillation often works better than training the student from scratch on the same dataset as the teacher! Indeed, the student benefits from the teacher’s more nuanced labels.

Many more transformer architectures came out after BERT, almost on a monthly basis, often improving on the state of the art across all NLP tasks: XLNet (June 2019), RoBERTa (July 2019), StructBERT (August 2019), ALBERT (September 2019), T5 (October 2019), ELECTRA (March 2020), GPT3 (May 2020), DeBERTa (June 2020), Switch Transformers (January 2021), Wu Dao 2.0 (June 2021), Gopher (December 2021), GPT-NeoX-20B (February 2022), Chinchilla (March 2022), OPT (May 2022), and the list goes on and on. Each of these models brought new ideas and techniques, but I particularly like the [T5 paper](https://homl.info/t5) by Google researchers: it frames all NLP tasks as text-to-text, using an encoder–decoder transformer. For example, to translate “I like soccer” to Spanish, you can just call the model with the input sentence “translate English to Spanish: I like soccer” and it outputs “me gusta el fútbol”. To summarize a paragraph, you just enter “summarize:” followed by the paragraph, and it outputs the summary. For classification, you only need to change the prefix to “classify:” and the model outputs the class name, as text. This simplifies using the model, and it also makes it possible to pretrain it on even more tasks.

Last but not least, in April 2022, Google researchers used a new large-scale training platform named Pathways (which we will briefly discuss in Chapter 19) to train a humongous language model named the [Pathways Language Model (PaLM)](https://homl.info/palm), with a whopping 540 billion parameters, using over 6,000 TPUs. Other than its incredible size, this model is a standard transformer, using decoders only (i.e., with masked multi-head attention layers), with just a few tweaks (see the paper for details). This model achieved incredible performance on all sorts of NLP tasks, particularly in natural language understanding (NLU). It’s capable of impressive feats, such as explaining jokes, giving detailed step-by-step answers to questions, and even coding. This is in part due to the model’s size, but also thanks to a technique called [Chain of thought prompting](https://homl.info/cpt), which was introduced a couple months earlier by another team of Google researchers.

In question answering tasks, regular prompting typically includes a few examples of questions and answers, such as: “Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: 11.” The prompt then continues with the actual question, such as “Q: John takes care of 10 dogs. Each dog takes .5 hours a day to walk and take care of their business. How many hours a week does he spend taking care of dogs? A:”, and the model’s job is to append the answer: in this case, “35.”

But with chain of thought prompting, the example answers include all the reasoning steps that lead to the conclusion. For example, instead of “A: 11”, the prompt contains “A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11.” This encourages the model to give a detailed answer to the actual question, such as “John takes care of 10 dogs. Each dog takes .5 hours a day to walk and take care of their business. So that is 10 × .5 = 5 hours a day. 5 hours a day × 7 days a week = 35 hours a week. The answer is 35 hours a week.” This is an actual example from the paper!

Not only does the model give the right answer much more frequently than using regular prompting we’re encouraging the model to think things through—but it also provides all the reasoning steps, which can be useful to better understand the rationale behind a model’s answer. Transformers have taken over NLP, but they didn’t stop there: they soon expanded to computer vision as well.

## **Vision Transformers**
One of the first applications of attention mechanisms beyond NMT was in generating image captions using [visual attention](https://homl.info/visualattention): a convolutional neural network first processes the image and outputs some feature maps, then a decoder RNN equipped with an attention mechanism generates the caption, one word at a time.

At each decoder time step (i.e., each word), the decoder uses the attention model to focus on just the right part of the image.

> #### **EXPLAINABILITY**
> One extra benefit of attention mechanisms is that they make it easier to understand what led the model to produce its output. This is called explainability. It can be especially useful when the model makes a mistake: for example, if an image of a dog walking in the snow is labeled as “a wolf walking in the snow”, then you can go back and check what the model focused on when it output the word “wolf”. You may find that it was paying attention not only to the dog, but also to the snow, hinting at a possible explanation: perhaps the way the model learned to distinguish dogs from wolves is by checking whether or not there’s a lot of snow around. You can then fix this by training the model with more images of wolves without snow, and dogs with snow. This example comes from a great [2016 paper](https://homl.info/explainclass) by Marco Tulio Ribeiro et al. that uses a different approach to explainability: learning an interpretable model locally around a classifier’s prediction. In some applications, explainability is not just a tool to debug a model; it can be a legal requirement—think of a system deciding whether or not it should grant you a loan.

When transformers came out in 2017 and people started to experiment with them beyond NLP, they were first used alongside CNNs, without replacing them. Instead, transformers were generally used to replace RNNs, for example, in image captioning models. Transformers became slightly more visual in a [2020 paper](https://homl.info/detr) by Facebook researchers, which proposed a hybrid CNN–transformer architecture for object detection. Once again, the CNN first processes the input images and outputs a set of feature maps, then these feature maps are converted to sequences and fed to a transformer, which outputs bounding box predictions. But again, most of the visual work is still done by the CNN.

Then, in October 2020, a team of Google researchers released [a paper](https://homl.info/vit) that introduced a fully transformer-based vision model, called a vision transformer (ViT). The idea is surprisingly simple: just chop the image into little 16 × 16 squares, and treat the sequence of squares as if it were a sequence of word representations. To be more precise, the squares are first flattened into 16 × 16 × 3 = 768-dimensional vectors—the 3 is for the RGB color channels—then these vectors go through a linear layer that transforms them but retains their dimensionality. The resulting sequence of vectors can then be treated just like a sequence of word embeddings: this means adding positional embeddings, and passing the result to the transformer. That’s it! This model beat the state of the art on ImageNet image classification, but to be fair the authors had to use over 300 million additional images for training. This makes sense since transformers don’t have as many inductive biases as convolution neural nets, so they need extra data just to learn things that CNNs implicitly assume.

> #### **NOTE**
> An inductive bias is an implicit assumption made by the model, due to its architecture. For example, linear models implicitly assume that the data is, well, linear. CNNs implicitly assume that patterns learned in one location will likely be useful in other locations as well. RNNs implicitly assume that the inputs are ordered, and that recent tokens are more important than older ones. The more inductive biases a model has, assuming they are correct, the less training data the model will require. But if the implicit assumptions are wrong, then the model may perform poorly even if it is trained on a large dataset.

Just two months later, a team of Facebook researchers released [a paper](https://homl.info/deit) that introduced dataefficient image transformers (DeiTs). Their model achieved competitive results on ImageNet without requiring any additional data for training. The model’s architecture is virtually the same as the original ViT, but the authors used a distillation technique to transfer knowledge from state-ofthe-art CNN models to their model.

Then, in March 2021, DeepMind released an important [paper](https://homl.info/perceiver) that introduced the Perceiver architecture. It is a multimodal transformer, meaning you can feed it text, images, audio, or virtually any other modality. Until then, transformers had been restricted to fairly short sequences because of the performance and RAM bottleneck in the attention layers. This excluded modalities such as audio or video, and it forced researchers to treat images as sequences of patches, rather than sequences of pixels. The bottleneck is due to self-attention, where every token must attend to every other token: if the input sequence has M tokens, then the attention layer must compute an M × M matrix, which can be huge if M is very large. The Perceiver solves this problem by gradually improving a fairly short latent representation of the inputs, composed of N tokens—typically just a few hundred. (The word latent means hidden, or internal.) The model uses cross-attention layers only, feeding them the latent representation as the queries, and the (possibly large) inputs as the values. This only requires computing an M × N matrix, so the computational complexity is linear with regard to M, instead of quadratic. After going through several cross-attention layers, if everything goes well, the latent representation ends up capturing everything that matters in the inputs. The authors also suggested sharing the weights between consecutive cross-attention layers: if you do that, then the Perceiver effectively becomes an RNN. Indeed, the shared cross-attention layers can be seen as the same memory cell at different time steps, and the latent representation corresponds to the cell’s context vector. The same inputs are repeatedly fed to the memory cell at every time step. It looks like RNNs are not dead after all!

Just a month later, Mathilde Caron et al. introduced [DINO](https://homl.info/dino), an impressive vision transformer trained entirely without labels, using self-supervision, and capable of high-accuracy semantic segmentation. The model is duplicated during training, with one network acting as a teacher and the other acting as a student. Gradient descent only affects the student, while the teacher’s weights are just an exponential moving average of the student’s weights. The student is trained to match the teacher’s predictions: since they’re almost the same model, this is called self-distillation. At each training step, the input images are augmented in different ways for the teacher and the student, so they don’t see the exact same image, but their predictions must match. This forces them to come up with high-level representations. To prevent mode collapse, where both the student and the teacher would always output the same thing, completely ignoring the inputs, DINO keeps track of a moving average of the teacher’s outputs, and it tweaks the teacher’s predictions to ensure that they remain centered on zero, on average. DINO also forces the teacher to have high confidence in its predictions: this is called sharpening. Together, these techniques preserve diversity in the teacher’s outputs.

In a [2021 paper](https://homl.info/scalingvits), Google researchers showed how to scale ViTs up or down, depending on the amount of data. They managed to create a huge 2 billion parameter model that reached over 90.4% top-1 accuracy on ImageNet. Conversely, they also trained a scaled-down model that reached over 84.8% top-1 accuracy on ImageNet, using only 10,000 images: that’s just 10 images per class! And progress in visual transformers has continued steadily to this day. For example, in March 2022, a [paper](https://homl.info/modelsoups) by Mitchell Wortsman et al. demonstrated that it’s possible to first train multiple transformers, then average their weights to create a new and improved model. This is similar to an ensemble (see Chapter 7), except there’s just one model in the end, which means there’s no inference time penalty.

The latest trend in transformers consists in building large multimodal models, often capable of zeroshot or few-shot learning. For example, [OpenAI’s 2021 CLIP paper](https://homl.info/clip) proposed a large transformer model pretrained to match captions with images: this task allows it to learn excellent image representations, and the model can then be used directly for tasks such as image classification using simple text prompts such as “a photo of a cat”. Soon after, OpenAI announced [DALL·E](https://homl.info/dalle), capable of generating amazing images based on text prompts. The [DALL·E 2](https://homl.info/dalle), which generates even higher quality images using a diffusion model (see Chapter 17).

> #### **NOTE**
> astounding advances have led some researchers to claim that human-level AI is near, that “scale is all you need”, and that some of these models may be “slightly conscious”. Others point out that despite the amazing progress, these models still lack the reliability and adaptability of human intelligence, our ability to reason symbolically, to generalize based on a single example, and more.

As you can see, transformers are everywhere! And the good news is that you generally won’t have to implement transformers yourself since many excellent pretrained models are readily available for download via TensorFlow Hub or Hugging Face’s model hub. You’ve already seen how to use a model from TF Hub, so let’s close this chapter by taking a quick look at Hugging Face’s ecosystem.

## **Hugging Face's Transformers Library**

It’s impossible to talk about transformers today without mentioning Hugging Face, an AI company that has built a whole ecosystem of easy-to-use open source tools for NLP, vision, and beyond. The
central component of their ecosystem is the Transformers library, which allows you to easily download a pretrained model, including its corresponding tokenizer, and then fine-tune it on your own dataset, if needed. Plus, the library supports TensorFlow, PyTorch, and JAX (with the Flax library).

The simplest way to use the Transformers library is to use the ***transformers.pipeline()*** function: you just specify which task you want, such as sentiment analysis, and it downloads a default pretrained model, ready to be used—it really couldn’t be any simpler:

In [37]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", framework="tf")
result = classifier("The actors were very convincing")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


NameError: name 'torch' is not defined

In [None]:
# This takes a lot of Mbs
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification, pipeline

model_name = "distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

Classifier = pipeline(
    "sentiment_analysis", 
    model=model, 
    tokenizer=tokenizer, 
    framework="tf"
)

print(Classifier("The actors were very convincing"))

> #### **BIAS AND FAIRNESS**
> As the output suggests, this specific classifier loves Indians, but is severely biased against Iraqis. You can try this code with your own country or city. Such an undesirable bias generally comes in large part from the training data itself: in this case, there were plenty of negative sentences related to the wars in Iraq in the training data. This bias was then amplified during the fine-tuning process since the model was forced to choose between just two classes: positive or negative. If you add a neutral class when fine-tuning, then the country bias mostly disappears. But the training data is not the only source of bias: the model’s architecture, the type of loss or regularization used for training, the optimizer; all of these can affect what the model ends up learning. Even a mostly unbiased model can be used in a biased way, much like survey questions can be biased.
>
> Understanding bias in AI and mitigating its negative effects is still an area of active research, but one thing is certain: you should pause and think before you rush to deploy a model to production. Ask yourself how the model could do harm, even indirectly. For example, if the model’s predictions are used to decide whether or not to give someone a loan, the process should be fair. So, make sure you evaluate the model’s performance not just on average over the whole test set, but across various subsets as well: for example, you may find that although the model works very well on average, its performance is abysmal for some categories of people. You may also want to run counterfactual tests: for example, you may want to check that the model’s predictions do not change when you simply switch someone’s gender.
>
> If the model works well on average, it’s tempting to push it to production and move on to something else, especially if it’s just one component of a much larger system. But in general, if you don’t fix such issues, no one else will, and your model may end up doing more harm than good. The solution depends on the problem: it may require rebalancing the dataset, fine-tuning on a different dataset, switching to another pretrained model, tweaking the model’s architecture or hyperparameters, etc.

The ***pipeline()*** function uses the default model for the given task. For example, for text classification tasks such as sentiment analysis, at the time of writing, it defaults to distilbertbase-uncased-finetuned-sst-2-english—a DistilBERT model with an uncased tokenizer, trained on English Wikipedia and a corpus of English books, and fine-tuned on the Stanford Sentiment Treebank v2 (SST 2) task. It’s also possible to manually specify a different model. For example, you could use a DistilBERT model fine-tuned on the Multi-Genre Natural Language Inference (MultiNLI) task, which classifies two sentences into three classes: contradiction, neutral, or entailment. Here is how:

The pipeline() function uses the default model for the given task. For example, for text classification tasks such as sentiment analysis, at the time of writing, it defaults to distilbertbase-uncased-finetuned-sst-2-english—a DistilBERT model with an uncased tokenizer, trained on English Wikipedia and a corpus of English books, and fine-tuned on the Stanford Sentiment Treebank v2 (SST 2) task. It’s also possible to manually specify a different model. For example, you could use a DistilBERT model fine-tuned on the Multi-Genre Natural Language Inference (MultiNLI) task, which classifies two sentences into three classes: contradiction, neutral, or entailment. Here is how:

In [None]:
from transformers import pipeline


model_name = "huggingface/distilbert-base-uncased-finetuned-mnli"
classifier_mnli = pipeline("text-classification", model=model_name)
classifier_mnli("She loves me. [SEP] She loves me not.")

> #### **TIP**
> You can find the available models at https://huggingface.co/models, and the list of tasks at https://huggingface.co/tasks.

The pipeline API is very simple and convenient, but sometimes you will need more control. For such cases, the Transformers library provides many classes, including all sorts of tokenizers, models, configurations, callbacks, and much more. For example, let’s load the same DistilBERT model, along with its corresponding tokenizer, using the TFAutoModelForSequenceClassification and AutoTokenizer classes:

In [None]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

Next, let’s tokenize a couple of pairs of sentences. In this code, we activate padding and specify that we want TensorFlow tensors instead of Python lists:

In [None]:
token_ids = tokenizer(["I like soccer. [SEP] We all love soccer!",
                        "Joe lived for a very long time. [SEP] Joe is old."],
                        padding=True, return_tensors="tf")

> #### **TIP**
> Instead of passing "Sentence 1 [SEP] Sentence 2" to the tokenizer, you can equivalently pass it a tuple: ("Sentence 1", "Sentence 2").

The output is a dictionary-like instance of the BatchEncoding class, which contains the sequences of token IDs, as well as a mask containing 0s for the padding tokens:

In [None]:
token_ids

If you set ***return_token_type_ids=True*** when calling the tokenizer, you will also get an extra tensor that indicates which sentence each token belongs to. This is needed by some models, but not DistilBERT.

Next, we can directly pass this BatchEncoding object to the model; it returns a ***TFSequenceClassifierOutput*** object containing its predicted class logits:

In [None]:
outputs = model(token_ids)
outputs

Lastly, we can apply the softmax activation function to convert these logits to class probabilities, and use the ***argmax()*** function to predict the class with the highest probability for each input sentence pair:

In [None]:
Y_probas = tf.keras.activations.softmax(outputs.logits)
Y_probas

Y_pred = tf.argmax(Y_probas, axis=1)
Y_pred

In this example, the model correctly classifies the first sentence pair as neutral (the fact that I like soccer does not imply that everyone else does) and the second pair as an entailment (Joe must indeed be quite old).

If you wish to fine-tune this model on your own dataset, you can train the model as usual with Keras since it’s just a regular Keras model with a few extra methods. However, because the model outputs logits instead of probabilities, you must use the ***tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)*** loss instead of the usual "***sparse_categorical_crossentropy***" loss. Moreover, the modeldoes not support BatchEncoding inputs during training, so you must use its data attribute to get a regular dictionary instead:

In [None]:
sentences = [("Sky is blue", "Sky is red"), ("I love her", "She loves me")]
X_train = tokenizer(sentences, padding=True, return_tensors="tf").data
y_train = tf.constant([0, 2])
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(loss=loss, optimizer="nadam", metrics=["accuracy"])
history = model.fit(X_train, y_train, epochs=2)

Hugging Face has also built a Datasets library that you can use to easily download a standard dataset (such as IMDb) or a custom one, and use it to fine-tune your model. It’s similar to TensorFlow Datasets, but it also provides tools to perform common preprocessing tasks on the fly, such as masking. The list of datasets is available at https://huggingface.co/datasets.

This should get you started with Hugging Face’s ecosystem. To learn more, you can head over to https://huggingface.co/docs for the documentation, which includes many tutorial notebooks, videos, the full API, and more. I also recommend you check out the O’Reilly book Natural Language Processing with Transformers: Building Language Applications with Hugging Face by Lewis Tunstall, Leandro von Werra, and Thomas Wolf—all from the Hugging Face team.

In the next chapter we will discuss how to learn deep representations in an unsupervised way using autoencoders, and we will use generative adversarial networks to produce images and more!