# Chapter 16: Natural Language Processing with RNNs and Attention

This notebook contains the code reproductions and theoretical explanations for Chapter 16 of *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow*.

## Chapter Summary

This chapter dives into Natural Language Processing (NLP) using RNNs and newer, more powerful architectures.

Key topics covered include:

* **Text Generation (Char-RNN):** We start by building a character-level RNN to generate "Shakespearean" text. This covers how to prepare sequential text data, create windows, and build both stateless and stateful RNNs.
* **Sentiment Analysis:** We build a model to classify movie reviews as positive or negative. This introduces word-level processing, word embeddings, and the crucial concept of **masking** to handle variable-length sequences.
* **Encoder-Decoder Models:** We build a model for Neural Machine Translation (NMT). This introduces the **Encoder–Decoder** architecture, where one RNN (the encoder) processes the input sequence into a state vector, and another RNN (the decoder) uses that vector to generate an output sequence. We also cover **beam search** as a technique to improve translation quality.
* **Attention Mechanisms:** We improve the Encoder–Decoder model by adding an **attention mechanism**. This allows the decoder to "look back" at the most relevant parts of the *entire* input sequence at each step, solving the bottleneck of using a single fixed-size state vector, which is especially problematic for long sentences.
* **The Transformer:** We explore the groundbreaking "Attention Is All You Need" architecture. The Transformer eschews RNNs entirely and relies exclusively on attention mechanisms (specifically **self-attention** and **multi-head attention**). It also introduces **positional embeddings** to encode word order.
* **Modern Language Models:** The chapter concludes with an overview of state-of-the-art Transformer-based models like BERT and GPT-2, which have revolutionized NLP.

## Setup

First, let's import the necessary libraries and set up the environment.

In [1]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import os

# Common setup for plotting
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

## Generating Shakespearean Text Using a Character RNN

### Theoretical Explanation

A **Character-RNN (Char-RNN)** is a model trained to predict the next *character* in a sequence. By learning the statistical patterns of a text (like word spellings, grammar, and punctuation), it can be used to generate new text, one character at a time.

We will train an RNN on all of Shakespeare's work. The model will process a sequence of characters (e.g., 100 long) and learn to predict the next character for each position in the sequence.

### Creating the Training Dataset

In [2]:
# Download the Shakespeare dataset
shakespeare_url = "https://homl.info/shakespeare" # shortcut URL
filepath = keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

Downloading data from https://homl.info/shakespeare
[1m1115394/1115394[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [3]:
# Tokenize the text at the character level
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts([shakespeare_text])

In [4]:
# Check how it works
print(tokenizer.texts_to_sequences(["First"]))
print(tokenizer.sequences_to_texts([[20, 6, 9, 8, 3]]))

[[20, 6, 9, 8, 3]]
['f i r s t']


In [13]:
max_id = len(tokenizer.word_index) # number of distinct characters
dataset_size = len(encoded) # total number of characters

print("Total distinct characters:", max_id)
print("Total characters in the text:", dataset_size)

Total distinct characters: 39
Total characters in the text: 1115394


In [14]:
# Encode the full text (we subtract 1 to get IDs from 0 to 38, not 1 to 39)
[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1

### How to Split a Sequential Dataset

For sequential data, we **cannot** shuffle randomly. We must split it chronologically (or, in this case, textually) to prevent any overlap and data leakage. We'll take the first 90% for training.

In [15]:
train_size = dataset_size * 90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

### Chopping the Sequential Dataset into Multiple Windows

We can't train the RNN on the full text (over 1 million characters). Instead, we use **truncated backpropagation through time** by creating many smaller "windows" from the text.

We use `dataset.window()` to create overlapping windows. For example, window 1 is characters 0-100, window 2 is 1-101, etc. The input for each window will be the first 100 chars, and the target will be the last 100 chars (shifted by one).

In [16]:
n_steps = 100
window_length = n_steps + 1 # e.g., 0-100 (input is 0-99, target is 1-100)
dataset = dataset.window(window_length, shift=1, drop_remainder=True)

In [17]:
# `window()` creates a nested dataset. We must flatten it.
# `flat_map()` also lets us transform each window (e.g., batch it into a tensor).
dataset = dataset.flat_map(lambda window: window.batch(window_length))

In [18]:
# Now we shuffle the windows (not the characters *inside* them) and batch them.
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)

# Finally, we split each batch into inputs (X) and targets (Y).
# X = first 100 chars, Y = last 100 chars (shifted by one)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

In [19]:
# The model needs to predict the *class* of the next character.
# We one-hot encode the input characters. The labels (Y) can remain as integers
# because we will use "sparse_categorical_crossentropy" as the loss.
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))

# Add prefetching for performance
dataset = dataset.prefetch(1)

### Building and Training the Char-RNN Model

In [22]:
model = keras.models.Sequential([
    keras.Input(shape=[None, max_id]), # Explicit Input layer to avoid UserWarning
    keras.layers.GRU(128, return_sequences=True,
                     dropout=0.2, recurrent_dropout=0.2), # input_shape removed
    keras.layers.GRU(128, return_sequences=True,
                     dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id,
                                                  activation="softmax"))
])

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")

# Diagnostic: Verify dataset is not empty before training
print("Checking dataset content before training...")
try:
    for X_batch, Y_batch in dataset.take(1):
        print(f"Dataset yields: X_batch shape {X_batch.shape}, Y_batch shape {Y_batch.shape}")
except Exception as e:
    print(f"Error taking element from dataset: {e}")
print("Dataset check complete.")

history = model.fit(dataset, epochs=2) # Reduced epochs for faster run time

Checking dataset content before training...
Dataset yields: X_batch shape (32, 100, 39), Y_batch shape (32, 100)
Dataset check complete.
Epoch 1/2
     40/Unknown [1m27s[0m 452ms/step - loss: 3.3939

KeyboardInterrupt: 

### Generating Fake Shakespearean Text

To generate new text, we feed the model a starting string (seed text), have it predict the next character, append that character to the text, and repeat the process.

In [None]:
# First, we need a function to preprocess the seed text
def preprocess(texts):
    X = np.array(tokenizer.texts_to_sequences(texts)) - 1
    return tf.one_hot(X, max_id)

In [None]:
# Test predicting the next char
X_new = preprocess(["How are yo"])
Y_pred = model.predict_classes(X_new)
print(tokenizer.sequences_to_texts(Y_pred + 1)[0][-1]) # 'u'

**Theoretical Explanation: Temperature**

Instead of greedily picking the *most likely* next character, we can sample from the probability distribution. This creates more diverse and less repetitive text.

The `temperature` hyperparameter controls this:
* **Low Temperature (e.g., 0.2):** Boosts high-probability characters, making the text safer and more predictable (but also more repetitive).
* **High Temperature (e.g., 2.0):** Flattens the distribution, making all characters more equally likely. This leads to more random, creative, and often nonsensical text.

We can do this by dividing the *logits* (the outputs before softmax) by the temperature, before feeding them to a `tf.random.categorical()` function.

In [None]:
def next_char(text, temperature=1):
    X_new = preprocess([text])
    y_proba = model.predict(X_new)[0, -1:, :]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

def complete_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

In [None]:
print(complete_text("t", temperature=0.2))
print("-" * 40)
print(complete_text("w", temperature=1))
print("-" * 40)
print(complete_text("w", temperature=2))

### Stateful RNN

**Theoretical Explanation:**

So far, our RNN has been **stateless**. At each training iteration, it starts with a hidden state of zeros. It has no memory of the text from the previous batch.

A **stateful RNN** preserves its hidden state from one batch to the next. This allows it to learn longer-term patterns.

To build a stateful RNN:
1.  **Set `stateful=True`** in all recurrent layers.
2.  **Provide `batch_input_shape`** in the first layer (it needs to know the batch size to preserve states for each sequence in the batch).
3.  **Prepare the dataset carefully:** The dataset must be non-overlapping and sequential. Batch *i* must contain the windows that immediately follow the windows in batch *i-1*.
4.  **Reset the states** at the end of each epoch.

In [None]:
# 1. Prepare the dataset for a stateful model (batch_size=32)
# (A simpler way is batch_size=1, but the book shows the batch_size=32 version)

# Simple way: batch_size=1
# dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])
# dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
# dataset = dataset.flat_map(lambda window: window.batch(window_length))
# dataset = dataset.batch(1)

# More complex way: batch_size > 1 (as in the book)
batch_size = 32
encoded_parts = np.array_split(encoded[:train_size], batch_size)
datasets = []
for encoded_part in encoded_parts:
    dataset = tf.data.Dataset.from_tensor_slices(encoded_part)
    dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
    dataset = dataset.flat_map(lambda window: window.batch(window_length))
    datasets.append(dataset)
dataset = tf.data.Dataset.zip(tuple(datasets)).map(lambda *windows: tf.stack(windows))
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

In [None]:
# 2. Build the stateful model
model_stateful = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, stateful=True,
                     dropout=0.2, recurrent_dropout=0.2,
                     batch_input_shape=[batch_size, None, max_id]),
    keras.layers.GRU(128, return_sequences=True, stateful=True,
                     dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id,
                                                  activation="softmax"))
])

In [None]:
# 3. Create a callback to reset the states at the end of each epoch
class ResetStatesCallback(keras.callbacks.Callback):
    def on_epoch_begin(self, epoch, logs):
        self.model.reset_states()

In [None]:
# 4. Compile and fit
model_stateful.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
history_stateful = model_stateful.fit(dataset, epochs=50,
                                    callbacks=[ResetStatesCallback()])

## Sentiment Analysis

**Theoretical Explanation:**

We now move to a **word-level** model to classify IMDb movie reviews as positive (1) or negative (0). This is a **sequence-to-vector** task.

The main challenge is that reviews have **variable lengths**. We can't use fixed-size tensors directly. The solution is **masking**.

1.  **Padding:** We pad all sequences in a batch with a special padding token (usually represented by 0) so they all have the same length.
2.  **Masking:** We use `mask_zero=True` in the `Embedding` layer. This tells the layer to create a *mask* (a tensor of booleans) that is `False` for all padding tokens (ID 0) and `True` for all other tokens. This mask is then automatically propagated to all subsequent layers (like GRU/LSTM), which will then correctly ignore the padding tokens.

In [None]:
# Load the IMDb dataset (pre-tokenized)
(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data()

# Show an example review (list of word IDs)
X_train[0][:10]

In [None]:
# Decode the review back to text
word_index = keras.datasets.imdb.get_word_index()
id_to_word = {id_ + 3: word for word, id_ in word_index.items()}
for id_, token in enumerate(('<pad>', '<sos>', '<unk>')):
    id_to_word[id_] = token

print(" ".join([id_to_word[id_] for id_ in X_train[0][:10]]))

#### Preprocessing Text with TensorFlow Datasets (TFDS)

Instead of the pre-tokenized data, let's load the raw text reviews from TFDS to show a more complete pipeline.

In [None]:
import tensorflow_datasets as tfds

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)
train_size = info.splits["train"].num_examples

In [None]:
# Preprocessing function
def preprocess(X_batch, y_batch):
    # Truncate to 300 characters
    X_batch = tf.strings.substr(X_batch, 0, 300)
    # Replace <br /> with spaces
    X_batch = tf.strings.regex_replace(X_batch, b"<br\\s*/?>", b" ")
    # Keep only letters and quotes
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
    # Split by spaces
    X_batch = tf.strings.split(X_batch)
    # Convert from ragged to dense tensor, padding with "<pad>"
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch

print(preprocess(tf.constant([b"This movie was faaaaaantastic<br />"]), tf.constant([1])))

In [None]:
# Build the vocabulary
from collections import Counter
vocabulary = Counter()
for X_batch, y_batch in datasets["train"].batch(32).map(preprocess):
    for review in X_batch:
        vocabulary.update(list(review.numpy()))

print("Most common words:", vocabulary.most_common()[:3])

In [None]:
# Truncate the vocabulary
vocab_size = 10000
truncated_vocabulary = [
    word for word, count in vocabulary.most_common()[:vocab_size]]

In [None]:
# Create a lookup table (with 1000 OOV buckets for unknown words)
words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)

In [None]:
# Create the final training dataset
def encode_words(X_batch, y_batch):
    return table.lookup(X_batch), y_batch

train_set = datasets["train"].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

In [None]:
# Build the sentiment analysis model with masking
embed_size = 128
model = keras.models.Sequential([
    # The Embedding layer converts word IDs to dense vectors (embeddings)
    # mask_zero=True tells the layer to ignore padding (ID 0)
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size,
                         mask_zero=True,
                         input_shape=[None]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128), # Only return the last output
    keras.layers.Dense(1, activation="sigmoid")
])

model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(train_set, epochs=5)

### Reusing Pretrained Embeddings

Instead of learning embeddings from scratch (which requires a lot of data), we can use embeddings pretrained on a huge corpus (like Wikipedia). **TensorFlow Hub** is a repository for pretrained model components, called *modules*. We can load a module as a Keras layer.

In [None]:
import tensorflow_hub as hub

# Load a pretrained sentence embedding module from TF Hub
model = keras.Sequential([
    hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1",
                   dtype=tf.string, input_shape=[], output_shape=[50]),
    keras.layers.Dense(128, activation="relu"),
    keras.layers.Dense(1, activation="sigmoid")
])

model.compile(loss="binary_crossentropy", optimizer="adam",
              metrics=["accuracy"])

# Create a new dataset that doesn't do our manual preprocessing
train_set_hub = datasets["train"].batch(32).prefetch(1)
history_hub = model.fit(train_set_hub, epochs=5)

## An Encoder–Decoder Network for Neural Machine Translation

**Theoretical Explanation:**

An **Encoder–Decoder** model is a common architecture for NMT.

1.  **Encoder:** An RNN (e.g., LSTM/GRU) that reads the input sentence (e.g., in English) one word at a time and compresses it into a single state vector (the last hidden state). This vector is often called the *context vector* or *thought vector*.
2.  **Decoder:** Another RNN that is initialized with the encoder's final state. Its job is to generate the translated sentence (e.g., in French), one word at a time.

**Training:** At each time step, the decoder is fed the *previous* target word. For example, to output "Je", it's given the `<sos>` (start-of-sequence) token. To output "bois", it's given "Je".

**Inference:** At inference time, we don't have the target. Instead, we feed the decoder the word it *just* predicted at the previous step. It stops when it outputs an `<eos>` (end-of-sequence) token.

The `tensorflow_addons` package provides tools to build these models. (Note: The book's code for this section is complex, so this is a simplified representation of the concept).

In [None]:
# This is a conceptual representation.
# The full implementation is very involved.

# We'll use a placeholder for the real vocab_size and embed_size
vocab_size = 10000
embed_size = 128

try:
    import tensorflow_addons as tfa

    encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
    decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
    sequence_lengths = keras.layers.Input(shape=[], dtype=np.int32)

    embeddings = keras.layers.Embedding(vocab_size, embed_size)
    encoder_embeddings = embeddings(encoder_inputs)
    decoder_embeddings = embeddings(decoder_inputs)

    encoder = keras.layers.LSTM(512, return_state=True)
    encoder_outputs, state_h, state_c = encoder(encoder_embeddings)
    encoder_state = [state_h, state_c]

    sampler = tfa.seq2seq.sampler.TrainingSampler()

    decoder_cell = keras.layers.LSTMCell(512)
    output_layer = keras.layers.Dense(vocab_size)

    decoder = tfa.seq2seq.basic_decoder.BasicDecoder(decoder_cell, sampler,
                                                  output_layer=output_layer)

    final_outputs, final_state, final_sequence_lengths = decoder(
        decoder_embeddings, initial_state=encoder_state,
        sequence_length=sequence_lengths)

    Y_proba = tf.nn.softmax(final_outputs.rnn_output)

    model = keras.Model(inputs=[encoder_inputs, decoder_inputs, sequence_lengths],
                        outputs=[Y_proba])

except ImportError:
    print("TensorFlow Addons is not installed. Skipping this code block.")

#### Bidirectional RNNs

When encoding a word like "queen" in "the queen of hearts", it's useful to look at *both* the words that came before it ("the") and the words that come *after* it ("of hearts").

A **bidirectional RNN** (`keras.layers.Bidirectional`) runs two separate RNNs over the input: one from left-to-right and one from right-to-left. It then concatenates their outputs at each time step. This is very common in encoders.

In [None]:
model = keras.models.Sequential([
    keras.layers.Bidirectional(keras.layers.GRU(10, return_sequences=True),
                               input_shape=[None, 1])
])

#### Beam Search

Instead of greedily choosing the most likely word at each step, **beam search** keeps track of the *k* most probable sentences so far. At each step, it tries to extend these *k* sentences and again keeps only the *k* most likely results. A larger *k* (beam width) finds better translations but is slower.

## Attention Mechanisms

**Theoretical Explanation:**

The Encoder-Decoder's fixed-size context vector (`encoder_state`) is a bottleneck. It's impossible to cram all the information of a long sentence into one vector.

**Attention** solves this. The main idea is to give the decoder access to *all* of the encoder's outputs (not just its final state).

At each step, the decoder uses an **attention mechanism** (a small neural net) to compute a set of *alignment scores*. These scores measure how relevant each input word (encoder output) is to the current word the decoder is about to produce.

These scores are converted to weights (via softmax), and the decoder computes a *weighted sum* of the encoder outputs. This *context vector* is dynamic—it "pays attention" to different input words at each step.

This lets the model handle long sequences and makes the model *explainable*—we can plot the attention weights to see what the decoder was "looking at" when it produced each word.

In [None]:
# Conceptual implementation with TensorFlow Addons
# Note: Bahdanau attention is also available.
try:
    # Assuming we have an encoder_state and decoder_cell from before
    attention_mechanism = tfa.seq2seq.attention_wrapper.LuongAttention(
        units=512, memory=encoder_outputs,
        memory_sequence_length=sequence_lengths)

    attention_decoder_cell = tfa.seq2seq.attention_wrapper.AttentionWrapper(
        decoder_cell, attention_mechanism, attention_layer_size=512)

except (NameError, AttributeError):
    print("Skipping attention block because previous components are not fully defined.")

## Attention Is All You Need: The Transformer Architecture

**Theoretical Explanation:**

The **Transformer** is an architecture that uses *only* attention mechanisms, completely removing RNNs and CNNs. Because it is not recurrent, it is much faster to train and can be parallelized.

Key components:

1.  **Positional Embeddings:** Since the model has no RNNs, it has no sense of word order. We "inject" order information by adding a **positional embedding** vector to each word embedding. This vector is a function of the word's position in the sentence (using `sin` and `cos` functions of different frequencies).
2.  **Multi-Head Attention:** This is the core layer. It's composed of multiple **Scaled Dot-Product Attention** layers (or "heads").
    * **Self-Attention:** In the encoder, the attention layer compares every word in the sentence to every *other* word in the *same* sentence to build a richer representation (e.g., to understand that "it" in "it is raining" refers to the weather, not an object).
    * **Masked Self-Attention:** In the decoder, this is the same, but each word is *masked* (prevented) from looking at future words (since it can't know what they are).
    * **Encoder-Decoder Attention:** This is the same as the attention mechanism in the previous section: the decoder looks at the encoder's outputs.
3.  **Feed Forward blocks:** Each attention layer is followed by a simple two-layer feedforward network, which is applied to each word position independently.
4.  **Skip Connections & Layer Norm:** Just like in ResNet, skip connections are used around every sub-layer, followed by Layer Normalization.

In [None]:
# Implementation of Positional Embeddings
class PositionalEncoding(keras.layers.Layer):
    def __init__(self, max_steps, max_dims, dtype=tf.float32, **kwargs):
        super().__init__(dtype=dtype, **kwargs)
        if max_dims % 2 == 1: max_dims += 1 # max_dims must be even
        p, i = np.meshgrid(np.arange(max_steps), np.arange(max_dims // 2))
        pos_emb = np.empty((1, max_steps, max_dims))
        pos_emb[0, :, ::2] = np.sin(p / 10000**(2 * i / max_dims)).T
        pos_emb[0, :, 1::2] = np.cos(p / 10000**(2 * i / max_dims)).T
        self.positional_embedding = tf.constant(pos_emb.astype(self.dtype))
    def call(self, inputs):
        shape = tf.shape(inputs)
        return inputs + self.positional_embedding[:, :shape[-2], :shape[-1]]

In [None]:
# Example of inputs for the Transformer
embed_size = 512; max_steps = 500; vocab_size = 10000

encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)

embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)

positional_encoding = PositionalEncoding(max_steps, max_dims=embed_size)
encoder_in = positional_encoding(encoder_embeddings)
decoder_in = positional_encoding(decoder_embeddings)

### Multi-Head Attention

**Theoretical Explanation:**

The core of the Transformer is **Scaled Dot-Product Attention**. It works like a dictionary lookup. For a given word (the **Query**), it compares it to all other words (the **Keys**) using a dot product. These scores are scaled (divided by $\sqrt{d_{keys}}$) and put through a softmax to get weights. Finally, it computes a weighted sum of the **Values** (the word representations themselves).

**Multi-Head Attention** is even more powerful. It runs *h* (e.g., 8) Scaled Dot-Product Attention layers in parallel. Each "head" projects the Q, K, and V vectors into a different, smaller subspace. This allows each head to learn *different types* of relationships (e.g., one head might learn subject-verb relationships, another might learn word-order relationships). The outputs of all heads are concatenated and passed through a final `Dense` layer.

(The full implementation is provided in the book's notebook, but `keras.layers.MultiHeadAttention` is now the standard.)

In [None]:
# Building the Transformer (conceptual, simplified)
# Keras now provides a keras.layers.MultiHeadAttention layer.

Z = encoder_in
for N in range(6): # The book uses 6 blocks
    # We'll use a placeholder for the real layer for simplicity
    # In a real implementation, this would be a full block with LayerNorm,
    # Multi-Head Attention, another LayerNorm, and a FeedForward network.
    Z = keras.layers.Dense(embed_size, activation="relu")(Z) # Placeholder for a Transformer block

encoder_outputs = Z

Z = decoder_in
for N in range(6):
    # This would be a Masked Multi-Head Attention, plus Encoder-Decoder Attention
    Z = keras.layers.Dense(embed_size, activation="relu")(Z) # Placeholder

outputs = keras.layers.TimeDistributed(
    keras.layers.Dense(vocab_size, activation="softmax"))(Z)

## Exercises

See Appendix A in the book.

# Task
Adjust the Char-RNN model training in the notebook to reduce training time. Specifically, modify the `dataset.window()` call to use `shift=n_steps` for non-overlapping windows, and set the number of training epochs to 15 in the `model.fit()` call to complete training within 15-45 minutes.

## Adjust Dataset Windowing

### Subtask:
Modify the `dataset.window()` call to use `shift=n_steps` instead of `shift=1`. This will create non-overlapping windows, significantly reducing the number of batches per epoch.


**Reasoning**:
The subtask requires modifying the `dataset.window()` call for the stateless Char-RNN to use `shift=n_steps` for non-overlapping windows. This code block updates the relevant data preparation steps accordingly, including the windowing, flattening, batching, and mapping operations.



In [23]:
n_steps = 100
window_length = n_steps + 1 # e.g., 0-100 (input is 0-99, target is 1-100)
dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)

# `window()` creates a nested dataset. We must flatten it.
# `flat_map()` also lets us transform each window (e.g., batch it into a tensor).
dataset = dataset.flat_map(lambda window: window.batch(window_length))

# Now we shuffle the windows (not the characters *inside* them) and batch them.
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)

# Finally, we split each batch into inputs (X) and targets (Y).
# X = first 100 chars, Y = last 100 chars (shifted by one)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

# The model needs to predict the *class* of the next character.
# We one-hot encode the input characters. The labels (Y) can remain as integers
# because we will use "sparse_categorical_crossentropy" as the loss.
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))

# Add prefetching for performance
dataset = dataset.prefetch(1)


TypeError: in user code:


    TypeError: outer_factory.<locals>.inner_factory.<locals>.<lambda>() takes 1 positional argument but 2 were given


**Reasoning**:
The previous code failed because of a `TypeError` during the `flat_map` operation, indicating an internal TensorFlow AutoGraph issue when processing the `window.batch()` call. To fix this, I will replace the `flat_map` with a `map` operation that directly converts each `window` (which is a `tf.data.Dataset`) into a single batched tensor using `window.batch(window_length).get_single_element()`. This achieves the same result of creating a dataset of `window_length`-sized tensors but uses a different `tf.data` idiom that often avoids such internal `TypeError`s. Additionally, I will update the subsequent batching and mapping operations to reflect the new dataset structure and incorporate the user's initial request of `shift=n_steps` for non-overlapping windows.



In [24]:
n_steps = 100
window_length = n_steps + 1 # e.g., 0-100 (input is 0-99, target is 1-100)

# Adjust `shift` to `n_steps` for non-overlapping windows.
# Use `map` with `get_single_element` instead of `flat_map` to avoid a TypeError
# when processing the windowed dataset, as each window.batch() yields a single element.
dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
dataset = dataset.map(lambda window: window.batch(window_length).get_single_element())

# Now we shuffle the windows (which are now `(window_length,)` tensors) and batch them.
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)

# Finally, we split each batch into inputs (X) and targets (Y).
# X = first 100 chars, Y = last 100 chars (shifted by one)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

# The model needs to predict the *class* of the next character.
# We one-hot encode the input characters. The labels (Y) can remain as integers
# because we will use "sparse_categorical_crossentropy" as the loss.
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))

# Add prefetching for performance
dataset = dataset.prefetch(1)

TypeError: in user code:


    TypeError: outer_factory.<locals>.inner_factory.<locals>.<lambda>() takes 1 positional argument but 2 were given


**Reasoning**:
The previous attempts to prepare the dataset for the Char-RNN model failed with a `TypeError` when using `flat_map` or `map` in combination with `window.batch().get_single_element()`. This error, 'takes 1 positional argument but 2 were given', suggests a conflict in how TensorFlow's `tf.data` API is tracing and executing the lambda functions for internal dataset operations. To resolve this, I will use the `tf.data.Dataset.reduce()` method within a `map` function to explicitly aggregate the elements of each window (which is itself a `tf.data.Dataset`) into a single `tf.Tensor`. This method is graph-compatible and provides a robust way to transform nested datasets into flat tensors, thus avoiding the observed `TypeError`. Additionally, I will re-initialize the dataset from its source to ensure a clean pipeline, apply the requested `shift=n_steps` for non-overlapping windows, and then proceed with shuffling, batching, splitting, one-hot encoding, and prefetching.



In [25]:
n_steps = 100
window_length = n_steps + 1 # e.g., 0-100 (input is 0-99, target is 1-100)

# Re-initialize dataset to ensure a fresh pipeline for the new windowing logic
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

# Adjust `shift` to `n_steps` for non-overlapping windows.
dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)

# Define a function to convert each window (a tf.data.Dataset) into a single tf.Tensor
# This uses reduce to stack the elements, which is more robust than batch().get_single_element()
# against the observed TypeError in some TF environments.
def window_to_tensor(window_ds):
    # Initialize with an empty tensor of the correct dtype (tf.int32 for encoded character IDs)
    return window_ds.reduce(tf.constant([], dtype=tf.int32), lambda x, y: tf.concat([x, [y]], axis=0))

# Apply the function to each window using map
dataset = dataset.map(window_to_tensor)

# Now we shuffle the windows (which are now `(window_length,)` tensors) and batch them.
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)

# Finally, we split each batch into inputs (X) and targets (Y).
# X = first 100 chars, Y = last 100 chars (shifted by one)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

# The model needs to predict the *class* of the next character.
# We one-hot encode the input characters. The labels (Y) can remain as integers
# because we will use "sparse_categorical_crossentropy" as the loss.
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))

# Add prefetching for performance
dataset = dataset.prefetch(1)


**Reasoning**:
The previous code block successfully reconfigured the dataset with non-overlapping windows and addressed the `TypeError`. Now, to fulfill the second part of the subtask, I need to modify the `model.fit()` call to set the number of training epochs to 15, as specified in the main task to reduce training time.



In [27]:
model = keras.models.Sequential([
    keras.Input(shape=[None, max_id]), # Explicit Input layer to avoid UserWarning
    keras.layers.GRU(128, return_sequences=True,
                     dropout=0.2, recurrent_dropout=0.2), # input_shape removed
    keras.layers.GRU(128, return_sequences=True,
                     dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id,
                                                  activation="softmax"))
])

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")

# Diagnostic: Verify dataset is not empty before training
print("Checking dataset content before training...")
try:
    for X_batch, Y_batch in dataset.take(1):
        print(f"Dataset yields: X_batch shape {X_batch.shape}, Y_batch shape {Y_batch.shape}")
except Exception as e:
    print(f"Error taking element from dataset: {e}")
print("Dataset check complete.")

history = model.fit(dataset, epochs=15) # Changed epochs from 2 to 15

Checking dataset content before training...
Dataset yields: X_batch shape (32, 100, 39), Y_batch shape (32, 100)
Dataset check complete.
Epoch 1/15
[1m314/314[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m250s[0m 463ms/step - loss: 2.9216
Epoch 2/15
[1m314/314[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m244s[0m 458ms/step - loss: 2.2662
Epoch 3/15
[1m314/314[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m240s[0m 459ms/step - loss: 2.0828
Epoch 4/15
[1m314/314[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m242s[0m 458ms/step - loss: 1.9825
Epoch 5/15
[1m314/314[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m240s[0m 458ms/step - loss: 1.9111
Epoch 6/15
[1m314/314[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m243s[0m 460ms/step - loss: 1.8675
Epoch 7/15
[1m314/314[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m241s[0m 459ms/step - loss: 1.8309
Epoch 8/15
[1m314/314[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m267s[0m 460ms/step - loss: 1.8128
Epoch 9/15
[1m314/314

# Task
The Char-RNN model training was interrupted. To confirm that the Char-RNN model training completes within the desired timeframe and is ready for text generation, re-run the model training with 15 epochs.

## Adjust Dataset Windowing

### Subtask:
Modify the `dataset.window()` call to use `shift=n_steps` instead of `shift=1`. This will create non-overlapping windows, significantly reducing the number of batches per epoch.


## Adjust Epochs and Retrain Model

### Subtask:
Set the number of epochs to 15 in the `model.fit()` call to complete the training within the desired timeframe.


**Reasoning**:
The subtask requires setting the number of epochs to 15 for model training. The previous `model.fit()` call was interrupted, so I will regenerate the training code, including the model definition and compilation, and ensure `epochs` is set to 15.

