# Chapter 16: Natural Language Processing with RNNs and Attention

**Based on "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition" by Aurélien Géron**

This notebook reproduces the code from Chapter 16 and provides theoretical explanations for each concept, as required by the individual task.

## Chapter Summary

This chapter dives into the techniques required to process sequential data, with a special focus on Natural Language Processing (NLP).

1.  **Recurrent Neural Networks (RNNs) for Text:** We begin by building a **Character RNN (Char-RNN)** to generate original text, one character at a time. This introduces key data preparation techniques for sequential data, such as creating a `tf.data.Dataset` from text, using the `window()` method to create sequential windows, and building both **stateless** and **stateful RNNs**[cite: 943, 948].

2.  **Sentiment Analysis:** We then move from character-level to word-level models. We build an RNN to perform sentiment analysis on the IMDb movie review dataset. This section covers crucial NLP preprocessing steps, including:
    * Word-level tokenization.
    * Encoding words to integers using a vocabulary lookup table.
    * Using **masking** (via `mask_zero=True`) to handle variable-length sequences (i.e., padding) efficiently[cite: 955].
    * Using **word embeddings** (`Embedding` layer) to represent words as dense, trainable vectors[cite: 954].
    * Reusing **pretrained word embeddings** from TensorFlow Hub for transfer learning [cite: 957-958].

3.  **Neural Machine Translation (NMT):** The chapter introduces the **Encoder-Decoder** architecture, a powerful model for sequence-to-sequence (seq2seq) tasks like translation[cite: 959]. We explore:
    * The concept of an encoder (reading the source sentence) and a decoder (generating the target sentence).
    * **Bidirectional RNNs** to capture information from both directions of a sequence[cite: 963].
    * **Beam search** to improve prediction quality by keeping track of several candidate translations instead of just one[cite: 964].

4.  **Attention Mechanisms:** We explore the key limitation of the basic Encoder-Decoder model (the bottleneck of the final hidden state) and solve it with **attention**[cite: 966].
    * Attention allows the decoder to look back at the *entire* input sequence at each step, focusing on the most relevant parts.
    * We discuss the two main types: **Bahdanau (additive)** and **Luong (multiplicative)** attention[cite: 968].

5.  **The Transformer Architecture:** Finally, we move beyond RNNs entirely to the **Transformer**, the model introduced in the "Attention Is All You Need" paper[cite: 969]. This is the state-of-the-art architecture that powers models like BERT and GPT.
    * It relies exclusively on attention mechanisms (no recurrent or convolutional layers).
    * Key components include **Positional Embeddings** (to inject word order information) and **Multi-Head Attention** (which allows the model to pay attention to multiple things simultaneously)[cite: 970, 972].

The chapter concludes with a look at recent state-of-the-art language models like BERT, ELMo, and GPT-2, which are all built upon the concepts of pretraining and the Transformer architecture [cite: 975-976].

## Setup

First, let's import the necessary libraries and set up our environment.

In [None]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import os

# Common imports
import pandas as pd
import matplotlib.pyplot as plt

# to make this notebook's output stable across runs
np.random.seed(42)
tf.random.set_seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Install tensorflow-addons for NMT section
# !pip install -q -U tensorflow-addons

## 1. Generating Shakespearean Text Using a Character RNN (Char-RNN)

We will start by building a model that predicts the next character in a text. This is called a Character RNN, or Char-RNN. Once trained, the model can be used to generate novel text, one character at a time, in the style of the original training data[cite: 943].

### Creating the Training Dataset

> **Theoretical Deep-Dive: Tokenization** 
> 
> Before we can feed text to a neural network, we must convert it into numbers. This process is called **tokenization**. We can tokenize at the **word level** (where each unique word gets an ID), the **sub-word level** (where words are broken up, e.g., "smartest" -> "smart" + "est"), or the **character level**.
> 
> For this model, we will use **character-level tokenization**. This is simple and has a very small vocabulary (only 39 unique characters in this case). It's also effective for learning grammar, punctuation, and even generating new 

In [None]:
# Download Shakespeare's work
shakespeare_url = "https://homl.info/shakespeare" # shortcut URL [cite: 944]
filepath = keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

# View the first 100 characters
print(shakespeare_text[:100])

In [None]:
# Use Keras's Tokenizer to encode each character as an integer
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts([shakespeare_text]) [cite: 944]

# The tokenizer finds all unique characters and maps them to an ID, starting from 1
print(f"Unique characters: {len(tokenizer.word_index)}")
print(tokenizer.texts_to_sequences(["First"])) [cite: 945]
print(tokenizer.sequences_to_texts([[20, 6, 9, 8, 3]])) [cite: 945]

In [None]:
# Let's encode the full text. We subtract 1 to get IDs from 0 to 38 (instead of 1 to 39).
# This is useful because 0 is a natural ID for the first token (or for padding).
max_id = len(tokenizer.word_index) # number of distinct characters [cite: 945]
dataset_size = tokenizer.document_count # total number of characters [cite: 945]

[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1 [cite: 946]

print(f"max_id: {max_id}, dataset_size: {dataset_size}")
print(f"Encoded text shape: {encoded.shape}")

### How to Split a Sequential Dataset

We must split the data into training, validation, and test sets. For sequential data, we **cannot** shuffle the data randomly before splitting, as this would destroy the sequences. The patterns the RNN learns depend on the order of the data.

We must split the data along the time axis. We'll use the first 90% for training, the next 5% for validation, and the final 5% for testing[cite: 946].

In [None]:
train_size = int(dataset_size * 0.9)
train_set = tf.data.Dataset.from_tensor_slices(encoded[:train_size]) [cite: 946]

### Chopping the Sequential Dataset into Multiple Windows

> **Theoretical Deep-Dive: Truncated Backpropagation Through Time (TBPTT)** 
>
> We can't train the RNN on the full text sequence of 1 million characters. This would be like unrolling a network with 1 million layers, leading to an impossibly slow and unstable training process (due to the vanishing/exploding gradients problem).
>
> Instead, we use **Truncated Backpropagation Through Time**. We chop the long sequence into many shorter subsequences (or "windows"). The RNN is then unrolled *only* over the length of these short windows. This is what the `window()` method helps us do[cite: 947].

In [None]:
# We'll create windows of 101 characters.
# The RNN will be trained to predict the next character, given the previous 100.
# So, n_steps = 100.
n_steps = 100
window_length = n_steps + 1 # target = input shifted 1 character ahead [cite: 947]

# Use shift=1 to get overlapping windows. drop_remainder=True ensures all windows are the same size.
dataset = train_set.window(window_length, shift=1, drop_remainder=True) [cite: 947]

The `window()` method creates a *nested dataset* (a dataset of datasets). We must use `flat_map()` to flatten it into a dataset of tensors[cite: 947].

In [None]:
# This converts each window (which is a 'Dataset') into a tensor.
dataset = dataset.flat_map(lambda window: window.batch(window_length)) [cite: 947]

# Now we shuffle the windows, batch them, and split them into inputs (X) and targets (Y)
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size) [cite: 948]
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:])) [cite: 948]

Finally, we need to encode the inputs. The targets (labels) can remain as character IDs (integers), because we will use `sparse_categorical_crossentropy` as the loss function. This loss function expects integer labels.

The inputs, however, must be encoded. We will use **one-hot encoding** because the vocabulary is very small (39 characters). An `Embedding` layer would also work but is overkill here[cite: 948].

In [None]:
# [batch_size, n_steps] -> [batch_size, n_steps, max_id]
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch)) [cite: 948]

# Add prefetching for performance
dataset = dataset.prefetch(1) [cite: 949]

### Building and Training the Char-RNN Model

> **Theoretical Deep-Dive: Char-RNN Architecture** 
>
> 1.  **Input:** A batch of one-hot encoded character sequences (shape: `[batch_size, n_steps, max_id]`).
> 2.  **Recurrent Layers (`GRU`):** We use `GRU` layers (a more efficient variant of `LSTM`). We stack two layers to learn more complex patterns. `return_sequences=True` is crucial. It tells the GRU layer to output its hidden state at *every time step*, not just the final one. This is required to feed the full sequence to the next layer (and to our final output layer)[cite: 948].
> 3.  **`TimeDistributed(Dense)` Layer:** The output of the final GRU layer has a shape of `[batch_size, n_steps, n_neurons]`. We need to predict the *next character* at *each time step*. A standard `Dense` layer would flatten this and output a single prediction for the whole sequence. By wrapping the `Dense` layer in `TimeDistributed`, we apply the *same* `Dense` layer to each of the `n_steps` time steps independently. This gives us an output of shape `[batch_size, n_steps, max_id]`[cite: 948].
> 4.  **Softmax Activation:** This converts the outputs (logits) at each time step into probability distributions over the 39 possible characters[cite: 948].
> 5.  **Loss Function:** We use `sparse_categorical_crossentropy` because our labels (`Y_batch`) are sparse (integer IDs: `[batch_size, n_steps]`), but our model's outputs are probabilities (from softmax)[cite: 948].

In [None]:
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id],
                     dropout=0.2, recurrent_dropout=0.2), # [cite: 948]
    keras.layers.GRU(128, return_sequences=True,
                     dropout=0.2, recurrent_dropout=0.2), # [cite: 948]
    keras.layers.TimeDistributed(keras.layers.Dense(max_id,
                                                    activation="softmax")) # [cite: 948]
])

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam") [cite: 949]
history = model.fit(dataset, epochs=10) # The book suggests 20, but 10 is faster for reproduction

### Generating Fake Shakespearean Text

> **Theoretical Deep-Dive: Text Generation & Temperature** 
>
> To generate text, we give the model a starting string (a "seed"), predict the next character, append that character to the string, and repeat the process.
>
> If we always pick the character with the *highest* probability (a greedy approach), the text becomes very repetitive and boring. 
>
> To get more interesting text, we sample from the probability distribution predicted by the model. We can control the randomness of this sampling using a **temperature** parameter[cite: 949].
> 1.  We divide the output logits by the `temperature`.
> 2.  A **low temperature** (e.g., 0.2) makes the distribution 

In [None]:
# First, we need a function to preprocess the seed text
def preprocess(texts):
    X = np.array(tokenizer.texts_to_sequences(texts)) - 1
    return tf.one_hot(X, max_id) [cite: 949]

In [None]:
def next_char(text, temperature=1):
    X_new = preprocess([text])
    y_proba = model.predict(X_new)[0, -1:, :] # Get probabilities for the last time step
    # Divide logits by temperature
    rescaled_logits = tf.math.log(y_proba) / temperature [cite: 949]
    # Sample a character ID based on the new distribution
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1 [cite: 949]
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

def complete_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text [cite: 950]

In [None]:
# Let's generate some text with different temperatures
print(complete_text("t", temperature=0.2)) [cite: 950]
print(complete_text("w", temperature=1)) [cite: 950]
print(complete_text("w", temperature=2)) [cite: 950]

## 2. Stateful RNN

> **Theoretical Deep-Dive: Stateful vs. Stateless RNNs** 
>
> -   **Stateless RNN (default):** At each training iteration, the model's hidden state is reset to zeros. It does not learn any patterns *between* batches. This is why we used overlapping windows (`shift=1`)—to show the model as many different transitions as possible[cite: 950].
> -   **Stateful RNN:** The model's final hidden state from one batch is preserved and used as the *initial* state for the next batch. This allows the RNN to learn much longer-term patterns, spanning across batches[cite: 950].
>
> **Data Requirement:** A stateful RNN *requires* that the batches are sequential and non-overlapping. The N-th sequence in a batch must be the direct continuation of the N-th sequence from the previous batch.

### Preparing the Dataset for a Stateful RNN

Preparing the data is the hardest part. The simplest way (as done in the book) is to use `batch_size=1`[cite: 951].

A more complex (but more efficient) way is to chop the text into `batch_size` equal parts and create a dataset that returns one batch of windows from each part at each step. We will reproduce the simpler `batch_size=1` version first.

In [None]:
# Simple sequential, non-overlapping dataset (as in the book)
dataset_stateful = tf.data.Dataset.from_tensor_slices(encoded[:train_size])
# Use shift=n_steps for non-overlapping windows
dataset_stateful = dataset_stateful.window(window_length, shift=n_steps, drop_remainder=True) [cite: 951]
dataset_stateful = dataset_stateful.flat_map(lambda window: window.batch(window_length)) [cite: 951]
dataset_stateful = dataset_stateful.batch(1) # batch_size=1 [cite: 951]
dataset_stateful = dataset_stateful.map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset_stateful = dataset_stateful.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset_stateful = dataset_stateful.prefetch(1)

print(list(dataset_stateful.take(1)))

Here is the more complex, batched implementation. This is what you would typically do in a real project.

In [None]:
batch_size = 32
n_steps = 100
window_length = n_steps + 1

# Chop the training text into batch_size equal parts
text_parts = np.array_split(encoded[:train_size], batch_size)
datasets = []
for text_part in text_parts:
    dataset = tf.data.Dataset.from_tensor_slices(text_part)
    dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
    dataset = dataset.flat_map(lambda window: window.batch(window_length))
    datasets.append(dataset)

# Zip the datasets to create sequential batches
dataset_batched = tf.data.Dataset.zip(tuple(datasets)).map(lambda *windows: tf.stack(windows))

# Prepare inputs/targets and one-hot encode
dataset_batched = dataset_batched.map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset_batched = dataset_batched.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset_batched = dataset_batched.prefetch(1)

print(list(dataset_batched.take(1)))

### Building and Training the Stateful Model

To build a stateful model, we must:
1.  Set `stateful=True` in every recurrent layer.
2.  Specify the `batch_input_shape` in the first layer (it must know the batch size)[cite: 951].
3.  Manually reset the model's states at the end of each epoch, as it should not learn to connect the end of the text to the beginning[cite: 951].

In [None]:
model_stateful = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, stateful=True, # [cite: 951]
                     dropout=0.2, recurrent_dropout=0.2,
                     batch_input_shape=[batch_size, None, max_id]), # [cite: 951]
    keras.layers.GRU(128, return_sequences=True, stateful=True,
                     dropout=0.2, recurrent_dropout=0.2), # [cite: 951]
    keras.layers.TimeDistributed(keras.layers.Dense(max_id,
                                                    activation="softmax"))
])

In [None]:
# Custom callback to reset the model's states at the end of each epoch
class ResetStatesCallback(keras.callbacks.Callback):
    def on_epoch_begin(self, epoch, logs):
        self.model.reset_states() [cite: 952]

In [None]:
model_stateful.compile(loss="sparse_categorical_crossentropy", optimizer="adam") [cite: 952]
history_stateful = model_stateful.fit(dataset_batched, epochs=20,
                                      callbacks=[ResetStatesCallback()]) [cite: 952]

After training, a stateful model can only be used to make predictions for batches of the same size. To avoid this, you can create an identical *stateless* model and copy the stateful model's weights to it.

## 3. Sentiment Analysis

Next, we'll build a model to classify IMDb movie reviews as positive (1) or negative (0). This is a sequence-to-vector task[cite: 952].

Instead of a Char-RNN, we will now process text at the **word level**. This requires several new preprocessing steps.

### Loading the IMDb Dataset

We can load the raw text from `tensorflow_datasets` (TFDS).

In [None]:
import tensorflow_datasets as tfds

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True) [cite: 954]
train_size = info.splits["train"].num_examples
print(f"Training set size: {train_size}")

for X_batch, y_batch in datasets["train"].batch(2).take(1):
    print(X_batch.numpy())
    print(y_batch.numpy())

### Text Preprocessing

We need to convert these text reviews (byte strings) into sequences of word IDs. We'll follow the book's example by creating a preprocessing pipeline using `tf.strings` operations. This is crucial because it means the preprocessing logic can be embedded *inside* the model, simplifying deployment[cite: 954].

1.  **Truncate:** Keep only the first 300 characters.
2.  **Clean:** Remove `<br />` tags and non-letter characters.
3.  **Split:** Split the text by spaces.
4.  **Pad:** Convert the resulting `RaggedTensor` to a dense `Tensor` by padding with `"<pad>"`.

In [None]:
def preprocess(X_batch, y_batch):
    X_batch = tf.strings.substr(X_batch, 0, 300) [cite: 954]
    X_batch = tf.strings.regex_replace(X_batch,  b"<br\\s*/?>",  b" ") [cite: 954]
    X_batch = tf.strings.regex_replace(X_batch,  b"[^a-zA-Z']",  b" ") [cite: 954]
    X_batch = tf.strings.split(X_batch) [cite: 954]
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch [cite: 954]

### Building the Vocabulary

Next, we need to build a vocabulary by counting all word occurrences in the training set[cite: 955].

In [None]:
from collections import Counter

vocabulary = Counter()
for X_batch, y_batch in datasets["train"].batch(32).map(preprocess):
    for review in X_batch:
        vocabulary.update(list(review.numpy())) [cite: 955]

print("Top 3 most common words:", vocabulary.most_common()[:3])

The vocabulary is huge. We'll truncate it to the 10,000 most common words[cite: 955].

In [None]:
vocab_size = 10000
truncated_vocabulary = [
    word for word, count in vocabulary.most_common()[:vocab_size]] [cite: 955]

# Create a lookup table. We add 1000 'out-of-vocabulary' (oov) buckets.
# Any word not in our 10k vocab will be hashed into one of these 1000 buckets.
words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets) [cite: 955]

### Creating the Final Dataset Pipeline

Now we can create our final training pipeline. It will batch, preprocess, and encode the reviews.

In [None]:
def encode_words(X_batch, y_batch):
    return table.lookup(X_batch), y_batch [cite: 956]

train_set = datasets["train"].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1) [cite: 956]

# We also create the test set
test_set = datasets["test"].batch(32).map(preprocess)
test_set = test_set.map(encode_words).prefetch(1)

### Building the Model with an Embedding Layer

> **Theoretical Deep-Dive: Word Embeddings** 
>
> We could one-hot encode the word IDs, but with a vocabulary of 11,000, this would create *massive*, *sparse* vectors, which is computationally inefficient.
>
> A better solution is to use an **Embedding Layer**[cite: 956]. This layer is a trainable lookup table. It maps each word ID to a dense vector of a fixed size (e.g., 128 dimensions). 
>
> -   `input_dim`: The size of the vocabulary (10k + 1k oov buckets).
> -   `output_dim`: The size of the dense embedding vector (a hyperparameter).
>
> Initially, these vectors are random. During training, the model learns to place similar words close together in this "embedding space." For example, "good" and "great" will end up having similar vectors, while "good" and "terrible" will be far apart. This is a form of **representation learning**.

In [None]:
embed_size = 128
total_vocab_size = vocab_size + num_oov_buckets

model = keras.models.Sequential([
    keras.layers.Embedding(total_vocab_size, embed_size,
                           input_shape=[None]), # [cite: 956]
    keras.layers.GRU(128, return_sequences=True), # [cite: 956]
    keras.layers.GRU(128), # [cite: 956]
    keras.layers.Dense(1, activation="sigmoid") # [cite: 956]
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]) [cite: 956]
model.summary()

In [None]:
# Train the model
history = model.fit(train_set, epochs=5, validation_data=test_set)

### Masking

> **Theoretical Deep-Dive: Masking** 
>
> Our input batches are padded with the `"<pad>"` token (which maps to ID 0). We don't want the model to learn anything from this padding. We want it to be ignored.
>
> By setting `mask_zero=True` in the `Embedding` layer, we tell it that the token with ID 0 is a padding token. The layer will generate a **mask** (a boolean tensor) and propagate it to all subsequent layers[cite: 955]. 
>
> Recurrent layers like `GRU` and `LSTM` know how to handle this mask: they will simply ignore the time steps where the mask is `False` (i.e., where the token was a pad token)[cite: 956].

In [None]:
model_masked = keras.models.Sequential([
    keras.layers.Embedding(total_vocab_size, embed_size,
                           mask_zero=True, # <= This is the key change [cite: 955]
                           input_shape=[None]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128),
    keras.layers.Dense(1, activation="sigmoid")
])
model_masked.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
model_masked.summary()

In [None]:
# Train the masked model
history_masked = model_masked.fit(train_set, epochs=5, validation_data=test_set)

### Reusing Pretrained Embeddings

Instead of learning embeddings from scratch (which requires a lot of data), we can use **pretrained embeddings** trained on a massive text corpus (like all of Wikipedia or Google News). We can do this easily using the **TensorFlow Hub** library[cite: 957].

In [None]:
# !pip install -q tensorflow_hub
import tensorflow_hub as hub

# This model from TF Hub is a sentence encoder. 
# It takes strings as input and outputs a 50-dimensional embedding vector for the whole sentence.
# It handles all preprocessing (tokenization, embedding lookup, averaging) internally.
model_pretrained = keras.Sequential([
    hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1",
                     dtype=tf.string, input_shape=[], output_shape=[50]), # [cite: 958]
    keras.layers.Dense(128, activation="relu"),
    keras.layers.Dense(1, activation="sigmoid")
]) [cite: 958]

model_pretrained.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]) [cite: 958]
model_pretrained.summary()

In [None]:
# Create a new dataset that doesn't do any preprocessing
train_set_simple = datasets["train"].batch(32).prefetch(1)
test_set_simple = datasets["test"].batch(32).prefetch(1)

history_pretrained = model_pretrained.fit(train_set_simple, epochs=5, validation_data=test_set_simple)

## 4. Neural Machine Translation (NMT)

Here we build a model to translate from one language (e.g., English) to another (e.g., French). This is a **sequence-to-sequence (seq2seq)** task.

> **Theoretical Deep-Dive: The Encoder-Decoder Architecture** 
>
> A simple seq2seq RNN is not ideal for translation because it tries to translate as it reads. A better approach is the **Encoder-Decoder** model[cite: 959]:
>
> 1.  An **Encoder** RNN reads the source sentence (e.g., "I drink milk") and compresses it into a final hidden state (a vector). This vector is often called the **context vector** or **thought vector**.
> 2.  A **Decoder** RNN takes this context vector as its *initial hidden state* and generates the target sentence word by word (e.g., "Je bois du lait").
>
> **Training (Teacher Forcing):** As shown in Figure 16-3[cite: 960], the decoder is fed the *actual* target sentence from the training data, shifted one step to the right (i.e., it's fed the `<sos>` (start-of-sequence) token first, then "Je", then "bois", etc.). It is trained to predict the *next* word at each step.
>
> **Inference (Making Predictions):** As shown in Figure 16-4[cite: 961], we don't have the target sentence. So, we feed the decoder the `<sos>` token, it predicts the first word ("Je"), then we feed *that* word back into it to predict the second word ("bois"), and so on, until it predicts an `<eos>` (end-of-sequence) token.

The book uses the `tensorflow_addons` library to build the NMT model. We will reproduce the architecture snippet from the book[cite: 962].

*(Note: This is an architecture snippet and requires a pre-processed dataset (like the one in the book's notebook) to be runnable.)*

In [None]:
import tensorflow_addons as tfa

# Assume we have vocab_size and embed_size defined
vocab_size = 10000
embed_size = 128
n_units = 512

# ENCODER
encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
sequence_lengths = keras.layers.Input(shape=[], dtype=np.int32)

embeddings = keras.layers.Embedding(vocab_size, embed_size) 
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)

encoder = keras.layers.LSTM(n_units, return_state=True) [cite: 962]
encoder_outputs, state_h, state_c = encoder(encoder_embeddings)
encoder_state = [state_h, state_c]

# DECODER
sampler = tfa.seq2seq.sampler.TrainingSampler() [cite: 962]

decoder_cell = keras.layers.LSTMCell(n_units) [cite: 962]
output_layer = keras.layers.Dense(vocab_size) [cite: 962]

decoder = tfa.seq2seq.basic_decoder.BasicDecoder(decoder_cell, sampler,
                                                 output_layer=output_layer) [cite: 962]

final_outputs, final_state, final_sequence_lengths = decoder(
    decoder_embeddings, initial_state=encoder_state,
    sequence_length=sequence_lengths) [cite: 962]

Y_proba = tf.nn.softmax(final_outputs.rnn_output)

model_nmt = keras.Model(
    inputs=[encoder_inputs, decoder_inputs, sequence_lengths],
    outputs=[Y_proba]) [cite: 962]

model_nmt.compile(loss="sparse_categorical_crossentropy", optimizer="adam")

print("NMT Model Built Successfully")

### Bidirectional RNNs

> **Theoretical Deep-Dive:** 
> 
> When processing a word, a normal RNN only has access to the words that came *before* it. For many NLP tasks (like translation), it's incredibly useful to also know the words that come *after* it. 
> 
> A **Bidirectional RNN** consists of two RNNs: one reads the sequence from left-to-right, and the other reads it from right-to-left. At each time step, their outputs are concatenated. This gives the model a richer representation of each word *in its full context*[cite: 963].

In [None]:
model.add(
    keras.layers.Bidirectional(keras.layers.GRU(10, return_sequences=True)) # [cite: 963]
)

### Beam Search

> **Theoretical Deep-Dive:** 
>
> At inference time, a "greedy" decoder just picks the single most probable word at each step. This can lead to errors (e.g., "How *will* you?"). 
>
> **Beam Search** is a better approach. It keeps track of the *k* most probable sentences (where *k* is the **beam width**) at each step. It explores multiple hypotheses and is much more likely to find a good translation[cite: 964].
>
> `tfa.seq2seq.beam_search_decoder.BeamSearchDecoder` can be used to implement this.

## 5. Attention Mechanisms

> **Theoretical Deep-Dive: The Bottleneck Problem** 
>
> A basic Encoder-Decoder model must compress the *entire* source sentence into a single context vector (the encoder's final state). This is a major bottleneck. Information about the first words can be lost by the time the encoder is done. This makes the model struggle with long sentences.
>
> **Attention** solves this[cite: 966]. The core idea is to allow the decoder to look back at the encoder's outputs (its hidden states from *every* time step) during decoding. At each step, the decoder decides which input words are most relevant for generating the *next* output word.
>
> This creates a "shortcut" between the decoder and the relevant input words, bypassing the bottleneck. We will reproduce the architecture for **Luong Attention**[cite: 968].

In [None]:
# This snippet shows how to wrap a decoder cell with LuongAttention
# (Assumes encoder_outputs and decoder_cell are defined)

attention_mechanism = tfa.seq2seq.attention_wrapper.LuongAttention(
    units=n_units, memory=encoder_outputs) # [cite: 968]

attention_decoder_cell = tfa.seq2seq.attention_wrapper.AttentionWrapper(
    decoder_cell, attention_mechanism, attention_layer_size=n_units) [cite: 968]

## 6. The Transformer Architecture

The 2017 paper "Attention Is All You Need" introduced the **Transformer**, an architecture that *completely removes* RNNs and relies *only* on attention mechanisms[cite: 969]. This is the basis for most state-of-the-art models today (like BERT and GPT).

> **Theoretical Deep-Dive: Transformer Components** 
>
> 1.  **Positional Embeddings:** Since the model has no RNNs, it has no sense of word order. We must inject word position information. We do this by adding a vector (a **Positional Embedding**) to each word embedding. The Transformer uses a fixed (non-trainable) pattern of sines and cosines of different frequencies[cite: 970].
> 2.  **Multi-Head Attention:** This is the core of the Transformer. It's composed of multiple **Scaled Dot-Product Attention** layers. 
>     -   **Scaled Dot-Product Attention:** For each word, the model learns three vectors: a **Query (Q)**, a **Key (K)**, and a **Value (V)**. The Query (a word's question) is compared (via dot product) to every other word's Key. The resulting scores are scaled and softmaxed to get attention weights. These weights are then used to get a weighted sum of all the Values. This is the new representation for that word[cite: 971].
>     -   **Multi-Head:** The model does this multiple times in parallel (e.g., 8 "heads"). Each head can learn to pay attention to different things. The results are concatenated and fed to a final `Dense` layer[cite: 972].
> 
> The Transformer's encoder and decoder stacks use these layers to build rich, context-aware representations for each word.

In [None]:
# This custom layer implements the fixed sine/cosine positional embeddings [cite: 971]
class PositionalEncoding(keras.layers.Layer):
    def __init__(self, max_steps, max_dims, dtype=tf.float32, **kwargs):
        super().__init__(dtype=dtype, **kwargs)
        if max_dims % 2 == 1: max_dims += 1 # max_dims must be even
        p, i = np.meshgrid(np.arange(max_steps), np.arange(max_dims // 2))
        pos_emb = np.empty((1, max_steps, max_dims))
        pos_emb[0, :, ::2] = np.sin(p / 10000**(2 * i / max_dims)).T
        pos_emb[0, :, 1::2] = np.cos(p / 10000**(2 * i / max_dims)).T
        self.positional_embedding = tf.constant(pos_emb.astype(self.dtype))
    def call(self, inputs):
        shape = tf.shape(inputs)
        return inputs + self.positional_embedding[:, :shape[-2], :shape[-1]] [cite: 971]

In [None]:
# This is the base building block for the Transformer's Encoder
# Note: The book's code uses keras.layers.Attention. 
# The modern keras.layers.MultiHeadAttention is a more direct implementation.
# I will reproduce the conceptual snippet from the book[cite: 973].

embed_size = 512; max_steps = 500; vocab_size = 10000; n_units = 512

# Create inputs and embeddings (as in the book)
encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)
positional_encoding = PositionalEncoding(max_steps, max_dims=embed_size)
encoder_in = positional_encoding(encoder_embeddings)
decoder_in = positional_encoding(decoder_embeddings)

# --- Simplified Transformer-like structure --- 
# (This is a simplified snippet, not the full N-stack model from the paper)
# The book uses keras.layers.Attention to show the core idea.
K = keras.backend

# Encoder Self-Attention
Z = encoder_in
Z = keras.layers.Attention(use_scale=True)([Z, Z]) # [cite: 973]
encoder_outputs = Z

# Decoder Masked Self-Attention
Z = decoder_in
Z = keras.layers.Attention(use_scale=True, causal=True)([Z, Z]) # [cite: 973]

# Decoder Encoder-Decoder Attention (cross-attention)
Z = keras.layers.Attention(use_scale=True)([Z, encoder_outputs]) # [cite: 973]

# Final Output Layer
outputs = keras.layers.TimeDistributed(
    keras.layers.Dense(vocab_size, activation="softmax"))(Z) # [cite: 973]

model_transformer_simple = keras.Model(inputs=[encoder_inputs, decoder_inputs], outputs=[outputs])
print("Simple Transformer-like model built successfully.")

## 7. Recent Innovations in Language Models

> **Theoretical Deep-Dive:** 
>
> The end of the chapter highlights the state-of-the-art models that combine the Transformer architecture with massive-scale **unsupervised pretraining**.
>
> -   **ELMo (Embeddings from Language Models):** Generates *contextualized* word embeddings. The embedding for "queen" is different in "queen bee" vs. "Queen of England"[cite: 975].
> -   **ULMFiT (Universal Language Model Fine-Tuning):** Showed that pretraining an LSTM on a large corpus (like Wikipedia) and then fine-tuning it on a specific task (like sentiment analysis) could achieve state-of-the-art results, even with very little labeled data[cite: 975].
> -   **GPT (Generative Pre-Training):** Used the Transformer's *decoder* stack for pretraining. It was pretrained on a massive dataset to predict the next word. GPT-2, its successor, showed it could perform many tasks *without any fine-tuning* (called **zero-shot learning**)[cite: 975].
> -   **BERT (Bidirectional Encoder Representations from Transformers):** Used the Transformer's *encoder* stack. Unlike GPT, it's bidirectional (it sees text from both left and right). It was pretrained on two novel tasks[cite: 976]:
>     1.  **Masked Language Model (MLM):** 15% of words are hidden (`<mask>`), and the model must predict them.
>     2.  **Next Sentence Prediction (NSP):** The model reads two sentences and predicts if the second one is the actual next sentence or just a random sentence.