# Chapter 16: Natural Language Processing with RNNs and Attention

## 1. Chapter Overview
**Goal:** Natural Language Processing (NLP) is one of the most active fields in AI. In this chapter, we will build systems that can generate text (like Shakespeare), translate languages (English to Spanish), and understand context using **Attention Mechanisms**. We will culminate with the **Transformer** architecture, the foundation of modern Large Language Models.

**Key Concepts:**
* **Char-RNN:** Generating text character by character.
* **Stateful RNNs:** Preserving hidden state across batches for long sequences.
* **Sentiment Analysis:** Classifying text (positive/negative).
* **Encoder-Decoder Network:** The standard architecture for Neural Machine Translation (NMT).
* **Bidirectional RNNs:** Reading text forward and backward.
* **Beam Search:** Finding the most likely sequence of words, not just the greedy choice.
* **Attention Mechanisms:** Allowing the decoder to "focus" on specific parts of the input sentence.
* **The Transformer:** An architecture based entirely on Attention (Self-Attention, Multi-Head Attention), abandoning RNNs.

**Practical Skills:**
* Building a character-level text generation model.
* Implementing a basic Encoder-Decoder for translation.
* Using **TensorFlow Addons (tfa)** for Seq2Seq models.
* Implementing **Positional Encoding** and **Multi-Head Attention** from scratch.

In [None]:
# Setup
import sys
assert sys.version_info >= (3, 5)

import sklearn
assert sklearn.__version__ >= "0.20"

import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

import numpy as np
import os
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

np.random.seed(42)
tf.random.set_seed(42)

print("TensorFlow version:", tf.__version__)

## 2. Theoretical Explanation (In-Depth)

### 1. Char-RNN & Stateful RNNs
To generate text, we can train an RNN to predict the next character given a sequence of previous characters. 
* **Stateless RNN:** The hidden state is reset to zero at the start of every batch. Good for independent sentences.
* **Stateful RNN:** The final hidden state of batch $i$ is used as the initial state of batch $i+1$. This allows the network to learn patterns that span across batches (e.g., long chapters).

### 2. Encoder-Decoder Architecture
Used for translation. 
* **Encoder:** An RNN that reads the input sentence (e.g., "Hello") and condenses it into a single vector (the final hidden state).
* **Decoder:** An RNN that takes that vector and generates the translation ("Hola") step-by-step.
* **Problem:** Condensing a long sentence into a single vector causes information loss (the "bottleneck" problem).

### 3. Attention Mechanisms
Bahdanau et al. (2014) introduced Attention to solve the bottleneck. Instead of just sending the final state to the decoder, we send *all* encoder outputs. At each step, the decoder calculates a weighted sum of these outputs, focusing (attending) on the words most relevant to the current word it is generating.

### 4. The Transformer (Vaswani et al., 2017)
The paper "Attention Is All You Need" proposed removing RNNs entirely. 
* **Positional Encoding:** Since there is no recurrence, we must inject information about word order (index 1, 2, 3...) mathematically.
* **Self-Attention:** Each word in the sentence looks at every other word to understand context (e.g., "Bank" looks at "River" or "Money").
* **Multi-Head Attention:** Running several self-attention layers in parallel to capture different types of relationships.

## 3. Code Reproduction

### 3.1 Char-RNN: Generating Shakespearean Text
We download the Shakespeare dataset, tokenize it by character, and train a GRU model.

In [None]:
filepath = keras.utils.get_file("shakespeare.txt", "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt")
with open(filepath) as f:
    shakespeare_text = f.read()

# Tokenizer: Char to Int
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts([shakespeare_text])

# Encode full text
[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1
max_id = len(tokenizer.word_index)
dataset_size = len(encoded)

print(f"Total characters: {dataset_size}")
print(f"Unique characters: {max_id}")

# Create Dataset (Windowing)
train_size = dataset_size * 90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])
n_steps = 100
window_length = n_steps + 1
dataset = dataset.window(window_length, shift=1, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(window_length))
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id),
                              Y_batch))
dataset = dataset.prefetch(1)

# Build Model
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id], dropout=0.2),
    keras.layers.GRU(128, return_sequences=True, dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation="softmax"))
])

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
# history = model.fit(dataset, epochs=5) # Skipped for speed

### 3.2 Encoder-Decoder for Neural Machine Translation (NMT)
We will build a simple English-to-Spanish translator. We use `Bidirectional` LSTM for the encoder.

In [None]:
vocab_size = 1000
embed_size = 10

encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)

embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)

# Encoder
encoder = keras.layers.LSTM(512, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_embeddings)
encoder_states = [state_h, state_c]

# Decoder (Initialized with Encoder states)
decoder = keras.layers.LSTM(512, return_sequences=True)
decoder_outputs = decoder(decoder_embeddings, initial_state=encoder_states)

# Output
output_layer = keras.layers.Dense(vocab_size, activation="softmax")
Y_proba = output_layer(decoder_outputs)

model = keras.models.Model(inputs=[encoder_inputs, decoder_inputs], outputs=[Y_proba])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])

### 3.3 Positional Encoding (Transformer Component)
Since Transformers have no recurrence, we add sine/cosine waves to the embeddings to represent position.

In [None]:
class PositionalEncoding(keras.layers.Layer):
    def __init__(self, max_steps, max_dims, dtype=tf.float32, **kwargs):
        super().__init__(dtype=dtype, **kwargs)
        if max_dims % 2 == 1: max_dims += 1
        p, i = np.meshgrid(np.arange(max_steps), np.arange(max_dims // 2))
        pos_emb = np.empty((1, max_steps, max_dims))
        pos_emb[0, :, 0::2] = np.sin(p / 10000**(2 * i / max_dims)).T
        pos_emb[0, :, 1::2] = np.cos(p / 10000**(2 * i / max_dims)).T
        self.positional_embedding = tf.constant(pos_emb.astype(self.dtype))

    def call(self, inputs):
        shape = tf.shape(inputs)
        return inputs + self.positional_embedding[:, :shape[1], :shape[2]]

# Visualizing Positional Encodings
max_steps = 201
max_dims = 512
pos_emb = PositionalEncoding(max_steps, max_dims)
PE = pos_emb(np.zeros((1, max_steps, max_dims), np.float32))
plt.figure(figsize=(10, 7))
plt.pcolormesh(PE[0], cmap='RdBu')
plt.xlabel('Embedding Dimension')
plt.xlim((0, 512))
plt.ylabel('Token Position')
plt.colorbar()
plt.title("Positional Encoding Matrix")
plt.show()

### 3.4 Multi-Head Attention (Transformer Component)
The core logic of the Transformer: `Attention(Q, K, V) = softmax(QK^T / sqrt(d)) * V`.

In [None]:
class MultiHeadAttention(keras.layers.Layer):
    def __init__(self, n_heads, d_model, **kwargs):
        super().__init__(**kwargs)
        self.n_heads = n_heads
        self.d_model = d_model
        self.d_head = d_model // n_heads
        self.wq = keras.layers.Dense(d_model)
        self.wk = keras.layers.Dense(d_model)
        self.wv = keras.layers.Dense(d_model)
        self.dense = keras.layers.Dense(d_model)

    def split_heads(self, inputs, batch_size):
        inputs = tf.reshape(
            inputs, shape=(batch_size, -1, self.n_heads, self.d_head))
        return tf.transpose(inputs, perm=[0, 2, 1, 3])

    def call(self, q, k, v, mask=None):
        batch_size = tf.shape(q)[0]
        qs = self.split_heads(self.wq(q), batch_size)
        ks = self.split_heads(self.wk(k), batch_size)
        vs = self.split_heads(self.wv(v), batch_size)
        
        # Scaled Dot-Product Attention
        matmul_qk = tf.matmul(qs, ks, transpose_b=True)
        dk = tf.cast(tf.shape(ks)[-1], tf.float32)
        scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

        if mask is not None:
            scaled_attention_logits += (mask * -1e9) 

        attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
        output = tf.matmul(attention_weights, vs)
        output = tf.transpose(output, perm=[0, 2, 1, 3])
        original_size_output = tf.reshape(output, (batch_size, -1, self.d_model))
        return self.dense(original_size_output)

print("MultiHeadAttention layer defined.")

## 4. Step-by-Step Explanation

### 1. Data Pipeline for Char-RNN
**Input:** A long string of text.
**Process:**
1.  **Windowing:** We chop the text into overlapping windows of 101 characters.
2.  **Split:** The first 100 characters are the input ($X$), the characters from index 1 to 101 are the target ($Y$). Essentially, at each step $t$, the model predicts char $t+1$.
3.  **One-Hot:** We convert integer IDs to sparse vectors.

### 2. Encoder-Decoder Flow
1.  **Encoder:** The LSTM processes the input sequence (English). We discard the `encoder_outputs` and keep only the final `state_h` and `state_c`. These states represent the "meaning" or "context" of the sentence.
2.  **Bridge:** We pass these states to the Decoder as its `initial_state`. 
3.  **Decoder:** The LSTM starts with the context vector and the `<start>` token. It generates the first translated word (Spanish). This word is then fed back as input for the next step (during inference).

### 3. Why Positional Encoding?
RNNs process "I" then "am" then "happy". They know "I" comes first because it was processed at step 0. 
Transformers process "I", "am", and "happy" simultaneously (in parallel). Without Positional Encoding, the model would see "I am happy" and "Happy am I" as identical "bags of words". The sine/cosine waves add a unique signature to each position index so the model can distinguish order.

### 4. Attention Mechanism Logic
**Query (Q), Key (K), Value (V):**
* Analogy: Searching a library database.
* **Query:** What you are looking for (e.g., "Books about Space").
* **Key:** The labels on the books in the library (e.g., "Science", "Cooking", "Space").
* **Value:** The content of the books.
* **Dot Product (QK):** Checks similarity between Query and Keys. "Space" matches "Space" (High score).
* **Softmax:** Converts scores to probabilities.
* **Output:** Weighted sum of Values. You get the content of the Space books.
In Self-Attention, Q, K, and V all come from the same word embeddings.

## 5. Chapter Summary

* **Char-RNN** works surprisingly well for generating text but struggles with long-term coherence.
* **Encoder-Decoder** is the standard for Seq2Seq tasks like translation.
* **Beam Search** improves translation quality by exploring multiple potential translations simultaneously.
* **Attention** solves the bottleneck problem by letting the decoder look at the entire source sentence dynamically.
* **The Transformer** is the current state-of-the-art. It uses Multi-Head Attention and Positional Encodings to process sequences in parallel, enabling massive scale (BERT, GPT).