# Chapter 5: State-of-the-art in deep learning: Transformers

This notebook reproduces the code and summarizes the theoretical concepts from Chapter 5 of *'TensorFlow in Action'* by Thushan Ganegedara.

This chapter introduces the **Transformer model**, the architecture that has become the foundation for modern state-of-the-art Natural Language Processing (NLP). We will cover:
1.  How text is represented numerically for a model.
2.  The core components of the Transformer: the **encoder-decoder** architecture.
3.  The **self-attention** mechanism (the "Query, Key, and Value" concept).
4.  **Masked self-attention** for the decoder.
5.  **Multi-head attention**.
6.  Building a complete, simplified Transformer model from scratch using Keras.

## 5.1 Representing Text as Numbers

Machine learning models cannot understand raw text. We must first convert text into numbers. This is a multi-step process:

1.  **Tokenization**: Split a sentence into individual pieces, or "tokens." This can be done at the word level (e.g., `"I went"` -> `["I", "went"]`).
2.  **Build a Vocabulary**: Assign a unique integer ID to each unique token (e.g., `{"<PAD>": 0, "I": 1, "went": 2, ...}`). We reserve ID 0 for a special `<PAD>` (padding) token.
3.  **Integer Encoding**: Convert each token in the sentence to its corresponding ID.
4.  **Padding & Truncating**: Ensure all sequences in a batch have the same length. 
    * Sentences shorter than the fixed length are **padded** with the `<PAD>` (0) token.
    * Sentences longer than the fixed length are **truncated**.
    * Example (length 5): `"It was cold"` -> `[6, 7, 8]` -> `[6, 7, 8, 0, 0]`
5.  **Vector Representation (e.g., One-Hot Encoding)**: Convert each integer ID into a vector. In one-hot encoding, the ID becomes the index for a '1' in a vector of zeros. This is necessary to prevent the model from assuming a false numerical relationship between words (e.g., that word 3 is "more" than word 2).

The final result is a 3D tensor with the shape `(batch_size, sequence_length, vocabulary_size)`.

---

## 5.2 Understanding the Transformer Model

The Transformer is based on an **encoder-decoder architecture**, which is common for sequence-to-sequence tasks like machine translation.

* **Encoder**: Takes the input sequence (e.g., an English sentence) and maps it to a rich, latent representation (a set of vectors).
* **Decoder**: Takes the encoder's output and generates the target sequence (e.g., the French translation) one token at a time.

Unlike RNNs/LSTMs which process text one word at a time, the Transformer processes all tokens at once using a mechanism called **self-attention**.

### 5.2.1-5.2.2 Diving Deeper: Encoder and Decoder Layers

Both the encoder and decoder are stacks of identical layers.

A single **Encoder Layer** has two sub-layers:
1.  A **self-attention layer**: Allows every token in the input sequence to look at and weigh the importance of all other tokens in the same sequence. For example, in "I kicked the ball and **it** disappeared," self-attention helps the model learn that "it" refers to "ball."
2.  A **fully connected layer**: A simple feed-forward network applied to each token's representation.

A single **Decoder Layer** has three sub-layers:
1.  A **masked self-attention layer**: This is the same as self-attention, but it's "masked" to prevent a token from "seeing" future tokens. When predicting the 3rd word, the model can only attend to words 1 and 2.
2.  An **encoder-decoder attention layer**: This is the key link. The decoder's output from the layer above is used to "query" the encoder's output, deciding which parts of the *input* sentence are most relevant to generating the *current* output token.
3.  A **fully connected layer**.

### 5.2.3 The Self-Attention Layer

Self-attention is the core idea of the Transformer. It computes a token's representation by taking a weighted sum of all other tokens in the sequence. The weights are calculated using three vectors for each token:

* **Query (Q)**: A representation of the current token, used to "ask" other tokens for their relevance.
* **Key (K)**: A representation of another token, used to be "asked" by the query. The Query-Key interaction determines the attention score (the weight).
* **Value (V)**: The actual content or representation of the token. Once scores are calculated, we sum up the **Values** weighted by their scores.

The formula is: **$h = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$**

We can implement this as a custom Keras layer.

In [None]:
import tensorflow as tf
from tensorflow.keras import layers
import numpy as np
import math

# Implementation of the Self-Attention Layer (based on Listing 5.1)
class SelfAttentionLayer(layers.Layer):
    def __init__(self, d):
        super(SelfAttentionLayer, self).__init__()
        self.d = d # Output dimensionality
    
    def build(self, input_shape):
        # Define the weight matrices for Q, K, V
        self.Wq = self.add_weight(
            shape=(input_shape[-1], self.d), initializer='glorot_uniform',
            trainable=True, dtype='float32', name='Wq'
        )
        self.Wk = self.add_weight(
            shape=(input_shape[-1], self.d), initializer='glorot_uniform',
            trainable=True, dtype='float32', name='Wk'
        )
        self.Wv = self.add_weight(
            shape=(input_shape[-1], self.d), initializer='glorot_uniform',
            trainable=True, dtype='float32', name='Wv'
        )
    
    def call(self, q_x, k_x, v_x):
        # Calculate Q, K, V matrices
        q = tf.matmul(q_x, self.Wq)
        k = tf.matmul(k_x, self.Wk)
        v = tf.matmul(v_x, self.Wv)
        
        # Calculate attention scores (P)
        # (Q * K_transpose) / sqrt(d)
        p = tf.matmul(q, k, transpose_b=True) / math.sqrt(float(self.d))
        p = tf.nn.softmax(p)
        
        # Calculate the final attended output (h = P * V)
        h = tf.matmul(p, v)
        return h, p

# Test the layer
K_ = tf.keras.backend
K_.clear_session()

d_model = 512
n_seq = 7
x = tf.constant(np.random.normal(size=(1, n_seq, d_model)), dtype=tf.float32)

layer = SelfAttentionLayer(d_model)
h, p = layer(x, x, x)

print(f"Input shape: {x.shape}")
print(f"Output (h) shape: {h.shape}")
print(f"Attention (p) shape: {p.shape}")

### 5.2.6 Masked Self-Attention Layers

The decoder uses **masked** self-attention to prevent it from "cheating" by looking at future tokens. When predicting the word at position `t`, it should only have access to tokens from `0` to `t`.

This is achieved by adding a "mask" (a very large negative number, e.g., `-1e9`) to the attention scores right before the softmax. This turns all future positions' scores into zeros after the softmax, effectively ignoring them.

Here is the updated `SelfAttentionLayer` with masking capabilities.

In [None]:
# Updated SelfAttentionLayer with masking (based on Listing 5.2)
class SelfAttentionLayer(layers.Layer):
    def __init__(self, d, **kwargs):
        super(SelfAttentionLayer, self).__init__(**kwargs)
        self.d = d

    def build(self, input_shape):
        self.Wq = self.add_weight(shape=(input_shape[-1], self.d), initializer='glorot_uniform', trainable=True, name='Wq')
        self.Wk = self.add_weight(shape=(input_shape[-1], self.d), initializer='glorot_uniform', trainable=True, name='Wk')
        self.Wv = self.add_weight(shape=(input_shape[-1], self.d), initializer='glorot_uniform', trainable=True, name='Wv')
    
    def call(self, q_x, k_x, v_x, mask=None):
        q = tf.matmul(q_x, self.Wq)
        k = tf.matmul(k_x, self.Wk)
        v = tf.matmul(v_x, self.Wv)
        
        p = tf.matmul(q, k, transpose_b=True) / math.sqrt(float(self.d))
        
        # Apply the mask (if one is provided)
        if mask is not None:
            p += (mask * -1e9) # Add a large negative number where mask is 1
            
        p = tf.nn.softmax(p)
        h = tf.matmul(p, v)
        return h, p

# Create the triangular mask for a sequence of length 7
# This is a look-ahead mask for the decoder
mask = 1 - tf.linalg.band_part(tf.ones((n_seq, n_seq)), -1, 0)
print("Decoder Look-Ahead Mask:")
print(mask.numpy())

# Test the masked layer
K_.clear_session()
masked_layer = SelfAttentionLayer(d_model)
h_masked, p_masked = masked_layer(x, x, x, mask=mask)

print("\nAttention scores (p) with mask (should be lower-triangular):")
print(p_masked.numpy()[0].round(2)) # Rounding for readability

### 5.2.7 Multi-Head Attention

Instead of one large self-attention calculation, **Multi-Head Attention** runs several smaller self-attention mechanisms in parallel (e.g., 8 "heads"). Each head can theoretically learn a different type of relationship between tokens.

The outputs of all heads are concatenated and passed through a final linear layer to produce the sub-layer's final output.

In [None]:
# Code snippet for multi-head attention
n_heads = 8
d_head = d_model // n_heads # 512 / 8 = 64

multi_attn_heads = [SelfAttentionLayer(d_head) for i in range(n_heads)]

# In a full implementation, you'd pass x to each head
outputs = [head(x, x, x)[0] for head in multi_attn_heads]

# Concatenate all head outputs
outputs_concat = tf.concat(outputs, axis=-1)

print(f"Shape of one head: {outputs[0].shape}")
print(f"Shape after concatenation: {outputs_concat.shape}")

# This concatenated output would then be passed through a final Dense layer.

### 5.2.8 Fully Connected Layer

The second sub-layer in both the encoder and decoder is a simple, position-wise fully connected network. It consists of two `Dense` layers with a ReLU activation in between. This is applied independently to each token's representation.

In [None]:
# Implementation using Keras Dense layers (based on Listing 5.4)
class FCLayer(layers.Layer):
    def __init__(self, d1, d2, **kwargs):
        super(FCLayer, self).__init__(**kwargs)
        self.dense_layer_1 = layers.Dense(d1, activation='relu')
        self.dense_layer_2 = layers.Dense(d2)
    
    def call(self, x):
        ff1 = self.dense_layer_1(x)
        ff2 = self.dense_layer_2(ff1)
        return ff2

# Test the FC layer
fc = FCLayer(d1=2048, d2=d_model)
fc_out = fc(h_masked) # Using output from previous layer
print(f"Input shape to FC: {h_masked.shape}")
print(f"Output shape from FC: {fc_out.shape}")

### 5.2.9 Putting Everything Together: A Full Transformer Model

Now we combine these components (`SelfAttentionLayer` and `FCLayer`) into an `EncoderLayer` and `DecoderLayer`. (Note: For simplicity, this code omits the residual connections and layer normalization mentioned in Chapter 13).

In [None]:
# Based on Listing 5.5 - Encoder Layer
# (Note: This is a simplified version without multi-head or residuals for clarity)
class EncoderLayer(layers.Layer):
    def __init__(self, d, **kwargs):
        super(EncoderLayer, self).__init__(**kwargs)
        self.attn_head = SelfAttentionLayer(d)
        self.fc_layer = FCLayer(2048, d)
    
    def call(self, x):
        h, _ = self.attn_head(x, x, x)
        y = self.fc_layer(h)
        return y

# Based on Listing 5.6 - Decoder Layer
class DecoderLayer(layers.Layer):
    def __init__(self, d, **kwargs):
        super(DecoderLayer, self).__init__(**kwargs)
        self.masked_attn_head = SelfAttentionLayer(d, name="masked_attn")
        self.attn_head = SelfAttentionLayer(d, name="enc_dec_attn")
        self.fc_layer = FCLayer(2048, d)
    
    def call(self, de_x, en_x, mask=None):
        # 1. Masked self-attention (on decoder input)
        h1, _ = self.masked_attn_head(de_x, de_x, de_x, mask)
        
        # 2. Encoder-decoder attention (Query=decoder, Key/Value=encoder)
        h2, _ = self.attn_head(q_x=h1, k_x=en_x, v_x=en_x)
        
        # 3. Fully connected layer
        y = self.fc_layer(h2)
        return y

In [None]:
# Based on Listing 5.7 - Building the final "MinTransformer"
K_.clear_session()

# Hyperparameters
n_steps = 25
n_en_vocab = 300
n_de_vocab = 400
d = 512

# Define the decoder look-ahead mask
mask = 1 - tf.linalg.band_part(tf.ones((n_steps, n_steps)), -1, 0)

# --- Define Model Inputs ---
en_inp = layers.Input(shape=(n_steps,), name="encoder_input_ids")
de_inp = layers.Input(shape=(n_steps,), name="decoder_input_ids")

# --- Define Embedding Layers ---
en_emb_layer = layers.Embedding(n_en_vocab, d, input_length=n_steps)
de_emb_layer = layers.Embedding(n_de_vocab, d, input_length=n_steps)

en_emb = en_emb_layer(en_inp)
de_emb = de_emb_layer(de_inp)

# --- Build Encoder & Decoder ---
# (We'll use two layers for each)
en_out1 = EncoderLayer(d, name="enc_1")(en_emb)
en_out2 = EncoderLayer(d, name="enc_2")(en_out1)

de_out1 = DecoderLayer(d, name="dec_1")(de_emb, en_out2, mask)
de_out2 = DecoderLayer(d, name="dec_2")(de_out1, en_out2, mask)

# --- Final Prediction Layer ---
de_pred = layers.Dense(n_de_vocab, activation='softmax', name='final_output')(de_out2)

# --- Create and Compile the Model ---
transformer = models.Model(
    inputs=[en_inp, de_inp],
    outputs=de_pred,
    name='MinTransformer'
)

transformer.compile(
    loss='categorical_crossentropy', 
    optimizer='adam', 
    metrics=['acc']
)

transformer.summary()