# Chapter 5: State-of-the-Art in Deep Learning: Transformers

## 1️⃣ Chapter Overview

In the previous chapters, we explored Recurrent Neural Networks (RNNs) and LSTMs for sequence processing. While powerful, these models suffer from limitations like sequential processing (slow training) and difficulty in retaining long-range dependencies. 

This chapter introduces the **Transformer**, a groundbreaking architecture introduced in the paper *"Attention Is All You Need"* (Vaswani et al., 2017). Transformers discarded recurrence entirely in favor of a mechanism called **Self-Attention**, allowing the model to look at the entire input sequence simultaneously.

**Key Machine Learning Concepts:**
* **Encoder-Decoder Architecture:** The standard framework for sequence-to-sequence tasks.
* **Self-Attention:** The ability of a model to weigh the importance of different words in a sentence relative to a specific word.
* **Multi-Head Attention:** Running multiple self-attention mechanisms in parallel to capture different types of relationships.
* **Positional Encoding:** Injecting information about the order of words, since Transformers process them in parallel.

**Practical Skills:**
* Implementing custom Keras layers using the `Subclassing API`.
* Building complex mathematical operations (Scaled Dot-Product Attention) from scratch in TensorFlow.
* Assembling a full Transformer architecture from custom components.

## 2️⃣ Theoretical Explanation

### 2.1 The Intuition: Why "Attention"?
Imagine translating the sentence: *"The animal did not cross the street because it was too tired."*

When a human reads the word **"it"**, they immediately know it refers to **"the animal"**, not **"the street"**. 

* **RNNs** struggle with this if the distance between "animal" and "it" is large.
* **Self-Attention** calculates a score relating "it" to every other word in the sentence. The score for "animal" would be high, effectively telling the model: *"When processing 'it', pay close attention to 'animal'."*

### 2.2 The Core Mechanism: Q, K, V
Self-attention is mathematically described using three vectors for every input token: **Query (Q)**, **Key (K)**, and **Value (V)**.

1.  **Query (Q):** What the token is looking for.
2.  **Key (K):** What the token identifies as.
3.  **Value (V):** The actual information content of the token.

**Analogy:** 
Think of searching a library catalog.
* **Query:** The search term you type in.
* **Key:** The book titles/categories in the database.
* **Value:** The content of the book you eventually get.

We calculate the match between $Q$ and $K$ (using dot product) to get a score (attention weight). We then use that score to weight the $V$.

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

### 2.3 Multi-Head Attention
Instead of having one set of Q, K, V matrices, we have multiple sets (heads). This allows the model to focus on different aspects simultaneously. For example, one head might track grammatical relationships (subject-verb), while another tracks semantic references (pronoun resolution).

### 2.4 Positional Encodings
Since Transformers process the sentence "Dog bites Man" and "Man bites Dog" identically (as a bag of words) without recurrence, we must explicitly add information about the position of each word. We add a **Positional Encoding** vector to the input embeddings to give the model a sense of order.

## 3️⃣ Code Reproduction: Building the Transformer Components

We will build the Transformer from the bottom up using the **Keras Subclassing API**. This gives us fine-grained control over the forward pass calculations.

### 3.1 Setup

In [None]:
import tensorflow as tf
from tensorflow.keras import layers, models
import numpy as np
import math

# Ensure reproducibility
tf.random.set_seed(42)
np.random.seed(42)

### 3.2 The Self-Attention Layer

This is the heart of the Transformer. We calculate the $Q$, $K$, and $V$ vectors using dense layers (linear projections) and then apply the attention formula.

**Note on Masking:** In the decoder, we must prevent positions from attending to subsequent positions (peeking into the future). We implement this via a `mask` argument.

In [None]:
class SelfAttentionLayer(layers.Layer):
    def __init__(self, d_model, n_heads):
        super(SelfAttentionLayer, self).__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_head = d_model // n_heads
        
        # Check if d_model is divisible by n_heads
        assert d_model % n_heads == 0

        # The weight matrices for Q, K, V are implemented as Dense layers
        self.wq = layers.Dense(d_model)
        self.wk = layers.Dense(d_model)
        self.wv = layers.Dense(d_model)
        
        # Linear layer for the output
        self.dense = layers.Dense(d_model)

    def split_heads(self, x, batch_size):
        """
        Split the last dimension into (n_heads, d_head).
        Transpose the result to shape (batch_size, n_heads, seq_len, d_head)
        """
        x = tf.reshape(x, (batch_size, -1, self.n_heads, self.d_head))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, q, k, v, mask):
        batch_size = tf.shape(q)[0]

        # 1. Generate Q, K, V
        qw = self.wq(q)
        kw = self.wk(k)
        vw = self.wv(v)

        # 2. Split heads
        # Shape: (batch, n_heads, seq_len, d_head)
        q_heads = self.split_heads(qw, batch_size)
        k_heads = self.split_heads(kw, batch_size)
        v_heads = self.split_heads(vw, batch_size)

        # 3. Scaled Dot-Product Attention
        # Matmul of Q and K
        # Result Shape: (batch, n_heads, seq_len_q, seq_len_k)
        matmul_qk = tf.matmul(q_heads, k_heads, transpose_b=True)

        # Scale by sqrt(d_k)
        dk = tf.cast(self.d_head, tf.float32)
        scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

        # Apply Mask (if provided)
        # We add a large negative number to positions we want to mask
        # so that after softmax they become 0.
        if mask is not None:
            scaled_attention_logits += (mask * -1e9)

        # Softmax to get probabilities
        attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)

        # Weighted sum of Values
        output = tf.matmul(attention_weights, v_heads)

        # 4. Concatenate heads and run through final dense layer
        # Transpose back: (batch, seq_len, n_heads, d_head)
        output = tf.transpose(output, perm=[0, 2, 1, 3])
        # Reshape: (batch, seq_len, d_model)
        concat_attention = tf.reshape(output, (batch_size, -1, self.d_model))

        return self.dense(concat_attention)


### 3.3 The Feed-Forward Layer

After self-attention, each token is processed independently by a fully connected network. This typically consists of two linear transformations with a ReLU activation in between.

Equation: $FFN(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2$

In [None]:
class FCLayer(layers.Layer):
    def __init__(self, d_model, dff):
        """
        d_model: Dimensionality of the model (embedding size)
        dff: Dimensionality of the inner feed-forward layer (usually 4x d_model)
        """
        super(FCLayer, self).__init__()
        self.dense1 = layers.Dense(dff, activation='relu')
        self.dense2 = layers.Dense(d_model)

    def call(self, x):
        return self.dense2(self.dense1(x))

### 3.4 The Encoder Layer

Now we compose the `SelfAttentionLayer` and `FCLayer` into a single Encoder Layer. 
Crucially, the Transformer uses **Residual Connections** (Add) and **Layer Normalization** (Norm) after each sub-layer.

In [None]:
class EncoderLayer(layers.Layer):
    def __init__(self, d_model, n_heads, dff, rate=0.1):
        super(EncoderLayer, self).__init__()
        
        self.mha = SelfAttentionLayer(d_model, n_heads)
        self.ffn = FCLayer(d_model, dff)

        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, x, training, mask):
        # Sub-layer 1: Multi-Head Attention
        attn_output = self.mha(x, x, x, mask)  # Q, K, V are all x (Self-Attention)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)  # Add & Norm

        # Sub-layer 2: Feed Forward Network
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)  # Add & Norm

        return out2

### 3.5 The Decoder Layer

The Decoder layer is slightly more complex. It has three sub-layers:
1.  **Masked Self-Attention:** Can only attend to previous tokens.
2.  **Encoder-Decoder Attention:** Queries come from the decoder, Keys and Values come from the encoder output.
3.  **Feed Forward Network.**

In [None]:
class DecoderLayer(layers.Layer):
    def __init__(self, d_model, n_heads, dff, rate=0.1):
        super(DecoderLayer, self).__init__()

        self.mha1 = SelfAttentionLayer(d_model, n_heads) # Masked Self-Attention
        self.mha2 = SelfAttentionLayer(d_model, n_heads) # Encoder-Decoder Attention

        self.ffn = FCLayer(d_model, dff)

        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)
        self.dropout3 = layers.Dropout(rate)

    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        # 1. Masked Self-Attention (on the target sequence)
        attn1 = self.mha1(x, x, x, look_ahead_mask)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(attn1 + x)

        # 2. Encoder-Decoder Attention
        # Q = Decoder output so far (out1)
        # K, V = Encoder output
        attn2 = self.mha2(out1, enc_output, enc_output, padding_mask)
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(attn2 + out1)

        # 3. Feed Forward Network
        ffn_output = self.ffn(out2)
        ffn_output = self.dropout3(ffn_output, training=training)
        out3 = self.layernorm3(ffn_output + out2)

        return out3

### 3.6 Positional Encoding

The book uses the standard sinusoidal positional encoding described in Vaswani et al. 
$$ PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}}) $$
$$ PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}}) $$

In [None]:
def get_positional_encoding(position, d_model):
    # Provide a matrix of position indices
    angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                            np.arange(d_model)[np.newaxis, :],
                            d_model)
    
    # Apply sin to even indices in the array; 2i
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    
    # Apply cos to odd indices in the array; 2i+1
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    
    pos_encoding = angle_rads[np.newaxis, ...]
    return tf.cast(pos_encoding, dtype=tf.float32)

def get_angles(pos, i, d_model):
    angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))
    return pos * angle_rates

### 3.7 The Final Transformer Model (MinTransformer)

We assemble everything into a Keras Model. This model takes both Encoder and Decoder inputs and outputs the probability distribution of the next token.

In [None]:
class MinTransformer(models.Model):
    def __init__(self, n_layers, d_model, n_heads, dff, input_vocab_size, 
                 target_vocab_size, pe_input, pe_target, rate=0.1):
        super(MinTransformer, self).__init__()

        # --- ENCODER ---
        self.encoder_embedding = layers.Embedding(input_vocab_size, d_model)
        self.encoder_pos_encoding = get_positional_encoding(pe_input, d_model)
        self.encoder_layers = [EncoderLayer(d_model, n_heads, dff, rate) 
                               for _ in range(n_layers)]
        self.encoder_dropout = layers.Dropout(rate)

        # --- DECODER ---
        self.decoder_embedding = layers.Embedding(target_vocab_size, d_model)
        self.decoder_pos_encoding = get_positional_encoding(pe_target, d_model)
        self.decoder_layers = [DecoderLayer(d_model, n_heads, dff, rate) 
                               for _ in range(n_layers)]
        self.decoder_dropout = layers.Dropout(rate)

        # --- FINAL OUTPUT ---
        self.final_layer = layers.Dense(target_vocab_size)

    def call(self, inputs, training):
        # Inputs is a tuple: (encoder_input, decoder_input)
        # We'll omit masks generation here for brevity, assuming they are passed or handled externally 
        # in a full training loop. For the chapter's scope, we define the architecture.
        inp, tar = inputs
        
        # Dummy masks for demonstration (In real training, these depend on padding)
        enc_padding_mask = None 
        look_ahead_mask = None
        dec_padding_mask = None

        # --- ENCODER PASS ---
        enc_out = self.encoder_embedding(inp)
        enc_out *= tf.math.sqrt(tf.cast(512, tf.float32)) # Scaling by sqrt(d_model)
        enc_out += self.encoder_pos_encoding[:, :tf.shape(inp)[1], :]
        enc_out = self.encoder_dropout(enc_out, training=training)

        for i in range(len(self.encoder_layers)):
            enc_out = self.encoder_layers[i](enc_out, training, enc_padding_mask)

        # --- DECODER PASS ---
        dec_out = self.decoder_embedding(tar)
        dec_out *= tf.math.sqrt(tf.cast(512, tf.float32))
        dec_out += self.decoder_pos_encoding[:, :tf.shape(tar)[1], :]
        dec_out = self.decoder_dropout(dec_out, training=training)

        for i in range(len(self.decoder_layers)):
            dec_out = self.decoder_layers[i](dec_out, enc_out, training, 
                                             look_ahead_mask, dec_padding_mask)

        # Final prediction
        final_output = self.final_layer(dec_out)
        return final_output

## 4️⃣ Testing the Model

Let's instantiate the model with some sample hyperparameters to verify that the forward pass works and produces the correct output shape.

In [None]:
# Define Hyperparameters
num_layers = 4
d_model = 128
dff = 512
num_heads = 8
input_vocab_size = 8500
target_vocab_size = 8000
dropout_rate = 0.1

transformer = MinTransformer(
    num_layers, 
    d_model, 
    num_heads, 
    dff,
    input_vocab_size, 
    target_vocab_size, 
    pe_input=1000, 
    pe_target=1000,
    rate=dropout_rate
)

# Create Dummy Input
# Batch size: 64, Sequence Length: 38
temp_input = tf.random.uniform((64, 38), dtype=tf.int64, minval=0, maxval=200)
temp_target = tf.random.uniform((64, 36), dtype=tf.int64, minval=0, maxval=200)

# Run Forward Pass
fn_out = transformer((temp_input, temp_target), training=False)

print(f"Encoder Input Shape: {temp_input.shape}")
print(f"Decoder Input Shape: {temp_target.shape}")
print(f"Transformer Output Shape: {fn_out.shape}")

### Output Analysis
The output shape should be `(64, 36, 8000)`.
* `64`: Batch size.
* `36`: Target sequence length (the model outputs a prediction for every step in the decoder input).
* `8000`: Target vocab size (logits for the next word probability).

## 5️⃣ Chapter Summary

In this chapter, we dissected the Transformer model, the current state-of-the-art in NLP.

* **No More Recurrence:** We replaced LSTMs with **Self-Attention**, allowing parallel processing of sequences.
* **Attention Mechanism:** We implemented `Scaled Dot-Product Attention`, which allows the model to dynamically focus on different parts of the input.
* **Multi-Head Attention:** We split the attention mechanism into multiple heads to capture different linguistic features simultaneously.
* **Positional Encoding:** We injected sine/cosine waves into the embeddings so the model knows the order of words.
* **Architecture:** We built the Encoder (extracts features) and Decoder (generates sequence) and connected them into a full `MinTransformer` model in Keras.