# Appendix C: NLP Deep Dive - Attention & Encodings

## 1️⃣ Overview

In **Chapter 5** and **Chapter 13**, we used Transformers for translation and classification. While we implemented the code, understanding the *geometry* of these models helps debug and optimize them.

This appendix focuses on visualizing the two most abstract components of the Transformer:
1.  **Positional Encodings:** How do we represent "order" in a model that processes everything in parallel? We will visualize the unique properties of Sinusoidal embeddings.
2.  **Attention Masks:** How do we prevent the decoder from "cheating" (looking into the future) during training? We will visualize the Look-Ahead Mask.

---

## 2️⃣ Positional Encoding Visualization

Transformers process tokens in parallel. To the model, "Man bites Dog" and "Dog bites Man" look identical (bag of words) without positional information.

We inject information about position $pos$ into the embedding vector of size $d$ using sine and cosine waves of different frequencies:

$$ PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}}) $$
$$ PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}}) $$

Let's visualize why this works. It creates a unique "fingerprint" for every position that the model can easily learn to attend to relative distances (e.g., "word at $pos+k$").

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

def get_angles(pos, i, d_model):
    angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))
    return pos * angle_rates

def positional_encoding(position, d_model):
    angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                            np.arange(d_model)[np.newaxis, :],
                            d_model)
    
    # Apply sin to even indices in the array; 2i
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    
    # Apply cos to odd indices in the array; 2i+1
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    
    pos_encoding = angle_rads[np.newaxis, ...]
    return tf.cast(pos_encoding, dtype=tf.float32)

# Generate Encoding for a sentence of length 50 with embedding dimension 512
pos_encoding = positional_encoding(50, 512)

# Visualization
plt.figure(figsize=(12, 8))
plt.pcolormesh(pos_encoding[0], cmap='RdBu')
plt.xlabel('Depth (Embedding Dimension)')
plt.xlim((0, 512))
plt.ylabel('Position (Sequence Index)')
plt.title("Positional Encodings (50 Positions x 512 Dims)")
plt.colorbar()
plt.show()

### Analysis
* **X-Axis (Depth):** Represents the 512 dimensions of the word vector.
* **Y-Axis (Position):** Represents the token's position in the sentence (0 to 50).
* **Pattern:** 
    * On the left (low dimensions), the wave frequency is high (rapid changes).
    * On the right (high dimensions), the wave frequency is low (slow changes).
* **Interpretation:** This allows the model to differentiate positions nearby (using high-frequency dimensions) and positions far apart (using low-frequency dimensions), similar to how a clock uses seconds, minutes, and hours hands to represent time.

## 3️⃣ Masking Visualization

In the Decoder of a Transformer (like GPT), we must ensure that when predicting the word at position $t$, the model can only see words at positions $0$ to $t-1$. It cannot see $t+1$.

We achieve this by adding a **Look-Ahead Mask** to the attention scores. We set the scores of future tokens to negative infinity ($-\infty$). When passed through Softmax, these become exactly **0**.

In [None]:
def create_look_ahead_mask(size):
    # Band part returns the lower triangular part of the matrix
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask  # (seq_len, seq_len)

# Create a mask for a sequence of length 5
seq_len = 5
temp_mask = create_look_ahead_mask(seq_len)

print("Look Ahead Mask (Numerical):")
print(temp_mask.numpy())

plt.figure(figsize=(5, 5))
plt.imshow(temp_mask, cmap='binary')
plt.title("Look Ahead Mask (White = 1 = Masked)")
plt.xlabel("Key Position (Attending To)")
plt.ylabel("Query Position (Current Word)")
plt.grid(False)
plt.show()

### Analysis
* **Row 0 (Word 0):** Can only attend to Column 0. All other columns are white (1, meaning masked/blocked).
* **Row 2 (Word 2):** Can attend to Columns 0, 1, and 2. Columns 3 and 4 are blocked.

This triangle ensures causality: past tokens cannot influence future predictions during training.

## 4️⃣ Self-Attention Heatmaps (Matrix Math)

Let's simulate the dot-product attention mechanism on dummy data to see how the matrix multiplication results in alignment.

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

In [None]:
def scaled_dot_product_attention_viz(q, k, v, mask=None):
    # q, k, v shape: (batch_size, seq_len, d_model)
    matmul_qk = tf.matmul(q, k, transpose_b=True)
    
    # Scale
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    # Add Mask
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)

    # Softmax
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
    
    return attention_weights

# Simulate 3 words, embedding dimension 4
np.random.seed(42)
temp_q = tf.constant(np.random.randn(1, 3, 4), dtype=tf.float32)
temp_k = tf.constant(np.random.randn(1, 3, 4), dtype=tf.float32)
temp_v = tf.constant(np.random.randn(1, 3, 4), dtype=tf.float32)

# 1. Without Mask
attn_weights_no_mask = scaled_dot_product_attention_viz(temp_q, temp_k, temp_v)

# 2. With Look-Ahead Mask
mask = create_look_ahead_mask(3)
attn_weights_with_mask = scaled_dot_product_attention_viz(temp_q, temp_k, temp_v, mask)

# Visualization
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.matshow(attn_weights_no_mask[0], fignum=0, cmap='viridis')
plt.title("Full Self-Attention (Encoder)")
plt.colorbar()

plt.subplot(1, 2, 2)
plt.matshow(attn_weights_with_mask[0], fignum=0, cmap='viridis')
plt.title("Masked Self-Attention (Decoder)")
plt.colorbar()

plt.show()

### Analysis
* **Left Plot (Encoder):** The attention matrix is fully populated. Every word attends to every other word.
* **Right Plot (Decoder):** The upper triangle is dark blue (zero probability). The attention is strictly confined to the lower triangle (history).