# Chapter 12: Sequence-to-Sequence Learning: Part 2 (Attention)

## 1️⃣ Chapter Overview

In Chapter 11, we built a standard Encoder-Decoder model for machine translation. While effective for short sentences, standard Seq2Seq models suffer from a fundamental **bottleneck**: the Encoder must compress the entire information of a source sentence into a *single fixed-size vector* (the Context Vector).

This chapter introduces the **Attention Mechanism** (specifically **Bahdanau Attention**), which solves this bottleneck. Instead of relying on a single static context vector, Attention allows the Decoder to "look back" at the entire sequence of Encoder outputs and focus on specific words relevant to the current time step.

**Key Machine Learning Concepts:**
* **The Information Bottleneck:** Why fixed-size context vectors fail for long sequences.
* **Bahdanau Attention (Additive Attention):** A mechanism to compute a dynamic context vector for every decoding step.
* **Alignment Scores:** Calculating how relevant an encoder state is to the current decoder state.
* **Model Interpretability:** Using attention weights to visualize word-to-word alignment (Heatmaps).

**Practical Skills:**
* Implementing a custom Keras Layer for Attention (`DecoderRNNAttentionWrapper`) from scratch.
* Integrating Attention into a GRU-based Decoder.
* Extracting attention weights during inference to generate Alignment Heatmaps.

## 2️⃣ Theoretical Explanation

### 2.1 The Bottleneck Problem
In a standard Seq2Seq model:
1.  The Encoder reads the input: $X = [x_1, x_2, ..., x_T]$.
2.  It produces a final hidden state: $h_T$.
3.  The Decoder uses $h_T$ as the *only* source of information to generate the translation.

If the sentence is "The quick brown fox jumps over the lazy dog" (9 words) or a complex paragraph (100 words), the Encoder is forced to squash all that meaning into a vector of the same size (e.g., 256 floats). Information loss is inevitable, leading to poor translations for long sentences.

### 2.2 The Attention Solution
Attention allows the Decoder to access **all** encoder hidden states $[h_1, h_2, ..., h_T]$ at every time step. 

For each step $t$ in the Decoder:
1.  **Score:** The model calculates an **alignment score** between the Decoder's previous state $s_{t-1}$ and every Encoder state $h_j$. This score answers: *"How relevant is the word at position $j$ for generating the next word?"*
2.  **Weight:** The scores are normalized using Softmax to create **Attention Weights** $\alpha_{tj}$. These weights sum to 1.
3.  **Context:** A weighted sum of all encoder states is computed: $c_t = \sum \alpha_{tj} h_j$.

### 2.3 Bahdanau Attention (Additive)
Proposed by Bahdanau et al. (2014), the alignment score is calculated using a small feed-forward neural network:

$$ \text{score}(s_{t-1}, h_j) = v_a^T \tanh(W_a s_{t-1} + U_a h_j) $$

Where $W_a$, $U_a$, and $v_a$ are learnable weight matrices. This mechanism allows the model to learn *where* to look without explicit alignment data.

## 3️⃣ Data Preparation

We will reuse the English-German translation dataset preparation steps from Chapter 11. We load the data, clean it, vectorization it, and prepare the `tf.data` pipeline.

In [None]:
import os
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, models, backend as K
import pandas as pd
import zipfile

# Ensure reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# --- 1. Data Loading (Same as Chapter 11) ---
url = "http://www.manythings.org/anki/deu-eng.zip"
zip_path = tf.keras.utils.get_file("deu-eng.zip", origin=url, extract=True)
text_file = os.path.join(os.path.dirname(zip_path), "deu.txt")

def load_data(path, num_samples=50000):
    with open(path, 'r', encoding='utf-8') as f:
        lines = f.read().split('\n')
    data = []
    for line in lines[:min(num_samples, len(lines)-1)]:
        parts = line.split('\t')
        if len(parts) >= 2:
            english = parts[0]
            german = f"sos {parts[1]} eos" # Start/End tokens
            data.append([english, german])
    return np.array(data)

raw_data = load_data(text_file)
print(f"Loaded {len(raw_data)} samples.")

# --- 2. Train/Val/Test Split ---
indices = np.arange(len(raw_data))
np.random.shuffle(indices)
raw_data = raw_data[indices]

num_val = int(0.1 * len(raw_data))
num_test = int(0.1 * len(raw_data))
train_pairs = raw_data[:-(num_val + num_test)]
val_pairs = raw_data[-(num_val + num_test):-num_test]
test_pairs = raw_data[-num_test:]

# --- 3. Text Vectorization ---
VOCAB_SIZE = 10000
SEQUENCE_LENGTH = 20

en_vectorizer = layers.TextVectorization(
    max_tokens=VOCAB_SIZE, output_mode='int', output_sequence_length=SEQUENCE_LENGTH,
    standardize='lower_and_strip_punctuation')

de_vectorizer = layers.TextVectorization(
    max_tokens=VOCAB_SIZE, output_mode='int', output_sequence_length=SEQUENCE_LENGTH + 1,
    standardize='lower_and_strip_punctuation')

en_vectorizer.adapt(train_pairs[:, 0])
de_vectorizer.adapt(train_pairs[:, 1])

# --- 4. Dataset Pipeline ---
def format_dataset(eng, deu):
    eng = en_vectorizer(eng)
    deu = de_vectorizer(deu)
    return ({'encoder_inputs': eng, 'decoder_inputs': deu[:, :-1]}, deu[:, 1:])

def make_dataset(pairs, batch_size=64):
    eng_texts, deu_texts = pairs[:, 0], pairs[:, 1]
    dataset = tf.data.Dataset.from_tensor_slices((eng_texts, deu_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset, num_parallel_calls=tf.data.AUTOTUNE)
    return dataset.shuffle(2048).prefetch(16).cache()

train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

## 4️⃣ Custom Attention Layer

This is the core contribution of this chapter. Since Keras did not natively support a simple plug-and-play Bahdanau Attention wrapper for RNN cells at the time of writing, we implement it manually.

We define `DecoderRNNAttentionWrapper`. This layer acts as a wrapper around a standard GRU cell. Inside its `call` method, it calculates the attention context and feeds the concatenated `[input, context]` to the GRU cell.

In [None]:
class DecoderRNNAttentionWrapper(layers.Layer):
    def __init__(self, cell_fn, units, **kwargs):
        super(DecoderRNNAttentionWrapper, self).__init__(**kwargs)
        self._cell_fn = cell_fn  # The basic GRU Cell
        self.units = units       # Dimension of Attention internal layers

    def build(self, input_shape):
        # input_shape comes as [Encoder_Out_Shape, Decoder_Input_Shape]
        # Encoder Out Shape: (batch, time, hidden)
        
        # W_a: Weight for Encoder Outputs (Values)
        self.W_a = self.add_weight(
            name='W_a', 
            shape=tf.TensorShape((input_shape[0][2], self.units)),
            initializer='uniform', trainable=True)
        
        # U_a: Weight for Decoder State (Query)
        self.U_a = self.add_weight(
            name='U_a', 
            shape=tf.TensorShape((self._cell_fn.units, self.units)),
            initializer='uniform', trainable=True)
        
        # V_a: Weight for calculating the final score
        self.V_a = self.add_weight(
            name='V_a', 
            shape=tf.TensorShape((self.units, 1)),
            initializer='uniform', trainable=True)
        
        super(DecoderRNNAttentionWrapper, self).build(input_shape)

    def call(self, inputs, initial_state, training=False):
        # inputs is a list: [encoder_outputs, decoder_inputs]
        encoder_outputs, decoder_inputs = inputs
        
        # Define the step function for K.rnn
        # This function runs for every time step of the Decoder
        def _step(inputs, states):
            # inputs: Single decoder step input
            # states: List containing [prev_decoder_state, encoder_outputs]
            
            prev_decoder_state = states[0]
            full_encoder_outputs = states[-1] # Accessed via constants

            # --- Attention Score Calculation ---
            
            # 1. Transform Encoder Outputs: H * W_a 
            # Result shape: (batch, en_seq_len, units)
            W_a_dot_h = K.dot(full_encoder_outputs, self.W_a)
            
            # 2. Transform Previous Decoder State: S_{t-1} * U_a
            # Result shape: (batch, 1, units)
            U_a_dot_s = K.expand_dims(K.dot(prev_decoder_state, self.U_a), 1)
            
            # 3. Tanh Activation: tanh(W_a.H + U_a.S)
            # Broadcasting happens here to add (batch, 1, units) to (batch, seq, units)
            Wa_plus_Ua = K.tanh(W_a_dot_h + U_a_dot_s)
            
            # 4. Calculate Score: V_a^T * tanh(...)
            # Shape: (batch, en_seq_len, 1)
            e_i = K.dot(Wa_plus_Ua, self.V_a)
            # Squeeze to remove last dim -> (batch, en_seq_len)
            e_i = K.squeeze(e_i, axis=-1) 
            
            # 5. Softmax to get Attention Weights (alpha)
            a_i = K.softmax(e_i)
            
            # --- Context Vector Calculation ---
            # Weighted sum of encoder outputs
            # context = sum(alpha * H)
            c_i = K.sum(full_encoder_outputs * K.expand_dims(a_i, -1), axis=1)
            
            # --- GRU Step ---
            # Concatenate Context Vector + Current Decoder Input
            gru_input = K.concatenate([inputs, c_i], axis=-1)
            
            # Call the basic GRU cell
            output, new_states = self._cell_fn(gru_input, states)
            
            # Return: (Output, Attention_Weights), [New_State]
            return (output, a_i), new_states

        # K.rnn loops the _step function over the time dimension of decoder_inputs
        # constants=[encoder_outputs] ensures encoder_outputs is available in _step
        _, outputs, _ = K.rnn(
            step_function=_step,
            inputs=decoder_inputs,
            initial_states=[initial_state],
            constants=[encoder_outputs]
        )
        
        # Unpack outputs (GRU outputs, Attention Weights)
        decoder_outputs, attention_weights = outputs
        return decoder_outputs, attention_weights

### Step-by-Step Code Explanation

1.  **Input:** The layer receives `encoder_outputs` (the full history of the source sentence) and `decoder_inputs` (the target sentence offset by one).
2.  **`K.rnn` Loop:** Unlike standard layers that process matrices, we need to iterate step-by-step because the attention for time $t$ depends on the state from $t-1$. `K.rnn` handles this unrolling efficiently.
3.  **`_step` Function:** Inside the loop:
    * We project the encoder states (`W_a`) and the current decoder state (`U_a`) into a shared latent space `units`.
    * We apply `tanh` nonlinearity.
    * We project down to a score using `V_a`.
    * `K.softmax` converts scores into probabilities ($\{ \alpha_1, ..., \alpha_T \}$).
    * We compute the **Context Vector** as the weighted sum of encoder states.
    * We feed `[Decoder_Input, Context_Vector]` into the actual GRU cell.
4.  **Output:** The layer returns the GRU outputs (for prediction) AND the attention weights (for visualization).

## 5️⃣ Building the Attention-Based Seq2Seq Model

Now we integrate our custom layer into the functional Keras API.

In [None]:
EMBEDDING_DIM = 128
LATENT_DIM = 256

# --- Encoder (Same as Chapter 11) ---
encoder_inputs = layers.Input(shape=(SEQUENCE_LENGTH,), dtype="int64", name="encoder_inputs")
enc_emb = layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM, mask_zero=True)(encoder_inputs)
encoder_gru = layers.Bidirectional(layers.GRU(LATENT_DIM // 2, return_sequences=True, return_state=True))
encoder_out, fwd_state, bwd_state = encoder_gru(enc_emb)

# Concatenate forward and backward states for the initial decoder state
encoder_state = layers.Concatenate()([fwd_state, bwd_state])

# --- Decoder with Attention ---
decoder_inputs = layers.Input(shape=(SEQUENCE_LENGTH,), dtype="int64", name="decoder_inputs")
dec_emb_layer = layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM, mask_zero=True)
dec_emb = dec_emb_layer(decoder_inputs)

# Define the Basic GRU Cell
decoder_cell = layers.GRUCell(LATENT_DIM)

# Wrap it with Attention
# Note: We pass the encoder outputs AND decoder embeddings to the call
attention_wrapper = DecoderRNNAttentionWrapper(cell_fn=decoder_cell, units=512)
decoder_outputs, attention_weights = attention_wrapper(
    [encoder_out, dec_emb],
    initial_state=encoder_state
)

# Final Output Layer
decoder_dense = layers.Dense(VOCAB_SIZE, activation="softmax")
final_outputs = decoder_dense(decoder_outputs)

# --- Model Compilation ---
attn_model = models.Model([encoder_inputs, decoder_inputs], final_outputs)

attn_model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)

attn_model.summary()

## 6️⃣ Training the Model

We train using Teacher Forcing (feeding the correct previous word to generate the next).

In [None]:
history = attn_model.fit(
    train_ds,
    epochs=5, # Reduced for demonstration speed
    validation_data=val_ds
)

## 7️⃣ Visualization: Peeking into the "Brain"

The beauty of Attention is interpretability. We can extract the attention weights to see which English words the model focused on when generating a specific German word.

To do this, we need a separate **Inference Model** that outputs the attention weights.

In [None]:
import matplotlib.pyplot as plt

# --- Define Inference Model for Visualization ---
# This model takes inputs and returns: Predictions, Attention Weights
vis_model = models.Model(
    inputs=[encoder_inputs, decoder_inputs], 
    outputs=[final_outputs, attention_weights]
)

def plot_attention(text_en, text_de_input):
    # Preprocess
    en_seq = en_vectorizer([text_en])
    de_seq = de_vectorizer([text_de_input])[:, :-1] # Remove end token for input
    
    # Predict
    preds, attn_weights = vis_model.predict([en_seq, de_seq])
    
    # attn_weights shape: (1, dec_len, enc_len)
    # Squeeze batch dimension
    attn_weights = attn_weights[0]
    
    # Get words for plotting
    en_words = text_en.split()
    de_words = text_de_input.split()[1:] # Skip 'sos'
    
    # Plot
    fig = plt.figure(figsize=(8, 8))
    ax = fig.add_subplot(1, 1, 1)
    
    # Slice the attention matrix to match sentence lengths
    # (The model pads to 20, but we only want to visualize real words)
    ax.matshow(attn_weights[:len(de_words), :len(en_words)], cmap='viridis')
    
    ax.set_xticklabels([''] + en_words, rotation=90)
    ax.set_yticklabels([''] + de_words)
    
    plt.xlabel('Source (English)')
    plt.ylabel('Target (German)')
    plt.show()

# Example Visualization
# Ideally, we generate the translation first, but for visualization here
# we will force feed a known pair to see alignment.
sample_en = "i am very happy"
sample_de = "sos ich bin sehr glücklich"

plot_attention(sample_en, sample_de)

### Interpretation of the Plot

If the model learned correctly, you should see a diagonal alignment:
* "ich" (I) should attend to "i".
* "bin" (am) should attend to "am".
* "sehr" (very) should attend to "very".

In complex translations where word order flips (e.g., German verbs moving to the end), the attention mechanism allows the model to jump to the correct encoder position effectively.

## 8️⃣ Chapter Summary

* **Limitations of Fixed Context:** Standard Seq2Seq models struggle with long sentences because they compress everything into one vector.
* **Attention Mechanism:** Allows the Decoder to create a dynamic context vector at every time step by taking a weighted sum of Encoder states.
* **Implementation:** We built a custom `DecoderRNNAttentionWrapper` because `K.rnn` allows us to define custom step logic involving alignment score calculation.
* **Alignment:** The attention weights $\alpha_{tj}$ represent the alignment between the target word at step $t$ and source word at step $j$.
* **Result:** Attention not only improves translation accuracy (BLEU scores) but makes the model interpretable via heatmaps.