# Chapter 12 — Sequence-to-sequence learning: Part 2 (Attention)

Chapter 11 built an encoder–decoder translator that predicts German tokens from an English sentence.  
The key limitation of the plain encoder–decoder setup is that the decoder often relies too heavily on a *single* compressed representation of the source sentence. When sentences get longer or contain multiple clauses, that compression becomes a bottleneck.

This chapter introduces **Bahdanau attention** (additive attention). Instead of treating the encoder as a one-shot summary, attention lets the decoder access *all* encoder time steps and select what is most relevant while generating each output token.

In this notebook, I reproduce the chapter workflow using the same English → German dataset and the same preprocessing steps as Chapter 11:
- download and prepare the dataset,
- build tokenizers and aligned input/target sequences with teacher forcing,
- implement Bahdanau attention with the Keras subclassing API,
- integrate attention into the encoder–decoder model,
- train and evaluate using accuracy and BLEU score,
- and finally inspect the model by **visualizing attention weights** (which words the model “looked at” while producing each German token).


## 1) Summary

### 1.1 Why attention improves seq2seq translation
In a plain encoder–decoder model, the encoder produces a fixed-size vector, and the decoder conditions on that vector to generate the translation.  
This works for short sentences, but for longer inputs the fixed vector can lose information.

Attention changes the information flow:
- the encoder produces a sequence of hidden states \(h_1, h_2, ..., h_T\),
- at each decoder step \(t\), the model computes attention weights \(\alpha_{t, i}\) over encoder states,
- the decoder receives a context vector \(c_t = \sum_i \alpha_{t,i} h_i\), which is a weighted summary of the source sentence that is specific to the current output step.

This helps the model focus on relevant source words during generation.

### 1.2 Bahdanau attention (additive attention) in one place
Bahdanau attention uses a small feed-forward network to compute alignment scores (“energies”):

\[
e_{t,i} = v^\top \tanh(W h_i + U s_t)
\]

where:
- \(h_i\) is encoder output at position \(i\),
- \(s_t\) is decoder hidden state at time \(t\),
- \(W, U, v\) are learned parameters.

Then attention weights are:

\[
\alpha_{t,i} = \text{softmax}(e_{t,i})
\]

and the context vector is:

\[
c_t = \sum_i \alpha_{t,i} h_i
\]

In this notebook, I implement this mechanism as a reusable Keras layer.

### 1.3 What I evaluate (and why BLEU matters)
Token-level accuracy is useful for debugging, but it does not fully capture translation quality because:
- multiple translations can be valid,
- a single early mistake can shift the whole output.

BLEU is an n-gram overlap metric that correlates better with translation quality than raw accuracy for this kind of task.  
I keep the BLEU implementation from Chapter 11 so results are comparable across models.

### 1.4 Attention visualization as a sanity check
A major practical benefit of attention is interpretability:  
I can visualize the attention matrix (decoder steps × encoder steps) to see whether the model aligns German tokens to relevant English words (e.g., nouns → nouns, verbs → verbs, phrase boundaries).


## 2) Setup

Imports and reproducibility settings.

In [None]:
# Core
import os
import re
import zipfile
from pathlib import Path
from collections import Counter

# Data
import numpy as np
import pandas as pd

# TensorFlow
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

# Visualization
import matplotlib.pyplot as plt

# Reproducibility
SEED = 4321
np.random.seed(SEED)
tf.random.set_seed(SEED)

print("TensorFlow:", tf.__version__)


## 3) Download and load the English–German sentence pairs

This chapter uses the same dataset as Chapter 11:
- `http://www.manythings.org/anki/deu-eng.zip`

The extracted text file is `deu.txt`, which contains tab-separated lines:
`English<TAB>German<TAB>...`

I keep only the English and German columns for machine translation.


In [None]:
DATA_DIR = Path("data")
DATA_DIR.mkdir(exist_ok=True)

ZIP_PATH = DATA_DIR / "deu-eng.zip"
TXT_PATH = DATA_DIR / "deu.txt"

URL = "http://www.manythings.org/anki/deu-eng.zip"

def download_and_extract(url=URL, zip_path=ZIP_PATH, txt_path=TXT_PATH):
    if not zip_path.exists():
        print("Downloading:", url)
        tf.keras.utils.get_file(
            fname=str(zip_path),
            origin=url,
            cache_dir=".",
            cache_subdir="",
        )
    else:
        print("Zip already exists:", zip_path)

    if not txt_path.exists():
        with zipfile.ZipFile(zip_path, "r") as zf:
            zf.extractall(DATA_DIR)
        print("Extracted to:", DATA_DIR)
    else:
        print("Text file already exists:", txt_path)

download_and_extract()

# Load to DataFrame
df = pd.read_csv(
    TXT_PATH,
    sep="\t",
    header=None,
    names=["EN", "DE", "META"],
    usecols=[0, 1, 2],
)

print("Total pairs:", len(df))
df.head()


### 3.1 Basic cleaning and sampling

To keep runtime manageable in Colab, I sample a fixed number of sentence pairs.
I also add **start** and **end** tokens to German sequences because:
- the decoder needs an explicit start signal (`sos`),
- and we need a clean stopping criterion (`eos`) during inference.

I keep the same tokens as Chapter 11: `sos` and `eos`.


In [None]:
N_SAMPLES = 50000
df_sample = df.sample(n=min(N_SAMPLES, len(df)), random_state=SEED).reset_index(drop=True)

start_token = "sos"
end_token = "eos"
df_sample["DE"] = start_token + " " + df_sample["DE"].astype(str) + " " + end_token

df_sample.head()


## 4) Train/validation/test split and sequence length inspection

I split 80/10/10 so validation and test are both large enough to be informative.

I also inspect sentence lengths to choose a fixed maximum length for vectorization:
- this helps avoid extreme outliers dominating memory and compute,
- and it keeps encoder/decoder tensors consistent for training.

The length choice is a modeling decision: keeping longer sequences captures more information but increases cost.


In [None]:
# Train/valid/test split (80/10/10)
n = len(df_sample)
train_end = int(0.8 * n)
valid_end = int(0.9 * n)

train_df = df_sample.iloc[:train_end].reset_index(drop=True)
valid_df = df_sample.iloc[train_end:valid_end].reset_index(drop=True)
test_df  = df_sample.iloc[valid_end:].reset_index(drop=True)

print("Train:", len(train_df), "Valid:", len(valid_df), "Test:", len(test_df))

def sequence_length_stats(series, name, q_low=0.01, q_high=0.99):
    lengths = series.astype(str).apply(lambda s: len(s.split()))
    print("\n===", name, "===")
    print("Min:", int(lengths.min()), "Max:", int(lengths.max()))
    print("Mean:", float(lengths.mean()))
    print("Median:", float(lengths.median()))
    print(lengths.describe())

    lo = int(lengths.quantile(q_low))
    hi = int(lengths.quantile(q_high))
    trimmed = lengths[(lengths >= lo) & (lengths <= hi)]
    print(f"Between {int(q_low*100)}% and {int(q_high*100)}% quantiles (ignore outliers)")
    print(trimmed.describe())
    return lengths, lo, hi

en_lens, en_lo, en_hi = sequence_length_stats(train_df["EN"], "English (EN)")
de_lens, de_lo, de_hi = sequence_length_stats(train_df["DE"], "German (DE) with sos/eos")

# Conservative max lengths based on high quantiles (cap to keep runtime reasonable)
en_seq_length = int(min(en_hi, 20))
de_seq_length = int(min(de_hi, 22))

print("\nChosen EN max length:", en_seq_length)
print("Chosen DE max length:", de_seq_length)


## 5) Vectorizers and shifted decoder sequences (teacher forcing)

The decoder is trained with teacher forcing:
- input sequence is the German sentence shifted right (starts with `sos`),
- label sequence is the German sentence shifted left (ends with `eos`).

Example:
- original: `sos ich bin müde eos`
- decoder input : `sos ich bin müde`
- decoder label : `ich bin müde eos`

This shift makes the learning problem well-defined: the decoder predicts the next token given previous ground-truth tokens.


In [None]:
def get_vectorizer(corpus: np.ndarray, n_vocab: int, max_length=None, name="vectorizer"):
    inp = tf.keras.Input(shape=(1,), dtype=tf.string)
    vectorize_layer = TextVectorization(
        max_tokens=n_vocab + 2,   # +2 for '' and [UNK]
        output_mode="int",
        output_sequence_length=max_length,
        standardize="lower_and_strip_punctuation",
    )
    vectorize_layer.adapt(corpus)
    vectorized_out = vectorize_layer(inp)
    return tf.keras.models.Model(inp, vectorized_out, name=name), vectorize_layer.get_vocabulary(), vectorize_layer

# Vocabulary sizes (can be tuned)
en_vocab = 15000
de_vocab = 15000

en_vectorizer, en_vocabulary, en_vec_layer = get_vectorizer(
    corpus=np.array(train_df["EN"].tolist()),
    n_vocab=en_vocab,
    max_length=en_seq_length,
    name="en_vectorizer",
)

# Decoder inputs/labels are length de_seq_length-1 after shifting
de_vectorizer, de_vocabulary, de_vec_layer = get_vectorizer(
    corpus=np.array(train_df["DE"].tolist()),
    n_vocab=de_vocab,
    max_length=de_seq_length - 1,
    name="de_vectorizer",
)

print("EN vocab size (with special tokens):", len(en_vocabulary))
print("DE vocab size (with special tokens):", len(de_vocabulary))
print("Special tokens (start of vocab):", de_vocabulary[:8])


In [None]:
def shift_target_sequence(s: str):
    tokens = s.split()
    de_in = " ".join(tokens[:-1])   # drop last token
    de_out = " ".join(tokens[1:])   # drop first token
    return de_in, de_out

def prepare_data(train_df: pd.DataFrame, valid_df: pd.DataFrame, test_df: pd.DataFrame):
    data = {}
    for name, dfi in [("train", train_df), ("valid", valid_df), ("test", test_df)]:
        en_inputs = dfi["EN"].astype(str).values.reshape(-1, 1)

        de_in_list = []
        de_out_list = []
        for s in dfi["DE"].astype(str).tolist():
            de_in, de_out = shift_target_sequence(s)
            de_in_list.append(de_in)
            de_out_list.append(de_out)

        de_inputs = np.array(de_in_list, dtype=object).reshape(-1, 1)
        de_labels = np.array(de_out_list, dtype=object).reshape(-1, 1)

        data[name] = {
            "encoder_inputs": en_inputs,
            "decoder_inputs": de_inputs,
            "decoder_labels": de_labels,
        }
    return data

data_dict = prepare_data(train_df, valid_df, test_df)
{k: {kk: v.shape for kk, v in data_dict[k].items()} for k in data_dict}


## 6) BLEU implementation (same as Chapter 11)

To keep evaluation consistent with Chapter 11, I use the same lightweight BLEU implementation:
- modified n-gram precision (up to 4-grams),
- brevity penalty,
- smoothing to avoid zero scores when a higher-order n-gram is missing.

This BLEU is not a full official corpus BLEU implementation, but it is sufficient for relative comparisons while training.


In [None]:
def _ngrams(tokens, n):
    return [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]

def _modified_precision(reference, candidate, n):
    ref_ngrams = Counter(_ngrams(reference, n))
    cand_ngrams = Counter(_ngrams(candidate, n))
    if len(cand_ngrams) == 0:
        return 0.0

    clipped = 0
    total = 0
    for ng, count in cand_ngrams.items():
        clipped += min(count, ref_ngrams.get(ng, 0))
        total += count
    return clipped / total if total > 0 else 0.0

def sentence_bleu(reference_tokens, candidate_tokens, max_n=4, smooth=1e-9):
    ref_len = len(reference_tokens)
    cand_len = len(candidate_tokens)
    if cand_len == 0:
        return 0.0

    bp = 1.0 if cand_len > ref_len else np.exp(1.0 - (ref_len / (cand_len + smooth)))

    precisions = []
    for n in range(1, max_n+1):
        p = _modified_precision(reference_tokens, candidate_tokens, n)
        precisions.append(max(p, smooth))

    log_p = sum((1.0/max_n) * np.log(p) for p in precisions)
    return float(bp * np.exp(log_p))

def clean_text(token_tensor, end_token="eos"):
    # token_tensor: (batch, time) int
    out = []
    for row in token_tensor:
        toks = []
        for tid in row:
            if tid == 0:
                continue
            tok = de_vocabulary[int(tid)] if int(tid) < len(de_vocabulary) else "[UNK]"
            if tok == end_token:
                break
            toks.append(tok)
        out.append(toks)
    return out

class BleuMetric:
    def __init__(self, end_token="eos"):
        self.end_token = end_token

    def calculate_bleu_from_predictions(self, y_true_ids, y_pred_probs):
        # y_true_ids: (batch, time)
        # y_pred_probs: (batch, time, vocab)
        y_pred_ids = np.argmax(y_pred_probs, axis=-1)

        refs = clean_text(y_true_ids, end_token=self.end_token)
        cands = clean_text(y_pred_ids, end_token=self.end_token)

        scores = []
        for r, c in zip(refs, cands):
            scores.append(sentence_bleu(r, c))
        return float(np.mean(scores)) if scores else 0.0

bleu_metric = BleuMetric(end_token="eos")


## 7) Model with Bahdanau attention

### 7.1 Encoder output changes from Part 1
Attention needs **all encoder time steps**, not only a final vector.  
So the encoder must output:
- `enc_outputs`: (batch, enc_time, hidden_dim)
- forward and backward states to initialize the decoder.

### 7.2 Attention layer implementation detail
I implement Bahdanau attention as a Keras layer:
- it takes `enc_outputs` and `dec_outputs`,
- computes alignment scores and a softmax distribution over encoder steps,
- returns a context sequence aligned to decoder time steps and the attention weights.

The attention weights are kept accessible so I can build a second model later that outputs them for visualization.


In [None]:
class BahdanauAttention(tf.keras.layers.Layer):
    def __init__(self, units, **kwargs):
        super().__init__(**kwargs)
        self.units = units
        self.W = tf.keras.layers.Dense(units, use_bias=False)
        self.U = tf.keras.layers.Dense(units, use_bias=False)
        self.V = tf.keras.layers.Dense(1, use_bias=False)

    def call(self, inputs):
        # inputs = [enc_outputs, dec_outputs]
        enc_outputs, dec_outputs = inputs  # (B, Te, H), (B, Td, H)

        # Transform
        # W(enc): (B, Te, units)
        # U(dec): (B, Td, units)
        w_enc = self.W(enc_outputs)
        u_dec = self.U(dec_outputs)

        # Broadcast addition: (B, Td, Te, units)
        w_enc = tf.expand_dims(w_enc, axis=1)         # (B, 1, Te, units)
        u_dec = tf.expand_dims(u_dec, axis=2)         # (B, Td, 1, units)

        score = self.V(tf.nn.tanh(w_enc + u_dec))     # (B, Td, Te, 1)
        attn_weights = tf.nn.softmax(score, axis=2)   # softmax over encoder steps

        # Context: weighted sum over encoder outputs
        enc_exp = tf.expand_dims(enc_outputs, axis=1)         # (B, 1, Te, H)
        context = tf.reduce_sum(attn_weights * enc_exp, axis=2)  # (B, Td, H)

        return context, tf.squeeze(attn_weights, axis=-1)     # weights: (B, Td, Te)


In [None]:
def get_encoder_with_states(n_vocab: int, vectorizer: tf.keras.Model, enc_len: int):
    inp = tf.keras.Input(shape=(1,), dtype=tf.string, name="e_input")
    x = vectorizer(inp)  # (B, enc_len)

    emb = tf.keras.layers.Embedding(n_vocab + 2, 128, mask_zero=True, name="e_embedding")
    x = emb(x)  # (B, enc_len, 128)

    bi_gru = tf.keras.layers.Bidirectional(
        tf.keras.layers.GRU(128, return_sequences=True, return_state=True),
        name="e_bi_gru",
    )
    # output: (B, enc_len, 256), states: (B, 128), (B, 128)
    enc_outputs, fwd_state, bwd_state = bi_gru(x)

    encoder = tf.keras.models.Model(
        inputs=inp,
        outputs=[fwd_state, bwd_state, enc_outputs],
        name="encoder_with_states",
    )
    return encoder

encoder = get_encoder_with_states(n_vocab=en_vocab, vectorizer=en_vectorizer, enc_len=en_seq_length)
encoder.summary()


In [None]:
def get_attention_seq2seq_model(n_vocab: int, encoder: tf.keras.Model, de_vectorizer: tf.keras.Model):
    # Encoder inputs
    e_inp = tf.keras.Input(shape=(1,), dtype=tf.string, name="e_input_final")
    fwd_state, bwd_state, enc_outputs = encoder(e_inp)
    d_init_state = tf.concat([fwd_state, bwd_state], axis=-1)  # (B, 256)

    # Decoder inputs (teacher forcing)
    d_inp = tf.keras.Input(shape=(1,), dtype=tf.string, name="d_input")
    d_vec = de_vectorizer(d_inp)  # (B, dec_len)

    d_emb = tf.keras.layers.Embedding(n_vocab + 2, 128, mask_zero=True, name="d_embedding")
    d_emb_out = d_emb(d_vec)

    d_gru = tf.keras.layers.GRU(256, return_sequences=True, name="d_gru")
    d_gru_out = d_gru(d_emb_out, initial_state=d_init_state)  # (B, dec_len, 256)

    # Attention
    attn_layer = BahdanauAttention(256, name="bahdanau_attention")
    context, attn_weights = attn_layer([enc_outputs, d_gru_out])  # (B, dec_len, 256), (B, dec_len, enc_len)

    # Combine context + decoder output
    concat = tf.concat([context, d_gru_out], axis=-1)  # (B, dec_len, 512)
    combine = tf.keras.layers.Dense(256, activation="tanh", name="attn_combine")(concat)

    # Classifier head
    d_dense_1 = tf.keras.layers.Dense(512, activation="relu", name="d_dense_1")
    x = d_dense_1(combine)

    d_final = tf.keras.layers.Dense(n_vocab + 2, activation="softmax", name="d_dense_final")
    out = d_final(x)

    model = tf.keras.models.Model(inputs=[e_inp, d_inp], outputs=out, name="seq2seq_with_attention")
    return model

attn_model = get_attention_seq2seq_model(n_vocab=de_vocab, encoder=encoder, de_vectorizer=de_vectorizer)
attn_model.summary()


## 8) Training helpers (batch evaluation + BLEU)

I keep the same approach as Chapter 11:
- vectorize labels to token IDs,
- train using sparse categorical cross-entropy,
- report mean loss and accuracy,
- compute BLEU from model predictions for a more translation-relevant metric.


In [None]:
def evaluate_model(model, vectorizer, data_split, batch_size=128, bleu_metric=None):
    en_inputs_raw = data_split["encoder_inputs"]
    de_inputs_raw = data_split["decoder_inputs"]
    de_labels_raw = data_split["decoder_labels"]

    loss_log, acc_log, bleu_log = [], [], []
    n_batches = en_inputs_raw.shape[0] // batch_size
    if n_batches == 0:
        n_batches = 1

    for i in range(n_batches):
        x = [
            en_inputs_raw[i*batch_size:(i+1)*batch_size],
            de_inputs_raw[i*batch_size:(i+1)*batch_size],
        ]
        y = vectorizer(de_labels_raw[i*batch_size:(i+1)*batch_size])

        eval_out = model.evaluate(x, y, verbose=0)
        loss_log.append(eval_out[0])
        acc_log.append(eval_out[1])

        if bleu_metric is not None:
            pred_y = model.predict(x, verbose=0)
            bleu = bleu_metric.calculate_bleu_from_predictions(y.numpy(), pred_y)
            bleu_log.append(bleu)

    mean_loss = float(np.mean(loss_log))
    mean_acc = float(np.mean(acc_log))
    mean_bleu = float(np.mean(bleu_log)) if bleu_log else 0.0
    return mean_loss, mean_acc, mean_bleu

def train_model(model, vectorizer, data_dict, epochs=5, batch_size=128, bleu_metric=None):
    train_split = data_dict["train"]
    valid_split = data_dict["valid"]

    for ep in range(1, epochs+1):
        print(f"\nEpoch {ep}/{epochs}")

        # Train in mini-batches
        en_inputs_raw = train_split["encoder_inputs"]
        de_inputs_raw = train_split["decoder_inputs"]
        de_labels_raw = train_split["decoder_labels"]

        idx = np.arange(len(en_inputs_raw))
        np.random.shuffle(idx)

        n_batches = len(idx) // batch_size
        if n_batches == 0:
            n_batches = 1

        for i in range(n_batches):
            b = idx[i*batch_size:(i+1)*batch_size]
            x = [en_inputs_raw[b], de_inputs_raw[b]]
            y = vectorizer(de_labels_raw[b])
            model.train_on_batch(x, y)

        tr_loss, tr_acc, tr_bleu = evaluate_model(model, vectorizer, train_split, batch_size=batch_size, bleu_metric=bleu_metric)
        va_loss, va_acc, va_bleu = evaluate_model(model, vectorizer, valid_split, batch_size=batch_size, bleu_metric=bleu_metric)

        print(f"(train) loss: {tr_loss:.4f} - accuracy: {tr_acc:.4f} - bleu: {tr_bleu:.4f}")
        print(f"(valid) loss: {va_loss:.4f} - accuracy: {va_acc:.4f} - bleu: {va_bleu:.4f}")

    return model


In [None]:
# Compile and train
attn_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    metrics=["accuracy"],
)

EPOCHS = 5
BATCH_SIZE = 128

attn_model = train_model(
    attn_model,
    vectorizer=de_vectorizer,
    data_dict=data_dict,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    bleu_metric=bleu_metric,
)


## 9) Test evaluation

I evaluate on the test split after training.  
The BLEU score here is computed from model predictions on the same batches used for evaluation.


In [None]:
te_loss, te_acc, te_bleu = evaluate_model(
    attn_model,
    vectorizer=de_vectorizer,
    data_split=data_dict["test"],
    batch_size=BATCH_SIZE,
    bleu_metric=bleu_metric,
)

print(f"(test) loss: {te_loss:.4f} - accuracy: {te_acc:.4f} - bleu: {te_bleu:.4f}")


## 10) Save the trained model and vocabularies

Saving the model and vocabularies makes the notebook reproducible:
- the same token-to-id mapping is necessary for consistent inference,
- and it avoids needing to re-adapt vectorizers when testing translations later.


In [None]:
MODEL_DIR = Path("models")
MODEL_DIR.mkdir(exist_ok=True)

# Save model
attn_model.save(MODEL_DIR / "ch12_seq2seq_attention.keras")

# Save vocabularies
with open(MODEL_DIR / "ch12_en_vocab.txt", "w", encoding="utf-8") as f:
    for tok in en_vocabulary:
        f.write(tok + "\n")

with open(MODEL_DIR / "ch12_de_vocab.txt", "w", encoding="utf-8") as f:
    for tok in de_vocabulary:
        f.write(tok + "\n")

print("Saved model and vocabularies in:", MODEL_DIR.resolve())


## 11) Attention visualization

To visualize attention, I build a second model that shares the same trained weights but also outputs the attention weights tensor.

Because the attention layer is part of the computation graph, I can retrieve its second output (weights) without changing the training model.


In [None]:
# Build a model that outputs both translation probabilities and attention weights
attn_layer = attn_model.get_layer("bahdanau_attention")

# Layer outputs: (context, weights)
context_tensor, weights_tensor = attn_layer.output

viz_model = tf.keras.Model(
    inputs=attn_model.inputs,
    outputs=[attn_model.output, weights_tensor],
    name="seq2seq_with_attention_viz",
)

viz_model.summary()


### 11.1 Greedy decoding (simple inference) + attention matrix

For a qualitative check, I implement a small greedy decoder:
- start with `sos`,
- repeatedly predict the next token by argmax,
- stop when `eos` is produced or when max steps is reached.

At the same time, I record the attention weights at each decoding step so I can visualize a heatmap.


In [None]:
# Build token id mappings from vocabulary lists
en_word_index = {w: i for i, w in enumerate(en_vocabulary)}
de_word_index = {w: i for i, w in enumerate(de_vocabulary)}
de_index_word = {i: w for i, w in enumerate(de_vocabulary)}

sos_id = de_word_index.get("sos", None)
eos_id = de_word_index.get("eos", None)
print("sos id:", sos_id, "| eos id:", eos_id)

def standardize_text(s: str):
    s = s.lower()
    s = re.sub(r"[^a-zA-Zäöüß0-9\s]", "", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

def greedy_translate_with_attention(sentence: str, max_len=30):
    sentence = standardize_text(sentence)
    enc_inp = np.array([[sentence]], dtype=object)

    # Start decoder with "sos"
    dec_tokens = ["sos"]
    attn_rows = []

    for _ in range(max_len):
        dec_inp = np.array([[" ".join(dec_tokens)]], dtype=object)

        probs, weights = viz_model.predict([enc_inp, dec_inp], verbose=0)
        # probs: (1, dec_len, vocab), weights: (1, dec_len, enc_len)

        # Next token is last time step prediction
        next_id = int(np.argmax(probs[0, len(dec_tokens)-1]))
        next_tok = de_index_word.get(next_id, "[UNK]")

        # Save attention row for this step (use same time index)
        attn_row = weights[0, len(dec_tokens)-1]  # (enc_len,)
        attn_rows.append(attn_row)

        dec_tokens.append(next_tok)
        if next_tok == "eos":
            break

    # Remove sos/eos for display
    out_tokens = [t for t in dec_tokens[1:] if t != "eos"]
    return out_tokens, np.array(attn_rows), sentence

def decode_en_tokens(sentence: str):
    # Approximate token display for x-axis
    toks = standardize_text(sentence).split()
    return toks[:en_seq_length]

sample_sentence = "I am a student."
out_tokens, attn_mat, cleaned_en = greedy_translate_with_attention(sample_sentence, max_len=25)

print("EN:", cleaned_en)
print("DE:", " ".join(out_tokens))
print("Attention matrix shape:", attn_mat.shape)  # (dec_steps, enc_len)


### 11.2 Plot the attention heatmap

Rows: generated German tokens  
Columns: English tokens

Even if translation is not perfect, I expect to see some structured alignment pattern (rather than uniform or random weights).


In [None]:
def plot_attention(attn_mat, en_tokens, de_tokens):
    # attn_mat: (T_dec, enc_len). Limit to real token counts for readability.
    enc_len = min(len(en_tokens), attn_mat.shape[1])
    dec_len = min(len(de_tokens), attn_mat.shape[0])

    mat = attn_mat[:dec_len, :enc_len]

    plt.figure(figsize=(min(12, 0.6 * enc_len + 2), min(8, 0.6 * dec_len + 2)))
    plt.imshow(mat, aspect="auto")
    plt.colorbar()
    plt.xticks(range(enc_len), en_tokens[:enc_len], rotation=45, ha="right")
    plt.yticks(range(dec_len), de_tokens[:dec_len])
    plt.xlabel("English tokens")
    plt.ylabel("Generated German tokens")
    plt.title("Bahdanau attention heatmap")
    plt.tight_layout()
    plt.show()

en_tokens = decode_en_tokens(cleaned_en)
plot_attention(attn_mat, en_tokens, out_tokens)


## 12) Takeaways

- Attention improves seq2seq translation by reducing the compression bottleneck of plain encoder–decoder models.
- Bahdanau attention provides a learnable alignment mechanism that selects relevant encoder time steps per decoder step.
- BLEU complements token accuracy for translation evaluation and gives a more task-relevant signal.
- Visualizing attention weights is a practical way to validate that the model is using source context in a structured way.


## 13) References

- Thushan Ganegedara, *TensorFlow in Action* (Chapter 12).
- Bahdanau et al., “Neural Machine Translation by Jointly Learning to Align and Translate.”
- ManyThings.org English–German sentence pairs (Anki dataset).
