# Chapter 11 — Sequence-to-sequence learning: Part 1

This chapter moves from *single-output* NLP tasks (classification and language modeling) into **sequence-to-sequence (seq2seq)** learning, where:

- the input is a sequence (e.g., an English sentence),
- the output is another sequence (e.g., a German sentence),
- and input/output lengths can differ.

The main example is **English → German machine translation** using an **encoder–decoder** architecture with GRU layers. I reproduce the workflow from the book:

1. Download and inspect a translation dataset (`deu.txt` from `deu-eng.zip`).
2. Prepare the data for teacher forcing (decoder inputs vs. decoder labels).
3. Build an end-to-end model that accepts **raw strings** and performs tokenization internally using `TextVectorization`.
4. Train and evaluate using **masked accuracy** and a **BLEU-style** metric.
5. Repurpose the trained model into an **inference model** that generates tokens recursively (starting from `sos` until `eos`).

> Notes for Colab: This notebook is written to run in Google Colab (CPU or GPU).  
> If training is slow, reduce `N_SAMPLES` and/or `EPOCHS` in the training section.


## 0) Setup

I keep the setup minimal: TensorFlow 2.x + standard Python libraries.

The dataset is small enough to download directly inside the notebook.


In [1]:
# Core
import os
import re
import zipfile
from pathlib import Path

# Data / math
import numpy as np
import pandas as pd

# TensorFlow
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

# Reproducibility
np.random.seed(4321)
tf.random.set_seed(4321)

print("TensorFlow:", tf.__version__)


TensorFlow: 2.19.0


## 1) Download and load the English–German dataset

The chapter uses a tab-separated text file (`deu.txt`) where each row has:

- English sentence (source)
- German sentence (target)
- Attribution metadata (we ignore this for modeling)

The file is distributed inside `deu-eng.zip`.


In [3]:
DATA_DIR = Path("data")
DATA_DIR.mkdir(exist_ok=True)

ZIP_PATH = DATA_DIR / "deu-eng.zip"
TXT_PATH = DATA_DIR / "deu.txt"

URL = "http://www.manythings.org/anki/deu-eng.zip"

def download_and_extract(url=URL, zip_path=ZIP_PATH, txt_path=TXT_PATH):
    if not zip_path.exists():
        print("Downloading:", url)
        tf.keras.utils.get_file(
            fname=str(zip_path),
            origin=url,
            cache_dir=".",
            cache_subdir="",
        )
    else:
        print("Zip already exists:", zip_path)

    if not txt_path.exists():
        print("Extracting zip...")
        with zipfile.ZipFile(zip_path, "r") as zf:
            zf.extractall(DATA_DIR)
    else:
        print("Text file already exists:", txt_path)

download_and_extract()

print("Exists?", TXT_PATH.exists())


Downloading: http://www.manythings.org/anki/deu-eng.zip


ValueError: Paths are no longer accepted as the `fname` argument. To specify the file's parent directory, use the `cache_dir` argument. Received: fname=data/deu-eng.zip

In [None]:
# Load the raw file into a DataFrame
df = pd.read_csv(TXT_PATH, delimiter="\t", header=None)
df.columns = ["EN", "DE", "Attribution"]
df = df[["EN", "DE"]]

print("df.shape =", df.shape)
df.head()


## 2) Cleaning, sampling, and adding `sos` / `eos`

### Why `sos` and `eos`?
For sequence generation, it helps to mark:

- `sos`: *start of sequence* (the first input token to the decoder during inference)
- `eos`: *end of sequence* (stopping condition during inference)

During training, I insert these tokens into every German sentence so that the model learns the same “format” it will face at inference time.

### Sampling
To keep training practical, the chapter uses a subset of the full dataset. I do the same using `N_SAMPLES`.


In [None]:
# Basic filtering: remove rows that can cause odd encoding artifacts
# (The book mentions filtering out entries containing certain problematic byte patterns.)
def has_bad_bytes(s: str) -> bool:
    try:
        return b"\xc2" in s.encode("utf-8")
    except Exception:
        return True

mask = ~df["DE"].apply(has_bad_bytes)
df = df[mask].reset_index(drop=True)

print("After filtering:", df.shape)


In [None]:
N_SAMPLES = 50000
df_sample = df.sample(n=min(N_SAMPLES, len(df)), random_state=4321).reset_index(drop=True)

start_token = "sos"
end_token = "eos"

df_sample["DE"] = start_token + " " + df_sample["DE"].astype(str) + " " + end_token

df_sample.head()


In [None]:
# Train/valid/test split (80/10/10)
n = len(df_sample)
n_train = int(0.8 * n)
n_valid = int(0.1 * n)

train_df = df_sample.iloc[:n_train].reset_index(drop=True)
valid_df = df_sample.iloc[n_train:n_train+n_valid].reset_index(drop=True)
test_df  = df_sample.iloc[n_train+n_valid:].reset_index(drop=True)

print("train:", train_df.shape, "valid:", valid_df.shape, "test:", test_df.shape)


## 3) Quick EDA: vocabulary size and sequence lengths

The book uses two practical checks:

1. **Vocabulary size above a frequency threshold**  
   This gives a sense of how many tokens the model will need to handle.

2. **Sequence length statistics**  
   This helps pick a reasonable maximum sequence length for padding/truncation.

These are not strict rules, but they are useful for selecting hyperparameters in a controlled way.


In [None]:
from collections import Counter

def vocabulary_stats(series: pd.Series, min_freq: int = 10, top_k: int = 10):
    words = series.str.split().sum()
    counter = Counter(words)
    freq = pd.Series(counter).sort_values(ascending=False)

    print("="*50)
    print("Top tokens")
    print(freq.head(top_k))
    vocab_size = int((freq >= min_freq).sum())
    print(f"Vocabulary size (>= {min_freq} frequent): {vocab_size}")
    return vocab_size, freq

en_vocab, en_freq = vocabulary_stats(train_df["EN"], min_freq=10, top_k=10)
de_vocab, de_freq = vocabulary_stats(train_df["DE"], min_freq=10, top_k=10)


In [None]:
def sequence_length_stats(series: pd.Series, name: str, q_low=0.01, q_high=0.99):
    lengths = series.str.split().apply(len)
    print("="*50)
    print(name)
    print("Median length:", float(lengths.median()))
    print(lengths.describe())

    lo = int(lengths.quantile(q_low))
    hi = int(lengths.quantile(q_high))
    trimmed = lengths[(lengths >= lo) & (lengths <= hi)]
    print(f"Between {int(q_low*100)}% and {int(q_high*100)}% quantiles (ignore outliers)")
    print(trimmed.describe())
    return lengths, lo, hi

en_lens, en_lo, en_hi = sequence_length_stats(train_df["EN"], "English (EN)")
de_lens, de_lo, de_hi = sequence_length_stats(train_df["DE"], "German (DE) with sos/eos")


### Choosing maximum sequence lengths

In the chapter, the chosen maximum lengths are:

- English max length: **19**
- German max length: **21**

German includes `sos` and `eos`, so it is slightly longer.

I keep these values because they work well with the statistics above and match the book’s workflow.


In [None]:
en_seq_length = 19
de_seq_length = 21

print("EN vocabulary size:", en_vocab)
print("DE vocabulary size:", de_vocab)
print("EN max sequence length:", en_seq_length)
print("DE max sequence length:", de_seq_length)


## 4) TextVectorization inside the model

Previously, tokenization was treated as a preprocessing step. Here the model is more end-to-end:

- Input: raw strings
- Inside the model: `TextVectorization` converts strings → integer token IDs
- Then: embedding layer converts IDs → trainable word vectors

This makes the model easier to use at inference time, because it accepts raw text directly.


In [None]:
def get_vectorizer(corpus: np.ndarray, n_vocab: int, max_length=None, name="vectorizer"):
    """Return a (string -> token IDs) model and optionally its vocabulary."""
    inp = tf.keras.Input(shape=(1,), dtype=tf.string)
    vectorize_layer = TextVectorization(
        max_tokens=n_vocab + 2,  # +2 for '' and [UNK]
        output_mode="int",
        output_sequence_length=max_length,
    )
    vectorize_layer.adapt(corpus)
    vectorized_out = vectorize_layer(inp)

    return tf.keras.models.Model(inp, vectorized_out, name=name), vectorize_layer.get_vocabulary()

# English vectorizer (encoder input)
en_vectorizer, en_vocabulary = get_vectorizer(
    corpus=np.array(train_df["EN"].tolist()),
    n_vocab=en_vocab,
    max_length=en_seq_length,
    name="en_vectorizer",
)

# German vectorizer (decoder input / labels are length de_seq_length-1 after shifting)
de_vectorizer, de_vocabulary = get_vectorizer(
    corpus=np.array(train_df["DE"].tolist()),
    n_vocab=de_vocab,
    max_length=de_seq_length - 1,
    name="d_vectorizer",
)

print("EN vocab size (with special tokens):", len(en_vocabulary))
print("DE vocab size (with special tokens):", len(de_vocabulary))
print("Special tokens (start):", de_vocabulary[:5])


## 5) Building the encoder

The encoder consumes the **English** sequence and produces a single vector representation (context vector).

Architecture (matching the chapter):
- `TextVectorization` (English)
- `Embedding` (128-dim)
- `Bidirectional(GRU(128))`

Because the bidirectional GRU concatenates forward and backward representations, the final context vector has size **256**.


In [None]:
def get_encoder(n_vocab: int, vectorizer: tf.keras.Model):
    """Define the encoder of the seq2seq model."""
    inp = tf.keras.Input(shape=(1,), dtype=tf.string, name="e_input")
    vectorized_out = vectorizer(inp)

    emb_layer = tf.keras.layers.Embedding(
        n_vocab + 2, 128, mask_zero=True, name="e_embedding"
    )
    emb_out = emb_layer(vectorized_out)

    gru_layer = tf.keras.layers.Bidirectional(
        tf.keras.layers.GRU(128), name="e_bi_gru"
    )
    gru_out = gru_layer(emb_out)

    encoder = tf.keras.models.Model(inputs=inp, outputs=gru_out, name="encoder")
    return encoder

encoder = get_encoder(n_vocab=en_vocab, vectorizer=en_vectorizer)
encoder.summary()


## 6) Building the decoder + final seq2seq model (teacher forcing)

During training, the decoder receives **ground-truth target tokens** (teacher forcing).  
It learns to predict the **next** token at each step.

So I construct:
- decoder inputs: target sentence **without the last token**
- decoder labels: target sentence **without the first token**

Example target sentence:
`"sos ich möchte ... eos"`

- decoder inputs: `"sos ich möchte ..."`
- decoder labels: `"ich möchte ... eos"`

Decoder architecture:
- `TextVectorization` (German)
- `Embedding` (128-dim)
- `GRU(256, return_sequences=True)` initialized with the encoder context vector
- `Dense(512, relu)`
- `Dense(vocab_size, softmax)` producing a probability distribution at each time step


In [None]:
def get_final_seq2seq_model(n_vocab: int, encoder: tf.keras.Model, vectorizer: tf.keras.Model):
    """Define the final encoder–decoder model."""
    e_inp = tf.keras.Input(shape=(1,), dtype=tf.string, name="e_input_final")
    d_init_state = encoder(e_inp)

    d_inp = tf.keras.Input(shape=(1,), dtype=tf.string, name="d_input")
    d_vectorized_out = vectorizer(d_inp)

    d_emb_layer = tf.keras.layers.Embedding(
        n_vocab + 2, 128, mask_zero=True, name="d_embedding"
    )
    d_emb_out = d_emb_layer(d_vectorized_out)

    d_gru_layer = tf.keras.layers.GRU(256, return_sequences=True, name="d_gru")
    d_gru_out = d_gru_layer(d_emb_out, initial_state=d_init_state)

    d_dense_layer_1 = tf.keras.layers.Dense(512, activation="relu", name="d_dense_1")
    d_dense1_out = d_dense_layer_1(d_gru_out)

    d_dense_layer_final = tf.keras.layers.Dense(
        n_vocab + 2, activation="softmax", name="d_dense_final"
    )
    d_final_out = d_dense_layer_final(d_dense1_out)

    seq2seq = tf.keras.models.Model(
        inputs=[e_inp, d_inp], outputs=d_final_out, name="final_seq2seq"
    )
    return seq2seq

final_model = get_final_seq2seq_model(n_vocab=de_vocab, encoder=encoder, vectorizer=de_vectorizer)
final_model.summary()


## 7) Compiling with masked loss / masked accuracy

Both decoder labels and predictions are padded to a fixed length (`de_seq_length-1`).

Padding tokens should not contribute to:
- cross-entropy loss
- accuracy

So I use masked versions of both.


In [None]:
loss_obj = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False, reduction="none")

def masked_loss(y_true, y_pred):
    # y_true: (batch, time), y_pred: (batch, time, vocab)
    loss = loss_obj(y_true, y_pred)
    mask = tf.cast(tf.not_equal(y_true, 0), tf.float32)  # 0 is padding
    loss = tf.reduce_sum(loss * mask) / tf.reduce_sum(mask)
    return loss

def masked_accuracy(y_true, y_pred):
    y_pred_ids = tf.argmax(y_pred, axis=-1, output_type=tf.int32)
    matches = tf.cast(tf.equal(y_true, y_pred_ids), tf.float32)
    mask = tf.cast(tf.not_equal(y_true, 0), tf.float32)
    return tf.reduce_sum(matches * mask) / tf.reduce_sum(mask)

final_model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=masked_loss,
    metrics=[masked_accuracy],
)


## 8) Preparing data for teacher forcing

The model inputs are raw strings shaped like `(N, 1)`:

- encoder input: English sentence
- decoder input: German sentence **without the last token**
- decoder label: German sentence **without the first token** (vectorized during training/eval)

I keep these as raw strings and let `TextVectorization` inside the model handle the tokenization.


In [None]:
def shift_target_sequence(s: str):
    tokens = s.split()
    # input: drop last token, label: drop first token
    de_in = " ".join(tokens[:-1])
    de_out = " ".join(tokens[1:])
    return de_in, de_out

def prepare_data(train_df: pd.DataFrame, valid_df: pd.DataFrame, test_df: pd.DataFrame):
    data = {}
    for name, dfi in [("train", train_df), ("valid", valid_df), ("test", test_df)]:
        en_inputs = dfi["EN"].astype(str).values.reshape(-1, 1)
        de_in_list = []
        de_out_list = []
        for s in dfi["DE"].astype(str).tolist():
            de_in, de_out = shift_target_sequence(s)
            de_in_list.append(de_in)
            de_out_list.append(de_out)

        de_inputs = np.array(de_in_list, dtype=object).reshape(-1, 1)
        de_labels = np.array(de_out_list, dtype=object).reshape(-1, 1)

        data[name] = {
            "encoder_inputs": en_inputs,
            "decoder_inputs": de_inputs,
            "decoder_labels": de_labels,
        }
    return data

data_dict = prepare_data(train_df, valid_df, test_df)
{k: {kk: v.shape for kk, v in data_dict[k].items()} for k in data_dict}


In [None]:
def shuffle_data(en_inputs, de_inputs, de_labels, shuffle_indices=None):
    if shuffle_indices is None:
        shuffle_indices = np.random.permutation(np.arange(en_inputs.shape[0]))
    else:
        shuffle_indices = np.random.permutation(shuffle_indices)

    return (
        en_inputs[shuffle_indices],
        de_inputs[shuffle_indices],
        de_labels[shuffle_indices],
        shuffle_indices,
    )


## 9) BLEU-style metric for translation quality

Accuracy checks whether each predicted token equals the target token.  
That is useful, but it can be misleading because translation quality depends on **phrases**, not just isolated words.

BLEU addresses this by measuring overlap of n-grams between candidate and reference translations (plus a brevity penalty).

Here, I compute a lightweight BLEU-style score:

- convert predictions to token sequences
- remove padding and cut off at `eos`
- compute BLEU-4 with modified n-gram precision and brevity penalty


In [None]:
def _ngrams(tokens, n):
    return [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]

def _modified_precision(reference, candidate, n):
    ref_ngrams = Counter(_ngrams(reference, n))
    cand_ngrams = Counter(_ngrams(candidate, n))
    if len(cand_ngrams) == 0:
        return 0.0

    clipped = 0
    total = 0
    for ng, count in cand_ngrams.items():
        clipped += min(count, ref_ngrams.get(ng, 0))
        total += count
    return clipped / total if total > 0 else 0.0

def bleu_score_single(reference_tokens, candidate_tokens, max_n=4, smooth=1e-9):
    # Brevity penalty
    ref_len = len(reference_tokens)
    cand_len = len(candidate_tokens)
    if cand_len == 0:
        return 0.0

    bp = 1.0 if cand_len > ref_len else np.exp(1.0 - (ref_len / (cand_len + smooth)))

    precisions = []
    for n in range(1, max_n+1):
        p = _modified_precision(reference_tokens, candidate_tokens, n)
        precisions.append(max(p, smooth))

    # geometric mean
    log_p = sum((1.0/max_n) * np.log(p) for p in precisions)
    return float(bp * np.exp(log_p))

def clean_text(token_tensor, end_token="eos"):
    """Convert a (batch, time) token tensor to a list of token lists, cut at eos, drop padding and sos."""
    # token_tensor is usually dtype=string/bytes
    tokens = token_tensor.numpy()
    out = []
    for row in tokens:
        row_tokens = []
        for tok in row:
            if isinstance(tok, bytes):
                tok = tok.decode("utf-8")
            tok = tok.strip()
            if tok == "" or tok == start_token:
                continue
            if tok == end_token:
                break
            row_tokens.append(tok)
        out.append(row_tokens)
    return out

class BLEUMetric:
    def __init__(self, vocabulary):
        self.vocab = vocabulary
        self.id_to_token = tf.keras.layers.StringLookup(
            vocabulary=self.vocab, invert=True, num_oov_indices=0
        )

    def calculate_bleu_from_predictions(self, y_true_ids, y_pred_probs):
        # Argmax IDs
        pred_ids = tf.argmax(y_pred_probs, axis=-1, output_type=tf.int32)

        # IDs -> tokens
        pred_tokens = self.id_to_token(pred_ids)
        true_tokens = self.id_to_token(y_true_ids)

        pred_clean = clean_text(pred_tokens)
        true_clean = clean_text(true_tokens)

        scores = []
        for ref, cand in zip(true_clean, pred_clean):
            scores.append(bleu_score_single(ref, cand))
        return float(np.mean(scores)) if scores else 0.0

bleu_metric = BLEUMetric(de_vocabulary)


## 10) Training and evaluation loop

The book uses a custom loop to make evaluation explicit and to include BLEU in the reporting.

I follow the same idea:

- `evaluate_model(...)`: loops over batches and returns mean loss / masked accuracy / BLEU
- `train_model(...)`: trains epoch-by-epoch, shuffles the data each epoch, and reports train + validation metrics


In [None]:
def evaluate_model(model, vectorizer, data_split, batch_size=128, bleu_metric=None):
    en_inputs_raw = data_split["encoder_inputs"]
    de_inputs_raw = data_split["decoder_inputs"]
    de_labels_raw = data_split["decoder_labels"]

    loss_log, acc_log, bleu_log = [], [], []
    n_batches = en_inputs_raw.shape[0] // batch_size
    if n_batches == 0:
        n_batches = 1

    for i in range(n_batches):
        x = [
            en_inputs_raw[i*batch_size:(i+1)*batch_size],
            de_inputs_raw[i*batch_size:(i+1)*batch_size],
        ]
        y = vectorizer(de_labels_raw[i*batch_size:(i+1)*batch_size])

        # Keras returns [loss, metric...]
        eval_out = model.evaluate(x, y, verbose=0)
        loss_log.append(eval_out[0])
        acc_log.append(eval_out[1])

        if bleu_metric is not None:
            pred_y = model.predict(x, verbose=0)
            bleu = bleu_metric.calculate_bleu_from_predictions(y, pred_y)
            bleu_log.append(bleu)

    mean_loss = float(np.mean(loss_log))
    mean_acc = float(np.mean(acc_log))
    mean_bleu = float(np.mean(bleu_log)) if bleu_log else 0.0
    return mean_loss, mean_acc, mean_bleu

def train_model(model, vectorizer, data_dict, epochs=5, batch_size=128):
    train_split = data_dict["train"]
    valid_split = data_dict["valid"]

    en_inputs_raw = train_split["encoder_inputs"]
    de_inputs_raw = train_split["decoder_inputs"]
    de_labels_raw = train_split["decoder_labels"]

    prev_shuffle = None

    for epoch in range(1, epochs+1):
        # Shuffle at the beginning of each epoch
        en_inputs_raw, de_inputs_raw, de_labels_raw, prev_shuffle = shuffle_data(
            en_inputs_raw, de_inputs_raw, de_labels_raw, prev_shuffle
        )

        n_batches = en_inputs_raw.shape[0] // batch_size
        if n_batches == 0:
            n_batches = 1

        for i in range(n_batches):
            x = [
                en_inputs_raw[i*batch_size:(i+1)*batch_size],
                de_inputs_raw[i*batch_size:(i+1)*batch_size],
            ]
            y = vectorizer(de_labels_raw[i*batch_size:(i+1)*batch_size])

            model.train_on_batch(x, y)

        # End-of-epoch evaluation (train + valid)
        train_loss, train_acc, train_bleu = evaluate_model(
            model, vectorizer, {"encoder_inputs": en_inputs_raw, "decoder_inputs": de_inputs_raw, "decoder_labels": de_labels_raw},
            batch_size=batch_size,
            bleu_metric=bleu_metric
        )
        valid_loss, valid_acc, valid_bleu = evaluate_model(
            model, vectorizer, valid_split,
            batch_size=batch_size,
            bleu_metric=bleu_metric
        )

        print(f"Epoch {epoch}/{epochs}")
        print(f"  (train) loss: {train_loss:.4f} - acc: {train_acc:.4f} - bleu: {train_bleu:.6f}")
        print(f"  (valid) loss: {valid_loss:.4f} - acc: {valid_acc:.4f} - bleu: {valid_bleu:.6f}")

# Default hyperparameters from the chapter
EPOCHS = 5
BATCH_SIZE = 128

# Uncomment to train:
# train_model(final_model, de_vectorizer, data_dict, epochs=EPOCHS, batch_size=BATCH_SIZE)


## 11) From training to inference: a recursive decoder

Teacher forcing trains the decoder using the full target sequence.  
At inference time, the full German sentence is unknown — that is what the model must generate.

So I create an **inference decoder** that:
- takes a single token (string) and the previous GRU state,
- outputs the next-token distribution and the updated state.

Then translation works like:
1. Encode the English sentence → context vector (initial state).
2. Start decoder input with `sos`.
3. Predict next token, feed it back in, repeat until `eos` or max steps.


In [None]:
def build_inference_decoder(trained_model, de_vocabulary, de_vocab_size):
    # 1) A vectorizer that maps a *single token* -> a *single ID*
    token_vectorizer = TextVectorization(
        max_tokens=de_vocab_size + 2,
        output_mode="int",
        output_sequence_length=1,
    )
    token_vectorizer.set_vocabulary(de_vocabulary)

    # 2) Pull trained layers to copy weights
    trained_emb = trained_model.get_layer("d_embedding")
    trained_dense1 = trained_model.get_layer("d_dense_1")
    trained_dense_final = trained_model.get_layer("d_dense_final")
    trained_gru = trained_model.get_layer("d_gru")

    # 3) New layers for inference (single step)
    d_inp = tf.keras.Input(shape=(1,), dtype=tf.string, name="d_infer_input")
    d_state_inp = tf.keras.Input(shape=(256,), name="d_infer_state")

    d_vec = token_vectorizer(d_inp)  # (batch, 1)
    emb_layer = tf.keras.layers.Embedding(de_vocab_size + 2, 128, mask_zero=True, name="d_infer_embedding")
    emb_out = emb_layer(d_vec)  # (batch, 1, 128)

    gru_layer = tf.keras.layers.GRU(256, name="d_infer_gru")  # returns (batch, 256)
    gru_out = gru_layer(emb_out, initial_state=d_state_inp)

    dense1 = tf.keras.layers.Dense(512, activation="relu", name="d_infer_dense_1")
    dense1_out = dense1(gru_out)

    dense_final = tf.keras.layers.Dense(de_vocab_size + 2, activation="softmax", name="d_infer_dense_final")
    final_out = dense_final(dense1_out)

    de_infer = tf.keras.Model(inputs=[d_inp, d_state_inp], outputs=[final_out, gru_out], name="decoder_inference")

    # 4) Transfer weights
    emb_layer.set_weights(trained_emb.get_weights())
    dense1.set_weights(trained_dense1.get_weights())
    dense_final.set_weights(trained_dense_final.get_weights())
    gru_layer.set_weights(trained_gru.get_weights())

    return de_infer, token_vectorizer

# Build inference pieces
en_infer = final_model.get_layer("encoder")
de_infer, de_token_vectorizer = build_inference_decoder(final_model, de_vocabulary, de_vocab)

id_to_token = tf.keras.layers.StringLookup(
    vocabulary=de_vocabulary, invert=True, num_oov_indices=0
)

def translate(sentence: str, max_steps=30):
    # encoder expects shape (batch, 1)
    enc_in = np.array([[sentence]], dtype=object)
    state = en_infer(enc_in)  # (1, 256)

    token = start_token
    out_tokens = []

    for _ in range(max_steps):
        dec_in = np.array([[token]], dtype=object)
        probs, state = de_infer([dec_in, state], training=False)

        next_id = int(tf.argmax(probs, axis=-1).numpy()[0])
        next_tok = id_to_token([next_id]).numpy()[0].decode("utf-8")

        if next_tok == "" or next_tok == end_token:
            break

        out_tokens.append(next_tok)
        token = next_tok

    return " ".join(out_tokens)

# Example usage (after training):
# print(translate("I love you."))


## 12) Testing on the held-out test set (after training)

After training, I typically test by:
- translating a few random English sentences from the test split,
- comparing the generated German output with the ground-truth German sentence.

This is not a rigorous evaluation, but it is a good sanity check that the inference loop works.


In [None]:
# Uncomment after training:
# for i in np.random.choice(len(test_df), size=5, replace=False):
#     en = test_df.loc[i, "EN"]
#     de_true = test_df.loc[i, "DE"]
#     de_pred = translate(en)
#     print("="*80)
#     print("EN      :", en)
#     print("DE true :", de_true)
#     print("DE pred :", de_pred)


## 13) Chapter wrap-up

What I take away from this chapter:

- **Seq2seq problems** require models that can map sequences to sequences, often with different lengths.
- The **encoder–decoder** structure is a natural fit: the encoder builds a context representation, and the decoder generates the output sequence.
- **Teacher forcing** makes training much easier by feeding the decoder the ground-truth previous token.
- Plain accuracy is helpful but not sufficient; **BLEU-style** scores give a more sequence-aware signal.
- Using `TextVectorization` inside the model makes the workflow closer to an end-to-end translation system: raw strings in, translation out.
- To actually translate, the trained model must be adapted into an **inference version** where the decoder runs recursively and consumes its own previous predictions.
