# Chapter 11: Sequence-to-sequence learning: Part 1

This notebook reproduces the code and summarizes the theoretical concepts from Chapter 11 of *'TensorFlow in Action'* by Thushan Ganegedara.

This chapter introduces **sequence-to-sequence (seq2seq)** models, a powerful architecture for tasks that map an input sequence of one length to an output sequence of another length (e.g., machine translation).

We will cover:
1.  **Data Preparation**: Loading and processing a parallel English-to-German text corpus.
2.  **The `TextVectorization` Layer**: Using this Keras layer to build an end-to-end model that accepts raw strings.
3.  **Seq2seq Model Architecture**: Building an encoder-decoder model using GRUs (Gated Recurrent Units).
4.  **Training (Teacher Forcing)**: How to train a seq2seq model using the "teacher forcing" technique.
5.  **Inference Model**: Building a separate model for generating new translations recursively.

---

## 11.1 Understanding the machine translation data

We will use an English-to-German parallel corpus from `manythings.org`. The data is a text file where each line contains an English sentence, a tab, and its German translation.

**Note**: The book requires you to manually download the file `deu-eng.zip` from `http://www.manythings.org/anki/deu-eng.zip` and place it in a `data` folder.

In [None]:
import tensorflow as tf
from tensorflow.keras import layers, models
import tensorflow.keras.backend as K
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import numpy as np
import pandas as pd
import os
import zipfile
import json

# Set a random seed for reproducibility
random_seed = 4321
np.random.seed(random_seed)
tf.random.set_seed(random_seed)

# --- 1. Load and Extract Data ---
data_dir = 'data'
zip_path = os.path.join(data_dir, 'deu-eng.zip')
extracted_path = os.path.join(data_dir, 'deu.txt')

if not os.path.exists(extracted_path):
    if not os.path.exists(zip_path):
        print("Error: Please download 'deu-eng.zip' from http://www.manythings.org/anki/ and place it in the 'data' folder.")
    else:
        print("Extracting data...")
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(data_dir)
        print("Extraction complete.")
else:
    print("Data already extracted.")

# --- 2. Read data into pandas ---
df = pd.read_csv(extracted_path, delimiter='\t', header=None)
df.columns = ["EN", "DE", "Attribution"]
df = df[["EN", "DE"]]

# Clean up problematic unicode characters from the book's example
clean_inds = [i for i in range(len(df)) if b"\xc2" not in df.iloc[i]["DE"].encode("utf-8")]
df = df.iloc[clean_inds]

# --- 3. Sample and Preprocess Data ---
n_samples = 50000
df = df.sample(n=n_samples, random_state=random_seed)

# Add 'sos' (start of sentence) and 'eos' (end of sentence) tokens
# These are crucial for the decoder during training and inference.
start_token = 'sos'
end_token = 'eos'
df["DE"] = start_token + ' ' + df["DE"] + ' ' + end_token

print(f"Loaded and processed {len(df)} samples.")
print("\nSample data:")
print(df.head())

In [None]:
# --- 4. Create train/validation/test splits ---
n_test = int(n_samples / 10)
n_valid = int(n_samples / 10)

test_df = df.sample(n=n_test, random_state=random_seed)
valid_df = df.loc[~df.index.isin(test_df.index)].sample(n=n_valid, random_state=random_seed)
train_df = df.loc[~(df.index.isin(test_df.index) | df.index.isin(valid_df.index))]

print(f"\nTraining samples: {len(train_df)}")
print(f"Validation samples: {len(valid_df)}")
print(f"Test samples: {len(test_df)}")

# --- 5. Analyze Vocabulary and Sequence Length ---
# (Using helper function from Listing 11.1)
def get_vocabulary_size_greater_than(words, n, verbose=True):
    counter = Counter(words)
    freq_df = pd.Series(list(counter.values()), index=list(counter.keys())).sort_values(ascending=False)
    n_vocab = (freq_df >= n).sum()
    if verbose: print(f"Vocabulary size (>= {n} freq): {n_vocab}")
    return n_vocab

en_words = train_df["EN"].str.split().sum()
de_words = train_df["DE"].str.split().sum()

en_vocab = get_vocabulary_size_greater_than(en_words, n=10)
de_vocab = get_vocabulary_size_greater_than(de_words, n=10)

# Get 99th percentile for sequence lengths
en_seq_length = int(train_df["EN"].str.split().str.len().quantile(0.99)) + 5
de_seq_length = int(train_df["DE"].str.split().str.len().quantile(0.99)) + 5

print(f"EN max sequence length (99th percentile + 5): {en_seq_length}")
print(f"DE max sequence length (99th percentile + 5): {de_seq_length}")

---

## 11.2 Writing an English-German seq2seq machine translator

A seq2seq model consists of two main parts:
1.  **Encoder**: An RNN (we'll use a GRU) that reads the input English sentence one token at a time and compresses its meaning into a single vector, known as the **context vector** or "thought vector". This is the final hidden state of the encoder.
2.  **Decoder**: Another RNN (also a GRU) that takes the encoder's context vector as its *initial hidden state*. It then generates the output German sentence one token at a time.

### 11.2.1 The `TextVectorization` Layer

Instead of preprocessing our text into integers *before* feeding it to the model, we can build the preprocessing *into* the model using the `TextVectorization` layer. This layer will:
1.  Be `adapt`ed (fitted) on our training corpus to build a vocabulary.
2.  When the model is running, it will take raw strings as input.
3.  It will automatically tokenize, convert to integers, and pad the sequences to a fixed length, all inside the model graph.

In [None]:
# Based on Listing 11.3
def get_vectorizer(corpus, n_vocab, max_length=None, return_vocabulary=True, name=None):
    """Creates a TextVectorization layer/model."""
    
    inp = tf.keras.Input(shape=(1,), dtype=tf.string, name=f'{name}_input')
    
    # We add 2 to the vocab size for the <PAD> (ID 0) and [UNK] (ID 1) tokens
    vectorize_layer = TextVectorization(
        max_tokens=n_vocab + 2, 
        output_mode='int',
        output_sequence_length=max_length,
        name=name
    )
    
    # Build the vocabulary
    vectorize_layer.adapt(corpus)
    vectorized_out = vectorize_layer(inp)
    
    model = tf.keras.models.Model(inputs=inp, outputs=vectorized_out)
    
    if return_vocabulary:
        return model, vectorize_layer.get_vocabulary()
    return model

# Create the vectorizers for English and German
# Note: The decoder's max_length is de_seq_length - 1
# This is because we will feed it 'sos ... word_n' (length N) to predict 'word_1 ... eos' (length N)
en_vectorizer, en_vocabulary = get_vectorizer(
    corpus=np.array(train_df["EN"].tolist()), 
    n_vocab=en_vocab, 
    max_length=en_seq_length, 
    name='en_vectorizer'
)
de_vectorizer, de_vocabulary = get_vectorizer(
    corpus=np.array(train_df["DE"].tolist()), 
    n_vocab=de_vocab,
    max_length=de_seq_length - 1, 
    name='de_vectorizer'
)

print(f"English Vocabulary size: {len(en_vocabulary)}")
print(f"German Vocabulary size: {len(de_vocabulary)}")

# Test the English vectorizer
print("\nTest EN Vectorizer:")
print(en_vectorizer(np.array([["I like machine learning"]])))

### 11.2.3 & 11.2.4 Defining the Encoder and Decoder

Now we build the full seq2seq model using the Keras Functional API.

**Encoder (Listing 11.4):**
1.  Input (Raw English strings)
2.  `en_vectorizer` (Text -> Integer IDs)
3.  `Embedding` Layer (IDs -> Dense Vectors)
4.  `Bidirectional(GRU)`: Reads the sequence forwards and backwards. The final hidden state is the context vector.

**Decoder (Listing 11.5):**
1.  Input (Raw German strings, e.g., "sos Ich möchte ein...")
2.  `de_vectorizer` (Text -> Integer IDs)
3.  `Embedding` Layer (IDs -> Dense Vectors)
4.  `GRU`: This GRU's **initial_state** is set to the **encoder's context vector**.
5.  `Dense` Layer (with Softmax): Predicts the next word in the German vocabulary.

In [None]:
K.clear_session()

# --- Define Encoder --- 
def get_encoder(n_vocab, vectorizer):
    inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='e_input')
    vectorized_out = vectorizer(inp)
    emb_layer = layers.Embedding(
        n_vocab + 2, 128, mask_zero=True, name='e_embedding'
    )
    emb_out = emb_layer(vectorized_out)
    gru_layer = layers.Bidirectional(
        layers.GRU(128, name='e_gru'), name='e_bidirectional_gru'
    )
    gru_out = gru_layer(emb_out)
    encoder = tf.keras.models.Model(inputs=inp, outputs=gru_out, name='encoder')
    return encoder

# --- Define Final Seq2Seq Model ---
def get_final_seq2seq_model(n_vocab, encoder, vectorizer):
    e_inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='e_input_final')
    d_init_state = encoder(e_inp)
    
    d_inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='d_input')
    d_vectorized_out = vectorizer(d_inp)
    
    d_emb_layer = layers.Embedding(
        n_vocab + 2, 128, mask_zero=True, name='d_embedding'
    )
    d_emb_out = d_emb_layer(d_vectorized_out)
    
    d_gru_layer = layers.GRU(
        256, return_sequences=True, name='d_gru' # 256 units = 128 (fwd) + 128 (bwd) from encoder
    )
    # The encoder's state is fed as the initial state to the decoder's GRU
    d_gru_out = d_gru_layer(d_emb_out, initial_state=d_init_state)
    
    d_dense_layer_1 = layers.Dense(512, activation='relu', name='d_dense_1')
    d_dense1_out = d_dense_layer_1(d_gru_out)
    
    d_final_layer = layers.Dense(n_vocab + 2, activation='softmax', name='d_dense_final')
    d_final_out = d_final_layer(d_dense1_out)
    
    seq2seq = tf.keras.models.Model(
        inputs=[e_inp, d_inp], outputs=d_final_out, name='final_seq2seq'
    )
    return seq2seq

# Get the models
encoder = get_encoder(n_vocab=en_vocab, vectorizer=en_vectorizer)
final_model = get_final_seq2seq_model(n_vocab=de_vocab, encoder=encoder, vectorizer=de_vectorizer)

# Compile the model
final_model.compile(
    loss='sparse_categorical_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy']
)

final_model.summary()

---

## 11.3 Training and evaluating the model

To train this model, we use **teacher forcing**. 

This means for a translation pair `("I like cats", "sos Ich mag Katzen eos")`:
* `x` (inputs) = `("I like cats", "sos Ich mag Katzen")`
* `y` (target) = `("Ich", "mag", "Katzen", "eos")`

The decoder receives the *true* previous word (e.g., "mag") as input to help it predict the next word (e.g., "Katzen"). This stabilizes and speeds up training.

We also need to define a custom training loop to correctly calculate the **BLEU score**, a standard metric for machine translation that measures the overlap of n-grams between the predicted and reference translations.

In [None]:
# Based on Listing 11.6 - Prepare data for teacher forcing
def prepare_data(df):
    en_inputs = np.array(df["EN"].tolist())
    # Decoder inputs = 'sos ... word_n'
    de_inputs = np.array(df["DE"].str.rsplit(n=1, expand=True).iloc[:, 0].tolist())
    # Decoder labels = 'word_1 ... eos'
    de_labels = np.array(df["DE"].str.split(n=1, expand=True).iloc[:, 1].tolist())
    
    # The labels need to be vectorized *without* the 'sos' token, 
    # so we create a separate vectorizer for them.
    de_label_vectorizer = get_vectorizer(
        corpus=np.array(train_df["DE"].tolist()), 
        n_vocab=de_vocab,
        max_length=de_seq_length - 1, 
        return_vocabulary=False
    )
    
    # Convert string labels to token IDs
    de_labels_vec = de_label_vectorizer(de_labels)
    return en_inputs, de_inputs, de_labels_vec

en_train, de_train_in, de_train_out = prepare_data(train_df)
en_valid, de_valid_in, de_valid_out = prepare_data(valid_df)

print("Training data shapes:")
print(en_train.shape, de_train_in.shape, de_train_out.shape)

# Train the model (simplified .fit() call from the book)
# The book uses a custom training loop (Listing 11.10) to calculate BLEU.
# For simplicity, we will use model.fit() here.

print("\nStarting model training (1 epoch for demo)...")
history = final_model.fit(
    x=[en_train, de_train_in],
    y=de_train_out,
    validation_data=([en_valid, de_valid_in], de_valid_out),
    epochs=1, # Book uses 5
    batch_size=batch_size
)

print("Training complete.")

# Save the model
os.makedirs('models', exist_ok=True)
model_path = os.path.join('models', 'seq2seq_ch11.h5')
final_model.save(model_path)
print(f"Model saved to {model_path}")

---

## 11.4 From training to inference: Defining the inference model

We can't use the trained model directly for inference because it relies on **teacher forcing** (i.e., it expects the *true* German sentence as an input to the decoder).

For inference, we must build a new model that generates text **recursively**:
1.  Feed the English sentence to the **Encoder** to get the context vector.
2.  Feed the context vector (as the initial state) and the `sos` token to the **Decoder**.
3.  The Decoder predicts the first word (e.g., "Ich").
4.  Feed the *new* state and the predicted word ("Ich") back into the Decoder.
5.  The Decoder predicts the second word (e.g., "möchte").
6.  Repeat this process until the Decoder predicts the `eos` token.

In [None]:
# Based on Listing 11.11 - Create the inference models

def get_inference_model(save_path, de_vocab_size):
    print("Loading trained model and building inference models...")
    K.clear_session()
    model = load_model(save_path)
    
    # 1. Get the Encoder
    en_model = model.get_layer("encoder")
    
    # 2. Build the Decoder
    # We need to define new inputs for the decoder's state
    d_inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='d_infer_input')
    d_state_inp = tf.keras.Input(shape=(256,), name='d_infer_state') # 256 = GRU units
    
    # Get the layers from the trained model
    d_vectorizer = model.get_layer('d_vectorizer')
    d_emb_layer = model.get_layer('d_embedding')
    d_gru_layer = model.get_layer("d_gru")
    d_gru_layer.return_sequences = False # Only need the last output
    d_dense_layer_1 = model.get_layer("d_dense_1")
    d_final_layer = model.get_layer("d_dense_final")
    
    # Build the graph
    d_vectorized_out = d_vectorizer(d_inp)
    d_emb_out = d_emb_layer(d_vectorized_out)
    d_gru_out = d_gru_layer(d_emb_out, initial_state=d_state_inp)
    d_dense1_out = d_dense_layer_1(d_gru_out)
    d_final_out = d_final_layer(d_dense1_out)
    
    de_model = tf.keras.models.Model(
        inputs=[d_inp, d_state_inp], 
        outputs=[d_final_out, d_gru_out] # Output prediction AND new state
    )
    return en_model, de_model

en_model, de_model = get_inference_model(model_path, de_vocab_size=de_vocab)
print("Inference models built.")

In [None]:
# Based on Listing 11.12 - Function to generate a translation

def generate_new_translation(en_model, de_model, de_vocabulary, sample_en_text, max_len=20):
    print(f"Input: {sample_en_text}")
    
    # 1. Get the context vector from the encoder
    d_state = en_model.predict(np.array([sample_en_text]))
    
    # 2. Start the decoder with the 'sos' token
    de_word = start_token
    de_translation = []
    
    # 3. Recursive loop
    for _ in range(max_len):
        # Predict the next word and get the new state
        de_pred, d_state = de_model.predict([np.array([de_word]), d_state])
        
        # Get the word ID with the highest probability
        de_word_id = np.argmax(de_pred[0])
        
        # Look up the word from the ID
        de_word = de_vocabulary[de_word_id]
        
        if de_word == end_token:
            break
        
        de_translation.append(de_word)
    
    print(f"Translation: {' '.join(de_translation)}\n")

# --- Test the inference model ---
for i in range(5):
    sample_en_text = test_df["EN"].iloc[i]
    generate_new_translation(en_model, de_model, de_vocabulary, sample_en_text)