# Chapter 11: Sequence-to-Sequence Learning: Part 1

## 1️⃣ Chapter Overview

In previous chapters, sequence modeling tasks such as Sentiment Analysis (Many-to-One) and Language Modeling (One-to-Many / Many-to-Many) were introduced. These tasks assume either a fixed-length output or a direct alignment between input and output tokens. This chapter extends sequence modeling to a more general and challenging setting known as **Sequence-to-Sequence (Seq2Seq) learning**, where both input and output are variable-length sequences with no explicit alignment.

The chapter uses **Machine Translation**, specifically English-to-German translation, as the primary application to motivate Seq2Seq learning. This task highlights several real-world challenges, including differing sentence lengths, word reordering across languages, and the need to preserve semantic meaning rather than surface-level word correspondence.

To address these challenges, the chapter introduces the **Encoder–Decoder architecture**, implemented using recurrent neural networks (RNNs) and constructed via the Keras **Functional API**. Unlike the Sequential API, the Functional API enables flexible multi-input and multi-output network topologies, which are essential for encoder–decoder models.

The chapter further distinguishes between **training-time behavior**, where Teacher Forcing is employed to stabilize optimization, and **inference-time behavior**, where the model must rely on its own predictions through Recursive (autoregressive) decoding.

---


## 2️⃣ Theoretical Explanation

### 2.1 The Sequence-to-Sequence Problem

Traditional feedforward neural networks require fixed-size inputs and outputs. Recurrent neural networks (RNNs) relax the fixed-size input constraint by processing sequences of arbitrary length, but they are commonly applied to tasks where the output is either a single label (classification) or a sequence aligned with the input (sequence tagging).

Machine translation represents a fundamentally harder problem. The input sequence length (English sentence) and output sequence length (German sentence) are rarely equal, and there is often no one-to-one correspondence between input and output tokens. Additionally, syntactic structures differ across languages, requiring the model to learn reordering and long-range dependencies.

Formally, the task is to model the conditional probability distribution:

$$ P(Y | X) = \prod_{t=1}^{T'} P(y_t | y_{<t}, X) $$

where $X = (x_1, x_2, ..., x_T)$ is the source sequence and $Y = (y_1, y_2, ..., y_{T'})$ is the target sequence. This formulation naturally leads to **autoregressive decoding**, in which each output token depends on all previously generated tokens and the encoded representation of the input sequence.

---


### 2.2 The Encoder–Decoder Architecture

Seq2Seq models decompose the sequence transformation problem into two distinct components: an **encoder** and a **decoder**. This separation allows the model to first build an abstract representation of the input sequence and then generate the output sequence conditioned on that representation.

**The Encoder** processes the source sentence one token at a time and updates its hidden state recursively:

$$ h_t^{enc} = f(h_{t-1}^{enc}, x_t) $$

where $f(\cdot)$ denotes a recurrent cell such as a GRU or LSTM. The encoder does not produce predictions at each timestep. Instead, it discards intermediate outputs and retains only the final hidden state, commonly referred to as the **context vector**:

$$ c = h_T^{enc} $$

This context vector is intended to compress the semantic information of the entire input sentence into a fixed-dimensional representation.

In this chapter, a **Bidirectional RNN encoder** is used to improve representational capacity. A forward RNN processes the sequence from left to right, while a backward RNN processes it from right to left. The final context vector is constructed by concatenating the terminal states of both directions:

$$ c = [h_T^{forward}; h_1^{backward}] $$

This design allows the encoder to capture information from both past and future contexts within the source sentence.

---


**The Decoder** is an autoregressive recurrent neural network responsible for generating the target sentence one token at a time. It is initialized using the encoder’s context vector $c$, which conditions the entire generation process on the input sentence.

At each decoding timestep $t$, the decoder updates its hidden state according to:

$$ h_t^{dec} = f(h_{t-1}^{dec}, y_{t-1}, c) $$

The decoder then computes a probability distribution over the target vocabulary:

$$ P(y_t | y_{<t}, X) = \text{softmax}(W h_t^{dec} + b) $$

This distribution represents the model’s belief about the most likely next token. The decoding process continues iteratively until a special **end-of-sequence (eos)** token is generated or a predefined maximum length is reached.

---


### 2.3 Training Strategy: Teacher Forcing

During training, the decoder must learn to predict the next target token $y_t$ given the previous tokens and the encoded input. If the decoder were to use its own predictions as inputs at early stages of training, small errors would rapidly accumulate, leading to unstable learning dynamics.

**Teacher Forcing** addresses this issue by feeding the **ground-truth token** from the previous timestep as input to the decoder, regardless of what the model actually predicted. As a result, the decoder learns under idealized conditions where previous context is always correct.

The training objective minimizes the negative log-likelihood of the target sequence:

$$ \mathcal{L} = - \sum_{t=1}^{T'} \log P(y_t^{true} | y_{<t}^{true}, X) $$

Teacher forcing significantly improves convergence speed and gradient stability. However, it introduces a mismatch between training and inference, known as **exposure bias**, since the model never observes its own prediction errors during training.

---


### 2.4 Inference Strategy: Recursive Decoding

During inference, ground-truth target tokens are unavailable. The decoder must rely entirely on its own predictions to generate the output sequence. The inference procedure proceeds as follows:

1. The encoder processes the source sentence and produces the context vector $c$.
2. The decoder is initialized with $c$ and a start-of-sequence (sos) token.
3. At each timestep, the decoder predicts the next token:

$$ \hat{y}_t = \arg\max P(y_t | \hat{y}_{<t}, X) $$

4. The predicted token is fed back as input to the decoder for the next timestep.
5. The process repeats until an eos token is generated.

Because inference requires step-by-step state propagation, the training architecture cannot be reused directly. Separate encoder and decoder inference models must be defined, sharing learned parameters but operating under different input assumptions.

---


## Limitations of Vanilla Seq2Seq Models

While encoder–decoder architectures successfully enable variable-length sequence transformation, they suffer from a fundamental limitation: all information from the source sequence must be compressed into a single fixed-size context vector.

As input sequences become longer, this bottleneck leads to information loss, degraded long-range dependency modeling, and reduced translation quality. This limitation motivates the introduction of **attention mechanisms**, which allow the decoder to dynamically focus on different parts of the encoder’s hidden states during generation.

The next chapter extends the Seq2Seq framework by incorporating attention, effectively removing the fixed-context bottleneck.


## 3️⃣ Data Preparation

We will use a standard English-German translation dataset from [ManyThings.org](http://www.manythings.org/anki/).

**Requirements:**
1.  Download the dataset.
2.  Clean the text.
3.  Add `sos` (Start of Sentence) and `eos` (End of Sentence) tokens to the target (German) sentences.

In [None]:
import os
import random
import zipfile
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers, models

# Ensure reproducibility
np.random.seed(42)
tf.random.set_seed(42)
random.seed(42)

# 1. Download Dataset
url = "http://www.manythings.org/anki/deu-eng.zip"
zip_path = tf.keras.utils.get_file("deu-eng.zip", origin=url, extract=True)
text_file = os.path.join(os.path.dirname(zip_path), "deu.txt")

# 2. Load and Preprocess
def load_data(path, num_samples=50000):
    with open(path, 'r', encoding='utf-8') as f:
        lines = f.read().split('\n')
    
    # Format: English \t German \t Attribution
    # We only care about the first two columns
    data = []
    for line in lines[:min(num_samples, len(lines)-1)]:
        parts = line.split('\t')
        if len(parts) >= 2:
            english = parts[0]
            # Add start and end tokens to the target (German)
            german = f"sos {parts[1]} eos"
            data.append([english, german])
            
    return np.array(data)

raw_data = load_data(text_file)
print(f"Total Samples: {len(raw_data)}")
print(f"Sample: {raw_data[100]}")

### 3.1 Splitting the Data
We split the data into Training, Validation, and Test sets.

In [None]:
# Shuffle indices
indices = np.arange(len(raw_data))
np.random.shuffle(indices)
raw_data = raw_data[indices]

# Split 80-10-10
num_val_samples = int(0.1 * len(raw_data))
num_test_samples = int(0.1 * len(raw_data))
num_train_samples = len(raw_data) - num_val_samples - num_test_samples

train_pairs = raw_data[:num_train_samples]
val_pairs = raw_data[num_train_samples : num_train_samples + num_val_samples]
test_pairs = raw_data[num_train_samples + num_val_samples:]

print(f"Training pairs: {len(train_pairs)}")
print(f"Validation pairs: {len(val_pairs)}")
print(f"Test pairs: {len(test_pairs)}")

### 3.2 Text Vectorization
We need two vectorizers: one for English (Encoder Input) and one for German (Decoder Input/Output).

We will use the `TextVectorization` layer which handles standardization, tokenization, and indexing.

In [None]:
from tensorflow.keras.layers import TextVectorization

# Parameters
VOCAB_SIZE = 10000
SEQUENCE_LENGTH = 20

# --- English Vectorizer ---
en_vectorizer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=SEQUENCE_LENGTH,
    standardize='lower_and_strip_punctuation'
)

# --- German Vectorizer ---
# We keep specific punctuation or do custom stripping if needed, 
# but defaults are usually fine for basic translation.
de_vectorizer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=SEQUENCE_LENGTH + 1, # +1 for offset
    standardize='lower_and_strip_punctuation'
)

# Adapt to the text
train_en_texts = train_pairs[:, 0]
train_de_texts = train_pairs[:, 1]

en_vectorizer.adapt(train_en_texts)
de_vectorizer.adapt(train_de_texts)

print("English Vocabulary Sample:", en_vectorizer.get_vocabulary()[:10])
print("German Vocabulary Sample:", de_vectorizer.get_vocabulary()[:10])

### 3.3 Data Pipeline for Teacher Forcing

For training, we need to prepare the data in a specific tuple format `(inputs, outputs)`.

**Inputs:** A dictionary containing:
1.  `encoder_inputs`: The English sentence.
2.  `decoder_inputs`: The German sentence (including `sos`, but WITHOUT `eos`).

**Outputs:**
1.  The German sentence shifted by one (including `eos`, but WITHOUT `sos`).

Example:
* Source: "I like cats"
* Decoder Input: "sos Ich mag Katzen"
* Decoder Target: "Ich mag Katzen eos"

In [None]:
def format_dataset(eng, deu):
    eng = en_vectorizer(eng)
    deu = de_vectorizer(deu)
    
    # Inputs to the model
    # 1. Encoder Input (English)
    # 2. Decoder Input (German, excluding the last token <eos>)
    decoder_input = deu[:, :-1]
    
    # Targets (German, excluding the first token <sos>)
    decoder_target = deu[:, 1:]
    
    return (
        {"encoder_inputs": eng, "decoder_inputs": decoder_input},
        decoder_target
    )

def make_dataset(pairs, batch_size=64):
    eng_texts, deu_texts = pairs[:, 0], pairs[:, 1]
    dataset = tf.data.Dataset.from_tensor_slices((eng_texts, deu_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset, num_parallel_calls=tf.data.AUTOTUNE)
    return dataset.shuffle(2048).prefetch(16).cache()

train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

# Check shapes
for inputs, targets in train_ds.take(1):
    print(f"Encoder Input Shape: {inputs['encoder_inputs'].shape}")
    print(f"Decoder Input Shape: {inputs['decoder_inputs'].shape}")
    print(f"Target Shape: {targets.shape}")

## 4️⃣ Building the Seq2Seq Model

We use the **Functional API** to connect the Encoder and Decoder.

### 4.1 The Encoder
The encoder processes the input English sequence. We use a **Bidirectional GRU** to capture context from both directions. The crucial output here is the **state**, not the sequence.

In [None]:
EMBEDDING_DIM = 256
LATENT_DIM = 512

# --- Encoder ---
encoder_inputs = layers.Input(shape=(SEQUENCE_LENGTH,), dtype="int64", name="encoder_inputs")
x = layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM, mask_zero=True)(encoder_inputs)

# Bidirectional GRU
# We return the state to initialize the decoder
encoder_gru = layers.Bidirectional(layers.GRU(LATENT_DIM // 2, return_state=True), name="encoder_gru")
encoder_out, state_h_fwd, state_h_bwd = encoder_gru(x)

# Concatenate forward and backward states to pass to decoder
encoder_state = layers.Concatenate()([state_h_fwd, state_h_bwd])

### 4.2 The Decoder (Training)
The decoder takes the German sequence (offset by one) and the Encoder's state. It predicts the next word.

In [None]:
# --- Decoder ---
decoder_inputs = layers.Input(shape=(SEQUENCE_LENGTH,), dtype="int64", name="decoder_inputs")
decoder_embedding = layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM, mask_zero=True)
x = decoder_embedding(decoder_inputs)

# Decoder GRU
# We initialize it with the Encoder State
decoder_gru = layers.GRU(LATENT_DIM, return_sequences=True, return_state=True, name="decoder_gru")
decoder_outputs, _ = decoder_gru(x, initial_state=encoder_state)

# Output Layer
decoder_dense = layers.Dense(VOCAB_SIZE, activation="softmax", name="decoder_final")
decoder_outputs = decoder_dense(decoder_outputs)

# --- Full Model ---
seq2seq_model = models.Model([encoder_inputs, decoder_inputs], decoder_outputs, name="Seq2Seq_Training")
seq2seq_model.summary()

### 4.3 Training
We compile and train the model. Note that we use `sparse_categorical_crossentropy` because our targets are integers, not one-hot vectors.

In [None]:
seq2seq_model.compile(
    optimizer="adam", 
    loss="sparse_categorical_crossentropy", 
    metrics=["accuracy"]
)

history = seq2seq_model.fit(
    train_ds, 
    epochs=10, 
    validation_data=val_ds,
    callbacks=[tf.keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True)]
)

## 5️⃣ Inference: The Recursive Decoder

The training model cannot be used directly for inference because it expects the answer (`decoder_inputs`) to be provided. For inference, we need to generate the answer word by word.

We need to define a separate inference architecture that shares the weights of the trained model.

In [None]:
# 1. Define the Inference Encoder
# Input: English Sentence -> Output: Context Vector (State)
encoder_model = models.Model(encoder_inputs, encoder_state)

# 2. Define the Inference Decoder
# Inputs: Previous Word + Previous State
decoder_state_input = layers.Input(shape=(LATENT_DIM,), name="input_state")
decoder_word_input = layers.Input(shape=(1,), name="input_word")

# Reuse layers from training model
x = decoder_embedding(decoder_word_input)
decoder_outputs, decoder_state = decoder_gru(x, initial_state=decoder_state_input)
decoder_outputs = decoder_dense(decoder_outputs)

decoder_model = models.Model(
    [decoder_word_input, decoder_state_input], 
    [decoder_outputs, decoder_state]
)

# 3. Decoding Loop
def decode_sentence(input_sentence):
    # 1. Encode the input
    input_seq = en_vectorizer([input_sentence])
    states_value = encoder_model.predict(input_seq, verbose=0)

    # 2. Start with 'sos' token
    # We need the ID for 'sos'
    vocab = de_vectorizer.get_vocabulary()
    sos_id = vocab.index('sos')
    eos_id = vocab.index('eos')
    
    target_seq = np.array([[sos_id]])
    decoded_sentence = []

    # 3. Loop until 'eos' or max length
    for _ in range(SEQUENCE_LENGTH):
        output_tokens, h = decoder_model.predict([target_seq, states_value], verbose=0)

        # Sample next token
        sampled_token_index = np.argmax(output_tokens[0, 0, :])
        sampled_word = vocab[sampled_token_index]

        if sampled_word == 'eos':
            break
            
        decoded_sentence.append(sampled_word)

        # Update the target sequence and state for next step
        target_seq = np.array([[sampled_token_index]])
        states_value = h

    return " ".join(decoded_sentence)

### 5.1 Testing the Translation
Let's see how our model performs on unseen data.

In [None]:
test_examples = [
    "I love cats",
    "He works every day",
    "She is happy",
    "The weather is nice today"
]

for sent in test_examples:
    translation = decode_sentence(sent)
    print(f"English: {sent}")
    print(f"German:  {translation}")
    print("-" * 30)

## 6️⃣ Chapter Summary

* **Seq2Seq Tasks:** Mapping sequences to sequences is fundamentally different from classification. It requires handling variable length inputs *and* outputs.
* **Encoder:** Compresses the source sentence into a single **Context Vector** (State). We used a Bidirectional GRU for this.
* **Decoder:** Unpacks the Context Vector into the target sentence.
* **Teacher Forcing:** During training, we feed the *correct* previous word to the decoder. This stabilizes training.
* **Inference:** During testing, we must feed the *predicted* previous word to the decoder (Recursive decoding).
* **Performance:** While this basic Seq2Seq model works, it struggles with very long sentences because the entire meaning must be compressed into one vector. In the next chapter, we will solve this with **Attention**.