# Chapter 11: Sequence-to-Sequence Learning: Part 1

## 1️⃣ Chapter Overview

In the previous chapters, we covered Sentiment Analysis (Many-to-One) and Language Modeling (One-to-Many / Many-to-Many). This chapter introduces **Sequence-to-Sequence (Seq2Seq)** learning, a paradigm used when we need to map an input sequence of arbitrary length to an output sequence of arbitrary length.

We will build a **Machine Translation** system to translate English sentences into German. We will move beyond the standard `Sequential` API and strictly use the **Functional API** to handle the complex topology of Encoder-Decoder networks. We will also explore **Teacher Forcing** during training and **Recursive Decoding** during inference.

### Key Machine Learning Concepts:
* **Encoder-Decoder Architecture:** Separating the understanding of input (encoding) from the generation of output (decoding).
* **Context Vector:** The bottleneck vector that compresses the meaning of the source sentence.
* **Teacher Forcing:** A training strategy where the model uses the *ground truth* previous token as input instead of its own prediction.
* **BLEU Score:** The standard metric for evaluating machine generated text against human references.

### Practical Skills:
* Using the `TextVectorization` layer for end-to-end text processing within the model.
* Building sophisticated models with the Keras Functional API.
* Separating Training architecture (Teacher Forcing) from Inference architecture (Recursive).
* Handling bilingual datasets.

---

## 2️⃣ Theoretical Explanation

### 2.1 The Sequence-to-Sequence Problem
Standard neural networks accept fixed-size inputs. Recurrent networks (RNNs) can handle variable input lengths but typically produce an output for every input (tagging) or one output at the end (classification). 

Machine translation poses a harder problem: the input length (English source) and output length (German translation) are rarely the same, and the word order often changes.

### 2.2 The Encoder-Decoder Architecture
To solve this, we use two separate RNNs:

1.  **The Encoder:** 
    * Reads the source sequence (English) one token at a time.
    * Updates its internal state.
    * Discards the outputs but passes the **Final State** (Context Vector) to the decoder.
    * *Analogy:* Reading a book and forming a mental summary.

2.  **The Decoder:**
    * Initialized with the Context Vector from the encoder.
    * Generates the target sequence (German) one token at a time.
    * *Analogy:* Writing a summary in a different language based on your mental summary.

### 2.3 Training Strategy: Teacher Forcing
During training, the decoder needs to learn to predict $y_t$ given the history $y_{t-1}, ... y_0$ and the context.

If we let the decoder predict $y_1$, it might be wrong. If we feed that *wrong* prediction as input for $y_2$, the model will drift further and further away (error accumulation), making training slow and unstable.

**Teacher Forcing** solves this by feeding the **Ground Truth** token from the previous timestep as input to the current timestep, regardless of what the model actually predicted. 

* **Input to Decoder:** `<sos> Ich bin gut`
* **Target for Decoder:** `Ich bin gut <eos>`

### 2.4 Inference Strategy: Recursive Decoding
During inference (real-world use), we don't have the ground truth. We must rely on the model's own predictions.

1.  Feed Encoder the source sentence $\rightarrow$ Get Context Vector.
2.  Feed Decoder the Context Vector + `<sos>` token.
3.  Decoder predicts token `A`.
4.  Feed token `A` as input to Decoder to predict token `B`.
5.  Repeat until `<eos>` is predicted.

## 3️⃣ Data Preparation

We will use a standard English-German translation dataset from [ManyThings.org](http://www.manythings.org/anki/).

**Requirements:**
1.  Download the dataset.
2.  Clean the text.
3.  Add `sos` (Start of Sentence) and `eos` (End of Sentence) tokens to the target (German) sentences.

In [None]:
import os
import random
import zipfile
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers, models

# Ensure reproducibility
np.random.seed(42)
tf.random.set_seed(42)
random.seed(42)

# 1. Download Dataset
url = "http://www.manythings.org/anki/deu-eng.zip"
zip_path = tf.keras.utils.get_file("deu-eng.zip", origin=url, extract=True)
text_file = os.path.join(os.path.dirname(zip_path), "deu.txt")

# 2. Load and Preprocess
def load_data(path, num_samples=50000):
    with open(path, 'r', encoding='utf-8') as f:
        lines = f.read().split('\n')
    
    # Format: English \t German \t Attribution
    # We only care about the first two columns
    data = []
    for line in lines[:min(num_samples, len(lines)-1)]:
        parts = line.split('\t')
        if len(parts) >= 2:
            english = parts[0]
            # Add start and end tokens to the target (German)
            german = f"sos {parts[1]} eos"
            data.append([english, german])
            
    return np.array(data)

raw_data = load_data(text_file)
print(f"Total Samples: {len(raw_data)}")
print(f"Sample: {raw_data[100]}")

### 3.1 Splitting the Data
We split the data into Training, Validation, and Test sets.

In [None]:
# Shuffle indices
indices = np.arange(len(raw_data))
np.random.shuffle(indices)
raw_data = raw_data[indices]

# Split 80-10-10
num_val_samples = int(0.1 * len(raw_data))
num_test_samples = int(0.1 * len(raw_data))
num_train_samples = len(raw_data) - num_val_samples - num_test_samples

train_pairs = raw_data[:num_train_samples]
val_pairs = raw_data[num_train_samples : num_train_samples + num_val_samples]
test_pairs = raw_data[num_train_samples + num_val_samples:]

print(f"Training pairs: {len(train_pairs)}")
print(f"Validation pairs: {len(val_pairs)}")
print(f"Test pairs: {len(test_pairs)}")

### 3.2 Text Vectorization
We need two vectorizers: one for English (Encoder Input) and one for German (Decoder Input/Output).

We will use the `TextVectorization` layer which handles standardization, tokenization, and indexing.

In [None]:
from tensorflow.keras.layers import TextVectorization

# Parameters
VOCAB_SIZE = 10000
SEQUENCE_LENGTH = 20

# --- English Vectorizer ---
en_vectorizer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=SEQUENCE_LENGTH,
    standardize='lower_and_strip_punctuation'
)

# --- German Vectorizer ---
# We keep specific punctuation or do custom stripping if needed, 
# but defaults are usually fine for basic translation.
de_vectorizer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=SEQUENCE_LENGTH + 1, # +1 for offset
    standardize='lower_and_strip_punctuation'
)

# Adapt to the text
train_en_texts = train_pairs[:, 0]
train_de_texts = train_pairs[:, 1]

en_vectorizer.adapt(train_en_texts)
de_vectorizer.adapt(train_de_texts)

print("English Vocabulary Sample:", en_vectorizer.get_vocabulary()[:10])
print("German Vocabulary Sample:", de_vectorizer.get_vocabulary()[:10])

### 3.3 Data Pipeline for Teacher Forcing

For training, we need to prepare the data in a specific tuple format `(inputs, outputs)`.

**Inputs:** A dictionary containing:
1.  `encoder_inputs`: The English sentence.
2.  `decoder_inputs`: The German sentence (including `sos`, but WITHOUT `eos`).

**Outputs:**
1.  The German sentence shifted by one (including `eos`, but WITHOUT `sos`).

Example:
* Source: "I like cats"
* Decoder Input: "sos Ich mag Katzen"
* Decoder Target: "Ich mag Katzen eos"

In [None]:
def format_dataset(eng, deu):
    eng = en_vectorizer(eng)
    deu = de_vectorizer(deu)
    
    # Inputs to the model
    # 1. Encoder Input (English)
    # 2. Decoder Input (German, excluding the last token <eos>)
    decoder_input = deu[:, :-1]
    
    # Targets (German, excluding the first token <sos>)
    decoder_target = deu[:, 1:]
    
    return (
        {"encoder_inputs": eng, "decoder_inputs": decoder_input},
        decoder_target
    )

def make_dataset(pairs, batch_size=64):
    eng_texts, deu_texts = pairs[:, 0], pairs[:, 1]
    dataset = tf.data.Dataset.from_tensor_slices((eng_texts, deu_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset, num_parallel_calls=tf.data.AUTOTUNE)
    return dataset.shuffle(2048).prefetch(16).cache()

train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

# Check shapes
for inputs, targets in train_ds.take(1):
    print(f"Encoder Input Shape: {inputs['encoder_inputs'].shape}")
    print(f"Decoder Input Shape: {inputs['decoder_inputs'].shape}")
    print(f"Target Shape: {targets.shape}")

## 4️⃣ Building the Seq2Seq Model

We use the **Functional API** to connect the Encoder and Decoder.

### 4.1 The Encoder
The encoder processes the input English sequence. We use a **Bidirectional GRU** to capture context from both directions. The crucial output here is the **state**, not the sequence.

In [None]:
EMBEDDING_DIM = 256
LATENT_DIM = 512

# --- Encoder ---
encoder_inputs = layers.Input(shape=(SEQUENCE_LENGTH,), dtype="int64", name="encoder_inputs")
x = layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM, mask_zero=True)(encoder_inputs)

# Bidirectional GRU
# We return the state to initialize the decoder
encoder_gru = layers.Bidirectional(layers.GRU(LATENT_DIM // 2, return_state=True), name="encoder_gru")
encoder_out, state_h_fwd, state_h_bwd = encoder_gru(x)

# Concatenate forward and backward states to pass to decoder
encoder_state = layers.Concatenate()([state_h_fwd, state_h_bwd])

### 4.2 The Decoder (Training)
The decoder takes the German sequence (offset by one) and the Encoder's state. It predicts the next word.

In [None]:
# --- Decoder ---
decoder_inputs = layers.Input(shape=(SEQUENCE_LENGTH,), dtype="int64", name="decoder_inputs")
decoder_embedding = layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM, mask_zero=True)
x = decoder_embedding(decoder_inputs)

# Decoder GRU
# We initialize it with the Encoder State
decoder_gru = layers.GRU(LATENT_DIM, return_sequences=True, return_state=True, name="decoder_gru")
decoder_outputs, _ = decoder_gru(x, initial_state=encoder_state)

# Output Layer
decoder_dense = layers.Dense(VOCAB_SIZE, activation="softmax", name="decoder_final")
decoder_outputs = decoder_dense(decoder_outputs)

# --- Full Model ---
seq2seq_model = models.Model([encoder_inputs, decoder_inputs], decoder_outputs, name="Seq2Seq_Training")
seq2seq_model.summary()

### 4.3 Training
We compile and train the model. Note that we use `sparse_categorical_crossentropy` because our targets are integers, not one-hot vectors.

In [None]:
seq2seq_model.compile(
    optimizer="adam", 
    loss="sparse_categorical_crossentropy", 
    metrics=["accuracy"]
)

history = seq2seq_model.fit(
    train_ds, 
    epochs=10, 
    validation_data=val_ds,
    callbacks=[tf.keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True)]
)

## 5️⃣ Inference: The Recursive Decoder

The training model cannot be used directly for inference because it expects the answer (`decoder_inputs`) to be provided. For inference, we need to generate the answer word by word.

We need to define a separate inference architecture that shares the weights of the trained model.

In [None]:
# 1. Define the Inference Encoder
# Input: English Sentence -> Output: Context Vector (State)
encoder_model = models.Model(encoder_inputs, encoder_state)

# 2. Define the Inference Decoder
# Inputs: Previous Word + Previous State
decoder_state_input = layers.Input(shape=(LATENT_DIM,), name="input_state")
decoder_word_input = layers.Input(shape=(1,), name="input_word")

# Reuse layers from training model
x = decoder_embedding(decoder_word_input)
decoder_outputs, decoder_state = decoder_gru(x, initial_state=decoder_state_input)
decoder_outputs = decoder_dense(decoder_outputs)

decoder_model = models.Model(
    [decoder_word_input, decoder_state_input], 
    [decoder_outputs, decoder_state]
)

# 3. Decoding Loop
def decode_sentence(input_sentence):
    # 1. Encode the input
    input_seq = en_vectorizer([input_sentence])
    states_value = encoder_model.predict(input_seq, verbose=0)

    # 2. Start with 'sos' token
    # We need the ID for 'sos'
    vocab = de_vectorizer.get_vocabulary()
    sos_id = vocab.index('sos')
    eos_id = vocab.index('eos')
    
    target_seq = np.array([[sos_id]])
    decoded_sentence = []

    # 3. Loop until 'eos' or max length
    for _ in range(SEQUENCE_LENGTH):
        output_tokens, h = decoder_model.predict([target_seq, states_value], verbose=0)

        # Sample next token
        sampled_token_index = np.argmax(output_tokens[0, 0, :])
        sampled_word = vocab[sampled_token_index]

        if sampled_word == 'eos':
            break
            
        decoded_sentence.append(sampled_word)

        # Update the target sequence and state for next step
        target_seq = np.array([[sampled_token_index]])
        states_value = h

    return " ".join(decoded_sentence)

### 5.1 Testing the Translation
Let's see how our model performs on unseen data.

In [None]:
test_examples = [
    "I love cats",
    "He works every day",
    "She is happy",
    "The weather is nice today"
]

for sent in test_examples:
    translation = decode_sentence(sent)
    print(f"English: {sent}")
    print(f"German:  {translation}")
    print("-" * 30)

## 6️⃣ Chapter Summary

* **Seq2Seq Tasks:** Mapping sequences to sequences is fundamentally different from classification. It requires handling variable length inputs *and* outputs.
* **Encoder:** Compresses the source sentence into a single **Context Vector** (State). We used a Bidirectional GRU for this.
* **Decoder:** Unpacks the Context Vector into the target sentence.
* **Teacher Forcing:** During training, we feed the *correct* previous word to the decoder. This stabilizes training.
* **Inference:** During testing, we must feed the *predicted* previous word to the decoder (Recursive decoding).
* **Performance:** While this basic Seq2Seq model works, it struggles with very long sentences because the entire meaning must be compressed into one vector. In the next chapter, we will solve this with **Attention**.