**PART 3**

**ADVANCED DEEP NETWORKS FOR COMPLEX PROBLEMS**

---

**CHAPTER 11 - Sequence-to-sequence learning: Part 1**

---

### **11.1 Understanding the machine translation data**

In this chapter, we explore **sequence-to-sequence (seq2seq)** learning, a paradigm used for tasks where an arbitrary-length input sequence is mapped to an arbitrary-length output sequence. Our specific goal is to build an **English-to-German machine translator**.

**Data Acquisition and Cleaning**

We utilize a bilingual parallel corpus provided by *manythings.org*, which contains English sentences paired with their German translations. The data requires rigorous preprocessing to ensure the model can learn effectively:

1.  **Cleaning**: The raw data often contains metadata (e.g., attribution info like "CC-BY 2.0") that is irrelevant to the translation task and must be stripped. We also filter out lines with problematic Unicode characters (such as `\xc2`) that can cause encoding errors in downstream TensorFlow operations.
2.  **Special Tokens**: To enable the model to generate text efficiently, we explicitly add two special tokens to the German translations:
    * `sos` (Start of Sentence): Marks the beginning of the translation.
    * `eos` (End of Sentence): Marks the end of the translation.
    These tokens serve as functional signals for the decoder during inference, telling it when to start generating text and when to stop.
3.  **Sampling and Splitting**: To ensure the training completes in a reasonable time for this exercise, we sample a subset of 50,000 phrase pairs. These are strictly split into training (80%), validation (10%), and testing (10%) sets to monitor overfitting and evaluate generalization.

**Vocabulary Analysis**

Before modeling, we must understand the data distribution. We analyze the **vocabulary size** (number of unique words) and **sequence length** (number of words per sentence). This analysis helps us define the input dimension for our embedding layers and the padding length for our batches. The following code calculates the vocabulary size based on word frequency.

In [None]:
from collections import Counter
import pandas as pd

def get_vocabulary_size_greater_than(words, n, verbose=True):
    """ Get the vocabulary size above a certain threshold """
    counter = Counter(words)
    freq_df = pd.Series(
        list(counter.values()),
        index=list(counter.keys())
    ).sort_values(ascending=False)
    
    if verbose:
        print(freq_df.head(n=10))
        
    n_vocab = (freq_df >= n).sum()
    
    if verbose:
        print("\nVocabulary size (>= {} frequent): {}".format(n, n_vocab))
        
    return n_vocab

### **11.2 Writing an English-German seq2seq machine translator**

The core architecture for this task is the **Encoder-Decoder** model. This architecture decouples the understanding of the input from the generation of the output, allowing the model to handle sequences of different lengths (e.g., a 5-word English sentence translating to a 7-word German sentence).

1.  **Encoder**: Consumes the source language (English) and compresses the information into a latent representation called a **context vector** (or thought vector).
2.  **Decoder**: Takes the context vector and generates the target language (German).

![Figure 11.1 High-level components of the encoder-decoder architecture in the context of machine translation](./11.Chapter-11/Figure11-1.jpg)

#### **The TextVectorization layer**

To build a modern, end-to-end model, we integrate preprocessing directly into the network using the `TextVectorization` layer. This eliminates the need for external tokenization scripts and ensures the model can accept raw strings as input during deployment.

* **Functionality**: It tokenizes strings (splitting them into words) and maps them to integer IDs based on a learned vocabulary lookup.
* **Configuration**:
    * `output_mode='int'`: Ensures the layer outputs integer indices suitable for an Embedding layer.
    * `adapt()`: This method scans the training corpus to build the internal vocabulary mapping from words to integers.

#### **Defining the TextVectorization layers for the seq2seq model**

We define a helper function, `get_vectorizer`, to instantiate these layers for both the source (English) and target (German) languages. Note that we set the vocabulary size to `n_vocab + 2` to explicitly account for the padding token (used to make batches uniform) and the `[UNK]` token (used for out-of-vocabulary words).

In [None]:
import tensorflow as tf
import numpy as np

def get_vectorizer(corpus, n_vocab, max_length=None, return_vocabulary=True, name=None):
    """ Return a text vectorization layer or a model """
    inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='encoder_input')
    vectorize_layer = tf.keras.layers.experimental.preprocessing.TextVectorization(
        max_tokens=n_vocab+2,
        output_mode='int',
        output_sequence_length=max_length,
    )
    vectorize_layer.adapt(corpus)
    vectorized_out = vectorize_layer(inp)
    
    if not return_vocabulary:
        return tf.keras.models.Model(
            inputs=inp, outputs=vectorized_out, name=name
        )
    else:
        return tf.keras.models.Model(
            inputs=inp, outputs=vectorized_out, name=name
        ), vectorize_layer.get_vocabulary()

#### **Defining the encoder**

The encoder's primary role is to ingest the source English sequence and compress its semantic meaning into a latent representation known as the **context vector**.

* **Bidirectional Processing**: We use a `Bidirectional` wrapper around a GRU (Gated Recurrent Unit) layer. Unlike a standard RNN that reads strictly left-to-right, a bidirectional RNN processes the sequence in both directions. This is critical for capturing dependencies where the meaning of a word is influenced by future words (e.g., distinguishing the word "bank" in "bank of the river" vs. "bank of America").
* **Context Vector Extraction**: While the GRU processes the entire sequence, we are specifically interested in the **final state** (not the sequence of outputs). This final state represents the aggregated understanding of the entire sentence and is passed to the decoder as its initial state.

![Figure 11.2 Specific components in the encoder and decoder modules](./11.Chapter-11/Figure11-2.jpg)

In [None]:
def get_encoder(n_vocab, vectorizer):
    """ Define the encoder of the seq2seq model """
    inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='e_input')
    vectorized_out = vectorizer(inp)
    
    # Embedding layer with mask_zero=True to handle padding
    emb_layer = tf.keras.layers.Embedding(
        n_vocab+2, 128, mask_zero=True, name='e_embedding'
    )
    emb_out = emb_layer(vectorized_out)
    
    # Bidirectional GRU
    gru_layer = tf.keras.layers.Bidirectional(
        tf.keras.layers.GRU(128, name='e_gru'),
        name='e_bidirectional_gru'
    )
    gru_out = gru_layer(emb_out)
    
    # The encoder returns the final state (context vector)
    encoder = tf.keras.models.Model(
        inputs=inp, outputs=gru_out, name='encoder'
    )
    return encoder

#### **Defining the decoder and the final model**

The decoder is responsible for generating the German translation, token by token. It differs from the encoder in several key ways:

* **Initialization**: The decoder's GRU layer does not start with a zero state. Instead, its `initial_state` is set to the encoder's output (`d_init_state`). This transfers the knowledge from the English sentence to the German generation process.
* **Unidirectional**: The decoder is a standard, unidirectional GRU because it generates text sequentially and cannot "see" the future words it hasn't generated yet.
* **Teacher Forcing**: During training, we employ **Teacher Forcing**, where the decoder is fed the *correct* German word at time $t$ to predict the word at $t+1$. This stabilizes training and helps the model converge faster than if it relied solely on its own predictions.
* **Sequence Output**: The decoder sets `return_sequences=True` because it must output a prediction for every time step in the target sequence, which is then passed through a Softmax layer to predict the specific word from the vocabulary.

![Figure 11.3 The implementation of the final sequence-to-sequence model with the focus on various layers and outputs involved](./11.Chapter-11/Figure11-3.jpg)

In [None]:
def get_final_seq2seq_model(n_vocab, encoder, vectorizer):
    """ Define the final encoder-decoder model """
    # 1. Get Context Vector from Encoder
    e_inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='e_input_final')
    d_init_state = encoder(e_inp)
    
    # 2. Define Decoder Input (Teacher Forcing Input)
    d_inp = tf.keras.Input(shape=(1,), dtype=tf.string, name='d_input')
    d_vectorized_out = vectorizer(d_inp)
    
    d_emb_layer = tf.keras.layers.Embedding(
        n_vocab+2, 128, mask_zero=True, name='d_embedding'
    )
    d_emb_out = d_emb_layer(d_vectorized_out)
    
    # 3. Decoder GRU initialized with Encoder State
    d_gru_layer = tf.keras.layers.GRU(256, return_sequences=True, name='d_gru')
    d_gru_out = d_gru_layer(d_emb_out, initial_state=d_init_state)
    
    # 4. Final Prediction Layers
    d_dense_layer_1 = tf.keras.layers.Dense(512, activation='relu', name='d_dense_1')
    d_densel_out = d_dense_layer_1(d_gru_out)
    
    d_dense_layer_final = tf.keras.layers.Dense(
        n_vocab+2, activation='softmax', name='d_dense_final'
    )
    d_final_out = d_dense_layer_final(d_densel_out)
    
    seq2seq = tf.keras.models.Model(
        inputs=[e_inp, d_inp], outputs=d_final_out, name='final_seq2seq'
    )
    return seq2seq

#### **Compiling the model**

We compile the model using standard Keras settings for multi-class classification (predicting the next word from the vocabulary).

* **Loss**: `sparse_categorical_crossentropy` (targets are integers, predictions are probabilities).
* **Optimizer**: `adam`.
* **Metrics**: `accuracy`.

In [None]:
# Example compilation code (assuming model instance 'final_model' exists)
# final_model.compile(
#    loss='sparse_categorical_crossentropy',
#    optimizer='adam',
#    metrics=['accuracy']
# )

### **11.3 Training and evaluating the model**

**Data Preparation for Teacher Forcing**
Training a machine translation model requires careful alignment of inputs and outputs to support the Teacher Forcing mechanism. We cannot simply feed the entire German sentence as both input and label; we must offset them by one time step.

* **Encoder Input**: The source English text (e.g., "I like cake").
* **Decoder Input**: The target German text *including* the `sos` token but *excluding* the last token (e.g., "sos Ich mag Kuchen").
* **Decoder Label**: The target German text *excluding* the `sos` token but *including* the `eos` token (e.g., "Ich mag Kuchen eos").

This setup forces the model to predict the next token (`Ich`) given the current token (`sos`).

**Evaluating with BLEU**

Standard classification accuracy is a poor metric for machine translation because there are often multiple valid ways to translate a sentence. To evaluate our model effectively, we implement the **BLEU (BiLingual Evaluation Understudy)** score.

* **Mechanism**: BLEU measures the overlap of n-grams (sequences of 1, 2, 3, or more words) between the model's prediction and the reference translation, providing a more nuanced view of quality than simple word-matching.
* **Brevity Penalty**: The metric includes a penalty for translations that are significantly shorter than the reference, preventing the model from "gaming" the score by outputting very short, high-confidence snippets.

Since BLEU is not a built-in Keras metric, we implement a custom `BLEUMetric` class and a custom training loop to compute it during validation.

In [None]:
class BLEUMetric(object):
    def __init__(self, vocabulary, name='bleu', **kwargs):
        self.vocab = vocabulary
        # StringLookup to convert IDs back to words for comparison
        self.id_to_token_layer = tf.keras.layers.StringLookup(
            vocabulary=self.vocab, invert=True, num_oov_indices=0
        )
    
    def calculate_bleu_from_predictions(self, real, pred):
        """ Decodes predictions and calculates BLEU score """
        # Convert probability distribution to token IDs
        pred_argmax = tf.argmax(pred, axis=-1)
        pred_tokens = self.id_to_token_layer(pred_argmax)
        real_tokens = self.id_to_token_layer(real)
        
        # (Detailed logic to clean text, remove padding/EOS/SOS, and call compute_bleu would exist here)
        # This step is crucial for accurate BLEU calculation
        
        return 0.0 # Placeholder for result

### **11.4 From training to inference: Defining the inference model**

**The Inference Challenge**

The training model relies on Teacher Forcing, meaning it requires the target German sequence as input. However, in a real-world inference scenario, we do not have the German translationâ€”generating it is the goal!.

**The Recursive Decoder Solution**

To solve this, we construct a separate **Recursive Inference Model** that reuses the trained weights but alters the data flow to a recursive loop:

1.  **Encoder Step**: Pass the English text through the encoder to generate the **context vector**. This vector is calculated once and serves as the seed for the decoder.
2.  **Initialization**: We feed the `sos` (Start of Sentence) token and the context vector into the decoder to predict the *first* word.
3.  **Recursive Loop**: For every subsequent step, we take the *predicted* word from the previous step (along with the decoder's updated internal state) and feed them back as inputs to predict the *next* word.
4.  **Termination**: This loop continues until the decoder generates the `eos` (End of Sentence) token, signaling that the translation is complete.

![Figure 11.4 Using the sequence-to-sequence model for inference (i.e., generating translations from English inputs)](./11.Chapter-11/Figure11-4.jpg)

The code below demonstrates this recursive generation loop:

In [None]:
def generate_new_translation(en_model, de_model, de_vocabulary, sample_en_text):
    """ Generate a new translation using recursive decoding """
    start_token = 'sos'
    end_token = 'eos'
    
    # Step 1: Get the context vector from the encoder
    d_state = en_model.predict(np.array([sample_en_text]))
    
    de_word = start_token
    de_translation = []
    
    # Step 2 & 3: Recursive prediction loop
    while de_word != end_token:
        # Predict the next word and the NEW state given the previous word and state
        # Note: de_model for inference is modified to accept state inputs
        de_pred, d_state = de_model.predict([np.array([de_word]), d_state])
        
        # Convert the prediction index back to a string word
        de_word = de_vocabulary[np.argmax(de_pred[0])]
        de_translation.append(de_word)
        
    print("Translation: {}".format(' '.join(de_translation)))