# Neural Machine Translation: Building an RNN Encoder-Decoder with Keras 3

**Learning Objectives**
1.  Learn how to build an efficient `tf.data.Dataset` pipeline for a seq2seq task.
2.  Learn how to preprocess text using the `keras.layers.TextVectorization` layer.
3.  Learn how to train an encoder-decoder model in Keras using the Functional API.
4.  Learn how to create separate encoder and decoder models for inference.
5.  Learn how to implement a translation (decoding) function from scratch.
6.  Learn how to use the BLEU score to evaluate a translation model.


***Note: For faster execution, please ensure you are using a GPU runtime. An NVIDIA T4 GPU is recommended for this notebook.***

## Introduction

In this notebook, we'll build a Spanish-to-English translation model using a modern RNN encoder-decoder architecture with Keras 3.

We will start by building an efficient and scalable input pipeline with the `tf.data.Dataset` API. A key part of our workflow will be using the `TextVectorization` layer to handle all text preprocessing—from standardization and tokenization to integer-encoding—directly within our model.

Next, we will use the Keras Functional API to build and train our RNN encoder-decoder model. After training, we will create two specialized models—a dedicated encoder and a decoder—from the layers of our trained model. These specialized models are essential for performing inference, allowing us to implement a function that generates translations word by word.

Finally, we'll evaluate the quality of our model's translations using the industry-standard BLEU score.


In [None]:
import os
import warnings

warnings.filterwarnings("ignore")
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"

In [None]:
import pickle
import re
import string
import sys

import evaluate
import keras
import numpy as np
import pandas as pd
import tensorflow as tf
from keras.layers import GRU, Dense, Embedding, Input
from keras.models import Model, load_model
from sklearn.model_selection import train_test_split
from tqdm import tqdm

In [None]:
SEED = 42
MODEL_PATH = "translate_models/baseline"
DATA_URL = (
    "http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip"
)
LOAD_CHECKPOINT = False

In [None]:
keras.utils.set_random_seed(SEED)

## Downloading the Data

We'll use a language dataset provided by http://www.manythings.org/anki/. The dataset contains Spanish-English  translation pairs in the format:

```
May I borrow this book?	¿Puedo tomar prestado este libro?
```

The dataset is a curated list of 120K translation pairs from http://tatoeba.org/, a platform for community contributed translations by native speakers.

In [None]:
path_to_zip = keras.utils.get_file("spa-eng.zip", origin=DATA_URL, extract=True)

path_to_file = os.path.join(
    os.path.dirname(path_to_zip), "spa-eng_extracted/spa-eng/spa.txt"
)
print("Translation data stored at:", path_to_file)

In [None]:
data = pd.read_csv(
    path_to_file, sep="\t", header=None, names=["english", "spanish"]
)

In [None]:
data.sample(3)

## Create the tf.data Pipeline
To begin, we'll construct our training and evaluation datasets using the `tf.data` API, which is the standard for building efficient and scalable input pipelines in TensorFlow. This approach allows us to handle data in a memory-efficient way and seamlessly integrate it with Keras.

Our process will be as follows:

1. Load the data: We create a tf.data.Dataset directly from our pandas DataFrame using tf.data.Dataset.from_tensor_slices. This creates a dataset where each element is a pair of (Spanish, English) sentences.

2. Split the data: We'll use the .take() and .skip() methods to create our training and validation sets. This is a clean and efficient way to split the data without having to load everything into memory at once.

3. Define a standardization function: We create a custom_standardization function to preprocess our raw text. This function replicates the logic of the original preprocess_sentence function by lowercasing text, adding spaces around punctuation, and, most importantly, adding <start> and <end> tokens to each sentence. It is built using tf.strings operations, which allows it to be embedded directly into our TextVectorization layer and run efficiently on the GPU/TPU.

In [None]:
# Define constants
BUFFER_SIZE = 32000
BATCH_SIZE = 64
TEST_PROP = 0.2
NUM_EXAMPLES = 30000

# Create a single dataset from your pandas DataFrame
full_dataset = tf.data.Dataset.from_tensor_slices(
    (data["spanish"][:NUM_EXAMPLES], data["english"][:NUM_EXAMPLES])
)

# Create the training and validation splits using take() and skip()
TRAIN_SIZE = int(NUM_EXAMPLES * (1 - TEST_PROP))
train_raw = full_dataset.take(TRAIN_SIZE)
val_raw = full_dataset.skip(TRAIN_SIZE)

# Define the custom standardization function for TextVectorization


@keras.utils.register_keras_serializable()
def custom_standardization(input_string):
    # Lowercase and strip leading/trailing whitespace
    s = tf.strings.lower(input_string)
    s = tf.strings.strip(s)

    # Add spaces around punctuation
    s = tf.strings.regex_replace(s, r"([?.!,¿])", r" \1 ")

    # Replace multiple spaces with a single space
    s = tf.strings.regex_replace(s, r'[" "]+', " ")

    # Filter out unwanted characters, replacing them with a space
    s = tf.strings.regex_replace(s, r"[^a-zA-Z?.!,¿]+", " ")

    # Strip again to remove any leading/trailing spaces created by cleaning
    s = tf.strings.strip(s)

    # Add the <start> and <end> tokens
    s = tf.strings.join(["<start>", s, "<end>"], separator=" ")
    return s

### Create and Adapt the TextVectorization Layers
With our raw text datasets ready, we need to convert the sentences from strings into integer sequences that our model can understand. In Keras 3, the modern and efficient way to do this is with the `keras.layers.TextVectorization` layer. This layer handles tokenization, vocabulary creation, and integer-encoding directly within our model's graph.

We will create two separate TextVectorization layers: one for the source language (Spanish) and one for the target language (English). We pass our custom_standardization function to the standardize argument to ensure the text is preprocessed according to our specific rules (including adding `<start>` and `<end>` tokens) before tokenization.

The most crucial step is calling the `.adapt()` method. The `adapt` method reads through the text from our training dataset and builds a vocabulary of all the unique words. It's during this step that the layer learns the mapping from each word to a unique integer index. We do this for both the source and target vectorization layers on their respective text data.

***NOTE: This cell takes 10-15 to run while it's adapting the pre-processing layer to our dataset***

In [None]:
# Create and adapt the TextVectorization layers
source_vectorization = keras.layers.TextVectorization(
    max_tokens=None,
    output_mode="int",
    output_sequence_length=None,
    standardize=custom_standardization,
)

target_vectorization = keras.layers.TextVectorization(
    max_tokens=None,
    output_mode="int",
    output_sequence_length=None,
    standardize=custom_standardization,
)

# Adapt the layers to the training data
source_text = train_raw.map(lambda x, y: x)
target_text = train_raw.map(lambda x, y: y)

source_vectorization.adapt(source_text)
target_vectorization.adapt(target_text)

### Create the Final Datasets
Now that our TextVectorization layers have learned the vocabularies, we can build our final input pipelines.

First, we store the vocabulary sizes for later use in our model's Embedding layers. Then, we define two helper functions:

1. `vectorize_text`: This function takes the raw text sentences and applies the appropriate TextVectorization layer to convert them into integer sequences.

2. `create_dataset`: This function prepares the integerized sequences for our encoder-decoder model. The decoder needs two versions of the target sequence during training: one for input (to predict the next word) and one for output (to compare against the prediction for calculating loss). This function creates:
   - target_in: The target sequence with the last token removed.
   - target_out: The target sequence with the first (<start>) token removed.  

Finally, we chain together several `tf.data` methods to construct our final train_dataset and eval_dataset. We `.shuffle()` the training data, `.batch()` both datasets, apply our preprocessing functions using `.map()`, and call `.prefetch()` for better performance. We also inspect the shape of the first batch to confirm our pipeline is set up correctly.

In [None]:
# Set vocabulary sizes
INPUT_VOCAB_SIZE = source_vectorization.vocabulary_size()
TARGET_VOCAB_SIZE = target_vectorization.vocabulary_size()


def vectorize_text(source, target):
    source = source_vectorization(source)
    target = target_vectorization(target)
    return source, target


def create_dataset(source, target):
    target_in = target[:, :-1]
    target_out = target[:, 1:]
    return (source, target_in), target_out


# Create the final training and validation datasets
train_dataset = (
    train_raw.shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE)
    .map(vectorize_text)
    .map(create_dataset)
    .prefetch(tf.data.AUTOTUNE)
)

eval_dataset = (
    val_raw.batch(BATCH_SIZE)
    .map(vectorize_text)
    .map(create_dataset)
    .prefetch(tf.data.AUTOTUNE)
)

# Set max lengths for the Embedding layers
# We do this by inspecting the element_spec of the dataset
for (source, target_in), target_out in train_dataset.take(1):
    max_length_inp = source.shape[1]
    max_length_targ = target_in.shape[1]
    print("Source shape:", source.shape)
    print("Target in shape:", target_in.shape)
    print("Target out shape:", target_out.shape)

### Preview a Batch of Data
To verify that our data pipeline is working correctly, let's take one batch from our train_dataset and inspect the first example. We'll convert the integer tensors back to text to see what the model will receive during training.

In [None]:
# Get the vocabularies from the vectorization layers
source_vocab = source_vectorization.get_vocabulary()
source_index_lookup = {i: word for i, word in enumerate(source_vocab)}

target_vocab = target_vectorization.get_vocabulary()
target_index_lookup = {i: word for i, word in enumerate(target_vocab)}

# Helper function to convert a tensor of token IDs back to a string


def to_text(tensor, lookup_dict):
    # Join the words, filtering out the padding token (ID 0)
    return " ".join([lookup_dict[i] for i in tensor.numpy() if i != 0])


# Take one batch from the training dataset and inspect the first example
for (source_batch, target_in_batch), target_out_batch in train_dataset.take(1):
    # Get the first example from the batch
    source_example = source_batch[0]
    target_in_example = target_in_batch[0]
    target_out_example = target_out_batch[0]

    print("--- Example from a Training Batch ---")
    print("\nSource (Spanish):")
    print(f"  Tensor: {source_example.numpy()}")
    print(f"  Text:   {to_text(source_example, source_index_lookup)}")

    print("\nTarget Input (English - for Decoder):")
    print(f"  Tensor: {target_in_example.numpy()}")
    print(f"  Text:   {to_text(target_in_example, target_index_lookup)}")

    print("\nTarget Output (English - for Loss Calculation):")
    print(f"  Tensor: {target_out_example.numpy()}")
    print(f"  Text:   {to_text(target_out_example, target_index_lookup)}")

In [None]:
def load_and_integerize(path, num_examples=None):

    targ_lang, inp_lang = load_and_preprocess(path, num_examples)

    # TODO 1b
    input_tensor, inp_lang_tokenizer = # TODO
    target_tensor, targ_lang_tokenizer = # TODO

    return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

## Training the RNN encoder-decoder model

We use an encoder-decoder architecture, however we embed our words into a latent space prior to feeding them into the RNN. 

In [None]:
EMBEDDING_DIM = 256
HIDDEN_UNITS = 1024

### Build the Encoder

**Exercise: Build the Encoder Model**  
Build the encoder component for a standard encoder-decoder translator. The encoder's job is to process an entire input sentence and compress its meaning into a single fixed-size "thought vector" (or "context vector"), which will eventually be passed to the decoder.

Use the Keras Functional API to construct the encoder, which gives us a clear way to define the flow of data. Our encoder will have two main layers:

1. **Embedding Layer**: This layer takes the integer-encoded vocabulary and learns a dense vector representation (an embedding) for each word. These embeddings can capture semantic relationships between words.

2. **GRU (Gated Recurrent Unit) Layer**: This is a type of Recurrent Neural Network (RNN) that processes the sequence of word embeddings one by one. We configure it with return_state=True to get the final hidden state of the GRU after it has processed the entire input sentence. This final hidden state is the "thought vector" that encapsulates the meaning of the input sentence and will be passed to the decoder. Add the argument `reset_after=False` to make this layer GPU compatible. See this [documentation](https://keras.io/api/layers/recurrent_layers/gru/) for more info.

In [None]:
encoder_inputs = Input(shape=(None,), name="encoder_input")

# TODO - Define the Embedding Layer
encoder_inputs_embedded = Embedding()(encoder_inputs)

encoder_rnn = GRU()  # TODO - Define the GRU Layer

encoder_outputs, encoder_state = encoder_rnn(encoder_inputs_embedded)

### Building the Decoder
**Exercise: Define the Decoder**  
Build the decoder so that it takes the "thought vector" from the encoder and generates the translated sentence word by word.

Its architecture mirrors the encoder, containing Embedding and GRU layers. The crucial difference is that we initialize the decoder's GRU layer with the final hidden state of the encoder `(initial_state=encoder_state)`. This is how the decoder receives the context from the source sentence, which it then uses to generate the correct translation.

In [None]:
decoder_inputs = Input(shape=(None,), name="decoder_input")

# TODO - Define the Embedding Layer
decoder_inputs_embedded = Embedding()(decoder_inputs)

decoder_rnn = GRU()  # TODO - Define the Decoder Layer

decoder_outputs, decoder_state = decoder_rnn(
    # TODO - Connect the encoder and decoder using the "initial_state" parameter
)

The last part of the encoder-decoder architecture is a softmax `Dense` layer that will create the next word probability vector or next word `predictions` from the `decoder_output`:

In [None]:
decoder_dense = Dense(TARGET_VOCAB_SIZE, activation="softmax", name="dense")

predictions = decoder_dense(decoder_outputs)

### Create and Compile the Model

**Exercise: Complete the compilation code for the model**
Complete the compile code for the trainable model, instantiate a Model object and specify the encoder and decoder inputs, and the final predictions as the output.

Compile the model, configuring it for training. Use the `adam` optimizer and the `sparse_categorical_crossentropy` loss function. This loss is ideal for our task because our targets are integers (the word indices) and the model's output is a probability distribution over the vocabulary.

Finally, call `.summary()` to print a useful overview of the model's architecture, including the layers, output shapes, and the number of trainable parameters.

In [None]:
model = Model(inputs=[encoder_inputs, decoder_inputs], outputs=predictions)

# TODO- Fill in the parameters
model.compile()
# TODO - Print the model summary

Train the model!

***NOTE: Update the number of `EPOCHS` to 10/15 to get a decent translation performance. However, this will increase the training time.***

In [None]:
EPOCHS = 5

history = model.fit(
    train_dataset, validation_data=eval_dataset, epochs=EPOCHS, verbose=2
)

### Implementing the Translation (Inference) Function

Now that our model is trained, we need a way to use it for translation. We can't simply call `model.predict()` on a Spanish sentence because the model expects both a source (Spanish) and a target (English) sentence as input. During training, we used the ground-truth target sentence in a technique called "teacher forcing." For inference, we don't have the target sentence—that's what we need to generate!

The solution is to generate the translation word by word in a loop. The process works like this:

1.  Take the input sentence (e.g., "No estamos comiendo.") and pass it through the encoder to get its final hidden state (the "thought vector").
2.  Start the decoder with the `<start>` token.
3.  Feed the `<start>` token and the encoder's state into the decoder to predict the first word of the translation.
4.  Take the predicted word, feed it back into the decoder as input for the next step, and use the new decoder state.
5.  Repeat this process until the decoder predicts the `<end>` token, signaling that the translation is complete.

**Exercise**  
To implement this, create two new, specialized models using the layers from the model we just trained:

*   An **`encoder_model`** that takes a raw string sentence, vectorizes it, and returns the encoder's final hidden state.
*   A **`decoder_model`** that takes the current predicted sequence and a hidden state, and returns the prediction for the next word, along with the updated hidden state.


In [None]:
# --- Create Inference Encoder Model ---
# This model will convert raw Spanish sentences into the encoder's final state.

# Input layer for raw strings
encoder_string_input = keras.Input(
    shape=(1,), dtype="string", name="encoder_string_input"
)

# Vectorize the strings using the adapted layer
# TODO

# Reuse the trained Embedding layer
# TODO

# Reuse the trained GRU layer
# TODO

# Create the final encoder model for inference
encoder_model = Model(
    inputs=encoder_string_input,
    outputs=encoder_state_output,
    name="inference_encoder",
)

In [None]:
# --- Create Inference Decoder Model ---
# This model's structure is similar to the original, as it works with token IDs.

# Input layers for the decoder
decoder_input = keras.Input(shape=(None,), name="decoder_input")
decoder_state_input = keras.Input(
    shape=(HIDDEN_UNITS,), name="decoder_state_input"
)

# Reuse trained layers from the main model
# TODO

# Build the decoder graph
# TODO

# Create the final decoder model for inference
decoder_model = Model(
    inputs=[decoder_input, decoder_state_input],
    outputs=[decoder_predictions, decoder_state_output],
    name="inference_decoder",
)

### Performing the Translation

With our specialized encoder and decoder models ready, we can now define the main `decode_sequences` function that ties everything together to perform the translation.

This function implements the step-by-step decoding process:

1.  **Encode the input**: The raw input sentences are passed to the `encoder_model` to get the initial "thought vector" or hidden state.

2.  **Initialize the sequence**: A target sequence is created, containing only the `<start>` token for each sentence in the batch.

3.  **Iteratively decode**: In a loop, the `decoder_model` is called with the current target sequence and the hidden state to predict the next token. For simplicity, we use `argmax` to select the most probable token as our prediction.

4.  **Append and update**: The predicted token is appended to our result, and the process is repeated with the new token and the updated hidden state from the decoder.

5.  **Stop condition**: The loop continues until the special `<end>` token is generated or the maximum sequence length is reached.

Finally, we'll test our function with a few example sentences to see the translation results in action.


In [None]:
# --- Define the Translation Function (decode_sequences) ---

# Get the vocabulary and create the reverse mapping
targ_vocab = target_vectorization.get_vocabulary()
targ_index_lookup = dict(zip(range(len(targ_vocab)), targ_vocab))
start_token_id = target_vectorization.get_vocabulary().index("<start>")
end_token_id = target_vectorization.get_vocabulary().index("<end>")


def decode_sequences(input_sentences, max_decode_length=50):
    """
    Arguments:
        input_sentences: A tensor of raw strings in the source language.
    Returns:
        A list of translated sentences.
    """
    batch_size = input_sentences.shape[0]

    # Encode the input strings to get the initial state.
    states_value = encoder_model(input_sentences)

    # Initialize the target sequence with the <start> token ID.
    target_seq = keras.ops.full((batch_size, 1), start_token_id, dtype="int32")

    decoded_sentences = ["" for _ in range(batch_size)]

    for i in range(max_decode_length):
        output_tokens, decoder_state = decoder_model([target_seq, states_value])

        # Sample a token (we use argmax for simplicity)
        sampled_token_index = keras.ops.argmax(output_tokens[:, -1, :], axis=-1)

        # Update the target sequence for the next iteration
        target_seq = keras.ops.expand_dims(sampled_token_index, axis=-1)

        # Update the decoder state
        states_value = decoder_state

        for j in range(batch_size):
            token = targ_index_lookup[sampled_token_index[j].numpy()]
            if token == "<end>":
                continue  # Don't add the end token to the output
            decoded_sentences[j] += token + " "

    return [s.strip() for s in decoded_sentences]


sentences = [
    "No estamos comiendo.",
    "Está llegando el invierno.",
]

machine_translations = decode_sequences(keras.ops.convert_to_tensor(sentences))
print(machine_translations)

In [None]:
def create_dataset(encoder_input, decoder_input):
    
    # shift ahead by 1
    target = tf.roll(decoder_input, -1, 1)

    # replace last column with 0s
    zeros = tf.zeros([target.shape[0], 1], dtype=tf.int32)
    target = tf.concat((target[:, :-1], zeros), axis=-1)

    dataset = # TODO

    return dataset

Let's now create the actual train and eval dataset using the function above:

In [None]:
BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64

In [None]:
train_dataset = (
    create_dataset(input_tensor_train, target_tensor_train)
    .shuffle(BUFFER_SIZE)
    .repeat()
    .batch(BATCH_SIZE, drop_remainder=True)
)


eval_dataset = create_dataset(input_tensor_val, target_tensor_val).batch(
    BATCH_SIZE, drop_remainder=True
)

Now we're ready to predict!

In [None]:
sentences = [
    "No estamos comiendo.",
    "Está llegando el invierno.",
    "El invierno se acerca.",
    "Tom no comio nada.",
    "Su pierna mala le impidió ganar la carrera.",
    "Su respuesta es erronea.",
    "¿Qué tal si damos un paseo después del almuerzo?",
]

reference_translations = [
    "We're not eating.",
    "Winter is coming.",
    "Winter is coming.",
    "Tom ate nothing.",
    "His bad leg prevented him from winning the race.",
    "Your answer is wrong.",
    "How about going for a walk after lunch?",
]

machine_translations = decode_sequences(keras.ops.convert_to_tensor(sentences))

for i in range(len(sentences)):
    print("-")
    print("INPUT:")
    print(sentences[i])
    print("REFERENCE TRANSLATION:")
    print(reference_translations[i])
    print("MACHINE TRANSLATION:")
    print(machine_translations[i])

## Evaluation Metric (BLEU)

Unlike say, image classification, there is no one right answer for a machine translation. However our current loss metric, cross entropy, only gives credit when the machine translation matches the exact same word in the same order as the reference translation. 

Many attempts have been made to develop a better metric for natural language evaluation. The most popular currently is Bilingual Evaluation Understudy (BLEU).

- It is quick and inexpensive to calculate.
- It allows flexibility for the ordering of words and phrases.
- It is easy to understand.
- It is language independent.
- It correlates highly with human evaluation.
- It has been widely adopted.

The score is from 0 to 1, where 1 is an exact match.

It works by counting matching n-grams between the machine and reference texts, regardless of order. BLUE-4 counts matching n grams from 1-4 (1-gram, 2-gram, 3-gram and 4-gram). It is common to report both BLUE-1 and BLUE-4

It still is imperfect, since it gives no credit to synonyms and so human evaluation is still best when feasible. However BLEU is commonly considered the best among bad options for an automated metric.

The Hugging Face evaluate framework has an implementation that we will use.

We can't run calculate BLEU during training, because at that time the correct decoder input is used. Instead we'll calculate it now.

For more info: https://machinelearningmastery.com/calculate-bleu-score-for-text-python/

In [None]:
def postprocess(sentence):
    filtered = list(filter(lambda x: x != "" and x != "<end>", sentence))
    return " ".join(filtered)

Let's now average the `bleu_1` and `bleu_4` scores for all the sentence pairs in the eval set. The next cell takes around 1 minute (8 minutes for full dataset eval) to run, the bulk of which is decoding the sentences in the validation set. Please wait until completes.

In [None]:
NUM_EVALUATE = 1000

source_sentences = []
reference = []  # This will be a list of lists for the BLEU score calculation

# Iterate over the raw validation dataset to get the source and reference texts.
# We use .numpy().decode() to convert the EagerTensors from the dataset into strings.
for spa, eng in tqdm(val_raw.take(NUM_EVALUATE), total=NUM_EVALUATE):
    source_sentences.append(spa.numpy().decode("utf-8"))
    reference.append([eng.numpy().decode("utf-8")])

# Generate all machine translations in a single batch for better performance.
candidate = decode_sequences(tf.constant(source_sentences))

# Print a few examples to see the results
for i in range(5):
    print("-" * 20)
    print(f"Source:    {source_sentences[i]}")
    print(f"Reference: {reference[i][0]}")
    print(f"Machine:   {candidate[i]}")

### Check the score

In [None]:
bleu = evaluate.load("bleu")
bleu_1 = bleu.compute(predictions=candidate, references=reference, max_order=1)
bleu_4 = bleu.compute(predictions=candidate, references=reference, max_order=4)

In [None]:
bleu_1["bleu"]

In [None]:
bleu_4["bleu"]

Copyright 2025 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License