<a href="https://colab.research.google.com/github/Monaa48/TensorFlow-in-Action-starter/blob/main/notebooks/Ch05_State_of_the_art_in_deep_learning_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 05 — State-of-the-art in Deep Learning: Transformers


## 1) Summary

This chapter introduces the **Transformer** model and explains why it became a standard approach for many sequence problems.

Key points that I extracted while working through the code:

- **Text must be converted into numbers** before it can be used by a neural network. In practice, this usually means:
  - mapping tokens to integer IDs,
  - padding/truncating sequences to a fixed length for batching,
  - using an **embedding layer** to convert IDs to vectors.

- A Transformer is usually described as an **encoder–decoder** architecture:
  - the **encoder** reads the source sequence and produces a sequence of contextual representations,
  - the **decoder** generates the output sequence step by step, attending to both:
    - previously generated tokens (masked self-attention),
    - encoder outputs (encoder–decoder attention).

- The main difference from RNN-based sequence models is the heavy use of **attention**:
  - attention lets each token look at other tokens directly,
  - the model can process all tokens in parallel (no recurrent hidden state dependency),
  - position information is injected through **positional encoding**.

- Internally, each encoder/decoder layer is mainly a combination of:
  1. **(Masked) Multi-head self-attention**
  2. **A position-wise feed-forward network**
  3. Residual connections + layer normalization (for stable optimization)

In this notebook, I implement the major components in TensorFlow/Keras:
- scaled dot-product attention,
- look-ahead and padding masks,
- multi-head attention,
- encoder/decoder layers,
- a small Transformer model trained on a small translation-like dataset.

The goal is not to obtain high translation quality (that requires more data and compute), but to reproduce the architecture and understand what each block contributes.


## 2) Setup


In [1]:
import os, random
import numpy as np
import tensorflow as tf

print("TensorFlow version:", tf.__version__)

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)


TensorFlow version: 2.19.0


## 3) Representing text as numbers

Neural networks operate on tensors. For text, I need a pipeline that turns sentences into:
- integer sequences (token IDs),
- padded batches with a consistent length.

In this notebook I use `TextVectorization` because it is simple and integrates well with Keras.
I build separate vectorizers for source and target text, which is a common practice in translation.


In [2]:
from tensorflow.keras.layers import TextVectorization

def build_vectorizer(texts, vocab_size=8000, seq_len=20):
    vec = TextVectorization(
        max_tokens=vocab_size,
        output_mode="int",
        output_sequence_length=seq_len,
        standardize="lower_and_strip_punctuation",
        split="whitespace"
    )
    vec.adapt(texts)
    return vec

# A small translation-like dataset (French -> English).
# This is intentionally small so it runs quickly in Colab.
pairs = [
    ("bonjour", "hello"),
    ("merci", "thank you"),
    ("je suis etudiant", "i am a student"),
    ("je suis fatigué", "i am tired"),
    ("où est la gare", "where is the station"),
    ("je veux de l eau", "i want water"),
    ("je veux un café", "i want coffee"),
    ("je ne comprends pas", "i do not understand"),
    ("pouvez vous aider", "can you help"),
    ("bonne nuit", "good night"),
    ("comment ça va", "how are you"),
    ("je vais bien", "i am fine"),
    ("je viens d indonésie", "i come from indonesia"),
    ("j aime apprendre", "i like learning"),
    ("j aime tensorflow", "i like tensorflow"),
]

src_texts = [s for s, t in pairs]
tgt_texts = [t for s, t in pairs]

# Add explicit start/end tokens on the target side (common decoder setup)
START, END = "start", "end"
tgt_in_texts  = [f"{START} {t}" for t in tgt_texts]
tgt_out_texts = [f"{t} {END}" for t in tgt_texts]

SRC_SEQ_LEN = 12
TGT_SEQ_LEN = 12
VOCAB_SIZE  = 2000

src_vec = build_vectorizer(src_texts, vocab_size=VOCAB_SIZE, seq_len=SRC_SEQ_LEN)
tgt_vec = build_vectorizer(tgt_in_texts + tgt_out_texts, vocab_size=VOCAB_SIZE, seq_len=TGT_SEQ_LEN)

# Peek at vocab
print("Source vocab size:", len(src_vec.get_vocabulary()))
print("Target vocab size:", len(tgt_vec.get_vocabulary()))
print("Example source tokens:", src_vec(src_texts[:2]).numpy())
print("Example target-in tokens:", tgt_vec(tgt_in_texts[:2]).numpy())
print("Example target-out tokens:", tgt_vec(tgt_out_texts[:2]).numpy())


Source vocab size: 38
Target vocab size: 35
Example source tokens: [[34  0  0  0  0  0  0  0  0  0  0  0]
 [19  0  0  0  0  0  0  0  0  0  0  0]]
Example target-in tokens: [[ 3 25  0  0  0  0  0  0  0  0  0  0]
 [ 3 14  5  0  0  0  0  0  0  0  0  0]]
Example target-out tokens: [[25  4  0  0  0  0  0  0  0  0  0  0]
 [14  5  4  0  0  0  0  0  0  0  0  0]]


### 3.1 Build a `tf.data` pipeline

For training a decoder, I need:
- encoder input: source token IDs
- decoder input: target sequence shifted right (starts with ``start``)
- labels: target sequence shifted left (ends with ``end``)

This is the standard teacher-forcing setup.


In [3]:
def make_dataset(src_texts, tgt_in_texts, tgt_out_texts, batch_size=8, shuffle=True):
    src_ids = src_vec(tf.constant(src_texts))
    tgt_in_ids = tgt_vec(tf.constant(tgt_in_texts))
    tgt_out_ids = tgt_vec(tf.constant(tgt_out_texts))

    ds = tf.data.Dataset.from_tensor_slices(((src_ids, tgt_in_ids), tgt_out_ids))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(src_texts), seed=SEED)
    ds = ds.batch(batch_size).prefetch(tf.data.AUTOTUNE)
    return ds

train_ds = make_dataset(src_texts, tgt_in_texts, tgt_out_texts, batch_size=8, shuffle=True)

((src_b, tgt_in_b), tgt_out_b) = next(iter(train_ds))
print("src batch:", src_b.shape)
print("tgt_in batch:", tgt_in_b.shape)
print("tgt_out batch:", tgt_out_b.shape)


src batch: (8, 12)
tgt_in batch: (8, 12)
tgt_out batch: (8, 12)


## 4) Transformer overview (encoder–decoder view)

A Transformer can be seen as a stack of encoder layers and decoder layers.

- Each **encoder layer** uses:
  1) multi-head self-attention (tokens attend to all tokens in the source sequence),
  2) a feed-forward network applied to each position independently.

- Each **decoder layer** uses:
  1) masked multi-head self-attention (a token can only attend to earlier tokens),
  2) encoder–decoder attention (decoder attends to encoder outputs),
  3) a feed-forward network.

In practice, the model also needs:
- positional encoding (to represent word order),
- masking (to ignore padding tokens and to enforce autoregressive decoding).


## 5) Scaled dot-product attention

The basic attention computation can be written as:

\[
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
\]

- `Q` (query), `K` (key), and `V` (value) are learned projections of embeddings.
- dividing by \(\sqrt{d_k}\) prevents dot products from growing too large and making softmax too peaky.

I implement this function first because it is the core of the Transformer.


In [4]:
def scaled_dot_product_attention(q, k, v, mask=None):
    # q, k, v: (..., seq_len, depth)
    matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled = matmul_qk / tf.math.sqrt(dk)

    if mask is not None:
        # mask is broadcastable to (..., seq_len_q, seq_len_k)
        scaled += (mask * -1e9)

    weights = tf.nn.softmax(scaled, axis=-1)
    output = tf.matmul(weights, v)
    return output, weights

# Toy check: random attention on small tensors
q = tf.random.normal([2, 4, 8])  # batch=2, seq=4, depth=8
k = tf.random.normal([2, 5, 8])
v = tf.random.normal([2, 5, 8])

out, w = scaled_dot_product_attention(q, k, v)
print("output:", out.shape, "| weights:", w.shape)


output: (2, 4, 8) | weights: (2, 4, 5)


## 6) Masks: padding mask and look-ahead mask

There are two masks commonly used:

1. **Padding mask**: ignore padded positions (token ID = 0)  
2. **Look-ahead mask**: during decoding, prevent attention to future tokens


In [5]:
def create_padding_mask(seq):
    # seq: (batch, seq_len) with 0 as padding
    mask = tf.cast(tf.equal(seq, 0), tf.float32)
    # shape -> (batch, 1, 1, seq_len) for broadcasting
    return mask[:, tf.newaxis, tf.newaxis, :]

def create_look_ahead_mask(size):
    # upper-triangular matrix of 1s (excluding diagonal)
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask  # (size, size)

# Example masks
example_seq = tf.constant([[7, 6, 0, 0],
                           [1, 2, 3, 0]], dtype=tf.int32)

pad_mask = create_padding_mask(example_seq)
la_mask  = create_look_ahead_mask(4)

print("Padding mask shape:", pad_mask.shape)
print("Look-ahead mask:\n", la_mask.numpy())


Padding mask shape: (2, 1, 1, 4)
Look-ahead mask:
 [[0. 1. 1. 1.]
 [0. 0. 1. 1.]
 [0. 0. 0. 1.]
 [0. 0. 0. 0.]]


## 7) Positional encoding

Attention alone does not encode word order. Transformers inject position information by adding a positional encoding to token embeddings.

A standard choice is the sinusoidal encoding used in the original Transformer paper.


In [6]:
def get_angles(pos, i, d_model):
    angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
    return pos * angle_rates

def positional_encoding(max_position, d_model):
    angle_rads = get_angles(
        np.arange(max_position)[:, np.newaxis],
        np.arange(d_model)[np.newaxis, :],
        d_model
    )

    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

    pos_enc = angle_rads[np.newaxis, ...].astype("float32")
    return tf.constant(pos_enc)

# Sanity check
pe = positional_encoding(50, 32)
print("positional encoding:", pe.shape)


positional encoding: (1, 50, 32)


## 8) Multi-head attention

Instead of computing a single attention output, Transformers use **multiple heads**.
Each head learns attention in a different learned subspace.

Implementation outline:
- project inputs to Q, K, V
- split into heads
- run scaled dot-product attention per head
- concatenate heads and project back


In [7]:
class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super().__init__()
        if d_model % num_heads != 0:
            raise ValueError("d_model must be divisible by num_heads")
        self.d_model = d_model
        self.num_heads = num_heads
        self.depth = d_model // num_heads

        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)

        self.dense = tf.keras.layers.Dense(d_model)

    def split_heads(self, x, batch_size):
        # x: (batch, seq_len, d_model) -> (batch, heads, seq_len, depth)
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, v, k, q, mask=None):
        batch_size = tf.shape(q)[0]

        q = self.wq(q)
        k = self.wk(k)
        v = self.wv(v)

        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)

        # scaled_dot_product_attention expects (..., seq, depth)
        attn_out, attn_weights = scaled_dot_product_attention(q, k, v, mask)
        # back to (batch, seq_len, d_model)
        attn_out = tf.transpose(attn_out, perm=[0, 2, 1, 3])
        concat = tf.reshape(attn_out, (batch_size, -1, self.d_model))

        out = self.dense(concat)
        return out, attn_weights

# Quick check
mha = MultiHeadAttention(d_model=32, num_heads=4)
x_tmp = tf.random.normal([2, 10, 32])
y_tmp, w_tmp = mha(x_tmp, x_tmp, x_tmp, mask=None)
print("mha output:", y_tmp.shape, "| weights:", w_tmp.shape)


mha output: (2, 10, 32) | weights: (2, 4, 10, 10)


## 9) Feed-forward network (position-wise)

Each token position also goes through a small fully connected network:

\[
\mathrm{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2
\]

This is applied independently to each time step.


In [8]:
def point_wise_feed_forward_network(d_model, dff):
    return tf.keras.Sequential([
        tf.keras.layers.Dense(dff, activation="relu"),
        tf.keras.layers.Dense(d_model)
    ])


## 10) Encoder and decoder layers

Each layer uses:
- attention + residual + layer norm
- feed-forward + residual + layer norm

Dropout is included to reduce overfitting.


In [9]:
class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, dropout_rate=0.1):
        super().__init__()
        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
        self.dropout2 = tf.keras.layers.Dropout(dropout_rate)

    def call(self, x, training=False, mask=None):
        attn_out, _ = self.mha(x, x, x, mask)
        attn_out = self.dropout1(attn_out, training=training)
        out1 = self.layernorm1(x + attn_out)

        ffn_out = self.ffn(out1)
        ffn_out = self.dropout2(ffn_out, training=training)
        out2 = self.layernorm2(out1 + ffn_out)
        return out2


class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, dropout_rate=0.1):
        super().__init__()
        self.mha1 = MultiHeadAttention(d_model, num_heads)  # masked self-attn
        self.mha2 = MultiHeadAttention(d_model, num_heads)  # enc-dec attn

        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
        self.dropout2 = tf.keras.layers.Dropout(dropout_rate)
        self.dropout3 = tf.keras.layers.Dropout(dropout_rate)

    def call(self, x, enc_out, training=False, look_ahead_mask=None, padding_mask=None):
        # 1) masked self-attention
        attn1, attn_weights1 = self.mha1(x, x, x, look_ahead_mask)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(x + attn1)

        # 2) encoder-decoder attention (queries from decoder, keys/values from encoder)
        attn2, attn_weights2 = self.mha2(enc_out, enc_out, out1, padding_mask)
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(out1 + attn2)

        # 3) feed-forward
        ffn_out = self.ffn(out2)
        ffn_out = self.dropout3(ffn_out, training=training)
        out3 = self.layernorm3(out2 + ffn_out)

        return out3, attn_weights1, attn_weights2


## 11) Encoder and decoder stacks

The encoder and decoder are stacks of the above layers with:
- token embedding,
- positional encoding,
- dropout.

Note: I scale embeddings by \(\sqrt{d_{model}}\), which is commonly used in Transformer implementations.


In [10]:
class Encoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff,
                 input_vocab_size, maximum_position_encoding, dropout_rate=0.1):
        super().__init__()
        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)

        self.enc_layers = [EncoderLayer(d_model, num_heads, dff, dropout_rate)
                           for _ in range(num_layers)]
        self.dropout = tf.keras.layers.Dropout(dropout_rate)

    def call(self, x, training=False, mask=None):
        seq_len = tf.shape(x)[1]
        x = self.embedding(x)  # (batch, seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)
        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training=training, mask=mask)
        return x


class Decoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff,
                 target_vocab_size, maximum_position_encoding, dropout_rate=0.1):
        super().__init__()
        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)

        self.dec_layers = [DecoderLayer(d_model, num_heads, dff, dropout_rate)
                           for _ in range(num_layers)]
        self.dropout = tf.keras.layers.Dropout(dropout_rate)

    def call(self, x, enc_out, training=False, look_ahead_mask=None, padding_mask=None):
        seq_len = tf.shape(x)[1]
        attention_weights = {}

        x = self.embedding(x)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x, block1, block2 = self.dec_layers[i](
                x, enc_out, training=training,
                look_ahead_mask=look_ahead_mask,
                padding_mask=padding_mask
            )
            attention_weights[f"decoder_layer{i+1}_block1"] = block1
            attention_weights[f"decoder_layer{i+1}_block2"] = block2

        return x, attention_weights


## 12) Full Transformer model

The final layer projects decoder outputs to vocabulary logits.


In [11]:
class Transformer(tf.keras.Model):
    def __init__(self, num_layers, d_model, num_heads, dff,
                 input_vocab_size, target_vocab_size,
                 pe_input, pe_target, dropout_rate=0.1):
        super().__init__()
        self.encoder = Encoder(num_layers, d_model, num_heads, dff,
                               input_vocab_size, pe_input, dropout_rate)
        self.decoder = Decoder(num_layers, d_model, num_heads, dff,
                               target_vocab_size, pe_target, dropout_rate)
        self.final_layer = tf.keras.layers.Dense(target_vocab_size)

    def call(self, inputs, training=False):
        # inputs: ((inp, tar_in),)
        inp, tar_in = inputs

        enc_padding_mask = create_padding_mask(inp)
        dec_padding_mask = create_padding_mask(inp)

        look_ahead_mask = create_look_ahead_mask(tf.shape(tar_in)[1])
        dec_target_padding_mask = create_padding_mask(tar_in)
        combined_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask[tf.newaxis, tf.newaxis, :, :])

        enc_out = self.encoder(inp, training=training, mask=enc_padding_mask)
        dec_out, attn = self.decoder(tar_in, enc_out, training=training,
                                     look_ahead_mask=combined_mask,
                                     padding_mask=dec_padding_mask)

        final_logits = self.final_layer(dec_out)
        return final_logits, attn


## 13) Training setup

### 13.1 Loss with padding masking

When sequences are padded, I should not penalize the model for predicting padding.
I compute the cross-entropy per token and mask out padding tokens.


In [12]:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction="none")

def loss_function(real, pred):
    # real: (batch, seq_len), pred: (batch, seq_len, vocab)
    loss_ = loss_object(real, pred)
    mask = tf.cast(tf.not_equal(real, 0), tf.float32)
    loss_ *= mask
    return tf.reduce_sum(loss_) / tf.reduce_sum(mask)


### 13.2 Optimizer and learning rate schedule

A common choice for Transformers is Adam with a warmup schedule.
I implement a small learning-rate schedule similar to widely used Transformer examples.


In [13]:
D_MODEL = 64
NUM_LAYERS = 2
NUM_HEADS = 4
DFF = 128
DROP_RATE = 0.1

# For this small dataset, a constant learning rate is enough and easier to debug.
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)


### 13.3 Create the model and run training

Because this dataset is small, training will converge quickly, but it will also overfit easily.
I keep epochs low and mainly check that the model learns a sensible mapping.


In [14]:
transformer = Transformer(
    num_layers=NUM_LAYERS,
    d_model=D_MODEL,
    num_heads=NUM_HEADS,
    dff=DFF,
    input_vocab_size=len(src_vec.get_vocabulary()),
    target_vocab_size=len(tgt_vec.get_vocabulary()),
    pe_input=SRC_SEQ_LEN,
    pe_target=TGT_SEQ_LEN,
    dropout_rate=DROP_RATE
)

# Keras compile expects a tensor output; the Transformer returns (logits, attention).
# This wrapper exposes logits only.
class TransformerLogits(tf.keras.Model):
    def __init__(self, transformer):
        super().__init__()
        self.transformer = transformer

    def call(self, inputs, training=False):
        logits, _ = self.transformer(inputs, training=training)
        return logits

logit_model = TransformerLogits(transformer)
logit_model.compile(optimizer=optimizer, loss=loss_function)

EPOCHS = 200
history = logit_model.fit(train_ds, epochs=EPOCHS, verbose=0)

# Print a quick training signal
print("Final training loss:", float(history.history["loss"][-1]))
print("First 5 losses:", [float(x) for x in history.history["loss"][:5]])
print("Last  5 losses:", [float(x) for x in history.history["loss"][-5:]])


Final training loss: 0.011037083342671394
First 5 losses: [3.8343255519866943, 3.261533498764038, 3.0069315433502197, 2.955033302307129, 2.861940622329712]
Last  5 losses: [0.02810092829167843, 0.010155181400477886, 0.011375500820577145, 0.011499704793095589, 0.011037083342671394]


## 14) Simple decoding (greedy)

To translate a sentence, I:
1. encode the source
2. start the target with ``start``
3. iteratively predict the next token and append it
4. stop if ``end`` is generated or max length is reached

This is greedy decoding (no beam search).


In [15]:
# Helper maps for decoding
tgt_vocab = tgt_vec.get_vocabulary()
tgt_id_to_token = {i: tok for i, tok in enumerate(tgt_vocab)}
tgt_token_to_id = {tok: i for i, tok in enumerate(tgt_vocab)}

start_id = tgt_token_to_id.get(START)
end_id = tgt_token_to_id.get(END)

print("start token:", START, "| id:", start_id)
print("end token  :", END,   "| id:", end_id)

def greedy_translate(src_sentence, max_len=TGT_SEQ_LEN):
    # Vectorize source
    src_ids = src_vec(tf.constant([src_sentence]))

    if start_id is None or end_id is None:
        return "(start/end token not in vocabulary)"

    # Start with the start token id
    out_ids = [start_id]

    for _ in range(max_len - 1):
        tar_in = tf.constant([out_ids], dtype=tf.int64)  # variable-length decoding
        logits, _ = transformer((src_ids, tar_in), training=False)

        # Next-token distribution is at the last position
        next_id = int(tf.argmax(logits[0, -1]).numpy())
        out_ids.append(next_id)

        if next_id == end_id:
            break

    # Convert ids to tokens; drop start/end and padding
    tokens = []
    for i in out_ids:
        if i in (0, start_id, end_id):
            continue
        tok = tgt_id_to_token.get(i, "")
        if tok:
            tokens.append(tok)

    return " ".join(tokens)

tests = [
    "bonjour",
    "merci",
    "je suis etudiant",
    "je ne comprends pas",
    "où est la gare",
]

for s in tests:
    print(f"{s:25s} -> {greedy_translate(s)}")


start token: start | id: 3
end token  : end | id: 4
bonjour                   -> hello
merci                     -> thank you
je suis etudiant          -> i am a student
je ne comprends pas       -> i do not understand
où est la gare            -> where is the station


## 15) Takeaways

- The Transformer replaces recurrence with attention, which enables parallel processing of tokens.
- The two masks are essential: padding mask ignores padding; look-ahead mask enforces autoregressive decoding.
- Multi-head attention is built from the same attention function repeated across several heads, then concatenated.
- Even with a small dataset, the architecture can be assembled end-to-end in Keras once attention and masking are implemented carefully.
