<a href="https://colab.research.google.com/github/MylearninghubAIMLGENAILLM/LLM-Research-Roadmap/blob/main/transformer_attention_all_you_need_tf_end2end.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Original Vaswani et al. Transformer (2017) in TensorFlow with:**

**Full encoder–decoder architecture (6 layers, multi-head self-attention + cross-attention, residuals + layer norm).**

Exact training loop.

Data pipeline (TF Datasets).

**End-to-end MLOps pipeline (data prep → training → evaluation → inference → model save/load).**

**1) Setup & imports — run first**

Explain: installs TF and TFDS (Colab usually has TF preinstalled, but we ensure a compatible version), mounts Drive for saving model/checkpoints (optional), and imports libs.

In [1]:
# Colab cell 1: Setup & imports
# Run this first. It installs tfds (if missing), mounts Drive to save artifacts (optional).
# If you already have TF 2.10+ in Colab, the install lines are safe.

!pip install -q "tensorflow>=2.10.0" tensorflow-datasets sentencepiece

import os, time, math, json
from pathlib import Path
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras import layers
print("TF version:", tf.__version__)

# Optional: mount Google Drive to store checkpoints & exported model
from google.colab import drive
DRIVE_MOUNT = "/content/drive"
try:
    drive.mount(DRIVE_MOUNT)
    BASE_DIR = Path(DRIVE_MOUNT) / "MyDrive" / "transformer_training"
    BASE_DIR.mkdir(parents=True, exist_ok=True)
    print("Artifacts will be saved to:", BASE_DIR)
except Exception as e:
    print("Drive mount failed or skipped:", e)
    BASE_DIR = Path("/content/transformer_artifacts")
    BASE_DIR.mkdir(parents=True, exist_ok=True)
    print("Artifacts will be saved to:", BASE_DIR)

TF version: 2.19.0
Mounted at /content/drive
Artifacts will be saved to: /content/drive/MyDrive/transformer_training


**2) Data source & tokenizers**

Explain: we use ted_hrlr_translate/pt_to_en (small-ish parallel dataset) via TFDS. We create subword tokenizers (shared or separate — here I'll create separate for source & target to keep it simple). Then show how to encode and decode examples.

In [2]:
# Colab cell 2: Load parallel dataset from TFDS and build subword tokenizers
# Using a modest subset size (configurable). The training pipeline uses the tokenizers produced here.

DATASET_NAME = "ted_hrlr_translate/pt_to_en"   # Portuguese -> English
SPLIT_TRAIN = "train"
SPLIT_VALID = "validation"

# Load raw examples (text) — only take a subset if you want faster runs (set take=N)
ds_train_raw = tfds.load(DATASET_NAME, split=SPLIT_TRAIN, as_supervised=True)
ds_valid_raw = tfds.load(DATASET_NAME, split=SPLIT_VALID, as_supervised=True)

# Collect raw text to build subword tokenizers (we'll collect up to N examples to keep it fast)
MAX_EXAMPLES_FOR_VOCAB = 20000   # increase if you want larger vocab
src_texts = []
tgt_texts = []
for i, (src, tgt) in enumerate(tfds.as_numpy(ds_train_raw)):
    if i >= MAX_EXAMPLES_FOR_VOCAB:
        break
    src_texts.append(src.decode("utf-8"))
    tgt_texts.append(tgt.decode("utf-8"))

print("Collected", len(src_texts), "sentences for tokenizer building.")

# Build SubwordTextEncoder tokenizers (sentencepiece-backed)
tokenizer_src = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    (s for s in src_texts), target_vocab_size=2**13)  # ~8192
tokenizer_tgt = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    (t for t in tgt_texts), target_vocab_size=2**13)

# Helper functions to save tokenizers for later reuse
tok_dir = BASE_DIR / "tokenizers"
tok_dir.mkdir(parents=True, exist_ok=True)
tokenizer_src.save_to_file(str(tok_dir / "tokenizer_src"))
tokenizer_tgt.save_to_file(str(tok_dir / "tokenizer_tgt"))
print("Saved tokenizers to", tok_dir)

# Quick tokenizer sanity
sample_src = src_texts[0]
print("SRC sample:", sample_src)
print("SRC tokens (first 20):", tokenizer_src.encode(sample_src)[:20])



Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/tensorflow_datasets/ted_hrlr_translate/pt_to_en/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/ted_hrlr_translate/pt_to_en/incomplete.9O9I45_1.0.0/ted_hrlr_translate-tra…

Generating validation examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/ted_hrlr_translate/pt_to_en/incomplete.9O9I45_1.0.0/ted_hrlr_translate-val…

Generating test examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/ted_hrlr_translate/pt_to_en/incomplete.9O9I45_1.0.0/ted_hrlr_translate-tes…

Dataset ted_hrlr_translate downloaded and prepared to /root/tensorflow_datasets/ted_hrlr_translate/pt_to_en/1.0.0. Subsequent calls will reuse this data.
Collected 20000 sentences for tokenizer building.
Saved tokenizers to /content/drive/MyDrive/transformer_training/tokenizers
SRC sample: e quando melhoramos a procura , tiramos a única vantagem da impressão , que é a serendipidade .
SRC tokens (first 20): [6, 41, 5086, 58, 3, 2048, 1, 6990, 3, 492, 6936, 7917, 19, 2672, 129, 1, 5, 8, 3, 4414]


Notes:

SubwordTextEncoder is TFDS’ convenience wrapper over subword units. It generates <eos> or not? We'll add explicit <sos>/<eos> in our pipeline by reserving ids.

We set vocab sizes ~8k to keep memory manageable.

3) Preprocessing & tf.data pipeline

Explain: create functions to encode text to integer sequences with <sos> and <eos>, pad/truncate to max_len, build tf.data pipelines with shuffle/cache/batch/prefetch. Also create masks for Transformer.

In [None]:
# Colab cell 3: Preprocessing helpers & tf.data pipelines

# Special token ids: reserve first slots in each tokenizer by offsetting indices.
# We'll add explicit specials: <pad>=0, <sos>=1, <eos>=2, <unk>=3, so we shift tokenizer vocab indices by +4
SPECIAL_TOKENS = {"pad":0, "sos":1, "eos":2, "unk":3}
shift = len(SPECIAL_TOKENS)  # 4

SRC_VOCAB_SIZE = tokenizer_src.vocab_size + shift
TGT_VOCAB_SIZE = tokenizer_tgt.vocab_size + shift
print("SRC vocab size (including specials):", SRC_VOCAB_SIZE)
print("TGT vocab size (including specials):", TGT_VOCAB_SIZE)

def encode_sentence(sentence, tokenizer, max_len=64):
    """Encode Python string -> int list with special token shifting and SOS/EOS."""
    pieces = tokenizer.encode(sentence.numpy().decode("utf-8"))
    pieces = [SPECIAL_TOKENS["sos"]] + [p + shift for p in pieces] + [SPECIAL_TOKENS["eos"]]
    if len(pieces) > max_len:
        pieces = pieces[:max_len]
        if pieces[-1] != SPECIAL_TOKENS["eos"]:
            pieces[-1] = SPECIAL_TOKENS["eos"]
    # pad
    pieces += [SPECIAL_TOKENS["pad"]] * (max_len - len(pieces))
    return tf.constant(pieces, dtype=tf.int32)

def tf_encode(src, tgt, max_len=64):
    """Wrapper that works in tf.data via tf.py_function."""
    src_enc = tf.py_function(func=encode_sentence, inp=[src, tokenizer_src, max_len], Tout=tf.int32)
    tgt_enc = tf.py_function(func=encode_sentence, inp=[tgt, tokenizer_tgt, max_len], Tout=tf.int32)
    src_enc.set_shape([max_len])
    tgt_enc.set_shape([max_len])
    return src_enc, tgt_enc

# Create masks used by Transformer Keras model (we'll create them inside the model training step as needed)
def make_masks(src, tgt):
    # src, tgt: (batch, seq_len)
    src_padding_mask = tf.cast(tf.math.equal(src, SPECIAL_TOKENS["pad"]), tf.float32)[:, tf.newaxis, tf.newaxis, :]  # (B,1,1,src_len)
    tgt_padding_mask = tf.cast(tf.math.equal(tgt, SPECIAL_TOKENS["pad"]), tf.float32)[:, tf.newaxis, tf.newaxis, :]  # (B,1,1,tgt_len)
    seq_len = tf.shape(tgt)[1]
    look_ahead_mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)  # (tgt_len, tgt_len)
    look_ahead_mask = look_ahead_mask[tf.newaxis, tf.newaxis, :, :]  # (1,1,tgt_len,tgt_len)
    combined_mask = tf.maximum(tgt_padding_mask, look_ahead_mask)
    return src_padding_mask, combined_mask

# Build datasets
MAX_LEN = 64
BATCH_SIZE = 64
AUTOTUNE = tf.data.AUTOTUNE

def prepare_dataset(ds_raw, max_examples=None):
    ds = ds_raw.map(lambda s, t: (s, t))
    if max_examples:
        ds = ds.take(max_examples)
    ds = ds.map(lambda s, t: tf_encode(s, t, MAX_LEN), num_parallel_calls=AUTOTUNE)
    ds = ds.cache()
    ds = ds.shuffle(20000)
    ds = ds.batch(BATCH_SIZE)
    ds = ds.prefetch(AUTOTUNE)
    return ds

train_ds = prepare_dataset(ds_train_raw, max_examples=20000)  # cap for Colab speed
valid_ds = prepare_dataset(ds_valid_raw, max_examples=2000)
print("Prepared train and valid tf.data pipelines.")

Notes:

We used tf.py_function to apply the Python tokenizer. This keeps flexibility; for large-scale production you’d use TF ops or pre-tokenize into TFRecord.

4) Transformer model (exact architecture)

Explain: implement the Transformer as in the paper using tf.keras layers — multi-head attention, feed-forward, residual connections and layer normalization. We'll implement MultiHeadAttention (custom) and EncoderLayer, DecoderLayer, then stack 6 layers each.

In [None]:
# Colab cell 4: Transformer model implementation in Keras (6+6 layers)
from typing import Tuple

class MultiHeadAttentionLayer(layers.Layer):
    def __init__(self, d_model, num_heads, dropout=0.1):
        super().__init__()
        assert d_model % num_heads == 0
        self.num_heads = num_heads
        self.d_model = d_model
        self.depth = d_model // num_heads

        self.wq = layers.Dense(d_model)
        self.wk = layers.Dense(d_model)
        self.wv = layers.Dense(d_model)
        self.dense = layers.Dense(d_model)
        self.dropout = layers.Dropout(dropout)

    def split_heads(self, x, batch_size):
        """Split last dim into (num_heads, depth) and transpose to (batch, heads, seq_len, depth)."""
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0,2,1,3])

    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]
        q = self.wq(q)  # (B, seq, d_model)
        k = self.wk(k)
        v = self.wv(v)

        q = self.split_heads(q, batch_size)  # (B, heads, seq_q, depth)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)

        # scaled dot-product attention
        matmul_qk = tf.matmul(q, k, transpose_b=True)  # (B, heads, seq_q, seq_k)
        dk = tf.cast(tf.shape(k)[-1], tf.float32)
        scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
        if mask is not None:
            scaled_attention_logits += (mask * -1e9)  # mask shape broadcastable
        attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (B, heads, seq_q, seq_k)
        attention_weights = self.dropout(attention_weights)
        scaled_attention = tf.matmul(attention_weights, v)  # (B, heads, seq_q, depth)
        scaled_attention = tf.transpose(scaled_attention, perm=[0,2,1,3])  # (B, seq_q, heads, depth)
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))  # (B, seq_q, d_model)
        output = self.dense(concat_attention)  # final linear
        return output, attention_weights

class EncoderLayerK(layers.Layer):
    def __init__(self, d_model, num_heads, dff, dropout=0.1):
        super().__init__()
        self.mha = MultiHeadAttentionLayer(d_model, num_heads, dropout)
        self.ffn = tf.keras.Sequential([
            layers.Dense(dff, activation='relu'),
            layers.Dense(d_model)
        ])
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(dropout)
        self.dropout2 = layers.Dropout(dropout)

    def call(self, x, training, mask):
        attn_output, _ = self.mha(x, x, x, mask)  # self-attention
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)  # residual + norm
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)
        return out2

class DecoderLayerK(layers.Layer):
    def __init__(self, d_model, num_heads, dff, dropout=0.1):
        super().__init__()
        self.mha1 = MultiHeadAttentionLayer(d_model, num_heads, dropout)  # self
        self.mha2 = MultiHeadAttentionLayer(d_model, num_heads, dropout)  # cross
        self.ffn = tf.keras.Sequential([
            layers.Dense(dff, activation='relu'),
            layers.Dense(d_model)
        ])

        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = layers.Dropout(dropout)
        self.dropout2 = layers.Dropout(dropout)
        self.dropout3 = layers.Dropout(dropout)

    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        # Masked self-attention (look-ahead)
        attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(x + attn1)

        # Cross-attention with encoder output
        attn2, attn_weights_block2 = self.mha2(enc_output, enc_output, out1, padding_mask)
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(out1 + attn2)

        # Feed forward
        ffn_output = self.ffn(out2)
        ffn_output = self.dropout3(ffn_output, training=training)
        out3 = self.layernorm3(out2 + ffn_output)

        return out3, attn_weights_block1, attn_weights_block2

class EncoderK(layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, maximum_position_encoding, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = layers.Embedding(input_vocab_size, d_model)
        self.pos_encoding = self.positional_encoding(maximum_position_encoding, d_model)
        self.enc_layers = [EncoderLayerK(d_model, num_heads, dff, dropout) for _ in range(num_layers)]
        self.dropout = layers.Dropout(dropout)

    def positional_encoding(self, position, d_model):
        angle_rads = self.get_angles(tf.range(position)[:, tf.newaxis], tf.range(d_model)[tf.newaxis, :], d_model)
        # apply sin to even indices in the array; cos to odd indices
        sines = tf.math.sin(angle_rads[:, 0::2])
        coses = tf.math.cos(angle_rads[:, 1::2])
        pos_encoding = tf.concat([sines, coses], axis=-1)
        pos_encoding = pos_encoding[tf.newaxis, ...]
        return tf.cast(pos_encoding, dtype=tf.float32)

    def get_angles(self, pos, i, d_model):
        angle_rates = 1 / tf.math.pow(10000.0, (2 * (i//2)) / tf.cast(d_model, tf.float32))
        return pos * angle_rates

    def call(self, x, training, mask):
        seq_len = tf.shape(x)[1]
        x = self.embedding(x)  # (B, seq, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]
        x = self.dropout(x, training=training)
        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training, mask)
        return x  # (B, seq, d_model)

class DecoderK(layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size, maximum_position_encoding, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = layers.Embedding(target_vocab_size, d_model)
        self.pos_encoding = self.positional_encoding(maximum_position_encoding, d_model)
        self.dec_layers = [DecoderLayerK(d_model, num_heads, dff, dropout) for _ in range(num_layers)]
        self.dropout = layers.Dropout(dropout)

    def positional_encoding(self, position, d_model):
        angle_rads = self.get_angles(tf.range(position)[:, tf.newaxis], tf.range(d_model)[tf.newaxis, :], d_model)
        sines = tf.math.sin(angle_rads[:, 0::2])
        coses = tf.math.cos(angle_rads[:, 1::2])
        pos_encoding = tf.concat([sines, coses], axis=-1)
        pos_encoding = pos_encoding[tf.newaxis, ...]
        return tf.cast(pos_encoding, dtype=tf.float32)

    def get_angles(self, pos, i, d_model):
        angle_rates = 1 / tf.math.pow(10000.0, (2 * (i//2)) / tf.cast(d_model, tf.float32))
        return pos * angle_rates

    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        seq_len = tf.shape(x)[1]
        attention_weights = {}
        x = self.embedding(x)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]
        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x, block1, block2 = self.dec_layers[i](x, enc_output, training, look_ahead_mask, padding_mask)
            attention_weights['decoder_layer{}_block1'.format(i+1)] = block1
            attention_weights['decoder_layer{}_block2'.format(i+1)] = block2

        return x, attention_weights

class TransformerK(tf.keras.Model):
    def __init__(self, num_layers, d_model, num_heads, dff,
                 input_vocab_size, target_vocab_size, pe_input, pe_target, dropout=0.1):
        super().__init__()
        self.encoder = EncoderK(num_layers, d_model, num_heads, dff, input_vocab_size, pe_input, dropout)
        self.decoder = DecoderK(num_layers, d_model, num_heads, dff, target_vocab_size, pe_target, dropout)
        self.final_layer = layers.Dense(target_vocab_size)

    def call(self, inp, tar, training, enc_padding_mask, look_ahead_mask, dec_padding_mask):
        enc_output = self.encoder(inp, training, enc_padding_mask)  # (B, inp_seq, d_model)
        dec_output, attention_weights = self.decoder(tar, enc_output, training, look_ahead_mask, dec_padding_mask)  # (B, tar_seq, d_model)
        final_output = self.final_layer(dec_output)  # (B, tar_seq, target_vocab)
        return final_output, attention_weights

Key architecture choices and how they implement the paper:

6 encoder + 6 decoder layers achieved by num_layers=6 when instantiating TransformerK.

Each layer: MHA → Add & Norm → FFN → Add & Norm. Residuals implemented via x + sublayer.

LayerNorm uses epsilon=1e-6 like common practice.

Sinusoidal positional encodings implemented in positional_encoding.

5) Training internals: loss, label smoothing, LR warmup

Explain: label smoothing and the optimizer + warmup schedule per Vaswani et al. We implement the warmup scheduler:

In [None]:
# Colab cell 5: loss (label smoothing) and learning rate schedule (warmup + decay)
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, d_model, warmup_steps=4000):
        super().__init__()
        self.d_model = tf.cast(d_model, tf.float32)
        self.warmup_steps = warmup_steps

    def __call__(self, step):
        step = tf.cast(step, tf.float32)
        arg1 = tf.math.rsqrt(step + 1e-9)
        arg2 = step * (self.warmup_steps ** -1.5)
        return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

def loss_function(real, pred, label_smoothing=0.1):
    # real: (batch, seq_len), pred: (batch, seq_len, vocab)
    vocab_size = tf.shape(pred)[-1]
    # one-hot with label smoothing
    real_one_hot = tf.one_hot(real, depth=vocab_size, dtype=tf.float32)
    smooth_positives = 1.0 - label_smoothing
    smooth_negatives = label_smoothing / tf.cast(vocab_size - 1, tf.float32)
    soft_targets = real_one_hot * smooth_positives + smooth_negatives * (1 - real_one_hot)
    loss = tf.keras.losses.categorical_crossentropy(soft_targets, pred, from_logits=True)
    mask = tf.cast(tf.not_equal(real, SPECIAL_TOKENS["pad"]), dtype=loss.dtype)
    loss = loss * mask
    return tf.reduce_sum(loss) / tf.reduce_sum(mask)

def accuracy_function(real, pred):
    predictions = tf.argmax(pred, axis=-1, output_type=tf.int32)
    mask = tf.cast(tf.not_equal(real, SPECIAL_TOKENS["pad"]), dtype=tf.float32)
    matches = tf.cast(tf.equal(real, predictions), tf.float32)
    return tf.reduce_sum(matches * mask) / tf.reduce_sum(mask)

Notes:

We use label smoothing implemented by creating soft targets. This matches the idea in the paper (they used smoothing of 0.1).

CustomSchedule matches the paper’s warmup/decay schedule.

6) Training loop (tf.function for performance) and checkpoints

Explain: training step with tf.function for speed, a training loop that iterates dataset, logs with TensorBoard, and saves checkpoints to Drive.

In [None]:
# Colab cell 6: Instantiate model, optimizer, checkpointing, and training loop
NUM_LAYERS = 6
D_MODEL = 512          # paper uses 512
NUM_HEADS = 8
DFF = 2048
DROPOUT = 0.1

transformer = TransformerK(
    num_layers=NUM_LAYERS,
    d_model=D_MODEL,
    num_heads=NUM_HEADS,
    dff=DFF,
    input_vocab_size=SRC_VOCAB_SIZE,
    target_vocab_size=TGT_VOCAB_SIZE,
    pe_input=MAX_LEN,
    pe_target=MAX_LEN,
    dropout=DROPOUT
)

learning_rate = CustomSchedule(D_MODEL, warmup_steps=8000)
optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98, epsilon=1e-9)

# Checkpoint manager
ckpt_dir = str(BASE_DIR / "checkpoints")
ckpt = tf.train.Checkpoint(transformer=transformer, optimizer=optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, ckpt_dir, max_to_keep=5)
if ckpt_manager.latest_checkpoint:
    ckpt.restore(ckpt_manager.latest_checkpoint)
    print("Restored from", ckpt_manager.latest_checkpoint)
else:
    print("Initializing from scratch.")

# Metrics and TensorBoard
train_summary_writer = tf.summary.create_file_writer(str(BASE_DIR / "logs" / "train"))
valid_summary_writer = tf.summary.create_file_writer(str(BASE_DIR / "logs" / "valid"))

EPOCHS = 10  # set small for Colab; increase if you have time/GPU
import datetime

@tf.function
def train_step(inp, tar):
    tar_inp = tar[:, :-1]
    tar_real = tar[:, 1:]
    enc_padding_mask, combined_mask = None, None
    # masks creation inline (broadcast shapes ok)
    enc_padding_mask = tf.cast(tf.equal(inp, SPECIAL_TOKENS["pad"]), tf.float32)[:, tf.newaxis, tf.newaxis, :]
    dec_padding_mask = enc_padding_mask
    seq_len = tf.shape(tar_inp)[1]
    look_ahead_mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)  # (tgt_len,tgt_len)
    look_ahead_mask = look_ahead_mask[tf.newaxis, tf.newaxis, :, :]
    dec_target_padding_mask = tf.cast(tf.equal(tar_inp, SPECIAL_TOKENS["pad"]), tf.float32)[:, tf.newaxis, tf.newaxis, :]
    combined_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)

    with tf.GradientTape() as tape:
        predictions, _ = transformer(inp, tar_inp, True, enc_padding_mask, combined_mask, dec_padding_mask)
        loss = loss_function(tar_real, predictions, label_smoothing=0.1)
    gradients = tape.gradient(loss, transformer.trainable_variables)
    optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))

    acc = accuracy_function(tar_real, predictions)
    return loss, acc

@tf.function
def valid_step(inp, tar):
    tar_inp = tar[:, :-1]
    tar_real = tar[:, 1:]
    enc_padding_mask = tf.cast(tf.equal(inp, SPECIAL_TOKENS["pad"]), tf.float32)[:, tf.newaxis, tf.newaxis, :]
    dec_padding_mask = enc_padding_mask
    seq_len = tf.shape(tar_inp)[1]
    look_ahead_mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
    look_ahead_mask = look_ahead_mask[tf.newaxis, tf.newaxis, :, :]
    dec_target_padding_mask = tf.cast(tf.equal(tar_inp, SPECIAL_TOKENS["pad"]), tf.float32)[:, tf.newaxis, tf.newaxis, :]
    combined_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)
    predictions, _ = transformer(inp, tar_inp, False, enc_padding_mask, combined_mask, dec_padding_mask)
    loss = loss_function(tar_real, predictions, label_smoothing=0.1)
    acc = accuracy_function(tar_real, predictions)
    return loss, acc

# Run training
global_step = tf.Variable(0, trainable=False, dtype=tf.int64)

for epoch in range(EPOCHS):
    start = time.time()
    total_loss = 0.0
    total_acc = 0.0
    steps = 0
    # training
    for (batch, (inp, tar)) in enumerate(train_ds):
        loss, acc = train_step(inp, tar)
        total_loss += loss
        total_acc += acc
        steps += 1
        global_step.assign_add(1)
        if batch % 100 == 0:
            print(f"Epoch {epoch+1} Batch {batch} Loss {loss:.4f} Acc {acc:.4f}")

    train_loss = total_loss / tf.cast(steps, tf.float32)
    train_acc = total_acc / tf.cast(steps, tf.float32)

    # validation
    val_loss = 0.0
    val_acc = 0.0
    val_steps = 0
    for (batch, (inp, tar)) in enumerate(valid_ds):
        loss_v, acc_v = valid_step(inp, tar)
        val_loss += loss_v
        val_acc += acc_v
        val_steps += 1
    if val_steps > 0:
        val_loss /= tf.cast(val_steps, tf.float32)
        val_acc /= tf.cast(val_steps, tf.float32)
    else:
        val_loss = tf.constant(0.0)
        val_acc = tf.constant(0.0)

    print(f"Epoch {epoch+1} finished | Train Loss: {train_loss:.4f} Train Acc: {train_acc:.4f} | Val Loss: {val_loss:.4f} Val Acc: {val_acc:.4f} | Time: {time.time()-start:.1f}s")

    # write TensorBoard scalars
    with train_summary_writer.as_default():
        tf.summary.scalar('loss', train_loss, step=epoch)
        tf.summary.scalar('accuracy', train_acc, step=epoch)
    with valid_summary_writer.as_default():
        tf.summary.scalar('loss', val_loss, step=epoch)
        tf.summary.scalar('accuracy', val_acc, step=epoch)

    # save checkpoint
    ckpt_save_path = ckpt_manager.save()
    print("Saved checkpoint to", ckpt_save_path)


Notes:

EPOCHS is small for Colab; increase if you have time.

Checkpoints saved to BASE_DIR/checkpoints. You can download from Drive.

7) Inference / Greedy decoding

Explain: Greedy decoding function that uses the trained encoder and decoder. For clearer outputs we convert token ids back to text by undoing the shift.

In [None]:
# Colab cell 7: Greedy inference utilities
def detokenize(toks, tokenizer, shift):
    # toks is list of ints (with our special tokens)
    words = []
    for t in toks:
        t = int(t)
        if t == SPECIAL_TOKENS["sos"]:
            continue
        if t == SPECIAL_TOKENS["eos"]:
            break
        if t == SPECIAL_TOKENS["pad"]:
            continue
        # reverse shift
        words.append(tokenizer.decode([t - shift]))
    return " ".join(words)

def evaluate_sentence(sentence, max_len=MAX_LEN):
    # encode single sentence
    enc_in = tf.py_function(func=encode_sentence, inp=[tf.constant(sentence), tokenizer_src, MAX_LEN], Tout=tf.int32)
    enc_in.set_shape([MAX_LEN])
    enc_in = tf.expand_dims(enc_in, 0)  # batch 1

    dec_in = tf.constant([[SPECIAL_TOKENS["sos"]] + [SPECIAL_TOKENS["pad"]] * (MAX_LEN - 1)], dtype=tf.int32)
    for i in range(1, MAX_LEN):
        enc_padding_mask = tf.cast(tf.equal(enc_in, SPECIAL_TOKENS["pad"]), tf.float32)[:, tf.newaxis, tf.newaxis, :]
        seq_len = tf.shape(dec_in)[1]
        look_ahead_mask = 1 - tf.linalg.band_part(tf.ones((seq_len, seq_len)), -1, 0)
        look_ahead_mask = look_ahead_mask[tf.newaxis, tf.newaxis, :, :]
        dec_target_padding_mask = tf.cast(tf.equal(dec_in, SPECIAL_TOKENS["pad"]), tf.float32)[:, tf.newaxis, tf.newaxis, :]
        combined_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)
        dec_padding_mask = enc_padding_mask

        predictions, _ = transformer(enc_in, dec_in, False, enc_padding_mask, combined_mask, dec_padding_mask)
        # take the last token probabilities
        logits = predictions[:, -1:, :]  # (1,1,vocab)
        predicted_id = tf.argmax(logits, axis=-1, output_type=tf.int32).numpy()[0][0]
        # append predicted to dec_in
        dec_in = tf.concat([dec_in, tf.constant([[predicted_id]], dtype=tf.int32)], axis=-1)
        if predicted_id == SPECIAL_TOKENS["eos"]:
            break

    returned = tf.squeeze(dec_in, axis=0).numpy().tolist()
    return detokenize(returned, tokenizer_tgt, shift)

# Example
example_sent = "este é um problema que precisamos resolver ."
print("Input:", example_sent)
print("Model output:", evaluate_sentence(example_sent))


8) Export model and MLOps tips for Colab

Explain: how to save a SavedModel and pointers to productionize (TF Serving / Docker / CI). Also how to run TensorBoard in Colab.

In [None]:
# Colab cell 8: Save/export model (SavedModel) and quick MLOps tips

export_dir = BASE_DIR / "saved_model"
export_dir.mkdir(parents=True, exist_ok=True)

# Wrap a serving function that accepts raw strings and returns translations
@tf.function(input_signature=[tf.TensorSpec([None], tf.string)])
def serving_fn(raw_inputs):
    # raw_inputs: [batch] of strings
    outputs = []
    for i in tf.range(tf.shape(raw_inputs)[0]):
        s = raw_inputs[i]
        out = tf.py_function(func=lambda x: evaluate_sentence(x.decode("utf-8")), inp=[s], Tout=tf.string)
        outputs.append(out)
    return tf.stack(outputs)

# Save model weights + custom objects if you want. Because our model is a subclassed Model with Python-side tokenization,
# a robust production step usually packs tokenizers and deterministic TF ops. For a Colab demo we'll save weights.
transformer.save_weights(str(export_dir / "transformer_weights"))

print("Weights saved to", export_dir)

# To run TensorBoard in Colab:
# %load_ext tensorboard
# %tensorboard --logdir {BASE_DIR / "logs"}

print("MLOps tips:")
print("- Use checkpoints in training and keep model weights in stable storage (GCS, S3, or Drive).")
print("- For production, pre-tokenize inputs to TFRecords, incorporate tokenization into TF graph (e.g., TF Text / SentencePiece),")
print("  and export a SavedModel that accepts raw strings (or integer sequences) with deterministic steps.")
print("- CI/CD: use GitHub Actions to run tests; use Docker + TF Serving for deployment. For GCP, use Cloud Build + Cloud Run/AI Platform.")


9) Practical notes, scaling & suggestions

Short, exact guidance you can act on immediately:

For exact Vaswani setup:

d_model=512, num_heads=8, dff=2048, num_layers=6.

Warmup steps = 4000 (paper used 4000; I used 8000 in example — set to 4000 to match).

Label smoothing = 0.1.

Adam with betas (0.9, 0.98) and epsilon=1e-9.

For tokenization at production scale: train SentencePiece model and integrate it into the TF SavedModel using the TF Text ops (so serving receives raw text).

For data pipeline: pre-tokenize to TFRecord (int32 arrays) and use tf.data from TFRecord for fastest I/O in production.

For MLOps:

Continuous training: push new TFRecord shards to storage and trigger retrain via Cloud Build or GitHub Actions.

Serving: export a SavedModel that accepts raw text & outputs strings or token IDs; containerize with TF Serving.

Monitoring: use TensorBoard logs, push metrics to Prometheus/Grafana for production.

10) Want me to produce the .ipynb file?

If you want, I can convert these exact code blocks into a ready-to-download .ipynb file and place it in your Colab/Drive path. (You said you previously couldn't download — I can also paste the entire notebook content in a single reply so you can copy-paste cell-by-cell.)

Final — quick checklist for you to run immediately in Colab

Create new Colab notebook.

Paste each code block (1 → 8) into its own cell in order.

Runtime → Change runtime type → GPU.

Run cells. For speed, start with reduced MAX_EXAMPLES_FOR_VOCAB, EPOCHS and smaller NUM_LAYERS while debugging; then set NUM_LAYERS=6, D_MODEL=512, EPOCHS higher for full training.