<a href="https://colab.research.google.com/github/Daprosero/Procesamiento_Lenguaje_Natural/blob/main/2.%20Transformer/Transformer_Parte_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Logo UNAL CHEC](https://www.funcionpublica.gov.co/documents/d/guest/logo-universidad-nacional)



# **Mecanismo de atención**
### Departamento de Ingeniería Eléctrica, Electrónica y Computación
#### Universidad Nacional de Colombia - Sede Manizales

#### Profesor: Diego A. Pérez

# Encoder-Only: Clasificación SMS con Transformer Encoder

**Idea:** un *encoder* bidireccional procesa toda la secuencia con *self-attention* (no causal), hacemos *pooling* y clasificamos (spam/ham).

**Por qué aquí:** Las tareas discriminativas (clasificación, NER) suelen ir mejor con encoders (p. ej., BERT).

**Esquema del bloque (Pre-LN):**


In [1]:
# -------- Datos y prepro compartidos --------
import numpy as np, pandas as pd, tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

url = "https://raw.githubusercontent.com/juacardonahe/Curso_NLP/refs/heads/main/data/SMSSpamCollection/SMSSpamCollection"
df = pd.read_csv(url, sep="\t", header=None, names=["label","message"])
df["label"] = df["label"].map({"ham":0, "spam":1})

max_length = 128
X_train, X_test, y_train, y_test = train_test_split(
    df["message"], df["label"], test_size=0.2, random_state=42, stratify=df["label"]
)

tokenizer = Tokenizer(oov_token="<unk>")
tokenizer.fit_on_texts(X_train)
vocab_size = len(tokenizer.word_index) + 1

X_train_pad = pad_sequences(tokenizer.texts_to_sequences(X_train), maxlen=max_length, padding="pre", truncating="pre")
X_test_pad  = pad_sequences(tokenizer.texts_to_sequences(X_test),  maxlen=max_length, padding="pre", truncating="pre")
print("Vocab:", vocab_size, "| Train:", X_train_pad.shape, "| Test:", X_test_pad.shape)


Vocab: 7935 | Train: (4457, 128) | Test: (1115, 128)


In [2]:
# -------- Utilidades comunes (PE seno-coseno y bloque encoder) --------
from tensorflow.keras import layers, Model

def sinusoidal_pe(max_len, d_model):
    pos = tf.range(max_len, dtype=tf.float32)[:, None]
    i   = tf.range(d_model, dtype=tf.float32)[None, :]
    rates = 1.0 / tf.pow(10000.0, (2*(i//2))/tf.cast(d_model, tf.float32))
    ang = pos * rates
    s, c = tf.sin(ang[:, 0::2]), tf.cos(ang[:, 1::2])
    return tf.concat([s, c], axis=-1)[None, ...]  # (1,T,D)

class AddSinusoidalPE(layers.Layer):
    def __init__(self, max_len, d_model):
        super().__init__(); self.pe = sinusoidal_pe(max_len, d_model)
    def call(self, x):  # x: (B,T,D)
        return x + self.pe[:, :tf.shape(x)[1], :]

def encoder_block(x, pad_bool, d_model=128, num_heads=4, d_ff=512, drop=0.1, name="enc"):
    h = layers.LayerNormalization(epsilon=1e-6, name=f"{name}_ln1")(x)
    # máscara (B,T,T) = query_valid AND key_valid
    attn_mask = layers.Lambda(
        lambda m: tf.cast(tf.logical_and(tf.expand_dims(m,2), tf.expand_dims(m,1)), tf.float32),
        output_shape=lambda s: (s[0], s[1], s[1]),
        name=f"{name}_mask"
    )(pad_bool)
    h = layers.MultiHeadAttention(num_heads=num_heads, key_dim=d_model//num_heads, dropout=drop,
                                  name=f"{name}_mha")(h, h, attention_mask=attn_mask)
    h = layers.Dropout(drop, name=f"{name}_drop1")(h)
    x = layers.Add(name=f"{name}_res1")([x, h])

    h = layers.LayerNormalization(epsilon=1e-6, name=f"{name}_ln2")(x)
    h = layers.Dense(d_ff, activation=tf.nn.gelu, name=f"{name}_ff1")(h)
    h = layers.Dropout(drop, name=f"{name}_drop2")(h)
    h = layers.Dense(d_model, name=f"{name}_ff2")(h)
    h = layers.Dropout(drop, name=f"{name}_drop3")(h)
    x = layers.Add(name=f"{name}_res2")([x, h])
    return x


In [3]:
# -------- Modelo encoder-only --------
tf.keras.backend.clear_session()
d_model, num_heads, d_ff, drop = 128, 4, 512, 0.1

ids = layers.Input(shape=(max_length,), dtype="int32", name="ids")
pad_bool = layers.Lambda(lambda t: tf.not_equal(t, 0), name="pad_bool")(ids)

x = layers.Embedding(vocab_size, d_model, name="emb")(ids)
x = AddSinusoidalPE(max_length, d_model)(x)

for i in range(1, 4):
    x = encoder_block(x, pad_bool, d_model, num_heads, d_ff, drop, name=f"enc{i}")

x = layers.LayerNormalization(epsilon=1e-6)(x)
mask_f = layers.Lambda(lambda m: tf.cast(tf.expand_dims(m,-1), tf.float32))(pad_bool)
sum_x  = layers.Lambda(lambda xm: tf.reduce_sum(xm[0]*xm[1], axis=1))([x, mask_f])
len_x  = layers.Lambda(lambda m: tf.reduce_sum(m, axis=1) + 1e-9)(mask_f)
pooled = layers.Lambda(lambda sl: sl[0]/sl[1])([sum_x, len_x])

logits = layers.Dense(1, activation="sigmoid")(pooled)
enc_model = Model(ids, logits, name="EncoderOnly_SMS")

enc_model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
enc_model.summary()


In [4]:
# -------- Entrenamiento y evaluación --------
hist = enc_model.fit(X_train_pad, y_train, validation_split=0.1, epochs=3, batch_size=64, verbose=1)
probs = enc_model.predict(X_test_pad, batch_size=256, verbose=0).ravel()
preds = (probs >= 0.5).astype(int)
print("EncoderOnly  | Acc:", accuracy_score(y_test, preds), "| AUC:", roc_auc_score(y_test, probs))


Epoch 1/3
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m37s[0m 264ms/step - accuracy: 0.8586 - loss: 0.4037 - val_accuracy: 0.8924 - val_loss: 0.2984
Epoch 2/3
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 26ms/step - accuracy: 0.9279 - loss: 0.1932 - val_accuracy: 0.9619 - val_loss: 0.1116
Epoch 3/3
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 26ms/step - accuracy: 0.9842 - loss: 0.0623 - val_accuracy: 0.9776 - val_loss: 0.0843
EncoderOnly  | Acc: 0.9730941704035875 | AUC: 0.9777328497783706


# Decoder-Only: Clasificación con máscara causal

**Idea:** un *decoder* autoregresivo (máscara causal triangular). Para clasificar, tomamos la **representación del último token** (o agregamos un token especial) y aplicamos una capa de salida.

**Por qué aquí:** Los decoders puros (GPT) son ideales para **generación**. Para **clasificación** funcionan, pero suelen requerir *pooling* sobre la última posición o un token especial.

**Esquema del bloque (Pre-LN):**



**Nota:** Aquí **no** entrenamos como LM causal puro (sin *teacher forcing*); usamos el bloque como extractor autoregresivo y añadimos un clasificador.


In [5]:
# -------- Bloque decoder con máscara causal + padding --------
from tensorflow.keras import layers, Model

def lower_triangular(T):
    # (T,T) máscara causal (True si j<=i)
    return tf.linalg.band_part(tf.ones((T,T), dtype=tf.bool), -1, 0)

def decoder_block(x, pad_bool, d_model=128, num_heads=4, d_ff=512, drop=0.1, name="dec"):
    h = layers.LayerNormalization(epsilon=1e-6, name=f"{name}_ln1")(x)

    # Máscara causal ∩ padding -> (B,T,T)
    def mk_mask(m):
        B = tf.shape(m)[0]
        T = tf.shape(m)[1]
        pad_q = tf.expand_dims(m, 2)       # (B,T,1)
        pad_k = tf.expand_dims(m, 1)       # (B,1,T)
        pad_full = tf.logical_and(pad_q, pad_k)  # (B,T,T)
        causal = tf.expand_dims(lower_triangular(T), 0)        # (1,T,T)
        causal = tf.tile(causal, [B,1,1])                      # (B,T,T)
        return tf.cast(tf.logical_and(pad_full, causal), tf.float32)

    attn_mask = layers.Lambda(mk_mask, output_shape=lambda s: (s[0], s[1], s[1]), name=f"{name}_mask")(pad_bool)

    h = layers.MultiHeadAttention(num_heads=num_heads, key_dim=d_model//num_heads, dropout=drop,
                                  name=f"{name}_mha")(h, h, attention_mask=attn_mask)
    h = layers.Dropout(drop, name=f"{name}_drop1")(h)
    x = layers.Add(name=f"{name}_res1")([x, h])

    h = layers.LayerNormalization(epsilon=1e-6, name=f"{name}_ln2")(x)
    h = layers.Dense(d_ff, activation=tf.nn.gelu, name=f"{name}_ff1")(h)
    h = layers.Dropout(drop, name=f"{name}_drop2")(h)
    h = layers.Dense(d_model, name=f"{name}_ff2")(h)
    h = layers.Dropout(drop, name=f"{name}_drop3")(h)
    x = layers.Add(name=f"{name}_res2")([x, h])
    return x


In [6]:
# -------- Modelo decoder-only --------
tf.keras.backend.clear_session()
d_model, num_heads, d_ff, drop = 128, 4, 512, 0.1

ids = layers.Input(shape=(max_length,), dtype="int32", name="ids")
pad_bool = layers.Lambda(lambda t: tf.not_equal(t, 0), name="pad_bool")(ids)

x = layers.Embedding(vocab_size, d_model, name="emb")(ids)
x = AddSinusoidalPE(max_length, d_model)(x)

for i in range(1, 4):
    x = decoder_block(x, pad_bool, d_model, num_heads, d_ff, drop, name=f"dec{i}")

x = layers.LayerNormalization(epsilon=1e-6)(x)

# Tomar la representación del ÚLTIMO token válido (no <pad>)
def last_valid(xm):
    x, m = xm  # x:(B,T,D), m:(B,T) bool
    m = tf.cast(m, tf.int32)
    lengths = tf.reduce_sum(m, axis=1)                               # (B,)
    idx = tf.maximum(lengths - 1, 0)                                 # (B,)
    batch = tf.range(tf.shape(x)[0], dtype=tf.int32)
    gather_idx = tf.stack([batch, idx], axis=1)                      # (B,2)
    return tf.gather_nd(x, gather_idx)                               # (B,D)

last = layers.Lambda(last_valid, name="last_valid")([x, pad_bool])   # (B,D)
logits = layers.Dense(1, activation="sigmoid")(last)

dec_model = Model(ids, logits, name="DecoderOnly_SMS")
dec_model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
dec_model.summary()


In [7]:
# -------- Entrenamiento y evaluación --------
hist = dec_model.fit(X_train_pad, y_train, validation_split=0.1, epochs=3, batch_size=64, verbose=1)
probs = dec_model.predict(X_test_pad, batch_size=256, verbose=0).ravel()
preds = (probs >= 0.5).astype(int)
print("DecoderOnly  | Acc:", accuracy_score(y_test, preds), "| AUC:", roc_auc_score(y_test, probs))


Epoch 1/3
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 263ms/step - accuracy: 0.7990 - loss: 0.5207 - val_accuracy: 0.8453 - val_loss: 0.3253
Epoch 2/3
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 28ms/step - accuracy: 0.8702 - loss: 0.2728 - val_accuracy: 0.8408 - val_loss: 0.3109
Epoch 3/3
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 27ms/step - accuracy: 0.8670 - loss: 0.2753 - val_accuracy: 0.8453 - val_loss: 0.3424
DecoderOnly  | Acc: 0.8663677130044843 | AUC: 0.873431572804202


# Encoder-Decoder: Clasificación con Cross-Attention (seq2seq)

**Idea:** el *encoder* codifica el mensaje; el *decoder* (con máscara causal) atiende al encoder vía **cross-attention**. Para clasificar, usamos un **token de inicio** (learnable) y predecimos la clase desde su representación (1 paso).

**Por qué aquí:** Muchos modelos *text-to-text* (T5/BART) usan esta forma; aquí lo simplificamos a **1 paso** del decoder para clasificación.

**Bloque decoder:**


In [12]:
from tensorflow.keras import layers
import tensorflow as tf

def lower_triangular(T):
    return tf.linalg.band_part(tf.ones((T, T), dtype=tf.bool), -1, 0)

def cross_attention_block(y, enc_out, dec_pad_bool, enc_pad_bool,
                          d_model=128, num_heads=4, d_ff=512, drop=0.1, name="xdec"):
    # --- Self-attention (causal + padding) sobre y ---
    y1 = layers.LayerNormalization(epsilon=1e-6, name=f"{name}_ln1")(y)

    def mk_self_mask(m):
        B = tf.shape(m)[0]; T = tf.shape(m)[1]
        pad_q = tf.expand_dims(m, 2)
        pad_k = tf.expand_dims(m, 1)
        pad_full = tf.logical_and(pad_q, pad_k)          # (B,T,T)
        causal = tf.expand_dims(lower_triangular(T), 0)  # (1,T,T)
        causal = tf.tile(causal, [B, 1, 1])              # (B,T,T)
        return tf.cast(tf.logical_and(pad_full, causal), tf.float32)

    self_mask = layers.Lambda(mk_self_mask, output_shape=lambda s: (s[0], s[1], s[1]),
                              name=f"{name}_selfmask")(dec_pad_bool)

    self_mha = layers.MultiHeadAttention(num_heads=num_heads, key_dim=d_model//num_heads,
                                         dropout=drop, name=f"{name}_selfmha")
    h = self_mha(query=y1, value=y1, key=y1, attention_mask=self_mask)
    h = layers.Dropout(drop, name=f"{name}_drop1")(h)
    y = layers.Add(name=f"{name}_res1")([y, h])

    # --- Cross-attention: Q=y, K=V=enc_out ---
    y2 = layers.LayerNormalization(epsilon=1e-6, name=f"{name}_ln2")(y)

    def mk_cross_mask(qm_em):
        qm, em = qm_em  # qm:(B,T_dec)  em:(B,T_enc)
        return tf.cast(tf.logical_and(tf.expand_dims(qm, 2), tf.expand_dims(em, 1)), tf.float32)

    cross_mask = layers.Lambda(mk_cross_mask,
                               output_shape=lambda s: (s[0][0], s[0][1], s[1][1]),
                               name=f"{name}_crossmask")([dec_pad_bool, enc_pad_bool])

    cross_mha = layers.MultiHeadAttention(num_heads=num_heads, key_dim=d_model//num_heads,
                                          dropout=drop, name=f"{name}_crossmha")
    h = cross_mha(query=y2, value=enc_out, key=enc_out, attention_mask=cross_mask)
    h = layers.Dropout(drop, name=f"{name}_drop2")(h)
    y = layers.Add(name=f"{name}_res2")([y, h])

    # --- FFN ---
    h = layers.LayerNormalization(epsilon=1e-6, name=f"{name}_ln3")(y)
    h = layers.Dense(d_ff, activation=tf.nn.gelu, name=f"{name}_ff1")(h)
    h = layers.Dropout(drop, name=f"{name}_drop3")(h)
    h = layers.Dense(d_model, name=f"{name}_ff2")(h)
    h = layers.Dropout(drop, name=f"{name}_drop4")(h)
    y = layers.Add(name=f"{name}_res3")([y, h])

    return y


In [13]:
# -------- Modelo encoder-decoder (1 paso del decoder) --------
tf.keras.backend.clear_session()
d_model, num_heads, d_ff, drop = 128, 4, 512, 0.1

# Inputs
ids_enc = layers.Input(shape=(max_length,), dtype="int32", name="ids_enc")
ids_dec = layers.Input(shape=(1,), dtype="int32", name="ids_dec")  # un solo paso (token de inicio)

enc_pad_bool = layers.Lambda(lambda t: tf.not_equal(t, 0), name="enc_pad_bool")(ids_enc)
dec_pad_bool = layers.Lambda(lambda t: tf.not_equal(t, 0), name="dec_pad_bool")(ids_dec)

# Compartimos embedding entre enc/dec para simplificar (opcional)
emb = layers.Embedding(vocab_size, d_model, name="emb_shared")

# Encoder
x = emb(ids_enc)
x = AddSinusoidalPE(max_length, d_model)(x)
for i in range(1, 3):  # 2 capas encoder p/ejemplo
    x = encoder_block(x, enc_pad_bool, d_model, num_heads, d_ff, drop, name=f"enc{i}")
enc_out = layers.LayerNormalization(epsilon=1e-6, name="enc_ln")(x)  # (B,T_enc,D)

# Decoder (1 token: BOS)
y = emb(ids_dec)                                # (B,1,D)
# PE de longitud 1 (tomará pos=0)
y = AddSinusoidalPE(max_len=1, d_model=d_model)(y)
# Un bloque decoder con self + cross
y = cross_attention_block(y, enc_out, dec_pad_bool, enc_pad_bool,
                          d_model, num_heads, d_ff, drop, name="xdec1")

y = layers.LayerNormalization(epsilon=1e-6, name="dec_ln")(y)  # (B,1,D)

# Clasificador desde el token del decoder
y_cls = layers.Flatten()(y)        # (B,D)
logits = layers.Dense(1, activation="sigmoid")(y_cls)

encdec_model = Model([ids_enc, ids_dec], logits, name="EncDec_SMS")
encdec_model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
encdec_model.summary()


In [14]:
# -------- Entrenamiento y evaluación --------
# Construimos el token de inicio (BOS). Usamos índice 1 (o cualquier no cero) para marcar 'válido'.
bos = np.ones((X_train_pad.shape[0], 1), dtype=np.int32)
bos_test = np.ones((X_test_pad.shape[0], 1), dtype=np.int32)

hist = encdec_model.fit([X_train_pad, bos], y_train, validation_split=0.1, epochs=3, batch_size=64, verbose=1)
probs = encdec_model.predict([X_test_pad, bos_test], batch_size=256, verbose=0).ravel()
preds = (probs >= 0.5).astype(int)
print("Enc-Dec      | Acc:", accuracy_score(y_test, preds), "| AUC:", roc_auc_score(y_test, probs))


Epoch 1/3




[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m48s[0m 335ms/step - accuracy: 0.8290 - loss: 0.5303 - val_accuracy: 0.8857 - val_loss: 0.3145
Epoch 2/3
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 21ms/step - accuracy: 0.8837 - loss: 0.2592 - val_accuracy: 0.9529 - val_loss: 0.1814
Epoch 3/3
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 21ms/step - accuracy: 0.9767 - loss: 0.1053 - val_accuracy: 0.9507 - val_loss: 0.1644




Enc-Dec      | Acc: 0.9713004484304932 | AUC: 0.9640529687217753



---

### Tabla de decisión (Encoder-Only vs Decoder-Only vs Encoder-Decoder)

| Arquitectura        | Objetivo de pretraining                      | Máscara                                                                 | Cuándo usarla (tareas típicas)                                              | Fortalezas                                                            | Debilidades                                              | Modelos base                      |
| ------------------- | -------------------------------------------- | ----------------------------------------------------------------------- | --------------------------------------------------------------------------- | --------------------------------------------------------------------- | -------------------------------------------------------- | --------------------------------- |
| **Encoder-Only**    | MLM (Masked Language Modeling)               | **Padding** (no causal)                                                 | Clasificación, NER, extracción, búsqueda semántica, QA extractivo           | Contexto bidireccional, buenas representaciones, finetuning eficiente | No genera texto de forma natural                         | **BERT**, RoBERTa, DeBERTa, MPNet |
| **Decoder-Only**    | CLM (Causal LM)                              | **Causal** (triangular) + padding                                       | Generación (chat, resumen libre), continuación de texto, few-shot prompting | Generación fluida, control por sampling, fácil para agentes           | Peor en tareas puramente discriminativas si no se adapta | **GPT**, LLaMA, Mistral           |
| **Encoder-Decoder** | Denoising / Span corruption (*text-to-text*) | **Self-attn enc** (padding), **self-attn dec** (causal), **cross-attn** | Traducción, resumen condicional, QA abstractive, tasks “input→output”       | Flexibilidad *condicional*, fuerte en seq2seq                         | Más costoso; pipeline más complejo                       | **T5**, BART, Marian              |

> Regla práctica:
>
> * ¿Necesitas **etiquetar/comprender**? → **Encoder-Only**.
> * ¿Necesitas **generar**? → **Decoder-Only**.
> * ¿Necesitas **mapear entrada→salida** (traducción/resumen condicional)? → **Encoder-Decoder**.
