<a href="https://colab.research.google.com/github/Daprosero/Procesamiento_Lenguaje_Natural/blob/main/2.%20Transformer/Comparaci%C3%B3n_de_Arquitecturas_Transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Logo UNAL CHEC](https://www.funcionpublica.gov.co/documents/d/guest/logo-universidad-nacional)



# **Comparación de Arquitecturas Transformer**
### Departamento de Ingeniería Eléctrica, Electrónica y Computación
#### Universidad Nacional de Colombia - Sede Manizales

#### Profesor: Diego A. Pérez

# Encoder-Only, Decoder-Only y Encoder-Decoder

**Objetivo:** mostrar un ejemplo funcional por **tipo de arquitectura** y **tarea** usando modelos **preentrenados** (y un ajuste fino ligero cuando sea viable) con `transformers` y `datasets`.

**Contenido**
1) Encoder-Only → **Clasificación** (DistilBERT en IMDb) — *fine-tuning rápido*.  
2) Decoder-Only → **Generación** (GPT-2) — *inferencia + sampling*.  
3) Encoder-Decoder → **Resumen** (T5-small en SAMSum) — *fine-tuning breve*.

> Tip: En Colab activa GPU (Entorno de ejecución → Cambiar tipo de hardware → GPU).


In [1]:
#@title Instalación y utilidades
!pip -q install transformers datasets accelerate evaluate sentencepiece -U

import torch, random, numpy as np
from datasets import load_dataset
from transformers import set_seed

def seed_all(seed=42):
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed)
    if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)
    set_seed(seed)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
seed_all(42)
device


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m121.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m506.3/506.3 kB[0m [31m46.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pylibcudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 21.0.0 which is incompatible.
cudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86

device(type='cuda')

## 1) Encoder-Only — Clasificación de sentimientos (IMDb) con DistilBERT

- Modelo: `distilbert-base-uncased` (**encoder** bidireccional).  
- Tarea: **clasificación binaria** (positivo/negativo).  
- Técnica: **fine-tuning** ligero (1 época, batch pequeño).


In [2]:
#@title Carga de IMDb
from datasets import load_dataset
imdb = load_dataset("imdb")
imdb


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [3]:
imdb["train"], imdb["test"]


(Dataset({
     features: ['text', 'label'],
     num_rows: 25000
 }),
 Dataset({
     features: ['text', 'label'],
     num_rows: 25000
 }))

In [4]:

# Ver columnas
imdb["train"].column_names


['text', 'label']

In [5]:
imdb["train"][0], imdb["train"][1]

({'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far b

### Tokenización y preparación (padding dinámico)

**cómo convertir texto crudo en tensores** listos para el modelo (DistilBERT) y **cómo rellenar (“pad”)** cada *batch* de forma eficiente:

- **`AutoTokenizer.from_pretrained("distilbert-base-uncased")`**  
  Carga el *tokenizer* ya entrenado para DistilBERT. Este:
  - Divide el texto en *tokens* compatibles con el modelo.
  - Inserta *tokens especiales* (p. ej., `[CLS]`, `[SEP]`) cuando corresponde.
  - Mapea cada token a su **ID** y crea la **attention mask** (1 para token real, 0 para padding).

- **`tok_fn` + `imdb.map(...)`**  
  Aplica la función de tokenización **a todo el dataset**:
  - `truncation=True` recorta los textos que exceden la longitud máxima del modelo (evita secuencias demasiado largas).
  - `remove_columns=["text"]` elimina la columna original en texto plano y deja solo campos tokenizados (p. ej., `input_ids`, `attention_mask`, `label`).

- **`DataCollatorWithPadding(tokenizer=enc_tok)`**  
  Activa **padding dinámico por batch**:
  - En lugar de rellenar *todas* las secuencias al mismo largo global, **ajusta el padding al texto más largo de cada batch**.
  - Ventajas: **menos ceros**, **menos cómputo** y **mejor uso de memoria/VRAM**.
  - El *collator* arma automáticamente los tensores (`input_ids`, `attention_mask`, `labels`) con dimensiones uniformes para el *forward* del modelo.

> En resumen: tokenizamos todo IMDb con el vocabulario de DistilBERT y configuramos un *collator* que aplica **padding mínimo necesario** en cada batch, optimizando tiempo y memoria durante el entrenamiento/evaluación.


In [6]:
#@title Tokenización y preparación (padding dinámico)
from transformers import AutoTokenizer, DataCollatorWithPadding

enc_ckpt = "distilbert-base-uncased"
enc_tok = AutoTokenizer.from_pretrained(enc_ckpt)

def tok_fn(batch):
    return enc_tok(batch["text"], truncation=True)

imdb_tok = imdb.map(tok_fn, batched=True, remove_columns=["text"])
collator = DataCollatorWithPadding(tokenizer=enc_tok)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [7]:
#@title Fine-tuning con Trainer
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import evaluate, numpy as np

acc_metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = (logits.argmax(-1))
    return acc_metric.compute(predictions=preds, references=labels)

model_enc = AutoModelForSequenceClassification.from_pretrained(enc_ckpt, num_labels=2).to(device)

args = TrainingArguments(
    output_dir="enc_clf",
    eval_strategy="epoch",
    save_strategy="no",
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=3e-5,
    weight_decay=0.01,
    logging_steps=50,
    fp16=torch.cuda.is_available(),
)

trainer = Trainer(
    model=model_enc,
    args=args,
    train_dataset=imdb_tok["train"].shuffle(seed=42).select(range(20_000)),  # subset p/rápido
    eval_dataset=imdb_tok["test"],
    tokenizer=enc_tok,
    data_collator=collator,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.evaluate()


Downloading builder script: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(
  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdaprosero[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2514,0.195914,0.92612


{'eval_loss': 0.1959141343832016,
 'eval_accuracy': 0.92612,
 'eval_runtime': 101.5467,
 'eval_samples_per_second': 246.192,
 'eval_steps_per_second': 7.701,
 'epoch': 1.0}

In [8]:
#@title Inferencia de ejemplo
test_text = "This movie was surprisingly good. Great acting and tight story."
inputs = enc_tok(test_text, return_tensors="pt", truncation=True).to(device)
with torch.no_grad():
    logits = model_enc(**inputs).logits
pred = logits.argmax(-1).item()
print("Texto:", test_text)
print("Predicción:", "positive" if pred==1 else "negative")


Texto: This movie was surprisingly good. Great acting and tight story.
Predicción: positive


## 2) Decoder-Only — Generación de texto con GPT-2 (inferencia)

- Modelo: `gpt2` (**decoder** autoregresivo).  
- Tarea: **generación** condicionada a un *prompt*.  
- Técnica: **sampling** (`top_k`, `top_p`) sin entrenamiento adicional.


- **Modelo y *tokenizer***  
  - `AutoTokenizer.from_pretrained("gpt2")`: carga el *tokenizer* de GPT-2.  
  - `dec_tok.pad_token = dec_tok.eos_token`: GPT-2 no tiene `pad_token` por defecto; reutilizamos el token de fin (`eos_token`) para evitar avisos.  
  - `AutoModelForCausalLM.from_pretrained("gpt2").to(device)`: carga GPT-2 (solo **decoder**) y lo manda a **GPU/CPU**.  
  - `gpt2.eval()`: modo evaluación (sin *dropout*).

- **Entrada (*prompt*)**  
  - Se define un texto semilla (`prompt`) y se tokeniza: `dec_tok(prompt, return_tensors="pt").to(device)`.

- **Generación con *sampling*** (`model.generate(...)`)  
  Parámetros clave para controlar diversidad y coherencia:
  - `max_new_tokens=80`: cantidad máxima de **tokens nuevos** a generar.  
  - `do_sample=True`: activa **muestreo estocástico** (no greedy).  
  - `top_k=50`: limita la elección a los **50 tokens** con mayor probabilidad (reduce ruido extremo).  
  - `top_p=0.9` (*nucleus sampling*): elige dinámicamente el **mínimo conjunto** de tokens cuya probabilidad acumulada ≥ 0.9 (control más adaptativo que `top_k`).  
  - `temperature=0.9`: **suaviza** la distribución; <1 hace el modelo más conservador, >1 más creativo.  
  - `repetition_penalty=1.1`: penaliza repetir textos literalmente (mitiga bucles).  
  - `pad_token_id=dec_tok.eos_token_id`: asegura compatibilidad si hay *padding*.

- **Decodificación**  
  - `dec_tok.decode(..., skip_special_tokens=True)`: convierte IDs → texto legible.

**En resumen:** Continúa el *prompt* con GPT-2 controlando **diversidad** (`top_k`, `top_p`, `temperature`) y **repetición**. Ajusta estos hiperparámetros para explorar estilos (más creativo vs. más preciso).


In [9]:
#@title Carga y generación con sampling
from transformers import AutoTokenizer, AutoModelForCausalLM

dec_ckpt = "gpt2"
dec_tok = AutoTokenizer.from_pretrained(dec_ckpt)
dec_tok.pad_token = dec_tok.eos_token  # evitar warnings
gpt2 = AutoModelForCausalLM.from_pretrained(dec_ckpt).to(device)
gpt2.eval()

prompt = "In a future where messages are filtered by AI,"
inputs = dec_tok(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    out = gpt2.generate(
        **inputs,
        max_new_tokens=1,
        do_sample=True,
        top_k=50,
        top_p=0.9,
        temperature=0.9,
        repetition_penalty=1.1,
        pad_token_id=dec_tok.eos_token_id
    )
print(dec_tok.decode(out[0], skip_special_tokens=True))


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In a future where messages are filtered by AI, the


## 3) Encoder-Decoder — Resumen con T5-small (fine-tuning breve en SAMSum)

- Modelo: `t5-small` (**encoder-decoder** con *cross-attention*).  
- Tarea: **resumen** de diálogos (SAMSum).  
- Técnica: **fine-tuning** corto sobre un *subset*.


In [10]:
# Instalar dependencias de ROUGE
!pip -q install rouge-score evaluate -U
#@title 0) Instalación
!pip -q install -U transformers datasets accelerate sentencepiece
!pip -q install -U evaluate rouge-score


  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone


In [11]:
#@title 1) Imports, seed y device
import torch, random, numpy as np
from datasets import load_dataset
from transformers import set_seed

def seed_all(seed=42):
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed); set_seed(seed)
    if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)

seed_all(42)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device


device(type='cuda')

In [12]:
#@title 2) Carga de dataset con fallback (SAMSum → DialogSum → CNN/DailyMail)
def load_any_summarization():
    tried = []
    for name, kwargs in [
        ("samsum", {}),
        ("knkarthick/dialogsum", {}),
        ("cnn_dailymail", {"name": "3.0.0"}),
    ]:
        try:
            ds = load_dataset(name, **kwargs)
            print(f"✓ Dataset: {name}")
            return name, ds
        except Exception as e:
            tried.append((name, str(e).splitlines()[:1]))
    raise RuntimeError(f"No se pudo cargar dataset. Intentos: {tried}")

ds_name, ds = load_any_summarization()
ds


README.md: 0.00B [00:00, ?B/s]

train.csv:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

validation.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

✓ Dataset: knkarthick/dialogsum


DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

In [13]:
#@title 3) Tokenizador y preprocesamiento
from transformers import AutoTokenizer
t5_ckpt = "t5-small"
t5_tok  = AutoTokenizer.from_pretrained(t5_ckpt)

max_in, max_out = 512, 128

def preprocess(batch):
    if ds_name in ("samsum", "knkarthick/dialogsum"):
        inputs = ["summarize: " + d for d in batch["dialogue"]]
        targets = batch["summary"]
    else:  # cnn_dailymail
        inputs  = ["summarize: " + a for a in batch["article"]]
        targets = batch["highlights"]

    model_inputs = t5_tok(inputs, max_length=max_in, truncation=True)
    with t5_tok.as_target_tokenizer():
        labels = t5_tok(targets, max_length=max_out, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

cols_to_remove = ds["train"].column_names
ds_tok = ds.map(preprocess, batched=True, remove_columns=cols_to_remove)
{split: ds_tok[split].num_rows for split in ds_tok}


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]



Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

{'train': 12460, 'validation': 500, 'test': 1500}

In [14]:
#@title 4) Métrica ROUGE (evaluate o fallback rouge-score)
use_evaluate = True
try:
    import evaluate
    rouge_eval = evaluate.load("rouge")
except Exception:
    use_evaluate = False
    from rouge_score import rouge_scorer

def compute_rouge_from_arrays(preds, labels):
    preds  = np.where(preds  != -100, preds,  t5_tok.pad_token_id)
    labels = np.where(labels != -100, labels, t5_tok.pad_token_id)
    pred_str  = t5_tok.batch_decode(preds,  skip_special_tokens=True)
    label_str = t5_tok.batch_decode(labels, skip_special_tokens=True)
    if use_evaluate:
        scores = rouge_eval.compute(predictions=pred_str, references=label_str, use_stemmer=True)
        return {k: round(v, 4) for k, v in scores.items()}
    else:
        scorer = rouge_scorer.RougeScorer(["rouge1","rouge2","rougeLsum"], use_stemmer=True)
        r1=r2=rl=0.0
        for p,g in zip(pred_str, label_str):
            s = scorer.score(g, p)
            r1+=s["rouge1"].fmeasure; r2+=s["rouge2"].fmeasure; rl+=s["rougeLsum"].fmeasure
        n = max(1, len(pred_str))
        return {"rouge1": r1/n, "rouge2": r2/n, "rougeLsum": rl/n}


Downloading builder script: 0.00B [00:00, ?B/s]

In [15]:
#@title 5) Modelo y data collator
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
t5 = AutoModelForSeq2SeqLM.from_pretrained(t5_ckpt).to(device)
collator = DataCollatorForSeq2Seq(tokenizer=t5_tok, model=t5)


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [16]:
#@title 6) Entrenamiento
from transformers import Trainer, TrainingArguments

# Subsets pequeños para demo rápida (ajusta tamaños según VRAM/tiempo)
train_subset = ds_tok["train"].shuffle(seed=42).select(range(min(400, ds_tok["train"].num_rows)))
eval_split = "validation" if "validation" in ds_tok else "test"
eval_subset  = ds_tok[eval_split].select(range(min(200, ds_tok[eval_split].num_rows)))

# Intento A: Seq2SeqTrainer
trainer_type = "seq2seq"
try:
    from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
    args = Seq2SeqTrainingArguments(
        output_dir="t5_sum",
        do_train=True, do_eval=True,
        eval_strategy="epoch", save_strategy="no",
        num_train_epochs=1,
        per_device_train_batch_size=4, per_device_eval_batch_size=8,
        learning_rate=5e-4, weight_decay=0.01,
        predict_with_generate=True,
        generation_max_length=128,
        logging_steps=50,
        fp16=torch.cuda.is_available(),
        report_to="none",
    )
    def compute_metrics(eval_pred):
        preds, labels = eval_pred
        return compute_rouge_from_arrays(preds, labels)

    trainer = Seq2SeqTrainer(
        model=t5, args=args,
        train_dataset=train_subset, eval_dataset=eval_subset,
        tokenizer=t5_tok, data_collator=collator,
        compute_metrics=compute_metrics,
    )
except Exception as e:
    # Fallback B: Trainer clásico; evaluación por generación manual
    trainer_type = "classic"
    args = TrainingArguments(
        output_dir="t5_sum",
        do_train=True, do_eval=True,
        eval_strategy="epoch", save_strategy="no",
        num_train_epochs=1,
        per_device_train_batch_size=4, per_device_eval_batch_size=8,
        learning_rate=5e-4, weight_decay=0.01,
        logging_steps=50,
        fp16=torch.cuda.is_available(),
        report_to="none",
    )
    trainer = Trainer(
        model=t5, args=args,
        train_dataset=train_subset, eval_dataset=eval_subset,
        tokenizer=t5_tok, data_collator=collator,
    )

trainer_type


  trainer = Seq2SeqTrainer(


'seq2seq'

In [17]:
#@title 7) Fit y evaluación
trainer.train()

if trainer_type == "seq2seq":
    metrics = trainer.evaluate()
else:
    # Generación manual para el subset de valid/test
    from tqdm.auto import tqdm
    preds, refs = [], []
    # Necesitamos ejemplos originales (no tokenizados) para obtener los textos referencia
    raw_eval = ds[eval_split].select(range(len(eval_subset)))  # misma longitud que eval_subset
    for i in tqdm(range(len(eval_subset))):
        if ds_name in ("samsum", "knkarthick/dialogsum"):
            inp_txt = "summarize: " + raw_eval[i]["dialogue"]
            ref_txt = raw_eval[i]["summary"]
        else:
            inp_txt = "summarize: " + raw_eval[i]["article"]
            ref_txt = raw_eval[i]["highlights"]
        batch = t5_tok(inp_txt, return_tensors="pt", truncation=True, max_length=max_in)
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            gen = t5.generate(**batch, max_new_tokens=128, num_beams=4)
        hyp = t5_tok.decode(gen[0], skip_special_tokens=True)
        preds.append(hyp); refs.append(ref_txt)

    if use_evaluate:
        import evaluate
        rouge_eval = evaluate.load("rouge")
        metrics = rouge_eval.compute(predictions=preds, references=refs, use_stemmer=True)
    else:
        from rouge_score import rouge_scorer
        scorer = rouge_scorer.RougeScorer(["rouge1","rouge2","rougeLsum"], use_stemmer=True)
        r1=r2=rl=0.0
        for p,g in zip(preds, refs):
            s = scorer.score(g, p)
            r1+=s["rouge1"].fmeasure; r2+=s["rouge2"].fmeasure; rl+=s["rougeLsum"].fmeasure
        n = max(1, len(preds))
        metrics = {"rouge1": r1/n, "rouge2": r2/n, "rougeLsum": rl/n}

{ k: round(float(v), 4) for k, v in metrics.items() if isinstance(v, (int,float)) }


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,1.6641,1.478864,0.3644,0.124,0.295,0.296


{'eval_loss': 1.4789,
 'eval_rouge1': 0.3644,
 'eval_rouge2': 0.124,
 'eval_rougeL': 0.295,
 'eval_rougeLsum': 0.296,
 'eval_runtime': 25.649,
 'eval_samples_per_second': 7.798,
 'eval_steps_per_second': 0.975,
 'epoch': 1.0}

In [18]:
#@title 8) Inferencia en un ejemplo (cuida CPU/GPU)
# Selecciona un ejemplo del split de test (o valid)
split = "test" if "test" in ds else "validation"
sample = ds[split][0]

if ds_name in ("samsum", "knkarthick/dialogsum"):
    inp_txt = "summarize: " + sample["dialogue"]
    gold    = sample["summary"]
else:
    inp_txt = "summarize: " + sample["article"]
    gold    = sample["highlights"]

# Asegura que inputs están en el mismo device que el modelo
model_device = next(t5.parameters()).device
batch = t5_tok(inp_txt, return_tensors="pt", truncation=True, max_length=max_in)
batch = {k: v.to(model_device) for k, v in batch.items()}

t5.eval()
with torch.no_grad():
    gen = t5.generate(**batch, max_new_tokens=128, num_beams=4)

pred = t5_tok.decode(gen[0], skip_special_tokens=True)
print("=== GOLD ===\n", gold, "\n")
print("=== PRED ===\n", pred)


=== GOLD ===
 Ms. Dawson helps #Person1# to write a memo to inform every employee that they have to change the communication method and should not use Instant Messaging anymore. 

=== PRED ===
 Ms. Dawson needs to take a dictation for her. Ms. Dawson tells #Person1# to go out as an intra-office memorandum to all employees by this afternoon.


## Indicaciones para clase

- **Encoder-Only (BERT/DistilBERT)** — *Clasificación*:
  - Mide **Accuracy**/F1 y analiza errores (ej., oraciones irónicas).
  - Experimenta con `num_train_epochs`, *freezing* parcial y *learning rate*.

- **Decoder-Only (GPT-2)** — *Generación*:
  - Compara `greedy`, `beam`, `top_k`, `top_p`; observa coherencia/diversidad.
  - Controla longitud: `max_new_tokens`, `no_repeat_ngram_size`.

- **Encoder-Decoder (T5)** — *Resumen*:
  - Ajusta `generation_max_length` y `num_beams`; compara **ROUGE** en eval.
  - Cambia prefijos: `summarize:` vs otras instrucciones (*text-to-text*).


### Tarea

## 1. Modelo **Encoder-Only** (BERT / DistilBERT) — *Clasificación (Sentiment Analysis)*

1.1. Modifica los hiperparámetros:
- `num_train_epochs`: 1, 2 y 3 — ¿Mejora o empeora el rendimiento?
- `learning_rate`: prueba 5e-5 vs 2e-5

1.2. Analiza errores:
- Muestra 5 ejemplos mal clasificados.
- ¿Hay sarcasmo, ironía o frases ambiguas?

1.3. ¿BERT entiende el **significado del texto** o solo “palabras clave”?

---

## 2. Modelo **Decoder-Only** (GPT-2) — *Generación de Texto*

2.1. Cambia los modos de generación:
| Estrategia | Parámetros |
|------------|------------|
| Greedy | `do_sample=False` |
| Creativa (Sampling) | `do_sample=True`, `top_k=50` |
| Nucleus | `top_p=0.9` |
| Temperature | `temperature=0.7 vs 1.2` |

2.2. Comenta:
- ¿Cuál es más coherente?  
- ¿Cuál repite frases o inventa cosas (*hallucination*)?

2.3. ¿GPT genera *texto propio* o solo rellena patrones?

---

## 3. Modelo **Encoder-Decoder** (T5) — *Resumen Automático*

3.1. Ajusta la generación:
- `num_beams`: 1, 4, 8  
- `generation_max_length`: 50 vs 150

3.2. Evalúa con ROUGE:
- ¿Qué configuración da mejor ROUGE?
- Compara el resumen generado vs el de referencia (GOLD).

3.3. Cambia la tarea:
- Usa: `summarize:`, `translate:`, o `question:`  
  ¿T5 entiende instrucciones de *texto a texto*?
3.4. ¿Por qué T5 puede resolver múltiples tareas con el mismo formato *text in → text out*?


