# **Finetuning del modelo T5**

En este notebook llevamos a cabo el finetuning del modelo T5. Se trata de un proceso complejo, que hubo que corregir varias veces, debido a la incompatibilidad de librer√≠as, en especial con accelerate, y problemas con la tokenizaci√≥n.

Comenzamos con la instalaci√≥n de las librer√≠as, que ya digo que dieron problemas.

In [None]:
"""
FINE-TUNING T5 PARA AN√ÅLISIS ELECTORAL ESPA√ëOL
Google Colab Pro - VERSI√ìN ESTABLE Y CORREGIDA
Con soluci√≥n integrada para error "piece id is out of range"
"""

# ============================================================================
# CELDA 1: INSTALACI√ìN DE DEPENDENCIAS (VERSI√ìN ESTABLE)
# ============================================================================
print("üöÄ INICIANDO FINE-TUNING T5 - VERSI√ìN ESTABLE")
print("=" * 70)

# Instalaci√≥n de dependencias COMPATIBLES
!pip install -q transformers==4.36.0
!pip install -q accelerate==0.21.0  # Specify compatible accelerate version
!pip install -q datasets==2.15.0
!pip install -q evaluate==0.4.0
!pip install -q rouge-score==0.1.2
!pip install -q nltk==3.8.1
!pip install -q sentencepiece==0.1.99
!pip install -q protobuf==3.20.3
!pip install -q peft==0.5.0

import os
import sys
import json
import torch
import numpy as np
import pandas as pd
from datetime import datetime
import gc
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Dependencias instaladas")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA disponible: {torch.cuda.is_available()}")

üöÄ INICIANDO FINE-TUNING T5 - VERSI√ìN ESTABLE
‚úÖ Dependencias instaladas
PyTorch: 2.9.0+cu126
CUDA disponible: True


DeepSeek incluy√≥ esta celda de revisi√≥n de recursos despu√©s de que le dijera que iba a usar la GPU L4 de Colab Pro.

In [None]:
# ============================================================================
# CELDA 2: VERIFICACI√ìN DE RECURSOS Y CONFIGURACI√ìN
# ============================================================================
def check_resources_and_setup():
    """Verifica recursos y configura par√°metros seguros"""

    print("\n" + "=" * 70)
    print("üîç VERIFICACI√ìN DE RECURSOS - Google Colab Pro")
    print("=" * 70)

    # Informaci√≥n de GPU
    gpu_info = {}
    if torch.cuda.is_available():
        gpu_info['name'] = torch.cuda.get_device_name(0)
        gpu_info['memory_gb'] = torch.cuda.get_device_properties(0).total_memory / 1e9
        gpu_info['available_memory'] = torch.cuda.get_device_properties(0).total_memory - torch.cuda.memory_allocated()

        print(f"üéØ GPU: {gpu_info['name']}")
        print(f"üíæ Memoria GPU total: {gpu_info['memory_gb']:.2f} GB")
        print(f"üíæ Memoria GPU disponible: {gpu_info['available_memory'] / 1e9:.2f} GB")

        # Seleccionar modelo seg√∫n GPU
        if gpu_info['memory_gb'] >= 16:  # V100/A100
            model_name = "google-t5/t5-base"  # 220M par√°metros
            batch_size = 4  # Reducido para ser conservador
            print("   ‚Üí Usando T5-base (220M par√°metros) con batch size 4")
        else:  # T4 15GB o menos
            model_name = "google-t5/t5-small"  # 60M par√°metros
            batch_size = 8
            print("   ‚Üí Usando T5-small (60M par√°metros) con batch size 8")
    else:
        print("‚ö†Ô∏è  No hay GPU disponible - Usando CPU (muy lento)")
        model_name = "google-t5/t5-small"
        batch_size = 2

    # Informaci√≥n de RAM
    import psutil
    ram_gb = psutil.virtual_memory().total / 1e9
    ram_available = psutil.virtual_memory().available / 1e9
    print(f"\nüíø RAM Total: {ram_gb:.2f} GB")
    print(f"üíø RAM Disponible: {ram_available:.2f} GB")

    # Espacio en disco
    import shutil
    total, used, free = shutil.disk_usage("/")
    print(f"üìÅ Disco total: {total / 1e9:.1f} GB")
    print(f"üìÅ Disco libre: {free / 1e9:.1f} GB")

    return model_name, batch_size, gpu_info

MODEL_NAME, BASE_BATCH_SIZE, GPU_INFO = check_resources_and_setup()


üîç VERIFICACI√ìN DE RECURSOS - Google Colab Pro
üéØ GPU: NVIDIA L4
üíæ Memoria GPU total: 23.80 GB
üíæ Memoria GPU disponible: 23.80 GB
   ‚Üí Usando T5-base (220M par√°metros) con batch size 4

üíø RAM Total: 56.86 GB
üíø RAM Disponible: 54.45 GB
üìÅ Disco total: 253.1 GB
üìÅ Disco libre: 210.4 GB


Se cofiguran los par√°metros seguros y dem√°s. Es aqu√≠ cuando me di cuenta de lo conveniente que es tener librer√≠as como Unsloth...

In [None]:
# ============================================================================
# CELDA 3: CONFIGURACI√ìN DE PAR√ÅMETROS SEGUROS
# ============================================================================
class SafeT5TrainingConfig:
    """Configuraci√≥n segura para fine-tuning de T5"""

    # ========== MODELO ==========
    model_name = MODEL_NAME
    tokenizer_name = MODEL_NAME

    # ========== DATASET ==========
    # Cambia estas rutas seg√∫n tu estructura
    dataset_base_path = "/content/drive/MyDrive/Practica_LLM_Engineering_25/dataset_t5_final"
    train_path = f"{dataset_base_path}/train.jsonl"
    val_path = f"{dataset_base_path}/val.jsonl"
    test_path = f"{dataset_base_path}/test.jsonl"

    # ========== HIPERPAR√ÅMETROS SEGUROS ==========
    max_input_length = 256  # Reducido para seguridad
    max_target_length = 128
    batch_size = BASE_BATCH_SIZE
    gradient_accumulation_steps = max(1, 8 // BASE_BATCH_SIZE)  # Ajuste autom√°tico
    num_train_epochs = 3
    learning_rate = 2e-4  # Un poco m√°s bajo para estabilidad
    weight_decay = 0.01
    warmup_steps = 50

    # ========== ENTRENAMIENTO SEGURO ==========
    logging_steps = 25
    eval_steps = 100  # Evaluar cada 100 pasos
    save_steps = 200
    save_total_limit = 2
    load_best_model_at_end = True
    metric_for_best_model = "eval_loss"
    greater_is_better = False

    # ========== GENERACI√ìN (DESACTIVADA DURANTE EVAL) ==========
    num_beams = 4
    temperature = 0.7
    top_p = 0.9
    repetition_penalty = 1.2

    # ========== SALIDA ==========
    output_dir = f"/content/drive/MyDrive/Practica_LLM_Engineering_25/t5_electoral_safe_{datetime.now().strftime('%Y%m%d_%H%M')}"
    report_to = "none"

    # ========== SEGURIDAD ==========
    use_safe_tokenization = True  # Usar tokenizaci√≥n segura
    filter_problematic_examples = True  # Filtrar ejemplos problem√°ticos
    predict_with_generate_during_training = False  # ¬°IMPORTANTE! Desactivar generaci√≥n durante eval

    def __str__(self):
        config_str = "\n‚öôÔ∏è CONFIGURACI√ìN DE ENTRENAMIENTO SEGURO:\n"
        config_str += "=" * 50 + "\n"

        for key, value in self.__dict__.items():
            if not key.startswith('_'):
                config_str += f"{key:35}: {value}\n"

        return config_str

config = SafeT5TrainingConfig()
print(config)



‚öôÔ∏è CONFIGURACI√ìN DE ENTRENAMIENTO SEGURO:



Cargamos los tres datasests de train, test y validaci√≥n, y se muestran una serie de ejemplos.

In [None]:
# ============================================================================
# CELDA 4: CARGA Y VERIFICACI√ìN DEL DATASET
# ============================================================================
from datasets import Dataset, DatasetDict
import hashlib

print("\n" + "=" * 70)
print("üìÇ CARGA Y VERIFICACI√ìN DEL DATASET")
print("=" * 70)

def load_and_validate_dataset(config):
    """Carga y valida el dataset cuidadosamente"""

    def read_jsonl_safely(filepath, max_examples=None):
        """Lee JSONL con validaci√≥n robusta"""
        data = []
        errors = []

        try:
            with open(filepath, 'r', encoding='utf-8') as f:
                for line_num, line in enumerate(f, 1):
                    if max_examples and len(data) >= max_examples:
                        break

                    line = line.strip()
                    if not line:
                        continue

                    try:
                        example = json.loads(line)

                        # Validar estructura
                        if not isinstance(example, dict):
                            errors.append(f"L√≠nea {line_num}: No es un diccionario")
                            continue

                        # Verificar campos requeridos
                        if 'input_text' not in example or 'target_text' not in example:
                            errors.append(f"L√≠nea {line_num}: Faltan campos requeridos")
                            continue

                        # Verificar que no est√©n vac√≠os
                        if not example['input_text'] or not example['target_text']:
                            errors.append(f"L√≠nea {line_num}: Campos vac√≠os")
                            continue

                        # Limpiar textos
                        input_text = str(example['input_text']).strip()
                        target_text = str(example['target_text']).strip()

                        # A√±adir ID √∫nico
                        _id = hashlib.md5(
                            f"{input_text}_{target_text}".encode()
                        ).hexdigest()[:8]

                        # Only append the cleaned and necessary fields to ensure consistent schema
                        data.append({
                            'input_text': input_text,
                            'target_text': target_text,
                            '_id': _id
                        })

                    except json.JSONDecodeError as e:
                        errors.append(f"L√≠nea {line_num}: JSON inv√°lido - {str(e)}")
                    except Exception as e:
                        errors.append(f"L√≠nea {line_num}: Error - {str(e)}")

            print(f"  ‚úÖ {len(data)} ejemplos v√°lidos")
            if errors:
                print(f"  ‚ö†Ô∏è  {len(errors)} errores (primeros 3):")
                for err in errors[:3]:
                    print(f"     {err}")

            return data

        except FileNotFoundError:
            print(f"  ‚ùå Archivo no encontrado: {filepath}")
            return []
        except Exception as e:
            print(f"  ‚ùå Error cargando {filepath}: {e}")
            return []

    print("Cargando splits...")

    # Cargar con l√≠mite si es muy grande
    train_data = read_jsonl_safely(config.train_path)
    val_data = read_jsonl_safely(config.val_path) if os.path.exists(config.val_path) else []
    test_data = read_jsonl_safely(config.test_path) if os.path.exists(config.test_path) else []

    if not train_data:
        raise ValueError("‚ùå No se pudo cargar el dataset de entrenamiento")

    # Crear DatasetDict
    dataset_dict = {}
    dataset_dict["train"] = Dataset.from_list(train_data)

    if val_data:
        dataset_dict["validation"] = Dataset.from_list(val_data)

    if test_data:
        dataset_dict["test"] = Dataset.from_list(test_data)

    dataset_dict = DatasetDict(dataset_dict)

    # Estad√≠sticas
    print(f"\nüìä ESTAD√çSTICAS DEL DATASET:")
    print(f"  Entrenamiento: {len(train_data)} ejemplos")
    if val_data:
        print(f"  Validaci√≥n: {len(val_data)} ejemplos")
    if test_data:
        print(f"  Test: {len(test_data)} ejemplos")

    # Mostrar ejemplos de muestra
    print(f"\nüìù MUESTRA DEL DATASET (3 ejemplos):")
    for i in range(min(3, len(train_data))):
        example = train_data[i]
        print(f"\n  Ejemplo {i+1} (ID: {example.get('_id', 'N/A')}):")
        print(f"  Input: {example['input_text'][:80]}...")
        print(f"  Target: {example['target_text'][:80]}...")
        if 'metadata' in example:
            nivel = example['metadata'].get('nivel_original', 'N/A')
            dificultad = example['metadata'].get('dificultad', 'N/A')
            print(f"  Nivel: {nivel}, Dificultad: {dificultad}")

    return dataset_dict

# Cargar dataset
dataset = load_and_validate_dataset(config)



üìÇ CARGA Y VERIFICACI√ìN DEL DATASET
Cargando splits...
  ‚úÖ 1440 ejemplos v√°lidos
  ‚úÖ 180 ejemplos v√°lidos
  ‚úÖ 180 ejemplos v√°lidos

üìä ESTAD√çSTICAS DEL DATASET:
  Entrenamiento: 1440 ejemplos
  Validaci√≥n: 180 ejemplos
  Test: 180 ejemplos

üìù MUESTRA DEL DATASET (3 ejemplos):

  Ejemplo 1 (ID: 1863b614):
  Input: ### Instrucci√≥n:
¬øCu√°l fue el % VOX mediano en la comunidad de Andaluc√≠a en Novi...
  Target: El % VOX mediano en la comunidad de Andaluc√≠a en Noviembre 2019 fue de 19.90%....

  Ejemplo 2 (ID: 86113f39):
  Input: ### Instrucci√≥n:
¬øQu√© diferencia hay entre PP y VOX en la provincia de Segovia e...
  Target: En la elecci√≥n de Abril 2019 en la provincia de Segovia, el partido PP obtuvo el...

  Ejemplo 3 (ID: 2fcd4bb0):
  Input: ### Instrucci√≥n:
¬øC√≥mo evolucion√≥ el porcentaje de voto a VOX en Cillorigo de Li...
  Target: El VOX en Cillorigo de Li√©bana pas√≥ de 19.61% en Noviembre 2019 a 15.03% en Abri...


Llegamos a la tokenizacion que, pese a cargar el tokenizador desde transformers, tuve continuos problemas con √©l. Tras las rectificaciones de DeepSeek, Colab estuvo cerca de una hora intentando solucionar el problema.

In [None]:
# ============================================================================
# CELDA 5: TOKENIZACI√ìN SEGURA PARA T5
# ============================================================================
from transformers import T5Tokenizer
import re

print("\n" + "=" * 70)
print("üî§ TOKENIZACI√ìN SEGURA DEL DATASET")
print("=" * 70)

print(f"Cargando tokenizer: {config.tokenizer_name}")
tokenizer = T5Tokenizer.from_pretrained(config.tokenizer_name)

# Configurar tokenizer para T5
tokenizer.pad_token = tokenizer.eos_token

print(f"  Vocabulario: {tokenizer.vocab_size:,} tokens")
print(f"  Pad token: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")
print(f"  EOS token: {tokenizer.eos_token} (ID: {tokenizer.eos_token_id})")

# ============================================================================
# FUNCI√ìN DE PREPROCESAMIENTO SEGURA
# ============================================================================
def safe_preprocess_function(examples, is_validation=False):
    """
    Funci√≥n de preprocesamiento SEGURA con verificaci√≥n exhaustiva
    """
    # Limpiar textos
    def clean_text(text):
        if not isinstance(text, str):
            return ""

        # 1. Normalizar espacios
        text = ' '.join(text.split())

        # 2. Mantener solo caracteres seguros
        # Permitir: letras, n√∫meros, espacios, puntuaci√≥n b√°sica, acentos espa√±oles
        safe_pattern = r'[^a-zA-Z0-9√°√©√≠√≥√∫√Å√â√ç√ì√ö√±√ë√º√ú\s.,;:¬ø?¬°!()\-%\d]'
        text = re.sub(safe_pattern, '', text)

        # 3. Limitar longitud
        max_chars = config.max_input_length * 4  # Estimaci√≥n conservadora
        if len(text) > max_chars:
            text = text[:max_chars]

        return text.strip()

    # Limpiar inputs y targets
    inputs = [clean_text(text) for text in examples["input_text"]]
    targets = [clean_text(text) for text in examples["target_text"]]

    # Tokenizar inputs
    model_inputs = tokenizer(
        inputs,
        max_length=config.max_input_length,
        truncation=True,
        padding="max_length",
        return_tensors=None
    )

    # Tokenizar targets con contexto de target tokenizer
    with tokenizer.as_target_tokenizer():
        tokenized_targets = tokenizer(
            targets,
            max_length=config.max_target_length,
            truncation=True,
            padding="max_length",
            return_tensors=None
        )

    # VERIFICACI√ìN Y CORRECCI√ìN EXHAUSTIVA DE LABELS
    labels = tokenized_targets["input_ids"]

    for i in range(len(labels)):
        for j in range(len(labels[i])):
            token_id = labels[i][j]

            # 1. Verificar que sea un n√∫mero entero
            if not isinstance(token_id, int):
                labels[i][j] = tokenizer.pad_token_id
                continue

            # 2. Verificar rango del vocabulario (0 a vocab_size-1)
            if token_id < 0 or token_id >= tokenizer.vocab_size:
                labels[i][j] = tokenizer.pad_token_id
                continue

            # 3. Reemplazar padding token por -100 (ignore index)
            if token_id == tokenizer.pad_token_id:
                labels[i][j] = -100

    model_inputs["labels"] = labels

    # Verificaci√≥n de seguridad (solo para primeros ejemplos)
    if not is_validation and len(model_inputs["input_ids"]) > 0:
        # Verificar primer ejemplo
        first_input_ids = model_inputs["input_ids"][0]
        first_labels = model_inputs["labels"][0]

        max_input_id = max(first_input_ids)
        valid_labels = [l for l in first_labels if l != -100]
        max_label_id = max(valid_labels) if valid_labels else 0

        if max_input_id >= tokenizer.vocab_size:
            print(f"‚ö†Ô∏è  ADVERTENCIA: input_ids tiene ID {max_input_id} >= {tokenizer.vocab_size}")

        if valid_labels and max_label_id >= tokenizer.vocab_size:
            print(f"‚ö†Ô∏è  ADVERTENCIA: labels tiene ID {max_label_id} >= {tokenizer.vocab_size}")

    return model_inputs

# ============================================================================
# DIAGN√ìSTICO DE PROBLEMAS EN VALIDATION SET
# ============================================================================
print("\n" + "=" * 70)
print("üîç DIAGN√ìSTICO PREVENTIVO DE VALIDATION SET")
print("=" * 70)

def diagnose_validation_set(validation_dataset):
    """Diagnostica y corrige problemas en el validation set"""

    if not validation_dataset:
        print("‚ö†Ô∏è  No hay validation set para diagnosticar")
        return None

    print(f"Analizando validation set ({len(validation_dataset)} ejemplos)...")

    # Tokenizar validation set para diagn√≥stico
    val_tokenized_for_diagnosis = validation_dataset.map(
        lambda x: safe_preprocess_function(x, is_validation=True),
        batched=True,
        batch_size=32,
        remove_columns=validation_dataset.column_names
    )

    # Buscar ejemplos problem√°ticos
    problematic_indices = []
    problematic_details = []

    for i in range(len(val_tokenized_for_diagnosis)):
        example = val_tokenized_for_diagnosis[i]

        # Verificar input_ids
        input_ids = example["input_ids"]
        if any(token_id >= tokenizer.vocab_size for token_id in input_ids):
            problematic_indices.append(i)
            problematic_details.append({
                "index": i,
                "issue": "input_ids out of range",
                "max_id": max(input_ids),
                "vocab_size": tokenizer.vocab_size
            })
            continue

        # Verificar labels
        labels = example["labels"]
        valid_labels = [l for l in labels if l != -100]

        if valid_labels:
            if any(label_id >= tokenizer.vocab_size for label_id in valid_labels):
                problematic_indices.append(i)
                problematic_details.append({
                    "index": i,
                    "issue": "labels out of range",
                    "max_id": max(valid_labels),
                    "vocab_size": tokenizer.vocab_size
                })

    if problematic_indices:
        print(f"‚ùå ENCONTRADOS {len(problematic_indices)} EJEMPLOS PROBLEM√ÅTICOS")
        print("Detalles (primeros 5):")
        for detail in problematic_details[:5]:
            print(f"  √çndice {detail['index']}: {detail['issue']}")
            print(f"     ID m√°ximo: {detail['max_id']}, Vocab size: {detail['vocab_size']}")

        # Mostrar textos problem√°ticos
        print(f"\nüìù Textos de ejemplos problem√°ticos:")
        for idx in problematic_indices[:3]:
            original_example = validation_dataset[idx]
            print(f"\n  √çndice {idx}:")
            print(f"  Input: {original_example['input_text'][:100]}...")
            print(f"  Target: {original_example['target_text'][:100]}...")

        if config.filter_problematic_examples:
            print(f"\nüõ†Ô∏è  FILTRANDO EJEMPLOS PROBLEM√ÅTICOS...")

            # Crear nuevo validation set sin ejemplos problem√°ticos
            good_indices = [i for i in range(len(validation_dataset)) if i not in problematic_indices]

            from datasets import Dataset
            good_examples = [validation_dataset[i] for i in good_indices]

            print(f"  Original: {len(validation_dataset)} ejemplos")
            print(f"  Filtrado: {len(good_examples)} ejemplos")
            print(f"  Eliminados: {len(problematic_indices)} ejemplos")

            return Dataset.from_list(good_examples)
        else:
            print("‚ö†Ô∏è  Manteniendo ejemplos problem√°ticos (puede causar errores)")
            return validation_dataset
    else:
        print("‚úÖ Validation set verificado - No se encontraron problemas")
        return validation_dataset

# Aplicar diagn√≥stico al validation set si existe
if "validation" in dataset:
    print("\nAplicando diagn√≥stico al validation set...")
    dataset["validation"] = diagnose_validation_set(dataset["validation"])

# ============================================================================
# TOKENIZACI√ìN FINAL
# ============================================================================
print("\nüîÑ TOKENIZACI√ìN FINAL DE DATASETS...")

def safe_tokenize_dataset(dataset_split, split_name):
    """Tokeniza un split del dataset de forma segura"""
    print(f"  Tokenizando {split_name}...")

    is_validation = (split_name == "validation")

    tokenized = dataset_split.map(
        lambda x: safe_preprocess_function(x, is_validation=is_validation),
        batched=True,
        batch_size=32,
        remove_columns=dataset_split.column_names
    )

    print(f"    ‚úÖ {len(tokenized)} ejemplos tokenizados")
    return tokenized

tokenized_datasets = {}

# Tokenizar train
tokenized_datasets["train"] = safe_tokenize_dataset(dataset["train"], "train")

# Tokenizar validation si existe y no est√° vac√≠o
if "validation" in dataset and len(dataset["validation"]) > 0:
    tokenized_datasets["validation"] = safe_tokenize_dataset(dataset["validation"], "validation")
else:
    print("‚ö†Ô∏è  Validation set no disponible o vac√≠o")
    config.eval_steps = None  # Desactivar evaluaci√≥n

# Tokenizar test si existe
if "test" in dataset and len(dataset["test"]) > 0:
    tokenized_datasets["test"] = safe_tokenize_dataset(dataset["test"], "test")

print(f"\n‚úÖ TOKENIZACI√ìN COMPLETADA:")
print(f"  Train: {len(tokenized_datasets['train'])} ejemplos")
if "validation" in tokenized_datasets:
    print(f"  Validation: {len(tokenized_datasets['validation'])} ejemplos")
if "test" in tokenized_datasets:
    print(f"  Test: {len(tokenized_datasets['test'])} ejemplos")


üî§ TOKENIZACI√ìN SEGURA DEL DATASET
Cargando tokenizer: google-t5/t5-base


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


  Vocabulario: 32,000 tokens
  Pad token: </s> (ID: 1)
  EOS token: </s> (ID: 1)

üîç DIAGN√ìSTICO PREVENTIVO DE VALIDATION SET

Aplicando diagn√≥stico al validation set...
Analizando validation set (180 ejemplos)...


Map:   0%|          | 0/180 [00:00<?, ? examples/s]

‚úÖ Validation set verificado - No se encontraron problemas

üîÑ TOKENIZACI√ìN FINAL DE DATASETS...
  Tokenizando train...


Map:   0%|          | 0/1440 [00:00<?, ? examples/s]

    ‚úÖ 1440 ejemplos tokenizados
  Tokenizando validation...


Map:   0%|          | 0/180 [00:00<?, ? examples/s]

    ‚úÖ 180 ejemplos tokenizados
  Tokenizando test...


Map:   0%|          | 0/180 [00:00<?, ? examples/s]

    ‚úÖ 180 ejemplos tokenizados

‚úÖ TOKENIZACI√ìN COMPLETADA:
  Train: 1440 ejemplos
  Validation: 180 ejemplos
  Test: 180 ejemplos


Cargamos el modelo T5 desde transformers, esta celda fue quiz√°s la √∫nica que no dio problemas.

In [None]:
# ============================================================================
# CELDA 6: CARGA DEL MODELO T5
# ============================================================================
from transformers import T5ForConditionalGeneration

print("\n" + "=" * 70)
print("üèóÔ∏è  CARGA DEL MODELO T5")
print("=" * 70)

print(f"Cargando modelo: {config.model_name}")

try:
    # Configuraci√≥n del modelo
    model = T5ForConditionalGeneration.from_pretrained(config.model_name)

    # Mover a GPU si est√° disponible
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)

    print(f"‚úÖ Modelo cargado en: {device}")
    print(f"   Par√°metros totales: {sum(p.numel() for p in model.parameters()):,}")
    print(f"   Par√°metros entrenables: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

except Exception as e:
    print(f"‚ùå Error cargando el modelo: {e}")
    print("Intentando cargar versi√≥n alternativa...")

    # Intentar con configuraciones alternativas
    try:
        model = T5ForConditionalGeneration.from_pretrained(
            config.model_name,
            torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
        )
        model = model.to(device)
        print("‚úÖ Modelo cargado con configuraci√≥n alternativa")
    except Exception as e2:
        print(f"‚ùå Error cr√≠tico: {e2}")
        raise



üèóÔ∏è  CARGA DEL MODELO T5
Cargando modelo: google-t5/t5-base
‚úÖ Modelo cargado en: cuda
   Par√°metros totales: 222,903,552
   Par√°metros entrenables: 222,903,552


Esta celda configura las m√©tricas, de forma 'segura'. Esto de 'seguro' es una constante de DeepSeek, tras informarle yo de los problemas que iba encontrando.

De todas formas, soy algo esc√©ptico, quiz√°s debido a mi ignorancia, con esta mediciones de Rouge, al menos en esta pr√°ctica. Lo explico en el pdf de presentaci√≥n de la pr√°ctica.

In [None]:
# ============================================================================
# CELDA 7: CONFIGURACI√ìN DE M√âTRICAS SEGURAS
# ============================================================================
import evaluate
import nltk

print("\n" + "=" * 70)
print("üìä CONFIGURACI√ìN DE M√âTRICAS SEGURAS")
print("=" * 70)

try:
    nltk.download('punkt', quiet=True)

    # Cargar m√©tricas
    rouge = evaluate.load("rouge")

    def safe_compute_metrics(eval_pred):
        """
        Calcula m√©tricas de forma segura
        Solo se usar√° si predict_with_generate=True
        """
        predictions, labels = eval_pred

        # Verificar que las predicciones sean v√°lidas
        if predictions is None or labels is None:
            return {"eval_loss": 0.0}

        # Decodificar predicciones
        try:
            decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
        except:
            decoded_preds = [""] * len(predictions)

        # Reemplazar -100 en labels
        labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
        try:
            decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
        except:
            decoded_labels = [""] * len(labels)

        # Calcular ROUGE de forma segura
        try:
            rouge_result = rouge.compute(
                predictions=decoded_preds,
                references=decoded_labels,
                use_stemmer=True,
                use_aggregator=True
            )
        except:
            rouge_result = {"rouge1": 0.0, "rouge2": 0.0, "rougeL": 0.0, "rougeLsum": 0.0}

        return {
            "rouge1": rouge_result["rouge1"],
            "rouge2": rouge_result["rouge2"],
            "rougeL": rouge_result["rougeL"],
            "rougeLsum": rouge_result["rougeLsum"]
        }

    print("‚úÖ M√©tricas configuradas (solo para evaluaci√≥n final)")

except Exception as e:
    print(f"‚ö†Ô∏è  Error configurando m√©tricas: {e}")
    print("Continuando sin m√©tricas detalladas")

    def safe_compute_metrics(eval_pred):
        return {"eval_loss": 0.0}


üìä CONFIGURACI√ìN DE M√âTRICAS SEGURAS
‚úÖ M√©tricas configuradas (solo para evaluaci√≥n final)


Ahora configuramos el finetuning, de nuevo de forma 'segura', con la librer√≠a de transformers. La mayor parte de los par√°metros los hemos definido en la celda 3, con el objeto creado all√≠ de la clase SafeT5TrainingConfig, llamado 'config'.

Por fin, definimos el trainer con la funci√≥n Seq2SeqTrainer().

In [None]:
# ============================================================================
# CELDA 8: CONFIGURACI√ìN DEL ENTRENAMIENTO SEGURO
# ============================================================================
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq

print("\n" + "=" * 70)
print("üéØ CONFIGURACI√ìN DEL ENTRENAMIENTO SEGURO")
print("=" * 70)

# Data collator para T5
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True
)

# Configurar si hay validation set
has_validation = "validation" in tokenized_datasets and len(tokenized_datasets["validation"]) > 0

# Argumentos de entrenamiento SEGUROS
training_args = Seq2SeqTrainingArguments(
    output_dir=config.output_dir,
    overwrite_output_dir=True,

    # Hiperpar√°metros
    num_train_epochs=config.num_train_epochs,
    per_device_train_batch_size=config.batch_size,
    per_device_eval_batch_size=config.batch_size,
    gradient_accumulation_steps=config.gradient_accumulation_steps,
    learning_rate=config.learning_rate,
    weight_decay=config.weight_decay,
    warmup_steps=config.warmup_steps,

    # Estrategias SEGURAS
    evaluation_strategy="steps" if has_validation else "no",
    eval_steps=config.eval_steps if has_validation else None,
    save_strategy="steps",
    save_steps=config.save_steps,
    save_total_limit=config.save_total_limit,
    load_best_model_at_end=has_validation,
    metric_for_best_model=config.metric_for_best_model if has_validation else None,
    greater_is_better=config.greater_is_better if has_validation else None,

    # Logging y reporte
    logging_strategy="steps",
    logging_steps=config.logging_steps,
    report_to=config.report_to,

    # Optimizaci√≥n SEGURA
    fp16=torch.cuda.is_available(),
    optim="adafactor",  # Adafactor es m√°s estable para T5
    gradient_checkpointing=True,

    # ‚ö†Ô∏è CONFIGURACI√ìN CR√çTICA: Generaci√≥n durante evaluaci√≥n
    # Desactivar para evitar error "piece id is out of range"
    predict_with_generate=config.predict_with_generate_during_training and has_validation,

    # Configuraci√≥n de generaci√≥n (solo si predict_with_generate=True)
    generation_max_length=config.max_target_length if config.predict_with_generate_during_training else None,
    generation_num_beams=config.num_beams if config.predict_with_generate_during_training else None,

    # Otros ajustes de seguridad
    dataloader_num_workers=0,  # 0 para evitar problemas en Colab
    remove_unused_columns=True,
    push_to_hub=False,
    ddp_find_unused_parameters=False,
)

print("‚úÖ Training arguments configurados SEGUROS")
print(f"   Output directory: {config.output_dir}")
print(f"   Batch size: {config.batch_size}")
print(f"   Gradient accumulation: {config.gradient_accumulation_steps}")
print(f"   Total effective batch: {config.batch_size * config.gradient_accumulation_steps}")
print(f"   Evaluation: {'Activada' if has_validation else 'Desactivada'}")
print(f"   Predict with generate: {training_args.predict_with_generate}")

# Crear trainer SEGURO
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets.get("validation", None),
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=safe_compute_metrics if training_args.predict_with_generate else None,
)

print("‚úÖ Trainer creado exitosamente")


üéØ CONFIGURACI√ìN DEL ENTRENAMIENTO SEGURO
‚úÖ Training arguments configurados SEGUROS
   Output directory: /content/drive/MyDrive/Practica_LLM_Engineering_25/t5_electoral_safe_20260203_1558
   Batch size: 4
   Gradient accumulation: 2
   Total effective batch: 8
   Evaluation: Activada
   Predict with generate: False
‚úÖ Trainer creado exitosamente


Ejecutamos el finetuning, que dur√≥ unos 5-7 minutos. Lo cierto es que notamos que los errores van descendiendo a lo largo de las tres √©poca, tanto en train como validaci√≥n.

In [None]:
# ============================================================================
# CELDA 9: ENTRENAMIENTO SEGURO CON MANEJO DE ERRORES
# ============================================================================
print("\n" + "=" * 70)
print("üöÄ INICIANDO ENTRENAMIENTO SEGURO")
print("=" * 70)

# Limpiar memoria antes de empezar
torch.cuda.empty_cache()
gc.collect()

# Mostrar memoria disponible
if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated() / 1e9
    total = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"Memoria GPU antes del entrenamiento: {allocated:.2f} GB / {total:.2f} GB")

# Calcular pasos totales estimados
total_steps = len(tokenized_datasets["train"]) // (config.batch_size * config.gradient_accumulation_steps) * config.num_train_epochs
print(f"Pasos totales estimados: {total_steps}")
print(f"Duraci√≥n estimada: {total_steps * 2 / 60:.1f} minutos (aproximadamente)")

# Funci√≥n de callback para manejar errores
from transformers import TrainerCallback

class SafeTrainingCallback(TrainerCallback):
    """Callback para manejar errores durante el entrenamiento"""

    def on_step_end(self, args, state, control, **kwargs):
        # Verificar cada 50 pasos
        if state.global_step % 50 == 0:
            torch.cuda.empty_cache()

        return control

    def on_evaluate(self, args, state, control, **kwargs):
        print(f"\nüîç Evaluaci√≥n completada en paso {state.global_step}")
        return control

# A√±adir callback al trainer
trainer.add_callback(SafeTrainingCallback())

# INTENTAR ENTRENAR CON MANEJO DE ERRORES
try:
    print("\n" + "=" * 70)
    print("‚è≥ ENTRENANDO... (esto puede tomar varios minutos)")
    print("=" * 70)

    # Entrenar
    train_result = trainer.train()

    # Guardar modelo final
    trainer.save_model()
    tokenizer.save_pretrained(config.output_dir)

    # Guardar m√©tricas de entrenamiento
    metrics = train_result.metrics
    trainer.log_metrics("train", metrics)
    trainer.save_metrics("train", metrics)
    trainer.save_state()

    print(f"\n" + "=" * 70)
    print("üéâ ¬°ENTRENAMIENTO COMPLETADO EXITOSAMENTE!")
    print("=" * 70)
    print(f"üìÅ Modelo guardado en: {config.output_dir}")

    # Mostrar resumen
    print(f"\nüìä RESUMEN DEL ENTRENAMIENTO:")
    print(f"  √âpocas completadas: {config.num_train_epochs}")
    print(f"  Pasos totales: {train_result.global_step}")
    print(f"  P√©rdida final: {train_result.training_loss:.4f}")

    if has_validation:
        print(f"  P√©rdida en validaci√≥n: {trainer.state.log_history[-1].get('eval_loss', 'N/A')}")

except RuntimeError as e:
    if "out of memory" in str(e):
        print("\n‚ùå ERROR: Memoria insuficiente (OOM)")
        print("Intentando recuperaci√≥n...")

        # Guardar lo que se haya logrado
        try:
            trainer.save_model(config.output_dir + "_partial")
            print(f"‚úÖ Modelo parcial guardado en: {config.output_dir}_partial")
        except:
            print("‚ö†Ô∏è  No se pudo guardar modelo parcial")

        print("\nüí° SOLUCIONES:")
        print("1. Reduce batch_size a la mitad")
        print("2. Reduce max_input_length y max_target_length")
        print("3. Usa gradient_checkpointing=True")
        print("4. Usa T5-small en lugar de T5-base")

    elif "piece id is out of range" in str(e):
        print("\n‚ùå ERROR: Token ID fuera de rango durante evaluaci√≥n")
        print("Esto probablemente ocurri√≥ durante la evaluaci√≥n con generaci√≥n.")

        print("\nüõ†Ô∏è  SOLUCI√ìN APLICADA AUTOM√ÅTICAMENTE:")
        print("Reentrenando SIN generaci√≥n durante evaluaci√≥n...")

        # Reconfigurar para entrenar sin generaci√≥n
        training_args.predict_with_generate = False

        # Crear nuevo trainer
        trainer_no_generate = Seq2SeqTrainer(
            model=model,
            args=training_args,
            train_dataset=tokenized_datasets["train"],
            eval_dataset=tokenized_datasets.get("validation", None),
            tokenizer=tokenizer,
            data_collator=data_collator,
            compute_metrics=None,  # Sin m√©tricas que requieran generaci√≥n
        )

        # Entrenar de nuevo
        print("Reiniciando entrenamiento sin generaci√≥n durante evaluaci√≥n...")
        train_result = trainer_no_generate.train()

        # Guardar
        trainer_no_generate.save_model()
        tokenizer.save_pretrained(config.output_dir + "_no_generate")

        print(f"\n‚úÖ Entrenamiento completado sin generaci√≥n")
        print(f"üìÅ Modelo guardado en: {config.output_dir}_no_generate")

    else:
        print(f"\n‚ùå ERROR durante el entrenamiento: {e}")
        import traceback
        traceback.print_exc()

except Exception as e:
    print(f"\n‚ùå ERROR inesperado: {e}")
    import traceback
    traceback.print_exc()


üöÄ INICIANDO ENTRENAMIENTO SEGURO
Memoria GPU antes del entrenamiento: 0.89 GB / 23.80 GB
Pasos totales estimados: 540
Duraci√≥n estimada: 18.0 minutos (aproximadamente)

‚è≥ ENTRENANDO... (esto puede tomar varios minutos)


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
100,1.038,0.807598
200,0.6905,0.674657
300,0.7071,0.616378
400,0.5656,0.593632
500,0.5229,0.583102



üîç Evaluaci√≥n completada en paso 100

üîç Evaluaci√≥n completada en paso 200

üîç Evaluaci√≥n completada en paso 300

üîç Evaluaci√≥n completada en paso 400

üîç Evaluaci√≥n completada en paso 500


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].


***** train metrics *****
  epoch                    =        3.0
  total_flos               =  1225014GF
  train_loss               =     0.8151
  train_runtime            = 0:04:30.71
  train_samples_per_second =     15.958
  train_steps_per_second   =      1.995

üéâ ¬°ENTRENAMIENTO COMPLETADO EXITOSAMENTE!
üìÅ Modelo guardado en: /content/drive/MyDrive/Practica_LLM_Engineering_25/t5_electoral_safe_20260203_1558

üìä RESUMEN DEL ENTRENAMIENTO:
  √âpocas completadas: 3
  Pasos totales: 540
  P√©rdida final: 0.8151
  P√©rdida en validaci√≥n: N/A


Probamos el modelo T5 ya finetuneado con el dataset de Test. Obtenemos unas p√©rdidas bastante parecidas a las de train y validation, 0,57, con lo que no parece haber un overfitting claro.

Vemos que obtenemos un resultado de Rouge de cero, lo cual pregunt√© al respecto a DeepSeek.

In [None]:
# ============================================================================
# CELDA 10: EVALUACI√ìN FINAL SEGURA
# ============================================================================
print("\n" + "=" * 70)
print("üìä EVALUACI√ìN FINAL SEGURA")
print("=" * 70)

if "test" in tokenized_datasets and len(tokenized_datasets["test"]) > 0:
    print("Evaluando en conjunto de test...")

    try:
        # Para evaluaci√≥n final, podemos usar generaci√≥n
        # pero con configuraci√≥n segura

        # Configurar generaci√≥n segura para evaluaci√≥n final
        eval_args = Seq2SeqTrainingArguments(
            output_dir=config.output_dir + "_final_eval",
            per_device_eval_batch_size=config.batch_size,
            predict_with_generate=True,  # Usar generaci√≥n solo para evaluaci√≥n final
            generation_max_length=config.max_target_length,
            generation_num_beams=1,  # Usar beam=1 para mayor estabilidad
            dataloader_num_workers=0,
        )

        eval_trainer = Seq2SeqTrainer(
            model=model,
            args=eval_args,
            eval_dataset=tokenized_datasets["test"],
            tokenizer=tokenizer,
            data_collator=data_collator,
            compute_metrics=safe_compute_metrics,
        )

        print("Realizando evaluaci√≥n final con generaci√≥n...")
        eval_results = eval_trainer.evaluate()

        print("\nüìà RESULTADOS EN TEST:")
        print("-" * 40)
        for key, value in eval_results.items():
            if isinstance(value, float):
                print(f"{key:20}: {value:.4f}")

        # Guardar resultados
        with open(f"{config.output_dir}/test_results.json", "w") as f:
            json.dump(eval_results, f, indent=2)

        print(f"‚úÖ Resultados guardados en: {config.output_dir}/test_results.json")

    except Exception as e:
        print(f"‚ö†Ô∏è  Error en evaluaci√≥n final: {e}")
        print("Realizando evaluaci√≥n simple (solo loss)...")

        # Evaluaci√≥n simple sin generaci√≥n
        try:
            simple_eval_args = Seq2SeqTrainingArguments(
                output_dir=config.output_dir + "_simple_eval",
                per_device_eval_batch_size=config.batch_size,
                predict_with_generate=False,
                dataloader_num_workers=0,
            )

            simple_eval_trainer = Seq2SeqTrainer(
                model=model,
                args=simple_eval_args,
                eval_dataset=tokenized_datasets["test"],
                tokenizer=tokenizer,
                data_collator=data_collator,
            )

            simple_results = simple_eval_trainer.evaluate()
            print(f"üìä P√©rdida en test: {simple_results['eval_loss']:.4f}")

        except Exception as e2:
            print(f"‚ö†Ô∏è  Error incluso en evaluaci√≥n simple: {e2}")
else:
    print("‚ö†Ô∏è  No hay conjunto de test para evaluaci√≥n")



üìä EVALUACI√ìN FINAL SEGURA
Evaluando en conjunto de test...
Realizando evaluaci√≥n final con generaci√≥n...


[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 3


[34m[1mwandb[0m: You chose "Don't visualize my results"
[34m[1mwandb[0m: Using W&B in offline mode.
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin



üìà RESULTADOS EN TEST:
----------------------------------------
eval_loss           : 0.5740
eval_rouge1         : 0.0000
eval_rouge2         : 0.0000
eval_rougeL         : 0.0000
eval_rougeLsum      : 0.0000
eval_runtime        : 142.2180
eval_samples_per_second: 1.2660
eval_steps_per_second: 0.3160
‚úÖ Resultados guardados en: /content/drive/MyDrive/Practica_LLM_Engineering_25/t5_electoral_safe_20260203_1558/test_results.json


DeepSeek a√±adi√≥ esta celda para probar el modelo con preguntas/respuestas que no hubiese visto el modelo antes. Honestamente, no pens√© que fuese interesante cuando no hemos siquiera configurar el RAG, con lo que no la ejecut√©.

In [None]:
# ============================================================================
# CELDA 11: PRUEBA DEL MODELO ENTRENADO
# ============================================================================
print("\n" + "=" * 70)
print("üß™ PRUEBA DEL MODELO ENTRENADO")
print("=" * 70)

def test_trained_model(model_path, test_questions, max_examples=3):
    """Prueba el modelo con preguntas de ejemplo"""

    from transformers import T5Tokenizer, T5ForConditionalGeneration
    import torch

    print(f"Cargando modelo desde {model_path}...")

    try:
        tokenizer_test = T5Tokenizer.from_pretrained(model_path)
        model_test = T5ForConditionalGeneration.from_pretrained(model_path)

        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        model_test = model_test.to(device)
        model_test.eval()

        print(f"‚úÖ Modelo cargado en {device}")

    except Exception as e:
        print(f"‚ùå Error cargando modelo: {e}")
        return

    print("\nüìù PROBANDO CON PREGUNTAS DE EJEMPLO:")

    for i, question in enumerate(test_questions[:max_examples]):
        print(f"\n--- Ejemplo {i+1} ---")
        print(f"Pregunta: {question}")

        # Asegurar formato correcto
        if not question.strip().endswith("respuesta:"):
            question = question.strip() + " respuesta:"

        try:
            # Tokenizar
            inputs = tokenizer_test(
                question,
                return_tensors="pt",
                truncation=True,
                max_length=config.max_input_length,
                padding=True
            ).to(device)

            # Generar
            with torch.no_grad():
                outputs = model_test.generate(
                    **inputs,
                    max_length=config.max_target_length,
                    num_beams=config.num_beams,
                    temperature=config.temperature,
                    do_sample=True,
                    repetition_penalty=config.repetition_penalty,
                    early_stopping=True
                )

            # Decodificar
            response = tokenizer_test.decode(outputs[0], skip_special_tokens=True)

            # Extraer solo la respuesta (lo que viene despu√©s del prompt)
            prompt_length = len(tokenizer_test.decode(inputs.input_ids[0], skip_special_tokens=True))
            response_only = response[prompt_length:].strip()

            print(f"ü§ñ Respuesta: {response_only}")

        except Exception as e:
            print(f"‚ùå Error generando respuesta: {e}")

# Preguntas de prueba
test_questions = [
    "pregunta: ¬øQu√© porcentaje de votos obtuvo el PSOE en el municipio de Madrid en 2019?",
    "pregunta: ¬øC√≥mo evolucion√≥ la participaci√≥n electoral entre 2016 y 2019?",
    "pregunta: ¬øCu√°l fue el partido m√°s votado en las √∫ltimas elecciones?",
]

# Probar el modelo
test_trained_model(config.output_dir, test_questions)


Despu√©s de preguntarle a DeepSeek por el cero obtenido con Rouge, me recomend√≥ esta celda adicional, donde examina ejemplos de test, y que obtiene coeficientes de Rouge superiores a cero. El problema es que en esos casos recoge en la respuesta datos que le hemos pasado en la pregunta, con lo que, sinceramente, no me parece √∫til esta m√©trica.

In [None]:
# A√ëADE ESTA CELDA PARA DIAGNOSTICAR ROUGE
print("\n" + "=" * 70)
print("üîç DIAGN√ìSTICO DE M√âTRICAS ROUGE")
print("=" * 70)

def debug_rouge_metrics(model_path, test_dataset, num_examples=5):
    """Debug detallado de las m√©tricas ROUGE"""

    from transformers import T5Tokenizer, T5ForConditionalGeneration
    import torch

    # Cargar modelo
    tokenizer = T5Tokenizer.from_pretrained(model_path)
    model = T5ForConditionalGeneration.from_pretrained(model_path)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    model.eval()

    print(f"Analizando {num_examples} ejemplos del test set...\n")

    for i in range(min(num_examples, len(test_dataset))):
        example = test_dataset[i]

        print(f"\n--- Ejemplo {i+1} ---")
        print(f"üìù INPUT ORIGINAL:")
        print(f"   {example['input_text'][:150]}...")
        print(f"\nüéØ TARGET ESPERADO:")
        print(f"   {example['target_text']}")

        # Generar respuesta del modelo
        inputs = tokenizer(
            example['input_text'],
            return_tensors="pt",
            truncation=True,
            max_length=256
        ).to(device)

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_length=128,
                num_beams=1,
                temperature=0.7,
                do_sample=True
            )

        generated = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Extraer solo la parte de respuesta
        input_decoded = tokenizer.decode(inputs.input_ids[0], skip_special_tokens=True)
        if generated.startswith(input_decoded):
            generated_response = generated[len(input_decoded):].strip()
        else:
            generated_response = generated

        print(f"\nü§ñ RESPUESTA GENERADA:")
        print(f"   {generated_response}")

        # Calcular ROUGE manualmente para ver qu√© pasa
        from rouge_score import rouge_scorer

        scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
        scores = scorer.score(example['target_text'], generated_response)

        print(f"\nüìä ROUGE manual:")
        print(f"   Rouge-1: {scores['rouge1'].fmeasure:.4f}")
        print(f"   Rouge-2: {scores['rouge2'].fmeasure:.4f}")
        print(f"   Rouge-L: {scores['rougeL'].fmeasure:.4f}")

        # Verificar si hay n√∫meros en ambas respuestas
        import re
        target_numbers = set(re.findall(r'\d+\.?\d*', example['target_text']))
        generated_numbers = set(re.findall(r'\d+\.?\d*', generated_response))

        print(f"\nüî¢ N√∫meros en target: {target_numbers}")
        print(f"üî¢ N√∫meros en generado: {generated_numbers}")
        print(f"üìê Coincidencia num√©rica: {len(target_numbers.intersection(generated_numbers))}/{len(target_numbers)}")

        print("-" * 80)

# Ejecutar diagn√≥stico
debug_rouge_metrics(config.output_dir, dataset["test"], num_examples=6)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.



üîç DIAGN√ìSTICO DE M√âTRICAS ROUGE
Analizando 6 ejemplos del test set...


--- Ejemplo 1 ---
üìù INPUT ORIGINAL:
   ### Instrucci√≥n:
¬øC√≥mo evolucion√≥ el porcentaje de voto a VOX en Medina del Campo entre Noviembre 2019 y Abril 2019?

### Contexto:
Municipio: Medina ...

üéØ TARGET ESPERADO:
   El VOX en Medina del Campo pas√≥ de 19.86% en Noviembre 2019 a 13.75% en Abril 2019. Esto representa una disminuci√≥n de 6.11 puntos porcentuales (30.8% menos).

ü§ñ RESPUESTA GENERADA:
   <pad> El VOX en Medina del Campo pas√≥ de 37.53% en Noviembre 2019 a 33.79% en Abril 2019. Esto representa un incremento de 6.28% (de incremento porcentual). Esto representa un incremento de 1.64% (de incremento porcentual). Esto representa un incremento de 4.09 puntos porcentuales (0.25% disminuci√≥n). Esto representa un incremento de 1.39 puntos porcentuales (0.66% disminuci√≥n porcent

üìä ROUGE manual:
   Rouge-1: 0.4646
   Rouge-2: 0.3505
   Rouge-L: 0.4444

üî¢ N√∫meros en target: {'19.86', '2