# üéØ Entrenamiento LSE con Visualizaciones

Notebook para entrenar modelo T5 de traducci√≥n **Espa√±ol ‚Üí LSE** con gr√°ficas en tiempo real.

## üìã Contenido:
1. Instalaci√≥n de dependencias
2. Imports y configuraci√≥n
3. Clase de visualizaci√≥n en tiempo real
4. Funciones de carga de datos
5. Entrenamiento
6. Evaluaci√≥n y ejemplos

In [27]:
# üì¶ Instalaci√≥n de dependencias (ejecuta solo la primera vez)
!pip install matplotlib seaborn plotly ipywidgets pandas transformers datasets pyyaml -q

In [28]:
# Imports b√°sicos
import os
import json
import yaml
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, clear_output
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from datasets import Dataset, DatasetDict
from transformers import (
    AutoTokenizer, 
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq, 
    TrainingArguments, 
    Trainer,
    TrainerCallback
)

# Configuraci√≥n de estilo
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 10

print("‚úÖ Imports completados")

‚úÖ Imports completados


In [29]:
class LivePlotCallback(TrainerCallback):
    """Callback para visualizar m√©tricas en tiempo real"""
    
    def __init__(self):
        self.train_losses = []
        self.eval_losses = []
        self.learning_rates = []
        self.grad_norms = []
        self.epochs = []
        self.steps = []
        
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None:
            if 'loss' in logs:
                self.train_losses.append(logs['loss'])
                self.steps.append(state.global_step)
                
            if 'learning_rate' in logs:
                self.learning_rates.append(logs['learning_rate'])
                
            if 'grad_norm' in logs:
                self.grad_norms.append(logs['grad_norm'])
                
            if 'epoch' in logs:
                self.epochs.append(logs['epoch'])
            
            if state.global_step % 50 == 0 and len(self.train_losses) > 1:
                self.plot_metrics()
    
    def on_evaluate(self, args, state, control, metrics=None, **kwargs):
        if metrics and 'eval_loss' in metrics:
            self.eval_losses.append({
                'step': state.global_step,
                'epoch': state.epoch,
                'loss': metrics['eval_loss']
            })
            self.plot_metrics()
    
    def plot_metrics(self):
        clear_output(wait=True)
        
        fig, axes = plt.subplots(2, 2, figsize=(16, 10))
        
        # Loss
        if len(self.train_losses) > 0:
            axes[0, 0].plot(self.steps, self.train_losses, 'b-', linewidth=2, label='Train Loss')
            
            if len(self.eval_losses) > 0:
                eval_steps = [e['step'] for e in self.eval_losses]
                eval_vals = [e['loss'] for e in self.eval_losses]
                axes[0, 0].plot(eval_steps, eval_vals, 'ro-', linewidth=2, 
                              markersize=8, label='Validation Loss')
            
            axes[0, 0].set_xlabel('Steps', fontsize=12)
            axes[0, 0].set_ylabel('Loss', fontsize=12)
            axes[0, 0].set_title('üìâ Loss durante el entrenamiento', fontsize=14, fontweight='bold')
            axes[0, 0].legend(loc='upper right')
            axes[0, 0].grid(True, alpha=0.3)
            
            current_loss = self.train_losses[-1]
            axes[0, 0].text(0.02, 0.98, f'Loss actual: {current_loss:.4f}',
                          transform=axes[0, 0].transAxes, 
                          verticalalignment='top',
                          bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
        
        # Learning Rate
        if len(self.learning_rates) > 0:
            axes[0, 1].plot(self.steps, self.learning_rates, 'g-', linewidth=2)
            axes[0, 1].set_xlabel('Steps', fontsize=12)
            axes[0, 1].set_ylabel('Learning Rate', fontsize=12)
            axes[0, 1].set_title('üìä Learning Rate Schedule', fontsize=14, fontweight='bold')
            axes[0, 1].grid(True, alpha=0.3)
            axes[0, 1].ticklabel_format(style='scientific', axis='y', scilimits=(0,0))
        
        # Gradient Norm
        if len(self.grad_norms) > 0:
            axes[1, 0].plot(self.steps, self.grad_norms, 'purple', linewidth=2)
            axes[1, 0].set_xlabel('Steps', fontsize=12)
            axes[1, 0].set_ylabel('Gradient Norm', fontsize=12)
            axes[1, 0].set_title('üéØ Gradient Norm', fontsize=14, fontweight='bold')
            axes[1, 0].grid(True, alpha=0.3)
            
            if max(self.grad_norms) > 10:
                axes[1, 0].axhline(y=10, color='r', linestyle='--', 
                                  label='Umbral de alerta', linewidth=2)
                axes[1, 0].legend()
        
        # Resumen
        axes[1, 1].axis('off')
        
        if len(self.train_losses) > 0:
            stats_text = f"""
            üìä RESUMEN DEL ENTRENAMIENTO
            ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
            
            ‚úì Steps: {self.steps[-1] if self.steps else 0}
            ‚úì √âpoca: {self.epochs[-1]:.2f if self.epochs else 0}
            
            üìâ Loss:
               ‚Ä¢ Inicial: {self.train_losses[0]:.4f}
               ‚Ä¢ Actual: {self.train_losses[-1]:.4f}
               ‚Ä¢ Reducci√≥n: {((self.train_losses[0] - self.train_losses[-1]) / self.train_losses[0] * 100):.1f}%
            
            üéØ Gradiente:
               ‚Ä¢ Promedio: {np.mean(self.grad_norms[-10:]) if len(self.grad_norms) >= 10 else 0:.4f}
               ‚Ä¢ M√°ximo: {max(self.grad_norms) if self.grad_norms else 0:.4f}
            """
            
            if len(self.eval_losses) > 0:
                last_eval = self.eval_losses[-1]['loss']
                stats_text += f"\n            üîç Val Loss: {last_eval:.4f}"
            
            axes[1, 1].text(0.1, 0.9, stats_text, 
                          transform=axes[1, 1].transAxes,
                          fontsize=11,
                          verticalalignment='top',
                          fontfamily='monospace',
                          bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.3))
        
        plt.tight_layout()
        plt.show()
    
    def save_final_report(self, output_dir):
        os.makedirs(output_dir, exist_ok=True)
        
        df = pd.DataFrame({
            'step': self.steps,
            'loss': self.train_losses,
            'learning_rate': self.learning_rates[:len(self.steps)],
            'grad_norm': self.grad_norms[:len(self.steps)],
            'epoch': self.epochs[:len(self.steps)]
        })
        df.to_csv(f"{output_dir}/training_metrics.csv", index=False)
        
        fig = make_subplots(
            rows=2, cols=2,
            subplot_titles=('Loss', 'Learning Rate', 'Gradient Norm', 'Train vs Val')
        )
        
        fig.add_trace(
            go.Scatter(x=self.steps, y=self.train_losses, 
                      name='Train Loss', line=dict(color='blue', width=2)),
            row=1, col=1
        )
        
        if len(self.eval_losses) > 0:
            eval_steps = [e['step'] for e in self.eval_losses]
            eval_vals = [e['loss'] for e in self.eval_losses]
            
            fig.add_trace(
                go.Scatter(x=eval_steps, y=eval_vals,
                          name='Val Loss', mode='lines+markers',
                          line=dict(color='red', width=2)),
                row=2, col=2
            )
        
        fig.update_layout(height=800, title_text="Reporte de Entrenamiento LSE")
        fig.write_html(f"{output_dir}/training_report.html")
        
        print(f"‚úÖ Reporte: {output_dir}/training_report.html")
        print(f"‚úÖ M√©tricas: {output_dir}/training_metrics.csv")

print("‚úÖ Clase LivePlotCallback creada")

‚úÖ Clase LivePlotCallback creada


In [30]:
def load_jsonl(path):
    rows = []
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            if line.strip():
                rows.append(json.loads(line))
    return rows

def build_hf_dataset(train_path, dev_path, test_path):
    train = load_jsonl(train_path)
    dev = load_jsonl(dev_path)
    test = load_jsonl(test_path)
    
    ds_train = Dataset.from_dict({"src": [x["src"] for x in train], "tgt": [x["tgt"] for x in train]})
    ds_dev = Dataset.from_dict({"src": [x["src"] for x in dev], "tgt": [x["tgt"] for x in dev]})
    ds_test = Dataset.from_dict({"src": [x["src"] for x in test], "tgt": [x["tgt"] for x in test]})
    
    return DatasetDict(train=ds_train, validation=ds_dev, test=ds_test)

print("‚úÖ Funciones de carga creadas")

‚úÖ Funciones de carga creadas


In [31]:
def train_lse_model(config_path="../configs/training.yaml"):
    print("üöÄ Iniciando entrenamiento LSE")
    print("="*60)
    
    # Cargar config
    with open(config_path, "r", encoding="utf-8") as f:
        cfg = yaml.safe_load(f)
    
    model_name = cfg.get("model_name", "t5-small")
    out_dir = cfg.get("output_dir", "../runs/exp_cpu_t5s_fixed")
    
    train_path = cfg.get("train_path", "../data/synthetic/train.jsonl")
    dev_path = cfg.get("dev_path", "../data/synthetic/dev.jsonl")
    test_path = cfg.get("test_path", "../data/synthetic/test.jsonl")
    
    max_src_len = int(cfg.get("max_source_length", 96))
    max_tgt_len = int(cfg.get("max_target_length", 64))
    
    per_device_train_bs = int(cfg.get("per_device_train_batch_size", 8))
    per_device_eval_bs = int(cfg.get("per_device_eval_batch_size", 8))
    grad_accum_steps = int(cfg.get("grad_accum_steps", 1))
    
    num_epochs = float(cfg.get("num_train_epochs", 6))
    lr = float(cfg.get("learning_rate", 5e-4))
    weight_decay = float(cfg.get("weight_decay", 0.01))
    warmup_ratio = float(cfg.get("warmup_ratio", 0.05))
    logging_steps = int(cfg.get("logging_steps", 100))
    
    print(f"ü§ñ Modelo: {model_name}")
    print(f"üìä √âpocas: {num_epochs}")
    print(f"üìà LR: {lr}")
    print("="*60)
    
    # Cargar tokenizer
    print("\nüì¶ Cargando tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    spec_path = "../configs/special_tokens.json"
    if os.path.exists(spec_path):
        with open(spec_path, "r", encoding="utf-8") as f:
            spec = json.load(f)
        tokenizer.add_special_tokens(spec)
        print(f"‚úì Tokens especiales: {spec}")
    
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    model.resize_token_embeddings(len(tokenizer))
    
    # Cargar datos
    print("\nüìö Cargando datasets...")
    ds = build_hf_dataset(train_path, dev_path, test_path)
    print(f"‚úì Train: {len(ds['train'])}")
    print(f"‚úì Val: {len(ds['validation'])}")
    print(f"‚úì Test: {len(ds['test'])}")
    
    # Preprocesar
    def preprocess(batch):
        model_inputs = tokenizer(batch["src"], max_length=max_src_len, truncation=True)
        labels = tokenizer(text_target=batch["tgt"], max_length=max_tgt_len, truncation=True)
        model_inputs["labels"] = labels["input_ids"]
        return model_inputs
    
    print("\nüîÑ Tokenizando...")
    ds_tok = ds.map(preprocess, batched=True, remove_columns=["src", "tgt"])
    
    # Training args
    training_args = TrainingArguments(
        output_dir=out_dir,
        per_device_train_batch_size=per_device_train_bs,
        per_device_eval_batch_size=per_device_eval_bs,
        learning_rate=lr,
        num_train_epochs=num_epochs,
        warmup_ratio=warmup_ratio,
        weight_decay=weight_decay,
        fp16=False,
        bf16=False,
        gradient_checkpointing=False,
        remove_unused_columns=False,
        logging_strategy="steps",
        logging_steps=logging_steps,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        save_total_limit=2,
        report_to="none",
        gradient_accumulation_steps=grad_accum_steps,
    )
    
    data_collator = DataCollatorForSeq2Seq(
        tokenizer=tokenizer,
        model=model,
        label_pad_token_id=-100
    )
    
    # Callback
    plot_callback = LivePlotCallback()
    
    # Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=ds_tok["train"],
        eval_dataset=ds_tok["validation"],
        data_collator=data_collator,
        tokenizer=tokenizer,
        callbacks=[plot_callback]
    )
    
    # ¬°Entrenar!
    print("\nüéØ Entrenando...")
    print("="*60)
    trainer.train()
    
    # Guardar
    print("\nüíæ Guardando...")
    trainer.save_model(out_dir)
    tokenizer.save_pretrained(out_dir)
    print(f"‚úÖ Guardado en: {out_dir}")
    
    plot_callback.save_final_report(out_dir)
    
    return trainer, plot_callback

print("‚úÖ Funci√≥n de entrenamiento creada")

‚úÖ Funci√≥n de entrenamiento creada


In [32]:
def evaluate_and_show_examples(trainer, num_examples=10):
    import torch
    
    print("\n" + "="*60)
    print("üîç EVALUACI√ìN Y EJEMPLOS")
    print("="*60)
    
    results = trainer.evaluate()
    
    print("\nüìä M√©tricas:")
    for key, value in results.items():
        print(f"   ‚Ä¢ {key}: {value:.4f}")
    
    print(f"\nüìù Ejemplos de traducci√≥n:")
    print("-"*60)
    
    tokenizer = trainer.tokenizer
    test_samples = trainer.eval_dataset.select(range(min(num_examples, len(trainer.eval_dataset))))
    
    for i, sample in enumerate(test_samples):
        input_ids = sample['input_ids']
        input_text = tokenizer.decode(input_ids, skip_special_tokens=True)
        
        outputs = trainer.model.generate(
            input_ids=torch.tensor([input_ids]),
            max_length=64,
            num_beams=4,
            early_stopping=True
        )
        
        prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
        reference = tokenizer.decode(sample['labels'], skip_special_tokens=True)
        
        match = prediction == reference
        
        print(f"\nüîπ Ejemplo {i+1}:")
        print(f"   üì• Entrada:     {input_text}")
        print(f"   ‚úÖ Referencia:  {reference}")
        print(f"   ü§ñ Predicci√≥n:  {prediction}")
        print(f"   {'‚úì Correcto' if match else '‚úó Diferente'}")

print("‚úÖ Funci√≥n de evaluaci√≥n creada")

‚úÖ Funci√≥n de evaluaci√≥n creada


In [33]:
# üöÄ EJECUTAR ENTRENAMIENTO
trainer, plot_callback = train_lse_model("../configs/training.yaml")

üöÄ Iniciando entrenamiento LSE
ü§ñ Modelo: t5-small
üìä √âpocas: 6.0
üìà LR: 0.0005

üì¶ Cargando tokenizer...
‚úì Tokens especiales: {'additional_special_tokens': ['#']}

üìö Cargando datasets...
‚úì Train: 3000
‚úì Val: 400
‚úì Test: 400

üîÑ Tokenizando...


Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'

In [None]:
# üîç EVALUAR Y VER EJEMPLOS
evaluate_and_show_examples(trainer, num_examples=10)

## üìä An√°lisis Final

### Archivos generados:
- `runs/exp_cpu_t5s_fixed/training_report.html` - Gr√°fica interactiva
- `runs/exp_cpu_t5s_fixed/training_metrics.csv` - Datos CSV
- `runs/exp_cpu_t5s_fixed/` - Modelo entrenado

### Pr√≥ximos pasos:
1. Revisa el archivo HTML para an√°lisis detallado
2. Prueba el modelo con tus propias frases
3. Ajusta hiperpar√°metros si es necesario