# üöÄ Atheria 4: Entrenamiento Progresivo (Long-Running)

Notebook optimizado para **entrenamientos largos** (muchas horas) con aprovechamiento m√°ximo de cuota de GPU.

## ‚ú® Caracter√≠sticas

- üîÑ **Auto-guardado en Google Drive** - Sincronizaci√≥n configurable
- üìä **Monitoreo de recursos** - GPU%, RAM, tiempo de sesi√≥n
- ‚ö° **Auto-recuperaci√≥n** - Contin√∫a autom√°ticamente desde checkpoints
- ‚è∞ **L√≠mite de tiempo autom√°tico** - Guardado de emergencia antes de timeout
- üìà **Visualizaci√≥n en tiempo real** - Progreso, m√©tricas, recursos
- üíæ **Smart Checkpointing** - Solo guarda mejores modelos + √∫ltimo

## üìã Cuotas de GPU

- **Colab Free**: ~12 horas/d√≠a (variable)
- **Colab Pro**: ~24 horas continuas
- **Kaggle**: 30 horas/semana (T4/P100)

## üéØ Workflow Recomendado

1. Configurar experimento (Secci√≥n 4)
2. Ejecutar todas las celdas autom√°ticamente
3. Dejar corriendo sin supervisi√≥n
4. Checkpoints se guardan autom√°ticamente en Drive
5. Si se desconecta: ejecutar de nuevo, auto-recupera desde Drive

---
## üì¶ Secci√≥n 1: Setup y Detecci√≥n de Entorno

In [13]:
# Detectar entorno (Kaggle o Colab)
import os
import sys
from pathlib import Path

# IMPORTANTE: Verificar Kaggle PRIMERO (tiene google.colab instalado pero no funciona)
IN_KAGGLE = os.path.exists("/kaggle/input") or os.path.exists("/kaggle/working")

# Solo si NO es Kaggle, verificar Colab
if not IN_KAGGLE:
    try:
        import google.colab
        # Verificar que realmente estemos en Colab (existe /content)
        IN_COLAB = os.path.exists("/content")
    except:
        IN_COLAB = False
else:
    IN_COLAB = False

ENV_NAME = "Kaggle" if IN_KAGGLE else "Colab" if IN_COLAB else "Local"
print(f"üåç Entorno detectado: {ENV_NAME}")

# Instalar dependencias b√°sicas
print("üì¶ Instalando dependencias...")
%pip install -q snntorch scikit-learn matplotlib

# Para Colab: instalar pybind11 (opcional, solo para motor nativo)
if IN_COLAB:
    %pip install -q pybind11

print("‚úÖ Dependencias instaladas")

üåç Entorno detectado: Local
üì¶ Instalando dependencias...
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/jonathan.correa/Projects/Atheria/ath_venv/lib/python3.10/site-packages/pip/__main__.py", line 29, in <module>
    from pip._internal.cli.main import main as _main
  File "/home/jonathan.correa/Projects/Atheria/ath_venv/lib/python3.10/site-packages/pip/_internal/cli/main.py", line 9, in <module>
    from pip._internal.cli.autocompletion import autocomplete
  File "/home/jonathan.correa/Projects/Atheria/ath_venv/lib/python3.10/site-packages/pip/_internal/cli/autocompletion.py", line 12, in <module>
    from pip._internal.metadata import get_default_environment
  File "/home/jonathan.correa/Projects/Atheria/ath_venv/lib/python3.10/site-packages/pip/_internal/metadata/__init__

---
## üíæ Secci√≥n 2: Google Drive - Montaje y Configuraci√≥n

In [None]:
# Montar Google Drive (solo Colab)
if IN_COLAB:
    from google.colab import drive
    print("üìÅ Montando Google Drive...")
    drive.mount('/content/drive')
    DRIVE_ROOT = Path("/content/drive/MyDrive/Atheria")
    print(f"‚úÖ Drive montado en: {DRIVE_ROOT}")
elif IN_KAGGLE:
    # En Kaggle, usar /kaggle/working como alternativa
    DRIVE_ROOT = Path("/kaggle/working/atheria_checkpoints")
    print(f"üìÅ Usando directorio local: {DRIVE_ROOT}")
else:
    DRIVE_ROOT = Path.home() / "atheria_checkpoints"
    print(f"üíª Usando directorio local: {DRIVE_ROOT}")

# Crear estructura de carpetas
DRIVE_CHECKPOINT_DIR = DRIVE_ROOT / "checkpoints"
DRIVE_LOGS_DIR = DRIVE_ROOT / "logs"
DRIVE_EXPORTS_DIR = DRIVE_ROOT / "exports"

for directory in [DRIVE_CHECKPOINT_DIR, DRIVE_LOGS_DIR, DRIVE_EXPORTS_DIR]:
    directory.mkdir(parents=True, exist_ok=True)

print(f"\nüìÇ Estructura de carpetas creada:")
print(f"  - Checkpoints: {DRIVE_CHECKPOINT_DIR}")
print(f"  - Logs: {DRIVE_LOGS_DIR}")
print(f"  - Exports: {DRIVE_EXPORTS_DIR}")

üíª Usando directorio local: /home/jonathan.correa/atheria_checkpoints

üìÇ Estructura de carpetas creada:
  - Checkpoints: /home/jonathan.correa/atheria_checkpoints/checkpoints
  - Logs: /home/jonathan.correa/atheria_checkpoints/logs
  - Exports: /home/jonathan.correa/atheria_checkpoints/exports


---
## üîß Secci√≥n 3: Clonar Proyecto Atheria

In [None]:
# Configurar ruta del proyecto
if IN_KAGGLE:
    PROJECT_ROOT = Path("/kaggle/working/Atheria")
    if not PROJECT_ROOT.exists():
        print("‚ö†Ô∏è Proyecto no encontrado. Clonando desde GitHub...")
        !git clone https://github.com/Jonakss/Atheria.git /kaggle/working/Atheria
elif IN_COLAB:
    PROJECT_ROOT = Path("/content/Atheria")
    if not PROJECT_ROOT.exists():
        print("üì• Clonando proyecto desde GitHub...")
        !git clone https://github.com/Jonakss/Atheria.git /content/Atheria
else:
    # Local: asumir que estamos en notebooks/
    PROJECT_ROOT = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()

# Agregar al path de Python
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

print(f"üìÅ Proyecto configurado en: {PROJECT_ROOT}")

# Verificar estructura b√°sica
src_path = PROJECT_ROOT / "src"
if src_path.exists():
    print("‚úÖ Estructura del proyecto verificada")
else:
    print("‚ùå Error: No se encontr√≥ la carpeta 'src'. Verifica la instalaci√≥n.")

üìÅ Proyecto configurado en: /home/jonathan.correa/Projects/Atheria
‚úÖ Estructura del proyecto verificada


---
## üìä Secci√≥n 4: Utilidades de Monitoreo de Recursos

In [None]:
import torch
import psutil
import time
from datetime import datetime, timedelta
from IPython.display import clear_output

class ResourceMonitor:
    """Monitor de recursos GPU/RAM/Tiempo"""
    
    def __init__(self, max_training_hours=10):
        self.start_time = time.time()
        self.max_training_seconds = max_training_hours * 3600
        self.max_training_hours = max_training_hours
        
    def get_gpu_usage(self):
        """Retorna uso de GPU en %"""
        if not torch.cuda.is_available():
            return 0.0
        try:
            return torch.cuda.utilization()
        except:
            return 0.0
    
    def get_gpu_memory(self):
        """Retorna memoria GPU usada/total en GB"""
        if not torch.cuda.is_available():
            return 0.0, 0.0
        allocated = torch.cuda.memory_allocated() / 1e9  # GB
        reserved = torch.cuda.memory_reserved() / 1e9
        return allocated, reserved
    
    def get_ram_usage(self):
        """Retorna uso de RAM en GB"""
        mem = psutil.virtual_memory()
        return mem.used / 1e9, mem.total / 1e9
    
    def get_elapsed_time(self):
        """Retorna tiempo transcurrido y restante"""
        elapsed = time.time() - self.start_time
        remaining = max(0, self.max_training_seconds - elapsed)
        return elapsed, remaining
    
    def should_stop(self):
        """True si se acerca al l√≠mite de tiempo (90%)"""
        elapsed, remaining = self.get_elapsed_time()
        return remaining < (self.max_training_seconds * 0.1)  # 10% restante
    
    def get_status_str(self):
        """Retorna string con estado de recursos"""
        gpu_usage = self.get_gpu_usage()
        gpu_mem_used, gpu_mem_reserved = self.get_gpu_memory()
        ram_used, ram_total = self.get_ram_usage()
        elapsed, remaining = self.get_elapsed_time()
        
        elapsed_str = str(timedelta(seconds=int(elapsed)))
        remaining_str = str(timedelta(seconds=int(remaining)))
        
        status = (
            f"üìä RECURSOS:\n"
            f"  GPU Utilization: {gpu_usage:.1f}%\n"
            f"  GPU Memory: {gpu_mem_used:.2f}GB / {gpu_mem_reserved:.2f}GB\n"
            f"  RAM: {ram_used:.2f}GB / {ram_total:.2f}GB ({ram_used/ram_total*100:.1f}%)\n"
            f"  \n"
            f"‚è∞ TIEMPO:\n"
            f"  Transcurrido: {elapsed_str}\n"
            f"  Restante: {remaining_str} (de {self.max_training_hours}h m√°ximo)\n"
        )
        return status

print("‚úÖ ResourceMonitor definido")

‚úÖ ResourceMonitor definido


---
## ‚öôÔ∏è Secci√≥n 5: Configuraci√≥n del Experimento

In [None]:
from types import SimpleNamespace
import torch

# ============================================================================
# üéØ CONFIGURACI√ìN DEL CURRICULUM (FASES DE ENTRENAMIENTO)
# ============================================================================

# Definimos una lista de fases. El sistema ejecutar√° una tras otra.
# Si una fase ya est√° completada (existe checkpoint final), pasar√° a la siguiente.

TRAINING_PHASES = [
    # --- FASE 1: ESTABILIDAD DEL VAC√çO ---
    {
        "PHASE_NAME": "Fase1_Vacuum_Stability",
        "LOAD_FROM_PHASE": None,  # Empezar desde cero
        
        "MODEL_ARCHITECTURE": "UNET",
        "MODEL_PARAMS": {
            "d_state": 4,           # Dimensi√≥n baja para aprender r√°pido
            "hidden_channels": 32,
        },
        
        "GRID_SIZE_TRAINING": 32,
        "QCA_STEPS_TRAINING": 50,
        "LR_RATE_M": 1e-4,
        "GAMMA_DECAY": 0.001,       # Poco decaimiento
        
        "TOTAL_EPISODES": 500,
        "SAVE_EVERY_EPISODES": 50,
    },
    
    # --- FASE 2: EMERGENCIA DE MATERIA ---
    {
        "PHASE_NAME": "Fase2_Matter_Emergence",
        "LOAD_FROM_PHASE": "Fase1_Vacuum_Stability", # Cargar cerebro de Fase 1
        
        "MODEL_ARCHITECTURE": "UNET",
        "MODEL_PARAMS": {
            "d_state": 4,           # Misma dimensi√≥n
            "hidden_channels": 64,  # M√°s capacidad (neuronas)
        },
        
        "GRID_SIZE_TRAINING": 64,   # Grid m√°s grande
        "QCA_STEPS_TRAINING": 100,
        "LR_RATE_M": 5e-5,          # LR m√°s fino
        "GAMMA_DECAY": 0.01,        # M√°s presi√≥n evolutiva
        
        "TOTAL_EPISODES": 1000,
        "SAVE_EVERY_EPISODES": 20,
    },
]

# Configuraci√≥n Global
GLOBAL_CONFIG = {
    "DRIVE_SYNC_EVERY": 50,
    "MAX_TRAINING_HOURS": 10,     # Tiempo total para TODAS las fases
    "AUTO_RESUME": True,
    "MAX_CHECKPOINTS_TO_KEEP": 3,
}

# Detectar dispositivo
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Directorios base
EXPERIMENT_ROOT_NAME = "MultiPhase_Experiment_v1" # Nombre carpeta ra√≠z
BASE_CHECKPOINT_DIR = DRIVE_CHECKPOINT_DIR / EXPERIMENT_ROOT_NAME
BASE_LOG_DIR = DRIVE_LOGS_DIR / EXPERIMENT_ROOT_NAME
BASE_CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
BASE_LOG_DIR.mkdir(parents=True, exist_ok=True)

print("=" * 70)
print("üìä CONFIGURACI√ìN DEL CURRICULUM")
print("=" * 70)
print(f"Experimento Ra√≠z: {EXPERIMENT_ROOT_NAME}")
print(f"Total Fases: {len(TRAINING_PHASES)}")
print(f"Device: {device}")
print("=" * 70)

for i, phase in enumerate(TRAINING_PHASES):
    print(f"\nüîπ FASE {i+1}: {phase['PHASE_NAME']}")
    print(f"   Load From: {phase['LOAD_FROM_PHASE']}")
    print(f"   Model: {phase['MODEL_ARCHITECTURE']} (d={phase['MODEL_PARAMS']['d_state']}, ch={phase['MODEL_PARAMS']['hidden_channels']})")
    print(f"   Grid: {phase['GRID_SIZE_TRAINING']}x{phase['GRID_SIZE_TRAINING']}")
    print(f"   Episodes: {phase['TOTAL_EPISODES']}")


üìä CONFIGURACI√ìN DEL CURRICULUM
Experimento Ra√≠z: MultiPhase_Experiment_v1
Total Fases: 2
Device: cuda

üîπ FASE 1: Fase1_Vacuum_Stability
   Load From: None
   Model: UNET (d=4, ch=32)
   Grid: 32x32
   Episodes: 500

üîπ FASE 2: Fase2_Matter_Emergence
   Load From: Fase1_Vacuum_Stability
   Model: UNET (d=4, ch=64)
   Grid: 64x64
   Episodes: 1000


In [None]:
from src.trainers.qc_trainer_v4 import QC_Trainer_v4
from src.model_loader import instantiate_model, load_weights, load_checkpoint_data
import matplotlib.pyplot as plt
from IPython.display import clear_output, display
import json
import shutil

# Inicializar monitor de recursos GLOBAL
monitor = ResourceMonitor(max_training_hours=GLOBAL_CONFIG["MAX_TRAINING_HOURS"])


def find_latest_checkpoint(checkpoint_dir):
    """Encuentra el √∫ltimo checkpoint en un directorio"""
    checkpoints = list(checkpoint_dir.glob("*.pth"))
    if not checkpoints:
        return None
    
    # Buscar last_model.pth primero
    last_model = checkpoint_dir / "last_model.pth"
    if last_model.exists():
        return str(last_model)
    
    # Si no, buscar el m√°s reciente
    latest = max(checkpoints, key=lambda p: p.stat().st_mtime)
    return str(latest)

# Funci√≥n auxiliar para guardar en Drive
def sync_checkpoint_to_drive(local_path, drive_dir, filename=None):
    """Copia checkpoint local a Drive"""
    try:
        if filename is None:
            filename = Path(local_path).name
        drive_path = drive_dir / filename
        shutil.copy2(local_path, drive_path)
        # logger.info(f"üíæ Checkpoint sincronizado a Drive: {filename}")
        return True
    except Exception as e:
        print(f"‚ùå Error sincronizando a Drive: {e}")
        return False

# ============================================================================
# üîÑ LOOP PRINCIPAL DE FASES
# ============================================================================

for phase_idx, phase_cfg in enumerate(TRAINING_PHASES):
    PHASE_NAME = phase_cfg["PHASE_NAME"]
    
    print("\n" + "#" * 70)
    print(f"üöÄ INICIANDO FASE {phase_idx+1}/{len(TRAINING_PHASES)}: {PHASE_NAME}")
    print("#" * 70)
    
    # 1. Configuraci√≥n de Directorios para esta Fase
    PHASE_CHECKPOINT_DIR = BASE_CHECKPOINT_DIR / PHASE_NAME
    PHASE_LOG_DIR = BASE_LOG_DIR / PHASE_NAME
    PHASE_CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
    PHASE_LOG_DIR.mkdir(parents=True, exist_ok=True)
    
    LOCAL_PHASE_DIR = PROJECT_ROOT / "output" / "checkpoints" / EXPERIMENT_ROOT_NAME / PHASE_NAME
    LOCAL_PHASE_DIR.mkdir(parents=True, exist_ok=True)
    
    # 2. Verificar si la fase ya est√° completada
    final_marker = PHASE_CHECKPOINT_DIR / "PHASE_COMPLETED.marker"
    if final_marker.exists() and GLOBAL_CONFIG["AUTO_RESUME"]:
        print(f"‚úÖ Fase {PHASE_NAME} ya completada. Saltando...")
        continue

    # 3. Preparar Configuraci√≥n de Fase
    current_exp_cfg = SimpleNamespace(**phase_cfg)
    current_exp_cfg.MODEL_PARAMS = SimpleNamespace(**phase_cfg["MODEL_PARAMS"])
    current_exp_cfg.DEVICE = device
    
    # 4. Instanciar Modelo
    print(f"üõ†Ô∏è Instanciando modelo {phase_cfg['MODEL_ARCHITECTURE']}...")
    model = instantiate_model(current_exp_cfg)
    
    # 5. Cargar Pesos (L√≥gica de Transici√≥n)
    resume_from_episode = 0
    weights_loaded = False
    
    # A) Intentar resumir la propia fase (si se interrumpi√≥)
    if GLOBAL_CONFIG["AUTO_RESUME"]:
        latest_ckpt = find_latest_checkpoint(PHASE_CHECKPOINT_DIR)
        if not latest_ckpt:
            latest_ckpt = find_latest_checkpoint(LOCAL_PHASE_DIR)
            
        if latest_ckpt:
            print(f"üîÑ Resumiendo fase actual desde: {Path(latest_ckpt).name}")
            ckpt_data = load_checkpoint_data(latest_ckpt)
            if ckpt_data:
                load_weights(model, ckpt_data) # Strict=True porque es el mismo modelo
                resume_from_episode = ckpt_data.get('episode', 0)
                weights_loaded = True
    
    # B) Si no se resume, cargar de fase anterior (Transfer Learning)
    if not weights_loaded and phase_cfg["LOAD_FROM_PHASE"]:
        prev_phase_name = phase_cfg["LOAD_FROM_PHASE"]
        print(f"üì• Buscando pesos de fase anterior: {prev_phase_name}")
        
        prev_dir = BASE_CHECKPOINT_DIR / prev_phase_name
        best_prev = prev_dir / "best_model.pth"
        if not best_prev.exists():
             best_prev = find_latest_checkpoint(prev_dir)
             
        if best_prev and Path(best_prev).exists():
            print(f"‚úÖ Cargando pesos previos de: {Path(best_prev).name}")
            ckpt_data = load_checkpoint_data(str(best_prev))
            
            # CR√çTICO: strict=False para permitir cambio de arquitectura (ej: d_state 4 -> 8)
            # Smart Load: Filter out size mismatches
            model_state = model.state_dict()
            pretrained_state = ckpt_data['model_state_dict']
            filtered_state = {}
            ignored_keys = []
            
            for k, v in pretrained_state.items():
                if k in model_state:
                    if v.shape == model_state[k].shape:
                        filtered_state[k] = v
                    else:
                        ignored_keys.append(k)
                else:
                    pass
            
            if ignored_keys:
                print(f"‚ö†Ô∏è Ignoring {len(ignored_keys)} layers due to shape mismatch (Transfer Learning).")
            
            missing, unexpected = model.load_state_dict(filtered_state, strict=False)
            print(f"‚ÑπÔ∏è Transfer Learning: {len(missing)} capas nuevas inicializadas, {len(unexpected)} capas descartadas.")
            weights_loaded = True
        else:
            print(f"‚ö†Ô∏è No se encontraron pesos de la fase anterior. Iniciando desde cero (aleatorio).")
            
    if not weights_loaded:
        print("üÜï Iniciando fase con pesos aleatorios.")

    # 6. Inicializar Trainer
    trainer = QC_Trainer_v4(
        experiment_name=f"{EXPERIMENT_ROOT_NAME}/{PHASE_NAME}", # Subdirectorio en logs internos
        model=model,
        model_params=phase_cfg['MODEL_PARAMS'], # Fix: Pass model params
        device=device,
        lr=phase_cfg["LR_RATE_M"],
        grid_size=phase_cfg["GRID_SIZE_TRAINING"],
        qca_steps=phase_cfg["QCA_STEPS_TRAINING"],
        gamma_decay=phase_cfg["GAMMA_DECAY"],
        max_checkpoints_to_keep=GLOBAL_CONFIG["MAX_CHECKPOINTS_TO_KEEP"]
    )
    
    # Hack para redirigir checkpoints del trainer a nuestra carpeta de fase local
    trainer.checkpoint_dir = str(LOCAL_PHASE_DIR)
    
    # Tracking local
    training_log = {"episodes": [], "losses": []}
    
    # 7. Loop de Episodios
    print(f"\n‚ñ∂Ô∏è Entrenando {phase_cfg['TOTAL_EPISODES'] - resume_from_episode} episodios...")
    
    try:
        for episode in range(resume_from_episode, phase_cfg["TOTAL_EPISODES"]):
            
            # Check tiempo global
            if monitor.should_stop():
                print(f"\n‚è∞ TIEMPO GLOBAL AGOTADO en Fase {PHASE_NAME}, Ep {episode}")
                trainer.save_checkpoint(episode, is_best=False)
                sync_checkpoint_to_drive(str(LOCAL_PHASE_DIR / f"checkpoint_ep{episode}.pth"), PHASE_CHECKPOINT_DIR)
                raise KeyboardInterrupt("Global Timeout") # Salir de todo
                
            # Entrenar
            loss, metrics = trainer.train_episode(episode)
            # loss = epoch_result.get("loss_total", 0)
            
            training_log["episodes"].append(episode)
            training_log["losses"].append(loss)
            
            # Guardar/Sync
            if (episode + 1) % phase_cfg["SAVE_EVERY_EPISODES"] == 0:
                is_best = (episode + 1) % (phase_cfg["SAVE_EVERY_EPISODES"] * 2) == 0
                trainer.save_checkpoint(episode, is_best=is_best)
                
            if (episode + 1) % GLOBAL_CONFIG["DRIVE_SYNC_EVERY"] == 0:
                # Sync last & best
                sync_checkpoint_to_drive(str(LOCAL_PHASE_DIR / "last_checkpoint.pth"), PHASE_CHECKPOINT_DIR, "last_model.pth")
                if (LOCAL_PHASE_DIR / "best_model.pth").exists():
                    sync_checkpoint_to_drive(str(LOCAL_PHASE_DIR / "best_model.pth"), PHASE_CHECKPOINT_DIR)
                print(f"‚òÅÔ∏è Sync Drive Ep {episode}")

            # Visualizaci√≥n simple
            if (episode + 1) % 10 == 0:
                print(f"Ep {episode+1}/{phase_cfg['TOTAL_EPISODES']} | Loss: {loss:.4f} | {monitor.get_status_str().splitlines()[0]}")
        
        # Fin de Fase
        print(f"\n‚úÖ FASE {PHASE_NAME} COMPLETADA")
        
        # Guardar marcador de completado
        with open(final_marker, 'w') as f:
            f.write(f"Completed at {datetime.now()}")
            
        # Sync final
        sync_checkpoint_to_drive(str(LOCAL_PHASE_DIR / "best_model.pth"), PHASE_CHECKPOINT_DIR)
        
    except KeyboardInterrupt:
        print("\nüõë Entrenamiento interrumpido.")
        break
    except Exception as e:
        print(f"\n‚ùå Error en fase {PHASE_NAME}: {e}")
        raise e

print("\n" + "="*70)
print("üèÅ ENTRENAMIENTO FINALIZADO")
print("="*70)



######################################################################
üöÄ INICIANDO FASE 1/2: Fase1_Vacuum_Stability
######################################################################
‚úÖ Fase Fase1_Vacuum_Stability ya completada. Saltando...

######################################################################
üöÄ INICIANDO FASE 2/2: Fase2_Matter_Emergence
######################################################################
üõ†Ô∏è Instanciando modelo UNET...


üîÑ Resumiendo fase actual desde: last_model.pth

‚ñ∂Ô∏è Entrenando 961 episodios...
Ep 40/1000 | Loss: 3.1667 | üìä RECURSOS:
‚òÅÔ∏è Sync Drive Ep 49
Ep 50/1000 | Loss: 3.1293 | üìä RECURSOS:
Ep 60/1000 | Loss: 3.1106 | üìä RECURSOS:

üõë Entrenamiento interrumpido.

üèÅ ENTRENAMIENTO FINALIZADO


In [None]:
from src.trainers.qc_trainer_v4 import QC_Trainer_v4
from src.model_loader import instantiate_model, load_weights, load_checkpoint_data
import matplotlib.pyplot as plt
from IPython.display import clear_output, display
import json

# Inicializar monitor de recursos
monitor = ResourceMonitor(max_training_hours=exp_cfg.MAX_TRAINING_HOURS)

# Instanciar modelo
model = instantiate_model(exp_cfg)

# Cargar pesos si hay checkpoint
if checkpoint_path:
    ckpt_data = load_checkpoint_data(checkpoint_path)
    if ckpt_data:
        load_weights(model, ckpt_data)

# Inicializar trainer
trainer = QC_Trainer_v4(
    experiment_name=exp_cfg.EXPERIMENT_NAME,
    model=model,
        model_params=phase_cfg['MODEL_PARAMS'], # Fix: Pass model params
    device=device,
    lr=exp_cfg.LR_RATE_M,
    grid_size=exp_cfg.GRID_SIZE_TRAINING,
    qca_steps=exp_cfg.QCA_STEPS_TRAINING,
    gamma_decay=exp_cfg.GAMMA_DECAY,
    max_checkpoints_to_keep=exp_cfg.MAX_CHECKPOINTS_TO_KEEP
)

# Tracking de m√©tricas
training_log = {
    "experiment_name": exp_cfg.EXPERIMENT_NAME,
    "config": EXPERIMENT_CONFIG,
    "episodes": [],
    "losses": [],
    "metrics": [],
    "checkpoints_saved": []
}

# Funci√≥n auxiliar para guardar en Drive
def sync_checkpoint_to_drive(local_path, episode_num):
    """Copia checkpoint local a Drive"""
    try:
        import shutil
        filename = Path(local_path).name
        drive_path = EXPERIMENT_CHECKPOINT_DIR / filename
        shutil.copy2(local_path, drive_path)
        logger.info(f"üíæ Checkpoint sincronizado a Drive: {filename}")
        return True
    except Exception as e:
        logger.error(f"‚ùå Error sincronizando a Drive: {e}")
        return False

# ENTRENAMIENTO PROGRESIVO
print("\n" + "=" * 70)
print("üéØ INICIANDO ENTRENAMIENTO PROGRESIVO")
print("=" * 70)

try:
    for episode in range(resume_from_episode, exp_cfg.TOTAL_EPISODES):
        
        # Verificar l√≠mite de tiempo ANTES de empezar episodio
        if monitor.should_stop():
            print("\n‚è∞ ¬°L√çMITE DE TIEMPO ALCANZADO!")
            print("üíæ Guardando checkpoint de emergencia...")
            
            # Guardar checkpoint de emergencia
            trainer.save_checkpoint(episode, is_best=False)
            
            # Sincronizar a Drive
            last_checkpoint = LOCAL_CHECKPOINT_DIR / "last_model.pth"
            if last_checkpoint.exists():
                sync_checkpoint_to_drive(str(last_checkpoint), episode)
            
            print(f"‚úÖ Entrenamiento detenido en episodio {episode}")
            print(f"üìä Progreso: {episode}/{exp_cfg.TOTAL_EPISODES} episodios ({episode/exp_cfg.TOTAL_EPISODES*100:.1f}%)")
            break
        
        # Entrenar episodio
        loss, metrics = trainer.train_episode(episode)
        
        # Registrar m√©tricas
        training_log["episodes"].append(episode)
        training_log["losses"].append(epoch_result.get("loss_total", 0))
        training_log["metrics"].append(epoch_result)
        
        # Guardar checkpoint peri√≥dicamente
        if (episode + 1) % exp_cfg.SAVE_EVERY_EPISODES == 0:
            is_best = (episode + 1) % (exp_cfg.SAVE_EVERY_EPISODES * 5) == 0  # Cada 50 eps = best candidate
            trainer.save_checkpoint(episode, is_best=is_best)
            training_log["checkpoints_saved"].append(episode)
        
        # Sincronizar a Drive peri√≥dicamente
        if (episode + 1) % exp_cfg.DRIVE_SYNC_EVERY == 0:
            print(f"\nüíæ Sincronizando checkpoint a Drive (episodio {episode})...")
            last_checkpoint = LOCAL_CHECKPOINT_DIR / "last_model.pth"
            if last_checkpoint.exists():
                sync_checkpoint_to_drive(str(last_checkpoint), episode)
            
            # Tambi√©n sincronizar mejores modelos
            best_checkpoint = LOCAL_CHECKPOINT_DIR / "best_model.pth"
            if best_checkpoint.exists():
                sync_checkpoint_to_drive(str(best_checkpoint), episode)
        
        # Visualizaci√≥n cada 10 episodios
        if (episode + 1) % 10 == 0:
            clear_output(wait=True)
            
            print("=" * 70)
            print(f"üìà PROGRESO: Episodio {episode + 1}/{exp_cfg.TOTAL_EPISODES} ({(episode+1)/exp_cfg.TOTAL_EPISODES*100:.1f}%)")
            print("=" * 70)
            
            # Mostrar recursos
            print(monitor.get_status_str())
            
            # Mostrar √∫ltimas m√©tricas
            if len(training_log["losses"]) > 0:
                recent_losses = training_log["losses"][-10:]
                avg_loss = sum(recent_losses) / len(recent_losses)
                print(f"\nüìä M√âTRICAS (√∫ltimos 10 episodios):")
                print(f"  Loss promedio: {avg_loss:.4f}")
                print(f"  Loss actual: {training_log['losses'][-1]:.4f}")
            
            # Gr√°fico de p√©rdida
            if len(training_log["episodes"]) > 1:
                plt.figure(figsize=(10, 4))
                plt.plot(training_log["episodes"], training_log["losses"], 'b-', alpha=0.6)
                plt.xlabel('Episodio')
                plt.ylabel('Loss')
                plt.title(f'{exp_cfg.EXPERIMENT_NAME} - Progreso de Entrenamiento')
                plt.grid(True, alpha=0.3)
                plt.tight_layout()
                plt.show()
    
    print("\n" + "=" * 70)
    print("‚úÖ ENTRENAMIENTO COMPLETADO EXITOSAMENTE")
    print("=" * 70)
    
except KeyboardInterrupt:
    print("\n‚ö†Ô∏è Entrenamiento interrumpido por usuario")
    print("üíæ Guardando checkpoint...")
    trainer.save_checkpoint(episode, is_best=False)
    
except Exception as e:
    print(f"\n‚ùå ERROR DURANTE ENTRENAMIENTO: {e}")
    logger.error(f"Error: {e}", exc_info=True)
    print("üíæ Guardando checkpoint de emergencia...")
    try:
        trainer.save_checkpoint(episode, is_best=False)
    except:
        pass
    raise

finally:
    # Guardar log de entrenamiento en Drive
    log_file = EXPERIMENT_LOG_DIR / f"training_log_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
    with open(log_file, 'w') as f:
        json.dump(training_log, f, indent=2)
    print(f"\nüìÑ Log guardado en: {log_file}")

---
## üìä Secci√≥n 8: Visualizaci√≥n de Resultados Finales

In [None]:
import matplotlib.pyplot as plt
import numpy as np

if len(training_log["episodes"]) > 0:
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # 1. P√©rdida total
    axes[0, 0].plot(training_log["episodes"], training_log["losses"], 'b-', alpha=0.6, label='Loss')
    axes[0, 0].set_xlabel('Episodio')
    axes[0, 0].set_ylabel('Loss Total')
    axes[0, 0].set_title('Evoluci√≥n de la P√©rdida')
    axes[0, 0].grid(True, alpha=0.3)
    axes[0, 0].legend()
    
    # 2. M√©tricas individuales (si est√°n disponibles)
    if len(training_log["metrics"]) > 0 and "survival_rate" in training_log["metrics"][0]:
        survival_rates = [m.get("survival_rate", 0) for m in training_log["metrics"]]
        axes[0, 1].plot(training_log["episodes"], survival_rates, 'g-', alpha=0.6, label='Survival Rate')
        axes[0, 1].set_xlabel('Episodio')
        axes[0, 1].set_ylabel('Survival Rate')
        axes[0, 1].set_title('Tasa de Supervivencia')
        axes[0, 1].grid(True, alpha=0.3)
        axes[0, 1].legend()
    
    # 3. Checkpoints guardados
    axes[1, 0].scatter(training_log["checkpoints_saved"], 
                      [1] * len(training_log["checkpoints_saved"]), 
                      c='red', marker='|', s=100, label='Checkpoint')
    axes[1, 0].set_xlabel('Episodio')
    axes[1, 0].set_yticks([])
    axes[1, 0].set_title('Checkpoints Guardados')
    axes[1, 0].grid(True, alpha=0.3, axis='x')
    axes[1, 0].legend()
    
    # 4. Histograma de p√©rdidas
    axes[1, 1].hist(training_log["losses"], bins=30, alpha=0.7, color='purple', edgecolor='black')
    axes[1, 1].set_xlabel('Loss')
    axes[1, 1].set_ylabel('Frecuencia')
    axes[1, 1].set_title('Distribuci√≥n de P√©rdidas')
    axes[1, 1].grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.savefig(EXPERIMENT_LOG_DIR / 'training_summary.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    print(f"\nüìä Gr√°ficos guardados en: {EXPERIMENT_LOG_DIR / 'training_summary.png'}")
else:
    print("‚ö†Ô∏è No hay datos de entrenamiento para visualizar")

---
## üì¶ Secci√≥n 9: Exportaci√≥n y Finalizaci√≥n

In [None]:
from src.model_loader import load_model
import shutil
from types import SimpleNamespace

print("üì§ Exportando modelo final de la √öLTIMA FASE...\n")

# Obtener configuraci√≥n de la √∫ltima fase
last_phase = TRAINING_PHASES[-1]
last_phase_name = last_phase["PHASE_NAME"]
print(f"üîπ √öltima fase: {last_phase_name}")

# Directorios de la √∫ltima fase
LAST_PHASE_CHECKPOINT_DIR = BASE_CHECKPOINT_DIR / last_phase_name
LAST_PHASE_LOG_DIR = BASE_LOG_DIR / last_phase_name

# Encontrar mejor checkpoint de la √∫ltima fase
best_checkpoint = LAST_PHASE_CHECKPOINT_DIR / "best_model.pth"
if not best_checkpoint.exists():
    best_checkpoint = find_latest_checkpoint(LAST_PHASE_CHECKPOINT_DIR)

if best_checkpoint:
    print(f"üì• Cargando modelo desde: {Path(best_checkpoint).name}")
    
    # Preparar config para cargar modelo
    # Necesitamos reconstruir el objeto config esperado por load_model
    export_cfg = SimpleNamespace(**last_phase)
    export_cfg.MODEL_PARAMS = SimpleNamespace(**last_phase["MODEL_PARAMS"])
    export_cfg.DEVICE = device
    export_cfg.EXPERIMENT_NAME = f"{EXPERIMENT_ROOT_NAME}_{last_phase_name}"
    
    # Cargar modelo
    model = load_model(export_cfg, str(best_checkpoint))
    model.eval()
    model.to(device)
    
    # Exportar a TorchScript
    # Usamos dimensiones de la √∫ltima fase
    d_state = last_phase["MODEL_PARAMS"]["d_state"]
    grid_size = last_phase["GRID_SIZE_TRAINING"]
    
    example_input = torch.randn(1, 2 * d_state, grid_size, grid_size, device=device)
    
    torchscript_path = DRIVE_EXPORTS_DIR / f"{EXPERIMENT_ROOT_NAME}_{last_phase_name}_model.pt"
    
    try:
        traced_model = torch.jit.trace(model, example_input, strict=False)
        traced_model.save(str(torchscript_path))
        print(f"‚úÖ Modelo TorchScript exportado: {torchscript_path.name}")
    except Exception as e:
        print(f"‚ö†Ô∏è Error exportando TorchScript: {e}")
    
    # Copiar mejor checkpoint a Drive (Exports)
    export_checkpoint_path = DRIVE_EXPORTS_DIR / f"{EXPERIMENT_ROOT_NAME}_{last_phase_name}_best.pth"
    shutil.copy2(best_checkpoint, export_checkpoint_path)
    print(f"‚úÖ Mejor checkpoint exportado: {export_checkpoint_path.name}")
    
    # Generar reporte de entrenamiento
    report_path = DRIVE_EXPORTS_DIR / f"{EXPERIMENT_ROOT_NAME}_REPORT.md"
    
    with open(report_path, 'w') as f:
        f.write(f"# Reporte de Entrenamiento: {EXPERIMENT_ROOT_NAME}\n\n")
        f.write(f"**Fecha:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
        f.write(f"## Fases Completadas\n\n")
        
        for i, phase in enumerate(TRAINING_PHASES):
             f.write(f"### Fase {i+1}: {phase['PHASE_NAME']}\n")
             f.write(f"- Grid: {phase['GRID_SIZE_TRAINING']}x{phase['GRID_SIZE_TRAINING']}\n")
             f.write(f"- Episodes: {phase['TOTAL_EPISODES']}\n\n")
        
        f.write(f"## Archivos Exportados\n\n")
        f.write(f"- TorchScript: `{torchscript_path.name}`\n")
        f.write(f"- Checkpoint: `{export_checkpoint_path.name}`\n")
    
    print(f"‚úÖ Reporte generado: {report_path.name}")
    
else:
    print("‚ö†Ô∏è No se encontr√≥ checkpoint para exportar en la √∫ltima fase")

print("\n" + "=" * 70)
print("üéâ EXPORTACI√ìN COMPLETADA")
print("=" * 70)
print(f"\nüìÇ Todos los archivos est√°n en Drive: {DRIVE_EXPORTS_DIR}")


---
## üìù Notas Finales

### ‚úÖ Checklist de Verificaci√≥n

- ‚úì Checkpoints guardados en Drive cada 50 episodios
- ‚úì Logs de entrenamiento persistidos
- ‚úì Modelo final exportado a TorchScript
- ‚úì Reporte de entrenamiento generado
- ‚úì Auto-recuperaci√≥n configurada para pr√≥xima sesi√≥n

### üîÑ Para Continuar Entrenamiento

1. Ejecutar este notebook de nuevo
2. Mantener `AUTO_RESUME = True`
3. Ajustar `TOTAL_EPISODES` si quieres m√°s episodios
4. El notebook detectar√° autom√°ticamente el √∫ltimo checkpoint en Drive

### üìä Mejores Pr√°cticas

- **Colab Free**: Entrenar en sesiones de 6-8 horas
- **Colab Pro**: Sesiones de 12-20 horas
- **Kaggle**: Aprovechar las 30 horas semanales
- Siempre verificar que Drive est√© sincronizando correctamente
- Revisar gr√°ficos cada 50-100 episodios para detectar problemas

---

**¬°Feliz entrenamiento! üöÄ**