# Entrenamiento Local de Agentes Pok√©mon
Este cuaderno coordina el entrenamiento local de los agentes especializados y del agente h√≠brido usando las utilidades de `advanced_agents`. Cada secci√≥n describe qu√© configura o ejecuta para que puedas seguir el flujo sin consultar otros archivos. Para m√°s detalles sobre los scripts equivalentes por lotes revisa `README_LOCAL_TRAINING.md`.

In [1]:
import sys
import os

# FIX: Resolver conflicto de OpenMP (Error #15) que causa crash del kernel
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

import json
import shutil
import types
import importlib
from gymnasium import spaces

# Configuraci√≥n de rutas locales
project_path = os.getcwd()
if project_path not in sys.path:
    sys.path.append(project_path)

baselines_path = os.path.join(project_path, 'baselines')
if baselines_path not in sys.path:
    sys.path.append(baselines_path)

print(f"Directorio de trabajo: {project_path}")

Directorio de trabajo: c:\Users\javi1\Documents\repos_git\TEL351-PokemonRed


## 1. Configuraci√≥n de entorno
Inicializa rutas y variables de entorno necesarias para que PyBoy y Stable-Baselines3 funcionen sin conflictos (por ejemplo, se habilita `KMP_DUPLICATE_LIB_OK` para evitar errores de OpenMP).

## 1.1 Optimizaci√≥n con Numba (Opcional)
Para acelerar los c√°lculos de recompensa (especialmente el c√°lculo de percentiles en el historial de p√©rdidas), se recomienda instalar `numba`. Si no est√° instalado, el c√≥digo usar√° una versi√≥n est√°ndar de Python m√°s lenta.

In [3]:
try:
    import numba
    print(f"Numba instalado: {numba.__version__}")
except ImportError:
    print("Numba no detectado. Instalando...")
    !pip install numba
    print("Instalaci√≥n completada. Por favor reinicia el kernel si es necesario.")

# Verificar e instalar dependencias para la barra de progreso
try:
    import tqdm
    import rich
    import ipywidgets
    print(f"tqdm, rich e ipywidgets disponibles (barra de progreso activada)")
except ImportError:
    print("Instalando dependencias para habilitar la barra de progreso...")
    !pip install tqdm rich ipywidgets
    print("Instalaci√≥n completada.")

Numba instalado: 0.62.1
tqdm, rich e ipywidgets disponibles (barra de progreso activada)


In [None]:
try:
    import torch
    print(f"PyTorch GPU disponible: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"   Dispositivo: {torch.cuda.get_device_name(0)}")
    else:
        print("PyTorch est√° usando CPU. El entrenamiento ser√° lento.")
except OSError as e:
    print(f"ERROR CR√çTICO DETECTADO: {e}")
    if "126" in str(e) or "caffe2_nvrtc.dll" in str(e):
        print("\n" + "="*60)
        print("   ¬°NO TE PREOCUPES! ESTE ERROR ES ESPERADO SI LA VERSI√ìN FALLA")
        print("   SOLUCI√ìN: Ejecuta la SIGUIENTE CELDA para reparar PyTorch.")
        print("="*60 + "\n")
    else:
        raise e

PyTorch GPU disponible: True
   Dispositivo: NVIDIA GeForce RTX 3050


## 1.2 Soluci√≥n de Problemas de GPU
Si la celda anterior indica que **PyTorch est√° usando CPU**, es probable que tengas instalada una versi√≥n incorrecta de PyTorch o que falten los drivers de CUDA.
Para arreglarlo en tu **RTX 3050**, ejecuta la siguiente celda para reinstalar una versi√≥n estable de PyTorch con soporte CUDA 12.4 (compatible con tus drivers actuales).
**Nota:** Despu√©s de la instalaci√≥n, deber√°s reiniciar el kernel del notebook (Bot√≥n "Restart" en la barra superior).

In [5]:
# VERIFICACI√ìN DE INSTALACI√ìN
# Si esta celda falla con "WinError 126" o "ModuleNotFoundError", 
# SIGNIFICA QUE LA INSTALACI√ìN ANTERIOR FALL√ì POR BLOQUEO DE ARCHIVOS.
#
# SOLUCI√ìN:
# 1. Cierra este notebook o Reinicia el Kernel (Bot√≥n Restart ‚Üª arriba).
# 2. Abre una TERMINAL en VS Code (Ctrl+√ë).
# 3. Copia y pega los comandos que te dio el asistente para reinstalar Torch manualmente.
# 4. Vuelve aqu√≠ y ejecuta esta celda.

import torch
print(f"Versi√≥n de Torch: {torch.__version__}")
print(f"Versi√≥n de CUDA en Torch: {torch.version.cuda}")
print(f"¬øCUDA disponible?: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Detectada: {torch.cuda.get_device_name(0)}")
else:
    print("‚ö†Ô∏è GPU no detectada. Si tienes una RTX, reinstala Torch desde la terminal.")

Versi√≥n de Torch: 2.4.1+cu121
Versi√≥n de CUDA en Torch: 12.1
¬øCUDA disponible?: True
GPU Detectada: NVIDIA GeForce RTX 3050


In [6]:
# --- RELOAD MODULES ---
def reload_modules():
    modules_to_reload = [
        'v2.red_gym_env_v2',
        'advanced_agents.features',
        'advanced_agents.wrappers',
        'advanced_agents.base',
        'advanced_agents.train_agents',
        'advanced_agents.combat_apex_agent',
        'advanced_agents.puzzle_speed_agent',
        'advanced_agents.hybrid_sage_agent',
        'advanced_agents.transition_models'
    ]
    for mod_name in modules_to_reload:
        if mod_name in sys.modules:
            try:
                importlib.reload(sys.modules[mod_name])
                print(f"Recargado: {mod_name}")
            except Exception as e:
                print(f"No se pudo recargar {mod_name}: {e}")

reload_modules()

## 2. Recarga de m√≥dulos
Permite refrescar los m√≥dulos clave de `advanced_agents` y del entorno RedGym cada vez que hagas cambios en el c√≥digo fuente sin tener que reiniciar el kernel. Ejecuta esta celda si modificas archivos Python relacionados.

In [7]:
# Copiar events.json si es necesario
events_source = os.path.join(project_path, 'baselines', 'events.json')
events_dest = os.path.join(project_path, 'events.json')
if os.path.exists(events_source) and not os.path.exists(events_dest):
    shutil.copy(events_source, events_dest)
    print(f"Copiado events.json a {events_dest}")

## 3. Sincronizaci√≥n de `events.json`
Garantiza que el archivo de eventos requerido por PyBoy est√© disponible en la ra√≠z del proyecto copi√°ndolo desde `baselines/events.json` cuando falta.

## 4. Utilidades de entrenamiento
Define el registro de agentes, valida que existan los archivos `.state`, construye las configuraciones de entorno y expone `train_single_run`/`train_plan`, que son los puntos de entrada para disparar los entrenamientos desde las celdas siguientes.

In [8]:
import json
import shutil
import types
import importlib
from typing import Dict, Iterable, List, Optional
import os

from gymnasium import spaces

try:
    from advanced_agents.train_agents import _base_env_config
    from advanced_agents.combat_apex_agent import CombatApexAgent, CombatAgentConfig
    from advanced_agents.puzzle_speed_agent import PuzzleSpeedAgent, PuzzleAgentConfig
    from advanced_agents.hybrid_sage_agent import HybridSageAgent, HybridAgentConfig
except ImportError as e:
    print(f"‚ö†Ô∏è ERROR DE IMPORTACI√ìN: {e}")
    raise e
except OSError as e:
    print(f"‚ö†Ô∏è ERROR CR√çTICO DE PYTORCH: {e}")
    if "126" in str(e) or "caffe2_nvrtc.dll" in str(e):
        print("\n" + "="*60)
        print("   ¬°TU INSTALACI√ìN DE PYTORCH EST√Å ROTA!")
        print("   El kernel tiene archivos bloqueados o la versi√≥n es incompatible.")
        print("   ")
        print("   SOLUCI√ìN DEFINITIVA:")
        print("   1. Abre la terminal (Ctrl+√ë)")
        print("   2. Ejecuta: ./repair_torch.ps1")
        print("   3. Reinicia el Kernel (Bot√≥n Restart ‚Üª)")
        print("="*60 + "\n")
    raise e

# --- Cargar escenarios ---
SCENARIO_PATH = os.path.join(project_path, 'gym_scenarios', 'scenarios.json')
with open(SCENARIO_PATH, 'r') as f:
    scenarios_data = json.load(f)

SCENARIOS: Dict[str, Dict] = {scenario['id']: scenario for scenario in scenarios_data['scenarios']}

AGENT_REGISTRY = {
    'combat': {
        'agent_cls': CombatApexAgent,
        'config_cls': CombatAgentConfig,
        'default_phase': 'battle'
    },
    'puzzle': {
        'agent_cls': PuzzleSpeedAgent,
        'config_cls': PuzzleAgentConfig,
        'default_phase': 'puzzle'
    },
    'hybrid': {
        'agent_cls': HybridSageAgent,
        'config_cls': HybridAgentConfig,
        'default_phase': 'battle'
    }
}

MODELS_DIR = os.path.join(project_path, 'models_local')
os.makedirs(MODELS_DIR, exist_ok=True)

def resolve_phase(scenario_id: str, phase_name: Optional[str]) -> Dict:
    scenario = SCENARIOS.get(scenario_id)
    if scenario is None:
        raise ValueError(f"Escenario {scenario_id} no encontrado en {SCENARIO_PATH}")
    target_phase = phase_name or AGENT_REGISTRY['combat']['default_phase']
    selected_phase = next((p for p in scenario['phases'] if p['name'] == target_phase), None)
    if selected_phase is None:
        raise ValueError(f"Fase {target_phase} no encontrada en el escenario {scenario_id}")
    return selected_phase

def ensure_state_file(state_file_path: str) -> str:
    abs_path = os.path.join(project_path, state_file_path) if not os.path.isabs(state_file_path) else state_file_path
    if not os.path.exists(abs_path):
        raise FileNotFoundError(
            f"No se encontr√≥ el archivo de estado requerido: {abs_path}. "
            "Genera los .state con generate_gym_states.py o ajusta la ruta."
        )
    return abs_path

def build_env_overrides(state_file_path: str, headless: bool) -> Dict:
    return {
        'init_state': state_file_path,
        'headless': headless,
        'save_video': False,
        'gb_path': os.path.join(project_path, 'PokemonRed.gb'),
        'session_path': os.path.join(project_path, 'sessions', f"local_{os.path.basename(state_file_path)}"),
        'render_mode': 'rgb_array' if headless else 'human',
        'fast_video': headless
    }

def _patch_callbacks(agent, additional_callbacks: Optional[List] = None):
    base_callbacks_method = agent.extra_callbacks

    def _patched_callbacks(self):
        callbacks = list(base_callbacks_method())
        if additional_callbacks:
            callbacks.extend(additional_callbacks)
        return callbacks

    agent.extra_callbacks = types.MethodType(_patched_callbacks, agent)

def train_single_run(
    agent_key: str,
    scenario_id: str,
    phase_name: str,
    total_timesteps: int = 200_000,
    headless: bool = False,
    additional_callbacks: Optional[List] = None
):
    registry_entry = AGENT_REGISTRY.get(agent_key)
    if registry_entry is None:
        raise ValueError(f"Agente desconocido: {agent_key}")

    phase = resolve_phase(scenario_id, phase_name)
    state_file_path = ensure_state_file(phase['state_file'])

    env_overrides = build_env_overrides(state_file_path, headless=headless)
    config = registry_entry['config_cls'](
        env_config=_base_env_config(env_overrides),
        total_timesteps=total_timesteps
    )

    agent = registry_entry['agent_cls'](config)

    env_for_check = agent.make_env()
    obs_space = getattr(env_for_check, 'observation_space', None)
    if isinstance(obs_space, spaces.Dict):
        print("Observaci√≥n Dict detectada -> MultiInputPolicy")
        agent.policy_name = types.MethodType(lambda self: "MultiInputPolicy", agent)
    env_for_check.close()

    if additional_callbacks:
        _patch_callbacks(agent, additional_callbacks)

    print(
        f"\n=== Entrenando {agent_key.upper()} en {scenario_id} ({phase_name}) por {total_timesteps:,} pasos ===")
    runtime = agent.train()

    agent_dir = os.path.join(MODELS_DIR, agent_key)
    os.makedirs(agent_dir, exist_ok=True)
    model_path = os.path.join(agent_dir, f"{scenario_id}_{phase_name}.zip")
    runtime.model.save(model_path)
    print(f"Modelo guardado en {model_path}")

    return runtime

def train_plan(
    agent_key: str,
    plan: List[Dict],
    default_timesteps: int = 200_000,
    headless: bool = False,
    callback_factory: Optional[callable] = None
) -> Dict[tuple, object]:
    results = {}
    total_runs = len(plan)
    for run_idx, entry in enumerate(plan, start=1):
        scenario_id = entry['scenario']
        phase_name = entry.get('phase') or AGENT_REGISTRY[agent_key]['default_phase']
        run_timesteps = entry.get('timesteps', default_timesteps)
        callbacks = None
        if callback_factory is not None:
            callbacks = callback_factory(entry)
        print(f"\n>>> [{agent_key.upper()}] Ejecuci√≥n {run_idx}/{total_runs}")
        runtime = train_single_run(
            agent_key=agent_key,
            scenario_id=scenario_id,
            phase_name=phase_name,
            total_timesteps=run_timesteps,
            headless=headless,
            additional_callbacks=callbacks
        )
        results[(scenario_id, phase_name)] = runtime
    return results



## 5. Planes de entrenamiento
Ajusta aqu√≠ qu√© escenarios, fases y pasos quieres cubrir para cada agente. Usa esto como checklist antes de lanzar ejecuciones largas; puedes sobreescribir timesteps por fila y alternar `headless` para ver la ventana del emulador.

### Configura planes de entrenamiento locales
Especifica los escenarios, fases y timesteps que quieres para cada agente. Puedes ejecutar cada bloque por separado y combinar headless=True/False seg√∫n quieras ver la ventana del emulador.

In [9]:
# NOTA: Para pruebas r√°pidas con pocos pasos (ej. 200), considera reducir n_steps
# en la configuraci√≥n del agente, ya que PPO hace rollouts completos de n_steps=1024 por defecto.
# Para entrenamiento real, usa valores como 40_000+ timesteps.

combat_plan_local = [
    {"scenario": "pewter_brock", "phase": "battle", "timesteps": 40_000},
    # {"scenario": "cerulean_misty", "phase": "battle", "timesteps": 50_000},
]

puzzle_plan_local = [
    {"scenario": "pewter_brock", "phase": "puzzle", "timesteps": 40_000},
    # {"scenario": "cerulean_misty", "phase": "puzzle", "timesteps": 50_000},
]

hybrid_plan_local = [
    {"scenario": "pewter_brock", "phase": "battle", "timesteps": 50_000},
    # {"scenario": "vermillion_lt_surge", "phase": "battle", "timesteps": 60_000},
]

DEFAULT_TIMESTEPS_LOCAL = 40_000
DEFAULT_HEADLESS_LOCAL = False  # Cambia a True si no necesitas la ventana SDL

## 6. Ejecutar plan de combate

**IMPORTANTE**: Si tu entrenamiento anterior mostr√≥ `value_loss > 1000` o `explained_variance < 0.1`, el modelo **no aprendi√≥ correctamente**. 

**S√≠ntomas de entrenamiento fallido:**
- value_loss = 3200 (deber√≠a estar cerca de 0)
- explained_variance = 0.036 (deber√≠a ser >0.5)
- Reward constante en evaluaci√≥n
- Episodios terminan en timeout sin progreso

**Soluci√≥n**: La siguiente celda usa par√°metros **estabilizados** autom√°ticamente. Solo ejec√∫tala para re-entrenar con configuraci√≥n robusta.

In [None]:
# ==================== ENTRENAMIENTO CON PAR√ÅMETROS ESTABLES ====================
# Si tu entrenamiento anterior fall√≥ (value_loss alto), esta versi√≥n usa par√°metros
# m√°s conservadores que garantizan convergencia.
# =================================================================================

from advanced_agents.combat_apex_agent import CombatApexAgent, CombatAgentConfig

def train_combat_stable(scenario_id='pewter_brock', phase_name='battle', timesteps=40_000):
    """Entrena CombatApexAgent con par√°metros estabilizados."""
    print(f"\n{'='*70}")
    print(f"   ENTRENAMIENTO ESTABLE - COMBAT APEX AGENT")
    print(f"   Escenario: {scenario_id} | Fase: {phase_name}")
    print(f"   Pasos: {timesteps:,}")
    print(f"   Par√°metros: LR reducido, clipping conservador, gradientes limitados")
    print(f"{'='*70}\n")
    
    # Configurar entorno
    phase = resolve_phase(scenario_id, phase_name)
    state_file_path = ensure_state_file(phase['state_file'])
    env_overrides = build_env_overrides(state_file_path, headless=True)
    base_config = _base_env_config(env_overrides)
    
    # Configuraci√≥n ESTABLE (par√°metros ajustados para evitar divergencia)
    agent_config = CombatAgentConfig(
        env_config=base_config,
        total_timesteps=timesteps,
        learning_rate=1e-4,      # M√°s conservador que 2.5e-4
        n_steps=512,             # Actualizaciones m√°s frecuentes
        batch_size=128,          # Batches m√°s peque√±os
        gamma=0.998,             # Menos influencia del futuro
        gae_lambda=0.95,
        clip_range=0.1,          # Clipping m√°s estricto
        vf_coef=0.25,            # Menos peso a la funci√≥n de valor
        ent_coef=0.01,           # Entrop√≠a para exploraci√≥n
        device='cuda' if torch.cuda.is_available() else 'cpu'
    )
    
    # Crear agente
    agent = CombatApexAgent(agent_config)
    
    # Verificar espacio de observaciones
    env_check = agent.make_env()
    from gymnasium import spaces
    if isinstance(env_check.observation_space, spaces.Dict):
        print("Observaci√≥n Dict detectada -> MultiInputPolicy")
        agent.policy_name = lambda: "MultiInputPolicy"
    env_check.close()
    
    # Entrenar
    print(f"\nüöÄ Iniciando entrenamiento estable...")
    runtime = agent.train()
    
    # Guardar
    save_dir = os.path.join(MODELS_DIR, 'combat')
    os.makedirs(save_dir, exist_ok=True)
    save_path = os.path.join(save_dir, f"{scenario_id}_{phase_name}_stable.zip")
    runtime.model.save(save_path)
    
    print(f"\nModelo ESTABLE guardado en: {save_path}")
    print(f"Revisa los logs - deber√≠as ver:")
    print(f"   - value_loss < 100 (idealmente < 10)")
    print(f"   - explained_variance > 0.3 (mejorando hacia 0.7+)")
    print(f"   - approx_kl < 0.05")
    
    return save_path

# EJECUTAR ENTRENAMIENTO ESTABLE
combat_model_stable = train_combat_stable(
    scenario_id='pewter_brock',
    phase_name='battle', 
    timesteps=40_000
)

print(f"\n{'='*70}")
print(f"ENTRENAMIENTO COMPLETO")
print(f"{'='*70}")
print(f"Modelo guardado en: {combat_model_stable}")
print(f"\nüéÆ Para probarlo:")
print(f"python run_combat_agent_interactive.py --scenario pewter_brock --phase battle")
print(f"(Renombra el archivo _stable.zip a .zip si es necesario)")
print(f"{'='*70}")


>>> [COMBAT] Ejecuci√≥n 1/1
Observaci√≥n Dict detectada -> MultiInputPolicy

=== Entrenando COMBAT en pewter_brock (battle) por 40,000 pasos ===


Output()

Using cuda device
Wrapping the env in a VecTransposeImage.


-----------------------------
| time/              |      |
|    fps             | 5    |
|    iterations      | 1    |
|    time_elapsed    | 191  |
|    total_timesteps | 1024 |
-----------------------------


------------------------------------------
| time/                   |              |
|    fps                  | 5            |
|    iterations           | 2            |
|    time_elapsed         | 405          |
|    total_timesteps      | 2048         |
| train/                  |              |
|    approx_kl            | 0.007673993  |
|    clip_fraction        | 0.195        |
|    clip_range           | 0.15         |
|    entropy_loss         | -1.94        |
|    explained_variance   | 0.0006688237 |
|    learning_rate        | 0.00025      |
|    loss                 | 6.58e+03     |
|    n_updates            | 10           |
|    policy_gradient_loss | -0.00447     |
|    value_loss           | 3.3e+03      |
------------------------------------------


--------------------------------------------
| time/                   |                |
|    fps                  | 4              |
|    iterations           | 3              |
|    time_elapsed         | 618            |
|    total_timesteps      | 3072           |
| train/                  |                |
|    approx_kl            | 0.009874553    |
|    clip_fraction        | 0.3            |
|    clip_range           | 0.15           |
|    entropy_loss         | -1.92          |
|    explained_variance   | -1.1920929e-07 |
|    learning_rate        | 0.00025        |
|    loss                 | -0.0317        |
|    n_updates            | 20             |
|    policy_gradient_loss | -0.00729       |
|    value_loss           | 0.00461        |
--------------------------------------------


------------------------------------------
| time/                   |              |
|    fps                  | 4            |
|    iterations           | 4            |
|    time_elapsed         | 831          |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0066226544 |
|    clip_fraction        | 0.126        |
|    clip_range           | 0.15         |
|    entropy_loss         | -1.92        |
|    explained_variance   | 0.0013555288 |
|    learning_rate        | 0.00025      |
|    loss                 | -0.0152      |
|    n_updates            | 30           |
|    policy_gradient_loss | -0.00315     |
|    value_loss           | 3.29e+03     |
------------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 4           |
|    iterations           | 5           |
|    time_elapsed         | 1043        |
|    total_timesteps      | 5120        |
| train/                  |             |
|    approx_kl            | 0.014401462 |
|    clip_fraction        | 0.241       |
|    clip_range           | 0.15        |
|    entropy_loss         | -1.93       |
|    explained_variance   | 0.0         |
|    learning_rate        | 0.00025     |
|    loss                 | -0.0228     |
|    n_updates            | 40          |
|    policy_gradient_loss | -0.00576    |
|    value_loss           | 0.000994    |
-----------------------------------------


------------------------------------------
| time/                   |              |
|    fps                  | 4            |
|    iterations           | 6            |
|    time_elapsed         | 1255         |
|    total_timesteps      | 6144         |
| train/                  |              |
|    approx_kl            | 0.0073117116 |
|    clip_fraction        | 0.15         |
|    clip_range           | 0.15         |
|    entropy_loss         | -1.93        |
|    explained_variance   | 0.0129413605 |
|    learning_rate        | 0.00025      |
|    loss                 | 6.54e+03     |
|    n_updates            | 50           |
|    policy_gradient_loss | -0.00125     |
|    value_loss           | 3.27e+03     |
------------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 4           |
|    iterations           | 7           |
|    time_elapsed         | 1467        |
|    total_timesteps      | 7168        |
| train/                  |             |
|    approx_kl            | 0.010050582 |
|    clip_fraction        | 0.232       |
|    clip_range           | 0.15        |
|    entropy_loss         | -1.93       |
|    explained_variance   | 0.0         |
|    learning_rate        | 0.00025     |
|    loss                 | -0.0236     |
|    n_updates            | 60          |
|    policy_gradient_loss | -0.00594    |
|    value_loss           | 0.000591    |
-----------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 4           |
|    iterations           | 8           |
|    time_elapsed         | 1679        |
|    total_timesteps      | 8192        |
| train/                  |             |
|    approx_kl            | 0.009176165 |
|    clip_fraction        | 0.259       |
|    clip_range           | 0.15        |
|    entropy_loss         | -1.94       |
|    explained_variance   | 0.015097022 |
|    learning_rate        | 0.00025     |
|    loss                 | 6.53e+03    |
|    n_updates            | 70          |
|    policy_gradient_loss | -0.00437    |
|    value_loss           | 3.27e+03    |
-----------------------------------------


------------------------------------------
| time/                   |              |
|    fps                  | 4            |
|    iterations           | 9            |
|    time_elapsed         | 1892         |
|    total_timesteps      | 9216         |
| train/                  |              |
|    approx_kl            | 0.0069254516 |
|    clip_fraction        | 0.152        |
|    clip_range           | 0.15         |
|    entropy_loss         | -1.93        |
|    explained_variance   | 0.0          |
|    learning_rate        | 0.00025      |
|    loss                 | -0.0262      |
|    n_updates            | 80           |
|    policy_gradient_loss | -0.00371     |
|    value_loss           | 0.000445     |
------------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 4           |
|    iterations           | 10          |
|    time_elapsed         | 2104        |
|    total_timesteps      | 10240       |
| train/                  |             |
|    approx_kl            | 0.006389723 |
|    clip_fraction        | 0.0894      |
|    clip_range           | 0.15        |
|    entropy_loss         | -1.93       |
|    explained_variance   | 0.016813517 |
|    learning_rate        | 0.00025     |
|    loss                 | 6.52e+03    |
|    n_updates            | 90          |
|    policy_gradient_loss | -0.00192    |
|    value_loss           | 3.26e+03    |
-----------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 4           |
|    iterations           | 11          |
|    time_elapsed         | 2316        |
|    total_timesteps      | 11264       |
| train/                  |             |
|    approx_kl            | 0.009003445 |
|    clip_fraction        | 0.232       |
|    clip_range           | 0.15        |
|    entropy_loss         | -1.93       |
|    explained_variance   | 0.0         |
|    learning_rate        | 0.00025     |
|    loss                 | -0.0253     |
|    n_updates            | 100         |
|    policy_gradient_loss | -0.00473    |
|    value_loss           | 0.000292    |
-----------------------------------------


----------------------------------------
| time/                   |            |
|    fps                  | 4          |
|    iterations           | 12         |
|    time_elapsed         | 2528       |
|    total_timesteps      | 12288      |
| train/                  |            |
|    approx_kl            | 0.00857245 |
|    clip_fraction        | 0.254      |
|    clip_range           | 0.15       |
|    entropy_loss         | -1.92      |
|    explained_variance   | 0.01816523 |
|    learning_rate        | 0.00025    |
|    loss                 | -0.0251    |
|    n_updates            | 110        |
|    policy_gradient_loss | -0.00577   |
|    value_loss           | 3.26e+03   |
----------------------------------------


-------------------------------------------
| time/                   |               |
|    fps                  | 4             |
|    iterations           | 13            |
|    time_elapsed         | 2740          |
|    total_timesteps      | 13312         |
| train/                  |               |
|    approx_kl            | 0.0055834865  |
|    clip_fraction        | 0.146         |
|    clip_range           | 0.15          |
|    entropy_loss         | -1.92         |
|    explained_variance   | 5.9604645e-08 |
|    learning_rate        | 0.00025       |
|    loss                 | -0.0306       |
|    n_updates            | 120           |
|    policy_gradient_loss | -0.00292      |
|    value_loss           | 0.000548      |
-------------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 4           |
|    iterations           | 14          |
|    time_elapsed         | 2953        |
|    total_timesteps      | 14336       |
| train/                  |             |
|    approx_kl            | 0.00931727  |
|    clip_fraction        | 0.335       |
|    clip_range           | 0.15        |
|    entropy_loss         | -1.92       |
|    explained_variance   | 0.019625008 |
|    learning_rate        | 0.00025     |
|    loss                 | 6.5e+03     |
|    n_updates            | 130         |
|    policy_gradient_loss | -0.00648    |
|    value_loss           | 3.25e+03    |
-----------------------------------------


-------------------------------------------
| time/                   |               |
|    fps                  | 4             |
|    iterations           | 15            |
|    time_elapsed         | 3165          |
|    total_timesteps      | 15360         |
| train/                  |               |
|    approx_kl            | 0.007498509   |
|    clip_fraction        | 0.246         |
|    clip_range           | 0.15          |
|    entropy_loss         | -1.92         |
|    explained_variance   | 5.9604645e-08 |
|    learning_rate        | 0.00025       |
|    loss                 | -0.0215       |
|    n_updates            | 140           |
|    policy_gradient_loss | -0.00521      |
|    value_loss           | 0.000224      |
-------------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 4           |
|    iterations           | 16          |
|    time_elapsed         | 3377        |
|    total_timesteps      | 16384       |
| train/                  |             |
|    approx_kl            | 0.006463942 |
|    clip_fraction        | 0.245       |
|    clip_range           | 0.15        |
|    entropy_loss         | -1.92       |
|    explained_variance   | 0.020989478 |
|    learning_rate        | 0.00025     |
|    loss                 | 6.49e+03    |
|    n_updates            | 150         |
|    policy_gradient_loss | -0.00489    |
|    value_loss           | 3.25e+03    |
-----------------------------------------


--------------------------------------------
| time/                   |                |
|    fps                  | 4              |
|    iterations           | 17             |
|    time_elapsed         | 3588           |
|    total_timesteps      | 17408          |
| train/                  |                |
|    approx_kl            | 0.009252979    |
|    clip_fraction        | 0.342          |
|    clip_range           | 0.15           |
|    entropy_loss         | -1.92          |
|    explained_variance   | -1.1920929e-07 |
|    learning_rate        | 0.00025        |
|    loss                 | -0.0326        |
|    n_updates            | 160            |
|    policy_gradient_loss | -0.0101        |
|    value_loss           | 0.000705       |
--------------------------------------------


------------------------------------------
| time/                   |              |
|    fps                  | 4            |
|    iterations           | 18           |
|    time_elapsed         | 3800         |
|    total_timesteps      | 18432        |
| train/                  |              |
|    approx_kl            | 0.0073879464 |
|    clip_fraction        | 0.231        |
|    clip_range           | 0.15         |
|    entropy_loss         | -1.93        |
|    explained_variance   | 0.022357464  |
|    learning_rate        | 0.00025      |
|    loss                 | -0.0136      |
|    n_updates            | 170          |
|    policy_gradient_loss | -0.00372     |
|    value_loss           | 3.24e+03     |
------------------------------------------


-------------------------------------------
| time/                   |               |
|    fps                  | 4             |
|    iterations           | 19            |
|    time_elapsed         | 4013          |
|    total_timesteps      | 19456         |
| train/                  |               |
|    approx_kl            | 0.0074165924  |
|    clip_fraction        | 0.191         |
|    clip_range           | 0.15          |
|    entropy_loss         | -1.91         |
|    explained_variance   | 5.9604645e-08 |
|    learning_rate        | 0.00025       |
|    loss                 | -0.0325       |
|    n_updates            | 180           |
|    policy_gradient_loss | -0.00607      |
|    value_loss           | 0.00063       |
-------------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 4           |
|    iterations           | 20          |
|    time_elapsed         | 4225        |
|    total_timesteps      | 20480       |
| train/                  |             |
|    approx_kl            | 0.009384371 |
|    clip_fraction        | 0.279       |
|    clip_range           | 0.15        |
|    entropy_loss         | -1.91       |
|    explained_variance   | 0.023714006 |
|    learning_rate        | 0.00025     |
|    loss                 | -0.0224     |
|    n_updates            | 190         |
|    policy_gradient_loss | -0.00608    |
|    value_loss           | 3.24e+03    |
-----------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 4           |
|    iterations           | 21          |
|    time_elapsed         | 4438        |
|    total_timesteps      | 21504       |
| train/                  |             |
|    approx_kl            | 0.006461496 |
|    clip_fraction        | 0.163       |
|    clip_range           | 0.15        |
|    entropy_loss         | -1.91       |
|    explained_variance   | 0.0         |
|    learning_rate        | 0.00025     |
|    loss                 | -0.0283     |
|    n_updates            | 200         |
|    policy_gradient_loss | -0.0047     |
|    value_loss           | 0.000108    |
-----------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 4           |
|    iterations           | 22          |
|    time_elapsed         | 4651        |
|    total_timesteps      | 22528       |
| train/                  |             |
|    approx_kl            | 0.008232214 |
|    clip_fraction        | 0.121       |
|    clip_range           | 0.15        |
|    entropy_loss         | -1.93       |
|    explained_variance   | 0.025066972 |
|    learning_rate        | 0.00025     |
|    loss                 | -0.0133     |
|    n_updates            | 210         |
|    policy_gradient_loss | -0.00149    |
|    value_loss           | 3.23e+03    |
-----------------------------------------


--------------------------------------------
| time/                   |                |
|    fps                  | 4              |
|    iterations           | 23             |
|    time_elapsed         | 4863           |
|    total_timesteps      | 23552          |
| train/                  |                |
|    approx_kl            | 0.004975764    |
|    clip_fraction        | 0.15           |
|    clip_range           | 0.15           |
|    entropy_loss         | -1.91          |
|    explained_variance   | -1.1920929e-07 |
|    learning_rate        | 0.00025        |
|    loss                 | -0.0329        |
|    n_updates            | 220            |
|    policy_gradient_loss | -0.00483       |
|    value_loss           | 0.000482       |
--------------------------------------------


------------------------------------------
| time/                   |              |
|    fps                  | 4            |
|    iterations           | 24           |
|    time_elapsed         | 5076         |
|    total_timesteps      | 24576        |
| train/                  |              |
|    approx_kl            | 0.0065936847 |
|    clip_fraction        | 0.126        |
|    clip_range           | 0.15         |
|    entropy_loss         | -1.92        |
|    explained_variance   | 0.026413202  |
|    learning_rate        | 0.00025      |
|    loss                 | -0.00726     |
|    n_updates            | 230          |
|    policy_gradient_loss | -0.000469    |
|    value_loss           | 3.23e+03     |
------------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 4           |
|    iterations           | 25          |
|    time_elapsed         | 5288        |
|    total_timesteps      | 25600       |
| train/                  |             |
|    approx_kl            | 0.009861588 |
|    clip_fraction        | 0.244       |
|    clip_range           | 0.15        |
|    entropy_loss         | -1.91       |
|    explained_variance   | 0.0         |
|    learning_rate        | 0.00025     |
|    loss                 | -0.0117     |
|    n_updates            | 240         |
|    policy_gradient_loss | -0.00623    |
|    value_loss           | 0.00127     |
-----------------------------------------


------------------------------------------
| time/                   |              |
|    fps                  | 4            |
|    iterations           | 26           |
|    time_elapsed         | 5501         |
|    total_timesteps      | 26624        |
| train/                  |              |
|    approx_kl            | 0.0069283834 |
|    clip_fraction        | 0.174        |
|    clip_range           | 0.15         |
|    entropy_loss         | -1.91        |
|    explained_variance   | 0.027794242  |
|    learning_rate        | 0.00025      |
|    loss                 | -0.0315      |
|    n_updates            | 250          |
|    policy_gradient_loss | -0.00271     |
|    value_loss           | 3.22e+03     |
------------------------------------------


--------------------------------------------
| time/                   |                |
|    fps                  | 4              |
|    iterations           | 27             |
|    time_elapsed         | 5713           |
|    total_timesteps      | 27648          |
| train/                  |                |
|    approx_kl            | 0.006899758    |
|    clip_fraction        | 0.189          |
|    clip_range           | 0.15           |
|    entropy_loss         | -1.93          |
|    explained_variance   | -2.3841858e-07 |
|    learning_rate        | 0.00025        |
|    loss                 | -0.0325        |
|    n_updates            | 260            |
|    policy_gradient_loss | -0.00461       |
|    value_loss           | 0.000769       |
--------------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 4           |
|    iterations           | 28          |
|    time_elapsed         | 5925        |
|    total_timesteps      | 28672       |
| train/                  |             |
|    approx_kl            | 0.006442425 |
|    clip_fraction        | 0.193       |
|    clip_range           | 0.15        |
|    entropy_loss         | -1.93       |
|    explained_variance   | 0.029028237 |
|    learning_rate        | 0.00025     |
|    loss                 | -0.0214     |
|    n_updates            | 270         |
|    policy_gradient_loss | -0.00395    |
|    value_loss           | 3.22e+03    |
-----------------------------------------


-------------------------------------------
| time/                   |               |
|    fps                  | 4             |
|    iterations           | 29            |
|    time_elapsed         | 6137          |
|    total_timesteps      | 29696         |
| train/                  |               |
|    approx_kl            | 0.009296153   |
|    clip_fraction        | 0.193         |
|    clip_range           | 0.15          |
|    entropy_loss         | -1.92         |
|    explained_variance   | 5.9604645e-08 |
|    learning_rate        | 0.00025       |
|    loss                 | -0.0261       |
|    n_updates            | 280           |
|    policy_gradient_loss | -0.00734      |
|    value_loss           | 0.00479       |
-------------------------------------------


------------------------------------------
| time/                   |              |
|    fps                  | 4            |
|    iterations           | 30           |
|    time_elapsed         | 6349         |
|    total_timesteps      | 30720        |
| train/                  |              |
|    approx_kl            | 0.0055149514 |
|    clip_fraction        | 0.212        |
|    clip_range           | 0.15         |
|    entropy_loss         | -1.9         |
|    explained_variance   | 0.030292451  |
|    learning_rate        | 0.00025      |
|    loss                 | -0.0262      |
|    n_updates            | 290          |
|    policy_gradient_loss | -0.00419     |
|    value_loss           | 3.22e+03     |
------------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 4           |
|    iterations           | 31          |
|    time_elapsed         | 6562        |
|    total_timesteps      | 31744       |
| train/                  |             |
|    approx_kl            | 0.017431054 |
|    clip_fraction        | 0.183       |
|    clip_range           | 0.15        |
|    entropy_loss         | -1.89       |
|    explained_variance   | 0.0         |
|    learning_rate        | 0.00025     |
|    loss                 | -0.0298     |
|    n_updates            | 300         |
|    policy_gradient_loss | -0.00464    |
|    value_loss           | 0.000128    |
-----------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 4           |
|    iterations           | 32          |
|    time_elapsed         | 6774        |
|    total_timesteps      | 32768       |
| train/                  |             |
|    approx_kl            | 0.005536956 |
|    clip_fraction        | 0.113       |
|    clip_range           | 0.15        |
|    entropy_loss         | -1.89       |
|    explained_variance   | 0.03149408  |
|    learning_rate        | 0.00025     |
|    loss                 | 6.42e+03    |
|    n_updates            | 310         |
|    policy_gradient_loss | -0.00196    |
|    value_loss           | 3.21e+03    |
-----------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 4           |
|    iterations           | 33          |
|    time_elapsed         | 6987        |
|    total_timesteps      | 33792       |
| train/                  |             |
|    approx_kl            | 0.009741454 |
|    clip_fraction        | 0.247       |
|    clip_range           | 0.15        |
|    entropy_loss         | -1.89       |
|    explained_variance   | 0.0         |
|    learning_rate        | 0.00025     |
|    loss                 | -0.0128     |
|    n_updates            | 320         |
|    policy_gradient_loss | -0.00581    |
|    value_loss           | 0.000685    |
-----------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 4           |
|    iterations           | 34          |
|    time_elapsed         | 7200        |
|    total_timesteps      | 34816       |
| train/                  |             |
|    approx_kl            | 0.009235102 |
|    clip_fraction        | 0.218       |
|    clip_range           | 0.15        |
|    entropy_loss         | -1.89       |
|    explained_variance   | 0.032627106 |
|    learning_rate        | 0.00025     |
|    loss                 | 6.41e+03    |
|    n_updates            | 330         |
|    policy_gradient_loss | -0.00387    |
|    value_loss           | 3.21e+03    |
-----------------------------------------


-------------------------------------------
| time/                   |               |
|    fps                  | 4             |
|    iterations           | 35            |
|    time_elapsed         | 7413          |
|    total_timesteps      | 35840         |
| train/                  |               |
|    approx_kl            | 0.00914288    |
|    clip_fraction        | 0.195         |
|    clip_range           | 0.15          |
|    entropy_loss         | -1.86         |
|    explained_variance   | 1.7881393e-07 |
|    learning_rate        | 0.00025       |
|    loss                 | -0.0209       |
|    n_updates            | 340           |
|    policy_gradient_loss | -0.00565      |
|    value_loss           | 0.000643      |
-------------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 4           |
|    iterations           | 36          |
|    time_elapsed         | 7625        |
|    total_timesteps      | 36864       |
| train/                  |             |
|    approx_kl            | 0.00937172  |
|    clip_fraction        | 0.243       |
|    clip_range           | 0.15        |
|    entropy_loss         | -1.85       |
|    explained_variance   | 0.034033835 |
|    learning_rate        | 0.00025     |
|    loss                 | -0.0264     |
|    n_updates            | 350         |
|    policy_gradient_loss | -0.00438    |
|    value_loss           | 3.2e+03     |
-----------------------------------------


--------------------------------------------
| time/                   |                |
|    fps                  | 4              |
|    iterations           | 37             |
|    time_elapsed         | 7836           |
|    total_timesteps      | 37888          |
| train/                  |                |
|    approx_kl            | 0.0071684606   |
|    clip_fraction        | 0.14           |
|    clip_range           | 0.15           |
|    entropy_loss         | -1.84          |
|    explained_variance   | -1.1920929e-07 |
|    learning_rate        | 0.00025        |
|    loss                 | -0.0319        |
|    n_updates            | 360            |
|    policy_gradient_loss | -0.00321       |
|    value_loss           | 0.000568       |
--------------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 4           |
|    iterations           | 38          |
|    time_elapsed         | 8047        |
|    total_timesteps      | 38912       |
| train/                  |             |
|    approx_kl            | 0.013177071 |
|    clip_fraction        | 0.158       |
|    clip_range           | 0.15        |
|    entropy_loss         | -1.83       |
|    explained_variance   | 0.035197556 |
|    learning_rate        | 0.00025     |
|    loss                 | -0.00521    |
|    n_updates            | 370         |
|    policy_gradient_loss | -0.00362    |
|    value_loss           | 3.2e+03     |
-----------------------------------------


-------------------------------------------
| time/                   |               |
|    fps                  | 4             |
|    iterations           | 39            |
|    time_elapsed         | 8258          |
|    total_timesteps      | 39936         |
| train/                  |               |
|    approx_kl            | 0.009246286   |
|    clip_fraction        | 0.212         |
|    clip_range           | 0.15          |
|    entropy_loss         | -1.82         |
|    explained_variance   | 1.1920929e-07 |
|    learning_rate        | 0.00025       |
|    loss                 | -0.0392       |
|    n_updates            | 380           |
|    policy_gradient_loss | -0.00546      |
|    value_loss           | 0.00176       |
-------------------------------------------


-----------------------------------------
| time/                   |             |
|    fps                  | 4           |
|    iterations           | 40          |
|    time_elapsed         | 8455        |
|    total_timesteps      | 40960       |
| train/                  |             |
|    approx_kl            | 0.011068063 |
|    clip_fraction        | 0.229       |
|    clip_range           | 0.15        |
|    entropy_loss         | -1.82       |
|    explained_variance   | 0.036417663 |
|    learning_rate        | 0.00025     |
|    loss                 | -0.0234     |
|    n_updates            | 390         |
|    policy_gradient_loss | -0.00463    |
|    value_loss           | 3.2e+03     |
-----------------------------------------


Modelo guardado en c:\Users\javi1\Documents\repos_git\TEL351-PokemonRed\models_local\combat\pewter_brock_battle.zip


## 7. Ejecutar plan de puzzles
Corre el plan `puzzle_plan_local` usando `PuzzleSpeedAgent` y guarda salidas en `models_local/puzzle/`. √ötil para medir tiempos de navegaci√≥n y resoluci√≥n de puzzles previos al combate.

In [10]:
puzzle_runs_local = train_plan(
    agent_key='puzzle',
    plan=puzzle_plan_local,
    default_timesteps=DEFAULT_TIMESTEPS_LOCAL,
    headless=DEFAULT_HEADLESS_LOCAL
)


>>> [PUZZLE] Ejecuci√≥n 1/1
Observaci√≥n Dict detectada -> MultiInputPolicy

=== Entrenando PUZZLE en pewter_brock (puzzle) por 200 pasos ===
Using cuda device
Wrapping the env in a VecTransposeImage.


AttributeError: 'RedGymEnv' object has no attribute 'seen_coords'

## 8. Ejecutar plan h√≠brido
Activa `HybridSageAgent` sobre los escenarios definidos en `hybrid_plan_local`, mezclando comportamientos de combate y navegaci√≥n y almacenando resultados en `models_local/hybrid/`.

In [None]:
hybrid_runs_local = train_plan(
    agent_key='hybrid',
    plan=hybrid_plan_local,
    default_timesteps=DEFAULT_TIMESTEPS_LOCAL,
    headless=DEFAULT_HEADLESS_LOCAL
)

## 9. Guardado manual (opcional)
Fragmento de ejemplo para guardar un modelo entrenado con un nombre personalizado. Solo √∫salo si traes a la sesi√≥n variables como `model`, `AGENT_TYPE`, `SCENARIO_ID` y `PHASE_NAME`; de lo contrario producir√° errores.

In [20]:
# Guardar modelo
save_dir = "models_local"
os.makedirs(save_dir, exist_ok=True)
save_path = os.path.join(save_dir, f"{AGENT_TYPE}_{SCENARIO_ID}_{PHASE_NAME}")
model.save(save_path)
print(f"Modelo guardado en {save_path}")

NameError: name 'AGENT_TYPE' is not defined

## 10. Comparaci√≥n con Baseline (PPO v2)

Esta secci√≥n permite comparar el desempe√±o de tus agentes entrenados (Combat, Puzzle, Hybrid) contra un baseline.

**IMPORTANTE - Limitaciones de RAM (16GB):**
- El modelo `poke_26214400.zip` (26M pasos) requiere >10GB solo para cargarlo
- **Alternativa recomendada**: Entrenar tu propio baseline ligero (40k-100k pasos) en lugar de usar el modelo pesado
- O simplemente evaluar solo tus modelos locales sin comparaci√≥n (ver celda siguiente)

**Alternativa para comparar sin .zip pesado:**
Puedes usar `run_pretrained_interactive.py` como baseline ejecut√°ndolo manualmente y registrando las m√©tricas, pero esta secci√≥n automatiza la evaluaci√≥n de **tus modelos** sin necesidad del baseline gigante.

In [None]:
import pandas as pd
import numpy as np
from stable_baselines3 import PPO
from v2.red_gym_env_v2 import RedGymEnv

def load_baseline_model(path):
    if not os.path.exists(path):
        print(f"No se encontr√≥ el modelo baseline en: {path}")
        return None
    try:
        return PPO.load(path)
    except Exception as e:
        print(f"Error cargando baseline: {e}")
        return None

def evaluate_agent_model(model, env, num_episodes=1):
    """Ejecuta episodios de evaluaci√≥n y retorna m√©tricas promedio."""
    rewards = []
    steps = []
    
    for i in range(num_episodes):
        # Manejar diferentes formatos de reset()
        reset_result = env.reset()
        obs = reset_result[0] if isinstance(reset_result, tuple) else reset_result
        
        done = False
        truncated = False
        total_reward = 0
        step_count = 0
        max_steps = 5000  # L√≠mite de pasos por episodio
        
        while not done and not truncated and step_count < max_steps:
            action, _ = model.predict(obs, deterministic=True)
            step_result = env.step(action)
            
            # Manejar diferentes formatos de step()
            if len(step_result) == 5:
                obs, reward, done, truncated, info = step_result
            elif len(step_result) == 4:
                obs, reward, done, info = step_result
                truncated = False
            else:
                raise ValueError(f"Formato inesperado de step(): {len(step_result)} valores")
            
            # Convertir reward a escalar
            reward_scalar = float(reward.item() if hasattr(reward, 'item') else reward)
            total_reward += reward_scalar
            step_count += 1
            
        rewards.append(total_reward)
        steps.append(step_count)
        
    return {
        'mean_reward': np.mean(rewards),
        'std_reward': np.std(rewards),
        'mean_steps': np.mean(steps)
    }

def run_comparison_lightweight(plans_dict, baseline_path=None, headless=True, skip_baseline=False):
    """
    Versi√≥n optimizada para RAM limitada (<=16GB con Windows ocupando 10GB).
    skip_baseline=True: Solo eval√∫a tus modelos locales (recomendado para 16GB RAM)
    """
    results = []
    
    # Cargar Baseline solo si se solicita y existe
    baseline_model = None
    if not skip_baseline and baseline_path:
        print(f"Intentando cargar modelo baseline desde: {baseline_path}")
        baseline_model = load_baseline_model(baseline_path)
        if not baseline_model:
            print("No se pudo cargar el baseline. Solo se evaluar√°n modelos locales.")
    else:
        print("Modo sin baseline activado (ahorra ~10GB RAM)")
    
    for agent_key, plan in plans_dict.items():
        for entry in plan:
            scenario_id = entry['scenario']
            phase_name = entry.get('phase') or AGENT_REGISTRY[agent_key]['default_phase']
            
            print(f"\n--- Evaluando {agent_key.upper()} en {scenario_id} ({phase_name}) ---")
            
            # 1. Preparar Configuraci√≥n Com√∫n
            phase = resolve_phase(scenario_id, phase_name)
            state_file_path = ensure_state_file(phase['state_file'])
            env_overrides = build_env_overrides(state_file_path, headless=headless)
            base_config = _base_env_config(env_overrides)
            
            # ---------------------------------------------------------
            # 2. Evaluar Agente Local (con su propio wrapper/env)
            # ---------------------------------------------------------
            registry_entry = AGENT_REGISTRY[agent_key]
            agent_config = registry_entry['config_cls'](
                env_config=base_config,
                total_timesteps=1000 
            )
            local_agent_wrapper = registry_entry['agent_cls'](agent_config)
            
            local_model_path = os.path.join(MODELS_DIR, agent_key, f"{scenario_id}_{phase_name}.zip")
            
            if os.path.exists(local_model_path):
                print(f"Cargando modelo local: {local_model_path}")
                try:
                    env_local = local_agent_wrapper.make_env()
                    local_agent_wrapper.model = PPO.load(local_model_path)
                    
                    print(f"Ejecutando evaluaci√≥n...")
                    metrics_local = evaluate_agent_model(local_agent_wrapper.model, env_local)
                    env_local.close()
                    
                    print(f"Local: Reward={metrics_local['mean_reward']:.2f}, Steps={metrics_local['mean_steps']:.0f}")
                    
                    results.append({
                        'Agent': agent_key.upper(),
                        'Scenario': scenario_id,
                        'Phase': phase_name,
                        'Model': 'Local (Specialized)',
                        'Reward': metrics_local['mean_reward'],
                        'Steps': metrics_local['mean_steps']
                    })
                    
                    # Liberar memoria
                    del local_agent_wrapper.model
                    del env_local
                    
                except Exception as e:
                    print(f"Error evaluando local: {e}")
                    import traceback
                    traceback.print_exc()
            else:
                print(f"No existe modelo local en {local_model_path}")

            # ---------------------------------------------------------
            # 3. Evaluar Baseline solo si est√° disponible
            # ---------------------------------------------------------
            if baseline_model:
                print("Evaluando Baseline...")
                try:
                    env_baseline = RedGymEnv(base_config)
                    metrics_baseline = evaluate_agent_model(baseline_model, env_baseline)
                    env_baseline.close()
                    
                    print(f"Baseline: Reward={metrics_baseline['mean_reward']:.2f}, Steps={metrics_baseline['mean_steps']:.0f}")
                    
                    results.append({
                        'Agent': agent_key.upper(),
                        'Scenario': scenario_id,
                        'Phase': phase_name,
                        'Model': 'Baseline (PPO v2)',
                        'Reward': metrics_baseline['mean_reward'],
                        'Steps': metrics_baseline['mean_steps']
                    })
                    
                    del env_baseline
                    
                except Exception as e:
                    print(f"Error evaluando baseline: {e}")

    return pd.DataFrame(results) if results else None

In [23]:
# ==================== CONFIGURACI√ìN DE COMPARACI√ìN ====================
# Para sistemas con 16GB RAM (con Windows usando ~10GB):
# skip_baseline=True: Solo eval√∫a tus modelos (ahorra ~10GB)
# skip_baseline=False: Intenta cargar el baseline (requiere >20GB RAM total)
# ========================================================================

BASELINE_MODEL_PATH = os.path.join(project_path, 'v2', 'runs', 'poke_26214400.zip')

comparison_plans = {
    'combat': combat_plan_local,
    # 'puzzle': puzzle_plan_local,   # Comenta para evaluar menos modelos
    # 'hybrid': hybrid_plan_local,   # Comenta para evaluar menos modelos
}

# IMPORTANTE: skip_baseline=True para ahorrar RAM
df_results = run_comparison_lightweight(
    comparison_plans, 
    baseline_path=BASELINE_MODEL_PATH, 
    headless=True,
    skip_baseline=True  # Cambia a False solo si tienes >24GB RAM
)

if df_results is not None and not df_results.empty:
    print("\n" + "="*60)
    print("           RESULTADOS DE EVALUACI√ìN")
    print("="*60)
    print(df_results.to_string(index=False))
    print("="*60)
    
    # Guardar CSV
    csv_path = "evaluacion_modelos_locales.csv"
    df_results.to_csv(csv_path, index=False)
    print(f"\nResultados guardados en: {csv_path}")
else:
    print("\nNo se generaron resultados. Verifica que existan modelos entrenados.")

Modo sin baseline activado (ahorra ~10GB RAM)

--- Evaluando COMBAT en pewter_brock (battle) ---
Cargando modelo local: c:\Users\javi1\Documents\repos_git\TEL351-PokemonRed\models_local\combat\pewter_brock_battle.zip
Error evaluando local: too many values to unpack (expected 2)

No se generaron resultados. Verifica que existan modelos entrenados.


## 11. Entrenar Baseline Ligero (Opcional - Alternativa al modelo pesado)

Si quieres comparar tus agentes especializados con un baseline PPO gen√©rico **sin usar el modelo gigante de 26M pasos**, puedes entrenar tu propio baseline ligero aqu√≠. Este ser√° un modelo est√°ndar de `v2/red_gym_env_v2.py` entrenado con los **mismos 40k pasos** que tus agentes especializados para una comparaci√≥n justa.

In [None]:
from stable_baselines3 import PPO
from v2.red_gym_env_v2 import RedGymEnv

def train_lightweight_baseline(scenario_id='pewter_brock', phase_name='battle', timesteps=40_000):
    """
    Entrena un baseline PPO simple (sin wrappers especializados) para comparaci√≥n justa.
    Usa el mismo n√∫mero de pasos que tus agentes especializados.
    """
    print(f"\n{'='*60}")
    print(f"   ENTRENANDO BASELINE LIGERO (PPO Gen√©rico)")
    print(f"   Escenario: {scenario_id} | Fase: {phase_name}")
    print(f"   Pasos: {timesteps:,}")
    print(f"{'='*60}\n")
    
    # Preparar configuraci√≥n del entorno (igual que tus agentes)
    phase = resolve_phase(scenario_id, phase_name)
    state_file_path = ensure_state_file(phase['state_file'])
    env_overrides = build_env_overrides(state_file_path, headless=True)
    base_config = _base_env_config(env_overrides)
    
    # Crear entorno est√°ndar (sin wrappers especializados)
    env = RedGymEnv(base_config)
    
    # Crear modelo PPO con configuraci√≥n similar a tus agentes
    model = PPO(
        "CnnPolicy",  # Pol√≠tica est√°ndar para im√°genes
        env,
        learning_rate=2.5e-4,
        n_steps=1024,
        batch_size=256,
        gamma=0.999,
        verbose=1,
        device='cuda' if torch.cuda.is_available() else 'cpu'
    )
    
    # Entrenar
    try:
        import tqdm, rich
        model.learn(total_timesteps=timesteps, progress_bar=True)
    except ImportError:
        model.learn(total_timesteps=timesteps)
    
    # Guardar
    baseline_dir = os.path.join(MODELS_DIR, 'baseline_lightweight')
    os.makedirs(baseline_dir, exist_ok=True)
    baseline_path = os.path.join(baseline_dir, f"{scenario_id}_{phase_name}.zip")
    model.save(baseline_path)
    
    print(f"\nBaseline ligero guardado en: {baseline_path}")
    env.close()
    
    return baseline_path

# Entrenar baseline (descomenta para ejecutar)
# baseline_ligero_path = train_lightweight_baseline(
#     scenario_id='pewter_brock',
#     phase_name='battle',
#     timesteps=40_000  # Mismo n√∫mero de pasos que tus agentes
# )

### Comparar con Baseline Ligero

Una vez entrenado el baseline ligero, puedes compararlo con tus agentes especializados usando esta celda:

In [None]:
# Ruta al baseline ligero que acabas de entrenar
BASELINE_LIGERO_PATH = os.path.join(project_path, 'models_local', 'baseline_lightweight', 'pewter_brock_battle.zip')

# Comparar (solo si el baseline ligero existe)
if os.path.exists(BASELINE_LIGERO_PATH):
    print("Comparando con Baseline Ligero (entrenado con los mismos 40k pasos)")
    
    df_comparison = run_comparison_lightweight(
        {'combat': combat_plan_local},
        baseline_path=BASELINE_LIGERO_PATH,
        headless=True,
        skip_baseline=False  # Ahora S√ç cargamos el baseline (es ligero)
    )
    
    if df_comparison is not None and not df_comparison.empty:
        print("\n" + "="*70)
        print("     COMPARACI√ìN: AGENTE ESPECIALIZADO vs BASELINE LIGERO")
        print("="*70)
        print(df_comparison.to_string(index=False))
        print("="*70)
        
        # Calcular mejora
        if len(df_comparison) == 2:
            reward_especializado = df_comparison[df_comparison['Model'].str.contains('Specialized')]['Reward'].values[0]
            reward_baseline = df_comparison[df_comparison['Model'].str.contains('Baseline')]['Reward'].values[0]
            mejora = ((reward_especializado - reward_baseline) / abs(reward_baseline)) * 100
            print(f"\nMejora del agente especializado: {mejora:+.1f}%")
        
        df_comparison.to_csv("comparacion_especializado_vs_baseline_ligero.csv", index=False)
else:
    print(f"Baseline ligero no encontrado en: {BASELINE_LIGERO_PATH}")
    print("Ejecuta primero la celda anterior para entrenar el baseline ligero.")

In [None]:
# =================================================================================
# EVALUACI√ìN DE AGENTE EN ESCENARIO DE GIMNASIO (HEADLESS + M√âTRICAS)
# =================================================================================

import json
import time
import sys
import numpy as np
from pathlib import Path
from stable_baselines3 import PPO
from gym_scenarios.gym_metrics import GymMetricsTracker

# Importar direcciones de memoria necesarias para la inyecci√≥n
# (Copiadas de memory_addresses.py para asegurar disponibilidad)
PARTY_SIZE_ADDRESS = 0xD163
PARTY_ADDRESSES = [0xD164, 0xD165, 0xD166, 0xD167, 0xD168, 0xD169]
LEVELS_ADDRESSES = [0xD18C, 0xD1B8, 0xD1E4, 0xD210, 0xD23C, 0xD268]
HP_ADDRESSES = [0xD16C, 0xD198, 0xD1C4, 0xD1F0, 0xD21C, 0xD248]
MAX_HP_ADDRESSES = [0xD18D, 0xD1B9, 0xD1E5, 0xD211, 0xD23D, 0xD269]
MONEY_ADDRESS_1 = 0xD347
MONEY_ADDRESS_2 = 0xD348
MONEY_ADDRESS_3 = 0xD349
BADGE_COUNT_ADDRESS = 0xD356
BAG_ITEMS_START = 0xD31E
BAG_ITEM_COUNT = 0xD31D

def get_base_env(env):
    """Obtiene el entorno base (RedGymEnv) de un wrapper o VecEnv."""
    if hasattr(env, 'envs'): # DummyVecEnv
        env = env.envs[0]
    if hasattr(env, 'unwrapped'):
        return env.unwrapped
    return env

def inject_gym_config(env, config):
    """Inyecta la configuraci√≥n del equipo e inventario en la memoria del emulador."""
    base_env = get_base_env(env)
    pyboy = base_env.pyboy
    
    def write_mem(addr, val):
        if hasattr(pyboy, "set_memory_value"):
            pyboy.set_memory_value(addr, val & 0xFF)
        else:
            pyboy.memory[addr] = val & 0xFF

    def write_word(addr, val):
        write_mem(addr, (val >> 8) & 0xFF)
        write_mem(addr + 1, val & 0xFF)

    def write_bcd(val):
        return ((val // 10) << 4) | (val % 10)

    print("Inyectando configuraci√≥n de equipo e inventario...")

    # 1. Equipo
    team = config.get('player_team', [])
    write_mem(PARTY_SIZE_ADDRESS, len(team))
    for i, poke in enumerate(team):
        slot = poke.get('slot', 1) - 1
        if 0 <= slot < 6:
            write_mem(PARTY_ADDRESSES[slot], poke.get('species_id', 0))
            write_mem(LEVELS_ADDRESSES[slot], poke.get('level', 5))
            write_word(HP_ADDRESSES[slot], poke.get('current_hp', 20))
            write_word(MAX_HP_ADDRESSES[slot], poke.get('max_hp', 20))

    # 2. Items
    items = config.get('bag_items', [])
    item_count = min(len(items), 20)
    write_mem(BAG_ITEM_COUNT, item_count)
    for i, item in enumerate(items[:20]):
        base = BAG_ITEMS_START + (i * 2)
        write_mem(base, item.get('item_id', 0))
        write_mem(base + 1, item.get('quantity', 1))
    write_mem(BAG_ITEMS_START + (item_count * 2), 0xFF)

    # 3. Dinero y Medallas
    money = config.get('money', 0)
    write_mem(MONEY_ADDRESS_1, write_bcd(money // 10000))
    write_mem(MONEY_ADDRESS_2, write_bcd((money // 100) % 100))
    write_mem(MONEY_ADDRESS_3, write_bcd(money % 100))
    write_mem(BADGE_COUNT_ADDRESS, config.get('badge_bits', 0))

    # 4. Warp Seguro
    start_pos = config.get('start_position', {'x': 4, 'y': 13})
    map_id = config.get('map_id', 0)
    
    print(f"üåÄ Programando Warp a Mapa {map_id} ({start_pos['x']}, {start_pos['y']})...")
    write_mem(0xD365, map_id)          # wWarpDestMap
    write_mem(0xD366, start_pos['x'])  # wWarpDestX
    write_mem(0xD367, start_pos['y'])  # wWarpDestY
    
    if hasattr(pyboy, "get_memory_value"):
        current_wd72d = pyboy.get_memory_value(0xD12B)
    else:
        current_wd72d = pyboy.memory[0xD12B]
    write_mem(0xD12B, current_wd72d | 0x08) # Trigger Warp
    write_mem(0xD35D, 0x00) # Reset script state

    return map_id, start_pos

def evaluate_gym_scenario(model_path, scenario_path, headless=True):
    """Ejecuta la evaluaci√≥n completa de un escenario de gimnasio."""
    
    # Rutas
    state_file = os.path.join(scenario_path, "gym_scenario.state")
    config_file = os.path.join(scenario_path, "team_config.json")
    
    if not os.path.exists(model_path):
        print(f"Modelo no encontrado: {model_path}")
        return
    if not os.path.exists(state_file):
        print(f"Estado no encontrado: {state_file}")
        return
    if not os.path.exists(config_file):
        print(f"Configuraci√≥n no encontrada: {config_file}")
        return

    # Cargar Configuraci√≥n
    with open(config_file, 'r') as f:
        team_config = json.load(f)
        
    print(f"\nEvaluando en: {team_config.get('gym_name', 'Unknown Gym')}")
    print(f"Modelo: {os.path.basename(model_path)}")
    print(f"Modo Headless: {headless}")

    # Configurar Entorno
    env_overrides = build_env_overrides(state_file, headless=headless)
    env_overrides['max_steps'] = 2048 * 5 
    base_config = _base_env_config(env_overrides)
    
    # Instanciar Agente
    agent_config = CombatAgentConfig(env_config=base_config, total_timesteps=1000)
    agent_wrapper = CombatApexAgent(agent_config)
    
    # Cargar Modelo
    print("Cargando pesos del modelo...")
    agent_wrapper.model = PPO.load(model_path)
    
    # Crear Entorno
    env = agent_wrapper.make_env()
    
    # --- FIX: Manejo de VecEnv vs Env est√°ndar ---
    try:
        obs, _ = env.reset()
    except ValueError:
        # Si falla, es probable que sea un VecEnv que solo retorna obs
        obs = env.reset()
    
    # Inyectar Configuraci√≥n
    target_map, target_pos = inject_gym_config(env, team_config)
    base_env = get_base_env(env) # Referencia directa para lecturas de memoria
    
    # Calentamiento para Warp
    print("Calentando motor para warp (3s)...")
    for _ in range(180):
        base_env.pyboy.tick(1, False)
        if not headless:
            env.render() # Mantener ventana viva
            
    # Verificar si el Warp funcion√≥
    current_map = base_env.read_m(0xD35E)
    if current_map != target_map:
        print(f"‚ö†Ô∏è ADVERTENCIA: El Warp fall√≥. Mapa actual: {current_map}, Esperado: {target_map}")
        print("   Reintentando inyecci√≥n...")
        inject_gym_config(env, team_config)
        for _ in range(60):
            base_env.pyboy.tick(1, False)
            if not headless: env.render()
        
    # Inicializar Tracker
    tracker = GymMetricsTracker(
        gym_number=team_config.get('gym_number', 1),
        agent_name="CombatApex_Local",
        gym_name=team_config.get('gym_name', "")
    )
    tracker.start()
    
    done = False
    truncated = False
    
    print("\nIniciando ejecuci√≥n del agente...")
    while not done and not truncated:
        
        # --- FIX: Asegurar que obs no sea tupla antes de predict ---
        if isinstance(obs, tuple):
            # print(f"DEBUG: Obs es tupla {type(obs)}, corrigiendo...")
            obs = obs[0]
            
        # Predecir
        action, _ = agent_wrapper.model.predict(obs, deterministic=True)
        
        # Ejecutar
        step_result = env.step(action)
        
        # --- FIX: Unpacking flexible para VecEnv ---
        if len(step_result) == 4:
            obs, reward, done, info = step_result
            truncated = False
            # Si es VecEnv, reward/done son arrays
            if isinstance(done, (list, np.ndarray)): done = done[0]
            if isinstance(reward, (list, np.ndarray)): reward = reward[0]
            if isinstance(info, (list, np.ndarray)): info = info[0]
        else:
            obs, reward, done, truncated, info = step_result
        
        # Renderizar
        if not headless:
            env.render()
            
        # Registrar m√©tricas (Usando base_env para leer memoria)
        # FIX: Asegurar tipos nativos de Python para evitar errores de JSON serialization
        game_state = {
            'x': int(base_env.read_m(0xD362)),
            'y': int(base_env.read_m(0xD361)),
            'map': int(base_env.read_m(0xD35E)),
            'hp': [int(base_env.read_m(HP_ADDRESSES[i])) for i in range(6)],
            'in_battle': bool(base_env.read_m(0xD057) != 0)
        }
        
        # FIX: Convertir acci√≥n y reward a escalares nativos
        action_scalar = action.item() if isinstance(action, np.ndarray) else action
        reward_scalar = float(reward)
        
        tracker.record_step(action_scalar, reward_scalar, game_state)
        
        # L√≥gica de batalla
        if game_state['in_battle'] and not tracker.battle_started:
            tracker.record_battle_start()
        elif not game_state['in_battle'] and tracker.battle_started:
            tracker.record_battle_end(won=True) 
            
        # Seguridad
        if game_state['map'] != target_map and not game_state['in_battle']:
             pass

    env.close()
    
    # Finalizar
    tracker.end(success=tracker.battle_won)
    tracker.save_metrics(output_dir="metrics_evaluation")
    
    stats = tracker.get_summary_stats()
    print("\nResumen de Evaluaci√≥n:")
    print(json.dumps(stats, indent=2))
    
    # --- DIAGN√ìSTICO AUTOM√ÅTICO ---
    if not stats['battle_won']:
        print("\n‚ö†Ô∏è DIAGN√ìSTICO DE FALLO:")
        print("   'FALLO' significa que el agente no gan√≥ la batalla del gimnasio.")
        if stats['unique_tiles_explored'] <= 1:
            print("   üî¥ EL AGENTE EST√Å INM√ìVIL: Solo explor√≥ 1 baldosa.")
            print("   Sugerencia: Revisa si el modelo aprendi√≥ correctamente o si el Warp lo dej√≥ atrapado.")
        elif stats['battle_steps'] == 0:
            print("   üü† NO ENTR√ì A BATALLA: El agente se movi√≥ pero no inici√≥ el combate.")
            
    return stats

# --- EJECUTAR EVALUACI√ìN ---
MODEL_PATH = os.path.join(project_path, 'models_local', 'combat', 'pewter_brock_battle.zip')
SCENARIO_PATH = os.path.join(project_path, 'gym_scenarios', 'gym1_pewter_brock')

# Ejecutar (Cambiado a headless=False para ver qu√© pasa)
stats = evaluate_gym_scenario(MODEL_PATH, SCENARIO_PATH, headless=False)


Evaluando en: Pewter City Gym - Brock
Modelo: pewter_brock_battle.zip
Modo Headless: False
Cargando pesos del modelo...
Inyectando configuraci√≥n de equipo e inventario...
üåÄ Programando Warp a Mapa 54 (4, 13)...
Calentando motor para warp (3s)...
Inyectando configuraci√≥n de equipo e inventario...
üåÄ Programando Warp a Mapa 54 (4, 13)...
Calentando motor para warp (3s)...
üìä M√©tricas iniciadas para CombatApex_Local en Gimnasio 1

Iniciando ejecuci√≥n del agente...
üìä M√©tricas iniciadas para CombatApex_Local en Gimnasio 1

Iniciando ejecuci√≥n del agente...




KeyboardInterrupt: 