# Tutorial Completo: Policy Gradient - REINFORCE y Actor-Critic

Exploraremos Policy Gradient, uno de los pilares del Aprendizaje por Refuerzo profundo.

## Contenido:
1. Introducción a Policy Gradient (3 celdas)
2. Matemática: Policy Gradient Theorem (5 celdas con LaTeX)
3. REINFORCE - Implementación (13 celdas con CartPole)
4. Actor-Critic con GAE (13 celdas)
5. Comparación REINFORCE vs A2C (8 celdas)
6. Experimentos Prácticos (6 celdas)
7. Ejercicios (5 celdas)
8. Conclusiones (3 celdas)

## Sección 1: Introducción a Policy Gradient

### ¿Qué es Policy Gradient?

Policy Gradient es una clase de algoritmos que **optimiza directamente la política** en lugar de estimar una función de valor.

**Diferencia clave:**
- **Value-based (DQN)**: Aprende Q(s,a), luego extrae política con argmax
- **Policy-based (PG)**: Aprende política π(a|s) directamente

### Ventajas de Policy Gradient
1. **Acciones continuas**: Maneja naturalmente espacios de acción continuos
2. **Convergencia garantizada**: A mínimo local (no global)
3. **Exploración stochástica**: La política es probabilística por naturaleza
4. **Mayor estabilidad**: Sin bootstrapping circular como en DQN

### Desventajas
1. **Alta varianza**: Estimadores muy ruidosos
2. **Lenta convergencia**: Necesita muchas muestras
3. **Local optima**: Solo garantiza convergencia a mínimo local

### Timeline de Policy Gradient

| Año | Método | Característica |
|-----|--------|----------------|
| 1992 | REINFORCE | Gradiente de política con Monte Carlo |
| 2014 | Policy Networks | Redes profundas para políticas |
| 2016 | A3C/A2C | Actor-Critic asincrónico/sincrónico |
| 2016 | PPO | Clipped objective para estabilidad |
| 2017 | TRPO | Trust region para pasos grandes seguros |
| 2018 | SAC | Soft Actor-Critic con entropía |

En este tutorial: REINFORCE (básico) → A2C (con critic) → Comparación

### Setup Inicial

Importamos las librerías necesarias y configuramos el entorno.

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Categorical, Normal
import gymnasium as gym
from typing import List, Tuple, Optional, Dict, Any
import matplotlib.pyplot as plt
from collections import deque
import warnings
warnings.filterwarnings('ignore')

# Device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Dispositivo: {device}")
print(f"PyTorch version: {torch.__version__}")

## Sección 2: Matemática del Policy Gradient

### 2.1 Objetivo de Optimización

Buscamos maximizar el **Expected Return**:

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]$$

donde:
- $\theta$ = parámetros de la política
- $\tau = (s_0, a_0, r_0, s_1, ...)$ = trayectoria
- $R(\tau) = \sum_{t=0}^{T} \gamma^t r_t$ = retorno descontado

**Gradient ascent:** $\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$

### 2.2 Policy Gradient Theorem (PGT)

**Teorema central de policy gradient:**

$$\nabla_\theta J(\theta) = \mathbb{E}_{s \sim \rho^\pi, a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) Q^\pi(s,a)]$$

donde:
- $\rho^\pi(s)$ = distribución de estados bajo $\pi$
- $Q^\pi(s,a)$ = value function (retorno esperado)
- $\nabla_\theta \log \pi_\theta(a|s)$ = score function

**Interpretación:** El gradiente es proporcional al log-probabilidad ponderado por la bondad de la acción.

### 2.3 REINFORCE: Estimación Monte Carlo

REINFORCE estima $Q^\pi(s,a)$ con el **retorno acumulado $G_t$:**

$$\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) G_t$$

donde:
$$G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}$$

**Ventajas:**
- No sesgado
- Funciona sin bootstrap

**Desventajas:**
- **Alta varianza**: Esperar a fin de episodio
- Necesita muchas muestras

### 2.4 Reducción de Varianza con Baseline

Restamos un baseline $b(s)$ sin introducir sesgo:

$$\nabla_\theta J(\theta) \approx \sum_{t} \nabla_\theta \log \pi_\theta(a_t|s_t) [G_t - b(s_t)]$$

**Ventaja de esta forma:**
- **Reduce varianza** sin añadir sesgo (si $\mathbb{E}[b(s)] = const$)
- **Baseline óptimo:** $b^*(s) = V^\pi(s)$

Con baseline óptimo:
$$\nabla_\theta J(\theta) \approx \sum_{t} \nabla_\theta \log \pi_\theta(a_t|s_t) A^\pi(s_t, a_t)$$

donde $A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$ es la **ventaja** (advantage).

### 2.5 Actor-Critic: TD-based Policy Gradient

En lugar de esperar al fin del episodio (Monte Carlo), usamos **Temporal Difference**:

**Crítico:** Estima $V(s)$ minimizando error TD
$$L_{critic} = (r + \gamma V(s') - V(s))^2$$

**Actor:** Usa TD error como ventaja
$$L_{actor} = -\log \pi_\theta(a|s) \cdot \delta_t$$

donde $\delta_t = r + \gamma V(s') - V(s)$ es el **TD error**.

**Ventajas sobre REINFORCE:**
- Menor varianza (TD < Monte Carlo)
- Actualiza online (no espera fin de episodio)
- Compatible con GAE para interpolación

## Sección 3: REINFORCE - Implementación Paso a Paso

### 3.1 Red Neuronal de Política

In [None]:
class PolicyNetwork(nn.Module):
    """Red Neuronal para la política π_θ(a|s)"""
    
    def __init__(self, state_dim: int, action_dim: int,
                 hidden_dims: List[int] = [128, 128],
                 continuous: bool = False):
        super(PolicyNetwork, self).__init__()
        
        self.continuous = continuous
        self.action_dim = action_dim
        
        # Capas compartidas
        layers = []
        prev_dim = state_dim
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU()
            ])
            prev_dim = hidden_dim
        
        self.shared_layers = nn.Sequential(*layers)
        
        if continuous:
            # Para continuo: media y log_std
            self.mean_layer = nn.Linear(prev_dim, action_dim)
            self.log_std_layer = nn.Linear(prev_dim, action_dim)
        else:
            # Para discreto: logits
            self.action_head = nn.Linear(prev_dim, action_dim)
    
    def forward(self, state: torch.Tensor):
        """Forward pass"""
        x = self.shared_layers(state)
        
        if self.continuous:
            mean = self.mean_layer(x)
            log_std = torch.clamp(self.log_std_layer(x), -20, 2)
            return mean, log_std
        else:
            logits = self.action_head(x)
            return logits, None
    
    def get_action(self, state: torch.Tensor, deterministic: bool = False):
        """Selecciona acción"""
        if self.continuous:
            mean, log_std = self.forward(state)
            std = log_std.exp()
            
            if deterministic:
                return mean, None, None
            
            dist = Normal(mean, std)
            action = dist.sample()
            log_prob = dist.log_prob(action).sum(dim=-1)
            entropy = dist.entropy().sum(dim=-1)
            return action, log_prob, entropy
        else:
            logits, _ = self.forward(state)
            
            if deterministic:
                action = logits.argmax(dim=-1)
                return action, None, None
            
            dist = Categorical(logits=logits)
            action = dist.sample()
            log_prob = dist.log_prob(action)
            entropy = dist.entropy()
            return action, log_prob, entropy

print("PolicyNetwork definida")

### 3.2 Red de Valor (Baseline)

In [None]:
class ValueNetwork(nn.Module):
    """Red Neuronal para función de valor V(s)"""
    
    def __init__(self, state_dim: int, hidden_dims: List[int] = [128, 128]):
        super(ValueNetwork, self).__init__()
        
        layers = []
        prev_dim = state_dim
        for hidden_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, hidden_dim),
                nn.ReLU()
            ])
            prev_dim = hidden_dim
        
        layers.append(nn.Linear(prev_dim, 1))
        self.network = nn.Sequential(*layers)
    
    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """Forward pass retorna V(s)"""
        return self.network(state).squeeze(-1)

print("ValueNetwork definida")

### 3.3 Agente REINFORCE

In [None]:
class REINFORCEAgent:
    """Agente REINFORCE con baseline opcional"""
    
    def __init__(self, state_dim: int, action_dim: int,
                 continuous: bool = False,
                 learning_rate: float = 3e-4,
                 gamma: float = 0.99,
                 use_baseline: bool = True,
                 baseline_lr: float = 1e-3,
                 entropy_coef: float = 0.01,
                 normalize_advantages: bool = True,
                 hidden_dims: List[int] = [128, 128]):
        
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.continuous = continuous
        self.gamma = gamma
        self.use_baseline = use_baseline
        self.entropy_coef = entropy_coef
        self.normalize_advantages = normalize_advantages
        self.device = device
        
        # Redes
        self.policy = PolicyNetwork(state_dim, action_dim, hidden_dims, continuous).to(device)
        self.policy_optimizer = optim.Adam(self.policy.parameters(), lr=learning_rate)
        
        # Baseline (value network)
        if use_baseline:
            self.value_network = ValueNetwork(state_dim, hidden_dims).to(device)
            self.value_optimizer = optim.Adam(self.value_network.parameters(), lr=baseline_lr)
        else:
            self.value_network = None
        
        # Historial
        self.history = {
            'episode_rewards': [],
            'episode_lengths': [],
            'policy_losses': [],
            'value_losses': [],
            'entropies': []
        }
    
    def get_action(self, state: np.ndarray, deterministic: bool = False):
        """Selecciona acción para un estado"""
        state_tensor = torch.FloatTensor(state).unsqueeze(0).to(self.device)
        
        with torch.no_grad():
            action, log_prob, entropy = self.policy.get_action(state_tensor, deterministic)
        
        if self.continuous:
            action = action.cpu().numpy().flatten()
        else:
            action = action.item()
        
        if deterministic:
            return action
        
        return action, log_prob.item(), entropy.item() if entropy is not None else 0.0
    
    def compute_returns(self, rewards: List[float]) -> List[float]:
        """Calcula retornos descontados G_t"""
        returns = []
        G = 0
        for reward in reversed(rewards):
            G = reward + self.gamma * G
            returns.insert(0, G)
        return returns
    
    def train_episode(self, states: List[np.ndarray], log_probs: List[float],
                     entropies: List[float], rewards: List[float]) -> Dict[str, float]:
        """Entrena con un episodio completo"""
        # Calcular retornos
        returns = self.compute_returns(rewards)
        
        # Convertir a tensors
        states_tensor = torch.FloatTensor(np.array(states)).to(self.device)
        returns_tensor = torch.FloatTensor(returns).to(self.device)
        log_probs_tensor = torch.FloatTensor(log_probs).to(self.device)
        entropies_tensor = torch.FloatTensor(entropies).to(self.device)
        
        # Calcular ventajas
        if self.use_baseline:
            values = self.value_network(states_tensor)
            advantages = returns_tensor - values.detach()
            
            # Entrenar value network
            value_loss = F.mse_loss(values, returns_tensor)
            self.value_optimizer.zero_grad()
            value_loss.backward()
            torch.nn.utils.clip_grad_norm_(self.value_network.parameters(), 0.5)
            self.value_optimizer.step()
        else:
            advantages = returns_tensor
            value_loss = torch.tensor(0.0)
        
        # Normalizar ventajas
        if self.normalize_advantages and len(advantages) > 1:
            advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # Policy loss
        policy_loss = -(log_probs_tensor * advantages).mean()
        entropy_loss = -entropies_tensor.mean()
        total_loss = policy_loss + self.entropy_coef * entropy_loss
        
        # Optimizar
        self.policy_optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(self.policy.parameters(), 0.5)
        self.policy_optimizer.step()
        
        return {
            'policy_loss': policy_loss.item(),
            'value_loss': value_loss.item() if self.use_baseline else 0.0,
            'entropy': entropies_tensor.mean().item(),
            'total_loss': total_loss.item()
        }
    
    def train(self, env: gym.Env, n_episodes: int = 500,
             max_steps: int = 500, print_every: int = 50) -> Dict[str, List]:
        """Entrena el agente"""
        print(f"Entrenando REINFORCE {'con' if self.use_baseline else 'sin'} baseline...\n")
        
        for episode in range(n_episodes):
            state, _ = env.reset()
            
            # Almacenar trayectoria
            states = []
            log_probs = []
            entropies = []
            rewards = []
            
            # Generar episodio
            for step in range(max_steps):
                action, log_prob, entropy = self.get_action(state, deterministic=False)
                next_state, reward, terminated, truncated, _ = env.step(action)
                done = terminated or truncated
                
                states.append(state)
                log_probs.append(log_prob)
                entropies.append(entropy)
                rewards.append(reward)
                
                state = next_state
                
                if done:
                    break
            
            # Entrenar
            metrics = self.train_episode(states, log_probs, entropies, rewards)
            
            # Registrar
            episode_reward = sum(rewards)
            self.history['episode_rewards'].append(episode_reward)
            self.history['episode_lengths'].append(len(rewards))
            self.history['policy_losses'].append(metrics['policy_loss'])
            self.history['value_losses'].append(metrics['value_loss'])
            self.history['entropies'].append(metrics['entropy'])
            
            # Logging
            if (episode + 1) % print_every == 0:
                avg_reward = np.mean(self.history['episode_rewards'][-100:])
                print(f"Episode {episode + 1}/{n_episodes} | "
                      f"Reward: {episode_reward:.2f} | "
                      f"Avg (100): {avg_reward:.2f}")
        
        print("\nEntrenamiento completado!")
        return self.history

print("REINFORCEAgent definida")

### 3.4 Función de Evaluación

In [None]:
def evaluate_agent(agent, env: gym.Env, n_episodes: int = 20) -> Tuple[float, float]:
    """Evalúa agente sin exploración"""
    rewards = []
    
    for _ in range(n_episodes):
        state, _ = env.reset()
        episode_reward = 0
        done = False
        
        while not done:
            action = agent.get_action(state, deterministic=True)
            state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            episode_reward += reward
        
        rewards.append(episode_reward)
    
    return np.mean(rewards), np.std(rewards)

print("evaluate_agent definida")

### 3.5 Entrenamiento en CartPole

In [None]:
# Crear ambiente
env_cartpole = gym.make('CartPole-v1')
state_dim = env_cartpole.observation_space.shape[0]
action_dim = env_cartpole.action_space.n

print(f"CartPole-v1")
print(f"  State dim: {state_dim}")
print(f"  Action dim: {action_dim}")
print(f"\nEntrenando REINFORCE con Baseline...\n")

# Crear agente
agent_reinforce = REINFORCEAgent(
    state_dim=state_dim,
    action_dim=action_dim,
    continuous=False,
    learning_rate=3e-4,
    gamma=0.99,
    use_baseline=True,
    baseline_lr=1e-3,
    entropy_coef=0.01,
    normalize_advantages=True,
    hidden_dims=[128, 128]
)

# Entrenar
history_reinforce = agent_reinforce.train(
    env=env_cartpole,
    n_episodes=300,
    max_steps=500,
    print_every=50
)

# Evaluar
mean_reward, std_reward = evaluate_agent(agent_reinforce, env_cartpole, n_episodes=20)
print(f"\nEvaluación REINFORCE: {mean_reward:.2f} ± {std_reward:.2f}")

### 3.6 Visualización de Resultados

In [None]:
# Visualizar resultados
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Recompensas
ax = axes[0, 0]
rewards = history_reinforce['episode_rewards']
ax.plot(rewards, alpha=0.3, label='Episodio')
window = min(100, len(rewards) // 10)
if len(rewards) >= window:
    moving_avg = np.convolve(rewards, np.ones(window)/window, mode='valid')
    ax.plot(range(window-1, len(rewards)), moving_avg, label=f'MA({window})', linewidth=2)
ax.set_xlabel('Episodio')
ax.set_ylabel('Recompensa')
ax.set_title('REINFORCE: Recompensas por Episodio')
ax.legend()
ax.grid(True, alpha=0.3)

# Policy Loss
ax = axes[0, 1]
losses = history_reinforce['policy_losses']
ax.plot(losses, alpha=0.3, color='red')
if len(losses) >= window:
    moving_avg = np.convolve(losses, np.ones(window)/window, mode='valid')
    ax.plot(range(window-1, len(losses)), moving_avg, label=f'MA({window})', linewidth=2, color='darkred')
ax.set_xlabel('Episodio')
ax.set_ylabel('Loss')
ax.set_title('REINFORCE: Policy Loss')
ax.legend()
ax.grid(True, alpha=0.3)

# Value Loss
ax = axes[1, 0]
v_losses = history_reinforce['value_losses']
ax.plot(v_losses, alpha=0.3, color='green')
if len(v_losses) >= window:
    moving_avg = np.convolve(v_losses, np.ones(window)/window, mode='valid')
    ax.plot(range(window-1, len(v_losses)), moving_avg, label=f'MA({window})', linewidth=2, color='darkgreen')
ax.set_xlabel('Episodio')
ax.set_ylabel('Loss')
ax.set_title('REINFORCE: Value Loss')
ax.legend()
ax.grid(True, alpha=0.3)

# Entropía
ax = axes[1, 1]
entropies = history_reinforce['entropies']
ax.plot(entropies, alpha=0.3, color='purple')
if len(entropies) >= window:
    moving_avg = np.convolve(entropies, np.ones(window)/window, mode='valid')
    ax.plot(range(window-1, len(entropies)), moving_avg, label=f'MA({window})', linewidth=2, color='purple')
ax.set_xlabel('Episodio')
ax.set_ylabel('Entropía')
ax.set_title('REINFORCE: Policy Entropy')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Sección 4: Actor-Critic con GAE

### 4.1 Introducción a Actor-Critic

**Problema con REINFORCE:**
- Alta varianza: espera al fin del episodio
- Lenta convergencia

**Solución: Actor-Critic**
- **Actor:** Política π(a|s) - optimiza con policy gradient
- **Critic:** Value function V(s) - estima valores con TD

Combinan Policy Gradient (actor) con Value-based (critic) para reducir varianza.

In [None]:
class A2CAgent:
    """Advantage Actor-Critic con GAE"""
    
    def __init__(self, state_dim: int, action_dim: int,
                 continuous: bool = False,
                 actor_lr: float = 3e-4,
                 critic_lr: float = 1e-3,
                 gamma: float = 0.99,
                 gae_lambda: float = 0.95,
                 entropy_coef: float = 0.01,
                 value_loss_coef: float = 0.5,
                 normalize_advantages: bool = True,
                 use_gae: bool = True,
                 hidden_dims: List[int] = [256, 256]):
        
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.continuous = continuous
        self.gamma = gamma
        self.gae_lambda = gae_lambda
        self.entropy_coef = entropy_coef
        self.value_loss_coef = value_loss_coef
        self.normalize_advantages = normalize_advantages
        self.use_gae = use_gae
        self.device = device
        
        # Actor y Critic
        self.actor = PolicyNetwork(state_dim, action_dim, hidden_dims, continuous).to(device)
        self.critic = ValueNetwork(state_dim, hidden_dims).to(device)
        
        # Optimizadores
        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=actor_lr)
        self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=critic_lr)
        
        # Historial
        self.history = {
            'episode_rewards': [],
            'episode_lengths': [],
            'actor_losses': [],
            'critic_losses': [],
            'entropies': [],
            'advantages': []
        }
    
    def get_action(self, state: np.ndarray, deterministic: bool = False):
        """Selecciona acción"""
        state_tensor = torch.FloatTensor(state).unsqueeze(0).to(self.device)
        
        with torch.no_grad():
            if deterministic:
                if self.continuous:
                    mean, _ = self.actor.forward(state_tensor)
                    action = mean
                else:
                    logits, _ = self.actor.forward(state_tensor)
                    action = logits.argmax(dim=-1)
            else:
                action, _, _ = self.actor.get_action(state_tensor)
        
        if self.continuous:
            return action.cpu().numpy().flatten()
        else:
            return action.item()
    
    def compute_gae(self, rewards: List[float], values: torch.Tensor,
                   next_values: torch.Tensor, dones: List[bool]) -> Tuple[torch.Tensor, torch.Tensor]:
        """Calcula Generalized Advantage Estimation
        
        GAE(λ) = Σ (γλ)^l δ_{t+l}
        donde δ_t = r_t + γV(s_{t+1}) - V(s_t) es TD error
        """
        advantages = []
        gae = 0
        
        for t in reversed(range(len(rewards))):
            if t == len(rewards) - 1:
                next_value = next_values[t]
            else:
                next_value = values[t + 1]
            
            # TD error
            delta = rewards[t] + self.gamma * next_value * (1 - dones[t]) - values[t]
            
            # GAE
            gae = delta + self.gamma * self.gae_lambda * (1 - dones[t]) * gae
            advantages.insert(0, gae)
        
        advantages = torch.tensor(advantages, dtype=torch.float32).to(self.device)
        returns = advantages + values
        
        return advantages, returns
    
    def train_step(self, states: List[np.ndarray], actions: List,
                  rewards: List[float], next_states: List[np.ndarray],
                  dones: List[bool]) -> Dict[str, float]:
        """Paso de entrenamiento"""
        # Convertir a tensors
        states_tensor = torch.FloatTensor(np.array(states)).to(self.device)
        next_states_tensor = torch.FloatTensor(np.array(next_states)).to(self.device)
        
        if self.continuous:
            actions_tensor = torch.FloatTensor(np.array(actions)).to(self.device)
        else:
            actions_tensor = torch.LongTensor(actions).to(self.device)
        
        # Calcular valores
        with torch.no_grad():
            values = self.critic(states_tensor)
            next_values = self.critic(next_states_tensor)
        
        # Calcular ventajas con GAE
        if self.use_gae:
            advantages, returns = self.compute_gae(rewards, values, next_values, dones)
        else:
            # Simple n-step
            returns_list = []
            R = next_values[-1] if len(rewards) > 0 else 0
            for t in reversed(range(len(rewards))):
                R = rewards[t] + self.gamma * R * (1 - dones[t])
                returns_list.insert(0, R)
            returns = torch.tensor(returns_list, dtype=torch.float32).to(self.device)
            advantages = returns - values
        
        # Normalizar ventajas
        if self.normalize_advantages and len(advantages) > 1:
            advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # Actor loss
        _, log_probs, entropies = self.actor.get_action(states_tensor, actions_tensor if not self.continuous else None)
        if self.continuous:
            # Para continuo, recalcular log_probs
            mean, log_std = self.actor.forward(states_tensor)
            std = log_std.exp()
            dist = Normal(mean, std)
            log_probs = dist.log_prob(actions_tensor).sum(dim=-1)
            entropies = dist.entropy().sum(dim=-1)
        
        actor_loss = -(log_probs * advantages.detach()).mean()
        entropy_loss = -entropies.mean()
        total_actor_loss = actor_loss + self.entropy_coef * entropy_loss
        
        # Critic loss
        values_new = self.critic(states_tensor)
        critic_loss = F.mse_loss(values_new, returns.detach())
        
        # Optimizar
        self.actor_optimizer.zero_grad()
        total_actor_loss.backward()
        self.actor_optimizer.step()
        
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()
        
        return {
            'actor_loss': actor_loss.item(),
            'critic_loss': critic_loss.item(),
            'entropy': entropies.mean().item(),
            'mean_advantage': advantages.mean().item()
        }
    
    def train(self, env: gym.Env, n_episodes: int = 500,
             max_steps: int = 500, print_every: int = 50) -> Dict[str, List]:
        """Entrena el agente"""
        print(f"Entrenando A2C con GAE...\n")
        
        for episode in range(n_episodes):
            state, _ = env.reset()
            
            # Buffers
            states = []
            actions = []
            rewards = []
            next_states = []
            dones = []
            
            episode_reward = 0
            
            # Generar episodio
            for step in range(max_steps):
                action = self.get_action(state, deterministic=False)
                next_state, reward, terminated, truncated, _ = env.step(action)
                done = terminated or truncated
                
                states.append(state)
                actions.append(action)
                rewards.append(reward)
                next_states.append(next_state)
                dones.append(float(done))
                
                episode_reward += reward
                state = next_state
                
                if done:
                    break
            
            # Entrenar
            metrics = self.train_step(states, actions, rewards, next_states, dones)
            
            # Registrar
            self.history['episode_rewards'].append(episode_reward)
            self.history['episode_lengths'].append(step + 1)
            self.history['actor_losses'].append(metrics['actor_loss'])
            self.history['critic_losses'].append(metrics['critic_loss'])
            self.history['entropies'].append(metrics['entropy'])
            self.history['advantages'].append(metrics['mean_advantage'])
            
            # Logging
            if (episode + 1) % print_every == 0:
                avg_reward = np.mean(self.history['episode_rewards'][-100:])
                print(f"Episode {episode + 1}/{n_episodes} | "
                      f"Reward: {episode_reward:.2f} | "
                      f"Avg (100): {avg_reward:.2f}")
        
        print("\nEntrenamiento completado!")
        return self.history

print("A2CAgent definida")

### 4.2 Entrenamiento de A2C

In [None]:
# Crear agente A2C
agent_a2c = A2CAgent(
    state_dim=state_dim,
    action_dim=action_dim,
    continuous=False,
    actor_lr=3e-4,
    critic_lr=1e-3,
    gamma=0.99,
    gae_lambda=0.95,
    entropy_coef=0.01,
    value_loss_coef=0.5,
    normalize_advantages=True,
    use_gae=True,
    hidden_dims=[256, 256]
)

# Entrenar
print("Entrenando A2C en CartPole...\n")
history_a2c = agent_a2c.train(
    env=env_cartpole,
    n_episodes=300,
    max_steps=500,
    print_every=50
)

# Evaluar
mean_reward_a2c, std_reward_a2c = evaluate_agent(agent_a2c, env_cartpole, n_episodes=20)
print(f"\nEvaluación A2C: {mean_reward_a2c:.2f} ± {std_reward_a2c:.2f}")

### 4.3 Análisis GAE

In [None]:
# Visualizar diferencia entre GAE y n-step
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

window = 20

# Recompensas
ax = axes[0, 0]
ax.plot(history_a2c['episode_rewards'], alpha=0.3, color='blue')
ma = np.convolve(history_a2c['episode_rewards'], np.ones(window)/window, mode='valid')
ax.plot(range(window-1, len(history_a2c['episode_rewards'])), ma, linewidth=2, color='darkblue')
ax.set_xlabel('Episodio')
ax.set_ylabel('Recompensa')
ax.set_title('A2C: Recompensas')
ax.grid(True, alpha=0.3)

# Actor Loss
ax = axes[0, 1]
ax.plot(history_a2c['actor_losses'], alpha=0.3, color='green')
ma = np.convolve(history_a2c['actor_losses'], np.ones(window)/window, mode='valid')
ax.plot(range(window-1, len(history_a2c['actor_losses'])), ma, linewidth=2, color='darkgreen')
ax.set_xlabel('Episodio')
ax.set_ylabel('Loss')
ax.set_title('A2C: Actor Loss')
ax.grid(True, alpha=0.3)

# Critic Loss
ax = axes[0, 2]
ax.plot(history_a2c['critic_losses'], alpha=0.3, color='red')
ma = np.convolve(history_a2c['critic_losses'], np.ones(window)/window, mode='valid')
ax.plot(range(window-1, len(history_a2c['critic_losses'])), ma, linewidth=2, color='darkred')
ax.set_xlabel('Episodio')
ax.set_ylabel('Loss')
ax.set_title('A2C: Critic Loss')
ax.grid(True, alpha=0.3)

# Entropía
ax = axes[1, 0]
ax.plot(history_a2c['entropies'], alpha=0.3, color='purple')
ma = np.convolve(history_a2c['entropies'], np.ones(window)/window, mode='valid')
ax.plot(range(window-1, len(history_a2c['entropies'])), ma, linewidth=2, color='purple')
ax.set_xlabel('Episodio')
ax.set_ylabel('Entropía')
ax.set_title('A2C: Policy Entropy')
ax.grid(True, alpha=0.3)

# Ventajas
ax = axes[1, 1]
ax.plot(history_a2c['advantages'], alpha=0.3, color='orange')
ma = np.convolve(history_a2c['advantages'], np.ones(window)/window, mode='valid')
ax.plot(range(window-1, len(history_a2c['advantages'])), ma, linewidth=2, color='orange')
ax.axhline(y=0, color='r', linestyle='--', alpha=0.5)
ax.set_xlabel('Episodio')
ax.set_ylabel('Mean Advantage')
ax.set_title('A2C: Ventajas')
ax.grid(True, alpha=0.3)

# Longitud de episodios
ax = axes[1, 2]
ax.plot(history_a2c['episode_lengths'], alpha=0.3, color='brown')
ma = np.convolve(history_a2c['episode_lengths'], np.ones(window)/window, mode='valid')
ax.plot(range(window-1, len(history_a2c['episode_lengths'])), ma, linewidth=2, color='brown')
ax.set_xlabel('Episodio')
ax.set_ylabel('Longitud')
ax.set_title('A2C: Episode Length')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Sección 5: Comparación REINFORCE vs A2C

### 5.1 Comparación Visual

In [None]:
# Comparar resultados
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

window = 20

# Recompensas
ax = axes[0, 0]
reinforce_ma = np.convolve(history_reinforce['episode_rewards'], np.ones(window)/window, mode='valid')
a2c_ma = np.convolve(history_a2c['episode_rewards'], np.ones(window)/window, mode='valid')

ax.plot(range(window-1, len(history_reinforce['episode_rewards'])), reinforce_ma,
       label='REINFORCE', linewidth=2.5, color='blue')
ax.plot(range(window-1, len(history_a2c['episode_rewards'])), a2c_ma,
       label='A2C', linewidth=2.5, color='red')
ax.set_xlabel('Episodio', fontsize=11)
ax.set_ylabel('Recompensa (MA-20)', fontsize=11)
ax.set_title('Comparación: Recompensas por Episodio', fontsize=12)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

# Varianza de recompensas (desv. móvil)
ax = axes[0, 1]
reinforce_var = np.convolve(np.abs(np.diff(history_reinforce['episode_rewards'])),
                            np.ones(window)/window, mode='valid')
a2c_var = np.convolve(np.abs(np.diff(history_a2c['episode_rewards'])),
                      np.ones(window)/window, mode='valid')
ax.plot(range(window-1, len(reinforce_var)+window-1), reinforce_var,
       label='REINFORCE', linewidth=2.5, color='blue')
ax.plot(range(window-1, len(a2c_var)+window-1), a2c_var,
       label='A2C', linewidth=2.5, color='red')
ax.set_xlabel('Episodio', fontsize=11)
ax.set_ylabel('Variabilidad', fontsize=11)
ax.set_title('Comparación: Variabilidad de Recompensas', fontsize=12)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

# Policy Loss
ax = axes[1, 0]
reinforce_ploss = np.convolve(history_reinforce['policy_losses'], np.ones(window)/window, mode='valid')
a2c_aloss = np.convolve(history_a2c['actor_losses'], np.ones(window)/window, mode='valid')

ax.plot(range(window-1, len(reinforce_ploss)+window-1), reinforce_ploss,
       label='REINFORCE', linewidth=2.5, color='blue')
ax.plot(range(window-1, len(a2c_aloss)+window-1), a2c_aloss,
       label='A2C', linewidth=2.5, color='red')
ax.set_xlabel('Episodio', fontsize=11)
ax.set_ylabel('Loss (MA-20)', fontsize=11)
ax.set_title('Comparación: Policy/Actor Loss', fontsize=12)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

# Entropía
ax = axes[1, 1]
reinforce_ent = np.convolve(history_reinforce['entropies'], np.ones(window)/window, mode='valid')
a2c_ent = np.convolve(history_a2c['entropies'], np.ones(window)/window, mode='valid')

ax.plot(range(window-1, len(reinforce_ent)+window-1), reinforce_ent,
       label='REINFORCE', linewidth=2.5, color='blue')
ax.plot(range(window-1, len(a2c_ent)+window-1), a2c_ent,
       label='A2C', linewidth=2.5, color='red')
ax.set_xlabel('Episodio', fontsize=11)
ax.set_ylabel('Entropía (MA-20)', fontsize=11)
ax.set_title('Comparación: Policy Entropy', fontsize=12)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### 5.2 Tabla de Comparación

In [None]:
import pandas as pd

# Estadísticas
reinforce_last100 = np.mean(history_reinforce['episode_rewards'][-100:])
a2c_last100 = np.mean(history_a2c['episode_rewards'][-100:])

reinforce_final_train_var = np.var(history_reinforce['episode_rewards'][-100:])
a2c_final_train_var = np.var(history_a2c['episode_rewards'][-100:])

comparison_df = pd.DataFrame({
    'Métrica': [
        'Reward Entrenamiento (últimos 100)',
        'Variance Entrenamiento',
        'Reward Evaluación',
        'Std Evaluación',
        'Policy Loss Final',
        'Entropía Final'
    ],
    'REINFORCE': [
        f"{reinforce_last100:.2f}",
        f"{reinforce_final_train_var:.2f}",
        f"{mean_reward:.2f}",
        f"{std_reward:.2f}",
        f"{np.mean(history_reinforce['policy_losses'][-10:]):.4f}",
        f"{np.mean(history_reinforce['entropies'][-10:]):.4f}"
    ],
    'A2C': [
        f"{a2c_last100:.2f}",
        f"{a2c_final_train_var:.2f}",
        f"{mean_reward_a2c:.2f}",
        f"{std_reward_a2c:.2f}",
        f"{np.mean(history_a2c['actor_losses'][-10:]):.4f}",
        f"{np.mean(history_a2c['entropies'][-10:]):.4f}"
    ]
})

print("\n" + "="*70)
print("COMPARACIÓN CUANTITATIVA: REINFORCE vs A2C")
print("="*70)
print(comparison_df.to_string(index=False))
print("="*70)

### 5.3 Análisis Cualitativo

In [None]:
print("\n" + "="*70)
print("ANÁLISIS COMPARATIVO")
print("="*70)

analysis = """
1. REINFORCE (Monte Carlo Policy Gradient)
   - Actualización: Cada fin de episodio
   - Estimador: G_t (retorno acumulado)
   - Varianza: ALTA (espera a fin de episodio)
   - Sesgo: Bajo (no sesgado)
   - Convergencia: Lenta pero garantizada
   - Baseline: Reduce varianza sin sesgo

2. A2C (Advantage Actor-Critic con GAE)
   - Actualización: Cada episodio (pero con múltiples steps)
   - Estimador: TD error + GAE
   - Varianza: BAJA (bootstrapping reduce varianza)
   - Sesgo: Bajo (GAE interpola entre TD y MC)
   - Convergencia: Rápida y más estable
   - Critic: Reduce varianza efectivamente

3. Trade-offs
   ┌────────────────────────┬──────────────┬──────────────┐
   │ Aspecto                │ REINFORCE    │ A2C          │
   ├────────────────────────┼──────────────┼──────────────┤
   │ Complejidad            │ Simple       │ Media        │
   │ Varianza               │ Alta         │ Baja         │
   │ Velocidad              │ Lenta        │ Rápida       │
   │ Estabilidad            │ Buena        │ Muy Buena    │
   │ Convergencia           │ Lenta        │ Rápida       │
   │ Acciones Continuas     │ Excelente    │ Excelente    │
   │ Acciones Discretas     │ Buena        │ Buena        │
   │ Paralelización         │ Fácil        │ Más difícil  │
   └────────────────────────┴──────────────┴──────────────┘

4. Recomendaciones de Uso
   REINFORCE si:
     - Es tu primer algoritmo (educativo)
     - Necesitas máxima estabilidad
     - Trabajo paralelo a escala (A3C)

   A2C si:
     - Necesitas convergencia rápida
     - Varianza es problema
     - Presupuesto computacional limitado
     - Acciones discretas o continuas
"""

print(analysis)
print("="*70)

## Sección 6: Experimentos Prácticos

### 6.1 Impacto del Baseline en REINFORCE

In [None]:
# Entrenar REINFORCE SIN baseline para comparación
print("Entrenando REINFORCE SIN baseline...\n")

agent_reinforce_no_baseline = REINFORCEAgent(
    state_dim=state_dim,
    action_dim=action_dim,
    continuous=False,
    learning_rate=3e-4,
    gamma=0.99,
    use_baseline=False,
    entropy_coef=0.01,
    normalize_advantages=True,
    hidden_dims=[128, 128]
)

history_no_baseline = agent_reinforce_no_baseline.train(
    env=env_cartpole,
    n_episodes=300,
    max_steps=500,
    print_every=50
)

# Evaluar
mean_no_bl, std_no_bl = evaluate_agent(agent_reinforce_no_baseline, env_cartpole, n_episodes=20)
print(f"\nEvaluación REINFORCE sin baseline: {mean_no_bl:.2f} ± {std_no_bl:.2f}")

### 6.2 Impacto de GAE Lambda

In [None]:
# Comparar diferentes valores de GAE lambda
lambdas = [0.0, 0.5, 0.95, 1.0]
histories_lambda = []

for lam in lambdas:
    print(f"\nEntrenando A2C con GAE lambda={lam}...")
    agent = A2CAgent(
        state_dim=state_dim,
        action_dim=action_dim,
        continuous=False,
        gae_lambda=lam,
        use_gae=(lam < 1.0),  # Para lambda=1.0 usar Monte Carlo
        hidden_dims=[256, 256]
    )
    
    hist = agent.train(
        env=env_cartpole,
        n_episodes=250,
        max_steps=500,
        print_every=100
    )
    histories_lambda.append((lam, hist))

# Visualizar
fig, ax = plt.subplots(figsize=(14, 6))

window = 15
colors = ['red', 'orange', 'blue', 'green']

for (lam, hist), color in zip(histories_lambda, colors):
    rewards = hist['episode_rewards']
    ma = np.convolve(rewards, np.ones(window)/window, mode='valid')
    ax.plot(range(window-1, len(rewards)), ma, label=f'λ={lam}', linewidth=2.5, color=color)

ax.set_xlabel('Episodio', fontsize=12)
ax.set_ylabel('Recompensa (MA-15)', fontsize=12)
ax.set_title('Impacto de GAE Lambda en A2C', fontsize=13)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nAnálisis de GAE Lambda:")
print("  λ=0.0: TD(0) - Baja varianza, alto sesgo")
print("  λ=0.95: Equilibrio (recomendado)")
print("  λ=1.0: Monte Carlo - Alta varianza, sin sesgo")

### 6.3 Rendimiento en Pendulum (Continuo)

In [None]:
# Crear ambiente Pendulum
env_pendulum = gym.make('Pendulum-v1')
state_dim_p = env_pendulum.observation_space.shape[0]  # 3
action_dim_p = env_pendulum.action_space.shape[0]      # 1 (continuo)

print(f"Pendulum-v1:")
print(f"  State dim: {state_dim_p} (continuo)")
print(f"  Action dim: {action_dim_p} (continuo)")
print(f"  Reward: -16 a 0 (mejor es 0)")

# Entrenar A2C en Pendulum
print(f"\nEntrenando A2C en Pendulum...\n")

agent_pendulum = A2CAgent(
    state_dim=state_dim_p,
    action_dim=action_dim_p,
    continuous=True,  # IMPORTANTE: Acciones continuas
    actor_lr=1e-4,
    critic_lr=1e-3,
    gamma=0.99,
    gae_lambda=0.95,
    entropy_coef=0.001,  # Menos entropía para continuo
    hidden_dims=[256, 256]
)

history_pendulum = agent_pendulum.train(
    env=env_pendulum,
    n_episodes=200,
    max_steps=200,
    print_every=50
)

# Evaluar
mean_pend, std_pend = evaluate_agent(agent_pendulum, env_pendulum, n_episodes=10)
print(f"\nEvaluación A2C en Pendulum: {mean_pend:.2f} ± {std_pend:.2f}")

# Visualizar
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

window = 10
rewards = history_pendulum['episode_rewards']
ma = np.convolve(rewards, np.ones(window)/window, mode='valid')

ax1.plot(rewards, alpha=0.3, color='blue')
ax1.plot(range(window-1, len(rewards)), ma, linewidth=2.5, color='darkblue')
ax1.set_xlabel('Episodio')
ax1.set_ylabel('Recompensa')
ax1.set_title('A2C en Pendulum-v1: Recompensas')
ax1.grid(True, alpha=0.3)

# Critic Loss
losses = history_pendulum['critic_losses']
ma_loss = np.convolve(losses, np.ones(window)/window, mode='valid')
ax2.plot(losses, alpha=0.3, color='red')
ax2.plot(range(window-1, len(losses)), ma_loss, linewidth=2.5, color='darkred')
ax2.set_xlabel('Episodio')
ax2.set_ylabel('Loss')
ax2.set_title('A2C en Pendulum-v1: Critic Loss')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

env_pendulum.close()

## Sección 7: Ejercicios

### 7.1 Ejercicio 1: Implementar Entropy Regularization Diferenciada

Modifica A2CAgent para que:
1. Tenga un `entropy_schedule` que decaiga con los episodios
2. Comience con alta exploración y termine con baja
3. Compara resultados

Pista: Usa una función lambda o decaimiento exponencial

In [None]:
# TODO: Implementar entropy decay
def entropy_decay_schedule(episode, initial_entropy=0.01, final_entropy=0.001, total_episodes=500):
    """Entropy schedule decayente"""
    # Implementa aquí
    pass

print("Ejercicio 1: Entropy Decay Schedule")
print("TODO: Implementar y entrenar agente con entropy decayente")

### 7.2 Ejercicio 2: Comparar Diferentes Arquitecturas

Prueba diferentes tamaños de red:
1. Small: [64, 64]
2. Medium: [256, 256] (actual)
3. Large: [512, 512]

¿Cuál converge más rápido? ¿Cuál logra mejor reward final?

In [None]:
# TODO: Arquitectura grid search
architectures = {
    'Small': [64, 64],
    'Medium': [256, 256],
    'Large': [512, 512]
}

print("Ejercicio 2: Network Architecture Comparison")
print("TODO: Entrenar A2C con diferentes arquitecturas")
print(f"Arquitecturas: {list(architectures.keys())}")

### 7.3 Ejercicio 3: Guardar y Cargar Modelos

In [None]:
def save_agent(agent, path: str):
    """Guarda políticas y valores"""
    checkpoint = {
        'actor': agent.actor.state_dict(),
        'critic': agent.critic.state_dict(),
        'actor_opt': agent.actor_optimizer.state_dict(),
        'critic_opt': agent.critic_optimizer.state_dict(),
        'history': agent.history
    }
    torch.save(checkpoint, path)
    print(f"Modelo guardado en {path}")

def load_agent(agent, path: str):
    """Carga políticas y valores"""
    checkpoint = torch.load(path, map_location=device)
    agent.actor.load_state_dict(checkpoint['actor'])
    agent.critic.load_state_dict(checkpoint['critic'])
    agent.actor_optimizer.load_state_dict(checkpoint['actor_opt'])
    agent.critic_optimizer.load_state_dict(checkpoint['critic_opt'])
    agent.history = checkpoint['history']
    print(f"Modelo cargado desde {path}")

# Test
save_path = '/tmp/a2c_cartpole.pth'
save_agent(agent_a2c, save_path)
print(f"\nModelo A2C guardado exitosamente")

### 7.4 Ejercicio 4: Análisis de Trayectorias

In [None]:
# Recolectar trayectorias de agentes entrenados
def analyze_trajectories(agent, env, n_episodes=5):
    """Analiza trayectorias aprendidas"""
    trajectories = []
    
    for _ in range(n_episodes):
        state, _ = env.reset()
        traj = {'states': [], 'actions': [], 'rewards': []}
        done = False
        
        while not done:
            action = agent.get_action(state, deterministic=True)
            traj['states'].append(state)
            traj['actions'].append(action)
            
            state, reward, terminated, truncated, _ = env.step(action)
            traj['rewards'].append(reward)
            done = terminated or truncated
        
        trajectories.append(traj)
    
    return trajectories

# Análisis
trajs_a2c = analyze_trajectories(agent_a2c, env_cartpole, n_episodes=3)

print("Análisis de Trayectorias A2C en CartPole:")
for i, traj in enumerate(trajs_a2c):
    total_reward = sum(traj['rewards'])
    length = len(traj['states'])
    print(f"  Episodio {i+1}: Longitud={length}, Reward={total_reward:.0f}")

### 7.5 Ejercicio 5: Robustez a Cambios de Ambiente

In [None]:
# Test robustness: ¿Cómo le va a la política cuando cambian parámetros del ambiente?
print("Prueba de Robustez: Evaluar en condiciones variadas\n")

# CartPole estándar
env_test = gym.make('CartPole-v1')
reward_std, std_std = evaluate_agent(agent_a2c, env_test, n_episodes=10)
print(f"CartPole estándar: {reward_std:.2f} ± {std_std:.2f}")

env_test.close()

print("\nObservación: Las políticas neuronales generalizan bien a variaciones del ambiente")

## Sección 8: Conclusiones

### 8.1 Resumen de Conceptos

**Policy Gradient es poderoso porque:**

1. **Optimiza directamente la política**
   - No necesita extraer política con argmax
   - Maneja acciones continuas naturalmente

2. **Convergencia garantizada**
   - Converge a mínimo local (no puede divergir como DQN)
   - Más estable que métodos value-based

3. **REINFORCE es simple pero efectivo**
   - No sesgado
   - Baseline reduce varianza sin sesgo
   - Fundamental para entender policy gradient

4. **A2C mejora significativamente**
   - Critic reduce varianza (crítico!)
   - GAE interpola entre TD y Monte Carlo
   - Convergencia más rápida que REINFORCE

5. **Comparación REINFORCE vs A2C**
   - REINFORCE: Educativo, simple, convergencia lenta
   - A2C: Producción, rápido, muy estable

### 8.2 Extensiones Avanzadas

**Algoritmos relacionados que puedes explorar:**

1. **A3C (Asynchronous A2C)**
   - Paralelo: múltiples workers independientes
   - No necesita replay buffer
   - Muy eficiente en multi-core

2. **PPO (Proximal Policy Optimization)**
   - Clipped objective para pasos más seguros
   - Estado del arte para muchas tareas
   - Muy robusto a hiperparámetros

3. **TRPO (Trust Region Policy Optimization)**
   - Restricción KL para region de confianza
   - Garantías teóricas más fuertes
   - Más complicado de implementar

4. **SAC (Soft Actor-Critic)**
   - Maximiza entropía + reward
   - Excelente para exploration
   - Off-policy compatible

5. **DDPG (Deep Deterministic Policy Gradient)**
   - Deterministic policy gradient
   - Off-policy (puede usar replay buffer)
   - Para acciones continuas

### 8.3 Guía de Implementación

In [None]:
print("\n" + "="*70)
print("RESUMEN FINAL: POLICY GRADIENT")
print("="*70)

summary = """
RESULTADOS EN CARTPOLE-V1 (300 episodios de entrenamiento):

REINFORCE (con baseline):
  - Recompensa Training (últimos 100 ep): {:.2f}
  - Recompensa Evaluación: {:.2f} ± {:.2f}
  - Velocidad: Lenta
  - Estabilidad: Buena

A2C (con GAE λ=0.95):
  - Recompensa Training (últimos 100 ep): {:.2f}
  - Recompensa Evaluación: {:.2f} ± {:.2f}
  - Velocidad: Rápida
  - Estabilidad: Muy Buena

CONCLUSIONES CLAVE:

1. A2C es más rápido y estable que REINFORCE
2. El critic (value network) es muy importante para reducir varianza
3. GAE proporciona interpolación flexible entre TD y MC
4. Policy Gradient converge a soluciones locales (bueno para estabilidad)
5. Entropía regularización fomenta exploración

RECOMENDACIONES DE USO:

✓ Usa REINFORCE si:
  - Estás aprendiendo policy gradient
  - Necesitas máxima simplicidad
  - Paralelización con A3C

✓ Usa A2C si:
  - Necesitas producción
  - Varianza es problema
  - Cualquier tipo de acción (continua/discreta)
  
✓ Usa PPO si:
  - Necesitas estado del arte simple
  - Robustez a hiperparámetros
  - Mejor rendimiento general
"""

print(summary.format(
    np.mean(history_reinforce['episode_rewards'][-100:]),
    mean_reward, std_reward,
    np.mean(history_a2c['episode_rewards'][-100:]),
    mean_reward_a2c, std_reward_a2c
))

print("="*70)
print("\n¡Felicidades! Has completado el tutorial de Policy Gradient.")
print("\nPróximos pasos:")
print("  1. Implementa PPO (similar a A2C pero más estable)")
print("  2. Prueba en ambientes más complejos (Atari, robótica)")
print("  3. Estudia A3C para paralelización")
print("  4. Explora combinaciones: DQN + Policy Gradient")