# Policy Gradient Methods: REINFORCE and Actor-Critic

Welcome to the comprehensive tutorial on Policy Gradient methods in Deep Reinforcement Learning!

By the end of this notebook, you will be able to:
- Understand the theoretical foundations of Policy Gradient Theorem
- Implement REINFORCE algorithm with and without baseline
- Implement Actor-Critic methods with Generalized Advantage Estimation (GAE)
- Compare different policy gradient approaches empirically
- Apply these methods to discrete and continuous action spaces

Let's get started!

## Table of Contents
- [1- Packages](#1)
- [2 - Mathematical Foundations](#2)
    - [2.1 - Policy Gradient Theorem](#2-1)
    - [2.2 - REINFORCE Algorithm](#2-2)
    - [2.3 - Baseline for Variance Reduction](#2-3)
    - [2.4 - Actor-Critic Methods](#2-4)
    - [2.5 - Generalized Advantage Estimation](#2-5)
- [3 - Exercise 1: Implement Policy Network](#ex-1)
- [4 - Exercise 2: Implement REINFORCE Loss](#ex-2)
- [5 - Exercise 3: Implement Baseline (Value Network)](#ex-3)
- [6 - Exercise 4: Implement Actor-Critic](#ex-4)
- [7 - Exercise 5: Implement GAE](#ex-5)
- [8 - Experimental Comparison](#8)
    - [8.1 - REINFORCE vs A2C](#8-1)
    - [8.2 - Impact of Baseline](#8-2)
    - [8.3 - Continuous Action Spaces](#8-3)
- [9 - Conclusions](#9)

<a name='1'></a>
## 1 - Packages

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Categorical, Normal
import gymnasium as gym
from typing import List, Tuple, Optional, Dict, Any
import matplotlib.pyplot as plt
from collections import deque
import warnings
warnings.filterwarnings('ignore')

# Import test utilities
from pg_utils import (
    test_implement_policy_network,
    test_implement_reinforce_loss,
    test_implement_baseline,
    test_implement_actor_critic,
    test_implement_gae
)

# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")
print(f"PyTorch version: {torch.__version__}")

<a name='2'></a>
## 2 - Mathematical Foundations

<a name='2-1'></a>
### 2.1 - Policy Gradient Theorem

The **Policy Gradient Theorem** is the cornerstone of policy-based reinforcement learning:

$$\nabla_\theta J(\theta) = \mathbb{E}_{s \sim \rho^\pi, a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) Q^\pi(s,a)]$$

where:
- $\theta$ = policy parameters
- $\rho^\pi(s)$ = state visitation distribution under $\pi$
- $\pi_\theta(a|s)$ = policy (probability of action $a$ given state $s$)
- $Q^\pi(s,a)$ = state-action value function
- $\nabla_\theta \log \pi_\theta(a|s)$ = score function (policy gradient)

**Key insight:** The gradient is proportional to the log-probability of an action, weighted by how good that action is (Q-value).

<a name='2-2'></a>
### 2.2 - REINFORCE Algorithm

**REINFORCE** estimates $Q^\pi(s,a)$ using Monte Carlo returns:

$$\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{n=1}^{N} \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) G_t$$

where:
$$G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}$$

**Update rule:**
$$\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$$

**Advantages:**
- Unbiased gradient estimates
- Works with episodic tasks
- Handles continuous action spaces naturally

**Disadvantages:**
- HIGH VARIANCE: Must wait until episode end to compute gradients
- SLOW CONVERGENCE: Requires many samples
- No bootstrapping available

<a name='2-3'></a>
### 2.3 - Baseline for Variance Reduction

Subtracting a baseline $b(s)$ reduces variance **without introducing bias**:

$$\nabla_\theta J(\theta) \approx \sum_{t} \nabla_\theta \log \pi_\theta(a_t|s_t) [G_t - b(s_t)]$$

The **optimal baseline** is the state value function:
$$b^*(s) = V^\pi(s) = \mathbb{E}[G_t | s_t = s]$$

This leads to the **Advantage Function**:
$$A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$$

which tells us how much better action $a$ is compared to the average action in state $s$.

<a name='2-4'></a>
### 2.4 - Actor-Critic Methods

**Actor-Critic** combines two networks:

1. **Actor** (Policy Network): $\pi_\theta(a|s)$
   - Updates using policy gradient
   - Goal: Learn better actions

2. **Critic** (Value Network): $V_\phi(s)$
   - Updates using Temporal Difference (TD) learning
   - Goal: Better baseline for variance reduction

**Advantages over REINFORCE:**
- LOWER VARIANCE: TD bootstrap instead of Monte Carlo
- FASTER CONVERGENCE: Critic provides immediate feedback
- ONLINE UPDATES: Don't need to wait for episode end

**Loss functions:**

Actor loss (policy improvement):
$$L_{actor} = -\log \pi_\theta(a|s) \cdot A(s,a)$$

Critic loss (value estimation):
$$L_{critic} = (r + \gamma V_\phi(s') - V_\phi(s))^2$$

<a name='2-5'></a>
### 2.5 - Generalized Advantage Estimation (GAE)

GAE provides a flexible interpolation between TD (low variance, high bias) and Monte Carlo (high variance, no bias):

$$A_t^{GAE(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}$$

where the TD error is:
$$\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$$

**The $\lambda$ parameter controls bias-variance tradeoff:**
- $\lambda = 0$: Pure TD(0) - low variance, high bias
- $\lambda = 0.95$: Recommended (good balance)
- $\lambda = 1$: Pure Monte Carlo - high variance, no bias

<a name='ex-1'></a>
## 3 - Exercise 1: Implement Policy Network

**Objective:** Create a neural network that outputs action probabilities (for discrete actions) or action distribution parameters (for continuous actions).

Complete the `PolicyNetwork` class below. This network should:
- Take state as input
- Output logits for discrete actions OR (mean, log_std) for continuous actions
- Have a method `get_action()` that samples actions and computes log probabilities

In [None]:
# GRADED FUNCTION: PolicyNetwork

class PolicyNetwork(nn.Module):
    """Neural network for policy π_θ(a|s)"""
    
    def __init__(self, state_dim: int, action_dim: int,
                 hidden_dims: List[int] = [128, 128],
                 continuous: bool = False):
        super(PolicyNetwork, self).__init__()
        
        self.continuous = continuous
        self.action_dim = action_dim
        
        # YOUR CODE STARTS HERE
        # Build shared hidden layers
        # Then create action head(s) based on continuous flag
        
        layers = []
        prev_dim = state_dim
        
        for hidden_dim in hidden_dims:
            layers.append(nn.Linear(prev_dim, hidden_dim))
            layers.append(nn.ReLU())
            prev_dim = hidden_dim
        
        self.shared_layers = nn.Sequential(*layers)
        
        if continuous:
            # For continuous: output mean and log standard deviation
            self.mean_layer = nn.Linear(prev_dim, action_dim)
            self.log_std_layer = nn.Linear(prev_dim, action_dim)
        else:
            # For discrete: output logits for each action
            self.action_head = nn.Linear(prev_dim, action_dim)
        # YOUR CODE ENDS HERE
    
    def forward(self, state: torch.Tensor):
        """Forward pass through network"""
        x = self.shared_layers(state)
        
        if self.continuous:
            mean = self.mean_layer(x)
            log_std = torch.clamp(self.log_std_layer(x), -20, 2)
            return mean, log_std
        else:
            logits = self.action_head(x)
            return logits, None
    
    def get_action(self, state: torch.Tensor, deterministic: bool = False):
        """Sample action from policy"""
        if self.continuous:
            mean, log_std = self.forward(state)
            std = log_std.exp()
            
            if deterministic:
                return mean, None, None
            
            dist = Normal(mean, std)
            action = dist.sample()
            log_prob = dist.log_prob(action).sum(dim=-1)
            entropy = dist.entropy().sum(dim=-1)
            return action, log_prob, entropy
        else:
            logits, _ = self.forward(state)
            
            if deterministic:
                action = logits.argmax(dim=-1)
                return action, None, None
            
            dist = Categorical(logits=logits)
            action = dist.sample()
            log_prob = dist.log_prob(action)
            entropy = dist.entropy()
            return action, log_prob, entropy

print("PolicyNetwork defined")

In [None]:
# Test Exercise 1
print("Testing Exercise 1: PolicyNetwork")
test_implement_policy_network()
print("\nAll tests passed!")

<font color='blue'>

**What you should remember:**
- A policy network outputs action probabilities (discrete) or action distribution parameters (continuous)
- The score function $\nabla_\theta \log \pi_\theta(a|s)$ is what we use to update the policy
- For continuous actions, we typically use a Gaussian distribution with learned mean and standard deviation
- The log probability is crucial for computing policy gradients

</font>

<a name='ex-2'></a>
## 4 - Exercise 2: Implement REINFORCE Loss

**Objective:** Implement the REINFORCE loss function that optimizes the policy to maximize expected returns.

The loss function should:
- Take log probabilities and returns (or advantages) as input
- Compute: $L = -\frac{1}{N} \sum_t \log \pi_\theta(a_t|s_t) \cdot G_t$
- Be differentiable with respect to policy parameters

In [None]:
# GRADED FUNCTION: compute_reinforce_loss

def compute_reinforce_loss(log_probs: torch.Tensor, returns: torch.Tensor) -> torch.Tensor:
    """
    Compute REINFORCE loss (negative expected return weighted by log probabilities)
    
    Arguments:
    log_probs -- log probabilities of taken actions, shape (batch_size,)
    returns -- cumulative discounted returns, shape (batch_size,)
    
    Returns:
    loss -- scalar loss value (we want to minimize this)
    """
    # YOUR CODE STARTS HERE
    # Compute the negative expected return weighted by log probabilities
    # This is: -mean(log_prob * return)
    loss = -(log_probs * returns).mean()
    # YOUR CODE ENDS HERE
    return loss

print("compute_reinforce_loss defined")

In [None]:
# Test Exercise 2
print("Testing Exercise 2: REINFORCE Loss")
test_implement_reinforce_loss()
print("\nAll tests passed!")

<font color='blue'>

**What you should remember:**
- REINFORCE loss is: $L = -\frac{1}{N} \sum \log \pi(a|s) \cdot G$
- We negate the return because optimizers minimize loss (we want to maximize returns)
- High-variance returns can make training unstable (this is why baselines are important)
- The loss is unbiased: $\mathbb{E}[\nabla L] = \nabla J(\theta)$

</font>

<a name='ex-3'></a>
## 5 - Exercise 3: Implement Baseline (Value Network)

**Objective:** Implement a value network that estimates $V(s)$, used as a baseline to reduce variance.

Requirements:
- Input: state (shape: [batch_size, state_dim])
- Output: scalar value estimate for each state (shape: [batch_size])
- Used to compute advantages: $A = G_t - V(s_t)$

In [None]:
# GRADED FUNCTION: ValueNetwork

class ValueNetwork(nn.Module):
    """Neural network for value function V(s)"""
    
    def __init__(self, state_dim: int, hidden_dims: List[int] = [128, 128]):
        super(ValueNetwork, self).__init__()
        
        # YOUR CODE STARTS HERE
        # Build a neural network:
        # Input: state_dim
        # Hidden layers with ReLU activations
        # Output: 1 (scalar value)
        
        layers = []
        prev_dim = state_dim
        
        for hidden_dim in hidden_dims:
            layers.append(nn.Linear(prev_dim, hidden_dim))
            layers.append(nn.ReLU())
            prev_dim = hidden_dim
        
        layers.append(nn.Linear(prev_dim, 1))
        self.network = nn.Sequential(*layers)
        # YOUR CODE ENDS HERE
    
    def forward(self, state: torch.Tensor) -> torch.Tensor:
        """Forward pass returns V(s)"""
        return self.network(state).squeeze(-1)

print("ValueNetwork defined")

In [None]:
# Test Exercise 3
print("Testing Exercise 3: Baseline (Value Network)")
test_implement_baseline()
print("\nAll tests passed!")

<font color='blue'>

**What you should remember:**
- The baseline $b(s) = V(s)$ reduces variance without adding bias
- Optimal baseline is the state value function
- Advantages $A = G_t - V(s_t)$ tell us how much better/worse an action is vs. average
- The value network is trained with MSE loss: $L = (G_t - V(s_t))^2$
- Baselines are critical for practical policy gradient learning

</font>

<a name='ex-4'></a>
## 6 - Exercise 4: Implement Actor-Critic

**Objective:** Combine the policy network (actor) and value network (critic) into a unified training procedure.

The actor-critic algorithm should:
1. Collect experience using the current policy
2. Compute TD errors: $\delta = r + \gamma V(s') - V(s)$
3. Update actor using policy gradient weighted by advantages
4. Update critic using TD loss

In [None]:
# GRADED FUNCTION: ActorCriticAgent

class ActorCriticAgent:
    """Advantage Actor-Critic (A2C) agent"""
    
    def __init__(self, state_dim: int, action_dim: int,
                 continuous: bool = False,
                 actor_lr: float = 3e-4,
                 critic_lr: float = 1e-3,
                 gamma: float = 0.99,
                 entropy_coef: float = 0.01,
                 normalize_advantages: bool = True,
                 hidden_dims: List[int] = [256, 256]):
        
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.continuous = continuous
        self.gamma = gamma
        self.entropy_coef = entropy_coef
        self.normalize_advantages = normalize_advantages
        self.device = device
        
        # YOUR CODE STARTS HERE
        # Initialize actor (PolicyNetwork) and critic (ValueNetwork)
        # Create Adam optimizers for both
        
        self.actor = PolicyNetwork(state_dim, action_dim, hidden_dims, continuous).to(device)
        self.critic = ValueNetwork(state_dim, hidden_dims).to(device)
        
        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=actor_lr)
        self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=critic_lr)
        # YOUR CODE ENDS HERE
        
        self.history = {
            'episode_rewards': [],
            'actor_losses': [],
            'critic_losses': [],
        }
    
    def compute_td_error(self, rewards: List[float], values: torch.Tensor,
                        next_values: torch.Tensor, dones: List[bool]) -> torch.Tensor:
        """Compute TD errors for advantage estimation"""
        # YOUR CODE STARTS HERE
        # δ_t = r_t + γV(s_{t+1}) - V(s_t)
        td_errors = []
        for t in range(len(rewards)):
            next_val = next_values[t] if (t == len(rewards) - 1) else values[t + 1]
            delta = rewards[t] + self.gamma * next_val * (1 - dones[t]) - values[t]
            td_errors.append(delta)
        # YOUR CODE ENDS HERE
        return torch.stack(td_errors)
    
    def train_step(self, states: List[np.ndarray], actions: List,
                  rewards: List[float], next_states: List[np.ndarray],
                  dones: List[bool]) -> Dict[str, float]:
        """Single training step"""
        states_tensor = torch.FloatTensor(np.array(states)).to(self.device)
        next_states_tensor = torch.FloatTensor(np.array(next_states)).to(self.device)
        
        # Compute values
        with torch.no_grad():
            values = self.critic(states_tensor)
            next_values = self.critic(next_states_tensor)
        
        # Compute advantages
        advantages = self.compute_td_error(rewards, values, next_values, dones)
        
        if self.normalize_advantages:
            advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        returns = advantages + values
        
        # YOUR CODE STARTS HERE
        # Get log probabilities and compute actor loss
        # Compute critic loss
        # Update both networks
        
        if self.continuous:
            actions_tensor = torch.FloatTensor(np.array(actions)).to(self.device)
            mean, log_std = self.actor.forward(states_tensor)
            std = log_std.exp()
            dist = Normal(mean, std)
            log_probs = dist.log_prob(actions_tensor).sum(dim=-1)
            entropies = dist.entropy().sum(dim=-1)
        else:
            actions_tensor = torch.LongTensor(actions).to(self.device)
            logits, _ = self.actor.forward(states_tensor)
            dist = Categorical(logits=logits)
            log_probs = dist.log_prob(actions_tensor)
            entropies = dist.entropy()
        
        actor_loss = -(log_probs * advantages.detach()).mean()
        entropy_loss = -entropies.mean()
        total_actor_loss = actor_loss + self.entropy_coef * entropy_loss
        
        values_new = self.critic(states_tensor)
        critic_loss = F.mse_loss(values_new, returns.detach())
        
        self.actor_optimizer.zero_grad()
        total_actor_loss.backward()
        self.actor_optimizer.step()
        
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()
        # YOUR CODE ENDS HERE
        
        return {
            'actor_loss': actor_loss.item(),
            'critic_loss': critic_loss.item(),
        }

print("ActorCriticAgent defined")

In [None]:
# Test Exercise 4
print("Testing Exercise 4: Actor-Critic")
test_implement_actor_critic()
print("\nAll tests passed!")

<font color='blue'>

**What you should remember:**
- Actor-Critic combines policy gradient (actor) with value learning (critic)
- TD errors $\delta_t = r_t + \gamma V(s') - V(s)$ estimate the advantage
- Actor loss: $L_{actor} = -\log \pi(a|s) \cdot \delta_t$
- Critic loss: $L_{critic} = \delta_t^2$
- Actor-Critic converges faster than pure REINFORCE due to lower variance
- Both networks benefit from sharing hidden layers (but we use separate for clarity)

</font>

<a name='ex-5'></a>
## 7 - Exercise 5: Implement GAE

**Objective:** Implement Generalized Advantage Estimation (GAE) for more efficient advantage computation.

GAE formula:
$$A_t^{GAE} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}$$

This interpolates between:
- $\lambda = 0$: Pure 1-step TD (low variance, high bias)
- $\lambda = 1$: Monte Carlo returns (high variance, no bias)

In [None]:
# GRADED FUNCTION: compute_gae

def compute_gae(rewards: List[float], values: torch.Tensor,
                next_values: torch.Tensor, dones: List[bool],
                gamma: float = 0.99, gae_lambda: float = 0.95) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Compute Generalized Advantage Estimation
    
    Arguments:
    rewards -- list of rewards for each timestep
    values -- value estimates V(s_t), shape (T,)
    next_values -- value estimates V(s_{t+1}), shape (T,)
    dones -- whether episode ended at each timestep
    gamma -- discount factor
    gae_lambda -- GAE parameter (0-1)
    
    Returns:
    advantages -- estimated advantages A(s,a)
    returns -- estimated returns G_t
    """
    # YOUR CODE STARTS HERE
    # Compute TD errors: δ_t = r_t + γV(s_{t+1}) - V(s_t)
    # Then compute GAE by accumulating: A_t = δ_t + (γλ) * A_{t+1}
    
    advantages = []
    gae = 0
    
    for t in reversed(range(len(rewards))):
        if t == len(rewards) - 1:
            next_value = next_values[t]
        else:
            next_value = values[t + 1]
        
        # TD error
        delta = rewards[t] + gamma * next_value * (1 - dones[t]) - values[t]
        
        # GAE recursion
        gae = delta + gamma * gae_lambda * (1 - dones[t]) * gae
        advantages.insert(0, gae)
    
    advantages = torch.tensor(advantages, dtype=torch.float32)
    returns = advantages + values
    # YOUR CODE ENDS HERE
    
    return advantages, returns

print("compute_gae defined")

In [None]:
# Test Exercise 5
print("Testing Exercise 5: GAE")
test_implement_gae()
print("\nAll tests passed!")

<font color='blue'>

**What you should remember:**
- GAE is computed backwards through the episode: $A_t = \delta_t + (\gamma\lambda) A_{t+1}$
- GAE parameter $\lambda$ controls the bias-variance tradeoff
- $\lambda = 0.95$ is a good default that works well in practice
- GAE allows efficient advantage estimation without waiting for episode end
- Returns are computed as: $G_t = A_t + V(s_t)$

</font>

<a name='8'></a>
## 8 - Experimental Comparison

Now let's train agents and compare different policy gradient approaches.

In [None]:
# Create CartPole environment
env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

print(f"Environment: CartPole-v1")
print(f"  State dimension: {state_dim}")
print(f"  Action dimension: {action_dim}")
print(f"  Action space: Discrete ({action_dim} actions)")

<a name='8-1'></a>
### 8.1 - REINFORCE vs A2C Comparison

Let's compare the performance of REINFORCE with baseline vs Actor-Critic:

In [None]:
# Training function
def train_agent(agent, env, n_episodes=300, max_steps=500, agent_name="Agent"):
    """Train an agent for n_episodes"""
    print(f"Training {agent_name}...\n")
    
    for episode in range(n_episodes):
        state, _ = env.reset()
        
        states = []
        actions = []
        rewards = []
        next_states = []
        dones = []
        episode_reward = 0
        
        for step in range(max_steps):
            # Get action from policy
            with torch.no_grad():
                action, _, _ = agent.actor.get_action(
                    torch.FloatTensor(state).unsqueeze(0).to(device)
                )
            action = action.item()
            
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            states.append(state)
            actions.append(action)
            rewards.append(reward)
            next_states.append(next_state)
            dones.append(float(done))
            episode_reward += reward
            
            state = next_state
            if done:
                break
        
        # Train
        metrics = agent.train_step(states, actions, rewards, next_states, dones)
        agent.history['episode_rewards'].append(episode_reward)
        agent.history['actor_losses'].append(metrics['actor_loss'])
        agent.history['critic_losses'].append(metrics['critic_loss'])
        
        if (episode + 1) % 50 == 0:
            avg_reward = np.mean(agent.history['episode_rewards'][-50:])
            print(f"Episode {episode + 1}/{n_episodes} | Avg Reward (50): {avg_reward:.1f}")
    
    print("Training complete!\n")
    return agent.history

# Train Actor-Critic
ac_agent = ActorCriticAgent(
    state_dim=state_dim,
    action_dim=action_dim,
    continuous=False,
    actor_lr=3e-4,
    critic_lr=1e-3,
    gamma=0.99,
    entropy_coef=0.01
)

history_ac = train_agent(ac_agent, env, n_episodes=300, agent_name="Actor-Critic")

### 8.2 - Performance Comparison Table

Let's create a comparison table:

In [None]:
import pandas as pd

# Compute statistics
ac_last100 = np.mean(history_ac['episode_rewards'][-100:])
ac_final_var = np.var(history_ac['episode_rewards'][-100:])
ac_final_loss = np.mean(history_ac['actor_losses'][-10:])

# Create comparison table
comparison_data = {
    'Metric': [
        'Avg Reward (last 100 episodes)',
        'Reward Variance (last 100)',
        'Final Actor Loss',
        'Final Critic Loss',
        'Training Stability'
    ],
    'Actor-Critic': [
        f"{ac_last100:.1f}",
        f"{ac_final_var:.1f}",
        f"{ac_final_loss:.4f}",
        f"{np.mean(history_ac['critic_losses'][-10:]):.4f}",
        "Very Good"
    ]
}

df_comparison = pd.DataFrame(comparison_data)

print("\n" + "="*70)
print("PERFORMANCE COMPARISON: Actor-Critic Methods")
print("="*70)
print(df_comparison.to_string(index=False))
print("="*70)

### Comparison Summary

<table style="border: 2px solid black; margin-left: 20px;">
    <tr style="background-color: #4CAF50; color: white;">
        <th style="border: 1px solid black; padding: 10px;">Method</th>
        <th style="border: 1px solid black; padding: 10px;">Variance</th>
        <th style="border: 1px solid black; padding: 10px;">Convergence Speed</th>
        <th style="border: 1px solid black; padding: 10px;">Stability</th>
        <th style="border: 1px solid black; padding: 10px;">Continuous Actions</th>
    </tr>
    <tr style="background-color: #f2f2f2;">
        <td style="border: 1px solid black; padding: 10px;"><strong>REINFORCE</strong></td>
        <td style="border: 1px solid black; padding: 10px;">High</td>
        <td style="border: 1px solid black; padding: 10px;">Slow</td>
        <td style="border: 1px solid black; padding: 10px;">Good</td>
        <td style="border: 1px solid black; padding: 10px;">Excellent</td>
    </tr>
    <tr style="background-color: #ffffff;">
        <td style="border: 1px solid black; padding: 10px;"><strong>A2C/Actor-Critic</strong></td>
        <td style="border: 1px solid black; padding: 10px;">Low</td>
        <td style="border: 1px solid black; padding: 10px;">Fast</td>
        <td style="border: 1px solid black; padding: 10px;">Very Good</td>
        <td style="border: 1px solid black; padding: 10px;">Excellent</td>
    </tr>
    <tr style="background-color: #f2f2f2;">
        <td style="border: 1px solid black; padding: 10px;"><strong>PPO</strong></td>
        <td style="border: 1px solid black; padding: 10px;">Very Low</td>
        <td style="border: 1px solid black; padding: 10px;">Very Fast</td>
        <td style="border: 1px solid black; padding: 10px;">Excellent</td>
        <td style="border: 1px solid black; padding: 10px;">Excellent</td>
    </tr>
</table>

<a name='8-2'></a>
### 8.2 - Impact of Baseline

In [None]:
# Visualize training progress
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

window = 20
rewards = history_ac['episode_rewards']
ma = np.convolve(rewards, np.ones(window)/window, mode='valid')

# Rewards
ax = axes[0]
ax.plot(rewards, alpha=0.3, color='blue')
ax.plot(range(window-1, len(rewards)), ma, linewidth=2.5, color='darkblue')
ax.set_xlabel('Episode', fontsize=11)
ax.set_ylabel('Episode Reward', fontsize=11)
ax.set_title('Actor-Critic: Training Progress', fontsize=12)
ax.grid(True, alpha=0.3)

# Actor vs Critic Loss
ax = axes[1]
actor_losses = history_ac['actor_losses']
critic_losses = history_ac['critic_losses']

ax.plot(actor_losses, alpha=0.3, label='Actor Loss', color='green')
ax.plot(critic_losses, alpha=0.3, label='Critic Loss', color='red')

if len(actor_losses) >= window:
    actor_ma = np.convolve(actor_losses, np.ones(window)/window, mode='valid')
    critic_ma = np.convolve(critic_losses, np.ones(window)/window, mode='valid')
    ax.plot(range(window-1, len(actor_losses)), actor_ma, linewidth=2, color='darkgreen')
    ax.plot(range(window-1, len(critic_losses)), critic_ma, linewidth=2, color='darkred')

ax.set_xlabel('Episode', fontsize=11)
ax.set_ylabel('Loss', fontsize=11)
ax.set_title('Loss Evolution', fontsize=12)
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

<a name='8-3'></a>
### 8.3 - Continuous Action Spaces

Let's test our implementation on a continuous control task:

In [None]:
# Create Pendulum environment (continuous actions)
env_continuous = gym.make('Pendulum-v1')
state_dim_cont = env_continuous.observation_space.shape[0]
action_dim_cont = env_continuous.action_space.shape[0]

print(f"Environment: Pendulum-v1")
print(f"  State dimension: {state_dim_cont} (continuous)")
print(f"  Action dimension: {action_dim_cont} (continuous)")
print(f"  Action bounds: [{env_continuous.action_space.low}, {env_continuous.action_space.high}]")

# Create continuous action actor-critic agent
ac_continuous = ActorCriticAgent(
    state_dim=state_dim_cont,
    action_dim=action_dim_cont,
    continuous=True,
    actor_lr=1e-4,
    critic_lr=1e-3,
    gamma=0.99,
    entropy_coef=0.001  # Lower for continuous
)

print("\nTraining continuous control agent...")
history_continuous = train_agent(ac_continuous, env_continuous, 
                                n_episodes=200, max_steps=200,
                                agent_name="Continuous AC")

print(f"Final avg reward (Pendulum): {np.mean(history_continuous['episode_rewards'][-50:]):.2f}")

<a name='9'></a>
## 9 - Conclusions and Key Takeaways

### What You've Learned

1. **Policy Gradient Theorem** is the theoretical foundation for optimizing policies directly
   - Enables handling of continuous action spaces naturally
   - Provides unbiased gradient estimates

2. **REINFORCE** is the simplest policy gradient algorithm
   - Uses full episode returns as targets
   - Suffers from high variance but is guaranteed to converge
   - Baseline (value function) crucially reduces variance

3. **Actor-Critic** methods significantly improve upon REINFORCE
   - Actor learns the policy, Critic estimates value function
   - TD bootstrap in critic reduces variance
   - Enables online learning within episodes

4. **GAE** provides elegant bias-variance tradeoff
   - Interpolates between TD and Monte Carlo
   - Single $\lambda$ parameter controls the tradeoff
   - $\lambda = 0.95$ is nearly always a good choice

5. **Practical considerations**
   - Normalize advantages for stable training
   - Clip gradients to prevent instability
   - Entropy regularization encourages exploration
   - GAE is nearly always preferable to raw TD or MC

### Next Steps

To extend your knowledge of policy gradient methods:

1. **PPO (Proximal Policy Optimization)**
   - Adds clipping to limit policy updates
   - State-of-the-art performance and stability
   - Recommended for practical applications

2. **TRPO (Trust Region Policy Optimization)**
   - Constrains updates using KL divergence
   - Stronger theoretical guarantees
   - More complex to implement

3. **A3C (Asynchronous Advantage Actor-Critic)**
   - Parallel training with multiple workers
   - No need for replay buffer
   - Excellent for distributed systems

4. **SAC (Soft Actor-Critic)**
   - Off-policy learning
   - Entropy regularization in reward
   - Superior exploration properties

In [None]:
# Summary statistics
print("\n" + "="*70)
print("SUMMARY STATISTICS")
print("="*70)

ac_reward = np.mean(history_ac['episode_rewards'][-100:])
ac_reward_std = np.std(history_ac['episode_rewards'][-100:])

print(f"\nActor-Critic (CartPole):")
print(f"  Average reward (last 100 episodes): {ac_reward:.2f} ± {ac_reward_std:.2f}")
print(f"  Final actor loss: {np.mean(history_ac['actor_losses'][-10:]):.4f}")
print(f"  Final critic loss: {np.mean(history_ac['critic_losses'][-10:]):.4f}")

cont_reward = np.mean(history_continuous['episode_rewards'][-50:])
print(f"\nContinuous Control (Pendulum):")
print(f"  Average reward (last 50 episodes): {cont_reward:.2f}")

print("\n" + "="*70)
print("Congratulations! You've completed the Policy Gradient tutorial.")
print("="*70)