# Actor-Critic - Interactive Exercise

Welcome! In this notebook, you will implement **Actor-Critic**, a powerful policy gradient method that combines the best of both worlds.

## What is Actor-Critic?

Actor-Critic methods use two neural networks:
- **Actor**: Learns the policy œÄ(a|s) - what action to take
- **Critic**: Learns the value function V(s) - how good is the current state

The critic helps reduce variance in the actor's gradient updates by providing a **baseline**.

## Key Differences from REINFORCE

| Aspect | REINFORCE | Actor-Critic |
|--------|-----------|---------------|
| Learning | Monte Carlo (full episodes) | Temporal Difference (step-by-step) |
| Baseline | Fixed or moving average | Learned value function V(s) |
| Variance | High variance | Lower variance |
| Update Frequency | End of episode | Every step (online) |
| Sample Efficiency | Lower | Higher |

## Learning Objectives

By the end of this notebook, you will:
- Understand the Actor-Critic architecture
- Implement both Actor and Critic networks
- Compute advantages using TD error
- Combine policy and value losses
- Compare Actor-Critic with REINFORCE

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import gymnasium as gym
import matplotlib.pyplot as plt
from actor_critic_tests import *

In [None]:
# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

## The Environment: CartPole

We'll use CartPole-v1, where the goal is to balance a pole on a moving cart.

- **State**: [cart position, cart velocity, pole angle, pole angular velocity]
- **Actions**: 0 (push left), 1 (push right)
- **Reward**: +1 for each timestep the pole stays upright
- **Success**: Average reward > 475 over 100 episodes

In [None]:
env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

print(f"State dimension: {state_dim}")
print(f"Action dimension: {action_dim}")

## Exercise 1: Actor Network

The **Actor** network outputs action probabilities. It's similar to REINFORCE's PolicyNetwork.

**Architecture**:
```
Input (state) ‚Üí FC1 (128) ‚Üí ReLU ‚Üí FC2 (action_dim) ‚Üí Softmax ‚Üí Action Probabilities
```

**Task**: Implement the Actor network.

In [None]:
# GRADED FUNCTION: ActorNetwork

class ActorNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        """
        Actor network that outputs action probabilities.
        
        Arguments:
        state_dim -- dimension of state space
        action_dim -- dimension of action space
        hidden_dim -- number of hidden units
        """
        super(ActorNetwork, self).__init__()
        
        # (approx. 2 lines)
        # Define two fully connected layers:
        # fc1: state_dim -> hidden_dim
        # fc2: hidden_dim -> action_dim
        
        # YOUR CODE STARTS HERE
        
        
        # YOUR CODE ENDS HERE
    
    def forward(self, state):
        """
        Forward pass to get action probabilities.
        
        Arguments:
        state -- state tensor
        
        Returns:
        action_probs -- probability distribution over actions
        """
        # (approx. 3 lines)
        # 1. Pass through fc1 and apply ReLU
        # 2. Pass through fc2
        # 3. Apply softmax to get probabilities
        
        # YOUR CODE STARTS HERE
        
        
        # YOUR CODE ENDS HERE
        
        return action_probs

In [None]:
# Test your implementation
actor_network_test(ActorNetwork)

## Exercise 2: Critic Network

The **Critic** network estimates the value function V(s). It helps the actor by providing a baseline.

**Architecture**:
```
Input (state) ‚Üí FC1 (128) ‚Üí ReLU ‚Üí FC2 (1) ‚Üí State Value
```

Note: Output is a single value (not a probability distribution).

**Task**: Implement the Critic network.

In [None]:
# GRADED FUNCTION: CriticNetwork

class CriticNetwork(nn.Module):
    def __init__(self, state_dim, hidden_dim=128):
        """
        Critic network that estimates state value V(s).
        
        Arguments:
        state_dim -- dimension of state space
        hidden_dim -- number of hidden units
        """
        super(CriticNetwork, self).__init__()
        
        # (approx. 2 lines)
        # Define two fully connected layers:
        # fc1: state_dim -> hidden_dim
        # fc2: hidden_dim -> 1 (single value output)
        
        # YOUR CODE STARTS HERE
        
        
        # YOUR CODE ENDS HERE
    
    def forward(self, state):
        """
        Forward pass to get state value.
        
        Arguments:
        state -- state tensor
        
        Returns:
        value -- estimated value of the state
        """
        # (approx. 3 lines)
        # 1. Pass through fc1 and apply ReLU
        # 2. Pass through fc2 to get value
        # 3. Squeeze to remove extra dimensions
        
        # YOUR CODE STARTS HERE
        
        
        # YOUR CODE ENDS HERE
        
        return value

In [None]:
# Test your implementation
critic_network_test(CriticNetwork)

## Exercise 3: Select Action

Similar to REINFORCE, we sample actions from the policy distribution.

**Task**: Implement action selection with log probability tracking.

In [None]:
# GRADED FUNCTION: select_action

def select_action(actor, state):
    """
    Select action from policy and compute log probability.
    
    Arguments:
    actor -- Actor network
    state -- current state (numpy array)
    
    Returns:
    action -- selected action (int)
    log_prob -- log probability of the action
    """
    # (approx. 5-6 lines)
    # 1. Convert state to tensor
    # 2. Get action probabilities from actor
    # 3. Create categorical distribution
    # 4. Sample action
    # 5. Get log probability
    # 6. Return action as int and log_prob
    
    # YOUR CODE STARTS HERE
    
    
    # YOUR CODE ENDS HERE
    
    return action, log_prob

In [None]:
# Test your implementation
select_action_test(select_action, ActorNetwork)

## Exercise 4: Compute Actor-Critic Loss

The Actor-Critic algorithm uses two losses:

**Actor Loss** (Policy Gradient with advantage):
$$L_{actor} = -\log \pi(a|s) \cdot A(s,a)$$

where the **advantage** is:
$$A(s,a) = r + \gamma V(s') - V(s)$$

This is the **TD error** Œ¥!

**Critic Loss** (Mean Squared Error):
$$L_{critic} = [r + \gamma V(s') - V(s)]^2 = \delta^2$$

**Task**: Implement both losses.

In [None]:
# GRADED FUNCTION: compute_ac_loss

def compute_ac_loss(log_prob, value, next_value, reward, done, gamma=0.99):
    """
    Compute Actor-Critic loss.
    
    Arguments:
    log_prob -- log probability of action taken
    value -- V(s) from critic
    next_value -- V(s') from critic
    reward -- reward received
    done -- whether episode ended
    gamma -- discount factor
    
    Returns:
    actor_loss -- loss for actor
    critic_loss -- loss for critic
    """
    # (approx. 7-8 lines)
    # 1. Compute TD target:
    #    - If done: target = reward
    #    - Else: target = reward + gamma * next_value
    # 2. Compute advantage (TD error): delta = target - value
    # 3. Detach advantage for actor (don't backprop through critic to actor)
    # 4. Compute actor loss: -log_prob * advantage
    # 5. Compute critic loss: delta^2 (or use F.mse_loss)
    
    # YOUR CODE STARTS HERE
    
    
    # YOUR CODE ENDS HERE
    
    return actor_loss, critic_loss

In [None]:
# Test your implementation
compute_ac_loss_test(compute_ac_loss)

## Exercise 5: Train Actor-Critic

Now let's combine everything into the training loop!

**Algorithm** (per step):
1. Select action using actor
2. Take action, get reward and next state
3. Compute V(s) and V(s') using critic
4. Compute actor and critic losses
5. Update both networks

**Key difference from REINFORCE**: We update **every step**, not at the end of episodes!

**Task**: Implement the training loop.

In [None]:
# GRADED FUNCTION: train_actor_critic

def train_actor_critic(env, actor, critic, actor_optimizer, critic_optimizer, 
                       n_episodes=500, gamma=0.99, max_steps=500):
    """
    Train Actor-Critic on the environment.
    
    Arguments:
    env -- Gym environment
    actor -- Actor network
    critic -- Critic network
    actor_optimizer -- optimizer for actor
    critic_optimizer -- optimizer for critic
    n_episodes -- number of episodes to train
    gamma -- discount factor
    max_steps -- max steps per episode
    
    Returns:
    episode_rewards -- list of total rewards per episode
    """
    episode_rewards = []
    
    # (approx. 25-30 lines)
    # For each episode:
    #   1. Reset environment, get initial state
    #   2. total_reward = 0
    #   3. For each step (up to max_steps):
    #      a. Select action using select_action()
    #      b. Take action: next_state, reward, done, _, _ = env.step(action)
    #      c. Get value estimates: value = critic(state), next_value = critic(next_state)
    #      d. Compute losses: compute_ac_loss(...)
    #      e. Update actor:
    #         - actor_optimizer.zero_grad()
    #         - actor_loss.backward()
    #         - actor_optimizer.step()
    #      f. Update critic:
    #         - critic_optimizer.zero_grad()
    #         - critic_loss.backward()
    #         - critic_optimizer.step()
    #      g. Update state and total_reward
    #      h. If done: break
    #   4. Append total_reward to episode_rewards
    #   5. Print progress every 50 episodes
    
    # YOUR CODE STARTS HERE
    
    
    # YOUR CODE ENDS HERE
    
    return episode_rewards

In [None]:
# Test your implementation
train_actor_critic_test(train_actor_critic, ActorNetwork, CriticNetwork, select_action)

## Full Training Run

Let's train Actor-Critic on CartPole and visualize the results!

In [None]:
# Initialize networks and optimizers
actor = ActorNetwork(state_dim, action_dim)
critic = CriticNetwork(state_dim)
actor_optimizer = optim.Adam(actor.parameters(), lr=1e-3)
critic_optimizer = optim.Adam(critic.parameters(), lr=1e-3)

# Train
episode_rewards = train_actor_critic(env, actor, critic, actor_optimizer, critic_optimizer, n_episodes=500)

# Plot results
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(episode_rewards, alpha=0.6)
plt.plot(np.convolve(episode_rewards, np.ones(50)/50, mode='valid'), linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Actor-Critic Training Progress')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
window = 100
moving_avg = np.convolve(episode_rewards, np.ones(window)/window, mode='valid')
plt.plot(moving_avg)
plt.axhline(y=475, color='r', linestyle='--', label='Solved threshold (475)')
plt.xlabel('Episode')
plt.ylabel(f'Average Reward (last {window} episodes)')
plt.title('Moving Average')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Check if solved
if len(moving_avg) > 0 and moving_avg[-1] >= 475:
    print(f"\nüéâ Environment solved! Final average reward: {moving_avg[-1]:.2f}")
else:
    print(f"\nüìä Training completed. Final average reward: {moving_avg[-1]:.2f}")

## Comparison: Actor-Critic vs REINFORCE

**Advantages of Actor-Critic**:
- ‚úÖ **Lower variance**: Critic provides better baseline than simple average
- ‚úÖ **Online learning**: Updates every step (don't need to wait for episode end)
- ‚úÖ **Faster convergence**: More frequent updates lead to faster learning
- ‚úÖ **Works for continuing tasks**: Doesn't require episodic structure

**Disadvantages**:
- ‚ùå **More complex**: Two networks instead of one
- ‚ùå **Bias-variance tradeoff**: Bootstrapping (using V(s')) introduces bias
- ‚ùå **Hyperparameter sensitivity**: Need to balance actor and critic learning rates

**When to use**:
- Use **Actor-Critic** for most practical applications (better sample efficiency)
- Use **REINFORCE** when you want simplicity or unbiased estimates

## Congratulations!

You've successfully implemented Actor-Critic! You now understand:
- ‚úÖ The Actor-Critic architecture (two networks working together)
- ‚úÖ How the Critic reduces variance by providing a learned baseline
- ‚úÖ TD error as the advantage function
- ‚úÖ Online learning with step-by-step updates
- ‚úÖ Differences from REINFORCE

**Next Steps**: 
- Try **Advantage Actor-Critic (A2C)** with n-step returns
- Explore **A3C** (Asynchronous Actor-Critic) for parallel training
- Learn **PPO** (Proximal Policy Optimization) for more stable training!