# REINFORCE Algorithm (Policy Gradient)

Welcome to REINFORCE! This is the first **policy gradient** algorithm that directly optimizes the policy using neural networks. By the end of this notebook, you'll be able to:

* Understand the difference between value-based and policy-based methods
* Implement a policy network that outputs action probabilities
* Calculate the policy gradient using the REINFORCE algorithm
* Train an agent using gradient ascent on expected return

## Policy Gradient: A Paradigm Shift

**Value-Based (Q-Learning, DQN)**:
- Learn Q(s,a)
- Extract policy: $\pi(s) = \arg\max_a Q(s,a)$
- Indirect optimization

**Policy-Based (REINFORCE, PPO)**:
- Learn $\pi_\theta(a|s)$ directly
- Optimize policy parameters $\theta$
- Direct optimization

## REINFORCE: Key Idea

**Objective**: Maximize expected return
$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]$$

**Policy Gradient Theorem**:
$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) G_t\right]$$

**REINFORCE Update**:
$$\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) G_t$$

**Intuition**: Increase probability of actions with high returns!

## Important Note

Please ensure:
1. No extra print statements
2. No extra code cells
3. Function parameters unchanged
4. No global variables

## Table of Contents
- [1 - Packages](#1)
- [2 - Policy Network](#2)
    - [Exercise 1 - PolicyNetwork](#ex-1)
- [3 - Action Sampling](#3)
    - [Exercise 2 - select_action](#ex-2)
- [4 - Policy Gradient Loss](#4)
    - [Exercise 3 - compute_policy_loss](#ex-3)
- [5 - Complete REINFORCE](#5)
    - [Exercise 4 - train_reinforce](#ex-4)
- [6 - Testing on CartPole](#6)

<a name='1'></a>
## 1 - Packages

In [None]:
import numpy as np
import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Categorical
import matplotlib.pyplot as plt
from reinforce_tests import *

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 6.0)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

torch.manual_seed(42)
np.random.seed(42)

<a name='2'></a>
## 2 - Policy Network

The policy network $\pi_\theta(a|s)$ outputs a **probability distribution** over actions.

**Architecture for CartPole**:
```
Input: State (4 values)
Hidden: 128 neurons (ReLU)
Output: Action probabilities (2 actions, Softmax)
```

**Key difference from DQN**: Output is probabilities, not Q-values!

<a name='ex-1'></a>
### Exercise 1 - PolicyNetwork

Implement the policy network.

In [None]:
# GRADED FUNCTION: PolicyNetwork

class PolicyNetwork(nn.Module):
    """
    Policy network for REINFORCE.
    """
    
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super(PolicyNetwork, self).__init__()
        
        # (approx. 3-4 lines)
        # Build network:
        # 1. fc1: Linear(state_dim, hidden_dim)
        # 2. fc2: Linear(hidden_dim, action_dim)
        # Note: No softmax layer - we'll apply it in forward()
        
        # YOUR CODE STARTS HERE
        
        
        
        # YOUR CODE ENDS HERE
    
    def forward(self, state):
        """
        Forward pass returns action probabilities.
        
        Arguments:
        state -- state tensor
        
        Returns:
        action_probs -- probability distribution over actions
        """
        # (approx. 3-4 lines)
        # 1. x = ReLU(fc1(state))
        # 2. x = fc2(x)
        # 3. action_probs = Softmax(x, dim=-1)
        # Hint: Use F.relu() and F.softmax()
        
        # YOUR CODE STARTS HERE
        
        
        
        # YOUR CODE ENDS HERE
        
        return action_probs

In [None]:
# Test your implementation
policy_net = PolicyNetwork(state_dim=4, action_dim=2)
print("Policy Network:")
print(policy_net)

test_state = torch.randn(1, 4)
action_probs = policy_net(test_state)
print(f"\nInput shape: {test_state.shape}")
print(f"Output shape: {action_probs.shape}")
print(f"Action probabilities: {action_probs.detach().numpy()}")
print(f"Sum of probabilities: {action_probs.sum().item():.4f} (should be 1.0)")

policy_network_test(PolicyNetwork)

<a name='3'></a>
## 3 - Action Sampling

Given action probabilities, we need to:
1. **Sample** an action from the distribution
2. **Compute** log probability $\log \pi_\theta(a|s)$ (needed for gradient)

**Using PyTorch Categorical**:
```python
dist = Categorical(action_probs)
action = dist.sample()  # Sample action
log_prob = dist.log_prob(action)  # Get log probability
```

<a name='ex-2'></a>
### Exercise 2 - select_action

Sample action and get log probability.

In [None]:
# GRADED FUNCTION: select_action

def select_action(policy_net, state):
    """
    Select action from policy and return log probability.
    
    Arguments:
    policy_net -- PolicyNetwork
    state -- current state (numpy array)
    
    Returns:
    action -- sampled action (integer)
    log_prob -- log probability of action
    """
    # (approx. 5-7 lines)
    # 1. Convert state to tensor and add batch dimension
    # 2. Get action probabilities from policy_net
    # 3. Create Categorical distribution
    # 4. Sample action from distribution
    # 5. Get log probability
    # 6. Return action.item() and log_prob
    
    # Hint: Categorical is already imported
    
    # YOUR CODE STARTS HERE
    
    
    
    
    
    
    # YOUR CODE ENDS HERE
    
    return action, log_prob

In [None]:
# Test your implementation
policy_net = PolicyNetwork(4, 2)
test_state = np.array([0.1, 0.2, 0.3, 0.4])

action, log_prob = select_action(policy_net, test_state)
print(f"Sampled action: {action}")
print(f"Log probability: {log_prob.item():.4f}")
print(f"Action type: {type(action)}")
print(f"Log prob requires grad: {log_prob.requires_grad}")

select_action_test(select_action, PolicyNetwork)

<a name='4'></a>
## 4 - Policy Gradient Loss

The REINFORCE update is:
$$\nabla_\theta J(\theta) = \mathbb{E}\left[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) G_t\right]$$

**PyTorch implementation**:
```python
loss = -sum(log_probs[t] * returns[t] for t in episode)
```

**Why negative?** PyTorch minimizes loss, we want to maximize reward!

**Baseline (optional)**: Subtract mean to reduce variance
$$loss = -\sum_t \log\pi(a_t|s_t)(G_t - b)$$

<a name='ex-3'></a>
### Exercise 3 - compute_policy_loss

Compute the policy gradient loss.

In [None]:
# GRADED FUNCTION: compute_policy_loss

def compute_policy_loss(log_probs, returns):
    """
    Compute policy gradient loss.
    
    Arguments:
    log_probs -- list of log probabilities for each action
    returns -- list of returns for each timestep
    
    Returns:
    loss -- policy gradient loss (negative for gradient ascent)
    """
    # (approx. 5-7 lines)
    # 1. Convert returns to tensor
    # 2. Normalize returns (subtract mean, divide by std + eps)
    #    This is the baseline trick for variance reduction
    # 3. Compute policy_loss = -sum(log_probs[t] * returns[t])
    # 4. Return loss
    
    # Hint: torch.stack(log_probs) to convert list to tensor
    # Hint: (returns - returns.mean()) / (returns.std() + 1e-8)
    
    # YOUR CODE STARTS HERE
    
    
    
    
    
    # YOUR CODE ENDS HERE
    
    return policy_loss

In [None]:
# Test your implementation
test_log_probs = [torch.tensor(-0.5, requires_grad=True), 
                  torch.tensor(-0.3, requires_grad=True),
                  torch.tensor(-0.7, requires_grad=True)]
test_returns = torch.tensor([1.0, 2.0, 0.5])

loss = compute_policy_loss(test_log_probs, test_returns)
print(f"Policy loss: {loss.item():.4f}")
print(f"Loss requires grad: {loss.requires_grad}")

compute_policy_loss_test(compute_policy_loss)

<a name='5'></a>
## 5 - Complete REINFORCE

**REINFORCE Algorithm**:
```
Initialize policy network π_θ
For each episode:
    Generate episode using π_θ
    For each timestep t:
        Calculate return G_t
    Compute loss = -Σ log π_θ(a_t|s_t) * G_t
    Update θ using gradient descent on loss
```

<a name='ex-4'></a>
### Exercise 4 - train_reinforce

Implement complete REINFORCE training.

In [None]:
# GRADED FUNCTION: train_reinforce

def train_reinforce(env, n_episodes=1000, lr=0.01, gamma=0.99):
    """
    Train policy using REINFORCE.
    
    Returns:
    policy_net -- trained policy network
    rewards_history -- episode rewards
    """
    # (approx. 25-30 lines)
    # 1. Initialize policy network and optimizer
    # 2. For each episode:
    #    a. Reset environment
    #    b. Generate episode:
    #       - For each step:
    #         * Select action using select_action
    #         * Store log_prob and reward
    #         * Take action in environment
    #    c. Calculate returns (discounted cumulative rewards)
    #    d. Compute loss using compute_policy_loss
    #    e. Backprop and update policy
    #    f. Store total reward
    # 3. Return policy_net and rewards_history
    
    # YOUR CODE STARTS HERE
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    # YOUR CODE ENDS HERE
    
    return policy_net, rewards_history

<a name='6'></a>
## 6 - Testing on CartPole

In [None]:
# Train REINFORCE
env = gym.make('CartPole-v1')

print("Training REINFORCE on CartPole...\n")
policy_net, rewards = train_reinforce(env, n_episodes=1000, lr=0.01, gamma=0.99)

print(f"\nTraining completed!")
print(f"Average reward (last 100): {np.mean(rewards[-100:]):.2f}")

# Plot results
plt.figure(figsize=(10, 5))
plt.plot(rewards, alpha=0.3, label='Episode reward')

window = 20
if len(rewards) >= window:
    moving_avg = np.convolve(rewards, np.ones(window)/window, mode='valid')
    plt.plot(range(window-1, len(rewards)), moving_avg,
             label=f'Moving average ({window})', linewidth=2)

plt.axhline(y=195, color='r', linestyle='--', label='Solved (195)', alpha=0.7)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('REINFORCE on CartPole')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

env.close()

## Congratulations!

You've implemented REINFORCE, the foundation of policy gradient methods! Here's what you learned:

✅ Policy networks that output action probabilities

✅ Sampling actions from learned distributions

✅ Policy gradient theorem and loss computation

✅ Complete REINFORCE training loop

### Key Takeaways:

1. **Direct policy optimization**: No Q-function needed!
2. **Stochastic policies**: Naturally handles exploration
3. **High variance**: Returns vary a lot → slower convergence
4. **Baseline trick**: Reduces variance without introducing bias
5. **On-policy**: Must collect new data after each update

### REINFORCE vs DQN:

| Aspect | REINFORCE | DQN |
|--------|-----------|-----|
| Type | Policy-based | Value-based |
| Output | π(a\|s) | Q(s,a) |
| Learning | Policy gradient | TD learning |
| Exploration | Stochastic policy | ε-greedy |
| Variance | High | Low |
| Continuous actions | Easy | Hard |

### Next Steps:

- Learn **Actor-Critic** (combines value and policy)
- Explore **PPO** (more stable policy gradients)
- Understand **advantage functions** A(s,a)
- Try **continuous action spaces** (Gaussian policies)