In [None]:
# 🔧 Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"✅ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("⚠️ No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime → Change runtime type → GPU")

print(f"\n📦 Python {sys.version.split()[0]}")
print(f"🔥 PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"🎲 Random seed set to {SEED}")

%matplotlib inline

# Policy Gradient Foundations: From Preferences to Gradient Ascent

*Part 1 of the Vizuara series on Policy Gradient Methods*
*Estimated time: 45 minutes*

## 1. Why Does This Matter?

Policy gradient methods are the backbone of modern reinforcement learning. Every time you hear about RLHF (Reinforcement Learning from Human Feedback), PPO (Proximal Policy Optimization), or how ChatGPT was fine-tuned — policy gradients are at the core.

But here is the key insight: instead of learning a value function and deriving a policy from it (like Q-learning), policy gradient methods **directly optimize the policy itself**. This is a paradigm shift. It is like the difference between memorizing an answer key versus learning how to solve problems.

By the end of this notebook, you will:
- Build a softmax policy from scratch
- Implement the performance measure $J(\theta)$
- Derive and code the policy gradient theorem step by step
- See gradient ascent move a policy toward higher returns

Let us begin.

## 2. Building Intuition

Let us think about a simple problem. You are learning to throw darts at a dartboard. You could try to memorize the value of every hand position (this is what Q-learning does). Or you could directly adjust your throwing motion based on where the darts land.

Policy gradient methods take the second approach. Instead of estimating values for every state-action pair, we directly parameterize a policy — a function that outputs the probability of each action — and then use gradient ascent to make it better.

Think of it this way: imagine a mountain landscape where the height at each point represents how good your policy is. Your policy parameters $\theta$ determine where you stand on this landscape. Gradient ascent tells you which direction to walk to go uphill.

### Think About This

If you have a continuous action space (like robot joint angles), why would a lookup table approach fail? How many entries would you need for a robot arm with 7 joints, each with 360 possible angles?

## 3. The Mathematics

### 3.1 Policy Parameterization

We write our policy as $\pi(a|s, \theta)$ — the probability of action $a$ given state $s$ and parameters $\theta$.

For discrete actions, we use the softmax function to convert raw preferences into probabilities:

$$\pi(a|s, \theta) = \frac{e^{h(s, a, \theta)}}{\sum_{a'} e^{h(s, a', \theta)}}$$

This equation says: compute a preference score $h$ for each action, exponentiate them, and normalize. Actions with higher preferences get higher probabilities, but every action retains some probability — this is how the agent explores.

Let us plug in simple numbers. Suppose we have 3 actions with preferences $h(a_1) = 2.0$, $h(a_2) = 1.0$, $h(a_3) = 0.5$:

$$e^{2.0} = 7.39, \quad e^{1.0} = 2.72, \quad e^{0.5} = 1.65$$
$$\text{sum} = 7.39 + 2.72 + 1.65 = 11.76$$
$$\pi(a_1) = 0.63, \quad \pi(a_2) = 0.23, \quad \pi(a_3) = 0.14$$

Action $a_1$ gets 63% probability because it has the highest preference. This is exactly what we want.

### 3.2 The Performance Measure

What are we optimizing? The expected total return starting from the initial state:

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[G(\tau)] = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T-1} \gamma^t r_t\right]$$

This equation says: sample many trajectories $\tau$ under policy $\pi_\theta$, compute the discounted return $G$ for each, and take the average. We want to maximize this.

### 3.3 The Policy Gradient Theorem

The gradient of $J(\theta)$ is:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G(\tau)\right]$$

This is the most important equation. It says: for each trajectory, multiply the gradient of the log-probability of each action by the total return. Good trajectories (high $G$) push the policy to make those actions more likely. Bad trajectories push the policy away.

## 4. Let's Build It — Component by Component

### 4.1 Softmax Policy

Let us build a softmax policy from scratch. We start by implementing the softmax function and a simple linear policy.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

def manual_softmax(preferences):
    """Compute softmax probabilities from raw preferences."""
    # Subtract max for numerical stability
    shifted = preferences - np.max(preferences)
    exp_prefs = np.exp(shifted)
    return exp_prefs / np.sum(exp_prefs)

# Example: 3 actions with preferences
preferences = np.array([2.0, 1.0, 0.5])
probabilities = manual_softmax(preferences)

print("Action preferences:", preferences)
print("Action probabilities:", probabilities)
print("Sum of probabilities:", np.sum(probabilities))  # Should be 1.0

In [None]:
# Visualization checkpoint: See how preferences become probabilities
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Left: preferences
axes[0].bar(['a1', 'a2', 'a3'], preferences, color=['#2563eb', '#3b82f6', '#93c5fd'])
axes[0].set_title('Action Preferences h(a, θ)', fontsize=14)
axes[0].set_ylabel('Preference Value')

# Right: probabilities after softmax
axes[1].bar(['a1', 'a2', 'a3'], probabilities, color=['#2563eb', '#3b82f6', '#93c5fd'])
axes[1].set_title('Action Probabilities π(a|s, θ)', fontsize=14)
axes[1].set_ylabel('Probability')

plt.tight_layout()
plt.show()
print("Notice how the highest preference gets the highest probability!")

### 4.2 Policy Neural Network

Now let us build the actual policy network that maps states to action probabilities:

In [None]:
class PolicyNetwork(nn.Module):
    """A neural network that parameterizes the policy."""
    def __init__(self, state_dim, n_actions, hidden_dim=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, n_actions)
        )

    def forward(self, state):
        """Return action logits (preferences) for the given state."""
        return self.net(state)

    def get_action_probs(self, state):
        """Return action probabilities (softmax of logits)."""
        logits = self.forward(state)
        return F.softmax(logits, dim=-1)

    def sample_action(self, state):
        """Sample an action from the policy and return (action, log_prob)."""
        probs = self.get_action_probs(state)
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action.item(), log_prob

# Create a policy for CartPole (4-dim state, 2 actions)
policy = PolicyNetwork(state_dim=4, n_actions=2)
print(f"Policy parameters: {sum(p.numel() for p in policy.parameters())}")

# Test with a dummy state
dummy_state = torch.randn(4)
probs = policy.get_action_probs(dummy_state)
print(f"State: {dummy_state.numpy().round(2)}")
print(f"Action probs: {probs.detach().numpy().round(4)}")

### 4.3 The Log-Derivative Trick

The key insight behind policy gradients is the log-derivative trick. Let us verify it numerically:

In [None]:
# The log-derivative trick: ∇P(x)/P(x) = ∇log(P(x))
# Or equivalently: ∇P(x) = P(x) * ∇log(P(x))

# Let us verify this with a concrete example
theta = torch.tensor([1.5], requires_grad=True)

# f(theta) = theta^2
f = theta ** 2
f.backward()
grad_f = theta.grad.item()

# Reset
theta = torch.tensor([1.5], requires_grad=True)
log_f = torch.log(theta ** 2)
log_f.backward()
grad_log_f = theta.grad.item()

print(f"f(θ) = θ² at θ = 1.5")
print(f"∇f(θ) = 2θ = {grad_f:.4f}")
print(f"∇log(f(θ)) = {grad_log_f:.4f}")
print(f"f(θ) × ∇log(f(θ)) = {1.5**2 * grad_log_f:.4f}")
print(f"These should match: ∇f = {grad_f:.4f} ≈ f × ∇log(f) = {1.5**2 * grad_log_f:.4f}")

In [None]:
# Visualization: How the gradient direction depends on the return
returns_range = np.linspace(-5, 5, 100)
log_prob = -0.5  # Fixed log probability

gradients = log_prob * returns_range

plt.figure(figsize=(10, 5))
plt.plot(returns_range, gradients, 'b-', linewidth=2)
plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='--', alpha=0.3)
plt.fill_between(returns_range, gradients, where=(returns_range > 0), alpha=0.2, color='green', label='Positive return: reinforce')
plt.fill_between(returns_range, gradients, where=(returns_range < 0), alpha=0.2, color='red', label='Negative return: penalize')
plt.xlabel('Return G(τ)', fontsize=12)
plt.ylabel('Gradient contribution', fontsize=12)
plt.title('How Returns Shape the Policy Gradient', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("Positive returns push the gradient to increase action probability.")
print("Negative returns push the gradient to decrease action probability.")

## 5. Your Turn

### TODO: Implement the Policy Gradient Estimator

In [None]:
def estimate_policy_gradient(log_probs, returns):
    """
    Estimate the policy gradient from a batch of trajectories.

    This implements: ∇J(θ) ≈ (1/N) Σ_i Σ_t ∇log π(a_t|s_t) * G(τ_i)

    Args:
        log_probs: List of lists. log_probs[i][t] = log π(a_t|s_t) for trajectory i, step t
        returns: List of floats. returns[i] = G(τ_i) for trajectory i

    Returns:
        Estimated policy gradient (scalar for simplicity)
    """
    # ============ TODO ============
    # Step 1: For each trajectory, sum up all the log_probs
    # Step 2: Multiply each sum by the corresponding return
    # Step 3: Average over all trajectories
    # ==============================

    gradient = ???  # YOUR CODE HERE

    return gradient

In [None]:
# Verification
test_log_probs = [
    [-0.5, -1.2],   # Trajectory 1: two actions
    [-0.3, -0.8],   # Trajectory 2: two actions
]
test_returns = [3.0, -1.0]  # Traj 1 was good, Traj 2 was bad

result = estimate_policy_gradient(test_log_probs, test_returns)
expected = ((-0.5 + -1.2) * 3.0 + (-0.3 + -0.8) * (-1.0)) / 2.0
# = (-1.7 * 3.0 + -1.1 * -1.0) / 2.0 = (-5.1 + 1.1) / 2.0 = -2.0

assert abs(result - expected) < 1e-6, f"Expected {expected}, got {result}"
print("Correct! Your policy gradient estimator works.")

### TODO: Implement Temperature-Scaled Softmax

In [None]:
def temperature_softmax(preferences, temperature=1.0):
    """
    Compute softmax with temperature scaling.

    Higher temperature -> more uniform (more exploration)
    Lower temperature -> more peaked (more exploitation)
    Temperature = 1.0 -> standard softmax

    Args:
        preferences: numpy array of preference values
        temperature: float > 0, controls the sharpness

    Returns:
        numpy array of probabilities
    """
    # ============ TODO ============
    # Step 1: Divide preferences by temperature
    # Step 2: Apply softmax (subtract max for stability)
    # ==============================

    probabilities = ???  # YOUR CODE HERE

    return probabilities

In [None]:
# Verification
prefs = np.array([2.0, 1.0, 0.5])

low_temp = temperature_softmax(prefs, temperature=0.1)
mid_temp = temperature_softmax(prefs, temperature=1.0)
high_temp = temperature_softmax(prefs, temperature=10.0)

assert np.argmax(low_temp) == 0, "Low temp should heavily favor the best action"
assert abs(np.sum(mid_temp) - 1.0) < 1e-6, "Probabilities must sum to 1"
assert np.max(high_temp) - np.min(high_temp) < 0.1, "High temp should be nearly uniform"
print("Correct! Temperature scaling works as expected.")
print(f"Low temp (0.1):  {low_temp.round(4)} — almost deterministic")
print(f"Mid temp (1.0):  {mid_temp.round(4)} — standard softmax")
print(f"High temp (10.0): {high_temp.round(4)} — nearly uniform")

## 6. Putting It All Together

Let us combine everything into a simple demonstration. We will create a policy, compute gradients for sample trajectories, and show how gradient ascent works.

In [None]:
import gymnasium as gym

# Create CartPole environment
env = gym.make("CartPole-v1")
state_dim = env.observation_space.shape[0]
n_actions = env.action_space.n

# Initialize policy
policy = PolicyNetwork(state_dim, n_actions)
optimizer = torch.optim.Adam(policy.parameters(), lr=0.01)

def collect_episode(env, policy):
    """Collect one full episode using the current policy."""
    states, actions, rewards, log_probs = [], [], [], []
    state, _ = env.reset()
    done = False

    while not done:
        state_t = torch.as_tensor(state, dtype=torch.float32)
        action, log_prob = policy.sample_action(state_t)

        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        states.append(state)
        actions.append(action)
        rewards.append(reward)
        log_probs.append(log_prob)

        state = next_state

    return states, actions, rewards, log_probs

def compute_returns(rewards, gamma=0.99):
    """Compute discounted returns working backwards from the end."""
    returns = []
    G = 0
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)
    return returns

# Collect one episode and examine it
states, actions, rewards, log_probs = collect_episode(env, policy)
returns = compute_returns(rewards)

print(f"Episode length: {len(rewards)} steps")
print(f"Total reward: {sum(rewards):.1f}")
print(f"First 5 returns: {[f'{r:.2f}' for r in returns[:5]]}")
print(f"Last 5 returns: {[f'{r:.2f}' for r in returns[-5:]]}")

## 7. Training and Results

In [None]:
# Train with vanilla policy gradient for a few episodes
GAMMA = 0.99
NUM_EPISODES = 300
reward_history = []

policy = PolicyNetwork(state_dim, n_actions)
optimizer = torch.optim.Adam(policy.parameters(), lr=0.01)

for episode in range(NUM_EPISODES):
    # Collect episode
    states, actions, rewards, log_probs = collect_episode(env, policy)
    returns = compute_returns(rewards, GAMMA)

    # Compute policy gradient loss
    returns_t = torch.tensor(returns, dtype=torch.float32)
    log_probs_t = torch.stack(log_probs)

    # Loss = -Σ log_prob * return (negative because we minimize)
    loss = -(log_probs_t * returns_t).sum()

    # Update
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    episode_reward = sum(rewards)
    reward_history.append(episode_reward)

    if (episode + 1) % 50 == 0:
        avg = np.mean(reward_history[-50:])
        print(f"Episode {episode+1:4d} | Reward: {episode_reward:6.1f} | Avg(50): {avg:.1f}")

env.close()

In [None]:
# Training curves
fig, ax = plt.subplots(figsize=(10, 5))

ax.plot(reward_history, alpha=0.3, color='steelblue', label='Per-episode')
# Smoothed average
window = 20
if len(reward_history) >= window:
    smoothed = np.convolve(reward_history, np.ones(window)/window, mode='valid')
    ax.plot(range(window-1, len(reward_history)), smoothed, color='navy', linewidth=2, label=f'{window}-episode average')

ax.axhline(y=500, color='gray', linestyle='--', alpha=0.5, label='Max reward (500)')
ax.set_xlabel('Episode', fontsize=12)
ax.set_ylabel('Episode Reward', fontsize=12)
ax.set_title('Policy Gradient Training on CartPole', fontsize=14)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("Watch how the reward climbs as the policy improves!")

## 8. Final Output

In [None]:
# Visualize the trained policy's action probabilities across different states
states_to_test = [
    ("Cart centered, pole upright", [0.0, 0.0, 0.0, 0.0]),
    ("Cart left, pole tilting right", [-1.0, -0.5, 0.1, 0.5]),
    ("Cart right, pole tilting left", [1.0, 0.5, -0.1, -0.5]),
    ("Pole falling right fast", [0.0, 0.0, 0.2, 1.5]),
    ("Pole falling left fast", [0.0, 0.0, -0.2, -1.5]),
]

fig, axes = plt.subplots(1, len(states_to_test), figsize=(16, 4))

for idx, (desc, state) in enumerate(states_to_test):
    state_t = torch.tensor(state, dtype=torch.float32)
    probs = policy.get_action_probs(state_t).detach().numpy()

    colors = ['#ef4444', '#3b82f6']  # Red for left, Blue for right
    axes[idx].bar(['Left', 'Right'], probs, color=colors)
    axes[idx].set_title(desc, fontsize=9, wrap=True)
    axes[idx].set_ylim(0, 1)
    if idx == 0:
        axes[idx].set_ylabel('Probability')

plt.suptitle('Trained Policy: Action Probabilities for Different States', fontsize=14)
plt.tight_layout()
plt.show()
print("Congratulations! You have built a policy gradient agent from scratch!")
print("Notice how the policy learned to push right when the pole tilts right, and left when it tilts left.")

## 9. Reflection and Next Steps

### Reflection Questions
1. Why do we use the log-derivative trick instead of directly differentiating the trajectory probability? What makes direct differentiation difficult?
2. If the return for all trajectories is positive (e.g., rewards are always non-negative), what happens to the gradient estimate? Why might this be a problem?
3. How does the number of sampled trajectories affect the quality of the gradient estimate?

### Optional Challenges
1. Modify the policy network to use 2 hidden layers instead of 1. Does it learn faster?
2. Experiment with different learning rates (0.001, 0.01, 0.1). Plot the training curves for each.
3. Replace CartPole with LunarLander-v3 (4 actions instead of 2). Does the same approach work?