# Policy Gradients Basics

In this notebook, we’ll explore **policy gradient methods**, a cornerstone of modern reinforcement learning and the foundation of algorithms like **REINFORCE**, **Actor-Critic**, and **PPO**.

Unlike value based methods such as Q learning, **policy gradients** directly learn the *policy function* : i.e., a mapping from states to actions without explicitly estimating a value function.

## 🎯 1. What are Policy Gradient Methods?

A policy gradient method optimizes a **parameterized policy** $\pi_\theta(a|s)$ by maximizing the expected cumulative reward:

$$ J(\theta) = \mathbb{E}_{\pi_\theta}[R] $$

We update the policy parameters in the direction of the gradient:

$$ \theta \leftarrow \theta + \alpha \nabla_\theta J(\theta) $$

where $\alpha$ is the learning rate.

The key insight: we improve the probability of actions that lead to higher returns.

## 🧩 2. The REINFORCE Algorithm

The **REINFORCE algorithm** is one of the simplest policy gradient methods. It uses Monte Carlo sampling to estimate gradients.

### Algorithm Steps
1. Run the policy $\pi_\theta$ in the environment and collect trajectories (episodes).
2. Compute returns $G_t$ for each time step.
3. Update the policy parameters:
   
$$ \theta \leftarrow \theta + \alpha G_t \nabla_\theta \log \pi_\theta(a_t|s_t) $$

## ⚙️ 3. Implementation with Gym and PyTorch

In [ ]:
import gym
import torch
import torch.nn as nn
import torch.optim as optim

# Define policy network
class PolicyNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(PolicyNetwork, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim),
            nn.Softmax(dim=-1)
        )
    
    def forward(self, x):
        return self.fc(x)

# Create environment and model
env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

policy = PolicyNetwork(state_dim, action_dim)
optimizer = optim.Adam(policy.parameters(), lr=0.01)

def select_action(state):
    state = torch.FloatTensor(state)
    probs = policy(state)
    dist = torch.distributions.Categorical(probs)
    action = dist.sample()
    return action.item(), dist.log_prob(action)

In [ ]:
# Train using REINFORCE
for episode in range(500):
    log_probs = []
    rewards = []
    state = env.reset()[0]
    done = False
    
    while not done:
        action, log_prob = select_action(state)
        next_state, reward, done, _, _ = env.step(action)
        log_probs.append(log_prob)
        rewards.append(reward)
        state = next_state
    
    # Compute discounted returns
    returns = []
    G = 0
    for r in reversed(rewards):
        G = r + 0.99 * G
        returns.insert(0, G)
    returns = torch.FloatTensor(returns)
    
    # Normalize for stability
    returns = (returns - returns.mean()) / (returns.std() + 1e-9)
    
    # Compute loss and update
    loss = 0
    for log_prob, Gt in zip(log_probs, returns):
        loss += -log_prob * Gt
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if episode % 50 == 0:
        print(f"Episode {episode}, Total Reward: {sum(rewards)}")

## 📈 4. Advantages and Disadvantages

**Advantages:**
- Works directly with stochastic policies.
- Handles continuous action spaces.
- Can optimize non-differentiable environments via sampling.

**Disadvantages:**
- High variance in gradient estimates.
- Can be sample inefficient.
- Sensitive to hyperparameters (learning rate, discount factor).

## 🧠 5. Key Takeaways
- Policy gradients optimize the policy directly, without value estimation.
- REINFORCE uses Monte Carlo sampling to estimate returns.
- Variance reduction (via baselines or actor-critic methods) improves stability.

Next, we’ll move to **Actor-Critic Methods**, which combine value based and polic based ideas to overcome these limitations.