# Policy Gradient Methods: Theory and Implementation

This comprehensive tutorial covers policy gradient methods in reinforcement learning, including theoretical foundations, practical implementations, and real-world applications.

## Table of Contents
1. [Introduction to Policy Gradient Methods](#introduction)
2. [Theoretical Foundations](#theory)
3. [REINFORCE Algorithm](#reinforce)
4. [Actor-Critic Methods](#actor-critic)
5. [Proximal Policy Optimization (PPO)](#ppo)
6. [Real-World Case Study: Algorithmic Trading](#trading)
7. [Comparison and Analysis](#comparison)
8. [Advanced Topics](#advanced)
9. [Conclusion](#conclusion)


## 1. Introduction to Policy Gradient Methods {#introduction}

Policy gradient methods are a class of reinforcement learning algorithms that directly optimize the policy function. Unlike value-based methods (like Q-learning), policy gradient methods learn the policy directly without needing to estimate value functions first.

### Key Advantages:
- Can handle continuous action spaces naturally
- Can learn stochastic policies
- Often more stable than value-based methods
- Can handle high-dimensional state spaces effectively

### Key Disadvantages:
- Can have high variance in gradient estimates
- May converge to local optima
- Can be sample inefficient

Let's start by setting up our environment and importing necessary libraries.


In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import seaborn as sns
import gymnasium as gym
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Import our custom implementations
import sys
sys.path.append('../src')

from algorithms import REINFORCE, ActorCritic, PPO
from environments import TradingEnvironment, CustomCartPoleEnv
from utils import plot_training_progress, calculate_metrics, evaluate_policy

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")


## 2. Theoretical Foundations {#theory}

### Policy Gradient Theorem

The policy gradient theorem provides the foundation for policy gradient methods. It states that the gradient of the expected return with respect to the policy parameters can be expressed as:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \hat{A}_t \right]$$

Where:
- $J(\theta)$ is the expected return
- $\pi_\theta(a_t|s_t)$ is the policy probability of action $a_t$ given state $s_t$
- $\hat{A}_t$ is an estimate of the advantage function
- $\tau$ is a trajectory

### Advantage Function

The advantage function measures how much better an action is compared to the average:

$$A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$$

Where:
- $Q^\pi(s,a)$ is the action-value function
- $V^\pi(s)$ is the state-value function

### Different Estimators

1. **Monte Carlo (REINFORCE)**: $\hat{A}_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$
2. **Actor-Critic**: $\hat{A}_t = r_t + \gamma V^\pi(s_{t+1}) - V^\pi(s_t)$
3. **GAE (Generalized Advantage Estimation)**: Combines multiple estimates

Let's visualize the policy gradient concept:


In [None]:
# Visualize policy gradient concept
def visualize_policy_gradient():
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    
    # Policy before and after update
    actions = np.array([0, 1, 2])
    
    # Before update (random policy)
    probs_before = np.array([0.33, 0.33, 0.34])
    
    # After update (biased towards action 1)
    probs_after = np.array([0.1, 0.7, 0.2])
    
    ax1.bar(actions, probs_before, alpha=0.7, label='Before Update', color='lightblue')
    ax1.bar(actions, probs_after, alpha=0.7, label='After Update', color='darkblue')
    ax1.set_xlabel('Action')
    ax1.set_ylabel('Probability')
    ax1.set_title('Policy Update Example')
    ax1.legend()
    ax1.set_xticks(actions)
    
    # Advantage function visualization
    states = np.linspace(0, 10, 100)
    advantages = np.sin(states) * np.exp(-states/5)  # Example advantage function
    
    ax2.plot(states, advantages, linewidth=2, label='Advantage Function')
    ax2.axhline(y=0, color='black', linestyle='--', alpha=0.5)
    ax2.set_xlabel('State')
    ax2.set_ylabel('Advantage')
    ax2.set_title('Advantage Function Example')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

visualize_policy_gradient()
