# Week 16, Day 5: Policy Gradients for Trading

## Learning Objectives
- Understand the REINFORCE algorithm and policy gradient theorem
- Implement Actor-Critic architecture for trading
- Learn advantage estimation techniques (GAE)
- Build a trading policy network with continuous actions

## Why Policy Gradients for Trading?

Unlike value-based methods (DQN), policy gradients:
- **Directly optimize the policy** - Learn what action to take
- **Handle continuous actions** - Position sizing, not just buy/sell
- **Stochastic policies** - Natural exploration through probability distributions
- **Better convergence** - Smoother optimization landscape

---

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Normal, Categorical
import matplotlib.pyplot as plt
from collections import deque, namedtuple
from typing import Tuple, List, Optional
import warnings
warnings.filterwarnings('ignore')

# Set seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

---
## Part 1: Policy Gradient Theorem

### The Core Idea

The **policy gradient theorem** states:

$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot Q^{\pi_\theta}(s, a) \right]$$

Where:
- $J(\theta)$ = Expected cumulative reward
- $\pi_\theta(a|s)$ = Policy (probability of action $a$ given state $s$)
- $Q^{\pi_\theta}(s, a)$ = Action-value function

### Intuition
- If an action leads to **high reward**, increase its probability
- If an action leads to **low reward**, decrease its probability
- The gradient naturally does this!

---

## Part 2: REINFORCE Algorithm

### Algorithm Overview

REINFORCE uses **Monte Carlo** returns to estimate $Q(s, a)$:

$$\nabla_\theta J(\theta) \approx \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t$$

Where $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ is the **return from time $t$**.

### Pseudocode
```
1. Initialize policy network π_θ
2. For each episode:
   a. Generate trajectory: s₀, a₀, r₁, s₁, a₁, r₂, ..., sT
   b. For each timestep t:
      - Calculate return Gt
      - Calculate loss: -log π_θ(at|st) * Gt
   c. Update θ using gradient descent
```

In [None]:
class PolicyNetwork(nn.Module):
    """
    Simple policy network for discrete actions.
    Outputs probability distribution over actions.
    """
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128):
        super(PolicyNetwork, self).__init__()
        
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim),
            nn.Softmax(dim=-1)
        )
    
    def forward(self, state: torch.Tensor) -> torch.Tensor:
        return self.network(state)
    
    def get_action(self, state: torch.Tensor) -> Tuple[int, torch.Tensor]:
        """Sample action from policy distribution."""
        probs = self.forward(state)
        dist = Categorical(probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action.item(), log_prob


# Test the network
state_dim = 10
action_dim = 3  # Buy, Hold, Sell

policy = PolicyNetwork(state_dim, action_dim)
test_state = torch.randn(1, state_dim)
action, log_prob = policy.get_action(test_state)

print(f"State shape: {test_state.shape}")
print(f"Action probabilities: {policy(test_state).detach().numpy()}")
print(f"Sampled action: {action}")
print(f"Log probability: {log_prob.item():.4f}")

In [None]:
class REINFORCE:
    """
    REINFORCE algorithm implementation.
    
    Key components:
    - Policy network outputs action probabilities
    - Monte Carlo returns for gradient estimation
    - Optional baseline for variance reduction
    """
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        lr: float = 1e-3,
        gamma: float = 0.99,
        baseline: bool = True
    ):
        self.gamma = gamma
        self.baseline = baseline
        
        # Policy network
        self.policy = PolicyNetwork(state_dim, action_dim).to(device)
        self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)
        
        # Episode storage
        self.log_probs = []
        self.rewards = []
    
    def select_action(self, state: np.ndarray) -> int:
        """Select action using current policy."""
        state = torch.FloatTensor(state).unsqueeze(0).to(device)
        action, log_prob = self.policy.get_action(state)
        self.log_probs.append(log_prob)
        return action
    
    def store_reward(self, reward: float):
        """Store reward for current step."""
        self.rewards.append(reward)
    
    def compute_returns(self) -> torch.Tensor:
        """
        Compute discounted returns for each timestep.
        G_t = r_t + γ*r_{t+1} + γ²*r_{t+2} + ...
        """
        returns = []
        G = 0
        
        # Calculate returns backwards
        for reward in reversed(self.rewards):
            G = reward + self.gamma * G
            returns.insert(0, G)
        
        returns = torch.tensor(returns, dtype=torch.float32).to(device)
        
        # Normalize returns (baseline/variance reduction)
        if self.baseline and len(returns) > 1:
            returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        
        return returns
    
    def update(self) -> float:
        """
        Update policy using REINFORCE gradient.
        Loss = -Σ log π(a|s) * G_t
        """
        returns = self.compute_returns()
        
        # Policy gradient loss
        policy_loss = []
        for log_prob, G in zip(self.log_probs, returns):
            policy_loss.append(-log_prob * G)
        
        policy_loss = torch.stack(policy_loss).sum()
        
        # Gradient descent
        self.optimizer.zero_grad()
        policy_loss.backward()
        self.optimizer.step()
        
        # Clear episode data
        loss_value = policy_loss.item()
        self.log_probs = []
        self.rewards = []
        
        return loss_value


print("REINFORCE agent created successfully!")
print("\nKey insight: REINFORCE updates only at episode end (Monte Carlo)")

---
## Part 3: Simple Trading Environment

Let's create a trading environment to test our algorithms.

In [None]:
class SimpleTradingEnv:
    """
    Simple trading environment for policy gradient algorithms.
    
    State: [price_changes, position, unrealized_pnl, volatility, momentum]
    Actions: 0=Sell, 1=Hold, 2=Buy
    Reward: PnL from position changes and holding
    """
    def __init__(
        self,
        prices: np.ndarray = None,
        lookback: int = 10,
        transaction_cost: float = 0.001
    ):
        self.lookback = lookback
        self.transaction_cost = transaction_cost
        
        # Generate synthetic prices if not provided
        if prices is None:
            self.prices = self._generate_prices(1000)
        else:
            self.prices = prices
        
        self.reset()
    
    def _generate_prices(self, n: int) -> np.ndarray:
        """Generate synthetic price series with trends and mean reversion."""
        prices = [100.0]
        trend = 0
        
        for _ in range(n - 1):
            # Random trend changes
            if np.random.random() < 0.05:
                trend = np.random.uniform(-0.002, 0.002)
            
            # Price change with trend and noise
            change = trend + np.random.randn() * 0.02
            new_price = prices[-1] * (1 + change)
            prices.append(new_price)
        
        return np.array(prices)
    
    def reset(self) -> np.ndarray:
        """Reset environment to initial state."""
        self.current_step = self.lookback
        self.position = 0  # -1, 0, or 1
        self.entry_price = 0
        self.total_pnl = 0
        self.trades = []
        
        return self._get_state()
    
    def _get_state(self) -> np.ndarray:
        """Construct state vector."""
        # Price returns over lookback period
        prices = self.prices[self.current_step - self.lookback:self.current_step + 1]
        returns = np.diff(prices) / prices[:-1]
        
        # Current price and position info
        current_price = self.prices[self.current_step]
        
        # Unrealized PnL (normalized)
        if self.position != 0:
            unrealized_pnl = self.position * (current_price - self.entry_price) / self.entry_price
        else:
            unrealized_pnl = 0
        
        # Volatility (rolling std of returns)
        volatility = np.std(returns) if len(returns) > 1 else 0
        
        # Momentum (sum of recent returns)
        momentum = np.sum(returns[-5:]) if len(returns) >= 5 else np.sum(returns)
        
        # Combine features
        state = np.concatenate([
            returns,                    # Lookback returns
            [self.position],            # Current position
            [unrealized_pnl],           # Unrealized PnL
            [volatility],               # Volatility
            [momentum]                  # Momentum
        ])
        
        return state.astype(np.float32)
    
    def step(self, action: int) -> Tuple[np.ndarray, float, bool, dict]:
        """
        Execute action and return new state, reward, done, info.
        
        Actions: 0=Sell/Short, 1=Hold, 2=Buy/Long
        """
        current_price = self.prices[self.current_step]
        reward = 0
        
        # Map action to target position
        target_position = action - 1  # 0->-1, 1->0, 2->1
        
        # Calculate position change
        position_change = target_position - self.position
        
        # Transaction cost for position changes
        if position_change != 0:
            reward -= abs(position_change) * self.transaction_cost
            
            # Close existing position
            if self.position != 0:
                pnl = self.position * (current_price - self.entry_price) / self.entry_price
                reward += pnl
                self.total_pnl += pnl
                self.trades.append(pnl)
            
            # Open new position
            self.position = target_position
            self.entry_price = current_price if target_position != 0 else 0
        
        # Move to next step
        self.current_step += 1
        
        # Check if episode is done
        done = self.current_step >= len(self.prices) - 1
        
        # Add holding reward/penalty
        if not done and self.position != 0:
            next_price = self.prices[self.current_step]
            holding_return = self.position * (next_price - current_price) / current_price
            reward += holding_return * 0.1  # Scale down holding rewards
        
        info = {
            'total_pnl': self.total_pnl,
            'position': self.position,
            'n_trades': len(self.trades)
        }
        
        return self._get_state(), reward, done, info
    
    @property
    def state_dim(self) -> int:
        return self.lookback + 4  # returns + position + pnl + vol + momentum
    
    @property
    def action_dim(self) -> int:
        return 3  # Sell, Hold, Buy


# Test the environment
env = SimpleTradingEnv()
state = env.reset()

print(f"State dimension: {env.state_dim}")
print(f"Action dimension: {env.action_dim}")
print(f"Initial state shape: {state.shape}")
print(f"\nSample state: {state[:5]}...")

In [None]:
def train_reinforce(
    env: SimpleTradingEnv,
    agent: REINFORCE,
    n_episodes: int = 500,
    print_every: int = 50
) -> Tuple[List[float], List[float]]:
    """
    Train REINFORCE agent on trading environment.
    """
    episode_rewards = []
    episode_pnls = []
    
    for episode in range(n_episodes):
        state = env.reset()
        total_reward = 0
        done = False
        
        while not done:
            # Select action
            action = agent.select_action(state)
            
            # Take step in environment
            next_state, reward, done, info = env.step(action)
            
            # Store reward
            agent.store_reward(reward)
            total_reward += reward
            
            state = next_state
        
        # Update policy at end of episode
        loss = agent.update()
        
        episode_rewards.append(total_reward)
        episode_pnls.append(info['total_pnl'])
        
        if (episode + 1) % print_every == 0:
            avg_reward = np.mean(episode_rewards[-print_every:])
            avg_pnl = np.mean(episode_pnls[-print_every:])
            print(f"Episode {episode + 1:4d} | Avg Reward: {avg_reward:8.4f} | "
                  f"Avg PnL: {avg_pnl:8.4f} | Trades: {info['n_trades']}")
    
    return episode_rewards, episode_pnls


# Train REINFORCE agent
print("Training REINFORCE Agent...")
print("=" * 60)

env = SimpleTradingEnv()
reinforce_agent = REINFORCE(
    state_dim=env.state_dim,
    action_dim=env.action_dim,
    lr=1e-3,
    gamma=0.99,
    baseline=True
)

reinforce_rewards, reinforce_pnls = train_reinforce(env, reinforce_agent, n_episodes=300)

In [None]:
# Visualize REINFORCE training
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Smoothed rewards
window = 20
smoothed_rewards = pd.Series(reinforce_rewards).rolling(window).mean()
smoothed_pnls = pd.Series(reinforce_pnls).rolling(window).mean()

axes[0].plot(reinforce_rewards, alpha=0.3, color='blue', label='Raw')
axes[0].plot(smoothed_rewards, color='blue', linewidth=2, label=f'MA({window})')
axes[0].axhline(y=0, color='red', linestyle='--', alpha=0.5)
axes[0].set_xlabel('Episode')
axes[0].set_ylabel('Episode Reward')
axes[0].set_title('REINFORCE: Episode Rewards')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].plot(reinforce_pnls, alpha=0.3, color='green', label='Raw')
axes[1].plot(smoothed_pnls, color='green', linewidth=2, label=f'MA({window})')
axes[1].axhline(y=0, color='red', linestyle='--', alpha=0.5)
axes[1].set_xlabel('Episode')
axes[1].set_ylabel('Total PnL')
axes[1].set_title('REINFORCE: Trading PnL')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nFinal 50 episodes - Avg Reward: {np.mean(reinforce_rewards[-50:]):.4f}")
print(f"Final 50 episodes - Avg PnL: {np.mean(reinforce_pnls[-50:]):.4f}")

---
## Part 4: Actor-Critic Architecture

### Motivation

REINFORCE has **high variance** because:
- Uses full episode returns (Monte Carlo)
- Must wait until episode end to update

### Actor-Critic Solution

Use **two networks**:
1. **Actor** (Policy): Decides which action to take
2. **Critic** (Value): Estimates how good the current state is

### Key Equations

**Critic Update** (TD Learning):
$$\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$$

**Actor Update** (Policy Gradient):
$$\nabla_\theta J(\theta) \approx \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \delta_t$$

Where $\delta_t$ is the **TD error** (temporal difference).

---

In [None]:
class ActorNetwork(nn.Module):
    """Actor network for discrete actions."""
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128):
        super(ActorNetwork, self).__init__()
        
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim),
            nn.Softmax(dim=-1)
        )
    
    def forward(self, state: torch.Tensor) -> torch.Tensor:
        return self.network(state)


class CriticNetwork(nn.Module):
    """Critic network - estimates state value V(s)."""
    def __init__(self, state_dim: int, hidden_dim: int = 128):
        super(CriticNetwork, self).__init__()
        
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
    
    def forward(self, state: torch.Tensor) -> torch.Tensor:
        return self.network(state)


class ActorCritic:
    """
    Actor-Critic algorithm with TD learning.
    
    Advantages over REINFORCE:
    - Lower variance (bootstrapping)
    - Online updates (no need to wait for episode end)
    - Better sample efficiency
    """
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        actor_lr: float = 1e-3,
        critic_lr: float = 1e-3,
        gamma: float = 0.99
    ):
        self.gamma = gamma
        
        # Actor (policy) network
        self.actor = ActorNetwork(state_dim, action_dim).to(device)
        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=actor_lr)
        
        # Critic (value) network
        self.critic = CriticNetwork(state_dim).to(device)
        self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=critic_lr)
    
    def select_action(self, state: np.ndarray) -> Tuple[int, torch.Tensor]:
        """Select action from current policy."""
        state = torch.FloatTensor(state).unsqueeze(0).to(device)
        probs = self.actor(state)
        dist = Categorical(probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action.item(), log_prob
    
    def update(
        self,
        state: np.ndarray,
        action: int,
        reward: float,
        next_state: np.ndarray,
        done: bool,
        log_prob: torch.Tensor
    ) -> Tuple[float, float]:
        """
        Online update using TD error.
        """
        state = torch.FloatTensor(state).unsqueeze(0).to(device)
        next_state = torch.FloatTensor(next_state).unsqueeze(0).to(device)
        reward = torch.FloatTensor([reward]).to(device)
        
        # Current value estimate
        value = self.critic(state)
        
        # Next value estimate (0 if terminal)
        with torch.no_grad():
            next_value = self.critic(next_state) if not done else torch.zeros(1).to(device)
        
        # TD error (advantage estimate)
        td_target = reward + self.gamma * next_value
        td_error = td_target - value
        
        # Critic loss (MSE of TD error)
        critic_loss = td_error.pow(2).mean()
        
        # Actor loss (policy gradient with TD error)
        actor_loss = -log_prob * td_error.detach()
        
        # Update critic
        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()
        
        # Update actor
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()
        
        return actor_loss.item(), critic_loss.item()


print("Actor-Critic class created!")
print("\nKey difference from REINFORCE: Updates every step, not every episode")

In [None]:
def train_actor_critic(
    env: SimpleTradingEnv,
    agent: ActorCritic,
    n_episodes: int = 500,
    print_every: int = 50
) -> Tuple[List[float], List[float]]:
    """
    Train Actor-Critic agent with online updates.
    """
    episode_rewards = []
    episode_pnls = []
    
    for episode in range(n_episodes):
        state = env.reset()
        total_reward = 0
        done = False
        
        while not done:
            # Select action
            action, log_prob = agent.select_action(state)
            
            # Take step
            next_state, reward, done, info = env.step(action)
            
            # Online update
            agent.update(state, action, reward, next_state, done, log_prob)
            
            total_reward += reward
            state = next_state
        
        episode_rewards.append(total_reward)
        episode_pnls.append(info['total_pnl'])
        
        if (episode + 1) % print_every == 0:
            avg_reward = np.mean(episode_rewards[-print_every:])
            avg_pnl = np.mean(episode_pnls[-print_every:])
            print(f"Episode {episode + 1:4d} | Avg Reward: {avg_reward:8.4f} | "
                  f"Avg PnL: {avg_pnl:8.4f} | Trades: {info['n_trades']}")
    
    return episode_rewards, episode_pnls


# Train Actor-Critic agent
print("Training Actor-Critic Agent...")
print("=" * 60)

env = SimpleTradingEnv()
ac_agent = ActorCritic(
    state_dim=env.state_dim,
    action_dim=env.action_dim,
    actor_lr=1e-3,
    critic_lr=1e-3,
    gamma=0.99
)

ac_rewards, ac_pnls = train_actor_critic(env, ac_agent, n_episodes=300)

---
## Part 5: Advantage Estimation (GAE)

### The Problem

Simple TD error $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ has:
- **Low variance** (single-step)
- **High bias** (depends on value function accuracy)

### Generalized Advantage Estimation (GAE)

GAE balances bias and variance with parameter $\lambda$:

$$\hat{A}_t^{GAE} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}$$

Where:
- $\lambda = 0$: TD(0), low variance, high bias
- $\lambda = 1$: Monte Carlo, high variance, low bias
- $\lambda \in (0, 1)$: Trade-off between the two

### Practical Computation

Recursive formula:
$$\hat{A}_t = \delta_t + \gamma \lambda \hat{A}_{t+1}$$

---

In [None]:
def compute_gae(
    rewards: List[float],
    values: List[float],
    next_value: float,
    gamma: float = 0.99,
    gae_lambda: float = 0.95
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute Generalized Advantage Estimation.
    
    Args:
        rewards: List of rewards from trajectory
        values: List of value estimates V(s_t)
        next_value: Value estimate of final next state
        gamma: Discount factor
        gae_lambda: GAE parameter (0=TD(0), 1=MC)
    
    Returns:
        advantages: GAE advantages
        returns: Target returns for value function
    """
    rewards = np.array(rewards)
    values = np.array(values)
    
    T = len(rewards)
    advantages = np.zeros(T)
    
    # Compute GAE backwards
    gae = 0
    for t in reversed(range(T)):
        if t == T - 1:
            next_val = next_value
        else:
            next_val = values[t + 1]
        
        # TD error
        delta = rewards[t] + gamma * next_val - values[t]
        
        # GAE recursive formula
        gae = delta + gamma * gae_lambda * gae
        advantages[t] = gae
    
    # Target returns for value function training
    returns = advantages + values
    
    return advantages, returns


# Demonstrate GAE with different lambda values
print("GAE Example with different λ values")
print("=" * 50)

# Sample trajectory
rewards = [0.1, -0.05, 0.2, 0.15, -0.1]
values = [0.5, 0.45, 0.55, 0.6, 0.5]
next_value = 0.48

for lam in [0.0, 0.5, 0.95, 1.0]:
    advantages, returns = compute_gae(rewards, values, next_value, gae_lambda=lam)
    print(f"\nλ = {lam}:")
    print(f"  Advantages: {advantages.round(4)}")
    print(f"  Advantage std: {advantages.std():.4f}")

In [None]:
# Experience buffer for batch updates
Experience = namedtuple('Experience', ['state', 'action', 'reward', 'next_state', 'done', 'log_prob', 'value'])


class A2CAgent:
    """
    Advantage Actor-Critic (A2C) with GAE.
    
    Improvements over basic Actor-Critic:
    - GAE for better advantage estimation
    - Batch updates for stability
    - Entropy bonus for exploration
    """
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        lr: float = 3e-4,
        gamma: float = 0.99,
        gae_lambda: float = 0.95,
        entropy_coef: float = 0.01,
        value_coef: float = 0.5,
        update_steps: int = 128
    ):
        self.gamma = gamma
        self.gae_lambda = gae_lambda
        self.entropy_coef = entropy_coef
        self.value_coef = value_coef
        self.update_steps = update_steps
        
        # Shared network backbone
        self.actor = ActorNetwork(state_dim, action_dim).to(device)
        self.critic = CriticNetwork(state_dim).to(device)
        
        # Single optimizer for both networks
        self.optimizer = optim.Adam(
            list(self.actor.parameters()) + list(self.critic.parameters()),
            lr=lr
        )
        
        # Experience buffer
        self.buffer = []
    
    def select_action(self, state: np.ndarray) -> Tuple[int, torch.Tensor, torch.Tensor]:
        """Select action and get value estimate."""
        state_t = torch.FloatTensor(state).unsqueeze(0).to(device)
        
        probs = self.actor(state_t)
        value = self.critic(state_t)
        
        dist = Categorical(probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        
        return action.item(), log_prob, value.squeeze()
    
    def store_experience(self, exp: Experience):
        """Store experience in buffer."""
        self.buffer.append(exp)
    
    def update(self, next_state: np.ndarray) -> Tuple[float, float, float]:
        """
        Batch update using GAE.
        """
        if len(self.buffer) < self.update_steps:
            return 0, 0, 0
        
        # Extract data from buffer
        states = torch.FloatTensor([e.state for e in self.buffer]).to(device)
        actions = torch.LongTensor([e.action for e in self.buffer]).to(device)
        rewards = [e.reward for e in self.buffer]
        dones = [e.done for e in self.buffer]
        old_log_probs = torch.stack([e.log_prob for e in self.buffer]).to(device)
        values = torch.stack([e.value for e in self.buffer]).detach().cpu().numpy()
        
        # Get next value for GAE
        with torch.no_grad():
            next_state_t = torch.FloatTensor(next_state).unsqueeze(0).to(device)
            next_value = self.critic(next_state_t).item()
        
        # Compute GAE advantages and returns
        advantages, returns = compute_gae(
            rewards, values, next_value,
            self.gamma, self.gae_lambda
        )
        
        advantages = torch.FloatTensor(advantages).to(device)
        returns = torch.FloatTensor(returns).to(device)
        
        # Normalize advantages
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # Forward pass
        probs = self.actor(states)
        current_values = self.critic(states).squeeze()
        
        dist = Categorical(probs)
        log_probs = dist.log_prob(actions)
        entropy = dist.entropy().mean()
        
        # Actor loss (policy gradient with advantages)
        actor_loss = -(log_probs * advantages.detach()).mean()
        
        # Critic loss (MSE with returns)
        critic_loss = F.mse_loss(current_values, returns)
        
        # Total loss with entropy bonus
        total_loss = (
            actor_loss +
            self.value_coef * critic_loss -
            self.entropy_coef * entropy
        )
        
        # Backpropagation
        self.optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(
            list(self.actor.parameters()) + list(self.critic.parameters()),
            max_norm=0.5
        )
        self.optimizer.step()
        
        # Clear buffer
        self.buffer = []
        
        return actor_loss.item(), critic_loss.item(), entropy.item()


print("A2C Agent with GAE created!")

In [None]:
def train_a2c(
    env: SimpleTradingEnv,
    agent: A2CAgent,
    n_episodes: int = 500,
    print_every: int = 50
) -> Tuple[List[float], List[float]]:
    """
    Train A2C agent with periodic batch updates.
    """
    episode_rewards = []
    episode_pnls = []
    total_steps = 0
    
    for episode in range(n_episodes):
        state = env.reset()
        total_reward = 0
        done = False
        
        while not done:
            # Select action
            action, log_prob, value = agent.select_action(state)
            
            # Take step
            next_state, reward, done, info = env.step(action)
            
            # Store experience
            exp = Experience(state, action, reward, next_state, done, log_prob, value)
            agent.store_experience(exp)
            
            total_reward += reward
            total_steps += 1
            
            # Update when buffer is full or episode ends
            if len(agent.buffer) >= agent.update_steps or done:
                agent.update(next_state)
            
            state = next_state
        
        episode_rewards.append(total_reward)
        episode_pnls.append(info['total_pnl'])
        
        if (episode + 1) % print_every == 0:
            avg_reward = np.mean(episode_rewards[-print_every:])
            avg_pnl = np.mean(episode_pnls[-print_every:])
            print(f"Episode {episode + 1:4d} | Avg Reward: {avg_reward:8.4f} | "
                  f"Avg PnL: {avg_pnl:8.4f} | Steps: {total_steps}")
    
    return episode_rewards, episode_pnls


# Train A2C agent
print("Training A2C Agent with GAE...")
print("=" * 60)

env = SimpleTradingEnv()
a2c_agent = A2CAgent(
    state_dim=env.state_dim,
    action_dim=env.action_dim,
    lr=3e-4,
    gamma=0.99,
    gae_lambda=0.95,
    entropy_coef=0.01,
    update_steps=64
)

a2c_rewards, a2c_pnls = train_a2c(env, a2c_agent, n_episodes=300)

---
## Part 6: Trading Policy Network with Continuous Actions

Real trading requires **continuous actions** (position sizes), not just discrete buy/sell.

### Gaussian Policy

For continuous actions, the policy outputs parameters of a Gaussian distribution:

$$\pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta(s)^2)$$

Where:
- $\mu_\theta(s)$ = Mean action (from neural network)
- $\sigma_\theta(s)$ = Standard deviation (learned or fixed)

---

In [None]:
class ContinuousTradingEnv:
    """
    Trading environment with continuous position sizing.
    
    Action: Position size in [-1, 1]
        -1 = Full short
         0 = No position
        +1 = Full long
    """
    def __init__(
        self,
        prices: np.ndarray = None,
        lookback: int = 20,
        transaction_cost: float = 0.001
    ):
        self.lookback = lookback
        self.transaction_cost = transaction_cost
        
        if prices is None:
            self.prices = self._generate_prices(1000)
        else:
            self.prices = prices
        
        self.reset()
    
    def _generate_prices(self, n: int) -> np.ndarray:
        """Generate realistic price series with regime changes."""
        prices = [100.0]
        regime = 'trending'
        trend = 0.001
        
        for i in range(n - 1):
            # Regime changes
            if np.random.random() < 0.02:
                regime = np.random.choice(['trending', 'mean_revert', 'volatile'])
                trend = np.random.uniform(-0.003, 0.003)
            
            if regime == 'trending':
                change = trend + np.random.randn() * 0.015
            elif regime == 'mean_revert':
                mean_price = np.mean(prices[-20:]) if len(prices) >= 20 else prices[0]
                change = 0.1 * (mean_price - prices[-1]) / prices[-1] + np.random.randn() * 0.01
            else:  # volatile
                change = np.random.randn() * 0.03
            
            new_price = prices[-1] * (1 + change)
            prices.append(max(new_price, 1.0))  # Price floor
        
        return np.array(prices)
    
    def reset(self) -> np.ndarray:
        self.current_step = self.lookback
        self.position = 0.0  # Continuous position [-1, 1]
        self.entry_price = 0
        self.total_pnl = 0
        self.pnl_history = []
        
        return self._get_state()
    
    def _get_state(self) -> np.ndarray:
        """Construct feature-rich state."""
        prices = self.prices[self.current_step - self.lookback:self.current_step + 1]
        returns = np.diff(prices) / prices[:-1]
        
        # Technical features
        sma_5 = np.mean(prices[-5:]) / prices[-1] - 1
        sma_10 = np.mean(prices[-10:]) / prices[-1] - 1
        sma_20 = np.mean(prices) / prices[-1] - 1
        
        volatility = np.std(returns)
        momentum = np.sum(returns[-5:])
        
        # RSI approximation
        gains = np.maximum(returns, 0)
        losses = np.maximum(-returns, 0)
        avg_gain = np.mean(gains[-14:]) if len(gains) >= 14 else np.mean(gains)
        avg_loss = np.mean(losses[-14:]) if len(losses) >= 14 else np.mean(losses)
        rsi = avg_gain / (avg_gain + avg_loss + 1e-8) - 0.5  # Centered at 0
        
        # Position features
        unrealized_pnl = 0
        if self.position != 0:
            unrealized_pnl = self.position * (prices[-1] - self.entry_price) / self.entry_price
        
        state = np.concatenate([
            returns[-10:],           # Recent returns
            [sma_5, sma_10, sma_20], # Moving average features
            [volatility],            # Volatility
            [momentum],              # Momentum
            [rsi],                   # RSI
            [self.position],         # Current position
            [unrealized_pnl]         # Unrealized PnL
        ])
        
        return state.astype(np.float32)
    
    def step(self, action: float) -> Tuple[np.ndarray, float, bool, dict]:
        """
        Execute continuous action.
        Action is clipped to [-1, 1].
        """
        # Clip action to valid range
        target_position = np.clip(action, -1, 1)
        
        current_price = self.prices[self.current_step]
        reward = 0
        
        # Position change
        position_change = abs(target_position - self.position)
        
        # Transaction cost
        reward -= position_change * self.transaction_cost
        
        # PnL from closing/adjusting position
        if self.position != 0:
            pnl = self.position * (current_price - self.entry_price) / self.entry_price
            # Scale PnL by how much we're closing
            close_fraction = min(position_change / (abs(self.position) + 1e-8), 1.0)
            realized_pnl = pnl * close_fraction
            reward += realized_pnl
            self.total_pnl += realized_pnl
            self.pnl_history.append(realized_pnl)
        
        # Update position
        self.position = target_position
        self.entry_price = current_price if target_position != 0 else 0
        
        # Move to next step
        self.current_step += 1
        done = self.current_step >= len(self.prices) - 1
        
        # Holding reward
        if not done and self.position != 0:
            next_price = self.prices[self.current_step]
            holding_return = self.position * (next_price - current_price) / current_price
            reward += holding_return
        
        info = {
            'total_pnl': self.total_pnl,
            'position': self.position,
            'n_trades': len(self.pnl_history),
            'sharpe': self._calculate_sharpe()
        }
        
        return self._get_state(), reward, done, info
    
    def _calculate_sharpe(self) -> float:
        if len(self.pnl_history) < 2:
            return 0.0
        returns = np.array(self.pnl_history)
        return np.mean(returns) / (np.std(returns) + 1e-8) * np.sqrt(252)
    
    @property
    def state_dim(self) -> int:
        return 18  # 10 returns + 3 SMA + vol + mom + rsi + pos + pnl
    
    @property
    def action_dim(self) -> int:
        return 1  # Continuous position


# Test continuous environment
cont_env = ContinuousTradingEnv()
state = cont_env.reset()
print(f"State dimension: {cont_env.state_dim}")
print(f"State shape: {state.shape}")
print(f"Sample state: {state[:5]}...")

In [None]:
class GaussianPolicyNetwork(nn.Module):
    """
    Gaussian policy for continuous actions.
    Outputs mean and log_std for position sizing.
    """
    def __init__(
        self,
        state_dim: int,
        action_dim: int = 1,
        hidden_dim: int = 128,
        log_std_min: float = -20,
        log_std_max: float = 2
    ):
        super(GaussianPolicyNetwork, self).__init__()
        
        self.log_std_min = log_std_min
        self.log_std_max = log_std_max
        
        # Shared layers
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        # Mean head (bounded by tanh)
        self.mean_head = nn.Sequential(
            nn.Linear(hidden_dim, action_dim),
            nn.Tanh()  # Bounds action to [-1, 1]
        )
        
        # Log std head (learned)
        self.log_std_head = nn.Linear(hidden_dim, action_dim)
    
    def forward(self, state: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """Output mean and log_std of action distribution."""
        features = self.shared(state)
        mean = self.mean_head(features)
        log_std = self.log_std_head(features)
        log_std = torch.clamp(log_std, self.log_std_min, self.log_std_max)
        return mean, log_std
    
    def sample(self, state: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """Sample action and compute log probability."""
        mean, log_std = self.forward(state)
        std = log_std.exp()
        
        # Sample from Gaussian
        dist = Normal(mean, std)
        action = dist.rsample()  # Reparameterization trick
        
        # Compute log probability
        log_prob = dist.log_prob(action).sum(dim=-1)
        
        # Clamp action to valid range
        action = torch.clamp(action, -1, 1)
        
        return action, log_prob
    
    def evaluate(self, state: torch.Tensor, action: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """Evaluate log probability of actions."""
        mean, log_std = self.forward(state)
        std = log_std.exp()
        
        dist = Normal(mean, std)
        log_prob = dist.log_prob(action).sum(dim=-1)
        entropy = dist.entropy().sum(dim=-1)
        
        return log_prob, entropy


# Test Gaussian policy
gaussian_policy = GaussianPolicyNetwork(state_dim=cont_env.state_dim)
test_state = torch.randn(1, cont_env.state_dim)

mean, log_std = gaussian_policy(test_state)
action, log_prob = gaussian_policy.sample(test_state)

print(f"Mean action: {mean.item():.4f}")
print(f"Std: {log_std.exp().item():.4f}")
print(f"Sampled action: {action.item():.4f}")
print(f"Log probability: {log_prob.item():.4f}")

In [None]:
class ContinuousA2C:
    """
    A2C for continuous trading actions.
    Uses Gaussian policy for position sizing.
    """
    def __init__(
        self,
        state_dim: int,
        action_dim: int = 1,
        lr: float = 3e-4,
        gamma: float = 0.99,
        gae_lambda: float = 0.95,
        entropy_coef: float = 0.01,
        value_coef: float = 0.5,
        update_steps: int = 64
    ):
        self.gamma = gamma
        self.gae_lambda = gae_lambda
        self.entropy_coef = entropy_coef
        self.value_coef = value_coef
        self.update_steps = update_steps
        
        # Networks
        self.actor = GaussianPolicyNetwork(state_dim, action_dim).to(device)
        self.critic = CriticNetwork(state_dim).to(device)
        
        self.optimizer = optim.Adam(
            list(self.actor.parameters()) + list(self.critic.parameters()),
            lr=lr
        )
        
        self.buffer = []
    
    def select_action(self, state: np.ndarray) -> Tuple[float, torch.Tensor, torch.Tensor]:
        """Select continuous action."""
        state_t = torch.FloatTensor(state).unsqueeze(0).to(device)
        
        action, log_prob = self.actor.sample(state_t)
        value = self.critic(state_t)
        
        return action.item(), log_prob, value.squeeze()
    
    def store_experience(self, exp: Experience):
        self.buffer.append(exp)
    
    def update(self, next_state: np.ndarray) -> Tuple[float, float, float]:
        """Update using GAE."""
        if len(self.buffer) < self.update_steps:
            return 0, 0, 0
        
        # Extract data
        states = torch.FloatTensor([e.state for e in self.buffer]).to(device)
        actions = torch.FloatTensor([[e.action] for e in self.buffer]).to(device)
        rewards = [e.reward for e in self.buffer]
        values = torch.stack([e.value for e in self.buffer]).detach().cpu().numpy()
        
        # Get next value
        with torch.no_grad():
            next_state_t = torch.FloatTensor(next_state).unsqueeze(0).to(device)
            next_value = self.critic(next_state_t).item()
        
        # Compute GAE
        advantages, returns = compute_gae(
            rewards, values, next_value,
            self.gamma, self.gae_lambda
        )
        
        advantages = torch.FloatTensor(advantages).to(device)
        returns = torch.FloatTensor(returns).to(device)
        
        # Normalize advantages
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # Evaluate actions
        log_probs, entropy = self.actor.evaluate(states, actions)
        current_values = self.critic(states).squeeze()
        
        # Losses
        actor_loss = -(log_probs * advantages.detach()).mean()
        critic_loss = F.mse_loss(current_values, returns)
        entropy_loss = -entropy.mean()
        
        total_loss = (
            actor_loss +
            self.value_coef * critic_loss +
            self.entropy_coef * entropy_loss
        )
        
        # Update
        self.optimizer.zero_grad()
        total_loss.backward()
        torch.nn.utils.clip_grad_norm_(
            list(self.actor.parameters()) + list(self.critic.parameters()),
            max_norm=0.5
        )
        self.optimizer.step()
        
        self.buffer = []
        
        return actor_loss.item(), critic_loss.item(), -entropy_loss.item()


print("Continuous A2C Agent created!")

In [None]:
def train_continuous_a2c(
    env: ContinuousTradingEnv,
    agent: ContinuousA2C,
    n_episodes: int = 500,
    print_every: int = 50
) -> Tuple[List[float], List[float], List[float]]:
    """
    Train continuous A2C agent.
    """
    episode_rewards = []
    episode_pnls = []
    episode_sharpes = []
    
    for episode in range(n_episodes):
        state = env.reset()
        total_reward = 0
        done = False
        
        while not done:
            action, log_prob, value = agent.select_action(state)
            next_state, reward, done, info = env.step(action)
            
            exp = Experience(state, action, reward, next_state, done, log_prob, value)
            agent.store_experience(exp)
            
            total_reward += reward
            
            if len(agent.buffer) >= agent.update_steps or done:
                agent.update(next_state)
            
            state = next_state
        
        episode_rewards.append(total_reward)
        episode_pnls.append(info['total_pnl'])
        episode_sharpes.append(info['sharpe'])
        
        if (episode + 1) % print_every == 0:
            avg_reward = np.mean(episode_rewards[-print_every:])
            avg_pnl = np.mean(episode_pnls[-print_every:])
            avg_sharpe = np.mean(episode_sharpes[-print_every:])
            print(f"Episode {episode + 1:4d} | Reward: {avg_reward:8.4f} | "
                  f"PnL: {avg_pnl:8.4f} | Sharpe: {avg_sharpe:6.2f}")
    
    return episode_rewards, episode_pnls, episode_sharpes


# Train Continuous A2C
print("Training Continuous A2C Agent...")
print("=" * 60)

cont_env = ContinuousTradingEnv()
cont_a2c_agent = ContinuousA2C(
    state_dim=cont_env.state_dim,
    action_dim=cont_env.action_dim,
    lr=3e-4,
    gamma=0.99,
    gae_lambda=0.95
)

cont_rewards, cont_pnls, cont_sharpes = train_continuous_a2c(
    cont_env, cont_a2c_agent, n_episodes=400
)

In [None]:
# Comprehensive comparison visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

window = 20

# REINFORCE vs Actor-Critic rewards
ax1 = axes[0, 0]
ax1.plot(pd.Series(reinforce_rewards).rolling(window).mean(), label='REINFORCE', alpha=0.8)
ax1.plot(pd.Series(ac_rewards).rolling(window).mean(), label='Actor-Critic', alpha=0.8)
ax1.plot(pd.Series(a2c_rewards).rolling(window).mean(), label='A2C (GAE)', alpha=0.8)
ax1.axhline(y=0, color='red', linestyle='--', alpha=0.5)
ax1.set_xlabel('Episode')
ax1.set_ylabel('Episode Reward')
ax1.set_title('Discrete Actions: Algorithm Comparison')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Continuous A2C results
ax2 = axes[0, 1]
ax2.plot(pd.Series(cont_rewards).rolling(window).mean(), color='purple', linewidth=2)
ax2.axhline(y=0, color='red', linestyle='--', alpha=0.5)
ax2.set_xlabel('Episode')
ax2.set_ylabel('Episode Reward')
ax2.set_title('Continuous A2C: Episode Rewards')
ax2.grid(True, alpha=0.3)

# PnL comparison
ax3 = axes[1, 0]
ax3.plot(pd.Series(reinforce_pnls).rolling(window).mean(), label='REINFORCE', alpha=0.8)
ax3.plot(pd.Series(ac_pnls).rolling(window).mean(), label='Actor-Critic', alpha=0.8)
ax3.plot(pd.Series(a2c_pnls).rolling(window).mean(), label='A2C (GAE)', alpha=0.8)
ax3.axhline(y=0, color='red', linestyle='--', alpha=0.5)
ax3.set_xlabel('Episode')
ax3.set_ylabel('Total PnL')
ax3.set_title('Discrete Actions: PnL Comparison')
ax3.legend()
ax3.grid(True, alpha=0.3)

# Continuous Sharpe ratio
ax4 = axes[1, 1]
ax4.plot(pd.Series(cont_sharpes).rolling(window).mean(), color='green', linewidth=2)
ax4.axhline(y=0, color='red', linestyle='--', alpha=0.5)
ax4.axhline(y=1, color='blue', linestyle='--', alpha=0.5, label='Sharpe=1')
ax4.axhline(y=2, color='green', linestyle='--', alpha=0.5, label='Sharpe=2')
ax4.set_xlabel('Episode')
ax4.set_ylabel('Sharpe Ratio')
ax4.set_title('Continuous A2C: Sharpe Ratio')
ax4.legend()
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Evaluate trained continuous agent
def evaluate_agent(env, agent, n_episodes: int = 10) -> dict:
    """Evaluate trained agent without exploration."""
    total_rewards = []
    total_pnls = []
    positions_history = []
    
    for _ in range(n_episodes):
        state = env.reset()
        done = False
        episode_positions = []
        total_reward = 0
        
        while not done:
            with torch.no_grad():
                state_t = torch.FloatTensor(state).unsqueeze(0).to(device)
                mean, _ = agent.actor(state_t)
                action = mean.item()  # Use mean action (no sampling)
            
            next_state, reward, done, info = env.step(action)
            episode_positions.append(info['position'])
            total_reward += reward
            state = next_state
        
        total_rewards.append(total_reward)
        total_pnls.append(info['total_pnl'])
        positions_history.append(episode_positions)
    
    return {
        'mean_reward': np.mean(total_rewards),
        'std_reward': np.std(total_rewards),
        'mean_pnl': np.mean(total_pnls),
        'std_pnl': np.std(total_pnls),
        'positions': positions_history[-1]  # Last episode positions
    }


# Evaluate
eval_results = evaluate_agent(cont_env, cont_a2c_agent, n_episodes=10)

print("Continuous A2C Evaluation Results")
print("=" * 40)
print(f"Mean Reward: {eval_results['mean_reward']:.4f} ± {eval_results['std_reward']:.4f}")
print(f"Mean PnL: {eval_results['mean_pnl']:.4f} ± {eval_results['std_pnl']:.4f}")

# Plot position distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 4))

positions = eval_results['positions']
axes[0].plot(positions)
axes[0].axhline(y=0, color='red', linestyle='--', alpha=0.5)
axes[0].set_xlabel('Step')
axes[0].set_ylabel('Position')
axes[0].set_title('Agent Position Over Time')
axes[0].set_ylim(-1.1, 1.1)
axes[0].grid(True, alpha=0.3)

axes[1].hist(positions, bins=30, edgecolor='black', alpha=0.7)
axes[1].axvline(x=0, color='red', linestyle='--', alpha=0.5)
axes[1].set_xlabel('Position')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Position Distribution')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---
## Summary & Key Takeaways

### Algorithms Covered

| Algorithm | Update | Variance | Bias | Use Case |
|-----------|--------|----------|------|----------|
| **REINFORCE** | Episode end | High | Low | Simple, educational |
| **Actor-Critic** | Every step | Medium | Medium | Online learning |
| **A2C + GAE** | Batch | Low | Adjustable | Production systems |

### Key Concepts

1. **Policy Gradient Theorem**: $\nabla_\theta J = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \cdot Q(s,a)]$

2. **Variance Reduction**:
   - Baseline subtraction
   - Value function (Critic)
   - GAE ($\lambda$ parameter)

3. **Continuous Actions**:
   - Gaussian policy: $\pi(a|s) = \mathcal{N}(\mu(s), \sigma(s)^2)$
   - Natural for position sizing

### Trading Applications

- **Position sizing** → Continuous actions
- **Risk management** → Reward shaping (Sharpe ratio)
- **Transaction costs** → Built into reward
- **Market regimes** → State representation

### Next Steps

1. **PPO (Proximal Policy Optimization)** - More stable training
2. **SAC (Soft Actor-Critic)** - Maximum entropy RL
3. **Multi-asset trading** - Multiple action dimensions
4. **Risk-adjusted rewards** - CVaR, Sortino ratio

---

## Practice Exercises

### Exercise 1: Implement PPO
Add clipped surrogate objective to A2C:
$$L^{CLIP}(\theta) = \mathbb{E}[\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)]$$

### Exercise 2: Risk-Adjusted Reward
Modify the reward to include:
- Sharpe ratio component
- Drawdown penalty
- Position risk penalty

### Exercise 3: Multi-Asset Trading
Extend to 2+ assets:
- Correlation features in state
- Portfolio-level position constraints
- Rebalancing costs

### Exercise 4: Market Regime Detection
Add regime-aware features:
- Hidden Markov Model states
- Volatility regime indicators
- Trend strength metrics