# Week 16: Reinforcement Learning Theory

## Learning Objectives
- Understand Markov Decision Processes (MDPs) as the mathematical framework for RL
- Master Bellman equations for value computation
- Implement Q-learning from scratch
- Understand policy gradient methods
- Learn Deep Q-Networks (DQN) and their innovations

---

## 1. Markov Decision Processes (MDPs)

### What is an MDP?

An MDP is a mathematical framework for modeling sequential decision-making problems. It consists of:

- **States (S)**: Set of all possible situations the agent can be in
- **Actions (A)**: Set of all possible actions the agent can take
- **Transition Function P(s'|s,a)**: Probability of reaching state s' from state s after taking action a
- **Reward Function R(s,a,s')**: Immediate reward received after transitioning
- **Discount Factor γ ∈ [0,1]**: How much future rewards are valued vs immediate rewards

### The Markov Property

The future depends only on the current state, not the history:

$$P(s_{t+1}|s_t, a_t, s_{t-1}, a_{t-1}, ...) = P(s_{t+1}|s_t, a_t)$$

### Trading as an MDP

| MDP Component | Trading Context |
|---------------|----------------|
| State | Portfolio holdings, prices, indicators, market regime |
| Action | Buy, sell, hold (continuous: position sizes) |
| Transition | Market dynamics (partially observable, stochastic) |
| Reward | PnL, risk-adjusted returns, Sharpe ratio |
| Discount | Time preference, risk aversion |

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, Dict, List
np.random.seed(42)

In [None]:
class SimpleTradingMDP:
    """
    Simple trading environment as an MDP.
    States: Price relative to moving average (low, neutral, high)
    Actions: Buy (0), Hold (1), Sell (2)
    """
    
    def __init__(self):
        self.states = ['low', 'neutral', 'high']  # Price vs MA
        self.actions = ['buy', 'hold', 'sell']
        self.n_states = len(self.states)
        self.n_actions = len(self.actions)
        
        # Transition probabilities P(s'|s,a) - simplified
        # In reality, actions don't change market state much (price-taker)
        self.transitions = {
            'low': {'low': 0.3, 'neutral': 0.5, 'high': 0.2},
            'neutral': {'low': 0.25, 'neutral': 0.5, 'high': 0.25},
            'high': {'low': 0.2, 'neutral': 0.5, 'high': 0.3}
        }
        
        # Reward structure (mean reversion strategy)
        # Buy when low is profitable, sell when high is profitable
        self.rewards = {
            ('low', 'buy'): 1.0,      # Good: buy low
            ('low', 'hold'): 0.0,
            ('low', 'sell'): -0.5,    # Bad: sell low
            ('neutral', 'buy'): 0.0,
            ('neutral', 'hold'): 0.1, # Small reward for patience
            ('neutral', 'sell'): 0.0,
            ('high', 'buy'): -0.5,    # Bad: buy high
            ('high', 'hold'): 0.0,
            ('high', 'sell'): 1.0,    # Good: sell high
        }
        
        self.current_state = 'neutral'
    
    def reset(self) -> str:
        self.current_state = np.random.choice(self.states)
        return self.current_state
    
    def step(self, action: str) -> Tuple[str, float, bool]:
        """Take action, return (next_state, reward, done)"""
        reward = self.rewards[(self.current_state, action)]
        
        # Sample next state from transition probabilities
        probs = self.transitions[self.current_state]
        next_state = np.random.choice(
            list(probs.keys()),
            p=list(probs.values())
        )
        
        self.current_state = next_state
        return next_state, reward, False

# Test the MDP
env = SimpleTradingMDP()
state = env.reset()
print(f"Initial state: {state}")

for i in range(5):
    action = np.random.choice(env.actions)
    next_state, reward, done = env.step(action)
    print(f"Action: {action:6s} -> State: {next_state:8s}, Reward: {reward:+.1f}")

---

## 2. Bellman Equations

### Value Functions

**State Value Function V(s)**: Expected return starting from state s, following policy π

$$V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t R_{t+1} | S_0 = s\right]$$

**Action Value Function Q(s,a)**: Expected return starting from state s, taking action a, then following policy π

$$Q^\pi(s,a) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t R_{t+1} | S_0 = s, A_0 = a\right]$$

### Bellman Expectation Equation

The value of a state equals immediate reward plus discounted value of next state:

$$V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^\pi(s')]$$

$$Q^\pi(s,a) = \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a')]$$

### Bellman Optimality Equation

For optimal policy π*:

$$V^*(s) = \max_a \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^*(s')]$$

$$Q^*(s,a) = \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma \max_{a'} Q^*(s',a')]$$

### Key Insight

The Bellman equations express a **recursive relationship**: the value of a state depends on the values of successor states. This enables **dynamic programming** solutions.

In [None]:
def value_iteration(env: SimpleTradingMDP, gamma: float = 0.9, 
                    theta: float = 1e-6) -> Tuple[Dict, Dict]:
    """
    Value Iteration: Find optimal value function and policy.
    Uses Bellman optimality equation iteratively.
    """
    # Initialize V(s) = 0 for all states
    V = {s: 0.0 for s in env.states}
    
    iteration = 0
    while True:
        delta = 0
        
        for s in env.states:
            v = V[s]
            
            # Bellman optimality: V(s) = max_a [R(s,a) + γ Σ P(s'|s,a)V(s')]
            action_values = []
            for a in env.actions:
                # Expected value for action a
                expected_value = env.rewards[(s, a)]  # Immediate reward
                for s_next, prob in env.transitions[s].items():
                    expected_value += gamma * prob * V[s_next]
                action_values.append(expected_value)
            
            V[s] = max(action_values)
            delta = max(delta, abs(v - V[s]))
        
        iteration += 1
        if delta < theta:
            break
    
    # Extract optimal policy
    policy = {}
    for s in env.states:
        action_values = []
        for a in env.actions:
            expected_value = env.rewards[(s, a)]
            for s_next, prob in env.transitions[s].items():
                expected_value += gamma * prob * V[s_next]
            action_values.append(expected_value)
        policy[s] = env.actions[np.argmax(action_values)]
    
    print(f"Value Iteration converged in {iteration} iterations")
    return V, policy

# Run value iteration
V_optimal, optimal_policy = value_iteration(env)

print("\nOptimal Value Function:")
for s, v in V_optimal.items():
    print(f"  V({s:8s}) = {v:.3f}")

print("\nOptimal Policy:")
for s, a in optimal_policy.items():
    print(f"  π({s:8s}) = {a}")

---

## 3. Q-Learning

### Model-Free Learning

Q-learning learns optimal Q-values **without knowing the MDP dynamics** (transition probabilities). It learns from experience (samples).

### Q-Learning Update Rule

$$Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right]$$

Where:
- $\alpha$ = learning rate
- $r + \gamma \max_{a'} Q(s',a')$ = TD target (bootstrapped estimate)
- $r + \gamma \max_{a'} Q(s',a') - Q(s,a)$ = TD error

### Key Properties

1. **Off-policy**: Learns about optimal policy while following exploratory policy
2. **Temporal Difference**: Updates based on difference between consecutive estimates
3. **Bootstrapping**: Uses current Q estimates to update Q estimates

### Exploration vs Exploitation

**ε-greedy policy**:
- With probability ε: take random action (explore)
- With probability 1-ε: take best action according to Q (exploit)

In [None]:
class QLearningAgent:
    """
    Tabular Q-Learning Agent.
    """
    
    def __init__(self, env: SimpleTradingMDP, alpha: float = 0.1,
                 gamma: float = 0.9, epsilon: float = 0.1):
        self.env = env
        self.alpha = alpha      # Learning rate
        self.gamma = gamma      # Discount factor
        self.epsilon = epsilon  # Exploration rate
        
        # Initialize Q-table
        self.Q = {}
        for s in env.states:
            for a in env.actions:
                self.Q[(s, a)] = 0.0
    
    def get_action(self, state: str, training: bool = True) -> str:
        """ε-greedy action selection."""
        if training and np.random.random() < self.epsilon:
            return np.random.choice(self.env.actions)
        else:
            # Greedy: select action with highest Q-value
            q_values = [self.Q[(state, a)] for a in self.env.actions]
            return self.env.actions[np.argmax(q_values)]
    
    def update(self, state: str, action: str, reward: float, 
               next_state: str) -> float:
        """
        Q-learning update.
        Returns TD error for monitoring.
        """
        # Current Q-value
        current_q = self.Q[(state, action)]
        
        # TD target: r + γ max_a' Q(s', a')
        max_next_q = max(self.Q[(next_state, a)] for a in self.env.actions)
        td_target = reward + self.gamma * max_next_q
        
        # TD error
        td_error = td_target - current_q
        
        # Update Q-value
        self.Q[(state, action)] += self.alpha * td_error
        
        return td_error
    
    def get_policy(self) -> Dict[str, str]:
        """Extract greedy policy from Q-table."""
        policy = {}
        for s in self.env.states:
            q_values = [self.Q[(s, a)] for a in self.env.actions]
            policy[s] = self.env.actions[np.argmax(q_values)]
        return policy

In [None]:
def train_q_learning(env: SimpleTradingMDP, n_episodes: int = 1000,
                     max_steps: int = 100) -> Tuple[QLearningAgent, List]:
    """
    Train Q-learning agent.
    """
    agent = QLearningAgent(env, alpha=0.1, gamma=0.9, epsilon=0.2)
    episode_rewards = []
    
    for episode in range(n_episodes):
        state = env.reset()
        total_reward = 0
        
        for step in range(max_steps):
            action = agent.get_action(state, training=True)
            next_state, reward, done = env.step(action)
            
            agent.update(state, action, reward, next_state)
            
            total_reward += reward
            state = next_state
            
            if done:
                break
        
        episode_rewards.append(total_reward)
        
        # Decay epsilon
        agent.epsilon = max(0.01, agent.epsilon * 0.995)
    
    return agent, episode_rewards

# Train the agent
agent, rewards = train_q_learning(env, n_episodes=1000)

# Plot learning curve
plt.figure(figsize=(10, 4))
window = 50
smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')
plt.plot(smoothed)
plt.xlabel('Episode')
plt.ylabel('Total Reward (smoothed)')
plt.title('Q-Learning Training Progress')
plt.grid(True, alpha=0.3)
plt.show()

# Show learned Q-values and policy
print("\nLearned Q-values:")
for s in env.states:
    q_str = ", ".join([f"{a}:{agent.Q[(s,a)]:+.2f}" for a in env.actions])
    print(f"  {s:8s}: {q_str}")

print("\nLearned Policy:")
learned_policy = agent.get_policy()
for s, a in learned_policy.items():
    print(f"  π({s:8s}) = {a}")

print("\nCompare with optimal policy:")
for s in env.states:
    match = "✓" if learned_policy[s] == optimal_policy[s] else "✗"
    print(f"  {s:8s}: learned={learned_policy[s]:6s}, optimal={optimal_policy[s]:6s} {match}")

---

## 4. Policy Gradient Methods

### Why Policy Gradients?

Value-based methods (like Q-learning) have limitations:
- Struggle with **continuous action spaces**
- Can't represent **stochastic policies** naturally
- Small changes in Q can cause large policy changes

**Policy gradients** directly parameterize and optimize the policy.

### Policy Parameterization

Policy $\pi_\theta(a|s)$ is a neural network with parameters θ:

$$\pi_\theta(a|s) = P(A=a | S=s, \theta)$$

### Objective: Maximize Expected Return

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t r_t\right]$$

### Policy Gradient Theorem

$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot Q^{\pi_\theta}(s,a)\right]$$

### REINFORCE Algorithm

Monte Carlo policy gradient using returns G_t as estimate of Q:

$$\nabla_\theta J(\theta) \approx \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t$$

where $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$

### Variance Reduction: Baseline

Subtract a baseline b(s) to reduce variance:

$$\nabla_\theta J(\theta) = \mathbb{E}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot (Q(s,a) - b(s))\right]$$

Common baseline: Value function V(s), giving **Advantage**: A(s,a) = Q(s,a) - V(s)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical

class PolicyNetwork(nn.Module):
    """
    Simple policy network for discrete actions.
    """
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 32):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim),
            nn.Softmax(dim=-1)
        )
    
    def forward(self, state: torch.Tensor) -> torch.Tensor:
        return self.network(state)
    
    def get_action(self, state: torch.Tensor) -> Tuple[int, torch.Tensor]:
        """Sample action and return log probability."""
        probs = self.forward(state)
        dist = Categorical(probs)
        action = dist.sample()
        return action.item(), dist.log_prob(action)


class REINFORCEAgent:
    """
    REINFORCE (Monte Carlo Policy Gradient) Agent.
    """
    
    def __init__(self, state_dim: int, action_dim: int, 
                 lr: float = 0.01, gamma: float = 0.99):
        self.gamma = gamma
        self.policy = PolicyNetwork(state_dim, action_dim)
        self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)
        
        # Episode storage
        self.log_probs = []
        self.rewards = []
    
    def select_action(self, state: np.ndarray) -> int:
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        action, log_prob = self.policy.get_action(state_tensor)
        self.log_probs.append(log_prob)
        return action
    
    def store_reward(self, reward: float):
        self.rewards.append(reward)
    
    def update(self) -> float:
        """
        Update policy using REINFORCE.
        Returns the loss value.
        """
        # Calculate discounted returns
        returns = []
        G = 0
        for r in reversed(self.rewards):
            G = r + self.gamma * G
            returns.insert(0, G)
        
        returns = torch.tensor(returns)
        
        # Normalize returns (baseline: mean)
        if len(returns) > 1:
            returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        
        # Policy gradient loss: -log(π) * G
        policy_loss = []
        for log_prob, G in zip(self.log_probs, returns):
            policy_loss.append(-log_prob * G)
        
        loss = torch.stack(policy_loss).sum()
        
        # Gradient descent
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Clear episode data
        self.log_probs = []
        self.rewards = []
        
        return loss.item()

print("REINFORCE Agent initialized successfully")
print("\nKey components:")
print("1. Policy network: state -> action probabilities")
print("2. Log probability tracking for gradient computation")
print("3. Return calculation with discounting")
print("4. Policy gradient: ∇θ J = -Σ log π(a|s) * G")

---

## 5. Deep Q-Networks (DQN)

### Scaling Q-Learning with Neural Networks

Tabular Q-learning doesn't scale to large/continuous state spaces. **DQN** uses a neural network to approximate Q-values:

$$Q(s, a; \theta) \approx Q^*(s, a)$$

### Challenge: Training Instability

Naive neural network + Q-learning is unstable due to:
1. **Correlated samples**: Sequential data breaks i.i.d. assumption
2. **Non-stationary targets**: Q-targets change as network updates
3. **Overestimation**: max operator causes positive bias

### DQN Innovations

#### 1. Experience Replay

Store transitions $(s, a, r, s')$ in replay buffer. Sample **random mini-batches** for training:
- Breaks correlation between consecutive samples
- Improves sample efficiency (reuse data)

#### 2. Target Network

Use separate network $\theta^-$ for computing TD targets:

$$y = r + \gamma \max_{a'} Q(s', a'; \theta^-)$$

Update target network periodically: $\theta^- \leftarrow \theta$

This stabilizes training by keeping targets fixed between updates.

### DQN Loss Function

$$L(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\left[\left(y - Q(s,a;\theta)\right)^2\right]$$

where $y = r + \gamma \max_{a'} Q(s', a'; \theta^-)$

In [None]:
from collections import deque
import random

class ReplayBuffer:
    """
    Experience Replay Buffer.
    Stores transitions and samples random mini-batches.
    """
    
    def __init__(self, capacity: int = 10000):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        """Store a transition."""
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size: int) -> Tuple:
        """Sample a random mini-batch."""
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        return (
            torch.FloatTensor(np.array(states)),
            torch.LongTensor(actions),
            torch.FloatTensor(rewards),
            torch.FloatTensor(np.array(next_states)),
            torch.FloatTensor(dones)
        )
    
    def __len__(self):
        return len(self.buffer)


class QNetwork(nn.Module):
    """
    Q-Network: maps state to Q-values for all actions.
    """
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 64):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )
    
    def forward(self, state: torch.Tensor) -> torch.Tensor:
        return self.network(state)


class DQNAgent:
    """
    Deep Q-Network Agent with experience replay and target network.
    """
    
    def __init__(self, state_dim: int, action_dim: int, lr: float = 1e-3,
                 gamma: float = 0.99, epsilon: float = 1.0,
                 epsilon_decay: float = 0.995, epsilon_min: float = 0.01,
                 buffer_size: int = 10000, batch_size: int = 64,
                 target_update_freq: int = 100):
        
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        self.batch_size = batch_size
        self.target_update_freq = target_update_freq
        self.update_counter = 0
        
        # Q-Networks
        self.q_network = QNetwork(state_dim, action_dim)
        self.target_network = QNetwork(state_dim, action_dim)
        self.target_network.load_state_dict(self.q_network.state_dict())
        
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=lr)
        self.replay_buffer = ReplayBuffer(buffer_size)
    
    def select_action(self, state: np.ndarray) -> int:
        """ε-greedy action selection."""
        if np.random.random() < self.epsilon:
            return np.random.randint(self.action_dim)
        
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        with torch.no_grad():
            q_values = self.q_network(state_tensor)
        return q_values.argmax().item()
    
    def store_transition(self, state, action, reward, next_state, done):
        """Store transition in replay buffer."""
        self.replay_buffer.push(state, action, reward, next_state, done)
    
    def update(self) -> float:
        """
        Update Q-network using a mini-batch from replay buffer.
        Returns the loss value.
        """
        if len(self.replay_buffer) < self.batch_size:
            return 0.0
        
        # Sample mini-batch
        states, actions, rewards, next_states, dones = \
            self.replay_buffer.sample(self.batch_size)
        
        # Current Q-values
        current_q = self.q_network(states).gather(1, actions.unsqueeze(1))
        
        # Target Q-values (using target network)
        with torch.no_grad():
            max_next_q = self.target_network(next_states).max(1)[0]
            target_q = rewards + self.gamma * max_next_q * (1 - dones)
        
        # MSE Loss
        loss = nn.MSELoss()(current_q.squeeze(), target_q)
        
        # Gradient descent
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Update target network periodically
        self.update_counter += 1
        if self.update_counter % self.target_update_freq == 0:
            self.target_network.load_state_dict(self.q_network.state_dict())
        
        # Decay epsilon
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
        
        return loss.item()

print("DQN Agent initialized successfully")
print("\nKey components:")
print("1. Q-Network: state -> Q(s,a) for all actions")
print("2. Target Network: separate network for stable TD targets")
print("3. Replay Buffer: stores transitions, samples mini-batches")
print("4. ε-greedy exploration with decay")

In [None]:
# Demonstrate DQN training loop structure

def dqn_training_pseudocode():
    """
    DQN Training Algorithm (pseudocode with real structure)
    """
    print("="*60)
    print("DQN TRAINING ALGORITHM")
    print("="*60)
    print("""
    Initialize:
        Q-network with random weights θ
        Target network with weights θ⁻ = θ
        Replay buffer D with capacity N
    
    For each episode:
        Initialize state s
        
        For each step:
            # Action Selection (ε-greedy)
            With probability ε: a = random action
            Otherwise: a = argmax_a Q(s, a; θ)
            
            # Environment Interaction
            Execute action a, observe r, s'
            Store transition (s, a, r, s') in D
            
            # Learning (if enough samples)
            Sample random minibatch from D
            
            For each transition (sⱼ, aⱼ, rⱼ, s'ⱼ):
                if s'ⱼ is terminal:
                    yⱼ = rⱼ
                else:
                    yⱼ = rⱼ + γ max_a' Q(s'ⱼ, a'; θ⁻)  # Target network!
            
            # Gradient descent on (yⱼ - Q(sⱼ, aⱼ; θ))²
            
            # Periodic target network update
            Every C steps: θ⁻ ← θ
            
            s ← s'
    """)
    print("="*60)

dqn_training_pseudocode()

---

## 6. DQN Extensions

### Double DQN

**Problem**: Standard DQN overestimates Q-values (max over noisy estimates is biased).

**Solution**: Decouple action selection from evaluation:

$$y = r + \gamma Q(s', \argmax_{a'} Q(s', a'; \theta); \theta^-)$$

Use online network to **select** best action, target network to **evaluate** it.

### Dueling DQN

**Insight**: Some states are valuable regardless of action taken.

**Architecture**: Separate streams for value and advantage:

$$Q(s,a) = V(s) + A(s,a) - \frac{1}{|A|}\sum_{a'} A(s,a')$$

### Prioritized Experience Replay

**Insight**: Not all transitions are equally informative.

**Solution**: Sample transitions with probability proportional to TD error:

$$P(i) \propto |\delta_i|^\alpha$$

Higher TD error = more surprising = more to learn.

### Rainbow DQN

Combines all improvements:
- Double DQN
- Dueling architecture
- Prioritized replay
- Multi-step learning
- Distributional RL
- Noisy networks

---

## 7. RL in Finance: Considerations

### Challenges

| Challenge | Description | Mitigation |
|-----------|-------------|------------|
| **Non-stationarity** | Markets change over time | Continuous retraining, regime detection |
| **Partial observability** | Can't observe all market state | Use LSTM/attention, add features |
| **High noise** | Low signal-to-noise ratio | Robust reward design, regularization |
| **Sample efficiency** | Limited historical data | Transfer learning, simulation |
| **Transaction costs** | Real costs eat into profits | Include in reward, action penalties |
| **Market impact** | Large trades move prices | Model impact, constrain actions |

### Reward Design

```python
# Simple PnL
reward = position * returns

# Risk-adjusted (Sharpe-like)
reward = position * returns - λ * position² * variance

# With transaction costs
reward = position * returns - cost * |Δposition|

# Differential Sharpe ratio
reward = (A_t * returns - 0.5 * B_t * returns²) / (B_t - A_t²)^1.5
```

### State Representation

```python
state = [
    # Price features
    returns_1d, returns_5d, returns_20d,
    volatility_10d, volatility_30d,
    
    # Technical indicators
    rsi, macd, bollinger_position,
    
    # Portfolio state
    current_position, unrealized_pnl,
    
    # Market context
    vix_level, sector_momentum
]
```

---

## 8. Summary: Algorithm Comparison

| Method | Type | Strengths | Weaknesses | Use Case |
|--------|------|-----------|------------|----------|
| **Q-Learning** | Value-based, tabular | Simple, guaranteed convergence | Doesn't scale | Small discrete problems |
| **DQN** | Value-based, deep | Handles high-dim states | Discrete actions only | Image input, discrete actions |
| **REINFORCE** | Policy gradient | Continuous actions, stochastic | High variance | Simple continuous control |
| **Actor-Critic** | Hybrid | Lower variance than PG | More complex | General purpose |
| **PPO** | Policy gradient | Stable, easy to tune | Sample inefficient | General deep RL |
| **SAC** | Off-policy AC | Sample efficient, robust | Complex | Continuous control, robotics |

### Key Equations to Remember

1. **Bellman Optimality**: $V^*(s) = \max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s')]$

2. **Q-Learning Update**: $Q(s,a) \leftarrow Q(s,a) + \alpha[r + \gamma \max_{a'} Q(s',a') - Q(s,a)]$

3. **Policy Gradient**: $\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \cdot G_t]$

4. **DQN Target**: $y = r + \gamma \max_{a'} Q(s', a'; \theta^-)$

---

## Practice Problems

### Conceptual

1. Why does Q-learning converge to optimal Q* even with ε-greedy (off-policy)?

2. Explain why experience replay helps stabilize DQN training.

3. What's the difference between on-policy (SARSA) and off-policy (Q-learning)?

4. Why do policy gradients have high variance? How does a baseline help?

### Coding Challenges

1. Implement Double DQN by modifying the DQNAgent class.

2. Add prioritized experience replay to the ReplayBuffer.

3. Implement SARSA and compare with Q-learning on the trading MDP.

4. Create a simple trading environment with continuous position sizing.

---

## Next Steps

- **Day 01-03**: Review this theory, run all code cells
- **Day 04**: Implement DQN for simple trading (see Day_04_DQN.ipynb)
- **Day 05**: Policy gradients and Actor-Critic methods
- **Day 06-07**: Advanced: PPO/SAC for portfolio optimization