Here is **Chapter 17: Reinforcement Learning (RL)** — teaching machines to make decisions.

---

# **CHAPTER 17: REINFORCEMENT LEARNING (RL)**

*Learning Through Interaction*

## **Chapter Overview**

Reinforcement Learning is the paradigm of learning through trial and error, maximizing cumulative reward in an environment. From mastering games like Go and StarCraft to powering the alignment of Large Language Models via RLHF, RL enables autonomous decision-making in complex, dynamic environments. This chapter builds from Markov Decision Processes to state-of-the-art algorithms like PPO and SAC, culminating in the RLHF techniques that align modern AI systems with human intent.

**Estimated Time:** 60-70 hours (4-5 weeks)  
**Prerequisites:** Chapters 10-11 (Neural networks, Optimization), Chapter 15 (LLMs, for RLHF connection)

---

## **17.0 Learning Objectives**

By the end of this chapter, you will be able to:
1. Formulate problems as Markov Decision Processes (MDPs) and derive Bellman equations for optimality
2. Implement value-based methods (DQN, Double DQN, Dueling DQN) with experience replay and target networks
3. Apply policy gradient methods (REINFORCE, Actor-Critic) and understand the bias-variance tradeoff in gradients
4. Train agents using Proximal Policy Optimization (PPO), the industry standard for continuous control and LLM alignment
5. Understand Model-Based RL and planning with learned dynamics
6. Implement Multi-Agent RL systems for competitive and cooperative scenarios
7. Apply RLHF (Reinforcement Learning from Human Feedback) to align language models with human preferences

---

## **17.1 Foundations: The Markov Decision Process**

#### **17.1.1 Formal Definition**

An MDP is defined by the tuple $(\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma)$:

- **$\mathcal{S}$:** Set of states (observations of the environment)
- **$\mathcal{A}$:** Set of actions available to the agent
- **$\mathcal{P}(s'|s,a)$:** Transition dynamics (probability of next state given current state and action)
- **$\mathcal{R}(s,a,s')$:** Reward function (immediate feedback)
- **$\gamma \in [0,1]$:** Discount factor (importance of future rewards vs immediate)

**Markov Property:** The future is independent of the past given the present: $P(s_{t+1}|s_t, a_t) = P(s_{t+1}|s_0, a_0, ..., s_t, a_t)$.

#### **17.1.2 Policies and Value Functions**

**Policy $\pi(a|s)$:** Probability distribution over actions given state. Can be deterministic ($a = \pi(s)$) or stochastic.

**State Value Function $V^\pi(s)$:** Expected cumulative reward starting from state $s$ and following policy $\pi$:
$$V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s\right]$$

**Action Value Function (Q-Function) $Q^\pi(s,a)$:** Expected cumulative reward starting from state $s$, taking action $a$, then following $\pi$:
$$Q^\pi(s,a) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a\right]$$

**Relationship:** $V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s,a)$

#### **17.1.3 Bellman Equations**

The recursive structure of value functions:

**Bellman Expectation Equation:**
$$V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} \mathcal{P}(s'|s,a) [\mathcal{R}(s,a,s') + \gamma V^\pi(s')]$$

**Bellman Optimality Equation:**
$$V^*(s) = \max_a \sum_{s'} \mathcal{P}(s'|s,a) [\mathcal{R}(s,a,s') + \gamma V^*(s')]$$

$$Q^*(s,a) = \sum_{s'} \mathcal{P}(s'|s,a) [\mathcal{R}(s,a,s') + \gamma \max_{a'} Q^*(s',a')]$$

The optimal policy $\pi^*$ is greedy with respect to $Q^*$: $\pi^*(s) = \arg\max_a Q^*(s,a)$.

---

## **17.2 Model-Free Value-Based Methods**

When the transition dynamics $\mathcal{P}$ are unknown, we learn from experience.

#### **17.2.1 Q-Learning (Off-Policy)**

Update Q-values using temporal difference (TD) learning:

$$Q(s,a) \leftarrow Q(s,a) + \alpha [r + \gamma \max_{a'} Q(s',a') - Q(s,a)]$$

**Off-policy:** Can learn about optimal policy while following exploratory policy (e.g., $\epsilon$-greedy).

**Exploration vs Exploitation:**
- **$\epsilon$-greedy:** With probability $\epsilon$ take random action, else greedy.
- **Boltzmann:** Sample from softmax of Q-values: $\pi(a|s) \propto \exp(Q(s,a)/\tau)$

#### **17.2.2 Deep Q-Network (DQN)**

Use neural network $Q(s,a; \theta)$ to approximate Q-function for high-dimensional state spaces (e.g., pixels).

**Key Innovations:**
1. **Experience Replay:** Store transitions $(s,a,r,s')$ in buffer, sample random mini-batches to break correlation.
2. **Target Network:** Separate network $Q(s',a'; \theta^-)$ for computing target values, updated periodically to stabilize learning.

**Loss Function:**
$$\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[(r + \gamma \max_{a'} Q(s',a'; \theta^-) - Q(s,a; \theta))^2\right]$$

```python
import torch
import torch.nn as nn
import numpy as np
from collections import deque
import random

class DQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, output_dim)
        )
    
    def forward(self, x):
        return self.net(x)

class ReplayBuffer:
    def __init__(self, capacity=10000):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        return (np.array(states), np.array(actions), np.array(rewards), 
                np.array(next_states), np.array(dones))
    
    def __len__(self):
        return len(self.buffer)

class DQNAgent:
    def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99):
        self.policy_net = DQN(state_dim, action_dim)
        self.target_net = DQN(state_dim, action_dim)
        self.target_net.load_state_dict(self.policy_net.state_dict())
        
        self.optimizer = torch.optim.Adam(self.policy_net.parameters(), lr=lr)
        self.buffer = ReplayBuffer()
        self.gamma = gamma
        self.epsilon = 1.0
        self.epsilon_decay = 0.995
        self.epsilon_min = 0.01
        self.action_dim = action_dim
        
    def select_action(self, state):
        if random.random() < self.epsilon:
            return random.randrange(self.action_dim)
        
        with torch.no_grad():
            state = torch.FloatTensor(state).unsqueeze(0)
            q_values = self.policy_net(state)
            return q_values.argmax().item()
    
    def train(self, batch_size=64):
        if len(self.buffer) < batch_size:
            return
        
        states, actions, rewards, next_states, dones = self.buffer.sample(batch_size)
        
        states = torch.FloatTensor(states)
        actions = torch.LongTensor(actions)
        rewards = torch.FloatTensor(rewards)
        next_states = torch.FloatTensor(next_states)
        dones = torch.FloatTensor(dones)
        
        # Current Q values
        current_q = self.policy_net(states).gather(1, actions.unsqueeze(1)).squeeze()
        
        # Target Q values (Double DQN style)
        with torch.no_grad():
            next_actions = self.policy_net(next_states).argmax(1)
            next_q = self.target_net(next_states).gather(1, next_actions.unsqueeze(1)).squeeze()
            target_q = rewards + (1 - dones) * self.gamma * next_q
        
        # Huber loss (smooth L1) for stability
        loss = nn.functional.smooth_l1_loss(current_q, target_q)
        
        self.optimizer.zero_grad()
        loss.backward()
        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(self.policy_net.parameters(), max_norm=1.0)
        self.optimizer.step()
        
        # Decay epsilon
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
    
    def update_target(self):
        self.target_net.load_state_dict(self.policy_net.state_dict())
```

**Improvements:**
- **Double DQN:** Decouples action selection and evaluation to reduce overestimation bias.
- **Dueling DQN:** Separates value and advantage streams: $Q(s,a) = V(s) + A(s,a) - \frac{1}{|\mathcal{A}|}\sum_{a'} A(s,a')$.
- **Prioritized Replay:** Sample transitions with higher TD error more frequently.

---

## **17.3 Policy Gradient Methods**

Instead of learning value functions, directly parameterize and optimize the policy $\pi_\theta(a|s)$.

#### **17.3.1 REINFORCE (Monte-Carlo Policy Gradient)**

The policy gradient theorem:
$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot G_t]$$

Where $G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}$ is the return from time $t$.

**Intuition:** Increase probability of actions that lead to high returns, decrease for low returns.

**High Variance:** Uses Monte-Carlo returns (no bootstrapping), leading to noisy gradients.

#### **17.3.2 Actor-Critic Methods**

Combine policy gradient (Actor) with value function approximation (Critic) to reduce variance.

**Advantage Function:** $A(s,a) = Q(s,a) - V(s)$ (how much better is action $a$ than average).

**Gradient:**
$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot A(s,a)]$$

**A2C (Advantage Actor-Critic):** Synchronous update. Both networks share layers often.

```python
class ActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU()
        )
        self.actor = nn.Linear(128, action_dim)  # Policy head
        self.critic = nn.Linear(128, 1)          # Value head
    
    def forward(self, state):
        x = self.shared(state)
        logits = self.actor(x)
        value = self.critic(x)
        return logits, value
    
    def get_action(self, state):
        logits, value = self.forward(state)
        dist = torch.distributions.Categorical(logits=logits)
        action = dist.sample()
        return action, dist.log_prob(action), value

# Training step
def train_step(self, states, actions, log_probs, rewards, next_states, dones):
    # Compute returns and advantages
    returns = []
    advantages = []
    gae = 0  # Generalized Advantage Estimation
    
    with torch.no_grad():
        _, next_values = self.model(next_states)
    
    # Compute GAE (Generalized Advantage Estimation)
    for t in reversed(range(len(rewards))):
        if t == len(rewards) - 1:
            next_value = next_values[t] * (1 - dones[t])
        else:
            next_value = values[t+1]
        
        delta = rewards[t] + self.gamma * next_value - values[t]
        gae = delta + self.gamma * self.gae_lambda * gae * (1 - dones[t])
        advantages.insert(0, gae)
        returns.insert(0, gae + values[t])
    
    # Normalize advantages
    advantages = torch.tensor(advantages)
    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
    
    # Update actor and critic
    logits, values = self.model(states)
    dist = torch.distributions.Categorical(logits=logits)
    
    new_log_probs = dist.log_prob(actions)
    entropy = dist.entropy().mean()
    
    # Actor loss (policy gradient)
    policy_loss = -(new_log_probs * advantages).mean()
    
    # Critic loss (value function)
    value_loss = nn.functional.mse_loss(values.squeeze(), torch.tensor(returns))
    
    loss = policy_loss + 0.5 * value_loss - 0.01 * entropy  # Entropy bonus for exploration
    
    self.optimizer.zero_grad()
    loss.backward()
    self.optimizer.step()
```

---

## **17.4 Proximal Policy Optimization (PPO)**

The current industry standard for continuous control and LLM alignment (RLHF). Addresses instability in vanilla policy gradients.

**Problem:** Large policy updates can collapse performance.
**Solution:** Clipped surrogate objective to prevent large changes.

**Surrogate Objective:**
$$L^{CLIP}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]$$

Where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ (probability ratio).

If advantage is positive, don't increase probability by more than $\epsilon$. If negative, don't decrease by more than $\epsilon$.

**Implementation:**
```python
class PPO:
    def __init__(self, state_dim, action_dim, lr=3e-4, gamma=0.99, gae_lambda=0.95, 
                 clip_epsilon=0.2, epochs=4, batch_size=64):
        self.policy = ActorCritic(state_dim, action_dim)
        self.optimizer = torch.optim.Adam(self.policy.parameters(), lr=lr)
        self.gamma = gamma
        self.gae_lambda = gae_lambda
        self.clip_epsilon = clip_epsilon
        self.epochs = epochs
        self.batch_size = batch_size
        
    def compute_gae(self, rewards, values, next_values, dones):
        advantages = []
        gae = 0
        for t in reversed(range(len(rewards))):
            if t == len(rewards) - 1:
                next_val = next_values[t]
            else:
                next_val = values[t+1]
            
            delta = rewards[t] + self.gamma * next_val * (1-dones[t]) - values[t]
            gae = delta + self.gamma * self.gae_lambda * (1-dones[t]) * gae
            advantages.insert(0, gae)
        return torch.tensor(advantages, dtype=torch.float32)
    
    def update(self, states, actions, old_log_probs, rewards, next_states, dones):
        # Convert to tensors
        states = torch.FloatTensor(states)
        actions = torch.LongTensor(actions)
        old_log_probs = torch.FloatTensor(old_log_probs)
        
        # Get values and next_values
        with torch.no_grad():
            _, values = self.policy(states)
            _, next_values = self.policy(torch.FloatTensor(next_states))
            values = values.squeeze()
            next_values = next_values.squeeze()
        
        # Compute advantages using GAE
        advantages = self.compute_gae(rewards, values.numpy(), next_values.numpy(), dones)
        returns = advantages + values
        
        # Normalize advantages
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        # PPO epochs
        dataset_size = len(states)
        indices = np.arange(dataset_size)
        
        for _ in range(self.epochs):
            np.random.shuffle(indices)
            
            for start in range(0, dataset_size, self.batch_size):
                end = start + self.batch_size
                batch_idx = indices[start:end]
                
                batch_states = states[batch_idx]
                batch_actions = actions[batch_idx]
                batch_old_log_probs = old_log_probs[batch_idx]
                batch_advantages = advantages[batch_idx]
                batch_returns = returns[batch_idx]
                
                # Evaluate current policy
                logits, curr_values = self.policy(batch_states)
                dist = torch.distributions.Categorical(logits=logits)
                curr_log_probs = dist.log_prob(batch_actions)
                entropy = dist.entropy().mean()
                
                # Probability ratio
                ratio = torch.exp(curr_log_probs - batch_old_log_probs)
                
                # Clipped surrogate loss
                surr1 = ratio * batch_advantages
                surr2 = torch.clamp(ratio, 1-self.clip_epsilon, 1+self.clip_epsilon) * batch_advantages
                actor_loss = -torch.min(surr1, surr2).mean()
                
                # Value loss
                critic_loss = nn.functional.mse_loss(curr_values.squeeze(), batch_returns)
                
                # Total loss
                loss = actor_loss + 0.5 * critic_loss - 0.01 * entropy
                
                self.optimizer.zero_grad()
                loss.backward()
                torch.nn.utils.clip_grad_norm_(self.policy.parameters(), 0.5)
                self.optimizer.step()
```

---

## **17.5 Model-Based RL (Brief Overview)**

Learn a model of the environment dynamics $\hat{P}(s'|s,a)$ and $\hat{R}(s,a)$, then plan.

**Methods:**
- **Dyna-Q:** Mix real experience with simulated experience from learned model
- **MPC (Model Predictive Control):** Plan action sequences using learned model, execute first action, replan
- **MuZero:** Combines model-based planning with model-free learning (used in AlphaZero)

**Trade-off:** Sample efficient (less environment interaction) but compounding model errors.

---

## **17.6 Multi-Agent RL**

Multiple agents interact in shared environment. Can be:
- **Cooperative:** Shared reward (team success)
- **Competitive:** Zero-sum (one wins, one loses)
- **Mixed:** General sum (prisoner's dilemma)

**Challenges:**
- Non-stationarity: Other agents are part of the environment, constantly changing
- Credit assignment: Which agent caused success/failure?

**Algorithms:**
- **MADDPG:** Multi-Agent DDPG (centralized training, decentralized execution)
- **QMIX:** Value factorization for cooperative agents $Q_{tot} = f(Q_1, Q_2, ..., Q_n)$

---

## **17.7 Reinforcement Learning from Human Feedback (RLHF)**

The technique behind ChatGPT, Claude, and aligned LLMs.

#### **17.7.1 The Pipeline (Review and Expansion from Chapter 15)**

1. **Supervised Fine-Tuning (SFT):** Train on high-quality human demonstrations (instruction-following).
2. **Reward Modeling (RM):** Train model to predict human preferences $r_\theta(x,y)$.
   - Data: Comparisons $(x, y_w, y_l)$ where $y_w$ is preferred over $y_l$
   - Loss: $-\log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))$ (Bradley-Terry model)
3. **RL Optimization:** Use PPO to maximize expected reward while staying close to SFT policy (KL penalty).

**The RLHF Objective:**
$$\max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(y|x)} [r_\phi(x,y)] - \beta \mathbb{D}_{KL}(\pi_\theta(y|x) || \pi_{SFT}(y|x))$$

#### **17.7.2 PPO for Language Models**

When applying PPO to LLMs:
- **State:** Context/prompt $x$ and tokens generated so far $y_{<t}$
- **Action:** Next token $y_t$ (vocabulary size ~50k)
- **Policy:** The language model $\pi_\theta(y_t|x, y_{<t})$
- **Reward:** Reward model score at end of sequence + KL penalty per token

**Important Implementation Details:**
- **Reward Hacking:** Reward models can be exploited (generate high reward but gibberish). Mitigate with KL penalty and diverse sampling.
- **Token-Level Rewards:** Since reward is only at sequence end, use value function to estimate future returns for intermediate tokens.

```python
# Simplified RLHF PPO for LLMs (conceptual)
def compute_rewards(sequences, reward_model, sft_model, kl_coeff=0.1):
    """
    sequences: token IDs of generated text
    """
    with torch.no_grad():
        # End-of-sequence reward from RM
        rm_scores = reward_model(sequences)
        
        # KL divergence penalty per token
        sft_logits = sft_model(sequences)
        policy_logits = policy_model(sequences)
        
        kl_div = torch.sum(
            torch.softmax(policy_logits, dim=-1) * 
            (torch.log_softmax(policy_logits, dim=-1) - torch.log_softmax(sft_logits, dim=-1)),
            dim=-1
        )
    
    # Combined reward: RM score at end - KL penalty for each token
    rewards = -kl_coeff * kl_div
    rewards[:, -1] += rm_scores  # Add RM score to last token
    
    return rewards
```

#### **17.7.3 Direct Preference Optimization (DPO)**

As mentioned in Chapter 15, DPO bypasses explicit reward modeling and PPO by deriving optimal policy directly from preference data.

**Loss:**
$$\mathcal{L}_{DPO} = -\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)$$

**Advantages:** Simpler, more stable, often performs as well as PPO.

---

## **17.8 Workbook Labs**

### **Lab 1: Q-Learning from Scratch**
Implement tabular Q-learning on FrozenLake (OpenAI Gym):
1. Epsilon-greedy exploration with decay
2. Q-table updates
3. Solve 4x4 and 8x8 maps
4. Visualize value function and policy

**Deliverable:** Solved environment with visualization of learned Q-values.

### **Lab 2: DQN on Atari**
Train DQN on CartPole or Atari Pong:
1. Experience replay with prioritized sampling
2. Target network updates
3. Frame stacking (for Atari)
4. Plot reward curves and epsilon decay

**Deliverable:** Training script achieving average reward > threshold for 100 episodes.

### **Lab 3: PPO for Continuous Control**
Use Stable Baselines3 or custom implementation for BipedalWalker or LunarLander:
1. Actor-Critic with separate networks
2. GAE for advantage estimation
3. Multiple epochs per update
4. Monitor KL divergence to detect too-large steps

**Deliverable:** Agent successfully walking/landing with video recording.

### **Lab 4: Multi-Agent Gridworld**
Implement independent Q-learning vs MADDPG on simple pursuit-evasion or cooperative navigation:
1. Centralized training with decentralized execution
2. Experience sharing between agents
3. Analysis of emergent behaviors

**Deliverable:** Visualization of multi-agent coordination.

---

## **17.9 Common Pitfalls**

1. **Reward Hacking:** Agent finds loophole in reward function (e.g., crashing quickly to avoid negative step penalty). Solution: Careful reward shaping, human oversight.

2. **Catastrophic Forgetting:** In continual RL, forgetting old tasks when learning new ones. Solution: Elastic Weight Consolidation (EWC), progressive networks.

3. **Exploration Collapse:** Policy becomes deterministic too early, stops exploring. Solution: Entropy regularization, noise injection (parameter space noise), count-based exploration.

4. **Deadly Triad:** Function approximation + Bootstrapping + Off-policy learning can diverge (e.g., DQN with large learning rates). Solution: Target networks, gradient clipping, conservative updates.

5. **Sample Inefficiency:** Model-free RL needs millions of steps. Solution: Model-based methods, demonstration data (RLHF), better replay buffers.

---

## **17.10 Interview Questions**

**Q1:** What is the difference between on-policy and off-policy RL? Give examples of each.
*A: On-policy learns about the policy currently being followed (must generate new experience after each update). Examples: REINFORCE, A2C, PPO. Off-policy can learn from experience generated by old policies or other policies (can reuse old data). Examples: Q-Learning, DQN, DDPG. Off-policy is more sample efficient but can be less stable; on-policy is more stable but sample inefficient.*

**Q2:** Why does DQN use a target network, and how is it updated?
*A: Without target networks, DQN chases its own tail: as parameters update, the target $r + \gamma \max Q(s',a'; \theta)$ changes simultaneously with the current estimate $Q(s,a; \theta)$, leading to oscillations or divergence. The target network $\theta^-$ is a lagged copy of the policy network, updated periodically (hard update: copy every N steps) or smoothly (soft update: $\theta^- \leftarrow \tau\theta + (1-\tau)\theta^-$). This stabilizes learning by providing consistent targets.*

**Q3:** Explain the purpose of the clipping objective in PPO. Why not use a simple trust region like TRPO?
*A: PPO prevents policy updates that are too large (which can collapse performance) by clipping the probability ratio $r(\theta)$ to $[1-\epsilon, 1+\epsilon]$. This penalizes changes that make actions much more/less probable than under the old policy. TRPO uses constrained optimization (KL divergence < $\delta$) requiring second-order methods or conjugate gradient—complex and computationally expensive. PPO's clipped surrogate is a first-order approximation that's simpler to implement and often works as well or better.*

**Q4:** What is the advantage function, and why is it used in Actor-Critic methods instead of raw returns?
*A: Advantage $A(s,a) = Q(s,a) - V(s)$ measures how much better action $a$ is than the average action at state $s$. Subtracting the value baseline reduces variance in policy gradients (we only care about relative action quality, not absolute state value). Lower variance means more stable learning and faster convergence, though it introduces bias which is mitigated by good value function approximation.*

**Q5:** How does RLHF prevent the policy from drifting too far from the original language model?
*A: RLHF adds a KL divergence penalty $\beta D_{KL}(\pi_\theta || \pi_{SFT})$ to the reward function. This penalizes the RL policy for generating outputs with probability distributions too different from the supervised fine-tuned model. It prevents reward hacking (exploiting the reward model's weaknesses) and maintains language fluency/coherence from the base model. The $\beta$ coefficient trades off alignment vs diversity.*

---

## **17.11 Further Reading**

**Books:**
- *Reinforcement Learning: An Introduction* (Sutton & Barto) - The bible, free PDF available
- *Spinning Up in Deep RL* (OpenAI) - Practical guide with implementations

**Papers:**
- "Human-level control through deep reinforcement learning" (Mnih et al., 2015) - DQN
- "Proximal Policy Optimization Algorithms" (Schulman et al., 2017) - PPO
- "Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments" (Lowe et al., 2017) - MADDPG
- "Training language models to follow instructions with human feedback" (Ouyang et al., 2022) - InstructGPT/RLHF
- "Direct Preference Optimization" (Rafailov et al., 2023) - DPO

**Libraries:**
- **Stable Baselines3:** Reliable implementations of DQN, PPO, A2C, SAC
- **RLlib (Ray):** Scalable RL for multi-agent and distributed training
- **CleanRL:** Single-file implementations for educational purposes

---

## **17.12 Checkpoint Project: Autonomous Trading Agent**

Build an RL agent for simulated algorithmic trading (or use OpenAI Gym trading environment).

**Requirements:**

1. **Environment:**
   - State: Price history (OHLCV), technical indicators (RSI, MACD), portfolio state (cash, position, PnL)
   - Actions: Discrete [Buy, Sell, Hold] or continuous [position sizing -1 to 1]
   - Reward: Sharpe ratio of returns, or PnL with risk penalty (drawdown penalty)

2. **Algorithms:**
   - Baseline: DQN with discretized actions
   - Advanced: PPO with continuous action space for position sizing
   - Compare against Buy-and-Hold and simple moving average crossover

3. **Risk Management:**
   - Stop-loss logic in environment (forced exit on large drawdown)
   - Position limits (max leverage)
   - Transaction cost modeling (slippage, fees)

4. **Evaluation:**
   - Backtest on unseen 6-month period
   - Metrics: Total return, Sharpe ratio, max drawdown, Calmar ratio
   - Visualization: Equity curve, drawdown periods, action distribution

5. **Safety:**
   - Sanity checks: No lookahead bias (only past data in state)
   - Overfitting detection: Performance degradation from train to test

**Deliverables:**
- `trading_rl/` package with custom Gym environment and agents
- Backtesting report with risk metrics
- Analysis: "Agent learned to avoid high-volatility periods, achieving 1.5 Sharpe ratio vs 0.8 buy-and-hold"

**Success Criteria:**
- Positive returns on test set (out-of-sample)
- Sharpe ratio > 1.0
- Max drawdown < 20%
- Interpretable policy (e.g., buys dips in uptrend)

---

**End of Chapter 17**

*You now master reinforcement learning from fundamentals to RLHF. Chapter 18 will cover Specialized Applications (Time Series, Recommendation Systems, Graph Neural Networks) — applying deep learning to specific domains.*

---