In [None]:
# === Environment Setup ===
import os, sys, math, time, random, json, textwrap, warnings
import numpy as np, pandas as pd, matplotlib.pyplot as plt
from collections import deque, namedtuple
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
try:
    import gymnasium as gym
    GYM_AVAILABLE = True
except ImportError:
    GYM_AVAILABLE = False

from IPython.display import display, Markdown

# --- Configuration ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({'font.size': 14, 'figure.figsize': (12, 7), 'figure.dpi': 150})
np.set_printoptions(suppress=True, linewidth=120, precision=4)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# --- Utility Functions ---
def note(msg, **kwargs): display(Markdown(f"<div class='alert alert-block alert-info'>📝 **Note:** {msg}</div>"))
def sec(title): print(f"\n{100*'='}\n| {title.upper()} |\n{100*'='}")

if not GYM_AVAILABLE: note("Gymnasium is not installed. Skipping code labs. Run `pip install gymnasium[classic_control]`.")
note(f"Environment initialized. Gymnasium available: {GYM_AVAILABLE}. Using device: {device}")

# Part 7: Advanced and Frontier Topics
## Chapter 7.20: Advanced Deep Reinforcement Learning

### Introduction: The Challenge of Scaling Reinforcement Learning

The tabular methods discussed in the previous chapter are foundational but suffer from the **curse of dimensionality**. They require storing a value for every state-action pair, which is infeasible for problems with large or continuous state spaces. The solution is **function approximation**, where we use a parameterized function—most powerfully, a deep neural network—to estimate the value function, the policy, or both. This is the core idea behind **Deep Reinforcement Learning (DRL)**.

However, the combination of three elements—**function approximation** (like a neural network), **bootstrapping** (updating estimates from other estimates, as in TD learning), and **off-policy training** (learning about policy $\pi$ from data generated by policy $\mu$)—is known as the **deadly triad**. When combined naively, this triad can lead to unstable and divergent training. The history of modern DRL is largely a story of developing algorithms that can successfully manage this instability.

This chapter provides a PhD-level survey of the key algorithmic breakthroughs in DRL, covering both major families of algorithms: **Value-Based Methods** (learning a value function) and **Policy-Based Methods** (learning a policy directly).

## 1. Value-Based Methods: Learning What to Expect

Value-based methods focus on estimating the optimal action-value function, $Q^*(s, a)$. Once this function is known, the optimal policy is simply to choose the action with the highest Q-value in any given state. The archetypal algorithm in this family is Q-Learning.

### 1.1 Deep Q-Networks (DQN): Taming the Deadly Triad

The DQN algorithm (Mnih et al., 2015) was the first to successfully train a deep neural network to perform Q-learning, achieving superhuman performance on Atari games using only raw pixel data. It introduced two key innovations to stabilize the learning process:

1.  **Experience Replay:** A large buffer stores past transitions $(s_t, a_t, r_t, s_{t+1})$. For training, mini-batches are sampled *randomly* from this buffer. This breaks the strong temporal correlations in the data, making the samples more like the i.i.d. data that standard optimizers expect.

2.  **Fixed Target Network:** The TD target in the Bellman update, $y_t = r_t + \gamma \max_{a'} Q(s_{t+1}, a'; \theta)$, is problematic because the same network weights $\theta$ are used to predict the current Q-value and the target Q-value. This creates a "moving target" problem. DQN solves this by using a separate **target network**, with weights $\theta^-$, to compute the TD target. The weights of this target network are frozen for many steps and only periodically updated with the weights of the main policy network ($\theta^- \leftarrow \theta$).

The loss function for a DQN is the Mean Squared Error (MSE) between the predicted Q-value and the stable TD target:
$$ L(\theta) = E_{(s,a,r,s') \sim U(B)} \left[ \left( r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta) \right)^2 \right] $$ 
where $U(B)$ denotes a mini-batch sampled uniformly from the replay buffer $B$.

### 1.2 Dueling DQN Architecture: Decoupling Value and Advantage

The Dueling Network (Wang et al., 2016) is an architectural improvement that provides a more efficient and robust way to estimate Q-values. It decomposes the Q-value into two separate, fully-connected streams:

1.  **State-Value Function ($V(s; \theta, \beta)$):** A scalar representing how good it is to be in state $s$. It has parameters $\theta$ (shared with the convolutional base) and $\beta$ (for its own fully-connected layers).
2.  **Advantage Function ($A(s, a; \theta, \alpha)$):** A vector representing how much better taking action $a$ is compared to other actions in state $s$. It has shared parameters $\theta$ and its own parameters $\alpha$.

The Q-value is then reconstructed from these two streams. However, we cannot simply add them, $Q = V + A$, because $V$ and $A$ are not uniquely identifiable (e.g., we could add a constant to $V$ and subtract it from all $A(s,a)$ without changing $Q$). To enforce identifiability, the advantage stream is forced to have zero advantage for the chosen action (or zero mean advantage). The most common formulation is:
$$ Q(s, a; \theta, \alpha, \beta) = V(s; \theta, \beta) + \left( A(s, a; \theta, \alpha) - \frac{1}{|\mathcal{A}|} \sum_{a'} A(s, a'; \theta, \alpha) \right) $$ 

This separation allows the network to learn the value of states without having to learn the effect of each action for every state. This is particularly useful in environments where many actions have little to no impact on the state's value.

In [None]:
# This diagram illustrates the Dueling DQN architecture.\n# The network splits into two streams: one for the state-value function (V) and one for the advantage function (A).\n# These streams are then combined to produce the final Q-values.\ndisplay(Image(filename='../images/07-Machine-Learning/dueling_dqn_architecture.png'))

## 2. Policy-Based Methods: Learning What to Do

Policy-based methods directly parameterize the policy, $\pi_\theta(a|s)$, and optimize the policy parameters $\theta$ by performing gradient ascent on the expected total reward.
$$ J(\theta) = E_{\tau \sim \pi_\theta} [R(\tau)] = E_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T R(s_t, a_t) \right] $$

### 2.1 The Policy Gradient Theorem and REINFORCE

The **Policy Gradient Theorem** provides a way to compute the gradient of the expected reward $J(\theta)$ without needing to know the dynamics of the environment:
$$ \nabla_\theta J(\theta) = E_{\tau \sim \pi_\theta} \left[ \left( \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \right) \left( \sum_{t=0}^T r(s_t, a_t) \right) \right] $$ 

The **REINFORCE** algorithm is the most direct application of this theorem. It collects a trajectory, and for each action $a_t$ taken, it multiplies the score function, $\nabla_\theta \log \pi_\theta(a_t|s_t)$, by the total return from that point onward, $G_t = \sum_{k=t}^T \gamma^{k-t} r_k$. The intuition is to "reinforce" actions that led to high returns by increasing their log-probability.

**Problem:** While unbiased, this gradient estimate has extremely **high variance**. The return $G_t$ can vary significantly depending on the stochastic actions taken and the environment's transitions, leading to noisy and slow learning.

### 2.2 Actor-Critic Methods: Reducing Variance with a Critic

**Actor-Critic (AC)** methods combine the best of value-based and policy-based methods to tackle the variance problem. They use two networks:

1.  **The Actor ($\pi_\theta(a|s)$):** A policy network that controls how the agent acts.
2.  **The Critic ($V_\phi(s)$):** A value network that learns the state-value function, $V^{\pi_\theta}(s)$.

The critic's job is to provide a low-variance estimate of the return. We can subtract the state value $V(s_t)$ from the return $G_t$ to get the **Advantage Function**, $A(s_t, a_t) = Q(s_t, a_t) - V(s_t)$. The policy gradient update becomes:
$$ \nabla_\theta J(\theta) = E_{\tau \sim \pi_\theta} [\nabla_\theta \log \pi_\theta(a_t|s_t) A(s_t, a_t)] $$

Using the advantage function has a clear intuition: instead of just asking "was this action good?", we ask "was this action *better than average*?". This significantly reduces variance because the baseline $V(s_t)$ accounts for much of the stochasticity in the return.

In practice, we use the critic's estimate of the value function to compute the **TD Advantage Estimate**: $A(s_t, a_t) \approx r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t)$. This is the core of modern algorithms like **Advantage Actor-Critic (A2C)**.

### 2.3 Proximal Policy Optimization (PPO): The State of the Art

A major challenge in policy gradient methods is that a single large, noisy policy update can destroy the policy, leading to a catastrophic drop in performance from which the agent cannot recover. **Trust Region Policy Optimization (TRPO)** addressed this by constraining policy updates to be within a certain KL-divergence of the old policy, but it is computationally expensive.

**Proximal Policy Optimization (PPO)** (Schulman et al., 2017) achieves the stability of TRPO with a much simpler mechanism. It modifies the objective function to penalize large changes in the policy. The key innovation is the **clipped surrogate objective function**:

Let $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ be the probability ratio between the new and old policies.

The PPO objective is:
$$ L^{CLIP}(\theta) = E_t \left[ \min \left( r_t(\theta) \hat{A}_t, \; \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t \right) \right] $$

This objective takes the minimum of two terms:
1.  The standard policy gradient objective, $r_t(\theta) \hat{A}_t$.
2.  A clipped version, where the probability ratio $r_t(\theta)$ is clipped to be within the range $[1 - \epsilon, 1 + \epsilon]$ (e.g., $\epsilon=0.2$).

If the advantage $\hat{A}_t$ is positive, the objective increases with $r_t$, but the clipping prevents it from increasing too much. If the advantage is negative, the objective decreases with $r_t$, but the clipping prevents it from decreasing too much. This simple mechanism effectively creates a trust region, preventing destructive policy updates and making PPO incredibly robust and sample-efficient. It is often the default choice for new DRL problems.

## 3. Code Lab: Advantage Actor-Critic (A2C) in PyTorch

We will now implement a robust A2C agent to solve the classic `CartPole-v1` environment. The goal is to balance a pole on a cart by moving the cart left or right. The state is a 4-dimensional vector (cart position, cart velocity, pole angle, pole angular velocity), and the action space is discrete (left or right).

Our A2C implementation will feature:
- A single neural network with two heads: one for the policy (Actor) and one for the value function (Critic).
- Training on batches of experience collected from the environment.
- Calculation of discounted returns and advantages to drive the learning updates.

In [None]:
sec("Implementing Advantage Actor-Critic (A2C)")

if not GYM_AVAILABLE:
    note("Gymnasium not installed. Skipping this code lab.")
else:
    # 1. Define the Actor-Critic Network
    # This network has a shared body and two heads: one for the policy (Actor) and one for the value function (Critic).
    class ActorCritic(nn.Module):
        def __init__(self, input_dims, n_actions):
            super(ActorCritic, self).__init__()
            # Shared layers learn common features from the input state.
            self.shared_layer = nn.Sequential(
                nn.Linear(input_dims, 128),
                nn.ReLU()
            )
            # The Actor head outputs a probability distribution over actions (the policy).
            self.actor_head = nn.Linear(128, n_actions)
            # The Critic head outputs a single value, estimating the value of the current state.
            self.critic_head = nn.Linear(128, 1)

        def forward(self, state):
            shared_features = self.shared_layer(state)
            action_logits = self.actor_head(shared_features)
            state_value = self.critic_head(shared_features)
            # The softmax function converts the logits into a probability distribution.
            return F.softmax(action_logits, dim=-1), state_value

    # 2. A2C Agent Training Loop
    def train_a2c(env, episodes=1000, gamma=0.99):
        input_dims = env.observation_space.shape[0]
        n_actions = env.action_space.n
        model = ActorCritic(input_dims, n_actions).to(device)
        optimizer = optim.Adam(model.parameters(), lr=0.001)
        
        episode_rewards = []

        for episode in range(episodes):
            log_probs = []
            values = []
            rewards = []
            
            state, _ = env.reset()
            done = False
            ep_reward = 0

            # Collect one complete trajectory from the environment.
            while not done:
                state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)
                probs, value = model(state_tensor)
                
                # Sample an action from the policy distribution.
                dist = torch.distributions.Categorical(probs)
                action = dist.sample()
                
                next_state, reward, terminated, truncated, _ = env.step(action.cpu().item())
                done = terminated or truncated
                
                log_probs.append(dist.log_prob(action))
                values.append(value)
                rewards.append(reward)
                state = next_state
                ep_reward += reward
            
            episode_rewards.append(ep_reward)
            
            # Calculate the discounted returns for each step in the trajectory.
            returns = []
            R = 0
            for r in reversed(rewards):
                R = r + gamma * R
                returns.insert(0, R)
            returns = torch.tensor(returns).to(device)
            
            log_probs = torch.cat(log_probs)
            values = torch.cat(values).squeeze()
            
            # Calculate the advantage: how much better was the return than the critic's estimate?
            advantage = returns - values
            
            # The actor loss encourages actions that led to a positive advantage.
            actor_loss = -(log_probs * advantage.detach()).mean()
            # The critic loss trains the value function to be a better predictor of the returns.
            critic_loss = F.mse_loss(returns, values)
            total_loss = actor_loss + 0.5 * critic_loss
            
            optimizer.zero_grad()
            total_loss.backward()
            optimizer.step()

            if (episode + 1) % 100 == 0:
                avg_reward = np.mean(episode_rewards[-100:])
                print(f"Episode {episode+1}/{episodes} | Average Reward (last 100): {avg_reward:.2f}")
                if avg_reward >= 475:
                    print("\nEnvironment solved!")
                    break
        return episode_rewards

    # 3. Run the training
    env = gym.make('CartPole-v1')
    rewards_history = train_a2c(env, episodes=200)
    note("A2C training complete.")

    plt.figure(figsize=(12, 7))
    plt.plot(rewards_history, label='Reward per Episode')
    plt.plot(pd.Series(rewards_history).rolling(100).mean(), label='100-episode average', lw=3)
    plt.title('A2C Training on CartPole-v1', fontsize=16)
    plt.xlabel('Episode'); plt.ylabel('Total Reward')
    plt.legend(); plt.grid(True)
    plt.show()

## 4. Exercises

1.  **The Deadly Triad:** Explain the "deadly triad" in your own words. For each of the three components (function approximation, bootstrapping, off-policy learning), describe what it is and why it is desirable. Then, explain how the DQN algorithm's two main innovations (Experience Replay and Fixed Target Networks) specifically address the instability caused by combining these three components.

2.  **Value vs. Policy-Based Methods:** What is the fundamental difference in what value-based (e.g., DQN) and policy-based (e.g., REINFORCE) methods learn? What are the primary advantages and disadvantages of each approach? Why are Actor-Critic methods considered a hybrid approach that captures the best of both worlds?

3.  **The Role of the Advantage Function:** In the A2C code lab, we calculate the advantage as `advantage = returns - values`. Explain the intuition behind this calculation. Why is using the advantage `A(s, a)` to weight the policy gradient update `∇θ log πθ(at|st)` superior to using the raw return `G_t`, as is done in the REINFORCE algorithm? Relate your answer to the concept of variance reduction.

4.  **PPO's Clipped Objective:** The PPO algorithm is the current workhorse of DRL. Its key innovation is the clipped surrogate objective function. Explain the purpose of the `clip(r_t(\theta), 1 - \epsilon, 1 + \epsilon)` term. How does this term prevent the policy from changing too drastically in a single update, and why is this property crucial for stable and efficient learning?