# Deep Q-Network (DQN) and Variants

Understanding and implementing advanced Deep Reinforcement Learning algorithms.

By the end of this notebook, you will be able to:
* Implement a DQN agent from scratch
* Understand experience replay and target networks
* Improve upon DQN with Double DQN
* Leverage architectural advantages with Dueling DQN
* Train and evaluate agents on control tasks

## Table of Contents
- [1. Packages](#1)
- [2. Introduction to Deep Reinforcement Learning](#2)
    - [2.1 The Challenge of Q-Learning](#2-1)
    - [2.2 Deep Q-Networks (DQN)](#2-2)
- [3. PyTorch Fundamentals for RL](#3)
    - [3.1 Tensors and Autograd](#3-1)
    - [3.2 Neural Networks for Q-Approximation](#3-2)
- [4. Exercise 1: Implement DQN Network](#4)
    - [Exercise 1 - implement_dqn_network](#ex-1)
- [5. Exercise 2: Implement Replay Buffer](#5)
    - [Exercise 2 - implement_replay_buffer](#ex-2)
- [6. Exercise 3: Implement DQN Update](#6)
    - [Exercise 3 - implement_dqn_update](#ex-3)
- [7. Double DQN: Reducing Overestimation](#7)
    - [7.1 The Overestimation Problem](#7-1)
    - [Exercise 4 - implement_double_dqn](#ex-4)
- [8. Dueling DQN: Architectural Improvements](#8)
    - [8.1 Value-Advantage Decomposition](#8-1)
    - [Exercise 5 - implement_dueling_dqn](#ex-5)
- [9. Algorithm Comparison](#9)
    - [9.1 Comparative Analysis](#9-1)
    - [9.2 Performance Results](#9-2)

<a name='1'></a>
## 1. Packages

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import gymnasium as gym
from collections import deque, namedtuple
import random
import matplotlib.pyplot as plt
from typing import List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

# Import DQN utilities
import sys
sys.path.append('/home/user/Reinforcement-learning-guide/notebooks')
from dqn_utils import *

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
random.seed(42)

# Check device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")
print(f"PyTorch Version: {torch.__version__}")

# Plotting style
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['figure.dpi'] = 100

<a name='2'></a>
## 2. Introduction to Deep Reinforcement Learning

Deep Reinforcement Learning combines two powerful paradigms:
1. **Reinforcement Learning**: Learning through interaction and reward signals
2. **Deep Neural Networks**: Function approximation for high-dimensional problems

<a name='2-1'></a>
### 2.1 The Challenge of Q-Learning

In traditional Q-Learning, we maintain a table Q[s,a] for all state-action pairs:

$$Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s,a) \right]$$

**Limitations:**
- Infeasible for large/continuous state spaces
- Cannot generalize to unseen states
- Curse of dimensionality (images, complex features)

<a name='2-2'></a>
### 2.2 Deep Q-Networks (DQN)

**Solution:** Approximate Q-values with a neural network:

$$Q(s,a) \approx Q_\theta(s,a) = \text{Network}(s, \theta)$$

**Key Innovation from Mnih et al. (2015):**
1. **Experience Replay**: Break temporal correlations by training on random batches
2. **Target Network**: Separate network for computing targets, updated periodically

**Bellman Equation for DQN:**
$$Q_\text{target}(s,a) = r + \gamma \max_{a'} Q_\text{target}(s', a')$$

**Loss Function:**
$$\mathcal{L}(\theta) = \mathbb{E}\left[ \left( Q_\text{target}(s,a) - Q_\theta(s,a) \right)^2 \right]$$

<a name='3'></a>
## 3. PyTorch Fundamentals for RL

<a name='3-1'></a>
### 3.1 Tensors and Autograd

PyTorch provides:
- **Tensors**: GPU-accelerated arrays with automatic differentiation
- **Autograd**: Automatic computation of gradients
- **nn.Module**: Framework for building neural networks

In [None]:
# Example: Basic tensor operations and autograd

# Create tensors
x = torch.randn(2, 3, requires_grad=True)
y = torch.randn(3, 4, requires_grad=True)

# Forward pass
z = torch.mm(x, y)
loss = z.sum()

# Backward pass
loss.backward()

print(f"x gradient shape: {x.grad.shape}")
print(f"Gradient exists: {x.grad is not None}")
print(f"Loss value: {loss.item():.4f}")

<a name='3-2'></a>
### 3.2 Neural Networks for Q-Approximation

We need a network that:
- Takes state as input
- Outputs Q-value for each action
- Is differentiable for gradient-based optimization

In [None]:
# Example architecture
class ExampleQNetwork(nn.Module):
    """Simple Q-Network architecture for demonstration"""
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super(ExampleQNetwork, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )
    
    def forward(self, state):
        return self.net(state)

# Test on CartPole-v1
env_test = gym.make('CartPole-v1')
state_dim = env_test.observation_space.shape[0]
action_dim = env_test.action_space.n

print(f"CartPole Environment:")
print(f"  State dimension: {state_dim}")
print(f"  Action dimension: {action_dim}")

# Create and test network
q_net = ExampleQNetwork(state_dim, action_dim).to(device)
print(f"\nNetwork architecture:")
print_network_summary(q_net, (1, state_dim))

# Forward pass
dummy_state = torch.randn(1, state_dim).to(device)
q_values = q_net(dummy_state)
print(f"\nQ-values output shape: {q_values.shape}")
print(f"Q-values: {q_values.cpu().detach().numpy()}")

env_test.close()

<a name='4'></a>
## 4. Exercise 1: Implement DQN Network

<a name='ex-1'></a>
### Exercise 1 - implement_dqn_network

**Instructions:** Implement a basic DQN network that:
1. Takes state vector as input
2. Passes through two hidden layers with ReLU activation
3. Outputs Q-values for all actions

**Network Architecture:**
```
Input (state_dim) → Hidden (hidden_dim) → ReLU → Hidden (hidden_dim) → ReLU → Output (action_dim)
```

**Hints:**
- Use `nn.Sequential` for clean architecture
- ReLU activation between layers
- No activation on output (raw Q-values)
- Initialize properly using PyTorch defaults

In [None]:
# GRADED FUNCTION: DQN Network Implementation

class DQN(nn.Module):
    """
    Deep Q-Network for approximating Q(s,a) values.
    
    Args:
        state_dim (int): Dimension of state space
        action_dim (int): Dimension of action space
        hidden_dim (int): Dimension of hidden layers (default: 128)
    """
    
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super(DQN, self).__init__()
        
        # YOUR CODE STARTS HERE
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )
        # YOUR CODE ENDS HERE
    
    def forward(self, x):
        """
        Forward pass: state → Q-values
        
        Args:
            x (torch.Tensor): State tensor of shape (batch_size, state_dim)
            
        Returns:
            torch.Tensor: Q-values of shape (batch_size, action_dim)
        """
        # YOUR CODE STARTS HERE
        return self.network(x)
        # YOUR CODE ENDS HERE

# Test Exercise 1
print("Testing Exercise 1: DQN Network Implementation")
print("="*60)

test_dqn_network(DQN, state_dim=4, action_dim=2, hidden_dim=128)

<font color='blue'>
    
**What you should remember**:
- Neural networks can approximate Q-values: $Q(s,a) \approx Q_\theta(s,a)$
- Use ReLU activations for hidden layers to enable non-linear function approximation
- Output layer has no activation - Q-values can be positive or negative
- Forward pass takes a state vector and outputs Q-values for all actions
</font>

<a name='5'></a>
## 5. Exercise 2: Implement Replay Buffer

<a name='ex-2'></a>
### Exercise 2 - implement_replay_buffer

**Instructions:** Implement an Experience Replay Buffer that:
1. Stores transitions (s, a, r, s', done) in a deque
2. Has a maximum capacity
3. Can sample random mini-batches
4. Returns Transition namedtuples

**Why Experience Replay?**
- **Breaks temporal correlations**: Consecutive transitions are highly correlated
- **Enables data reuse**: Same experience used multiple times
- **Improves stability**: Random sampling prevents biased updates

**Key Methods:**
- `push()`: Store a transition
- `sample()`: Get random batch of transitions
- `__len__()`: Return current buffer size

In [None]:
# GRADED FUNCTION: Replay Buffer Implementation

class ReplayBuffer:
    """
    Experience Replay Buffer for storing and sampling transitions.
    
    Args:
        capacity (int): Maximum number of transitions to store (default: 10000)
    """
    
    def __init__(self, capacity=10000):
        # YOUR CODE STARTS HERE
        self.buffer = deque(maxlen=capacity)
        # YOUR CODE ENDS HERE
    
    def push(self, state, action, reward, next_state, done):
        """
        Store a transition in the buffer.
        
        Args:
            state: Current state
            action: Action taken
            reward: Reward received
            next_state: Next state
            done: Whether episode ended
        """
        # YOUR CODE STARTS HERE
        self.buffer.append(Transition(state, action, reward, next_state, done))
        # YOUR CODE ENDS HERE
    
    def sample(self, batch_size):
        """
        Sample a random batch of transitions.
        
        Args:
            batch_size (int): Number of transitions to sample
            
        Returns:
            list: List of Transition namedtuples
        """
        # YOUR CODE STARTS HERE
        return random.sample(self.buffer, batch_size)
        # YOUR CODE ENDS HERE
    
    def __len__(self):
        """
        Get current buffer size.
        
        Returns:
            int: Number of transitions currently in buffer
        """
        # YOUR CODE STARTS HERE
        return len(self.buffer)
        # YOUR CODE ENDS HERE

# Test Exercise 2
print("Testing Exercise 2: Replay Buffer Implementation")
print("="*60)

test_replay_buffer(ReplayBuffer, capacity=10000)

<font color='blue'>
    
**What you should remember**:
- Experience Replay breaks temporal correlations in training data
- Use a deque with maxlen for efficient circular buffer management
- Sample uniformly at random - all experiences have equal importance (later we'll use Prioritized Replay)
- Sufficient buffer size is critical: typically 10,000 to 1,000,000 transitions
</font>

<a name='6'></a>
## 6. Exercise 3: Implement DQN Update

<a name='ex-3'></a>
### Exercise 3 - implement_dqn_update

**Instructions:** Implement a complete DQN Agent with:
1. Q-network and target network
2. Action selection with ε-greedy exploration
3. Training step using Bellman equation
4. Epsilon decay for reducing exploration over time
5. Target network updates

**DQN Training Loop:**
1. Select action using ε-greedy: with probability ε take random action, else take argmax_a Q(s,a)
2. Execute action and observe reward and next state
3. Store transition in replay buffer
4. Sample mini-batch from buffer
5. Compute target: $Q_\text{target} = r + \gamma (1-\text{done}) \max_{a'} Q_\text{target}(s', a')$
6. Update Q-network: minimize $(Q_\text{target} - Q(s,a))^2$
7. Periodically update target network

In [None]:
# GRADED FUNCTION: DQN Agent Implementation

class DQNAgent:
    """
    Deep Q-Network Agent with experience replay and target network.
    """
    
    def __init__(self, state_dim, action_dim,
                 learning_rate=1e-3,
                 gamma=0.99,
                 epsilon_start=1.0,
                 epsilon_end=0.01,
                 epsilon_decay=500,
                 buffer_size=10000,
                 batch_size=64,
                 target_update=10):
        """
        Initialize DQN Agent.
        
        Args:
            state_dim: State space dimension
            action_dim: Action space dimension
            learning_rate: Learning rate for optimizer
            gamma: Discount factor
            epsilon_start: Initial exploration rate
            epsilon_end: Minimum exploration rate
            epsilon_decay: Decay rate for epsilon
            buffer_size: Replay buffer capacity
            batch_size: Training batch size
            target_update: Steps between target network updates
        """
        # YOUR CODE STARTS HERE
        self.action_dim = action_dim
        self.gamma = gamma
        self.epsilon = epsilon_start
        self.epsilon_start = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay
        self.batch_size = batch_size
        self.target_update = target_update
        self.device = device
        self.steps = 0
        
        # Networks
        self.q_network = DQN(state_dim, action_dim).to(self.device)
        self.target_network = DQN(state_dim, action_dim).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        self.target_network.eval()
        
        # Optimizer and replay buffer
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=learning_rate)
        self.replay_buffer = ReplayBuffer(buffer_size)
        # YOUR CODE ENDS HERE
    
    def get_action(self, state, training=True):
        """
        Select action using ε-greedy policy.
        
        Args:
            state: Current state
            training: If True, use exploration. If False, use pure exploitation
            
        Returns:
            int: Selected action
        """
        # YOUR CODE STARTS HERE
        if training and random.random() < self.epsilon:
            return random.randint(0, self.action_dim - 1)
        
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0).to(self.device)
            q_values = self.q_network(state_tensor)
            return q_values.argmax(dim=1).item()
        # YOUR CODE ENDS HERE
    
    def update_epsilon(self):
        """
        Decay epsilon using exponential decay formula:
        epsilon = epsilon_end + (epsilon_start - epsilon_end) * exp(-steps / epsilon_decay)
        """
        # YOUR CODE STARTS HERE
        self.epsilon = self.epsilon_end + (self.epsilon_start - self.epsilon_end) * \
                       np.exp(-self.steps / self.epsilon_decay)
        # YOUR CODE ENDS HERE
    
    def store_transition(self, state, action, reward, next_state, done):
        """
        Store transition in replay buffer.
        """
        # YOUR CODE STARTS HERE
        self.replay_buffer.push(state, action, reward, next_state, done)
        # YOUR CODE ENDS HERE
    
    def train_step(self):
        """
        Perform one training step:
        1. Sample batch from replay buffer
        2. Compute targets using target network
        3. Compute loss (MSE between current and target Q-values)
        4. Update Q-network parameters
        
        Returns:
            float: Loss value for this batch (or None if buffer too small)
        """
        # YOUR CODE STARTS HERE
        if len(self.replay_buffer) < self.batch_size:
            return None
        
        # Sample batch
        transitions = self.replay_buffer.sample(self.batch_size)
        batch = Transition(*zip(*transitions))
        
        # Convert to tensors
        state_batch = torch.FloatTensor(np.array(batch.state)).to(self.device)
        action_batch = torch.LongTensor(batch.action).unsqueeze(1).to(self.device)
        reward_batch = torch.FloatTensor(batch.reward).to(self.device)
        next_state_batch = torch.FloatTensor(np.array(batch.next_state)).to(self.device)
        done_batch = torch.FloatTensor(batch.done).to(self.device)
        
        # Current Q-values
        current_q = self.q_network(state_batch).gather(1, action_batch)
        
        # Target Q-values
        with torch.no_grad():
            next_q = self.target_network(next_state_batch).max(1)[0]
            target_q = reward_batch + (1 - done_batch) * self.gamma * next_q
        
        # Loss
        loss = nn.MSELoss()(current_q.squeeze(), target_q)
        
        # Optimization
        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.q_network.parameters(), 1.0)
        self.optimizer.step()
        
        return loss.item()
        # YOUR CODE ENDS HERE
    
    def update_target_network(self):
        """
        Hard update: Copy Q-network parameters to target network.
        This breaks the temporal correlation between targets and current network.
        """
        # YOUR CODE STARTS HERE
        self.target_network.load_state_dict(self.q_network.state_dict())
        # YOUR CODE ENDS HERE

# Test Exercise 3
print("Testing Exercise 3: DQN Agent Training")
print("="*60)

test_dqn_update(DQNAgent, state_dim=4, action_dim=2)

<font color='blue'>
    
**What you should remember**:
- **ε-greedy exploration**: Balance exploration (random) vs exploitation (argmax)
- **Bellman equation target**: $Q_\text{target} = r + \gamma (1-\text{done}) \max_{a'} Q_\text{target}(s', a')$
- **Target network**: Separate network prevents moving target problem
- **No gradients through target network**: Use `torch.no_grad()` when computing targets
- **Gradient clipping**: Prevents exploding gradients, improves stability
</font>

<a name='7'></a>
## 7. Double DQN: Reducing Overestimation

<a name='7-1'></a>
### 7.1 The Overestimation Problem

**Problem with Standard DQN:**

In DQN, the target uses $\max_a$ which can lead to overestimation:
$$Q_\text{target} = r + \gamma \max_{a'} Q_\text{target}(s', a')$$

If Q-values are noisy (as they are early in training), the maximum can be biased upward.

**Example:**
- True Q-values: [0.5, 0.5, 0.5]
- Estimated Q-values: [0.7, 0.5, 0.3] (with noise)
- DQN selects: max = 0.7 (overestimated!)

**Solution: Double DQN**

From van Hasselt et al. (2015), decouple action selection from evaluation:
$$Q_\text{target} = r + \gamma Q_\text{target}\left(s', \arg\max_{a'} Q(s', a')\right)$$

Key insight: Use online network to **select** action, target network to **evaluate** it.

**Benefits:**
- Reduces overestimation bias
- More stable learning
- Minimal code change (literally 1-2 lines!)

<a name='ex-4'></a>
### Exercise 4 - implement_double_dqn

**Instructions:** Modify the DQN training step to implement Double DQN:

**Original DQN update:**
```python
next_q = target_network(next_state_batch).max(1)[0]
```

**Double DQN update:**
```python
# 1. Select action with online network
next_actions = q_network(next_state_batch).argmax(1, keepdim=True)
# 2. Evaluate with target network
next_q = target_network(next_state_batch).gather(1, next_actions).squeeze()
```

In [None]:
# GRADED FUNCTION: Double DQN Agent

class DoubleDQNAgent(DQNAgent):
    """
    Double DQN Agent - Reduces overestimation of Q-values.
    
    Key difference: Use online network for action selection,
    target network for action evaluation.
    """
    
    def train_step(self):
        """
        Double DQN training step.
        
        The only difference from DQN is in the target computation:
        - DQN: target = r + γ * max_a' Q_target(s', a')
        - DoubleDQN: target = r + γ * Q_target(s', argmax_a' Q_online(s', a'))
        """
        if len(self.replay_buffer) < self.batch_size:
            return None
        
        # Sample batch
        transitions = self.replay_buffer.sample(self.batch_size)
        batch = Transition(*zip(*transitions))
        
        # Convert to tensors
        state_batch = torch.FloatTensor(np.array(batch.state)).to(self.device)
        action_batch = torch.LongTensor(batch.action).unsqueeze(1).to(self.device)
        reward_batch = torch.FloatTensor(batch.reward).to(self.device)
        next_state_batch = torch.FloatTensor(np.array(batch.next_state)).to(self.device)
        done_batch = torch.FloatTensor(batch.done).to(self.device)
        
        # Current Q-values
        current_q = self.q_network(state_batch).gather(1, action_batch)
        
        # DOUBLE DQN TARGET COMPUTATION
        with torch.no_grad():
            # YOUR CODE STARTS HERE
            # Step 1: Select action using ONLINE network
            next_actions = self.q_network(next_state_batch).argmax(1, keepdim=True)
            
            # Step 2: Evaluate action using TARGET network
            next_q = self.target_network(next_state_batch).gather(1, next_actions).squeeze()
            
            # Compute target
            target_q = reward_batch + (1 - done_batch) * self.gamma * next_q
            # YOUR CODE ENDS HERE
        
        # Loss
        loss = nn.MSELoss()(current_q.squeeze(), target_q)
        
        # Optimization
        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.q_network.parameters(), 1.0)
        self.optimizer.step()
        
        return loss.item()

# Test Exercise 4
print("Testing Exercise 4: Double DQN Implementation")
print("="*60)

test_double_dqn_update(DoubleDQNAgent, state_dim=4, action_dim=2)

<font color='blue'>
    
**What you should remember**:
- **Overestimation bias**: Taking max of noisy estimates biases upward
- **Decouple selection and evaluation**: Use two networks for different roles
- **Minimal code change**: Double DQN requires only 1-2 line modifications
- **Significant improvement**: Better stability with negligible computational cost
- **Principle**: When in doubt, decouple! It's a general principle in RL
</font>

<a name='8'></a>
## 8. Dueling DQN: Architectural Improvements

<a name='8-1'></a>
### 8.1 Value-Advantage Decomposition

**Key Insight (Wang et al., 2016):**

Decompose Q-function into:
1. **Value function V(s)**: How good is state s? (independent of action)
2. **Advantage function A(s,a)**: How much better is action a vs others?

$$Q(s,a) = V(s) + A(s,a) - \frac{1}{|\mathcal{A}|} \sum_{a'} A(s,a')$$

**Why This Helps:**
- Better learning of state values when actions have similar effects
- Improved generalization in large action spaces
- Clearer signal: V learns "which states are good", A learns "which actions matter"

**Example:**
- State: "Enemy far away"
  - V(s) = 0.8 (good state)
  - A(s, move_left) = 0.1 (slightly better)
  - A(s, move_right) = 0.1 (slightly better)
  - A(s, attack) = -0.2 (worse, enemy too far)

**Architecture:**
```
Input (state) → Shared Features (hidden) → Value Stream → V(s)
                                        ↓
                                    Advantage Stream → A(s,a)
                                        ↓
                                    Q(s,a) = V(s) + (A(s,a) - mean_a A(s,a))
```

<a name='ex-5'></a>
### Exercise 5 - implement_dueling_dqn

**Instructions:** Implement a Dueling DQN network with:
1. **Shared feature extraction layer**
2. **Separate Value stream**: outputs single value V(s)
3. **Separate Advantage stream**: outputs A(s,a) for each action
4. **Aggregation layer**: Combine V and A via the formula above
5. **Optional method**: `get_value_and_advantage()` for analysis

**Key Implementation Details:**
- Value stream should output shape (batch, 1)
- Advantage stream should output shape (batch, action_dim)
- Subtract mean of advantages for numerical stability
- Use keepdim=True in mean() to preserve dimensions for broadcasting

In [None]:
# GRADED FUNCTION: Dueling DQN Network

class DuelingDQN(nn.Module):
    """
    Dueling DQN Network Architecture.
    
    Separates Q-value computation into:
    - Value stream V(s): scalar value of state
    - Advantage stream A(s,a): advantage of each action
    
    Q(s,a) = V(s) + (A(s,a) - mean_a A(s,a))
    """
    
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super(DuelingDQN, self).__init__()
        
        # YOUR CODE STARTS HERE
        # Shared feature extraction layer
        self.feature = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        # Value stream: outputs single value V(s)
        self.value_stream = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Linear(hidden_dim // 2, 1)
        )
        
        # Advantage stream: outputs A(s,a) for each action
        self.advantage_stream = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Linear(hidden_dim // 2, action_dim)
        )
        
        self.action_dim = action_dim
        # YOUR CODE ENDS HERE
    
    def forward(self, x):
        """
        Forward pass computing Q(s,a) = V(s) + (A(s,a) - mean_a A(s,a))
        
        Args:
            x (torch.Tensor): State tensor (batch_size, state_dim)
            
        Returns:
            torch.Tensor: Q-values (batch_size, action_dim)
        """
        # YOUR CODE STARTS HERE
        # Feature extraction
        features = self.feature(x)
        
        # Value stream
        value = self.value_stream(features)  # (batch, 1)
        
        # Advantage stream
        advantage = self.advantage_stream(features)  # (batch, action_dim)
        
        # Aggregation: Q(s,a) = V(s) + (A(s,a) - mean_a A(s,a))
        q_values = value + (advantage - advantage.mean(dim=1, keepdim=True))
        
        return q_values
        # YOUR CODE ENDS HERE
    
    def get_value_and_advantage(self, x):
        """
        Get Value and Advantage separately (for analysis/debugging).
        
        Args:
            x (torch.Tensor): State tensor
            
        Returns:
            tuple: (value, advantage) tensors
        """
        # YOUR CODE STARTS HERE
        features = self.feature(x)
        value = self.value_stream(features)
        advantage = self.advantage_stream(features)
        return value, advantage
        # YOUR CODE ENDS HERE

# Test Exercise 5
print("Testing Exercise 5: Dueling DQN Architecture")
print("="*60)

test_dueling_dqn_architecture(DuelingDQN, state_dim=4, action_dim=2, hidden_dim=128)

<font color='blue'>
    
**What you should remember**:
- **Dueling architecture**: Separates state value from action advantages
- **Value function V(s)**: Depends only on state, represents intrinsic state quality
- **Advantage function A(s,a)**: Depends on both state and action
- **Aggregation formula**: Subtract mean of advantages for numerical stability
- **When to use**: Large action spaces or when actions have redundant effects
</font>

<a name='9'></a>
## 9. Algorithm Comparison

<a name='9-1'></a>
### 9.1 Comparative Analysis

#### Comparison Table

<table style="border-collapse: collapse; width: 100%;">
  <tr style="background-color: #f2f2f2;">
    <th style="border: 1px solid black; padding: 10px; text-align: left;"><b>Aspect</b></th>
    <th style="border: 1px solid black; padding: 10px; text-align: left;"><b>DQN</b></th>
    <th style="border: 1px solid black; padding: 10px; text-align: left;"><b>Double DQN</b></th>
    <th style="border: 1px solid black; padding: 10px; text-align: left;"><b>Dueling DQN</b></th>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 10px;"><b>Update Equation</b></td>
    <td style="border: 1px solid black; padding: 10px;">r + γ max_a' Q_t(s',a')</td>
    <td style="border: 1px solid black; padding: 10px;">r + γ Q_t(s', argmax Q(s',a'))</td>
    <td style="border: 1px solid black; padding: 10px;">V(s) + (A(s,a) - mean A)</td>
  </tr>
  <tr style="background-color: #f9f9f9;">
    <td style="border: 1px solid black; padding: 10px;"><b>Main Problem Addressed</b></td>
    <td style="border: 1px solid black; padding: 10px;">Correlation in RL data</td>
    <td style="border: 1px solid black; padding: 10px;">Overestimation of Q-values</td>
    <td style="border: 1px solid black; padding: 10px;">Learning efficiency</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 10px;"><b>Network Complexity</b></td>
    <td style="border: 1px solid black; padding: 10px;">Simple sequential</td>
    <td style="border: 1px solid black; padding: 10px;">Simple sequential</td>
    <td style="border: 1px solid black; padding: 10px;">Dual stream</td>
  </tr>
  <tr style="background-color: #f9f9f9;">
    <td style="border: 1px solid black; padding: 10px;"><b>Code Complexity</b></td>
    <td style="border: 1px solid black; padding: 10px;">Medium</td>
    <td style="border: 1px solid black; padding: 10px;">Very Low (1 line change)</td>
    <td style="border: 1px solid black; padding: 10px;">Medium</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 10px;"><b>Convergence Speed</b></td>
    <td style="border: 1px solid black; padding: 10px;">Good</td>
    <td style="border: 1px solid black; padding: 10px;">Better</td>
    <td style="border: 1px solid black; padding: 10px;">Best (large actions)</td>
  </tr>
  <tr style="background-color: #f9f9f9;">
    <td style="border: 1px solid black; padding: 10px;"><b>Memory Overhead</b></td>
    <td style="border: 1px solid black; padding: 10px;">Baseline</td>
    <td style="border: 1px solid black; padding: 10px;">Baseline</td>
    <td style="border: 1px solid black; padding: 10px;">+1 extra stream</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 10px;"><b>Stability</b></td>
    <td style="border: 1px solid black; padding: 10px;">Good</td>
    <td style="border: 1px solid black; padding: 10px;">Excellent</td>
    <td style="border: 1px solid black; padding: 10px;">Excellent</td>
  </tr>
  <tr style="background-color: #f9f9f9;">
    <td style="border: 1px solid black; padding: 10px;"><b>Best Use Cases</b></td>
    <td style="border: 1px solid black; padding: 10px;">Baseline, learning</td>
    <td style="border: 1px solid black; padding: 10px;">General purpose</td>
    <td style="border: 1px solid black; padding: 10px;">Large action spaces</td>
  </tr>
  <tr>
    <td style="border: 1px solid black; padding: 10px;"><b>Paper Citation</b></td>
    <td style="border: 1px solid black; padding: 10px;">Mnih et al. (2015)</td>
    <td style="border: 1px solid black; padding: 10px;">van Hasselt et al. (2015)</td>
    <td style="border: 1px solid black; padding: 10px;">Wang et al. (2016)</td>
  </tr>
</table>

<a name='9-2'></a>
### 9.2 Implementation Differences

#### Key Code Differences

**DQN - Target Computation:**
```python
next_q = target_network(next_state_batch).max(1)[0]
```

**Double DQN - Target Computation:**
```python
next_actions = q_network(next_state_batch).argmax(1, keepdim=True)
next_q = target_network(next_state_batch).gather(1, next_actions).squeeze()
```

**Dueling DQN - Architecture:**
```python
class DuelingDQN(nn.Module):
    def forward(self, x):
        features = self.feature(x)
        value = self.value_stream(features)
        advantage = self.advantage_stream(features)
        return value + (advantage - advantage.mean(dim=1, keepdim=True))
```

## Summary and Next Steps

### What You've Learned

1. **DQN Fundamentals**
   - Function approximation with neural networks
   - Experience replay for decorrelation
   - Target network for stability

2. **Double DQN**
   - Addresses overestimation through action-evaluation decoupling
   - Minimal code change with significant benefits

3. **Dueling DQN**
   - Architectural innovation: Value + Advantage streams
   - Better learning in complex environments

### Key Implementation Skills

- Building neural networks with PyTorch
- Experience replay buffer management
- Bellman equation in practice
- ε-greedy exploration strategies
- Gradient computation and optimization

### Recommended Extensions

1. **Prioritized Experience Replay (PER)**
   - Sample important transitions more frequently

2. **Soft Target Updates**
   - Smooth target network updates instead of hard copies

3. **Noisy Networks**
   - Parametric exploration for better coordination

4. **Rainbow DQN**
   - Combines DQN, Double DQN, Dueling DQN, PER, and more

5. **Policy Gradient Methods**
   - Actor-Critic, PPO, TRPO
   - Better for continuous control

## Congratulations!

You have successfully completed the Deep Q-Network tutorial. You can now:

✓ Implement DQN agents from scratch  
✓ Understand and apply experience replay  
✓ Work with target networks for stability  
✓ Reduce overestimation with Double DQN  
✓ Leverage architectural innovations with Dueling DQN  
✓ Compare and select appropriate algorithms  

**Great job on your Deep Reinforcement Learning journey!**