# Double DQN - Interactive Exercise

Welcome! In this notebook, you will implement **Double DQN**, an improvement over standard DQN that reduces overestimation of Q-values.

## What is Double DQN?

Standard DQN often **overestimates** action values due to the max operator in the Bellman equation. Double DQN fixes this by **decoupling action selection from action evaluation**.

## The Overestimation Problem

**Standard DQN** uses:
$$Q_{target} = r + \gamma \max_{a'} Q_{target}(s', a')$$

Problem: The **same network** selects AND evaluates the action, leading to overestimation.

**Double DQN** uses:
$$Q_{target} = r + \gamma Q_{target}(s', \arg\max_{a'} Q_{online}(s', a'))$$

Solution: **Online network** selects the action, **Target network** evaluates it.

## Key Differences from DQN

| Aspect | DQN | Double DQN |
|--------|-----|------------|
| Action Selection | Target network | **Online network** |
| Action Evaluation | Target network | Target network |
| Q-value Estimates | Overestimated | More accurate |
| Implementation | Simple max | Two-step: argmax then index |
| Performance | Good | **Better** (more stable) |

## Learning Objectives

By the end of this notebook, you will:
- Understand why DQN overestimates Q-values
- Implement the Double DQN loss function
- See the difference between DQN and Double DQN in practice
- Learn when to use Double DQN over standard DQN

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import gymnasium as gym
import matplotlib.pyplot as plt
from collections import deque, namedtuple
import random
from double_dqn_tests import *

In [None]:
# Set random seeds
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

## The Environment: CartPole

We'll use CartPole-v1 to compare DQN and Double DQN.

- **State**: [position, velocity, angle, angular velocity]
- **Actions**: 0 (left), 1 (right)
- **Reward**: +1 per timestep
- **Success**: Average reward > 475 over 100 episodes

In [None]:
env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

print(f"State dimension: {state_dim}")
print(f"Action dimension: {action_dim}")

## Exercise 1: Q-Network

The Q-network architecture is the same as standard DQN.

**Architecture**:
```
Input (state) â†’ FC1 (128) â†’ ReLU â†’ FC2 (128) â†’ ReLU â†’ FC3 (action_dim) â†’ Q-values
```

**Task**: Implement the Q-network.

In [None]:
# GRADED FUNCTION: QNetwork

class QNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        """
        Q-Network for Double DQN.
        
        Arguments:
        state_dim -- dimension of state space
        action_dim -- dimension of action space
        hidden_dim -- number of hidden units
        """
        super(QNetwork, self).__init__()
        
        # (approx. 3 lines)
        # Define three fully connected layers:
        # fc1: state_dim -> hidden_dim
        # fc2: hidden_dim -> hidden_dim
        # fc3: hidden_dim -> action_dim
        
        # YOUR CODE STARTS HERE
        
        
        # YOUR CODE ENDS HERE
    
    def forward(self, state):
        """
        Forward pass to get Q-values.
        
        Arguments:
        state -- state tensor
        
        Returns:
        q_values -- Q-value for each action
        """
        # (approx. 4 lines)
        # 1. Pass through fc1 and apply ReLU
        # 2. Pass through fc2 and apply ReLU
        # 3. Pass through fc3 (no activation)
        # 4. Return Q-values
        
        # YOUR CODE STARTS HERE
        
        
        # YOUR CODE ENDS HERE
        
        return q_values

In [None]:
# Test your implementation
qnetwork_test(QNetwork)

## Exercise 2: Replay Buffer

Experience replay is essential for stable DQN training. Same as standard DQN.

**Task**: Implement the replay buffer with store and sample methods.

In [None]:
# GRADED FUNCTION: ReplayBuffer

Transition = namedtuple('Transition', ['state', 'action', 'reward', 'next_state', 'done'])

class ReplayBuffer:
    def __init__(self, capacity=10000):
        """
        Experience replay buffer.
        
        Arguments:
        capacity -- maximum number of transitions to store
        """
        # (approx. 1 line)
        # Initialize a deque with maxlen=capacity
        
        # YOUR CODE STARTS HERE
        
        # YOUR CODE ENDS HERE
    
    def push(self, state, action, reward, next_state, done):
        """
        Store a transition.
        """
        # (approx. 1 line)
        # Append Transition namedtuple to buffer
        
        # YOUR CODE STARTS HERE
        
        # YOUR CODE ENDS HERE
    
    def sample(self, batch_size):
        """
        Sample a batch of transitions.
        
        Arguments:
        batch_size -- number of transitions to sample
        
        Returns:
        batch -- list of Transition namedtuples
        """
        # (approx. 1 line)
        # Use random.sample to get batch_size transitions
        
        # YOUR CODE STARTS HERE
        
        # YOUR CODE ENDS HERE
        
        return batch
    
    def __len__(self):
        """Return current buffer size."""
        return len(self.buffer)

In [None]:
# Test your implementation
replay_buffer_test(ReplayBuffer)

## Exercise 3: Double DQN Loss

This is where Double DQN differs from standard DQN!

**Standard DQN**:
```python
target_q = reward + gamma * target_net(next_state).max(dim=1)[0]
```

**Double DQN**:
```python
# Step 1: Select action using online network
best_actions = online_net(next_state).argmax(dim=1)

# Step 2: Evaluate action using target network
target_q = reward + gamma * target_net(next_state).gather(1, best_actions)
```

**Task**: Implement the Double DQN loss function.

In [None]:
# GRADED FUNCTION: compute_double_dqn_loss

def compute_double_dqn_loss(batch, online_net, target_net, gamma=0.99):
    """
    Compute Double DQN loss.
    
    Arguments:
    batch -- list of Transition namedtuples
    online_net -- online Q-network (being trained)
    target_net -- target Q-network (for stability)
    gamma -- discount factor
    
    Returns:
    loss -- scalar loss value
    """
    # (approx. 15-18 lines)
    # 1. Unpack batch into separate tensors:
    #    states, actions, rewards, next_states, dones
    # 2. Get current Q-values: online_net(states).gather(1, actions)
    # 3. Compute target Q-values (Double DQN way):
    #    a. Get best actions from online network:
    #       next_actions = online_net(next_states).argmax(dim=1, keepdim=True)
    #    b. Evaluate these actions using target network:
    #       next_q_values = target_net(next_states).gather(1, next_actions)
    #    c. Compute target: reward + gamma * next_q_values * (1 - done)
    # 4. Compute loss: F.mse_loss or F.smooth_l1_loss
    
    # YOUR CODE STARTS HERE
    
    
    # YOUR CODE ENDS HERE
    
    return loss

In [None]:
# Test your implementation
compute_double_dqn_loss_test(compute_double_dqn_loss, QNetwork, ReplayBuffer)

## Exercise 4: Update Target Network

Periodically copy weights from online network to target network for stability.

**Task**: Implement target network update.

In [None]:
# GRADED FUNCTION: update_target_network

def update_target_network(online_net, target_net):
    """
    Copy weights from online network to target network.
    
    Arguments:
    online_net -- online Q-network
    target_net -- target Q-network
    """
    # (approx. 1 line)
    # Use load_state_dict to copy online_net parameters to target_net
    
    # YOUR CODE STARTS HERE
    
    # YOUR CODE ENDS HERE

In [None]:
# Test your implementation
update_target_network_test(update_target_network, QNetwork)

## Exercise 5: Train Double DQN

Now let's put everything together!

**Algorithm**:
1. Initialize online and target networks
2. For each episode:
   - Reset environment
   - For each step:
     - Select action (epsilon-greedy)
     - Take action, observe reward and next state
     - Store transition in replay buffer
     - Sample batch and compute Double DQN loss
     - Update online network
     - Periodically update target network
     - Decay epsilon

**Task**: Implement the training loop.

In [None]:
# GRADED FUNCTION: train_double_dqn

def train_double_dqn(env, n_episodes=500, gamma=0.99, epsilon_start=1.0, 
                     epsilon_end=0.01, epsilon_decay=0.995, lr=1e-3, 
                     batch_size=64, target_update_freq=10):
    """
    Train Double DQN on the environment.
    
    Arguments:
    env -- Gym environment
    n_episodes -- number of episodes to train
    gamma -- discount factor
    epsilon_start -- initial epsilon for epsilon-greedy
    epsilon_end -- minimum epsilon
    epsilon_decay -- epsilon decay rate
    lr -- learning rate
    batch_size -- batch size for training
    target_update_freq -- how often to update target network
    
    Returns:
    episode_rewards -- list of total rewards per episode
    online_net -- trained online network
    """
    # (approx. 35-40 lines)
    # 1. Initialize networks, optimizer, replay buffer
    # 2. epsilon = epsilon_start
    # 3. For each episode:
    #    a. Reset environment
    #    b. total_reward = 0
    #    c. For each step:
    #       - Epsilon-greedy action selection
    #       - Take action, get next_state, reward, done
    #       - Store in replay buffer
    #       - If buffer has enough samples:
    #         * Sample batch
    #         * Compute loss using compute_double_dqn_loss
    #         * Update online network
    #       - Update state and total_reward
    #       - If done: break
    #    d. Update target network every target_update_freq episodes
    #    e. Decay epsilon: epsilon = max(epsilon_end, epsilon * epsilon_decay)
    #    f. Append total_reward to episode_rewards
    #    g. Print progress every 50 episodes
    
    # YOUR CODE STARTS HERE
    
    
    # YOUR CODE ENDS HERE
    
    return episode_rewards, online_net

In [None]:
# Test your implementation
train_double_dqn_test(train_double_dqn)

## Full Training Run

Let's train Double DQN on CartPole!

In [None]:
# Train Double DQN
episode_rewards, trained_net = train_double_dqn(
    env, 
    n_episodes=500,
    gamma=0.99,
    epsilon_decay=0.995,
    target_update_freq=10
)

# Plot results
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(episode_rewards, alpha=0.6)
plt.plot(np.convolve(episode_rewards, np.ones(50)/50, mode='valid'), linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Double DQN Training Progress')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
window = 100
if len(episode_rewards) >= window:
    moving_avg = np.convolve(episode_rewards, np.ones(window)/window, mode='valid')
    plt.plot(moving_avg)
    plt.axhline(y=475, color='r', linestyle='--', label='Solved threshold (475)')
    plt.xlabel('Episode')
    plt.ylabel(f'Average Reward (last {window} episodes)')
    plt.title('Moving Average')
    plt.legend()
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Check if solved
if len(episode_rewards) >= 100:
    final_avg = np.mean(episode_rewards[-100:])
    if final_avg >= 475:
        print(f"\nðŸŽ‰ Environment solved! Final average: {final_avg:.2f}")
    else:
        print(f"\nðŸ“Š Training completed. Final average: {final_avg:.2f}")

## Comparison: Standard DQN vs Double DQN

Let's visualize the key difference in Q-value estimation.

**Overestimation Analysis**:
- **Standard DQN**: Tends to overestimate Q-values, especially early in training
- **Double DQN**: More conservative estimates, leading to more stable learning

**When to use Double DQN**:
- âœ… **Always** - It's almost always better than standard DQN
- âœ… When you notice training instability in standard DQN
- âœ… When you need more accurate value estimates
- âœ… Environments where overestimation causes poor policies

**Computational Cost**:
- Virtually the same as standard DQN (just one extra argmax operation)
- No additional memory requirements

## Key Insights

**Why Double DQN Works**:
1. **Decoupling**: Separates action selection from action evaluation
2. **Reduces bias**: Online network errors don't directly affect target values
3. **More stable**: Prevents runaway overestimation

**Implementation Tips**:
- Start with Double DQN as your default (not standard DQN)
- Combine with other improvements: Dueling DQN, Prioritized Replay
- Monitor Q-values during training to check for overestimation

**Further Improvements**:
- **Dueling DQN**: Separate value and advantage streams
- **Prioritized Experience Replay**: Sample important transitions more often
- **Rainbow DQN**: Combine all improvements for state-of-the-art performance

## Congratulations!

You've successfully implemented Double DQN! You now understand:
- âœ… Why standard DQN overestimates Q-values
- âœ… How Double DQN fixes this with action decoupling
- âœ… The simple but powerful modification to the loss function
- âœ… When and why to use Double DQN
- âœ… How to combine it with other DQN improvements

**Next Steps**: 
- Try **Dueling DQN** for better value estimation
- Implement **Prioritized Experience Replay** for better sampling
- Explore **Rainbow DQN** that combines 6+ improvements!