# Deep Q-Network (DQN)

Welcome to the DQN assignment! This is where Reinforcement Learning meets Deep Learning. By the end of this notebook, you'll be able to:

* Understand why we need function approximation in RL
* Implement a neural network for Q-value approximation
* Build an experience replay buffer
* Implement the DQN algorithm with target networks
* Train an agent on environments with continuous state spaces

## From Q-Learning to DQN

**Problem with Q-Tables:**
- CartPole has ~10^20 possible states (continuous)
- Atari has 256^(84×84×4) ≈ 10^67,000 states!
- Q-tables are impossible for large/continuous state spaces

**Solution: Function Approximation**
- Instead of a table, use a neural network: $Q(s,a; \theta)$
- The network learns to **approximate** Q-values
- Can generalize to unseen states!

## DQN Key Innovations

1. **Neural Network**: Replace Q-table with neural network
2. **Experience Replay**: Store and reuse past experiences
3. **Target Network**: Stabilize training with separate target Q-network
4. **Reward Clipping**: Normalize rewards for stability

<img src="https://miro.medium.com/max/1400/1*w5GuxedZ9ivRYhQCv8kVZQ.png" style="width:600px;height:300px;">

## Important Note on Submission

Please ensure:
1. No extra print statements
2. No extra code cells  
3. Function parameters unchanged
4. No global variables in graded functions

## Table of Contents
- [1 - Packages](#1)
- [2 - Q-Network Architecture](#2)
    - [Exercise 1 - build_q_network](#ex-1)
- [3 - Experience Replay Buffer](#3)
    - [Exercise 2 - ReplayBuffer](#ex-2)
- [4 - DQN Training](#4)
    - [Exercise 3 - compute_td_loss](#ex-3)
    - [Exercise 4 - train_dqn](#ex-4)
- [5 - Testing on CartPole](#5)

<a name='1'></a>
## 1 - Packages

In [None]:
import numpy as np
import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
from collections import deque, namedtuple
from dqn_tests import *

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 6.0)

# Check for GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

np.random.seed(42)
torch.manual_seed(42)

<a name='2'></a>
## 2 - Q-Network Architecture

The Q-Network is a neural network that takes a state as input and outputs Q-values for each action.

**Architecture for CartPole:**
```
Input: State (4 values) → [cart position, cart velocity, pole angle, pole angular velocity]
Hidden Layer 1: 128 neurons (ReLU)
Hidden Layer 2: 128 neurons (ReLU)
Output: Q-values (2 actions) → [Q(s, left), Q(s, right)]
```

<a name='ex-1'></a>
### Exercise 1 - build_q_network

Implement a Q-Network using PyTorch.

In [None]:
# GRADED FUNCTION: QNetwork

class QNetwork(nn.Module):
    """
    Q-Network for DQN.
    
    Arguments:
    state_dim -- dimension of state space
    action_dim -- dimension of action space
    hidden_dim -- number of neurons in hidden layers (default: 128)
    """
    
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super(QNetwork, self).__init__()
        
        # (approx. 5-7 lines)
        # Build a neural network with:
        # 1. Input layer: state_dim
        # 2. Hidden layer 1: Linear(state_dim, hidden_dim) + ReLU
        # 3. Hidden layer 2: Linear(hidden_dim, hidden_dim) + ReLU
        # 4. Output layer: Linear(hidden_dim, action_dim)
        # 
        # Hint: Use nn.Sequential to combine layers
        # Hint: ReLU activation: nn.ReLU()
        
        # YOUR CODE STARTS HERE
        
        
        
        
        
        # YOUR CODE ENDS HERE
    
    def forward(self, state):
        """
        Forward pass.
        
        Arguments:
        state -- state tensor of shape (batch_size, state_dim)
        
        Returns:
        q_values -- Q-values for each action, shape (batch_size, action_dim)
        """
        # (approx. 1 line)
        # Pass state through the network
        
        # YOUR CODE STARTS HERE
        
        
        # YOUR CODE ENDS HERE
        
        return q_values

In [None]:
# Test your implementation
q_net = QNetwork(state_dim=4, action_dim=2, hidden_dim=128)
print("Q-Network architecture:")
print(q_net)

# Test forward pass
test_state = torch.randn(1, 4)
q_values = q_net(test_state)
print(f"\nInput shape: {test_state.shape}")
print(f"Output shape: {q_values.shape}")
print(f"Q-values: {q_values.detach().numpy()}")

# Run the grader
qnetwork_test(QNetwork)

<a name='3'></a>
## 3 - Experience Replay Buffer

**Why Experience Replay?**

Problem with online learning:
- Consecutive samples are highly correlated
- Network can forget previous experiences (catastrophic forgetting)
- Inefficient use of data

**Solution: Replay Buffer**
1. Store experiences (s, a, r, s', done) in a buffer
2. Sample random mini-batches for training
3. Breaks correlation, improves stability and sample efficiency

```python
# Store experience
buffer.add(state, action, reward, next_state, done)

# Sample mini-batch
batch = buffer.sample(batch_size=32)
```

<a name='ex-2'></a>
### Exercise 2 - ReplayBuffer

Implement an experience replay buffer.

In [None]:
# Define experience tuple
Experience = namedtuple('Experience', ['state', 'action', 'reward', 'next_state', 'done'])

In [None]:
# GRADED FUNCTION: ReplayBuffer

class ReplayBuffer:
    """
    Experience Replay Buffer.
    
    Arguments:
    capacity -- maximum number of experiences to store
    """
    
    def __init__(self, capacity=10000):
        # (approx. 1 line)
        # Use deque with maxlen=capacity to store experiences
        # Hint: deque automatically removes oldest items when full
        
        # YOUR CODE STARTS HERE
        
        
        # YOUR CODE ENDS HERE
    
    def add(self, state, action, reward, next_state, done):
        """
        Add experience to buffer.
        """
        # (approx. 1-2 lines)
        # Create Experience tuple and append to buffer
        
        # YOUR CODE STARTS HERE
        
        
        # YOUR CODE ENDS HERE
    
    def sample(self, batch_size):
        """
        Sample random batch of experiences.
        
        Arguments:
        batch_size -- number of experiences to sample
        
        Returns:
        Tuple of batched (states, actions, rewards, next_states, dones)
        """
        # (approx. 8-10 lines)
        # 1. Randomly sample batch_size experiences from buffer
        #    Hint: Use np.random.choice to sample indices
        # 2. Extract and stack each component into numpy arrays
        # 3. Return tuple of (states, actions, rewards, next_states, dones)
        
        # YOUR CODE STARTS HERE
        
        
        
        
        
        
        
        # YOUR CODE ENDS HERE
        
        return states, actions, rewards, next_states, dones
    
    def __len__(self):
        """Return current size of buffer."""
        return len(self.buffer)

In [None]:
# Test your implementation
buffer = ReplayBuffer(capacity=1000)

# Add some experiences
for i in range(10):
    state = np.array([i, i+1, i+2, i+3])
    action = i % 2
    reward = float(i)
    next_state = state + 1
    done = (i == 9)
    buffer.add(state, action, reward, next_state, done)

print(f"Buffer size: {len(buffer)}")

# Sample a batch
states, actions, rewards, next_states, dones = buffer.sample(batch_size=3)
print(f"\nSampled batch shapes:")
print(f"States: {states.shape}")
print(f"Actions: {actions.shape}")
print(f"Rewards: {rewards.shape}")

# Run the grader
replay_buffer_test(ReplayBuffer)

<a name='4'></a>
## 4 - DQN Training

DQN uses **two networks**:
1. **Q-Network** (θ): Updated every step, used to select actions
2. **Target Network** (θ⁻): Updated periodically, used to compute targets

**Why Target Network?**
- Prevents moving target problem
- Stabilizes training
- Updated every C steps: θ⁻ ← θ

**DQN Loss Function:**
$$L(\theta) = \mathbb{E}_{(s,a,r,s')\sim Buffer}\left[\left(r + \gamma \max_{a'} Q(s',a'; \theta^-) - Q(s,a;\theta)\right)^2\right]$$

<a name='ex-3'></a>
### Exercise 3 - compute_td_loss

Implement the DQN loss computation.

In [None]:
# GRADED FUNCTION: compute_td_loss

def compute_td_loss(q_network, target_network, batch, gamma=0.99):
    """
    Compute TD loss for DQN.
    
    Arguments:
    q_network -- current Q-network
    target_network -- target Q-network  
    batch -- tuple of (states, actions, rewards, next_states, dones)
    gamma -- discount factor
    
    Returns:
    loss -- mean squared TD error
    """
    states, actions, rewards, next_states, dones = batch
    
    # Convert to tensors (approx. 5 lines)
    # YOUR CODE STARTS HERE
    
    
    
    
    # YOUR CODE ENDS HERE
    
    # Compute Q-values (approx. 5-7 lines)
    # 1. Get current Q-values: Q(s, a) for actions taken
    #    Hint: Use gather() to select Q-values for specific actions
    # 2. Get target Q-values: max Q(s', a') from target network
    #    Hint: Use torch.no_grad() for target network
    # 3. Compute TD targets: r + gamma * max Q(s', a') * (1 - done)
    # 4. Compute loss: MSE between current Q and TD targets
    
    # YOUR CODE STARTS HERE
    
    
    
    
    
    
    
    
    # YOUR CODE ENDS HERE
    
    return loss

<a name='ex-4'></a>
### Exercise 4 - train_dqn

Now implement the complete DQN training loop!

In [None]:
# Helper function
def select_action(q_network, state, epsilon, n_actions):
    """Epsilon-greedy action selection."""
    if np.random.random() < epsilon:
        return np.random.randint(n_actions)
    else:
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)
            q_values = q_network(state_tensor)
            return q_values.argmax().item()

In [None]:
# GRADED FUNCTION: train_dqn

def train_dqn(env, n_episodes=500, batch_size=32, gamma=0.99,
              epsilon_start=1.0, epsilon_end=0.01, epsilon_decay=0.995,
              lr=0.001, target_update=10, buffer_size=10000):
    """
    Train DQN agent.
    
    Returns:
    q_network -- trained Q-network
    rewards_history -- list of episode rewards
    """
    # Initialize (approx. 8 lines)
    # YOUR CODE STARTS HERE
    
    
    
    
    
    
    # YOUR CODE ENDS HERE
    
    rewards_history = []
    epsilon = epsilon_start
    
    # Training loop (approx. 25-30 lines)
    # For each episode:
    #   1. Reset environment
    #   2. For each step:
    #      a. Select action using epsilon-greedy
    #      b. Take action, observe reward and next_state
    #      c. Store experience in buffer
    #      d. If buffer has enough samples:
    #         - Sample batch from buffer
    #         - Compute loss
    #         - Update Q-network
    #      e. If done, break
    #   3. Decay epsilon
    #   4. Every target_update episodes, copy Q-network to target_network
    #   5. Store episode reward
    
    # YOUR CODE STARTS HERE
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    # YOUR CODE ENDS HERE
    
    return q_network, rewards_history

<a name='5'></a>
## 5 - Testing on CartPole

Let's train DQN on the classic CartPole environment!

**Goal**: Balance a pole on a cart for as long as possible.
- **State**: [cart position, cart velocity, pole angle, pole angular velocity]
- **Actions**: [push left, push right]
- **Reward**: +1 for each timestep the pole stays up
- **Success**: Average reward ≥ 195 over 100 episodes

In [None]:
# Create environment
env = gym.make('CartPole-v1')

print("Training DQN on CartPole...")
print(f"State space: {env.observation_space.shape[0]}")
print(f"Action space: {env.action_space.n}")
print("\nThis may take a few minutes...\n")

# Train
q_network, rewards_history = train_dqn(
    env,
    n_episodes=500,
    batch_size=32,
    gamma=0.99,
    epsilon_start=1.0,
    epsilon_end=0.01,
    epsilon_decay=0.995,
    lr=0.001,
    target_update=10,
    buffer_size=10000
)

print("\nTraining completed!")
print(f"Final average reward (last 100 eps): {np.mean(rewards_history[-100:]):.2f}")

In [None]:
# Plot results
fig, ax = plt.subplots(figsize=(12, 5))

ax.plot(rewards_history, alpha=0.3, label='Episode reward')

window = 20
if len(rewards_history) >= window:
    moving_avg = np.convolve(rewards_history, np.ones(window)/window, mode='valid')
    ax.plot(range(window-1, len(rewards_history)), moving_avg,
            label=f'Moving average ({window} episodes)', linewidth=2)

ax.axhline(y=195, color='r', linestyle='--', label='Success threshold (195)', alpha=0.7)
ax.set_xlabel('Episode')
ax.set_ylabel('Total Reward')
ax.set_title('DQN Training on CartPole')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

env.close()

## Congratulations!

You've successfully implemented Deep Q-Network (DQN)! This is a major milestone in your RL journey.

### What You've Learned:

✅ Function approximation with neural networks

✅ Experience replay for stable training

✅ Target networks to prevent moving targets

✅ How to train agents on continuous state spaces

### Key Insights:

1. **Neural networks** can approximate Q-functions for large/continuous state spaces
2. **Experience replay** breaks correlation and improves sample efficiency
3. **Target networks** stabilize training by fixing targets temporarily
4. **Hyperparameters matter**: learning rate, buffer size, target update frequency

### DQN vs Q-Learning:

| Aspect | Q-Learning | DQN |
|--------|-----------|-----|
| State space | Small, discrete | Large, continuous |
| Q-function | Table | Neural network |
| Update | Online | Replay buffer |
| Stability | Stable | Needs target network |

### Next Steps:

- Try DQN on other environments (MountainCar, LunarLander)
- Learn about improvements: Double DQN, Dueling DQN, Rainbow
- Explore Policy Gradient methods (A2C, PPO)
- Try DQN on Atari games!