# 1.2 Deep Reinforcement Learning Introduction

## Learning Objectives
- Understand why we need function approximation
- Learn the basics of Deep Q-Networks (DQN)
- Understand Policy Gradient methods
- Implement a simple neural network-based agent

## The Curse of Dimensionality

Q-tables work for small state spaces, but real-world problems have:
- **Continuous states**: Robot joint angles, pixel images
- **High dimensions**: Atari games have 210×160×3 = 100,800 pixels
- **Combinatorial explosion**: Chess has ~10^43 possible states

**Solution**: Use function approximation (neural networks) to generalize across states.

In [None]:
# Install dependencies
# !pip install torch numpy gymnasium matplotlib

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from collections import deque
import random
import matplotlib.pyplot as plt
import gymnasium as gym

## Deep Q-Network (DQN)

DQN (Mnih et al., 2015) was a breakthrough that played Atari games at superhuman level.

### Key Innovations:

1. **Neural Network Q-Function**: Q(s,a;θ) approximates Q*(s,a)

2. **Experience Replay**: Store transitions in a buffer, sample randomly
   - Breaks correlation between consecutive samples
   - Improves sample efficiency

3. **Target Network**: Separate network for computing targets
   - Stabilizes training
   - Updated periodically from main network

In [None]:
class ReplayBuffer:
    """Experience replay buffer for DQN."""
    
    def __init__(self, capacity: int = 10000):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        """Store a transition."""
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size: int):
        """Sample a batch of transitions."""
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        
        return (
            torch.FloatTensor(np.array(states)),
            torch.LongTensor(actions),
            torch.FloatTensor(rewards),
            torch.FloatTensor(np.array(next_states)),
            torch.FloatTensor(dones)
        )
    
    def __len__(self):
        return len(self.buffer)

In [None]:
class DQN(nn.Module):
    """Deep Q-Network."""
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, action_dim)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

In [None]:
class DQNAgent:
    """DQN Agent with experience replay and target network."""
    
    def __init__(self, state_dim: int, action_dim: int,
                 lr: float = 1e-3, gamma: float = 0.99,
                 epsilon_start: float = 1.0, epsilon_end: float = 0.01,
                 epsilon_decay: float = 0.995, buffer_size: int = 10000,
                 batch_size: int = 64, target_update: int = 10):
        
        self.action_dim = action_dim
        self.gamma = gamma
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay
        self.batch_size = batch_size
        self.target_update = target_update
        self.update_counter = 0
        
        # Networks
        self.policy_net = DQN(state_dim, action_dim)
        self.target_net = DQN(state_dim, action_dim)
        self.target_net.load_state_dict(self.policy_net.state_dict())
        
        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=lr)
        self.buffer = ReplayBuffer(buffer_size)
    
    def select_action(self, state: np.ndarray) -> int:
        """Epsilon-greedy action selection."""
        if random.random() < self.epsilon:
            return random.randint(0, self.action_dim - 1)
        
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            q_values = self.policy_net(state_tensor)
            return q_values.argmax().item()
    
    def update(self):
        """Update the network from replay buffer."""
        if len(self.buffer) < self.batch_size:
            return 0.0
        
        states, actions, rewards, next_states, dones = self.buffer.sample(self.batch_size)
        
        # Current Q values
        current_q = self.policy_net(states).gather(1, actions.unsqueeze(1))
        
        # Target Q values (using target network)
        with torch.no_grad():
            next_q = self.target_net(next_states).max(1)[0]
            target_q = rewards + self.gamma * next_q * (1 - dones)
        
        # Compute loss
        loss = F.mse_loss(current_q.squeeze(), target_q)
        
        # Optimize
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Update target network periodically
        self.update_counter += 1
        if self.update_counter % self.target_update == 0:
            self.target_net.load_state_dict(self.policy_net.state_dict())
        
        # Decay epsilon
        self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)
        
        return loss.item()

In [None]:
def train_dqn(env_name: str = 'CartPole-v1', n_episodes: int = 300):
    """Train DQN on a Gymnasium environment."""
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    agent = DQNAgent(state_dim, action_dim)
    
    episode_rewards = []
    losses = []
    
    for episode in range(n_episodes):
        state, _ = env.reset()
        total_reward = 0
        episode_loss = []
        
        while True:
            action = agent.select_action(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            agent.buffer.push(state, action, reward, next_state, float(done))
            loss = agent.update()
            if loss > 0:
                episode_loss.append(loss)
            
            state = next_state
            total_reward += reward
            
            if done:
                break
        
        episode_rewards.append(total_reward)
        losses.append(np.mean(episode_loss) if episode_loss else 0)
        
        if (episode + 1) % 50 == 0:
            avg_reward = np.mean(episode_rewards[-50:])
            print(f"Episode {episode + 1}, Avg Reward: {avg_reward:.2f}, Epsilon: {agent.epsilon:.3f}")
    
    env.close()
    return agent, episode_rewards, losses

# Train the agent
agent, rewards, losses = train_dqn(n_episodes=300)

In [None]:
# Plot training results
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Rewards
window = 20
smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')
axes[0].plot(smoothed)
axes[0].axhline(y=195, color='r', linestyle='--', label='Solved threshold')
axes[0].set_xlabel('Episode')
axes[0].set_ylabel('Reward')
axes[0].set_title('Episode Rewards (CartPole-v1)')
axes[0].legend()

# Loss
smoothed_loss = np.convolve(losses, np.ones(window)/window, mode='valid')
axes[1].plot(smoothed_loss)
axes[1].set_xlabel('Episode')
axes[1].set_ylabel('Loss')
axes[1].set_title('Training Loss')

plt.tight_layout()
plt.show()

## Policy Gradient Methods

Instead of learning Q-values, we can directly optimize the policy.

### REINFORCE Algorithm

Parameterize policy as π(a|s;θ) and optimize:

$$\nabla_\theta J(\theta) = E_\pi[\sum_t \nabla_\theta \log \pi(a_t|s_t;\theta) \cdot G_t]$$

Where G_t is the return from time t.

In [None]:
class PolicyNetwork(nn.Module):
    """Simple policy network."""
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128):
        super(PolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(state_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, action_dim)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        return F.softmax(self.fc2(x), dim=-1)

In [None]:
class REINFORCEAgent:
    """REINFORCE (Monte Carlo Policy Gradient) agent."""
    
    def __init__(self, state_dim: int, action_dim: int,
                 lr: float = 1e-3, gamma: float = 0.99):
        self.gamma = gamma
        self.policy = PolicyNetwork(state_dim, action_dim)
        self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)
        
        self.saved_log_probs = []
        self.rewards = []
    
    def select_action(self, state: np.ndarray) -> int:
        """Sample action from policy."""
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        probs = self.policy(state_tensor)
        
        # Sample from categorical distribution
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        
        self.saved_log_probs.append(dist.log_prob(action))
        return action.item()
    
    def update(self):
        """Update policy using collected episode data."""
        # Calculate returns
        returns = []
        G = 0
        for r in reversed(self.rewards):
            G = r + self.gamma * G
            returns.insert(0, G)
        
        returns = torch.FloatTensor(returns)
        # Normalize returns for stability
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        
        # Policy gradient loss
        policy_loss = []
        for log_prob, G in zip(self.saved_log_probs, returns):
            policy_loss.append(-log_prob * G)
        
        loss = torch.stack(policy_loss).sum()
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Clear episode data
        self.saved_log_probs = []
        self.rewards = []
        
        return loss.item()

In [None]:
def train_reinforce(env_name: str = 'CartPole-v1', n_episodes: int = 500):
    """Train REINFORCE agent."""
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    agent = REINFORCEAgent(state_dim, action_dim)
    episode_rewards = []
    
    for episode in range(n_episodes):
        state, _ = env.reset()
        total_reward = 0
        
        while True:
            action = agent.select_action(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            agent.rewards.append(reward)
            state = next_state
            total_reward += reward
            
            if done:
                break
        
        agent.update()
        episode_rewards.append(total_reward)
        
        if (episode + 1) % 50 == 0:
            avg_reward = np.mean(episode_rewards[-50:])
            print(f"Episode {episode + 1}, Avg Reward: {avg_reward:.2f}")
    
    env.close()
    return agent, episode_rewards

# Train REINFORCE
pg_agent, pg_rewards = train_reinforce(n_episodes=500)

## Comparison: Value vs Policy Methods

| Aspect | Value-Based (DQN) | Policy-Based (REINFORCE) |
|--------|-------------------|-------------------------|
| Output | Q-values | Action probabilities |
| Action space | Discrete only | Discrete or Continuous |
| Exploration | ε-greedy | Built-in (stochastic) |
| Sample efficiency | Higher (replay) | Lower (on-policy) |
| Convergence | Can be unstable | Stable but high variance |

## Actor-Critic: Best of Both Worlds

Combines value and policy methods:
- **Actor**: Policy network that selects actions
- **Critic**: Value network that evaluates actions

This reduces variance while maintaining flexibility.

```
┌─────────┐     state      ┌────────┐
│  State  │───────────────▶│ Actor  │──▶ action
│         │                └────────┘
│         │     state      ┌────────┐
│         │───────────────▶│ Critic │──▶ value estimate
└─────────┘                └────────┘
```

This leads to algorithms like A2C, A3C, PPO, SAC (covered later with RLlib).

## Key Takeaways

1. **Function approximation** enables RL in complex environments

2. **DQN** uses replay buffers and target networks for stable training

3. **Policy Gradients** directly optimize the policy, work with continuous actions

4. **Actor-Critic** methods combine both approaches

## Next Steps

In the next section, we'll set up Ray and RLlib to train these algorithms at scale!

## Exercises

1. **Double DQN**: Implement Double DQN to reduce overestimation bias

2. **Dueling DQN**: Separate state-value and advantage streams

3. **Baseline subtraction**: Add a baseline to REINFORCE to reduce variance

4. **Different environments**: Try LunarLander-v2 or Acrobot-v1