# Deep Reinforcement Learning: Neural Networks Meet RL

**Prerequisites**: Complete [01_rl_concepts](./01_rl_concepts.ipynb) first!

In the last notebook, we used a Q-table to store values. But what happens when states become complex?

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE PROBLEM: STATE SPACES EXPLODE                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Q-Table works:              Q-Table FAILS:                                 │
│  ─────────────               ──────────────                                 │
│                                                                             │
│  GridWorld (16 states)       Atari Game (pixels)                            │
│  ┌───┬───┬───┬───┐           ┌─────────────────┐                            │
│  │ Q │ Q │ Q │ Q │           │ 210 x 160 x 3   │                            │
│  ├───┼───┼───┼───┤           │ = 100,800 pixels│                            │
│  │ Q │ Q │ Q │ Q │           │                 │                            │
│  ├───┼───┼───┼───┤           │ Each pixel 0-255│                            │
│  │ Q │ Q │ Q │ Q │           │ = 256^100800    │                            │
│  ├───┼───┼───┼───┤           │ possible states │                            │
│  │ Q │ Q │ Q │ Q │           │                 │                            │
│  └───┴───┴───┴───┘           │ (more than atoms│                            │
│  16 x 4 = 64 entries         │  in universe!)  │                            │
│  Easy to store!              └─────────────────┘                            │
│                              Impossible to store!                           │
│                                                                             │
│  SOLUTION: Use a neural network to GENERALIZE across similar states         │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

## The Key Insight: Function Approximation

Instead of storing Q(s,a) in a table, we **approximate** it with a neural network:

```
Q-TABLE APPROACH                    NEURAL NETWORK APPROACH
─────────────────                   ───────────────────────

State (0,0) ──> Look up row 0       State (0,0) ──┐
                     │                            │
                     v                            v
              ┌─────────────┐              ┌─────────────┐
              │ 0.5 0.2 0.8 │              │   Neural    │
              │ 0.1 0.9 0.3 │              │   Network   │
              │ ... ... ... │              │   Q(s;θ)    │
              └─────────────┘              └─────────────┘
                     │                            │
                     v                            v
              Q-values for                 Q-values for
              each action                  each action
              [0.5, 0.2, 0.8]              [0.5, 0.2, 0.8]

Memory: O(states × actions)        Memory: O(network parameters)
        Grows with state space             FIXED size!
```

**The magic**: Similar states produce similar Q-values (generalization)!

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from collections import deque
import random
import matplotlib.pyplot as plt
import gymnasium as gym

---

## DQN: Deep Q-Network

DQN (Mnih et al., 2015) was a breakthrough - it played Atari games at superhuman level!

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                         DQN ARCHITECTURE                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│                          ┌─────────────────────────────────┐                │
│                          │        Q-Network                │                │
│     State s              │                                 │                │
│  ┌─────────────┐         │  ┌─────┐   ┌─────┐   ┌─────┐  │   Q-values     │
│  │ obs[0]      │────────>│  │     │   │     │   │     │  │  ┌─────────┐   │
│  │ obs[1]      │         │  │ 128 │──>│ 128 │──>│  n  │──│─>│ Q(s,a0) │   │
│  │ obs[2]      │         │  │     │   │     │   │     │  │  │ Q(s,a1) │   │
│  │ obs[3]      │         │  └─────┘   └─────┘   └─────┘  │  │ Q(s,a2) │   │
│  └─────────────┘         │   ReLU      ReLU     Linear   │  │   ...   │   │
│   (4 numbers for         │                                │  └─────────┘   │
│    CartPole)             └─────────────────────────────────┘                │
│                                                                             │
│  Action Selection: a = argmax Q(s, a)                                       │
│                              a                                              │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

The network takes a state and outputs Q-values for ALL actions at once!

### DQN's Two Key Innovations

Naive neural network Q-learning is unstable. DQN fixes this with two tricks:

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    INNOVATION #1: EXPERIENCE REPLAY                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  PROBLEM: Consecutive samples are correlated (s1->s2->s3->...)              │
│           Neural networks learn poorly from correlated data!                │
│                                                                             │
│  SOLUTION: Store experiences in a buffer, sample RANDOMLY                   │
│                                                                             │
│  ┌────────────────────────────────────────────────────────────┐             │
│  │               REPLAY BUFFER (size = 10,000)                │             │
│  ├────────────────────────────────────────────────────────────┤             │
│  │  (s₁, a₁, r₁, s₁')  <── oldest                             │             │
│  │  (s₂, a₂, r₂, s₂')                                         │             │
│  │  (s₃, a₃, r₃, s₃')      ↑                                  │             │
│  │       ...               │ Random sample batch of 32        │             │
│  │  (s₈, a₈, r₈, s₈')      ↓                                  │             │
│  │       ...                                                   │             │
│  │  (sₙ, aₙ, rₙ, sₙ')  <── newest                             │             │
│  └────────────────────────────────────────────────────────────┘             │
│                              │                                              │
│                              v                                              │
│                    Train on random batch                                    │
│                    (breaks correlation!)                                    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    INNOVATION #2: TARGET NETWORK                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  PROBLEM: We update Q using Q itself → moving target → instability!         │
│                                                                             │
│       Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]                      │
│                              ↑                                              │
│                        This changes as                                      │
│                        we update Q!                                         │
│                                                                             │
│  SOLUTION: Use a SEPARATE "target" network that updates slowly              │
│                                                                             │
│     ┌─────────────────┐                    ┌─────────────────┐              │
│     │  Policy Network │                    │  Target Network │              │
│     │    Q(s; θ)      │                    │    Q(s; θ⁻)     │              │
│     │                 │  copy weights      │                 │              │
│     │  Updates every  │ ───────────────>   │  Updates every  │              │
│     │  training step  │  (every 100 steps) │  100 steps      │              │
│     └─────────────────┘                    └─────────────────┘              │
│            │                                        │                       │
│            v                                        v                       │
│     Used for action                         Used for computing              │
│     selection                               target Q-values                 │
│                                                                             │
│  The target is now STABLE for 100 steps → much easier to learn!             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

In [None]:
class ReplayBuffer:
    """Experience replay buffer for DQN.
    
    Stores (state, action, reward, next_state, done) tuples
    and provides random sampling for training.
    """
    
    def __init__(self, capacity: int = 10000):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        """Store a transition."""
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size: int):
        """Sample a RANDOM batch of transitions."""
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        
        return (
            torch.FloatTensor(np.array(states)),
            torch.LongTensor(actions),
            torch.FloatTensor(rewards),
            torch.FloatTensor(np.array(next_states)),
            torch.FloatTensor(dones)
        )
    
    def __len__(self):
        return len(self.buffer)

In [None]:
class DQN(nn.Module):
    """Deep Q-Network.
    
    Architecture: state → [128] → [128] → Q-values for each action
    """
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, action_dim)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))  # First hidden layer + ReLU
        x = F.relu(self.fc2(x))  # Second hidden layer + ReLU
        return self.fc3(x)       # Output Q-values (no activation)

### The DQN Training Loop

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                          DQN TRAINING LOOP                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  1. COLLECT EXPERIENCE                                                      │
│     ┌─────┐  action (ε-greedy)  ┌─────────────┐                             │
│     │Agent│ ──────────────────> │ Environment │                             │
│     │     │ <────────────────── │             │                             │
│     └─────┘  state, reward      └─────────────┘                             │
│        │                                                                    │
│        v                                                                    │
│  2. STORE IN REPLAY BUFFER                                                  │
│     ┌─────────────────────────────────┐                                     │
│     │  (s, a, r, s', done)  ───────>  │ Replay Buffer                       │
│     └─────────────────────────────────┘                                     │
│                                                                             │
│  3. SAMPLE RANDOM BATCH                                                     │
│     ┌─────────────────────────────────┐                                     │
│     │  Random 32 samples  <─────────  │ Replay Buffer                       │
│     └─────────────────────────────────┘                                     │
│                                                                             │
│  4. COMPUTE LOSS                                                            │
│                                                                             │
│     predicted = Q(s, a; θ)           ← Policy network                       │
│     target = r + γ max Q(s', a'; θ⁻) ← Target network (frozen)              │
│                   a'                                                        │
│     loss = (predicted - target)²                                            │
│                                                                             │
│  5. UPDATE POLICY NETWORK                                                   │
│     θ ← θ - α ∇loss                                                         │
│                                                                             │
│  6. PERIODICALLY UPDATE TARGET NETWORK                                      │
│     Every N steps: θ⁻ ← θ                                                   │
│                                                                             │
│  REPEAT!                                                                    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

In [None]:
class DQNAgent:
    """DQN Agent with experience replay and target network."""
    
    def __init__(self, state_dim: int, action_dim: int,
                 lr: float = 1e-3, gamma: float = 0.99,
                 epsilon_start: float = 1.0, epsilon_end: float = 0.01,
                 epsilon_decay: float = 0.995, buffer_size: int = 10000,
                 batch_size: int = 64, target_update: int = 10):
        
        self.action_dim = action_dim
        self.gamma = gamma
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay
        self.batch_size = batch_size
        self.target_update = target_update
        self.update_counter = 0
        
        # Two networks: policy (updates often) and target (updates slowly)
        self.policy_net = DQN(state_dim, action_dim)
        self.target_net = DQN(state_dim, action_dim)
        self.target_net.load_state_dict(self.policy_net.state_dict())  # Start same
        
        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=lr)
        self.buffer = ReplayBuffer(buffer_size)
    
    def select_action(self, state: np.ndarray) -> int:
        """Epsilon-greedy action selection."""
        if random.random() < self.epsilon:
            return random.randint(0, self.action_dim - 1)  # Explore
        
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            q_values = self.policy_net(state_tensor)
            return q_values.argmax().item()  # Exploit
    
    def update(self):
        """Update the network from replay buffer."""
        if len(self.buffer) < self.batch_size:
            return 0.0
        
        # Sample random batch
        states, actions, rewards, next_states, dones = self.buffer.sample(self.batch_size)
        
        # Current Q values from POLICY network
        current_q = self.policy_net(states).gather(1, actions.unsqueeze(1))
        
        # Target Q values from TARGET network (frozen)
        with torch.no_grad():
            next_q = self.target_net(next_states).max(1)[0]
            target_q = rewards + self.gamma * next_q * (1 - dones)
        
        # Compute MSE loss
        loss = F.mse_loss(current_q.squeeze(), target_q)
        
        # Optimize policy network
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Update target network periodically
        self.update_counter += 1
        if self.update_counter % self.target_update == 0:
            self.target_net.load_state_dict(self.policy_net.state_dict())
        
        # Decay epsilon (less exploration over time)
        self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)
        
        return loss.item()

In [None]:
def train_dqn(env_name: str = 'CartPole-v1', n_episodes: int = 300):
    """Train DQN on an environment."""
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    agent = DQNAgent(state_dim, action_dim)
    
    episode_rewards = []
    
    print("Training DQN on CartPole")
    print("=" * 50)
    
    for episode in range(n_episodes):
        state, _ = env.reset()
        total_reward = 0
        
        while True:
            action = agent.select_action(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            agent.buffer.push(state, action, reward, next_state, float(done))
            agent.update()
            
            state = next_state
            total_reward += reward
            
            if done:
                break
        
        episode_rewards.append(total_reward)
        
        if (episode + 1) % 50 == 0:
            avg_reward = np.mean(episode_rewards[-50:])
            print(f"Episode {episode + 1:>4}, Avg Reward: {avg_reward:>6.1f}, Epsilon: {agent.epsilon:.3f}")
    
    env.close()
    return agent, episode_rewards

# Train the agent
agent, rewards = train_dqn(n_episodes=300)

In [None]:
# Plot training results
plt.figure(figsize=(10, 4))

window = 20
smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')

plt.plot(smoothed)
plt.axhline(y=195, color='r', linestyle='--', label='Solved threshold (195)')
plt.axhline(y=475, color='g', linestyle='--', label='Max score (500)')
plt.xlabel('Episode')
plt.ylabel('Reward (smoothed)')
plt.title('DQN Training on CartPole-v1')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

---

## Policy Gradient Methods

DQN learns Q-values and derives a policy from them. **Policy Gradient** methods take a different approach - they learn the policy DIRECTLY.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    VALUE-BASED vs POLICY-BASED                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  VALUE-BASED (DQN)                  POLICY-BASED (Policy Gradient)          │
│  ─────────────────                  ──────────────────────────────          │
│                                                                             │
│  Learn: Q(s, a)                     Learn: π(a|s) directly                  │
│                                                                             │
│       State s                            State s                            │
│          │                                  │                               │
│          v                                  v                               │
│    ┌───────────┐                     ┌───────────┐                          │
│    │  Network  │                     │  Network  │                          │
│    └───────────┘                     └───────────┘                          │
│          │                                  │                               │
│          v                                  v                               │
│    ┌─────────────┐                   ┌─────────────┐                        │
│    │Q(s,a1)=0.5  │                   │P(a1|s)=0.7  │                        │
│    │Q(s,a2)=0.8  │ ← pick max        │P(a2|s)=0.3  │ ← sample from          │
│    └─────────────┘                   └─────────────┘   distribution         │
│          │                                  │                               │
│    Action = a2                       Action ~ Categorical([0.7, 0.3])       │
│    (deterministic)                   (stochastic)                           │
│                                                                             │
│  Pros:                              Pros:                                   │
│  - Sample efficient                 - Works with continuous actions         │
│  - Stable with replay               - Natural exploration                   │
│                                     - Can learn stochastic policies         │
│                                                                             │
│  Cons:                              Cons:                                   │
│  - Discrete actions only            - High variance                         │
│  - Can be unstable                  - Less sample efficient                 │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### REINFORCE: The Simplest Policy Gradient

The key insight: **increase probability of actions that led to high rewards**.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                         REINFORCE INTUITION                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  After an episode:                                                          │
│                                                                             │
│  t=0: state s₀ ──> action a₀ ──> ... ──> total reward = 100 (good!)        │
│                                                                             │
│  Update rule:                                                               │
│  ────────────                                                               │
│  If total reward was HIGH:                                                  │
│      INCREASE P(a₀|s₀), P(a₁|s₁), ...  (do more of this!)                  │
│                                                                             │
│  If total reward was LOW:                                                   │
│      DECREASE P(a₀|s₀), P(a₁|s₁), ...  (do less of this!)                  │
│                                                                             │
│                                                                             │
│  Mathematically:                                                            │
│                                                                             │
│     ∇J(θ) = Σ  ∇ log π(aₜ|sₜ; θ) × Gₜ                                      │
│             t                                                               │
│                                                                             │
│     where Gₜ = total reward from time t ("return")                          │
│                                                                             │
│  This is called the "Policy Gradient Theorem"                               │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

In [None]:
class PolicyNetwork(nn.Module):
    """Policy network that outputs action probabilities."""
    
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 128):
        super(PolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(state_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, action_dim)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        return F.softmax(self.fc2(x), dim=-1)  # Outputs PROBABILITIES (sum to 1)

In [None]:
class REINFORCEAgent:
    """REINFORCE (Monte Carlo Policy Gradient) agent."""
    
    def __init__(self, state_dim: int, action_dim: int,
                 lr: float = 1e-3, gamma: float = 0.99):
        self.gamma = gamma
        self.policy = PolicyNetwork(state_dim, action_dim)
        self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)
        
        # Store episode data
        self.saved_log_probs = []
        self.rewards = []
    
    def select_action(self, state: np.ndarray) -> int:
        """Sample action from policy (stochastic!)."""
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        probs = self.policy(state_tensor)  # Get probabilities
        
        # Sample from the distribution
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        
        # Save log probability for training
        self.saved_log_probs.append(dist.log_prob(action))
        return action.item()
    
    def update(self):
        """Update policy using collected episode data."""
        # Calculate returns (reward-to-go)
        returns = []
        G = 0
        for r in reversed(self.rewards):
            G = r + self.gamma * G
            returns.insert(0, G)
        
        returns = torch.FloatTensor(returns)
        # Normalize for stability
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        
        # Policy gradient loss: -log(π) × G
        # (negative because we do gradient ASCENT on reward)
        policy_loss = []
        for log_prob, G in zip(self.saved_log_probs, returns):
            policy_loss.append(-log_prob * G)
        
        loss = torch.stack(policy_loss).sum()
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Clear episode data
        self.saved_log_probs = []
        self.rewards = []
        
        return loss.item()

In [None]:
def train_reinforce(env_name: str = 'CartPole-v1', n_episodes: int = 500):
    """Train REINFORCE agent."""
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    agent = REINFORCEAgent(state_dim, action_dim)
    episode_rewards = []
    
    print("Training REINFORCE on CartPole")
    print("=" * 50)
    
    for episode in range(n_episodes):
        state, _ = env.reset()
        total_reward = 0
        
        while True:
            action = agent.select_action(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            
            agent.rewards.append(reward)
            state = next_state
            total_reward += reward
            
            if done:
                break
        
        agent.update()  # Update AFTER episode ends (Monte Carlo)
        episode_rewards.append(total_reward)
        
        if (episode + 1) % 50 == 0:
            avg_reward = np.mean(episode_rewards[-50:])
            print(f"Episode {episode + 1:>4}, Avg Reward: {avg_reward:>6.1f}")
    
    env.close()
    return agent, episode_rewards

# Train REINFORCE
pg_agent, pg_rewards = train_reinforce(n_episodes=500)

---

## Actor-Critic: Best of Both Worlds

REINFORCE has high variance (noisy gradients). Actor-Critic methods fix this:

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                         ACTOR-CRITIC ARCHITECTURE                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│                              State s                                        │
│                                 │                                           │
│                    ┌────────────┴────────────┐                              │
│                    │                         │                              │
│                    v                         v                              │
│             ┌─────────────┐           ┌─────────────┐                       │
│             │   ACTOR     │           │   CRITIC    │                       │
│             │  π(a|s; θ)  │           │   V(s; w)   │                       │
│             │             │           │             │                       │
│             │ "What to do"│           │ "How good   │                       │
│             │             │           │  is this?"  │                       │
│             └─────────────┘           └─────────────┘                       │
│                    │                         │                              │
│                    v                         v                              │
│              Action probs               Value estimate                      │
│              [0.7, 0.3]                    V(s) = 42                        │
│                    │                         │                              │
│                    └───────────┬─────────────┘                              │
│                                │                                            │
│                                v                                            │
│                         ADVANTAGE                                           │
│                    A = r + γV(s') - V(s)                                    │
│                                                                             │
│                    "Was action better or worse than expected?"              │
│                                                                             │
│  If A > 0: Action was BETTER than expected → increase probability           │
│  If A < 0: Action was WORSE than expected → decrease probability            │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

This leads to modern algorithms like **A2C, A3C, PPO, and SAC** (which we'll use in RLlib)!

---

## Summary: Algorithm Family Tree

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                          RL ALGORITHM FAMILIES                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│                         Reinforcement Learning                              │
│                                   │                                         │
│                 ┌─────────────────┴─────────────────┐                       │
│                 │                                   │                       │
│           VALUE-BASED                         POLICY-BASED                  │
│           "Learn Q(s,a)"                      "Learn π(a|s)"                │
│                 │                                   │                       │
│         ┌───────┴───────┐                   ┌───────┴───────┐               │
│         │               │                   │               │               │
│      Q-Table          DQN              REINFORCE      Actor-Critic          │
│   (small states)  (neural net)        (vanilla)     (reduced variance)     │
│                         │                                   │               │
│                 ┌───────┴───────┐               ┌───────────┼───────────┐   │
│                 │               │               │           │           │   │
│              Double          Dueling         A2C/A3C       PPO         SAC  │
│              DQN             DQN            (parallel)   (stable)   (entropy)│
│                                                                             │
│  Discrete actions only ◄──────────────►  Continuous actions supported      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

## Key Takeaways

| Method | Key Idea | Pros | Cons |
|--------|----------|------|------|
| **DQN** | Neural net Q-values + replay + target net | Sample efficient | Discrete only |
| **REINFORCE** | Directly optimize policy | Simple, continuous actions | High variance |
| **Actor-Critic** | Combine value + policy learning | Lower variance, flexible | More complex |

## What's Next?

Now that you understand the foundations, we'll use **RLlib** to:
- Train these algorithms at scale (parallel workers)
- Use optimized implementations (PPO, SAC, etc.)
- Not worry about the low-level details!

```
┌───────────────────┐          ┌───────────────────┐
│  01.2 Deep RL     │   ───>   │  02.1 RLlib Setup │
│  (you are here)   │          │                   │
│                   │          │  - PPO in 10 lines│
│  - DQN            │          │  - Parallel envs  │
│  - Policy Gradient│          │  - Easy config    │
│  - Actor-Critic   │          │                   │
└───────────────────┘          └───────────────────┘
```