# RLlib Algorithms: Choosing the Right Tool

**Prerequisites**: Complete [02.1 RLlib Setup](./01_ray_setup.ipynb)

RLlib has many algorithms. How do you choose? This guide will help.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                        ALGORITHM DECISION FLOWCHART                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│                         What's your ACTION SPACE?                           │
│                                   │                                         │
│                    ┌──────────────┴──────────────┐                          │
│                    │                             │                          │
│              DISCRETE                       CONTINUOUS                      │
│           (left/right/jump)              (torque: -1.0 to 1.0)              │
│                    │                             │                          │
│             ┌──────┴──────┐               ┌──────┴──────┐                   │
│             │             │               │             │                   │
│    Need sample      Don't care     Need sample     Don't care               │
│    efficiency?                     efficiency?                              │
│             │             │               │             │                   │
│             v             v               v             v                   │
│         ┌─────┐       ┌─────┐         ┌─────┐       ┌─────┐                │
│         │ DQN │       │ PPO │         │ SAC │       │ PPO │                │
│         └─────┘       └─────┘         └─────┘       └─────┘                │
│                           │                             │                   │
│                           └──────────────┬──────────────┘                   │
│                                          │                                  │
│                                  ┌───────────────┐                          │
│                                  │  PPO is your  │                          │
│                                  │  DEFAULT!     │                          │
│                                  └───────────────┘                          │
│                                                                             │
│  Rule of thumb: When in doubt, use PPO. It works almost everywhere.         │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

In [None]:
import warnings
import logging
warnings.filterwarnings("ignore")
logging.getLogger("ray").setLevel(logging.ERROR)

import ray
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.algorithms.dqn import DQNConfig
from ray.rllib.algorithms.sac import SACConfig
from ray.rllib.algorithms.a2c import A2CConfig
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt

ray.init(
    num_cpus=4,
    object_store_memory=1 * 1024 * 1024 * 1024,
    ignore_reinit_error=True,
)
print(f"Ray initialized: {ray.cluster_resources()}")

---

## Understanding the Algorithm Zoo

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                           ALGORITHM TAXONOMY                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│                            RL Algorithms                                    │
│                                  │                                          │
│              ┌───────────────────┴───────────────────┐                      │
│              │                                       │                      │
│        MODEL-FREE                              MODEL-BASED                  │
│    "Learn from experience"               "Learn environment model"          │
│              │                              (Dreamer, MBPO)                 │
│              │                                                              │
│      ┌───────┴───────┐                                                      │
│      │               │                                                      │
│  VALUE-BASED    POLICY-BASED                                                │
│  "Learn Q(s,a)"  "Learn π(a|s)"                                             │
│      │               │                                                      │
│      │         ┌─────┴─────┐                                                │
│      │         │           │                                                │
│    DQN    Actor-Critic  Pure Policy                                         │
│   Rainbow  (both!)     REINFORCE                                            │
│              │                                                              │
│       ┌──────┴──────┐                                                       │
│       │             │                                                       │
│   ON-POLICY     OFF-POLICY                                                  │
│   "Fresh data"  "Replay buffer"                                             │
│       │             │                                                       │
│   ┌───┴───┐     ┌───┴───┐                                                   │
│   │       │     │       │                                                   │
│  A2C     PPO   SAC     TD3                                                  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

### On-Policy vs Off-Policy: The Key Distinction

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                     ON-POLICY vs OFF-POLICY                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ON-POLICY (PPO, A2C)                OFF-POLICY (DQN, SAC)                  │
│  ────────────────────                ──────────────────────                 │
│                                                                             │
│  Collect ──> Train ──> DISCARD       Collect ──> Store ──> Sample ──> Train │
│   data        data      data          data       buffer    randomly         │
│                                                     │                       │
│     ┌──────┐                              ┌────────────────┐                │
│     │ Data │  Use once,                   │ Replay Buffer  │                │
│     │      │  throw away                  │ ┌─┐┌─┐┌─┐┌─┐   │                │
│     └──────┘                              │ │ ││ ││ ││ │...│  Reuse many    │
│                                           │ └─┘└─┘└─┘└─┘   │  times!        │
│                                           └────────────────┘                │
│                                                                             │
│  Pros:                                   Pros:                              │
│  ✓ More stable                           ✓ Sample efficient                 │
│  ✓ Simpler to implement                  ✓ Can learn from old data          │
│  ✓ Guaranteed convergence                ✓ Better for expensive envs        │
│                                                                             │
│  Cons:                                   Cons:                              │
│  ✗ Needs lots of data                    ✗ Can be unstable                  │
│  ✗ Can't reuse old experience            ✗ More hyperparameters             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## 1. PPO (Proximal Policy Optimization)

**The safe default.** Works on almost anything.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                              PPO EXPLAINED                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  TYPE: On-policy, Actor-Critic                                              │
│  ACTIONS: Discrete AND Continuous                                           │
│                                                                             │
│  KEY IDEA: Limit how much the policy can change per update                  │
│                                                                             │
│     Old Policy                    New Policy                                │
│     π_old(a|s)                    π_new(a|s)                                │
│         │                              │                                    │
│         └──────────┬───────────────────┘                                    │
│                    │                                                        │
│                    v                                                        │
│         ratio = π_new(a|s) / π_old(a|s)                                     │
│                    │                                                        │
│                    v                                                        │
│         ┌──────────────────────────┐                                        │
│         │  CLIP(ratio, 1-ε, 1+ε)   │  ← Keep ratio between 0.8 and 1.2     │
│         │  where ε = 0.2           │    (prevents wild policy swings)       │
│         └──────────────────────────┘                                        │
│                                                                             │
│  Without clipping:          With PPO clipping:                              │
│  ┌──────────────────┐       ┌──────────────────┐                            │
│  │       /\         │       │         ___      │                            │
│  │      /  \  crash!│       │        /   \     │  (smooth learning)         │
│  │     /    \___    │       │     __/     \__  │                            │
│  │    /             │       │    /             │                            │
│  └──────────────────┘       └──────────────────┘                            │
│                                                                             │
│  WHEN TO USE:                                                               │
│  • First algorithm to try on any new problem                                │
│  • Continuous control (robotics, MuJoCo)                                    │
│  • When stability matters more than sample efficiency                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

In [None]:
# PPO Configuration
ppo_config = (
    PPOConfig()
    .environment("CartPole-v1")
    .framework("torch")
    .env_runners(num_env_runners=2)
    .training(
        lr=3e-4,                  # Learning rate
        gamma=0.99,               # Discount factor
        train_batch_size=4000,    # Samples per update
        
        # PPO-specific
        sgd_minibatch_size=128,   # Minibatch size
        num_sgd_iter=10,          # Epochs per batch
        clip_param=0.2,           # PPO clipping (the key innovation!)
        vf_loss_coeff=0.5,        # Value function loss weight
        entropy_coeff=0.01,       # Exploration bonus
        
        # GAE (Generalized Advantage Estimation)
        use_gae=True,
        lambda_=0.95,
    )
)

print("PPO config created")
print(f"  clip_param = 0.2 means policy can only change by ±20% per update")

---

## 2. DQN (Deep Q-Network)

**For discrete actions when you need sample efficiency.**

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                              DQN EXPLAINED                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  TYPE: Off-policy, Value-based                                              │
│  ACTIONS: Discrete ONLY (no continuous!)                                    │
│                                                                             │
│  KEY IDEA: Learn Q(s,a) with neural network + replay buffer + target net    │
│                                                                             │
│     ┌──────────────────────────────────────────────────────────────┐        │
│     │                    DQN COMPONENTS                            │        │
│     ├──────────────────────────────────────────────────────────────┤        │
│     │                                                              │        │
│     │  1. Q-Network: state → Q-values for each action              │        │
│     │                                                              │        │
│     │  2. Replay Buffer: stores (s, a, r, s') tuples               │        │
│     │     └─> enables learning from past experience                │        │
│     │                                                              │        │
│     │  3. Target Network: slowly-updating copy of Q-network        │        │
│     │     └─> provides stable learning targets                     │        │
│     │                                                              │        │
│     │  4. ε-greedy: explore randomly with probability ε            │        │
│     │                                                              │        │
│     └──────────────────────────────────────────────────────────────┘        │
│                                                                             │
│  ENHANCEMENTS (RLlib supports all of these):                                │
│                                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
│  │ Double DQN  │  │ Dueling DQN │  │ N-step      │  │ Prioritized │        │
│  │             │  │             │  │             │  │ Replay      │        │
│  │ Fixes over- │  │ Separates   │  │ Look ahead  │  │ Sample imp- │        │
│  │ estimation  │  │ V and A     │  │ N steps     │  │ ortant exp. │        │
│  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘        │
│                                                                             │
│  WHEN TO USE:                                                               │
│  • Discrete action spaces (Atari, board games)                              │
│  • When sample efficiency matters (expensive environments)                  │
│  • When you can collect lots of data over time                              │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

In [None]:
# DQN Configuration
dqn_config = (
    DQNConfig()
    .environment("CartPole-v1")
    .framework("torch")
    .env_runners(num_env_runners=2)
    .training(
        lr=1e-3,
        gamma=0.99,
        train_batch_size=32,
        
        # DQN-specific
        replay_buffer_config={
            "type": "MultiAgentPrioritizedReplayBuffer",
            "capacity": 50000,
        },
        double_q=True,              # Double DQN (reduces overestimation)
        dueling=True,               # Dueling DQN (separates V and A)
        n_step=3,                   # N-step returns (look ahead)
        target_network_update_freq=500,  # How often to update target
    )
    .exploration(
        exploration_config={
            "type": "EpsilonGreedy",
            "initial_epsilon": 1.0,   # Start with 100% random
            "final_epsilon": 0.02,    # End with 2% random
            "epsilon_timesteps": 10000,
        }
    )
)

print("DQN config created")
print("  Using: Double DQN + Dueling DQN + Prioritized Replay + 3-step returns")

---

## 3. SAC (Soft Actor-Critic)

**For continuous control when you need sample efficiency.**

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                              SAC EXPLAINED                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  TYPE: Off-policy, Actor-Critic                                             │
│  ACTIONS: Continuous (can do discrete but not recommended)                  │
│                                                                             │
│  KEY IDEA: Maximize reward AND entropy (built-in exploration!)              │
│                                                                             │
│     Regular RL objective:        SAC objective:                             │
│     ────────────────────         ──────────────                             │
│     max  Σ γᵗ rₜ                 max  Σ γᵗ (rₜ + α H(π))                   │
│                                            └─────────┘                      │
│                                            Entropy bonus!                   │
│                                            (keep exploring)                 │
│                                                                             │
│  WHY ENTROPY MATTERS:                                                       │
│                                                                             │
│     Low Entropy Policy          High Entropy Policy                         │
│     ─────────────────           ────────────────────                        │
│     "Always do action 1"        "Spread bets across actions"                │
│                                                                             │
│     P(a) █████████              P(a) ██ ██ ██ ██                            │
│          │░░░░░░░│                   │  │  │  │                             │
│          1 2 3 4                     1  2  3  4                             │
│                                                                             │
│     Problem: Gets stuck!        Benefit: Keeps exploring!                   │
│                                                                             │
│  ARCHITECTURE:                                                              │
│                                                                             │
│     ┌────────────┐    ┌────────────┐    ┌────────────┐                     │
│     │   Actor    │    │  Critic 1  │    │  Critic 2  │  ← Two critics      │
│     │   π(a|s)   │    │  Q₁(s,a)   │    │  Q₂(s,a)   │    (min of both)    │
│     └────────────┘    └────────────┘    └────────────┘                     │
│                                                                             │
│  WHEN TO USE:                                                               │
│  • Continuous control (robotics, MuJoCo)                                    │
│  • When sample efficiency is critical                                       │
│  • When you want automatic exploration tuning                               │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

In [None]:
# SAC Configuration (for continuous action space)
sac_config = (
    SACConfig()
    .environment("Pendulum-v1")  # Continuous action space!
    .framework("torch")
    .env_runners(num_env_runners=2)
    .training(
        lr=3e-4,
        gamma=0.99,
        train_batch_size=256,
        
        # SAC-specific
        tau=0.005,                # Soft update coefficient (Polyak averaging)
        initial_alpha=1.0,        # Entropy coefficient
        target_entropy="auto",    # Auto-tune entropy (key SAC feature!)
        n_step=1,
        replay_buffer_config={
            "type": "MultiAgentPrioritizedReplayBuffer",
            "capacity": 100000,
        },
    )
)

print("SAC config created")
print("  target_entropy='auto' means SAC auto-tunes exploration!")

---

## 4. A2C (Advantage Actor-Critic)

**Simple baseline, good for understanding.**

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                              A2C EXPLAINED                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  TYPE: On-policy, Actor-Critic                                              │
│  ACTIONS: Discrete AND Continuous                                           │
│                                                                             │
│  KEY IDEA: Synchronous advantage actor-critic                               │
│                                                                             │
│     A2C vs A3C:                                                             │
│     ────────────                                                            │
│                                                                             │
│     A3C (Async):              A2C (Sync):                                   │
│     Workers update            Workers wait,                                 │
│     independently             update together                               │
│                                                                             │
│     W1 ──┐                    W1 ──┐                                        │
│     W2 ──┼──> update          W2 ──┼──> wait ──> update                     │
│     W3 ──┘    (anytime)       W3 ──┘    (sync)   (together)                 │
│                                                                             │
│     More throughput           More stable                                   │
│     Less stable               Simpler to implement                          │
│                                                                             │
│  WHY USE A2C?                                                               │
│  • Simpler than PPO (good for learning)                                     │
│  • Good baseline to compare against                                         │
│  • Fast iteration (no replay buffer overhead)                               │
│                                                                             │
│  WHY NOT USE A2C?                                                           │
│  • PPO is almost always better                                              │
│  • Less sample efficient than off-policy methods                            │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

In [None]:
# A2C Configuration
a2c_config = (
    A2CConfig()
    .environment("CartPole-v1")
    .framework("torch")
    .env_runners(num_env_runners=2)
    .training(
        lr=1e-3,
        gamma=0.99,
        train_batch_size=500,
        
        # A2C-specific
        vf_loss_coeff=0.5,
        entropy_coeff=0.01,
        use_gae=True,
        lambda_=0.95,
    )
)

print("A2C config created")
print("  A2C is simpler than PPO (no clipping) but less stable")

---

## Algorithm Comparison

Let's train each algorithm and compare!

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                        WHAT TO EXPECT                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Algorithm   Sample Efficiency   Stability   Convergence Speed              │
│  ─────────   ─────────────────   ─────────   ─────────────────              │
│  PPO         Medium              High        Medium                         │
│  DQN         High (replay)       Medium      Slow then fast                 │
│  A2C         Low                 Medium      Fast but noisy                 │
│                                                                             │
│  Expected on CartPole:                                                      │
│  • All should solve it (reach 475+)                                         │
│  • PPO: Steady improvement                                                  │
│  • DQN: Slow start, then takes off                                          │
│  • A2C: Fast start, more variance                                           │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

In [None]:
def train_and_evaluate(config, name, n_iters=20):
    """Train an algorithm and return learning curve."""
    algo = config.build_algo()  # Use new API!
    rewards = []
    
    for i in range(n_iters):
        result = algo.train()
        # Use new API key name
        reward = result["env_runners"]["episode_return_mean"]
        rewards.append(reward)
        
        if (i + 1) % 5 == 0:
            print(f"{name:>6} - Iter {i+1:>2}: {reward:>6.1f}")
    
    algo.stop()
    return rewards

In [None]:
# Compare PPO vs DQN vs A2C on CartPole
print("Training PPO...")
print("=" * 40)
ppo_rewards = train_and_evaluate(ppo_config, "PPO", n_iters=20)

print("\nTraining DQN...")
print("=" * 40)
dqn_rewards = train_and_evaluate(dqn_config, "DQN", n_iters=20)

print("\nTraining A2C...")
print("=" * 40)
a2c_rewards = train_and_evaluate(a2c_config, "A2C", n_iters=20)

In [None]:
# Plot comparison
plt.figure(figsize=(10, 6))
plt.plot(ppo_rewards, label='PPO', linewidth=2, marker='o', markersize=4)
plt.plot(dqn_rewards, label='DQN', linewidth=2, marker='s', markersize=4)
plt.plot(a2c_rewards, label='A2C', linewidth=2, marker='^', markersize=4)
plt.axhline(y=475, color='gray', linestyle='--', label='Solved (475)')
plt.xlabel('Training Iteration')
plt.ylabel('Mean Episode Return')
plt.title('Algorithm Comparison on CartPole-v1')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

---

## Quick Reference: Algorithm Cheat Sheet

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                        ALGORITHM CHEAT SHEET                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Algorithm │ Actions    │ Sample Eff │ Stability │ Best For                │
│  ──────────┼────────────┼────────────┼───────────┼─────────────────────────│
│  PPO       │ Both       │ Medium     │ High      │ DEFAULT CHOICE!         │
│  DQN       │ Discrete   │ High       │ Medium    │ Atari, board games      │
│  SAC       │ Continuous │ High       │ Medium    │ Robotics, MuJoCo        │
│  A2C       │ Both       │ Low        │ Medium    │ Simple baseline         │
│  IMPALA    │ Both       │ Medium     │ Medium    │ Massive scale           │
│  APEX-DQN  │ Discrete   │ High       │ Medium    │ Distributed DQN         │
│                                                                             │
│  DECISION RULES:                                                            │
│  ───────────────                                                            │
│  1. Start with PPO. Always.                                                 │
│  2. Discrete actions + need efficiency? --> Try DQN                         │
│  3. Continuous actions + need efficiency? --> Try SAC                       │
│  4. Massive scale (100+ workers)? --> Consider IMPALA                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

## More Algorithms in RLlib

RLlib has many more algorithms:

| Category | Algorithms | Use Case |
|----------|-----------|----------|
| **Distributed** | IMPALA, APEX-DQN | Scale to many workers |
| **Continuous** | TD3, DDPG | Alternative to SAC |
| **Multi-Agent** | QMIX, MADDPG | Cooperative/competitive agents |
| **Offline RL** | CQL, MARWIL | Learn from fixed datasets |
| **Model-Based** | Dreamer | Learn world model |

## What's Next?

```
┌──────────────────┐          ┌──────────────────┐          ┌──────────────────┐
│ 02.2 Algorithms  │   ───>   │ 03 Custom Envs   │   ───>   │ 05 Distributed   │
│ (you are here)   │          │                  │          │                  │
│                  │          │ Build your own   │          │ Scale to many    │
│ - PPO, DQN, SAC  │          │ Gymnasium envs   │          │ GPUs & workers   │
│ - When to use    │          │                  │          │                  │
└──────────────────┘          └──────────────────┘          └──────────────────┘
```

In [None]:
ray.shutdown()