# 2.2 RLlib Algorithms Overview

## Learning Objectives
- Understand the main algorithm families in RLlib
- Learn when to use each algorithm
- Compare performance characteristics
- Run multiple algorithms on the same environment

In [None]:
import ray
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.algorithms.dqn import DQNConfig
from ray.rllib.algorithms.sac import SACConfig
from ray.rllib.algorithms.a2c import A2CConfig
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt

ray.init(ignore_reinit_error=True)

## Algorithm Taxonomy

```
                    RL Algorithms
                          │
            ┌─────────────┴─────────────┐
            │                           │
      Model-Free                   Model-Based
            │                      (Dreamer, MBPO)
    ┌───────┴───────┐
    │               │
Value-Based    Policy-Based
(DQN, Rainbow)  (REINFORCE)
                    │
              Actor-Critic
            ┌───────┴───────┐
            │               │
        On-Policy       Off-Policy
        (A2C, PPO)      (SAC, TD3)
```

## 1. DQN (Deep Q-Network)

**Type**: Value-based, Off-policy

**Best for**: Discrete action spaces, sample efficiency needed

**Key features**:
- Experience replay for sample efficiency
- Target network for stability
- Works only with discrete actions

In [None]:
# DQN Configuration
dqn_config = (
    DQNConfig()
    .environment("CartPole-v1")
    .framework("torch")
    .env_runners(num_env_runners=2)
    .training(
        lr=1e-3,
        gamma=0.99,
        train_batch_size=32,
        # DQN specific
        replay_buffer_config={
            "type": "MultiAgentPrioritizedReplayBuffer",
            "capacity": 50000,
        },
        double_q=True,        # Double DQN
        dueling=True,         # Dueling DQN
        n_step=3,             # N-step returns
        target_network_update_freq=500,
    )
    .exploration(
        exploration_config={
            "type": "EpsilonGreedy",
            "initial_epsilon": 1.0,
            "final_epsilon": 0.02,
            "epsilon_timesteps": 10000,
        }
    )
)

print("DQN config created")

## 2. PPO (Proximal Policy Optimization)

**Type**: Actor-Critic, On-policy

**Best for**: General-purpose, continuous & discrete actions, robustness

**Key features**:
- Clipped objective prevents large policy updates
- Stable and reliable across many tasks
- Easy to tune

In [None]:
# PPO Configuration
ppo_config = (
    PPOConfig()
    .environment("CartPole-v1")
    .framework("torch")
    .env_runners(num_env_runners=2)
    .training(
        lr=3e-4,
        gamma=0.99,
        train_batch_size=4000,
        # PPO specific
        sgd_minibatch_size=128,
        num_sgd_iter=10,
        clip_param=0.2,           # Clipping range
        vf_loss_coeff=0.5,        # Value function loss coefficient
        entropy_coeff=0.01,       # Entropy bonus for exploration
        use_gae=True,             # Generalized Advantage Estimation
        lambda_=0.95,             # GAE lambda
    )
)

print("PPO config created")

## 3. SAC (Soft Actor-Critic)

**Type**: Actor-Critic, Off-policy

**Best for**: Continuous action spaces, sample efficiency

**Key features**:
- Maximum entropy framework (exploration built-in)
- Very sample efficient due to replay buffer
- Automatic temperature tuning

In [None]:
# SAC Configuration (for continuous action space)
sac_config = (
    SACConfig()
    .environment("Pendulum-v1")  # Continuous action space
    .framework("torch")
    .env_runners(num_env_runners=2)
    .training(
        lr=3e-4,
        gamma=0.99,
        train_batch_size=256,
        # SAC specific
        tau=0.005,                # Soft update coefficient
        initial_alpha=1.0,        # Entropy coefficient
        target_entropy="auto",    # Automatic entropy tuning
        n_step=1,
        replay_buffer_config={
            "type": "MultiAgentPrioritizedReplayBuffer",
            "capacity": 100000,
        },
    )
)

print("SAC config created")

## 4. A2C (Advantage Actor-Critic)

**Type**: Actor-Critic, On-policy

**Best for**: Simple problems, baseline for comparison

**Key features**:
- Synchronous version of A3C
- Simple and fast
- Good for understanding actor-critic methods

In [None]:
# A2C Configuration
a2c_config = (
    A2CConfig()
    .environment("CartPole-v1")
    .framework("torch")
    .env_runners(num_env_runners=2)
    .training(
        lr=1e-3,
        gamma=0.99,
        train_batch_size=500,
        # A2C specific
        vf_loss_coeff=0.5,
        entropy_coeff=0.01,
        use_gae=True,
        lambda_=0.95,
    )
)

print("A2C config created")

## Algorithm Selection Guide

| Scenario | Recommended Algorithm |
|----------|----------------------|
| Discrete actions, need sample efficiency | DQN |
| Continuous actions, sample efficiency critical | SAC |
| General purpose, stability important | PPO |
| Simple baseline | A2C |
| Multi-agent setting | PPO, QMIX |
| Offline RL (fixed dataset) | CQL, BCQ |
| Image observations | DQN (with CNN), PPO (with CNN) |

## Comparing Algorithms

In [None]:
def train_and_evaluate(config, name, n_iters=20):
    """Train an algorithm and return learning curve."""
    algo = config.build()
    rewards = []
    
    for i in range(n_iters):
        result = algo.train()
        reward = result["env_runners"]["episode_reward_mean"]
        rewards.append(reward)
        
        if (i + 1) % 5 == 0:
            print(f"{name} - Iter {i+1}: {reward:.2f}")
    
    algo.stop()
    return rewards

In [None]:
# Compare PPO vs DQN vs A2C on CartPole
# Note: This may take several minutes

print("Training PPO...")
ppo_rewards = train_and_evaluate(ppo_config, "PPO", n_iters=20)

print("\nTraining DQN...")
dqn_rewards = train_and_evaluate(dqn_config, "DQN", n_iters=20)

print("\nTraining A2C...")
a2c_rewards = train_and_evaluate(a2c_config, "A2C", n_iters=20)

In [None]:
# Plot comparison
plt.figure(figsize=(10, 6))
plt.plot(ppo_rewards, label='PPO', linewidth=2)
plt.plot(dqn_rewards, label='DQN', linewidth=2)
plt.plot(a2c_rewards, label='A2C', linewidth=2)
plt.axhline(y=475, color='gray', linestyle='--', label='Solved')
plt.xlabel('Training Iteration')
plt.ylabel('Mean Episode Reward')
plt.title('Algorithm Comparison on CartPole-v1')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## On-Policy vs Off-Policy

### On-Policy (PPO, A2C)
- Uses data from current policy only
- More stable, but less sample efficient
- Better for parallel data collection

### Off-Policy (DQN, SAC)
- Can reuse old experience (replay buffer)
- More sample efficient
- Can be less stable

In [None]:
# Demonstrate off-policy sample efficiency with SAC on Pendulum
print("Training SAC on Pendulum-v1...")

sac_algo = sac_config.build()
sac_rewards = []

for i in range(30):
    result = sac_algo.train()
    reward = result["env_runners"]["episode_reward_mean"]
    sac_rewards.append(reward)
    
    if (i + 1) % 10 == 0:
        print(f"SAC Iter {i+1}: {reward:.2f}")

sac_algo.stop()

plt.figure(figsize=(10, 6))
plt.plot(sac_rewards, linewidth=2)
plt.xlabel('Training Iteration')
plt.ylabel('Mean Episode Reward')
plt.title('SAC on Pendulum-v1 (Continuous Action Space)')
plt.grid(True, alpha=0.3)
plt.show()

## Additional Algorithms in RLlib

RLlib supports many more algorithms:

### Model-Free
- **IMPALA**: Distributed actor-critic with V-trace
- **APEX-DQN**: Distributed DQN with prioritized experience replay
- **TD3**: Twin Delayed DDPG (continuous actions)
- **DDPG**: Deep Deterministic Policy Gradient

### Multi-Agent
- **QMIX**: Q-value mixing for cooperative agents
- **MADDPG**: Multi-agent DDPG

### Offline RL
- **CQL**: Conservative Q-Learning
- **MARWIL**: Monotonic Advantage Re-Weighted Imitation Learning

### Model-Based
- **Dreamer**: World model learning

## Key Takeaways

1. **PPO** is a safe default choice for most problems

2. **DQN** excels with discrete actions and when sample efficiency matters

3. **SAC** is best for continuous control with its entropy regularization

4. **Off-policy** methods are more sample efficient but can be less stable

## Next Steps

In the next section, we'll learn how to create custom environments for RLlib.

In [None]:
ray.shutdown()