# RLlib: From Ray Core to Reinforcement Learning

**Prerequisites**: Complete [00_ray_core](../00_ray_core/01_ray_core_fundamentals.ipynb) first!

Remember the three Ray primitives? RLlib uses ALL of them:

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                         HOW RLLIB USES RAY CORE                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   RAY CORE PRIMITIVE          RLLIB USAGE                                   │
│   ──────────────────          ───────────                                   │
│                                                                             │
│   @ray.remote                 EnvRunners: Actors that run environments      │
│   class Actor                 and collect experience in parallel            │
│                                                                             │
│   @ray.remote                 Training updates: Gradient computation        │
│   def task()                  runs as distributed tasks                     │
│                                                                             │
│   ray.put()                   Policy weights: Stored in object store,       │
│   Object Store                broadcast to all workers efficiently          │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## The Problem: RL Training is Slow

Reinforcement learning needs MILLIONS of environment interactions. One environment is too slow:

```
SEQUENTIAL RL (slow)                     PARALLEL RL with RLlib (fast)
────────────────────                     ─────────────────────────────

┌──────────────────┐                     ┌──────────────────┐
│   Environment    │                     │   Env 1 (Worker) │ ──┐
│   ┌──┐           │                     │   ┌──┐           │   │
│   │  │ step      │                     │   │  │ step      │   │
│   └──┘           │                     │   └──┘           │   │
│   ┌──┐           │                     └──────────────────┘   │
│   │  │ step      │                     ┌──────────────────┐   │  experience
│   └──┘           │                     │   Env 2 (Worker) │   │  in parallel
│   ┌──┐           │                     │   ┌──┐           │   │
│   │  │ step      │                     │   │  │ step      │ ──┼──> Policy
│   └──┘           │                     │   └──┘           │   │    Update
│   ┌──┐           │                     └──────────────────┘   │
│   │  │ step      │                     ┌──────────────────┐   │
│   └──┘           │                     │   Env 3 (Worker) │   │
│   ...            │                     │   ┌──┐           │   │
│                  │                     │   │  │ step      │ ──┘
│   1000 steps/sec │                     │   └──┘           │
└──────────────────┘                     └──────────────────┘
                                         
                                         10,000+ steps/sec!
```

**RLlib = Ray + RL algorithms.** It handles all the parallelization for you.

---

## Installation

```bash
# Using uv (recommended - fast!)
uv pip install "ray[rllib]" gymnasium torch

# Or with pip
pip install "ray[rllib]" gymnasium torch
```

In [1]:
import warnings
import logging
import os

warnings.filterwarnings("ignore")
logging.getLogger("ray").setLevel(logging.ERROR)

import ray
from ray.rllib.algorithms.ppo import PPOConfig
import gymnasium as gym
import numpy as np

print(f"Ray version: {ray.__version__}")

Ray version: 2.53.0


In [2]:
# Initialize Ray (same as Module 00!)
ray.init(
    num_cpus=4,
    object_store_memory=1 * 1024 * 1024 * 1024,
    ignore_reinit_error=True,
)

print(f"Ray resources: {ray.cluster_resources()}")

2026-02-01 22:42:51,564	INFO worker.py:1998 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8267 [39m[22m


Ray resources: {'node:127.0.0.1': 1.0, 'node:__internal_head__': 1.0, 'memory': 10002006016.0, 'CPU': 4.0, 'object_store_memory': 1073741824.0}


---

## RLlib Architecture: What Happens When You Train

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                          RLLIB TRAINING LOOP                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  STEP 1: COLLECT EXPERIENCE (parallel)                                      │
│  ─────────────────────────────────────                                      │
│                                                                             │
│    ┌─────────────────┐   ┌─────────────────┐   ┌─────────────────┐          │
│    │  EnvRunner 1    │   │  EnvRunner 2    │   │  EnvRunner 3    │          │
│    │  (Ray Actor)    │   │  (Ray Actor)    │   │  (Ray Actor)    │          │
│    │                 │   │                 │   │                 │          │
│    │  ┌───────────┐  │   │  ┌───────────┐  │   │  ┌───────────┐  │          │
│    │  │ CartPole  │  │   │  │ CartPole  │  │   │  │ CartPole  │  │          │
│    │  └───────────┘  │   │  └───────────┘  │   │  └───────────┘  │          │
│    │       │         │   │       │         │   │       │         │          │
│    │  state -> policy -> action -> reward  │   │       ...       │          │
│    │       │         │   │       │         │   │       │         │          │
│    │  [experience]   │   │  [experience]   │   │  [experience]   │          │
│    └────────┬────────┘   └────────┬────────┘   └────────┬────────┘          │
│             │                     │                     │                   │
│             └─────────────────────┴─────────────────────┘                   │
│                                   │                                         │
│                                   ▼                                         │
│                    ┌──────────────────────────┐                             │
│                    │  Object Store            │  <- Ray Core!               │
│                    │  [batched experience]    │                             │
│                    └──────────────────────────┘                             │
│                                   │                                         │
│  STEP 2: UPDATE POLICY            │                                         │
│  ─────────────────────            ▼                                         │
│                    ┌──────────────────────────┐                             │
│                    │  Learner                 │                             │
│                    │  - Compute loss          │                             │
│                    │  - Backprop gradients    │                             │
│                    │  - Update neural network │                             │
│                    └──────────────────────────┘                             │
│                                   │                                         │
│  STEP 3: BROADCAST NEW WEIGHTS    │                                         │
│  ─────────────────────────────    ▼                                         │
│                    ┌──────────────────────────┐                             │
│                    │  Object Store            │  <- Ray Core!               │
│                    │  [new policy weights]    │                             │
│                    └──────────────────────────┘                             │
│                                   │                                         │
│             ┌─────────────────────┼─────────────────────┐                   │
│             │                     │                     │                   │
│             ▼                     ▼                     ▼                   │
│    ┌─────────────────┐   ┌─────────────────┐   ┌─────────────────┐          │
│    │  EnvRunner 1    │   │  EnvRunner 2    │   │  EnvRunner 3    │          │
│    │  (gets new      │   │  (gets new      │   │  (gets new      │          │
│    │   weights)      │   │   weights)      │   │   weights)      │          │
│    └─────────────────┘   └─────────────────┘   └─────────────────┘          │
│                                                                             │
│  REPEAT until converged!                                                    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## CartPole: The "Hello World" of RL

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                               CARTPOLE                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│                                  │                                          │
│                                  │  <- pole (keep upright!)                 │
│                                  │                                          │
│                               ┌──┴──┐                                       │
│                               │     │  <- cart                              │
│        ◄── push left          │     │          push right ──►               │
│                               └─────┘                                       │
│        ════════════════════════════════════════════════                     │
│                                                                             │
│   STATE (4 numbers):           ACTIONS (2 choices):                         │
│   ──────────────────           ────────────────────                         │
│   - Cart position              - 0: Push left                               │
│   - Cart velocity              - 1: Push right                              │
│   - Pole angle                                                              │
│   - Pole angular velocity                                                   │
│                                                                             │
│   REWARD: +1 for each step you keep the pole upright                        │
│   GOAL:   Reach 500 steps (max score)                                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

In [3]:
env = gym.make("CartPole-v1")

print("CartPole Environment")
print("=" * 50)
print(f"State space:  {env.observation_space}")  # 4 numbers
print(f"Action space: {env.action_space}")       # 2 choices (left/right)

# Run one episode with RANDOM actions
state, _ = env.reset()
print(f"\nInitial state: {state}")
print("               [cart_pos, cart_vel, pole_angle, pole_vel]")

total_reward = 0
for step in range(500):
    action = env.action_space.sample()  # Random action!
    state, reward, done, _, _ = env.step(action)
    total_reward += reward
    if done:
        break

print(f"\nRandom policy: {int(total_reward)} steps before falling")
print("Goal: Learn to reach 500 steps!")
env.close()

CartPole Environment
State space:  Box([-4.8               -inf -0.41887903        -inf], [4.8               inf 0.41887903        inf], (4,), float32)
Action space: Discrete(2)

Initial state: [ 0.01545564 -0.04757685 -0.02987708  0.02439807]
               [cart_pos, cart_vel, pole_angle, pole_vel]

Random policy: 16 steps before falling
Goal: Learn to reach 500 steps!


---

## PPO: The Go-To Algorithm

**PPO (Proximal Policy Optimization)** is the most popular RL algorithm. Why?

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                               WHY PPO?                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  1. STABLE: Clips updates to prevent destroying your policy                 │
│                                                                             │
│     Without clipping:              With PPO clipping:                       │
│     ┌────────────────┐             ┌────────────────┐                       │
│     │      /\        │             │         ___    │                       │
│     │     /  \crash! │             │        /   \   │  (stable!)            │
│     │    /    \____  │             │     __/     \__│                       │
│     │   /            │             │    /           │                       │
│     └────────────────┘             └────────────────┘                       │
│                                                                             │
│  2. SAMPLE EFFICIENT: Reuses data multiple times                            │
│                                                                             │
│     Collect 4000 steps ──► Train 10 epochs on same data                     │
│                                                                             │
│  3. SIMPLE: Fewer hyperparameters than other algorithms                     │
│                                                                             │
│  4. GENERAL: Works on discrete AND continuous action spaces                 │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## Configuring PPO: The Builder Pattern

RLlib uses a **builder pattern** - chain method calls to configure:

```python
config = (
    PPOConfig()              # Start with defaults
    .environment(...)        # What to train on
    .framework(...)          # PyTorch or TensorFlow
    .env_runners(...)        # Parallelization
    .training(...)           # Algorithm settings
)
```

In [4]:
# Configure PPO step by step
#
# Each .method() returns self, so you can chain them

config = (
    PPOConfig()
    
    # 1. ENVIRONMENT: What are we training on?
    .environment("CartPole-v1")
    
    # 2. FRAMEWORK: PyTorch (recommended this is what everyone uses) or TensorFlow
    .framework("torch")
    
    # 3. ENV_RUNNERS: How many parallel environments?
    #    Each EnvRunner is a Ray Actor!
    #    More workers = faster data collection
    .env_runners(
        num_env_runners=2,         # 2 parallel workers
        num_envs_per_env_runner=1, # 1 env per worker
    )
    
    # 4. TRAINING: PPO hyperparameters
    .training(
        lr=0.0003,             # Learning rate
        gamma=0.99,            # Discount factor
        train_batch_size=4000, # Steps per update
    )
)

### What Do These Settings Mean?

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                        KEY HYPERPARAMETERS                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  num_env_runners = 2                                                        │
│  ───────────────────                                                        │
│  How many Ray Actors collect experience in parallel?                        │
│                                                                             │
│       EnvRunner 1 ──► 2000 steps ──┐                                        │
│                                    ├──► 4000 steps total                    │
│       EnvRunner 2 ──► 2000 steps ──┘                                        │
│                                                                             │
│  More workers = faster, but uses more CPU                                   │
│                                                                             │
│  ───────────────────────────────────────────────────────────────────────    │
│                                                                             │
│  train_batch_size = 4000                                                    │
│  ───────────────────────                                                    │
│  How many steps before updating the policy?                                 │
│                                                                             │
│  Bigger = more stable but slower. Smaller = faster but noisier.             │
│                                                                             │
│  ───────────────────────────────────────────────────────────────────────    │
│                                                                             │
│  lr = 0.0003 (learning rate)                                                │
│  ───────────────────────────                                                │
│  How big are the policy updates?                                            │
│                                                                             │
│  Too high = unstable training                                               │
│  Too low  = slow learning                                                   │
│                                                                             │
│  ───────────────────────────────────────────────────────────────────────    │
│                                                                             │
│  gamma = 0.99 (discount factor)                                             │
│  ──────────────────────────────                                             │
│  How much to value FUTURE rewards vs IMMEDIATE rewards?                     │
│                                                                             │
│  gamma = 0.99: Future rewards matter (look ~100 steps ahead)                │
│  gamma = 0.9:  Focus on near-term (~10 steps ahead)                         │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

In [5]:
# Build the algorithm
#
# This creates:
#   - 2 EnvRunner actors (Ray Actors running CartPole)
#   - 1 Learner (updates the neural network)
#   - The policy neural network

print("Building algorithm...")
print("  (This creates Ray Actors for parallel environment execution)")
algo = config.build_algo()
print("Done!")



Building algorithm...
  (This creates Ray Actors for parallel environment execution)


`UnifiedLogger` will be removed in Ray 2.7.
  return UnifiedLogger(config, logdir, loggers=None)
The `JsonLogger interface is deprecated in favor of the `ray.tune.json.JsonLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
The `CSVLogger interface is deprecated in favor of the `ray.tune.csv.CSVLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
The `TBXLogger interface is deprecated in favor of the `ray.tune.tensorboardx.TBXLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
[2026-02-01 22:42:53,054 E 76571 393559] core_worker.cc:2223: Actor with class name: 'SingleAgentEnvRunner' and ID: '5557ca47576bea403769754701000000' has constructor arguments in the object store and max_restarts > 0. If the arguments in the object store go out of scope or are lost, the actor restart 

Done!


---

## Training: Watch It Learn!

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                      WHAT algo.train() DOES                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   1. Workers collect 4000 steps of (state, action, reward) tuples           │
│                         │                                                   │
│                         ▼                                                   │
│   2. Data sent to Learner via Object Store                                  │
│                         │                                                   │
│                         ▼                                                   │
│   3. Learner computes loss and updates neural network                       │
│                         │                                                   │
│                         ▼                                                   │
│   4. New weights broadcast to all workers                                   │
│                         │                                                   │
│                         ▼                                                   │
│   5. Returns dict with metrics (rewards, loss, etc.)                        │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

In [None]:
# Train for 10 iterations
#
# Each iteration:
#   - Collects train_batch_size steps (4000)
#   - Updates the policy
#   - Returns metrics

print("Training PPO on CartPole")
print("=" * 60)
print(f"{'Iter':>4} | {'Mean Reward':>12} | {'Max Reward':>11} | {'Episodes':>10}")
print("-" * 60)

for i in range(5):
    # Train one iteration
    result = algo.train()
    
    # Get metrics from result dict
    # New API: 'episode_return_mean' (old was 'episode_reward_mean')
    mean_reward = result["env_runners"]["episode_return_mean"]
    max_reward = result["env_runners"]["episode_return_max"]
    episodes = result["env_runners"]["num_episodes"]
    
    print(f"{i+1:>4} | {mean_reward:>12.1f} | {max_reward:>11.1f} | {episodes:>10}")

print("-" * 60)
print(f"Final mean reward: {mean_reward:.1f}")
print(f"(CartPole is 'solved' at 475+)")

Training PPO on CartPole
Iter |  Mean Reward |  Max Reward |   Episodes
------------------------------------------------------------


---

## Using the Trained Policy

Now let's use our trained agent!

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                            INFERENCE                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   state = env.reset()                                                       │
│         │                                                                   │
│         ▼                                                                   │
│   action = algo.compute_single_action(state)  # Policy inference            │
│         │                                                                   │
│         ▼                                                                   │
│   state, reward, done = env.step(action)                                    │
│         │                                                                   │
│         └──────────── repeat until done ──────────────┘                     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

In [None]:
import torch

env = gym.make("CartPole-v1")

# Get the trained RLModule
module = algo.get_module()

print("Evaluating trained policy")
print("=" * 40)

eval_rewards = []
for episode in range(5):
    state, _ = env.reset()
    total_reward = 0
    
    while True:
        # Convert state to tensor and get action from trained policy
        state_tensor = torch.tensor([state], dtype=torch.float32)
        with torch.no_grad():
            output = module.forward_inference({"obs": state_tensor})
            # For discrete actions, get the action with highest probability
            action = output["action_dist_inputs"].argmax(dim=-1).item()
        
        state, reward, done, _, _ = env.step(action)
        total_reward += reward
        
        if done:
            break
    
    eval_rewards.append(total_reward)
    print(f"Episode {episode+1}: {int(total_reward)} steps")

print("-" * 40)
print(f"Mean: {np.mean(eval_rewards):.1f} steps")
print(f"\nCompare to random policy: ~20 steps")

env.close()

Evaluating trained policy
Episode 1: 695 steps
Episode 2: 1362 steps
Episode 3: 473 steps
Episode 4: 500 steps
Episode 5: 334 steps
----------------------------------------
Mean: 672.8 steps

Compare to random policy: ~20 steps


---

## Detailed Single Run: See the Agent in Action

Let's watch one complete episode step-by-step to see exactly what the trained agent is doing:

In [None]:
from IPython.display import HTML
import matplotlib.pyplot as plt
import matplotlib.animation as animation

# Create environment with rgb_array rendering
env = gym.make("CartPole-v1", render_mode="rgb_array")
state, _ = env.reset()

frames = []
total_reward = 0

print("Running episode and capturing frames...")

for step in range(500):
    # Capture frame
    frame = env.render()
    frames.append(frame)
    
    # Get action from trained policy
    state_tensor = torch.tensor([state], dtype=torch.float32)
    with torch.no_grad():
        output = module.forward_inference({"obs": state_tensor})
        action = output["action_dist_inputs"].argmax(dim=-1).item()
    
    # Take action
    state, reward, done, _, _ = env.step(action)
    total_reward += reward
    
    if done:
        # Capture final frame
        frames.append(env.render())
        break

env.close()

print(f"Episode finished: {int(total_reward)} steps")
print(f"Captured {len(frames)} frames")

# Create animation
fig, ax = plt.subplots(figsize=(6, 4))
ax.axis('off')
img = ax.imshow(frames[0])

def animate(i):
    img.set_array(frames[i])
    return [img]

# Create animation (show every 2nd frame to speed it up)
anim = animation.FuncAnimation(
    fig, 
    animate, 
    frames=range(0, len(frames), 2),  # Skip frames for speed
    interval=50,  # 50ms between frames
    blit=True
)

plt.close()  # Prevent static display

# Display as HTML5 video
HTML(anim.to_jshtml())

Running episode and capturing frames...
Episode finished: 500 steps
Captured 500 frames


---

## Saving and Loading Checkpoints

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                        CHECKPOINT WORKFLOW                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   SAVE                               LOAD                                   │
│   ────                               ────                                   │
│   path = algo.save()                 algo = Algorithm.from_checkpoint(path) │
│         │                                    │                              │
│         ▼                                    ▼                              │
│   checkpoint/                        Restored algorithm:                    │
│   ├── algorithm_state.pkl            - Same policy weights                  │
│   ├── rllib_checkpoint.json          - Same config                          │
│   └── policies/                      - Ready to train more or evaluate      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

In [None]:
# Save checkpoint
checkpoint_path = algo.save()
print(f"Checkpoint saved to: {checkpoint_path}")

# To load later:
#
# from ray.rllib.algorithms.algorithm import Algorithm
# restored_algo = Algorithm.from_checkpoint(checkpoint_path)
# action = restored_algo.compute_single_action(state)

Checkpoint saved to: TrainingResult(checkpoint=Checkpoint(filesystem=local, path=/var/folders/ts/3v49jkvx06z91kc356z9c41r0000gn/T/tmpmz09f4wu), metrics={'timers': {'training_iteration': 5.662603166998451, 'restore_env_runners': 1.862499993876554e-05, 'training_step': 5.66246291599964, 'env_runner_sampling_timer': 0.6978446659995825, 'learner_update_timer': 4.961909957999524, 'synch_weights': 0.002381500000410597, 'synch_env_connectors': 0.002458416998706525}, 'env_runners': {'env_to_module_connector': {'connector_pipeline_timer': np.float64(6.031488154228519e-05), 'timers': {'connectors': {'add_states_from_episodes_to_batch': np.float64(8.117834481433668e-07), 'add_observations_from_episodes_to_batch': np.float64(3.3765500309750846e-06), 'add_time_dim_to_batch_and_zero_pad': np.float64(9.94349102162394e-07), 'numpy_to_tensor': np.float64(1.2594524458906535e-05), 'batch_individual_items': np.float64(7.735972699250877e-06)}}}, 'episode_len_mean': np.float64(324.66666666666663), 'module_t

In [None]:
# Clean up
algo.stop()

---

## Summary: The RLlib Pattern

```python
# 1. CONFIGURE
config = (
    PPOConfig()
    .environment("MyEnv-v1")
    .framework("torch")
    .env_runners(num_env_runners=4)
    .training(lr=3e-4, train_batch_size=4000)
)

# 2. BUILD
algo = config.build_algo()

# 3. TRAIN
for i in range(100):
    result = algo.train()
    print(result["env_runners"]["episode_return_mean"])

# 4. USE
action = algo.compute_single_action(state)

# 5. SAVE
algo.save("checkpoint")
```

### Quick Reference

| Task | Code |
|------|------|
| Configure | `PPOConfig().environment(...).training(...)` |
| Build | `algo = config.build_algo()` |
| Train one iteration | `result = algo.train()` |
| Get mean reward | `result["env_runners"]["episode_return_mean"]` |
| Get action | `algo.compute_single_action(state)` |
| Save | `algo.save(path)` |
| Load | `Algorithm.from_checkpoint(path)` |
| Stop | `algo.stop()` |

---

## What's Next?

```
┌─────────────────────────┐          ┌─────────────────────────┐
│  02.1 RLlib Setup       │   ───>   │  02.2 Algorithms        │
│  (you are here)         │          │                         │
│                         │          │  - PPO vs DQN vs SAC    │
│  - How RLlib works      │          │  - When to use each     │
│  - Config API           │          │  - Comparison           │
│  - Train & evaluate     │          │                         │
└─────────────────────────┘          └─────────────────────────┘

Full path: Custom environments (03) → Distributed (05) → Robotics (09)!
```

In [None]:
ray.shutdown()