# 5.1 Scaling RLlib Training

## Learning Objectives
- Understand RLlib's distributed architecture
- Configure multi-GPU and multi-node training
- Optimize resource utilization
- Use Ray Cluster for large-scale training

## RLlib Distributed Architecture

```
┌──────────────────────────────────────────────────────────────┐
│                        Driver (Trainer)                       │
│                                                              │
│  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐        │
│  │   Policy    │   │  Learner    │   │   Replay    │        │
│  │   (GPU)     │   │  (GPU)      │   │   Buffer    │        │
│  └─────────────┘   └─────────────┘   └─────────────┘        │
└──────────────────────────────────────────────────────────────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
              ▼               ▼               ▼
       ┌──────────┐    ┌──────────┐    ┌──────────┐
       │ EnvRunner│    │ EnvRunner│    │ EnvRunner│
       │  (CPU)   │    │  (CPU)   │    │  (CPU)   │
       │  ┌───┐   │    │  ┌───┐   │    │  ┌───┐   │
       │  │Env│   │    │  │Env│   │    │  │Env│   │
       │  └───┘   │    │  └───┘   │    │  └───┘   │
       └──────────┘    └──────────┘    └──────────┘
```

Key components:
- **Driver**: Coordinates training, manages policy
- **Learner**: Performs gradient updates (can use GPU)
- **EnvRunners**: Collect experience (CPU workers)
- **Replay Buffer**: Stores experience (for off-policy algorithms)

In [None]:
import ray
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.algorithms.impala import IMPALAConfig
from ray.rllib.algorithms.apex_dqn import ApexDQNConfig
import numpy as np
import time

# Initialize Ray with resource specification
ray.init(
    ignore_reinit_error=True,
    # Uncomment for specific resource allocation:
    # num_cpus=8,
    # num_gpus=1,
)

print(f"Available resources: {ray.cluster_resources()}")

## Scaling with Multiple Workers

In [None]:
def benchmark_workers(num_workers_list, n_iters=5):
    """Benchmark training speed with different worker counts."""
    results = {}
    
    for num_workers in num_workers_list:
        config = (
            PPOConfig()
            .environment("CartPole-v1")
            .framework("torch")
            .env_runners(
                num_env_runners=num_workers,
                num_envs_per_env_runner=1,
            )
            .training(
                train_batch_size=4000,
                sgd_minibatch_size=128,
            )
        )
        
        algo = config.build()
        
        # Warm up
        algo.train()
        
        # Benchmark
        start = time.time()
        for _ in range(n_iters):
            algo.train()
        elapsed = time.time() - start
        
        samples_per_sec = (n_iters * 4000) / elapsed
        results[num_workers] = samples_per_sec
        
        print(f"Workers: {num_workers}, Samples/sec: {samples_per_sec:.0f}")
        algo.stop()
    
    return results

# Benchmark with different worker counts
# results = benchmark_workers([1, 2, 4], n_iters=3)
print("Benchmark example (uncomment to run)")

## GPU Configuration

In [None]:
# Single GPU configuration
single_gpu_config = (
    PPOConfig()
    .environment("CartPole-v1")
    .framework("torch")
    .env_runners(
        num_env_runners=4,
        num_cpus_per_env_runner=1,
    )
    .resources(
        num_gpus=1,  # GPU for training
        num_cpus_for_main_process=1,
    )
    .training(
        train_batch_size=4000,
        model={
            "fcnet_hiddens": [256, 256],
        },
    )
)

print("Single GPU config created")

In [None]:
# Multi-GPU with Learner API (RLlib 2.x)
multi_gpu_config = (
    PPOConfig()
    .environment("CartPole-v1")
    .framework("torch")
    .env_runners(
        num_env_runners=8,
    )
    .learners(
        num_learners=2,        # Number of learner workers
        num_gpus_per_learner=1,  # GPU per learner
    )
    .training(
        train_batch_size=8000,
        minibatch_size=256,
    )
)

print("Multi-GPU config created")

## IMPALA: Scalable Distributed Training

IMPALA (Importance Weighted Actor-Learner Architecture) is designed for massive scale:
- Asynchronous actors and learners
- V-trace for off-policy correction
- Can scale to thousands of workers

In [None]:
impala_config = (
    IMPALAConfig()
    .environment("CartPole-v1")
    .framework("torch")
    .env_runners(
        num_env_runners=4,
        num_envs_per_env_runner=5,  # Vectorized environments
    )
    .resources(
        num_gpus=1,
    )
    .training(
        lr=5e-4,
        train_batch_size=500,
        # V-trace parameters
        vtrace=True,
        vtrace_clip_rho_threshold=1.0,
        vtrace_clip_pg_rho_threshold=1.0,
    )
)

# Train IMPALA
# algo = impala_config.build()
# result = algo.train()
print("IMPALA config created")

## APEX-DQN: Distributed Experience Replay

In [None]:
apex_config = (
    ApexDQNConfig()
    .environment("CartPole-v1")
    .framework("torch")
    .env_runners(
        num_env_runners=4,
    )
    .resources(
        num_gpus=1,
    )
    .training(
        lr=5e-4,
        n_step=3,
        # Apex-specific: prioritized replay
        replay_buffer_config={
            "type": "MultiAgentPrioritizedReplayBuffer",
            "capacity": 100000,
            "prioritized_replay_alpha": 0.6,
            "prioritized_replay_beta": 0.4,
        },
    )
)

print("APEX-DQN config created")

## Ray Cluster Setup

For multi-node training, you need a Ray cluster.

### Option 1: Manual Cluster

In [None]:
# cluster_config.yaml example:
cluster_yaml = """
cluster_name: rllib-cluster

max_workers: 4

provider:
    type: aws
    region: us-west-2

auth:
    ssh_user: ubuntu

head_node:
    InstanceType: m5.2xlarge
    ImageId: ami-0a2363a9cff180a64  # Ray AMI

worker_nodes:
    InstanceType: m5.2xlarge
    ImageId: ami-0a2363a9cff180a64

setup_commands:
    - pip install "ray[rllib]" torch
"""

print("Example Ray cluster config")
print("Commands:")
print("  ray up cluster_config.yaml    # Start cluster")
print("  ray submit cluster_config.yaml train.py  # Submit job")
print("  ray down cluster_config.yaml  # Stop cluster")

### Option 2: Kubernetes with KubeRay

In [None]:
# RayCluster Kubernetes manifest example
kuberay_yaml = """
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  name: rllib-cluster
spec:
  rayVersion: '2.9.0'
  headGroupSpec:
    rayStartParams:
      dashboard-host: '0.0.0.0'
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray-ml:2.9.0
          resources:
            limits:
              cpu: "4"
              memory: "8Gi"
              nvidia.com/gpu: "1"
  workerGroupSpecs:
  - groupName: workers
    replicas: 4
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray-ml:2.9.0
          resources:
            limits:
              cpu: "2"
              memory: "4Gi"
"""

print("KubeRay cluster manifest example")

## Optimizing Throughput

In [None]:
# High-throughput configuration
high_throughput_config = (
    PPOConfig()
    .environment("CartPole-v1")
    .framework("torch")
    
    # Maximize parallelism
    .env_runners(
        num_env_runners=16,           # Many workers
        num_envs_per_env_runner=5,    # Vectorized envs per worker
        rollout_fragment_length=200,  # Steps before sending data
        sample_timeout_s=60,
    )
    
    # GPU training
    .resources(
        num_gpus=1,
        num_cpus_for_main_process=1,
    )
    
    # Large batches for GPU efficiency
    .training(
        train_batch_size=16000,       # Large batch
        sgd_minibatch_size=4096,      # GPU-friendly minibatch
        num_sgd_iter=10,
        lr=3e-4,
    )
)

print("High-throughput config created")
print(f"Expected samples per iteration: 16000")
print(f"Workers x Envs x Fragment: 16 x 5 x 200 = 16000")

## Memory Optimization

In [None]:
# Memory-efficient configuration for large observations
memory_efficient_config = (
    PPOConfig()
    .environment("CartPole-v1")
    .framework("torch")
    .env_runners(
        num_env_runners=4,
        # Compress observations to save memory
        compress_observations=True,
    )
    .training(
        train_batch_size=4000,
        # Gradient accumulation for memory efficiency
        sgd_minibatch_size=256,  # Smaller minibatch
        num_sgd_iter=16,          # More iterations
    )
    .debugging(
        # Log memory usage
        log_level="WARN",
    )
)

print("Memory-efficient config created")

## Production Training Script

In [None]:
# train_distributed.py
train_script = '''
#!/usr/bin/env python
"""Distributed RLlib training script."""

import argparse
import ray
from ray import tune
from ray.rllib.algorithms.ppo import PPOConfig
from ray.tune.schedulers import PopulationBasedTraining

def main(args):
    # Connect to Ray cluster
    ray.init(address=args.ray_address)
    
    config = (
        PPOConfig()
        .environment(args.env)
        .framework("torch")
        .env_runners(
            num_env_runners=args.num_workers,
            num_envs_per_env_runner=args.envs_per_worker,
        )
        .resources(
            num_gpus=args.num_gpus,
        )
        .training(
            train_batch_size=args.batch_size,
            lr=args.lr,
        )
    )
    
    # Run with Tune for experiment tracking
    tuner = tune.Tuner(
        "PPO",
        param_space=config,
        run_config=tune.RunConfig(
            name=args.experiment_name,
            stop={"training_iteration": args.max_iterations},
            checkpoint_config=tune.CheckpointConfig(
                checkpoint_frequency=10,
                checkpoint_at_end=True,
            ),
            storage_path=args.storage_path,
        ),
    )
    
    results = tuner.fit()
    print(f"Best result: {results.get_best_result()}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--ray-address", default="auto")
    parser.add_argument("--env", default="CartPole-v1")
    parser.add_argument("--num-workers", type=int, default=8)
    parser.add_argument("--envs-per-worker", type=int, default=5)
    parser.add_argument("--num-gpus", type=int, default=1)
    parser.add_argument("--batch-size", type=int, default=4000)
    parser.add_argument("--lr", type=float, default=3e-4)
    parser.add_argument("--max-iterations", type=int, default=100)
    parser.add_argument("--experiment-name", default="ppo_experiment")
    parser.add_argument("--storage-path", default="/tmp/ray_results")
    
    main(parser.parse_args())
'''

print("Distributed training script example")

## Key Takeaways

1. **Scale workers** for data collection, GPUs for training

2. **IMPALA/APEX** are designed for massive distributed training

3. **Ray Cluster** enables multi-node training (AWS, GCP, Kubernetes)

4. **Tune batch sizes** to match your hardware capabilities

## Next Steps

In the next section, we'll cover hyperparameter tuning with Ray Tune.

In [None]:
ray.shutdown()