# 2.1 Setting Up Ray and RLlib

## Learning Objectives
- Install and configure Ray and RLlib
- Understand Ray's architecture and concepts
- Run your first RLlib training job
- Monitor training with TensorBoard

## What is Ray?

Ray is a unified framework for scaling AI and Python applications. It provides:

- **Ray Core**: Distributed computing primitives
- **Ray RLlib**: Scalable reinforcement learning
- **Ray Tune**: Hyperparameter tuning
- **Ray Serve**: Model serving
- **Ray Data**: Distributed data processing

```
┌─────────────────────────────────────────────────────────┐
│                    Ray Libraries                         │
├──────────┬──────────┬──────────┬──────────┬────────────┤
│  RLlib   │   Tune   │  Serve   │   Data   │   Train    │
├──────────┴──────────┴──────────┴──────────┴────────────┤
│                      Ray Core                           │
│            (Tasks, Actors, Objects)                     │
├─────────────────────────────────────────────────────────┤
│              Cluster Management                         │
│        (Local, Cloud, Kubernetes)                       │
└─────────────────────────────────────────────────────────┘
```

## Installation

Install Ray with RLlib support:

In [None]:
# Install Ray with RLlib
# !pip install "ray[rllib]" gymnasium torch tensorboard

# For specific versions (recommended for reproducibility):
# !pip install "ray[rllib]==2.9.0" gymnasium torch tensorboard

In [None]:
import ray
from ray import tune
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.algorithms.dqn import DQNConfig
import gymnasium as gym
import numpy as np

print(f"Ray version: {ray.__version__}")

## Ray Core Basics

Before using RLlib, let's understand Ray's core concepts.

In [None]:
# Initialize Ray (use ray.init() for local, or connect to a cluster)
ray.init(ignore_reinit_error=True)

# Check cluster resources
print(ray.cluster_resources())

In [None]:
# Ray Tasks: Parallel function execution
@ray.remote
def simulate_episode(env_name: str) -> float:
    """Simulate one episode and return total reward."""
    env = gym.make(env_name)
    state, _ = env.reset()
    total_reward = 0
    
    while True:
        action = env.action_space.sample()  # Random policy
        state, reward, terminated, truncated, _ = env.step(action)
        total_reward += reward
        if terminated or truncated:
            break
    
    env.close()
    return total_reward

# Run 10 episodes in parallel
futures = [simulate_episode.remote("CartPole-v1") for _ in range(10)]
results = ray.get(futures)

print(f"Episode rewards: {results}")
print(f"Mean reward: {np.mean(results):.2f}")

In [None]:
# Ray Actors: Stateful distributed objects
@ray.remote
class Counter:
    def __init__(self):
        self.value = 0
    
    def increment(self):
        self.value += 1
        return self.value
    
    def get_value(self):
        return self.value

# Create actor and call methods
counter = Counter.remote()
for _ in range(5):
    ray.get(counter.increment.remote())

print(f"Counter value: {ray.get(counter.get_value.remote())}")

## RLlib: Your First Training Job

RLlib uses a config-based API. Let's train PPO on CartPole.

In [None]:
# Configure PPO algorithm
config = (
    PPOConfig()
    .environment("CartPole-v1")
    .framework("torch")  # or "tf2" for TensorFlow
    .env_runners(
        num_env_runners=2,  # Number of parallel workers
        num_envs_per_env_runner=1,
    )
    .training(
        lr=0.0003,
        gamma=0.99,
        train_batch_size=4000,
    )
)

# Build the algorithm
algo = config.build()

print("Algorithm built successfully!")
print(f"Policy: {algo.get_policy()}")

In [None]:
# Train for a few iterations
results_history = []

for i in range(10):
    result = algo.train()
    results_history.append(result)
    
    # Extract key metrics
    mean_reward = result["env_runners"]["episode_reward_mean"]
    episodes = result["env_runners"]["num_episodes"]
    
    print(f"Iteration {i+1}: Mean Reward = {mean_reward:.2f}, Episodes = {episodes}")

In [None]:
# Evaluate the trained policy
env = gym.make("CartPole-v1")

eval_rewards = []
for _ in range(10):
    state, _ = env.reset()
    total_reward = 0
    
    while True:
        action = algo.compute_single_action(state)
        state, reward, terminated, truncated, _ = env.step(action)
        total_reward += reward
        if terminated or truncated:
            break
    
    eval_rewards.append(total_reward)

print(f"Evaluation rewards: {eval_rewards}")
print(f"Mean evaluation reward: {np.mean(eval_rewards):.2f}")

env.close()

In [None]:
# Save the trained model
checkpoint_dir = algo.save()
print(f"Checkpoint saved to: {checkpoint_dir}")

# Clean up
algo.stop()

## Using Ray Tune for Training

Ray Tune provides a more robust way to run experiments with logging, checkpointing, and early stopping.

In [None]:
from ray.tune.registry import register_env

# Configure the experiment
config = (
    PPOConfig()
    .environment("CartPole-v1")
    .framework("torch")
    .env_runners(num_env_runners=2)
    .training(
        lr=0.0003,
        train_batch_size=4000,
    )
)

# Run with Tune
tuner = tune.Tuner(
    "PPO",
    param_space=config,
    run_config=tune.RunConfig(
        stop={"env_runners/episode_reward_mean": 450},  # Stop when solved
        checkpoint_config=tune.CheckpointConfig(
            checkpoint_at_end=True,
            checkpoint_frequency=5,
        ),
    ),
)

results = tuner.fit()

# Get best result
best_result = results.get_best_result(metric="env_runners/episode_reward_mean", mode="max")
print(f"Best reward: {best_result.metrics['env_runners']['episode_reward_mean']:.2f}")

## Monitoring with TensorBoard

Ray logs metrics to TensorBoard by default.

```bash
# In terminal, run:
tensorboard --logdir ~/ray_results
```

Then open http://localhost:6006 in your browser.

In [None]:
# You can also load TensorBoard in Jupyter
%load_ext tensorboard
%tensorboard --logdir ~/ray_results

## RLlib Configuration Deep Dive

RLlib's config has several sections:

In [None]:
# Comprehensive configuration example
full_config = (
    PPOConfig()
    
    # Environment settings
    .environment(
        env="CartPole-v1",
        env_config={},  # Pass config to env
        observation_space=None,  # Override if needed
        action_space=None,
    )
    
    # Framework (PyTorch or TensorFlow)
    .framework(
        framework="torch",
    )
    
    # Rollout workers
    .env_runners(
        num_env_runners=4,          # Parallel workers
        num_envs_per_env_runner=1,  # Envs per worker
        rollout_fragment_length=200, # Steps per rollout
        batch_mode="truncate_episodes",
    )
    
    # Training settings
    .training(
        lr=3e-4,
        gamma=0.99,
        train_batch_size=4000,
        model={
            "fcnet_hiddens": [256, 256],
            "fcnet_activation": "tanh",
        },
        # PPO-specific
        sgd_minibatch_size=128,
        num_sgd_iter=10,
        clip_param=0.2,
    )
    
    # Resources
    .resources(
        num_gpus=0,  # GPUs for training
        num_cpus_per_env_runner=1,
    )
    
    # Evaluation
    .evaluation(
        evaluation_interval=5,       # Eval every N iterations
        evaluation_num_env_runners=2,
        evaluation_duration=10,      # Episodes per eval
    )
)

print("Configuration created successfully!")

## Loading and Restoring Checkpoints

In [None]:
from ray.rllib.algorithms.algorithm import Algorithm

# Restore from checkpoint
# algo = Algorithm.from_checkpoint(checkpoint_dir)

# Or restore a specific algorithm type
# from ray.rllib.algorithms.ppo import PPO
# algo = PPO.from_checkpoint(checkpoint_dir)

print("Checkpoint loading example (uncomment to use)")

## Key Takeaways

1. **Ray Core** provides distributed computing primitives (tasks, actors)

2. **RLlib** uses a config-based API for easy experimentation

3. **Ray Tune** adds experiment management, logging, and hyperparameter tuning

4. **TensorBoard** integration for monitoring training progress

## Next Steps

In the next notebook, we'll explore different RLlib algorithms and when to use each one.

In [None]:
# Clean up Ray
ray.shutdown()