# Hyperparameter Tuning with Ray Tune

**Prerequisites**: Complete [05_distributed_training](../05_distributed_training/01_scaling_rllib.ipynb)

RL algorithms are VERY sensitive to hyperparameters. Let Ray Tune find the best ones!

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    WHY HYPERPARAMETER TUNING?                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  SAME ALGORITHM, DIFFERENT HYPERPARAMETERS:                                 │
│                                                                             │
│  lr=0.01 (too high)        lr=0.0003 (good)         lr=0.00001 (too low)   │
│                                                                             │
│  Reward │      /\          Reward │      ___         Reward │              │
│         │     /  \                │     /   \               │   _____      │
│         │    /    \  crash        │    /     \              │  /           │
│         │   /      \___           │   /       \___          │ /            │
│         │__/                      │__/                      │/             │
│         └──────────────           └──────────────           └───────────   │
│              Iterations                Iterations               Iterations │
│                                                                             │
│  Unstable: big updates         Stable: converges!       Slow: barely learns│
│  destroy learning                                                           │
│                                                                             │
│  Finding the RIGHT hyperparameters can be the difference between            │
│  success and failure!                                                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

## Ray Tune: Hyperparameter Search at Scale

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                          RAY TUNE OVERVIEW                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Ray Tune runs MANY training trials in PARALLEL:                            │
│                                                                             │
│     ┌─────────────────────────────────────────────────────────────────┐    │
│     │                         SEARCH SPACE                            │    │
│     │                                                                 │    │
│     │  lr: [0.00001 ─────────────────────────────────────── 0.01]    │    │
│     │  gamma: [0.9 ────────────────────────────────────────── 0.999] │    │
│     │  batch_size: [1000, 2000, 4000, 8000]                          │    │
│     │                                                                 │    │
│     └─────────────────────────────────────────────────────────────────┘    │
│                                   │                                         │
│               ┌───────────────────┼───────────────────┐                    │
│               │                   │                   │                    │
│               v                   v                   v                    │
│     ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐           │
│     │    TRIAL 1      │ │    TRIAL 2      │ │    TRIAL 3      │           │
│     │ lr=0.0005       │ │ lr=0.001        │ │ lr=0.0001       │           │
│     │ gamma=0.99      │ │ gamma=0.95      │ │ gamma=0.999     │           │
│     │ batch=4000      │ │ batch=2000      │ │ batch=8000      │           │
│     │                 │ │                 │ │                 │           │
│     │ reward: 420     │ │ reward: 180     │ │ reward: 490 *   │           │
│     └─────────────────┘ └─────────────────┘ └─────────────────┘           │
│               │                   │                   │                    │
│               └───────────────────┴───────────────────┘                    │
│                                   │                                         │
│                                   v                                         │
│                        BEST: Trial 3 config!                                │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

In [None]:
# Suppress warnings
import warnings
import logging
warnings.filterwarnings("ignore")
logging.getLogger("ray").setLevel(logging.ERROR)

import ray
from ray import tune
from ray.tune.schedulers import ASHAScheduler, PopulationBasedTraining
from ray.rllib.algorithms.ppo import PPOConfig
import numpy as np
import matplotlib.pyplot as plt

ray.init(
    num_cpus=4,
    object_store_memory=1 * 1024 * 1024 * 1024,
    ignore_reinit_error=True,
)
print(f"Ray initialized: {ray.cluster_resources()}")

---

## Defining Search Spaces

Tell Tune what values to try for each hyperparameter.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                          SEARCH SPACE TYPES                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  tune.uniform(a, b)                                                         │
│  ──────────────────                                                         │
│  Uniform between a and b                                                    │
│  Good for: bounded parameters like gamma                                    │
│                                                                             │
│     |████████████████████████|                                              │
│     a                        b                                              │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────      │
│                                                                             │
│  tune.loguniform(a, b)                                                      │
│  ─────────────────────                                                      │
│  Log-uniform: samples uniformly in LOG space                                │
│  Good for: learning rates, coefficients that span orders of magnitude       │
│                                                                             │
│     |████████  ████  ██  █ █|                                               │
│     1e-5              1e-2   (more samples at small values)                 │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────      │
│                                                                             │
│  tune.choice([a, b, c])                                                     │
│  ──────────────────────                                                     │
│  Pick one from a list                                                       │
│  Good for: categorical choices like batch sizes                             │
│                                                                             │
│     [1000]  [2000]  [4000]  [8000]                                          │
│                                                                             │
│  ─────────────────────────────────────────────────────────────────────      │
│                                                                             │
│  tune.grid_search([a, b, c])                                                │
│  ───────────────────────────                                                │
│  Try EVERY value (exhaustive)                                               │
│  Good for: small sets you want to fully explore                             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

In [None]:
# Define a search space
print("SEARCH SPACE EXAMPLES")
print("=" * 50)

# Learning rate: log-uniform because it spans orders of magnitude
print(f"\nLearning rate (loguniform 1e-5 to 1e-3):")
for _ in range(5):
    print(f"  {tune.loguniform(1e-5, 1e-3).sample():.6f}")

# Gamma: uniform because it's bounded
print(f"\nGamma (uniform 0.9 to 0.999):")
for _ in range(5):
    print(f"  {tune.uniform(0.9, 0.999).sample():.4f}")

# Batch size: choice because we want specific values
print(f"\nBatch size (choice):")
for _ in range(5):
    print(f"  {tune.choice([1000, 2000, 4000, 8000]).sample()}")

---

## ASHA Scheduler: Early Stopping

Stop bad trials early to save resources!

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                            ASHA SCHEDULER                                   │
│             (Asynchronous Successive Halving Algorithm)                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  IDEA: Don't waste resources on trials that are clearly losing              │
│                                                                             │
│  Without ASHA:                      With ASHA:                              │
│  ──────────────                     ──────────                              │
│                                                                             │
│  Run ALL trials to completion       Stop bad trials early                   │
│                                                                             │
│  Reward │    * good                 Reward │    * good                      │
│         │   /                              │   /                            │
│         │  /  ─── meh                      │  /                             │
│         │ /  /                             │ /                              │
│         │/  /  ─── bad                     │/  X stopped!                   │
│         └──────────────                    └──────────────                  │
│         0   10  20  30  40  50             0   10  20  30  40  50           │
│                                                 ^                           │
│  Wasted compute on bad trials!             Save resources for               │
│                                            promising trials!                │
│                                                                             │
│  HOW IT WORKS:                                                              │
│  1. grace_period: minimum iterations before stopping                        │
│  2. Compare trials at checkpoints                                           │
│  3. Stop bottom performers                                                  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

In [None]:
# ASHA scheduler configuration
asha_scheduler = ASHAScheduler(
    metric="env_runners/episode_return_mean",  # What to optimize
    mode="max",                                 # Maximize reward
    max_t=30,                                   # Max iterations per trial
    grace_period=5,                             # Min iterations before stopping
    reduction_factor=2,                         # How aggressively to stop
)

print("ASHA Scheduler:")
print("  - Stop bad trials after 5 iterations")
print("  - Keep running good trials up to 30 iterations")
print("  - Reduction factor 2: keep top 50% at each checkpoint")

---

## Running a Tune Experiment

In [None]:
# Create a config with search space
tunable_config = (
    PPOConfig()
    .environment("CartPole-v1")
    .framework("torch")
    .env_runners(num_env_runners=1)  # Keep small for demo
    .training(
        # These will be SEARCHED
        lr=tune.loguniform(1e-5, 1e-3),
        gamma=tune.uniform(0.95, 0.999),
        train_batch_size=tune.choice([1000, 2000, 4000]),
        
        # These are fixed
        sgd_minibatch_size=128,
        num_sgd_iter=10,
    )
)

print("Tunable config created with search space:")
print("  - lr: loguniform(1e-5, 1e-3)")
print("  - gamma: uniform(0.95, 0.999)")
print("  - train_batch_size: choice([1000, 2000, 4000])")

In [None]:
# Run tuning experiment
print("Running hyperparameter search...")
print("=" * 50)

tuner = tune.Tuner(
    "PPO",
    param_space=tunable_config,
    tune_config=tune.TuneConfig(
        scheduler=asha_scheduler,
        num_samples=6,             # Try 6 different configs
        max_concurrent_trials=2,   # Run 2 at a time
    ),
    run_config=tune.RunConfig(
        stop={"training_iteration": 20},
        verbose=1,
    ),
)

results = tuner.fit()

In [None]:
# Analyze results
best_result = results.get_best_result(
    metric="env_runners/episode_return_mean",
    mode="max"
)

print("\nBEST TRIAL")
print("=" * 50)
print(f"Reward: {best_result.metrics['env_runners']['episode_return_mean']:.1f}")
print(f"\nBest hyperparameters:")
print(f"  lr:               {best_result.config['lr']:.6f}")
print(f"  gamma:            {best_result.config['gamma']:.4f}")
print(f"  train_batch_size: {best_result.config['train_batch_size']}")

---

## Population Based Training (PBT)

The most powerful scheduler for RL: adapts hyperparameters DURING training!

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                      POPULATION BASED TRAINING                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  KEY IDEA: Don't just search, EVOLVE during training                        │
│                                                                             │
│  Regular Search:                   PBT:                                     │
│  ───────────────                   ────                                     │
│  Fix hyperparams at start          Change hyperparams AS YOU TRAIN          │
│                                                                             │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │                         PBT PROCESS                                  │  │
│  │                                                                      │  │
│  │  Population of 8 agents training in parallel:                        │  │
│  │                                                                      │  │
│  │  Start:  [A1] [A2] [A3] [A4] [A5] [A6] [A7] [A8]                     │  │
│  │          all random hyperparameters                                  │  │
│  │                                                                      │  │
│  │  After 10 iters, compare rewards:                                    │  │
│  │          A3 and A5 are doing best!                                   │  │
│  │                                                                      │  │
│  │  Exploit: Copy A3's weights to A1, A7 (they were worst)             │  │
│  │  Explore: Mutate A1, A7's hyperparameters slightly                  │  │
│  │                                                                      │  │
│  │  Result: Bad agents get a "boost" from good agents,                 │  │
│  │          and try slightly different hyperparameters                  │  │
│  │                                                                      │  │
│  │  Repeat every N iterations!                                          │  │
│  └──────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│  WHY PBT IS GREAT FOR RL:                                                   │
│  - Optimal hyperparams CHANGE during training                               │
│  - Early: high exploration (high entropy, high lr)                          │
│  - Late: low exploration (low entropy, low lr)                              │
│  - PBT naturally discovers this schedule!                                   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

In [None]:
# PBT scheduler configuration
pbt_scheduler = PopulationBasedTraining(
    time_attr="training_iteration",
    metric="env_runners/episode_return_mean",
    mode="max",
    perturbation_interval=5,        # Check every 5 iterations
    
    # Which hyperparameters to mutate
    hyperparam_mutations={
        "lr": tune.loguniform(1e-5, 1e-3),
        "entropy_coeff": tune.loguniform(1e-4, 1e-2),
    },
    
    quantile_fraction=0.25,  # Bottom 25% exploit top 25%
    resample_probability=0.25,  # 25% chance to resample vs perturb
)

print("PBT Scheduler:")
print("  - Check every 5 iterations")
print("  - Bottom 25% copy from top 25%")
print("  - Mutate lr and entropy_coeff")

---

## Hyperparameter Sensitivity

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                     HYPERPARAMETER IMPORTANCE                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  MOST IMPORTANT (tune first):                                               │
│  ────────────────────────────                                               │
│                                                                             │
│  1. Learning Rate (lr)                                                      │
│     └─> Try: 1e-5 to 1e-3 (log scale)                                       │
│     └─> HUGE impact on stability and speed                                  │
│                                                                             │
│  2. Batch Size (train_batch_size)                                           │
│     └─> Try: 1000, 2000, 4000, 8000                                         │
│     └─> Larger = more stable, slower                                        │
│                                                                             │
│  MODERATELY IMPORTANT:                                                      │
│  ─────────────────────                                                      │
│                                                                             │
│  3. Discount Factor (gamma)                                                 │
│     └─> Try: 0.95 to 0.999                                                  │
│     └─> Higher = longer horizon, but harder to train                        │
│                                                                             │
│  4. Entropy Coefficient (entropy_coeff)                                     │
│     └─> Try: 0.0001 to 0.1                                                  │
│     └─> Higher = more exploration                                           │
│                                                                             │
│  LESS IMPORTANT (use defaults):                                             │
│  ──────────────────────────────                                             │
│                                                                             │
│  5. GAE Lambda (lambda_)                                                    │
│     └─> Default 0.95 usually fine                                           │
│                                                                             │
│  6. Clip Parameter (clip_param)                                             │
│     └─> Default 0.2 usually fine                                            │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

## Best Practices

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                       TUNING BEST PRACTICES                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  1. START WITH KNOWN GOOD DEFAULTS                                          │
│     ─────────────────────────────────                                       │
│     PPO defaults that usually work:                                         │
│       lr = 3e-4                                                             │
│       gamma = 0.99                                                          │
│       clip_param = 0.2                                                      │
│       entropy_coeff = 0.01                                                  │
│                                                                             │
│  2. TUNE LEARNING RATE FIRST                                                │
│     ─────────────────────────────                                           │
│     It has the biggest impact!                                              │
│                                                                             │
│  3. USE LOG-UNIFORM FOR RATES/COEFFICIENTS                                  │
│     ─────────────────────────────────────                                   │
│     tune.loguniform(1e-5, 1e-3) not tune.uniform(0, 0.001)                  │
│                                                                             │
│  4. USE PBT FOR RL (adapts during training)                                 │
│     ───────────────────────────────────────                                 │
│                                                                             │
│  5. SET REASONABLE STOPPING CRITERIA                                        │
│     ──────────────────────────────────                                      │
│     Don't run forever - diminishing returns                                 │
│                                                                             │
│  6. USE ASHA FOR QUICK EXPLORATION                                          │
│     ─────────────────────────────────                                       │
│     Quickly filter out bad configs                                          │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

## Key Takeaways

1. **tune.loguniform** for learning rates and coefficients

2. **tune.choice** for categorical values like batch sizes

3. **ASHA** for quick exploration with early stopping

4. **PBT** for best RL results (adapts during training)

5. **Learning rate** is usually the most important hyperparameter

## What's Next?

```
┌──────────────────┐          ┌──────────────────┐          ┌──────────────────┐
│   06 Ray Tune    │          │  07 Production   │          │ 08 Best Practice │
│  (you are here)  │   ───>   │                  │   ───>   │                  │
│                  │          │  Deploy trained  │          │     Industry     │
│  - Search spaces │          │  policies        │          │     patterns     │
│  - ASHA, PBT     │          │                  │          │                  │
└──────────────────┘          └──────────────────┘          └──────────────────┘
```

In [None]:
ray.shutdown()