# Goodhart's Law Simulation - Colab Training

This notebook trains RL agents to demonstrate Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.

**Setup:** Copy the entire `goodharts_law` folder to your Google Drive before running.

**Expected Drive structure:**
```
My Drive/
  goodharts_law/
    goodharts/           <- Python package
    config.default.toml  <- Configuration
    models/              <- Created during training
```

---
## 1. Environment Setup

Mount Drive, configure paths, and install dependencies.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Configure project path - EDIT THIS if your folder is named differently
PROJECT_PATH = '/content/drive/MyDrive/goodharts_law'

import os
import sys

# Verify the path exists
if not os.path.exists(PROJECT_PATH):
    raise FileNotFoundError(
        f"Project not found at {PROJECT_PATH}\n"
        f"Please copy the goodharts_law folder to your Google Drive."
    )

# Add to Python path so imports work
if PROJECT_PATH not in sys.path:
    sys.path.insert(0, PROJECT_PATH)

# Change to project directory (for config file loading)
os.chdir(PROJECT_PATH)

print(f"Working directory: {os.getcwd()}")
print(f"Python path includes: {PROJECT_PATH}")

In [None]:
# Check GPU availability
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA version: {torch.version.cuda}")
    
    # Memory info
    props = torch.cuda.get_device_properties(0)
    print(f"GPU Memory: {props.total_memory / 1024**3:.1f} GB")
else:
    print("WARNING: No GPU detected. Training will be slow.")
    print("Go to Runtime -> Change runtime type -> GPU")

In [None]:
# Install additional dependencies if needed
# (Colab has torch, numpy, matplotlib pre-installed)

# Check for tomllib (Python 3.11+) or install toml
try:
    import tomllib
except ImportError:
    !pip install toml -q
    print("Installed toml package")

# Verify package imports
try:
    from goodharts.utils.device import get_device
    from goodharts.config import get_config
    from goodharts.modes import get_all_mode_names
    from goodharts.configs.default_config import get_simulation_config
    
    device = get_device()
    config = get_config()
    sim_config = get_simulation_config()
    
    print(f"\nPackage loaded successfully!")
    print(f"Device: {device}")
    print(f"World size: {config.get('WORLD_SIZE')}")
    print(f"Available training modes: {get_all_mode_names(sim_config)}")
except ImportError as e:
    print(f"Import error: {e}")
    print("\nCheck that the goodharts/ folder is in your Drive.")
    raise

---
## 2. Training Configuration

Set your training parameters here. The defaults are tuned for Colab's free tier (T4 GPU).

In [None]:
# Training configuration
# These override config.default.toml settings

TRAINING_CONFIG = {
    # Which mode(s) to train
    # Options: 'ground_truth', 'ground_truth_handhold', 'ground_truth_blinded', 'proxy', 'all'
    'mode': 'ground_truth',
    
    # Training duration
    'total_timesteps': 500_000,  # Increase for better results (1M+ recommended)
    
    # Performance tuning for Colab
    'n_envs': 128,           # Parallel environments (reduce if OOM)
    'n_minibatches': 4,      # Increase if OOM (trades memory for speed)
    
    # Compilation (faster training after warmup)
    'compile_models': True,  # Set False for faster startup, slower training
    'use_amp': True,         # Mixed precision (faster, less memory)
    
    # Logging
    'tensorboard': True,     # Enable TensorBoard logging
}

print("Training configuration:")
for k, v in TRAINING_CONFIG.items():
    print(f"  {k}: {v}")

---
## 3. Run Training

Execute training with the configuration above. Progress is printed inline.

**Tips:**
- Training saves checkpoints to `models/` in your Drive
- You can interrupt (stop button) and resume later - models are saved periodically
- For the full Goodhart demonstration, train both `ground_truth` and `proxy` modes

In [None]:
# Import training components
from goodharts.training.ppo import PPOTrainer, PPOConfig
from goodharts.modes import get_all_mode_names
from goodharts.configs.default_config import get_simulation_config
import os
import time

# Create models directory if it doesn't exist
os.makedirs('models', exist_ok=True)

# Parse mode(s)
sim_config = get_simulation_config()
all_modes = get_all_mode_names(sim_config)

mode_setting = TRAINING_CONFIG['mode']
if mode_setting == 'all':
    modes_to_train = all_modes
elif ',' in mode_setting:
    modes_to_train = [m.strip() for m in mode_setting.split(',')]
else:
    modes_to_train = [mode_setting]

print(f"Will train: {modes_to_train}")

In [None]:
# Run training
results = {}

for mode in modes_to_train:
    print(f"\n{'='*60}")
    print(f"Training mode: {mode}")
    print(f"{'='*60}")
    
    # Build config for this mode
    output_path = f'models/ppo_{mode}.pth'
    
    ppo_config = PPOConfig.from_config(
        mode=mode,
        total_timesteps=TRAINING_CONFIG['total_timesteps'],
        n_envs=TRAINING_CONFIG['n_envs'],
        n_minibatches=TRAINING_CONFIG['n_minibatches'],
        compile_models=TRAINING_CONFIG['compile_models'],
        use_amp=TRAINING_CONFIG['use_amp'],
        tensorboard=TRAINING_CONFIG['tensorboard'],
        output_path=output_path,
    )
    
    # Create trainer and run
    trainer = PPOTrainer(ppo_config)
    
    start_time = time.time()
    result = trainer.train()
    elapsed = time.time() - start_time
    
    results[mode] = result
    
    print(f"\nCompleted {mode} in {elapsed/60:.1f} minutes")
    print(f"Model saved to: {output_path}")
    
    # Clean up GPU memory between modes
    del trainer
    torch.cuda.empty_cache()

print(f"\n{'='*60}")
print("Training complete!")
print(f"{'='*60}")

---
## 4. TensorBoard

View training metrics in TensorBoard. Logs are saved to `generated/logs/tensorboard/`.

**Metrics tracked:**
- `loss/policy`, `loss/value` - PPO losses
- `metrics/entropy` - Policy entropy (exploration)
- `metrics/food_ratio` - food / (food + poison) per update
- `reward/episode` - Episode rewards
- `validation/*` - Validation metrics (if enabled)

In [None]:
# Load TensorBoard extension
%load_ext tensorboard

In [None]:
# Check for TensorBoard logs
import os
import glob

tb_dir = 'generated/logs/tensorboard'
if os.path.exists(tb_dir):
    subdirs = [d for d in os.listdir(tb_dir) if os.path.isdir(os.path.join(tb_dir, d))]
    print(f"TensorBoard logs found for: {subdirs}")
    
    # Show event file counts
    for subdir in subdirs:
        events = glob.glob(os.path.join(tb_dir, subdir, 'events.*'))
        print(f"  {subdir}: {len(events)} event file(s)")
else:
    print(f"No TensorBoard logs found at {tb_dir}")
    print("Run training with tensorboard=True first.")

In [None]:
# Launch TensorBoard (embedded in notebook)
# This will show training curves for all modes
%tensorboard --logdir generated/logs/tensorboard

---
## 5. Evaluation

Evaluate trained models using the continuous survival paradigm. Agents run until they die (starvation), then respawn. We track death events and survival times.

In [None]:
# List available trained models
import glob

model_files = glob.glob('models/*.pth')
print("Available models:")
for f in sorted(model_files):
    size_mb = os.path.getsize(f) / 1024 / 1024
    print(f"  {f} ({size_mb:.1f} MB)")

if not model_files:
    print("  No models found. Run training first.")

In [None]:
# Evaluation configuration
EVAL_CONFIG = {
    'models_to_evaluate': ['ground_truth', 'proxy'],  # Edit this list
    'total_timesteps': 1000,   # Steps per environment
    'n_envs': 64,              # Parallel environments
    'deterministic': False,    # Use stochastic policy (more realistic)
}

In [None]:
# Run evaluation using ModelTester (continuous survival paradigm)
from goodharts.evaluation.evaluator import ModelTester, EvaluationConfig

eval_results = {}

for mode in EVAL_CONFIG['models_to_evaluate']:
    model_path = f'models/ppo_{mode}.pth'
    
    if not os.path.exists(model_path):
        print(f"Skipping {mode}: model not found at {model_path}")
        continue
    
    print(f"\nEvaluating {mode}...")
    
    # Create evaluation config
    eval_cfg = EvaluationConfig.from_config(
        mode=mode,
        model_path=model_path,
        total_timesteps=EVAL_CONFIG['total_timesteps'],
        n_envs=EVAL_CONFIG['n_envs'],
        deterministic=EVAL_CONFIG['deterministic'],
        output_path=f'generated/eval_{mode}.json',
    )
    
    # Run evaluation
    tester = ModelTester(eval_cfg)
    result = tester.run()
    eval_results[mode] = result
    
    # Clean up
    del tester
    torch.cuda.empty_cache()

print("\nEvaluation complete!")

In [None]:
# Comparison table
if len(eval_results) >= 1:
    import pandas as pd
    
    # Build comparison dataframe from aggregates
    comparison = []
    for mode, result in eval_results.items():
        agg = result.get('aggregates')
        if agg:
            comparison.append({
                'Mode': mode,
                'Deaths': agg['n_deaths'],
                'Survival (mean)': f"{agg['survival_mean']:.0f}",
                'Food/Death': f"{agg['food_per_death_mean']:.1f}",
                'Poison/Death': f"{agg['poison_per_death_mean']:.1f}",
                'Efficiency': f"{agg['overall_efficiency']:.1%}",
                'Deaths/1k Steps': f"{agg['deaths_per_1k_steps']:.2f}",
            })
    
    if comparison:
        df = pd.DataFrame(comparison)
        print("\nComparison of trained agents:")
        display(df)

In [None]:
# Goodhart's Law demonstration
if 'ground_truth' in eval_results and 'proxy' in eval_results:
    gt_agg = eval_results['ground_truth'].get('aggregates')
    px_agg = eval_results['proxy'].get('aggregates')
    
    if gt_agg and px_agg:
        print("\n" + "="*60)
        print("GOODHART'S LAW DEMONSTRATION")
        print("="*60)
        
        survival_diff = gt_agg['survival_mean'] - px_agg['survival_mean']
        efficiency_diff = gt_agg['overall_efficiency'] - px_agg['overall_efficiency']
        poison_diff = px_agg['poison_per_death_mean'] - gt_agg['poison_per_death_mean']
        
        print(f"\nThe proxy agent optimizes for 'interestingness' instead of energy.")
        print(f"")
        print(f"Results:")
        print(f"  Survival gap: {survival_diff:+.0f} steps (ground truth lives longer)")
        print(f"  Efficiency gap: {efficiency_diff:+.1%} (ground truth finds more food)")
        print(f"  Poison eaten: proxy eats {poison_diff:.1f} more per death")
        print(f"")
        print(f"The proxy metric (interestingness) failed as a target.")
        print(f"When the agent optimized for it, it ate MORE poison because")
        print(f"poison is configured to be MORE interesting than food.")
        print(f"")
        print(f"This is Goodhart's Law in action: optimizing for a proxy metric")
        print(f"(interestingness) led to worse outcomes on the true objective (survival).")

---
## 6. Visualization

Plot evaluation results and compare agent behaviors.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Plot survival comparison
if len(eval_results) >= 2:
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    modes = list(eval_results.keys())
    colors = ['#16c79a', '#ff6b6b', '#ffa500', '#00d9ff']
    
    # 1. Survival time comparison
    ax = axes[0]
    survivals = [eval_results[m]['aggregates']['survival_mean'] for m in modes if eval_results[m].get('aggregates')]
    survival_stds = [eval_results[m]['aggregates']['survival_std'] for m in modes if eval_results[m].get('aggregates')]
    valid_modes = [m for m in modes if eval_results[m].get('aggregates')]
    
    bars = ax.bar(valid_modes, survivals, color=colors[:len(valid_modes)], yerr=survival_stds, capsize=5)
    ax.set_ylabel('Steps')
    ax.set_title('Mean Survival Time')
    ax.set_ylim(0, max(survivals) * 1.3)
    
    # 2. Efficiency comparison
    ax = axes[1]
    efficiencies = [eval_results[m]['aggregates']['overall_efficiency'] * 100 for m in valid_modes]
    
    bars = ax.bar(valid_modes, efficiencies, color=colors[:len(valid_modes)])
    ax.set_ylabel('Efficiency (%)')
    ax.set_title('Food Efficiency (food / total consumed)')
    ax.set_ylim(0, 100)
    ax.axhline(y=50, color='gray', linestyle='--', alpha=0.5, label='Random baseline')
    ax.legend()
    
    # 3. Consumption comparison
    ax = axes[2]
    food_rates = [eval_results[m]['aggregates']['food_per_1k_steps'] for m in valid_modes]
    poison_rates = [eval_results[m]['aggregates']['poison_per_1k_steps'] for m in valid_modes]
    
    x = np.arange(len(valid_modes))
    width = 0.35
    
    ax.bar(x - width/2, food_rates, width, label='Food', color='#16c79a')
    ax.bar(x + width/2, poison_rates, width, label='Poison', color='#ff6b6b')
    ax.set_ylabel('Per 1000 Steps')
    ax.set_title('Consumption Rates')
    ax.set_xticks(x)
    ax.set_xticklabels(valid_modes)
    ax.legend()
    
    plt.tight_layout()
    plt.show()
else:
    print("Need at least 2 evaluated models for comparison plots.")

In [None]:
# Plot survival time distributions
if len(eval_results) >= 1:
    fig, ax = plt.subplots(figsize=(10, 5))
    
    for mode, result in eval_results.items():
        deaths = result.get('deaths', [])
        if deaths:
            survival_times = [d['survival_time'] for d in deaths]
            ax.hist(survival_times, bins=30, alpha=0.6, label=f'{mode} (n={len(deaths)})')
    
    ax.set_xlabel('Survival Time (steps)')
    ax.set_ylabel('Frequency')
    ax.set_title('Survival Time Distribution')
    ax.legend()
    plt.show()

---
## 7. Download Models

Download trained models to your local machine.

In [None]:
# Download a specific model
from google.colab import files

# Uncomment to download:
# files.download('models/ppo_ground_truth.pth')
# files.download('models/ppo_proxy.pth')

---
## Notes

### Training Tips
- **Memory errors (OOM):** Reduce `n_envs` or increase `n_minibatches`
- **Slow training:** Enable `compile_models` (slower startup, faster training)
- **Better results:** Increase `total_timesteps` to 1M+

### Understanding the Modes
- **ground_truth:** Agent sees actual cell types (food vs poison). Should learn to survive.
- **proxy:** Agent sees only "interestingness" values. Will learn to seek poison (more interesting than food).
- **ground_truth_handhold:** Ground truth with scaled rewards. Easier learning.
- **ground_truth_blinded:** Proxy observations but real energy reward. Control condition.

### The Goodhart Demonstration
When you train both `ground_truth` and `proxy` modes:
1. Ground truth agent learns to eat food and avoid poison (survives longer)
2. Proxy agent optimizes for interestingness (eats poison, dies faster)

This demonstrates Goodhart's Law: the proxy metric (interestingness) fails catastrophically when optimized directly.

### Key Metrics
- **Survival time:** How long agents live before dying (starvation)
- **Efficiency:** food / (food + poison) - the key Goodhart metric
- **Deaths per 1k steps:** Population death rate

### TensorBoard Metrics
- **loss/policy, loss/value:** PPO training losses (should decrease)
- **metrics/entropy:** Policy entropy (should decrease as policy becomes confident)
- **metrics/food_ratio:** Curriculum-invariant efficiency metric
- **reward/episode:** Episode rewards (mode-specific, affected by curriculum)