# SAPO Experiment: Config 2 (4 Local / 4 External) **BEST**

This notebook replicates **Configuration 2** from the SAPO paper - **THE OPTIMAL CONFIGURATION**:
- **I = 4**: Generate 4 rollouts locally per round
- **J = 4**: Fetch 4 external rollouts from swarm peers
- **G = 8**: Generate 8 completions per question
- **Rounds = 2000**: Train for 2000 rounds

This configuration achieves **perfect balance** between local exploration and swarm collaboration (50% external).

**Expected Performance:**
- Cumulative reward after 2000 rounds: ~1093
- **+94% improvement** over baseline (no sharing)
- **BEST configuration** in the paper

**Why This Works Best:**
- 50/50 split balances local innovation with swarm diversity
- External rollouts provide diverse strategies without overwhelming local learning
- Sweet spot between exploration and exploitation

**Setup Requirements:**
1. Run ONE coordinator node (set `NODE_ROLE = 'coordinator'`)
2. Run 7+ worker nodes (use `NODE_ROLE = 'worker'`, unique NODE_IDs)
3. All nodes must use same `EXPERIMENT_NAME`
4. All nodes must use same Google Drive account

**Paper Reference:** arXiv:2509.08721 - SAPO (Section 5.2, Table 1, Row 3 - BEST RESULT)

## 1. Configuration

In [None]:
# Experiment Configuration
EXPERIMENT_NAME = 'sapo_config2_4loc4ext'  # MUST BE SAME ACROSS ALL NODES
NODE_ROLE = 'coordinator'  # 'coordinator' for first node, 'worker' for others
NODE_ID = 'node_0'  # MUST BE UNIQUE (node_0, node_1, node_2, etc.)

# Model Configuration
MODEL_NAME = 'Gensyn/Qwen2.5-0.5B-Instruct'  # Same as paper
SEED = 42  # For reproducibility

# SAPO Configuration (Config 2: 4/4 - BEST)
NUM_TRAIN_SAMPLES = 4        # I: Local rollouts per round
NUM_TRANSPLANT_TREES = 4     # J: External rollouts from swarm
NUM_GENERATIONS = 8          # G: Completions per question
MAX_ROUNDS = 2000            # Train for 2000 rounds (same as paper)

# Coordinator Configuration (only used if NODE_ROLE='coordinator')
ADVANCEMENT_STRATEGY = 'hybrid'  # 'time_based', 'completion_based', or 'hybrid'
ROUND_DURATION_MINUTES = 15      # How long to wait for peers per round
MIN_SUBMISSION_PERCENT = 0.5     # Minimum % of peers before advancing
MAX_ROUND_DURATION_MINUTES = 30  # Maximum wait time

# Rollout Sharing Configuration
ROLLOUT_PUBLISH_FREQUENCY = 'stage'  # When to share rollouts
ROLLOUT_CLEANUP_ENABLED = True       # Enable cleanup (2000 rounds = lots of data)
ROLLOUT_KEEP_LAST_N_ROUNDS = 20      # Keep recent rollouts only
ROLLOUT_ARCHIVE_OLD = False          # Don't archive (saves space)

# Optional: HuggingFace Token
HUGGINGFACE_TOKEN = None  # Set to your token or keep None

# Optional: Wandb Configuration
WANDB_API_KEY = None  # Set to your Wandb API key or keep None
WANDB_PROJECT = 'sapo-replication'

print(f"✓ Experiment: {EXPERIMENT_NAME}")
print(f"✓ Node Role: {NODE_ROLE}")
print(f"✓ Node ID: {NODE_ID}")
print(f"✓ Configuration: I={NUM_TRAIN_SAMPLES}, J={NUM_TRANSPLANT_TREES}, G={NUM_GENERATIONS}")
print(f"✓ Model: {MODEL_NAME}")
print(f"✓ Max Rounds: {MAX_ROUNDS}")
print()
print("🏆 Config 2: OPTIMAL swarm collaboration (50% external)")
print("   Expected cumulative reward: ~1093 (+94% vs baseline)")
print("   This is the BEST configuration from the paper!")
print()
if NODE_ROLE == 'coordinator':
    print("📡 Running as COORDINATOR - will manage round progression")
else:
    print("👷 Running as WORKER - will follow coordinator")

## 2. Mount Google Drive

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Set base path (MUST BE SAME ACROSS ALL NODES)
GDRIVE_BASE_PATH = '/content/drive/MyDrive/rl-swarm'
os.makedirs(GDRIVE_BASE_PATH, exist_ok=True)

print(f"✓ Google Drive mounted at: {GDRIVE_BASE_PATH}")

# Check if experiment exists (for workers)
if NODE_ROLE == 'worker':
    experiment_path = os.path.join(GDRIVE_BASE_PATH, 'experiments', EXPERIMENT_NAME)
    if not os.path.exists(experiment_path):
        print(f"⚠️  Experiment '{EXPERIMENT_NAME}' not found!")
        print(f"   Expected at: {experiment_path}")
        print()
        print("Make sure:")
        print("  1. Coordinator is running")
        print("  2. Coordinator has initialized the experiment (cell 4)")
        print("  3. EXPERIMENT_NAME matches the coordinator")
        raise FileNotFoundError(f"Experiment not found: {EXPERIMENT_NAME}")
    else:
        print(f"✓ Found experiment: {EXPERIMENT_NAME}")

## 3. System Setup & Dependencies

In [None]:
# Check GPU availability
import torch

if torch.cuda.is_available():
    print(f"✓ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"  Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("⚠️  No GPU detected - training will be slow")
    print("  Consider: Runtime > Change runtime type > GPU")

In [None]:
# Clone repository
import os

# Change to safe directory first
%cd /content

# Remove existing directory if it exists
if os.path.exists('/content/rl-swarm'):
    print("Removing existing repository...")
    !rm -rf /content/rl-swarm

# Clone fresh copy
print("Cloning repository...")
!git clone https://github.com/Elrashid/rl-swarm.git /content/rl-swarm

# Change to repo directory
%cd /content/rl-swarm

# Verify clone worked
if not os.path.exists('requirements.txt'):
    print("❌ Clone failed! requirements.txt not found")
    raise FileNotFoundError("Repository clone failed")

print("✓ Repository cloned successfully")

# Install dependencies
print("Installing dependencies (this may take 3-5 minutes)...")
!pip install -q -r requirements.txt
!pip install -q gensyn-genrl==0.1.9

print("✓ Dependencies installed")

In [None]:
if WANDB_API_KEY:
    import wandb
    wandb.login(key=WANDB_API_KEY)
    print("✓ Wandb configured")
else:
    print("ℹ️ Wandb disabled (WANDB_API_KEY not set)")

## 4. Initialize Experiment (Coordinator Only)

**⚠️ Run this cell ONLY on the coordinator node!**

Workers should skip this cell - the coordinator will create the experiment structure.

In [None]:
if NODE_ROLE == 'coordinator':
    from rgym_exp.utils.experiment_manager import init_experiment
    
    # Initialize experiment structure in Google Drive
    config_overrides = {
        'training.max_round': MAX_ROUNDS,
        'training.num_generations': NUM_GENERATIONS,
        'training.num_transplant_trees': NUM_TRANSPLANT_TREES,
        'training.num_train_samples': NUM_TRAIN_SAMPLES,
        'training.seed': SEED,
        'coordinator_manager.advancement_strategy': ADVANCEMENT_STRATEGY,
        'coordinator_manager.round_duration_minutes': ROUND_DURATION_MINUTES,
        'coordinator_manager.min_submission_percent': MIN_SUBMISSION_PERCENT,
        'coordinator_manager.max_round_duration_minutes': MAX_ROUND_DURATION_MINUTES,
    }
    
    init_experiment(
        gdrive_base_path=GDRIVE_BASE_PATH,
        experiment_name=EXPERIMENT_NAME,
        config_overrides=config_overrides
    )
    
    print(f"✓ Experiment initialized: {EXPERIMENT_NAME}")
    print(f"  Path: {GDRIVE_BASE_PATH}/experiments/{EXPERIMENT_NAME}")
    print(f"  Config: I={NUM_TRAIN_SAMPLES}, J={NUM_TRANSPLANT_TREES}, G={NUM_GENERATIONS}")
    print()
    print("✓ Workers can now join this experiment!")
else:
    print("ℹ️ Skipping initialization (worker node)")
    print("  Coordinator will create the experiment structure")

## 5. Set Environment Variables

In [None]:
import os
import uuid

# Set environment variables
os.environ['GDRIVE_PATH'] = GDRIVE_BASE_PATH
os.environ['EXPERIMENT_NAME'] = EXPERIMENT_NAME
os.environ['NODE_ROLE'] = NODE_ROLE
os.environ['NODE_ID'] = NODE_ID or f"node_{uuid.uuid4().hex[:8]}"
os.environ['MODEL_NAME'] = MODEL_NAME
os.environ['SEED'] = str(SEED)

# SAPO configuration
os.environ['NUM_TRAIN_SAMPLES'] = str(NUM_TRAIN_SAMPLES)
os.environ['NUM_TRANSPLANT_TREES'] = str(NUM_TRANSPLANT_TREES)
os.environ['NUM_GENERATIONS'] = str(NUM_GENERATIONS)
os.environ['MAX_ROUNDS'] = str(MAX_ROUNDS)

# Rollout configuration
os.environ['ROLLOUT_PUBLISH_FREQUENCY'] = ROLLOUT_PUBLISH_FREQUENCY
os.environ['ROLLOUT_CLEANUP_ENABLED'] = str(ROLLOUT_CLEANUP_ENABLED)
os.environ['ROLLOUT_KEEP_LAST_N_ROUNDS'] = str(ROLLOUT_KEEP_LAST_N_ROUNDS)
os.environ['ROLLOUT_ARCHIVE_OLD'] = str(ROLLOUT_ARCHIVE_OLD)

if HUGGINGFACE_TOKEN:
    os.environ['HUGGINGFACE_ACCESS_TOKEN'] = HUGGINGFACE_TOKEN

if WANDB_API_KEY:
    os.environ['WANDB_API_KEY'] = WANDB_API_KEY
    os.environ['WANDB_PROJECT'] = WANDB_PROJECT

print("✓ Environment variables set")
print(f"  Node ID: {os.environ['NODE_ID']}")
print(f"  Role: {NODE_ROLE}")
print(f"  Config: I={NUM_TRAIN_SAMPLES}, J={NUM_TRANSPLANT_TREES}, G={NUM_GENERATIONS}")

## 6. Start Training

**This cell will run for ~24-48 hours (2000 rounds).**

The training will:
- Generate 4 local rollouts per round
- Fetch 4 external rollouts from swarm peers (50% external!)
- Train using GRPO algorithm with optimal local/external balance
- Share rollouts with other nodes after each stage
- Save checkpoints every 10 rounds

**This is the BEST configuration - expect highest performance!**

**Monitor progress:**
- Use `EX12.02.RL_Swarm_Monitoring.ipynb` in a separate tab
- Check peer discovery: Should see 8+ active peers
- Watch for +94% improvement over baseline

**Press stop button to gracefully shutdown.**

In [None]:
from rgym_exp.utils.notebook_utils import run_with_live_output
import sys

print("="*60)
print(f"Starting SAPO Config 2 Experiment (BEST)")
print(f"Configuration: I={NUM_TRAIN_SAMPLES}, J={NUM_TRANSPLANT_TREES}, G={NUM_GENERATIONS}")
print(f"Node: {NODE_ID} ({NODE_ROLE})")
print(f"Experiment: {EXPERIMENT_NAME}")
print(f"Model: {MODEL_NAME}")
print(f"Max Rounds: {MAX_ROUNDS}")
print("="*60)
print()

# Run training with live output
exit_code = run_with_live_output([
    sys.executable, '-m', 'rgym_exp.runner.swarm_launcher'
])

if exit_code == -1:
    print("\n⚠️  Training interrupted by user")
elif exit_code != 0:
    print(f"\n❌ Training exited with code: {exit_code}")
else:
    print(f"\n✅ Training completed successfully")
    print(f"   Total rounds: {MAX_ROUNDS}")
    print(f"   Expected cumulative reward: ~1093 (+94% vs baseline)")
    print(f"   🏆 This is the BEST configuration!")

## 7. View Results

In [None]:
from rgym_exp.utils.experiment_manager import get_experiment_status, get_experiment_metrics
import pandas as pd

# Get current status
status = get_experiment_status(GDRIVE_BASE_PATH, EXPERIMENT_NAME)

print(f"Experiment: {EXPERIMENT_NAME}")
print(f"Configuration: Config 2 (I=4, J=4, G=8) - BEST")
print(f"Current Round: {status.get('current_round', 0)} / {MAX_ROUNDS}")
print(f"Active Peers: {status.get('active_peers', 0)}")
print()

# Load and display metrics for this node
try:
    df = get_experiment_metrics(GDRIVE_BASE_PATH, EXPERIMENT_NAME)
    if not df.empty:
        # Filter to this node
        node_df = df[df['node_id'] == NODE_ID]
        if not node_df.empty:
            cumulative_reward = node_df['my_reward'].sum()
            print(f"Cumulative Reward ({NODE_ID}): {cumulative_reward:.2f}")
            print(f"Expected (paper): ~1093")
            print(f"Baseline: ~562")
            improvement = ((cumulative_reward / 562) - 1) * 100 if cumulative_reward > 0 else 0
            print(f"Improvement: +{improvement:.1f}%")
            print()
            
            # Show recent rounds
            print("Recent rounds (last 10):")
            recent = node_df.tail(10)
            print(recent[['round', 'stage', 'my_reward']].to_string(index=False))
        else:
            print(f"No metrics for {NODE_ID} yet")
    else:
        print("No metrics available yet")
except Exception as e:
    print(f"Could not load metrics: {e}")

## 8. Resume Training (If Disconnected)

If your Colab session disconnects:
1. Re-run cells 1-3 (keep same EXPERIMENT_NAME and NODE_ID)
2. Skip cell 4 (initialization - already done)
3. Re-run cells 5-6 (env vars and training)
4. System will automatically resume from last checkpoint

## Notes

### SAPO Config 2: OPTIMAL Balance 🏆

This experiment uses:
- **I = 4**: 4 local rollouts generated per round
- **J = 4**: 4 external rollouts fetched from swarm
- **G = 8**: 8 completions generated per question
- **Total rollouts per round**: 8 (4 local + 4 external)
- **External ratio**: 50% (4/8) - PERFECT BALANCE

### Why This Is Best

The 50/50 split achieves optimal balance:
1. **Sufficient local exploration**: 4 local rollouts allow discovering new strategies
2. **Maximum diversity**: 4 external rollouts provide varied perspectives from swarm
3. **No overwhelming**: External experience doesn't drown out local learning
4. **Stable training**: Balanced approach prevents instability (unlike 2/6 config)

### Expected Results

From the SAPO paper (Table 1):
- **Cumulative reward after 2000 rounds**: ~1093
- **Improvement over baseline**: +94% (baseline: ~562)
- **BEST among all configurations tested**

### Performance Comparison

| Config | I/J Split | External % | Cumulative Reward | Improvement |
|--------|-----------|------------|-------------------|-------------|
| Baseline | 8/0 | 0% | ~562 | - |
| Config 1 | 6/2 | 25% | ~854 | +52% |
| **Config 2** | **4/4** | **50%** | **~1093** | **+94%** |
| Config 3 | 2/6 | 75% | ~946 | +68% |

Notice how 4/4 outperforms even 2/6 despite less external data - balance matters!

### Key Insights

1. **More external ≠ better**: 2/6 (75% external) performs worse than 4/4 (50%)
2. **Balance is critical**: Need both local innovation AND swarm diversity
3. **Swarm amplifies learning**: +94% improvement shows power of collaboration
4. **Reproducibility**: This config should consistently outperform others

### Multi-Node Setup

For authentic replication:
- Run 8 nodes total (1 coordinator + 7 workers)
- All nodes benefit equally from swarm sharing
- Each node contributes 4 rollouts, receives 4 from others
- Collective intelligence emerges from collaboration