# SAPO Experiment: Config 3 (2 Local / 6 External)

This notebook replicates **Configuration 3** from the SAPO paper:
- **I = 2**: Generate only 2 rollouts locally per round
- **J = 6**: Fetch 6 external rollouts from swarm peers
- **G = 8**: Generate 8 completions per question
- **Rounds = 2000**: Train for 2000 rounds

This configuration explores **heavy swarm dependence** (75% external experience).

**Expected Performance:**
- Cumulative reward after 2000 rounds: ~946
- **+68% improvement** over baseline (no sharing)
- **WORSE than 4/4** despite more external data!

**Why This Underperforms 4/4:**
- Too few local rollouts (only 2) limits exploration
- Over-reliance on external experience can cause instability
- Local innovation is critical for swarm diversity
- Demonstrates that "more external ≠ better"

**Setup Requirements:**
1. Run ONE coordinator node (set `NODE_ROLE = 'coordinator'`)
2. Run 7+ worker nodes (use `NODE_ROLE = 'worker'`, unique NODE_IDs)
3. All nodes must use same `EXPERIMENT_NAME`
4. All nodes must use same Google Drive account

**Paper Reference:** arXiv:2509.08721 - SAPO (Section 5.2, Table 1, Row 4)

## 1. Configuration

In [None]:
# Experiment Configuration
EXPERIMENT_NAME = 'sapo_config3_2loc6ext'  # MUST BE SAME ACROSS ALL NODES
NODE_ROLE = 'coordinator'  # 'coordinator' for first node, 'worker' for others
NODE_ID = 'node_0'  # MUST BE UNIQUE (node_0, node_1, node_2, etc.)

# Model Configuration
MODEL_NAME = 'Gensyn/Qwen2.5-0.5B-Instruct'  # Same as paper
SEED = 42  # For reproducibility

# SAPO Configuration (Config 3: 2/6)
NUM_TRAIN_SAMPLES = 2        # I: Local rollouts per round (MINIMAL)
NUM_TRANSPLANT_TREES = 6     # J: External rollouts from swarm (MAXIMUM)
NUM_GENERATIONS = 8          # G: Completions per question
MAX_ROUNDS = 2000            # Train for 2000 rounds (same as paper)

# Coordinator Configuration (only used if NODE_ROLE='coordinator')
ADVANCEMENT_STRATEGY = 'hybrid'  # 'time_based', 'completion_based', or 'hybrid'
ROUND_DURATION_MINUTES = 15      # How long to wait for peers per round
MIN_SUBMISSION_PERCENT = 0.5     # Minimum % of peers before advancing
MAX_ROUND_DURATION_MINUTES = 30  # Maximum wait time

# Rollout Sharing Configuration
ROLLOUT_PUBLISH_FREQUENCY = 'stage'  # When to share rollouts
ROLLOUT_CLEANUP_ENABLED = True       # Enable cleanup (2000 rounds = lots of data)
ROLLOUT_KEEP_LAST_N_ROUNDS = 20      # Keep recent rollouts only
ROLLOUT_ARCHIVE_OLD = False          # Don't archive (saves space)

# Optional: HuggingFace Token
HUGGINGFACE_TOKEN = None  # Set to your token or keep None


print(f"✓ Experiment: {EXPERIMENT_NAME}")
print(f"✓ Node Role: {NODE_ROLE}")
print(f"✓ Node ID: {NODE_ID}")
print(f"✓ Configuration: I={NUM_TRAIN_SAMPLES}, J={NUM_TRANSPLANT_TREES}, G={NUM_GENERATIONS}")
print(f"✓ Model: {MODEL_NAME}")
print(f"✓ Max Rounds: {MAX_ROUNDS}")
print()
print("⚠️  Config 3: Heavy swarm dependence (75% external)")
print("   Expected cumulative reward: ~946 (+68% vs baseline)")
print("   Note: Worse than 4/4 config despite more external data!")
print()
if NODE_ROLE == 'coordinator':
    print("📡 Running as COORDINATOR - will manage round progression")
else:
    print("👷 Running as WORKER - will follow coordinator")

## 2. Mount Google Drive

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Set base path (MUST BE SAME ACROSS ALL NODES)
GDRIVE_BASE_PATH = '/content/drive/MyDrive/rl-swarm'
os.makedirs(GDRIVE_BASE_PATH, exist_ok=True)

print(f"✓ Google Drive mounted at: {GDRIVE_BASE_PATH}")

# Check if experiment exists (for workers)
if NODE_ROLE == 'worker':
    experiment_path = os.path.join(GDRIVE_BASE_PATH, 'experiments', EXPERIMENT_NAME)
    if not os.path.exists(experiment_path):
        print(f"⚠️  Experiment '{EXPERIMENT_NAME}' not found!")
        print(f"   Expected at: {experiment_path}")
        print()
        print("Make sure:")
        print("  1. Coordinator is running")
        print("  2. Coordinator has initialized the experiment (cell 4)")
        print("  3. EXPERIMENT_NAME matches the coordinator")
        raise FileNotFoundError(f"Experiment not found: {EXPERIMENT_NAME}")
    else:
        print(f"✓ Found experiment: {EXPERIMENT_NAME}")

## 3. System Setup & Dependencies

In [None]:
# Check GPU availability
import torch

if torch.cuda.is_available():
    print(f"✓ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"  Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("⚠️  No GPU detected - training will be slow")
    print("  Consider: Runtime > Change runtime type > GPU")

In [None]:
# Clone repository
import os

# Change to safe directory first
%cd /content

# Remove existing directory if it exists
if os.path.exists('/content/rl-swarm'):
    print("Removing existing repository...")
    !rm -rf /content/rl-swarm

# Clone fresh copy
print("Cloning repository...")
!git clone https://github.com/Elrashid/rl-swarm.git /content/rl-swarm

# Change to repo directory
%cd /content/rl-swarm

# Verify clone worked
if not os.path.exists('requirements.txt'):
    print("❌ Clone failed! requirements.txt not found")
    raise FileNotFoundError("Repository clone failed")

print("✓ Repository cloned successfully")

# Install dependencies
print("Installing dependencies (this may take 3-5 minutes)...")
!pip install -q -r requirements.txt
!pip install -q gensyn-genrl==0.1.9

print("✓ Dependencies installed")

## 4. Initialize Experiment (Coordinator Only)

**⚠️ Run this cell ONLY on the coordinator node!**

Workers should skip this cell - the coordinator will create the experiment structure.

In [None]:
if NODE_ROLE == 'coordinator':
    from rgym_exp.utils.experiment_manager import init_experiment
    
    # Initialize experiment structure in Google Drive
    config_overrides = {
        'training.max_round': MAX_ROUNDS,
        'training.num_generations': NUM_GENERATIONS,
        'training.num_transplant_trees': NUM_TRANSPLANT_TREES,
        'training.num_train_samples': NUM_TRAIN_SAMPLES,
        'training.seed': SEED,
        'coordinator_manager.advancement_strategy': ADVANCEMENT_STRATEGY,
        'coordinator_manager.round_duration_minutes': ROUND_DURATION_MINUTES,
        'coordinator_manager.min_submission_percent': MIN_SUBMISSION_PERCENT,
        'coordinator_manager.max_round_duration_minutes': MAX_ROUND_DURATION_MINUTES,
    }
    
    init_experiment(
        gdrive_base_path=GDRIVE_BASE_PATH,
        experiment_name=EXPERIMENT_NAME,
        config_overrides=config_overrides
    )
    
    print(f"✓ Experiment initialized: {EXPERIMENT_NAME}")
    print(f"  Path: {GDRIVE_BASE_PATH}/experiments/{EXPERIMENT_NAME}")
    print(f"  Config: I={NUM_TRAIN_SAMPLES}, J={NUM_TRANSPLANT_TREES}, G={NUM_GENERATIONS}")
    print()
    print("✓ Workers can now join this experiment!")
else:
    print("ℹ️ Skipping initialization (worker node)")
    print("  Coordinator will create the experiment structure")

## 5. Set Environment Variables

In [None]:
import os
import uuid

# Set environment variables
os.environ['GDRIVE_PATH'] = GDRIVE_BASE_PATH
os.environ['EXPERIMENT_NAME'] = EXPERIMENT_NAME
os.environ['NODE_ROLE'] = NODE_ROLE
os.environ['NODE_ID'] = NODE_ID or f"node_{uuid.uuid4().hex[:8]}"
os.environ['MODEL_NAME'] = MODEL_NAME
os.environ['SEED'] = str(SEED)

# SAPO configuration
os.environ['NUM_TRAIN_SAMPLES'] = str(NUM_TRAIN_SAMPLES)
os.environ['NUM_TRANSPLANT_TREES'] = str(NUM_TRANSPLANT_TREES)
os.environ['NUM_GENERATIONS'] = str(NUM_GENERATIONS)
os.environ['MAX_ROUNDS'] = str(MAX_ROUNDS)

# Rollout configuration
os.environ['ROLLOUT_PUBLISH_FREQUENCY'] = ROLLOUT_PUBLISH_FREQUENCY
os.environ['ROLLOUT_CLEANUP_ENABLED'] = str(ROLLOUT_CLEANUP_ENABLED)
os.environ['ROLLOUT_KEEP_LAST_N_ROUNDS'] = str(ROLLOUT_KEEP_LAST_N_ROUNDS)
os.environ['ROLLOUT_ARCHIVE_OLD'] = str(ROLLOUT_ARCHIVE_OLD)

if HUGGINGFACE_TOKEN:
    os.environ['HUGGINGFACE_ACCESS_TOKEN'] = HUGGINGFACE_TOKEN


print("✓ Environment variables set")
print(f"  Node ID: {os.environ['NODE_ID']}")
print(f"  Role: {NODE_ROLE}")
print(f"  Config: I={NUM_TRAIN_SAMPLES}, J={NUM_TRANSPLANT_TREES}, G={NUM_GENERATIONS}")

## 6. Start Training

**This cell will run for ~24-48 hours (2000 rounds).**

The training will:
- Generate only 2 local rollouts per round (minimal local exploration)
- Fetch 6 external rollouts from swarm peers (75% external!)
- Train using GRPO algorithm with heavy external dependence
- Share rollouts with other nodes after each stage
- Save checkpoints every 10 rounds

**Expected outcome:**
- Good improvement over baseline (+68%)
- BUT worse than 4/4 config despite more external data
- Demonstrates importance of local/external balance

**Monitor progress:**
- Use `EX12.02.RL_Swarm_Monitoring.ipynb` in a separate tab
- Check peer discovery: Should see 8+ active peers

**Press stop button to gracefully shutdown.**

In [None]:
from rgym_exp.utils.notebook_utils import run_with_live_output
import sys

print("="*60)
print(f"Starting SAPO Config 3 Experiment")
print(f"Configuration: I={NUM_TRAIN_SAMPLES}, J={NUM_TRANSPLANT_TREES}, G={NUM_GENERATIONS}")
print(f"Node: {NODE_ID} ({NODE_ROLE})")
print(f"Experiment: {EXPERIMENT_NAME}")
print(f"Model: {MODEL_NAME}")
print(f"Max Rounds: {MAX_ROUNDS}")
print("="*60)
print()

# Run training with live output
exit_code = run_with_live_output([
    sys.executable, '-m', 'rgym_exp.runner.swarm_launcher'
])

if exit_code == -1:
    print("\n⚠️  Training interrupted by user")
elif exit_code != 0:
    print(f"\n❌ Training exited with code: {exit_code}")
else:
    print(f"\n✅ Training completed successfully")
    print(f"   Total rounds: {MAX_ROUNDS}")
    print(f"   Expected cumulative reward: ~946 (+68% vs baseline)")
    print(f"   Note: Config 2 (4/4) achieves +94% with less external data")

## 7. View Results

In [None]:
from rgym_exp.utils.experiment_manager import get_experiment_status, get_experiment_metrics
import pandas as pd

# Get current status
status = get_experiment_status(GDRIVE_BASE_PATH, EXPERIMENT_NAME)

print(f"Experiment: {EXPERIMENT_NAME}")
print(f"Configuration: Config 3 (I=2, J=6, G=8)")
print(f"Current Round: {status.get('current_round', 0)} / {MAX_ROUNDS}")
print(f"Active Peers: {status.get('active_peers', 0)}")
print()

# Load and display metrics for this node
try:
    df = get_experiment_metrics(GDRIVE_BASE_PATH, EXPERIMENT_NAME)
    if not df.empty:
        # Filter to this node
        node_df = df[df['node_id'] == NODE_ID]
        if not node_df.empty:
            cumulative_reward = node_df['my_reward'].sum()
            print(f"Cumulative Reward ({NODE_ID}): {cumulative_reward:.2f}")
            print(f"Expected (paper): ~946")
            print(f"Baseline: ~562")
            print(f"Config 2 (4/4): ~1093")
            improvement = ((cumulative_reward / 562) - 1) * 100 if cumulative_reward > 0 else 0
            print(f"Improvement vs baseline: +{improvement:.1f}%")
            print()
            
            # Show recent rounds
            print("Recent rounds (last 10):")
            recent = node_df.tail(10)
            print(recent[['round', 'stage', 'my_reward']].to_string(index=False))
        else:
            print(f"No metrics for {NODE_ID} yet")
    else:
        print("No metrics available yet")
except Exception as e:
    print(f"Could not load metrics: {e}")

In [None]:
# === Real-Time Progress Viewer ===
# Run this cell anytime to check progress from GDrive
# Useful if you reconnect after notebook disconnect

import sys
sys.path.append('/content/rl-swarm')

from rgym_exp.utils.progress_tracker import get_experiment_progress

progress = get_experiment_progress(GDRIVE_BASE_PATH, EXPERIMENT_NAME)

print("="*70)
print("📊 REAL-TIME PROGRESS FROM GDRIVE")
print("="*70)
print(f"Experiment: {progress.get('experiment')}")
print()

for node_id, node_data in progress.get('nodes', {}).items():
    if 'error' in node_data:
        print(f"  {node_id}: {node_data['error']}")
    else:
        elapsed_hours = node_data.get('elapsed_seconds', 0) / 3600
        print(f"  {node_id}:")
        print(f"    Latest event: {node_data.get('latest_event')}")
        print(f"    Current round: {node_data.get('latest_round')}")
        print(f"    Elapsed time: {elapsed_hours:.1f} hours")
        print()

print("="*70)
print("Note: Progress updates every round. Logs flush every 30s to GDrive.")

## 7.5. Check Real-Time Progress from GDrive (Optional)

**Reconnected after disconnect?** Run this cell to check training progress:
- Shows current round for each node
- Displays elapsed time
- Works even if your notebook disconnected

Progress is saved to GDrive every round, logs flush every 30 seconds.

## 8. Resume Training (If Disconnected)

If your Colab session disconnects:
1. Re-run cells 1-3 (keep same EXPERIMENT_NAME and NODE_ID)
2. Skip cell 4 (initialization - already done)
3. Re-run cells 5-6 (env vars and training)
4. System will automatically resume from last checkpoint

## Notes

### SAPO Config 3: Too Much External?

This experiment uses:
- **I = 2**: Only 2 local rollouts generated per round
- **J = 6**: 6 external rollouts fetched from swarm
- **G = 8**: 8 completions generated per question
- **Total rollouts per round**: 8 (2 local + 6 external)
- **External ratio**: 75% (6/8) - VERY HIGH

### Expected Results

From the SAPO paper (Table 1):
- **Cumulative reward after 2000 rounds**: ~946
- **Improvement over baseline**: +68% (baseline: ~562)
- **WORSE than Config 2 (4/4)** which achieves +94%

### Why More External Can Be Worse

Performance comparison:
```
Config 2 (4/4):  50% external → +94%  ✓ BEST
Config 3 (2/6):  75% external → +68%  ✗ Worse despite more data!
```

Key insights:
1. **Too few local rollouts** (only 2): Limited local exploration and innovation
2. **Over-reliance on external**: Can create unstable training dynamics
3. **Loss of diversity**: If all nodes rely heavily on external, swarm diversity decreases
4. **Local matters**: Each node needs sufficient local exploration to contribute unique strategies

### The Diversity Paradox

When all nodes generate only 2 local rollouts:
- Each node's contribution to swarm is limited
- External rollouts become more homogeneous over time
- Swarm loses the diversity that made collaboration effective
- System converges to suboptimal strategies

### Performance Comparison (All Configs)

| Config | I | J | External % | Cumulative | Improvement |
|--------|---|---|------------|------------|-------------|
| Baseline | 8 | 0 | 0% | ~562 | - |
| Config 1 | 6 | 2 | 25% | ~854 | +52% |
| **Config 2** | **4** | **4** | **50%** | **~1093** | **+94%** |
| Config 3 | 2 | 6 | 75% | ~946 | +68% |

### Key Lesson

**Balance is critical!**
- More external data ≠ better performance
- Need sufficient local exploration (at least 4 rollouts)
- 50/50 split (Config 2) achieves optimal balance
- Local innovation fuels swarm diversity

### When To Use This Config

This configuration is primarily useful for:
- Understanding the importance of local/external balance
- Demonstrating diminishing returns of external dependence
- Research on swarm dynamics and diversity
- Comparison studies (as in the SAPO paper)

**For practical applications, use Config 2 (4/4) instead.**