# SAPO Experiment: Baseline (8 Local / 0 External)

This notebook replicates the **baseline configuration** from the SAPO paper:
- **I = 8**: Generate 8 rollouts locally per round
- **J = 0**: No external rollouts from swarm (no experience sharing)
- **G = 8**: Generate 8 completions per question
- **Rounds = 2000**: Train for 2000 rounds

This is the **control experiment** - training without swarm collaboration.

**Expected Performance:**
- Cumulative reward after 2000 rounds: ~562
- This serves as the baseline for comparison with collaborative configs

**Before running:**
1. Mount your Google Drive
2. Run all cells in order
3. Training will run for ~24-48 hours (2000 rounds)
4. Use `EX12.02.RL_Swarm_Monitoring` to track progress

**Paper Reference:** arXiv:2509.08721 - SAPO (Section 5.2, Table 1)

## 1. Configuration

In [None]:
# Experiment Configuration
EXPERIMENT_NAME = 'sapo_baseline_8loc0ext'  # Unique experiment name
NODE_ROLE = 'coordinator'  # Single node acts as coordinator
NODE_ID = 'baseline_node_0'  # Unique node identifier

# Model Configuration
MODEL_NAME = 'Gensyn/Qwen2.5-0.5B-Instruct'  # Same as paper
SEED = 42  # For reproducibility

# SAPO Configuration (Baseline)
NUM_TRAIN_SAMPLES = 8        # I: Local rollouts per round
NUM_TRANSPLANT_TREES = 0     # J: External rollouts (NONE for baseline)
NUM_GENERATIONS = 8          # G: Completions per question
MAX_ROUNDS = 2000            # Train for 2000 rounds (same as paper)

# Rollout Sharing (disabled for baseline)
ROLLOUT_PUBLISH_FREQUENCY = 'never'  # No sharing needed
ROLLOUT_CLEANUP_ENABLED = False
ROLLOUT_KEEP_LAST_N_ROUNDS = 10
ROLLOUT_ARCHIVE_OLD = False

# Optional: HuggingFace Token
HUGGINGFACE_TOKEN = None  # Set to your token or keep None


print(f"✓ Experiment: {EXPERIMENT_NAME}")
print(f"✓ Configuration: I={NUM_TRAIN_SAMPLES}, J={NUM_TRANSPLANT_TREES}, G={NUM_GENERATIONS}")
print(f"✓ Model: {MODEL_NAME}")
print(f"✓ Max Rounds: {MAX_ROUNDS}")
print()
print("⚠️  This is the BASELINE config (no swarm collaboration)")
print("   Expected cumulative reward: ~562 after 2000 rounds")

## 2. Mount Google Drive

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Set base path
GDRIVE_BASE_PATH = '/content/drive/MyDrive/rl-swarm'
os.makedirs(GDRIVE_BASE_PATH, exist_ok=True)

print(f"✓ Google Drive mounted at: {GDRIVE_BASE_PATH}")

## 3. System Setup & Dependencies

In [None]:
# Check GPU availability
import torch

if torch.cuda.is_available():
    print(f"✓ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"  Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("⚠️  No GPU detected - training will be slow")
    print("  Consider: Runtime > Change runtime type > GPU")

In [None]:
# Clone repository
import os

# Change to safe directory first
%cd /content

# Remove existing directory if it exists
if os.path.exists('/content/rl-swarm'):
    print("Removing existing repository...")
    !rm -rf /content/rl-swarm

# Clone fresh copy
print("Cloning repository...")
!git clone https://github.com/Elrashid/rl-swarm.git /content/rl-swarm

# Change to repo directory
%cd /content/rl-swarm

# Verify clone worked
if not os.path.exists('requirements.txt'):
    print("❌ Clone failed! requirements.txt not found")
    raise FileNotFoundError("Repository clone failed")

print("✓ Repository cloned successfully")

# Install dependencies
print("Installing dependencies (this may take 3-5 minutes)...")
!pip install -q -r requirements.txt
!pip install -q gensyn-genrl==0.1.9

print("✓ Dependencies installed")

## 4. Initialize Experiment

In [None]:
from rgym_exp.utils.experiment_manager import init_experiment

# Initialize experiment structure in Google Drive
config_overrides = {
    'training.max_round': MAX_ROUNDS,
    'training.num_generations': NUM_GENERATIONS,
    'training.num_transplant_trees': NUM_TRANSPLANT_TREES,
    'training.num_train_samples': NUM_TRAIN_SAMPLES,
    'training.seed': SEED,
}

init_experiment(
    gdrive_base_path=GDRIVE_BASE_PATH,
    experiment_name=EXPERIMENT_NAME,
    config_overrides=config_overrides
)

print(f"✓ Experiment initialized: {EXPERIMENT_NAME}")
print(f"  Path: {GDRIVE_BASE_PATH}/experiments/{EXPERIMENT_NAME}")
print(f"  Config: I={NUM_TRAIN_SAMPLES}, J={NUM_TRANSPLANT_TREES}, G={NUM_GENERATIONS}")

## 5. Set Environment Variables

In [None]:
import os
import uuid

# Set environment variables
os.environ['GDRIVE_PATH'] = GDRIVE_BASE_PATH
os.environ['EXPERIMENT_NAME'] = EXPERIMENT_NAME
os.environ['NODE_ROLE'] = NODE_ROLE
os.environ['NODE_ID'] = NODE_ID or f"baseline_{uuid.uuid4().hex[:8]}"
os.environ['MODEL_NAME'] = MODEL_NAME
os.environ['SEED'] = str(SEED)

# SAPO configuration (NEW)
os.environ['NUM_TRAIN_SAMPLES'] = str(NUM_TRAIN_SAMPLES)
os.environ['NUM_TRANSPLANT_TREES'] = str(NUM_TRANSPLANT_TREES)
os.environ['NUM_GENERATIONS'] = str(NUM_GENERATIONS)
os.environ['MAX_ROUNDS'] = str(MAX_ROUNDS)

# Rollout configuration
os.environ['ROLLOUT_PUBLISH_FREQUENCY'] = ROLLOUT_PUBLISH_FREQUENCY
os.environ['ROLLOUT_CLEANUP_ENABLED'] = str(ROLLOUT_CLEANUP_ENABLED)
os.environ['ROLLOUT_KEEP_LAST_N_ROUNDS'] = str(ROLLOUT_KEEP_LAST_N_ROUNDS)
os.environ['ROLLOUT_ARCHIVE_OLD'] = str(ROLLOUT_ARCHIVE_OLD)

if HUGGINGFACE_TOKEN:
    os.environ['HUGGINGFACE_ACCESS_TOKEN'] = HUGGINGFACE_TOKEN


print("✓ Environment variables set")
print(f"  Node ID: {os.environ['NODE_ID']}")
print(f"  Config: I={NUM_TRAIN_SAMPLES}, J={NUM_TRANSPLANT_TREES}, G={NUM_GENERATIONS}")

## 6. Start Training

**This cell will run for ~24-48 hours (2000 rounds).**

The training will:
- Generate 8 local rollouts per round (no external sharing)
- Train using GRPO algorithm
- Save checkpoints every 10 rounds
- Log metrics to Google Drive

**Monitor progress:**
- Use `EX12.02.RL_Swarm_Monitoring.ipynb` in a separate tab
- Check metrics file: `{GDRIVE_BASE_PATH}/experiments/{EXPERIMENT_NAME}/metrics/`

**Press stop button to gracefully shutdown.**

In [None]:
from rgym_exp.utils.notebook_utils import run_with_live_output
import sys

print("="*60)
print(f"Starting SAPO Baseline Experiment")
print(f"Configuration: I={NUM_TRAIN_SAMPLES}, J={NUM_TRANSPLANT_TREES}, G={NUM_GENERATIONS}")
print(f"Experiment: {EXPERIMENT_NAME}")
print(f"Model: {MODEL_NAME}")
print(f"Max Rounds: {MAX_ROUNDS}")
print("="*60)
print()

# Run training with live output
exit_code = run_with_live_output([
    sys.executable, '-m', 'rgym_exp.runner.swarm_launcher'
])

if exit_code == -1:
    print("\n⚠️  Training interrupted by user")
elif exit_code != 0:
    print(f"\n❌ Training exited with code: {exit_code}")
else:
    print(f"\n✅ Training completed successfully")
    print(f"   Total rounds: {MAX_ROUNDS}")
    print(f"   Expected cumulative reward: ~562")

## 7. View Results

In [None]:
from rgym_exp.utils.experiment_manager import get_experiment_status, get_experiment_metrics
import pandas as pd

# Get current status
status = get_experiment_status(GDRIVE_BASE_PATH, EXPERIMENT_NAME)

print(f"Experiment: {EXPERIMENT_NAME}")
print(f"Configuration: Baseline (I=8, J=0, G=8)")
print(f"Current Round: {status.get('current_round', 0)} / {MAX_ROUNDS}")
print(f"Current Stage: {status.get('current_stage', 0)}")
print()

# Load and display metrics
try:
    df = get_experiment_metrics(GDRIVE_BASE_PATH, EXPERIMENT_NAME)
    if not df.empty:
        # Calculate cumulative reward
        cumulative_reward = df['my_reward'].sum()
        print(f"Cumulative Reward: {cumulative_reward:.2f}")
        print(f"Expected (paper): ~562")
        print()
        
        # Show recent rounds
        print("Recent rounds (last 10):")
        recent = df.tail(10)
        print(recent[['round', 'stage', 'my_reward']].to_string(index=False))
    else:
        print("No metrics available yet")
except Exception as e:
    print(f"Could not load metrics: {e}")

In [None]:
# === Real-Time Progress Viewer ===
# Run this cell anytime to check progress from GDrive
# Useful if you reconnect after notebook disconnect

import sys
sys.path.append('/content/rl-swarm')

from rgym_exp.utils.progress_tracker import get_experiment_progress

progress = get_experiment_progress(GDRIVE_BASE_PATH, EXPERIMENT_NAME)

print("="*70)
print("📊 REAL-TIME PROGRESS FROM GDRIVE")
print("="*70)
print(f"Experiment: {progress.get('experiment')}")
print()

for node_id, node_data in progress.get('nodes', {}).items():
    if 'error' in node_data:
        print(f"  {node_id}: {node_data['error']}")
    else:
        elapsed_hours = node_data.get('elapsed_seconds', 0) / 3600
        print(f"  {node_id}:")
        print(f"    Latest event: {node_data.get('latest_event')}")
        print(f"    Current round: {node_data.get('latest_round')}")
        print(f"    Elapsed time: {elapsed_hours:.1f} hours")
        print()

print("="*70)
print("Note: Progress updates every round. Logs flush every 30s to GDrive.")

## 7.5. Check Real-Time Progress from GDrive (Optional)

**Reconnected after disconnect?** Run this cell to check training progress:
- Shows current round for each node
- Displays elapsed time
- Works even if your notebook disconnected

Progress is saved to GDrive every round, logs flush every 30 seconds.

## 8. Resume Training (If Disconnected)

If your Colab session disconnects:
1. Re-run cells 1-5 (keep same EXPERIMENT_NAME and NODE_ID)
2. Re-run cell 6 (training cell)
3. System will automatically resume from last checkpoint
4. Training continues from last saved round

## Notes

### SAPO Baseline Configuration

This experiment uses:
- **I = 8**: 8 local rollouts generated per round
- **J = 0**: No external rollouts (no swarm sharing)
- **G = 8**: 8 completions generated per question
- **Total rollouts per round**: 8 (all local)

This is the **control experiment** for the SAPO paper. It trains a single model without any collaborative experience sharing.

### Expected Results

From the SAPO paper (Table 1):
- **Cumulative reward after 2000 rounds**: ~562
- This serves as the baseline for comparison

### Comparison with Other Configs

The paper shows significant improvements with experience sharing:
- **6 local / 2 external**: +52% improvement (cumulative reward ~854)
- **4 local / 4 external**: +94% improvement (cumulative reward ~1093) **BEST**
- **2 local / 6 external**: +68% improvement (cumulative reward ~946)

### Next Steps

After this baseline completes:
1. Run collaborative experiments: `EX12.11`, `EX12.12`, `EX12.13`
2. Compare results using `EX12.20.SAPO_Results_Analysis.ipynb`
3. Reproduce paper's findings on swarm collaboration benefits