Multi-agent benchmark where 6 Felix Agent SDK role-specialized LLM agents manage a RimWorld colony. Think FLE (Factorio Learning Environment) but for multi-agent coordination under uncertainty.
- 6 agents, not 1 — ResourceManager, DefenseCommander, ResearchDirector, SocialOverseer, ConstructionPlanner, MedicalOfficer coordinate through a hub-spoke communication network
- Stochastic environment — raids, plague, mental breaks, weather. Agents adapt, not just optimize
- Helix-driven strategy — agents shift from exploration (diverse strategies) to synthesis (decisive actions) as the colony progresses
- Provider-agnostic — runs on a free local 4B model or a cloud 120B, same architecture
RimWorld (game)
↕ Harmony patches
RIMAPI mod (C# REST :8765 + SSE events)
↕ httpx async + SSE
RLE Orchestrator (pause → read → deliberate → resolve → execute → score → unpause)
↕ CentralPost hub-spoke
Felix Agent SDK (6 role agents, parallel deliberation)
↕ OpenAI-compatible API
LLM (Nemotron 4B local / 120B cloud / Anthropic / OpenAI)
| Agent | Domain | Actions |
|---|---|---|
| ResourceManager | Food, materials, power, hauling | set_work_priority, haul_resource, set_growing_zone, toggle_power |
| DefenseCommander | Raids, drafting, positioning | draft_colonist, undraft_colonist, move_colonist |
| ResearchDirector | Tech tree, researcher assignment | set_research_target, assign_researcher |
| SocialOverseer | Mood, recreation, mental breaks | set_recreation_policy, assign_social_activity |
| ConstructionPlanner | Buildings, walls, repairs | place_blueprint, cancel_blueprint |
| MedicalOfficer | Injuries, disease, medicine | assign_bed_rest, administer_medicine |
git clone https://github.com/AppSprout-dev/RLE.git
cd RLE
pip install -e ".[dev]"- Download
unsloth/nvidia-nemotron-3-nano-4b(GGUF, Q4_K_M) - Settings: Flash Attention ON, Context 10000, GPU Offload max, Keep in Memory ON
- Start the server (default port 1234)
# Single scenario against live RimWorld colony (local LM Studio)
python scripts/run_scenario.py crashlanded_survival \
--provider openai \
--model unsloth/nvidia-nemotron-3-nano-4b \
--base-url http://localhost:1234/v1 \
--no-think --visualize --ticks 10
# Same scenario via OpenRouter (cloud, ~$0.01)
OPENAI_API_KEY=<your-key> python scripts/run_scenario.py crashlanded_survival \
--provider openai \
--model nvidia/nemotron-3-super-120b-a12b \
--base-url https://openrouter.ai/api/v1 \
--no-think --visualize --ticks 10
# Full benchmark (mock game state, real LLM)
python scripts/run_benchmark.py \
--provider openai \
--model unsloth/nvidia-nemotron-3-nano-4b \
--base-url http://localhost:1234/v1 \
--ticks 10 --no-think --output results/
# List available scenarios
python scripts/run_scenario.py --list# Terminal 1: Run RLE with --output to export tick data
python scripts/run_scenario.py crashlanded_survival --output results/live/ ...
# Terminal 2: Serve tick data for dashboard
python scripts/serve_dashboard.py results/live
# Terminal 3: Start dashboard
cd ../rimapi-dashboard && bun run start
# Open http://localhost:3000, add the 5 RLE widgetsTested across 6 scenarios, 10 ticks each:
| Config | Model | Avg Score | Parse Rate | s/tick | Cost |
|---|---|---|---|---|---|
| Local (RX 5700 XT 8GB) | Nemotron Nano 4B | 0.738 | 100% | 41.7 | free |
| Local (RX 7800 XT 16GB) | Nemotron Nano 4B | 0.739 | 100% | 16.8 | free |
| OpenRouter (cloud) | Nemotron Super 120B | 0.739 | 99.4% | 4.7 | ~$0.09 |
Live game test (10 ticks, Crashlanded, real RIMAPI data): 0.843 composite, 96.8% execution rate, 100% parse rate, all colonists alive.
| # | Name | Difficulty | Duration |
|---|---|---|---|
| 01 | Crashlanded Survival | easy | 30 days |
| 02 | First Winter | medium | 60 days |
| 03 | Toxic Fallout | hard | 20 days |
| 04 | Raid Defense | hard | 15 days |
| 05 | Plague Response | hard | 20 days |
| 06 | Ship Launch | extreme | 120 days |
8 metrics, weighted composite (scenarios can override weights):
| Metric | Default Weight | What it measures |
|---|---|---|
| survival | 0.25 | alive / started colonists |
| threat_response | 0.15 | draft response speed |
| mood | 0.15 | avg colonist mood |
| food_security | 0.10 | days of food (10+ = 1.0) |
| wealth | 0.10 | wealth growth ratio |
| research | 0.10 | % research tree completed |
| self_sufficiency | 0.10 | power + food + population stability |
| efficiency | 0.05 | action execution rate |
pytest # Run all tests (262)
ruff check src/ tests/ scripts/ # Lint
mypy src/ # Type checksrc/rle/
├── config.py # RLEConfig (pydantic-settings)
├── rimapi/ # RIMAPI async HTTP client + SSE + Pydantic schemas
├── agents/ # 6 role agents + base class + action schema
├── orchestration/ # Game loop, state manager, action executor/resolver
├── scoring/ # 8 metrics, composite scorer, CSV recorder
└── scenarios/ # YAML schema, loader, evaluator, 6 definitions
- felix-agent-sdk — agents, communication, helix geometry, providers
- RIMAPI — C# RimWorld mod exposing REST API
- httpx, pydantic, pyyaml
MIT
Built by AppSprout with Claude Code