Wire #13 infrastructure into game loop + Docker e2e by jkbennitt · Pull Request #16 · AppSprout-dev/RLE

jkbennitt · 2026-04-12T06:13:18Z

Summary

EventLog emits at every tick stage (7 event types, thread-safe). CostTracker records token usage from agents after each deliberation.
ActionResolver returns ResolverStats for conflict counting. MetricContext populated with conflicts + CentralPost message tracking.
DEFAULT_WEIGHTS redistributed (proportional 20% cut) to activate coordination (0.12) and communication_efficiency (0.08). All 6 scenario YAMLs updated.
--ablation runs 8 benchmark passes (full + 7 agent-removed).
Optional Weave tracing on _call_provider().
Docker e2e validated against real headless RimWorld — avg 0.804 across 6 scenarios, 99% parse rate.

Test plan

mypy strict, ruff, 331 tests pass
Smoke test produces events.jsonl + cost_snapshot + event_summary in benchmark_summary.json
Docker: RIMAPI responds from host via socat bridge
Docker: full 6-scenario benchmark (10 ticks, Nemotron 120B via OpenRouter)

Closes #13

🤖 Generated with Claude Code

EventLog emits at every run_tick() stage (7 event types, thread-safe). CostTracker records token usage from agent._last_usage after each deliberation. ActionResolver returns ResolverStats for conflict counting. MetricContext populated with conflicts + CentralPost message tracking. DEFAULT_WEIGHTS redistributed (proportional 20% cut) to activate coordination (0.12) and communication_efficiency (0.08). All 6 scenario YAMLs updated. --ablation runs 8 benchmark passes (full + 7 agent-removed). Weave tracing optional on _call_provider via enable_weave(). Docker entrypoint CRLF fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Entrypoint now seeds ModsConfig.xml (Harmony → Core → Royalty → HeadlessRim → RIMAPI), links Workshop mods from SteamCMD download, replaces game Mods/ dir with merged mods symlink. Runs as root for setup then drops to rimworld user via su. Dockerfile strips CRLF from entrypoint.sh, extends healthcheck for longer startup. docker-compose.yml drops :ro on game volume (entrypoint needs to swap Mods dir). Tested: RIMAPI responds inside container with patched HeadlessRimPatch (PR IlyaChichkov/HeadlessRimPatch#6). IPv6 loopback binding blocks external access through Docker port forwarding — separate RIMAPI fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Mono's HttpListener binds to [::1] regardless of config. Docker port forwarding can't reach loopback. socat bridges 0.0.0.0:8765 (IPv4) to [::1]:8765 (where RIMAPI actually listens). Tested: curl from host gets HTTP 200 JSON response through Docker port mapping. Full e2e validated with patched HeadlessRimPatch (no autoplay) + socat bridge. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jkbennitt · 2026-04-12T18:01:24Z

Code Review

Docker & Infrastructure

CRITICAL: Container runs as root
docker/Dockerfile:36-37 — USER rimworld is commented out. The entrypoint drops to rimworld for the game process via su, but socat and tail run as root. Privilege escalation surface if the container is compromised.

Socat bridge has no supervision
docker/entrypoint.sh:137 — socat is backgrounded with no PID tracking, no restart-on-crash, no error logging if bind fails, a race condition (healthcheck may hit before socat is ready), and no signal forwarding on container shutdown.

Healthcheck allows 30-minute wait
docker/Dockerfile:41-42 — 120s start-period + 30s × 60 retries = 1920s. Previous was ~2 min. 60 retries is overkill — reduce to ~20 retries (~10-15 min cap).

CRLF fix is good ✓
docker/Dockerfile:34 — sed -i 's/\r$//' prevents Windows line endings from breaking the script.

Orchestration & Scoring

MetricContext wiring — correct ✓
game_loop.py:413-431 tracks messages_acted_on, game_loop.py:450-451 accumulates ResolverStats for conflicts. Clean implementation.

EventLog thread safety — correct ✓
event_log.py:67 uses threading.Lock() with proper critical sections.

Weight redistribution — correct ✓
composite.py:10-21 — DEFAULT_WEIGHTS sums to ~1.0 within float tolerance. Tests validate via pytest.approx(1.0).

Agents & Scripts

Ablation logic — correct ✓
run_benchmark.py:386-387 iterates _ALL_AGENT_IDS, excluding one per pass. Clean one-at-a-time removal.

Weave tracing — truly optional ✓
base_role.py:163 initializes _weave_op = None, gated by None check. No hard imports.

Gap: ablation + save load failure
run_benchmark.py:399-402 — If load_game() fails and continues, agents run against stale state. Silent garbage ablation results.

Gap: --ablation lacks conflict validation
No check that --ablation and --runs / --no-baseline don't interact unexpectedly.

Tests

Good coverage on new features ✓

TestResolverStats (5 test cases) covers zero-conflict through multi-conflict
test_sum_to_one validates DEFAULT_WEIGHTS

Gap: scenario YAML weight sums not tested
test_scenario_loader.py checks structure but doesn't assert each scenario's weight overrides sum to 1.0.

Action Items

Issue	Severity
Docker root user	Fix
Socat supervision	Fix
Ablation error handling on load failure	Improve
Scenario YAML weight-sum test	Add
Healthcheck retries (60 → ~20)	Nit

Core instrumentation work is solid — thread safety, metric wiring, and weight math are all correct. Main concerns are Docker security and the ablation save-load gap.

jkbennitt · 2026-04-12T18:14:01Z

Fixes Applied

All review items addressed in feature/issue-13-wiring:

Docker security & reliability

Root user: Removed commented-out USER rimworld. Entrypoint still starts as root for setup (chown, mv, ln), but socat and tail now run as rimworld via su
Socat supervision: PID tracking + startup verification — exits immediately if bind fails. No more silent failures
Signal forwarding: Added trap handler forwarding SIGTERM/SIGINT to all child processes (socat, game, Xvfb) for clean shutdown
Healthcheck retries: 60 → 20 (~10 min cap instead of 30 min)

Ablation save-load guard

Both ablation and main benchmark loops now continue (skip the run) when load_game() fails, instead of silently running against stale state
Prints SKIP message so dropped runs are visible

Test coverage

New TestScoringWeights — parametrized across all 6 YAML definitions, asserts scoring_weights sum to pytest.approx(1.0)

Verification

337 tests pass (6 new), ruff clean, mypy strict clean
Need to re-validate Docker e2e before merge

jkbennitt · 2026-04-12T18:59:29Z

Docker e2e Validated

Rebuilt image with all fixes, ran crashlanded scenario (10 ticks, Nemotron 120B paid via OpenRouter):

Metric	Score
survival	1.000
threat_response	1.000
wealth	1.000
coordination	1.000
food_security	0.950
communication_efficiency	0.952
efficiency	0.889
self_sufficiency	0.667
mood	0.407
research	0.226
COMPOSITE	0.844

Stats: 95K tokens, 68 LLM calls, $0.03, 900s wall time

Entrypoint fixes confirmed working:

socat bridge started with PID tracking (pid=146), no silent failures
Signal trap registered for clean shutdown
Save loaded successfully (game/load 200), all map endpoints 200
Healthcheck passed with reduced retries (20 vs 60)
Score climbed 0.824 → 0.844 over 10 ticks, 3 colonists alive

Ship it.

jkbennitt and others added 3 commits April 10, 2026 03:27

jkbennitt merged commit faf7313 into master Apr 12, 2026
6 checks passed

This was referenced Apr 12, 2026

HeadlessRim Docker: real automated benchmarks + leaderboard infrastructure #13

Closed

RLE v1.0: Multi-model colony management leaderboard #8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wire #13 infrastructure into game loop + Docker e2e#16

Wire #13 infrastructure into game loop + Docker e2e#16
jkbennitt merged 3 commits intomasterfrom
feature/issue-13-wiring

jkbennitt commented Apr 12, 2026

Uh oh!

jkbennitt commented Apr 12, 2026

Uh oh!

jkbennitt commented Apr 12, 2026

Uh oh!

jkbennitt commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jkbennitt commented Apr 12, 2026

Summary

Test plan

Uh oh!

jkbennitt commented Apr 12, 2026

Code Review

Docker & Infrastructure

Orchestration & Scoring

Agents & Scripts

Tests

Action Items

Uh oh!

jkbennitt commented Apr 12, 2026

Fixes Applied

Docker security & reliability

Ablation save-load guard

Test coverage

Verification

Uh oh!

jkbennitt commented Apr 12, 2026

Docker e2e Validated

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant