Skip to content

Wire #13 infrastructure into game loop + Docker e2e#16

Merged
jkbennitt merged 3 commits intomasterfrom
feature/issue-13-wiring
Apr 12, 2026
Merged

Wire #13 infrastructure into game loop + Docker e2e#16
jkbennitt merged 3 commits intomasterfrom
feature/issue-13-wiring

Conversation

@jkbennitt
Copy link
Copy Markdown
Member

Summary

  • EventLog emits at every tick stage (7 event types, thread-safe). CostTracker records token usage from agents after each deliberation.
  • ActionResolver returns ResolverStats for conflict counting. MetricContext populated with conflicts + CentralPost message tracking.
  • DEFAULT_WEIGHTS redistributed (proportional 20% cut) to activate coordination (0.12) and communication_efficiency (0.08). All 6 scenario YAMLs updated.
  • --ablation runs 8 benchmark passes (full + 7 agent-removed).
  • Optional Weave tracing on _call_provider().
  • Docker e2e validated against real headless RimWorld — avg 0.804 across 6 scenarios, 99% parse rate.

Test plan

  • mypy strict, ruff, 331 tests pass
  • Smoke test produces events.jsonl + cost_snapshot + event_summary in benchmark_summary.json
  • Docker: RIMAPI responds from host via socat bridge
  • Docker: full 6-scenario benchmark (10 ticks, Nemotron 120B via OpenRouter)

Closes #13

🤖 Generated with Claude Code

jkbennitt and others added 3 commits April 10, 2026 03:27
EventLog emits at every run_tick() stage (7 event types, thread-safe).
CostTracker records token usage from agent._last_usage after each
deliberation. ActionResolver returns ResolverStats for conflict counting.
MetricContext populated with conflicts + CentralPost message tracking.
DEFAULT_WEIGHTS redistributed (proportional 20% cut) to activate
coordination (0.12) and communication_efficiency (0.08). All 6 scenario
YAMLs updated. --ablation runs 8 benchmark passes (full + 7 agent-removed).
Weave tracing optional on _call_provider via enable_weave(). Docker
entrypoint CRLF fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Entrypoint now seeds ModsConfig.xml (Harmony → Core → Royalty →
HeadlessRim → RIMAPI), links Workshop mods from SteamCMD download,
replaces game Mods/ dir with merged mods symlink. Runs as root for
setup then drops to rimworld user via su.

Dockerfile strips CRLF from entrypoint.sh, extends healthcheck for
longer startup. docker-compose.yml drops :ro on game volume (entrypoint
needs to swap Mods dir).

Tested: RIMAPI responds inside container with patched HeadlessRimPatch
(PR IlyaChichkov/HeadlessRimPatch#6). IPv6 loopback binding blocks
external access through Docker port forwarding — separate RIMAPI fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mono's HttpListener binds to [::1] regardless of config. Docker port
forwarding can't reach loopback. socat bridges 0.0.0.0:8765 (IPv4)
to [::1]:8765 (where RIMAPI actually listens).

Tested: curl from host gets HTTP 200 JSON response through Docker
port mapping. Full e2e validated with patched HeadlessRimPatch
(no autoplay) + socat bridge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jkbennitt
Copy link
Copy Markdown
Member Author

Code Review

Docker & Infrastructure

CRITICAL: Container runs as root
docker/Dockerfile:36-37USER rimworld is commented out. The entrypoint drops to rimworld for the game process via su, but socat and tail run as root. Privilege escalation surface if the container is compromised.

Socat bridge has no supervision
docker/entrypoint.sh:137 — socat is backgrounded with no PID tracking, no restart-on-crash, no error logging if bind fails, a race condition (healthcheck may hit before socat is ready), and no signal forwarding on container shutdown.

Healthcheck allows 30-minute wait
docker/Dockerfile:41-42 — 120s start-period + 30s × 60 retries = 1920s. Previous was ~2 min. 60 retries is overkill — reduce to ~20 retries (~10-15 min cap).

CRLF fix is good
docker/Dockerfile:34sed -i 's/\r$//' prevents Windows line endings from breaking the script.


Orchestration & Scoring

MetricContext wiring — correct
game_loop.py:413-431 tracks messages_acted_on, game_loop.py:450-451 accumulates ResolverStats for conflicts. Clean implementation.

EventLog thread safety — correct
event_log.py:67 uses threading.Lock() with proper critical sections.

Weight redistribution — correct
composite.py:10-21 — DEFAULT_WEIGHTS sums to ~1.0 within float tolerance. Tests validate via pytest.approx(1.0).


Agents & Scripts

Ablation logic — correct
run_benchmark.py:386-387 iterates _ALL_AGENT_IDS, excluding one per pass. Clean one-at-a-time removal.

Weave tracing — truly optional
base_role.py:163 initializes _weave_op = None, gated by None check. No hard imports.

Gap: ablation + save load failure
run_benchmark.py:399-402 — If load_game() fails and continues, agents run against stale state. Silent garbage ablation results.

Gap: --ablation lacks conflict validation
No check that --ablation and --runs / --no-baseline don't interact unexpectedly.


Tests

Good coverage on new features

  • TestResolverStats (5 test cases) covers zero-conflict through multi-conflict
  • test_sum_to_one validates DEFAULT_WEIGHTS

Gap: scenario YAML weight sums not tested
test_scenario_loader.py checks structure but doesn't assert each scenario's weight overrides sum to 1.0.


Action Items

Issue Severity
Docker root user Fix
Socat supervision Fix
Ablation error handling on load failure Improve
Scenario YAML weight-sum test Add
Healthcheck retries (60 → ~20) Nit

Core instrumentation work is solid — thread safety, metric wiring, and weight math are all correct. Main concerns are Docker security and the ablation save-load gap.

@jkbennitt
Copy link
Copy Markdown
Member Author

Fixes Applied

All review items addressed in feature/issue-13-wiring:

Docker security & reliability

  • Root user: Removed commented-out USER rimworld. Entrypoint still starts as root for setup (chown, mv, ln), but socat and tail now run as rimworld via su
  • Socat supervision: PID tracking + startup verification — exits immediately if bind fails. No more silent failures
  • Signal forwarding: Added trap handler forwarding SIGTERM/SIGINT to all child processes (socat, game, Xvfb) for clean shutdown
  • Healthcheck retries: 60 → 20 (~10 min cap instead of 30 min)

Ablation save-load guard

  • Both ablation and main benchmark loops now continue (skip the run) when load_game() fails, instead of silently running against stale state
  • Prints SKIP message so dropped runs are visible

Test coverage

  • New TestScoringWeights — parametrized across all 6 YAML definitions, asserts scoring_weights sum to pytest.approx(1.0)

Verification

  • 337 tests pass (6 new), ruff clean, mypy strict clean
  • Need to re-validate Docker e2e before merge

@jkbennitt
Copy link
Copy Markdown
Member Author

Docker e2e Validated

Rebuilt image with all fixes, ran crashlanded scenario (10 ticks, Nemotron 120B paid via OpenRouter):

Metric Score
survival 1.000
threat_response 1.000
wealth 1.000
coordination 1.000
food_security 0.950
communication_efficiency 0.952
efficiency 0.889
self_sufficiency 0.667
mood 0.407
research 0.226
COMPOSITE 0.844

Stats: 95K tokens, 68 LLM calls, $0.03, 900s wall time

Entrypoint fixes confirmed working:

  • socat bridge started with PID tracking (pid=146), no silent failures
  • Signal trap registered for clean shutdown
  • Save loaded successfully (game/load 200), all map endpoints 200
  • Healthcheck passed with reduced retries (20 vs 60)
  • Score climbed 0.824 → 0.844 over 10 ticks, 3 colonists alive

Ship it.

@jkbennitt jkbennitt merged commit faf7313 into master Apr 12, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HeadlessRim Docker: real automated benchmarks + leaderboard infrastructure

1 participant