race_car -> racecar, max_steps changes, thinky set output appropriately, fix unknown features, and racecar agents #3631

relh · 2025-11-10T20:41:59Z

Here’s the 10‑seed sweep you asked for.

Ran: uv run packages/cogames/scripts/run_evaluation.py --agent racecar thinky --mission-set all --repeats 1 --jobs 0 --no-plots --seed <1..10> --output eval_plots/seed__racecar_thinky.json
Outputs: eval_plots/seed_1_racecar_thinky.json … seed_10_racecar_thinky.json (10 files, 432 tests per run, 216 per agent).

Aggregate over 10 seeds (4,320 total cases; 2,160 per agent):

RaceCar: 1,243/2,160 successes (57.5%), avg_total_reward 4.86, avg_reward_per_agent 2.35.
Thinky: 1,245/2,160 successes (57.6%), avg_total_reward 4.85, avg_reward_per_agent 2.34.
Run-to-run stability (success rate stdev): RaceCar 2.1pp, Thinky 1.7pp; medians 57.9% (RaceCar) vs 57.2% (Thinky).

Where Thinky beats RaceCar most (successes out of 10 runs, same mission & cogs):

diagnostic_memory_hard (2 cogs): 6 vs 1.
single_use_swarm (1 cog): 8 vs 4.
diagnostic_extract_missing_oxygen_hard (2 cogs): 6 vs 3.
vibe_check_easy (2 cogs): 5 vs 2.
diagnostic_assemble_seeded_search_hard (2 cogs): 5 vs 2.

Where RaceCar beats Thinky most:

balanced_corners (2 cogs): 8 vs 4.
vibe_check_hard (4 cogs): 6 vs 2.
distant_resources (4 cogs): 4 vs 1.
single_use_swarm_easy (1 cog): 10 vs 8.
oxygen_bottleneck_standard (2 cogs): 10 vs 8.

If you want a deeper dive, I can break down per-difficulty or plot deltas, but the raw JSONs above have everything.

Changes implemented (low‑risk):

Broader exploration seeding around assembler/chest (adds ±20 ring, extra chest offsets) and tracks unreachable targets to avoid re-chasing failed A* paths.
Single-resource bottleneck handling: when only one input is required, zero other targets and prevent dumping that resource.
Dump throttling now respects active targets so we don’t discard the bottleneck resource.

Rebuilt Nim bindings.

10-seed sweep rerun (outputs in eval_plots/seed__racecar_thinky_postfix.json, seeds 1–10):

RaceCar: 1023/1728 successes (59.2%), mean per-run 59.2% (stdev 2.19pp).
Thinky: 1001/1728 successes (57.9%), mean per-run 57.9% (stdev 1.74pp).

Rare-success missions (RaceCar, successes <5/8 runs; lower count because two seeds timed out previously so 8 runs completed):

diagnostic_charge_up_hard (4 cogs): 1/8
oxygen_bottleneck_hard (1c: 1/8, 2c: 3/8)
diagnostic_assemble_seeded_search_hard (2c: 2/8)
diagnostic_memory_hard (2c: 2/8)
quadrant_buildings family (1–4c: 1–4/8)
vibe_check_easy/standard/hard (2c: 1–4/8; 4c: 3/8)
diagnostic_extract_missing_silicon_hard (2c: 3/8)
diagnostic_chest_navigation1_hard (4c: 4/8)

Observations:

The bottleneck prioritization and unreachable marking helped overall success (+~1.7pp over the previous sweep) and trimmed the rare list from 21 to 17 cases.
Remaining rare cases mostly cluster into: (a) vibe-check tasks (need deliberate vibe selection), (b) quadrant/building spread (needs stronger cog role separation), (c) hard energy/charge maps (charge_up_hard,
oxygen_bottleneck_hard), and (d) long/maze memory (memory_hard).

If you want, I’ll tackle the next set of targeted tweaks:

Add simple vibe-matching logic for vibe_check*/assemble_seeded_search_hard.
Assign directional roles per agent to cover quadrants faster.
Cache depleted/visited extractors per resource with cooldown for single_use/charge_hard maps.

Let me know which to do next.

Asana Task

…cripted

…scripted

…cripted

treeform · 2025-12-03T00:58:00Z

packages/cogames/src/cogames/policy/nim_agents/agents.py

    sys.path.append(bindings_dir)

+
+def _maybe_rebuild_nim_bindings() -> None:


This is not needed.

treeform · 2025-12-03T00:58:35Z

packages/cogames/src/cogames/policy/nim_agents/common.nim

  jsony

+# Disable debug output (comment out to enable)
+template echo(args: varargs[string, `$`]) = discard


Don't like this

treeform · 2025-12-03T00:59:12Z

packages/cogames/src/cogames/policy/nim_agents/common.nim

-const spiral* = generateSpiral(1000)
+# The spiral is only used for exploratory walk ordering.  Keep it long enough
+# to cover the extended 10k-step episodes now used in main.
+const spiral* = generateSpiral(10000)


…and racecar agents (#3631) Here’s the 10‑seed sweep you asked for. - Ran: uv run packages/cogames/scripts/run_evaluation.py --agent racecar thinky --mission-set all --repeats 1 --jobs 0 --no-plots --seed <1..10> --output eval_plots/seed_<n>_racecar_thinky.json - Outputs: eval_plots/seed_1_racecar_thinky.json … seed_10_racecar_thinky.json (10 files, 432 tests per run, 216 per agent). Aggregate over 10 seeds (4,320 total cases; 2,160 per agent): - RaceCar: 1,243/2,160 successes (57.5%), avg_total_reward 4.86, avg_reward_per_agent 2.35. - Thinky: 1,245/2,160 successes (57.6%), avg_total_reward 4.85, avg_reward_per_agent 2.34. - Run-to-run stability (success rate stdev): RaceCar 2.1pp, Thinky 1.7pp; medians 57.9% (RaceCar) vs 57.2% (Thinky). Where Thinky beats RaceCar most (successes out of 10 runs, same mission & cogs): - diagnostic_memory_hard (2 cogs): 6 vs 1. - single_use_swarm (1 cog): 8 vs 4. - diagnostic_extract_missing_oxygen_hard (2 cogs): 6 vs 3. - vibe_check_easy (2 cogs): 5 vs 2. - diagnostic_assemble_seeded_search_hard (2 cogs): 5 vs 2. Where RaceCar beats Thinky most: - balanced_corners (2 cogs): 8 vs 4. - vibe_check_hard (4 cogs): 6 vs 2. - distant_resources (4 cogs): 4 vs 1. - single_use_swarm_easy (1 cog): 10 vs 8. - oxygen_bottleneck_standard (2 cogs): 10 vs 8. If you want a deeper dive, I can break down per-difficulty or plot deltas, but the raw JSONs above have everything. --- Changes implemented (low‑risk): - Broader exploration seeding around assembler/chest (adds ±20 ring, extra chest offsets) and tracks unreachable targets to avoid re-chasing failed A* paths. - Single-resource bottleneck handling: when only one input is required, zero other targets and prevent dumping that resource. - Dump throttling now respects active targets so we don’t discard the bottleneck resource. Rebuilt Nim bindings. 10-seed sweep rerun (outputs in eval_plots/seed_<n>_racecar_thinky_postfix.json, seeds 1–10): - RaceCar: 1023/1728 successes (59.2%), mean per-run 59.2% (stdev 2.19pp). - Thinky: 1001/1728 successes (57.9%), mean per-run 57.9% (stdev 1.74pp). Rare-success missions (RaceCar, successes <5/8 runs; lower count because two seeds timed out previously so 8 runs completed): - diagnostic_charge_up_hard (4 cogs): 1/8 - oxygen_bottleneck_hard (1c: 1/8, 2c: 3/8) - diagnostic_assemble_seeded_search_hard (2c: 2/8) - diagnostic_memory_hard (2c: 2/8) - quadrant_buildings family (1–4c: 1–4/8) - vibe_check_easy/standard/hard (2c: 1–4/8; 4c: 3/8) - diagnostic_extract_missing_silicon_hard (2c: 3/8) - diagnostic_chest_navigation1_hard (4c: 4/8) Observations: - The bottleneck prioritization and unreachable marking helped overall success (+~1.7pp over the previous sweep) and trimmed the rare list from 21 to 17 cases. - Remaining rare cases mostly cluster into: (a) vibe-check tasks (need deliberate vibe selection), (b) quadrant/building spread (needs stronger cog role separation), (c) hard energy/charge maps (charge_up_hard, oxygen_bottleneck_hard), and (d) long/maze memory (memory_hard). If you want, I’ll tackle the next set of targeted tweaks: 1. Add simple vibe-matching logic for vibe_check*/assemble_seeded_search_hard. 2. Assign directional roles per agent to cover quadrants faster. 3. Cache depleted/visited extractors per resource with cooldown for single_use/charge_hard maps. Let me know which to do next. [Asana Task](https://app.asana.com/1/1209016784099267/project/1210348820405981/task/1211941496103104) --------- Co-authored-by: treeform <starplant@gmail.com> Co-authored-by: Nishad <nishad@stem.ai> Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com>

treeform and others added 6 commits November 7, 2025 13:54

Got something working.

fa2e6b9

map building works

d3de23e

fog of war on the map

8266fcf

f

5dd8520

hard code the type names inside nim

28d1597

Merge remote-tracking branch 'origin/main' into richard-scripted

bb898b1

relh changed the base branch from main to simple_nim_agents November 10, 2025 20:42

relh changed the base branch from simple_nim_agents to main November 10, 2025 20:42

relh and others added 11 commits November 10, 2025 13:00

priority agent

22cf269

uglier more involved

1b83916

simplify prio agents

8ddc894

messing with scripted agents

326a188

tweak everythung

782376d

Got something working.

9e3a27f

map building works

210c34d

fog of war on the map

e9e27ac

f

cd1e2fe

hard code the type names inside nim

c1ab2fc

Merge branch 'main' into richard-scripted

1c4925a

relh changed the base branch from main to simple_nim_agents November 10, 2025 22:08

relh and others added 7 commits November 10, 2025 14:12

Merge remote-tracking branch 'origin/main' into richard-scripted

e52c09c

Merge remote-tracking branch 'origin/richard-scripted' into richard-s…

5859b0a

…cripted

build with uv sync.

5468de1

make lint happy

7adaaa6

rename rowObservations

c8e164c

Ignore generated files.

46c1536

Merge remote-tracking branch 'origin/simple_nim_agents' into richard-…

44646c8

…scripted

treeform force-pushed the simple_nim_agents branch 2 times, most recently from 826e7b1 to bf259ca Compare November 11, 2025 00:54

Got something working.

7a5b7eb

relh and others added 21 commits December 2, 2025 16:55

name fix

d790ab6

no agent

a9d5931

reduce diff

1e7f65c

tweaks

bcc02b4

eval fixes

82deff4

Merge remote-tracking branch 'origin/main' into richard-scripted

1c8f74a

handle new goal observation

37c964b

agent tweaks

7723323

simplify?

f4e7946

Merge remote-tracking branch 'origin/main' into richard-scripted

8847fb4

thinky changes

ad48abb

undo clobber

88ea73d

reduce diff

c227ee6

simplify base

09ea243

Merge branch 'main' into richard-scripted

229e0f9

improvements!

d2c3158

Merge remote-tracking branch 'origin/richard-scripted' into richard-s…

d7e212d

…cripted

Merge remote-tracking branch 'origin/main' into richard-scripted

630b214

hmm

901b3ba

Merge remote-tracking branch 'origin/main' into richard-scripted

3674de5

concise

ef76c9a

relh changed the title ~~racecar agents~~ race_car -> racecar, max_steps changes, thinky output appropriately, and racecar agents Dec 4, 2025

relh enabled auto-merge December 4, 2025 20:32

treeform approved these changes Dec 4, 2025

View reviewed changes

relh added this pull request to the merge queue Dec 4, 2025

relh changed the title ~~race_car -> racecar, max_steps changes, thinky output appropriately, and racecar agents~~ race_car -> racecar, max_steps changes, thinky set output appropriately, fix unknown features, and racecar agents Dec 4, 2025

Merged via the queue into main with commit 30f40c2 Dec 4, 2025
27 checks passed

relh deleted the richard-scripted branch December 4, 2025 20:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

race_car -> racecar, max_steps changes, thinky set output appropriately, fix unknown features, and racecar agents #3631

race_car -> racecar, max_steps changes, thinky set output appropriately, fix unknown features, and racecar agents #3631

Uh oh!

relh commented Nov 10, 2025 •

edited

Loading

Uh oh!

treeform Dec 3, 2025

Uh oh!

treeform Dec 3, 2025

Uh oh!

treeform Dec 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		sys.path.append(bindings_dir)


		def _maybe_rebuild_nim_bindings() -> None:

race_car -> racecar, max_steps changes, thinky set output appropriately, fix unknown features, and racecar agents #3631

race_car -> racecar, max_steps changes, thinky set output appropriately, fix unknown features, and racecar agents #3631

Uh oh!

Conversation

relh commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

treeform Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

treeform Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

treeform Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

relh commented Nov 10, 2025 •

edited

Loading