-
Notifications
You must be signed in to change notification settings - Fork 48
race_car -> racecar, max_steps changes, thinky set output appropriately, fix unknown features, and racecar agents #3631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
826e7b1 to
bf259ca
Compare
| sys.path.append(bindings_dir) | ||
|
|
||
|
|
||
| def _maybe_rebuild_nim_bindings() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed.
| jsony | ||
|
|
||
| # Disable debug output (comment out to enable) | ||
| template echo(args: varargs[string, `$`]) = discard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't like this
| const spiral* = generateSpiral(1000) | ||
| # The spiral is only used for exploratory walk ordering. Keep it long enough | ||
| # to cover the extended 10k-step episodes now used in main. | ||
| const spiral* = generateSpiral(10000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wut?
…and racecar agents (#3631) Here’s the 10‑seed sweep you asked for. - Ran: uv run packages/cogames/scripts/run_evaluation.py --agent racecar thinky --mission-set all --repeats 1 --jobs 0 --no-plots --seed <1..10> --output eval_plots/seed_<n>_racecar_thinky.json - Outputs: eval_plots/seed_1_racecar_thinky.json … seed_10_racecar_thinky.json (10 files, 432 tests per run, 216 per agent). Aggregate over 10 seeds (4,320 total cases; 2,160 per agent): - RaceCar: 1,243/2,160 successes (57.5%), avg_total_reward 4.86, avg_reward_per_agent 2.35. - Thinky: 1,245/2,160 successes (57.6%), avg_total_reward 4.85, avg_reward_per_agent 2.34. - Run-to-run stability (success rate stdev): RaceCar 2.1pp, Thinky 1.7pp; medians 57.9% (RaceCar) vs 57.2% (Thinky). Where Thinky beats RaceCar most (successes out of 10 runs, same mission & cogs): - diagnostic_memory_hard (2 cogs): 6 vs 1. - single_use_swarm (1 cog): 8 vs 4. - diagnostic_extract_missing_oxygen_hard (2 cogs): 6 vs 3. - vibe_check_easy (2 cogs): 5 vs 2. - diagnostic_assemble_seeded_search_hard (2 cogs): 5 vs 2. Where RaceCar beats Thinky most: - balanced_corners (2 cogs): 8 vs 4. - vibe_check_hard (4 cogs): 6 vs 2. - distant_resources (4 cogs): 4 vs 1. - single_use_swarm_easy (1 cog): 10 vs 8. - oxygen_bottleneck_standard (2 cogs): 10 vs 8. If you want a deeper dive, I can break down per-difficulty or plot deltas, but the raw JSONs above have everything. --- Changes implemented (low‑risk): - Broader exploration seeding around assembler/chest (adds ±20 ring, extra chest offsets) and tracks unreachable targets to avoid re-chasing failed A* paths. - Single-resource bottleneck handling: when only one input is required, zero other targets and prevent dumping that resource. - Dump throttling now respects active targets so we don’t discard the bottleneck resource. Rebuilt Nim bindings. 10-seed sweep rerun (outputs in eval_plots/seed_<n>_racecar_thinky_postfix.json, seeds 1–10): - RaceCar: 1023/1728 successes (59.2%), mean per-run 59.2% (stdev 2.19pp). - Thinky: 1001/1728 successes (57.9%), mean per-run 57.9% (stdev 1.74pp). Rare-success missions (RaceCar, successes <5/8 runs; lower count because two seeds timed out previously so 8 runs completed): - diagnostic_charge_up_hard (4 cogs): 1/8 - oxygen_bottleneck_hard (1c: 1/8, 2c: 3/8) - diagnostic_assemble_seeded_search_hard (2c: 2/8) - diagnostic_memory_hard (2c: 2/8) - quadrant_buildings family (1–4c: 1–4/8) - vibe_check_easy/standard/hard (2c: 1–4/8; 4c: 3/8) - diagnostic_extract_missing_silicon_hard (2c: 3/8) - diagnostic_chest_navigation1_hard (4c: 4/8) Observations: - The bottleneck prioritization and unreachable marking helped overall success (+~1.7pp over the previous sweep) and trimmed the rare list from 21 to 17 cases. - Remaining rare cases mostly cluster into: (a) vibe-check tasks (need deliberate vibe selection), (b) quadrant/building spread (needs stronger cog role separation), (c) hard energy/charge maps (charge_up_hard, oxygen_bottleneck_hard), and (d) long/maze memory (memory_hard). If you want, I’ll tackle the next set of targeted tweaks: 1. Add simple vibe-matching logic for vibe_check*/assemble_seeded_search_hard. 2. Assign directional roles per agent to cover quadrants faster. 3. Cache depleted/visited extractors per resource with cooldown for single_use/charge_hard maps. Let me know which to do next. [Asana Task](https://app.asana.com/1/1209016784099267/project/1210348820405981/task/1211941496103104) --------- Co-authored-by: treeform <starplant@gmail.com> Co-authored-by: Nishad <nishad@stem.ai> Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com>
Here’s the 10‑seed sweep you asked for.
Aggregate over 10 seeds (4,320 total cases; 2,160 per agent):
Where Thinky beats RaceCar most (successes out of 10 runs, same mission & cogs):
Where RaceCar beats Thinky most:
If you want a deeper dive, I can break down per-difficulty or plot deltas, but the raw JSONs above have everything.
Changes implemented (low‑risk):
Rebuilt Nim bindings.
10-seed sweep rerun (outputs in eval_plots/seed__racecar_thinky_postfix.json, seeds 1–10):
Rare-success missions (RaceCar, successes <5/8 runs; lower count because two seeds timed out previously so 8 runs completed):
Observations:
oxygen_bottleneck_hard), and (d) long/maze memory (memory_hard).
If you want, I’ll tackle the next set of targeted tweaks:
Let me know which to do next.
Asana Task