Skip to content

Conversation

@relh
Copy link
Contributor

@relh relh commented Nov 10, 2025

Here’s the 10‑seed sweep you asked for.

  • Ran: uv run packages/cogames/scripts/run_evaluation.py --agent racecar thinky --mission-set all --repeats 1 --jobs 0 --no-plots --seed <1..10> --output eval_plots/seed__racecar_thinky.json
  • Outputs: eval_plots/seed_1_racecar_thinky.json … seed_10_racecar_thinky.json (10 files, 432 tests per run, 216 per agent).

Aggregate over 10 seeds (4,320 total cases; 2,160 per agent):

  • RaceCar: 1,243/2,160 successes (57.5%), avg_total_reward 4.86, avg_reward_per_agent 2.35.
  • Thinky: 1,245/2,160 successes (57.6%), avg_total_reward 4.85, avg_reward_per_agent 2.34.
  • Run-to-run stability (success rate stdev): RaceCar 2.1pp, Thinky 1.7pp; medians 57.9% (RaceCar) vs 57.2% (Thinky).

Where Thinky beats RaceCar most (successes out of 10 runs, same mission & cogs):

  • diagnostic_memory_hard (2 cogs): 6 vs 1.
  • single_use_swarm (1 cog): 8 vs 4.
  • diagnostic_extract_missing_oxygen_hard (2 cogs): 6 vs 3.
  • vibe_check_easy (2 cogs): 5 vs 2.
  • diagnostic_assemble_seeded_search_hard (2 cogs): 5 vs 2.

Where RaceCar beats Thinky most:

  • balanced_corners (2 cogs): 8 vs 4.
  • vibe_check_hard (4 cogs): 6 vs 2.
  • distant_resources (4 cogs): 4 vs 1.
  • single_use_swarm_easy (1 cog): 10 vs 8.
  • oxygen_bottleneck_standard (2 cogs): 10 vs 8.

If you want a deeper dive, I can break down per-difficulty or plot deltas, but the raw JSONs above have everything.


Changes implemented (low‑risk):

  • Broader exploration seeding around assembler/chest (adds ±20 ring, extra chest offsets) and tracks unreachable targets to avoid re-chasing failed A* paths.
  • Single-resource bottleneck handling: when only one input is required, zero other targets and prevent dumping that resource.
  • Dump throttling now respects active targets so we don’t discard the bottleneck resource.

Rebuilt Nim bindings.

10-seed sweep rerun (outputs in eval_plots/seed__racecar_thinky_postfix.json, seeds 1–10):

  • RaceCar: 1023/1728 successes (59.2%), mean per-run 59.2% (stdev 2.19pp).
  • Thinky: 1001/1728 successes (57.9%), mean per-run 57.9% (stdev 1.74pp).

Rare-success missions (RaceCar, successes <5/8 runs; lower count because two seeds timed out previously so 8 runs completed):

  • diagnostic_charge_up_hard (4 cogs): 1/8
  • oxygen_bottleneck_hard (1c: 1/8, 2c: 3/8)
  • diagnostic_assemble_seeded_search_hard (2c: 2/8)
  • diagnostic_memory_hard (2c: 2/8)
  • quadrant_buildings family (1–4c: 1–4/8)
  • vibe_check_easy/standard/hard (2c: 1–4/8; 4c: 3/8)
  • diagnostic_extract_missing_silicon_hard (2c: 3/8)
  • diagnostic_chest_navigation1_hard (4c: 4/8)

Observations:

  • The bottleneck prioritization and unreachable marking helped overall success (+~1.7pp over the previous sweep) and trimmed the rare list from 21 to 17 cases.
  • Remaining rare cases mostly cluster into: (a) vibe-check tasks (need deliberate vibe selection), (b) quadrant/building spread (needs stronger cog role separation), (c) hard energy/charge maps (charge_up_hard,
    oxygen_bottleneck_hard), and (d) long/maze memory (memory_hard).

If you want, I’ll tackle the next set of targeted tweaks:

  1. Add simple vibe-matching logic for vibe_check*/assemble_seeded_search_hard.
  2. Assign directional roles per agent to cover quadrants faster.
  3. Cache depleted/visited extractors per resource with cooldown for single_use/charge_hard maps.

Let me know which to do next.

Asana Task

@relh relh changed the base branch from main to simple_nim_agents November 10, 2025 20:42
@relh relh changed the base branch from simple_nim_agents to main November 10, 2025 20:42
@relh relh changed the base branch from main to simple_nim_agents November 10, 2025 22:08
@treeform treeform force-pushed the simple_nim_agents branch 2 times, most recently from 826e7b1 to bf259ca Compare November 11, 2025 00:54
@relh relh changed the title racecar agents race_car -> racecar, max_steps changes, thinky output appropriately, and racecar agents Dec 4, 2025
@relh relh enabled auto-merge December 4, 2025 20:32
sys.path.append(bindings_dir)


def _maybe_rebuild_nim_bindings() -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed.

jsony

# Disable debug output (comment out to enable)
template echo(args: varargs[string, `$`]) = discard
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't like this

const spiral* = generateSpiral(1000)
# The spiral is only used for exploratory walk ordering. Keep it long enough
# to cover the extended 10k-step episodes now used in main.
const spiral* = generateSpiral(10000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wut?

@relh relh added this pull request to the merge queue Dec 4, 2025
@relh relh changed the title race_car -> racecar, max_steps changes, thinky output appropriately, and racecar agents race_car -> racecar, max_steps changes, thinky set output appropriately, fix unknown features, and racecar agents Dec 4, 2025
Merged via the queue into main with commit 30f40c2 Dec 4, 2025
27 checks passed
@relh relh deleted the richard-scripted branch December 4, 2025 20:45
zfogg pushed a commit that referenced this pull request Dec 20, 2025
…and racecar agents (#3631)

Here’s the 10‑seed sweep you asked for.

- Ran: uv run packages/cogames/scripts/run_evaluation.py --agent racecar
thinky --mission-set all --repeats 1 --jobs 0 --no-plots --seed <1..10>
--output eval_plots/seed_<n>_racecar_thinky.json
- Outputs: eval_plots/seed_1_racecar_thinky.json …
seed_10_racecar_thinky.json (10 files, 432 tests per run, 216 per
agent).

  Aggregate over 10 seeds (4,320 total cases; 2,160 per agent):

- RaceCar: 1,243/2,160 successes (57.5%), avg_total_reward 4.86,
avg_reward_per_agent 2.35.
- Thinky: 1,245/2,160 successes (57.6%), avg_total_reward 4.85,
avg_reward_per_agent 2.34.
- Run-to-run stability (success rate stdev): RaceCar 2.1pp, Thinky
1.7pp; medians 57.9% (RaceCar) vs 57.2% (Thinky).

Where Thinky beats RaceCar most (successes out of 10 runs, same mission
& cogs):

  - diagnostic_memory_hard (2 cogs): 6 vs 1.
  - single_use_swarm (1 cog): 8 vs 4.
  - diagnostic_extract_missing_oxygen_hard (2 cogs): 6 vs 3.
  - vibe_check_easy (2 cogs): 5 vs 2.
  - diagnostic_assemble_seeded_search_hard (2 cogs): 5 vs 2.

  Where RaceCar beats Thinky most:

  - balanced_corners (2 cogs): 8 vs 4.
  - vibe_check_hard (4 cogs): 6 vs 2.
  - distant_resources (4 cogs): 4 vs 1.
  - single_use_swarm_easy (1 cog): 10 vs 8.
  - oxygen_bottleneck_standard (2 cogs): 10 vs 8.

If you want a deeper dive, I can break down per-difficulty or plot
deltas, but the raw JSONs above have everything.

---

Changes implemented (low‑risk):

- Broader exploration seeding around assembler/chest (adds ±20 ring,
extra chest offsets) and tracks unreachable targets to avoid re-chasing
failed A* paths.
- Single-resource bottleneck handling: when only one input is required,
zero other targets and prevent dumping that resource.
- Dump throttling now respects active targets so we don’t discard the
bottleneck resource.

  Rebuilt Nim bindings.

10-seed sweep rerun (outputs in
eval_plots/seed_<n>_racecar_thinky_postfix.json, seeds 1–10):

- RaceCar: 1023/1728 successes (59.2%), mean per-run 59.2% (stdev
2.19pp).
- Thinky: 1001/1728 successes (57.9%), mean per-run 57.9% (stdev
1.74pp).

Rare-success missions (RaceCar, successes <5/8 runs; lower count because
two seeds timed out previously so 8 runs completed):

  - diagnostic_charge_up_hard (4 cogs): 1/8
  - oxygen_bottleneck_hard (1c: 1/8, 2c: 3/8)
  - diagnostic_assemble_seeded_search_hard (2c: 2/8)
  - diagnostic_memory_hard (2c: 2/8)
  - quadrant_buildings family (1–4c: 1–4/8)
  - vibe_check_easy/standard/hard (2c: 1–4/8; 4c: 3/8)
  - diagnostic_extract_missing_silicon_hard (2c: 3/8)
  - diagnostic_chest_navigation1_hard (4c: 4/8)

  Observations:

- The bottleneck prioritization and unreachable marking helped overall
success (+~1.7pp over the previous sweep) and trimmed the rare list from
21 to 17 cases.
- Remaining rare cases mostly cluster into: (a) vibe-check tasks (need
deliberate vibe selection), (b) quadrant/building spread (needs stronger
cog role separation), (c) hard energy/charge maps (charge_up_hard,
    oxygen_bottleneck_hard), and (d) long/maze memory (memory_hard).

  If you want, I’ll tackle the next set of targeted tweaks:

1. Add simple vibe-matching logic for
vibe_check*/assemble_seeded_search_hard.
  2. Assign directional roles per agent to cover quadrants faster.
3. Cache depleted/visited extractors per resource with cooldown for
single_use/charge_hard maps.

  Let me know which to do next.


[Asana
Task](https://app.asana.com/1/1209016784099267/project/1210348820405981/task/1211941496103104)

---------

Co-authored-by: treeform <starplant@gmail.com>
Co-authored-by: Nishad <nishad@stem.ai>
Co-authored-by: graphite-app[bot] <96075541+graphite-app[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants