An autonomous research loop pointed at a SystemVerilog RV32IM CPU. Each round the agent proposes a microarchitectural hypothesis, implements it in an isolated git worktree, then runs it through riscv-formal + Verilator cosim + 3-seed FPGA place-and-route. Only hypotheses that beat the current champion on CoreMark/MHz get merged.
The repo is multi-core: every architecture lives under cores/<name>/ with
its own RTL, tests, log, and core.yaml. make loop TARGET=<name> runs the
loop on one core at a time, on a dedicated core-<name> git branch inside
its own worktree, so two cores can iterate in parallel without stomping each
other.
73 hypotheses, 9h 51m wall-clock, on the run that produced cores/v1/ from
cores/baseline/. Locked baseline → champion went from
301 iter/s (2.23 CoreMark/MHz) to 577 iter/s (2.91 CoreMark/MHz) —
+92% on fitness, +26% over VexRiscv's published 2.30 CoreMark/MHz, with
40% fewer LUTs.
The 10 accepted winners, in order of merge:
| Δt | iter/s | CM/MHz | Fmax | Hypothesis |
|---|---|---|---|---|
| 0.0h | 301.04 | 2.226 | 135 MHz | Baseline |
| 0.4h | 313.10 | 2.320 | 135 MHz | Backward-Branch Taken Predictor |
| 0.7h | 324.48 | 2.348 | 138 MHz | IF Direct-Jump Predictor |
| 2.1h | 375.43 | 2.348 | 160 MHz | Cold Multi-Cycle DIV/REM Unit |
| 2.7h | 397.55 | 2.366 | 168 MHz | One-Deep Store Retirement Slot |
| 3.5h | 422.77 | 2.366 | 179 MHz | Segmented RVFI Order Counter |
| 3.8h | 472.96 | 2.891 | 164 MHz | Registered Lookahead I-Fetch Replay Predictor |
| 4.0h | 505.65 | 2.891 | 175 MHz | Compressed Resetless I-Fetch Replay Tags |
| 5.3h | 529.35 | 2.891 | 183 MHz | RTL-Only Hot/Cold ALU Opcode Split |
| 6.1h | 577.76 | 2.908 | 199 MHz | Banked Registered I-Fetch Replay Predictor |
Full writeup: docs/auto-arch-tournament-blog-post.md.
macOS only for now. One-time toolchain install:
bash setup.sh
Fetches Verilator, OSS CAD Suite (yosys, nextpnr-himbaechel, sby, bitwuzla),
the riscv-none-elf cross compiler, and a few Python deps into .toolchain/.
Tools already on $PATH are detected and reused.
You'll also need a coding-agent CLI. Codex is the default; Claude Code works
too — pass AGENT=claude to any orchestrator target.
Every core-touching command takes TARGET=<name>. The Makefile errors out
otherwise and lists the cores it found.
make next TARGET=v1 # one round, one slot — smoke test
make loop TARGET=v1 N=10 # 10 rounds, sequential
make loop TARGET=v1 N=10 K=3 # 10 rounds, 3 parallel slots/round
make loop TARGET=v1 N=10 LUT=3000 COREMARK=400 # set per-target fitness goals
make loop TARGET=v1 N=10 AGENT=claude # use Claude Code instead of Codex
make loop TARGET=mycore BASE=baseline N=10 # fork a new core from cores/baseline/
make report TARGET=v1 # summary of cores/v1/experiments/log.jsonl
What make loop TARGET=<name> does:
- Creates
.worktrees/<name>/on branchcore-<name>if it doesn't exist (subsequent runs reuse it). Re-execs make inside the worktree so two parallelmake loops on different TARGETs don't collide on the git index. - For each round:
- Hypothesis agent writes
cores/<name>/experiments/hypotheses/<id>.yamlagainst the JSON schema. The prompt sees: source RTL, recent log entries,LESSONS.md,CORE_PHILOSOPHY.md,core.yaml. - Implementation agent edits
cores/<name>/rtl/(and optionallycores/<name>/test/test_*.py) in a per-slot worktree. - Eval pipeline: verilator lint → yosys synth → bench
make→ cosim build → riscv-formal → Verilator cosim vs. Python ISS → 3-seed FPGA P&R + CoreMark. Each step is gated; first failure short-circuits with abroken: <step>: <stderr tail>outcome. - Scribe distills one bullet into
cores/<name>/LESSONS.mdso the next round's hypothesis agent reads what failed and why.
- Hypothesis agent writes
- The highest-fitness slot above the current champion merges into
core-<name>. Regressions / broken / placement-failed → worktree destroyed, log entry written, next round.
The orchestrator never merges to main. Each core's evolution is one PR
diff: git push -u origin core-<name> && gh pr create --base main.
To opt out of the worktree wrap (run directly on the current branch):
make loop TARGET=v1 WORKTREE=.
Other useful targets — all accept TARGET=:
make lint TARGET=v1 # verilator lint over cores/v1/rtl/
make test TARGET=v1 # cocotb unit tests under cores/v1/test/
make cosim TARGET=v1 # cosim alone (no orchestrator)
make formal TARGET=v1 # riscv-formal fast suite (ALTOPS — see CLAUDE.md)
make formal-deep TARGET=v1 # full formal suite WITHOUT ALTOPS — slow, real bitvector arithmetic
make fpga TARGET=v1 # FPGA eval alone (3-seed P&R + CoreMark)
make bench # build selftest.elf / coremark.elf (shared across cores)
make clean TARGET=v1 # nuke per-core build artifacts
make test-infra # pytest under tools/ (no TARGET needed)
If you SSH-sign your commits and run unattended, tools/orchestrator.py
sets commit.gpgsign=false for its own subprocess tree so the loop
doesn't hang on a 1Password biometric prompt. Manual commits from your
shell are unaffected.
The repo holds multiple cores under cores/<name>/. Each has its own RTL,
tests, experiment log, core.yaml, and LESSONS.md. The orchestrator runs
against one core at a time.
Available cores on main:
cores/baseline/— the original simple RV32IM 5-stage in-order core. Universal seed for new cores.cores/v1/— the evolved core from the historical run shown above (branch prediction, banked I-fetch replay, hot/cold ALU split).
Other cores live on their own core-<name> branches, not yet merged to
main — that's the normal state for a core mid-evolution.
Forking a new core:
make loop TARGET=mycore BASE=baseline N=20 LUT=3000 COREMARK=400
The orchestrator:
- Copies
cores/baseline/→cores/mycore/. - Re-runs the eval gates against the freshly-copied RTL to get a measured
current:block incores/mycore/core.yaml— the first log entry. - Iterates
Nrounds, with the LUT/CoreMark targets as Pareto goals.
The work lands on branch core-mycore in .worktrees/mycore/. To review
or ship:
git push -u origin core-mycore
gh pr create --base main
This is intentional: each core's evolution is one reviewable diff, and
multiple cores can iterate concurrently — make loop TARGET=mini and
make loop TARGET=maxperf run side-by-side without colliding because each
runs in its own worktree on its own branch.
Per-core artifacts (under cores/<name>/):
| File / dir | Purpose |
|---|---|
rtl/*.sv |
The design. The agent's playground. |
test/test_*.py |
cocotb tests. The agent may add tests for new modules. |
core.yaml |
Declared targets + auto-updated current: (running champion's measured numbers). |
CORE_PHILOSOPHY.md |
Optional architect intent — injected verbatim into hypothesis prompts. |
LESSONS.md |
Append-only one-line lessons from prior iterations (the scribe agent). |
experiments/log.jsonl |
Per-iteration outcomes: proposal + fitness numbers + verdict + lesson. |
experiments/progress.png |
Fitness chart over time. |
The orchestrator is hardcoded. The model never edits it. What the model can touch is small and explicit:
cores/<TARGET>/rtl/**— any SystemVerilog file. Add modules, delete modules, rename, restructure, rewrite from scratch. The only top-level invariant is the I/O contract oncore.sv(clock/reset, imem/dmem, NRET=2 RVFI).cores/<TARGET>/test/test_*.py— cocotb suites. Add tests for new modules.
Everything else is off-limits. The path sandbox in tools/orchestrator.py
rejects the round before any eval runs if the agent touched formal/,
tools/, fpga/, test/cosim/main.cpp, the CRC table, or any sibling
core's directory. The agent doesn't get to soften its own grader.
The verifier does the heavy lifting:
- riscv-formal — symbolic BMC against RV32IM: decode, traps, ordering, liveness, M-ext discipline. ~105 checks at NRET=2.
- Verilator cosim — random ~22% bus stalls, RVFI byte-identical against
a Python ISS on
selftest.elfandcoremark.elf. - 3-seed P&R — yosys + nextpnr on a Gowin GW2A-LV18 (Tang Nano 20K). Median Fmax × CoreMark iter/cycle = fitness. One seed is a coin flip; three is comparable across rounds.
- CoreMark CRC validation — the four canonical 2K-config CRCs. CoreMark prints "Correct operation validated." even when it isn't, so the bench re-checks them itself.
- Path sandbox — the agent cannot edit anything outside its target
core's
rtl/andtest/test_*.py.
Of 73 hypotheses in the run shown above, 63 were rejected by the verifier. That's the point.
The full contract — invariants, don't-touch list, what may change — is in
CLAUDE.md. The I/O contract is in ARCHITECTURE.md.
Karpathy's autoresearch showed an agent loop finding 20 training-time wins on a nanochat in two days. That worked because Python and gradient descent are the agent's home turf. This repo asks whether the recipe generalizes when you point it somewhere it has no business being good at — SystemVerilog, formal verification, FPGA timing.
It does. But the loop isn't the moat — the loop is commodity. The artifact that survived 10 wins past 63 rejections wasn't the agent; it was the verifier. That's the part that encodes what correct means in this domain. The full argument is in the blog post.
cores/ # per-target architectures
<name>/
rtl/ # SystemVerilog sources (the design — agent-editable)
test/ # cocotb unit tests (agent-editable)
experiments/ # log.jsonl + progress.png + transient hypotheses/<id>.yaml
core.yaml # declared targets + auto-updated current:
CORE_PHILOSOPHY.md # optional architect intent (injected into prompts)
LESSONS.md # append-only scribe output
bench/programs/ # selftest.S, crt0.S, link.ld, EEMBC CoreMark — shared, off-limits
formal/ # riscv-formal wrapper, checks.cfg, run_all.sh — correctness contract
fpga/ # core_bench.sv, synth.tcl, nextpnr scripts, constraints — fitness contract
test/cosim/ # Verilator cosim harness (main.cpp + reference Python ISS)
tools/ # orchestrator, worktree manager, eval gates, scribe, plotting
schemas/ # hypothesis + eval-result JSON schemas
docs/ # design notes, blog post
.worktrees/ # auto-created by `make loop`, one git worktree per TARGET
| Concern | Tool |
|---|---|
| RTL | SystemVerilog (IEEE 1800-2017 synthesizable subset) |
| Sim | Verilator ≥ 5.0 |
| Unit tests | cocotb ≥ 1.8 (Python harness over Verilator) |
| Formal | YosysHQ riscv-formal (vendored submodule); sby + bitwuzla |
| Synth | Yosys + synth_gowin |
| Place & route | nextpnr-himbaechel (Gowin GW2A-LV18QN88C8/I7 = Tang Nano 20K) |
| Cross-compiler | xPack riscv-none-elf-gcc 15.x (symlinked to riscv32-unknown-elf) |
| Orchestrator | Python 3.11+, jsonschema, pyyaml, matplotlib |
| Coding agent | Codex CLI (default) or Claude Code (AGENT=claude) |
If you use HWE Bench in research, please cite the repository (an arXiv
preprint is in preparation). GitHub renders a "Cite this repository"
button from CITATION.cff on the repo's main page.
BibTeX:
@software{bonetto_hwebench_2026,
author = {Bonetto, Felipe Sens},
title = {{HWE Bench: An Unbounded Benchmark for LLM Hardware Development on RISC-V}},
year = {2026},
url = {https://hwebench.com},
howpublished = {\url{https://github.com/FeSens/auto-arch-tournament}}
}Public site: https://hwebench.com.
