Skip to content

feat(studio)!: benchmarks.yaml as single source of truth, live-reloaded#1145

Merged
christso merged 12 commits intomainfrom
feat/1144-studio-runtime-discovery
Apr 20, 2026
Merged

feat(studio)!: benchmarks.yaml as single source of truth, live-reloaded#1145
christso merged 12 commits intomainfrom
feat/1144-studio-runtime-discovery

Conversation

@christso
Copy link
Copy Markdown
Collaborator

@christso christso commented Apr 20, 2026

Closes #1144.

Summary

Studio previously treated benchmark discovery as startup-time bootstrap — any change to the registry required restarting agentv serve. This PR makes the registry re-read on every /api/benchmarks request, so edits propagate live; the Studio UI polls every ~10 s and reflects the change without operator action.

Implements option 2 of #1144's "Proposed direction": runtime reload of the project registry when the file changes. No filesystem watcher; no background scanning; no cache; the single source of truth is ~/.agentv/benchmarks.yaml.

The PR also does a full terminology sweep: the on-disk filename, every API response key, every Studio URL, every CLI flag and error message now uses "benchmark" consistently.

What's in the PR

Runtime behavior (the #1144 ask)

  • /api/benchmarks re-reads benchmarks.yaml on every request. ConfigMap-driven, file-edit-driven, and API-driven changes all propagate within the UI's ~10 s poll interval.
  • POST /api/benchmarks {path} registers a benchmark; DELETE /api/benchmarks/:id unregisters. Both write to benchmarks.yaml.

Filename rename (breaking)

  • ~/.agentv/projects.yaml~/.agentv/benchmarks.yaml.

Wire-format rename (breaking)

  • {"projects": [...]}{"benchmarks": [...]}.
  • project_id / project_namebenchmark_id / benchmark_name on /api/benchmarks/all-runs.
  • /api/config returns benchmark_name + multi_benchmark_dashboard.

Studio URL rename (breaking)

  • /projects/$benchmarkId/.../benchmarks/$benchmarkId/... throughout.
  • apps/studio/src/routes/projects/routes/benchmarks/, ProjectCard.tsxBenchmarkCard.tsx.

YAML key convention (breaking)

  • All keys snake_case per AGENTS.md §"Wire Format Convention". addedAt/lastOpenedAtadded_at/last_opened_at. fromYaml/toYaml in packages/core/src/benchmarks.ts translate at the boundary; TS internals stay camelCase.

CLI cleanup (breaking)

  • --discover flag removed.
  • --add / --remove / --single / --multi unchanged in behavior; help text now says "benchmark".
  • Error messages "Project not found" / "Register a project…" → "Benchmark not found" / "Register a benchmark…".

AGENTS.md tightening

  • §"Wire Format Convention" spells out the rule blanket (every YAML on disk + every cross-process JSON is snake_case), adds an anti-patterns list, and cites fromYaml/toYaml as the reference boundary pattern.

Tests

  • packages/core/test/benchmarks.test.ts: start-empty, add/remove, idempotency, touch isolation, snake_case on-disk round-trip.
  • apps/cli/test/commands/results/serve.test.ts: updated for renamed identifiers and multi_benchmark_dashboard wire key.

Breaking-change safety

Multi-benchmark Studio shipped ~1 week ago and nothing is known to depend on the old shape. No migration shim; users with an empty or trivial setup just re-register with agentv studio --add <path>.

Test plan

  • Unit: 2259 tests pass.
  • Build, typecheck, lint clean.
  • Pre-push hooks (prek): Build / Typecheck / Lint / Test / Validate eval YAML — all pass.
  • Manual UAT — empty start → POST → external YAML edit → DELETE → removed endpoints 404 → removed flag rejected.

Acceptance-criteria mapping (#1144)

  • Studio can start with zero benchmarks and stay healthy.
  • Adding a benchmark reflects in Studio without restart.
  • Removing a benchmark reflects without restart.
  • /api/benchmarks and the UI are updated while the same agentv serve process remains running.

Not implemented in this PR (and not required by #1144): automatic filesystem scanning of a "discovery root." Earlier commits prototyped that and were removed in favor of the simpler config-driven model. Deployments that want ad-hoc repo drops to register automatically can add a sidecar that watches the directory and calls POST /api/benchmarks.

Persist optional discoveryRoots in ~/.agentv/projects.yaml and resolve the
effective benchmark set at request time so repos appearing or disappearing
under a root are reflected in Studio without restarting `agentv serve`.

- Add BenchmarkEntry.source ('manual' | 'discovered') and
  BenchmarkRegistry.discoveryRoots; keep YAML unchanged when empty.
- Add resolveActiveBenchmarks / getActiveBenchmark: merges persisted entries
  with a live rescan of every root; persisted wins on path conflict;
  discovered entries are never written to disk.
- Route /api/benchmarks, /api/benchmarks/all-runs, /api/benchmarks/:id/summary
  and withBenchmark / registerEvalRoutes through the active list so
  discovered repos participate in every benchmark-scoped route.
- New HTTP endpoints: GET/POST/DELETE /api/benchmarks/discovery-roots and
  POST /api/benchmarks/rescan. DELETE /api/benchmarks/:id rejects discovered
  entries with a clear error.
- New --discovery-root <path> CLI flag (repeatable) that persists a root and
  continues to start the server; --discover's one-shot semantics are
  preserved.
- Count active benchmarks when picking single/multi dashboard mode.
- Unit tests in packages/core/test/benchmarks.test.ts cover add/remove/
  idempotency, live appear/disappear, manual-vs-discovered precedence, and
  standalone manual entries.

Closes #1144.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Apr 20, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: b97f20e
Status: ✅  Deploy successful!
Preview URL: https://f20c4855.agentv.pages.dev
Branch Preview URL: https://feat-1144-studio-runtime-dis.agentv.pages.dev

View logs

- Drop external-issue-reference comments per AGENTS.md §7 (AI-First).
- Document single-writer assumption in benchmarks.ts header; the existing
  read-modify-write model is safe for the single-process Studio case that
  motivated the change.
- Sort discoverBenchmarks output so id assignment under basename collisions
  is deterministic across filesystems.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@christso christso marked this pull request as ready for review April 20, 2026 05:02
christso and others added 9 commits April 20, 2026 07:54
AGENTS.md §"Wire Format Convention" mandates snake_case for YAML config
fields, with camelCase reserved for internal TypeScript. The previous
commit emitted discoveryRoots (camelCase) on disk. TS field name stays
discoveryRoots; only the serialization boundary changes.

Adds a regression test that reads projects.yaml after a write and asserts
the on-disk key is discovery_roots.

Pre-existing benchmarks[] fields (addedAt, lastOpenedAt) are left as-is
in this PR since changing them would be a back-compat-breaking migration
orthogonal to runtime discovery; they're flagged in the file header.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BREAKING: benchmark entries in ~/.agentv/projects.yaml are serialized with
snake_case keys (added_at, last_opened_at, source) instead of camelCase
(addedAt, lastOpenedAt). Single-project Studio users are unaffected because
they don't touch projects.yaml; multi-project users on pre-release builds
must re-register projects (`agentv serve --add <path>`).

- Introduce BenchmarkEntryYaml + fromYaml/toYaml in packages/core/src/benchmarks.ts
  so TS internals stay camelCase and the YAML boundary stays snake_case.
- Drop the camelCase → snake_case carry-over for addedAt / lastOpenedAt; the
  file header now documents the fully snake_case format as canonical.
- Tighten AGENTS.md §"Wire Format Convention" to apply the rule blanket
  across all on-disk YAML (eval configs, projects.yaml, future files),
  add an anti-patterns list, and cite fromYaml/toYaml as the reference
  pattern for YAML boundaries.
- Add a regression test asserting serialized keys on disk are snake_case
  and that the file round-trips into the camelCase TS shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BREAKING: hard-remove the --discover CLI flag and POST /api/benchmarks/discover
endpoint. Callers should use --discovery-root + POST /api/benchmarks/discovery-roots
for the runtime-watching model. The core discoverBenchmarks util stays exported.

Better UX for Remove on a discovered entry: instead of 400-ing, add the repo's
path to a new persisted excluded_paths[] list and hide it from future scans.
The .agentv/ directory stays on disk, so the user can re-show the repo (via
DELETE /api/benchmarks/exclusions) or pin it manually (POST /api/benchmarks,
which auto-unexcludes).

- New BenchmarkRegistry.excludedPaths?: string[] (YAML key: excluded_paths).
- New core helpers: getExcludedPaths, addExcludedPath, removeExcludedPath.
- resolveActiveBenchmarks filters the discovered set by excludedPaths; pinned
  entries are never filtered.
- addBenchmark() strips the path from excludedPaths if present — explicit pin
  wins over a prior hide.
- DELETE /api/benchmarks/:id on a discovered entry calls addExcludedPath and
  returns { ok: true, excluded: <path> }; on a manual entry it still removes
  from benchmarks[] as before.
- GET /api/benchmarks/exclusions lists excluded paths; DELETE unhides one.
- Route-ordering fix: DELETE /api/benchmarks/:benchmarkId is now registered
  after all /api/benchmarks/<literal> sub-paths so Hono doesn't route
  DELETE /api/benchmarks/exclusions (or /discovery-roots) through the :id
  handler with benchmarkId=<literal>. Inline comment documents the constraint.
- Tests: exclude-hides-then-unexclude-shows round-trip, snake_case on-disk
  key assertion, pin-beats-exclusion flow. All 2260 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Studio already called the domain concept a "benchmark" in code (BenchmarkRegistry, BenchmarkEntry, /api/benchmarks), but a long tail of surfaces still said "project": the on-disk registry filename, API response keys, CLI help text, error messages, docs, Studio routes, and component names. This sweeps every remaining "project" surface to "benchmark".

BREAKING (safe — multi-benchmark Studio isn't in use yet):

- File rename: ~/.agentv/projects.yaml → ~/.agentv/benchmarks.yaml. migrateLegacyRegistry() copies any existing projects.yaml on first load (from either the AGENTV_HOME or the config-dir location) and deletes the legacy file.
- API response shape: /api/benchmarks now returns { benchmarks: [...] } (was projects); /api/benchmarks/all-runs rows now carry benchmark_id / benchmark_name (were project_id / project_name); /api/config returns benchmark_name + multi_benchmark_dashboard.
- Studio URLs: /projects/$benchmarkId/... → /benchmarks/$benchmarkId/...; the routes/projects/ directory is renamed to routes/benchmarks/.
- CLI: --discover flag removed (use --discovery-root instead); --add/--remove/--multi/--single help text says "benchmark"; error messages say "Benchmark not found" / "Registered benchmark: …"; console.log is "Multi-benchmark mode".
- Core: BenchmarkEntry / BenchmarkRegistry unchanged (already benchmark-named); saveBenchmarkRegistry now preserves excludedPaths alongside discoveryRoots; the old migrateProjectsYaml is folded into migrateLegacyRegistry.
- Frontend: ProjectCard component → BenchmarkCard; all Project*Sidebar / Project*Tab internal components renamed; UI strings say "benchmark".
- Docs: studio.mdx option table updated; Auto-Discovery section replaced with Runtime Discovery that describes --discovery-root and the excluded_paths flow. running-evals.mdx filename reference updated.
- AGENTS.md: Wire Format Convention lists benchmarks.yaml instead of projects.yaml.
- Tests: resolveDashboardMode tests renamed ("single/multi-benchmark"); /api/config test reads the new keys; new benchmarks.test.ts case asserts migration from legacy projects.yaml.

All 2261 tests pass; build, typecheck, and lint clean. Manual UAT confirmed:
legacy projects.yaml → benchmarks.yaml migration, on-disk snake_case keys, response shape and /api/config emit the new names end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The projects.yaml → benchmarks.yaml migration shim is one-time: any
surviving file is migrated on the first post-upgrade `agentv` invocation.
Leaving the code in place past v5.0.0 is dead weight. Flag it with a
concrete TODO so future maintainers know where and when to delete it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Multi-benchmark Studio only shipped last week — nobody has a populated
projects.yaml to migrate from in practice. The shim (and its TODO pointing
at a future v5.0.0 cleanup) is dead code that future maintainers have to
carry. Delete it now while there's no adoption to protect, and let any
(hypothetical) stale file simply mean "re-register your benchmarks."

- Remove migrateLegacyRegistry() and its call site in loadBenchmarkRegistry.
- Remove the legacy-filename paragraph from the benchmarks.ts header.
- Remove the unused copyFileSync / rmSync / getAgentvHome imports.
- Remove the migration test case from benchmarks.test.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- AGENTS.md §"Wire Format Convention" now uses benchmark_id in its
  HTTP-body example, matching the actual wire shape.
- Delete docs/plans/1144-runtime-benchmark-discovery.md. Per AGENTS.md
  §"Plans and Worktrees", plans are working materials and should be
  removed before merging; the design decisions it captured have all
  landed in code, the header docstring, and studio.mdx.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Final-review findings from a fresh subagent pass over commits since the
first review:

- RunEvalModal invalidated queryKey: ['projects'] after an eval run
  finished. That key never existed after the rename (the benchmark list
  uses ['benchmarks']), so the multi-benchmark dashboard's pass-rate /
  last-run columns did not refresh when an eval completed from the modal.
  Rename to ['benchmarks']. Real regression — the only functional bug the
  reviewer found.
- /api/benchmarks/:id/summary returned "Failed to read project" on 500.
  Bring it in line with the rest of the API: "Failed to read benchmark".
- resolveDashboardMode took a projectCount parameter and one of its
  internal comments still said "project-scoped routes"; Sidebar.tsx had
  a "Project-scoped sidebars" section header. Pure TS drift from the
  rename sweep.
- addExcludedPath now early-returns when the path is already pinned in
  benchmarks[]. The exclusion filter only applies to the discovered set,
  so recording an exclusion for a pinned path is meaningless state; the
  guard keeps the YAML invariant crisp and mirrors the auto-unexclude
  that addBenchmark already does. New unit test covers the invariant.

Skipped nits (per YAGNI): defensive literal-path guard inside DELETE
/api/benchmarks/:id and the pre-existing benchmark_name-from-basename
quirk in single-benchmark mode. Route ordering already works and the
inline comment documents the constraint; the basename issue was there
before this PR.

2261 tests pass; build, typecheck, lint clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gistry

The Studio runtime-update acceptance criteria for #1144 are satisfied by
runtime reload of benchmarks.yaml (option 2 of the issue's "Proposed
direction"), which is what loadBenchmarkRegistry() has always done on
every request. The filesystem-scanning path added in earlier commits
(discovery_roots, excluded_paths, source=manual|discovered, the active-
vs-persisted split, per-request depth-2 readdirSync) was significant
code surface for a single niche workflow — dropping a .agentv/ directory
into a watched folder and having it appear without an explicit API call
or file edit.

Trade it for the simpler declarative model: benchmarks.yaml is the single
source of truth; edits to it (direct, via POST /api/benchmarks, via
--add/--remove, or via a Kubernetes ConfigMap mount) propagate within
the UI's 10 s poll interval. Deployments that want declarative config
get the clean path; deployments that want ad-hoc repo drops can script
POST /api/benchmarks.

BREAKING (safe — multi-benchmark Studio shipped last week and nothing
adopted yet):

- Remove BenchmarkSource, BenchmarkEntry.source, BenchmarkRegistry.
  discoveryRoots, BenchmarkRegistry.excludedPaths.
- Remove core helpers: addDiscoveryRoot, removeDiscoveryRoot, getDiscovery
  Roots, addExcludedPath, removeExcludedPath, getExcludedPaths,
  resolveActiveBenchmarks, getActiveBenchmark.
- Remove HTTP endpoints: GET/POST/DELETE /api/benchmarks/discovery-roots,
  GET/DELETE /api/benchmarks/exclusions, POST /api/benchmarks/rescan.
- Remove --discovery-root CLI flag (and the multioption/array cmd-ts
  imports it needed).
- Remove wire-format source field from /api/benchmarks responses.
- Remove Watch form + addDiscoveryRootApi from the Studio frontend.
- Simplify DELETE /api/benchmarks/:benchmarkId back to a straight remove.
- Docs: studio.mdx drops --discovery-root from the options table and
  replaces the Runtime Discovery section with "Runtime behavior: no
  restart needed" covering the 10 s poll model and ConfigMap flow.
- Tests: rewrite benchmarks.test.ts to cover the core CRUD surface
  (start-empty, add/remove, idempotency, touch, snake_case on-disk
  round-trip) and drop the discovery/exclusion/precedence cases.

Net -472 lines. All 2259 tests pass; build, typecheck, lint clean. UAT
confirmed: POST add, external YAML edit, DELETE by id all reflect live;
removed endpoints return 404; removed flag rejected with "Unknown
arguments".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@christso christso changed the title feat(studio): runtime benchmark discovery without server restart feat(studio)!: benchmarks.yaml as single source of truth, live-reloaded Apr 20, 2026
…equests

Retro on #1145: the PR started as a modest runtime-discovery feature,
grew to include source/excluded_paths/route-ordering machinery, and
was later torn back out in favor of the one-line runtime-reload that
the existing registry already provided. YAGNI was in AGENTS.md but
only covered "don't build features nobody asked for" — it didn't
catch "someone asked for X and I built a bigger X than necessary."

Add five habits to §YAGNI that would have caught the miss:
  1. Audit existing primitives before adding new ones.
  2. Treat issue language as a hint; summarize acceptance criteria in
     your own words, strip implementation nouns, then check existing
     primitives before designing.
  3. Prefer data/config changes over new mechanisms.
  4. Stop and re-plan when scope doubles — don't push through.
  5. Stop when you're about to add a second mode, precedence rules,
     or invariants between optional fields. Those are complexity tells.

Also add a "call out existing overengineering" rule: when working on a
task, if you spot an overengineered existing feature, open a cleanup
tracking issue rather than widening the current PR. Names the shape of
issue to open so it's actionable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@christso christso merged commit af559e3 into main Apr 20, 2026
4 checks passed
@christso christso deleted the feat/1144-studio-runtime-discovery branch April 20, 2026 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(studio): support runtime benchmark discovery without server restart

1 participant