feat(studio)!: benchmarks.yaml as single source of truth, live-reloaded by christso · Pull Request #1145 · EntityProcess/agentv

christso · 2026-04-20T04:55:32Z

Closes #1144.

Summary

Studio previously treated benchmark discovery as startup-time bootstrap — any change to the registry required restarting agentv serve. This PR makes the registry re-read on every /api/benchmarks request, so edits propagate live; the Studio UI polls every ~10 s and reflects the change without operator action.

Implements option 2 of #1144's "Proposed direction": runtime reload of the project registry when the file changes. No filesystem watcher; no background scanning; no cache; the single source of truth is ~/.agentv/benchmarks.yaml.

The PR also does a full terminology sweep: the on-disk filename, every API response key, every Studio URL, every CLI flag and error message now uses "benchmark" consistently.

What's in the PR

Runtime behavior (the #1144 ask)

/api/benchmarks re-reads benchmarks.yaml on every request. ConfigMap-driven, file-edit-driven, and API-driven changes all propagate within the UI's ~10 s poll interval.
POST /api/benchmarks {path} registers a benchmark; DELETE /api/benchmarks/:id unregisters. Both write to benchmarks.yaml.

Filename rename (breaking)

~/.agentv/projects.yaml → ~/.agentv/benchmarks.yaml.

Wire-format rename (breaking)

{"projects": [...]} → {"benchmarks": [...]}.
project_id / project_name → benchmark_id / benchmark_name on /api/benchmarks/all-runs.
/api/config returns benchmark_name + multi_benchmark_dashboard.

Studio URL rename (breaking)

/projects/$benchmarkId/... → /benchmarks/$benchmarkId/... throughout.
apps/studio/src/routes/projects/ → routes/benchmarks/, ProjectCard.tsx → BenchmarkCard.tsx.

YAML key convention (breaking)

All keys snake_case per AGENTS.md §"Wire Format Convention". addedAt/lastOpenedAt → added_at/last_opened_at. fromYaml/toYaml in packages/core/src/benchmarks.ts translate at the boundary; TS internals stay camelCase.

CLI cleanup (breaking)

--discover flag removed.
--add / --remove / --single / --multi unchanged in behavior; help text now says "benchmark".
Error messages "Project not found" / "Register a project…" → "Benchmark not found" / "Register a benchmark…".

AGENTS.md tightening

§"Wire Format Convention" spells out the rule blanket (every YAML on disk + every cross-process JSON is snake_case), adds an anti-patterns list, and cites fromYaml/toYaml as the reference boundary pattern.

Tests

packages/core/test/benchmarks.test.ts: start-empty, add/remove, idempotency, touch isolation, snake_case on-disk round-trip.
apps/cli/test/commands/results/serve.test.ts: updated for renamed identifiers and multi_benchmark_dashboard wire key.

Breaking-change safety

Multi-benchmark Studio shipped ~1 week ago and nothing is known to depend on the old shape. No migration shim; users with an empty or trivial setup just re-register with agentv studio --add <path>.

Test plan

Unit: 2259 tests pass.
Build, typecheck, lint clean.
Pre-push hooks (prek): Build / Typecheck / Lint / Test / Validate eval YAML — all pass.
Manual UAT — empty start → POST → external YAML edit → DELETE → removed endpoints 404 → removed flag rejected.

Acceptance-criteria mapping (#1144)

Studio can start with zero benchmarks and stay healthy.
Adding a benchmark reflects in Studio without restart.
Removing a benchmark reflects without restart.
/api/benchmarks and the UI are updated while the same agentv serve process remains running.

Not implemented in this PR (and not required by #1144): automatic filesystem scanning of a "discovery root." Earlier commits prototyped that and were removed in favor of the simpler config-driven model. Deployments that want ad-hoc repo drops to register automatically can add a sidecar that watches the directory and calls POST /api/benchmarks.

Persist optional discoveryRoots in ~/.agentv/projects.yaml and resolve the effective benchmark set at request time so repos appearing or disappearing under a root are reflected in Studio without restarting `agentv serve`. - Add BenchmarkEntry.source ('manual' | 'discovered') and BenchmarkRegistry.discoveryRoots; keep YAML unchanged when empty. - Add resolveActiveBenchmarks / getActiveBenchmark: merges persisted entries with a live rescan of every root; persisted wins on path conflict; discovered entries are never written to disk. - Route /api/benchmarks, /api/benchmarks/all-runs, /api/benchmarks/:id/summary and withBenchmark / registerEvalRoutes through the active list so discovered repos participate in every benchmark-scoped route. - New HTTP endpoints: GET/POST/DELETE /api/benchmarks/discovery-roots and POST /api/benchmarks/rescan. DELETE /api/benchmarks/:id rejects discovered entries with a clear error. - New --discovery-root <path> CLI flag (repeatable) that persists a root and continues to start the server; --discover's one-shot semantics are preserved. - Count active benchmarks when picking single/multi dashboard mode. - Unit tests in packages/core/test/benchmarks.test.ts cover add/remove/ idempotency, live appear/disappear, manual-vs-discovered precedence, and standalone manual entries. Closes #1144. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-04-20T04:55:53Z

Deploying agentv with Cloudflare Pages

Latest commit:	`b97f20e`
Status:	✅ Deploy successful!
Preview URL:	https://f20c4855.agentv.pages.dev
Branch Preview URL:	https://feat-1144-studio-runtime-dis.agentv.pages.dev

View logs

- Drop external-issue-reference comments per AGENTS.md §7 (AI-First). - Document single-writer assumption in benchmarks.ts header; the existing read-modify-write model is safe for the single-process Studio case that motivated the change. - Sort discoverBenchmarks output so id assignment under basename collisions is deterministic across filesystems. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AGENTS.md §"Wire Format Convention" mandates snake_case for YAML config fields, with camelCase reserved for internal TypeScript. The previous commit emitted discoveryRoots (camelCase) on disk. TS field name stays discoveryRoots; only the serialization boundary changes. Adds a regression test that reads projects.yaml after a write and asserts the on-disk key is discovery_roots. Pre-existing benchmarks[] fields (addedAt, lastOpenedAt) are left as-is in this PR since changing them would be a back-compat-breaking migration orthogonal to runtime discovery; they're flagged in the file header. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

BREAKING: benchmark entries in ~/.agentv/projects.yaml are serialized with snake_case keys (added_at, last_opened_at, source) instead of camelCase (addedAt, lastOpenedAt). Single-project Studio users are unaffected because they don't touch projects.yaml; multi-project users on pre-release builds must re-register projects (`agentv serve --add <path>`). - Introduce BenchmarkEntryYaml + fromYaml/toYaml in packages/core/src/benchmarks.ts so TS internals stay camelCase and the YAML boundary stays snake_case. - Drop the camelCase → snake_case carry-over for addedAt / lastOpenedAt; the file header now documents the fully snake_case format as canonical. - Tighten AGENTS.md §"Wire Format Convention" to apply the rule blanket across all on-disk YAML (eval configs, projects.yaml, future files), add an anti-patterns list, and cite fromYaml/toYaml as the reference pattern for YAML boundaries. - Add a regression test asserting serialized keys on disk are snake_case and that the file round-trips into the camelCase TS shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

BREAKING: hard-remove the --discover CLI flag and POST /api/benchmarks/discover endpoint. Callers should use --discovery-root + POST /api/benchmarks/discovery-roots for the runtime-watching model. The core discoverBenchmarks util stays exported. Better UX for Remove on a discovered entry: instead of 400-ing, add the repo's path to a new persisted excluded_paths[] list and hide it from future scans. The .agentv/ directory stays on disk, so the user can re-show the repo (via DELETE /api/benchmarks/exclusions) or pin it manually (POST /api/benchmarks, which auto-unexcludes). - New BenchmarkRegistry.excludedPaths?: string[] (YAML key: excluded_paths). - New core helpers: getExcludedPaths, addExcludedPath, removeExcludedPath. - resolveActiveBenchmarks filters the discovered set by excludedPaths; pinned entries are never filtered. - addBenchmark() strips the path from excludedPaths if present — explicit pin wins over a prior hide. - DELETE /api/benchmarks/:id on a discovered entry calls addExcludedPath and returns { ok: true, excluded: <path> }; on a manual entry it still removes from benchmarks[] as before. - GET /api/benchmarks/exclusions lists excluded paths; DELETE unhides one. - Route-ordering fix: DELETE /api/benchmarks/:benchmarkId is now registered after all /api/benchmarks/<literal> sub-paths so Hono doesn't route DELETE /api/benchmarks/exclusions (or /discovery-roots) through the :id handler with benchmarkId=<literal>. Inline comment documents the constraint. - Tests: exclude-hides-then-unexclude-shows round-trip, snake_case on-disk key assertion, pin-beats-exclusion flow. All 2260 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Studio already called the domain concept a "benchmark" in code (BenchmarkRegistry, BenchmarkEntry, /api/benchmarks), but a long tail of surfaces still said "project": the on-disk registry filename, API response keys, CLI help text, error messages, docs, Studio routes, and component names. This sweeps every remaining "project" surface to "benchmark". BREAKING (safe — multi-benchmark Studio isn't in use yet): - File rename: ~/.agentv/projects.yaml → ~/.agentv/benchmarks.yaml. migrateLegacyRegistry() copies any existing projects.yaml on first load (from either the AGENTV_HOME or the config-dir location) and deletes the legacy file. - API response shape: /api/benchmarks now returns { benchmarks: [...] } (was projects); /api/benchmarks/all-runs rows now carry benchmark_id / benchmark_name (were project_id / project_name); /api/config returns benchmark_name + multi_benchmark_dashboard. - Studio URLs: /projects/$benchmarkId/... → /benchmarks/$benchmarkId/...; the routes/projects/ directory is renamed to routes/benchmarks/. - CLI: --discover flag removed (use --discovery-root instead); --add/--remove/--multi/--single help text says "benchmark"; error messages say "Benchmark not found" / "Registered benchmark: …"; console.log is "Multi-benchmark mode". - Core: BenchmarkEntry / BenchmarkRegistry unchanged (already benchmark-named); saveBenchmarkRegistry now preserves excludedPaths alongside discoveryRoots; the old migrateProjectsYaml is folded into migrateLegacyRegistry. - Frontend: ProjectCard component → BenchmarkCard; all Project*Sidebar / Project*Tab internal components renamed; UI strings say "benchmark". - Docs: studio.mdx option table updated; Auto-Discovery section replaced with Runtime Discovery that describes --discovery-root and the excluded_paths flow. running-evals.mdx filename reference updated. - AGENTS.md: Wire Format Convention lists benchmarks.yaml instead of projects.yaml. - Tests: resolveDashboardMode tests renamed ("single/multi-benchmark"); /api/config test reads the new keys; new benchmarks.test.ts case asserts migration from legacy projects.yaml. All 2261 tests pass; build, typecheck, and lint clean. Manual UAT confirmed: legacy projects.yaml → benchmarks.yaml migration, on-disk snake_case keys, response shape and /api/config emit the new names end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The projects.yaml → benchmarks.yaml migration shim is one-time: any surviving file is migrated on the first post-upgrade `agentv` invocation. Leaving the code in place past v5.0.0 is dead weight. Flag it with a concrete TODO so future maintainers know where and when to delete it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Multi-benchmark Studio only shipped last week — nobody has a populated projects.yaml to migrate from in practice. The shim (and its TODO pointing at a future v5.0.0 cleanup) is dead code that future maintainers have to carry. Delete it now while there's no adoption to protect, and let any (hypothetical) stale file simply mean "re-register your benchmarks." - Remove migrateLegacyRegistry() and its call site in loadBenchmarkRegistry. - Remove the legacy-filename paragraph from the benchmarks.ts header. - Remove the unused copyFileSync / rmSync / getAgentvHome imports. - Remove the migration test case from benchmarks.test.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- AGENTS.md §"Wire Format Convention" now uses benchmark_id in its HTTP-body example, matching the actual wire shape. - Delete docs/plans/1144-runtime-benchmark-discovery.md. Per AGENTS.md §"Plans and Worktrees", plans are working materials and should be removed before merging; the design decisions it captured have all landed in code, the header docstring, and studio.mdx. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Final-review findings from a fresh subagent pass over commits since the first review: - RunEvalModal invalidated queryKey: ['projects'] after an eval run finished. That key never existed after the rename (the benchmark list uses ['benchmarks']), so the multi-benchmark dashboard's pass-rate / last-run columns did not refresh when an eval completed from the modal. Rename to ['benchmarks']. Real regression — the only functional bug the reviewer found. - /api/benchmarks/:id/summary returned "Failed to read project" on 500. Bring it in line with the rest of the API: "Failed to read benchmark". - resolveDashboardMode took a projectCount parameter and one of its internal comments still said "project-scoped routes"; Sidebar.tsx had a "Project-scoped sidebars" section header. Pure TS drift from the rename sweep. - addExcludedPath now early-returns when the path is already pinned in benchmarks[]. The exclusion filter only applies to the discovered set, so recording an exclusion for a pinned path is meaningless state; the guard keeps the YAML invariant crisp and mirrors the auto-unexclude that addBenchmark already does. New unit test covers the invariant. Skipped nits (per YAGNI): defensive literal-path guard inside DELETE /api/benchmarks/:id and the pre-existing benchmark_name-from-basename quirk in single-benchmark mode. Route ordering already works and the inline comment documents the constraint; the basename issue was there before this PR. 2261 tests pass; build, typecheck, lint clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…gistry The Studio runtime-update acceptance criteria for #1144 are satisfied by runtime reload of benchmarks.yaml (option 2 of the issue's "Proposed direction"), which is what loadBenchmarkRegistry() has always done on every request. The filesystem-scanning path added in earlier commits (discovery_roots, excluded_paths, source=manual|discovered, the active- vs-persisted split, per-request depth-2 readdirSync) was significant code surface for a single niche workflow — dropping a .agentv/ directory into a watched folder and having it appear without an explicit API call or file edit. Trade it for the simpler declarative model: benchmarks.yaml is the single source of truth; edits to it (direct, via POST /api/benchmarks, via --add/--remove, or via a Kubernetes ConfigMap mount) propagate within the UI's 10 s poll interval. Deployments that want declarative config get the clean path; deployments that want ad-hoc repo drops can script POST /api/benchmarks. BREAKING (safe — multi-benchmark Studio shipped last week and nothing adopted yet): - Remove BenchmarkSource, BenchmarkEntry.source, BenchmarkRegistry. discoveryRoots, BenchmarkRegistry.excludedPaths. - Remove core helpers: addDiscoveryRoot, removeDiscoveryRoot, getDiscovery Roots, addExcludedPath, removeExcludedPath, getExcludedPaths, resolveActiveBenchmarks, getActiveBenchmark. - Remove HTTP endpoints: GET/POST/DELETE /api/benchmarks/discovery-roots, GET/DELETE /api/benchmarks/exclusions, POST /api/benchmarks/rescan. - Remove --discovery-root CLI flag (and the multioption/array cmd-ts imports it needed). - Remove wire-format source field from /api/benchmarks responses. - Remove Watch form + addDiscoveryRootApi from the Studio frontend. - Simplify DELETE /api/benchmarks/:benchmarkId back to a straight remove. - Docs: studio.mdx drops --discovery-root from the options table and replaces the Runtime Discovery section with "Runtime behavior: no restart needed" covering the 10 s poll model and ConfigMap flow. - Tests: rewrite benchmarks.test.ts to cover the core CRUD surface (start-empty, add/remove, idempotency, touch, snake_case on-disk round-trip) and drop the discovery/exclusion/precedence cases. Net -472 lines. All 2259 tests pass; build, typecheck, lint clean. UAT confirmed: POST add, external YAML edit, DELETE by id all reflect live; removed endpoints return 404; removed flag rejected with "Unknown arguments". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…equests Retro on #1145: the PR started as a modest runtime-discovery feature, grew to include source/excluded_paths/route-ordering machinery, and was later torn back out in favor of the one-line runtime-reload that the existing registry already provided. YAGNI was in AGENTS.md but only covered "don't build features nobody asked for" — it didn't catch "someone asked for X and I built a bigger X than necessary." Add five habits to §YAGNI that would have caught the miss: 1. Audit existing primitives before adding new ones. 2. Treat issue language as a hint; summarize acceptance criteria in your own words, strip implementation nouns, then check existing primitives before designing. 3. Prefer data/config changes over new mechanisms. 4. Stop and re-plan when scope doubles — don't push through. 5. Stop when you're about to add a second mode, precedence rules, or invariants between optional fields. Those are complexity tells. Also add a "call out existing overengineering" rule: when working on a task, if you spot an overengineered existing feature, open a cleanup tracking issue rather than widening the current PR. Names the shape of issue to open so it's actionable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

christso marked this pull request as ready for review April 20, 2026 05:02

christso and others added 9 commits April 20, 2026 07:54

christso changed the title ~~feat(studio): runtime benchmark discovery without server restart~~ feat(studio)!: benchmarks.yaml as single source of truth, live-reloaded Apr 20, 2026

christso merged commit af559e3 into main Apr 20, 2026
4 checks passed

christso deleted the feat/1144-studio-runtime-discovery branch April 20, 2026 23:09

christso mentioned this pull request Apr 20, 2026

docs(targets): add CLI Provider page + oracle-validation pattern #1146

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(studio)!: benchmarks.yaml as single source of truth, live-reloaded#1145

feat(studio)!: benchmarks.yaml as single source of truth, live-reloaded#1145
christso merged 12 commits intomainfrom
feat/1144-studio-runtime-discovery

christso commented Apr 20, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the PR

Runtime behavior (the #1144 ask)

Filename rename (breaking)

Wire-format rename (breaking)

Studio URL rename (breaking)

YAML key convention (breaking)

CLI cleanup (breaking)

AGENTS.md tightening

Tests

Breaking-change safety

Test plan

Acceptance-criteria mapping (#1144)

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented Apr 20, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented Apr 20, 2026 •

edited

Loading