feat(studio)!: benchmarks.yaml as single source of truth, live-reloaded#1145
Merged
feat(studio)!: benchmarks.yaml as single source of truth, live-reloaded#1145
Conversation
Persist optional discoveryRoots in ~/.agentv/projects.yaml and resolve the
effective benchmark set at request time so repos appearing or disappearing
under a root are reflected in Studio without restarting `agentv serve`.
- Add BenchmarkEntry.source ('manual' | 'discovered') and
BenchmarkRegistry.discoveryRoots; keep YAML unchanged when empty.
- Add resolveActiveBenchmarks / getActiveBenchmark: merges persisted entries
with a live rescan of every root; persisted wins on path conflict;
discovered entries are never written to disk.
- Route /api/benchmarks, /api/benchmarks/all-runs, /api/benchmarks/:id/summary
and withBenchmark / registerEvalRoutes through the active list so
discovered repos participate in every benchmark-scoped route.
- New HTTP endpoints: GET/POST/DELETE /api/benchmarks/discovery-roots and
POST /api/benchmarks/rescan. DELETE /api/benchmarks/:id rejects discovered
entries with a clear error.
- New --discovery-root <path> CLI flag (repeatable) that persists a root and
continues to start the server; --discover's one-shot semantics are
preserved.
- Count active benchmarks when picking single/multi dashboard mode.
- Unit tests in packages/core/test/benchmarks.test.ts cover add/remove/
idempotency, live appear/disappear, manual-vs-discovered precedence, and
standalone manual entries.
Closes #1144.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Deploying agentv with
|
| Latest commit: |
b97f20e
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://f20c4855.agentv.pages.dev |
| Branch Preview URL: | https://feat-1144-studio-runtime-dis.agentv.pages.dev |
- Drop external-issue-reference comments per AGENTS.md §7 (AI-First). - Document single-writer assumption in benchmarks.ts header; the existing read-modify-write model is safe for the single-process Studio case that motivated the change. - Sort discoverBenchmarks output so id assignment under basename collisions is deterministic across filesystems. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AGENTS.md §"Wire Format Convention" mandates snake_case for YAML config fields, with camelCase reserved for internal TypeScript. The previous commit emitted discoveryRoots (camelCase) on disk. TS field name stays discoveryRoots; only the serialization boundary changes. Adds a regression test that reads projects.yaml after a write and asserts the on-disk key is discovery_roots. Pre-existing benchmarks[] fields (addedAt, lastOpenedAt) are left as-is in this PR since changing them would be a back-compat-breaking migration orthogonal to runtime discovery; they're flagged in the file header. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BREAKING: benchmark entries in ~/.agentv/projects.yaml are serialized with snake_case keys (added_at, last_opened_at, source) instead of camelCase (addedAt, lastOpenedAt). Single-project Studio users are unaffected because they don't touch projects.yaml; multi-project users on pre-release builds must re-register projects (`agentv serve --add <path>`). - Introduce BenchmarkEntryYaml + fromYaml/toYaml in packages/core/src/benchmarks.ts so TS internals stay camelCase and the YAML boundary stays snake_case. - Drop the camelCase → snake_case carry-over for addedAt / lastOpenedAt; the file header now documents the fully snake_case format as canonical. - Tighten AGENTS.md §"Wire Format Convention" to apply the rule blanket across all on-disk YAML (eval configs, projects.yaml, future files), add an anti-patterns list, and cite fromYaml/toYaml as the reference pattern for YAML boundaries. - Add a regression test asserting serialized keys on disk are snake_case and that the file round-trips into the camelCase TS shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BREAKING: hard-remove the --discover CLI flag and POST /api/benchmarks/discover
endpoint. Callers should use --discovery-root + POST /api/benchmarks/discovery-roots
for the runtime-watching model. The core discoverBenchmarks util stays exported.
Better UX for Remove on a discovered entry: instead of 400-ing, add the repo's
path to a new persisted excluded_paths[] list and hide it from future scans.
The .agentv/ directory stays on disk, so the user can re-show the repo (via
DELETE /api/benchmarks/exclusions) or pin it manually (POST /api/benchmarks,
which auto-unexcludes).
- New BenchmarkRegistry.excludedPaths?: string[] (YAML key: excluded_paths).
- New core helpers: getExcludedPaths, addExcludedPath, removeExcludedPath.
- resolveActiveBenchmarks filters the discovered set by excludedPaths; pinned
entries are never filtered.
- addBenchmark() strips the path from excludedPaths if present — explicit pin
wins over a prior hide.
- DELETE /api/benchmarks/:id on a discovered entry calls addExcludedPath and
returns { ok: true, excluded: <path> }; on a manual entry it still removes
from benchmarks[] as before.
- GET /api/benchmarks/exclusions lists excluded paths; DELETE unhides one.
- Route-ordering fix: DELETE /api/benchmarks/:benchmarkId is now registered
after all /api/benchmarks/<literal> sub-paths so Hono doesn't route
DELETE /api/benchmarks/exclusions (or /discovery-roots) through the :id
handler with benchmarkId=<literal>. Inline comment documents the constraint.
- Tests: exclude-hides-then-unexclude-shows round-trip, snake_case on-disk
key assertion, pin-beats-exclusion flow. All 2260 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Studio already called the domain concept a "benchmark" in code (BenchmarkRegistry, BenchmarkEntry, /api/benchmarks), but a long tail of surfaces still said "project": the on-disk registry filename, API response keys, CLI help text, error messages, docs, Studio routes, and component names. This sweeps every remaining "project" surface to "benchmark".
BREAKING (safe — multi-benchmark Studio isn't in use yet):
- File rename: ~/.agentv/projects.yaml → ~/.agentv/benchmarks.yaml. migrateLegacyRegistry() copies any existing projects.yaml on first load (from either the AGENTV_HOME or the config-dir location) and deletes the legacy file.
- API response shape: /api/benchmarks now returns { benchmarks: [...] } (was projects); /api/benchmarks/all-runs rows now carry benchmark_id / benchmark_name (were project_id / project_name); /api/config returns benchmark_name + multi_benchmark_dashboard.
- Studio URLs: /projects/$benchmarkId/... → /benchmarks/$benchmarkId/...; the routes/projects/ directory is renamed to routes/benchmarks/.
- CLI: --discover flag removed (use --discovery-root instead); --add/--remove/--multi/--single help text says "benchmark"; error messages say "Benchmark not found" / "Registered benchmark: …"; console.log is "Multi-benchmark mode".
- Core: BenchmarkEntry / BenchmarkRegistry unchanged (already benchmark-named); saveBenchmarkRegistry now preserves excludedPaths alongside discoveryRoots; the old migrateProjectsYaml is folded into migrateLegacyRegistry.
- Frontend: ProjectCard component → BenchmarkCard; all Project*Sidebar / Project*Tab internal components renamed; UI strings say "benchmark".
- Docs: studio.mdx option table updated; Auto-Discovery section replaced with Runtime Discovery that describes --discovery-root and the excluded_paths flow. running-evals.mdx filename reference updated.
- AGENTS.md: Wire Format Convention lists benchmarks.yaml instead of projects.yaml.
- Tests: resolveDashboardMode tests renamed ("single/multi-benchmark"); /api/config test reads the new keys; new benchmarks.test.ts case asserts migration from legacy projects.yaml.
All 2261 tests pass; build, typecheck, and lint clean. Manual UAT confirmed:
legacy projects.yaml → benchmarks.yaml migration, on-disk snake_case keys, response shape and /api/config emit the new names end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The projects.yaml → benchmarks.yaml migration shim is one-time: any surviving file is migrated on the first post-upgrade `agentv` invocation. Leaving the code in place past v5.0.0 is dead weight. Flag it with a concrete TODO so future maintainers know where and when to delete it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Multi-benchmark Studio only shipped last week — nobody has a populated projects.yaml to migrate from in practice. The shim (and its TODO pointing at a future v5.0.0 cleanup) is dead code that future maintainers have to carry. Delete it now while there's no adoption to protect, and let any (hypothetical) stale file simply mean "re-register your benchmarks." - Remove migrateLegacyRegistry() and its call site in loadBenchmarkRegistry. - Remove the legacy-filename paragraph from the benchmarks.ts header. - Remove the unused copyFileSync / rmSync / getAgentvHome imports. - Remove the migration test case from benchmarks.test.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- AGENTS.md §"Wire Format Convention" now uses benchmark_id in its HTTP-body example, matching the actual wire shape. - Delete docs/plans/1144-runtime-benchmark-discovery.md. Per AGENTS.md §"Plans and Worktrees", plans are working materials and should be removed before merging; the design decisions it captured have all landed in code, the header docstring, and studio.mdx. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Final-review findings from a fresh subagent pass over commits since the first review: - RunEvalModal invalidated queryKey: ['projects'] after an eval run finished. That key never existed after the rename (the benchmark list uses ['benchmarks']), so the multi-benchmark dashboard's pass-rate / last-run columns did not refresh when an eval completed from the modal. Rename to ['benchmarks']. Real regression — the only functional bug the reviewer found. - /api/benchmarks/:id/summary returned "Failed to read project" on 500. Bring it in line with the rest of the API: "Failed to read benchmark". - resolveDashboardMode took a projectCount parameter and one of its internal comments still said "project-scoped routes"; Sidebar.tsx had a "Project-scoped sidebars" section header. Pure TS drift from the rename sweep. - addExcludedPath now early-returns when the path is already pinned in benchmarks[]. The exclusion filter only applies to the discovered set, so recording an exclusion for a pinned path is meaningless state; the guard keeps the YAML invariant crisp and mirrors the auto-unexclude that addBenchmark already does. New unit test covers the invariant. Skipped nits (per YAGNI): defensive literal-path guard inside DELETE /api/benchmarks/:id and the pre-existing benchmark_name-from-basename quirk in single-benchmark mode. Route ordering already works and the inline comment documents the constraint; the basename issue was there before this PR. 2261 tests pass; build, typecheck, lint clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gistry The Studio runtime-update acceptance criteria for #1144 are satisfied by runtime reload of benchmarks.yaml (option 2 of the issue's "Proposed direction"), which is what loadBenchmarkRegistry() has always done on every request. The filesystem-scanning path added in earlier commits (discovery_roots, excluded_paths, source=manual|discovered, the active- vs-persisted split, per-request depth-2 readdirSync) was significant code surface for a single niche workflow — dropping a .agentv/ directory into a watched folder and having it appear without an explicit API call or file edit. Trade it for the simpler declarative model: benchmarks.yaml is the single source of truth; edits to it (direct, via POST /api/benchmarks, via --add/--remove, or via a Kubernetes ConfigMap mount) propagate within the UI's 10 s poll interval. Deployments that want declarative config get the clean path; deployments that want ad-hoc repo drops can script POST /api/benchmarks. BREAKING (safe — multi-benchmark Studio shipped last week and nothing adopted yet): - Remove BenchmarkSource, BenchmarkEntry.source, BenchmarkRegistry. discoveryRoots, BenchmarkRegistry.excludedPaths. - Remove core helpers: addDiscoveryRoot, removeDiscoveryRoot, getDiscovery Roots, addExcludedPath, removeExcludedPath, getExcludedPaths, resolveActiveBenchmarks, getActiveBenchmark. - Remove HTTP endpoints: GET/POST/DELETE /api/benchmarks/discovery-roots, GET/DELETE /api/benchmarks/exclusions, POST /api/benchmarks/rescan. - Remove --discovery-root CLI flag (and the multioption/array cmd-ts imports it needed). - Remove wire-format source field from /api/benchmarks responses. - Remove Watch form + addDiscoveryRootApi from the Studio frontend. - Simplify DELETE /api/benchmarks/:benchmarkId back to a straight remove. - Docs: studio.mdx drops --discovery-root from the options table and replaces the Runtime Discovery section with "Runtime behavior: no restart needed" covering the 10 s poll model and ConfigMap flow. - Tests: rewrite benchmarks.test.ts to cover the core CRUD surface (start-empty, add/remove, idempotency, touch, snake_case on-disk round-trip) and drop the discovery/exclusion/precedence cases. Net -472 lines. All 2259 tests pass; build, typecheck, lint clean. UAT confirmed: POST add, external YAML edit, DELETE by id all reflect live; removed endpoints return 404; removed flag rejected with "Unknown arguments". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…equests Retro on #1145: the PR started as a modest runtime-discovery feature, grew to include source/excluded_paths/route-ordering machinery, and was later torn back out in favor of the one-line runtime-reload that the existing registry already provided. YAGNI was in AGENTS.md but only covered "don't build features nobody asked for" — it didn't catch "someone asked for X and I built a bigger X than necessary." Add five habits to §YAGNI that would have caught the miss: 1. Audit existing primitives before adding new ones. 2. Treat issue language as a hint; summarize acceptance criteria in your own words, strip implementation nouns, then check existing primitives before designing. 3. Prefer data/config changes over new mechanisms. 4. Stop and re-plan when scope doubles — don't push through. 5. Stop when you're about to add a second mode, precedence rules, or invariants between optional fields. Those are complexity tells. Also add a "call out existing overengineering" rule: when working on a task, if you spot an overengineered existing feature, open a cleanup tracking issue rather than widening the current PR. Names the shape of issue to open so it's actionable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1144.
Summary
Studio previously treated benchmark discovery as startup-time bootstrap — any change to the registry required restarting
agentv serve. This PR makes the registry re-read on every/api/benchmarksrequest, so edits propagate live; the Studio UI polls every ~10 s and reflects the change without operator action.Implements option 2 of #1144's "Proposed direction": runtime reload of the project registry when the file changes. No filesystem watcher; no background scanning; no cache; the single source of truth is
~/.agentv/benchmarks.yaml.The PR also does a full terminology sweep: the on-disk filename, every API response key, every Studio URL, every CLI flag and error message now uses "benchmark" consistently.
What's in the PR
Runtime behavior (the #1144 ask)
/api/benchmarksre-readsbenchmarks.yamlon every request. ConfigMap-driven, file-edit-driven, and API-driven changes all propagate within the UI's ~10 s poll interval.POST /api/benchmarks {path}registers a benchmark;DELETE /api/benchmarks/:idunregisters. Both write tobenchmarks.yaml.Filename rename (breaking)
~/.agentv/projects.yaml→~/.agentv/benchmarks.yaml.Wire-format rename (breaking)
{"projects": [...]}→{"benchmarks": [...]}.project_id/project_name→benchmark_id/benchmark_nameon/api/benchmarks/all-runs./api/configreturnsbenchmark_name+multi_benchmark_dashboard.Studio URL rename (breaking)
/projects/$benchmarkId/...→/benchmarks/$benchmarkId/...throughout.apps/studio/src/routes/projects/→routes/benchmarks/,ProjectCard.tsx→BenchmarkCard.tsx.YAML key convention (breaking)
addedAt/lastOpenedAt→added_at/last_opened_at.fromYaml/toYamlinpackages/core/src/benchmarks.tstranslate at the boundary; TS internals stay camelCase.CLI cleanup (breaking)
--discoverflag removed.--add/--remove/--single/--multiunchanged in behavior; help text now says "benchmark".AGENTS.md tightening
fromYaml/toYamlas the reference boundary pattern.Tests
packages/core/test/benchmarks.test.ts: start-empty, add/remove, idempotency, touch isolation, snake_case on-disk round-trip.apps/cli/test/commands/results/serve.test.ts: updated for renamed identifiers andmulti_benchmark_dashboardwire key.Breaking-change safety
Multi-benchmark Studio shipped ~1 week ago and nothing is known to depend on the old shape. No migration shim; users with an empty or trivial setup just re-register with
agentv studio --add <path>.Test plan
Acceptance-criteria mapping (#1144)
/api/benchmarksand the UI are updated while the sameagentv serveprocess remains running.Not implemented in this PR (and not required by #1144): automatic filesystem scanning of a "discovery root." Earlier commits prototyped that and were removed in favor of the simpler config-driven model. Deployments that want ad-hoc repo drops to register automatically can add a sidecar that watches the directory and calls POST /api/benchmarks.