From edcc9d202e1de0f5c92eb4157ca681b8cdec68d5 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Fri, 15 May 2026 01:32:50 +0200 Subject: [PATCH] =?UTF-8?q?refactor(docs):=20rename=20docs/skills=20benchm?= =?UTF-8?q?ark=20=E2=86=92=20project,=20retain=20academic=20uses?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit PR 4 of 4 in the benchmark → project rename. Aligns documentation and skill cards with the rename that landed in PRs 1–3, and adds a naming-convention note to AGENTS.md so the project/benchmark distinction is durable. Renamed (registry/workspace concept → "project"): - apps/web/src/content/docs/docs/tools/studio.mdx - "## Benchmarks Dashboard" → "## Projects Dashboard" - "### Registering Benchmarks" → "### Registering Projects" - "### Removing a Benchmark" → "### Removing a Project" - CLI flag descriptions ("Register a benchmark by path", etc.) - All `~/.agentv/benchmarks.yaml` references → `projects.yaml` - YAML example `benchmarks:` top-level key → `projects:` - `/api/benchmarks` URLs → `/api/projects` - "Add Benchmark" / "single-benchmark view" / "multi-benchmark" text - apps/web/src/content/docs/docs/evaluation/running-evals.mdx - `benchmarks.yaml` in the "Lightweight config and cache files" list → `projects.yaml` - AGENTS.md (Wire Format Convention section) - YAML file list updated to `projects.yaml` - HTTP response field examples updated to `project_id` - TypeScript boundary example: BenchmarkEntry/BenchmarkEntryYaml → ProjectEntry/ProjectEntryYaml - "Reading back" pointer: `packages/core/src/benchmarks.ts` → `projects.ts` Added (new): - AGENTS.md "Naming Convention: Project vs Benchmark" section between TypeScript Guidelines and Wire Format Convention. Codifies the distinction so future contributors don't re-conflate the registry concept with academic eval-suite terminology. Intentionally kept (academic / artifact / verb usages): - benchmark.json per-run artifact and all references to it (Agent Skills compatibility — different concept, different rename if ever). - examples/*-benchmark/ directory names (benchmark-tooling, multi-model-benchmark, offline-grader-benchmark, bug-fix-benchmark) — they really are eval suites. - "benchmark agents" / "benchmark datasets" / "grader benchmarks" usages (verb / academic ML sense). - "Snapshot MCP for benchmarks" reference in AGENTS.md (academic). Stacks on refactor/rename-pr3-studio. With this PR landed, the codebase consistently uses "project" for the registry/workspace concept and "benchmark" only for eval-suite or per-run-artifact usages. Co-Authored-By: Claude Opus 4.7 --- AGENTS.md | 25 +++++++--- .../docs/docs/evaluation/running-evals.mdx | 2 +- .../src/content/docs/docs/tools/studio.mdx | 46 +++++++++---------- 3 files changed, 42 insertions(+), 31 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index a6eaa75d..2c323a34 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -154,6 +154,17 @@ cd ../agentv.worktrees/- - Prefer named exports - Keep modules cohesive +## Naming Convention: "Project" vs "Benchmark" + +These two words have distinct, non-interchangeable meanings in this codebase. Get them right when adding new symbols, docs, or example dirs: + +- **Project** — the top-level container Studio organises around: a registered workspace directory (`.agentv/` + run artifacts + traces + experiments). Lives in `~/.agentv/projects.yaml`. Modelled by `ProjectEntry` / `ProjectRegistry` in `packages/core/src/projects.ts`. Matches the terminology used by Phoenix, Langfuse, Braintrust, W&B Weave, and LangSmith. +- **Benchmark** — a curated *eval suite* designed to measure something specific (academic ML sense: MMLU, HumanEval, SWE-bench). Example dirs use this sense: `examples/showcase/multi-model-benchmark/`, `examples/showcase/offline-grader-benchmark/`, `examples/features/benchmark-tooling/`. Do not rename these — they are correctly named. + +The legacy registry file `~/.agentv/benchmarks.yaml` is auto-migrated to `projects.yaml` on first load by `migrateLegacyBenchmarksFile()`. The unrelated per-run `benchmark.json` artifact (Agent Skills compatibility output) is a third, separate concept — also keep that name. + +When in doubt: if the thing holds runs / traces / experiments, it's a **project**. If it's a curated set of eval cases meant to measure capability, it's a **benchmark**. + ## Wire Format Convention **Everything that crosses a process boundary uses `snake_case` keys. Internal TypeScript uses `camelCase`. Translate at the boundary — never in the middle.** @@ -161,10 +172,10 @@ cd ../agentv.worktrees/- The rule is blanket: if the key is going to disk, to a user's editor, into a JSON response, or onto a CLI, it's snake_case. There is no "well this file is internal-ish" carve-out. If in doubt, snake_case. ### snake_case surfaces -- All YAML files on disk: `*.eval.yaml`, `agentv.config.yaml`, `benchmarks.yaml`, `studio/config.yaml`, any future YAML we add. +- All YAML files on disk: `*.eval.yaml`, `agentv.config.yaml`, `projects.yaml`, `studio/config.yaml`, any future YAML we add. - JSONL result files (`test_id`, `token_usage`, `duration_ms`). - Artifact-writer output (`pass_rate`, `tests_run`, `total_tool_calls`). -- HTTP response bodies from `agentv serve` / Studio (`added_at`, `pass_rate`, `benchmark_id`). +- HTTP response bodies from `agentv serve` / Studio (`added_at`, `pass_rate`, `project_id`). - CLI JSON output (`agentv results summary`, `results failures`, `results show`). - Anything consumed by non-TS tooling (Python, jq pipelines, external dashboards). @@ -177,7 +188,7 @@ Define a second interface for the wire shape and convert in one place — don't ```typescript // Wire shape — snake_case, matches what hits disk / the network -interface BenchmarkEntryYaml { +interface ProjectEntryYaml { id: string; name: string; path: string; @@ -186,7 +197,7 @@ interface BenchmarkEntryYaml { } // Internal shape — camelCase, what every TS call site sees -interface BenchmarkEntry { +interface ProjectEntry { id: string; name: string; path: string; @@ -194,11 +205,11 @@ interface BenchmarkEntry { lastOpenedAt: string; } -function fromYaml(e: BenchmarkEntryYaml): BenchmarkEntry { +function fromYaml(e: ProjectEntryYaml): ProjectEntry { return { id: e.id, name: e.name, path: e.path, addedAt: e.added_at, lastOpenedAt: e.last_opened_at }; } -function toYaml(e: BenchmarkEntry): BenchmarkEntryYaml { +function toYaml(e: ProjectEntry): ProjectEntryYaml { return { id: e.id, name: e.name, path: e.path, added_at: e.addedAt, last_opened_at: e.lastOpenedAt }; } ``` @@ -213,7 +224,7 @@ Yes, this is two interfaces and two functions per entity. That's the price of ke ### Existing divergences If you spot a camelCase key already on disk or in a response (e.g. a legacy endpoint), treat it as a bug: migrate it to snake_case in the same PR where you touch that code path. Don't grandfather it in. -**Reading back:** `parseJsonlResults()` in `artifact-writer.ts` converts snake_case → camelCase when reading JSONL into TypeScript. `fromYaml` / `toYaml` in `packages/core/src/benchmarks.ts` is the model for YAML boundaries. +**Reading back:** `parseJsonlResults()` in `artifact-writer.ts` converts snake_case → camelCase when reading JSONL into TypeScript. `fromYaml` / `toYaml` in `packages/core/src/projects.ts` is the model for YAML boundaries. **Why:** Aligns with skill-creator (claude-plugins-official) and broader Python/JSON ecosystem conventions where snake_case is the standard wire format. diff --git a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx index f0bc287b..dfa63c70 100644 --- a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx +++ b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx @@ -429,7 +429,7 @@ The `{timestamp}` placeholder is replaced with an ISO-like timestamp (e.g., `202 ### AGENTV_HOME -Override the data directory for heavy runtime artifacts — workspaces, workspace pool, subagents, trace state, git cache, and downloaded dependencies. Lightweight config and cache files (`version-check.json`, `last-config.json`, `benchmarks.yaml`) always stay in `~/.agentv` regardless of this setting. +Override the data directory for heavy runtime artifacts — workspaces, workspace pool, subagents, trace state, git cache, and downloaded dependencies. Lightweight config and cache files (`version-check.json`, `last-config.json`, `projects.yaml`) always stay in `~/.agentv` regardless of this setting. ```bash # Linux/macOS diff --git a/apps/web/src/content/docs/docs/tools/studio.mdx b/apps/web/src/content/docs/docs/tools/studio.mdx index 8eb7bd07..c026d6d1 100644 --- a/apps/web/src/content/docs/docs/tools/studio.mdx +++ b/apps/web/src/content/docs/docs/tools/studio.mdx @@ -45,10 +45,10 @@ agentv studio .agentv/results/runs/2026-03-30T11-45-56-989Z |--------|-------------| | `--port`, `-p` | Port to listen on (flag > `PORT` env var > 3117) | | `--dir`, `-d` | Working directory (default: current directory) | -| `--multi` | Launch in multi-benchmark dashboard mode (deprecated; use auto-detect or `--single`) | -| `--single` | Force single-benchmark dashboard mode | -| `--add ` | Register a benchmark by path | -| `--remove ` | Unregister a benchmark by ID | +| `--multi` | Launch in multi-project dashboard mode (deprecated; use auto-detect or `--single`) | +| `--single` | Force single-project dashboard mode | +| `--add ` | Register a project by path | +| `--remove ` | Unregister a project by ID | ## Features @@ -138,25 +138,25 @@ The section includes the following visualizations: The baseline comparison is also available via the API: `GET /api/compare?baseline=` adds `delta` and `normalized_gain` fields to each non-baseline cell in the response. -## Benchmarks Dashboard +## Projects Dashboard -By default, Studio shows results for the current directory. Register multiple benchmark repos to view them from a single dashboard. +By default, Studio shows results for the current directory. Register multiple project repos to view them from a single dashboard. -### Registering Benchmarks +### Registering Projects -Register benchmark repos one at a time: +Register project repos one at a time: ```bash agentv studio --add /path/to/my-evals agentv studio --add /path/to/other-evals ``` -Each path must contain a `.agentv/` directory. Registered benchmarks are stored in `~/.agentv/benchmarks.yaml`. +Each path must contain a `.agentv/` directory. Registered projects are stored in `~/.agentv/projects.yaml`. -To register a remote repo and keep it synced automatically, add a `source` block to the entry in `~/.agentv/benchmarks.yaml`: +To register a remote repo and keep it synced automatically, add a `source` block to the entry in `~/.agentv/projects.yaml`: ```yaml -benchmarks: +projects: - id: my-evals name: My Evals path: /srv/agentv/my-evals @@ -169,32 +169,32 @@ On each Studio startup, AgentV clones the repo if the path is empty (`git clone ### Runtime behavior: no restart needed -`benchmarks.yaml` is the single source of truth. Studio re-reads it on every `/api/benchmarks` request (which the UI polls every ~10 s), so any of these changes appear live without restarting `agentv serve`: +`projects.yaml` is the single source of truth. Studio re-reads it on every `/api/projects` request (which the UI polls every ~10 s), so any of these changes appear live without restarting `agentv serve`: -- Adding via the UI's **Add Benchmark** form or `POST /api/benchmarks`. -- Removing via the UI's **Remove** button or `DELETE /api/benchmarks/:id`. -- Editing `~/.agentv/benchmarks.yaml` directly. +- Adding via the UI's **Add Project** form or `POST /api/projects`. +- Removing via the UI's **Remove** button or `DELETE /api/projects/:id`. +- Editing `~/.agentv/projects.yaml` directly. - Mounting the file via a Kubernetes ConfigMap — GitOps the ConfigMap and Studio reflects it within the next poll. -This satisfies the 24/7-Studio use case: the server stays up; benchmarks come and go through config edits or API calls. +This satisfies the 24/7-Studio use case: the server stays up; projects come and go through config edits or API calls. ### Launching the Dashboard -Studio auto-detects the mode based on how many benchmarks are registered: +Studio auto-detects the mode based on how many projects are registered: -- `0` or `1` registered: single-benchmark view -- `2+` registered: Benchmarks dashboard +- `0` or `1` registered: single-project view +- `2+` registered: Projects dashboard ```bash agentv studio # auto-detects -agentv studio --single # force single-benchmark view +agentv studio --single # force single-project view ``` -The landing page shows a card for each benchmark with run count, pass rate, and last run time. +The landing page shows a card for each project with run count, pass rate, and last run time. -AgentV Studio benchmarks dashboard showing benchmark cards with pass rates +AgentV Studio projects dashboard showing project cards with pass rates -### Removing a Benchmark +### Removing a Project Unregister by its ID: