From edcc9d202e1de0f5c92eb4157ca681b8cdec68d5 Mon Sep 17 00:00:00 2001
From: Christopher Tso <christso@gmail.com>
Date: Fri, 15 May 2026 01:32:50 +0200
Subject: [PATCH] =?UTF-8?q?refactor(docs):=20rename=20docs/skills=20benchm?=
 =?UTF-8?q?ark=20=E2=86=92=20project,=20retain=20academic=20uses?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

PR 4 of 4 in the benchmark → project rename. Aligns documentation and skill
cards with the rename that landed in PRs 1–3, and adds a naming-convention
note to AGENTS.md so the project/benchmark distinction is durable.

Renamed (registry/workspace concept → "project"):
- apps/web/src/content/docs/docs/tools/studio.mdx
  - "## Benchmarks Dashboard" → "## Projects Dashboard"
  - "### Registering Benchmarks" → "### Registering Projects"
  - "### Removing a Benchmark" → "### Removing a Project"
  - CLI flag descriptions ("Register a benchmark by path", etc.)
  - All `~/.agentv/benchmarks.yaml` references → `projects.yaml`
  - YAML example `benchmarks:` top-level key → `projects:`
  - `/api/benchmarks` URLs → `/api/projects`
  - "Add Benchmark" / "single-benchmark view" / "multi-benchmark" text
- apps/web/src/content/docs/docs/evaluation/running-evals.mdx
  - `benchmarks.yaml` in the "Lightweight config and cache files" list → `projects.yaml`
- AGENTS.md (Wire Format Convention section)
  - YAML file list updated to `projects.yaml`
  - HTTP response field examples updated to `project_id`
  - TypeScript boundary example: BenchmarkEntry/BenchmarkEntryYaml → ProjectEntry/ProjectEntryYaml
  - "Reading back" pointer: `packages/core/src/benchmarks.ts` → `projects.ts`

Added (new):
- AGENTS.md "Naming Convention: Project vs Benchmark" section between
  TypeScript Guidelines and Wire Format Convention. Codifies the
  distinction so future contributors don't re-conflate the registry
  concept with academic eval-suite terminology.

Intentionally kept (academic / artifact / verb usages):
- benchmark.json per-run artifact and all references to it
  (Agent Skills compatibility — different concept, different rename if ever).
- examples/*-benchmark/ directory names (benchmark-tooling, multi-model-benchmark,
  offline-grader-benchmark, bug-fix-benchmark) — they really are eval suites.
- "benchmark agents" / "benchmark datasets" / "grader benchmarks" usages
  (verb / academic ML sense).
- "Snapshot MCP for benchmarks" reference in AGENTS.md (academic).

Stacks on refactor/rename-pr3-studio. With this PR landed, the codebase
consistently uses "project" for the registry/workspace concept and
"benchmark" only for eval-suite or per-run-artifact usages.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 AGENTS.md                                     | 25 +++++++---
 .../docs/docs/evaluation/running-evals.mdx    |  2 +-
 .../src/content/docs/docs/tools/studio.mdx    | 46 +++++++++----------
 3 files changed, 42 insertions(+), 31 deletions(-)
diff --git a/AGENTS.md b/AGENTS.md
index a6eaa75d..2c323a34 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -154,6 +154,17 @@ cd ../agentv.worktrees/<type>-<short-desc>
 - Prefer named exports
 - Keep modules cohesive
 
+## Naming Convention: "Project" vs "Benchmark"
+
+These two words have distinct, non-interchangeable meanings in this codebase. Get them right when adding new symbols, docs, or example dirs:
+
+- **Project** — the top-level container Studio organises around: a registered workspace directory (`.agentv/` + run artifacts + traces + experiments). Lives in `~/.agentv/projects.yaml`. Modelled by `ProjectEntry` / `ProjectRegistry` in `packages/core/src/projects.ts`. Matches the terminology used by Phoenix, Langfuse, Braintrust, W&B Weave, and LangSmith.
+- **Benchmark** — a curated *eval suite* designed to measure something specific (academic ML sense: MMLU, HumanEval, SWE-bench). Example dirs use this sense: `examples/showcase/multi-model-benchmark/`, `examples/showcase/offline-grader-benchmark/`, `examples/features/benchmark-tooling/`. Do not rename these — they are correctly named.
+
+The legacy registry file `~/.agentv/benchmarks.yaml` is auto-migrated to `projects.yaml` on first load by `migrateLegacyBenchmarksFile()`. The unrelated per-run `benchmark.json` artifact (Agent Skills compatibility output) is a third, separate concept — also keep that name.
+
+When in doubt: if the thing holds runs / traces / experiments, it's a **project**. If it's a curated set of eval cases meant to measure capability, it's a **benchmark**.
+
 ## Wire Format Convention
 
 **Everything that crosses a process boundary uses `snake_case` keys. Internal TypeScript uses `camelCase`. Translate at the boundary — never in the middle.**
@@ -161,10 +172,10 @@ cd ../agentv.worktrees/<type>-<short-desc>
 The rule is blanket: if the key is going to disk, to a user's editor, into a JSON response, or onto a CLI, it's snake_case. There is no "well this file is internal-ish" carve-out. If in doubt, snake_case.
 
 ### snake_case surfaces
-- All YAML files on disk: `*.eval.yaml`, `agentv.config.yaml`, `benchmarks.yaml`, `studio/config.yaml`, any future YAML we add.
+- All YAML files on disk: `*.eval.yaml`, `agentv.config.yaml`, `projects.yaml`, `studio/config.yaml`, any future YAML we add.
 - JSONL result files (`test_id`, `token_usage`, `duration_ms`).
 - Artifact-writer output (`pass_rate`, `tests_run`, `total_tool_calls`).
-- HTTP response bodies from `agentv serve` / Studio (`added_at`, `pass_rate`, `benchmark_id`).
+- HTTP response bodies from `agentv serve` / Studio (`added_at`, `pass_rate`, `project_id`).
 - CLI JSON output (`agentv results summary`, `results failures`, `results show`).
 - Anything consumed by non-TS tooling (Python, jq pipelines, external dashboards).
 
@@ -177,7 +188,7 @@ Define a second interface for the wire shape and convert in one place — don't
 
 ```typescript
 // Wire shape — snake_case, matches what hits disk / the network
-interface BenchmarkEntryYaml {
+interface ProjectEntryYaml {
   id: string;
   name: string;
   path: string;
@@ -186,7 +197,7 @@ interface BenchmarkEntryYaml {
 }
 
 // Internal shape — camelCase, what every TS call site sees
-interface BenchmarkEntry {
+interface ProjectEntry {
   id: string;
   name: string;
   path: string;
@@ -194,11 +205,11 @@ interface BenchmarkEntry {
   lastOpenedAt: string;
 }
 
-function fromYaml(e: BenchmarkEntryYaml): BenchmarkEntry {
+function fromYaml(e: ProjectEntryYaml): ProjectEntry {
   return { id: e.id, name: e.name, path: e.path, addedAt: e.added_at, lastOpenedAt: e.last_opened_at };
 }
 
-function toYaml(e: BenchmarkEntry): BenchmarkEntryYaml {
+function toYaml(e: ProjectEntry): ProjectEntryYaml {
   return { id: e.id, name: e.name, path: e.path, added_at: e.addedAt, last_opened_at: e.lastOpenedAt };
 }
 ```
@@ -213,7 +224,7 @@ Yes, this is two interfaces and two functions per entity. That's the price of ke
 ### Existing divergences
 If you spot a camelCase key already on disk or in a response (e.g. a legacy endpoint), treat it as a bug: migrate it to snake_case in the same PR where you touch that code path. Don't grandfather it in.
 
-**Reading back:** `parseJsonlResults()` in `artifact-writer.ts` converts snake_case → camelCase when reading JSONL into TypeScript. `fromYaml` / `toYaml` in `packages/core/src/benchmarks.ts` is the model for YAML boundaries.
+**Reading back:** `parseJsonlResults()` in `artifact-writer.ts` converts snake_case → camelCase when reading JSONL into TypeScript. `fromYaml` / `toYaml` in `packages/core/src/projects.ts` is the model for YAML boundaries.
 
 **Why:** Aligns with skill-creator (claude-plugins-official) and broader Python/JSON ecosystem conventions where snake_case is the standard wire format.
 
diff --git a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx
index f0bc287b..dfa63c70 100644
--- a/apps/web/src/content/docs/docs/evaluation/running-evals.mdx
+++ b/apps/web/src/content/docs/docs/evaluation/running-evals.mdx
@@ -429,7 +429,7 @@ The `{timestamp}` placeholder is replaced with an ISO-like timestamp (e.g., `202
 
 ### AGENTV_HOME
 
-Override the data directory for heavy runtime artifacts — workspaces, workspace pool, subagents, trace state, git cache, and downloaded dependencies. Lightweight config and cache files (`version-check.json`, `last-config.json`, `benchmarks.yaml`) always stay in `~/.agentv` regardless of this setting.
+Override the data directory for heavy runtime artifacts — workspaces, workspace pool, subagents, trace state, git cache, and downloaded dependencies. Lightweight config and cache files (`version-check.json`, `last-config.json`, `projects.yaml`) always stay in `~/.agentv` regardless of this setting.
 
 ```bash
 # Linux/macOS
diff --git a/apps/web/src/content/docs/docs/tools/studio.mdx b/apps/web/src/content/docs/docs/tools/studio.mdx
index 8eb7bd07..c026d6d1 100644
--- a/apps/web/src/content/docs/docs/tools/studio.mdx
+++ b/apps/web/src/content/docs/docs/tools/studio.mdx
@@ -45,10 +45,10 @@ agentv studio .agentv/results/runs/2026-03-30T11-45-56-989Z
 |--------|-------------|
 | `--port`, `-p` | Port to listen on (flag > `PORT` env var > 3117) |
 | `--dir`, `-d` | Working directory (default: current directory) |
-| `--multi` | Launch in multi-benchmark dashboard mode (deprecated; use auto-detect or `--single`) |
-| `--single` | Force single-benchmark dashboard mode |
-| `--add <path>` | Register a benchmark by path |
-| `--remove <id>` | Unregister a benchmark by ID |
+| `--multi` | Launch in multi-project dashboard mode (deprecated; use auto-detect or `--single`) |
+| `--single` | Force single-project dashboard mode |
+| `--add <path>` | Register a project by path |
+| `--remove <id>` | Unregister a project by ID |
 
 ## Features
 
@@ -138,25 +138,25 @@ The section includes the following visualizations:
 
 The baseline comparison is also available via the API: `GET /api/compare?baseline=<target>` adds `delta` and `normalized_gain` fields to each non-baseline cell in the response.
 
-## Benchmarks Dashboard
+## Projects Dashboard
 
-By default, Studio shows results for the current directory. Register multiple benchmark repos to view them from a single dashboard.
+By default, Studio shows results for the current directory. Register multiple project repos to view them from a single dashboard.
 
-### Registering Benchmarks
+### Registering Projects
 
-Register benchmark repos one at a time:
+Register project repos one at a time:
 
 ```bash
 agentv studio --add /path/to/my-evals
 agentv studio --add /path/to/other-evals
 ```
 
-Each path must contain a `.agentv/` directory. Registered benchmarks are stored in `~/.agentv/benchmarks.yaml`.
+Each path must contain a `.agentv/` directory. Registered projects are stored in `~/.agentv/projects.yaml`.
 
-To register a remote repo and keep it synced automatically, add a `source` block to the entry in `~/.agentv/benchmarks.yaml`:
+To register a remote repo and keep it synced automatically, add a `source` block to the entry in `~/.agentv/projects.yaml`:
 
 ```yaml
-benchmarks:
+projects:
   - id: my-evals
     name: My Evals
     path: /srv/agentv/my-evals
@@ -169,32 +169,32 @@ On each Studio startup, AgentV clones the repo if the path is empty (`git clone
 
 ### Runtime behavior: no restart needed
 
-`benchmarks.yaml` is the single source of truth. Studio re-reads it on every `/api/benchmarks` request (which the UI polls every ~10 s), so any of these changes appear live without restarting `agentv serve`:
+`projects.yaml` is the single source of truth. Studio re-reads it on every `/api/projects` request (which the UI polls every ~10 s), so any of these changes appear live without restarting `agentv serve`:
 
-- Adding via the UI's **Add Benchmark** form or `POST /api/benchmarks`.
-- Removing via the UI's **Remove** button or `DELETE /api/benchmarks/:id`.
-- Editing `~/.agentv/benchmarks.yaml` directly.
+- Adding via the UI's **Add Project** form or `POST /api/projects`.
+- Removing via the UI's **Remove** button or `DELETE /api/projects/:id`.
+- Editing `~/.agentv/projects.yaml` directly.
 - Mounting the file via a Kubernetes ConfigMap — GitOps the ConfigMap and Studio reflects it within the next poll.
 
-This satisfies the 24/7-Studio use case: the server stays up; benchmarks come and go through config edits or API calls.
+This satisfies the 24/7-Studio use case: the server stays up; projects come and go through config edits or API calls.
 
 ### Launching the Dashboard
 
-Studio auto-detects the mode based on how many benchmarks are registered:
+Studio auto-detects the mode based on how many projects are registered:
 
-- `0` or `1` registered: single-benchmark view
-- `2+` registered: Benchmarks dashboard
+- `0` or `1` registered: single-project view
+- `2+` registered: Projects dashboard
 
 ```bash
 agentv studio          # auto-detects
-agentv studio --single # force single-benchmark view
+agentv studio --single # force single-project view
 ```
 
-The landing page shows a card for each benchmark with run count, pass rate, and last run time.
+The landing page shows a card for each project with run count, pass rate, and last run time.
 
-<Image src={studioProjectsMulti} alt="AgentV Studio benchmarks dashboard showing benchmark cards with pass rates" />
+<Image src={studioProjectsMulti} alt="AgentV Studio projects dashboard showing project cards with pass rates" />
 
-### Removing a Benchmark
+### Removing a Project
 
 Unregister by its ID: