Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 18 additions & 7 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,17 +154,28 @@ cd ../agentv.worktrees/<type>-<short-desc>
- Prefer named exports
- Keep modules cohesive

## Naming Convention: "Project" vs "Benchmark"

These two words have distinct, non-interchangeable meanings in this codebase. Get them right when adding new symbols, docs, or example dirs:

- **Project** — the top-level container Studio organises around: a registered workspace directory (`.agentv/` + run artifacts + traces + experiments). Lives in `~/.agentv/projects.yaml`. Modelled by `ProjectEntry` / `ProjectRegistry` in `packages/core/src/projects.ts`. Matches the terminology used by Phoenix, Langfuse, Braintrust, W&B Weave, and LangSmith.
- **Benchmark** — a curated *eval suite* designed to measure something specific (academic ML sense: MMLU, HumanEval, SWE-bench). Example dirs use this sense: `examples/showcase/multi-model-benchmark/`, `examples/showcase/offline-grader-benchmark/`, `examples/features/benchmark-tooling/`. Do not rename these — they are correctly named.

The legacy registry file `~/.agentv/benchmarks.yaml` is auto-migrated to `projects.yaml` on first load by `migrateLegacyBenchmarksFile()`. The unrelated per-run `benchmark.json` artifact (Agent Skills compatibility output) is a third, separate concept — also keep that name.

When in doubt: if the thing holds runs / traces / experiments, it's a **project**. If it's a curated set of eval cases meant to measure capability, it's a **benchmark**.

## Wire Format Convention

**Everything that crosses a process boundary uses `snake_case` keys. Internal TypeScript uses `camelCase`. Translate at the boundary — never in the middle.**

The rule is blanket: if the key is going to disk, to a user's editor, into a JSON response, or onto a CLI, it's snake_case. There is no "well this file is internal-ish" carve-out. If in doubt, snake_case.

### snake_case surfaces
- All YAML files on disk: `*.eval.yaml`, `agentv.config.yaml`, `benchmarks.yaml`, `studio/config.yaml`, any future YAML we add.
- All YAML files on disk: `*.eval.yaml`, `agentv.config.yaml`, `projects.yaml`, `studio/config.yaml`, any future YAML we add.
- JSONL result files (`test_id`, `token_usage`, `duration_ms`).
- Artifact-writer output (`pass_rate`, `tests_run`, `total_tool_calls`).
- HTTP response bodies from `agentv serve` / Studio (`added_at`, `pass_rate`, `benchmark_id`).
- HTTP response bodies from `agentv serve` / Studio (`added_at`, `pass_rate`, `project_id`).
- CLI JSON output (`agentv results summary`, `results failures`, `results show`).
- Anything consumed by non-TS tooling (Python, jq pipelines, external dashboards).

Expand All @@ -177,7 +188,7 @@ Define a second interface for the wire shape and convert in one place — don't

```typescript
// Wire shape — snake_case, matches what hits disk / the network
interface BenchmarkEntryYaml {
interface ProjectEntryYaml {
id: string;
name: string;
path: string;
Expand All @@ -186,19 +197,19 @@ interface BenchmarkEntryYaml {
}

// Internal shape — camelCase, what every TS call site sees
interface BenchmarkEntry {
interface ProjectEntry {
id: string;
name: string;
path: string;
addedAt: string;
lastOpenedAt: string;
}

function fromYaml(e: BenchmarkEntryYaml): BenchmarkEntry {
function fromYaml(e: ProjectEntryYaml): ProjectEntry {
return { id: e.id, name: e.name, path: e.path, addedAt: e.added_at, lastOpenedAt: e.last_opened_at };
}

function toYaml(e: BenchmarkEntry): BenchmarkEntryYaml {
function toYaml(e: ProjectEntry): ProjectEntryYaml {
return { id: e.id, name: e.name, path: e.path, added_at: e.addedAt, last_opened_at: e.lastOpenedAt };
}
```
Expand All @@ -213,7 +224,7 @@ Yes, this is two interfaces and two functions per entity. That's the price of ke
### Existing divergences
If you spot a camelCase key already on disk or in a response (e.g. a legacy endpoint), treat it as a bug: migrate it to snake_case in the same PR where you touch that code path. Don't grandfather it in.

**Reading back:** `parseJsonlResults()` in `artifact-writer.ts` converts snake_case → camelCase when reading JSONL into TypeScript. `fromYaml` / `toYaml` in `packages/core/src/benchmarks.ts` is the model for YAML boundaries.
**Reading back:** `parseJsonlResults()` in `artifact-writer.ts` converts snake_case → camelCase when reading JSONL into TypeScript. `fromYaml` / `toYaml` in `packages/core/src/projects.ts` is the model for YAML boundaries.

**Why:** Aligns with skill-creator (claude-plugins-official) and broader Python/JSON ecosystem conventions where snake_case is the standard wire format.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -429,7 +429,7 @@ The `{timestamp}` placeholder is replaced with an ISO-like timestamp (e.g., `202

### AGENTV_HOME

Override the data directory for heavy runtime artifacts — workspaces, workspace pool, subagents, trace state, git cache, and downloaded dependencies. Lightweight config and cache files (`version-check.json`, `last-config.json`, `benchmarks.yaml`) always stay in `~/.agentv` regardless of this setting.
Override the data directory for heavy runtime artifacts — workspaces, workspace pool, subagents, trace state, git cache, and downloaded dependencies. Lightweight config and cache files (`version-check.json`, `last-config.json`, `projects.yaml`) always stay in `~/.agentv` regardless of this setting.

```bash
# Linux/macOS
Expand Down
46 changes: 23 additions & 23 deletions apps/web/src/content/docs/docs/tools/studio.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -45,10 +45,10 @@ agentv studio .agentv/results/runs/2026-03-30T11-45-56-989Z
|--------|-------------|
| `--port`, `-p` | Port to listen on (flag > `PORT` env var > 3117) |
| `--dir`, `-d` | Working directory (default: current directory) |
| `--multi` | Launch in multi-benchmark dashboard mode (deprecated; use auto-detect or `--single`) |
| `--single` | Force single-benchmark dashboard mode |
| `--add <path>` | Register a benchmark by path |
| `--remove <id>` | Unregister a benchmark by ID |
| `--multi` | Launch in multi-project dashboard mode (deprecated; use auto-detect or `--single`) |
| `--single` | Force single-project dashboard mode |
| `--add <path>` | Register a project by path |
| `--remove <id>` | Unregister a project by ID |

## Features

Expand Down Expand Up @@ -138,25 +138,25 @@ The section includes the following visualizations:

The baseline comparison is also available via the API: `GET /api/compare?baseline=<target>` adds `delta` and `normalized_gain` fields to each non-baseline cell in the response.

## Benchmarks Dashboard
## Projects Dashboard

By default, Studio shows results for the current directory. Register multiple benchmark repos to view them from a single dashboard.
By default, Studio shows results for the current directory. Register multiple project repos to view them from a single dashboard.

### Registering Benchmarks
### Registering Projects

Register benchmark repos one at a time:
Register project repos one at a time:

```bash
agentv studio --add /path/to/my-evals
agentv studio --add /path/to/other-evals
```

Each path must contain a `.agentv/` directory. Registered benchmarks are stored in `~/.agentv/benchmarks.yaml`.
Each path must contain a `.agentv/` directory. Registered projects are stored in `~/.agentv/projects.yaml`.

To register a remote repo and keep it synced automatically, add a `source` block to the entry in `~/.agentv/benchmarks.yaml`:
To register a remote repo and keep it synced automatically, add a `source` block to the entry in `~/.agentv/projects.yaml`:

```yaml
benchmarks:
projects:
- id: my-evals
name: My Evals
path: /srv/agentv/my-evals
Expand All @@ -169,32 +169,32 @@ On each Studio startup, AgentV clones the repo if the path is empty (`git clone

### Runtime behavior: no restart needed

`benchmarks.yaml` is the single source of truth. Studio re-reads it on every `/api/benchmarks` request (which the UI polls every ~10 s), so any of these changes appear live without restarting `agentv serve`:
`projects.yaml` is the single source of truth. Studio re-reads it on every `/api/projects` request (which the UI polls every ~10 s), so any of these changes appear live without restarting `agentv serve`:

- Adding via the UI's **Add Benchmark** form or `POST /api/benchmarks`.
- Removing via the UI's **Remove** button or `DELETE /api/benchmarks/:id`.
- Editing `~/.agentv/benchmarks.yaml` directly.
- Adding via the UI's **Add Project** form or `POST /api/projects`.
- Removing via the UI's **Remove** button or `DELETE /api/projects/:id`.
- Editing `~/.agentv/projects.yaml` directly.
- Mounting the file via a Kubernetes ConfigMap — GitOps the ConfigMap and Studio reflects it within the next poll.

This satisfies the 24/7-Studio use case: the server stays up; benchmarks come and go through config edits or API calls.
This satisfies the 24/7-Studio use case: the server stays up; projects come and go through config edits or API calls.

### Launching the Dashboard

Studio auto-detects the mode based on how many benchmarks are registered:
Studio auto-detects the mode based on how many projects are registered:

- `0` or `1` registered: single-benchmark view
- `2+` registered: Benchmarks dashboard
- `0` or `1` registered: single-project view
- `2+` registered: Projects dashboard

```bash
agentv studio # auto-detects
agentv studio --single # force single-benchmark view
agentv studio --single # force single-project view
```

The landing page shows a card for each benchmark with run count, pass rate, and last run time.
The landing page shows a card for each project with run count, pass rate, and last run time.

<Image src={studioProjectsMulti} alt="AgentV Studio benchmarks dashboard showing benchmark cards with pass rates" />
<Image src={studioProjectsMulti} alt="AgentV Studio projects dashboard showing project cards with pass rates" />

### Removing a Benchmark
### Removing a Project

Unregister by its ID:

Expand Down
Loading