From fa4cfb63d666c29b449bef315d053aad50e2d664 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Tue, 21 Apr 2026 01:53:03 +0200 Subject: [PATCH] docs(targets): add CLI Provider page + oracle-validation pattern MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The cli provider was effectively undocumented — configuration.mdx had one line in the provider table and a two-line example, with nothing on template placeholders, the {OUTPUT_FILE} contract, batch mode, or what errors look like. Grep for "oracle" / "type: cli" / "cli provider" across apps/web turned up essentially zero hits, so the oracle- validation composition pattern AGENTS.md §3 cites as an example of "compose, don't add a feature" was invisible to users. New page: docs/targets/cli-provider.mdx. Covers: - minimal worked example - the command contract (template rendered per case; command writes to {OUTPUT_FILE}; AgentV reads that file) - every template placeholder (PROMPT, PROMPT_FILE, OUTPUT_FILE, FILES, EVAL_ID, ATTEMPT) - the JSON output schema and the plain-text fallback - full configuration-field table (including healthcheck, workers, provider_batching, grader_target, keep_temp_files) - a Batching section explaining when to enable it - a dedicated "Pattern: Oracle validation" section with a worked config that uses {EVAL_ID} to look up per-case fixtures, and a CLI workflow (run oracle first; if it's not 100% the grader is the bug) - a Debugging checklist Verified against packages/core/src/evaluation/providers/cli.ts: - {EVAL_ID} expands to request.evalCaseId at command-render time (line 721), so it's available per case and works for the oracle per-fixture pattern. - parseOutputContent falls back to plain-text wrapping when JSON parse fails or the JSON lacks output/text (lines 482, 489, 522). Sidebar order: inserted CLI Provider at order 4 (LLM=2, Coding Agents=3, CLI=4) and bumped Retry to 5, Custom Providers to 6. The "Supported Providers" row in configuration.mdx now links to the new page. Node 18.19.1 in this dev environment can't run the Astro build (requires 18.20.8+), but biome check passes on all 598 files and URL pattern matches the rest of the docs site. Cloudflare Pages will build on PR open. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../docs/docs/targets/cli-provider.mdx | 145 ++++++++++++++++++ .../docs/docs/targets/configuration.mdx | 2 +- .../docs/docs/targets/custom-providers.mdx | 2 +- .../src/content/docs/docs/targets/retry.mdx | 2 +- 4 files changed, 148 insertions(+), 3 deletions(-) create mode 100644 apps/web/src/content/docs/docs/targets/cli-provider.mdx diff --git a/apps/web/src/content/docs/docs/targets/cli-provider.mdx b/apps/web/src/content/docs/docs/targets/cli-provider.mdx new file mode 100644 index 000000000..f1eb91e65 --- /dev/null +++ b/apps/web/src/content/docs/docs/targets/cli-provider.mdx @@ -0,0 +1,145 @@ +--- +title: CLI Provider +description: Wrap any shell command as an evaluation target +sidebar: + order: 4 +--- + +The `cli` provider runs an arbitrary shell command per test case and captures its output as the target's response. It's the escape hatch that lets you evaluate *anything* that exposes a command-line entry point — your own agent, a third-party CLI, a stub that prints a fixed answer, a script that calls an in-house microservice, etc. + +Because the contract is "we invoke a command and read a file," almost any useful composition pattern (sanity-checking your grader against a known-good answer, diffing two implementations, driving a batch mode) can be built on top without any new primitives. + +## Minimal example + +```yaml +# .agentv/targets.yaml +targets: + - name: my_agent + provider: cli + command: python agent.py --prompt {PROMPT} --out {OUTPUT_FILE} + grader_target: azure-base # required if your evals use LLM graders +``` + +Your `agent.py` reads the prompt, writes its response to the path passed as `--out`, and exits `0`. That's it. + +## Command contract + +Before each test case, AgentV renders the `command` template and spawns it as a shell process. The command has two responsibilities: + +1. **Read the input** via one of the placeholders below. +2. **Write the response to `{OUTPUT_FILE}`** — AgentV reads *that file*, not your stdout. + +When the process exits successfully, AgentV parses the contents of `{OUTPUT_FILE}` and treats it as the target's response. Non-zero exits, timeouts, and unreadable output files are surfaced as test errors with the underlying stderr/exit code. + +### Template placeholders + +Use these in `command`; AgentV substitutes them per test case. + +| Placeholder | What it expands to | +|---|---| +| `{PROMPT}` | The test case's input text, shell-escaped. | +| `{PROMPT_FILE}` | Path to a temp file containing the prompt (use this when the input is large enough to blow past shell argv limits). | +| `{OUTPUT_FILE}` | Path to a temp file the command **must** write to. Deleted after the run unless `keep_temp_files: true`. | +| `{FILES}` | Space-separated paths of any input files attached to the test case, formatted via `files_format`. | +| `{EVAL_ID}` | Unique identifier of the current test case — useful for logging or per-case scratch dirs. | +| `{ATTEMPT}` | Retry attempt number (0 on the first try). | + +### Output file format + +AgentV tries to parse `{OUTPUT_FILE}` as JSON first. If it parses and contains any of these keys, they're picked up; if it doesn't parse, the entire content is treated as the assistant's message text. + +```jsonc +{ + "output": [ // preferred: full message array + { "role": "assistant", "content": "..." } + ], + "text": "...", // fallback: plain assistant text + "token_usage": { "input": 123, "output": 456, "cached": 0 }, + "cost_usd": 0.0042, + "duration_ms": 1800 +} +``` + +For the common case, plain text is fine: + +```bash +echo "Hello, world!" > {OUTPUT_FILE} +``` + +## Configuration fields + +| Field | Type | Required | Default | Description | +|---|---|---|---|---| +| `name` | string | yes | — | Target identifier used in eval configs. | +| `provider` | literal `"cli"` | yes | — | Selects this provider. | +| `command` | string | yes | — | Shell command template. | +| `timeout_seconds` | number | no | — | Kill the process if it runs longer than this. | +| `cwd` | string | no | eval dir | Working directory. Relative paths resolve against the eval file. | +| `files_format` | string | no | `{path}` | How each entry in `{FILES}` is formatted. Placeholders: `{path}`, `{basename}`. | +| `verbose` | boolean | no | `false` | Log the rendered command and cwd to stdout. Useful for debugging template substitution. | +| `keep_temp_files` | boolean | no | `false` | Preserve `{PROMPT_FILE}` / `{OUTPUT_FILE}` after the run — handy while iterating on your command. | +| `healthcheck` | object | no | — | Pre-run health check (HTTP or command); the eval aborts if it fails. | +| `workers` | number | no | — | Concurrent test-case executions against this target. | +| `provider_batching` | boolean | no | `false` | Run all cases in one command invocation — see [Batching](#batching). | +| `grader_target` | string | no | — | LLM target used by this target's LLM graders. Required if your evals use LLM-based graders. | + +## Batching + +For targets where spin-up cost dominates per-case work (e.g. loading a model, authenticating), set `provider_batching: true`. AgentV invokes the command *once*, hands it a JSONL stream of cases, and expects a JSONL response keyed by each case's `id`: + +```yaml +targets: + - name: batched_agent + provider: cli + provider_batching: true + command: python agent.py --batch-in {PROMPT_FILE} --batch-out {OUTPUT_FILE} +``` + +`{PROMPT_FILE}` contains one JSON object per line with an `id` and the case's inputs; your command writes one line per case to `{OUTPUT_FILE}`, each carrying the matching `id` plus the same output shape as the non-batched case. + +## Pattern: Oracle validation (sanity-check your grader) + +A common question when building a new eval: **"if my grader scores my agent poorly, is the agent wrong or is the grader wrong?"** The classical testing answer is to run a known-correct reference ("the oracle") through the same grader — if a perfect answer doesn't pass, the grader is the bug. + +AgentV has no dedicated "oracle" feature because the `cli` provider already composes into one. Declare a second target that prints your known-good answer into `{OUTPUT_FILE}`, run the same eval against it, and assert a perfect score: + +```yaml +# .agentv/targets.yaml +targets: + - name: my_agent + provider: cli + command: python agent.py --prompt {PROMPT} --out {OUTPUT_FILE} + grader_target: azure-base + + - name: oracle + provider: cli + command: cp fixtures/{EVAL_ID}.expected.txt {OUTPUT_FILE} + grader_target: azure-base +``` + +```bash +# While iterating on your grader, run the oracle first. +# If it doesn't score 100%, fix the grader before trusting any agent results. +agentv eval my.EVAL.yaml --target oracle + +# Then run the real target. +agentv eval my.EVAL.yaml --target my_agent +``` + +A few practical notes: + +- `{EVAL_ID}` in the oracle command lets one target serve an entire eval suite — just ship one `fixtures/.expected.txt` per case. Alternatively, read the expected output from wherever your rubric already keeps it. +- If the oracle doesn't reach 100%, that's the bug. Do not proceed to scoring real agents until it does. +- If the oracle *does* reach 100%, low scores on real agents are a signal about the agent, not the grader. +- The same composition works for other meta-tests: a "deliberately wrong" target that should score 0, a "mostly right" target pinned at a known partial score, etc. + +The pattern needs no special config field, no directory convention, and no flag — it's just a second target that happens to know the answer. + +## Debugging + +When a `cli` target misbehaves: + +1. Set `verbose: true` to see the rendered command and cwd. +2. Set `keep_temp_files: true` and inspect `{PROMPT_FILE}` / `{OUTPUT_FILE}` after the run. +3. Run the rendered command by hand with those files and check it exits `0` and writes the expected output shape. +4. If the output looks right but grading is off, check the JSON schema — a typo in `output` vs `output_messages` silently falls back to "treat whole file as plain text." diff --git a/apps/web/src/content/docs/docs/targets/configuration.mdx b/apps/web/src/content/docs/docs/targets/configuration.mdx index 92befdd63..e78408642 100644 --- a/apps/web/src/content/docs/docs/targets/configuration.mdx +++ b/apps/web/src/content/docs/docs/targets/configuration.mdx @@ -56,7 +56,7 @@ already-exported secrets into `.env`. | `pi-coding-agent` | Agent | Pi Coding Agent | | `vscode` | Agent | VS Code with Copilot | | `vscode-insiders` | Agent | VS Code Insiders | -| `cli` | Agent | Any CLI command | +| `cli` | Agent | Any CLI command — see [CLI Provider](/docs/targets/cli-provider) | | `mock` | Testing | Mock provider for dry runs | ## Referencing Targets in Evals diff --git a/apps/web/src/content/docs/docs/targets/custom-providers.mdx b/apps/web/src/content/docs/docs/targets/custom-providers.mdx index aa652bd33..737ba57b0 100644 --- a/apps/web/src/content/docs/docs/targets/custom-providers.mdx +++ b/apps/web/src/content/docs/docs/targets/custom-providers.mdx @@ -2,7 +2,7 @@ title: Custom Providers (SDK) description: Implement native TypeScript providers using the ProviderRegistry API sidebar: - order: 5 + order: 6 --- Custom providers let you implement evaluation targets in TypeScript instead of shelling out to a CLI command. This is useful when you want to call an HTTP API, use an SDK, or implement custom logic that goes beyond what the CLI provider supports. diff --git a/apps/web/src/content/docs/docs/targets/retry.mdx b/apps/web/src/content/docs/docs/targets/retry.mdx index eba647944..b01769520 100644 --- a/apps/web/src/content/docs/docs/targets/retry.mdx +++ b/apps/web/src/content/docs/docs/targets/retry.mdx @@ -2,7 +2,7 @@ title: Retry Configuration description: Configure automatic retry with exponential backoff sidebar: - order: 4 + order: 5 --- Configure automatic retry with exponential backoff for transient failures.