A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, OpenAI Codex CLI and Azure OpenAI.
This is the recommended method for users who want to use agentv as a command-line tool.
- Install via npm:
# Install globally
npm install -g agentv
# Or use npx to run without installing
npx agentv --help- Verify the installation:
agentv --helpFollow these steps if you want to contribute to the agentv project itself. This workflow uses pnpm workspaces and an editable install for immediate feedback.
- Clone the repository and navigate into it:
git clone https://github.com/EntityProcess/agentv.git
cd agentv- Install dependencies:
# Install pnpm if you don't have it
npm install -g pnpm
# Install all workspace dependencies
pnpm install- Build the project:
pnpm build- Run tests:
pnpm testYou are now ready to start development. The monorepo contains:
packages/core/- Core evaluation engineapps/cli/- Command-line interface
-
Initialize your workspace:
- Run
agentv initat the root of your repository - This command automatically sets up the
.agentv/directory structure and configuration files
- Run
-
Configure environment variables:
- The init command creates a
.env.templatefile in your project root - Copy
.env.templateto.envand fill in your API keys, endpoints, and other configuration values - Update the environment variable names in
.agentv/targets.yamlto match those defined in your.envfile
- The init command creates a
You can use the following examples as a starting point.
- Simple Example: A minimal working example to help you get started fast.
- Showcase: A collection of advanced use cases and real-world agent evaluation scenarios.
Validate your eval and targets files before running them:
# Validate a single file
agentv validate evals/my-eval.yaml
# Validate multiple files
agentv validate evals/eval1.yaml evals/eval2.yaml
# Validate entire directory (recursively finds all YAML files)
agentv validate evals/File type detection:
All AgentV files must include a $schema field:
# Eval files
$schema: agentv-eval-v2
evalcases:
- id: eval-1
# ...
# Targets files
$schema: agentv-targets-v2.2
targets:
- name: default
# ...Files without a $schema field will be rejected with a clear error message.
Run eval (target auto-selected from eval file or CLI override):
# If your eval.yaml contains "target: azure_base", it will be used automatically
agentv eval "path/to/eval.yaml"
# Override the eval file's target with CLI flag
agentv eval --target vscode_projectx "path/to/eval.yaml"
# Run multiple evals via glob
agentv eval "path/to/evals/**/*.yaml"Run a specific eval case with custom targets path:
agentv eval --target vscode_projectx --targets "path/to/targets.yaml" --eval-id "my-eval-case" "path/to/eval.yaml"eval_paths...: Path(s) or glob(s) to eval YAML files (required; e.g.,evals/**/*.yaml)--target TARGET: Execution target name from targets.yaml (overrides target specified in eval file)--targets TARGETS: Path to targets.yaml file (default: ./.agentv/targets.yaml)--eval-id EVAL_ID: Run only the eval case with this specific ID--out OUTPUT_FILE: Output file path (default: .agentv/results/eval_.jsonl)--output-format FORMAT: Output format: 'jsonl' or 'yaml' (default: jsonl)--dry-run: Run with mock model for testing--agent-timeout SECONDS: Timeout in seconds for agent response polling (default: 120)--max-retries COUNT: Maximum number of retries for timeout cases (default: 2)--cache: Enable caching of LLM responses (default: disabled)--dump-prompts: Save all prompts to.agentv/prompts/directory--workers COUNT: Parallel workers for eval cases (default: 3; targetworkerssetting used when provided)--verbose: Verbose output
The CLI determines which execution target to use with the following precedence:
- CLI flag override:
--target my_target(when provided and not 'default') - Eval file specification:
target: my_targetkey in the .eval.yaml file - Default fallback: Uses the 'default' target (original behavior)
This allows eval files to specify their preferred target while still allowing command-line overrides for flexibility, and maintains backward compatibility with existing workflows.
Output goes to .agentv/results/eval_<timestamp>.jsonl (or .yaml) unless --out is provided.
Workspace Switching: The runner automatically switches to the target workspace when running evals. Make sure you're not actively using another VS Code instance, as this could cause prompts to be injected into the wrong workspace.
Recommended Models: Use Claude Sonnet 4.5 or Grok Code Fast 1 for best results, as these models are more consistent in following instruction chains.
Execution targets in .agentv/targets.yaml decouple evals from providers/settings and provide flexible environment variable mapping.
Each target specifies:
name: Unique identifier for the targetprovider: The model provider (azure,anthropic,gemini,codex,vscode,vscode-insiders,cli, ormock)- Provider-specific configuration fields at the top level (no
settingswrapper needed) - Optional fields:
judge_target,workers,provider_batching
Azure OpenAI targets:
- name: azure_base
provider: azure
endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
api_key: ${{ AZURE_OPENAI_API_KEY }}
model: ${{ AZURE_DEPLOYMENT_NAME }}Note: Environment variables are referenced using ${{ VARIABLE_NAME }} syntax. The actual values are resolved from your .env file at runtime.
VS Code targets:
- name: vscode_projectx
provider: vscode
workspace_template: ${{ PROJECTX_WORKSPACE_PATH }}
provider_batching: false
judge_target: azure_base
- name: vscode_insiders_projectx
provider: vscode-insiders
workspace_template: ${{ PROJECTX_WORKSPACE_PATH }}
provider_batching: false
judge_target: azure_baseCLI targets (template-based):
- name: local_cli
provider: cli
judge_target: azure_base
command_template: 'uv run ./my_agent.py --prompt {PROMPT} {FILES}'
files_format: '--file {path}'
cwd: ${{ CLI_EVALS_DIR }} # optional working directory
timeout_seconds: 30 # optional per-command timeout
healthcheck:
type: command # or http
command_template: uv run ./my_agent.py --healthcheckSupported placeholders in CLI commands:
{PROMPT}- The rendered prompt text (shell-escaped){FILES}- Expands to multiple file arguments usingfiles_formattemplate{GUIDELINES}- Guidelines content{EVAL_ID}- Current eval case ID{ATTEMPT}- Retry attempt number{OUTPUT_FILE}- Path to output file (for agents that write responses to disk)
Codex CLI targets:
- name: codex_cli
provider: codex
judge_target: azure_base
executable: ${{ CODEX_CLI_PATH }} # defaults to `codex` if omitted
args: # optional CLI arguments
- --profile
- ${{ CODEX_PROFILE }}
- --model
- ${{ CODEX_MODEL }}
timeout_seconds: 180
cwd: ${{ CODEX_WORKSPACE_DIR }}
log_format: json # 'summary' or 'json'Codex targets require the standalone codex CLI and a configured profile (via codex configure) so credentials are stored in ~/.codex/config (or whatever path the CLI already uses). AgentV mirrors all guideline and attachment files into a fresh scratch workspace, so the file:// preread links remain valid even when the CLI runs outside your repo tree.
Confirm the CLI works by running codex exec --json --profile <name> "ping" (or any supported dry run) before starting an eval. This prints JSONL events; seeing item.completed messages indicates the CLI is healthy.
Code evaluators receive input via stdin and write output to stdout as JSON.
Input Format (via stdin):
{
"task": "string describing the task",
"outcome": "expected outcome description",
"expected": "expected output string",
"output": "generated code/text from the agent",
"system_message": "system message if any",
"guideline_paths": ["path1", "path2"],
"attachments": ["file1", "file2"],
"user_segments": [{"type": "text", "value": "..."}]
}Output Format (to stdout):
{
"score": 0.85,
"hits": ["list of successful checks"],
"misses": ["list of failed checks"],
"reasoning": "explanation of the score"
}Key Points:
- Evaluators receive full context but should select only relevant fields
- Most evaluators only need
outputfield - ignore the rest to avoid false positives - Complex evaluators can use
task,expected, orguideline_pathsfor context-aware validation - Score range:
0.0to1.0(float) hitsandmissesare optional but recommended for debugging
#!/usr/bin/env python3
import json
import sys
def evaluate(input_data):
# Extract only the fields you need
output = input_data.get("output", "")
# Your validation logic here
score = 0.0 # to 1.0
hits = ["successful check 1", "successful check 2"]
misses = ["failed check 1"]
reasoning = "Explanation of score"
return {
"score": score,
"hits": hits,
"misses": misses,
"reasoning": reasoning
}
if __name__ == "__main__":
try:
input_data = json.loads(sys.stdin.read())
result = evaluate(input_data)
print(json.dumps(result, indent=2))
except Exception as e:
error_result = {
"score": 0.0,
"hits": [],
"misses": [f"Evaluator error: {str(e)}"],
"reasoning": f"Evaluator error: {str(e)}"
}
print(json.dumps(error_result, indent=2))
sys.exit(1)# Judge Name
Evaluation criteria and guidelines...
## Scoring Guidelines
0.9-1.0: Excellent
0.7-0.8: Good
...
## Output Format
{
"score": 0.85,
"passed": true,
"reasoning": "..."
}AgentV supports automatic retry with exponential backoff for handling rate limiting (HTTP 429) and transient errors. All retry configuration fields are optional and work with Azure, Anthropic, and Gemini providers.
Available retry fields:
| Field | Type | Default | Description |
|---|---|---|---|
max_retries |
number | 3 | Maximum number of retry attempts |
retry_initial_delay_ms |
number | 1000 | Initial delay in milliseconds before first retry |
retry_max_delay_ms |
number | 60000 | Maximum delay cap in milliseconds |
retry_backoff_factor |
number | 2 | Exponential backoff multiplier |
retry_status_codes |
number[] | [500, 408, 429, 502, 503, 504] | HTTP status codes to retry |
Example configuration:
$schema: agentv-targets-v2.2
targets:
- name: azure_base
provider: azure
endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
api_key: ${{ AZURE_OPENAI_API_KEY }}
model: gpt-4
max_retries: 5 # Maximum retry attempts
retry_initial_delay_ms: 2000 # Initial delay before first retry
retry_max_delay_ms: 120000 # Maximum delay cap
retry_backoff_factor: 2 # Exponential backoff multiplier
retry_status_codes: [500, 408, 429, 502, 503, 504] # HTTP status codes to retryRetry behavior:
- Exponential backoff with jitter (0.75-1.25x) to avoid thundering herd
- Automatically retries on HTTP 429 (rate limiting), 5xx errors, and network failures
- Respects abort signals for cancellation
- If no retry config is specified, uses sensible defaults
- subagent - VS Code Copilot programmatic interface
- ai-sdk - Vercel AI SDK
- Agentic Context Engineering (ACE)
MIT License - see LICENSE for details.