feat: make agentv-bench work across all agent harnesses by christso · Pull Request #641 · EntityProcess/agentv

christso · 2026-03-17T08:52:05Z

Summary

skill-trigger evaluator is now target-agnostic: adds ToolMatcher interface with PROVIDER_TOOL_SEMANTICS static mapping for Claude, Copilot, Pi, VS Code providers. Falls back to Claude defaults for unknown providers. For unsupported providers with different tool-call formats, use a code-grader.
Python scripts (run_eval.py, improve_description.py, run_loop.py) accept --target, --mode, and --llm-command args for multi-provider execution. CLI mode delegates to agentv eval instead of hardcoded claude -p.
SKILL.md updated with provider support matrix, code-grader example for unsupported providers, and cli mode as recommended default.
CLAUDE.md clarified that config overrides on built-in evaluators should be code-graders instead.

All defaults match current behavior — backward compatible, no existing evals need modification.

Test plan

13 unit tests for skill-trigger evaluator covering:
- Claude-cli resolves to Claude tool names
- Copilot-cli resolves to Copilot tool names (Read File, readFile, readTextFile, skill)
- Unknown provider falls back to Claude defaults
- Codex with non-matching tool calls fails correctly
- Codex with should_trigger: false passes correctly
- Backward compatibility with existing Claude Skill/Read behavior
- should_trigger: false for negative test cases
TypeScript typecheck passes across all packages
Biome lint passes
Full test suite passes (pre-push hook: Build, Typecheck, Lint, Test)
Multi-provider example EVAL.yaml added (examples/features/agent-skills-evals/multi-provider-skill-trigger.EVAL.yaml)
Manual e2e (Copilot): --target copilot --test-id should-not-trigger-unrelated — score 1.0, correctly detected no csv-analyzer trigger
Manual e2e (Copilot): --target copilot --test-id should-trigger-direct-request — score 0.0, correctly detected first tool was "Using skill: using-superpowers" (no csv-analyzer skill installed). Evaluator correctly identifies Copilot tool calls.
Manual e2e (Pi): --target pi --test-id should-not-trigger-unrelated — score 1.0, correctly detected no csv-analyzer trigger
Manual e2e (Pi): --target pi --test-id should-trigger-direct-request — score 0.0, "No tool calls recorded" (no csv-analyzer skill installed, Pi answered without tools). Evaluator correctly graded as fail.

Note: Positive cases score 0.0 because no csv-analyzer skill exists in the test workspace — the agents can't trigger a skill that isn't installed. This validates that the evaluator correctly detects the absence of skill invocation across providers.

Closes #613

🤖 Generated with Claude Code

- run_eval.py: add --target and --mode args; cli mode delegates to agentv eval instead of hardcoded claude -p - improve_description.py: add --llm-command for configurable LLM inference (default: claude -p --output-format text) - run_loop.py: pass --target, --mode, --llm-command through to child scripts All defaults match current behavior for backward compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…l mappings Add ToolMatcher interface and PROVIDER_TOOL_SEMANTICS static mapping to support skill-trigger evaluation across Claude, Copilot, Pi, VS Code, and other providers. Config-level overrides (skill_tools, read_tools, skill_input_field, read_input_field) provide escape hatch for edge cases. Providers known to not emit tool calls (codex) get descriptive error messages instead of silent failures. Backward compatible: existing evals work without modification. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-03-17T08:52:37Z

Deploying agentv with Cloudflare Pages

Latest commit:	`232f9cf`
Status:	✅ Deploy successful!
Preview URL:	https://108c69e1.agentv.pages.dev
Branch Preview URL:	https://feat-multi-provider-skill-tr.agentv.pages.dev

View logs

Codex SDK emits command_execution and file_change tool calls. Remove it from NO_TOOL_CALL_PROVIDERS and update tests/docs to reflect actual behavior. Users can use skill_tools/read_tools config overrides to map Codex tool names to skill-trigger detection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…s/.codex Codex supports skills via .agents/ and .codex/ folders. Update provider matrix and notes to reflect this. Link to #643 for proper tool mapping. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- DRY Copilot/VSCode matcher into COPILOT_MATCHER constant - Add pi-agent-sdk to PROVIDER_TOOL_SEMANTICS - Clarify hardcoded should_trigger in run_single_query_cli - Deduplicate executor.submit branches in run_eval - Add read_tools/read_input_field config override test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Demonstrates skill-trigger evaluation across providers: - Default Claude tool detection (Skill, Read) - Auto-detection for Copilot (Read File, readFile, etc.) - Config override escape hatch for custom providers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… YAML The evaluator-parser was silently dropping skill_tools, read_tools, skill_input_field, and read_input_field fields from YAML skill-trigger assertions. These config overrides now pass through to the evaluator, enabling custom tool-name mappings directly from EVAL.yaml. Remove stale "not yet supported" note from example. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The whole point is that skill-trigger just works across providers. Remove the custom tool mapping example that taught users to specify skill_tools/read_tools — those are internal escape hatches, not something end users should need to configure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… unsupported providers Config override fields (skill_tools, read_tools, skill_input_field, read_input_field) added unnecessary configuration surface. For providers not covered by the built-in mapping, a code-grader is the correct escape hatch — it's more flexible and follows the "lightweight core, plugin extensibility" principle. - Remove override fields from SkillTriggerEvaluatorConfig type - Remove override logic from resolveMatcher() in skill-trigger evaluator - Revert parser forwarding of override fields - Remove 2 override-specific tests (13 tests remain, all passing) - Update SKILL.md: replace override docs with code-grader example - Update CLAUDE.md: clarify that config overrides on evaluators = code-grader Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove dead NO_TOOL_CALL_PROVIDERS set and unreachable guard branch - Replace insider reference to "skill-creator's run_eval.py" with standalone docs - Add "To add a new provider" 3-step guide in file header - Add comment explaining Copilot's ACP tool name variance

…sign Expand Section 6 with concrete code quality guidelines: standalone file headers, data-driven patterns over conditional chains, no dead code, and extension recipes in module headers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

christso and others added 2 commits March 17, 2026 08:26

christso mentioned this pull request Mar 17, 2026

feat(skill-trigger): add Codex provider tool mapping for skill detection #643

Closed

christso and others added 8 commits March 17, 2026 09:16

docs(skill-trigger): update Codex to reflect skill support via .agent…

6c8acec

…s/.codex Codex supports skills via .agents/ and .codex/ folders. Update provider matrix and notes to reflect this. Link to #643 for proper tool mapping. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

christso marked this pull request as ready for review March 17, 2026 13:10

christso merged commit 018da48 into main Mar 17, 2026
1 check passed

christso deleted the feat/multi-provider-skill-trigger branch March 17, 2026 13:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: make agentv-bench work across all agent harnesses#641

feat: make agentv-bench work across all agent harnesses#641
christso merged 11 commits intomainfrom
feat/multi-provider-skill-trigger

christso commented Mar 17, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

cloudflare-workers-and-pages bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented Mar 17, 2026 •

edited

Loading

cloudflare-workers-and-pages bot commented Mar 17, 2026 •

edited

Loading