Skip to content

feat: make agentv-bench work across all agent harnesses#641

Merged
christso merged 11 commits intomainfrom
feat/multi-provider-skill-trigger
Mar 17, 2026
Merged

feat: make agentv-bench work across all agent harnesses#641
christso merged 11 commits intomainfrom
feat/multi-provider-skill-trigger

Conversation

@christso
Copy link
Collaborator

@christso christso commented Mar 17, 2026

Summary

  • skill-trigger evaluator is now target-agnostic: adds ToolMatcher interface with PROVIDER_TOOL_SEMANTICS static mapping for Claude, Copilot, Pi, VS Code providers. Falls back to Claude defaults for unknown providers. For unsupported providers with different tool-call formats, use a code-grader.
  • Python scripts (run_eval.py, improve_description.py, run_loop.py) accept --target, --mode, and --llm-command args for multi-provider execution. CLI mode delegates to agentv eval instead of hardcoded claude -p.
  • SKILL.md updated with provider support matrix, code-grader example for unsupported providers, and cli mode as recommended default.
  • CLAUDE.md clarified that config overrides on built-in evaluators should be code-graders instead.

All defaults match current behavior — backward compatible, no existing evals need modification.

Test plan

  • 13 unit tests for skill-trigger evaluator covering:
    • Claude-cli resolves to Claude tool names
    • Copilot-cli resolves to Copilot tool names (Read File, readFile, readTextFile, skill)
    • Unknown provider falls back to Claude defaults
    • Codex with non-matching tool calls fails correctly
    • Codex with should_trigger: false passes correctly
    • Backward compatibility with existing Claude Skill/Read behavior
    • should_trigger: false for negative test cases
  • TypeScript typecheck passes across all packages
  • Biome lint passes
  • Full test suite passes (pre-push hook: Build, Typecheck, Lint, Test)
  • Multi-provider example EVAL.yaml added (examples/features/agent-skills-evals/multi-provider-skill-trigger.EVAL.yaml)
  • Manual e2e (Copilot): --target copilot --test-id should-not-trigger-unrelated — score 1.0, correctly detected no csv-analyzer trigger
  • Manual e2e (Copilot): --target copilot --test-id should-trigger-direct-request — score 0.0, correctly detected first tool was "Using skill: using-superpowers" (no csv-analyzer skill installed). Evaluator correctly identifies Copilot tool calls.
  • Manual e2e (Pi): --target pi --test-id should-not-trigger-unrelated — score 1.0, correctly detected no csv-analyzer trigger
  • Manual e2e (Pi): --target pi --test-id should-trigger-direct-request — score 0.0, "No tool calls recorded" (no csv-analyzer skill installed, Pi answered without tools). Evaluator correctly graded as fail.

Note: Positive cases score 0.0 because no csv-analyzer skill exists in the test workspace — the agents can't trigger a skill that isn't installed. This validates that the evaluator correctly detects the absence of skill invocation across providers.

Closes #613

🤖 Generated with Claude Code

christso and others added 2 commits March 17, 2026 08:26
- run_eval.py: add --target and --mode args; cli mode delegates to
  agentv eval instead of hardcoded claude -p
- improve_description.py: add --llm-command for configurable LLM
  inference (default: claude -p --output-format text)
- run_loop.py: pass --target, --mode, --llm-command through to
  child scripts

All defaults match current behavior for backward compatibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…l mappings

Add ToolMatcher interface and PROVIDER_TOOL_SEMANTICS static mapping to
support skill-trigger evaluation across Claude, Copilot, Pi, VS Code,
and other providers. Config-level overrides (skill_tools, read_tools,
skill_input_field, read_input_field) provide escape hatch for edge cases.

Providers known to not emit tool calls (codex) get descriptive error
messages instead of silent failures.

Backward compatible: existing evals work without modification.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Mar 17, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 232f9cf
Status: ✅  Deploy successful!
Preview URL: https://108c69e1.agentv.pages.dev
Branch Preview URL: https://feat-multi-provider-skill-tr.agentv.pages.dev

View logs

Codex SDK emits command_execution and file_change tool calls. Remove it
from NO_TOOL_CALL_PROVIDERS and update tests/docs to reflect actual
behavior. Users can use skill_tools/read_tools config overrides to map
Codex tool names to skill-trigger detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
christso and others added 8 commits March 17, 2026 09:16
…s/.codex

Codex supports skills via .agents/ and .codex/ folders. Update provider
matrix and notes to reflect this. Link to #643 for proper tool mapping.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- DRY Copilot/VSCode matcher into COPILOT_MATCHER constant
- Add pi-agent-sdk to PROVIDER_TOOL_SEMANTICS
- Clarify hardcoded should_trigger in run_single_query_cli
- Deduplicate executor.submit branches in run_eval
- Add read_tools/read_input_field config override test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Demonstrates skill-trigger evaluation across providers:
- Default Claude tool detection (Skill, Read)
- Auto-detection for Copilot (Read File, readFile, etc.)
- Config override escape hatch for custom providers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… YAML

The evaluator-parser was silently dropping skill_tools, read_tools,
skill_input_field, and read_input_field fields from YAML skill-trigger
assertions. These config overrides now pass through to the evaluator,
enabling custom tool-name mappings directly from EVAL.yaml.

Remove stale "not yet supported" note from example.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The whole point is that skill-trigger just works across providers.
Remove the custom tool mapping example that taught users to specify
skill_tools/read_tools — those are internal escape hatches, not
something end users should need to configure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… unsupported providers

Config override fields (skill_tools, read_tools, skill_input_field, read_input_field)
added unnecessary configuration surface. For providers not covered by the built-in
mapping, a code-grader is the correct escape hatch — it's more flexible and follows
the "lightweight core, plugin extensibility" principle.

- Remove override fields from SkillTriggerEvaluatorConfig type
- Remove override logic from resolveMatcher() in skill-trigger evaluator
- Revert parser forwarding of override fields
- Remove 2 override-specific tests (13 tests remain, all passing)
- Update SKILL.md: replace override docs with code-grader example
- Update CLAUDE.md: clarify that config overrides on evaluators = code-grader

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove dead NO_TOOL_CALL_PROVIDERS set and unreachable guard branch
- Replace insider reference to "skill-creator's run_eval.py" with standalone docs
- Add "To add a new provider" 3-step guide in file header
- Add comment explaining Copilot's ACP tool name variance
…sign

Expand Section 6 with concrete code quality guidelines: standalone file
headers, data-driven patterns over conditional chains, no dead code, and
extension recipes in module headers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@christso christso marked this pull request as ready for review March 17, 2026 13:10
@christso christso merged commit 018da48 into main Mar 17, 2026
1 check passed
@christso christso deleted the feat/multi-provider-skill-trigger branch March 17, 2026 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(evaluator): make skill-trigger target-agnostic

1 participant