feat: make agentv-bench work across all agent harnesses#641
Merged
Conversation
- run_eval.py: add --target and --mode args; cli mode delegates to agentv eval instead of hardcoded claude -p - improve_description.py: add --llm-command for configurable LLM inference (default: claude -p --output-format text) - run_loop.py: pass --target, --mode, --llm-command through to child scripts All defaults match current behavior for backward compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…l mappings Add ToolMatcher interface and PROVIDER_TOOL_SEMANTICS static mapping to support skill-trigger evaluation across Claude, Copilot, Pi, VS Code, and other providers. Config-level overrides (skill_tools, read_tools, skill_input_field, read_input_field) provide escape hatch for edge cases. Providers known to not emit tool calls (codex) get descriptive error messages instead of silent failures. Backward compatible: existing evals work without modification. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deploying agentv with
|
| Latest commit: |
232f9cf
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://108c69e1.agentv.pages.dev |
| Branch Preview URL: | https://feat-multi-provider-skill-tr.agentv.pages.dev |
Codex SDK emits command_execution and file_change tool calls. Remove it from NO_TOOL_CALL_PROVIDERS and update tests/docs to reflect actual behavior. Users can use skill_tools/read_tools config overrides to map Codex tool names to skill-trigger detection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…s/.codex Codex supports skills via .agents/ and .codex/ folders. Update provider matrix and notes to reflect this. Link to #643 for proper tool mapping. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- DRY Copilot/VSCode matcher into COPILOT_MATCHER constant - Add pi-agent-sdk to PROVIDER_TOOL_SEMANTICS - Clarify hardcoded should_trigger in run_single_query_cli - Deduplicate executor.submit branches in run_eval - Add read_tools/read_input_field config override test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Demonstrates skill-trigger evaluation across providers: - Default Claude tool detection (Skill, Read) - Auto-detection for Copilot (Read File, readFile, etc.) - Config override escape hatch for custom providers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… YAML The evaluator-parser was silently dropping skill_tools, read_tools, skill_input_field, and read_input_field fields from YAML skill-trigger assertions. These config overrides now pass through to the evaluator, enabling custom tool-name mappings directly from EVAL.yaml. Remove stale "not yet supported" note from example. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The whole point is that skill-trigger just works across providers. Remove the custom tool mapping example that taught users to specify skill_tools/read_tools — those are internal escape hatches, not something end users should need to configure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… unsupported providers Config override fields (skill_tools, read_tools, skill_input_field, read_input_field) added unnecessary configuration surface. For providers not covered by the built-in mapping, a code-grader is the correct escape hatch — it's more flexible and follows the "lightweight core, plugin extensibility" principle. - Remove override fields from SkillTriggerEvaluatorConfig type - Remove override logic from resolveMatcher() in skill-trigger evaluator - Revert parser forwarding of override fields - Remove 2 override-specific tests (13 tests remain, all passing) - Update SKILL.md: replace override docs with code-grader example - Update CLAUDE.md: clarify that config overrides on evaluators = code-grader Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove dead NO_TOOL_CALL_PROVIDERS set and unreachable guard branch - Replace insider reference to "skill-creator's run_eval.py" with standalone docs - Add "To add a new provider" 3-step guide in file header - Add comment explaining Copilot's ACP tool name variance
…sign Expand Section 6 with concrete code quality guidelines: standalone file headers, data-driven patterns over conditional chains, no dead code, and extension recipes in module headers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ToolMatcherinterface withPROVIDER_TOOL_SEMANTICSstatic mapping for Claude, Copilot, Pi, VS Code providers. Falls back to Claude defaults for unknown providers. For unsupported providers with different tool-call formats, use a code-grader.run_eval.py,improve_description.py,run_loop.py) accept--target,--mode, and--llm-commandargs for multi-provider execution. CLI mode delegates toagentv evalinstead of hardcodedclaude -p.climode as recommended default.All defaults match current behavior — backward compatible, no existing evals need modification.
Test plan
examples/features/agent-skills-evals/multi-provider-skill-trigger.EVAL.yaml)--target copilot --test-id should-not-trigger-unrelated— score 1.0, correctly detected no csv-analyzer trigger--target copilot --test-id should-trigger-direct-request— score 0.0, correctly detected first tool was "Using skill: using-superpowers" (no csv-analyzer skill installed). Evaluator correctly identifies Copilot tool calls.--target pi --test-id should-not-trigger-unrelated— score 1.0, correctly detected no csv-analyzer trigger--target pi --test-id should-trigger-direct-request— score 0.0, "No tool calls recorded" (no csv-analyzer skill installed, Pi answered without tools). Evaluator correctly graded as fail.Note: Positive cases score 0.0 because no
csv-analyzerskill exists in the test workspace — the agents can't trigger a skill that isn't installed. This validates that the evaluator correctly detects the absence of skill invocation across providers.Closes #613
🤖 Generated with Claude Code