Evaluation harness for Claude Code custom Agent Skills.
The harness runs claude -p with a project-level .claude/skills/<name>/
fixture, but the spawned subprocess inherits the user's full Claude environment.
Things that can change results between machines (or between runs on the same
machine):
- Built-in skills.
claude-codeships with skills (debug,simplify,update-config, etc.) that always appear inavailable_skills. If your skill's prompt territory overlaps a built-in's, Claude may pick the built-in instead — checkinvoked_skillinraw.jsonlto diagnose. - MCP servers. Servers configured globally (
~/.claude/settings.json) or per-project (.mcp.json) expose tools Claude can use as alternatives to invoking a skill. Run with--strict-mcp-config+ an empty--mcp-configto isolate this variable, or disable them at the source. - User CLAUDE.md.
~/.claude/CLAUDE.mdis auto-loaded into the system prompt and may bias skill selection. - Settings, hooks, custom agents, plugins. Same caveat — anything in the
user's
~/.claude/can affect what Claude sees and does.
These can't all be cleanly suppressed: --bare disables project-level skill
discovery (the very thing we're testing), and HOME=<tmpdir> breaks auth. For
now, every record in raw.jsonl includes available_skills so reviewers can
confirm what Claude actually saw.
When comparing two versions of a skill description, run both evals back-to-back on the same machine to hold the environment constant.