Skip to content

Ryan-Small/skill-eval

Repository files navigation

skill-eval

Evaluation harness for Claude Code custom Agent Skills.

Eval environment caveats

The harness runs claude -p with a project-level .claude/skills/<name>/ fixture, but the spawned subprocess inherits the user's full Claude environment. Things that can change results between machines (or between runs on the same machine):

  • Built-in skills. claude-code ships with skills (debug, simplify, update-config, etc.) that always appear in available_skills. If your skill's prompt territory overlaps a built-in's, Claude may pick the built-in instead — check invoked_skill in raw.jsonl to diagnose.
  • MCP servers. Servers configured globally (~/.claude/settings.json) or per-project (.mcp.json) expose tools Claude can use as alternatives to invoking a skill. Run with --strict-mcp-config + an empty --mcp-config to isolate this variable, or disable them at the source.
  • User CLAUDE.md. ~/.claude/CLAUDE.md is auto-loaded into the system prompt and may bias skill selection.
  • Settings, hooks, custom agents, plugins. Same caveat — anything in the user's ~/.claude/ can affect what Claude sees and does.

These can't all be cleanly suppressed: --bare disables project-level skill discovery (the very thing we're testing), and HOME=<tmpdir> breaks auth. For now, every record in raw.jsonl includes available_skills so reviewers can confirm what Claude actually saw.

When comparing two versions of a skill description, run both evals back-to-back on the same machine to hold the environment constant.

About

Evaluation harness for Claude Code custom Agent Skills

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages