skill-eval

Evaluation harness for Claude Code custom Agent Skills.

Eval environment caveats

The harness runs claude -p with a project-level .claude/skills/<name>/ fixture, but the spawned subprocess inherits the user's full Claude environment. Things that can change results between machines (or between runs on the same machine):

Built-in skills. claude-code ships with skills (debug, simplify, update-config, etc.) that always appear in available_skills. If your skill's prompt territory overlaps a built-in's, Claude may pick the built-in instead — check invoked_skill in raw.jsonl to diagnose.
MCP servers. Servers configured globally (~/.claude/settings.json) or per-project (.mcp.json) expose tools Claude can use as alternatives to invoking a skill. Run with --strict-mcp-config + an empty --mcp-config to isolate this variable, or disable them at the source.
User CLAUDE.md. ~/.claude/CLAUDE.md is auto-loaded into the system prompt and may bias skill selection.
Settings, hooks, custom agents, plugins. Same caveat — anything in the user's ~/.claude/ can affect what Claude sees and does.

These can't all be cleanly suppressed: --bare disables project-level skill discovery (the very thing we're testing), and HOME=<tmpdir> breaks auth. For now, every record in raw.jsonl includes available_skills so reviewers can confirm what Claude actually saw.

When comparing two versions of a skill description, run both evals back-to-back on the same machine to hold the environment constant.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.claude/skills/ideate		.claude/skills/ideate
datasets		datasets
src/skill_eval		src/skill_eval
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

skill-eval

Eval environment caveats

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

skill-eval

Eval environment caveats

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages