Benchmarking for Agent Skills (SKILL.md files). Measures whether a skill actually improves LLM output quality, and by how much, using blind judge evaluation and confidence intervals.
Thousands of Agent Skills exist across public registries. There is no systematic way to know whether any of them actually work. SkillBenchmark fills that gap.
An Agent Skill is only as good as the improvement it produces. Right now, skill authors publish based on intuition, and users install based on anecdote. There is no objective way to answer:
- Does this skill improve output quality, and by how much?
- Does it do so consistently, or only on certain task types?
- Is the quality improvement worth the extra token cost?
Each task is run N times. Every run produces two outputs from the same LLM: one with the skill injected as the system prompt, one without. Both outputs are then scored by a judge LLM. After all runs, confidence intervals are computed over the scores and compared.
The runner LLM receives the task prompt. It runs twice: once with a plain system prompt, once with the skill's instructions as well. Everything else is identical: same model, same temperature, same task.
The judge LLM scores each output against a rubric. This is the part people ask about most, so it's worth explaining carefully.
The judge never sees the original task prompt. This sounds like a limitation but it is a deliberate design choice. The rubric contains a context block that tells the judge exactly what a good answer looks like for this type of task: what to reward, what to penalise, and why. The rubric is the definition of quality. If you need the judge to see the task prompt to score the output, the rubric is underspecified and should be improved. Keeping the judge prompt-blind also prevents a common failure mode where the judge rewards outputs that literally mirror the task instructions, which would contaminate the comparison.
The judge does not know which output used the skill. It receives the output and the rubric only. It cannot favour one condition over the other because it cannot tell them apart.
How judge bias is handled. Any LLM judge has tendencies — it might prefer longer responses, or penalise terse phrasing, or be slightly inconsistent between calls. SkillBenchmark handles this in two ways:
- The same judge, with the same prompt, scores both outputs. This means any systematic bias applies equally to both conditions and cancels out when you compute the delta. You are not trying to get an absolute quality score — you are measuring a relative difference between two conditions evaluated under identical circumstances. Absolute judge accuracy matters far less than consistency.
- Using multiple judges per run and treating each score as an independent sample reduces random variance and gives tighter confidence intervals.
The rubric is the main lever for evaluation quality. Criteria with clear, distinguishable scoring levels produce consistent results. Vague criteria produce noisy ones.
All scores across runs and judges are treated as independent samples. A t-distribution confidence interval is computed for each condition (with skill, without skill). The delta — the difference between the two means — gets its own CI using Welch's t-interval, which correctly accounts for the uncertainty in both samples.
Results are displayed as mean ± margin. Non-overlapping CIs on the two conditions indicate a statistically meaningful difference. The delta CI tells you whether the observed gap is real or consistent with zero. Overlapping CIs are not a failure — they mean the current number of runs is not enough to confirm a difference. Add more runs to tighten them.
Current scope: SkillBenchmark v1 evaluates skills on raw text LLM calls: single-turn prompt-in, response-out. The next major milestone is full agent environment support — sandboxed tool-use runs where a skill's effect on multi-turn, agentic tasks can be measured end-to-end. If you're interested in contributing, reach out on LinkedIn.
git clone https://github.com/TiesPetersen/SkillBenchmark
cd SkillBenchmark
pip install -r requirements.txt
cp .env.example .envAdd your Anthropic API key to .env:
ANTHROPIC_API_KEY=sk-ant-...
The repo ships with a working example: the Caveman skill and three tasks are already in tasks/ and skills/, and config.yml points at them. You can run python run.py immediately to see real output before touching anything.
When you're ready to benchmark your own skill, drop it into skills/ and update config.yml:
skill_path: skills/my-skill/SKILL.mdThen replace or extend the tasks in tasks/ with tasks relevant to your skill. The more tasks you add, the more meaningful the results will be (see Writing Tasks).
Run the benchmark:
python run.pyResults are written to results/ as both a JSON log and a markdown report.
SkillBenchmark/
├── tasks/ # One YAML file per task
│ ├── caveman_debug_explanation.yml # Example: explain a Python bug to a developer
│ ├── caveman_user_error_message.yml # Example: write a user-facing error message
│ └── caveman_commit_message.yml # Example: write a commit message for a diff
├── skills/ # One folder per skill (Agent Skills standard)
│ └── caveman/ # Example: Caveman skill
│ └── SKILL.md
├── results/ # Output (gitignored)
├── src/
│ ├── config.py # Config loader
│ ├── task.py # Task YAML parser
│ ├── runner.py # LLM runner (with/without skill)
│ ├── evaluator.py # Blind judge scoring
│ ├── stats.py # Confidence interval calculation
│ ├── report.py # JSON + markdown reporter
│ └── main.py # Orchestrator
├── run.py # Entry point
├── config.yml # Configuration
└── .env.example # Example environment variables (copy to .env and fill in)
All settings live in config.yml:
number_of_runs_per_task: 3
number_of_judges_per_run: 1
runner_model: claude-sonnet-4-6
runner_temperature: 0.7
runner_max_tokens: 4096
judge_model: claude-sonnet-4-6
judge_temperature: 0.1
judge_max_tokens: 1024
skill_path: skills/my-skill/SKILL.md
tasks_dir: tasks
results_dir: results
confidence_level: 0.95number_of_runs_per_task — How many times each task is run in full (both with and without skill). Each run is an independent sample. More runs produce tighter confidence intervals and more reliable verdicts, but cost more tokens. Use 3 during development and 10+ before drawing real conclusions.
number_of_judges_per_run — How many times the judge LLM scores each output. Multiple judges average out scoring inconsistency. 1 is fine for quick tests; use 3+ for important benchmarks.
runner_model — The model used to complete the tasks. This is what you're measuring the skill's effect on.
runner_temperature — Controls how varied the runner's outputs are between runs. Higher values (e.g. 0.7) produce more diverse outputs across runs, which is what you want — it gives the benchmark more signal to work with. Setting this to 0 would produce nearly identical outputs every run, making multiple runs pointless.
runner_max_tokens — The maximum length of the runner's output. Increase this for tasks that require long responses (documents, code files). The right value depends on your task — too low and the output gets cut off mid-response.
judge_model — The model used to score outputs. Using a different model than the runner reduces the risk of one model systematically favouring its own style of output.
judge_temperature — Controls how consistent the judge's scores are. Keep this at 0.0-0.2 so the judge produces deterministic, repeatable scores rather than varying scores for the same output.
judge_max_tokens — The maximum length of the judge's response. The judge only returns structured JSON, so 1024 is almost always sufficient.
confidence_level — The confidence level used for the confidence intervals (e.g. 0.95 = 95% CI). Higher values produce wider intervals, but the result is more trustworthy when CIs don't overlap.
skill_path — Path to the SKILL.md file to benchmark.
tasks_dir / results_dir — Where to read tasks from and write results to.
Tasks live in tasks/ as YAML files. Each file defines the prompt given to the runner LLM and a scoring rubric given only to the judge. They are kept separate so the judge cannot reverse-engineer which output was "helped".
name: "Write an incident postmortem"
version: "1.0"
task:
context: |
You are a senior engineer at a tech company. An incident has just been
resolved and you need to write the official postmortem report.
description: |
Write a complete postmortem for the following incident:
[incident description here]
rubric:
context: |
A good postmortem is specific, actionable, and structured. It should
allow any engineer who was not present to fully understand what happened,
why, and what will be done to prevent recurrence.
criteria:
- name: Timeline
points: 20
levels:
- range: [0, 8]
description: "No timeline, or a vague narrative without specific times."
- range: [9, 14]
description: "Timeline present but incomplete or uses approximate times."
- range: [15, 20]
description: "Chronological timeline with precise timestamps covering all key events."
- name: Root cause analysis
points: 25
levels:
- range: [0, 10]
description: "Root cause missing, vague, or incorrect."
- range: [11, 18]
description: "Root cause identified but causation chain is incomplete."
- range: [19, 25]
description: "Full causation chain explained across at least two levels of why."
# ... more criteriaTask design tips:
- The rubric should be written generically enough, so that the LLM using the skill doesn't have an advantage, because the rubric is focused on exactly what the skill is designed to improve. For example, if the skill is designed to help with structured output, the rubric should reward good structure but not explicitly call out "used the exact format from the skill instructions".
- Each criterion's levels should be clearly distinguishable so the judge produces consistent scores
- The
contextin the rubric gives the judge background without revealing the task prompt
Skills follow the Agent Skills open standard, compatible with Claude Code, Cursor, Gemini CLI, GitHub Copilot, and others.
Each skill lives in its own folder under skills/:
skills/
└── my-skill/
└── SKILL.md
SKILL.md format:
---
name: my-skill
description: What this skill does and when to use it.
license: MIT
metadata:
author: your-name
version: "1.0"
---
# Skill instructions here
Everything below the frontmatter becomes the system prompt for the runner LLM.The frontmatter name and description fields are standard Agent Skills metadata. The markdown body is what gets injected as the system prompt, frontmatter is stripped automatically.
Each benchmark run writes a directory under results/<skill>__<timestamp>/ containing three file types.
<task_slug>.md — human-readable summary for each task:
# Write an incident postmortem
*Skill: `my-skill` | 2026-05-26 17:30:02*
## Score summary
| | With skill | Without skill |
|---|---|---|
| Mean score | 91.4 / 100 | 74.2 / 100 |
| 95% CI | 91.4 ± 2.9 | 74.2 ± 5.5 |
| Delta (95% CI) | +17.2 ± 6.2 | — |
| Std dev | 3.1 | 6.8 |
| Samples (runs × judges) | 10 | 10 |
## Per-run scores
| Run | Judge | With skill | Without skill |
|---|---|---|---|
| 1 | 1 | 94 | 76 |
...
## Criterion breakdown (mean across runs)
| Criterion | With skill | Without skill | Max |
|---|---|---|---|
| Timeline | 18.9 | 12.1 | 20 |
| Root cause | 23.1 | 17.4 | 25 |
...
## Token usage
| Run | With skill | Without skill |
|---|---|---|
| 1 | 1913 | 944 |
...
<task_slug>.json — full run log for reproducibility. Contains every output, per-criterion score and reasoning, token counts, and aggregate stats:
{
"skill": "my-skill",
"task": "Write an incident postmortem",
"timestamp": "20260526_173002",
"with_skill_stats": { "mean": 91.4, "std": 3.1, "ci_lower": 88.5, "ci_upper": 94.3, "n": 10 },
"without_skill_stats": { "mean": 74.2, "std": 6.8, "ci_lower": 67.9, "ci_upper": 80.5, "n": 10 },
"runs": [
{
"run_index": 0,
"with_skill_output": "...",
"without_skill_output": "...",
"with_skill_tokens": 1913,
"without_skill_tokens": 944,
"with_skill": [
{
"total_score": 94,
"max_score": 100,
"criteria_scores": [
{ "name": "Timeline", "score": 18, "max_score": 20, "reasoning": "..." },
...
]
}
],
"without_skill": [ ... ]
},
...
]
}overview.md — cross-task summary table with all scores, deltas, and run configuration in one place.
The tasks/ and skills/ folders include a ready-to-run example that benchmarks the Caveman skill by Julius Brussee. Caveman is a popular Agent Skill that makes an LLM respond in compressed, fragment-based prose to cut output token usage by roughly 65–75% while claiming to maintain technical accuracy.
Note: This is not a rigorous or definitive evaluation of Caveman. It is a small illustrative example using three tasks and a small number of runs, intended only to show how SkillBenchmark works in practice. Treat the results as a demonstration, not a verdict on the skill.
Full credit for the Caveman skill goes to Julius Brussee, it is not part of this project and is included here solely as a benchmark subject.
The three example tasks were chosen to probe different contexts where Caveman's terse style might help, hurt, or have no effect:
| Task | Hypothesis |
|---|---|
| Explain a Python bug to a developer | Caveman-style fragments are acceptable for dev-to-dev communication — skill may help or be neutral |
| Write a user-facing error message | Non-technical users need full sentences and clear tone — skill likely hurts |
| Write a commit message for a diff | Commit messages reward brevity and precision — skill may help |
Run config: 5 runs × 3 judges = 15 samples per condition · claude-haiku-4-5 runner and judge · 95% CI
| Task | With skill | Without skill | Delta | Avg tokens (with / without skill) |
|---|---|---|---|---|
| Write a commit message | 93.5 ± 1.5 | 89.9 ± 2.3 | +3.6 ± 2.8 | 1896 / 952 |
| Explain a Python bug | 99.5 ± 0.5 | 100.0 ± 0.0 | −0.5 ± 0.5 | 1551 / 729 |
| Write a user-facing error message | 89.7 ± 3.2 | 87.7 ± 2.5 | +2.0 ± 4.0 | 1233 / 306 |
The CI intervals on all three tasks overlap, so no difference is statistically confirmed at 95%. The extra token cost across all tasks is roughly the size of the SKILL.md being injected as system prompt (~950 tokens), with output length largely unchanged.
Criterion breakdown — Write a commit message
| Criterion | With skill | Without skill | Max |
|---|---|---|---|
| Conventional Commits format | 24.3 | 24.2 | 25 |
| Accuracy — what changed | 33.7 | 32.4 | 35 |
| Explains the why | 22.3 | 20.8 | 25 |
| Conciseness | 13.3 | 12.5 | 15 |
Criterion breakdown — Explain a Python bug
| Criterion | With skill | Without skill | Max |
|---|---|---|---|
| Root cause accuracy | 40.0 | 40.0 | 40 |
| Fix correctness | 34.9 | 35.0 | 35 |
| Clarity for a developer | 24.6 | 25.0 | 25 |
Criterion breakdown — Write a user-facing error message
| Criterion | With skill | Without skill | Max |
|---|---|---|---|
| Clarity for a non-technical user | 28.0 | 28.0 | 30 |
| Actionability | 25.5 | 23.9 | 30 |
| Tone | 17.7 | 17.7 | 20 |
| Structure and completeness | 18.5 | 18.1 | 20 |
- Text output only (v1). Agentic / file-writing runs are not yet supported.
Help wanted! If you have ideas or want to contribute, reach out on LinkedIn.
- Claude API only (v1). Multi-provider support is planned if there's interest.
If you'd like to see support for your model of choice, let me know on LinkedIn or open an issue. If enough people want it, it'll move up the priority list.
MIT — see LICENSE. Built by Ties Petersen (Linkedin)