SkillBenchmark

Benchmarking for Agent Skills (SKILL.md files). Measures whether a skill actually improves LLM output quality, and by how much, using blind judge evaluation and confidence intervals.

Thousands of Agent Skills exist across public registries. There is no systematic way to know whether any of them actually work. SkillBenchmark fills that gap.

The problem

An Agent Skill is only as good as the improvement it produces. Right now, skill authors publish based on intuition, and users install based on anecdote. There is no objective way to answer:

Does this skill improve output quality, and by how much?
Does it do so consistently, or only on certain task types?
Is the quality improvement worth the extra token cost?

How it works

Each task is run N times. Every run produces two outputs from the same LLM: one with the skill injected as the system prompt, one without. Both outputs are then scored by a judge LLM. After all runs, confidence intervals are computed over the scores and compared.

Step 1 — Two outputs per run

The runner LLM receives the task prompt. It runs twice: once with a plain system prompt, once with the skill's instructions as well. Everything else is identical: same model, same temperature, same task.

Step 2 — Blind scoring with a rubric

The judge LLM scores each output against a rubric. This is the part people ask about most, so it's worth explaining carefully.

The judge never sees the original task prompt. This sounds like a limitation but it is a deliberate design choice. The rubric contains a context block that tells the judge exactly what a good answer looks like for this type of task: what to reward, what to penalise, and why. The rubric is the definition of quality. If you need the judge to see the task prompt to score the output, the rubric is underspecified and should be improved. Keeping the judge prompt-blind also prevents a common failure mode where the judge rewards outputs that literally mirror the task instructions, which would contaminate the comparison.

The judge does not know which output used the skill. It receives the output and the rubric only. It cannot favour one condition over the other because it cannot tell them apart.

How judge bias is handled. Any LLM judge has tendencies — it might prefer longer responses, or penalise terse phrasing, or be slightly inconsistent between calls. SkillBenchmark handles this in two ways:

The same judge, with the same prompt, scores both outputs. This means any systematic bias applies equally to both conditions and cancels out when you compute the delta. You are not trying to get an absolute quality score — you are measuring a relative difference between two conditions evaluated under identical circumstances. Absolute judge accuracy matters far less than consistency.
Using multiple judges per run and treating each score as an independent sample reduces random variance and gives tighter confidence intervals.

The rubric is the main lever for evaluation quality. Criteria with clear, distinguishable scoring levels produce consistent results. Vague criteria produce noisy ones.

Step 3 — Confidence intervals on the scores and the delta

All scores across runs and judges are treated as independent samples. A t-distribution confidence interval is computed for each condition (with skill, without skill). The delta — the difference between the two means — gets its own CI using Welch's t-interval, which correctly accounts for the uncertainty in both samples.

Results are displayed as mean ± margin. Non-overlapping CIs on the two conditions indicate a statistically meaningful difference. The delta CI tells you whether the observed gap is real or consistent with zero. Overlapping CIs are not a failure — they mean the current number of runs is not enough to confirm a difference. Add more runs to tighten them.

Current scope: SkillBenchmark v1 evaluates skills on raw text LLM calls: single-turn prompt-in, response-out. The next major milestone is full agent environment support — sandboxed tool-use runs where a skill's effect on multi-turn, agentic tasks can be measured end-to-end. If you're interested in contributing, reach out on LinkedIn.

Quickstart

git clone https://github.com/TiesPetersen/SkillBenchmark
cd SkillBenchmark
pip install -r requirements.txt
cp .env.example .env

Add your Anthropic API key to .env:

ANTHROPIC_API_KEY=sk-ant-...

The repo ships with a working example: the Caveman skill and three tasks are already in tasks/ and skills/, and config.yml points at them. You can run python run.py immediately to see real output before touching anything.

When you're ready to benchmark your own skill, drop it into skills/ and update config.yml:

skill_path: skills/my-skill/SKILL.md

Then replace or extend the tasks in tasks/ with tasks relevant to your skill. The more tasks you add, the more meaningful the results will be (see Writing Tasks).

Run the benchmark:

python run.py

Results are written to results/ as both a JSON log and a markdown report.

Project structure

SkillBenchmark/
├── tasks/                          # One YAML file per task
│   ├── caveman_debug_explanation.yml       # Example: explain a Python bug to a developer
│   ├── caveman_user_error_message.yml      # Example: write a user-facing error message
│   └── caveman_commit_message.yml          # Example: write a commit message for a diff
├── skills/                         # One folder per skill (Agent Skills standard)
│   └── caveman/                    # Example: Caveman skill
│       └── SKILL.md
├── results/                        # Output (gitignored)
├── src/
│   ├── config.py                   # Config loader
│   ├── task.py                     # Task YAML parser
│   ├── runner.py                   # LLM runner (with/without skill)
│   ├── evaluator.py                # Blind judge scoring
│   ├── stats.py                    # Confidence interval calculation
│   ├── report.py                   # JSON + markdown reporter
│   └── main.py                     # Orchestrator
├── run.py                          # Entry point
├── config.yml                      # Configuration
└── .env.example                    # Example environment variables (copy to .env and fill in)

Configuration

All settings live in config.yml:

number_of_runs_per_task: 3
number_of_judges_per_run: 1

runner_model: claude-sonnet-4-6
runner_temperature: 0.7
runner_max_tokens: 4096

judge_model: claude-sonnet-4-6
judge_temperature: 0.1
judge_max_tokens: 1024

skill_path: skills/my-skill/SKILL.md
tasks_dir: tasks
results_dir: results
confidence_level: 0.95

number_of_runs_per_task — How many times each task is run in full (both with and without skill). Each run is an independent sample. More runs produce tighter confidence intervals and more reliable verdicts, but cost more tokens. Use 3 during development and 10+ before drawing real conclusions.

number_of_judges_per_run — How many times the judge LLM scores each output. Multiple judges average out scoring inconsistency. 1 is fine for quick tests; use 3+ for important benchmarks.

runner_model — The model used to complete the tasks. This is what you're measuring the skill's effect on.

runner_temperature — Controls how varied the runner's outputs are between runs. Higher values (e.g. 0.7) produce more diverse outputs across runs, which is what you want — it gives the benchmark more signal to work with. Setting this to 0 would produce nearly identical outputs every run, making multiple runs pointless.

runner_max_tokens — The maximum length of the runner's output. Increase this for tasks that require long responses (documents, code files). The right value depends on your task — too low and the output gets cut off mid-response.

judge_model — The model used to score outputs. Using a different model than the runner reduces the risk of one model systematically favouring its own style of output.

judge_temperature — Controls how consistent the judge's scores are. Keep this at 0.0-0.2 so the judge produces deterministic, repeatable scores rather than varying scores for the same output.

judge_max_tokens — The maximum length of the judge's response. The judge only returns structured JSON, so 1024 is almost always sufficient.

confidence_level — The confidence level used for the confidence intervals (e.g. 0.95 = 95% CI). Higher values produce wider intervals, but the result is more trustworthy when CIs don't overlap.

skill_path — Path to the SKILL.md file to benchmark.

tasks_dir / results_dir — Where to read tasks from and write results to.

Writing tasks

Tasks live in tasks/ as YAML files. Each file defines the prompt given to the runner LLM and a scoring rubric given only to the judge. They are kept separate so the judge cannot reverse-engineer which output was "helped".

name: "Write an incident postmortem"
version: "1.0"

task:
  context: |
    You are a senior engineer at a tech company. An incident has just been
    resolved and you need to write the official postmortem report.
  description: |
    Write a complete postmortem for the following incident:
    [incident description here]

rubric:
  context: |
    A good postmortem is specific, actionable, and structured. It should
    allow any engineer who was not present to fully understand what happened,
    why, and what will be done to prevent recurrence.
  criteria:
    - name: Timeline
      points: 20
      levels:
        - range: [0, 8]
          description: "No timeline, or a vague narrative without specific times."
        - range: [9, 14]
          description: "Timeline present but incomplete or uses approximate times."
        - range: [15, 20]
          description: "Chronological timeline with precise timestamps covering all key events."

    - name: Root cause analysis
      points: 25
      levels:
        - range: [0, 10]
          description: "Root cause missing, vague, or incorrect."
        - range: [11, 18]
          description: "Root cause identified but causation chain is incomplete."
        - range: [19, 25]
          description: "Full causation chain explained across at least two levels of why."

    # ... more criteria

Task design tips:

The rubric should be written generically enough, so that the LLM using the skill doesn't have an advantage, because the rubric is focused on exactly what the skill is designed to improve. For example, if the skill is designed to help with structured output, the rubric should reward good structure but not explicitly call out "used the exact format from the skill instructions".
Each criterion's levels should be clearly distinguishable so the judge produces consistent scores
The context in the rubric gives the judge background without revealing the task prompt

Using skills

Skills follow the Agent Skills open standard, compatible with Claude Code, Cursor, Gemini CLI, GitHub Copilot, and others.

Each skill lives in its own folder under skills/:

skills/
└── my-skill/
    └── SKILL.md

SKILL.md format:

---
name: my-skill
description: What this skill does and when to use it.
license: MIT
metadata:
  author: your-name
  version: "1.0"
---

# Skill instructions here

Everything below the frontmatter becomes the system prompt for the runner LLM.

The frontmatter name and description fields are standard Agent Skills metadata. The markdown body is what gets injected as the system prompt, frontmatter is stripped automatically.

Reading results

Each benchmark run writes a directory under results/<skill>__<timestamp>/ containing three file types.

<task_slug>.md — human-readable summary for each task:

# Write an incident postmortem
*Skill: `my-skill` | 2026-05-26 17:30:02*

## Score summary

| | With skill | Without skill |
|---|---|---|
| Mean score          | 91.4 / 100   | 74.2 / 100   |
| 95% CI              | 91.4 ± 2.9   | 74.2 ± 5.5   |
| Delta (95% CI)      | +17.2 ± 6.2  | —            |
| Std dev             | 3.1          | 6.8          |
| Samples (runs × judges) | 10      | 10           |

## Per-run scores
| Run | Judge | With skill | Without skill |
|---|---|---|---|
| 1 | 1 | 94 | 76 |
...

## Criterion breakdown (mean across runs)
| Criterion     | With skill | Without skill | Max |
|---|---|---|---|
| Timeline      | 18.9       | 12.1          | 20  |
| Root cause    | 23.1       | 17.4          | 25  |
...

## Token usage
| Run | With skill | Without skill |
|---|---|---|
| 1 | 1913 | 944 |
...

<task_slug>.json — full run log for reproducibility. Contains every output, per-criterion score and reasoning, token counts, and aggregate stats:

{
  "skill": "my-skill",
  "task": "Write an incident postmortem",
  "timestamp": "20260526_173002",
  "with_skill_stats": { "mean": 91.4, "std": 3.1, "ci_lower": 88.5, "ci_upper": 94.3, "n": 10 },
  "without_skill_stats": { "mean": 74.2, "std": 6.8, "ci_lower": 67.9, "ci_upper": 80.5, "n": 10 },
  "runs": [
    {
      "run_index": 0,
      "with_skill_output": "...",
      "without_skill_output": "...",
      "with_skill_tokens": 1913,
      "without_skill_tokens": 944,
      "with_skill": [
        {
          "total_score": 94,
          "max_score": 100,
          "criteria_scores": [
            { "name": "Timeline", "score": 18, "max_score": 20, "reasoning": "..." },
            ...
          ]
        }
      ],
      "without_skill": [ ... ]
    },
    ...
  ]
}

overview.md — cross-task summary table with all scores, deltas, and run configuration in one place.

Example benchmark: Caveman

The tasks/ and skills/ folders include a ready-to-run example that benchmarks the Caveman skill by Julius Brussee. Caveman is a popular Agent Skill that makes an LLM respond in compressed, fragment-based prose to cut output token usage by roughly 65–75% while claiming to maintain technical accuracy.

Note: This is not a rigorous or definitive evaluation of Caveman. It is a small illustrative example using three tasks and a small number of runs, intended only to show how SkillBenchmark works in practice. Treat the results as a demonstration, not a verdict on the skill.

Full credit for the Caveman skill goes to Julius Brussee, it is not part of this project and is included here solely as a benchmark subject.

The three example tasks were chosen to probe different contexts where Caveman's terse style might help, hurt, or have no effect:

Task	Hypothesis
Explain a Python bug to a developer	Caveman-style fragments are acceptable for dev-to-dev communication — skill may help or be neutral
Write a user-facing error message	Non-technical users need full sentences and clear tone — skill likely hurts
Write a commit message for a diff	Commit messages reward brevity and precision — skill may help

Results

Run config: 5 runs × 3 judges = 15 samples per condition · claude-haiku-4-5 runner and judge · 95% CI

Task	With skill	Without skill	Delta	Avg tokens (with / without skill)
Write a commit message	93.5 ± 1.5	89.9 ± 2.3	+3.6 ± 2.8	1896 / 952
Explain a Python bug	99.5 ± 0.5	100.0 ± 0.0	−0.5 ± 0.5	1551 / 729
Write a user-facing error message	89.7 ± 3.2	87.7 ± 2.5	+2.0 ± 4.0	1233 / 306

The CI intervals on all three tasks overlap, so no difference is statistically confirmed at 95%. The extra token cost across all tasks is roughly the size of the SKILL.md being injected as system prompt (~950 tokens), with output length largely unchanged.

Criterion breakdown — Write a commit message

Criterion	With skill	Without skill	Max
Conventional Commits format	24.3	24.2	25
Accuracy — what changed	33.7	32.4	35
Explains the why	22.3	20.8	25
Conciseness	13.3	12.5	15

Criterion breakdown — Explain a Python bug

Criterion	With skill	Without skill	Max
Root cause accuracy	40.0	40.0	40
Fix correctness	34.9	35.0	35
Clarity for a developer	24.6	25.0	25

Criterion breakdown — Write a user-facing error message

Criterion	With skill	Without skill	Max
Clarity for a non-technical user	28.0	28.0	30
Actionability	25.5	23.9	30
Tone	17.7	17.7	20
Structure and completeness	18.5	18.1	20

Limitations

Text output only (v1). Agentic / file-writing runs are not yet supported.

Help wanted! If you have ideas or want to contribute, reach out on LinkedIn.
Claude API only (v1). Multi-provider support is planned if there's interest.

If you'd like to see support for your model of choice, let me know on LinkedIn or open an issue. If enough people want it, it'll move up the priority list.

License

MIT — see LICENSE. Built by Ties Petersen (Linkedin)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SkillBenchmark

The problem

How it works

Step 1 — Two outputs per run

Step 2 — Blind scoring with a rubric

Step 3 — Confidence intervals on the scores and the delta

Quickstart

Project structure

Configuration

Writing tasks

Using skills

Reading results

Example benchmark: Caveman

Results

Limitations

License

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
skills/caveman		skills/caveman
src		src
tasks		tasks
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yml		config.yml
requirements.txt		requirements.txt
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

SkillBenchmark

The problem

How it works

Step 1 — Two outputs per run

Step 2 — Blind scoring with a rubric

Step 3 — Confidence intervals on the scores and the delta

Quickstart

Project structure

Configuration

Writing tasks

Using skills

Reading results

Example benchmark: Caveman

Results

Limitations

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages