English | 中文
Test and regression-check your Agent Skills before shipping.
Agent Skills are becoming reusable software assets, but most skills are shipped without tests. SkillCI helps you check structure, resource references, risky commands, and trigger behavior before publishing a Skill.
随着 Claude Skills、OpenAI Codex Skills、GitHub gh skill、AGENTS.md 等 Agent 工作流资产逐渐普及,AI Agent 的能力复用方式正在从"写 Prompt"转向"维护可复用 Skill"。
但当前 Skill 生态仍处于早期阶段,开发者主要面临以下问题:
- Skill 写完后无法判断是否真的有效 — 只能靠人工试几条 Prompt
- Skill 修改后容易产生行为回归 — 原本能触发的任务不再触发
- Skill 质量缺少工程化门禁 — 没有 lint、测试、回归、报告和 CI 检查
- Skill 逐渐成为团队资产,但缺少维护工具
SkillCI 的目标是:让 Agent Skills 像代码一样可以被测试、回归检查和持续集成。
| Check | Severity | Description |
|---|---|---|
SKILL.md missing |
error | No SKILL.md in skill directory |
| Invalid frontmatter | error | YAML frontmatter parse failure |
| Missing name | error | No name in frontmatter |
| Missing description | error | No description in frontmatter |
| Description too short | warning | Word count < 8 |
| Description too long | warning | Word count > 80 or chars > 500 |
| Description too broad | warning | Contains generic phrases like "all tasks" |
| Missing scope hints | warning | No "when to use" indicators |
| Broken reference | error | Referenced file does not exist |
| Dangerous command | warning | Contains risky commands like rm -rf |
| Sensitive path | warning | References sensitive paths like .ssh |
- Local mode: Rule-based scoring using description similarity, keyword overlap, domain terms, headings, and tags
- LLM mode: Calls an LLM to judge whether the Skill should trigger for each test case
- Regression: Compares current results against a saved baseline to detect behavior changes
# Install
pip install -e .[dev]
# Lint a skill
skillci lint examples/api-doc-writer
# Run local trigger test
skillci test examples/api-doc-writer --mode local
# Run LLM trigger test (mock)
skillci test examples/api-doc-writer --mode llm --provider mock
# Save baseline snapshot
skillci snapshot examples/api-doc-writer
# Compare with baseline
skillci test examples/api-doc-writer --mode local --compare latest
# Generate markdown report
skillci report examples/api-doc-writer --format markdown --output report.md---
name: api-doc-writer
description: Use this skill when the user asks to generate, refine, or validate
API documentation from backend routes, controllers, request and response models,
or OpenAPI specifications.
---
# API Doc Writer
This skill helps generate API documentation from backend source code,
including Spring Boot controllers, FastAPI routes, REST endpoints,
request DTOs, response schemas, and OpenAPI specifications.
## When to use
Use this skill for:
- Generating OpenAPI documentation
- Writing REST API docs
- Converting backend routes into API specs
## References
See `references/openapi-style-guide.md`.skill: .
mode: local
judge:
provider: openai
base_url: https://your-api.com/v1
model: gpt-4o-mini
temperature: 0
timeout: 30
thresholds:
trigger_score: 0.45
trigger_f1: 0.8
llm_confidence: 0.7
cases:
- name: should_trigger_for_spring_controller
input: 根据这个 Spring Controller 生成 OpenAPI 文档
expected_trigger: true
tags: [api, spring, openapi]
- name: should_trigger_for_fastapi_routes
input: 把这些 FastAPI 路由整理成接口说明
expected_trigger: true
tags: [api, fastapi, documentation]
- name: should_not_trigger_for_sql_optimization
input: 帮我优化这个 SQL 查询
expected_trigger: false
tags: [sql, database]
- name: should_not_trigger_for_css_explanation
input: 解释一下这个 CSS 样式
expected_trigger: false
tags: [frontend, css]$ skillci test examples/api-doc-writer --mode local
SkillCI Report: api-doc-writer
Static Health
OK no issues found
Trigger Check
PASS should_trigger_for_spring_controller: expected=True, actual=True, score=0.4786
matched_terms: api, controller, openapi, spring, 文档
PASS should_trigger_for_fastapi_routes: expected=True, actual=True, score=0.57
matched_terms: api, documentation, fastapi
PASS should_not_trigger_for_sql_optimization: expected=False, actual=False, score=0.0
matched_terms: none
PASS should_not_trigger_for_css_explanation: expected=False, actual=False, score=0.0
matched_terms: none
Summary
Local Precision: 1.0
Local Recall: 1.0
Local F1: 1.0
Local Accuracy: 1.0
Result
passed
$ skillci test examples/api-doc-writer --mode llm --provider mock
SkillCI Report: api-doc-writer
Static Health
OK no issues found
LLM Trigger Check
PASS should_trigger_for_spring_controller: expected=True, actual=True, confidence=0.92
reason: mock judge returned expected result
PASS should_trigger_for_fastapi_routes: expected=True, actual=True, confidence=0.92
reason: mock judge returned expected result
PASS should_not_trigger_for_sql_optimization: expected=False, actual=False, confidence=0.88
reason: mock judge returned expected result
PASS should_not_trigger_for_css_explanation: expected=False, actual=False, confidence=0.88
reason: mock judge returned expected result
LLM Summary
LLM Precision: 1.0
LLM Recall: 1.0
LLM F1: 1.0
LLM Accuracy: 1.0
Average Confidence: 0.9
Result
passed
$ skillci test examples/api-doc-writer --mode local --compare latest
Regression Report
Baseline: .skillci/baselines/api-doc-writer/latest.json
Total cases: 4
Unchanged: 4
Regressed: 0
Improved: 0
No behavior changes detected.
| Command | Description |
|---|---|
skillci init <path> |
Generate a starter skillci.yaml for a Skill directory |
skillci lint <path> |
Run static health checks |
skillci test <path> --mode local |
Run local trigger test |
skillci test <path> --mode llm --provider openai |
Run LLM trigger test |
skillci test <path> --mode llm --provider mock |
Run LLM test with mock provider |
skillci test <path> --json |
Output JSON report |
skillci test <path> --compare latest |
Compare with baseline snapshot |
skillci snapshot <path> |
Save baseline snapshot |
skillci report <path> --format markdown |
Generate markdown report |
| Field | Type | Default | Description |
|---|---|---|---|
skill |
string | . |
Skill directory path |
mode |
string | local |
Test mode: local, llm, both |
judge.provider |
string | openai |
LLM provider |
judge.base_url |
string | null |
Custom API endpoint |
judge.api_key |
string | null |
API key, overrides OPENAI_API_KEY when set |
judge.model |
string | gpt-4.1-mini |
Model name |
judge.temperature |
float | 0 |
Temperature |
judge.timeout |
int | 30 |
Timeout in seconds |
thresholds.trigger_score |
float | 0.6 |
Local trigger threshold |
thresholds.trigger_f1 |
float | 0.8 |
F1 pass threshold |
thresholds.llm_confidence |
float | 0.7 |
LLM confidence threshold |
cases[].name |
string | required | Test case name |
cases[].input |
string | required | User input |
cases[].expected_trigger |
bool | required | Expected trigger result |
cases[].tags |
list | [] |
Tags for the case |
| Variable | Description |
|---|---|
OPENAI_API_KEY |
API key for OpenAI-compatible LLM providers |
OPENAI_BASE_URL |
Default API endpoint (overridden by judge.base_url) |
If you want to keep sensitive settings (like API keys) out of version control, create a skillci.local.yaml file in the same directory as skillci.yaml. It will be automatically loaded and merged, but is gitignored by default.
# skillci.local.yaml (not committed)
judge:
provider: openai
api_key: sk-...
base_url: https://your-api.com/v1
model: your-model-nameThe local config overrides any matching fields in skillci.yaml. You don't need to change any commands — it loads automatically.
API key priority: judge.api_key in skillci.local.yaml > judge.api_key in skillci.yaml > OPENAI_API_KEY
To use LLM mode, install the optional dependency:
pip install -e .[llm]Then configure the judge section in skillci.yaml:
judge:
provider: openai
base_url: https://your-api.com/v1
model: your-model-nameOr use environment variables:
export OPENAI_API_KEY=your-key
export OPENAI_BASE_URL=https://your-api.com/v1pip install -e .[dev]
python -m pytest -v
ruff check .Current status: 48 tests passing, ruff clean.
See GitHub Actions Guide for CI/CD setup instructions.
bothmode is planned — currently onlylocalandllmmodes are supported- Judge cache is planned — LLM results are not cached yet, each run calls the API
- LLM confidence is self-reported — the confidence score comes from the model, not from logprobs or multi-run consensus
- Local trigger scoring is heuristic — based on word coverage, not semantic understanding
- Chinese tokenization is character-level — no word segmentation library is used
- Only OpenAI-compatible API is supported — Anthropic, Gemini, and local models (Ollama) are planned
- Anthropic / Gemini / Ollama provider support
- logprobs-based confidence
- Multi-run consistency check (
--runs N) - Judge cache
- Both mode (local + LLM comparison)
- HTML report
- GitHub Action
- Skill Health Score
- Batch testing (multiple skills)
Apache-2.0