Your skill could be much better. But better how? Which part? In what order?
GitHub · SKILL.md · Schemas · Changelog
| What it is | An evaluation-driven skill evolution engine for Claude Code / OpenClaw — six-dimension scoring, directed improvement, version management. |
| Pain it solves | Turns "tweak and hope" into diagnose → targeted fix → verified improvement. |
| Use in 30 seconds | /skill-compass evaluate {skill} — instant quality report showing exactly what's weakest and what to improve next. |
Find the weakest link → fix it → prove it worked → next weakness → repeat.
|
For
|
Not For
|
Prerequisites: Claude Opus 4.6 (complex reasoning + consistent scoring) · Node.js v18+ (local validators)
git clone https://github.com/Evol-ai/SkillCompass.git
cd SkillCompass && npm install
# User-level (all projects)
rsync -a --exclude='.git' . ~/.claude/skills/skill-compass/
# Or project-level (current project only)
rsync -a --exclude='.git' . .claude/skills/skill-compass/First run: Claude Code will request permission for
node -eandbashcommands. Select "Allow always" to avoid repeated prompts.
git clone https://github.com/Evol-ai/SkillCompass.git
cd SkillCompass && npm install
# Follow OpenClaw skill installation docs for your setup
rsync -a --exclude='.git' . <your-openclaw-skills-path>/skill-compass/Two ways to invoke SkillCompass:
/skill-compass evaluate ./my-skill/SKILL.md
/skill-compass improve the nano-banana skill
/skill-compass security scan ./my-skill/SKILL.md
/skill-compass audit all skills in .claude/skills/
/skill-compass compare my-skill 1.0.0 vs 1.0.0-evo.2
/skill-compass roll back my-skill to previous version
No slash command needed — Claude automatically recognizes the intent:
Evaluate the nano-banana skill for me
Improve this skill — fix the weakest dimension
Scan all skills in .claude/skills/ for security issues
Capability reference
| Intent | Maps to |
|---|---|
| Evaluate / score / review a skill | eval-skill |
| Improve / fix / upgrade a skill | eval-improve |
| Security scan a skill | eval-security |
| Batch audit a directory | eval-audit |
| Compare two versions | eval-compare |
| Merge with upstream | eval-merge |
| Rollback to previous version | eval-rollback |
The score isn't the point — the direction is. You instantly see which dimension is the bottleneck and what to do about it.
Each /eval-improve round follows a closed loop: fix the weakest → re-evaluate → verify improvement → next weakest. No fix is saved unless the re-evaluation confirms it actually helped.
| ID | Dimension | Weight | What it evaluates |
|---|---|---|---|
| D1 | Structure | 10% | Frontmatter validity, markdown format, declarations |
| D2 | Trigger | 15% | Activation quality, rejection accuracy, discoverability |
| D3 | Security | 20% | Secrets, injection, permissions, exfiltration |
| D4 | Functional | 30% | Core quality, edge cases, output stability, error handling |
| D5 | Comparative | 15% | Value over direct prompting (with vs without skill) |
| D6 | Uniqueness | 10% | Overlap with similar skills, model supersession risk |
overall_score = round((D1×0.10 + D2×0.15 + D3×0.20 + D4×0.30 + D5×0.15 + D6×0.10) × 10)
| Verdict | Condition |
|---|---|
| PASS | score ≥ 70 AND D3 pass |
| CAUTION | 50–69, or D3 High findings |
| FAIL | score < 50, or D3 Critical (gate override) |
| Feature | Description |
|---|---|
| Directed Evolution | Diagnose → targeted fix → verify → next weakness. Not random patching. |
| Closed-Loop Improve | /eval-improve auto re-evaluates after each fix. Only saves if improved and nothing regressed. |
| Scope Control | --scope gate = D1+D3 (~8K tokens). --scope target --dimension D4 = single dim + gate. |
| Tiered Verification | L0 syntax → L1 single dimension → L2 full re-eval → L3 cross-skill. |
| D1+D2 Grouping | Both metadata dimensions weak (≤5)? Improved together — they share the frontmatter layer. |
| Feature | Description |
|---|---|
| Pre-Accept Gate | Hooks auto-scan every SKILL.md write. D1 + D3 checks. Zero config. Warns, never blocks. |
| Pre-Eval Scan | Static analysis blocks malicious code, exfiltration, prompt injection before LLM eval. |
| Output Guard | Validates improvement output for URL injection, dangerous commands, size anomalies. |
| Auto-Rollback | Any dimension drops >2 points after improvement? Changes discarded. |
| Local Validators | JS-based D1/D2/D3 validators run locally. Saves ~60% tokens on clear-cut issues. |
| Feature | Description |
|---|---|
| Correction Tracking | Detects repeated manual fixes, maps to dimensions, prompts update at next invocation. |
| Feedback Integration | Real usage data fuses into scores: 60% static + 40% feedback signals. |
| Multi-Language Triggers | Detects your language, tests trigger accuracy in it, fixes multilingual gaps. |
| Obsolescence Detection | Compares skill vs base model. Tracks supersession risk across model updates. |
| Skill Type Detection | Auto-classifies atom / composite / meta. Evaluation adapts accordingly. |
| Feature | Description |
|---|---|
| Version Management | SHA-256 hashed snapshots. Rollback to any version anytime. |
| Three-Way Merge | Merges upstream updates region-by-region. Local improvements preserved. |
| Multi-Round Evolution | /eval-evolve runs up to 6 rounds autonomously. Stops at PASS or plateau. |
| Batch Audit + Auto-Fix | /eval-audit --fix --budget 3 scans worst-first, auto-fixes within budget. |
| CI Mode | --ci flag, exit codes: 0=PASS, 1=CAUTION, 2=FAIL. |
No point-to-point integration needed. The Pre-Accept Gate intercepts all SKILL.md edits regardless of source.
| Tool | How it works together | Guide |
|---|---|---|
| Auto-Updater | Pulls new version → Gate auto-checks for security regressions → keep or rollback | guide |
| Claudeception | Extracts skill → auto-evaluation catches security holes + redundancy → directed fix | guide |
| Self-Improving Agent | Logs errors → feed as signals → SkillCompass maps to dimensions and fixes | guide |
SkillCompass defines an open feedback-signal.json schema for any tool to report skill usage data:
/eval-skill ./my-skill/SKILL.md --feedback ./feedback-signals.jsonSignals: trigger_accuracy, correction_count, correction_patterns, adoption_rate, ignore_rate, usage_frequency. The schema is extensible (additionalProperties: true) — any pipeline can produce or consume this format.
MIT — Use, modify, distribute freely. See LICENSE for details.
