diff --git a/packages/rules/.ai-rules/skills/README.md b/packages/rules/.ai-rules/skills/README.md index 7a00a06..fcbd5a6 100644 --- a/packages/rules/.ai-rules/skills/README.md +++ b/packages/rules/.ai-rules/skills/README.md @@ -57,6 +57,7 @@ Reusable workflows for consistent development practices. | agent-design | Design new specialist agent JSON definitions with schema, expertise, and system prompts | Adding new agents to codingbuddy | | mcp-builder | NestJS-based MCP server development with Tools/Resources/Prompts design | Building or extending MCP servers | | rule-authoring | Write unambiguous AI coding rules compatible across multiple AI tools | Creating rules for .ai-rules/ directories | +| skill-creator | Create, eval, improve, and benchmark skills with measurable behavior-change tests | Creating new skills, testing skill effectiveness, optimizing underperforming skills | ## Skill Format @@ -219,6 +220,8 @@ EOF │ └── SKILL.md ├── mcp-builder/ │ └── SKILL.md - └── rule-authoring/ + ├── rule-authoring/ + │ └── SKILL.md + └── skill-creator/ └── SKILL.md ``` diff --git a/packages/rules/.ai-rules/skills/skill-creator/SKILL.md b/packages/rules/.ai-rules/skills/skill-creator/SKILL.md new file mode 100644 index 0000000..ea25c02 --- /dev/null +++ b/packages/rules/.ai-rules/skills/skill-creator/SKILL.md @@ -0,0 +1,461 @@ +--- +name: skill-creator +description: >- + Create new skills, modify and improve existing skills, + and measure skill performance with eval pipeline. + Use when creating a skill from scratch, editing or optimizing + an existing skill, running evals to test a skill, + or benchmarking skill performance. +disable-model-invocation: true +argument-hint: [create|eval|improve|benchmark] [skill-name] +--- + +# Skill Creator + +## Overview + +Skills are reusable workflows that encode expert processes into repeatable instructions. A well-crafted skill transforms inconsistent ad-hoc work into systematic, verifiable outcomes across any AI tool. + +**Core principle:** A skill must change behavior. If an AI assistant produces the same output with and without the skill loaded, the skill has failed. + +**Iron Law:** +``` +EVERY SKILL MUST HAVE A MEASURABLE "DID BEHAVIOR CHANGE?" TEST +No eval = no confidence. Ship nothing you haven't measured. +``` + +## Modes + +| Mode | Purpose | Input | Output | +|------|---------|-------|--------| +| **Create** | Build a new skill from scratch | Intent or problem statement | `SKILL.md` + scaffold | +| **Eval** | Test skill effectiveness | Skill + test cases | Graded scorecard | +| **Improve** | Refine based on eval results | Skill + eval data | Improved `SKILL.md` | +| **Benchmark** | Compare performance metrics | Skill + baseline | Performance report | + +## When to Use + +- Creating a new skill for `.ai-rules/skills/` +- Testing whether an existing skill produces correct behavior +- Optimizing a skill that underperforms on edge cases +- Comparing skill versions to select the best one +- Measuring skill quality before shipping + +## When NOT to Use + +- Writing one-off instructions (not reusable = not a skill) +- Creating rules (use `rule-authoring` skill) +- Designing agents (use `agent-design` skill) +- Process is too simple to warrant a workflow (< 3 steps) + +--- + +## Create Mode + +**Trigger:** `skill-creator create ` + +### Phase 1: Intent — Define What the Skill Does + +Answer before writing anything: + +``` +1. What SPECIFIC problem does this skill solve? + Bad: "helps with testing" + Good: "enforces Red-Green-Refactor TDD cycle with mandatory verification" + +2. What behavior change should loading this skill cause? + Without: "AI writes implementation first, tests after" + With: "AI writes failing test first, verifies failure, then implements" + +3. Who consumes this skill? + Which AI tools? (Claude Code, Cursor, Codex, Q, Kiro) + What user skill level? + +4. What is the boundary? + What it handles vs. what it delegates to other skills + Name 2-3 skills it does NOT overlap with +``` + +### Phase 2: Interview — Gather Domain Knowledge + +Collect the expertise the skill will encode: + +``` +For each major workflow step: + 1. What is the step? + 2. What is the expected input/output? + 3. What are the most common mistakes? + 4. How do you verify correctness? + 5. What red flags should halt progress? +``` + +**Sources to consult:** +- Existing codebase patterns (search for conventions) +- Project documentation and ADRs +- Domain experts (ask the user) +- Related skills (check for reusable patterns) + +### Phase 3: Write — Author the SKILL.md + +**Required structure:** + +```markdown +--- +name: skill-name +description: "Use when... (max 500 chars)" +[optional frontmatter fields] +--- + +# Skill Title + +## Overview +2-3 sentences. Core principle. Iron Law. + +## When to Use +Bullet list of trigger scenarios. + +## When NOT to Use (if applicable) + +## Process / Phases +The actual workflow steps. + +## Verification Checklist + +## Red Flags — STOP +Table of rationalizations vs. reality. +``` + +**Writing rules:** + +| Rule | Why | +|------|-----| +| Imperative mood ("Write the test") | Direct instructions produce consistent behavior | +| Concrete examples over abstractions | AI tools follow examples more reliably than rules | +| Good/Bad comparisons for ambiguous steps | Eliminates interpretation variance across tools | +| One responsibility per phase | Multi-purpose phases get half-completed | +| Max 500 lines total | Longer skills get truncated or ignored | + +### Phase 4: Scaffold — Create Supporting Files + +```bash +mkdir -p packages/rules/.ai-rules/skills/ +# Write SKILL.md (from Phase 3) +# Optional supporting references: +# /reference-guide.md +# /examples.md +``` + +**Frontmatter field reference:** + +| Field | Required | Description | +|-------|----------|-------------| +| `name` | Yes | `^[a-z0-9-]+$`, matches directory name | +| `description` | Yes | 1-500 chars, start with "Use when..." | +| `disable-model-invocation` | No | `true` if skill handles its own execution flow | +| `argument-hint` | No | Usage hint shown in skill listings | +| `allowed-tools` | No | Restrict available tools during execution | +| `context` | No | `fork` to run in isolated context | +| `agent` | No | Agent to activate with the skill | + +### Phase 5: Test — Verify the Skill Works + +``` +- [ ] Frontmatter validates (name matches directory, description <= 500 chars) +- [ ] Skill loads without error in target tool (list_skills / get_skill) +- [ ] Following the skill produces different behavior than without it +- [ ] Every phase has a verifiable output +- [ ] Red flags table covers top 3 rationalizations for skipping +- [ ] No overlap with existing skills (check skills/README.md) +- [ ] Multi-tool compatible (no tool-specific syntax in core workflow) +``` + +--- + +## Eval Mode + +**Trigger:** `skill-creator eval ` + +Measure whether a skill produces the intended behavior change. + +### Phase 1: Define — Write Test Scenarios + +Create scenarios exercising the skill's key behaviors: + +``` +Scenario: [descriptive name] + Given: [initial state / context] + When: [skill is applied with this input] + Then: [expected behavior / output] + Anti-pattern: [what happens WITHOUT the skill] +``` + +**Minimum scenarios:** +- 1 happy path (standard use case) +- 1 edge case (unusual but valid input) +- 1 adversarial case (input that tempts skipping the skill) + +### Phase 2: Spawn — Execute Test Scenarios + +Run each scenario against the skill: + +``` +For each scenario: + 1. Load the skill content + 2. Present the scenario input + 3. Capture the AI's response + 4. Save response for grading +``` + +**Execution options:** +- **Manual:** Paste skill + scenario, capture response +- **Automated:** Use subagent with skill loaded, capture output +- **Parallel:** Run via dispatching-parallel-agents skill + +### Phase 3: Assert — Check Expected Behavior + +Grade each response: + +``` + PASS: Response follows skill workflow + PARTIAL: Some steps followed, others skipped + FAIL: Skill ignored or wrong behavior produced +``` + +Note which specific steps were followed/skipped. + +### Phase 4: Grade — Assign Severity + +| Severity | Definition | Action | +|----------|-----------|--------| +| **Critical** | Skill completely ignored | Must fix before shipping | +| **High** | Key phase skipped or verification missing | Must fix before shipping | +| **Medium** | Minor step deviation, output still usable | Fix in next iteration | +| **Low** | Style/formatting difference, behavior correct | Optional fix | + +### Phase 5: Aggregate — Summarize Results + +``` +Skill: [name] +Scenarios: [total] | Pass: [n] | Partial: [n] | Fail: [n] + +Critical: [count] High: [count] Medium: [count] Low: [count] + +Verdict: + SHIP = Critical=0 AND High=0 + ITERATE = Critical=0, High>0 + REWRITE = Critical>0 +``` + +### Phase 6: View — Present Findings + +```markdown +## Eval Report: [skill-name] + +### Summary +[Aggregate from Phase 5] + +### Scenario Results +| Scenario | Result | Issues | +|----------|--------|--------| +| ... | PASS/PARTIAL/FAIL | ... | + +### Recommendations +1. [Fix for highest-severity issue] +2. [Next fix] +``` + +--- + +## Improve Mode + +**Trigger:** `skill-creator improve ` + +Refine an existing skill based on eval data or observed behavior gaps. + +### Phase 1: Read — Understand Current State + +``` +1. Read the current SKILL.md +2. Read eval results (if available) +3. Identify the gap: + - Which phases are being skipped? + - Which instructions are ambiguous? + - Where does behavior diverge across AI tools? +``` + +### Phase 2: Generalize — Find Patterns in Failures + +``` +Look for systemic issues: +- Same step skipped across scenarios → Step is unclear or seems optional +- Different behavior per AI tool → Instructions use tool-specific syntax +- Partial compliance → Steps too large, need decomposition +- Complete skip → Trigger conditions don't match use case +``` + +### Phase 3: Apply — Make Targeted Changes + +| Issue Type | Fix Strategy | +|-----------|-------------| +| Step skipped | Add "MANDATORY" marker + red flag for skipping | +| Ambiguous instruction | Replace with concrete Good/Bad example | +| Tool-specific behavior | Remove tool-specific syntax, use universal patterns | +| Steps too large | Decompose into sub-steps with verification | +| Missing edge case | Add scenario to When to Use section | + +**Rules for changes:** +- One targeted change per identified issue +- Do not rewrite the entire skill (preserve what works) +- Add examples where instructions were misinterpreted +- Strengthen red flags for commonly skipped steps + +### Phase 4: Re-run — Eval the Improved Version + +Run the same eval scenarios against the modified skill: + +``` +Compare original vs. improved: + - Did the targeted fix resolve the issue? + - Did the fix introduce new issues? + - Is overall pass rate higher? +``` + +### Phase 5: Compare — Side-by-Side Analysis + +```markdown +| Metric | Before | After | Delta | +|--------|--------|-------|-------| +| Pass rate | X% | Y% | +/-Z% | +| Critical issues | N | M | +/-D | +| High issues | N | M | +/-D | +| Avg. steps followed | X/Y | X/Y | +/-D | +``` + +### Phase 6: Analyze — Document Learnings + +``` +1. Which fix strategy was most effective? +2. Which issues persisted despite changes? +3. Are there structural problems requiring a rewrite? +4. What patterns should inform future skill authoring? +``` + +--- + +## Benchmark Mode + +**Trigger:** `skill-creator benchmark ` + +Measure skill performance across dimensions and optimize weak spots. + +### Phase 1: Generate — Create Benchmark Suite + +Design a comprehensive test suite: + +``` +Dimensions: + 1. Compliance: Does the AI follow every step? + 2. Consistency: Same input → same behavior across runs? + 3. Portability: Works across Claude Code, Cursor, Codex, Q, Kiro? + 4. Robustness: Handles edge cases without breaking? + 5. Efficiency: Does the skill add unnecessary overhead? +``` + +- Minimum 2 cases per dimension +- Mix of simple and complex inputs +- At least 1 adversarial case per dimension + +### Phase 2: Review — Analyze Benchmark Results + +``` +For each dimension: + Score: [0-100] + Weakest case: [description] + Root cause: [why this dimension scored low] +``` + +| Score | Meaning | +|-------|---------| +| 90-100 | Excellent — production-ready | +| 70-89 | Good — minor improvements needed | +| 50-69 | Fair — significant gaps in at least one dimension | +| 0-49 | Poor — needs major rework | + +### Phase 3: Optimize — Target Weak Dimensions + +Focus on the lowest-scoring dimension first: + +``` +For each dimension scoring < 70: + 1. Identify the specific instruction causing the gap + 2. Apply the appropriate fix from Improve Mode Phase 3 + 3. Re-run that dimension's cases only + 4. Verify improvement without regression in other dimensions +``` + +### Phase 4: Apply — Finalize and Document + +``` +1. Update SKILL.md with optimized content +2. Update skills/README.md if category or description changed +3. Record benchmark baseline: + + Benchmark: [skill-name] @ [date] + Compliance: [score] + Consistency: [score] + Portability: [score] + Robustness: [score] + Efficiency: [score] + Overall: [weighted average] +``` + +--- + +## Additional Resources + +### Related Skills + +| Skill | Relationship | +|-------|-------------| +| `rule-authoring` | Rules constrain behavior; skills define workflows. Complementary. | +| `agent-design` | Agents are personas; skills are processes. Non-overlapping. | +| `prompt-engineering` | Prompt techniques apply within skill instructions. Supporting. | +| `writing-plans` | Plans are one-time; skills are reusable. Different lifecycle. | + +### Agent Support + +| Agent | When to Involve | +|-------|----------------| +| Code Quality Specialist | Reviewing skill structure and clarity | +| Test Engineer | Designing eval scenarios | +| Architecture Specialist | Skill decomposition and boundary design | + +### Multi-Tool Compatibility + +Skills must work across all supported AI tools: + +| Tool | How Skills Load | Key Consideration | +|------|----------------|-------------------| +| Claude Code | `get_skill` MCP tool | Full markdown + frontmatter parsed | +| Cursor | `@file` reference | Inline loading, no frontmatter processing | +| Codex / Copilot | `cat` file content | Plain text only, examples critical | +| Amazon Q | `.q/rules/` reference | Rule-style integration | +| Kiro | `.kiro/` reference | Spec-based integration | + +**Portability rules:** +- No tool-specific syntax in core workflow +- Examples in generic markdown, not tool-specific blocks +- Phases described as actions, not tool commands +- Test with at least 2 different tools before shipping + +## Red Flags — STOP + +| Thought | Reality | +|---------|---------| +| "This skill is obvious, no need to eval" | Obvious skills still get ignored. Eval proves they work. | +| "I'll test it manually later" | Manual tests are forgotten. Eval now. | +| "One scenario is enough" | One is anecdote. Three is pattern. | +| "It works in Claude Code, ship it" | Cursor/Codex may ignore the same instructions. Test portability. | +| "Small change, no need to re-eval" | Small changes cause cascading behavior shifts. Re-eval. | +| "The skill is too long but everything is needed" | Max 500 lines. Cut or decompose into reference files. | +| "I'll add examples later" | Skills without examples produce inconsistent behavior. Add now. |