-
-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Parent: #738
Depends on: Sub-issue C (after references/schemas.md is complete)
Purpose
Implement the Python/Bash scripts used in skill-creator's Eval/Benchmark modes.
File Locations
packages/rules/.ai-rules/skills/skill-creator/scripts/
├── aggregate_benchmark.py
├── run_loop.py
└── init_skill.sh
1. aggregate_benchmark.py — Benchmark Result Aggregation
Role: Aggregate grading/timing results from iteration directory into benchmark.json + benchmark.md
CLI Interface:
python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>Behavior:
- Scan all
eval-*/subdirectories initeration-N/ - Read
with_skill/grading.jsonandwithout_skill/grading.jsonfor each eval - Read
with_skill/timing.jsonandwithout_skill/timing.jsonfor each eval - Compute statistics:
- pass_rate: assertion pass rate (mean ± stddev)
- tokens: tokens used (mean ± stddev)
- duration_seconds: execution time (mean ± stddev)
- Output files:
benchmark.json— conforming to benchmark.json schema inreferences/schemas.mdbenchmark.md— human-readable markdown summary
Dependencies: Python 3.8+ standard library only (no external packages)
Error Handling:
- eval without grading.json → warning output, skip that eval
- eval without timing.json → excluded from token/time statistics
2. run_loop.py — Description Optimization Loop
Role: Auto-optimize description in Benchmark mode
CLI Interface:
python -m scripts.run_loop \
--eval-set <path-to-trigger-eval.json> \
--skill-path <path-to-skill> \
--model <model-id> \
--max-iterations 5 \
--verboseBehavior:
- Load
trigger_eval.json(20 should_trigger / should_not_trigger queries) - 60/40 train/test split
- Each iteration:
a. Measure trigger rate with current description (train set)
b. Analyze results and generate 3 improved description candidates
c. Measure trigger rate for each candidate (test set)
d. Select highest-scoring candidate - Output final optimal description
Input Schema: trigger_eval.json format from references/schemas.md
Output: Optimized description string + per-iteration score log
Dependencies: Python 3.8+ standard library only
Note: Parts requiring actual LLM calls guide users to manual execution via CLI output (tool-independent)
3. init_skill.sh — New Skill Directory Scaffolding
Role: Create new skill directory structure + template SKILL.md in Create mode
CLI Interface:
./scripts/init_skill.sh <skill-name> [--path <output-directory>]Behavior:
- Create directory structure:
<skill-name>/ ├── SKILL.md ├── references/ ├── examples/ └── scripts/ - Generate SKILL.md template (based on
assets/skill-template.md):--- name: <skill-name> description: TODO - describe when to use this skill --- # <Skill Name> ## Overview TODO **Core principle:** TODO ## When to Use TODO ## When NOT to Use TODO
- Print creation results
Dependencies: bash, mkdir, cat (standard utilities)
Notes:
- Defaults to current directory if
--pathis not specified - Error + abort if directory already exists (no overwriting)
Acceptance Criteria
-
scripts/aggregate_benchmark.pycreated-
benchmark.jsonoutput conforms toreferences/schemas.mdschema -
benchmark.mdmarkdown summary generated - Python 3.8+ standard library only
- Graceful handling of missing files (warning + skip)
-
-
scripts/run_loop.pycreated-
trigger_eval.jsonschema-conforming input - 60/40 train/test split logic
- Per-iteration score log output
-
-
scripts/init_skill.shcreated- Template reflects codingbuddy patterns (Core principle, When to Use, etc.)
- Existing directory overwrite prevention
-
--pathoption support
- All 3 scripts support
--helpoption - No external package dependencies
References
- JSON schemas: Sub-issue C's
references/schemas.md - Template: Sub-issue E's
assets/skill-template.md