feat(skills): skill-usage observability + parametrized catalog test by spashii · Pull Request #73 · Dembrane/sam

spashii · 2026-05-24T17:22:21Z

What this enables

Two pieces of skill-usage evaluation, addressing the SAM-43 questions on catalog verification + audit-log-based usage measurement:

1. Skill usage scan in daily-maintenance §1

New subsection that aggregates yesterday's audit log for read_file calls on src/skills/<name>/skill.md paths → per-skill discovery counts. Counts land in §4 journal synthesis under a new ### Skill usage sub-section so the trend is queryable over time.

Decision rule when a skill has consistent zero reads: catalog issue (refine frontmatter), obsolete skill (delete), or genuinely missed (note + watch). Sam-as-LLM makes the call.

2. Parametrized catalog-presence test

test_every_skill_visible_in_catalog in tests/eval/test_structural.py parametrizes over every src/skills/<name>/ directory. Asserts the skill name appears in the assembled system prompt. Adding a new skill automatically gets defended — no manual test addition needed.

Trigger: PR #70 shipped exa-search without any catalog-presence test. Under the prior single-skill check, a new skill could ship invisible. This parametrize fixes that for every future skill.

Consequences

Operator sees per-skill usage trends daily without doing any querying.
Any future skill that doesn't appear in the catalog fails CI immediately (parametrize includes its case automatically).
The deeper "did Sam apply the right skill" eval (Opus-as-judge) stays as a follow-up — it's a meta-eval rubric design problem, not a build task.

What this doesn't cover

Doesn't measure whether Sam applied the skill correctly once it was read. Just discovery.
Doesn't fire alerts when a skill is consistently zero-used; that's the operator's daily-synthesis call.

How to verify

pytest tests/eval/test_structural.py — 16 passed (was 8); 9 new parametrized cases, one per skill.
After merge + a real cron fire: §4 journal synthesis includes a ### Skill usage block with per-skill counts.

Bonus

.gitignore adds mining/ so blog-scratch files from session-jsonl mining don't keep leaking into PRs (the entry on PR #71 hasn't merged yet).

Tier

Tier 1 (skill prose + tests). No runtime changes.

Closes the catalog-verification + skill-usage-observability part of SAM-43.

…ized catalog test Two pieces of skill-usage evaluation, both small: 1. Daily-maintenance §1 'Skill usage scan' subsection. Aggregates yesterday's audit log for read_file calls on src/skills/<name>/skill.md paths → per-skill discovery counts. Operator decision rule: consistent zero-reads with obvious triggers = catalog issue (refine frontmatter), obsolete skill (delete), or genuinely missed (note + watch). Counts land in §4 journal synthesis under a new '### Skill usage' sub-section so the trend is queryable. 2. Parametrized 'every skill in src/skills/<name>/' must appear in the assembled system prompt's catalog. Added at the eval-harness structural layer. Trigger: PR #70 shipped exa-search without any catalog-presence test — under the prior single-skill (test_skill_creator_visible_in_catalog), a new skill could ship invisible. Parametrize fixes it for every future skill automatically. Plus .gitignore entry for mining/ so the blog scratch from session-jsonl mining doesn't keep leaking into PRs (the entry on the ask_operator branch in PR #71 hasn't merged yet). Together (1) + (2) cover the operator's catalog-presence and discovery-observability questions on SAM-43. The deeper Opus-as-judge ('did Sam apply the right skill') stays as a separate follow-up. Tests: 9 skills defended by the parametrize. 16 eval tests pass (was 8). Full suite: 143 passed locally.

linear · 2026-05-24T17:22:24Z

SAM-43

spashii enabled auto-merge May 24, 2026 17:22

Merge branch 'main' into sam/skill-usage-eval

6b2546f

spashii disabled auto-merge May 24, 2026 17:27

spashii merged commit 4c69886 into main May 24, 2026
2 checks passed

spashii deleted the sam/skill-usage-eval branch May 24, 2026 17:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(skills): skill-usage observability + parametrized catalog test#73

feat(skills): skill-usage observability + parametrized catalog test#73
spashii merged 2 commits into
mainfrom
sam/skill-usage-eval

spashii commented May 24, 2026

Uh oh!

linear Bot commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

spashii commented May 24, 2026

What this enables

1. Skill usage scan in daily-maintenance §1

2. Parametrized catalog-presence test

Consequences

What this doesn't cover

How to verify

Bonus

Tier

Uh oh!

linear Bot commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant