Skip to content

feat(skills): skill-usage observability + parametrized catalog test#73

Merged
spashii merged 2 commits into
mainfrom
sam/skill-usage-eval
May 24, 2026
Merged

feat(skills): skill-usage observability + parametrized catalog test#73
spashii merged 2 commits into
mainfrom
sam/skill-usage-eval

Conversation

@spashii
Copy link
Copy Markdown
Member

@spashii spashii commented May 24, 2026

What this enables

Two pieces of skill-usage evaluation, addressing the SAM-43 questions on catalog verification + audit-log-based usage measurement:

1. Skill usage scan in daily-maintenance §1

New subsection that aggregates yesterday's audit log for read_file calls on src/skills/<name>/skill.md paths → per-skill discovery counts. Counts land in §4 journal synthesis under a new ### Skill usage sub-section so the trend is queryable over time.

Decision rule when a skill has consistent zero reads: catalog issue (refine frontmatter), obsolete skill (delete), or genuinely missed (note + watch). Sam-as-LLM makes the call.

2. Parametrized catalog-presence test

test_every_skill_visible_in_catalog in tests/eval/test_structural.py parametrizes over every src/skills/<name>/ directory. Asserts the skill name appears in the assembled system prompt. Adding a new skill automatically gets defended — no manual test addition needed.

Trigger: PR #70 shipped exa-search without any catalog-presence test. Under the prior single-skill check, a new skill could ship invisible. This parametrize fixes that for every future skill.

Consequences

  • Operator sees per-skill usage trends daily without doing any querying.
  • Any future skill that doesn't appear in the catalog fails CI immediately (parametrize includes its case automatically).
  • The deeper "did Sam apply the right skill" eval (Opus-as-judge) stays as a follow-up — it's a meta-eval rubric design problem, not a build task.

What this doesn't cover

  • Doesn't measure whether Sam applied the skill correctly once it was read. Just discovery.
  • Doesn't fire alerts when a skill is consistently zero-used; that's the operator's daily-synthesis call.

How to verify

  • pytest tests/eval/test_structural.py — 16 passed (was 8); 9 new parametrized cases, one per skill.
  • After merge + a real cron fire: §4 journal synthesis includes a ### Skill usage block with per-skill counts.

Bonus

.gitignore adds mining/ so blog-scratch files from session-jsonl mining don't keep leaking into PRs (the entry on PR #71 hasn't merged yet).

Tier

Tier 1 (skill prose + tests). No runtime changes.

Closes the catalog-verification + skill-usage-observability part of SAM-43.

…ized catalog test

Two pieces of skill-usage evaluation, both small:

1. Daily-maintenance §1 'Skill usage scan' subsection. Aggregates
   yesterday's audit log for read_file calls on src/skills/<name>/skill.md
   paths → per-skill discovery counts. Operator decision rule:
   consistent zero-reads with obvious triggers = catalog issue
   (refine frontmatter), obsolete skill (delete), or genuinely missed
   (note + watch). Counts land in §4 journal synthesis under a new
   '### Skill usage' sub-section so the trend is queryable.

2. Parametrized 'every skill in src/skills/<name>/' must appear in
   the assembled system prompt's catalog. Added at the eval-harness
   structural layer. Trigger: PR #70 shipped exa-search without any
   catalog-presence test — under the prior single-skill
   (test_skill_creator_visible_in_catalog), a new skill could ship
   invisible. Parametrize fixes it for every future skill automatically.

Plus .gitignore entry for mining/ so the blog scratch from session-jsonl
mining doesn't keep leaking into PRs (the entry on the ask_operator
branch in PR #71 hasn't merged yet).

Together (1) + (2) cover the operator's catalog-presence and
discovery-observability questions on SAM-43. The deeper Opus-as-judge
('did Sam apply the right skill') stays as a separate follow-up.

Tests: 9 skills defended by the parametrize. 16 eval tests pass
(was 8). Full suite: 143 passed locally.
@spashii spashii enabled auto-merge May 24, 2026 17:22
@linear
Copy link
Copy Markdown

linear Bot commented May 24, 2026

SAM-43

@spashii spashii disabled auto-merge May 24, 2026 17:27
@spashii spashii merged commit 4c69886 into main May 24, 2026
2 checks passed
@spashii spashii deleted the sam/skill-usage-eval branch May 24, 2026 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant