Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions src/skills/daily-maintenance/skill.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,38 @@ for (tool, prefix), n in groups.most_common(30):

**The predicate.** An actionable audit finding is **friction-pattern × not-already-covered**, not raw frequency. Frequency alone is noise — it just says Sam used a tool a lot. Friction × not-covered says Sam paid time-cost from the lack of documentation and no existing rule helps.

### Skill usage scan

A separate aggregation — independent of the friction-pattern scan above. Goal: see which skills Sam actually *discovered* yesterday (i.e., `read_file` on `src/skills/<name>/skill.md`). A skill never read across multiple sessions where its `when_to_use` plausibly applied is either undiscovered (catalog problem) or unneeded (delete-it candidate).

```bash
gcloud storage cat gs://${GCS_DATA_BUCKET:-dembrane-sameer-cli-sam-data}/tool_calls/$(date -u -d yesterday +%F).jsonl \
| python3 -c "
import sys, json, re, collections
pat = re.compile(r'src/skills/([a-z0-9-]+)/skill\.md')
reads = collections.Counter()
for line in sys.stdin:
line = line.strip()
if not line: continue
try: d = json.loads(line)
except: continue
if d.get('tool') != 'read_file': continue
fp = (d.get('args') or {}).get('file_path') or ''
m = pat.search(fp)
if m: reads[m.group(1)] += 1
for name, n in reads.most_common(30):
print(f'{n:3d} {name}')
"
```

Compare against the list of all skills (`ls src/skills/`). Skills with zero reads yesterday are not automatically a problem — many skills only fire on specific triggers. But over a multi-day window, a skill with consistent zeros while its `when_to_use` triggers obviously fired in the audit log is a real signal:

- Catalog issue → the skill's `name` / `description` / `when_to_use` isn't surfacing the trigger Sam should pattern-match on. Open a Tier 1 PR refining the frontmatter.
- Skill obsolete → the pattern it codifies has been absorbed into a capability, or the workflow no longer exists. Open a Tier 1 PR to remove the skill (skills accumulate; deleting them is fine).
- Sam genuinely missed it → the skill is right but Sam didn't catch the trigger. Less common; harder to fix with prose. Note in journal, watch over a week before acting.

§4's daily synthesis appends the per-skill count under `### Skill usage` so future-Sam (and the operator) can see the trend.

## 2. Propose changes if there's substance

If reflection surfaces something concrete to codify, first decide where it belongs using `src/capabilities/self-maintenance.md` ("Where does a change belong?"), then open self-PRs via the same file's flow. **No artificial cap on how many** — open one per distinct concept.
Expand Down Expand Up @@ -138,6 +170,7 @@ Then append a `## Daily synthesis` section to **yesterday's** journal entry (clo
- proposed PRs (from §2 and §3 combined) — one line each. Lead with the behavior change Sam is proposing so future-Sam can grep on intent. The PR number/title is the reference at the end of the line, not the headline.
- open threads to pick up today — one line each, named by what Sam should do today, not what happened yesterday.
- if the active-blocker count changed yesterday, one line: `blockers: <N> active (see Linear)`. Don't re-list blocker details — they're in Linear.
- a sub-section **`### Skill usage`** — the per-skill `read_file` counts from §1's skill-usage scan, plus any notable zero-usage skills with their flag (catalog issue / obsolete / Sam missed it). One short paragraph; if every skill got expected use, just `(typical usage shape)`.

If none of the items above has content, skip the section entirely. Don't write a "nothing to say" placeholder.

Expand Down
50 changes: 50 additions & 0 deletions tests/eval/test_structural.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,11 @@ def test_skill_creator_visible_in_catalog():
skill catalog. If Sam can't find the skill, Sam can't apply its
sample-before-codifying rule — and the next refactored predicate
hallucinates the way PR #46 did.

Kept as a named test (not just covered by the parametrized
`test_every_skill_visible_in_catalog` below) because skill-creator
is load-bearing for *other* skills — losing it from the catalog
would silently degrade the whole skill-authoring loop.
"""
sp = assemble_system_prompt()
assert "skill-creator" in sp, (
Expand All @@ -164,6 +169,51 @@ def test_skill_creator_visible_in_catalog():
)


def _all_skill_dirs() -> list:
"""Enumerate every src/skills/<name>/ with a skill.md. Used by the
parametrized catalog-presence test below. Computed from `__file__`
(not from SAM_SRC) because pytest parametrization runs at
collection time, before the autouse SAM_SRC fixture fires."""
from pathlib import Path
repo_root = Path(__file__).resolve().parent.parent.parent
skills_dir = repo_root / "src" / "skills"
if not skills_dir.exists():
return []
return sorted(
p for p in skills_dir.iterdir()
if p.is_dir()
and not p.name.startswith("_")
and (p / "skill.md").exists()
)


@pytest.mark.parametrize("skill_dir", _all_skill_dirs(), ids=lambda p: p.name)
def test_every_skill_visible_in_catalog(skill_dir):
"""Every skill in `src/skills/<name>/` must appear in the assembled
system prompt's catalog. A skill that's been added to the file
system but doesn't surface in the prompt is functionally invisible
— Sam can't decide to apply a skill it doesn't know exists.

Catalog discoverability is the load-bearing invariant: skills are
catalog-only by design (bodies are read on demand) so the catalog
entry IS the discovery surface. Parametrized so adding a new skill
automatically gets defended — no manual test addition needed.

The trigger for adding this test: PR #70 shipped a new `exa-search`
skill that wasn't covered by any catalog-presence test. Without
this parametrized check, the next new skill could ship invisible
the same way.
"""
sp = assemble_system_prompt()
name = skill_dir.name
assert name in sp, (
f"skill {name!r} exists at {skill_dir} but doesn't appear in "
"the assembled system prompt's catalog — Sam won't discover it. "
"Check the skill's frontmatter (name/description/when_to_use) "
"and the catalog assembly in `src/runtime/prompts.py`."
)


def test_skill_creator_still_carries_sampling_rule():
"""The body of `src/skills/skill-creator/skill.md` must still contain
the sample-before-codifying rule. The catalog only carries the
Expand Down
Loading