feat: ask_operator tool — async pause-and-resume for unknown unknowns (v2) by spashii · Pull Request #71 · Dembrane/sam

spashii · 2026-05-24T16:46:19Z

What this enables

Sam can break out of a session mid-work for genuine unknown unknowns, post a question to Slack, exit cleanly. Operator replies whenever — daemon detects the continuation, Sam picks up via the audit log. Slack reactions are the source of truth for paused state; no in-memory daemon map, no parallel state store, no special-case boot recovery.

How the lifecycle works

Sam pauses → ask_operator tool posts question + adds :speech_balloon: atomically
            → session exits clean, :white_check_mark: on inbound
            → ledger entry has ask_operator_called: true

  ... operator replies whenever (works across daemon restarts) ...

Operator reply → daemon fetches thread_history (already happens)
              → finds the bot message with :speech_balloon: from this bot
              → looks up session_id from sessions.jsonl
                 (filter thread_ts + ask_operator_called + ts_start strictly before post)
              → injects paused_session_id into IncomingMessage
              → calls reactions.remove on the question post (marks resolved)
              → continuation prompt fires: read audit log, apply answer, continue

Consequences

New ADK tool ask_operator(question: str) on the main agent only — workers/pro_executor/mentor cannot escalate to operator directly.
Tool atomically posts question + adds :speech_balloon: (race-mitigation per your design call).
SessionLedgerEntry.ask_operator_called: bool is the new ledger field the daemon's lookup correlates on.
_find_active_paused_question + _lookup_paused_session_id replace the deleted _paused_threads map. Reactions on Slack + ledger on GCS both persist across daemon restarts.
Daily-maintenance §6 handles abandoned questions (>24h) — Sam enumerates 💬 via reactions.list, decides per-question whether to remind / mark abandoned / escalate. Daemon mechanical, Sam judgment.

Composition with existing gates

silent-exit gate (PR feat(daemon): enforce Slack reply before clean exit (silent-exit gate) #68): ask_operator counts as a post → closes the loop for that turn.
ack-first rule (PR feat(capability/slack): ACK-first rule on tool-using sessions (SAM-32) #44): unaffected.
retry / silent-exit retry: takes precedence over continuation (failure narration wins over resumption — defends against a failed pause spawning a continuation loop).
recovered=True (boot recovery): paused_session_id takes precedence (more specific signal).

Tier

Tier 3 (src/runtime/) + Tier 1 (src/capabilities/slack.md, src/skills/daily-maintenance/skill.md). Both layers: system enforces routing; prose explains the rule and Sam's review responsibility.

Test plan

pytest tests/ — 159 passed (24 new in test_ask_operator.py, no regressions)
_find_active_paused_question defended for: empty history, no bot messages, no reaction, reaction from another user, multiple paused (most-recent wins), missing reactions field
_lookup_paused_session_id defended for: missing ledger, matching session, future-skipping, most-recent-match
SessionLedgerEntry field defaulting + plumbing
Live: Sam mid-work calls ask_operator → 💬 appears → operator reply → continuation runs + clears reaction

Closes the async-question class of bug. Three new tickets queued for follow-up work (24h reminder, introspect tool, skill descriptions audit) tracked separately.

Operator design call (2026-05-24): Sam should be able to break out of a session mid-work when it hits a genuine unknown unknown, post a question to Slack, exit cleanly, and let the operator's reply trigger a continuation session that picks up via the audit log. Mechanism (v2 — explicit tool + daemon plumbing): 1. New ADK FunctionTool 'ask_operator(question: str)' registered only on main_agent (NOT workers / pro_executor / mentor — they must return to their caller, never escalate). Tool closure binds the originating Slack channel/thread_ts/event_ts so the LLM signature stays minimal. 2. Tool atomically posts the question + adds 💬 reaction on the question post (race-mitigation: by the time the LLM returns to the runner, the operator can already see Sam is paused). If reaction.add fails, the post still goes through — daemon's paused_threads map is the primary signal, the reaction is for operator visibility. 3. SessionResult.ask_operator_called surfaces the pause. Daemon checks it after a clean session exit: if True, records thread_ts -> session_id in self._paused_threads. When an inbound reply lands in a thread present in this map, daemon pops the entry and injects paused_session_id into IncomingMessage. 4. IncomingMessage.paused_session_id triggers _format_continuation_message instead of the default shape. The continuation prompt tells Sam to: read prior audit log slice (ground truth, not journal text), apply the operator's answer, continue from where the prior session stopped, close the loop with a final reply. 5. Continuation takes precedence over recovered=True (more specific); retry_context takes precedence over continuation (failure narration wins over resumption — defends against a failed pause spawning a silent-retry loop). Composition with existing gates: - silent-exit gate (PR #68): ask_operator counts as a post — closes the loop for that turn. No false-positive silent-exit retry. - ack-first rule (PR #44): ask_operator can also satisfy ack-first if it's the first substantive post in a session (uncommon but valid). Abandoned-question handling: per operator direction, uses the existing reactions-scan infrastructure (_recover_from_reactions extends to age out 💬 markers >24h with a contextual reminder + clear). Same reactions.list call, no new state store. Filed as follow-up. Coverage: 11 new tests in test_ask_operator.py covering classifier behavior, continuation-prompt dispatch, precedence rules, graceful tool failures (no channel, no token). Updated test_silent_exit.py to reflect the 6-tuple classifier return. 146 total tests pass. slack.md gets a 'Pausing for operator input mid-work' subsection explaining when to call ask_operator (genuine unknown unknowns only), the operator-visibility model (💬 on the question post), and the abandoned-question mechanism.

…mory map Operator pushback (2026-05-24): the in-memory _paused_threads map was the wrong substrate. Slack reactions are already the visible state; the map duplicates them, loses on restart, and violates the 'no parallel state stores' rule. Complete redesign: - REMOVE Daemon._paused_threads in-memory map - REMOVE post-session update of _paused_threads in _worker - REMOVE _paused_threads.pop in _handle_event - ADD _find_active_paused_question(thread_history, bot_user_id): module-level pure function that walks thread_history newest-first, returns the first bot message with 💬 from this bot. Slack's conversations.replies returns reactions inline per message so no extra API call needed for the scan. - ADD Daemon._lookup_paused_session_id(thread_ts, question_post_ts): reads sessions.jsonl, filters thread_ts + ask_operator_called=True + ts_start strictly before the post, picks most-recent match. - ADD SessionLedgerEntry.ask_operator_called: bool — the ledger field the lookup correlates on. Default False for backwards compat. - ADD reactions.remove call after detection (clears 💬 from the question post = marks resolved). Idempotent — failures are non-fatal, the only cost is the reaction lingering until daily-maintenance's abandoned-question scan ages it out. - ADD daily-maintenance §6 'Review abandoned ask_operator questions': Sam-as-LLM enumerates 💬 > 24h via reactions.list, decides per-question whether to remind / mark abandoned / escalate. Daemon mechanical, Sam judgment. - REORDER session.run: classify_tool_use moved before ledger write so ask_operator_called lands in the ledger entry that the lookup reads. - UPDATE slack.md prose: reactions-as-source-of-truth framing, daemon restart explicitly named as not-a-special-case, abandoned-question handling delegated to daily-maintenance §6. Daemon restart is no longer a special case: the 💬 lives on Slack, the ledger lives on GCS, both persist. Next inbound triggers detection naturally via thread_history reactions + ledger lookup. Tests: - 13 new tests in test_ask_operator.py covering _find_active_paused_question (empty/no-bot/no-reaction/own-vs-other-user/multiple/missing-field), _lookup_paused_session_id (missing-ledger/matching/future-skipping/ most-recent), and SessionLedgerEntry field plumbing. - 159 total tests pass (was 146 before this PR's tests were added).

CI failed on the bare try/except/continue at the json.loads step in _lookup_paused_session_id. Replaced with json.JSONDecodeError-specific catch + log.debug. Malformed lines are still skipped (they're noise, not signal), but the exception is now visible at debug level for post-mortem if needed.

…ized catalog test Two pieces of skill-usage evaluation, both small: 1. Daily-maintenance §1 'Skill usage scan' subsection. Aggregates yesterday's audit log for read_file calls on src/skills/<name>/skill.md paths → per-skill discovery counts. Operator decision rule: consistent zero-reads with obvious triggers = catalog issue (refine frontmatter), obsolete skill (delete), or genuinely missed (note + watch). Counts land in §4 journal synthesis under a new '### Skill usage' sub-section so the trend is queryable. 2. Parametrized 'every skill in src/skills/<name>/' must appear in the assembled system prompt's catalog. Added at the eval-harness structural layer. Trigger: PR #70 shipped exa-search without any catalog-presence test — under the prior single-skill (test_skill_creator_visible_in_catalog), a new skill could ship invisible. Parametrize fixes it for every future skill automatically. Plus .gitignore entry for mining/ so the blog scratch from session-jsonl mining doesn't keep leaking into PRs (the entry on the ask_operator branch in PR #71 hasn't merged yet). Together (1) + (2) cover the operator's catalog-presence and discovery-observability questions on SAM-43. The deeper Opus-as-judge ('did Sam apply the right skill') stays as a separate follow-up. Tests: 9 skills defended by the parametrize. 16 eval tests pass (was 8). Full suite: 143 passed locally.

) ## What this enables Two pieces of skill-usage evaluation, addressing the SAM-43 questions on catalog verification + audit-log-based usage measurement: ### 1. Skill usage scan in daily-maintenance §1 New subsection that aggregates yesterday's audit log for `read_file` calls on `src/skills/<name>/skill.md` paths → per-skill discovery counts. Counts land in §4 journal synthesis under a new `### Skill usage` sub-section so the trend is queryable over time. Decision rule when a skill has consistent zero reads: catalog issue (refine frontmatter), obsolete skill (delete), or genuinely missed (note + watch). Sam-as-LLM makes the call. ### 2. Parametrized catalog-presence test `test_every_skill_visible_in_catalog` in `tests/eval/test_structural.py` parametrizes over every `src/skills/<name>/` directory. Asserts the skill name appears in the assembled system prompt. Adding a new skill automatically gets defended — no manual test addition needed. **Trigger:** PR #70 shipped exa-search without any catalog-presence test. Under the prior single-skill check, a new skill could ship invisible. This parametrize fixes that for every future skill. ## Consequences - Operator sees per-skill usage trends daily without doing any querying. - Any future skill that doesn't appear in the catalog fails CI immediately (parametrize includes its case automatically). - The deeper "did Sam apply the right skill" eval (Opus-as-judge) stays as a follow-up — it's a meta-eval rubric design problem, not a build task. ## What this doesn't cover - Doesn't measure whether Sam applied the skill *correctly* once it was read. Just discovery. - Doesn't fire alerts when a skill is consistently zero-used; that's the operator's daily-synthesis call. ## How to verify - `pytest tests/eval/test_structural.py` — 16 passed (was 8); 9 new parametrized cases, one per skill. - After merge + a real cron fire: §4 journal synthesis includes a `### Skill usage` block with per-skill counts. ## Bonus `.gitignore` adds `mining/` so blog-scratch files from session-jsonl mining don't keep leaking into PRs (the entry on PR #71 hasn't merged yet). ## Tier Tier 1 (skill prose + tests). No runtime changes. Closes the catalog-verification + skill-usage-observability part of SAM-43.

spashii enabled auto-merge May 24, 2026 16:46

spashii added 2 commits May 24, 2026 19:00

spashii disabled auto-merge May 24, 2026 17:19

spashii merged commit b44069b into main May 24, 2026
2 checks passed

spashii deleted the sam/ask-operator-async-question branch May 24, 2026 17:19

spashii mentioned this pull request May 24, 2026

feat(skills): skill-usage observability + parametrized catalog test #73

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ask_operator tool — async pause-and-resume for unknown unknowns (v2)#71

feat: ask_operator tool — async pause-and-resume for unknown unknowns (v2)#71
spashii merged 3 commits into
mainfrom
sam/ask-operator-async-question

spashii commented May 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

spashii commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this enables

How the lifecycle works

Consequences

Composition with existing gates

Tier

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

spashii commented May 24, 2026 •

edited

Loading