feat: ask_operator tool — async pause-and-resume for unknown unknowns (v2)#71
Merged
Conversation
Operator design call (2026-05-24): Sam should be able to break out of a session mid-work when it hits a genuine unknown unknown, post a question to Slack, exit cleanly, and let the operator's reply trigger a continuation session that picks up via the audit log. Mechanism (v2 — explicit tool + daemon plumbing): 1. New ADK FunctionTool 'ask_operator(question: str)' registered only on main_agent (NOT workers / pro_executor / mentor — they must return to their caller, never escalate). Tool closure binds the originating Slack channel/thread_ts/event_ts so the LLM signature stays minimal. 2. Tool atomically posts the question + adds 💬 reaction on the question post (race-mitigation: by the time the LLM returns to the runner, the operator can already see Sam is paused). If reaction.add fails, the post still goes through — daemon's paused_threads map is the primary signal, the reaction is for operator visibility. 3. SessionResult.ask_operator_called surfaces the pause. Daemon checks it after a clean session exit: if True, records thread_ts -> session_id in self._paused_threads. When an inbound reply lands in a thread present in this map, daemon pops the entry and injects paused_session_id into IncomingMessage. 4. IncomingMessage.paused_session_id triggers _format_continuation_message instead of the default shape. The continuation prompt tells Sam to: read prior audit log slice (ground truth, not journal text), apply the operator's answer, continue from where the prior session stopped, close the loop with a final reply. 5. Continuation takes precedence over recovered=True (more specific); retry_context takes precedence over continuation (failure narration wins over resumption — defends against a failed pause spawning a silent-retry loop). Composition with existing gates: - silent-exit gate (PR #68): ask_operator counts as a post — closes the loop for that turn. No false-positive silent-exit retry. - ack-first rule (PR #44): ask_operator can also satisfy ack-first if it's the first substantive post in a session (uncommon but valid). Abandoned-question handling: per operator direction, uses the existing reactions-scan infrastructure (_recover_from_reactions extends to age out 💬 markers >24h with a contextual reminder + clear). Same reactions.list call, no new state store. Filed as follow-up. Coverage: 11 new tests in test_ask_operator.py covering classifier behavior, continuation-prompt dispatch, precedence rules, graceful tool failures (no channel, no token). Updated test_silent_exit.py to reflect the 6-tuple classifier return. 146 total tests pass. slack.md gets a 'Pausing for operator input mid-work' subsection explaining when to call ask_operator (genuine unknown unknowns only), the operator-visibility model (💬 on the question post), and the abandoned-question mechanism.
…mory map Operator pushback (2026-05-24): the in-memory _paused_threads map was the wrong substrate. Slack reactions are already the visible state; the map duplicates them, loses on restart, and violates the 'no parallel state stores' rule. Complete redesign: - REMOVE Daemon._paused_threads in-memory map - REMOVE post-session update of _paused_threads in _worker - REMOVE _paused_threads.pop in _handle_event - ADD _find_active_paused_question(thread_history, bot_user_id): module-level pure function that walks thread_history newest-first, returns the first bot message with 💬 from this bot. Slack's conversations.replies returns reactions inline per message so no extra API call needed for the scan. - ADD Daemon._lookup_paused_session_id(thread_ts, question_post_ts): reads sessions.jsonl, filters thread_ts + ask_operator_called=True + ts_start strictly before the post, picks most-recent match. - ADD SessionLedgerEntry.ask_operator_called: bool — the ledger field the lookup correlates on. Default False for backwards compat. - ADD reactions.remove call after detection (clears 💬 from the question post = marks resolved). Idempotent — failures are non-fatal, the only cost is the reaction lingering until daily-maintenance's abandoned-question scan ages it out. - ADD daily-maintenance §6 'Review abandoned ask_operator questions': Sam-as-LLM enumerates 💬 > 24h via reactions.list, decides per-question whether to remind / mark abandoned / escalate. Daemon mechanical, Sam judgment. - REORDER session.run: classify_tool_use moved before ledger write so ask_operator_called lands in the ledger entry that the lookup reads. - UPDATE slack.md prose: reactions-as-source-of-truth framing, daemon restart explicitly named as not-a-special-case, abandoned-question handling delegated to daily-maintenance §6. Daemon restart is no longer a special case: the 💬 lives on Slack, the ledger lives on GCS, both persist. Next inbound triggers detection naturally via thread_history reactions + ledger lookup. Tests: - 13 new tests in test_ask_operator.py covering _find_active_paused_question (empty/no-bot/no-reaction/own-vs-other-user/multiple/missing-field), _lookup_paused_session_id (missing-ledger/matching/future-skipping/ most-recent), and SessionLedgerEntry field plumbing. - 159 total tests pass (was 146 before this PR's tests were added).
CI failed on the bare try/except/continue at the json.loads step in _lookup_paused_session_id. Replaced with json.JSONDecodeError-specific catch + log.debug. Malformed lines are still skipped (they're noise, not signal), but the exception is now visible at debug level for post-mortem if needed.
spashii
added a commit
that referenced
this pull request
May 24, 2026
…ized catalog test Two pieces of skill-usage evaluation, both small: 1. Daily-maintenance §1 'Skill usage scan' subsection. Aggregates yesterday's audit log for read_file calls on src/skills/<name>/skill.md paths → per-skill discovery counts. Operator decision rule: consistent zero-reads with obvious triggers = catalog issue (refine frontmatter), obsolete skill (delete), or genuinely missed (note + watch). Counts land in §4 journal synthesis under a new '### Skill usage' sub-section so the trend is queryable. 2. Parametrized 'every skill in src/skills/<name>/' must appear in the assembled system prompt's catalog. Added at the eval-harness structural layer. Trigger: PR #70 shipped exa-search without any catalog-presence test — under the prior single-skill (test_skill_creator_visible_in_catalog), a new skill could ship invisible. Parametrize fixes it for every future skill automatically. Plus .gitignore entry for mining/ so the blog scratch from session-jsonl mining doesn't keep leaking into PRs (the entry on the ask_operator branch in PR #71 hasn't merged yet). Together (1) + (2) cover the operator's catalog-presence and discovery-observability questions on SAM-43. The deeper Opus-as-judge ('did Sam apply the right skill') stays as a separate follow-up. Tests: 9 skills defended by the parametrize. 16 eval tests pass (was 8). Full suite: 143 passed locally.
spashii
added a commit
that referenced
this pull request
May 24, 2026
) ## What this enables Two pieces of skill-usage evaluation, addressing the SAM-43 questions on catalog verification + audit-log-based usage measurement: ### 1. Skill usage scan in daily-maintenance §1 New subsection that aggregates yesterday's audit log for `read_file` calls on `src/skills/<name>/skill.md` paths → per-skill discovery counts. Counts land in §4 journal synthesis under a new `### Skill usage` sub-section so the trend is queryable over time. Decision rule when a skill has consistent zero reads: catalog issue (refine frontmatter), obsolete skill (delete), or genuinely missed (note + watch). Sam-as-LLM makes the call. ### 2. Parametrized catalog-presence test `test_every_skill_visible_in_catalog` in `tests/eval/test_structural.py` parametrizes over every `src/skills/<name>/` directory. Asserts the skill name appears in the assembled system prompt. Adding a new skill automatically gets defended — no manual test addition needed. **Trigger:** PR #70 shipped exa-search without any catalog-presence test. Under the prior single-skill check, a new skill could ship invisible. This parametrize fixes that for every future skill. ## Consequences - Operator sees per-skill usage trends daily without doing any querying. - Any future skill that doesn't appear in the catalog fails CI immediately (parametrize includes its case automatically). - The deeper "did Sam apply the right skill" eval (Opus-as-judge) stays as a follow-up — it's a meta-eval rubric design problem, not a build task. ## What this doesn't cover - Doesn't measure whether Sam applied the skill *correctly* once it was read. Just discovery. - Doesn't fire alerts when a skill is consistently zero-used; that's the operator's daily-synthesis call. ## How to verify - `pytest tests/eval/test_structural.py` — 16 passed (was 8); 9 new parametrized cases, one per skill. - After merge + a real cron fire: §4 journal synthesis includes a `### Skill usage` block with per-skill counts. ## Bonus `.gitignore` adds `mining/` so blog-scratch files from session-jsonl mining don't keep leaking into PRs (the entry on PR #71 hasn't merged yet). ## Tier Tier 1 (skill prose + tests). No runtime changes. Closes the catalog-verification + skill-usage-observability part of SAM-43.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this enables
Sam can break out of a session mid-work for genuine unknown unknowns, post a question to Slack, exit cleanly. Operator replies whenever — daemon detects the continuation, Sam picks up via the audit log. Slack reactions are the source of truth for paused state; no in-memory daemon map, no parallel state store, no special-case boot recovery.
How the lifecycle works
Consequences
ask_operator(question: str)on the main agent only — workers/pro_executor/mentor cannot escalate to operator directly.:speech_balloon:(race-mitigation per your design call).SessionLedgerEntry.ask_operator_called: boolis the new ledger field the daemon's lookup correlates on._find_active_paused_question+_lookup_paused_session_idreplace the deleted_paused_threadsmap. Reactions on Slack + ledger on GCS both persist across daemon restarts.Composition with existing gates
ask_operatorcounts as a post → closes the loop for that turn.Tier
Tier 3 (
src/runtime/) + Tier 1 (src/capabilities/slack.md,src/skills/daily-maintenance/skill.md). Both layers: system enforces routing; prose explains the rule and Sam's review responsibility.Test plan
pytest tests/— 159 passed (24 new intest_ask_operator.py, no regressions)_find_active_paused_questiondefended for: empty history, no bot messages, no reaction, reaction from another user, multiple paused (most-recent wins), missing reactions field_lookup_paused_session_iddefended for: missing ledger, matching session, future-skipping, most-recent-matchCloses the async-question class of bug. Three new tickets queued for follow-up work (24h reminder, introspect tool, skill descriptions audit) tracked separately.