Skip to content

feat: ask_operator tool — async pause-and-resume for unknown unknowns (v2)#71

Merged
spashii merged 3 commits into
mainfrom
sam/ask-operator-async-question
May 24, 2026
Merged

feat: ask_operator tool — async pause-and-resume for unknown unknowns (v2)#71
spashii merged 3 commits into
mainfrom
sam/ask-operator-async-question

Conversation

@spashii
Copy link
Copy Markdown
Member

@spashii spashii commented May 24, 2026

What this enables

Sam can break out of a session mid-work for genuine unknown unknowns, post a question to Slack, exit cleanly. Operator replies whenever — daemon detects the continuation, Sam picks up via the audit log. Slack reactions are the source of truth for paused state; no in-memory daemon map, no parallel state store, no special-case boot recovery.

How the lifecycle works

Sam pauses → ask_operator tool posts question + adds :speech_balloon: atomically
            → session exits clean, :white_check_mark: on inbound
            → ledger entry has ask_operator_called: true

  ... operator replies whenever (works across daemon restarts) ...

Operator reply → daemon fetches thread_history (already happens)
              → finds the bot message with :speech_balloon: from this bot
              → looks up session_id from sessions.jsonl
                 (filter thread_ts + ask_operator_called + ts_start strictly before post)
              → injects paused_session_id into IncomingMessage
              → calls reactions.remove on the question post (marks resolved)
              → continuation prompt fires: read audit log, apply answer, continue

Consequences

  • New ADK tool ask_operator(question: str) on the main agent only — workers/pro_executor/mentor cannot escalate to operator directly.
  • Tool atomically posts question + adds :speech_balloon: (race-mitigation per your design call).
  • SessionLedgerEntry.ask_operator_called: bool is the new ledger field the daemon's lookup correlates on.
  • _find_active_paused_question + _lookup_paused_session_id replace the deleted _paused_threads map. Reactions on Slack + ledger on GCS both persist across daemon restarts.
  • Daily-maintenance §6 handles abandoned questions (>24h) — Sam enumerates 💬 via reactions.list, decides per-question whether to remind / mark abandoned / escalate. Daemon mechanical, Sam judgment.

Composition with existing gates

Tier

Tier 3 (src/runtime/) + Tier 1 (src/capabilities/slack.md, src/skills/daily-maintenance/skill.md). Both layers: system enforces routing; prose explains the rule and Sam's review responsibility.

Test plan

  • pytest tests/ — 159 passed (24 new in test_ask_operator.py, no regressions)
  • _find_active_paused_question defended for: empty history, no bot messages, no reaction, reaction from another user, multiple paused (most-recent wins), missing reactions field
  • _lookup_paused_session_id defended for: missing ledger, matching session, future-skipping, most-recent-match
  • SessionLedgerEntry field defaulting + plumbing
  • Live: Sam mid-work calls ask_operator → 💬 appears → operator reply → continuation runs + clears reaction

Closes the async-question class of bug. Three new tickets queued for follow-up work (24h reminder, introspect tool, skill descriptions audit) tracked separately.

Operator design call (2026-05-24): Sam should be able to break out of
a session mid-work when it hits a genuine unknown unknown, post a
question to Slack, exit cleanly, and let the operator's reply trigger
a continuation session that picks up via the audit log.

Mechanism (v2 — explicit tool + daemon plumbing):

1. New ADK FunctionTool 'ask_operator(question: str)' registered only
   on main_agent (NOT workers / pro_executor / mentor — they must
   return to their caller, never escalate). Tool closure binds the
   originating Slack channel/thread_ts/event_ts so the LLM signature
   stays minimal.

2. Tool atomically posts the question + adds 💬 reaction
   on the question post (race-mitigation: by the time the LLM returns
   to the runner, the operator can already see Sam is paused). If
   reaction.add fails, the post still goes through — daemon's
   paused_threads map is the primary signal, the reaction is for
   operator visibility.

3. SessionResult.ask_operator_called surfaces the pause. Daemon checks
   it after a clean session exit: if True, records thread_ts ->
   session_id in self._paused_threads. When an inbound reply lands in
   a thread present in this map, daemon pops the entry and injects
   paused_session_id into IncomingMessage.

4. IncomingMessage.paused_session_id triggers _format_continuation_message
   instead of the default shape. The continuation prompt tells Sam to:
   read prior audit log slice (ground truth, not journal text), apply
   the operator's answer, continue from where the prior session
   stopped, close the loop with a final reply.

5. Continuation takes precedence over recovered=True (more specific);
   retry_context takes precedence over continuation (failure narration
   wins over resumption — defends against a failed pause spawning a
   silent-retry loop).

Composition with existing gates:

- silent-exit gate (PR #68): ask_operator counts as a post — closes
  the loop for that turn. No false-positive silent-exit retry.
- ack-first rule (PR #44): ask_operator can also satisfy ack-first if
  it's the first substantive post in a session (uncommon but valid).

Abandoned-question handling: per operator direction, uses the existing
reactions-scan infrastructure (_recover_from_reactions extends to age
out 💬 markers >24h with a contextual reminder + clear).
Same reactions.list call, no new state store. Filed as follow-up.

Coverage: 11 new tests in test_ask_operator.py covering classifier
behavior, continuation-prompt dispatch, precedence rules, graceful
tool failures (no channel, no token). Updated test_silent_exit.py to
reflect the 6-tuple classifier return. 146 total tests pass.

slack.md gets a 'Pausing for operator input mid-work' subsection
explaining when to call ask_operator (genuine unknown unknowns only),
the operator-visibility model (💬 on the question post),
and the abandoned-question mechanism.
@spashii spashii enabled auto-merge May 24, 2026 16:46
spashii added 2 commits May 24, 2026 19:00
…mory map

Operator pushback (2026-05-24): the in-memory _paused_threads map was
the wrong substrate. Slack reactions are already the visible state;
the map duplicates them, loses on restart, and violates the
'no parallel state stores' rule.

Complete redesign:

- REMOVE Daemon._paused_threads in-memory map
- REMOVE post-session update of _paused_threads in _worker
- REMOVE _paused_threads.pop in _handle_event

- ADD _find_active_paused_question(thread_history, bot_user_id):
  module-level pure function that walks thread_history newest-first,
  returns the first bot message with 💬 from this bot.
  Slack's conversations.replies returns reactions inline per message
  so no extra API call needed for the scan.

- ADD Daemon._lookup_paused_session_id(thread_ts, question_post_ts):
  reads sessions.jsonl, filters thread_ts + ask_operator_called=True
  + ts_start strictly before the post, picks most-recent match.

- ADD SessionLedgerEntry.ask_operator_called: bool — the ledger field
  the lookup correlates on. Default False for backwards compat.

- ADD reactions.remove call after detection (clears 💬
  from the question post = marks resolved). Idempotent — failures are
  non-fatal, the only cost is the reaction lingering until
  daily-maintenance's abandoned-question scan ages it out.

- ADD daily-maintenance §6 'Review abandoned ask_operator questions':
  Sam-as-LLM enumerates 💬 > 24h via reactions.list,
  decides per-question whether to remind / mark abandoned / escalate.
  Daemon mechanical, Sam judgment.

- REORDER session.run: classify_tool_use moved before ledger write so
  ask_operator_called lands in the ledger entry that the lookup reads.

- UPDATE slack.md prose: reactions-as-source-of-truth framing, daemon
  restart explicitly named as not-a-special-case, abandoned-question
  handling delegated to daily-maintenance §6.

Daemon restart is no longer a special case: the 💬 lives
on Slack, the ledger lives on GCS, both persist. Next inbound triggers
detection naturally via thread_history reactions + ledger lookup.

Tests:
- 13 new tests in test_ask_operator.py covering _find_active_paused_question
  (empty/no-bot/no-reaction/own-vs-other-user/multiple/missing-field),
  _lookup_paused_session_id (missing-ledger/matching/future-skipping/
  most-recent), and SessionLedgerEntry field plumbing.
- 159 total tests pass (was 146 before this PR's tests were added).
CI failed on the bare try/except/continue at the json.loads step in
_lookup_paused_session_id. Replaced with json.JSONDecodeError-specific
catch + log.debug. Malformed lines are still skipped (they're noise,
not signal), but the exception is now visible at debug level for
post-mortem if needed.
@spashii spashii disabled auto-merge May 24, 2026 17:19
@spashii spashii merged commit b44069b into main May 24, 2026
2 checks passed
@spashii spashii deleted the sam/ask-operator-async-question branch May 24, 2026 17:19
spashii added a commit that referenced this pull request May 24, 2026
…ized catalog test

Two pieces of skill-usage evaluation, both small:

1. Daily-maintenance §1 'Skill usage scan' subsection. Aggregates
   yesterday's audit log for read_file calls on src/skills/<name>/skill.md
   paths → per-skill discovery counts. Operator decision rule:
   consistent zero-reads with obvious triggers = catalog issue
   (refine frontmatter), obsolete skill (delete), or genuinely missed
   (note + watch). Counts land in §4 journal synthesis under a new
   '### Skill usage' sub-section so the trend is queryable.

2. Parametrized 'every skill in src/skills/<name>/' must appear in
   the assembled system prompt's catalog. Added at the eval-harness
   structural layer. Trigger: PR #70 shipped exa-search without any
   catalog-presence test — under the prior single-skill
   (test_skill_creator_visible_in_catalog), a new skill could ship
   invisible. Parametrize fixes it for every future skill automatically.

Plus .gitignore entry for mining/ so the blog scratch from session-jsonl
mining doesn't keep leaking into PRs (the entry on the ask_operator
branch in PR #71 hasn't merged yet).

Together (1) + (2) cover the operator's catalog-presence and
discovery-observability questions on SAM-43. The deeper Opus-as-judge
('did Sam apply the right skill') stays as a separate follow-up.

Tests: 9 skills defended by the parametrize. 16 eval tests pass
(was 8). Full suite: 143 passed locally.
spashii added a commit that referenced this pull request May 24, 2026
)

## What this enables

Two pieces of skill-usage evaluation, addressing the SAM-43 questions on
catalog verification + audit-log-based usage measurement:

### 1. Skill usage scan in daily-maintenance §1

New subsection that aggregates yesterday's audit log for `read_file`
calls on `src/skills/<name>/skill.md` paths → per-skill discovery
counts. Counts land in §4 journal synthesis under a new `### Skill
usage` sub-section so the trend is queryable over time.

Decision rule when a skill has consistent zero reads: catalog issue
(refine frontmatter), obsolete skill (delete), or genuinely missed (note
+ watch). Sam-as-LLM makes the call.

### 2. Parametrized catalog-presence test

`test_every_skill_visible_in_catalog` in `tests/eval/test_structural.py`
parametrizes over every `src/skills/<name>/` directory. Asserts the
skill name appears in the assembled system prompt. Adding a new skill
automatically gets defended — no manual test addition needed.

**Trigger:** PR #70 shipped exa-search without any catalog-presence
test. Under the prior single-skill check, a new skill could ship
invisible. This parametrize fixes that for every future skill.

## Consequences

- Operator sees per-skill usage trends daily without doing any querying.
- Any future skill that doesn't appear in the catalog fails CI
immediately (parametrize includes its case automatically).
- The deeper "did Sam apply the right skill" eval (Opus-as-judge) stays
as a follow-up — it's a meta-eval rubric design problem, not a build
task.

## What this doesn't cover

- Doesn't measure whether Sam applied the skill *correctly* once it was
read. Just discovery.
- Doesn't fire alerts when a skill is consistently zero-used; that's the
operator's daily-synthesis call.

## How to verify

- `pytest tests/eval/test_structural.py` — 16 passed (was 8); 9 new
parametrized cases, one per skill.
- After merge + a real cron fire: §4 journal synthesis includes a `###
Skill usage` block with per-skill counts.

## Bonus

`.gitignore` adds `mining/` so blog-scratch files from session-jsonl
mining don't keep leaking into PRs (the entry on PR #71 hasn't merged
yet).

## Tier

Tier 1 (skill prose + tests). No runtime changes.

Closes the catalog-verification + skill-usage-observability part of
SAM-43.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant