Skip to content

obs(sdk): info-level diagnostics on the pause-cascade hot path#569

Merged
AbirAbbas merged 2 commits into
mainfrom
observability/pause-cascade-diagnostic-logging
May 11, 2026
Merged

obs(sdk): info-level diagnostics on the pause-cascade hot path#569
AbirAbbas merged 2 commits into
mainfrom
observability/pause-cascade-diagnostic-logging

Conversation

@AbirAbbas
Copy link
Copy Markdown
Contributor

Summary

Promote five points on the cross-reasoner pause-propagation hot path from debug to info so the next failing run is diagnosable from logs alone — no behavior change.

Why

Run run_1778458437977_fb6f588b (production): implement_from_issue failed with "Reasoner 'implement_from_issue' timed out after 7200.0s of active time" at wall-clock duration_ms=7201393. Its child swe-planner.build was visibly WAITING on a hax-sdk approval for ≥20 min (gap between plan completing at 00:42:34 and the revision iteration of run_architect starting at 01:02:57). Math: if the cascade had fired, pause_clock.total_paused() ≥ 1200sactive_elapsed ≤ 6001s → watchdog wouldn't have tripped. It tripped at 7201s wallclock, so pause_clock.total_paused() ≈ 0 — the cascade never started.

A local 3-hop repro (outer→middle→inner-with-app.pause) at the same SDK version (0.1.83) does fire the cascade correctly (outer's CP exec went to waiting / awaiting_child within seconds, survived 30s past a 5s budget). Something differs in production. All cascade toggles were emitted at logger.debug level, so the production log was silent on which specific link broke — no signal on whether parent_pause_clock lookup missed, whether the polling task ever observed the awaited child as WAITING, or whether pause_clock.start_pause() fired.

What this PR adds

All info level (so they appear in production without dev_mode), all on hot path but bounded — at most one line per state transition.

  1. agent.py_execute_async_with_callback: log pause_clock registration with id() + execution_id + reasoner_name at reasoner entry.
  2. agent.pyapp.call (cross-reasoner pause prop block): log target, parent_execution_id, parent_pause_clock id(), and a peek at _pause_clocks keys. Distinguishes "no current_context" from "lookup missed" — they need different fixes.
  3. agent.py — watchdog firing: log wall_elapsed, total_paused, active_elapsed, budget, pause_clock_id at the moment the timeout cancels the reasoner. Tells "legitimate long active work" apart from "cascade never ran" without re-deriving from the run timeline.
  4. async_execution_manager.pywait_for_result pause toggles: log start_pause and end_pause on the parent's pause_clock with the awaited child id + clock id + cumulative paused seconds.
  5. async_execution_manager.py — poll status mapper: log WAITING/RUNNING transitions on the awaited child at info (other transitions stay at debug to avoid spam from healthy workflows).

Diagnostic value

The next production run that hits this failure mode will show, in order:

  • (1) pause_cascade: registered pause_clock id=<X> for execution_id=<parent>
  • (2) pause_cascade: target=<child> parent_execution_id=<parent> parent_pause_clock_id=<X> (or None — which itself names the bug)
  • (5) pause_cascade: poll_observed execution_id=<child> status_transition=running->waiting (or missing — which names a different bug)
  • (4) pause_cascade: start_pause child=<child> pause_clock_id=<X> (or missing — names yet another bug)
  • (3) pause_cascade: WATCHDOG_FIRING ... total_paused=<N>s active_elapsed=<N>s budget=<N>s

Diff between expected-vs-actual at each step pinpoints the broken link.

Validation Contract

  • No behavior change. Same imports, same control flow, same condition checks.
  • One additional log_info at reasoner start (per execution).
  • One additional log_info per app.call from a reasoner (per call).
  • One additional logger.info per WAITING/RUNNING transition on awaited child (bounded by the number of child status flips — typically 2-3 per pause cycle).
  • One additional log_error if and only if the watchdog actually fires (rare by design).

Test plan

  • ruff check . — clean
  • python -m pytest --no-cov — passes (exit 0)
  • Deploy to Railway test stack
  • Trigger another implement_from_issue run that pauses on hax approval
  • Read pause_cascade: log lines on github-buddy to identify which link broke

Cascade fired correctly in local repros but didn't on run
run_1778458437977_fb6f588b (implement_from_issue timed out at exactly
7200.0s "active time" while swe-planner.build was paused on a hax
approval for ~20min — math says pause_clock.total_paused()=0 despite
the awaited child being visibly WAITING in the CP). All cascade
toggles were debug-only, so production logs gave no signal on which
specific link broke. Promotes five points to INFO so the next
occurrence is diagnosable from logs alone:

- agent.py: log pause_clock registration with id+execution_id, log
  parent_pause_clock lookup result at app.call time (including the
  _pause_clocks keyset so a missing-entry case is visible), and log
  full pause_clock state at the moment the watchdog fires (wall
  elapsed, total_paused, active_elapsed, budget). The last one in
  particular tells "legitimate long active work" apart from "cascade
  never ran" without needing to re-derive it from the timeline.

- async_execution_manager.py: log start_pause / end_pause on the
  parent's pause_clock with the awaited child id + clock id, and log
  poll-observed WAITING<->RUNNING transitions for the awaited child
  (other status transitions stay at debug). The two together pin
  exactly when the polling task saw WAITING and whether the wait
  loop translated that observation into a clock pause.

No behavior change — same imports, same control flow. Safe on hot
paths: at most one log line per status transition on the awaited
child, and a one-line registration on reasoner entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@AbirAbbas AbirAbbas requested a review from a team as a code owner May 11, 2026 13:47
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

Performance

SDK Memory Δ Latency Δ Tests Status
Python 9.4 KB +4% 0.32 µs -9%

✓ No regressions detected

test_wait_for_result_invokes_callbacks_on_child_waiting_transitions
raced the wait loop's 0.1s poll interval — original 0.05s gap between
the RUNNING transition and the SUCCEEDED transition meant the loop
had a coin-flip chance of seeing WAITING then jumping straight to
SUCCEEDED, never firing on_child_running. Any added work in the
toggle block (e.g. the diagnostic logger.info lines from the prior
commit) shifts that race; passed on 3.12, failed on 3.10/3.11.

Fix: bump both inter-transition sleeps from 0.05/0.20 to 0.30s — well
above the loop's poll interval — so each transition is observed
deterministically. The same widening applied to only this test
because the neighbouring test_wait_for_result_pauses_clock_on_child_waiting
asserts only total_paused() (which includes the in-progress pause via
PauseClock.total_paused) and so doesn't race the way this one did.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

📊 Coverage gate

Thresholds from .coverage-gate.toml: per-surface ≥ 86%, aggregate ≥ 88%, max per-surface regression ≤ 1.0 pp, max aggregate regression ≤ 0.50 pp.

Surface Current Baseline Δ
control-plane 87.50% 87.30% ↑ +0.20 pp 🟡
sdk-go 91.90% 90.70% ↑ +1.20 pp 🟢
sdk-python 93.73% 93.63% ↑ +0.10 pp 🟢
sdk-typescript 92.68% 92.56% ↑ +0.12 pp 🟢
web-ui 89.91% 90.01% ↓ -0.10 pp 🟡
aggregate 89.02% 89.01% ↑ +0.01 pp 🟡

✅ Gate passed

No surface regressed past the allowed threshold and the aggregate stayed above the floor.

@github-actions
Copy link
Copy Markdown
Contributor

📐 Patch coverage gate

Threshold: 80% on lines this PR touches vs origin/main (from .coverage-gate.toml:thresholds.min_patch).

Surface Touched lines Patch coverage Status
control-plane 0 ➖ no changes
sdk-go 0 ➖ no changes
sdk-python 0 ➖ no changes
sdk-typescript 0 ➖ no changes
web-ui 0 ➖ no changes

✅ Patch gate passed

Every surface whose lines were touched by this PR has patch coverage at or above the threshold.

@AbirAbbas AbirAbbas merged commit 1bd51fa into main May 11, 2026
27 checks passed
@AbirAbbas AbirAbbas deleted the observability/pause-cascade-diagnostic-logging branch May 11, 2026 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant