obs(sdk): info-level diagnostics on the pause-cascade hot path#569
Merged
Conversation
Cascade fired correctly in local repros but didn't on run run_1778458437977_fb6f588b (implement_from_issue timed out at exactly 7200.0s "active time" while swe-planner.build was paused on a hax approval for ~20min — math says pause_clock.total_paused()=0 despite the awaited child being visibly WAITING in the CP). All cascade toggles were debug-only, so production logs gave no signal on which specific link broke. Promotes five points to INFO so the next occurrence is diagnosable from logs alone: - agent.py: log pause_clock registration with id+execution_id, log parent_pause_clock lookup result at app.call time (including the _pause_clocks keyset so a missing-entry case is visible), and log full pause_clock state at the moment the watchdog fires (wall elapsed, total_paused, active_elapsed, budget). The last one in particular tells "legitimate long active work" apart from "cascade never ran" without needing to re-derive it from the timeline. - async_execution_manager.py: log start_pause / end_pause on the parent's pause_clock with the awaited child id + clock id, and log poll-observed WAITING<->RUNNING transitions for the awaited child (other status transitions stay at debug). The two together pin exactly when the polling task saw WAITING and whether the wait loop translated that observation into a clock pause. No behavior change — same imports, same control flow. Safe on hot paths: at most one log line per status transition on the awaited child, and a one-line registration on reasoner entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Performance
✓ No regressions detected |
test_wait_for_result_invokes_callbacks_on_child_waiting_transitions raced the wait loop's 0.1s poll interval — original 0.05s gap between the RUNNING transition and the SUCCEEDED transition meant the loop had a coin-flip chance of seeing WAITING then jumping straight to SUCCEEDED, never firing on_child_running. Any added work in the toggle block (e.g. the diagnostic logger.info lines from the prior commit) shifts that race; passed on 3.12, failed on 3.10/3.11. Fix: bump both inter-transition sleeps from 0.05/0.20 to 0.30s — well above the loop's poll interval — so each transition is observed deterministically. The same widening applied to only this test because the neighbouring test_wait_for_result_pauses_clock_on_child_waiting asserts only total_paused() (which includes the in-progress pause via PauseClock.total_paused) and so doesn't race the way this one did. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
📊 Coverage gateThresholds from
✅ Gate passedNo surface regressed past the allowed threshold and the aggregate stayed above the floor. |
Contributor
📐 Patch coverage gateThreshold: 80% on lines this PR touches vs
✅ Patch gate passedEvery surface whose lines were touched by this PR has patch coverage at or above the threshold. |
This was referenced May 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Promote five points on the cross-reasoner pause-propagation hot path from
debugtoinfoso the next failing run is diagnosable from logs alone — no behavior change.Why
Run
run_1778458437977_fb6f588b(production):implement_from_issuefailed with"Reasoner 'implement_from_issue' timed out after 7200.0s of active time"at wall-clock duration_ms=7201393. Its childswe-planner.buildwas visibly WAITING on a hax-sdk approval for ≥20 min (gap betweenplancompleting at 00:42:34 and the revision iteration ofrun_architectstarting at 01:02:57). Math: if the cascade had fired,pause_clock.total_paused() ≥ 1200s→active_elapsed ≤ 6001s→ watchdog wouldn't have tripped. It tripped at 7201s wallclock, sopause_clock.total_paused() ≈ 0— the cascade never started.A local 3-hop repro (outer→middle→inner-with-
app.pause) at the same SDK version (0.1.83) does fire the cascade correctly (outer's CP exec went towaiting / awaiting_childwithin seconds, survived 30s past a 5s budget). Something differs in production. All cascade toggles were emitted atlogger.debuglevel, so the production log was silent on which specific link broke — no signal on whetherparent_pause_clocklookup missed, whether the polling task ever observed the awaited child as WAITING, or whetherpause_clock.start_pause()fired.What this PR adds
All
infolevel (so they appear in production withoutdev_mode), all on hot path but bounded — at most one line per state transition.agent.py—_execute_async_with_callback: logpause_clockregistration withid()+execution_id+reasoner_nameat reasoner entry.agent.py—app.call(cross-reasoner pause prop block): logtarget,parent_execution_id,parent_pause_clock id(), and a peek at_pause_clockskeys. Distinguishes "no current_context" from "lookup missed" — they need different fixes.agent.py— watchdog firing: logwall_elapsed,total_paused,active_elapsed,budget,pause_clock_idat the moment the timeout cancels the reasoner. Tells "legitimate long active work" apart from "cascade never ran" without re-deriving from the run timeline.async_execution_manager.py—wait_for_resultpause toggles: logstart_pauseandend_pauseon the parent's pause_clock with the awaited child id + clock id + cumulative paused seconds.async_execution_manager.py— poll status mapper: logWAITING/RUNNINGtransitions on the awaited child atinfo(other transitions stay atdebugto avoid spam from healthy workflows).Diagnostic value
The next production run that hits this failure mode will show, in order:
pause_cascade: registered pause_clock id=<X> for execution_id=<parent>pause_cascade: target=<child> parent_execution_id=<parent> parent_pause_clock_id=<X>(orNone— which itself names the bug)pause_cascade: poll_observed execution_id=<child> status_transition=running->waiting(or missing — which names a different bug)pause_cascade: start_pause child=<child> pause_clock_id=<X>(or missing — names yet another bug)pause_cascade: WATCHDOG_FIRING ... total_paused=<N>s active_elapsed=<N>s budget=<N>sDiff between expected-vs-actual at each step pinpoints the broken link.
Validation Contract
log_infoat reasoner start (per execution).log_infoperapp.callfrom a reasoner (per call).logger.infoperWAITING/RUNNINGtransition on awaited child (bounded by the number of child status flips — typically 2-3 per pause cycle).log_errorif and only if the watchdog actually fires (rare by design).Test plan
ruff check .— cleanpython -m pytest --no-cov— passes (exit 0)implement_from_issuerun that pauses on hax approvalpause_cascade:log lines on github-buddy to identify which link broke