obs(sdk): info-level diagnostics on the pause-cascade hot path by AbirAbbas · Pull Request #569 · Agent-Field/agentfield

AbirAbbas · 2026-05-11T13:47:45Z

Summary

Promote five points on the cross-reasoner pause-propagation hot path from debug to info so the next failing run is diagnosable from logs alone — no behavior change.

Why

Run run_1778458437977_fb6f588b (production): implement_from_issue failed with "Reasoner 'implement_from_issue' timed out after 7200.0s of active time" at wall-clock duration_ms=7201393. Its child swe-planner.build was visibly WAITING on a hax-sdk approval for ≥20 min (gap between plan completing at 00:42:34 and the revision iteration of run_architect starting at 01:02:57). Math: if the cascade had fired, pause_clock.total_paused() ≥ 1200s → active_elapsed ≤ 6001s → watchdog wouldn't have tripped. It tripped at 7201s wallclock, so pause_clock.total_paused() ≈ 0 — the cascade never started.

A local 3-hop repro (outer→middle→inner-with-app.pause) at the same SDK version (0.1.83) does fire the cascade correctly (outer's CP exec went to waiting / awaiting_child within seconds, survived 30s past a 5s budget). Something differs in production. All cascade toggles were emitted at logger.debug level, so the production log was silent on which specific link broke — no signal on whether parent_pause_clock lookup missed, whether the polling task ever observed the awaited child as WAITING, or whether pause_clock.start_pause() fired.

What this PR adds

All info level (so they appear in production without dev_mode), all on hot path but bounded — at most one line per state transition.

agent.py — _execute_async_with_callback: log pause_clock registration with id() + execution_id + reasoner_name at reasoner entry.
agent.py — app.call (cross-reasoner pause prop block): log target, parent_execution_id, parent_pause_clock id(), and a peek at _pause_clocks keys. Distinguishes "no current_context" from "lookup missed" — they need different fixes.
agent.py — watchdog firing: log wall_elapsed, total_paused, active_elapsed, budget, pause_clock_id at the moment the timeout cancels the reasoner. Tells "legitimate long active work" apart from "cascade never ran" without re-deriving from the run timeline.
async_execution_manager.py — wait_for_result pause toggles: log start_pause and end_pause on the parent's pause_clock with the awaited child id + clock id + cumulative paused seconds.
async_execution_manager.py — poll status mapper: log WAITING/RUNNING transitions on the awaited child at info (other transitions stay at debug to avoid spam from healthy workflows).

Diagnostic value

The next production run that hits this failure mode will show, in order:

(1) pause_cascade: registered pause_clock id=<X> for execution_id=<parent>
(2) pause_cascade: target=<child> parent_execution_id=<parent> parent_pause_clock_id=<X> (or None — which itself names the bug)
(5) pause_cascade: poll_observed execution_id=<child> status_transition=running->waiting (or missing — which names a different bug)
(4) pause_cascade: start_pause child=<child> pause_clock_id=<X> (or missing — names yet another bug)
(3) pause_cascade: WATCHDOG_FIRING ... total_paused=<N>s active_elapsed=<N>s budget=<N>s

Diff between expected-vs-actual at each step pinpoints the broken link.

Validation Contract

No behavior change. Same imports, same control flow, same condition checks.
One additional log_info at reasoner start (per execution).
One additional log_info per app.call from a reasoner (per call).
One additional logger.info per WAITING/RUNNING transition on awaited child (bounded by the number of child status flips — typically 2-3 per pause cycle).
One additional log_error if and only if the watchdog actually fires (rare by design).

Test plan

ruff check . — clean
python -m pytest --no-cov — passes (exit 0)
Deploy to Railway test stack
Trigger another implement_from_issue run that pauses on hax approval
Read pause_cascade: log lines on github-buddy to identify which link broke

Cascade fired correctly in local repros but didn't on run run_1778458437977_fb6f588b (implement_from_issue timed out at exactly 7200.0s "active time" while swe-planner.build was paused on a hax approval for ~20min — math says pause_clock.total_paused()=0 despite the awaited child being visibly WAITING in the CP). All cascade toggles were debug-only, so production logs gave no signal on which specific link broke. Promotes five points to INFO so the next occurrence is diagnosable from logs alone: - agent.py: log pause_clock registration with id+execution_id, log parent_pause_clock lookup result at app.call time (including the _pause_clocks keyset so a missing-entry case is visible), and log full pause_clock state at the moment the watchdog fires (wall elapsed, total_paused, active_elapsed, budget). The last one in particular tells "legitimate long active work" apart from "cascade never ran" without needing to re-derive it from the timeline. - async_execution_manager.py: log start_pause / end_pause on the parent's pause_clock with the awaited child id + clock id, and log poll-observed WAITING<->RUNNING transitions for the awaited child (other status transitions stay at debug). The two together pin exactly when the polling task saw WAITING and whether the wait loop translated that observation into a clock pause. No behavior change — same imports, same control flow. Safe on hot paths: at most one log line per status transition on the awaited child, and a one-line registration on reasoner entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-11T13:49:53Z

Performance

SDK	Memory	Δ	Latency	Δ	Tests	Status
Python	9.4 KB	+4%	0.32 µs	-9%	✓	✓

✓ No regressions detected

test_wait_for_result_invokes_callbacks_on_child_waiting_transitions raced the wait loop's 0.1s poll interval — original 0.05s gap between the RUNNING transition and the SUCCEEDED transition meant the loop had a coin-flip chance of seeing WAITING then jumping straight to SUCCEEDED, never firing on_child_running. Any added work in the toggle block (e.g. the diagnostic logger.info lines from the prior commit) shifts that race; passed on 3.12, failed on 3.10/3.11. Fix: bump both inter-transition sleeps from 0.05/0.20 to 0.30s — well above the loop's poll interval — so each transition is observed deterministically. The same widening applied to only this test because the neighbouring test_wait_for_result_pauses_clock_on_child_waiting asserts only total_paused() (which includes the in-progress pause via PauseClock.total_paused) and so doesn't race the way this one did. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-11T14:01:08Z

📊 Coverage gate

Thresholds from .coverage-gate.toml: per-surface ≥ 86%, aggregate ≥ 88%, max per-surface regression ≤ 1.0 pp, max aggregate regression ≤ 0.50 pp.

Surface	Current	Baseline	Δ
`control-plane`	87.50%	87.30%	↑ +0.20 pp	🟡
`sdk-go`	91.90%	90.70%	↑ +1.20 pp	🟢
`sdk-python`	93.73%	93.63%	↑ +0.10 pp	🟢
`sdk-typescript`	92.68%	92.56%	↑ +0.12 pp	🟢
`web-ui`	89.91%	90.01%	↓ -0.10 pp	🟡
aggregate	89.02%	89.01%	↑ +0.01 pp	🟡

✅ Gate passed

No surface regressed past the allowed threshold and the aggregate stayed above the floor.

github-actions · 2026-05-11T14:01:09Z

📐 Patch coverage gate

Threshold: 80% on lines this PR touches vs origin/main (from .coverage-gate.toml:thresholds.min_patch).

Surface	Patch coverage	Status
`control-plane`	—	➖ no changes
`sdk-go`	—	➖ no changes
`sdk-python`	—	➖ no changes
`sdk-typescript`	—	➖ no changes
`web-ui`	—	➖ no changes

✅ Patch gate passed

Every surface whose lines were touched by this PR has patch coverage at or above the threshold.

AbirAbbas requested a review from a team as a code owner May 11, 2026 13:47

AbirAbbas merged commit 1bd51fa into main May 11, 2026
27 checks passed

AbirAbbas deleted the observability/pause-cascade-diagnostic-logging branch May 11, 2026 14:28

This was referenced May 11, 2026

chore: bump agentfield pin to >=0.1.84 Agent-Field/pr-af#34

Merged

chore: bump agentfield pin to >=0.1.84 Agent-Field/SWE-AF#75

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

obs(sdk): info-level diagnostics on the pause-cascade hot path#569

obs(sdk): info-level diagnostics on the pause-cascade hot path#569
AbirAbbas merged 2 commits into
mainfrom
observability/pause-cascade-diagnostic-logging

AbirAbbas commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AbirAbbas commented May 11, 2026

Summary

Why

What this PR adds

Diagnostic value

Validation Contract

Test plan

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

Uh oh!

github-actions Bot commented May 11, 2026

📊 Coverage gate

✅ Gate passed

Uh oh!

github-actions Bot commented May 11, 2026

📐 Patch coverage gate

✅ Patch gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 11, 2026 •

edited

Loading