Skip to content

fix(sdk): poll-driven cross-reasoner pause propagation#562

Merged
AbirAbbas merged 1 commit intomainfrom
fix/poll-driven-pause-propagation
May 9, 2026
Merged

fix(sdk): poll-driven cross-reasoner pause propagation#562
AbirAbbas merged 1 commit intomainfrom
fix/poll-driven-pause-propagation

Conversation

@AbirAbbas
Copy link
Copy Markdown
Contributor

Summary

The v0.1.80 listener-based pause propagation never fired in production because it depended on the SDK's SSE event-stream subscription, which is gated behind enable_event_stream (default False) and not enabled on any deployed service.

Reproduced on Railway run run_1778268481826_8c9dd544: implement_from_issue timed out at exactly 7200s wallclock despite a ~21min hax-sdk approval gap clearly visible in the SWE-AF logs (20:10:42 → 20:31:35) and a second silent gap before the timeout.

Why the listener didn't fire

  • Control plane DOES publish execution_waiting to the bus (verified in execute_approval.go).
  • Parents subscribe to that bus only when enable_event_stream=True. Default is False and no service overrides it. Verified via railway ssh ... env: no service has the var set, and the only subscriber on the live SSE bus is the user's browser.
  • So register_status_listener callbacks were never invoked, pause_clock.start_pause() was never called, total_paused() stayed at 0, active_elapsed == wallclock, watchdog tripped at the budget.

The previous tests poked _on_child_status_change and _handle_event_stream_payload directly, which bypassed the enable_event_stream gate entirely. They passed; production was broken.

Fix

Replace the SSE-listener mechanism with a poll-driven toggle inside wait_for_result:

  • The polling task already runs unconditionally and already updates _executions[id].status from control-plane responses (_update_execution_from_status).
  • wait_for_result now reads that status each loop iteration. When it reads as WAITING, the parent's pause_clock.start_pause() is called; when it reads back as anything else, end_pause() is called. A finally block closes any in-flight pause if we exit via terminal / timeout / exception.
  • No SSE subscription needed. No listener registry. No refcount machinery. Works whether or not enable_event_stream is on.

Removed

  • async_execution_manager.py: register_status_listener, _status_listeners, _fire_status_listeners, the execution.waiting event-type override (the WAITING-status branch in _handle_event_stream_payload stays for when SSE is on).
  • agent.py: _on_child_status_change, _waiting_children, _parent_paused_children, the listener registration in call(). The pause_clock kwarg pass-through (the actual fix) is preserved.

Net: −387 / +274 lines across 4 files.

Tests

Replaced the listener tests with integration tests that drive the production data path: they update _executions[id].status the same way the polling task does, then assert wait_for_result toggles the parent clock and survives a long WAITING window. Includes a regression for the headline scenario (WAITING window > wait_timeout, must complete) and a finally-block check for cancelled-while-waiting.

Test plan

  • cd sdk/python && ruff check . — clean
  • cd sdk/python && python -m pytest --no-cov — 1482 passed, 4 skipped (integration tests requiring server sources, pre-existing)
  • CI green
  • Verify on Railway after release: re-run an implement_from_issue flow that hits the hax-sdk gate; parent should not time out at wallclock 2hr if active work < 2hr

🤖 Generated with Claude Code

The v0.1.80 listener-based propagation never fired in production because
it depended on the AsyncExecutionManager's SSE event-stream loop, which is
gated behind ``enable_event_stream`` (default False) and was not enabled
on any deployed service. Result: parents waiting on an ``app.call`` that
hit a hax-sdk human-approval gate kept ticking wallclock, and the parent
watchdog tripped at exactly the budget despite the visible WAITING state
(reproduced on Railway run_1778268481826_8c9dd544).

Replace the SSE-listener mechanism with a poll-driven toggle inside
``wait_for_result``: when the awaited child's status reads as WAITING
(updated unconditionally by the existing polling task), pause the parent's
pause-clock; when it reads back as not-WAITING, end the pause. A finally
block closes any in-flight pause if we exit via terminal/timeout. No SSE
subscription required, no listener registry, no refcount machinery.

Removes from ``async_execution_manager.py``:
  - ``register_status_listener`` / ``_status_listeners`` / ``_fire_status_listeners``
  - the ``execution.waiting`` event-type override (the WAITING-status branch
    in ``_handle_event_stream_payload`` stays for the case where SSE is on)

Removes from ``agent.py``:
  - ``_on_child_status_change`` and the ``_waiting_children`` /
    ``_parent_paused_children`` registries it consumed
  - the listener registration in ``call()``; the ``pause_clock`` kwarg
    pass-through (the actual fix) is preserved

Tests: the previous tests poked ``_on_child_status_change`` and
``_handle_event_stream_payload`` directly, which bypassed the
``enable_event_stream`` gate and never exercised the production data path.
The new tests drive the production path: they update ``_executions[id].status``
the same way the polling task does and assert that ``wait_for_result``
toggles the parent clock and survives a long WAITING window.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@AbirAbbas AbirAbbas requested a review from a team as a code owner May 9, 2026 12:03
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

Performance

SDK Memory Δ Latency Δ Tests Status
Python 9.4 KB +4% 0.31 µs -11%

✓ No regressions detected

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

📊 Coverage gate

Thresholds from .coverage-gate.toml: per-surface ≥ 86%, aggregate ≥ 88%, max per-surface regression ≤ 1.0 pp, max aggregate regression ≤ 0.50 pp.

Surface Current Baseline Δ
control-plane 87.40% 87.30% ↑ +0.10 pp 🟡
sdk-go 91.90% 90.70% ↑ +1.20 pp 🟢
sdk-python 93.66% 93.63% ↑ +0.03 pp 🟢
sdk-typescript 92.68% 92.56% ↑ +0.12 pp 🟢
web-ui 89.91% 90.01% ↓ -0.10 pp 🟡
aggregate 88.99% 89.01% ↓ -0.02 pp 🟡

✅ Gate passed

No surface regressed past the allowed threshold and the aggregate stayed above the floor.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

📐 Patch coverage gate

Threshold: 80% on lines this PR touches vs origin/main (from .coverage-gate.toml:thresholds.min_patch).

Surface Touched lines Patch coverage Status
control-plane 0 ➖ no changes
sdk-go 0 ➖ no changes
sdk-python 0 ➖ no changes
sdk-typescript 0 ➖ no changes
web-ui 0 ➖ no changes

✅ Patch gate passed

Every surface whose lines were touched by this PR has patch coverage at or above the threshold.

@AbirAbbas AbirAbbas merged commit 5bfd1ed into main May 9, 2026
32 checks passed
@AbirAbbas AbirAbbas deleted the fix/poll-driven-pause-propagation branch May 9, 2026 15:49
AbirAbbas added a commit to Agent-Field/SWE-AF that referenced this pull request May 9, 2026
Picks up the poll-driven cross-reasoner pause propagation fix shipped in
agentfield v0.1.81 (Agent-Field/agentfield#562). v0.1.80 attempted the same
fix via an SSE listener that was gated behind ``enable_event_stream``
(default off) and never fired in production — observed on a long
implement_from_issue run that timed out at exactly the 7200s wallclock
budget despite a long hax-sdk approval gap.

Bumping the constraint string is required to bust the Docker pip-install
layer cache; otherwise the cached layer would keep restoring 0.1.80 even
once 0.1.81 is on PyPI.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant