Skip to content

fix(core): stop AbortSignal listener leak in long sessions (MaxListenersExceededWarning)#4366

Merged
doudouOUC merged 16 commits into
QwenLM:mainfrom
doudouOUC:worktree-joyful-honking-melody
May 26, 2026
Merged

fix(core): stop AbortSignal listener leak in long sessions (MaxListenersExceededWarning)#4366
doudouOUC merged 16 commits into
QwenLM:mainfrom
doudouOUC:worktree-joyful-honking-melody

Conversation

@doudouOUC
Copy link
Copy Markdown
Collaborator

@doudouOUC doudouOUC commented May 20, 2026

Summary

Fixes MaxListenersExceededWarning: 1509 abort listeners added to [AbortSignal] that users hit in long interactive sessions.

The agent runtime nests parent→child AbortControllers (masterAbortController → per-message round → per-API-call round → tool execution). Each layer registered addEventListener('abort', ...) on its parent without {once:true} or reverse cleanup, so listeners piled up on long-lived parents across hundreds of model turns. After ~30–40 rounds it tripped Node's leak warning.

Two-layer fix:

  1. Structural — new packages/core/src/utils/abortController.ts helper:

    • createAbortController(maxListeners = 50) — factory that pre-caps the signal.
    • createChildAbortController(parent) — WeakRef-based propagation with {once:true} on the parent listener plus a reverse-cleanup listener on the child that detaches the parent listener when the child aborts. This is the core mechanism — short-lived children stop accumulating dead listeners on long-lived parents.
    • combineAbortSignals(signals, {timeoutMs}) — N-way combiner.
  2. Belt-and-suspenderspackages/cli/src/utils/warningHandler.ts hides any remaining MaxListenersExceededWarning.*AbortSignal from end users; debug mode (DEBUG=*/QWEN_DEBUG=*/NODE_ENV=development) keeps it visible.

Scope (deliberately narrow per @yiliang114's review): only the parent→child chain that actually accumulates listeners on a long-lived parent signal is migrated. Independent short-lived controllers (per-shell-command, per-fetch, per-recall, per-arena-session, per-monitor, per-spawn, etc.) stay on raw new AbortController() — they're GC'd at end of use and do not accumulate.

Migrated call sites:

  • agents/runtime/agent-interactive.ts (master + per-message round)
  • agents/runtime/agent-core.ts (per-iteration round + waitForExternalInputs + processFunctionCalls try/finally)
  • agents/runtime/agent-headless.ts (external → execution)
  • hooks/promptHookRunner.ts (had a real cleanup leak: manual addEventListener without {once:true} and never removed)
  • hooks/httpHookRunner.ts (migrated from the deprecated createCombinedAbortSignal shim to combineAbortSignals directly; shim deleted)

Plus three {once:true}-only fixes:

  • hooks/hookRunner.ts / hooks/functionHookRunner.ts / confirmation-bus/message-bus.ts

Plus openaiContentGenerator/pipeline.ts band-aid removal (per-request OpenAI signals are now children of the per-round controller, which carries maxListeners=50).

Direct proof

Self-contained reproducer at docs/verification/abort-controller-refactor/listener-accumulation-repro.mjs:

$ node docs/verification/abort-controller-refactor/listener-accumulation-repro.mjs
OLD pattern listener count on long-lived parent: 2000
NEW pattern listener count on long-lived parent: 0
PASS: OLD pattern accumulated >1500 listeners (reproduces the bug).
PASS: NEW pattern kept listener count at 0 — the helper prevents accumulation.

Test plan

Automated (passing locally and on CI)

  • node docs/verification/abort-controller-refactor/listener-accumulation-repro.mjs — direct OLD-vs-NEW comparison
  • vitest run packages/core/src/utils/abortController.test.ts — 26 tests (factory cap, child propagation incl. signal-only retention, reverse cleanup, fast path, undefined parent, custom maxListeners, combineAbortSignals semantics incl. cleanup-cancels-timeout, timeout-cleans-input-listeners, timeoutMs <= 0 boundary, mid-iteration defensive check, GC safety best-effort)
  • vitest run packages/cli/src/utils/warningHandler.test.ts — 13 tests including a spawned-child integration test that asserts real Node default-printer stderr is empty for AbortSignal warnings and still contains unrelated DeprecationWarning text
  • vitest run packages/core/src/hooks/httpHookRunner.test.ts — covers the migrated combineAbortSignals consumer (the deprecated createCombinedAbortSignal shim and its test file were removed once the lone caller migrated)
  • All affected test suites pass (agent runtime, followup, hooks, services, tools, goal hook, prompt hook)
  • grep -rn "new AbortController" packages/core/src --include="*.ts" | grep -v test | grep -v abortController.ts returns the 19 intentionally-unmigrated independent controllers (see docs/verification/abort-controller-refactor/migration-completeness.txt for the captured list + rationale)
  • TypeScript strict-mode typecheck clean for core + cli
  • Prettier + ESLint clean
  • npm run build:packages succeeds; node packages/cli/dist/index.js --version boots clean under NODE_OPTIONS=--trace-warnings

End-to-end scenarios I drove locally

Scripts live under docs/verification/abort-controller-refactor/scripts/. Each captures its run to docs/verification/abort-controller-refactor/logs/.

Boot under --trace-warnings — proves no warnings on startup.

$ NODE_OPTIONS=--trace-warnings node packages/cli/dist/index.js --version
0.15.11

Expected: version line, no (node) MaxListenersExceededWarning. Observed: clean.

02-lite — single real-Qwen prompt under --trace-warnings — proves the steady-state happy path emits no warning.

$ docs/verification/abort-controller-refactor/scripts/02-lite.sh
EXIT=0
MaxListenersExceededWarning count: 0
--- log ---
OK

Input: --prompt "Reply with exactly 'OK' and nothing else."
Expected: exit 0, response OK, zero MaxListenersExceededWarning mentions in stderr.
Observed: matches expected.

06 — headless --prompt + SIGINT — proves SIGINT propagates cleanly through the abort plumbing all the way down.

$ docs/verification/abort-controller-refactor/scripts/06-headless-sigint.sh
EXIT_CODE=130 (expected 130)
MaxListenersExceededWarning count: 0

Input: long-essay prompt + kill -INT <pid> after 6 s.
Expected: exit code 130 (128 + SIGINT), no warnings emitted during the interrupted stream.
Observed: matches expected.

Direct listener-accumulation repro — head-to-head proof of the structural fix.

$ node docs/verification/abort-controller-refactor/listener-accumulation-repro.mjs
Simulating 2000 rounds for each pattern.

OLD pattern listener count on long-lived parent: 2000
NEW pattern listener count on long-lived parent: 0
PASS: OLD pattern accumulated >1500 listeners (reproduces the bug).
PASS: NEW pattern kept listener count at 0 — the helper prevents accumulation.

Input: 2000 short-lived child controllers on one long-lived parent.
Expected: OLD ≥ 1500 listeners (reproduces the user-reported bug), NEW = 0.
Observed: matches expected.

Scenarios that still need a human at the keyboard

These exercise interactive-TUI code paths and were not self-verifiable here. The behavior each one targets is covered by unit tests (agent-interactive.test.ts cancellation suite for ESC mid-stream; shell.test.ts for shell-tool abort; agent.test.ts for subagent cancel; the helper's own 26 tests for listener accounting). Listed for completeness so a reviewer or release manager can replay them:

  • 00 Baseline reproduction on main — confirm warning triggers at ~30–40 rounds (needs a long mixed-tool session against the real API)
  • 01 Long-session, DEBUG mode — no MaxListenersExceededWarning after 50+ rounds
  • 02 Long-session, prod mode — clean output across 50+ rounds
  • 03 ESC mid-stream — stream stops within ~200 ms, next prompt accepts input (TUI ESC has dual meaning depending on streaming state; reliable automation needs a stream-token sentinel)
  • 04 Cancel long-running shell tool — child process killed, agent accepts next prompt (shell-tool policy blocks foreground commands > 2 s; background shells are designed to survive parent abort, so this only meaningfully exercises in interactive approval-mode)
  • 05 Subagent cancellation — child agent's tools + model stream abort cleanly
  • 07 Background agent flow — first completes, second cancels cleanly, no cross-leak
  • 08 Heap snapshots round 0 / 50 / 100 — AbortSignal instance count + per-signal listener count stable

Review notes

  • Independently reviewed twice via Codex. First pass surfaced three issues (throw between controller creation and explicit abort in agent-core.ts / agent-headless.ts, overly-broad warning regex, and process.removeAllListeners('warning') stripping third-party listeners) — all addressed in the same commit. Second pass confirmed no remaining blockers.
  • Three reverse-cleanup tests in abortController.test.ts directly verify the listener-count invariant: 1000 short-lived children on one long-lived parent leave 0 listeners.

🤖 Generated with Qwen Code

…s in long sessions

Users hit `MaxListenersExceededWarning: 1509 abort listeners added to
[AbortSignal]` in long interactive sessions. The agent runtime nests
parent→child controllers (masterAbortController → per-message round →
per-API-call round → tool execution) and each layer registered its own
`addEventListener('abort', ...)` on the parent without `{once:true}` or
reverse cleanup, so listeners accumulated on long-lived parents across
hundreds of model turns.

Add `utils/abortController.ts` with three helpers:

- `createAbortController(maxListeners = 50)` — factory that pre-caps the
  signal so the warning never fires on per-request signals.
- `createChildAbortController(parent)` — WeakRef-based parent→child
  propagation with `{once:true}` on the parent listener AND a reverse-cleanup
  listener on the child that detaches the parent listener when the child
  aborts. This is the key mechanism — short-lived children stop accumulating
  dead listeners on long-lived parents.
- `combineAbortSignals(signals, {timeoutMs})` — N-way combiner that replaces
  the existing one-input `combinedAbortSignal.ts` (kept as a `@deprecated`
  shim so `httpHookRunner.ts` doesn't churn).

Migrate every production `new AbortController()` in `packages/core/src` (24
sites) to the helper. Wrap `_runReasoningLoopInner` per-iteration body and
`AgentHeadless.execute` in `try/finally` so the round controller is aborted
(triggering reverse cleanup) even when the model stream or tool execution
throws. Add `{once:true}` to the manual abort listeners in `hookRunner`,
`functionHookRunner`, and `message-bus` that were missing it. Remove the
`raiseAbortListenerCap` band-aid from `openaiContentGenerator/pipeline.ts` —
no longer needed now that the per-round signal carries `maxListeners=50`.

Add `cli/utils/warningHandler.ts` as a belt-and-suspenders: hides
`MaxListenersExceededWarning.*AbortSignal` from end users in production
(any shape Node ≥20 emits), keeps it visible under `DEBUG`/`QWEN_DEBUG`/
`NODE_ENV=development`. Uses `process.on('warning', ...)` without
`removeAllListeners` so third-party warning subscribers stay intact.

Direct reproducer in `docs/verification/abort-controller-refactor/` proves
the old pattern accumulates 2000 listeners over 2000 rounds while the new
pattern stays at 0.
Copilot AI review requested due to automatic review settings May 20, 2026 18:38
@github-actions
Copy link
Copy Markdown
Contributor

📋 Review Summary

This PR addresses a critical MaxListenersExceededWarning issue in long-running interactive sessions by introducing a robust AbortController helper module with automatic listener cleanup. The implementation is thorough, well-tested, and includes both structural fixes and a user-facing warning suppression layer. The code quality is excellent with comprehensive test coverage and careful attention to edge cases.

🔍 General Feedback

  • Excellent problem diagnosis: The PR clearly identifies the root cause (nested AbortControllers without proper cleanup) and provides a self-contained reproducer demonstrating the issue and fix.
  • Two-layer defense strategy: Smart approach combining structural fixes (proper listener management) with user-facing warning suppression for any edge cases.
  • Comprehensive migration: All 24 production call sites migrated, verified via grep—no stragglers.
  • Outstanding test coverage: 19 tests for the core helper, 9 for the warning handler, plus verification that all 71 affected test suites pass.
  • GC-aware design: Use of WeakRef for parent/child references shows deep understanding of JavaScript memory management.
  • Debug-mode passthrough: Keeps warnings visible during development while suppressing in production—excellent developer experience consideration.

🎯 Specific Feedback

🟢 Medium

  • File: packages/core/src/utils/abortController.ts:78-95 - The propagateAbort and removeAbortHandler functions use WeakRef but the comment at line 95 acknowledges a caveat: "if the parent controller is held ONLY through the child's WeakRef, it can be GC'd before its abort fires." While the comment notes this is safe for current usage patterns, consider adding a runtime assertion or TypeScript type constraint to enforce that parents passed to createChildAbortController are strongly held elsewhere. This would prevent future regressions if usage patterns change.

  • File: packages/cli/src/utils/warningHandler.ts:17 - The regex /MaxListenersExceededWarning.*AbortSignal/ is well-targeted, but consider making it slightly more explicit by matching the exact message format Node.js produces. Current regex is good, but could be: /MaxListenersExceededWarning.*\d+ abort listeners added to \[AbortSignal/ to be even more specific and avoid any false positives if Node changes the format slightly.

🔵 Low

  • File: packages/core/src/utils/abortController.ts:1 - Consider adding a JSDoc @example block to the module-level documentation showing the basic usage pattern. The individual functions have good docs, but a quick example at the module level would help adopters:

    // @example
    // const parent = createAbortController();
    // const child = createChildAbortController(parent);
    // child.signal.addEventListener('abort', () => { ... });
    // // When done: child.abort() automatically cleans up parent listener
  • File: packages/core/src/agents/runtime/agent-core.ts:601-853 - The try/finally wrapping the per-iteration body is correct, but the comment at line 850 ("Reverse-cleanup fires whether the iteration ended normally, broke, returned, or threw") could benefit from a reference to the specific invariant this satisfies from the helper's documentation (the three invariants mentioned in abortController.ts:78).

  • File: packages/core/src/utils/abortController.test.ts:175 - The GC safety test is marked as requiring --expose-gc. Consider adding a note in the test description or a separate README section explaining how to run this specific test, as GC behavior is critical to the memory leak fix.

  • File: docs/verification/abort-controller-refactor/README.md - The verification matrix is comprehensive. Consider adding a "Success Criteria" column that explicitly states what metrics to check (e.g., "listener count ≤ 50", "exit code 0", "no warnings in stderr").

✅ Highlights

  • Brilliant use of {once: true} + reverse cleanup pattern: The two-way cleanup mechanism ensures listeners never accumulate on either parent or child, solving the leak from both directions.

  • Self-contained reproducer: The listener-accumulation-repro.mjs script is a model example of bug documentation—runs without build dependencies, clearly demonstrates old vs. new behavior, and provides immediate validation.

  • Idempotent cleanup: The combineAbortSignals cleanup function being idempotent (tested explicitly) and auto-invoked on abort shows careful API design that prevents double-free issues.

  • Preserved backward compatibility: The @deprecated wrapper around createCombinedAbortSignal allows gradual migration while keeping existing consumers working.

  • Independent review integration: Incorporating feedback from two Codex review passes and addressing all three issues found (throw-between-controller creation, regex broadness, removeAllListeners side effect) demonstrates excellent review responsiveness.

  • Migration completeness verification: Using grep to verify zero stragglers is a simple but effective quality gate that should be a template for similar refactors.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses long-session MaxListenersExceededWarning issues by introducing a shared AbortController/AbortSignal helper in packages/core and migrating runtime/tool code to use it, plus adding a CLI-level warning suppressor as a fallback.

Changes:

  • Added utils/abortController.ts with createAbortController, createChildAbortController (reverse-cleanup), and combineAbortSignals, plus unit tests.
  • Migrated many core runtime/services/tools to use the helper and to abort() child controllers in finally blocks to trigger reverse-cleanup.
  • Added CLI warningHandler to suppress remaining AbortSignal-related MaxListenersExceededWarning in non-debug mode, and added verification artifacts/docs.

Reviewed changes

Copilot reviewed 28 out of 30 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
packages/core/src/utils/fetch.ts Uses createAbortController() for timeout-based fetch abort.
packages/core/src/utils/abortController.ts New helper implementing capped controllers, parent→child propagation, reverse-cleanup, and signal combining.
packages/core/src/utils/abortController.test.ts Unit tests covering helper behavior (including listener-count invariants and GC best-effort).
packages/core/src/tools/shell.ts Migrates internal controllers to helper to avoid listener warnings/leaks.
packages/core/src/tools/monitor.ts Migrates monitor entry controller creation to helper.
packages/core/src/tools/agent/agent.ts Uses helper for background/foreground agent abort controllers; aborts child in finally for reverse-cleanup.
packages/core/src/services/chatRecordingService.ts Migrates auto-title controller to helper.
packages/core/src/services/chatCompressionService.ts Migrates fallback abort signal creation to helper.
packages/core/src/memory/manager.ts Migrates dream abort controller creation to helper.
packages/core/src/hooks/hookRunner.ts Adds { once: true } to abort listeners to avoid accumulation.
packages/core/src/hooks/functionHookRunner.ts Adds { once: true } to abort listeners to avoid accumulation.
packages/core/src/hooks/combinedAbortSignal.ts Deprecation shim delegating to combineAbortSignals.
packages/core/src/followup/speculation.ts Replaces manual parent abort wiring with createChildAbortController + abort in finally.
packages/core/src/core/openaiContentGenerator/pipeline.ts Removes per-request raiseAbortListenerCap workaround.
packages/core/src/core/client.ts Migrates recall abort controller to helper.
packages/core/src/confirmation-bus/message-bus.ts Adds { once: true } to abort listener.
packages/core/src/agents/runtime/agent-interactive.ts Uses helper for master/round controllers and aborts round in finally for reverse-cleanup.
packages/core/src/agents/runtime/agent-headless.ts Uses createChildAbortController(externalSignal) and aborts in outer finally to guarantee cleanup.
packages/core/src/agents/runtime/agent-core.ts Uses per-round child controllers and aborts them in finally to prevent parent listener buildup.
packages/core/src/agents/background-agent-resume.ts Migrates background-task controllers to helper.
packages/core/src/agents/arena/ArenaManager.ts Migrates arena master controller and links agent controllers to master via createChildAbortController.
packages/cli/src/utils/warningHandler.ts Adds process-level warning handler to suppress AbortSignal MaxListenersExceededWarning in non-debug.
packages/cli/src/utils/warningHandler.test.ts Tests for warning suppression behavior and debug passthrough.
packages/cli/src/gemini.tsx Installs the warning handler during CLI startup.
docs/verification/abort-controller-refactor/test-summary.txt Captured test run summary artifact.
docs/verification/abort-controller-refactor/smoke-boot.log Captured CLI boot smoke output artifact.
docs/verification/abort-controller-refactor/README.md Manual verification plan and scenario matrix.
docs/verification/abort-controller-refactor/migration-completeness.txt Captured migration completeness artifact (grep output).
docs/verification/abort-controller-refactor/listener-accumulation-repro.mjs Standalone reproducer/proof script for listener accumulation vs new helper.
docs/verification/abort-controller-refactor/automated-results.md Captured automated verification outputs and notes.
Comments suppressed due to low confidence (2)

packages/core/src/utils/abortController.ts:142

  • Auto-cleanup isn’t guaranteed if an input aborts during setup: the per-source handler aborts the controller, but the controller’s own signal.addEventListener('abort', cleanup, ...) is only registered later. If a source abort triggers before that registration, cleanup won’t run automatically and listeners may linger on other long-lived input signals unless the caller explicitly calls cleanup(). One fix is to make cleanup registration resilient (e.g., if done is already true, immediately run new cleanup fns), or to register the controller abort→cleanup hook before adding input listeners and ensure late-added cleanup fns still execute when already aborted.
  for (const sourceSignal of signals) {
    if (!sourceSignal) continue;
    const handler = () => controller.abort(sourceSignal.reason);
    sourceSignal.addEventListener('abort', handler, { once: true });
    cleanups.push(() => sourceSignal.removeEventListener('abort', handler));

docs/verification/abort-controller-refactor/README.md:26

  • The tmux example command hardcodes an absolute path into the repo worktree. Consider switching to $WT/... (or a generic placeholder path) so the instructions work for any developer without editing and don’t leak local path details.
tmux new-session -d -s qwen-verify-XX
tmux pipe-pane -t qwen-verify-XX -o "cat >> $LOGDIR/XX-name.log"
tmux send-keys -t qwen-verify-XX 'cd /path/to/your/test/workspace && exec node /Users/jinye.djy/Projects/qwen-code/.claude/worktrees/joyful-honking-melody/packages/cli/dist/index.js' C-m
tmux attach -t qwen-verify-XX

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/core/src/utils/abortController.ts
Comment thread packages/cli/src/utils/warningHandler.ts Outdated
Comment thread docs/verification/abort-controller-refactor/README.md Outdated
Comment thread docs/verification/abort-controller-refactor/README.md Outdated
doudouOUC added 3 commits May 21, 2026 08:00
Four issues from the Copilot review:

1. combineAbortSignals — add a per-iteration `aborted` check inside the
   for-loop so we short-circuit if an input signal flips aborted between
   the initial scan and listener registration. In single-threaded JS this
   can't actually interleave, but the defensive check makes correctness
   obvious and protects against signals whose `aborted` getter has side
   effects. New test exercises the path via a Proxy that flips after the
   initial scan.

2. warningHandler docstring — was stale: said "AbortSignal / EventTarget"
   while the regex was tightened to AbortSignal-only in the previous review.

3. README.md — replace personal absolute path with `$WT` placeholder so
   the verification recipe is shareable.

4. README.md — replace the markdown table with per-scenario headed
   sections. Prettier had interpreted an inline `ps -ef | grep sleep`
   pipe character as a column separator, breaking the table rendering on
   GitHub. Per-section format is also easier to scan and edit.
… loop check

The previous version set the Proxy's `aborted` to true before calling
combineAbortSignals, so the initial `find` scan caught it and we took the
fast path — not the per-iteration check the test was meant to validate.

Switch to an access counter so `aborted` is false on the first read (during
`find`) and true on subsequent reads (inside the loop). This forces the
loop to enter, then catches the flip via the defensive per-iteration check
before any listener is attached to the next input.

Verified the test fails if the per-iteration check is removed.
…ng-melody

Two conflict resolutions, both folding the helper into main's new code:

- agent-core.ts: kept main's `randomUUID` import next to our
  `createChildAbortController` import.

- client.ts: collapsed main's manual addEventListener + {once:true} +
  removeEventListener-in-finally pattern (added in the recall-decoupling
  PR) into a single `createChildAbortController(signal)` call. The
  finally now triggers reverse cleanup via `controller.abort()` on
  success, achieving the same listener-hygiene goal main wanted. Kept
  main's `cancelPendingMemoryPrefetch()` short-circuit at the top.
…ortController repro passes lint

CI Lint flagged 11 no-undef errors in
docs/verification/abort-controller-refactor/listener-accumulation-repro.mjs
(AbortController, console, process) because the project's flat config
only declared Node globals for ./scripts/**/*.mjs.

The reviewer's suggestion (`/* eslint-env node */`) doesn't work under
ESLint 9 flat config — env directives are deprecated there. The proper
fix is to extend the existing script-globals block to also cover the
verification repro script under docs/.
Comment thread packages/cli/src/utils/warningHandler.ts
Comment thread packages/core/src/utils/abortController.ts Outdated
Two real bugs the reviewer caught and I confirmed locally:

1. warningHandler.ts didn't actually suppress anything. Adding a
   `process.on('warning')` listener does NOT prevent Node's default
   onWarning printer from writing to stderr — the default is just an
   ordinary listener registered in `lib/internal/process/warning.js`.
   My previous code therefore:
   - failed to suppress targeted AbortSignal warnings (they still hit
     stderr via the default printer)
   - produced a SECOND copy of every non-suppressed warning (default
     printer + my handler's own stderr.write)
   The unit tests missed it because they synthesised a fake warning and
   called `process.listeners('warning')` directly rather than going
   through `process.emitWarning`.

   Fix: snapshot the existing `'warning'` listeners (which include the
   default printer and any third-party telemetry hooks) BEFORE replacing
   them. Install ours as the sole listener. For non-suppressed warnings
   fan out to the captured set so the default printer + telemetry still
   fire; for suppressed warnings stop here. Tests now use
   `process.emit('warning', ...)` to drive the real listener chain, plus
   a spawned-child integration test that asserts the real stderr from
   `process.emitWarning` is empty for AbortSignal warnings and still
   contains DeprecationWarning text.

2. abortController.createChildAbortController kept a WeakRef to the
   child controller. A natural usage pattern — pass `child.signal` into
   an async API and drop the controller object — could let the
   controller be GC'd while the signal is still in use, after which
   `parent.abort()` would no longer propagate. Reproduced with
   `node --expose-gc`.

   Fix: hold the child strongly via the parent's listener closure. The
   reverse-cleanup listener still removes the closure when child aborts
   (closure releases child → GC-eligible), and the parent's `{once:true}`
   listener self-removes when parent fires (same effect). Net listener
   accounting on long-lived parents is unchanged; the only difference is
   the controller now stays alive long enough for propagation to reach
   downstream consumers that hold only the signal. Tests updated: drop
   the old `--expose-gc`-dependent assertion that abandoned children
   GC immediately (that was a property of the OLD contract); add a
   signal-only-retention test that verifies propagation under the new
   contract without needing GC at all.

Verified: 32 helper/warning tests pass (incl. spawned-child stderr
integration); 363 affected caller tests pass; typecheck + prettier +
eslint clean for the touched files.
Copy link
Copy Markdown
Collaborator

@wenshao wenshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional notes (files not in PR diff):

  • packages/core/src/hooks/promptHookRunner.ts:233 — still uses new AbortController() + manual addEventListener('abort', ...) without {once:true} or reverse cleanup. This is the same anti-pattern the PR eliminates at 24 other sites. Migration completeness grep (new AbortController) misses it because the file was untouched.
  • packages/core/src/agents/runtime/agent-headless.test.ts:1121 — the only test covering the "model call throws" error path is it.skip(...). The PR's try/finally cleanup guarantee has no active automated test.

CI note: Test (ubuntu-latest) and Test (windows-latest) are failing; macOS and local runs pass. Likely pre-existing or flaky — not correlated with PR changes.

— qwen-latest-series-invite-beta-v34 via Qwen Code /review

Comment thread packages/core/src/utils/abortController.ts Outdated
Comment thread packages/core/src/agents/arena/ArenaManager.ts Outdated
Comment thread packages/cli/src/utils/warningHandler.ts Outdated
…s orphan listeners + runtime DEBUG toggle

Two real bugs the reviewer caught:

1. combineAbortSignals registered its cleanup listener on
   controller.signal AFTER the for-loop. Node does NOT fire 'abort'
   listeners added to an already-aborted signal, so when the
   per-iteration defensive check aborted the controller mid-loop, the
   cleanup never ran — orphaning every input-signal listener registered
   before the break, and leaving the (also-registered-after-the-break)
   setTimeout uncleared.

   Fix: skip timeout scheduling when controller.signal.aborted is
   already true post-loop, and when it's true call cleanup()
   synchronously instead of registering a doomed listener. Existing
   test for the mid-iteration path now also asserts that the
   pre-break input signal (a) has zero abort listeners — that's the
   assertion that catches the orphan bug. New test for the
   already-aborted-input + timeoutMs combination confirms the timer
   isn't scheduled (would otherwise overwrite the abort reason).

2. warningHandler captured isDebugMode() in a closure at init time, so
   toggling DEBUG / QWEN_DEBUG at runtime (e.g. via a /debug slash
   command) didn't update suppression behavior. Moved the check inside
   the handler — warnings are rare so the per-emit env-lookup cost is
   negligible. New test asserts a mid-stream DEBUG=1 flip starts
   forwarding suppressed warnings to the prior-listener chain.
Comment thread packages/core/src/utils/abortController.test.ts
…to actually exercise the new !aborted check

Reviewer correctly pointed out that the previous version of this test
took the pre-loop fast path (since `a.abort('pre')` ran before
`combineAbortSignals`), so it never reached the in-loop guard at
abortController.ts:138.

Switched to the Proxy `aborted`-getter pattern from the sibling
mid-iteration test (so the loop genuinely re-checks `aborted` and
short-circuits inside the for-loop), and added a `setTimeout` spy that
asserts the timer was never scheduled — this is the only observable
difference from "scheduled then immediately cleared by synchronous
cleanup()", which is what the timer-advance assertion alone couldn't
distinguish.

Verified by mutation testing: removing the guard makes the new test
fail; restoring it makes it pass. Refs PR QwenLM#4366.
Comment thread packages/core/src/utils/abortController.test.ts
… in combineAbortSignals

Reviewer noted the timeout path only had an empty-input test, leaving
the leak-sensitive case uncovered: when timeoutMs fires with a
long-lived source signal in the input list, do the input-side
listeners get released? They do (the timeout callback aborts the
combined controller, which fires the auto-cleanup listener registered
on its signal, which calls the per-input removeEventListener), but
that path wasn't tested.

Adds a test that snapshots the source listener count before, asserts
it increased by 1 after combineAbortSignals returns, advances fake
timers past timeoutMs, and asserts the count returns to baseline.

Refs PR QwenLM#4366.
yiliang114
yiliang114 previously approved these changes May 21, 2026
Copy link
Copy Markdown
Collaborator

@yiliang114 yiliang114 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Core mechanism is solid — the createChildAbortController reverse-cleanup pattern correctly prevents listener accumulation on long-lived parents, and all critical bugs surfaced during review (WeakRef GC issue, warningHandler not actually suppressing, combineAbortSignals orphan listeners) have been addressed.

The one minor gap (promptHookRunner.ts:233 still using raw new AbortController) is not a real leak — the signal there is per-round and short-lived — but could be cleaned up in a follow-up for migration consistency.

…ndows

CI failure on windows-latest:
  AssertionError: expected '\r\nnode:internal/modules/run_main:12…'
                  to match /DeprecationWarning.*Plain deprecation/
  Error [ERR_UNSUPPORTED_ESM_URL_SCHEME]: Only URLs with a scheme in:
    file, data, and node are supported by the default ESM loader. On
    Windows, absolute paths must be valid file:// URLs. Received protocol 'd:'

The e2e test wrote a child script with an `import "<helperPath>"` where
helperPath was a raw Windows absolute path (`D:\a\qwen-code\...`). Node's
ESM loader parses that as a URL on Windows and rejects the `D:` "scheme".

Converted the helper path to a `file://` URL via `pathToFileURL`. macOS
test still passes; the Windows-specific schemes-must-be-URL behavior is
now honored. Refs PR QwenLM#4366.
Copy link
Copy Markdown
Collaborator

@wenshao wenshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both Round 4 suggestions addressed: timeout-triggered listener cleanup test added for combineAbortSignals, and Windows E2E test fixed via pathToFileURL. All CI green (including Windows). LGTM! ✅ — qwen-latest-series-invite-beta-v34 via Qwen Code /review

Copy link
Copy Markdown
Collaborator

@wenshao wenshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (not inline):

  • [Critical] packages/core/src/hooks/promptHookRunner.ts:233 — Missed migration site. Uses new AbortController() with anonymous signal.addEventListener('abort', () => { ... }) — no {once: true}, no removeEventListener, no reverse cleanup. Each hook invocation leaks one listener on the parent signal. This is the exact pattern the PR eliminates everywhere else. Similarly goalHook.ts:70 and goalHook.ts:169 remain un-migrated.

  • [Suggestion] ArenaManager.ts:821 — Per-agent child controllers are aborted in cancel/timeout paths but NOT on normal agent exit (handleAgentExit). Reverse-cleanup is deferred until session teardown. (Skipped inline due to existing comment at this line.)

  • [Suggestion] eslint.config.js — No no-restricted-syntax rule prevents new AbortController() in new code. The migration convention will erode without a lint guard.

— qwen-latest-series-invite-beta-v34 via Qwen Code /review

Comment thread packages/core/src/agents/runtime/agent-core.ts
Comment thread packages/cli/src/utils/warningHandler.ts
Comment thread packages/core/src/utils/abortController.ts
Comment thread packages/core/src/hooks/combinedAbortSignal.ts Outdated
Comment thread packages/core/src/utils/abortController.test.ts
Comment thread docs/verification/abort-controller-refactor/automated-results.md
doudouOUC added 2 commits May 21, 2026 18:55
…grate missed sites, tighten tests

Adopted 6 of the 7 review threads (skipping the debug-logging suggestion).

1. processFunctionCalls onAbort leak (CRITICAL): the new
   `finally { roundAbortController.abort(); }` in _runReasoningLoopInner
   would fire the `onAbort` handler in `processFunctionCalls` if
   scheduler.schedule or batchDone threw (the explicit
   removeEventListener at the old happy-path exit would be skipped),
   emitting spurious "Tool call cancelled by user abort." TOOL_RESULT
   events for every un-emitted callId — corrupting the transcript and
   misleading the model on the next round. Fixed by wrapping schedule
   + batchDone in their own try/finally so removeEventListener always
   runs before the outer finally's abort.

2. Migrate 3 new-from-main `new AbortController()` sites that this
   PR's audit missed (they came in via the merge from main):
   - goals/goalHook.ts (2 sites: judgeController, fallback signal) —
     consistency
   - hooks/promptHookRunner.ts (1 site: internalAbortController) —
     real leak (manual addEventListener without {once:true} or
     cleanup, exactly the pattern this PR exists to fix). Switched to
     createChildAbortController + finally `internalAbortController.abort()`
     for reverse cleanup on the success path.

3. Repro script (`listener-accumulation-repro.mjs`): inlined helper
   diverged from production — used WeakRef on child, while production
   was changed to strong-ref earlier in this PR. Updated the inlined
   copy to match production exactly, with a comment noting the
   intentional WeakRef-on-parent-only pattern.

4. warningHandler.ts: documented the snapshot-and-replace trade-offs
   in the JSDoc (late-added listeners bypass our filter; late
   `removeListener` calls have no effect on our fan-out). Tried the
   re-snapshot-per-warning approach the reviewer suggested but it
   doesn't work — `removeAllListeners('warning')` permanently removes
   the snapshot from Node's tracking, so a `process.listeners('warning')`
   filter at fan-out time always returns empty for prior listeners.
   The current design is the right trade-off; documentation is the
   correct fix.

5. abortController.test.ts: added three coverage gaps the reviewer
   identified —
   - createChildAbortController forwards custom maxListeners
   - manual cleanup() before scheduled timeout fires cancels it
   - timeoutMs <= 0 is treated as "no timeout"

6. Migrated `httpHookRunner.ts:202` (the lone caller of the deprecated
   `createCombinedAbortSignal`) to `combineAbortSignals` directly,
   then deleted `combinedAbortSignal.ts` + its test. All semantics
   covered by `combineAbortSignals` tests in abortController.test.ts.

Refreshed `migration-completeness.txt` (now empty — clean grep).
Tests: 194 pass across abortController/warningHandler/agent-runtime/
followup/hooks/goal/promptHook suites. Typecheck + prettier clean.
…y the PR body

The PR body's "End-to-end scenarios I drove locally" section points at
docs/verification/abort-controller-refactor/scripts/02-lite.sh and 06-headless-sigint.sh.
These are the actual reproducible commands behind the EXIT codes /
warning counts reported there — checking them in so anyone can replay
without copy-pasting from the PR description.

Refs PR QwenLM#4366.
Comment thread docs/verification/abort-controller-refactor/automated-results.md Outdated
Two doc fixes the reviewer flagged:

- migration-completeness.txt was a 0-byte file with a confusing
  cross-reference. Populated with the actual grep command + its
  "(no output)" result so the empty-output state is explicit.

- automated-results.md still referenced combinedAbortSignal.test.ts (8
  tests, @deprecated shim) — both files were deleted in 94e8c58 when
  httpHookRunner.ts migrated to combineAbortSignals directly. Replaced
  the line with a reference to httpHookRunner.test.ts. Also updated
  the test counts to reflect current state (26 abortController, 13
  warningHandler — both grew with the review cycle) and removed the
  stale combinedAbortSignal.ts entry from the prettier-check command.

Refs PR QwenLM#4366.
Copy link
Copy Markdown
Collaborator

@wenshao wenshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed after the substantial multi-round refactor history. The core mechanism (reverse-cleanup, strong-ref-on-parent, {once:true} migration) is sound and all Critical concerns from prior rounds were addressed. Three remaining suggestions are inline; one minor doc nit follows.

Minor — docs/verification/abort-controller-refactor/migration-completeness.txt: committed as 0 bytes, while automated-results.md item 2 says "See migration-completeness.txt for the captured grep." Either capture the actual grep output or delete the empty file and drop the reference.

Needs human review (low confidence — not posted inline):

  • packages/core/src/hooks/promptHookRunner.ts:317-319 — new internalAbortController.abort() in finally fixes a real leak (the success-path migration the author already flagged elsewhere on this PR), but promptHookRunner.test.ts has no listener-count assertion to lock the fix in.
  • packages/core/src/hooks/httpHookRunner.ts:220,255 — manual cleanup() on both success and catch branches works correctly today (cleanup is idempotent and auto-fires on abort), but it's exactly the manual-cleanup pattern the rest of this PR replaced with try { ... } finally { cleanup(); }.
  • packages/core/src/agents/runtime/agent-headless.test.ts:1121 has a pre-existing it.skip(...) for the model-throw case — the new outer try/finally in agent-headless.ts fires only on that path, so the new throw-cleanup behavior remains uncovered by CI.

— claude-opus-4-7 via Claude Code /qreview

Comment thread docs/verification/abort-controller-refactor/automated-results.md Outdated
Comment thread packages/core/src/agents/arena/ArenaManager.ts Outdated
Comment thread packages/core/src/followup/speculation.ts Outdated
Adopting 2 of 3 new review threads (the third — automated-results.md
drift — was already fixed in 5aa7110).

1. packages/core/src/agents/arena/ArenaManager.test.ts: pin the
   master→agent abort cascade introduced by switching per-agent
   controllers to `createChildAbortController(this.masterAbortController)`.
   New test spawns ≥2 agents, calls `manager.cancel()`, and asserts every
   `agentState.abortController.signal.aborted === true`. Existing cancel
   test only checked backend + status; if a future refactor re-introduced
   independent controllers, the cascade would silently regress.

2. packages/core/src/followup/speculation.test.ts: cover the
   `startSpeculation` abort wiring introduced when the manual
   addEventListener + .finally removeEventListener pattern got replaced
   by createChildAbortController + .finally abort(). Three tests:
   - parent abort propagates to state.abortController (lifetime contract)
   - parent-already-aborted fast path returns aborted state
   - parent-signal listener count returns to baseline after the fire-and-
     forget loop settles (reverse-cleanup proof)
   Mocked `runWithForkedChatModel` and `OverlayFs` so the background
   loop is a no-op — these tests only assert the synchronous wiring,
   not the loop's content.
Copy link
Copy Markdown
Collaborator

@wenshao wenshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No review findings at this commit. Downgraded from Approve to Comment: CI failing (Test on windows-latest, ubuntu-latest, macos-latest Node 22.x). Code review passes — core mechanism (createChildAbortController reverse-cleanup) is correct, all 24+ migration sites have proper try/finally cleanup, 26 unit tests + 13 warning handler tests all pass locally. — qwen-latest-series-invite-beta-v34 via Qwen Code /review

Comment thread packages/core/src/followup/speculation.test.ts Outdated
Comment thread packages/core/src/followup/speculation.test.ts Outdated
Comment thread packages/cli/src/gemini.tsx
Comment thread docs/verification/abort-controller-refactor/README.md Outdated
Two real CI blockers in the just-added speculation tests (TS2554 and
TS2339) plus stale doc counts the reviewer flagged.

1. saveCacheSafeParams takes 3 positional args (generationConfig,
   history, model), not a single object. Compile error on every
   platform. Fixed by switching to the correct shape; also moved
   getEventListeners to a static `import` at the top of the file
   (dynamic `await import('node:events')` exposes EventEmitter's
   static method via the namespace type rather than as a direct
   property, so destructuring fails type-check).

2. docs/verification/abort-controller-refactor/README.md still claimed
   "18 + 1 GC" tests for abortController and "9" for warningHandler;
   actual current counts are 26 and 13. Also dropped the stale
   combinedAbortSignal reference and added a note about the new
   ArenaManager cascade + startSpeculation wiring pin tests.

Refreshed smoke-boot.log against current built bin (still 0.15.11,
which is what package.json reports on this branch).

Refs PR QwenLM#4366.
Copy link
Copy Markdown
Collaborator

@wenshao wenshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No review findings at e31432f. Downgraded from Approve to Comment: CI still running (15 checks pending). Both Round 9 Critical tsc items (saveCacheSafeParams arity + getEventListeners import) are correctly fixed; speculation.test.ts passes 10/10. — qwen-latest-series-invite-beta-v34 via Qwen Code /review

Copy link
Copy Markdown
Collaborator

@wenshao wenshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found. LGTM! ✅ — gpt-5.5 via Qwen Code /review

@wenshao
Copy link
Copy Markdown
Collaborator

wenshao commented May 21, 2026

PR #4366 — Maintainer merge-validation report

Date: 2026-05-22
Validator: maintainer (high铁), driven via Claude Code + tmux
Worktree: /Users/wenshao/Work/git/qwen-code-x6 on branch pr-4366
Validated HEAD: e126d794c (PR head e31432ff1 + a local merge of origin/main to simulate post-merge state)
PR: #4366 — "fix(core): stop AbortSignal listener leak in long sessions (MaxListenersExceededWarning)"


Verdict

PASS — safe to merge.

Every PR-author claim that could be reproduced locally was reproduced.
The interactive long-session path (the actual user-facing scenario the bug was reported in)
ran 50 rounds of real Qwen API traffic plus an ESC mid-stream cancel and emitted
zero MaxListenersExceededWarning in ~3.7 MB of --trace-warnings stderr output.


What was validated and how

1. Direct listener-accumulation reproducer — PASS

$ node docs/verification/abort-controller-refactor/listener-accumulation-repro.mjs
Simulating 2000 rounds for each pattern.

OLD pattern listener count on long-lived parent: 2000
NEW pattern listener count on long-lived parent: 0
PASS: OLD pattern accumulated >1500 listeners (reproduces the bug).
PASS: NEW pattern kept listener count at 0 — the helper prevents accumulation.

The OLD pattern reaches 2000 listeners on one parent in 2000 rounds (well past the
1500 user-reported threshold); the NEW pattern via createChildAbortController
stays at 0. This is the structural invariant the PR is built around.

2. Focused unit suites — PASS (84 passed | 1 skipped)

$ npx vitest run \
    packages/core/src/utils/abortController.test.ts \
    packages/cli/src/utils/warningHandler.test.ts \
    packages/core/src/hooks/httpHookRunner.test.ts \
    packages/core/src/agents/arena/ArenaManager.test.ts \
    packages/core/src/followup/speculation.test.ts

 ✓ src/hooks/httpHookRunner.test.ts    (10 tests)
 ✓ src/utils/abortController.test.ts   (26 tests | 1 skipped — GC test needs --expose-gc)
 ✓ src/followup/speculation.test.ts    (10 tests)
 ✓ src/utils/warningHandler.test.ts    (13 tests, incl. 1.1s spawned-child stderr e2e)
 ✓ src/agents/arena/ArenaManager.test.ts (26 tests, incl. master→agent cascade pin)

 Test Files  5 passed (5)
      Tests  84 passed | 1 skipped (85)
   Duration  7.40s

The two newest pinning tests added in b4f36d43cArenaManager > cancel cascades the master abort to every spawned agent controller and the parent-signal wiring in
speculation.test.ts — both pass.

3. Build + boot smoke — PASS

$ npm run build:packages  →  BUILD_EXIT=0   (34s)
$ NODE_OPTIONS=--trace-warnings node packages/cli/dist/index.js --version
0.16.0

No warning lines emitted during boot under --trace-warnings.

4. TypeScript strict-mode typecheck — PASS

$ node_modules/.bin/tsc -p packages/core/tsconfig.json --noEmit   →  CORE_TS=0
$ node_modules/.bin/tsc -p packages/cli/tsconfig.json  --noEmit   →  CLI_TS=0

5. Headless scripted scenarios — PASS

Both scripts shipped under docs/verification/abort-controller-refactor/scripts/:

$ bash scripts/02-lite.sh
EXIT=0
MaxListenersExceededWarning count: 0
--- log ---
OK

$ bash scripts/06-headless-sigint.sh
EXIT_CODE=130 (expected 130)
MaxListenersExceededWarning count: 0

Log file: logs/02-lite-short-prompt.log (3 B), logs/06-headless-sigint.log (5.8 KB).

6. Real interactive long-session via tmux — PASS ← the bug's actual reproduction surface

Driven through tmux session pr4366, window long-session, against the real Qwen API
(idealab.alibaba-inc.com OpenAI-compat endpoint), under
NODE_OPTIONS=--trace-warnings QWEN_DEBUG=1.
Full transcript captured via tmux pipe-pane to
logs/interactive-long-session.log (3.76 MB / 12 581 lines).

Round breakdown:

Rounds Workload Warnings
1–30 Reply with the number N and nothing else. (model-only) 0
31–50 Run shell command 'echo round-N' … (model + Shell tool, YOLO) 0
51 1500-word essay then ESC mid-stream at ~3 s into generation 0
52 Reply with just PONG✦ PONG — confirms session still alive 0
53 Attempted sleep 30 shell tool — blocked by shell-policy (>2 s) 0
/quit exit — session summary printed cleanly 0
Final audit of logs/interactive-long-session.log:
  size:  3 763 375 bytes / 12 581 lines
  MaxListenersExceededWarning hits: 0
  'abort listeners' hits:           0
  any 'Warning:'/'Error:' line:     0

This is the headline result. The original bug was reported as
MaxListenersExceededWarning: 1509 abort listeners added to [AbortSignal] in long
interactive sessions. We exercised exactly that surface — long interactive session,
mixed model + tool rounds, ESC cancel — with --trace-warnings actively flagging
every Node warning, and got nothing.

ESC behavior: the streaming essay (Grace Hopper / FORTRAN / Backus, partially
rendered) stopped within the 2 s sampling window after ESC and printed
● Request cancelled.; the next prompt (Reply with just PONG) was accepted and
answered (✦ PONG). No leaked listener warning fired across the cancel.

7. Migration completeness — PASS

$ grep -rn "new AbortController" packages/core/src --include="*.ts" \
    | grep -v test | grep -v abortController.ts
(empty)

Every production new AbortController() in packages/core/src has been migrated
to the helper.


Scenarios not exercised here

These were called out in the PR body and remain not reproduced live; the reasoning
under each is why I judged that acceptable for merge.

  • 00 — Pre-fix baseline on main. Not re-run. The direct OLD-vs-NEW
    reproducer in §1 already shows OLD pattern reaching 2000 listeners on a single
    parent, which is a stronger and cheaper demonstration of the bug than asking
    the real API for 30+ rounds of mixed-tool conversation against pre-fix code.
  • 04 — Foreground shell-tool cancel. Attempted (round 53). Shell tool policy
    blocked sleep 30 as a long-blocking foreground command, exactly as the PR
    body anticipated ("shell-tool policy blocks foreground commands > 2 s"). The
    background-shell path is by design unaffected by parent abort, so this
    scenario is not exercisable in YOLO without disabling the policy. Not a
    blocker for the abort-listener fix.
  • 05 / 07 — Subagent + background-agent cancellation. Unit coverage exists
    (agent.test.ts, ArenaManager.test.ts's new cascade test). I did not drive
    these interactively. Risk if regressed: subagent cancellation broken — but the
    pinning test added in b4f36d43c directly asserts cancelSession cascades
    master → every spawned agent controller, which is the exact failure mode.
  • 08 — Heap snapshots over 100 rounds. Not run. The structural invariant
    (children remove their parent listeners on abort) is proven by §1 and by the
    three reverse-cleanup tests in abortController.test.ts. With listener count
    bounded, heap growth from AbortSignal leak is bounded by definition.

Risks / things to keep an eye on post-merge

  • Arena per-agent controller change (intentional behavioral change called out
    in PR body, ArenaManager.ts:821). The new
    createChildAbortController(this.masterAbortController) makes master.abort()
    cascade to every running arena agent automatically. The new pinning test
    covers it. Worth a glance in arena-flow CI / staging if you run arena
    smoke-tests pre-release.
  • Warning suppressor scope. warningHandler.ts matches only
    MaxListenersExceededWarning.*AbortSignal — a generic EventTarget
    variant test asserts it is NOT suppressed. If Node's warning text format
    changes (unlikely between minor versions), the suppressor silently misses.
    Low risk; the 13 warningHandler tests will catch it.
  • Modified files in worktree (not part of PR diff): package-lock.json,
    packages/vscode-ide-companion/NOTICES.txt, and
    scripts/installation/install-qwen-standalone.bat show modifications.
    These look like CRLF normalisation / install-script byproducts from prior
    build runs in this worktree — not generated by this validation pass and not
    in the merge. Confirmed they do not affect the test results.

Evidence artifacts (all in docs/verification/abort-controller-refactor/)

File Purpose
listener-accumulation-repro.mjs Side-by-side OLD vs NEW listener count proof
logs/02-lite-short-prompt.log Headless single-prompt + --trace-warnings output
logs/06-headless-sigint.log Headless SIGINT-mid-stream output
logs/interactive-long-session.log Headline: 50 rounds + ESC cancel, full tmux capture
automated-results.md Pre-existing automated-test summary (still accurate)
MERGE_VALIDATION.md This report

@yiliang114
Copy link
Copy Markdown
Collaborator

Thanks for the fix — the core approach (createChildAbortController with WeakRef + {once: true} + reverse cleanup) is solid and aligns with how upstream handles this.

However, I think the scope here is significantly larger than it needs to be. A few observations:

1. Not all new AbortController() sites need migration

The actual leak only happens in the nested parent→child chain (masterAbortController → per-message round → per-API-call round → tool execution). Independent, short-lived controllers (e.g., in hooks, one-off fetches, message-bus) don't accumulate listeners on a long-lived parent — they just need {once: true} at most.

For reference, upstream Claude Code has the same createChildAbortController helper but does not migrate all new AbortController() call sites — only the ones with actual parent-child propagation relationships. There are still 30+ raw new AbortController() usages in their codebase.

2. The blast radius makes review difficult

39 files changed, +1851/−621 for what is fundamentally a listener cleanup fix. This makes it hard to review confidently and increases the risk of unrelated regressions.

Suggestion: Consider scoping this down to:

  • The abortController.ts helper (keep as-is, it's good)
  • Only the 3–5 call sites in the agent loop that actually have nested parent→child relationships
  • The {once: true} additions to hookRunner / functionHookRunner / message-bus
  • The warningHandler.ts as a belt-and-suspenders (optional, but fine to keep)

The remaining 20+ sites that are independent controllers don't benefit from the migration and just add churn. A smaller, focused PR would be much easier to review and safer to merge.

Copy link
Copy Markdown
Collaborator

@wenshao wenshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found. LGTM! ✅

This is a well-engineered fix for the AbortSignal listener leak. The three core primitives (createAbortController, createChildAbortController, combineAbortSignals) are correctly designed with sound lifetime invariants (reverse cleanup, {once: true} self-removal, WeakRef parent pinning). The migration across ~20 call sites in packages/core is consistent, and the warningHandler correctly suppresses MaxListenersExceededWarning for AbortSignal while preserving other warnings.

Deterministic checks: tsc clean (0 errors), eslint clean (0 errors), all tests pass (49 passed, 1 skipped).

CI: 18/18 checks passing.

— qwen-latest-series-invite-beta-v34 via Qwen Code /review

@doudouOUC doudouOUC requested review from LaZzyMan and pomelo-nwu May 22, 2026 16:00
wenshao
wenshao previously approved these changes May 22, 2026
Copy link
Copy Markdown
Collaborator

@wenshao wenshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found. All findings are Suggestion-level only (warningHandler idempotency edge case, pipeline raiseAbortListenerCap enforcement gap, shell.ts AbortSignal.any() cleanup, test coverage gaps in throw paths). tsc/eslint clean, build passes, test failures pre-existing (unrelated). LGTM! ✅ — DeepSeek/deepseek-v4-pro via Qwen Code /review

…vert independent-controller migrations

Adopting @yiliang114's review feedback (QwenLM#4366 review comment, 2026-05-22):
keep only the migrations that fix the real leak path (the agent-runtime
parent→child chain that accumulates listeners on a long-lived parent
signal in long sessions) and revert the consistency-only migrations on
independent short-lived controllers.

Issue QwenLM#4423 confirms the user-visible bug is the nested-chain
accumulation — the reverted sites do not contribute to that bug.

Migrations KEPT:
- agents/runtime/agent-interactive.ts (master + per-message round)
- agents/runtime/agent-core.ts (per-iteration + wait + processFunctionCalls)
- agents/runtime/agent-headless.ts (external → execution)
- hooks/promptHookRunner.ts (real cleanup leak: addEventListener without
  {once:true}, never removed)
- hooks/httpHookRunner.ts → combineAbortSignals direct (shim deleted)
- hookRunner.ts / functionHookRunner.ts / message-bus.ts: {once:true} only
- openaiContentGenerator/pipeline.ts band-aid removal (per-request signals
  are children of the per-round controller, which carries maxListeners=50)
- warningHandler.ts belt-and-suspenders

Migrations REVERTED (independent short-lived controllers; restored to
`new AbortController()` + their original cleanup patterns):
- agents/arena/ArenaManager.ts (master + per-agent)
- agents/background-agent-resume.ts (3 sites)
- core/client.ts (recall — restored manual addEventListener + finally
  removeEventListener pattern from main)
- followup/speculation.ts (restored parentAbortHandler + finally
  removeEventListener)
- goals/goalHook.ts (judgeController + fallback signal)
- memory/manager.ts (dream controller)
- services/chatCompressionService.ts (fallback signal)
- services/chatRecordingService.ts (autoTitle controller)
- tools/agent/agent.ts (fg + bg subagent controllers — restored manual
  onParentAbort + finally removeEventListener)
- tools/monitor.ts (entryAc)
- tools/shell.ts (promote + 3 entryAc)
- utils/fetch.ts (fetchWithTimeout)

Tests removed alongside the reverts:
- ArenaManager.test.ts "cancels cascades..." — the cascade itself was an
  intentional behavioral improvement that's now reverted, so the
  pin-test belongs with it
- speculation.test.ts "startSpeculation — abort-controller wiring" block
  (3 tests) — they tested helper-wired behavior we reverted

Verification docs updated to reflect the narrower scope.
Net change: 19 raw `new AbortController()` remain (intentional, per
migration-completeness.txt rationale); previously was 0.

Refs PR QwenLM#4366, issue QwenLM#4423.
@doudouOUC
Copy link
Copy Markdown
Collaborator Author

Thanks for the careful review — adopted in 9323bc3. Reverted the 13 independent-controller migrations and kept only the parent→child sites that actually fix the user-visible bug (the agent-runtime nested chain plus promptHookRunner.ts which had a real cleanup leak).

Scope before: 39 files / +1851 −621, helper used in ~26 sites
Scope after: ~12 files / +110 −204 on top of the reverted state, helper used in only 4 sites (3 agent-runtime files + promptHookRunner.ts)

Mapping to your suggested scope:

Plus two things I'm keeping because they're real fixes, not consistency:

  • promptHookRunner.ts:233 — the original code did signal.addEventListener('abort', () => internalAbortController.abort()) with no {once:true} and never removed the listener. That's a parent→child leak in the same family as the agent-runtime one; switched to createChildAbortController(signal) + internalAbortController.abort() in finally.
  • pipeline.ts raiseAbortListenerCap removal — the per-request OpenAI signal becomes a child of the per-round controller (which carries maxListeners=50), so the explicit band-aid is no longer needed.

Reverted sites that the issue #4423 leak does NOT come from (tools/shell.ts ×3, tools/monitor.ts, tools/agent/agent.ts ×2, arena/ArenaManager.ts ×2, background-agent-resume.ts ×3, core/client.ts recall, followup/speculation.ts, utils/fetch.ts, memory/manager.ts, services/chatRecordingService.ts, services/chatCompressionService.ts, goals/goalHook.ts ×2). All restored to new AbortController() with their original cleanup patterns preserved.

Also dropped two pin-tests that specifically tested the reverted behavior (ArenaManager.test.ts cascade test + speculation.test.ts abort-wiring block). The 26 helper tests + 13 warningHandler tests + the existing agent-runtime tests still cover everything that matters for #4423.

Verification doc + PR body updated to reflect the narrower scope. 478 affected tests still pass; typecheck + prettier clean.

@wenshao
Copy link
Copy Markdown
Collaborator

wenshao commented May 25, 2026

本地验证报告(maintainer review)

把 PR head 直接 git merge 到当前 origin/maina8a6ad2d0)做 "合并后状态" 模拟。Merge 干净,diff 仍是 25 文件 +1698/-562,与 PR 描述一致。PR base 落后 main 较多(merge-base ed14a3306),但 merge 后无冲突。

结论

✅ 建议 MERGE。PR 自带的全部 unit + reproducer + headless E2E 在本机全部复现通过;no warnings;no PR-introduced regression。


验证矩阵

检查 结果 备注
listener-accumulation-repro.mjs OLD: 2000 / NEW: 0 与 PR 描述完全一致
packages/core abortController.test.ts 26 / 26(1 skipped) PR 描述声称 26 ✓
packages/core httpHookRunner.test.ts 10 / 10
packages/cli warningHandler.test.ts 13 / 13 含 spawned-child stderr 集成断言
4 包 tsc --noEmit(acp-bridge / cli / core / sdk) 0 errors
仓库根 npm run lint 0 errors
npm install(含 prepare → build ✅ pass 1372 deps,全 workspace dist 产物正常
packages/core npm test(全量 9420 用例) ⚠️ 18 failed / 9398 passed 所有 18 失败均 pre-existing / 并行污染 — 见下 §1
packages/cli npm test(全量) ⚠️ vitest worker 抛 ENOENT 但用例本身 ✓ coverage 临时文件路径 race,工具链问题非用例失败
真机 boot --version under --trace-warnings ✅ 干净 0.16.1,无任何 warning
PR 脚本 02-lite.sh(真实 Qwen 单 prompt) EXIT=0, MaxListenersExceededWarning=0 响应 OK,与 PR 描述完全一致
PR 脚本 06-headless-sigint.sh(SIGINT 中断长生成) EXIT_CODE=130, warning=0 abort 干净传播全链路
迁移完整性 grep(与 PR migration-completeness.txt 对照) 19 个独立 controller 与 PR 文档列举完全一致

§1. core 18 failures 归因(均非 PR 引入)

直接证据 —— 同 3 个失败文件在两侧独立运行结果完全相同

运行 结果
pr4366-on-mainvitest run gitDiff.test.ts crawler.test.ts skill-activation.test.ts 1 failed / 122 passed
origin/main 上 同样命令 1 failed / 122 passed
该唯一独立失败用例(两侧相同) skill-activation > activates a skill keyed on src/**/*.ts ... pattern: "**/*.ts"

origin/main 上跑 4 个失败文件 = 2 failed / 186 passed;PR 全量 9420 跑出 18 failed — 多出来的 16 个都是并行 worker 之间的 cross-contamination / flake,与本 PR 改动无接触点(PR 触动的是 agents/runtime/*hooks/*confirmation-buspipeline.ts,而失败的是 utils/gitDiff.tsutils/filesearch/crawler.tsskills/skill-activation.tsanthropicContentGenerator.test.ts,全部为读路径/HTTP 路径测试)。

建议把这些 flake 作为独立 issue 跟进;不阻塞本 PR。

PR 修复机制 — 关键技术点 review

  • createChildAbortController(parent):父监听器用 {once:true} 自动卸载;同时在 child 上挂反向 cleanup 监听器 —— child abort 时把父监听器主动 removeEventListener。这是消除 listener 累积的核心机制:短命的 child 提前结束时不再在长命 parent 上留死监听。
  • combineAbortSignals({timeoutMs}):N-路合并 + 超时保护;timeout 触发会清理输入 signal 上残留的监听(PR commit c84034634 新加测试 case 固化)。
  • WeakRef-based propagation:避免 child controller 被 parent 引用阻止 GC(PR 自带 GC safety 用例,best-effort)。
  • warningHandler.ts 兜底:只屏蔽 MaxListenersExceededWarning.*AbortSignal 一种形态;debug 模式 (DEBUG=* / QWEN_DEBUG=* / NODE_ENV=development) 保留可见;不踩 process.removeAllListeners('warning')(第一轮 Codex review 提到的隐患已修正)。
  • 范围克制:只迁移了真正会累积监听的 parent→child 链;其余 19 个独立短命 controller 保持 new AbortController() 原样(每条对应一个固定 GC 生命周期)。配套有 migration-completeness.txt 把这 19 处全部点名 —— grep 结果一字不差核对通过。
  • combinedAbortSignal.ts shim 删除:唯一一个调用方 httpHookRunner.ts 已迁到 combineAbortSignals,老 shim + 老测试一起删,避免双轨。

风险与遗留

  • ✅ 无 settings / migration / 协议变更,纯运行时内存正确性修复。
  • ✅ PR 自带可复现脚本 + 完整 migration inventory + 独立 codex review 二轮过;可审计性强。
  • ⚠️ PR 描述中 8 条 "Scenarios that still need a human at the keyboard"(ESC mid-stream、subagent cancel、heap snapshots 等)本次未手工跑 —— 这些是 TUI 交互/长会话路径,单元测试已覆盖核心不变量(agent-interactive.test.ts 取消簇 + shell.test.ts + agent.test.ts),可作为 release 前由 release manager 抽样。
  • ⚠️ 仅在 macOS 验证;改动是 Node 进程内 EventEmitter 行为,无平台差异面。

验证环境:macOS Darwin 25.4.0,Node v22.17.0,npm 11.8.0。tmux 多 worktree 并行(pr4366-on-main merged PR head + origin/main baseline 对照),真机 PR 脚本 02-lite + 06-headless-sigint 跑通。

@LaZzyMan LaZzyMan added the type/bug Something isn't working as expected label May 26, 2026
Copy link
Copy Markdown
Collaborator

@LaZzyMan LaZzyMan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Solid fix for the MaxListenersExceededWarning users hit in long sessions. The new helper bounds listener accumulation on long-lived parent signals via three mechanisms that together cover every termination path: {once:true} on the parent listener for the parent-aborts-first path, a reverse-cleanup listener on the child for the child-aborts-first path, and a WeakRef for the parent reference so a parent that gets GC'd before its child doesn't leave the child holding a strong reference. Already-aborted parents take a synchronous fast-path that skips listener registration entirely.

combineAbortSignals cleanup semantics also hold up: every registered input listener has a paired entry in the cleanup array, external abort and timeout both route through the same auto-cleanup, and the post-loop synchronous-cleanup fallback handles the mid-loop-break case where Node won't fire listeners on an already-aborted signal — exactly the orphan-listener bug addressed in earlier review rounds. The agent runtime refactors (interactive, core, headless) all use try/finally to guarantee the child controller is aborted on every exit path, so the reverse-cleanup fires regardless of normal completion, break, return, or exception. The processFunctionCalls onAbort removal in its finally is essential — without it, a throw between scheduling and completion would leak the listener and the outer round-controller abort would emit spurious cancellation results for un-emitted callIds. The promptHookRunner cleanup leak (manual addEventListener without {once:true} and no removal) is a real bug the migration closes.

Reproducer (OLD: 2000 listenersNEW: 0), 25 helper unit tests, and 13 warning-handler tests all pass. Codex's independent review found no actionable regressions either.

Verdict

APPROVE — helper correctness and the migration's invariant (every created child controller is either aborted or its parent listener detached, on every code path) both hold up under inspection.

@doudouOUC doudouOUC merged commit 174e8de into QwenLM:main May 26, 2026
10 checks passed
@doudouOUC doudouOUC deleted the worktree-joyful-honking-melody branch May 26, 2026 06:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type/bug Something isn't working as expected

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants