Skip to content

feat(daemon): non-blocking POST /prompt — return 202 with promptId#4585

Merged
chiga0 merged 4 commits into
daemon_mode_b_mainfrom
feat/non-blocking-prompt
May 28, 2026
Merged

feat(daemon): non-blocking POST /prompt — return 202 with promptId#4585
chiga0 merged 4 commits into
daemon_mode_b_mainfrom
feat/non-blocking-prompt

Conversation

@chiga0

@chiga0 chiga0 commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

Closes #4582.

POST /session/:id/prompt is now non-blocking: returns 202 Accepted immediately with { promptId, lastEventId }. The prompt completion is delivered asynchronously via turn_complete / turn_error SSE events, correlated by promptId.


1. Prior Design Audit

1.1 Original Blocking Model

The original design (documented in 03-http-api.md) established a dual-channel parallel model:

  • POST /prompt → HTTP connection held open until turn completes (blocking, returns 200 with `PromptResult`)
  • GET /events → SSE stream pushes real-time deltas (session_update, permission_request, etc.)

The blocking pattern was straightforward but coupled HTTP connection lifetime to LLM turn duration. The design doc explicitly shows `Promise.race(bridge.sendPrompt, deadlinePromise)` as the timeout mechanism, returning 504 on deadline breach.

1.2 Problems with the Blocking Approach

  1. Proxy/LB timeouts: Reverse proxies (nginx default 60s, ALB 60s) enforce idle-connection deadlines shorter than typical agent turns (minutes with tool calls).
  2. Deadline coupling: T2.9 `promptDeadlineMs` used `Promise.race` to emit HTTP 504 — clients had to handle two distinct error shapes (structured 504 JSON vs SSE events).
  3. Client disconnect ambiguity: TCP reset could mean "user cancelled" or "network hiccup". `res.on('close')` conflated intentional cancellation with transient failures.
  4. Resource pinning: Each in-flight prompt held an Express response object open for the entire turn duration.

1.3 External Design Doc Comparison

The external design document describes the original blocking + SSE architecture. Notably:

  • No mention of `promptId`, `turn_complete`, or `turn_error` — these are new concepts
  • The doc's event type enumeration (~14 types) does not include turn lifecycle events
  • The design relies on HTTP response completion as the implicit "turn done" signal

This PR extends the original design: the SSE channel now carries explicit turn lifecycle events, and the HTTP POST becomes a lightweight trigger.


2. /acp Transport Prior Art — Design Choice Rationale

2.1 /acp's Non-Blocking Pattern

The `/acp` route (PR #4472, MCP Streamable HTTP transport) already implements 202:

  • POST to `/acp` always returns `202 Accepted` (HTTP layer non-blocking)
  • Results are delivered via a long-lived SSE session stream
  • The transport layer internally manages request-response correlation

2.2 Why We Follow the /acp Pattern (Not Opt-In)

Initial proposal: Use `Prefer: respond-async` header for opt-in non-blocking, keeping 200 blocking as default.

Problems with opt-in:

  • Dual code paths in server (blocking + non-blocking) increase maintenance burden
  • Inconsistent with /acp which is unconditionally 202
  • The `Prefer` header is not part of the HTTP Streamable Transport proposal

Final approach (follows /acp): Unconditional 202, like the MCP Streamable HTTP transport. SDK handles the correlation transparently, just as the ACP transport layer routes JSON-RPC responses to pending requests.

Aspect /acp Transport REST /prompt (this PR)
HTTP response Always 202 Always 202 (was 200)
Result delivery JSON-RPC response in session SSE `turn_complete`/`turn_error` in session SSE
Correlation JSON-RPC `id` `promptId` (UUID)
SDK event routing Transport layer dispatches to pending requests `DaemonSessionClient._pendingPrompts` map dispatches to pending Promises

3. Architecture Impact Analysis

3.1 Bridge Layer (Additive Only)

  • `broadcastTurnComplete()` / `broadcastTurnError()` — side-effect `.then()` on the sendPrompt result promise, unconditionally published regardless of HTTP response mode
  • `getSessionLastEventId()` — reads `entry.events.lastEventId` (existing EventBus property)
  • FIFO queue, ACP child lifecycle — completely unaffected

3.2 EventBus & SSE (Correctness Argument)

  1. Server snapshots `lastEventId` before calling `sendPrompt`
  2. 202 response carries this `lastEventId` to the client
  3. `DaemonSessionClient` already has SSE running → `_dispatchTurnEvent` catches the matching event
  4. Fallback path: SDK opens temporary SSE with `Last-Event-ID` replay

Ring buffer (default 8000 frames) guarantees no event loss in the snapshot-to-dispatch window.

3.3 SDK — Two-Tier Design (Following ACP Transport Pattern)

`DaemonClient` (stateless HTTP layer):

  • `promptNonBlocking()` — sends POST, returns `{ promptId, lastEventId }` for 202
  • `prompt()` — sends POST, handles 202 via temporary SSE fallback (for standalone callers without existing subscription)
  • `matchTurnEvent()` — shared utility for turn event correlation

`DaemonSessionClient` (stateful transport layer, like ACP):

  • `_pendingPrompts: Map<promptId, { resolve, reject }>` — mirrors ACP transport's pending-request dispatch table
  • `prompt()` → when SSE subscription is active, uses `promptNonBlocking()` and registers pending entry; otherwise falls back to `DaemonClient.prompt()` with temporary SSE
  • `iterateEvents()` → intercepts `turn_complete`/`turn_error` via `_dispatchTurnEvent()` and resolves pending Promises before yielding to the consumer
  • SSE stream end → `_rejectAllPending()` cleans up

Consumer impact: ZERO. `DaemonSessionClient.prompt()` still returns `Promise` — WebUI, web-shell, IDE providers all work unchanged.

3.4 Web-Shell (Passive Observer Enhancement)

Adds `turn_complete` event handling for passive session viewers (non-prompt-originating tabs). Replaces the heuristic idle-timer for `assistant.done` dispatch.

3.5 Capability Tag

`non_blocking_prompt: { since: 'v1' }` — baseline (always-on) tag for client feature detection.


4. Backward Compatibility

Scenario Impact
SDK consumers (`DaemonSessionClient.prompt()`) None — same `Promise` API
SDK consumers (`DaemonClient.prompt()`) None — 202 handled via temp SSE fallback
Raw HTTP clients Breaking — 200→202, response body changed. Mitigated by `non_blocking_prompt` capability tag
Old SDK + New daemon `res.ok` true for 202, body parsed as `PromptResult` with `stopReason: undefined`. Daemon+SDK co-released
New SDK + Old daemon `prompt()` receives 200, handled by legacy path. `promptNonBlocking()` returns `PromptResult` directly

5. Feasibility Conclusion

Feasible and architecturally sound. Supporting arguments:

  1. /acp prior art validates the pattern — same codebase, proven in production
  2. Bridge layer unaffected — FIFO queue and ACP child lifecycle independent of HTTP response
  3. EventBus replay guarantees correctness — ring buffer closes the race window
  4. SDK upgrade transparent — two-tier design with ACP-style pending-request dispatch
  5. turn events have independent value — passive observers benefit regardless of HTTP layer

Risks to monitor:

  • Ring buffer overflow during snapshot-to-SSE window (theoretical, extremely unlikely)
  • Orphan prompts if client disconnects (prompt runs to completion, result published to SSE bus)

Changes

  • Bridge (`acp-bridge`): publishes `turn_complete` / `turn_error` SSE events unconditionally; exposes `getSessionLastEventId()`
  • Server (`cli/serve`): POST /prompt returns 202 with `{ promptId, lastEventId }`; deadline timer aborts via abort signal, outcome surfaced via `turn_error` SSE
  • SDK `DaemonClient`: `promptNonBlocking()` + `matchTurnEvent()` utility; `prompt()` retains temp SSE fallback
  • SDK `DaemonSessionClient`: ACP-style pending-prompt dispatch via existing SSE subscription
  • Web-shell: observes `turn_complete` for passive session viewers
  • Capability: `non_blocking_prompt` baseline tag

Test plan

  • 337 `server.test.ts` tests pass
  • 205 `bridge.test.ts` tests pass
  • 5 `daemon-public-surface.test.ts` tests pass
  • SDK + CLI TypeScript compilation clean
  • Manual: start daemon, send prompt via curl, verify 202 + SSE turn_complete
  • Manual: test deadline expiry → verify turn_error SSE event with promptId

Generated with AI

@chiga0 chiga0 requested review from doudouOUC and wenshao and removed request for wenshao May 28, 2026 03:11
@github-actions

Copy link
Copy Markdown
Contributor

📋 Review Summary

This PR implements non-blocking prompt handling for the daemon server, returning 202 Accepted immediately with promptId and lastEventId, while delivering completion asynchronously via SSE events. The implementation is well-architected with good backward compatibility, but has several critical issues around error handling, resource cleanup, and race conditions that must be addressed before merging.

🔍 General Feedback

Positive aspects:

  • Clean separation of concerns between bridge, server, and SDK layers
  • Excellent backward compatibility design (200/202 dual support)
  • Good use of SSE event correlation via promptId
  • Proper capability tagging for feature detection
  • Well-documented motivation in PR description

Architectural concerns:

  • Fire-and-forget pattern in server loses error visibility
  • Race condition between deadline timer and bridge completion
  • Missing cleanup for SSE subscriptions on error paths
  • Inconsistent error handling between legacy and new modes

Recurring themes:

  • Several places swallow errors that should be logged or tracked
  • Abort signal handling is complex and has edge cases
  • Event correlation relies on proper promptId propagation throughout the chain

🎯 Specific Feedback

🔴 Critical

File: packages/cli/src/serve/server.ts:1498-1507 - Uncaught promise rejection risk in fire-and-forget pattern

The .catch(() => {}) swallows ALL errors including legitimate bridge failures. While turn_error is published on the SSE bus, there's no fallback if the SSE publish itself fails or if the session is destroyed mid-prompt:

bridge
  .sendPrompt(...)
  .finally(() => {
    if (deadlineTimer !== undefined) clearTimeout(deadlineTimer);
  })
  .catch(() => {
    // Swallowed — `turn_error` already published on the SSE bus.
  });

Issue: If broadcastTurnError throws (e.g., session destroyed, event bus closed), this catch silently drops it. The client waits forever on SSE, and operators have no log trace.

Recommendation: Add structured logging for unexpected errors:

.catch((err) => {
  // turn_error already published, but log unexpected failures
  if (!(err instanceof DOMException && err.name === 'AbortError')) {
    writeStderrLine(
      `[prompt:${promptId}] Unexpected error after turn_error published: ${err.message}`,
    );
  }
});

File: packages/cli/src/serve/server.ts:1477-1483 - Race condition: deadline can fire after bridge completes

The deadline timer aborts the prompt, but there's a race window where:

  1. Bridge completes successfully and publishes turn_complete
  2. Deadline timer fires milliseconds later and aborts
  3. Client receives turn_error via SSE after already processing turn_complete

Issue: The abort happens unconditionally even if the bridge already settled:

if (effectiveDeadlineMs !== undefined) {
  deadlineTimer = setTimeout(() => {
    if (!abort.signal.aborted) {
      abort.abort(new PromptDeadlineExceededError(effectiveDeadlineMs));
    }
  }, effectiveDeadlineMs);
}

Recommendation: Track bridge completion state and skip abort if already done:

let promptSettled = false;
bridge
  .sendPrompt(...)
  .finally(() => {
    promptSettled = true;
    if (deadlineTimer !== undefined) clearTimeout(deadlineTimer);
  })
  ...

// In timer:
if (!promptSettled && !abort.signal.aborted) {
  abort.abort(new PromptDeadlineExceededError(effectiveDeadlineMs));
}

File: packages/sdk-typescript/src/daemon/DaemonClient.ts:1258-1294 - SSE subscription leak on client abort

When the caller's signal aborts during _awaitTurnComplete, the code calls cancel() out-of-band but doesn't verify the SSE stream actually terminates:

if (
  signal?.aborted &&
  err instanceof DOMException &&
  err.name === 'AbortError'
) {
  this.cancel(sessionId, clientId).catch(() => {});
  throw err;
}

Issue: The composedSignal includes sseAbort.signal, but if the SSE stream is mid-yield when abort fires, the for await loop may not exit cleanly, leaving the HTTP connection open.

Recommendation: Ensure SSE stream is explicitly destroyed:

} finally {
  if (!sseAbort.signal.aborted) sseAbort.abort();
  // Force stream cleanup if still active
  if (events && typeof (events as AsyncIterable<any>).return === 'function') {
    await (events as AsyncIterable<any>).return?.();
  }
}

File: packages/acp-bridge/src/bridge.ts:2193-2213 - AbortError swallow in turn_error path loses legitimate errors

result.then(
  (promptResult) => {
    broadcastTurnComplete(...);
  },
  (err) => {
    if (err instanceof DOMException && err.name === 'AbortError') return;
    broadcastTurnError(...);
  },
);

Issue: This swallows ALL AbortError instances, including those from internal bridge failures (not just user-initiated cancels). The server-side deadline abort uses the same DOMException type, so deadline-exceeded errors never reach clients via turn_error.

Recommendation: Distinguish between user-cancel and other aborts:

(err) => {
  if (err instanceof DOMException && err.name === 'AbortError') {
    // Only swallow if it's a genuine cancel, not deadline or internal abort
    if (err.message === 'The operation was aborted.') {
      return; // user cancel
    }
    // Fall through to broadcastTurnError for deadline/internal aborts
  }
  broadcastTurnError(entry, sessionId, err, promptId, originatorClientId);
},

Or better: use a custom error type for deadline exceeded that's checked explicitly.


🟡 High

File: packages/cli/src/serve/server.ts:1453-1462 - Missing validation for deadlineMs type

The code validates deadlineMs is a positive number but doesn't check for reasonable bounds. A client could send Number.MAX_SAFE_INTEGER or a negative value that passes the <= 0 check due to type coercion:

if (
  typeof rawRequestDeadline !== 'number' ||
  !Number.isFinite(rawRequestDeadline) ||
  !Number.isInteger(rawRequestDeadline) ||
  rawRequestDeadline <= 0
) {

Issue: Number.isFinite(Infinity) is false, but Number.isFinite(NaN) is also false — good. However, extremely large values (e.g., 1e15 ms = 31 million years) pass validation and could cause timer resource issues.

Recommendation: Add reasonable upper bound (e.g., 24 hours):

const MAX_DEADLINE_MS = 24 * 60 * 60 * 1000; // 24 hours
if (
  typeof rawRequestDeadline !== 'number' ||
  !Number.isFinite(rawRequestDeadline) ||
  !Number.isInteger(rawRequestDeadline) ||
  rawRequestDeadline <= 0 ||
  rawRequestDeadline > MAX_DEADLINE_MS
) {

File: packages/sdk-typescript/src/daemon/DaemonClient.ts:1276-1288 - SSE event matching doesn't validate sessionId

The _awaitTurnComplete method matches events by promptId only:

if (event.type === 'turn_complete') {
  const data = event.data as {
    promptId?: string;
    stopReason?: string;
  };
  if (data.promptId === promptId) {
    return { stopReason: data.stopReason ?? 'end_turn' };
  }
}

Issue: If multiple concurrent prompts use the same sessionId (possible in multi-client scenarios), a turn_complete from prompt A could be misattributed to prompt B if both happen to generate the same random promptId (extremely unlikely but theoretically possible with UUID collisions).

Recommendation: Add sessionId validation as defense-in-depth:

if (event.type === 'turn_complete') {
  const data = event.data as {
    sessionId?: string;
    promptId?: string;
    stopReason?: string;
  };
  if (data.sessionId === sessionId && data.promptId === promptId) {
    return { stopReason: data.stopReason ?? 'end_turn' };
  }
}

File: packages/acp-bridge/src/bridge.ts:432-452 - broadcastTurnError loses error stack trace

const message = err instanceof Error ? err.message : String(err);
const code =
  err instanceof Error && 'code' in err
    ? String((err as Error & { code?: string }).code)
    : undefined;

Issue: Only message and code are transmitted. The error stack, name, and cause are lost, making debugging production failures difficult.

Recommendation: Include error name and optionally stack (if not sensitive):

const message = err instanceof Error ? err.message : String(err);
const code =
  err instanceof Error && 'code' in err
    ? String((err as Error & { code?: string }).code)
    : undefined;
const name = err instanceof Error ? err.name : undefined;
entry.events.publish({
  type: 'turn_error',
  data: {
    sessionId,
    message,
    ...(code ? { code } : {}),
    ...(name ? { name } : {}),
    ...(promptId ? { promptId } : {}),
  },
  ...(originatorClientId ? { originatorClientId } : {}),
});

🟢 Medium

File: packages/cli/src/serve/server.ts:1448 - crypto.randomUUID() import not visible

The code uses crypto.randomUUID() but the import isn't shown in the excerpt. Ensure crypto is properly imported at the top of the file.

Recommendation: Verify import exists:

import * as crypto from 'node:crypto';

File: packages/sdk-typescript/src/daemon/DaemonClient.ts:1243 - Magic number in SSE error handling

const err = new DaemonHttpError(
  500,
  data.code ?? 'turn_error',
  data.message ?? 'Prompt failed',
);

Issue: Hard-coded 500 status code assumes all turn_error events are server errors. Some might be client errors (e.g., invalid prompt format that passed initial validation).

Recommendation: Consider mapping error codes to appropriate HTTP status codes, or at least document why 500 is always appropriate here.


File: packages/web-shell/client/hooks/useDaemonSession.ts:449-457 - Passive viewer logic assumes turn_complete always fires

if (
  event.type === 'turn_complete' &&
  !activePromptsRef.current.has(activeSession.sessionId)
) {
  clearPassiveAssistantDoneTimer(passiveAssistantDoneTimerRef);
  const stopReason =
    (event.data as { stopReason?: string })?.stopReason ??
    'end_turn';
  store.dispatch({ type: 'assistant.done', reason: stopReason });
}

Issue: If turn_error fires instead, the passive assistant done timer may still be running, leaving the UI in an ambiguous state.

Recommendation: Handle turn_error for passive viewers:

if (
  event.type === 'turn_complete' &&
  !activePromptsRef.current.has(activeSession.sessionId)
) {
  clearPassiveAssistantDoneTimer(passiveAssistantDoneTimerRef);
  const stopReason =
    (event.data as { stopReason?: string })?.stopReason ??
    'end_turn';
  store.dispatch({ type: 'assistant.done', reason: stopReason });
} else if (
  event.type === 'turn_error' &&
  !activePromptsRef.current.has(activeSession.sessionId)
) {
  clearPassiveAssistantDoneTimer(passiveAssistantDoneTimerRef);
  // Optionally dispatch assistant.error with the error details
}

🔵 Low

File: packages/cli/src/serve/server.ts:1399 - Comment references removed validation code

The comment about !Array.isArray(item) validation is still present, but the validation logic was simplified. Update the comment to match the current code.


File: packages/acp-bridge/src/bridgeTypes.ts:116-120 - JSDoc could mention correlation requirement

The promptId field documentation explains the purpose but doesn't explicitly state that callers MUST include this field when using non-blocking mode.

Recommendation: Add clarity:

/**
 * Caller-generated correlation id for non-blocking prompt mode.
 * REQUIRED when the caller expects to receive `turn_complete` / `turn_error`
 * SSE events correlated to this specific prompt. Generate via `crypto.randomUUID()`
 * or equivalent before the initial POST /prompt request.
 */
promptId?: string;

File: packages/sdk-typescript/src/daemon/events.ts:629-640 - Interface definitions could include JSDoc

The new DaemonTurnCompleteData and DaemonTurnErrorData interfaces lack inline documentation explaining their purpose and when they're emitted.


File: packages/cli/src/serve/capabilities.ts:230 - Capability descriptor lacks version context

non_blocking_prompt: { since: 'v1' },

Recommendation: Consider adding a description field to the capability registry for client-side feature detection documentation.


✅ Highlights

  • Excellent backward compatibility: SDK handles both 200 and 202 responses transparently, allowing gradual daemon upgrades
  • Clean SSE event design: turn_complete and turn_error events carry appropriate metadata (promptId, originatorClientId) for correlation and attribution
  • Proper resource cleanup pattern: Deadline timers are .unref()'d to prevent keeping the daemon alive, and cleared in finally blocks
  • Good capability tagging: non_blocking_prompt capability allows clients to detect and adapt to daemon features
  • Well-documented motivation: PR description clearly explains the three problems being solved (proxy timeouts, client abort ambiguity, deadline coupling)
  • Thoughtful event cursor snapshot: Capturing lastEventId before enqueuing ensures no SSE events are missed during the HTTP→SSE handoff

@chiga0 chiga0 force-pushed the feat/non-blocking-prompt branch from 9af0732 to 66bffea Compare May 28, 2026 03:19
@chiga0

chiga0 commented May 28, 2026

Copy link
Copy Markdown
Collaborator Author

实现方案复审 — 三个需要调整的问题

在完成设计审计(对比原始阻塞设计、/acp 先例、外部设计文档)后,发现当前方案虽然方向正确,但有三个值得修正的问题:

问题 1:无条件 202 是不必要的破坏性变更

当前 POST /prompt 无条件返回 202。旧 SDK 收到 202 后 res.ok === true,会把 { promptId, lastEventId } 当作 PromptResult 解析——stopReason: undefined,不崩溃但数据静默错误。

调整方案:改为 opt-in——客户端通过 Prefer: respond-async header 请求非阻塞,默认仍然 200 阻塞。新 SDK 自动带 header,旧 SDK 完全不受影响;新 SDK 对旧 daemon 也安全(header 被忽略)。

问题 2:每次 prompt 开一个临时 SSE 连接

_awaitTurnComplete 为每个 prompt 创建新的 SSE 订阅等待 turn_complete。但 SDK 消费者通常已有长生命周期 SSE 连接在跑 subscribe(),额外开一条浪费连接资源(subscriber 上限 64)、增加 EventBus 广播扇出、存在重复处理风险。

调整方案:SDK 内部复用已有 SSE 订阅——有活跃 subscription 则在上面 filter turn_complete by promptId;没有才降级开临时连接。

问题 3:turn 事件的价值独立于 202

turn_complete/turn_error 对被动观察者(其他 tab、监控、审计)有独立价值,不应与 202 非阻塞绑定。应无条件发布:

bridge.sendPrompt 完成 → 一定广播 turn_complete / turn_error(不管 HTTP 层是 200 还是 202)
POST /prompt → 根据 Prefer header 决定返回 200(等结果)还是 202(立即返回)

调整后方案概览

层级 变更
Bridge turn_complete/turn_error 无条件发布(保留当前实现,不变)
Server POST /prompt 双模:无 Prefer: respond-async → 200 阻塞;有 header → 202 非阻塞
SDK prompt() 检测 non_blocking_prompt capability → 自动带 Prefer header → 202 路径复用已有 SSE 订阅
Capability non_blocking_prompt 标签保留,含义不变

接下来按此方向调整实现。

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the daemon REST prompt flow from a blocking HTTP request to an accepted asynchronous operation, with prompt completion delivered over session SSE events. It updates the bridge, server route, TypeScript SDK event/client surfaces, web-shell observer handling, capabilities, and tests to support prompt correlation via promptId.

Changes:

  • POST /session/:id/prompt now returns 202 with { promptId, lastEventId }.
  • Bridge and SDK add turn_complete / turn_error events for prompt completion correlation.
  • Web-shell and server tests are updated for the non-blocking prompt model.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/acp-bridge/src/bridge.ts Publishes turn lifecycle SSE events and exposes session event cursor snapshots.
packages/acp-bridge/src/bridgeTypes.ts Extends bridge context/API types for prompt correlation and event cursor access.
packages/cli/src/serve/server.ts Converts prompt route to fire-and-forget 202 behavior with deadline abort handling.
packages/cli/src/serve/server.test.ts Updates prompt and deadline route tests for non-blocking responses.
packages/cli/src/serve/capabilities.ts Adds the non_blocking_prompt serve capability.
packages/sdk-typescript/src/daemon/DaemonClient.ts Adds SDK handling for 202 prompt responses by awaiting matching SSE events.
packages/sdk-typescript/src/daemon/events.ts Adds turn lifecycle event types, parsing, and reducer state.
packages/sdk-typescript/src/daemon/index.ts Exports new turn lifecycle SDK types.
packages/web-shell/client/hooks/useDaemonSession.ts Uses turn_complete for passive observer assistant completion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/cli/src/serve/server.ts Outdated
Comment thread packages/acp-bridge/src/bridge.ts
Comment thread packages/sdk-typescript/src/daemon/DaemonClient.ts
@chiga0

chiga0 commented May 28, 2026

Copy link
Copy Markdown
Collaborator Author

方案迭代 — 最终采用 ACP transport 模式(替代上述 Prefer: respond-async 方案)

上述三个问题中,问题 2(SSE 连接复用)和问题 3(turn 事件独立价值) 在最终实现中已完整解决。问题 1 的解法调整如下:

问题 1 重新决策:unconditional 202(不走 Prefer opt-in)

复审后认为 Prefer: respond-async 方案引入了不必要的复杂性,且与 /acp 的设计哲学不一致:

  • /acp (MCP Streamable HTTP transport) 的 POST 无条件 202,不做内容协商
  • 双模代码路径(阻塞 + 非阻塞)增加了 server 维护负担
  • Prefer header 不是 HTTP Streamable Transport 提案的一部分
  • daemon + SDK 共同发版,旧 SDK 兼容性在这个发版模型下不是关键约束

最终方案 follow /acp transport 模式:unconditional 202,SDK 内部像 ACP transport 一样管理 request-response 关联。

三个问题的最终处置

问题 最终处置
问题 1:无条件 202 的兼容性 保留 unconditional 202。follow /acp 模式,不做 Prefer 协商。non_blocking_prompt capability tag 供 raw HTTP 客户端检测
问题 2:临时 SSE 连接 已解决DaemonSessionClient 通过 _pendingPrompts map + _dispatchTurnEvent() 在已有 SSE 订阅上关联 turn 事件。DaemonClient.prompt() 保留临时 SSE 作为无订阅时的 fallback
问题 3:turn 事件独立发布 已解决。bridge 的 broadcastTurnComplete/broadcastTurnError 无条件发布,不依赖 HTTP 层是 200 还是 202

@chiga0 chiga0 left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Overview (AI Generated)

PR: #4585 feat(daemon): non-blocking POST /prompt — return 202 with promptId
Type: New Feature + Refactor
Change size: +519/-351 across 10 files

Findings Summary

  • Critical/Major: 2 items
  • Minor: 4 items

Key Observations

Excellent architectural refactor with a thorough design document. The non-blocking 202 pattern is well-justified (proxy timeouts, resource pinning) and the ACP-prior-art alignment provides strong precedent. The two-tier SDK design (stateless DaemonClient with temp SSE fallback vs stateful DaemonSessionClient with pending-prompt dispatch) is clean and consumer-transparent. Test suite is well-adapted from the blocking model. Main concerns: (1) abort listener leak on the external signal in DaemonSessionClient.prompt(), and (2) silent error swallowing in the fire-and-forget .catch(() => {}) with no observability fallback.


This review was generated by QoderWork AI

Comment thread packages/sdk-typescript/src/daemon/DaemonSessionClient.ts Outdated
Comment thread packages/cli/src/serve/server.ts
Comment thread packages/acp-bridge/src/bridge.ts
Comment thread packages/cli/src/serve/server.test.ts

@chiga0 chiga0 left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continued review findings (Part 2/2)

Comment thread packages/cli/src/serve/server.test.ts
Comment thread packages/web-shell/client/hooks/useDaemonSession.ts

@chiga0 chiga0 left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #4585 Review: Non-blocking POST /prompt — 202 with promptId

Author: ChiGao | +519 / -351 across 10 files | Branch: feat/non-blocking-promptdaemon_mode_b_main


Overview

POST /session/:id/prompt阻塞式 (hold HTTP 直到 turn 结束, 返回 200 + PromptResult) 改为非阻塞式 (立即返回 202 + { promptId, lastEventId }, 通过 SSE turn_complete / turn_error 事件异步交付结果)。


能否解决 504 超时问题?

能。这个 PR 精准命中了 504 的根因。

之前的架构:

Client → POST /prompt → [HTTP 连接一直 hold 住] → 等 LLM turn 完成 → 200 返回
              ↑
         nginx/ALB 60s 超时 → 504 Gateway Timeout

新架构:

Client → POST /prompt → [几毫秒内] → 202 Accepted { promptId, lastEventId }
Client ← SSE stream  ← turn_complete { promptId, stopReason } (异步,走已有 SSE 长连接)

关键点:

  1. HTTP POST 请求几毫秒内就返回 202,不可能触发反向代理超时
  2. 结果通过已建立的 SSE 连接推送,SSE 有 keepalive 心跳,不会被中间件掐断
  3. 即使 LLM turn 跑了 10 分钟,HTTP 层也不会有任何超时问题

代码质量分析

架构设计 — 优秀

  • 遵循了 /acp 路由已验证的 202 模式(同 codebase 内的 prior art)
  • SDK 两层设计合理:DaemonClient.promptNonBlocking() (无状态) + DaemonSessionClient (有状态 pending map)
  • matchTurnEvent() 提取为共享 utility,避免逻辑重复

Bridge 层改动 — 安全,纯增量

  • broadcastTurnComplete / broadcastTurnError 通过 .then() side-effect 无条件发布,不影响现有 FIFO 队列
  • AbortError 被正确过滤,不会误发 turn_error
  • getSessionLastEventId() 只读操作,干净

Server 层 — 大幅简化

  • 删除了 ~100 行复杂的 Promise.race + deadline 504 + 错误分支逻辑
  • 新代码只有 ~30 行:生成 promptId → 拿 lastEventId → fire-and-forget sendPrompt → 立即 202
  • .catch(() => {}) 正确吞掉了 orphan promise 的 rejection

SDK 层 — 兼容性好

  • DaemonSessionClient.prompt() 返回值不变 (Promise<PromptResult>),对上层消费者完全透明
  • 向后兼容:新 SDK + 旧 daemon (200) 走 legacy path;旧 SDK + 新 daemon (202) 因 res.ok 对 202 为 true 也不会 crash

潜在问题和风险

1. Orphan prompt — 中等风险

客户端断开后 prompt 仍会运行到结束。结果发布到 SSE bus 但无人消费。对资源消耗的影响取决于 turn 时长。

2. _pendingPrompts 内存泄漏 — 低风险

DaemonSessionClient 的 pending map 在 SSE stream 结束时 _rejectAllPending() 清理。signal abort listener 和 stream end 都有清理路径,实际泄漏风险很低。

3. 删除了 invalid_client_id 400 错误测试 — 值得注意

原来有一个测试验证 bridge 拒绝 client ID 时返回 400。因为现在 sendPrompt 是 fire-and-forget,这个错误会变成 turn_error SSE 事件而不是同步 HTTP 错误。需要确认下游消费者都已适配。

4. setTimeout 20ms 在测试中 — 低风险

多处测试用 await new Promise(r => setTimeout(r, 20)) 等待异步 bridge call settle,可能在 CI 慢环境下 flaky。不影响生产。


总结

这个 PR 质量很高,架构清晰,精准解决 504 问题。 核心思路正确:把 HTTP 连接生命周期和 LLM turn 时长解耦。代码量净减,SDK 接口完全向后兼容。建议 merge。


Generated by Claude Opus 4.6 (claude-opus-4-6)

Comment thread packages/web-shell/client/hooks/useDaemonSession.ts
Comment thread packages/sdk-typescript/src/daemon/DaemonClient.ts
Comment thread packages/sdk-typescript/src/daemon/DaemonClient.ts
Comment thread packages/sdk-typescript/src/daemon/DaemonClient.ts
Comment thread packages/sdk-typescript/src/daemon/DaemonSessionClient.ts
Comment thread packages/sdk-typescript/src/daemon/DaemonSessionClient.ts
Comment thread packages/sdk-typescript/src/daemon/DaemonSessionClient.ts Outdated
Comment thread packages/sdk-typescript/src/daemon/DaemonSessionClient.ts
Comment thread packages/acp-bridge/src/bridge.ts
Comment thread packages/acp-bridge/src/bridge.ts Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Comment on lines +205 to +243
if (!this.subscriptionActive) {
return await this.client.prompt(
this.sessionId,
req,
signal,
this.clientId,
);
}

const accepted = await this.client.promptNonBlocking(
this.sessionId,
req,
signal,
this.clientId,
);
if (!isNonBlockingAccepted(accepted)) {
return accepted;
}

return new Promise<PromptResult>((resolve, reject) => {
const onAbort = () => {
if (this._pendingPrompts.delete(accepted.promptId)) {
this.client.cancel(this.sessionId, this.clientId).catch(() => {});
reject(signal!.reason ?? new DOMException('Aborted', 'AbortError'));
}
};
const cleanup = () => signal?.removeEventListener('abort', onAbort);
this._pendingPrompts.set(accepted.promptId, {
resolve: (r) => {
cleanup();
resolve(r);
},
reject: (e) => {
cleanup();
reject(e);
},
});
signal?.addEventListener('abort', onAbort, { once: true });
});
doudouOUC
doudouOUC previously approved these changes May 28, 2026

@doudouOUC doudouOUC left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

架构方向正确,执行到位。将 HTTP 连接生命周期与 LLM turn 时长解耦,精准解决 reverse proxy 504 超时问题。遵循同 codebase 内 /acp 路由的 202 先例,SDK 两层设计(无状态 DaemonClient + 有状态 DaemonSessionClient._pendingPrompts)干净且对消费者完全透明。

4 commit 迭代质量良好——abort listener leak、String(undefined) coercion、type guard 缺失、passive observer turn_error 缺失均已在后续 commit 修复。

Approve with follow-up suggestions

以下问题不阻塞 merge,建议后续 PR 跟踪:

  1. Deadline AbortError 被动观察者不可见 — bridge.ts L2317 的 AbortError 过滤会吞掉 deadline-triggered abort,导致被动 SSE 观察者看不到 terminal event。需与 prompt_cancelled 语义协调后修复(作者已 acknowledge)。

  2. _pendingPrompts 注册时序假设promptNonBlocking() 返回后才 set() pending entry,依赖 bridge IPC round-trip 延迟保证 turn_complete 不会先到。当前实现安全,但建议加注释说明时序前提或在 follow-up 中增加防御性缓冲。

  3. Fire-and-forget .catch(() => {}) 可观测性 — 如果未来 EventBus 实现变更导致 broadcastTurnError 本身可能 throw,当前 catch 会完全静默。建议至少 fallback 到 stderr debug log。

  4. SDK 202 路径测试缺口DaemonClient.prompt() 处理 202 + 临时 SSE 的路径没有直接单测覆盖(受限于 SDK 无 HTTP server test harness),建议后续补充。

LGTM overall — 净删代码、逻辑大幅简化、向后兼容在 co-release 模型下可控。

秦奇 and others added 4 commits May 28, 2026 14:19
…4582)

Decouple trigger from completion: POST /session/:id/prompt now returns
202 Accepted immediately with `{ promptId, lastEventId }`. Completion
is delivered via `turn_complete` / `turn_error` SSE events correlated
by promptId.

- Bridge publishes `turn_complete` and `turn_error` events after
  sendPrompt settles (abort-cancelled prompts are suppressed)
- Bridge exposes `getSessionLastEventId()` so the server can snapshot
  the cursor before enqueuing
- DaemonClient.prompt() transparently handles 202 by opening a
  temporary SSE subscription and awaiting the matching turn event
- Web-shell observes `turn_complete` for passive session viewers
- Capability tag `non_blocking_prompt` advertised for feature detection
- Deadline enforcement preserved: timer aborts the prompt server-side,
  surfaced through `turn_error` SSE event instead of HTTP 504

Generated with AI

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
…ource reuse

Revert the Prefer: respond-async dual-mode approach in favor of the
simpler ACP-consistent model:

Server:
- POST /prompt unconditionally returns 202 (no opt-in header needed)
- Remove emitPromptDeadline504 (deadline surfaced via turn_error SSE)

SDK DaemonClient:
- Add promptNonBlocking() for callers with existing SSE subscriptions
- Add matchTurnEvent() shared utility for turn event correlation
- prompt() retains temporary SSE fallback for standalone callers
- Export NonBlockingPromptAccepted, matchTurnEvent, isNonBlockingAccepted

SDK DaemonSessionClient:
- prompt() uses promptNonBlocking() when SSE subscription is active,
  resolving via _pendingPrompts map (like ACP transport request routing)
- iterateEvents() intercepts turn_complete/turn_error and dispatches
  to pending prompts before yielding to the consumer
- Falls back to DaemonClient.prompt() (temp SSE) when no subscription

Generated with AI

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
When prompt() resolved via _dispatchTurnEvent (turn_complete SSE),
the abort listener on the caller's signal was never removed. Over a
long-lived session each prompt call accumulated another leaked
listener. Additionally, if the signal fired after resolution, the
stale handler called cancel() — potentially cancelling an unrelated
in-flight prompt.

Fix: wrap resolve/reject to removeEventListener on settlement.

Also: use typed DaemonTurnCompleteData instead of ad-hoc cast in
web-shell passive turn_complete handler.

Generated with AI

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
… turn_error

- Add type guard (turn_complete/turn_error only) in _dispatchTurnEvent
  before extracting promptId. Without this, a future event type
  carrying promptId in data would silently delete the pending entry
  without resolving or rejecting the promise.

- Fix String(undefined) producing "undefined" in broadcastTurnError.
  When err.code is undefined, 'code' in err is true but
  String(undefined) yields the truthy string "undefined", bypassing
  the conditional spread and stamping a misleading error code.

- Handle turn_error for passive observers in web-shell. Passive tabs
  viewing a session that hits turn_error (agent crash, transport
  failure) now dispatch assistant.done instead of staying stuck in
  the thinking state indefinitely.

Generated with AI

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

Comment on lines +215 to +225
const accepted = await this.client.promptNonBlocking(
this.sessionId,
req,
signal,
this.clientId,
);
if (!isNonBlockingAccepted(accepted)) {
return accepted;
}

return new Promise<PromptResult>((resolve, reject) => {

@doudouOUC doudouOUC left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed after conflict resolution (rebase). Code is functionally identical to the previously approved version — clean rebase, no logic changes, no new issues introduced.

The base branch additions (user_shell_command/user_shell_result event types in events.ts) were correctly preserved alongside the new turn_complete/turn_error entries.

Copilot's repeated TOCTOU concerns (L244, L225) were already addressed in the prior review round — the event-loop timing guarantee (202 sent synchronously before bridge enqueue) makes the race unreachable in practice.

Previous follow-up suggestions remain applicable but non-blocking. LGTM.

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R2 Review — Non-blocking POST /prompt (Second Opinion)

SHA: 50870b19 (R1 was 874a7d21e)
Model: qwen3-max (2nd opinion)
Build/Test: ✅ tsc clean (0 errors), 542 tests pass, CI all_pass (3/3)

R1 Status — All Addressed

R1 Finding Status
Critical: TOCTOU race on _pendingPrompts Rejected (author) — 202 sent synchronously, agreed unreachable
Abort listener leak (Copilot) ✅ Fixed (d960be5) — resolve/reject wrapped with cleanup
_dispatchTurnEvent type guard ✅ Fixed (17b435e) — early return for non-turn events
Passive observer turn_error ✅ Fixed (17b435e)
String(undefined) code guard ✅ Fixed (17b435e)
Typed event interface ✅ Fixed (d960be5)
broadcastTurnComplete drops _meta OOS — intentional design choice
prompt()/promptNonBlocking() duplication OOS — refactoring suggestion
_awaitTurnComplete no timeout OOS — new feature
matchTurnEvent throw pattern OOS — refactoring suggestion

Code Quality Assessment

The R2 changes are clean and well-structured:

  • Abort listener fix correctly wraps resolve/reject in cleanup, handles all settlement paths (resolve, reject, abort), and uses { once: true } for the abort path
  • _dispatchTurnEvent type guard prevents non-turn events with promptId from accidentally resolving pending prompts
  • Passive observer turn_error properly gates on !activePromptsRef.current.has() to avoid double-handling

Advisory: Shell Command Feature (Out of PR Scope)

The POST /session/:id/shell endpoint, executeShellCommand bridge method, and related ! bang-command infrastructure are not in this PR's diff (already present in daemon_mode_b_main). However, since they were added to the branch after R1, the following security concerns should be addressed in a follow-up:

  1. Prompt injection via hardcoded backtick fencesacpAgent.ts:2467 injects shell output into LLM history using hardcoded triple backticks. ChannelBase.ts:286-293 correctly uses a dynamic fence pattern. The acpAgent.ts path should match.
  2. No concurrency limit on concurrent shell commands (each spawns an OS child process, up to 120s)
  3. Unbounded output accumulation in outputChunks[] (no memory cap)
  4. Non-strict auth gate — shell route uses mutate() while less impactful routes use mutate({ strict: true })
  5. Zero test coverage for the entire shell command feature (~387 lines across 16 files)

These are real concerns but belong to a different PR since the code is already in the base branch.

— qwen3-max via Qwen Code /review

@doudouOUC doudouOUC left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: feat(daemon): non-blocking POST /prompt — 202 with promptId

结论:LGTM,可合并


架构评价

将 HTTP 连接生命周期与 LLM turn 时长解耦,精准解决 reverse proxy 504 超时问题。设计遵循 /acp 路由已验证的 unconditional 202 模式,SDK 两层设计(无状态 DaemonClient fallback + 有状态 DaemonSessionClient._pendingPrompts dispatch)干净且对消费者完全透明。

验证项

  • 冲突解决干净(纯 rebase,无逻辑变更)
  • 4 commit 迭代修复了 abort listener leak、String(undefined) coercion、type guard 缺失、passive turn_error 处理
  • Bridge 层纯增量(broadcastTurnComplete/broadcastTurnError 无条件发布,不影响 FIFO)
  • Server 层净删 ~100 行 Promise.race/504 逻辑,新路径 ~30 行
  • 向后兼容:新 SDK+旧 daemon(200) 走 legacy path;co-release 模型下旧 SDK 风险可控
  • 测试覆盖:server.test.ts 337 pass / bridge.test.ts 205 pass

非阻塞 Follow-up 建议

  1. Deadline AbortError 被动观察者不可见 — bridge.ts 的 AbortError 过滤会吞掉 deadline-triggered abort,被动 SSE 观察者收不到 terminal event。需与 prompt_cancelled 语义协调(作者已 track)。

  2. _pendingPrompts 注册时序 — 依赖 bridge IPC round-trip 保证 turn_complete 不先于 set() 到达。当前安全但建议加注释说明时序前提。

  3. .catch(() => {}) 可观测性 — 如果未来 EventBus 实现变更导致 broadcastTurnError 可 throw,当前 catch 完全静默。建议 fallback 到 stderr debug log。

  4. SDK 202 路径测试DaemonClient.prompt() 处理 202 + 临时 SSE 的路径缺乏直接单测覆盖(受 SDK test infra 限制),建议后续补充。

@chiga0 chiga0 merged commit 6e381d3 into daemon_mode_b_main May 28, 2026
6 checks passed
@chiga0 chiga0 deleted the feat/non-blocking-prompt branch May 28, 2026 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants