Skip to content

agent/streaming: ship synthetic done when a turn ends without one#64

Merged
pufit merged 1 commit into
ClickHouse:mainfrom
alex-fedotyev:claude/fix-stuck-session-running
May 13, 2026
Merged

agent/streaming: ship synthetic done when a turn ends without one#64
pufit merged 1 commit into
ClickHouse:mainfrom
alex-fedotyev:claude/fix-stuck-session-running

Conversation

@alex-fedotyev
Copy link
Copy Markdown
Contributor

Summary

Chats appear stuck mid-step: the session card drops out of the sidebar's "Running" group into the time-grouped list, but the chat detail keeps showing the "thinking..." spinner because isStreaming was never reset.

The previous c439473 fix (tool schemas aborting agent turns) closed one underlying cause. Three remaining cases also leave _run_inner without emitting a terminal event:

  1. Post-stream exception. After receive_response yields ResultMessage the loop exits cleanly, but the ~80 lines after that contain unguarded post-stream work (add_message, mark_active, update_session_metadata, record_turn_usage). If any of those raise, the exception propagates up to engine.run() before broadcast_done ever fires.
  2. Hung CLI cancelled externally. claude_agent_sdk's receive_response documents "If no ResultMessage is received, the iterator continues indefinitely." When the CLI wedges and the asyncio task is cancelled, the CancelledError handler sends stopped on the normal path, but the WS message can be lost mid-reconnect.
  3. Lost WebSocket message during reconnect. The frontend handler had no symmetry: handleSessionRunning enters streaming on is_running:true but never exits on is_running:false.

I confirmed case 1 against a live session in our DB: last_activity_at stuck at connected_at (04:13:13) but updated_at advanced to 04:30:29, consistent with mark_active running but a later DB write failing before broadcast_done.

What changed

Server. StreamBroadcaster gains an _open_turns set with mark_turn_open / is_turn_open / clear_turn_open. broadcast() auto-clears the flag when a terminal event ships (done/stopped/error). engine.run() calls mark_turn_open before _run_inner; its finally checks is_turn_open and ships a synthetic broadcast_done if still set. Single backstop covers all three cases at the source.

Frontend. handleSessionRunning gets a symmetric exit branch. When the active session reports is_running:false while isStreaming is true, it commits any in-flight streaming blocks to messages, clears streamingBlocks, and resets agentStatus to idle. Belt-and-suspenders for the case where even the synthetic done is missed (e.g. WS reconnect drops the done event but redelivers session_running:false).

Test plan

  • tests/test_streaming.py::TestOpenTurnTracking: 5 new tests covering each terminal type clearing the flag, non-terminal types not clearing, explicit clear_turn_open, multi-session isolation, and the broadcast_done helper closing the turn
  • tests/test_streaming.py + tests/test_sessions.py + tests/test_engine.py: 76/76 pass
  • web/ tsc -b clean
  • Smoke: open a chat, run a turn that exercises post-stream cleanup, confirm the spinner clears

Symptom: chats appear stuck mid-step. The session card drops out of
the sidebar's "Running" group and into the time-grouped list, but
the chat detail keeps showing the "thinking..." spinner because
isStreaming was never reset.

The previous c439473 fix (tool schemas aborting agent turns) closed
one of the underlying causes. Three remaining cases also leave
_run_inner without emitting a terminal event:

1. Post-stream exception. After receive_response yields ResultMessage
   the loop exits cleanly, but the ~80 lines after that in _run_inner
   contain unguarded post-stream work (add_message, mark_active,
   update_session_metadata, record_turn_usage). If any of those raise,
   the exception propagates up to engine.run() before broadcast_done
   ever fires.

2. Hung CLI cancelled externally. claude_agent_sdk's receive_response
   documents "If no ResultMessage is received, the iterator continues
   indefinitely." When the CLI wedges and an outer mechanism cancels
   the asyncio task, the CancelledError handler sends "stopped" on
   the normal path, but the WS message can be lost mid-reconnect.

3. Lost WebSocket message during reconnect. The frontend handler had
   no symmetry: handleSessionRunning enters streaming on
   is_running:true but never exits on is_running:false, so a missed
   done/stopped/error leaves isStreaming wedged on.

Server fix: StreamBroadcaster gains an _open_turns set with
mark_turn_open / is_turn_open / clear_turn_open. broadcast() auto-
clears the flag when a terminal event ships (done/stopped/error).
engine.run() calls mark_turn_open before _run_inner; its finally
checks is_turn_open and ships a synthetic broadcast_done if still
set. Single backstop covers all three cases at the source.

Frontend fix: handleSessionRunning gets a symmetric exit branch.
When the active session reports is_running:false while isStreaming
is true, it commits any in-flight streaming blocks to messages,
clears streamingBlocks, and resets agentStatus to idle.
Belt-and-suspenders for the case where even the synthetic done is
missed (e.g. WS reconnect drops the done event but redelivers
session_running:false).

Adds 5 tests in tests/test_streaming.py::TestOpenTurnTracking
covering: each terminal type clears the flag, non-terminal types
don't, explicit clear_turn_open, multi-session isolation, and the
broadcast_done helper closes the turn.
@alex-fedotyev
Copy link
Copy Markdown
Contributor Author

Bumping for review. This is one of five small reliability fixes (#63, #64, #65, #66, #67) that compound: each closes one cause of chats that wedge mid-turn, lose context after restart, or get routed into the wrong session. Happy to split, rebase, or land any of them in any order; they don't depend on each other.

In daily use the chat surface is the single biggest friction right now, so any of these merging would help. Let me know if you want changes.

@pufit pufit merged commit f598e5e into ClickHouse:main May 13, 2026
pufit pushed a commit that referenced this pull request May 13, 2026
User-reported glitch: "I ask for something and there is no response, then
I ask again and it answers to the previous question." Five reliability
PRs (#63 shorthand-schema, #64 synthetic done, #65 stale sdk_session_id,
#66 idle timeout, #67 sticky session) each close one underlying cause.
Two gaps remain that none of those PRs cover.

Gap 1: client-side send silently drops payloads.
web/src/api/websocket.ts checked readyState === OPEN and no-op'd
otherwise. The 3s reconnect window leaves a hole: send() returns to the
caller and chatStore.sendMessage has already optimistically appended the
user message and set isStreaming=true. The user thinks the agent is
thinking but the message never reached the server, so the next reply
lands against a stale prompt.

Track readyState explicitly. CONNECTING or reconnect-scheduled now queues
the payload (bounded to 5 entries; oldest evicted) and flushes from
onopen. CLOSED-without-reconnect and CLOSING return 'dropped' so the
caller can revert. chatStore.sendMessage pops the optimistic user message
on 'dropped' and surfaces an inline assistant error so the user can
retry.

Gap 2: gateway initial-bind never replayed the broadcaster buffer.
The switch_session handler already shipped session_status with
buffered_events on session switch, but the initial-connect handshake at
server.py:286-311 didn't. Reload mid-turn (or a transient 3s WS drop)
and the in-flight stream was lost from the client's view even though
the events sat in broadcaster._session_buffers waiting to be replayed.

Lift the duplicated send-status construction into _send_session_status
and call it from both branches. Initial-bind gates on
broadcaster.is_buffering so idle sessions stay silent; switch_session
calls unconditionally so the client refreshes is_running/status on
every selection. The frontend handleSessionStatus already restores
streamingBlocks, panels, todos, and interaction state from the buffer
(handled by #69), so this is purely additive at the gateway.

Tests:
- 9 new asserts in tests/test_gateway_ws.py covering the helper output,
  the initial-bind gate, the switch_session regression path, and a
  load-fidelity check for buffer ordering.
- Full pytest run: 444 pass, 2 skip, 2 pre-existing failures unrelated
  (test_bootstrap docker-env detection and test_cli_upgrade docker
  mode, both noted in notes/repo-conventions/nerve.md).
- web/ tsc --noEmit clean, npm run build clean.

Out of scope (followups, not blocking):
- Stale-listener cleanup on swallowed send_json exceptions
  (server.py:298-301).
- Application-level message_received ack from engine after
  sessions.add_message.
- _session_locks TTL on session archive.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants