fix(code): drain cloud queue for idle resumed runs stuck disconnected by tatoalo · Pull Request #2217 · PostHog/code

tatoalo · 2026-05-19T09:12:42Z

Problem

when a cloud task is resumed from a snapshot and then goes idle, an SSE transport drop (or the watcher retry it triggers) flips the session to disconnected even though the run is still alive (in_progress) on the server. A user message sent in that state is queued because status !== "connected", and then nothing ever drains it

so basically the message stays queued forever

Changes

track a run-id-scoped agentReadyForRunId on the session, set when _posthog/run_started / _posthog/turn_complete is observed for the current run (survives hydrate-from-logs)
boot race stays protected so that a still-booting run has no run_started for its run id, so recovery does not fire
add a guarded tryRecoverIdleCloudQueue recovery path

handling #2159

greptile-apps · 2026-05-19T09:16:04Z

Comments Outside Diff (2)

apps/code/src/renderer/features/sessions/service/service.test.ts, line 594-874 (link)

Prefer parameterised tests over repeated it blocks

The five new unit tests (recovery via live flag, recovery via event scan, stale run id, prompt in flight, no boot evidence) share an almost-identical scaffold: create a mock session, configure store mocks, call watchCloudTask, extract subscribeOptions, fire onData, and assert. The only meaningful variation is the session shape and the expected outcome. Per the team's simplicity rules this should be expressed as a single it.each table (or two — one for positive recovery paths, one for negative guards), which would make adding future cases trivial and keep the test file from growing proportionally with every new guard condition.

Context Used: Do not attempt to comment on incorrect alphabetica... (source)

Prompt To Fix With AI

This is a comment left during a code review.
Path: apps/code/src/renderer/features/sessions/service/service.test.ts
Line: 594-874

Comment:
**Prefer parameterised tests over repeated `it` blocks**

The five new unit tests (recovery via live flag, recovery via event scan, stale run id, prompt in flight, no boot evidence) share an almost-identical scaffold: create a mock session, configure store mocks, call `watchCloudTask`, extract `subscribeOptions`, fire `onData`, and assert. The only meaningful variation is the session shape and the expected outcome. Per the team's simplicity rules this should be expressed as a single `it.each` table (or two — one for positive recovery paths, one for negative guards), which would make adding future cases trivial and keep the test file from growing proportionally with every new guard condition.

**Context Used:** Do not attempt to comment on incorrect alphabetica... ([source](https://app.greptile.com/review/custom-context?memory=instruction-0))

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

apps/code/src/renderer/features/sessions/service/service.ts, line 3211-3256 (link)

Fallback scan misses _posthog/turn_complete for old agents in the recreate-from-logs case

The live flag (agentReadyForRunId) is set by both _posthog/run_started and _posthog/turn_complete (the backward-compat path for agents that predate run_started). The fallback event-scan in hasAgentBootedForRun only checks _posthog/run_started, because turn_complete carries no runId param and can't be safely scoped to the current run.

The gap: for an old-agent session that was recreated from logs (no-delta dedup guard skipped reprocessing, so agentReadyForRunId was never set) whose events contain only a turn_complete, neither branch returns true — the message stays stranded. The live-flag path protects the common online case, but the hydrate-from-logs scenario for old agents remains unfixed. Whether old agents are still common enough to warrant a workaround (e.g., storing agentReadyForRunId in the persisted log entries themselves) is worth considering.

Prompt To Fix With AI

This is a comment left during a code review.
Path: apps/code/src/renderer/features/sessions/service/service.ts
Line: 3211-3256

Comment:
**Fallback scan misses `_posthog/turn_complete` for old agents in the recreate-from-logs case**

The live flag (`agentReadyForRunId`) is set by both `_posthog/run_started` and `_posthog/turn_complete` (the backward-compat path for agents that predate `run_started`). The fallback event-scan in `hasAgentBootedForRun` only checks `_posthog/run_started`, because `turn_complete` carries no `runId` param and can't be safely scoped to the current run.

The gap: for an old-agent session that was recreated from logs (no-delta dedup guard skipped reprocessing, so `agentReadyForRunId` was never set) whose events contain only a `turn_complete`, neither branch returns `true` — the message stays stranded. The live-flag path protects the common online case, but the hydrate-from-logs scenario for old agents remains unfixed. Whether old agents are still common enough to warrant a workaround (e.g., storing `agentReadyForRunId` in the persisted log entries themselves) is worth considering.

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
apps/code/src/renderer/features/sessions/service/service.test.ts:594-874
**Prefer parameterised tests over repeated `it` blocks**

The five new unit tests (recovery via live flag, recovery via event scan, stale run id, prompt in flight, no boot evidence) share an almost-identical scaffold: create a mock session, configure store mocks, call `watchCloudTask`, extract `subscribeOptions`, fire `onData`, and assert. The only meaningful variation is the session shape and the expected outcome. Per the team's simplicity rules this should be expressed as a single `it.each` table (or two — one for positive recovery paths, one for negative guards), which would make adding future cases trivial and keep the test file from growing proportionally with every new guard condition.

### Issue 2 of 2
apps/code/src/renderer/features/sessions/service/service.ts:3211-3256
**Fallback scan misses `_posthog/turn_complete` for old agents in the recreate-from-logs case**

The live flag (`agentReadyForRunId`) is set by both `_posthog/run_started` and `_posthog/turn_complete` (the backward-compat path for agents that predate `run_started`). The fallback event-scan in `hasAgentBootedForRun` only checks `_posthog/run_started`, because `turn_complete` carries no `runId` param and can't be safely scoped to the current run.

The gap: for an old-agent session that was recreated from logs (no-delta dedup guard skipped reprocessing, so `agentReadyForRunId` was never set) whose events contain only a `turn_complete`, neither branch returns `true` — the message stays stranded. The live-flag path protects the common online case, but the hydrate-from-logs scenario for old agents remains unfixed. Whether old agents are still common enough to warrant a workaround (e.g., storing `agentReadyForRunId` in the persisted log entries themselves) is worth considering.

_{Reviews (1): Last reviewed commit: "fix(code): drain cloud queue for idle re..." | Re-trigger Greptile}

…nnected

joshsny

i think some of this logic might be getting unweildy and we might need to come back and simplify it afterwards, but sounds like this will improve things at least for now, great job finding the issue

tatoalo · 2026-05-19T11:26:21Z

@joshsny agree completely, it's becoming a bit of a 🍝 , we need some time to have proper ACP cleanup of all this

tatoalo self-assigned this May 19, 2026

fix(cloud-agent): drain cloud queue for idle resumed runs stuck disco…

d307750

…nnected

tatoalo force-pushed the fix/code/idle-queued-cloud-messages branch from f1fbb7d to d307750 Compare May 19, 2026 11:01

tatoalo requested a review from a team May 19, 2026 11:02

joshsny approved these changes May 19, 2026

View reviewed changes

tatoalo added the Create Release This will trigger a new release label May 19, 2026

tatoalo merged commit ae05339 into main May 19, 2026
15 checks passed

tatoalo deleted the fix/code/idle-queued-cloud-messages branch May 19, 2026 11:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(code): drain cloud queue for idle resumed runs stuck disconnected#2217

fix(code): drain cloud queue for idle resumed runs stuck disconnected#2217
tatoalo merged 1 commit into
mainfrom
fix/code/idle-queued-cloud-messages

tatoalo commented May 19, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 19, 2026 •

edited

Loading

Comments Outside Diff (2)

Uh oh!

joshsny left a comment

Uh oh!

tatoalo commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tatoalo commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Uh oh!

greptile-apps Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comments Outside Diff (2)

Uh oh!

joshsny left a comment

Choose a reason for hiding this comment

Uh oh!

tatoalo commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tatoalo commented May 19, 2026 •

edited

Loading

greptile-apps Bot commented May 19, 2026 •

edited

Loading