Skip to content

fix(code): drain cloud queue for idle resumed runs stuck disconnected#2217

Merged
tatoalo merged 1 commit into
mainfrom
fix/code/idle-queued-cloud-messages
May 19, 2026
Merged

fix(code): drain cloud queue for idle resumed runs stuck disconnected#2217
tatoalo merged 1 commit into
mainfrom
fix/code/idle-queued-cloud-messages

Conversation

@tatoalo
Copy link
Copy Markdown
Contributor

@tatoalo tatoalo commented May 19, 2026

Problem

when a cloud task is resumed from a snapshot and then goes idle, an SSE transport drop (or the watcher retry it triggers) flips the session to disconnected even though the run is still alive (in_progress) on the server. A user message sent in that state is queued because status !== "connected", and then nothing ever drains it

image

so basically the message stays queued forever

Changes

  • track a run-id-scoped agentReadyForRunId on the session, set when _posthog/run_started / _posthog/turn_complete is observed for the current run (survives hydrate-from-logs)
  • boot race stays protected so that a still-booting run has no run_started for its run id, so recovery does not fire
  • add a guarded tryRecoverIdleCloudQueue recovery path

handling #2159

@tatoalo tatoalo self-assigned this May 19, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 19, 2026

Comments Outside Diff (2)

  1. apps/code/src/renderer/features/sessions/service/service.test.ts, line 594-874 (link)

    P2 Prefer parameterised tests over repeated it blocks

    The five new unit tests (recovery via live flag, recovery via event scan, stale run id, prompt in flight, no boot evidence) share an almost-identical scaffold: create a mock session, configure store mocks, call watchCloudTask, extract subscribeOptions, fire onData, and assert. The only meaningful variation is the session shape and the expected outcome. Per the team's simplicity rules this should be expressed as a single it.each table (or two — one for positive recovery paths, one for negative guards), which would make adding future cases trivial and keep the test file from growing proportionally with every new guard condition.

    Context Used: Do not attempt to comment on incorrect alphabetica... (source)

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: apps/code/src/renderer/features/sessions/service/service.test.ts
    Line: 594-874
    
    Comment:
    **Prefer parameterised tests over repeated `it` blocks**
    
    The five new unit tests (recovery via live flag, recovery via event scan, stale run id, prompt in flight, no boot evidence) share an almost-identical scaffold: create a mock session, configure store mocks, call `watchCloudTask`, extract `subscribeOptions`, fire `onData`, and assert. The only meaningful variation is the session shape and the expected outcome. Per the team's simplicity rules this should be expressed as a single `it.each` table (or two — one for positive recovery paths, one for negative guards), which would make adding future cases trivial and keep the test file from growing proportionally with every new guard condition.
    
    **Context Used:** Do not attempt to comment on incorrect alphabetica... ([source](https://app.greptile.com/review/custom-context?memory=instruction-0))
    
    How can I resolve this? If you propose a fix, please make it concise.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

  2. apps/code/src/renderer/features/sessions/service/service.ts, line 3211-3256 (link)

    P2 Fallback scan misses _posthog/turn_complete for old agents in the recreate-from-logs case

    The live flag (agentReadyForRunId) is set by both _posthog/run_started and _posthog/turn_complete (the backward-compat path for agents that predate run_started). The fallback event-scan in hasAgentBootedForRun only checks _posthog/run_started, because turn_complete carries no runId param and can't be safely scoped to the current run.

    The gap: for an old-agent session that was recreated from logs (no-delta dedup guard skipped reprocessing, so agentReadyForRunId was never set) whose events contain only a turn_complete, neither branch returns true — the message stays stranded. The live-flag path protects the common online case, but the hydrate-from-logs scenario for old agents remains unfixed. Whether old agents are still common enough to warrant a workaround (e.g., storing agentReadyForRunId in the persisted log entries themselves) is worth considering.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: apps/code/src/renderer/features/sessions/service/service.ts
    Line: 3211-3256
    
    Comment:
    **Fallback scan misses `_posthog/turn_complete` for old agents in the recreate-from-logs case**
    
    The live flag (`agentReadyForRunId`) is set by both `_posthog/run_started` and `_posthog/turn_complete` (the backward-compat path for agents that predate `run_started`). The fallback event-scan in `hasAgentBootedForRun` only checks `_posthog/run_started`, because `turn_complete` carries no `runId` param and can't be safely scoped to the current run.
    
    The gap: for an old-agent session that was recreated from logs (no-delta dedup guard skipped reprocessing, so `agentReadyForRunId` was never set) whose events contain only a `turn_complete`, neither branch returns `true` — the message stays stranded. The live-flag path protects the common online case, but the hydrate-from-logs scenario for old agents remains unfixed. Whether old agents are still common enough to warrant a workaround (e.g., storing `agentReadyForRunId` in the persisted log entries themselves) is worth considering.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
apps/code/src/renderer/features/sessions/service/service.test.ts:594-874
**Prefer parameterised tests over repeated `it` blocks**

The five new unit tests (recovery via live flag, recovery via event scan, stale run id, prompt in flight, no boot evidence) share an almost-identical scaffold: create a mock session, configure store mocks, call `watchCloudTask`, extract `subscribeOptions`, fire `onData`, and assert. The only meaningful variation is the session shape and the expected outcome. Per the team's simplicity rules this should be expressed as a single `it.each` table (or two — one for positive recovery paths, one for negative guards), which would make adding future cases trivial and keep the test file from growing proportionally with every new guard condition.

### Issue 2 of 2
apps/code/src/renderer/features/sessions/service/service.ts:3211-3256
**Fallback scan misses `_posthog/turn_complete` for old agents in the recreate-from-logs case**

The live flag (`agentReadyForRunId`) is set by both `_posthog/run_started` and `_posthog/turn_complete` (the backward-compat path for agents that predate `run_started`). The fallback event-scan in `hasAgentBootedForRun` only checks `_posthog/run_started`, because `turn_complete` carries no `runId` param and can't be safely scoped to the current run.

The gap: for an old-agent session that was recreated from logs (no-delta dedup guard skipped reprocessing, so `agentReadyForRunId` was never set) whose events contain only a `turn_complete`, neither branch returns `true` — the message stays stranded. The live-flag path protects the common online case, but the hydrate-from-logs scenario for old agents remains unfixed. Whether old agents are still common enough to warrant a workaround (e.g., storing `agentReadyForRunId` in the persisted log entries themselves) is worth considering.

Reviews (1): Last reviewed commit: "fix(code): drain cloud queue for idle re..." | Re-trigger Greptile

@tatoalo tatoalo force-pushed the fix/code/idle-queued-cloud-messages branch from f1fbb7d to d307750 Compare May 19, 2026 11:01
@tatoalo tatoalo requested a review from a team May 19, 2026 11:02
Copy link
Copy Markdown
Contributor

@joshsny joshsny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think some of this logic might be getting unweildy and we might need to come back and simplify it afterwards, but sounds like this will improve things at least for now, great job finding the issue

@tatoalo
Copy link
Copy Markdown
Contributor Author

tatoalo commented May 19, 2026

@joshsny agree completely, it's becoming a bit of a 🍝 , we need some time to have proper ACP cleanup of all this

@tatoalo tatoalo added the Create Release This will trigger a new release label May 19, 2026
@tatoalo tatoalo merged commit ae05339 into main May 19, 2026
15 checks passed
@tatoalo tatoalo deleted the fix/code/idle-queued-cloud-messages branch May 19, 2026 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Create Release This will trigger a new release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants