Skip to content

remote-session: detect stale resume sessions reported via JSON error events#3406

Merged
gcsecsey merged 4 commits into
trunkfrom
gcsecsey/fix-local-agent-message
May 11, 2026
Merged

remote-session: detect stale resume sessions reported via JSON error events#3406
gcsecsey merged 4 commits into
trunkfrom
gcsecsey/fix-local-agent-message

Conversation

@gcsecsey
Copy link
Copy Markdown
Member

@gcsecsey gcsecsey commented May 8, 2026

Related issues

How AI was used in this PR

Claude Code investigated the daemon log together with me, traced the regression to PR #3360, and drafted the fix and tests. I drove the diagnosis and reviewed the diff before committing.

Proposed Changes

After #3360, the session ID stored by the remote-session bridge from a previous build can no longer be resolved by the new SessionManager. studio code --resume-session <id> now fails immediately with error No AI session found for resume ID: .... The poll loop already has a recovery path, gated on outcome.staleSession. That detection was bypassed because of where the failure surfaces:

  • The new pi runtime reports the resume failure as a JSON error event on stdout, not on stderr.
  • detectStaleSession() in apps/cli/remote-session/turn-runner.ts only inspected stderrTail and replyText, so the stale-session pattern (/no ai session found for resume id/i) never had a chance to match.
  • Result: the poll loop fell through to the generic non-zero-exit branch and posted Local agent error (no output). to Telegram. The stale ID stayed in the remote-session state, so every subsequent turn hit the same error.

This PR:

  • Captures the message from JSON error events into a local lastErrorEventMessage in runTurn().
  • Feeds it into detectStaleSession() alongside stderrTail and replyText. Cleaned up dead code in that function while there (the previous isError short-circuit had no effect on the result).
  • Adds a stale-resume-error-event scenario to the studio-code mock that emits the failure as a JSON error event (matching the real production behavior we observed in the log).
  • Adds a unit test asserting outcome.staleSession === true when the failure arrives via error event with empty stderr.

Once staleSession is correctly detected, the existing recovery path in poll-loop.ts:143-160 clears the saved session, retries without --resume-session, and the new sessionId from the retry's turn.completed is persisted at poll-loop.ts:170-172. The stuck-on-stale-ID loop self-heals on the next turn.

Testing Instructions

  • npm test -- apps/cli/remote-session should report 112 passing tests (was 111).
  • npm run typecheck should be clean.
  • Manual:
    • npm run cli:build && node apps/cli/dist/cli/main.mjs code remote-session start --detach (with a configured bot/chat).
    • In a second terminal: node apps/cli/dist/cli/main.mjs code remote-session attach.
    • Pre-populate the remote-session state file under ~/.studio/ with a non-existent session_id (or just have a session_id from before AI sessions: adopt pi-coding-agent SessionManager end-to-end #3360).
    • Send a message from Telegram. Expected before this PR: Local agent error (no output). Expected after: an info Resume failed; retrying without session_id line in the log, a Session expired; started a new one. message in Telegram, and the agent's actual reply right after.
    • The next turn should resume cleanly against the new sessionId.
State Telegram Daemon log
Before fix (stale resume) image CleanShot 2026-05-08 at 16 17 44@2x
After fix (auto-recovery) CleanShot 2026-05-08 at 16 20 35@2x CleanShot 2026-05-08 at 16 23 11@2x

Pre-merge Checklist

  • Have you checked for TypeScript, React or other console errors?

@gcsecsey gcsecsey requested review from Copilot and epeicher May 8, 2026 15:23
@gcsecsey gcsecsey marked this pull request as ready for review May 8, 2026 15:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a regression in the CLI remote-session “resume” flow by detecting stale --resume-session IDs when the underlying runtime reports the failure as a JSON error event on stdout (instead of stderr), enabling the existing recovery path to clear the stale ID and retry.

Changes:

  • Capture the message from JSON error events during runTurn() and include it in stale-session detection.
  • Simplify detectStaleSession() to scan a combined haystack of stderrTail, last reply text, and the last JSON error message.
  • Add a new mock scenario plus a unit test covering stale resume failures emitted via JSON error events (with empty stderr).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
apps/cli/remote-session/turn-runner.ts Track JSON error messages and use them to detect stale resume sessions.
apps/cli/remote-session/tests/turn-runner.test.ts Add a unit test asserting staleSession is detected via JSON error events.
apps/cli/remote-session/tests/fixtures/mock-studio-code.mjs Add a mock scenario that emits the stale-resume failure as a JSON error event.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread apps/cli/remote-session/turn-runner.ts Outdated
@wpmobilebot
Copy link
Copy Markdown
Collaborator

📊 Performance Test Results

Comparing a72f5e4 vs trunk

app-size

Metric trunk a72f5e4 Diff Change
App Size (Mac) 1404.70 MB 1404.70 MB +0.00 MB ⚪ 0.0%

site-editor

Metric trunk a72f5e4 Diff Change
load 1493 ms 1469 ms 24 ms ⚪ 0.0%

site-startup

Metric trunk a72f5e4 Diff Change
siteCreation 8586 ms 8567 ms 19 ms ⚪ 0.0%
siteStartup 4925 ms 4938 ms +13 ms ⚪ 0.0%

Results are median values from multiple test runs.

Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff)

Copy link
Copy Markdown
Contributor

@epeicher epeicher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gcsecsey, this is working as expected. Changes also LGTM!

Image

@gcsecsey gcsecsey merged commit f6b53d7 into trunk May 11, 2026
10 checks passed
@gcsecsey gcsecsey deleted the gcsecsey/fix-local-agent-message branch May 11, 2026 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants