Skip to content

Fix session history loss on restart: detect session ID mismatch and find best events source#382

Merged
PureWeen merged 2 commits intomainfrom
fix/session-history-loss-resume-id-mismatch
Mar 16, 2026
Merged

Fix session history loss on restart: detect session ID mismatch and find best events source#382
PureWeen merged 2 commits intomainfrom
fix/session-history-loss-resume-id-mismatch

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Bug

The 'AndroidShellHandler' session loses all history on every PolyPilot restart. 181 messages show up, but they're always the same stale set from Mar 13 — work done since then is lost.

Root Cause (follows PR #378)

PR #378 fixed the missing flush after restore, but there are two additional issues:

  1. Session ID mismatch in ResumeSessionAsync: Line 1618 stored sessionId (the input) instead of copilotSession.SessionId (the actual ID returned by the SDK). If the persistent server returns a different session ID, events get written to the wrong directory.

  2. Stale events source in fallback path: When active-sessions.json points to an old session ID (before PR fix: flush active-sessions.json after restore to persist recreated session IDs #378 flush fix), the fallback always loads from that stale ID's events.jsonl (517 lines), ignoring intermediate sessions that accumulated more history (10,488+ lines).

Fix

  1. ResumeSessionAsync: Detect when copilotSession.SessionId != sessionId, copy events to the new directory, and use the actual session ID going forward.

  2. FindBestEventsSource(): New method that scans session directories matching the working directory and picks the one with the most events. Used by the fallback restore path instead of blindly loading from the stale entry.

  3. CopyEventsToNewSession(): Extracted from the inline copy logic into a shared helper, now used by ResumeSessionAsync, the reconnect path, and the fallback restore path.

Tests

  • 5 new functional tests for FindBestEventsSource (correct session found, different cwd ignored, null cwd, no workspace.yaml)
  • 4 new structural regression tests ensuring session ID mismatch detection exists in resume, reconnect, and fallback paths
  • All 2590 tests pass (1 pre-existing flaky test in ZeroIdleCaptureTests)
  • Mac Catalyst build clean (0 errors)

PureWeen and others added 2 commits March 15, 2026 19:20
…ents source

Three fixes for the recurring 'AndroidShellHandler loses history on restart' bug:

1. ResumeSessionAsync: detect when SDK returns a different session ID than
   requested (actualSessionId != sessionId). When this happens, copy events
   to the new directory and use the actual ID. Previously the input ID was
   always stored, causing events to be written to a directory that doesn't
   match what active-sessions.json tracks.

2. Fallback restore path: use FindBestEventsSource() to scan session
   directories for the one with the most events matching the working
   directory, instead of always loading from the stale entry.SessionId.
   This recovers accumulated history from intermediate sessions created
   during previous fallback cycles.

3. Extract CopyEventsToNewSession as a shared helper (was inline in the
   fallback path). Now used by ResumeSessionAsync, the reconnect path,
   and the fallback restore path.

Root cause: When active-sessions.json has a stale session ID (from before
PR #378's flush fix), each restart creates a new session via fallback.
LoadHistoryFromDisk reads from the stale ID's events.jsonl (517 lines)
instead of the most recent intermediate session (10,488 lines). The user
sees the same old history every time, losing all work done since.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ew tests

Fixes from code review on PR #382:

1. FALSE POSITIVE PATH MATCHING (High): Replace yaml.Contains() with
   exact cwd line parsing via WorkspaceYamlMatchesCwd(). Prevents
   '/project' from matching '/project-2' in workspace.yaml.

2. PERFORMANCE (Medium): Limit FindBestEventsSource to scan at most
   MaxSessionDirsToScan (200) directories, sorted newest-first via
   OrderByDescending(LastWriteTimeUtc).

3. TEMP FILE LEAK (Low): Declare tmpFile outside try block in
   CopyEventsToNewSession; clean up in catch if File.Move fails.

4. REDUNDANT FALLBACK (Low): Simplify SessionId assignment from
   'actualSessionId ?? sessionId' to just 'sessionId' since sessionId
   is already reassigned to actualSessionId in the mismatch case.

5. RACE CONDITION (Medium): Added documentation that CopyEventsToNewSession
   runs before event handler registration, minimizing the race window
   with SDK event writes.

New tests (10):
- FindBestEventsSource_DoesNotMatchSubstringPaths
- FindBestEventsSource_HandlesEmptyEventsJsonl
- WorkspaceYamlMatchesCwd_ExactMatch
- WorkspaceYamlMatchesCwd_RejectsSubstringMatch
- WorkspaceYamlMatchesCwd_CaseInsensitive
- WorkspaceYamlMatchesCwd_ReturnsFalse_WhenNoCwdLine
- WorkspaceYamlMatchesCwd_HandlesQuotedValue
- FindBestEventsSource_LimitsDirectoryScan (structural)
- CopyEventsToNewSession_CleansTempFileOnFailure (structural)
- Updated ResumeSessionAsync structural test for simplified fallback

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

PR #382 Review — Multi-Model Consensus Report

CI: No checks configured
Tests: ✅ 2599/2599 passing (13 new tests all pass)
Prior reviews: None


Consensus Findings (2+ models)

🟡 MODERATE — — Race in CopyEventsToNewSession prepend path

Flagged by 4/5 models

When the new session dir already has an events.jsonl (the existed = true branch), the code snapshots newLines = File.ReadAllLines(newEvents) (line 874), writes a tmp file, then atomically replaces via File.Move(overwrite: true) (line 881). By the time ResumeSessionAsync returns, the SDK has a live connection to actualSessionId and can begin appending replay events to that file. Any SDK writes between the ReadAllLines and the File.Move are silently lost — the overwrite replaces the file with the pre-snapshot version.

The existed = false path (most common for a freshly-remapped session ID) is not affected. Impact is limited to on-disk recovery; in-memory event flow is unaffected. But on a subsequent app restart, history could appear shorter than expected.

Mitigation: Use a file lock or, simpler, skip the prepend when isStillProcessing = true — if the session is still live, the events it writes are canonical and the old prepend isn't needed for recovery.


🟢 MINOR — CopilotService.Persistence.cs:923,954FindBestEventsSource full-file line counts

Flagged by 3/5 models

File.ReadLines(eventsFile).LongCount() streams the entire file for each candidate session directory (up to MaxSessionDirsToScan = 200). In the common case most dirs won't have a matching workspace.yaml cwd so the filter eliminates them before the line count, and the path is restore-only (not on UI thread). Acceptable in practice but worth noting if startup perf degrades as history grows.


Dismissed Concerns

  • Reconnect path missing flush (raised by Gemini): at line 2807 is followed by at line 2935 unconditionally — confirmed in source. DISMISSED.
  • Structural test string match: Assert.Contains("CopyEventsToNewSession(sessionId, actualSessionId)", serviceFile) — exact string confirmed at CopilotService.cs:1622. DISMISSED.
  • BulkInsertAsync ordering: sessionId = actualSessionId runs before BulkInsertAsync(sessionId, history) — correct. DISMISSED.

Test Coverage Assessment

New tests cover the happy paths well:

  • FindBestEventsSource with no better source, better source by line count, different cwd, null cwd, no workspace.yaml, empty events, substring path false positive, exact match
  • Structural guards for session ID mismatch detection and reconnect path

Missing test: CopyEventsToNewSession with concurrent write (the race in the existed = true path). Suggested test: write a few bytes to the new session's events.jsonl between setup and calling CopyEventsToNewSession, verify those bytes aren't lost.


Verdict

⚠️ Request changes (optional) — the MODERATE race is real but narrow and only affects an edge case where: (1) the SDK remaps the session ID AND (2) has already started writing events to the new dir. The PR is a clear improvement over the prior code and all 2599 tests pass.

If the author is comfortable accepting the narrow race (documented in a code comment), this could also be ✅ approved as-is. The fix for the race (skip prepend when session is live, or lock the file) is straightforward.

@PureWeen PureWeen merged commit 1acda33 into main Mar 16, 2026
@PureWeen PureWeen deleted the fix/session-history-loss-resume-id-mismatch branch March 16, 2026 04:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant