Skip to content

fix: distinguish session lock vs SDK errors in resume handling#417

Merged
PureWeen merged 1 commit intomainfrom
fix/session-lock-detection
Mar 23, 2026
Merged

fix: distinguish session lock vs SDK errors in resume handling#417
PureWeen merged 1 commit intomainfrom
fix/session-lock-detection

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Problem

Resuming a closed Copilot CLI session showed Session is locked even when no lock files existed. The old code pattern-matched SDK error messages for session file + corrupt and then displayed a misleading locked message. We don't actually know why the SDK failed.

Fix

  • Pre-resume lock file check: New static FindLiveLockPid() checks inuse.*.lock files for live copilot-related processes before calling the SDK. Only says locked when there is proof (a live PID holding a lock file).
  • Stale lock cleanup: Deletes lock files from dead processes so they don't block future resumes.
  • Remove IsCorruptSessionError: Stop guessing at SDK error causes. All SDK errors now pass through FriendlyResumeError which shows the actual message.
  • Fix misleading doc comment: FindAnyLiveLockPid claimed without verifying the process name but the code does verify against copilot/node/dotnet/github.

Files Changed

  • PolyPilot/Components/Layout/SessionSidebar.razor - Pre-resume lock check, removed error guessing
  • PolyPilot/Services/ExternalSessionScanner.cs - New FindLiveLockPid static method, fixed doc comment
  • PolyPilot.Tests/ConnectionRecoveryTests.cs - Updated structural regression test

Multi-model review

Reviewed by Claude Sonnet, Claude Opus 4.5, GPT-5.1, and Gemini 3 Pro. Cross-reviewed findings between models.

@PureWeen
Copy link
Copy Markdown
Owner Author

PR #417 Review: Session lock vs SDK error handling

CI: ⚠️ No CI checks on branch
Existing reviews: 0
Tests: Failed: 2 (pre-existing flakies — TurnEndFallbackTests.FallbackTimer_NotCancelled_FiresAfterDelay + WsBridgeIntegrationTests.Organization_BroadcastsStateBack_ToClient, both pass in isolation on main), Passed: 2814


Summary

The core approach is correct and a clear improvement: replacing fragile SDK error-string pattern matching (IsCorruptSessionError) with proof-based lock file detection (FindLiveLockPid). The finally { isCreating = false; } pattern is safe with the early return. Mobile platforms are safely skipped via Directory.Exists. All File.Delete calls are wrapped in try/catch. IsCorruptSessionError and its call site are both removed.


Consensus Findings (2+ of 5 models)

🟢 N1 (3/5) — Code duplication: FindLiveLockPid vs FindActiveLockPid / FindAnyLiveLockPid
ExternalSessionScanner.cs:526,488,571

The PR adds a third lock-checking method (FindLiveLockPid static) alongside the existing FindActiveLockPid (instance) and FindAnyLiveLockPid (static). All three enumerate inuse.*.lock files, resolve PIDs, and check process names. Future fixes to the process-name heuristic (copilot/node/dotnet/github) must be applied to all three. Consider whether FindActiveLockPid and FindAnyLiveLockPid could be refactored to delegate to FindLiveLockPid. Non-blocking for this PR.

🟢 N2 (2/5) — Session-state path hardcoded inline
SessionSidebar.razor:1877

System.IO.Path.Combine(Environment.GetFolderPath(SpecialFolder.UserProfile), ".copilot", "session-state", persisted.SessionId) duplicates logic already in CopilotService's session-state path. Bypasses SetBaseDirForTesting() isolation. Pre-existing pattern elsewhere in the codebase, but this PR adds another call site. Worth extracting to a shared helper.


Non-consensus notes (1 model each, not blocking)

  • FriendlyResumeError has no "locked"/"in use" branch — if the SDK throws a locking error that the pre-check missed (e.g. process name is gh), the user sees the raw message. Acceptable given the pre-check handles the common case.
  • Stale lock files are not guaranteed to be cleaned when a live lock is found first (enumeration order not guaranteed) — very low practical risk.
  • No behavioral unit tests for FindLiveLockPid cleanup logic — the structural test only checks string presence.

Verification

  • All File.Delete calls wrapped in try { } catch { }
  • IsCorruptSessionError removed from both call site and method declaration ✅
  • finally { isCreating = false; } runs on early return
  • 2 failing tests are pre-existing flakies unrelated to this PR ✅

Verdict

Approve — The fix is correct and addresses the root cause. N1/N2 are code quality suggestions for follow-up, not bugs. The approach of checking actual lock files before the SDK call is sound.

The ResumeSession error handler showed 'Session is locked' for all SDK
errors containing 'session file' + 'corrupt', even when no lock files
existed. We don't actually know why the SDK failed — guessing 'corrupted'
is just as wrong as guessing 'locked'.

Changes:
- Add pre-resume lock file check using new static FindLiveLockPid()
  that checks inuse.*.lock files for live copilot-related processes
- Clean up stale lock files (dead PIDs) before attempting resume
- Remove IsCorruptSessionError — stop guessing at SDK error causes
- Pass all SDK errors through FriendlyResumeError with the actual message
- Extract reusable FindLiveLockPid from ExternalSessionScanner
- Fix misleading doc comment on FindAnyLiveLockPid (claimed no process
  name check, but code does verify against copilot/node/dotnet/github)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the fix/session-lock-detection branch from 0abfbe3 to ce828c8 Compare March 23, 2026 15:45
@PureWeen
Copy link
Copy Markdown
Owner Author

PR #417 Round 2 Review — Session Lock vs SDK Error Fix

No new commits since Round 1 (HEAD ce828c8e, 1 commit). Re-reviewing previous findings.

CI Status

⚠️ 2 pre-existing flaky failures (TurnEndFallbackTests.FallbackTimer_NotCancelled_FiresAfterDelay, WsBridgeIntegrationTests.Organization_BroadcastsStateBack_ToClient). Both pass in isolation on main; not touched by this PR. 2,814 other tests passing.


Previous Findings Status

N1 🟢 (code duplication: FindLiveLockPid vs FindAnyLiveLockPid) — STILL PRESENT, not blocking

Diff-verified: FindAnyLiveLockPid (private static, functionally near-identical) still exists alongside the new FindLiveLockPid (internal static). Three of five sub-agents incorrectly reported this as resolved — the diff was checked directly to confirm. That said, both methods are private/internal within the same file, the duplication is contained, and the code reads clearly. Suitable follow-up refactor, not a blocker.

N2 🟢 (hardcoded ~/.copilot/session-state/ path in SessionSidebar.razor) — STILL PRESENT, not blocking

Consensus 4/5 models agree this is present but low-risk. Root cause clarification: SetBaseDirForTesting() redirects ~/.polypilot/ paths (org, sessions, UI state) but also clears _sessionStatePath — so CopilotService.SessionStatePath IS redirected by the test hook. SessionSidebar.razor constructs the path independently via Environment.GetFolderPath(UserProfile), bypassing the service. This only matters if a future test directly exercises the pre-resume lock check in Razor. Pre-existing pattern in the codebase; desktop-only code path.


New Findings (Consensus Filter)

No new issues reached the 2/5 consensus threshold. Low-severity observations noted by single models only:

  • cleanupStale: true called from Sidebar before resume attempt has a silent filesystem side-effect (stale lock files deleted even if user cancels). Harmless, but worth a code comment.
  • PID-recycling false-positive envelope on "dotnet" process name filter — acceptable as a best-effort UX guard.

Test Coverage

The ConnectionRecoveryTests structural guard was correctly updated to assert FindLiveLockPid + FriendlyResumeError instead of IsCorruptSessionError. Direct behavioral tests for lock-file parsing (e.g., malformed filenames, live vs stale PID discrimination) remain absent — acceptable as follow-up.


Summary

Finding Status Blocking?
N1: Code duplication Still present 🟢 No
N2: Hardcoded path in Razor Still present 🟢 No

The fix itself is correct: replaces fragile string pattern-matching (IsCorruptSessionError) with proof-based lock-file pre-check (FindLiveLockPid). Stale lock cleanup is a nice bonus. All error paths fall through to FriendlyResumeError. Mobile path is safe (directory won't exist). TOCTOU is inherent and benign.

Recommended action: ✅ Approve — N1 and N2 are quality/maintainability nits appropriate for a follow-up. No bugs, no regressions, no blocking issues.

@PureWeen PureWeen merged commit 1dad864 into main Mar 23, 2026
@PureWeen PureWeen deleted the fix/session-lock-detection branch March 23, 2026 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant