Skip to content

workaround: avoid SDK cancellation serialization bug#320

Merged
PureWeen merged 3 commits intomainfrom
fix/sdk-cancellation-serialization-workaround
Mar 9, 2026
Merged

workaround: avoid SDK cancellation serialization bug#320
PureWeen merged 3 commits intomainfrom
fix/sdk-cancellation-serialization-workaround

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

@PureWeen PureWeen commented Mar 9, 2026

Summary

Pass \CancellationToken.None\ to SDK \SendAsync\ calls to prevent \StreamJsonRpc's \StandardCancellationStrategy\ from attempting to serialize \RequestId, which is not included in the SDK's AOT-safe JSON context.

Fixes #319

Root Cause (SDK Bug)

When a \CancellationToken\ fires during an SDK call:

  1. \StreamJsonRpc.StandardCancellationStrategy.CancelOutboundRequest()\ tries to send a $/cancelRequest\ notification
  2. This requires serializing \StreamJsonRpc.RequestId\
  3. The SDK's \SystemTextJsonFormatter\ doesn't include \RequestId\ in its \TypeInfoResolverChain\
  4. \NotSupportedException\ is thrown, becomes unobserved, kills the JSON-RPC reader loop silently
  5. Session gets stuck — no events arrive, watchdog eventually fires

Why This Workaround Works

PolyPilot already handles cancellation at the TCS level:

  • \AbortSessionAsync\ calls \ResponseCompletion.TrySetCanceled()\
  • Watchdog timeout clears \IsProcessing\ state
  • All abort/error paths properly clean up state

The \CancellationToken\ was only used by SDK to send server-side cancellation notifications — which is the broken path.

Trade-off

The server won't receive $/cancelRequest\ notifications, so it may continue processing cancelled requests. However:

  • Local state is correctly cleaned up (good UX)
  • Watchdog provides safety net
  • Server resources are wasted but operations complete gracefully

Investigation Summary

4 parallel sub-agents analyzed:

Agent Focus Finding
SDK changelogs Breaking changes None documented for cancellation
SDK Client.cs JsonRpc config Confirms \StandardCancellationStrategy\ used, \RequestId\ missing from JSON context
PolyPilot audit CancellationToken usage Our usage is correct
SDK samples Recommended patterns Don't use CancellationToken at all

All agents agree: This is an SDK bug, not a PolyPilot defect.

Changes

  • CopilotService.cs: 2 lines changed
    • Primary \SendAsync\ call (line 2351)
    • Reconnect retry \SendAsync\ call (line 2499)

Testing

  • Build passes
  • 139 watchdog tests pass
  • Manual testing: orchestrator sessions no longer get stuck on cancellation

SDK Versions Affected

  • Broken: 0.1.30+ (after AOT/trimming changes)
  • Works: 0.1.26

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Pass CancellationToken.None to SDK SendAsync calls to prevent
StreamJsonRpc's StandardCancellationStrategy from attempting to
serialize RequestId, which is not included in the SDK's AOT-safe
JSON context.

## Root Cause (SDK Bug)

When a CancellationToken fires during an SDK call:
1. StreamJsonRpc's StandardCancellationStrategy.CancelOutboundRequest()
   tries to send a \$/cancelRequest notification
2. This requires serializing StreamJsonRpc.RequestId
3. The SDK's SystemTextJsonFormatter doesn't include RequestId in its
   TypeInfoResolverChain (ClientJsonContext, TypesJsonContext, etc.)
4. NotSupportedException is thrown, becomes unobserved, kills the
   JSON-RPC reader loop silently
5. Session gets stuck — no events arrive, watchdog eventually fires

## Why This Workaround Works

PolyPilot already handles cancellation at the TCS level:
- AbortSessionAsync calls ResponseCompletion.TrySetCanceled()
- Watchdog timeout clears IsProcessing state
- All abort/error paths properly clean up state

The CancellationToken was only used by SDK to send server-side
cancellation notifications — which is the broken path.

## Trade-off

The server won't receive \$/cancelRequest notifications, so it may
continue processing cancelled requests. However:
- Local state is correctly cleaned up (good UX)
- Watchdog provides safety net
- Server resources are wasted but operations complete gracefully

## Investigation

4 sub-agents analyzed: SDK changelogs (no breaking changes), SDK
Client.cs (confirms missing RequestId), PolyPilot usage (correct),
SDK samples (don't use CancellationToken). All confirmed SDK bug.

Fixes #319

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the fix/sdk-cancellation-serialization-workaround branch from fe42b32 to 366fbe3 Compare March 9, 2026 03:37
PureWeen and others added 2 commits March 8, 2026 23:50
- CRITICAL: Fix missed call site in PolyPilot.Console/Models/AgentSession.cs
  The Console app was also vulnerable to the RequestId serialization crash.

- MODERATE: Add comment about watchdog limitation when SendAsync blocks
  If the transport write itself blocks (half-open TCP), watchdog completes
  the TCS but execution never reaches it. Edge-case on mobile/codespace.

- MINOR: Add explanatory comment to soft-steer path (line 2702)
  Functionally safe (implicit CancellationToken.None) but prevents
  future maintainers from accidentally reintroducing the bug.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tion

If SendAsync throws (transport error, connection dropped, etc.),
execution exits without resetting IsProcessing. Unlike CopilotService,
AgentSession has no watchdog — only event handlers reset the flag.
If SendAsync throws before sending, no events arrive, and IsProcessing
stays true forever → all subsequent calls throw 'already processing'.

Fix: Wrap SendAsync and TCS await in try/catch that resets IsProcessing
on any exception before re-throwing.

Note: The TrySetCanceled() path self-heals because the server continues
processing (CancellationToken.None) and eventually fires SessionIdleEvent.
The SendAsync exception path does not self-heal without this fix.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit 3bc8ea7 into main Mar 9, 2026
@PureWeen PureWeen deleted the fix/sdk-cancellation-serialization-workaround branch March 9, 2026 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigate Copilot SDK cancellation serialization bug

1 participant