fix(sdk): drop stale RELAYFILE_TOKEN env shadow + recover from ws error with no successor close#99
Conversation
…or with no successor close Two related bugs surfaced in the cortical-demo against 0.6.12. Both leave the onWrite dispatcher silently stalled instead of streaming events. 1. onWrite re-degraded to a stale env literal whenever RELAYFILE_TOKEN was set in the caller's environment (the default for any workspace started via WorkspaceHandle.mountEnv, which captures the JWT once at join-time and never refreshes it). After ~1 hour the literal expired, the WS upgrade failed with an auth error, and the auto-derive fix from PR #96 never got a chance to run. Auto-derive now wins over the env literal whenever a client is available — the default-client path still inherits RELAYFILE_TOKEN through the client's own tokenProvider, so env-only callers keep working. 2. Some WebSocket implementations (notably Node 22's built-in WebSocket on auth-rejected upgrades, and proxies that abruptly RST after the handshake) emit `error` and never deliver the matching `close`. The close handler was the only path that scheduled reconnect / started polling, so the dispatcher would emit one error and then go silent forever. The error handler now arms a 250ms grace timer; if no close arrives, it forces the same recovery path the close handler would have taken. The well-behaved (close-after-error) path still works — the close handler clears the timer. Companion to PR #98 (ErrorEvent polyfill — also in sync.ts). The two changes can be rebased independently; this PR does not touch the normalizeError function. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe PR updates token precedence to let RelayFileSync auto-derive auth from client.getToken() (no env-token fallback) and adds a WebSocket error→close grace watchdog that forces recovery when an error is not followed by a close. ChangesToken Precedence and Error Recovery
Sequence DiagramssequenceDiagram
participant Client
participant OnWriteDispatcher
participant RelayFileSync
OnWriteDispatcher->>RelayFileSync: ensureSync(token: undefined)
RelayFileSync->>Client: getToken() on (re)connect
Client-->>RelayFileSync: fresh token
sequenceDiagram
participant WebSocket
participant RelayFileSync
participant Watchdog
WebSocket->>RelayFileSync: error event
RelayFileSync->>Watchdog: arm errorRecoveryTimer
Note right of Watchdog: timer fires if no close arrives
Watchdog->>RelayFileSync: forceReconnect(preferPolling?)
RelayFileSync->>RelayFileSync: clearErrorRecoveryTimer()
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/sdk/typescript/src/sync.ts`:
- Around line 505-534: The error watchdog must treat errors on sockets that
never reached OPEN as polling fallbacks instead of always calling
forceReconnect(); change the timeout handler in the errorRecoveryTimer block to
check whether the failed socket had ever reached open (e.g., inspect
socket.readyState === WebSocket.OPEN or an existing "socketOpened" / "hasOpened"
flag on the dispatcher instance) and, if it never opened, invoke the polling
fallback path (call the method that triggers onPollingFallback/startPolling or
call forceReconnect with an explicit option/flag that tells it to immediately
transition to polling) rather than the current unconditional
this.forceReconnect(socket, "ws-error-no-close"); also update forceReconnect
(and any call sites) to accept and honor that flag so the pre-open path does not
just retry WebSocket forever; apply the same change to the analogous code around
the other occurrence mentioned (lines ~793-813).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 07b1f3bd-afad-4cf8-b938-388c69d9e7cf
📒 Files selected for processing (4)
packages/sdk/typescript/src/onWrite.test.tspackages/sdk/typescript/src/onWrite.tspackages/sdk/typescript/src/sync.test.tspackages/sdk/typescript/src/sync.ts
…connect Addresses CodeRabbit feedback on PR #99. The error watchdog added in the prior commit unconditionally called forceReconnect() when an error fired without a successor close. For sockets that never reached OPEN — auth rejection on the upgrade, proxy RST during the handshake, server cold start returning 502 — that re-runs the same failing handshake forever. Track per-attach whether the socket reached OPEN (currentSocketHasOpened, reset when openWebSocket attaches a fresh socket; flipped true in the `open` handler). When the watchdog fires and the socket never opened, pass preferPolling:true to forceReconnect, which now routes to startPolling("forced-polling-pre-open") instead of scheduling a reconnect. Caller still gets events via HTTP and the underlying error surfaces through onPollingFallback rather than burning the reconnect loop. Test: "falls back to polling when the ws errors before reaching OPEN" asserts onPollingFallback fires with reason "forced-polling-pre-open" and no second WS socket is opened. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Two related bugs in the
onWriteflow surfaced against@relayfile/sdk@0.6.12while running the cortical-demo on Node 22.22.1. Both leave the dispatcher silently stalled (one WS error, then nothing) instead of streaming events.Bug A — auto-derive shadowed by stale
RELAYFILE_TOKENenvonWrite.tsresolved the WS token asoptions.token ?? readEnv(\"RELAYFILE_TOKEN\") ?? undefined. Any caller that joined a workspace viaWorkspaceHandle.mountEnv()ends up withRELAYFILE_TOKENset in their env to the JWT captured at join-time — andmountEnv()never refreshes it. So the env literal would silently shadow the auto-derive fix from PR #96, and ~1 hour after workspace-join the WS upgrade started failing with an auth error.The fix: when the caller omits
options.token, fall through toundefinedand letRelayFileSyncauto-derive on every (re)connect viaclient.getToken(). The default-client path (whenoptions.clientis omitted) still works becausegetDefaultClient()itself wiresRELAYFILE_TOKENinto the client'stokenProvider.The cortical-demo workaround — explicitly passing
token: () => workspace.client().tokenProvider()— was working precisely because it bypassed the env shadow and letgetOrRefreshToken()run on every connect. After this PR, that workaround is no longer required.Bug B — WS error with no successor close
Some WebSocket implementations (notably Node's built-in WebSocket on auth-rejected upgrades, and proxies that RST after the upgrade handshake) deliver
errorand never the matchingclose. The close handler was the only path that scheduled reconnect / started polling, so the dispatcher emitted one error and then sat silent forever — the exact symptom the cortical-demo reported.The fix arms a short (250ms) grace timer in the WS
errorhandler. If aclosearrives in time (the well-behaved path), the close handler clears it. Otherwise the timer fires, sees the socket is still current, and forces the same recovery path the close handler would have taken (forceReconnect→scheduleReconnectorstartPollingif reconnect is disabled). Polling fallback now also engages reliably under this failure mode, addressing the secondary symptom from the bug report.Relationship to PR #98
PR #98 (ErrorEvent polyfill) is also in
sync.tsand is independent of these changes. The two can be rebased in either order — this PR does not touchnormalizeError.Test plan
npx vitest runinpackages/sdk/typescript— 112/112 pass (added 3 new tests: env-shadowing, error-no-close recovery, error-followed-by-close de-dupe)tsccleanscripts/orchestrator.tswith the WS-token workaround removed (i.e. omittokenfrom theonWriteoptions) and confirm live events stream after a Notion edit🤖 Generated with Claude Code