Skip to content

fix(sdk): drop stale RELAYFILE_TOKEN env shadow + recover from ws error with no successor close#99

Merged
khaliqgant merged 2 commits intomainfrom
fix/sdk-onwrite-prefer-live-client-token-and-error-recovery
May 8, 2026
Merged

fix(sdk): drop stale RELAYFILE_TOKEN env shadow + recover from ws error with no successor close#99
khaliqgant merged 2 commits intomainfrom
fix/sdk-onwrite-prefer-live-client-token-and-error-recovery

Conversation

@khaliqgant
Copy link
Copy Markdown
Member

Summary

Two related bugs in the onWrite flow surfaced against @relayfile/sdk@0.6.12 while running the cortical-demo on Node 22.22.1. Both leave the dispatcher silently stalled (one WS error, then nothing) instead of streaming events.

Bug A — auto-derive shadowed by stale RELAYFILE_TOKEN env

onWrite.ts resolved the WS token as options.token ?? readEnv(\"RELAYFILE_TOKEN\") ?? undefined. Any caller that joined a workspace via WorkspaceHandle.mountEnv() ends up with RELAYFILE_TOKEN set in their env to the JWT captured at join-time — and mountEnv() never refreshes it. So the env literal would silently shadow the auto-derive fix from PR #96, and ~1 hour after workspace-join the WS upgrade started failing with an auth error.

The fix: when the caller omits options.token, fall through to undefined and let RelayFileSync auto-derive on every (re)connect via client.getToken(). The default-client path (when options.client is omitted) still works because getDefaultClient() itself wires RELAYFILE_TOKEN into the client's tokenProvider.

The cortical-demo workaround — explicitly passing token: () => workspace.client().tokenProvider() — was working precisely because it bypassed the env shadow and let getOrRefreshToken() run on every connect. After this PR, that workaround is no longer required.

Bug B — WS error with no successor close

Some WebSocket implementations (notably Node's built-in WebSocket on auth-rejected upgrades, and proxies that RST after the upgrade handshake) deliver error and never the matching close. The close handler was the only path that scheduled reconnect / started polling, so the dispatcher emitted one error and then sat silent forever — the exact symptom the cortical-demo reported.

The fix arms a short (250ms) grace timer in the WS error handler. If a close arrives in time (the well-behaved path), the close handler clears it. Otherwise the timer fires, sees the socket is still current, and forces the same recovery path the close handler would have taken (forceReconnectscheduleReconnect or startPolling if reconnect is disabled). Polling fallback now also engages reliably under this failure mode, addressing the secondary symptom from the bug report.

Relationship to PR #98

PR #98 (ErrorEvent polyfill) is also in sync.ts and is independent of these changes. The two can be rebased in either order — this PR does not touch normalizeError.

Test plan

  • npx vitest run in packages/sdk/typescript — 112/112 pass (added 3 new tests: env-shadowing, error-no-close recovery, error-followed-by-close de-dupe)
  • tsc clean
  • Re-run cortical-demo scripts/orchestrator.ts with the WS-token workaround removed (i.e. omit token from the onWrite options) and confirm live events stream after a Notion edit
  • Confirm a forced auth failure (e.g. expired JWT) now logs the polling-fallback warn within ~250ms instead of stalling

🤖 Generated with Claude Code

…or with no successor close

Two related bugs surfaced in the cortical-demo against 0.6.12. Both leave
the onWrite dispatcher silently stalled instead of streaming events.

1. onWrite re-degraded to a stale env literal whenever RELAYFILE_TOKEN was
   set in the caller's environment (the default for any workspace started
   via WorkspaceHandle.mountEnv, which captures the JWT once at join-time
   and never refreshes it). After ~1 hour the literal expired, the WS
   upgrade failed with an auth error, and the auto-derive fix from PR #96
   never got a chance to run. Auto-derive now wins over the env literal
   whenever a client is available — the default-client path still inherits
   RELAYFILE_TOKEN through the client's own tokenProvider, so env-only
   callers keep working.

2. Some WebSocket implementations (notably Node 22's built-in WebSocket on
   auth-rejected upgrades, and proxies that abruptly RST after the
   handshake) emit `error` and never deliver the matching `close`. The
   close handler was the only path that scheduled reconnect / started
   polling, so the dispatcher would emit one error and then go silent
   forever. The error handler now arms a 250ms grace timer; if no close
   arrives, it forces the same recovery path the close handler would
   have taken. The well-behaved (close-after-error) path still works —
   the close handler clears the timer.

Companion to PR #98 (ErrorEvent polyfill — also in sync.ts). The two
changes can be rebased independently; this PR does not touch the
normalizeError function.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: f27fe1c8-f576-438c-9e7e-21ab730304ba

📥 Commits

Reviewing files that changed from the base of the PR and between d6a9439 and 6ff9ba5.

📒 Files selected for processing (2)
  • packages/sdk/typescript/src/sync.test.ts
  • packages/sdk/typescript/src/sync.ts

📝 Walkthrough

Walkthrough

The PR updates token precedence to let RelayFileSync auto-derive auth from client.getToken() (no env-token fallback) and adds a WebSocket error→close grace watchdog that forces recovery when an error is not followed by a close.

Changes

Token Precedence and Error Recovery

Layer / File(s) Summary
Token Resolution Logic
packages/sdk/typescript/src/onWrite.ts
OnWriteDispatcher.ensureSync removes the RELAYFILE_TOKEN env fallback, allowing RelayFileSync to auto-derive fresh auth via client.getToken() on reconnect.
Token Auto-Derivation Test
packages/sdk/typescript/src/onWrite.test.ts
Verifies onWrite uses fresh client.getToken() result even when process.env.RELAYFILE_TOKEN is set, ensuring env vars do not interfere.
Error Recovery State
packages/sdk/typescript/src/sync.ts
Adds ERROR_TO_CLOSE_GRACE_MS, errorRecoveryTimer, and currentSocketHasOpened to track orphaned ws error events and pre-open state.
Handler Attachment / Open Tracking
packages/sdk/typescript/src/sync.ts
Reset and set currentSocketHasOpened when attaching handlers and on WebSocket open.
Error Recovery Watchdog
packages/sdk/typescript/src/sync.ts
WebSocket error handler arms a grace timer; if the corresponding close does not arrive in time for the still-current socket, forceReconnect(...) is invoked (optionally preferring polling for pre-open failures).
Close & Shutdown Cleanup
packages/sdk/typescript/src/sync.ts
close handler and stop() clear the errorRecoveryTimer to avoid duplicate or lingering recovery actions.
forceReconnect Integration
packages/sdk/typescript/src/sync.ts
forceReconnect(...) accepts an options object (e.g., preferPolling), clears the watchdog timer, and routes pre-open failures toward polling when requested.
Error Recovery Tests
packages/sdk/typescript/src/sync.test.ts
Tests added for orphaned error recovery (replacement socket), pre-open error→polling fallback, and error+close preventing double recovery.

Sequence Diagrams

sequenceDiagram
  participant Client
  participant OnWriteDispatcher
  participant RelayFileSync
  OnWriteDispatcher->>RelayFileSync: ensureSync(token: undefined)
  RelayFileSync->>Client: getToken() on (re)connect
  Client-->>RelayFileSync: fresh token
Loading
sequenceDiagram
  participant WebSocket
  participant RelayFileSync
  participant Watchdog
  WebSocket->>RelayFileSync: error event
  RelayFileSync->>Watchdog: arm errorRecoveryTimer
  Note right of Watchdog: timer fires if no close arrives
  Watchdog->>RelayFileSync: forceReconnect(preferPolling?)
  RelayFileSync->>RelayFileSync: clearErrorRecoveryTimer()
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

  • AgentWorkforce/relayfile#93: Both PRs modify RelayFileSync's WebSocket recovery/watchdog and reconnection logic in packages/sdk/typescript/src/sync.ts.
  • AgentWorkforce/relayfile#96: Both PRs modify onWrite.ensureSync's token handling and related tests—both change how tokens are resolved/auto-derived for WS auth.
  • AgentWorkforce/relayfile#85: Related earlier changes touching onWrite/sync token resolution and tests that this PR further adjusts.

Poem

🐰 I sniffed a stale token in the air,
A socket hiccupped and lingered there,
A little watchdog counted the time,
Then hopped us to fresh auth and a cleaner line,
Reconnected, we dance — jitter no more, beware!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and specifically describes both main changes: dropping stale RELAYFILE_TOKEN env shadowing and recovering from WebSocket errors without close events.
Description check ✅ Passed The description thoroughly explains both bugs fixed, their root causes, the solutions implemented, test coverage, and manual verification steps required.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/sdk-onwrite-prefer-live-client-token-and-error-recovery

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/sdk/typescript/src/sync.ts`:
- Around line 505-534: The error watchdog must treat errors on sockets that
never reached OPEN as polling fallbacks instead of always calling
forceReconnect(); change the timeout handler in the errorRecoveryTimer block to
check whether the failed socket had ever reached open (e.g., inspect
socket.readyState === WebSocket.OPEN or an existing "socketOpened" / "hasOpened"
flag on the dispatcher instance) and, if it never opened, invoke the polling
fallback path (call the method that triggers onPollingFallback/startPolling or
call forceReconnect with an explicit option/flag that tells it to immediately
transition to polling) rather than the current unconditional
this.forceReconnect(socket, "ws-error-no-close"); also update forceReconnect
(and any call sites) to accept and honor that flag so the pre-open path does not
just retry WebSocket forever; apply the same change to the analogous code around
the other occurrence mentioned (lines ~793-813).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 07b1f3bd-afad-4cf8-b938-388c69d9e7cf

📥 Commits

Reviewing files that changed from the base of the PR and between 059a2d9 and d6a9439.

📒 Files selected for processing (4)
  • packages/sdk/typescript/src/onWrite.test.ts
  • packages/sdk/typescript/src/onWrite.ts
  • packages/sdk/typescript/src/sync.test.ts
  • packages/sdk/typescript/src/sync.ts

Comment thread packages/sdk/typescript/src/sync.ts
…connect

Addresses CodeRabbit feedback on PR #99. The error watchdog added in the
prior commit unconditionally called forceReconnect() when an error fired
without a successor close. For sockets that never reached OPEN — auth
rejection on the upgrade, proxy RST during the handshake, server cold
start returning 502 — that re-runs the same failing handshake forever.

Track per-attach whether the socket reached OPEN (currentSocketHasOpened,
reset when openWebSocket attaches a fresh socket; flipped true in the
`open` handler). When the watchdog fires and the socket never opened,
pass preferPolling:true to forceReconnect, which now routes to
startPolling("forced-polling-pre-open") instead of scheduling a reconnect.
Caller still gets events via HTTP and the underlying error surfaces
through onPollingFallback rather than burning the reconnect loop.

Test: "falls back to polling when the ws errors before reaching OPEN"
asserts onPollingFallback fires with reason "forced-polling-pre-open"
and no second WS socket is opened.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@khaliqgant khaliqgant merged commit 011175a into main May 8, 2026
4 of 5 checks passed
@khaliqgant khaliqgant deleted the fix/sdk-onwrite-prefer-live-client-token-and-error-recovery branch May 8, 2026 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant