fix: preserve OAuth popup session state#1235
Conversation
Greptile SummaryFixes a self-host OAuth callback bug where the popup handler was forwarding the provider-echoed org-slug-wrapped state token (
Confidence Score: 4/5Safe to merge — the fix is minimal and targeted, touching only the state-unwrapping step in The core logic is correct and the implementation is clean:
Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant UI as Opener UI
participant SH as Self-Host /callback
participant SDK as decodeOAuthCallbackState
participant OA as oauth.complete()
UI->>SH: "GET /callback?state=base64({state:"raw123",orgSlug:"acme"})&code=…"
Note over SH: Before fix: passes wrapped state to complete()
SH--xOA: "complete({ state: "base64(…)" }) ❌ session not found"
Note over SH: After fix
SH->>SDK: decodeOAuthCallbackState("base64(…)")
SDK-->>SH: "{ state: "raw123", orgSlug: "acme" }"
SH->>OA: "complete({ state: "raw123", code: "…" }) ✅"
OA-->>SH: Connection
SH-->>UI: "popup HTML { sessionId: "raw123", ok: true }"
Note over UI: Opener correlates on "raw123" ✅
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant UI as Opener UI
participant SH as Self-Host /callback
participant SDK as decodeOAuthCallbackState
participant OA as oauth.complete()
UI->>SH: "GET /callback?state=base64({state:"raw123",orgSlug:"acme"})&code=…"
Note over SH: Before fix: passes wrapped state to complete()
SH--xOA: "complete({ state: "base64(…)" }) ❌ session not found"
Note over SH: After fix
SH->>SDK: decodeOAuthCallbackState("base64(…)")
SDK-->>SH: "{ state: "raw123", orgSlug: "acme" }"
SH->>OA: "complete({ state: "raw123", code: "…" }) ✅"
OA-->>SH: Connection
SH-->>UI: "popup HTML { sessionId: "raw123", ok: true }"
Note over UI: Opener correlates on "raw123" ✅
Reviews (1): Last reviewed commit: "fix: preserve OAuth popup session state" | Re-trigger Greptile |
| it("completes wrapped callback state with the raw OAuth session state", async () => { | ||
| const providerState = encodeOAuthCallbackState({ state: "session-raw", orgSlug: "default" }); | ||
| const received: Array<{ state: string }> = []; | ||
|
|
||
| const html = await Effect.runPromise( | ||
| runOAuthCallback<GoogleAuth, never, never>({ | ||
| complete: (params) => { | ||
| received.push({ state: params.state }); | ||
| return Effect.succeed({ | ||
| kind: "oauth2", | ||
| accessTokenSecretId: "s", | ||
| refreshTokenSecretId: null, | ||
| }); | ||
| }, | ||
| urlParams: { state: providerState, code: "code1" }, | ||
| toErrorMessage: () => ({ short: "" }), | ||
| channelName: "c", | ||
| }), | ||
| ); | ||
|
|
||
| expect(received).toEqual([{ state: "session-raw" }]); | ||
| expect(html).toContain('"sessionId":"session-raw"'); | ||
| expect(html).not.toContain(`"sessionId":"${providerState}"`); | ||
| }); |
There was a problem hiding this comment.
Missing test coverage for error paths with wrapped state
The new test validates success + wrapped state, but two other branches go untested: (1) provider error (urlParams.error present) with a wrapped state, and (2) complete() failing with a wrapped state. In both cases sessionId should still be the unwrapped raw token so the opener can correlate the failure popup. The code is correct because sessionId is computed once at the top of runOAuthCallback and shared across all three result paths, but a test for the provider-error + wrapped-state scenario would lock that invariant in explicitly — e.g. urlParams: { state: providerState, error: "access_denied" } with an assertion that html contains "sessionId":"session-raw" and not the wrapped value.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
|
thanks, hooking up the e2e tests to catch this moving forward as well |
The cloud e2e project never gated CI either, so ten scenarios rotted. Refresh the four whose product behavior moved intentionally: - connect-card-ssr-origin: install URLs are org-slug-scoped since the org-slug console URLs change (#974); accept the slug form. - connection-owner-isolation: /api/auth/switch-organization was deleted with cookie-based org switching (#1000); switch orgs the way the web client does, via the x-executor-organization selector header. - oauth-connections: the popup-state fix (#1235) envelopes the callback state as base64url JSON; decode it and assert the inner state + orgSlug. - unauthenticated-skeleton: the 404 page shipped as a standalone page in the same commit as the shell-framed assertion (#986); assert the page it actually renders. Quarantine the six that need product/harness work, each with a reason: mcp-browser-approval-org-scope + the two browser-approval scenarios (cloud-only: the mcporter browser-approval completion never lands), cli-device-login (device-flow terminal never reaches the emulator), and run-panel-auto-approve (autoApprove leaves the run paused; never green since the feature landed in #1183).
* e2e: fix stale docs, harden dev-CLI status, add cloud+selfhost CI jobs - e2e/AGENTS.md: the anatomy example predated the service-yielding scenario() signature (no more needs/ctx); capability notes said browser was cloud-only and mcp-oauth selfhost-only, both wrong per targets/*.ts; file placement now lists cloudflare/, local/, cli/; document summary, motel, test:* scripts, the viewer/ SPA, pr-media, and the Windows desktop/cli VM targets. - e2e dev CLI status: probe the app URL before reporting ready (a zombie runner with a dead server used to read as healthy), and only parse real state files in .dev/ (cloud.journey.json rendered as a garbage DEAD line). - CI: run the cloud and selfhost e2e projects on every PR/push with failure artifacts (trace.zip, session.mp4, step screenshots) uploaded per target. * Fix the MCP regressions and policy gaps the e2e suite caught Cloud (hibernatable MCP DO rework fallout): - server.ts no longer gates MCP dispatch behind the Axiom tracer install: with AXIOM_TOKEN unset (any dev boot without motel) every /mcp request fell through to the SPA router and 404ed. - agent-handler mounts a second serve() on /mcp/toolkits/:slug — the agents SDK builds an exact-match URLPattern, so the single /mcp handler never saw toolkit paths. - Restore the old envelope's transport contract: JSON-RPC 405 for verbs outside GET/POST/DELETE/OPTIONS (was a bare 404), 200 for session DELETE (agents SDK answers 204), and a reconnect-worded 404 for requests that race a condemned DO's abort. Selfhost (org-scoped MCP OAuth discovery): - The org-segment strip middleware now carries the original pathname in an internal header, and the protected-resource metadata echoes it, so a client that dialed /<org>/mcp/... passes the MCP SDK's RFC 9728 resource check. Bare paths are untouched; the header is stripped from unrewritten requests. Microsoft Graph URL policy: - microsoftHttpPlugin gains the hosts' local-network dev posture: selfhost, cloud, and the cloudflare host thread allowLocalNetwork into allowUnsafeUrlOverrides, and the override now also admits plain-http loopback URLs (local emulators). Production behavior is unchanged: the flag is unset there, and non-loopback http stays rejected even with it. Stale e2e assertion refreshed for an intentional product change: - tool-descriptions: the execute inventory is names-only since the skills tool slimming; drop the per-connection description assertions. * test(e2e): repair self-host scenarios and gate the suite in CI The self-host e2e project never ran in CI, so it drifted red while the app moved on. Repair the failing scenarios (stale connect-modal selectors, a racy action-bar position read, a shared-admin connection-count assertion, a multi-tenant-only org-slug 404 step, and a cloud-shaped toolkit MCP URL), add a documented skip affordance to the scenario helper, and quarantine the two Microsoft emulator scenarios that need a canonical block-YAML Graph spec (tracked separately). Cherry-picked from origin/fix-selfhost-e2e-and-ci (PR #1239); its CI job is superseded by the cloud+selfhost matrix job already on this branch. * test(e2e): quarantine the two agents-SDK transport gaps Both are real gaps in the hibernatable Agent bridge (standalone SSE supersede never resolves; response routing scopes JSON-RPC ids per session instead of per stream), not regressions on this branch. Skip with reasons so the suite gates CI while the gaps stay visible; fixing the bridge is tracked separately. * test(e2e): repair or quarantine the cloud scenarios that drifted on main The cloud e2e project never gated CI either, so ten scenarios rotted. Refresh the four whose product behavior moved intentionally: - connect-card-ssr-origin: install URLs are org-slug-scoped since the org-slug console URLs change (#974); accept the slug form. - connection-owner-isolation: /api/auth/switch-organization was deleted with cookie-based org switching (#1000); switch orgs the way the web client does, via the x-executor-organization selector header. - oauth-connections: the popup-state fix (#1235) envelopes the callback state as base64url JSON; decode it and assert the inner state + orgSlug. - unauthenticated-skeleton: the 404 page shipped as a standalone page in the same commit as the shell-framed assertion (#986); assert the page it actually renders. Quarantine the six that need product/harness work, each with a reason: mcp-browser-approval-org-scope + the two browser-approval scenarios (cloud-only: the mcporter browser-approval completion never lands), cli-device-login (device-flow terminal never reaches the emulator), and run-panel-auto-approve (autoApprove leaves the run paused; never green since the feature landed in #1183). * lint: suppress the adapter-boundary error checks in the MCP agent handler The condemned-DO abort surfaces as a plain runtime Error thrown out of the agents SDK's serve.fetch; its message string is the only signal. Narrow suppressions with boundary reasons, per the typed-errors skill. * test(e2e): quarantine the seat-limit scenario on the emulate 0.9.0 Autumn gap emulate 0.9.0's Autumn customer balances omit the expanded feature object autumn-js asserts, so useCustomer crashes the org page into the error boundary. Fixed upstream in UsefulSoftwareCo/emulate#8 (0.9.1); unskip once the publish lands and the e2e dependency is bumped. * ci: retrigger * ci: shard the cloud e2e job so each shard gets a fresh dev stack A full-suite run against one long-lived cloud dev server degrades partway through: sign-in starts refusing connections and everything after fails with fetch errors (the same SSE/OTel memory growth being instrumented on main). Four shards, each booting its own stack, stay under the threshold. Re-merge into one job once the leak is fixed. * ci: split the cloud e2e job into eight shards Four shards still hit the dev-server degradation a few minutes in on 2-core runners; eight keeps each stack's lifetime under the threshold. * ci: retry flaky browser scenarios twice on the same stack The remaining shard failures are scattered single-test Playwright waitFor timeouts on 2-core runners, not systemic stack death; vitest --retry clears them without hiding real regressions (a consistent failure still fails after 3 attempts). * test(e2e): quarantine the Graph default-add scenario on CI runners Compiling the Graph spec inside dev workerd 500s on 2-core GitHub runners and takes the dev stack down for every scenario after it in the shard (the auth-hint/org-slug/docs-link failures in the same shard were all downstream of this). Local runs are unaffected; skip only under CI. * selfhost: read the local-network posture from env in the plugins seam plugins() runs per request; loadConfig() does filesystem work (data dir, secret key resolution) that should not ride the request path. The env read is the same computation loadConfig makes for the flag. * e2e: bump @executor-js/emulate to 0.10.0, unskip the seat-limit scenario 0.10.0 ships the Autumn balances.feature expansion autumn-js asserts (UsefulSoftwareCo/emulate#8), so the org page renders again and the scenario passes.
Summary
OAuth sessions persist a raw random state token. When a self-hosted flow starts from the organization context, the authorization request wraps that token with the organization slug before sending it to the provider. The provider echoes the wrapped value back on callback, but the shared callback handler passed it directly to
oauth.complete, which looks up sessions by the raw token.Cloud already unwraps this state at its request boundary because it must use
orgSlugto select the right tenant before constructing the executor. Self-host is single-tenant, so it does not need org routing on callback, but it still needs the raw token for session lookup.This decodes wrapped callback state in the shared popup callback path and uses the raw token for both OAuth completion and popup result correlation. Raw callback state still passes through unchanged.
Testing
bun run --cwd packages/core/api test -- oauth-popup.test.tsbun run --cwd e2e test:selfhost -- selfhost/oauth-callback-unauthenticated.test.tsThe self-host e2e test was failing before this fix with
OAuth session expired or not foundon the callback page, and now passes.