Skip to content

fix: detect and recover from stale OAuth credentials in background workers#454

Merged
BYK merged 1 commit into
mainfrom
fix/stale-oauth-401
May 21, 2026
Merged

fix: detect and recover from stale OAuth credentials in background workers#454
BYK merged 1 commit into
mainfrom
fix/stale-oauth-401

Conversation

@BYK
Copy link
Copy Markdown
Owner

@BYK BYK commented May 21, 2026

Problem

Background workers (distillation, curation, cache warming) use cached OAuth Bearer tokens from resolveAuth() in auth.ts. When tokens expire between client requests, all background work fails silently with 401 errors — no Sentry alert, no retry, no credential refresh.

Root cause: TRANSIENT_CODES in llm-adapter.ts does not include 401, so the worker returns null immediately with no retry. The credential remains cached and every subsequent worker call repeats the 401 silently.

Changes

auth.ts — Per-session staleness tracking

  • Add staleSessionAuth Set with markAuthStale() / isAuthStale() / clearAuthStale()
  • resolveAuth() skips stale session credentials, falling through to the global lastSeenAuth fallback (which may have been refreshed by a concurrent client request)
  • setSessionAuth() auto-clears staleness when a fresh credential arrives

llm-adapter.ts — 401/403 detection, re-resolve + retry, Sentry alert

  • Add AUTH_ERROR_CODES set (401, 403) — checked before the generic non-transient error path
  • On auth error: mark credential stale → re-resolve → if credential changed, rebuild request and retry once → if still failing, Sentry.captureException() with fingerprint ["LOREAI-GATEWAY", "worker-auth-error", status]
  • attempt === 0 guard prevents infinite retry loops
  • Response body included in retry log for diagnostics

cache-warmer.ts — Auth-disabled sessions

  • Add authDisabledSessions Set with isWarmupAuthDisabled() / clearWarmupAuthDisabled()
  • On 401/403 in executeWarmup(): add session to disabled set + call markAuthStale() — prevents unbounded 401 spam on every 30s idle tick
  • _resetForTest() clears the new state

idle.ts — Skip auth-disabled sessions

  • Add isWarmupAuthDisabled(sessionID) early-exit check in the warmup loop, before expensive profile/histogram/survival computation

pipeline.ts — Clear auth-disabled state on fresh credentials

  • Call clearWarmupAuthDisabled(sessionID) when setSessionAuth() binds a fresh credential from a client request

Design Decisions

  1. Staleness tracked via separate Set (not on AuthCredential type) — follows batch-queue.ts's disabledBatchSessions pattern, keeps the AuthCredential union clean
  2. Re-resolve, not refresh — the gateway does not implement OAuth token refresh; it passively captures credentials from client requests. The re-resolve picks up tokens that were refreshed by a concurrent client request.
  3. At most one retry — guarded by attempt === 0 and freshCred.value !== cred.value; if both session and global credentials are expired, bails immediately
  4. Not persisted — staleness and auth-disabled state are process-lifetime only; on restart, the first client request provides fresh credentials

What This Does NOT Change

  • LLMClient interface in @loreai/core — return type stays string | null
  • batch-queue.ts — already handles 401/403 correctly
  • API-key auth — API keys don't expire, so 401 is never triggered for them

Tests

  • New test/auth.test.ts: 12 tests for staleness tracking and resolveAuth behavior
  • test/llm-adapter.test.ts: 4 new tests for AUTH_ERROR_CODES
  • test/cache-warmer.test.ts: 4 new tests for warmup auth-disabled sessions
  • test/helpers/idle-worker.ts: updated mock to include isWarmupAuthDisabled

Closes #453

…rkers

Background workers (distillation, curation, cache warming) use cached
OAuth Bearer tokens that can expire between client requests. When they
hit 401, the LLM adapter treated it as a non-transient error and
returned null silently — no Sentry alert, no retry, no credential
invalidation. The cache warmer retried every 30s, generating unbounded
401 spam.

Changes:
- auth.ts: Add per-session staleness tracking (staleSessionAuth Set).
  resolveAuth() skips stale session credentials, falling through to the
  global fallback. setSessionAuth() auto-clears staleness when a fresh
  credential arrives from a client request.
- llm-adapter.ts: Add AUTH_ERROR_CODES (401, 403) detection before the
  generic non-transient error path. On auth error: mark credential
  stale, re-resolve (picks up fresh token from concurrent request if
  available), retry once if credential changed, then alert via
  Sentry.captureException with dedicated fingerprint.
- cache-warmer.ts: Add authDisabledSessions Set. On 401/403 in
  executeWarmup(), disable future warmups for that session and mark
  auth stale. Prevents unbounded 401 spam on every 30s idle tick.
- idle.ts: Skip auth-disabled sessions in the warmup loop.
- pipeline.ts: Clear warmup auth-disabled state when fresh credentials
  arrive via setSessionAuth().

Closes #453
@BYK BYK merged commit d47c12f into main May 21, 2026
7 checks passed
@BYK BYK deleted the fix/stale-oauth-401 branch May 21, 2026 19:34
@craft-deployer craft-deployer Bot mentioned this pull request May 21, 2026
6 tasks
BYK added a commit that referenced this pull request May 23, 2026
## Problem

**Sentry Issue:**
[LOREAI-GATEWAY-Z](https://byk.sentry.io/issues/7498124471/) — 19 users,
2,349 events

When a single user's OAuth bearer token expires, background workers
(distillation, curation, consolidation) keep retrying every 30 seconds
with the stale token. Each attempt generates a `Sentry.captureException`
call, flooding Sentry with thousands of events.

### Root Cause

PR #454 added a retry-once mechanism: mark the session credential stale
→ fall back to global → retry if credential changed. But in
single-session OAuth setups (the typical Claude Code user), the session
and global credentials are the **same expired token**. So:

1. `markAuthStale(sessionID)` marks the session stale
2. `resolveAuth(sessionID)` skips the stale session, falls through to
`getLastSeenAuth()`
3. The global holds the same expired token → `credentialChanged = false`
4. Retry-once path is never taken → `Sentry.captureException()` fires
5. Next 30s idle tick repeats the cycle

### Fix (three layers)

1. **`auth.ts` — `resolveAuth()` detects same-token fallback**: When the
stale session credential and global credential have the same value,
return `null` instead of the expired global token. This lets callers
know there's no usable credential available.

2. **`idle.ts` — skip background work on stale auth**: Before scheduling
idle work (distillation, curation, consolidation), check
`isAuthStale(sessionID) && !resolveAuth(sessionID)`. If auth is stale
and no fresh credential is available, skip the session entirely. Auth
refreshes when the next client request arrives.

3. **`instrument.ts` — filter auth errors in `beforeSend`**: Add
`/Worker upstream auth error/` to `TRANSIENT_ERROR_PATTERNS` as
defense-in-depth, suppressing any residual auth error events that slip
through.

### Tests

- Added 2 new test cases for `resolveAuth` same-token detection
- All existing auth tests pass (16 total)
- Full suite: 1810 pass, 5 skip, 0 fail
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: background workers fail silently on 401 with stale OAuth credentials

1 participant