fix: detect and recover from stale OAuth credentials in background workers#454
Merged
Conversation
…rkers Background workers (distillation, curation, cache warming) use cached OAuth Bearer tokens that can expire between client requests. When they hit 401, the LLM adapter treated it as a non-transient error and returned null silently — no Sentry alert, no retry, no credential invalidation. The cache warmer retried every 30s, generating unbounded 401 spam. Changes: - auth.ts: Add per-session staleness tracking (staleSessionAuth Set). resolveAuth() skips stale session credentials, falling through to the global fallback. setSessionAuth() auto-clears staleness when a fresh credential arrives from a client request. - llm-adapter.ts: Add AUTH_ERROR_CODES (401, 403) detection before the generic non-transient error path. On auth error: mark credential stale, re-resolve (picks up fresh token from concurrent request if available), retry once if credential changed, then alert via Sentry.captureException with dedicated fingerprint. - cache-warmer.ts: Add authDisabledSessions Set. On 401/403 in executeWarmup(), disable future warmups for that session and mark auth stale. Prevents unbounded 401 spam on every 30s idle tick. - idle.ts: Skip auth-disabled sessions in the warmup loop. - pipeline.ts: Clear warmup auth-disabled state when fresh credentials arrive via setSessionAuth(). Closes #453
6 tasks
BYK
added a commit
that referenced
this pull request
May 23, 2026
## Problem **Sentry Issue:** [LOREAI-GATEWAY-Z](https://byk.sentry.io/issues/7498124471/) — 19 users, 2,349 events When a single user's OAuth bearer token expires, background workers (distillation, curation, consolidation) keep retrying every 30 seconds with the stale token. Each attempt generates a `Sentry.captureException` call, flooding Sentry with thousands of events. ### Root Cause PR #454 added a retry-once mechanism: mark the session credential stale → fall back to global → retry if credential changed. But in single-session OAuth setups (the typical Claude Code user), the session and global credentials are the **same expired token**. So: 1. `markAuthStale(sessionID)` marks the session stale 2. `resolveAuth(sessionID)` skips the stale session, falls through to `getLastSeenAuth()` 3. The global holds the same expired token → `credentialChanged = false` 4. Retry-once path is never taken → `Sentry.captureException()` fires 5. Next 30s idle tick repeats the cycle ### Fix (three layers) 1. **`auth.ts` — `resolveAuth()` detects same-token fallback**: When the stale session credential and global credential have the same value, return `null` instead of the expired global token. This lets callers know there's no usable credential available. 2. **`idle.ts` — skip background work on stale auth**: Before scheduling idle work (distillation, curation, consolidation), check `isAuthStale(sessionID) && !resolveAuth(sessionID)`. If auth is stale and no fresh credential is available, skip the session entirely. Auth refreshes when the next client request arrives. 3. **`instrument.ts` — filter auth errors in `beforeSend`**: Add `/Worker upstream auth error/` to `TRANSIENT_ERROR_PATTERNS` as defense-in-depth, suppressing any residual auth error events that slip through. ### Tests - Added 2 new test cases for `resolveAuth` same-token detection - All existing auth tests pass (16 total) - Full suite: 1810 pass, 5 skip, 0 fail
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Background workers (distillation, curation, cache warming) use cached OAuth Bearer tokens from
resolveAuth()inauth.ts. When tokens expire between client requests, all background work fails silently with 401 errors — no Sentry alert, no retry, no credential refresh.Root cause:
TRANSIENT_CODESinllm-adapter.tsdoes not include 401, so the worker returnsnullimmediately with no retry. The credential remains cached and every subsequent worker call repeats the 401 silently.Changes
auth.ts— Per-session staleness trackingstaleSessionAuthSet withmarkAuthStale()/isAuthStale()/clearAuthStale()resolveAuth()skips stale session credentials, falling through to the globallastSeenAuthfallback (which may have been refreshed by a concurrent client request)setSessionAuth()auto-clears staleness when a fresh credential arrivesllm-adapter.ts— 401/403 detection, re-resolve + retry, Sentry alertAUTH_ERROR_CODESset (401, 403) — checked before the generic non-transient error pathSentry.captureException()with fingerprint["LOREAI-GATEWAY", "worker-auth-error", status]attempt === 0guard prevents infinite retry loopscache-warmer.ts— Auth-disabled sessionsauthDisabledSessionsSet withisWarmupAuthDisabled()/clearWarmupAuthDisabled()executeWarmup(): add session to disabled set + callmarkAuthStale()— prevents unbounded 401 spam on every 30s idle tick_resetForTest()clears the new stateidle.ts— Skip auth-disabled sessionsisWarmupAuthDisabled(sessionID)early-exit check in the warmup loop, before expensive profile/histogram/survival computationpipeline.ts— Clear auth-disabled state on fresh credentialsclearWarmupAuthDisabled(sessionID)whensetSessionAuth()binds a fresh credential from a client requestDesign Decisions
AuthCredentialtype) — followsbatch-queue.ts'sdisabledBatchSessionspattern, keeps theAuthCredentialunion cleanattempt === 0andfreshCred.value !== cred.value; if both session and global credentials are expired, bails immediatelyWhat This Does NOT Change
LLMClientinterface in@loreai/core— return type staysstring | nullbatch-queue.ts— already handles 401/403 correctlyTests
test/auth.test.ts: 12 tests for staleness tracking andresolveAuthbehaviortest/llm-adapter.test.ts: 4 new tests forAUTH_ERROR_CODEStest/cache-warmer.test.ts: 4 new tests for warmup auth-disabled sessionstest/helpers/idle-worker.ts: updated mock to includeisWarmupAuthDisabledCloses #453