fix(auth): deactivate account on refresh-time token_expired#600
Merged
Conversation
Fixes #383. When the OAuth refresh endpoint returns error code `token_expired`, the existing code path treated it as a transient failure: it was not in `PERMANENT_FAILURE_CODES`, so `classify_refresh_error` returned `False`, `RefreshError` was built with `is_permanent=False`, and `AuthManager.refresh_account` left the account in `ACTIVE`. The load balancer then kept routing traffic to the account, every request kept failing with the same `token_expired` error, and the only mitigation was for an operator to pause the account manually -- exactly what #383 documents. This is not a transient error in practice. An access-token-only expiry would have caused the refresh call itself to return a fresh token pair (HTTP 200). A `token_expired` reaching codex-lb at the refresh boundary means the refresh token (or the session it represents) is no longer usable -- the same shape as `refresh_token_expired`, just under a different error code on the upstream side. Add `token_expired` to `PERMANENT_FAILURE_CODES` with the message "Authentication token expired - re-login required". This single change flows through both deactivation paths: - `AuthManager.refresh_account` -- via `classify_refresh_error` - usage-refresh `_should_deactivate_for_usage_error` -- via its `code in PERMANENT_FAILURE_CODES` check So a refresh-time `token_expired` now correctly deactivates the account on both the request-driven refresh path and the background usage refresh path. The reporter's affected account (`email_3e2ba4c18bda` with repeated `Token refresh failed ... status=401` / `Provided authentication token is expired`) would now be deactivated on the next failing refresh and stop draining client requests. Re-authentication clears the deactivation as it does today for the other permanent codes. Tests added: - `tests/unit/test_auth_refresh.py::test_classify_refresh_error_token_expired_is_permanent` - `tests/unit/test_auth_manager.py::test_refresh_account_deactivates_when_upstream_returns_token_expired`
Owner
Author
|
@codex review |
|
Codex Review: Didn't find any major issues. Swish! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Soju06
added a commit
that referenced
this pull request
May 17, 2026
Release 1.18.0 — v1.17.0 → main, 27 PR. ## Pre-merge verification (release-gate e2e) - ✅ CI 18/18 (mergeStateStatus CLEAN) - ✅ baseline release-gate-smoke: 10/10 (`v1.models`, chat non-stream/stream/tool_call, responses non-stream/stream, unknown-msg-key drop, transient codes, dashboard 72-byte, alembic index) - ✅ per-PR e2e extra: 14/14 (#658 strict tool × chat valid/invalid/type-omitted + responses invalid, #650-652 trim replay markers, #516 prev-resp miss mask, #558 prolite plan, #600 token_expired, #635 device poll, #647 frontend 72B, #594 account selection, #640 alembic head, #653/#654 conv archive, #655/#656 weekly pace) - ✅ pytest 562 passed / 3 PG-only skip (unit + integration suite) - ✅ SDK e2e 20/20 (`tests/e2e/test_openai_sdk_compat.py` + `tests/e2e/test_v1_responses_openai_sdk.py`) - ✅ `scripts/verify_v1_responses_openai_sdk.py` 5/5 against live container with real account - ✅ ruff check + ruff format --check + uv run ty check — All clean - ✅ prod parity probe: #637 (unknown-key drop) + #658 (strict tool 400) both confirmed fixed vs prod 1.17.0 ## Behavior change to note in announcements - **#658**: `strict: true` function tool schemas missing `additionalProperties: false` (or other strict-mode requirements) now return a local **400 `invalid_function_parameters`** with `param=tools[N].function.parameters` (chat) / `tools[N].parameters` (responses). Pre-1.18.0 these were silently forwarded upstream; 1.18.0 enforces OpenAI's documented strict-mode contract locally before forwarding. OpenSpec changes shipped in this release (to be archived after tag publish): - deactivate-on-token-expired (#600) - drop-unknown-message-fields (#637) - mask-single-http-previous-response-miss (#516) - normalize-v1-responses-openai-sdk-stream (#639) - start-device-oauth-polling (#635) - support-prolite-plan-capacity (#558) - treat-upstream-overloaded-as-retryable (#601) - validate-password-length (#598) - validate-strict-function-tool-schema (#658)
Soju06
added a commit
that referenced
this pull request
May 17, 2026
Archive changes shipped in v1.18.0 and v1.18.1 per AGENTS.md "release 이후 archive" workflow. All 12 changes verified shipped in production at 10.0.0.113 (image 1.18.1, sha256:0431a474432d). 1.18.0 (9 changes): - deactivate-on-token-expired (#600) - drop-unknown-message-fields (#637) - mask-single-http-previous-response-miss (#516) - normalize-v1-responses-openai-sdk-stream (#639) - start-device-oauth-polling (#635) [delta header fixed: MODIFIED -> ADDED] - support-prolite-plan-capacity (#558) [delta header fixed: MODIFIED -> ADDED] - treat-upstream-overloaded-as-retryable (#601) - validate-password-length (#598) - validate-strict-function-tool-schema (#658) 1.18.1 (3 changes): - harden-long-codex-websocket-turns (#586) - suppress-duplicate-side-effect-tool-calls (#586) - hotfix-db-pool-pre-ping-and-firewall-cache-ttl (#679) Specs updated: 9 capabilities (admin-auth, api-firewall, chat-completions-compat, chat-completions-extensions, database-backends, frontend-architecture, responses-api-compat, usage-refresh-policy). openspec validate --specs: 19/19 passed. Co-authored-by: Hermes Agent <hermes-agent@nous.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #383.
Root cause
When the OAuth refresh endpoint returns error code
token_expired(HTTP 401 body{"error": {"code": "token_expired"}}with message"Provided authentication token is expired. Please try signing in again."), the existing code path treats it as a transient failure:PERMANENT_FAILURE_CODESinapp/core/balancer/logic.pyonly listsrefresh_token_expired,refresh_token_reused,refresh_token_invalidated,account_deactivated,account_suspended,account_deleted. It does not includetoken_expired.classify_refresh_error("token_expired")therefore returnsFalse, soRefreshErroris built withis_permanent=False.AuthManager.refresh_accountonly deactivates on a permanent error.usage._should_deactivate_for_usage_erroronly deactivates on HTTP 402/404, on the existing permanent code set, or on the "account has been deactivated" message hint.Net effect: refresh keeps failing, the account stays
ACTIVE, and the load balancer keeps selecting it. The reporter ran into exactly this withemail_3e2ba4c18bdauntil they manually paused it.Why
token_expiredat the refresh boundary is permanentThis is not a transient error in practice:
token_expiredarriving at the refresh boundary means the refresh token (or the session it represents) is no longer usable.refresh_token_expired/refresh_token_invalidated-- just a different error code on the upstream side.So treating it as permanent matches what is actually happening on the upstream side, and matches the reporter's observation that the account only recovers after re-authentication.
Fix
Add
token_expiredtoPERMANENT_FAILURE_CODESwith the message"Authentication token expired - re-login required". This single change flows through both deactivation paths without further plumbing:AuthManager.refresh_account-- viaclassify_refresh_error("token_expired") -> True_should_deactivate_for_usage_error-- via itscode in PERMANENT_FAILURE_CODEScheckSo a refresh-time
token_expirednow correctly deactivates the account on both the request-driven refresh path and the background usage refresh path. The reporter's affected account would now be deactivated on the next failing refresh and stop draining client requests.Re-authentication clears the deactivation as it does today for the other permanent codes.
Tests
The auth-manager regression test drives
refresh_accountagainst a stubbedrefresh_access_tokenraisingRefreshError("token_expired", ..., classify_refresh_error("token_expired")), then asserts:RefreshErrorhasis_permanent=Trueandcode="token_expired"AccountStatus.DEACTIVATEDVerification on this branch:
OpenSpec
Added
openspec/changes/deactivate-on-token-expired/capturing the new behavior under the existingusage-refresh-policycapability (refresh-time and usage-refresh-timetoken_expiredboth deactivate; classification scenario; deactivation scenario; reason wording requirement).Hotfix candidate for v1.17.1 / v1.17.2
Small, scoped change to the permanent-failure classification, with two unit regressions and an OpenSpec entry. No runtime/migration impact. Pairs naturally with #598 (the bcrypt 72-byte fix) for the next patch release.