Skip to content

fix(auth): deactivate account on refresh-time token_expired#600

Merged
Soju06 merged 1 commit into
mainfrom
fix/token-expired-deactivates-account
May 13, 2026
Merged

fix(auth): deactivate account on refresh-time token_expired#600
Soju06 merged 1 commit into
mainfrom
fix/token-expired-deactivates-account

Conversation

@Soju06
Copy link
Copy Markdown
Owner

@Soju06 Soju06 commented May 13, 2026

Fixes #383.

Root cause

When the OAuth refresh endpoint returns error code token_expired (HTTP 401 body {"error": {"code": "token_expired"}} with message "Provided authentication token is expired. Please try signing in again."), the existing code path treats it as a transient failure:

  • PERMANENT_FAILURE_CODES in app/core/balancer/logic.py only lists refresh_token_expired, refresh_token_reused, refresh_token_invalidated, account_deactivated, account_suspended, account_deleted. It does not include token_expired.
  • classify_refresh_error("token_expired") therefore returns False, so RefreshError is built with is_permanent=False.
  • AuthManager.refresh_account only deactivates on a permanent error.
  • usage._should_deactivate_for_usage_error only deactivates on HTTP 402/404, on the existing permanent code set, or on the "account has been deactivated" message hint.

Net effect: refresh keeps failing, the account stays ACTIVE, and the load balancer keeps selecting it. The reporter ran into exactly this with email_3e2ba4c18bda until they manually paused it.

Why token_expired at the refresh boundary is permanent

This is not a transient error in practice:

  • Access-token-only expiry would have caused the refresh call itself to return a fresh token pair (HTTP 200). The refresh wouldn't fail.
  • A token_expired arriving at the refresh boundary means the refresh token (or the session it represents) is no longer usable.
  • That is the same shape as refresh_token_expired / refresh_token_invalidated -- just a different error code on the upstream side.

So treating it as permanent matches what is actually happening on the upstream side, and matches the reporter's observation that the account only recovers after re-authentication.

Fix

Add token_expired to PERMANENT_FAILURE_CODES with the message "Authentication token expired - re-login required". This single change flows through both deactivation paths without further plumbing:

  • AuthManager.refresh_account -- via classify_refresh_error("token_expired") -> True
  • usage-refresh _should_deactivate_for_usage_error -- via its code in PERMANENT_FAILURE_CODES check

So a refresh-time token_expired now correctly deactivates the account on both the request-driven refresh path and the background usage refresh path. The reporter's affected account would now be deactivated on the next failing refresh and stop draining client requests.

Re-authentication clears the deactivation as it does today for the other permanent codes.

Tests

tests/unit/test_auth_refresh.py::test_classify_refresh_error_token_expired_is_permanent PASSED
tests/unit/test_auth_manager.py::test_refresh_account_deactivates_when_upstream_returns_token_expired PASSED

The auth-manager regression test drives refresh_account against a stubbed refresh_access_token raising RefreshError("token_expired", ..., classify_refresh_error("token_expired")), then asserts:

  • the resulting RefreshError has is_permanent=True and code="token_expired"
  • the account row is updated to AccountStatus.DEACTIVATED
  • the deactivation reason surfaces the re-login / expired wording

Verification on this branch:

uv run ruff check app/core/balancer/logic.py tests/unit/test_auth_refresh.py tests/unit/test_auth_manager.py     # clean
uv run ruff format --check app/core/balancer/logic.py tests/unit/test_auth_refresh.py tests/unit/test_auth_manager.py  # clean
uv run ty check app/core/balancer/logic.py tests/unit/test_auth_refresh.py tests/unit/test_auth_manager.py        # clean
uv run pytest tests/unit/test_auth_refresh.py tests/unit/test_auth_manager.py tests/unit/test_load_balancer.py -q   # 74 passed
npx @fission-ai/openspec validate deactivate-on-token-expired                                                       # change is valid

OpenSpec

Added openspec/changes/deactivate-on-token-expired/ capturing the new behavior under the existing usage-refresh-policy capability (refresh-time and usage-refresh-time token_expired both deactivate; classification scenario; deactivation scenario; reason wording requirement).

Hotfix candidate for v1.17.1 / v1.17.2

Small, scoped change to the permanent-failure classification, with two unit regressions and an OpenSpec entry. No runtime/migration impact. Pairs naturally with #598 (the bcrypt 72-byte fix) for the next patch release.

Fixes #383.

When the OAuth refresh endpoint returns error code `token_expired`,
the existing code path treated it as a transient failure: it was not in
`PERMANENT_FAILURE_CODES`, so `classify_refresh_error` returned
`False`, `RefreshError` was built with `is_permanent=False`, and
`AuthManager.refresh_account` left the account in `ACTIVE`. The load
balancer then kept routing traffic to the account, every request kept
failing with the same `token_expired` error, and the only mitigation
was for an operator to pause the account manually -- exactly what
#383 documents.

This is not a transient error in practice. An access-token-only expiry
would have caused the refresh call itself to return a fresh token pair
(HTTP 200). A `token_expired` reaching codex-lb at the refresh boundary
means the refresh token (or the session it represents) is no longer
usable -- the same shape as `refresh_token_expired`, just under a
different error code on the upstream side.

Add `token_expired` to `PERMANENT_FAILURE_CODES` with the message
"Authentication token expired - re-login required". This single change
flows through both deactivation paths:

- `AuthManager.refresh_account` -- via `classify_refresh_error`
- usage-refresh `_should_deactivate_for_usage_error` -- via its
  `code in PERMANENT_FAILURE_CODES` check

So a refresh-time `token_expired` now correctly deactivates the account
on both the request-driven refresh path and the background usage
refresh path. The reporter's affected account (`email_3e2ba4c18bda`
with repeated `Token refresh failed ... status=401` /
`Provided authentication token is expired`) would now be deactivated
on the next failing refresh and stop draining client requests.

Re-authentication clears the deactivation as it does today for the
other permanent codes.

Tests added:

- `tests/unit/test_auth_refresh.py::test_classify_refresh_error_token_expired_is_permanent`
- `tests/unit/test_auth_manager.py::test_refresh_account_deactivates_when_upstream_returns_token_expired`
@Soju06
Copy link
Copy Markdown
Owner Author

Soju06 commented May 13, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@Soju06 Soju06 merged commit 7b0e0ff into main May 13, 2026
18 checks passed
Soju06 added a commit that referenced this pull request May 17, 2026
Release 1.18.0 — v1.17.0 → main, 27 PR.

## Pre-merge verification (release-gate e2e)

- ✅ CI 18/18 (mergeStateStatus CLEAN)
- ✅ baseline release-gate-smoke: 10/10 (`v1.models`, chat non-stream/stream/tool_call, responses non-stream/stream, unknown-msg-key drop, transient codes, dashboard 72-byte, alembic index)
- ✅ per-PR e2e extra: 14/14 (#658 strict tool × chat valid/invalid/type-omitted + responses invalid, #650-652 trim replay markers, #516 prev-resp miss mask, #558 prolite plan, #600 token_expired, #635 device poll, #647 frontend 72B, #594 account selection, #640 alembic head, #653/#654 conv archive, #655/#656 weekly pace)
- ✅ pytest 562 passed / 3 PG-only skip (unit + integration suite)
- ✅ SDK e2e 20/20 (`tests/e2e/test_openai_sdk_compat.py` + `tests/e2e/test_v1_responses_openai_sdk.py`)
- ✅ `scripts/verify_v1_responses_openai_sdk.py` 5/5 against live container with real account
- ✅ ruff check + ruff format --check + uv run ty check — All clean
- ✅ prod parity probe: #637 (unknown-key drop) + #658 (strict tool 400) both confirmed fixed vs prod 1.17.0

## Behavior change to note in announcements

- **#658**: `strict: true` function tool schemas missing `additionalProperties: false` (or other strict-mode requirements) now return a local **400 `invalid_function_parameters`** with `param=tools[N].function.parameters` (chat) / `tools[N].parameters` (responses). Pre-1.18.0 these were silently forwarded upstream; 1.18.0 enforces OpenAI's documented strict-mode contract locally before forwarding.

OpenSpec changes shipped in this release (to be archived after tag publish):
- deactivate-on-token-expired (#600)
- drop-unknown-message-fields (#637)
- mask-single-http-previous-response-miss (#516)
- normalize-v1-responses-openai-sdk-stream (#639)
- start-device-oauth-polling (#635)
- support-prolite-plan-capacity (#558)
- treat-upstream-overloaded-as-retryable (#601)
- validate-password-length (#598)
- validate-strict-function-tool-schema (#658)
Soju06 added a commit that referenced this pull request May 17, 2026
Archive changes shipped in v1.18.0 and v1.18.1 per AGENTS.md
"release 이후 archive" workflow. All 12 changes verified shipped in
production at 10.0.0.113 (image 1.18.1, sha256:0431a474432d).

1.18.0 (9 changes):
- deactivate-on-token-expired (#600)
- drop-unknown-message-fields (#637)
- mask-single-http-previous-response-miss (#516)
- normalize-v1-responses-openai-sdk-stream (#639)
- start-device-oauth-polling (#635) [delta header fixed: MODIFIED -> ADDED]
- support-prolite-plan-capacity (#558) [delta header fixed: MODIFIED -> ADDED]
- treat-upstream-overloaded-as-retryable (#601)
- validate-password-length (#598)
- validate-strict-function-tool-schema (#658)

1.18.1 (3 changes):
- harden-long-codex-websocket-turns (#586)
- suppress-duplicate-side-effect-tool-calls (#586)
- hotfix-db-pool-pre-ping-and-firewall-cache-ttl (#679)

Specs updated: 9 capabilities (admin-auth, api-firewall, chat-completions-compat,
chat-completions-extensions, database-backends, frontend-architecture,
responses-api-compat, usage-refresh-policy).

openspec validate --specs: 19/19 passed.

Co-authored-by: Hermes Agent <hermes-agent@nous.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Access token expiration (token_expired) does not deactivate account, so expired accounts remain routable

1 participant