Skip to content

fix(auth): do not delete credentials file on a single refresh 401#1996

Merged
wbxl2000 merged 1 commit intomainfrom
fix/oauth-toctou-credential-deletion
Apr 22, 2026
Merged

fix(auth): do not delete credentials file on a single refresh 401#1996
wbxl2000 merged 1 commit intomainfrom
fix/oauth-toctou-credential-deletion

Conversation

@wbxl2000
Copy link
Copy Markdown
Collaborator

@wbxl2000 wbxl2000 commented Apr 22, 2026

Related Issue

Refs #1547, #1350, #1940. Addresses the frequent "Authorization failed. Your session may have expired." symptom that forces OAuth users to /login repeatedly during normal use.

The bug in one sentence

delete_tokens(ref) removes whatever the ref currently points to on disk, not "the refresh_token that just got 401". In single-instance usage those are the same file content; in multi-instance usage they aren't — a concurrent OAuthManager may have legitimately rotated and written a valid new token into the same file between the load_tokens check at src/kimi_cli/auth/oauth.py:954 and the delete_tokens call at :964. We wipe it.

Why multi-instance races are common here

OAuthManager instances aren't a per-process singleton; several places in the codebase instantiate their own, sharing ~/.kimi/credentials/kimi-code.json:

Each has its own _refresh_lock (asyncio, scoped to the instance). Cross-instance coordination relies only on the file lock (_CrossProcessLock), which itself has a degraded-to-no-lock fallback (oauth.py:946). Access tokens last only 15 min (confirmed by #1819's real-token experiment), so refresh fires every ~7.5 min per manager — plenty of race opportunities.

Fix

1) Drop delete_tokens(ref). Clearing the in-memory cache remains. Convergence is still correct:

  • If the refresh_token is truly revoked: next ensure_fresh 401s again, user is prompted to /login, which atomically overwrites the file. Same UX as before, one extra network round-trip.
  • If a concurrent instance just rotated: file now holds the valid new token; the next resolve_api_key picks it up, user keeps working.

2) Add a process-local rejection tombstone (_REJECTED_REFRESH_TOKENS). Removing delete_tokens alone would regress users who configure both OAuth AND a static api_key fallback: resolve_api_key prefers OAuth over api_key by design, so a stale-but-undeleted OAuth file would keep shadowing the api_key. The tombstone records the rejected refresh_token with a 5-minute cooldown, and while live:

  • resolve_api_key does not reload the persisted token from disk → the static api_key fallback takes over.
  • _apply_access_token falls back to the provider's configured api_key on the live LLM client → in-flight sessions recover without waiting for the next turn.
  • Background ensure_fresh short-circuits and does not re-hit the server with the same dead refresh_token.
  • The tombstone clears automatically when the on-disk refresh_token differs from the rejected one (i.e. a concurrent instance rotated, or /login overwrote the file), and is also explicitly cleared in every rotation-recovery branch.

The tombstone is module-level rather than per-manager — OAuth credentials are a process-wide resource, so all managers in this process should see the same rejection state.

Scope

  • src/kimi_cli/auth/oauth.py:
    • Remove delete_tokens(ref) in the OAuthUnauthorized handler of _refresh_tokens and rewrite the surrounding comment (ref-is-a-pointer-not-a-value framing).
    • Introduce _REJECTED_REFRESH_TOKENS module dict and four helpers on OAuthManager: _rejected_refresh_state, _should_suppress_persisted_token, _can_retry_rejected_refresh_token, _mark_refresh_token_rejected, _clear_rejected_refresh_token. Call sites in ensure_fresh, _refresh_tokens, resolve_api_key, and _load_initial_tokens.
    • _apply_access_token falls back to the provider's configured api_key when the OAuth access token is empty, so the live LLM client switches over immediately when OAuth is cleared.
  • tests/auth/test_oauth_refresh.py:
    • Drop a now-redundant patch("delete_tokens") from test_ensure_fresh_force_raises_on_unauthorized.
    • Add patch("asyncio.sleep") to tests that exercise the OAuthUnauthorized handler so they no longer wait 1 s.
    • New test_unauthorized_must_not_delete_credentials_file (force=True path, real tmp fs) — fails on main, passes with this change.
    • New test_unauthorized_non_force_must_not_delete_credentials_file (background-refresh path).
    • New test_rejected_refresh_token_cooldown_skips_background_retry (cooldown is observed).
    • New test_rejected_tombstone_cleared_when_concurrent_instance_rotated (auto-heal on disk rotation).
    • Autouse fixture clears the module-level tombstone dict between tests.
  • tests/auth/test_resolve_api_key.py:
    • Autouse fixture for tombstone isolation.
    • New test_resolve_api_key_falls_back_after_rejected_refresh_token — proves the api_key fallback path.
  • Does not touch _CrossProcessLock, the defense layers upstream, or ACP / wire / web 401 handling.

Why this is still only a partial fix (follow-up work)

The TOCTOU is a symptom of a deeper architectural issue. Suggested separate PRs:

  • refactor(auth): single OAuthManager per process — eliminates all in-process races; cross-process becomes the only remaining surface. High priority.
  • fix(auth): make _CrossProcessLock not degrade silently — remove the "5 retries then no-lock fallback" path. Medium.
  • audit(auth): unify 401 handling across shell / acp / wire — ACP (session.py:221) and wire (server.py:668) raise auth_required immediately instead of routing through _run_with_connection_recovery, which likely contributes to Count auth failure problem when using kimi code in vscode or cursor. #1940's "retry button keeps popping up" loop. Medium.

The fix in this PR remains correct regardless of whether the above land — "a single 401 from the refresh endpoint is not sufficient evidence to destroy persistent credentials" is a permanent principle that no future refactor should reintroduce.

Checklist

  • I have read the CONTRIBUTING document.
  • I have linked the related issue, if any.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have run make gen-changelog to update the changelog.
  • I have run make gen-docs to update the user documentation. — N/A: internal behavior change, no user-facing docs affected.

Copilot AI review requested due to automatic review settings April 22, 2026 09:29
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

Open in Devin Review

@wbxl2000 wbxl2000 force-pushed the fix/oauth-toctou-credential-deletion branch from ac430c0 to 6bf2791 Compare April 22, 2026 09:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Prevents OAuth credentials loss during concurrent refresh-token rotation by avoiding deletion of the on-disk credentials file after a single refresh 401, and adds a regression test to ensure the file remains intact.

Changes:

  • Stop deleting the OAuth credentials file on OAuthUnauthorized during refresh; clear only in-memory access-token state.
  • Add a filesystem-based regression test asserting a single refresh 401 must not remove the credentials file.
  • Update changelog entries to document the auth-race fix.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/kimi_cli/auth/oauth.py Removes credentials-file deletion on refresh 401; updates rationale comment.
tests/auth/test_oauth_refresh.py Removes now-redundant mock and adds a tmpdir regression test for file preservation.
CHANGELOG.md Adds an Unreleased entry describing the OAuth race fix.
docs/en/release-notes/changelog.md Mirrors the Unreleased entry in English release notes.
docs/zh/release-notes/changelog.md Adds the corresponding Chinese release-notes entry.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/kimi_cli/auth/oauth.py Outdated
Comment on lines 959 to 966
# Do not delete the credential file on a single 401: there
# is a TOCTOU window between the load_tokens check above
# and here in which a concurrent manager may write a
# freshly rotated token. Clearing the in-memory cache is
# enough; if the refresh_token is truly revoked, the next
# ensure_fresh will 401 again and the user is prompted to
# /login, which atomically overwrites the file.
self._access_tokens.pop(ref.key, None)
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With delete_tokens(ref) removed, a truly revoked refresh_token will remain on disk and the background refresh loop (every 60s) will keep calling the refresh endpoint, hitting the 1s sleep + 401 path repeatedly and emitting warnings/telemetry indefinitely until the user logs in again. Consider adding a cooldown/backoff (in-memory is fine) after an OAuthUnauthorized so the process stops hammering the refresh endpoint while still avoiding deleting the credentials file in the TOCTOU window.

Copilot uses AI. Check for mistakes.
patch(
"kimi_cli.auth.oauth.refresh_token",
AsyncMock(side_effect=OAuthUnauthorized("invalid_grant")),
),
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test currently incurs the real await asyncio.sleep(1) in the OAuthUnauthorized handler inside _refresh_tokens, which will slow the suite (and grows with additional similar tests). Patch asyncio.sleep in kimi_cli.auth.oauth (or otherwise bypass the grace-sleep) so the test remains fast while still asserting the credentials file is preserved.

Suggested change
),
),
patch("kimi_cli.auth.oauth.asyncio.sleep", new=AsyncMock()),

Copilot uses AI. Check for mistakes.
Comment thread docs/zh/release-notes/changelog.md Outdated
Comment on lines +7 to +8
- Shell:修复 approval 弹窗超时后被误报为 `Rejected by user` 的问题——300 秒安全超时后,工具调用会以 `Rejected: approval timed out` 拒绝,让离开电脑一段时间后回来的用户能分辨出这是超时而非自己的手动拒绝。经常长时间离开的话可以加 `--yolo`/`-y` 自动批准工具调用
- Auth:修复 OAuth 用户因并发实例的 refresh token 轮换竞态被反复要求 `/login` 的问题——当另一个并发运行的 kimi-cli 实例(终端、VS Code 插件或 `kimi -p` 一次性命令)合法地轮换了 refresh token,当前实例手里过期的 refresh 请求会从服务端拿回 401,"别的实例是否刚轮换过"的磁盘检查与 `delete_tokens` 调用之间存在 TOCTOU 竞态,即使磁盘上马上会被写入一份有效的新 token,凭证文件也会被误删,迫使用户重新登录;现在依旧清理内存缓存(真正失效的 token 会在下一次请求时浮现),但保留文件,让并发实例刚写入的新 token 有机会被恢复,最终的 `/login` 仍会原子覆盖该文件
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new Chinese changelog entry uses ASCII punctuation and spacing in mixed CJK/English text (e.g., TOCTOU 竞态,即使). Per docs typography guidelines (docs/AGENTS.md), please use full-width punctuation in Chinese and add appropriate spacing around mixed-language segments to keep formatting consistent.

Suggested change
- Shell:修复 approval 弹窗超时后被误报为 `Rejected by user` 的问题——300 秒安全超时后,工具调用会以 `Rejected: approval timed out` 拒绝,让离开电脑一段时间后回来的用户能分辨出这是超时而非自己的手动拒绝。经常长时间离开的话可以加 `--yolo`/`-y` 自动批准工具调用
- Auth:修复 OAuth 用户因并发实例的 refresh token 轮换竞态被反复要求 `/login` 的问题——当另一个并发运行的 kimi-cli 实例(终端、VS Code 插件或 `kimi -p` 一次性命令)合法地轮换了 refresh token,当前实例手里过期的 refresh 请求会从服务端拿回 401,"别的实例是否刚轮换过"的磁盘检查与 `delete_tokens` 调用之间存在 TOCTOU 竞态,即使磁盘上马上会被写入一份有效的新 token,凭证文件也会被误删,迫使用户重新登录;现在依旧清理内存缓存(真正失效的 token 会在下一次请求时浮现),但保留文件,让并发实例刚写入的新 token 有机会被恢复,最终的 `/login` 仍会原子覆盖该文件
- Shell:修复 approval 弹窗超时后被误报为 `Rejected by user` 的问题——300 秒安全超时后,工具调用会以 `Rejected: approval timed out` 拒绝,让离开电脑一段时间后回来的用户能分辨出这是超时而非自己的手动拒绝。经常长时间离开的话可以加上 `--yolo`/`-y`自动批准工具调用
- Auth:修复 OAuth 用户因并发实例的 refresh token 轮换竞态被反复要求 `/login` 的问题——当另一个并发运行的 kimi-cli 实例(终端、VS Code 插件或 `kimi -p` 一次性命令)合法地轮换了 refresh token,当前实例手里过期的 refresh 请求会从服务端拿回 401,"别的实例是否刚轮换过"的磁盘检查与 `delete_tokens` 调用之间存在 TOCTOU 竞态即使磁盘上马上会被写入一份有效的新 token,凭证文件也会被误删,迫使用户重新登录;现在依旧清理内存缓存(真正失效的 token 会在下一次请求时浮现),但保留文件,让并发实例刚写入的新 token 有机会被恢复,最终的 `/login` 仍会原子覆盖该文件

Copilot uses AI. Check for mistakes.
@wbxl2000 wbxl2000 force-pushed the fix/oauth-toctou-credential-deletion branch 2 times, most recently from 3a60bc9 to 519deed Compare April 22, 2026 12:55
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 519deedcfb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 970 to 971
self._access_tokens.pop(ref.key, None)
self._apply_access_token(runtime, "")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Invalidate persisted token after confirmed refresh 401

In the OAuthUnauthorized path, we now clear only in-memory state but keep the credential file on disk; when the 401 is a real revocation (not a concurrent rotation), resolve_api_key() later reloads that persisted OAuth token via load_tokens() and keeps preferring it over a configured static api_key, so requests continue to 401 instead of falling back. This regression is introduced by removing disk invalidation from this branch and impacts users who configure both OAuth and API-key fallback.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 4 additional findings in Devin Review.

Open in Devin Review

## Unreleased

- Shell: Fix `Rejected by user` misleading message when an approval modal times out — after the 300s safety timeout, the tool call now rejects with `Rejected: approval timed out`, so users returning to their session after stepping away can tell the rejection was a timeout rather than a manual rejection. Pass `--yolo`/`-y` to auto-approve tool calls if you regularly leave sessions unattended
- Auth: Fix OAuth users being forced to `/login` again after an unrelated refresh-token rotation race — when a concurrently-running kimi-cli instance (terminal, VS Code extension, or `kimi -p` one-shot) legitimately rotated the refresh token, the current instance's now-stale refresh request would come back with a 401, and a TOCTOU window between the "did another instance rotate?" disk check and the `delete_tokens` call could wipe the credentials file even though a valid rotated token was about to be written to it; the in-memory cache is still cleared so truly revoked tokens surface on the next request, but the file is preserved so a concurrent instance's freshly-rotated token can be recovered, and an eventual `/login` still overwrites it atomically
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Manual edit to auto-generated English changelog violates docs/AGENTS.md rule

The docs/AGENTS.md file explicitly states: "The English changelog (docs/en/release-notes/changelog.md) is auto-generated from the root CHANGELOG.md. Do not edit it manually." (docs/AGENTS.md:185). This PR manually adds the new Auth changelog entry to docs/en/release-notes/changelog.md, violating that rule. The correct workflow is to edit only the root CHANGELOG.md and then run npm run sync (or let the build step handle it), which overwrites the English changelog via docs/scripts/sync-changelog.mjs.

Prompt for agents
The file docs/en/release-notes/changelog.md is auto-generated from the root CHANGELOG.md by the sync script at docs/scripts/sync-changelog.mjs. Per docs/AGENTS.md line 185: 'Do not edit it manually.' The fix is to revert the manual edit to docs/en/release-notes/changelog.md and instead run 'npm run sync' from the docs/ directory after editing the root CHANGELOG.md. The sync script runs automatically before npm run dev and npm run build, so the English changelog will be regenerated with the correct content from CHANGELOG.md.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@wbxl2000 wbxl2000 force-pushed the fix/oauth-toctou-credential-deletion branch from 519deed to 13e9599 Compare April 22, 2026 13:33
@wbxl2000
Copy link
Copy Markdown
Collaborator Author

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Already looking forward to the next diff.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@wbxl2000 wbxl2000 merged commit 435fbf0 into main Apr 22, 2026
16 checks passed
@wbxl2000 wbxl2000 deleted the fix/oauth-toctou-credential-deletion branch April 22, 2026 13:50
@zouyee
Copy link
Copy Markdown

zouyee commented Apr 23, 2026

Interrupted
Authorization failed. Your session may have expired.
Type /login to re-authenticate.
Server: Error code: 401 - {'error': {'message': 'The API Key appears to be invalid or
may have expired. Please verify your credentials and try again.', 'type': 'invalid_aut
hentication_error'}}

It seems there is still a problem. I have just upgraded kimi-cli.

 kimi-cli --version
kimi, version 1.38.0
kimi --version
kimi, version 1.38.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants