fix(auth): do not delete credentials file on a single refresh 401#1996
fix(auth): do not delete credentials file on a single refresh 401#1996
Conversation
ac430c0 to
6bf2791
Compare
There was a problem hiding this comment.
Pull request overview
Prevents OAuth credentials loss during concurrent refresh-token rotation by avoiding deletion of the on-disk credentials file after a single refresh 401, and adds a regression test to ensure the file remains intact.
Changes:
- Stop deleting the OAuth credentials file on
OAuthUnauthorizedduring refresh; clear only in-memory access-token state. - Add a filesystem-based regression test asserting a single refresh 401 must not remove the credentials file.
- Update changelog entries to document the auth-race fix.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/kimi_cli/auth/oauth.py |
Removes credentials-file deletion on refresh 401; updates rationale comment. |
tests/auth/test_oauth_refresh.py |
Removes now-redundant mock and adds a tmpdir regression test for file preservation. |
CHANGELOG.md |
Adds an Unreleased entry describing the OAuth race fix. |
docs/en/release-notes/changelog.md |
Mirrors the Unreleased entry in English release notes. |
docs/zh/release-notes/changelog.md |
Adds the corresponding Chinese release-notes entry. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Do not delete the credential file on a single 401: there | ||
| # is a TOCTOU window between the load_tokens check above | ||
| # and here in which a concurrent manager may write a | ||
| # freshly rotated token. Clearing the in-memory cache is | ||
| # enough; if the refresh_token is truly revoked, the next | ||
| # ensure_fresh will 401 again and the user is prompted to | ||
| # /login, which atomically overwrites the file. | ||
| self._access_tokens.pop(ref.key, None) |
There was a problem hiding this comment.
With delete_tokens(ref) removed, a truly revoked refresh_token will remain on disk and the background refresh loop (every 60s) will keep calling the refresh endpoint, hitting the 1s sleep + 401 path repeatedly and emitting warnings/telemetry indefinitely until the user logs in again. Consider adding a cooldown/backoff (in-memory is fine) after an OAuthUnauthorized so the process stops hammering the refresh endpoint while still avoiding deleting the credentials file in the TOCTOU window.
| patch( | ||
| "kimi_cli.auth.oauth.refresh_token", | ||
| AsyncMock(side_effect=OAuthUnauthorized("invalid_grant")), | ||
| ), |
There was a problem hiding this comment.
This test currently incurs the real await asyncio.sleep(1) in the OAuthUnauthorized handler inside _refresh_tokens, which will slow the suite (and grows with additional similar tests). Patch asyncio.sleep in kimi_cli.auth.oauth (or otherwise bypass the grace-sleep) so the test remains fast while still asserting the credentials file is preserved.
| ), | |
| ), | |
| patch("kimi_cli.auth.oauth.asyncio.sleep", new=AsyncMock()), |
| - Shell:修复 approval 弹窗超时后被误报为 `Rejected by user` 的问题——300 秒安全超时后,工具调用会以 `Rejected: approval timed out` 拒绝,让离开电脑一段时间后回来的用户能分辨出这是超时而非自己的手动拒绝。经常长时间离开的话可以加 `--yolo`/`-y` 自动批准工具调用 | ||
| - Auth:修复 OAuth 用户因并发实例的 refresh token 轮换竞态被反复要求 `/login` 的问题——当另一个并发运行的 kimi-cli 实例(终端、VS Code 插件或 `kimi -p` 一次性命令)合法地轮换了 refresh token,当前实例手里过期的 refresh 请求会从服务端拿回 401,"别的实例是否刚轮换过"的磁盘检查与 `delete_tokens` 调用之间存在 TOCTOU 竞态,即使磁盘上马上会被写入一份有效的新 token,凭证文件也会被误删,迫使用户重新登录;现在依旧清理内存缓存(真正失效的 token 会在下一次请求时浮现),但保留文件,让并发实例刚写入的新 token 有机会被恢复,最终的 `/login` 仍会原子覆盖该文件 |
There was a problem hiding this comment.
The new Chinese changelog entry uses ASCII punctuation and spacing in mixed CJK/English text (e.g., TOCTOU 竞态,即使). Per docs typography guidelines (docs/AGENTS.md), please use full-width punctuation in Chinese and add appropriate spacing around mixed-language segments to keep formatting consistent.
| - Shell:修复 approval 弹窗超时后被误报为 `Rejected by user` 的问题——300 秒安全超时后,工具调用会以 `Rejected: approval timed out` 拒绝,让离开电脑一段时间后回来的用户能分辨出这是超时而非自己的手动拒绝。经常长时间离开的话可以加 `--yolo`/`-y` 自动批准工具调用 | |
| - Auth:修复 OAuth 用户因并发实例的 refresh token 轮换竞态被反复要求 `/login` 的问题——当另一个并发运行的 kimi-cli 实例(终端、VS Code 插件或 `kimi -p` 一次性命令)合法地轮换了 refresh token,当前实例手里过期的 refresh 请求会从服务端拿回 401,"别的实例是否刚轮换过"的磁盘检查与 `delete_tokens` 调用之间存在 TOCTOU 竞态,即使磁盘上马上会被写入一份有效的新 token,凭证文件也会被误删,迫使用户重新登录;现在依旧清理内存缓存(真正失效的 token 会在下一次请求时浮现),但保留文件,让并发实例刚写入的新 token 有机会被恢复,最终的 `/login` 仍会原子覆盖该文件 | |
| - Shell:修复 approval 弹窗超时后被误报为 `Rejected by user` 的问题——300 秒安全超时后,工具调用会以 `Rejected: approval timed out` 拒绝,让离开电脑一段时间后回来的用户能分辨出这是超时而非自己的手动拒绝。经常长时间离开的话可以加上 `--yolo`/`-y`,自动批准工具调用 | |
| - Auth:修复 OAuth 用户因并发实例的 refresh token 轮换竞态被反复要求 `/login` 的问题——当另一个并发运行的 kimi-cli 实例(终端、VS Code 插件或 `kimi -p` 一次性命令)合法地轮换了 refresh token,当前实例手里过期的 refresh 请求会从服务端拿回 401,"别的实例是否刚轮换过"的磁盘检查与 `delete_tokens` 调用之间存在 TOCTOU 竞态,即使磁盘上马上会被写入一份有效的新 token,凭证文件也会被误删,迫使用户重新登录;现在依旧清理内存缓存(真正失效的 token 会在下一次请求时浮现),但保留文件,让并发实例刚写入的新 token 有机会被恢复,最终的 `/login` 仍会原子覆盖该文件 |
3a60bc9 to
519deed
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 519deedcfb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| self._access_tokens.pop(ref.key, None) | ||
| self._apply_access_token(runtime, "") |
There was a problem hiding this comment.
Invalidate persisted token after confirmed refresh 401
In the OAuthUnauthorized path, we now clear only in-memory state but keep the credential file on disk; when the 401 is a real revocation (not a concurrent rotation), resolve_api_key() later reloads that persisted OAuth token via load_tokens() and keeps preferring it over a configured static api_key, so requests continue to 401 instead of falling back. This regression is introduced by removing disk invalidation from this branch and impacts users who configure both OAuth and API-key fallback.
Useful? React with 👍 / 👎.
| ## Unreleased | ||
|
|
||
| - Shell: Fix `Rejected by user` misleading message when an approval modal times out — after the 300s safety timeout, the tool call now rejects with `Rejected: approval timed out`, so users returning to their session after stepping away can tell the rejection was a timeout rather than a manual rejection. Pass `--yolo`/`-y` to auto-approve tool calls if you regularly leave sessions unattended | ||
| - Auth: Fix OAuth users being forced to `/login` again after an unrelated refresh-token rotation race — when a concurrently-running kimi-cli instance (terminal, VS Code extension, or `kimi -p` one-shot) legitimately rotated the refresh token, the current instance's now-stale refresh request would come back with a 401, and a TOCTOU window between the "did another instance rotate?" disk check and the `delete_tokens` call could wipe the credentials file even though a valid rotated token was about to be written to it; the in-memory cache is still cleared so truly revoked tokens surface on the next request, but the file is preserved so a concurrent instance's freshly-rotated token can be recovered, and an eventual `/login` still overwrites it atomically |
There was a problem hiding this comment.
🟡 Manual edit to auto-generated English changelog violates docs/AGENTS.md rule
The docs/AGENTS.md file explicitly states: "The English changelog (docs/en/release-notes/changelog.md) is auto-generated from the root CHANGELOG.md. Do not edit it manually." (docs/AGENTS.md:185). This PR manually adds the new Auth changelog entry to docs/en/release-notes/changelog.md, violating that rule. The correct workflow is to edit only the root CHANGELOG.md and then run npm run sync (or let the build step handle it), which overwrites the English changelog via docs/scripts/sync-changelog.mjs.
Prompt for agents
The file docs/en/release-notes/changelog.md is auto-generated from the root CHANGELOG.md by the sync script at docs/scripts/sync-changelog.mjs. Per docs/AGENTS.md line 185: 'Do not edit it manually.' The fix is to revert the manual edit to docs/en/release-notes/changelog.md and instead run 'npm run sync' from the docs/ directory after editing the root CHANGELOG.md. The sync script runs automatically before npm run dev and npm run build, so the English changelog will be regenerated with the correct content from CHANGELOG.md.
Was this helpful? React with 👍 or 👎 to provide feedback.
519deed to
13e9599
Compare
|
@codex review |
|
Codex Review: Didn't find any major issues. Already looking forward to the next diff. ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
It seems there is still a problem. I have just upgraded kimi-cli. |
Related Issue
Refs #1547, #1350, #1940. Addresses the frequent "Authorization failed. Your session may have expired." symptom that forces OAuth users to
/loginrepeatedly during normal use.The bug in one sentence
delete_tokens(ref)removes whatever the ref currently points to on disk, not "the refresh_token that just got 401". In single-instance usage those are the same file content; in multi-instance usage they aren't — a concurrentOAuthManagermay have legitimately rotated and written a valid new token into the same file between theload_tokenscheck atsrc/kimi_cli/auth/oauth.py:954and thedelete_tokenscall at:964. We wipe it.Why multi-instance races are common here
OAuthManagerinstances aren't a per-process singleton; several places in the codebase instantiate their own, sharing~/.kimi/credentials/kimi-code.json:src/kimi_cli/app.py:199— main entrysrc/kimi_cli/web/api/sessions.py:817— per session title-generation (added in fix(auth): improve OAuth error handling for skill execution and title generation (#1635) #1658)src/kimi_cli/cli/plugin.py:212,src/kimi_cli/plugin/tool.py:33— plugin pathsEach has its own
_refresh_lock(asyncio, scoped to the instance). Cross-instance coordination relies only on the file lock (_CrossProcessLock), which itself has a degraded-to-no-lock fallback (oauth.py:946). Access tokens last only 15 min (confirmed by #1819's real-token experiment), so refresh fires every ~7.5 min per manager — plenty of race opportunities.Fix
1) Drop
delete_tokens(ref). Clearing the in-memory cache remains. Convergence is still correct:ensure_fresh401s again, user is prompted to/login, which atomically overwrites the file. Same UX as before, one extra network round-trip.resolve_api_keypicks it up, user keeps working.2) Add a process-local rejection tombstone (
_REJECTED_REFRESH_TOKENS). Removingdelete_tokensalone would regress users who configure both OAuth AND a staticapi_keyfallback:resolve_api_keyprefers OAuth over api_key by design, so a stale-but-undeleted OAuth file would keep shadowing the api_key. The tombstone records the rejected refresh_token with a 5-minute cooldown, and while live:resolve_api_keydoes not reload the persisted token from disk → the static api_key fallback takes over._apply_access_tokenfalls back to the provider's configured api_key on the live LLM client → in-flight sessions recover without waiting for the next turn.ensure_freshshort-circuits and does not re-hit the server with the same dead refresh_token./loginoverwrote the file), and is also explicitly cleared in every rotation-recovery branch.The tombstone is module-level rather than per-manager — OAuth credentials are a process-wide resource, so all managers in this process should see the same rejection state.
Scope
src/kimi_cli/auth/oauth.py:delete_tokens(ref)in theOAuthUnauthorizedhandler of_refresh_tokensand rewrite the surrounding comment (ref-is-a-pointer-not-a-value framing)._REJECTED_REFRESH_TOKENSmodule dict and four helpers onOAuthManager:_rejected_refresh_state,_should_suppress_persisted_token,_can_retry_rejected_refresh_token,_mark_refresh_token_rejected,_clear_rejected_refresh_token. Call sites inensure_fresh,_refresh_tokens,resolve_api_key, and_load_initial_tokens._apply_access_tokenfalls back to the provider's configuredapi_keywhen the OAuth access token is empty, so the live LLM client switches over immediately when OAuth is cleared.tests/auth/test_oauth_refresh.py:patch("delete_tokens")fromtest_ensure_fresh_force_raises_on_unauthorized.patch("asyncio.sleep")to tests that exercise theOAuthUnauthorizedhandler so they no longer wait 1 s.test_unauthorized_must_not_delete_credentials_file(force=True path, real tmp fs) — fails on main, passes with this change.test_unauthorized_non_force_must_not_delete_credentials_file(background-refresh path).test_rejected_refresh_token_cooldown_skips_background_retry(cooldown is observed).test_rejected_tombstone_cleared_when_concurrent_instance_rotated(auto-heal on disk rotation).tests/auth/test_resolve_api_key.py:test_resolve_api_key_falls_back_after_rejected_refresh_token— proves the api_key fallback path._CrossProcessLock, the defense layers upstream, or ACP / wire / web 401 handling.Why this is still only a partial fix (follow-up work)
The TOCTOU is a symptom of a deeper architectural issue. Suggested separate PRs:
refactor(auth): single OAuthManager per process — eliminates all in-process races; cross-process becomes the only remaining surface. High priority.fix(auth): make_CrossProcessLocknot degrade silently — remove the "5 retries then no-lock fallback" path. Medium.audit(auth): unify 401 handling across shell / acp / wire — ACP (session.py:221) and wire (server.py:668) raiseauth_requiredimmediately instead of routing through_run_with_connection_recovery, which likely contributes to Count auth failure problem when using kimi code in vscode or cursor. #1940's "retry button keeps popping up" loop. Medium.The fix in this PR remains correct regardless of whether the above land — "a single 401 from the refresh endpoint is not sufficient evidence to destroy persistent credentials" is a permanent principle that no future refactor should reintroduce.
Checklist
make gen-changelogto update the changelog.make gen-docsto update the user documentation. — N/A: internal behavior change, no user-facing docs affected.