Skip to content

fix(login): hold Child handle and reap on cancel to close PID-reuse race#29

Merged
Cmochance merged 3 commits into
mainfrom
fix/login-cancel-pid-race
May 9, 2026
Merged

fix(login): hold Child handle and reap on cancel to close PID-reuse race#29
Cmochance merged 3 commits into
mainfrom
fix/login-cancel-pid-race

Conversation

@Cmochance
Copy link
Copy Markdown
Owner

Summary

Closes the PID-reuse race carried over from PR #27's first cut at cancel.

  • Replace Mutex<Option<u32>> (PID slot) with Mutex<Option<Child>> so cancel calls Child::kill() on the actual handle. Both natural-exit and cancel paths atomically take() the same slot under one lock — no more recycled-PID hazard.
  • New shared helper wait_for_login_or_cancel(child) -> AppResult<Output> drives a Child::try_wait poll loop and is reused by both mac/runtime/process.rs and win/runtime/process.rs run_codex_login. Removes the duplicated register/wait/clear dance.
  • New drop_killed_child(child) helper does kill() + wait() together. Closes a zombie-leak hole flagged by the local 3-agent review: std::process::Child has no reaping Drop, so dropping a killed child on Unix leaves a zombie until app exit — and could let a stuck codex login keep running long enough to write auth.json. Funneled through cancel, the try_wait Err cleanup branch, and register_login_child's replace-and-reap path.
  • New typed error code LOGIN_CANCELLED plumbed through to the front-end. Both cancelledLoginProfile === profile (optimistic flag) and error_code === "LOGIN_CANCELLED" (authoritative) now route to the same "已取消登录" toast.

Test plan

  • cargo test --manifest-path src-tauri/Cargo.toml --lib — 66 passed (4 new tests on login_cancel: e2e happy-path, e2e cancel-from-thread, register-replaces-and-reaps, plus the existing peek/cancel unit tests, all gated #[cfg(all(test, unix))]).
  • Hand-test on macOS: click Login → close OAuth tab without finishing → click the spinning Login button → expect "已取消登录" toast within 200 ms.
  • Hand-test: open Settings → Codex CLI 路径 row → confirm dialog still opens; verify saving a path still works (regression check on the path UI we shipped in 1.5.6).
  • Click Login twice in quick succession → expect the second click to be no-op (login serializes through acquire_login_lock).

Known follow-ups (not blocking)

  • Windows-specific: Child::wait() after a failed TerminateProcess (e.g. elevated child) blocks until the child exits naturally. In practice we always spawn codex login as a non-elevated child of the app, so access-denied shouldn't fire — but if it ever does we'd want to bound the wait or move it onto a detached thread. Documented in drop_killed_child's rustdoc.
  • try_wait Err branch is genuinely hard to unit-test portably; left as an accepted gap with the cleanup logic exercised in code review only.

🤖 Generated with Claude Code

PR #27's first cut at cancel stored only the OS PID and used shell
kill / taskkill against it. Between wait_with_output returning and
clear_login_pid running, the OS could have recycled the PID — a late
cancel click would target whatever new owner held it. Worst case on
Windows, taskkill /F /T would nuke an unrelated process tree.

Replace the slot type with Mutex<Option<Child>> so cancel goes through
Child::kill() on the actual handle. Both spawn-site exit and cancel
atomically `take()` the same slot under one lock, so they can never
race on the same Child.

A new shared helper `wait_for_login_or_cancel(child) -> AppResult<Output>`
drives a poll loop on Child::try_wait and is called from both mac and
win run_codex_login (instead of each platform repeating the
register/wait/clear dance). Adds typed `LOGIN_CANCELLED` error code
and threads it through to the front-end.

Three additional fixes from the local 3-agent review:

- `cancel_login_in_progress` and `register_login_child`'s replace
  branch and `wait_for_login_or_cancel`'s try_wait Err arm all funnel
  through a new `drop_killed_child(child)` helper that does kill +
  wait. Without the wait, every cancel leaked a zombie on Unix
  (std::process::Child has no reaping Drop) and could let a stuck
  codex login keep running in the background long enough to write
  auth.json and confuse subsequent state.
- `register_login_child` now uses `replace` and reaps any prior
  occupant rather than silently dropping it. Defensive in case
  acquire_login_lock is ever bypassed.
- Front-end's catch handler now also recognises `LOGIN_CANCELLED` as
  a cancelled state, in addition to the optimistic
  `cancelledLoginProfile` flag.

Tests: added end-to-end coverage of `wait_for_login_or_cancel` (happy
path + cancel-from-sibling-thread). The cancel test polls the slot
rather than sleeping a fixed amount so it stays stable on slow CI.
Added `register_login_child_replaces_and_reaps_previous_occupant` to
lock in the new replace/reap contract. Test module gated `#[cfg(all
(test, unix))]` because it relies on /bin/sleep + /bin/sh.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Cmochance
Copy link
Copy Markdown
Owner Author

剩余在你侧

  1. 手测 cancel 路径 — open 起来的 .app(dist/codex_switch.app,含本 PR 的修复):
    • 任意账号点登录 → 浏览器开 OAuth → 不完成,关掉浏览器 tab
    • 点回到应用,按那个转圈+取消的登录按钮
    • 应在约 200ms 内看到"已取消登录" toast,按钮回到"登录"状态
    • 期望:终端 ps -ef | grep codex 看不到僵尸 codex login 进程
  2. 回归手测 1.5.6 已有功能 — codex CLI 路径对话框(设置 → Codex CLI 路径 → 更改)应仍正常;正常登录(不取消)应仍写入 auth.json。
  3. CI 全绿后 squash-merge — 改动主要在 shared/runtime/login_cancel.rs,CI 应 ~3-5 分钟跑完。
  4. 下一轮 release 可顺延 v1.5.7(如需):本 PR 是 PID race fix(critical 级 bug),值得作为单独 patch 发;如果想跟 PR feat(ui): paint card Base button red when a custom Base Url is set #26 合并发就先合 PR feat(ui): paint card Base button red when a custom Base Url is set #26

已知 follow-up(不阻塞)

  • Windows 上 Child::wait()TerminateProcess access-denied 时可能 hang 到 OAuth 超时(少见);rustdoc 已写明,未来可加超时或 detached thread。
  • try_wait Err 分支单元测试缺(要 kernel 级 fault injection),cleanup 逻辑由 code review 覆盖。

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 493c738196

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src-tauri/shared/runtime/login_cancel.rs
Cmochance and others added 2 commits May 10, 2026 02:58
…fill hang

P1 from chatgpt-codex-connector review on PR #29: the spawn-site poll
loop only called Child::try_wait without draining the piped stdout /
stderr. A verbose codex login (warnings, debug output) would fill the
~64 KB pipe buffer, block on write, never reach exit, and our loop
would see Running forever — login would appear stuck until manual
cancel.

wait_for_login_or_cancel now takes the stdio handles off the Child
*before* registering it in the slot, hands each pipe to a dedicated
drainer thread that read_to_end into a Vec<u8>, and joins them after
the child exits naturally. On cancel / try_wait Err the drainers
self-clean (the killed child's pipes close → read_to_end returns EOF
→ thread exits) and we discard their output.

Added regression test
wait_for_login_or_cancel_drains_oversized_pipe_output that pushes
256 KiB through stdout — well past the 64 KB buffer threshold — and
asserts the helper still returns Ok with the full payload. The test
uses absolute paths (`/usr/bin/yes`, `/usr/bin/head`) so it stays
stable when other parallel tests modify PATH via `std::env::set_var`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Follow-ups from the 3rd-round review on the pipe-drain fix:

- Add `wait_for_login_or_cancel_captures_stderr_on_nonzero_exit`
  regression test. Counterpart to the stdout test: pushes 256 KiB
  through stderr only, exits non-zero, and asserts the helper still
  captures the full payload — the `LOGIN_FAILED` toast feeds
  `output.stderr`, so this is the path that actually matters in
  production.

- Add `debug_assert!` to `register_login_child` enforcing that
  stdout/stderr are taken before parking the child. Documents the
  invariant `wait_for_login_or_cancel` already maintains and catches
  any future caller that forgets.

- Document why `read_to_end`'s Err and `JoinHandle::join`'s panic
  branch silently collapse to empty bytes. Both are rare exceptional
  conditions; partial stdio on success is informational, on failure
  still better than nothing. We lose visibility because the project
  has no structured logger; flagged as a future revisit if one lands.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Cmochance Cmochance merged commit 67a90b2 into main May 9, 2026
3 checks passed
@Cmochance Cmochance deleted the fix/login-cancel-pid-race branch May 9, 2026 19:05
Cmochance added a commit that referenced this pull request May 9, 2026
…#31)

Two small UX/perf fixes flagged by the local 3-agent reviews on PR
#27 and PR #29:

- Login button now sets `aria-label` reflecting its dual role: "Log
  into <profile>" when idle, "Cancel login for <profile>" while a
  login is in flight. The visible label was already updated (spinner
  + red "取消") and the `title` attribute carried the cancel hint, but
  many screen readers prefer aria-label and ignore title; this makes
  the dual semantics announced consistently across assistive tech.
  New i18n keys `profileLoginCancelAria` / `profileLoginReadyAria`
  (en + zh-CN).

- mac `suggested_codex_cli_paths` walked the entire `PATH` and
  stat'd `<entry>/codex` for every component, blocking the
  `get_codex_cli_status` Tauri command for seconds when PATH includes
  NFS / SMB / slow drives. Now bounded by a 500 ms soft deadline:
  fixed locations (Codex.app bundle, Homebrew, npm-global, etc) are
  checked first since those are guaranteed-fast local stats, then the
  bounded PATH walk fills in whatever remaining suggestions fit in
  the deadline. Worst case: dialog still opens promptly even on a
  pathologically slow PATH.

Both are pure refinements — no behavior change on the happy path.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@Cmochance Cmochance mentioned this pull request May 9, 2026
5 tasks
Cmochance added a commit that referenced this pull request May 9, 2026
Bumps package.json to 1.5.7 (and lets `version-sync.mjs` propagate
to Cargo.toml + Cargo.lock + package-lock.json) and adds the 1.5.7
CHANGELOG entry covering #26, #29, #30, #31:

- #29 critical: PID-reuse race + buffer-fill hang in login cancel
- #26 baseurl button red warning style
- #30 InstallState + codex CLI helpers refactor to shared
- #31 a11y aria-label + bounded mac PATH probe

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant