Skip to content

fix(orphan): kill stale claude --resume on archive + warn on orphan windows#82

Merged
Time4Mind merged 1 commit into
mainfrom
fix/orphan-claude-and-window-detection
May 16, 2026
Merged

fix(orphan): kill stale claude --resume on archive + warn on orphan windows#82
Time4Mind merged 1 commit into
mainfrom
fix/orphan-claude-and-window-detection

Conversation

@Time4Mind
Copy link
Copy Markdown
Owner

Summary

  • Two claude --resume <id> processes on one session id ⇒ interleaved JSONL writes ⇒ broken tool_use ↔ tool_result pairing ⇒ ghost activity / lag in the live card (the /screenshot view was fine because it reads the tmux pane directly, not JSONL). Root cause: a tmux window that survived mark_session_archived and a later /new with resume on the same id spawned a second writer.
  • Guard 1: tmux_manager.kill_orphan_claude_processes(claude_session_id) — after every archive's kill_window, pgrep for any leftover claude --resume <id> and SIGTERM what's left. UUID-validated before pgrep; self/parent PID guarded. Wired into lifecycle.archive_session (/kill, /done) and archive.idle_archive_sweep.
  • Guard 2: session_recovery.detect_orphan_windows — at startup, log a single WARNING per tmux window not bound to any Session record (excluding the reserved utility windows __main__ / ccbot-usage). Never auto-kills; surfaces the failure mode.

No existing-flow change on the happy path — kill_orphan_claude_processes is a no-op when kill_window worked.

Test plan

  • +10 unit tests in tests/ccbot/test_orphan_cleanup.py covering pgrep happy path, UUID validation, self/parent PID guard, ProcessLookupError swallowing, pgrep timeout fallback, reserved-window exclusion, lazy list_windows invocation.
  • Full suite 462/462 green.
  • ruff check / ruff format / pyright clean.
  • Live-tested on restart: Dropping stale window_state: @94 + orphan_window window_id=@21 name=ccbot session_id=ee8ed9af-... — no Session record references it without disturbing any active session.

🤖 Generated with Claude Code

…indows

Reproduced: two ``claude --resume <id>`` processes on one session_id were
writing to the same JSONL concurrently. Interleaved entries broke
tool_use ↔ tool_result pairing, surfacing as ghost activity and large
lag in the live card (the screenshot view was fine because it reads the
tmux pane directly, bypassing JSONL). Root cause was a tmux window that
survived ``mark_session_archived`` — its claude child outlived the SIGHUP
either because of a race with a bot restart or claude trapping the
signal — and then a later ``/new`` with resume on the same session id
spawned a second writer.

Two additive guards, no existing-flow change on the happy path:

1. ``tmux_manager.kill_orphan_claude_processes(claude_session_id)`` —
   after every archive's ``kill_window``, ``pgrep`` for any
   ``claude --resume <id>`` and SIGTERM what's left. UUID-validated
   before reaching pgrep; self/parent PID guarded so the bot can't
   accidentally kill its own ancestor process. Returns count for tests.
   Wired into ``lifecycle.archive_session`` (/kill, /done) and
   ``archive.idle_archive_sweep`` (auto idle archive).

2. ``session_recovery.detect_orphan_windows`` — at startup, list tmux
   windows; any window not bound to a Session record (and not on the
   reserved utility list ``__main__`` / ``ccbot-usage``) logs a single
   WARNING with the window id, name, and claude session id (when
   known). Never auto-kills — surfaces the failure mode so it can be
   investigated and cleared via /list or manual tmux kill.

Live-tested on restart: cleaned up the leftover ``@94`` window_state
and flagged a remaining orphan tmux window (``ccbot @21``) without
disturbing any active session.

+10 unit tests cover pgrep happy path, UUID validation, self/parent
guard, ProcessLookupError swallowing, pgrep timeout fallback, reserved-
window exclusion, lazy ``list_windows`` invocation. 462/462 suite green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Time4Mind Time4Mind merged commit edfd253 into main May 16, 2026
4 checks passed
@Time4Mind Time4Mind deleted the fix/orphan-claude-and-window-detection branch May 16, 2026 15:31
Time4Mind added a commit that referenced this pull request May 16, 2026
…87)

Recent PRs (#82#86) changed bot behaviour in user-facing and
operator-facing ways that the docs hadn't caught up with yet:

- CLAUDE.md
  * Core Design Constraints: add the SessionStart + UserPromptSubmit
    hook story (self-heal) and the singleton flock.
  * Configuration: add ``ccbot.lock`` to the state-files list.
  * Hook Configuration: full block with both events + the safety
    contract (zero stdout, always exit 0, fast-path skip).

- .claude/rules/architecture.md
  * State-files diagram: ``session_map.json`` now lists both hook
    events; new ``ccbot.lock`` entry.
  * Key Design Decisions: hook self-heal, singleton lock, archive-
    time orphan-claude kill, startup orphan-window detection.

- .claude/rules/dm-architecture.md
  * window_id → claude session_id: both hook events update the map.
  * "What is unchanged": session_map.json description matches.

- .claude/rules/secrets.md
  * Add ``ccbot.lock`` to the where-things-are table.

- doc/dm-multisession-spec.md
  * State persistence: list ``ccbot.lock``.
  * Section 9 Recovery: flock acquire step, orphan-window scan, and
    the archive cleanup path that SIGTERMs orphan claude processes.

- src/ccbot/i18n.py
  * /help → Tips body (en / ru / zh): two new bullets — single-
    instance lock, hook self-heal — so operators have visibility
    into the new guarantees without digging through code.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant