fix(orphan): kill stale claude --resume on archive + warn on orphan windows#82
Merged
Merged
Conversation
…indows Reproduced: two ``claude --resume <id>`` processes on one session_id were writing to the same JSONL concurrently. Interleaved entries broke tool_use ↔ tool_result pairing, surfacing as ghost activity and large lag in the live card (the screenshot view was fine because it reads the tmux pane directly, bypassing JSONL). Root cause was a tmux window that survived ``mark_session_archived`` — its claude child outlived the SIGHUP either because of a race with a bot restart or claude trapping the signal — and then a later ``/new`` with resume on the same session id spawned a second writer. Two additive guards, no existing-flow change on the happy path: 1. ``tmux_manager.kill_orphan_claude_processes(claude_session_id)`` — after every archive's ``kill_window``, ``pgrep`` for any ``claude --resume <id>`` and SIGTERM what's left. UUID-validated before reaching pgrep; self/parent PID guarded so the bot can't accidentally kill its own ancestor process. Returns count for tests. Wired into ``lifecycle.archive_session`` (/kill, /done) and ``archive.idle_archive_sweep`` (auto idle archive). 2. ``session_recovery.detect_orphan_windows`` — at startup, list tmux windows; any window not bound to a Session record (and not on the reserved utility list ``__main__`` / ``ccbot-usage``) logs a single WARNING with the window id, name, and claude session id (when known). Never auto-kills — surfaces the failure mode so it can be investigated and cleared via /list or manual tmux kill. Live-tested on restart: cleaned up the leftover ``@94`` window_state and flagged a remaining orphan tmux window (``ccbot @21``) without disturbing any active session. +10 unit tests cover pgrep happy path, UUID validation, self/parent guard, ProcessLookupError swallowing, pgrep timeout fallback, reserved- window exclusion, lazy ``list_windows`` invocation. 462/462 suite green. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2 tasks
Time4Mind
added a commit
that referenced
this pull request
May 16, 2026
…87) Recent PRs (#82–#86) changed bot behaviour in user-facing and operator-facing ways that the docs hadn't caught up with yet: - CLAUDE.md * Core Design Constraints: add the SessionStart + UserPromptSubmit hook story (self-heal) and the singleton flock. * Configuration: add ``ccbot.lock`` to the state-files list. * Hook Configuration: full block with both events + the safety contract (zero stdout, always exit 0, fast-path skip). - .claude/rules/architecture.md * State-files diagram: ``session_map.json`` now lists both hook events; new ``ccbot.lock`` entry. * Key Design Decisions: hook self-heal, singleton lock, archive- time orphan-claude kill, startup orphan-window detection. - .claude/rules/dm-architecture.md * window_id → claude session_id: both hook events update the map. * "What is unchanged": session_map.json description matches. - .claude/rules/secrets.md * Add ``ccbot.lock`` to the where-things-are table. - doc/dm-multisession-spec.md * State persistence: list ``ccbot.lock``. * Section 9 Recovery: flock acquire step, orphan-window scan, and the archive cleanup path that SIGTERMs orphan claude processes. - src/ccbot/i18n.py * /help → Tips body (en / ru / zh): two new bullets — single- instance lock, hook self-heal — so operators have visibility into the new guarantees without digging through code. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
claude --resume <id>processes on one session id ⇒ interleaved JSONL writes ⇒ broken tool_use ↔ tool_result pairing ⇒ ghost activity / lag in the live card (the /screenshot view was fine because it reads the tmux pane directly, not JSONL). Root cause: a tmux window that survivedmark_session_archivedand a later/newwith resume on the same id spawned a second writer.tmux_manager.kill_orphan_claude_processes(claude_session_id)— after every archive'skill_window,pgrepfor any leftoverclaude --resume <id>and SIGTERM what's left. UUID-validated before pgrep; self/parent PID guarded. Wired intolifecycle.archive_session(/kill, /done) andarchive.idle_archive_sweep.session_recovery.detect_orphan_windows— at startup, log a single WARNING per tmux window not bound to any Session record (excluding the reserved utility windows__main__/ccbot-usage). Never auto-kills; surfaces the failure mode.No existing-flow change on the happy path —
kill_orphan_claude_processesis a no-op whenkill_windowworked.Test plan
tests/ccbot/test_orphan_cleanup.pycovering pgrep happy path, UUID validation, self/parent PID guard, ProcessLookupError swallowing, pgrep timeout fallback, reserved-window exclusion, lazylist_windowsinvocation.Dropping stale window_state: @94+orphan_window window_id=@21 name=ccbot session_id=ee8ed9af-... — no Session record references itwithout disturbing any active session.🤖 Generated with Claude Code