Skip to content

ClaudeTalker registry + session_alive + halt on leak (closes #473)#475

Merged
FidoCanCode merged 1 commit into
mainfrom
fix-session-owner-telemetry
Apr 14, 2026
Merged

ClaudeTalker registry + session_alive + halt on leak (closes #473)#475
FidoCanCode merged 1 commit into
mainfrom
fix-session-owner-telemetry

Conversation

@FidoCanCode
Copy link
Copy Markdown
Owner

Redesigns claude-subprocess tracking per #473.

What's in

  • ClaudeTalker registry in `kennel/claude.py` — at most one thread per repo may drive a claude subprocess at any moment
  • `ClaudeSession.enter` registers a `worker`-kind talker; `_run_streaming` registers a `webhook`-kind talker when the thread-local `current_repo()` is set
  • Thread-local repo is pinned by `WorkerThread.run` + `WebhookHandler._process_action`
  • Halt on leak: concurrent registration raises `ClaudeLeakError` → `os._exit(3)`
  • `/status` gains `session_alive` (subprocess up) + `claude_talker` (structured snapshot of who's driving)
  • `session.owner` now reads under lock via the talker registry — correct under free-threaded 3.14t (noted in CLAUDE.md)
  • Drive-by: `pytest-xdist` wired with `addopts = -n auto` (37s → 16s)

What's not in (follow-ups)

Routing webhook handlers through the persistent session (so they stop using one-shot `print_prompt` entirely) is the follow-up — the leak detector will scream the moment a webhook and worker collide, which motivates that change.

Also filed #474 to kill `test_raises_on_idle_timeout`'s 10s `sleep 60` (caps xdist parallelism).

Closes #473.

Add per-repo ClaudeTalker exclusivity tracking: at most one thread may
drive a claude subprocess per repo at any moment, enforced at the point
of registration.  ClaudeSession.__enter__ registers a "worker" talker;
_run_streaming registers a "webhook" talker when current_repo() is set
via a thread-local pinned at the webhook handler and worker thread
entrypoints.  Concurrent registration for the same repo raises
ClaudeLeakError, which the worker thread and webhook dispatcher catch
and handle by os._exit(3) — a loud halt so a leaked sub-claude can't
silently proliferate.

Surface in /status: new session_alive (bool, "subprocess up") and
claude_talker (object, authoritative who-is-talking snapshot) fields
per repo.  ClaudeSession.owner now reads from the talker registry under
_talkers_lock rather than a best-effort unsynchronized attribute —
correct under free-threaded Python 3.14t with no GIL (noted in
CLAUDE.md).

pytest-xdist wired with addopts = -n auto.  Suite drops 37s → 16s.

Closes #473.
@FidoCanCode FidoCanCode marked this pull request as ready for review April 14, 2026 17:33
@FidoCanCode FidoCanCode requested a review from rhencke April 14, 2026 17:33
@FidoCanCode FidoCanCode merged commit bf513f3 into main Apr 14, 2026
2 checks passed
@FidoCanCode FidoCanCode deleted the fix-session-owner-telemetry branch April 14, 2026 17:34
FidoCanCode added a commit that referenced this pull request Apr 14, 2026
 #477) (#478)

Fixes the \"setup produced no tasks\" regression from #456.

## Problem

When the persistent \`ClaudeSession\` was introduced in #456,
\`claude_start\` / \`claude_run\` started sending only
\`fido_dir/prompt\` (context) to the session. The sub-skill instructions
— \`setup.md\`, \`task.md\`, \`ci.md\`, \`comments.md\`, \`resume.md\` —
were built into \`fido_dir/system\` by \`build_prompt\` but **never
delivered** to the session.

Claude saw only the bare \"Request: ... Repo: ... Branch: ...\" with no
instructions, produced empty output, and \`find_or_create_pr\` raised
\`setup produced no tasks\`. Observed live after #475 landed: fido stuck
on #465, watchdog kept restarting, three abandoned branches.

## Fix

The persistent session already has \`persona.md\` as its system prompt,
so prepending the full \`fido_dir/system\` (persona + sub-skill) would
duplicate the persona. Instead:

- \`build_prompt\` now writes a third file \`fido_dir/skill\` containing
just the sub-skill content (\`setup.md\`, \`task.md\`, …)
- New \`_session_turn_prompt\` helper reads \`fido_dir/skill\` +
\`fido_dir/prompt\` and joins them with \`\\n\\n---\\n\\n\`
- Both \`claude_start\` and \`claude_run\` persistent-session paths now
send \`_session_turn_prompt(fido_dir)\`

Tests updated: \`_setup_fido_dir\` fixtures now write \`skill\`, and the
\`session_path_sends_prompt_content\` tests expect the combined
\`<skill>\\n\\n---\\n\\n<prompt>\` body.

100% coverage, suite passes.

Closes #477.

---------

Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>
FidoCanCode added a commit that referenced this pull request Apr 14, 2026
…reempt (closes #479) (#481)

Fulfills the 'one claude per repo' invariant from #473 / #475 by making
webhook handler prompt calls share the worker's persistent
\`ClaudeSession\` instead of spawning one-shot subprocesses.

## Steal / handoff

\`ClaudeSession.prompt(content, model=..., system_prompt=...)\` runs one
full turn on the persistent session:

1. Set the cancel event — wakes any current holder out of
\`iter_events\`.
2. Acquire the session as a context manager (blocks until previous
holder's \`__exit__\` releases the lock + unregisters its
\`ClaudeTalker\`).
3. Send stream-json \`control_request\` interrupt + drain stale events
so stdout is clean.
4. Switch model if requested, send content, return result text.

## Resume on preempt

When \`iter_events\` exits early because \`_cancel\` was set, a new
\`last_turn_cancelled\` property records it. The worker's
\`claude_start\` / \`claude_run\` now wrap the persistent-session path
in \`_run_session_turn\`, which retries preempted turns up to
\`_MAX_PREEMPT_RETRIES\` times — 'hand the session back to the worker
and ask it to resume what it was doing'. No more crashing on empty
result after a webhook steals the session.

## Auto-routing print_prompt

\`claude.print_prompt\` now checks for a session on the current thread's
repo via a new \`set_session_resolver()\` hook. When one exists and is
alive, the call routes through \`session.prompt()\` instead of spawning
a one-shot subprocess. Falls back to one-shot on \`ClaudeStreamError\` /
\`ClaudeLeakError\` (returns \`\"\"\` like the existing failure path).

## Wiring

- \`WorkerRegistry.get_session(repo_name)\` exposes the worker thread's
\`ClaudeSession\`.
- \`server._run()\` calls
\`claude.set_session_resolver(registry.get_session)\` right after
constructing the registry.

Closes #479.

Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>
FidoCanCode added a commit that referenced this pull request Apr 14, 2026
Fixes #467.

pytest-xdist and `-n auto` were already added in #475 — this PR fixes
the last slow-test pattern by adding `poll_interval=0.01` to
`_make_unregistered_server`, which was blocking 0.5s per test on
shutdown. Shaves ~1s off wall time and closes out the parallel-pytest
work.

---

## Work queue

<!-- WORK_QUEUE_START -->

<details><summary>Completed (1)</summary>

- [x] Use fast poll_interval in _make_unregistered_server test helper
<!-- type:spec -->
</details>
<!-- WORK_QUEUE_END -->

Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Status redesign: claude-talker tracking, tree display, halt on leak

2 participants