Drive a single persistent claude per worker via bidirectional stream-json

## Idea

Replace per-task \`claude --print\` subprocess spawns with one long-lived claude per worker, fed by \`--input-format stream-json --output-format stream-json\`.  Kennel becomes a real state machine driving claude around the loop instead of a subprocess shepherd.

## Why

- **Spawn cost**: every task currently pays ~10s claude boot.  Persistent session = paid once at worker start.
- **Context coherence**: plan + implement + nudge + recovery all share the same transcript.  Today each task starts fresh or resumes a stale session; cross-task reasoning is lost.
- **Model switching**: send \`/model claude-opus-4-6\` for planning / triage / rescope, \`/model claude-sonnet-4-6\` for implementation — same session, no spawn cost, full context preserved.
- **Live eventing**: stream-json events arrive as they happen.  Worker reacts to tool calls, status updates, and completion in real time instead of parsing post-hoc dumps.
- **Smarter nudges**: nudge logic moves into the same conversation that produced the stuck state — claude sees its own \"I think it's done\" reasoning and can act on the contradiction.

## Sketch

\`\`\`python
class ClaudeSession:
    def __init__(self, system_file, work_dir):
        self.proc = subprocess.Popen(
            ["claude", "--input-format", "stream-json",
             "--output-format", "stream-json", "--verbose",
             "--system-prompt-file", str(system_file)],
            stdin=PIPE, stdout=PIPE, cwd=work_dir, text=True,
        )
    
    def send(self, content): ...
    def switch_model(self, model): self.send_command(f"/model {model}")
    def stream_events(self): ...
    def stop(self): ...

class WorkerStateMachine:
    def run(self):
        while not self.done:
            self.session.switch_model("claude-opus-4-6")
            self.session.send(self.plan_prompt())
            plan = self.consume_until_done()
            
            self.session.switch_model("claude-sonnet-4-6")
            self.session.send(self.impl_prompt(plan))
            self.consume_until_committed_or_nudge_threshold()
\`\`\`

## Risks

- Long-lived processes accumulate context — at some point we have to compact or restart.  Need a graceful boundary (per-issue? per-PR?).
- Process death = lost session.  Need supervisor + reattach (#430 covers some of this).
- Slash commands in stream-json — we use \`/model\`; need to verify the input format actually accepts other slash commands too.

## Migration

This is a substantial rewrite of \`kennel/claude.py\` and \`kennel/worker.py\`'s execute loop.  Probably one PR for the persistent-session abstraction (no behavior change), then incremental PRs migrating each spawn site.

## Why v1

The whole reliability batch we've been chasing (#369 backoff, #452 nudges, #267 no-commit loops, etc.) is *symptomatic* of the per-task-spawn model.  A persistent session where claude sees its own history makes most of that go away — the session knows it's been told to commit twice already, knows the previous run claimed completion, can self-correct.  This is the real fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drive a single persistent claude per worker via bidirectional stream-json #455

Idea

Why

Sketch

Risks

Migration

Why v1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Drive a single persistent claude per worker via bidirectional stream-json #455

Description

Idea

Why

Sketch

Risks

Migration

Why v1

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions