Idea
Replace per-task `claude --print` subprocess spawns with one long-lived claude per worker, fed by `--input-format stream-json --output-format stream-json`. Kennel becomes a real state machine driving claude around the loop instead of a subprocess shepherd.
Why
- Spawn cost: every task currently pays ~10s claude boot. Persistent session = paid once at worker start.
- Context coherence: plan + implement + nudge + recovery all share the same transcript. Today each task starts fresh or resumes a stale session; cross-task reasoning is lost.
- Model switching: send `/model claude-opus-4-6` for planning / triage / rescope, `/model claude-sonnet-4-6` for implementation — same session, no spawn cost, full context preserved.
- Live eventing: stream-json events arrive as they happen. Worker reacts to tool calls, status updates, and completion in real time instead of parsing post-hoc dumps.
- Smarter nudges: nudge logic moves into the same conversation that produced the stuck state — claude sees its own "I think it's done" reasoning and can act on the contradiction.
Sketch
```python
class ClaudeSession:
def init(self, system_file, work_dir):
self.proc = subprocess.Popen(
["claude", "--input-format", "stream-json",
"--output-format", "stream-json", "--verbose",
"--system-prompt-file", str(system_file)],
stdin=PIPE, stdout=PIPE, cwd=work_dir, text=True,
)
def send(self, content): ...
def switch_model(self, model): self.send_command(f"/model {model}")
def stream_events(self): ...
def stop(self): ...
class WorkerStateMachine:
def run(self):
while not self.done:
self.session.switch_model("claude-opus-4-6")
self.session.send(self.plan_prompt())
plan = self.consume_until_done()
self.session.switch_model("claude-sonnet-4-6")
self.session.send(self.impl_prompt(plan))
self.consume_until_committed_or_nudge_threshold()
```
Risks
Migration
This is a substantial rewrite of `kennel/claude.py` and `kennel/worker.py`'s execute loop. Probably one PR for the persistent-session abstraction (no behavior change), then incremental PRs migrating each spawn site.
Why v1
The whole reliability batch we've been chasing (#369 backoff, #452 nudges, #267 no-commit loops, etc.) is symptomatic of the per-task-spawn model. A persistent session where claude sees its own history makes most of that go away — the session knows it's been told to commit twice already, knows the previous run claimed completion, can self-correct. This is the real fix.
Idea
Replace per-task `claude --print` subprocess spawns with one long-lived claude per worker, fed by `--input-format stream-json --output-format stream-json`. Kennel becomes a real state machine driving claude around the loop instead of a subprocess shepherd.
Why
Sketch
```python
class ClaudeSession:
def init(self, system_file, work_dir):
self.proc = subprocess.Popen(
["claude", "--input-format", "stream-json",
"--output-format", "stream-json", "--verbose",
"--system-prompt-file", str(system_file)],
stdin=PIPE, stdout=PIPE, cwd=work_dir, text=True,
)
class WorkerStateMachine:
def run(self):
while not self.done:
self.session.switch_model("claude-opus-4-6")
self.session.send(self.plan_prompt())
plan = self.consume_until_done()
```
Risks
Migration
This is a substantial rewrite of `kennel/claude.py` and `kennel/worker.py`'s execute loop. Probably one PR for the persistent-session abstraction (no behavior change), then incremental PRs migrating each spawn site.
Why v1
The whole reliability batch we've been chasing (#369 backoff, #452 nudges, #267 no-commit loops, etc.) is symptomatic of the per-task-spawn model. A persistent session where claude sees its own history makes most of that go away — the session knows it's been told to commit twice already, knows the previous run claimed completion, can self-correct. This is the real fix.