Skip to content

Drive a single persistent claude per worker via bidirectional stream-json #455

@FidoCanCode

Description

@FidoCanCode

Idea

Replace per-task `claude --print` subprocess spawns with one long-lived claude per worker, fed by `--input-format stream-json --output-format stream-json`. Kennel becomes a real state machine driving claude around the loop instead of a subprocess shepherd.

Why

  • Spawn cost: every task currently pays ~10s claude boot. Persistent session = paid once at worker start.
  • Context coherence: plan + implement + nudge + recovery all share the same transcript. Today each task starts fresh or resumes a stale session; cross-task reasoning is lost.
  • Model switching: send `/model claude-opus-4-6` for planning / triage / rescope, `/model claude-sonnet-4-6` for implementation — same session, no spawn cost, full context preserved.
  • Live eventing: stream-json events arrive as they happen. Worker reacts to tool calls, status updates, and completion in real time instead of parsing post-hoc dumps.
  • Smarter nudges: nudge logic moves into the same conversation that produced the stuck state — claude sees its own "I think it's done" reasoning and can act on the contradiction.

Sketch

```python
class ClaudeSession:
def init(self, system_file, work_dir):
self.proc = subprocess.Popen(
["claude", "--input-format", "stream-json",
"--output-format", "stream-json", "--verbose",
"--system-prompt-file", str(system_file)],
stdin=PIPE, stdout=PIPE, cwd=work_dir, text=True,
)

def send(self, content): ...
def switch_model(self, model): self.send_command(f"/model {model}")
def stream_events(self): ...
def stop(self): ...

class WorkerStateMachine:
def run(self):
while not self.done:
self.session.switch_model("claude-opus-4-6")
self.session.send(self.plan_prompt())
plan = self.consume_until_done()

        self.session.switch_model("claude-sonnet-4-6")
        self.session.send(self.impl_prompt(plan))
        self.consume_until_committed_or_nudge_threshold()

```

Risks

Migration

This is a substantial rewrite of `kennel/claude.py` and `kennel/worker.py`'s execute loop. Probably one PR for the persistent-session abstraction (no behavior change), then incremental PRs migrating each spawn site.

Why v1

The whole reliability batch we've been chasing (#369 backoff, #452 nudges, #267 no-commit loops, etc.) is symptomatic of the per-task-spawn model. A persistent session where claude sees its own history makes most of that go away — the session knows it's been told to commit twice already, knows the previous run claimed completion, can self-correct. This is the real fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions