Skip to content

bug: claude-local — treat ConnectionRefused as transient + rotate session after N adapter_failed #28

Description

@AnilChinchawale

Tracks upstream paperclipai/paperclip#6731.

Problem

The transient-error regex CLAUDE_TRANSIENT_UPSTREAM_RE in packages/adapters/claude-local/src/server/parse.ts only matches upstream API failures (rate limits, 429, 529, overloaded, etc.). It does NOT match local ConnectionRefused against the Claude CLI's socket. As a consequence, a brief local interruption of the claude CLI process (OOM, restart, host reboot, package upgrade) strands the agent for ~21 hours before any recovery path fires.

Why this matters for Levi

Levi is critically dependent on the Claude CLI process running on the agi.openscan.ai host. We've already seen one Claude-related outage today (5-hour usage cap). A local CLI hiccup currently has the same blast radius as a Claude account quota event. This is the exact resilience gap that turns a small interruption into a Levi-wide outage.

Files

  • packages/adapters/claude-local/src/server/parse.ts — extend CLAUDE_TRANSIENT_UPSTREAM_RE to include ConnectionRefused and related local-socket patterns
  • Heartbeat session-rotation logic — rotate Claude session after N consecutive adapter_failed runs (default 2)

Effort

S — regex extension + a single-counter session-rotation rule. Half a day.

Risk

L — narrow regex extension. Edge case: misclassifying a real persistent local failure as transient. Mitigated by the new "rotate after N" rule which forces a fresh session before retrying indefinitely.

Levi-specific implementation note

Compose with the Kimi env-swap fallback (PR #26 / its rework). The full recovery ladder becomes:

  1. classify error as transient → in-process retry
  2. consecutive adapter_failed ≥ N → rotate Claude session, retry
  3. still failing OR explicit quota signal → engage Kimi fallback

That order uses the cheap fallback only when local recovery has actually failed.

Acceptance criteria

  • ConnectionRefused from claude CLI marked as transient_upstream
  • Session rotates after 2 (configurable) consecutive adapter_failed runs
  • Unit tests in parse.test.ts cover both the regex and rotation rule
  • End-to-end test: kill claude CLI mid-run → agent recovers within one heartbeat cycle without manual intervention

Surveyed by the Levi planning agent on 2026-05-26.

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions