Tracks upstream paperclipai/paperclip#6731.
Problem
The transient-error regex CLAUDE_TRANSIENT_UPSTREAM_RE in packages/adapters/claude-local/src/server/parse.ts only matches upstream API failures (rate limits, 429, 529, overloaded, etc.). It does NOT match local ConnectionRefused against the Claude CLI's socket. As a consequence, a brief local interruption of the claude CLI process (OOM, restart, host reboot, package upgrade) strands the agent for ~21 hours before any recovery path fires.
Why this matters for Levi
Levi is critically dependent on the Claude CLI process running on the agi.openscan.ai host. We've already seen one Claude-related outage today (5-hour usage cap). A local CLI hiccup currently has the same blast radius as a Claude account quota event. This is the exact resilience gap that turns a small interruption into a Levi-wide outage.
Files
packages/adapters/claude-local/src/server/parse.ts — extend CLAUDE_TRANSIENT_UPSTREAM_RE to include ConnectionRefused and related local-socket patterns
- Heartbeat session-rotation logic — rotate Claude session after N consecutive
adapter_failed runs (default 2)
Effort
S — regex extension + a single-counter session-rotation rule. Half a day.
Risk
L — narrow regex extension. Edge case: misclassifying a real persistent local failure as transient. Mitigated by the new "rotate after N" rule which forces a fresh session before retrying indefinitely.
Levi-specific implementation note
Compose with the Kimi env-swap fallback (PR #26 / its rework). The full recovery ladder becomes:
- classify error as transient → in-process retry
- consecutive
adapter_failed ≥ N → rotate Claude session, retry
- still failing OR explicit quota signal → engage Kimi fallback
That order uses the cheap fallback only when local recovery has actually failed.
Acceptance criteria
Surveyed by the Levi planning agent on 2026-05-26.
Tracks upstream paperclipai/paperclip#6731.
Problem
The transient-error regex
CLAUDE_TRANSIENT_UPSTREAM_REinpackages/adapters/claude-local/src/server/parse.tsonly matches upstream API failures (rate limits, 429, 529, overloaded, etc.). It does NOT match localConnectionRefusedagainst the Claude CLI's socket. As a consequence, a brief local interruption of the claude CLI process (OOM, restart, host reboot, package upgrade) strands the agent for ~21 hours before any recovery path fires.Why this matters for Levi
Levi is critically dependent on the Claude CLI process running on the agi.openscan.ai host. We've already seen one Claude-related outage today (5-hour usage cap). A local CLI hiccup currently has the same blast radius as a Claude account quota event. This is the exact resilience gap that turns a small interruption into a Levi-wide outage.
Files
packages/adapters/claude-local/src/server/parse.ts— extendCLAUDE_TRANSIENT_UPSTREAM_REto includeConnectionRefusedand related local-socket patternsadapter_failedruns (default 2)Effort
S — regex extension + a single-counter session-rotation rule. Half a day.
Risk
L — narrow regex extension. Edge case: misclassifying a real persistent local failure as transient. Mitigated by the new "rotate after N" rule which forces a fresh session before retrying indefinitely.
Levi-specific implementation note
Compose with the Kimi env-swap fallback (PR #26 / its rework). The full recovery ladder becomes:
adapter_failed≥ N → rotate Claude session, retryThat order uses the cheap fallback only when local recovery has actually failed.
Acceptance criteria
ConnectionRefusedfrom claude CLI marked astransient_upstreamadapter_failedrunsparse.test.tscover both the regex and rotation ruleSurveyed by the Levi planning agent on 2026-05-26.