Skip to content

fix(health): exponential backoff + max-restart cap on crashing roles (#11)#21

Merged
arcaven merged 1 commit intomainfrom
fix/crashloop-backoff
Apr 18, 2026
Merged

fix(health): exponential backoff + max-restart cap on crashing roles (#11)#21
arcaven merged 1 commit intomainfrom
fix/crashloop-backoff

Conversation

@arcaven
Copy link
Copy Markdown
Collaborator

@arcaven arcaven commented Apr 18, 2026

Summary

Closes the crash-loop footgun from #11. Reconciler used to restart failed sessions immediately with no backoff, and the restart counter lived on the session itself — so it reset every rebuild and the cap was silently ineffective. When any runtime goes into a crash loop, the daemon would spin pane IDs upward forever, burn CPU, and drown the log ring.

Design

Crash-loop state moves off the session and onto the role:

  • New per-role RoleHealth on the controller (keyed workspace/team/role). Restart count + backoff deadline survive delete+recreate.
  • Exponential backoff: 30s → 60s → 120s → 240s → 5m (capped). First restart is immediate (k8s-style); subsequent wait.
  • New Role.MaxRestarts manifest field. 0 = unlimited, N = transition to SessionFailed once exceeded.
  • New SessionCrashLoopBackOff state — visible in marvel get sessions for sibling replicas that fail inside the same cooling window.
  • reconcileRole respects the role's BackoffUntil and skips spawning replacements while cooling down. Without this the reconciler would immediately recreate what we just deleted and defeat the whole backoff.
  • Controller.now injection point for deterministic tests.
  • Controller.RoleHealthSnapshot() exposes state for tests today, marvel describe team tomorrow.

Test plan

  • just lint — clean
  • just test-race — all packages green
  • New tests cover: first-restart-immediate + backoff-then-respawn, reconciler-holds-during-backoff, sibling-replica-marked-CrashLoopBackOff, max-restarts-cap-reached, and backoff-curve shape.
  • Validate on Skippy's Pi 5 once alpha lands — feed a crashing role and observe exponential cadence + cap in the daemon log ring.

Refs: #11

Reconciler restarted failed sessions immediately with no backoff, and
Session.RestartCount lived on the session itself — it reset every
rebuild, so the cap was silently ineffective. When forestage/claude/
any runtime goes into a crash loop the daemon would spin pane IDs
upward forever, burn CPU, and drown the log ring in restart chatter.

Fix — track crash-loop state per role, not per session:

  - New Controller.roleHealth map (workspace/team/role -> RoleHealth).
    Restart counter and backoff deadline survive delete+recreate.
  - Exponential backoff: 30s, 60s, 120s, 240s, 5m (capped).
  - First restart is immediate (k8s-style); subsequent wait.
  - Role.MaxRestarts (new manifest field): 0 = unlimited, N =
    transition to SessionFailed once exceeded.
  - New SessionCrashLoopBackOff state — visible via `marvel get
    sessions` for sibling replicas caught inside the backoff window.
  - reconcileRole respects the role's BackoffUntil and skips spawning
    replacements while cooling down, so the backoff isn't defeated
    by the reconciler rushing in to respawn what we just deleted.
  - RoleHealthSnapshot() exposes state for tests + future `marvel
    describe team` surfacing.

Clock injected via Controller.now so backoff behavior is
deterministically testable without sleeping 5 minutes.

Tests:
  - TestHealthRestartAlways updated — first restart still immediate
    but verifies backoff then respawn-after-elapse.
  - TestHealthRestartBackoffHoldsReplacement — reconciler holds while
    backoff active.
  - TestHealthRestartBackoffSiblingMarked — 2-replica role: sibling
    that fails inside the cooling window is marked
    SessionCrashLoopBackOff and kept alive.
  - TestHealthRestartMaxReached — MaxRestarts=2 caps restarts at 2,
    third failure transitions to SessionFailed.
  - TestComputeBackoff — locks in the exponential curve shape.

Refs: #11
@arcaven arcaven merged commit 81a00b2 into main Apr 18, 2026
7 checks passed
@arcaven arcaven added type.feature Net-new capability agent.worker PR created by a Claude Code worker agent area.controller Reconciler labels Apr 19, 2026
arcaven added a commit that referenced this pull request Apr 19, 2026
Both fields exist on api.Role with toml tags, the team controller
honors MaxRestarts (the crash-loop saturation cap from PR #21), and
the forestage adapter honors DangerousPermissions (PR #20). But
ManifestRole in internal/api/manifest.go never declared either
field, so yaml.v3 and BurntSushi/toml both silently dropped them
during parse, and Apply() never copied them onto the constructed
Role. Net effect:

- #28 — `max_restarts: 3` in a YAML manifest
  produced `Role.MaxRestarts = 0`, which means "unlimited" per the
  contract in types.go. The cap was permanently disabled on
  manifest-driven teams.
- #43 — `dangerous_permissions: true` produced
  `Role.DangerousPermissions = false`, so the forestage adapter
  never appended `--dangerously-skip-permissions`. Autonomous
  fleet-agent teams couldn't ship the flag from a manifest.

Adds both fields to ManifestRole with matching `toml` + `yaml`
struct tags, wires Apply() to copy them into Role, and adds
regression tests covering the round-trip in both manifest formats
plus a defaults test that pins the zero-valued behavior.

Refs #28
Refs #43
arcaven added a commit that referenced this pull request Apr 19, 2026
…47)

Both fields exist on api.Role with toml tags, the team controller
honors MaxRestarts (the crash-loop saturation cap from PR #21), and
the forestage adapter honors DangerousPermissions (PR #20). But
ManifestRole in internal/api/manifest.go never declared either
field, so yaml.v3 and BurntSushi/toml both silently dropped them
during parse, and Apply() never copied them onto the constructed
Role. Net effect:

- #28 — `max_restarts: 3` in a YAML manifest
  produced `Role.MaxRestarts = 0`, which means "unlimited" per the
  contract in types.go. The cap was permanently disabled on
  manifest-driven teams.
- #43 — `dangerous_permissions: true` produced
  `Role.DangerousPermissions = false`, so the forestage adapter
  never appended `--dangerously-skip-permissions`. Autonomous
  fleet-agent teams couldn't ship the flag from a manifest.

Adds both fields to ManifestRole with matching `toml` + `yaml`
struct tags, wires Apply() to copy them into Role, and adds
regression tests covering the round-trip in both manifest formats
plus a defaults test that pins the zero-valued behavior.

Refs #28
Refs #43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent.worker PR created by a Claude Code worker agent area.controller Reconciler type.feature Net-new capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant