fix(health): exponential backoff + max-restart cap on crashing roles (#11)#21
Merged
fix(health): exponential backoff + max-restart cap on crashing roles (#11)#21
Conversation
Reconciler restarted failed sessions immediately with no backoff, and
Session.RestartCount lived on the session itself — it reset every
rebuild, so the cap was silently ineffective. When forestage/claude/
any runtime goes into a crash loop the daemon would spin pane IDs
upward forever, burn CPU, and drown the log ring in restart chatter.
Fix — track crash-loop state per role, not per session:
- New Controller.roleHealth map (workspace/team/role -> RoleHealth).
Restart counter and backoff deadline survive delete+recreate.
- Exponential backoff: 30s, 60s, 120s, 240s, 5m (capped).
- First restart is immediate (k8s-style); subsequent wait.
- Role.MaxRestarts (new manifest field): 0 = unlimited, N =
transition to SessionFailed once exceeded.
- New SessionCrashLoopBackOff state — visible via `marvel get
sessions` for sibling replicas caught inside the backoff window.
- reconcileRole respects the role's BackoffUntil and skips spawning
replacements while cooling down, so the backoff isn't defeated
by the reconciler rushing in to respawn what we just deleted.
- RoleHealthSnapshot() exposes state for tests + future `marvel
describe team` surfacing.
Clock injected via Controller.now so backoff behavior is
deterministically testable without sleeping 5 minutes.
Tests:
- TestHealthRestartAlways updated — first restart still immediate
but verifies backoff then respawn-after-elapse.
- TestHealthRestartBackoffHoldsReplacement — reconciler holds while
backoff active.
- TestHealthRestartBackoffSiblingMarked — 2-replica role: sibling
that fails inside the cooling window is marked
SessionCrashLoopBackOff and kept alive.
- TestHealthRestartMaxReached — MaxRestarts=2 caps restarts at 2,
third failure transitions to SessionFailed.
- TestComputeBackoff — locks in the exponential curve shape.
Refs: #11
7 tasks
This was referenced Apr 19, 2026
arcaven
added a commit
that referenced
this pull request
Apr 19, 2026
Both fields exist on api.Role with toml tags, the team controller honors MaxRestarts (the crash-loop saturation cap from PR #21), and the forestage adapter honors DangerousPermissions (PR #20). But ManifestRole in internal/api/manifest.go never declared either field, so yaml.v3 and BurntSushi/toml both silently dropped them during parse, and Apply() never copied them onto the constructed Role. Net effect: - #28 — `max_restarts: 3` in a YAML manifest produced `Role.MaxRestarts = 0`, which means "unlimited" per the contract in types.go. The cap was permanently disabled on manifest-driven teams. - #43 — `dangerous_permissions: true` produced `Role.DangerousPermissions = false`, so the forestage adapter never appended `--dangerously-skip-permissions`. Autonomous fleet-agent teams couldn't ship the flag from a manifest. Adds both fields to ManifestRole with matching `toml` + `yaml` struct tags, wires Apply() to copy them into Role, and adds regression tests covering the round-trip in both manifest formats plus a defaults test that pins the zero-valued behavior. Refs #28 Refs #43
4 tasks
arcaven
added a commit
that referenced
this pull request
Apr 19, 2026
…47) Both fields exist on api.Role with toml tags, the team controller honors MaxRestarts (the crash-loop saturation cap from PR #21), and the forestage adapter honors DangerousPermissions (PR #20). But ManifestRole in internal/api/manifest.go never declared either field, so yaml.v3 and BurntSushi/toml both silently dropped them during parse, and Apply() never copied them onto the constructed Role. Net effect: - #28 — `max_restarts: 3` in a YAML manifest produced `Role.MaxRestarts = 0`, which means "unlimited" per the contract in types.go. The cap was permanently disabled on manifest-driven teams. - #43 — `dangerous_permissions: true` produced `Role.DangerousPermissions = false`, so the forestage adapter never appended `--dangerously-skip-permissions`. Autonomous fleet-agent teams couldn't ship the flag from a manifest. Adds both fields to ManifestRole with matching `toml` + `yaml` struct tags, wires Apply() to copy them into Role, and adds regression tests covering the round-trip in both manifest formats plus a defaults test that pins the zero-valued behavior. Refs #28 Refs #43
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the crash-loop footgun from #11. Reconciler used to restart failed sessions immediately with no backoff, and the restart counter lived on the session itself — so it reset every rebuild and the cap was silently ineffective. When any runtime goes into a crash loop, the daemon would spin pane IDs upward forever, burn CPU, and drown the log ring.
Design
Crash-loop state moves off the session and onto the role:
RoleHealthon the controller (keyedworkspace/team/role). Restart count + backoff deadline survive delete+recreate.Role.MaxRestartsmanifest field.0= unlimited,N= transition toSessionFailedonce exceeded.SessionCrashLoopBackOffstate — visible inmarvel get sessionsfor sibling replicas that fail inside the same cooling window.reconcileRolerespects the role'sBackoffUntiland skips spawning replacements while cooling down. Without this the reconciler would immediately recreate what we just deleted and defeat the whole backoff.Controller.nowinjection point for deterministic tests.Controller.RoleHealthSnapshot()exposes state for tests today,marvel describe teamtomorrow.Test plan
just lint— cleanjust test-race— all packages greenRefs: #11