fix(team): clear RoleHealth on workspace/team cascade delete by arcaven · Pull Request #33 · ArcavenAE/marvel

arcaven · 2026-04-19T03:42:46Z

Summary

Fixes the stall Skippy reproduced in #29: after marvel delete workspace <name> + re-applying the same manifest, the reconciler never spawns sessions.

team.Controller.roleHealth was keyed by workspace/team/role — three name components the operator is free to reuse after a delete. Accumulated RestartCount + BackoffUntil survived the cascade and poisoned the fresh generation.
If the prior generation hit MaxRestarts saturation, BackoffUntil = saturationFreezeUntil (far future). The reconciler gate at controller.go:230 refused spawns forever.

Changes

team.Controller.ClearRoleHealthForTeam(workspace, team) — mutex-guarded, prefix-match deletion, trailing / so sibling names that share a prefix (teamA vs teamAA) are unaffected.
team.Controller.ClearRoleHealthForWorkspace(workspace) — same contract at workspace scope.
daemon.handleDelete wires both: workspace and team branches each call the appropriate clear after the corresponding Store.Delete* succeeds.

Tests

TestClearRoleHealthForTeam — populates entries under two workspaces and three teams with one prefix-boundary pair (teamA vs teamAA); asserts only the target team is cleared.
TestClearRoleHealthForWorkspace — same shape at workspace scope (ws1 vs ws1-alt).
TestClearRoleHealthForTeamEmpty — no-op on empty map.
go test -race ./... green; golangci-lint run ./... 0 issues.

No daemon-level regression test. The two new call sites are trivially auditable in the diff, and the controller-level tests cover the underlying contract. Adding a daemon-level test would have required either a 30s+ real crash-loop or exporting an internal map seeder — both worse than the current coverage.

Not addressed here

Scale 0→1 on a backoff-poisoned role (Skippy's follow-up comment, partially corrected by him). Skippy's post-test timing confirmed the reconciler did spawn, just after the full 5-minute restartBackoffMax wait. Orthogonal to this fix — if we want scale-to-1 to also reset the counter, that's a separate design call.
Generation-scoped keys (workspace/team/role@gen). Cleaner long-term but threading generation into every lookup has cost, and the cascade-clear approach covers this specific bug without the churn.

Refs #29

Before this change, team.Controller.roleHealth survived daemon.handleDelete's workspace and team cascade paths. The map is keyed by workspace/team/role — three name components the operator is free to reuse after a delete. When a fresh manifest re-used the same names, accumulated RestartCount and BackoffUntil carried over. If the prior generation had hit MaxRestarts saturation, BackoffUntil was frozen to saturationFreezeUntil (far future) and the reconciler gate at controller.go:230 refused spawns forever — producing the silent stall Skippy reproduced in #29. Adds: - team.Controller.ClearRoleHealthForTeam(workspace, team) - team.Controller.ClearRoleHealthForWorkspace(workspace) Both take c.mu, iterate roleHealth, and delete entries whose keys match the workspace-or-team prefix with a trailing `/` so that sibling names sharing a prefix (teamA vs teamAA, ws1 vs ws1-alt) are not affected. Wires both helpers into daemon.handleDelete's workspace and team branches, after the corresponding Store.Delete* call so the cascade happens in the same direction everywhere: drop the records, then drop the crash-loop history. Tests cover prefix-boundary correctness, sibling-survival, and the empty-map no-op case. No daemon-level regression test — the two call sites are trivially auditable in the diff and the unit tests cover the underlying contract. Refs #29

arcaven mentioned this pull request Apr 19, 2026

Reconciler does not respawn sessions after workspace delete + re-apply #29

Closed

arcaven merged commit 20d90ac into main Apr 19, 2026
7 checks passed

arcaven mentioned this pull request Apr 19, 2026

test: MaxRestarts saturation freezes role permanently until delete+re-apply #37

Open

arcaven added type.bug Broken behavior — something doesn't work as designed agent.worker PR created by a Claude Code worker agent area.controller Reconciler labels Apr 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(team): clear RoleHealth on workspace/team cascade delete#33

fix(team): clear RoleHealth on workspace/team cascade delete#33
arcaven merged 1 commit intomainfrom
fix/rolehealth-cleared-on-cascade-delete

arcaven commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arcaven commented Apr 19, 2026

Summary

Changes

Tests

Not addressed here

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant