Skip to content

fix(team): clear RoleHealth on workspace/team cascade delete#33

Merged
arcaven merged 1 commit intomainfrom
fix/rolehealth-cleared-on-cascade-delete
Apr 19, 2026
Merged

fix(team): clear RoleHealth on workspace/team cascade delete#33
arcaven merged 1 commit intomainfrom
fix/rolehealth-cleared-on-cascade-delete

Conversation

@arcaven
Copy link
Copy Markdown
Collaborator

@arcaven arcaven commented Apr 19, 2026

Summary

Fixes the stall Skippy reproduced in #29: after marvel delete workspace <name> + re-applying the same manifest, the reconciler never spawns sessions.

  • team.Controller.roleHealth was keyed by workspace/team/role — three name components the operator is free to reuse after a delete. Accumulated RestartCount + BackoffUntil survived the cascade and poisoned the fresh generation.
  • If the prior generation hit MaxRestarts saturation, BackoffUntil = saturationFreezeUntil (far future). The reconciler gate at controller.go:230 refused spawns forever.

Changes

  • team.Controller.ClearRoleHealthForTeam(workspace, team) — mutex-guarded, prefix-match deletion, trailing / so sibling names that share a prefix (teamA vs teamAA) are unaffected.
  • team.Controller.ClearRoleHealthForWorkspace(workspace) — same contract at workspace scope.
  • daemon.handleDelete wires both: workspace and team branches each call the appropriate clear after the corresponding Store.Delete* succeeds.

Tests

  • TestClearRoleHealthForTeam — populates entries under two workspaces and three teams with one prefix-boundary pair (teamA vs teamAA); asserts only the target team is cleared.
  • TestClearRoleHealthForWorkspace — same shape at workspace scope (ws1 vs ws1-alt).
  • TestClearRoleHealthForTeamEmpty — no-op on empty map.
  • go test -race ./... green; golangci-lint run ./... 0 issues.

No daemon-level regression test. The two new call sites are trivially auditable in the diff, and the controller-level tests cover the underlying contract. Adding a daemon-level test would have required either a 30s+ real crash-loop or exporting an internal map seeder — both worse than the current coverage.

Not addressed here

  • Scale 0→1 on a backoff-poisoned role (Skippy's follow-up comment, partially corrected by him). Skippy's post-test timing confirmed the reconciler did spawn, just after the full 5-minute restartBackoffMax wait. Orthogonal to this fix — if we want scale-to-1 to also reset the counter, that's a separate design call.
  • Generation-scoped keys (workspace/team/role@gen). Cleaner long-term but threading generation into every lookup has cost, and the cascade-clear approach covers this specific bug without the churn.

Refs #29

Before this change, team.Controller.roleHealth survived
daemon.handleDelete's workspace and team cascade paths. The map is
keyed by workspace/team/role — three name components the operator is
free to reuse after a delete. When a fresh manifest re-used the
same names, accumulated RestartCount and BackoffUntil carried over.
If the prior generation had hit MaxRestarts saturation, BackoffUntil
was frozen to saturationFreezeUntil (far future) and the reconciler
gate at controller.go:230 refused spawns forever — producing the
silent stall Skippy reproduced in #29.

Adds:
- team.Controller.ClearRoleHealthForTeam(workspace, team)
- team.Controller.ClearRoleHealthForWorkspace(workspace)

Both take c.mu, iterate roleHealth, and delete entries whose keys
match the workspace-or-team prefix with a trailing `/` so that
sibling names sharing a prefix (teamA vs teamAA, ws1 vs ws1-alt)
are not affected.

Wires both helpers into daemon.handleDelete's workspace and team
branches, after the corresponding Store.Delete* call so the cascade
happens in the same direction everywhere: drop the records, then
drop the crash-loop history.

Tests cover prefix-boundary correctness, sibling-survival, and the
empty-map no-op case. No daemon-level regression test — the two call
sites are trivially auditable in the diff and the unit tests cover
the underlying contract.

Refs #29
@arcaven arcaven merged commit 20d90ac into main Apr 19, 2026
7 checks passed
@arcaven arcaven added type.bug Broken behavior — something doesn't work as designed agent.worker PR created by a Claude Code worker agent area.controller Reconciler labels Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent.worker PR created by a Claude Code worker agent area.controller Reconciler type.bug Broken behavior — something doesn't work as designed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant