fix(team): clear RoleHealth on workspace/team cascade delete#33
Merged
Conversation
Before this change, team.Controller.roleHealth survived daemon.handleDelete's workspace and team cascade paths. The map is keyed by workspace/team/role — three name components the operator is free to reuse after a delete. When a fresh manifest re-used the same names, accumulated RestartCount and BackoffUntil carried over. If the prior generation had hit MaxRestarts saturation, BackoffUntil was frozen to saturationFreezeUntil (far future) and the reconciler gate at controller.go:230 refused spawns forever — producing the silent stall Skippy reproduced in #29. Adds: - team.Controller.ClearRoleHealthForTeam(workspace, team) - team.Controller.ClearRoleHealthForWorkspace(workspace) Both take c.mu, iterate roleHealth, and delete entries whose keys match the workspace-or-team prefix with a trailing `/` so that sibling names sharing a prefix (teamA vs teamAA, ws1 vs ws1-alt) are not affected. Wires both helpers into daemon.handleDelete's workspace and team branches, after the corresponding Store.Delete* call so the cascade happens in the same direction everywhere: drop the records, then drop the crash-loop history. Tests cover prefix-boundary correctness, sibling-survival, and the empty-map no-op case. No daemon-level regression test — the two call sites are trivially auditable in the diff and the unit tests cover the underlying contract. Refs #29
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the stall Skippy reproduced in #29: after
marvel delete workspace <name>+ re-applying the same manifest, the reconciler never spawns sessions.team.Controller.roleHealthwas keyed byworkspace/team/role— three name components the operator is free to reuse after a delete. AccumulatedRestartCount+BackoffUntilsurvived the cascade and poisoned the fresh generation.MaxRestartssaturation,BackoffUntil = saturationFreezeUntil(far future). The reconciler gate atcontroller.go:230refused spawns forever.Changes
team.Controller.ClearRoleHealthForTeam(workspace, team)— mutex-guarded, prefix-match deletion, trailing/so sibling names that share a prefix (teamAvsteamAA) are unaffected.team.Controller.ClearRoleHealthForWorkspace(workspace)— same contract at workspace scope.daemon.handleDeletewires both: workspace and team branches each call the appropriate clear after the correspondingStore.Delete*succeeds.Tests
TestClearRoleHealthForTeam— populates entries under two workspaces and three teams with one prefix-boundary pair (teamAvsteamAA); asserts only the target team is cleared.TestClearRoleHealthForWorkspace— same shape at workspace scope (ws1vsws1-alt).TestClearRoleHealthForTeamEmpty— no-op on empty map.go test -race ./...green;golangci-lint run ./...0 issues.No daemon-level regression test. The two new call sites are trivially auditable in the diff, and the controller-level tests cover the underlying contract. Adding a daemon-level test would have required either a 30s+ real crash-loop or exporting an internal map seeder — both worse than the current coverage.
Not addressed here
restartBackoffMaxwait. Orthogonal to this fix — if we want scale-to-1 to also reset the counter, that's a separate design call.workspace/team/role@gen). Cleaner long-term but threading generation into every lookup has cost, and the cascade-clear approach covers this specific bug without the churn.Refs #29