feat(cluster): add dead node removal API for Kubernetes scale-down#384
feat(cluster): add dead node removal API for Kubernetes scale-down#384
Conversation
New admin endpoint DELETE /api/v1/cluster/nodes/:id removes a dead or permanently scaled-down node from the cluster. Removes the node from Raft voting configuration, cluster FSM state, and local registry. Also fixes RemoveNodeViaRaft() which previously only called RemoveNode (FSM state) but not RemoveServer (Raft voting). Now matches the handleLeaveNotify pattern with complete three-phase cleanup. Self-removal is prevented — the endpoint rejects requests to remove the node you're connected to. Use graceful shutdown instead.
There was a problem hiding this comment.
Code Review
This pull request introduces a new administrative API endpoint (DELETE /api/v1/cluster/nodes/:id) for removing dead or unresponsive nodes from the cluster. The changes include adding an admin-only route in the cluster handler and updating the coordinator to remove nodes from both the Raft configuration and the FSM state. Feedback suggests improving the robustness of the removal logic by ensuring that failures in the Raft configuration step do not block FSM cleanup. Additionally, it was noted that manual unregistration from the local registry is redundant and could lead to state inconsistency, as this is already managed via FSM callbacks.
| if err := c.raftNode.RemoveServer(nodeID, 5*time.Second); err != nil { | ||
| return fmt.Errorf("failed to remove from Raft: %w", err) | ||
| } |
There was a problem hiding this comment.
If RemoveServer fails because the node has already been removed from the Raft configuration (e.g., due to a previous partial success or a race condition), this function returns an error, preventing the subsequent RemoveNode call from cleaning up the FSM state. For a destructive 'dead node removal' API, it is often more robust to log a warning if the Raft configuration change fails (as long as it's not a leadership or quorum error) and proceed with the FSM cleanup to ensure the node is removed from the registry across the cluster.
| if err := c.raftNode.RemoveServer(nodeID, 5*time.Second); err != nil { | |
| return fmt.Errorf("failed to remove from Raft: %w", err) | |
| } | |
| if err := c.raftNode.RemoveServer(nodeID, 5*time.Second); err != nil { | |
| c.logger.Warn().Err(err).Str("node_id", nodeID).Msg("Failed to remove node from Raft configuration (may already be removed)") | |
| } |
| } | ||
|
|
||
| // Remove from local registry | ||
| c.registry.Unregister(nodeID) |
There was a problem hiding this comment.
This manual call to c.registry.Unregister(nodeID) is redundant and potentially leads to temporary state inconsistency. In this architecture, the node registry is synchronized via FSM callbacks (see onRaftNodeRemoved at line 1169). When c.raftNode.RemoveNode is successfully applied, it triggers the callback on all nodes, including the leader, which performs the unregistration. By calling it manually here, you risk updating the leader's local registry even if the FSM command fails to commit, creating a discrepancy between the leader's memory and the cluster's authoritative state.
- RemoveServer failure now warns instead of hard-failing, allowing FSM cleanup to proceed (handles retry/idempotency scenarios where the node was already removed from Raft voting) - Removed manual registry.Unregister() call — the FSM callback (onRaftNodeRemoved) handles registry cleanup on all nodes when RemoveNode succeeds, avoiding state inconsistency - RemoveNode failure now returns error (it's the authoritative state change that must succeed)
Summary
DELETE /api/v1/cluster/nodes/:idto remove a dead or permanently scaled-down node from the clusterRemoveNodeViaRaft()to call bothRemoveServer()(Raft voting) andRemoveNode()(FSM state) +Unregister()— previously only calledRemoveNode(), leaving dead voters in the Raft configurationauth.RequireAdmin)Why
When a Kubernetes pod is permanently scaled down, its Raft voter entry persists. Without a removal API, dead voters accumulate and can eventually break quorum — the cluster can't elect a leader, process writes, or even remove the dead voters.
Test plan
go build ./cmd/... ./internal/...passesgo test ./internal/cluster/...passesgo test ./internal/api/...passes