feat(cluster): add dead node removal API for Kubernetes scale-down by xe-nvdk · Pull Request #384 · Basekick-Labs/arc

xe-nvdk · 2026-04-10T19:40:56Z

Summary

New admin endpoint DELETE /api/v1/cluster/nodes/:id to remove a dead or permanently scaled-down node from the cluster
Fixes RemoveNodeViaRaft() to call both RemoveServer() (Raft voting) and RemoveNode() (FSM state) + Unregister() — previously only called RemoveNode(), leaving dead voters in the Raft configuration
Self-removal prevented with clear error message
Admin auth required (auth.RequireAdmin)

Why

When a Kubernetes pod is permanently scaled down, its Raft voter entry persists. Without a removal API, dead voters accumulate and can eventually break quorum — the cluster can't elect a leader, process writes, or even remove the dead voters.

Test plan

go build ./cmd/... ./internal/... passes
go test ./internal/cluster/... passes
go test ./internal/api/... passes
Post-implementation review: auth enforced, self-removal prevented, license-gated via coordinator
Manual: remove a dead node via API, verify cluster health recovers

New admin endpoint DELETE /api/v1/cluster/nodes/:id removes a dead or permanently scaled-down node from the cluster. Removes the node from Raft voting configuration, cluster FSM state, and local registry. Also fixes RemoveNodeViaRaft() which previously only called RemoveNode (FSM state) but not RemoveServer (Raft voting). Now matches the handleLeaveNotify pattern with complete three-phase cleanup. Self-removal is prevented — the endpoint rejects requests to remove the node you're connected to. Use graceful shutdown instead.

gemini-code-assist

Code Review

This pull request introduces a new administrative API endpoint (DELETE /api/v1/cluster/nodes/:id) for removing dead or unresponsive nodes from the cluster. The changes include adding an admin-only route in the cluster handler and updating the coordinator to remove nodes from both the Raft configuration and the FSM state. Feedback suggests improving the robustness of the removal logic by ensuring that failures in the Raft configuration step do not block FSM cleanup. Additionally, it was noted that manual unregistration from the local registry is redundant and could lead to state inconsistency, as this is already managed via FSM callbacks.

gemini-code-assist · 2026-04-10T19:42:24Z

+	if err := c.raftNode.RemoveServer(nodeID, 5*time.Second); err != nil {
+		return fmt.Errorf("failed to remove from Raft: %w", err)
+	}


If RemoveServer fails because the node has already been removed from the Raft configuration (e.g., due to a previous partial success or a race condition), this function returns an error, preventing the subsequent RemoveNode call from cleaning up the FSM state. For a destructive 'dead node removal' API, it is often more robust to log a warning if the Raft configuration change fails (as long as it's not a leadership or quorum error) and proceed with the FSM cleanup to ensure the node is removed from the registry across the cluster.

Suggested change

if err := c.raftNode.RemoveServer(nodeID, 5*time.Second); err != nil {

return fmt.Errorf("failed to remove from Raft: %w", err)

}

if err := c.raftNode.RemoveServer(nodeID, 5*time.Second); err != nil {

c.logger.Warn().Err(err).Str("node_id", nodeID).Msg("Failed to remove node from Raft configuration (may already be removed)")

}

gemini-code-assist · 2026-04-10T19:42:24Z

+	}
+
+	// Remove from local registry
+	c.registry.Unregister(nodeID)


This manual call to c.registry.Unregister(nodeID) is redundant and potentially leads to temporary state inconsistency. In this architecture, the node registry is synchronized via FSM callbacks (see onRaftNodeRemoved at line 1169). When c.raftNode.RemoveNode is successfully applied, it triggers the callback on all nodes, including the leader, which performs the unregistration. By calling it manually here, you risk updating the leader's local registry even if the FSM command fails to commit, creating a discrepancy between the leader's memory and the cluster's authoritative state.

- RemoveServer failure now warns instead of hard-failing, allowing FSM cleanup to proceed (handles retry/idempotency scenarios where the node was already removed from Raft voting) - Removed manual registry.Unregister() call — the FSM callback (onRaftNodeRemoved) handles registry cleanup on all nodes when RemoveNode succeeds, avoiding state inconsistency - RemoveNode failure now returns error (it's the authoritative state change that must succeed)

gemini-code-assist Bot reviewed Apr 10, 2026

View reviewed changes

xe-nvdk merged commit ac6ac3a into main Apr 10, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cluster): add dead node removal API for Kubernetes scale-down#384

feat(cluster): add dead node removal API for Kubernetes scale-down#384
xe-nvdk merged 2 commits intomainfrom
feat/dead-voter-removal-api

xe-nvdk commented Apr 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xe-nvdk commented Apr 10, 2026

Summary

Why

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant