Skip to content

feat(cluster): add dead node removal API for Kubernetes scale-down#384

Merged
xe-nvdk merged 2 commits intomainfrom
feat/dead-voter-removal-api
Apr 10, 2026
Merged

feat(cluster): add dead node removal API for Kubernetes scale-down#384
xe-nvdk merged 2 commits intomainfrom
feat/dead-voter-removal-api

Conversation

@xe-nvdk
Copy link
Copy Markdown
Member

@xe-nvdk xe-nvdk commented Apr 10, 2026

Summary

  • New admin endpoint DELETE /api/v1/cluster/nodes/:id to remove a dead or permanently scaled-down node from the cluster
  • Fixes RemoveNodeViaRaft() to call both RemoveServer() (Raft voting) and RemoveNode() (FSM state) + Unregister() — previously only called RemoveNode(), leaving dead voters in the Raft configuration
  • Self-removal prevented with clear error message
  • Admin auth required (auth.RequireAdmin)

Why

When a Kubernetes pod is permanently scaled down, its Raft voter entry persists. Without a removal API, dead voters accumulate and can eventually break quorum — the cluster can't elect a leader, process writes, or even remove the dead voters.

Test plan

  • go build ./cmd/... ./internal/... passes
  • go test ./internal/cluster/... passes
  • go test ./internal/api/... passes
  • Post-implementation review: auth enforced, self-removal prevented, license-gated via coordinator
  • Manual: remove a dead node via API, verify cluster health recovers

New admin endpoint DELETE /api/v1/cluster/nodes/:id removes a dead or
permanently scaled-down node from the cluster. Removes the node from
Raft voting configuration, cluster FSM state, and local registry.

Also fixes RemoveNodeViaRaft() which previously only called RemoveNode
(FSM state) but not RemoveServer (Raft voting). Now matches the
handleLeaveNotify pattern with complete three-phase cleanup.

Self-removal is prevented — the endpoint rejects requests to remove
the node you're connected to. Use graceful shutdown instead.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new administrative API endpoint (DELETE /api/v1/cluster/nodes/:id) for removing dead or unresponsive nodes from the cluster. The changes include adding an admin-only route in the cluster handler and updating the coordinator to remove nodes from both the Raft configuration and the FSM state. Feedback suggests improving the robustness of the removal logic by ensuring that failures in the Raft configuration step do not block FSM cleanup. Additionally, it was noted that manual unregistration from the local registry is redundant and could lead to state inconsistency, as this is already managed via FSM callbacks.

Comment on lines +1471 to +1473
if err := c.raftNode.RemoveServer(nodeID, 5*time.Second); err != nil {
return fmt.Errorf("failed to remove from Raft: %w", err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If RemoveServer fails because the node has already been removed from the Raft configuration (e.g., due to a previous partial success or a race condition), this function returns an error, preventing the subsequent RemoveNode call from cleaning up the FSM state. For a destructive 'dead node removal' API, it is often more robust to log a warning if the Raft configuration change fails (as long as it's not a leadership or quorum error) and proceed with the FSM cleanup to ensure the node is removed from the registry across the cluster.

Suggested change
if err := c.raftNode.RemoveServer(nodeID, 5*time.Second); err != nil {
return fmt.Errorf("failed to remove from Raft: %w", err)
}
if err := c.raftNode.RemoveServer(nodeID, 5*time.Second); err != nil {
c.logger.Warn().Err(err).Str("node_id", nodeID).Msg("Failed to remove node from Raft configuration (may already be removed)")
}

Comment thread internal/cluster/coordinator.go Outdated
}

// Remove from local registry
c.registry.Unregister(nodeID)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This manual call to c.registry.Unregister(nodeID) is redundant and potentially leads to temporary state inconsistency. In this architecture, the node registry is synchronized via FSM callbacks (see onRaftNodeRemoved at line 1169). When c.raftNode.RemoveNode is successfully applied, it triggers the callback on all nodes, including the leader, which performs the unregistration. By calling it manually here, you risk updating the leader's local registry even if the FSM command fails to commit, creating a discrepancy between the leader's memory and the cluster's authoritative state.

- RemoveServer failure now warns instead of hard-failing, allowing FSM
  cleanup to proceed (handles retry/idempotency scenarios where the
  node was already removed from Raft voting)
- Removed manual registry.Unregister() call — the FSM callback
  (onRaftNodeRemoved) handles registry cleanup on all nodes when
  RemoveNode succeeds, avoiding state inconsistency
- RemoveNode failure now returns error (it's the authoritative state
  change that must succeed)
@xe-nvdk xe-nvdk merged commit ac6ac3a into main Apr 10, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant