feat(spurctld): get Raft HA working for K8s and bare-metal#93
Merged
feat(spurctld): get Raft HA working for K8s and bare-metal#93
Conversation
Member
Author
|
Cluster Tests failure seems to be pre existing from yesterday. K8s integration tests pass which include raft HA integration tests. |
Collaborator
|
lmk if the cluster CI is flaky and I can investigate it. Thanks for the changes. We can land once we get the CI green. |
Member
Author
|
Yes, there is a last minute thing which I want to fix before merging. I will let you know when it is ready. Thanks. |
Member
Author
|
@powderluv Cluster Tests are still failing, for a reason I suspect not related to this work. K8s intergration tests PASS, which means we are good to merge. HA with failover handling is covered in those tests. Please hit merge whenever you are ready. |
Defines AppendEntries, Vote, and InstallSnapshot RPCs using a bytes envelope to avoid mirroring openraft's complex types in proto. Runs on a dedicated port (6821) separate from the client API. Made-with: Cursor
… bootstrap - Disk-backed SpurStore (vote.json, log/*.json, snapshot.json) replaces in-memory storage so Raft state survives pod restarts - Real gRPC transport via RaftInternal service (replaces stub Unreachable) - Single-member bootstrap with add_learner retry for bare-metal robustness - StateMachineApply trait for ClusterManager integration - Config: add raft_listen_addr (default 6821), node_id auto-detection from hostname ordinal, fail-fast on undetermined node_id Made-with: Cursor
…arding ClusterManager: - propose() routes through raft.client_write() or local WAL - apply_wal_op_internal() is the single state-apply path for both modes, handling resource alloc/dealloc, license tracking, partition assignment - Removed replay_entry() — startup WAL recovery uses apply_wal_op_internal - Deadlock fixes: all mutation methods drop locks before calling propose() - Snapshot includes reservations, steps, license_pool, hostname_aliases - Redb snapshot open retries on stale flock from SIGKILL Server: - Leader check + LeaderProxy forwarding on all RPCs (reads and writes) - x-spur-leader header uses client API port, not Raft port - x-spur-forwarded header prevents forwarding loops Scheduler/health: only run on leader node. Made-with: Cursor
- spurctld.yaml: 3 replicas, dual ports (6817 client + 6821 raft), volumeClaimTemplates for persistent Raft state (1Gi) - configmap.yaml: controller.peers with StatefulSet DNS names - pdb.yaml: maxUnavailable=1 to maintain quorum during rollouts Made-with: Cursor
- Tests 9-12: Raft cluster deploy, leader election (committed vote), state replication (PVCs + log entries), failover recovery - Patch replicas to 1 for single-node CI tests (Tests 1-8) - Raft tests run last so no single-node restore step is needed Made-with: Cursor
The reconciler could submit a SpurJob twice when the informer cache was stale after the finalizer patch. Extract submit_to_controller() which re-reads from the API server before submitting, and applies the job-id label best-effort so the critical status patch always runs. Use the operator's Service DNS as the advertised agent address instead of the ephemeral pod IP, preventing stale addresses after rollouts. Made-with: Cursor
…path Make Raft always-on even for single-node deployments (1-member cluster that self-elects instantly), eliminating the separate FileWalStore + redb snapshot persistence path. Key changes: - SpurStore takes applier at construction; ClusterManager is created before Raft so recovery applies entries correctly - Single-node mode synthesizes a local peer list instead of skipping Raft - propose() always goes through raft.client_write(); no local WAL branch - RaftHandle is non-optional throughout server.rs and scheduler_loop.rs - Rename apply_wal_op_internal -> apply_operation and StateMachineApply::apply_wal_operation -> apply_operation - Restore snapshot data into applier on startup (fixes latent data loss bug where persisted snapshot metadata was loaded but data was not deserialized into ClusterManager) Fixes: snapshot scope mismatch (redb only saved jobs+nodes; Raft snapshot includes reservations, steps, licenses, hostname_aliases), startup ordering bug (Raft replayed entries before applier was set), and stale redb keys (redb save never deleted removed entries). Made-with: Cursor
The spur-state crate (FileWalStore + SnapshotStore/redb) is no longer used now that Raft is the sole persistence path. Remove it entirely along with the WalEntry wrapper, WalStore trait, WalStoreError, the t53_state integration tests, and the redb workspace dependency. WalOperation enum is kept in spur-core/src/wal.rs as the Raft log entry payload type. Made-with: Cursor
Reflect the unified Raft persistence model in CLAUDE.md, README.md, quickstart.md, and k8s_test.sh. Remove references to WAL+redb, spur-state crate, and the old single-node-without-Raft mode. Made-with: Cursor
Made-with: Cursor
…membership Replace the asymmetric bootstrap (node 1 initializes alone, then adds others via add_learner + change_membership) with symmetric bootstrap where every node calls raft.initialize() with the full membership set. Openraft guarantees that when all nodes use the same membership, the voting protocol picks exactly one leader via randomized election timeouts. On subsequent restarts, initialize() is a no-op and normal Raft elections take over. This removes the node_id==1 bootstrap privilege, the 60-second add_learner retry loop, the wait_for_leadership helper, and the expand_voter_set background task. Nodes can now start in any order with no ordering dependency. Made-with: Cursor
…ests New tests 13-15 for the 3-replica Raft cluster: - TEST 13: Submit a job, kill the Raft leader, verify the job ID is consistent on the new leader (Raft state survived failover) and a new leader was elected with a committed vote. - TEST 14: After the leader failover from test 13, submit a new job to the new leader and verify it gets a higher job ID (proves the new leader accepts writes and the ID sequence is preserved). - TEST 15: Verify all 3 nodes have replicated log entries, then kill one node and confirm it recovers with its log and vote intact. Made-with: Cursor
Root cause: test_cluster() returned before the single-node Raft finished self-electing. The first propose() hit a not-yet-leader node, client_write failed, and propose() silently swallowed the error. The job ID was allocated but never committed. openraft's client_write DOES wait for apply (it returns the apply_to_state_machine result), so there is no commit-before-apply gap. The issue was purely that propose() was called before the leader election completed. Fix: use openraft's wait API in test_cluster() to block until the leader is elected before returning. Fix doc comments on wait helpers to accurately describe the cause. Made-with: Cursor
Member
Author
|
I rebased on main and the CI is passing now. Can we merge this @powderluv? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Raft HA for spurctld
This gets Raft-based HA working on the K8s cluster and bare metal. Changes in #84 served as a foundation. The Raft log is now the sole persistence path — even single-node deployments run a 1-member Raft cluster, eliminating the dual WAL/redb code path entirely.
Raft core (new:
raft.rs,raft_server.rs,raft_internal.proto)state_dir/raft/add_learnerretry so bare-metal nodes can start simultaneouslynode_idcannot be determined from hostname or configAlways-on Raft (
main.rs,raft.rs,cluster.rs)SpurStoretakes the applier (ClusterManager) at construction — Raft recovery replays entries into live state correctly on startupClusterManageron startup, fixing a latent data-loss bug where snapshot metadata was loaded but the actual state was not deserializedClusterManager::new()creates empty state; all recovery flows through Raft log replay + snapshot restoreFileWalStore,SnapshotStore(redb),open_snapshot_with_retry,maybe_snapshot,take_snapshot, dual-pathpropose(),WalEntry/WalStore/WalStoreErrorState replication (
cluster.rs)propose()always callsraft.client_write()— no local WAL branchapply_wal_op_internal→apply_operation(called by Raft'sapply_to_state_machineon commit)propose()Leader gating (
server.rs,scheduler_loop.rs)RaftHandleis non-optional throughout — noif let Some(raft)branchingLeaderProxyx-spur-leaderheader hints the client API address (not the Raft port)Deleted:
spur-statecrateraft.rsis the only persistence layerredbremoved from workspace dependenciesWalOperationenum kept inspur-core/src/wal.rsas the Raft log entry payload typet53_statetest module removed (coverage replaced by Raft storage unit tests)K8s manifests
spurctld.yaml: 3 replicas, PVCs for Raft state, dual portsconfigmap.yaml: peers via StatefulSet DNSpdb.yaml:maxUnavailable: 1for quorum safetyK8s operator bug fixes (
job_controller.rs,agent.rs)Tests
#[tokio::test])Issues resolved
SnapshotStore::save()never deleted removed jobs/nodes (redb has noclear())Config