feat(spurctld): get Raft HA working for K8s and bare-metal by shiv-tyagi · Pull Request #93 · ROCm/spur

shiv-tyagi · 2026-04-15T15:41:13Z

Raft HA for spurctld

This gets Raft-based HA working on the K8s cluster and bare metal. Changes in #84 served as a foundation. The Raft log is now the sole persistence path — even single-node deployments run a 1-member Raft cluster, eliminating the dual WAL/redb code path entirely.

Raft core (new: `raft.rs`, `raft_server.rs`, `raft_internal.proto`)

Disk-backed storage for vote, log entries, and snapshots under state_dir/raft/
gRPC transport for Raft peer RPCs on dedicated port 6821 (separate from client API on 6817)
Single-member bootstrap with add_learner retry so bare-metal nodes can start simultaneously
Fail-fast when node_id cannot be determined from hostname or config

Always-on Raft (`main.rs`, `raft.rs`, `cluster.rs`)

Raft runs in all modes: single-node auto-synthesizes a local peer list and self-elects instantly.
SpurStore takes the applier (ClusterManager) at construction — Raft recovery replays entries into live state correctly on startup
Snapshot data is restored into ClusterManager on startup, fixing a latent data-loss bug where snapshot metadata was loaded but the actual state was not deserialized
ClusterManager::new() creates empty state; all recovery flows through Raft log replay + snapshot restore
Removed: FileWalStore, SnapshotStore (redb), open_snapshot_with_retry, maybe_snapshot, take_snapshot, dual-path propose(), WalEntry/WalStore/WalStoreError

State replication (`cluster.rs`)

propose() always calls raft.client_write() — no local WAL branch
Unified apply path renamed apply_wal_op_internal → apply_operation (called by Raft's apply_to_state_machine on commit)
Snapshot includes full cluster state (reservations, steps, licenses, aliases) not just jobs/nodes
Deadlock fixes: mutation methods drop locks before calling propose()

Leader gating (`server.rs`, `scheduler_loop.rs`)

RaftHandle is non-optional throughout — no if let Some(raft) branching
All gRPC RPCs forward to leader when hit on a follower via LeaderProxy
x-spur-leader header hints the client API address (not the Raft port)
Scheduler, health checker, time-limit enforcer, power manager only run on leader

Deleted: `spur-state` crate

Entire crate removed (FileWalStore + SnapshotStore/redb) — Raft storage in raft.rs is the only persistence layer
redb removed from workspace dependencies
WalOperation enum kept in spur-core/src/wal.rs as the Raft log entry payload type
t53_state test module removed (coverage replaced by Raft storage unit tests)

K8s manifests

spurctld.yaml: 3 replicas, PVCs for Raft state, dual ports
configmap.yaml: peers via StatefulSet DNS
pdb.yaml: maxUnavailable: 1 for quorum safety
CI patches replicas to 1 for single-node tests

K8s operator bug fixes (`job_controller.rs`, `agent.rs`)

Fixed double job submission caused by finalizer 422 retry seeing stale status
Operator refreshes controller gRPC connection on address changes
Agent namespace-resolve backoff increased to 15s
Scheduler retries dispatch 3 times on transient agent failures

Tests

Unit tests for apply logic, snapshot round-trip, Raft storage, hostname parsing (all use single-node Raft via #[tokio::test])
K8s integration tests 1-12: SpurJob lifecycle, multi-node jobs, cancellation, failure detection, cross-namespace, Raft cluster formation, leader election, log replication, failover recovery
Net: -963 lines (eliminated dual persistence path)

Issues resolved

Snapshot scope mismatch: redb only saved jobs+nodes; single-node users silently lost reservations/steps/licenses on restart
Startup ordering: Raft replayed committed entries before applier was set, silently dropping them
Stale redb keys: SnapshotStore::save() never deleted removed jobs/nodes (redb has no clear())
Snapshot data not restored on startup: metadata was loaded but state machine was empty until Raft replayed post-snapshot entries

Config

[controller]
peers = ["host1:6821", "host2:6821", "host3:6821"]
# node_id = 1           # auto-detected from hostname ordinal if unset
# raft_listen_addr = "[::]:6821"

shiv-tyagi · 2026-04-15T15:48:09Z

Cluster Tests failure seems to be pre existing from yesterday.

K8s integration tests pass which include raft HA integration tests.

powderluv · 2026-04-16T08:24:24Z

lmk if the cluster CI is flaky and I can investigate it. Thanks for the changes. We can land once we get the CI green.

shiv-tyagi · 2026-04-16T08:25:28Z

Yes, there is a last minute thing which I want to fix before merging. I will let you know when it is ready. Thanks.

shiv-tyagi · 2026-04-16T09:04:10Z

@powderluv Cluster Tests are still failing, for a reason I suspect not related to this work.

K8s intergration tests PASS, which means we are good to merge. HA with failover handling is covered in those tests. Please hit merge whenever you are ready.

Defines AppendEntries, Vote, and InstallSnapshot RPCs using a bytes envelope to avoid mirroring openraft's complex types in proto. Runs on a dedicated port (6821) separate from the client API. Made-with: Cursor

… bootstrap - Disk-backed SpurStore (vote.json, log/*.json, snapshot.json) replaces in-memory storage so Raft state survives pod restarts - Real gRPC transport via RaftInternal service (replaces stub Unreachable) - Single-member bootstrap with add_learner retry for bare-metal robustness - StateMachineApply trait for ClusterManager integration - Config: add raft_listen_addr (default 6821), node_id auto-detection from hostname ordinal, fail-fast on undetermined node_id Made-with: Cursor

…arding ClusterManager: - propose() routes through raft.client_write() or local WAL - apply_wal_op_internal() is the single state-apply path for both modes, handling resource alloc/dealloc, license tracking, partition assignment - Removed replay_entry() — startup WAL recovery uses apply_wal_op_internal - Deadlock fixes: all mutation methods drop locks before calling propose() - Snapshot includes reservations, steps, license_pool, hostname_aliases - Redb snapshot open retries on stale flock from SIGKILL Server: - Leader check + LeaderProxy forwarding on all RPCs (reads and writes) - x-spur-leader header uses client API port, not Raft port - x-spur-forwarded header prevents forwarding loops Scheduler/health: only run on leader node. Made-with: Cursor

- spurctld.yaml: 3 replicas, dual ports (6817 client + 6821 raft), volumeClaimTemplates for persistent Raft state (1Gi) - configmap.yaml: controller.peers with StatefulSet DNS names - pdb.yaml: maxUnavailable=1 to maintain quorum during rollouts Made-with: Cursor

- Tests 9-12: Raft cluster deploy, leader election (committed vote), state replication (PVCs + log entries), failover recovery - Patch replicas to 1 for single-node CI tests (Tests 1-8) - Raft tests run last so no single-node restore step is needed Made-with: Cursor

The reconciler could submit a SpurJob twice when the informer cache was stale after the finalizer patch. Extract submit_to_controller() which re-reads from the API server before submitting, and applies the job-id label best-effort so the critical status patch always runs. Use the operator's Service DNS as the advertised agent address instead of the ephemeral pod IP, preventing stale addresses after rollouts. Made-with: Cursor

…path Make Raft always-on even for single-node deployments (1-member cluster that self-elects instantly), eliminating the separate FileWalStore + redb snapshot persistence path. Key changes: - SpurStore takes applier at construction; ClusterManager is created before Raft so recovery applies entries correctly - Single-node mode synthesizes a local peer list instead of skipping Raft - propose() always goes through raft.client_write(); no local WAL branch - RaftHandle is non-optional throughout server.rs and scheduler_loop.rs - Rename apply_wal_op_internal -> apply_operation and StateMachineApply::apply_wal_operation -> apply_operation - Restore snapshot data into applier on startup (fixes latent data loss bug where persisted snapshot metadata was loaded but data was not deserialized into ClusterManager) Fixes: snapshot scope mismatch (redb only saved jobs+nodes; Raft snapshot includes reservations, steps, licenses, hostname_aliases), startup ordering bug (Raft replayed entries before applier was set), and stale redb keys (redb save never deleted removed entries). Made-with: Cursor

The spur-state crate (FileWalStore + SnapshotStore/redb) is no longer used now that Raft is the sole persistence path. Remove it entirely along with the WalEntry wrapper, WalStore trait, WalStoreError, the t53_state integration tests, and the redb workspace dependency. WalOperation enum is kept in spur-core/src/wal.rs as the Raft log entry payload type. Made-with: Cursor

Reflect the unified Raft persistence model in CLAUDE.md, README.md, quickstart.md, and k8s_test.sh. Remove references to WAL+redb, spur-state crate, and the old single-node-without-Raft mode. Made-with: Cursor

Made-with: Cursor

…membership Replace the asymmetric bootstrap (node 1 initializes alone, then adds others via add_learner + change_membership) with symmetric bootstrap where every node calls raft.initialize() with the full membership set. Openraft guarantees that when all nodes use the same membership, the voting protocol picks exactly one leader via randomized election timeouts. On subsequent restarts, initialize() is a no-op and normal Raft elections take over. This removes the node_id==1 bootstrap privilege, the 60-second add_learner retry loop, the wait_for_leadership helper, and the expand_voter_set background task. Nodes can now start in any order with no ordering dependency. Made-with: Cursor

…ests New tests 13-15 for the 3-replica Raft cluster: - TEST 13: Submit a job, kill the Raft leader, verify the job ID is consistent on the new leader (Raft state survived failover) and a new leader was elected with a committed vote. - TEST 14: After the leader failover from test 13, submit a new job to the new leader and verify it gets a higher job ID (proves the new leader accepts writes and the ID sequence is preserved). - TEST 15: Verify all 3 nodes have replicated log entries, then kill one node and confirm it recovers with its log and vote intact. Made-with: Cursor

Root cause: test_cluster() returned before the single-node Raft finished self-electing. The first propose() hit a not-yet-leader node, client_write failed, and propose() silently swallowed the error. The job ID was allocated but never committed. openraft's client_write DOES wait for apply (it returns the apply_to_state_machine result), so there is no commit-before-apply gap. The issue was purely that propose() was called before the leader election completed. Fix: use openraft's wait API in test_cluster() to block until the leader is elected before returning. Fix doc comments on wait helpers to accurately describe the cause. Made-with: Cursor

PR ROCm#94 changed requeue pending_reason from Priority to None (issue ROCm#90). Update the requeue_resets_fields_via_apply test assertion to match. Made-with: Cursor

shiv-tyagi · 2026-04-18T16:01:09Z

I rebased on main and the CI is passing now. Can we merge this @powderluv?

shiv-tyagi marked this pull request as ready for review April 15, 2026 15:56

shiv-tyagi requested a review from powderluv April 15, 2026 15:56

shiv-tyagi marked this pull request as draft April 16, 2026 08:24

shiv-tyagi force-pushed the fix-raft branch from b50308b to 89203ab Compare April 16, 2026 08:48

shiv-tyagi marked this pull request as ready for review April 16, 2026 09:02

shiv-tyagi mentioned this pull request Apr 17, 2026

Node re-registration preserves Down state indefinitely, preventing auto-recovery #95

Closed

shiv-tyagi added 14 commits April 18, 2026 12:50

feat(proto): add Raft internal gRPC service for peer communication

7609122

Defines AppendEntries, Vote, and InstallSnapshot RPCs using a bytes envelope to avoid mirroring openraft's complex types in proto. Runs on a dedicated port (6821) separate from the client API. Made-with: Cursor

docs: update architecture references for always-on Raft

8a3e001

Reflect the unified Raft persistence model in CLAUDE.md, README.md, quickstart.md, and k8s_test.sh. Remove references to WAL+redb, spur-state crate, and the old single-node-without-Raft mode. Made-with: Cursor

style: cargo fmt --all

6658572

Made-with: Cursor

fix(tests): align requeue pending_reason assertion with PR ROCm#94

200c6a6

PR ROCm#94 changed requeue pending_reason from Priority to None (issue ROCm#90). Update the requeue_resets_fields_via_apply test assertion to match. Made-with: Cursor

shiv-tyagi force-pushed the fix-raft branch from 89203ab to 200c6a6 Compare April 18, 2026 15:51

powderluv merged commit e290bd9 into ROCm:main Apr 18, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spurctld): get Raft HA working for K8s and bare-metal#93

feat(spurctld): get Raft HA working for K8s and bare-metal#93
powderluv merged 14 commits intoROCm:mainfrom
shiv-tyagi:fix-raft

shiv-tyagi commented Apr 15, 2026 •

edited

Loading

Uh oh!

shiv-tyagi commented Apr 15, 2026 •

edited

Loading

Uh oh!

powderluv commented Apr 16, 2026

Uh oh!

shiv-tyagi commented Apr 16, 2026 •

edited

Loading

Uh oh!

shiv-tyagi commented Apr 16, 2026 •

edited

Loading

Uh oh!

shiv-tyagi commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shiv-tyagi commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Raft HA for spurctld

Raft core (new: raft.rs, raft_server.rs, raft_internal.proto)

Always-on Raft (main.rs, raft.rs, cluster.rs)

State replication (cluster.rs)

Leader gating (server.rs, scheduler_loop.rs)

Deleted: spur-state crate

K8s manifests

K8s operator bug fixes (job_controller.rs, agent.rs)

Tests

Issues resolved

Config

Uh oh!

shiv-tyagi commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

powderluv commented Apr 16, 2026

Uh oh!

shiv-tyagi commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shiv-tyagi commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shiv-tyagi commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shiv-tyagi commented Apr 15, 2026 •

edited

Loading

Raft core (new: `raft.rs`, `raft_server.rs`, `raft_internal.proto`)

Always-on Raft (`main.rs`, `raft.rs`, `cluster.rs`)

State replication (`cluster.rs`)

Leader gating (`server.rs`, `scheduler_loop.rs`)

Deleted: `spur-state` crate

K8s operator bug fixes (`job_controller.rs`, `agent.rs`)

shiv-tyagi commented Apr 15, 2026 •

edited

Loading

shiv-tyagi commented Apr 16, 2026 •

edited

Loading

shiv-tyagi commented Apr 16, 2026 •

edited

Loading