Skip to content

feat(spurctld): get Raft HA working for K8s and bare-metal#93

Merged
powderluv merged 14 commits intoROCm:mainfrom
shiv-tyagi:fix-raft
Apr 18, 2026
Merged

feat(spurctld): get Raft HA working for K8s and bare-metal#93
powderluv merged 14 commits intoROCm:mainfrom
shiv-tyagi:fix-raft

Conversation

@shiv-tyagi
Copy link
Copy Markdown
Member

@shiv-tyagi shiv-tyagi commented Apr 15, 2026

Raft HA for spurctld

This gets Raft-based HA working on the K8s cluster and bare metal. Changes in #84 served as a foundation. The Raft log is now the sole persistence path — even single-node deployments run a 1-member Raft cluster, eliminating the dual WAL/redb code path entirely.

Raft core (new: raft.rs, raft_server.rs, raft_internal.proto)

  • Disk-backed storage for vote, log entries, and snapshots under state_dir/raft/
  • gRPC transport for Raft peer RPCs on dedicated port 6821 (separate from client API on 6817)
  • Single-member bootstrap with add_learner retry so bare-metal nodes can start simultaneously
  • Fail-fast when node_id cannot be determined from hostname or config

Always-on Raft (main.rs, raft.rs, cluster.rs)

  • Raft runs in all modes: single-node auto-synthesizes a local peer list and self-elects instantly.
  • SpurStore takes the applier (ClusterManager) at construction — Raft recovery replays entries into live state correctly on startup
  • Snapshot data is restored into ClusterManager on startup, fixing a latent data-loss bug where snapshot metadata was loaded but the actual state was not deserialized
  • ClusterManager::new() creates empty state; all recovery flows through Raft log replay + snapshot restore
  • Removed: FileWalStore, SnapshotStore (redb), open_snapshot_with_retry, maybe_snapshot, take_snapshot, dual-path propose(), WalEntry/WalStore/WalStoreError

State replication (cluster.rs)

  • propose() always calls raft.client_write() — no local WAL branch
  • Unified apply path renamed apply_wal_op_internalapply_operation (called by Raft's apply_to_state_machine on commit)
  • Snapshot includes full cluster state (reservations, steps, licenses, aliases) not just jobs/nodes
  • Deadlock fixes: mutation methods drop locks before calling propose()

Leader gating (server.rs, scheduler_loop.rs)

  • RaftHandle is non-optional throughout — no if let Some(raft) branching
  • All gRPC RPCs forward to leader when hit on a follower via LeaderProxy
  • x-spur-leader header hints the client API address (not the Raft port)
  • Scheduler, health checker, time-limit enforcer, power manager only run on leader

Deleted: spur-state crate

  • Entire crate removed (FileWalStore + SnapshotStore/redb) — Raft storage in raft.rs is the only persistence layer
  • redb removed from workspace dependencies
  • WalOperation enum kept in spur-core/src/wal.rs as the Raft log entry payload type
  • t53_state test module removed (coverage replaced by Raft storage unit tests)

K8s manifests

  • spurctld.yaml: 3 replicas, PVCs for Raft state, dual ports
  • configmap.yaml: peers via StatefulSet DNS
  • pdb.yaml: maxUnavailable: 1 for quorum safety
  • CI patches replicas to 1 for single-node tests

K8s operator bug fixes (job_controller.rs, agent.rs)

  • Fixed double job submission caused by finalizer 422 retry seeing stale status
  • Operator refreshes controller gRPC connection on address changes
  • Agent namespace-resolve backoff increased to 15s
  • Scheduler retries dispatch 3 times on transient agent failures

Tests

  • Unit tests for apply logic, snapshot round-trip, Raft storage, hostname parsing (all use single-node Raft via #[tokio::test])
  • K8s integration tests 1-12: SpurJob lifecycle, multi-node jobs, cancellation, failure detection, cross-namespace, Raft cluster formation, leader election, log replication, failover recovery
  • Net: -963 lines (eliminated dual persistence path)

Issues resolved

  • Snapshot scope mismatch: redb only saved jobs+nodes; single-node users silently lost reservations/steps/licenses on restart
  • Startup ordering: Raft replayed committed entries before applier was set, silently dropping them
  • Stale redb keys: SnapshotStore::save() never deleted removed jobs/nodes (redb has no clear())
  • Snapshot data not restored on startup: metadata was loaded but state machine was empty until Raft replayed post-snapshot entries

Config

[controller]
peers = ["host1:6821", "host2:6821", "host3:6821"]
# node_id = 1           # auto-detected from hostname ordinal if unset
# raft_listen_addr = "[::]:6821"

@shiv-tyagi
Copy link
Copy Markdown
Member Author

shiv-tyagi commented Apr 15, 2026

Cluster Tests failure seems to be pre existing from yesterday.

K8s integration tests pass which include raft HA integration tests.

@shiv-tyagi shiv-tyagi marked this pull request as ready for review April 15, 2026 15:56
@shiv-tyagi shiv-tyagi requested a review from powderluv April 15, 2026 15:56
@powderluv
Copy link
Copy Markdown
Collaborator

lmk if the cluster CI is flaky and I can investigate it. Thanks for the changes. We can land once we get the CI green.

@shiv-tyagi shiv-tyagi marked this pull request as draft April 16, 2026 08:24
@shiv-tyagi
Copy link
Copy Markdown
Member Author

shiv-tyagi commented Apr 16, 2026

Yes, there is a last minute thing which I want to fix before merging. I will let you know when it is ready. Thanks.

@shiv-tyagi shiv-tyagi marked this pull request as ready for review April 16, 2026 09:02
@shiv-tyagi
Copy link
Copy Markdown
Member Author

shiv-tyagi commented Apr 16, 2026

@powderluv Cluster Tests are still failing, for a reason I suspect not related to this work.

K8s intergration tests PASS, which means we are good to merge. HA with failover handling is covered in those tests. Please hit merge whenever you are ready.

Defines AppendEntries, Vote, and InstallSnapshot RPCs using a bytes
envelope to avoid mirroring openraft's complex types in proto. Runs
on a dedicated port (6821) separate from the client API.

Made-with: Cursor
… bootstrap

- Disk-backed SpurStore (vote.json, log/*.json, snapshot.json) replaces
  in-memory storage so Raft state survives pod restarts
- Real gRPC transport via RaftInternal service (replaces stub Unreachable)
- Single-member bootstrap with add_learner retry for bare-metal robustness
- StateMachineApply trait for ClusterManager integration
- Config: add raft_listen_addr (default 6821), node_id auto-detection
  from hostname ordinal, fail-fast on undetermined node_id

Made-with: Cursor
…arding

ClusterManager:
- propose() routes through raft.client_write() or local WAL
- apply_wal_op_internal() is the single state-apply path for both modes,
  handling resource alloc/dealloc, license tracking, partition assignment
- Removed replay_entry() — startup WAL recovery uses apply_wal_op_internal
- Deadlock fixes: all mutation methods drop locks before calling propose()
- Snapshot includes reservations, steps, license_pool, hostname_aliases
- Redb snapshot open retries on stale flock from SIGKILL

Server:
- Leader check + LeaderProxy forwarding on all RPCs (reads and writes)
- x-spur-leader header uses client API port, not Raft port
- x-spur-forwarded header prevents forwarding loops

Scheduler/health: only run on leader node.

Made-with: Cursor
- spurctld.yaml: 3 replicas, dual ports (6817 client + 6821 raft),
  volumeClaimTemplates for persistent Raft state (1Gi)
- configmap.yaml: controller.peers with StatefulSet DNS names
- pdb.yaml: maxUnavailable=1 to maintain quorum during rollouts

Made-with: Cursor
- Tests 9-12: Raft cluster deploy, leader election (committed vote),
  state replication (PVCs + log entries), failover recovery
- Patch replicas to 1 for single-node CI tests (Tests 1-8)
- Raft tests run last so no single-node restore step is needed

Made-with: Cursor
The reconciler could submit a SpurJob twice when the informer cache
was stale after the finalizer patch. Extract submit_to_controller()
which re-reads from the API server before submitting, and applies the
job-id label best-effort so the critical status patch always runs.

Use the operator's Service DNS as the advertised agent address instead
of the ephemeral pod IP, preventing stale addresses after rollouts.

Made-with: Cursor
…path

Make Raft always-on even for single-node deployments (1-member cluster
that self-elects instantly), eliminating the separate FileWalStore +
redb snapshot persistence path.

Key changes:
- SpurStore takes applier at construction; ClusterManager is created
  before Raft so recovery applies entries correctly
- Single-node mode synthesizes a local peer list instead of skipping Raft
- propose() always goes through raft.client_write(); no local WAL branch
- RaftHandle is non-optional throughout server.rs and scheduler_loop.rs
- Rename apply_wal_op_internal -> apply_operation and
  StateMachineApply::apply_wal_operation -> apply_operation
- Restore snapshot data into applier on startup (fixes latent data loss
  bug where persisted snapshot metadata was loaded but data was not
  deserialized into ClusterManager)

Fixes: snapshot scope mismatch (redb only saved jobs+nodes; Raft
snapshot includes reservations, steps, licenses, hostname_aliases),
startup ordering bug (Raft replayed entries before applier was set),
and stale redb keys (redb save never deleted removed entries).

Made-with: Cursor
The spur-state crate (FileWalStore + SnapshotStore/redb) is no longer
used now that Raft is the sole persistence path. Remove it entirely
along with the WalEntry wrapper, WalStore trait, WalStoreError, the
t53_state integration tests, and the redb workspace dependency.

WalOperation enum is kept in spur-core/src/wal.rs as the Raft log
entry payload type.

Made-with: Cursor
Reflect the unified Raft persistence model in CLAUDE.md, README.md,
quickstart.md, and k8s_test.sh. Remove references to WAL+redb,
spur-state crate, and the old single-node-without-Raft mode.

Made-with: Cursor
Made-with: Cursor
…membership

Replace the asymmetric bootstrap (node 1 initializes alone, then adds
others via add_learner + change_membership) with symmetric bootstrap
where every node calls raft.initialize() with the full membership set.

Openraft guarantees that when all nodes use the same membership, the
voting protocol picks exactly one leader via randomized election
timeouts. On subsequent restarts, initialize() is a no-op and normal
Raft elections take over.

This removes the node_id==1 bootstrap privilege, the 60-second
add_learner retry loop, the wait_for_leadership helper, and the
expand_voter_set background task. Nodes can now start in any order
with no ordering dependency.

Made-with: Cursor
…ests

New tests 13-15 for the 3-replica Raft cluster:

- TEST 13: Submit a job, kill the Raft leader, verify the job ID is
  consistent on the new leader (Raft state survived failover) and a
  new leader was elected with a committed vote.

- TEST 14: After the leader failover from test 13, submit a new job
  to the new leader and verify it gets a higher job ID (proves the
  new leader accepts writes and the ID sequence is preserved).

- TEST 15: Verify all 3 nodes have replicated log entries, then kill
  one node and confirm it recovers with its log and vote intact.

Made-with: Cursor
Root cause: test_cluster() returned before the single-node Raft
finished self-electing. The first propose() hit a not-yet-leader node,
client_write failed, and propose() silently swallowed the error. The
job ID was allocated but never committed.

openraft's client_write DOES wait for apply (it returns the
apply_to_state_machine result), so there is no commit-before-apply
gap. The issue was purely that propose() was called before the leader
election completed.

Fix: use openraft's wait API in test_cluster() to block until the
leader is elected before returning. Fix doc comments on wait helpers
to accurately describe the cause.

Made-with: Cursor
PR ROCm#94 changed requeue pending_reason from Priority to None (issue
ROCm#90). Update the requeue_resets_fields_via_apply test assertion to
match.

Made-with: Cursor
@shiv-tyagi
Copy link
Copy Markdown
Member Author

I rebased on main and the CI is passing now. Can we merge this @powderluv?

@powderluv powderluv merged commit e290bd9 into ROCm:main Apr 18, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants