Skip to content

feat: RDMA Production Phases — Split-CB TP, chunked transfer fix, fair benchmarks#87

Merged
0xDaizz merged 15 commits intomainfrom
rdma-production-phases
Mar 15, 2026
Merged

feat: RDMA Production Phases — Split-CB TP, chunked transfer fix, fair benchmarks#87
0xDaizz merged 15 commits intomainfrom
rdma-production-phases

Conversation

@0xDaizz
Copy link
Owner

@0xDaizz 0xDaizz commented Mar 15, 2026

Summary

  • Fix chunked_recv/chunked_sendrecv over-posting bug — root cause of CQ timeout on multi-chunk (28KB+) RDMA transfers. The repost condition now correctly accounts for already in-flight recvs, preventing phantom recv posts that have no matching send
  • Implement Split-CB TP path (forward_with_group_split_cb) — batches attention and FFN ops into 2 command buffers per layer instead of 12 per-op dispatches. Result: 18,193 us → 392 us per layer (46x improvement)
  • Enable gate-up merged weights + weight pre-transpose in distributed benchmark for production-equivalent measurements
  • Add RoPE + KV cache (128 tokens) to benchmark for fair RMLX vs MLX comparison
  • Upgrade MLX benchmark to production path (mx.fast.rms_norm/rope/sdpa + mx.compile + KV cache)
  • Remove misdiagnosed TB5 driver bug workarounds — replace sleep(1ms) with proper CQ completion polling
  • Add CLI args to all scripts (--node0, --node1, --node0-ip, --node1-ip, --remote-dir, --help) with env var fallbacks
  • Remove all hardcoded hostnames/IPs from scripts, tests, and examples

Benchmark Results (Mixtral 8x7B-like, f16, decode, 2-node TB5 RDMA)

Metric RMLX MLX (mx.compile) Ratio
TP=1 single-CB (gate-up merged) 200 us 2,002 us RMLX 10.0x
TP=2 sharded compute 133 us 890 us RMLX 6.7x
TP=2 Split-CB + RDMA (measured) 392 us 921 us (est.) RMLX 2.4x
TP=2 per-op dispatch (old path) 18,193 us
RDMA allreduce (1x, 8KB) 17.9 us

Test plan

  • 2-node RDMA distributed benchmark passes (both ranks, all sections)
  • Single-node baseline benchmark passes
  • MLX benchmark runs with production-optimized path
  • cargo fmt --all --check passes
  • cargo clippy --workspace --all-targets passes
  • PII scan clean (no internal hostnames/IPs in tracked files)
  • All scripts support --help with documented arguments
  • Benchmark fairness validated by Codex review

🤖 Generated with Claude Code

0xDaizz and others added 15 commits March 15, 2026 17:16
Phase 1: EP combine path now uses transport-backed Group from dispatch
exchange instead of creating a stub Group without transport. This
enables MoE combine to use RDMA for multi-node token exchange.

Phase 2: Added tp_2node_e2e.rs with RowParallelLinear and
ColumnParallelLinear 2-node RDMA E2E tests with CPU reference
verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…c trait analysis

- mesh allreduce: f16/bf16 native path (no f32 expansion), halves RDMA transfer
- transport: replace send-side sleep(1ms) with yield_now() (recv-side kept for safety)
- RdmaTransport trait: documented why async methods aren't needed (sendrecv already overlaps)
- Phase 4c analysis: type coupling + existing overlap + downcast escape hatch

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Q corruption

The "TB5 RDMA driver bug" that corrupted RQ state on ibv_poll_cq was
actually stale CQ completions from prior operations. JACCL (Apple's
reference UC QP implementation) uses CQ poll for recv without any sleep.

Root cause: CQ wasn't drained between connection teardown/recreate,
leaving old completions that confused subsequent recv polling.
graceful_shutdown's CQ drain fixes the real issue.

Verified: 3 consecutive 2-node test passes with yield_now() instead
of sleep(1ms) on recv-side.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-init, dead code cleanup

Phase 5a: Replace intermediate Vec allocation in allgather interleave
with direct Metal buffer write. Eliminates 1 allocation + 1 copy for
ColumnParallelLinear and QuantizedColumnParallelLinear.

Phase 5c: Move cpu_matmul_f32 and read_f32_strided from distributed
to test-only cfg (unused in production since GPU matmul migration).

Phase 6d: Auto-attach ProgressEngine in init.rs for default async
operation support.

Phase 6e: Document TopologyRing integration gap for future topology-
aware ring ordering in collectives.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ion recovery

A. Async EP: exchange_and_compute splits into blocking/async paths based
   on runtime_ctx availability. Async path uses zero-copy combine.
   dispatch_async wiring documented as TODO (AcquiredBuffer API mismatch).
   FP8 exchange config flag added (enable_fp8, default false).

B. allreduce_in_place TP: RowParallelLinear and QuantizedRowParallelLinear
   now modify Metal buffer directly via to_bytes_mut(), eliminating Vec
   allocation + Array reconstruction per forward pass.

C. MR cache: RdmaConnection caches nocopy MR registrations by (ptr, len).
   Repeated sends from same Metal buffer skip ibv_reg_mr (~10-50µs saved).

D. Connection recovery: send/recv/sendrecv retry once on transient errors
   (Timeout, CqPoll). Non-transient errors propagate immediately.

E. HealthMonitor integration documented (TODO in init.rs).
   eprintln→tracing migration in init.rs and context.rs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…add TP E2E to script

- rdma_2node_integration: add 3 subtests (broadcast, reduce_scatter, MR cache)
- test_rdma_2node.sh: include tp_2node_e2e binary as Step 5/5
- Script handles TP binary build/discovery with separate port range

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Async EP pipeline fully connected:
- AcquiredBuffer::shared_buffer() accessor for Arc<SharedBuffer> access
- SharedBufferTier uses Vec<Arc<SharedBuffer>> for shared ownership
- exchange_and_compute_async: dispatch_async → compute (overlap) →
  combine_async_start → wait → combine_async_finish
- Blocking path preserved as fallback when runtime_ctx absent

EP 2-node tests added to rdma_2node_integration suite:
- Phase 8: asymmetric sendrecv (EP dispatch pattern)
- Phase 9: dispatch→compute→combine round-trip simulation

Stale TODO comments removed (Phase 3a/3b now complete).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fair comparison benchmarks for all distributed features:
- Raw send/recv throughput (4KB-16MB, 6 sizes)
- Allreduce f16/f32 (ring/mesh native)
- Allgather, all-to-all
- EP transport (sendrecv vs all_to_all)
- EP pipeline (MLX-only: full moe_dispatch/combine)
- Broadcast, reduce_scatter (RMLX-only)

Fairness verified by Opus + Codex:
- Identical sizes, warmup(10), iterations(30)
- Barrier before every iteration on both sides
- Same EP configs [(16,1024)...(512,1024)]
- JSON output with matching metrics

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nment, EP all_to_all

- send/recv: both ranks send+recv simultaneously (RMLX sendrecv, MLX send+recv crossover)
- allgather: RMLX per-rank contribution = size/ws (matches MLX convention)
- EP transport: RMLX uses all_to_all instead of sendrecv (matches MLX)
- All barriers use .unwrap()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- chunked_sendrecv: offset += chunk instead of chunk_size (fixes panic
  on non-aligned payload sizes, e.g. allreduce barrier with 4 bytes)
- benchmark: switch from init() coordinator to RdmaConnection::establish()
  (same reliable pattern as 2-node tests)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…gnostics

- rdma_setup.py: add networksetup -deletenetworkservice for TB Bridge,
  verify IP actually holds after setting (macOS configd override detection)
- Removed EXO Thunderbolt network services from both hwStudio nodes
- Added device name to alloc_pd error/success messages for diagnostics
- Benchmark establish timeout increased to 30s
- run_rmlx_bench.sh updated for establish pattern

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…fair benchmark

Fix chunked_recv/chunked_sendrecv over-posting bug that caused CQ timeouts
on multi-chunk transfers: guard nocopy send for single-chunk only, track
recvs_posted to prevent re-posting beyond needed chunks.

Remove misdiagnosed TB5 driver bug workarounds (sleep→proper CQ polling)
in connection.rs warmup and all ring collectives (allreduce, allgather,
reduce_scatter).

Add Split-CB TP path (forward_with_group_split_cb) that batches attention
and FFN ops into 2 command buffers per layer instead of per-op dispatch.

Upgrade benchmarks for fair comparison:
- RMLX: RoPE + 128-token KV cache, gate-up merge, weight pre-transpose
- MLX: mx.fast.* fused kernels, mx.compile, KV cache
- Script: hardcoded hostnames/IPs → env vars, cd→env -C

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add Phase 7 entry to roadmap (Split-CB TP, chunked transfer fix, fair benchmarks)
- Add changelog entries (Added, Performance, Fixed sections)
- Document chunked transfer over-posting fix and nocopy guard in rmlx-rdma
- Document Split-CB TP path (forward_with_group_split_cb) in rmlx-nn
- Add Split-CB reference in rmlx-distributed
- Remove references to misdiagnosed TB5 driver bug workarounds

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace all internal hostnames and RDMA IPs with env vars + CLI args
- All scripts now support --node0, --node1, --node0-ip, --node1-ip,
  --remote-dir flags with --help documentation
- Priority chain: CLI args > env vars > generic defaults
- Add rmlx-hosts.json to .gitignore (local config, not for repo)
- Replace test fixture IPs with generic 10.0.0.x addresses
- Add clippy allow attributes for arc_with_non_send_sync, too_many_arguments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@0xDaizz 0xDaizz merged commit 045d2ba into main Mar 15, 2026
7 checks passed
@0xDaizz 0xDaizz deleted the rdma-production-phases branch March 15, 2026 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant