feat: RDMA Production Phases — Split-CB TP, chunked transfer fix, fair benchmarks by 0xDaizz · Pull Request #87 · 0xDaizz/RMLX

0xDaizz · 2026-03-15T15:15:45Z

Summary

Fix chunked_recv/chunked_sendrecv over-posting bug — root cause of CQ timeout on multi-chunk (28KB+) RDMA transfers. The repost condition now correctly accounts for already in-flight recvs, preventing phantom recv posts that have no matching send
Implement Split-CB TP path (forward_with_group_split_cb) — batches attention and FFN ops into 2 command buffers per layer instead of 12 per-op dispatches. Result: 18,193 us → 392 us per layer (46x improvement)
Enable gate-up merged weights + weight pre-transpose in distributed benchmark for production-equivalent measurements
Add RoPE + KV cache (128 tokens) to benchmark for fair RMLX vs MLX comparison
Upgrade MLX benchmark to production path (mx.fast.rms_norm/rope/sdpa + mx.compile + KV cache)
Remove misdiagnosed TB5 driver bug workarounds — replace sleep(1ms) with proper CQ completion polling
Add CLI args to all scripts (--node0, --node1, --node0-ip, --node1-ip, --remote-dir, --help) with env var fallbacks
Remove all hardcoded hostnames/IPs from scripts, tests, and examples

Benchmark Results (Mixtral 8x7B-like, f16, decode, 2-node TB5 RDMA)

Metric	RMLX	MLX (mx.compile)	Ratio
TP=1 single-CB (gate-up merged)	200 us	2,002 us	RMLX 10.0x
TP=2 sharded compute	133 us	890 us	RMLX 6.7x
TP=2 Split-CB + RDMA (measured)	392 us	921 us (est.)	RMLX 2.4x
TP=2 per-op dispatch (old path)	18,193 us	—	—
RDMA allreduce (1x, 8KB)	17.9 us	—	—

Test plan

2-node RDMA distributed benchmark passes (both ranks, all sections)
Single-node baseline benchmark passes
MLX benchmark runs with production-optimized path
cargo fmt --all --check passes
cargo clippy --workspace --all-targets passes
PII scan clean (no internal hostnames/IPs in tracked files)
All scripts support --help with documented arguments
Benchmark fairness validated by Codex review

🤖 Generated with Claude Code

Phase 1: EP combine path now uses transport-backed Group from dispatch exchange instead of creating a stub Group without transport. This enables MoE combine to use RDMA for multi-node token exchange. Phase 2: Added tp_2node_e2e.rs with RowParallelLinear and ColumnParallelLinear 2-node RDMA E2E tests with CPU reference verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…c trait analysis - mesh allreduce: f16/bf16 native path (no f32 expansion), halves RDMA transfer - transport: replace send-side sleep(1ms) with yield_now() (recv-side kept for safety) - RdmaTransport trait: documented why async methods aren't needed (sendrecv already overlaps) - Phase 4c analysis: type coupling + existing overlap + downcast escape hatch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…Q corruption The "TB5 RDMA driver bug" that corrupted RQ state on ibv_poll_cq was actually stale CQ completions from prior operations. JACCL (Apple's reference UC QP implementation) uses CQ poll for recv without any sleep. Root cause: CQ wasn't drained between connection teardown/recreate, leaving old completions that confused subsequent recv polling. graceful_shutdown's CQ drain fixes the real issue. Verified: 3 consecutive 2-node test passes with yield_now() instead of sleep(1ms) on recv-side. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…-init, dead code cleanup Phase 5a: Replace intermediate Vec allocation in allgather interleave with direct Metal buffer write. Eliminates 1 allocation + 1 copy for ColumnParallelLinear and QuantizedColumnParallelLinear. Phase 5c: Move cpu_matmul_f32 and read_f32_strided from distributed to test-only cfg (unused in production since GPU matmul migration). Phase 6d: Auto-attach ProgressEngine in init.rs for default async operation support. Phase 6e: Document TopologyRing integration gap for future topology- aware ring ordering in collectives. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ion recovery A. Async EP: exchange_and_compute splits into blocking/async paths based on runtime_ctx availability. Async path uses zero-copy combine. dispatch_async wiring documented as TODO (AcquiredBuffer API mismatch). FP8 exchange config flag added (enable_fp8, default false). B. allreduce_in_place TP: RowParallelLinear and QuantizedRowParallelLinear now modify Metal buffer directly via to_bytes_mut(), eliminating Vec allocation + Array reconstruction per forward pass. C. MR cache: RdmaConnection caches nocopy MR registrations by (ptr, len). Repeated sends from same Metal buffer skip ibv_reg_mr (~10-50µs saved). D. Connection recovery: send/recv/sendrecv retry once on transient errors (Timeout, CqPoll). Non-transient errors propagate immediately. E. HealthMonitor integration documented (TODO in init.rs). eprintln→tracing migration in init.rs and context.rs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…add TP E2E to script - rdma_2node_integration: add 3 subtests (broadcast, reduce_scatter, MR cache) - test_rdma_2node.sh: include tp_2node_e2e binary as Step 5/5 - Script handles TP binary build/discovery with separate port range Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Async EP pipeline fully connected: - AcquiredBuffer::shared_buffer() accessor for Arc<SharedBuffer> access - SharedBufferTier uses Vec<Arc<SharedBuffer>> for shared ownership - exchange_and_compute_async: dispatch_async → compute (overlap) → combine_async_start → wait → combine_async_finish - Blocking path preserved as fallback when runtime_ctx absent EP 2-node tests added to rdma_2node_integration suite: - Phase 8: asymmetric sendrecv (EP dispatch pattern) - Phase 9: dispatch→compute→combine round-trip simulation Stale TODO comments removed (Phase 3a/3b now complete). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fair comparison benchmarks for all distributed features: - Raw send/recv throughput (4KB-16MB, 6 sizes) - Allreduce f16/f32 (ring/mesh native) - Allgather, all-to-all - EP transport (sendrecv vs all_to_all) - EP pipeline (MLX-only: full moe_dispatch/combine) - Broadcast, reduce_scatter (RMLX-only) Fairness verified by Opus + Codex: - Identical sizes, warmup(10), iterations(30) - Barrier before every iteration on both sides - Same EP configs [(16,1024)...(512,1024)] - JSON output with matching metrics Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nment, EP all_to_all - send/recv: both ranks send+recv simultaneously (RMLX sendrecv, MLX send+recv crossover) - allgather: RMLX per-rank contribution = size/ws (matches MLX convention) - EP transport: RMLX uses all_to_all instead of sendrecv (matches MLX) - All barriers use .unwrap() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- chunked_sendrecv: offset += chunk instead of chunk_size (fixes panic on non-aligned payload sizes, e.g. allreduce barrier with 4 bytes) - benchmark: switch from init() coordinator to RdmaConnection::establish() (same reliable pattern as 2-node tests) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…gnostics - rdma_setup.py: add networksetup -deletenetworkservice for TB Bridge, verify IP actually holds after setting (macOS configd override detection) - Removed EXO Thunderbolt network services from both hwStudio nodes - Added device name to alloc_pd error/success messages for diagnostics - Benchmark establish timeout increased to 30s - run_rmlx_bench.sh updated for establish pattern Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…fair benchmark Fix chunked_recv/chunked_sendrecv over-posting bug that caused CQ timeouts on multi-chunk transfers: guard nocopy send for single-chunk only, track recvs_posted to prevent re-posting beyond needed chunks. Remove misdiagnosed TB5 driver bug workarounds (sleep→proper CQ polling) in connection.rs warmup and all ring collectives (allreduce, allgather, reduce_scatter). Add Split-CB TP path (forward_with_group_split_cb) that batches attention and FFN ops into 2 command buffers per layer instead of per-op dispatch. Upgrade benchmarks for fair comparison: - RMLX: RoPE + 128-token KV cache, gate-up merge, weight pre-transpose - MLX: mx.fast.* fused kernels, mx.compile, KV cache - Script: hardcoded hostnames/IPs → env vars, cd→env -C Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add Phase 7 entry to roadmap (Split-CB TP, chunked transfer fix, fair benchmarks) - Add changelog entries (Added, Performance, Fixed sections) - Document chunked transfer over-posting fix and nocopy guard in rmlx-rdma - Document Split-CB TP path (forward_with_group_split_cb) in rmlx-nn - Add Split-CB reference in rmlx-distributed - Remove references to misdiagnosed TB5 driver bug workarounds Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Replace all internal hostnames and RDMA IPs with env vars + CLI args - All scripts now support --node0, --node1, --node0-ip, --node1-ip, --remote-dir flags with --help documentation - Priority chain: CLI args > env vars > generic defaults - Add rmlx-hosts.json to .gitignore (local config, not for repo) - Replace test fixture IPs with generic 10.0.0.x addresses - Add clippy allow attributes for arc_with_non_send_sync, too_many_arguments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

0xDaizz and others added 15 commits March 15, 2026 17:16

fix: row_parallel mock test — correct world_size and contiguous shard

2d8307d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

0xDaizz merged commit 045d2ba into main Mar 15, 2026
7 checks passed

0xDaizz deleted the rdma-production-phases branch March 15, 2026 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: RDMA Production Phases — Split-CB TP, chunked transfer fix, fair benchmarks#87

feat: RDMA Production Phases — Split-CB TP, chunked transfer fix, fair benchmarks#87
0xDaizz merged 15 commits intomainfrom
rdma-production-phases

0xDaizz commented Mar 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

0xDaizz commented Mar 15, 2026

Summary

Benchmark Results (Mixtral 8x7B-like, f16, decode, 2-node TB5 RDMA)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant