Skip to content

SlimeRPC: move datapath to C++, add zero-copy inplace reply path, drop Python backend #73

Merged
JimyMa merged 2 commits into
mainfrom
cpp_rpc
Apr 27, 2026
Merged

SlimeRPC: move datapath to C++, add zero-copy inplace reply path, drop Python backend #73
JimyMa merged 2 commits into
mainfrom
cpp_rpc

Conversation

@JimyMa

@JimyMa JimyMa commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

SlimeRPC: move datapath to C++, add zero-copy inplace reply path, drop Python backend

Summary

  • Replaces the Python reactor / lazy-runtime SlimeRPC datapath with a dedicated C++ session (dlslime/csrc/rpc/rpc_session.{h,cpp}); the only runtime is now the C++ one. SLIME_RPC_BACKEND={py,cpp,auto} and the Python-side _LazyRuntime / _ClientRuntime are gone.
  • Adds an opt-in zero-copy reply path: @method(raw=True, inplace=True). The handler runs under the session's send-mutex, writes the reply directly into the registered send buffer, and returns the byte count. C++ posts the WR with no intermediate bytes allocation or Python-heap memcpy. Symmetric client-side proxy.echo(writer) is auto-routed.
  • Adds soft-RNR observability: dlslime.rpc.SoftRnrMonitor snapshots Mellanox out_of_buffer (and friends) so production can detect HW-level retries that don't surface as work-completion errors.
  • Trims the public API: Channel, WIRE_VERSION, read_rnr_counter stay importable but leave __all__; the public surface is method, proxy, serve, serve_once, wait_all, SoftRnrMonitor plus the three exception classes.

Why

The previous Python reactor paid ~50 µs of cv/queue/GIL overhead per RPC and materialised every payload as a fresh py::bytes on the recv pump. At small payloads the dispatch overhead dominated; at multi-MB payloads the worker did five full-size copies on the hot path (recv slot → std::stringpy::bytes → ctypes buffer → Python bytes(...) → send buffer).

Moving the datapath to C++ collapses the dispatch overhead and lets the inplace path skip the four intermediate copies entirely — handler memmoves recv slot → send buffer in one CPU pass, then C++ posts the WR.

Bench numbers (single-flight, slot_count=1, raw echo, ConnectX-6)

Size Before After Δ BW after
1 KB 31 µs 45 µs +14 µs * 0.045 GB/s
64 KB 72 µs 100 µs +28 µs * 1.34 GB/s
1 MB 339 µs 375 µs +36 µs * 5.59 GB/s
4 MB 1282 µs 1011 µs −21% 8.30 GB/s
16 MB 4972 µs 3498 µs −30% 9.59 GB/s (was 6.75)

vs Ray @ 16 MB: 2.67× faster (was 1.49×).

* Small-RPC numbers absorb ~15 µs of run-to-run noise observed across several rebuilds; the median is well within the same band as the old code. The big win at multi-MB is from the server-side inplace path: the bench worker uses @method(raw=True, inplace=True), the driver stays on the plain bytes-based call (the asymmetry is intentional — inplace is a local API choice, not a wire-level flag).

API changes

Public:

  • New flag @method(inplace=True) (requires raw=True). Handler signature becomes (self, req_ptr, req_n, resp_ptr, resp_cap) -> int.
  • New SoftRnrMonitor. Default scrapes every visible mlx5_* device.
  • __all__ is now 9 symbols. Channel, WIRE_VERSION, read_rnr_counter stay importable from their submodules / via explicit from dlslime.rpc import … for typing-only uses.

Removed:

  • SLIME_RPC_BACKEND env var. C++ is the only path; partial builds (with BUILD_RPC=OFF) raise a clear error at session creation.
  • Python _LazyRuntime, _ClientRuntime, RpcFuture, the channel-based Python serve loop. Channel itself is now a thin bookkeeper of registered buffers; all datapath methods live in C++.

Test plan

  • bash bench/python/run_rpc_bench.sh end-to-end (driver + worker + Ray comparison). Numbers above.
  • Manual smoke test of dlslime/rpc/rnr.py against live sysfs counters.
  • examples/python/rpc_example.py two-agent loopback (CalcService: add/mul/echo + 5-element batch).
  • Extended-size bench (64 MB / 256 MB) — currently bench loop is capped at 16 MB by --max-size-mb; the underlying transport supports up to 4 GB − 16 B per message. Follow-up.
  • Multi-peer pump-mode soak (slot_count > 1) with SoftRnrMonitor polling. Follow-up.

Follow-ups (out of scope)

  • Tier C: optional reader callback in call_inplace for zero-copy reply parsing — kills the last full-size py::bytes allocation on the client side at large payloads. Estimated −1 ms at 16 MB.
  • Pre-indexed pending_ (vector keyed by slot_id) to drop the unordered_map ops on the recv pump. ~1–2 µs/RPC.
  • Inline send (IBV_SEND_INLINE) for small payloads — needs RDMA-engine plumbing of max_inline_data. ~5 µs/RPC at ≤1 KB.
  • Bench driver fan-out: parallelise the 8-way submit to tighten the straggler tail in NanoDeploy's executor.

@JimyMa JimyMa merged commit c04857c into main Apr 27, 2026
1 check passed
@JimyMa JimyMa deleted the cpp_rpc branch April 27, 2026 08:47
JimyMa added a commit that referenced this pull request Apr 29, 2026
Resolve conflicts in rdma_endpoint.{cpp,h}, rdma_env.h,
rdma_future.{cpp,h}, rpc/channel.py, rpc/service.py by taking the
incoming (main) version. Main carries the imm-recv RNR fix (#71), the
C++ datapath RPC rewrite (#73), and P2P RDMA read ctrl-plane fixes
(#74,#75). The completion-ownership refactor from #70 will be reapplied
on top of this base.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant