SlimeRPC: move datapath to C++, add zero-copy inplace reply path, drop Python backend by JimyMa · Pull Request #73 · DeepLink-org/DLSlime

JimyMa · 2026-04-27T08:42:06Z

SlimeRPC: move datapath to C++, add zero-copy `inplace` reply path, drop Python backend

Summary

Replaces the Python reactor / lazy-runtime SlimeRPC datapath with a dedicated C++ session (dlslime/csrc/rpc/rpc_session.{h,cpp}); the only runtime is now the C++ one. SLIME_RPC_BACKEND={py,cpp,auto} and the Python-side _LazyRuntime / _ClientRuntime are gone.
Adds an opt-in zero-copy reply path: @method(raw=True, inplace=True). The handler runs under the session's send-mutex, writes the reply directly into the registered send buffer, and returns the byte count. C++ posts the WR with no intermediate bytes allocation or Python-heap memcpy. Symmetric client-side proxy.echo(writer) is auto-routed.
Adds soft-RNR observability: dlslime.rpc.SoftRnrMonitor snapshots Mellanox out_of_buffer (and friends) so production can detect HW-level retries that don't surface as work-completion errors.
Trims the public API: Channel, WIRE_VERSION, read_rnr_counter stay importable but leave __all__; the public surface is method, proxy, serve, serve_once, wait_all, SoftRnrMonitor plus the three exception classes.

Why

The previous Python reactor paid ~50 µs of cv/queue/GIL overhead per RPC and materialised every payload as a fresh py::bytes on the recv pump. At small payloads the dispatch overhead dominated; at multi-MB payloads the worker did five full-size copies on the hot path (recv slot → std::string → py::bytes → ctypes buffer → Python bytes(...) → send buffer).

Moving the datapath to C++ collapses the dispatch overhead and lets the inplace path skip the four intermediate copies entirely — handler memmoves recv slot → send buffer in one CPU pass, then C++ posts the WR.

Bench numbers (single-flight, slot_count=1, raw echo, ConnectX-6)

Size	Before	After	Δ	BW after
1 KB	31 µs	45 µs	+14 µs *	0.045 GB/s
64 KB	72 µs	100 µs	+28 µs *	1.34 GB/s
1 MB	339 µs	375 µs	+36 µs *	5.59 GB/s
4 MB	1282 µs	1011 µs	−21%	8.30 GB/s
16 MB	4972 µs	3498 µs	−30%	9.59 GB/s (was 6.75)

vs Ray @ 16 MB: 2.67× faster (was 1.49×).

* Small-RPC numbers absorb ~15 µs of run-to-run noise observed across several rebuilds; the median is well within the same band as the old code. The big win at multi-MB is from the server-side inplace path: the bench worker uses @method(raw=True, inplace=True), the driver stays on the plain bytes-based call (the asymmetry is intentional — inplace is a local API choice, not a wire-level flag).

API changes

Public:

New flag @method(inplace=True) (requires raw=True). Handler signature becomes (self, req_ptr, req_n, resp_ptr, resp_cap) -> int.
New SoftRnrMonitor. Default scrapes every visible mlx5_* device.
__all__ is now 9 symbols. Channel, WIRE_VERSION, read_rnr_counter stay importable from their submodules / via explicit from dlslime.rpc import … for typing-only uses.

Removed:

SLIME_RPC_BACKEND env var. C++ is the only path; partial builds (with BUILD_RPC=OFF) raise a clear error at session creation.
Python _LazyRuntime, _ClientRuntime, RpcFuture, the channel-based Python serve loop. Channel itself is now a thin bookkeeper of registered buffers; all datapath methods live in C++.

Test plan

bash bench/python/run_rpc_bench.sh end-to-end (driver + worker + Ray comparison). Numbers above.
Manual smoke test of dlslime/rpc/rnr.py against live sysfs counters.
examples/python/rpc_example.py two-agent loopback (CalcService: add/mul/echo + 5-element batch).
Extended-size bench (64 MB / 256 MB) — currently bench loop is capped at 16 MB by --max-size-mb; the underlying transport supports up to 4 GB − 16 B per message. Follow-up.
Multi-peer pump-mode soak (slot_count > 1) with SoftRnrMonitor polling. Follow-up.

Follow-ups (out of scope)

Tier C: optional reader callback in call_inplace for zero-copy reply parsing — kills the last full-size py::bytes allocation on the client side at large payloads. Estimated −1 ms at 16 MB.
Pre-indexed pending_ (vector keyed by slot_id) to drop the unordered_map ops on the recv pump. ~1–2 µs/RPC.
Inline send (IBV_SEND_INLINE) for small payloads — needs RDMA-engine plumbing of max_inline_data. ~5 µs/RPC at ≤1 KB.
Bench driver fan-out: parallelise the 8-way submit to tighten the straggler tail in NanoDeploy's executor.

Resolve conflicts in rdma_endpoint.{cpp,h}, rdma_env.h, rdma_future.{cpp,h}, rpc/channel.py, rpc/service.py by taking the incoming (main) version. Main carries the imm-recv RNR fix (#71), the C++ datapath RPC rewrite (#73), and P2P RDMA read ctrl-plane fixes (#74,#75). The completion-ownership refactor from #70 will be reapplied on top of this base. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JimyMa added 2 commits April 27, 2026 08:19

cpp_rpc

ee29484

clean __all__

c5ccb1c

JimyMa merged commit c04857c into main Apr 27, 2026
1 check passed

JimyMa deleted the cpp_rpc branch April 27, 2026 08:47

JimyMa mentioned this pull request Apr 29, 2026

Refactor RDMAEndpoint completion ownership #70

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SlimeRPC: move datapath to C++, add zero-copy inplace reply path, drop Python backend #73

SlimeRPC: move datapath to C++, add zero-copy inplace reply path, drop Python backend #73
JimyMa merged 2 commits into
mainfrom
cpp_rpc

JimyMa commented Apr 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

JimyMa commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SlimeRPC: move datapath to C++, add zero-copy inplace reply path, drop Python backend

Summary

Why

Bench numbers (single-flight, slot_count=1, raw echo, ConnectX-6)

API changes

Test plan

Follow-ups (out of scope)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JimyMa commented Apr 27, 2026 •

edited

Loading

SlimeRPC: move datapath to C++, add zero-copy `inplace` reply path, drop Python backend