Conversation
JimyMa
added a commit
that referenced
this pull request
Apr 29, 2026
Resolve conflicts in rdma_endpoint.{cpp,h}, rdma_env.h,
rdma_future.{cpp,h}, rpc/channel.py, rpc/service.py by taking the
incoming (main) version. Main carries the imm-recv RNR fix (#71), the
C++ datapath RPC rewrite (#73), and P2P RDMA read ctrl-plane fixes
(#74,#75). The completion-ownership refactor from #70 will be reapplied
on top of this base.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SlimeRPC: move datapath to C++, add zero-copy
inplacereply path, drop Python backendSummary
dlslime/csrc/rpc/rpc_session.{h,cpp}); the only runtime is now the C++ one.SLIME_RPC_BACKEND={py,cpp,auto}and the Python-side_LazyRuntime/_ClientRuntimeare gone.@method(raw=True, inplace=True). The handler runs under the session's send-mutex, writes the reply directly into the registered send buffer, and returns the byte count. C++ posts the WR with no intermediatebytesallocation or Python-heap memcpy. Symmetric client-sideproxy.echo(writer)is auto-routed.dlslime.rpc.SoftRnrMonitorsnapshots Mellanoxout_of_buffer(and friends) so production can detect HW-level retries that don't surface as work-completion errors.Channel,WIRE_VERSION,read_rnr_counterstay importable but leave__all__; the public surface ismethod,proxy,serve,serve_once,wait_all,SoftRnrMonitorplus the three exception classes.Why
The previous Python reactor paid ~50 µs of cv/queue/GIL overhead per RPC and materialised every payload as a fresh
py::byteson the recv pump. At small payloads the dispatch overhead dominated; at multi-MB payloads the worker did five full-size copies on the hot path (recv slot →std::string→py::bytes→ ctypes buffer → Pythonbytes(...)→ send buffer).Moving the datapath to C++ collapses the dispatch overhead and lets the inplace path skip the four intermediate copies entirely — handler memmoves recv slot → send buffer in one CPU pass, then C++ posts the WR.
Bench numbers (single-flight, slot_count=1, raw echo, ConnectX-6)
vs Ray @ 16 MB: 2.67× faster (was 1.49×).
* Small-RPC numbers absorb ~15 µs of run-to-run noise observed across several rebuilds; the median is well within the same band as the old code. The big win at multi-MB is from the server-side
inplacepath: the bench worker uses@method(raw=True, inplace=True), the driver stays on the plain bytes-based call (the asymmetry is intentional —inplaceis a local API choice, not a wire-level flag).API changes
Public:
@method(inplace=True)(requiresraw=True). Handler signature becomes(self, req_ptr, req_n, resp_ptr, resp_cap) -> int.SoftRnrMonitor. Default scrapes every visiblemlx5_*device.__all__is now 9 symbols.Channel,WIRE_VERSION,read_rnr_counterstay importable from their submodules / via explicitfrom dlslime.rpc import …for typing-only uses.Removed:
SLIME_RPC_BACKENDenv var. C++ is the only path; partial builds (withBUILD_RPC=OFF) raise a clear error at session creation._LazyRuntime,_ClientRuntime,RpcFuture, the channel-based Python serve loop. Channel itself is now a thin bookkeeper of registered buffers; all datapath methods live in C++.Test plan
bash bench/python/run_rpc_bench.shend-to-end (driver + worker + Ray comparison). Numbers above.dlslime/rpc/rnr.pyagainst live sysfs counters.examples/python/rpc_example.pytwo-agent loopback (CalcService:add/mul/echo+ 5-element batch).--max-size-mb; the underlying transport supports up to 4 GB − 16 B per message. Follow-up.SoftRnrMonitorpolling. Follow-up.Follow-ups (out of scope)
call_inplacefor zero-copy reply parsing — kills the last full-sizepy::bytesallocation on the client side at large payloads. Estimated −1 ms at 16 MB.pending_(vector keyed by slot_id) to drop theunordered_mapops on the recv pump. ~1–2 µs/RPC.IBV_SEND_INLINE) for small payloads — needs RDMA-engine plumbing ofmax_inline_data. ~5 µs/RPC at ≤1 KB.