IPC Mempool Serialization and multiprocessing Module Support #1020

Andy-Jost · 2025-09-25T17:28:57Z

Add serialization to memory IPC. Memory resources can be passed directly to new processes. Expands testing. Removes IPCChannel.

…ort methods.

…clear). Added a test for an error case.

… for clarity.

…esource. Add tests for buffer IPC through serialization.

…egistry from imported memory resources so that buffers can be serialized using an mr key. Test updates.

copy-pr-bot · 2025-09-25T17:29:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cuda_core/cuda/core/experimental/_memory.pyx

cuda_core/tests/memory_ipc/conftest.py

cuda_core/tests/memory_ipc/test_workerpool.py

Andy-Jost · 2025-10-01T22:24:30Z

/ok to test 6be4686

Andy-Jost · 2025-10-01T22:57:44Z

/ok to test e0d0bf4

Andy-Jost · 2025-10-02T00:07:19Z

It looks like this change is good to go. I'm not aware of any remaining issues.

leofang · 2025-10-02T00:18:35Z

Awesome, thanks Andy! I'll re-review in an hour or two and then merge. I don't expect any change is needed, just wanna read the new tests once more.

Andy-Jost · 2025-10-02T00:52:14Z

cuda_core/cuda/core/experimental/_memory.pyx

+            try:
+                assert self._uuid is None
+                import uuid as uuid
+                self._uuid = uuid.uuid4()


@ksimpson-work I ended up creating unique identifiers to track memory pools across processes. Is this something we could add to the Driver API as a mempool attribute?

leofang

Sorry Andy, I re-read this again and still have a few questions 😓

btw we should fix a few doc issues

add DeviceMemoryResourceOptions to cuda_core/docs/source/api.rst
add IPCAllocationHandle to cuda_core/docs/source/api_private.rst
address other comments below

The doc preview CI can be used to check what's missing.

cuda_core/cuda/core/experimental/_memory.pyx

cuda_core/tests/memory_ipc/test_errors.py

cuda_core/tests/memory_ipc/test_serialize.py

cuda_core/tests/memory_ipc/test_workerpool.py

leofang · 2025-10-02T02:34:40Z

cuda_core/cuda/core/experimental/_device.py

+    return device
+
+
+multiprocessing.reduction.register(Device, _reduce_device)


Sorry I can't resist but circle back to ask this: We wanted Device to be serializable when only Buffer objects are passed over to child processes through multiprocessing APIs. We called .set_current() because the child function arguments are restored/deserialized prior to dropping into the function (xref: #1020 (comment)).

Questions:

What happens in a multi-GPU environment? If I send over buf0 allocated on GPU 0 and buf1 allocated on GPU 1 to the child, which device is current by the time the child function starts executing? What if we change the order of args from (buf0, buf1) to (buf1, buf0)? We don't have CI for multi-GPU tests for now, but the design should be multi-GPU-safe.

If we really need a device ID, can't we just look up from Buffer.mr.device_id, instead of serializing Device?

I still feel it is unwise to set the global state behind users' back.

The problem here is that I don't see a way to initialize CUDA (?) without calling set_current on a device. If I remove set_current from _reconstruct_device then I get CUDA_ERROR_INVALID_CONTEXT errors everywhere.

Device._has_inited is only set in set_current. To avoid messing with the global state, I'd rather say something like Device(n).init() to be sure that device n is ready to go. Then, later, when mapping a buffer, I could use mr.device_id to get that device. I just don't see any way to do that with the current _device code.

OK, I think I have something here that we can accept. I was able to remove the call to set_current in Device reconstruction since it turns out that memory resources can be mapped without setting a device context.

The only thing that really needs to be done implicitly when a process spawns is to call cuInit. The needs to be the very first thing or else all CUDA API calls (including mapping a memory resource or buffer) fails. That can be accomplished by just constructing a Device.

I tried removing device serialization completely but that actually makes the code quite a bit more cumbersome in several places. It's just easier to send an applicative of the form (Device, (device_id,)) rather than send a device ID, import Device, and then reconstruct manually on the other end. I also think it is conceptually sensible for Device objects to be serializable.

Sounds great, I'll take a look asap! A few quick notes before I forget:

CUDA mempools are by design not bound to any CUDA context. We (CUDA) used to have a context as the powerhouse for everything, memory, queue, module, ..., but now we are making things "context-independent," including mempools and kernel modules.

I read the multiprocessing docs more carefully, and I think we spawned the processes wrong (we should document this after this PR, in a Q&A or Tips & Tricks like in cuda-bindings). As you said calling cuInit() is the minimal:

For processes spawned via Process, we can inherit from the Process class, in the constructor __init__() we call Device().set_current() which would under the hood call cuInit(), and then call super().__init__(). My take is that this will ensure CUDA is initialized before the arguments are deserialized

For processes spawned via a Pool of workers, the same can be done by passing initializer and initargs.

Prove me wrong! 🙂

I think subclassing Process as you describe would work. Are you suggesting we ship a subclass of Process to simplify this for our users? The current code places set_current calls within the child main functions. That's actually pretty nice, since it mirrors what we ask users to do in any process. As you point out, the memory pool and buffer mappings don't require a context, which simplifies this a lot.

When using worker pools, initializer and initargs do work (we have examples of this in test_workpool.py). Still, I think there's a ton of value in supporting the ultra-simple use case of mapping a workerpool over buffers without doing anything special at start up. The latest code just puts the call to set_current within each worker function, which seems great, actually. That way, one could mix-and-match buffers from different devices and it ought to work.

One limitation I'm facing now is that I don't know whether the test runners have multipe GPUs, so I'm not sure how to move forward with developing the multi-device scenario.

Are you suggesting we ship a subclass of Process to simplify this for our users?

Certainly not, at least not now (or never). We should only do so in our own test suite, and teach users about this.

We did not have to teach this in most of Python GPU libraries (wearing my CuPy hat), because most libraries only interact with CUDA through runtime instead of driver, and runtime does the implicit initialization in every single cudart API. We don't and by design we always want users to call dev.set_current() first.

Now, with our own test suite, your other question is valid: Device().set_current() sets by default GPU 0 to current if there is no prior current device, this is a bit unfortunate if the child main wants to use say GPU 1. I think for now we need to call set_current() in both places, the process initializer (as a proxy to cuInit() because we definitely don't want to teach this) as well as child main (to actually select the right GPU, following the usual cuda.core requirement). We should do this for now and discuss if there is a better approach on Friday meeting. I believe cccl-rt will hit the same issue: NVIDIA/cccl#6073.

…yResource reduction with multiprocessing. Add a quick exit to from_allocation_handle. Simplify the worker pool tests based on the new reduction method.

…p to tests.

Andy-Jost · 2025-10-03T00:07:56Z

/ok to test

copy-pr-bot · 2025-10-03T00:07:59Z

/ok to test

@Andy-Jost, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

Andy-Jost · 2025-10-03T00:09:01Z

/ok to test e5b8542

Andy-Jost · 2025-10-03T00:09:33Z

/ok to test 64f154c

cuda_core/cuda/core/experimental/_memory.pyx

leofang · 2025-10-03T01:21:42Z

cuda_core/cuda/core/experimental/_memory.pyx

-        if _ipc_registry is not None:
+        """Unregister this mapped memory resource."""
+        assert self.is_mapped
+        if _ipc_registry is not None:  # can occur during shutdown catastrophe


cuda_core/cuda/core/experimental/_memory.pyx

leofang · 2025-10-03T02:02:19Z

/ok to test d584ea9

leofang · 2025-10-03T02:03:22Z

cuda_core/tests/memory_ipc/test_errors.py

+        # Note: if the buffer is not attached to something to prolong its life,
+        # CUDA_ERROR_INVALID_CONTEXT is raised from Buffer.__del__


I suspect this comment is outdated.

github-actions · 2025-10-03T02:53:30Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

Andy-Jost added 11 commits September 18, 2025 10:48

Restructures IPC mempool tests into a subdirectory.

846c75c

Simplify the IPC interface, adding create_ipc_channel and import_/exp…

238db00

…ort methods.

Simply the interface to IPCBufferTestHelper.

f2ea8c9

Adds more tests.

827466e

Removes sequence forms of certain function (exception behavior was un…

93d9217

…clear). Added a test for an error case.

Changes channel methods export/import_ to send_buffer/receive_buffer,…

476349e

… for clarity.

Implement serialization methods for Device, Buffer, and DeviceMemoryR…

2ed5be7

…esource. Add tests for buffer IPC through serialization.

Protects serialization where needed to avoid resource leaks. Adds a r…

7f7f80f

…egistry from imported memory resources so that buffers can be serialized using an mr key. Test updates.

Add tests for leaked file descriptors and fix leaks.

e8822b3

Eliminates IPCChannel.

708e2b5

Changes DeviceMemoryResource remote_id to uuid.

b31d849

Andy-Jost requested a review from leofang September 25, 2025 17:29

Andy-Jost force-pushed the ipc-mempool-channel branch from 4514bf9 to cca13b5 Compare September 25, 2025 17:53

Andy-Jost self-assigned this Sep 25, 2025

Andy-Jost added cuda.core Everything related to the cuda.core module P0 High priority - Must do! labels Sep 25, 2025

cpcloud reviewed Sep 25, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memory.pyx Outdated Show resolved Hide resolved

cpcloud reviewed Sep 25, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memory.pyx Outdated Show resolved Hide resolved

cpcloud reviewed Sep 25, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memory.pyx Outdated Show resolved Hide resolved

cpcloud reviewed Sep 25, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memory.pyx Show resolved Hide resolved

cpcloud reviewed Sep 25, 2025

View reviewed changes

cuda_core/tests/memory_ipc/conftest.py Outdated Show resolved Hide resolved

cpcloud reviewed Sep 25, 2025

View reviewed changes

cuda_core/tests/memory_ipc/conftest.py Outdated Show resolved Hide resolved

Andy-Jost force-pushed the ipc-mempool-channel branch from cca13b5 to 4820512 Compare September 25, 2025 19:38

Embeds the memory resource UUID into allocation handles.

6c53cb0

Andy-Jost force-pushed the ipc-mempool-channel branch from 4820512 to 6c53cb0 Compare September 25, 2025 19:44

Andy-Jost requested a review from rparolin September 25, 2025 20:45

Andy-Jost changed the title ~~Ipc mempool channel~~ IPC Mempool Serialization and multiprocessing Module Support Sep 25, 2025

leofang added this to the cuda.core beta 7 milestone Sep 26, 2025

rparolin reviewed Sep 26, 2025

View reviewed changes

cuda_core/tests/memory_ipc/test_workerpool.py Outdated Show resolved Hide resolved

Andy-Jost added 2 commits October 1, 2025 15:23

Merge branch 'main' into ipc-mempool-channel

3802c4c

Bump the child timeout for IPC tests.

e0d0bf4

Andy-Jost force-pushed the ipc-mempool-channel branch from ff29756 to 6be4686 Compare October 1, 2025 22:24

Andy-Jost force-pushed the ipc-mempool-channel branch from 6be4686 to e0d0bf4 Compare October 1, 2025 22:57

Andy-Jost commented Oct 2, 2025

View reviewed changes

leofang mentioned this pull request Oct 2, 2025

Cythonize cuda.core #1041

Merged

2 tasks

leofang requested changes Oct 2, 2025

View reviewed changes

Andy-Jost added 2 commits October 2, 2025 14:21

Add docstrings. Change is_imported to is_mapped. Register DeviceMemor…

fbcf3b3

…yResource reduction with multiprocessing. Add a quick exit to from_allocation_handle. Simplify the worker pool tests based on the new reduction method.

Remove call to set_current in Device reconstruction. Add device set-u…

e5b8542

…p to tests.

Andy-Jost force-pushed the ipc-mempool-channel branch from 5cd1043 to e5b8542 Compare October 3, 2025 00:02

Merge branch 'main' into ipc-mempool-channel

64f154c

leofang reviewed Oct 3, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memory.pyx Outdated Show resolved Hide resolved

leofang mentioned this pull request Oct 3, 2025

Enable IPC events and update tests to test stream-ordered IPC #1040

Open

leofang reviewed Oct 3, 2025

View reviewed changes

leofang added 2 commits October 3, 2025 01:49

fix docstring rendering

534b16a

Merge branch 'main' into ipc-mempool-channel

d584ea9

leofang approved these changes Oct 3, 2025

View reviewed changes

leofang enabled auto-merge (squash) October 3, 2025 02:23

leofang merged commit 0ea67b5 into NVIDIA:main Oct 3, 2025
71 checks passed

leofang mentioned this pull request Oct 3, 2025

IPC tests do not exit cleanly #1074

Open

		return device


		multiprocessing.reduction.register(Device, _reduce_device)

		# Note: if the buffer is not attached to something to prolong its life,
		# CUDA_ERROR_INVALID_CONTEXT is raised from Buffer.__del__

IPC Mempool Serialization and multiprocessing Module Support #1020

IPC Mempool Serialization and multiprocessing Module Support #1020

Uh oh!

Conversation

Andy-Jost commented Sep 25, 2025

Uh oh!

copy-pr-bot bot commented Sep 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Andy-Jost commented Oct 1, 2025

Uh oh!

Andy-Jost commented Oct 1, 2025

Uh oh!

Andy-Jost commented Oct 2, 2025

Uh oh!

leofang commented Oct 2, 2025

Uh oh!

Andy-Jost Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leofang Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Andy-Jost Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Andy-Jost Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

leofang Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Andy-Jost Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

leofang Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Andy-Jost commented Oct 3, 2025

Uh oh!

copy-pr-bot bot commented Oct 3, 2025

Uh oh!

Andy-Jost commented Oct 3, 2025

Uh oh!

Andy-Jost commented Oct 3, 2025

Uh oh!

Uh oh!

leofang Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leofang commented Oct 3, 2025

Uh oh!

leofang Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Oct 3, 2025

leofang Oct 2, 2025 •

edited

Loading

Andy-Jost Oct 2, 2025 •

edited

Loading

leofang Oct 3, 2025 •

edited

Loading

leofang Oct 3, 2025 •

edited

Loading