Implements IPC-enabled memory pools for Linux in DeviceMemoryResource #930

Andy-Jost · 2025-08-29T17:01:52Z

Updates cuda.core.experiment.DeviceMemoryResource to support IPC-enabled memory pools. This change updates the DMR constructor, adding an option to create a new memory pool with optional IPC enablement. The previous behavior, which gets the current device memory pool via cuDeviceGetMemPool, remains in effect when no constructor options are supplied.

…id argument.

…rent` to access the default memory pool.

copy-pr-bot · 2025-08-29T17:01:56Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cuda_core/cuda/core/experimental/_memory.pyx

…memory pool. Removes the `current` staticmethod. Adds an option dataclass for constructor options.

cuda_core/cuda/core/experimental/_memory.pyx

…andle" to better align with the driver API.

Introduces `IPCAllocationHandle` to manage pool-sharing resources. Introduces `IPCChannel` to for sharing allocation handles in a platform-independent way (though currently only Linux is supported).

leofang

Thanks a lot, @Andy-Jost! Sorry for my late reply. I've gone through the implementation and made some suggestions, in 3 categories:

design: I believe your IPCChannel abstraction is key to deliver feature parity while enabling very nice UX. I've made a few suggestions below to tidy it up and make the UI even cleaner. Lots of objects and methods can be turned to private/non-user-facing!
- Sorry, my comments might seem inconsistent at the first glance. I reviewed _memory.pyx from top to bottom, so in the beginning I was not seeing the full picture. I had also forgotten the details of Keenan's PR, on which this one is based.
implementation: in a few places we can cythonize more, also the addition of __bool__ is a bit nerve wrecking
semantics: I would prefer to save the discussion of 1. .set_current(), 2. view or owning to the very end. I think the current implementation is OK. No need to change until we are making another pass.

Tests look OK but I need to review them again later.

cuda_core/cuda/core/experimental/_memory.pyx

…method from MemoryResource; Cythonizes helper classes; disables __init__ for non-API classes; removes abstract factory constructor from IPCChannel; removes the _requires_ipc decorator (checks are now inlined)

…utes.

cuda_core/tests/test_ipc_mempool.py

leofang

Thanks, Andy!

We need to use IPC events for stream ordering, see my comments below in the test suite.
The question on import/export buffer methods is unanswered

Also, could you add two entries (if I do not miss any) to the release note?
cuda_core/docs/source/release/0.X.Y-notes.rst

mention IPC is supported on Linux
mention MRs can also take Device instances

cuda_core/tests/test_ipc_mempool.py

cuda_core/tests/test_memory.py

cuda_core/tests/test_ipc_mempool.py

leofang · 2025-09-10T01:22:21Z

cuda_core/tests/test_ipc_mempool.py

+        ptr = ctypes.cast(int(self.scratch_buffer.handle), ctypes.POINTER(ctypes.c_byte))
+        op = (lambda i: 255 - i) if flipped else (lambda i: i)
+        for i in range(self.nbytes):
+            assert ctypes.c_byte(ptr[i]).value == ctypes.c_byte(op(i)).value, (
+                f"Buffer contains incorrect data at index {i}"
+            )


ditto, we can use all(arr1 == arr2).all() to check

This leads to an error:

>>> np.from_dlpack(self.buffer) RuntimeError: Unsupported device in DLTensor.

Ah, for device buffers (.is_device_accessible is True), we wrap them as CuPy arrays not NumPy. Just change it to cp.from_dlpack().

it might be a good idea to just wrap both buffers as cupy arrays instead of numpy arrays like I originally suggested, because one is backed by device memory another by managed memory, and numpy can't handle the former

cuda_core/tests/test_ipc_mempool.py

leofang · 2025-09-10T01:26:45Z

cuda_core/tests/test_ipc_mempool.py

+        # Export the buffer via IPC.
+        handle = mr.export_buffer(buffer)
+        queue.put(handle)


I need to apologize for sharing my wrong understanding when syncing with you and Keenan.

There is a reason I mentioned IPC event handle here. It turns out that only the cudaIpcMemHandle_t is considered "unsafe"/"legacy"; cudaIpcEventHandle_t (and its driver counterpart) is not, and is still used in the "safe"/"modern" IPC example, see here.

This is important because of the stream ordering reason as I explained during the sync-up. The mental model is the same:

process 1 does work on buffer on its stream

process 1 creates an IPC event and record on its stream

process 1 exports the buffer and the IPC event

process 2 imports the buffer and the event

process 2 has its stream wait on the event

process 2 does work on buffer on its stream

...

So as it stands currently our test suite is not safe due to lack of stream ordering.

Thanks for explaining this, Leo. I've addressed this in the tests by added the appropriate stream synchronization. That is good enough for testing the correctness. If we need a cleaner Python example using events, we can develop it outside this PR.

cuda_core/cuda/core/experimental/_memory.pyx

cuda_core/tests/test_ipc_mempool.py

kkraus14 · 2025-09-10T04:57:16Z

cuda_core/tests/test_memory.py

Can we add a test that captures the expected failure behavior for when someone imports a DeviceMemoryResource via DeviceMemoryResource.from_shared_channel and tries to allocate using it? I'm not convinced that throwing on allocation is a good user experience here vs having something like an ImportedDeviceMemoryResource that would allow better introspection and IDE support if someone tried to write some code that tries to allocate from it.

See f84d9d8 for this test.

I agree that throwing is not a great user experience. Wouldn't a ImportedDeviceMemoryResource still need to implement a throwing allocate function though?

@Andy-Jost yes, but at least the user would be able to introspect the class as well as type the allocate function as NoReturn (https://docs.python.org/3/library/typing.html#typing.NoReturn) which in theory should help to guide users much more nicely?

When we revisit this thread: We should keep the test, but I think we can avoid ImportedDeviceMemoryResource: #930 (comment)

cuda_core/cuda/core/experimental/_memory.pyx

cuda_core/tests/test_ipc_mempool.py

leofang · 2025-09-11T14:29:30Z

/ok to test 89d5e3f

…dds stream synchronization. Eliminates unnecessary clean-up. Removes unnecessary check for CUDA 12 or later.

Andy-Jost · 2025-09-15T20:04:55Z

I think these changes are ready to go now.

Thanks, Andy!

We need to use IPC events for stream ordering, see my comments below in the test suite.

I did this by adding stream synchronization points. It is sufficient to test the functionality.

The question on import/export buffer methods is unanswered

Done

Also, could you add two entries (if I do not miss any) to the release note? cuda_core/docs/source/release/0.X.Y-notes.rst

mention IPC is supported on Linux

mention MRs can also take Device instances

Done

Andy-Jost · 2025-09-15T20:06:11Z

/ok to test 598f9a1

Andy-Jost · 2025-09-15T23:17:30Z

/ok to test 159cf41

cuda_core/cuda/core/experimental/_memory.pyx

leofang

Left a few quick notes before calling it a night, will resume tomorrow!

cuda_core/cuda/core/experimental/_memory.pyx

Andy-Jost · 2025-09-16T19:21:13Z

/ok to test df7ea5c

leofang

LGTM! Left a bunch of nits. Feel free to ignore or address later.

leofang · 2025-09-17T01:42:18Z

cuda_core/tests/test_memory.py

+    """Test all properties of the DeviceMemoryResource class."""
+    device = mempool_device
+    if platform.system() == "Windows":
+        return  # IPC not implemented for Windows


nit:

Suggested change

return # IPC not implemented for Windows

pytest.skip("IPC not implemented for Windows")

or move it to as a class decorator.

leofang · 2025-09-17T01:43:14Z

cuda_core/tests/test_ipc_mempool.py

+    """Test IPC with memory pools."""
+    # Set up the IPC-enabled memory pool and share it.
+    stream = ipc_device.create_stream()
+    mr = DeviceMemoryResource(ipc_device, dict(max_size=POOL_SIZE, ipc_enabled=True))


would be better to get DeviceMemoryResourceOptions tested

leofang · 2025-09-17T01:48:28Z

cuda_core/tests/test_ipc_mempool.py

+        ptr = ctypes.cast(int(self.scratch_buffer.handle), ctypes.POINTER(ctypes.c_byte))
+        op = (lambda i: 255 - i) if flipped else (lambda i: i)
+        for i in range(self.nbytes):
+            assert ctypes.c_byte(ptr[i]).value == ctypes.c_byte(op(i)).value, (
+                f"Buffer contains incorrect data at index {i}"
+            )


it might be a good idea to just wrap both buffers as cupy arrays instead of numpy arrays like I originally suggested, because one is backed by device memory another by managed memory, and numpy can't handle the former

leofang · 2025-09-17T01:54:58Z

cuda_core/tests/test_ipc_mempool.py

+    mr = DeviceMemoryResource.from_shared_channel(device, channel)
+    handle = queue.get()  # Get exported buffer data
+    buffer = Buffer.import_(mr, handle)


Thinking about this more, it's probably a better idea to avoid the whole discussion of ImportedDeviceMemoryResource by not making it user-visible:

handle = queue.get() buffer = Buffer.import_(device, handle, channel)

The MR would be a construct internal to the (imported) buffer this way.

If we manage to bypass the user-specified IPC mechanism and use IPCChannel to serve all needs, the code would be even more clean:

buffer = Buffer.import_(device, channel)

But again the main benefit is to avoid unnecessary exposure of ImportedDeviceMemoryResource; reducing 3 lines of code to only 1 line is not a big deal.

leofang · 2025-09-17T01:57:35Z

cuda_core/tests/test_ipc_mempool.py

+    protocol = IPCBufferTestProtocol(device, buffer, stream=stream)
+    protocol.verify_buffer(flipped=False)
+    protocol.fill_buffer(flipped=True)


One thing I find the code a bit hard to follow is that it is unclear to me if the operations are stream-ordered. It'd be better to pass stream explicitly instead of holding it internally to IPCBufferTestProtocol instances.

leofang · 2025-09-17T02:01:14Z

cuda_core/tests/test_memory.py

+    stream = device.create_stream()
+    buffer = mr.allocate(1024, stream=stream)
+    assert buffer.handle != 0
+    buffer.close()


nit

Suggested change

buffer.close()

buffer.close(stream)

leofang · 2025-09-17T02:03:17Z

cuda_core/tests/test_memory.py

+        assert value >= current_value, f"{property_name} should be >= {current_prop}"
+
+
+def test_mempool_attributes_ownership(mempool_device):


leofang · 2025-09-17T02:17:56Z

(Will merge tomorrow; see DM)

github-actions · 2025-09-17T15:48:31Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

Andy-Jost added 2 commits August 27, 2025 14:46

Updates DeviceMemoryResource to accept Device objects for the device_…

dba0b07

…id argument.

Adds IPC support to DeviceMemoryResource. Creates a staticmethod `cur…

e6246fc

…rent` to access the default memory pool.

Andy-Jost requested review from ksimpson-work and leofang August 29, 2025 17:01

Andy-Jost self-assigned this Aug 29, 2025

Added a missing import statement.

4cfc7f4

Andy-Jost commented Aug 29, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memory.pyx Show resolved Hide resolved

Andy-Jost commented Aug 29, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memory.pyx Outdated Show resolved Hide resolved

Restores DeviceMemoryResource default behavior to return the current …

f26fcfa

…memory pool. Removes the `current` staticmethod. Adds an option dataclass for constructor options.

leofang added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Aug 30, 2025

leofang modified the milestones: cuda-python parking lot, cuda.core beta 7 Aug 30, 2025

Andy-Jost added 2 commits September 2, 2025 09:32

Adjusts docstrings for consistency.

a4ce5eb

Merge branch 'main' into ipc-mempool-linux

968e31a

kkraus14 reviewed Sep 3, 2025

View reviewed changes

Andy-Jost added 4 commits September 3, 2025 09:05

Adjusts dataclass member defs in *Options classes for Cython.

6ab758b

Adds __bool__ to the MemoryResource interface.

9f3ad13

Minor docstring update.

5994f71

Changes verbiage from "shared" or "shareable" handle to "allocation h…

185dcfc

…andle" to better align with the driver API.

Andy-Jost marked this pull request as draft September 4, 2025 21:48

Significantly reworks the tests for IPC-enabled memory pools.

cc1fa40

Introduces `IPCAllocationHandle` to manage pool-sharing resources. Introduces `IPCChannel` to for sharing allocation handles in a platform-independent way (though currently only Linux is supported).

Andy-Jost force-pushed the ipc-mempool-linux branch from 5b57f77 to cc1fa40 Compare September 4, 2025 21:58

leofang requested changes Sep 6, 2025

View reviewed changes

Andy-Jost added 3 commits September 8, 2025 13:30

Various fixes including: groups cimport statements; removes __bool__ …

0af0f77

…method from MemoryResource; Cythonizes helper classes; disables __init__ for non-API classes; removes abstract factory constructor from IPCChannel; removes the _requires_ipc decorator (checks are now inlined)

Formatting changes.

b3bba46

Creates an attributes suite to bundle the DeviceMemoryResource attrib…

91a9d89

…utes.

This comment has been minimized.

Sign in to view

cpcloud reviewed Sep 9, 2025

View reviewed changes

cuda_core/tests/test_ipc_mempool.py Outdated Show resolved Hide resolved

Andy-Jost requested review from leofang and removed request for ksimpson-work September 9, 2025 20:28

leofang requested changes Sep 10, 2025

View reviewed changes

kkraus14 reviewed Sep 10, 2025

View reviewed changes

Merge branch 'main' into ipc-mempool-linux

89d5e3f

Andy-Jost commented Sep 10, 2025

View reviewed changes

Andy-Jost added 5 commits September 15, 2025 11:00

Test updates. Parameterizes tests (rather than use internal loops). A…

cb8efcd

…dds stream synchronization. Eliminates unnecessary clean-up. Removes unnecessary check for CUDA 12 or later.

Reworks mempool attribute implementation.

80ee3c9

Adds testing for errors when allocating from an imported memory pool.

f84d9d8

Move IPC import/export methods into the Buffer class.

6e86cd1

Updated release notes.

598f9a1

Andy-Jost added 2 commits September 15, 2025 15:38

Reworks DeviceMemoryResourceAttributes to use a descriptor.

e0e2175

Merge branch 'main' into ipc-mempool-linux

159cf41

Andy-Jost commented Sep 15, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memory.pyx Show resolved Hide resolved

leofang requested changes Sep 16, 2025

View reviewed changes

Andy-Jost added 2 commits September 16, 2025 08:54

Remove use of deprecated abstractproperty.

5b0c228

Merge branch 'main' into ipc-mempool-linux

df7ea5c

leofang approved these changes Sep 17, 2025

View reviewed changes

leofang merged commit 7a24bd8 into NVIDIA:main Sep 17, 2025
49 checks passed

Andy-Jost deleted the ipc-mempool-linux branch September 17, 2025 22:38

leofang mentioned this pull request Sep 17, 2025

Mempool memory resource - IPC #446

Draft

	return # IPC not implemented for Windows
	pytest.skip("IPC not implemented for Windows")

		assert value >= current_value, f"{property_name} should be >= {current_prop}"


		def test_mempool_attributes_ownership(mempool_device):

Implements IPC-enabled memory pools for Linux in DeviceMemoryResource #930

Implements IPC-enabled memory pools for Linux in DeviceMemoryResource #930

Uh oh!

Conversation

Andy-Jost commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Aug 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leofang commented Sep 11, 2025

Uh oh!

Andy-Jost commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Andy-Jost commented Aug 29, 2025 •

edited

Loading

Andy-Jost commented Sep 15, 2025 •

edited

Loading

leofang Sep 17, 2025 •

edited

Loading