Skip to content

Conversation

Andy-Jost
Copy link
Contributor

@Andy-Jost Andy-Jost commented Aug 29, 2025

Updates cuda.core.experiment.DeviceMemoryResource to support IPC-enabled memory pools. This change updates the DMR constructor, adding an option to create a new memory pool with optional IPC enablement. The previous behavior, which gets the current device memory pool via cuDeviceGetMemPool, remains in effect when no constructor options are supplied.

@Andy-Jost Andy-Jost self-assigned this Aug 29, 2025
Copy link
Contributor

copy-pr-bot bot commented Aug 29, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…memory pool. Removes the `current` staticmethod. Adds an option dataclass for constructor options.
@leofang leofang added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Aug 30, 2025
@Andy-Jost Andy-Jost marked this pull request as draft September 4, 2025 21:48
Introduces `IPCAllocationHandle` to manage pool-sharing resources.
Introduces `IPCChannel` to for sharing allocation handles in a
platform-independent way (though currently only Linux is supported).
Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, @Andy-Jost! Sorry for my late reply. I've gone through the implementation and made some suggestions, in 3 categories:

  1. design: I believe your IPCChannel abstraction is key to deliver feature parity while enabling very nice UX. I've made a few suggestions below to tidy it up and make the UI even cleaner. Lots of objects and methods can be turned to private/non-user-facing!
    • Sorry, my comments might seem inconsistent at the first glance. I reviewed _memory.pyx from top to bottom, so in the beginning I was not seeing the full picture. I had also forgotten the details of Keenan's PR, on which this one is based.
  2. implementation: in a few places we can cythonize more, also the addition of __bool__ is a bit nerve wrecking
  3. semantics: I would prefer to save the discussion of 1. .set_current(), 2. view or owning to the very end. I think the current implementation is OK. No need to change until we are making another pass.

Tests look OK but I need to review them again later.

…method from MemoryResource; Cythonizes helper classes; disables __init__ for non-API classes; removes abstract factory constructor from IPCChannel; removes the _requires_ipc decorator (checks are now inlined)

This comment has been minimized.

@Andy-Jost Andy-Jost requested review from leofang and removed request for ksimpson-work September 9, 2025 20:28
Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Andy!

  • We need to use IPC events for stream ordering, see my comments below in the test suite.
  • The question on import/export buffer methods is unanswered

Also, could you add two entries (if I do not miss any) to the release note?
cuda_core/docs/source/release/0.X.Y-notes.rst

  • mention IPC is supported on Linux
  • mention MRs can also take Device instances

Comment on lines +144 to +149
ptr = ctypes.cast(int(self.scratch_buffer.handle), ctypes.POINTER(ctypes.c_byte))
op = (lambda i: 255 - i) if flipped else (lambda i: i)
for i in range(self.nbytes):
assert ctypes.c_byte(ptr[i]).value == ctypes.c_byte(op(i)).value, (
f"Buffer contains incorrect data at index {i}"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, we can use all(arr1 == arr2).all() to check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leads to an error:

>>> np.from_dlpack(self.buffer)
RuntimeError: Unsupported device in DLTensor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, for device buffers (.is_device_accessible is True), we wrap them as CuPy arrays not NumPy. Just change it to cp.from_dlpack().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be a good idea to just wrap both buffers as cupy arrays instead of numpy arrays like I originally suggested, because one is backed by device memory another by managed memory, and numpy can't handle the former

Comment on lines 56 to 58
# Export the buffer via IPC.
handle = mr.export_buffer(buffer)
queue.put(handle)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to apologize for sharing my wrong understanding when syncing with you and Keenan.

There is a reason I mentioned IPC event handle here. It turns out that only the cudaIpcMemHandle_t is considered "unsafe"/"legacy"; cudaIpcEventHandle_t (and its driver counterpart) is not, and is still used in the "safe"/"modern" IPC example, see here.

This is important because of the stream ordering reason as I explained during the sync-up. The mental model is the same:

  • process 1 does work on buffer on its stream
  • process 1 creates an IPC event and record on its stream
  • process 1 exports the buffer and the IPC event
  • process 2 imports the buffer and the event
  • process 2 has its stream wait on the event
  • process 2 does work on buffer on its stream
  • ...

So as it stands currently our test suite is not safe due to lack of stream ordering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining this, Leo. I've addressed this in the tests by added the appropriate stream synchronization. That is good enough for testing the correctness. If we need a cleaner Python example using events, we can develop it outside this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test that captures the expected failure behavior for when someone imports a DeviceMemoryResource via DeviceMemoryResource.from_shared_channel and tries to allocate using it? I'm not convinced that throwing on allocation is a good user experience here vs having something like an ImportedDeviceMemoryResource that would allow better introspection and IDE support if someone tried to write some code that tries to allocate from it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See f84d9d8 for this test.

I agree that throwing is not a great user experience. Wouldn't a ImportedDeviceMemoryResource still need to implement a throwing allocate function though?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Andy-Jost yes, but at least the user would be able to introspect the class as well as type the allocate function as NoReturn (https://docs.python.org/3/library/typing.html#typing.NoReturn) which in theory should help to guide users much more nicely?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we revisit this thread: We should keep the test, but I think we can avoid ImportedDeviceMemoryResource: #930 (comment)

@leofang
Copy link
Member

leofang commented Sep 11, 2025

/ok to test 89d5e3f

@Andy-Jost
Copy link
Contributor Author

Andy-Jost commented Sep 15, 2025

I think these changes are ready to go now.

Thanks, Andy!

  • We need to use IPC events for stream ordering, see my comments below in the test suite.

I did this by adding stream synchronization points. It is sufficient to test the functionality.

  • The question on import/export buffer methods is unanswered

Done

Also, could you add two entries (if I do not miss any) to the release note? cuda_core/docs/source/release/0.X.Y-notes.rst

  • mention IPC is supported on Linux
  • mention MRs can also take Device instances

Done

@Andy-Jost
Copy link
Contributor Author

/ok to test 598f9a1

@Andy-Jost
Copy link
Contributor Author

/ok to test 159cf41

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few quick notes before calling it a night, will resume tomorrow!

@Andy-Jost
Copy link
Contributor Author

/ok to test df7ea5c

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left a bunch of nits. Feel free to ignore or address later.

"""Test all properties of the DeviceMemoryResource class."""
device = mempool_device
if platform.system() == "Windows":
return # IPC not implemented for Windows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
return # IPC not implemented for Windows
pytest.skip("IPC not implemented for Windows")

or move it to as a class decorator.

"""Test IPC with memory pools."""
# Set up the IPC-enabled memory pool and share it.
stream = ipc_device.create_stream()
mr = DeviceMemoryResource(ipc_device, dict(max_size=POOL_SIZE, ipc_enabled=True))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be better to get DeviceMemoryResourceOptions tested

Comment on lines +144 to +149
ptr = ctypes.cast(int(self.scratch_buffer.handle), ctypes.POINTER(ctypes.c_byte))
op = (lambda i: 255 - i) if flipped else (lambda i: i)
for i in range(self.nbytes):
assert ctypes.c_byte(ptr[i]).value == ctypes.c_byte(op(i)).value, (
f"Buffer contains incorrect data at index {i}"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be a good idea to just wrap both buffers as cupy arrays instead of numpy arrays like I originally suggested, because one is backed by device memory another by managed memory, and numpy can't handle the former

Comment on lines +77 to +79
mr = DeviceMemoryResource.from_shared_channel(device, channel)
handle = queue.get() # Get exported buffer data
buffer = Buffer.import_(mr, handle)
Copy link
Member

@leofang leofang Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, it's probably a better idea to avoid the whole discussion of ImportedDeviceMemoryResource by not making it user-visible:

handle = queue.get()
buffer = Buffer.import_(device, handle, channel)

The MR would be a construct internal to the (imported) buffer this way.

If we manage to bypass the user-specified IPC mechanism and use IPCChannel to serve all needs, the code would be even more clean:

buffer = Buffer.import_(device, channel)

But again the main benefit is to avoid unnecessary exposure of ImportedDeviceMemoryResource; reducing 3 lines of code to only 1 line is not a big deal.

Comment on lines +81 to +83
protocol = IPCBufferTestProtocol(device, buffer, stream=stream)
protocol.verify_buffer(flipped=False)
protocol.fill_buffer(flipped=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I find the code a bit hard to follow is that it is unclear to me if the operations are stream-ordered. It'd be better to pass stream explicitly instead of holding it internally to IPCBufferTestProtocol instances.

stream = device.create_stream()
buffer = mr.allocate(1024, stream=stream)
assert buffer.handle != 0
buffer.close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
buffer.close()
buffer.close(stream)

assert value >= current_value, f"{property_name} should be >= {current_prop}"


def test_mempool_attributes_ownership(mempool_device):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@leofang
Copy link
Member

leofang commented Sep 17, 2025

(Will merge tomorrow; see DM)

@leofang leofang merged commit 7a24bd8 into NVIDIA:main Sep 17, 2025
49 checks passed
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

@Andy-Jost Andy-Jost deleted the ipc-mempool-linux branch September 17, 2025 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants