-
Notifications
You must be signed in to change notification settings - Fork 215
Implements IPC-enabled memory pools for Linux in DeviceMemoryResource #930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rent` to access the default memory pool.
…memory pool. Removes the `current` staticmethod. Adds an option dataclass for constructor options.
Introduces `IPCAllocationHandle` to manage pool-sharing resources. Introduces `IPCChannel` to for sharing allocation handles in a platform-independent way (though currently only Linux is supported).
5b57f77
to
cc1fa40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, @Andy-Jost! Sorry for my late reply. I've gone through the implementation and made some suggestions, in 3 categories:
- design: I believe your
IPCChannel
abstraction is key to deliver feature parity while enabling very nice UX. I've made a few suggestions below to tidy it up and make the UI even cleaner. Lots of objects and methods can be turned to private/non-user-facing!- Sorry, my comments might seem inconsistent at the first glance. I reviewed
_memory.pyx
from top to bottom, so in the beginning I was not seeing the full picture. I had also forgotten the details of Keenan's PR, on which this one is based.
- Sorry, my comments might seem inconsistent at the first glance. I reviewed
- implementation: in a few places we can cythonize more, also the addition of
__bool__
is a bit nerve wrecking - semantics: I would prefer to save the discussion of 1. .set_current(), 2. view or owning to the very end. I think the current implementation is OK. No need to change until we are making another pass.
Tests look OK but I need to review them again later.
…method from MemoryResource; Cythonizes helper classes; disables __init__ for non-API classes; removes abstract factory constructor from IPCChannel; removes the _requires_ipc decorator (checks are now inlined)
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Andy!
- We need to use IPC events for stream ordering, see my comments below in the test suite.
- The question on import/export buffer methods is unanswered
Also, could you add two entries (if I do not miss any) to the release note?
cuda_core/docs/source/release/0.X.Y-notes.rst
- mention IPC is supported on Linux
- mention MRs can also take
Device
instances
ptr = ctypes.cast(int(self.scratch_buffer.handle), ctypes.POINTER(ctypes.c_byte)) | ||
op = (lambda i: 255 - i) if flipped else (lambda i: i) | ||
for i in range(self.nbytes): | ||
assert ctypes.c_byte(ptr[i]).value == ctypes.c_byte(op(i)).value, ( | ||
f"Buffer contains incorrect data at index {i}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, we can use all(arr1 == arr2).all()
to check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This leads to an error:
>>> np.from_dlpack(self.buffer)
RuntimeError: Unsupported device in DLTensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, for device buffers (.is_device_accessible
is True), we wrap them as CuPy arrays not NumPy. Just change it to cp.from_dlpack()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be a good idea to just wrap both buffers as cupy arrays instead of numpy arrays like I originally suggested, because one is backed by device memory another by managed memory, and numpy can't handle the former
cuda_core/tests/test_ipc_mempool.py
Outdated
# Export the buffer via IPC. | ||
handle = mr.export_buffer(buffer) | ||
queue.put(handle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to apologize for sharing my wrong understanding when syncing with you and Keenan.
There is a reason I mentioned IPC event handle here. It turns out that only the cudaIpcMemHandle_t
is considered "unsafe"/"legacy"; cudaIpcEventHandle_t
(and its driver counterpart) is not, and is still used in the "safe"/"modern" IPC example, see here.
This is important because of the stream ordering reason as I explained during the sync-up. The mental model is the same:
- process 1 does work on buffer on its stream
- process 1 creates an IPC event and record on its stream
- process 1 exports the buffer and the IPC event
- process 2 imports the buffer and the event
- process 2 has its stream wait on the event
- process 2 does work on buffer on its stream
- ...
So as it stands currently our test suite is not safe due to lack of stream ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining this, Leo. I've addressed this in the tests by added the appropriate stream synchronization. That is good enough for testing the correctness. If we need a cleaner Python example using events, we can develop it outside this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test that captures the expected failure behavior for when someone imports a DeviceMemoryResource
via DeviceMemoryResource.from_shared_channel
and tries to allocate using it? I'm not convinced that throwing on allocation is a good user experience here vs having something like an ImportedDeviceMemoryResource
that would allow better introspection and IDE support if someone tried to write some code that tries to allocate from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See f84d9d8 for this test.
I agree that throwing is not a great user experience. Wouldn't a ImportedDeviceMemoryResource
still need to implement a throwing allocate
function though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Andy-Jost yes, but at least the user would be able to introspect the class as well as type the allocate
function as NoReturn
(https://docs.python.org/3/library/typing.html#typing.NoReturn) which in theory should help to guide users much more nicely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we revisit this thread: We should keep the test, but I think we can avoid ImportedDeviceMemoryResource
: #930 (comment)
/ok to test 89d5e3f |
…dds stream synchronization. Eliminates unnecessary clean-up. Removes unnecessary check for CUDA 12 or later.
I think these changes are ready to go now.
I did this by adding stream synchronization points. It is sufficient to test the functionality.
Done
Done |
/ok to test 598f9a1 |
/ok to test 159cf41 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few quick notes before calling it a night, will resume tomorrow!
/ok to test df7ea5c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left a bunch of nits. Feel free to ignore or address later.
"""Test all properties of the DeviceMemoryResource class.""" | ||
device = mempool_device | ||
if platform.system() == "Windows": | ||
return # IPC not implemented for Windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
return # IPC not implemented for Windows | |
pytest.skip("IPC not implemented for Windows") |
or move it to as a class decorator.
"""Test IPC with memory pools.""" | ||
# Set up the IPC-enabled memory pool and share it. | ||
stream = ipc_device.create_stream() | ||
mr = DeviceMemoryResource(ipc_device, dict(max_size=POOL_SIZE, ipc_enabled=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be better to get DeviceMemoryResourceOptions
tested
ptr = ctypes.cast(int(self.scratch_buffer.handle), ctypes.POINTER(ctypes.c_byte)) | ||
op = (lambda i: 255 - i) if flipped else (lambda i: i) | ||
for i in range(self.nbytes): | ||
assert ctypes.c_byte(ptr[i]).value == ctypes.c_byte(op(i)).value, ( | ||
f"Buffer contains incorrect data at index {i}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be a good idea to just wrap both buffers as cupy arrays instead of numpy arrays like I originally suggested, because one is backed by device memory another by managed memory, and numpy can't handle the former
mr = DeviceMemoryResource.from_shared_channel(device, channel) | ||
handle = queue.get() # Get exported buffer data | ||
buffer = Buffer.import_(mr, handle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more, it's probably a better idea to avoid the whole discussion of ImportedDeviceMemoryResource
by not making it user-visible:
handle = queue.get()
buffer = Buffer.import_(device, handle, channel)
The MR would be a construct internal to the (imported) buffer this way.
If we manage to bypass the user-specified IPC mechanism and use IPCChannel
to serve all needs, the code would be even more clean:
buffer = Buffer.import_(device, channel)
But again the main benefit is to avoid unnecessary exposure of ImportedDeviceMemoryResource
; reducing 3 lines of code to only 1 line is not a big deal.
protocol = IPCBufferTestProtocol(device, buffer, stream=stream) | ||
protocol.verify_buffer(flipped=False) | ||
protocol.fill_buffer(flipped=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I find the code a bit hard to follow is that it is unclear to me if the operations are stream-ordered. It'd be better to pass stream
explicitly instead of holding it internally to IPCBufferTestProtocol
instances.
stream = device.create_stream() | ||
buffer = mr.allocate(1024, stream=stream) | ||
assert buffer.handle != 0 | ||
buffer.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
buffer.close() | |
buffer.close(stream) |
assert value >= current_value, f"{property_name} should be >= {current_prop}" | ||
|
||
|
||
def test_mempool_attributes_ownership(mempool_device): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
(Will merge tomorrow; see DM) |
|
Updates
cuda.core.experiment.DeviceMemoryResource
to support IPC-enabled memory pools. This change updates the DMR constructor, adding an option to create a new memory pool with optional IPC enablement. The previous behavior, which gets the current device memory pool viacuDeviceGetMemPool
, remains in effect when no constructor options are supplied.