-
Notifications
You must be signed in to change notification settings - Fork 214
IPC Mempool Serialization and multiprocessing Module Support #1020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…clear). Added a test for an error case.
…esource. Add tests for buffer IPC through serialization.
…egistry from imported memory resources so that buffers can be serialized using an mr key. Test updates.
4514bf9
to
cca13b5
Compare
cca13b5
to
4820512
Compare
4820512
to
6c53cb0
Compare
ff29756
to
6be4686
Compare
/ok to test 6be4686 |
6be4686
to
e0d0bf4
Compare
/ok to test e0d0bf4 |
It looks like this change is good to go. I'm not aware of any remaining issues. |
Awesome, thanks Andy! I'll re-review in an hour or two and then merge. I don't expect any change is needed, just wanna read the new tests once more. |
try: | ||
assert self._uuid is None | ||
import uuid as uuid | ||
self._uuid = uuid.uuid4() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ksimpson-work I ended up creating unique identifiers to track memory pools across processes. Is this something we could add to the Driver API as a mempool attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry Andy, I re-read this again and still have a few questions 😓
btw we should fix a few doc issues
- add
DeviceMemoryResourceOptions
tocuda_core/docs/source/api.rst
- add
IPCAllocationHandle
tocuda_core/docs/source/api_private.rst
- address other comments below
The doc preview CI can be used to check what's missing.
return device | ||
|
||
|
||
multiprocessing.reduction.register(Device, _reduce_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I can't resist but circle back to ask this: We wanted Device
to be serializable when only Buffer
objects are passed over to child processes through multiprocessing
APIs. We called .set_current()
because the child function arguments are restored/deserialized prior to dropping into the function (xref: #1020 (comment)).
Questions:
- What happens in a multi-GPU environment? If I send over
buf0
allocated on GPU 0 andbuf1
allocated on GPU 1 to the child, which device is current by the time the child function starts executing? What if we change the order of args from(buf0, buf1)
to(buf1, buf0)
? We don't have CI for multi-GPU tests for now, but the design should be multi-GPU-safe. - If we really need a device ID, can't we just look up from
Buffer.mr.device_id
, instead of serializingDevice
?
I still feel it is unwise to set the global state behind users' back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is that I don't see a way to initialize CUDA (?) without calling set_current
on a device. If I remove set_current
from _reconstruct_device
then I get CUDA_ERROR_INVALID_CONTEXT errors everywhere.
Device._has_inited
is only set in set_current
. To avoid messing with the global state, I'd rather say something like Device(n).init()
to be sure that device n is ready to go. Then, later, when mapping a buffer, I could use mr.device_id
to get that device. I just don't see any way to do that with the current _device
code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think I have something here that we can accept. I was able to remove the call to set_current
in Device reconstruction since it turns out that memory resources can be mapped without setting a device context.
The only thing that really needs to be done implicitly when a process spawns is to call cuInit
. The needs to be the very first thing or else all CUDA API calls (including mapping a memory resource or buffer) fails. That can be accomplished by just constructing a Device
.
I tried removing device serialization completely but that actually makes the code quite a bit more cumbersome in several places. It's just easier to send an applicative of the form (Device, (device_id,))
rather than send a device ID, import Device
, and then reconstruct manually on the other end. I also think it is conceptually sensible for Device
objects to be serializable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great, I'll take a look asap! A few quick notes before I forget:
- CUDA mempools are by design not bound to any CUDA context. We (CUDA) used to have a context as the powerhouse for everything, memory, queue, module, ..., but now we are making things "context-independent," including mempools and kernel modules.
- I read the
multiprocessing
docs more carefully, and I think we spawned the processes wrong (we should document this after this PR, in a Q&A or Tips & Tricks like in cuda-bindings). As you said callingcuInit()
is the minimal:- For processes spawned via
Process
, we can inherit from theProcess
class, in the constructor__init__()
we callDevice().set_current()
which would under the hood callcuInit()
, and then callsuper().__init__()
. My take is that this will ensure CUDA is initialized before the arguments are deserialized - For processes spawned via a
Pool
of workers, the same can be done by passinginitializer
andinitargs
.
- For processes spawned via
Prove me wrong! 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think subclassing Process
as you describe would work. Are you suggesting we ship a subclass of Process
to simplify this for our users? The current code places set_current
calls within the child main functions. That's actually pretty nice, since it mirrors what we ask users to do in any process. As you point out, the memory pool and buffer mappings don't require a context, which simplifies this a lot.
When using worker pools, initializer
and initargs
do work (we have examples of this in test_workpool.py
). Still, I think there's a ton of value in supporting the ultra-simple use case of mapping a workerpool over buffers without doing anything special at start up. The latest code just puts the call to set_current
within each worker function, which seems great, actually. That way, one could mix-and-match buffers from different devices and it ought to work.
One limitation I'm facing now is that I don't know whether the test runners have multipe GPUs, so I'm not sure how to move forward with developing the multi-device scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting we ship a subclass of
Process
to simplify this for our users?
Certainly not, at least not now (or never). We should only do so in our own test suite, and teach users about this.
We did not have to teach this in most of Python GPU libraries (wearing my CuPy hat), because most libraries only interact with CUDA through runtime instead of driver, and runtime does the implicit initialization in every single cudart API. We don't and by design we always want users to call dev.set_current()
first.
Now, with our own test suite, your other question is valid: Device().set_current()
sets by default GPU 0 to current if there is no prior current device, this is a bit unfortunate if the child main wants to use say GPU 1. I think for now we need to call set_current()
in both places, the process initializer (as a proxy to cuInit()
because we definitely don't want to teach this) as well as child main (to actually select the right GPU, following the usual cuda.core requirement). We should do this for now and discuss if there is a better approach on Friday meeting. I believe cccl-rt will hit the same issue: NVIDIA/cccl#6073.
…yResource reduction with multiprocessing. Add a quick exit to from_allocation_handle. Simplify the worker pool tests based on the new reduction method.
5cd1043
to
e5b8542
Compare
/ok to test |
@Andy-Jost, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
/ok to test e5b8542 |
/ok to test 64f154c |
if _ipc_registry is not None: | ||
"""Unregister this mapped memory resource.""" | ||
assert self.is_mapped | ||
if _ipc_registry is not None: # can occur during shutdown catastrophe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❗
/ok to test d584ea9 |
# Note: if the buffer is not attached to something to prolong its life, | ||
# CUDA_ERROR_INVALID_CONTEXT is raised from Buffer.__del__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this comment is outdated.
|
Add serialization to memory IPC. Memory resources can be passed directly to new processes. Expands testing. Removes IPCChannel.