Skip to content

Conversation

Andy-Jost
Copy link
Contributor

Add serialization to memory IPC. Memory resources can be passed directly to new processes. Expands testing. Removes IPCChannel.

Copy link
Contributor

copy-pr-bot bot commented Sep 25, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Andy-Jost Andy-Jost requested a review from leofang September 25, 2025 17:29
@Andy-Jost Andy-Jost self-assigned this Sep 25, 2025
@Andy-Jost Andy-Jost added cuda.core Everything related to the cuda.core module P0 High priority - Must do! labels Sep 25, 2025
@Andy-Jost Andy-Jost requested a review from rparolin September 25, 2025 20:45
@Andy-Jost Andy-Jost changed the title Ipc mempool channel IPC Mempool Serialization and multiprocessing Module Support Sep 25, 2025
@leofang leofang added this to the cuda.core beta 7 milestone Sep 26, 2025
@Andy-Jost Andy-Jost force-pushed the ipc-mempool-channel branch from ff29756 to 6be4686 Compare October 1, 2025 22:24
@Andy-Jost
Copy link
Contributor Author

/ok to test 6be4686

@Andy-Jost Andy-Jost force-pushed the ipc-mempool-channel branch from 6be4686 to e0d0bf4 Compare October 1, 2025 22:57
@Andy-Jost
Copy link
Contributor Author

/ok to test e0d0bf4

@Andy-Jost
Copy link
Contributor Author

It looks like this change is good to go. I'm not aware of any remaining issues.

@leofang
Copy link
Member

leofang commented Oct 2, 2025

Awesome, thanks Andy! I'll re-review in an hour or two and then merge. I don't expect any change is needed, just wanna read the new tests once more.

try:
assert self._uuid is None
import uuid as uuid
self._uuid = uuid.uuid4()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ksimpson-work I ended up creating unique identifiers to track memory pools across processes. Is this something we could add to the Driver API as a mempool attribute?

@leofang leofang mentioned this pull request Oct 2, 2025
2 tasks
Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry Andy, I re-read this again and still have a few questions 😓

btw we should fix a few doc issues

  • add DeviceMemoryResourceOptions to cuda_core/docs/source/api.rst
  • add IPCAllocationHandle to cuda_core/docs/source/api_private.rst
  • address other comments below

The doc preview CI can be used to check what's missing.

return device


multiprocessing.reduction.register(Device, _reduce_device)
Copy link
Member

@leofang leofang Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I can't resist but circle back to ask this: We wanted Device to be serializable when only Buffer objects are passed over to child processes through multiprocessing APIs. We called .set_current() because the child function arguments are restored/deserialized prior to dropping into the function (xref: #1020 (comment)).

Questions:

  1. What happens in a multi-GPU environment? If I send over buf0 allocated on GPU 0 and buf1 allocated on GPU 1 to the child, which device is current by the time the child function starts executing? What if we change the order of args from (buf0, buf1) to (buf1, buf0)? We don't have CI for multi-GPU tests for now, but the design should be multi-GPU-safe.
  2. If we really need a device ID, can't we just look up from Buffer.mr.device_id, instead of serializing Device?

I still feel it is unwise to set the global state behind users' back.

Copy link
Contributor Author

@Andy-Jost Andy-Jost Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that I don't see a way to initialize CUDA (?) without calling set_current on a device. If I remove set_current from _reconstruct_device then I get CUDA_ERROR_INVALID_CONTEXT errors everywhere.

Device._has_inited is only set in set_current. To avoid messing with the global state, I'd rather say something like Device(n).init() to be sure that device n is ready to go. Then, later, when mapping a buffer, I could use mr.device_id to get that device. I just don't see any way to do that with the current _device code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I have something here that we can accept. I was able to remove the call to set_current in Device reconstruction since it turns out that memory resources can be mapped without setting a device context.

The only thing that really needs to be done implicitly when a process spawns is to call cuInit. The needs to be the very first thing or else all CUDA API calls (including mapping a memory resource or buffer) fails. That can be accomplished by just constructing a Device.

I tried removing device serialization completely but that actually makes the code quite a bit more cumbersome in several places. It's just easier to send an applicative of the form (Device, (device_id,)) rather than send a device ID, import Device, and then reconstruct manually on the other end. I also think it is conceptually sensible for Device objects to be serializable.

Copy link
Member

@leofang leofang Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great, I'll take a look asap! A few quick notes before I forget:

  1. CUDA mempools are by design not bound to any CUDA context. We (CUDA) used to have a context as the powerhouse for everything, memory, queue, module, ..., but now we are making things "context-independent," including mempools and kernel modules.
  2. I read the multiprocessing docs more carefully, and I think we spawned the processes wrong (we should document this after this PR, in a Q&A or Tips & Tricks like in cuda-bindings). As you said calling cuInit() is the minimal:
    • For processes spawned via Process, we can inherit from the Process class, in the constructor __init__() we call Device().set_current() which would under the hood call cuInit(), and then call super().__init__(). My take is that this will ensure CUDA is initialized before the arguments are deserialized
    • For processes spawned via a Pool of workers, the same can be done by passing initializer and initargs.

Prove me wrong! 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think subclassing Process as you describe would work. Are you suggesting we ship a subclass of Process to simplify this for our users? The current code places set_current calls within the child main functions. That's actually pretty nice, since it mirrors what we ask users to do in any process. As you point out, the memory pool and buffer mappings don't require a context, which simplifies this a lot.

When using worker pools, initializer and initargs do work (we have examples of this in test_workpool.py). Still, I think there's a ton of value in supporting the ultra-simple use case of mapping a workerpool over buffers without doing anything special at start up. The latest code just puts the call to set_current within each worker function, which seems great, actually. That way, one could mix-and-match buffers from different devices and it ought to work.

One limitation I'm facing now is that I don't know whether the test runners have multipe GPUs, so I'm not sure how to move forward with developing the multi-device scenario.

Copy link
Member

@leofang leofang Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting we ship a subclass of Process to simplify this for our users?

Certainly not, at least not now (or never). We should only do so in our own test suite, and teach users about this.

We did not have to teach this in most of Python GPU libraries (wearing my CuPy hat), because most libraries only interact with CUDA through runtime instead of driver, and runtime does the implicit initialization in every single cudart API. We don't and by design we always want users to call dev.set_current() first.

Now, with our own test suite, your other question is valid: Device().set_current() sets by default GPU 0 to current if there is no prior current device, this is a bit unfortunate if the child main wants to use say GPU 1. I think for now we need to call set_current() in both places, the process initializer (as a proxy to cuInit() because we definitely don't want to teach this) as well as child main (to actually select the right GPU, following the usual cuda.core requirement). We should do this for now and discuss if there is a better approach on Friday meeting. I believe cccl-rt will hit the same issue: NVIDIA/cccl#6073.

…yResource reduction with multiprocessing. Add a quick exit to from_allocation_handle. Simplify the worker pool tests based on the new reduction method.
@Andy-Jost Andy-Jost force-pushed the ipc-mempool-channel branch from 5cd1043 to e5b8542 Compare October 3, 2025 00:02
@Andy-Jost
Copy link
Contributor Author

/ok to test

Copy link
Contributor

copy-pr-bot bot commented Oct 3, 2025

/ok to test

@Andy-Jost, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@Andy-Jost
Copy link
Contributor Author

/ok to test e5b8542

@Andy-Jost
Copy link
Contributor Author

/ok to test 64f154c

if _ipc_registry is not None:
"""Unregister this mapped memory resource."""
assert self.is_mapped
if _ipc_registry is not None: # can occur during shutdown catastrophe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leofang
Copy link
Member

leofang commented Oct 3, 2025

/ok to test d584ea9

Comment on lines +109 to +110
# Note: if the buffer is not attached to something to prolong its life,
# CUDA_ERROR_INVALID_CONTEXT is raised from Buffer.__del__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this comment is outdated.

@leofang leofang enabled auto-merge (squash) October 3, 2025 02:23
@leofang leofang merged commit 0ea67b5 into NVIDIA:main Oct 3, 2025
71 checks passed
Copy link

github-actions bot commented Oct 3, 2025

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable IPC support on Linux

4 participants