-
Couldn't load subscription status.
- Fork 217
Adds use_pool option to DeviceMemoryResource. #1192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
This is by no means ready for merge. I'm interested in getting feedback on whether this is the right direction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR introduces a use_pool option to DeviceMemoryResource that enables CUDA graph capture of stream-ordered memory allocations. When use_pool=False, the resource bypasses memory pool creation and uses cuMemAllocAsync directly instead of cuMemAllocFromPoolAsync, allowing these allocations to be captured in graph nodes. The change maintains backward compatibility by defaulting use_pool=True and adds validation to prevent conflicting configurations (IPC requires pools). This extends the cuda.core.experimental memory abstraction to support both traditional pool-based allocations and graph-capturable stream-ordered allocations within a unified interface.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| cuda_core/cuda/core/experimental/_memory.pyx | 3/5 | Adds use_pool option to DeviceMemoryResourceOptions, conditional pool creation logic, allocation path branching between pool and non-pool modes, and validation for attribute access |
| cuda_core/tests/test_graph2.py | 4/5 | New test file validating graph capture with non-pooled stream-ordered allocations, kernel launches, and result verification |
Confidence score: 3/5
- This PR requires careful review due to a potential runtime bug and incomplete cleanup
- Score lowered due to: (1) line 542 check
mr.handle is Nonelikely incorrect sincehandleproperty returns aCUmemoryPoolobject notNonefor NULL handles, (2) commented-outtest_no_graphfunction should be removed, (3) numerous unused imports in test file - Pay close attention to cuda_core/cuda/core/experimental/_memory.pyx lines 542-543 for the handle validation logic, and cuda_core/tests/test_graph2.py for cleanup of development artifacts
Sequence Diagram
sequenceDiagram
participant User
participant Device
participant GraphBuilder
participant DeviceMemoryResource
participant Stream
participant Module
participant Graph
participant Buffer
User->>Device: "create_graph_builder()"
Device-->>GraphBuilder: "return graph_builder"
User->>GraphBuilder: "begin_building(mode='thread_local')"
GraphBuilder-->>GraphBuilder: "create internal stream"
GraphBuilder-->>User: "return self"
User->>Device: "create device with use_pool=False"
Device-->>DeviceMemoryResource: "create with use_pool=False"
DeviceMemoryResource-->>User: "return mr"
User->>Module: "get_kernel('set_zero')"
Module-->>User: "return set_zero kernel"
User->>Module: "get_kernel('add_one')"
Module-->>User: "return add_one kernel"
User->>DeviceMemoryResource: "allocate(NBYTES, stream=target_stream)"
DeviceMemoryResource->>Stream: "cuMemAllocAsync()"
Stream-->>Buffer: "create buffer"
Buffer-->>User: "return target buffer"
User->>DeviceMemoryResource: "allocate(NBYTES, stream=gb.stream)"
DeviceMemoryResource->>GraphBuilder: "get capture stream"
DeviceMemoryResource->>Stream: "cuMemAllocAsync() [captured]"
Stream-->>Buffer: "create work_buffer"
Buffer-->>User: "return work_buffer"
User->>GraphBuilder: "launch(set_zero, work_buffer)"
GraphBuilder->>Stream: "record kernel launch [captured]"
User->>GraphBuilder: "launch(add_one, work_buffer)"
GraphBuilder->>Stream: "record kernel launch [captured]"
User->>GraphBuilder: "launch(add_one, work_buffer)"
GraphBuilder->>Stream: "record kernel launch [captured]"
User->>Buffer: "target.copy_from(work_buffer)"
Buffer->>Stream: "cuMemcpyAsync() [captured]"
User->>GraphBuilder: "end_building()"
GraphBuilder->>Graph: "finalize capture"
GraphBuilder-->>User: "return graph"
User->>Graph: "complete()"
Graph-->>User: "return completed graph"
User->>Stream: "create_stream()"
Stream-->>User: "return stream"
User->>Graph: "upload(stream)"
Graph->>Stream: "upload executable"
User->>Graph: "launch(stream)"
Graph->>Stream: "launch graph"
User->>Stream: "sync()"
Stream-->>User: "wait for completion"
User->>Buffer: "compare_buffer.copy_from(target)"
Buffer->>Stream: "cuMemcpyAsync()"
User->>Stream: "sync()"
Stream-->>User: "verification complete"
2 files reviewed, 8 comments
| if mr.handle is None: | ||
| raise RuntimeError("DeviceMemoryResource is not configured to use a memory pool") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: mr.handle property returns a driver.CUmemoryPool object (line 982), not None. The check should be mr._mempool_handle != NULL or use the is_using_pool property.
| if opts.ipc_enabled: | ||
| raise RuntimeError("Cannot supply ipc_enabled=True with use_pool=False") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Setting ipc_enabled=True with use_pool=False is correctly rejected, but this leaves the resource partially initialized (device_id, handle_type, etc. are set). Consider moving this validation before any state changes.
| # import code | ||
| # code.interact(local=dict(globals(), **locals())) | ||
| work_buffer = mr.allocate(NBYTES, stream=gb.stream) | ||
| launch(gb, LaunchConfig(grid=1, block=1), set_zero, int(work_buffer.handle), NBYTES) | ||
| launch(gb, LaunchConfig(grid=1, block=1), add_one, int(work_buffer.handle), NBYTES) | ||
| launch(gb, LaunchConfig(grid=1, block=1), add_one, int(work_buffer.handle), NBYTES) | ||
| target.copy_from(work_buffer, stream=gb.stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is the sort of use we are targeting.
|
Why not add a separate memory resource type for graphs? The behavior is different enough that a separate type might be needed |
I would much prefer to do that. When I added IPC support to |
Description
closes #963
Adds
use_pooloption toDeviceMemoryResourceto allow disabling memory pool use. This allows for graph capture of stream-ordered allocations.