Skip to content

Conversation

@Andy-Jost
Copy link
Contributor

Description

closes #963

Adds use_pool option to DeviceMemoryResource to allow disabling memory pool use. This allows for graph capture of stream-ordered allocations.

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Oct 27, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Andy-Jost Andy-Jost self-assigned this Oct 27, 2025
@Andy-Jost Andy-Jost marked this pull request as draft October 27, 2025 19:53
@Andy-Jost Andy-Jost requested a review from leofang October 27, 2025 19:53
@Andy-Jost
Copy link
Contributor Author

This is by no means ready for merge. I'm interested in getting feedback on whether this is the right direction.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR introduces a use_pool option to DeviceMemoryResource that enables CUDA graph capture of stream-ordered memory allocations. When use_pool=False, the resource bypasses memory pool creation and uses cuMemAllocAsync directly instead of cuMemAllocFromPoolAsync, allowing these allocations to be captured in graph nodes. The change maintains backward compatibility by defaulting use_pool=True and adds validation to prevent conflicting configurations (IPC requires pools). This extends the cuda.core.experimental memory abstraction to support both traditional pool-based allocations and graph-capturable stream-ordered allocations within a unified interface.

Important Files Changed

Filename Score Overview
cuda_core/cuda/core/experimental/_memory.pyx 3/5 Adds use_pool option to DeviceMemoryResourceOptions, conditional pool creation logic, allocation path branching between pool and non-pool modes, and validation for attribute access
cuda_core/tests/test_graph2.py 4/5 New test file validating graph capture with non-pooled stream-ordered allocations, kernel launches, and result verification

Confidence score: 3/5

  • This PR requires careful review due to a potential runtime bug and incomplete cleanup
  • Score lowered due to: (1) line 542 check mr.handle is None likely incorrect since handle property returns a CUmemoryPool object not None for NULL handles, (2) commented-out test_no_graph function should be removed, (3) numerous unused imports in test file
  • Pay close attention to cuda_core/cuda/core/experimental/_memory.pyx lines 542-543 for the handle validation logic, and cuda_core/tests/test_graph2.py for cleanup of development artifacts

Sequence Diagram

sequenceDiagram
    participant User
    participant Device
    participant GraphBuilder
    participant DeviceMemoryResource
    participant Stream
    participant Module
    participant Graph
    participant Buffer

    User->>Device: "create_graph_builder()"
    Device-->>GraphBuilder: "return graph_builder"
    User->>GraphBuilder: "begin_building(mode='thread_local')"
    GraphBuilder-->>GraphBuilder: "create internal stream"
    GraphBuilder-->>User: "return self"
    
    User->>Device: "create device with use_pool=False"
    Device-->>DeviceMemoryResource: "create with use_pool=False"
    DeviceMemoryResource-->>User: "return mr"
    
    User->>Module: "get_kernel('set_zero')"
    Module-->>User: "return set_zero kernel"
    User->>Module: "get_kernel('add_one')"
    Module-->>User: "return add_one kernel"
    
    User->>DeviceMemoryResource: "allocate(NBYTES, stream=target_stream)"
    DeviceMemoryResource->>Stream: "cuMemAllocAsync()"
    Stream-->>Buffer: "create buffer"
    Buffer-->>User: "return target buffer"
    
    User->>DeviceMemoryResource: "allocate(NBYTES, stream=gb.stream)"
    DeviceMemoryResource->>GraphBuilder: "get capture stream"
    DeviceMemoryResource->>Stream: "cuMemAllocAsync() [captured]"
    Stream-->>Buffer: "create work_buffer"
    Buffer-->>User: "return work_buffer"
    
    User->>GraphBuilder: "launch(set_zero, work_buffer)"
    GraphBuilder->>Stream: "record kernel launch [captured]"
    User->>GraphBuilder: "launch(add_one, work_buffer)"
    GraphBuilder->>Stream: "record kernel launch [captured]"
    User->>GraphBuilder: "launch(add_one, work_buffer)"
    GraphBuilder->>Stream: "record kernel launch [captured]"
    
    User->>Buffer: "target.copy_from(work_buffer)"
    Buffer->>Stream: "cuMemcpyAsync() [captured]"
    
    User->>GraphBuilder: "end_building()"
    GraphBuilder->>Graph: "finalize capture"
    GraphBuilder-->>User: "return graph"
    
    User->>Graph: "complete()"
    Graph-->>User: "return completed graph"
    
    User->>Stream: "create_stream()"
    Stream-->>User: "return stream"
    
    User->>Graph: "upload(stream)"
    Graph->>Stream: "upload executable"
    
    User->>Graph: "launch(stream)"
    Graph->>Stream: "launch graph"
    
    User->>Stream: "sync()"
    Stream-->>User: "wait for completion"
    
    User->>Buffer: "compare_buffer.copy_from(target)"
    Buffer->>Stream: "cuMemcpyAsync()"
    User->>Stream: "sync()"
    Stream-->>User: "verification complete"
Loading

2 files reviewed, 8 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +542 to +543
if mr.handle is None:
raise RuntimeError("DeviceMemoryResource is not configured to use a memory pool")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: mr.handle property returns a driver.CUmemoryPool object (line 982), not None. The check should be mr._mempool_handle != NULL or use the is_using_pool property.

Comment on lines +751 to +752
if opts.ipc_enabled:
raise RuntimeError("Cannot supply ipc_enabled=True with use_pool=False")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Setting ipc_enabled=True with use_pool=False is correctly rejected, but this leaves the resource partially initialized (device_id, handle_type, etc. are set). Consider moving this validation before any state changes.

Comment on lines +97 to +103
# import code
# code.interact(local=dict(globals(), **locals()))
work_buffer = mr.allocate(NBYTES, stream=gb.stream)
launch(gb, LaunchConfig(grid=1, block=1), set_zero, int(work_buffer.handle), NBYTES)
launch(gb, LaunchConfig(grid=1, block=1), add_one, int(work_buffer.handle), NBYTES)
launch(gb, LaunchConfig(grid=1, block=1), add_one, int(work_buffer.handle), NBYTES)
target.copy_from(work_buffer, stream=gb.stream)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the sort of use we are targeting.

@pciolkosz
Copy link

Why not add a separate memory resource type for graphs? The behavior is different enough that a separate type might be needed

@Andy-Jost
Copy link
Contributor Author

Why not add a separate memory resource type for graphs? The behavior is different enough that a separate type might be needed

I would much prefer to do that. When I added IPC support to DeviceMemoryResource I initially made DeviceMemoryResource.__new__ a factory method that chose the correct subclass, but that idea was rejected. I certainly think the existing code is getting way too complicated but I don't have a good sense of what alternatives will be accepted.

@leofang leofang added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Oct 28, 2025
@leofang leofang added this to the cuda.core beta 9 milestone Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CUDA graph phase 2 - memory nodes

3 participants