Add Buffer.fill() method for cuMemsetAsync support (#1314) #1318

Andy-Jost · 2025-12-04T23:54:35Z

Summary

This PR implements Buffer.fill(value, width, *, stream) method that wraps CUDA's cuMemsetAsync functions, supporting cuMemsetD8Async, cuMemsetD16Async, and cuMemsetD32Async based on the width parameter.

Part of issue #1314: CUDA Graph phase 3 - memcpy nodes

Changes

Added Buffer.fill() method in cuda/core/experimental/_memory/_buffer.pyx:
- Supports width=1 (byte fill via cuMemsetD8Async)
- Supports width=2 (16-bit fill via cuMemsetD16Async)
- Supports width=4 (32-bit fill via cuMemsetD32Async)
- Validates width (must be 1, 2, or 4)
- Validates value range for each width
- Validates buffer size is divisible by width
- Follows same pattern as copy_to/copy_from methods
Added comprehensive tests in tests/test_memory.py:
- Tests all three widths (1, 2, 4 bytes)
- Tests error cases (invalid width, value out of range, size not divisible)
- Tests with different memory resource types
- Includes verification for host-accessible memory

Implementation Details

The method automatically selects the appropriate CUDA driver API function based on the width parameter:

width=1: Uses cuMemsetD8Async with N = buffer_size (bytes)
width=2: Uses cuMemsetD16Async with N = buffer_size // 2 (16-bit elements)
width=4: Uses cuMemsetD32Async with N = buffer_size // 4 (32-bit elements)

Example Usage

buffer = mr.allocate(1024, stream=stream)
buffer.fill(0x42, width=1, stream=stream)  # Fill with byte value
buffer.fill(0x1234, width=2, stream=stream)  # Fill with 16-bit value
buffer.fill(0xDEADBEEF, width=4, stream=stream)  # Fill with 32-bit value

Testing

All tests pass with comprehensive coverage of:

Success cases for all widths
Error validation (width, value range, size divisibility)
Multiple memory resource types (device, unified, pinned)

Implements Buffer.fill(value, width, *, stream) method that wraps cuMemsetD8Async, cuMemsetD16Async, and cuMemsetD32Async based on the width parameter (1, 2, or 4 bytes). - Add fill() method to Buffer class in _buffer.pyx - Support width=1 (byte), width=2 (16-bit), width=4 (32-bit) - Validate width, value range, and buffer size divisibility - Add comprehensive tests in test_memory.py - Tests cover all widths, error cases, and verification Part of issue NVIDIA#1314: CUDA Graph phase 3 - memcpy nodes

copy-pr-bot · 2025-12-04T23:54:39Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Extend test_graph_alloc with 'fill' action parameter to test Buffer.fill() in graph capture mode. The test verifies graph capture for Buffer operations including copy_from, copy_to, fill, and kernel launch operations. Part of issue NVIDIA#1314

- Replace Python driver module calls with direct cydriver calls - Use 'with nogil:' blocks around CUDA driver API calls - Use HANDLE_RETURN macro for error handling - Cast stream to Stream type to access _handle attribute - Improves performance by eliminating Python overhead

- Replace Python driver module calls with direct cydriver calls - Use 'with nogil:' blocks around CUDA driver API calls - Use HANDLE_RETURN macro for error handling - Cast stream to Stream type to access _handle attribute - Remove unused raise_if_driver_error import - Improves performance by eliminating Python overhead

Andy-Jost · 2025-12-05T00:23:37Z

/ok to test 7d9747d

rparolin · 2025-12-05T16:05:21Z

cuda_core/cuda/core/experimental/_memory/_buffer.pyx

+        cdef cydriver.CUstream s = s_stream._handle
+
+        # Validate width
+        if width not in (1, 2, 4):


I'd put the validation code closer to the top of the function so we avoid any setup work in the error case where the user passes an unsupported size to the function.

OK. I fiddled with it to simplify the logic. To be honest, I don't see a big improvement, here, since most of the preceding statements just declare stack variables.

cuda_core/cuda/core/experimental/_memory/_buffer.pyx

rparolin · 2025-12-05T16:19:29Z

cuda_core/tests/test_memory.py

+    buffer1.fill(0x42, width=1, stream=stream)
+    device.sync()
+
+    if check:


Why are parametrizing these value checks in a test suite? The memory sizes don't strike me as so large that these operations would be slow.

The values are only checked when the memory allocation is pinned. This follows the existing pattern.

rparolin

Looks good generally, just a couple comments.

- Add _validate_value_against_bitwidth helper function - Move helper function to end of file as cdef function - Use 64-bit platform integers (int64_t/uint64_t) instead of Python ints - Add assertion that bitwidth < 64 - Remove magic numbers from fill() method - Update tests to match new error message format

Andy-Jost · 2025-12-05T18:38:55Z

/ok to test 8e2cddf

kkraus14 · 2025-12-08T15:29:03Z

cc @pciolkosz who implemented CCCL's cuda::fill_bytes

I think something what would make more sense would be for Buffer.fill() to only work for 1 byte width and then have a StridedMemoryView.fill() that uses the dtype of the StridedMemoryView to control the byte width used for filling as well as interpreting the signed-ness of the input. I think this has a couple of benefits:

No need for user to pass a width explicitly and inherently no need to ensure that the Buffer is aligned with a passed width
Makes it clear when a Python integer is desired to be treated as an unsigned vs signed integer

kkraus14 · 2025-12-08T15:31:04Z

cuda_core/cuda/core/experimental/_memory/_buffer.pyx

+
+        # Map width (bytes) to bitwidth and validate value
+        cdef int bitwidth = width * 8
+        _validate_value_against_bitwidth(bitwidth, value, is_signed=False)


I think this is a bit of unexpected behavior that this will throw if someone passes a negative integer

kkraus14 · 2025-12-08T15:32:00Z

cuda_core/cuda/core/experimental/_memory/_buffer.pyx

+
+# Helper Functions
+# ----------------
+cdef void _validate_value_against_bitwidth(int bitwidth, int64_t value, bint is_signed=False) except *:


nitpick: we should probably make this an inline function for performance reasons

and avoid except*, so it should return an int instead, something like

Suggested change

cdef void _validate_value_against_bitwidth(int bitwidth, int64_t value, bint is_signed=False) except *:

cdef int _validate_value_against_bitwidth(int bitwidth, int64_t value, bint is_signed=False) except?-1:

kkraus14 · 2025-12-08T15:35:33Z

cuda_core/cuda/core/experimental/_memory/_buffer.pyx

+    if is_signed:
+        min_value = -(<int64_t>1 << (max_bits - 1))
+        max_value = (<int64_t>1 << (max_bits - 1)) - 1
+    else:
+        min_value = 0
+        max_value_unsigned = (<uint64_t>1 << max_bits) - 1
+        max_value = <int64_t>max_value_unsigned


nitpick: getting the min and max values here can be cimported from the built in libc module: https://github.com/cython/cython/blob/master/Cython/Includes/libc/limits.pxd

i.e. from libc.limits cimport INT_MAX

kkraus14 · 2025-12-08T15:38:02Z

cuda_core/cuda/core/experimental/_memory/_buffer.pyx

+        value : int
+            Integer value to fill the buffer with


I know the underlying driver APIs expect an integer, but I think for Buffer.fill() it would be good to support a byte-like input as well, maybe via buffer protocol?

To confirm my understanding, I believe the suggestion here is to accept arguments of type bytes or that provide bytes via the buffer protocol (where the number of bytes equals the buffer size) and fill our Buffer from that. Would that be better to implement as Buffer.copy_from rather than Buffer.fill? Or possibly a new function Buffer.copy_from_bytes or Buffer.copy_from_host.

The internal logic will be different from Buffer.fill, since it requires copying the bytes to a staging area, unlike cuMemset*.

I was referring to allowing a user to pass a memoryview object of 1, 2, or 4 bytes in length to be used as the fill value. I.e. Buffer.fill(b'abcd').

leofang · 2025-12-08T23:43:21Z

From #1314 (comment):

How do we envision this working? If someone says buf.fill(1) we still don't have information about the intended bit-width. Are we thinking something like buf.fill(value=1, width=4) ?

If the type is a specific NumPy dtype we can do it automatically. Is there a precedent for anything like that, or do we have a list of frameworks/dtypes that we support?

and #1318 (comment):

and then have a StridedMemoryView.fill() that uses the dtype of the StridedMemoryView to control the byte width used for filling as well as interpreting the signed-ness of the input. I think this has a couple of benefits:

My thinking was: We follow what we did for cuda.core.launch() and use a dispatcher to infer the bit width from the scalar type. So we'll do

def fill(self, value, stream: Stream | GraphBuilder):
    ...

and the user code can be:

buf.fill(np.int16(8), s)

This would provide a uniform UX across cuda.core APIs that accept Python/NumPy scalars. @kkraus14 @Andy-Jost WDYT?

rparolin · 2025-12-09T01:01:16Z

My bad, I didn't check if you had auto-merged enabled. @Andy-Jost please make sure any feedback from Keith and Leo is incorporated.

github-actions · 2025-12-09T01:07:15Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

leofang · 2025-12-09T01:17:57Z

No worries. Tracking this in #1345.

Andy-Jost · 2025-12-09T15:38:50Z

My bad, I didn't check if you had auto-merged enabled. @Andy-Jost please make sure any feedback from Keith and Leo is incorporated.

Will follow up to fix, no worries.

Andy-Jost added cuda.core Everything related to the cuda.core module feature New feature or request labels Dec 4, 2025

Andy-Jost requested review from cpcloud, ksimpson-work and leofang December 4, 2025 23:54

Andy-Jost self-assigned this Dec 4, 2025

Andy-Jost requested review from mdboom, rparolin and rwgk December 4, 2025 23:54

Andy-Jost added this to the cuda.core beta 10 milestone Dec 4, 2025

Andy-Jost removed request for cpcloud, ksimpson-work and mdboom December 4, 2025 23:55

Add graph capture tests for Buffer.fill()

8294553

Extend test_graph_alloc with 'fill' action parameter to test Buffer.fill() in graph capture mode. The test verifies graph capture for Buffer operations including copy_from, copy_to, fill, and kernel launch operations. Part of issue NVIDIA#1314

Andy-Jost added the P0 High priority - Must do! label Dec 5, 2025

Andy-Jost added 2 commits December 4, 2025 16:16

This comment has been minimized.

Sign in to view

rparolin reviewed Dec 5, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memory/_buffer.pyx Show resolved Hide resolved

rparolin reviewed Dec 5, 2025

View reviewed changes

rparolin requested changes Dec 5, 2025

View reviewed changes

Andy-Jost added 2 commits December 5, 2025 10:08

Simplified argument validation logic in Buffer.fill.

07c65d2

Andy-Jost requested a review from rparolin December 5, 2025 18:37

Merge branch 'main' into issue-1314

8e2cddf

Andy-Jost marked this pull request as ready for review December 5, 2025 18:38

Andy-Jost enabled auto-merge (squash) December 5, 2025 20:23

kkraus14 reviewed Dec 8, 2025

View reviewed changes

leofang mentioned this pull request Dec 8, 2025

CUDA Graph phase 3 - memcpy nodes #1314

Open

rparolin approved these changes Dec 9, 2025

View reviewed changes

Andy-Jost merged commit db6118e into NVIDIA:main Dec 9, 2025
80 checks passed

leofang mentioned this pull request Dec 9, 2025

Revisit Buffer.fill() #1345

Open

	cdef void _validate_value_against_bitwidth(int bitwidth, int64_t value, bint is_signed=False) except *:
	cdef int _validate_value_against_bitwidth(int bitwidth, int64_t value, bint is_signed=False) except?-1:

Add Buffer.fill() method for cuMemsetAsync support (#1314) #1318

Add Buffer.fill() method for cuMemsetAsync support (#1314) #1318

Uh oh!

Conversation

Andy-Jost commented Dec 4, 2025

Summary

Changes

Implementation Details

Example Usage

Testing

Uh oh!

copy-pr-bot bot commented Dec 4, 2025

Uh oh!

Andy-Jost commented Dec 5, 2025

Uh oh!

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Andy-Jost Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rparolin left a comment

Choose a reason for hiding this comment

Uh oh!

Andy-Jost commented Dec 5, 2025

Uh oh!

kkraus14 commented Dec 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Andy-Jost Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leofang commented Dec 8, 2025

Uh oh!

Uh oh!

rparolin commented Dec 9, 2025

Uh oh!

github-actions bot commented Dec 9, 2025

Uh oh!

leofang commented Dec 9, 2025

Uh oh!

Andy-Jost commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Andy-Jost Dec 5, 2025 •

edited

Loading

Andy-Jost Dec 8, 2025 •

edited

Loading