Skip to content

Conversation

@Andy-Jost
Copy link
Contributor

Summary

This PR implements Buffer.fill(value, width, *, stream) method that wraps CUDA's cuMemsetAsync functions, supporting cuMemsetD8Async, cuMemsetD16Async, and cuMemsetD32Async based on the width parameter.

Part of issue #1314: CUDA Graph phase 3 - memcpy nodes

Changes

  • Added Buffer.fill() method in cuda/core/experimental/_memory/_buffer.pyx:

    • Supports width=1 (byte fill via cuMemsetD8Async)
    • Supports width=2 (16-bit fill via cuMemsetD16Async)
    • Supports width=4 (32-bit fill via cuMemsetD32Async)
    • Validates width (must be 1, 2, or 4)
    • Validates value range for each width
    • Validates buffer size is divisible by width
    • Follows same pattern as copy_to/copy_from methods
  • Added comprehensive tests in tests/test_memory.py:

    • Tests all three widths (1, 2, 4 bytes)
    • Tests error cases (invalid width, value out of range, size not divisible)
    • Tests with different memory resource types
    • Includes verification for host-accessible memory

Implementation Details

The method automatically selects the appropriate CUDA driver API function based on the width parameter:

  • width=1: Uses cuMemsetD8Async with N = buffer_size (bytes)
  • width=2: Uses cuMemsetD16Async with N = buffer_size // 2 (16-bit elements)
  • width=4: Uses cuMemsetD32Async with N = buffer_size // 4 (32-bit elements)

Example Usage

buffer = mr.allocate(1024, stream=stream)
buffer.fill(0x42, width=1, stream=stream)  # Fill with byte value
buffer.fill(0x1234, width=2, stream=stream)  # Fill with 16-bit value
buffer.fill(0xDEADBEEF, width=4, stream=stream)  # Fill with 32-bit value

Testing

All tests pass with comprehensive coverage of:

  • Success cases for all widths
  • Error validation (width, value range, size divisibility)
  • Multiple memory resource types (device, unified, pinned)

Implements Buffer.fill(value, width, *, stream) method that wraps
cuMemsetD8Async, cuMemsetD16Async, and cuMemsetD32Async based on
the width parameter (1, 2, or 4 bytes).

- Add fill() method to Buffer class in _buffer.pyx
- Support width=1 (byte), width=2 (16-bit), width=4 (32-bit)
- Validate width, value range, and buffer size divisibility
- Add comprehensive tests in test_memory.py
- Tests cover all widths, error cases, and verification

Part of issue NVIDIA#1314: CUDA Graph phase 3 - memcpy nodes
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Dec 4, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Andy-Jost Andy-Jost added cuda.core Everything related to the cuda.core module feature New feature or request labels Dec 4, 2025
@Andy-Jost Andy-Jost self-assigned this Dec 4, 2025
@Andy-Jost Andy-Jost requested review from mdboom, rparolin and rwgk December 4, 2025 23:54
@Andy-Jost Andy-Jost added this to the cuda.core beta 10 milestone Dec 4, 2025
Extend test_graph_alloc with 'fill' action parameter to test Buffer.fill()
in graph capture mode. The test verifies graph capture for Buffer operations
including copy_from, copy_to, fill, and kernel launch operations.

Part of issue NVIDIA#1314
@Andy-Jost Andy-Jost added the P0 High priority - Must do! label Dec 5, 2025
- Replace Python driver module calls with direct cydriver calls
- Use 'with nogil:' blocks around CUDA driver API calls
- Use HANDLE_RETURN macro for error handling
- Cast stream to Stream type to access _handle attribute
- Improves performance by eliminating Python overhead
- Replace Python driver module calls with direct cydriver calls
- Use 'with nogil:' blocks around CUDA driver API calls
- Use HANDLE_RETURN macro for error handling
- Cast stream to Stream type to access _handle attribute
- Remove unused raise_if_driver_error import
- Improves performance by eliminating Python overhead
@Andy-Jost
Copy link
Contributor Author

/ok to test 7d9747d

@github-actions

This comment has been minimized.

cdef cydriver.CUstream s = s_stream._handle

# Validate width
if width not in (1, 2, 4):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put the validation code closer to the top of the function so we avoid any setup work in the error case where the user passes an unsupported size to the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I fiddled with it to simplify the logic. To be honest, I don't see a big improvement, here, since most of the preceding statements just declare stack variables.

buffer1.fill(0x42, width=1, stream=stream)
device.sync()

if check:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are parametrizing these value checks in a test suite? The memory sizes don't strike me as so large that these operations would be slow.

Copy link
Contributor Author

@Andy-Jost Andy-Jost Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values are only checked when the memory allocation is pinned. This follows the existing pattern.

Copy link
Collaborator

@rparolin rparolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good generally, just a couple comments.

- Add _validate_value_against_bitwidth helper function
- Move helper function to end of file as cdef function
- Use 64-bit platform integers (int64_t/uint64_t) instead of Python ints
- Add assertion that bitwidth < 64
- Remove magic numbers from fill() method
- Update tests to match new error message format
@Andy-Jost Andy-Jost requested a review from rparolin December 5, 2025 18:37
@Andy-Jost Andy-Jost marked this pull request as ready for review December 5, 2025 18:38
@Andy-Jost
Copy link
Contributor Author

/ok to test 8e2cddf

@Andy-Jost Andy-Jost enabled auto-merge (squash) December 5, 2025 20:23
@kkraus14
Copy link
Collaborator

kkraus14 commented Dec 8, 2025

cc @pciolkosz who implemented CCCL's cuda::fill_bytes

I think something what would make more sense would be for Buffer.fill() to only work for 1 byte width and then have a StridedMemoryView.fill() that uses the dtype of the StridedMemoryView to control the byte width used for filling as well as interpreting the signed-ness of the input. I think this has a couple of benefits:

  • No need for user to pass a width explicitly and inherently no need to ensure that the Buffer is aligned with a passed width
  • Makes it clear when a Python integer is desired to be treated as an unsigned vs signed integer


# Map width (bytes) to bitwidth and validate value
cdef int bitwidth = width * 8
_validate_value_against_bitwidth(bitwidth, value, is_signed=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit of unexpected behavior that this will throw if someone passes a negative integer


# Helper Functions
# ----------------
cdef void _validate_value_against_bitwidth(int bitwidth, int64_t value, bint is_signed=False) except *:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: we should probably make this an inline function for performance reasons

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and avoid except*, so it should return an int instead, something like

Suggested change
cdef void _validate_value_against_bitwidth(int bitwidth, int64_t value, bint is_signed=False) except *:
cdef int _validate_value_against_bitwidth(int bitwidth, int64_t value, bint is_signed=False) except?-1:

Comment on lines +440 to +446
if is_signed:
min_value = -(<int64_t>1 << (max_bits - 1))
max_value = (<int64_t>1 << (max_bits - 1)) - 1
else:
min_value = 0
max_value_unsigned = (<uint64_t>1 << max_bits) - 1
max_value = <int64_t>max_value_unsigned
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: getting the min and max values here can be cimported from the built in libc module: https://github.com/cython/cython/blob/master/Cython/Includes/libc/limits.pxd

i.e. from libc.limits cimport INT_MAX

Comment on lines +198 to +199
value : int
Integer value to fill the buffer with
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the underlying driver APIs expect an integer, but I think for Buffer.fill() it would be good to support a byte-like input as well, maybe via buffer protocol?

Copy link
Contributor Author

@Andy-Jost Andy-Jost Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm my understanding, I believe the suggestion here is to accept arguments of type bytes or that provide bytes via the buffer protocol (where the number of bytes equals the buffer size) and fill our Buffer from that. Would that be better to implement as Buffer.copy_from rather than Buffer.fill? Or possibly a new function Buffer.copy_from_bytes or Buffer.copy_from_host.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The internal logic will be different from Buffer.fill, since it requires copying the bytes to a staging area, unlike cuMemset*.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to allowing a user to pass a memoryview object of 1, 2, or 4 bytes in length to be used as the fill value. I.e. Buffer.fill(b'abcd').

@leofang
Copy link
Member

leofang commented Dec 8, 2025

From #1314 (comment):

How do we envision this working? If someone says buf.fill(1) we still don't have information about the intended bit-width. Are we thinking something like buf.fill(value=1, width=4) ?

If the type is a specific NumPy dtype we can do it automatically. Is there a precedent for anything like that, or do we have a list of frameworks/dtypes that we support?

and #1318 (comment):

and then have a StridedMemoryView.fill() that uses the dtype of the StridedMemoryView to control the byte width used for filling as well as interpreting the signed-ness of the input. I think this has a couple of benefits:

My thinking was: We follow what we did for cuda.core.launch() and use a dispatcher to infer the bit width from the scalar type. So we'll do

def fill(self, value, stream: Stream | GraphBuilder):
    ...

and the user code can be:

buf.fill(np.int16(8), s)

This would provide a uniform UX across cuda.core APIs that accept Python/NumPy scalars. @kkraus14 @Andy-Jost WDYT?

@Andy-Jost Andy-Jost merged commit db6118e into NVIDIA:main Dec 9, 2025
80 checks passed
@rparolin
Copy link
Collaborator

rparolin commented Dec 9, 2025

My bad, I didn't check if you had auto-merged enabled. @Andy-Jost please make sure any feedback from Keith and Leo is incorporated.

@github-actions
Copy link

github-actions bot commented Dec 9, 2025

Doc Preview CI
Preview removed because the pull request was closed or merged.

@leofang leofang mentioned this pull request Dec 9, 2025
@leofang
Copy link
Member

leofang commented Dec 9, 2025

No worries. Tracking this in #1345.

@Andy-Jost
Copy link
Contributor Author

My bad, I didn't check if you had auto-merged enabled. @Andy-Jost please make sure any feedback from Keith and Leo is incorporated.

Will follow up to fix, no worries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants