Skip to content

Conversation

@Andy-Jost
Copy link
Contributor

@Andy-Jost Andy-Jost commented Nov 19, 2025

Description

Changes the argument size from 32-bits to 64-bits, as expected by the driver.

closes NVIDIA/cuda-python-private#197

@Andy-Jost Andy-Jost requested a review from rwgk November 19, 2025 20:27
@Andy-Jost Andy-Jost self-assigned this Nov 19, 2025
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Nov 19, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Andy-Jost
Copy link
Contributor Author

/ok to test 6752a1f

@rwgk
Copy link
Collaborator

rwgk commented Nov 19, 2025

I see this in the documentation:


Description
Supported attributes are:

CU_MEMPOOL_ATTR_RELEASE_THRESHOLD: (value type = cuuint64_t) Amount of reserved memory in bytes to hold onto before trying to release memory back to the OS. When more than the release threshold bytes of memory are held by the memory pool, the allocator will try to release memory back to the OS on the next call to stream, event or context synchronize. (default 0)

CU_MEMPOOL_ATTR_REUSE_FOLLOW_EVENT_DEPENDENCIES: (value type = int) Allow cuMemAllocAsync to use memory asynchronously freed in another stream as long as a stream ordering dependency of the allocating stream on the free action exists. Cuda events and null stream interactions can create the required stream ordered dependencies. (default enabled)

CU_MEMPOOL_ATTR_REUSE_ALLOW_OPPORTUNISTIC: (value type = int) Allow reuse of already completed frees when there is no dependency between the free and allocation. (default enabled)

CU_MEMPOOL_ATTR_REUSE_ALLOW_INTERNAL_DEPENDENCIES: (value type = int) Allow cuMemAllocAsync to insert new stream dependencies in order to establish the stream ordering required to reuse a piece of memory released by cuMemFreeAsync (default enabled).

CU_MEMPOOL_ATTR_RESERVED_MEM_CURRENT: (value type = cuuint64_t) Amount of backing memory currently allocated for the mempool

CU_MEMPOOL_ATTR_RESERVED_MEM_HIGH: (value type = cuuint64_t) High watermark of backing memory allocated for the mempool since the last time it was reset.

CU_MEMPOOL_ATTR_USED_MEM_CURRENT: (value type = cuuint64_t) Amount of memory from the pool that is currently in use by the application.

CU_MEMPOOL_ATTR_USED_MEM_HIGH: (value type = cuuint64_t) High watermark of the amount of memory from the pool that was in use by the application.


There are three attributes with value type int.

Is that real?

If it's real, could we still get into trouble because of big-endian vs little-endian?

@github-actions
Copy link

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw I found another nerve-wrecking casting:

&(self._handle), <void*><uintptr_t>(handle), IPC_HANDLE_TYPE, 0)

The driver expects an int cast directly to void* (I was bitten by this right before the v0.4.0 release).

Comment on lines 118 to 125

cdef int DMRA_getattribute(
cdef uint64_t DMRA_getattribute(
cydriver.CUmemoryPool pool_handle, cydriver.CUmemPool_attribute attr_enum
):
cdef int value
cdef uint64_t value
with nogil:
HANDLE_RETURN(cydriver.cuMemPoolGetAttribute(pool_handle, attr_enum, <void *> &value))
return value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This technically is still not correct, because the driver sometimes does want just an int.
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MALLOC__ASYNC.html#group__CUDA__MALLOC__ASYNC_1gd45ea7c43e4a1add4b971d06fa72eda4
Wouldn't this result in only the first 32 bits being written in those cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I need to initialize to zero.

@rwgk
Copy link
Collaborator

rwgk commented Nov 19, 2025

Wouldn't this result in only the first 32 bits being written in those cases?

Or the last 32 bits, depending on endian-ness.

@leofang
Copy link
Member

leofang commented Nov 19, 2025

Right.

Another issue: Why don't we see any segfault in the CI? We even turned on compute-sanitizer in some CI runs.

@Andy-Jost
Copy link
Contributor Author

Wouldn't this result in only the first 32 bits being written in those cases?

Or the last 32 bits, depending on endian-ness.

It appears that the properties using int are using it in liu of bool. We should be okay in those cases since we convert to bool before returning to Python.

@Andy-Jost
Copy link
Contributor Author

Andy-Jost commented Nov 19, 2025

Right.

Another issue: Why don't we see any segfault in the CI? We even turned on compute-sanitizer in some CI runs.

Valgrind also could not detect this, since the bad write clobbers part of a local variable on the stack. When it crashes, a pointer is sliced as follows: 0x7fffxxxxxxxx -> 0x7fff00000000. I cannot explain why the crash is intermittant.

@Andy-Jost
Copy link
Contributor Author

/ok to test 20334d1

@leofang
Copy link
Member

leofang commented Nov 19, 2025

I propose an alternative solution (#1274) that should be safer, more extensible, and with the bonus of slightly faster.

Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how thorough we want to be here.

I'd definitely add comments.

cydriver.CUmemoryPool pool_handle, cydriver.CUmemPool_attribute attr_enum
):
cdef int value
cdef uint64_t value = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok, I believe it'll work.

But at a minimum there should be a comment/warning somewhere, explaining that

  • we know we're implicitly casting int* to cuuint64_t* for some attributes, and

  • we decided to not worry about endianness here because the ints we know about today are only converted to bool.

Better IMO would be to just do it right, so that any changes to the enums in the future will lead to obvious failures that are most likely easily fixed.

For this one case it may not matter much, but in aggregate it'll makes a huge difference in long-term reliability and maintainability of cuda-python as a whole.

It's really straightforward to achieve, although it's significantly more code:

This would give us complete safety. If there are changes to the API in the future, it will fail loudly and we'll know immediately that we need to review and adjust.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of safety, I left a comment above (#1274) probably while you were leaving the review here 🙂

@Andy-Jost
Copy link
Contributor Author

Closing this since we have a better solution.

@Andy-Jost Andy-Jost closed this Nov 20, 2025
github-actions bot pushed a commit that referenced this pull request Nov 20, 2025
Removed preview folders for the following PRs:
- PR #1272
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants