-
Notifications
You must be signed in to change notification settings - Fork 225
feat: Introduce StridedLayout, support wrapping external allocations in Buffer, add StridedMemoryView.from_buffer #1283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
||
| args_viewable_as_strided_memory | ||
|
|
||
| :template: dataclass.rst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- dataclass.rst does not render methods.
- class.rst omits cythonized properties
cyclass places attributes section just after the main class docstring. this way we can document the actual attributes at the end of the main docstring and they are followed by docstring of all the properties.
leofang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking in EOD progress. I haven't reviewed layout/memoryview.
Also, I assume you're working migrating the tests?
| dl_tensor = &dlm_tensor_ver.dl_tensor | ||
| dlm_tensor_ver.version.major = DLPACK_MAJOR_VERSION | ||
| dlm_tensor_ver.version.minor = DLPACK_MINOR_VERSION | ||
| cpython.Py_INCREF(buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to call decref in the except: branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cleaners helpers do so. I moved the cpython.Py_INCREF call next to assigning its address to manager_ctx, for the clearer "release" ptr semantics. If the manager_ctx is set, the DECREF needs to be called.
| driver.CUpointer_attribute.CU_POINTER_ATTRIBUTE_MEMORY_TYPE, | ||
| driver.CUpointer_attribute.CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL, | ||
| ) | ||
| return driver.cuPointerGetAttributes(len(attrs), attrs, ptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: cimport this from cydriver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙃
I've actually had that this way initially, but seeing all the cdriver imports are gone from buffer, went along with the Python API. I can undig the previous variant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry. I think it's not "gone" gone, most likely @Andy-Jost found that we don't need many driver API calls in this file after the refactoring (#1205). But pointer attribute checks are in the hot path so we should cythonize it.
In fact, I am trying to catch up with what @fbusato is doing in C++ (NVIDIA/cccl#6733), which is an equivalent check (but for C++ mdspan instead of Python SMV).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reference! Looking at the logic in cccl, I adjusted managed memory "discovery". I am not sure if we need to go into so much details as trying to get particular memory pool and check if readability flag is set there, I did not add this, but can adjust if needed.
In any case, I moved back to cydriver API and added tests with host/device/managed/pinned from cuda malloc and pinned from cuda register.
For pinned memory, I am not 100% sure what happens on devices for which
CU_DEVICE_ATTRIBUTE_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM is false. I.e. if one registers host memory with cuda host register and passes the original pointer, while the pointer to access on device is different. I.e. would it be still memory_type = 0 or memory_type = host and what would be a desired is_device_accessible value.
| #include <numeric> | ||
|
|
||
|
|
||
| #define STRIDED_LAYOUT_MAX_NDIM 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: is there any particular reason (apart from using 32-bit masks) we want to limit this to 32? There are libraries that use a higher limit, for example cuTENSOR (@yangcal would remember what the current limit is).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, precompiling the copy kernels was on the table, so sticking to 32-ndim was more appealing. Now, I don't think there's anything preventing us from going to 64-dims. Especially, if there's really a need for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c238c1a to
dc27268
Compare
|
The dlpack fix moved out of this PR is here #1291 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB.
from cuda.bindings import driver
driver.cuPointerGetAttributes(1, [driver.CUpointer_attribute.CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL], 0)
returns
(<CUresult.CUDA_SUCCESS: 0>, [4294967294])
while the actual value returned by the underlying API is -2. Looks like some unsigned/signed type mismatch.
|
Let's kick off CI |
|
/ok to test 3a904e7 |
|
|
A single test case failed - testing pointer attributes for host memory "manually pinned" with cuMemHostRregister. It failed on pre-condition assert that the memory is not device accessible before it is registered. And it failed in a second of two cases testing this. The test did not clean-up properly - it was missing unregister call. I am guessing we ended up with the same pointer in the second case. The 38ddb36 should fix that. |
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
…pack export Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
…n in reshape, fix to dense with sliced views Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
… add tests for the attributes Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
38ddb36 to
26dfe3b
Compare
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
| cdef int itemsize = nbits >> 3 | ||
| if (itemsize << 3) != nbits: | ||
| raise ValueError("dl_tensor.dtype.bits must be a multiple of 8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should we handle things like 6-bit and 4-bit floats? https://github.com/dmlc/dlpack/blob/main/include/dlpack/dlpack.h#L171-L181
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I am not sure, really. AFAIK, we don't support it now anyway, the dtype parsing won't accept it.
Playing with https://github.com/jax-ml/ml_dtypes I see that numpy:
- does not allow exporting such types through dlpack
- does not pack those - one element corresponds to one byte offset.
We could work around it in different ways, just looking at the landscape now I am not sure what's the desired behavior. E.g. would a dlpack dtype for fractional fps have nbits * lanes divisible by 8 or the numpy behavior is a common thing - we should round the itemsize to 1?
| cdef bint is_allocated = buffer.owner is None | ||
| if is_allocated: | ||
| if buffer.memory_resource is None: | ||
| raise ValueError( | ||
| "Ambiguous buffer instance. The buffer must either hold an allocation " | ||
| "(coming from MemoryResource, e.g. Device().memory_resource.allocate()) " | ||
| "or wrap external data and specify the owner " | ||
| "(`Buffer.from_handle(ptr, size, owner=...)`)." | ||
| ) | ||
| # For external buffers, we may not know the size. Even if we did, the size | ||
| # alone is not enough if the layout can map to negative offsets, i.e.: | ||
| # the valid range is not the [ptr, ptr + size - 1], but | ||
| # [ptr - offset, ptr + size - offset - 1]. The offset is not reported | ||
| # by the packages. | ||
| if is_allocated and buffer.size < layout.get_required_size_in_bytes(): | ||
| raise ValueError( | ||
| f"Buffer size is too small for the layout. " | ||
| f"Expected at least {layout.get_required_size_in_bytes()} bytes, " | ||
| f"got {buffer.size} bytes." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These checks seem somewhat expensive to do on the path of getting the ptr to the data. Should we validate these only at construction time and avoid them in a hot path like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those checks happen now for new SMV created with from_buffer or view methods. We could limit those to from_buffer only or get rid of them completely. I don't have strong preference either way here.
| with cython.gil: | ||
| # Device class handles the cuInit call internally | ||
| from cuda.core.experimental import Device | ||
| Device() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call cuInit if that's what we're after, instead of calling Device?
| F = "F" | ||
|
|
||
|
|
||
| class _S: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're already importing numpy where this is being used, can't we use np.s_ for this instead of duplicating that functionality solely for testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I wasn't aware of this utility.
Description
StridedLayout(shape, strides, itemsize),StridedLayout(a.shape, a.strides, a.itemsize, divide_strides=True)StridedLayout.dense(shape, itemsize, stride_order)StridedLayout.dense_likeandself.to_denseFrom Python, StridedLayout is immutable, stride manipulation methods return a new instance. In Cython, to avoid temporary objects in a sequence of operations, layout manipulations methods can be run in place.
Please take a look at the StridedLayout docs for more details and examples.
Enables wrapping external allocation into Buffer (
Buffer.from_handle(ptr, owner=obj)). The owner and memory resource cannot be specified together. The owner reference is kept until the Buffer is closed. Without the memory resource, Buffer now queries driver for host/device accessibility and device_id of the ptr.StridedMemoryView uses now StridedLayout to represent the shape/strides.
from_buffer(buffer, layout, optional dtype)to create SMV from Buffer and StridedLayout. For example to implement empty_like() method for numpy array, but allocated on a device, one could:The StridedMemoryView can be now exported via dlpack.(delayed for later)The StridedMemoryView.copy_from, StridedMemoryView.copy_to allow copying data between views(in a follow-up PR).Checklist