Version
current main checkout (0477a34 locally); this code path appears to date back to the initial public import as well
CUDA Toolkit Version
not pinned yet; this came from source review rather than a runtime repro on a specific toolkit version
Which installation method(s) does this occur on?
Source
Describe the bug
I think the generic DLPack launch path is releasing consumed DLManagedTensor objects too early.
In cext/tile_kernel.cpp, arrayrepr_dlpack_common() renames the capsule to "used_dltensor" and then immediately calls tensor->deleter(tensor). That happens while kernel arguments are still being prepared, before prepare_launch() returns, and before cuLaunchKernelEx() is called.
For generic __dlpack__ objects, arrayrepr_dlpack() also calls __dlpack__(stream=-1), so the producer is explicitly being told not to synchronize.
My understanding of the DLPack ownership/lifetime contract is that once the consumer takes ownership of the capsule, it should keep the managed tensor alive until the consumer is actually done with it. Releasing it during argument parsing looks wrong on its own, and in the async CUDA launch path it seems like it could become a stale-pointer / premature-release problem for producers that rely on the deleter to hold the export alive until consumer work has safely passed.
What made me look twice is that there is already a comment in the code saying this is "technically an incorrect implementation" and suggesting an event-based deferred release after launch.
The control flow I am looking at is:
arrayrepr_dlpack() calls __dlpack__(stream=-1)
arrayrepr_dlpack_common() reads the pointer, renames the capsule, and immediately calls the deleter
- the actual kernel enqueue happens later in
launch() via cuLaunchKernelEx()
Minimum reproducible example
I do not have a clean runtime repro yet. I found this during code review because the control flow itself looks off:
1. producer returns a DLPack capsule
2. cuTile reads the exported pointer
3. cuTile immediately calls the DLPack deleter
4. the CUDA kernel launch happens afterwards, asynchronously
The repro shape I would expect to fail is a producer whose deleter drops the last owner or returns memory to a pool before the launched kernel has actually finished using the pointer.
Relevant log output
none yet
Full env printout
not available yet
Other/Misc.
Code pointers on current main:
cext/tile_kernel.cpp: arrayrepr_dlpack_common()
cext/tile_kernel.cpp: arrayrepr_dlpack()
cext/tile_kernel.cpp: launch()
If I am reading the intent correctly, it seems like the managed tensor probably needs to stay alive until launch work on the relevant stream is safely past the point where the producer can release it, rather than being deleted during argument extraction.
Happy to help with a repro if that would be useful.
Version
current
maincheckout (0477a34locally); this code path appears to date back to the initial public import as wellCUDA Toolkit Version
not pinned yet; this came from source review rather than a runtime repro on a specific toolkit version
Which installation method(s) does this occur on?
Source
Describe the bug
I think the generic DLPack launch path is releasing consumed
DLManagedTensorobjects too early.In
cext/tile_kernel.cpp,arrayrepr_dlpack_common()renames the capsule to"used_dltensor"and then immediately callstensor->deleter(tensor). That happens while kernel arguments are still being prepared, beforeprepare_launch()returns, and beforecuLaunchKernelEx()is called.For generic
__dlpack__objects,arrayrepr_dlpack()also calls__dlpack__(stream=-1), so the producer is explicitly being told not to synchronize.My understanding of the DLPack ownership/lifetime contract is that once the consumer takes ownership of the capsule, it should keep the managed tensor alive until the consumer is actually done with it. Releasing it during argument parsing looks wrong on its own, and in the async CUDA launch path it seems like it could become a stale-pointer / premature-release problem for producers that rely on the deleter to hold the export alive until consumer work has safely passed.
What made me look twice is that there is already a comment in the code saying this is "technically an incorrect implementation" and suggesting an event-based deferred release after launch.
The control flow I am looking at is:
arrayrepr_dlpack()calls__dlpack__(stream=-1)arrayrepr_dlpack_common()reads the pointer, renames the capsule, and immediately calls the deleterlaunch()viacuLaunchKernelEx()Minimum reproducible example
I do not have a clean runtime repro yet. I found this during code review because the control flow itself looks off:
The repro shape I would expect to fail is a producer whose deleter drops the last owner or returns memory to a pool before the launched kernel has actually finished using the pointer.
Relevant log output
none yet
Full env printout
not available yet
Other/Misc.
Code pointers on current
main:cext/tile_kernel.cpp:arrayrepr_dlpack_common()cext/tile_kernel.cpp:arrayrepr_dlpack()cext/tile_kernel.cpp:launch()If I am reading the intent correctly, it seems like the managed tensor probably needs to stay alive until launch work on the relevant stream is safely past the point where the producer can release it, rather than being deleted during argument extraction.
Happy to help with a repro if that would be useful.