Skip to content

[BUG]: DLPack tensors may be released before async launch work is finished #88

@fallintoplace

Description

@fallintoplace

Version

current main checkout (0477a34 locally); this code path appears to date back to the initial public import as well

CUDA Toolkit Version

not pinned yet; this came from source review rather than a runtime repro on a specific toolkit version

Which installation method(s) does this occur on?

Source

Describe the bug

I think the generic DLPack launch path is releasing consumed DLManagedTensor objects too early.

In cext/tile_kernel.cpp, arrayrepr_dlpack_common() renames the capsule to "used_dltensor" and then immediately calls tensor->deleter(tensor). That happens while kernel arguments are still being prepared, before prepare_launch() returns, and before cuLaunchKernelEx() is called.

For generic __dlpack__ objects, arrayrepr_dlpack() also calls __dlpack__(stream=-1), so the producer is explicitly being told not to synchronize.

My understanding of the DLPack ownership/lifetime contract is that once the consumer takes ownership of the capsule, it should keep the managed tensor alive until the consumer is actually done with it. Releasing it during argument parsing looks wrong on its own, and in the async CUDA launch path it seems like it could become a stale-pointer / premature-release problem for producers that rely on the deleter to hold the export alive until consumer work has safely passed.

What made me look twice is that there is already a comment in the code saying this is "technically an incorrect implementation" and suggesting an event-based deferred release after launch.

The control flow I am looking at is:

  • arrayrepr_dlpack() calls __dlpack__(stream=-1)
  • arrayrepr_dlpack_common() reads the pointer, renames the capsule, and immediately calls the deleter
  • the actual kernel enqueue happens later in launch() via cuLaunchKernelEx()

Minimum reproducible example

I do not have a clean runtime repro yet. I found this during code review because the control flow itself looks off:

1. producer returns a DLPack capsule
2. cuTile reads the exported pointer
3. cuTile immediately calls the DLPack deleter
4. the CUDA kernel launch happens afterwards, asynchronously

The repro shape I would expect to fail is a producer whose deleter drops the last owner or returns memory to a pool before the launched kernel has actually finished using the pointer.

Relevant log output

none yet

Full env printout

not available yet

Other/Misc.

Code pointers on current main:

  • cext/tile_kernel.cpp: arrayrepr_dlpack_common()
  • cext/tile_kernel.cpp: arrayrepr_dlpack()
  • cext/tile_kernel.cpp: launch()

If I am reading the intent correctly, it seems like the managed tensor probably needs to stay alive until launch work on the relevant stream is safely past the point where the producer can release it, rather than being deleted during argument extraction.

Happy to help with a repro if that would be useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions