[BUG]: DLPack tensors may be released before async launch work is finished

### Version

current `main` checkout (`0477a34` locally); this code path appears to date back to the initial public import as well

### CUDA Toolkit Version

not pinned yet; this came from source review rather than a runtime repro on a specific toolkit version

### Which installation method(s) does this occur on?

Source

### Describe the bug

I think the generic DLPack launch path is releasing consumed `DLManagedTensor` objects too early.

In `cext/tile_kernel.cpp`, `arrayrepr_dlpack_common()` renames the capsule to `"used_dltensor"` and then immediately calls `tensor->deleter(tensor)`. That happens while kernel arguments are still being prepared, before `prepare_launch()` returns, and before `cuLaunchKernelEx()` is called.

For generic `__dlpack__` objects, `arrayrepr_dlpack()` also calls `__dlpack__(stream=-1)`, so the producer is explicitly being told not to synchronize.

My understanding of the DLPack ownership/lifetime contract is that once the consumer takes ownership of the capsule, it should keep the managed tensor alive until the consumer is actually done with it. Releasing it during argument parsing looks wrong on its own, and in the async CUDA launch path it seems like it could become a stale-pointer / premature-release problem for producers that rely on the deleter to hold the export alive until consumer work has safely passed.

What made me look twice is that there is already a comment in the code saying this is "technically an incorrect implementation" and suggesting an event-based deferred release after launch.

The control flow I am looking at is:

- `arrayrepr_dlpack()` calls `__dlpack__(stream=-1)`
- `arrayrepr_dlpack_common()` reads the pointer, renames the capsule, and immediately calls the deleter
- the actual kernel enqueue happens later in `launch()` via `cuLaunchKernelEx()`

### Minimum reproducible example

I do not have a clean runtime repro yet. I found this during code review because the control flow itself looks off:

```text
1. producer returns a DLPack capsule
2. cuTile reads the exported pointer
3. cuTile immediately calls the DLPack deleter
4. the CUDA kernel launch happens afterwards, asynchronously
```

The repro shape I would expect to fail is a producer whose deleter drops the last owner or returns memory to a pool before the launched kernel has actually finished using the pointer.

### Relevant log output

none yet

### Full env printout

not available yet

### Other/Misc.

Code pointers on current `main`:

- `cext/tile_kernel.cpp`: `arrayrepr_dlpack_common()`
- `cext/tile_kernel.cpp`: `arrayrepr_dlpack()`
- `cext/tile_kernel.cpp`: `launch()`

If I am reading the intent correctly, it seems like the managed tensor probably needs to stay alive until launch work on the relevant stream is safely past the point where the producer can release it, rather than being deleted during argument extraction.

Happy to help with a repro if that would be useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: DLPack tensors may be released before async launch work is finished #88

Version

CUDA Toolkit Version

Which installation method(s) does this occur on?

Describe the bug

Minimum reproducible example

Relevant log output

Full env printout

Other/Misc.

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG]: DLPack tensors may be released before async launch work is finished #88

Description

Version

CUDA Toolkit Version

Which installation method(s) does this occur on?

Describe the bug

Minimum reproducible example

Relevant log output

Full env printout

Other/Misc.

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions