Skip to content

Conversation

@Andy-Jost
Copy link
Contributor

Summary

Closes #1586. Adds __enter__/__exit__ to Device so it can be used as a context manager that temporarily activates a device and restores the previous CUDA context on exit.

from cuda.core import Device

dev0 = Device(0)
dev0.set_current()
# ... do work on device 0 ...

with Device(1) as dev1:
    # device 1 is now current
    buf = dev1.allocate(1024)

# device 0 is automatically restored here

Changes

  • cuda/core/_device.pyx: Added __enter__ and __exit__ methods to Device. On enter, queries the current context via cuCtxGetCurrent and saves it on a per-thread stack (_tls._ctx_stack), then calls set_current(). On exit, restores the saved context via cuCtxSetCurrent. Uses peek-then-pop ordering so the stack is not corrupted if cuCtxSetCurrent raises.
  • tests/test_device.py: Added 12 tests covering basic usage, context restoration, exception safety, same-device nesting, deep nesting, multi-GPU nesting, set_current() inside a with block, device usability after exit, device initialization, and thread safety (3 threads on 3 GPUs).
  • tests/conftest.py: Added teardown to mempool_device_x2 and mempool_device_x3 fixtures to clean up residual contexts between tests.

Design

  • Stateless restoration: Each __enter__ queries the actual CUDA driver state rather than maintaining a Python-side device cache. This ensures correct interoperability with other libraries (PyTorch, CuPy) that use cudaSetDevice/cuCtxSetCurrent.
  • Reentrant: Saved contexts are stored on a per-thread stack (not on the Device singleton), so nested and reentrant usage works correctly.
  • Uses cuCtxGetCurrent/cuCtxSetCurrent: Consistent with set_current() and the runtime API model. Does not use cuCtxPushCurrent/cuCtxPopCurrent.

Test Coverage

All tests pass locally on single-GPU (L40) and multi-GPU (3x RTX PRO 6000 Blackwell) machines. Stress-tested with 20 randomized iterations via pytest-repeat + pytest-randomly with no ordering issues.

Made with Cursor

Closes NVIDIA#1586. Adds __enter__/__exit__ to Device so it can be used as
a context manager that saves the current CUDA context on entry and
restores it on exit. Uses cuCtxGetCurrent/cuCtxSetCurrent (not push/pop)
for interoperability with the runtime API. Saved contexts are stored on
a per-thread stack (_tls._ctx_stack) so nested and reentrant usage works
correctly.

Also adds teardown to mempool_device_x2/x3 fixtures to clean up
residual contexts between tests.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Andy-Jost Andy-Jost added this to the cuda.core v0.6.0 milestone Feb 11, 2026
@Andy-Jost Andy-Jost added feature New feature or request cuda.core Everything related to the cuda.core module labels Feb 11, 2026
@Andy-Jost Andy-Jost self-assigned this Feb 11, 2026
@Andy-Jost Andy-Jost requested a review from leofang February 11, 2026 01:37
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Feb 11, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Andy-Jost
Copy link
Contributor Author

/ok to test f02b730

@Andy-Jost Andy-Jost marked this pull request as draft February 11, 2026 01:40
@github-actions
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Device context manager for temporary device switching

1 participant