-
Notifications
You must be signed in to change notification settings - Fork 212
Closed
Labels
P0High priority - Must do!High priority - Must do!cuda.coreEverything related to the cuda.core moduleEverything related to the cuda.core moduleenhancementAny code-related improvementsAny code-related improvements
Milestone
Description
Is this a duplicate?
- I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct
Area
cuda.core
Is your feature request related to a problem? Please describe.
I would like abnormal process shutdown to complete gracefully. When testing memory IPC, it was observed that a killed child process sometimes appears to invoke Buffer.__del__
after the CUDA context has been destroyed (see additional context section for error message). This has been observed to result in hard errors including floating point exceptions and segmentation violations.
Describe the solution you'd like
Update Buffer.__del__
(and similar methods) to use Buffer.__dealloc__
.
Describe alternatives you've considered
Solutions based on sys.is_finalizing
appear to be unreliable.
Additional context
The following messages were output from a killed child process while running cuda_core/tests/memory_ipc
tests:
/opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/_pytest/unraisableexception.py:67: PytestUnraisableExceptionWarning: Exception ignored in: 'cuda.core.experimental._memory.Buffer.__del__'
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/cuda/core/experimental/_device.py", line 991, in __new__
devices = _tls.devices
^^^^^^^^^^^^
AttributeError: '_thread._local' object has no attribute 'devices'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "cuda/core/experimental/_memory.pyx", line 82, in cuda.core.experimental._memory.Buffer._shutdown_safe_close
self._mr.deallocate(self._ptr, self._size, stream)
File "cuda/core/experimental/_memory.pyx", line 801, in cuda.core.experimental._memory.DeviceMemoryResource.deallocate
raise_if_driver_error(err)
File "cuda/core/experimental/_utils/cuda_utils.pyx", line 69, in cuda.core.experimental._utils.cuda_utils._check_driver_error
raise CUDAError(f"{name}: {expl}")
cuda.core.experimental._utils.cuda_utils.CUDAError: CUDA_ERROR_INVALID_CONTEXT: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had ::cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See ::cuCtxGetApiVersion() for more details. This can also be returned if the green context passed to an API call was not converted to a ::CUcontext using ::cuCtxFromGreenCtx API.
Metadata
Metadata
Assignees
Labels
P0High priority - Must do!High priority - Must do!cuda.coreEverything related to the cuda.core moduleEverything related to the cuda.core moduleenhancementAny code-related improvementsAny code-related improvements