[ROCm] Cherry-pick: Fix profiler leaking stale hipErrorInvalidDevice#773
Merged
i-chaochen merged 3 commits intoROCm:rocm-jaxlib-v0.9.2from Apr 8, 2026
Merged
Conversation
b82a41c to
dfb20c6
Compare
Replace all deprecated hipCtx_t-based context management with hipSetDevice/hipGetDevice as recommended by AMD since ROCm 1.9. On ROCm, hipCtx_t is a thin wrapper around a device ordinal and the entire context lifecycle (Retain/Release/SetCurrent/GetCurrent) is a no-op. AMD has deprecated these APIs with the warning "might not be supported in future releases." Changes: - rocm_context.h: Simplify RocmContext to a trivial class holding only device_ordinal_. Remove Create(), context map, mutex, GetDeviceMemoryUsage, GetDeviceTotalMemory. - rocm_context.cc: Deleted. Method implementations (SetActive, IsActive, Synchronize) moved into rocm_executor.cc. - rocm_executor.h: Change RocmContext* to a value field initialized in the constructor. No heap allocation or factory needed. - rocm_executor.cc: Update all usages from pointer to address-of. Inline DeviceMemoryUsage and GetDeviceTotalMemory. Simplify DeviceFromContext() to use context->device_ordinal(). - rocm_driver_wrapper.h: Remove 6 deprecated API wrappers (hipCtxGetDevice, hipCtxSetCurrent, hipDevicePrimaryCtxGetState, hipDevicePrimaryCtxSetFlags, hipDevicePrimaryCtxRetain, hipDevicePrimaryCtxRelease). - BUILD: Remove rocm_context.cc from srcs, simplify deps.
- Replace hipGetDeviceProperties in GetDeviceCapabilities with rocprofiler agent data already available from RocmTracer. This eliminates HIP runtime calls that fail for non-visible devices when ROCR_VISIBLE_DEVICES restricts GPU visibility (e.g. pytest-xdist workers). - Since ROCm, hipGetLastError() is sticky: it retains errors even after subsequent successful HIP API calls. The stale hipErrorInvalidDevice from the profiler leaked into unrelated GPU operations, causing flaky test failures in JAX's FFI handlers (solver, linalg, prng). - Agent clock rates are in MHz (vs KHz in hipDeviceProp_t); memory is filtered to VRAM-only banks (FRAME_BUFFER_PUBLIC/PRIVATE). - Add unit test verifying agent data matches hipGetDeviceProperties.
dfb20c6 to
6215a05
Compare
i-chaochen
approved these changes
Apr 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cherry-pick of openxla#40199 into
rocm-jaxlib-v0.9.2release branch.Fixes the ROCm profiler leaking stale
hipErrorInvalidDeviceinto per-thread HIP error state, causing flaky JAX test failures in solver/linalg/prng FFI handlers when running with pytest-xdist on multi-GPU systems.hipGetDevicePropertiesinGetDeviceCapabilitieswith rocprofiler agent data already available from RocmTracerROCR_VISIBLE_DEVICESrestricts GPU visibilityhipGetDeviceProperties