Skip to content

[ROCm] Cherry-pick: Fix profiler leaking stale hipErrorInvalidDevice#773

Merged
i-chaochen merged 3 commits intoROCm:rocm-jaxlib-v0.9.2from
magaonka-amd:cherry-pick-profiler-fix
Apr 8, 2026
Merged

[ROCm] Cherry-pick: Fix profiler leaking stale hipErrorInvalidDevice#773
i-chaochen merged 3 commits intoROCm:rocm-jaxlib-v0.9.2from
magaonka-amd:cherry-pick-profiler-fix

Conversation

@magaonka-amd
Copy link
Copy Markdown

Summary

Cherry-pick of openxla#40199 into rocm-jaxlib-v0.9.2 release branch.

Fixes the ROCm profiler leaking stale hipErrorInvalidDevice into per-thread HIP error state, causing flaky JAX test failures in solver/linalg/prng FFI handlers when running with pytest-xdist on multi-GPU systems.

  • Replace hipGetDeviceProperties in GetDeviceCapabilities with rocprofiler agent data already available from RocmTracer
  • Eliminates HIP runtime calls that fail for non-visible devices when ROCR_VISIBLE_DEVICES restricts GPU visibility
  • Add unit test verifying agent data matches hipGetDeviceProperties

@magaonka-amd magaonka-amd force-pushed the cherry-pick-profiler-fix branch from b82a41c to dfb20c6 Compare April 2, 2026 05:17
phambinhfin and others added 2 commits April 2, 2026 00:37
Replace all deprecated hipCtx_t-based context management with
hipSetDevice/hipGetDevice as recommended by AMD since ROCm 1.9.

On ROCm, hipCtx_t is a thin wrapper around a device ordinal and the
entire context lifecycle (Retain/Release/SetCurrent/GetCurrent) is
a no-op.  AMD has deprecated these APIs with the warning "might not
be supported in future releases."

Changes:
- rocm_context.h: Simplify RocmContext to a trivial class holding
  only device_ordinal_.  Remove Create(), context map, mutex,
  GetDeviceMemoryUsage, GetDeviceTotalMemory.
- rocm_context.cc: Deleted.  Method implementations (SetActive,
  IsActive, Synchronize) moved into rocm_executor.cc.
- rocm_executor.h: Change RocmContext* to a value field initialized
  in the constructor.  No heap allocation or factory needed.
- rocm_executor.cc: Update all usages from pointer to address-of.
  Inline DeviceMemoryUsage and GetDeviceTotalMemory.  Simplify
  DeviceFromContext() to use context->device_ordinal().
- rocm_driver_wrapper.h: Remove 6 deprecated API wrappers
  (hipCtxGetDevice, hipCtxSetCurrent, hipDevicePrimaryCtxGetState,
  hipDevicePrimaryCtxSetFlags, hipDevicePrimaryCtxRetain,
  hipDevicePrimaryCtxRelease).
- BUILD: Remove rocm_context.cc from srcs, simplify deps.
- Replace hipGetDeviceProperties in GetDeviceCapabilities with rocprofiler
  agent data already available from RocmTracer. This eliminates HIP runtime
  calls that fail for non-visible devices when ROCR_VISIBLE_DEVICES restricts
  GPU visibility (e.g. pytest-xdist workers).
- Since ROCm, hipGetLastError() is sticky: it retains errors even after
  subsequent successful HIP API calls. The stale hipErrorInvalidDevice from
  the profiler leaked into unrelated GPU operations, causing flaky test
  failures in JAX's FFI handlers (solver, linalg, prng).
- Agent clock rates are in MHz (vs KHz in hipDeviceProp_t); memory is
  filtered to VRAM-only banks (FRAME_BUFFER_PUBLIC/PRIVATE).
- Add unit test verifying agent data matches hipGetDeviceProperties.
@magaonka-amd magaonka-amd force-pushed the cherry-pick-profiler-fix branch from dfb20c6 to 6215a05 Compare April 2, 2026 05:37
@i-chaochen i-chaochen merged commit 4408ca1 into ROCm:rocm-jaxlib-v0.9.2 Apr 8, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants