[ROCm] Cherry-pick: Fix HSACO module cache using pointer-based key causing stale lookups by magaonka-amd · Pull Request #780 · ROCm/xla

magaonka-amd · 2026-04-06T17:01:18Z

Summary

Cherry-pick of openxla#40419 into rocm-jaxlib-v0.9.2 release branch.

Fixes flaky sort-related test failures caused by HSACO module cache returning stale modules when a freed buffer's address is reused.

Replace pointer-based ModuleHandle key with content hash of HSACO bytes (absl::HashOf)
Add hipPeekAtLastError guard before hipModuleGetFunction for clearer diagnostics

…ookups - The in_memory_modules_ cache in RocmExecutor used the raw HSACO data pointer as the cache key. When a CustomKernelThunk's owned HSACO buffer was freed and a new buffer was allocated at the same address, the cache returned a stale module loaded from different binary content, causing hipModuleGetFunction to fail with hipErrorNotFound. - Replace pointer-based ModuleHandle key with a content hash of the HSACO bytes (absl::HashOf). Same content still hits the cache; different content at a reused address correctly misses. - Add hipPeekAtLastError guard before hipModuleGetFunction to surface pre-existing sticky errors early with clear diagnostics instead of producing confusing failures.

i-chaochen

I'm not sure I fully got this. @pemeliya @nurmukhametov did you aware of this one?

i-chaochen

IIUC, this is because of openxla@e129675

do we have this offensive commit in 0.9.2?

magaonka-amd · 2026-04-08T16:33:35Z

yes that commit exposes this problem. but commit itself doesn't introduce the bug.

i-chaochen · 2026-04-08T16:38:41Z

yes that commit exposes this problem. but commit itself doesn't introduce the bug.

my question is, does xla-0.9.2 has this commit?

magaonka-amd · 2026-04-08T16:41:00Z

No Chao, openxla@e129675 is not in 9.2 xla

pemeliya · 2026-04-09T13:54:02Z

xla/stream_executor/rocm/rocm_executor.cc

+  // the per-thread error state is sticky: successful HIP calls do
+  // not clear it, so a stale error from a prior operation would
+  // produce confusing diagnostics if we proceeded.
+  hipError_t pre_err = ::hipPeekAtLastError();


Oh I am not sure we shall just clean up sticky error bit here. It may hide problems in other libraries. Like the recent one about rocprofiler collector using wrong device ID.

Hi Pavel. intention here is to not to clear sticky error but get a peek at it and if there is an error dont proceed and error out here. if we simply proceed with this error we run into complex run to run variation issue under parallel loads. so I added more this check more of a guard.

ah ok, I see, I have confused it wit hipGetLastError()

pemeliya · 2026-04-09T14:00:18Z

xla/stream_executor/rocm/rocm_executor.cc

+// Compute a content-based ModuleHandle from HSACO bytes.
+// Using a content hash instead of the raw data pointer avoids stale cache
+// entries when an HSACO buffer is freed and a new one is allocated at the
+// same address (pointer-reuse cache collision).


I am wondering how it happens that we have stale cache entries? I saw we have several flat_hash_maps in rocm_executor: in_memory_modules_, kernel_to_gpu_binary_ and gpu_binary_to_module_.

Which one was returning a stale cache entry? I see at least that UnloadGpuBinary() shall cleanup gpu_binary_to_module_ and in_memory_modules_. Though, the logic about erasing in_memory_modules_ is not easy to understand at first glance..

I mean, from my understanding, this shall not happen since GpuExecutable maintains a list of module handles which shall be automatically unloaded at the end.

yes there is an issue with unload right now and it doesnt take all cases into account. I'm working on a fix , its not ready yet, might take couple more days.
But I agree with that fix we may not need hashing at all. I just discovered this yesterday we don't have to merge this PR now. I'll work on improved one.

but can we please not revert the upstream PR , without this fix we have very annoying jax bug in pytest run.

summary : this fix works but there is more to it, and I have found actual problem with unload an erase routine will work on it and come up with better fix.

cc: @i-chaochen

Yes, I think we need to have a proper fix for that.. I mean, here we use the result of absl::HashOf() as a unique key to identify hipModule_t object. Though, absl::Hash is very fast but prone to collisions (not like SHA256), so this could be a subtle bug which only shows up under a heavy load - when two different hipModules suddenly get the same hash key..

Hi @pemeliya fixed the root problem that surfaced as stale cache, I opened downstream PR for review feedback. if XLA team agrees with the fix I'll open PR upstream.
PR: #798
Can you please review at your convenience.

magaonka-amd force-pushed the cherry-pick-hsaco-cache-fix branch from d60dad9 to 85555f4 Compare April 6, 2026 17:17

magaonka-amd closed this Apr 7, 2026

magaonka-amd reopened this Apr 7, 2026

This was referenced Apr 7, 2026

[Issue] JAX: core_test failures due to XLA autotuning No valid config found in portable wheel CI ROCm/TheRock#4270

Open

[Issue] JAX ROCm wheel tests failing with invalid kernel file / device kernel image is invalid - gfx1151 ROCm/TheRock#4373

Open

Merge branch 'rocm-jaxlib-v0.9.2' into cherry-pick-hsaco-cache-fix

0dc546f

i-chaochen reviewed Apr 8, 2026

View reviewed changes

i-chaochen requested a review from pemeliya April 8, 2026 16:31

i-chaochen requested a review from nurmukhametov April 8, 2026 16:34

magaonka-amd marked this pull request as draft April 8, 2026 18:53

pemeliya reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Cherry-pick: Fix HSACO module cache using pointer-based key causing stale lookups#780

[ROCm] Cherry-pick: Fix HSACO module cache using pointer-based key causing stale lookups#780
magaonka-amd wants to merge 2 commits intoROCm:rocm-jaxlib-v0.9.2from
magaonka-amd:cherry-pick-hsaco-cache-fix

magaonka-amd commented Apr 6, 2026

Uh oh!

i-chaochen left a comment •

edited

Loading

Uh oh!

i-chaochen left a comment •

edited

Loading

Uh oh!

magaonka-amd commented Apr 8, 2026

Uh oh!

i-chaochen commented Apr 8, 2026

Uh oh!

magaonka-amd commented Apr 8, 2026

Uh oh!

pemeliya Apr 9, 2026

Uh oh!

magaonka-amd Apr 9, 2026

Uh oh!

pemeliya Apr 9, 2026

Uh oh!

pemeliya Apr 9, 2026 •

edited

Loading

Uh oh!

magaonka-amd Apr 9, 2026

Uh oh!

pemeliya Apr 10, 2026

Uh oh!

magaonka-amd Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

magaonka-amd commented Apr 6, 2026

Summary

Uh oh!

i-chaochen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

i-chaochen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

magaonka-amd commented Apr 8, 2026

Uh oh!

i-chaochen commented Apr 8, 2026

Uh oh!

magaonka-amd commented Apr 8, 2026

Uh oh!

pemeliya Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

magaonka-amd Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

pemeliya Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

pemeliya Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

magaonka-amd Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

pemeliya Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

magaonka-amd Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

i-chaochen left a comment •

edited

Loading

i-chaochen left a comment •

edited

Loading

pemeliya Apr 9, 2026 •

edited

Loading