Skip to content

[ROCm] Cherry-pick: Fix HSACO module cache using pointer-based key causing stale lookups#780

Draft
magaonka-amd wants to merge 2 commits intoROCm:rocm-jaxlib-v0.9.2from
magaonka-amd:cherry-pick-hsaco-cache-fix
Draft

[ROCm] Cherry-pick: Fix HSACO module cache using pointer-based key causing stale lookups#780
magaonka-amd wants to merge 2 commits intoROCm:rocm-jaxlib-v0.9.2from
magaonka-amd:cherry-pick-hsaco-cache-fix

Conversation

@magaonka-amd
Copy link
Copy Markdown

Summary

Cherry-pick of openxla#40419 into rocm-jaxlib-v0.9.2 release branch.

Fixes flaky sort-related test failures caused by HSACO module cache returning stale modules when a freed buffer's address is reused.

  • Replace pointer-based ModuleHandle key with content hash of HSACO bytes (absl::HashOf)
  • Add hipPeekAtLastError guard before hipModuleGetFunction for clearer diagnostics

…ookups

- The in_memory_modules_ cache in RocmExecutor used the raw HSACO data
  pointer as the cache key. When a CustomKernelThunk's owned HSACO buffer
  was freed and a new buffer was allocated at the same address, the cache
  returned a stale module loaded from different binary content, causing
  hipModuleGetFunction to fail with hipErrorNotFound.
- Replace pointer-based ModuleHandle key with a content hash of the HSACO
  bytes (absl::HashOf). Same content still hits the cache; different
  content at a reused address correctly misses.
- Add hipPeekAtLastError guard before hipModuleGetFunction to surface
  pre-existing sticky errors early with clear diagnostics instead of
  producing confusing failures.
Copy link
Copy Markdown
Collaborator

@i-chaochen i-chaochen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I fully got this. @pemeliya @nurmukhametov did you aware of this one?

Copy link
Copy Markdown
Collaborator

@i-chaochen i-chaochen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this is because of openxla@e129675

do we have this offensive commit in 0.9.2?

@i-chaochen i-chaochen requested a review from pemeliya April 8, 2026 16:31
@magaonka-amd
Copy link
Copy Markdown
Author

yes that commit exposes this problem. but commit itself doesn't introduce the bug.

@i-chaochen i-chaochen requested a review from nurmukhametov April 8, 2026 16:34
@i-chaochen
Copy link
Copy Markdown
Collaborator

yes that commit exposes this problem. but commit itself doesn't introduce the bug.

my question is, does xla-0.9.2 has this commit?

@magaonka-amd
Copy link
Copy Markdown
Author

No Chao, openxla@e129675 is not in 9.2 xla

@magaonka-amd magaonka-amd marked this pull request as draft April 8, 2026 18:53
// the per-thread error state is sticky: successful HIP calls do
// not clear it, so a stale error from a prior operation would
// produce confusing diagnostics if we proceeded.
hipError_t pre_err = ::hipPeekAtLastError();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I am not sure we shall just clean up sticky error bit here. It may hide problems in other libraries. Like the recent one about rocprofiler collector using wrong device ID.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Pavel. intention here is to not to clear sticky error but get a peek at it and if there is an error dont proceed and error out here. if we simply proceed with this error we run into complex run to run variation issue under parallel loads. so I added more this check more of a guard.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, I see, I have confused it wit hipGetLastError()

// Compute a content-based ModuleHandle from HSACO bytes.
// Using a content hash instead of the raw data pointer avoids stale cache
// entries when an HSACO buffer is freed and a new one is allocated at the
// same address (pointer-reuse cache collision).
Copy link
Copy Markdown

@pemeliya pemeliya Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering how it happens that we have stale cache entries? I saw we have several flat_hash_maps in rocm_executor: in_memory_modules_, kernel_to_gpu_binary_ and gpu_binary_to_module_.

Which one was returning a stale cache entry? I see at least that UnloadGpuBinary() shall cleanup gpu_binary_to_module_ and in_memory_modules_. Though, the logic about erasing in_memory_modules_ is not easy to understand at first glance..

I mean, from my understanding, this shall not happen since GpuExecutable maintains a list of module handles which shall be automatically unloaded at the end.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes there is an issue with unload right now and it doesnt take all cases into account. I'm working on a fix , its not ready yet, might take couple more days.
But I agree with that fix we may not need hashing at all. I just discovered this yesterday we don't have to merge this PR now. I'll work on improved one.

but can we please not revert the upstream PR , without this fix we have very annoying jax bug in pytest run.

summary : this fix works but there is more to it, and I have found actual problem with unload an erase routine will work on it and come up with better fix.

cc: @i-chaochen

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we need to have a proper fix for that.. I mean, here we use the result of absl::HashOf() as a unique key to identify hipModule_t object. Though, absl::Hash is very fast but prone to collisions (not like SHA256), so this could be a subtle bug which only shows up under a heavy load - when two different hipModules suddenly get the same hash key..

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pemeliya fixed the root problem that surfaced as stale cache, I opened downstream PR for review feedback. if XLA team agrees with the fix I'll open PR upstream.
PR: #798
Can you please review at your convenience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants