[ROCm] Fix HSACO module cache using pointer-based key causing stale lookups by magaonka-amd · Pull Request #779 · ROCm/xla

magaonka-amd · 2026-04-06T03:31:34Z

Motivation

The in_memory_modules_ cache in RocmExecutor used the raw HSACO data
pointer as the cache key. When a CustomKernelThunk's owned HSACO buffer
was freed and a new buffer was allocated at the same address, the cache
returned a stale module loaded from different binary content, causing
hipModuleGetFunction to fail with hipErrorNotFound.
Replace pointer-based ModuleHandle key with a content hash of the HSACO
bytes (absl::HashOf). Same content still hits the cache; different
content at a reused address correctly misses.
Add hipPeekAtLastError guard before hipModuleGetFunction to surface
pre-existing sticky errors early with clear diagnostics instead of
producing confusing failures.

claude · 2026-04-06T03:36:51Z

xla/stream_executor/rocm/rocm_executor.cc

+// entries when an HSACO buffer is freed and a new one is allocated at the
+// same address (pointer-reuse cache collision).
+ModuleHandle HsacoModuleHandle(const char* hsaco, size_t size) {
+  auto hash = absl::HashOf(absl::string_view(hsaco, size));
+  // Ensure hash is never 0 (ModuleHandle treats nullptr as invalid)
+  return ModuleHandle{reinterpret_cast<const void*>(hash | 1)};


Hash collision risk: HsacoModuleHandle maps arbitrary-length HSACO binaries to a single size_t hash stuffed into a void*. If two different HSACO binaries produce the same hash, the cache silently returns the wrong module — replacing one collision domain (pointer reuse) with another that may be harder to diagnose.

In practice, 64-bit absl::HashOf collisions are astronomically unlikely for distinct HSACO binaries within a single process, so this is probably fine — but it would be worth adding a brief comment acknowledging the trade-off (e.g., "collision is negligible for the number of distinct kernels a single process loads").

Also, reinterpret_cast<const void*>(hash | 1) is technically implementation-defined by the C++ standard. It works on x86-64/aarch64, but a cleaner long-term design would be to change ModuleHandle to hold a uint64_t or define a dedicated key type.

not a conern 2^64 options we have for hash collision we dont have that many kernels !!

Agreed — 2^64 makes collisions negligible here, thanks for confirming. The reinterpret_cast<const void*>(hash | 1) concern was a minor style note for future cleanup; fine to leave as-is for this PR.

claude · 2026-04-06T03:36:53Z

xla/stream_executor/rocm/BUILD

        "@com_google_absl//absl/base:core_headers",
        "@com_google_absl//absl/container:flat_hash_map",
        "@com_google_absl//absl/container:flat_hash_set",
+        "@com_google_absl//absl/hash",


nit: absl/hash sorts after absl/container:inlined_vector alphabetically. This should be placed after line 196 to keep deps sorted and avoid a buildifier lint failure.

Suggested change

"@com_google_absl//absl/hash",

"@com_google_absl//absl/container:inlined_vector",

"@com_google_absl//absl/hash",

Resolved ✓ — addressed in this revision.

claude · 2026-04-06T03:37:00Z

xla/stream_executor/rocm/rocm_executor.cc

+  // Check for pre-existing HIP errors before the call. On ROCm 7+
+  // hipGetLastError() is sticky: successful calls do not clear it,
+  // so a stale error from a prior operation would produce confusing
+  // diagnostics if we proceeded.
+  hipError_t pre_err = ::hipPeekAtLastError();
+  if (pre_err != hipSuccess) {
+    return absl::InternalError(
+        absl::StrCat("There was a HIP error before calling "
+                     "hipModuleGetFunction for kernel '",
+                     kernel_name, "': ", ToString(pre_err)));
+  }


Two concerns with this guard:

Comment/code mismatch: The comment says "hipGetLastError() is sticky" but the code calls hipPeekAtLastError(). These are different APIs — hipGetLastError clears the error, hipPeekAtLastError does not. The comment should reference hipPeekAtLastError and explain why Peek was chosen over Get.

False-positive risk: If ROCm 7+ truly never clears sticky errors on success, then any prior handled error anywhere in the process will cause this guard to fire and prevent all subsequent kernel launches for the process lifetime. This seems overly aggressive. Consider either:

Logging at WARNING and proceeding (if the intent is debugging aid), or

Using hipGetLastError() (which clears) so only unhandled errors are caught, or

Documenting under what circumstances a stale non-zero value from hipPeekAtLastError should not be treated as fatal.

better comment added , we should not drain the hipLastError()

Thanks for improving the comment — the hipPeekAtLastError vs hipGetLastError mismatch is fixed now.

One concern remains: since hipPeekAtLastError() does not clear the error state, if any prior HIP call in the process ever failed (even if it was correctly handled), this guard will return InternalError on every subsequent kernel launch for the lifetime of the process. This makes the guard a potential source of false-positive failures.

Consider either:

Downgrading to LOG(WARNING) instead of returning a hard error, or

Adding a brief comment explaining why false positives are not expected in practice (e.g., XLA never handles HIP errors and continues).

claude · 2026-04-06T03:37:02Z

xla/stream_executor/rocm/rocm_executor.cc

  if (spec.has_cuda_cubin_in_memory()) {
-    const char* hsaco = reinterpret_cast<const char*>(
-        spec.cuda_cubin_in_memory()->cubin_bytes.data());
+    const auto& cubin = spec.cuda_cubin_in_memory()->cubin_bytes;


Minor: in the LoadModule path (line 774), auto cubin = spec.cuda_cubin_in_memory() copies the span, while here you use const auto&. The inconsistency is harmless (spans are lightweight) but worth keeping uniform for readability.

Resolved ✓ — addressed in this revision.

claude · 2026-04-06T03:37:16Z

Claude Review Summary

Overall: Good fix for a real cache-correctness bug (pointer-reuse causing stale HSACO module lookups). The content-hash approach is a solid improvement. A few items worth addressing:

hipPeekAtLastError guard — comment/code mismatch (hipGetLastError vs hipPeekAtLastError) and risk of false positives from sticky errors blocking all subsequent kernel launches.
BUILD deps ordering — absl/hash is inserted out of alphabetical order; will likely fail buildifier lint.
Hash collision semantics — worth a brief comment acknowledging the (negligible) collision risk and the reinterpret_cast trade-off.
Stale comment (not in diff) — line 769 still says "We store the pointer to the HSACO binary as ModuleHandle::id()" which no longer reflects the new hash-based approach.
No tests — a unit test for HsacoModuleHandle (different content → different handles, same content → same handle) would guard against regressions.
CUDA parity — cuda_executor.cc uses the same pointer-based ModuleHandle{cubin} pattern; worth investigating as a follow-up.

See inline comments for details.

claude · 2026-04-06T03:54:58Z

xla/stream_executor/rocm/rocm_executor.cc

 #include "rocm/rocm_config.h"
+#include <unistd.h>


nit: #include <unistd.h> is a C system header and should be grouped with the other system headers near the top of the file, not between third-party (rocm/) and project (xla/) includes. Per the Google C++ Style Guide include ordering, the order is: main header, C system headers, C++ standard library, other libraries, project headers.

Suggested change

#include "rocm/rocm_config.h"

#include <unistd.h>

#include "rocm/rocm_config.h"

#include "xla/stream_executor/activate_context.h"

Move #include <unistd.h> back up with the C system headers (e.g., after line 17 where it was originally).

claude · 2026-04-06T03:56:00Z

Re-review Summary

Good progress — 2 of 4 original findings are resolved (BUILD dep sorting, const auto& consistency). Thread replies posted on the remaining 2 inline threads:

hipPeekAtLastError guard: Comment fix looks good. Flagged remaining concern — hipPeekAtLastError doesn't clear error state, so prior handled errors could cause false-positive hard failures on subsequent kernel launches. Consider LOG(WARNING) or a comment explaining why this is safe.
Hash collision / reinterpret_cast: Acknowledged — collision risk is negligible at 2^64. Minor style note only.

New finding: #include <unistd.h> moved to wrong include group (posted inline).

No test coverage for the new hashing logic — consider adding a regression test.

…ookups - The in_memory_modules_ cache in RocmExecutor used the raw HSACO data pointer as the cache key. When a CustomKernelThunk's owned HSACO buffer was freed and a new buffer was allocated at the same address, the cache returned a stale module loaded from different binary content, causing hipModuleGetFunction to fail with hipErrorNotFound. - Replace pointer-based ModuleHandle key with a content hash of the HSACO bytes (absl::HashOf). Same content still hits the cache; different content at a reused address correctly misses. - Add hipPeekAtLastError guard before hipModuleGetFunction to surface pre-existing sticky errors early with clear diagnostics instead of producing confusing failures.

magaonka-amd added the claude-review Request a Claude AI code review for this PR label Apr 6, 2026

magaonka-amd force-pushed the fix/rocm-hsaco-cache-content-hash branch from f46cee2 to d364132 Compare April 6, 2026 03:34

claude bot reviewed Apr 6, 2026

View reviewed changes

github-actions bot removed the claude-review Request a Claude AI code review for this PR label Apr 6, 2026

magaonka-amd force-pushed the fix/rocm-hsaco-cache-content-hash branch 2 times, most recently from 247f907 to 8b84ec7 Compare April 6, 2026 03:46

magaonka-amd added the claude-review Request a Claude AI code review for this PR label Apr 6, 2026

claude bot reviewed Apr 6, 2026

View reviewed changes

github-actions bot removed the claude-review Request a Claude AI code review for this PR label Apr 6, 2026

magaonka-amd force-pushed the fix/rocm-hsaco-cache-content-hash branch from 8b84ec7 to 8ba8901 Compare April 6, 2026 04:01

magaonka-amd force-pushed the fix/rocm-hsaco-cache-content-hash branch from 8ba8901 to b4bca56 Compare April 6, 2026 04:04

magaonka-amd closed this Apr 7, 2026

	"@com_google_absl//absl/hash",
	"@com_google_absl//absl/container:inlined_vector",
	"@com_google_absl//absl/hash",

Conversation

magaonka-amd commented Apr 6, 2026

Motivation

Uh oh!

claude bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

magaonka-amd Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

magaonka-amd Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

magaonka-amd Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

magaonka-amd Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Apr 6, 2026

Claude Review Summary

Uh oh!

claude bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Apr 6, 2026

Re-review Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant