Skip to content

[ROCm] Fix HSACO module cache using pointer-based key causing stale lookups#779

Closed
magaonka-amd wants to merge 1 commit intoROCm:mainfrom
magaonka-amd:fix/rocm-hsaco-cache-content-hash
Closed

[ROCm] Fix HSACO module cache using pointer-based key causing stale lookups#779
magaonka-amd wants to merge 1 commit intoROCm:mainfrom
magaonka-amd:fix/rocm-hsaco-cache-content-hash

Conversation

@magaonka-amd
Copy link
Copy Markdown

Motivation

  • The in_memory_modules_ cache in RocmExecutor used the raw HSACO data
    pointer as the cache key. When a CustomKernelThunk's owned HSACO buffer
    was freed and a new buffer was allocated at the same address, the cache
    returned a stale module loaded from different binary content, causing
    hipModuleGetFunction to fail with hipErrorNotFound.
  • Replace pointer-based ModuleHandle key with a content hash of the HSACO
    bytes (absl::HashOf). Same content still hits the cache; different
    content at a reused address correctly misses.
  • Add hipPeekAtLastError guard before hipModuleGetFunction to surface
    pre-existing sticky errors early with clear diagnostics instead of
    producing confusing failures.

@magaonka-amd magaonka-amd added the claude-review Request a Claude AI code review for this PR label Apr 6, 2026
@magaonka-amd magaonka-amd force-pushed the fix/rocm-hsaco-cache-content-hash branch from f46cee2 to d364132 Compare April 6, 2026 03:34
Comment on lines +214 to +219
// entries when an HSACO buffer is freed and a new one is allocated at the
// same address (pointer-reuse cache collision).
ModuleHandle HsacoModuleHandle(const char* hsaco, size_t size) {
auto hash = absl::HashOf(absl::string_view(hsaco, size));
// Ensure hash is never 0 (ModuleHandle treats nullptr as invalid)
return ModuleHandle{reinterpret_cast<const void*>(hash | 1)};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hash collision risk: HsacoModuleHandle maps arbitrary-length HSACO binaries to a single size_t hash stuffed into a void*. If two different HSACO binaries produce the same hash, the cache silently returns the wrong module — replacing one collision domain (pointer reuse) with another that may be harder to diagnose.

In practice, 64-bit absl::HashOf collisions are astronomically unlikely for distinct HSACO binaries within a single process, so this is probably fine — but it would be worth adding a brief comment acknowledging the trade-off (e.g., "collision is negligible for the number of distinct kernels a single process loads").

Also, reinterpret_cast<const void*>(hash | 1) is technically implementation-defined by the C++ standard. It works on x86-64/aarch64, but a cleaner long-term design would be to change ModuleHandle to hold a uint64_t or define a dedicated key type.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a conern 2^64 options we have for hash collision we dont have that many kernels !!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — 2^64 makes collisions negligible here, thanks for confirming. The reinterpret_cast<const void*>(hash | 1) concern was a minor style note for future cleanup; fine to leave as-is for this PR.

"@com_google_absl//absl/base:core_headers",
"@com_google_absl//absl/container:flat_hash_map",
"@com_google_absl//absl/container:flat_hash_set",
"@com_google_absl//absl/hash",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: absl/hash sorts after absl/container:inlined_vector alphabetically. This should be placed after line 196 to keep deps sorted and avoid a buildifier lint failure.

Suggested change
"@com_google_absl//absl/hash",
"@com_google_absl//absl/container:inlined_vector",
"@com_google_absl//absl/hash",

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved ✓ — addressed in this revision.

Comment on lines +180 to +190
// Check for pre-existing HIP errors before the call. On ROCm 7+
// hipGetLastError() is sticky: successful calls do not clear it,
// so a stale error from a prior operation would produce confusing
// diagnostics if we proceeded.
hipError_t pre_err = ::hipPeekAtLastError();
if (pre_err != hipSuccess) {
return absl::InternalError(
absl::StrCat("There was a HIP error before calling "
"hipModuleGetFunction for kernel '",
kernel_name, "': ", ToString(pre_err)));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two concerns with this guard:

  1. Comment/code mismatch: The comment says "hipGetLastError() is sticky" but the code calls hipPeekAtLastError(). These are different APIs — hipGetLastError clears the error, hipPeekAtLastError does not. The comment should reference hipPeekAtLastError and explain why Peek was chosen over Get.

  2. False-positive risk: If ROCm 7+ truly never clears sticky errors on success, then any prior handled error anywhere in the process will cause this guard to fire and prevent all subsequent kernel launches for the process lifetime. This seems overly aggressive. Consider either:

    • Logging at WARNING and proceeding (if the intent is debugging aid), or
    • Using hipGetLastError() (which clears) so only unhandled errors are caught, or
    • Documenting under what circumstances a stale non-zero value from hipPeekAtLastError should not be treated as fatal.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better comment added , we should not drain the hipLastError()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for improving the comment — the hipPeekAtLastError vs hipGetLastError mismatch is fixed now.

One concern remains: since hipPeekAtLastError() does not clear the error state, if any prior HIP call in the process ever failed (even if it was correctly handled), this guard will return InternalError on every subsequent kernel launch for the lifetime of the process. This makes the guard a potential source of false-positive failures.

Consider either:

  • Downgrading to LOG(WARNING) instead of returning a hard error, or
  • Adding a brief comment explaining why false positives are not expected in practice (e.g., XLA never handles HIP errors and continues).

if (spec.has_cuda_cubin_in_memory()) {
const char* hsaco = reinterpret_cast<const char*>(
spec.cuda_cubin_in_memory()->cubin_bytes.data());
const auto& cubin = spec.cuda_cubin_in_memory()->cubin_bytes;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: in the LoadModule path (line 774), auto cubin = spec.cuda_cubin_in_memory() copies the span, while here you use const auto&. The inconsistency is harmless (spans are lightweight) but worth keeping uniform for readability.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved ✓ — addressed in this revision.

@claude
Copy link
Copy Markdown

claude bot commented Apr 6, 2026

Claude Review Summary

Overall: Good fix for a real cache-correctness bug (pointer-reuse causing stale HSACO module lookups). The content-hash approach is a solid improvement. A few items worth addressing:

  • hipPeekAtLastError guard — comment/code mismatch (hipGetLastError vs hipPeekAtLastError) and risk of false positives from sticky errors blocking all subsequent kernel launches.
  • BUILD deps orderingabsl/hash is inserted out of alphabetical order; will likely fail buildifier lint.
  • Hash collision semantics — worth a brief comment acknowledging the (negligible) collision risk and the reinterpret_cast trade-off.
  • Stale comment (not in diff) — line 769 still says "We store the pointer to the HSACO binary as ModuleHandle::id()" which no longer reflects the new hash-based approach.
  • No tests — a unit test for HsacoModuleHandle (different content → different handles, same content → same handle) would guard against regressions.
  • CUDA paritycuda_executor.cc uses the same pointer-based ModuleHandle{cubin} pattern; worth investigating as a follow-up.

See inline comments for details.

@github-actions github-actions bot removed the claude-review Request a Claude AI code review for this PR label Apr 6, 2026
@magaonka-amd magaonka-amd force-pushed the fix/rocm-hsaco-cache-content-hash branch 2 times, most recently from 247f907 to 8b84ec7 Compare April 6, 2026 03:46
@magaonka-amd magaonka-amd added the claude-review Request a Claude AI code review for this PR label Apr 6, 2026
Comment on lines 48 to +49
#include "rocm/rocm_config.h"
#include <unistd.h>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: #include <unistd.h> is a C system header and should be grouped with the other system headers near the top of the file, not between third-party (rocm/) and project (xla/) includes. Per the Google C++ Style Guide include ordering, the order is: main header, C system headers, C++ standard library, other libraries, project headers.

Suggested change
#include "rocm/rocm_config.h"
#include <unistd.h>
#include "rocm/rocm_config.h"
#include "xla/stream_executor/activate_context.h"

Move #include <unistd.h> back up with the C system headers (e.g., after line 17 where it was originally).

@claude
Copy link
Copy Markdown

claude bot commented Apr 6, 2026

Re-review Summary

Good progress — 2 of 4 original findings are resolved (BUILD dep sorting, const auto& consistency). Thread replies posted on the remaining 2 inline threads:

  • hipPeekAtLastError guard: Comment fix looks good. Flagged remaining concern — hipPeekAtLastError doesn't clear error state, so prior handled errors could cause false-positive hard failures on subsequent kernel launches. Consider LOG(WARNING) or a comment explaining why this is safe.
  • Hash collision / reinterpret_cast: Acknowledged — collision risk is negligible at 2^64. Minor style note only.

New finding: #include <unistd.h> moved to wrong include group (posted inline).

No test coverage for the new hashing logic — consider adding a regression test.

@github-actions github-actions bot removed the claude-review Request a Claude AI code review for this PR label Apr 6, 2026
@magaonka-amd magaonka-amd force-pushed the fix/rocm-hsaco-cache-content-hash branch from 8b84ec7 to 8ba8901 Compare April 6, 2026 04:01
…ookups

- The in_memory_modules_ cache in RocmExecutor used the raw HSACO data
  pointer as the cache key. When a CustomKernelThunk's owned HSACO buffer
  was freed and a new buffer was allocated at the same address, the cache
  returned a stale module loaded from different binary content, causing
  hipModuleGetFunction to fail with hipErrorNotFound.
- Replace pointer-based ModuleHandle key with a content hash of the HSACO
  bytes (absl::HashOf). Same content still hits the cache; different
  content at a reused address correctly misses.
- Add hipPeekAtLastError guard before hipModuleGetFunction to surface
  pre-existing sticky errors early with clear diagnostics instead of
  producing confusing failures.
@magaonka-amd magaonka-amd force-pushed the fix/rocm-hsaco-cache-content-hash branch from 8ba8901 to b4bca56 Compare April 6, 2026 04:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant