[ROCm] Fix HSACO module cache using pointer-based key causing stale lookups#779
[ROCm] Fix HSACO module cache using pointer-based key causing stale lookups#779magaonka-amd wants to merge 1 commit intoROCm:mainfrom
Conversation
f46cee2 to
d364132
Compare
| // entries when an HSACO buffer is freed and a new one is allocated at the | ||
| // same address (pointer-reuse cache collision). | ||
| ModuleHandle HsacoModuleHandle(const char* hsaco, size_t size) { | ||
| auto hash = absl::HashOf(absl::string_view(hsaco, size)); | ||
| // Ensure hash is never 0 (ModuleHandle treats nullptr as invalid) | ||
| return ModuleHandle{reinterpret_cast<const void*>(hash | 1)}; |
There was a problem hiding this comment.
Hash collision risk: HsacoModuleHandle maps arbitrary-length HSACO binaries to a single size_t hash stuffed into a void*. If two different HSACO binaries produce the same hash, the cache silently returns the wrong module — replacing one collision domain (pointer reuse) with another that may be harder to diagnose.
In practice, 64-bit absl::HashOf collisions are astronomically unlikely for distinct HSACO binaries within a single process, so this is probably fine — but it would be worth adding a brief comment acknowledging the trade-off (e.g., "collision is negligible for the number of distinct kernels a single process loads").
Also, reinterpret_cast<const void*>(hash | 1) is technically implementation-defined by the C++ standard. It works on x86-64/aarch64, but a cleaner long-term design would be to change ModuleHandle to hold a uint64_t or define a dedicated key type.
There was a problem hiding this comment.
not a conern 2^64 options we have for hash collision we dont have that many kernels !!
There was a problem hiding this comment.
Agreed — 2^64 makes collisions negligible here, thanks for confirming. The reinterpret_cast<const void*>(hash | 1) concern was a minor style note for future cleanup; fine to leave as-is for this PR.
xla/stream_executor/rocm/BUILD
Outdated
| "@com_google_absl//absl/base:core_headers", | ||
| "@com_google_absl//absl/container:flat_hash_map", | ||
| "@com_google_absl//absl/container:flat_hash_set", | ||
| "@com_google_absl//absl/hash", |
There was a problem hiding this comment.
nit: absl/hash sorts after absl/container:inlined_vector alphabetically. This should be placed after line 196 to keep deps sorted and avoid a buildifier lint failure.
| "@com_google_absl//absl/hash", | |
| "@com_google_absl//absl/container:inlined_vector", | |
| "@com_google_absl//absl/hash", |
There was a problem hiding this comment.
Resolved ✓ — addressed in this revision.
| // Check for pre-existing HIP errors before the call. On ROCm 7+ | ||
| // hipGetLastError() is sticky: successful calls do not clear it, | ||
| // so a stale error from a prior operation would produce confusing | ||
| // diagnostics if we proceeded. | ||
| hipError_t pre_err = ::hipPeekAtLastError(); | ||
| if (pre_err != hipSuccess) { | ||
| return absl::InternalError( | ||
| absl::StrCat("There was a HIP error before calling " | ||
| "hipModuleGetFunction for kernel '", | ||
| kernel_name, "': ", ToString(pre_err))); | ||
| } |
There was a problem hiding this comment.
Two concerns with this guard:
-
Comment/code mismatch: The comment says "hipGetLastError() is sticky" but the code calls
hipPeekAtLastError(). These are different APIs —hipGetLastErrorclears the error,hipPeekAtLastErrordoes not. The comment should referencehipPeekAtLastErrorand explain why Peek was chosen over Get. -
False-positive risk: If ROCm 7+ truly never clears sticky errors on success, then any prior handled error anywhere in the process will cause this guard to fire and prevent all subsequent kernel launches for the process lifetime. This seems overly aggressive. Consider either:
- Logging at
WARNINGand proceeding (if the intent is debugging aid), or - Using
hipGetLastError()(which clears) so only unhandled errors are caught, or - Documenting under what circumstances a stale non-zero value from
hipPeekAtLastErrorshould not be treated as fatal.
- Logging at
There was a problem hiding this comment.
better comment added , we should not drain the hipLastError()
There was a problem hiding this comment.
Thanks for improving the comment — the hipPeekAtLastError vs hipGetLastError mismatch is fixed now.
One concern remains: since hipPeekAtLastError() does not clear the error state, if any prior HIP call in the process ever failed (even if it was correctly handled), this guard will return InternalError on every subsequent kernel launch for the lifetime of the process. This makes the guard a potential source of false-positive failures.
Consider either:
- Downgrading to
LOG(WARNING)instead of returning a hard error, or - Adding a brief comment explaining why false positives are not expected in practice (e.g., XLA never handles HIP errors and continues).
| if (spec.has_cuda_cubin_in_memory()) { | ||
| const char* hsaco = reinterpret_cast<const char*>( | ||
| spec.cuda_cubin_in_memory()->cubin_bytes.data()); | ||
| const auto& cubin = spec.cuda_cubin_in_memory()->cubin_bytes; |
There was a problem hiding this comment.
Minor: in the LoadModule path (line 774), auto cubin = spec.cuda_cubin_in_memory() copies the span, while here you use const auto&. The inconsistency is harmless (spans are lightweight) but worth keeping uniform for readability.
There was a problem hiding this comment.
Resolved ✓ — addressed in this revision.
Claude Review SummaryOverall: Good fix for a real cache-correctness bug (pointer-reuse causing stale HSACO module lookups). The content-hash approach is a solid improvement. A few items worth addressing:
See inline comments for details. |
247f907 to
8b84ec7
Compare
| #include "rocm/rocm_config.h" | ||
| #include <unistd.h> |
There was a problem hiding this comment.
nit: #include <unistd.h> is a C system header and should be grouped with the other system headers near the top of the file, not between third-party (rocm/) and project (xla/) includes. Per the Google C++ Style Guide include ordering, the order is: main header, C system headers, C++ standard library, other libraries, project headers.
| #include "rocm/rocm_config.h" | |
| #include <unistd.h> | |
| #include "rocm/rocm_config.h" | |
| #include "xla/stream_executor/activate_context.h" |
Move #include <unistd.h> back up with the C system headers (e.g., after line 17 where it was originally).
Re-review SummaryGood progress — 2 of 4 original findings are resolved (BUILD dep sorting,
New finding: No test coverage for the new hashing logic — consider adding a regression test. |
8b84ec7 to
8ba8901
Compare
…ookups - The in_memory_modules_ cache in RocmExecutor used the raw HSACO data pointer as the cache key. When a CustomKernelThunk's owned HSACO buffer was freed and a new buffer was allocated at the same address, the cache returned a stale module loaded from different binary content, causing hipModuleGetFunction to fail with hipErrorNotFound. - Replace pointer-based ModuleHandle key with a content hash of the HSACO bytes (absl::HashOf). Same content still hits the cache; different content at a reused address correctly misses. - Add hipPeekAtLastError guard before hipModuleGetFunction to surface pre-existing sticky errors early with clear diagnostics instead of producing confusing failures.
8ba8901 to
b4bca56
Compare
Motivation
pointer as the cache key. When a CustomKernelThunk's owned HSACO buffer
was freed and a new buffer was allocated at the same address, the cache
returned a stale module loaded from different binary content, causing
hipModuleGetFunction to fail with hipErrorNotFound.
bytes (absl::HashOf). Same content still hits the cache; different
content at a reused address correctly misses.
pre-existing sticky errors early with clear diagnostics instead of
producing confusing failures.