AiterAsmKernel: add init-time sanity checks for .co registration#3127
Conversation
Two checks added to AiterAsmKernelFast::init(), both running once per kernel variant at construction time (not on the launch hot path): 1. validate_hsaco_lds(): scans the raw .co ELF blob for group_segment_fixed_size via msgpack decode and compares against the device LDS limit (hipDeviceGetAttribute). Gives an actionable error before __hipRegisterFatBinary is called, e.g. on gfx942 (MI300X) the 64 KB limit would reject a .co built for gfx950 (MI355X, ~160 KB). 2. Registration probe: __hipRegisterFunction returns void, so a hipGetFuncBySymbol probe is used to detect silent rejection by the runtime (LDS limit exceeded, arch mismatch, corrupted binary, etc.).
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
|
Hi @alexioslyrakis-amd, |
Hi @amd-ruitang3, the probe is the safety net, it catches all failures, but it can only say "registration failed", not why. On the fragility concern: the scanner is conservative (skips anything it doesn't recognize) so false negatives are possible (it misses the check), but I can't think of how false positives might appear. I could also alternatively use LLVM's built-in msgpack parser (llvm/BinaryFormat/MsgPack.h), I guess this won't introduce a new dependency since LLVM is already required by ROCm. |
Summary
validate_hsaco_lds()toAiterAsmKernelFast::init(): scans the raw.coELF blob forgroup_segment_fixed_sizevia msgpack decode and checks it against the device LDS limit (hipDeviceGetAttribute). Fires before registration, giving an actionable error instead of a silent null fromhipGetFuncBySymbol.init(): since__hipRegisterFunctionreturnsvoid, ahipGetFuncBySymbolprobe detects silent rejection by the runtime (LDS limit exceeded, arch mismatch, corrupted binary, etc.).launch_kernel()hot path.Motivation
On gfx942 (MI300X) the LDS limit is 64 KB. A
.cobuilt targeting gfx950 (MI355X, ~160 KB LDS) would be silently rejected at registration,hipGetFuncBySymbolwould return null, andhipModuleLaunchKernelwould emit only a generichipErrorIllegalStatewith no indication of the root cause.Test plan
.codeclaringgroup_segment_fixed_size > 65536on gfx942 — expect clear error fromvalidate_hsaco_lds()at init.coon gfx942 — expect no error, kernel executes normally