Skip to content

[AMDGPU] comgr: read SGPR count from msgpack metadata instead of reserved KD field#2873

Merged
chinmaydd merged 2 commits into
ROCm:amd-stagingfrom
suryajasper:users/surya/sgpr-metadata-allocation
Jun 15, 2026
Merged

[AMDGPU] comgr: read SGPR count from msgpack metadata instead of reserved KD field#2873
chinmaydd merged 2 commits into
ROCm:amd-stagingfrom
suryajasper:users/surya/sgpr-metadata-allocation

Conversation

@suryajasper

@suryajasper suryajasper commented Jun 11, 2026

Copy link
Copy Markdown
Member

Summary

On GFX10+, GRANULATED_WAVEFRONT_SGPR_COUNT in the kernel descriptor is architecturally reserved and must be zero — the hardware ignores it. The scratch SGPR allocator introduced in #2328 was reading this field to determine where to start allocating scratch SGPRs, which is unreliable: a code object that correctly zeroes the field would cause the allocator to start at s8, potentially clobbering live registers.

This PR replaces the KD-field approach with a metadata-first strategy: read .sgpr_count from the amdhsa.kernels msgpack metadata note (NT_AMDGPU_METADATA), falling back to the KD field for minimal ELFs that lack a metadata note (e.g. lit test objects assembled with -nostdlib).

Also fixes Gfx1250MaxVgprs from 256 to 1024 — GFX1250 has Feature1024AddressableVGPRs and with wave32, getAddressableNumVGPRs returns 1024.

Follow-up to #2328 (specifically @jmmartinez's review comment noting the SGPR field is reserved on GFX10+, and comment noting gfx1250 has 1024 VGPRs).

Implementation Details

comgr-hotswap-elf.cppgetKernelSgprCount() rewritten:

  • Primary path: iterate PT_NOTE program headers → find NT_AMDGPU_METADATA note → parse msgpack blob via llvm::msgpack::Document::readFromBlob() → walk amdhsa.kernels array → match .name to the kernel → return .sgpr_count.
  • Fallback path: if no metadata note is present, read GRANULATED_WAVEFRONT_SGPR_COUNT from the kernel descriptor with the correct encoding granule (8), matching getSGPREncodingGranule() in AMDGPUBaseInfo.cpp.
  • Returns std::nullopt on failure; the caller falls back to MaxSgprs (106), which is conservative (no headroom → patch refused rather than silent clobber).

comgr-hotswap-elf.cppupdateKernelDescriptor() simplified:

  • Removed ExtraSgprs and SgprGranuleSize parameters.
  • Deleted the SGPR write block — bumping a reserved field has no effect on hardware.
  • VGPR update path unchanged (that field is not reserved).

comgr-hotswap-internal.h:

  • getKernelSgprCount() signature: removed SgprGranuleSize parameter (metadata provides the raw count directly).
  • updateKernelDescriptor() signature: removed ExtraSgprs and SgprGranuleSize.
  • RewriteConfig: removed SgprGranuleSize field.

comgr-hotswap-patch-f32-to-e5m3.cpp:

  • allocateScratch(): updated call to getKernelSgprCount() (no granule argument).

comgr-hotswap-b0a0.cpp:

  • Removed Gfx1250SgprGranuleSize constant.
  • Fixed Gfx1250MaxVgprs from 256 to 1024 (Feature1024AddressableVGPRs + wave32).
  • Updated updateKernelDescriptor() call site to match new signature.

No build system changes — LLVMBinaryFormat (which includes msgpack) is already linked by COMGR.

Test Coverage

  • 49/49 hotswap lit tests pass, including all 7 FP8 E5M3 tests and the new WMMA split test. The CHECK patterns use s_mov_b32 without specifying SGPR numbers, so they are agnostic to which specific SGPRs the allocator selects.
  • 4/4 HotswapElfTests and 44/44 HotswapMCTests pass.
  • The KD fallback path is exercised by every existing lit test (assembled with -nostdlib, which produces ELFs without metadata notes). The metadata path will be exercised by production code objects compiled with hipcc/clang which always include NT_AMDGPU_METADATA.

@jmmartinez jmmartinez left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The getKernelSgprCount related changes look good to me. Thanks!

@chinmaydd

Copy link
Copy Markdown

@suryajasper could you take a look at the merge conflicts here ? Thanks ! This one should be good to go.

…rved KD field

On GFX10+ GRANULATED_WAVEFRONT_SGPR_COUNT in the kernel descriptor is
architecturally reserved (must be zero). The scratch SGPR allocator was
reading this field to determine the allocation base, which is unreliable.

Replace getKernelSgprCount() with a metadata-first approach: parse the
NT_AMDGPU_METADATA msgpack note and read .sgpr_count from the
amdhsa.kernels array. Fall back to the KD field (with the correct
encoding granule of 8) for minimal ELFs that lack a metadata note.

Also remove the SGPR write path from updateKernelDescriptor() since
bumping the reserved field has no effect on hardware, and remove
SgprGranuleSize from RewriteConfig since the metadata provides the
raw count directly.
@suryajasper suryajasper force-pushed the users/surya/sgpr-metadata-allocation branch from 98519e0 to a985d11 Compare June 15, 2026 16:00
@suryajasper

suryajasper commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

Rebased and addressed merge conflicts. Also fixed Gfx1250MaxVgprs from 256 to 1024 per @jmmartinez's comment on #2328.

GFX1250 has Feature1024AddressableVGPRs. With wave32,
getAddressableNumVGPRs returns 1024, not 256. The incorrect
limit artificially constrained the VgprAllocator's headroom.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants