Skip to content

Fix assertion failure with sub-8B kernel arguments#34

Closed
Copilot wants to merge 2 commits intomainfrom
copilot/fix-sub-8b-kernel-arguments
Closed

Fix assertion failure with sub-8B kernel arguments#34
Copilot wants to merge 2 commits intomainfrom
copilot/fix-sub-8b-kernel-arguments

Conversation

Copy link

Copilot AI commented Mar 16, 2026

When a kernel has pointer arguments followed by sub-8B arguments (e.g. int), the injected dh_comms* in the instrumented clone lands at a 4-byte-aligned offset, so the explicit args end at a non-8-byte-aligned boundary (e.g. 44 bytes). The old roundArgsLength() call rounded that up to 48, overshooting the actual hidden-args start and causing:

assert(desc.clone_hidden_args_length <= desc.kernarg_length - desc.explicit_args_length)
// e.g. assert(72 <= 116 - 48)  →  assert(72 <= 68)  →  ABORT

Reproducer:

__global__ void hgemm_kernel(const __half* A, const __half* B, __half* C,
                              int M, int N, int K)  // ← int args trigger the bug

Changes

  • src/utils.ccKernelArgHelper::computeKernargData: use the offset of the first hidden argument from the kernel metadata as explicit_args_length. This is the exact explicit/hidden boundary as encoded by the compiler — no rounding needed. roundArgsLength is retained only for kernels with no hidden arguments (e.g. some Triton kernels), where it is both harmless and preserves existing behaviour.

  • tests/test_kernels/sub8b_args_test.cpp: regression kernel with (const int*, const int*, int*, int M, int N, int K) arguments — same layout as the failing kernel above.

  • tests/test_kernels/CMakeLists.txt, tests/run_basic_tests.sh: wire the new kernel into the build and basic test suite.

Original prompt

This section details on the original issue you should resolve

<issue_title>Error with sub-8B kernel arguments</issue_title>
<issue_description>Possibly related to this comment in intercept.cc,

it seems like Omniprobe returns the following when each kernel argument is smaller that 8 Bytes.

>>>>>>>> HSA intercept registered.
ADDRESS_MESSAGE,timestamp,kernel,src_line,dispatch,exec_mask,xcc_id,se_id,cu_id,kind,address
host_device_combined_omniprobe: /app/omniprobe/src/interceptor.cc:649: void hsaInterceptor::fixupKernArgs(void *, void *, void *, arg_descriptor_t): Assertion `desc.clone_hidden_args_length <= desc.kernarg_length - desc.explicit_args_length' failed.

This was encountered with the following kernel argument - when 32b integer arguments are used,

__global__ void hgemm_kernel(const __half* __restrict__ A,
                             const __half* __restrict__ B,
                             __half* __restrict__ C,
                             int M, int N, int K)

but not encountered (proceeds to instrumentation) with the following kernel argument - when replaced with 64b integers,

__global__ void hgemm_kernel(const __half* __restrict__ A,
                             const __half* __restrict__ B,
                             __half* __restrict__ C,
                             int64_t M, int64_t N, int64_t K)
```</issue_description>

## Comments on the Issue (you are @copilot in this section)

<comments>
<comment_new><author>@rwvo</author><body>
This may have been fixed already; I looked into kernel argument alignment before. Should be testable by writing a kernel that has similar arguments as the one above with the three int arguments, launch it with a single block of 64 threads, let thread 0 printf the values of the integer arguments. Then instrument it using instrumentation scoping such that nothing is actually instrumented, but a cloned kernel with an extra void* argument is created. Then run with and without instrumentation under omniprobe. Both runs should print the same values.</body></comment_new>
</comments>


📱 Kick off Copilot coding agent tasks wherever you are with GitHub Mobile, available on iOS and Android.

In KernelArgHelper::computeKernargData(), the explicit_args_length was
computed as roundArgsLength(last_explicit_arg_offset + size), rounding up
to the next 8-byte boundary.  When a kernel has pointer args followed by
32-bit int args (e.g. hgemm_kernel(half*, half*, half*, int, int, int)),
the last explicit arg in the instrumented kernel (dh_comms*) is placed at a
4-byte-aligned offset.  The rounded explicit_args_length overshot the actual
hidden-args boundary, so kernarg_length - explicit_args_length became smaller
than clone_hidden_args_length, triggering:

  assert(desc.clone_hidden_args_length <=
         desc.kernarg_length - desc.explicit_args_length)

Fix: use the offset of the FIRST hidden argument from the kernel metadata
as explicit_args_length.  This is the exact explicit/hidden boundary with no
rounding required.  Fall back to roundArgsLength only for kernels that have
no hidden arguments (e.g. some Triton kernels), where rounding is harmless.

Also:
- Add sub8b_args_test.cpp regression test kernel (ptr + int args)
- Wire it into CMakeLists.txt and run_basic_tests.sh

Co-authored-by: rwvo <21990117+rwvo@users.noreply.github.com>
Copilot AI changed the title [WIP] [ISSUE-123] Fix error with sub-8B kernel arguments Fix assertion failure with sub-8B kernel arguments Mar 16, 2026
Copilot AI requested a review from rwvo March 16, 2026 20:18
Copilot finished work on behalf of rwvo March 16, 2026 20:18
@rwvo
Copy link
Collaborator

rwvo commented Mar 23, 2026

Closing this PR. The assertion failure described in #32 was already fixed in commit 705ece5, which added a std::min(roundArgsLength(...), desc.kernarg_length) clamp. The current code works correctly for sub-8B kernel arguments because the AMD GPU ABI aligns the first hidden argument on an 8-byte boundary, so roundArgsLength() produces the correct explicit/hidden boundary.

The PR's approach of reading the first hidden argument's offset directly from metadata is theoretically more robust (it would handle a hypothetical ABI where hidden args aren't 8-byte aligned), but doesn't address a current bug.

The regression test kernel is a good idea, but as written it only checks that omniprobe doesn't crash — it doesn't verify that the integer arguments are passed through correctly to the instrumented kernel. A proper test should read back results and/or print argument values to confirm correctness with and without instrumentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error with sub-8B kernel arguments

2 participants