[watch/rebar 2/3] Ensure WAIT_MEM waits for trap#81
Open
palves wants to merge 1 commit intospr/amd-staging/c99ccb02from
Open
[watch/rebar 2/3] Ensure WAIT_MEM waits for trap#81palves wants to merge 1 commit intospr/amd-staging/c99ccb02from
palves wants to merge 1 commit intospr/amd-staging/c99ccb02from
Conversation
This was referenced Apr 15, 2026
The next patch adds a testcase that uses WAIT_MEM to wait for a
watchpoint hit with precise memory off, like so:
__global__ void
kern (int *ptr)
{
*ptr = 1; // trigger watchpoint
WAIT_MEM;
}
On gfx1030 on Linux, I observe that the watchpoint would not trigger.
If I add several more WAIT_MEM calls, I see the watchpoint trigger a
few instructions further down.
The problem is that on GFX10, WAIT_MEM expands to a single:
s_waitcnt_vscnt null, 0
which only waits for vector stores to retire. In practice this is not
sufficient to ensure that a memory watchpoint is delivered before
execution continues. For example, in:
flat_store_dword v[0:1], v2
s_waitcnt_vscnt null, 0
ROCgdb still observes the watchpoint trap several instructions later:
s_waitcnt_vscnt null, 0
s_waitcnt_vscnt null, 0
s_waitcnt_vscnt null, 0 <== trap taken here
My understanding is that even when s_waitcnt_vscnt completes, it only
means the memory controller has acknowledged that the store is "done".
However, the signal that a watchpoint was hit has to travel from the
memory hierarchy back to the Wavefront Controller to trigger the
exception.
By the time the wavefront controller receives the "stop!" signal, the
sequencer has already fetched and potentially executed the next few
instructions.
Running the same testcase on Windows with a gfx1201, I observe the
exact same thing, we need a few more instructions for the trap to be
reported.
Fix this by adding a short sequence of s_nop instructions to delay
forward progress, giving the hardware time to deliver the watchpoint
trap before exiting WAIT_MEM.
While at it, on gfx10, also wait for vmcnt(0) and lgkmcnt(0), ensuring
that reads and scalar writes are waited for too, so that the gfx10
version has the same semantics as the other architecture
implementations. Otherwise, if WAIT_MEM is meant for being useful
with writes only, then I think it'd be better to rename
e.g. WAIT_MEM_STORE or something like that.
Change-Id: Ia119ceba6d9f1e2c8ec1eaf619cb066286ea6dd3
commit-id:6dc0d30a
7ddf723 to
241989a
Compare
b5d2bee to
15d8d9d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The next patch adds a testcase that uses WAIT_MEM to wait for a
watchpoint hit with precise memory off, like so:
global void
kern (int *ptr)
{
*ptr = 1; // trigger watchpoint
WAIT_MEM;
}
On gfx1030 on Linux, I observe that the watchpoint would not trigger.
If I add several more WAIT_MEM calls, I see the watchpoint trigger a
few instructions further down.
The problem is that on GFX10, WAIT_MEM expands to a single:
s_waitcnt_vscnt null, 0
which only waits for vector stores to retire. In practice this is not
sufficient to ensure that a memory watchpoint is delivered before
execution continues. For example, in:
flat_store_dword v[0:1], v2
s_waitcnt_vscnt null, 0
ROCgdb still observes the watchpoint trap several instructions later:
s_waitcnt_vscnt null, 0
s_waitcnt_vscnt null, 0
s_waitcnt_vscnt null, 0 <== trap taken here
My understanding is that even when s_waitcnt_vscnt completes, it only
means the memory controller has acknowledged that the store is "done".
However, the signal that a watchpoint was hit has to travel from the
memory hierarchy back to the Wavefront Controller to trigger the
exception.
By the time the wavefront controller receives the "stop!" signal, the
sequencer has already fetched and potentially executed the next few
instructions.
Running the same testcase on Windows with a gfx1201, I observe the
exact same thing, we need a few more instructions for the trap to be
reported.
Fix this by adding a short sequence of s_nop instructions to delay
forward progress, giving the hardware time to deliver the watchpoint
trap before exiting WAIT_MEM.
While at it, on gfx10, also wait for vmcnt(0) and lgkmcnt(0), ensuring
that reads and scalar writes are waited for too, so that the gfx10
version has the same semantics as the other architecture
implementations. Otherwise, if WAIT_MEM is meant for being useful
with writes only, then I think it'd be better to rename
e.g. WAIT_MEM_STORE or something like that.
Change-Id: Ia119ceba6d9f1e2c8ec1eaf619cb066286ea6dd3
Stack: