Request for permuted parallel atomic message passing test #295

raphlinus · 2021-11-03T23:15:57Z

I've lately been tracking down some strange behavior, in the process of porting prefix sum to work on webgpu. Right now, that's looking like miscompilation of relaxed atomic loads and stores in the AMD 5700 XT Windows driver.

The Vulkan CTS should be catching this behavior, but the message passing test passes on this device. I've developed a version of the message passing test which is more or less vanilla, but runs many instances in parallel, and permutes the locations in memory; this seems to be the difference between passing and failing. It also does both the store and load roles in one thread, which is pleasant because it avoids control flow.

My code is here: googlefonts/compute-shader-101#11. This is in WGSL, but I'm also pasting a GLSL version below (lightly edited from spirv-cross output, as its translation of atomics is... interesting):

#version 450
#extension GL_KHR_memory_scope_semantics: enable

layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;

struct Element
{
    uint data;
    uint flag;
};

layout(set = 0, binding = 0, std430) buffer DataBuf
{
    Element data[];
} data_buf;

layout(set = 0, binding = 1, std430) buffer ControlBuf
{
    uint strategy;
    uint failures;
} control_buf;

uint permute_flag_ix(uint data_ix)
{
    return (data_ix * 419u) & 65535u;
}

void main()
{
    atomicStore(data_buf.data[gl_GlobalInvocationID.x].data, 1u, gl_ScopeDevice, gl_StorageSemanticsBuffer, gl_SemanticsRelaxed);
    memoryBarrierBuffer();
    atomicStore(data_buf.data[permute_flag_ix(gl_GlobalInvocationID.x)].flag, 1u, gl_ScopeDevice, gl_StorageSemanticsBuffer, gl_SemanticsRelaxed);
    uint read_ix = (gl_GlobalInvocationID.x * 4099u) & 65535u;
    uint flag = atomicLoad(data_buf.data[permute_flag_ix(read_ix)].flag, gl_ScopeDevice, gl_StorageSemanticsBuffer, gl_SemanticsRelaxed);
    memoryBarrierBuffer();
    uint data = atomicLoad(data_buf.data[read_ix].data, gl_ScopeDevice, gl_StorageSemanticsBuffer, gl_SemanticsRelaxed);
    if (flag > data)
    {
        atomicAdd(control_buf.failures, 1u);
    }
}

There are many possible variants of this test, including using Release semantics for the second store and Acquire for the first load instead of the barriers. But I'm focusing on this one in large part because it represents a natural translation of WGSL source.

I'm happy to answer any questions or help getting this into CTS. I'm not sure this is the only issue I'm seeing with my prefix sum code running, so am still pursuing whether there are more clean tests I can derive.

jeffbolznv · 2021-11-04T21:30:02Z

@raphlinus if you're set up to build the CTS, could you check whether changing the non-atomic loads to atomic loads in the existing CTS memory_model tests is sufficient to reproduce this bug? I'd like to know whether it's the lack of atomics or the different coordinate permutation (or both) that triggers the bug.

raphlinus · 2021-11-04T22:00:20Z

I get failures, but am so far unsure how to interpret them.

This is the patch I applied:

diff --git a/external/vulkancts/modules/vulkan/memory_model/vktMemoryModelMessagePassing.cpp b/external/vulkancts/modules/vulkan/memory_model/vktMemoryModelMessagePassing.cpp
index d45f589f9..29b0779e3 100755
--- a/external/vulkancts/modules/vulkan/memory_model/vktMemoryModelMessagePassing.cpp
+++ b/external/vulkancts/modules/vulkan/memory_model/vktMemoryModelMessagePassing.cpp
@@ -646,7 +646,7 @@ void MemoryModelTestCase::initPrograms (SourceCollections& programCollection) co
                        {
                        default: DE_ASSERT(0); // fall through
                        case SC_PHYSBUFFER: // fall through
-                       case SC_BUFFER:         css << "   payload.x[bufferCoord] = bufferCoord + (payload.x[partnerBufferCoord]>>31);\n"; break;
+                       case SC_BUFFER:         css << "   atomicStore(payload.x[bufferCoord], bufferCoord + (payload.x[partnerBufferCoord]>>31), " << scopeStr << ", 0, 0);\n"; break;
                        case SC_IMAGE:          css << "   imageStore(payload, imageCoord, uvec4(bufferCoord + (imageLoad(payload, partnerImageCoord).x>>31), 0, 0, 0));\n"; break;
                        case SC_WORKGROUP:      css << "   payload.x[sharedCoord] = bufferCoord + (payload.x[partnerSharedCoord]>>31);\n"; break;
                        }
@@ -774,7 +774,7 @@ void MemoryModelTestCase::initPrograms (SourceCollections& programCollection) co
                {
                default: DE_ASSERT(0); // fall through
                case SC_PHYSBUFFER: // fall through
-               case SC_BUFFER:         css << "   " << typeStr << " r = payload.x[partnerBufferCoord];\n"; break;
+               case SC_BUFFER:         css << "   " << typeStr << " r = atomicLoad(payload.x[partnerBufferCoord], " << scopeStr << ", 0, 0);\n"; break;
                case SC_IMAGE:          css << "   " << typeStr << " r = imageLoad(payload, partnerImageCoord).x;\n"; break;
                case SC_WORKGROUP:      css << "   " << typeStr << " r = payload.x[partnerSharedCoord];\n"; break;
                }

With this I get fails, but all the fails are in the frag shader tests; the comp tests all seem to past (which is what I'm most interested in). I fear I've done something that breaks frag shaders specifically.

Should I upload the TestResults.qpa?

jeffbolznv · 2021-11-05T04:38:42Z

That change looks correct, I think you've reproduced the bug. I ran that change on an NVIDIA implementation and there are no failures.

raphlinus · 2021-11-05T05:06:01Z

I'm not sure it counts as a full reproduction if the comp test are still passing. One explanation that seems plausible is that fragment shaders are permuting the invocations in a way that compute dispatch is not.

jeffbolznv · 2021-11-05T05:10:24Z

Yeah, I think that's likely why you're not seeing it in the compute shaders. But once the tests make it into a test lab and get run on all different sizes and families of GPUs dozens of times a day, I think there's a decent chance the bug would show up there, too.

mnetsch · 2022-02-03T16:17:27Z

Please note an internal issue was opened and is pending

mnetsch · 2022-02-25T21:00:32Z

0f04733

raphlinus mentioned this issue Nov 3, 2021

Atomic concerns gpuweb/gpuweb#2229

Closed

raphlinus mentioned this issue Nov 8, 2021

Prefix sum on portable compute shaders raphlinus/raphlinus.github.io#61

Closed

mnetsch closed this as completed Feb 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for permuted parallel atomic message passing test #295

Request for permuted parallel atomic message passing test #295

raphlinus commented Nov 3, 2021

jeffbolznv commented Nov 4, 2021

raphlinus commented Nov 4, 2021

jeffbolznv commented Nov 5, 2021

raphlinus commented Nov 5, 2021

jeffbolznv commented Nov 5, 2021

mnetsch commented Feb 3, 2022

mnetsch commented Feb 25, 2022

Request for permuted parallel atomic message passing test #295

Request for permuted parallel atomic message passing test #295

Comments

raphlinus commented Nov 3, 2021

jeffbolznv commented Nov 4, 2021

raphlinus commented Nov 4, 2021

jeffbolznv commented Nov 5, 2021

raphlinus commented Nov 5, 2021

jeffbolznv commented Nov 5, 2021

mnetsch commented Feb 3, 2022

mnetsch commented Feb 25, 2022