Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for permuted parallel atomic message passing test #295

Closed
raphlinus opened this issue Nov 3, 2021 · 7 comments
Closed

Request for permuted parallel atomic message passing test #295

raphlinus opened this issue Nov 3, 2021 · 7 comments

Comments

@raphlinus
Copy link

I've lately been tracking down some strange behavior, in the process of porting prefix sum to work on webgpu. Right now, that's looking like miscompilation of relaxed atomic loads and stores in the AMD 5700 XT Windows driver.

The Vulkan CTS should be catching this behavior, but the message passing test passes on this device. I've developed a version of the message passing test which is more or less vanilla, but runs many instances in parallel, and permutes the locations in memory; this seems to be the difference between passing and failing. It also does both the store and load roles in one thread, which is pleasant because it avoids control flow.

My code is here: googlefonts/compute-shader-101#11. This is in WGSL, but I'm also pasting a GLSL version below (lightly edited from spirv-cross output, as its translation of atomics is... interesting):

#version 450
#extension GL_KHR_memory_scope_semantics: enable

layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;

struct Element
{
    uint data;
    uint flag;
};

layout(set = 0, binding = 0, std430) buffer DataBuf
{
    Element data[];
} data_buf;

layout(set = 0, binding = 1, std430) buffer ControlBuf
{
    uint strategy;
    uint failures;
} control_buf;

uint permute_flag_ix(uint data_ix)
{
    return (data_ix * 419u) & 65535u;
}

void main()
{
    atomicStore(data_buf.data[gl_GlobalInvocationID.x].data, 1u, gl_ScopeDevice, gl_StorageSemanticsBuffer, gl_SemanticsRelaxed);
    memoryBarrierBuffer();
    atomicStore(data_buf.data[permute_flag_ix(gl_GlobalInvocationID.x)].flag, 1u, gl_ScopeDevice, gl_StorageSemanticsBuffer, gl_SemanticsRelaxed);
    uint read_ix = (gl_GlobalInvocationID.x * 4099u) & 65535u;
    uint flag = atomicLoad(data_buf.data[permute_flag_ix(read_ix)].flag, gl_ScopeDevice, gl_StorageSemanticsBuffer, gl_SemanticsRelaxed);
    memoryBarrierBuffer();
    uint data = atomicLoad(data_buf.data[read_ix].data, gl_ScopeDevice, gl_StorageSemanticsBuffer, gl_SemanticsRelaxed);
    if (flag > data)
    {
        atomicAdd(control_buf.failures, 1u);
    }
}

There are many possible variants of this test, including using Release semantics for the second store and Acquire for the first load instead of the barriers. But I'm focusing on this one in large part because it represents a natural translation of WGSL source.

I'm happy to answer any questions or help getting this into CTS. I'm not sure this is the only issue I'm seeing with my prefix sum code running, so am still pursuing whether there are more clean tests I can derive.

@jeffbolznv
Copy link
Contributor

@raphlinus if you're set up to build the CTS, could you check whether changing the non-atomic loads to atomic loads in the existing CTS memory_model tests is sufficient to reproduce this bug? I'd like to know whether it's the lack of atomics or the different coordinate permutation (or both) that triggers the bug.

@raphlinus
Copy link
Author

I get failures, but am so far unsure how to interpret them.

This is the patch I applied:

diff --git a/external/vulkancts/modules/vulkan/memory_model/vktMemoryModelMessagePassing.cpp b/external/vulkancts/modules/vulkan/memory_model/vktMemoryModelMessagePassing.cpp
index d45f589f9..29b0779e3 100755
--- a/external/vulkancts/modules/vulkan/memory_model/vktMemoryModelMessagePassing.cpp
+++ b/external/vulkancts/modules/vulkan/memory_model/vktMemoryModelMessagePassing.cpp
@@ -646,7 +646,7 @@ void MemoryModelTestCase::initPrograms (SourceCollections& programCollection) co
                        {
                        default: DE_ASSERT(0); // fall through
                        case SC_PHYSBUFFER: // fall through
-                       case SC_BUFFER:         css << "   payload.x[bufferCoord] = bufferCoord + (payload.x[partnerBufferCoord]>>31);\n"; break;
+                       case SC_BUFFER:         css << "   atomicStore(payload.x[bufferCoord], bufferCoord + (payload.x[partnerBufferCoord]>>31), " << scopeStr << ", 0, 0);\n"; break;
                        case SC_IMAGE:          css << "   imageStore(payload, imageCoord, uvec4(bufferCoord + (imageLoad(payload, partnerImageCoord).x>>31), 0, 0, 0));\n"; break;
                        case SC_WORKGROUP:      css << "   payload.x[sharedCoord] = bufferCoord + (payload.x[partnerSharedCoord]>>31);\n"; break;
                        }
@@ -774,7 +774,7 @@ void MemoryModelTestCase::initPrograms (SourceCollections& programCollection) co
                {
                default: DE_ASSERT(0); // fall through
                case SC_PHYSBUFFER: // fall through
-               case SC_BUFFER:         css << "   " << typeStr << " r = payload.x[partnerBufferCoord];\n"; break;
+               case SC_BUFFER:         css << "   " << typeStr << " r = atomicLoad(payload.x[partnerBufferCoord], " << scopeStr << ", 0, 0);\n"; break;
                case SC_IMAGE:          css << "   " << typeStr << " r = imageLoad(payload, partnerImageCoord).x;\n"; break;
                case SC_WORKGROUP:      css << "   " << typeStr << " r = payload.x[partnerSharedCoord];\n"; break;
                }

With this I get fails, but all the fails are in the frag shader tests; the comp tests all seem to past (which is what I'm most interested in). I fear I've done something that breaks frag shaders specifically.

Should I upload the TestResults.qpa?

@jeffbolznv
Copy link
Contributor

That change looks correct, I think you've reproduced the bug. I ran that change on an NVIDIA implementation and there are no failures.

@raphlinus
Copy link
Author

I'm not sure it counts as a full reproduction if the comp test are still passing. One explanation that seems plausible is that fragment shaders are permuting the invocations in a way that compute dispatch is not.

@jeffbolznv
Copy link
Contributor

Yeah, I think that's likely why you're not seeing it in the compute shaders. But once the tests make it into a test lab and get run on all different sizes and families of GPUs dozens of times a day, I think there's a decent chance the bug would show up there, too.

@mnetsch
Copy link
Contributor

mnetsch commented Feb 3, 2022

Please note an internal issue was opened and is pending

@mnetsch
Copy link
Contributor

mnetsch commented Feb 25, 2022

0f04733

@mnetsch mnetsch closed this as completed Feb 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants