Skip to content

Invalid results due to shared memory + multiple function exits (?) mysteriously solved by @cuprintf #43

@ali-ramadhan

Description

@ali-ramadhan

I modified the scan.jl example (which uses dynamic shared memory) to calculate mean(a; dims=[1,2]) for a non-contiguous part of an OffsetArray{CuArray}.

So the extra change I made is that the scan/accumulation is done along the dimension 2 first and stored in a 2D array, then summed along dimension 1 and stored in a 1D array.

I pasted an reproducer below.

But I noticed that the resulting mean was wrong for some entries. Interestingly, it was wrong by exactly the amount due to one or more threads not doing their jobs (they don't seem to store the accumulated value at the end: see output at the bottom of this post). So I suspected it was some sort of race condition.

But this is all fixed if I insert a @cuprintf statement before the accumulation starts, although I suspect the exact placement doesn't matter. I suspect whatever synchronization is performed by the @cuprintf statement is exactly what's needed for all the threads to do the right thing, but replacing @cuprintf with sync_threads() or sync_warp() don't help, so not sure what's going on.

@maleadt suggested it may be another case of JuliaGPU/CUDAnative.jl#4

Apologies for not posting a true MWE. Wasn't sure how to reduce it down. But since it builds upon an example, hopefully it isn't too bad. However, if this example is too complicated and not super helpful, please close this issue. I'm also happy to reduce the example further if anyone has suggestions.


using CUDAnative, CUDAdrv, CuArrays

# Work-inefficient inclusive scan that reduces along the xy dimensions.
# Useful for computing mean(; dim=[1, 2]) lol.
# Modified from: https://github.com/JuliaGPU/CUDAnative.jl/blob/master/examples/scan.jl
function gpu_accumulate_xy!(Rxy::CuDeviceArray{T}, Rx, data, op::Function) where T
    lvl, lvls = blockIdx().y,  gridDim().y
    col, cols = blockIdx().x,  gridDim().x
    row, rows = threadIdx().x, blockDim().x

    if lvl <= lvls && col <= cols && row <= rows
        shmem = @cuDynamicSharedMem(T, 2*rows)
        shmem[row] = data[row, col, lvl]
        sync_threads()

        # parallel reduction
        pin, pout, offset = 1, 0, 1
        while offset < rows
            pout = 1 - pout
            pin = 1 - pin

            if row > offset
                shmem[pout * rows + row] = op(shmem[pin * rows + row], shmem[pin * rows + row - offset])
            else
                 shmem[pout * rows + row] = shmem[pin * rows + row]
            end

            sync_threads()
            offset *= UInt32(2)
        end

        shmem[pin * rows + row] = shmem[pout * rows + row]
        sync_threads()

        # write results
        if row == rows
            Rx[1, col, lvl] = shmem[rows]
        end
        sync_threads()

        if col == cols && row == rows
            sum = 0
            @cuprintf(" ")  # Have absolutely no idea why this is "needed" but horizontal average is wrong without it...
            for j in 1:cols
                sum += Rx[1, j, lvl]
            end
            
            Rxy[1, 1, lvl] = sum
        end
    end

    return
end

# Create a 3D array where A[:, :, k] = k where k = 1, 2, 3, ..., Nz.
Nx, Ny, Nz = 64, 64, 64
A = zeros(Nx, Ny, Nz) |> CuArray
A .= 1

# This will store the horizontal average of A. Because A[:, :, k] = k,
# the mean along dims=[1,2] should be HV[k] = k.
HV = CuArray(zeros(1, 1, Nz))

sz = 2Nx * sizeof(eltype(HV))
@cuda threads=Nx blocks=(Ny, Nz) shmem=sz gpu_accumulate_xy!(HV, CuArray(zeros(1, Ny, Nz)), A, +)
HV = HV ./ (Nx*Ny)  # Normalize to get the mean along dims=[1,2].

# Print the discrepency between the computed horizontal average and the
# correct answer b. Should be zero.
print("HV - correct_answer = $(Array(HV)[:] .- 1)\n")

# We can actually calculate how many threads didn't do their job as the numbers are
# all off by multiples of 0.015625 and 64*0.015625 = 1 indicating that one or more
# threads may not have stored anything in the accumulated array (instead leaving
# a 0 which leads to the discrepancy).
print("Threads that didn't do their job = $((Array(HV)[:] .-1) / -0.015625)\n")

Result without the @cuprintf statement:

HV - correct_answer = [-0.046875, 0.0, 0.0, 0.0, -0.03125, -0.015625, 0.0, 0.0, -0.03125, -0.046875, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -0.0625, 0.0, 0.0, 0.0, 0.0, -0.03125, 0.0, -0.015625, 0.0, -0.015625, 0.0, 0.0, 0.0, 0.0, -0.015625, 0.0, 0.0, 0.0, -0.015625, -0.078125, 0.0, 0.0, -0.078125, -0.015625, 0.0, 0.0, 0.0, 0.0, 0.0, -0.046875, 0.0, 0.0, -0.0625, 0.0, -0.03125, 0.0, -0.015625, 0.0, -0.046875, -0.046875, -0.046875, -0.03125, -0.125, -0.03125, -0.09375]
Threads that didn't do their job = [3.0, -0.0, -0.0, -0.0, 2.0, 1.0, -0.0, -0.0, 2.0, 3.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, 4.0, -0.0, -0.0, -0.0, -0.0, 2.0, -0.0, 1.0, -0.0, 1.0, -0.0, -0.0, -0.0, -0.0, 1.0, -0.0, -0.0, -0.0, 1.0, 5.0, -0.0, -0.0, 5.0, 1.0, -0.0, -0.0, -0.0, -0.0, -0.0, 3.0, -0.0, -0.0, 4.0, -0.0, 2.0, -0.0, 1.0, -0.0, 3.0, 3.0, 3.0, 2.0, 8.0, 2.0, 6.0]

Result with the @cuprintf statement:

                                                                HV - correct_answer = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Threads that didn't do their job = [-0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0]

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcuda kernelsStuff about writing CUDA kernels.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions