-
Notifications
You must be signed in to change notification settings - Fork 258
Description
I modified the scan.jl example (which uses dynamic shared memory) to calculate mean(a; dims=[1,2]) for a non-contiguous part of an OffsetArray{CuArray}.
So the extra change I made is that the scan/accumulation is done along the dimension 2 first and stored in a 2D array, then summed along dimension 1 and stored in a 1D array.
I pasted an reproducer below.
But I noticed that the resulting mean was wrong for some entries. Interestingly, it was wrong by exactly the amount due to one or more threads not doing their jobs (they don't seem to store the accumulated value at the end: see output at the bottom of this post). So I suspected it was some sort of race condition.
But this is all fixed if I insert a @cuprintf statement before the accumulation starts, although I suspect the exact placement doesn't matter. I suspect whatever synchronization is performed by the @cuprintf statement is exactly what's needed for all the threads to do the right thing, but replacing @cuprintf with sync_threads() or sync_warp() don't help, so not sure what's going on.
@maleadt suggested it may be another case of JuliaGPU/CUDAnative.jl#4
Apologies for not posting a true MWE. Wasn't sure how to reduce it down. But since it builds upon an example, hopefully it isn't too bad. However, if this example is too complicated and not super helpful, please close this issue. I'm also happy to reduce the example further if anyone has suggestions.
using CUDAnative, CUDAdrv, CuArrays
# Work-inefficient inclusive scan that reduces along the xy dimensions.
# Useful for computing mean(; dim=[1, 2]) lol.
# Modified from: https://github.com/JuliaGPU/CUDAnative.jl/blob/master/examples/scan.jl
function gpu_accumulate_xy!(Rxy::CuDeviceArray{T}, Rx, data, op::Function) where T
lvl, lvls = blockIdx().y, gridDim().y
col, cols = blockIdx().x, gridDim().x
row, rows = threadIdx().x, blockDim().x
if lvl <= lvls && col <= cols && row <= rows
shmem = @cuDynamicSharedMem(T, 2*rows)
shmem[row] = data[row, col, lvl]
sync_threads()
# parallel reduction
pin, pout, offset = 1, 0, 1
while offset < rows
pout = 1 - pout
pin = 1 - pin
if row > offset
shmem[pout * rows + row] = op(shmem[pin * rows + row], shmem[pin * rows + row - offset])
else
shmem[pout * rows + row] = shmem[pin * rows + row]
end
sync_threads()
offset *= UInt32(2)
end
shmem[pin * rows + row] = shmem[pout * rows + row]
sync_threads()
# write results
if row == rows
Rx[1, col, lvl] = shmem[rows]
end
sync_threads()
if col == cols && row == rows
sum = 0
@cuprintf(" ") # Have absolutely no idea why this is "needed" but horizontal average is wrong without it...
for j in 1:cols
sum += Rx[1, j, lvl]
end
Rxy[1, 1, lvl] = sum
end
end
return
end
# Create a 3D array where A[:, :, k] = k where k = 1, 2, 3, ..., Nz.
Nx, Ny, Nz = 64, 64, 64
A = zeros(Nx, Ny, Nz) |> CuArray
A .= 1
# This will store the horizontal average of A. Because A[:, :, k] = k,
# the mean along dims=[1,2] should be HV[k] = k.
HV = CuArray(zeros(1, 1, Nz))
sz = 2Nx * sizeof(eltype(HV))
@cuda threads=Nx blocks=(Ny, Nz) shmem=sz gpu_accumulate_xy!(HV, CuArray(zeros(1, Ny, Nz)), A, +)
HV = HV ./ (Nx*Ny) # Normalize to get the mean along dims=[1,2].
# Print the discrepency between the computed horizontal average and the
# correct answer b. Should be zero.
print("HV - correct_answer = $(Array(HV)[:] .- 1)\n")
# We can actually calculate how many threads didn't do their job as the numbers are
# all off by multiples of 0.015625 and 64*0.015625 = 1 indicating that one or more
# threads may not have stored anything in the accumulated array (instead leaving
# a 0 which leads to the discrepancy).
print("Threads that didn't do their job = $((Array(HV)[:] .-1) / -0.015625)\n")Result without the @cuprintf statement:
HV - correct_answer = [-0.046875, 0.0, 0.0, 0.0, -0.03125, -0.015625, 0.0, 0.0, -0.03125, -0.046875, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -0.0625, 0.0, 0.0, 0.0, 0.0, -0.03125, 0.0, -0.015625, 0.0, -0.015625, 0.0, 0.0, 0.0, 0.0, -0.015625, 0.0, 0.0, 0.0, -0.015625, -0.078125, 0.0, 0.0, -0.078125, -0.015625, 0.0, 0.0, 0.0, 0.0, 0.0, -0.046875, 0.0, 0.0, -0.0625, 0.0, -0.03125, 0.0, -0.015625, 0.0, -0.046875, -0.046875, -0.046875, -0.03125, -0.125, -0.03125, -0.09375]
Threads that didn't do their job = [3.0, -0.0, -0.0, -0.0, 2.0, 1.0, -0.0, -0.0, 2.0, 3.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, 4.0, -0.0, -0.0, -0.0, -0.0, 2.0, -0.0, 1.0, -0.0, 1.0, -0.0, -0.0, -0.0, -0.0, 1.0, -0.0, -0.0, -0.0, 1.0, 5.0, -0.0, -0.0, 5.0, 1.0, -0.0, -0.0, -0.0, -0.0, -0.0, 3.0, -0.0, -0.0, 4.0, -0.0, 2.0, -0.0, 1.0, -0.0, 3.0, 3.0, 3.0, 2.0, 8.0, 2.0, 6.0]
Result with the @cuprintf statement:
HV - correct_answer = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Threads that didn't do their job = [-0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0, -0.0]