@atomic is slow within AMDGPU.jl #569

OsKnoth · 2023-12-12T22:33:49Z

I have noticed that with KernelAbstractions the use of @atomic is very slow on AMDGPU compared to CUDA.

I have a test example at https://github.com/CliMA/CGDycore.jl/tree/main/TestAMD with results for a kernel with @atomic and the same kernel with removed @atomic macros. You will find timing results in InfoCUDAAMD. There is an additional data file with the global indices for the atomic gather operation which is input for the testExample. Before running the code you have to comment out the right backend.

All test are done on the LUMI supercomputer (https://lumi-supercomputer.eu/)

After login you get the following message:
rocm/5.3.3:
This module is not installed the way it is intended by AMD. Hence some parts of ROCm may
be broken or offer slower than expected performance and there is nothing the LUMI
User Support Team can do about that. As the module is also not an official HPE Cray
module there is no guarantee that it will integrate properly with HPE Cray PE modules
or work correctly in that environment. Note that HPE Cray PE programming environments
up to at least 23.02 have only been tested with ROCm version prior to 5.3 by HPE.

luraess · 2023-12-12T22:37:34Z

After login you get the following message:

This message you are getting after loading the specific ROCm module combination to have 5.3.3 instead of 5.2 exposed. From LUMI support, there should not be any issue with this config, besides it may not be fully tested by HPE. I don't think this relates anyhow to the slowdown you are experiencing using atomic operations in AMDGPU.jl.

albert-de-montserrat · 2023-12-14T08:32:07Z

I will chip in the issue. I have the same problem (tested on Lumi as well) with AMDGPU and atomics. If it helps, this is my MWE using ParallelStencil.jl instead of KernelAbstractions.jl

using ParallelStencil, StaticArrays, Atomix, BenchmarkTools
const use_AMD = true

const TA = @static if use_AMD 
    @init_parallel_stencil(AMDGPU, Float64, 2)

    macro myatomic(expr)
        return esc(quote
            AMDGPU.@atomic $expr
        end)
    end

    ParallelStencil.ROCArray
else
    @init_parallel_stencil(Threads, Float64, 2)

    macro myatomic(expr)
        return esc(quote
            Atomix.@atomic $expr
        end)
    end

    Array
end

function init_particles(nxcell, x, y, dx, dy, nx, ny)
    ncells = nx * ny
    np = nxcell * ncells
    pcoords = TA([SA[(rand()*0.9 + 0.05)*dx, (rand()*0.9 + 0.05)*dy] for _ in 1:np])
    parent_cell = TA([(0, 0) for _ in 1:np])
    
    nx2, ny2 = length(x)-1, length(y)-1
    LinInd = LinearIndices((1:nx2, 1:ny2))

    @parallel_indices (i, j) function fill_coords(pcoords, parent_cell, x, y, nxcell, LinInd)
        k = LinInd[i, j] 
        x0, y0 = x[i], y[j] # lower-left corner of the cell
        for l in (1 + nxcell * (k-1)):(nxcell * k)
            pcoords[l] = @SVector [x0 + pcoords[l][1], y0 + pcoords[l][2]]
            parent_cell[l] = (i, j)
        end
        return nothing
    end

    @parallel (1:nx, 1:ny) fill_coords(pcoords, parent_cell, x, y, nxcell, LinInd)

    return pcoords, parent_cell
end

function foo!(F, Fp, buffer, parent_cell)
    np = length(Fp)
    @parallel (1:np) _foo!(F, Fp, buffer, parent_cell)
    return nothing
end

@parallel_indices (ipart) function _foo!(F, Fp, buffer, parent_cell)
    inode, jnode = parent_cell[ipart]
    # iterate over cells around ipart-th node
    for joffset in 0:1
        jvertex = joffset + jnode
        for ioffset in 0:1
            ivertex = ioffset + inode
            @myatomic F[ivertex, jvertex] += ω_i = rand() # a weight function is actually called here
            @myatomic buffer[ivertex, jvertex] += Fp[ipart] * ω_i 
        end
    end

    return nothing
end

let 
    np = 24
    n = 128
    nx = ny = n-1
    ni = nx , ny
    Lx = Ly = 1.0
    xvi = xv, yv = range(0, Lx, length=n), range(0, Ly, length=n)
    dxi = dx, dy = xv[2] - xv[1], yv[2] - yv[1]
    
    pcoords, parent_cell = init_particles(np, xvi..., dx, dy, nx, ny)

    F = @zeros(ni.+1...)
    buffer = similar(F)
    Fp = @rand(length(pcoords))

    @btime foo!($(F, Fp, buffer, parent_cell)...)
    # AMDGPU with atomics:    412.973 ms (45 allocations: 2.80 KiB)
    # AMDGPU without atomics:  82.009 μs (45 allocations: 2.80 KiB)
end

vchuravy · 2023-12-14T10:10:53Z

I will note that you are using floating point atomics and the OP uses integer atomics. There was an recommendation from LUMI to try out unsafe-atomics.

E.g. https://reviews.llvm.org/D91546?id=305522

OsKnoth · 2023-12-14T10:19:51Z

How is this option added to the Julia call?

luraess · 2023-12-15T10:34:49Z

With @albert-de-montserrat we worked on a MWE comparing AMDGPU, ROCBackend() and HIP C++ perf on the following micro kernel which may be somewhat representative of what happens in @albert-de-montserrat more complete example 👆

function amd_atomic_add!(target1, target2, source, indices)
    i = workitemIdx().x + (workgroupIdx().x - 0x1) * workgroupDim().x
    i1, i2, i3, i4 = indices[i, 1], indices[i, 2], indices[i, 3], indices[i, 4]
    v = source[i]
    AMDGPU.@atomic target1[i1] += v
    AMDGPU.@atomic target1[i2] += v
    AMDGPU.@atomic target1[i3] += v
    AMDGPU.@atomic target1[i4] += v
    AMDGPU.@atomic target2[i1] += v
    AMDGPU.@atomic target2[i2] += v
    AMDGPU.@atomic target2[i3] += v
    AMDGPU.@atomic target2[i4] += v
    return
end

the HIP C++ version would be

__global__ void atomic_kernel(DAT *_target1, DAT *_target2, DAT *_source, int *_indices, const int n) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  int i1 = indices(i, 0);
  int i2 = indices(i, 1);
  int i3 = indices(i, 2);
  int i4 = indices(i, 3);
  DAT v = source(i);
  atomicAdd(&target1(i1), v);
  atomicAdd(&target1(i2), v);
  atomicAdd(&target1(i3), v);
  atomicAdd(&target1(i4), v);
  atomicAdd(&target2(i1), v);
  atomicAdd(&target2(i2), v);
  atomicAdd(&target2(i3), v);
  atomicAdd(&target2(i4), v);
}

Running these codes on LUMI (MI250x - ROCm 5.3.3) and a Radeon 7800XT (Navi3 - ROCm 5.6.1) gives following timings:

Float64

Radeon 7800 XT

  "julia" => Trial(43.370 μs)
  "reference" => Trial(38.450 μs)
  "julia-ka" => Trial(45.190 μs)

On LUMI

  "julia" => Trial(345.640 μs)
  "reference" => Trial(49.636 μs)
  "julia-ka" => Trial(350.620 μs)

On LUMI with -xhip -munsafe-fp-atomics (for C++ code only)

  "julia" => Trial(351.731 μs)
  "reference" => Trial(12.635 μs)
  "julia-ka" => Trial(357.743 μs)

Similar behaviour is observed for Float32

UInt64

Radeon 7800

  "julia" => Trial(23.400 μs)
  "reference" => Trial(18.530 μs)
  "julia-ka" => Trial(25.320 μs)

LUMI

  "julia" => Trial(39.395 μs)
  "reference" => Trial(12.714 μs)
  "julia-ka" => Trial(42.992 μs)

LUMI with -xhip -munsafe-fp-atomics

  "julia" => Trial(39.085 μs)
  "reference" => Trial(12.624 μs)
  "julia-ka" => Trial(42.531 μs)

Similar behaviour is observed for UInt32

Files can be found here PTsolvers/HPCBenchmarks.jl#1 (CUDA version added soon)

vchuravy · 2023-12-15T11:59:06Z

Okay so on LUMI the C++ code is already doing something quite different from Julia even without unsafe-fp-atomics

Can you get the LLVM IR from hipcc on LUMI with something like -emit-llvm -S --offload-device-only ?
Also @device_code dir="ir" for Julia on LUMI would be helpful.

luraess · 2023-12-15T14:31:02Z

Okay so on LUMI the C++ code is already doing something quite different from Julia even without unsafe-fp-atomics

Can you get the LLVM IR from hipcc on LUMI with something like -emit-llvm -S --offload-device-only ? Also @device_code dir="ir" for Julia on LUMI would be helpful.

Getting following warning and compilation failure when adding the suggested flags to https://github.com/PTsolvers/HPCBenchmarks.jl/blob/8b8488231a0d8279233d52866f549d2302e81b82/AMDGPU/atomic.jl#L91

clang-15: warning: -lgcc_s: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lgcc: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lpthread: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lm: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lrt: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -Wl,--enable-new-dtags: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -Wl,-rpath=/appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/hip/lib:/appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/lib: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lamdhip64: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: -lclang_rt.builtins-x86_64: 'linker' input unused [-Wunused-command-line-argument]
clang-15: warning: argument unused during compilation: '--shared' [-Wunused-command-line-argument]

vchuravy · 2023-12-15T14:32:29Z

That should be okay. I think you might also need -c since you don't want to link.

albert-de-montserrat · 2023-12-15T14:43:27Z

The --offload-device-only flag still doesn't work for me

clang-14: error: unsupported option '--offload-device-only'

The @device_code for Julia is here https://gist.github.com/albert-de-montserrat/ab4a66b1e5a1b547673041e4c5f190e2

luraess · 2023-12-15T14:45:53Z

Still getting the error

ERROR: LoadError: could not load library "./atomic.so"
./atomic.so: invalid ELF header

in

run(`hipcc -emit-llvm -S --offload-device-only -c  -xhip -munsafe-fp-atomics -O3 -o $libname --shared -fPIC atomic.cu`)

Libdl.dlopen("./$libname") do lib

vchuravy · 2023-12-15T14:48:51Z

You don't want --shared -fIC since your -o should be a .ll

vchuravy · 2023-12-15T14:57:38Z

So we are just emitting atomicrmw fadd double addrspace(1)* %73, double %51 seq_cst, align 8, !dbg !374

So the issue might be that we emit seq_cst instead of acq_rel

luraess · 2023-12-15T15:10:39Z

You don't want --shared -fIC since your -o should be a .ll, hipcc -xhip -munsafe-fp-atomics -O3 -o out.ll atomic.cu

If I remove it and change the -o from .so to .ll I am then getting

ld.lld: error: undefined symbol: main
>>> referenced by start.S:104 (../sysdeps/x86_64/start.S:104)
>>>               /usr/lib64/gcc/x86_64-suse-linux/7/../../../../lib64/crt1.o:(_start)
clang-15: error: linker command failed with exit code 1 (use -v to see invocation)
ERROR: LoadError: failed process: Process(`hipcc -xhip -munsafe-fp-atomics -O3 -o atomic.ll atomic.cu`, ProcessExited(1)) [1]

The "main" function of the external cpp code being extern "C" EXPORT_API void run_benchmark(){...}

vchuravy · 2023-12-15T15:13:06Z

But now you are missing -emit-llvm -S --offload-device-only -c?

luraess · 2023-12-15T15:18:26Z

But now you are missing -emit-llvm -S --offload-device-only -c?

Ok, got it now. Was confused. The generated out.ll is out.ll.zip

vchuravy · 2023-12-15T15:25:06Z

Can you also send me the one withtout unsafe-fp-atomics, but this is already promising:

  %41 = load i32, i32 addrspace(1)* %23, align 4, !tbaa !11, !amdgpu.noclobber !5
  %42 = load i32, i32 addrspace(1)* %27, align 4, !tbaa !11, !amdgpu.noclobber !5
  %43 = tail call contract double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double* %37, double %29) #4
  %44 = tail call contract double @llvm.amdgcn.flat.atomic.fadd.f64.p0f64.f64(double* %40, double %29) #4

luraess · 2023-12-15T15:54:25Z

Here without unsafe atomics out_nounsafe.ll.zip

vchuravy · 2023-12-18T12:42:57Z

Oh wow that is even more interesting.

  %41 = atomicrmw fadd double addrspace(1)* %36, double %29 syncscope("agent-one-as") monotonic, align 8
  %42 = atomicrmw fadd double addrspace(1)* %38, double %29 syncscope("agent-one-as") monotonic, align 8

That is easier. That means AMDGPU.jl is emiting atomics with a stronger ordering than needed.

Try:

using Atomix: @atomic, @atomicswap, @atomicreplace


function amd_atomic_add!(target1, target2, source, indices)
    i = workitemIdx().x + (workgroupIdx().x - 0x1) * workgroupDim().x
    i1, i2, i3, i4 = indices[i, 1], indices[i, 2], indices[i, 3], indices[i, 4]
    v = source[i]
    @atomic :montonic target1[i1] += v
    @atomic :montonic target1[i2] += v
    @atomic :montonic target1[i3] += v
    @atomic :montonic target1[i4] += v
    @atomic :montonic target2[i1] += v
    @atomic :montonic target2[i2] += v
    @atomic :montonic target2[i3] += v
    @atomic :montonic target2[i4] += v
    return
end

The only thing we don't handle there is syncscope so we might need to add that.
(Atomix is the device portable interface for atomic arrays accesses, which is why I recommend it for KA)

vchuravy · 2023-12-18T12:52:08Z

This is consistent with Base.@atomic:

help?> @atomic
  @atomic var
  @atomic order ex

  Mark var or ex as being performed atomically, if ex is a supported expression. If no
  order is specified it defaults to :sequentially_consistent.

AMDGPU uses Atomix.jl under the hood.

The unsafe-fp-atomics may be something to explore, but I would bet that biggest issue is the different default choice for atomic order.

luraess · 2023-12-18T13:36:16Z

Thanks - will check the benchmarks and report

luraess · 2023-12-18T14:14:48Z

Adding the :monotonic keyword to the @atomic macro in Atomix (used in KA lernel) and AMDGPU, results in following results (NOT using unsafe-fp-atomics in the C++ reference):

On LUMI (Float64)

  "julia" => Trial(53.444 μs)
  "reference" => Trial(49.606 μs)
  "julia-ka" => Trial(58.644 μs)

No effect/change on the Radeon 7800XT

luraess · 2023-12-18T15:44:36Z

Is there a way to have this selected automatically for some use cases, or make it default? Any better strategy than the current one?

vchuravy · 2023-12-18T15:53:42Z

Is there a way to have this selected automatically for some use cases, or make it default? Any better strategy than the current one?

I was thinking about this, but not really. The default is correct and needs to be consistent across architectures.
The docs could probably be improved to call this out.

Fundamentally the problem here is that we get lowered to a cmpxchg loop and so we might need to look at syncscope as well.

albert-de-montserrat · 2023-12-18T18:39:20Z

Thanks @vchuravy ! This also speeds up my MWE by 400 times :)

OsKnoth · 2023-12-18T21:57:28Z

Thanks for the effort. My code speeds up, but is yet 3 times slower than on an A100.

albert-de-montserrat · 2023-12-19T08:54:59Z

Indeed I also a difference in the benchmarks that @luraess shared running on a P100 (on Piz Daint):

  "julia" => Trial(17.575 μs)
  "reference" => Trial(9.815 μs)
  "julia-ka" => Trial(25.806 μs)

luraess · 2023-12-19T09:09:46Z

Indeed I also a difference in the benchmarks that @luraess shared running on a P100 (on Piz Daint):
  "julia" => Trial(17.575 μs)

  "reference" => Trial(9.815 μs)

  "julia-ka" => Trial(25.806 μs)

What are the timings without the flag?

albert-de-montserrat · 2023-12-19T09:12:21Z

Those are without the unsafe flag, I don't know what is the equivalent flag for CUDA. I get the same timings with and without :monotonic here.

vchuravy · 2023-12-19T09:16:11Z

Sorry could we keep the conversation focused here?
If there is an issue with atomic operations in CUDA.jl please open an issue there.

I assume that:

My code speeds up, but is yet 3 times slower than on an A100.

Means there is a performance between A100 and MI250X?
It is hard to tell what the cause for this is from afar. The first thing I would try is to write the code in HIP and make sure that the performance different is apparent there also, if it isn't then we can investigate further.

pxl-th · 2024-02-29T09:50:26Z

I think with #604 we can close this.
Feel free to re-open though, if needed.

pxl-th closed this as completed Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

@atomic is slow within AMDGPU.jl #569

@atomic is slow within AMDGPU.jl #569

OsKnoth commented Dec 12, 2023

luraess commented Dec 12, 2023

albert-de-montserrat commented Dec 14, 2023

vchuravy commented Dec 14, 2023

OsKnoth commented Dec 14, 2023

luraess commented Dec 15, 2023 •

edited

Loading

vchuravy commented Dec 15, 2023

luraess commented Dec 15, 2023 •

edited

Loading

vchuravy commented Dec 15, 2023

albert-de-montserrat commented Dec 15, 2023

luraess commented Dec 15, 2023 •

edited

Loading

vchuravy commented Dec 15, 2023

vchuravy commented Dec 15, 2023

luraess commented Dec 15, 2023 •

edited

Loading

vchuravy commented Dec 15, 2023

luraess commented Dec 15, 2023

vchuravy commented Dec 15, 2023

luraess commented Dec 15, 2023

vchuravy commented Dec 18, 2023

vchuravy commented Dec 18, 2023

luraess commented Dec 18, 2023

luraess commented Dec 18, 2023 •

edited

Loading

luraess commented Dec 18, 2023

vchuravy commented Dec 18, 2023

albert-de-montserrat commented Dec 18, 2023

OsKnoth commented Dec 18, 2023

albert-de-montserrat commented Dec 19, 2023

luraess commented Dec 19, 2023

albert-de-montserrat commented Dec 19, 2023 •

edited

Loading

vchuravy commented Dec 19, 2023

pxl-th commented Feb 29, 2024 •

edited

Loading

@atomic is slow within AMDGPU.jl #569

@atomic is slow within AMDGPU.jl #569

Comments

OsKnoth commented Dec 12, 2023

luraess commented Dec 12, 2023

albert-de-montserrat commented Dec 14, 2023

vchuravy commented Dec 14, 2023

OsKnoth commented Dec 14, 2023

luraess commented Dec 15, 2023 • edited Loading

Float64

UInt64

vchuravy commented Dec 15, 2023

luraess commented Dec 15, 2023 • edited Loading

vchuravy commented Dec 15, 2023

albert-de-montserrat commented Dec 15, 2023

luraess commented Dec 15, 2023 • edited Loading

vchuravy commented Dec 15, 2023

vchuravy commented Dec 15, 2023

luraess commented Dec 15, 2023 • edited Loading

vchuravy commented Dec 15, 2023

luraess commented Dec 15, 2023

vchuravy commented Dec 15, 2023

luraess commented Dec 15, 2023

vchuravy commented Dec 18, 2023

vchuravy commented Dec 18, 2023

luraess commented Dec 18, 2023

luraess commented Dec 18, 2023 • edited Loading

luraess commented Dec 18, 2023

vchuravy commented Dec 18, 2023

albert-de-montserrat commented Dec 18, 2023

OsKnoth commented Dec 18, 2023

albert-de-montserrat commented Dec 19, 2023

luraess commented Dec 19, 2023

albert-de-montserrat commented Dec 19, 2023 • edited Loading

vchuravy commented Dec 19, 2023

pxl-th commented Feb 29, 2024 • edited Loading

luraess commented Dec 15, 2023 •

edited

Loading

luraess commented Dec 15, 2023 •

edited

Loading

luraess commented Dec 15, 2023 •

edited

Loading

luraess commented Dec 15, 2023 •

edited

Loading

luraess commented Dec 18, 2023 •

edited

Loading

albert-de-montserrat commented Dec 19, 2023 •

edited

Loading

pxl-th commented Feb 29, 2024 •

edited

Loading