Add `AMDGPU.@sync` macro by luraess · Pull Request #454 · JuliaGPU/AMDGPU.jl

luraess · 2023-08-02T07:23:04Z

I am trying to add similar macro to AMDGPU as it exists in CUDA.jl, CUDA.@sync.

One reason is that AMDGPU currently returns following for AMDGPU.@sync:

help?> AMDGPU.@sync
  @sync


  Wait until all lexically-enclosed uses of @async, @spawn,
  @spawnat and @distributed are complete. All exceptions thrown
  by enclosed async operations are collected and thrown as a
  CompositeException.

  Examples
  ≡≡≡≡≡≡≡≡≡≡

  julia> Threads.nthreads()
  4
  
  julia> @sync begin
             Threads.@spawn println("Thread-id $(Threads.threadid()), task 1")
             Threads.@spawn println("Thread-id $(Threads.threadid()), task 2")
         end;
  Thread-id 3, task 1
  Thread-id 1, task 2


julia>

which seems to be a different then expected behaviour.

Trying to implement the macro, getting inspiration from CUDA.jl, I am facing some issues, namely:

The implementation does not sync the current stream (see perf output from MWE below)
One needs to import Base: @sync which does not seems to be needed in CUDA.jl
I am wondering if the quote should not rather be in following order such that one syncs after ret:

    quote
        local ret = $(esc(code))
        ret
        synchronize()
    end

The MWE I am trying the syncing on is following

using Printf
using AMDGPU

macro inn(A) esc(:( $A[ix+1, iy+1, iz+1] )) end
macro d2_xi(A) esc(:( $A[ix+2, iy+1, iz+1] - $A[ix+1, iy+1, iz+1] - $A[ix+1, iy+1, iz+1] - $A[ix, iy+1, iz+1] )) end
macro d2_yi(A) esc(:( $A[ix+1, iy+2, iz+1] - $A[ix+1, iy+1, iz+1] - $A[ix+1, iy+1, iz+1] - $A[ix+1, iy, iz+1] )) end
macro d2_zi(A) esc(:( $A[ix+1, iy+1, iz+2] - $A[ix+1, iy+1, iz+1] - $A[ix+1, iy+1, iz+1] - $A[ix+1, iy+1, iz] )) end

function lapl!(A2, A, D, dt_dx, dt_dy, dt_dz)
    ix = (workgroupIdx().x - UInt32(1)) * workgroupDim().x + workitemIdx().x
    iy = (workgroupIdx().y - UInt32(1)) * workgroupDim().y + workitemIdx().y
    iz = (workgroupIdx().z - UInt32(1)) * workgroupDim().z + workitemIdx().z
    if (ix < size(A, 1) - 2 && iy < size(A, 2) - 2 && iz < size(A, 3) - 2)
        @inbounds @inn(A2) = @inn(A) + @inn(D) * (dt_dx * @d2_xi(A) + dt_dy * @d2_yi(A) + dt_dz * @d2_zi(A))
    end
    return
end

function compute_roc(A2, A, D, dt_dx, dt_dy, dt_dz, iters, nblocks, nthreads, stream)
    print("Starting the time loop 🚀...")
    tic = time_ns()
    for it = 1:iters
        AMDGPU.@sync @roc gridsize=nblocks groupsize=nthreads #= stream=stream =# lapl!(A2, A, D, dt_dx, dt_dy, dt_dz)
        # AMDGPU.synchronize(stream)
        # AMDGPU.synchronize()
        A, A2 = A2, A
    end
    wtime = (time_ns() - tic) * 1e-9
    println("done")
    return wtime
end

function main(; resol=128)
    # physics
    lx = ly = lz = 10.0
    d0 = 1.0
    # numerics
    nx = ny = nz = resol
    nthreads = (128, 2, 1)
    iters, warmup = 33, 3
    println("Process selecting device $(AMDGPU.device_id(AMDGPU.device()))")
    # preprocessing
    nblocks = cld.((nx, ny, nz), nthreads)
    dx, dy, dz = 1.0 / nx, 1.0 / ny, 1.0 / nz
    dt = min(dx*dx, dy*dy, dz*dz)*d0 / 6.1
    dt_dx, dt_dy, dt_dz = dt / dx, dt / dy, dt / dz
    # init
    A = AMDGPU.rand(Float64, nx, ny, nz)
    D = d0 .* AMDGPU.ones(Float64, nx, ny, nz)
    A2 = copy(A)
    stream = nothing #AMDGPU.HIPStream(:high)
    # action
    # warmup
    compute_roc(A2, A, D, dt_dx, dt_dy, dt_dz, warmup, nblocks, nthreads, stream)
    # time
    wtime = compute_roc(A2, A, D, dt_dx, dt_dy, dt_dz, (iters - warmup), nblocks, nthreads, stream)
    # perf
    A_eff = 3 / 2^30 * nx * ny * nz * sizeof(Float64)
    wtime_it = wtime / (iters - warmup)
    T_eff = A_eff / wtime_it
    @printf("Executed %d steps in = %1.3e sec (@ T_eff = %1.2f GB/s) \n", (iters - warmup), wtime, round(T_eff, sigdigits=3))
    return
end

main(resol=512)

Note that the correct behaviour can be obtained by commenting the AMDGPU.@sync macro and syncing after the kernel with AMDGPU.synchronize (or syncing the specific stream), The expected performance on MI250x is:

Executed 30 steps in = 1.435e-01 sec (@ T_eff = 627.00 GB/s)

while erroneous perf would return e.g.

Executed 30 steps in = 2.051e-04 sec (@ T_eff = 439000.00 GB/s)

luraess added 7 commits August 2, 2023 09:00

Add sync macro

1d61e66

Import base sync

ebacf87

Flip order

0ded4e3

Update

e928992

Fixup!

e86f71e

test

6c5ffeb

Update

9d823fb

luraess requested a review from pxl-th August 2, 2023 07:23

pxl-th added 2 commits August 2, 2023 17:52

Fix sync macro

aa0c6bc

Drop kwargs

b08a3ba

pxl-th marked this pull request as ready for review August 2, 2023 15:33

pxl-th approved these changes Aug 2, 2023

View reviewed changes

pxl-th merged commit 125e21b into master Aug 2, 2023

pxl-th deleted the lr/sync branch August 2, 2023 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `AMDGPU.@sync` macro#454

Add `AMDGPU.@sync` macro#454
pxl-th merged 9 commits intomasterfrom
lr/sync

luraess commented Aug 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

luraess commented Aug 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants