Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error triggered by synchronize() #603

Open
williamfgc opened this issue Feb 23, 2024 · 3 comments
Open

Error triggered by synchronize() #603

williamfgc opened this issue Feb 23, 2024 · 3 comments

Comments

@williamfgc
Copy link

I think I'm missing something basic with synchronization.

When using a simple @roc kernel launch inside a function we get an error in this AMDGPU.synchronize() line. The stacktrace can be seen in our CI using a recent AMDGPU.jl v0.8.6 on a MI100 with rocm 6.
I don't know if the first message in AMDGPU.jl in the stacktrace: [4] synchronize (repeats 2 times) @ ~/.julia/packages/AMDGPU/rrvsy/src/highlevel.jl:49 [inlined] provides any hints.

Works:

 @roc groupsize = threads gridsize = threads * blocks _parallel_for_amdgpu(f, x...)
end

Fails:

 @roc groupsize = threads gridsize = threads * blocks _parallel_for_amdgpu(f, x...)
  AMDGPU.synchronize()
end

For reference the CUDA code works fine:

  CUDA.@sync @cuda threads = threads blocks = blocks _parallel_for_cuda(f, x...)
end

Any help would be appreciated!

@williamfgc williamfgc changed the title Error triggered by synchonize() Error triggered by synchronize() Feb 23, 2024
@pxl-th
Copy link
Collaborator

pxl-th commented Feb 23, 2024

It means there's an exception that's triggered by one of the kernels you run.
Sadly at the moment it doesn't say much (just GPU Kernel Exception), I had to comment out these lines (link, link) because the functions that participate in exception reporting are not inlined thus causing maximum scratch memory usage which caused issues on the MI-series GPUs.

But you can try uncommenting them and running again to trigger the exception and see in details what's causing it.

VectorAddLambda: Error During Test at /home/wfg/github-runners/cousteau-JACC/ci/_work/JACC.jl/JACC.jl/test/tests_amdgpu.jl:10
  Got exception outside of a @test
  GPU Kernel Exception
  Stacktrace:
    [1] error(s::String)
      @ Base ./error.jl:35
    [2] throw_if_exception(dev::AMDGPU.HIP.HIPDevice)
      @ AMDGPU ~/.julia/packages/AMDGPU/rrvsy/src/exception_handler.jl:122
    [3] synchronize(stm::AMDGPU.HIP.HIPStream*** blocking::Bool, stop_hostcalls::Bool)
      @ AMDGPU ~/.julia/packages/AMDGPU/rrvsy/src/highlevel.jl:53
    [4] synchronize (repeats 2 times)

@williamfgc
Copy link
Author

@pxl-th thanks for the guidance, I will give it a try and report back.

@pxl-th
Copy link
Collaborator

pxl-th commented Feb 24, 2024

To make it easier, I've pushed a branch pxl-th/exception that has proper exception reporting, so you can use it for debugging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants