Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance gap on a 7-point stencil Laplacian kernel on Frontier MI250x GPUs #553

Open
3 tasks
williamfgc opened this issue Nov 29, 2023 · 1 comment
Open
3 tasks

Comments

@williamfgc
Copy link

In a recent study on Frontier, a 7-point stencil kernel under performs at half the bandwidth (~300 GB/s) of its HIP counterpart (~600 GB/s) on a single MI250x. The behavior is replicated at large scale up to 4K GPUs. While this was a first attempt using AMDGPU v0.4

Some to-do items:

  • Test with AMDGPU.jl v0.5 and onwards
  • Understand the performance difference wrt HIP 7-point stencil driven by a Laplacian operator available here
  • Implement a performance test to avoid regression

Opening after discussion in the HPC call with @vchuravy and @gbaraldi

@luraess
Copy link
Collaborator

luraess commented Nov 29, 2023

That's a really good point. Thanks for getting this started. As FYI, running AMDGPU on LUMI with AMDGPU v0.7.4 does not show major performance difference between HIP C++ and AMDGPU. Part of the story could be that we switched from HSA to HIP internally in AMDGPU, and that @pxl-th did massive refactoring work (TY).

While developing FastIce (https://github.com/PTsolvers/FastIce.jl) which should run optimally on LUMI, we encountered quite some challenges using the Julia GPU stack which now prefers Julia task-based parallelism instead of events as it used to be. @utkinis thus started a project called HPCBenchmarks https://github.com/PTsolvers/HPCBenchmarks.jl, where we try to compare host-overhead, memcpy, 2D and 3D laplacian for CUDA.jl and AMDGPU.jl versus respective C++ CUDA and HIP counterparts. Benchmark are designed to populate a BenchmarkGroup matrix (from BenchmarkTools.jl) which can be further used for analysis.

I just updated the suite to make sure it runs on AMDGPU and CUDA. Only host-overhead needs to be fixed.

Repo: https://github.com/PTsolvers/HPCBenchmarks.jl

Suggestion: Maybe one could add this repo to JuliaGPU org, and extend it with relevant benchmarks to run it part of GPU CI, and also allow people to pull it and run it on HPC centre CI. Further additions could be:

  • Reporting result matrix
  • Catching regression
  • MPI tests
  • Different kernels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants