Differences in throughput for an application in HIP/SYCL #3441

jinz2014 · 2024-04-09T14:18:06Z

The attached paper shows that the throughput of the application in SYCL is higher than that of the HIP program, but it does not explain the performance difference.

Unlocking performance portability on LUMI-G supercomputer:
A virtual screening case study
3648115.3648125.pdf

bdenhollander · 2024-04-09T21:06:12Z

5.1.1 Software stack. [...]
Moreover, we used the HIPIFY tool ⁴ to
automatically generate a HIP implementation from the CUDA one,
based on HIP 5.3. We have used the ROCm LLVM’s to perform a
code build of the HIP version on AMD GPUs

5.2 Single GPU performance portability [...]
Moreover, we include an automatically generated
HIP version for AMD GPUs, while for NVIDIA GPUs, we include a
hand-optimized CUDA version.

It doesn't sound like they made any effort to tune the generated HIP version. CUDA results for A100 are double that of AdaptiveCPP so there's a good chance that hand-optimized HIP could also outperform SYCL.

jinz2014 closed this as completed May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differences in throughput for an application in HIP/SYCL #3441

Differences in throughput for an application in HIP/SYCL #3441

jinz2014 commented Apr 9, 2024

bdenhollander commented Apr 9, 2024

Differences in throughput for an application in HIP/SYCL #3441

Differences in throughput for an application in HIP/SYCL #3441

Comments

jinz2014 commented Apr 9, 2024

bdenhollander commented Apr 9, 2024