Skip to content

AdaptiveCpp 24.02.0

Latest
Compare
Choose a tag to compare
@illuhad illuhad released this 11 Mar 17:09
· 45 commits to develop since this release
974adc3

Maxing out SYCL performance

AdaptiveCpp 24.02 introduces multiple compiler improvements, making it one of the best SYCL compilers - and in many cases the best - in the world when it comes to extracting performance from the hardware.

If you are not using it already, try it now and perhaps save some compute time!

The following performance results have been obtained with AdaptiveCpp's generic single-pass compiler (--acpp-targets=generic).

Note: oneAPI by default compiles with -ffast-math, while AdaptiveCpp does not enable fast math by default. All benchmarks have been explicitly compiled with -fno-fast-math to align compiler behavior, except where noted otherwise.

perf_2402_nvidia

perf_2402_amd
Note: oneAPI for AMD does not correctly round sqrt() calls even if -fno-fast-math is passed, using approximate builtins instead. This loss of precision can substantially skew benchmark results, resulting in misleading performance results. AdaptiveCpp 24.02 correctly rounds math functions by default. To align precision and allowed compiler optimizations, AdaptiveCpp was allowed to use approximate sqrt builtins as well for the AMD results.

perf_2402_intel

Note: AdaptiveCpp was running on the Intel GPU through OpenCL, while DPC++ was using its default backend Level Zero, which allows for more low-level control. Some of the differences may be explained by the different backend runtimes underneath the SYCL implementations.

World's fastest compiler for C++ standard parallelism offload

AdaptiveCpp 24.02 ships with the world's fastest compiler for offloading C++ standard parallelism constructs. This functionality was already part of 23.10, however AdaptiveCpp includes multiple important improvements. It can substantially outperform vendor compilers, and is the world's only compiler that can demonstrate C++ standard parallelism offloading performance across Intel, NVIDIA and AMD hardware. Consider the following performance results for the CloverLeaf, TeaLeaf and miniBUDE benchmarks:

apps_stdpar_normalized

  • The green bars show AdaptiveCpp 24.02 speedup over NVIDIA nvc++ on NVIDIA A100;
  • The red bars show AdaptiveCpp 24.02 speedup over AMD roc-stdpar on AMD Instinct MI100;
  • The blue bars show AdaptiveCpp 24.02 speedup over Intel icpx -fsycl-pstl-offload=gpu on Intel Data Center GPU Max 1550.
  • The dashed blue line indicates performance +/- 20%.

In particular, note that AdaptiveCpp does not depend on the XNACK hardware feature to obtain performance on AMD GPUs. XNACK is an elusive feature that is not available on most consumer hardware, and usually not enabled on most production HPC systems.

New features: Highlights

  • No targets specification needed anymore! AdaptiveCpp now by default compiles with --acpp-targets=generic. This means that a simple compiler invocation such as acpp -o test -O3 test.cpp will create a binary that can run on Intel, NVIDIA and AMD GPUs. AdaptiveCpp 24.02 is the world's only SYCL compiler that does not require specifying compilation targets to generate a binary that can run "everywhere".
  • New JIT backend: Host CPU. --acpp-targets=generic can now also target the host CPU through the generic JIT compiler. This can lead to performance improvements over the old omp compiler. E.g. on AMD Milan, babelstream's dot benchmark was observed to improve from 280GB/s to 380GB/s. This also means that it is no longer necessary to target omp to run on the CPU. generic is sufficient, and will likely perform better. Not having to compile for omp explicitly can also reduce compile times noticably (we observed e.g. ~15% for babelstream).
  • Persistent on-disk kernel cache: AdaptiveCpp 24.02 ships with an on-disk kernel cache for JIT compilations occuring when using --acpp-targets=generic. This can substantially reduce JIT overheads.
  • Automatic runtime specialization of kernels: When using --acpp-targets=generic, AdaptiveCpp can now automatically apply optimizations to kernels at JIT-time based on runtime knowledge. This can lead to noticable speedups in some cases, although the full potential of this is expected to only become apparent with future AdaptiveCpp versions.
    • This means that achieving best possible performance might require running the application multiple times, as AdaptiveCpp will try to JIT-compile increasingly specialized kernels with each application run. This can be controlled using the ACPP_ADAPTIVITY_LEVEL environment variable. Set it to 0 to recover the old behavior. The default is currently 1. If you are running benchmarks, you may have to update your benchmarking infrastructure to run applications multiple times.

What's Changed in Detail

Full Changelog: v23.10.0...v24.02.0

New Contributors