Add C++ standard parallelism offloading support #1088

illuhad · 2023-07-24T01:23:22Z

This PR adds support for C++ standard parallelism, allowing automatic offloading of C++ STL algorithms of parallel_unsequenced policy to Intel, NVIDIA and AMD GPUs -- potentially even from a single binary (e.g. using --hipsycl-targets=generic). It is enabled using --hipsycl-stdpar.

Currently, only nvc++ can offload standard C++ algorithms, but only to NVIDIA GPUs. This PR makes us the first compiler to be able to offload standard C++ to "any" GPU.

Not all algorithms are implemented yet. See the documentation below for details. For unimplemented algorithms, the compiler will fall back to the regular non-offloaded host implementations. Offloading support for further algorithms will be added in the future, such that over time, the coverage will be completed.

Note that this is NOT just a SYCL library -- it's a compiler feature, and comes with dedicated compiler support! This is also a fundamental difference between the approach taken here, and e.g. either oneDPL or nvc++. I believe C++ standard parallelism must be a first-class citizen for offloading compilers.

Here are some BabelStream numbers offloading standard C++ parallelism to an Intel iGPU:

$ HIPSYCL_VISIBILITY_MASK="omp;ze" ./stream 
BabelStream
Version: 4.0
Implementation: STD (data-oriented)
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
[hipSYCL Warning] backend_loader: Could not load backend plugin: /home/aksel/src/hipSYCL/master/build/install/bin/../lib/hipSYCL/librt-backend-hip.so
[hipSYCL Warning] libamdhip64.so.5: cannot open shared object file: No such file or directory
Function    MBytes/sec  Min (sec)   Max         Average     
Copy        25557.182   0.02101     0.03908     0.02176     
Mul         25975.789   0.02067     0.03717     0.02150     
Add         26636.233   0.03023     0.06069     0.03144     
Triad       27156.416   0.02965     0.05282     0.03095     
Dot         29998.973   0.01790     0.02145     0.01871

With the regular C++ parallel STL, performance on this machine is around 15 GB/s, so in this case, performance has doubled just by recompiling with --hipsycl-stdpar!

CC @jeffhammond

TODO:

We need a new test infrastructure where we can add tests - probably this feature does not fit well into the existing sycl_tests infrastructure.
There are currently some issues with falling back to the other, non-offloaded execution policies like par. This needs to be fixed.

C++ standard parallelism support [taken from the documentation contained in this PR]

Open SYCL supports automatic offloading of C++ standard algorithms.

Installation & dependencies

C++ standard parallelism offload requires LLVM >= 14. It is automatically enabled when a sufficiently new LLVM is detected. cmake -DWITH_STDPAR_COMPILER=ON/OFF can be used to explicitly enable or disable it at cmake configure time.
C++ standard parallelism offload currently is only supported in conjunction with libstdc++ >= 11. Other standard C++ standard library versions may or may not work. Support for libc++ is likely easy to add if there is demand.

Using accelerated C++ standard parallelism

Offloading of C++ standard parallelism is enabled using --opensycl-stdpar. This flag does not by itself imply a target or compilation flow, which will have to be provided in addition using the normal --opensycl-targets argument. C++ standard parallelism is expected to work with any of our clang compiler-based compilation flows, such as omp.accelerated, cuda, hip or the generic SSCP compiler (--opensycl-targets=generic). It is not currently supported in library-only compilation flows. The focus of testing currently is the generic SSCP compiler.

Algorithms and policies supported for offloading

Currently, the following execution policies qualify for offloading:

par_unseq

Offloading is implemented for the following STL algorithms:

Algorithm	Notes
`for_each`
`for_each_n`
`transform`	both unary and binary operator overloads
`copy`
`copy_n`
`copy_if`
`fill`
`fill_n`
`generate`
`generate_n`
`replace`
`replace_if`
`replace_copy`
`replace_copy_if`
`transform_reduce`	all overloads
`reduce`	all overloads
`any_of`
`all_of`
`none_of`

For all other execution policies or algorithms, the algorithm will compile and execute correctly, however the regular host implementation of the algorithm provided by the C++ standard library implementation will be invoked and no offloading takes place.

Performance

Performance can generally be expected to be on par with comparable SYCL kernels, although there are some optimizations specific to the C++ standard parallelism model. See the sections on execution and memory model below for details. However, because the implementation of C++ standard parallelism depends heavily on SYCL shared USM (unified shared memory) allocations, the implementation quality of USM at the driver and hardware level can have a great impact on performance, especially for memory-intensive applications.
In particular, on some AMD GPUs USM is known to not perform well due to hardware and driver limitations.
In general, USM relies on memory pages automatically migrating between host and device, depending on where they are accessed. Consequently, patterns where the same memory region is accessed by host and offloaded C++ standard algorithms in alternating fashion should be avoided as much as possible, as this will trigger memory transfers behind the scenes.

Execution model

Queues and devices

Each thread in the user application maintains a dedicated thread-local in-order SYCL queue that will be used to dispatch STL algorithms. Thus, concurrent operations can be expressed by launching them from separate threads.
The selected device is currently the device returned from the default selector. Use HIPSYCL_VISIBILITY_MASK and/or backend-specific environment variables such as HIP_VISIBLE_DEVICES to control which device this is. Because sycl::event objects are not needed in the C++ standard parallelism model, queues are set up to rely exclusively on the hipSYCL coarse grained events extension. This means that offloading a C++ standard parallel algorithm can potentially have lower overhead compared to submitting a regular SYCL kernel.

Synchronous and asynchronous execution

The C++ STL algorithms are all designed around the assumption of being synchronous. This can become a performance issue especially when multiple algorithms are executed in succession, as in principle a wait() must be executed after each algorithm is submitted to device.

To address this issue, a dedicated compiler optimization tries to remove wait() calls in between successive calls to offloaded algorithms, such that a wait() will only be executed for the last algorithm invocation. This is possible without side effects if no instructions (particularly loads and stores) between the algorithm invocations are present.
Currently, the analysis is very simplistic and the compiler gives up the optimization attempt early - therefore, it is recommended for now to make it as easy as possible for the compiler to spot this opportunity by removing any code between calls to C++ algorithms if possible. This also includes code in the call arguments, such as calls to begin() and end(), which currently should better be moved to before the algorithm invocation. Example:

auto first = data.begin();
auto last = data.end();
auto dest = dest.begin();
std::for_each(std::execution::par_unseq, first, last, ...);
std::transform(std::execution::par_unseq, first, last, dest, ...);

Memory model

Automatic migration of heap allocations to USM shared allocations

C++ is unaware of separate devices with their own device memory. In order to retain C++ semantics, when offloading C++ standard algorithms Open SYCL tries to move all memory allocations that the application performs in translation units compiled with --opensycl-stdpar to SYCL shared USM allocations. To this end, operator new and operator delete are replaced by our own implementations. malloc and other C-style functions are not yet replaced (but this could easily be implemented if there is need).
Note that pointers to host stack memory cannot be used in offloaded C++ algorithms, because we cannot move stack allocations to USM memory! This also means that lambdas passed to C++ algorithms should never capture by reference!

This replacement is performed using a special compiler transformation. This compiler transformation also enforces that the SYCL headers perform regular allocations instead. This is important because in general the SYCL headers construct complex objects such as std::vector or std::shared_ptr which then get handed over to the SYCL runtime library. The runtime library however cannot rely on SYCL USM pointers -- in short: The runtime as the code responsible for managing these allocations cannot itself sit on them. Therefore, the compiler performs non-trivial operations to only selectively replace memory allocations.

The backend used to perform USM allocations is the backend managing the executing device as described in the previous section.

Scope and visibility of replaced functions

Functions for memory allocation are only exchanged for USM variants within translation units compiled with --opensycl-stdpar. Our USM functions for releasing memory are however overriding the standard functions within the entire linkage unit. This is motivated by the expectation that pointers may be shared within the application, and the place where they are released may not be the place where they are created. As our functions for freeing memory can handle both regular and USM allocations, making them more widely available seems like the safer choice. However, our memory release functions are currently not exported to external linkage units, such as shared libraries that the application may load. As such, you should be cautious when transferring ownership of a pointer to an external shared library, as this library may be unable to release the memory if it is a USM allocation!

Note that in C++ due to the one definition rule (ODR) the linker may in certain circumstances pick one definition of a symbol when multiple definitions are available. This can potentially be a problem if a user-defined function is both defined in a translation unit compiled with --opensycl-stdpar and one without it. In this case, there is no guarantee that the linker will pick the variant that does USM allocations. Be aware that the most vulnerable code for this issue might not only be user code directly, but also header-only library code such as std:: functions (think of e.g. the allocations performed by std::vector of common types) as these functions may be commonly used in multiple translation units.
We therefore recommend that if you enable --opensycl-stdpar for one translation unit, you also enable it for the other translation units in your project!

Such issues are not present for the functions defined in the SYCL headers, because the compiler inserts special ABI tags into their symbol names when compiled with --opensycl-stdpar to distinguish them from the regular variants, thus preventing such linking issues. Unfortunately, we cannot do the same for client code because we cannot know if other translation or linkage units will attempt to link against the user code, and expect the unaltered symbol names.

User-controlled USM device pointers

Of course, if you wish to have greater control over memory, USM device pointers from user-controlled USM memory management function calls can also be used, as in any regular SYCL kernel. The buffer-accessor model is not supported; memory stored in sycl::buffer objects can only be used when converting it to a USM pointer using our buffer-USM interoperability extension.
Note that you may need to invoke SYCL functions to explicitly copy memory to device and back if you use explicit SYCL device USM allocations.

Systems with system-level USM support

If you are on a system that supports system-level USM, i.e. a system where every CPU pointer returned from regular memory allocations or even stack pointers can directly be used on GPUs (such as on AMD MI300 or Grace-Hopper), the compiler transformation to turn heap allocations to SYCL USM shared allocations is unnecessary. In this case, you may want to request the compiler to assume system-level USM and disable the compiler transformations regarding SYCL shared USM allocations using --opensycl-stdpar-system-usm.

Functionality supported in device code

The functionality supported in device code aligns with the kernel restrictions from SYCL. This means that no exceptions, dynamic polymorphism, dynamic memory management, or calls to external shared libraries are allowed. Note that this functionality might already be pohibited in the C++ par_unseq model anyway.

The std:: math functions are supported in device code in an experimental state when using the generic SSCP compilation flow (--opensycl-targets=generic). This is accomplished using a dedicated compiler pass that maps standard functions to our SSCP builtins.

…e stdpar malloc-to-usm callgraph duplication

…StdBuiltinRemapper pass

…nctions

…nctions to USM versions by inserting ABI tag

…he line to avoid false sharing

illuhad · 2023-07-24T03:08:48Z

Running on AMD Ryzen 4750U APU:

With --hipsycl-stdpar:

$ ./stream
BabelStream
Version: 4.0
Implementation: STD (data-oriented)
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Function    MBytes/sec  Min (sec)   Max         Average     
Copy        41429.128   0.01296     0.01359     0.01312     
Mul         41191.144   0.01303     0.01432     0.01318     
Add         40084.954   0.02009     0.02530     0.02040     
Triad       39932.806   0.02017     0.02189     0.02038     
Dot         45922.418   0.01169     0.01269     0.01200

Without (regular host parallel STL):

BabelStream
Version: 4.0
Implementation: STD (data-oriented)
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Function    MBytes/sec  Min (sec)   Max         Average     
Copy        18598.206   0.02887     0.03543     0.02985     
Mul         18479.680   0.02905     0.03477     0.03008     
Add         20497.072   0.03929     0.04874     0.04070     
Triad       20423.192   0.03943     0.04652     0.04077     
Dot         39160.883   0.01371     0.02003     0.01456

… calculate local memory for multiple simultaneous reductions

…ommunicates required local memory size in bytes

…host for better NUMA behavior

illuhad added 29 commits June 2, 2023 03:23

[stdpar] Some proof of concept work

7dfdd8d

[stdpar] Add SyncElision and first MallocToUSM version

bda0dbb

[SSCP][stdpar] Fix HostKernelNameExtractionPass being unable to handl…

72ddadf

…e stdpar malloc-to-usm callgraph duplication

[stdpar] Link runtime with -Bsymbolic-functions

66a3d9c

[stdpar] Add improved MallocToUSM pass with callgraph duplication

5f961b4

[stdpar][SSCP] Perform MallocToUSM before SSCP target separation

b5f9396

[stdpar] Automatically apply stdpar LLVM flags in syclcc

e09d479

[stdpar] Align headers with new MallocToUSM implementation

a9a721b

[stdpar] Provide <numeric> entrypoint

d3740a0

[stdpar] Avoid linker errors for sync elision markers

57d43f1

[stdpar] Add copy, copy_if

9081ff5

[stdpar] Move algorithm definition to dedicated file

1242644

Fix event state query for default constructed event

d0818a0

[stdpar] Add copy_n

ab22bfb

[stdpar] Add fill, fill_n and fix copy

3112dc7

[stdpar] Move algorithms to own top-level directory

e481e57

[stdpar][SSCP] Support std:: math functions in device code using new …

3e712d8

…StdBuiltinRemapper pass

[stdpar] Add documentation

180bd77

[stdpar] Add generate, generate_n

f8a4fd1

[stdpar] Fix links to stdpar doc in readme

c095924

[stdpar] Update documentation

40f7f5f

[stdpar] Add replace, replace_if

68700ee

[stdpar] Link opensycl-common, llvm-to-backend with -Wl,-Bsymbolic-fu…

7435c07

…nctions

[stdpar] MallocToUSM: Prevent external libraries ODR-resolving STL fu…

0b3adf2

…nctions to USM versions by inserting ABI tag

[stdpar] Add reduction engine and initial transform_reduce support

2af3735

[stdpar][reduction] threading_model: Align thread-local values by cac…

97631a5

…he line to avoid false sharing

[stdpar] transform_reduce: Support both reduction models on host

d1adce9

[stdpar] Add reduce() support

3e1efdd

[stdpar] Add replace_copy, replace_copy_if

33ffd04

illuhad added 21 commits August 10, 2023 05:32

[stdpar] Fix indexing bug in none_of, all_of, any_of

b1915c0

[stdpar] reduction: Fix reduction not using largest data type size to…

e461f59

… calculate local memory for multiple simultaneous reductions

[stdpar] reduction: Fix excessive local mem usage: Reduction engine c…

4e82e0d

…ommunicates required local memory size in bytes

[stdpar] Fix accidental permutation in transform_reduce tests

a9d122c

[stdpar] Add transform tests

d6d1de5

[stdpar] Add copy tests

9207714

[stdpar] Add test for StdBuiltinRemapper

543f5ad

[stdpar] Add copy_if tests

f230d82

[stdpar] MallocToUSM: Avoid cloning operator delete()

2e439bb

[stdpar] Avoid optimizing fill(), copy() to memset() and memcpy() on …

7b2e459

…host for better NUMA behavior

[stdpar] Add copy_n tests

68ef70e

[stdpar] Add fill tests

1a3de22

[stdpar] Add fill_n tests

f406007

[stdpar] Add generate tests

13cc1b0

[stdpar] Add generate_n tests

c8b1624

[stdpar] Add replace() tests

eb4dd43

[stdpar] Add replace_if tests

0a033d7

[stdpar] Add replace_copy tests

c91d015

[stdpar] Add replace_copy_if tests

5ec0738

[stdpar] Add all_of, any_of, none_of tests

1b3ea11

[stdpar] Add reduce() tests

5e0a531

illuhad marked this pull request as ready for review August 14, 2023 11:26

illuhad force-pushed the feature/pstl branch from 48e4440 to cba0f9b Compare August 14, 2023 14:09

CI: Enable Intel GPU testing for parallel STL tests

f6a5121

illuhad force-pushed the feature/pstl branch from cba0f9b to f6a5121 Compare August 14, 2023 15:23

illuhad added 2 commits August 14, 2023 23:19

[stdpar] Make offloading heuristic more robust

c27e9db

[stdpar] Mention stdpar-unconditional-offload in documentation

7b7f824

illuhad merged commit aec54af into develop Aug 19, 2023
36 checks passed

illuhad deleted the feature/pstl branch August 19, 2023 02:24

illuhad mentioned this pull request Sep 19, 2023

Project renaming to AdaptiveCpp #1147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add C++ standard parallelism offloading support #1088

Add C++ standard parallelism offloading support #1088

illuhad commented Jul 24, 2023 •

edited

illuhad commented Jul 24, 2023

Add C++ standard parallelism offloading support #1088

Add C++ standard parallelism offloading support #1088

Conversation

illuhad commented Jul 24, 2023 • edited

C++ standard parallelism support [taken from the documentation contained in this PR]

Installation & dependencies

Using accelerated C++ standard parallelism

Algorithms and policies supported for offloading

Performance

Execution model

Queues and devices

Synchronous and asynchronous execution

Memory model

Automatic migration of heap allocations to USM shared allocations

Scope and visibility of replaced functions

User-controlled USM device pointers

Systems with system-level USM support

Functionality supported in device code

illuhad commented Jul 24, 2023

illuhad commented Jul 24, 2023 •

edited