Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add C++ standard parallelism offloading support #1088

Merged
merged 82 commits into from
Aug 19, 2023
Merged

Add C++ standard parallelism offloading support #1088

merged 82 commits into from
Aug 19, 2023

Conversation

illuhad
Copy link
Collaborator

@illuhad illuhad commented Jul 24, 2023

This PR adds support for C++ standard parallelism, allowing automatic offloading of C++ STL algorithms of parallel_unsequenced policy to Intel, NVIDIA and AMD GPUs -- potentially even from a single binary (e.g. using --hipsycl-targets=generic). It is enabled using --hipsycl-stdpar.

Currently, only nvc++ can offload standard C++ algorithms, but only to NVIDIA GPUs. This PR makes us the first compiler to be able to offload standard C++ to "any" GPU.

Not all algorithms are implemented yet. See the documentation below for details. For unimplemented algorithms, the compiler will fall back to the regular non-offloaded host implementations. Offloading support for further algorithms will be added in the future, such that over time, the coverage will be completed.

Note that this is NOT just a SYCL library -- it's a compiler feature, and comes with dedicated compiler support! This is also a fundamental difference between the approach taken here, and e.g. either oneDPL or nvc++. I believe C++ standard parallelism must be a first-class citizen for offloading compilers.

Here are some BabelStream numbers offloading standard C++ parallelism to an Intel iGPU:

$ HIPSYCL_VISIBILITY_MASK="omp;ze" ./stream 
BabelStream
Version: 4.0
Implementation: STD (data-oriented)
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
[hipSYCL Warning] backend_loader: Could not load backend plugin: /home/aksel/src/hipSYCL/master/build/install/bin/../lib/hipSYCL/librt-backend-hip.so
[hipSYCL Warning] libamdhip64.so.5: cannot open shared object file: No such file or directory
Function    MBytes/sec  Min (sec)   Max         Average     
Copy        25557.182   0.02101     0.03908     0.02176     
Mul         25975.789   0.02067     0.03717     0.02150     
Add         26636.233   0.03023     0.06069     0.03144     
Triad       27156.416   0.02965     0.05282     0.03095     
Dot         29998.973   0.01790     0.02145     0.01871  

With the regular C++ parallel STL, performance on this machine is around 15 GB/s, so in this case, performance has doubled just by recompiling with --hipsycl-stdpar!

CC @jeffhammond

TODO:

  • We need a new test infrastructure where we can add tests - probably this feature does not fit well into the existing sycl_tests infrastructure.
  • There are currently some issues with falling back to the other, non-offloaded execution policies like par. This needs to be fixed.

C++ standard parallelism support [taken from the documentation contained in this PR]

Open SYCL supports automatic offloading of C++ standard algorithms.

Installation & dependencies

C++ standard parallelism offload requires LLVM >= 14. It is automatically enabled when a sufficiently new LLVM is detected. cmake -DWITH_STDPAR_COMPILER=ON/OFF can be used to explicitly enable or disable it at cmake configure time.
C++ standard parallelism offload currently is only supported in conjunction with libstdc++ >= 11. Other standard C++ standard library versions may or may not work. Support for libc++ is likely easy to add if there is demand.

Using accelerated C++ standard parallelism

Offloading of C++ standard parallelism is enabled using --opensycl-stdpar. This flag does not by itself imply a target or compilation flow, which will have to be provided in addition using the normal --opensycl-targets argument. C++ standard parallelism is expected to work with any of our clang compiler-based compilation flows, such as omp.accelerated, cuda, hip or the generic SSCP compiler (--opensycl-targets=generic). It is not currently supported in library-only compilation flows. The focus of testing currently is the generic SSCP compiler.

Algorithms and policies supported for offloading

Currently, the following execution policies qualify for offloading:

  • par_unseq

Offloading is implemented for the following STL algorithms:

Algorithm Notes
for_each
for_each_n
transform both unary and binary operator overloads
copy
copy_n
copy_if
fill
fill_n
generate
generate_n
replace
replace_if
replace_copy
replace_copy_if
transform_reduce all overloads
reduce all overloads
any_of
all_of
none_of

For all other execution policies or algorithms, the algorithm will compile and execute correctly, however the regular host implementation of the algorithm provided by the C++ standard library implementation will be invoked and no offloading takes place.

Performance

Performance can generally be expected to be on par with comparable SYCL kernels, although there are some optimizations specific to the C++ standard parallelism model. See the sections on execution and memory model below for details. However, because the implementation of C++ standard parallelism depends heavily on SYCL shared USM (unified shared memory) allocations, the implementation quality of USM at the driver and hardware level can have a great impact on performance, especially for memory-intensive applications.
In particular, on some AMD GPUs USM is known to not perform well due to hardware and driver limitations.
In general, USM relies on memory pages automatically migrating between host and device, depending on where they are accessed. Consequently, patterns where the same memory region is accessed by host and offloaded C++ standard algorithms in alternating fashion should be avoided as much as possible, as this will trigger memory transfers behind the scenes.

Execution model

Queues and devices

Each thread in the user application maintains a dedicated thread-local in-order SYCL queue that will be used to dispatch STL algorithms. Thus, concurrent operations can be expressed by launching them from separate threads.
The selected device is currently the device returned from the default selector. Use HIPSYCL_VISIBILITY_MASK and/or backend-specific environment variables such as HIP_VISIBLE_DEVICES to control which device this is. Because sycl::event objects are not needed in the C++ standard parallelism model, queues are set up to rely exclusively on the hipSYCL coarse grained events extension. This means that offloading a C++ standard parallel algorithm can potentially have lower overhead compared to submitting a regular SYCL kernel.

Synchronous and asynchronous execution

The C++ STL algorithms are all designed around the assumption of being synchronous. This can become a performance issue especially when multiple algorithms are executed in succession, as in principle a wait() must be executed after each algorithm is submitted to device.

To address this issue, a dedicated compiler optimization tries to remove wait() calls in between successive calls to offloaded algorithms, such that a wait() will only be executed for the last algorithm invocation. This is possible without side effects if no instructions (particularly loads and stores) between the algorithm invocations are present.
Currently, the analysis is very simplistic and the compiler gives up the optimization attempt early - therefore, it is recommended for now to make it as easy as possible for the compiler to spot this opportunity by removing any code between calls to C++ algorithms if possible. This also includes code in the call arguments, such as calls to begin() and end(), which currently should better be moved to before the algorithm invocation. Example:

auto first = data.begin();
auto last = data.end();
auto dest = dest.begin();
std::for_each(std::execution::par_unseq, first, last, ...);
std::transform(std::execution::par_unseq, first, last, dest, ...);

Memory model

Automatic migration of heap allocations to USM shared allocations

C++ is unaware of separate devices with their own device memory. In order to retain C++ semantics, when offloading C++ standard algorithms Open SYCL tries to move all memory allocations that the application performs in translation units compiled with --opensycl-stdpar to SYCL shared USM allocations. To this end, operator new and operator delete are replaced by our own implementations. malloc and other C-style functions are not yet replaced (but this could easily be implemented if there is need).
Note that pointers to host stack memory cannot be used in offloaded C++ algorithms, because we cannot move stack allocations to USM memory! This also means that lambdas passed to C++ algorithms should never capture by reference!

This replacement is performed using a special compiler transformation. This compiler transformation also enforces that the SYCL headers perform regular allocations instead. This is important because in general the SYCL headers construct complex objects such as std::vector or std::shared_ptr which then get handed over to the SYCL runtime library. The runtime library however cannot rely on SYCL USM pointers -- in short: The runtime as the code responsible for managing these allocations cannot itself sit on them. Therefore, the compiler performs non-trivial operations to only selectively replace memory allocations.

The backend used to perform USM allocations is the backend managing the executing device as described in the previous section.

Scope and visibility of replaced functions

Functions for memory allocation are only exchanged for USM variants within translation units compiled with --opensycl-stdpar. Our USM functions for releasing memory are however overriding the standard functions within the entire linkage unit. This is motivated by the expectation that pointers may be shared within the application, and the place where they are released may not be the place where they are created. As our functions for freeing memory can handle both regular and USM allocations, making them more widely available seems like the safer choice. However, our memory release functions are currently not exported to external linkage units, such as shared libraries that the application may load. As such, you should be cautious when transferring ownership of a pointer to an external shared library, as this library may be unable to release the memory if it is a USM allocation!

Note that in C++ due to the one definition rule (ODR) the linker may in certain circumstances pick one definition of a symbol when multiple definitions are available. This can potentially be a problem if a user-defined function is both defined in a translation unit compiled with --opensycl-stdpar and one without it. In this case, there is no guarantee that the linker will pick the variant that does USM allocations. Be aware that the most vulnerable code for this issue might not only be user code directly, but also header-only library code such as std:: functions (think of e.g. the allocations performed by std::vector of common types) as these functions may be commonly used in multiple translation units.
We therefore recommend that if you enable --opensycl-stdpar for one translation unit, you also enable it for the other translation units in your project!

Such issues are not present for the functions defined in the SYCL headers, because the compiler inserts special ABI tags into their symbol names when compiled with --opensycl-stdpar to distinguish them from the regular variants, thus preventing such linking issues. Unfortunately, we cannot do the same for client code because we cannot know if other translation or linkage units will attempt to link against the user code, and expect the unaltered symbol names.

User-controlled USM device pointers

Of course, if you wish to have greater control over memory, USM device pointers from user-controlled USM memory management function calls can also be used, as in any regular SYCL kernel. The buffer-accessor model is not supported; memory stored in sycl::buffer objects can only be used when converting it to a USM pointer using our buffer-USM interoperability extension.
Note that you may need to invoke SYCL functions to explicitly copy memory to device and back if you use explicit SYCL device USM allocations.

Systems with system-level USM support

If you are on a system that supports system-level USM, i.e. a system where every CPU pointer returned from regular memory allocations or even stack pointers can directly be used on GPUs (such as on AMD MI300 or Grace-Hopper), the compiler transformation to turn heap allocations to SYCL USM shared allocations is unnecessary. In this case, you may want to request the compiler to assume system-level USM and disable the compiler transformations regarding SYCL shared USM allocations using --opensycl-stdpar-system-usm.

Functionality supported in device code

The functionality supported in device code aligns with the kernel restrictions from SYCL. This means that no exceptions, dynamic polymorphism, dynamic memory management, or calls to external shared libraries are allowed. Note that this functionality might already be pohibited in the C++ par_unseq model anyway.

The std:: math functions are supported in device code in an experimental state when using the generic SSCP compilation flow (--opensycl-targets=generic). This is accomplished using a dedicated compiler pass that maps standard functions to our SSCP builtins.

illuhad added 29 commits June 2, 2023 03:23
…e stdpar malloc-to-usm callgraph duplication
…nctions to USM versions by inserting ABI tag
@illuhad
Copy link
Collaborator Author

illuhad commented Jul 24, 2023

Running on AMD Ryzen 4750U APU:

With --hipsycl-stdpar:

$ ./stream
BabelStream
Version: 4.0
Implementation: STD (data-oriented)
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Function    MBytes/sec  Min (sec)   Max         Average     
Copy        41429.128   0.01296     0.01359     0.01312     
Mul         41191.144   0.01303     0.01432     0.01318     
Add         40084.954   0.02009     0.02530     0.02040     
Triad       39932.806   0.02017     0.02189     0.02038     
Dot         45922.418   0.01169     0.01269     0.01200 

Without (regular host parallel STL):

BabelStream
Version: 4.0
Implementation: STD (data-oriented)
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Function    MBytes/sec  Min (sec)   Max         Average     
Copy        18598.206   0.02887     0.03543     0.02985     
Mul         18479.680   0.02905     0.03477     0.03008     
Add         20497.072   0.03929     0.04874     0.04070     
Triad       20423.192   0.03943     0.04652     0.04077     
Dot         39160.883   0.01371     0.02003     0.01456 

@illuhad illuhad marked this pull request as ready for review August 14, 2023 11:26
@illuhad illuhad merged commit aec54af into develop Aug 19, 2023
36 checks passed
@illuhad illuhad deleted the feature/pstl branch August 19, 2023 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant