-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add C++ standard parallelism offloading support #1088
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…e stdpar malloc-to-usm callgraph duplication
…StdBuiltinRemapper pass
…nctions to USM versions by inserting ABI tag
…he line to avoid false sharing
Running on AMD Ryzen 4750U APU: With
Without (regular host parallel STL):
|
… calculate local memory for multiple simultaneous reductions
…ommunicates required local memory size in bytes
…host for better NUMA behavior
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds support for C++ standard parallelism, allowing automatic offloading of C++ STL algorithms of
parallel_unsequenced
policy to Intel, NVIDIA and AMD GPUs -- potentially even from a single binary (e.g. using--hipsycl-targets=generic
). It is enabled using--hipsycl-stdpar
.Currently, only nvc++ can offload standard C++ algorithms, but only to NVIDIA GPUs. This PR makes us the first compiler to be able to offload standard C++ to "any" GPU.
Not all algorithms are implemented yet. See the documentation below for details. For unimplemented algorithms, the compiler will fall back to the regular non-offloaded host implementations. Offloading support for further algorithms will be added in the future, such that over time, the coverage will be completed.
Note that this is NOT just a SYCL library -- it's a compiler feature, and comes with dedicated compiler support! This is also a fundamental difference between the approach taken here, and e.g. either oneDPL or nvc++. I believe C++ standard parallelism must be a first-class citizen for offloading compilers.
Here are some BabelStream numbers offloading standard C++ parallelism to an Intel iGPU:
With the regular C++ parallel STL, performance on this machine is around 15 GB/s, so in this case, performance has doubled just by recompiling with
--hipsycl-stdpar
!CC @jeffhammond
TODO:
sycl_tests
infrastructure.par
. This needs to be fixed.C++ standard parallelism support [taken from the documentation contained in this PR]
Open SYCL supports automatic offloading of C++ standard algorithms.
Installation & dependencies
C++ standard parallelism offload requires LLVM >= 14. It is automatically enabled when a sufficiently new LLVM is detected.
cmake -DWITH_STDPAR_COMPILER=ON/OFF
can be used to explicitly enable or disable it at cmake configure time.C++ standard parallelism offload currently is only supported in conjunction with
libstdc++
>= 11. Other standard C++ standard library versions may or may not work. Support forlibc++
is likely easy to add if there is demand.Using accelerated C++ standard parallelism
Offloading of C++ standard parallelism is enabled using
--opensycl-stdpar
. This flag does not by itself imply a target or compilation flow, which will have to be provided in addition using the normal--opensycl-targets
argument. C++ standard parallelism is expected to work with any of our clang compiler-based compilation flows, such asomp.accelerated
,cuda
,hip
or the generic SSCP compiler (--opensycl-targets=generic
). It is not currently supported in library-only compilation flows. The focus of testing currently is the generic SSCP compiler.Algorithms and policies supported for offloading
Currently, the following execution policies qualify for offloading:
par_unseq
Offloading is implemented for the following STL algorithms:
for_each
for_each_n
transform
copy
copy_n
copy_if
fill
fill_n
generate
generate_n
replace
replace_if
replace_copy
replace_copy_if
transform_reduce
reduce
any_of
all_of
none_of
For all other execution policies or algorithms, the algorithm will compile and execute correctly, however the regular host implementation of the algorithm provided by the C++ standard library implementation will be invoked and no offloading takes place.
Performance
Performance can generally be expected to be on par with comparable SYCL kernels, although there are some optimizations specific to the C++ standard parallelism model. See the sections on execution and memory model below for details. However, because the implementation of C++ standard parallelism depends heavily on SYCL shared USM (unified shared memory) allocations, the implementation quality of USM at the driver and hardware level can have a great impact on performance, especially for memory-intensive applications.
In particular, on some AMD GPUs USM is known to not perform well due to hardware and driver limitations.
In general, USM relies on memory pages automatically migrating between host and device, depending on where they are accessed. Consequently, patterns where the same memory region is accessed by host and offloaded C++ standard algorithms in alternating fashion should be avoided as much as possible, as this will trigger memory transfers behind the scenes.
Execution model
Queues and devices
Each thread in the user application maintains a dedicated thread-local in-order SYCL queue that will be used to dispatch STL algorithms. Thus, concurrent operations can be expressed by launching them from separate threads.
The selected device is currently the device returned from the default selector. Use
HIPSYCL_VISIBILITY_MASK
and/or backend-specific environment variables such asHIP_VISIBLE_DEVICES
to control which device this is. Becausesycl::event
objects are not needed in the C++ standard parallelism model, queues are set up to rely exclusively on the hipSYCL coarse grained events extension. This means that offloading a C++ standard parallel algorithm can potentially have lower overhead compared to submitting a regular SYCL kernel.Synchronous and asynchronous execution
The C++ STL algorithms are all designed around the assumption of being synchronous. This can become a performance issue especially when multiple algorithms are executed in succession, as in principle a
wait()
must be executed after each algorithm is submitted to device.To address this issue, a dedicated compiler optimization tries to remove
wait()
calls in between successive calls to offloaded algorithms, such that await()
will only be executed for the last algorithm invocation. This is possible without side effects if no instructions (particularly loads and stores) between the algorithm invocations are present.Currently, the analysis is very simplistic and the compiler gives up the optimization attempt early - therefore, it is recommended for now to make it as easy as possible for the compiler to spot this opportunity by removing any code between calls to C++ algorithms if possible. This also includes code in the call arguments, such as calls to
begin()
andend()
, which currently should better be moved to before the algorithm invocation. Example:Memory model
Automatic migration of heap allocations to USM shared allocations
C++ is unaware of separate devices with their own device memory. In order to retain C++ semantics, when offloading C++ standard algorithms Open SYCL tries to move all memory allocations that the application performs in translation units compiled with
--opensycl-stdpar
to SYCL shared USM allocations. To this end,operator new
andoperator delete
are replaced by our own implementations.malloc
and other C-style functions are not yet replaced (but this could easily be implemented if there is need).Note that pointers to host stack memory cannot be used in offloaded C++ algorithms, because we cannot move stack allocations to USM memory! This also means that lambdas passed to C++ algorithms should never capture by reference!
This replacement is performed using a special compiler transformation. This compiler transformation also enforces that the SYCL headers perform regular allocations instead. This is important because in general the SYCL headers construct complex objects such as
std::vector
orstd::shared_ptr
which then get handed over to the SYCL runtime library. The runtime library however cannot rely on SYCL USM pointers -- in short: The runtime as the code responsible for managing these allocations cannot itself sit on them. Therefore, the compiler performs non-trivial operations to only selectively replace memory allocations.The backend used to perform USM allocations is the backend managing the executing device as described in the previous section.
Scope and visibility of replaced functions
Functions for memory allocation are only exchanged for USM variants within translation units compiled with
--opensycl-stdpar
. Our USM functions for releasing memory are however overriding the standard functions within the entire linkage unit. This is motivated by the expectation that pointers may be shared within the application, and the place where they are released may not be the place where they are created. As our functions for freeing memory can handle both regular and USM allocations, making them more widely available seems like the safer choice. However, our memory release functions are currently not exported to external linkage units, such as shared libraries that the application may load. As such, you should be cautious when transferring ownership of a pointer to an external shared library, as this library may be unable to release the memory if it is a USM allocation!Note that in C++ due to the one definition rule (ODR) the linker may in certain circumstances pick one definition of a symbol when multiple definitions are available. This can potentially be a problem if a user-defined function is both defined in a translation unit compiled with
--opensycl-stdpar
and one without it. In this case, there is no guarantee that the linker will pick the variant that does USM allocations. Be aware that the most vulnerable code for this issue might not only be user code directly, but also header-only library code such asstd::
functions (think of e.g. the allocations performed bystd::vector
of common types) as these functions may be commonly used in multiple translation units.We therefore recommend that if you enable
--opensycl-stdpar
for one translation unit, you also enable it for the other translation units in your project!Such issues are not present for the functions defined in the SYCL headers, because the compiler inserts special ABI tags into their symbol names when compiled with
--opensycl-stdpar
to distinguish them from the regular variants, thus preventing such linking issues. Unfortunately, we cannot do the same for client code because we cannot know if other translation or linkage units will attempt to link against the user code, and expect the unaltered symbol names.User-controlled USM device pointers
Of course, if you wish to have greater control over memory, USM device pointers from user-controlled USM memory management function calls can also be used, as in any regular SYCL kernel. The buffer-accessor model is not supported; memory stored in
sycl::buffer
objects can only be used when converting it to a USM pointer using our buffer-USM interoperability extension.Note that you may need to invoke SYCL functions to explicitly copy memory to device and back if you use explicit SYCL device USM allocations.
Systems with system-level USM support
If you are on a system that supports system-level USM, i.e. a system where every CPU pointer returned from regular memory allocations or even stack pointers can directly be used on GPUs (such as on AMD MI300 or Grace-Hopper), the compiler transformation to turn heap allocations to SYCL USM shared allocations is unnecessary. In this case, you may want to request the compiler to assume system-level USM and disable the compiler transformations regarding SYCL shared USM allocations using
--opensycl-stdpar-system-usm
.Functionality supported in device code
The functionality supported in device code aligns with the kernel restrictions from SYCL. This means that no exceptions, dynamic polymorphism, dynamic memory management, or calls to external shared libraries are allowed. Note that this functionality might already be pohibited in the C++
par_unseq
model anyway.The
std::
math functions are supported in device code in an experimental state when using the generic SSCP compilation flow (--opensycl-targets=generic
). This is accomplished using a dedicated compiler pass that maps standard functions to our SSCP builtins.