Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generic SSCP compilation flow: Single pass compiler to generic LLVM IR + runtime JIT #862

Merged
merged 166 commits into from
Jan 10, 2023

Conversation

illuhad
Copy link
Collaborator

@illuhad illuhad commented Nov 8, 2022

This PR adds the generic SSCP (single-source, single compiler pass) compilation flow.

It is still far from complete, but since it can now do things on multiple backends I think it is ripe for a draft PR to gather feedback.

Currently, it supports single-task and range-based parallel-for kernels on Intel and NVIDIA GPUs (AMD will come too, but later) as well as AMD hardware. The llvm-spirv translator for the Intel support requires a patch for correct address space handling which is not yet included in this PR, but I will fix this soon :) which is included in this PR.

Important: The new flow will not replace the existing compilation flows, but complement them. For some interoperability use cases, the existing flows might be better equipped, while the new flow may be a better choice for generic use cases.

The world's first major SYCL implementation with a single-pass compiler architecture

Currently, most SYCL implementations rely on performing multiple compiler passes: At least one compiler invocation for the target device, and one compiler invocation for the host.
When multiple backends are targeted, or a backend requires separate compilation for each hardware, even multiple device compiler invocations might be necessary. For example, we might have to execute one pass to generate PTX, one to generate SPIR-V, one to generate code for AMD gfx906, one for gfx908 and so on.

As we add more backends, this clearly does not scale, and makes it very difficult to generate a SYCL binary that is supposed to run on "all" hardware without very long excruciating compilation times.

Since each compilation pass also has to parse the code again, this becomes even worse for heavily templated modern C++ code. And is where SYCL arguably fits in the best.

This is why this PR follows a different route:

This PR introduce a new generic compilation flow, where only a single host compilation pass takes place. Kernel code is extracted within that pass. No matter how many backends or devices need to be targeted, code is only parsed a single time.

The result are very fast compile times.

A single kernel code representation: Compile once, run anywhere

In the new compilation flow, we embed a generic, backend-independent form of LLVM bitcode into the host application to represent kernel code.
This bitcode is then lowered at runtime to whatever format a device needs: PTX, SPIR-V, amdgcn etc.
(Note that for more exotic hardware, where runtime lowering may not be feasible due to excessive compile times such as FPGAs, the design is flexible enough to add support for precompiling for specific targets)

This means we are introducing a unified code representation across all backends.

This has multiple advantages:

  • Features such as kernel fusion that depend on runtime knowledge and JIT can be implemented in a unified way for all backends
  • Users don't need to think about which devices they want to target, as the code representation will always be the same anyway. A simple --hipsycl-targets=generic will suffice.

Provide straight-forward path to extend hipSYCL support to new hardware

Previously, hipSYCL's reliance of existing toolchains, while providing a lot of advantages, has also made it unclear how to best extend hipSYCL to new hardware targets. This is now over:
The SSCP compilation flow has been designed to be extensible to new hardware targets: Simply implement the new llvm-to-backend infrastructure to lower the generic LLVM IR representation to whatever the new backend needs.

Technical details

Here are some more details from the documentation that comes with this PR:

Generic SSCP compilation flow

Note: This flow is work-in-progress and support level may vary quickly.

hipSYCL supports a generic single-pass compiler flow, where a single compiler invocation generates both host and device code. The SSCP compilation consists of two stages:

  1. Stage 1 happens at compile time: During the regular C++ host compilation, hipSYCL extracts LLVM IR for kernels with backend-independent representations of builtins, kernel annotations etc. This LLVM IR is embedded in the host code. During stage 1, it is not yet known on which device(s) the code will ultimately run.
  2. Stage 2 typically happens at runtime: The embedded device IR is passed to hipSYCL's llvm-to-backend infrastructure, which lowers the IR to backend-specific formats, such as NVIDIA's PTX, SPIR-V or amdgcn code. Unlike stage 1, stage 2 assumes that the target device is known. While stage 2 typically happens at runtime, support for precompiling to particular devices and formats could be added in the future.

The generic SSCP design has several advantages over other implementation choices:

  • There is a single code representation across all backends, which allows implementing JIT-based features such as runtime kernel fusion in a backend-independent way;
  • Code is only parsed a single time, which can result is significant compilation speedups especially for template-heavy code;
  • Binaries inherently run on a wide range of hardware, without the user having to precompile for particular devices, and hence making assumptions where the binary will ultimately be executed ("Compile once, run anywhere").

The generic SSCP flow can potentially provide very fast compile times, very good portability and good performance.

Implementation status

Currently, the SSCP flow is implemented for

  • CUDA devices
  • SPIR-V devices through oneAPI Level Zero
  • AMD devices

Most builtins are not yet implemented, and only single-task and basic parallel for kernels are supported.
Some builtins are not yet implemented, and some hipSYCL extensions are not yet supported.

How it works

IR constants

The SSCP kernel extraction relies on the concept of what we refer to as IR constants. IR constants are global variables, that are non-const and without defined value when parsing the code, but will be turned into constants later during the processing of LLVM IR. This is a similar idea to e.g. SYCL 2020 specialiization constants, and indeed specialization constants could be implemented on top of IR constants.

Stage 1 IR constants are hard-wired. The following important S1 IR constants exist:

  • A string containing the device LLVM IR bitcode
  • Whether the LLVM module contains the host code
  • Whehter the LLVM module contains the device code.

Stage 2 IR constants are intended to provide information that requires knowledge of the target device, such as backend, device, and so on. Stage 2 IR constants can also be programmatically added by the user.

After hipSYCL sets the value of an IR constant, it runs constant propagation and dead code elimination passes. This causes if statements depending entirely on IR constants to be trivially optimized away - causing either removal of the code contained in the if brach, or hardwiring the code.

Stage 1: Kernel extraction

During stage 1, hipSYCL clones the module containing the regular host IR, and sets the IR constants such that one is identified as host code, and one is identified as device code.
The kernel function calls are guarded inside the hipSYCL headers by an if-statement depending on the IR constant signifying device compilation. This causes kernel code only to end up in the device module, and host code to end up only in the host module. To be sure that no host code remains in the device module, hipSYCL runs additional passes in the device module to remove all code not reachable from kernel entrypoints.

The implementation of SYCL builtins contains an if/else branch depending on the IR constant signifying device compilation. One branch invokes the externally defined SSCP builtins following the naming scheme __hipsycl_sscp_*, while the other branch invokes regular host builtins.
This allows SYCL kernels to simultaneously run correctly both on the host as well as on SSCP-supported devices.

The final LLVM IR device bitcode is then embedded into a stage 1 IR constant string in the host module.

Stage 2: llvm-to-backend

During stage 2, the llvm-to-backend infrastructure is responsible for turning the generic LLVM IR into something that a backend can actually execute. This means in particular:

  • Flavoring the LLVM IR such that the appropriate LLVM backend can handle the code; e.g. by correctly mapping address spaces, attaching information to mark kernels as entrypoints, correctly setting target triple, data layout, and function calling conventions etc.
  • Mapping __hipsycl_sscp_* builtins to backend builtins. This typically happens by linking backend-specific bitcode libraries.
  • Running optimization passes on the finalized IR
  • Lowering the flavored, optimized IR to backend-specific formats, such as ptx or SPIR-V.

For debugging, development, or advanced use cases, each llvm-to-backend implementation provides a tool (called llvm-to-ptx-tool, llvm-to-spirv-tool, ....) that can be invoked to perform the stage 2 compilation step manually.

@illuhad
Copy link
Collaborator Author

illuhad commented Jan 10, 2023

Ok, I think we should prepare this to go in now. Otherwise it will get way too large. It already has other PRs depending on it. We can add the features that are still missing with subsequent PRs.

@illuhad illuhad marked this pull request as ready for review January 10, 2023 10:46
@illuhad illuhad merged commit e092e65 into develop Jan 10, 2023
@illuhad illuhad deleted the feature/generic-llvm-sscp branch January 10, 2023 16:58
# -DCMAKE_INSTALL_PREFIX:PATH=<INSTALL_DIR>
# BUILD_COMMAND ${CMAKE_COMMAND} --build <BINARY_DIR> --config Release --target install
# )
ExternalProject_Add(LLVMSpirvTranslator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any more details around the issue with address space handling here? I am playing with building of OpenSYCL for a project and as I try to build it it fails here (because it tries to write to /usr/local and I don't have the ACLs to do so) and my distro already has SPIRV-LLVM-Translator available so I started looking at why I couldn't just use that and got to here :)

Copy link
Collaborator Author

@illuhad illuhad Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, it should not write to /usr/local if you set your CMAKE_INSTALL_PREFIX to a user-writable directory. If you want to install system-wide, you will need root privileges anyway.
(EDIT: Using Open SYCL without installing it is not supported and will not work, because the directory layout will not be correct unless you make install)

The patch is necessary because llvm-spirv follows the SPIR convention where address space 0, the default address space, does not refer to the generic address space (i.e. "may point to any address space"). Instead, it assumes that address space 0 refers to the private addess space.

This is different from all other device backends in LLVM like ptx or amdgcn, and is a huge complication because the IR that clang gives us assumes address space 0 everywhere by default. This makes sense because C++ semantics are that every pointer is effectively a pointer in generic address space.
Additionally, while in a multi-pass compiler the device compiler invocation can be configured with different assumptions as the default address space, this is not so easily possible in a single-pass design as in our compiler. Here, host and device code are generated from the same compiler invocation and thus also share some features such as the default address space configuration.

So, we would need to completely rewrite the LLVM IR to exchange every occurence of address space 0 with llvm-spirv's convention for the generic address space. While we already do some modifications to address spaces to account for device/backend specifics such as the local memory address space, these changes are usually fairly isolated. On the other hand, address space 0 is used everywhere throughout the code, so rewriting all that IR requires a massive effort, and there are no preexisting LLVM passes to handle that use case to my knowledge. So it turned out that patching llvm-spirv is much, much simper, as it boils down to a 1-line change there.

Note that even if you get your system llvm-spirv to work, it wouldn't be found by our llvm-to-spirv JIT infrastructure which expects llvm-spirv to live in the directory that we have assigned it.

EDIT2: Also, llvm-spirv must be built against the exact same LLVM distribution as the one Open SYCL is built against. This may not be the case for a distribution's package (e.g. maybe the distribution has packages for multiple LLVM versions, or maybe the user wants to use Open SYCL with some custom LLVM installation), so it's also much more robust to build our own.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details. Should that change be going upstream rather than be a permanent fork?

Copy link
Collaborator Author

@illuhad illuhad Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, technically it's not a fork, but a patch that is broadly compatible with all llvm-spirv versions ;) But of course it would be more convenient if it were upstreamed at some point.

See also KhronosGroup/SPIRV-LLVM-Translator#1699

While there's no hard objection from upstream SPIRV-LLVM-Translator, it's also not really clear how this is supposed to work since presumably the old convention would still need to work (otherwise, all other stacks apart from us depending on llvm-spirv will break), so there needs to be some kind of toggle (macro? command line flag?) and I think it's unclear how this would look like and if we would to tie it to some other address space convention like PTX or just be an independent thing..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants