Add generic SSCP compilation flow: Single pass compiler to generic LLVM IR + runtime JIT #862

illuhad · 2022-11-08T17:18:01Z

This PR adds the generic SSCP (single-source, single compiler pass) compilation flow.

It is still far from complete, but since it can now do things on multiple backends I think it is ripe for a draft PR to gather feedback.

Currently, it supports single-task and ~~range-based~~ parallel-for kernels on Intel and NVIDIA GPUs ~~(AMD will come too, but later)~~ as well as AMD hardware. The llvm-spirv translator for the Intel support requires a patch for correct address space handling ~~which is not yet included in this PR, but I will fix this soon :)~~ which is included in this PR.

Important: The new flow will not replace the existing compilation flows, but complement them. For some interoperability use cases, the existing flows might be better equipped, while the new flow may be a better choice for generic use cases.

The world's first major SYCL implementation with a single-pass compiler architecture

Currently, most SYCL implementations rely on performing multiple compiler passes: At least one compiler invocation for the target device, and one compiler invocation for the host.
When multiple backends are targeted, or a backend requires separate compilation for each hardware, even multiple device compiler invocations might be necessary. For example, we might have to execute one pass to generate PTX, one to generate SPIR-V, one to generate code for AMD gfx906, one for gfx908 and so on.

As we add more backends, this clearly does not scale, and makes it very difficult to generate a SYCL binary that is supposed to run on "all" hardware without very long excruciating compilation times.

Since each compilation pass also has to parse the code again, this becomes even worse for heavily templated modern C++ code. And is where SYCL arguably fits in the best.

This is why this PR follows a different route:

This PR introduce a new generic compilation flow, where only a single host compilation pass takes place. Kernel code is extracted within that pass. No matter how many backends or devices need to be targeted, code is only parsed a single time.

The result are very fast compile times.

A single kernel code representation: Compile once, run anywhere

In the new compilation flow, we embed a generic, backend-independent form of LLVM bitcode into the host application to represent kernel code.
This bitcode is then lowered at runtime to whatever format a device needs: PTX, SPIR-V, amdgcn etc.
(Note that for more exotic hardware, where runtime lowering may not be feasible due to excessive compile times such as FPGAs, the design is flexible enough to add support for precompiling for specific targets)

This means we are introducing a unified code representation across all backends.

This has multiple advantages:

Features such as kernel fusion that depend on runtime knowledge and JIT can be implemented in a unified way for all backends
Users don't need to think about which devices they want to target, as the code representation will always be the same anyway. A simple --hipsycl-targets=generic will suffice.

Provide straight-forward path to extend hipSYCL support to new hardware

Previously, hipSYCL's reliance of existing toolchains, while providing a lot of advantages, has also made it unclear how to best extend hipSYCL to new hardware targets. This is now over:
The SSCP compilation flow has been designed to be extensible to new hardware targets: Simply implement the new llvm-to-backend infrastructure to lower the generic LLVM IR representation to whatever the new backend needs.

Technical details

Here are some more details from the documentation that comes with this PR:

Generic SSCP compilation flow

Note: This flow is work-in-progress and support level may vary quickly.

hipSYCL supports a generic single-pass compiler flow, where a single compiler invocation generates both host and device code. The SSCP compilation consists of two stages:

Stage 1 happens at compile time: During the regular C++ host compilation, hipSYCL extracts LLVM IR for kernels with backend-independent representations of builtins, kernel annotations etc. This LLVM IR is embedded in the host code. During stage 1, it is not yet known on which device(s) the code will ultimately run.
Stage 2 typically happens at runtime: The embedded device IR is passed to hipSYCL's llvm-to-backend infrastructure, which lowers the IR to backend-specific formats, such as NVIDIA's PTX, SPIR-V or amdgcn code. Unlike stage 1, stage 2 assumes that the target device is known. While stage 2 typically happens at runtime, support for precompiling to particular devices and formats could be added in the future.

The generic SSCP design has several advantages over other implementation choices:

There is a single code representation across all backends, which allows implementing JIT-based features such as runtime kernel fusion in a backend-independent way;
Code is only parsed a single time, which can result is significant compilation speedups especially for template-heavy code;
Binaries inherently run on a wide range of hardware, without the user having to precompile for particular devices, and hence making assumptions where the binary will ultimately be executed ("Compile once, run anywhere").

The generic SSCP flow can potentially provide very fast compile times, very good portability and good performance.

Implementation status

Currently, the SSCP flow is implemented for

CUDA devices
SPIR-V devices through oneAPI Level Zero
AMD devices

~~Most builtins are not yet implemented, and only single-task and basic parallel for kernels are supported.~~
Some builtins are not yet implemented, and some hipSYCL extensions are not yet supported.

How it works

IR constants

The SSCP kernel extraction relies on the concept of what we refer to as IR constants. IR constants are global variables, that are non-const and without defined value when parsing the code, but will be turned into constants later during the processing of LLVM IR. This is a similar idea to e.g. SYCL 2020 specialiization constants, and indeed specialization constants could be implemented on top of IR constants.

Stage 1 IR constants are hard-wired. The following important S1 IR constants exist:

A string containing the device LLVM IR bitcode
Whether the LLVM module contains the host code
Whehter the LLVM module contains the device code.

Stage 2 IR constants are intended to provide information that requires knowledge of the target device, such as backend, device, and so on. Stage 2 IR constants can also be programmatically added by the user.

After hipSYCL sets the value of an IR constant, it runs constant propagation and dead code elimination passes. This causes if statements depending entirely on IR constants to be trivially optimized away - causing either removal of the code contained in the if brach, or hardwiring the code.

Stage 1: Kernel extraction

During stage 1, hipSYCL clones the module containing the regular host IR, and sets the IR constants such that one is identified as host code, and one is identified as device code.
The kernel function calls are guarded inside the hipSYCL headers by an if-statement depending on the IR constant signifying device compilation. This causes kernel code only to end up in the device module, and host code to end up only in the host module. To be sure that no host code remains in the device module, hipSYCL runs additional passes in the device module to remove all code not reachable from kernel entrypoints.

The implementation of SYCL builtins contains an if/else branch depending on the IR constant signifying device compilation. One branch invokes the externally defined SSCP builtins following the naming scheme __hipsycl_sscp_*, while the other branch invokes regular host builtins.
This allows SYCL kernels to simultaneously run correctly both on the host as well as on SSCP-supported devices.

The final LLVM IR device bitcode is then embedded into a stage 1 IR constant string in the host module.

Stage 2: llvm-to-backend

During stage 2, the llvm-to-backend infrastructure is responsible for turning the generic LLVM IR into something that a backend can actually execute. This means in particular:

Flavoring the LLVM IR such that the appropriate LLVM backend can handle the code; e.g. by correctly mapping address spaces, attaching information to mark kernels as entrypoints, correctly setting target triple, data layout, and function calling conventions etc.
Mapping __hipsycl_sscp_* builtins to backend builtins. This typically happens by linking backend-specific bitcode libraries.
Running optimization passes on the finalized IR
Lowering the flavored, optimized IR to backend-specific formats, such as ptx or SPIR-V.

For debugging, development, or advanced use cases, each llvm-to-backend implementation provides a tool (called llvm-to-ptx-tool, llvm-to-spirv-tool, ....) that can be invoked to perform the stage 2 compilation step manually.

… in host pass

…or old LLVM

…tiple compilation units together

…n tool to gain access to kernel names

…g when JIT fails

…memory allocation

…> into device IR, which can break SPIR-V due to the occurence of function pointers

…ded functions on the host side

…eger widths and remove llvm.lifetime.start/end calls

…nterface to obtain addressspacemap for all backends

…iltin internals

…ments based on parameter information from HCF

… are present

…table behavior across multiple ABIs

illuhad · 2023-01-10T10:46:01Z

Ok, I think we should prepare this to go in now. Otherwise it will get way too large. It already has other PRs depending on it. We can add the features that are still missing with subsequent PRs.

nullr0ute · 2023-08-24T17:08:14Z

src/compiler/llvm-to-backend/CMakeLists.txt

-#		  -DCMAKE_INSTALL_PREFIX:PATH=<INSTALL_DIR>
-#    BUILD_COMMAND ${CMAKE_COMMAND} --build <BINARY_DIR> --config Release --target install
-#    )
+  ExternalProject_Add(LLVMSpirvTranslator


Is there any more details around the issue with address space handling here? I am playing with building of OpenSYCL for a project and as I try to build it it fails here (because it tries to write to /usr/local and I don't have the ACLs to do so) and my distro already has SPIRV-LLVM-Translator available so I started looking at why I couldn't just use that and got to here :)

Hi, it should not write to /usr/local if you set your CMAKE_INSTALL_PREFIX to a user-writable directory. If you want to install system-wide, you will need root privileges anyway.
(EDIT: Using Open SYCL without installing it is not supported and will not work, because the directory layout will not be correct unless you make install)

The patch is necessary because llvm-spirv follows the SPIR convention where address space 0, the default address space, does not refer to the generic address space (i.e. "may point to any address space"). Instead, it assumes that address space 0 refers to the private addess space.

This is different from all other device backends in LLVM like ptx or amdgcn, and is a huge complication because the IR that clang gives us assumes address space 0 everywhere by default. This makes sense because C++ semantics are that every pointer is effectively a pointer in generic address space.
Additionally, while in a multi-pass compiler the device compiler invocation can be configured with different assumptions as the default address space, this is not so easily possible in a single-pass design as in our compiler. Here, host and device code are generated from the same compiler invocation and thus also share some features such as the default address space configuration.

So, we would need to completely rewrite the LLVM IR to exchange every occurence of address space 0 with llvm-spirv's convention for the generic address space. While we already do some modifications to address spaces to account for device/backend specifics such as the local memory address space, these changes are usually fairly isolated. On the other hand, address space 0 is used everywhere throughout the code, so rewriting all that IR requires a massive effort, and there are no preexisting LLVM passes to handle that use case to my knowledge. So it turned out that patching llvm-spirv is much, much simper, as it boils down to a 1-line change there.

Note that even if you get your system llvm-spirv to work, it wouldn't be found by our llvm-to-spirv JIT infrastructure which expects llvm-spirv to live in the directory that we have assigned it.

EDIT2: Also, llvm-spirv must be built against the exact same LLVM distribution as the one Open SYCL is built against. This may not be the case for a distribution's package (e.g. maybe the distribution has packages for multiple LLVM versions, or maybe the user wants to use Open SYCL with some custom LLVM installation), so it's also much more robust to build our own.

Thanks for the details. Should that change be going upstream rather than be a permanent fork?

Well, technically it's not a fork, but a patch that is broadly compatible with all llvm-spirv versions ;) But of course it would be more convenient if it were upstreamed at some point.

See also KhronosGroup/SPIRV-LLVM-Translator#1699

While there's no hard objection from upstream SPIRV-LLVM-Translator, it's also not really clear how this is supposed to work since presumably the old convention would still need to work (otherwise, all other stacks apart from us depending on llvm-spirv will break), so there needs to be some kind of toggle (macro? command line flag?) and I think it's unclear how this would look like and if we would to tie it to some other address space convention like PTX or just be an independent thing..

illuhad added 30 commits August 5, 2022 20:51

WIP

ba1438d

Merge branch 'develop' into feature/generic-llvm-sscp

dc24016

[SSCP] Add kernel outlining and HCF embedding support

ab1f224

[SSCP] Enable LLVM SSCP flow in syclcc

83e2b00

[SSCP] Add LLVM flag to enable/disable SSCP

04a8ea4

[SSCP] Hook up SSCP kernel launcher prototype

2e79f20

[SSCP] Generalize HCF generation and make __hipsycl_sscp_is_host work…

f8b61a6

… in host pass

[SSCP] Add -mllvm -hipsycl-sscp-emit-hcf flag to emit HCF

5784e57

[SSCP] Make SSCP compiler support configurable in cmake and disable f…

e2aa051

…or old LLVM

[SSCP] Code cleanup

e680991

[SSCP] Support device-side __hipsycl_sscp_is_host/device

78dd517

[SSCP] Make all kernel code noexcept

4017a1f

[SSCP] Require LLVM 14+

4ef5cd6

[SSCP] Add SSCP libkernel detection and __hipsycl_if_target support

f209fc0

[SSCP] Ensure IR constants have internal linkage to allow linking mul…

a222f96

…tiple compilation units together

[SSCP] Support selection of SSCP kernel launcher

454af1a

[SSCP] Add llvm-to-backend infrastructure

94cabea

[SSCP] llvm-to-backend tools: Provide HCF as argument when constructi…

f121a02

…n tool to gain access to kernel names

[SSCP] Add some stage2 spir-v logic

1c90173

[SSCP] Correctly set llvm-to-backend include dirs for LLVM

906a4bf

[SSCP] Add support for S2 IR constants

7dd204c

[SSCP] Add libkernel bitcode generation infrastructure

3c00f40

[SSCP] Map thread hierarchy to SSCP builtins

e9e8ed1

[SSCP] Enable libkernel bitcode libraries to cmake

2f3e416

[SSCP] Update IR constant and llvm-to-backend infrastructure

2bb33e0

[SSCP] Add support for basic parallel for

30cac29

[SSCP] Improve code generation quality

8ad86d4

[SSCP] Remove LLVM dependency from llvm-to-backend tools

4d592f4

[SSCP] Use shared LLVM libraries for llvm-to-backend

336705d

[SSCP] Add kernel_configuration header

bd4be80

illuhad added 5 commits December 19, 2022 22:12

[SSCP] Add HIPSYCL_SSCP_FAILED_IR_DUMP_DIRECTORY to simplify debuggin…

588d86c

…g when JIT fails

[SSCP] Add atomic builtin stubs to enable running more unit tests

88d9f77

[SSCP] Emit bitcode instead of text IR in case of error

2d3289a

[SSCP] Take alignment into account when modifying global variables

204e4df

[SSCP][llvm-to-spirv] Handle initializer when creating dynamic local …

877736a

…memory allocation

illuhad force-pushed the feature/generic-llvm-sscp branch from de352d2 to 877736a Compare December 20, 2022 03:04

illuhad added 8 commits December 20, 2022 04:58

[SSCP] Fix mobile_shared_ptr pulling type definitions of shared_ptr<T…

8a712f6

…> into device IR, which can break SPIR-V due to the occurence of function pointers

[SSCP] Run GlobalDCE after IR constant application to eliminate unnee…

2074af7

…ded functions on the host side

[SSCP][llvm-to-spirv] Use default alloca AS 4, avoid non-standard int…

fd53d6e

…eger widths and remove llvm.lifetime.start/end calls

[SSCP] More diagnostics around alloca address spaces, and introduce i…

5f9a475

…nterface to obtain addressspacemap for all backends

[SSCP] Update readme

33e561e

[SSCP][llvm-to-amdgpu] Don't pollute global namespace with hipSYCL bu…

3944a23

…iltin internals

[SSCP] Emit kernel parameter information to HCF

8cea561

[SSCP] Pass kernel args by value, and add support for passing in argu…

6720bcd

…ments based on parameter information from HCF

illuhad force-pushed the feature/generic-llvm-sscp branch from 3d1a848 to 6720bcd Compare January 2, 2023 16:41

illuhad added 2 commits January 8, 2023 17:19

[SSCP] Add support for full kernel argument decomposition

770f5ed

[SSCP] Attempt to handle case when neither ByVal nor ByRef attributes…

aac6a04

… are present

illuhad force-pushed the feature/generic-llvm-sscp branch from 88f026e to aac6a04 Compare January 9, 2023 09:10

illuhad added 4 commits January 9, 2023 12:40

[SSCP] Leverage alloca to determine kernel argument type

2129aed

[SSCP] Add hotfix for cg properties extension test for SSCP

bb50554

[SSCP] Revert to passing kernel argument by reference for more predic…

9f441fb

…table behavior across multiple ABIs

Merge branch 'develop' into feature/generic-llvm-sscp

7ad3c69

illuhad marked this pull request as ready for review January 10, 2023 10:46

illuhad added 2 commits January 10, 2023 14:30

[SSCP] Only expose SSCP support if hipSYCL was built with SSCP enabled

8fd1cf8

[SSCP][NFC] Update documentation

e49dce5

illuhad merged commit e092e65 into develop Jan 10, 2023

illuhad deleted the feature/generic-llvm-sscp branch January 10, 2023 16:58

illuhad mentioned this pull request Feb 9, 2023

Project renaming to Open SYCL #942

Closed

nullr0ute reviewed Aug 24, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add generic SSCP compilation flow: Single pass compiler to generic LLVM IR + runtime JIT #862

Add generic SSCP compilation flow: Single pass compiler to generic LLVM IR + runtime JIT #862

illuhad commented Nov 8, 2022 •

edited

illuhad commented Jan 10, 2023

nullr0ute Aug 24, 2023

illuhad Aug 24, 2023 •

edited

nullr0ute Aug 24, 2023

illuhad Aug 24, 2023 •

edited

Add generic SSCP compilation flow: Single pass compiler to generic LLVM IR + runtime JIT #862

Add generic SSCP compilation flow: Single pass compiler to generic LLVM IR + runtime JIT #862

Conversation

illuhad commented Nov 8, 2022 • edited

The world's first major SYCL implementation with a single-pass compiler architecture

A single kernel code representation: Compile once, run anywhere

Provide straight-forward path to extend hipSYCL support to new hardware

Technical details

Generic SSCP compilation flow

Implementation status

How it works

IR constants

Stage 1: Kernel extraction

Stage 2: llvm-to-backend

illuhad commented Jan 10, 2023

nullr0ute Aug 24, 2023

Choose a reason for hiding this comment

illuhad Aug 24, 2023 • edited

Choose a reason for hiding this comment

nullr0ute Aug 24, 2023

Choose a reason for hiding this comment

illuhad Aug 24, 2023 • edited

Choose a reason for hiding this comment

illuhad commented Nov 8, 2022 •

edited

illuhad Aug 24, 2023 •

edited

illuhad Aug 24, 2023 •

edited