From 1450e50bee87868e02d7ccbdceec1a7e6bb954b0 Mon Sep 17 00:00:00 2001 From: Nicholas Nethercote Date: Wed, 19 Nov 2025 14:51:53 +1100 Subject: [PATCH 1/7] Use "Rust CUDA" for the project name consistently. --- guide/src/README.md | 2 +- guide/src/guide/compute_capabilities.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/guide/src/README.md b/guide/src/README.md index a74826c2..c2bce261 100644 --- a/guide/src/README.md +++ b/guide/src/README.md @@ -1,3 +1,3 @@ # Introduction -Welcome to the rust-cuda guide! Let's dive right in. +Welcome to the Rust CUDA guide! Let's dive right in. diff --git a/guide/src/guide/compute_capabilities.md b/guide/src/guide/compute_capabilities.md index 617169fb..9261cd87 100644 --- a/guide/src/guide/compute_capabilities.md +++ b/guide/src/guide/compute_capabilities.md @@ -25,7 +25,7 @@ In CUDA terminology: features - **Real architectures** (`sm_XX`) represent actual GPU hardware -rust-cuda works exclusively with virtual architectures since it only generates PTX. The +Rust CUDA works exclusively with virtual architectures since it only generates PTX. The `NvvmArch::ComputeXX` enum values correspond to CUDA's virtual architectures. ## Using Target Features @@ -217,7 +217,7 @@ If you encounter errors about missing functions or features: ## Runtime Behavior -Again, rust-cuda **only generates PTX**, not pre-compiled GPU binaries +Again, Rust CUDA **only generates PTX**, not pre-compiled GPU binaries ("[fatbinaries](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#fatbinaries)"). This PTX is then JIT-compiled by the CUDA driver at _runtime_. From cb2956f4535a9098824da8c234276d8935caf25a Mon Sep 17 00:00:00 2001 From: Nicholas Nethercote Date: Wed, 19 Nov 2025 14:53:04 +1100 Subject: [PATCH 2/7] Change the guide title and description. A little shorter and clearer. --- guide/book.toml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/guide/book.toml b/guide/book.toml index 2bca2762..f45759fc 100644 --- a/guide/book.toml +++ b/guide/book.toml @@ -2,5 +2,5 @@ authors = ["Riccardo D'Ambrosio"] language = "en" src = "src" -title = "GPU Computing with Rust using CUDA" -description = "Writing extremely fast GPU Computing code with rust using rustc_codegen_nvvm and CUDA" +title = "The Rust CUDA Guide" +description = "How to write GPU compute code with Rust using rustc_codegen_nvvm and CUDA" From 48445dd5843ce86886a8381d9341c9a861d3562c Mon Sep 17 00:00:00 2001 From: Nicholas Nethercote Date: Wed, 19 Nov 2025 14:59:30 +1100 Subject: [PATCH 3/7] Use consistent capitalization. The Guide is currently very inconsistent with capitalization of abbreviations. The general trend is towards lower-case for informal English, but for formal English (such as documentation) I think upper-case is still preferable. - gpu -> GPU - cuda/Cuda -> CUDA - rustacuda -> RustaCUDA - llvm -> LLVM - nvvm -> NVVM - ir -> IR - ptx -> PTX - libnvvm -> libNVVM - Optix/optix -> OptiX - SPIRV -> SPIR-V - cuBlas/cuRand -> cuBLAS/cuRAND - i (the pronoun!) -> I - TLDR -> TL;DR --- guide/src/cuda/README.md | 2 +- guide/src/cuda/gpu_computing.md | 16 +++++----- guide/src/cuda/pipeline.md | 4 +-- guide/src/faq.md | 42 ++++++++++++------------- guide/src/features.md | 4 +-- guide/src/guide/compute_capabilities.md | 2 +- guide/src/guide/getting_started.md | 14 ++++----- guide/src/guide/safety.md | 2 +- guide/src/guide/tips.md | 2 +- guide/src/nvvm/backends.md | 6 ++-- guide/src/nvvm/debugging.md | 16 +++++----- guide/src/nvvm/nvvm.md | 10 +++--- guide/src/nvvm/ptxgen.md | 28 ++++++++--------- guide/src/nvvm/types.md | 6 ++-- 14 files changed, 77 insertions(+), 77 deletions(-) diff --git a/guide/src/cuda/README.md b/guide/src/cuda/README.md index f8608e1c..4a8243d7 100644 --- a/guide/src/cuda/README.md +++ b/guide/src/cuda/README.md @@ -2,7 +2,7 @@ The CUDA Toolkit is an ecosystem for executing extremely fast code on NVIDIA GPUs for the purpose of general computing. -CUDA includes many libraries for this purpose, including the Driver API, Runtime API, the PTX ISA, libnvvm, etc. CUDA +CUDA includes many libraries for this purpose, including the Driver API, Runtime API, the PTX ISA, libNVVM, etc. CUDA is currently the best option for computing in terms of libraries and control available, however, it unfortunately only works on NVIDIA GPUs. diff --git a/guide/src/cuda/gpu_computing.md b/guide/src/cuda/gpu_computing.md index eba5c508..022f4cce 100644 --- a/guide/src/cuda/gpu_computing.md +++ b/guide/src/cuda/gpu_computing.md @@ -13,13 +13,13 @@ of time and/or take different code paths. CUDA is currently one of the best choices for fast GPU computing for multiple reasons: - It offers deep control over how kernels are dispatched and how memory is managed. -- It has a rich ecosystem of tutorials, guides, and libraries such as cuRand, cuBlas, libnvvm, optix, the PTX ISA, etc. +- It has a rich ecosystem of tutorials, guides, and libraries such as cuRAND, cuBLAS, libNVVM, OptiX, the PTX ISA, etc. - It is mostly unmatched in performance because it is solely meant for computing and offers rich control. And more... -However, CUDA can only run on NVIDIA GPUs, which precludes AMD gpus from tools that use it. However, this is a drawback that -is acceptable by many because of the significant developer cost of supporting both NVIDIA gpus with CUDA and -AMD gpus with OpenCL, since OpenCL is generally slower, clunkier, and lacks libraries and docs on par with CUDA. +However, CUDA can only run on NVIDIA GPUs, which precludes AMD GPUs from tools that use it. However, this is a drawback that +is acceptable by many because of the significant developer cost of supporting both NVIDIA GPUs with CUDA and +AMD GPUs with OpenCL, since OpenCL is generally slower, clunkier, and lacks libraries and docs on par with CUDA. # Why Rust? @@ -28,22 +28,22 @@ accomplish; The initial hurdle of getting Rust to compile to something CUDA can polish part. On top of its rich language features (macros, enums, traits, proc macros, great errors, etc), Rust's safety guarantees -can be applied in gpu programming too; A field that has historically been full of implied invariants and unsafety, such +can be applied in GPU programming too; A field that has historically been full of implied invariants and unsafety, such as (but not limited to): - Expecting some amount of dynamic shared memory from the caller. - Expecting a certain layout for thread blocks/threads. - Manually handling the indexing of data, leaving code prone to data races if not managed correctly. - Forgetting to free memory, using uninitialized memory, etc. -Not to mention the standardized tooling that makes the building, documentation, sharing, and linting of gpu kernel libraries easily possible. +Not to mention the standardized tooling that makes the building, documentation, sharing, and linting of GPU kernel libraries easily possible. Most of the reasons for using rust on the CPU apply to using Rust for the GPU, these reasons have been stated countless times so -i will not repeat them here. +I will not repeat them here. A couple of particular rust features make writing CUDA code much easier: RAII and Results. In `cust` everything uses RAII (through `Drop` impls) to manage freeing memory and returning handles, which frees users from having to think about that, which yields safer, more reliable code. -Results are particularly helpful, almost every single call in every CUDA library returns a status code in the form of a cuda result. +Results are particularly helpful, almost every single call in every CUDA library returns a status code in the form of a CUDA result. Ignoring these statuses is very dangerous and can often lead to random segfaults and overall unreliable code. For this purpose, both the CUDA SDK, and other libraries provide macros to handle such statuses. This handling is not very reliable and causes dependency issues down the line. diff --git a/guide/src/cuda/pipeline.md b/guide/src/cuda/pipeline.md index 0e3481b1..f88b7a1c 100644 --- a/guide/src/cuda/pipeline.md +++ b/guide/src/cuda/pipeline.md @@ -19,13 +19,13 @@ with additional restrictions including the following. - Some linkage types are not supported. - Function ABIs are ignored; everything uses the PTX calling convention. -libnvvm is a closed source library which takes NVVM IR, optimizes it further, then converts it to +libNVVM is a closed source library which takes NVVM IR, optimizes it further, then converts it to PTX. PTX is a low level, assembly-like format with an open specification which can be targeted by any language. For an assembly format, PTX is fairly user-friendly. - It is well formatted. - It is mostly fully specified (other than the iffy grammar specification). - It uses named registers/parameters. -- It uses virtual registers. (Because gpus have thousands of registers, listing all of them out +- It uses virtual registers. (Because GPUs have thousands of registers, listing all of them out would be unrealistic.) - It uses ASCII as a file encoding. diff --git a/guide/src/faq.md b/guide/src/faq.md index 5197fd01..8db836cb 100644 --- a/guide/src/faq.md +++ b/guide/src/faq.md @@ -29,10 +29,10 @@ over CUDA C/C++ with the same (or better!) performance and features, therefore, Short answer, no. Long answer, there are a couple of things that make this impossible: -- At the time of writing, libnvvm expects LLVM 7 bitcode, which is a very old format. Giving it bitcode from later LLVM version (which is what rustc uses) does not work. -- NVVM IR is a __subset__ of LLVM IR, there are tons of things that nvvm will not accept. Such as a lot of function attrs not being allowed. +- At the time of writing, libNVVM expects LLVM 7 bitcode, which is a very old format. Giving it bitcode from later LLVM version (which is what rustc uses) does not work. +- NVVM IR is a __subset__ of LLVM IR, there are tons of things that NVVM will not accept. Such as a lot of function attrs not being allowed. This is well documented and you can find the spec [here](https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html). Not to mention -many bugs in libnvvm that i have found along the way, the most infuriating of which is nvvm not accepting integer types that arent `i1, i8, i16, i32, or i64`. +many bugs in libNVVM that I have found along the way, the most infuriating of which is nvvm not accepting integer types that arent `i1, i8, i16, i32, or i64`. This required special handling in the codegen to convert these "irregular" types into vector types. ## What is the point of using Rust if a lot of things in kernels are unsafe? @@ -153,13 +153,13 @@ things to gain in terms of safety using Rust. The reasoning for this is the same reasoning as to why you would use CUDA over opengl/vulkan compute shaders: - CUDA usually outperforms shaders if kernels are written well and launch configurations are optimal. - CUDA has many useful features such as shared memory, unified memory, graphs, fine grained thread control, streams, the PTX ISA, etc. -- rust-gpu does not perform many optimizations, and with rustc_codegen_ssa's less than ideal codegen, the optimizations by llvm and libnvvm are needed. -- SPIRV is arguably still not suitable for serious GPU kernel codegen, it is underspecced, complex, and does not mention many things which are needed. -While libnvvm (which uses a well documented subset of LLVM IR) and the PTX ISA are very thoroughly documented/specified. +- rust-gpu does not perform many optimizations, and with rustc_codegen_ssa's less than ideal codegen, the optimizations by LLVM and libNVVM are needed. +- SPIR-V is arguably still not suitable for serious GPU kernel codegen, it is underspecced, complex, and does not mention many things which are needed. +While libNVVM (which uses a well documented subset of LLVM IR) and the PTX ISA are very thoroughly documented/specified. - rust-gpu is primarily focused on graphical shaders, compute shaders are secondary, which the rust ecosystem needs, but it also needs a project 100% focused on computing, and computing only. -- SPIRV cannot access many useful CUDA libraries such as Optix, cuDNN, cuBLAS, etc. -- SPIRV debug info is still very young and rust-gpu cannot generate it. While rustc_codegen_nvvm does, which can be used +- SPIR-V cannot access many useful CUDA libraries such as OptiX, cuDNN, cuBLAS, etc. +- SPIR-V debug info is still very young and rust-gpu cannot generate it. While rustc_codegen_nvvm does, which can be used for profiling kernels in something like nsight compute. Moreover, CUDA is the primary tool used in big computing industries such as VFX and scientific computing. Therefore @@ -190,17 +190,17 @@ when it is finished, which causes further uses of CUDA to fail. Modules are the second big difference in the driver API. Modules are similar to shared libraries, they contain all of the globals and functions (kernels) inside of a PTX/cubin file. The driver API -is language-agnostic, it purely works off of ptx/cubin files. To answer why this is important we -need to cover what cubins and ptx files are briefly. +is language-agnostic, it purely works off PTX/cubin files. To answer why this is important we +need to cover what cubins and PTX files are briefly. PTX is a low level assembly-like language which is the penultimate step before what the GPU actually executes. It is human-readable and you can dump it from a CUDA C++ program with `nvcc ./file.cu --ptx`. This PTX is then optimized and lowered into a final format called SASS (Source and Assembly) and turned into a cubin (CUDA binary) file. -Driver API modules can be loaded as either ptx, cubin, or fatbin files. If they are loaded as -ptx then the driver API will JIT compile the PTX to cubin then cache it. You can also -compile ptx to cubin yourself using ptx-compiler and cache it. +Driver API modules can be loaded as either PTX, cubin, or fatbin files. If they are loaded as +PTX then the driver API will JIT compile the PTX to cubin then cache it. You can also +compile PTX to cubin yourself using ptx-compiler and cache it. This pipeline provides much better control over what functions you actually need to load and cache. You can separate different functions into different modules you can load dynamically (and even dynamically reload). @@ -217,7 +217,7 @@ need to manage many kernels being dispatched at the same time as efficiently as ## Why target NVIDIA GPUs only instead of using something that can work on AMD? -This is a complex issue with many arguments for both sides, so i will give you +This is a complex issue with many arguments for both sides, so I will give you both sides as well as my opinion. Pros for using OpenCL over CUDA: @@ -235,7 +235,7 @@ new features cannot be reliably relied upon because they are unlikely to work on - OpenCL can only be written in OpenCL C (based on C99), OpenCL C++ is a thing, but again, not everything supports it. This makes complex programs more difficult to create. - OpenCL has less tools and libraries. -- OpenCL is nowhere near as language-agnostic as CUDA. CUDA works almost fully off of an assembly format (ptx) +- OpenCL is nowhere near as language-agnostic as CUDA. CUDA works almost fully off of an assembly format (PTX) and debug info. Essentially how CPU code works. This makes writing language-agnostic things in OpenCL near impossible and locks you into using OpenCL C. - OpenCL is plagued with serious driver bugs which have not been fixed, or that occur only on certain vendors. @@ -245,10 +245,10 @@ Pros for using CUDA over OpenCL: VFX computing. - CUDA is a proprietary tool, meaning that NVIDIA is able to push out bug fixes and features much faster than releasing a new spec and waiting for vendors to implement it. This allows for more features being added, -such as cooperative kernels, cuda graphs, unified memory, new profilers, etc. +such as cooperative kernels, CUDA graphs, unified memory, new profilers, etc. - CUDA is a single entity, meaning that if something does or does not work on one system it is unlikely that that will be different on another system. Assuming you are not using different architectures, where -one gpu may be lacking a feature. +one GPU may be lacking a feature. - CUDA is usually 10-30% faster than OpenCL overall, this is likely due to subpar OpenCL drivers by NVIDIA, but it is unlikely this performance gap will change in the near future. - CUDA has a much richer set of libraries and tools than OpenCL, such as cuFFT, cuBLAS, cuRand, cuDNN, OptiX, NSight Compute, cuFile, etc. @@ -264,8 +264,8 @@ Cons for using CUDA over OpenCL: # What makes cust and RustaCUDA different? -Cust is a fork of rustacuda which changes a lot of things inside of it, as well as adds new features that -are not inside of rustacuda. +Cust is a fork of RustaCUDA which changes a lot of things inside of it, as well as adds new features that +are not inside of RustaCUDA. The most significant changes (This list is not complete!!) are: - Drop code no longer panics on failure to drop raw CUDA handles, this is so that InvalidAddress errors, which cause @@ -286,8 +286,8 @@ Changes that are currently in progress but not done/experimental: - Graphs - PTX validation -Just like rustacuda, cust makes no assumptions of what language was used to generate the ptx/cubin. It could be +Just like RustaCUDA, cust makes no assumptions of what language was used to generate the PTX/cubin. It could be C, C++, futhark, or best of all, Rust! -Cust's name is literally just rust + cuda mashed together in a horrible way. +Cust's name is literally just rust + CUDA mashed together in a horrible way. Or you can pretend it stands for custard if you really like custard. diff --git a/guide/src/features.md b/guide/src/features.md index 33c8128b..ab9e5c4f 100644 --- a/guide/src/features.md +++ b/guide/src/features.md @@ -18,9 +18,9 @@ around to adding it yet. | Feature Name | Support Level | Notes | | ------------ | ------------- | ----- | -| Opt-Levels | ✔️ | behaves mostly the same (because llvm is still used for optimizations). Except that libnvvm opts are run on anything except no-opts because nvvm only has -O0 and -O3 | +| Opt-Levels | ✔️ | behaves mostly the same (because LLVM is still used for optimizations). Except that libNVVM opts are run on anything except no-opts because NVVM only has -O0 and -O3 | | codegen-units | ✔️ | -| LTO | ➖ | we load bitcode modules lazily using dependency graphs, which then forms a single module optimized by libnvvm, so all the benefits of LTO are on without pre-libnvvm LTO being needed. | +| LTO | ➖ | we load bitcode modules lazily using dependency graphs, which then forms a single module optimized by libNVVM, so all the benefits of LTO are on without pre-libNVVM LTO being needed. | | Closures | ✔️ | | Enums | ✔️ | | Loops | ✔️ | diff --git a/guide/src/guide/compute_capabilities.md b/guide/src/guide/compute_capabilities.md index 9261cd87..57dccbc7 100644 --- a/guide/src/guide/compute_capabilities.md +++ b/guide/src/guide/compute_capabilities.md @@ -142,7 +142,7 @@ Note: While the 'a' variant enables all these features during compilation (allow For more details on suffixes, see [NVIDIA's blog post on family-specific architecture features](https://developer.nvidia.com/blog/nvidia-blackwell-and-nvidia-cuda-12-9-introduce-family-specific-architecture-features/). -### Manual Compilation (Without CudaBuilder) +### Manual Compilation (Without `cuda_builder`) If you're invoking `rustc` directly instead of using `cuda_builder`, you only need to specify the architecture through LLVM args: diff --git a/guide/src/guide/getting_started.md b/guide/src/guide/getting_started.md index be30c946..8017f706 100644 --- a/guide/src/guide/getting_started.md +++ b/guide/src/guide/getting_started.md @@ -17,9 +17,9 @@ Before you can use the project to write GPU crates, you will need a couple of pr - Finally, if neither are present or unusable, it will attempt to download and use prebuilt LLVM. This currently only works on Windows however. -- The OptiX SDK if using the optix library (the pathtracer example uses it for denoising). +- The OptiX SDK if using the OptiX library (the pathtracer example uses it for denoising). -- You may also need to add `libnvvm` to PATH, the builder should do it for you but in case it does not work, add libnvvm to PATH, it should be somewhere like `CUDA_ROOT/nvvm/bin`, +- You may also need to add `libnvvm` to PATH, the builder should do it for you but in case it does not work, add `libnvvm` to PATH, it should be somewhere like `CUDA_ROOT/nvvm/bin`, - You may wish to use or consult the bundled [Dockerfile](#docker) to assist in your local config @@ -102,7 +102,7 @@ Now we can finally start writing an actual GPU kernel. Firstly, we must explain a couple of things about GPU kernels, specifically, how they are executed. GPU Kernels (functions) are the entry point for executing anything on the GPU, they are the functions which will be executed from the CPU. GPU kernels do not return anything, they write their data to buffers passed into them. CUDA's execution model is very very complex and it is unrealistic to explain all of it in -this section, but the TLDR of it is that CUDA will execute the GPU kernel once on every +this section, but the TL;DR of it is that CUDA will execute the GPU kernel once on every thread, with the number of threads being decided by the caller (the CPU). We call these parameters the launch dimensions of the kernel. Launch dimensions are split @@ -115,7 +115,7 @@ up into two basic concepts: of the current block. One important thing to note is that block and thread dimensions may be 1d, 2d, or 3d. -That is to say, i can launch `1` block of `6x6x6`, `6x6`, or `6` threads. I could +That is to say, I can launch `1` block of `6x6x6`, `6x6`, or `6` threads. I could also launch `5x5x5` blocks. This is very useful for 2d/3d applications because it makes the 2d/3d index calculations much simpler. CUDA exposes thread and block indices for each dimension through special registers. We expose thread index queries through @@ -229,7 +229,7 @@ You can use it as follows (assuming your clone of Rust CUDA is at the absolute p **Notes:** 1. refer to [rust-toolchain.toml](#rust-toolchain.toml) to ensure you are using the correct toolchain in your project. -2. despite using Docker, your machine will still need to be running a compatible driver, in this case for Cuda 11.4.1 it is >=470.57.02 -3. if you have issues within the container, it can help to start ensuring your gpu is recognized +2. despite using Docker, your machine will still need to be running a compatible driver, in this case for CUDA 11.4.1 it is >=470.57.02 +3. if you have issues within the container, it can help to start ensuring your GPU is recognized - ensure `nvidia-smi` provides meaningful output in the container - - NVidia provides a number of samples https://github.com/NVIDIA/cuda-samples. In particular, you may want to try `make`ing and running the [`deviceQuery`](https://github.com/NVIDIA/cuda-samples/tree/ba04faaf7328dbcc87bfc9acaf17f951ee5ddcf3/Samples/deviceQuery) sample. If all is well you should see many details about your gpu + - NVidia provides a number of samples https://github.com/NVIDIA/cuda-samples. In particular, you may want to try `make`ing and running the [`deviceQuery`](https://github.com/NVIDIA/cuda-samples/tree/ba04faaf7328dbcc87bfc9acaf17f951ee5ddcf3/Samples/deviceQuery) sample. If all is well you should see many details about your GPU diff --git a/guide/src/guide/safety.md b/guide/src/guide/safety.md index bad3dfaf..5e89a1b4 100644 --- a/guide/src/guide/safety.md +++ b/guide/src/guide/safety.md @@ -25,7 +25,7 @@ Behavior considered undefined inside of GPU kernels: undefined on the GPU too. The only exception being invalid sizes for buffers given to a GPU kernel. -Currently we declare that the invariant that a buffer given to a gpu kernel must be large enough for any access the +Currently we declare that the invariant that a buffer given to a GPU kernel must be large enough for any access the kernel is going to make is up to the caller of the kernel to uphold. This idiom may be changed in the future. - Any kind of data race, this has the same semantics as data races in CPU code. Such as: diff --git a/guide/src/guide/tips.md b/guide/src/guide/tips.md index 98ddb1a0..9cb60dc9 100644 --- a/guide/src/guide/tips.md +++ b/guide/src/guide/tips.md @@ -10,5 +10,5 @@ will get much better in the future but currently it will cause some undesirable - Don't use recursion, CUDA allows it but threads have very limited stacks (local memory) and stack overflows yield confusing `InvalidAddress` errors. If you are getting such an error, run the executable in cuda-memcheck, -it should yield a write failure to `Local` memory at an address of about 16mb. You can also put the ptx file through +it should yield a write failure to `Local` memory at an address of about 16mb. You can also put the PTX file through `cuobjdump` and it should yield ptxas warnings for functions without a statically known stack usage. diff --git a/guide/src/nvvm/backends.md b/guide/src/nvvm/backends.md index c4de04ca..7c34d736 100644 --- a/guide/src/nvvm/backends.md +++ b/guide/src/nvvm/backends.md @@ -2,7 +2,7 @@ Before we get into the details of rustc_codegen_nvvm, we obviously need to explain what a codegen is! -Custom codegens are rustc's answer to "well what if i want rust to compile to X?". This is a problem +Custom codegens are rustc's answer to "well what if I want rust to compile to X?". This is a problem that comes up in many situations, especially conversations of "well LLVM cannot target this, so we are screwed". To solve this problem, rustc decided to incrementally decouple itself from being attached/reliant on LLVM exclusively. @@ -27,8 +27,8 @@ which is able to target more exotic targets than LLVM, especially for embedded. format, which is a format mostly used for compiling shader languages such as GLSL or WGSL to a standard representation that Vulkan/OpenGL can use, the reasons why SPIR-V is not an alternative to CUDA/rustc_codegen_nvvm have been covered in the [FAQ](../../faq.md). -Finally, we come to the star of the show, `rustc_codegen_nvvm`. This backend targets NVVM IR for compiling rust to gpu kernels that can be run by CUDA. -What NVVM IR/libnvvm are has been covered in the [CUDA section](../../cuda/pipeline.md). +Finally, we come to the star of the show, `rustc_codegen_nvvm`. This backend targets NVVM IR for compiling rust to GPU kernels that can be run by CUDA. +What NVVM IR/libNVVM are has been covered in the [CUDA section](../../cuda/pipeline.md). # rustc_codegen_ssa diff --git a/guide/src/nvvm/debugging.md b/guide/src/nvvm/debugging.md index b5ab7224..63c2e345 100644 --- a/guide/src/nvvm/debugging.md +++ b/guide/src/nvvm/debugging.md @@ -11,7 +11,7 @@ Segfaults are usually caused in one of two ways: The first case can be debugged in two ways: - Building the codegen in debug mode and using `RUSTC_LOG="rustc_codegen_nvvm=trace"` (`$env:RUSTC_LOG = "rustc_codegen_nvvm=trace";` if using powershell). -Note that this will dump a LOT of output, and when i say a LOT, i am not joking, so please, pipe this to a file. +Note that this will dump a LOT of output, and when I say a LOT, i am not joking, so please, pipe this to a file. This will give you a detailed summary of almost every action the codegen has done, you can examine the final few logs to check what the last action the codegen was doing before segfaulting was. This is usually straightforward because the logs are detailed. @@ -20,23 +20,23 @@ get LLVM to throw an exception whenever something bad happens. The latter case is a bit worse. -Segfaults in libnvvm are generally because we gave something to libnvvm which it did not expect. In an ideal world, libnvvm would -just throw a validation error, but it wouldn't be an llvm-based library if it threw friendly errors ;). Libnvvm has been known to segfault +Segfaults in libNVVM are generally because we gave something to libnvvm which it did not expect. In an ideal world, libnvvm would +just throw a validation error, but it wouldn't be an LLVM-based library if it threw friendly errors ;). Libnvvm has been known to segfault on things like: - using int types that arent `i1`, `i8`, `i16`, `i32`, or `i64` in functions signatures. (see int_replace.rs). - having debug info on multiple modules (this is technically disallowed per the spec but it still shouldn't segfault). -Generally there is no good way to debug these failures other than hoping libnvvm throws a validation error (which will cause an ICE). -I have created a tiny tool to run `llvm-extract` on an llvm ir file to attempt to isolate segfaulting functions which works to some degree -which i will add to the project soon. +Generally there is no good way to debug these failures other than hoping libNVVM throws a validation error (which will cause an ICE). +I have created a tiny tool to run `llvm-extract` on an LLVM IR file to attempt to isolate segfaulting functions which works to some degree +which I will add to the project soon. ## Miscompilations Miscompilations are rare but annoying. They usually result in one of two things happening: - CUDA rejecting the PTX as a whole (throwing an InvalidPtx error). This is rare but the most common cause is declaring invalid -extern functions (just grep for `extern` in the ptx file and check if it's odd functions that aren't cuda syscalls like vprintf, malloc, free, etc). +extern functions (just grep for `extern` in the PTX file and check if it's odd functions that aren't CUDA syscalls like vprintf, malloc, free, etc). - The PTX containing invalid behavior. This is very specific and rare but if you find this, the best way to debug it is: - - Try to get a minimal working example so we don't have to search through megabytes of llvm ir/ptx. + - Try to get a minimal working example so we don't have to search through megabytes of LLVM IR/PTX. - Use `RUSTFLAGS="--emit=llvm-ir"` and find `crate_name.ll` in `target/nvptx64-nvidia-cuda//deps/` and attach it in any bug report. - Attach the final PTX file. diff --git a/guide/src/nvvm/nvvm.md b/guide/src/nvvm/nvvm.md index 4e15aa7d..5f65b669 100644 --- a/guide/src/nvvm/nvvm.md +++ b/guide/src/nvvm/nvvm.md @@ -5,7 +5,7 @@ At the highest level, our codegen workflow goes like this: ``` Source code -> Typechecking -> MIR -> SSA Codegen -> LLVM IR (NVVM IR) -> PTX -> PTX opts/function DCE -> Final PTX | | | | ^ - | | libnvvm +------+ | + | | libNVVM +------+ | | | | | rustc_codegen_nvvm +------------------------------------------------------------| Rustc +--------------------------------------------------------------------------------------------------- @@ -43,19 +43,19 @@ dive into each trait. But first, let's talk about the end of the codegen, it is pretty simple, we do a couple of things: *after codegen is done and LLVM has been run to optimize each module* 1. We gather every LLVM bitcode module we created. -2. We create a new libnvvm program. -3. We add every bitcode module to the libnvvm program. +2. We create a new libNVVM program. +3. We add every bitcode module to the libNVVM program. 4. We try to find libdevice and add it to the program (see [nvidia docs](https://docs.nvidia.com/cuda/libdevice-users-guide/introduction.html#what-is-libdevice) on what libdevice is). -5. We run the verifier on the nvvm program just to check that we did not create any invalid NVVM IR. +5. We run the verifier on the NVVM program just to check that we did not create any invalid NVVM IR. 6. We run the compiler which gives us a final PTX string, hooray! 7. Finally, the PTX goes through a small stage where its parsed and function DCE is run to eliminate most of the bloat in the file. Traditionally this is done by the linker but there's no linker to be found for miles here. 8. We write this PTX file to wherever rustc tells us to write the final file. -We will cover the libnvvm steps in more detail later on. +We will cover the libNVVM steps in more detail later on. # Codegen Units (CGUs) diff --git a/guide/src/nvvm/ptxgen.md b/guide/src/nvvm/ptxgen.md index 260f6612..5c4e3672 100644 --- a/guide/src/nvvm/ptxgen.md +++ b/guide/src/nvvm/ptxgen.md @@ -1,20 +1,20 @@ # PTX Generation -This is the final and most fun part of codegen, taking our LLVM bitcode and giving it to libnvvm. -It is in theory as simple as just giving nvvm every single bitcode module, but in practice, we do a couple -of things before and after to reduce ptx size and speed things up. +This is the final and most fun part of codegen, taking our LLVM bitcode and giving it to libNVVM. +It is in theory as simple as just giving NVVM every single bitcode module, but in practice, we do a couple +of things before and after to reduce PTX size and speed things up. # The NVVM API -libnvvm is a dynamically linked library which is distributed in every download of the CUDA SDK. +libNVVM is a dynamically linked library which is distributed in every download of the CUDA SDK. If you are on windows, it should be somewhere around `C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.3/nvvm/bin` -where `v11.3` is the version of cuda you have downloaded. On Windows it's usually called `nvvm64_40_0.dll`. If you are +where `v11.3` is the version of CUDA you have downloaded. On Windows it's usually called `nvvm64_40_0.dll`. If you are on linux it should be somewhere around `/opt/cuda/nvvm-prev/lib64/libnvvm.so`. You can see its API either in the [API docs](https://docs.nvidia.com/cuda/libnvvm-api/group__compilation.html) or in its header file in the `include` folder. We have our own high level bindings to it published as a crate called `nvvm`. -The libnvvm API could not be simpler, it is just a couple of functions: +The libNVVM API could not be simpler, it is just a couple of functions: - Make new program - Add bitcode module - Lazy add bitcode module @@ -31,13 +31,13 @@ should be a very simple thing that would involve no calls to random functions in haystack, ...right? Why of course not, you didn't seriously think we would make this straight-forward, right? -So, in theory it is very simple, just load the bitcode from the rlib and tell nvvm to load it. +So, in theory it is very simple, just load the bitcode from the rlib and tell NVVM to load it. While this is easy and it works, it has its own very visible issues. Traditionally, if you never use a function, either the compiler destroys it when using LTO, or the linker destroys it in its own dead code pass. The issue is that LTO is not always run, -and we do not have a linker, nvvm *is* our linker. However, nvvm does not eliminate dead functions. -I think you can guess why that is a problem, so unless we want `11mb` ptx files (yes this is actually +and we do not have a linker, NVVM *is* our linker. However, NVVM does not eliminate dead functions. +I think you can guess why that is a problem, so unless we want `11mb` PTX files (yes this is actually how big it was) we need to do something about it. # Module Merging and DCE @@ -58,7 +58,7 @@ into the module if they are used, doing so using dependency graphs. There are a couple of special modules we need to load before we are done, `libdevice` and `libintrinsics`. The first and most important one is libdevice, libdevice is essentially a bitcode module containing hyper-optimized math intrinsics -that nvidia provides for us. You can find it as a `.bc` file in the libdevice folder inside your nvvm install location. +that nvidia provides for us. You can find it as a `.bc` file in the libdevice folder inside your NVVM install location. Every function inside of it is prefixed with `__nv_`, you can find docs for it [here](https://docs.nvidia.com/cuda/libdevice-users-guide/index.html). We declare these intrinsics inside of `ctx_intrinsics.rs` and link to them inside cuda_std. We also use them to codegen @@ -69,15 +69,15 @@ libdevice is also lazy loaded so we do not import useless intrinsics. # libintrinsics This is the last special module we load, it is simple, it is just a dumping ground for random wrapper functions -we need to define that `cuda_std` or the codegen needs. You can find the llvm ir definition for it in the codegen directory +we need to define that `cuda_std` or the codegen needs. You can find the LLVM IR definition for it in the codegen directory called `libintrinsics.ll`. All of its functions should be declared with the `__nvvm_` prefix. # Compilation Finally, we have everything loaded and we can compile our program. We do one last thing however. -Nvvm has a function for verifying our program to make sure we did not add anything nvvm does not like. We run this -before compilation just to be safe. Although annoyingly this does not catch all errors, nvvm just segfaults sometimes which is unfortunate. +NVVM has a function for verifying our program to make sure we did not add anything nvvm does not like. We run this +before compilation just to be safe. Although annoyingly this does not catch all errors, NVVM just segfaults sometimes which is unfortunate. -Compiling is simple, we just call nvvm's program compile function and panic if it fails, if it doesn't, we get a final PTX string. We +Compiling is simple, we just call NVVM's program compile function and panic if it fails, if it doesn't, we get a final PTX string. We can then just write that to the file that rustc wants us to put the final item in. diff --git a/guide/src/nvvm/types.md b/guide/src/nvvm/types.md index 845a7cb3..f9bc1121 100644 --- a/guide/src/nvvm/types.md +++ b/guide/src/nvvm/types.md @@ -1,18 +1,18 @@ # Types -Types! who doesn't love types, especially those that cause libnvvm to randomly segfault or loop forever! +Types! who doesn't love types, especially those that cause libNVVM to randomly segfault or loop forever! Anyways, types are an integral part of the codegen and everything revolves around them and you will see them everywhere. `rustc_codegen_ssa` does not actually tell you what your type representation should be, it allows you to decide. For example, `rust-gpu` represents it as a `SpirvType` enum, while both `rustc_codegen_llvm` and our codegen represent it as -opaque llvm types: +opaque LLVM types: ```rs type Type = &'ll llvm::Type; ``` `llvm::Type` is an opaque type that comes from llvm-c. `'ll` is one of the main lifetimes you will see -throughout the whole codegen, it is used for anything that lasts as long as the current usage of llvm. +throughout the whole codegen, it is used for anything that lasts as long as the current usage of LLVM. LLVM gives you back pointers when you ask for a type or value, some time ago rustc_codegen_llvm fully switched to using references over pointers, and we follow in their footsteps. From 85ea790bf99ac6171d8663b8de1106702313cda0 Mon Sep 17 00:00:00 2001 From: Nicholas Nethercote Date: Wed, 19 Nov 2025 16:20:34 +1100 Subject: [PATCH 4/7] Be consistent with backticks. A lot of names are used sometimes with backticks, sometimes without. This commit removes backticks where necessary for these: - rustc - rust-gpu (which will become "Rust GPU" in a subsequent commit) And adds backticks where necessary for these: - rustc_codegen_* - cuda_builder - cuda_std - lib.rs --- guide/src/SUMMARY.md | 4 ++-- guide/src/cuda/gpu_computing.md | 2 +- guide/src/faq.md | 4 ++-- guide/src/guide/getting_started.md | 6 +++--- guide/src/guide/kernel_abi.md | 2 +- guide/src/nvvm/README.md | 6 +++--- guide/src/nvvm/backends.md | 16 ++++++++-------- guide/src/nvvm/nvvm.md | 4 ++-- guide/src/nvvm/ptxgen.md | 2 +- guide/src/nvvm/types.md | 4 ++-- 10 files changed, 25 insertions(+), 25 deletions(-) diff --git a/guide/src/SUMMARY.md b/guide/src/SUMMARY.md index a93a363d..fc03685f 100644 --- a/guide/src/SUMMARY.md +++ b/guide/src/SUMMARY.md @@ -12,9 +12,9 @@ - [The CUDA Toolkit](cuda/README.md) - [GPU Computing](cuda/gpu_computing.md) - [The CUDA Pipeline](cuda/pipeline.md) -- [rustc_codegen_nvvm](nvvm/README.md) +- [`rustc_codegen_nvvm`](nvvm/README.md) - [Custom Rustc Backends](nvvm/backends.md) - - [rustc_codegen_nvvm](nvvm/nvvm.md) + - [`rustc_codegen_nvvm`](nvvm/nvvm.md) - [Types](nvvm/types.md) - [PTX Generation](nvvm/ptxgen.md) - [Debugging](nvvm/debugging.md) diff --git a/guide/src/cuda/gpu_computing.md b/guide/src/cuda/gpu_computing.md index 022f4cce..d0973ecb 100644 --- a/guide/src/cuda/gpu_computing.md +++ b/guide/src/cuda/gpu_computing.md @@ -23,7 +23,7 @@ AMD GPUs with OpenCL, since OpenCL is generally slower, clunkier, and lacks libr # Why Rust? -Rust is a great choice for GPU programming, however, it has needed a kickstart, which is what rustc_codegen_nvvm tries to +Rust is a great choice for GPU programming, however, it has needed a kickstart, which is what `rustc_codegen_nvvm` tries to accomplish; The initial hurdle of getting Rust to compile to something CUDA can run is over, now comes the design and polish part. diff --git a/guide/src/faq.md b/guide/src/faq.md index 8db836cb..9342502b 100644 --- a/guide/src/faq.md +++ b/guide/src/faq.md @@ -153,13 +153,13 @@ things to gain in terms of safety using Rust. The reasoning for this is the same reasoning as to why you would use CUDA over opengl/vulkan compute shaders: - CUDA usually outperforms shaders if kernels are written well and launch configurations are optimal. - CUDA has many useful features such as shared memory, unified memory, graphs, fine grained thread control, streams, the PTX ISA, etc. -- rust-gpu does not perform many optimizations, and with rustc_codegen_ssa's less than ideal codegen, the optimizations by LLVM and libNVVM are needed. +- rust-gpu does not perform many optimizations, and with `rustc_codegen_ssa`'s less than ideal codegen, the optimizations by LLVM and libNVVM are needed. - SPIR-V is arguably still not suitable for serious GPU kernel codegen, it is underspecced, complex, and does not mention many things which are needed. While libNVVM (which uses a well documented subset of LLVM IR) and the PTX ISA are very thoroughly documented/specified. - rust-gpu is primarily focused on graphical shaders, compute shaders are secondary, which the rust ecosystem needs, but it also needs a project 100% focused on computing, and computing only. - SPIR-V cannot access many useful CUDA libraries such as OptiX, cuDNN, cuBLAS, etc. -- SPIR-V debug info is still very young and rust-gpu cannot generate it. While rustc_codegen_nvvm does, which can be used +- SPIR-V debug info is still very young and rust-gpu cannot generate it. While `rustc_codegen_nvvm` does, which can be used for profiling kernels in something like nsight compute. Moreover, CUDA is the primary tool used in big computing industries such as VFX and scientific computing. Therefore diff --git a/guide/src/guide/getting_started.md b/guide/src/guide/getting_started.md index 8017f706..62816998 100644 --- a/guide/src/guide/getting_started.md +++ b/guide/src/guide/getting_started.md @@ -58,7 +58,7 @@ Where `XX` is the latest version of `cuda_std`. We changed our crate's crate types to `cdylib` and `rlib`. We specified `cdylib` because the nvptx targets do not support binary crate types. `rlib` is so that we will be able to use the crate as a dependency, such as if we would like to use it on the CPU. -## lib.rs +## `lib.rs` Before we can write any GPU kernels, we must add a few directives to our `lib.rs` which are required by the codegen: @@ -86,7 +86,7 @@ If you would like to use `alloc` or things like printing from GPU kernels (which extern crate alloc; ``` -Finally, if you would like to use types such as slices or arrays inside of GPU kernels you must allow `improper_cytypes_definitions` either on the whole crate or the individual GPU kernels. This is because on the CPU, such types are not guaranteed to be passed a certain way, so they should not be used in `extern "C"` functions (which is what kernels are implicitly declared as). However, `rustc_codegen_nvvm` guarantees the way in which things like structs, slices, and arrays are passed. See [Kernel ABI](./kernel_abi.md). +Finally, if you would like to use types such as slices or arrays inside of GPU kernels you must allow `improper_ctypes_definitions` either on the whole crate or the individual GPU kernels. This is because on the CPU, such types are not guaranteed to be passed a certain way, so they should not be used in `extern "C"` functions (which is what kernels are implicitly declared as). However, `rustc_codegen_nvvm` guarantees the way in which things like structs, slices, and arrays are passed. See [Kernel ABI](./kernel_abi.md). ```rs #![allow(improper_ctypes_definitions)] @@ -171,7 +171,7 @@ To use it you can simply add it as a build dependency in your CPU crate (the cra +cuda_builder = "XX" ``` -Where `XX` is the current version of cuda_builder. +Where `XX` is the current version of `cuda_builder`. Then, you can simply invoke it in the build.rs of your CPU crate: diff --git a/guide/src/guide/kernel_abi.md b/guide/src/guide/kernel_abi.md index c4a9fe3d..f1a131b4 100644 --- a/guide/src/guide/kernel_abi.md +++ b/guide/src/guide/kernel_abi.md @@ -15,7 +15,7 @@ other ABI we override purely to avoid footguns. Functions marked as `#[kernel]` are enforced to be `extern "C"` by the kernel macro, and it is expected that __all__ GPU kernels be `extern "C"`, not that you should be declaring any kernels without the `#[kernel]` macro, -because the codegen/cuda_std is allowed to rely on the behavior of `#[kernel]` for correctness. +because the codegen/`cuda_std` is allowed to rely on the behavior of `#[kernel]` for correctness. ## Structs diff --git a/guide/src/nvvm/README.md b/guide/src/nvvm/README.md index 69efa4fe..701cdf1c 100644 --- a/guide/src/nvvm/README.md +++ b/guide/src/nvvm/README.md @@ -1,10 +1,10 @@ -# rustc_codegen_nvvm +# `rustc_codegen_nvvm` -This section will cover the more technical details of how rustc_codegen_nvvm works +This section will cover the more technical details of how `rustc_codegen_nvvm` works as well as the issues that came with it. It will also explain some technical details about CUDA/PTX/etc, it is not necessarily -limited to rustc_codegen_nvvm. +limited to `rustc_codegen_nvvm`. Basic knowledge of how rustc and LLVM work and what they do is assumed. You can find info about rustc in the [rustc dev guide](https://rustc-dev-guide.rust-lang.org/). diff --git a/guide/src/nvvm/backends.md b/guide/src/nvvm/backends.md index 7c34d736..f2c8b3fa 100644 --- a/guide/src/nvvm/backends.md +++ b/guide/src/nvvm/backends.md @@ -1,6 +1,6 @@ # Custom Rustc Backends -Before we get into the details of rustc_codegen_nvvm, we obviously need to explain what a codegen is! +Before we get into the details of `rustc_codegen_nvvm`, we obviously need to explain what a codegen is! Custom codegens are rustc's answer to "well what if I want rust to compile to X?". This is a problem that comes up in many situations, especially conversations of "well LLVM cannot target this, so we are screwed". @@ -15,22 +15,22 @@ Nowadays, Rustc is almost fully decoupled from LLVM and it is instead generic ov Rustc instead uses a system of codegen backends that implement traits and then get loaded as dynamically linked libraries. This allows rust to compile to virtually anything with a surprisingly small amount of work. At the time of writing, there are five publicly known codegens that exist: -- rustc_codegen_cranelift -- rustc_codegen_llvm -- rustc_codegen_gcc -- rustc_codegen_spirv -- rustc_codegen_nvvm, obviously the best codegen ;) +- `rustc_codegen_cranelift` +- `rustc_codegen_llvm` +- `rustc_codegen_gcc` +- `rustc_codegen_spirv` +- `rustc_codegen_nvvm`, obviously the best codegen ;) `rustc_codegen_cranelift` targets the cranelift backend, which is a codegen backend written in rust that is faster than LLVM but does not have many optimizations compared to LLVM. `rustc_codegen_llvm` is obvious, it is the backend almost everybody uses which targets LLVM. `rustc_codegen_gcc` targets GCC (GNU Compiler Collection) which is able to target more exotic targets than LLVM, especially for embedded. `rustc_codegen_spirv` targets the SPIR-V (Standard Portable Intermediate Representation 5) format, which is a format mostly used for compiling shader languages such as GLSL or WGSL to a standard representation that Vulkan/OpenGL can use, the reasons -why SPIR-V is not an alternative to CUDA/rustc_codegen_nvvm have been covered in the [FAQ](../../faq.md). +why SPIR-V is not an alternative to CUDA/`rustc_codegen_nvvm` have been covered in the [FAQ](../../faq.md). Finally, we come to the star of the show, `rustc_codegen_nvvm`. This backend targets NVVM IR for compiling rust to GPU kernels that can be run by CUDA. What NVVM IR/libNVVM are has been covered in the [CUDA section](../../cuda/pipeline.md). -# rustc_codegen_ssa +# `rustc_codegen_ssa` `rustc_codegen_ssa` is the central crate behind every single codegen and does much of the hard work. It abstracts away the MIR lowering logic so that custom codegens only have to implement some diff --git a/guide/src/nvvm/nvvm.md b/guide/src/nvvm/nvvm.md index 5f65b669..6bd70617 100644 --- a/guide/src/nvvm/nvvm.md +++ b/guide/src/nvvm/nvvm.md @@ -1,4 +1,4 @@ -# rustc_codegen_nvvm +# `rustc_codegen_nvvm` At the highest level, our codegen workflow goes like this: @@ -14,7 +14,7 @@ Source code -> Typechecking -> MIR -> SSA Codegen -> LLVM IR (NVVM IR) -> PTX -> Before we do anything, rustc does its normal job, it typechecks, converts everything to MIR, etc. Then, rustc loads our codegen shared lib and invokes it to codegen the MIR. It creates an instance of `NvvmCodegenBackend` and it invokes `codegen_crate`. You could do anything inside `codegen_crate` but -we just defer back to rustc_codegen_ssa and tell it to do the job for us: +we just defer back to `rustc_codegen_ssa` and tell it to do the job for us: ```rs fn codegen_crate<'tcx>( diff --git a/guide/src/nvvm/ptxgen.md b/guide/src/nvvm/ptxgen.md index 5c4e3672..9c54c14b 100644 --- a/guide/src/nvvm/ptxgen.md +++ b/guide/src/nvvm/ptxgen.md @@ -61,7 +61,7 @@ important one is libdevice, libdevice is essentially a bitcode module containing that nvidia provides for us. You can find it as a `.bc` file in the libdevice folder inside your NVVM install location. Every function inside of it is prefixed with `__nv_`, you can find docs for it [here](https://docs.nvidia.com/cuda/libdevice-users-guide/index.html). -We declare these intrinsics inside of `ctx_intrinsics.rs` and link to them inside cuda_std. We also use them to codegen +We declare these intrinsics inside of `ctx_intrinsics.rs` and link to them inside `cuda_std`. We also use them to codegen a lot of intrinsics inside `intrinsic.rs`, such as `sqrtf32`. libdevice is also lazy loaded so we do not import useless intrinsics. diff --git a/guide/src/nvvm/types.md b/guide/src/nvvm/types.md index f9bc1121..35e0e86f 100644 --- a/guide/src/nvvm/types.md +++ b/guide/src/nvvm/types.md @@ -4,7 +4,7 @@ Types! who doesn't love types, especially those that cause libNVVM to randomly s Anyways, types are an integral part of the codegen and everything revolves around them and you will see them everywhere. `rustc_codegen_ssa` does not actually tell you what your type representation should be, it allows you to decide. For -example, `rust-gpu` represents it as a `SpirvType` enum, while both `rustc_codegen_llvm` and our codegen represent it as +example, rust-gpu represents it as a `SpirvType` enum, while both `rustc_codegen_llvm` and our codegen represent it as opaque LLVM types: ```rs @@ -13,7 +13,7 @@ type Type = &'ll llvm::Type; `llvm::Type` is an opaque type that comes from llvm-c. `'ll` is one of the main lifetimes you will see throughout the whole codegen, it is used for anything that lasts as long as the current usage of LLVM. -LLVM gives you back pointers when you ask for a type or value, some time ago rustc_codegen_llvm fully switched to using +LLVM gives you back pointers when you ask for a type or value, some time ago `rustc_codegen_llvm` fully switched to using references over pointers, and we follow in their footsteps. One important fact about types is that they are opaque, you cannot take a type and ask "is this X struct?", From 9209eded6ffd5568987d5de08fdcafbe46662d47 Mon Sep 17 00:00:00 2001 From: Nicholas Nethercote Date: Wed, 19 Nov 2025 16:27:57 +1100 Subject: [PATCH 5/7] Use consistent capitalization, part 2. - Rustc -> rustc (truly!) - rust -> Rust - NVidia/nvidia -> NVIDIA --- guide/src/SUMMARY.md | 2 +- guide/src/cuda/gpu_computing.md | 8 ++++---- guide/src/faq.md | 18 +++++++++--------- guide/src/features.md | 2 +- guide/src/guide/getting_started.md | 2 +- guide/src/guide/kernel_abi.md | 6 +++--- guide/src/nvvm/backends.md | 14 +++++++------- guide/src/nvvm/debugging.md | 2 +- guide/src/nvvm/nvvm.md | 2 +- guide/src/nvvm/ptxgen.md | 2 +- 10 files changed, 29 insertions(+), 29 deletions(-) diff --git a/guide/src/SUMMARY.md b/guide/src/SUMMARY.md index fc03685f..21cd8d26 100644 --- a/guide/src/SUMMARY.md +++ b/guide/src/SUMMARY.md @@ -13,7 +13,7 @@ - [GPU Computing](cuda/gpu_computing.md) - [The CUDA Pipeline](cuda/pipeline.md) - [`rustc_codegen_nvvm`](nvvm/README.md) - - [Custom Rustc Backends](nvvm/backends.md) + - [Custom rustc Backends](nvvm/backends.md) - [`rustc_codegen_nvvm`](nvvm/nvvm.md) - [Types](nvvm/types.md) - [PTX Generation](nvvm/ptxgen.md) diff --git a/guide/src/cuda/gpu_computing.md b/guide/src/cuda/gpu_computing.md index d0973ecb..01531a29 100644 --- a/guide/src/cuda/gpu_computing.md +++ b/guide/src/cuda/gpu_computing.md @@ -36,10 +36,10 @@ as (but not limited to): - Forgetting to free memory, using uninitialized memory, etc. Not to mention the standardized tooling that makes the building, documentation, sharing, and linting of GPU kernel libraries easily possible. -Most of the reasons for using rust on the CPU apply to using Rust for the GPU, these reasons have been stated countless times so +Most of the reasons for using Rust on the CPU apply to using Rust for the GPU, these reasons have been stated countless times so I will not repeat them here. -A couple of particular rust features make writing CUDA code much easier: RAII and Results. +A couple of particular Rust features make writing CUDA code much easier: RAII and Results. In `cust` everything uses RAII (through `Drop` impls) to manage freeing memory and returning handles, which frees users from having to think about that, which yields safer, more reliable code. @@ -48,6 +48,6 @@ Ignoring these statuses is very dangerous and can often lead to random segfaults both the CUDA SDK, and other libraries provide macros to handle such statuses. This handling is not very reliable and causes dependency issues down the line. -Instead of an unreliable system of macros, we can leverage rust results for this. In cust we return special `CudaResult` -results that can be bubbled up using rust's `?` operator, or, similar to `CUDA_SAFE_CALL` can be unwrapped or expected if +Instead of an unreliable system of macros, we can leverage Rust results for this. In cust we return special `CudaResult` +results that can be bubbled up using Rust's `?` operator, or, similar to `CUDA_SAFE_CALL` can be unwrapped or expected if proper error handling is not needed. diff --git a/guide/src/faq.md b/guide/src/faq.md index 9342502b..3bf36452 100644 --- a/guide/src/faq.md +++ b/guide/src/faq.md @@ -21,7 +21,7 @@ seamlessly implement features which would have been impossible or very difficult - Stripping away everything we do not need, no complex ABI handling, no shared lib handling, control over how function calls are generated, etc. So overall, the LLVM PTX backend is fit for smaller kernels/projects/proofs of concept. -It is however not fit for compiling an entire language (core is __very__ big) with dependencies and more. The end goal is for rust to be able to be used +It is however not fit for compiling an entire language (core is __very__ big) with dependencies and more. The end goal is for Rust to be able to be used over CUDA C/C++ with the same (or better!) performance and features, therefore, we must take advantage of all optimizations NVCC has over us. ## If NVVM IR is a subset of LLVM IR, can we not give rustc-generated LLVM IR to NVVM? @@ -117,22 +117,22 @@ no control over it and no 100% reliable way to fix it, therefore we must shift t Moreover, the CUDA GPU kernel model is entirely based on trust, trusting each thread to index into the correct place in buffers, trusting the caller of the kernel to uphold some dimension invariants, etc. This is once again, completely incompatible with how -rust does things. We can provide wrappers to calculate an index that always works, and macros to index a buffer automatically, but +Rust does things. We can provide wrappers to calculate an index that always works, and macros to index a buffer automatically, but indexing in complex ways is a core operation in CUDA and it is impossible for us to prove that whatever the developer is doing is correct. Finally, We would love to be able to use mut refs in kernel parameters, but this is would be unsound. Because each kernel function is *technically* called multiple times in parallel with the same parameters, we would be -aliasing the mutable ref, which Rustc declares as unsound (aliasing mechanics). So raw pointers or slightly-less-unsafe +aliasing the mutable ref, which rustc declares as unsound (aliasing mechanics). So raw pointers or slightly-less-unsafe need to be used. However, they are usually only used for the initial buffer indexing, after which you can turn them into a mutable reference just fine (because you indexed in a way where no other thread will index that element). Also note that shared refs can be used as parameters just fine. -Now that we outlined why this is a thing, why is using rust a benefit if we still need to use unsafe? +Now that we outlined why this is a thing, why is using Rust a benefit if we still need to use unsafe? Well it's simple, eliminating most of the things that a developer needs to think about to have a safe program is still exponentially safer than leaving __everything__ to the developer to think about. -By using rust, we eliminate: +By using Rust, we eliminate: - The forgotten/unhandled CUDA errors problem (yay results!). - The uninitialized memory problem. - The forgetting to dealloc memory problem. @@ -156,15 +156,15 @@ The reasoning for this is the same reasoning as to why you would use CUDA over o - rust-gpu does not perform many optimizations, and with `rustc_codegen_ssa`'s less than ideal codegen, the optimizations by LLVM and libNVVM are needed. - SPIR-V is arguably still not suitable for serious GPU kernel codegen, it is underspecced, complex, and does not mention many things which are needed. While libNVVM (which uses a well documented subset of LLVM IR) and the PTX ISA are very thoroughly documented/specified. -- rust-gpu is primarily focused on graphical shaders, compute shaders are secondary, which the rust ecosystem needs, but it also +- rust-gpu is primarily focused on graphical shaders, compute shaders are secondary, which the Rust ecosystem needs, but it also needs a project 100% focused on computing, and computing only. - SPIR-V cannot access many useful CUDA libraries such as OptiX, cuDNN, cuBLAS, etc. - SPIR-V debug info is still very young and rust-gpu cannot generate it. While `rustc_codegen_nvvm` does, which can be used for profiling kernels in something like nsight compute. Moreover, CUDA is the primary tool used in big computing industries such as VFX and scientific computing. Therefore -it is much easier for CUDA C++ users to use rust for GPU computing if most of the concepts are still the same. Plus, -we can interface with existing CUDA code by compiling it to PTX then linking it with our rust code using the CUDA linker +it is much easier for CUDA C++ users to use Rust for GPU computing if most of the concepts are still the same. Plus, +we can interface with existing CUDA code by compiling it to PTX then linking it with our Rust code using the CUDA linker API (which is exposed in a high level wrapper in cust). ## Why use the CUDA Driver API over the Runtime API? @@ -289,5 +289,5 @@ Changes that are currently in progress but not done/experimental: Just like RustaCUDA, cust makes no assumptions of what language was used to generate the PTX/cubin. It could be C, C++, futhark, or best of all, Rust! -Cust's name is literally just rust + CUDA mashed together in a horrible way. +Cust's name is literally just Rust + CUDA mashed together in a horrible way. Or you can pretend it stands for custard if you really like custard. diff --git a/guide/src/features.md b/guide/src/features.md index ab9e5c4f..b16b92a7 100644 --- a/guide/src/features.md +++ b/guide/src/features.md @@ -105,4 +105,4 @@ on things used by the wide majority of users. | Stream Ordered Memory | ✔️ | | Graph Memory Nodes | ❌ | | Unified Memory | ✔️ | -| `__restrict__` | ➖ | Not needed, you get that performance boost automatically through rust's noalias :) | +| `__restrict__` | ➖ | Not needed, you get that performance boost automatically through Rust's noalias :) | diff --git a/guide/src/guide/getting_started.md b/guide/src/guide/getting_started.md index 62816998..c6291c50 100644 --- a/guide/src/guide/getting_started.md +++ b/guide/src/guide/getting_started.md @@ -232,4 +232,4 @@ You can use it as follows (assuming your clone of Rust CUDA is at the absolute p 2. despite using Docker, your machine will still need to be running a compatible driver, in this case for CUDA 11.4.1 it is >=470.57.02 3. if you have issues within the container, it can help to start ensuring your GPU is recognized - ensure `nvidia-smi` provides meaningful output in the container - - NVidia provides a number of samples https://github.com/NVIDIA/cuda-samples. In particular, you may want to try `make`ing and running the [`deviceQuery`](https://github.com/NVIDIA/cuda-samples/tree/ba04faaf7328dbcc87bfc9acaf17f951ee5ddcf3/Samples/deviceQuery) sample. If all is well you should see many details about your GPU + - NVIDIA provides a number of samples https://github.com/NVIDIA/cuda-samples. In particular, you may want to try `make`ing and running the [`deviceQuery`](https://github.com/NVIDIA/cuda-samples/tree/ba04faaf7328dbcc87bfc9acaf17f951ee5ddcf3/Samples/deviceQuery) sample. If all is well you should see many details about your GPU diff --git a/guide/src/guide/kernel_abi.md b/guide/src/guide/kernel_abi.md index f1a131b4..c7034081 100644 --- a/guide/src/guide/kernel_abi.md +++ b/guide/src/guide/kernel_abi.md @@ -7,10 +7,10 @@ In other words, how the codegen expects you to pass different types to GPU kerne ## Preface -Please note that the following __only__ applies to non-rust call conventions, we make zero guarantees -about the rust call convention, just like rustc. +Please note that the following __only__ applies to non-Rust call conventions, we make zero guarantees +about the Rust call convention, just like rustc. -While we currently override every ABI except rust, you should generally only use `"C"`, any +While we currently override every ABI except Rust, you should generally only use `"C"`, any other ABI we override purely to avoid footguns. Functions marked as `#[kernel]` are enforced to be `extern "C"` by the kernel macro, and it is expected diff --git a/guide/src/nvvm/backends.md b/guide/src/nvvm/backends.md index f2c8b3fa..f117c51f 100644 --- a/guide/src/nvvm/backends.md +++ b/guide/src/nvvm/backends.md @@ -1,8 +1,8 @@ -# Custom Rustc Backends +# Custom rustc Backends Before we get into the details of `rustc_codegen_nvvm`, we obviously need to explain what a codegen is! -Custom codegens are rustc's answer to "well what if I want rust to compile to X?". This is a problem +Custom codegens are rustc's answer to "well what if I want Rust to compile to X?". This is a problem that comes up in many situations, especially conversations of "well LLVM cannot target this, so we are screwed". To solve this problem, rustc decided to incrementally decouple itself from being attached/reliant on LLVM exclusively. @@ -11,9 +11,9 @@ This is great if you just want to support LLVM, but LLVM is not perfect, and ine is able to do. Or, you may just want to stop using LLVM, LLVM is not without problems (it is often slow, clunky to deal with, and does not support a lot of targets). -Nowadays, Rustc is almost fully decoupled from LLVM and it is instead generic over the "codegen" backend used. -Rustc instead uses a system of codegen backends that implement traits and then get loaded as dynamically linked libraries. -This allows rust to compile to virtually anything with a surprisingly small amount of work. At the time of writing, there are +Nowadays, rustc is almost fully decoupled from LLVM and it is instead generic over the "codegen" backend used. +rustc instead uses a system of codegen backends that implement traits and then get loaded as dynamically linked libraries. +This allows Rust to compile to virtually anything with a surprisingly small amount of work. At the time of writing, there are five publicly known codegens that exist: - `rustc_codegen_cranelift` - `rustc_codegen_llvm` @@ -21,13 +21,13 @@ five publicly known codegens that exist: - `rustc_codegen_spirv` - `rustc_codegen_nvvm`, obviously the best codegen ;) -`rustc_codegen_cranelift` targets the cranelift backend, which is a codegen backend written in rust that is faster than LLVM but does not have many optimizations +`rustc_codegen_cranelift` targets the cranelift backend, which is a codegen backend written in Rust that is faster than LLVM but does not have many optimizations compared to LLVM. `rustc_codegen_llvm` is obvious, it is the backend almost everybody uses which targets LLVM. `rustc_codegen_gcc` targets GCC (GNU Compiler Collection) which is able to target more exotic targets than LLVM, especially for embedded. `rustc_codegen_spirv` targets the SPIR-V (Standard Portable Intermediate Representation 5) format, which is a format mostly used for compiling shader languages such as GLSL or WGSL to a standard representation that Vulkan/OpenGL can use, the reasons why SPIR-V is not an alternative to CUDA/`rustc_codegen_nvvm` have been covered in the [FAQ](../../faq.md). -Finally, we come to the star of the show, `rustc_codegen_nvvm`. This backend targets NVVM IR for compiling rust to GPU kernels that can be run by CUDA. +Finally, we come to the star of the show, `rustc_codegen_nvvm`. This backend targets NVVM IR for compiling Rust to GPU kernels that can be run by CUDA. What NVVM IR/libNVVM are has been covered in the [CUDA section](../../cuda/pipeline.md). # `rustc_codegen_ssa` diff --git a/guide/src/nvvm/debugging.md b/guide/src/nvvm/debugging.md index 63c2e345..b9dc69a2 100644 --- a/guide/src/nvvm/debugging.md +++ b/guide/src/nvvm/debugging.md @@ -47,7 +47,7 @@ If that doesn't work, then it might be a bug inside of CUDA itself, but that sho is to set up the crate for debug (and see if it still happens in debug). Then you can run your executable under NSight Compute, go to the source tab, and examine the SASS (basically an assembly lower than PTX) to see if ptxas miscompiled it. -If you set up the codegen for debug, it should give you a mapping from rust code to SASS which should hopefully help to see what exactly is breaking. +If you set up the codegen for debug, it should give you a mapping from Rust code to SASS which should hopefully help to see what exactly is breaking. Here is an example of the screen you should see: diff --git a/guide/src/nvvm/nvvm.md b/guide/src/nvvm/nvvm.md index 6bd70617..e0ecf082 100644 --- a/guide/src/nvvm/nvvm.md +++ b/guide/src/nvvm/nvvm.md @@ -8,7 +8,7 @@ Source code -> Typechecking -> MIR -> SSA Codegen -> LLVM IR (NVVM IR) -> PTX -> | | libNVVM +------+ | | | | | rustc_codegen_nvvm +------------------------------------------------------------| - Rustc +--------------------------------------------------------------------------------------------------- + rustc +--------------------------------------------------------------------------------------------------- ``` Before we do anything, rustc does its normal job, it typechecks, converts everything to MIR, etc. Then, diff --git a/guide/src/nvvm/ptxgen.md b/guide/src/nvvm/ptxgen.md index 9c54c14b..51b2e2c3 100644 --- a/guide/src/nvvm/ptxgen.md +++ b/guide/src/nvvm/ptxgen.md @@ -58,7 +58,7 @@ into the module if they are used, doing so using dependency graphs. There are a couple of special modules we need to load before we are done, `libdevice` and `libintrinsics`. The first and most important one is libdevice, libdevice is essentially a bitcode module containing hyper-optimized math intrinsics -that nvidia provides for us. You can find it as a `.bc` file in the libdevice folder inside your NVVM install location. +that NVIDIA provides for us. You can find it as a `.bc` file in the libdevice folder inside your NVVM install location. Every function inside of it is prefixed with `__nv_`, you can find docs for it [here](https://docs.nvidia.com/cuda/libdevice-users-guide/index.html). We declare these intrinsics inside of `ctx_intrinsics.rs` and link to them inside `cuda_std`. We also use them to codegen From 227d66003423224d18c85ff8da62aeb3fd3e214f Mon Sep 17 00:00:00 2001 From: Nicholas Nethercote Date: Wed, 19 Nov 2025 16:38:17 +1100 Subject: [PATCH 6/7] Use "Rust GPU" for the project name consistently. --- guide/src/faq.md | 8 ++++---- guide/src/nvvm/types.md | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/guide/src/faq.md b/guide/src/faq.md index 3bf36452..aad6f1b3 100644 --- a/guide/src/faq.md +++ b/guide/src/faq.md @@ -148,18 +148,18 @@ a lot of them, and ease the burden of correctness from the developer. Besides, using Rust only adds to safety, it does not make CUDA *more* unsafe. This means there are only things to gain in terms of safety using Rust. -## Why not use rust-gpu with compute shaders? +## Why not use Rust GPU with compute shaders? The reasoning for this is the same reasoning as to why you would use CUDA over opengl/vulkan compute shaders: - CUDA usually outperforms shaders if kernels are written well and launch configurations are optimal. - CUDA has many useful features such as shared memory, unified memory, graphs, fine grained thread control, streams, the PTX ISA, etc. -- rust-gpu does not perform many optimizations, and with `rustc_codegen_ssa`'s less than ideal codegen, the optimizations by LLVM and libNVVM are needed. +- Rust GPU does not perform many optimizations, and with `rustc_codegen_ssa`'s less than ideal codegen, the optimizations by LLVM and libNVVM are needed. - SPIR-V is arguably still not suitable for serious GPU kernel codegen, it is underspecced, complex, and does not mention many things which are needed. While libNVVM (which uses a well documented subset of LLVM IR) and the PTX ISA are very thoroughly documented/specified. -- rust-gpu is primarily focused on graphical shaders, compute shaders are secondary, which the Rust ecosystem needs, but it also +- Rust GPU is primarily focused on graphical shaders, compute shaders are secondary, which the Rust ecosystem needs, but it also needs a project 100% focused on computing, and computing only. - SPIR-V cannot access many useful CUDA libraries such as OptiX, cuDNN, cuBLAS, etc. -- SPIR-V debug info is still very young and rust-gpu cannot generate it. While `rustc_codegen_nvvm` does, which can be used +- SPIR-V debug info is still very young and Rust GPU cannot generate it. While `rustc_codegen_nvvm` does, which can be used for profiling kernels in something like nsight compute. Moreover, CUDA is the primary tool used in big computing industries such as VFX and scientific computing. Therefore diff --git a/guide/src/nvvm/types.md b/guide/src/nvvm/types.md index 35e0e86f..3259aad1 100644 --- a/guide/src/nvvm/types.md +++ b/guide/src/nvvm/types.md @@ -4,7 +4,7 @@ Types! who doesn't love types, especially those that cause libNVVM to randomly s Anyways, types are an integral part of the codegen and everything revolves around them and you will see them everywhere. `rustc_codegen_ssa` does not actually tell you what your type representation should be, it allows you to decide. For -example, rust-gpu represents it as a `SpirvType` enum, while both `rustc_codegen_llvm` and our codegen represent it as +example, Rust GPU represents it as a `SpirvType` enum, while both `rustc_codegen_llvm` and our codegen represent it as opaque LLVM types: ```rs From 8bc37a65b23f4aae2137827241d407c21b07cc18 Mon Sep 17 00:00:00 2001 From: Nicholas Nethercote Date: Wed, 19 Nov 2025 16:43:54 +1100 Subject: [PATCH 7/7] Clarify use of the word "codegen", and use sentence case for headings. [I accidentally squashed two commits, and can't be bothered separating them.] The existing text uses "codegen" frequently as a shorthand for "codegen backend". I found this confusing and distracting. ("Codegens" is even worse.) This commit replaces these uses with "codegen backend" (or occasionally something else more appropriate). The commit preserves the use of "codegen" for the act of code generation, e.g. "during codegen we do XYZ", because that's more standard. Also, currently headings are a mix of sentence case ("The quick brown fox") and title case ("The Quick Brown Fox"). Title case is extremely formal, so sentence case feels more natural here. --- guide/src/cuda/gpu_computing.md | 2 +- guide/src/cuda/pipeline.md | 2 +- guide/src/faq.md | 8 +++---- guide/src/features.md | 10 ++++---- guide/src/guide/compute_capabilities.md | 32 ++++++++++++------------- guide/src/guide/getting_started.md | 12 +++++----- guide/src/guide/kernel_abi.md | 10 ++++---- guide/src/guide/safety.md | 2 +- guide/src/guide/tips.md | 2 +- guide/src/nvvm/backends.md | 19 ++++++++------- guide/src/nvvm/debugging.md | 10 ++++---- guide/src/nvvm/nvvm.md | 10 ++++---- guide/src/nvvm/ptxgen.md | 6 ++--- guide/src/nvvm/types.md | 11 +++++---- 14 files changed, 69 insertions(+), 67 deletions(-) diff --git a/guide/src/cuda/gpu_computing.md b/guide/src/cuda/gpu_computing.md index 01531a29..9079dde1 100644 --- a/guide/src/cuda/gpu_computing.md +++ b/guide/src/cuda/gpu_computing.md @@ -1,4 +1,4 @@ -# GPU Computing +# GPU computing You probably already know what GPU computing is, but if you don't, it is utilizing the extremely parallel nature of GPUs for purposes other than rendering. It is widely used in many scientific and consumer fields. diff --git a/guide/src/cuda/pipeline.md b/guide/src/cuda/pipeline.md index f88b7a1c..785fd7bd 100644 --- a/guide/src/cuda/pipeline.md +++ b/guide/src/cuda/pipeline.md @@ -1,4 +1,4 @@ -# The CUDA Pipeline +# The CUDA pipeline CUDA is traditionally used via CUDA C/C++ files which have a `.cu` extension. These files can be compiled using NVCC (NVIDIA CUDA Compiler) into an executable. diff --git a/guide/src/faq.md b/guide/src/faq.md index aad6f1b3..dfd94780 100644 --- a/guide/src/faq.md +++ b/guide/src/faq.md @@ -1,4 +1,4 @@ -# Frequently Asked Questions +# Frequently asked questions This page will cover a lot of the questions people often have when they encounter this project, so they are addressed all at once. @@ -14,8 +14,8 @@ This can be circumvented by building LLVM in a special way, but this is far beyo which yield considerable performance differences (especially on more complex kernels with more information in the IR). - For some reason (either rustc giving weird LLVM IR or the LLVM PTX backend being broken) the LLVM PTX backend often generates completely invalid PTX for trivial programs, so it is not an acceptable workflow for a production pipeline. -- GPU and CPU codegen is fundamentally different, creating a codegen that is only for the GPU allows us to -seamlessly implement features which would have been impossible or very difficult to implement in the existing codegen, such as: +- GPU and CPU codegen is fundamentally different, creating a codegen backend that is only for the GPU allows us to +seamlessly implement features which would have been impossible or very difficult to implement in the existing codegen backend, such as: - Shared memory, this requires some special generation of globals with custom addrspaces, its just not possible to do without backend explicit handling. - Custom linking logic to do dead code elimination so as to not end up with large PTX files full of dead functions/globals. - Stripping away everything we do not need, no complex ABI handling, no shared lib handling, control over how function calls are generated, etc. @@ -33,7 +33,7 @@ Long answer, there are a couple of things that make this impossible: - NVVM IR is a __subset__ of LLVM IR, there are tons of things that NVVM will not accept. Such as a lot of function attrs not being allowed. This is well documented and you can find the spec [here](https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html). Not to mention many bugs in libNVVM that I have found along the way, the most infuriating of which is nvvm not accepting integer types that arent `i1, i8, i16, i32, or i64`. -This required special handling in the codegen to convert these "irregular" types into vector types. +This required special handling in the codegen backend to convert these "irregular" types into vector types. ## What is the point of using Rust if a lot of things in kernels are unsafe? diff --git a/guide/src/features.md b/guide/src/features.md index b16b92a7..0977c480 100644 --- a/guide/src/features.md +++ b/guide/src/features.md @@ -1,4 +1,4 @@ -# Supported Features +# Supported features This page is used for tracking Cargo/Rust and CUDA features that are currently supported or planned to be supported in the future. As well as tracking some information about how they could @@ -14,7 +14,7 @@ around to adding it yet. | ✔️ | Fully Supported | | 🟨 | Partially Supported | -# Rust Features +# Rust features | Feature Name | Support Level | Notes | | ------------ | ------------- | ----- | @@ -40,7 +40,7 @@ around to adding it yet. | Float Ops | ✔️ | Maps to libdevice intrinsics, calls to libm are not intercepted though, which we may want to do in the future | | Atomics | ❌ | -# CUDA Libraries +# CUDA libraries | Library Name | Support Level | Notes | | ------------ | ------------- | ----- | @@ -54,9 +54,9 @@ around to adding it yet. | cuSPARSE | ❌ | | AmgX | ❌ | | cuTENSOR | ❌ | -| OptiX | 🟨 | CPU OptiX is mostly complete, GPU OptiX is still heavily in-progress because it needs support from the codegen | +| OptiX | 🟨 | CPU OptiX is mostly complete, GPU OptiX is still heavily in-progress because it needs support from the codegen backend | -# GPU-side Features +# GPU-side features Note: Most of these categories are used __very__ rarely in CUDA code, therefore do not be alarmed that it seems like many things are not supported. We just focus diff --git a/guide/src/guide/compute_capabilities.md b/guide/src/guide/compute_capabilities.md index 57dccbc7..562be2f0 100644 --- a/guide/src/guide/compute_capabilities.md +++ b/guide/src/guide/compute_capabilities.md @@ -1,9 +1,9 @@ -# Compute Capability Gating +# Compute capability gating This section covers how to write code that adapts to different CUDA compute capabilities using conditional compilation. -## What are Compute Capabilities? +## What are compute capabilities? CUDA GPUs have different "compute capabilities" that determine which features they support. Each capability is identified by a version number like `3.5`, `5.0`, `6.1`, @@ -17,7 +17,7 @@ For example: For comprehensive details, see [NVIDIA's CUDA documentation on GPU architectures](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#gpu-compilation). -## Virtual vs Real Architectures +## Virtual vs real Architectures In CUDA terminology: @@ -28,7 +28,7 @@ In CUDA terminology: Rust CUDA works exclusively with virtual architectures since it only generates PTX. The `NvvmArch::ComputeXX` enum values correspond to CUDA's virtual architectures. -## Using Target Features +## Using target features When building your kernel, the `NvvmArch::ComputeXX` variant you choose enables specific `target_feature` flags. These can be used with `#[cfg(...)]` to conditionally compile @@ -51,12 +51,12 @@ which `NvvmArch::ComputeXX` is used to build the kernel, there is a different an These features let you write optimized code paths for specific GPU generations while still supporting older ones. -## Specifying Compute Capabilites +## Specifying compute capabilites Starting with CUDA 12.9, NVIDIA introduced architecture suffixes that affect compatibility. -### Base Architecture (No Suffix) +### Base architecture (no suffix) Example: `NvvmArch::Compute70` @@ -79,7 +79,7 @@ CudaBuilder::new("kernels") #[cfg(target_feature = "compute_80")] // ✗ Fail (higher base variant) ``` -### Family Suffix ('f') +### Family suffix ('f') Example: `NvvmArch::Compute101f` @@ -108,7 +108,7 @@ CudaBuilder::new("kernels") #[cfg(target_feature = "compute_110")] // ✗ Fail (higher base variant) ``` -### Architecture Suffix ('a') +### Architecture suffix ('a') Example: `NvvmArch::Compute100a` @@ -142,7 +142,7 @@ Note: While the 'a' variant enables all these features during compilation (allow For more details on suffixes, see [NVIDIA's blog post on family-specific architecture features](https://developer.nvidia.com/blog/nvidia-blackwell-and-nvidia-cuda-12-9-introduce-family-specific-architecture-features/). -### Manual Compilation (Without `cuda_builder`) +### Manual compilation (without `cuda_builder`) If you're invoking `rustc` directly instead of using `cuda_builder`, you only need to specify the architecture through LLVM args: @@ -162,11 +162,11 @@ cargo build --target nvptx64-nvidia-cuda The codegen backend automatically synthesizes target features based on the architecture type as described above. -### Common Patterns for Base Architectures +### Common patterns for base architectures These patterns work when using base architectures (no suffix), which enable all lower capabilities: -#### At Least a Capability (Default) +#### At least a capability (default) ```rust,no_run // Code that requires compute 6.0 or higher @@ -176,7 +176,7 @@ These patterns work when using base architectures (no suffix), which enable all } ``` -#### Exactly One Capability +#### Exactly one capability ```rust,no_run // Code that targets exactly compute 6.1 (not 6.2+) @@ -186,7 +186,7 @@ These patterns work when using base architectures (no suffix), which enable all } ``` -#### Up To a Maximum Capability +#### Up to a maximum capability ```rust,no_run // Code that works up to compute 6.0 (not 6.1+) @@ -196,7 +196,7 @@ These patterns work when using base architectures (no suffix), which enable all } ``` -#### Targeting Specific Architecture Ranges +#### Targeting specific architecture ranges ```rust,no_run // This block compiles when building for architectures >= 6.0 but < 8.0 @@ -206,7 +206,7 @@ These patterns work when using base architectures (no suffix), which enable all } ``` -## Debugging Capability Issues +## Debugging capability issues If you encounter errors about missing functions or features: @@ -215,7 +215,7 @@ If you encounter errors about missing functions or features: 3. Use `nvidia-smi` to check your GPU's compute capability 4. Add appropriate `#[cfg]` guards or increase the target architecture -## Runtime Behavior +## Runtime behavior Again, Rust CUDA **only generates PTX**, not pre-compiled GPU binaries ("[fatbinaries](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#fatbinaries)"). diff --git a/guide/src/guide/getting_started.md b/guide/src/guide/getting_started.md index c6291c50..d6a781d9 100644 --- a/guide/src/guide/getting_started.md +++ b/guide/src/guide/getting_started.md @@ -1,8 +1,8 @@ -# Getting Started +# Getting started This section covers how to get started writing GPU crates with `cuda_std` and `cuda_builder`. -## Required Libraries +## Required libraries Before you can use the project to write GPU crates, you will need a couple of prerequisites: @@ -10,7 +10,7 @@ Before you can use the project to write GPU crates, you will need a couple of pr This is only for building GPU crates, to execute built PTX you only need CUDA `9+`. -- LLVM 7.x (7.0 to 7.4), The codegen searches multiple places for LLVM: +- LLVM 7.x (7.0 to 7.4), The codegen backend searches multiple places for LLVM: - If `LLVM_CONFIG` is present, it will use that path as `llvm-config`. - Or, if `llvm-config` is present as a binary, it will use that, assuming that `llvm-config --version` returns `7.x.x`. @@ -60,7 +60,7 @@ We changed our crate's crate types to `cdylib` and `rlib`. We specified `cdylib` ## `lib.rs` -Before we can write any GPU kernels, we must add a few directives to our `lib.rs` which are required by the codegen: +Before we can write any GPU kernels, we must add a few directives to our `lib.rs` which are required by the codegen backend: ```rs #![cfg_attr( @@ -76,7 +76,7 @@ This does a couple of things: - It only applies the attributes if we are compiling the crate for the GPU (target_os = "cuda"). - It declares the crate to be `no_std` on CUDA targets. -- It registers a special attribute required by the codegen for things like figuring out +- It registers a special attribute required by the codegen backend for things like figuring out what functions are GPU kernels. - It explicitly includes `kernel` macro and `thread` @@ -156,7 +156,7 @@ Internally what this does is it first checks that a couple of things are right i - The function is `unsafe`. - The function does not return anything. -Then it declares this kernel to the codegen so that the codegen can tell CUDA this is a GPU kernel. +Then it declares this kernel to the codegen backend so it can tell CUDA this is a GPU kernel. It also applies `#[no_mangle]` so the name of the kernel is the same as it is declared in the code. ## Building the GPU crate diff --git a/guide/src/guide/kernel_abi.md b/guide/src/guide/kernel_abi.md index c7034081..5330d207 100644 --- a/guide/src/guide/kernel_abi.md +++ b/guide/src/guide/kernel_abi.md @@ -1,7 +1,7 @@ # Kernel ABI -This section details how parameters are passed to GPU kernels by the Codegen at the current time. -In other words, how the codegen expects you to pass different types to GPU kernels from the CPU. +This section details how parameters are passed to GPU kernels by the codegen backend. In other +words, how the codegen backend expects you to pass different types to GPU kernels from the CPU. ⚠️ If you find any bugs in the ABI please report them. ⚠️ @@ -15,7 +15,7 @@ other ABI we override purely to avoid footguns. Functions marked as `#[kernel]` are enforced to be `extern "C"` by the kernel macro, and it is expected that __all__ GPU kernels be `extern "C"`, not that you should be declaring any kernels without the `#[kernel]` macro, -because the codegen/`cuda_std` is allowed to rely on the behavior of `#[kernel]` for correctness. +because the codegen backend/`cuda_std` is allowed to rely on the behavior of `#[kernel]` for correctness. ## Structs @@ -119,7 +119,7 @@ unsafe { } ``` -You may get warnings about slices being an improper C-type, but the warnings are safe to ignore, the codegen guarantees +You may get warnings about slices being an improper C-type, but the warnings are safe to ignore, the codegen backend guarantees that slices are passed as pairs of params. You cannot however pass mutable slices, this is because it would violate aliasing rules, each thread receiving a copy of the mutable @@ -135,7 +135,7 @@ ZSTs (zero-sized types) are ignored and become nothing in the final PTX. Primitive types are passed directly by value, same as structs. They map to the special PTX types `.s8`, `.s16`, `.s32`, `.s64`, `.u8`, `.u16`, `.u32`, `.u64`, `.f32`, and `.f64`. With the exception that `u128` and `i128` are passed as byte arrays (but this has no impact on how they are passed from the CPU). -## References And Pointers +## References And pointers References and Pointers are both passed as expected, as pointers. It is therefore expected that you pass such parameters using device memory: diff --git a/guide/src/guide/safety.md b/guide/src/guide/safety.md index 5e89a1b4..1a8bc4ec 100644 --- a/guide/src/guide/safety.md +++ b/guide/src/guide/safety.md @@ -90,7 +90,7 @@ Note however, that unified memory can be accessed by multiple GPUs and multiple takes care of copying and moving data automatically from GPUs/CPU when a page fault occurs. For this reason as well as general ease of use, we suggest that unified memory generally be used over regular device memory. -### Kernel Launches +### Kernel launches Kernel Launches are the most unsafe part of CUDA, many things must be checked by the developer to soundly launch a kernel. It is fundamentally impossible for us to verify a large portion of the invariants expected by the kernel/CUDA. diff --git a/guide/src/guide/tips.md b/guide/src/guide/tips.md index 9cb60dc9..616b7807 100644 --- a/guide/src/guide/tips.md +++ b/guide/src/guide/tips.md @@ -4,7 +4,7 @@ This section contains some tips on what to do and what not to do using the proje ## GPU kernels -- Generally don't derive `Debug` for structs in GPU crates. The codegen currently does not do much global +- Generally don't derive `Debug` for structs in GPU crates. The codegen backend currently does not do much global DCE (dead code elimination) so debug can really slow down compile times and make the PTX gigantic. This will get much better in the future but currently it will cause some undesirable effects. diff --git a/guide/src/nvvm/backends.md b/guide/src/nvvm/backends.md index f117c51f..1249f4ec 100644 --- a/guide/src/nvvm/backends.md +++ b/guide/src/nvvm/backends.md @@ -1,20 +1,21 @@ -# Custom rustc Backends +# Custom rustc backends -Before we get into the details of `rustc_codegen_nvvm`, we obviously need to explain what a codegen is! +Before we get into the details of `rustc_codegen_nvvm`, we obviously need to explain what a codegen +backend is! -Custom codegens are rustc's answer to "well what if I want Rust to compile to X?". This is a problem +Custom codegen backends are rustc's answer to "well what if I want Rust to compile to X?". This is a problem that comes up in many situations, especially conversations of "well LLVM cannot target this, so we are screwed". To solve this problem, rustc decided to incrementally decouple itself from being attached/reliant on LLVM exclusively. -Previously, rustc only had a single codegen, the LLVM codegen. The LLVM codegen translated MIR directly to LLVM IR. +Previously, rustc only had a single codegen backend, the LLVM codegen backed. This translated MIR directly to LLVM IR. This is great if you just want to support LLVM, but LLVM is not perfect, and inevitably you will hit limits to what LLVM is able to do. Or, you may just want to stop using LLVM, LLVM is not without problems (it is often slow, clunky to deal with, and does not support a lot of targets). -Nowadays, rustc is almost fully decoupled from LLVM and it is instead generic over the "codegen" backend used. +Nowadays, rustc is almost fully decoupled from LLVM and it is instead generic over the codegen backend used. rustc instead uses a system of codegen backends that implement traits and then get loaded as dynamically linked libraries. This allows Rust to compile to virtually anything with a surprisingly small amount of work. At the time of writing, there are -five publicly known codegens that exist: +five publicly known codegen backends that exist: - `rustc_codegen_cranelift` - `rustc_codegen_llvm` - `rustc_codegen_gcc` @@ -32,9 +33,9 @@ What NVVM IR/libNVVM are has been covered in the [CUDA section](../../cuda/pipel # `rustc_codegen_ssa` -`rustc_codegen_ssa` is the central crate behind every single codegen and does much of the hard work. -It abstracts away the MIR lowering logic so that custom codegens only have to implement some -traits and the SSA codegen does everything else. For example: +`rustc_codegen_ssa` is the central crate behind every single codegen backend and does much of the +hard work. It abstracts away the MIR lowering logic so that custom codegen backends only have to +implement some traits and the SSA codegen does everything else. For example: - A trait for getting a type like an integer type. - A trait for optimizing a module. - A trait for linking everything. diff --git a/guide/src/nvvm/debugging.md b/guide/src/nvvm/debugging.md index b9dc69a2..6c0491a9 100644 --- a/guide/src/nvvm/debugging.md +++ b/guide/src/nvvm/debugging.md @@ -1,4 +1,4 @@ -# Debugging The Codegen +# Debugging the codegen backend When you try to compile an entire language for a completely different type of hardware, stuff is bound to break. In this section we will cover how to debug 🧊, segfaults, and more. @@ -10,10 +10,10 @@ Segfaults are usually caused in one of two ways: - From NVVM when linking (generating PTX). (more common) The first case can be debugged in two ways: -- Building the codegen in debug mode and using `RUSTC_LOG="rustc_codegen_nvvm=trace"` (`$env:RUSTC_LOG = "rustc_codegen_nvvm=trace";` if using powershell). +- Building the codegen backend in debug mode and using `RUSTC_LOG="rustc_codegen_nvvm=trace"` (`$env:RUSTC_LOG = "rustc_codegen_nvvm=trace";` if using powershell). Note that this will dump a LOT of output, and when I say a LOT, i am not joking, so please, pipe this to a file. -This will give you a detailed summary of almost every action the codegen has done, you can examine the final few logs to -check what the last action the codegen was doing before segfaulting was. This is usually straightforward because the logs are detailed. +This will give you a detailed summary of almost every action the codegen backend has done, you can examine the final few logs to +check what the last action the codegen backend was doing before segfaulting was. This is usually straightforward because the logs are detailed. - Building LLVM 7 with debug assertions. This, coupled with logging should give all the info needed to debug a segfault. It should get LLVM to throw an exception whenever something bad happens. @@ -47,7 +47,7 @@ If that doesn't work, then it might be a bug inside of CUDA itself, but that sho is to set up the crate for debug (and see if it still happens in debug). Then you can run your executable under NSight Compute, go to the source tab, and examine the SASS (basically an assembly lower than PTX) to see if ptxas miscompiled it. -If you set up the codegen for debug, it should give you a mapping from Rust code to SASS which should hopefully help to see what exactly is breaking. +If you set up the codegen backend for debug, it should give you a mapping from Rust code to SASS which should hopefully help to see what exactly is breaking. Here is an example of the screen you should see: diff --git a/guide/src/nvvm/nvvm.md b/guide/src/nvvm/nvvm.md index e0ecf082..ed2f4651 100644 --- a/guide/src/nvvm/nvvm.md +++ b/guide/src/nvvm/nvvm.md @@ -12,7 +12,7 @@ Source code -> Typechecking -> MIR -> SSA Codegen -> LLVM IR (NVVM IR) -> PTX -> ``` Before we do anything, rustc does its normal job, it typechecks, converts everything to MIR, etc. Then, -rustc loads our codegen shared lib and invokes it to codegen the MIR. It creates an instance of +rustc loads our codegen backend shared lib and invokes it to codegen the MIR. It creates an instance of `NvvmCodegenBackend` and it invokes `codegen_crate`. You could do anything inside `codegen_crate` but we just defer back to `rustc_codegen_ssa` and tell it to do the job for us: @@ -34,9 +34,9 @@ fn codegen_crate<'tcx>( ``` After that, the codegen logic is kind of abstracted away from us, which is a good thing! -We just need to provide the SSA codegen whatever it needs to do its thing. This is +We just need to provide the SSA codegen crate whatever it needs to do its thing. This is done in the form of traits, lots and lots and lots of traits, more traits than you've ever seen, traits -your subconscious has warned you of in nightmares, anyways. Because talking about how the SSA codegen +your subconscious has warned you of in nightmares, anyways. Because talking about how the SSA codegen crate works is kind of useless, we will instead talk first about general concepts and terminology, then dive into each trait. @@ -57,7 +57,7 @@ But first, let's talk about the end of the codegen, it is pretty simple, we do a We will cover the libNVVM steps in more detail later on. -# Codegen Units (CGUs) +# Codegen units (CGUs) Ah codegen units, the thing everyone just tells you to set to `1` in Cargo.toml, but what are they? Well, to put it simply, codegen units are rustc splitting up a crate into different modules to then @@ -65,7 +65,7 @@ run LLVM in parallel over. For example, rustc can run LLVM over two different mo save time. This gets a little bit more complex with generics, because MIR is not monomorphized and monomorphized MIR is not a thing, -the codegen monomorphizes instances on the fly. Therefore rustc needs to put any generic functions that one CGU relies on +the compiler monomorphizes instances on the fly. Therefore rustc needs to put any generic functions that one CGU relies on inside of the same CGU because it needs to monomorphize them. # Rlibs diff --git a/guide/src/nvvm/ptxgen.md b/guide/src/nvvm/ptxgen.md index 51b2e2c3..3ec0d2c0 100644 --- a/guide/src/nvvm/ptxgen.md +++ b/guide/src/nvvm/ptxgen.md @@ -1,4 +1,4 @@ -# PTX Generation +# PTX generation This is the final and most fun part of codegen, taking our LLVM bitcode and giving it to libNVVM. It is in theory as simple as just giving NVVM every single bitcode module, but in practice, we do a couple @@ -40,7 +40,7 @@ and we do not have a linker, NVVM *is* our linker. However, NVVM does not elimin I think you can guess why that is a problem, so unless we want `11mb` PTX files (yes this is actually how big it was) we need to do something about it. -# Module Merging and DCE +# Module merging and DCE To solve our dead code issue, we take a pretty simple approach. We merge every module (one crate maybe be multiple modules because of codegen units) into a single module to start. Then, we do the following: @@ -69,7 +69,7 @@ libdevice is also lazy loaded so we do not import useless intrinsics. # libintrinsics This is the last special module we load, it is simple, it is just a dumping ground for random wrapper functions -we need to define that `cuda_std` or the codegen needs. You can find the LLVM IR definition for it in the codegen directory +we need to define that `cuda_std` or the codegen backend needs. You can find the LLVM IR definition for it in the codegen directory called `libintrinsics.ll`. All of its functions should be declared with the `__nvvm_` prefix. # Compilation diff --git a/guide/src/nvvm/types.md b/guide/src/nvvm/types.md index 3259aad1..23d3933e 100644 --- a/guide/src/nvvm/types.md +++ b/guide/src/nvvm/types.md @@ -1,7 +1,7 @@ # Types Types! who doesn't love types, especially those that cause libNVVM to randomly segfault or loop forever! -Anyways, types are an integral part of the codegen and everything revolves around them and you will see them everywhere. +Anyways, types are an integral part of the codegen backend and everything revolves around them and you will see them everywhere. `rustc_codegen_ssa` does not actually tell you what your type representation should be, it allows you to decide. For example, Rust GPU represents it as a `SpirvType` enum, while both `rustc_codegen_llvm` and our codegen represent it as @@ -20,8 +20,8 @@ One important fact about types is that they are opaque, you cannot take a type a this is like asking "which chickens were responsible for this omelette?". You can ask if its a number type, a vector type, a void type, etc. -The SSA codegen needs to ask the backend for types for everything it needs to codegen MIR. It does -this using a trait called `BaseTypeMethods`: +The SSA codegen crate needs to ask the backend for types for everything it needs to codegen MIR. It +does this using a trait called `BaseTypeMethods`: ```rs pub trait BaseTypeMethods<'tcx>: Backend<'tcx> { @@ -55,8 +55,9 @@ pub trait BaseTypeMethods<'tcx>: Backend<'tcx> { } ``` -Every codegen implements this some way or another, you can find our implementation in `ty.rs`. Our -implementation is pretty straightforward, LLVM has functions that we link to which get us the types we need: +Every codegen backend implements this some way or another, you can find our implementation in +`ty.rs`. Our implementation is pretty straightforward, LLVM has functions that we link to which get +us the types we need: ```rs impl<'ll, 'tcx> BaseTypeMethods<'tcx> for CodegenCx<'ll, 'tcx> {