Kernel Launcher

Kernel Launcher is a C++ library for dynamically compiling CUDA kernels at runtime (using NVRTC) and launching them using C++ magic in a way that is type-safe, user-friendly, and with minimal boilerplate.

On top of that, Kernel Launcher supports tuning the GPU kernels in your application. This is done by capturing kernel launches, replaying them with an auto-tuning tool such as Kernel Tuner, and importing the results, saved as wisdom files, during runtime kernel compilation.

The result: highly efficient GPU applications with maximum portability.

Installation

Recommended installation is using CMake. See the installation guide.

Example

There are several ways of using Kernel Launcher. See the documentation for examples or check out the examples/ directory.

Pragma-based API

Below is an example of using the pragma-based API, which allows existing CUDA kernels to be annotated with Kernel-Launcher-specific directives.

kernel.cu

#pragma kernel tune(threads_per_block=32, 64, 128, 256, 512, 1024)
#pragma kernel block_size(threads_per_block)
#pragma kernel problem_size(n)
#pragma kernel buffers(A[n], B[n], C[n])
template <typename T, int threads_per_block>
__global__ void vector_add(int n, T *C, const T *A, const T *B) {
    int i = blockIdx.x * threads_per_block + threadIdx.x;
    if (i < n) {
        C[i] = A[i] + B[i];
    }
}

main.cpp

#include "kernel_launcher.h"

int main() {
    // Initialize CUDA memory. This is outside the scope of Kernel Launcher.
    unsigned int n = 1000000;
    float *dev_A, *dev_B, *dev_C;
    /* cudaMalloc, cudaMemcpy, ... */

    // Namespace alias.
    namespace kl = kernel_launcher;

    // Launch the kernel! Again, the grid size and block size do not need to
    // be specified, they are calculated from the kernel specifications and
    // runtime arguments.
    kl::launch(
        kl::PragmaKernel("vector_add", "kernel.cu", {"float"}),
        n, dev_C, dev_A, dev_B
    );
}

Builder-based API

Below shows an example of the KernelBuilder-based API. This offers more flexibility than the pragma-based API, but is also more verbose:

kernel.cu

template <typename T>
__global__ void vector_add(int n, T *C, const T *A, const T *B) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        C[i] = A[i] + B[i];
    }
}

main.cpp

#include "kernel_launcher.h"

int main() {
    // Namespace alias.
    namespace kl = kernel_launcher;

    // Define the variables that can be tuned for this kernel.
    auto space = kl::ConfigSpace();
    auto threads_per_block = space.tune("block_size", {32, 64, 128, 256, 512, 1024});

    // Create a kernel builder and set kernel properties such as block size,
    // grid divisor, template arguments, etc.
    auto builder = kl::KernelBuilder("vector_add", "kernel.cu", space);
    builder
        .template_args(kl::type_of<float>())
        .problem_size(kl::arg0)
        .block_size(threads_per_block);

    // Define the kernel
    auto vector_add_kernel = kl::WisdomKernel(builder);

    // Initialize CUDA memory. This is outside the scope of kernel_launcher.
    unsigned int n = 1000000;
    float *dev_A, *dev_B, *dev_C;
    /* cudaMalloc, cudaMemcpy, ... */

    // Launch the kernel! Note that the kernel is compiled on the first call.
    // The grid size and block size do not need to be specified as they are
    // derived from the kernel specifications and runtime arguments.
    vector_add_kernel(n, dev_C, dev_A, dev_B);
}

License

Licensed under Apache 2.0. See LICENSE.

Citation

If you use Kernel Launcher in your work, please cite the following publication:

S. Heldens, B. van Werkhoven (2023), "Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications", The Eighteenth International Workshop on Automatic Performance Tuning (iWAPT2023) co-located with IPDPS 2023

As BibTeX:

@inproceedings{heldens2023kernellauncher,
  title={Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications},
  author={Heldens, Stijn and van Werkhoven, Ben},
  journal={The Eighteenth International Workshop on Automatic Performance Tuning (iWAPT2023) co-located with IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2023},
  year={2023},
  pages={744-753},
  doi={10.1109/IPDPSW59300.2023.00126}}
}

Related Work

Kernel Tuner

Name		Name	Last commit message	Last commit date
Latest commit History 303 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
include		include
python		python
src		src
tests		tests
thirdparty		thirdparty
.clang-format		.clang-format
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
test.cu		test.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kernel Launcher

Installation

Example

Pragma-based API

Builder-based API

License

Citation

Related Work

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

KernelTuner/kernel_launcher

Folders and files

Latest commit

History

Repository files navigation

Kernel Launcher

Installation

Example

Pragma-based API

Builder-based API

License

Citation

Related Work

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages