Skip to content

[cccl.c] Split build step into compile + load#8484

Open
shwina wants to merge 2 commits intoNVIDIA:mainfrom
shwina:pr/aot-c-layer
Open

[cccl.c] Split build step into compile + load#8484
shwina wants to merge 2 commits intoNVIDIA:mainfrom
shwina:pr/aot-c-layer

Conversation

@shwina
Copy link
Copy Markdown
Contributor

@shwina shwina commented Apr 16, 2026

Description

Closes #8410.

This PR introduces two changes to all algorithms in CCCL C:

  1. The build structs are extended to hold the lowered names of the kernels contained in the cubin, as well as the size of the void* runtime_policy where applicable.
  2. Each cccl_<algo>_build algorithm is split into two steps: cccl_<algo>_compile and cccl_<algo>_load.

These changes are needed to support ahead-of-time compilation. Specifically, splitting off the compile step into its own API allows the user to compile on a CPU-only machine (say), and for different target architectures.

Currently, the cccl_<algo>_build functions include calls to cuLibraryLoadData and cuLibraryGetKernel which would fail in the above scenarios.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@shwina shwina requested review from a team as code owners April 16, 2026 14:20
@shwina shwina requested a review from alliepiper April 16, 2026 14:20
@shwina shwina requested a review from gevtushenko April 16, 2026 14:20
@github-project-automation github-project-automation bot moved this to Todo in CCCL Apr 16, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Apr 16, 2026
@shwina shwina changed the title Split build into compile + load [cccl.c] Split build step into compile + load Apr 16, 2026
@github-actions

This comment has been minimized.

size_t cubin_size;
CUlibrary library;
CUkernel kernel;
// Lowered (mangled) kernel name, heap-allocated, freed by cccl_device_binary_search_cleanup():
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do without these comments.

Comment thread c/parallel/src/scan.cu Outdated
Comment on lines +438 to +440
build_ptr->runtime_policy = std::malloc(sizeof(cub::detail::scan::policy_selector));
build_ptr->runtime_policy_size = sizeof(cub::detail::scan::policy_selector);
std::memcpy(build_ptr->runtime_policy, &policy_sel, sizeof(cub::detail::scan::policy_selector));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Why can't we use new to allocate the policy selector anymore?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep the allocation/deallocation pattern the same. I undid this change.

Copy link
Copy Markdown
Contributor

@NaderAlAwar NaderAlAwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did not make the same changes to for, is that intentional?

build_ptr->cc = cc;
build_ptr->cubin = (void*) result.data.release();
build_ptr->cubin_size = result.size;
build_ptr->kernel_lowered_name = strdup(lowered_name.c_str());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: we should not use strdup (applies to all other files in the PR) since it is POSIX, not standard C++. I would suggest something like

char* duplicate_c_string(std::string_view s)
  {
    auto p = std::make_unique<char[]>(s.size() + 1);
    std::memcpy(p.get(), s.data(), s.size());
    p[s.size()] = '\0';
    return p.release();
  }

and use that everywhere


std::unique_ptr<char[]> cubin(reinterpret_cast<char*>(build_ptr->cubin));
check(cuLibraryUnload(build_ptr->library));
std::free(build_ptr->kernel_lowered_name);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: after replacing strdup, we should use something similar to the std::unique_ptr<char[]> pattern above

Comment thread c/parallel/src/reduce.cu
#include <cstring> // strdup, free, memcpy
#include <format>
#include <memory>
#include <new> // std::nothrow
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: this seems like it's unused, should be removed.

@github-actions
Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 08m: Pass: 100%/20 | Total: 5h 07m | Max: 42m 24s | Hits: 95%/1422

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

[cccl.c]: Add infrastructure to support ahead-of-time compilation (AoT)

3 participants