Skip to content

cmake: portable CUDA arch default + require CMake 3.25 for device LTO#875

Open
IvanaGyro wants to merge 2 commits into
masterfrom
claude/claude-local-rules-SB77V
Open

cmake: portable CUDA arch default + require CMake 3.25 for device LTO#875
IvanaGyro wants to merge 2 commits into
masterfrom
claude/claude-local-rules-SB77V

Conversation

@IvanaGyro
Copy link
Copy Markdown
Collaborator

@IvanaGyro IvanaGyro commented Jun 3, 2026

Summary

Two related CMake changes to harden CUDA configuration for GPU-less build environments and enable CUDA device LTO.

1. Replace native CUDA architecture default with a portable fat-binary list

Problem: The previous set(CMAKE_CUDA_ARCHITECTURES native) requires a CUDA-capable GPU to be present at configure time. Configure fails on CI runners, cloud VMs, and developer machines without a GPU — even when cross-compiling for a GPU target.

Change: A guarded default is set before enable_language(CUDA):

if(NOT CMAKE_CUDA_ARCHITECTURES AND NOT DEFINED ENV{CUDAARCHS})
  set(CMAKE_CUDA_ARCHITECTURES 70-real 75-real 80-real 86-real 89-real 90-real 90-virtual)
endif()
  • The guard (NOT … AND NOT DEFINED ENV{CUDAARCHS}) ensures any user-supplied value — either the CMake cache variable or the CUDAARCHS environment variable — takes precedence. CMake's enable_language(CUDA) reads CUDAARCHS natively, so the old manual copy of that env var into CMAKE_CUDA_ARCHITECTURES was redundant and is removed.
  • The guard must run before enable_language(CUDA): after that call CMake fills in its own default, making the "not yet set" check unreliable.
  • Architecture floor is sm_70 (Volta), the minimum required by cuQuantum 24.x. sm_90 with PTX (90-virtual) is included so the binary JIT-compiles on future architectures.

2. Raise cmake_minimum_required to 3.25

Problem: CUDA device LTO (-dlto, activated by CMAKE_INTERPROCEDURAL_OPTIMIZATION) requires CMake ≥ 3.25. Declaring a lower minimum silently degraded to host-only LTO even when INTERPROCEDURAL_OPTIMIZATION was set.

Change: cmake_minimum_required(VERSION 3.25) with an inline comment documenting the reason.

Testing

Verified on a machine with CUDA toolkit installed but CMAKE_CUDA_ARCHITECTURES unset:

  • Configure succeeds without a GPU present.
  • Generated build.ninja shows -gencode arch=compute_70,code=sm_70 … arch=compute_90,code=sm_90 (fat binary).
  • CUDAARCHS=80 cmake --preset openblas-cuda overrides to sm_80 only, confirming the guard is respected.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the CMake configuration to dynamically set CUDA target architectures from environment variables (falling back to a portable fat binary default) and adds a workaround to disable CUDA device LTO while preserving host C++ LTO. The review feedback highlights two critical issues: first, environment variables for CUDA architectures are space-separated and must be converted to semicolon-separated lists to prevent CMake build failures; second, the LTO workaround unconditionally applies GCC-specific flags on all non-Apple platforms, which will break builds on Windows (MSVC) and Clang-based systems. Code suggestions are provided to resolve both issues.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread CMakeLists.txt Outdated
Comment thread CMakeLists.txt Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 29.49%. Comparing base (b2fe84c) to head (57b7618).
⚠️ Report is 2 commits behind head on master.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #875   +/-   ##
=======================================
  Coverage   29.49%   29.49%           
=======================================
  Files         241      241           
  Lines       35512    35512           
  Branches    14777    14777           
=======================================
  Hits        10475    10475           
  Misses      17784    17784           
  Partials     7253     7253           
Flag Coverage Δ
cpp 29.09% <ø> (ø)
python 52.71% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
C++ backend 30.74% <ø> (ø)
Python bindings 17.09% <ø> (ø)
Python package 52.71% <ø> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b2fe84c...57b7618. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@IvanaGyro IvanaGyro force-pushed the claude/claude-local-rules-SB77V branch 2 times, most recently from d495108 to b402f36 Compare June 3, 2026 09:15
@IvanaGyro IvanaGyro changed the title cmake: portable CUDA arch default + device-LTO workaround for GPU-less CI cmake: portable CUDA arch default + nvlink empty-stub fix for glibc 2.34+ Jun 3, 2026
IvanaGyro and others added 2 commits June 3, 2026 09:21
…fault

The previous unconditional `set(CMAKE_CUDA_ARCHITECTURES native)` had
three problems for redistributable PyPI wheels built on GPU-less CI runners:

1. `native` queries the local GPU at configure time, so the build fails
   outright on a machine with no NVIDIA device.
2. Even when it succeeds, `native` bakes in only the build host's
   architecture — the resulting wheel does not run on any other GPU
   generation.
3. Because it was a plain (non-cache) `set()`, it overrode any value
   supplied via -D, a CMakePresets.json cacheVariables entry, the CUDAARCHS
   environment variable, or the cache, so there was no way to override it
   without editing the file.

Replace it with a guarded default that runs before enable_language(CUDA):

  if(NOT CMAKE_CUDA_ARCHITECTURES AND NOT DEFINED ENV{CUDAARCHS})
    set(CMAKE_CUDA_ARCHITECTURES
        70-real 75-real 80-real 86-real 89-real 90-real 90-virtual)
  endif()
  enable_language(CUDA)

The guard must run before enable_language(CUDA): afterwards
CMAKE_CUDA_ARCHITECTURES is never empty (CMake fills in its own default),
so the "not specified" case can no longer be detected. enable_language(CUDA)
already reads the CUDAARCHS environment variable on its own, so the guard
only has to avoid shadowing it — there is no need to copy CUDAARCHS into the
variable, and no need to honor a CMAKE_CUDA_ARCHITECTURES environment
variable (CMake defines no such variable; only CUDAARCHS is standard).
A -D flag, the cache, or a preset's cacheVariables populate the normal/cache
variable, so the NOT CMAKE_CUDA_ARCHITECTURES guard lets them win.

The default targets Volta through Hopper (sm_70 is the minimum required by
cuTENSOR/cuQuantum), with 90-virtual PTX so the driver can JIT-compile for
GPUs newer than Hopper without a rebuild.

Also drop the now-false "native" justification on the cmake_minimum_required
line: with the native keyword gone, the comment is updated to note that
cmake_language(EVAL CODE ...) is the feature setting the 3.18 lower bound,
with 3.24 kept as the tested minimum.

Closes #870.

Co-authored-by: Claude <noreply@anthropic.com>
CMake 3.25 is the first release where CMAKE_INTERPROCEDURAL_OPTIMIZATION
(and the INTERPROCEDURAL_OPTIMIZATION target property) activate CUDA device
link-time optimization (nvcc -dlto) in addition to host C++ LTO. On earlier
CMake versions the same setting silently produces no device LTO for CUDA
targets, so the optimisation the build asks for via
CMAKE_INTERPROCEDURAL_OPTIMIZATION=TRUE would be quietly dropped.

Raise the floor from 3.24 to 3.25 so the requested device LTO is actually
emitted rather than ignored. The previous 3.24 floor existed only for the
removed "native" CUDA architecture keyword; the remaining version-sensitive
feature, cmake_language(EVAL CODE ...), needs 3.18, which 3.25 also covers.

Co-authored-by: Claude <noreply@anthropic.com>
@IvanaGyro IvanaGyro force-pushed the claude/claude-local-rules-SB77V branch from b402f36 to 57b7618 Compare June 3, 2026 09:36
@IvanaGyro IvanaGyro changed the title cmake: portable CUDA arch default + nvlink empty-stub fix for glibc 2.34+ cmake: portable CUDA arch default + require CMake 3.25 for device LTO Jun 3, 2026
@IvanaGyro IvanaGyro marked this pull request as ready for review June 3, 2026 16:16
@IvanaGyro IvanaGyro requested review from pcchen and yingjerkao June 3, 2026 16:16
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 57b7618b84

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread CMakeLists.txt
# the floor required by cuTENSOR/cuQuantum, up through Hopper sm_90) plus PTX
# of the newest (90-virtual) so the driver can JIT for newer/unknown GPUs.
if(NOT CMAKE_CUDA_ARCHITECTURES AND NOT DEFINED ENV{CUDAARCHS})
set(CMAKE_CUDA_ARCHITECTURES 70-real 75-real 80-real 86-real 89-real 90-real 90-virtual)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gate CUDA architectures by toolkit version

When USE_CUDA=ON is configured with the documented CUDA v10+ dependency (docs/source/adv_install.rst:32) but without an explicit CMAKE_CUDA_ARCHITECTURES, this default now passes compute_89/compute_90 to nvcc. CUDA toolkits before 11.8 do not know those architectures, so CMake's CUDA compiler check or the first CUDA compile fails with nvcc fatal: Unsupported gpu architecture, even though those CUDA versions were previously supported via native. Please derive the default from CMAKE_CUDA_COMPILER_VERSION or keep the default to architectures supported by the minimum CUDA version.

Useful? React with 👍 / 👎.

@pcchen
Copy link
Copy Markdown
Collaborator

pcchen commented Jun 7, 2026

Review: PR #875 — "cmake: portable CUDA arch default + require CMake 3.25 for device LTO"

Overview

Two focused CMake changes: replace CMAKE_CUDA_ARCHITECTURES native (requires a GPU at configure time) with a guarded fat-binary default, and bump cmake_minimum_required to 3.25 for proper CUDA device LTO support.


Change 1 — Portable CUDA architecture default

The logic is correct:

  • The guard NOT CMAKE_CUDA_ARCHITECTURES AND NOT DEFINED ENV{CUDAARCHS} correctly defers to any user-supplied value (cache variable, preset cacheVariables, or the CUDAARCHS env var that enable_language(CUDA) reads natively).
  • Placement before enable_language(CUDA) is necessary — after that call CMake fills in its own default, making the "not yet set" check unreliable. The comment documents this correctly.
  • Removing the old manual ENV{CUDAARCHS} copy is right — enable_language(CUDA) handles it natively since CMake 3.17.
  • 90-virtual (PTX) enables JIT compilation on future architectures. Correct approach.

The existing Gemini/codex comments on CUDAARCHS string parsing and the GCC LTO flags are outdated — both were addressed by dropping the code they reviewed. @IvanaGyro's replies confirm this.

One open concern (codex P1): Architecture list vs. documented CUDA minimum

The default list includes sm_89 (Ada Lovelace) and sm_90 (Hopper), which require CUDA 11.8+. The docs (adv_install.rst:32) still say "CUDA v10+". A user following the documentation who has CUDA 10–11.7 and builds with USE_CUDA=ON without explicitly setting CMAKE_CUDA_ARCHITECTURES will get an nvcc error about unrecognised compute capabilities.

The simplest fix is to update the documentation minimum to match the arch list. In practice, the full GPU feature set (cuTENSOR 24.x, cuQuantum) already requires CUDA 12.x, so "CUDA 10+" is stale regardless. Either:

  1. Update docs to state CUDA ≥ 11.8 (or ≥ 12.x if cuTENSOR 24.x is the real floor), or
  2. Trim the default arch list to 70-real 75-real 80-real 86-real 80-virtual (safe with CUDA ≥ 11.1) and note that users on Hopper/Ada should override.

Change 2 — cmake_minimum_required(VERSION 3.25)

Correct. CMake 3.25 added proper INTERPROCEDURAL_OPTIMIZATION support for CUDA device LTO. The old 3.24 comment ("require for the 'native' value") was also becoming stale now that native is removed. The bump is a single minor version (released Nov 2022) and does not introduce a new de-facto requirement beyond what 3.24 already imposed.


Summary

Item Status
Guard logic and ordering ✅ Correct
CUDAARCHS env var handling ✅ Correct (enable_language handles natively)
90-virtual PTX for future GPUs ✅ Correct
Gemini / outdated codex comments ✅ Addressed — code they reviewed was dropped
cmake_minimum_required 3.24 → 3.25 ✅ Correct and justified
Docs still say "CUDA v10+" but default arch list requires CUDA 11.8+ ⚠️ Should update adv_install.rst before merging

Posted by Claude Code on behalf of @pcchen

@pcchen
Copy link
Copy Markdown
Collaborator

pcchen commented Jun 7, 2026

@IvanaGyro We will update the document later. Please decide if we want to follow this comment: "Trim the default arch list to 70-real 75-real 80-real 86-real 80-virtual (safe with CUDA ≥ 11.1) and note that users on Hopper/Ada should override."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants