cmake: portable CUDA arch default + require CMake 3.25 for device LTO by IvanaGyro · Pull Request #875 · Cytnx-dev/Cytnx

IvanaGyro · 2026-06-03T07:02:39Z

Summary

Two related CMake changes to harden CUDA configuration for GPU-less build environments and enable CUDA device LTO.

1. Replace `native` CUDA architecture default with a portable fat-binary list

Problem: The previous set(CMAKE_CUDA_ARCHITECTURES native) requires a CUDA-capable GPU to be present at configure time. Configure fails on CI runners, cloud VMs, and developer machines without a GPU — even when cross-compiling for a GPU target.

Change: A guarded default is set before enable_language(CUDA):

if(NOT CMAKE_CUDA_ARCHITECTURES AND NOT DEFINED ENV{CUDAARCHS})
  set(CMAKE_CUDA_ARCHITECTURES 70-real 75-real 80-real 86-real 89-real 90-real 90-virtual)
endif()

The guard (NOT … AND NOT DEFINED ENV{CUDAARCHS}) ensures any user-supplied value — either the CMake cache variable or the CUDAARCHS environment variable — takes precedence. CMake's enable_language(CUDA) reads CUDAARCHS natively, so the old manual copy of that env var into CMAKE_CUDA_ARCHITECTURES was redundant and is removed.
The guard must run before enable_language(CUDA): after that call CMake fills in its own default, making the "not yet set" check unreliable.
Architecture floor is sm_70 (Volta), the minimum required by cuQuantum 24.x. sm_90 with PTX (90-virtual) is included so the binary JIT-compiles on future architectures.

2. Raise `cmake_minimum_required` to 3.25

Problem: CUDA device LTO (-dlto, activated by CMAKE_INTERPROCEDURAL_OPTIMIZATION) requires CMake ≥ 3.25. Declaring a lower minimum silently degraded to host-only LTO even when INTERPROCEDURAL_OPTIMIZATION was set.

Change: cmake_minimum_required(VERSION 3.25) with an inline comment documenting the reason.

Testing

Verified on a machine with CUDA toolkit installed but CMAKE_CUDA_ARCHITECTURES unset:

Configure succeeds without a GPU present.
Generated build.ninja shows -gencode arch=compute_70,code=sm_70 … arch=compute_90,code=sm_90 (fat binary).
CUDAARCHS=80 cmake --preset openblas-cuda overrides to sm_80 only, confirming the guard is respected.

gemini-code-assist

Code Review

This pull request updates the CMake configuration to dynamically set CUDA target architectures from environment variables (falling back to a portable fat binary default) and adds a workaround to disable CUDA device LTO while preserving host C++ LTO. The review feedback highlights two critical issues: first, environment variables for CUDA architectures are space-separated and must be converted to semicolon-separated lists to prevent CMake build failures; second, the LTO workaround unconditionally applies GCC-specific flags on all non-Apple platforms, which will break builds on Windows (MSVC) and Clang-based systems. Code suggestions are provided to resolve both issues.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

codecov · 2026-06-03T07:20:16Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 29.49%. Comparing base (b2fe84c) to head (57b7618).
⚠️ Report is 2 commits behind head on master.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #875   +/-   ##
=======================================
  Coverage   29.49%   29.49%           
=======================================
  Files         241      241           
  Lines       35512    35512           
  Branches    14777    14777           
=======================================
  Hits        10475    10475           
  Misses      17784    17784           
  Partials     7253     7253

Flag	Coverage Δ
cpp	`29.09% <ø> (ø)`
python	`52.71% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
C++ backend	`30.74% <ø> (ø)`
Python bindings	`17.09% <ø> (ø)`
Python package	`52.71% <ø> (ø)`

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b2fe84c...57b7618. Read the comment docs.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…fault The previous unconditional `set(CMAKE_CUDA_ARCHITECTURES native)` had three problems for redistributable PyPI wheels built on GPU-less CI runners: 1. `native` queries the local GPU at configure time, so the build fails outright on a machine with no NVIDIA device. 2. Even when it succeeds, `native` bakes in only the build host's architecture — the resulting wheel does not run on any other GPU generation. 3. Because it was a plain (non-cache) `set()`, it overrode any value supplied via -D, a CMakePresets.json cacheVariables entry, the CUDAARCHS environment variable, or the cache, so there was no way to override it without editing the file. Replace it with a guarded default that runs before enable_language(CUDA): if(NOT CMAKE_CUDA_ARCHITECTURES AND NOT DEFINED ENV{CUDAARCHS}) set(CMAKE_CUDA_ARCHITECTURES 70-real 75-real 80-real 86-real 89-real 90-real 90-virtual) endif() enable_language(CUDA) The guard must run before enable_language(CUDA): afterwards CMAKE_CUDA_ARCHITECTURES is never empty (CMake fills in its own default), so the "not specified" case can no longer be detected. enable_language(CUDA) already reads the CUDAARCHS environment variable on its own, so the guard only has to avoid shadowing it — there is no need to copy CUDAARCHS into the variable, and no need to honor a CMAKE_CUDA_ARCHITECTURES environment variable (CMake defines no such variable; only CUDAARCHS is standard). A -D flag, the cache, or a preset's cacheVariables populate the normal/cache variable, so the NOT CMAKE_CUDA_ARCHITECTURES guard lets them win. The default targets Volta through Hopper (sm_70 is the minimum required by cuTENSOR/cuQuantum), with 90-virtual PTX so the driver can JIT-compile for GPUs newer than Hopper without a rebuild. Also drop the now-false "native" justification on the cmake_minimum_required line: with the native keyword gone, the comment is updated to note that cmake_language(EVAL CODE ...) is the feature setting the 3.18 lower bound, with 3.24 kept as the tested minimum. Closes #870. Co-authored-by: Claude <noreply@anthropic.com>

CMake 3.25 is the first release where CMAKE_INTERPROCEDURAL_OPTIMIZATION (and the INTERPROCEDURAL_OPTIMIZATION target property) activate CUDA device link-time optimization (nvcc -dlto) in addition to host C++ LTO. On earlier CMake versions the same setting silently produces no device LTO for CUDA targets, so the optimisation the build asks for via CMAKE_INTERPROCEDURAL_OPTIMIZATION=TRUE would be quietly dropped. Raise the floor from 3.24 to 3.25 so the requested device LTO is actually emitted rather than ignored. The previous 3.24 floor existed only for the removed "native" CUDA architecture keyword; the remaining version-sensitive feature, cmake_language(EVAL CODE ...), needs 3.18, which 3.25 also covers. Co-authored-by: Claude <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 57b7618b84

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-03T16:19:30Z

+  # the floor required by cuTENSOR/cuQuantum, up through Hopper sm_90) plus PTX
+  # of the newest (90-virtual) so the driver can JIT for newer/unknown GPUs.
+  if(NOT CMAKE_CUDA_ARCHITECTURES AND NOT DEFINED ENV{CUDAARCHS})
+    set(CMAKE_CUDA_ARCHITECTURES 70-real 75-real 80-real 86-real 89-real 90-real 90-virtual)


Gate CUDA architectures by toolkit version

When USE_CUDA=ON is configured with the documented CUDA v10+ dependency (docs/source/adv_install.rst:32) but without an explicit CMAKE_CUDA_ARCHITECTURES, this default now passes compute_89/compute_90 to nvcc. CUDA toolkits before 11.8 do not know those architectures, so CMake's CUDA compiler check or the first CUDA compile fails with nvcc fatal: Unsupported gpu architecture, even though those CUDA versions were previously supported via native. Please derive the default from CMAKE_CUDA_COMPILER_VERSION or keep the default to architectures supported by the minimum CUDA version.

Useful? React with 👍 / 👎.

pcchen · 2026-06-07T02:50:53Z

Review: PR #875 — "cmake: portable CUDA arch default + require CMake 3.25 for device LTO"

Overview

Two focused CMake changes: replace CMAKE_CUDA_ARCHITECTURES native (requires a GPU at configure time) with a guarded fat-binary default, and bump cmake_minimum_required to 3.25 for proper CUDA device LTO support.

Change 1 — Portable CUDA architecture default

The logic is correct:

The guard NOT CMAKE_CUDA_ARCHITECTURES AND NOT DEFINED ENV{CUDAARCHS} correctly defers to any user-supplied value (cache variable, preset cacheVariables, or the CUDAARCHS env var that enable_language(CUDA) reads natively).
Placement before enable_language(CUDA) is necessary — after that call CMake fills in its own default, making the "not yet set" check unreliable. The comment documents this correctly.
Removing the old manual ENV{CUDAARCHS} copy is right — enable_language(CUDA) handles it natively since CMake 3.17.
90-virtual (PTX) enables JIT compilation on future architectures. Correct approach.

The existing Gemini/codex comments on CUDAARCHS string parsing and the GCC LTO flags are outdated — both were addressed by dropping the code they reviewed. @IvanaGyro's replies confirm this.

One open concern (codex P1): Architecture list vs. documented CUDA minimum

The default list includes sm_89 (Ada Lovelace) and sm_90 (Hopper), which require CUDA 11.8+. The docs (adv_install.rst:32) still say "CUDA v10+". A user following the documentation who has CUDA 10–11.7 and builds with USE_CUDA=ON without explicitly setting CMAKE_CUDA_ARCHITECTURES will get an nvcc error about unrecognised compute capabilities.

The simplest fix is to update the documentation minimum to match the arch list. In practice, the full GPU feature set (cuTENSOR 24.x, cuQuantum) already requires CUDA 12.x, so "CUDA 10+" is stale regardless. Either:

Update docs to state CUDA ≥ 11.8 (or ≥ 12.x if cuTENSOR 24.x is the real floor), or
Trim the default arch list to 70-real 75-real 80-real 86-real 80-virtual (safe with CUDA ≥ 11.1) and note that users on Hopper/Ada should override.

Change 2 — `cmake_minimum_required(VERSION 3.25)`

Correct. CMake 3.25 added proper INTERPROCEDURAL_OPTIMIZATION support for CUDA device LTO. The old 3.24 comment ("require for the 'native' value") was also becoming stale now that native is removed. The bump is a single minor version (released Nov 2022) and does not introduce a new de-facto requirement beyond what 3.24 already imposed.

Summary

Item	Status
Guard logic and ordering	✅ Correct
`CUDAARCHS` env var handling	✅ Correct (`enable_language` handles natively)
`90-virtual` PTX for future GPUs	✅ Correct
Gemini / outdated codex comments	✅ Addressed — code they reviewed was dropped
`cmake_minimum_required` 3.24 → 3.25	✅ Correct and justified
Docs still say "CUDA v10+" but default arch list requires CUDA 11.8+	⚠️ Should update `adv_install.rst` before merging

Posted by Claude Code on behalf of @pcchen

pcchen · 2026-06-07T02:52:36Z

@IvanaGyro We will update the document later. Please decide if we want to follow this comment: "Trim the default arch list to 70-real 75-real 80-real 86-real 80-virtual (safe with CUDA ≥ 11.1) and note that users on Hopper/Ada should override."

gemini-code-assist Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread CMakeLists.txt Outdated

Comment thread CMakeLists.txt Outdated

IvanaGyro force-pushed the claude/claude-local-rules-SB77V branch 2 times, most recently from d495108 to b402f36 Compare June 3, 2026 09:15

IvanaGyro changed the title ~~cmake: portable CUDA arch default + device-LTO workaround for GPU-less CI~~ cmake: portable CUDA arch default + nvlink empty-stub fix for glibc 2.34+ Jun 3, 2026

IvanaGyro and others added 2 commits June 3, 2026 09:21

IvanaGyro force-pushed the claude/claude-local-rules-SB77V branch from b402f36 to 57b7618 Compare June 3, 2026 09:36

IvanaGyro changed the title ~~cmake: portable CUDA arch default + nvlink empty-stub fix for glibc 2.34+~~ cmake: portable CUDA arch default + require CMake 3.25 for device LTO Jun 3, 2026

IvanaGyro mentioned this pull request Jun 3, 2026

docs: troubleshoot CUDA device-LTO elfLink failure on apt nvidia-cuda-toolkit #876

Open

IvanaGyro marked this pull request as ready for review June 3, 2026 16:16

IvanaGyro requested review from pcchen and yingjerkao June 3, 2026 16:16

chatgpt-codex-connector Bot reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmake: portable CUDA arch default + require CMake 3.25 for device LTO#875

cmake: portable CUDA arch default + require CMake 3.25 for device LTO#875
IvanaGyro wants to merge 2 commits into
masterfrom
claude/claude-local-rules-SB77V

IvanaGyro commented Jun 3, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 3, 2026

Uh oh!

pcchen commented Jun 7, 2026

Uh oh!

pcchen commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

IvanaGyro commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Replace native CUDA architecture default with a portable fat-binary list

2. Raise cmake_minimum_required to 3.25

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

pcchen commented Jun 7, 2026

Review: PR #875 — "cmake: portable CUDA arch default + require CMake 3.25 for device LTO"

Overview

Change 1 — Portable CUDA architecture default

Change 2 — cmake_minimum_required(VERSION 3.25)

Summary

Uh oh!

pcchen commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

IvanaGyro commented Jun 3, 2026 •

edited

Loading

1. Replace `native` CUDA architecture default with a portable fat-binary list

2. Raise `cmake_minimum_required` to 3.25

codecov Bot commented Jun 3, 2026 •

edited

Loading

Change 2 — `cmake_minimum_required(VERSION 3.25)`