Skip to content

Improve handling of the NVTE_CUDA_ARCHS#2665

Open
ptrendx wants to merge 10 commits intoNVIDIA:mainfrom
ptrendx:pr_get_arch_cmake2
Open

Improve handling of the NVTE_CUDA_ARCHS#2665
ptrendx wants to merge 10 commits intoNVIDIA:mainfrom
ptrendx:pr_get_arch_cmake2

Conversation

@ptrendx
Copy link
Copy Markdown
Member

@ptrendx ptrendx commented Feb 9, 2026

Description

Improve handling of the NVTE_CUDA_ARCHS env variable:

  • add the regular architectures to the build of the sources with specific architectures to enable some support for GPU architectures in the family that were not specialized directly.
  • automatically add sm75 to the build in case the CMAKE_CUDA_ARCHITECTURES becomes empty (which currently should only happen when cmake < 4.0.2 and sm120 is the only selected architecture)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
@ptrendx ptrendx requested a review from ksivaman February 9, 2026 23:09
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Feb 9, 2026

Greptile Summary

This PR improves GPU architecture handling in the build system with two main changes: (1) for arch 100/101, the base arch is intentionally kept in CMAKE_CUDA_ARCHITECTURES (so all sources, including arch-specific ones, are compiled for the generic arch) while only the specialized variant (100a, 101a) is added to NVTE_SPECIFIC_ARCHS for the arch-specific sources' extra codegen — this provides graceful fallback for GPUs in the family not covered by a specialized variant; (2) for arch 110/120, CMake 4.0.2+ native f-suffix support is leveraged to add the family-wide arch directly to CMAKE_CUDA_ARCHITECTURES, while older CMake falls back to the manual NVTE_GENERIC_ARCHS / NVTE_SPECIFIC_ARCHS split with an sm_75 guard to avoid an empty arch list.

Key observations:

  • The architectural split between "keep 100 in cmake" vs "route 110/120 through f-suffix" is intentional and correctly implemented per the explained hardware semantics.
  • set(NVTE_ARCH_SPECIFIC_TARGETS TRUE) on line 103 is unconditional: NVTE_HAS_ARCH_SPECIFIC_TARGETS=1 is defined on all transformer_engine_cuda_arch_specific_sources regardless of whether NVTE_SPECIFIC_ARCHS is populated. This is an open concern from prior review threads that has not been acknowledged.
  • The ptx.cuh change is a direct consequence of the above: by removing the static_assert when NVTE_HAS_ARCH_SPECIFIC_TARGETS is defined, compile-time guards that catch arch-specific feature misuse in generic builds are disabled for every configuration — including pure generic-arch builds (e.g., sm_70/80/89/90 only) where no arch-specific code generation is actually happening.
  • The duplicate list(REMOVE_ITEM CMAKE_CUDA_ARCHITECTURES "110") noted in a prior thread is resolved in this revision (now appears only once at line 61).

Confidence Score: 3/5

  • This PR is conditionally safe to merge for its primary CMake 4.0.2+ path, but the unconditional NVTE_ARCH_SPECIFIC_TARGETS flag (and consequent silencing of ptx.cuh static_asserts) is an open concern that warrants resolution before landing.
  • The arch 100/101 intentional-keep and the CMake 4.0.2 "f"-suffix improvements are correctly implemented. The duplicate REMOVE_ITEM for arch 110 from prior review has been fixed. However, set(NVTE_ARCH_SPECIFIC_TARGETS TRUE) at line 103 is unconditional, meaning NVTE_HAS_ARCH_SPECIFIC_TARGETS=1 is always injected into arch-specific sources, which permanently disables the compile-time static_assert guards in ptx.cuh for all build configurations — including pure generic-arch builds with no arch-specific codegen. This open concern from previous review threads lowers the confidence score.
  • transformer_engine/common/CMakeLists.txt (line 103: unconditional NVTE_ARCH_SPECIFIC_TARGETS) and transformer_engine/common/util/ptx.cuh (lines 34-42, 56-64: static_assert gates now permanently disabled)

Important Files Changed

Filename Overview
transformer_engine/common/CMakeLists.txt Significant rework of GPU architecture handling: arch 100/101 intentionally kept in CMAKE_CUDA_ARCHITECTURES while adding specific variants; arch 110/120 now use CMake 4.0.2+ native "f" suffix when available; sm_75 fallback added for empty arch list on older CMake; NVTE_ARCH_SPECIFIC_TARGETS unconditionally set to TRUE regardless of whether NVTE_SPECIFIC_ARCHS is populated, which disables compile-time safety checks in all arch-specific source builds.
transformer_engine/common/util/ptx.cuh Both ArchSpecific::compatible() and FamilySpecific::compatible() now gate their static_assert behind #if !defined(NVTE_HAS_ARCH_SPECIFIC_TARGETS). When the macro is defined (always, per CMakeLists.txt line 103), the assertions are replaced by a runtime-like constexpr boolean comparison, silencing compile-time guards that were designed to catch misuse of arch/family-specific features in generic builds.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Start: Process CMAKE_CUDA_ARCHITECTURES] --> B{arch 100 present?}
    B -- Yes --> C[Keep 100 in CMAKE_CUDA_ARCHITECTURES\nAppend 100a to NVTE_SPECIFIC_ARCHS\nIf CUDA>=12.9: also append 103a]
    B -- No --> D{arch 101 present?}
    C --> D
    D -- Yes --> E[Keep 101 in CMAKE_CUDA_ARCHITECTURES\nAppend 101a to NVTE_SPECIFIC_ARCHS]
    D -- No --> F{arch 110 present?}
    E --> F
    F -- Yes --> G[Remove 110 from CMAKE_CUDA_ARCHITECTURES]
    G --> H{CMake >= 4.0.2?}
    H -- Yes --> I[Append 110f to CMAKE_CUDA_ARCHITECTURES]
    H -- No --> J[Append 110 to NVTE_GENERIC_ARCHS\nAppend 110f to NVTE_SPECIFIC_ARCHS]
    F -- No --> K{arch 120 present?}
    I --> K
    J --> K
    K -- Yes --> L[Remove 120 from CMAKE_CUDA_ARCHITECTURES]
    L --> M{CMake >= 4.0.2?}
    M -- Yes --> N{CUDA >= 12.9?}
    N -- Yes --> O[Append 120f to CMAKE_CUDA_ARCHITECTURES]
    N -- No --> P[Append 120 to NVTE_GENERIC_ARCHS\nAppend 120a to NVTE_SPECIFIC_ARCHS]
    M -- No --> Q{CUDA >= 12.9?}
    Q -- Yes --> R[Append 120 to NVTE_GENERIC_ARCHS\nAppend 120f to NVTE_SPECIFIC_ARCHS]
    Q -- No --> S[Append 120 to NVTE_GENERIC_ARCHS\nAppend 120a to NVTE_SPECIFIC_ARCHS]
    K -- No --> T{CMake < 4.0.2 AND\nCMAKE_CUDA_ARCHITECTURES empty?}
    O --> T
    P --> T
    R --> T
    S --> T
    T -- Yes --> U[WARNING: set CMAKE_CUDA_ARCHITECTURES = 75]
    T -- No --> V[set NVTE_ARCH_SPECIFIC_TARGETS = TRUE unconditionally]
    U --> V
    V --> W[Compile generic sources for NVTE_GENERIC_ARCHS\nCompile arch-specific sources for NVTE_SPECIFIC_ARCHS\nDefine NVTE_HAS_ARCH_SPECIFIC_TARGETS=1 on all arch-specific sources]
Loading

Last reviewed commit: 32d1dd5

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 43 to 56
@@ -54,31 +52,57 @@ endif()
# Check for architecture 101 (if we see this we are in toolkit <= 12.9)
list(FIND CMAKE_CUDA_ARCHITECTURES "101" arch_101_index)
if(NOT arch_101_index EQUAL -1)
list(REMOVE_ITEM CMAKE_CUDA_ARCHITECTURES "101")
list(APPEND NVTE_GENERIC_ARCHS "101")
list(APPEND NVTE_SPECIFIC_ARCHS "101a")
endif()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arch 100/101 not removed

When "100" / "101" are present in CMAKE_CUDA_ARCHITECTURES, this block only appends 100a/101a to NVTE_SPECIFIC_ARCHS but never removes the base arch from CMAKE_CUDA_ARCHITECTURES nor adds it to NVTE_GENERIC_ARCHS. As a result, the build will still compile all sources for sm_100 / sm_101 (via CMAKE_CUDA_ARCHITECTURES) while also compiling arch-specific sources for sm_100a / sm_101a (via --generate-code), which defeats the “generic + specific” split and can cause unexpected extra build work / incorrect targeting. This differs from the 110/120 handling below, which removes the base arch and routes it into the explicit lists.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greptileai That is because arches 110/120 are routed to the "f" (like 120f) variant where the entire family is covered, whereas in the case of 100, it routes to "a" variants (like 100a) - this is intentional behavior, but in order to support different Arches in the same family, we need to still compile for the regular 100 then.

Comment on lines +93 to 101
if(CMAKE_VERSION VERSION_LESS 4.0.2)
if(NOT CMAKE_CUDA_ARCHITECTURES)
message(WARNING
"CMAKE_CUDA_ARCHITECTURES is empty after replacing arch-specific targets. "
"Please upgrade to CMake 4.0.2+ for native 'f' architecture support. "
"Adding sm_75 target in addition to the specified target to avoid configuration "
"errors - this will result in longer build time, but does not affect correctness.")
set(CMAKE_CUDA_ARCHITECTURES 75)
endif()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback wipes user archs

On CMake < 4.0.2, if the arch list becomes empty you set(CMAKE_CUDA_ARCHITECTURES 75) (dropping the user’s originally requested arch entirely). Given the warning text says “add sm_75 target in addition to the specified target”, this looks like a functional mismatch: the fallback should preserve the originally selected arch(es) and append 75, not replace everything with 75.

Copy link
Copy Markdown
Member Author

@ptrendx ptrendx Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greptileai If the list is empty, then setting is the same as appending.

@ptrendx
Copy link
Copy Markdown
Member Author

ptrendx commented Feb 11, 2026

/te-ci

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +104 to 106
set(NVTE_ARCH_SPECIFIC_TARGETS TRUE)

# cuDNN frontend API
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always-on arch-specific define

set(NVTE_ARCH_SPECIFIC_TARGETS TRUE) makes NVTE_HAS_ARCH_SPECIFIC_TARGETS=1 get added to all transformer_engine_cuda_arch_specific_sources (lines ~247-255) even when NVTE_SPECIFIC_ARCHS is empty (e.g., user selects only 70/80/89/90). That flips the behavior in util/ptx.cuh to skip the compile-time static_assert that catches “compiled generic but using arch/family-specific features”, so those mistakes will now silently pass in exactly the builds where no arch-specific codegen flags are being injected for these sources.

@ptrendx
Copy link
Copy Markdown
Member Author

ptrendx commented Feb 12, 2026

/te-ci

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

endif()
endif()

set(NVTE_ARCH_SPECIFIC_TARGETS TRUE)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unconditional flag disables compile-time safety checks. NVTE_ARCH_SPECIFIC_TARGETS is set to TRUE even when NVTE_SPECIFIC_ARCHS is empty (e.g., user builds only sm_70/80/89/90). This causes NVTE_HAS_ARCH_SPECIFIC_TARGETS=1 to be defined for all arch-specific sources (lines 249-258), which disables the static_assert in ptx.cuh lines 34-38 and 56-60. The compile-time check that catches misuse of arch/family-specific features in generic builds is now always off, even when no arch-specific code generation is happening.

Suggested change
set(NVTE_ARCH_SPECIFIC_TARGETS TRUE)
if(NVTE_SPECIFIC_ARCHS)
set(NVTE_ARCH_SPECIFIC_TARGETS TRUE)
endif()

@ptrendx
Copy link
Copy Markdown
Member Author

ptrendx commented Feb 18, 2026

/te-ci

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@ptrendx
Copy link
Copy Markdown
Member Author

ptrendx commented Mar 4, 2026

/te-ci

ptrendx added 2 commits March 10, 2026 09:52
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
@ptrendx
Copy link
Copy Markdown
Member Author

ptrendx commented Mar 10, 2026

/te-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant