Skip to content

Split grouped quantize/activations and dbias for faster compilation on multicore machines#2983

Merged
ptrendx merged 4 commits into
NVIDIA:mainfrom
ptrendx:pr_split_compilation
May 21, 2026
Merged

Split grouped quantize/activations and dbias for faster compilation on multicore machines#2983
ptrendx merged 4 commits into
NVIDIA:mainfrom
ptrendx:pr_split_compilation

Conversation

@ptrendx
Copy link
Copy Markdown
Member

@ptrendx ptrendx commented May 12, 2026

Description

The actual right fix would be to nvRTC them, but this will at least make it slightly more manageable.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

ptrendx added 2 commits May 12, 2026 15:55
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
@ptrendx ptrendx force-pushed the pr_split_compilation branch from 44a7d09 to 30880ef Compare May 12, 2026 22:55
@ptrendx ptrendx marked this pull request as ready for review May 13, 2026 04:33
@ptrendx ptrendx requested a review from Oleg-Goncharov as a code owner May 13, 2026 04:33
@ptrendx
Copy link
Copy Markdown
Member Author

ptrendx commented May 13, 2026

/te-ci

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 13, 2026

Greptile Summary

This PR is a pure build-system refactoring that splits large CUDA compilation units into smaller, independently compilable files to enable faster parallel compilation on multicore machines. No runtime logic changes are made — every function body is moved verbatim to its new file.

  • Activation files (gelu.cu, relu.cu, swiglu.cu): grouped-quantize and dbias variants extracted into *_grouped.cu, *_dbias.cu, and *_grouped_dbias.cu siblings; CMakeLists.txt updated to register all new files in both the arch-specific sources list and the NVTE_BUILD_ACTIVATION_WITH_FAST_MATH list.
  • Cast files (cast.cu): nvte_group_quantize, nvte_group_dequantize, nvte_group_nvfp4_quantize_with_amax, nvte_quantize_dbias, and nvte_group_quantize_dbias are moved to three new files with the appropriate includes for each.
  • cast/core/common.cuh: Gains an explicit #include \"../../util/ptx.cuh\" — making a previously-transitive dependency explicit.

Confidence Score: 5/5

Safe to merge — purely a build restructuring with no logic changes; all split functions are fully accounted for and the CMakeLists is correctly updated.

Every moved function was verified against the original file: gelu/relu/swiglu/cast variants are all present in their new homes with identical bodies. The CMakeLists registers each new file in both the required source lists. The only non-trivial change (explicit ptx.cuh include in common.cuh) makes an already-transitive dependency explicit and is safe.

No files require special attention.

Important Files Changed

Filename Overview
transformer_engine/common/CMakeLists.txt Correctly registers all 12 new split .cu files in both the arch-specific sources list and the NVTE_BUILD_ACTIVATION_WITH_FAST_MATH list.
transformer_engine/common/cast/core/common.cuh Adds explicit #include of ptx.cuh — makes an already-transitive dependency explicit; consistent with the PTX intrinsics used throughout this header.
transformer_engine/common/cast/cast.cu Removes grouped/dbias functions moved to new split files; retains nvte_dequantize which still needs the dequantize.cuh include.
transformer_engine/common/cast/cast_grouped.cu New file containing nvte_group_quantize, nvte_group_dequantize, and nvte_group_nvfp4_quantize_with_amax; correctly includes dequantize.cuh and quantize.cuh with cuda.h/cudaTypedefs.h for TMA.
transformer_engine/common/cast/cast_dbias.cu New file containing nvte_quantize_dbias; correct minimal include set.
transformer_engine/common/cast/cast_grouped_dbias.cu New file containing nvte_group_quantize_dbias; correct grouped bwd helper call.
transformer_engine/common/activation/gelu.cu Grouped and dbias variants removed and relocated; remaining non-grouped functions unchanged.
transformer_engine/common/activation/gelu_grouped.cu New file hosting nvte_group_gelu/dgelu and nvte_group_qgelu/dqgelu; logic faithfully moved from gelu.cu.
transformer_engine/common/activation/relu.cu Grouped and dbias variants removed and relocated; remaining non-grouped relu/srelu functions unchanged.
transformer_engine/common/activation/swiglu.cu Grouped and dbias variants removed; remaining silu/dsilu/swiglu/dswiglu/clamped variants unchanged.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph Before["Before"]
        A["gelu.cu (plain+grouped+dbias)"]
        B["relu.cu (plain+grouped+dbias)"]
        C["swiglu.cu (plain+grouped+dbias)"]
        D["cast.cu (plain+grouped+dbias+deq)"]
    end
    subgraph After["After (parallel compilation)"]
        A1["gelu.cu"] 
        A2["gelu_grouped.cu"]
        A3["gelu_dbias.cu"]
        A4["gelu_grouped_dbias.cu"]
        B1["relu.cu"]
        B2["relu_grouped.cu"]
        B3["relu_dbias.cu"]
        B4["relu_grouped_dbias.cu"]
        C1["swiglu.cu"]
        C2["swiglu_grouped.cu"]
        C3["swiglu_dbias.cu"]
        C4["swiglu_grouped_dbias.cu"]
        D1["cast.cu"]
        D2["cast_grouped.cu"]
        D3["cast_dbias.cu"]
        D4["cast_grouped_dbias.cu"]
    end
    A -->|split| A1 & A2 & A3 & A4
    B -->|split| B1 & B2 & B3 & B4
    C -->|split| C1 & C2 & C3 & C4
    D -->|split| D1 & D2 & D3 & D4
Loading

Reviews (2): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile

Copy link
Copy Markdown
Member

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
@ptrendx ptrendx merged commit a014300 into NVIDIA:main May 21, 2026
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants