Split grouped quantize/activations and dbias for faster compilation on multicore machines#2983
Conversation
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
44a7d09 to
30880ef
Compare
for more information, see https://pre-commit.ci
|
/te-ci |
Greptile SummaryThis PR is a pure build-system refactoring that splits large CUDA compilation units into smaller, independently compilable files to enable faster parallel compilation on multicore machines. No runtime logic changes are made — every function body is moved verbatim to its new file.
Confidence Score: 5/5Safe to merge — purely a build restructuring with no logic changes; all split functions are fully accounted for and the CMakeLists is correctly updated. Every moved function was verified against the original file: gelu/relu/swiglu/cast variants are all present in their new homes with identical bodies. The CMakeLists registers each new file in both the required source lists. The only non-trivial change (explicit ptx.cuh include in common.cuh) makes an already-transitive dependency explicit and is safe. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
subgraph Before["Before"]
A["gelu.cu (plain+grouped+dbias)"]
B["relu.cu (plain+grouped+dbias)"]
C["swiglu.cu (plain+grouped+dbias)"]
D["cast.cu (plain+grouped+dbias+deq)"]
end
subgraph After["After (parallel compilation)"]
A1["gelu.cu"]
A2["gelu_grouped.cu"]
A3["gelu_dbias.cu"]
A4["gelu_grouped_dbias.cu"]
B1["relu.cu"]
B2["relu_grouped.cu"]
B3["relu_dbias.cu"]
B4["relu_grouped_dbias.cu"]
C1["swiglu.cu"]
C2["swiglu_grouped.cu"]
C3["swiglu_dbias.cu"]
C4["swiglu_grouped_dbias.cu"]
D1["cast.cu"]
D2["cast_grouped.cu"]
D3["cast_dbias.cu"]
D4["cast_grouped_dbias.cu"]
end
A -->|split| A1 & A2 & A3 & A4
B -->|split| B1 & B2 & B3 & B4
C -->|split| C1 & C2 & C3 & C4
D -->|split| D1 & D2 & D3 & D4
Reviews (2): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile |
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Description
The actual right fix would be to nvRTC them, but this will at least make it slightly more manageable.
Type of change