[None][feat] Update TRTLLM MoE cubins#12440
Conversation
bbe09ba to
7e8370b
Compare
7e8370b to
58e0fd8
Compare
📝 WalkthroughWalkthroughRefactors kernel selection logic in batched GEMM runner, introduces new bias/scheduling parameters and hybrid slice K optimization (Blackwell-only), adds bias row mapping support with CGA-granularity routing, and reorganizes configuration templates with new variants. Multiple cuBin artifacts updated via Git LFS. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 20
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.h (1)
1-15:⚠️ Potential issue | 🟡 MinorUpdate the NVIDIA copyright year in this modified header.
This file changed in 2026, but the header still ends at 2025.
As per coding guidelines, "Include NVIDIA copyright header on all new files; update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.h` around lines 1 - 15, The file header in KernelRunner.h still lists the NVIDIA copyright range as "2020-2025" but the file was modified in 2026; update the copyright year range in the top comment block (the header comment present at the top of KernelRunner.h) to include 2026 (e.g., "2020-2026") so the file header reflects the modification year.cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h (1)
691-708:⚠️ Potential issue | 🟠 MajorFinish the
mMmaTileKmigration for SF TMA descriptors.The non-routed SF branches now use
mMmaTileK, but the routed-SF TMA branches still size their per-tile K extent frommTileK / numEltsPerSf. When customized MMA tile-K is enabled (which requiresmTileK=256andmMmaTileK=192), the same SF buffer gets different descriptor shapes: 256/numEltsPerSf for routed TMA vs. 192 for non-routed. This inconsistency should be resolved by migrating the routed TMA branches (around lines 560-582 for matrix B and lines 660-681 for matrix A) to usemMmaTileKas well.cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/config.json (1)
233-955:⚠️ Potential issue | 🟡 MinorRemove trailing commas from grouped-key field names in two configs.
The validation script found dangling delimiters in the field names of configs "FP4_FC2_HighThroughput" (config 13) and "FP4_FC2_HighThroughput_NonSwapAb" (config 15):
"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutB,"should be"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutB""mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA,"should be"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA"All other required fields and arity checks pass validation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/config.json` around lines 233 - 955, In the two configs for "FP4_FC2_HighThroughput" (template BatchedGemmFp4HighThroughput) and "FP4_FC2_HighThroughput_NonSwapAb" (template BatchedGemmFp4HighThroughputNonSwapAb) remove the stray trailing commas from the grouped-key names: change "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutB," to "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutB" and change "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA," to "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA" so the field names are valid JSON keys for validation.
🧹 Nitpick comments (1)
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/config.json (1)
544-546: Remove the trailing delimiter in the grouped key at Line 544.The key string ends with an extra comma (
...,sfLayoutA,), which is parser-fragile for tuple/key arity mapping.Proposed cleanup
- "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA,": [ + "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA": [ [128, 128, 128, 256, 6, "128x4"] ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/config.json` around lines 544 - 546, The grouped key string "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA," contains an extra trailing comma which breaks tuple/key arity mapping; edit the config.json entry so the key is "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA" (remove the trailing comma) and verify the associated value array stays unchanged, and scan other grouped keys for similar trailing delimiters to remove if present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.cpp`:
- Around line 367-380: The new ptrPermutedIdxToBiasRowIdx pointer for
BiasType::Mn is not being forwarded into gemmData.mInputBuffers in
TrtllmGenBatchedGemmRunner::run(...), so the bias row remap never gets used;
update TrtllmGenBatchedGemmRunner::run to accept or obtain the
ptrPermutedIdxToBiasRowIdx value and assign it into gemmData.mInputBuffers
(i.e., set gemmData.mInputBuffers.mPtrPermutedIdxToBiasRowIdx =
ptrPermutedIdxToBiasRowIdx) before calling
KernelParamsSetup::setKernelParams(...), and ensure the same pointer is passed
through any intermediate calls so BatchedGemmInterface::InputBuffers receives
it.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x128x256_s6_et128x128_m256x128x64_c2x1x1_rN_TN_schPd2x1x2x3_biasFp32N_bM_rgTma_clmp_dynB_sm100f_cubin.cpp`:
- Around line 1-3: precommit-check.yml's checkout step uses actions/checkout@v6
without LFS enabled, which is inconsistent with blossom-ci.yml and could break
future checks for large files like the listed cubin artifacts; update the
checkout step in precommit-check.yml (the actions/checkout@v6 step) to include
lfs: 'true' so git-lfs files are fetched during the precommit workflow, matching
blossom-ci.yml and ensuring excluded large files remain resolvable if exclusion
rules change.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x32x256_s9_et128x32_m128x32x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`:
- Around line 1-3: The translation unit
Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x32x256_s9_et128x32_m128x32x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp
currently contains only a Git LFS pointer stub which breaks parsing and builds;
replace the stub with the actual LFS-tracked binary cubin content (pull the LFS
object into the repo) or modify the build/static-analysis rules to skip/ignore
LFS pointer files for C++ translation units, and ensure the final .cpp file
contains the required NVIDIA copyright header at the top per project guidelines.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x8x512_s5_et128x8_m128x8x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`:
- Around line 1-3: The file begins with Git LFS pointer metadata (line starting
with "version https://git-lfs.github.com/spec/v1") but there is no
.gitattributes rule to mark this binary as LFS-managed, and the file is missing
the required NVIDIA copyright header; add an entry to .gitattributes to track
the cubins (e.g., a pattern matching these generated .cpp/cubin files) so the
pointer metadata is not treated as source, and prepend the canonical NVIDIA
copyright header comment to the top of this file with the correct latest
modification year.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x8x512u2_s4_et128x8_m128x8x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_sm100f_cubin.cpp`:
- Around line 1-3: The build is pulling Git LFS pointer files named like
Bmm_..._cubin.cpp into compilation via the file(GLOB_RECURSE SRC_CPP *.cpp) in
the subdirectory CMakeLists, which causes CI to try compiling LFS metadata and
also prevents adding required NVIDIA headers; fix by excluding those cubin.cpp
files from the source glob or moving them off the .cpp extension and adding
lightweight wrapper .cpp stubs. Concretely, update each affected CMakeLists (the
ones that define SRC_CPP via file(GLOB_RECURSE SRC_CPP *.cpp) in
trtllmGenKernels subdirs batchedGemm, gemm, gemmGatedAct, fmha, blockScaleMoe)
to filter out files matching "*_cubin.cpp" or "cubin.cpp" from SRC_CPP, or
rename/store binary cubin blobs with a non-.cpp extension and add minimal
wrapper translation units (e.g., loader functions) that include proper NVIDIA
copyright headers so real .cpp files compiled by CMake are valid source.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E4m3_castE4m3_Fp32_Ab32_t128x64x256_s4_et128x64_m128x64x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_biasFp32M_bN_rgTma_clmp_dynB_sm100a_cubin.cpp`:
- Around line 1-3: The build is picking up pointer-only cubin files tracked by
Git LFS (pattern *cubin.cpp in .gitattributes) because CMake uses
file(GLOB_RECURSE SRC_CPP *.cpp); either ensure CI materializes LFS blobs before
C++ parsing by adding a checkout step that runs git lfs pull --include="*.cpp"
--exclude="" or change the CMake glob to exclude cubin sources (e.g., stop using
*.cpp-only glob or add an exclusion for *cubin.cpp when populating SRC_CPP) so
the pointer-only cubin files are not passed to the compiler/static analyzers.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E4m3_castE4m3_Fp32_Ab32_t128x8x256_s5_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm103a_cubin.cpp`:
- Around line 1-3: CMake is globbing LFS pointer ".cpp" files (via
file(GLOB_RECURSE SRC_CPP *.cpp)) which pulls in LFS placeholders under
batchedGemm/cubins/* and causes compiler errors; change the glob/exclusion so
SRC_CPP does not include cubin LFS files (e.g., restrict to src/**/*.cpp or add
an explicit exclude for batchedGemm/cubins/** and other cubin/* directories) or
rename those LFS artifacts to a non-.cpp extension and update .gitattributes
accordingly, and add CI steps to run git lfs fetch/pull before the build so real
cubin binaries are available (refer to CMakeLists.txt, the SRC_CPP glob,
batchedGemm/cubins directory and .gitattributes).
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E4m3_castE4m3_Fp32_Ab32_t128x8x512_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_biasFp32M_bN_rgTma_clmp_dynB_sm100a_cubin.cpp`:
- Around line 1-3: precommit-check.yml's checkout step currently omits LFS and
causes Git LFS pointers (e.g., the cubin C++ LFS pointer shown) to be checked
out as text; update the checkout action in precommit-check.yml (the step that
uses actions/checkout) to include lfs: 'true' like in blossom-ci.yml so LFS
files are materialized before running pre-commit hooks such as clang-format.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E4m3_castE4m3_Fp32_Ab32_t128x8x512u2_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100a_cubin.cpp`:
- Around line 1-3: CI is failing static analysis because tracked cubin binary
artifacts (files matching cubins/**/*.cpp) are stored as Git LFS pointers and
static analyzers choke on the pointer metadata; fix by ensuring the pipeline
either runs git lfs pull (or equivalent fetch of LFS objects) prior to any
compilation/lint steps or by excluding these generated cubin files from C++
linting (update CI job steps to call git lfs pull or update analyzer config to
skip pattern cubins/**/*.cpp and rely on .gitattributes entry for *cubin.cpp).
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E4m3_castE4m3_Fp32_Ab32_t128x8x512u2_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_biasFp32M_bN_rgTma_clmp_dynB_sm100a_cubin.cpp`:
- Around line 1-3: The file
trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E4m3_castE4m3_Fp32_Ab32_t128x8x512u2_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_biasFp32M_bN_rgTma_clmp_dynB_sm100a_cubin.cpp
contains an unresolved Git LFS pointer instead of real C++ and is breaking
static analysis; fix by adding a pre-build step that runs git lfs pull (or
equivalent) to materialize LFS objects before compilation/analysis and/or update
CI/build scripts to fetch LFS, and update repository rules (.gitattributes/CI)
to treat *_cubin.cpp as binary or exclude them from C++ tooling so tools like
clang/cppcheck skip that filename pattern; ensure the build scripts reference
this filename pattern when adding the fetch or exclusion.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E4m3_castE4m3_Fp32_Ab32_t128x8x512u2_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_biasFp32M_bN_rgTma_clmp_dynB_sm103a_cubin.cpp`:
- Around line 1-3: The CI is compiling a Git LFS pointer file (the cubin.cpp
pointer at top of the referenced file and any files matching *cubin.cpp in
.gitattributes) because workflows do not fetch LFS objects; update the CI/build
workflows that run static analysis/compilation (the job that invokes
clang/cppcheck/your build script) to run git lfs fetch --all (or git lfs pull)
after checkout but before any compile/analysis steps, and ensure any
materialized cubin.cpp binary source includes the required NVIDIA copyright
header with the latest modification year so the real file (not the pointer) is
present for compilation and policy checks.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x128x128_s5_et64x128_m64x128x32_c1x1x1_rM_TN_transOut_noShflA_dsFp8_schedS_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`:
- Around line 1-3: The file is a Git LFS pointer misnamed with a .cpp extension,
which can confuse C++ tooling; rename all files matching the _cubin.cpp pattern
(e.g., Bmm_*_cubin.cpp) to a non-C++ artifact extension such as .cubin.bin or
.bin, or alternatively add explicit exclusion rules (e.g., in build configs and
.gitattributes) to ensure this directory and pattern are never treated as C++
sources; if any of these files are intended to be committed as source artifacts,
add the required NVIDIA copyright header to the file content and update
repository docs to note the exclusion.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x16x128u2_s6_et64x16_m64x16x32_c1x1x1_rM_TN_transOut_noShflA_dsFp8_schedS_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`:
- Around line 1-3: Multiple kernel cubin files contain unresolved Git LFS
pointer text (e.g., starting with "version https://git-lfs.github.com/spec/v1")
which breaks static analysis and compilation; update CI and build to: run git
lfs pull before any lint/compile stage (modify CI scripts and any prebuild
hooks), add a pre-compile check in the build scripts to scan .cubin.cpp files
for the LFS pointer header and either auto-run git lfs pull or fail fast with a
clear error telling contributors to run git lfs pull, and add a guard (or
exclusion) so unresolved LFS pointer files are skipped by static analyzers (or
listed in .gitignore/build exclusion) plus document the required Git LFS setup
in the repo README/CONTRIBUTING so contributors and CI get the real binary cubin
contents.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x32x128u2_s6_et64x32_m64x32x32_c1x1x1_rM_TN_transOut_noShflA_dsFp8_schPd4x2x2x3_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`:
- Around line 1-3: The new LFS-tracked cubin artifact
Bmm_Bfloat16_E4m3E4m3_Fp32_t128x32x128u2_s6_et64x32_m64x32x32_c1x1x1_rM_TN_transOut_noShflA_dsFp8_schPd4x2x2x3_bN_rgTma_clmp_dynB_sm100f_cubin.cpp
lacks the required NVIDIA copyright header; either add the standard NVIDIA
header (with the year of latest meaningful modification) into this generated
.cpp or formally exempt generated/LFS-tracked cubin artifacts by updating the
repo linting and docs: modify the header-check configuration to exclude the
cubin/generated pattern and add a docs note describing the exemption and
justification, then include the chosen remediation (header addition or
exemption) in the PR description for audit.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x128_s4_et64x8_m64x8x32_c1x1x1_rM_TN_transOut_noShflA_dsFp8_schedS_bN_rgTma_clmp_sm100f_cubin.cpp`:
- Around line 1-3: The build includes cubin pointer .cpp files because the
CMakeLists uses file(GLOB_RECURSE SRC_CPP *.cpp); update the build/CI to ensure
Git LFS artifacts are present before compilation by adding an explicit git lfs
fetch/pull step prior to invoking CMake (or change the CMake rule to exclude
patterns), and/or change the CMakeLists usage of file(GLOB_RECURSE SRC_CPP
*.cpp) to filter out "*cubin.cpp" so the compiler never sees LFS pointer files;
target the CMakeLists.txt entry with file(GLOB_RECURSE SRC_CPP *.cpp) and the CI
build steps that run CMake to implement the fix.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x256u2_s6_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`:
- Around line 1-3: Update CODING_GUIDELINES.md to explicitly exempt Git LFS
pointer files and pre-compiled binary artifacts in the cubins directory (e.g.,
the LFS pointer stub found in
trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x256u2_s6_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_bN_rgTma_clmp_dynB_sm100f_cubin.cpp)
from the "NVIDIA copyright headers" requirement, or alternatively add a clear
rule that the header mandate applies only to human-authored source files and not
generated/compiled binaries or their LFS references; make the language
unambiguous and add a short example sentence and a note in the file explaining
how LFS pointer stubs are treated.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x512_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schedS_bN_rgTma_clmp_sm100f_cubin.cpp`:
- Around line 1-3: The file
trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x512_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schedS_bN_rgTma_clmp_sm100f_cubin.cpp
is a Git LFS pointer, not real C++ source, and the cubins/ folder is implicitly
excluded from the build; make the exclusion explicit by either adding an exclude
rule for the cubins/ directory in your top-level CMakeLists (or the relevant
cmake module that collects sources) so files under cubins/ are ignored, and/or
renaming these artifacts from *.cpp to a non-source extension (e.g., .cubin or
.o) so build system and static analysis won’t treat them as C++ sources; update
any source-gathering patterns (e.g., add_glob or file(GLOB_RECURSE) filters)
that might pick up *_cubin.cpp and document the change in the build module that
references the generated kernels.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/GemmOptions.h`:
- Around line 1281-1288: The validator checkAndUpdateGemmOptions currently
mutates options (setting options.mMmaTileK = options.mTileK and
options.mPerTokenSfDtype defaults) even when called with updateOptions == false;
change the logic so these assignments only occur when updateOptions is true, and
when updateOptions is false either return an error/false if those fields are
unset or explicitly reject the config; apply the same guard to the other
identical block later in the function (the second occurrence that compares
mMmaTileK vs mTileK and sets mPerTokenSfDtype) so validation remains side-effect
free.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h`:
- Around line 357-360: The parameter name maxNumCtas in the KernelParams.h
function signature is misleading because the implementation (see
paddedTokensInBatchDim calculation using cgaOffset and clusterDimInBatchDim)
expects a CGA count; change the API to make this explicit by renaming the
parameter to maxNumCgas (and update all call sites) or add an inline comment in
the signature like /* maxNumCgas = CGA-granularity count */ and/or update
KernelParams::MaxNumCtas to reflect CGA semantics so callers pass CGA counts;
ensure references to maxNumCtas and KernelParams::MaxNumCtas and the
paddedTokensInBatchDim / cgaOffset usage are updated consistently.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelTraits.h`:
- Around line 524-526: KernelTraits currently ignores the separate TMEM-A stage
count; thread the mNumStagesTmemA member through KernelTraits (add the member,
ensure constructors/builders/factory code initialize it from the configuration)
and use mNumStagesTmemA instead of numStagesA when computing numTmemColsA (the
expression that uses numStagesA * tileK / (numSlicesForSliceK * ... /
tg::dtypeGetNumBits(dtypeMmaA))). Also ensure useTmemA logic and any getters
that expose stage counts reflect mNumStagesTmemA so configs with mNumStagesTmemA
!= mNumStagesA produce the correct TMEM-A footprint.
---
Outside diff comments:
In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.h`:
- Around line 1-15: The file header in KernelRunner.h still lists the NVIDIA
copyright range as "2020-2025" but the file was modified in 2026; update the
copyright year range in the top comment block (the header comment present at the
top of KernelRunner.h) to include 2026 (e.g., "2020-2026") so the file header
reflects the modification year.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/config.json`:
- Around line 233-955: In the two configs for "FP4_FC2_HighThroughput" (template
BatchedGemmFp4HighThroughput) and "FP4_FC2_HighThroughput_NonSwapAb" (template
BatchedGemmFp4HighThroughputNonSwapAb) remove the stray trailing commas from the
grouped-key names: change "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutB,"
to "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutB" and change
"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA," to
"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA" so the field names are
valid JSON keys for validation.
---
Nitpick comments:
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/config.json`:
- Around line 544-546: The grouped key string
"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA," contains an extra trailing
comma which breaks tuple/key arity mapping; edit the config.json entry so the
key is "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA" (remove the trailing
comma) and verify the associated value array stays unchanged, and scan other
grouped keys for similar trailing delimiters to remove if present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| setProblemDimensions(gemmData, transposeMmaOutput, m, n, k, batchedTokens, numTokens, numBatches, | ||
| maxNumCtasInBatchDim, validM, validN, validK); | ||
|
|
||
| // Inputs | ||
| gemmData.mInputBuffers.mPtrA = mOptions.transposeMmaOutput ? b : a; | ||
| gemmData.mInputBuffers.mPtrSfA = mOptions.transposeMmaOutput ? sfB : sfA; | ||
| gemmData.mInputBuffers.mPtrB = mOptions.transposeMmaOutput ? a : b; | ||
| gemmData.mInputBuffers.mPtrSfB = mOptions.transposeMmaOutput ? sfA : sfB; | ||
| gemmData.mInputBuffers.mPtrA = transposeMmaOutput ? b : a; | ||
| gemmData.mInputBuffers.mPtrSfA = transposeMmaOutput ? sfB : sfA; | ||
| gemmData.mInputBuffers.mPtrB = transposeMmaOutput ? a : b; | ||
| gemmData.mInputBuffers.mPtrSfB = transposeMmaOutput ? sfA : sfB; | ||
| gemmData.mInputBuffers.mPtrScaleC = scaleC; | ||
| gemmData.mInputBuffers.mPtrScaleGate = scaleGateC; | ||
| gemmData.mInputBuffers.mPtrScaleAct = scaleGateC; | ||
| gemmData.mInputBuffers.mPtrPerTokenSfA = mOptions.transposeMmaOutput ? perTokensSfB : perTokensSfA; | ||
| gemmData.mInputBuffers.mPtrPerTokenSfB = mOptions.transposeMmaOutput ? perTokensSfA : perTokensSfB; | ||
| gemmData.mInputBuffers.mPtrPerTokenSfA = transposeMmaOutput ? perTokensSfB : perTokensSfA; | ||
| gemmData.mInputBuffers.mPtrPerTokenSfB = transposeMmaOutput ? perTokensSfA : perTokensSfB; | ||
| gemmData.mInputBuffers.mPtrBias = ptrBias; |
There was a problem hiding this comment.
Plumb ptrPermutedIdxToBiasRowIdx through the runner path.
The new BiasType::Mn row remap never reaches gemmData.mInputBuffers, so this path always falls back to biasRow = permutedRow. BatchedGemmInterface::InputBuffers and KernelParamsSetup::setKernelParams(...) now support the pointer, but TrtllmGenBatchedGemmRunner::run(...) still has no way to populate it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.cpp`
around lines 367 - 380, The new ptrPermutedIdxToBiasRowIdx pointer for
BiasType::Mn is not being forwarded into gemmData.mInputBuffers in
TrtllmGenBatchedGemmRunner::run(...), so the bias row remap never gets used;
update TrtllmGenBatchedGemmRunner::run to accept or obtain the
ptrPermutedIdxToBiasRowIdx value and assign it into gemmData.mInputBuffers
(i.e., set gemmData.mInputBuffers.mPtrPermutedIdxToBiasRowIdx =
ptrPermutedIdxToBiasRowIdx) before calling
KernelParamsSetup::setKernelParams(...), and ensure the same pointer is passed
through any intermediate calls so BatchedGemmInterface::InputBuffers receives
it.
| version https://git-lfs.github.com/spec/v1 | ||
| oid sha256:1a66a28af82abf6caa7e88701b3eeedceaabe952043e6a75ef1ca842cdd49e75 | ||
| size 662339 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
TARGET='cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x128x256_s6_et128x128_m256x128x64_c2x1x1_rN_TN_schPd2x1x2x3_biasFp32N_bM_rgTma_clmp_dynB_sm100f_cubin.cpp'
echo "== Pointer header check =="
sed -n '1,5p' "$TARGET"
echo
echo "== CI/workflow LFS hydration checks =="
rg -n -C2 --iglob '*.yml' --iglob '*.yaml' 'git lfs (install|fetch|pull|checkout)'
echo
echo "== Build/static-analysis scripts mentioning cubins or this file type =="
rg -n -C2 '(_cubin\.cpp|trtllmGen_bmm_export/cubins|git lfs)'Repository: NVIDIA/TensorRT-LLM
Length of output: 261
🏁 Script executed:
# Check if .gitattributes exists and configures LFS
if [ -f .gitattributes ]; then
echo "=== .gitattributes LFS configuration ==="
cat .gitattributes | grep -E '(cubin|\.cpp|lfs)' | head -20
else
echo ".gitattributes not found"
fi
echo
echo "=== Workflow files in repository ==="
find . -type f \( -name '*.yml' -o -name '*.yaml' \) -path '*/.github/workflows/*' 2>/dev/null | head -20
echo
echo "=== LFS configuration in any workflow ==="
find . -type f \( -name '*.yml' -o -name '*.yaml' \) 2>/dev/null -exec grep -l 'lfs' {} \; 2>/dev/null | head -10
echo
echo "=== Build scripts or CMakeLists mentioning cubin or LFS ==="
rg -l '(cubin|lfs|git.*checkout)' --type cmake --type sh 2>/dev/null | head -20Repository: NVIDIA/TensorRT-LLM
Length of output: 1999
🏁 Script executed:
echo "=== blossom-ci.yml (LFS workflow) ==="
head -100 ./.github/workflows/blossom-ci.yml
echo
echo "=== pr-check.yml (likely runs C++ checks) ==="
head -150 ./.github/workflows/pr-check.yml
echo
echo "=== precommit-check.yml (likely runs linting) ==="
head -150 ./.github/workflows/precommit-check.ymlRepository: NVIDIA/TensorRT-LLM
Length of output: 7615
🏁 Script executed:
echo "=== scripts/release_check.py (what does it check?) ==="
head -200 scripts/release_check.py
echo
echo "=== Full blossom-ci.yml (look for git lfs or checkout) ==="
grep -n -A5 -B5 'git lfs\|checkout\|cubin' ./.github/workflows/blossom-ci.yml | head -100Repository: NVIDIA/TensorRT-LLM
Length of output: 7737
🏁 Script executed:
echo "=== .pre-commit-config.yaml (which hooks run on .cpp?) ==="
cat .pre-commit-config.yaml | head -200Repository: NVIDIA/TensorRT-LLM
Length of output: 11191
🏁 Script executed:
echo "=== Continuing .pre-commit-config.yaml (actual hooks) ==="
tail -300 .pre-commit-config.yamlRepository: NVIDIA/TensorRT-LLM
Length of output: 14794
Add lfs: 'true' to the checkout action in precommit-check.yml for consistency and robustness.
This file (*cubin.cpp) is already excluded from pre-commit hooks in .pre-commit-config.yaml, so it won't be processed by clang-format or other analysis tools. However, precommit-check.yml uses actions/checkout@v6 without the lfs: true parameter, while blossom-ci.yml correctly includes it. For consistency and to safeguard against future changes to the exclusion rules, add lfs: 'true' to the checkout step in precommit-check.yml.
🧰 Tools
🪛 Clang (14.0.6)
[error] 1-1: unknown type name 'version'
(clang-diagnostic-error)
[error] 1-1: expected ';' after top level declarator
(clang-diagnostic-error)
[error] 2-2: unknown type name 'oid'; did you mean 'void'?
(clang-diagnostic-error)
[error] 2-2: variable has incomplete type 'void'
(clang-diagnostic-error)
[error] 2-2: expected ';' after top level declarator
(clang-diagnostic-error)
[error] 2-2: expected unqualified-id
(clang-diagnostic-error)
🪛 Cppcheck (2.20.0)
[error] 3-3: syntax error
(syntaxError)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x128x256_s6_et128x128_m256x128x64_c2x1x1_rN_TN_schPd2x1x2x3_biasFp32N_bM_rgTma_clmp_dynB_sm100f_cubin.cpp`
around lines 1 - 3, precommit-check.yml's checkout step uses actions/checkout@v6
without LFS enabled, which is inconsistent with blossom-ci.yml and could break
future checks for large files like the listed cubin artifacts; update the
checkout step in precommit-check.yml (the actions/checkout@v6 step) to include
lfs: 'true' so git-lfs files are fetched during the precommit workflow, matching
blossom-ci.yml and ensuring excluded large files remain resolvable if exclusion
rules change.
| version https://git-lfs.github.com/spec/v1 | ||
| oid sha256:97caf696273574e75d797260df22a9ce4ec3f837e1e48b90040ffa499d4753db | ||
| size 453560 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Find .cpp files that are actually Git LFS pointers.
echo "== LFS pointer files with .cpp extension =="
fd -e cpp | while read -r f; do
first_line="$(sed -n '1p' "$f" || true)"
if [ "$first_line" = "version https://git-lfs.github.com/spec/v1" ]; then
echo "$f"
fi
done
# 2) Check whether this specific file is referenced by build/config scripts.
echo
echo "== References to this cubin path in build/config files =="
rg -n "Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x32x256_s9_et128x32_m128x32x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin\\.cpp|trtllmGen_bmm_export/cubins"
# 3) Show header lines for this file to confirm it's an LFS pointer (not a C++ source body/header banner).
echo
echo "== First 5 lines of the file under review =="
sed -n '1,5p' "cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x32x256_s9_et128x32_m128x32x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp"Repository: NVIDIA/TensorRT-LLM
Length of output: 50377
Exclude Git LFS pointer stubs from C++ translation units
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x32x256_s9_et128x32_m128x32x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cppcontains only Git LFS metadata (version/oid/size), causing parser and build failures; fetch LFS objects before compilation or adjust build/static‐analysis rules to skip these stubs.- Add the required NVIDIA copyright header to this
.cppfile per coding guidelines.
🧰 Tools
🪛 Clang (14.0.6)
[error] 1-1: unknown type name 'version'
(clang-diagnostic-error)
[error] 1-1: expected ';' after top level declarator
(clang-diagnostic-error)
[error] 2-2: unknown type name 'oid'; did you mean 'void'?
(clang-diagnostic-error)
[error] 2-2: variable has incomplete type 'void'
(clang-diagnostic-error)
[error] 2-2: expected ';' after top level declarator
(clang-diagnostic-error)
[error] 2-2: expected unqualified-id
(clang-diagnostic-error)
🪛 Cppcheck (2.20.0)
[error] 3-3: syntax error
(syntaxError)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x32x256_s9_et128x32_m128x32x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`
around lines 1 - 3, The translation unit
Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x32x256_s9_et128x32_m128x32x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp
currently contains only a Git LFS pointer stub which breaks parsing and builds;
replace the stub with the actual LFS-tracked binary cubin content (pull the LFS
object into the repo) or modify the build/static-analysis rules to skip/ignore
LFS pointer files for C++ translation units, and ensure the final .cpp file
contains the required NVIDIA copyright header at the top per project guidelines.
| version https://git-lfs.github.com/spec/v1 | ||
| oid sha256:28198e842017dfd64847faa770eefca936de60a8c09e7ae70e00fe313d3b005b | ||
| size 423264 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
TARGET='cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x8x512_s5_et128x8_m128x8x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp'
echo "== Pointer file header =="
sed -n '1,3p' "$TARGET" || true
echo
echo "== .gitattributes entries tracking this path (expect LFS rule) =="
attrs=$(fd -HI '^\.gitattributes$' || true)
if [ -n "${attrs:-}" ]; then
rg -n --fixed-strings 'trtllmGen_bmm_export/cubins/' $attrs || true
rg -n --fixed-strings 'Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x8x512_s5_et128x8_m128x8x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp' $attrs || true
else
echo "No .gitattributes found"
fi
echo
echo "== Build/lint references to this file (expect not compiled as raw pointer text) =="
rg -n -C2 --fixed-strings 'Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x8x512_s5_et128x8_m128x8x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp' || true
echo
echo "== CI/workflow LFS setup (expect explicit LFS fetch/pull before compile/lint) =="
rg -n -C2 -P 'git\s+lfs\s+(pull|fetch|checkout)|lfs:\s*true|actions/checkout@' || trueRepository: NVIDIA/TensorRT-LLM
Length of output: 4074
Add Git LFS tracking rule to .gitattributes and include NVIDIA copyright header
This .cpp file contains Git LFS pointer metadata but lacks a corresponding tracking rule in .gitattributes. Additionally, the file is missing the required NVIDIA copyright header.
To fix:
- Add a rule in
.gitattributesto explicitly track this file (or thetrtllmGen_bmm_export/cubins/directory) as LFS-managed binary - Include the NVIDIA copyright header comment at the top of the file with the year of latest modification
Without the .gitattributes rule, build/lint tools may attempt to process the pointer metadata as source code, causing failures.
🧰 Tools
🪛 Clang (14.0.6)
[error] 1-1: unknown type name 'version'
(clang-diagnostic-error)
[error] 1-1: expected ';' after top level declarator
(clang-diagnostic-error)
[error] 2-2: unknown type name 'oid'; did you mean 'void'?
(clang-diagnostic-error)
[error] 2-2: variable has incomplete type 'void'
(clang-diagnostic-error)
[error] 2-2: expected ';' after top level declarator
(clang-diagnostic-error)
[error] 2-2: expected unqualified-id
(clang-diagnostic-error)
🪛 Cppcheck (2.20.0)
[error] 3-3: syntax error
(syntaxError)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x8x512_s5_et128x8_m128x8x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`
around lines 1 - 3, The file begins with Git LFS pointer metadata (line starting
with "version https://git-lfs.github.com/spec/v1") but there is no
.gitattributes rule to mark this binary as LFS-managed, and the file is missing
the required NVIDIA copyright header; add an entry to .gitattributes to track
the cubins (e.g., a pattern matching these generated .cpp/cubin files) so the
pointer metadata is not treated as source, and prepend the canonical NVIDIA
copyright header comment to the top of this file with the correct latest
modification year.
| version https://git-lfs.github.com/spec/v1 | ||
| oid sha256:128c3261b254cdb66d2b786263044ac98147683feb60e526e0e58c5c863b0e38 | ||
| size 436478 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Detect .cpp cubin files that are Git LFS pointers =="
fd --full-path '_cubin\.cpp$' | while read -r f; do
if sed -n '1p' "$f" | grep -qx 'version https://git-lfs.github.com/spec/v1'; then
echo "POINTER_CPP: $f"
sed -n '1,3p' "$f"
fi
done
echo
echo "== Check whether repo tooling/build scripts broadly pick up .cpp files =="
rg -n --hidden --glob '!**/.git/**' '(_cubin\.cpp|\*\.cpp|clang-tidy|cppcheck|target_sources|add_library|add_executable)'Repository: NVIDIA/TensorRT-LLM
Length of output: 50377
🏁 Script executed:
# Check if these cubin files are referenced in build scripts
rg -l "cubin\.cpp|add_library|target_sources" --type cmake | head -20
# Check .gitattributes for LFS configuration
fd -e gitattributes | xargs cat 2>/dev/null | grep -A2 -B2 "cubin\|\.cpp"
# Check if there's a .clang-tidy or similar configuration that excludes these
fd -e clang-tidy -o -e clang-format | xargs cat 2>/dev/nullRepository: NVIDIA/TensorRT-LLM
Length of output: 1235
🏁 Script executed:
# Check .gitattributes
find . -name ".gitattributes" -type f | xargs cat 2>/dev/null | head -50
# Check if cubin files are referenced in CMakeLists
rg "_cubin\.cpp" cpp/tensorrt_llm/kernels/ | head -20
# Check for any pre-commit or CI configuration
find . -name ".pre-commit-config.yaml" -o -name ".clang-tidy" -o -name ".clang-format" | xargs cat 2>/dev/null | head -100Repository: NVIDIA/TensorRT-LLM
Length of output: 4449
🏁 Script executed:
# Check how cubin files are referenced in the main kernels CMakeLists.txt
cat cpp/tensorrt_llm/kernels/CMakeLists.txt | grep -A 5 -B 5 "cubin\|*.cpp" | head -100
# Check if there's any exclusion pattern for cubin files in build scripts
rg "cubin|exclude|EXCLUDE" cpp/tensorrt_llm/kernels/CMakeLists.txt -A 3
# Check the trtllmGenKernels CMakeLists specifically
fd -path "*trtllmGenKernels/CMakeLists.txt" -exec cat {} \; | head -150Repository: NVIDIA/TensorRT-LLM
Length of output: 1916
🏁 Script executed:
# Confirm the trtllmGenKernels exclusion applies to the cubin files
rg "trtllmGenKernels" cpp/tensorrt_llm/kernels/CMakeLists.txt -B 2 -A 2
# Check if there are any subdirectory declarations for trtllmGenKernels
rg "add_subdirectory.*trtllmGenKernels" cpp/tensorrt_llm/kernels/
# List cubin file locations
fd "_cubin\.cpp$" cpp/tensorrt_llm/kernels/trtllmGenKernels/ | head -5Repository: NVIDIA/TensorRT-LLM
Length of output: 2036
🏁 Script executed:
# Check the trtllmGenKernels subdirectory CMakeLists.txt
find cpp/tensorrt_llm/kernels/trtllmGenKernels -name "CMakeLists.txt" -exec cat {} \;Repository: NVIDIA/TensorRT-LLM
Length of output: 10372
Do not store Git LFS pointers in .cpp file paths; these will fail to compile when not checked out.
The cubin.cpp files in cpp/tensorrt_llm/kernels/trtllmGenKernels/ subdirectories (batchedGemm, gemm, gemmGatedAct, fmha, blockScaleMoe) are configured in .gitattributes as Git LFS pointers but are picked up directly by subdirectory CMakeLists files via file(GLOB_RECURSE SRC_CPP *.cpp) and added as compilation sources. When LFS files are not materialized (common in CI without LFS configuration), the build will attempt to compile text containing LFS metadata as C++, causing syntax errors.
Additionally, these files violate the coding guideline requiring NVIDIA copyright headers on all .cpp files, since LFS pointers cannot contain such headers.
Either:
- Exclude cubin.cpp files from the glob in each subdirectory CMakeLists.txt, or
- Store these binary artifacts with a non-source-code extension and generate/include wrapper .cpp files if compilation is needed
🧰 Tools
🪛 Clang (14.0.6)
[error] 1-1: unknown type name 'version'
(clang-diagnostic-error)
[error] 1-1: expected ';' after top level declarator
(clang-diagnostic-error)
[error] 2-2: unknown type name 'oid'; did you mean 'void'?
(clang-diagnostic-error)
[error] 2-2: variable has incomplete type 'void'
(clang-diagnostic-error)
[error] 2-2: expected ';' after top level declarator
(clang-diagnostic-error)
[error] 2-2: expected unqualified-id
(clang-diagnostic-error)
🪛 Cppcheck (2.20.0)
[error] 3-3: syntax error
(syntaxError)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x8x512u2_s4_et128x8_m128x8x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_sm100f_cubin.cpp`
around lines 1 - 3, The build is pulling Git LFS pointer files named like
Bmm_..._cubin.cpp into compilation via the file(GLOB_RECURSE SRC_CPP *.cpp) in
the subdirectory CMakeLists, which causes CI to try compiling LFS metadata and
also prevents adding required NVIDIA headers; fix by excluding those cubin.cpp
files from the source glob or moving them off the .cpp extension and adding
lightweight wrapper .cpp stubs. Concretely, update each affected CMakeLists (the
ones that define SRC_CPP via file(GLOB_RECURSE SRC_CPP *.cpp) in
trtllmGenKernels subdirs batchedGemm, gemm, gemmGatedAct, fmha, blockScaleMoe)
to filter out files matching "*_cubin.cpp" or "cubin.cpp" from SRC_CPP, or
rename/store binary cubin blobs with a non-.cpp extension and add minimal
wrapper translation units (e.g., loader functions) that include proper NVIDIA
copyright headers so real .cpp files compiled by CMake are valid source.
| version https://git-lfs.github.com/spec/v1 | ||
| oid sha256:b2d5b6ff01e5f2d80833302e5c38ed8a663baa695bdae838c4266066229ceaed | ||
| size 413176 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check if file is truly new or just shows LFS pointer
git ls-files "cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/"
# Find other .cpp files in the cubins directory to compare
fd -t f "\.cpp$" "cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/" | head -10Repository: NVIDIA/TensorRT-LLM
Length of output: 50376
🏁 Script executed:
# Check .gitattributes to see if cubins are configured for LFS
find . -name ".gitattributes" -type f -exec grep -l "cubins\|\.cubin\|LFS" {} \;Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Look at CODING_GUIDELINES.md for any LFS-specific guidance
find . -name "CODING_GUIDELINES.md" -type f | head -5Repository: NVIDIA/TensorRT-LLM
Length of output: 86
🏁 Script executed:
# Check if there are other similar .cpp files in the same directory
ls -la "cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/" | head -20Repository: NVIDIA/TensorRT-LLM
Length of output: 3631
🏁 Script executed:
cat CODING_GUIDELINES.md | grep -i -A 5 -B 5 "lfs\|binary\|cubin\|generated"Repository: NVIDIA/TensorRT-LLM
Length of output: 1097
🏁 Script executed:
# Also check for any exemptions or exceptions in the guidelines
cat CODING_GUIDELINES.md | head -100Repository: NVIDIA/TensorRT-LLM
Length of output: 4962
🏁 Script executed:
# Continue reading CODING_GUIDELINES.md to find any LFS or exemption guidance
cat CODING_GUIDELINES.md | grep -i "lfs\|binary\|generated\|exemption\|exclude\|copyright" | head -30Repository: NVIDIA/TensorRT-LLM
Length of output: 763
🏁 Script executed:
# Look for section about file headers and copyright
cat CODING_GUIDELINES.md | grep -i "copyright\|header\|license" -A 3 -B 1 | head -50Repository: NVIDIA/TensorRT-LLM
Length of output: 2181
LFS pointer files in the cubins directory require explicit exemption documentation in CODING_GUIDELINES.md.
The coding guidelines state that "all source files which are compiled or interpreted" must contain NVIDIA copyright headers. However, the guidelines don't explicitly address Git LFS pointer files (like this 131-byte metadata stub) or pre-compiled binary artifacts stored in the cubins directory. To avoid future ambiguity, either:
- Add explicit exemption language to CODING_GUIDELINES.md for LFS pointers and binary artifacts, or
- Document that the copyright header requirement applies only to human-authored source code, not generated/compiled binaries or their LFS references
🧰 Tools
🪛 Clang (14.0.6)
[error] 1-1: unknown type name 'version'
(clang-diagnostic-error)
[error] 1-1: expected ';' after top level declarator
(clang-diagnostic-error)
[error] 2-2: unknown type name 'oid'; did you mean 'void'?
(clang-diagnostic-error)
[error] 2-2: variable has incomplete type 'void'
(clang-diagnostic-error)
[error] 2-2: expected ';' after top level declarator
(clang-diagnostic-error)
[error] 2-2: unknown type name 'b2d5b6ff01e5f2d80833302e5c38ed8a663baa695bdae838c4266066229ceaed'
(clang-diagnostic-error)
[error] 3-3: expected ';' after top level declarator
(clang-diagnostic-error)
🪛 Cppcheck (2.20.0)
[error] 3-3: syntax error
(syntaxError)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x256u2_s6_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`
around lines 1 - 3, Update CODING_GUIDELINES.md to explicitly exempt Git LFS
pointer files and pre-compiled binary artifacts in the cubins directory (e.g.,
the LFS pointer stub found in
trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x256u2_s6_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_bN_rgTma_clmp_dynB_sm100f_cubin.cpp)
from the "NVIDIA copyright headers" requirement, or alternatively add a clear
rule that the header mandate applies only to human-authored source files and not
generated/compiled binaries or their LFS references; make the language
unambiguous and add a short example sentence and a note in the file explaining
how LFS pointer stubs are treated.
| version https://git-lfs.github.com/spec/v1 | ||
| oid sha256:33b5743610881ecfc742746f42648f4e054534a9ca1adb57634e1988c88cb969 | ||
| size 316110 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify this file is an LFS pointer in the checked-out tree.
sed -n '1,5p' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x512_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schedS_bN_rgTma_clmp_sm100f_cubin.cpp
# Check if cubin .cpp artifacts are referenced by CMake build inputs.
rg -n -C2 --iglob 'CMakeLists.txt' --iglob '*.cmake' '_cubin\.cpp|trtllmGen_bmm_export/cubins'
# Check if static-analysis configs explicitly exclude this folder/pattern.
rg -n -C2 --iglob '.clang-tidy' --iglob '.clangd' --iglob '*.yml' --iglob '*.yaml' 'cubins|trtllmGen_bmm_export|_cubin\.cpp'Repository: NVIDIA/TensorRT-LLM
Length of output: 194
.cpp file contains Git LFS pointer and is not referenced in build system
Confirmed: Lines 1–3 show Git LFS metadata, not C++ source. Verification found no CMakeLists.txt, .cmake files, or static analysis configs that reference this cubin directory or _cubin.cpp files, indicating the path is implicitly excluded from compilation.
However, this implicit exclusion through directory structure (no build references) is not explicit. To prevent accidental inclusion in future build rules, consider either:
- Adding an explicit exclusion pattern in CMakeLists.txt or build tooling for the
cubins/directory - Moving cubin artifacts to a non-
.cppextension (e.g.,.cubinor.o) to avoid confusion with C++ source files
🧰 Tools
🪛 Clang (14.0.6)
[error] 1-1: unknown type name 'version'
(clang-diagnostic-error)
[error] 1-1: expected ';' after top level declarator
(clang-diagnostic-error)
[error] 2-2: unknown type name 'oid'; did you mean 'void'?
(clang-diagnostic-error)
[error] 2-2: variable has incomplete type 'void'
(clang-diagnostic-error)
[error] 2-2: expected ';' after top level declarator
(clang-diagnostic-error)
[error] 2-2: expected unqualified-id
(clang-diagnostic-error)
🪛 Cppcheck (2.20.0)
[error] 3-3: syntax error
(syntaxError)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x512_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schedS_bN_rgTma_clmp_sm100f_cubin.cpp`
around lines 1 - 3, The file
trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x512_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schedS_bN_rgTma_clmp_sm100f_cubin.cpp
is a Git LFS pointer, not real C++ source, and the cubins/ folder is implicitly
excluded from the build; make the exclusion explicit by either adding an exclude
rule for the cubins/ directory in your top-level CMakeLists (or the relevant
cmake module that collects sources) so files under cubins/ are ignored, and/or
renaming these artifacts from *.cpp to a non-source extension (e.g., .cubin or
.o) so build system and static analysis won’t treat them as C++ sources; update
any source-gathering patterns (e.g., add_glob or file(GLOB_RECURSE) filters)
that might pick up *_cubin.cpp and document the change in the build module that
references the generated kernels.
| if (options.mMmaTileK == 0) | ||
| { | ||
| options.mMmaTileK = options.mTileK; | ||
| } | ||
| else | ||
| { | ||
| TLLM_CHECK_ERROR(options.mLayoutB != MatrixLayout::BlockMajorK, | ||
| "layoutB == MatrixLayout::BlockMajorK is not supported for now"); | ||
| if (options.mMmaTileK != options.mTileK) | ||
| { |
There was a problem hiding this comment.
Keep updateOptions == false side-effect free.
checkAndUpdateGemmOptions(..., false) is used as a pure validator, but these blocks still fill in mMmaTileK and mPerTokenSfDtype. That makes an underspecified config look valid and hides whether the cubin metadata actually encoded those fields. Please guard these defaults with updateOptions (or reject when they are unset in validation mode).
Also applies to: 2175-2185
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/GemmOptions.h`
around lines 1281 - 1288, The validator checkAndUpdateGemmOptions currently
mutates options (setting options.mMmaTileK = options.mTileK and
options.mPerTokenSfDtype defaults) even when called with updateOptions == false;
change the logic so these assignments only occur when updateOptions is true, and
when updateOptions is false either return an error/false if those fields are
unset or explicitly reject the config; apply the same guard to the other
identical block later in the function (the second occurrence that compares
mMmaTileK vs mTileK and sets mPerTokenSfDtype) so validation remains side-effect
free.
| uint32_t* rowMaxBars, int32_t const* ptrNumNonExitingCtas = nullptr, | ||
| int32_t const* ptrTotalNumPaddedTokens = nullptr, int32_t const* ptrCtaIdxXyToBatchIdx = nullptr, | ||
| int32_t const* ptrCtaIdxXyToMnLimit = nullptr, int32_t const maxNumCtas = KernelParams::MaxNumCtas, | ||
| uint32_t* ptrDynamicTileCounter = nullptr) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect all call sites of KernelParamsSetup::setKernelParams and the maxNumCtas plumbing.
rg -n -C3 '\bsetKernelParams\s*\(' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm
rg -n -C3 '\bmaxNumCtas(InBatchDim)?\b|\bnumCgas\b|\bcgaOffset\b|\bclusterDimInBatchDim\b' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemmRepository: NVIDIA/TensorRT-LLM
Length of output: 39791
🏁 Script executed:
# Trace where maxNumCtasInBatchDim originates and how it's computed
# Look for high-level callers that prepare this value
rg -n 'maxNumCtasInBatchDim\s*=' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm
rg -n -B5 'run\s*\(' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.cpp | grep -A5 'maxNumCtasInBatchDim'Repository: NVIDIA/TensorRT-LLM
Length of output: 1015
🏁 Script executed:
# Look for where maxNumCtasInBatchDim is actually computed or passed from external code
# Find the public API entry points and trace back
rg -n 'maxNumCtasInBatchDim' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm --type cpp -A2 -B2 | head -100
# Also check if there's any documentation or comments explaining the semantics
rg -n 'CGA|cga|cluster.*group' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.hRepository: NVIDIA/TensorRT-LLM
Length of output: 13747
🏁 Script executed:
# Look for the comment about CTA granularity in KernelParams.h
cat -n cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h | sed -n '380,390p'
# Check if there are any comments or documentation about the parameter semantics
rg -n 'maxNumCtas|MaxNumCtas' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h | head -20
# Look at the parameter definition and surrounding documentation
cat -n cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h | sed -n '340,365p'Repository: NVIDIA/TensorRT-LLM
Length of output: 2528
Parameter maxNumCtas uses CGA semantics but retains CTA naming, creating a misleading contract.
The code at line 386 explicitly acknowledges this: "The CTA names are kept for compatibility even though they are indexed at CGA granularity." The calculation at line 477 (paddedTokensInBatchDim = cgaOffset * tile * clusterDimInBatchDim) is correct only if callers pass CGA count, not CTA count. However, the parameter name and signature still advertise "CTA" semantics, risking misuse by callers unfamiliar with the compatibility note.
Rename the parameter or add clear inline documentation explaining it expects CGA count (e.g., maxNumCgas or /*maxNumCgas=*/ inline comment), or update KernelParams::MaxNumCtas to reflect CGA semantics.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h`
around lines 357 - 360, The parameter name maxNumCtas in the KernelParams.h
function signature is misleading because the implementation (see
paddedTokensInBatchDim calculation using cgaOffset and clusterDimInBatchDim)
expects a CGA count; change the API to make this explicit by renaming the
parameter to maxNumCgas (and update all call sites) or add an inline comment in
the signature like /* maxNumCgas = CGA-granularity count */ and/or update
KernelParams::MaxNumCtas to reflect CGA semantics so callers pass CGA counts;
ensure references to maxNumCtas and KernelParams::MaxNumCtas and the
paddedTokensInBatchDim / cgaOffset usage are updated consistently.
| auto const numTmemColsA = useTmemA ? numStagesA * tileK | ||
| / (numSlicesForSliceK * tg::dtypeGetNumBits(tg::Dtype::UInt32) / tg::dtypeGetNumBits(dtypeMmaA)) | ||
| : 0; |
There was a problem hiding this comment.
mNumStagesTmemA is still ignored here.
tmemA sizing is derived from numStagesA, and KernelTraits still has no separate TMEM-A stage parameter. Any config with mNumStagesTmemA != mNumStagesA will be modeled with the wrong TMEM footprint and effectively loses the new knob. Please thread mNumStagesTmemA into KernelTraits and use it for tmemA.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelTraits.h`
around lines 524 - 526, KernelTraits currently ignores the separate TMEM-A stage
count; thread the mNumStagesTmemA member through KernelTraits (add the member,
ensure constructors/builders/factory code initialize it from the configuration)
and use mNumStagesTmemA instead of numStagesA when computing numTmemColsA (the
expression that uses numStagesA * tileK / (numSlicesForSliceK * ... /
tg::dtypeGetNumBits(dtypeMmaA))). Also ensure useTmemA logic and any getters
that expose stage counts reflect mNumStagesTmemA so configs with mNumStagesTmemA
!= mNumStagesA produce the correct TMEM-A footprint.
|
The PR description covers the changes in TRT-LLM Gen, but many of those features are not added in this MR (Slice-K, MN Bias, etc). Could you fix it please? |
58e0fd8 to
8266478
Compare
|
/bot run |
|
PR_Github #46202 [ run ] triggered by Bot. Commit: |
|
PR_Github #46202 [ run ] completed with state
|
|
/bot run |
|
PR_Github #46252 [ run ] triggered by Bot. Commit: |
|
PR_Github #46252 [ run ] completed with state
|
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
experimental: TRTLLM_GEN_MOE_AUTOTUNE_DUMMY_DISTRIBUTION=random or balanced to alter expert distribution during autotuner; affect kernel secletion Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
8266478 to
f345e85
Compare
|
/bot run |
|
PR_Github #46350 [ run ] triggered by Bot. Commit: |
|
PR_Github #46350 [ run ] completed with state
|
|
/bot run |
|
PR_Github #46557 [ run ] triggered by Bot. Commit: |
|
PR_Github #46557 [ run ] completed with state |
Description
DEP8 DS-R1 FP4 1k/2k on B200
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.