[None][feat] Update TRTLLM MoE cubins by rosenrodt · Pull Request #12440 · NVIDIA/TensorRT-LLM

rosenrodt · 2026-03-23T06:20:51Z

Description

DEP8 DS-R1 FP4 1k/2k on B200

bsz	old	new	gain
8	74.93	74.76	−0.2%
64	504.8	506.1	+0.2%
128	905.3	917.9	+1.4%
512	2700.5	2737.8	+1.4%
1024	3439.7	3979.0	+15.7%
2048	5244.4	5315.9	+1.4%
4096	3375.3	3381.2	+0.2%

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-04-29T07:22:21Z

📝 Walkthrough

Walkthrough

Refactors kernel selection logic in batched GEMM runner, introduces new bias/scheduling parameters and hybrid slice K optimization (Blackwell-only), adds bias row mapping support with CGA-granularity routing, and reorganizes configuration templates with new variants. Multiple cuBin artifacts updated via Git LFS.

Changes

Cohort / File(s)	Summary
Kernel Selection & Parameter Logic `cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.cpp`	Removes prior predefined-config prioritization, introduces `skipQuirks()` filtering for known hanging/crashing configs, updates dtype validation to be transpose-aware using canonical operand convention, refactors problem-dimension construction into `setProblemDimensions()` helper, adjusts pointer routing for per-token SFs, and strengthens epilogue tile validation.
Header Documentation `cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.h`	Adds documentation comments to `TrtllmGenBatchedGemmRunnerOptions` struct canonically defining roles of `dtypeA`, `dtypeB`, `dtypeC` and clarifying `transposeMmaOutput` legacy behavior for MoE callers.
Bias & Scheduling Enums `cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/Enums.h`	Adds `FusedBiasShuffleMode` and `SchedHostTask` enum types with inline predicate helpers (`usesFusedBiasShuffle`, `usesFusedBiasReorder`, mode-specific checkers).
Batched GEMM Options & Validation `cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/BatchedGemmOptions.h`	Adds bias-related parameters (`biasDtype`, `fusedBiasShuffleMode`), scheduling (`fuseLoadSfTask`, `schedHostTask`), MMA customization (`mmaTileK`, `useCustomizedMma3xNvFp4`), and `mPrefetchB` option; updates validation for scheduler+routing incompatibility, fused-bias-reorder constraints, and extended bias type support (`Mn`).
GEMM Options & Hybrid Slice K `cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/GemmOptions.h`	Major refactoring: replaces single `numStages` with per-path counts (`numStagesA/B`, `numStagesSmemSfA/B`, `numStagesTmemA`, `numStagesTmemSfA/B`); adds `mBiasDtype`, `mFusedBiasShuffleMode`, `mPerTokenSfDtype`, `mSchedHostTask`, `mUseCustomizedMma3xNvFp4`; introduces new `mHybridSliceK` feature (Blackwell-gated) with derived fields (`mHybridHiddenDim`, `mHybridTileHidden`, `mReRunTimes`, etc.) and validation block, includes tile dimension swapping under hybrid mode.
Gated Act Options `cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/GemmGatedActOptions.h`	Removes unconditional `BiasType::Mn` rejection when `mTransposeMmaOutput` is disabled, formats function call on single line.
Kernel Traits & Stage Allocation `cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelTraits.h`	Updates constructor to accept per-path stage counts and new parameters (`mmaTileK`, `useCustomizedMma3xNvFp4`, `fusedBiasShuffleMode`); adjusts SMEM/TMEM allocation logic to use granular stage controls instead of unified `numStages`.
Batched GEMM Interface `cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/BatchedGemmInterface.h`	Extends `InputBuffers` with optional `mPtrPermutedIdxToBiasRowIdx` mapping pointer for `BiasType::Mn`; adds test-only `cudaEvent` parameter (guarded by `TLLM_TEST`); adjusts grid-dimension computation via `clusterDimInBatchDim` for MoE early-exit modes.
Kernel Parameters & CGA Semantics `cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h`, `cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParamsDecl.h`	Switches routing/padding bookkeeping from CTA to CGA granularity (`ctaOffset`→`cgaOffset`, `numCtas`→`numCgas`); adds `ptrPermutedIdxToBiasRowIdx` device pointer; introduces `paddedTokensInBatchDim` for TMA descriptor sizing; adjusts SF TMA to use `mmaTileK` instead of `mTileK`; halves B-side TMA tile dimension for 2CTA MMA.
Configuration Templates `cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/config.json`	Removes defaults from shared templates, adds explicit `batch`/`transposeMmaOutput`/`ctaSwizzleType`/`biasType` to FP4/FP8 specialized templates; introduces new "NonSwapAb" templates for FP4, FP8, MXE variants with `biasType: "n"` and epilogue LDTM overrides; updates E2M1/E4M3 variant templates; changes one fusion config from `"swiglu"` to `"none"` for activation.
cuBin Artifacts (LFS Pointers) `cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/*`	Adds ~100 new Git LFS pointer files for Bmm kernel variants (E2m1E2m1, E2m1E4m3, E4m3E4m3, MxE variants); removes ~80 existing LFS pointer files (obsolete 16dp256b variants and ldgsts routing versions).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[None][feat] Trtllm-gen FMHA JIT support #12612: Refactors TMA-based trtllm-gen kernel parameter layout and CGA-granularity semantics for clustered batched-dimension routing, aligning with the main PR's CTA→CGA transition and parameter plumbing changes.

Suggested labels

VisualGen

Suggested reviewers

nv-guomingz
niukuo
zhenhuaw-me
PerkzZheng
yuxianq

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

coderabbitai

Actionable comments posted: 20

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.h (1)
1-15: ⚠️ Potential issue | 🟡 Minor

Update the NVIDIA copyright year in this modified header.

This file changed in 2026, but the header still ends at 2025.

As per coding guidelines, "Include NVIDIA copyright header on all new files; update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.h` around
lines 1 - 15, The file header in KernelRunner.h still lists the NVIDIA copyright
range as "2020-2025" but the file was modified in 2026; update the copyright
year range in the top comment block (the header comment present at the top of
KernelRunner.h) to include 2026 (e.g., "2020-2026") so the file header reflects
the modification year.
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h (1)

691-708: ⚠️ Potential issue | 🟠 Major

Finish the mMmaTileK migration for SF TMA descriptors.

The non-routed SF branches now use mMmaTileK, but the routed-SF TMA branches still size their per-tile K extent from mTileK / numEltsPerSf. When customized MMA tile-K is enabled (which requires mTileK=256 and mMmaTileK=192), the same SF buffer gets different descriptor shapes: 256/numEltsPerSf for routed TMA vs. 192 for non-routed. This inconsistency should be resolved by migrating the routed TMA branches (around lines 560-582 for matrix B and lines 660-681 for matrix A) to use mMmaTileK as well.
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/config.json (1)
233-955: ⚠️ Potential issue | 🟡 Minor

Remove trailing commas from grouped-key field names in two configs.

The validation script found dangling delimiters in the field names of configs "FP4_FC2_HighThroughput" (config 13) and "FP4_FC2_HighThroughput_NonSwapAb" (config 15):

"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutB," should be "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutB"

"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA," should be "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA"

All other required fields and arity checks pass validation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/config.json`
around lines 233 - 955, In the two configs for "FP4_FC2_HighThroughput"
(template BatchedGemmFp4HighThroughput) and "FP4_FC2_HighThroughput_NonSwapAb"
(template BatchedGemmFp4HighThroughputNonSwapAb) remove the stray trailing
commas from the grouped-key names: change
"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutB," to
"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutB" and change
"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA," to
"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA" so the field names are
valid JSON keys for validation.

🧹 Nitpick comments (1)

cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/config.json (1)

544-546: Remove the trailing delimiter in the grouped key at Line 544.

The key string ends with an extra comma (...,sfLayoutA,), which is parser-fragile for tuple/key arity mapping.

Proposed cleanup

-            "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA,": [
+            "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA": [
                 [128, 128, 128, 256, 6, "128x4"]
             ]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/config.json`
around lines 544 - 546, The grouped key string
"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA," contains an extra trailing
comma which breaks tuple/key arity mapping; edit the config.json entry so the
key is "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA" (remove the trailing
comma) and verify the associated value array stays unchanged, and scan other
grouped keys for similar trailing delimiters to remove if present.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.cpp`:
- Around line 367-380: The new ptrPermutedIdxToBiasRowIdx pointer for
BiasType::Mn is not being forwarded into gemmData.mInputBuffers in
TrtllmGenBatchedGemmRunner::run(...), so the bias row remap never gets used;
update TrtllmGenBatchedGemmRunner::run to accept or obtain the
ptrPermutedIdxToBiasRowIdx value and assign it into gemmData.mInputBuffers
(i.e., set gemmData.mInputBuffers.mPtrPermutedIdxToBiasRowIdx =
ptrPermutedIdxToBiasRowIdx) before calling
KernelParamsSetup::setKernelParams(...), and ensure the same pointer is passed
through any intermediate calls so BatchedGemmInterface::InputBuffers receives
it.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x128x256_s6_et128x128_m256x128x64_c2x1x1_rN_TN_schPd2x1x2x3_biasFp32N_bM_rgTma_clmp_dynB_sm100f_cubin.cpp`:
- Around line 1-3: precommit-check.yml's checkout step uses actions/checkout@v6
without LFS enabled, which is inconsistent with blossom-ci.yml and could break
future checks for large files like the listed cubin artifacts; update the
checkout step in precommit-check.yml (the actions/checkout@v6 step) to include
lfs: 'true' so git-lfs files are fetched during the precommit workflow, matching
blossom-ci.yml and ensuring excluded large files remain resolvable if exclusion
rules change.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x32x256_s9_et128x32_m128x32x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`:
- Around line 1-3: The translation unit
Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x32x256_s9_et128x32_m128x32x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp
currently contains only a Git LFS pointer stub which breaks parsing and builds;
replace the stub with the actual LFS-tracked binary cubin content (pull the LFS
object into the repo) or modify the build/static-analysis rules to skip/ignore
LFS pointer files for C++ translation units, and ensure the final .cpp file
contains the required NVIDIA copyright header at the top per project guidelines.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x8x512_s5_et128x8_m128x8x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`:
- Around line 1-3: The file begins with Git LFS pointer metadata (line starting
with "version https://git-lfs.github.com/spec/v1") but there is no
.gitattributes rule to mark this binary as LFS-managed, and the file is missing
the required NVIDIA copyright header; add an entry to .gitattributes to track
the cubins (e.g., a pattern matching these generated .cpp/cubin files) so the
pointer metadata is not treated as source, and prepend the canonical NVIDIA
copyright header comment to the top of this file with the correct latest
modification year.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x8x512u2_s4_et128x8_m128x8x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_sm100f_cubin.cpp`:
- Around line 1-3: The build is pulling Git LFS pointer files named like
Bmm_..._cubin.cpp into compilation via the file(GLOB_RECURSE SRC_CPP *.cpp) in
the subdirectory CMakeLists, which causes CI to try compiling LFS metadata and
also prevents adding required NVIDIA headers; fix by excluding those cubin.cpp
files from the source glob or moving them off the .cpp extension and adding
lightweight wrapper .cpp stubs. Concretely, update each affected CMakeLists (the
ones that define SRC_CPP via file(GLOB_RECURSE SRC_CPP *.cpp) in
trtllmGenKernels subdirs batchedGemm, gemm, gemmGatedAct, fmha, blockScaleMoe)
to filter out files matching "*_cubin.cpp" or "cubin.cpp" from SRC_CPP, or
rename/store binary cubin blobs with a non-.cpp extension and add minimal
wrapper translation units (e.g., loader functions) that include proper NVIDIA
copyright headers so real .cpp files compiled by CMake are valid source.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E4m3_castE4m3_Fp32_Ab32_t128x64x256_s4_et128x64_m128x64x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_biasFp32M_bN_rgTma_clmp_dynB_sm100a_cubin.cpp`:
- Around line 1-3: The build is picking up pointer-only cubin files tracked by
Git LFS (pattern *cubin.cpp in .gitattributes) because CMake uses
file(GLOB_RECURSE SRC_CPP *.cpp); either ensure CI materializes LFS blobs before
C++ parsing by adding a checkout step that runs git lfs pull --include="*.cpp"
--exclude="" or change the CMake glob to exclude cubin sources (e.g., stop using
*.cpp-only glob or add an exclusion for *cubin.cpp when populating SRC_CPP) so
the pointer-only cubin files are not passed to the compiler/static analyzers.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E4m3_castE4m3_Fp32_Ab32_t128x8x256_s5_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm103a_cubin.cpp`:
- Around line 1-3: CMake is globbing LFS pointer ".cpp" files (via
file(GLOB_RECURSE SRC_CPP *.cpp)) which pulls in LFS placeholders under
batchedGemm/cubins/* and causes compiler errors; change the glob/exclusion so
SRC_CPP does not include cubin LFS files (e.g., restrict to src/**/*.cpp or add
an explicit exclude for batchedGemm/cubins/** and other cubin/* directories) or
rename those LFS artifacts to a non-.cpp extension and update .gitattributes
accordingly, and add CI steps to run git lfs fetch/pull before the build so real
cubin binaries are available (refer to CMakeLists.txt, the SRC_CPP glob,
batchedGemm/cubins directory and .gitattributes).

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E4m3_castE4m3_Fp32_Ab32_t128x8x512_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_biasFp32M_bN_rgTma_clmp_dynB_sm100a_cubin.cpp`:
- Around line 1-3: precommit-check.yml's checkout step currently omits LFS and
causes Git LFS pointers (e.g., the cubin C++ LFS pointer shown) to be checked
out as text; update the checkout action in precommit-check.yml (the step that
uses actions/checkout) to include lfs: 'true' like in blossom-ci.yml so LFS
files are materialized before running pre-commit hooks such as clang-format.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E4m3_castE4m3_Fp32_Ab32_t128x8x512u2_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100a_cubin.cpp`:
- Around line 1-3: CI is failing static analysis because tracked cubin binary
artifacts (files matching cubins/**/*.cpp) are stored as Git LFS pointers and
static analyzers choke on the pointer metadata; fix by ensuring the pipeline
either runs git lfs pull (or equivalent fetch of LFS objects) prior to any
compilation/lint steps or by excluding these generated cubin files from C++
linting (update CI job steps to call git lfs pull or update analyzer config to
skip pattern cubins/**/*.cpp and rely on .gitattributes entry for *cubin.cpp).

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E4m3_castE4m3_Fp32_Ab32_t128x8x512u2_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_biasFp32M_bN_rgTma_clmp_dynB_sm100a_cubin.cpp`:
- Around line 1-3: The file
trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E4m3_castE4m3_Fp32_Ab32_t128x8x512u2_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_biasFp32M_bN_rgTma_clmp_dynB_sm100a_cubin.cpp
contains an unresolved Git LFS pointer instead of real C++ and is breaking
static analysis; fix by adding a pre-build step that runs git lfs pull (or
equivalent) to materialize LFS objects before compilation/analysis and/or update
CI/build scripts to fetch LFS, and update repository rules (.gitattributes/CI)
to treat *_cubin.cpp as binary or exclude them from C++ tooling so tools like
clang/cppcheck skip that filename pattern; ensure the build scripts reference
this filename pattern when adding the fetch or exclusion.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E4m3_castE4m3_Fp32_Ab32_t128x8x512u2_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_biasFp32M_bN_rgTma_clmp_dynB_sm103a_cubin.cpp`:
- Around line 1-3: The CI is compiling a Git LFS pointer file (the cubin.cpp
pointer at top of the referenced file and any files matching *cubin.cpp in
.gitattributes) because workflows do not fetch LFS objects; update the CI/build
workflows that run static analysis/compilation (the job that invokes
clang/cppcheck/your build script) to run git lfs fetch --all (or git lfs pull)
after checkout but before any compile/analysis steps, and ensure any
materialized cubin.cpp binary source includes the required NVIDIA copyright
header with the latest modification year so the real file (not the pointer) is
present for compilation and policy checks.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x128x128_s5_et64x128_m64x128x32_c1x1x1_rM_TN_transOut_noShflA_dsFp8_schedS_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`:
- Around line 1-3: The file is a Git LFS pointer misnamed with a .cpp extension,
which can confuse C++ tooling; rename all files matching the _cubin.cpp pattern
(e.g., Bmm_*_cubin.cpp) to a non-C++ artifact extension such as .cubin.bin or
.bin, or alternatively add explicit exclusion rules (e.g., in build configs and
.gitattributes) to ensure this directory and pattern are never treated as C++
sources; if any of these files are intended to be committed as source artifacts,
add the required NVIDIA copyright header to the file content and update
repository docs to note the exclusion.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x16x128u2_s6_et64x16_m64x16x32_c1x1x1_rM_TN_transOut_noShflA_dsFp8_schedS_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`:
- Around line 1-3: Multiple kernel cubin files contain unresolved Git LFS
pointer text (e.g., starting with "version https://git-lfs.github.com/spec/v1")
which breaks static analysis and compilation; update CI and build to: run git
lfs pull before any lint/compile stage (modify CI scripts and any prebuild
hooks), add a pre-compile check in the build scripts to scan .cubin.cpp files
for the LFS pointer header and either auto-run git lfs pull or fail fast with a
clear error telling contributors to run git lfs pull, and add a guard (or
exclusion) so unresolved LFS pointer files are skipped by static analyzers (or
listed in .gitignore/build exclusion) plus document the required Git LFS setup
in the repo README/CONTRIBUTING so contributors and CI get the real binary cubin
contents.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x32x128u2_s6_et64x32_m64x32x32_c1x1x1_rM_TN_transOut_noShflA_dsFp8_schPd4x2x2x3_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`:
- Around line 1-3: The new LFS-tracked cubin artifact
Bmm_Bfloat16_E4m3E4m3_Fp32_t128x32x128u2_s6_et64x32_m64x32x32_c1x1x1_rM_TN_transOut_noShflA_dsFp8_schPd4x2x2x3_bN_rgTma_clmp_dynB_sm100f_cubin.cpp
lacks the required NVIDIA copyright header; either add the standard NVIDIA
header (with the year of latest meaningful modification) into this generated
.cpp or formally exempt generated/LFS-tracked cubin artifacts by updating the
repo linting and docs: modify the header-check configuration to exclude the
cubin/generated pattern and add a docs note describing the exemption and
justification, then include the chosen remediation (header addition or
exemption) in the PR description for audit.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x128_s4_et64x8_m64x8x32_c1x1x1_rM_TN_transOut_noShflA_dsFp8_schedS_bN_rgTma_clmp_sm100f_cubin.cpp`:
- Around line 1-3: The build includes cubin pointer .cpp files because the
CMakeLists uses file(GLOB_RECURSE SRC_CPP *.cpp); update the build/CI to ensure
Git LFS artifacts are present before compilation by adding an explicit git lfs
fetch/pull step prior to invoking CMake (or change the CMake rule to exclude
patterns), and/or change the CMakeLists usage of file(GLOB_RECURSE SRC_CPP
*.cpp) to filter out "*cubin.cpp" so the compiler never sees LFS pointer files;
target the CMakeLists.txt entry with file(GLOB_RECURSE SRC_CPP *.cpp) and the CI
build steps that run CMake to implement the fix.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x256u2_s6_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`:
- Around line 1-3: Update CODING_GUIDELINES.md to explicitly exempt Git LFS
pointer files and pre-compiled binary artifacts in the cubins directory (e.g.,
the LFS pointer stub found in
trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x256u2_s6_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_bN_rgTma_clmp_dynB_sm100f_cubin.cpp)
from the "NVIDIA copyright headers" requirement, or alternatively add a clear
rule that the header mandate applies only to human-authored source files and not
generated/compiled binaries or their LFS references; make the language
unambiguous and add a short example sentence and a note in the file explaining
how LFS pointer stubs are treated.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x512_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schedS_bN_rgTma_clmp_sm100f_cubin.cpp`:
- Around line 1-3: The file
trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x512_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schedS_bN_rgTma_clmp_sm100f_cubin.cpp
is a Git LFS pointer, not real C++ source, and the cubins/ folder is implicitly
excluded from the build; make the exclusion explicit by either adding an exclude
rule for the cubins/ directory in your top-level CMakeLists (or the relevant
cmake module that collects sources) so files under cubins/ are ignored, and/or
renaming these artifacts from *.cpp to a non-source extension (e.g., .cubin or
.o) so build system and static analysis won’t treat them as C++ sources; update
any source-gathering patterns (e.g., add_glob or file(GLOB_RECURSE) filters)
that might pick up *_cubin.cpp and document the change in the build module that
references the generated kernels.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/GemmOptions.h`:
- Around line 1281-1288: The validator checkAndUpdateGemmOptions currently
mutates options (setting options.mMmaTileK = options.mTileK and
options.mPerTokenSfDtype defaults) even when called with updateOptions == false;
change the logic so these assignments only occur when updateOptions is true, and
when updateOptions is false either return an error/false if those fields are
unset or explicitly reject the config; apply the same guard to the other
identical block later in the function (the second occurrence that compares
mMmaTileK vs mTileK and sets mPerTokenSfDtype) so validation remains side-effect
free.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h`:
- Around line 357-360: The parameter name maxNumCtas in the KernelParams.h
function signature is misleading because the implementation (see
paddedTokensInBatchDim calculation using cgaOffset and clusterDimInBatchDim)
expects a CGA count; change the API to make this explicit by renaming the
parameter to maxNumCgas (and update all call sites) or add an inline comment in
the signature like /* maxNumCgas = CGA-granularity count */ and/or update
KernelParams::MaxNumCtas to reflect CGA semantics so callers pass CGA counts;
ensure references to maxNumCtas and KernelParams::MaxNumCtas and the
paddedTokensInBatchDim / cgaOffset usage are updated consistently.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelTraits.h`:
- Around line 524-526: KernelTraits currently ignores the separate TMEM-A stage
count; thread the mNumStagesTmemA member through KernelTraits (add the member,
ensure constructors/builders/factory code initialize it from the configuration)
and use mNumStagesTmemA instead of numStagesA when computing numTmemColsA (the
expression that uses numStagesA * tileK / (numSlicesForSliceK * ... /
tg::dtypeGetNumBits(dtypeMmaA))). Also ensure useTmemA logic and any getters
that expose stage counts reflect mNumStagesTmemA so configs with mNumStagesTmemA
!= mNumStagesA produce the correct TMEM-A footprint.

---

Outside diff comments:
In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.h`:
- Around line 1-15: The file header in KernelRunner.h still lists the NVIDIA
copyright range as "2020-2025" but the file was modified in 2026; update the
copyright year range in the top comment block (the header comment present at the
top of KernelRunner.h) to include 2026 (e.g., "2020-2026") so the file header
reflects the modification year.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/config.json`:
- Around line 233-955: In the two configs for "FP4_FC2_HighThroughput" (template
BatchedGemmFp4HighThroughput) and "FP4_FC2_HighThroughput_NonSwapAb" (template
BatchedGemmFp4HighThroughputNonSwapAb) remove the stray trailing commas from the
grouped-key names: change "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutB,"
to "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutB" and change
"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA," to
"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA" so the field names are
valid JSON keys for validation.

---

Nitpick comments:
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/config.json`:
- Around line 544-546: The grouped key string
"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA," contains an extra trailing
comma which breaks tuple/key arity mapping; edit the config.json entry so the
key is "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA" (remove the trailing
comma) and verify the associated value array stays unchanged, and scan other
grouped keys for similar trailing delimiters to remove if present.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

coderabbitai · 2026-04-29T07:22:24Z

+    setProblemDimensions(gemmData, transposeMmaOutput, m, n, k, batchedTokens, numTokens, numBatches,
+        maxNumCtasInBatchDim, validM, validN, validK);

    // Inputs
-    gemmData.mInputBuffers.mPtrA = mOptions.transposeMmaOutput ? b : a;
-    gemmData.mInputBuffers.mPtrSfA = mOptions.transposeMmaOutput ? sfB : sfA;
-    gemmData.mInputBuffers.mPtrB = mOptions.transposeMmaOutput ? a : b;
-    gemmData.mInputBuffers.mPtrSfB = mOptions.transposeMmaOutput ? sfA : sfB;
+    gemmData.mInputBuffers.mPtrA = transposeMmaOutput ? b : a;
+    gemmData.mInputBuffers.mPtrSfA = transposeMmaOutput ? sfB : sfA;
+    gemmData.mInputBuffers.mPtrB = transposeMmaOutput ? a : b;
+    gemmData.mInputBuffers.mPtrSfB = transposeMmaOutput ? sfA : sfB;
    gemmData.mInputBuffers.mPtrScaleC = scaleC;
    gemmData.mInputBuffers.mPtrScaleGate = scaleGateC;
    gemmData.mInputBuffers.mPtrScaleAct = scaleGateC;
-    gemmData.mInputBuffers.mPtrPerTokenSfA = mOptions.transposeMmaOutput ? perTokensSfB : perTokensSfA;
-    gemmData.mInputBuffers.mPtrPerTokenSfB = mOptions.transposeMmaOutput ? perTokensSfA : perTokensSfB;
+    gemmData.mInputBuffers.mPtrPerTokenSfA = transposeMmaOutput ? perTokensSfB : perTokensSfA;
+    gemmData.mInputBuffers.mPtrPerTokenSfB = transposeMmaOutput ? perTokensSfA : perTokensSfB;
    gemmData.mInputBuffers.mPtrBias = ptrBias;


⚠️ Potential issue | 🟠 Major

Plumb ptrPermutedIdxToBiasRowIdx through the runner path.

The new BiasType::Mn row remap never reaches gemmData.mInputBuffers, so this path always falls back to biasRow = permutedRow. BatchedGemmInterface::InputBuffers and KernelParamsSetup::setKernelParams(...) now support the pointer, but TrtllmGenBatchedGemmRunner::run(...) still has no way to populate it.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.cpp` around lines 367 - 380, The new ptrPermutedIdxToBiasRowIdx pointer for BiasType::Mn is not being forwarded into gemmData.mInputBuffers in TrtllmGenBatchedGemmRunner::run(...), so the bias row remap never gets used; update TrtllmGenBatchedGemmRunner::run to accept or obtain the ptrPermutedIdxToBiasRowIdx value and assign it into gemmData.mInputBuffers (i.e., set gemmData.mInputBuffers.mPtrPermutedIdxToBiasRowIdx = ptrPermutedIdxToBiasRowIdx) before calling KernelParamsSetup::setKernelParams(...), and ensure the same pointer is passed through any intermediate calls so BatchedGemmInterface::InputBuffers receives it.

coderabbitai · 2026-04-29T07:22:24Z

+version https://git-lfs.github.com/spec/v1
+oid sha256:1a66a28af82abf6caa7e88701b3eeedceaabe952043e6a75ef1ca842cdd49e75
+size 662339


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail TARGET='cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x128x256_s6_et128x128_m256x128x64_c2x1x1_rN_TN_schPd2x1x2x3_biasFp32N_bM_rgTma_clmp_dynB_sm100f_cubin.cpp' echo "== Pointer header check ==" sed -n '1,5p' "$TARGET" echo echo "== CI/workflow LFS hydration checks ==" rg -n -C2 --iglob '*.yml' --iglob '*.yaml' 'git lfs (install|fetch|pull|checkout)' echo echo "== Build/static-analysis scripts mentioning cubins or this file type ==" rg -n -C2 '(_cubin\.cpp|trtllmGen_bmm_export/cubins|git lfs)'

Repository: NVIDIA/TensorRT-LLM

Length of output: 261

🏁 Script executed:

# Check if .gitattributes exists and configures LFS if [ -f .gitattributes ]; then echo "=== .gitattributes LFS configuration ===" cat .gitattributes | grep -E '(cubin|\.cpp|lfs)' | head -20 else echo ".gitattributes not found" fi echo echo "=== Workflow files in repository ===" find . -type f $ -name '*.yml' -o -name '*.yaml' $ -path '*/.github/workflows/*' 2>/dev/null | head -20 echo echo "=== LFS configuration in any workflow ===" find . -type f $ -name '*.yml' -o -name '*.yaml' $ 2>/dev/null -exec grep -l 'lfs' {} \; 2>/dev/null | head -10 echo echo "=== Build scripts or CMakeLists mentioning cubin or LFS ===" rg -l '(cubin|lfs|git.*checkout)' --type cmake --type sh 2>/dev/null | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 1999

🏁 Script executed:

echo "=== blossom-ci.yml (LFS workflow) ===" head -100 ./.github/workflows/blossom-ci.yml echo echo "=== pr-check.yml (likely runs C++ checks) ===" head -150 ./.github/workflows/pr-check.yml echo echo "=== precommit-check.yml (likely runs linting) ===" head -150 ./.github/workflows/precommit-check.yml

Repository: NVIDIA/TensorRT-LLM

Length of output: 7615

🏁 Script executed:

echo "=== scripts/release_check.py (what does it check?) ===" head -200 scripts/release_check.py echo echo "=== Full blossom-ci.yml (look for git lfs or checkout) ===" grep -n -A5 -B5 'git lfs\|checkout\|cubin' ./.github/workflows/blossom-ci.yml | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 7737

🏁 Script executed:

echo "=== .pre-commit-config.yaml (which hooks run on .cpp?) ===" cat .pre-commit-config.yaml | head -200

Repository: NVIDIA/TensorRT-LLM

Length of output: 11191

🏁 Script executed:

echo "=== Continuing .pre-commit-config.yaml (actual hooks) ===" tail -300 .pre-commit-config.yaml

Repository: NVIDIA/TensorRT-LLM

Length of output: 14794

Add lfs: 'true' to the checkout action in precommit-check.yml for consistency and robustness.

This file (*cubin.cpp) is already excluded from pre-commit hooks in .pre-commit-config.yaml, so it won't be processed by clang-format or other analysis tools. However, precommit-check.yml uses actions/checkout@v6 without the lfs: true parameter, while blossom-ci.yml correctly includes it. For consistency and to safeguard against future changes to the exclusion rules, add lfs: 'true' to the checkout step in precommit-check.yml.

🧰 Tools

🪛 Clang (14.0.6)

[error] 1-1: unknown type name 'version'

(clang-diagnostic-error)

[error] 1-1: expected ';' after top level declarator

(clang-diagnostic-error)

[error] 2-2: unknown type name 'oid'; did you mean 'void'?

(clang-diagnostic-error)

[error] 2-2: variable has incomplete type 'void'

(clang-diagnostic-error)

[error] 2-2: expected ';' after top level declarator

(clang-diagnostic-error)

[error] 2-2: expected unqualified-id

(clang-diagnostic-error)

🪛 Cppcheck (2.20.0)

[error] 3-3: syntax error

(syntaxError)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x128x256_s6_et128x128_m256x128x64_c2x1x1_rN_TN_schPd2x1x2x3_biasFp32N_bM_rgTma_clmp_dynB_sm100f_cubin.cpp` around lines 1 - 3, precommit-check.yml's checkout step uses actions/checkout@v6 without LFS enabled, which is inconsistent with blossom-ci.yml and could break future checks for large files like the listed cubin artifacts; update the checkout step in precommit-check.yml (the actions/checkout@v6 step) to include lfs: 'true' so git-lfs files are fetched during the precommit workflow, matching blossom-ci.yml and ensuring excluded large files remain resolvable if exclusion rules change.

coderabbitai · 2026-04-29T07:22:24Z

+version https://git-lfs.github.com/spec/v1
+oid sha256:97caf696273574e75d797260df22a9ce4ec3f837e1e48b90040ffa499d4753db
+size 453560


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # 1) Find .cpp files that are actually Git LFS pointers. echo "== LFS pointer files with .cpp extension ==" fd -e cpp | while read -r f; do first_line="$(sed -n '1p' "$f" || true)" if [ "$first_line" = "version https://git-lfs.github.com/spec/v1" ]; then echo "$f" fi done # 2) Check whether this specific file is referenced by build/config scripts. echo echo "== References to this cubin path in build/config files ==" rg -n "Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x32x256_s9_et128x32_m128x32x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin\\.cpp|trtllmGen_bmm_export/cubins" # 3) Show header lines for this file to confirm it's an LFS pointer (not a C++ source body/header banner). echo echo "== First 5 lines of the file under review ==" sed -n '1,5p' "cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x32x256_s9_et128x32_m128x32x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp"

Repository: NVIDIA/TensorRT-LLM

Length of output: 50377

Exclude Git LFS pointer stubs from C++ translation units

cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x32x256_s9_et128x32_m128x32x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp contains only Git LFS metadata (version/oid/size), causing parser and build failures; fetch LFS objects before compilation or adjust build/static‐analysis rules to skip these stubs.

Add the required NVIDIA copyright header to this .cpp file per coding guidelines.

🧰 Tools

🪛 Clang (14.0.6)

[error] 1-1: unknown type name 'version'

(clang-diagnostic-error)

[error] 1-1: expected ';' after top level declarator

(clang-diagnostic-error)

[error] 2-2: unknown type name 'oid'; did you mean 'void'?

(clang-diagnostic-error)

[error] 2-2: variable has incomplete type 'void'

(clang-diagnostic-error)

[error] 2-2: expected ';' after top level declarator

(clang-diagnostic-error)

[error] 2-2: expected unqualified-id

(clang-diagnostic-error)

🪛 Cppcheck (2.20.0)

[error] 3-3: syntax error

(syntaxError)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x32x256_s9_et128x32_m128x32x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp` around lines 1 - 3, The translation unit Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x32x256_s9_et128x32_m128x32x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp currently contains only a Git LFS pointer stub which breaks parsing and builds; replace the stub with the actual LFS-tracked binary cubin content (pull the LFS object into the repo) or modify the build/static-analysis rules to skip/ignore LFS pointer files for C++ translation units, and ensure the final .cpp file contains the required NVIDIA copyright header at the top per project guidelines.

coderabbitai · 2026-04-29T07:22:25Z

+version https://git-lfs.github.com/spec/v1
+oid sha256:28198e842017dfd64847faa770eefca936de60a8c09e7ae70e00fe313d3b005b
+size 423264


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail TARGET='cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x8x512_s5_et128x8_m128x8x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp' echo "== Pointer file header ==" sed -n '1,3p' "$TARGET" || true echo echo "== .gitattributes entries tracking this path (expect LFS rule) ==" attrs=$(fd -HI '^\.gitattributes$' || true) if [ -n "${attrs:-}" ]; then rg -n --fixed-strings 'trtllmGen_bmm_export/cubins/' $attrs || true rg -n --fixed-strings 'Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x8x512_s5_et128x8_m128x8x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp' $attrs || true else echo "No .gitattributes found" fi echo echo "== Build/lint references to this file (expect not compiled as raw pointer text) ==" rg -n -C2 --fixed-strings 'Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x8x512_s5_et128x8_m128x8x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp' || true echo echo "== CI/workflow LFS setup (expect explicit LFS fetch/pull before compile/lint) ==" rg -n -C2 -P 'git\s+lfs\s+(pull|fetch|checkout)|lfs:\s*true|actions/checkout@' || true

Repository: NVIDIA/TensorRT-LLM

Length of output: 4074

Add Git LFS tracking rule to .gitattributes and include NVIDIA copyright header

This .cpp file contains Git LFS pointer metadata but lacks a corresponding tracking rule in .gitattributes. Additionally, the file is missing the required NVIDIA copyright header.

To fix:

Add a rule in .gitattributes to explicitly track this file (or the trtllmGen_bmm_export/cubins/ directory) as LFS-managed binary

Include the NVIDIA copyright header comment at the top of the file with the year of latest modification

Without the .gitattributes rule, build/lint tools may attempt to process the pointer metadata as source code, causing failures.

🧰 Tools

🪛 Clang (14.0.6)

[error] 1-1: unknown type name 'version'

(clang-diagnostic-error)

[error] 1-1: expected ';' after top level declarator

(clang-diagnostic-error)

[error] 2-2: unknown type name 'oid'; did you mean 'void'?

(clang-diagnostic-error)

[error] 2-2: variable has incomplete type 'void'

(clang-diagnostic-error)

[error] 2-2: expected ';' after top level declarator

(clang-diagnostic-error)

[error] 2-2: expected unqualified-id

(clang-diagnostic-error)

🪛 Cppcheck (2.20.0)

[error] 3-3: syntax error

(syntaxError)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x8x512_s5_et128x8_m128x8x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp` around lines 1 - 3, The file begins with Git LFS pointer metadata (line starting with "version https://git-lfs.github.com/spec/v1") but there is no .gitattributes rule to mark this binary as LFS-managed, and the file is missing the required NVIDIA copyright header; add an entry to .gitattributes to track the cubins (e.g., a pattern matching these generated .cpp/cubin files) so the pointer metadata is not treated as source, and prepend the canonical NVIDIA copyright header comment to the top of this file with the correct latest modification year.

coderabbitai · 2026-04-29T07:22:25Z

+version https://git-lfs.github.com/spec/v1
+oid sha256:128c3261b254cdb66d2b786263044ac98147683feb60e526e0e58c5c863b0e38
+size 436478


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Detect .cpp cubin files that are Git LFS pointers ==" fd --full-path '_cubin\.cpp$' | while read -r f; do if sed -n '1p' "$f" | grep -qx 'version https://git-lfs.github.com/spec/v1'; then echo "POINTER_CPP: $f" sed -n '1,3p' "$f" fi done echo echo "== Check whether repo tooling/build scripts broadly pick up .cpp files ==" rg -n --hidden --glob '!**/.git/**' '(_cubin\.cpp|\*\.cpp|clang-tidy|cppcheck|target_sources|add_library|add_executable)'

Repository: NVIDIA/TensorRT-LLM

Length of output: 50377

🏁 Script executed:

# Check if these cubin files are referenced in build scripts rg -l "cubin\.cpp|add_library|target_sources" --type cmake | head -20 # Check .gitattributes for LFS configuration fd -e gitattributes | xargs cat 2>/dev/null | grep -A2 -B2 "cubin\|\.cpp" # Check if there's a .clang-tidy or similar configuration that excludes these fd -e clang-tidy -o -e clang-format | xargs cat 2>/dev/null

Repository: NVIDIA/TensorRT-LLM

Length of output: 1235

🏁 Script executed:

# Check .gitattributes find . -name ".gitattributes" -type f | xargs cat 2>/dev/null | head -50 # Check if cubin files are referenced in CMakeLists rg "_cubin\.cpp" cpp/tensorrt_llm/kernels/ | head -20 # Check for any pre-commit or CI configuration find . -name ".pre-commit-config.yaml" -o -name ".clang-tidy" -o -name ".clang-format" | xargs cat 2>/dev/null | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 4449

🏁 Script executed:

# Check how cubin files are referenced in the main kernels CMakeLists.txt cat cpp/tensorrt_llm/kernels/CMakeLists.txt | grep -A 5 -B 5 "cubin\|*.cpp" | head -100 # Check if there's any exclusion pattern for cubin files in build scripts rg "cubin|exclude|EXCLUDE" cpp/tensorrt_llm/kernels/CMakeLists.txt -A 3 # Check the trtllmGenKernels CMakeLists specifically fd -path "*trtllmGenKernels/CMakeLists.txt" -exec cat {} \; | head -150

Repository: NVIDIA/TensorRT-LLM

Length of output: 1916

🏁 Script executed:

# Confirm the trtllmGenKernels exclusion applies to the cubin files rg "trtllmGenKernels" cpp/tensorrt_llm/kernels/CMakeLists.txt -B 2 -A 2 # Check if there are any subdirectory declarations for trtllmGenKernels rg "add_subdirectory.*trtllmGenKernels" cpp/tensorrt_llm/kernels/ # List cubin file locations fd "_cubin\.cpp$" cpp/tensorrt_llm/kernels/trtllmGenKernels/ | head -5

Repository: NVIDIA/TensorRT-LLM

Length of output: 2036

🏁 Script executed:

# Check the trtllmGenKernels subdirectory CMakeLists.txt find cpp/tensorrt_llm/kernels/trtllmGenKernels -name "CMakeLists.txt" -exec cat {} \;

Repository: NVIDIA/TensorRT-LLM

Length of output: 10372

Do not store Git LFS pointers in .cpp file paths; these will fail to compile when not checked out.

The cubin.cpp files in cpp/tensorrt_llm/kernels/trtllmGenKernels/ subdirectories (batchedGemm, gemm, gemmGatedAct, fmha, blockScaleMoe) are configured in .gitattributes as Git LFS pointers but are picked up directly by subdirectory CMakeLists files via file(GLOB_RECURSE SRC_CPP *.cpp) and added as compilation sources. When LFS files are not materialized (common in CI without LFS configuration), the build will attempt to compile text containing LFS metadata as C++, causing syntax errors.

Additionally, these files violate the coding guideline requiring NVIDIA copyright headers on all .cpp files, since LFS pointers cannot contain such headers.

Either:

Exclude cubin.cpp files from the glob in each subdirectory CMakeLists.txt, or

Store these binary artifacts with a non-source-code extension and generate/include wrapper .cpp files if compilation is needed

🧰 Tools

🪛 Clang (14.0.6)

[error] 1-1: unknown type name 'version'

(clang-diagnostic-error)

[error] 1-1: expected ';' after top level declarator

(clang-diagnostic-error)

[error] 2-2: unknown type name 'oid'; did you mean 'void'?

(clang-diagnostic-error)

[error] 2-2: variable has incomplete type 'void'

(clang-diagnostic-error)

[error] 2-2: expected ';' after top level declarator

(clang-diagnostic-error)

[error] 2-2: expected unqualified-id

(clang-diagnostic-error)

🪛 Cppcheck (2.20.0)

[error] 3-3: syntax error

(syntaxError)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x8x512u2_s4_et128x8_m128x8x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_sm100f_cubin.cpp` around lines 1 - 3, The build is pulling Git LFS pointer files named like Bmm_..._cubin.cpp into compilation via the file(GLOB_RECURSE SRC_CPP *.cpp) in the subdirectory CMakeLists, which causes CI to try compiling LFS metadata and also prevents adding required NVIDIA headers; fix by excluding those cubin.cpp files from the source glob or moving them off the .cpp extension and adding lightweight wrapper .cpp stubs. Concretely, update each affected CMakeLists (the ones that define SRC_CPP via file(GLOB_RECURSE SRC_CPP *.cpp) in trtllmGenKernels subdirs batchedGemm, gemm, gemmGatedAct, fmha, blockScaleMoe) to filter out files matching "*_cubin.cpp" or "cubin.cpp" from SRC_CPP, or rename/store binary cubin blobs with a non-.cpp extension and add minimal wrapper translation units (e.g., loader functions) that include proper NVIDIA copyright headers so real .cpp files compiled by CMake are valid source.

coderabbitai · 2026-04-29T07:22:25Z

+version https://git-lfs.github.com/spec/v1
+oid sha256:b2d5b6ff01e5f2d80833302e5c38ed8a663baa695bdae838c4266066229ceaed
+size 413176


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check if file is truly new or just shows LFS pointer git ls-files "cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/" # Find other .cpp files in the cubins directory to compare fd -t f "\.cpp$" "cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/" | head -10

Repository: NVIDIA/TensorRT-LLM

Length of output: 50376

🏁 Script executed:

# Check .gitattributes to see if cubins are configured for LFS find . -name ".gitattributes" -type f -exec grep -l "cubins\|\.cubin\|LFS" {} \;

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Look at CODING_GUIDELINES.md for any LFS-specific guidance find . -name "CODING_GUIDELINES.md" -type f | head -5

Repository: NVIDIA/TensorRT-LLM

Length of output: 86

🏁 Script executed:

# Check if there are other similar .cpp files in the same directory ls -la "cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/" | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 3631

🏁 Script executed:

cat CODING_GUIDELINES.md | grep -i -A 5 -B 5 "lfs\|binary\|cubin\|generated"

Repository: NVIDIA/TensorRT-LLM

Length of output: 1097

🏁 Script executed:

# Also check for any exemptions or exceptions in the guidelines cat CODING_GUIDELINES.md | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 4962

🏁 Script executed:

# Continue reading CODING_GUIDELINES.md to find any LFS or exemption guidance cat CODING_GUIDELINES.md | grep -i "lfs\|binary\|generated\|exemption\|exclude\|copyright" | head -30

Repository: NVIDIA/TensorRT-LLM

Length of output: 763

🏁 Script executed:

# Look for section about file headers and copyright cat CODING_GUIDELINES.md | grep -i "copyright\|header\|license" -A 3 -B 1 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 2181

LFS pointer files in the cubins directory require explicit exemption documentation in CODING_GUIDELINES.md.

The coding guidelines state that "all source files which are compiled or interpreted" must contain NVIDIA copyright headers. However, the guidelines don't explicitly address Git LFS pointer files (like this 131-byte metadata stub) or pre-compiled binary artifacts stored in the cubins directory. To avoid future ambiguity, either:

Add explicit exemption language to CODING_GUIDELINES.md for LFS pointers and binary artifacts, or

Document that the copyright header requirement applies only to human-authored source code, not generated/compiled binaries or their LFS references

🧰 Tools

🪛 Clang (14.0.6)

[error] 1-1: unknown type name 'version'

(clang-diagnostic-error)

[error] 1-1: expected ';' after top level declarator

(clang-diagnostic-error)

[error] 2-2: unknown type name 'oid'; did you mean 'void'?

(clang-diagnostic-error)

[error] 2-2: variable has incomplete type 'void'

(clang-diagnostic-error)

[error] 2-2: expected ';' after top level declarator

(clang-diagnostic-error)

[error] 2-2: unknown type name 'b2d5b6ff01e5f2d80833302e5c38ed8a663baa695bdae838c4266066229ceaed'

(clang-diagnostic-error)

[error] 3-3: expected ';' after top level declarator

(clang-diagnostic-error)

🪛 Cppcheck (2.20.0)

[error] 3-3: syntax error

(syntaxError)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x256u2_s6_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_bN_rgTma_clmp_dynB_sm100f_cubin.cpp` around lines 1 - 3, Update CODING_GUIDELINES.md to explicitly exempt Git LFS pointer files and pre-compiled binary artifacts in the cubins directory (e.g., the LFS pointer stub found in trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x256u2_s6_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_bN_rgTma_clmp_dynB_sm100f_cubin.cpp) from the "NVIDIA copyright headers" requirement, or alternatively add a clear rule that the header mandate applies only to human-authored source files and not generated/compiled binaries or their LFS references; make the language unambiguous and add a short example sentence and a note in the file explaining how LFS pointer stubs are treated.

coderabbitai · 2026-04-29T07:22:26Z

+version https://git-lfs.github.com/spec/v1
+oid sha256:33b5743610881ecfc742746f42648f4e054534a9ca1adb57634e1988c88cb969
+size 316110


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Verify this file is an LFS pointer in the checked-out tree. sed -n '1,5p' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x512_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schedS_bN_rgTma_clmp_sm100f_cubin.cpp # Check if cubin .cpp artifacts are referenced by CMake build inputs. rg -n -C2 --iglob 'CMakeLists.txt' --iglob '*.cmake' '_cubin\.cpp|trtllmGen_bmm_export/cubins' # Check if static-analysis configs explicitly exclude this folder/pattern. rg -n -C2 --iglob '.clang-tidy' --iglob '.clangd' --iglob '*.yml' --iglob '*.yaml' 'cubins|trtllmGen_bmm_export|_cubin\.cpp'

Repository: NVIDIA/TensorRT-LLM

Length of output: 194

.cpp file contains Git LFS pointer and is not referenced in build system

Confirmed: Lines 1–3 show Git LFS metadata, not C++ source. Verification found no CMakeLists.txt, .cmake files, or static analysis configs that reference this cubin directory or _cubin.cpp files, indicating the path is implicitly excluded from compilation.

However, this implicit exclusion through directory structure (no build references) is not explicit. To prevent accidental inclusion in future build rules, consider either:

Adding an explicit exclusion pattern in CMakeLists.txt or build tooling for the cubins/ directory

Moving cubin artifacts to a non-.cpp extension (e.g., .cubin or .o) to avoid confusion with C++ source files

🧰 Tools

🪛 Clang (14.0.6)

[error] 1-1: unknown type name 'version'

(clang-diagnostic-error)

[error] 1-1: expected ';' after top level declarator

(clang-diagnostic-error)

[error] 2-2: unknown type name 'oid'; did you mean 'void'?

(clang-diagnostic-error)

[error] 2-2: variable has incomplete type 'void'

(clang-diagnostic-error)

[error] 2-2: expected ';' after top level declarator

(clang-diagnostic-error)

[error] 2-2: expected unqualified-id

(clang-diagnostic-error)

🪛 Cppcheck (2.20.0)

[error] 3-3: syntax error

(syntaxError)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x512_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schedS_bN_rgTma_clmp_sm100f_cubin.cpp` around lines 1 - 3, The file trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x512_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schedS_bN_rgTma_clmp_sm100f_cubin.cpp is a Git LFS pointer, not real C++ source, and the cubins/ folder is implicitly excluded from the build; make the exclusion explicit by either adding an exclude rule for the cubins/ directory in your top-level CMakeLists (or the relevant cmake module that collects sources) so files under cubins/ are ignored, and/or renaming these artifacts from *.cpp to a non-source extension (e.g., .cubin or .o) so build system and static analysis won’t treat them as C++ sources; update any source-gathering patterns (e.g., add_glob or file(GLOB_RECURSE) filters) that might pick up *_cubin.cpp and document the change in the build module that references the generated kernels.

coderabbitai · 2026-04-29T07:22:26Z

+    if (options.mMmaTileK == 0)
+    {
+        options.mMmaTileK = options.mTileK;
+    }
+    else
    {
-        TLLM_CHECK_ERROR(options.mLayoutB != MatrixLayout::BlockMajorK,
-            "layoutB == MatrixLayout::BlockMajorK is not supported for now");
+        if (options.mMmaTileK != options.mTileK)
+        {


⚠️ Potential issue | 🟡 Minor

Keep updateOptions == false side-effect free.

checkAndUpdateGemmOptions(..., false) is used as a pure validator, but these blocks still fill in mMmaTileK and mPerTokenSfDtype. That makes an underspecified config look valid and hides whether the cubin metadata actually encoded those fields. Please guard these defaults with updateOptions (or reject when they are unset in validation mode).

Also applies to: 2175-2185

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/GemmOptions.h` around lines 1281 - 1288, The validator checkAndUpdateGemmOptions currently mutates options (setting options.mMmaTileK = options.mTileK and options.mPerTokenSfDtype defaults) even when called with updateOptions == false; change the logic so these assignments only occur when updateOptions is true, and when updateOptions is false either return an error/false if those fields are unset or explicitly reject the config; apply the same guard to the other identical block later in the function (the second occurrence that compares mMmaTileK vs mTileK and sets mPerTokenSfDtype) so validation remains side-effect free.

coderabbitai · 2026-04-29T07:22:26Z

+    uint32_t* rowMaxBars, int32_t const* ptrNumNonExitingCtas = nullptr,
+    int32_t const* ptrTotalNumPaddedTokens = nullptr, int32_t const* ptrCtaIdxXyToBatchIdx = nullptr,
+    int32_t const* ptrCtaIdxXyToMnLimit = nullptr, int32_t const maxNumCtas = KernelParams::MaxNumCtas,
+    uint32_t* ptrDynamicTileCounter = nullptr)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect all call sites of KernelParamsSetup::setKernelParams and the maxNumCtas plumbing. rg -n -C3 '\bsetKernelParams\s*\(' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm rg -n -C3 '\bmaxNumCtas(InBatchDim)?\b|\bnumCgas\b|\bcgaOffset\b|\bclusterDimInBatchDim\b' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm

Repository: NVIDIA/TensorRT-LLM

Length of output: 39791

🏁 Script executed:

# Trace where maxNumCtasInBatchDim originates and how it's computed # Look for high-level callers that prepare this value rg -n 'maxNumCtasInBatchDim\s*=' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm rg -n -B5 'run\s*\(' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.cpp | grep -A5 'maxNumCtasInBatchDim'

Repository: NVIDIA/TensorRT-LLM

Length of output: 1015

🏁 Script executed:

# Look for where maxNumCtasInBatchDim is actually computed or passed from external code # Find the public API entry points and trace back rg -n 'maxNumCtasInBatchDim' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm --type cpp -A2 -B2 | head -100 # Also check if there's any documentation or comments explaining the semantics rg -n 'CGA|cga|cluster.*group' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h

Repository: NVIDIA/TensorRT-LLM

Length of output: 13747

🏁 Script executed:

# Look for the comment about CTA granularity in KernelParams.h cat -n cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h | sed -n '380,390p' # Check if there are any comments or documentation about the parameter semantics rg -n 'maxNumCtas|MaxNumCtas' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h | head -20 # Look at the parameter definition and surrounding documentation cat -n cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h | sed -n '340,365p'

Repository: NVIDIA/TensorRT-LLM

Length of output: 2528

Parameter maxNumCtas uses CGA semantics but retains CTA naming, creating a misleading contract.

The code at line 386 explicitly acknowledges this: "The CTA names are kept for compatibility even though they are indexed at CGA granularity." The calculation at line 477 (paddedTokensInBatchDim = cgaOffset * tile * clusterDimInBatchDim) is correct only if callers pass CGA count, not CTA count. However, the parameter name and signature still advertise "CTA" semantics, risking misuse by callers unfamiliar with the compatibility note.

Rename the parameter or add clear inline documentation explaining it expects CGA count (e.g., maxNumCgas or /*maxNumCgas=*/ inline comment), or update KernelParams::MaxNumCtas to reflect CGA semantics.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h` around lines 357 - 360, The parameter name maxNumCtas in the KernelParams.h function signature is misleading because the implementation (see paddedTokensInBatchDim calculation using cgaOffset and clusterDimInBatchDim) expects a CGA count; change the API to make this explicit by renaming the parameter to maxNumCgas (and update all call sites) or add an inline comment in the signature like /* maxNumCgas = CGA-granularity count */ and/or update KernelParams::MaxNumCtas to reflect CGA semantics so callers pass CGA counts; ensure references to maxNumCtas and KernelParams::MaxNumCtas and the paddedTokensInBatchDim / cgaOffset usage are updated consistently.

coderabbitai · 2026-04-29T07:22:26Z

+        auto const numTmemColsA = useTmemA ? numStagesA * tileK
                / (numSlicesForSliceK * tg::dtypeGetNumBits(tg::Dtype::UInt32) / tg::dtypeGetNumBits(dtypeMmaA))
                                           : 0;


⚠️ Potential issue | 🟠 Major

mNumStagesTmemA is still ignored here.

tmemA sizing is derived from numStagesA, and KernelTraits still has no separate TMEM-A stage parameter. Any config with mNumStagesTmemA != mNumStagesA will be modeled with the wrong TMEM footprint and effectively loses the new knob. Please thread mNumStagesTmemA into KernelTraits and use it for tmemA.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelTraits.h` around lines 524 - 526, KernelTraits currently ignores the separate TMEM-A stage count; thread the mNumStagesTmemA member through KernelTraits (add the member, ensure constructors/builders/factory code initialize it from the configuration) and use mNumStagesTmemA instead of numStagesA when computing numTmemColsA (the expression that uses numStagesA * tileK / (numSlicesForSliceK * ... / tg::dtypeGetNumBits(dtypeMmaA))). Also ensure useTmemA logic and any getters that expose stage counts reflect mNumStagesTmemA so configs with mNumStagesTmemA != mNumStagesA produce the correct TMEM-A footprint.

nekorobov · 2026-04-29T07:30:26Z

The PR description covers the changes in TRT-LLM Gen, but many of those features are not added in this MR (Slice-K, MN Bias, etc). Could you fix it please?

rosenrodt · 2026-04-29T18:13:34Z

/bot run

tensorrt-cicd · 2026-04-29T18:30:25Z

PR_Github #46202 [ run ] triggered by Bot. Commit: 8266478 Link to invocation

tensorrt-cicd · 2026-04-29T23:16:41Z

PR_Github #46202 [ run ] completed with state SUCCESS. Commit: 8266478
/LLM/main/L0_MergeRequest_PR pipeline #36315 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

rosenrodt · 2026-04-30T01:21:21Z

/bot run

tensorrt-cicd · 2026-04-30T01:28:50Z

PR_Github #46252 [ run ] triggered by Bot. Commit: 8266478 Link to invocation

tensorrt-cicd · 2026-04-30T05:12:17Z

PR_Github #46252 [ run ] completed with state SUCCESS. Commit: 8266478
/LLM/main/L0_MergeRequest_PR pipeline #36358 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

experimental: TRTLLM_GEN_MOE_AUTOTUNE_DUMMY_DISTRIBUTION=random or balanced to alter expert distribution during autotuner; affect kernel secletion Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

rosenrodt · 2026-04-30T09:16:26Z

/bot run

tensorrt-cicd · 2026-04-30T09:22:35Z

PR_Github #46350 [ run ] triggered by Bot. Commit: f345e85 Link to invocation

tensorrt-cicd · 2026-04-30T13:57:24Z

PR_Github #46350 [ run ] completed with state SUCCESS. Commit: f345e85
/LLM/main/L0_MergeRequest_PR pipeline #36438 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

rosenrodt · 2026-05-02T14:56:43Z

/bot run

tensorrt-cicd · 2026-05-02T15:04:15Z

PR_Github #46557 [ run ] triggered by Bot. Commit: f345e85 Link to invocation

tensorrt-cicd · 2026-05-02T18:58:17Z

PR_Github #46557 [ run ] completed with state SUCCESS. Commit: f345e85
/LLM/main/L0_MergeRequest_PR pipeline #36612 completed with status: 'SUCCESS'

CI Report

Link to invocation

github-actions Bot assigned rosenrodt Mar 23, 2026

rosenrodt changed the title ~~[none][feat] Update TRTLLM MoE cubins~~ [None][feat] Update TRTLLM MoE cubins Mar 23, 2026

rosenrodt force-pushed the trtllm-moe-tput-kernels branch 3 times, most recently from bbe09ba to 7e8370b Compare April 28, 2026 02:53

nv-guomingz self-requested a review April 28, 2026 06:07

rosenrodt force-pushed the trtllm-moe-tput-kernels branch from 7e8370b to 58e0fd8 Compare April 28, 2026 17:00

rosenrodt marked this pull request as ready for review April 29, 2026 07:04

rosenrodt requested a review from a team as a code owner April 29, 2026 07:04

rosenrodt requested a review from liji-nv April 29, 2026 07:04

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

nekorobov approved these changes Apr 29, 2026

View reviewed changes

rosenrodt force-pushed the trtllm-moe-tput-kernels branch from 58e0fd8 to 8266478 Compare April 29, 2026 09:59

rosenrodt added 3 commits April 30, 2026 14:43

add config

6206599

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

update runner

66655ae

experimental: TRTLLM_GEN_MOE_AUTOTUNE_DUMMY_DISTRIBUTION=random or balanced to alter expert distribution during autotuner; affect kernel secletion Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

update cubin

f345e85

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>

rosenrodt force-pushed the trtllm-moe-tput-kernels branch from 8266478 to f345e85 Compare April 30, 2026 08:16

rosenrodt mentioned this pull request Apr 30, 2026

[None][perf] Improve TRTLLM MoE autotune in DEP #13667

Open

1 task

Conversation

rosenrodt commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

nekorobov commented Apr 29, 2026

Uh oh!

rosenrodt commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

rosenrodt commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

rosenrodt commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

rosenrodt commented May 2, 2026

Uh oh!

tensorrt-cicd commented May 2, 2026

Uh oh!

tensorrt-cicd commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

rosenrodt commented Mar 23, 2026 •

edited

Loading

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading