Skip to content

[None][feat] Update TRTLLM MoE cubins#12440

Open
rosenrodt wants to merge 3 commits intoNVIDIA:mainfrom
rosenrodt:trtllm-moe-tput-kernels
Open

[None][feat] Update TRTLLM MoE cubins#12440
rosenrodt wants to merge 3 commits intoNVIDIA:mainfrom
rosenrodt:trtllm-moe-tput-kernels

Conversation

@rosenrodt
Copy link
Copy Markdown
Collaborator

@rosenrodt rosenrodt commented Mar 23, 2026

Description

DEP8 DS-R1 FP4 1k/2k on B200

bsz old new gain
8 74.93 74.76 −0.2%
64 504.8 506.1 +0.2%
128 905.3 917.9 +1.4%
512 2700.5 2737.8 +1.4%
1024 3439.7 3979.0 +15.7%
2048 5244.4 5315.9 +1.4%
4096 3375.3 3381.2 +0.2%

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@rosenrodt rosenrodt changed the title [none][feat] Update TRTLLM MoE cubins [None][feat] Update TRTLLM MoE cubins Mar 23, 2026
@rosenrodt rosenrodt force-pushed the trtllm-moe-tput-kernels branch 3 times, most recently from bbe09ba to 7e8370b Compare April 28, 2026 02:53
@nv-guomingz nv-guomingz self-requested a review April 28, 2026 06:07
@rosenrodt rosenrodt force-pushed the trtllm-moe-tput-kernels branch from 7e8370b to 58e0fd8 Compare April 28, 2026 17:00
@rosenrodt rosenrodt marked this pull request as ready for review April 29, 2026 07:04
@rosenrodt rosenrodt requested a review from a team as a code owner April 29, 2026 07:04
@rosenrodt rosenrodt requested a review from liji-nv April 29, 2026 07:04
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 29, 2026

📝 Walkthrough

Walkthrough

Refactors kernel selection logic in batched GEMM runner, introduces new bias/scheduling parameters and hybrid slice K optimization (Blackwell-only), adds bias row mapping support with CGA-granularity routing, and reorganizes configuration templates with new variants. Multiple cuBin artifacts updated via Git LFS.

Changes

Cohort / File(s) Summary
Kernel Selection & Parameter Logic
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.cpp
Removes prior predefined-config prioritization, introduces skipQuirks() filtering for known hanging/crashing configs, updates dtype validation to be transpose-aware using canonical operand convention, refactors problem-dimension construction into setProblemDimensions() helper, adjusts pointer routing for per-token SFs, and strengthens epilogue tile validation.
Header Documentation
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.h
Adds documentation comments to TrtllmGenBatchedGemmRunnerOptions struct canonically defining roles of dtypeA, dtypeB, dtypeC and clarifying transposeMmaOutput legacy behavior for MoE callers.
Bias & Scheduling Enums
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/Enums.h
Adds FusedBiasShuffleMode and SchedHostTask enum types with inline predicate helpers (usesFusedBiasShuffle, usesFusedBiasReorder, mode-specific checkers).
Batched GEMM Options & Validation
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/BatchedGemmOptions.h
Adds bias-related parameters (biasDtype, fusedBiasShuffleMode), scheduling (fuseLoadSfTask, schedHostTask), MMA customization (mmaTileK, useCustomizedMma3xNvFp4), and mPrefetchB option; updates validation for scheduler+routing incompatibility, fused-bias-reorder constraints, and extended bias type support (Mn).
GEMM Options & Hybrid Slice K
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/GemmOptions.h
Major refactoring: replaces single numStages with per-path counts (numStagesA/B, numStagesSmemSfA/B, numStagesTmemA, numStagesTmemSfA/B); adds mBiasDtype, mFusedBiasShuffleMode, mPerTokenSfDtype, mSchedHostTask, mUseCustomizedMma3xNvFp4; introduces new mHybridSliceK feature (Blackwell-gated) with derived fields (mHybridHiddenDim, mHybridTileHidden, mReRunTimes, etc.) and validation block, includes tile dimension swapping under hybrid mode.
Gated Act Options
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/GemmGatedActOptions.h
Removes unconditional BiasType::Mn rejection when mTransposeMmaOutput is disabled, formats function call on single line.
Kernel Traits & Stage Allocation
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelTraits.h
Updates constructor to accept per-path stage counts and new parameters (mmaTileK, useCustomizedMma3xNvFp4, fusedBiasShuffleMode); adjusts SMEM/TMEM allocation logic to use granular stage controls instead of unified numStages.
Batched GEMM Interface
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/BatchedGemmInterface.h
Extends InputBuffers with optional mPtrPermutedIdxToBiasRowIdx mapping pointer for BiasType::Mn; adds test-only cudaEvent parameter (guarded by TLLM_TEST); adjusts grid-dimension computation via clusterDimInBatchDim for MoE early-exit modes.
Kernel Parameters & CGA Semantics
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h, cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParamsDecl.h
Switches routing/padding bookkeeping from CTA to CGA granularity (ctaOffsetcgaOffset, numCtasnumCgas); adds ptrPermutedIdxToBiasRowIdx device pointer; introduces paddedTokensInBatchDim for TMA descriptor sizing; adjusts SF TMA to use mmaTileK instead of mTileK; halves B-side TMA tile dimension for 2CTA MMA.
Configuration Templates
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/config.json
Removes defaults from shared templates, adds explicit batch/transposeMmaOutput/ctaSwizzleType/biasType to FP4/FP8 specialized templates; introduces new "NonSwapAb" templates for FP4, FP8, MXE variants with biasType: "n" and epilogue LDTM overrides; updates E2M1/E4M3 variant templates; changes one fusion config from "swiglu" to "none" for activation.
cuBin Artifacts (LFS Pointers)
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/*
Adds ~100 new Git LFS pointer files for Bmm kernel variants (E2m1E2m1, E2m1E4m3, E4m3E4m3, MxE variants); removes ~80 existing LFS pointer files (obsolete 16dp256b variants and ldgsts routing versions).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • [None][feat] Trtllm-gen FMHA JIT support #12612: Refactors TMA-based trtllm-gen kernel parameter layout and CGA-granularity semantics for clustered batched-dimension routing, aligning with the main PR's CTA→CGA transition and parameter plumbing changes.

Suggested labels

VisualGen

Suggested reviewers

  • nv-guomingz
  • niukuo
  • zhenhuaw-me
  • PerkzZheng
  • yuxianq
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 20

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.h (1)

1-15: ⚠️ Potential issue | 🟡 Minor

Update the NVIDIA copyright year in this modified header.

This file changed in 2026, but the header still ends at 2025.

As per coding guidelines, "Include NVIDIA copyright header on all new files; update year on modified files".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.h` around
lines 1 - 15, The file header in KernelRunner.h still lists the NVIDIA copyright
range as "2020-2025" but the file was modified in 2026; update the copyright
year range in the top comment block (the header comment present at the top of
KernelRunner.h) to include 2026 (e.g., "2020-2026") so the file header reflects
the modification year.
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h (1)

691-708: ⚠️ Potential issue | 🟠 Major

Finish the mMmaTileK migration for SF TMA descriptors.

The non-routed SF branches now use mMmaTileK, but the routed-SF TMA branches still size their per-tile K extent from mTileK / numEltsPerSf. When customized MMA tile-K is enabled (which requires mTileK=256 and mMmaTileK=192), the same SF buffer gets different descriptor shapes: 256/numEltsPerSf for routed TMA vs. 192 for non-routed. This inconsistency should be resolved by migrating the routed TMA branches (around lines 560-582 for matrix B and lines 660-681 for matrix A) to use mMmaTileK as well.

cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/config.json (1)

233-955: ⚠️ Potential issue | 🟡 Minor

Remove trailing commas from grouped-key field names in two configs.

The validation script found dangling delimiters in the field names of configs "FP4_FC2_HighThroughput" (config 13) and "FP4_FC2_HighThroughput_NonSwapAb" (config 15):

  • "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutB," should be "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutB"
  • "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA," should be "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA"

All other required fields and arity checks pass validation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/config.json`
around lines 233 - 955, In the two configs for "FP4_FC2_HighThroughput"
(template BatchedGemmFp4HighThroughput) and "FP4_FC2_HighThroughput_NonSwapAb"
(template BatchedGemmFp4HighThroughputNonSwapAb) remove the stray trailing
commas from the grouped-key names: change
"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutB," to
"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutB" and change
"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA," to
"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA" so the field names are
valid JSON keys for validation.
🧹 Nitpick comments (1)
cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/config.json (1)

544-546: Remove the trailing delimiter in the grouped key at Line 544.

The key string ends with an extra comma (...,sfLayoutA,), which is parser-fragile for tuple/key arity mapping.

Proposed cleanup
-            "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA,": [
+            "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA": [
                 [128, 128, 128, 256, 6, "128x4"]
             ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/config.json`
around lines 544 - 546, The grouped key string
"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA," contains an extra trailing
comma which breaks tuple/key arity mapping; edit the config.json entry so the
key is "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA" (remove the trailing
comma) and verify the associated value array stays unchanged, and scan other
grouped keys for similar trailing delimiters to remove if present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.cpp`:
- Around line 367-380: The new ptrPermutedIdxToBiasRowIdx pointer for
BiasType::Mn is not being forwarded into gemmData.mInputBuffers in
TrtllmGenBatchedGemmRunner::run(...), so the bias row remap never gets used;
update TrtllmGenBatchedGemmRunner::run to accept or obtain the
ptrPermutedIdxToBiasRowIdx value and assign it into gemmData.mInputBuffers
(i.e., set gemmData.mInputBuffers.mPtrPermutedIdxToBiasRowIdx =
ptrPermutedIdxToBiasRowIdx) before calling
KernelParamsSetup::setKernelParams(...), and ensure the same pointer is passed
through any intermediate calls so BatchedGemmInterface::InputBuffers receives
it.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x128x256_s6_et128x128_m256x128x64_c2x1x1_rN_TN_schPd2x1x2x3_biasFp32N_bM_rgTma_clmp_dynB_sm100f_cubin.cpp`:
- Around line 1-3: precommit-check.yml's checkout step uses actions/checkout@v6
without LFS enabled, which is inconsistent with blossom-ci.yml and could break
future checks for large files like the listed cubin artifacts; update the
checkout step in precommit-check.yml (the actions/checkout@v6 step) to include
lfs: 'true' so git-lfs files are fetched during the precommit workflow, matching
blossom-ci.yml and ensuring excluded large files remain resolvable if exclusion
rules change.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x32x256_s9_et128x32_m128x32x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`:
- Around line 1-3: The translation unit
Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x32x256_s9_et128x32_m128x32x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp
currently contains only a Git LFS pointer stub which breaks parsing and builds;
replace the stub with the actual LFS-tracked binary cubin content (pull the LFS
object into the repo) or modify the build/static-analysis rules to skip/ignore
LFS pointer files for C++ translation units, and ensure the final .cpp file
contains the required NVIDIA copyright header at the top per project guidelines.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x8x512_s5_et128x8_m128x8x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`:
- Around line 1-3: The file begins with Git LFS pointer metadata (line starting
with "version https://git-lfs.github.com/spec/v1") but there is no
.gitattributes rule to mark this binary as LFS-managed, and the file is missing
the required NVIDIA copyright header; add an entry to .gitattributes to track
the cubins (e.g., a pattern matching these generated .cpp/cubin files) so the
pointer metadata is not treated as source, and prepend the canonical NVIDIA
copyright header comment to the top of this file with the correct latest
modification year.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x8x512u2_s4_et128x8_m128x8x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_sm100f_cubin.cpp`:
- Around line 1-3: The build is pulling Git LFS pointer files named like
Bmm_..._cubin.cpp into compilation via the file(GLOB_RECURSE SRC_CPP *.cpp) in
the subdirectory CMakeLists, which causes CI to try compiling LFS metadata and
also prevents adding required NVIDIA headers; fix by excluding those cubin.cpp
files from the source glob or moving them off the .cpp extension and adding
lightweight wrapper .cpp stubs. Concretely, update each affected CMakeLists (the
ones that define SRC_CPP via file(GLOB_RECURSE SRC_CPP *.cpp) in
trtllmGenKernels subdirs batchedGemm, gemm, gemmGatedAct, fmha, blockScaleMoe)
to filter out files matching "*_cubin.cpp" or "cubin.cpp" from SRC_CPP, or
rename/store binary cubin blobs with a non-.cpp extension and add minimal
wrapper translation units (e.g., loader functions) that include proper NVIDIA
copyright headers so real .cpp files compiled by CMake are valid source.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E4m3_castE4m3_Fp32_Ab32_t128x64x256_s4_et128x64_m128x64x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_biasFp32M_bN_rgTma_clmp_dynB_sm100a_cubin.cpp`:
- Around line 1-3: The build is picking up pointer-only cubin files tracked by
Git LFS (pattern *cubin.cpp in .gitattributes) because CMake uses
file(GLOB_RECURSE SRC_CPP *.cpp); either ensure CI materializes LFS blobs before
C++ parsing by adding a checkout step that runs git lfs pull --include="*.cpp"
--exclude="" or change the CMake glob to exclude cubin sources (e.g., stop using
*.cpp-only glob or add an exclusion for *cubin.cpp when populating SRC_CPP) so
the pointer-only cubin files are not passed to the compiler/static analyzers.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E4m3_castE4m3_Fp32_Ab32_t128x8x256_s5_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm103a_cubin.cpp`:
- Around line 1-3: CMake is globbing LFS pointer ".cpp" files (via
file(GLOB_RECURSE SRC_CPP *.cpp)) which pulls in LFS placeholders under
batchedGemm/cubins/* and causes compiler errors; change the glob/exclusion so
SRC_CPP does not include cubin LFS files (e.g., restrict to src/**/*.cpp or add
an explicit exclude for batchedGemm/cubins/** and other cubin/* directories) or
rename those LFS artifacts to a non-.cpp extension and update .gitattributes
accordingly, and add CI steps to run git lfs fetch/pull before the build so real
cubin binaries are available (refer to CMakeLists.txt, the SRC_CPP glob,
batchedGemm/cubins directory and .gitattributes).

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E4m3_castE4m3_Fp32_Ab32_t128x8x512_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_biasFp32M_bN_rgTma_clmp_dynB_sm100a_cubin.cpp`:
- Around line 1-3: precommit-check.yml's checkout step currently omits LFS and
causes Git LFS pointers (e.g., the cubin C++ LFS pointer shown) to be checked
out as text; update the checkout action in precommit-check.yml (the step that
uses actions/checkout) to include lfs: 'true' like in blossom-ci.yml so LFS
files are materialized before running pre-commit hooks such as clang-format.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E4m3_castE4m3_Fp32_Ab32_t128x8x512u2_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100a_cubin.cpp`:
- Around line 1-3: CI is failing static analysis because tracked cubin binary
artifacts (files matching cubins/**/*.cpp) are stored as Git LFS pointers and
static analyzers choke on the pointer metadata; fix by ensuring the pipeline
either runs git lfs pull (or equivalent fetch of LFS objects) prior to any
compilation/lint steps or by excluding these generated cubin files from C++
linting (update CI job steps to call git lfs pull or update analyzer config to
skip pattern cubins/**/*.cpp and rely on .gitattributes entry for *cubin.cpp).

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E4m3_castE4m3_Fp32_Ab32_t128x8x512u2_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_biasFp32M_bN_rgTma_clmp_dynB_sm100a_cubin.cpp`:
- Around line 1-3: The file
trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E4m3_castE4m3_Fp32_Ab32_t128x8x512u2_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_biasFp32M_bN_rgTma_clmp_dynB_sm100a_cubin.cpp
contains an unresolved Git LFS pointer instead of real C++ and is breaking
static analysis; fix by adding a pre-build step that runs git lfs pull (or
equivalent) to materialize LFS objects before compilation/analysis and/or update
CI/build scripts to fetch LFS, and update repository rules (.gitattributes/CI)
to treat *_cubin.cpp as binary or exclude them from C++ tooling so tools like
clang/cppcheck skip that filename pattern; ensure the build scripts reference
this filename pattern when adding the fetch or exclusion.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E4m3_castE4m3_Fp32_Ab32_t128x8x512u2_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_biasFp32M_bN_rgTma_clmp_dynB_sm103a_cubin.cpp`:
- Around line 1-3: The CI is compiling a Git LFS pointer file (the cubin.cpp
pointer at top of the referenced file and any files matching *cubin.cpp in
.gitattributes) because workflows do not fetch LFS objects; update the CI/build
workflows that run static analysis/compilation (the job that invokes
clang/cppcheck/your build script) to run git lfs fetch --all (or git lfs pull)
after checkout but before any compile/analysis steps, and ensure any
materialized cubin.cpp binary source includes the required NVIDIA copyright
header with the latest modification year so the real file (not the pointer) is
present for compilation and policy checks.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x128x128_s5_et64x128_m64x128x32_c1x1x1_rM_TN_transOut_noShflA_dsFp8_schedS_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`:
- Around line 1-3: The file is a Git LFS pointer misnamed with a .cpp extension,
which can confuse C++ tooling; rename all files matching the _cubin.cpp pattern
(e.g., Bmm_*_cubin.cpp) to a non-C++ artifact extension such as .cubin.bin or
.bin, or alternatively add explicit exclusion rules (e.g., in build configs and
.gitattributes) to ensure this directory and pattern are never treated as C++
sources; if any of these files are intended to be committed as source artifacts,
add the required NVIDIA copyright header to the file content and update
repository docs to note the exclusion.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x16x128u2_s6_et64x16_m64x16x32_c1x1x1_rM_TN_transOut_noShflA_dsFp8_schedS_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`:
- Around line 1-3: Multiple kernel cubin files contain unresolved Git LFS
pointer text (e.g., starting with "version https://git-lfs.github.com/spec/v1")
which breaks static analysis and compilation; update CI and build to: run git
lfs pull before any lint/compile stage (modify CI scripts and any prebuild
hooks), add a pre-compile check in the build scripts to scan .cubin.cpp files
for the LFS pointer header and either auto-run git lfs pull or fail fast with a
clear error telling contributors to run git lfs pull, and add a guard (or
exclusion) so unresolved LFS pointer files are skipped by static analyzers (or
listed in .gitignore/build exclusion) plus document the required Git LFS setup
in the repo README/CONTRIBUTING so contributors and CI get the real binary cubin
contents.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x32x128u2_s6_et64x32_m64x32x32_c1x1x1_rM_TN_transOut_noShflA_dsFp8_schPd4x2x2x3_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`:
- Around line 1-3: The new LFS-tracked cubin artifact
Bmm_Bfloat16_E4m3E4m3_Fp32_t128x32x128u2_s6_et64x32_m64x32x32_c1x1x1_rM_TN_transOut_noShflA_dsFp8_schPd4x2x2x3_bN_rgTma_clmp_dynB_sm100f_cubin.cpp
lacks the required NVIDIA copyright header; either add the standard NVIDIA
header (with the year of latest meaningful modification) into this generated
.cpp or formally exempt generated/LFS-tracked cubin artifacts by updating the
repo linting and docs: modify the header-check configuration to exclude the
cubin/generated pattern and add a docs note describing the exemption and
justification, then include the chosen remediation (header addition or
exemption) in the PR description for audit.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x128_s4_et64x8_m64x8x32_c1x1x1_rM_TN_transOut_noShflA_dsFp8_schedS_bN_rgTma_clmp_sm100f_cubin.cpp`:
- Around line 1-3: The build includes cubin pointer .cpp files because the
CMakeLists uses file(GLOB_RECURSE SRC_CPP *.cpp); update the build/CI to ensure
Git LFS artifacts are present before compilation by adding an explicit git lfs
fetch/pull step prior to invoking CMake (or change the CMake rule to exclude
patterns), and/or change the CMakeLists usage of file(GLOB_RECURSE SRC_CPP
*.cpp) to filter out "*cubin.cpp" so the compiler never sees LFS pointer files;
target the CMakeLists.txt entry with file(GLOB_RECURSE SRC_CPP *.cpp) and the CI
build steps that run CMake to implement the fix.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x256u2_s6_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`:
- Around line 1-3: Update CODING_GUIDELINES.md to explicitly exempt Git LFS
pointer files and pre-compiled binary artifacts in the cubins directory (e.g.,
the LFS pointer stub found in
trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x256u2_s6_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_bN_rgTma_clmp_dynB_sm100f_cubin.cpp)
from the "NVIDIA copyright headers" requirement, or alternatively add a clear
rule that the header mandate applies only to human-authored source files and not
generated/compiled binaries or their LFS references; make the language
unambiguous and add a short example sentence and a note in the file explaining
how LFS pointer stubs are treated.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x512_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schedS_bN_rgTma_clmp_sm100f_cubin.cpp`:
- Around line 1-3: The file
trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x512_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schedS_bN_rgTma_clmp_sm100f_cubin.cpp
is a Git LFS pointer, not real C++ source, and the cubins/ folder is implicitly
excluded from the build; make the exclusion explicit by either adding an exclude
rule for the cubins/ directory in your top-level CMakeLists (or the relevant
cmake module that collects sources) so files under cubins/ are ignored, and/or
renaming these artifacts from *.cpp to a non-source extension (e.g., .cubin or
.o) so build system and static analysis won’t treat them as C++ sources; update
any source-gathering patterns (e.g., add_glob or file(GLOB_RECURSE) filters)
that might pick up *_cubin.cpp and document the change in the build module that
references the generated kernels.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/GemmOptions.h`:
- Around line 1281-1288: The validator checkAndUpdateGemmOptions currently
mutates options (setting options.mMmaTileK = options.mTileK and
options.mPerTokenSfDtype defaults) even when called with updateOptions == false;
change the logic so these assignments only occur when updateOptions is true, and
when updateOptions is false either return an error/false if those fields are
unset or explicitly reject the config; apply the same guard to the other
identical block later in the function (the second occurrence that compares
mMmaTileK vs mTileK and sets mPerTokenSfDtype) so validation remains side-effect
free.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h`:
- Around line 357-360: The parameter name maxNumCtas in the KernelParams.h
function signature is misleading because the implementation (see
paddedTokensInBatchDim calculation using cgaOffset and clusterDimInBatchDim)
expects a CGA count; change the API to make this explicit by renaming the
parameter to maxNumCgas (and update all call sites) or add an inline comment in
the signature like /* maxNumCgas = CGA-granularity count */ and/or update
KernelParams::MaxNumCtas to reflect CGA semantics so callers pass CGA counts;
ensure references to maxNumCtas and KernelParams::MaxNumCtas and the
paddedTokensInBatchDim / cgaOffset usage are updated consistently.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelTraits.h`:
- Around line 524-526: KernelTraits currently ignores the separate TMEM-A stage
count; thread the mNumStagesTmemA member through KernelTraits (add the member,
ensure constructors/builders/factory code initialize it from the configuration)
and use mNumStagesTmemA instead of numStagesA when computing numTmemColsA (the
expression that uses numStagesA * tileK / (numSlicesForSliceK * ... /
tg::dtypeGetNumBits(dtypeMmaA))). Also ensure useTmemA logic and any getters
that expose stage counts reflect mNumStagesTmemA so configs with mNumStagesTmemA
!= mNumStagesA produce the correct TMEM-A footprint.

---

Outside diff comments:
In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.h`:
- Around line 1-15: The file header in KernelRunner.h still lists the NVIDIA
copyright range as "2020-2025" but the file was modified in 2026; update the
copyright year range in the top comment block (the header comment present at the
top of KernelRunner.h) to include 2026 (e.g., "2020-2026") so the file header
reflects the modification year.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/config.json`:
- Around line 233-955: In the two configs for "FP4_FC2_HighThroughput" (template
BatchedGemmFp4HighThroughput) and "FP4_FC2_HighThroughput_NonSwapAb" (template
BatchedGemmFp4HighThroughputNonSwapAb) remove the stray trailing commas from the
grouped-key names: change "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutB,"
to "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutB" and change
"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA," to
"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA" so the field names are
valid JSON keys for validation.

---

Nitpick comments:
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/config.json`:
- Around line 544-546: The grouped key string
"mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA," contains an extra trailing
comma which breaks tuple/key arity mapping; edit the config.json entry so the
key is "mmaN,tileN,epilogueTileN,tileK,numStages,sfLayoutA" (remove the trailing
comma) and verify the associated value array stays unchanged, and scan other
grouped keys for similar trailing delimiters to remove if present.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

Comment on lines +367 to 380
setProblemDimensions(gemmData, transposeMmaOutput, m, n, k, batchedTokens, numTokens, numBatches,
maxNumCtasInBatchDim, validM, validN, validK);

// Inputs
gemmData.mInputBuffers.mPtrA = mOptions.transposeMmaOutput ? b : a;
gemmData.mInputBuffers.mPtrSfA = mOptions.transposeMmaOutput ? sfB : sfA;
gemmData.mInputBuffers.mPtrB = mOptions.transposeMmaOutput ? a : b;
gemmData.mInputBuffers.mPtrSfB = mOptions.transposeMmaOutput ? sfA : sfB;
gemmData.mInputBuffers.mPtrA = transposeMmaOutput ? b : a;
gemmData.mInputBuffers.mPtrSfA = transposeMmaOutput ? sfB : sfA;
gemmData.mInputBuffers.mPtrB = transposeMmaOutput ? a : b;
gemmData.mInputBuffers.mPtrSfB = transposeMmaOutput ? sfA : sfB;
gemmData.mInputBuffers.mPtrScaleC = scaleC;
gemmData.mInputBuffers.mPtrScaleGate = scaleGateC;
gemmData.mInputBuffers.mPtrScaleAct = scaleGateC;
gemmData.mInputBuffers.mPtrPerTokenSfA = mOptions.transposeMmaOutput ? perTokensSfB : perTokensSfA;
gemmData.mInputBuffers.mPtrPerTokenSfB = mOptions.transposeMmaOutput ? perTokensSfA : perTokensSfB;
gemmData.mInputBuffers.mPtrPerTokenSfA = transposeMmaOutput ? perTokensSfB : perTokensSfA;
gemmData.mInputBuffers.mPtrPerTokenSfB = transposeMmaOutput ? perTokensSfA : perTokensSfB;
gemmData.mInputBuffers.mPtrBias = ptrBias;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Plumb ptrPermutedIdxToBiasRowIdx through the runner path.

The new BiasType::Mn row remap never reaches gemmData.mInputBuffers, so this path always falls back to biasRow = permutedRow. BatchedGemmInterface::InputBuffers and KernelParamsSetup::setKernelParams(...) now support the pointer, but TrtllmGenBatchedGemmRunner::run(...) still has no way to populate it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.cpp`
around lines 367 - 380, The new ptrPermutedIdxToBiasRowIdx pointer for
BiasType::Mn is not being forwarded into gemmData.mInputBuffers in
TrtllmGenBatchedGemmRunner::run(...), so the bias row remap never gets used;
update TrtllmGenBatchedGemmRunner::run to accept or obtain the
ptrPermutedIdxToBiasRowIdx value and assign it into gemmData.mInputBuffers
(i.e., set gemmData.mInputBuffers.mPtrPermutedIdxToBiasRowIdx =
ptrPermutedIdxToBiasRowIdx) before calling
KernelParamsSetup::setKernelParams(...), and ensure the same pointer is passed
through any intermediate calls so BatchedGemmInterface::InputBuffers receives
it.

Comment on lines +1 to +3
version https://git-lfs.github.com/spec/v1
oid sha256:1a66a28af82abf6caa7e88701b3eeedceaabe952043e6a75ef1ca842cdd49e75
size 662339
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

TARGET='cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x128x256_s6_et128x128_m256x128x64_c2x1x1_rN_TN_schPd2x1x2x3_biasFp32N_bM_rgTma_clmp_dynB_sm100f_cubin.cpp'

echo "== Pointer header check =="
sed -n '1,5p' "$TARGET"

echo
echo "== CI/workflow LFS hydration checks =="
rg -n -C2 --iglob '*.yml' --iglob '*.yaml' 'git lfs (install|fetch|pull|checkout)'

echo
echo "== Build/static-analysis scripts mentioning cubins or this file type =="
rg -n -C2 '(_cubin\.cpp|trtllmGen_bmm_export/cubins|git lfs)'

Repository: NVIDIA/TensorRT-LLM

Length of output: 261


🏁 Script executed:

# Check if .gitattributes exists and configures LFS
if [ -f .gitattributes ]; then
  echo "=== .gitattributes LFS configuration ==="
  cat .gitattributes | grep -E '(cubin|\.cpp|lfs)' | head -20
else
  echo ".gitattributes not found"
fi

echo
echo "=== Workflow files in repository ==="
find . -type f \( -name '*.yml' -o -name '*.yaml' \) -path '*/.github/workflows/*' 2>/dev/null | head -20

echo
echo "=== LFS configuration in any workflow ==="
find . -type f \( -name '*.yml' -o -name '*.yaml' \) 2>/dev/null -exec grep -l 'lfs' {} \; 2>/dev/null | head -10

echo
echo "=== Build scripts or CMakeLists mentioning cubin or LFS ==="
rg -l '(cubin|lfs|git.*checkout)' --type cmake --type sh 2>/dev/null | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 1999


🏁 Script executed:

echo "=== blossom-ci.yml (LFS workflow) ==="
head -100 ./.github/workflows/blossom-ci.yml

echo
echo "=== pr-check.yml (likely runs C++ checks) ==="
head -150 ./.github/workflows/pr-check.yml

echo
echo "=== precommit-check.yml (likely runs linting) ==="
head -150 ./.github/workflows/precommit-check.yml

Repository: NVIDIA/TensorRT-LLM

Length of output: 7615


🏁 Script executed:

echo "=== scripts/release_check.py (what does it check?) ==="
head -200 scripts/release_check.py

echo
echo "=== Full blossom-ci.yml (look for git lfs or checkout) ==="
grep -n -A5 -B5 'git lfs\|checkout\|cubin' ./.github/workflows/blossom-ci.yml | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 7737


🏁 Script executed:

echo "=== .pre-commit-config.yaml (which hooks run on .cpp?) ==="
cat .pre-commit-config.yaml | head -200

Repository: NVIDIA/TensorRT-LLM

Length of output: 11191


🏁 Script executed:

echo "=== Continuing .pre-commit-config.yaml (actual hooks) ==="
tail -300 .pre-commit-config.yaml

Repository: NVIDIA/TensorRT-LLM

Length of output: 14794


Add lfs: 'true' to the checkout action in precommit-check.yml for consistency and robustness.

This file (*cubin.cpp) is already excluded from pre-commit hooks in .pre-commit-config.yaml, so it won't be processed by clang-format or other analysis tools. However, precommit-check.yml uses actions/checkout@v6 without the lfs: true parameter, while blossom-ci.yml correctly includes it. For consistency and to safeguard against future changes to the exclusion rules, add lfs: 'true' to the checkout step in precommit-check.yml.

🧰 Tools
🪛 Clang (14.0.6)

[error] 1-1: unknown type name 'version'

(clang-diagnostic-error)


[error] 1-1: expected ';' after top level declarator

(clang-diagnostic-error)


[error] 2-2: unknown type name 'oid'; did you mean 'void'?

(clang-diagnostic-error)


[error] 2-2: variable has incomplete type 'void'

(clang-diagnostic-error)


[error] 2-2: expected ';' after top level declarator

(clang-diagnostic-error)


[error] 2-2: expected unqualified-id

(clang-diagnostic-error)

🪛 Cppcheck (2.20.0)

[error] 3-3: syntax error

(syntaxError)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x128x256_s6_et128x128_m256x128x64_c2x1x1_rN_TN_schPd2x1x2x3_biasFp32N_bM_rgTma_clmp_dynB_sm100f_cubin.cpp`
around lines 1 - 3, precommit-check.yml's checkout step uses actions/checkout@v6
without LFS enabled, which is inconsistent with blossom-ci.yml and could break
future checks for large files like the listed cubin artifacts; update the
checkout step in precommit-check.yml (the actions/checkout@v6 step) to include
lfs: 'true' so git-lfs files are fetched during the precommit workflow, matching
blossom-ci.yml and ensuring excluded large files remain resolvable if exclusion
rules change.

Comment on lines +1 to +3
version https://git-lfs.github.com/spec/v1
oid sha256:97caf696273574e75d797260df22a9ce4ec3f837e1e48b90040ffa499d4753db
size 453560
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Find .cpp files that are actually Git LFS pointers.
echo "== LFS pointer files with .cpp extension =="
fd -e cpp | while read -r f; do
  first_line="$(sed -n '1p' "$f" || true)"
  if [ "$first_line" = "version https://git-lfs.github.com/spec/v1" ]; then
    echo "$f"
  fi
done

# 2) Check whether this specific file is referenced by build/config scripts.
echo
echo "== References to this cubin path in build/config files =="
rg -n "Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x32x256_s9_et128x32_m128x32x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin\\.cpp|trtllmGen_bmm_export/cubins"

# 3) Show header lines for this file to confirm it's an LFS pointer (not a C++ source body/header banner).
echo
echo "== First 5 lines of the file under review =="
sed -n '1,5p' "cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x32x256_s9_et128x32_m128x32x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp"

Repository: NVIDIA/TensorRT-LLM

Length of output: 50377


Exclude Git LFS pointer stubs from C++ translation units

  • cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x32x256_s9_et128x32_m128x32x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp contains only Git LFS metadata (version/oid/size), causing parser and build failures; fetch LFS objects before compilation or adjust build/static‐analysis rules to skip these stubs.
  • Add the required NVIDIA copyright header to this .cpp file per coding guidelines.
🧰 Tools
🪛 Clang (14.0.6)

[error] 1-1: unknown type name 'version'

(clang-diagnostic-error)


[error] 1-1: expected ';' after top level declarator

(clang-diagnostic-error)


[error] 2-2: unknown type name 'oid'; did you mean 'void'?

(clang-diagnostic-error)


[error] 2-2: variable has incomplete type 'void'

(clang-diagnostic-error)


[error] 2-2: expected ';' after top level declarator

(clang-diagnostic-error)


[error] 2-2: expected unqualified-id

(clang-diagnostic-error)

🪛 Cppcheck (2.20.0)

[error] 3-3: syntax error

(syntaxError)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x32x256_s9_et128x32_m128x32x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`
around lines 1 - 3, The translation unit
Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x32x256_s9_et128x32_m128x32x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp
currently contains only a Git LFS pointer stub which breaks parsing and builds;
replace the stub with the actual LFS-tracked binary cubin content (pull the LFS
object into the repo) or modify the build/static-analysis rules to skip/ignore
LFS pointer files for C++ translation units, and ensure the final .cpp file
contains the required NVIDIA copyright header at the top per project guidelines.

Comment on lines +1 to +3
version https://git-lfs.github.com/spec/v1
oid sha256:28198e842017dfd64847faa770eefca936de60a8c09e7ae70e00fe313d3b005b
size 423264
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

TARGET='cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x8x512_s5_et128x8_m128x8x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp'

echo "== Pointer file header =="
sed -n '1,3p' "$TARGET" || true

echo
echo "== .gitattributes entries tracking this path (expect LFS rule) =="
attrs=$(fd -HI '^\.gitattributes$' || true)
if [ -n "${attrs:-}" ]; then
  rg -n --fixed-strings 'trtllmGen_bmm_export/cubins/' $attrs || true
  rg -n --fixed-strings 'Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x8x512_s5_et128x8_m128x8x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp' $attrs || true
else
  echo "No .gitattributes found"
fi

echo
echo "== Build/lint references to this file (expect not compiled as raw pointer text) =="
rg -n -C2 --fixed-strings 'Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x8x512_s5_et128x8_m128x8x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp' || true

echo
echo "== CI/workflow LFS setup (expect explicit LFS fetch/pull before compile/lint) =="
rg -n -C2 -P 'git\s+lfs\s+(pull|fetch|checkout)|lfs:\s*true|actions/checkout@' || true

Repository: NVIDIA/TensorRT-LLM

Length of output: 4074


Add Git LFS tracking rule to .gitattributes and include NVIDIA copyright header

This .cpp file contains Git LFS pointer metadata but lacks a corresponding tracking rule in .gitattributes. Additionally, the file is missing the required NVIDIA copyright header.

To fix:

  1. Add a rule in .gitattributes to explicitly track this file (or the trtllmGen_bmm_export/cubins/ directory) as LFS-managed binary
  2. Include the NVIDIA copyright header comment at the top of the file with the year of latest modification

Without the .gitattributes rule, build/lint tools may attempt to process the pointer metadata as source code, causing failures.

🧰 Tools
🪛 Clang (14.0.6)

[error] 1-1: unknown type name 'version'

(clang-diagnostic-error)


[error] 1-1: expected ';' after top level declarator

(clang-diagnostic-error)


[error] 2-2: unknown type name 'oid'; did you mean 'void'?

(clang-diagnostic-error)


[error] 2-2: variable has incomplete type 'void'

(clang-diagnostic-error)


[error] 2-2: expected ';' after top level declarator

(clang-diagnostic-error)


[error] 2-2: expected unqualified-id

(clang-diagnostic-error)

🪛 Cppcheck (2.20.0)

[error] 3-3: syntax error

(syntaxError)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x8x512_s5_et128x8_m128x8x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`
around lines 1 - 3, The file begins with Git LFS pointer metadata (line starting
with "version https://git-lfs.github.com/spec/v1") but there is no
.gitattributes rule to mark this binary as LFS-managed, and the file is missing
the required NVIDIA copyright header; add an entry to .gitattributes to track
the cubins (e.g., a pattern matching these generated .cpp/cubin files) so the
pointer metadata is not treated as source, and prepend the canonical NVIDIA
copyright header comment to the top of this file with the correct latest
modification year.

Comment on lines +1 to +3
version https://git-lfs.github.com/spec/v1
oid sha256:128c3261b254cdb66d2b786263044ac98147683feb60e526e0e58c5c863b0e38
size 436478
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Detect .cpp cubin files that are Git LFS pointers =="
fd --full-path '_cubin\.cpp$' | while read -r f; do
  if sed -n '1p' "$f" | grep -qx 'version https://git-lfs.github.com/spec/v1'; then
    echo "POINTER_CPP: $f"
    sed -n '1,3p' "$f"
  fi
done

echo
echo "== Check whether repo tooling/build scripts broadly pick up .cpp files =="
rg -n --hidden --glob '!**/.git/**' '(_cubin\.cpp|\*\.cpp|clang-tidy|cppcheck|target_sources|add_library|add_executable)'

Repository: NVIDIA/TensorRT-LLM

Length of output: 50377


🏁 Script executed:

# Check if these cubin files are referenced in build scripts
rg -l "cubin\.cpp|add_library|target_sources" --type cmake | head -20

# Check .gitattributes for LFS configuration
fd -e gitattributes | xargs cat 2>/dev/null | grep -A2 -B2 "cubin\|\.cpp"

# Check if there's a .clang-tidy or similar configuration that excludes these
fd -e clang-tidy -o -e clang-format | xargs cat 2>/dev/null

Repository: NVIDIA/TensorRT-LLM

Length of output: 1235


🏁 Script executed:

# Check .gitattributes
find . -name ".gitattributes" -type f | xargs cat 2>/dev/null | head -50

# Check if cubin files are referenced in CMakeLists
rg "_cubin\.cpp" cpp/tensorrt_llm/kernels/ | head -20

# Check for any pre-commit or CI configuration
find . -name ".pre-commit-config.yaml" -o -name ".clang-tidy" -o -name ".clang-format" | xargs cat 2>/dev/null | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 4449


🏁 Script executed:

# Check how cubin files are referenced in the main kernels CMakeLists.txt
cat cpp/tensorrt_llm/kernels/CMakeLists.txt | grep -A 5 -B 5 "cubin\|*.cpp" | head -100

# Check if there's any exclusion pattern for cubin files in build scripts
rg "cubin|exclude|EXCLUDE" cpp/tensorrt_llm/kernels/CMakeLists.txt -A 3

# Check the trtllmGenKernels CMakeLists specifically
fd -path "*trtllmGenKernels/CMakeLists.txt" -exec cat {} \; | head -150

Repository: NVIDIA/TensorRT-LLM

Length of output: 1916


🏁 Script executed:

# Confirm the trtllmGenKernels exclusion applies to the cubin files
rg "trtllmGenKernels" cpp/tensorrt_llm/kernels/CMakeLists.txt -B 2 -A 2

# Check if there are any subdirectory declarations for trtllmGenKernels
rg "add_subdirectory.*trtllmGenKernels" cpp/tensorrt_llm/kernels/

# List cubin file locations
fd "_cubin\.cpp$" cpp/tensorrt_llm/kernels/trtllmGenKernels/ | head -5

Repository: NVIDIA/TensorRT-LLM

Length of output: 2036


🏁 Script executed:

# Check the trtllmGenKernels subdirectory CMakeLists.txt
find cpp/tensorrt_llm/kernels/trtllmGenKernels -name "CMakeLists.txt" -exec cat {} \;

Repository: NVIDIA/TensorRT-LLM

Length of output: 10372


Do not store Git LFS pointers in .cpp file paths; these will fail to compile when not checked out.

The cubin.cpp files in cpp/tensorrt_llm/kernels/trtllmGenKernels/ subdirectories (batchedGemm, gemm, gemmGatedAct, fmha, blockScaleMoe) are configured in .gitattributes as Git LFS pointers but are picked up directly by subdirectory CMakeLists files via file(GLOB_RECURSE SRC_CPP *.cpp) and added as compilation sources. When LFS files are not materialized (common in CI without LFS configuration), the build will attempt to compile text containing LFS metadata as C++, causing syntax errors.

Additionally, these files violate the coding guideline requiring NVIDIA copyright headers on all .cpp files, since LFS pointers cannot contain such headers.

Either:

  1. Exclude cubin.cpp files from the glob in each subdirectory CMakeLists.txt, or
  2. Store these binary artifacts with a non-source-code extension and generate/include wrapper .cpp files if compilation is needed
🧰 Tools
🪛 Clang (14.0.6)

[error] 1-1: unknown type name 'version'

(clang-diagnostic-error)


[error] 1-1: expected ';' after top level declarator

(clang-diagnostic-error)


[error] 2-2: unknown type name 'oid'; did you mean 'void'?

(clang-diagnostic-error)


[error] 2-2: variable has incomplete type 'void'

(clang-diagnostic-error)


[error] 2-2: expected ';' after top level declarator

(clang-diagnostic-error)


[error] 2-2: expected unqualified-id

(clang-diagnostic-error)

🪛 Cppcheck (2.20.0)

[error] 3-3: syntax error

(syntaxError)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E2m1E2m1_Fp32_Ab16_Bb16_t128x8x512u2_s4_et128x8_m128x8x64_c1x1x1_rM_TN_transOut_schedS_biasFp32M_bN_rgTma_clmp_sm100f_cubin.cpp`
around lines 1 - 3, The build is pulling Git LFS pointer files named like
Bmm_..._cubin.cpp into compilation via the file(GLOB_RECURSE SRC_CPP *.cpp) in
the subdirectory CMakeLists, which causes CI to try compiling LFS metadata and
also prevents adding required NVIDIA headers; fix by excluding those cubin.cpp
files from the source glob or moving them off the .cpp extension and adding
lightweight wrapper .cpp stubs. Concretely, update each affected CMakeLists (the
ones that define SRC_CPP via file(GLOB_RECURSE SRC_CPP *.cpp) in
trtllmGenKernels subdirs batchedGemm, gemm, gemmGatedAct, fmha, blockScaleMoe)
to filter out files matching "*_cubin.cpp" or "cubin.cpp" from SRC_CPP, or
rename/store binary cubin blobs with a non-.cpp extension and add minimal
wrapper translation units (e.g., loader functions) that include proper NVIDIA
copyright headers so real .cpp files compiled by CMake are valid source.

Comment on lines +1 to +3
version https://git-lfs.github.com/spec/v1
oid sha256:b2d5b6ff01e5f2d80833302e5c38ed8a663baa695bdae838c4266066229ceaed
size 413176
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check if file is truly new or just shows LFS pointer
git ls-files "cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/"

# Find other .cpp files in the cubins directory to compare
fd -t f "\.cpp$" "cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/" | head -10

Repository: NVIDIA/TensorRT-LLM

Length of output: 50376


🏁 Script executed:

# Check .gitattributes to see if cubins are configured for LFS
find . -name ".gitattributes" -type f -exec grep -l "cubins\|\.cubin\|LFS" {} \;

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Look at CODING_GUIDELINES.md for any LFS-specific guidance
find . -name "CODING_GUIDELINES.md" -type f | head -5

Repository: NVIDIA/TensorRT-LLM

Length of output: 86


🏁 Script executed:

# Check if there are other similar .cpp files in the same directory
ls -la "cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/" | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 3631


🏁 Script executed:

cat CODING_GUIDELINES.md | grep -i -A 5 -B 5 "lfs\|binary\|cubin\|generated"

Repository: NVIDIA/TensorRT-LLM

Length of output: 1097


🏁 Script executed:

# Also check for any exemptions or exceptions in the guidelines
cat CODING_GUIDELINES.md | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 4962


🏁 Script executed:

# Continue reading CODING_GUIDELINES.md to find any LFS or exemption guidance
cat CODING_GUIDELINES.md | grep -i "lfs\|binary\|generated\|exemption\|exclude\|copyright" | head -30

Repository: NVIDIA/TensorRT-LLM

Length of output: 763


🏁 Script executed:

# Look for section about file headers and copyright
cat CODING_GUIDELINES.md | grep -i "copyright\|header\|license" -A 3 -B 1 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 2181


LFS pointer files in the cubins directory require explicit exemption documentation in CODING_GUIDELINES.md.

The coding guidelines state that "all source files which are compiled or interpreted" must contain NVIDIA copyright headers. However, the guidelines don't explicitly address Git LFS pointer files (like this 131-byte metadata stub) or pre-compiled binary artifacts stored in the cubins directory. To avoid future ambiguity, either:

  1. Add explicit exemption language to CODING_GUIDELINES.md for LFS pointers and binary artifacts, or
  2. Document that the copyright header requirement applies only to human-authored source code, not generated/compiled binaries or their LFS references
🧰 Tools
🪛 Clang (14.0.6)

[error] 1-1: unknown type name 'version'

(clang-diagnostic-error)


[error] 1-1: expected ';' after top level declarator

(clang-diagnostic-error)


[error] 2-2: unknown type name 'oid'; did you mean 'void'?

(clang-diagnostic-error)


[error] 2-2: variable has incomplete type 'void'

(clang-diagnostic-error)


[error] 2-2: expected ';' after top level declarator

(clang-diagnostic-error)


[error] 2-2: unknown type name 'b2d5b6ff01e5f2d80833302e5c38ed8a663baa695bdae838c4266066229ceaed'

(clang-diagnostic-error)


[error] 3-3: expected ';' after top level declarator

(clang-diagnostic-error)

🪛 Cppcheck (2.20.0)

[error] 3-3: syntax error

(syntaxError)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x256u2_s6_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_bN_rgTma_clmp_dynB_sm100f_cubin.cpp`
around lines 1 - 3, Update CODING_GUIDELINES.md to explicitly exempt Git LFS
pointer files and pre-compiled binary artifacts in the cubins directory (e.g.,
the LFS pointer stub found in
trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x256u2_s6_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schPd2x1x2x3_bN_rgTma_clmp_dynB_sm100f_cubin.cpp)
from the "NVIDIA copyright headers" requirement, or alternatively add a clear
rule that the header mandate applies only to human-authored source files and not
generated/compiled binaries or their LFS references; make the language
unambiguous and add a short example sentence and a note in the file explaining
how LFS pointer stubs are treated.

Comment on lines +1 to +3
version https://git-lfs.github.com/spec/v1
oid sha256:33b5743610881ecfc742746f42648f4e054534a9ca1adb57634e1988c88cb969
size 316110
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify this file is an LFS pointer in the checked-out tree.
sed -n '1,5p' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x512_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schedS_bN_rgTma_clmp_sm100f_cubin.cpp

# Check if cubin .cpp artifacts are referenced by CMake build inputs.
rg -n -C2 --iglob 'CMakeLists.txt' --iglob '*.cmake' '_cubin\.cpp|trtllmGen_bmm_export/cubins'

# Check if static-analysis configs explicitly exclude this folder/pattern.
rg -n -C2 --iglob '.clang-tidy' --iglob '.clangd' --iglob '*.yml' --iglob '*.yaml' 'cubins|trtllmGen_bmm_export|_cubin\.cpp'

Repository: NVIDIA/TensorRT-LLM

Length of output: 194


.cpp file contains Git LFS pointer and is not referenced in build system

Confirmed: Lines 1–3 show Git LFS metadata, not C++ source. Verification found no CMakeLists.txt, .cmake files, or static analysis configs that reference this cubin directory or _cubin.cpp files, indicating the path is implicitly excluded from compilation.

However, this implicit exclusion through directory structure (no build references) is not explicit. To prevent accidental inclusion in future build rules, consider either:

  • Adding an explicit exclusion pattern in CMakeLists.txt or build tooling for the cubins/ directory
  • Moving cubin artifacts to a non-.cpp extension (e.g., .cubin or .o) to avoid confusion with C++ source files
🧰 Tools
🪛 Clang (14.0.6)

[error] 1-1: unknown type name 'version'

(clang-diagnostic-error)


[error] 1-1: expected ';' after top level declarator

(clang-diagnostic-error)


[error] 2-2: unknown type name 'oid'; did you mean 'void'?

(clang-diagnostic-error)


[error] 2-2: variable has incomplete type 'void'

(clang-diagnostic-error)


[error] 2-2: expected ';' after top level declarator

(clang-diagnostic-error)


[error] 2-2: expected unqualified-id

(clang-diagnostic-error)

🪛 Cppcheck (2.20.0)

[error] 3-3: syntax error

(syntaxError)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x512_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schedS_bN_rgTma_clmp_sm100f_cubin.cpp`
around lines 1 - 3, The file
trtllmGen_bmm_export/cubins/Bmm_Bfloat16_E4m3E4m3_Fp32_t128x8x512_s3_et128x8_m128x8x32_c1x1x1_rM_TN_transOut_schedS_bN_rgTma_clmp_sm100f_cubin.cpp
is a Git LFS pointer, not real C++ source, and the cubins/ folder is implicitly
excluded from the build; make the exclusion explicit by either adding an exclude
rule for the cubins/ directory in your top-level CMakeLists (or the relevant
cmake module that collects sources) so files under cubins/ are ignored, and/or
renaming these artifacts from *.cpp to a non-source extension (e.g., .cubin or
.o) so build system and static analysis won’t treat them as C++ sources; update
any source-gathering patterns (e.g., add_glob or file(GLOB_RECURSE) filters)
that might pick up *_cubin.cpp and document the change in the build module that
references the generated kernels.

Comment on lines +1281 to +1288
if (options.mMmaTileK == 0)
{
options.mMmaTileK = options.mTileK;
}
else
{
TLLM_CHECK_ERROR(options.mLayoutB != MatrixLayout::BlockMajorK,
"layoutB == MatrixLayout::BlockMajorK is not supported for now");
if (options.mMmaTileK != options.mTileK)
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Keep updateOptions == false side-effect free.

checkAndUpdateGemmOptions(..., false) is used as a pure validator, but these blocks still fill in mMmaTileK and mPerTokenSfDtype. That makes an underspecified config look valid and hides whether the cubin metadata actually encoded those fields. Please guard these defaults with updateOptions (or reject when they are unset in validation mode).

Also applies to: 2175-2185

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/GemmOptions.h`
around lines 1281 - 1288, The validator checkAndUpdateGemmOptions currently
mutates options (setting options.mMmaTileK = options.mTileK and
options.mPerTokenSfDtype defaults) even when called with updateOptions == false;
change the logic so these assignments only occur when updateOptions is true, and
when updateOptions is false either return an error/false if those fields are
unset or explicitly reject the config; apply the same guard to the other
identical block later in the function (the second occurrence that compares
mMmaTileK vs mTileK and sets mPerTokenSfDtype) so validation remains side-effect
free.

Comment on lines +357 to +360
uint32_t* rowMaxBars, int32_t const* ptrNumNonExitingCtas = nullptr,
int32_t const* ptrTotalNumPaddedTokens = nullptr, int32_t const* ptrCtaIdxXyToBatchIdx = nullptr,
int32_t const* ptrCtaIdxXyToMnLimit = nullptr, int32_t const maxNumCtas = KernelParams::MaxNumCtas,
uint32_t* ptrDynamicTileCounter = nullptr)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect all call sites of KernelParamsSetup::setKernelParams and the maxNumCtas plumbing.
rg -n -C3 '\bsetKernelParams\s*\(' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm
rg -n -C3 '\bmaxNumCtas(InBatchDim)?\b|\bnumCgas\b|\bcgaOffset\b|\bclusterDimInBatchDim\b' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm

Repository: NVIDIA/TensorRT-LLM

Length of output: 39791


🏁 Script executed:

# Trace where maxNumCtasInBatchDim originates and how it's computed
# Look for high-level callers that prepare this value
rg -n 'maxNumCtasInBatchDim\s*=' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm
rg -n -B5 'run\s*\(' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/KernelRunner.cpp | grep -A5 'maxNumCtasInBatchDim'

Repository: NVIDIA/TensorRT-LLM

Length of output: 1015


🏁 Script executed:

# Look for where maxNumCtasInBatchDim is actually computed or passed from external code
# Find the public API entry points and trace back
rg -n 'maxNumCtasInBatchDim' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm --type cpp -A2 -B2 | head -100

# Also check if there's any documentation or comments explaining the semantics
rg -n 'CGA|cga|cluster.*group' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h

Repository: NVIDIA/TensorRT-LLM

Length of output: 13747


🏁 Script executed:

# Look for the comment about CTA granularity in KernelParams.h
cat -n cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h | sed -n '380,390p'

# Check if there are any comments or documentation about the parameter semantics
rg -n 'maxNumCtas|MaxNumCtas' cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h | head -20

# Look at the parameter definition and surrounding documentation
cat -n cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h | sed -n '340,365p'

Repository: NVIDIA/TensorRT-LLM

Length of output: 2528


Parameter maxNumCtas uses CGA semantics but retains CTA naming, creating a misleading contract.

The code at line 386 explicitly acknowledges this: "The CTA names are kept for compatibility even though they are indexed at CGA granularity." The calculation at line 477 (paddedTokensInBatchDim = cgaOffset * tile * clusterDimInBatchDim) is correct only if callers pass CGA count, not CTA count. However, the parameter name and signature still advertise "CTA" semantics, risking misuse by callers unfamiliar with the compatibility note.

Rename the parameter or add clear inline documentation explaining it expects CGA count (e.g., maxNumCgas or /*maxNumCgas=*/ inline comment), or update KernelParams::MaxNumCtas to reflect CGA semantics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelParams.h`
around lines 357 - 360, The parameter name maxNumCtas in the KernelParams.h
function signature is misleading because the implementation (see
paddedTokensInBatchDim calculation using cgaOffset and clusterDimInBatchDim)
expects a CGA count; change the API to make this explicit by renaming the
parameter to maxNumCgas (and update all call sites) or add an inline comment in
the signature like /* maxNumCgas = CGA-granularity count */ and/or update
KernelParams::MaxNumCtas to reflect CGA semantics so callers pass CGA counts;
ensure references to maxNumCtas and KernelParams::MaxNumCtas and the
paddedTokensInBatchDim / cgaOffset usage are updated consistently.

Comment on lines +524 to 526
auto const numTmemColsA = useTmemA ? numStagesA * tileK
/ (numSlicesForSliceK * tg::dtypeGetNumBits(tg::Dtype::UInt32) / tg::dtypeGetNumBits(dtypeMmaA))
: 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

mNumStagesTmemA is still ignored here.

tmemA sizing is derived from numStagesA, and KernelTraits still has no separate TMEM-A stage parameter. Any config with mNumStagesTmemA != mNumStagesA will be modeled with the wrong TMEM footprint and effectively loses the new knob. Please thread mNumStagesTmemA into KernelTraits and use it for tmemA.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/batchedGemm/trtllmGen_bmm_export/KernelTraits.h`
around lines 524 - 526, KernelTraits currently ignores the separate TMEM-A stage
count; thread the mNumStagesTmemA member through KernelTraits (add the member,
ensure constructors/builders/factory code initialize it from the configuration)
and use mNumStagesTmemA instead of numStagesA when computing numTmemColsA (the
expression that uses numStagesA * tileK / (numSlicesForSliceK * ... /
tg::dtypeGetNumBits(dtypeMmaA))). Also ensure useTmemA logic and any getters
that expose stage counts reflect mNumStagesTmemA so configs with mNumStagesTmemA
!= mNumStagesA produce the correct TMEM-A footprint.

@nekorobov
Copy link
Copy Markdown
Collaborator

The PR description covers the changes in TRT-LLM Gen, but many of those features are not added in this MR (Slice-K, MN Bias, etc). Could you fix it please?

@rosenrodt rosenrodt force-pushed the trtllm-moe-tput-kernels branch from 58e0fd8 to 8266478 Compare April 29, 2026 09:59
@rosenrodt
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46202 [ run ] triggered by Bot. Commit: 8266478 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46202 [ run ] completed with state SUCCESS. Commit: 8266478
/LLM/main/L0_MergeRequest_PR pipeline #36315 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@rosenrodt
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46252 [ run ] triggered by Bot. Commit: 8266478 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46252 [ run ] completed with state SUCCESS. Commit: 8266478
/LLM/main/L0_MergeRequest_PR pipeline #36358 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
experimental: TRTLLM_GEN_MOE_AUTOTUNE_DUMMY_DISTRIBUTION=random or balanced to alter expert distribution during autotuner; affect kernel secletion

Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
Signed-off-by: Anthony Chang <27950904+rosenrodt@users.noreply.github.com>
@rosenrodt rosenrodt force-pushed the trtllm-moe-tput-kernels branch from 8266478 to f345e85 Compare April 30, 2026 08:16
@rosenrodt
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46350 [ run ] triggered by Bot. Commit: f345e85 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46350 [ run ] completed with state SUCCESS. Commit: f345e85
/LLM/main/L0_MergeRequest_PR pipeline #36438 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@rosenrodt
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46557 [ run ] triggered by Bot. Commit: f345e85 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46557 [ run ] completed with state SUCCESS. Commit: f345e85
/LLM/main/L0_MergeRequest_PR pipeline #36612 completed with status: 'SUCCESS'

CI Report

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants