Skip to content

[None][feat] Trtllm-gen FMHA JIT support#12612

Merged
pengbowang-nv merged 7 commits intoNVIDIA:mainfrom
yunruis:user/yunruis/add_fmha_interface_rebased_cubin_2_perf
Apr 7, 2026
Merged

[None][feat] Trtllm-gen FMHA JIT support#12612
pengbowang-nv merged 7 commits intoNVIDIA:mainfrom
yunruis:user/yunruis/add_fmha_interface_rebased_cubin_2_perf

Conversation

@yunruis
Copy link
Copy Markdown
Contributor

@yunruis yunruis commented Mar 31, 2026

Description

This PR integrates the trtllm-gen FMHA JIT (NVRTC) compilation path into TensorRT-LLM alongside the existing cubin (pre-compiled) path, enabling runtime kernel generation for FMHA configurations not covered by pre-compiled cubins.

Key Changes

1. Dual kernel dispatch: cubin + NVRTC JIT

The TllmGenFmhaKernel class now supports two dispatch paths:

  • Cubin path (default): loads pre-compiled kernels from embedded cubin data, used for generation-phase MLA, E2M1 KV cache, and head_dim=64 kernels.
  • NVRTC path (new): uses FmhaInterface from trtllm-gen to generate and compile kernels at runtime via NVRTC for configurations where cubins are not available (e.g., SwapsMmaAbForGeneration with non-MLA, non-E2M1, head_dim != 64).

2. Unified kernel selection via FmhaAutoTuner — consistent with trtllm-gen

This is the most significant architectural change. Previously, TensorRT-LLM maintained its own hand-written heuristic kernel selection logic (selectGqGenerationKernel, selectMlaGenerationKernel, selectTileSizeQForGqaGeneration, etc.), which diverged from trtllm-gen's selection logic and was a frequent source of bugs — kernel hash mismatches, TRTLLM-GEN kernels not found errors, and silent fallbacks to unfused MHA.

This PR replaces all of that with trtllm-gen's FmhaAutoTuner, which automatically determines the optimal tile sizes, kernel types (Swaps/Keeps MMA AB), CTA configurations, and multi-CTA KV modes. This removes ~300 lines of manual selection code and ensures that TensorRT-LLM and trtllm-gen always agree on which kernel to select, eliminating the class of bugs caused by selection logic divergence between the two repos.

3. Simplified TllmGenFmhaRunner constructor

Removed VisualGen-specific parameters (maxNumHeadsQPerKvInCta, sageAttnBlk*, dataTypeQkReinterpret) from the runner/kernel constructor and hash key. These are now handled internally by the auto-tuner and kernel options system, reducing the API surface from 10 parameters to 3 (dtypeQ, dtypeKv, dtypeOut).

4. trtllm-gen export headers, static libraries, and cubin artifacts

  • Added 36 export headers from trtllm-gen under trtllmGen_fmha_export/ providing FmhaInterface, FmhaAutoTuner, FmhaOptions, KernelParams, and device-side runtime code.
  • Added prebuilt static libraries (libTrtLlmGenFmhaLib.a, libTrtLlmGen.a) for both x86_64 and aarch64 architectures.
  • Added cuda_ptx/cuda_ptx.h containing PTX intrinsic wrappers needed by NVRTC-generated kernels.
  • Regenerated ~3000 cubin files with updated kernel configurations and refreshed kernelMetaInfo.h metadata.
  • Extended pre-commit global exclude patterns to cover trtllmGen_fmha_export/ and cuda_ptx/ to prevent formatting tools from modifying external/generated code.

@yunruis yunruis requested review from a team as code owners March 31, 2026 06:40
@yunruis yunruis requested review from mzweilz and niukuo March 31, 2026 06:40
@yunruis
Copy link
Copy Markdown
Contributor Author

yunruis commented Mar 31, 2026

/bot run --disable-fail-fast --stage-list "DGX_B200-8_GPUs-PyTorch-PerfSanity-Post-Merge-1,DGX_B200-8_GPUs-PyTorch-PerfSanity-Post-Merge-2,DGX_B200-8_GPUs-PyTorch-PerfSanity-Post-Merge-3,DGX_B200-8_GPUs-PyTorch-PerfSanity-Post-Merge-4,GB200-4_GPUs-PyTorch-PerfSanity-1,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-1,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-2,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-3,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-4,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-5,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-6,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-7,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-4,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-5,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-6,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-7,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-4,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE2-GPU8-Post-Merge-1,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-1,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-2,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-3,GB200-16_GPUs-4_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE2-GPU8-GEN1-NODE2-GPU8-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-3,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-4"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40886 [ run ] triggered by Bot. Commit: 5e4f57b Link to invocation

@yunruis yunruis changed the title trtllm-gen attention JIT support [None][feat] Trtllm-gen FMHA JIT support Mar 31, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

📝 Walkthrough

Walkthrough

Pre-commit configuration adds exclusion rules for FMHA kernel output directories. FmhaDispatcher populates additional runner parameters including processor count and layout-dependent settings. CMake now links prebuilt kernel archives. Hundreds of Git LFS pointers updated for cubin artifacts. Added floating-point header include.

Changes

Cohort / File(s) Summary
Pre-commit configuration
.pre-commit-config.yaml
Added per-hook exclude patterns for FMHA kernel output directories (trtllmGen_fmha_export/.*, cuda_ptx/.*, .*cubin\.(cpp|h)$) across multiple linting tools (isort, ruff, yapf, clang-format, cmake-format, codespell, autoflake, remove-crlf, end-of-file-fixer, trailing-whitespace).
FMHA dispatcher updates
cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp
Populates additional TllmGenFmhaRunnerParams fields: mNumHeadsQ, mNumHeadsKv, mHeadDimQkNope, fixed sizing fields (mBatchSize, mMaxSeqLen*, mSumOfSeqLens*, mMaxNumPagesPerSeqKv), mMultiProcessorCount. Conditionally sets mNumTokensPerPage based on qkvLayout instead of always using fixed params.
CMake build configuration
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/CMakeLists.txt
Creates build-time symlinks and copies kernel headers. Extends interface include directories with trtllmGen_fmha_export. Adds compile definitions including TRTLLM_FMHA_BUILD_DIR, TLLM_PUBLIC_RELEASE, TLLM_GEN_EXPORT_INTERFACE, TLLM_FMHA_TRTLLM_COMPAT. Links object library against prebuilt static archives (libTrtLlmGenFmhaLib.a, libTrtLlmGen.a) with fatal error checking.
Kernel include
cpp/tensorrt_llm/kernels/indexerTopK.cu
Added #include <cfloat> for floating-point constants.
FMHA cubin artifacts
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/*_cubin.cpp (600\+ files)
Updated Git LFS pointers for compiled CUDA kernels across multiple kernel variants (Sm100a, Sm100f with various configurations: head dimensions, qkv layouts, sequence handling modes). Each file's oid (SHA-256) and size metadata changed, indicating binary payloads were regenerated.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

The change comprises predominantly homogeneous updates (600\+ Git LFS pointer metadata changes with identical patterns), which require minimal individual review. The non-repetitive portions (CMakeLists.txt build configuration, FmhaDispatcher parameter additions, pre-commit exclusions) are localized and straightforward, offsetting the large file count through pattern repetition rather than diverse logic complexity.

Possibly related PRs

Suggested reviewers

  • niukuo
  • PerkzZheng
  • yuxianq
  • Wanli-Jiang
  • byshiue
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Update copyright year in modified file.

This file is modified but still shows 2020-2024; it should include the latest modification year.

🛠️ Proposed fix
- * Copyright (c) 2020-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2020-2026, NVIDIA CORPORATION.  All rights reserved.

As per coding guidelines, "Add NVIDIA copyright header on ALL new files, and update year on modified files".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp` at line 2, Update the file
header in cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp to reflect the latest
modification year by changing the copyright range from "2020-2024" to include
2026 (e.g., "2020-2026"); ensure the header format exactly matches other source
files' NVIDIA copyright header so the symbol fmhaDispatcher.cpp contains the
updated year.
🧹 Nitpick comments (4)
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100fKernel_QkvBfloat16OBfloat16H256PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp (1)

1-3: Treat *_cubin.cpp LFS pointers as artifacts in static-analysis jobs.

These lines are Git LFS pointer metadata, so parsing them as C++ creates false hard errors (oid, size syntax failures). Please make CI either hydrate LFS before C++ analysis or exclude these artifact-pointer files from Clang/Cppcheck scans.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100fKernel_QkvBfloat16OBfloat16H256PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp`
around lines 1 - 3, The CI is treating Git LFS pointer files like real C++ and
failing; update the pipeline to either hydrate LFS before static analysis or
exclude these artifacts by glob. Specifically, in the static-analysis steps that
run Clang/Cppcheck/tidy, add a pre-step to run "git lfs pull" (or equivalent) so
files such as
FmhaSm100fKernel_QkvBfloat16OBfloat16H256PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp
are real C++ blobs, or add an exclusion glob for "*_cubin.cpp" (and any other
LFS pointer patterns) in the Clang/Cppcheck invocation so these LFS pointer
files are skipped. Ensure the change touches the CI job(s) that run
Clang/Cppcheck/tidy.
.pre-commit-config.yaml (1)

1391-1517: Consider deduplicating the repeated FMHA exclude regex via a YAML anchor.

The same block is repeated in many hooks, which is easy to drift over time.

♻️ Suggested DRY refactor
+fmha-generated-exclude: &fmha_generated_exclude |
+    (?x)^(
+        cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/trtllmGen_fmha_export/.* |
+        cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cuda_ptx/.*
+    )$
@@
-        exclude: |
-            (?x)^(
-                cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/trtllmGen_fmha_export/.* |
-                cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cuda_ptx/.*
-            )$
+        exclude: *fmha_generated_exclude
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.pre-commit-config.yaml around lines 1391 - 1517, Define a single YAML
anchor for the repeated FMHA exclude regex and replace the repeated multi-line
exclude values with a reference to that anchor; locate the top-level repeated
block (the long exclude: | (?x)^( ... )$ pattern) and extract it into a named
anchor (e.g. &fmha_exclude) and then use the alias (*fmha_exclude) in each
hook's exclude field (examples to update include hooks with id: remove-crlf,
yapf, end-of-file-fixer, trailing-whitespace, clang-format, cmake-format,
codespell, ruff, ruff-format, autoflake, etc.); ensure whitespace/indentation
matches YAML nesting so pre-commit still parses the exclude entries correctly.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100fKernel_QkvBfloat16OBfloat16H128PagedKvSlidingOrChunkedCausalP32VarSeqSkipsSoftmaxQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

1-3: Exclude Git LFS pointer stubs from C++ static-analysis passes.

This file is a valid Git LFS pointer update (oid/size), but clang/cppcheck will misparse it as C++ when LFS objects are not materialized. Please ensure these paths are excluded (or LFS pull is guaranteed) in static-analysis jobs to prevent false failures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100fKernel_QkvBfloat16OBfloat16H128PagedKvSlidingOrChunkedCausalP32VarSeqSkipsSoftmaxQ16Kv128StaticSwapsAbForGen_cubin.cpp`
around lines 1 - 3, The CI/static-analysis is picking up Git LFS pointer stubs
(files starting with the line "version https://git-lfs.github.com/spec/v1" and
containing "oid sha256:"/ "size") as C++; update the
static-analysis/clang/cppcheck job configuration to exclude such files (or
ensure LFS objects are pulled) by adding a rule to skip files matching that
header or the specific cubin filename pattern (e.g.,
FmhaSm100fKernel_..._cubin.cpp) so the analyzer ignores pointer stubs; ensure
the exclusion references the pointer header string ("version
https://git-lfs.github.com/spec/v1") or the oid/size pattern to reliably detect
LFS pointer files.
cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp (1)

117-123: Extract repeated non-zero literals to a named constant.

Using raw 1 repeatedly in assignments violates the literal-usage rule and makes intent less clear.

♻️ Proposed refactor
+        int32_t const kKERNEL_SELECTION_PROBE_VALUE = 1;
-        tllmRunnerParams.mBatchSize = 1;
-        tllmRunnerParams.mMaxSeqLenQ = 1;
-        tllmRunnerParams.mMaxSeqLenKv = 1;
-        tllmRunnerParams.mMaxSeqLenCacheKv = 1;
-        tllmRunnerParams.mSumOfSeqLensQ = 1;
-        tllmRunnerParams.mSumOfSeqLensKv = 1;
-        tllmRunnerParams.mMaxNumPagesPerSeqKv = 1;
+        tllmRunnerParams.mBatchSize = kKERNEL_SELECTION_PROBE_VALUE;
+        tllmRunnerParams.mMaxSeqLenQ = kKERNEL_SELECTION_PROBE_VALUE;
+        tllmRunnerParams.mMaxSeqLenKv = kKERNEL_SELECTION_PROBE_VALUE;
+        tllmRunnerParams.mMaxSeqLenCacheKv = kKERNEL_SELECTION_PROBE_VALUE;
+        tllmRunnerParams.mSumOfSeqLensQ = kKERNEL_SELECTION_PROBE_VALUE;
+        tllmRunnerParams.mSumOfSeqLensKv = kKERNEL_SELECTION_PROBE_VALUE;
+        tllmRunnerParams.mMaxNumPagesPerSeqKv = kKERNEL_SELECTION_PROBE_VALUE;

As per coding guidelines, "Except 0, nullptr, true, false, all other literals in C++ should only be used for variable initialization; extract other literal usages to named constants."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp` around lines 117 - 123, Replace
the repeated magic literal 1 used to initialize fields on tllmRunnerParams with
a named constant: declare a constexpr (e.g., kDefaultOne = 1) and use it for
mBatchSize, mMaxSeqLenQ, mMaxSeqLenKv, mMaxSeqLenCacheKv, mSumOfSeqLensQ,
mSumOfSeqLensKv, and mMaxNumPagesPerSeqKv; update the assignments in
fmhaDispatcher.cpp so the intent is clear and the literal is not repeated
directly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/CMakeLists.txt`:
- Around line 24-27: The computed CUDA_TARGETS_INCLUDE_DIR uses
CMAKE_SYSTEM_PROCESSOR directly which yields "aarch64" on Arm SBSA systems but
the CUDA toolkit places headers under "targets/sbsa-linux"; update the logic
around get_filename_component/CUDA_TARGETS_INCLUDE_DIR to map
CMAKE_SYSTEM_PROCESSOR "aarch64" to "sbsa" (or otherwise detect ARM SBSA) before
assembling the path, then build the include path as
"${CUDA_TOOLKIT_ROOT}/targets/${MAPPED_PROCESSOR}-linux/include" so the symbolic
link and AArch64 JIT path resolve correctly (refer to get_filename_component,
CUDA_BIN_PATH, CUDA_TOOLKIT_ROOT, CUDA_TARGETS_INCLUDE_DIR and
CMAKE_SYSTEM_PROCESSOR).
- Around line 39-42: Replace the file(COPY ...) usage for the FMHA exported
headers with configure_file so the generated CUDA/NVRTC build dir gets updated
whenever KernelParams.h or KernelParamsDecl.h change: locate the two file(COPY
${CMAKE_CURRENT_SOURCE_DIR}/trtllmGen_fmha_export/KernelParams.h ...) and
file(COPY ${CMAKE_CURRENT_SOURCE_DIR}/trtllmGen_fmha_export/KernelParamsDecl.h
...) entries in CMakeLists.txt and change them to configure_file calls that copy
from ${CMAKE_CURRENT_SOURCE_DIR}/trtllmGen_fmha_export/KernelParams.h and
KernelParamsDecl.h to ${CMAKE_CURRENT_BINARY_DIR}/KernelParams.h and
KernelParamsDecl.h (using `@ONLY` if needed), ensuring the build system reruns and
NVRTC sees up-to-date headers referenced by TRTLLM_FMHA_BUILD_DIR.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100aKernel_QE4m3KvE2m1OE4m3H256PagedKvSlidingOrChunkedCausalP32VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp`:
- Around line 2-3: CMake currently globs all .cpp files with file(GLOB_RECURSE
SRC_CPP *.cpp) which pulls in Git LFS pointer .cpp files from the cubin
directory and causes build failures when LFS is not hydrated; update the
CMakeLists handling by excluding the cubin directory from the glob or by adding
an explicit exclusion filter for the cubin path when populating SRC_CPP (or
alternatively treat the cubin directory separately), ensuring the unique symbol
SRC_CPP and the existing file(GLOB_RECURSE ...) invocation are modified so
cubin/*.cpp files are not added to the compile list unless Git LFS has been
hydrated.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100aKernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp`:
- Around line 2-3: The CMake GLOB that populates SRC_CPP is including Git LFS
pointer files like the cubin.cpp artifact; update the CMakeLists handling around
the "file(GLOB_RECURSE SRC_CPP *.cpp)" and subsequent
"add_library(trtllm_gen_fmha OBJECT ${SRC_CPP} ${SRC_CU})" so those artifacts
are filtered out — e.g. after the glob, remove or filter entries matching the
LFS/binary patterns (like ".*cubin.cpp$" and ".*_cubin.h$") using list(FILTER
SRC_CPP EXCLUDE REGEX ...) or use a foreach loop to push_valid files into
SRC_CPP_CLEAN and use that in add_library; ensure the same exclusion is applied
to any SRC_CU or other globbed source lists so LFS pointer files never reach the
compiler.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100fKernel_QkvBfloat16OBfloat16H256PagedKvSlidingOrChunkedCausalP32MultiCtasKvCgaVarSeqSkipsSoftmaxQ128Kv128StaticKeepsAbForGen_cubin.cpp`:
- Around line 1-3: The CI workflows pr-check.yml and precommit-check.yml are
missing LFS hydration which causes the CMake glob (file(GLOB_RECURSE SRC_CPP
*.cpp)) to pick up LFS pointer files; update the actions/checkout@v6 steps in
both workflows to include lfs: 'true' (matching blossom-ci.yml) so large-file
pointers are hydrated before static analysis runs and compilation.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100fKernel_QkvBfloat16OBfloat16H256PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqSkipsSoftmaxQ128Kv128StaticKeepsAbForGen_cubin.cpp`:
- Around line 1-3: The CI is attempting to compile Git LFS pointer files like
FmhaSm100fKernel_QkvBfloat16OBfloat16H256PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqSkipsSoftmaxQ128Kv128StaticKeepsAbForGen_cubin.cpp
before the LFS objects are materialized; update the pipeline to run a Git LFS
materialization step (e.g., git lfs pull or enabling git lfs smudge) as a
pre-build/pre-analysis step so the actual .cubin.cpp binaries/sources are
present before invoking the C++ compiler or static analyzers, and ensure this
step runs before any job that references these files.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100fKernel_QkvBfloat16OBfloat16H256PagedKvSlidingOrChunkedCausalP32VarSeqSkipsSoftmaxQ8Kv128PersistentSwapsAbForGen_cubin.cpp`:
- Around line 1-3: CI is running C++ analysis against LFS pointer stubs (e.g.,
files like
FmhaSm100fKernel_QkvBfloat16OBfloat16H256PagedKvSlidingOrChunkedCausalP32VarSeqSkipsSoftmaxQ8Kv128PersistentSwapsAbForGen_cubin.cpp);
ensure the pipeline materializes LFS objects early by adding a step that runs
git lfs pull --all (or git lfs checkout) before any C++
parsing/static-analysis/lint stages, and place this step at the start of the
job(s) that run the analyzers so the actual binary .cubin files are present
instead of pointer files.

---

Outside diff comments:
In `@cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp`:
- Line 2: Update the file header in cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp
to reflect the latest modification year by changing the copyright range from
"2020-2024" to include 2026 (e.g., "2020-2026"); ensure the header format
exactly matches other source files' NVIDIA copyright header so the symbol
fmhaDispatcher.cpp contains the updated year.

---

Nitpick comments:
In @.pre-commit-config.yaml:
- Around line 1391-1517: Define a single YAML anchor for the repeated FMHA
exclude regex and replace the repeated multi-line exclude values with a
reference to that anchor; locate the top-level repeated block (the long exclude:
| (?x)^( ... )$ pattern) and extract it into a named anchor (e.g. &fmha_exclude)
and then use the alias (*fmha_exclude) in each hook's exclude field (examples to
update include hooks with id: remove-crlf, yapf, end-of-file-fixer,
trailing-whitespace, clang-format, cmake-format, codespell, ruff, ruff-format,
autoflake, etc.); ensure whitespace/indentation matches YAML nesting so
pre-commit still parses the exclude entries correctly.

In `@cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp`:
- Around line 117-123: Replace the repeated magic literal 1 used to initialize
fields on tllmRunnerParams with a named constant: declare a constexpr (e.g.,
kDefaultOne = 1) and use it for mBatchSize, mMaxSeqLenQ, mMaxSeqLenKv,
mMaxSeqLenCacheKv, mSumOfSeqLensQ, mSumOfSeqLensKv, and mMaxNumPagesPerSeqKv;
update the assignments in fmhaDispatcher.cpp so the intent is clear and the
literal is not repeated directly.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100fKernel_QkvBfloat16OBfloat16H128PagedKvSlidingOrChunkedCausalP32VarSeqSkipsSoftmaxQ16Kv128StaticSwapsAbForGen_cubin.cpp`:
- Around line 1-3: The CI/static-analysis is picking up Git LFS pointer stubs
(files starting with the line "version https://git-lfs.github.com/spec/v1" and
containing "oid sha256:"/ "size") as C++; update the
static-analysis/clang/cppcheck job configuration to exclude such files (or
ensure LFS objects are pulled) by adding a rule to skip files matching that
header or the specific cubin filename pattern (e.g.,
FmhaSm100fKernel_..._cubin.cpp) so the analyzer ignores pointer stubs; ensure
the exclusion references the pointer header string ("version
https://git-lfs.github.com/spec/v1") or the oid/size pattern to reliably detect
LFS pointer files.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100fKernel_QkvBfloat16OBfloat16H256PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp`:
- Around line 1-3: The CI is treating Git LFS pointer files like real C++ and
failing; update the pipeline to either hydrate LFS before static analysis or
exclude these artifacts by glob. Specifically, in the static-analysis steps that
run Clang/Cppcheck/tidy, add a pre-step to run "git lfs pull" (or equivalent) so
files such as
FmhaSm100fKernel_QkvBfloat16OBfloat16H256PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp
are real C++ blobs, or add an exclusion glob for "*_cubin.cpp" (and any other
LFS pointer patterns) in the Clang/Cppcheck invocation so these LFS pointer
files are skipped. Ensure the change touches the CI job(s) that run
Clang/Cppcheck/tidy.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

@yunruis
Copy link
Copy Markdown
Contributor Author

yunruis commented Mar 31, 2026

/bot run --disable-fail-fast --post-merge

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40902 [ run ] triggered by Bot. Commit: 5e4f57b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40886 [ run ] completed with state ABORTED. Commit: 5e4f57b

Link to invocation

@yunruis
Copy link
Copy Markdown
Contributor Author

yunruis commented Mar 31, 2026

/bot run --disable-fail-fast --post-merge

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40936 [ run ] triggered by Bot. Commit: 5dca29d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40936 [ run ] completed with state FAILURE. Commit: 5dca29d
/LLM/main/L0_MergeRequest_PR pipeline #31929 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@yunruis
Copy link
Copy Markdown
Contributor Author

yunruis commented Mar 31, 2026

/bot run --post-merge --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40965 [ run ] triggered by Bot. Commit: 5dca29d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40965 [ run ] completed with state FAILURE. Commit: 5dca29d
/LLM/main/L0_MergeRequest_PR pipeline #31951 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Copy link
Copy Markdown
Collaborator

@niukuo niukuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jenkins/: LGTM

@yunruis
Copy link
Copy Markdown
Contributor Author

yunruis commented Apr 1, 2026

/bot run --disable-fail-fast --post-merge

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41127 [ run ] triggered by Bot. Commit: 5e4c72b Link to invocation

yunruis and others added 5 commits April 1, 2026 17:04
trtllm-gen tag1: trtllm gen pass, trtllm-llm fail

fix several bugs

Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com>

fix chunk prefill accuracy error

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

ignore trtllm-gen fmha release check

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

drop SBSA CI

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

wrap nvrtc path error on CI

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

1. change lib to release and do not dump file; 2. fix bug of nvrtc include file

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

fix ci check test list error

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

add print on auto tuner

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

drop internal instructions

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

pre-commit check ignore cpp/tensorrt_llm/kernels/trtllmGenKernels

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

skip CI Release Check stage

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

waive nvfp4 kv cache case

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

fix the bug when trying to enable CGA reduction

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

cancel print debug in autotuner

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

fix mla and mha bugs for nvrtc path

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

fix startTokenIdxSfO field loss

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

unify format to trtllm-gen

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

fix spec decoding error, and sync file with trtllm-gen, wave gptoss hang
error

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

fix chunk window size assert error

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

force using cubin path

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

add fix for cubin path

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

fix sparse mla on nvrtc path

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

fix fmha cubin path kernel select, to more sync with main branch

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

fix masktype selection

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

disable A10 and RTX5090 PackageSanityCheck to unblock multi-GPU testing

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

refactor cubin path kernel params to use TMA-based KernelParams struct, and more struct alignment with trtllm-gen

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

fix tokens per page==0 for not paged kvcache case

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

trtllm-gen: struct align with main branch

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

fix fmhaDispatch isSupported

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

trtllm-gen: fix illegal resource handle on multi-GPU

Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>

cancel waive GPTOSS case and layer_wise_benchmark

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

trtllm-gen: enable build for aarch64

Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>

waive test_layer_wise_benchmarks.py::test_qwen3_next_gen_tep[1]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>

trtllm-gen:update exportCubin, and do not dump cu file for nvrtc path

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

fix generation phase TMA mismatch in cubin path

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

trtllm-gen: fix for qwen3_next and test rocky

Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>

fix wrong aarch64 lib

Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>

add cubin

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

trtllm-gen: add TLLM_FMHA_TRTLLM_COMPAT guards for TRT-LLM export compatibility. and adapt to rebased to trtllm-gen

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

fix debug build error

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

trtllm-gen: fix exportCubin on trtllm, and fix SmemTile.h

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

trtllm-gen: refine with perkz comment

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

fix sth

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

fix: allow E2M1 KV cache on cubin path by guarding with mIsTrtllmLayout in checkFmhaOptions

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
Made-with: Cursor

fix NemotronNanoV3 by setting correct kvlen in context, and trivial sync trtllm-gen lib

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

trtllm-gen: WAR hang for dpsk v3 lite on B300

Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>

trtllm-gen: WAR hang for dpsk v3 lite on B300, export Cubin

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

refinement

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

fix custom mask ut

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

update trtllm-gen lib

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

trtllm-gen: Fix confidential/public scan: internal-release markers, comment scrub, cHigh rename in CutlassUtils

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

cancel pre-commit war

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

fix license check

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

fix: NVRTC path sanity test header path issue

Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>

fix warning

Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>

fix fastmath.h

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

add path selection

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>

trtllm-gen: fix compile wanring on trtllm, and add rocky lib

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
@yunruis yunruis force-pushed the user/yunruis/add_fmha_interface_rebased_cubin_2_perf branch from 4e7fd7b to ba085c2 Compare April 1, 2026 09:04
@yunruis
Copy link
Copy Markdown
Contributor Author

yunruis commented Apr 1, 2026

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41166 [ run ] triggered by Bot. Commit: ba085c2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41166 [ run ] completed with state SUCCESS. Commit: ba085c2
/LLM/main/L0_MergeRequest_PR pipeline #32136 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
@yunruis
Copy link
Copy Markdown
Contributor Author

yunruis commented Apr 2, 2026

/bot run --add-multi-gpu-test --disable-fail-fast

1 similar comment
@yunruis
Copy link
Copy Markdown
Contributor Author

yunruis commented Apr 2, 2026

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41442 [ run ] triggered by Bot. Commit: 9bcc613 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41442 [ run ] completed with state SUCCESS. Commit: 9bcc613
/LLM/main/L0_MergeRequest_PR pipeline #32373 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@yunruis
Copy link
Copy Markdown
Contributor Author

yunruis commented Apr 3, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41543 [ run ] triggered by Bot. Commit: 9bcc613 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41543 [ run ] completed with state SUCCESS. Commit: 9bcc613
/LLM/main/L0_MergeRequest_PR pipeline #32457 completed with status: 'SUCCESS'

CI Report

Link to invocation

Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
@yunruis
Copy link
Copy Markdown
Contributor Author

yunruis commented Apr 3, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41634 [ run ] triggered by Bot. Commit: e9f29d7 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41634 [ run ] completed with state SUCCESS. Commit: e9f29d7
/LLM/main/L0_MergeRequest_PR pipeline #32542 completed with status: 'SUCCESS'

CI Report

Link to invocation

@pengbowang-nv pengbowang-nv merged commit 88bbb4d into NVIDIA:main Apr 7, 2026
5 checks passed
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request Apr 7, 2026
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
Co-authored-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
Co-authored-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
suyoggupta pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Apr 8, 2026
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
Co-authored-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Co-authored-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants