[None][feat] Trtllm-gen FMHA JIT support#12612
Conversation
|
/bot run --disable-fail-fast --stage-list "DGX_B200-8_GPUs-PyTorch-PerfSanity-Post-Merge-1,DGX_B200-8_GPUs-PyTorch-PerfSanity-Post-Merge-2,DGX_B200-8_GPUs-PyTorch-PerfSanity-Post-Merge-3,DGX_B200-8_GPUs-PyTorch-PerfSanity-Post-Merge-4,GB200-4_GPUs-PyTorch-PerfSanity-1,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-1,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-2,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-3,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-4,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-5,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-6,GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-7,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-4,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-5,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-6,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge-7,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU2-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-3,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE1-GPU4-Post-Merge-4,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-2,GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-3,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU1-GEN1-NODE2-GPU8-Post-Merge-1,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-1,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-2,GB200-12_GPUs-3_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE2-GPU8-Post-Merge-3,GB200-16_GPUs-4_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE2-GPU8-GEN1-NODE2-GPU8-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-1,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-2,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-3,GB200-24_GPUs-6_Nodes-PyTorch-Disagg-PerfSanity-CTX2-NODE1-GPU4-GEN1-NODE4-GPU16-Post-Merge-4" |
|
PR_Github #40886 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughPre-commit configuration adds exclusion rules for FMHA kernel output directories. FmhaDispatcher populates additional runner parameters including processor count and layout-dependent settings. CMake now links prebuilt kernel archives. Hundreds of Git LFS pointers updated for cubin artifacts. Added floating-point header include. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes The change comprises predominantly homogeneous updates (600\+ Git LFS pointer metadata changes with identical patterns), which require minimal individual review. The non-repetitive portions (CMakeLists.txt build configuration, FmhaDispatcher parameter additions, pre-commit exclusions) are localized and straightforward, offsetting the large file count through pattern repetition rather than diverse logic complexity. Possibly related PRs
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp (1)
2-2:⚠️ Potential issue | 🟡 MinorUpdate copyright year in modified file.
This file is modified but still shows
2020-2024; it should include the latest modification year.🛠️ Proposed fix
- * Copyright (c) 2020-2024, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2020-2026, NVIDIA CORPORATION. All rights reserved.As per coding guidelines, "Add NVIDIA copyright header on ALL new files, and update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp` at line 2, Update the file header in cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp to reflect the latest modification year by changing the copyright range from "2020-2024" to include 2026 (e.g., "2020-2026"); ensure the header format exactly matches other source files' NVIDIA copyright header so the symbol fmhaDispatcher.cpp contains the updated year.
🧹 Nitpick comments (4)
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100fKernel_QkvBfloat16OBfloat16H256PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp (1)
1-3: Treat*_cubin.cppLFS pointers as artifacts in static-analysis jobs.These lines are Git LFS pointer metadata, so parsing them as C++ creates false hard errors (
oid,sizesyntax failures). Please make CI either hydrate LFS before C++ analysis or exclude these artifact-pointer files from Clang/Cppcheck scans.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100fKernel_QkvBfloat16OBfloat16H256PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp` around lines 1 - 3, The CI is treating Git LFS pointer files like real C++ and failing; update the pipeline to either hydrate LFS before static analysis or exclude these artifacts by glob. Specifically, in the static-analysis steps that run Clang/Cppcheck/tidy, add a pre-step to run "git lfs pull" (or equivalent) so files such as FmhaSm100fKernel_QkvBfloat16OBfloat16H256PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp are real C++ blobs, or add an exclusion glob for "*_cubin.cpp" (and any other LFS pointer patterns) in the Clang/Cppcheck invocation so these LFS pointer files are skipped. Ensure the change touches the CI job(s) that run Clang/Cppcheck/tidy..pre-commit-config.yaml (1)
1391-1517: Consider deduplicating the repeated FMHA exclude regex via a YAML anchor.The same block is repeated in many hooks, which is easy to drift over time.
♻️ Suggested DRY refactor
+fmha-generated-exclude: &fmha_generated_exclude | + (?x)^( + cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/trtllmGen_fmha_export/.* | + cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cuda_ptx/.* + )$ @@ - exclude: | - (?x)^( - cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/trtllmGen_fmha_export/.* | - cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cuda_ptx/.* - )$ + exclude: *fmha_generated_exclude🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.pre-commit-config.yaml around lines 1391 - 1517, Define a single YAML anchor for the repeated FMHA exclude regex and replace the repeated multi-line exclude values with a reference to that anchor; locate the top-level repeated block (the long exclude: | (?x)^( ... )$ pattern) and extract it into a named anchor (e.g. &fmha_exclude) and then use the alias (*fmha_exclude) in each hook's exclude field (examples to update include hooks with id: remove-crlf, yapf, end-of-file-fixer, trailing-whitespace, clang-format, cmake-format, codespell, ruff, ruff-format, autoflake, etc.); ensure whitespace/indentation matches YAML nesting so pre-commit still parses the exclude entries correctly.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100fKernel_QkvBfloat16OBfloat16H128PagedKvSlidingOrChunkedCausalP32VarSeqSkipsSoftmaxQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
1-3: Exclude Git LFS pointer stubs from C++ static-analysis passes.This file is a valid Git LFS pointer update (
oid/size), but clang/cppcheck will misparse it as C++ when LFS objects are not materialized. Please ensure these paths are excluded (or LFS pull is guaranteed) in static-analysis jobs to prevent false failures.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100fKernel_QkvBfloat16OBfloat16H128PagedKvSlidingOrChunkedCausalP32VarSeqSkipsSoftmaxQ16Kv128StaticSwapsAbForGen_cubin.cpp` around lines 1 - 3, The CI/static-analysis is picking up Git LFS pointer stubs (files starting with the line "version https://git-lfs.github.com/spec/v1" and containing "oid sha256:"/ "size") as C++; update the static-analysis/clang/cppcheck job configuration to exclude such files (or ensure LFS objects are pulled) by adding a rule to skip files matching that header or the specific cubin filename pattern (e.g., FmhaSm100fKernel_..._cubin.cpp) so the analyzer ignores pointer stubs; ensure the exclusion references the pointer header string ("version https://git-lfs.github.com/spec/v1") or the oid/size pattern to reliably detect LFS pointer files.cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp (1)
117-123: Extract repeated non-zero literals to a named constant.Using raw
1repeatedly in assignments violates the literal-usage rule and makes intent less clear.♻️ Proposed refactor
+ int32_t const kKERNEL_SELECTION_PROBE_VALUE = 1; - tllmRunnerParams.mBatchSize = 1; - tllmRunnerParams.mMaxSeqLenQ = 1; - tllmRunnerParams.mMaxSeqLenKv = 1; - tllmRunnerParams.mMaxSeqLenCacheKv = 1; - tllmRunnerParams.mSumOfSeqLensQ = 1; - tllmRunnerParams.mSumOfSeqLensKv = 1; - tllmRunnerParams.mMaxNumPagesPerSeqKv = 1; + tllmRunnerParams.mBatchSize = kKERNEL_SELECTION_PROBE_VALUE; + tllmRunnerParams.mMaxSeqLenQ = kKERNEL_SELECTION_PROBE_VALUE; + tllmRunnerParams.mMaxSeqLenKv = kKERNEL_SELECTION_PROBE_VALUE; + tllmRunnerParams.mMaxSeqLenCacheKv = kKERNEL_SELECTION_PROBE_VALUE; + tllmRunnerParams.mSumOfSeqLensQ = kKERNEL_SELECTION_PROBE_VALUE; + tllmRunnerParams.mSumOfSeqLensKv = kKERNEL_SELECTION_PROBE_VALUE; + tllmRunnerParams.mMaxNumPagesPerSeqKv = kKERNEL_SELECTION_PROBE_VALUE;As per coding guidelines, "Except
0,nullptr,true,false, all other literals in C++ should only be used for variable initialization; extract other literal usages to named constants."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp` around lines 117 - 123, Replace the repeated magic literal 1 used to initialize fields on tllmRunnerParams with a named constant: declare a constexpr (e.g., kDefaultOne = 1) and use it for mBatchSize, mMaxSeqLenQ, mMaxSeqLenKv, mMaxSeqLenCacheKv, mSumOfSeqLensQ, mSumOfSeqLensKv, and mMaxNumPagesPerSeqKv; update the assignments in fmhaDispatcher.cpp so the intent is clear and the literal is not repeated directly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/CMakeLists.txt`:
- Around line 24-27: The computed CUDA_TARGETS_INCLUDE_DIR uses
CMAKE_SYSTEM_PROCESSOR directly which yields "aarch64" on Arm SBSA systems but
the CUDA toolkit places headers under "targets/sbsa-linux"; update the logic
around get_filename_component/CUDA_TARGETS_INCLUDE_DIR to map
CMAKE_SYSTEM_PROCESSOR "aarch64" to "sbsa" (or otherwise detect ARM SBSA) before
assembling the path, then build the include path as
"${CUDA_TOOLKIT_ROOT}/targets/${MAPPED_PROCESSOR}-linux/include" so the symbolic
link and AArch64 JIT path resolve correctly (refer to get_filename_component,
CUDA_BIN_PATH, CUDA_TOOLKIT_ROOT, CUDA_TARGETS_INCLUDE_DIR and
CMAKE_SYSTEM_PROCESSOR).
- Around line 39-42: Replace the file(COPY ...) usage for the FMHA exported
headers with configure_file so the generated CUDA/NVRTC build dir gets updated
whenever KernelParams.h or KernelParamsDecl.h change: locate the two file(COPY
${CMAKE_CURRENT_SOURCE_DIR}/trtllmGen_fmha_export/KernelParams.h ...) and
file(COPY ${CMAKE_CURRENT_SOURCE_DIR}/trtllmGen_fmha_export/KernelParamsDecl.h
...) entries in CMakeLists.txt and change them to configure_file calls that copy
from ${CMAKE_CURRENT_SOURCE_DIR}/trtllmGen_fmha_export/KernelParams.h and
KernelParamsDecl.h to ${CMAKE_CURRENT_BINARY_DIR}/KernelParams.h and
KernelParamsDecl.h (using `@ONLY` if needed), ensuring the build system reruns and
NVRTC sees up-to-date headers referenced by TRTLLM_FMHA_BUILD_DIR.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100aKernel_QE4m3KvE2m1OE4m3H256PagedKvSlidingOrChunkedCausalP32VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp`:
- Around line 2-3: CMake currently globs all .cpp files with file(GLOB_RECURSE
SRC_CPP *.cpp) which pulls in Git LFS pointer .cpp files from the cubin
directory and causes build failures when LFS is not hydrated; update the
CMakeLists handling by excluding the cubin directory from the glob or by adding
an explicit exclusion filter for the cubin path when populating SRC_CPP (or
alternatively treat the cubin directory separately), ensuring the unique symbol
SRC_CPP and the existing file(GLOB_RECURSE ...) invocation are modified so
cubin/*.cpp files are not added to the compile list unless Git LFS has been
hydrated.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100aKernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp`:
- Around line 2-3: The CMake GLOB that populates SRC_CPP is including Git LFS
pointer files like the cubin.cpp artifact; update the CMakeLists handling around
the "file(GLOB_RECURSE SRC_CPP *.cpp)" and subsequent
"add_library(trtllm_gen_fmha OBJECT ${SRC_CPP} ${SRC_CU})" so those artifacts
are filtered out — e.g. after the glob, remove or filter entries matching the
LFS/binary patterns (like ".*cubin.cpp$" and ".*_cubin.h$") using list(FILTER
SRC_CPP EXCLUDE REGEX ...) or use a foreach loop to push_valid files into
SRC_CPP_CLEAN and use that in add_library; ensure the same exclusion is applied
to any SRC_CU or other globbed source lists so LFS pointer files never reach the
compiler.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100fKernel_QkvBfloat16OBfloat16H256PagedKvSlidingOrChunkedCausalP32MultiCtasKvCgaVarSeqSkipsSoftmaxQ128Kv128StaticKeepsAbForGen_cubin.cpp`:
- Around line 1-3: The CI workflows pr-check.yml and precommit-check.yml are
missing LFS hydration which causes the CMake glob (file(GLOB_RECURSE SRC_CPP
*.cpp)) to pick up LFS pointer files; update the actions/checkout@v6 steps in
both workflows to include lfs: 'true' (matching blossom-ci.yml) so large-file
pointers are hydrated before static analysis runs and compilation.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100fKernel_QkvBfloat16OBfloat16H256PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqSkipsSoftmaxQ128Kv128StaticKeepsAbForGen_cubin.cpp`:
- Around line 1-3: The CI is attempting to compile Git LFS pointer files like
FmhaSm100fKernel_QkvBfloat16OBfloat16H256PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqSkipsSoftmaxQ128Kv128StaticKeepsAbForGen_cubin.cpp
before the LFS objects are materialized; update the pipeline to run a Git LFS
materialization step (e.g., git lfs pull or enabling git lfs smudge) as a
pre-build/pre-analysis step so the actual .cubin.cpp binaries/sources are
present before invoking the C++ compiler or static analyzers, and ensure this
step runs before any job that references these files.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100fKernel_QkvBfloat16OBfloat16H256PagedKvSlidingOrChunkedCausalP32VarSeqSkipsSoftmaxQ8Kv128PersistentSwapsAbForGen_cubin.cpp`:
- Around line 1-3: CI is running C++ analysis against LFS pointer stubs (e.g.,
files like
FmhaSm100fKernel_QkvBfloat16OBfloat16H256PagedKvSlidingOrChunkedCausalP32VarSeqSkipsSoftmaxQ8Kv128PersistentSwapsAbForGen_cubin.cpp);
ensure the pipeline materializes LFS objects early by adding a step that runs
git lfs pull --all (or git lfs checkout) before any C++
parsing/static-analysis/lint stages, and place this step at the start of the
job(s) that run the analyzers so the actual binary .cubin files are present
instead of pointer files.
---
Outside diff comments:
In `@cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp`:
- Line 2: Update the file header in cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp
to reflect the latest modification year by changing the copyright range from
"2020-2024" to include 2026 (e.g., "2020-2026"); ensure the header format
exactly matches other source files' NVIDIA copyright header so the symbol
fmhaDispatcher.cpp contains the updated year.
---
Nitpick comments:
In @.pre-commit-config.yaml:
- Around line 1391-1517: Define a single YAML anchor for the repeated FMHA
exclude regex and replace the repeated multi-line exclude values with a
reference to that anchor; locate the top-level repeated block (the long exclude:
| (?x)^( ... )$ pattern) and extract it into a named anchor (e.g. &fmha_exclude)
and then use the alias (*fmha_exclude) in each hook's exclude field (examples to
update include hooks with id: remove-crlf, yapf, end-of-file-fixer,
trailing-whitespace, clang-format, cmake-format, codespell, ruff, ruff-format,
autoflake, etc.); ensure whitespace/indentation matches YAML nesting so
pre-commit still parses the exclude entries correctly.
In `@cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp`:
- Around line 117-123: Replace the repeated magic literal 1 used to initialize
fields on tllmRunnerParams with a named constant: declare a constexpr (e.g.,
kDefaultOne = 1) and use it for mBatchSize, mMaxSeqLenQ, mMaxSeqLenKv,
mMaxSeqLenCacheKv, mSumOfSeqLensQ, mSumOfSeqLensKv, and mMaxNumPagesPerSeqKv;
update the assignments in fmhaDispatcher.cpp so the intent is clear and the
literal is not repeated directly.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100fKernel_QkvBfloat16OBfloat16H128PagedKvSlidingOrChunkedCausalP32VarSeqSkipsSoftmaxQ16Kv128StaticSwapsAbForGen_cubin.cpp`:
- Around line 1-3: The CI/static-analysis is picking up Git LFS pointer stubs
(files starting with the line "version https://git-lfs.github.com/spec/v1" and
containing "oid sha256:"/ "size") as C++; update the
static-analysis/clang/cppcheck job configuration to exclude such files (or
ensure LFS objects are pulled) by adding a rule to skip files matching that
header or the specific cubin filename pattern (e.g.,
FmhaSm100fKernel_..._cubin.cpp) so the analyzer ignores pointer stubs; ensure
the exclusion references the pointer header string ("version
https://git-lfs.github.com/spec/v1") or the oid/size pattern to reliably detect
LFS pointer files.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100fKernel_QkvBfloat16OBfloat16H256PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp`:
- Around line 1-3: The CI is treating Git LFS pointer files like real C++ and
failing; update the pipeline to either hydrate LFS before static analysis or
exclude these artifacts by glob. Specifically, in the static-analysis steps that
run Clang/Cppcheck/tidy, add a pre-step to run "git lfs pull" (or equivalent) so
files such as
FmhaSm100fKernel_QkvBfloat16OBfloat16H256PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp
are real C++ blobs, or add an exclusion glob for "*_cubin.cpp" (and any other
LFS pointer patterns) in the Clang/Cppcheck invocation so these LFS pointer
files are skipped. Ensure the change touches the CI job(s) that run
Clang/Cppcheck/tidy.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
...vE2m1OE4m3H256PagedKvSlidingOrChunkedCausalP32VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp
Show resolved
Hide resolved
...haSm100aKernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
Show resolved
Hide resolved
...dingOrChunkedCausalP32MultiCtasKvCgaVarSeqSkipsSoftmaxQ128Kv128StaticKeepsAbForGen_cubin.cpp
Show resolved
Hide resolved
...SlidingOrChunkedCausalP32MultiCtasKvVarSeqSkipsSoftmaxQ128Kv128StaticKeepsAbForGen_cubin.cpp
Show resolved
Hide resolved
...56PagedKvSlidingOrChunkedCausalP32VarSeqSkipsSoftmaxQ8Kv128PersistentSwapsAbForGen_cubin.cpp
Show resolved
Hide resolved
|
/bot run --disable-fail-fast --post-merge |
|
PR_Github #40902 [ run ] triggered by Bot. Commit: |
|
PR_Github #40886 [ run ] completed with state |
|
/bot run --disable-fail-fast --post-merge |
|
PR_Github #40936 [ run ] triggered by Bot. Commit: |
|
PR_Github #40936 [ run ] completed with state
|
|
/bot run --post-merge --disable-fail-fast |
|
PR_Github #40965 [ run ] triggered by Bot. Commit: |
|
PR_Github #40965 [ run ] completed with state
|
|
/bot run --disable-fail-fast --post-merge |
|
PR_Github #41127 [ run ] triggered by Bot. Commit: |
trtllm-gen tag1: trtllm gen pass, trtllm-llm fail fix several bugs Signed-off-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> fix chunk prefill accuracy error Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> ignore trtllm-gen fmha release check Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> drop SBSA CI Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> wrap nvrtc path error on CI Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> 1. change lib to release and do not dump file; 2. fix bug of nvrtc include file Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> fix ci check test list error Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> add print on auto tuner Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> drop internal instructions Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> pre-commit check ignore cpp/tensorrt_llm/kernels/trtllmGenKernels Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> skip CI Release Check stage Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> waive nvfp4 kv cache case Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> fix the bug when trying to enable CGA reduction Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> cancel print debug in autotuner Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> fix mla and mha bugs for nvrtc path Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> fix startTokenIdxSfO field loss Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> unify format to trtllm-gen Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> fix spec decoding error, and sync file with trtllm-gen, wave gptoss hang error Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> fix chunk window size assert error Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> force using cubin path Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> add fix for cubin path Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> fix sparse mla on nvrtc path Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> fix fmha cubin path kernel select, to more sync with main branch Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> fix masktype selection Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> disable A10 and RTX5090 PackageSanityCheck to unblock multi-GPU testing Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> refactor cubin path kernel params to use TMA-based KernelParams struct, and more struct alignment with trtllm-gen Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> fix tokens per page==0 for not paged kvcache case Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> trtllm-gen: struct align with main branch Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> fix fmhaDispatch isSupported Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> trtllm-gen: fix illegal resource handle on multi-GPU Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com> cancel waive GPTOSS case and layer_wise_benchmark Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> trtllm-gen: enable build for aarch64 Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com> waive test_layer_wise_benchmarks.py::test_qwen3_next_gen_tep[1] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com> trtllm-gen:update exportCubin, and do not dump cu file for nvrtc path Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> fix generation phase TMA mismatch in cubin path Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> trtllm-gen: fix for qwen3_next and test rocky Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com> fix wrong aarch64 lib Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com> add cubin Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> trtllm-gen: add TLLM_FMHA_TRTLLM_COMPAT guards for TRT-LLM export compatibility. and adapt to rebased to trtllm-gen Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> fix debug build error Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> trtllm-gen: fix exportCubin on trtllm, and fix SmemTile.h Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> trtllm-gen: refine with perkz comment Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> fix sth Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> fix: allow E2M1 KV cache on cubin path by guarding with mIsTrtllmLayout in checkFmhaOptions Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> Made-with: Cursor fix NemotronNanoV3 by setting correct kvlen in context, and trivial sync trtllm-gen lib Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> trtllm-gen: WAR hang for dpsk v3 lite on B300 Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com> trtllm-gen: WAR hang for dpsk v3 lite on B300, export Cubin Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> refinement Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> fix custom mask ut Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> update trtllm-gen lib Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> trtllm-gen: Fix confidential/public scan: internal-release markers, comment scrub, cHigh rename in CutlassUtils Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> cancel pre-commit war Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> fix license check Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> fix: NVRTC path sanity test header path issue Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com> fix warning Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com> fix fastmath.h Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> add path selection Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> trtllm-gen: fix compile wanring on trtllm, and add rocky lib Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
4e7fd7b to
ba085c2
Compare
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #41166 [ run ] triggered by Bot. Commit: |
|
PR_Github #41166 [ run ] completed with state |
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
|
/bot run --add-multi-gpu-test --disable-fail-fast |
1 similar comment
|
/bot run --add-multi-gpu-test --disable-fail-fast |
|
PR_Github #41442 [ run ] triggered by Bot. Commit: |
|
PR_Github #41442 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #41543 [ run ] triggered by Bot. Commit: |
|
PR_Github #41543 [ run ] completed with state |
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #41634 [ run ] triggered by Bot. Commit: |
|
PR_Github #41634 [ run ] completed with state |
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com> Co-authored-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Co-authored-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com> Co-authored-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Co-authored-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
Signed-off-by: yunruis <205571022+yunruis@users.noreply.github.com> Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com> Co-authored-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com> Co-authored-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
Description
This PR integrates the trtllm-gen FMHA JIT (NVRTC) compilation path into TensorRT-LLM alongside the existing cubin (pre-compiled) path, enabling runtime kernel generation for FMHA configurations not covered by pre-compiled cubins.
Key Changes
1. Dual kernel dispatch: cubin + NVRTC JIT
The
TllmGenFmhaKernelclass now supports two dispatch paths:FmhaInterfacefrom trtllm-gen to generate and compile kernels at runtime via NVRTC for configurations where cubins are not available (e.g.,SwapsMmaAbForGenerationwith non-MLA, non-E2M1, head_dim != 64).2. Unified kernel selection via
FmhaAutoTuner— consistent with trtllm-genThis is the most significant architectural change. Previously, TensorRT-LLM maintained its own hand-written heuristic kernel selection logic (
selectGqGenerationKernel,selectMlaGenerationKernel,selectTileSizeQForGqaGeneration, etc.), which diverged from trtllm-gen's selection logic and was a frequent source of bugs — kernel hash mismatches,TRTLLM-GEN kernels not founderrors, and silent fallbacks to unfused MHA.This PR replaces all of that with trtllm-gen's
FmhaAutoTuner, which automatically determines the optimal tile sizes, kernel types (Swaps/Keeps MMA AB), CTA configurations, and multi-CTA KV modes. This removes ~300 lines of manual selection code and ensures that TensorRT-LLM and trtllm-gen always agree on which kernel to select, eliminating the class of bugs caused by selection logic divergence between the two repos.3. Simplified
TllmGenFmhaRunnerconstructorRemoved VisualGen-specific parameters (
maxNumHeadsQPerKvInCta,sageAttnBlk*,dataTypeQkReinterpret) from the runner/kernel constructor and hash key. These are now handled internally by the auto-tuner and kernel options system, reducing the API surface from 10 parameters to 3 (dtypeQ,dtypeKv,dtypeOut).4. trtllm-gen export headers, static libraries, and cubin artifacts
trtllmGen_fmha_export/providingFmhaInterface,FmhaAutoTuner,FmhaOptions,KernelParams, and device-side runtime code.libTrtLlmGenFmhaLib.a,libTrtLlmGen.a) for both x86_64 and aarch64 architectures.cuda_ptx/cuda_ptx.hcontaining PTX intrinsic wrappers needed by NVRTC-generated kernels.kernelMetaInfo.hmetadata.trtllmGen_fmha_export/andcuda_ptx/to prevent formatting tools from modifying external/generated code.