[https://nvbugs/6071081][fix] Fix exception in checkIfKernelExist when kernel param is illegal by pengbowang-nv · Pull Request #13019 · NVIDIA/TensorRT-LLM

pengbowang-nv · 2026-04-14T03:10:19Z

Summary by CodeRabbit

Bug Fixes
- Added exception handling during kernel validation to prevent crashes. The system now gracefully handles errors that occur during kernel checks, improving overall stability and reliability.

Description

checkIfKernelExist may get illegal kernel param during warmup. This will lead to a exception being raised and TRTLLM abort. The PR wrap it with try-catch so that a missing kernel/wrong input won't lead to TRTLLM abort.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

pengbowang-nv · 2026-04-14T03:11:45Z

/bot run --add-multi-gpu-test --gpu-type "B200_PCIe,DGX_B200,B300,DGX_B300,GB200"

tensorrt-cicd · 2026-04-14T03:17:30Z

PR_Github #43161 [ run ] triggered by Bot. Commit: 3f57dcc Link to invocation

pengbowang-nv · 2026-04-14T04:20:12Z

/bot kill

tensorrt-cicd · 2026-04-14T04:26:09Z

PR_Github #43172 [ kill ] triggered by Bot. Commit: 3f57dcc Link to invocation

tensorrt-cicd · 2026-04-14T04:26:55Z

PR_Github #43172 [ kill ] completed with state SUCCESS. Commit: 3f57dcc
Successfully killed previous jobs for commit 3f57dcc

Link to invocation

pengbowang-nv · 2026-04-15T05:30:53Z

/bot run --add-multi-gpu-test --gpu-type "B200_PCIe,DGX_B200,B300,DGX_B300,GB200" --disable-fail-fast

tensorrt-cicd · 2026-04-15T05:37:14Z

PR_Github #43393 [ run ] triggered by Bot. Commit: 24b0c52 Link to invocation

pengbowang-nv · 2026-04-15T08:59:07Z

/bot kill

pengbowang-nv · 2026-04-15T09:05:49Z

/bot run --add-multi-gpu-test --gpu-type "B200_PCIe,DGX_B200,B300,DGX_B300,GB200" --disable-fail-fast

tensorrt-cicd · 2026-04-15T09:06:03Z

PR_Github #43461 [ kill ] triggered by Bot. Commit: 43da3d4 Link to invocation

tensorrt-cicd · 2026-04-15T09:06:47Z

PR_Github #43461 [ kill ] completed with state SUCCESS. Commit: 43da3d4
Successfully killed previous jobs for commit 43da3d4

Link to invocation

tensorrt-cicd · 2026-04-15T09:11:12Z

PR_Github #43463 [ run ] triggered by Bot. Commit: 43da3d4 Link to invocation

tensorrt-cicd · 2026-04-15T18:23:42Z

PR_Github #43463 [ run ] completed with state SUCCESS. Commit: 43da3d4
/LLM/main/L0_MergeRequest_PR pipeline #33982 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

pengbowang-nv · 2026-04-16T03:32:32Z

/bot run --add-multi-gpu-test

tensorrt-cicd · 2026-04-16T03:40:04Z

PR_Github #43641 [ run ] triggered by Bot. Commit: f4e84a9 Link to invocation

coderabbitai · 2026-04-16T03:40:54Z

📝 Walkthrough

Walkthrough

The checkIfKernelExist function in the FMHA kernels header now wraps its autotuning, validation, and hash lookup logic inside a try-catch block to gracefully handle exceptions. If an exception occurs during the existence check, the function logs a warning and returns failure status instead of propagating the exception.

Changes

Cohort / File(s)	Summary
Exception Handling `cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h`	Added try-catch exception handling around kernel option autotuning, validation, cubin-path filtering, and hash lookup logic in `checkIfKernelExist`. Now returns graceful failure with warning message on exception instead of propagating.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The description explains the issue and solution, but Test Coverage section is empty and no specific tests are listed to validate the exception handling changes.	Add specific test cases or test scenarios that verify the try-catch handling works correctly for illegal kernel parameters during warmup.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the fix: wrapping checkIfKernelExist with exception handling for illegal kernel parameters.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h (1)
2-2: ⚠️ Potential issue | 🟡 Minor

Update the NVIDIA copyright year for this modified file.

Line 2 still ends at 2025, but this file has meaningful modifications in 2026.
Suggested patch
- * Copyright (c) 2020-2025, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020-2026, NVIDIA CORPORATION. All rights reserved.
As per coding guidelines: “Add NVIDIA copyright header on ALL new files and update year on modified files” and “All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h` at line 2,
Update the NVIDIA copyright header in the modified header file fmhaKernels.h to
reflect the latest meaningful modification year (change "2020-2025" to
"2020-2026"); locate the copyright comment at the top of fmhaKernels.h and
update the year range so the file header complies with the project guideline
requiring the year of latest meaningful modification.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h`:
- Around line 244-248: The catch block that currently builds errorInfo as a
constant string loses the exception details; change the construction of
errorInfo to include e.what(), log it via TLLM_LOG_WARNING with the combined
message, and return the new errorInfo in the std::make_pair(false, errorInfo) so
callers receive the exception text (update the path that catches std::exception
const& e and the use of TLLM_LOG_WARNING / std::make_pair to reference the new
errorInfo).

---

Outside diff comments:
In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h`:
- Line 2: Update the NVIDIA copyright header in the modified header file
fmhaKernels.h to reflect the latest meaningful modification year (change
"2020-2025" to "2020-2026"); locate the copyright comment at the top of
fmhaKernels.h and update the year range so the file header complies with the
project guideline requiring the year of latest meaningful modification.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 6f9304e6-ef23-4b37-91c3-a021e4ad4861

📥 Commits

Reviewing files that changed from the base of the PR and between 17aee3f and f4e84a9.

📒 Files selected for processing (1)

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h

tensorrt-cicd · 2026-04-16T04:53:20Z

PR_Github #43641 [ run ] completed with state SUCCESS. Commit: f4e84a9
/LLM/main/L0_MergeRequest_PR pipeline #34130 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

pengbowang-nv · 2026-04-16T09:23:29Z

/bot run --add-multi-gpu-test

tensorrt-cicd · 2026-04-16T09:29:27Z

PR_Github #43751 [ run ] triggered by Bot. Commit: 1d060ec Link to invocation

tensorrt-cicd · 2026-04-16T11:25:14Z

PR_Github #43751 [ run ] completed with state SUCCESS. Commit: 1d060ec
/LLM/main/L0_MergeRequest_PR pipeline #34232 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>

…nction Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>

pengbowang-nv · 2026-04-16T16:24:48Z

/bot run --add-multi-gpu-test --disable-fail-fast

tensorrt-cicd · 2026-04-16T16:31:26Z

PR_Github #43809 [ run ] triggered by Bot. Commit: 0f391d0 Link to invocation

github-actions bot assigned pengbowang-nv Apr 14, 2026

pengbowang-nv force-pushed the test-fix-tg-not-found branch from 43da3d4 to f4e84a9 Compare April 16, 2026 03:30

pengbowang-nv changed the title ~~[Draft][USE CI TO TEST] add back kernel selection logic for spec-dec tree~~ [https://nvbugs/6071081][fix] Fix exception in checkIfKernelExist when kernel param is illegal Apr 16, 2026

pengbowang-nv marked this pull request as ready for review April 16, 2026 03:36

coderabbitai bot reviewed Apr 16, 2026

View reviewed changes

Comment thread cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h

pengbowang-nv added 2 commits April 16, 2026 22:22

add try catch to avoid missing kernel interfere with normal process

53a2005

Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>

add NVRTC to checkIfKernelExist and extract shouldUseNvrtc utility fu…

0f391d0

…nction Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>

pengbowang-nv force-pushed the test-fix-tg-not-found branch from 1d060ec to 0f391d0 Compare April 16, 2026 14:22

Conversation

pengbowang-nv commented Apr 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

pengbowang-nv commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

pengbowang-nv commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

pengbowang-nv commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

pengbowang-nv commented Apr 15, 2026

Uh oh!

pengbowang-nv commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

pengbowang-nv commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

coderabbitai bot commented Apr 16, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

pengbowang-nv commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

pengbowang-nv commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pengbowang-nv commented Apr 14, 2026 •

edited by coderabbitai bot

Loading