Skip to content

[https://nvbugs/6071081][fix] Fix exception in checkIfKernelExist when kernel param is illegal#13019

Open
pengbowang-nv wants to merge 2 commits intoNVIDIA:mainfrom
pengbowang-nv:test-fix-tg-not-found
Open

[https://nvbugs/6071081][fix] Fix exception in checkIfKernelExist when kernel param is illegal#13019
pengbowang-nv wants to merge 2 commits intoNVIDIA:mainfrom
pengbowang-nv:test-fix-tg-not-found

Conversation

@pengbowang-nv
Copy link
Copy Markdown
Collaborator

@pengbowang-nv pengbowang-nv commented Apr 14, 2026

Summary by CodeRabbit

  • Bug Fixes
    • Added exception handling during kernel validation to prevent crashes. The system now gracefully handles errors that occur during kernel checks, improving overall stability and reliability.

Description

checkIfKernelExist may get illegal kernel param during warmup. This will lead to a exception being raised and TRTLLM abort. The PR wrap it with try-catch so that a missing kernel/wrong input won't lead to TRTLLM abort.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@pengbowang-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --gpu-type "B200_PCIe,DGX_B200,B300,DGX_B300,GB200"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43161 [ run ] triggered by Bot. Commit: 3f57dcc Link to invocation

@pengbowang-nv
Copy link
Copy Markdown
Collaborator Author

/bot kill

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43172 [ kill ] triggered by Bot. Commit: 3f57dcc Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43172 [ kill ] completed with state SUCCESS. Commit: 3f57dcc
Successfully killed previous jobs for commit 3f57dcc

Link to invocation

@pengbowang-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --gpu-type "B200_PCIe,DGX_B200,B300,DGX_B300,GB200" --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43393 [ run ] triggered by Bot. Commit: 24b0c52 Link to invocation

@pengbowang-nv
Copy link
Copy Markdown
Collaborator Author

/bot kill

@pengbowang-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --gpu-type "B200_PCIe,DGX_B200,B300,DGX_B300,GB200" --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43461 [ kill ] triggered by Bot. Commit: 43da3d4 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43461 [ kill ] completed with state SUCCESS. Commit: 43da3d4
Successfully killed previous jobs for commit 43da3d4

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43463 [ run ] triggered by Bot. Commit: 43da3d4 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43463 [ run ] completed with state SUCCESS. Commit: 43da3d4
/LLM/main/L0_MergeRequest_PR pipeline #33982 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@pengbowang-nv pengbowang-nv force-pushed the test-fix-tg-not-found branch from 43da3d4 to f4e84a9 Compare April 16, 2026 03:30
@pengbowang-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test

@pengbowang-nv pengbowang-nv changed the title [Draft][USE CI TO TEST] add back kernel selection logic for spec-dec tree [https://nvbugs/6071081][fix] Fix exception in checkIfKernelExist when kernel param is illegal Apr 16, 2026
@pengbowang-nv pengbowang-nv marked this pull request as ready for review April 16, 2026 03:36
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43641 [ run ] triggered by Bot. Commit: f4e84a9 Link to invocation

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 16, 2026

📝 Walkthrough

Walkthrough

The checkIfKernelExist function in the FMHA kernels header now wraps its autotuning, validation, and hash lookup logic inside a try-catch block to gracefully handle exceptions. If an exception occurs during the existence check, the function logs a warning and returns failure status instead of propagating the exception.

Changes

Cohort / File(s) Summary
Exception Handling
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h
Added try-catch exception handling around kernel option autotuning, validation, cubin-path filtering, and hash lookup logic in checkIfKernelExist. Now returns graceful failure with warning message on exception instead of propagating.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description explains the issue and solution, but Test Coverage section is empty and no specific tests are listed to validate the exception handling changes. Add specific test cases or test scenarios that verify the try-catch handling works correctly for illegal kernel parameters during warmup.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the fix: wrapping checkIfKernelExist with exception handling for illegal kernel parameters.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Update the NVIDIA copyright year for this modified file.

Line 2 still ends at 2025, but this file has meaningful modifications in 2026.

Suggested patch
- * Copyright (c) 2020-2025, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2020-2026, NVIDIA CORPORATION. All rights reserved.

As per coding guidelines: “Add NVIDIA copyright header on ALL new files and update year on modified files” and “All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h` at line 2,
Update the NVIDIA copyright header in the modified header file fmhaKernels.h to
reflect the latest meaningful modification year (change "2020-2025" to
"2020-2026"); locate the copyright comment at the top of fmhaKernels.h and
update the year range so the file header complies with the project guideline
requiring the year of latest meaningful modification.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h`:
- Around line 244-248: The catch block that currently builds errorInfo as a
constant string loses the exception details; change the construction of
errorInfo to include e.what(), log it via TLLM_LOG_WARNING with the combined
message, and return the new errorInfo in the std::make_pair(false, errorInfo) so
callers receive the exception text (update the path that catches std::exception
const& e and the use of TLLM_LOG_WARNING / std::make_pair to reference the new
errorInfo).

---

Outside diff comments:
In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h`:
- Line 2: Update the NVIDIA copyright header in the modified header file
fmhaKernels.h to reflect the latest meaningful modification year (change
"2020-2025" to "2020-2026"); locate the copyright comment at the top of
fmhaKernels.h and update the year range so the file header complies with the
project guideline requiring the year of latest meaningful modification.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 6f9304e6-ef23-4b37-91c3-a021e4ad4861

📥 Commits

Reviewing files that changed from the base of the PR and between 17aee3f and f4e84a9.

📒 Files selected for processing (1)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h

Comment thread cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43641 [ run ] completed with state SUCCESS. Commit: f4e84a9
/LLM/main/L0_MergeRequest_PR pipeline #34130 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@pengbowang-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43751 [ run ] triggered by Bot. Commit: 1d060ec Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43751 [ run ] completed with state SUCCESS. Commit: 1d060ec
/LLM/main/L0_MergeRequest_PR pipeline #34232 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
…nction

Signed-off-by: Pengbo Wang <221450789+pengbowang-nv@users.noreply.github.com>
@pengbowang-nv pengbowang-nv force-pushed the test-fix-tg-not-found branch from 1d060ec to 0f391d0 Compare April 16, 2026 14:22
@pengbowang-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43809 [ run ] triggered by Bot. Commit: 0f391d0 Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants