[None][test] add unit test and e2e test for gpt_oss_20b MHA kernel by ruodil · Pull Request #12796 · NVIDIA/TensorRT-LLM

ruodil · 2026-04-07T07:25:09Z

Summary by CodeRabbit

Tests
- Added performance test suite for GPT-OSS 20B FP4 model with multiple input/output scenarios.
- Added CUDA attention kernel selection verification test to ensure optimal kernel usage on supported hardware.
Chores
- Added performance configuration for GPT-OSS 20B FP4 model with optimized settings for CUDA execution.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Ruodi Lu <ruodil@users.noreply.github.com>

ruodil · 2026-04-07T07:25:55Z

/bot run

tensorrt-cicd · 2026-04-07T07:31:50Z

PR_Github #42091 [ run ] triggered by Bot. Commit: 1f159ef Link to invocation

coderabbitai · 2026-04-07T07:37:29Z

📝 Walkthrough

Walkthrough

This change adds performance configurations and test cases for GPT-OSS 20B models with PyTorch float4 backend, along with a new unit test for CUDA kernel selection behavior using TensorRT-LLM integration.

Changes

Cohort / File(s)	Summary
Performance Configuration & Test Lists `tests/integration/defs/perf/pytorch_model_config.py`, `tests/integration/test_lists/qa/llm_perf_core.yml`	Added PyTorch performance configuration for `gpt_oss_20b_fp4-bench-pytorch-float4` with CUDA graph padding, KV cache settings, and MOE backend configuration. Added three performance test cases with varying input/output lengths and concurrency parameters.
Unit Testing `tests/unittest/_torch/modeling/test_modeling_gpt_oss.py`	Added `test_gpt_oss_xqa_kernel_selection()` test with CUDA kernel profiling to verify XQA kernel dispatch while excluding MMHA kernel. Includes model configuration, KV cache setup, and CUDA activity profiling verification logic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The pull request description is largely incomplete and largely follows only the template structure without filling in required sections.	Fill in the Description section with details about what changes were made and why. Fill in the Test Coverage section with the specific test names and what they verify. Complete the PR Checklist by checking the box and ensuring all items are addressed.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the main change: adding unit and e2e tests for the gpt_oss_20b MHA kernel, which aligns with the code changes that add a unit test and integration test configurations.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tests/unittest/_torch/modeling/test_modeling_gpt_oss.py (1)
125-128: Consider adding .eval() as a best practice for inference models, though GPT-OSS has no training-sensitive branches.

Line 128 creates GptOssForCausalLM without calling .eval(). While torch.inference_mode() correctly disables autograd for the profiling runs (lines 198-212), the GPT-OSS model does not contain Dropout layers or self.training conditional branches. Adding .eval() is nonetheless a good practice for clarity and consistency:
-        model = GptOssForCausalLM(model_config).cuda()
+        model = GptOssForCausalLM(model_config).cuda().eval()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/modeling/test_modeling_gpt_oss.py` around lines 125 -
128, The test constructs a GptOssForCausalLM instance without setting it to
evaluation mode; update the test to call .eval() on the model instance created
from GptOssForCausalLM(model_config) (e.g., call model.eval() before or after
.cuda()) so the model is explicitly in inference mode; this uses the existing
symbols ModelConfig, gpt_oss_config, attn_backend="TRTLLM", and
GptOssForCausalLM to locate the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/unittest/_torch/modeling/test_modeling_gpt_oss.py`:
- Around line 6-23: Add the required NVIDIA SPDX copyright/header at the very
top of the file (before the first import, e.g., before the "import torch" line)
using the repository-mandated template and the latest meaningful modification
year (2026); ensure the header appears above the existing imports such as
"import torch", "from transformers import AutoTokenizer, GptOssConfig" and
references to "GptOssForCausalLM" so the file passes license checks.
- Around line 140-145: The zip() used to compute token_nums (token_nums = [p + s
for p, s in zip(past_seen_tokens, sequence_length)]) should include strict=True
to avoid silent truncation when past_seen_tokens and sequence_length lengths
differ; update that list comprehension to call zip(past_seen_tokens,
sequence_length, strict=True) so mismatched lengths raise an error during tests
(references: variables past_seen_tokens, sequence_length, token_nums).

---

Nitpick comments:
In `@tests/unittest/_torch/modeling/test_modeling_gpt_oss.py`:
- Around line 125-128: The test constructs a GptOssForCausalLM instance without
setting it to evaluation mode; update the test to call .eval() on the model
instance created from GptOssForCausalLM(model_config) (e.g., call model.eval()
before or after .cuda()) so the model is explicitly in inference mode; this uses
the existing symbols ModelConfig, gpt_oss_config, attn_backend="TRTLLM", and
GptOssForCausalLM to locate the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dc769040-48cc-42e8-b9f6-32fe69cbdce5

📥 Commits

Reviewing files that changed from the base of the PR and between 740f110 and 1f159ef.

📒 Files selected for processing (3)

tests/integration/defs/perf/pytorch_model_config.py
tests/integration/test_lists/qa/llm_perf_core.yml
tests/unittest/_torch/modeling/test_modeling_gpt_oss.py

coderabbitai · 2026-04-07T07:37:32Z

tests/unittest/_torch/modeling/test_modeling_gpt_oss.py

+import torch
+from torch.profiler import ProfilerActivity
+from transformers import AutoTokenizer, GptOssConfig
 from utils.llm_data import llm_models_root
-from utils.util import skip_no_hopper
+from utils.util import skip_no_hopper, skip_pre_hopper

+import tensorrt_llm
 from tensorrt_llm import LLM, SamplingParams
+from tensorrt_llm._torch.attention_backend.utils import get_attention_backend
+from tensorrt_llm._torch.metadata import KVCacheParams
+from tensorrt_llm._torch.model_config import ModelConfig
+from tensorrt_llm._torch.models.modeling_gpt_oss import GptOssForCausalLM
+from tensorrt_llm._torch.pyexecutor.resource_manager import KVCacheManager
+from tensorrt_llm.bindings.executor import \
+    KvCacheConfig as BindingsKvCacheConfig
 from tensorrt_llm.llmapi import CudaGraphConfig, KvCacheConfig, MoeConfig
+from tensorrt_llm.mapping import Mapping



⚠️ Potential issue | 🟠 Major

Add the required NVIDIA SPDX header before the import block.

This file was meaningfully modified in 2026 but still starts directly with imports, so the repository-mandated copyright/license header is missing.

As per coding guidelines, All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header with the year of latest meaningful modification.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/modeling/test_modeling_gpt_oss.py` around lines 6 - 23, Add the required NVIDIA SPDX copyright/header at the very top of the file (before the first import, e.g., before the "import torch" line) using the repository-mandated template and the latest meaningful modification year (2026); ensure the header appears above the existing imports such as "import torch", "from transformers import AutoTokenizer, GptOssConfig" and references to "GptOssForCausalLM" so the file passes license checks.

coderabbitai · 2026-04-07T07:37:32Z

tests/unittest/_torch/modeling/test_modeling_gpt_oss.py

+    batch_size = 8
+    context_sequence_length = []  # no context (prefill) sequences
+    sequence_length = [1] * batch_size  # all decode, 1 new token each
+    past_seen_tokens = [256] * batch_size  # enough history for multi-block
+    request_ids = list(range(batch_size))
+    token_nums = [p + s for p, s in zip(past_seen_tokens, sequence_length)]


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail python - <<'PY' import ast from pathlib import Path path = Path("tests/unittest/_torch/modeling/test_modeling_gpt_oss.py") tree = ast.parse(path.read_text(encoding="utf-8")) for node in ast.walk(tree): if isinstance(node, ast.Call) and isinstance(node.func, ast.Name) and node.func.id == "zip": has_strict = any(keyword.arg == "strict" for keyword in node.keywords) print(f"Line {node.lineno}: zip(...), strict={has_strict}") PY

Repository: NVIDIA/TensorRT-LLM

Length of output: 95

Add strict=True to the zip() call on Line 145.

The zip() function on line 145 currently lacks the strict parameter, which allows silent truncation if past_seen_tokens and sequence_length lengths diverge. Since TensorRT-LLM requires Python 3.10+, use strict=True to detect such mismatches early. Ruff already flags this as B905.

Suggested fix

- token_nums = [p + s for p, s in zip(past_seen_tokens, sequence_length)] + token_nums = [ + p + s + for p, s in zip(past_seen_tokens, sequence_length, strict=True) + ]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

batch_size = 8

context_sequence_length = [] # no context (prefill) sequences

sequence_length = [1] * batch_size # all decode, 1 new token each

past_seen_tokens = [256] * batch_size # enough history for multi-block

request_ids = list(range(batch_size))

token_nums = [p + s for p, s in zip(past_seen_tokens, sequence_length)]

batch_size = 8

context_sequence_length = [] # no context (prefill) sequences

sequence_length = [1] * batch_size # all decode, 1 new token each

past_seen_tokens = [256] * batch_size # enough history for multi-block

request_ids = list(range(batch_size))

token_nums = [

p + s

for p, s in zip(past_seen_tokens, sequence_length, strict=True)

]

🧰 Tools

🪛 Ruff (0.15.9)

[warning] 145-145: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/modeling/test_modeling_gpt_oss.py` around lines 140 - 145, The zip() used to compute token_nums (token_nums = [p + s for p, s in zip(past_seen_tokens, sequence_length)]) should include strict=True to avoid silent truncation when past_seen_tokens and sequence_length lengths differ; update that list comprehension to call zip(past_seen_tokens, sequence_length, strict=True) so mismatched lengths raise an error during tests (references: variables past_seen_tokens, sequence_length, token_nums).

tensorrt-cicd · 2026-04-07T07:43:41Z

PR_Github #42091 [ run ] completed with state FAILURE. Commit: 1f159ef
/LLM/main/L0_MergeRequest_PR pipeline #32930 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

ruodil · 2026-04-08T02:36:58Z

/bot run

tensorrt-cicd · 2026-04-08T02:42:41Z

PR_Github #42240 [ run ] triggered by Bot. Commit: 1f159ef Link to invocation

tensorrt-cicd · 2026-04-08T07:37:06Z

PR_Github #42240 [ run ] completed with state SUCCESS. Commit: 1f159ef
/LLM/main/L0_MergeRequest_PR pipeline #33050 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

add unit test and e2e test for gpt_oss_20b MHA kernel

1f159ef

Signed-off-by: Ruodi Lu <ruodil@users.noreply.github.com>

ruodil requested review from a team as code owners April 7, 2026 07:25

github-actions bot assigned ruodil Apr 7, 2026

ruodil requested review from LarryXFly and yufeiwu-nv April 7, 2026 07:25

coderabbitai bot reviewed Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][test] add unit test and e2e test for gpt_oss_20b MHA kernel#12796

[None][test] add unit test and e2e test for gpt_oss_20b MHA kernel#12796
ruodil wants to merge 1 commit intoNVIDIA:mainfrom
ruodil:user/ruodil/rcca

ruodil commented Apr 7, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

ruodil commented Apr 7, 2026

Uh oh!

tensorrt-cicd commented Apr 7, 2026

Uh oh!

coderabbitai bot commented Apr 7, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 7, 2026

Uh oh!

coderabbitai bot Apr 7, 2026

Uh oh!

tensorrt-cicd commented Apr 7, 2026

Uh oh!

ruodil commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ruodil commented Apr 7, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

ruodil commented Apr 7, 2026

Uh oh!

tensorrt-cicd commented Apr 7, 2026

Uh oh!

coderabbitai bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Apr 7, 2026

Uh oh!

ruodil commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ruodil commented Apr 7, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 7, 2026 •

edited

Loading