Skip to content

[None][test] add unit test and e2e test for gpt_oss_20b MHA kernel#12796

Open
ruodil wants to merge 1 commit intoNVIDIA:mainfrom
ruodil:user/ruodil/rcca
Open

[None][test] add unit test and e2e test for gpt_oss_20b MHA kernel#12796
ruodil wants to merge 1 commit intoNVIDIA:mainfrom
ruodil:user/ruodil/rcca

Conversation

@ruodil
Copy link
Copy Markdown
Collaborator

@ruodil ruodil commented Apr 7, 2026

Summary by CodeRabbit

  • Tests

    • Added performance test suite for GPT-OSS 20B FP4 model with multiple input/output scenarios.
    • Added CUDA attention kernel selection verification test to ensure optimal kernel usage on supported hardware.
  • Chores

    • Added performance configuration for GPT-OSS 20B FP4 model with optimized settings for CUDA execution.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Ruodi Lu <ruodil@users.noreply.github.com>
@ruodil ruodil requested review from a team as code owners April 7, 2026 07:25
@ruodil ruodil requested review from LarryXFly and yufeiwu-nv April 7, 2026 07:25
@ruodil
Copy link
Copy Markdown
Collaborator Author

ruodil commented Apr 7, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42091 [ run ] triggered by Bot. Commit: 1f159ef Link to invocation

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 7, 2026

📝 Walkthrough

Walkthrough

This change adds performance configurations and test cases for GPT-OSS 20B models with PyTorch float4 backend, along with a new unit test for CUDA kernel selection behavior using TensorRT-LLM integration.

Changes

Cohort / File(s) Summary
Performance Configuration & Test Lists
tests/integration/defs/perf/pytorch_model_config.py, tests/integration/test_lists/qa/llm_perf_core.yml
Added PyTorch performance configuration for gpt_oss_20b_fp4-bench-pytorch-float4 with CUDA graph padding, KV cache settings, and MOE backend configuration. Added three performance test cases with varying input/output lengths and concurrency parameters.
Unit Testing
tests/unittest/_torch/modeling/test_modeling_gpt_oss.py
Added test_gpt_oss_xqa_kernel_selection() test with CUDA kernel profiling to verify XQA kernel dispatch while excluding MMHA kernel. Includes model configuration, KV cache setup, and CUDA activity profiling verification logic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The pull request description is largely incomplete and largely follows only the template structure without filling in required sections. Fill in the Description section with details about what changes were made and why. Fill in the Test Coverage section with the specific test names and what they verify. Complete the PR Checklist by checking the box and ensuring all items are addressed.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main change: adding unit and e2e tests for the gpt_oss_20b MHA kernel, which aligns with the code changes that add a unit test and integration test configurations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/unittest/_torch/modeling/test_modeling_gpt_oss.py (1)

125-128: Consider adding .eval() as a best practice for inference models, though GPT-OSS has no training-sensitive branches.

Line 128 creates GptOssForCausalLM without calling .eval(). While torch.inference_mode() correctly disables autograd for the profiling runs (lines 198-212), the GPT-OSS model does not contain Dropout layers or self.training conditional branches. Adding .eval() is nonetheless a good practice for clarity and consistency:

-        model = GptOssForCausalLM(model_config).cuda()
+        model = GptOssForCausalLM(model_config).cuda().eval()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/modeling/test_modeling_gpt_oss.py` around lines 125 -
128, The test constructs a GptOssForCausalLM instance without setting it to
evaluation mode; update the test to call .eval() on the model instance created
from GptOssForCausalLM(model_config) (e.g., call model.eval() before or after
.cuda()) so the model is explicitly in inference mode; this uses the existing
symbols ModelConfig, gpt_oss_config, attn_backend="TRTLLM", and
GptOssForCausalLM to locate the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/unittest/_torch/modeling/test_modeling_gpt_oss.py`:
- Around line 6-23: Add the required NVIDIA SPDX copyright/header at the very
top of the file (before the first import, e.g., before the "import torch" line)
using the repository-mandated template and the latest meaningful modification
year (2026); ensure the header appears above the existing imports such as
"import torch", "from transformers import AutoTokenizer, GptOssConfig" and
references to "GptOssForCausalLM" so the file passes license checks.
- Around line 140-145: The zip() used to compute token_nums (token_nums = [p + s
for p, s in zip(past_seen_tokens, sequence_length)]) should include strict=True
to avoid silent truncation when past_seen_tokens and sequence_length lengths
differ; update that list comprehension to call zip(past_seen_tokens,
sequence_length, strict=True) so mismatched lengths raise an error during tests
(references: variables past_seen_tokens, sequence_length, token_nums).

---

Nitpick comments:
In `@tests/unittest/_torch/modeling/test_modeling_gpt_oss.py`:
- Around line 125-128: The test constructs a GptOssForCausalLM instance without
setting it to evaluation mode; update the test to call .eval() on the model
instance created from GptOssForCausalLM(model_config) (e.g., call model.eval()
before or after .cuda()) so the model is explicitly in inference mode; this uses
the existing symbols ModelConfig, gpt_oss_config, attn_backend="TRTLLM", and
GptOssForCausalLM to locate the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dc769040-48cc-42e8-b9f6-32fe69cbdce5

📥 Commits

Reviewing files that changed from the base of the PR and between 740f110 and 1f159ef.

📒 Files selected for processing (3)
  • tests/integration/defs/perf/pytorch_model_config.py
  • tests/integration/test_lists/qa/llm_perf_core.yml
  • tests/unittest/_torch/modeling/test_modeling_gpt_oss.py

Comment on lines +6 to 23
import torch
from torch.profiler import ProfilerActivity
from transformers import AutoTokenizer, GptOssConfig
from utils.llm_data import llm_models_root
from utils.util import skip_no_hopper
from utils.util import skip_no_hopper, skip_pre_hopper

import tensorrt_llm
from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm._torch.attention_backend.utils import get_attention_backend
from tensorrt_llm._torch.metadata import KVCacheParams
from tensorrt_llm._torch.model_config import ModelConfig
from tensorrt_llm._torch.models.modeling_gpt_oss import GptOssForCausalLM
from tensorrt_llm._torch.pyexecutor.resource_manager import KVCacheManager
from tensorrt_llm.bindings.executor import \
KvCacheConfig as BindingsKvCacheConfig
from tensorrt_llm.llmapi import CudaGraphConfig, KvCacheConfig, MoeConfig
from tensorrt_llm.mapping import Mapping

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add the required NVIDIA SPDX header before the import block.

This file was meaningfully modified in 2026 but still starts directly with imports, so the repository-mandated copyright/license header is missing.

As per coding guidelines, All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header with the year of latest meaningful modification.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/modeling/test_modeling_gpt_oss.py` around lines 6 - 23,
Add the required NVIDIA SPDX copyright/header at the very top of the file
(before the first import, e.g., before the "import torch" line) using the
repository-mandated template and the latest meaningful modification year (2026);
ensure the header appears above the existing imports such as "import torch",
"from transformers import AutoTokenizer, GptOssConfig" and references to
"GptOssForCausalLM" so the file passes license checks.

Comment on lines +140 to +145
batch_size = 8
context_sequence_length = [] # no context (prefill) sequences
sequence_length = [1] * batch_size # all decode, 1 new token each
past_seen_tokens = [256] * batch_size # enough history for multi-block
request_ids = list(range(batch_size))
token_nums = [p + s for p, s in zip(past_seen_tokens, sequence_length)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
python - <<'PY'
import ast
from pathlib import Path

path = Path("tests/unittest/_torch/modeling/test_modeling_gpt_oss.py")
tree = ast.parse(path.read_text(encoding="utf-8"))

for node in ast.walk(tree):
    if isinstance(node, ast.Call) and isinstance(node.func, ast.Name) and node.func.id == "zip":
        has_strict = any(keyword.arg == "strict" for keyword in node.keywords)
        print(f"Line {node.lineno}: zip(...), strict={has_strict}")
PY

Repository: NVIDIA/TensorRT-LLM

Length of output: 95


Add strict=True to the zip() call on Line 145.

The zip() function on line 145 currently lacks the strict parameter, which allows silent truncation if past_seen_tokens and sequence_length lengths diverge. Since TensorRT-LLM requires Python 3.10+, use strict=True to detect such mismatches early. Ruff already flags this as B905.

Suggested fix
-    token_nums = [p + s for p, s in zip(past_seen_tokens, sequence_length)]
+    token_nums = [
+        p + s
+        for p, s in zip(past_seen_tokens, sequence_length, strict=True)
+    ]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
batch_size = 8
context_sequence_length = [] # no context (prefill) sequences
sequence_length = [1] * batch_size # all decode, 1 new token each
past_seen_tokens = [256] * batch_size # enough history for multi-block
request_ids = list(range(batch_size))
token_nums = [p + s for p, s in zip(past_seen_tokens, sequence_length)]
batch_size = 8
context_sequence_length = [] # no context (prefill) sequences
sequence_length = [1] * batch_size # all decode, 1 new token each
past_seen_tokens = [256] * batch_size # enough history for multi-block
request_ids = list(range(batch_size))
token_nums = [
p + s
for p, s in zip(past_seen_tokens, sequence_length, strict=True)
]
🧰 Tools
🪛 Ruff (0.15.9)

[warning] 145-145: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/modeling/test_modeling_gpt_oss.py` around lines 140 -
145, The zip() used to compute token_nums (token_nums = [p + s for p, s in
zip(past_seen_tokens, sequence_length)]) should include strict=True to avoid
silent truncation when past_seen_tokens and sequence_length lengths differ;
update that list comprehension to call zip(past_seen_tokens, sequence_length,
strict=True) so mismatched lengths raise an error during tests (references:
variables past_seen_tokens, sequence_length, token_nums).

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42091 [ run ] completed with state FAILURE. Commit: 1f159ef
/LLM/main/L0_MergeRequest_PR pipeline #32930 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@ruodil
Copy link
Copy Markdown
Collaborator Author

ruodil commented Apr 8, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42240 [ run ] triggered by Bot. Commit: 1f159ef Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42240 [ run ] completed with state SUCCESS. Commit: 1f159ef
/LLM/main/L0_MergeRequest_PR pipeline #33050 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants