[None][fix] Fix GPQA Diamond filter_type mismatch in disagg accuracy … by yingguo-trt · Pull Request #13210 · NVIDIA/TensorRT-LLM

yingguo-trt · 2026-04-20T08:34:05Z

…test

The disagg YAML declared filter_type: flexible-extract, but the lm_eval config tests/integration/lm_eval_configs/gpqa_diamond_local.yaml only defines a strict-match filter. As a result, the accuracy checker could not find the expected row and marked the case as a "Benchmark run error" in aggregated JUnit, even though the underlying run succeeded.

GPQA is a multiple-choice task where strict-match is the correct filter; flexible-extract is semantically wrong here.

Summary by CodeRabbit

Tests
- Updated accuracy validation testing to use stricter matching criteria for improved evaluation accuracy.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…test The disagg YAML declared filter_type: flexible-extract, but the lm_eval config tests/integration/lm_eval_configs/gpqa_diamond_local.yaml only defines a strict-match filter. As a result, the accuracy checker could not find the expected row and marked the case as a "Benchmark run error" in aggregated JUnit, even though the underlying run succeeded. GPQA is a multiple-choice task where strict-match is the correct filter; flexible-extract is semantically wrong here. Signed-off-by: yingguo-trt <244492186+yingguo-trt@users.noreply.github.com>

coderabbitai · 2026-04-20T08:36:03Z

📝 Walkthrough

Walkthrough

A single YAML configuration file was modified to change the dataset accuracy validation filter type from flexible-extract to strict-match for the GPQA Diamond local dataset in a performance test script.

Changes

Cohort / File(s)	Summary
Performance Test Configuration `tests/scripts/perf/disaggregated/wideep_accuracy-deepseek-r1-fp4_gpqa_diamond_1k1k_ctx2_gen1_dep16_bs128_eplb288_mtp3_ccb-NIXL.yaml`	Updated `filter_type` for `gpqa_diamond_local` accuracy metadata from `flexible-extract` to `strict-match`.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	PR description explains the issue, cause, and solution clearly. However, required sections like 'Test Coverage' and 'Description' headers are not filled in, though the core information is provided.	Provide explicit details under the 'Description' and 'Test Coverage' sections. Clarify how this fix was validated and whether existing tests cover the accuracy checking behavior.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title follows the required format and accurately describes the fix: it identifies the issue (GPQA Diamond filter_type mismatch) and specifies the location (disagg accuracy config).
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

yingguo-trt · 2026-04-20T08:42:45Z

/bot skip --comment "Not cover in CI pipelines"

fredricz-20070104

Approved. This is a bug.

tensorrt-cicd · 2026-04-20T08:48:32Z

PR_Github #44407 [ skip ] triggered by Bot. Commit: c98f1c0 Link to invocation

tensorrt-cicd · 2026-04-20T08:55:38Z

PR_Github #44407 [ skip ] completed with state SUCCESS. Commit: c98f1c0
Skipping testing for commit c98f1c0

Link to invocation

github-actions Bot assigned yingguo-trt Apr 20, 2026

fredricz-20070104 approved these changes Apr 20, 2026

View reviewed changes

fredricz-20070104 merged commit 33b3fd3 into NVIDIA:main Apr 20, 2026
8 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][fix] Fix GPQA Diamond filter_type mismatch in disagg accuracy …#13210

[None][fix] Fix GPQA Diamond filter_type mismatch in disagg accuracy …#13210
fredricz-20070104 merged 1 commit intoNVIDIA:mainfrom
yingguo-trt:fix/gpqa-diamond-filter-type

yingguo-trt commented Apr 20, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 20, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

yingguo-trt commented Apr 20, 2026

Uh oh!

fredricz-20070104 left a comment

Uh oh!

tensorrt-cicd commented Apr 20, 2026

Uh oh!

tensorrt-cicd commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yingguo-trt commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

yingguo-trt commented Apr 20, 2026

Uh oh!

fredricz-20070104 left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Apr 20, 2026

Uh oh!

tensorrt-cicd commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yingguo-trt commented Apr 20, 2026 •

edited

Loading

coderabbitai Bot commented Apr 20, 2026 •

edited

Loading