[TRTLLM-11574][feat] Some updates on Perf Sanity System codes by chenfeiz0326 · Pull Request #12430 · NVIDIA/TensorRT-LLM

chenfeiz0326 · 2026-03-22T04:21:06Z

Refactor perf regression system by using a 3-layers architecture.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-03-22T04:29:48Z

📝 Walkthrough

Walkthrough

This pull request removes the pre-merge perf sanity HTML report generation pipeline and replaces it with runtime regression checking. The Jenkins pipeline no longer generates or uploads perf reports, while the pytest execution now passes a stageName environment variable. Perf data records now include stage name and test list fields, and regression analysis is performed at test time rather than as a post-processing step.

Changes

Cohort / File(s)	Summary
Jenkins Pipeline Configuration `jenkins/L0_MergeRequest.groovy`, `jenkins/L0_Test.groovy`	Removed the "Collect Perf Sanity Test Result" stage that generated and uploaded HTML reports. Added `stageName=${stageName}` to environment variables passed to pytest execution.
Performance Report Generation `jenkins/scripts/perf/get_pre_merge_html.py`	Removed entire script that loaded perf data YAML, queried OpenSearch history, and generated self-contained HTML reports with inline SVG charts.
Perf Testing Infrastructure `tests/integration/defs/perf/open_search_db_utils.py`	Replaced `generate_perf_yaml()` with regression checking logic; removed `SCENARIO_MATCH_FIELDS` constant; implemented `check_perf_regression()` to filter regression entries, write `regression_data.yaml` optionally, and print regression details.
Perf Sanity Tests `tests/integration/defs/perf/test_perf_sanity.py`	Switched postprocessing from YAML generation to regression checking with `fail_on_regression=not is_post_merge`. Removed scenario-specific match-field logic. Added `s_stage_name` and `s_test_list` fields to perf data records.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description is incomplete, containing only the repository template with empty sections for Description and Test Coverage.	Complete the Description section explaining what changes were made and why. Add a Test Coverage section listing relevant tests that validate the refactoring.
Title check	❓ Inconclusive	The title uses vague language ('Some updates') that obscures the actual scope of changes, which involve removing perf sanity report generation and shifting to regression checking.	Consider using a more descriptive title like 'Replace perf sanity report generation with regression checking' or 'Integrate regression checking into perf validation workflow'.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/defs/perf/open_search_db_utils.py`:
- Around line 649-653: The code currently treats records with the default
b_is_regression=False as "no regression" even when history lookups were skipped;
update prepare_regressive_test_cases() to surface whether a history comparison
actually ran (e.g., set a per-record flag like b_history_checked or return an
overall comparisons_performed boolean) and then change the
regressive_data_list/filtering (the comprehension over new_data_dict.values()
and the logging that prints "No regression data found.") to only consider
records where history was checked (or only log "no regression" when
comparisons_performed is true). Ensure both locations (the regressive_data_list
filter and the similar branch around lines ~700-702) use that new flag/returned
boolean to avoid false-green reports during skipped history lookups.
- Around line 631-637: The loop that prints config entries (using config_keys,
print_func, and skipping s_regression_info) currently emits sensitive env var
fields like s_server_env_var and s_client_env_vars; update the printing logic to
redact any key that contains "env_var" or matches disaggregated variants (e.g.,
keys starting with "s_" and containing "env_var") before calling print_func, or
better yet replace the current full-dump approach with an explicit allowlist of
safe keys to print and skip everything else; ensure you reference and modify the
block that builds config_keys and the loop that iterates over it so keys
matching the env-var pattern are printed as redacted (e.g., "<REDACTED>") or
omitted.

In `@tests/integration/defs/perf/test_perf_sanity.py`:
- Around line 1525-1529: The aggregated-match branch currently always extends
match_keys with server_config.to_match_keys() and client_config.to_match_keys(),
ignoring any ServerConfig.match_mode/client_config.match_mode declarations;
change the logic so after seeding match_keys with ["s_gpu_type","s_runtime"] it
only extends server_config.to_match_keys() if server_config.match_mode !=
"scenario" (and likewise only extend client_config.to_match_keys() if
client_config.match_mode != "scenario"), so configs that declare match_mode:
scenario keep the scenario-only matching behavior; reference
ServerConfig.match_mode, match_keys, server_config.to_match_keys(), and
client_config.to_match_keys() when making the conditional change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fec06597-6aa7-4870-a866-c3f3456d511f

📥 Commits

Reviewing files that changed from the base of the PR and between 6dd98b8 and 14f412b.

📒 Files selected for processing (5)

jenkins/L0_MergeRequest.groovy
jenkins/L0_Test.groovy
jenkins/scripts/perf/get_pre_merge_html.py
tests/integration/defs/perf/open_search_db_utils.py
tests/integration/defs/perf/test_perf_sanity.py

💤 Files with no reviewable changes (2)

jenkins/L0_MergeRequest.groovy
jenkins/scripts/perf/get_pre_merge_html.py

chenfeiz0326 · 2026-03-23T14:26:09Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-23T14:32:22Z

PR_Github #39947 [ run ] triggered by Bot. Commit: cf59812 Link to invocation

tensorrt-cicd · 2026-03-23T23:27:41Z

PR_Github #39947 [ run ] completed with state SUCCESS. Commit: cf59812
/LLM/main/L0_MergeRequest_PR pipeline #31112 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chenfeiz0326 · 2026-03-24T02:32:33Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-24T02:38:15Z

PR_Github #40025 [ run ] triggered by Bot. Commit: cf59812 Link to invocation

tensorrt-cicd · 2026-03-24T06:05:42Z

PR_Github #40025 [ run ] completed with state SUCCESS. Commit: cf59812
/LLM/main/L0_MergeRequest_PR pipeline #31182 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chenfeiz0326 · 2026-03-24T06:16:41Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-24T06:23:12Z

PR_Github #40072 [ run ] triggered by Bot. Commit: cf59812 Link to invocation

tensorrt-cicd · 2026-03-24T09:05:19Z

PR_Github #40072 [ run ] completed with state SUCCESS. Commit: cf59812
/LLM/main/L0_MergeRequest_PR pipeline #31224 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chenfeiz0326 · 2026-03-24T14:24:47Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-24T14:31:21Z

PR_Github #40129 [ run ] triggered by Bot. Commit: cf59812 Link to invocation

tensorrt-cicd · 2026-03-24T17:57:36Z

PR_Github #40129 [ run ] completed with state SUCCESS. Commit: cf59812
/LLM/main/L0_MergeRequest_PR pipeline #31276 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chenfeiz0326 · 2026-03-25T02:14:14Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-25T02:20:01Z

PR_Github #40216 [ run ] triggered by Bot. Commit: cf59812 Link to invocation

tensorrt-cicd · 2026-03-25T08:50:51Z

PR_Github #40216 [ run ] completed with state SUCCESS. Commit: cf59812
/LLM/main/L0_MergeRequest_PR pipeline #31353 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chenfeiz0326 · 2026-03-25T14:34:22Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-25T14:40:14Z

PR_Github #40342 [ run ] triggered by Bot. Commit: cf59812 Link to invocation

tensorrt-cicd · 2026-03-25T19:36:42Z

PR_Github #40342 [ run ] completed with state FAILURE. Commit: cf59812
/LLM/main/L0_MergeRequest_PR pipeline #31447 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chenfeiz0326 · 2026-03-26T14:05:24Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-01T07:54:11Z

PR_Github #41155 [ run ] triggered by Bot. Commit: 5d5147a Link to invocation

tensorrt-cicd · 2026-04-01T13:04:43Z

PR_Github #41155 [ run ] completed with state SUCCESS. Commit: 5d5147a
/LLM/main/L0_MergeRequest_PR pipeline #32124 completed with status: 'SUCCESS'

CI Report

Link to invocation

chenfeiz0326 · 2026-04-01T15:32:01Z

/bot run --disable-fail-fast --post-merge

tensorrt-cicd · 2026-04-01T15:32:02Z

PR_Github #41217 [ run ] triggered by Bot. Commit: 3889600 Link to invocation

tensorrt-cicd · 2026-04-01T15:37:36Z

PR_Github #41220 [ run ] triggered by Bot. Commit: 3889600 Link to invocation

tensorrt-cicd · 2026-04-01T22:56:46Z

PR_Github #41220 [ run ] completed with state SUCCESS. Commit: 3889600
/LLM/main/L0_MergeRequest_PR pipeline #32180 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Chenfei Zhang <chenfeiz@oci-hsg-cs-001-vscode-01.cm.cluster>

chenfeiz0326 · 2026-04-02T04:03:00Z

/bot run --disable-fail-fast --post-merge

tensorrt-cicd · 2026-04-02T04:09:04Z

PR_Github #41331 [ run ] triggered by Bot. Commit: fc5cd09 Link to invocation

tensorrt-cicd · 2026-04-02T10:59:27Z

PR_Github #41331 [ run ] completed with state SUCCESS. Commit: fc5cd09
/LLM/main/L0_MergeRequest_PR pipeline #32279 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Chenfei Zhang <chenfeiz@oci-hsg-cs-001-vscode-01.cm.cluster>

chenfeiz0326 · 2026-04-03T14:55:46Z

/bot run --disable-fail-fast --post-merge

tensorrt-cicd · 2026-04-03T15:01:29Z

PR_Github #41688 [ run ] triggered by Bot. Commit: 3f1167e Link to invocation

tensorrt-cicd · 2026-04-04T13:50:51Z

PR_Github #41688 [ run ] completed with state SUCCESS. Commit: 3f1167e
/LLM/main/L0_MergeRequest_PR pipeline #32591 completed with status: 'ABORTED'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chenfeiz0326 · 2026-04-07T02:40:20Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-07T02:46:54Z

PR_Github #42036 [ run ] triggered by Bot. Commit: 3f1167e Link to invocation

chenfeiz0326 · 2026-04-07T09:34:22Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-07T09:37:14Z

PR_Github #42036 [ run ] completed with state SUCCESS. Commit: 3f1167e
/LLM/main/L0_MergeRequest_PR pipeline #32882 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

tensorrt-cicd · 2026-04-07T09:40:26Z

PR_Github #42121 [ run ] triggered by Bot. Commit: 3f1167e Link to invocation

tensorrt-cicd · 2026-04-07T10:28:23Z

PR_Github #42121 [ run ] completed with state SUCCESS. Commit: 3f1167e
/LLM/main/L0_MergeRequest_PR pipeline #32958 completed with status: 'SUCCESS'

CI Report

Link to invocation

…#12430) Signed-off-by: Chenfei Zhang <chenfeiz@oci-hsg-cs-001-vscode-01.cm.cluster> Co-authored-by: Chenfei Zhang <chenfeiz@oci-hsg-cs-001-vscode-01.cm.cluster>

chenfeiz0326 requested review from a team as code owners March 22, 2026 04:21

chenfeiz0326 requested review from dpitman-nvda and zeroepoch March 22, 2026 04:21

github-actions bot assigned chenfeiz0326 Mar 22, 2026

coderabbitai bot reviewed Mar 22, 2026

View reviewed changes

Comment thread tests/integration/defs/perf/open_search_db_utils.py Outdated

Comment thread tests/integration/defs/perf/open_search_db_utils.py Outdated

Comment thread tests/integration/defs/perf/test_perf_sanity.py

LarryXFly approved these changes Mar 23, 2026

View reviewed changes

dpitman-nvda approved these changes Mar 23, 2026

View reviewed changes

chenfeiz0326 enabled auto-merge (squash) March 24, 2026 06:24

chenfeiz0326 disabled auto-merge April 1, 2026 08:39

update

9dbe2fa

Signed-off-by: Chenfei Zhang <chenfeiz@oci-hsg-cs-001-vscode-01.cm.cluster>

chenfeiz0326 force-pushed the chenfeiz/integrate-pre-merge-dashboard-into-ci-report branch from 3889600 to 9dbe2fa Compare April 2, 2026 03:55

update

fc5cd09

Signed-off-by: Chenfei Zhang <chenfeiz@oci-hsg-cs-001-vscode-01.cm.cluster>

update

22a6053

Signed-off-by: Chenfei Zhang <chenfeiz@oci-hsg-cs-001-vscode-01.cm.cluster>

chenfeiz0326 changed the title ~~[TRTLLM-11574][feat] Some updates for integrating Pre-Merge Report into CI Report~~ [TRTLLM-11574][feat] Some updates on Perf Sanity System codes Apr 3, 2026

Chenfei Zhang added 2 commits April 3, 2026 01:43

update

70f981a

Signed-off-by: Chenfei Zhang <chenfeiz@oci-hsg-cs-001-vscode-01.cm.cluster>

update

3f1167e

Signed-off-by: Chenfei Zhang <chenfeiz@oci-hsg-cs-001-vscode-01.cm.cluster>

chenfeiz0326 enabled auto-merge (squash) April 7, 2026 04:02

chenfeiz0326 merged commit 390c093 into NVIDIA:main Apr 7, 2026
5 checks passed

Conversation

chenfeiz0326 commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chenfeiz0326 commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

chenfeiz0326 commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

chenfeiz0326 commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

chenfeiz0326 commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

chenfeiz0326 commented Mar 25, 2026

Uh oh!

tensorrt-cicd commented Mar 25, 2026

Uh oh!

tensorrt-cicd commented Mar 25, 2026

Uh oh!

chenfeiz0326 commented Mar 25, 2026

Uh oh!

tensorrt-cicd commented Mar 25, 2026

Uh oh!

tensorrt-cicd commented Mar 25, 2026

Uh oh!

chenfeiz0326 commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

chenfeiz0326 commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

chenfeiz0326 commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

chenfeiz0326 commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

chenfeiz0326 commented Mar 22, 2026 •

edited

Loading

coderabbitai bot commented Mar 22, 2026 •

edited

Loading