[None][infra] Reenable GB300-4_GPUs-PyTorch-Post-Merge-1 by mlefeb01 · Pull Request #13097 · NVIDIA/TensorRT-LLM

mlefeb01 · 2026-04-16T01:50:20Z

Summary by CodeRabbit

Tests
- Enabled a 4-GPU PyTorch post-merge test stage with updated platform configuration.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

mlefeb01 · 2026-04-16T01:52:18Z

/bot run --disable-fail-fast --stage-list "GB300-4_GPUs-PyTorch-Post-Merge-1"

coderabbitai · 2026-04-16T01:53:06Z

📝 Walkthrough

Walkthrough

A test stage configuration entry in the Jenkins pipeline is modified to change its platform/cluster selector from "gb300-x4" to "auto:gb300-flex" and is uncommented, enabling the stage for post-merge PyTorch testing on four GPUs.

Changes

Cohort / File(s)	Summary
Jenkins Pipeline Configuration `jenkins/L0_Test.groovy`	Modified platform/cluster selector for `GB300-4_GPUs-PyTorch-Post-Merge-1` stage from hardcoded `"gb300-x4"` to `"auto:gb300-flex"` and enabled the entry.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is largely empty, containing only the template structure with no actual content in the required sections (Description, Test Coverage).	Provide a detailed description explaining why the GB300-4_GPUs-PyTorch-Post-Merge-1 stage is being re-enabled and specify relevant tests that verify the changes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly describes the main change: re-enabling a specific CI stage (GB300-4_GPUs-PyTorch-Post-Merge-1) which aligns with the changeset modifications.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

jenkins/L0_Test.groovy (1)

2970-2972: Refresh the GB300 disablement comment to match the now-enabled stage.

Line 2970 says GB300 stages are disabled, but Line 2972 re-enables one. Please update the comment to avoid operator confusion.

Suggested diff

-        // Disable GB300 stages due to nodes will be offline temporarily.
+        // Keep some GB300 stages disabled while re-enabling selected post-merge coverage.
         // "GB300-PyTorch-1": ["gb300-single", "l0_gb300", 1, 1],
         "GB300-4_GPUs-PyTorch-Post-Merge-1": ["auto:gb300-flex", "l0_gb300_multi_gpus", 1, 1, 4],

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@jenkins/L0_Test.groovy` around lines 2970 - 2972, Update the misleading
comment that says "Disable GB300 stages due to nodes will be offline
temporarily." to reflect that the "GB300-4_GPUs-PyTorch-Post-Merge-1" stage is
now enabled; locate the comment above the GB300 stage entries and change it to
something like "Adjust GB300 stages: most disabled,
GB300-4_GPUs-PyTorch-Post-Merge-1 re-enabled" (referencing the exact stage
identifier "GB300-4_GPUs-PyTorch-Post-Merge-1") so operators won't be confused.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@jenkins/L0_Test.groovy`:
- Around line 2970-2972: Update the misleading comment that says "Disable GB300
stages due to nodes will be offline temporarily." to reflect that the
"GB300-4_GPUs-PyTorch-Post-Merge-1" stage is now enabled; locate the comment
above the GB300 stage entries and change it to something like "Adjust GB300
stages: most disabled, GB300-4_GPUs-PyTorch-Post-Merge-1 re-enabled"
(referencing the exact stage identifier "GB300-4_GPUs-PyTorch-Post-Merge-1") so
operators won't be confused.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 1cfe824e-1112-465e-bd6c-3e0d86479fa8

📥 Commits

Reviewing files that changed from the base of the PR and between ec34644 and e0198a9.

📒 Files selected for processing (1)

jenkins/L0_Test.groovy

Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>

mlefeb01 · 2026-04-16T02:30:52Z

/bot run --disable-fail-fast --stage-list "GB300-4_GPUs-PyTorch-Post-Merge-1"

tensorrt-cicd · 2026-04-16T02:37:51Z

PR_Github #43619 [ run ] triggered by Bot. Commit: 2e85aec Link to invocation

tensorrt-cicd · 2026-04-16T03:23:31Z

PR_Github #43619 [ run ] completed with state FAILURE. Commit: 2e85aec
/LLM/main/L0_MergeRequest_PR pipeline #34108 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>

mlefeb01 · 2026-04-16T03:29:21Z

/bot run --disable-fail-fast --stage-list "GB300-4_GPUs-PyTorch-Post-Merge-1"

tensorrt-cicd · 2026-04-16T03:34:57Z

PR_Github #43639 [ run ] triggered by Bot. Commit: b67ad2c Link to invocation

tensorrt-cicd · 2026-04-16T07:46:26Z

PR_Github #43639 [ run ] completed with state SUCCESS. Commit: b67ad2c
/LLM/main/L0_MergeRequest_PR pipeline #34128 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

mlefeb01 · 2026-04-16T17:57:14Z

/bot run --disable-fail-fast --stage-list "GB300-4_GPUs-PyTorch-Post-Merge-1"

tensorrt-cicd · 2026-04-16T18:03:23Z

PR_Github #43821 [ run ] triggered by Bot. Commit: c32268b Link to invocation

tensorrt-cicd · 2026-04-17T01:10:43Z

PR_Github #43821 [ run ] completed with state FAILURE. Commit: c32268b
/LLM/main/L0_MergeRequest_PR pipeline #34296 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

mlefeb01 · 2026-04-17T01:56:35Z

/bot run --disable-fail-fast --stage-list "GB300-4_GPUs-PyTorch-Post-Merge-1"

tensorrt-cicd · 2026-04-17T02:03:46Z

PR_Github #43880 [ run ] triggered by Bot. Commit: c32268b Link to invocation

tensorrt-cicd · 2026-04-17T02:20:58Z

PR_Github #43880 [ run ] completed with state FAILURE. Commit: c32268b
/LLM/main/L0_MergeRequest_PR pipeline #34332 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

mlefeb01 · 2026-04-17T16:39:20Z

/bot run --disable-fail-fast --stage-list "GB300-4_GPUs-PyTorch-Post-Merge-1"

tensorrt-cicd · 2026-04-17T16:45:05Z

PR_Github #44056 [ run ] triggered by Bot. Commit: c32268b Link to invocation

tensorrt-cicd · 2026-04-18T08:47:00Z

PR_Github #44056 [ run ] completed with state SUCCESS. Commit: c32268b
/LLM/main/L0_MergeRequest_PR pipeline #34490 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

mlefeb01 · 2026-04-20T17:24:49Z

/bot skip --comment "Sufficient testing"

tensorrt-cicd · 2026-04-20T17:32:41Z

PR_Github #44485 [ skip ] triggered by Bot. Commit: 78ff0c2 Link to invocation

tensorrt-cicd · 2026-04-20T17:45:12Z

PR_Github #44485 [ skip ] completed with state SUCCESS. Commit: 78ff0c2
Skipping testing for commit 78ff0c2

Link to invocation

mlefeb01 self-assigned this Apr 16, 2026

mlefeb01 requested review from a team as code owners April 16, 2026 01:50

mlefeb01 requested review from niukuo and zeroepoch April 16, 2026 01:50

coderabbitai Bot reviewed Apr 16, 2026

View reviewed changes

Reenable GB300-4_GPUs-PyTorch-Post-Merge-1

2e85aec

Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>

mlefeb01 force-pushed the gb300-slurm branch from e0198a9 to 2e85aec Compare April 16, 2026 02:30

Change platform to -x4

b67ad2c

Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>

zeroepoch approved these changes Apr 16, 2026

View reviewed changes

mzweilz approved these changes Apr 16, 2026

View reviewed changes

Merge branch 'main' into gb300-slurm

c32268b

zeroepoch approved these changes Apr 17, 2026

View reviewed changes

Merge branch 'main' into gb300-slurm

78ff0c2

mlefeb01 merged commit 7b84136 into NVIDIA:main Apr 20, 2026
5 checks passed

mlefeb01 deleted the gb300-slurm branch April 20, 2026 17:53

Conversation

mlefeb01 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

mlefeb01 commented Apr 16, 2026

Uh oh!

coderabbitai Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

mlefeb01 commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

mlefeb01 commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

mlefeb01 commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 17, 2026

Uh oh!

mlefeb01 commented Apr 17, 2026

Uh oh!

tensorrt-cicd commented Apr 17, 2026

Uh oh!

tensorrt-cicd commented Apr 17, 2026

Uh oh!

mlefeb01 commented Apr 17, 2026

Uh oh!

tensorrt-cicd commented Apr 17, 2026

Uh oh!

tensorrt-cicd commented Apr 18, 2026

Uh oh!

mlefeb01 commented Apr 20, 2026

Uh oh!

tensorrt-cicd commented Apr 20, 2026

Uh oh!

tensorrt-cicd commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mlefeb01 commented Apr 16, 2026 •

edited

Loading

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading