Skip to content

[None][infra] Reenable GB300-4_GPUs-PyTorch-Post-Merge-1#13097

Merged
mlefeb01 merged 4 commits intoNVIDIA:mainfrom
mlefeb01:gb300-slurm
Apr 20, 2026
Merged

[None][infra] Reenable GB300-4_GPUs-PyTorch-Post-Merge-1#13097
mlefeb01 merged 4 commits intoNVIDIA:mainfrom
mlefeb01:gb300-slurm

Conversation

@mlefeb01
Copy link
Copy Markdown
Collaborator

@mlefeb01 mlefeb01 commented Apr 16, 2026

Summary by CodeRabbit

  • Tests
    • Enabled a 4-GPU PyTorch post-merge test stage with updated platform configuration.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@mlefeb01 mlefeb01 self-assigned this Apr 16, 2026
@mlefeb01 mlefeb01 requested review from a team as code owners April 16, 2026 01:50
@mlefeb01 mlefeb01 requested review from niukuo and zeroepoch April 16, 2026 01:50
@mlefeb01
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "GB300-4_GPUs-PyTorch-Post-Merge-1"

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 16, 2026

📝 Walkthrough

Walkthrough

A test stage configuration entry in the Jenkins pipeline is modified to change its platform/cluster selector from "gb300-x4" to "auto:gb300-flex" and is uncommented, enabling the stage for post-merge PyTorch testing on four GPUs.

Changes

Cohort / File(s) Summary
Jenkins Pipeline Configuration
jenkins/L0_Test.groovy
Modified platform/cluster selector for GB300-4_GPUs-PyTorch-Post-Merge-1 stage from hardcoded "gb300-x4" to "auto:gb300-flex" and enabled the entry.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is largely empty, containing only the template structure with no actual content in the required sections (Description, Test Coverage). Provide a detailed description explaining why the GB300-4_GPUs-PyTorch-Post-Merge-1 stage is being re-enabled and specify relevant tests that verify the changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly describes the main change: re-enabling a specific CI stage (GB300-4_GPUs-PyTorch-Post-Merge-1) which aligns with the changeset modifications.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)

2970-2972: Refresh the GB300 disablement comment to match the now-enabled stage.

Line 2970 says GB300 stages are disabled, but Line 2972 re-enables one. Please update the comment to avoid operator confusion.

Suggested diff
-        // Disable GB300 stages due to nodes will be offline temporarily.
+        // Keep some GB300 stages disabled while re-enabling selected post-merge coverage.
         // "GB300-PyTorch-1": ["gb300-single", "l0_gb300", 1, 1],
         "GB300-4_GPUs-PyTorch-Post-Merge-1": ["auto:gb300-flex", "l0_gb300_multi_gpus", 1, 1, 4],
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jenkins/L0_Test.groovy` around lines 2970 - 2972, Update the misleading
comment that says "Disable GB300 stages due to nodes will be offline
temporarily." to reflect that the "GB300-4_GPUs-PyTorch-Post-Merge-1" stage is
now enabled; locate the comment above the GB300 stage entries and change it to
something like "Adjust GB300 stages: most disabled,
GB300-4_GPUs-PyTorch-Post-Merge-1 re-enabled" (referencing the exact stage
identifier "GB300-4_GPUs-PyTorch-Post-Merge-1") so operators won't be confused.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@jenkins/L0_Test.groovy`:
- Around line 2970-2972: Update the misleading comment that says "Disable GB300
stages due to nodes will be offline temporarily." to reflect that the
"GB300-4_GPUs-PyTorch-Post-Merge-1" stage is now enabled; locate the comment
above the GB300 stage entries and change it to something like "Adjust GB300
stages: most disabled, GB300-4_GPUs-PyTorch-Post-Merge-1 re-enabled"
(referencing the exact stage identifier "GB300-4_GPUs-PyTorch-Post-Merge-1") so
operators won't be confused.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 1cfe824e-1112-465e-bd6c-3e0d86479fa8

📥 Commits

Reviewing files that changed from the base of the PR and between ec34644 and e0198a9.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy

Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
@mlefeb01
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "GB300-4_GPUs-PyTorch-Post-Merge-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43619 [ run ] triggered by Bot. Commit: 2e85aec Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43619 [ run ] completed with state FAILURE. Commit: 2e85aec
/LLM/main/L0_MergeRequest_PR pipeline #34108 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
@mlefeb01
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "GB300-4_GPUs-PyTorch-Post-Merge-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43639 [ run ] triggered by Bot. Commit: b67ad2c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43639 [ run ] completed with state SUCCESS. Commit: b67ad2c
/LLM/main/L0_MergeRequest_PR pipeline #34128 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@mlefeb01
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "GB300-4_GPUs-PyTorch-Post-Merge-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43821 [ run ] triggered by Bot. Commit: c32268b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43821 [ run ] completed with state FAILURE. Commit: c32268b
/LLM/main/L0_MergeRequest_PR pipeline #34296 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@mlefeb01
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "GB300-4_GPUs-PyTorch-Post-Merge-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43880 [ run ] triggered by Bot. Commit: c32268b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43880 [ run ] completed with state FAILURE. Commit: c32268b
/LLM/main/L0_MergeRequest_PR pipeline #34332 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@mlefeb01
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "GB300-4_GPUs-PyTorch-Post-Merge-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44056 [ run ] triggered by Bot. Commit: c32268b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44056 [ run ] completed with state SUCCESS. Commit: c32268b
/LLM/main/L0_MergeRequest_PR pipeline #34490 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@mlefeb01
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "Sufficient testing"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44485 [ skip ] triggered by Bot. Commit: 78ff0c2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44485 [ skip ] completed with state SUCCESS. Commit: 78ff0c2
Skipping testing for commit 78ff0c2

Link to invocation

@mlefeb01 mlefeb01 merged commit 7b84136 into NVIDIA:main Apr 20, 2026
5 checks passed
@mlefeb01 mlefeb01 deleted the gb300-slurm branch April 20, 2026 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants