Skip to content

[TRTINFRA-7698][infra] - Add SSH key authentication support for SLURM clusters#12172

Merged
mlefeb01 merged 6 commits intoNVIDIA:mainfrom
mlefeb01:slurm-ssh-auth
Mar 18, 2026
Merged

[TRTINFRA-7698][infra] - Add SSH key authentication support for SLURM clusters#12172
mlefeb01 merged 6 commits intoNVIDIA:mainfrom
mlefeb01:slurm-ssh-auth

Conversation

@mlefeb01
Copy link
Collaborator

@mlefeb01 mlefeb01 commented Mar 12, 2026

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@mlefeb01 mlefeb01 self-assigned this Mar 12, 2026
@mlefeb01 mlefeb01 requested review from a team as code owners March 12, 2026 23:03
@mlefeb01
Copy link
Collaborator Author

/bot run --disable-fail-fast

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 12, 2026

📝 Walkthrough

Walkthrough

The PR refactors SSH credential handling in the Jenkins test pipeline by introducing centralized credential management through CloudManager. Function signatures are updated to include cluster context and interruption tracking. A new SCP helper function is added, and result gathering is enhanced to include performance metrics and handle interruptions appropriately.

Changes

Cohort / File(s) Summary
SCP Helper & Credential Refactoring
jenkins/L0_Test.groovy
Added scpFromRemoteCmd function to build SCP commands. Refactored uploadResults, cleanUpSlurmResources, and cleanUpNodeResources to use CloudManager.withSlurmSshCredentials instead of ad-hoc credential handling. Updated function signatures to include clusterName and stageIsInterrupted flag for context-aware operations.
Result Gathering & Interruption Handling
jenkins/L0_Test.groovy
Enhanced result collection to download performance (perf) results alongside timeout and normal results. Added stageIsInterrupted flag tracking to conditionally generate timeout XML and influence result upload decisions. Updated runLLMTestlistWithAgent and runLLMTestlistWithSbatch to initialize and pass interruption state.
Host Type Exemption Logic
jenkins/L0_Test.groovy
Expanded exemption condition in Slurm-initiated test paths to include nsc-svg host type alongside existing logic for determining portal assignment.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete. Only the template is present with no actual description content, test coverage details, or explanation of changes. Add a detailed description explaining the SSH key authentication changes, list relevant test coverage, and fill in other required sections beyond the checklist.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding SSH key authentication support for SLURM clusters, which aligns with the detailed refactoring of SSH credential handling in the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can generate a title for your PR based on the changes.

Add @coderabbitai title placeholder anywhere in the title of your PR and CodeRabbit will replace it with a title based on the changes in the PR. You can change the placeholder by changing the reviews.auto_title_placeholder setting.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@jenkins/L0_Test.groovy`:
- Around line 848-850: The variable stageIsInterrupted is initialized but never
updated, so interrupted-stage upload logic never triggers; update the code to
set stageIsInterrupted = true when the stage is interrupted (e.g., inside the
interruption/exception handler or catch block that handles InterruptedException,
AbortException, or when you detect current build/thread is aborted) before
control proceeds to the upload section where stageIsInterrupted is read (the
variable initialized as stageIsInterrupted and later passed at the upload call
around line where it's used). Ensure the assignment occurs in the same scope as
the existing def stageIsInterrupted and before the upload invocation so the
upload branch for interrupted stages runs correctly.
- Around line 112-119: scpFromRemoteCmd currently builds scp commands without
recursion support; change its signature to def scpFromRemoteCmd(Map remote,
String remotePath, String localPath, boolean recursive=false) and, when
recursive is true, include the "-r " flag in the constructed command for both
key-auth and password-auth branches; then update the perf download caller(s)
that copy directories (the code that finds directories) to call
scpFromRemoteCmd(..., recursive:true) so directory copies succeed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d91571be-8183-4844-bac8-c05e40234f3d

📥 Commits

Reviewing files that changed from the base of the PR and between 5cc0ccd and 15bfdf0.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy

@mlefeb01
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38891 [ run ] triggered by Bot. Commit: 7732bc7 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38891 [ run ] completed with state FAILURE. Commit: 7732bc7
/LLM/main/L0_MergeRequest_PR pipeline #30199 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@mlefeb01
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38912 [ run ] triggered by Bot. Commit: f6a14c4 Link to invocation

@mlefeb01
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38921 [ run ] triggered by Bot. Commit: 6b0e4c0 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38921 [ run ] completed with state FAILURE. Commit: 6b0e4c0
/LLM/main/L0_MergeRequest_PR pipeline #30226 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
@mlefeb01
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39120 [ run ] triggered by Bot. Commit: 4e870c8 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39120 [ run ] completed with state FAILURE. Commit: 4e870c8
/LLM/main/L0_MergeRequest_PR pipeline #30379 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
@mlefeb01
Copy link
Collaborator Author

/bot run --disable-fail-fast

1 similar comment
@mlefeb01
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39149 [ run ] triggered by Bot. Commit: a54785d Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39149 [ run ] completed with state FAILURE. Commit: a54785d
/LLM/main/L0_MergeRequest_PR pipeline #30407 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
@mlefeb01
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39164 [ run ] triggered by Bot. Commit: 3a4f1de Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39164 [ run ] completed with state SUCCESS. Commit: 3a4f1de
/LLM/main/L0_MergeRequest_PR pipeline #30422 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@mlefeb01
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39283 [ run ] triggered by Bot. Commit: 3a4f1de Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39283 [ run ] completed with state SUCCESS. Commit: 3a4f1de
/LLM/main/L0_MergeRequest_PR pipeline #30532 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@mlefeb01
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39315 [ run ] triggered by Bot. Commit: f3a782b Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39315 [ run ] completed with state SUCCESS. Commit: f3a782b
/LLM/main/L0_MergeRequest_PR pipeline #30566 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

@mlefeb01
Copy link
Collaborator Author

/bot skip --comment "Sufficient testing"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39477 [ skip ] triggered by Bot. Commit: 28e4b43 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39477 [ skip ] completed with state SUCCESS. Commit: 28e4b43
Skipping testing for commit 28e4b43

Link to invocation

@mlefeb01 mlefeb01 merged commit b095c62 into NVIDIA:main Mar 18, 2026
5 checks passed
@mlefeb01 mlefeb01 deleted the slurm-ssh-auth branch March 18, 2026 16:14
limin2021 pushed a commit to limin2021/TensorRT-LLM that referenced this pull request Mar 19, 2026
… clusters (NVIDIA#12172)

Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants