Skip to content

[TRTLLMINF-113][infra] Add timeout protection to Setup/Initialize stages#14682

Open
ZhanruiSunCh wants to merge 1 commit into
NVIDIA:mainfrom
ZhanruiSunCh:user/zhanruis/add_setup_timeout
Open

[TRTLLMINF-113][infra] Add timeout protection to Setup/Initialize stages#14682
ZhanruiSunCh wants to merge 1 commit into
NVIDIA:mainfrom
ZhanruiSunCh:user/zhanruis/add_setup_timeout

Conversation

@ZhanruiSunCh
Copy link
Copy Markdown
Collaborator

@ZhanruiSunCh ZhanruiSunCh commented May 28, 2026

Add independent timeouts to prevent network flakiness from hanging the entire slurm job until the 4h kill:

K8s "Setup Environment" (runLLMTestlistOnPlatformImpl):

  • wget artifact download + tar extract: 15 min
  • pip install (requirements-dev + wheel): 30 min
  • Overall setup stage: 45 min

SLURM "Initialize Test" (runLLMTestlistWithSbatch):

  • wget artifact download + tar extract: 15 min
  • Overall initialize stage: 30 min

Coverage: 30/32 (94%) incidents resolved. 2 uncovered are Docker push TLS timeouts (different code path).

┌─────────────────────┬────────┬────────────┬────────────┬──────────┬───────────────┐
│ Timeout │ Limit │ Normal p95 │ Worst Case │ Margin │ Risk │
├─────────────────────┼────────┼────────────┼────────────┼──────────┼───────────────┤
│ wget + tar │ 15 min │ 3.7 min │ 9.2 min │ 5.8 min │ Low │
├─────────────────────┼────────┼────────────┼────────────┼──────────┼───────────────┤
│ pip install │ 30 min │ 19.1 min │ 19.2 min │ 10.8 min │ Low │
├─────────────────────┼────────┼────────────┼────────────┼──────────┼───────────────┤
│ K8s overall setup │ 45 min │ ~23 min │ ~28 min │ 17 min │ Low │
├─────────────────────┼────────┼────────────┼────────────┼──────────┼───────────────┤
│ Slurm overall setup │ 30 min │ ~13 min │ ~23 min │ 7 min │ Monitor GB200 │

Summary by CodeRabbit

  • Chores
    • Enhanced test pipeline reliability by adding timeout controls to long-running operations. This prevents test execution from hanging indefinitely and improves overall pipeline stability.

Review Change Stack

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@ZhanruiSunCh
Copy link
Copy Markdown
Collaborator Author

/bot run --post-merge --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50752 [ run ] triggered by Bot. Commit: f932683 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50752 [ run ] completed with state SUCCESS. Commit: f932683
/LLM/main/L0_MergeRequest_PR pipeline #40231 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Add independent timeouts to prevent network flakiness from hanging
the entire slurm job until the 4h kill:

K8s "Setup Environment" (runLLMTestlistOnPlatformImpl):
  - wget artifact download + tar extract: 15 min
  - pip install (requirements-dev + wheel): 30 min
  - Overall setup stage: 45 min

SLURM "Initialize Test" (runLLMTestlistWithSbatch):
  - wget artifact download + tar extract: 15 min
  - Overall initialize stage: 30 min

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: ZhanruiSunCh <184402041+ZhanruiSunCh@users.noreply.github.com>
@ZhanruiSunCh ZhanruiSunCh force-pushed the user/zhanruis/add_setup_timeout branch from f932683 to ee61e44 Compare May 29, 2026 07:52
@ZhanruiSunCh
Copy link
Copy Markdown
Collaborator Author

/bot run

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c7f72d54-5722-4832-81b9-0e52089429c1

📥 Commits

Reviewing files that changed from the base of the PR and between ecb1b44 and ee61e44.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy

📝 Walkthrough

Walkthrough

Jenkins pipeline script adds nested timeout blocks around long-running operations in two test execution paths: SLURM sbatch initialization wraps tar download/extraction with explicit time bounds, and platform test setup environment bounds TRT-LLM extraction and Python package installation within a 45-minute setup timeout.

Changes

Pipeline timeout bounds

Layer / File(s) Summary
SLURM sbatch initialization timeouts
jenkins/L0_Test.groovy
Tar download and extraction steps in the runLLMTestlistWithSbatch multi-node path are wrapped with nested timeout blocks: an outer initialization timeout and a 15-minute timeout around the wget+tar sequence. Explicit closing marker added.
Platform test environment setup timeouts
jenkins/L0_Test.groovy
Setup Environment stage in runLLMTestlistOnPlatformImpl is bounded by a 45-minute timeout. TRT-LLM tarfile download/extraction gets a 15-minute timeout, and Python package installation (Ray and wheel installs) gets a 30-minute timeout. Explicit closing marker added.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested reviewers

  • mzweilz
  • mlefeb01
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding timeout protection to Setup/Initialize stages in Jenkins pipeline scripts.
Description check ✅ Passed The PR description includes the main issue (network flakiness), the solution (independent timeouts with specific durations), coverage metrics (94%), and a detailed risk assessment table, but the 'Description' and 'Test Coverage' sections of the template are not explicitly filled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51010 [ run ] triggered by Bot. Commit: ee61e44 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51010 [ run ] completed with state SUCCESS. Commit: ee61e44
/LLM/main/L0_MergeRequest_PR pipeline #40459 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants