[https://nvbugs/5996645][fix] Fix Pyxis Error in Disagg Perf Test#12575
Conversation
|
/bot run --disable-fail-fast --post-merge |
📝 WalkthroughWalkthroughThis PR makes three focused changes: adds explicit variable declaration in a Groovy test script to improve scoping, inserts timing delays in a shell script to mitigate race conditions during server initialization, and removes specific test waivers from the waiver list for disaggregated configurations. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
jenkins/scripts/perf/disaggregated/slurm_launch_draft.sh (1)
31-65: Replace hard-coded startup sleeps with configurable/readiness-based gating.The added delays (Line 31, Line 47, Line 53, Line 65) reduce race frequency, but fixed
sleep 5is still brittle under variable cluster load and adds avoidable runtime. Prefer readiness polling with timeout; at minimum, make the delay configurable.Suggested incremental improvement (configurable delay)
+PYXIS_INIT_DELAY_SEC="${PYXIS_INIT_DELAY_SEC:-5}" + ... - sleep 5 # Wait for pyxis container namespace initialization to avoid race condition + sleep "${PYXIS_INIT_DELAY_SEC}" # Wait for pyxis container namespace initialization to avoid race condition ... - sleep 5 # Wait for pyxis container namespace initialization to avoid race condition + sleep "${PYXIS_INIT_DELAY_SEC}" # Wait for pyxis container namespace initialization to avoid race condition ... -sleep 5 # Wait for pyxis container namespace initialization to avoid race condition +sleep "${PYXIS_INIT_DELAY_SEC}" # Wait for pyxis container namespace initialization to avoid race condition ... -sleep 5 # Wait for pyxis container namespace initialization to avoid race condition +sleep "${PYXIS_INIT_DELAY_SEC}" # Wait for pyxis container namespace initialization to avoid race condition🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@jenkins/scripts/perf/disaggregated/slurm_launch_draft.sh` around lines 31 - 65, Replace the fixed "sleep 5" calls used before/after starting servers with a configurable timeout and readiness-based polling: add a STARTUP_SLEEP_DEFAULT env var (or reuse TRTLLM_STARTUP_SLEEP) to set a fallback delay, then implement a loop that checks service readiness (e.g., poll the generated log files $jobWorkspace/ctx_server_$i.log and $jobWorkspace/disagg_server.log for a readiness string, or check srun task/process status) until the readiness indicator appears or the configured timeout elapses; update the code paths around the ctx server loop and the disagg server startup where DISAGG_SERVING_TYPE, pytestCommand, srun and runScript are used to wait via polling instead of unconditional sleep, and fall back to the configurable sleep only if polling is unsupported.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@jenkins/scripts/perf/disaggregated/slurm_launch_draft.sh`:
- Around line 31-65: Replace the fixed "sleep 5" calls used before/after
starting servers with a configurable timeout and readiness-based polling: add a
STARTUP_SLEEP_DEFAULT env var (or reuse TRTLLM_STARTUP_SLEEP) to set a fallback
delay, then implement a loop that checks service readiness (e.g., poll the
generated log files $jobWorkspace/ctx_server_$i.log and
$jobWorkspace/disagg_server.log for a readiness string, or check srun
task/process status) until the readiness indicator appears or the configured
timeout elapses; update the code paths around the ctx server loop and the disagg
server startup where DISAGG_SERVING_TYPE, pytestCommand, srun and runScript are
used to wait via polling instead of unconditional sleep, and fall back to the
configurable sleep only if polling is unsupported.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3df89f44-b074-4d3a-955a-2d3d814aa962
📒 Files selected for processing (3)
jenkins/L0_Test.groovyjenkins/scripts/perf/disaggregated/slurm_launch_draft.shtests/integration/test_lists/waives.txt
💤 Files with no reviewable changes (1)
- tests/integration/test_lists/waives.txt
|
PR_Github #40600 [ run ] triggered by Bot. Commit: |
|
PR_Github #40600 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #40617 [ run ] triggered by Bot. Commit: |
|
PR_Github #40617 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #40721 [ run ] triggered by Bot. Commit: |
|
PR_Github #40721 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #40847 [ run ] triggered by Bot. Commit: |
|
PR_Github #40847 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #41058 [ run ] triggered by Bot. Commit: |
|
PR_Github #41058 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
/bot run --disable-fail-fast |
|
PR_Github #41119 [ run ] triggered by Bot. Commit: |
|
PR_Github #41120 [ run ] triggered by Bot. Commit: |
|
PR_Github #41119 [ run ] completed with state |
|
PR_Github #41120 [ run ] completed with state |
…IDIA#12575) Signed-off-by: Chenfei Zhang <chenfeiz@oci-hsg-cs-001-vscode-01.cm.cluster> Co-authored-by: Chenfei Zhang <chenfeiz@oci-hsg-cs-001-vscode-01.cm.cluster>
Summary by CodeRabbit
Bug Fixes
Tests
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.