vLLM, SGLang: cleanup fix for single-sbatch#880
Conversation
📝 WalkthroughWalkthroughReplaced unconditional SIGKILL-based cleanup with a two-phase shutdown: send SIGTERM to tracked PIDs, then poll with Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Line 447: Replace the map+lambda used to build pid_array with a generator
expression for readability: locate the pid_array assignment that references
pid_vars and change the join call to use a generator expression that formats
each p (e.g., f'"${p}"') instead of map(lambda p: ...); keep the surrounding
logic and variable names (pid_array, pid_vars) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: d22f1a73-53d3-4256-b03c-3872992a4d9c
📒 Files selected for processing (8)
src/cloudai/workloads/common/llm_serving.pytests/ref_data/sglang-disagg-2nodes.sbatchtests/ref_data/sglang-disagg.sbatchtests/ref_data/sglang.sbatchtests/ref_data/vllm-disagg-2nodes.sbatchtests/ref_data/vllm-disagg.sbatchtests/ref_data/vllm.sbatchtests/workloads/vllm/test_command_gen_strategy_slurm.py
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Around line 430-466: generate_cleanup_function currently only returns failure
on timeout and can be run twice (trap EXIT plus explicit call). Fix it by making
the generated cleanup idempotent and escalating to SIGKILL on timeout: add a
guard variable (e.g. CLEANUP_DONE=0) at top of generated function and return
immediately if CLEANUP_DONE=1; set CLEANUP_DONE=1 at the start of cleanup so
repeated calls are no-ops. In both the single-pid branch (pid_var) and multi-pid
branch (for loops) change the timeout branch so that after the TERM-wait loop
fails you call kill -KILL on the stalled pid(s), optionally sleep/wait briefly
for process removal, then return non-zero; reference generate_cleanup_function
and the explicit cleanup invocation site in _gen_srun_command when locating and
updating the code.
- Around line 558-560: The appended explicit "cleanup" call in
_gen_llm_serving_srun_command causes cleanup to run twice because
generate_cleanup_function() already installs "trap cleanup EXIT"; modify the
script generation so that after running the benchmark commands you capture the
benchmark exit status (e.g., store "$?" into a variable), remove the EXIT trap
(trap - EXIT), invoke cleanup once, capture cleanup's exit status and then exit
with the original benchmark status unless cleanup failed (in which case
propagate cleanup's non‑zero code). Ensure references:
_gen_llm_serving_srun_command, generate_cleanup_function(), the cleanup function
name ("cleanup"), and the trap on EXIT are updated accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: ada2642f-bab3-4e7a-ba40-5dcfff00af20
📒 Files selected for processing (1)
src/cloudai/workloads/common/llm_serving.py
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/cloudai/workloads/common/llm_serving.py (1)
562-564:⚠️ Potential issue | 🟠 MajorPreserve the benchmark status when appending
cleanup.This unconditional trailing
cleanupbecomes the generated fragment’s last command, so a failed benchmark can be reported as success whenever cleanup returns0. Keep the explicit cleanup for single-sbatch, but restore the benchmark’s status after running it.♻️ Suggested adjustment
def _gen_srun_command(self) -> str: serve_commands = self.get_serve_commands() srun_command = self._gen_llm_serving_srun_command(serve_commands) - srun_command += "\n\ncleanup\n" + srun_command += """\ + +bench_status=$? +cleanup +cleanup_status=$? +if [ "$bench_status" -ne 0 ]; then + (exit "$bench_status") +else + (exit "$cleanup_status") +fi +""" return srun_command#!/bin/bash bash -lc 'cleanup(){ return 0; }; false; cleanup' echo "explicit_cleanup_exit=$?" bash -lc 'cleanup(){ return 0; }; trap cleanup EXIT; false' echo "trap_only_exit=$?"Expected result:
explicit_cleanup_exit=0andtrap_only_exit=1. Based on learnings, keep the explicitcleanupcall for single-sbatch support; the issue here is only the lost benchmark status.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cloudai/workloads/common/llm_serving.py` around lines 562 - 564, The generated srun fragment currently appends an unconditional "cleanup" which can overwrite the benchmark's exit status; in _gen_llm_serving_srun_command preserve the benchmark/serve fragment's exit code by saving "$?" (or equivalent shell variable) immediately after the serve commands finish, run the explicit cleanup for single-sbatch compatibility, and then exit/return using the saved status so failed benchmarks don't appear as success; update the block that sets srun_command (variable srun_command and the call site of _gen_llm_serving_srun_command) to inject this save-run-cleanup-exit pattern.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Around line 562-564: The generated srun fragment currently appends an
unconditional "cleanup" which can overwrite the benchmark's exit status; in
_gen_llm_serving_srun_command preserve the benchmark/serve fragment's exit code
by saving "$?" (or equivalent shell variable) immediately after the serve
commands finish, run the explicit cleanup for single-sbatch compatibility, and
then exit/return using the saved status so failed benchmarks don't appear as
success; update the block that sets srun_command (variable srun_command and the
call site of _gen_llm_serving_srun_command) to inject this save-run-cleanup-exit
pattern.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 7159debe-ba96-4996-85fc-3f8ea16db62f
📒 Files selected for processing (8)
src/cloudai/systems/slurm/slurm_command_gen_strategy.pysrc/cloudai/workloads/common/llm_serving.pytests/ref_data/sglang-disagg-2nodes.sbatchtests/ref_data/sglang-disagg.sbatchtests/ref_data/sglang.sbatchtests/ref_data/vllm-disagg-2nodes.sbatchtests/ref_data/vllm-disagg.sbatchtests/ref_data/vllm.sbatch
Summary
Running vLLM/SGLang in single-sbatch mode failed starting on 2nd iteration because processes from previous iteration were not properly killed
Test Plan
Additional Notes