Skip to content

vLLM, SGLang: cleanup fix for single-sbatch#880

Merged
podkidyshev merged 7 commits into
mainfrom
ipod/llm-1sbatch-ffix
Apr 23, 2026
Merged

vLLM, SGLang: cleanup fix for single-sbatch#880
podkidyshev merged 7 commits into
mainfrom
ipod/llm-1sbatch-ffix

Conversation

@podkidyshev
Copy link
Copy Markdown
Contributor

@podkidyshev podkidyshev commented Apr 22, 2026

Summary

Running vLLM/SGLang in single-sbatch mode failed starting on 2nd iteration because processes from previous iteration were not properly killed

Test Plan

  • Automated CI
  • Manual runs

Additional Notes

@podkidyshev podkidyshev self-assigned this Apr 22, 2026
@podkidyshev podkidyshev added the bug Something isn't working label Apr 22, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

📝 Walkthrough

Walkthrough

Replaced unconditional SIGKILL-based cleanup with a two-phase shutdown: send SIGTERM to tracked PIDs, then poll with kill -0 up to a configurable timeout (default 15s) for processes to exit; generated Slurm scripts now explicitly invoke cleanup after the benchmark run in addition to trap cleanup EXIT. Also introduced a mpi property on Slurm command generators and switched generated/test srun invocations from --mpi=pmix to --mpi=none.

Changes

Cohort / File(s) Summary
Core cleanup generator
src/cloudai/workloads/common/llm_serving.py
generate_cleanup_function(pid_vars: list[str], timeout: int = 15) signature added; generated cleanup implements two-phase shutdown (send kill -TERM then per-PID kill -0 polling up to timeout) with distinct single-PID vs multi-PID paths; generated srun payloads append an explicit cleanup invocation after the benchmark.
Slurm strategy
src/cloudai/systems/slurm/slurm_command_gen_strategy.py
Added mpi property and updated srun prefix generation to source --mpi from self.mpi, enabling --mpi=none in generated commands.
SGLang reference scripts
tests/ref_data/sglang.sbatch, tests/ref_data/sglang-disagg.sbatch, tests/ref_data/sglang-disagg-2nodes.sbatch
Switched srun --mpi=pmixsrun --mpi=none; replaced kill -9 cleanup with kill -TERM + per-PID kill -0 polling (15s timeout); scripts now explicitly call cleanup after benchmark (keeps trap cleanup EXIT).
vLLM reference scripts
tests/ref_data/vllm.sbatch, tests/ref_data/vllm-disagg.sbatch, tests/ref_data/vllm-disagg-2nodes.sbatch
Switched srun --mpi=pmixsrun --mpi=none; replaced kill -9 cleanup with kill -TERM + per-PID kill -0 polling (15s timeout); scripts now explicitly call cleanup after benchmark (keeps trap cleanup EXIT).
Command-gen tests
tests/workloads/vllm/test_command_gen_strategy_slurm.py
Updated expected generated Slurm content: SIGTERM + polling cleanup behavior, appended explicit cleanup call after benchmark, and MPI launcher expectation changed to none.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰
I tap with TERM, not hammer's drum,
I count soft seconds, one to some.
Fifteen hops, a patient stare—
If you stay still, I'll say you dare.
Cleanup done, I nose the air.

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main fix: improving cleanup behavior for single-sbatch mode in vLLM/SGLang, which aligns with the changeset's focus on updating cleanup logic across multiple scripts and the command generation strategy.
Description check ✅ Passed The description directly relates to the changeset, explaining that the fix addresses a process cleanup issue in single-sbatch mode where previous iteration processes were not properly killed, which matches the core changes in cleanup function behavior.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ipod/llm-1sbatch-ffix

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Line 447: Replace the map+lambda used to build pid_array with a generator
expression for readability: locate the pid_array assignment that references
pid_vars and change the join call to use a generator expression that formats
each p (e.g., f'"${p}"') instead of map(lambda p: ...); keep the surrounding
logic and variable names (pid_array, pid_vars) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: d22f1a73-53d3-4256-b03c-3872992a4d9c

📥 Commits

Reviewing files that changed from the base of the PR and between 0064155 and b3fdc50.

📒 Files selected for processing (8)
  • src/cloudai/workloads/common/llm_serving.py
  • tests/ref_data/sglang-disagg-2nodes.sbatch
  • tests/ref_data/sglang-disagg.sbatch
  • tests/ref_data/sglang.sbatch
  • tests/ref_data/vllm-disagg-2nodes.sbatch
  • tests/ref_data/vllm-disagg.sbatch
  • tests/ref_data/vllm.sbatch
  • tests/workloads/vllm/test_command_gen_strategy_slurm.py

Comment thread src/cloudai/workloads/common/llm_serving.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Around line 430-466: generate_cleanup_function currently only returns failure
on timeout and can be run twice (trap EXIT plus explicit call). Fix it by making
the generated cleanup idempotent and escalating to SIGKILL on timeout: add a
guard variable (e.g. CLEANUP_DONE=0) at top of generated function and return
immediately if CLEANUP_DONE=1; set CLEANUP_DONE=1 at the start of cleanup so
repeated calls are no-ops. In both the single-pid branch (pid_var) and multi-pid
branch (for loops) change the timeout branch so that after the TERM-wait loop
fails you call kill -KILL on the stalled pid(s), optionally sleep/wait briefly
for process removal, then return non-zero; reference generate_cleanup_function
and the explicit cleanup invocation site in _gen_srun_command when locating and
updating the code.
- Around line 558-560: The appended explicit "cleanup" call in
_gen_llm_serving_srun_command causes cleanup to run twice because
generate_cleanup_function() already installs "trap cleanup EXIT"; modify the
script generation so that after running the benchmark commands you capture the
benchmark exit status (e.g., store "$?" into a variable), remove the EXIT trap
(trap - EXIT), invoke cleanup once, capture cleanup's exit status and then exit
with the original benchmark status unless cleanup failed (in which case
propagate cleanup's non‑zero code). Ensure references:
_gen_llm_serving_srun_command, generate_cleanup_function(), the cleanup function
name ("cleanup"), and the trap on EXIT are updated accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: ada2642f-bab3-4e7a-ba40-5dcfff00af20

📥 Commits

Reviewing files that changed from the base of the PR and between b3fdc50 and 5921d5d.

📒 Files selected for processing (1)
  • src/cloudai/workloads/common/llm_serving.py

Comment thread src/cloudai/workloads/common/llm_serving.py
Comment thread src/cloudai/workloads/common/llm_serving.py
@podkidyshev podkidyshev requested a review from amaslenn April 22, 2026 20:22
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/cloudai/workloads/common/llm_serving.py (1)

562-564: ⚠️ Potential issue | 🟠 Major

Preserve the benchmark status when appending cleanup.

This unconditional trailing cleanup becomes the generated fragment’s last command, so a failed benchmark can be reported as success whenever cleanup returns 0. Keep the explicit cleanup for single-sbatch, but restore the benchmark’s status after running it.

♻️ Suggested adjustment
     def _gen_srun_command(self) -> str:
         serve_commands = self.get_serve_commands()
         srun_command = self._gen_llm_serving_srun_command(serve_commands)
-        srun_command += "\n\ncleanup\n"
+        srun_command += """\
+
+bench_status=$?
+cleanup
+cleanup_status=$?
+if [ "$bench_status" -ne 0 ]; then
+    (exit "$bench_status")
+else
+    (exit "$cleanup_status")
+fi
+"""
         return srun_command
#!/bin/bash
bash -lc 'cleanup(){ return 0; }; false; cleanup'
echo "explicit_cleanup_exit=$?"

bash -lc 'cleanup(){ return 0; }; trap cleanup EXIT; false'
echo "trap_only_exit=$?"

Expected result: explicit_cleanup_exit=0 and trap_only_exit=1. Based on learnings, keep the explicit cleanup call for single-sbatch support; the issue here is only the lost benchmark status.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cloudai/workloads/common/llm_serving.py` around lines 562 - 564, The
generated srun fragment currently appends an unconditional "cleanup" which can
overwrite the benchmark's exit status; in _gen_llm_serving_srun_command preserve
the benchmark/serve fragment's exit code by saving "$?" (or equivalent shell
variable) immediately after the serve commands finish, run the explicit cleanup
for single-sbatch compatibility, and then exit/return using the saved status so
failed benchmarks don't appear as success; update the block that sets
srun_command (variable srun_command and the call site of
_gen_llm_serving_srun_command) to inject this save-run-cleanup-exit pattern.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Around line 562-564: The generated srun fragment currently appends an
unconditional "cleanup" which can overwrite the benchmark's exit status; in
_gen_llm_serving_srun_command preserve the benchmark/serve fragment's exit code
by saving "$?" (or equivalent shell variable) immediately after the serve
commands finish, run the explicit cleanup for single-sbatch compatibility, and
then exit/return using the saved status so failed benchmarks don't appear as
success; update the block that sets srun_command (variable srun_command and the
call site of _gen_llm_serving_srun_command) to inject this save-run-cleanup-exit
pattern.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 7159debe-ba96-4996-85fc-3f8ea16db62f

📥 Commits

Reviewing files that changed from the base of the PR and between 99bc470 and c803674.

📒 Files selected for processing (8)
  • src/cloudai/systems/slurm/slurm_command_gen_strategy.py
  • src/cloudai/workloads/common/llm_serving.py
  • tests/ref_data/sglang-disagg-2nodes.sbatch
  • tests/ref_data/sglang-disagg.sbatch
  • tests/ref_data/sglang.sbatch
  • tests/ref_data/vllm-disagg-2nodes.sbatch
  • tests/ref_data/vllm-disagg.sbatch
  • tests/ref_data/vllm.sbatch

@podkidyshev podkidyshev merged commit 37f566f into main Apr 23, 2026
5 checks passed
@podkidyshev podkidyshev deleted the ipod/llm-1sbatch-ffix branch April 23, 2026 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants