vLLM, SGLang: cleanup fix for single-sbatch by podkidyshev · Pull Request #880 · NVIDIA/cloudai

podkidyshev · 2026-04-22T18:38:58Z

Summary

Running vLLM/SGLang in single-sbatch mode failed starting on 2nd iteration because processes from previous iteration were not properly killed

Test Plan

Automated CI
Manual runs

Additional Notes

Closes internal ticket 1 and ticket 2

coderabbitai · 2026-04-22T18:39:34Z

📝 Walkthrough

Walkthrough

Replaced unconditional SIGKILL-based cleanup with a two-phase shutdown: send SIGTERM to tracked PIDs, then poll with kill -0 up to a configurable timeout (default 15s) for processes to exit; generated Slurm scripts now explicitly invoke cleanup after the benchmark run in addition to trap cleanup EXIT. Also introduced a mpi property on Slurm command generators and switched generated/test srun invocations from --mpi=pmix to --mpi=none.

Changes

Cohort / File(s)	Summary
Core cleanup generator `src/cloudai/workloads/common/llm_serving.py`	`generate_cleanup_function(pid_vars: list[str], timeout: int = 15)` signature added; generated cleanup implements two-phase shutdown (send `kill -TERM` then per-PID `kill -0` polling up to timeout) with distinct single-PID vs multi-PID paths; generated srun payloads append an explicit `cleanup` invocation after the benchmark.
Slurm strategy `src/cloudai/systems/slurm/slurm_command_gen_strategy.py`	Added `mpi` property and updated srun prefix generation to source `--mpi` from `self.mpi`, enabling `--mpi=none` in generated commands.
SGLang reference scripts `tests/ref_data/sglang.sbatch`, `tests/ref_data/sglang-disagg.sbatch`, `tests/ref_data/sglang-disagg-2nodes.sbatch`	Switched `srun --mpi=pmix` → `srun --mpi=none`; replaced `kill -9` cleanup with `kill -TERM` + per-PID `kill -0` polling (15s timeout); scripts now explicitly call `cleanup` after benchmark (keeps `trap cleanup EXIT`).
vLLM reference scripts `tests/ref_data/vllm.sbatch`, `tests/ref_data/vllm-disagg.sbatch`, `tests/ref_data/vllm-disagg-2nodes.sbatch`	Switched `srun --mpi=pmix` → `srun --mpi=none`; replaced `kill -9` cleanup with `kill -TERM` + per-PID `kill -0` polling (15s timeout); scripts now explicitly call `cleanup` after benchmark (keeps `trap cleanup EXIT`).
Command-gen tests `tests/workloads/vllm/test_command_gen_strategy_slurm.py`	Updated expected generated Slurm content: SIGTERM + polling cleanup behavior, appended explicit `cleanup` call after benchmark, and MPI launcher expectation changed to `none`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰
I tap with TERM, not hammer's drum,
I count soft seconds, one to some.
Fifteen hops, a patient stare—
If you stay still, I'll say you dare.
Cleanup done, I nose the air.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main fix: improving cleanup behavior for single-sbatch mode in vLLM/SGLang, which aligns with the changeset's focus on updating cleanup logic across multiple scripts and the command generation strategy.
Description check	✅ Passed	The description directly relates to the changeset, explaining that the fix addresses a process cleanup issue in single-sbatch mode where previous iteration processes were not properly killed, which matches the core changes in cleanup function behavior.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ipod/llm-1sbatch-ffix

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Line 447: Replace the map+lambda used to build pid_array with a generator
expression for readability: locate the pid_array assignment that references
pid_vars and change the join call to use a generator expression that formats
each p (e.g., f'"${p}"') instead of map(lambda p: ...); keep the surrounding
logic and variable names (pid_array, pid_vars) unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: d22f1a73-53d3-4256-b03c-3872992a4d9c

📥 Commits

Reviewing files that changed from the base of the PR and between 0064155 and b3fdc50.

📒 Files selected for processing (8)

src/cloudai/workloads/common/llm_serving.py
tests/ref_data/sglang-disagg-2nodes.sbatch
tests/ref_data/sglang-disagg.sbatch
tests/ref_data/sglang.sbatch
tests/ref_data/vllm-disagg-2nodes.sbatch
tests/ref_data/vllm-disagg.sbatch
tests/ref_data/vllm.sbatch
tests/workloads/vllm/test_command_gen_strategy_slurm.py

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Around line 430-466: generate_cleanup_function currently only returns failure
on timeout and can be run twice (trap EXIT plus explicit call). Fix it by making
the generated cleanup idempotent and escalating to SIGKILL on timeout: add a
guard variable (e.g. CLEANUP_DONE=0) at top of generated function and return
immediately if CLEANUP_DONE=1; set CLEANUP_DONE=1 at the start of cleanup so
repeated calls are no-ops. In both the single-pid branch (pid_var) and multi-pid
branch (for loops) change the timeout branch so that after the TERM-wait loop
fails you call kill -KILL on the stalled pid(s), optionally sleep/wait briefly
for process removal, then return non-zero; reference generate_cleanup_function
and the explicit cleanup invocation site in _gen_srun_command when locating and
updating the code.
- Around line 558-560: The appended explicit "cleanup" call in
_gen_llm_serving_srun_command causes cleanup to run twice because
generate_cleanup_function() already installs "trap cleanup EXIT"; modify the
script generation so that after running the benchmark commands you capture the
benchmark exit status (e.g., store "$?" into a variable), remove the EXIT trap
(trap - EXIT), invoke cleanup once, capture cleanup's exit status and then exit
with the original benchmark status unless cleanup failed (in which case
propagate cleanup's non‑zero code). Ensure references:
_gen_llm_serving_srun_command, generate_cleanup_function(), the cleanup function
name ("cleanup"), and the trap on EXIT are updated accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: ada2642f-bab3-4e7a-ba40-5dcfff00af20

📥 Commits

Reviewing files that changed from the base of the PR and between b3fdc50 and 5921d5d.

📒 Files selected for processing (1)

src/cloudai/workloads/common/llm_serving.py

coderabbitai

♻️ Duplicate comments (1)

src/cloudai/workloads/common/llm_serving.py (1)

562-564: ⚠️ Potential issue | 🟠 Major

Preserve the benchmark status when appending cleanup.

This unconditional trailing cleanup becomes the generated fragment’s last command, so a failed benchmark can be reported as success whenever cleanup returns 0. Keep the explicit cleanup for single-sbatch, but restore the benchmark’s status after running it.

♻️ Suggested adjustment

     def _gen_srun_command(self) -> str:
         serve_commands = self.get_serve_commands()
         srun_command = self._gen_llm_serving_srun_command(serve_commands)
-        srun_command += "\n\ncleanup\n"
+        srun_command += """\
+
+bench_status=$?
+cleanup
+cleanup_status=$?
+if [ "$bench_status" -ne 0 ]; then
+    (exit "$bench_status")
+else
+    (exit "$cleanup_status")
+fi
+"""
         return srun_command

#!/bin/bash
bash -lc 'cleanup(){ return 0; }; false; cleanup'
echo "explicit_cleanup_exit=$?"

bash -lc 'cleanup(){ return 0; }; trap cleanup EXIT; false'
echo "trap_only_exit=$?"

Expected result: explicit_cleanup_exit=0 and trap_only_exit=1. Based on learnings, keep the explicit cleanup call for single-sbatch support; the issue here is only the lost benchmark status.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/cloudai/workloads/common/llm_serving.py` around lines 562 - 564, The
generated srun fragment currently appends an unconditional "cleanup" which can
overwrite the benchmark's exit status; in _gen_llm_serving_srun_command preserve
the benchmark/serve fragment's exit code by saving "$?" (or equivalent shell
variable) immediately after the serve commands finish, run the explicit cleanup
for single-sbatch compatibility, and then exit/return using the saved status so
failed benchmarks don't appear as success; update the block that sets
srun_command (variable srun_command and the call site of
_gen_llm_serving_srun_command) to inject this save-run-cleanup-exit pattern.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/cloudai/workloads/common/llm_serving.py`:
- Around line 562-564: The generated srun fragment currently appends an
unconditional "cleanup" which can overwrite the benchmark's exit status; in
_gen_llm_serving_srun_command preserve the benchmark/serve fragment's exit code
by saving "$?" (or equivalent shell variable) immediately after the serve
commands finish, run the explicit cleanup for single-sbatch compatibility, and
then exit/return using the saved status so failed benchmarks don't appear as
success; update the block that sets srun_command (variable srun_command and the
call site of _gen_llm_serving_srun_command) to inject this save-run-cleanup-exit
pattern.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 7159debe-ba96-4996-85fc-3f8ea16db62f

📥 Commits

Reviewing files that changed from the base of the PR and between 99bc470 and c803674.

📒 Files selected for processing (8)

src/cloudai/systems/slurm/slurm_command_gen_strategy.py
src/cloudai/workloads/common/llm_serving.py
tests/ref_data/sglang-disagg-2nodes.sbatch
tests/ref_data/sglang-disagg.sbatch
tests/ref_data/sglang.sbatch
tests/ref_data/vllm-disagg-2nodes.sbatch
tests/ref_data/vllm-disagg.sbatch
tests/ref_data/vllm.sbatch

podkidyshev added 3 commits April 22, 2026 10:55

poc

10b3933

extending cleanup function

043b3c1

proper cleanup implementation

b3fdc50

podkidyshev self-assigned this Apr 22, 2026

podkidyshev added the bug Something isn't working label Apr 22, 2026

podkidyshev requested review from jeffnvidia and srivatsankrishnan as code owners April 22, 2026 18:38

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread src/cloudai/workloads/common/llm_serving.py Outdated

resolve ai feedback

5921d5d

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread src/cloudai/workloads/common/llm_serving.py

Comment thread src/cloudai/workloads/common/llm_serving.py

Merge branch 'main' into ipod/llm-1sbatch-ffix

99bc470

podkidyshev requested a review from amaslenn April 22, 2026 20:22

podkidyshev added 2 commits April 22, 2026 23:06

custom mpi for vllm for graceful shutdown

c850a28

also propagate to sglang

c803674

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

amaslenn approved these changes Apr 23, 2026

View reviewed changes

podkidyshev merged commit 37f566f into main Apr 23, 2026
5 checks passed

podkidyshev deleted the ipod/llm-1sbatch-ffix branch April 23, 2026 08:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM, SGLang: cleanup fix for single-sbatch#880

vLLM, SGLang: cleanup fix for single-sbatch#880
podkidyshev merged 7 commits into
mainfrom
ipod/llm-1sbatch-ffix

podkidyshev commented Apr 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

podkidyshev commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Additional Notes

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

podkidyshev commented Apr 22, 2026 •

edited

Loading

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading