Bump vLLM to 0.16.0 with required dep updates#1240
Conversation
- vllm: 0.13.0 -> 0.16.0 - torch: 2.9.0 -> 2.9.1 (required by vLLM 0.16.0) - flashinfer-python: 0.5.3 -> 0.6.3 (required by vLLM 0.16.0) - flashinfer-jit-cache: 0.5.3 -> 0.6.3 - numpy>=2.0.0 override (vLLM 0.16.0 -> opencv-python-headless>=4.13 -> numpy>=2, conflicting with megatron-core's <2 pin; tested compatible with megatron-core 0.15.0) Migrates vLLM import paths (0.13 -> 0.16): - serving_chat -> chat_completion.serving - serving_completion -> completion.serving - serving_models -> models.serving - protocol split into chat_completion/completion/engine.protocol - ErrorInfo moved to top-level import Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request upgrades vLLM to version 0.16.0 and updates its dependencies accordingly. The changes in pyproject.toml files correctly bump the versions. The Python code is updated to reflect the API changes in the new vLLM version, mainly around import paths.
My review has identified a couple of areas for improvement:
- There's some inconsistency in dependency pinning for
flashinfer-python, which I've suggested to fix. - I've found significant code duplication between
skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.pyandskyrl-train/skyrl_train/inference_engines/vllm/vllm_engine.py, which should be addressed to improve maintainability.
| from vllm.entrypoints.openai.chat_completion.serving import OpenAIServingChat | ||
| from vllm.entrypoints.openai.completion.serving import OpenAIServingCompletion | ||
| from vllm.entrypoints.openai.models.serving import BaseModelPath, OpenAIServingModels | ||
| from vllm.entrypoints.openai.chat_completion.protocol import ( | ||
| ChatCompletionRequest, | ||
| ChatCompletionResponse, | ||
| ErrorResponse, | ||
| ) | ||
| from vllm.entrypoints.openai.completion.protocol import ( | ||
| CompletionRequest, | ||
| CompletionResponse, | ||
| ) | ||
| from vllm.entrypoints.openai.engine.protocol import ErrorInfo, ErrorResponse | ||
| from vllm.lora.request import LoRARequest | ||
| from uuid import uuid4 | ||
| from skyrl.backends.skyrl_train.inference_engines.base import ( |
There was a problem hiding this comment.
This file appears to be an exact duplicate of skyrl-train/skyrl_train/inference_engines/vllm/vllm_engine.py. The changes in this PR had to be applied to both files, which highlights a maintainability issue. Having duplicated code increases the maintenance burden and the risk of inconsistencies. It would be best to refactor this to eliminate the duplication, for example by making one a symlink to the other, or adjusting the project structure so that both parts of the codebase can use a single shared file.
pyproject.toml
Outdated
| "flash-attn==2.8.3; sys_platform == 'linux'", | ||
| "torch==2.9.0; sys_platform == 'linux'", | ||
| "torch==2.9.1; sys_platform == 'linux'", | ||
| "flashinfer-python; sys_platform == 'linux' and platform_machine == 'x86_64'", |
There was a problem hiding this comment.
For consistency and to avoid potential resolution issues, it's a good practice to pin flashinfer-python to the same version as flashinfer-jit-cache. The megatron extra pins both to 0.6.3. I suggest doing the same here for the fsdp extra.
"flashinfer-python==0.6.3; sys_platform == 'linux' and platform_machine == 'x86_64'",
skyrl-train/pyproject.toml
Outdated
| "flash-attn==2.8.3; sys_platform == 'linux'", | ||
| "torch==2.9.0; sys_platform == 'linux'", | ||
| "torch==2.9.1; sys_platform == 'linux'", | ||
| "flashinfer-python; sys_platform == 'linux'", |
There was a problem hiding this comment.
Match the megatron/mcore extras which already pin both flashinfer-python and flashinfer-jit-cache to 0.6.3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The mcore extra combines vllm==0.16.0 (needs numpy>=2 transitively) with megatron-core==0.15.0 (declares numpy<2). Without this override, uv sync --extra mcore from skyrl-train fails to resolve. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This PR: - bumps megatron-core from 0.15.0 to 0.16.0 - updates the default for `moe_grouped_gemm` to `True` ## Overview After #1240, checkpointing was broken for megatron due to megatron-core==0.15.0 using `np.product` in the checkpointing validation code, which was deprecated in numpy >= 2.0.0 in favor of `np.prod`. `megatron-core==0.16.0` now is compatible with `numpy>=2.0.0` for all python versions, so we can get rid of overriding numpy deps and pinning the python version for megatron to 3.12. Also sets the default for `moe_grouped_gemm` to `True` - this is true by default for most major models ([megatron bridge search](https://github.com/search?q=repo%3ANVIDIA-NeMo%2FMegatron-Bridge+moe_grouped_gemm&type=code&p=1)), like Kimi, Deepseek, Qwen, Llama, GLM. Setting it to false by default in #1213 broke the test `test_megatron_forward[tp4_pp1_cp1_ep4_etp1_policy_seq_packing]`. This is now passing. All megatron GPU CI tests should now be passing. All tests except `test_megatron_forward[tp4_pp1_cp1_ep4_etp1_policy_seq_packing]` passing after upgrading to 0.16.0. <img width="445" height="156" alt="image" src="https://github.com/user-attachments/assets/94598f76-08ac-45d7-93bb-73ff78d171b5" /> `test_megatron_forward[tp4_pp1_cp1_ep4_etp1_policy_seq_packing]` passing after updating `moe_grouped_gemm` to `True` by default <img width="547" height="65" alt="image" src="https://github.com/user-attachments/assets/78ef5461-cd4b-4465-8ded-e652bb145470" /> <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1247" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end -->
…r 0.16.0 upgrade (#1248) # What does this PR do? ## Summary Part 1 in resolving #1243 and #242 Fixes all the generation related tests (pause/ resume and weight syncing related fixes are pending after this PR) Fixes context length error message detection and some generation tests after upgrade to vllm 0.16.0 in #1240 . ## Changes ### Context length errors The error message when the model hit maximum context length has changed a bit - vLLM now outputs something like; > You passed 3148 input characters and requested 1000 output tokens. However, the model's context length is only 1024 tokens, resulting in a maximum input length of 24 tokens (at most 3072 characters). Please reduce the length of the input prompt. (parameter=input_text, value=3148) I've changed the string matching logic in `vllm_engine.py` and the assertions in the tests ### Import path errors Also, some of the import paths were incorrect in the tests : ```bash FAILED tests/backends/skyrl_train/gpu/gpu_ci/inference_servers/test_new_inference_generation.py::test_chat_completions - ModuleNotFoundError: No module named 'vllm.entrypoints.openai.protocol' FAILED tests/backends/skyrl_train/gpu/gpu_ci/inference_servers/test_new_inference_generation.py::test_completions - ModuleNotFoundError: No module named 'vllm.entrypoints.openai.protocol' ``` The new paths are `vllm.entrypoints.openai.completion` and `.chat_completion` The following tests pass after this PR: - `tests/backends/skyrl_train/gpu/gpu_ci/inference_servers/test_new_inference_generation.py` - `tests/backends/skyrl_train/inference_engines/test_inference_engine_client.py` <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1248" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
) ### Summary - Fix `abort_generation()` and `sleep()` abort logic that broke silently after the vllm 0.16.0 bump (#1240) - Add backward-compatible `_get_unfinished_request_ids()` helper to resolve internal vs external request ID mismatch - Fixes #1243 ### Root Cause In vllm 0.16.0, [`InputProcessor.assign_request_id()`](https://github.com/vllm-project/vllm/blob/main/vllm/v1/engine/input_processor.py) now creates **internal** request IDs (with a random suffix) that are distinct from the user-provided **external** request IDs: ```python request.external_req_id = request.request_id # save original as external request.request_id = f"{request.external_req_id}-{random_uuid():.8}" # new internal ID ``` Our code was reading request IDs from `output_processor.request_states.keys()` (which are now **internal** IDs) and passing them to `engine.abort()` with `internal=False` (the default). The abort looked them up in the `external_req_ids` mapping, found nothing, and **silently did nothing**. Requests completed normally with `finish_reason="length"` instead of `"abort"`. This broke fully async RL's pause/resume flow, which relies on abort returning partial outputs with `finish_reason="abort"` so the retry loop can re-submit with accumulated tokens. Related vllm changes: - vllm-project/vllm#32103 - vllm-project/vllm#32351 - vllm-project/vllm#34125 - vllm-project/vllm#34528 ### Fix Add a `_get_unfinished_request_ids()` static method on `BaseVLLMInferenceEngine` that: - Uses `output_processor.external_req_ids.keys()` when available (vllm 0.16.0+) - Falls back to `output_processor.request_states.keys()` for older vllm versions Applied to all three abort call sites: 1. `AsyncVLLMInferenceEngine.abort_generation()` — used by fully async pause/resume 2. `AsyncVLLMInferenceEngine.sleep()` — cleanup before sleep 3. `VLLMInferenceEngine.sleep()` — sync engine cleanup before sleep ### Test plan - [x] `test_abort_generation_vllm_engine` — passes (was failing with `assert 'length' == 'abort'`) - [x] `test_continue_generation_vllm_engine_chat_completion` — passes - [x] `test_continue_generation_generate_vllm_engine_generation` — passes - [x] E2E fully async gsm8k (`gsm8k_fully_async_ci` project) — ran ~12 training steps successfully with pause/resume working correctly Light blue is the run after this fix (our nightly gsm8k fully async CI) https://wandb.ai/sky-posttraining-uc-berkeley/gsm8k_fully_async_ci <img width="2163" height="976" alt="image" src="https://github.com/user-attachments/assets/eaece0dc-ca53-4dd1-b3d1-2f6e308a8a47" /> <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1250" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Upgrades vLLM from 0.13.0 to 0.16.0 and updates all dependencies required by the new version.
Version bumps
Override-dependencies added
numpy>=2.0.0— vLLM 0.16.0 transitively needs numpy>=2 (via opencv-python-headless>=4.13), conflicting with megatron-core's<2pin. Tested: megatron-core 0.15.0 works correctly with numpy 2.x.vLLM API migration (0.13 -> 0.16)
The
vllm.entrypoints.openaimodule was restructured:serving_chat->chat_completion.servingserving_completion->completion.servingserving_models->models.servingprotocolsplit intochat_completion.protocol,completion.protocol,engine.protocolErrorInfomoved to top-level importNot included (separate PR)
transformers 5.x upgrade and
return_dict=Falsemigration are in a separate PR, to be merged when vLLM officially supports transformers>=5.Test plan
uv sync --extra megatronresolves successfully🤖 Generated with Claude Code