Bump vLLM to 0.16.0 with required dep updates by tyler-griggs · Pull Request #1240 · NovaSky-AI/SkyRL

tyler-griggs · 2026-03-02T06:11:45Z

Summary

Upgrades vLLM from 0.13.0 to 0.16.0 and updates all dependencies required by the new version.

Version bumps

Package	Before	After
vllm	0.13.0	0.16.0
torch	2.9.0	2.9.1
flashinfer-python	0.5.3	0.6.3
flashinfer-jit-cache	0.5.3	0.6.3

Override-dependencies added

numpy>=2.0.0 — vLLM 0.16.0 transitively needs numpy>=2 (via opencv-python-headless>=4.13), conflicting with megatron-core's <2 pin. Tested: megatron-core 0.15.0 works correctly with numpy 2.x.

vLLM API migration (0.13 -> 0.16)

The vllm.entrypoints.openai module was restructured:

serving_chat -> chat_completion.serving
serving_completion -> completion.serving
serving_models -> models.serving
protocol split into chat_completion.protocol, completion.protocol, engine.protocol
ErrorInfo moved to top-level import

Not included (separate PR)

transformers 5.x upgrade and return_dict=False migration are in a separate PR, to be merged when vLLM officially supports transformers>=5.

Test plan

uv sync --extra megatron resolves successfully
Existing Megatron training examples still work
vLLM import paths resolve correctly

🤖 Generated with Claude Code

- vllm: 0.13.0 -> 0.16.0 - torch: 2.9.0 -> 2.9.1 (required by vLLM 0.16.0) - flashinfer-python: 0.5.3 -> 0.6.3 (required by vLLM 0.16.0) - flashinfer-jit-cache: 0.5.3 -> 0.6.3 - numpy>=2.0.0 override (vLLM 0.16.0 -> opencv-python-headless>=4.13 -> numpy>=2, conflicting with megatron-core's <2 pin; tested compatible with megatron-core 0.15.0) Migrates vLLM import paths (0.13 -> 0.16): - serving_chat -> chat_completion.serving - serving_completion -> completion.serving - serving_models -> models.serving - protocol split into chat_completion/completion/engine.protocol - ErrorInfo moved to top-level import Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request upgrades vLLM to version 0.16.0 and updates its dependencies accordingly. The changes in pyproject.toml files correctly bump the versions. The Python code is updated to reflect the API changes in the new vLLM version, mainly around import paths.

My review has identified a couple of areas for improvement:

There's some inconsistency in dependency pinning for flashinfer-python, which I've suggested to fix.
I've found significant code duplication between skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py and skyrl-train/skyrl_train/inference_engines/vllm/vllm_engine.py, which should be addressed to improve maintainability.

gemini-code-assist · 2026-03-02T06:13:27Z

skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py

+from vllm.entrypoints.openai.chat_completion.serving import OpenAIServingChat
+from vllm.entrypoints.openai.completion.serving import OpenAIServingCompletion
+from vllm.entrypoints.openai.models.serving import BaseModelPath, OpenAIServingModels
+from vllm.entrypoints.openai.chat_completion.protocol import (
    ChatCompletionRequest,
    ChatCompletionResponse,
-    ErrorResponse,
+)
+from vllm.entrypoints.openai.completion.protocol import (
    CompletionRequest,
    CompletionResponse,
 )
+from vllm.entrypoints.openai.engine.protocol import ErrorInfo, ErrorResponse
 from vllm.lora.request import LoRARequest
 from uuid import uuid4
 from skyrl.backends.skyrl_train.inference_engines.base import (


This file appears to be an exact duplicate of skyrl-train/skyrl_train/inference_engines/vllm/vllm_engine.py. The changes in this PR had to be applied to both files, which highlights a maintainability issue. Having duplicated code increases the maintenance burden and the risk of inconsistencies. It would be best to refactor this to eliminate the duplication, for example by making one a symlink to the other, or adjusting the project structure so that both parts of the codebase can use a single shared file.

gemini-code-assist · 2026-03-02T06:13:27Z

pyproject.toml

    "flash-attn==2.8.3; sys_platform == 'linux'",
-    "torch==2.9.0; sys_platform == 'linux'",
+    "torch==2.9.1; sys_platform == 'linux'",
    "flashinfer-python; sys_platform == 'linux' and platform_machine == 'x86_64'",


For consistency and to avoid potential resolution issues, it's a good practice to pin flashinfer-python to the same version as flashinfer-jit-cache. The megatron extra pins both to 0.6.3. I suggest doing the same here for the fsdp extra.

"flashinfer-python==0.6.3; sys_platform == 'linux' and platform_machine == 'x86_64'",

gemini-code-assist · 2026-03-02T06:13:27Z

skyrl-train/pyproject.toml

    "flash-attn==2.8.3; sys_platform == 'linux'",
-    "torch==2.9.0; sys_platform == 'linux'",
+    "torch==2.9.1; sys_platform == 'linux'",
    "flashinfer-python; sys_platform == 'linux'",


For consistency with flashinfer-jit-cache and to prevent potential version mismatches, it's advisable to pin flashinfer-python here. The mcore extra in this file pins it to 0.6.3.

Suggested change

"flashinfer-python; sys_platform == 'linux'",

"flashinfer-python==0.6.3; sys_platform == 'linux'",

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Match the megatron/mcore extras which already pin both flashinfer-python and flashinfer-jit-cache to 0.6.3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The mcore extra combines vllm==0.16.0 (needs numpy>=2 transitively) with megatron-core==0.15.0 (declares numpy<2). Without this override, uv sync --extra mcore from skyrl-train fails to resolve. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This PR: - bumps megatron-core from 0.15.0 to 0.16.0 - updates the default for `moe_grouped_gemm` to `True` ## Overview After #1240, checkpointing was broken for megatron due to megatron-core==0.15.0 using `np.product` in the checkpointing validation code, which was deprecated in numpy >= 2.0.0 in favor of `np.prod`. `megatron-core==0.16.0` now is compatible with `numpy>=2.0.0` for all python versions, so we can get rid of overriding numpy deps and pinning the python version for megatron to 3.12. Also sets the default for `moe_grouped_gemm` to `True` - this is true by default for most major models ([megatron bridge search](https://github.com/search?q=repo%3ANVIDIA-NeMo%2FMegatron-Bridge+moe_grouped_gemm&type=code&p=1)), like Kimi, Deepseek, Qwen, Llama, GLM. Setting it to false by default in #1213 broke the test `test_megatron_forward[tp4_pp1_cp1_ep4_etp1_policy_seq_packing]`. This is now passing. All megatron GPU CI tests should now be passing. All tests except `test_megatron_forward[tp4_pp1_cp1_ep4_etp1_policy_seq_packing]` passing after upgrading to 0.16.0. <img width="445" height="156" alt="image" src="https://github.com/user-attachments/assets/94598f76-08ac-45d7-93bb-73ff78d171b5" /> `test_megatron_forward[tp4_pp1_cp1_ep4_etp1_policy_seq_packing]` passing after updating `moe_grouped_gemm` to `True` by default <img width="547" height="65" alt="image" src="https://github.com/user-attachments/assets/78ef5461-cd4b-4465-8ded-e652bb145470" />  --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1247" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a>

…r 0.16.0 upgrade (#1248) # What does this PR do? ## Summary Part 1 in resolving #1243 and #242 Fixes all the generation related tests (pause/ resume and weight syncing related fixes are pending after this PR) Fixes context length error message detection and some generation tests after upgrade to vllm 0.16.0 in #1240 . ## Changes ### Context length errors The error message when the model hit maximum context length has changed a bit - vLLM now outputs something like; > You passed 3148 input characters and requested 1000 output tokens. However, the model's context length is only 1024 tokens, resulting in a maximum input length of 24 tokens (at most 3072 characters). Please reduce the length of the input prompt. (parameter=input_text, value=3148) I've changed the string matching logic in `vllm_engine.py` and the assertions in the tests ### Import path errors Also, some of the import paths were incorrect in the tests : ```bash FAILED tests/backends/skyrl_train/gpu/gpu_ci/inference_servers/test_new_inference_generation.py::test_chat_completions - ModuleNotFoundError: No module named 'vllm.entrypoints.openai.protocol' FAILED tests/backends/skyrl_train/gpu/gpu_ci/inference_servers/test_new_inference_generation.py::test_completions - ModuleNotFoundError: No module named 'vllm.entrypoints.openai.protocol' ``` The new paths are `vllm.entrypoints.openai.completion` and `.chat_completion` The following tests pass after this PR: - `tests/backends/skyrl_train/gpu/gpu_ci/inference_servers/test_new_inference_generation.py` - `tests/backends/skyrl_train/inference_engines/test_inference_engine_client.py`  --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1248" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a>  --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

) ### Summary - Fix `abort_generation()` and `sleep()` abort logic that broke silently after the vllm 0.16.0 bump (#1240) - Add backward-compatible `_get_unfinished_request_ids()` helper to resolve internal vs external request ID mismatch - Fixes #1243 ### Root Cause In vllm 0.16.0, [`InputProcessor.assign_request_id()`](https://github.com/vllm-project/vllm/blob/main/vllm/v1/engine/input_processor.py) now creates **internal** request IDs (with a random suffix) that are distinct from the user-provided **external** request IDs: ```python request.external_req_id = request.request_id # save original as external request.request_id = f"{request.external_req_id}-{random_uuid():.8}" # new internal ID ``` Our code was reading request IDs from `output_processor.request_states.keys()` (which are now **internal** IDs) and passing them to `engine.abort()` with `internal=False` (the default). The abort looked them up in the `external_req_ids` mapping, found nothing, and **silently did nothing**. Requests completed normally with `finish_reason="length"` instead of `"abort"`. This broke fully async RL's pause/resume flow, which relies on abort returning partial outputs with `finish_reason="abort"` so the retry loop can re-submit with accumulated tokens. Related vllm changes: - vllm-project/vllm#32103 - vllm-project/vllm#32351 - vllm-project/vllm#34125 - vllm-project/vllm#34528 ### Fix Add a `_get_unfinished_request_ids()` static method on `BaseVLLMInferenceEngine` that: - Uses `output_processor.external_req_ids.keys()` when available (vllm 0.16.0+) - Falls back to `output_processor.request_states.keys()` for older vllm versions Applied to all three abort call sites: 1. `AsyncVLLMInferenceEngine.abort_generation()` — used by fully async pause/resume 2. `AsyncVLLMInferenceEngine.sleep()` — cleanup before sleep 3. `VLLMInferenceEngine.sleep()` — sync engine cleanup before sleep ### Test plan - [x] `test_abort_generation_vllm_engine` — passes (was failing with `assert 'length' == 'abort'`) - [x] `test_continue_generation_vllm_engine_chat_completion` — passes - [x] `test_continue_generation_generate_vllm_engine_generation` — passes - [x] E2E fully async gsm8k (`gsm8k_fully_async_ci` project) — ran ~12 training steps successfully with pause/resume working correctly Light blue is the run after this fix (our nightly gsm8k fully async CI) https://wandb.ai/sky-posttraining-uc-berkeley/gsm8k_fully_async_ci <img width="2163" height="976" alt="image" src="https://github.com/user-attachments/assets/eaece0dc-ca53-4dd1-b3d1-2f6e308a8a47" />  --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1250" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a>  Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist bot reviewed Mar 2, 2026

View reviewed changes

tyler-griggs mentioned this pull request Mar 2, 2026

Bump transformers to >=5.0.0 for GLM-4.7-Flash #1241

Open

devin-ai-integration bot reviewed Mar 2, 2026

View reviewed changes

Pin flashinfer-python==0.6.3 in fsdp and vllm extras

e32818e

Match the megatron/mcore extras which already pin both flashinfer-python and flashinfer-jit-cache to 0.6.3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This comment was marked as resolved.

Sign in to view

vercel bot deployed to Preview March 2, 2026 06:27 View deployment

tyler-griggs merged commit d4b0cb6 into main Mar 2, 2026
10 checks passed

This was referenced Mar 2, 2026

[train] Update new inference servers codepath after vllm upgrade to 0.16.0 #1242

Open

[train] Fix pause and resume logic and generation tests after 0.16.0 upgrade #1243

Closed

CharlieFRuan deleted the tgriggs/vllm-0.16-upgrade branch March 2, 2026 20:19

erictang000 mentioned this pull request Mar 2, 2026

[deps] bump megatron core to 0.16.0 #1247

Merged

SumanthRH mentioned this pull request Mar 2, 2026

[train] Fix maximum context length handling and generation tests after 0.16.0 upgrade #1248

Merged

CharlieFRuan mentioned this pull request Mar 2, 2026

[train][fullyAsync] Fix abort/pause broken after vllm 0.16.0 bump #1250

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump vLLM to 0.16.0 with required dep updates#1240

Bump vLLM to 0.16.0 with required dep updates#1240
tyler-griggs merged 3 commits intomainfrom
tgriggs/vllm-0.16-upgrade

tyler-griggs commented Mar 2, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	"flashinfer-python; sys_platform == 'linux'",
	"flashinfer-python==0.6.3; sys_platform == 'linux'",

Conversation

tyler-griggs commented Mar 2, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Version bumps

Override-dependencies added

vLLM API migration (0.13 -> 0.16)

Not included (separate PR)

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tyler-griggs commented Mar 2, 2026 •

edited by devin-ai-integration bot

Loading