Skip to content

Bump vLLM to 0.16.0 with required dep updates#1240

Merged
tyler-griggs merged 3 commits intomainfrom
tgriggs/vllm-0.16-upgrade
Mar 2, 2026
Merged

Bump vLLM to 0.16.0 with required dep updates#1240
tyler-griggs merged 3 commits intomainfrom
tgriggs/vllm-0.16-upgrade

Conversation

@tyler-griggs
Copy link
Member

@tyler-griggs tyler-griggs commented Mar 2, 2026

Summary

Upgrades vLLM from 0.13.0 to 0.16.0 and updates all dependencies required by the new version.

Version bumps

Package Before After
vllm 0.13.0 0.16.0
torch 2.9.0 2.9.1
flashinfer-python 0.5.3 0.6.3
flashinfer-jit-cache 0.5.3 0.6.3

Override-dependencies added

  • numpy>=2.0.0 — vLLM 0.16.0 transitively needs numpy>=2 (via opencv-python-headless>=4.13), conflicting with megatron-core's <2 pin. Tested: megatron-core 0.15.0 works correctly with numpy 2.x.

vLLM API migration (0.13 -> 0.16)

The vllm.entrypoints.openai module was restructured:

  • serving_chat -> chat_completion.serving
  • serving_completion -> completion.serving
  • serving_models -> models.serving
  • protocol split into chat_completion.protocol, completion.protocol, engine.protocol
  • ErrorInfo moved to top-level import

Not included (separate PR)

transformers 5.x upgrade and return_dict=False migration are in a separate PR, to be merged when vLLM officially supports transformers>=5.

Test plan

  • uv sync --extra megatron resolves successfully
  • Existing Megatron training examples still work
  • vLLM import paths resolve correctly

🤖 Generated with Claude Code


Open with Devin

- vllm: 0.13.0 -> 0.16.0
- torch: 2.9.0 -> 2.9.1 (required by vLLM 0.16.0)
- flashinfer-python: 0.5.3 -> 0.6.3 (required by vLLM 0.16.0)
- flashinfer-jit-cache: 0.5.3 -> 0.6.3
- numpy>=2.0.0 override (vLLM 0.16.0 -> opencv-python-headless>=4.13
  -> numpy>=2, conflicting with megatron-core's <2 pin; tested
  compatible with megatron-core 0.15.0)

Migrates vLLM import paths (0.13 -> 0.16):
- serving_chat -> chat_completion.serving
- serving_completion -> completion.serving
- serving_models -> models.serving
- protocol split into chat_completion/completion/engine.protocol
- ErrorInfo moved to top-level import

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request upgrades vLLM to version 0.16.0 and updates its dependencies accordingly. The changes in pyproject.toml files correctly bump the versions. The Python code is updated to reflect the API changes in the new vLLM version, mainly around import paths.

My review has identified a couple of areas for improvement:

  1. There's some inconsistency in dependency pinning for flashinfer-python, which I've suggested to fix.
  2. I've found significant code duplication between skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py and skyrl-train/skyrl_train/inference_engines/vllm/vllm_engine.py, which should be addressed to improve maintainability.

Comment on lines +14 to 28
from vllm.entrypoints.openai.chat_completion.serving import OpenAIServingChat
from vllm.entrypoints.openai.completion.serving import OpenAIServingCompletion
from vllm.entrypoints.openai.models.serving import BaseModelPath, OpenAIServingModels
from vllm.entrypoints.openai.chat_completion.protocol import (
ChatCompletionRequest,
ChatCompletionResponse,
ErrorResponse,
)
from vllm.entrypoints.openai.completion.protocol import (
CompletionRequest,
CompletionResponse,
)
from vllm.entrypoints.openai.engine.protocol import ErrorInfo, ErrorResponse
from vllm.lora.request import LoRARequest
from uuid import uuid4
from skyrl.backends.skyrl_train.inference_engines.base import (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This file appears to be an exact duplicate of skyrl-train/skyrl_train/inference_engines/vllm/vllm_engine.py. The changes in this PR had to be applied to both files, which highlights a maintainability issue. Having duplicated code increases the maintenance burden and the risk of inconsistencies. It would be best to refactor this to eliminate the duplication, for example by making one a symlink to the other, or adjusting the project structure so that both parts of the codebase can use a single shared file.

pyproject.toml Outdated
"flash-attn==2.8.3; sys_platform == 'linux'",
"torch==2.9.0; sys_platform == 'linux'",
"torch==2.9.1; sys_platform == 'linux'",
"flashinfer-python; sys_platform == 'linux' and platform_machine == 'x86_64'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and to avoid potential resolution issues, it's a good practice to pin flashinfer-python to the same version as flashinfer-jit-cache. The megatron extra pins both to 0.6.3. I suggest doing the same here for the fsdp extra.

    "flashinfer-python==0.6.3; sys_platform == 'linux' and platform_machine == 'x86_64'",

"flash-attn==2.8.3; sys_platform == 'linux'",
"torch==2.9.0; sys_platform == 'linux'",
"torch==2.9.1; sys_platform == 'linux'",
"flashinfer-python; sys_platform == 'linux'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with flashinfer-jit-cache and to prevent potential version mismatches, it's advisable to pin flashinfer-python here. The mcore extra in this file pins it to 0.6.3.

Suggested change
"flashinfer-python; sys_platform == 'linux'",
"flashinfer-python==0.6.3; sys_platform == 'linux'",

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

Match the megatron/mcore extras which already pin both
flashinfer-python and flashinfer-jit-cache to 0.6.3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

The mcore extra combines vllm==0.16.0 (needs numpy>=2 transitively)
with megatron-core==0.15.0 (declares numpy<2). Without this override,
uv sync --extra mcore from skyrl-train fails to resolve.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tyler-griggs tyler-griggs merged commit d4b0cb6 into main Mar 2, 2026
10 checks passed
@CharlieFRuan CharlieFRuan deleted the tgriggs/vllm-0.16-upgrade branch March 2, 2026 20:19
erictang000 added a commit that referenced this pull request Mar 2, 2026
This PR:
- bumps megatron-core from 0.15.0 to 0.16.0
- updates the default for `moe_grouped_gemm` to `True`

## Overview
After #1240, checkpointing was broken for megatron due to
megatron-core==0.15.0 using `np.product` in the checkpointing validation
code, which was deprecated in numpy >= 2.0.0 in favor of `np.prod`.
`megatron-core==0.16.0` now is compatible with `numpy>=2.0.0` for all
python versions, so we can get rid of overriding numpy deps and pinning
the python version for megatron to 3.12.

Also sets the default for `moe_grouped_gemm` to `True` - this is true by
default for most major models ([megatron bridge
search](https://github.com/search?q=repo%3ANVIDIA-NeMo%2FMegatron-Bridge+moe_grouped_gemm&type=code&p=1)),
like Kimi, Deepseek, Qwen, Llama, GLM. Setting it to false by default in
#1213 broke the test
`test_megatron_forward[tp4_pp1_cp1_ep4_etp1_policy_seq_packing]`. This
is now passing.

All megatron GPU CI tests should now be passing.


All tests except
`test_megatron_forward[tp4_pp1_cp1_ep4_etp1_policy_seq_packing]` passing
after upgrading to 0.16.0.
<img width="445" height="156" alt="image"
src="https://github.com/user-attachments/assets/94598f76-08ac-45d7-93bb-73ff78d171b5"
/>

`test_megatron_forward[tp4_pp1_cp1_ep4_etp1_policy_seq_packing]` passing
after updating `moe_grouped_gemm` to `True` by default
<img width="547" height="65" alt="image"
src="https://github.com/user-attachments/assets/78ef5461-cd4b-4465-8ded-e652bb145470"
/>

 
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1247"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->
SumanthRH added a commit that referenced this pull request Mar 2, 2026
…r 0.16.0 upgrade (#1248)

# What does this PR do?

## Summary
Part 1 in resolving #1243 and #242

Fixes all the generation related tests (pause/ resume and weight syncing
related fixes are pending after this PR)

Fixes context length error message detection and some generation tests
after upgrade to vllm 0.16.0 in #1240 .

## Changes

### Context length errors

The error message when the model hit maximum context length has changed
a bit - vLLM now outputs something like;

> You passed 3148 input characters and requested 1000 output tokens.
However, the model's context length is only 1024 tokens, resulting in a
maximum input length of 24 tokens (at most 3072 characters). Please
reduce the length of the input prompt. (parameter=input_text,
value=3148)

I've changed the string matching logic in `vllm_engine.py` and the
assertions in the tests

### Import path errors

Also, some of the import paths were incorrect in the tests :

```bash
FAILED tests/backends/skyrl_train/gpu/gpu_ci/inference_servers/test_new_inference_generation.py::test_chat_completions - ModuleNotFoundError: No module named 'vllm.entrypoints.openai.protocol'
FAILED tests/backends/skyrl_train/gpu/gpu_ci/inference_servers/test_new_inference_generation.py::test_completions - ModuleNotFoundError: No module named 'vllm.entrypoints.openai.protocol'
```

The new paths are `vllm.entrypoints.openai.completion` and
`.chat_completion`


The following tests pass after this PR:

-
`tests/backends/skyrl_train/gpu/gpu_ci/inference_servers/test_new_inference_generation.py`
-
`tests/backends/skyrl_train/inference_engines/test_inference_engine_client.py`
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1248"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

---------

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
SumanthRH pushed a commit that referenced this pull request Mar 2, 2026
)

### Summary

- Fix `abort_generation()` and `sleep()` abort logic that broke silently
after the vllm 0.16.0 bump (#1240)
- Add backward-compatible `_get_unfinished_request_ids()` helper to
resolve internal vs external request ID mismatch
- Fixes #1243

### Root Cause

In vllm 0.16.0,
[`InputProcessor.assign_request_id()`](https://github.com/vllm-project/vllm/blob/main/vllm/v1/engine/input_processor.py)
now creates **internal** request IDs (with a random suffix) that are
distinct from the user-provided **external** request IDs:

```python
request.external_req_id = request.request_id                       # save original as external
request.request_id = f"{request.external_req_id}-{random_uuid():.8}"  # new internal ID
```

Our code was reading request IDs from
`output_processor.request_states.keys()` (which are now **internal**
IDs) and passing them to `engine.abort()` with `internal=False` (the
default). The abort looked them up in the `external_req_ids` mapping,
found nothing, and **silently did nothing**. Requests completed normally
with `finish_reason="length"` instead of `"abort"`.

This broke fully async RL's pause/resume flow, which relies on abort
returning partial outputs with `finish_reason="abort"` so the retry loop
can re-submit with accumulated tokens.

Related vllm changes:
- vllm-project/vllm#32103
- vllm-project/vllm#32351
- vllm-project/vllm#34125
- vllm-project/vllm#34528

### Fix

Add a `_get_unfinished_request_ids()` static method on
`BaseVLLMInferenceEngine` that:
- Uses `output_processor.external_req_ids.keys()` when available (vllm
0.16.0+)
- Falls back to `output_processor.request_states.keys()` for older vllm
versions

Applied to all three abort call sites:
1. `AsyncVLLMInferenceEngine.abort_generation()` — used by fully async
pause/resume
2. `AsyncVLLMInferenceEngine.sleep()` — cleanup before sleep
3. `VLLMInferenceEngine.sleep()` — sync engine cleanup before sleep

### Test plan

- [x] `test_abort_generation_vllm_engine` — passes (was failing with
`assert 'length' == 'abort'`)
- [x] `test_continue_generation_vllm_engine_chat_completion` — passes
- [x] `test_continue_generation_generate_vllm_engine_generation` —
passes
- [x] E2E fully async gsm8k (`gsm8k_fully_async_ci` project) — ran ~12
training steps successfully with pause/resume working correctly

Light blue is the run after this fix (our nightly gsm8k fully async CI)
https://wandb.ai/sky-posttraining-uc-berkeley/gsm8k_fully_async_ci

<img width="2163" height="976" alt="image"
src="https://github.com/user-attachments/assets/eaece0dc-ca53-4dd1-b3d1-2f6e308a8a47"
/>


<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1250"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant