Skip to content

vLLM V1 migration#137

Open
rafapi wants to merge 25 commits intomainfrom
vllm_v1
Open

vLLM V1 migration#137
rafapi wants to merge 25 commits intomainfrom
vllm_v1

Conversation

@rafapi
Copy link
Copy Markdown
Collaborator

@rafapi rafapi commented Apr 22, 2026

Upgrade to vLLMv1 and restore vLLM V0 behaviour on the vLLM V1 rollout path

reward logprobs
image image

The logprobs shows the discrepancy with the initial vanilla vllm upgrade (green line)

@rafapi rafapi requested a review from ehsk April 22, 2026 13:11
@rafapi
Copy link
Copy Markdown
Collaborator Author

rafapi commented Apr 22, 2026

@bigximik FYI

chunk_probs = torch.exp(chunk_logprobs)
entropy -= (chunk_probs * chunk_logprobs).sum(dim=-1)

del logits, selected_logits, log_norm
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the chamges abpve are an old bugfix to ensure we don't materialise the vocab tensors by default and taking a lot of VRAM for no reason

Comment thread pipelinerl/vllm1.py
"NCCL_PROTO", "NCCL_ALGO", "NCCL_NTHREADS", "NCCL_SOCKET_NTHREADS",
):
os.environ.pop(_k, None)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stuff above is not required but since i tested batch invariance i decided to leave it for completion

bigximik added a commit that referenced this pull request Apr 23, 2026
The merge added a pause_generation/resume_generation wrap to both HTTP and
fast-llm weight-update paths symmetrically. On the fast-llm path this
deadlocks the initial (step=0) weight broadcast: engine.pause_generation
blocks waiting for in-flight requests to drain from a generator that hasn't
started yet, so the NCCL broadcast send from fast-llm never gets a receiver.

Origin/fast-llm calls the worker RPC directly with no wrap, and the baseline
run on counting.yaml completed 10/10 iterations cleanly. This commit restores
that behavior: EngineManager no longer has a receive_weight_update_fast_llm
method, and start_fast_llm_monitoring calls collective_rpc_async directly
again. HTTP path keeps the pause/resume wrap (PR #137's intended fix).
Comment thread pipelinerl/async_llm.py Outdated
Comment thread pipelinerl/async_llm.py
response.raise_for_status()
data = await response.json()
response_data = None
for attempt in range(2):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 3 times?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

twice :), just a basic retry

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad ;) can we make it configurable? and add a small delay between the retries?

Copy link
Copy Markdown
Collaborator Author

@rafapi rafapi Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I’d keep this as a single immediate retry. This path should stay as fast as possible, and if these aborts happen at all, we don’t want to add extra delay here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me wonder what specifically causes these aborts in this system. If it's weight-update pauses, an immediate retry might just hit another abort if the update hasn't finished yet. In that case, the retry only helps if the pause is shorter than the round-trip time of the HTTP request, which seems fragile to assume.

Comment thread pipelinerl/vllm1.py
# vLLM batch_invariant mode sets restrictive NCCL env vars (single channel,
# tree algo, simple proto, P2P disabled) that the trainer does not share.
# Clear them so the weight-update NCCL comm matches trainer defaults.
# Safe at tp=1 because no intra-engine NCCL comm has been created yet.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we add an assert to ensure this should occur only when tensor-parallel-size is 1?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment in the files, this mode should never be used, i left it there because it's a useful default for NCCL compatibility with the trainer

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know if this works when batch invariant is enabled and tp > 1?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it makes things slightly worse, it adds about 2x delay that compounds through the run

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, can we also clarify it in the comments for tp > 1?

Comment thread pipelinerl/vllm1.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants