[train] Add support for LoRA in the new inference codepath#1329
[train] Add support for LoRA in the new inference codepath#1329
Conversation
Add LoRA adapter loading and generation support for the new inference stack using vLLM's native /v1/load_lora_adapter endpoint. Changes: - RemoteInferenceClient: Add load_lora_adapter() method, handle LoraLoadRequest in update_named_weights(), track active LoRA name for generation - VLLMServerActor: Set VLLM_ALLOW_RUNTIME_LORA_UPDATING=1 to enable endpoint - Test utils: Remove LoRA blocker for new inference backend - Tests: Add new inference stack LoRA tests (non-colocated, fsdp/fsdp2) - Conftest: Add ray_init_new_inference_fixture with _SKYRL_USE_NEW_INFERENCE=1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflicts in remote_inference_client.py (TYPE_CHECKING imports) and test_lora.py (duplicate imports). Re-enable colocated test cases for new inference LoRA tests now that CudaIpcInitInfo.for_servers() is available. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…inference stack - Remove duplicate test_policy_new_inference_lora_e2e (same body as existing test) - Remove ray_init_new_inference_fixture (ray_init_fixture already reads _SKYRL_USE_NEW_INFERENCE from env) - Auto-create tokenizer in run_inference when RemoteInferenceClient is used - Add CI line to run test_lora.py with _SKYRL_USE_NEW_INFERENCE=1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflict in remote_inference_client.py: keep main's simplified _generate_single (no retry loop) and add LoRA effective_model logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Set `_active_lora_name` on `RemoteInferenceClient` at construction time
in `main_base.py` using fixed constant `_SKYRL_LORA_ADAPTER_NAME` ("skyrl-lora"),
removing the need to propagate it back from FSDP workers after each sync
- Remove lora_name return value threading through `_save_lora_adapters_and_sync`,
`broadcast_to_inference_engines` (fsdp_worker), and `WorkerDispatch`
- Auto-set `enforce_eager=False` in `SkyRLTrainConfig.__post_init__` when LoRA
is enabled, matching the existing behavior in `create_ray_wrapped_inference_engines`
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…inference stack Previously, detokenize was called inside _generate_single while holding the gen_sem semaphore slot. This caused GPU underutilization: each slot was blocked on a second HTTP roundtrip to /detokenize after generation completed, creating waves where the GPU sat idle waiting for detokenization to finish. Fix: split generate() into two pipelined stages with independent semaphores: - gen_sem: released immediately when generation finishes, freeing the GPU slot - detok_sem: controls concurrent detokenize calls independently Generation time drops from ~252s to ~22-27s per step, matching the old stack. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eEngineState - Rename _active_lora_name -> active_lora_name in RemoteInferenceClient: it is intended to be set by callers, so it should not be a private attribute - Update main_base.py to use the new public field name - InferenceEngineState.create(): add active_lora_name param, pass it to RemoteInferenceClient; also set cli_args.enable_lora=True when enable_lora=True so LoRA is properly enabled in the new inference codepath Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ve_lora_name load_lora_adapter now always loads under self.active_lora_name (raising if unset) instead of accepting a caller-supplied name. This prevents vLLM from accumulating multiple LoRA adapters across weight syncs (max_loras=1). - fsdp_worker: remove _SKYRL_LORA_ADAPTER_NAME constant and pass no name to load_lora_adapter — the worker has no business knowing the adapter name - InferenceEngineState.create(): default active_lora_name to "skyrl-lora" when enable_lora=True and no name is provided Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
vLLM's /v1/load_lora_adapter returns JSON on error (OpenAI-compatible), so the defensive try/except around resp.json() is unnecessary. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
|
||
| if cache_reset_task is not None: | ||
| await cache_reset_task | ||
| torch.cuda.empty_cache() | ||
| torch.distributed.barrier() |
There was a problem hiding this comment.
Previously, we were not awaiting on the cache_reset_task for LoRA
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive LoRA support within the new inference codepath, significantly enhancing the system's capabilities. Key improvements include a new /v1/load_lora_adapter endpoint for loading LoRA weights, and an updated RemoteInferenceClient to manage active LoRA adapters. The concurrency model in the generate function has been refactored to use separate semaphores for generation and detokenization, resolving a performance bottleneck and bringing generation speeds in line with the old codepath. Additionally, the wake_up method now proactively handles stale HTTP connections, improving reliability. Necessary configuration adjustments and test cases have been added to ensure the stability and correctness of these new features.
hao-aaron
left a comment
There was a problem hiding this comment.
LGTM, but have a design question. Do we want lora weight syncing to be on the same codepath as normal weight syncing instead of a separate function? Thinking about the future lora weight sync api for vllm, is that also something we want to adopt?
|
Long term, we want Actually, currently the caller will directly use |
…ead LoRA dispatch in update_named_weights update_named_weights is for full parameter weight sync only — the lora_path dispatch branch was dead code since fsdp_worker now calls the LoRA method directly. Rename load_lora_from_disk to update_lora_from_disk to better reflect that it is the LoRA counterpart to update_named_weights. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Manual test: I've run the GPU CI test |
Adds LoRA support in the new inference codepath. - Adds a new `/v1/load_lora_adapter` endpoint to `VLLMServerActor` to support loading lora weights in the vllm server - Adds support for lora weights in `RemoteInferenceClient` with a new method `load_lora_from_disk`. The caller is expected to provide a `lora_path` argument pointing to the lora weights on disk. - Improved concurrency in `generate` : While running E2E tests for lora in the new codepath, I noticed that the generation speeds where 10x worse in the new codepath compared to the old `inference_engines/` codepath in SkyRL. The root cause is that the currently we use a single semaphore for the generate + detokenize stage. This led to detokenize stage starving GPUs when max concurrency was reached. the solution is to use separate semaphores so that generation can proceed independently. Generation speed is now at parity with old codepath - Adds lora weight sync test to CI - [x] Unit tests: `_SKYRL_USE_NEW_INFERENCE=1 uv run --isolated --extra dev --extra fsdp pytest -s tests/backends/skyrl_train/gpu/gpu_ci/test_lora.py` - [x] E2E lora test I ran the lora example `examples/train/lora/run_qwen2_5_0.5b_gsm8k_grpo_lora.sh` with `_SKYRL_USE_NEW_INFERENCE=1` and compared it with the baseline Some curves are below: <img width="455" height="299" alt="Screenshot 2026-03-18 at 11 17 10 PM" src="https://github.com/user-attachments/assets/3ec1be26-6357-4eb0-a833-49366f19d135" /> <img width="448" height="291" alt="Screenshot 2026-03-18 at 11 17 23 PM" src="https://github.com/user-attachments/assets/f6c8f836-0cc0-4f46-87a9-d286d7d54f57" /> <img width="676" height="576" alt="image" src="https://github.com/user-attachments/assets/55126ef2-f1a3-48f4-9de2-83f7323d8fee" /> <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1329" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
# What does this PR do? Adds LoRA support in the new inference codepath. ## Summary - Adds a new `/v1/load_lora_adapter` endpoint to `VLLMServerActor` to support loading lora weights in the vllm server - Adds support for lora weights in `RemoteInferenceClient` with a new method `load_lora_from_disk`. The caller is expected to provide a `lora_path` argument pointing to the lora weights on disk. - Improved concurrency in `generate` : While running E2E tests for lora in the new codepath, I noticed that the generation speeds where 10x worse in the new codepath compared to the old `inference_engines/` codepath in SkyRL. The root cause is that the currently we use a single semaphore for the generate + detokenize stage. This led to detokenize stage starving GPUs when max concurrency was reached. the solution is to use separate semaphores so that generation can proceed independently. Generation speed is now at parity with old codepath - Adds lora weight sync test to CI ## Test Plan - [x] Unit tests: `_SKYRL_USE_NEW_INFERENCE=1 uv run --isolated --extra dev --extra fsdp pytest -s tests/backends/skyrl_train/gpu/gpu_ci/test_lora.py` - [x] E2E lora test ## Results I ran the lora example `examples/train/lora/run_qwen2_5_0.5b_gsm8k_grpo_lora.sh` with `_SKYRL_USE_NEW_INFERENCE=1` and compared it with the baseline Some curves are below: <img width="455" height="299" alt="Screenshot 2026-03-18 at 11 17 10 PM" src="https://github.com/user-attachments/assets/3ec1be26-6357-4eb0-a833-49366f19d135" /> <img width="448" height="291" alt="Screenshot 2026-03-18 at 11 17 23 PM" src="https://github.com/user-attachments/assets/f6c8f836-0cc0-4f46-87a9-d286d7d54f57" /> <img width="676" height="576" alt="image" src="https://github.com/user-attachments/assets/55126ef2-f1a3-48f4-9de2-83f7323d8fee" /> <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1329" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
What does this PR do?
Adds LoRA support in the new inference codepath.
Summary
/v1/load_lora_adapterendpoint toVLLMServerActorto support loading lora weights in the vllm serverRemoteInferenceClientwith a new methodload_lora_from_disk. The caller is expected to provide alora_pathargument pointing to the lora weights on disk.generate: While running E2E tests for lora in the new codepath, I noticed that the generation speeds where 10x worse in the new codepath compared to the oldinference_engines/codepath in SkyRL. The root cause is that the currently we use a single semaphore for the generate + detokenize stage. This led to detokenize stage starving GPUs when max concurrency was reached. the solution is to use separate semaphores so that generation can proceed independently. Generation speed is now at parity with old codepathTest Plan
_SKYRL_USE_NEW_INFERENCE=1 uv run --isolated --extra dev --extra fsdp pytest -s tests/backends/skyrl_train/gpu/gpu_ci/test_lora.pyResults
I ran the lora example
examples/train/lora/run_qwen2_5_0.5b_gsm8k_grpo_lora.shwith_SKYRL_USE_NEW_INFERENCE=1and compared it with the baselineSome curves are below: