[train] Add support for LoRA in the new inference codepath by SumanthRH · Pull Request #1329 · NovaSky-AI/SkyRL

SumanthRH · 2026-03-16T23:13:40Z

What does this PR do?

Adds LoRA support in the new inference codepath.

Summary

Adds a new /v1/load_lora_adapter endpoint to VLLMServerActor to support loading lora weights in the vllm server
Adds support for lora weights in RemoteInferenceClient with a new method load_lora_from_disk. The caller is expected to provide a lora_path argument pointing to the lora weights on disk.
Improved concurrency in generate : While running E2E tests for lora in the new codepath, I noticed that the generation speeds where 10x worse in the new codepath compared to the old inference_engines/ codepath in SkyRL. The root cause is that the currently we use a single semaphore for the generate + detokenize stage. This led to detokenize stage starving GPUs when max concurrency was reached. the solution is to use separate semaphores so that generation can proceed independently. Generation speed is now at parity with old codepath
Adds lora weight sync test to CI

Test Plan

Unit tests: _SKYRL_USE_NEW_INFERENCE=1 uv run --isolated --extra dev --extra fsdp pytest -s tests/backends/skyrl_train/gpu/gpu_ci/test_lora.py
E2E lora test

Results

I ran the lora example examples/train/lora/run_qwen2_5_0.5b_gsm8k_grpo_lora.sh with _SKYRL_USE_NEW_INFERENCE=1 and compared it with the baseline

Some curves are below:

Add LoRA adapter loading and generation support for the new inference stack using vLLM's native /v1/load_lora_adapter endpoint. Changes: - RemoteInferenceClient: Add load_lora_adapter() method, handle LoraLoadRequest in update_named_weights(), track active LoRA name for generation - VLLMServerActor: Set VLLM_ALLOW_RUNTIME_LORA_UPDATING=1 to enable endpoint - Test utils: Remove LoRA blocker for new inference backend - Tests: Add new inference stack LoRA tests (non-colocated, fsdp/fsdp2) - Conftest: Add ray_init_new_inference_fixture with _SKYRL_USE_NEW_INFERENCE=1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Resolve conflicts in remote_inference_client.py (TYPE_CHECKING imports) and test_lora.py (duplicate imports). Re-enable colocated test cases for new inference LoRA tests now that CudaIpcInitInfo.for_servers() is available. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…inference stack - Remove duplicate test_policy_new_inference_lora_e2e (same body as existing test) - Remove ray_init_new_inference_fixture (ray_init_fixture already reads _SKYRL_USE_NEW_INFERENCE from env) - Auto-create tokenizer in run_inference when RemoteInferenceClient is used - Add CI line to run test_lora.py with _SKYRL_USE_NEW_INFERENCE=1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Resolve conflict in remote_inference_client.py: keep main's simplified _generate_single (no retry loop) and add LoRA effective_model logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

- Set `_active_lora_name` on `RemoteInferenceClient` at construction time in `main_base.py` using fixed constant `_SKYRL_LORA_ADAPTER_NAME` ("skyrl-lora"), removing the need to propagate it back from FSDP workers after each sync - Remove lora_name return value threading through `_save_lora_adapters_and_sync`, `broadcast_to_inference_engines` (fsdp_worker), and `WorkerDispatch` - Auto-set `enforce_eager=False` in `SkyRLTrainConfig.__post_init__` when LoRA is enabled, matching the existing behavior in `create_ray_wrapped_inference_engines` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

…inference stack Previously, detokenize was called inside _generate_single while holding the gen_sem semaphore slot. This caused GPU underutilization: each slot was blocked on a second HTTP roundtrip to /detokenize after generation completed, creating waves where the GPU sat idle waiting for detokenization to finish. Fix: split generate() into two pipelined stages with independent semaphores: - gen_sem: released immediately when generation finishes, freeing the GPU slot - detok_sem: controls concurrent detokenize calls independently Generation time drops from ~252s to ~22-27s per step, matching the old stack. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…eEngineState - Rename _active_lora_name -> active_lora_name in RemoteInferenceClient: it is intended to be set by callers, so it should not be a private attribute - Update main_base.py to use the new public field name - InferenceEngineState.create(): add active_lora_name param, pass it to RemoteInferenceClient; also set cli_args.enable_lora=True when enable_lora=True so LoRA is properly enabled in the new inference codepath Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ve_lora_name load_lora_adapter now always loads under self.active_lora_name (raising if unset) instead of accepting a caller-supplied name. This prevents vLLM from accumulating multiple LoRA adapters across weight syncs (max_loras=1). - fsdp_worker: remove _SKYRL_LORA_ADAPTER_NAME constant and pass no name to load_lora_adapter — the worker has no business knowing the adapter name - InferenceEngineState.create(): default active_lora_name to "skyrl-lora" when enable_lora=True and no name is provided Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

vLLM's /v1/load_lora_adapter returns JSON on error (OpenAI-compatible), so the defensive try/except around resp.json() is unnecessary. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

SumanthRH · 2026-03-19T17:29:01Z


        if cache_reset_task is not None:
            await cache_reset_task
        torch.cuda.empty_cache()
        torch.distributed.barrier()


Previously, we were not awaiting on the cache_reset_task for LoRA

gemini-code-assist

Code Review

This pull request introduces comprehensive LoRA support within the new inference codepath, significantly enhancing the system's capabilities. Key improvements include a new /v1/load_lora_adapter endpoint for loading LoRA weights, and an updated RemoteInferenceClient to manage active LoRA adapters. The concurrency model in the generate function has been refactored to use separate semaphores for generation and detokenization, resolving a performance bottleneck and bringing generation speeds in line with the old codepath. Additionally, the wake_up method now proactively handles stale HTTP connections, improving reliability. Necessary configuration adjustments and test cases have been added to ensure the stability and correctness of these new features.

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 7 additional findings.

hao-aaron

LGTM, but have a design question. Do we want lora weight syncing to be on the same codepath as normal weight syncing instead of a separate function? Thinking about the future lora weight sync api for vllm, is that also something we want to adopt?

SumanthRH · 2026-03-20T01:59:26Z

Long term, we want update_named_weights to be the single interface for the trainer that supports in-memory weight sync for lora and full param.

Actually, currently the caller will directly use load_lora_adapter, so the changes in update_named_weights are not needed. Let me get rid of that.

…ead LoRA dispatch in update_named_weights update_named_weights is for full parameter weight sync only — the lora_path dispatch branch was dead code since fsdp_worker now calls the LoRA method directly. Rename load_lora_from_disk to update_lora_from_disk to better reflect that it is the LoRA counterpart to update_named_weights. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

SumanthRH · 2026-03-20T02:25:56Z

Manual test: I've run the GPU CI test test_lora.py again after the new changes. tests pass.

Adds LoRA support in the new inference codepath. - Adds a new `/v1/load_lora_adapter` endpoint to `VLLMServerActor` to support loading lora weights in the vllm server - Adds support for lora weights in `RemoteInferenceClient` with a new method `load_lora_from_disk`. The caller is expected to provide a `lora_path` argument pointing to the lora weights on disk. - Improved concurrency in `generate` : While running E2E tests for lora in the new codepath, I noticed that the generation speeds where 10x worse in the new codepath compared to the old `inference_engines/` codepath in SkyRL. The root cause is that the currently we use a single semaphore for the generate + detokenize stage. This led to detokenize stage starving GPUs when max concurrency was reached. the solution is to use separate semaphores so that generation can proceed independently. Generation speed is now at parity with old codepath - Adds lora weight sync test to CI - [x] Unit tests: `_SKYRL_USE_NEW_INFERENCE=1 uv run --isolated --extra dev --extra fsdp pytest -s tests/backends/skyrl_train/gpu/gpu_ci/test_lora.py` - [x] E2E lora test I ran the lora example `examples/train/lora/run_qwen2_5_0.5b_gsm8k_grpo_lora.sh` with `_SKYRL_USE_NEW_INFERENCE=1` and compared it with the baseline Some curves are below: <img width="455" height="299" alt="Screenshot 2026-03-18 at 11 17 10 PM" src="https://github.com/user-attachments/assets/3ec1be26-6357-4eb0-a833-49366f19d135" /> <img width="448" height="291" alt="Screenshot 2026-03-18 at 11 17 23 PM" src="https://github.com/user-attachments/assets/f6c8f836-0cc0-4f46-87a9-d286d7d54f57" /> <img width="676" height="576" alt="image" src="https://github.com/user-attachments/assets/55126ef2-f1a3-48f4-9de2-83f7323d8fee" />  --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1329" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a>  --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

# What does this PR do? Adds LoRA support in the new inference codepath. ## Summary - Adds a new `/v1/load_lora_adapter` endpoint to `VLLMServerActor` to support loading lora weights in the vllm server - Adds support for lora weights in `RemoteInferenceClient` with a new method `load_lora_from_disk`. The caller is expected to provide a `lora_path` argument pointing to the lora weights on disk. - Improved concurrency in `generate` : While running E2E tests for lora in the new codepath, I noticed that the generation speeds where 10x worse in the new codepath compared to the old `inference_engines/` codepath in SkyRL. The root cause is that the currently we use a single semaphore for the generate + detokenize stage. This led to detokenize stage starving GPUs when max concurrency was reached. the solution is to use separate semaphores so that generation can proceed independently. Generation speed is now at parity with old codepath - Adds lora weight sync test to CI ## Test Plan - [x] Unit tests: `_SKYRL_USE_NEW_INFERENCE=1 uv run --isolated --extra dev --extra fsdp pytest -s tests/backends/skyrl_train/gpu/gpu_ci/test_lora.py` - [x] E2E lora test ## Results I ran the lora example `examples/train/lora/run_qwen2_5_0.5b_gsm8k_grpo_lora.sh` with `_SKYRL_USE_NEW_INFERENCE=1` and compared it with the baseline Some curves are below: <img width="455" height="299" alt="Screenshot 2026-03-18 at 11 17 10 PM" src="https://github.com/user-attachments/assets/3ec1be26-6357-4eb0-a833-49366f19d135" /> <img width="448" height="291" alt="Screenshot 2026-03-18 at 11 17 23 PM" src="https://github.com/user-attachments/assets/f6c8f836-0cc0-4f46-87a9-d286d7d54f57" /> <img width="676" height="576" alt="image" src="https://github.com/user-attachments/assets/55126ef2-f1a3-48f4-9de2-83f7323d8fee" />  --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1329" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a>  --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

SumanthRH and others added 14 commits March 13, 2026 01:01

impr lora path

857e1ad

Merge origin/main into add-lora-support

6816fe6

Resolve conflict in remote_inference_client.py: keep main's simplified _generate_single (no retry loop) and add LoRA effective_model logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

x

a825050

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

1b8aecf

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

x

1a8f055

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

self review

1c7ba5d

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

[cleanup] Simplify error handling in load_lora_adapter

baf4401

vLLM's /v1/load_lora_adapter returns JSON on error (OpenAI-compatible), so the defensive try/except around resp.json() is unnecessary. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

SumanthRH commented Mar 19, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into add-lora-support

8b6a71f

SumanthRH marked this pull request as ready for review March 19, 2026 18:02

gemini-code-assist bot reviewed Mar 19, 2026

View reviewed changes

devin-ai-integration bot reviewed Mar 19, 2026

View reviewed changes

hao-aaron approved these changes Mar 20, 2026

View reviewed changes

SumanthRH merged commit a91f6e5 into main Mar 20, 2026
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Add support for LoRA in the new inference codepath#1329

[train] Add support for LoRA in the new inference codepath#1329
SumanthRH merged 16 commits intomainfrom
add-lora-support

SumanthRH commented Mar 16, 2026 •

edited

Loading

Uh oh!

SumanthRH Mar 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

hao-aaron left a comment

Uh oh!

SumanthRH commented Mar 20, 2026

Uh oh!

SumanthRH commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SumanthRH commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Summary

Test Plan

Results

Uh oh!

SumanthRH Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

hao-aaron left a comment

Choose a reason for hiding this comment

Uh oh!

SumanthRH commented Mar 20, 2026

Uh oh!

SumanthRH commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SumanthRH commented Mar 16, 2026 •

edited

Loading