Skip to content

[train] Add support for LoRA in the new inference codepath#1329

Merged
SumanthRH merged 16 commits intomainfrom
add-lora-support
Mar 20, 2026
Merged

[train] Add support for LoRA in the new inference codepath#1329
SumanthRH merged 16 commits intomainfrom
add-lora-support

Conversation

@SumanthRH
Copy link
Copy Markdown
Member

@SumanthRH SumanthRH commented Mar 16, 2026

What does this PR do?

Adds LoRA support in the new inference codepath.

Summary

  • Adds a new /v1/load_lora_adapter endpoint to VLLMServerActor to support loading lora weights in the vllm server
  • Adds support for lora weights in RemoteInferenceClient with a new method load_lora_from_disk. The caller is expected to provide a lora_path argument pointing to the lora weights on disk.
  • Improved concurrency in generate : While running E2E tests for lora in the new codepath, I noticed that the generation speeds where 10x worse in the new codepath compared to the old inference_engines/ codepath in SkyRL. The root cause is that the currently we use a single semaphore for the generate + detokenize stage. This led to detokenize stage starving GPUs when max concurrency was reached. the solution is to use separate semaphores so that generation can proceed independently. Generation speed is now at parity with old codepath
  • Adds lora weight sync test to CI

Test Plan

  • Unit tests: _SKYRL_USE_NEW_INFERENCE=1 uv run --isolated --extra dev --extra fsdp pytest -s tests/backends/skyrl_train/gpu/gpu_ci/test_lora.py
  • E2E lora test

Results

I ran the lora example examples/train/lora/run_qwen2_5_0.5b_gsm8k_grpo_lora.sh with _SKYRL_USE_NEW_INFERENCE=1 and compared it with the baseline

Some curves are below:

Screenshot 2026-03-18 at 11 17 10 PM Screenshot 2026-03-18 at 11 17 23 PM image
Open with Devin

SumanthRH and others added 14 commits March 13, 2026 01:01
Add LoRA adapter loading and generation support for the new inference stack
using vLLM's native /v1/load_lora_adapter endpoint.

Changes:
- RemoteInferenceClient: Add load_lora_adapter() method, handle LoraLoadRequest
  in update_named_weights(), track active LoRA name for generation
- VLLMServerActor: Set VLLM_ALLOW_RUNTIME_LORA_UPDATING=1 to enable endpoint
- Test utils: Remove LoRA blocker for new inference backend
- Tests: Add new inference stack LoRA tests (non-colocated, fsdp/fsdp2)
- Conftest: Add ray_init_new_inference_fixture with _SKYRL_USE_NEW_INFERENCE=1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflicts in remote_inference_client.py (TYPE_CHECKING imports)
and test_lora.py (duplicate imports). Re-enable colocated test cases
for new inference LoRA tests now that CudaIpcInitInfo.for_servers() is available.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…inference stack

- Remove duplicate test_policy_new_inference_lora_e2e (same body as existing test)
- Remove ray_init_new_inference_fixture (ray_init_fixture already reads _SKYRL_USE_NEW_INFERENCE from env)
- Auto-create tokenizer in run_inference when RemoteInferenceClient is used
- Add CI line to run test_lora.py with _SKYRL_USE_NEW_INFERENCE=1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflict in remote_inference_client.py: keep main's simplified
_generate_single (no retry loop) and add LoRA effective_model logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
- Set `_active_lora_name` on `RemoteInferenceClient` at construction time
  in `main_base.py` using fixed constant `_SKYRL_LORA_ADAPTER_NAME` ("skyrl-lora"),
  removing the need to propagate it back from FSDP workers after each sync
- Remove lora_name return value threading through `_save_lora_adapters_and_sync`,
  `broadcast_to_inference_engines` (fsdp_worker), and `WorkerDispatch`
- Auto-set `enforce_eager=False` in `SkyRLTrainConfig.__post_init__` when LoRA
  is enabled, matching the existing behavior in `create_ray_wrapped_inference_engines`

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
…inference stack

Previously, detokenize was called inside _generate_single while holding the
gen_sem semaphore slot. This caused GPU underutilization: each slot was blocked
on a second HTTP roundtrip to /detokenize after generation completed, creating
waves where the GPU sat idle waiting for detokenization to finish.

Fix: split generate() into two pipelined stages with independent semaphores:
- gen_sem: released immediately when generation finishes, freeing the GPU slot
- detok_sem: controls concurrent detokenize calls independently

Generation time drops from ~252s to ~22-27s per step, matching the old stack.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eEngineState

- Rename _active_lora_name -> active_lora_name in RemoteInferenceClient:
  it is intended to be set by callers, so it should not be a private attribute
- Update main_base.py to use the new public field name
- InferenceEngineState.create(): add active_lora_name param, pass it to
  RemoteInferenceClient; also set cli_args.enable_lora=True when enable_lora=True
  so LoRA is properly enabled in the new inference codepath

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ve_lora_name

load_lora_adapter now always loads under self.active_lora_name (raising if unset)
instead of accepting a caller-supplied name. This prevents vLLM from accumulating
multiple LoRA adapters across weight syncs (max_loras=1).

- fsdp_worker: remove _SKYRL_LORA_ADAPTER_NAME constant and pass no name to
  load_lora_adapter — the worker has no business knowing the adapter name
- InferenceEngineState.create(): default active_lora_name to "skyrl-lora" when
  enable_lora=True and no name is provided

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
vLLM's /v1/load_lora_adapter returns JSON on error (OpenAI-compatible),
so the defensive try/except around resp.json() is unnecessary.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment on lines 292 to 296

if cache_reset_task is not None:
await cache_reset_task
torch.cuda.empty_cache()
torch.distributed.barrier()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, we were not awaiting on the cache_reset_task for LoRA

@SumanthRH SumanthRH marked this pull request as ready for review March 19, 2026 18:02
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive LoRA support within the new inference codepath, significantly enhancing the system's capabilities. Key improvements include a new /v1/load_lora_adapter endpoint for loading LoRA weights, and an updated RemoteInferenceClient to manage active LoRA adapters. The concurrency model in the generate function has been refactored to use separate semaphores for generation and detokenization, resolving a performance bottleneck and bringing generation speeds in line with the old codepath. Additionally, the wake_up method now proactively handles stale HTTP connections, improving reliability. Necessary configuration adjustments and test cases have been added to ensure the stability and correctness of these new features.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 7 additional findings.

Open in Devin Review

Copy link
Copy Markdown
Contributor

@hao-aaron hao-aaron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but have a design question. Do we want lora weight syncing to be on the same codepath as normal weight syncing instead of a separate function? Thinking about the future lora weight sync api for vllm, is that also something we want to adopt?

@SumanthRH
Copy link
Copy Markdown
Member Author

Long term, we want update_named_weights to be the single interface for the trainer that supports in-memory weight sync for lora and full param.

Actually, currently the caller will directly use load_lora_adapter, so the changes in update_named_weights are not needed. Let me get rid of that.

…ead LoRA dispatch in update_named_weights

update_named_weights is for full parameter weight sync only — the lora_path
dispatch branch was dead code since fsdp_worker now calls the LoRA method
directly. Rename load_lora_from_disk to update_lora_from_disk to better reflect
that it is the LoRA counterpart to update_named_weights.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@SumanthRH
Copy link
Copy Markdown
Member Author

Manual test: I've run the GPU CI test test_lora.py again after the new changes. tests pass.

@SumanthRH SumanthRH merged commit a91f6e5 into main Mar 20, 2026
5 of 6 checks passed
SumanthRH added a commit that referenced this pull request Mar 20, 2026
Adds LoRA support in the new inference codepath.

- Adds a new `/v1/load_lora_adapter` endpoint to `VLLMServerActor` to
support loading lora weights in the vllm server
- Adds support for lora weights in `RemoteInferenceClient` with a new
method `load_lora_from_disk`. The caller is expected to provide a
`lora_path` argument pointing to the lora weights on disk.
- Improved concurrency in `generate` : While running E2E tests for lora
in the new codepath, I noticed that the generation speeds where 10x
worse in the new codepath compared to the old `inference_engines/`
codepath in SkyRL. The root cause is that the currently we use a single
semaphore for the generate + detokenize stage. This led to detokenize
stage starving GPUs when max concurrency was reached. the solution is to
use separate semaphores so that generation can proceed independently.
Generation speed is now at parity with old codepath
- Adds lora weight sync test to CI

- [x] Unit tests: `_SKYRL_USE_NEW_INFERENCE=1 uv run --isolated --extra
dev --extra fsdp pytest -s
tests/backends/skyrl_train/gpu/gpu_ci/test_lora.py`
- [x]  E2E lora test

I ran the lora example
`examples/train/lora/run_qwen2_5_0.5b_gsm8k_grpo_lora.sh` with
`_SKYRL_USE_NEW_INFERENCE=1` and compared it with the baseline

Some curves are below:

<img width="455" height="299" alt="Screenshot 2026-03-18 at 11 17 10 PM"
src="https://github.com/user-attachments/assets/3ec1be26-6357-4eb0-a833-49366f19d135"
/>

<img width="448" height="291" alt="Screenshot 2026-03-18 at 11 17 23 PM"
src="https://github.com/user-attachments/assets/f6c8f836-0cc0-4f46-87a9-d286d7d54f57"
/>

<img width="676" height="576" alt="image"
src="https://github.com/user-attachments/assets/55126ef2-f1a3-48f4-9de2-83f7323d8fee"
/>

<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1329"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

---------

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
devpatelio pushed a commit that referenced this pull request Mar 20, 2026
# What does this PR do?

Adds LoRA support in the new inference codepath.

## Summary
- Adds a new `/v1/load_lora_adapter` endpoint to `VLLMServerActor` to
support loading lora weights in the vllm server
- Adds support for lora weights in `RemoteInferenceClient` with a new
method `load_lora_from_disk`. The caller is expected to provide a
`lora_path` argument pointing to the lora weights on disk.
- Improved concurrency in `generate` : While running E2E tests for lora
in the new codepath, I noticed that the generation speeds where 10x
worse in the new codepath compared to the old `inference_engines/`
codepath in SkyRL. The root cause is that the currently we use a single
semaphore for the generate + detokenize stage. This led to detokenize
stage starving GPUs when max concurrency was reached. the solution is to
use separate semaphores so that generation can proceed independently.
Generation speed is now at parity with old codepath
- Adds lora weight sync test to CI 

## Test Plan
- [x] Unit tests: `_SKYRL_USE_NEW_INFERENCE=1 uv run --isolated --extra
dev --extra fsdp pytest -s
tests/backends/skyrl_train/gpu/gpu_ci/test_lora.py`
- [x]  E2E lora test


## Results

I ran the lora example
`examples/train/lora/run_qwen2_5_0.5b_gsm8k_grpo_lora.sh` with
`_SKYRL_USE_NEW_INFERENCE=1` and compared it with the baseline

Some curves are below: 

<img width="455" height="299" alt="Screenshot 2026-03-18 at 11 17 10 PM"
src="https://github.com/user-attachments/assets/3ec1be26-6357-4eb0-a833-49366f19d135"
/>

<img width="448" height="291" alt="Screenshot 2026-03-18 at 11 17 23 PM"
src="https://github.com/user-attachments/assets/f6c8f836-0cc0-4f46-87a9-d286d7d54f57"
/>

<img width="676" height="576" alt="image"
src="https://github.com/user-attachments/assets/55126ef2-f1a3-48f4-9de2-83f7323d8fee"
/>


<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1329"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

---------

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants