[train] LoRA-only weight syncing for Megatron by hao-aaron · Pull Request #1360 · NovaSky-AI/SkyRL

hao-aaron · 2026-03-20T20:36:37Z

closes #1336
Uses megatron new export_adapter_weights to save loras to disk and load in vllm. Bumped megatron version for the new api.

Added new megatron tests to test_lora.py, run test_lora with both _USE_SKYRL_NEW_INFERENCE=0 and 1.

Added some more robust shutdown that was causing issues with the test when using new inference

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces support for LoRA (Low-Rank Adaptation) adapter export and synchronization within the Megatron training framework. Key changes include updating pyproject.toml to use a newer megatron-bridge commit and adding the transformers library, implementing a new _save_lora_adapters_and_sync method in megatron_worker.py to export LoRA adapter weights and notify the inference engine, and modifying the broadcast_to_inference_engines method to utilize this new LoRA-specific synchronization. Additionally, the test_lora.py file has been updated to include new test cases for the Megatron strategy, while a redundant LoRA test case was removed from test_megatron_worker.py.

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces support for LoRA-only weight syncing for Megatron models, which is a valuable feature. The changes include updating dependencies, implementing the core logic for saving and syncing LoRA adapters, and adding corresponding tests. The implementation appears correct and robust. I have one suggestion regarding import statements to improve code maintainability.

gemini-code-assist · 2026-03-20T22:57:18Z

skyrl/backends/skyrl_train/workers/megatron/megatron_worker.py

+        import json
+
+        from megatron.bridge.models.conversion.peft_bridge import (
+            build_adapter_config_dict,
+            infer_target_modules_from_adapter_weights,
+        )
+        from safetensors.torch import save_file


Imports should generally be at the top of the file as per PEP 8 style guidelines. This improves readability and makes dependencies clear. The imports for json, megatron.bridge, and safetensors.torch here, as well as for RemoteInferenceClient on line 846, are local to this method. Unless there's a specific reason for lazy loading (like avoiding circular dependencies, which doesn't seem to be the case here), please move these imports to the top of the file.

erictang000 · 2026-03-20T23:02:15Z

skyrl/backends/skyrl_train/workers/megatron/megatron_worker.py

-            self.weight_extractor.extract_weights(generator_dtype),
-            weight_metadata=weight_metadata,
-        )
+        if self._is_lora:


hmm i don't think we want to get rid of the option to merge loras back into the base model to sync (since lora serving throughput is slower)

can we add a flag to gate this feature PR - something like trainer.policy.megatron_config.lora_config.merge_lora and have the default still be True?

then when we get to in memory lora weight sync we can replace the disk syncing?

+1.

We definitely want to support both modes. Default to merge for now

erictang000 · 2026-03-20T23:03:11Z

also will this lora only weight sync mode only support colocate_all=true for now? let's make sure to add that to config validation

tests/backends/skyrl_train/gpu/gpu_ci/test_lora.py

SumanthRH · 2026-03-20T23:26:47Z

tests/backends/skyrl_train/gpu/gpu_ci/test_megatron_worker.py

@hao-aaron can you add this to the megatron CI here and ensure that tests pass with _SKYRL_USE_NEW_INFERENCE=1 ?

SkyRL/ci/gpu_ci_run_skyrl_train_megatron.sh

Line 8 in 9d7bca9

uv run --directory . --isolated --extra dev --extra megatron pytest -s tests/backends/skyrl_train/gpu/gpu_ci -m "megatron"

SumanthRH · 2026-03-20T23:28:15Z

tests/backends/skyrl_train/gpu/gpu_ci/test_lora.py

+        pytest.param(False, "nccl", "megatron", 2, marks=pytest.mark.megatron),
+        pytest.param(True, "nccl", "megatron", 2, marks=pytest.mark.megatron),


Let's also parameterize saving to disk vs in-memory weight sync here for LoRA (after addressing @erictang000 's comment)

hao-aaron · 2026-03-23T05:44:04Z

also will this lora only weight sync mode only support colocate_all=true for now? let's make sure to add that to config validation

colocated and non colocated both work, i have tests for both in test_lora.py

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

SumanthRH

Only 2 minor comments left!

ci/gpu_ci_run_skyrl_train.sh

skyrl/backends/skyrl_train/inference_servers/router.py

Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>

Signed-off-by: ahao-anyscale <ahao@anyscale.com> Made-with: Cursor

SumanthRH · 2026-03-23T20:01:37Z

tests/backends/skyrl_train/gpu/utils.py

+                # Override build_vllm_cli_args which auto-enables LoRA based
+                # on lora rank in the config.  For merged weight sync the
+                # inference engine must NOT have LoRA wrapping enabled.
+                cli_args.enable_lora = False


@hao-aaron actually the same fix is needed for build_vllm_cli_args? Otherwise this will fail outside of the tests?

We need to unset enable_lora in the vllm cli if it's megatron backend and merge lora is true

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

erictang000 · 2026-03-23T21:19:29Z

wait can we hold off on merging this?

the megatron-bridge bump might cause issues for deepseek-v3 models:

megatron-bridge = {git = "https://github.com/NVIDIA-NeMo/Megatron-Bridge", rev = "f78c65f9", marker = "sys_platform == 'linux'"}

erictang000

megatron-bridge commit bump is a little scary for deepseek-v3 models - can we test that initializing glm-flash-4.7 still works at all?

https://github.com/NovaSky-AI/SkyRL/blob/main/examples/train/megatron/run_megatron_grpo_glm4_7_30b.sh

i saw this script not run anymore previously after bumping megatron-bridge to a newer commit

let me test this out

erictang000 · 2026-03-23T21:28:29Z

colocated and non colocated both work, i have tests for both in test_lora.py

how does it work for multi-node + non-colocated if it's writing to disk? i guess you can write to shared storage but that's a little non-obvious - we should at least put up a warning if non-colocated + not merging lora?

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

x

95e31d0

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

gemini-code-assist bot reviewed Mar 20, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

x

60bb1bc

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

This comment was marked as resolved.

Sign in to view

x

ad0e4e1

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

hao-aaron marked this pull request as draft March 20, 2026 21:05

shutdown router fix

2514206

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

hao-aaron marked this pull request as ready for review March 20, 2026 22:55

gemini-code-assist bot reviewed Mar 20, 2026

View reviewed changes

erictang000 reviewed Mar 20, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

SumanthRH assigned erictang000 Mar 20, 2026

SumanthRH reviewed Mar 20, 2026

View reviewed changes

tests/backends/skyrl_train/gpu/gpu_ci/test_lora.py Show resolved Hide resolved

SumanthRH reviewed Mar 20, 2026

View reviewed changes

x

b7aabb1

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

SumanthRH reviewed Mar 23, 2026

View reviewed changes

ci/gpu_ci_run_skyrl_train.sh Outdated Show resolved Hide resolved

skyrl/backends/skyrl_train/inference_servers/router.py Outdated Show resolved Hide resolved

Update ci/gpu_ci_run_skyrl_train.sh

611dc68

Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>

This comment was marked as resolved.

Sign in to view

x

0bf7283

Signed-off-by: ahao-anyscale <ahao@anyscale.com> Made-with: Cursor

hao-aaron force-pushed the megatron-peft branch from 312a02a to 0bf7283 Compare March 23, 2026 19:35

SumanthRH reviewed Mar 23, 2026

View reviewed changes

x

79ba72c

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

SumanthRH approved these changes Mar 23, 2026

View reviewed changes

erictang000 requested changes Mar 23, 2026

View reviewed changes

x

8e860af

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

		pytest.param(False, "nccl", "megatron", 2, marks=pytest.mark.megatron),
		pytest.param(True, "nccl", "megatron", 2, marks=pytest.mark.megatron),

Conversation

hao-aaron commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

erictang000 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

SumanthRH Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

erictang000 commented Mar 20, 2026

Uh oh!

Uh oh!

SumanthRH Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

SumanthRH Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

hao-aaron commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SumanthRH left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

SumanthRH Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erictang000 commented Mar 23, 2026

Uh oh!

erictang000 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erictang000 commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hao-aaron commented Mar 20, 2026 •

edited

Loading

hao-aaron commented Mar 23, 2026 •

edited

Loading

SumanthRH Mar 23, 2026 •

edited

Loading

erictang000 left a comment •

edited

Loading