Skip to content

[train] LoRA-only weight syncing for Megatron #1360

Open
hao-aaron wants to merge 9 commits intoNovaSky-AI:mainfrom
hao-aaron:megatron-peft
Open

[train] LoRA-only weight syncing for Megatron #1360
hao-aaron wants to merge 9 commits intoNovaSky-AI:mainfrom
hao-aaron:megatron-peft

Conversation

@hao-aaron
Copy link
Contributor

@hao-aaron hao-aaron commented Mar 20, 2026

closes #1336
Uses megatron new export_adapter_weights to save loras to disk and load in vllm. Bumped megatron version for the new api.

Added new megatron tests to test_lora.py, run test_lora with both _USE_SKYRL_NEW_INFERENCE=0 and 1.

Added some more robust shutdown that was causing issues with the test when using new inference


Open with Devin

x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for LoRA (Low-Rank Adaptation) adapter export and synchronization within the Megatron training framework. Key changes include updating pyproject.toml to use a newer megatron-bridge commit and adding the transformers library, implementing a new _save_lora_adapters_and_sync method in megatron_worker.py to export LoRA adapter weights and notify the inference engine, and modifying the broadcast_to_inference_engines method to utilize this new LoRA-specific synchronization. Additionally, the test_lora.py file has been updated to include new test cases for the Megatron strategy, while a redundant LoRA test case was removed from test_megatron_worker.py.

devin-ai-integration[bot]

This comment was marked as resolved.

x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
devin-ai-integration[bot]

This comment was marked as resolved.

x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
@hao-aaron hao-aaron marked this pull request as draft March 20, 2026 21:05
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
@hao-aaron hao-aaron marked this pull request as ready for review March 20, 2026 22:55
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for LoRA-only weight syncing for Megatron models, which is a valuable feature. The changes include updating dependencies, implementing the core logic for saving and syncing LoRA adapters, and adding corresponding tests. The implementation appears correct and robust. I have one suggestion regarding import statements to improve code maintainability.

Comment on lines +816 to +822
import json

from megatron.bridge.models.conversion.peft_bridge import (
build_adapter_config_dict,
infer_target_modules_from_adapter_weights,
)
from safetensors.torch import save_file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Imports should generally be at the top of the file as per PEP 8 style guidelines. This improves readability and makes dependencies clear. The imports for json, megatron.bridge, and safetensors.torch here, as well as for RemoteInferenceClient on line 846, are local to this method. Unless there's a specific reason for lazy loading (like avoiding circular dependencies, which doesn't seem to be the case here), please move these imports to the top of the file.

self.weight_extractor.extract_weights(generator_dtype),
weight_metadata=weight_metadata,
)
if self._is_lora:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i don't think we want to get rid of the option to merge loras back into the base model to sync (since lora serving throughput is slower)

can we add a flag to gate this feature PR - something like trainer.policy.megatron_config.lora_config.merge_lora and have the default still be True?

then when we get to in memory lora weight sync we can replace the disk syncing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

We definitely want to support both modes. Default to merge for now

devin-ai-integration[bot]

This comment was marked as resolved.

@erictang000
Copy link
Collaborator

also will this lora only weight sync mode only support colocate_all=true for now? let's make sure to add that to config validation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hao-aaron can you add this to the megatron CI here and ensure that tests pass with _SKYRL_USE_NEW_INFERENCE=1 ?

uv run --directory . --isolated --extra dev --extra megatron pytest -s tests/backends/skyrl_train/gpu/gpu_ci -m "megatron"

Comment on lines +70 to +71
pytest.param(False, "nccl", "megatron", 2, marks=pytest.mark.megatron),
pytest.param(True, "nccl", "megatron", 2, marks=pytest.mark.megatron),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also parameterize saving to disk vs in-memory weight sync here for LoRA (after addressing @erictang000 's comment)

@hao-aaron
Copy link
Contributor Author

hao-aaron commented Mar 23, 2026

also will this lora only weight sync mode only support colocate_all=true for now? let's make sure to add that to config validation

colocated and non colocated both work, i have tests for both in test_lora.py

x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Copy link
Member

@SumanthRH SumanthRH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only 2 minor comments left!

Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>
devin-ai-integration[bot]

This comment was marked as resolved.

x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Made-with: Cursor
# Override build_vllm_cli_args which auto-enables LoRA based
# on lora rank in the config. For merged weight sync the
# inference engine must NOT have LoRA wrapping enabled.
cli_args.enable_lora = False
Copy link
Member

@SumanthRH SumanthRH Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hao-aaron actually the same fix is needed for build_vllm_cli_args? Otherwise this will fail outside of the tests?

We need to unset enable_lora in the vllm cli if it's megatron backend and merge lora is true

x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
@erictang000
Copy link
Collaborator

wait can we hold off on merging this?

the megatron-bridge bump might cause issues for deepseek-v3 models:

megatron-bridge = {git = "https://github.com/NVIDIA-NeMo/Megatron-Bridge", rev = "f78c65f9", marker = "sys_platform == 'linux'"}

Copy link
Collaborator

@erictang000 erictang000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

megatron-bridge commit bump is a little scary for deepseek-v3 models - can we test that initializing glm-flash-4.7 still works at all?

https://github.com/NovaSky-AI/SkyRL/blob/main/examples/train/megatron/run_megatron_grpo_glm4_7_30b.sh

i saw this script not run anymore previously after bumping megatron-bridge to a newer commit

let me test this out

@erictang000
Copy link
Collaborator

colocated and non colocated both work, i have tests for both in test_lora.py

how does it work for multi-node + non-colocated if it's writing to disk? i guess you can write to shared storage but that's a little non-obvious - we should at least put up a warning if non-colocated + not merging lora?

x
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[train] Support LoRA-only weight syncing for Megatron backend

3 participants