[megatron] improving weight syncing - bucketed param gather + cuda ipc flattening #487

erictang000 · 2025-10-15T22:48:46Z

Blocked on #453.

Post Megatron-Bridge migration, we can now do a bucketed gathering of parameters, with precomputed size metadata, instead of iterating through parameters one by one. We then flatten the bucket into a single tensor, send over the metadata, and recover the weights/shapes on the inference engine side.

For Qwen3-30B-A3B, tp=2, ep=8, etp=1, 8xh100:

Before (48s):

After bucketing gather (45s):

After flattening cuda ipc (and removing redundant torch.device() calls) - 6s:

…tron_bridge

…ice less

…tron_bridge

…into weight_syncing

…ht_syncing

SumanthRH · 2025-11-21T22:27:18Z

skyrl-train/skyrl_train/inference_engines/base.py

+    sizes: List[int]
    extras: Optional[List[Dict[str, Any]]]
+    packed: bool


The current PR will break compatibility for FSDP and DeepSpeed since they don't send these arguments.

You should probably update the "sizes" to be a NotRequired argument. And you should update Fsdp and deepspeed worker files to send packed=False.

good catch, updated and tested both

SumanthRH · 2025-11-21T22:28:01Z

skyrl-train/skyrl_train/inference_engines/vllm/vllm_engine.py

                    request["names"],
                    request["dtypes"],
                    request["shapes"],
+                    request["sizes"],


This should be an optional entry. Should use safer .get

…ht_syncing

SumanthRH · 2025-11-21T23:48:28Z

skyrl-train/skyrl_train/inference_engines/base.py

    names: List[str]
    dtypes: List[str]
    shapes: List[List[int]]
+    sizes: Optional[List[int]]


Nit: this is NotRequired . Optional means that hte field exists but is None

from typing import NotRequired

SumanthRH · 2025-11-21T23:49:05Z

skyrl-train/skyrl_train/inference_engines/base.py

    shapes: List[List[int]]
+    sizes: Optional[List[int]]
    extras: Optional[List[Dict[str, Any]]]
+    packed: Optional[bool]


Same for this

SumanthRH · 2025-11-21T23:49:34Z

skyrl-train/skyrl_train/inference_engines/vllm/vllm_engine.py

+
+            offset = 0
+            for name, shape, size in zip(names, shapes, sizes):
+                weight_list.append((name, packed_tensor[offset : offset + size].view(*shape)))


dtype assert for this?

SumanthRH

Have left some minor comments. Please take a look

…c flattening (NovaSky-AI#487) Blocked on NovaSky-AI#453. Post Megatron-Bridge migration, we can now do a bucketed gathering of parameters, with precomputed size metadata, instead of iterating through parameters one by one. We then flatten the bucket into a single tensor, send over the metadata, and recover the weights/shapes on the inference engine side. For Qwen3-30B-A3B, tp=2, ep=8, etp=1, 8xh100: Before (48s): <img width="739" height="14" alt="image" src="https://github.com/user-attachments/assets/d90a2091-f36b-445e-898d-81c5019ea8a4" /> After bucketing gather (45s): <img width="754" height="24" alt="image" src="https://github.com/user-attachments/assets/5c0c725f-aafc-4989-9f77-c8a06f95e942" /> After flattening cuda ipc (and removing redundant torch.device() calls) - 6s: <img width="711" height="19" alt="image" src="https://github.com/user-attachments/assets/11a0be18-08cb-460a-9c3c-06ab06745b45" />

erictang000 added 12 commits October 8, 2025 22:29

x

f901243

getting issues w seq pack and pp fwd

60afc31

fix pp

3d87cc1

x

fad6f8d

Merge branch 'main' of https://github.com/erictang000/SkyRL into mega…

785ca3f

…tron_bridge

x

27d993c

x

3519b32

x

73cc288

x

89a3c54

x

6962904

working bucketed gather

7928a8a

bucketing/flattening during weight sync for cuda ipc + call torch.dev…

b80daad

…ice less

erictang000 changed the title ~~[megatron] improving weight syncing part 1 - bucketed parameter gathering~~ [megatron] improving weight syncing - bucketed param gather + cuda ipc flattening Oct 16, 2025

erictang000 added 9 commits November 7, 2025 22:54

Merge branch 'main' of https://github.com/erictang000/SkyRL into mega…

79e65c9

…tron_bridge

add ninja to build - most tests passing now

4f571d9

Merge branch 'main' of https://github.com/erictang000/SkyRL into mega…

adaff69

…tron_bridge

x

af3affb

x

5458a10

x

ae2e996

Merge branch 'megatron_bridge' of https://github.com/erictang000/SkyRL …

1b6a739

…into weight_syncing

add tests for non colocated weight sync

c6b9115

Merge branch 'main' of https://github.com/erictang000/SkyRL into weig…

5bbcfe8

…ht_syncing

erictang000 marked this pull request as ready for review November 21, 2025 18:28

erictang000 added 3 commits November 21, 2025 18:40

x

7a6faec

lint

f4e93a2

X

101f032

erictang000 requested a review from SumanthRH November 21, 2025 19:32

SumanthRH requested changes Nov 21, 2025

View reviewed changes

x

751b342

erictang000 requested a review from SumanthRH November 21, 2025 23:42

Merge branch 'main' of https://github.com/erictang000/SkyRL into weig…

3d729d3

…ht_syncing

SumanthRH reviewed Nov 21, 2025

View reviewed changes

SumanthRH approved these changes Nov 21, 2025

View reviewed changes

address coments

0dbfa31

erictang000 merged commit 8c7ba7e into NovaSky-AI:main Nov 22, 2025
3 checks passed

erictang000 deleted the weight_syncing branch November 22, 2025 00:06

erictang000 mentioned this pull request Dec 1, 2025

[skyrl-train] Update FSDP weight syncing implementation to use cuda IPC bucketing #725

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[megatron] improving weight syncing - bucketed param gather + cuda ipc flattening #487

[megatron] improving weight syncing - bucketed param gather + cuda ipc flattening #487

Uh oh!

erictang000 commented Oct 15, 2025 •

edited

Loading

Uh oh!

SumanthRH Nov 21, 2025

Uh oh!

erictang000 Nov 21, 2025

Uh oh!

SumanthRH Nov 21, 2025

Uh oh!

erictang000 Nov 21, 2025

Uh oh!

SumanthRH Nov 21, 2025

Uh oh!

SumanthRH Nov 21, 2025

Uh oh!

SumanthRH Nov 21, 2025

Uh oh!

SumanthRH left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[megatron] improving weight syncing - bucketed param gather + cuda ipc flattening #487

[megatron] improving weight syncing - bucketed param gather + cuda ipc flattening #487

Uh oh!

Conversation

erictang000 commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SumanthRH Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

erictang000 Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

SumanthRH Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

erictang000 Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

SumanthRH Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

SumanthRH Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

SumanthRH Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

SumanthRH left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erictang000 commented Oct 15, 2025 •

edited

Loading