Skip to content

[megatron] rebuild weight conversion tasks per sync to prevent stale PP-collective caches with bucketing#1345

Merged
erictang000 merged 2 commits intoNovaSky-AI:mainfrom
erictang000:fix_megatron_pp_weight_sync
Mar 19, 2026
Merged

[megatron] rebuild weight conversion tasks per sync to prevent stale PP-collective caches with bucketing#1345
erictang000 merged 2 commits intoNovaSky-AI:mainfrom
erictang000:fix_megatron_pp_weight_sync

Conversation

@erictang000
Copy link
Copy Markdown
Collaborator

@erictang000 erictang000 commented Mar 19, 2026

Summary

  • Bucketed weight sync reused WeightConversionTask objects (and their mapping caches) across sync cycles, causing incorrect vLLM weight updates for DeepSeek-V3 style models (like Moonlight-16B-A3B, or GLM-4.7-Flash) with PP > 1 (for both LoRA and full finetuning). The mapping objects cache PP-collective metadata that becomes stale across train/offload/reload cycles.
  • Fix: store only the bucket index structure (which task indices go in which bucket) once, and rebuild fresh tasks with clean mapping objects on each extract_weights() call. This preserves packed-broadcast performance while ensuring correct PP collectives every sync.

This manifested in extremely unstable training + reward collapsing for Deepseek-v3 style models with megatron.

Test plan

  • Moonlight-16B with PP=2: reward increases without significant weight sync time regression

Results

Moonlight-16B-A3B GSM8k

Before in purple, after the fix in blue:
image
image

Weight sync timing (~10s after, ~8s before):
image

GLM-4.7-Flash GSM8k

Before in red, after the fix in tan (GLM with LoRA)
image

Weight sync timing (~15 after, ~12 before):
image


Open with Devin

gemini-code-assist[bot]

This comment was marked as resolved.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

Open in Devin Review

Copy link
Copy Markdown
Member

@SumanthRH SumanthRH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stamp

@erictang000 erictang000 merged commit 957640e into NovaSky-AI:main Mar 19, 2026
4 of 6 checks passed
@erictang000 erictang000 deleted the fix_megatron_pp_weight_sync branch March 19, 2026 17:22
devpatelio pushed a commit that referenced this pull request Mar 20, 2026
…PP-collective caches with bucketing (#1345)

## Summary
- Bucketed weight sync reused `WeightConversionTask` objects (and their
mapping caches) across sync cycles, causing incorrect vLLM weight
updates for DeepSeek-V3 style models (like Moonlight-16B-A3B, or
GLM-4.7-Flash) with PP > 1 (for both LoRA and full finetuning). The
mapping objects cache PP-collective metadata that becomes stale across
train/offload/reload cycles.
- Fix: store only the bucket index structure (which task indices go in
which bucket) once, and rebuild fresh tasks with clean mapping objects
on each `extract_weights()` call. This preserves packed-broadcast
performance while ensuring correct PP collectives every sync.

This manifested in extremely unstable training + reward collapsing for
Deepseek-v3 style models with megatron.

## Test plan
- [x] Moonlight-16B with PP=2: reward increases without significant
weight sync time regression

## Results

### Moonlight-16B-A3B GSM8k
Before in purple, after the fix in blue:
<img width="315" height="246" alt="image"
src="https://github.com/user-attachments/assets/9a7119af-aea8-4b81-888f-f05cd3865c99"
/>
<img width="318" height="242" alt="image"
src="https://github.com/user-attachments/assets/b899766f-9a7e-43fc-98f9-2f856dc04c3c"
/>

Weight sync timing (~10s after, ~8s before):
<img width="317" height="246" alt="image"
src="https://github.com/user-attachments/assets/ed46cdbc-9b95-4533-bb5f-416db56a2847"
/>

### GLM-4.7-Flash GSM8k

Before in red, after the fix in tan (GLM with LoRA)
<img width="337" height="525" alt="image"
src="https://github.com/user-attachments/assets/e647d6f4-42a9-4317-99d4-e3d1d940b038"
/>

Weight sync timing (~15 after, ~12 before):
<img width="331" height="239" alt="image"
src="https://github.com/user-attachments/assets/cc98f4e9-314c-462c-ab7f-80b29bc96f70"
/>



<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1345"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants