General scaling policy for HP transfer + Depth MuP recipe#4381
General scaling policy for HP transfer + Depth MuP recipe#4381plugyawn wants to merge 28 commits into
Conversation
|
This PR has been automatically converted to draft because all PRs must start as drafts. When you are ready for review, click Ready for Review to begin the review process. This will:
See the contribution guide for more details. |
|
Adding the plots for depth-MuP soon. |
|
Shifting Depth-MuP implementation from TP VI (Yang's) to the Bytedance version, since it covers GPT-style transformers more comfortably. The paper is about MuonClip/Kimi-Muon, but also talks about Adam/AdamW. |
|
Thanks for the review @janEbert! Could you take a look at the depth-MuP plots too? The optimas seem to align well, and the curves seem bundled in the right way, but I'm not sure about the tail behaviour. Lower depths here prefer slightly higher LRs, which makes sense going by general wisdom. |
|
Hey, this is a huge PR, so it will take some time to review. :) |
|
Hey @plugyawn, we discussed in a smaller round how to tackle this PR and came to the conclusion that a design doc would be extremely helpful in order to better understand and review the PR. Ideally, you would include explanatory diagrams, considerations of edge cases, and explanations on why certain parts of the code needed to be touched. For example, the plots you already produced would also be perfect for this kind of document. This is also part of the PR template, so I hope you don't feel too thrown off by this request:
|
4b7b8e2 to
7647593
Compare
|
Fixed some bugs and redid the depth-MuP plots with width 2048, and verified across 4xA100s. Preparing the design doc, will share it soon! |
|
@plugyawn really appreciate you taking time for the additional input! Just tagging @NVIDIA/mcore-oncall should be good enough (I'll update the PR template with this; "@mcore-oncall" seems to be outdated). Oncall is a rotating member of our team and, coincidentally, it's me right now, so feel free to just ask here/reach out via mail. :) Stacked PRs are not currently enabled in this repository. |



What does this PR do ?
Addresses part of #4088, introducing a scaling policy for transfer recipes.
This also subsumes and refactors #3058 and #3715 into a
muprecipe.For now, I've kept the --use-mup flag connected (since it was part of the last release), but I'll add in a warning for deprecation?
Current recipes:
none(default) being the standard Megatron parameterizationmupbeing the refactored existing MuP behavior, preserved exactly.depth_mup(draft) with support for Adam/AdamW + dense residual transformers to demonstrate support for other scaling recipes.depth_muprecipeWe do the following:
depth_mult^-1depth_mult^0depth_mult^-1And also,
depth_mult^+0.5Chiefly, this PR adds a new
megatron.core.parameterizationpackage, which handles:For unsupported optimizers like SGD, we raise an error for now. Some math discovery might be necessary here. SGD depth transfer appears to need explicit hidden-weight, hidden-bias laws that would be complex to implement in one go.
Plots for MuP: Adam, SGD, Muon respectively:



240 iterations, wikitext8, on an A100 80 GB.
Needs some discussion on the YAML changing and the resume loop.
Some more thought: I think a scaling policy and training recipes could be a great feature moving forward as we can do more with the CLI, considering codex seems quite adept at using CLIs in general! I understand it's a more convenience-based idea, but would be exciting to see first-class support for autonomous and principled pretraining runs.
Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.