feat(ckpt): add --async-ckpt-use-cpu-shm argument#4355
Conversation
|
This PR has been automatically converted to draft because all PRs must start as drafts. When you are ready for review, click Ready for Review to begin the review process. This will:
See the contribution guide for more details. |
Wire nvidia-resiliency-ext cpu_shm_mode / use_cpu_shm_for_gpu_tensors into Megatron's async checkpointing stack. When enabled, GPU tensors are copied to per-tensor CPU shared-memory in the training process before handoff to the async worker, avoiding CUDA IPC / NVLink fabric handle exhaustion on MNNVL systems. Backward-compatible: inspect.signature guards each call site so older nvrx installs without the new parameters continue to work unchanged, with a warning if the flag was explicitly requested. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add --async-ckpt-use-cpu-shm to the GPT vp1 dist-optimizer overlap test and the MoE 8-expert multi-dist-optimizer test, covering the new CPU shared memory async checkpoint path under both ckpt-resume and regular test types. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
/ok to test 88334f7 |
Move --async-ckpt-use-cpu-shm to gpt3_mcore_te_tp4_pp1_resume_torch_dist_dist_optimizer_overlap_grad_reduce_param_gather and gpt3_moe_mcore_te_tp4_ep2_etp2_pp2_resume_torch_dist_dist_optimizer; remove it from the two previous holders. Also add --ckpt-assume-constant-structure to the GPT tp4 test. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
/ok to test 0f74002 |
|
/ok to test e1fe228 |
|
@jaredcasper can you review the resolved changes and approve it? |
|
/ok to test b95333a |
|
Applied linting with warn_rank_0 removed on the changed files, which are not used anymore. |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24592192475 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24592354485 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24593075620 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24594901231 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24595647331 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24598392582 |
* origin/main: (286 commits) Rename MambaModel/MambaStack to HybridModel/HybridStack (NVIDIA#4099) Fix Megatron initialization with extra_args_provider (NVIDIA#4327) Fix RL to once again work with --skip-train (NVIDIA#4249) Add activation logging and tokens per expert logging (NVIDIA#3842) Make param_index_map always use unpacked (full numel) offsets (NVIDIA#4328) FA4 Inference (NVIDIA#4186) Fix RL reward due to stop token (NVIDIA#4096) cp: Fix UT timeout (NVIDIA#4310) (NVIDIA#4373) feat(ckpt): add --async-ckpt-use-cpu-shm argument (NVIDIA#4355) Update copy-pr-bot.yaml [skip ci] Docs: improve docstrings and comments in example training loop (NVIDIA#4041) Add QK layernorm support for dot-product attention in MambaModel (NVIDIA#4067) Fix bug with non-partial rollouts (NVIDIA#3964) [docs] ci: use parent-relative json_url for version picker (NVIDIA#4367) Add tables and histogram for RL staleness (NVIDIA#4097) Port DeepSeek Sparse Attention to `MambaModel` (NVIDIA#3553) docs: bump versions1.json to 0.17.0 (latest) (NVIDIA#4360) Fix potential coredump issue that occurs when saving a checkpoint (NVIDIA#1871) ci(gb200): add 1-node mr-github functional test variants (NVIDIA#4334) fix: wait for async P2P send before deallocating output tensor (NVIDIA#4047) ... # Conflicts: # megatron/core/transformer/cuda_graphs.py
Wire nvidia-resiliency-ext cpu_shm_mode / use_cpu_shm_for_gpu_tensors into Megatron's async checkpointing stack. When enabled, GPU tensors are copied to per-tensor CPU shared-memory in the training process before handoff to the async worker, avoiding CUDA IPC / NVLink fabric handle exhaustion on MNNVL systems.
Backward-compatible: inspect.signature guards each call site so older nvrx installs without the new parameters continue to work unchanged, with a warning if the flag was explicitly requested.
It's recommended to use
--async-ckpt-use-cpu-shmwith--ckpt-assume-constant-structurebecausecreating shm tensors every checkpoint invocation is very costly.
the constant-structure will cache the set of shm tensors so both trainer and async ckpt process keep those
shm tensors for streaming every checkpoint interval.
What does this PR do ?
Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.