Skip to content

feat(ckpt): add --async-ckpt-use-cpu-shm argument#4355

Merged
sbak5 merged 7 commits intoNVIDIA:mainfrom
sbak5:sbak/async_shm
Apr 18, 2026
Merged

feat(ckpt): add --async-ckpt-use-cpu-shm argument#4355
sbak5 merged 7 commits intoNVIDIA:mainfrom
sbak5:sbak/async_shm

Conversation

@sbak5
Copy link
Copy Markdown
Contributor

@sbak5 sbak5 commented Apr 17, 2026

Wire nvidia-resiliency-ext cpu_shm_mode / use_cpu_shm_for_gpu_tensors into Megatron's async checkpointing stack. When enabled, GPU tensors are copied to per-tensor CPU shared-memory in the training process before handoff to the async worker, avoiding CUDA IPC / NVLink fabric handle exhaustion on MNNVL systems.

Backward-compatible: inspect.signature guards each call site so older nvrx installs without the new parameters continue to work unchanged, with a warning if the flag was explicitly requested.

It's recommended to use --async-ckpt-use-cpu-shm with --ckpt-assume-constant-structure because
creating shm tensors every checkpoint invocation is very costly.

the constant-structure will cache the set of shm tensors so both trainer and async ckpt process keep those
shm tensors for streaming every checkpoint interval.

What does this PR do ?

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

@sbak5 sbak5 requested review from a team as code owners April 17, 2026 02:59
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft April 17, 2026 02:59
@github-actions
Copy link
Copy Markdown
Contributor

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

sbak5 and others added 4 commits April 16, 2026 21:23
Wire nvidia-resiliency-ext cpu_shm_mode / use_cpu_shm_for_gpu_tensors
into Megatron's async checkpointing stack. When enabled, GPU tensors
are copied to per-tensor CPU shared-memory in the training process
before handoff to the async worker, avoiding CUDA IPC / NVLink fabric
handle exhaustion on MNNVL systems.

Backward-compatible: inspect.signature guards each call site so older
nvrx installs without the new parameters continue to work unchanged,
with a warning if the flag was explicitly requested.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add --async-ckpt-use-cpu-shm to the GPT vp1 dist-optimizer overlap test
and the MoE 8-expert multi-dist-optimizer test, covering the new CPU shared
memory async checkpoint path under both ckpt-resume and regular test types.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sbak5 sbak5 marked this pull request as ready for review April 17, 2026 04:36
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team April 17, 2026 04:36
@sbak5
Copy link
Copy Markdown
Contributor Author

sbak5 commented Apr 17, 2026

/ok to test 88334f7

Copy link
Copy Markdown
Contributor

@dimapihtar dimapihtar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

Move --async-ckpt-use-cpu-shm to gpt3_mcore_te_tp4_pp1_resume_torch_dist_dist_optimizer_overlap_grad_reduce_param_gather
and gpt3_moe_mcore_te_tp4_ep2_etp2_pp2_resume_torch_dist_dist_optimizer; remove it from the two previous holders.
Also add --ckpt-assume-constant-structure to the GPT tp4 test.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@sbak5
Copy link
Copy Markdown
Contributor Author

sbak5 commented Apr 17, 2026

/ok to test 0f74002

Comment thread megatron/training/async_utils.py Outdated
@sbak5
Copy link
Copy Markdown
Contributor Author

sbak5 commented Apr 17, 2026

/ok to test e1fe228

@sbak5
Copy link
Copy Markdown
Contributor Author

sbak5 commented Apr 17, 2026

@jaredcasper can you review the resolved changes and approve it?

@Phlip79 Phlip79 requested a review from jaredcasper April 17, 2026 20:39
@sbak5
Copy link
Copy Markdown
Contributor Author

sbak5 commented Apr 17, 2026

/ok to test b95333a

@sbak5
Copy link
Copy Markdown
Contributor Author

sbak5 commented Apr 17, 2026

Applied linting with warn_rank_0 removed on the changed files, which are not used anymore.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label Apr 17, 2026
@sbak5 sbak5 enabled auto-merge April 17, 2026 21:40
@svcnvidia-nemo-ci svcnvidia-nemo-ci added Approved All necessary approvals have been made and removed Final Review PR is in the "final review" stage labels Apr 18, 2026
@sbak5 sbak5 added this pull request to the merge queue Apr 18, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24592192475

@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24592354485

@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24593075620

@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 18, 2026
@sbak5 sbak5 added this pull request to the merge queue Apr 18, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24594901231

@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24595647331

@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 18, 2026
@sbak5 sbak5 modified the milestone: Core 0.16 Apr 18, 2026
@sbak5 sbak5 added this pull request to the merge queue Apr 18, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24598392582

Merged via the queue into NVIDIA:main with commit 9978968 Apr 18, 2026
347 of 351 checks passed
@sbak5 sbak5 deleted the sbak/async_shm branch April 18, 2026 07:07
Victarry added a commit to yanring/Megatron-LM that referenced this pull request Apr 20, 2026
* origin/main: (286 commits)
  Rename MambaModel/MambaStack to HybridModel/HybridStack (NVIDIA#4099)
  Fix Megatron initialization with extra_args_provider (NVIDIA#4327)
  Fix RL to once again work with --skip-train (NVIDIA#4249)
  Add activation logging and tokens per expert logging (NVIDIA#3842)
  Make param_index_map always use unpacked (full numel) offsets (NVIDIA#4328)
  FA4 Inference (NVIDIA#4186)
  Fix RL reward due to stop token (NVIDIA#4096)
  cp: Fix UT timeout (NVIDIA#4310) (NVIDIA#4373)
  feat(ckpt): add --async-ckpt-use-cpu-shm argument (NVIDIA#4355)
  Update copy-pr-bot.yaml [skip ci]
  Docs: improve docstrings and comments in example training loop (NVIDIA#4041)
  Add QK layernorm support for dot-product attention in MambaModel (NVIDIA#4067)
  Fix bug with non-partial rollouts (NVIDIA#3964)
  [docs] ci: use parent-relative json_url for version picker (NVIDIA#4367)
  Add tables and histogram for RL staleness (NVIDIA#4097)
  Port DeepSeek Sparse Attention to `MambaModel` (NVIDIA#3553)
  docs: bump versions1.json to 0.17.0 (latest) (NVIDIA#4360)
  Fix potential coredump issue that occurs when saving a checkpoint (NVIDIA#1871)
  ci(gb200): add 1-node mr-github functional test variants (NVIDIA#4334)
  fix: wait for async P2P send before deallocating output tensor (NVIDIA#4047)
  ...

# Conflicts:
#	megatron/core/transformer/cuda_graphs.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made complexity: medium Run functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants