Fix potential coredump issue that occurs when saving a checkpoint by ezioliao · Pull Request #1871 · NVIDIA/Megatron-LM

ezioliao · 2025-10-16T14:39:05Z

This PR attempts to fix issue #1870

copy-pr-bot · 2025-10-16T14:39:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ericharper · 2026-04-14T05:07:25Z

/ok to test 346696a

Add comments to handle potential core dump issue.

deepakn94 · 2026-04-16T18:42:43Z

/ok to test 7a518ee

svcnvidia-nemo-ci · 2026-04-17T02:26:02Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24544545365

svcnvidia-nemo-ci · 2026-04-17T05:13:32Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24548920944

svcnvidia-nemo-ci · 2026-04-17T06:51:17Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24552029960

* origin/main: (286 commits) Rename MambaModel/MambaStack to HybridModel/HybridStack (NVIDIA#4099) Fix Megatron initialization with extra_args_provider (NVIDIA#4327) Fix RL to once again work with --skip-train (NVIDIA#4249) Add activation logging and tokens per expert logging (NVIDIA#3842) Make param_index_map always use unpacked (full numel) offsets (NVIDIA#4328) FA4 Inference (NVIDIA#4186) Fix RL reward due to stop token (NVIDIA#4096) cp: Fix UT timeout (NVIDIA#4310) (NVIDIA#4373) feat(ckpt): add --async-ckpt-use-cpu-shm argument (NVIDIA#4355) Update copy-pr-bot.yaml [skip ci] Docs: improve docstrings and comments in example training loop (NVIDIA#4041) Add QK layernorm support for dot-product attention in MambaModel (NVIDIA#4067) Fix bug with non-partial rollouts (NVIDIA#3964) [docs] ci: use parent-relative json_url for version picker (NVIDIA#4367) Add tables and histogram for RL staleness (NVIDIA#4097) Port DeepSeek Sparse Attention to `MambaModel` (NVIDIA#3553) docs: bump versions1.json to 0.17.0 (latest) (NVIDIA#4360) Fix potential coredump issue that occurs when saving a checkpoint (NVIDIA#1871) ci(gb200): add 1-node mr-github functional test variants (NVIDIA#4334) fix: wait for async P2P send before deallocating output tensor (NVIDIA#4047) ... # Conflicts: # megatron/core/transformer/cuda_graphs.py

Fixed a potential coredump issue that occurred when saving a checkpoint.

454ea3f

sbhavani added the bug Something isn't working label Oct 17, 2025

ko3n1g requested review from a team as code owners February 18, 2026 09:18

Phlip79 added the Final Review PR is in the "final review" stage label Mar 4, 2026

jaredcasper approved these changes Apr 13, 2026

View reviewed changes

ericharper approved these changes Apr 14, 2026

View reviewed changes

ericharper enabled auto-merge April 14, 2026 05:06

svcnvidia-nemo-ci removed the Final Review PR is in the "final review" stage label Apr 14, 2026

Merge branch 'main' into main

346696a

svcnvidia-nemo-ci added the Approved All necessary approvals have been made label Apr 14, 2026

github-actions Bot added the community-request label Apr 14, 2026

svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 14, 2026

copy-pr-bot Bot temporarily deployed to test April 14, 2026 05:08 Inactive

chtruong814 added waiting-for-customer and removed waiting-for-customer labels Apr 14, 2026

deepakn94 added 2 commits April 16, 2026 11:41

Add comment explaining fix

e15944f

Add comments to handle potential core dump issue.

Merge branch 'main' into main

7a518ee

copy-pr-bot Bot temporarily deployed to test April 16, 2026 18:43 Inactive

deepakn94 changed the title ~~Fixed a potential coredump issue that occurred when saving a checkpoint.~~ Fix potential coredump issue that occurs when saving a checkpoint Apr 16, 2026

ericharper added this pull request to the merge queue Apr 17, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 17, 2026

Phlip79 added this pull request to the merge queue Apr 17, 2026

chtruong814 added the needs-follow-up Issue needs follow-up label Apr 17, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 17, 2026

ko3n1g added this pull request to the merge queue Apr 17, 2026

Merged via the queue into NVIDIA:main with commit ded22f4 Apr 17, 2026
63 checks passed

chtruong814 removed the needs-follow-up Issue needs follow-up label Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix potential coredump issue that occurs when saving a checkpoint#1871

Fix potential coredump issue that occurs when saving a checkpoint#1871
ko3n1g merged 4 commits intoNVIDIA:mainfrom
ezioliao:main

ezioliao commented Oct 16, 2025

Uh oh!

copy-pr-bot Bot commented Oct 16, 2025

Uh oh!

ericharper commented Apr 14, 2026

Uh oh!

deepakn94 commented Apr 16, 2026

Uh oh!

svcnvidia-nemo-ci commented Apr 17, 2026

Uh oh!

Uh oh!

svcnvidia-nemo-ci commented Apr 17, 2026

Uh oh!

Uh oh!

svcnvidia-nemo-ci commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

ezioliao commented Oct 16, 2025

Uh oh!

copy-pr-bot Bot commented Oct 16, 2025

Uh oh!

ericharper commented Apr 14, 2026

Uh oh!

deepakn94 commented Apr 16, 2026

Uh oh!

svcnvidia-nemo-ci commented Apr 17, 2026

Uh oh!

Uh oh!

svcnvidia-nemo-ci commented Apr 17, 2026

Uh oh!

Uh oh!

svcnvidia-nemo-ci commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants