Skip to content

Fix potential coredump issue that occurs when saving a checkpoint#1871

Merged
ko3n1g merged 4 commits intoNVIDIA:mainfrom
ezioliao:main
Apr 17, 2026
Merged

Fix potential coredump issue that occurs when saving a checkpoint#1871
ko3n1g merged 4 commits intoNVIDIA:mainfrom
ezioliao:main

Conversation

@ezioliao
Copy link
Copy Markdown
Contributor

This PR attempts to fix issue #1870

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Oct 16, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@sbhavani sbhavani added the bug Something isn't working label Oct 17, 2025
@ko3n1g ko3n1g requested review from a team as code owners February 18, 2026 09:18
@Phlip79 Phlip79 added the Final Review PR is in the "final review" stage label Mar 4, 2026
@ericharper ericharper enabled auto-merge April 14, 2026 05:06
@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the Final Review PR is in the "final review" stage label Apr 14, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Approved All necessary approvals have been made label Apr 14, 2026
@ericharper
Copy link
Copy Markdown
Contributor

/ok to test 346696a

Add comments to handle potential core dump issue.
@deepakn94
Copy link
Copy Markdown
Contributor

/ok to test 7a518ee

@deepakn94 deepakn94 changed the title Fixed a potential coredump issue that occurred when saving a checkpoint. Fix potential coredump issue that occurs when saving a checkpoint Apr 16, 2026
@ericharper ericharper added this pull request to the merge queue Apr 17, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24544545365

@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 17, 2026
@Phlip79 Phlip79 added this pull request to the merge queue Apr 17, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24548920944

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Apr 17, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 17, 2026
@ko3n1g ko3n1g added this pull request to the merge queue Apr 17, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24552029960

Merged via the queue into NVIDIA:main with commit ded22f4 Apr 17, 2026
63 checks passed
@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Apr 17, 2026
Victarry added a commit to yanring/Megatron-LM that referenced this pull request Apr 20, 2026
* origin/main: (286 commits)
  Rename MambaModel/MambaStack to HybridModel/HybridStack (NVIDIA#4099)
  Fix Megatron initialization with extra_args_provider (NVIDIA#4327)
  Fix RL to once again work with --skip-train (NVIDIA#4249)
  Add activation logging and tokens per expert logging (NVIDIA#3842)
  Make param_index_map always use unpacked (full numel) offsets (NVIDIA#4328)
  FA4 Inference (NVIDIA#4186)
  Fix RL reward due to stop token (NVIDIA#4096)
  cp: Fix UT timeout (NVIDIA#4310) (NVIDIA#4373)
  feat(ckpt): add --async-ckpt-use-cpu-shm argument (NVIDIA#4355)
  Update copy-pr-bot.yaml [skip ci]
  Docs: improve docstrings and comments in example training loop (NVIDIA#4041)
  Add QK layernorm support for dot-product attention in MambaModel (NVIDIA#4067)
  Fix bug with non-partial rollouts (NVIDIA#3964)
  [docs] ci: use parent-relative json_url for version picker (NVIDIA#4367)
  Add tables and histogram for RL staleness (NVIDIA#4097)
  Port DeepSeek Sparse Attention to `MambaModel` (NVIDIA#3553)
  docs: bump versions1.json to 0.17.0 (latest) (NVIDIA#4360)
  Fix potential coredump issue that occurs when saving a checkpoint (NVIDIA#1871)
  ci(gb200): add 1-node mr-github functional test variants (NVIDIA#4334)
  fix: wait for async P2P send before deallocating output tensor (NVIDIA#4047)
  ...

# Conflicts:
#	megatron/core/transformer/cuda_graphs.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made bug Something isn't working community-request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants