Skip to content

Remove cross-rank synchronization during checkpoint load & deprecate torch.distributed.checkpoint.state_dict_loader.load_state_dict#2864

Merged
asolergi-nv merged 7 commits intoNVIDIA:mainfrom
asolergi-nv:no_dist_init
Apr 10, 2026
Merged

Remove cross-rank synchronization during checkpoint load & deprecate torch.distributed.checkpoint.state_dict_loader.load_state_dict#2864
asolergi-nv merged 7 commits intoNVIDIA:mainfrom
asolergi-nv:no_dist_init

Conversation

@asolergi-nv
Copy link
Copy Markdown
Contributor

  • Set no_dist=True during checkpoint load to remove cross-rank synchronization during checkpoint load.
  • Use torch.distributed.checkpoint.state_dict_loader.load instead of torch.distributed.checkpoint.state_dict_loader.load_state_dict

@asolergi-nv asolergi-nv requested review from a team as code owners January 8, 2026 09:40
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Jan 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot requested a review from Phlip79 January 8, 2026 09:40
@asolergi-nv asolergi-nv marked this pull request as draft January 15, 2026 16:33
@asolergi-nv asolergi-nv added the Final Review PR is in the "final review" stage label Jan 16, 2026
@asolergi-nv asolergi-nv marked this pull request as ready for review January 16, 2026 17:27
@ko3n1g ko3n1g requested a review from a team January 16, 2026 17:27
@asolergi-nv asolergi-nv requested review from deepakn94 and removed request for a team January 16, 2026 17:27
@deepakn94
Copy link
Copy Markdown
Contributor

Is load_state_dict deprecated?

@asolergi-nv
Copy link
Copy Markdown
Contributor Author

They added the warning +2 years ago in pytorch/pytorch#113867.

FutureWarning: `load_state_dict` is deprecated and will be removed in future versions. Please use `load` instead.

load_state_dict & load both call _load_state_dict in the same way, but the later adds some logic for Stateful objects.

Copy link
Copy Markdown
Contributor

@dimapihtar dimapihtar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

@Phlip79 Phlip79 removed their request for review January 26, 2026 20:16
@yaox12 yaox12 enabled auto-merge February 2, 2026 06:34
@asolergi-nv
Copy link
Copy Markdown
Contributor Author

/ok to test bd1980c

@asolergi-nv
Copy link
Copy Markdown
Contributor Author

/ok to test dee26a0

@asolergi-nv
Copy link
Copy Markdown
Contributor Author

/ok to test a37c65b

@asolergi-nv asolergi-nv marked this pull request as draft April 1, 2026 14:57
@asolergi-nv
Copy link
Copy Markdown
Contributor Author

/ok to test a10ca97

@asolergi-nv
Copy link
Copy Markdown
Contributor Author

/ok to test e835a8c

@asolergi-nv asolergi-nv marked this pull request as ready for review April 10, 2026 11:42
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team April 10, 2026 11:43
@asolergi-nv asolergi-nv enabled auto-merge April 10, 2026 11:43
@svcnvidia-nemo-ci svcnvidia-nemo-ci added Approved All necessary approvals have been made and removed Final Review PR is in the "final review" stage labels Apr 10, 2026
@asolergi-nv asolergi-nv disabled auto-merge April 10, 2026 13:56
@asolergi-nv asolergi-nv enabled auto-merge April 10, 2026 13:56
@asolergi-nv asolergi-nv added this pull request to the merge queue Apr 10, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24246552837

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 10, 2026
@asolergi-nv asolergi-nv added this pull request to the merge queue Apr 10, 2026
@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24249665766

@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24251643090

Merged via the queue into NVIDIA:main with commit 0602523 Apr 10, 2026
68 of 70 checks passed
@asolergi-nv asolergi-nv deleted the no_dist_init branch April 10, 2026 17:23
ananthsub pushed a commit to ananthsub/Megatron-LM that referenced this pull request Apr 10, 2026
…torch.distributed.checkpoint.state_dict_loader.load_state_dict (NVIDIA#2864)
ananthsub added a commit to ananthsub/Megatron-Bridge that referenced this pull request Apr 10, 2026
Cherry-pick of NVIDIA/Megatron-LM#2864 (commit 0602523):
- Use checkpoint.load instead of deprecated checkpoint.load_state_dict
- Set no_dist=True to remove cross-rank synchronization during checkpoint load

Made-with: Cursor
ananthsub added a commit to NVIDIA-NeMo/Megatron-Bridge that referenced this pull request Apr 10, 2026
Cherry-pick of NVIDIA/Megatron-LM#2864 (commit 0602523):
- Use checkpoint.load instead of deprecated checkpoint.load_state_dict
- Set no_dist=True to remove cross-rank synchronization during checkpoint load

Made-with: Cursor
ananthsub added a commit to ananthsub/Megatron-LM that referenced this pull request Apr 11, 2026
…load

The cherry-pick of NVIDIA#2864 incorrectly included the
async_strategy parameter in _get_filesystem_reader, which only exists
on upstream main but not on ultra-v3-posttraining.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made complexity: low

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants