fix: resolve deadlock saving diffusion checkpoints in safetensors format by adil-a · Pull Request #1601 · NVIDIA-NeMo/Automodel

adil-a · 2026-03-24T14:31:15Z

Summary

Sampler padding bug (sampler.py): indices[:padding_size] silently under-pads when padding_size > len(indices), causing uneven batch counts across ranks (e.g., ranks 0-3 get 15 batches, ranks 4-7 get 14). At epoch boundary, some ranks enter checkpoint save while others continue training — mismatched NCCL collectives deadlock. Fixed with modular cycling.
FrozenDict serialization (addons.py): Diffusers models use FrozenDict for config (no to_json_string()), causing AttributeError on rank 0 during pre_save while other ranks wait at barrier. Fixed by falling back to json.dump().

Test plan

Run Flux T2I finetuning with model_save_format: safetensors and save_consolidated: false on 8 GPUs — checkpoint should save without deadlock
Run existing diffusion functional tests
Run pytest tests/unit_tests/ -vs -m "not pleasefixme" for unit test regression

🤖 Generated with Claude Code

Two bugs caused NCCL deadlocks when saving diffusion model checkpoints with `model_save_format: safetensors`: 1. **SequentialBucketSampler padding bug** (`sampler.py`): When a bucket had fewer samples than `padding_size` (e.g., 2 samples needing 6 padding elements to reach 8), `indices[:padding_size]` only returned `len(indices)` elements instead of `padding_size`. This made ranks 0-3 get 15 batches and ranks 4-7 get 14 batches per epoch. At epoch boundary, ranks 0-3 entered checkpoint save (DCP/NCCL ops) while ranks 4-7 continued training (FSDP2 allgather), deadlocking on mismatched NCCL collectives. Fixed by cycling through indices with modular indexing. 2. **FrozenDict config serialization** (`addons.py`): Diffusers models use `FrozenDict` for their config, which lacks the `to_json_string()` method that HF `PretrainedConfig` provides. This caused an `AttributeError` on rank 0 during `pre_save`, while other ranks waited at a barrier — another deadlock. Fixed by falling back to `json.dump()` when `to_json_string` is unavailable. Closes #1574 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

copy-pr-bot · 2026-03-24T14:31:20Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

adil-a · 2026-03-24T14:32:46Z

/ok to test cc63fca

…mat (#1601) Two bugs caused NCCL deadlocks when saving diffusion model checkpoints with `model_save_format: safetensors`: 1. **SequentialBucketSampler padding bug** (`sampler.py`): When a bucket had fewer samples than `padding_size` (e.g., 2 samples needing 6 padding elements to reach 8), `indices[:padding_size]` only returned `len(indices)` elements instead of `padding_size`. This made ranks 0-3 get 15 batches and ranks 4-7 get 14 batches per epoch. At epoch boundary, ranks 0-3 entered checkpoint save (DCP/NCCL ops) while ranks 4-7 continued training (FSDP2 allgather), deadlocking on mismatched NCCL collectives. Fixed by cycling through indices with modular indexing. 2. **FrozenDict config serialization** (`addons.py`): Diffusers models use `FrozenDict` for their config, which lacks the `to_json_string()` method that HF `PretrainedConfig` provides. This caused an `AttributeError` on rank 0 during `pre_save`, while other ranks waited at a barrier — another deadlock. Fixed by falling back to `json.dump()` when `to_json_string` is unavailable. Closes #1574 Signed-off-by: adil-a <adil.asif2000@hotmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

copy-pr-bot Bot temporarily deployed to nemo-ci March 24, 2026 14:33 Inactive

copy-pr-bot Bot temporarily deployed to test March 24, 2026 14:33 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 24, 2026 14:45 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 24, 2026 15:05 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 24, 2026 15:25 Inactive

akoumpa approved these changes Mar 24, 2026

View reviewed changes

akoumpa merged commit f09ff4c into main Mar 24, 2026
51 of 52 checks passed

akoumpa deleted the fix/diffusion-safetensors-checkpoint-deadlock branch March 24, 2026 22:03

qiaochuz-nv mentioned this pull request Apr 21, 2026

Deadlock while trying to save diffusion checkpoint in Safetensors format #1574

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve deadlock saving diffusion checkpoints in safetensors format#1601

fix: resolve deadlock saving diffusion checkpoints in safetensors format#1601
akoumpa merged 1 commit intomainfrom
fix/diffusion-safetensors-checkpoint-deadlock

adil-a commented Mar 24, 2026

Uh oh!

copy-pr-bot Bot commented Mar 24, 2026

Uh oh!

adil-a commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adil-a commented Mar 24, 2026

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented Mar 24, 2026

Uh oh!

adil-a commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants