Skip to content

fix: resolve deadlock saving diffusion checkpoints in safetensors format#1601

Merged
akoumpa merged 1 commit intomainfrom
fix/diffusion-safetensors-checkpoint-deadlock
Mar 24, 2026
Merged

fix: resolve deadlock saving diffusion checkpoints in safetensors format#1601
akoumpa merged 1 commit intomainfrom
fix/diffusion-safetensors-checkpoint-deadlock

Conversation

@adil-a
Copy link
Copy Markdown
Collaborator

@adil-a adil-a commented Mar 24, 2026

Summary

  • Sampler padding bug (sampler.py): indices[:padding_size] silently under-pads when padding_size > len(indices), causing uneven batch counts across ranks (e.g., ranks 0-3 get 15 batches, ranks 4-7 get 14). At epoch boundary, some ranks enter checkpoint save while others continue training — mismatched NCCL collectives deadlock. Fixed with modular cycling.
  • FrozenDict serialization (addons.py): Diffusers models use FrozenDict for config (no to_json_string()), causing AttributeError on rank 0 during pre_save while other ranks wait at barrier. Fixed by falling back to json.dump().

Closes #1574

Test plan

  • Run Flux T2I finetuning with model_save_format: safetensors and save_consolidated: false on 8 GPUs — checkpoint should save without deadlock
  • Run existing diffusion functional tests
  • Run pytest tests/unit_tests/ -vs -m "not pleasefixme" for unit test regression

🤖 Generated with Claude Code

Two bugs caused NCCL deadlocks when saving diffusion model checkpoints
with `model_save_format: safetensors`:

1. **SequentialBucketSampler padding bug** (`sampler.py`):
   When a bucket had fewer samples than `padding_size` (e.g., 2 samples
   needing 6 padding elements to reach 8), `indices[:padding_size]` only
   returned `len(indices)` elements instead of `padding_size`. This made
   ranks 0-3 get 15 batches and ranks 4-7 get 14 batches per epoch.
   At epoch boundary, ranks 0-3 entered checkpoint save (DCP/NCCL ops)
   while ranks 4-7 continued training (FSDP2 allgather), deadlocking on
   mismatched NCCL collectives. Fixed by cycling through indices with
   modular indexing.

2. **FrozenDict config serialization** (`addons.py`):
   Diffusers models use `FrozenDict` for their config, which lacks the
   `to_json_string()` method that HF `PretrainedConfig` provides. This
   caused an `AttributeError` on rank 0 during `pre_save`, while other
   ranks waited at a barrier — another deadlock. Fixed by falling back
   to `json.dump()` when `to_json_string` is unavailable.

Closes #1574

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 24, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@adil-a
Copy link
Copy Markdown
Collaborator Author

adil-a commented Mar 24, 2026

/ok to test cc63fca

@akoumpa akoumpa merged commit f09ff4c into main Mar 24, 2026
51 of 52 checks passed
@akoumpa akoumpa deleted the fix/diffusion-safetensors-checkpoint-deadlock branch March 24, 2026 22:03
linnanwang pushed a commit that referenced this pull request Apr 24, 2026
…mat (#1601)

Two bugs caused NCCL deadlocks when saving diffusion model checkpoints
with `model_save_format: safetensors`:

1. **SequentialBucketSampler padding bug** (`sampler.py`):
   When a bucket had fewer samples than `padding_size` (e.g., 2 samples
   needing 6 padding elements to reach 8), `indices[:padding_size]` only
   returned `len(indices)` elements instead of `padding_size`. This made
   ranks 0-3 get 15 batches and ranks 4-7 get 14 batches per epoch.
   At epoch boundary, ranks 0-3 entered checkpoint save (DCP/NCCL ops)
   while ranks 4-7 continued training (FSDP2 allgather), deadlocking on
   mismatched NCCL collectives. Fixed by cycling through indices with
   modular indexing.

2. **FrozenDict config serialization** (`addons.py`):
   Diffusers models use `FrozenDict` for their config, which lacks the
   `to_json_string()` method that HF `PretrainedConfig` provides. This
   caused an `AttributeError` on rank 0 during `pre_save`, while other
   ranks waited at a barrier — another deadlock. Fixed by falling back
   to `json.dump()` when `to_json_string` is unavailable.

Closes #1574

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deadlock while trying to save diffusion checkpoint in Safetensors format

2 participants