Skip to content

feat: Ensure that diffusion training jobs use the safetensors checkpoint format#1627

Merged
akoumpa merged 5 commits intomainfrom
pranav/diffusion_safetensors
Mar 31, 2026
Merged

feat: Ensure that diffusion training jobs use the safetensors checkpoint format#1627
akoumpa merged 5 commits intomainfrom
pranav/diffusion_safetensors

Conversation

@pthombre
Copy link
Copy Markdown
Contributor

@pthombre pthombre commented Mar 30, 2026

What does this PR do?

Add safetensors checkpoint support for diffusion models, enabling consolidated checkpoints that are directly loadable via the standard diffusers from_pretrained() API.

Changelog

  • Added diffusers_compatible config option to CheckpointingConfig that renames the consolidated index file from model.safetensors.index.json to diffusion_pytorch_model.safetensors.index.json after consolidation, making checkpoints compatible with diffusers' from_pretrained().
  • Ensured the diffusers-compatible rename works in both consolidation paths: the all-ranks path (sync, non-single-rank) in Checkpointer.save_model() and the single-rank path (async or single_rank_consolidation) in _HuggingFaceStorageWriter.finish(). Extracted a shared _maybe_rename_index_for_diffusers()
    helper to avoid duplicating the rename logic.
  • Updated generate.py to load finetuned checkpoints using from_pretrained() on the consolidated safetensors directory (model/consolidated/), replacing the previous EMA, consolidated .bin, and sharded FSDP/DCP loading paths.
  • Switched all diffusion recipe configs (finetune and pretrain for Flux, Wan 2.1, HunyuanVideo) from torch_save to safetensors format with save_consolidated: true and diffusers_compatible: true.
  • Added unit tests for the rename logic covering the shared helper, the storage writer finish() path, and the Checkpointer.save_model() path.

Before your PR is "Ready for review"

Pre checks:

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Files changed:

  • nemo_automodel/components/checkpoint/_backports/hf_storage.py — added _maybe_rename_index_for_diffusers() helper and diffusers_compatible param to _HuggingFaceStorageWriter; rename applied in finish() for the single-rank/async consolidation path
  • nemo_automodel/components/checkpoint/checkpointing.py — added diffusers_compatible field to CheckpointingConfig, post-consolidation rename on the all-ranks path, passes diffusers_compatible through to the storage writer
  • nemo_automodel/recipes/diffusion/train.py — pass diffusers_compatible from YAML config to CheckpointingConfig
  • examples/diffusion/generate/generate.py — rewrote checkpoint loading to use from_pretrained()
  • examples/diffusion/finetune/flux_t2i_flow.yaml
  • examples/diffusion/finetune/hunyuan_t2v_flow.yaml
  • examples/diffusion/finetune/wan2_1_t2v_flow.yaml
  • examples/diffusion/finetune/wan2_1_t2v_flow_multinode.yaml
  • examples/diffusion/pretrain/flux_t2i_flow.yaml
  • examples/diffusion/pretrain/wan2_1_t2v_flow.yaml
  • tests/unit_tests/checkpoint/test_consolidate_safetensors.py — tests for _maybe_rename_index_for_diffusers and _HuggingFaceStorageWriter.finish() diffusers rename
  • tests/unit_tests/checkpoint/test_checkpointing.py — tests for Checkpointer.save_model() diffusers rename on the all-ranks path

…int format

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@pthombre
Copy link
Copy Markdown
Contributor Author

/ok to test b3f89f9

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
@pthombre
Copy link
Copy Markdown
Contributor Author

/ok to test 9174b3b

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
@pthombre pthombre marked this pull request as ready for review March 30, 2026 22:11
@pthombre
Copy link
Copy Markdown
Contributor Author

/ok to test 83e9bee

@pthombre
Copy link
Copy Markdown
Contributor Author

/ok to test 44d8d01

@akoumpa akoumpa merged commit 45f0846 into main Mar 31, 2026
53 checks passed
@akoumpa akoumpa deleted the pranav/diffusion_safetensors branch March 31, 2026 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants