Skip to content

[pull] master from deepspeedai:master#110

Merged
pull[bot] merged 2 commits intoQSLee-Net:masterfrom
deepspeedai:master
Oct 22, 2025
Merged

[pull] master from deepspeedai:master#110
pull[bot] merged 2 commits intoQSLee-Net:masterfrom
deepspeedai:master

Conversation

@pull
Copy link
Copy Markdown

@pull pull bot commented Oct 22, 2025

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

This PR is fixing this:

```
[rank0]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 985, in grad_handling_hook
[rank0]:     self.process_gradients(param, i)
[rank0]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 1524, in process_gradients
[rank0]:     self.reduce_ready_partitions_and_remove_grads(param, i)
[rank0]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 1528, in reduce_ready_partitions_and_remove_grads
[rank0]:     self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
[rank0]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/zero/stage_1_and_2.py", line 1006, in reduce_independent_p_g_buckets_and_remove_grads
[rank0]:     self.report_ipg_memory_usage("In ipg_remove_grads before reduce_ipg_grads", param.numel(), param.dtype)
[rank0]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/base_optimizer.py", line 70, in report_ipg_memory_usage
[rank0]:     bucket = self.ipg_buckets[dt]
[rank0]:              ~~~~~~~~~~~~~~~~^^^^
[rank0]: KeyError: torch.bfloat16
```

the problem doesn't exist if: `seq_parallel_communication_data_type:
bf16` is used, but fails with `fp32` (or no setting).

In this PR I'm syncing with the z3 implementation which doesn't pass the
`dtype` arg and lets the traversal of existing dtypes do the thing.


https://github.com/deepspeedai/DeepSpeed/blob/407708cdb6e48dbff971b0f03ec4613d0f084a4b/deepspeed/runtime/base_optimizer.py#L66-L75

Fixes: #7607
Ulysses/ALST integration with HF Accelerate:
- Allow `UlyssesSPAttentionHF.register_with_transformers` to get a
`model` obj as an argument, to match HF accelerate's workflow
- Fix existing Ulysses' tests to tests z2 instead of z1
- Improve documentation
- Add a defensive check

The HF Accelerate PR that depends on this PR is here
huggingface/accelerate#3817

---------

Signed-off-by: Stas Bekman <stas@stason.org>
@pull pull bot locked and limited conversation to collaborators Oct 22, 2025
@pull pull bot added the ⤵️ pull label Oct 22, 2025
@pull pull bot merged commit 64c0052 into QSLee-Net:master Oct 22, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant