Skip to content

sequence parallel default dtype #7364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 19, 2025
Merged

sequence parallel default dtype #7364

merged 8 commits into from
Jun 19, 2025

Conversation

stas00
Copy link
Collaborator

@stas00 stas00 commented Jun 17, 2025

the newly released nccl finally started to use fp32 accumulation for reduction ops!

  • Floating point summation is always done in fp32 accumulators (with the
    exception of fp8 on NVLS, where it uses fp16 inside the switch). Thus,
    the accuracy with fp8 and fp16 data types should be much improved.
    NVIDIA/nccl@72d2432

So we should change the fp32 comms default for SP to the same dtype as inputs if nccl>=2.27.3 - the user can still override the default.

Signed-off-by: Stas Bekman <stas@stason.org>
@stas00 stas00 requested review from tjruwase and tohtana as code owners June 17, 2025 00:26
@stas00
Copy link
Collaborator Author

stas00 commented Jun 19, 2025

so what's special about a6000 gpu that it fails? I'm looking at the breakage

basically it's this call that fails:

python -c "from torch._dynamo._trace_wrapped_higher_order_op import TransformGetItemToIndex"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: cannot import name 'TransformGetItemToIndex' from 'torch._dynamo._trace_wrapped_higher_order_op' (/usr/local/lib/python3.10/dist-packages/torch/_dynamo/_trace_wrapped_higher_order_op.py)

for older pytorch versions, like pt-2.4, but it succeeds with pt-2.7.1.

Aha! This workflow uses:

torch                         2.6.0a0+df5bbc09d1.nv24.12

so it reports 2.6 but that TransformGetItemToIndex isn't there yet - that's the culprit!

The caller is here:

https://github.com/huggingface/transformers/blob/0725cd6953803b8aacfc85288cbfb83dea30c469/src/transformers/masking_utils.py#L34-L37

Is there a reason why this workflow is locked onto this pytorch version? If so we have to run it on an older transformers version to overcome this issue.

edit: The half-baked pt-2.6 comes from image: nvcr.io/nvidia/pytorch:24.12-py3 - let me try to raise that version.

@stas00 stas00 requested a review from loadams as a code owner June 19, 2025 17:52
Signed-off-by: Stas Bekman <stas@stason.org>
@stas00 stas00 enabled auto-merge (squash) June 19, 2025 18:22
@stas00 stas00 merged commit d3b9cb8 into master Jun 19, 2025
11 checks passed
@stas00 stas00 deleted the stas/nccl-fp32-accum branch June 19, 2025 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants