sequence parallel default dtype #7364

stas00 · 2025-06-17T00:26:50Z

the newly released nccl finally started to use fp32 accumulation for reduction ops!

Floating point summation is always done in fp32 accumulators (with the
exception of fp8 on NVLS, where it uses fp16 inside the switch). Thus,
the accuracy with fp8 and fp16 data types should be much improved.
NVIDIA/nccl@72d2432

So we should change the fp32 comms default for SP to the same dtype as inputs if nccl>=2.27.3 - the user can still override the default.

Signed-off-by: Stas Bekman <stas@stason.org>

stas00 · 2025-06-19T17:46:41Z

so what's special about a6000 gpu that it fails? I'm looking at the breakage

basically it's this call that fails:

python -c "from torch._dynamo._trace_wrapped_higher_order_op import TransformGetItemToIndex"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: cannot import name 'TransformGetItemToIndex' from 'torch._dynamo._trace_wrapped_higher_order_op' (/usr/local/lib/python3.10/dist-packages/torch/_dynamo/_trace_wrapped_higher_order_op.py)

for older pytorch versions, like pt-2.4, but it succeeds with pt-2.7.1.

Aha! This workflow uses:

torch                         2.6.0a0+df5bbc09d1.nv24.12

so it reports 2.6 but that TransformGetItemToIndex isn't there yet - that's the culprit!

The caller is here:

https://github.com/huggingface/transformers/blob/0725cd6953803b8aacfc85288cbfb83dea30c469/src/transformers/masking_utils.py#L34-L37

Is there a reason why this workflow is locked onto this pytorch version? If so we have to run it on an older transformers version to overcome this issue.

edit: The half-baked pt-2.6 comes from image: nvcr.io/nvidia/pytorch:24.12-py3 - let me try to raise that version.

Signed-off-by: Stas Bekman <stas@stason.org>

…i/DeepSpeed into stas/nccl-fp32-accum

Signed-off-by: Stas Bekman <stas@stason.org>

sequence parallel default dtype

5d101f9

Signed-off-by: Stas Bekman <stas@stason.org>

stas00 requested review from tjruwase and tohtana as code owners June 17, 2025 00:26

tjruwase approved these changes Jun 17, 2025

View reviewed changes

stas00 and others added 3 commits June 17, 2025 10:33

Merge branch 'master' into stas/nccl-fp32-accum

4707b42

Merge branch 'master' into stas/nccl-fp32-accum

12dbc2a

Merge branch 'master' into stas/nccl-fp32-accum

2757f29

stas00 added 3 commits June 19, 2025 17:51

use newer image/pytorch version

ae70d39

Signed-off-by: Stas Bekman <stas@stason.org>

Merge remote-tracking branch 'origin/master' into stas/nccl-fp32-accum

9afc75b

Merge branch 'stas/nccl-fp32-accum' of https://github.com:/deepspeeda…

9416af5

…i/DeepSpeed into stas/nccl-fp32-accum

stas00 requested a review from loadams as a code owner June 19, 2025 17:52

try another image

b3a797c

Signed-off-by: Stas Bekman <stas@stason.org>

stas00 enabled auto-merge (squash) June 19, 2025 18:22

stas00 merged commit d3b9cb8 into master Jun 19, 2025
11 checks passed

stas00 deleted the stas/nccl-fp32-accum branch June 19, 2025 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sequence parallel default dtype #7364

sequence parallel default dtype #7364

Uh oh!

stas00 commented Jun 17, 2025

Uh oh!

stas00 commented Jun 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

sequence parallel default dtype #7364

sequence parallel default dtype #7364

Uh oh!

Conversation

stas00 commented Jun 17, 2025

Uh oh!

stas00 commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stas00 commented Jun 19, 2025 •

edited

Loading