Fix FP8 block scaling with sequence parallel by cuichenx · Pull Request #2637 · NVIDIA/TransformerEngine

cuichenx · 2026-01-31T00:24:50Z

Description

Problem

Using Float8BlockQuantizer with sequence parallel fails with AssertionError: All-gather requires quantizable tensor for quantizer Float8BlockQuantizer when local tensor dimensions aren't divisible by 128.

Solution

Skip the assert_dim_for_all_gather check for Float8BlockQuantizer since gather_along_first_dim already has a fallback path
Fix the fallback in _start_all_gather_fp8_blockwise to handle already-quantized inputs by dequantizing before high-precision all-gather

###Note
The fallback path (high-precision all-gather → quantize) may increase the communication overhead.

Verification

The code change does not alter convergence behavior

When SP is True, the previous code did not run. When SP is False, this change doesn't affect anything.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Chen Cui <chcui@nvidia.com>

greptile-apps · 2026-01-31T00:28:20Z

Greptile Summary

This PR successfully fixes FP8 block scaling with sequence parallel by adding fallback handling for non-quantizable tensors. The core changes are sound:

Adds high-precision all-gather fallback in _start_all_gather_fp8_blockwise, _all_gather_nvfp4, and _all_gather_mxfp8 for tensors whose dimensions aren't divisible by the quantization block size
Fixes .dequantize() calls to explicitly pass dtype=dtype, preserving high-precision types during fallback paths
Removes now-redundant assert_dim_for_all_gather checks and function from utils

The PR addresses the reported crash when using Float8BlockQuantizer with sequence parallelism and does not alter convergence behavior as verified in the test results.

Confidence Score: 5/5

PR is safe to merge. The fix correctly addresses the reported crash with FP8 block scaling and sequence parallelism through well-scoped fallback mechanisms.
All changes are focused and correct. The fallback paths handle non-quantizable tensors appropriately by dequantizing before high-precision all-gather. The explicit dtype parameter ensures correct precision preservation. Module-level cleanup delegating to gather helpers' own fallback logic is sound. No edge cases identified that would cause issues in practice.
No files require special attention

_{Last reviewed commit: ccd46cb}

Perform all-gather in high-precision if the input tensor is too small to quantize. Signed-off-by: Tim Moon <tmoon@nvidia.com>

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

timmoon10

I've tweaked the module logic so that it no longer checks that the local tensor can be quantized. In addition to FP8 block scaling, MXFP8 and NVFP4 will also support high-precision all-gathers. That said, this may not be fully robust for LayerNormLinear and LayerNormMLP, since the norm kernel may still expect to output to a quantized local tensor.

timmoon10 · 2026-02-03T20:20:44Z

/te-ci pytorch L1

greptile-apps

_{2 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-03T20:25:23Z

-        out = torch.empty(out_shape, dtype=dtype, device=device)
+        warnings.warn("Cannot quantize input tensor. Performing all-gather in high precision.")
+        if isinstance(inp, QuantizedTensorStorage):
+            inp = inp.dequantize()  # Dequantize if needed


The dequantize() method defaults to dtype=torch.float32. Consider whether this is always appropriate for the fallback path, especially when the original tensor might have been in a different precision (e.g., bfloat16).

greptile-apps · 2026-02-03T20:25:25Z

    ):
+        warnings.warn("Cannot quantize input tensor. Performing all-gather in high precision.")
+        if isinstance(inp, QuantizedTensorStorage):
+            inp = inp.dequantize()  # Dequantize if needed


Same consideration as in _start_all_gather_fp8_blockwise: the dequantize() method defaults to dtype=torch.float32, which may not match the original tensor's precision.

greptile-apps · 2026-02-03T20:25:26Z

    ):
+        warnings.warn("Cannot quantize input tensor. Performing all-gather in high precision.")
+        if isinstance(inp, QuantizedTensorStorage):
+            inp = inp.dequantize()  # Dequantize if needed


Same consideration as in _start_all_gather_fp8_blockwise: the dequantize() method defaults to dtype=torch.float32, which may not match the original tensor's precision.

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

ptrendx · 2026-02-10T23:02:59Z

/te-ci pytorch L1

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-10T23:10:25Z

    if not quantizer.is_quantizable(inp) or quantizer.block_scaling_dim != 1:
+        warnings.warn("Cannot quantize input tensor. Performing all-gather in high precision.")
+        if isinstance(inp, QuantizedTensorStorage):
+            inp = inp.dequantize(dtype=dtype)  # Dequantize if needed
        out = torch.empty(out_shape, dtype=dtype, device=device)
        torch.distributed.all_gather_into_tensor(out, inp, group=process_group, async_op=False)
        out = quantizer(out)


Non-contiguous gather input

In the new high-precision fallback (if not quantizer.is_quantizable(inp) ...), all_gather_into_tensor(out, inp, ...) passes inp directly. Elsewhere in this same module the plain-tensor path uses inp.contiguous() (distributed.py:1737-1742) and the FP8 path uses _data.contiguous() (distributed.py:1031-1035), which strongly suggests the collective expects contiguous inputs. If inp is a non-contiguous view (common after transpose/slicing), this fallback can raise at runtime. This same issue also appears in the NVFP4 and MXFP8 high-precision fallbacks (distributed.py:1353 and :1523).

ptrendx · 2026-02-12T19:20:00Z

/te-ci pytorch

greptile-apps

_{5 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-12T19:23:06Z

+        if isinstance(inp, QuantizedTensorStorage):
+            inp = inp.dequantize(dtype=dtype)  # Dequantize if needed
        out = torch.empty(out_shape, dtype=dtype, device=device)
        torch.distributed.all_gather_into_tensor(out, inp, group=process_group, async_op=False)


Missing .contiguous() call on inp before all-gather

Other all-gather paths in this file use .contiguous() (lines 1739, 1033). Non-contiguous tensors (from transpose/slicing) can cause runtime errors.

Suggested change

torch.distributed.all_gather_into_tensor(out, inp, group=process_group, async_op=False)

torch.distributed.all_gather_into_tensor(out, inp.contiguous(), group=process_group, async_op=False)

greptile-apps · 2026-02-12T19:23:16Z

Additional Comments (2)

transformer_engine/pytorch/distributed.py
Missing .contiguous() call on inp before all-gather

        torch.distributed.all_gather_into_tensor(out, inp.contiguous(), group=process_group)

transformer_engine/pytorch/distributed.py
Missing .contiguous() call on inp before all-gather

        torch.distributed.all_gather_into_tensor(out, inp.contiguous(), group=process_group)

ksivaman · 2026-03-07T23:47:59Z

/te-ci pytorch

fix subchannel fp8 + sp

3ba991b

Signed-off-by: Chen Cui <chcui@nvidia.com>

This comment was marked as outdated.

Sign in to view

cyanguwa requested a review from timmoon10 February 2, 2026 18:48

Merge branch 'main' into chcui/fix_subchannel_fp8+sp

390b2e1

This comment was marked as outdated.

Sign in to view

timmoon10 self-requested a review February 2, 2026 19:38

timmoon10 requested changes Feb 2, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/utils.py Outdated

Support sequence-parallel all-gather with small inputs

637ba0f

Perform all-gather in high-precision if the input tensor is too small to quantize. Signed-off-by: Tim Moon <tmoon@nvidia.com>

greptile-apps Bot reviewed Feb 2, 2026

View reviewed changes

timmoon10 previously approved these changes Feb 2, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

Merge branch 'main' into chcui/fix_subchannel_fp8+sp

9fe572b

greptile-apps Bot reviewed Feb 3, 2026

View reviewed changes

Fix lint error

19fd927

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

ptrendx dismissed timmoon10’s stale review via 19fd927 February 10, 2026 22:21

greptile-apps Bot reviewed Feb 10, 2026

View reviewed changes

Keep the previous behavior with dtype

adfe33b

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

greptile-apps Bot reviewed Feb 10, 2026

View reviewed changes

ptrendx approved these changes Feb 10, 2026

View reviewed changes

Merge branch 'main' into chcui/fix_subchannel_fp8+sp

e6d1559

greptile-apps Bot reviewed Feb 12, 2026

View reviewed changes

Merge branch 'main' into chcui/fix_subchannel_fp8+sp

ccd46cb

ksivaman merged commit 5fd5c35 into NVIDIA:main Mar 8, 2026
20 of 24 checks passed

	torch.distributed.all_gather_into_tensor(out, inp, group=process_group, async_op=False)
	torch.distributed.all_gather_into_tensor(out, inp.contiguous(), group=process_group, async_op=False)

Conversation

cuichenx commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Verification

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

timmoon10 commented Feb 3, 2026

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ptrendx commented Feb 10, 2026

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx commented Feb 12, 2026

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Feb 12, 2026

Uh oh!

ksivaman commented Mar 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cuichenx commented Jan 31, 2026 •

edited

Loading

greptile-apps Bot commented Jan 31, 2026 •

edited

Loading