[JAX] Collective GEMM with FP8 and MXFP8 support by phu0ngng · Pull Request #2740 · NVIDIA/TransformerEngine

phu0ngng · 2026-03-05T22:22:01Z

Description

This PR extends the JAX Collective GEMM support with DelayedScalingFP8, CurrentScalingFP8, and MXFP8.
Unit tests for those quantization recipes are added. In addition, this PR also cleans up the test infrastructure in the collective gemm tests.

Note that Collective GEMM + MXFP8 requires all dimensions of the GEMM operands to be divisible by 128.
Besides, in the case of CGEMM + MXFP8 + AllGather, the block scales are still all-gathered in the critical path, unlike the quantized data, which is collectively gathered overlapping with the computation.

Type of change

Documentation change (change only to the documentation, either a fix or new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

phu0ngng · 2026-03-10T22:36:25Z

/te-ci JAX L1

greptile-apps · 2026-03-10T23:28:25Z

Greptile Summary

This PR extends the JAX Collective GEMM implementation to support DelayedScalingFP8, CurrentScalingFP8, and MXFP8 quantization recipes, and cleans up the collective GEMM test infrastructure. The core logic change in gemm.py adds MXFP8-aware scale tensor reordering (_reorder_tpsp_leading / _reorder_dp_leading) and updates the SPMD sharding specs to correctly distribute block scales alongside their operands during AllGather and ReduceScatter collectives.

Key changes:

GemmPrimitive.impl gains an MXFP8 + Collective path that skips padding, validates 128-alignment, reorders scale tensors for ReduceScatter (both data and scale) and AllGather (scale only; data layout is handled by the kernel)
_parse_operand_output_specs is updated to propagate per-operand scale sharding specs, replacing the previous uniform lhs_sharding / none_sharding logic
helper.py gains get_quantization_recipe and is_quantize_recipe_supported utility functions for mapping string recipe names to objects, now used throughout the test suite
The test suite is refactored to use per-test-case pytest node IDs in run_test_cgemm.sh, and new test classes cover FP8 / MXFP8 for all three test modules
One remaining diagnostic bug: The assertion at gemm.py:704–708 checks lhs_scale_inv.shape[sequence_dim] but its error message says "RHS scale inv sequence dimension", which would mislead users debugging alignment issues on the LHS scale

Confidence Score: 4/5

Safe to merge with the minor diagnostic copy-paste fix addressed; no correctness or data-integrity issues found beyond what is already tracked in open review threads.
The core MXFP8 + Collective GEMM logic is well-structured: scale reordering is guarded by scaling_mode.is_1d_block_scaling() and need_reorder, the sharding spec updates correctly propagate block-scale axes, and the new helpers are clean. The main outstanding issues are misleading assertion messages (copy-paste errors already flagged in prior review threads plus one new one at line 704), none of which affect runtime correctness. The test coverage for the new paths is comprehensive. Score is 4 rather than 5 because of the cluster of diagnostic copy-paste errors in the new assertion messages.
transformer_engine/jax/cpp_extensions/gemm.py — the assertion message at lines 704–708 references "RHS scale inv" but checks lhs_scale_inv, consistent with the copy-paste pattern already flagged for lines 698–702.

Important Files Changed

Filename	Overview
transformer_engine/jax/cpp_extensions/gemm.py	Core GEMM primitive extended with MXFP8 + Collective GEMM support; introduces scale reordering helpers, updated sharding specs for block scales, and an NVFP4 guard — but contains multiple copy-paste errors in assertion messages (line 699 says "LHS" for the RHS check, line 705 says "RHS scale inv" for the LHS scale check) that would mislead users debugging alignment failures.
transformer_engine/jax/quantize/helper.py	Adds two clean utility functions (`get_quantization_recipe` and `is_quantize_recipe_supported`) that bridge string recipe names to recipe objects; straightforward and well-documented.
examples/jax/collective_gemm/test_layernorm_mlp_grad.py	Adds FP8/MXFP8 test cases for LayerNorm MLP gradient; uses `QuantizerFactory.create_set(n_quantizer_sets=2)` correctly to create independent sets per layer. Both the reference and collective calls share the same `quantizer_sets` object, which is acceptable for purely functional JAX quantizer pytrees.
examples/jax/collective_gemm/common.py	Clean refactor: moves shared distributed helpers and imports to the top of the file, adds FP8 tolerances, introduces `get_tolerance_dtype` helper, and updates `cgemm_parser` to use `--quantize-recipe` instead of `--fp8-recipe`.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["tex.gemm() / layernorm_mlp()"] --> B{scaling_mode}
    B -- "NO_SCALING / TENSOR_SCALING" --> C[Standard GEMM path]
    B -- "MXFP8_1D_SCALING" --> D{collective_op?}
    D -- "NONE / is_outer" --> E[apply_padding_to_scale_inv\n+ swizzled_scale]
    D -- "ALL_GATHER / REDUCE_SCATTER" --> F[Assert dims % 128 == 0\nSkip padding]
    F --> G[swizzled_scale on lhs/rhs scale_inv]
    G --> H{need_reorder?}
    H -- "RS" --> I[_reorder_tpsp_leading on lhs\n+ lhs_scale_inv]
    H -- "AG" --> J[_reorder_tpsp_leading on lhs_scale_inv only\nlhs data stays as-is]
    I --> K[GemmPrimitive.inner_primitive.bind]
    J --> K
    C --> K
    E --> K
    K --> L{post-process output}
    L -- "AG + need_reorder" --> M[_reorder_dp_leading on output]
    L -- "other" --> N[return output as-is]
    M --> O[return output]
    N --> O

_{Last reviewed commit: 6f0c442}

phu0ngng · 2026-03-11T21:08:24Z

/te-ci JAX L1

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

phu0ngng · 2026-03-11T21:51:27Z

/te-ci JAX L1

jberchtold-nvidia

LGTM, thanks!

denera

LGTM!

denera · 2026-03-13T11:49:58Z

+        lhs_scale_specs = rhs_scale_specs = (None,)
+        if scaling_mode.is_1d_block_scaling():
+            rhs_scale_specs = rhs_specs
+            # Set the seq spec to None to trigger AG the scales as TE/Common CGEMM does not handle
+            # scale collecting yet
+            if collective_op.is_all_gather:
+                lhs_scale_specs = tuple(
+                    None if i == sequence_dim else s for i, s in enumerate(lhs_specs)
+                )
+            else:
+                lhs_scale_specs = lhs_specs


This AG is only required for overlap with Userbuffers. We'll conditionally disable it whenever we're using the cuBLASMp backend instead.

No changes needed in this PR, just dropping a note for reference.

* Enable cgemm + FP8 tests * Implement CGEMM + MXFP8 * added size check for mxfp8 * added tols for assertions * update tests with recipes * enable tests + is_quantize_recipe_supported Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

phu0ngng force-pushed the cgemm_fp8 branch from 1e018b2 to f314655 Compare March 9, 2026 18:36

phu0ngng changed the title ~~[JAX] CGEMM + FP8~~ [JAX] CGEMM + FP8MXFP8 Mar 10, 2026

phu0ngng force-pushed the cgemm_fp8 branch from 8bdf721 to 819a115 Compare March 10, 2026 22:36

phu0ngng changed the title ~~[JAX] CGEMM + FP8MXFP8~~ [JAX] CGEMM + FP8/MXFP8 Mar 10, 2026

phu0ngng marked this pull request as ready for review March 10, 2026 23:23

phu0ngng requested a review from jberchtold-nvidia March 10, 2026 23:23

phu0ngng changed the title ~~[JAX] CGEMM + FP8/MXFP8~~ [JAX] Collective GEMM with FP8 and MXFP8 support Mar 10, 2026

greptile-apps Bot reviewed Mar 10, 2026

View reviewed changes

Comment thread transformer_engine/jax/cpp_extensions/gemm.py

Comment thread examples/jax/collective_gemm/run_test_cgemm.sh Outdated

greptile-apps Bot reviewed Mar 11, 2026

View reviewed changes

Comment thread transformer_engine/jax/cpp_extensions/gemm.py Outdated

greptile-apps Bot reviewed Mar 11, 2026

View reviewed changes

Comment thread transformer_engine/jax/cpp_extensions/gemm.py

Comment thread transformer_engine/jax/cpp_extensions/gemm.py

jberchtold-nvidia reviewed Mar 11, 2026

View reviewed changes

Comment thread examples/jax/collective_gemm/test_layernorm_mlp_grad.py Outdated

Comment thread transformer_engine/jax/quantize/helper.py Outdated

Comment thread transformer_engine/jax/cpp_extensions/gemm.py

phu0ngng requested a review from denera March 11, 2026 19:52

phu0ngng force-pushed the cgemm_fp8 branch 2 times, most recently from 4f20d2d to f899fa2 Compare March 11, 2026 21:07

greptile-apps Bot reviewed Mar 11, 2026

View reviewed changes

Comment thread examples/jax/collective_gemm/test_layernorm_mlp_grad.py Outdated

phu0ngng force-pushed the cgemm_fp8 branch from f899fa2 to 819a115 Compare March 11, 2026 21:22

phu0ngng and others added 11 commits March 11, 2026 14:37

add cgemm + FP8 tests

817940c

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

1b3519b

for more information, see https://pre-commit.ci

cgemm+mxfp8 passed for AG

2e3bbeb

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

refactor code

76e0d78

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

mxfp8 + rs passed

3d68739

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

simplify the conditions

f53cebb

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

added size check for mxfp8

07ac889

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

added tols for assertions

593d7ae

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

update tests with recipes

e96e86b

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

shape check when padding mxfp8 scales

1d49bd5

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

cleanup

16adffe

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

pre-commit-ci Bot and others added 6 commits March 11, 2026 14:37

[pre-commit.ci] auto fixes from pre-commit.com hooks

4ba6974

for more information, see https://pre-commit.ci

enable tests + is_quantize_recipe_supported

463d9da

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

cleanup

43f3d31

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

address comments

4a3057c

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

typo

10ae5b8

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

include all tests

edb405c

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

phu0ngng force-pushed the cgemm_fp8 branch from 6352b89 to edb405c Compare March 11, 2026 21:38

pre-commit-ci Bot and others added 4 commits March 11, 2026 21:38

[pre-commit.ci] auto fixes from pre-commit.com hooks

976c80d

for more information, see https://pre-commit.ci

cleanup tracing

83f6f12

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

cleanup

74cae04

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

add comment

e6468c6

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

greptile-apps Bot reviewed Mar 11, 2026

View reviewed changes

Comment thread transformer_engine/jax/cpp_extensions/gemm.py

typo

ea929cf

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

jberchtold-nvidia approved these changes Mar 12, 2026

View reviewed changes

denera approved these changes Mar 13, 2026

View reviewed changes

Merge branch 'main' into cgemm_fp8

6f0c442

phu0ngng merged commit 14c29da into NVIDIA:main Mar 13, 2026
9 of 12 checks passed

phu0ngng deleted the cgemm_fp8 branch March 13, 2026 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JAX] Collective GEMM with FP8 and MXFP8 support#2740

[JAX] Collective GEMM with FP8 and MXFP8 support#2740
phu0ngng merged 23 commits intoNVIDIA:mainfrom
phu0ngng:cgemm_fp8

phu0ngng commented Mar 5, 2026 •

edited

Loading

Uh oh!

phu0ngng commented Mar 10, 2026

Uh oh!

greptile-apps Bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

phu0ngng commented Mar 11, 2026

Uh oh!

Uh oh!

Uh oh!

phu0ngng commented Mar 11, 2026

Uh oh!

jberchtold-nvidia left a comment

Uh oh!

denera left a comment

Uh oh!

denera Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

phu0ngng commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist:

Uh oh!

phu0ngng commented Mar 10, 2026

Uh oh!

greptile-apps Bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

phu0ngng commented Mar 11, 2026

Uh oh!

Uh oh!

Uh oh!

phu0ngng commented Mar 11, 2026

Uh oh!

jberchtold-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

denera left a comment

Choose a reason for hiding this comment

Uh oh!

denera Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

phu0ngng commented Mar 5, 2026 •

edited

Loading

greptile-apps Bot commented Mar 10, 2026 •

edited

Loading