[MoE][PyTorch] Add prob permutation to mask-based MoE permutation; Fix FP8 related codes by hxbai · Pull Request #1468 · NVIDIA/TransformerEngine

hxbai · 2025-02-09T15:39:16Z

Description

Add probs permutation codes to the mask-based permutation. With this, we can apply the probs to the MoE expert MLP rather than to the unpermutation to avoid saving huge input tensor of unpermute function.

Fix FP8 Tensor usages in the permutation codes since TE 2.0 has some breaking changes on FP8 Tensor interfaces.

Depiction for probs application:

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Add probs permutation to the functions of moe_permute and moe_sort_chunks_by_index
Fix FP8 related codes

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

yaox12 · 2025-02-13T01:54:21Z

/te-ci pytorch

phu0ngng · 2025-02-13T16:31:46Z

        print(f"chunk sort\t\tbwd: pytorch: {t1:.3f} ms,  TE: {t2:.3f} ms")


+def _test_permutation_mask_map_alongside_probs(


Hi,

I found this test almost identical to the _test_permutation_mask_map except that we have TP thus involves calling te_sort_chunks_by_index.

Is it correct that the _test_permutation_mask_map is the special case of _test_permutation_mask_map_alongside_probs, in which the tp_size=1?

If yes, I suggest we combine these two test and eliminate code duplications.

Hi, it is not exactly same.

For _test_permutation_mask_map, the probs are applied to the unpermutation and the results of permute and unpermute are verified separately.

For _test_permutation_mask_map_alongside_probs, it is roughly an end-to-end test. The probs are applied to the unpermutation for the PyTorch version and applied to the permute output for the TE version. We want to make sure that the two methods can have same final output value.

I see. Thank you!

phu0ngng · 2025-02-13T16:42:16Z

+                    merging_prob = tl.load(merging_probs_ptr + merging_prob_off).to(compute_type)
+                    inp *= merging_prob
                accumulator += inp
+            if PERMUTE_PROBS:


Hi,

I wonder in which case we can have PERMUTE_PROBS != WITH_MERGING_PROBS.
I interpret the code as when we have WITH_MERGING_PROBS=True, we do the accumulation then reset the prob to zeros in the if PERMUTE_PROBS block. Then in which case we don't need to reset them?

Hi, Phuong,

Yes, it is easily confused by this part. I just added a graph to this PR description to depict the usage.

The previous codes are for the right workflow and this PR is to support the left workflow. For the right workflow, we don't permute the probs tensor and pass it directly to the unpermute operation. For the left workflow, we permute the probs tensor and apply it on the GroupedGEMM; for this case, no probs is passed to the unpermute operation.

Left: moe_permute(probs=probs), moe_unpermute(merging_probs=None)

Right: moe_permute(probs=None), moe_unpermute(merging_probs=probs)

For this kernel, it is used by both permute_bwd and unpermute_fwd:

permute_bwd in the left workflow: PERMUTE_PROBS=True and WITH_MERGING_PROBS=False

permute_bwd in the right workflow: PERMUTE_PROBS=False and WITH_MERGING_PROBS=False

unpermute_fwd in the left workflow: PERMUTE_PROBS=False and WITH_MERGING_PROBS=False

unpermute_fwd in the left workflow: PERMUTE_PROBS=False and WITH_MERGING_PROBS=True

So, these two args would not be True at the same time.

timmoon10

Mostly looks reasonable, but we should make sure the user-facing APIs are not too messy.

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

for more information, see https://pre-commit.ci

phu0ngng · 2025-02-16T04:14:47Z

/te-ci pytorch

timmoon10

LGTM. Test failures in pipeline 24020411 are unrelated.

…x FP8 related codes (#1468) * add prob permute; fix fp8tensor Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert unnecessary changes in UT Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com> * remove unnecessary probs dtype convert Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com> * keep the output nums if probs is not provided Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine the doc string Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com> * fix lint Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com> * use fp32 compute type Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com> * style fix Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com> * fix empty input return Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com> * separate prob related functions out Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xin Yao <xiny@nvidia.com> Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>

hxbai and others added 9 commits February 9, 2025 07:17

add prob permute; fix fp8tensor

57c8f34

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b700aff

for more information, see https://pre-commit.ci

revert unnecessary changes in UT

175f964

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

remove unnecessary probs dtype convert

7e9d98c

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

keep the output nums if probs is not provided

6e26f26

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

bcf6721

for more information, see https://pre-commit.ci

refine the doc string

c15e676

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

fix lint

7f33620

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

use fp32 compute type

1f7bf3d

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

hxbai force-pushed the permute_probs branch from 57809db to 1f7bf3d Compare February 12, 2025 14:08

yaox12 mentioned this pull request Feb 13, 2025

Fix MOE tests #1476

Closed

13 tasks

Merge branch 'main' into permute_probs

1933e06

phu0ngng reviewed Feb 13, 2025

View reviewed changes

timmoon10 reviewed Feb 13, 2025

View reviewed changes

Comment thread transformer_engine/pytorch/permutation.py

Comment thread transformer_engine/pytorch/permutation.py

Comment thread transformer_engine/pytorch/permutation.py

Comment thread tests/pytorch/test_permutation.py Outdated

hxbai added 2 commits February 13, 2025 21:51

style fix

ba692d8

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

fix empty input return

5ef0f78

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

phu0ngng approved these changes Feb 14, 2025

View reviewed changes

ptrendx added the 2.1.0 label Feb 15, 2025

hxbai and others added 3 commits February 15, 2025 04:50

separate prob related functions out

f8492fe

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5148fc3

for more information, see https://pre-commit.ci

Merge branch 'main' into permute_probs

e2f1fe3

timmoon10 approved these changes Feb 18, 2025

View reviewed changes

timmoon10 merged commit eb9857d into NVIDIA:main Feb 18, 2025

This was referenced Mar 31, 2025

[Fix] permute fusion number of forward output and backward input is not match ROCm/TransformerEngine#160

Closed

[Fix] permute fusion number of forward output and backward input is not match ROCm/TransformerEngine#161

Merged

hungryGeek16 mentioned this pull request May 31, 2026

fix unfused padding causal sdpa #3063

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MoE][PyTorch] Add prob permutation to mask-based MoE permutation; Fix FP8 related codes#1468

[MoE][PyTorch] Add prob permutation to mask-based MoE permutation; Fix FP8 related codes#1468
timmoon10 merged 15 commits into
NVIDIA:mainfrom
hxbai:permute_probs

hxbai commented Feb 9, 2025 •

edited

Loading

Uh oh!

yaox12 commented Feb 13, 2025

Uh oh!

phu0ngng Feb 13, 2025

Uh oh!

hxbai Feb 14, 2025

Uh oh!

phu0ngng Feb 14, 2025

Uh oh!

phu0ngng Feb 13, 2025

Uh oh!

hxbai Feb 14, 2025

Uh oh!

timmoon10 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

phu0ngng commented Feb 16, 2025

Uh oh!

timmoon10 left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		print(f"chunk sort\t\tbwd: pytorch: {t1:.3f} ms, TE: {t2:.3f} ms")


		def _test_permutation_mask_map_alongside_probs(

Conversation

hxbai commented Feb 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

yaox12 commented Feb 13, 2025

Uh oh!

phu0ngng Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

hxbai Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

phu0ngng Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

phu0ngng Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

hxbai Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

phu0ngng commented Feb 16, 2025

Uh oh!

timmoon10 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hxbai commented Feb 9, 2025 •

edited

Loading

timmoon10 left a comment •

edited

Loading