Tongliu/router fusion #1883

Autumn1998 · 2025-06-16T08:43:03Z

Description

Provide function used in the router fusion and corresponding unit-test.

fuse the topk+softmax/sigmoid
fuse the score function used in the aux loss
fuse the aux loss computation

All 3 parts include the forward and backward.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: tongliu <tongliu@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: tongliu <tongliu@nvidia.com>

tests/pytorch/test_fused_router.py

timmoon10 · 2025-06-19T01:16:42Z

tests/pytorch/test_fused_router.py

+        expert_bias=expert_bias_clone,
+    )
+
+    assert torch.allclose(probs, probs_fused, atol=atol, rtol=rtol)


It would be nicer to replace torch.allclose with torch.testing.assert_close. It has a helpful error message and it automatically chooses the tols based on the dtype.

Suggested change

assert torch.allclose(probs, probs_fused, atol=atol, rtol=rtol)

torch.testing.assert_close(probs, probs_fused)

timmoon10 · 2025-06-19T01:25:46Z

tests/pytorch/test_fused_router.py

+@pytest.mark.parametrize("dtype", [torch.float32])
+@pytest.mark.parametrize("num_tokens", [2048, 7168, 32111])
+@pytest.mark.parametrize("num_experts", [128, 32])
+@pytest.mark.parametrize("topk", [4, 8])
+@pytest.mark.parametrize("group_topk", [None, 4])
+@pytest.mark.parametrize("scaling_factor", [None, 1.2])
+@pytest.mark.parametrize("enable_bias", [True, False])
+def test_topk_sigmoid(


How long does this test suite take to run? The number of test cases grows very quickly if you have one test with many parameters (O(2^n) cases), so it may be better to split it up into multiple tests with only a few parameters (O(2*n) cases).

timmoon10 · 2025-06-19T01:33:37Z

transformer_engine/common/include/transformer_engine/fused_router.h

+ *  \param[in]     intermediate_output  Intermediate output from the forward pass. (Softmax/sigmoid output)
+ *  \param[in]     stream          CUDA stream used for the operation.
+ */
+void nvte_fused_scores_for_aux_loss_forward(const NVTETensor logits, int num_tokens,


What is the naming convention for this loss function in other MoE implementations? It feels like aux_loss is too general and it might get confusing if some other multi-objective training method becomes popular in the future. Maybe something like moe_aux_loss would be more specific.

timmoon10 · 2025-06-19T01:40:49Z

/te-ci pytorch

Signed-off-by: tongliu <tongliu@nvidia.com>

for more information, see https://pre-commit.ci

add functions required by router fusion

b3c3633

Signed-off-by: tongliu <tongliu@nvidia.com>

Autumn1998 force-pushed the tongliu/router_fusion branch from 60d0142 to b3c3633 Compare June 16, 2025 09:12

pre-commit-ci bot and others added 3 commits June 16, 2025 09:13

[pre-commit.ci] auto fixes from pre-commit.com hooks

33fb217

for more information, see https://pre-commit.ci

add possible suppost for bf16 router

a7137b8

Signed-off-by: tongliu <tongliu@nvidia.com>

update compute on double

752c351

Signed-off-by: tongliu <tongliu@nvidia.com>

Autumn1998 force-pushed the tongliu/router_fusion branch from 4377f92 to 752c351 Compare June 16, 2025 12:20

timmoon10 reviewed Jun 19, 2025

View reviewed changes

timmoon10 self-requested a review June 19, 2025 01:37

fix bug on seq aux loss

baed11c

Signed-off-by: tongliu <tongliu@nvidia.com>

Autumn1998 force-pushed the tongliu/router_fusion branch from d3197e0 to baed11c Compare June 19, 2025 07:06

[pre-commit.ci] auto fixes from pre-commit.com hooks

382ff10

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tongliu/router fusion #1883

Tongliu/router fusion #1883

Uh oh!

Autumn1998 commented Jun 16, 2025

Uh oh!

Uh oh!

timmoon10 Jun 19, 2025

Uh oh!

timmoon10 Jun 19, 2025

Uh oh!

timmoon10 Jun 19, 2025

Uh oh!

timmoon10 commented Jun 19, 2025

Uh oh!

Uh oh!

	assert torch.allclose(probs, probs_fused, atol=atol, rtol=rtol)
	torch.testing.assert_close(probs, probs_fused)

Tongliu/router fusion #1883

Are you sure you want to change the base?

Tongliu/router fusion #1883

Uh oh!

Conversation

Autumn1998 commented Jun 16, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

Uh oh!

timmoon10 Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Jun 19, 2025

Uh oh!

Uh oh!