Add back is_batch_shard_by_expert by NuojCheng · Pull Request #3643 · AI-Hypercomputer/maxtext

NuojCheng · 2026-04-10T23:32:29Z

Description

This PR recover the is_batch_shard_by_expert logic fixing previous full EP decoding issue b/501537579. We introduced a decode_batch_moe logical rule to specifically deal with the decoding case in moe component when batch size is one.

Tests

Performance test: https://paste.googleplex.com/4690855089799168. We get the same training performance (~81 TFLOP/s/dev compared to Update config correcting rule and log_config #3645)
Inference vllm test: https://paste.googleplex.com/5221471321456640
Inference decode test: https://paste.googleplex.com/6557626352664576

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-04-10T23:38:23Z

Codecov Report

❌ Patch coverage is 66.66667% with 7 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/layers/moe.py	66.66%	4 Missing and 3 partials ⚠️

📢 Thoughts on this report? Let us know!

gobbleturk · 2026-04-11T00:13:23Z

      return output

-    batch_logical_axis = "activation_batch"
+    # Currently, we support data, tensor, and expert parallelism with Megablox.


longer term this logic should be separated into its own function following go/small-functions, but for now this is probably safer as a rollback and I am currently working on refactoring/cleaning up this file

RissyRan

Thanks Nuojin for the change! I was comparing the before/after, and noticed this change. Could you help cross check?

NuojCheng · 2026-04-13T17:08:01Z

Thanks Nuojin for the change! I was comparing the before/after, and noticed this change. Could you help cross check?

https://screenshot.googleplex.com/XuuC5bCYnWpgzd9

Diff: https://diff.googleplex.com/#key=B3n5CBidn5Wc

The previous hint ("activation_batch_no_exp", "activation_length_no_exp", None) was not correct. It always makes the output replicated along expert physical axis. The new implementation is a more memory-efficient way.

This change of logic only impacts performance under explicit shard mode, which has not been included in benchmarking yet. One exception is batch split, which has their own code path deepseek_batchsplit.py and should not get impacted.

NuojCheng force-pushed the chengnuojin-fix-moe branch from b4feb4d to 1fc103e Compare April 10, 2026 23:43

NuojCheng marked this pull request as ready for review April 10, 2026 23:44

NuojCheng force-pushed the chengnuojin-fix-moe branch from 1fc103e to 093a027 Compare April 10, 2026 23:51

gobbleturk reviewed Apr 11, 2026

View reviewed changes

NuojCheng force-pushed the chengnuojin-fix-moe branch 2 times, most recently from eef5d68 to 8ded7ff Compare April 13, 2026 16:32

RissyRan reviewed Apr 13, 2026

View reviewed changes

NuojCheng force-pushed the chengnuojin-fix-moe branch from 8ded7ff to 8acc2b9 Compare April 13, 2026 17:11

RissyRan approved these changes Apr 13, 2026

View reviewed changes

NuojCheng force-pushed the chengnuojin-fix-moe branch from 8acc2b9 to ff3b89c Compare April 13, 2026 19:16

khatwanimohit approved these changes Apr 13, 2026

View reviewed changes

Comment thread tests/integration/decode_tests.py Outdated

add back is_batch_shard_by_expert

c57e4ec

NuojCheng force-pushed the chengnuojin-fix-moe branch from ff3b89c to c57e4ec Compare April 13, 2026 20:02

NuojCheng added the pull ready label Apr 13, 2026

copybara-service Bot merged commit 50ba0a1 into main Apr 13, 2026
43 checks passed

copybara-service Bot deleted the chengnuojin-fix-moe branch April 13, 2026 22:10

NuojCheng mentioned this pull request Apr 13, 2026

Re-order logical names based on their component in base.yml and add embed_vocab #3607

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add back is_batch_shard_by_expert#3643

Add back is_batch_shard_by_expert#3643
copybara-service[bot] merged 1 commit intomainfrom
chengnuojin-fix-moe

NuojCheng commented Apr 10, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

gobbleturk Apr 11, 2026

Uh oh!

RissyRan left a comment

Uh oh!

NuojCheng commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

NuojCheng commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gobbleturk Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

NuojCheng commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NuojCheng commented Apr 10, 2026 •

edited

Loading

codecov Bot commented Apr 10, 2026 •

edited

Loading