Skip to content

Add back is_batch_shard_by_expert#3643

Merged
copybara-service[bot] merged 1 commit intomainfrom
chengnuojin-fix-moe
Apr 13, 2026
Merged

Add back is_batch_shard_by_expert#3643
copybara-service[bot] merged 1 commit intomainfrom
chengnuojin-fix-moe

Conversation

@NuojCheng
Copy link
Copy Markdown
Collaborator

@NuojCheng NuojCheng commented Apr 10, 2026

Description

This PR recover the is_batch_shard_by_expert logic fixing previous full EP decoding issue b/501537579. We introduced a decode_batch_moe logical rule to specifically deal with the decoding case in moe component when batch size is one.

Tests

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 10, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/maxtext/layers/moe.py 66.66% 4 Missing and 3 partials ⚠️

📢 Thoughts on this report? Let us know!

Comment thread src/maxtext/layers/moe.py Outdated
return output

batch_logical_axis = "activation_batch"
# Currently, we support data, tensor, and expert parallelism with Megablox.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

longer term this logic should be separated into its own function following go/small-functions, but for now this is probably safer as a rollback and I am currently working on refactoring/cleaning up this file

@NuojCheng NuojCheng force-pushed the chengnuojin-fix-moe branch 2 times, most recently from eef5d68 to 8ded7ff Compare April 13, 2026 16:32
Copy link
Copy Markdown
Collaborator

@RissyRan RissyRan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nuojin for the change! I was comparing the before/after, and noticed this change. Could you help cross check?

@NuojCheng
Copy link
Copy Markdown
Collaborator Author

Thanks Nuojin for the change! I was comparing the before/after, and noticed this change. Could you help cross check?

The previous hint ("activation_batch_no_exp", "activation_length_no_exp", None) was not correct. It always makes the output replicated along expert physical axis. The new implementation is a more memory-efficient way.

This change of logic only impacts performance under explicit shard mode, which has not been included in benchmarking yet. One exception is batch split, which has their own code path deepseek_batchsplit.py and should not get impacted.

@NuojCheng NuojCheng force-pushed the chengnuojin-fix-moe branch from 8ded7ff to 8acc2b9 Compare April 13, 2026 17:11
@NuojCheng NuojCheng force-pushed the chengnuojin-fix-moe branch from 8acc2b9 to ff3b89c Compare April 13, 2026 19:16
Comment thread tests/integration/decode_tests.py Outdated
@NuojCheng NuojCheng force-pushed the chengnuojin-fix-moe branch from ff3b89c to c57e4ec Compare April 13, 2026 20:02
@copybara-service copybara-service Bot merged commit 50ba0a1 into main Apr 13, 2026
43 checks passed
@copybara-service copybara-service Bot deleted the chengnuojin-fix-moe branch April 13, 2026 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants