Add back is_batch_shard_by_expert#3643
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
b4feb4d to
1fc103e
Compare
1fc103e to
093a027
Compare
| return output | ||
|
|
||
| batch_logical_axis = "activation_batch" | ||
| # Currently, we support data, tensor, and expert parallelism with Megablox. |
There was a problem hiding this comment.
longer term this logic should be separated into its own function following go/small-functions, but for now this is probably safer as a rollback and I am currently working on refactoring/cleaning up this file
eef5d68 to
8ded7ff
Compare
RissyRan
left a comment
There was a problem hiding this comment.
Thanks Nuojin for the change! I was comparing the before/after, and noticed this change. Could you help cross check?
The previous hint This change of logic only impacts performance under explicit shard mode, which has not been included in benchmarking yet. One exception is batch split, which has their own code path |
8ded7ff to
8acc2b9
Compare
8acc2b9 to
ff3b89c
Compare
ff3b89c to
c57e4ec
Compare
Description
This PR recover the
is_batch_shard_by_expertlogic fixing previous full EP decoding issue b/501537579. We introduced adecode_batch_moelogical rule to specifically deal with the decoding case in moe component when batch size is one.Tests
Performance test: https://paste.googleplex.com/4690855089799168. We get the same training performance (~81 TFLOP/s/dev compared to Update config correcting rule and log_config #3645)
Inference vllm test: https://paste.googleplex.com/5221471321456640
Inference decode test: https://paste.googleplex.com/6557626352664576
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.