Megablocks-based MoE #1197

DayOfThePenguin · 2024-03-27T16:16:56Z

This PR adds dropless MoE support using the Grouped GEMM implementation in megablocks.

Features

Unlike the legacy DeepSpeed MoE implementation that uses the data parallel groups for expert parallelism, this implementation uses the model parallel group to parallelize the experts. This avoids the following problems:

Using data parallel groups to distribute the experts will incur inter-node communications to do a forward pass through a single layer
MoE + pipeline parallelism is very complicated to reason about when you have MoE weights distributed across data parallel groups & deepspeed doesn't natively support it.

Clarified arguments to make it clear which ones are only required for token dropping deepspeed MoE.

Use sinkhorn routing by default, support k>=1.

Testing

Tested PP [3, 2, 1] and MP [1, 2, 4, 8] on Ampere GPUs.

Notes

Added megablocks and grouped_gemm to the dependencies. It might be desirable to pull some of the kernels in directly like in NVIDIA megatron-core.

yang · 2024-03-28T17:02:45Z

👍

Just wanted to jump in with some quick clarifications! DS doesn't actually necessitate using DP groups for expert parallelism / as your EP groups. You can choose to do so if you want - these are just different configurations of the parallelism, which generalizes over these arrangements.

So if you want to use the model/tensor parallel groups to parallelize your experts (and avoid DP shuffling), you can do so just by setting the DEP size to be equal to your DP size (rather than DEP > DP). Then set EP groups = TP groups exactly. It's one option (you have the degrees of freedom to choose).

(You can furthermore choose also whether you want expert tensor parallelism or not, which is another degree of freedom.)

DayOfThePenguin added 2 commits March 27, 2024 12:02

feat: megablocks grouped gemm-based implementation

923afa6

fix: run pre-commit

01c9577

DayOfThePenguin requested a review from Quentin-Anthony as a code owner March 27, 2024 16:16

StellaAthena added 4 commits March 28, 2024 12:51

Update moe.py

c2cc9c4

Update moe_mlp.py

c7bfed0

Update moe.py

bde88e8

Update router.py

89aa3f0

tgale96 mentioned this pull request Apr 2, 2024

Import dmoe model into other training script? databricks/megablocks#101

Open

fix: router inference time topk -> top_k

6697b0c

DayOfThePenguin closed this May 6, 2024

DayOfThePenguin mentioned this pull request May 6, 2024

Dmoe integration #1210

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Megablocks-based MoE #1197

Megablocks-based MoE #1197

DayOfThePenguin commented Mar 27, 2024

yang commented Mar 28, 2024 •

edited

Megablocks-based MoE #1197

Megablocks-based MoE #1197

Conversation

DayOfThePenguin commented Mar 27, 2024

Features

Testing

Notes

yang commented Mar 28, 2024 • edited

yang commented Mar 28, 2024 •

edited