Support Mixtral 8*7B MOE #667

matrixssy · 2024-01-18T11:29:44Z

Support Mixtral 8*7B MOE model structure and weight converter from huggingface.

You can refer to this script to convert the huggingface weight to megatron:

python tools/checkpoint/util.py --model-type GPT
--loader mixtral_hf
--saver mixtral
--load-dir ../models/Mixtral-8x7B-Instruct-v0.1
--save-dir ../models/Mixtral-8x7B-Instruct-v0.1-tp2-pp4
--tokenizer-model ../models/Mixtral-8x7B-Instruct-v0.1/tokenizer.model
--target-tensor-parallel-size 2
--target-pipeline-parallel-size 4

To activate mixtral moe in training:

--num-experts 8 \
--moe-type mixtral \

Note that:
To implement the load balancing loss of huggingface equivalently on megatron requires a lot of modifications for returning router logits. Therefore, in order to simplify the work, I choose to use the original sinkhorn algorithm to balance the voting probability of each expert instead of using the load_balancing_loss_func in huggingface.

…ggingface

matrixssy · 2024-01-18T12:08:56Z

#649

cdj0311 · 2024-01-19T02:06:21Z

Great work! Could you provide a script of convert megatron mixtral to hf ?

matrixssy · 2024-01-19T02:23:54Z

Great work! Could you provide a script of convert megatron mixtral to hf ?

Still working on it.

Fixed the bug where the b and s dimensions were mixed up.

ftgreat · 2024-01-22T13:10:28Z

Great work!

Great work! Look forward to a script of convert megatron mixtral to hf.

deactivate sinkhorn in default

TissueC · 2024-01-28T18:38:06Z

Hi, I wonder if the loss is normal after converting and training mixtral with megatron at your computer.
I apply this PR and the initial loss is quite high, which seems to indicate the forward step is not aligned with huggingface version, especially in TP>1.
@matrixssy

matrixssy · 2024-01-29T01:22:26Z

Hi, I wonder if the loss is normal after converting and training mixtral with megatron at your computer. I apply this PR and the initial loss is quite high, which seems to indicate the forward step is not aligned with huggingface version, especially in TP>1. @matrixssy

Hello, first of all, thank you for giving it a try. In my example, I trained Mixtral 8x7B-v0.1 on the gpt4 alpaca zh dataset, and the loss decreased from 4.5 to around 1.0 after 300 iterations; if I train Mixtral 8x7B-Instruction-v0.1 on the gpt4 alpaca zh dataset, the loss decreases from 1.5 to around 1.0 after 300 iterations. Could you share your HF training loss curve? Although this PR has not achieved load balancing loss (around 1e-2 level), the loss should not differ significantly from HF.

matrixssy · 2024-01-29T01:25:30Z

Hi, I wonder if the loss is normal after converting and training mixtral with megatron at your computer. I apply this PR and the initial loss is quite high, which seems to indicate the forward step is not aligned with huggingface version, especially in TP>1. @matrixssy

By the way, I have verified that the relative error of the average forward logits between the converted megatron model and the HF (Hugging Face) model is within 1%, and the cosine similarity is 0.9999.

TissueC · 2024-01-29T06:00:04Z

Hi, I fix a bug in my script and now the initial loss is normal *(around 2.3 in arxiv dataset). Thanks for your contribution!

also, I have an extra question, as the gate linear is shared across TP groups, why not define it as a tensor_parallel.RowParallelLinear?

matrixssy · 2024-01-29T11:53:01Z

Hi, I fix a bug in my script and now the initial loss is normal *(around 2.3 in arxiv dataset). Thanks for your contribution!

also, I have an extra question, as the gate linear is shared across TP groups, why not define it as a tensor_parallel.RowParallelLinear?

Good question! My initial idea was that the shape of the router would be (hidden state, n_experts), which is not particularly large and only has one per layer, so the benefits of parallelization are not significant. Additionally, when implementing load-balancing loss in the future, obtaining route logits will become difficult.

Victarry · 2024-02-05T02:41:15Z

Hi, @matrixssy. Thanks for your contribution, there are some ongoing efforts in NVIDIA internally working on the Mixtral 8x7B example. We will support convert HF checkpoint to MCore checkpoint with different EP/TP/PP size. The code will be released with some code refactor soon.

matrixssy · 2024-02-06T02:17:24Z

Hi, @matrixssy. Thanks for your contribution, there are some ongoing efforts in NVIDIA internally working on the Mixtral 8x7B example. We will support convert HF checkpoint to MCore checkpoint with different EP/TP/PP size. The code will be released with some code refactor soon.

Cool! How much longer will this task take (convert HF checkpoint to MCore checkpoint in EP), and are there any preceding pull requests?

Victarry · 2024-02-06T06:43:17Z

Actually, the functionality of changing EP size has been implemented.

But there is a preceding MR (on gitlab internally) still being reviewed which implement MLM-legacy to MCore model converter.
After that MR finished, I need some time for code refactor.
I think the MR for Mixtral checkpoint convert will be released this month.

ZhangEnmao · 2024-02-06T12:52:59Z

Hi, when I run your code, I got two errors. Could you help me and give some advises ?

ZhangEnmao · 2024-02-08T08:20:05Z

Hi, when I set target-tensor-parallel-size > 1 , I got the following errors. only setting target-tensor-parallel-size = 1 works. Is it possible that it is related to the following warning ? I use the latest docker with pytorch and nvidia, What can I do to resolve this missing packages problem. Thanks very much.

matrixssy · 2024-02-17T03:00:01Z

Hi, when I set target-tensor-parallel-size > 1 , I got the following errors. only setting target-tensor-parallel-size = 1 works. Is it possible that it is related to the following warning ? I use the latest docker with pytorch and nvidia, What can I do to resolve this missing packages problem. Thanks very much.

Yes, you need to set --sequence-parallel

github-actions · 2024-04-17T18:20:32Z

Marking as stale. No activity in 60 days.

shamanez · 2024-04-21T09:49:08Z

Any update on this ?

passaglia · 2024-04-23T02:50:25Z

@Victarry
Have the plans to release a checkpoint-converter that supports MoE (mentioned here as "Coming Soon") changed?

Support Mixtral 8*7B MOE model structure and weight converter from hu…

765fefe

…ggingface

matrixssy mentioned this pull request Jan 18, 2024

[ENHANCEMENT] Do you have a plan that supports Mixtral 8x7B? #649

Open

matrixssy marked this pull request as draft January 18, 2024 12:15

Add moe branch

a146977

matrixssy marked this pull request as ready for review January 18, 2024 12:31

matrixssy added 2 commits January 19, 2024 17:06

Update transformer.py

9a0ddc4

Fixed the bug where the b and s dimensions were mixed up.

Update saver_mixtral.py

626dfb1

matrixssy added 2 commits January 23, 2024 10:28

Update arguments.py

8060971

deactivate sinkhorn in default

Update transformer.py

46ebc0e

github-actions bot added the stale No activity in 60 days on issue or PR label Apr 17, 2024

github-actions bot removed the stale No activity in 60 days on issue or PR label Apr 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Mixtral 8*7B MOE #667

Support Mixtral 8*7B MOE #667

matrixssy commented Jan 18, 2024 •

edited

matrixssy commented Jan 18, 2024

cdj0311 commented Jan 19, 2024

matrixssy commented Jan 19, 2024

ftgreat commented Jan 22, 2024

TissueC commented Jan 28, 2024

matrixssy commented Jan 29, 2024

matrixssy commented Jan 29, 2024

TissueC commented Jan 29, 2024 •

edited

matrixssy commented Jan 29, 2024 •

edited

Victarry commented Feb 5, 2024

matrixssy commented Feb 6, 2024

Victarry commented Feb 6, 2024

ZhangEnmao commented Feb 6, 2024

ZhangEnmao commented Feb 8, 2024 •

edited

matrixssy commented Feb 17, 2024

github-actions bot commented Apr 17, 2024

shamanez commented Apr 21, 2024

passaglia commented Apr 23, 2024

Support Mixtral 8*7B MOE #667

Are you sure you want to change the base?

Support Mixtral 8*7B MOE #667

Conversation

matrixssy commented Jan 18, 2024 • edited

matrixssy commented Jan 18, 2024

cdj0311 commented Jan 19, 2024

matrixssy commented Jan 19, 2024

ftgreat commented Jan 22, 2024

TissueC commented Jan 28, 2024

matrixssy commented Jan 29, 2024

matrixssy commented Jan 29, 2024

TissueC commented Jan 29, 2024 • edited

matrixssy commented Jan 29, 2024 • edited

Victarry commented Feb 5, 2024

matrixssy commented Feb 6, 2024

Victarry commented Feb 6, 2024

ZhangEnmao commented Feb 6, 2024

ZhangEnmao commented Feb 8, 2024 • edited

matrixssy commented Feb 17, 2024

github-actions bot commented Apr 17, 2024

shamanez commented Apr 21, 2024

passaglia commented Apr 23, 2024

matrixssy commented Jan 18, 2024 •

edited

TissueC commented Jan 29, 2024 •

edited

matrixssy commented Jan 29, 2024 •

edited

ZhangEnmao commented Feb 8, 2024 •

edited