New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Mixtral 8*7B MOE #667
base: main
Are you sure you want to change the base?
Conversation
Great work! Could you provide a script of convert megatron mixtral to hf ? |
Still working on it. |
Fixed the bug where the b and s dimensions were mixed up.
Great work! Look forward to a script of convert megatron mixtral to hf. |
deactivate sinkhorn in default
Hi, I wonder if the loss is normal after converting and training mixtral with megatron at your computer. |
Hello, first of all, thank you for giving it a try. In my example, I trained Mixtral 8x7B-v0.1 on the gpt4 alpaca zh dataset, and the loss decreased from 4.5 to around 1.0 after 300 iterations; if I train Mixtral 8x7B-Instruction-v0.1 on the gpt4 alpaca zh dataset, the loss decreases from 1.5 to around 1.0 after 300 iterations. Could you share your HF training loss curve? Although this PR has not achieved load balancing loss (around 1e-2 level), the loss should not differ significantly from HF. |
By the way, I have verified that the relative error of the average forward logits between the converted megatron model and the HF (Hugging Face) model is within 1%, and the cosine similarity is 0.9999. |
Hi, I fix a bug in my script and now the initial loss is normal *(around 2.3 in arxiv dataset). Thanks for your contribution! also, I have an extra question, as the gate linear is shared across TP groups, why not define it as a tensor_parallel.RowParallelLinear? |
Good question! My initial idea was that the shape of the router would be (hidden state, n_experts), which is not particularly large and only has one per layer, so the benefits of parallelization are not significant. Additionally, when implementing load-balancing loss in the future, obtaining route logits will become difficult. |
Hi, @matrixssy. Thanks for your contribution, there are some ongoing efforts in NVIDIA internally working on the Mixtral 8x7B example. We will support convert HF checkpoint to MCore checkpoint with different EP/TP/PP size. The code will be released with some code refactor soon. |
Cool! How much longer will this task take (convert HF checkpoint to MCore checkpoint in EP), and are there any preceding pull requests? |
Actually, the functionality of changing EP size has been implemented. But there is a preceding MR (on gitlab internally) still being reviewed which implement MLM-legacy to MCore model converter. |
Marking as stale. No activity in 60 days. |
Any update on this ? |
Support Mixtral 8*7B MOE model structure and weight converter from huggingface.
You can refer to this script to convert the huggingface weight to megatron:
To activate mixtral moe in training:
Note that:
To implement the load balancing loss of huggingface equivalently on megatron requires a lot of modifications for returning router logits. Therefore, in order to simplify the work, I choose to use the original sinkhorn algorithm to balance the voting probability of each expert instead of using the load_balancing_loss_func in huggingface.