-
Notifications
You must be signed in to change notification settings - Fork 130
MoE vram #2110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
will test shortly, this is must have feature |
Yes. And allow larger calibration set. 128 is too low. 512 is good sweet spot. But gpu0 has a small bug/regression where the more gpu you have, the more memory pressure gpu0 has so for balacned, it's better to use as few gpu as possible until that bug is solved. |
|
|
|
@Qubitium i have a strange bug, maybe used too much dynamic exclusions... This is on GLM-4.5-Air |
|
@avtc The updated LogBar dependency not installed. pip install -r req*txt Also, I don't you need to use any of your dynamic rules. They are all auto handled except for shared_expoerts which I think are ok to quantize. We already skip mlp.gate. |
|
Also with 8 GPUs and samples: But 4 GPU proceed. |
|
I have successfully quantized the GLM-4.5-Air on 4 GPUs with balanced mode with GPQT v1 and 1024 samples from c4, to int8 group size 64 with padding for tp8 both intermediate and moe. It took around 4.5 hours. With: So far the quality of final quant is very good. I will check if excluding only mlp.gate without up/down will work with vllm. -- Update: |
|
@avtc Interseting! Glm 6 only has |
|
GLM-4.5-Air it is smaller version of GLM-4.5, but hope for the 4.6-Air soon |
@Qubitium |
@avtc This PR contains optimizations for vram at the expense for speed. Using
vram_strategytoggle toBalancedwill allow 2 gpu to use only 13GB of vram forQwen3 Next. Before this PR, Qwen3 Next used 24GB in 2x GPU instsance. You only need 2x 24GB gpu for even large very large MoEs. The vram is balanced to the number of gpus. So 2x gpu would spread half the moe to gpu 0 and 1. 3x gpu would move 1/3 to each gpu.