Skip to content

Conversation

@Qubitium
Copy link
Collaborator

@Qubitium Qubitium commented Oct 24, 2025

@avtc This PR contains optimizations for vram at the expense for speed. Using vram_strategy toggle to Balanced will allow 2 gpu to use only 13GB of vram for Qwen3 Next. Before this PR, Qwen3 Next used 24GB in 2x GPU instsance. You only need 2x 24GB gpu for even large very large MoEs. The vram is balanced to the number of gpus. So 2x gpu would spread half the moe to gpu 0 and 1. 3x gpu would move 1/3 to each gpu.

@avtc
Copy link
Contributor

avtc commented Oct 24, 2025

will test shortly, this is must have feature

@Qubitium
Copy link
Collaborator Author

will test shortly, this is must have feature

Yes. And allow larger calibration set. 128 is too low. 512 is good sweet spot. But gpu0 has a small bug/regression where the more gpu you have, the more memory pressure gpu0 has so for balacned, it's better to use as few gpu as possible until that bug is solved.

@Qubitium Qubitium changed the title Moe vram MoE vram Oct 24, 2025
@Qubitium
Copy link
Collaborator Author

VramStrategy.Balanced mode is about 1.5x slower than default Exclusive if you have enough vram but if you don't, it works well without oom and there is really no choice but to use it.

@Qubitium Qubitium merged commit 1f76659 into main Oct 24, 2025
5 checks passed
@Qubitium Qubitium deleted the moe-vram branch October 24, 2025 15:33
@avtc
Copy link
Contributor

avtc commented Oct 24, 2025

@Qubitium i have a strange bug, maybe used too much dynamic exclusions...

  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/models/base.py", line 1016, in quantize
    result = module_looper.loop(
        backend=backend,
        fail_safe=self.quantize_config.fail_safe,
    )
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/module_looper.py", line 1126, in loop
    return self._loop_impl(fail_safe=fail_safe, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/module_looper.py", line 1633, in _loop_impl
    replay_source = f"{layer_descriptor}:subset{index + 1}/{subset_total}"
                                                            ^^^^^^^^^^^^
Exception ignored in: <function ProgressBar.__del__ at 0x513297fad20>it is not associated with a value
Traceback (most recent call last): █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░| 0:00:01 / 0:00:46 [1/46] 2.2%
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/logbar/progress.py", line 876, in __del__:00 [0/1081] 0.0%
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/logbar/progress.py", line 916, in close
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/logbar/progress.py", line 594, in detach
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/logbar/progress.py", line 495, in _render_lock_context
TypeError: 'NoneType' object is not callable
Exception ignored in: <function ProgressBar.__del__ at 0x513297fad20>
Traceback (most recent call last):
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/logbar/progress.py", line 876, in __del__
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/logbar/progress.py", line 916, in close
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/logbar/progress.py", line 594, in detach
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/logbar/progress.py", line 495, in _render_lock_context
TypeError: 'NoneType' object is not callable
dynamic = {
    r"-:model.embed_tokens.weight": {},
    r"-:.*shared_experts": {},
    r"-:.*shared_head": {},
    r"-:lm_head.weight": {},
    r"-:.*mlp.down": {},
    r"-:.*mlp.gate": {},
    r"-:.*mlp.up": {},
    r"-:.*post_attention_layernorm": {},
    r"-:.*self_attn": {},
    r"-:.*norm.weight": {},
    r"-:.*enorm": {},
    r"-:.*hnorm": {},
    r"-:.*eh_proj": {},
    r"-:.*input_layernorm": {},
    }

This is on GLM-4.5-Air
Will try to debug later...
...
Maybe need to reinstall requirements

@Qubitium
Copy link
Collaborator Author

Qubitium commented Oct 24, 2025

@avtc The updated LogBar dependency not installed.

pip install -r req*txt

Also, I don't you need to use any of your dynamic rules. They are all auto handled except for shared_expoerts which I think are ok to quantize. We already skip mlp.gate.

@avtc
Copy link
Contributor

avtc commented Oct 24, 2025

@Qubitium appeared it is not dependency caused, it is caused by dynamic config. Maybe all modules for the first layer appeared to be ignored for quantization, so the variable subset_total become not defined (current main 97025e8 in line 1665)

@avtc
Copy link
Contributor

avtc commented Oct 24, 2025

Also with 8 GPUs and samples: 1, the error appear (probably the similar as was fixed by Q.to lock):

Traceback (most recent call last):j in layer      [1 of 45] ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░| 0:01:02 / 0:23:46 [2/46] 4.3%
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/utils/threadx.py", line 484, in _run
    result = fn(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/looper/module_looper.py", line 1608, in _process_on_worker
    proc.process(module=nm)
    ~~~~~~~~~~~~^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/looper/gptq_processor.py", line 162, in process
    wq, q_scales, q_zeros, q_g_idx, duration, avg_loss, damp_percent, nsamples = g.quantize()
                                                                                 ~~~~~~~~~~^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/quantization/gptq.py", line 602, in quantize
    self.finalize_hessian(target_device=target_device)
    ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/quantization/gptq.py", line 520, in finalize_hessian
    self._materialize_global_hessian(target_device=target_device)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/quantization/gptq.py", line 503, in _materialize_global_hessian
    tmp = partial.to(device=result_accum.device, dtype=torch.float32)
torch.AcceleratorError: CUDA error: invalid argument

But 4 GPU proceed.

@avtc
Copy link
Contributor

avtc commented Oct 25, 2025

I have successfully quantized the GLM-4.5-Air on 4 GPUs with balanced mode with GPQT v1 and 1024 samples from c4, to int8 group size 64 with padding for tp8 both intermediate and moe. It took around 4.5 hours. With:

dynamic = {
    r"-:model.embed_tokens.weight": {},
    r"-:.*shared_experts": {},
    r"-:.*shared_head": {},
    r"-:lm_head.weight": {},
    # r"-:.*mlp.down": {},
    # r"-:.*mlp.gate": {},
    # r"-:.*mlp.up": {},
    r"-:.*post_attention_layernorm": {},
    r"-:.*self_attn": {},
    r"-:.*norm.weight": {},
    r"-:.*enorm": {},
    r"-:.*hnorm": {},
    r"-:.*eh_proj": {},
    r"-:.*input_layernorm": {},
    }

So far the quality of final quant is very good.

I will check if excluding only mlp.gate without up/down will work with vllm.

-- Update:
When only mlp.gate is excluded without mlp.up the vllm fails to load the model, so need to quantize both or exclude both:

(Worker_TP3 pid=345961) ERROR 10-25 12:06:11 [multiproc_executor.py:597] ValueError: Detected some but not all shards of model.layers.0.mlp.gate_up_proj are quantized. All shards of fused layers to have the same precision.

@Qubitium
Copy link
Collaborator Author

@avtc Interseting! Glm 6 only has mlp.gate which is used for the expert token routing so we don't want to quantize it but 4.5 air has mlp.gate plus mlp.up/down which gets fused by vllm. I had previously though 4.5 Air was just a smaller layered version of Glm 4.6. It is not.

@avtc
Copy link
Contributor

avtc commented Oct 25, 2025

GLM-4.5-Air it is smaller version of GLM-4.5, but hope for the 4.6-Air soon

@avtc
Copy link
Contributor

avtc commented Oct 26, 2025

Yes. And allow larger calibration set. 128 is too low. 512 is good sweet spot. But gpu0 has a small bug/regression where the more gpu you have, the more memory pressure gpu0 has so for balacned, it's better to use as few gpu as possible until that bug is solved.

@Qubitium
As a variant of partial workaround - balanced distribution for MoE modules can exclude cuda:0 - avtc@5bffb4b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants