MoE vram #2110

Qubitium · 2025-10-24T13:21:37Z

@avtc This PR contains optimizations for vram at the expense for speed. Using vram_strategy toggle to Balanced will allow 2 gpu to use only 13GB of vram for Qwen3 Next. Before this PR, Qwen3 Next used 24GB in 2x GPU instsance. You only need 2x 24GB gpu for even large very large MoEs. The vram is balanced to the number of gpus. So 2x gpu would spread half the moe to gpu 0 and 1. 3x gpu would move 1/3 to each gpu.

avtc · 2025-10-24T13:35:13Z

will test shortly, this is must have feature

Qubitium · 2025-10-24T13:38:08Z

will test shortly, this is must have feature

Yes. And allow larger calibration set. 128 is too low. 512 is good sweet spot. But gpu0 has a small bug/regression where the more gpu you have, the more memory pressure gpu0 has so for balacned, it's better to use as few gpu as possible until that bug is solved.

Qubitium · 2025-10-24T15:33:45Z

VramStrategy.Balanced mode is about 1.5x slower than default Exclusive if you have enough vram but if you don't, it works well without oom and there is really no choice but to use it.

avtc · 2025-10-24T16:24:44Z

@Qubitium i have a strange bug, maybe used too much dynamic exclusions...

  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/models/base.py", line 1016, in quantize
    result = module_looper.loop(
        backend=backend,
        fail_safe=self.quantize_config.fail_safe,
    )
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/module_looper.py", line 1126, in loop
    return self._loop_impl(fail_safe=fail_safe, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/module_looper.py", line 1633, in _loop_impl
    replay_source = f"{layer_descriptor}:subset{index + 1}/{subset_total}"
                                                            ^^^^^^^^^^^^
Exception ignored in: <function ProgressBar.__del__ at 0x513297fad20>it is not associated with a value
Traceback (most recent call last): █░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░| 0:00:01 / 0:00:46 [1/46] 2.2%
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/logbar/progress.py", line 876, in __del__:00 [0/1081] 0.0%
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/logbar/progress.py", line 916, in close
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/logbar/progress.py", line 594, in detach
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/logbar/progress.py", line 495, in _render_lock_context
TypeError: 'NoneType' object is not callable
Exception ignored in: <function ProgressBar.__del__ at 0x513297fad20>
Traceback (most recent call last):
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/logbar/progress.py", line 876, in __del__
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/logbar/progress.py", line 916, in close
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/logbar/progress.py", line 594, in detach
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/logbar/progress.py", line 495, in _render_lock_context
TypeError: 'NoneType' object is not callable

dynamic = {
    r"-:model.embed_tokens.weight": {},
    r"-:.*shared_experts": {},
    r"-:.*shared_head": {},
    r"-:lm_head.weight": {},
    r"-:.*mlp.down": {},
    r"-:.*mlp.gate": {},
    r"-:.*mlp.up": {},
    r"-:.*post_attention_layernorm": {},
    r"-:.*self_attn": {},
    r"-:.*norm.weight": {},
    r"-:.*enorm": {},
    r"-:.*hnorm": {},
    r"-:.*eh_proj": {},
    r"-:.*input_layernorm": {},
    }

This is on GLM-4.5-Air
Will try to debug later...
...
Maybe need to reinstall requirements

Qubitium · 2025-10-24T16:47:02Z

@avtc The updated LogBar dependency not installed.

pip install -r req*txt

Also, I don't you need to use any of your dynamic rules. They are all auto handled except for shared_expoerts which I think are ok to quantize. We already skip mlp.gate.

avtc · 2025-10-24T17:47:49Z

@Qubitium appeared it is not dependency caused, it is caused by dynamic config. Maybe all modules for the first layer appeared to be ignored for quantization, so the variable subset_total become not defined (current main 97025e8 in line 1665)

avtc · 2025-10-24T17:49:26Z

Also with 8 GPUs and samples: 1, the error appear (probably the similar as was fixed by Q.to lock):

Traceback (most recent call last):j in layer      [1 of 45] ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░| 0:01:02 / 0:23:46 [2/46] 4.3%
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/utils/threadx.py", line 484, in _run
    result = fn(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/looper/module_looper.py", line 1608, in _process_on_worker
    proc.process(module=nm)
    ~~~~~~~~~~~~^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/looper/gptq_processor.py", line 162, in process
    wq, q_scales, q_zeros, q_g_idx, duration, avg_loss, damp_percent, nsamples = g.quantize()
                                                                                 ~~~~~~~~~~^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/quantization/gptq.py", line 602, in quantize
    self.finalize_hessian(target_device=target_device)
    ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/quantization/gptq.py", line 520, in finalize_hessian
    self._materialize_global_hessian(target_device=target_device)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/quantization/gptq.py", line 503, in _materialize_global_hessian
    tmp = partial.to(device=result_accum.device, dtype=torch.float32)
torch.AcceleratorError: CUDA error: invalid argument

But 4 GPU proceed.

avtc · 2025-10-25T06:06:04Z

I have successfully quantized the GLM-4.5-Air on 4 GPUs with balanced mode with GPQT v1 and 1024 samples from c4, to int8 group size 64 with padding for tp8 both intermediate and moe. It took around 4.5 hours. With:

dynamic = {
    r"-:model.embed_tokens.weight": {},
    r"-:.*shared_experts": {},
    r"-:.*shared_head": {},
    r"-:lm_head.weight": {},
    # r"-:.*mlp.down": {},
    # r"-:.*mlp.gate": {},
    # r"-:.*mlp.up": {},
    r"-:.*post_attention_layernorm": {},
    r"-:.*self_attn": {},
    r"-:.*norm.weight": {},
    r"-:.*enorm": {},
    r"-:.*hnorm": {},
    r"-:.*eh_proj": {},
    r"-:.*input_layernorm": {},
    }

So far the quality of final quant is very good.

I will check if excluding only mlp.gate without up/down will work with vllm.

-- Update:
When only mlp.gate is excluded without mlp.up the vllm fails to load the model, so need to quantize both or exclude both:

(Worker_TP3 pid=345961) ERROR 10-25 12:06:11 [multiproc_executor.py:597] ValueError: Detected some but not all shards of model.layers.0.mlp.gate_up_proj are quantized. All shards of fused layers to have the same precision.

Qubitium · 2025-10-25T10:38:42Z

@avtc Interseting! Glm 6 only has mlp.gate which is used for the expert token routing so we don't want to quantize it but 4.5 air has mlp.gate plus mlp.up/down which gets fused by vllm. I had previously though 4.5 Air was just a smaller layered version of Glm 4.6. It is not.

avtc · 2025-10-25T10:51:41Z

GLM-4.5-Air it is smaller version of GLM-4.5, but hope for the 4.6-Air soon

avtc · 2025-10-26T16:20:27Z

Yes. And allow larger calibration set. 128 is too low. 512 is good sweet spot. But gpu0 has a small bug/regression where the more gpu you have, the more memory pressure gpu0 has so for balacned, it's better to use as few gpu as possible until that bug is solved.

@Qubitium
As a variant of partial workaround - balanced distribution for MoE modules can exclude cuda:0 - avtc@5bffb4b

Qubitium added 8 commits October 24, 2025 08:37

balance memory mode for moe

19b0e15

skip monkeypatch unless balanced mode activated

7713e0e

enable balanced for qwen3_next

8bb4cb3

refractor

54aaf64

use hook method instead of monkey patch

bb4f81a

remove monkey patch

271966c

disable debug log

830dcff

enable balanced for glm moe

71a42e0

progress cached input stage which is very slow for some moes

ca3abec

Qubitium changed the title ~~Moe vram~~ MoE vram Oct 24, 2025

Qubitium added 4 commits October 24, 2025 13:40

start v5.1.0-dev track

a01980b

size

de30c9c

simplify

9f0bce9

recover some speed back for balance mode

a150c6e

Qubitium merged commit 1f76659 into main Oct 24, 2025
5 checks passed

Qubitium deleted the moe-vram branch October 24, 2025 15:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MoE vram #2110

MoE vram #2110

Uh oh!

Qubitium commented Oct 24, 2025 •

edited

Loading

Uh oh!

avtc commented Oct 24, 2025

Uh oh!

Qubitium commented Oct 24, 2025

Uh oh!

Qubitium commented Oct 24, 2025

Uh oh!

Uh oh!

avtc commented Oct 24, 2025 •

edited

Loading

Uh oh!

Qubitium commented Oct 24, 2025 •

edited

Loading

Uh oh!

avtc commented Oct 24, 2025

Uh oh!

avtc commented Oct 24, 2025 •

edited

Loading

Uh oh!

avtc commented Oct 25, 2025 •

edited

Loading

Uh oh!

Qubitium commented Oct 25, 2025

Uh oh!

avtc commented Oct 25, 2025 •

edited

Loading

Uh oh!

avtc commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MoE vram #2110

MoE vram #2110

Uh oh!

Conversation

Qubitium commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avtc commented Oct 24, 2025

Uh oh!

Qubitium commented Oct 24, 2025

Uh oh!

Qubitium commented Oct 24, 2025

Uh oh!

Uh oh!

avtc commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qubitium commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avtc commented Oct 24, 2025

Uh oh!

avtc commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avtc commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qubitium commented Oct 25, 2025

Uh oh!

avtc commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avtc commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Qubitium commented Oct 24, 2025 •

edited

Loading

avtc commented Oct 24, 2025 •

edited

Loading

Qubitium commented Oct 24, 2025 •

edited

Loading

avtc commented Oct 24, 2025 •

edited

Loading

avtc commented Oct 25, 2025 •

edited

Loading

avtc commented Oct 25, 2025 •

edited

Loading