feat: moe-router-bypass-batch-size by avtc · Pull Request #2349 · ModelCloud/GPTQModel

avtc · 2026-01-10T18:52:40Z

@Qubitium Hi, this feature adds ability to process moe weights in chunks in MoeRoutingBypass mode. Allowing to quantize large MoE models with less VRAM and devices.

Currently for each expert weight in GPTQ the hessian accumulator created during forward pass, which for example for GLM-4.5-Air is around 17Gb for up/gate moe modules for one layer. Processing expert weights in chunks require less hessian accumulator matrices and thus VRAM.

Example option usage:

moe={'routing': {'class': 'ExpertsRoutingBypass', 'batch_size': 50}}

Qubitium · 2026-01-12T06:41:24Z

@avtc Do you have some rough numbers for vram saving in your setup with and without the PR?

avtc · 2026-01-12T08:34:31Z

@Qubitium Hi,
Here are screens from quantizing a GLM-4.5-Air - layer 0 - non moe, layer 1+ - moe:
Calibration data: 2320 samples - 769K tokens.
5x3090 with first one excluded from compute: compute_device_filter=lambda devices: [d for d in devices if d.index != 0]
VramStrategy=Exclusive (i.e. data parallel).

Without batching experts, forward of moe up/gate modules on the layer 1:

Results in CUDA OOM in a while after screenshot.

With batch_size = 128, same stage:

It works, we batch 128 modules from 256 (128 x up/gate) in one forward/quantize.

With batch_size = 32, same stage for comparison:

Qubitium · 2026-01-13T04:04:45Z

@Qubitium Hi, Here are screens from quantizing a GLM-4.5-Air - layer 0 - non moe, layer 1+ - moe: Calibration data: 2320 samples - 769K tokens. 5x3090 with first one excluded from compute: compute_device_filter=lambda devices: [d for d in devices if d.index != 0] VramStrategy=Exclusive (i.e. data parallel).

Without batching experts, forward of moe up/gate modules on the layer 1: Results in CUDA OOM in a while after screenshot.

With batch_size = 128, same stage: It works, we batch 128 modules from 256 (128 x up/gate) in one forward/quantize.

With batch_size = 32, same stage for comparison:

Looks good! We will work on this PR after a newly awq discovered regression is fixed.

Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

tests/module_tree/test_subset.py

tests/test_moe_config.py

gptqmodel/quantization/config.py

Qubitium

Cleanup

…ng object Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

Qubitium · 2026-01-28T03:22:59Z

@avtc 5.7.0 took much longer to push out due to small bugs we keep finding/fixing in CI regression tests. Should be out soon so this can be merged.

skip

feat: moe-router-bypass-batch-size

de9b440

avtc marked this pull request as ready for review January 10, 2026 18:53

fix test_moe_dict_initialization_override

5b885ba

Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

Qubitium reviewed Jan 14, 2026

View reviewed changes

tests/module_tree/test_subset.py Outdated Show resolved Hide resolved

Qubitium reviewed Jan 14, 2026

View reviewed changes

tests/test_moe_config.py Outdated Show resolved Hide resolved

Qubitium reviewed Jan 14, 2026

View reviewed changes

gptqmodel/quantization/config.py Outdated Show resolved Hide resolved

Qubitium previously requested changes Jan 14, 2026

View reviewed changes

ZX-ModelCloud added 4 commits January 14, 2026 13:28

remove all code that parses an string json class into MoeConfig.routi…

295abf8

…ng object Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

cleanup

05e7142

Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

format

092d620

Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

cleanup

d8bd997

Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>

Merge branch 'main' into feature/moe-router-bypass-batch-size

6660ce0

Qubitium self-requested a review February 11, 2026 02:55

Qubitium merged commit 4b7950c into ModelCloud:main Feb 11, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: moe-router-bypass-batch-size#2349

feat: moe-router-bypass-batch-size#2349
Qubitium merged 7 commits intoModelCloud:mainfrom
avtc:feature/moe-router-bypass-batch-size

avtc commented Jan 10, 2026 •

edited

Loading

Uh oh!

Qubitium commented Jan 12, 2026 •

edited

Loading

Uh oh!

avtc commented Jan 12, 2026

Uh oh!

Qubitium commented Jan 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Qubitium left a comment

Uh oh!

Qubitium commented Jan 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

avtc commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qubitium commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avtc commented Jan 12, 2026

Uh oh!

Qubitium commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Qubitium left a comment

Choose a reason for hiding this comment

Uh oh!

Qubitium commented Jan 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

avtc commented Jan 10, 2026 •

edited

Loading

Qubitium commented Jan 12, 2026 •

edited

Loading

Qubitium commented Jan 13, 2026 •

edited

Loading