Fix balanced MoE vram usage by Qubitium · Pull Request #2716 · ModelCloud/GPTQModel

Qubitium · 2026-04-12T11:09:48Z

Qubitium · 2026-04-12T12:40:09Z

@avtc I will be doing heavy experiments here to resolve the device placement issue for VramStrategy. Right now my Qwen 3.5 MoE using new DENSE_HOME_MOE_BALANCED with 3 gpu (first gpu is for dense modules only, rest for moe balance) is showing massive vram savings at the expense of about 5% quantization time.

Not done yet.

  Current DENSE_HOME_MOE_BALANCED vs origin/main BALANCED:

  - pre_quant_forward_s: 2.021 vs 6.664
  - process_quant_s: 47.553 vs 65.732
  - final layer reserved spread: 0.098 GiB vs 8.854 GiB
  - final layer peak reserved spread: 1.311 GiB vs 8.115 GiB
  - coarse quant_wall_s: 422.186 vs 393.369 in that concurrent pair

avtc · 2026-04-12T12:52:36Z

As an idea, I can suggest for models with large dense layer, to spread even dense weights between GPUs, i.e. q, k, v, o - this should be possible as weights are already patched to move input params to weight's gpu. This could help quantizing models with abnormal dense weights on consumer GPUs.

Qubitium · 2026-04-12T13:04:05Z

As an idea, I can suggest for models with large dense layer, to spread even dense weights between GPUs, i.e. q, k, v, o - this should be possible as weights are already patched to move input params to weight's gpu. This could help quantizing models with abnormal dense weights on consumer GPUs.

Are you reading my mind? It's already designed, working, in testing and in next commit.

Qubitium · 2026-04-12T13:49:13Z

Dense and MoE vram stragey are now split controlled with device mapping for fine-grained control and flexibility.

 from gptqmodel import GPTQModel
  from gptqmodel.quantization import QuantizeConfig, FORMAT, METHOD
  from gptqmodel.quantization.config import VramStrategy

  qcfg = QuantizeConfig(
      quant_method=METHOD.GPTQ,
      format=FORMAT.GPTQ,
      bits=4,
      group_size=128,

      # Dense pool: use first 2 visible GPUs
      dense_vram_strategy=VramStrategy.BALANCED,
      dense_vram_strategy_devices=["cuda:0", "cuda:1"],

      # MoE pool: use last 4 visible GPUs
      moe_vram_strategy=VramStrategy.BALANCED,
      moe_vram_strategy_devices=["cuda:2", "cuda:3", "cuda:4", "cuda:5"],
  )

 - quant_wall_s: 431.571s current vs 437.933s baseline, -1.45%
  - pre_quant_forward_s: 2.024s vs 5.421s, -62.66%
  - process_quant_s: 54.190s vs 58.488s, -7.35%
  - post_quant_forward_s: 5.866s vs 4.927s, +19.07%
  - Final reserved VRAM spread: 0.098 GiB vs 8.818 GiB, -98.89%
  - Final peak reserved spread: 1.295 GiB vs 8.123 GiB, -84.06%

Qubitium · 2026-04-12T15:16:29Z

  A/B

  CURRENT: optimize-device-placement                MAIN: d5581070
  git 8bb638bd                                      git d5581070
  split pools: yes                                  split pools: no
  dense pool: cuda:0                                legacy balanced over all visible GPUs
  moe pool:   cuda:1,2,3,4,5
  routing: all experts
  GIL disabled: yes                                 GIL disabled: yes

  Timing
  ------
  quant wall         461.13s                        461.66s
  pre-quant fwd        6.01s                         13.85s
  process quant       97.68s                        144.49s
  post-quant replay   18.35s                          5.45s
  submodule finalize 2845.35s                      3011.02s

  Delta vs main
  -------------
  wall time          -0.12%
  pre-quant fwd     -56.60%
  process quant     -32.39%
  post-quant replay +236.93%
  submodule finalize -5.50%

  VRAM

  Final reserved VRAM after layer 1
  CURRENT                                  MAIN
  GPU0 4.201 GiB ####                      GPU0 2.061 GiB ##
  GPU1 0.369 GiB .                         GPU1 9.783 GiB ##########
  GPU2 0.436 GiB .                         GPU2 9.943 GiB ##########
  GPU3 0.359 GiB .                         GPU3 9.766 GiB ##########
  GPU4 0.379 GiB .                         GPU4 9.791 GiB ##########
  GPU5 0.377 GiB .                         GPU5 9.822 GiB ##########

  overall spread   3.842 GiB               overall spread   7.883 GiB
  moe-pool spread  0.076 GiB               moe-pool spread  0.178 GiB

Qubitium · 2026-04-12T15:28:59Z

@avtc When you have time, please test this PR to see if my numbers are reflectd in real-world. The current lab testing is showing 0% speed degradation overall (it is faster in some parts, slower in others) but has a huge vram improvement for the moe layers.

Fix balanced MoE forward device leakage

4f38373

Qubitium marked this pull request as draft April 12, 2026 11:09

Reduce balanced MoE ping-pong and add Qwen A/B benchmark

f5eb9fd

github-code-quality Bot found potential problems Apr 12, 2026

View reviewed changes

Comment thread scripts/benchmark_qwen35_moe_ab.py Fixed

Comment thread scripts/benchmark_qwen35_moe_ab.py Fixed

Add dense-home MoE VRAM strategy

4eedc3c

github-code-quality Bot found potential problems Apr 12, 2026

View reviewed changes

Comment thread scripts/benchmark_qwen35_moe_ab.py Fixed

Comment thread scripts/benchmark_qwen35_moe_ab.py Fixed

Qubitium changed the title ~~Fix balanced MoE forward device leakage~~ Fix balanced MoE vram usage Apr 12, 2026

Refactor dense and MoE VRAM pools

af6639d

github-code-quality Bot found potential problems Apr 12, 2026

View reviewed changes

Comment thread scripts/benchmark_qwen35_moe_ab.py Fixed

Comment thread scripts/benchmark_qwen35_moe_ab.py Fixed

Qubitium self-assigned this Apr 12, 2026

Add MoE parallel quant telemetry

8bb638b

github-code-quality Bot found potential problems Apr 12, 2026

View reviewed changes

Comment thread tests/test_stage_modules.py Fixed

Comment thread tests/test_subset_plan.py Fixed

Fix PR code quality alerts

90b437e

Qubitium marked this pull request as ready for review April 12, 2026 15:17

Qubitium merged commit 1cd5ff1 into main Apr 13, 2026
6 checks passed

Qubitium deleted the optimize-device-placement branch April 13, 2026 07:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix balanced MoE vram usage#2716

Fix balanced MoE vram usage#2716
Qubitium merged 6 commits intomainfrom
optimize-device-placement

Qubitium commented Apr 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Qubitium commented Apr 12, 2026 •

edited

Loading

Uh oh!

avtc commented Apr 12, 2026

Uh oh!

Qubitium commented Apr 12, 2026 •

edited

Loading

Uh oh!

Qubitium commented Apr 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Qubitium commented Apr 12, 2026

Uh oh!

Qubitium commented Apr 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Qubitium commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Qubitium commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avtc commented Apr 12, 2026

Uh oh!

Qubitium commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qubitium commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Qubitium commented Apr 12, 2026

Uh oh!

Qubitium commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Qubitium commented Apr 12, 2026 •

edited

Loading

Qubitium commented Apr 12, 2026 •

edited

Loading

Qubitium commented Apr 12, 2026 •

edited

Loading

Qubitium commented Apr 12, 2026 •

edited

Loading

Qubitium commented Apr 12, 2026 •

edited

Loading