Skip to content

Fix balanced MoE vram usage#2716

Merged
Qubitium merged 6 commits intomainfrom
optimize-device-placement
Apr 13, 2026
Merged

Fix balanced MoE vram usage#2716
Qubitium merged 6 commits intomainfrom
optimize-device-placement

Conversation

@Qubitium
Copy link
Copy Markdown
Collaborator

@Qubitium Qubitium commented Apr 12, 2026

Fix: #2326

@Qubitium Qubitium marked this pull request as draft April 12, 2026 11:09
Comment thread scripts/benchmark_qwen35_moe_ab.py Fixed
Comment thread scripts/benchmark_qwen35_moe_ab.py Fixed
Comment thread scripts/benchmark_qwen35_moe_ab.py Fixed
Comment thread scripts/benchmark_qwen35_moe_ab.py Fixed
@Qubitium
Copy link
Copy Markdown
Collaborator Author

Qubitium commented Apr 12, 2026

@avtc I will be doing heavy experiments here to resolve the device placement issue for VramStrategy. Right now my Qwen 3.5 MoE using new DENSE_HOME_MOE_BALANCED with 3 gpu (first gpu is for dense modules only, rest for moe balance) is showing massive vram savings at the expense of about 5% quantization time.

Not done yet.

  Current DENSE_HOME_MOE_BALANCED vs origin/main BALANCED:

  - pre_quant_forward_s: 2.021 vs 6.664
  - process_quant_s: 47.553 vs 65.732
  - final layer reserved spread: 0.098 GiB vs 8.854 GiB
  - final layer peak reserved spread: 1.311 GiB vs 8.115 GiB
  - coarse quant_wall_s: 422.186 vs 393.369 in that concurrent pair

@Qubitium Qubitium changed the title Fix balanced MoE forward device leakage Fix balanced MoE vram usage Apr 12, 2026
@avtc
Copy link
Copy Markdown
Contributor

avtc commented Apr 12, 2026

As an idea, I can suggest for models with large dense layer, to spread even dense weights between GPUs, i.e. q, k, v, o - this should be possible as weights are already patched to move input params to weight's gpu. This could help quantizing models with abnormal dense weights on consumer GPUs.

@Qubitium
Copy link
Copy Markdown
Collaborator Author

Qubitium commented Apr 12, 2026

As an idea, I can suggest for models with large dense layer, to spread even dense weights between GPUs, i.e. q, k, v, o - this should be possible as weights are already patched to move input params to weight's gpu. This could help quantizing models with abnormal dense weights on consumer GPUs.

Are you reading my mind? It's already designed, working, in testing and in next commit.

@Qubitium
Copy link
Copy Markdown
Collaborator Author

Qubitium commented Apr 12, 2026

Dense and MoE vram stragey are now split controlled with device mapping for fine-grained control and flexibility.

 from gptqmodel import GPTQModel
  from gptqmodel.quantization import QuantizeConfig, FORMAT, METHOD
  from gptqmodel.quantization.config import VramStrategy

  qcfg = QuantizeConfig(
      quant_method=METHOD.GPTQ,
      format=FORMAT.GPTQ,
      bits=4,
      group_size=128,

      # Dense pool: use first 2 visible GPUs
      dense_vram_strategy=VramStrategy.BALANCED,
      dense_vram_strategy_devices=["cuda:0", "cuda:1"],

      # MoE pool: use last 4 visible GPUs
      moe_vram_strategy=VramStrategy.BALANCED,
      moe_vram_strategy_devices=["cuda:2", "cuda:3", "cuda:4", "cuda:5"],
  )
 - quant_wall_s: 431.571s current vs 437.933s baseline, -1.45%
  - pre_quant_forward_s: 2.024s vs 5.421s, -62.66%
  - process_quant_s: 54.190s vs 58.488s, -7.35%
  - post_quant_forward_s: 5.866s vs 4.927s, +19.07%
  - Final reserved VRAM spread: 0.098 GiB vs 8.818 GiB, -98.89%
  - Final peak reserved spread: 1.295 GiB vs 8.123 GiB, -84.06%

Comment thread scripts/benchmark_qwen35_moe_ab.py Fixed
Comment thread scripts/benchmark_qwen35_moe_ab.py Fixed
@Qubitium Qubitium self-assigned this Apr 12, 2026
Comment thread tests/test_stage_modules.py Fixed
Comment thread tests/test_subset_plan.py Fixed
@Qubitium
Copy link
Copy Markdown
Collaborator Author

  A/B

  CURRENT: optimize-device-placement                MAIN: d5581070
  git 8bb638bd                                      git d5581070
  split pools: yes                                  split pools: no
  dense pool: cuda:0                                legacy balanced over all visible GPUs
  moe pool:   cuda:1,2,3,4,5
  routing: all experts
  GIL disabled: yes                                 GIL disabled: yes

  Timing
  ------
  quant wall         461.13s                        461.66s
  pre-quant fwd        6.01s                         13.85s
  process quant       97.68s                        144.49s
  post-quant replay   18.35s                          5.45s
  submodule finalize 2845.35s                      3011.02s

  Delta vs main
  -------------
  wall time          -0.12%
  pre-quant fwd     -56.60%
  process quant     -32.39%
  post-quant replay +236.93%
  submodule finalize -5.50%

  VRAM

  Final reserved VRAM after layer 1
  CURRENT                                  MAIN
  GPU0 4.201 GiB ####                      GPU0 2.061 GiB ##
  GPU1 0.369 GiB .                         GPU1 9.783 GiB ##########
  GPU2 0.436 GiB .                         GPU2 9.943 GiB ##########
  GPU3 0.359 GiB .                         GPU3 9.766 GiB ##########
  GPU4 0.379 GiB .                         GPU4 9.791 GiB ##########
  GPU5 0.377 GiB .                         GPU5 9.822 GiB ##########

  overall spread   3.842 GiB               overall spread   7.883 GiB
  moe-pool spread  0.076 GiB               moe-pool spread  0.178 GiB

@Qubitium Qubitium marked this pull request as ready for review April 12, 2026 15:17
@Qubitium
Copy link
Copy Markdown
Collaborator Author

Qubitium commented Apr 12, 2026

@avtc When you have time, please test this PR to see if my numbers are reflectd in real-world. The current lab testing is showing 0% speed degradation overall (it is faster in some parts, slower in others) but has a huge vram improvement for the moe layers.

@Qubitium Qubitium merged commit 1cd5ff1 into main Apr 13, 2026
6 checks passed
@Qubitium Qubitium deleted the optimize-device-placement branch April 13, 2026 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Module placement for BALANCED VramStrategy flawed

2 participants