Conversation
|
@avtc I will be doing heavy experiments here to resolve the device placement issue for Not done yet. |
|
As an idea, I can suggest for models with large dense layer, to spread even dense weights between GPUs, i.e. q, k, v, o - this should be possible as weights are already patched to move input params to weight's gpu. This could help quantizing models with abnormal dense weights on consumer GPUs. |
Are you reading my mind? It's already designed, working, in testing and in next commit. |
|
Dense and MoE vram stragey are now split controlled with device mapping for fine-grained control and flexibility. from gptqmodel import GPTQModel
from gptqmodel.quantization import QuantizeConfig, FORMAT, METHOD
from gptqmodel.quantization.config import VramStrategy
qcfg = QuantizeConfig(
quant_method=METHOD.GPTQ,
format=FORMAT.GPTQ,
bits=4,
group_size=128,
# Dense pool: use first 2 visible GPUs
dense_vram_strategy=VramStrategy.BALANCED,
dense_vram_strategy_devices=["cuda:0", "cuda:1"],
# MoE pool: use last 4 visible GPUs
moe_vram_strategy=VramStrategy.BALANCED,
moe_vram_strategy_devices=["cuda:2", "cuda:3", "cuda:4", "cuda:5"],
) - quant_wall_s: 431.571s current vs 437.933s baseline, -1.45%
- pre_quant_forward_s: 2.024s vs 5.421s, -62.66%
- process_quant_s: 54.190s vs 58.488s, -7.35%
- post_quant_forward_s: 5.866s vs 4.927s, +19.07%
- Final reserved VRAM spread: 0.098 GiB vs 8.818 GiB, -98.89%
- Final peak reserved spread: 1.295 GiB vs 8.123 GiB, -84.06% |
A/B
CURRENT: optimize-device-placement MAIN: d5581070
git 8bb638bd git d5581070
split pools: yes split pools: no
dense pool: cuda:0 legacy balanced over all visible GPUs
moe pool: cuda:1,2,3,4,5
routing: all experts
GIL disabled: yes GIL disabled: yes
Timing
------
quant wall 461.13s 461.66s
pre-quant fwd 6.01s 13.85s
process quant 97.68s 144.49s
post-quant replay 18.35s 5.45s
submodule finalize 2845.35s 3011.02s
Delta vs main
-------------
wall time -0.12%
pre-quant fwd -56.60%
process quant -32.39%
post-quant replay +236.93%
submodule finalize -5.50%
VRAM
Final reserved VRAM after layer 1
CURRENT MAIN
GPU0 4.201 GiB #### GPU0 2.061 GiB ##
GPU1 0.369 GiB . GPU1 9.783 GiB ##########
GPU2 0.436 GiB . GPU2 9.943 GiB ##########
GPU3 0.359 GiB . GPU3 9.766 GiB ##########
GPU4 0.379 GiB . GPU4 9.791 GiB ##########
GPU5 0.377 GiB . GPU5 9.822 GiB ##########
overall spread 3.842 GiB overall spread 7.883 GiB
moe-pool spread 0.076 GiB moe-pool spread 0.178 GiB |
|
@avtc When you have time, please test this PR to see if my numbers are reflectd in real-world. The current lab testing is showing 0% speed degradation overall (it is faster in some parts, slower in others) but has a huge vram improvement for the moe layers. |
Fix: #2326