feat: moe-router-bypass-batch-size#2349
Conversation
|
@avtc Do you have some rough numbers for vram saving in your setup with and without the PR? |
|
@Qubitium Hi, Without batching experts, forward of moe up/gate modules on the With |
Looks good! We will work on this PR after a newly |
Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>
…ng object Signed-off-by: ZX-ModelCloud <zx@modelcloud.ai>
|
@avtc 5.7.0 took much longer to push out due to small bugs we keep finding/fixing in CI regression tests. Should be out soon so this can be merged. |






@Qubitium Hi, this feature adds ability to process moe weights in chunks in MoeRoutingBypass mode. Allowing to quantize large MoE models with less VRAM and devices.
Currently for each expert weight in GPTQ the hessian accumulator created during forward pass, which for example for GLM-4.5-Air is around 17Gb for up/gate moe modules for one layer. Processing expert weights in chunks require less hessian accumulator matrices and thus VRAM.
Example option usage: