Skip to content

fix problems in the FOEM data processing pipeline#2659

Merged
Qubitium merged 1 commit intoModelCloud:mainfrom
Xingyu-Zheng:main
Apr 2, 2026
Merged

fix problems in the FOEM data processing pipeline#2659
Qubitium merged 1 commit intoModelCloud:mainfrom
Xingyu-Zheng:main

Conversation

@Xingyu-Zheng
Copy link
Copy Markdown
Contributor

When testing FOEM on Qwen3.5-35B-A3B, the error caused by reusing GPTAQ’s data processing pipeline occurs even earlier than the previous issue I encountered in gptqmodel/quantization/foem.py. In this case, simply setting alpha = 0 does not resolve the problem.

To address this, I added a special handling of alpha in the processor to ensure that FOEM, when used alone, achieves better generalization consistent with GPTQ.

@Qubitium
Copy link
Copy Markdown
Collaborator

Qubitium commented Apr 2, 2026

@Xingyu-Zheng LGTM! Thanks.

@Qubitium Qubitium merged commit 1dfe865 into ModelCloud:main Apr 2, 2026
1 check passed
@Qubitium
Copy link
Copy Markdown
Collaborator

Qubitium commented Apr 2, 2026

@Xingyu-Zheng I just remebered why GPTAQ had issues with MoE. Calibration data is feed to model serially and becomes orderd input to module which generates output. GPTAQ processed had the assumption that the input captured is in the samer order. The problem of MoE routing is that input of [a, b, c, e] may be seen by an MoE module as [ b, e ] but there was no safe to actually match captured captured: b to input captured: b if that makes any sense. I will check the code again but this memory just came back on why GPTAQ never worked with MoE. I discovered this as soon as GPTAQ (preriously called gptq v2) was merged and had a short discussion with the author and we both had no good solutions at that time.

@Xingyu-Zheng
Copy link
Copy Markdown
Contributor Author

@Qubitium I haven’t studied MoE models in depth, nor have I carefully gone through the implementation details in GPTQModel. However, here is my current hypothesis.

GPTAQ assumes a dual-stream data flow, where one stream corresponds to the FP model and the other to the progressively quantized model. As earlier layers become quantized, the routing decisions in later MoE layers may start to diverge between the two streams. For example, the FP model might route tokens {a, c} to expert 1, while the quantized model routes {b, d, e} to the same expert. As a result, when GPTAQ performs calibration on expert 1, the inputs $X$ and $\tilde{X}$ no longer match in either dimension or semantic meaning.

If this hypothesis is correct, there may be several possible solutions:

  1. Force the quantized branch to follow the FP router decisions, ignoring its own routing outputs. This seems like the most reliable approach, as it ensures strict alignment between the two data streams for each expert. Moreover, since the router itself does not directly modify the token representations, this should not affect semantic propagation.
  2. Disable top-k routing and send all tokens to the same expert (e.g., {a, b, c, d, e} to expert 1). However, I am unsure where expert weighting is applied in different MoE implementations. If the weighting differs between the FP and quantized branches, the representations may still be misaligned even if the dimensions match.
  3. Fallback to GPTQ for MoE layers, while continuing to apply GPTAQ to attention and other global components. This avoids dealing with routing inconsistencies altogether.

I should note that I am not deeply familiar with MoE mechanisms, so these are only preliminary thoughts. I hope they might still provide some useful insights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants