Skip to content

Conversation

@Qubitium
Copy link
Collaborator

@Qubitium Qubitium commented Oct 16, 2025

@nbasyl During forwarding hook eora accumulation, eora code is now synced to gptq code where accumlation is done per gpu and then merged at end. Test added and atol diff is around ~3e-6 so I think it's good to use.

assert torch.allclose(full_xtx, chunked_xtx, atol=5e-6, rtol=5e-6)

@nbasyl
Copy link
Collaborator

nbasyl commented Oct 16, 2025

This is awesome, appreciate the effort!

@Qubitium
Copy link
Collaborator Author

Qubitium commented Oct 16, 2025

This is awesome, appreciate the effort!

I am going to merge this for now. There is a slight regression in Eora quality. Not sure if this PR is fault or another, commit in the last 24 hours. I will backtract to fix in another PR. Once Eora regressoin (slight quality drop when it should be slight quality uplift) eora will finally join lower vram + multi-gpu data parallel quantization. =)

@Qubitium Qubitium merged commit 1a4f84b into main Oct 16, 2025
5 checks passed
@Qubitium Qubitium deleted the eora-optimize branch October 16, 2025 08:22
@Qubitium
Copy link
Collaborator Author

Qubitium commented Oct 17, 2025

@nbasyl There appears to be no regression but the way I changed test_quant_and_eora.py parameters.

Changes:

  1. act_group_aware is now True: Intel contributed gptq math adjustments that does group size aware activation re/ordering. I made it 16k+ times faster, not typo, so it's now default since there is zero downside even for quant time. This has become the default in 5.0 main branch as tests ahve shown this to improve post-quant recovery in a measureable way without using the desc_act trick which slows down inference on all kernels.
  2. desc_act is now False: required for act_group_aware.
  3. marlin kernel used for lm-eval. Marlin has very small, measured by test_kernel_outputs, deviation from reference torch and triton kernels.
  4. dataset using different one but this shouldn't matter since the dataset itself is static env.

I think that act_group_aware is not playing too well with Eora for unknown reasons. Here are the below two quick tests. I ddi not do mmlu since. it's so slow and my gpu is too slow too. If you can run main and test_quant_eora with mmlu that would be great to se what granular effect act_group_aware effects are having on post-gptq eora.

# act_group_aware = True + desc_act = False
--------Eval METHOD.GPTQ Result---------
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     ||0.3148|±  |0.0136|
|             |       |none  |     0|acc_norm||0.3370|±  |0.0138|

--------Eval METHOD.GPTQ + EoRA Result---------
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     ||0.3123|±  |0.0135|
|             |       |none  |     0|acc_norm||0.3481|±  |0.0139|
# act_group_aware = False + desc_act = False
--------Eval METHOD.GPTQ Result---------
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     ||0.3046|±  |0.0134|
|             |       |none  |     0|acc_norm||0.3404|±  |0.0138|

--------Eval METHOD.GPTQ + EoRA Result---------
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     ||0.3166|±  |0.0136|
|             |       |none  |     0|acc_norm||0.3447|±  |0.0139|

Notice how act_group_aware is having a positive effect but when enabled, eora's positive effect drifts slightly negative. Again, only a small arc challange test which is ok (fast) but not the best. Please run this on your faster H100/B200 and enable mmlu benchmarks to see if you can reproduce above or what I am seeing is just isolated to arc scores.

Also reference torch kernel is about 3-4x faster on main since it now optimistically use triton dequant vs slow torch dequant with no accraucy drop as dequant doesn't do any matmul and only simple math.

@nbasyl
Copy link
Collaborator

nbasyl commented Oct 17, 2025

Hi @Qubitium, thanks for the update! I’ll help run the MMLU test over the weekend. I have a quick question though — if the results show that EoRA + act_group_aware still degrade MMLU performance, how do we plan to address that? Do you think it’s more of an engineering issue or a methodological one? I’m asking since I’m not very familiar with GAR.

@Qubitium
Copy link
Collaborator Author

Hi @Qubitium, thanks for the update! I’ll help run the MMLU test over the weekend. I have a quick question though — if the results show that EoRA + act_group_aware still degrade MMLU performance, how do we plan to address that? Do you think it’s more of an engineering issue or a methodological one? I’m asking since I’m not very familiar with GAR.

Make sure run with latest main. There we several bug fix (multi-gpu + nogil) fixes as well more deterministic output fixes.

I have not tested the full range of lm-eval tests beyond arc since my gpu is very slow to run them every time I make a small change. ARC is fast so that's what I use. But I also know that arc is not the best test since it can overfit meaning you can get a lower score or higher ARC score but that doesn't really mean you will score better in the more accurage tests such as MMLU or GSM8K. Both MMLU and GSM8k are very sensitive to quantization where as arc appears to be not that sensitive.

Maybe once we get a full run of how how act_group_aware affects scores we can go from there. Right now, just speculation that eora and act_group_aware are cancelling each other out.

@nbasyl
Copy link
Collaborator

nbasyl commented Oct 24, 2025

Hi @Qubitium, apologies for the late response — I was completely swamped last week. I finally have some time to run the experiment, but I’m running into issues installing the latest version of GPTQModel. Do you happen to have a Docker image I could use directly?

@Qubitium
Copy link
Collaborator Author

@nbasyl I just release 5.0 to pypi last night with wheels for pytorch 2.8 2.9 and 3.0. Can you directly install from pypi?

pip install -U gptqmodel --no-build-isolation

If you get install errors, can you post the errors to see what's wrong? Unfortunately don't have a docker image created but we should.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants