optimize eora for multi-gpu and memory usage #2046

Qubitium · 2025-10-16T07:22:24Z

@nbasyl During forwarding hook eora accumulation, eora code is now synced to gptq code where accumlation is done per gpu and then merged at end. Test added and atol diff is around ~3e-6 so I think it's good to use.

assert torch.allclose(full_xtx, chunked_xtx, atol=5e-6, rtol=5e-6)

nbasyl · 2025-10-16T07:30:21Z

This is awesome, appreciate the effort!

Qubitium · 2025-10-16T08:19:09Z

This is awesome, appreciate the effort!

I am going to merge this for now. There is a slight regression in Eora quality. Not sure if this PR is fault or another, commit in the last 24 hours. I will backtract to fix in another PR. Once Eora regressoin (slight quality drop when it should be slight quality uplift) eora will finally join lower vram + multi-gpu data parallel quantization. =)

Qubitium · 2025-10-17T02:19:58Z

@nbasyl There appears to be no regression but the way I changed test_quant_and_eora.py parameters.

Changes:

act_group_aware is now True: Intel contributed gptq math adjustments that does group size aware activation re/ordering. I made it 16k+ times faster, not typo, so it's now default since there is zero downside even for quant time. This has become the default in 5.0 main branch as tests ahve shown this to improve post-quant recovery in a measureable way without using the desc_act trick which slows down inference on all kernels.
desc_act is now False: required for act_group_aware.
marlin kernel used for lm-eval. Marlin has very small, measured by test_kernel_outputs, deviation from reference torch and triton kernels.
dataset using different one but this shouldn't matter since the dataset itself is static env.

I think that act_group_aware is not playing too well with Eora for unknown reasons. Here are the below two quick tests. I ddi not do mmlu since. it's so slow and my gpu is too slow too. If you can run main and test_quant_eora with mmlu that would be great to se what granular effect act_group_aware effects are having on post-gptq eora.

# act_group_aware = True + desc_act = False
--------Eval METHOD.GPTQ Result---------
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     |↑  |0.3148|±  |0.0136|
|             |       |none  |     0|acc_norm|↑  |0.3370|±  |0.0138|

--------Eval METHOD.GPTQ + EoRA Result---------
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     |↑  |0.3123|±  |0.0135|
|             |       |none  |     0|acc_norm|↑  |0.3481|±  |0.0139|

# act_group_aware = False + desc_act = False
--------Eval METHOD.GPTQ Result---------
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     |↑  |0.3046|±  |0.0134|
|             |       |none  |     0|acc_norm|↑  |0.3404|±  |0.0138|

--------Eval METHOD.GPTQ + EoRA Result---------
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     |↑  |0.3166|±  |0.0136|
|             |       |none  |     0|acc_norm|↑  |0.3447|±  |0.0139|

Notice how act_group_aware is having a positive effect but when enabled, eora's positive effect drifts slightly negative. Again, only a small arc challange test which is ok (fast) but not the best. Please run this on your faster H100/B200 and enable mmlu benchmarks to see if you can reproduce above or what I am seeing is just isolated to arc scores.

Also reference torch kernel is about 3-4x faster on main since it now optimistically use triton dequant vs slow torch dequant with no accraucy drop as dequant doesn't do any matmul and only simple math.

nbasyl · 2025-10-17T18:46:44Z

Hi @Qubitium, thanks for the update! I’ll help run the MMLU test over the weekend. I have a quick question though — if the results show that EoRA + act_group_aware still degrade MMLU performance, how do we plan to address that? Do you think it’s more of an engineering issue or a methodological one? I’m asking since I’m not very familiar with GAR.

Qubitium · 2025-10-18T05:24:15Z

Hi @Qubitium, thanks for the update! I’ll help run the MMLU test over the weekend. I have a quick question though — if the results show that EoRA + act_group_aware still degrade MMLU performance, how do we plan to address that? Do you think it’s more of an engineering issue or a methodological one? I’m asking since I’m not very familiar with GAR.

Make sure run with latest main. There we several bug fix (multi-gpu + nogil) fixes as well more deterministic output fixes.

I have not tested the full range of lm-eval tests beyond arc since my gpu is very slow to run them every time I make a small change. ARC is fast so that's what I use. But I also know that arc is not the best test since it can overfit meaning you can get a lower score or higher ARC score but that doesn't really mean you will score better in the more accurage tests such as MMLU or GSM8K. Both MMLU and GSM8k are very sensitive to quantization where as arc appears to be not that sensitive.

Maybe once we get a full run of how how act_group_aware affects scores we can go from there. Right now, just speculation that eora and act_group_aware are cancelling each other out.

nbasyl · 2025-10-24T21:53:48Z

Hi @Qubitium, apologies for the late response — I was completely swamped last week. I finally have some time to run the experiment, but I’m running into issues installing the latest version of GPTQModel. Do you happen to have a Docker image I could use directly?

Qubitium · 2025-10-25T01:01:38Z

@nbasyl I just release 5.0 to pypi last night with wheels for pytorch 2.8 2.9 and 3.0. Can you directly install from pypi?

pip install -U gptqmodel --no-build-isolation

If you get install errors, can you post the errors to see what's wrong? Unfortunately don't have a docker image created but we should.

optimize eora for multi-gpu and memory usage

5762215

Qubitium force-pushed the eora-optimize branch from bc46691 to 5762215 Compare October 16, 2025 07:34

use marlin

1c15a6e

Qubitium merged commit 1a4f84b into main Oct 16, 2025
5 checks passed

Qubitium deleted the eora-optimize branch October 16, 2025 08:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimize eora for multi-gpu and memory usage #2046

optimize eora for multi-gpu and memory usage #2046

Uh oh!

Qubitium commented Oct 16, 2025 •

edited

Loading

Uh oh!

nbasyl commented Oct 16, 2025

Uh oh!

Qubitium commented Oct 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Qubitium commented Oct 17, 2025 •

edited

Loading

Uh oh!

nbasyl commented Oct 17, 2025

Uh oh!

Qubitium commented Oct 18, 2025

Uh oh!

nbasyl commented Oct 24, 2025

Uh oh!

Qubitium commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

optimize eora for multi-gpu and memory usage #2046

optimize eora for multi-gpu and memory usage #2046

Uh oh!

Conversation

Qubitium commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nbasyl commented Oct 16, 2025

Uh oh!

Qubitium commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Qubitium commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nbasyl commented Oct 17, 2025

Uh oh!

Qubitium commented Oct 18, 2025

Uh oh!

nbasyl commented Oct 24, 2025

Uh oh!

Qubitium commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Qubitium commented Oct 16, 2025 •

edited

Loading

Qubitium commented Oct 16, 2025 •

edited

Loading

Qubitium commented Oct 17, 2025 •

edited

Loading