Skip to content
Merged

Scores #2049

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@
</p>

## Latest News
* 09/30/2025 5.0.0-dev `main`: 👀: New Data Parallel + Multi-GPU + Python 3.13g (PYTHON_GIL=0) equals 80%+ overall quant time reduction of large MoE models va v4.2.5.
* 10/17/2025 5.0.0-dev `main`: 👀: EoRA now multi-gpu compatible. Fixed both quality stability of multi-gpu quanta and vram usage. New LFM and Ling models support.
* 09/30/2025 5.0.0-dev `main`: 👀: New Data Parallel + Multi-GPU + Python 3.13T (PYTHON_GIL=0) equals 80%+ overall quant time reduction of large MoE models vs v4.2.5.
* 09/29/2025 5.0.0-dev `main`: 🎉 New Qwen3 Omni model support. AWQ Marlin kernel integrated + many disk offload, threading, and memory usage fixes.
* 09/24/2025 5.0.0-dev `main`: 🎉 Up to 90% cpu mem saving for large MoE models with faster/inline packing! 26% quant time reduction for Qwen3 MoE! AWQ Marlin kernel added. AWQ Gemm loading bug fixes. `act_group_aware` now faster and auto enabled for GPTQ when `desc_act` is False for higher quality recovery.
* 09/19/2025 5.0.0-dev `main`: 👀 Cpu memory saving of ~73.5% during quantization stage with new `offload_to_disk` quantization config property default to `True`.
Expand Down
2 changes: 1 addition & 1 deletion gptqmodel/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
DEVICE_THREAD_POOL = DeviceThreadPool(
inference_mode=True,
workers={
"cuda:per": 2,
"cuda:per": 4,
"xpu:per": 1,
"mps": 8,
"cpu": 8,
Expand Down
7 changes: 4 additions & 3 deletions tests/models/test_qwen3_moe.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,12 @@
from model_test import ModelTest



class TestQwen3Moe(ModelTest):
NATIVE_MODEL_ID = "/monster/data/model/Qwen3-30B-A3B"
QUANT_ARC_MAX_DELTA_FLOOR_PERCENT = 0.2
NATIVE_ARC_CHALLENGE_ACC = 0.3700
NATIVE_ARC_CHALLENGE_ACC_NORM = 0.3700
QUANT_ARC_MAX_DELTA_FLOOR_PERCENT = 0.04
NATIVE_ARC_CHALLENGE_ACC = 0.3788 # a100 4,5,6,7
NATIVE_ARC_CHALLENGE_ACC_NORM = 0.3899 # a100 4,5,6,7
# TRUST_REMOTE_CODE = False
APPLY_CHAT_TEMPLATE = True
# EVAL_BATCH_SIZE = 6
Expand Down
14 changes: 11 additions & 3 deletions tests/models/test_qwen3_next.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,21 @@

class TestQwen3Next(ModelTest):
NATIVE_MODEL_ID = "/monster/data/model/Qwen3-Next-80B-A3B-Instruct"
QUANT_ARC_MAX_DELTA_FLOOR_PERCENT = 0.05
QUANT_ARC_MAX_DELTA_FLOOR_PERCENT = 0.04
NATIVE_ARC_CHALLENGE_ACC = 0.3900
NATIVE_ARC_CHALLENGE_ACC_NORM = 0.3900
TRUST_REMOTE_CODE = True
APPLY_CHAT_TEMPLATE = True
EVAL_BATCH_SIZE = 6
#DATASET_SIZE = 1024
EVAL_BATCH_SIZE = 4
V2 = False
DEBUG = True
ACT_GROUP_AWARE = True
DESC_ACT = False
DATASET_SIZE = 1024
DATASET_SORT = "desc"
QUANT_BATCH_SIZE = 4
CALIB_NOISE_MODE = "unseen"
CALIB_NOISE_PERCENT = 0.025

def test_mimo(self):
self.quant_lm_eval()