Optional `CPU` g_idx int64 cache for TorchQuantLinear dequant path by Qubitium · Pull Request #2431 · ModelCloud/GPTQModel

Qubitium · 2026-03-02T03:33:42Z

Reuse a cached int64 g_idx tensor in the non-Triton GPTQ torch dequant path to avoid repeated int32->int64 conversion and indexing overhead.

Observed speedups (dequant microbench):

up to ~20% speedup on Zen3 CPU with 96 threads.

Memory cost:

additional cache per module: numel(g_idx) * 8 bytes (int64)

Disable by default as microbenchs shows only big gain on cpu and cpu specific thread size. 21% is a pretty good boost.

 OMP_NUM_THREADS=8 MKL_NUM_THREADS=8 CUDA_VISIBLE_DEVICES=0 GPTQ_TORCH_TRITON_DEQUANT=0 python /tmp/bench_threads_torch_trimmed.py 8 10
 
  +---------+-------------------+-------------------+-----------+
  | Threads | Baseline Trim Avg | Current Trim Avg  | Speedup % |
  +---------+-------------------+-------------------+-----------+
  | 1       | 120.500959 ms     | 120.839331 ms     | -0.281    |
  | 4       | 56.057461 ms      | 56.339439 ms      | -0.503    |
  | 8       | 41.733621 ms      | 41.203413 ms      | +1.270    |
  | 16      | 35.515538 ms      | 34.727764 ms      | +2.218    |
  | 32      | 31.957524 ms      | 32.027303 ms      | -0.218    |
  | 64      | 40.309505 ms      | 39.946576 ms      | +0.900    |
  | 96      | 95.567628 ms      | 75.090228 ms      | +21.427   |
  | 128     | 47.846520 ms      | 46.367916 ms      | +3.090    |
  +---------+-------------------+-------------------+-----------+

Reuse a cached int64 g_idx tensor in the non-Triton GPTQ torch dequant path to avoid repeated int32->int64 conversion and indexing overhead. Observed speedups (dequant microbench): - up to ~24% on Zen3 CPU - up to ~13% on A100 in non-Triton mode - up to ~5% on 4090 in non-Triton mode Memory cost: - additional cache per module: numel(g_idx) * 8 bytes (int64)

Changed from column-only slicing to aligned row+column block slicing in the num_itr > 1 branch. 2. Fixed the same bug in cached torch path (_dequantize_weight_cached_248)

Qubitium · 2026-03-03T02:12:28Z

128 core Zen3, sweet spot is 16 threads

  +---------+-------------------+-------------------+-----------+
  | Threads | Baseline Trim Avg | Current Trim Avg  | Speedup % |
  +---------+-------------------+-------------------+-----------+
  | 1       | 121.003356 ms     | 120.748541 ms     | +0.211    |
  | 4       | 55.440594 ms      | 55.199437 ms      | +0.435    |
  | 8       | 42.718230 ms      | 42.487563 ms      | +0.540    |
  | 16      | 34.404057 ms      | 33.476284 ms      | +2.697    |
  | 32      | 41.884904 ms      | 40.231748 ms      | +3.947    |
  | 64      | 41.095100 ms      | 42.297816 ms      | -2.927    |
  | 96      | 84.770926 ms      | 69.136020 ms      | +18.444   |
  | 128     | 44.845950 ms      | 42.405236 ms      | +5.442    |
  +---------+-------------------+-------------------+-----------+

Qubitium · 2026-03-03T03:07:27Z

Dequant performance is already sub ms on gpu so the improvement is much smaller.

  +------+---------------------------+---------------------+---------------------+-----------+------------+
  | GPU  | Device                    | Baseline Trimmed ms | Current Trimmed ms  | Speedup % | MaxAbsDiff |
  +------+---------------------------+---------------------+---------------------+-----------+------------+
  | 0    | NVIDIA PG506-230 (A100)   | 0.2963678566        | 0.2775771627        | +6.3403   | 0.0        |
  | 6    | NVIDIA GeForce RTX 4090   | 0.2911456395        | 0.2811155963        | +3.4450   | 0.0        |
  +------+---------------------------+---------------------+---------------------+-----------+------------+

Qubitium marked this pull request as draft March 2, 2026 06:24

if cuda/xpu, move non-cached/converted original g_idx to cpu

08060f1

Qubitium changed the title ~~Add optional g_idx int64 cache for TorchQuantLinear dequant path~~ Optional CPU g_idx int64 cache for TorchQuantLinear dequant path Mar 3, 2026

Qubitium added 2 commits March 3, 2026 01:14

comment

b26fe6f

1. Fixed block slicing in base dequant (PackableQuantLinear)

3bc9a56

Changed from column-only slicing to aligned row+column block slicing in the num_itr > 1 branch. 2. Fixed the same bug in cached torch path (_dequantize_weight_cached_248)

Qubitium marked this pull request as ready for review March 3, 2026 06:40

Qubitium merged commit f2fee23 into main Mar 3, 2026
6 checks passed

Qubitium deleted the torch-gptq-cache-g_idx branch March 3, 2026 06:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional `CPU` g_idx int64 cache for TorchQuantLinear dequant path#2431

Optional `CPU` g_idx int64 cache for TorchQuantLinear dequant path#2431
Qubitium merged 4 commits intomainfrom
torch-gptq-cache-g_idx

Qubitium commented Mar 2, 2026 •

edited

Loading

Uh oh!

Qubitium commented Mar 3, 2026 •

edited

Loading

Uh oh!

Qubitium commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Qubitium commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qubitium commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qubitium commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Qubitium commented Mar 2, 2026 •

edited

Loading

Qubitium commented Mar 3, 2026 •

edited

Loading