Skip to content

Optional CPU g_idx int64 cache for TorchQuantLinear dequant path#2431

Merged
Qubitium merged 4 commits intomainfrom
torch-gptq-cache-g_idx
Mar 3, 2026
Merged

Optional CPU g_idx int64 cache for TorchQuantLinear dequant path#2431
Qubitium merged 4 commits intomainfrom
torch-gptq-cache-g_idx

Conversation

@Qubitium
Copy link
Copy Markdown
Collaborator

@Qubitium Qubitium commented Mar 2, 2026

Reuse a cached int64 g_idx tensor in the non-Triton GPTQ torch dequant path to avoid repeated int32->int64 conversion and indexing overhead.

Observed speedups (dequant microbench):

  • up to ~20% speedup on Zen3 CPU with 96 threads.

Memory cost:

  • additional cache per module: numel(g_idx) * 8 bytes (int64)

Disable by default as microbenchs shows only big gain on cpu and cpu specific thread size. 21% is a pretty good boost.

 OMP_NUM_THREADS=8 MKL_NUM_THREADS=8 CUDA_VISIBLE_DEVICES=0 GPTQ_TORCH_TRITON_DEQUANT=0 python /tmp/bench_threads_torch_trimmed.py 8 10
 
  +---------+-------------------+-------------------+-----------+
  | Threads | Baseline Trim Avg | Current Trim Avg  | Speedup % |
  +---------+-------------------+-------------------+-----------+
  | 1       | 120.500959 ms     | 120.839331 ms     | -0.281    |
  | 4       | 56.057461 ms      | 56.339439 ms      | -0.503    |
  | 8       | 41.733621 ms      | 41.203413 ms      | +1.270    |
  | 16      | 35.515538 ms      | 34.727764 ms      | +2.218    |
  | 32      | 31.957524 ms      | 32.027303 ms      | -0.218    |
  | 64      | 40.309505 ms      | 39.946576 ms      | +0.900    |
  | 96      | 95.567628 ms      | 75.090228 ms      | +21.427   |
  | 128     | 47.846520 ms      | 46.367916 ms      | +3.090    |
  +---------+-------------------+-------------------+-----------+

 Reuse a cached int64 g_idx tensor in the non-Triton GPTQ torch dequant path to avoid repeated int32->int64 conversion and indexing overhead.

 Observed speedups (dequant microbench):
  - up to ~24% on Zen3 CPU
  - up to ~13% on A100 in non-Triton mode
  - up to ~5% on 4090 in non-Triton mode

 Memory cost:
  - additional cache per module: numel(g_idx) * 8 bytes (int64)
@Qubitium Qubitium marked this pull request as draft March 2, 2026 06:24
@Qubitium Qubitium changed the title Add optional g_idx int64 cache for TorchQuantLinear dequant path Optional CPU g_idx int64 cache for TorchQuantLinear dequant path Mar 3, 2026
Qubitium added 2 commits March 3, 2026 01:14
Changed from column-only slicing to aligned row+column block slicing in the num_itr > 1 branch.
2. Fixed the same bug in cached torch path (_dequantize_weight_cached_248)
@Qubitium
Copy link
Copy Markdown
Collaborator Author

Qubitium commented Mar 3, 2026

128 core Zen3, sweet spot is 16 threads

  +---------+-------------------+-------------------+-----------+
  | Threads | Baseline Trim Avg | Current Trim Avg  | Speedup % |
  +---------+-------------------+-------------------+-----------+
  | 1       | 121.003356 ms     | 120.748541 ms     | +0.211    |
  | 4       | 55.440594 ms      | 55.199437 ms      | +0.435    |
  | 8       | 42.718230 ms      | 42.487563 ms      | +0.540    |
  | 16      | 34.404057 ms      | 33.476284 ms      | +2.697    |
  | 32      | 41.884904 ms      | 40.231748 ms      | +3.947    |
  | 64      | 41.095100 ms      | 42.297816 ms      | -2.927    |
  | 96      | 84.770926 ms      | 69.136020 ms      | +18.444   |
  | 128     | 44.845950 ms      | 42.405236 ms      | +5.442    |
  +---------+-------------------+-------------------+-----------+

@Qubitium
Copy link
Copy Markdown
Collaborator Author

Qubitium commented Mar 3, 2026

Dequant performance is already sub ms on gpu so the improvement is much smaller.

  +------+---------------------------+---------------------+---------------------+-----------+------------+
  | GPU  | Device                    | Baseline Trimmed ms | Current Trimmed ms  | Speedup % | MaxAbsDiff |
  +------+---------------------------+---------------------+---------------------+-----------+------------+
  | 0    | NVIDIA PG506-230 (A100)   | 0.2963678566        | 0.2775771627        | +6.3403   | 0.0        |
  | 6    | NVIDIA GeForce RTX 4090   | 0.2911456395        | 0.2811155963        | +3.4450   | 0.0        |
  +------+---------------------------+---------------------+---------------------+-----------+------------+

@Qubitium Qubitium marked this pull request as ready for review March 3, 2026 06:40
@Qubitium Qubitium merged commit f2fee23 into main Mar 3, 2026
6 checks passed
@Qubitium Qubitium deleted the torch-gptq-cache-g_idx branch March 3, 2026 06:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant