quant: extract FP→int quantization into its own sub-package#13
Conversation
Move the Triton fused quant kernels (fused_quant_kernel,
fused_quant_bipolar_batched_kernel + host wrappers) and the PyTorch
grouped quant helpers out of sc/kernels.py into scmp_kernels/quant/,
so quant strategies can be explored independently of the SC matmul.
sc/kernels.py re-exports the moved symbols under their original names,
so existing callers (scmp_diffusion/optimization_workspace/sc_patch.py,
optimized_kernels.py, scmp_llm calibration) keep working unchanged.
Verified on RTX PRO 6000 Blackwell (gl1810):
* 8/8 golden sc_matmul cases bit-exact vs pre-refactor reference
* per_row 128×128×128 sc_prec=8 stoc_len=256: 0.406 → 0.413 ms/it
(+1.6%, within shared-GPU noise)
* scmp_llm/kernels/tests/test_sc_smoke.py: 13/13 PASS
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| offsets = pid * BLOCK + tl.arange(0, BLOCK) | ||
| total = rows * cols | ||
| mask = offsets < total | ||
| base = 0 # offsets already absolute |
| from ..quant import ( | ||
| fused_quant_kernel, | ||
| fused_quant_bipolar_batched_kernel, | ||
| fused_quantize_bipolar, | ||
| fused_quantize_bipolar_perrow, | ||
| fused_quantize_unipolar, | ||
| _quant_dummy, | ||
| _grouped_symmetric_quant, | ||
| _grouped_asymmetric_quant, | ||
| _grouped_symmetric_quant_batched, | ||
| ) |
heroarmor
left a comment
There was a problem hiding this comment.
Reviewed — clean mechanical move, safe to merge. ✅
Verified line-by-line: the moved Triton/PyTorch code is byte-identical to the deletions, and sc/kernels.py re-exports all 9 symbols under their original names, so both external callers and the in-file call sites resolve fine.
No circular-import risk despite quant/fused.py referencing sc.kernels._resolve_rng_levels — that import is deferred inside the function body, so it works in either import order. Good call isolating it.
Minor nit (non-blocking): after the move, sc/kernels.py has two now-unused imports:
import functools(line 24) —lru_cacheleft with_quant_dummyfrom triton.language.extra.cuda import libdevice(line 31) —nearbyintonly used in the moved kernels
Worth a one-line cleanup, but not worth holding the merge over.
There was a problem hiding this comment.
Pull request overview
This PR refactors FP→int quantization code out of scmp_kernels.sc.kernels into a new scmp_kernels.quant subpackage, while keeping backward compatibility by re-exporting the moved symbols from the original module.
Changes:
- Added
scmp_kernels/quant/subpackage containing Triton fused quant kernels/wrappers and PyTorch grouped quant helpers. - Updated
scmp_kernels/sc/kernels.pyto re-export quantization APIs fromscmp_kernels.quantunder their original names. - Removed the inlined quantization implementations from
scmp_kernels/sc/kernels.py(replaced with pointers to the new module locations).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| scmp_kernels/sc/kernels.py | Re-exports quantization symbols from scmp_kernels.quant and removes the inlined implementations. |
| scmp_kernels/quant/fused.py | New home for Triton fused quant kernels and host wrappers (plus batched bipolar fused-transpose kernel). |
| scmp_kernels/quant/grouped.py | New home for grouped symmetric/asymmetric PyTorch quant helpers. |
| scmp_kernels/quant/init.py | Public re-export surface for the new quant subpackage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| sc_prec: int, | ||
| compute_sum: bool = False, | ||
| rng_levels: Optional[int] = None, | ||
| ) -> tuple[torch.Tensor, float, float, torch.Tensor | None]: |
| # Local import to avoid a circular dependency at module load. `_resolve_rng_levels` | ||
| # is about SC's RNG grid size, not quantization — it stays in sc.kernels. | ||
| def _resolve_rng_levels(sc_prec: int, rng_levels: Optional[int]) -> int: | ||
| from ..sc.kernels import _resolve_rng_levels as _impl | ||
| return _impl(sc_prec, rng_levels) |
| inv_scale_ptr, # (BH,) float32 — per-head inv_scale | ||
| q_max, # int: 2^(sc_prec-1) - 1 | ||
| q_min, # int: -(2^(sc_prec-1)) | ||
| max_rng_val, # int: 2^sc_prec - 1 |
heroarmor
left a comment
There was a problem hiding this comment.
Approving — clean mechanical extraction, byte-identical move, no circular-import risk, re-exports keep all callers working. The two unused imports (functools, libdevice) in sc/kernels.py are a non-blocking cleanup nit.
The quant sub-package extraction described by this plan landed in #13. The plan is now historical context only — remove from the working tree to avoid drift between the doc and the actual layout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* quant: add SmoothQuant pre-quantization transform
Mathematically equivalent diagonal rescale along D that migrates per-channel
activation outliers into the weight:
Y = A @ B.T = (A / s) @ (B * s).T, s_j = act_max[j]^a / w_max[j]^(1-a)
New helpers in scmp_kernels/quant/smoothquant.py:
accumulate_act_scales - per-channel max-abs aggregator for calibration
compute_smooth_scales - build s from calibrated stats + weight
apply_smoothing - apply diagonal rescale (2D and 3D)
apply_smoothing_offline- bake s into the weight once
Wired into sc_matmul as an optional smooth_scales kwarg; default None
preserves byte-for-byte legacy behavior. Works for all three granularities
(per_tensor, per_row, per_head).
Tests in tests/test_smoothquant.py cover the math identity, calibration
aggregator, alpha=0/1 closed forms, MSE improvement under simulated int8
quant for all three granularities (13-17x on synthetic outliers), and the
sc_matmul kwarg-vs-manual equivalence (CUDA-only).
Reference: Xiao et al., "SmoothQuant: Accurate and Efficient PTQ for LLMs",
ICML 2023.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: remove stale MIGRATION_PLAN.md
The quant sub-package extraction described by this plan landed in #13.
The plan is now historical context only — remove from the working tree to
avoid drift between the doc and the actual layout.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
scmp_kernels/sc/kernels.pyinto a newscmp_kernels/quant/sub-package so quant strategies can be explored independently of the SC matmul.quant/fused.py— Tritonfused_quant_kernel,fused_quant_bipolar_batched_kernel, host wrappersfused_quantize_bipolar/_perrow/_unipolar, plus_quant_dummy.quant/grouped.py— PyTorch_grouped_symmetric_quant,_grouped_asymmetric_quant,_grouped_symmetric_quant_batched.sc/kernels.pyre-exports all moved symbols under their original names — existing callers (scmp_diffusion/optimization_workspace/sc_patch.py,optimized_kernels.py,scmp_llm/calibrate_mp_thresholds.py) keep working unchanged.No logic changes — strictly a mechanical move. The fused Triton kernels are byte-identical to before; only their module home changed.
Test plan
Verified on RTX PRO 6000 Blackwell (
gl1810):sc_matmulcases bit-exact vs pre-refactor reference (per_tensor/per_row/per_head × bipolar/unipolar × 2D/3D + chunk_d + low-prec)scmp_llm/kernels/tests/test_sc_smoke.py: 13/13 PASSper_row 128×128×128 sc_prec=8 stoc_len=256: 0.406 → 0.413 ms/it (+1.6%, within shared-GPU noise)from scmp_kernels.quant import ...) and legacy (from scmp_kernels.sc.kernels import ...) import paths return the same objectsscmp_diffusionexternal import lists still resolve🤖 Generated with Claude Code