Skip to content

quant: extract FP→int quantization into its own sub-package#13

Merged
heroarmor merged 1 commit into
mainfrom
refactor/extract-quant-module
May 20, 2026
Merged

quant: extract FP→int quantization into its own sub-package#13
heroarmor merged 1 commit into
mainfrom
refactor/extract-quant-module

Conversation

@Allenjin123
Copy link
Copy Markdown
Contributor

Summary

  • Move FP→int quant out of scmp_kernels/sc/kernels.py into a new scmp_kernels/quant/ sub-package so quant strategies can be explored independently of the SC matmul.
    • quant/fused.py — Triton fused_quant_kernel, fused_quant_bipolar_batched_kernel, host wrappers fused_quantize_bipolar / _perrow / _unipolar, plus _quant_dummy.
    • quant/grouped.py — PyTorch _grouped_symmetric_quant, _grouped_asymmetric_quant, _grouped_symmetric_quant_batched.
  • sc/kernels.py re-exports all moved symbols under their original names — existing callers (scmp_diffusion/optimization_workspace/sc_patch.py, optimized_kernels.py, scmp_llm/calibrate_mp_thresholds.py) keep working unchanged.

No logic changes — strictly a mechanical move. The fused Triton kernels are byte-identical to before; only their module home changed.

Test plan

Verified on RTX PRO 6000 Blackwell (gl1810):

  • 8/8 golden sc_matmul cases bit-exact vs pre-refactor reference (per_tensor/per_row/per_head × bipolar/unipolar × 2D/3D + chunk_d + low-prec)
  • scmp_llm/kernels/tests/test_sc_smoke.py: 13/13 PASS
  • per_row 128×128×128 sc_prec=8 stoc_len=256: 0.406 → 0.413 ms/it (+1.6%, within shared-GPU noise)
  • Both new (from scmp_kernels.quant import ...) and legacy (from scmp_kernels.sc.kernels import ...) import paths return the same objects
  • Simulated scmp_diffusion external import lists still resolve

🤖 Generated with Claude Code

Move the Triton fused quant kernels (fused_quant_kernel,
fused_quant_bipolar_batched_kernel + host wrappers) and the PyTorch
grouped quant helpers out of sc/kernels.py into scmp_kernels/quant/,
so quant strategies can be explored independently of the SC matmul.

sc/kernels.py re-exports the moved symbols under their original names,
so existing callers (scmp_diffusion/optimization_workspace/sc_patch.py,
optimized_kernels.py, scmp_llm calibration) keep working unchanged.

Verified on RTX PRO 6000 Blackwell (gl1810):
  * 8/8 golden sc_matmul cases bit-exact vs pre-refactor reference
  * per_row 128×128×128 sc_prec=8 stoc_len=256: 0.406 → 0.413 ms/it
    (+1.6%, within shared-GPU noise)
  * scmp_llm/kernels/tests/test_sc_smoke.py: 13/13 PASS

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
offsets = pid * BLOCK + tl.arange(0, BLOCK)
total = rows * cols
mask = offsets < total
base = 0 # offsets already absolute
Comment on lines +42 to +52
from ..quant import (
fused_quant_kernel,
fused_quant_bipolar_batched_kernel,
fused_quantize_bipolar,
fused_quantize_bipolar_perrow,
fused_quantize_unipolar,
_quant_dummy,
_grouped_symmetric_quant,
_grouped_asymmetric_quant,
_grouped_symmetric_quant_batched,
)
Copy link
Copy Markdown
Collaborator

@heroarmor heroarmor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed — clean mechanical move, safe to merge.

Verified line-by-line: the moved Triton/PyTorch code is byte-identical to the deletions, and sc/kernels.py re-exports all 9 symbols under their original names, so both external callers and the in-file call sites resolve fine.

No circular-import risk despite quant/fused.py referencing sc.kernels._resolve_rng_levels — that import is deferred inside the function body, so it works in either import order. Good call isolating it.

Minor nit (non-blocking): after the move, sc/kernels.py has two now-unused imports:

  • import functools (line 24) — lru_cache left with _quant_dummy
  • from triton.language.extra.cuda import libdevice (line 31) — nearbyint only used in the moved kernels

Worth a one-line cleanup, but not worth holding the merge over.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors FP→int quantization code out of scmp_kernels.sc.kernels into a new scmp_kernels.quant subpackage, while keeping backward compatibility by re-exporting the moved symbols from the original module.

Changes:

  • Added scmp_kernels/quant/ subpackage containing Triton fused quant kernels/wrappers and PyTorch grouped quant helpers.
  • Updated scmp_kernels/sc/kernels.py to re-export quantization APIs from scmp_kernels.quant under their original names.
  • Removed the inlined quantization implementations from scmp_kernels/sc/kernels.py (replaced with pointers to the new module locations).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
scmp_kernels/sc/kernels.py Re-exports quantization symbols from scmp_kernels.quant and removes the inlined implementations.
scmp_kernels/quant/fused.py New home for Triton fused quant kernels and host wrappers (plus batched bipolar fused-transpose kernel).
scmp_kernels/quant/grouped.py New home for grouped symmetric/asymmetric PyTorch quant helpers.
scmp_kernels/quant/init.py Public re-export surface for the new quant subpackage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sc_prec: int,
compute_sum: bool = False,
rng_levels: Optional[int] = None,
) -> tuple[torch.Tensor, float, float, torch.Tensor | None]:
Comment on lines +28 to +32
# Local import to avoid a circular dependency at module load. `_resolve_rng_levels`
# is about SC's RNG grid size, not quantization — it stays in sc.kernels.
def _resolve_rng_levels(sc_prec: int, rng_levels: Optional[int]) -> int:
from ..sc.kernels import _resolve_rng_levels as _impl
return _impl(sc_prec, rng_levels)
inv_scale_ptr, # (BH,) float32 — per-head inv_scale
q_max, # int: 2^(sc_prec-1) - 1
q_min, # int: -(2^(sc_prec-1))
max_rng_val, # int: 2^sc_prec - 1
Copy link
Copy Markdown
Collaborator

@heroarmor heroarmor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — clean mechanical extraction, byte-identical move, no circular-import risk, re-exports keep all callers working. The two unused imports (functools, libdevice) in sc/kernels.py are a non-blocking cleanup nit.

@heroarmor heroarmor merged commit fa7387d into main May 20, 2026
2 checks passed
Allenjin123 added a commit that referenced this pull request May 20, 2026
The quant sub-package extraction described by this plan landed in #13.
The plan is now historical context only — remove from the working tree to
avoid drift between the doc and the actual layout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
heroarmor pushed a commit that referenced this pull request May 20, 2026
* quant: add SmoothQuant pre-quantization transform

Mathematically equivalent diagonal rescale along D that migrates per-channel
activation outliers into the weight:

    Y = A @ B.T = (A / s) @ (B * s).T,   s_j = act_max[j]^a / w_max[j]^(1-a)

New helpers in scmp_kernels/quant/smoothquant.py:

  accumulate_act_scales  - per-channel max-abs aggregator for calibration
  compute_smooth_scales  - build s from calibrated stats + weight
  apply_smoothing        - apply diagonal rescale (2D and 3D)
  apply_smoothing_offline- bake s into the weight once

Wired into sc_matmul as an optional smooth_scales kwarg; default None
preserves byte-for-byte legacy behavior. Works for all three granularities
(per_tensor, per_row, per_head).

Tests in tests/test_smoothquant.py cover the math identity, calibration
aggregator, alpha=0/1 closed forms, MSE improvement under simulated int8
quant for all three granularities (13-17x on synthetic outliers), and the
sc_matmul kwarg-vs-manual equivalence (CUDA-only).

Reference: Xiao et al., "SmoothQuant: Accurate and Efficient PTQ for LLMs",
ICML 2023.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: remove stale MIGRATION_PLAN.md

The quant sub-package extraction described by this plan landed in #13.
The plan is now historical context only — remove from the working tree to
avoid drift between the doc and the actual layout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants