Add MP module, drop quant-clipping margin, unify sc_matmul dispatcher#2
Conversation
Replaces the mp/ placeholder __init__.py with a real public API. scmp_kernels/mp/config.py (new, 762 lines): - MPConfig: fixed-fraction quantile-based per-row stoc_len assignment. - AdaptiveMPConfig: timestep-adaptive thresholds (per-operator, per-layer) with optional threshold_table_path JSON loader. - RangeMPConfig + classify_groups_by_range: range-bucket assignment for per-group quant operators. - RowAssignment, classify_rows_by_metric, adaptive_classify_rows: row classification primitives consumed by SC attention/MLP kernels. - MPDistributionLogger, MetricProfiler: instrumentation. scmp_kernels/mp/__init__.py: - Re-exports the 9 public names above. No SC, application, or evaluation changes — those are deliberately deferred per team discussion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the safety-margin clipping in both bipolar and unipolar quant paths so the full quantization range is used: - bipolar (8-bit): clamp/normalize at ±127 instead of ±125 (was q_clip = q_norm - 2). 254 → 256 distinct symmetric levels. - unipolar (8-bit): map to [0, 255] instead of [2, 253] (was q_lo, q_hi = 2, q_max - 2). 252 → 256 distinct levels. Mechanical changes in scmp_kernels/sc/kernels.py: - fused_quant_bipolar_kernel: params (q_clip, q_clip_min, q_norm) -> (q_max); body uses ±q_max for clamp and q_max for boundary norm. - fused_quant_bipolar_perrow_kernel: params (q_clip, q_norm) -> (q_max); same body simplification. - fused_quantize_bipolar (Python wrapper): drops local q_norm/q_clip, uses q_max throughout; kernel-launch arg list shortened. - fused_quantize_bipolar_perrow: drops `clip_margin=0` docstring noise, passes q_max once instead of (q_max, q_max). - fused_quantize_unipolar: drops `q_lo, q_hi = 2, q_max - 2` margin; scale = range_fp / q_max; zp clamped to [0, q_max]. - _grouped_symmetric_quant: drops `clip_margin` kwarg and local `q_clip = q_max - clip_margin`; uses q_max for scale. - _grouped_symmetric_quant_batched: same. - 6 call sites: drop `, clip_margin=0`. AST parses; no remaining q_clip / q_lo / q_hi / clip_margin / q_norm references in scmp_kernels/sc/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the historical sc_triton.* public names so application repos
(Q-DiT in scmp_diffusion, future ViT/WorldModel apps) can import
specialized SC matmul variants without using underscore-private helpers.
New public names in scmp_kernels/sc/kernels.py (re-exported from sc/__init__.py):
• sc_matmul_enable_triton — wrapper: 2D/3D enable-signal matmul,
dispatches bipolar/unipolar.
• sc_matmul_grouped_enable_triton — wrapper: enable-signal + per-row-group
quantization, dispatches by mode.
• sc_matmul_mlp — wrapper: non-enable MLP matmul,
delegates to per-tensor sc_matmul
(group_a/group_b/chunk_d accepted
for API parity, unused).
• sc_matmul_per_tensor — alias of _sc_matmul_per_tensor.
• sc_matmul_grouped — alias of _sc_matmul_per_row.
• sc_matmul_enable_batched_bipolar — alias of _sc_matmul_per_head_bipolar.
• sc_matmul_enable_triton_mlp — alias of _sc_matmul_per_row_mlp
(same body as historical wrapper).
• clear_rng_cache — re-exported (was internal).
The granularity-based sc_matmul dispatcher in matmul.py is unchanged.
The four aliases are byte-identical to their private counterparts;
only the three wrappers add real code (~180 lines).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…atcher Reverts the flat-API surface introduced in fbd7009 — it duplicated the existing private helpers and conflated two different algorithms (the deprecated packed-XNOR/AND path vs the enable-signal table-lookup that now backs everything). Replaces it with one extended dispatcher. scmp_kernels/sc/kernels.py - Removes 3 wrappers (sc_matmul_enable_triton, sc_matmul_grouped_enable_triton, sc_matmul_mlp) and 4 aliases (sc_matmul_per_tensor, sc_matmul_grouped, sc_matmul_enable_batched_bipolar, sc_matmul_enable_triton_mlp). Total -222 lines. File returns to its post-clipping-removal state. scmp_kernels/sc/matmul.py - sc_matmul gains 3 kwargs: group_a, group_b, rng_levels. - per_row dispatch now threads group_a/group_b/rng_levels through to _sc_matmul_per_row{,_mlp,_batched}. - per_head dispatch threads rng_levels through to _sc_matmul_per_head_bipolar. scmp_kernels/sc/__init__.py - Public surface narrowed to: sc_matmul, clear_rng_cache, det_kernel_tuning. Application code (Q-DiT in scmp_diffusion) is rewritten in a follow-up commit to call sc_matmul(..., granularity=...) instead of the deleted flat-API names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a first-class scmp_kernels.mp public API for mixed-precision (MP) row/group assignment and removes the remaining “quant clipping margin” headroom from SC quantization paths so 8-bit uses the full numeric range.
Changes:
- Introduces
scmp_kernels/mp/config.pyimplementingMPConfig,AdaptiveMPConfig,RangeMPConfig, and related classification + instrumentation utilities; updatesscmp_kernels/mp/__init__.pyto re-export the public API. - Removes SC quant clipping margins (bipolar and unipolar) by switching to full-range clamp/normalization and dropping
clip_marginplumbing in grouped quant helpers. - Extends
sc_matmulAPI to acceptgroup_a/group_bandrng_levels, and updatesscmp_kernels/sc/__init__.pyexports/docs.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
scmp_kernels/sc/matmul.py |
Extends sc_matmul API (grouping + rng_levels) and forwards new knobs into per-row/per-head paths. |
scmp_kernels/sc/kernels.py |
Removes quant clipping margin parameters and updates fused + grouped quantization helpers to use full quant range. |
scmp_kernels/sc/__init__.py |
Updates public surface docs and exports clear_rng_cache. |
scmp_kernels/mp/config.py |
New MP configuration + row/group classification primitives and logging/profiling utilities. |
scmp_kernels/mp/__init__.py |
Replaces placeholder with re-exporting public MP API. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if a.dim() == 3: | ||
| return _sc_matmul_per_row_batched( | ||
| a, b, | ||
| group_a=1, group_b=1, | ||
| group_a=group_a, group_b=group_b, | ||
| mode=mode, sc_prec=sc_prec, config=config, | ||
| stoc_len=stoc_len, | ||
| ) |
There was a problem hiding this comment.
@copilot explain this in more details
this feedback
| Args: | ||
| weight: [out_features, in_features] weight tensor (already quantized). | ||
| group_size: Number of output rows per group. Use 0 or out_features | ||
| for per-row grouping. | ||
| config: RangeMPConfig instance. | ||
| operator: Operator name for per-op threshold lookup. | ||
|
|
||
| Returns: | ||
| List of stoc_len values, one per group. | ||
| """ | ||
| out_features, in_features = weight.shape | ||
| if group_size <= 0 or group_size >= out_features: | ||
| group_size = out_features | ||
|
|
There was a problem hiding this comment.
@copilot explain with code about says group_size of 0 or out_features means “per-row grouping
| out_features, in_features = weight.shape | ||
| if group_size <= 0 or group_size >= out_features: | ||
| group_size = out_features | ||
|
|
||
| num_groups = out_features // group_size | ||
| levels = config.stoc_len_levels | ||
| n_levels = len(levels) | ||
| threshold = config.get_threshold(operator) | ||
| threshold = min(threshold, 0.95) | ||
|
|
||
| # Reshape to [num_groups, group_size * in_features] | ||
| w = weight.reshape(num_groups, -1).float() | ||
| group_max = w.amax(dim=-1) # [num_groups] |
There was a problem hiding this comment.
@copilot how to create a issue to fixed this later?
…ntics (#6) Reported by Copilot in PR #2 (review comment r3223399815). The docstring claimed "Use 0 or out_features for per-row grouping", but the implementation maps both of those values to ``group_size = out_features`` → ``num_groups = 1`` (a single per-tensor group, the *opposite* of per-row). Verified on gl1810: passing ``group_size=0`` against a ``[16, 8]`` weight returns 1 group, not 16. Keeps the implementation as-is (Option 1 from the audit) and updates the docstring to describe the real contract: * ``group_size == 1`` → per-row (num_groups = out_features) * ``group_size <= 0`` or ``>= out_features`` → per-tensor (num_groups = 1) * any other valid divisor of out_features → out_features // group_size groups Also documents the existing reshape-divisibility constraint, which otherwise surfaces as an obscure ``RuntimeError: shape '[N, -1]' is invalid`` at the reshape on line 724. No code change; no behavioral change. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Small, kernel-only follow-up to closed PR #1. Splits the work into focused commits per @Allenjin123's review feedback (no application code, no legacy archives, no rebase-style sync branch).
Commits (4)
1.
feat(mp): populate mp/ module with MPConfig and adaptive classification(+790/−1)Replaces the
scmp_kernels/mp/__init__.pyplaceholder with a real public API. Addsscmp_kernels/mp/config.py(762 lines, stdlib +torchonly — no SC coupling).Public surface (re-exported from
scmp_kernels.mp):MPConfig,AdaptiveMPConfig,RangeMPConfig,RowAssignment,MPDistributionLogger,MetricProfilerclassify_rows_by_metric,adaptive_classify_rows,classify_groups_by_range2.
remove quant clipping margin from SC kernels(+34/−52)Drops the safety margin in both SC quantization paths so the full quant range is used:
q_clip = q_norm - 2). 254 → 256 distinct symmetric levels.[0, 255]instead of[2, 253](wasq_lo, q_hi = 2, q_max - 2). 252 → 256 distinct levels.Mechanical:
fused_quant_bipolar_kernelparams(q_clip, q_clip_min, q_norm)→(q_max).fused_quant_bipolar_perrow_kernelparams(q_clip, q_norm)→(q_max).fused_quantize_bipolar/_perrowwrappers: dropq_norm/q_cliplocals, useq_maxthroughout.fused_quantize_unipolar: dropq_lo, q_hi = 2, q_max - 2._grouped_symmetric_quant/_grouped_symmetric_quant_batched: dropclip_marginkwarg.3.
expose flat-API public surface for application code(+257/−9) — intermediate, reverted by commit 4Briefly added flat-API names (
sc_matmul_per_tensor,sc_matmul_mlp,sc_matmul_grouped,sc_matmul_enable_triton,sc_matmul_enable_triton_mlp,sc_matmul_grouped_enable_triton,sc_matmul_enable_batched_bipolar) to make Q-DiT migration easier. Kept in history for traceability; net effect zero after commit 4.4.
revert flat-API duplicates; extend sc_matmul to a single elegant dispatcher(+33/−264)Net change: replaces the flat-API zoo with a single extended
sc_matmuldispatcher.scmp_kernels/sc/kernels.py— removes 3 wrappers + 4 aliases from commit 3. File returns to its post-clipping-removal state.scmp_kernels/sc/matmul.py—sc_matmulgains 3 kwargs:group_a,group_b,rng_levels. Per-row dispatch threads them through to_sc_matmul_per_row{,_mlp,_batched}; per-head dispatch threadsrng_levelsthrough to_sc_matmul_per_head_bipolar.scmp_kernels/sc/__init__.py— public surface narrowed to:sc_matmul,clear_rng_cache,det_kernel_tuning.Net public surface after all 4 commits
```python
from scmp_kernels.sc import sc_matmul, clear_rng_cache, det_kernel_tuning
from scmp_kernels.mp import MPConfig, AdaptiveMPConfig, RangeMPConfig, RowAssignment, \ classify_rows_by_metric, adaptive_classify_rows, classify_groups_by_range, \ MPDistributionLogger, MetricProfiler
sc_matmul(a, b, granularity="per_tensor" | "per_row" | "per_head",
*, mode="bipolar" | "unipolar",
sc_prec=8, stoc_len=None,
chunk_d=0, group_a=1, group_b=1,
rng_levels=None, config=None)
```
A single dispatcher reaches every specialised internal path (per-tensor, per-row + group, per-row MLP + chunk_d, per-row batched, per-head bipolar) that the application repo uses.
Scope guarantees
scmp_kernels/{sc,mp}/.kernels.py + matmul.pylayout preserved (no rename tosc_triton.py).Verified
q_clip/q_lo/q_hi/clip_margin/q_normreferences acrossscmp_kernels/sc/.mp/config.py.Companion PR
A separate PR will land on `CrucibleComputingGroup/scmp_diffusion` that consumes this branch via git submodule.
Not verified (no GPU on dev box)
Test plan
🤖 Generated with Claude Code