Add MP module, drop quant-clipping margin, unify sc_matmul dispatcher by heroarmor · Pull Request #2 · CrucibleComputingGroup/scmp_kernels

heroarmor · 2026-05-12T01:18:14Z

Small, kernel-only follow-up to closed PR #1. Splits the work into focused commits per @Allenjin123's review feedback (no application code, no legacy archives, no rebase-style sync branch).

Commits (4)

1. feat(mp): populate mp/ module with MPConfig and adaptive classification (+790/−1)

Replaces the scmp_kernels/mp/__init__.py placeholder with a real public API. Adds scmp_kernels/mp/config.py (762 lines, stdlib + torch only — no SC coupling).

Public surface (re-exported from scmp_kernels.mp):

MPConfig, AdaptiveMPConfig, RangeMPConfig, RowAssignment, MPDistributionLogger, MetricProfiler
classify_rows_by_metric, adaptive_classify_rows, classify_groups_by_range

2. remove quant clipping margin from SC kernels (+34/−52)

Drops the safety margin in both SC quantization paths so the full quant range is used:

Bipolar (8-bit): clamp/normalize at ±127 instead of ±125 (was q_clip = q_norm - 2). 254 → 256 distinct symmetric levels.
Unipolar (8-bit): map to [0, 255] instead of [2, 253] (was q_lo, q_hi = 2, q_max - 2). 252 → 256 distinct levels.

Mechanical:

fused_quant_bipolar_kernel params (q_clip, q_clip_min, q_norm) → (q_max).
fused_quant_bipolar_perrow_kernel params (q_clip, q_norm) → (q_max).
fused_quantize_bipolar / _perrow wrappers: drop q_norm/q_clip locals, use q_max throughout.
fused_quantize_unipolar: drop q_lo, q_hi = 2, q_max - 2.
_grouped_symmetric_quant / _grouped_symmetric_quant_batched: drop clip_margin kwarg.
All 6 call sites updated.

3. expose flat-API public surface for application code (+257/−9) — intermediate, reverted by commit 4

Briefly added flat-API names (sc_matmul_per_tensor, sc_matmul_mlp, sc_matmul_grouped, sc_matmul_enable_triton, sc_matmul_enable_triton_mlp, sc_matmul_grouped_enable_triton, sc_matmul_enable_batched_bipolar) to make Q-DiT migration easier. Kept in history for traceability; net effect zero after commit 4.

4. revert flat-API duplicates; extend sc_matmul to a single elegant dispatcher (+33/−264)

Net change: replaces the flat-API zoo with a single extended sc_matmul dispatcher.

scmp_kernels/sc/kernels.py — removes 3 wrappers + 4 aliases from commit 3. File returns to its post-clipping-removal state.
scmp_kernels/sc/matmul.py — sc_matmul gains 3 kwargs: group_a, group_b, rng_levels. Per-row dispatch threads them through to _sc_matmul_per_row{,_mlp,_batched}; per-head dispatch threads rng_levels through to _sc_matmul_per_head_bipolar.
scmp_kernels/sc/__init__.py — public surface narrowed to: sc_matmul, clear_rng_cache, det_kernel_tuning.

Net public surface after all 4 commits

```python
from scmp_kernels.sc import sc_matmul, clear_rng_cache, det_kernel_tuning
from scmp_kernels.mp import MPConfig, AdaptiveMPConfig, RangeMPConfig, RowAssignment, \ classify_rows_by_metric, adaptive_classify_rows, classify_groups_by_range, \ MPDistributionLogger, MetricProfiler

sc_matmul(a, b, granularity="per_tensor" | "per_row" | "per_head",
*, mode="bipolar" | "unipolar",
sc_prec=8, stoc_len=None,
chunk_d=0, group_a=1, group_b=1,
rng_levels=None, config=None)
```

A single dispatcher reaches every specialised internal path (per-tensor, per-row + group, per-row MLP + chunk_d, per-row batched, per-head bipolar) that the application repo uses.

Scope guarantees

Only kernel-side files: scmp_kernels/{sc,mp}/.
No application code, no evaluation, no legacy CPU references, no rebase-style sync branch.
SC kernels.py + matmul.py layout preserved (no rename to sc_triton.py).

Verified

AST parses for every modified file.
Zero remaining q_clip / q_lo / q_hi / clip_margin / q_norm references across scmp_kernels/sc/.
All 9 exported MP names are defined in mp/config.py.

Companion PR

A separate PR will land on `CrucibleComputingGroup/scmp_diffusion` that consumes this branch via git submodule.

Not verified (no GPU on dev box)

Actual Triton kernel execution. Needs `pytest tests/test_sc_smoke.py` on a CUDA + Triton machine.

Test plan

`pytest tests/test_sc_smoke.py` on CUDA box.
Smoke-import `from scmp_kernels.sc import sc_matmul` and `from scmp_kernels.mp import AdaptiveMPConfig` on a box with torch.

🤖 Generated with Claude Code

Replaces the mp/ placeholder __init__.py with a real public API. scmp_kernels/mp/config.py (new, 762 lines): - MPConfig: fixed-fraction quantile-based per-row stoc_len assignment. - AdaptiveMPConfig: timestep-adaptive thresholds (per-operator, per-layer) with optional threshold_table_path JSON loader. - RangeMPConfig + classify_groups_by_range: range-bucket assignment for per-group quant operators. - RowAssignment, classify_rows_by_metric, adaptive_classify_rows: row classification primitives consumed by SC attention/MLP kernels. - MPDistributionLogger, MetricProfiler: instrumentation. scmp_kernels/mp/__init__.py: - Re-exports the 9 public names above. No SC, application, or evaluation changes — those are deliberately deferred per team discussion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops the safety-margin clipping in both bipolar and unipolar quant paths so the full quantization range is used: - bipolar (8-bit): clamp/normalize at ±127 instead of ±125 (was q_clip = q_norm - 2). 254 → 256 distinct symmetric levels. - unipolar (8-bit): map to [0, 255] instead of [2, 253] (was q_lo, q_hi = 2, q_max - 2). 252 → 256 distinct levels. Mechanical changes in scmp_kernels/sc/kernels.py: - fused_quant_bipolar_kernel: params (q_clip, q_clip_min, q_norm) -> (q_max); body uses ±q_max for clamp and q_max for boundary norm. - fused_quant_bipolar_perrow_kernel: params (q_clip, q_norm) -> (q_max); same body simplification. - fused_quantize_bipolar (Python wrapper): drops local q_norm/q_clip, uses q_max throughout; kernel-launch arg list shortened. - fused_quantize_bipolar_perrow: drops `clip_margin=0` docstring noise, passes q_max once instead of (q_max, q_max). - fused_quantize_unipolar: drops `q_lo, q_hi = 2, q_max - 2` margin; scale = range_fp / q_max; zp clamped to [0, q_max]. - _grouped_symmetric_quant: drops `clip_margin` kwarg and local `q_clip = q_max - clip_margin`; uses q_max for scale. - _grouped_symmetric_quant_batched: same. - 6 call sites: drop `, clip_margin=0`. AST parses; no remaining q_clip / q_lo / q_hi / clip_margin / q_norm references in scmp_kernels/sc/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the historical sc_triton.* public names so application repos (Q-DiT in scmp_diffusion, future ViT/WorldModel apps) can import specialized SC matmul variants without using underscore-private helpers. New public names in scmp_kernels/sc/kernels.py (re-exported from sc/__init__.py): • sc_matmul_enable_triton — wrapper: 2D/3D enable-signal matmul, dispatches bipolar/unipolar. • sc_matmul_grouped_enable_triton — wrapper: enable-signal + per-row-group quantization, dispatches by mode. • sc_matmul_mlp — wrapper: non-enable MLP matmul, delegates to per-tensor sc_matmul (group_a/group_b/chunk_d accepted for API parity, unused). • sc_matmul_per_tensor — alias of _sc_matmul_per_tensor. • sc_matmul_grouped — alias of _sc_matmul_per_row. • sc_matmul_enable_batched_bipolar — alias of _sc_matmul_per_head_bipolar. • sc_matmul_enable_triton_mlp — alias of _sc_matmul_per_row_mlp (same body as historical wrapper). • clear_rng_cache — re-exported (was internal). The granularity-based sc_matmul dispatcher in matmul.py is unchanged. The four aliases are byte-identical to their private counterparts; only the three wrappers add real code (~180 lines). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…atcher Reverts the flat-API surface introduced in fbd7009 — it duplicated the existing private helpers and conflated two different algorithms (the deprecated packed-XNOR/AND path vs the enable-signal table-lookup that now backs everything). Replaces it with one extended dispatcher. scmp_kernels/sc/kernels.py - Removes 3 wrappers (sc_matmul_enable_triton, sc_matmul_grouped_enable_triton, sc_matmul_mlp) and 4 aliases (sc_matmul_per_tensor, sc_matmul_grouped, sc_matmul_enable_batched_bipolar, sc_matmul_enable_triton_mlp). Total -222 lines. File returns to its post-clipping-removal state. scmp_kernels/sc/matmul.py - sc_matmul gains 3 kwargs: group_a, group_b, rng_levels. - per_row dispatch now threads group_a/group_b/rng_levels through to _sc_matmul_per_row{,_mlp,_batched}. - per_head dispatch threads rng_levels through to _sc_matmul_per_head_bipolar. scmp_kernels/sc/__init__.py - Public surface narrowed to: sc_matmul, clear_rng_cache, det_kernel_tuning. Application code (Q-DiT in scmp_diffusion) is rewritten in a follow-up commit to call sc_matmul(..., granularity=...) instead of the deleted flat-API names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a first-class scmp_kernels.mp public API for mixed-precision (MP) row/group assignment and removes the remaining “quant clipping margin” headroom from SC quantization paths so 8-bit uses the full numeric range.

Changes:

Introduces scmp_kernels/mp/config.py implementing MPConfig, AdaptiveMPConfig, RangeMPConfig, and related classification + instrumentation utilities; updates scmp_kernels/mp/__init__.py to re-export the public API.
Removes SC quant clipping margins (bipolar and unipolar) by switching to full-range clamp/normalization and dropping clip_margin plumbing in grouped quant helpers.
Extends sc_matmul API to accept group_a/group_b and rng_levels, and updates scmp_kernels/sc/__init__.py exports/docs.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`scmp_kernels/sc/matmul.py`	Extends `sc_matmul` API (grouping + `rng_levels`) and forwards new knobs into per-row/per-head paths.
`scmp_kernels/sc/kernels.py`	Removes quant clipping margin parameters and updates fused + grouped quantization helpers to use full quant range.
`scmp_kernels/sc/__init__.py`	Updates public surface docs and exports `clear_rng_cache`.
`scmp_kernels/mp/config.py`	New MP configuration + row/group classification primitives and logging/profiling utilities.
`scmp_kernels/mp/__init__.py`	Replaces placeholder with re-exporting public MP API.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Allenjin123 · 2026-05-12T20:34:02Z

        if a.dim() == 3:
            return _sc_matmul_per_row_batched(
                a, b,
-                group_a=1, group_b=1,
+                group_a=group_a, group_b=group_b,
                mode=mode, sc_prec=sc_prec, config=config,
                stoc_len=stoc_len,
            )


@copilot explain this in more details
this feedback

Allenjin123 · 2026-05-12T20:35:07Z

+    Args:
+        weight: [out_features, in_features] weight tensor (already quantized).
+        group_size: Number of output rows per group. Use 0 or out_features
+            for per-row grouping.
+        config: RangeMPConfig instance.
+        operator: Operator name for per-op threshold lookup.
+
+    Returns:
+        List of stoc_len values, one per group.
+    """
+    out_features, in_features = weight.shape
+    if group_size <= 0 or group_size >= out_features:
+        group_size = out_features
+


@copilot explain with code about says group_size of 0 or out_features means “per-row grouping

Allenjin123 · 2026-05-12T20:36:33Z

+    out_features, in_features = weight.shape
+    if group_size <= 0 or group_size >= out_features:
+        group_size = out_features
+
+    num_groups = out_features // group_size
+    levels = config.stoc_len_levels
+    n_levels = len(levels)
+    threshold = config.get_threshold(operator)
+    threshold = min(threshold, 0.95)
+
+    # Reshape to [num_groups, group_size * in_features]
+    w = weight.reshape(num_groups, -1).float()
+    group_max = w.amax(dim=-1)   # [num_groups]


@copilot how to create a issue to fixed this later?

…ntics (#6) Reported by Copilot in PR #2 (review comment r3223399815). The docstring claimed "Use 0 or out_features for per-row grouping", but the implementation maps both of those values to ``group_size = out_features`` → ``num_groups = 1`` (a single per-tensor group, the *opposite* of per-row). Verified on gl1810: passing ``group_size=0`` against a ``[16, 8]`` weight returns 1 group, not 16. Keeps the implementation as-is (Option 1 from the audit) and updates the docstring to describe the real contract: * ``group_size == 1`` → per-row (num_groups = out_features) * ``group_size <= 0`` or ``>= out_features`` → per-tensor (num_groups = 1) * any other valid divisor of out_features → out_features // group_size groups Also documents the existing reshape-divisibility constraint, which otherwise surfaces as an obscure ``RuntimeError: shape '[N, -1]' is invalid`` at the reshape on line 724. No code change; no behavioral change. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

heroarmor and others added 4 commits May 11, 2026 20:54

Allenjin123 requested review from Allenjin123 and Copilot and removed request for Allenjin123 May 12, 2026 02:47

Copilot started reviewing on behalf of Allenjin123 May 12, 2026 02:48 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

heroarmor changed the title ~~Add MP module + drop SC quant clipping margin~~ Add MP module, drop quant-clipping margin, unify sc_matmul dispatcher May 12, 2026

This was referenced May 12, 2026

Bootstrap scmp_diffusion from scmp_llm CrucibleComputingGroup/scmp_diffusion#1

Merged

Consolidate Triton kernels (18 → 7, -749 lines) #3

Merged

Allenjin123 merged commit fa6b5cd into CrucibleComputingGroup:main May 12, 2026
4 checks passed

This was referenced May 15, 2026

Fix: plumb rng_levels through sc_matmul's 3D per-row path #4

Merged

Docs: classify_groups_by_range — fix group_size docstring to match impl #6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MP module, drop quant-clipping margin, unify sc_matmul dispatcher#2

Add MP module, drop quant-clipping margin, unify sc_matmul dispatcher#2
Allenjin123 merged 4 commits into
CrucibleComputingGroup:mainfrom
heroarmor:add-mp-module

heroarmor commented May 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Allenjin123 May 12, 2026

Uh oh!

Allenjin123 May 12, 2026

Uh oh!

Allenjin123 May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

heroarmor commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Commits (4)

Net public surface after all 4 commits

Scope guarantees

Verified

Companion PR

Not verified (no GPU on dev box)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Allenjin123 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Allenjin123 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Allenjin123 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

heroarmor commented May 12, 2026 •

edited

Loading