Skip to content

Add MP module, drop quant-clipping margin, unify sc_matmul dispatcher#2

Merged
Allenjin123 merged 4 commits into
CrucibleComputingGroup:mainfrom
heroarmor:add-mp-module
May 12, 2026
Merged

Add MP module, drop quant-clipping margin, unify sc_matmul dispatcher#2
Allenjin123 merged 4 commits into
CrucibleComputingGroup:mainfrom
heroarmor:add-mp-module

Conversation

@heroarmor
Copy link
Copy Markdown
Collaborator

@heroarmor heroarmor commented May 12, 2026

Small, kernel-only follow-up to closed PR #1. Splits the work into focused commits per @Allenjin123's review feedback (no application code, no legacy archives, no rebase-style sync branch).

Commits (4)

1. feat(mp): populate mp/ module with MPConfig and adaptive classification (+790/−1)

Replaces the scmp_kernels/mp/__init__.py placeholder with a real public API. Adds scmp_kernels/mp/config.py (762 lines, stdlib + torch only — no SC coupling).

Public surface (re-exported from scmp_kernels.mp):

  • MPConfig, AdaptiveMPConfig, RangeMPConfig, RowAssignment, MPDistributionLogger, MetricProfiler
  • classify_rows_by_metric, adaptive_classify_rows, classify_groups_by_range

2. remove quant clipping margin from SC kernels (+34/−52)

Drops the safety margin in both SC quantization paths so the full quant range is used:

  • Bipolar (8-bit): clamp/normalize at ±127 instead of ±125 (was q_clip = q_norm - 2). 254 → 256 distinct symmetric levels.
  • Unipolar (8-bit): map to [0, 255] instead of [2, 253] (was q_lo, q_hi = 2, q_max - 2). 252 → 256 distinct levels.

Mechanical:

  • fused_quant_bipolar_kernel params (q_clip, q_clip_min, q_norm)(q_max).
  • fused_quant_bipolar_perrow_kernel params (q_clip, q_norm)(q_max).
  • fused_quantize_bipolar / _perrow wrappers: drop q_norm/q_clip locals, use q_max throughout.
  • fused_quantize_unipolar: drop q_lo, q_hi = 2, q_max - 2.
  • _grouped_symmetric_quant / _grouped_symmetric_quant_batched: drop clip_margin kwarg.
  • All 6 call sites updated.

3. expose flat-API public surface for application code (+257/−9) — intermediate, reverted by commit 4

Briefly added flat-API names (sc_matmul_per_tensor, sc_matmul_mlp, sc_matmul_grouped, sc_matmul_enable_triton, sc_matmul_enable_triton_mlp, sc_matmul_grouped_enable_triton, sc_matmul_enable_batched_bipolar) to make Q-DiT migration easier. Kept in history for traceability; net effect zero after commit 4.

4. revert flat-API duplicates; extend sc_matmul to a single elegant dispatcher (+33/−264)

Net change: replaces the flat-API zoo with a single extended sc_matmul dispatcher.

  • scmp_kernels/sc/kernels.py — removes 3 wrappers + 4 aliases from commit 3. File returns to its post-clipping-removal state.
  • scmp_kernels/sc/matmul.pysc_matmul gains 3 kwargs: group_a, group_b, rng_levels. Per-row dispatch threads them through to _sc_matmul_per_row{,_mlp,_batched}; per-head dispatch threads rng_levels through to _sc_matmul_per_head_bipolar.
  • scmp_kernels/sc/__init__.py — public surface narrowed to: sc_matmul, clear_rng_cache, det_kernel_tuning.

Net public surface after all 4 commits

```python
from scmp_kernels.sc import sc_matmul, clear_rng_cache, det_kernel_tuning
from scmp_kernels.mp import MPConfig, AdaptiveMPConfig, RangeMPConfig, RowAssignment, \ classify_rows_by_metric, adaptive_classify_rows, classify_groups_by_range, \ MPDistributionLogger, MetricProfiler

sc_matmul(a, b, granularity="per_tensor" | "per_row" | "per_head",
*, mode="bipolar" | "unipolar",
sc_prec=8, stoc_len=None,
chunk_d=0, group_a=1, group_b=1,
rng_levels=None, config=None)
```

A single dispatcher reaches every specialised internal path (per-tensor, per-row + group, per-row MLP + chunk_d, per-row batched, per-head bipolar) that the application repo uses.

Scope guarantees

  • Only kernel-side files: scmp_kernels/{sc,mp}/.
  • No application code, no evaluation, no legacy CPU references, no rebase-style sync branch.
  • SC kernels.py + matmul.py layout preserved (no rename to sc_triton.py).

Verified

  • AST parses for every modified file.
  • Zero remaining q_clip / q_lo / q_hi / clip_margin / q_norm references across scmp_kernels/sc/.
  • All 9 exported MP names are defined in mp/config.py.

Companion PR

A separate PR will land on `CrucibleComputingGroup/scmp_diffusion` that consumes this branch via git submodule.

Not verified (no GPU on dev box)

  • Actual Triton kernel execution. Needs `pytest tests/test_sc_smoke.py` on a CUDA + Triton machine.

Test plan

  • `pytest tests/test_sc_smoke.py` on CUDA box.
  • Smoke-import `from scmp_kernels.sc import sc_matmul` and `from scmp_kernels.mp import AdaptiveMPConfig` on a box with torch.

🤖 Generated with Claude Code

heroarmor and others added 4 commits May 11, 2026 20:54
Replaces the mp/ placeholder __init__.py with a real public API.

scmp_kernels/mp/config.py (new, 762 lines):
- MPConfig: fixed-fraction quantile-based per-row stoc_len assignment.
- AdaptiveMPConfig: timestep-adaptive thresholds (per-operator, per-layer)
  with optional threshold_table_path JSON loader.
- RangeMPConfig + classify_groups_by_range: range-bucket assignment for
  per-group quant operators.
- RowAssignment, classify_rows_by_metric, adaptive_classify_rows: row
  classification primitives consumed by SC attention/MLP kernels.
- MPDistributionLogger, MetricProfiler: instrumentation.

scmp_kernels/mp/__init__.py:
- Re-exports the 9 public names above.

No SC, application, or evaluation changes — those are deliberately
deferred per team discussion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the safety-margin clipping in both bipolar and unipolar quant
paths so the full quantization range is used:

- bipolar (8-bit): clamp/normalize at ±127 instead of ±125 (was
  q_clip = q_norm - 2). 254 → 256 distinct symmetric levels.
- unipolar (8-bit): map to [0, 255] instead of [2, 253] (was
  q_lo, q_hi = 2, q_max - 2). 252 → 256 distinct levels.

Mechanical changes in scmp_kernels/sc/kernels.py:

- fused_quant_bipolar_kernel: params (q_clip, q_clip_min, q_norm)
  -> (q_max); body uses ±q_max for clamp and q_max for boundary norm.
- fused_quant_bipolar_perrow_kernel: params (q_clip, q_norm) -> (q_max);
  same body simplification.
- fused_quantize_bipolar (Python wrapper): drops local q_norm/q_clip,
  uses q_max throughout; kernel-launch arg list shortened.
- fused_quantize_bipolar_perrow: drops `clip_margin=0` docstring noise,
  passes q_max once instead of (q_max, q_max).
- fused_quantize_unipolar: drops `q_lo, q_hi = 2, q_max - 2` margin;
  scale = range_fp / q_max; zp clamped to [0, q_max].
- _grouped_symmetric_quant: drops `clip_margin` kwarg and local
  `q_clip = q_max - clip_margin`; uses q_max for scale.
- _grouped_symmetric_quant_batched: same.
- 6 call sites: drop `, clip_margin=0`.

AST parses; no remaining q_clip / q_lo / q_hi / clip_margin / q_norm
references in scmp_kernels/sc/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the historical sc_triton.* public names so application repos
(Q-DiT in scmp_diffusion, future ViT/WorldModel apps) can import
specialized SC matmul variants without using underscore-private helpers.

New public names in scmp_kernels/sc/kernels.py (re-exported from sc/__init__.py):

  • sc_matmul_enable_triton            — wrapper: 2D/3D enable-signal matmul,
                                          dispatches bipolar/unipolar.
  • sc_matmul_grouped_enable_triton    — wrapper: enable-signal + per-row-group
                                          quantization, dispatches by mode.
  • sc_matmul_mlp                      — wrapper: non-enable MLP matmul,
                                          delegates to per-tensor sc_matmul
                                          (group_a/group_b/chunk_d accepted
                                          for API parity, unused).
  • sc_matmul_per_tensor               — alias of _sc_matmul_per_tensor.
  • sc_matmul_grouped                  — alias of _sc_matmul_per_row.
  • sc_matmul_enable_batched_bipolar   — alias of _sc_matmul_per_head_bipolar.
  • sc_matmul_enable_triton_mlp        — alias of _sc_matmul_per_row_mlp
                                          (same body as historical wrapper).
  • clear_rng_cache                    — re-exported (was internal).

The granularity-based sc_matmul dispatcher in matmul.py is unchanged.
The four aliases are byte-identical to their private counterparts;
only the three wrappers add real code (~180 lines).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…atcher

Reverts the flat-API surface introduced in fbd7009 — it duplicated the
existing private helpers and conflated two different algorithms (the
deprecated packed-XNOR/AND path vs the enable-signal table-lookup that
now backs everything). Replaces it with one extended dispatcher.

scmp_kernels/sc/kernels.py
  - Removes 3 wrappers (sc_matmul_enable_triton,
    sc_matmul_grouped_enable_triton, sc_matmul_mlp) and 4 aliases
    (sc_matmul_per_tensor, sc_matmul_grouped,
    sc_matmul_enable_batched_bipolar, sc_matmul_enable_triton_mlp).
    Total -222 lines. File returns to its post-clipping-removal state.

scmp_kernels/sc/matmul.py
  - sc_matmul gains 3 kwargs: group_a, group_b, rng_levels.
  - per_row dispatch now threads group_a/group_b/rng_levels through
    to _sc_matmul_per_row{,_mlp,_batched}.
  - per_head dispatch threads rng_levels through to
    _sc_matmul_per_head_bipolar.

scmp_kernels/sc/__init__.py
  - Public surface narrowed to: sc_matmul, clear_rng_cache, det_kernel_tuning.

Application code (Q-DiT in scmp_diffusion) is rewritten in a follow-up
commit to call sc_matmul(..., granularity=...) instead of the deleted
flat-API names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Allenjin123 Allenjin123 requested review from Allenjin123 and Copilot and removed request for Allenjin123 May 12, 2026 02:47
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a first-class scmp_kernels.mp public API for mixed-precision (MP) row/group assignment and removes the remaining “quant clipping margin” headroom from SC quantization paths so 8-bit uses the full numeric range.

Changes:

  • Introduces scmp_kernels/mp/config.py implementing MPConfig, AdaptiveMPConfig, RangeMPConfig, and related classification + instrumentation utilities; updates scmp_kernels/mp/__init__.py to re-export the public API.
  • Removes SC quant clipping margins (bipolar and unipolar) by switching to full-range clamp/normalization and dropping clip_margin plumbing in grouped quant helpers.
  • Extends sc_matmul API to accept group_a/group_b and rng_levels, and updates scmp_kernels/sc/__init__.py exports/docs.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
scmp_kernels/sc/matmul.py Extends sc_matmul API (grouping + rng_levels) and forwards new knobs into per-row/per-head paths.
scmp_kernels/sc/kernels.py Removes quant clipping margin parameters and updates fused + grouped quantization helpers to use full quant range.
scmp_kernels/sc/__init__.py Updates public surface docs and exports clear_rng_cache.
scmp_kernels/mp/config.py New MP configuration + row/group classification primitives and logging/profiling utilities.
scmp_kernels/mp/__init__.py Replaces placeholder with re-exporting public MP API.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scmp_kernels/sc/matmul.py
Comment on lines 162 to 168
if a.dim() == 3:
return _sc_matmul_per_row_batched(
a, b,
group_a=1, group_b=1,
group_a=group_a, group_b=group_b,
mode=mode, sc_prec=sc_prec, config=config,
stoc_len=stoc_len,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot explain this in more details
this feedback

Comment thread scmp_kernels/mp/config.py
Comment on lines +703 to +716
Args:
weight: [out_features, in_features] weight tensor (already quantized).
group_size: Number of output rows per group. Use 0 or out_features
for per-row grouping.
config: RangeMPConfig instance.
operator: Operator name for per-op threshold lookup.

Returns:
List of stoc_len values, one per group.
"""
out_features, in_features = weight.shape
if group_size <= 0 or group_size >= out_features:
group_size = out_features

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot explain with code about says group_size of 0 or out_features means “per-row grouping

Comment thread scmp_kernels/mp/config.py
Comment on lines +713 to +725
out_features, in_features = weight.shape
if group_size <= 0 or group_size >= out_features:
group_size = out_features

num_groups = out_features // group_size
levels = config.stoc_len_levels
n_levels = len(levels)
threshold = config.get_threshold(operator)
threshold = min(threshold, 0.95)

# Reshape to [num_groups, group_size * in_features]
w = weight.reshape(num_groups, -1).float()
group_max = w.amax(dim=-1) # [num_groups]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot how to create a issue to fixed this later?

@heroarmor heroarmor changed the title Add MP module + drop SC quant clipping margin Add MP module, drop quant-clipping margin, unify sc_matmul dispatcher May 12, 2026
@Allenjin123 Allenjin123 merged commit fa6b5cd into CrucibleComputingGroup:main May 12, 2026
4 checks passed
heroarmor pushed a commit that referenced this pull request May 17, 2026
…ntics (#6)

Reported by Copilot in PR #2 (review comment r3223399815).

The docstring claimed "Use 0 or out_features for per-row grouping",
but the implementation maps both of those values to
``group_size = out_features`` → ``num_groups = 1`` (a single per-tensor
group, the *opposite* of per-row). Verified on gl1810: passing
``group_size=0`` against a ``[16, 8]`` weight returns 1 group, not 16.

Keeps the implementation as-is (Option 1 from the audit) and updates
the docstring to describe the real contract:

  * ``group_size == 1``                      → per-row (num_groups = out_features)
  * ``group_size <= 0`` or ``>= out_features`` → per-tensor (num_groups = 1)
  * any other valid divisor of out_features  → out_features // group_size groups

Also documents the existing reshape-divisibility constraint, which
otherwise surfaces as an obscure ``RuntimeError: shape '[N, -1]' is
invalid`` at the reshape on line 724.

No code change; no behavioral change.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants