sc/kernels: apply Owen scramble in the rescale branch (fixes halve x4.215 at 128 cycles) by heroarmor · Pull Request #16 · CrucibleComputingGroup/scmp_kernels

heroarmor · 2026-05-20T22:25:11Z

Problem

Investigating halve_bipolar_stoc_len on Llama-3.1-8B-Instruct / wikitext-2 test (ctx=1024, stride=512, per_row, sc_prec=8, 16384 tokens) surfaced two bugs.

1. The flag doesn't actually halve cycles when a caller passes `stoc_len`

The halve block only fills the arg that is None:

if halve_bipolar_stoc_len and mode == "bipolar":
    halved = 2 ** (sc_prec - 1)
    if stoc_len is None:  stoc_len = halved
    if rng_levels is None: rng_levels = halved

Callers that pass an explicit stoc_len (our SCLinear always passes 256) get only rng_levels halved — stoc_len stays 256, so the kernel still builds cum_indicator over 256 cycles. Verified by spying on _sc_matmul_bipolar_mlp_chunked:

[HALVE]    stoc_len=128 cycles, rng_levels=128   (stoc_len handed in as None)
[OLD WIRE] stoc_len=256 cycles, rng_levels=128   (explicit stoc_len=256 + halve)
[NO HALVE] stoc_len=256 cycles, rng_levels=256

So any measurement of "halve" taken via a caller that passes stoc_len was secretly still running 256 cycles.

2. At a true 128 cycles, PPL collapses ×4.215

Once stoc_len actually reaches 128, PPL jumps to 26.65 (×4.215 vs fp16).

Root cause: halve forces rng_levels = 2^(sc_prec-1) ≠ base_levels, which routes _prepare_rng_prefix into the rescale branch — and that branch never applies the Owen scramble. With make_sobol_simple_config broadcasting one Sobol pair across all chunk_d dims, every dim shares an identical joint (rng_a, rng_b) trajectory, so SC error accumulates across the D-reduction instead of averaging out. (Same mechanism as #11 / #14.) Tolerable at 256 cycles; catastrophic at 128.

Isolation (cycles fixed at 128, sc_prec=8)

grid `rng_levels`	scramble	PPL	×fp16	SC time
256	on (`sl=128`)	6.6797	1.056	674 s
128	off (`halve` as-is)	26.6532	4.215	511 s
128	on (this PR)	6.6749	1.056	518 s

Toggling only the scramble (grid and cycle count held constant): ×4.215 → ×1.056. The grid coarsening (256→128) is harmless — the missing scramble is the entire problem.

Fix (this PR)

Apply _owen_scramble in the rescale branch, before rescaling onto the coarser grid. XOR is a bijection on [0, base_levels), so marginals (and the rescaled grid) are unchanged; SC_OWEN_MODE still selects the family.

Gated behind SC_SCRAMBLE_RESCALE (default off → no behavior change) so it can be reviewed/benchmarked before becoming the default.

Result: recovers ×1.056 at the cheap 128-entry enable table — i.e. 128 cycles and the small table and full quality, and ~24% faster than the 256-grid sl=128 (518 s vs 674 s).

Status / proposed follow-ups

Gated fix verified: SC_SCRAMBLE_RESCALE=1 recovers ×1.056 at 128 cyc / 128 grid.
Decide whether to make scramble-in-rescale unconditional (data says it's strictly better; default-off is just conservative).
Fix bug Migrate scmp_llm: SC + MP + application/Diffusion + evaluation + archived #1 separately: make halve_bipolar_stoc_len override stoc_len (not only fill None), or document that callers must pass stoc_len=None to get the cycle saving.

🤖 Generated with Claude Code

_prepare_rng_prefix's rescale path (grid_levels != base_levels, e.g. anything driven by halve_bipolar_stoc_len) never applied the Owen scramble, so every chunk_d dim shared one Sobol joint trajectory and SC error accumulated across D instead of averaging. Harmless at stoc_len=256 but catastrophic once the cycle count is actually halved to 128 (PPL x4.215 vs fp16 on Llama-3.1-8B / wikitext-2). Add a gated (SC_SCRAMBLE_RESCALE, default off) _owen_scramble before the rescale; XOR is a bijection on [0, base_levels) so marginals and the rescaled grid are unchanged. Recovers x1.056 at the cheap 128-entry table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Allenjin123 requested review from Allenjin123 and Copilot May 20, 2026 22:48

Copilot started reviewing on behalf of Allenjin123 May 20, 2026 22:49 View session

Allenjin123 approved these changes May 20, 2026

View reviewed changes

Allenjin123 merged commit a576b83 into main May 20, 2026
1 of 2 checks passed

heroarmor review requested due to automatic review settings May 20, 2026 23:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sc/kernels: apply Owen scramble in the rescale branch (fixes halve x4.215 at 128 cycles)#16

sc/kernels: apply Owen scramble in the rescale branch (fixes halve x4.215 at 128 cycles)#16
Allenjin123 merged 1 commit into
mainfrom
fix/scramble-in-rescale-branch

heroarmor commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

heroarmor commented May 20, 2026

Problem

1. The flag doesn't actually halve cycles when a caller passes stoc_len

2. At a true 128 cycles, PPL collapses ×4.215

Isolation (cycles fixed at 128, sc_prec=8)

Fix (this PR)

Status / proposed follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. The flag doesn't actually halve cycles when a caller passes `stoc_len`