Skip to content

sc/kernels: apply Owen scramble in the rescale branch (fixes halve x4.215 at 128 cycles)#16

Merged
Allenjin123 merged 1 commit into
mainfrom
fix/scramble-in-rescale-branch
May 20, 2026
Merged

sc/kernels: apply Owen scramble in the rescale branch (fixes halve x4.215 at 128 cycles)#16
Allenjin123 merged 1 commit into
mainfrom
fix/scramble-in-rescale-branch

Conversation

@heroarmor
Copy link
Copy Markdown
Collaborator

Problem

Investigating halve_bipolar_stoc_len on Llama-3.1-8B-Instruct / wikitext-2 test (ctx=1024, stride=512, per_row, sc_prec=8, 16384 tokens) surfaced two bugs.

1. The flag doesn't actually halve cycles when a caller passes stoc_len

The halve block only fills the arg that is None:

if halve_bipolar_stoc_len and mode == "bipolar":
    halved = 2 ** (sc_prec - 1)
    if stoc_len is None:  stoc_len = halved
    if rng_levels is None: rng_levels = halved

Callers that pass an explicit stoc_len (our SCLinear always passes 256) get only rng_levels halved — stoc_len stays 256, so the kernel still builds cum_indicator over 256 cycles. Verified by spying on _sc_matmul_bipolar_mlp_chunked:

[HALVE]    stoc_len=128 cycles, rng_levels=128   (stoc_len handed in as None)
[OLD WIRE] stoc_len=256 cycles, rng_levels=128   (explicit stoc_len=256 + halve)
[NO HALVE] stoc_len=256 cycles, rng_levels=256

So any measurement of "halve" taken via a caller that passes stoc_len was secretly still running 256 cycles.

2. At a true 128 cycles, PPL collapses ×4.215

Once stoc_len actually reaches 128, PPL jumps to 26.65 (×4.215 vs fp16).

Root cause: halve forces rng_levels = 2^(sc_prec-1) ≠ base_levels, which routes _prepare_rng_prefix into the rescale branch — and that branch never applies the Owen scramble. With make_sobol_simple_config broadcasting one Sobol pair across all chunk_d dims, every dim shares an identical joint (rng_a, rng_b) trajectory, so SC error accumulates across the D-reduction instead of averaging out. (Same mechanism as #11 / #14.) Tolerable at 256 cycles; catastrophic at 128.

Isolation (cycles fixed at 128, sc_prec=8)

grid rng_levels scramble PPL ×fp16 SC time
256 on (sl=128) 6.6797 1.056 674 s
128 off (halve as-is) 26.6532 4.215 511 s
128 on (this PR) 6.6749 1.056 518 s

Toggling only the scramble (grid and cycle count held constant): ×4.215 → ×1.056. The grid coarsening (256→128) is harmless — the missing scramble is the entire problem.

Fix (this PR)

Apply _owen_scramble in the rescale branch, before rescaling onto the coarser grid. XOR is a bijection on [0, base_levels), so marginals (and the rescaled grid) are unchanged; SC_OWEN_MODE still selects the family.

Gated behind SC_SCRAMBLE_RESCALE (default off → no behavior change) so it can be reviewed/benchmarked before becoming the default.

Result: recovers ×1.056 at the cheap 128-entry enable table — i.e. 128 cycles and the small table and full quality, and ~24% faster than the 256-grid sl=128 (518 s vs 674 s).

Status / proposed follow-ups

  • Gated fix verified: SC_SCRAMBLE_RESCALE=1 recovers ×1.056 at 128 cyc / 128 grid.
  • Decide whether to make scramble-in-rescale unconditional (data says it's strictly better; default-off is just conservative).
  • Fix bug Migrate scmp_llm: SC + MP + application/Diffusion + evaluation + archived #1 separately: make halve_bipolar_stoc_len override stoc_len (not only fill None), or document that callers must pass stoc_len=None to get the cycle saving.

🤖 Generated with Claude Code

_prepare_rng_prefix's rescale path (grid_levels != base_levels, e.g. anything
driven by halve_bipolar_stoc_len) never applied the Owen scramble, so every
chunk_d dim shared one Sobol joint trajectory and SC error accumulated across
D instead of averaging. Harmless at stoc_len=256 but catastrophic once the
cycle count is actually halved to 128 (PPL x4.215 vs fp16 on Llama-3.1-8B /
wikitext-2). Add a gated (SC_SCRAMBLE_RESCALE, default off) _owen_scramble
before the rescale; XOR is a bijection on [0, base_levels) so marginals and the
rescaled grid are unchanged. Recovers x1.056 at the cheap 128-entry table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Allenjin123 Allenjin123 merged commit a576b83 into main May 20, 2026
1 of 2 checks passed
@heroarmor heroarmor review requested due to automatic review settings May 20, 2026 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants