sc/kernels: apply Owen scramble in the rescale branch (fixes halve x4.215 at 128 cycles)#16
Merged
Merged
Conversation
_prepare_rng_prefix's rescale path (grid_levels != base_levels, e.g. anything driven by halve_bipolar_stoc_len) never applied the Owen scramble, so every chunk_d dim shared one Sobol joint trajectory and SC error accumulated across D instead of averaging. Harmless at stoc_len=256 but catastrophic once the cycle count is actually halved to 128 (PPL x4.215 vs fp16 on Llama-3.1-8B / wikitext-2). Add a gated (SC_SCRAMBLE_RESCALE, default off) _owen_scramble before the rescale; XOR is a bijection on [0, base_levels) so marginals and the rescaled grid are unchanged. Recovers x1.056 at the cheap 128-entry table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Allenjin123
approved these changes
May 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Investigating
halve_bipolar_stoc_lenon Llama-3.1-8B-Instruct / wikitext-2 test (ctx=1024, stride=512, per_row, sc_prec=8, 16384 tokens) surfaced two bugs.1. The flag doesn't actually halve cycles when a caller passes
stoc_lenThe halve block only fills the arg that is
None:Callers that pass an explicit
stoc_len(ourSCLinearalways passes 256) get onlyrng_levelshalved —stoc_lenstays 256, so the kernel still buildscum_indicatorover 256 cycles. Verified by spying on_sc_matmul_bipolar_mlp_chunked:So any measurement of "halve" taken via a caller that passes
stoc_lenwas secretly still running 256 cycles.2. At a true 128 cycles, PPL collapses ×4.215
Once
stoc_lenactually reaches 128, PPL jumps to 26.65 (×4.215 vs fp16).Root cause:
halveforcesrng_levels = 2^(sc_prec-1) ≠ base_levels, which routes_prepare_rng_prefixinto the rescale branch — and that branch never applies the Owen scramble. Withmake_sobol_simple_configbroadcasting one Sobol pair across allchunk_ddims, every dim shares an identical joint(rng_a, rng_b)trajectory, so SC error accumulates across the D-reduction instead of averaging out. (Same mechanism as #11 / #14.) Tolerable at 256 cycles; catastrophic at 128.Isolation (cycles fixed at 128, sc_prec=8)
rng_levelssl=128)halveas-is)Toggling only the scramble (grid and cycle count held constant): ×4.215 → ×1.056. The grid coarsening (256→128) is harmless — the missing scramble is the entire problem.
Fix (this PR)
Apply
_owen_scramblein the rescale branch, before rescaling onto the coarser grid. XOR is a bijection on[0, base_levels), so marginals (and the rescaled grid) are unchanged;SC_OWEN_MODEstill selects the family.Gated behind
SC_SCRAMBLE_RESCALE(default off → no behavior change) so it can be reviewed/benchmarked before becoming the default.Result: recovers ×1.056 at the cheap 128-entry enable table — i.e. 128 cycles and the small table and full quality, and ~24% faster than the 256-grid
sl=128(518 s vs 674 s).Status / proposed follow-ups
SC_SCRAMBLE_RESCALE=1recovers ×1.056 at 128 cyc / 128 grid.halve_bipolar_stoc_lenoverridestoc_len(not only fillNone), or document that callers must passstoc_len=Noneto get the cycle saving.🤖 Generated with Claude Code