Add halve_bipolar_stoc_len flag + repo housekeeping#12
Merged
Conversation
Add a standard Python .gitignore (__pycache__, *.pyc, egg-info, build/dist, common editor/OS files) and git rm --cached the .pyc files that were previously committed under scmp_kernels/.
…halving When True and mode='bipolar', overrides any unset stoc_len / rng_levels to 2^(sc_prec-1). Bipolar magnitudes only carry sc_prec-1 bits of info (q_max = 2^(sc_prec-1) - 1), so halving the stream and RNG grid preserves the quantization resolution while approximately halving cycles, matching the wu-hpca2022 sign-magnitude optimization. Default False preserves legacy behavior. Forwards rng_levels through all five dispatch paths (per_tensor, per_row, per_row+chunk_d, per_row+3D, per_head).
There was a problem hiding this comment.
Pull request overview
Adds an opt-in API flag to reduce stochastic stream length/RNG grid in bipolar mode (cycle-halving optimization) while preserving existing default behavior, plus minor repository housekeeping.
Changes:
- Added
halve_bipolar_stoc_len: bool = Falsetosc_matmul; when enabled inmode="bipolar", it overrides unsetstoc_len/rng_levelsto2 ** (sc_prec - 1). - Fixed
per_tensordispatch to forwardrng_levelsinto_sc_matmul_per_tensor. - Added a standard Python
.gitignoreto prevent committing bytecode/caches and common build/editor artifacts.
Reviewed changes
Copilot reviewed 1 out of 12 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
scmp_kernels/sc/matmul.py |
Adds the cycle-halving flag and forwards rng_levels through the per-tensor dispatch path. |
.gitignore |
Ignores Python cache/bytecode and common local build/editor/OS files. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+121
to
+125
| if halve_bipolar_stoc_len and mode == "bipolar": | ||
| halved = 2 ** (sc_prec - 1) | ||
| if stoc_len is None: | ||
| stoc_len = halved | ||
| if rng_levels is None: |
heroarmor
approved these changes
May 20, 2026
Collaborator
heroarmor
left a comment
There was a problem hiding this comment.
LGTM — approving.
Verified statically (no torch/CUDA in my env, so I did not re-run the GPU suite — relying on static review + your stated manual results):
- Override correctly gated on
halve_bipolar_stoc_len and mode == "bipolar", only touches unset (None) values, preserves explicit userstoc_len/rng_levels, no-op for unipolar. DefaultFalsekeeps the legacy path byte-identical. rng_levelsis forwarded through all five dispatch entry points (per_tensor / per_row+chunk_d / per_row 3D batched / per_row / per_head).- Confirmed the latent per_tensor bug:
_sc_matmul_per_tensoralready acceptsrng_levelsbut the caller was dropping it — harmless before the flag, would have silently broken halving for per_tensor users. Good catch. - Math checks out: bipolar
q_max = 2^(sc_prec-1) - 1⇒ magnitude spans2^(sc_prec-1)levels, so the halved grid loses no resolution. .gitignore+ removing the committed.pycfiles is welcome housekeeping.
The ~3.5× MSE inflation is a clearly-documented, opt-in tradeoff behind a default-off flag, so no concern there. Nice clean change.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
sc/matmul: add opt-inhalve_bipolar_stoc_lenkwarg tosc_matmul. WhenTrueandmode="bipolar", halves any unsetstoc_len/rng_levelsto2**(sc_prec-1), matching the uSystolic / HUB sign-magnitude cycle-halving optimization (wu-hpca2022). DefaultFalsepreserves all existing behavior.repo: add a standard Python.gitignoreandgit rm --cached10 stale.pycfiles that had been committed underscmp_kernels/__pycache__/.Motivation
Bipolar mode is already sign-magnitude with
q_max = 2**(sc_prec-1) - 1, so the magnitude only carriessc_prec-1bits. The defaultstoc_len = 2**sc_prectherefore runs ~2× more cycles than the magnitude grid needs. The flag exposes the paper's optimization as a single switch; legacy callers see zero behavior change.Tradeoffs measured (per-tensor, sc_prec=8, Sobol RNG)
halve=False(default)halve=TrueThe 2× speedup materializes at large problem sizes where the SC inner loop dominates kernel time. The MSE inflation is a real tradeoff (per-D-shared Sobol's variance scales worse than textbook MC); it's the cost the user opts into by setting the flag.
What it does NOT change
mode="unipolar": flag is a no-op.stoc_lenorrng_levels: those user values are preserved.sc/kernels.py,sc/sng.py): unchanged.Forwarding correctness
The flag's
rng_levelsoverride is forwarded through all five dispatch entry points:per_tensor,per_row,per_row+chunk_d,per_row+3D,per_head. (Theper_tensorforwarding fixed a latent dispatch bug where the caller droppedrng_levels— pre-flag this was harmless because the kwarg was alwaysNone, but it would have caused per_tensor users to silently get the wrong halved-cycle behavior.)Test plan
tests/suite unaffected (flag default-off; legacy path byte-identical).halve=Falsereproduces baseline MSE;halve=Truehalvesstoc_lenand gives ~2× wallclock speedup at 1024×2048.rng_levelsreaches the kernel.False).