Skip to content

fix(mid): seed BitMatrixSampler explicitly to restore test reproducibility#43

Merged
ivanbasov merged 8 commits into
NVIDIA:mainfrom
ivanbasov:worktree-mid-running
Apr 7, 2026
Merged

fix(mid): seed BitMatrixSampler explicitly to restore test reproducibility#43
ivanbasov merged 8 commits into
NVIDIA:mainfrom
ivanbasov:worktree-mid-running

Conversation

@ivanbasov
Copy link
Copy Markdown
Member

Summary

  • torch.manual_seed() does not control cuQuantum's BitMatrixSampler internal RNG, so two mid-GPU tests that relied on it for cross-call reproducibility were failing non-deterministically.
  • Add an optional seed: int | None = None parameter to dem_sampling() and MemoryCircuitTorch.generate_batch(). When provided, a fresh BitMatrixSampler is always created with Options(seed=N), resetting its internal RNG and guaranteeing identical outputs for repeated calls with the same seed. Production paths (seed=None) are unaffected — the module-level cache is reused exactly as before.
  • Fix test_he_reduces_error_weight and test_full_pipeline_w2_reproducible to pass seed= explicitly instead of calling torch.manual_seed().

Root cause: Commit 5aeebdf removed the pure-torch fallback (which was controlled by torch.manual_seed()) making BitMatrixSampler the sole backend. The two mid tests were written when the torch path still existed and were never updated to account for cuST's independent RNG state.

Test plan

  • Re-run NVIDIA/Ising-Decoding CI run that failed: mid-gpu-tests / "HE compile tests" — both test_he_reduces_error_weight and test_full_pipeline_w2_reproducible should now pass.
  • Confirm all other mid-GPU tests (71 total) still pass.
  • Confirm no regression in other GPU/CPU test suites (sampler cache path unchanged when seed=None).

🤖 Generated with Claude Code

ivanbasov and others added 6 commits March 30, 2026 11:54
…fault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ility

torch.manual_seed() does not control cuQuantum's BitMatrixSampler internal
RNG, so the two mid-GPU tests that relied on it for reproducibility were
non-deterministic and intermittently failing.

Add an optional `seed` parameter to `dem_sampling()` and
`MemoryCircuitTorch.generate_batch()`. When a seed is provided a fresh
BitMatrixSampler is always created with `Options(seed=N)`, resetting its
internal RNG and guaranteeing identical outputs on every call with the same
seed. Production paths (seed=None) are unaffected — the cached sampler is
reused as before.

Update the two failing tests to use the explicit seed kwarg instead of
torch.manual_seed():
- test_he_reduces_error_weight: seed=123
- test_full_pipeline_w2_reproducible: seed=100

Fixes: NVIDIA/Ising-Decoding CI run 23963347042

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add TestDEMSamplingReproducibility to test_dem_sampling.py with four cases:
- same seed on CPU produces bit-exact identical frames
- different seeds produce different frames
- unseeded calls still reuse the cached sampler (perf regression guard)
- same seed on GPU produces bit-exact identical frames (GPU-only)

These tests use stochastic p values (0.1–0.9) so they would have caught
the original regression: before the seed= fix, BitMatrixSampler's internal
RNG was not reset between calls, making "same seed" reproducibility
impossible regardless of torch.manual_seed().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov ivanbasov requested review from bmhowe23 and kvmto April 6, 2026 15:58
ivanbasov and others added 2 commits April 6, 2026 09:49
… seedable

Options.__init__() does not accept a 'seed' keyword — the cuST
BitMatrixSampler's internal RNG is not exposed via the public API.

Replace the attempted Options(seed=N) approach with a small pure-torch
fallback (_torch_dem_sampling) that uses a local torch.Generator seeded
to the requested value.  This path is only taken when seed= is explicitly
passed (tests); the production BitMatrixSampler cache path is unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BitMatrixSampler accepts seed as a constructor kwarg (not via Options).
Replace the torch fallback workaround with the correct cuST API:
pass seed= directly to BitMatrixSampler(..., seed=seed).

A fresh sampler is created on every seeded call so its internal RNG is
reset to the requested seed, guaranteeing identical outputs on repeated
calls with the same value.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@danlkv
Copy link
Copy Markdown

danlkv commented Apr 6, 2026

FrameSimulator docs provide an example on different usages of seed arg, the BitMatrixSampler works the same way.

@ivanbasov
Copy link
Copy Markdown
Member Author

FrameSimulator docs provide an example on different usages of seed arg, the BitMatrixSampler works the same way.

Thank you, @danlkv! Fixed the way you suggested

@ivanbasov ivanbasov merged commit d09beb7 into NVIDIA:main Apr 7, 2026
17 checks passed
@ivanbasov ivanbasov deleted the worktree-mid-running branch April 7, 2026 17:41
ivanbasov added a commit that referenced this pull request Apr 10, 2026
…ility (#43)

* fix(ci): disable torch.compile in orientation training to prevent segfault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Revert "fix(ci): disable torch.compile in orientation training to prevent segfault"

This reverts commit 7f0f6c8.

* fix(mid): seed BitMatrixSampler explicitly to restore test reproducibility

torch.manual_seed() does not control cuQuantum's BitMatrixSampler internal
RNG, so the two mid-GPU tests that relied on it for reproducibility were
non-deterministic and intermittently failing.

Add an optional `seed` parameter to `dem_sampling()` and
`MemoryCircuitTorch.generate_batch()`. When a seed is provided a fresh
BitMatrixSampler is always created with `Options(seed=N)`, resetting its
internal RNG and guaranteeing identical outputs on every call with the same
seed. Production paths (seed=None) are unaffected — the cached sampler is
reused as before.

Update the two failing tests to use the explicit seed kwarg instead of
torch.manual_seed():
- test_he_reduces_error_weight: seed=123
- test_full_pipeline_w2_reproducible: seed=100

Fixes: NVIDIA/Ising-Decoding CI run 23963347042

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* style: fix yapf line-break position in need_new condition

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test: add dem_sampling reproducibility tests for seed= parameter

Add TestDEMSamplingReproducibility to test_dem_sampling.py with four cases:
- same seed on CPU produces bit-exact identical frames
- different seeds produce different frames
- unseeded calls still reuse the cached sampler (perf regression guard)
- same seed on GPU produces bit-exact identical frames (GPU-only)

These tests use stochastic p values (0.1–0.9) so they would have caught
the original regression: before the seed= fix, BitMatrixSampler's internal
RNG was not reset between calls, making "same seed" reproducibility
impossible regardless of torch.manual_seed().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: use torch.Generator for seeded path; BitMatrixSampler RNG is not seedable

Options.__init__() does not accept a 'seed' keyword — the cuST
BitMatrixSampler's internal RNG is not exposed via the public API.

Replace the attempted Options(seed=N) approach with a small pure-torch
fallback (_torch_dem_sampling) that uses a local torch.Generator seeded
to the requested value.  This path is only taken when seed= is explicitly
passed (tests); the production BitMatrixSampler cache path is unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: pass seed directly to BitMatrixSampler constructor

BitMatrixSampler accepts seed as a constructor kwarg (not via Options).
Replace the torch fallback workaround with the correct cuST API:
pass seed= directly to BitMatrixSampler(..., seed=seed).

A fresh sampler is created on every seeded call so its internal RNG is
reset to the requested seed, guaranteeing identical outputs on repeated
calls with the same value.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ivanbasov added a commit that referenced this pull request Apr 10, 2026
…ility (#43)

* fix(ci): disable torch.compile in orientation training to prevent segfault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Revert "fix(ci): disable torch.compile in orientation training to prevent segfault"

This reverts commit 7f0f6c8.

* fix(mid): seed BitMatrixSampler explicitly to restore test reproducibility

torch.manual_seed() does not control cuQuantum's BitMatrixSampler internal
RNG, so the two mid-GPU tests that relied on it for reproducibility were
non-deterministic and intermittently failing.

Add an optional `seed` parameter to `dem_sampling()` and
`MemoryCircuitTorch.generate_batch()`. When a seed is provided a fresh
BitMatrixSampler is always created with `Options(seed=N)`, resetting its
internal RNG and guaranteeing identical outputs on every call with the same
seed. Production paths (seed=None) are unaffected — the cached sampler is
reused as before.

Update the two failing tests to use the explicit seed kwarg instead of
torch.manual_seed():
- test_he_reduces_error_weight: seed=123
- test_full_pipeline_w2_reproducible: seed=100

Fixes: NVIDIA/Ising-Decoding CI run 23963347042

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* style: fix yapf line-break position in need_new condition

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test: add dem_sampling reproducibility tests for seed= parameter

Add TestDEMSamplingReproducibility to test_dem_sampling.py with four cases:
- same seed on CPU produces bit-exact identical frames
- different seeds produce different frames
- unseeded calls still reuse the cached sampler (perf regression guard)
- same seed on GPU produces bit-exact identical frames (GPU-only)

These tests use stochastic p values (0.1–0.9) so they would have caught
the original regression: before the seed= fix, BitMatrixSampler's internal
RNG was not reset between calls, making "same seed" reproducibility
impossible regardless of torch.manual_seed().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: use torch.Generator for seeded path; BitMatrixSampler RNG is not seedable

Options.__init__() does not accept a 'seed' keyword — the cuST
BitMatrixSampler's internal RNG is not exposed via the public API.

Replace the attempted Options(seed=N) approach with a small pure-torch
fallback (_torch_dem_sampling) that uses a local torch.Generator seeded
to the requested value.  This path is only taken when seed= is explicitly
passed (tests); the production BitMatrixSampler cache path is unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: pass seed directly to BitMatrixSampler constructor

BitMatrixSampler accepts seed as a constructor kwarg (not via Options).
Replace the torch fallback workaround with the correct cuST API:
pass seed= directly to BitMatrixSampler(..., seed=seed).

A fresh sampler is created on every seeded call so its internal RNG is
reset to the requested seed, guaranteeing identical outputs on repeated
calls with the same value.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants