fix(test): stabilize flaky LER tests in CI by ivanbasov · Pull Request #6 · NVIDIA/Ising-Decoding

ivanbasov · 2026-03-05T18:53:05Z

Summary

Increase CI sample count for test_ler_improves_with_bd_noise_model from 2,000 to 20,000 so the boundary-detector LER comparison has enough statistical power
Widen d=13 LER improvement tolerance from 1e-4 to 2e-4 in test_inference_d13_noise25p_ler_quality, raising the guard from ~2.5 sigma to ~5 sigma

Root cause

test_ler_improves_with_bd_noise_model (test_boundary_detectors.py):
With only 2,000 CI samples at p=0.002 (d=5), the test produced ~5 logical errors. Discrete error counts frequently coincided, causing assertLess to fail spuriously:

AssertionError: np.float64(0.0025) not less than np.float64(0.0025)

Observed in PR #2 CI run. Increasing to 20,000 samples yields ~50 errors — enough for reliable separation while adding negligible CI time (Stim sampling is fast).

test_inference_d13_noise25p_ler_quality (test_inference_public_model.py):
With d=13 at 262k shots and LER ~2e-4, the combined standard error for the before/after comparison is ~4e-5. The previous tolerance of 1e-4 (~2.5 sigma) gave a ~1.2% per-run flake probability:

AssertionError: 0.000320 not less than or equal to 0.000314 : Z: LER after (0.000320) > baseline (0.000214) + 0.0001

Observed in PR #6 CI run. Widening to 2e-4 (~5 sigma) effectively eliminates flakes.

Changed files

File	Change
`code/tests/test_boundary_detectors.py`	CI sample count 2k → 20k for `test_ler_improves_with_bd_noise_model`
`code/tests/test_inference_public_model.py`	`LER_IMPROVEMENT_TOLERANCE` 1e-4 → 2e-4

Test plan

test_ler_improves_with_bd_noise_model keeps original assertLess — increased samples provide the statistical power
test_inference_d13_noise25p_ler_quality tolerance widened with ~5-sigma headroom
CI passes consistently

…qual With only 2000 CI samples at p=0.002, discrete error counts often coincide for "with BD" and "without BD" circuits, causing assertLess to fail spuriously. Use assertLessEqual, consistent with the sibling test_ler_improves_with_bd_all_orientations. Made-with: Cursor

2000 samples at p=0.002 produced only ~5 logical errors, making ties and reversals likely. 20000 samples yields ~50 errors — enough for reliable statistical separation while adding negligible CI time. Made-with: Cursor

Keep the strict assertLess check — the increased sample count (20k) provides enough statistical power to reliably distinguish LER values. Made-with: Cursor

The previous 1e-4 tolerance gave only ~2.5-sigma headroom with combined SE ~4e-5, resulting in ~1.2% per-run flake probability. Widen to 2e-4 (~5 sigma) to eliminate CI flakes while still catching real regressions. Made-with: Cursor

Made-with: Cursor

* fix(test): relax flaky LER boundary-detector assertion to assertLessEqual With only 2000 CI samples at p=0.002, discrete error counts often coincide for "with BD" and "without BD" circuits, causing assertLess to fail spuriously. Use assertLessEqual, consistent with the sibling test_ler_improves_with_bd_all_orientations. Made-with: Cursor * fix(test): increase CI sample count from 2k to 20k for LER comparison 2000 samples at p=0.002 produced only ~5 logical errors, making ties and reversals likely. 20000 samples yields ~50 errors — enough for reliable statistical separation while adding negligible CI time. Made-with: Cursor * fix(test): restore original assertLess assertion Keep the strict assertLess check — the increased sample count (20k) provides enough statistical power to reliably distinguish LER values. Made-with: Cursor * fix(test): widen d=13 LER improvement tolerance from 1e-4 to 2e-4 The previous 1e-4 tolerance gave only ~2.5-sigma headroom with combined SE ~4e-5, resulting in ~1.2% per-run flake probability. Widen to 2e-4 (~5 sigma) to eliminate CI flakes while still catching real regressions. Made-with: Cursor * ci: trigger required status checks Made-with: Cursor

ivanbasov and others added 4 commits March 5, 2026 10:52

Merge branch 'main' into fix/flaky-ler-test

f8494ca

fix(test): restore original assertLess assertion

80acbab

Keep the strict assertLess check — the increased sample count (20k) provides enough statistical power to reliably distinguish LER values. Made-with: Cursor

ivanbasov force-pushed the fix/flaky-ler-test branch from cc0f38e to 80acbab Compare March 5, 2026 19:00

ivanbasov changed the title ~~fix(test): relax flaky LER boundary-detector assertion~~ fix(test): stabilize flaky LER tests in CI Mar 5, 2026

ci: trigger required status checks

f2e8714

Made-with: Cursor

ivanbasov requested a review from bmhowe23 March 5, 2026 22:14

bmhowe23 approved these changes Mar 5, 2026

View reviewed changes

ivanbasov merged commit cd578cc into NVIDIA:main Mar 5, 2026
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(test): stabilize flaky LER tests in CI#6

fix(test): stabilize flaky LER tests in CI#6
ivanbasov merged 6 commits into
NVIDIA:mainfrom
ivanbasov:fix/flaky-ler-test

ivanbasov commented Mar 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ivanbasov commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Changed files

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ivanbasov commented Mar 5, 2026 •

edited

Loading