fix(test): stabilize flaky LER tests in CI#6
Merged
Conversation
…qual With only 2000 CI samples at p=0.002, discrete error counts often coincide for "with BD" and "without BD" circuits, causing assertLess to fail spuriously. Use assertLessEqual, consistent with the sibling test_ler_improves_with_bd_all_orientations. Made-with: Cursor
2000 samples at p=0.002 produced only ~5 logical errors, making ties and reversals likely. 20000 samples yields ~50 errors — enough for reliable statistical separation while adding negligible CI time. Made-with: Cursor
Keep the strict assertLess check — the increased sample count (20k) provides enough statistical power to reliably distinguish LER values. Made-with: Cursor
cc0f38e to
80acbab
Compare
The previous 1e-4 tolerance gave only ~2.5-sigma headroom with combined SE ~4e-5, resulting in ~1.2% per-run flake probability. Widen to 2e-4 (~5 sigma) to eliminate CI flakes while still catching real regressions. Made-with: Cursor
Made-with: Cursor
bmhowe23
approved these changes
Mar 5, 2026
ivanbasov
added a commit
that referenced
this pull request
Apr 10, 2026
* fix(test): relax flaky LER boundary-detector assertion to assertLessEqual With only 2000 CI samples at p=0.002, discrete error counts often coincide for "with BD" and "without BD" circuits, causing assertLess to fail spuriously. Use assertLessEqual, consistent with the sibling test_ler_improves_with_bd_all_orientations. Made-with: Cursor * fix(test): increase CI sample count from 2k to 20k for LER comparison 2000 samples at p=0.002 produced only ~5 logical errors, making ties and reversals likely. 20000 samples yields ~50 errors — enough for reliable statistical separation while adding negligible CI time. Made-with: Cursor * fix(test): restore original assertLess assertion Keep the strict assertLess check — the increased sample count (20k) provides enough statistical power to reliably distinguish LER values. Made-with: Cursor * fix(test): widen d=13 LER improvement tolerance from 1e-4 to 2e-4 The previous 1e-4 tolerance gave only ~2.5-sigma headroom with combined SE ~4e-5, resulting in ~1.2% per-run flake probability. Widen to 2e-4 (~5 sigma) to eliminate CI flakes while still catching real regressions. Made-with: Cursor * ci: trigger required status checks Made-with: Cursor
ivanbasov
added a commit
that referenced
this pull request
Apr 10, 2026
* fix(test): relax flaky LER boundary-detector assertion to assertLessEqual With only 2000 CI samples at p=0.002, discrete error counts often coincide for "with BD" and "without BD" circuits, causing assertLess to fail spuriously. Use assertLessEqual, consistent with the sibling test_ler_improves_with_bd_all_orientations. Made-with: Cursor * fix(test): increase CI sample count from 2k to 20k for LER comparison 2000 samples at p=0.002 produced only ~5 logical errors, making ties and reversals likely. 20000 samples yields ~50 errors — enough for reliable statistical separation while adding negligible CI time. Made-with: Cursor * fix(test): restore original assertLess assertion Keep the strict assertLess check — the increased sample count (20k) provides enough statistical power to reliably distinguish LER values. Made-with: Cursor * fix(test): widen d=13 LER improvement tolerance from 1e-4 to 2e-4 The previous 1e-4 tolerance gave only ~2.5-sigma headroom with combined SE ~4e-5, resulting in ~1.2% per-run flake probability. Widen to 2e-4 (~5 sigma) to eliminate CI flakes while still catching real regressions. Made-with: Cursor * ci: trigger required status checks Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
test_ler_improves_with_bd_noise_modelfrom 2,000 to 20,000 so the boundary-detector LER comparison has enough statistical power1e-4to2e-4intest_inference_d13_noise25p_ler_quality, raising the guard from ~2.5 sigma to ~5 sigmaRoot cause
test_ler_improves_with_bd_noise_model(test_boundary_detectors.py):With only 2,000 CI samples at p=0.002 (d=5), the test produced ~5 logical errors. Discrete error counts frequently coincided, causing
assertLessto fail spuriously:Observed in PR #2 CI run. Increasing to 20,000 samples yields ~50 errors — enough for reliable separation while adding negligible CI time (Stim sampling is fast).
test_inference_d13_noise25p_ler_quality(test_inference_public_model.py):With d=13 at 262k shots and LER ~2e-4, the combined standard error for the before/after comparison is ~4e-5. The previous tolerance of
1e-4(~2.5 sigma) gave a ~1.2% per-run flake probability:Observed in PR #6 CI run. Widening to
2e-4(~5 sigma) effectively eliminates flakes.Changed files
code/tests/test_boundary_detectors.pytest_ler_improves_with_bd_noise_modelcode/tests/test_inference_public_model.pyLER_IMPROVEMENT_TOLERANCE1e-4 → 2e-4Test plan
test_ler_improves_with_bd_noise_modelkeeps originalassertLess— increased samples provide the statistical powertest_inference_d13_noise25p_ler_qualitytolerance widened with ~5-sigma headroom