Skip to content

fix(test): stabilize flaky LER tests in CI#6

Merged
ivanbasov merged 6 commits into
NVIDIA:mainfrom
ivanbasov:fix/flaky-ler-test
Mar 5, 2026
Merged

fix(test): stabilize flaky LER tests in CI#6
ivanbasov merged 6 commits into
NVIDIA:mainfrom
ivanbasov:fix/flaky-ler-test

Conversation

@ivanbasov
Copy link
Copy Markdown
Member

@ivanbasov ivanbasov commented Mar 5, 2026

Summary

  • Increase CI sample count for test_ler_improves_with_bd_noise_model from 2,000 to 20,000 so the boundary-detector LER comparison has enough statistical power
  • Widen d=13 LER improvement tolerance from 1e-4 to 2e-4 in test_inference_d13_noise25p_ler_quality, raising the guard from ~2.5 sigma to ~5 sigma

Root cause

test_ler_improves_with_bd_noise_model (test_boundary_detectors.py):
With only 2,000 CI samples at p=0.002 (d=5), the test produced ~5 logical errors. Discrete error counts frequently coincided, causing assertLess to fail spuriously:

AssertionError: np.float64(0.0025) not less than np.float64(0.0025)

Observed in PR #2 CI run. Increasing to 20,000 samples yields ~50 errors — enough for reliable separation while adding negligible CI time (Stim sampling is fast).

test_inference_d13_noise25p_ler_quality (test_inference_public_model.py):
With d=13 at 262k shots and LER ~2e-4, the combined standard error for the before/after comparison is ~4e-5. The previous tolerance of 1e-4 (~2.5 sigma) gave a ~1.2% per-run flake probability:

AssertionError: 0.000320 not less than or equal to 0.000314 : Z: LER after (0.000320) > baseline (0.000214) + 0.0001

Observed in PR #6 CI run. Widening to 2e-4 (~5 sigma) effectively eliminates flakes.

Changed files

File Change
code/tests/test_boundary_detectors.py CI sample count 2k → 20k for test_ler_improves_with_bd_noise_model
code/tests/test_inference_public_model.py LER_IMPROVEMENT_TOLERANCE 1e-4 → 2e-4

Test plan

  • test_ler_improves_with_bd_noise_model keeps original assertLess — increased samples provide the statistical power
  • test_inference_d13_noise25p_ler_quality tolerance widened with ~5-sigma headroom
  • CI passes consistently

ivanbasov and others added 4 commits March 5, 2026 10:52
…qual

With only 2000 CI samples at p=0.002, discrete error counts often
coincide for "with BD" and "without BD" circuits, causing assertLess
to fail spuriously. Use assertLessEqual, consistent with the sibling
test_ler_improves_with_bd_all_orientations.

Made-with: Cursor
2000 samples at p=0.002 produced only ~5 logical errors, making ties
and reversals likely. 20000 samples yields ~50 errors — enough for
reliable statistical separation while adding negligible CI time.

Made-with: Cursor
Keep the strict assertLess check — the increased sample count (20k)
provides enough statistical power to reliably distinguish LER values.

Made-with: Cursor
@ivanbasov ivanbasov force-pushed the fix/flaky-ler-test branch from cc0f38e to 80acbab Compare March 5, 2026 19:00
The previous 1e-4 tolerance gave only ~2.5-sigma headroom with
combined SE ~4e-5, resulting in ~1.2% per-run flake probability.
Widen to 2e-4 (~5 sigma) to eliminate CI flakes while still catching
real regressions.

Made-with: Cursor
@ivanbasov ivanbasov changed the title fix(test): relax flaky LER boundary-detector assertion fix(test): stabilize flaky LER tests in CI Mar 5, 2026
@ivanbasov ivanbasov requested a review from bmhowe23 March 5, 2026 22:14
@ivanbasov ivanbasov merged commit cd578cc into NVIDIA:main Mar 5, 2026
22 checks passed
ivanbasov added a commit that referenced this pull request Apr 10, 2026
* fix(test): relax flaky LER boundary-detector assertion to assertLessEqual

With only 2000 CI samples at p=0.002, discrete error counts often
coincide for "with BD" and "without BD" circuits, causing assertLess
to fail spuriously. Use assertLessEqual, consistent with the sibling
test_ler_improves_with_bd_all_orientations.

Made-with: Cursor

* fix(test): increase CI sample count from 2k to 20k for LER comparison

2000 samples at p=0.002 produced only ~5 logical errors, making ties
and reversals likely. 20000 samples yields ~50 errors — enough for
reliable statistical separation while adding negligible CI time.

Made-with: Cursor

* fix(test): restore original assertLess assertion

Keep the strict assertLess check — the increased sample count (20k)
provides enough statistical power to reliably distinguish LER values.

Made-with: Cursor

* fix(test): widen d=13 LER improvement tolerance from 1e-4 to 2e-4

The previous 1e-4 tolerance gave only ~2.5-sigma headroom with
combined SE ~4e-5, resulting in ~1.2% per-run flake probability.
Widen to 2e-4 (~5 sigma) to eliminate CI flakes while still catching
real regressions.

Made-with: Cursor

* ci: trigger required status checks

Made-with: Cursor
ivanbasov added a commit that referenced this pull request Apr 10, 2026
* fix(test): relax flaky LER boundary-detector assertion to assertLessEqual

With only 2000 CI samples at p=0.002, discrete error counts often
coincide for "with BD" and "without BD" circuits, causing assertLess
to fail spuriously. Use assertLessEqual, consistent with the sibling
test_ler_improves_with_bd_all_orientations.

Made-with: Cursor

* fix(test): increase CI sample count from 2k to 20k for LER comparison

2000 samples at p=0.002 produced only ~5 logical errors, making ties
and reversals likely. 20000 samples yields ~50 errors — enough for
reliable statistical separation while adding negligible CI time.

Made-with: Cursor

* fix(test): restore original assertLess assertion

Keep the strict assertLess check — the increased sample count (20k)
provides enough statistical power to reliably distinguish LER values.

Made-with: Cursor

* fix(test): widen d=13 LER improvement tolerance from 1e-4 to 2e-4

The previous 1e-4 tolerance gave only ~2.5-sigma headroom with
combined SE ~4e-5, resulting in ~1.2% per-run flake probability.
Widen to 2e-4 (~5 sigma) to eliminate CI flakes while still catching
real regressions.

Made-with: Cursor

* ci: trigger required status checks

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants