### A4.4.3. Performance Regression Detection

> *A performance regression is a statistically significant increase in execution time between two versions of the code. Detecting regressions reliably requires comparing distributions of measurements ‚Äî not single numbers ‚Äî and choosing a significance threshold that balances false positives against missed regressions.*

**Explanation:**

Continuous integration (CI) pipelines can gate merges on performance by running benchmarks on both the baseline and candidate branches, then testing whether the candidate is significantly slower.

**Detection Pipeline:**

1. Run benchmark on **baseline** (N iterations) ‚Üí distribution $B$.
2. Run benchmark on **candidate** (N iterations) ‚Üí distribution $C$.
3. Apply a **statistical test** to decide if $C$ is slower than $B$.
4. If the test rejects the null hypothesis ("no difference"), flag a regression.

**Statistical Tests:**

| Test | Assumptions | Use When |
|------|-------------|----------|
| Welch's t-test | Approximate normality | Large N, roughly symmetric distributions |
| Mann‚ÄìWhitney U | None (non-parametric) | Small N, skewed distributions |
| Bootstrap CI on median difference | None | Robust, flexible |

**Effect Size:**

Statistical significance alone is insufficient. A 0.01% slowdown can be "significant" with enough samples. Report **effect size** ‚Äî the relative change in median:

$$
\text{Relative Change} = \frac{\tilde{C} - \tilde{B}}{\tilde{B}} \times 100\%
$$

where $\tilde{B}$ and $\tilde{C}$ are the medians of baseline and candidate.

**Regression Gate Design:**

- Set a **threshold** (e.g., 2% slowdown) below which regressions are ignored.
- Require **both** statistical significance (p < 0.05) **and** effect size above threshold.
- Use **bisection** (binary search over commits) to locate the offending change when a regression is detected.

**Example:**

Simulate baseline and regressed benchmark distributions, apply Mann‚ÄìWhitney U test, and check the regression gate.

In [None]:
import numpy as np
from scipy import stats


def generate_benchmark_samples(base_time_ms, noise_std_ms, num_samples, seed):
    rng = np.random.default_rng(seed=seed)
    return rng.normal(loc=base_time_ms, scale=noise_std_ms, size=num_samples)


def check_regression(baseline_samples, candidate_samples, significance_level=0.05, threshold_percent=2.0):
    median_baseline = np.median(baseline_samples)
    median_candidate = np.median(candidate_samples)
    relative_change_percent = ((median_candidate - median_baseline) / median_baseline) * 100

    statistic, p_value = stats.mannwhitneyu(
        baseline_samples,
        candidate_samples,
        alternative="less",
    )

    is_significant = p_value < significance_level
    exceeds_threshold = relative_change_percent > threshold_percent
    is_regression = is_significant and exceeds_threshold

    return {
        "median_baseline_ms": median_baseline,
        "median_candidate_ms": median_candidate,
        "relative_change_percent": relative_change_percent,
        "p_value": p_value,
        "is_significant": is_significant,
        "exceeds_threshold": exceeds_threshold,
        "is_regression": is_regression,
    }


def print_regression_report(label, result):
    print(f"\n--- {label} ---")
    print(f"  Baseline median: {result['median_baseline_ms']:.3f} ms")
    print(f"  Candidate median: {result['median_candidate_ms']:.3f} ms")
    print(f"  Relative change: {result['relative_change_percent']:+.2f}%")
    print(f"  p-value: {result['p_value']:.6f}")
    print(f"  Statistically significant: {result['is_significant']}")
    print(f"  Exceeds threshold: {result['exceeds_threshold']}")
    verdict = "REGRESSION DETECTED" if result["is_regression"] else "PASS"
    print(f"  Verdict: {verdict}")


num_samples = 100

baseline = generate_benchmark_samples(10.0, 0.3, num_samples, seed=1)
candidate_no_regression = generate_benchmark_samples(10.0, 0.3, num_samples, seed=2)
candidate_small_regression = generate_benchmark_samples(10.15, 0.3, num_samples, seed=3)
candidate_clear_regression = generate_benchmark_samples(10.8, 0.3, num_samples, seed=4)

result_none = check_regression(baseline, candidate_no_regression)
print_regression_report("No regression (same distribution)", result_none)

result_small = check_regression(baseline, candidate_small_regression)
print_regression_report("Small regression (1.5%, below threshold)", result_small)

result_clear = check_regression(baseline, candidate_clear_regression)
print_regression_report("Clear regression (8%, above threshold)", result_clear)

**References:**

[üìò Gregg, B. (2020). *Systems Performance: Enterprise and the Cloud (2nd ed.).* Addison-Wesley.](https://www.brendangregg.com/systems-performance-2nd-edition-book.html)

[üìò Chen, T. et al. (2016). *An Empirical Study of Performance Regression Introducing Code Changes.* IEEE ICSME.](https://doi.org/10.1109/ICSME.2016.13)

---

[‚¨ÖÔ∏è Previous: Noise Control](./02_noise_control.ipynb)