### A4.4.1. Benchmark Design

> *A well-designed benchmark isolates the code under test, controls for system noise, and produces statistically meaningful timing measurements by choosing appropriate iteration counts, warm-up phases, and metrics.*

**Explanation:**

Benchmarking measures how fast code runs. **Bad benchmarks** produce misleading numbers; **good benchmarks** produce actionable, reproducible measurements.

**Benchmark Structure:**

1. **Setup** ‚Äî allocate data, initialize state. Not measured.
2. **Warm-up** ‚Äî run the code N times to fill caches, trigger JIT compilation, stabilize frequency scaling. Not measured.
3. **Measurement loop** ‚Äî run the code M times, record each timing.
4. **Reporting** ‚Äî compute statistics (median, mean, std, min, percentiles).

**Key Principles:**

| Principle | Rationale |
|-----------|----------|
| Isolate the target | Don't measure setup, teardown, or I/O |
| Warm up | JIT, caches, CPU frequency need time to stabilize |
| Use median, not mean | Mean is skewed by outliers (GC pauses, OS interrupts) |
| Report distribution | Min shows best-case; p99 shows tail latency |
| Prevent dead code elimination | Compiler may remove code whose result is unused |
| Control input size | Benchmark must cover representative workload sizes |

**Common Mistakes:**

- Measuring wall time with `time.time()` (low resolution) instead of `time.perf_counter_ns()`.
- Too few iterations ‚Üí high variance.
- Benchmarking with optimizations disabled.
- Comparing benchmarks run on different machines or under different load.

**Example:**

Benchmarking NumPy `dot` on 1000√ó1000 matrices: 5 warm-up iterations, 100 measured iterations, report median and IQR.

In [None]:
import numpy as np
import time


def run_benchmark(target_fn, warmup_iterations, measurement_iterations):
    for _ in range(warmup_iterations):
        target_fn()

    timings_ns = []
    for _ in range(measurement_iterations):
        start = time.perf_counter_ns()
        target_fn()
        elapsed = time.perf_counter_ns() - start
        timings_ns.append(elapsed)

    return np.array(timings_ns)


def report_statistics(label, timings_ns):
    timings_ms = timings_ns / 1e6
    print(f"{label}:")
    print(f"  Iterations: {len(timings_ms)}")
    print(f"  Median: {np.median(timings_ms):.3f} ms")
    print(f"  Mean:   {np.mean(timings_ms):.3f} ms")
    print(f"  Std:    {np.std(timings_ms):.3f} ms")
    print(f"  Min:    {np.min(timings_ms):.3f} ms")
    print(f"  Max:    {np.max(timings_ms):.3f} ms")
    print(f"  P25:    {np.percentile(timings_ms, 25):.3f} ms")
    print(f"  P75:    {np.percentile(timings_ms, 75):.3f} ms")
    print(f"  P99:    {np.percentile(timings_ms, 99):.3f} ms")
    print(f"  IQR:    {np.percentile(timings_ms, 75) - np.percentile(timings_ms, 25):.3f} ms")


matrix_size = 512
matrix_a = np.random.rand(matrix_size, matrix_size).astype(np.float32)
matrix_b = np.random.rand(matrix_size, matrix_size).astype(np.float32)

timings = run_benchmark(
    target_fn=lambda: np.dot(matrix_a, matrix_b),
    warmup_iterations=5,
    measurement_iterations=50,
)

report_statistics(f"np.dot ({matrix_size}x{matrix_size} float32)", timings)

**References:**

[üìò Gregg, B. (2020). *Systems Performance: Enterprise and the Cloud (2nd ed.).* Addison-Wesley.](https://www.brendangregg.com/systems-performance-2nd-edition-book.html)

[üìò Google. *google/benchmark ‚Äî A microbenchmark support library.*](https://github.com/google/benchmark)

---

[‚¨ÖÔ∏è Previous: Dispatch](../03_Runtime_Topics/03_dispatch.ipynb) | [Next: Noise Control ‚û°Ô∏è](./02_noise_control.ipynb)