Skip to content

Performance Baselines

Simon B.Stirling edited this page Mar 2, 2026 · 6 revisions

Performance Baseline and Regression Gates (v1)

I use this document as the frozen performance-baseline contract for the current bootstrap toolchain.

Contract version: perfbase.v1

What I Freeze in v1

I freeze deterministic throughput floors (ops/sec) for representative CLI workloads in the pinned Linux x86-64 bootstrap environment.

The gated workloads are:

  • verify on tests/valid_min.l0
  • build on tests/valid_min.l0
  • run on valid_min image (2-arg arithmetic kernel)
  • run on valid_sysv_abi_sum6_lowered image (6-arg SysV kernel)
  • build-elf on tests/valid_sysv_abi_sum6_lowered.l0
  • mapcat, schemacat, tracecat, tracejoin on deterministic trace artifacts

Enforcement

I enforce this contract in:

  • tests/performance_gates.sh

That harness is integrated into tests/run.sh and therefore enforced by make test.

I also run this automatically in GitHub Actions:

  • .github/workflows/performance.yml on nightly schedule and on version tags (v*).
  • CI uses a blocking smoke gate (tests/ci_smoke_bench.sh) plus a non-blocking full-gate observation run (tests/performance_gates.sh) because hosted runner throughput is less stable than my pinned local baseline environment.

Pinned Throughput Floors in v1

  • verify.valid_min >= 1800 ops/s
  • build.valid_min >= 1300 ops/s
  • run.add >= 2400 ops/s
  • run.sum6 >= 2200 ops/s
  • build-elf.sum6 >= 1100 ops/s
  • mapcat.trace_map >= 2500 ops/s
  • schemacat.trace_schema >= 2500 ops/s
  • tracecat.trace_bin >= 2500 ops/s
  • tracejoin.trace_bin+map >= 2300 ops/s

CI Runner Profile

I support a CI profile for hosted runners by setting:

  • L0_PERF_PROFILE=ci

This uses conservative floors intended for shared GitHub-hosted runner variance. My local/pinned production profile remains the default when this variable is not set.

Out of Scope in v1

  • Cross-machine or cross-CPU performance comparability guarantees.
  • Full benchmarking methodology for release marketing claims.
  • Auto-tuning of thresholds; v1 thresholds are pinned and explicitly versioned.

Reproducibility Notes for Comparison Reports

For apples-to-apples comparison snapshots, I now capture reproducibility metadata and confidence hints in:

  • docs/PERFORMANCE_COMPARISON_APPLES_TO_APPLES.md
  • docs/PERFORMANCE_COMPARISON_APPLES_TO_APPLES.json

I can control this benchmark behavior with:

  • L0_A2A_PIN_CPU (optional CPU affinity, e.g. 0)
  • L0_A2A_BUILD_ITERS
  • L0_A2A_BUILD_SAMPLES
  • L0_A2A_RUNTIME_ITERS
  • L0_A2A_RUNTIME_SAMPLES
  • L0_A2A_WARMUP_RUNS

The report includes:

  • CPU/model/topology metadata
  • sample counts and warmup configuration
  • median throughput values
  • CI95 on runtime sample means
  • outlier policy controls (L0_A2A_TRIM_COUNT) used for CI computation when enough samples are present
  • stability warning threshold control (L0_A2A_RUNTIME_CI95_PCT_WARN)

Clone this wiki locally