feat: cross-validation harness and performance regression tracking by tbitcs · Pull Request #7 · BitConcepts/arbiter

tbitcs · 2026-06-02T18:34:07Z

Cross-validation harness and performance regression tracking

Cross-validation test suite ( ests/python/test_cross_validation.py)

12 golden test vectors in ests/vectors/ covering all major evaluation paths:
basic eval, safety guard ordering, expression opcodes (scale, accumulate, clamp with saturation),
div-by-zero, staleness, mode transitions, fault raise/clear, delta operators, condition groups,
and INT32 saturation
Determinism test: runs 3 vectors 100 times each, asserts identical output
Compile-to-C test: compiles each vector model to C source and validates required symbols
(ARBITER_generated_model, ARBITER_MODEL_HASH)

Performance regression tracking

ools/parse_benchmark.py: parses Twister benchmark logs, extracts ns/tick timing for
hand-coded and engine variants from PID and Kalman benchmarks, prints summary table
CI workflow updated with "Benchmark timing summary" step after Twister benchmarks
(log-only for now — no fail threshold until baseline data is collected)

Test results

All 180 tests pass (python -m pytest tests/python/ -v)
27 new cross-validation tests (12 golden vectors + 3 determinism + 12 compile-to-C)

Conversation: https://app.warp.dev/conversation/069a3382-5169-4a07-8e15-ea4862186401
Run: https://oz.warp.dev/runs/019e8993-e9a4-7058-9fea-abfbf925ac06

This PR was generated with Oz.

Add cross-validation test suite that validates the Python evaluator against 12 golden test vectors covering all major evaluation paths: basic eval, safety guard ordering, expression opcodes (scale, accumulate, clamp with saturation), div-by-zero, staleness, mode transitions, fault raise/clear, delta operators, condition groups, and INT32 saturation. Tests include: - Parametrised golden-vector evaluation (12 vectors) - Determinism verification (100 identical runs) - Compile-to-C validation (each vector model compiles to valid C) Also adds: - tools/parse_benchmark.py for extracting ns/tick timing from Twister benchmark logs (PID and Kalman) - CI workflow step to print benchmark timing summary after Twister Co-Authored-By: Oz <oz-agent@warp.dev>

tbitcs merged commit 94b0942 into main Jun 2, 2026
2 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: cross-validation harness and performance regression tracking#7

feat: cross-validation harness and performance regression tracking#7
tbitcs merged 1 commit into
mainfrom
orchestrator/crossval

tbitcs commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant