Skip to content

feat: cross-validation harness and performance regression tracking#7

Merged
tbitcs merged 1 commit into
mainfrom
orchestrator/crossval
Jun 2, 2026
Merged

feat: cross-validation harness and performance regression tracking#7
tbitcs merged 1 commit into
mainfrom
orchestrator/crossval

Conversation

@tbitcs
Copy link
Copy Markdown
Contributor

@tbitcs tbitcs commented Jun 2, 2026

Cross-validation harness and performance regression tracking

Cross-validation test suite ( ests/python/test_cross_validation.py)

  • 12 golden test vectors in ests/vectors/ covering all major evaluation paths:
    basic eval, safety guard ordering, expression opcodes (scale, accumulate, clamp with saturation),
    div-by-zero, staleness, mode transitions, fault raise/clear, delta operators, condition groups,
    and INT32 saturation
  • Determinism test: runs 3 vectors 100 times each, asserts identical output
  • Compile-to-C test: compiles each vector model to C source and validates required symbols
    (ARBITER_generated_model, ARBITER_MODEL_HASH)

Performance regression tracking

  • ools/parse_benchmark.py: parses Twister benchmark logs, extracts ns/tick timing for
    hand-coded and engine variants from PID and Kalman benchmarks, prints summary table
  • CI workflow updated with "Benchmark timing summary" step after Twister benchmarks
    (log-only for now — no fail threshold until baseline data is collected)

Test results

  • All 180 tests pass (python -m pytest tests/python/ -v)
  • 27 new cross-validation tests (12 golden vectors + 3 determinism + 12 compile-to-C)

Conversation: https://app.warp.dev/conversation/069a3382-5169-4a07-8e15-ea4862186401
Run: https://oz.warp.dev/runs/019e8993-e9a4-7058-9fea-abfbf925ac06

This PR was generated with Oz.

Add cross-validation test suite that validates the Python evaluator
against 12 golden test vectors covering all major evaluation paths:
basic eval, safety guard ordering, expression opcodes (scale,
accumulate, clamp with saturation), div-by-zero, staleness, mode
transitions, fault raise/clear, delta operators, condition groups,
and INT32 saturation.

Tests include:
- Parametrised golden-vector evaluation (12 vectors)
- Determinism verification (100 identical runs)
- Compile-to-C validation (each vector model compiles to valid C)

Also adds:
- tools/parse_benchmark.py for extracting ns/tick timing from
  Twister benchmark logs (PID and Kalman)
- CI workflow step to print benchmark timing summary after Twister

Co-Authored-By: Oz <oz-agent@warp.dev>
@tbitcs tbitcs merged commit 94b0942 into main Jun 2, 2026
2 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant