Skip to content

feat(eval): foundations P0c — cost caps + wall-clock watchdog + save guards + cleanup#191

Merged
7xuanlu merged 9 commits into
mainfrom
worktree-feature+eval-foundations-p0c
May 25, 2026
Merged

feat(eval): foundations P0c — cost caps + wall-clock watchdog + save guards + cleanup#191
7xuanlu merged 9 commits into
mainfrom
worktree-feature+eval-foundations-p0c

Conversation

@7xuanlu
Copy link
Copy Markdown
Owner

@7xuanlu 7xuanlu commented May 25, 2026

Summary

Phase 0c of the eval-foundations refactor. Additive types + new save path + loud error replacement.

Builds on P0a (#178, merged 46a9703) + P0b (#190, merged 032ce63).

What changed

Cost caps

  • parse_eval_max_usd(value: Option<&str>) -> anyhow::Result<Option<f64>> in eval/anthropic.rs. Replaces silent unwrap_or(0.0) with explicit bails: parse-error / non-finite / <= 0 / > $10 (unless EVAL_I_REALLY_MEAN_IT=1). Both submit_batch + submit_batch_with_tool call sites converted.
  • RunCostTracker in new eval/cost.rsAtomicU64 millicents accumulator with soft-fence cap. record_usd refunds the increment on cap-overage so total_usd() stays honest after a failed call.
  • reconcile_cost_usd(input_tokens, output_tokens) in eval/anthropic.rs — Haiku batch rates ($0.25/$1.25 per MTok). P1 will wire this into batch judge path.

Wall-clock

  • WallClockWatchdog in new eval/wall_clock.rs. start(cap) / start_with_check_interval(cap, check) / disabled() / from_env(). Reads EVAL_MAX_WALL_SECS (default 14400 = 4h). Spawns tokio task that flips is_exceeded() atomic when cap elapses. Uses log crate (origin-core convention; tracing not in deps).

Save guards

  • save_full_report(&Path, &EvalReport) -> anyhow::Result<PathBuf> in eval/report.rs. Strict guards:
    • env: Some(...) required (panic-with-message otherwise)
    • All metric f64 fields finite (15 fields covered: 12 mandatory + 3 Option<f64>). Walks struct directly because serde_json silently maps NaN to null — a JSON-walk would miss the very value we're rejecting.
    • Skip rate ≤ 5% when total_scenarios > 0
    • enrichment_failures == 0 unless EVAL_ACCEPT_PARTIAL=1
  • Atomic write: <final>.tmp.<pid>.<nanos> in SAME directory as final, then rename. Same-filesystem guaranteed.
  • save_partial_report writes to partial/<runid>__<layer>__<task>__<variant>.json — never baselines/. Stamps truncated_reason on the report copy.
  • EvalReport extended with total_scenarios, skipped_scenarios: Vec<String>, enrichment_failures: usize, truncated_reason: Option<String> — all #[serde(default)].

Cleanup

  • eval/shared.rs: 9 let _ = ... swallow sites converted to if let Err(e) = ... { log::warn!(...) } with context (memory_id / entity_id / source_id + error). 2 sites simplified let _ = expr.await?expr.await? (was just discarding success type). 6 best-effort filesystem I/O sites kept with explicit // best-effort: ... comments.
  • Behavioral fix bundled in cleanup: chunk_linked += 1 now increments only on successful update_memory_entity_id (prior overcounted on silent failures).

Adversarial review (3 NITs, all non-blocking)

  • RunCostTracker::record_usd saturating cast on pathological huge USD inputs — defensible per soft-fence semantics; parse_eval_max_usd caps inputs to $10 unless override.
  • WallClockWatchdog tokio task outlives struct drop — documented as acceptable per spec.
  • save_full_report tmp filename pid + nanos could collide on same-nanosecond concurrent calls from same process — practically impossible on macOS clock resolution.

Test plan

  • cargo clippy --workspace --all-targets --features origin-core/eval-harness -- -D warnings → clean
  • cargo test -p origin-core --lib --features eval-harness → 1160 lib tests pass
  • cargo test -p origin-core --test eval_cost_caps --features eval-harness → 15 tests pass (6 parse + 5 tracker + 1 reconcile + 3 watchdog)
  • cargo test -p origin-core --test eval_save_guards --features eval-harness → 7 tests pass (4 original + 3 Option regression)
  • Pre-existing failures unchanged: eval::retrieval::tests::test_multi_turn_eval (FastEmbed network) + cmd_backfill::tests::check_service_unloaded_returns_ok_when_no_service_installed (env-specific to dev machines with daemon installed)

Follow-ups for P1+

  • reconcile_cost_usd + RunCostTracker get wired through answer_quality.rs batch judge in P1 Task 4
  • save_full_report becomes the canonical save path for L1 baselines in P1 Task 4
  • WallClockWatchdog::from_env() invoked at L1 / L2 runner start in P1 / P2
  • EVAL_MAX_USD_RUN (cumulative) cap — declared via RunCostTracker::new(parse_eval_max_usd(env::var("EVAL_MAX_USD_RUN").ok().as_deref())?) in P1 orchestration

🤖 Generated with Claude Code

7xuanlu and others added 9 commits May 25, 2026 01:55
Add parse_eval_max_usd() with explicit failure modes: garbage input, non-finite,
<= 0, and > $10 without EVAL_I_REALLY_MEAN_IT=1. Replace both unwrap_or(0.0)
sites in submit_batch and submit_batch_with_tool. 6 new tests in eval_cost_caps.rs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mirrors the pattern from PR #160 (eval_harness.rs:3770). Without a lock,
parallel test execution races on the shared process env.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… cap

On cap-exceeded, fetch_sub refunds the increment so total_usd() reflects
only successful spend. Negative/non-finite cap_usd saturated to 0 via
.max(0.0) cast with debug_assert for visibility. Two regression tests added.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add reconcile_cost_usd(input_tokens, output_tokens) -> f64 to
eval::anthropic using Claude 3.5 Haiku batch-discounted pricing
($0.25/MTok input, $1.25/MTok output). Companion test in
eval_cost_caps verifies all four cases (input-only, output-only,
mixed, zero).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds save_full_report() in eval/report.rs writing full EvalReport to layered
path via encode_baseline_path. Guards: env required, finite metrics, skip ≤5%,
enrichment_failures==0 unless EVAL_ACCEPT_PARTIAL=1. Atomic same-dir tmp+rename.

Adds save_partial_report() to partial/ dir with truncated_reason stamp.

EvalReport extended with total_scenarios, skipped_scenarios, enrichment_failures,
truncated_reason — all additive #[serde(default)].

NaN detection via first_non_finite_field() (serde_json maps NaN to null).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
9 sites converted to if-let-Err with log::warn! and full context
(entity/memory IDs, error). 2 let _ = expr.await? reduced to plain
expr.await? (error already propagated via ?; let _ = was discarding
only the success usize). 6 filesystem I/O sites kept as let _ = with
explicit // best-effort: comments (create_dir_all, writeln, flush on
cache files).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@7xuanlu 7xuanlu merged commit ffbda8f into main May 25, 2026
8 checks passed
@7xuanlu 7xuanlu deleted the worktree-feature+eval-foundations-p0c branch May 25, 2026 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant