feat(eval): foundations P0a — additive types + LatencySummary wiring#178
Merged
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reviewer nit: pub use layer::EvalLayer; was visually attached to the backward-compat alias's doc comment. Add blank line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `pub fn fixture_set_hash(dir: &Path) -> anyhow::Result<String>` that hashes a directory of TOML files into a stable 16-char hex digest. Files are sorted by path before hashing so the result is independent of filesystem ordering. Three TDD tests cover: stability across invocations, sensitivity to content changes, and ordering independence. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, caps Add 22 additive fields to ReportEnv: layer (EvalLayer), task/variant identifiers, embed_dim, similarity_fn_name, judge_model_id, mcp_schema_hash, skill_prompt_hash, schema_version, schema_db_version, migrations_hash, n_runs, is_single_run, run_id, timestamp_utc, git_sha, warmup_iterations, eval_max_usd_baseline_cap, eval_max_usd_run_cap, eval_max_wall_secs_cap, total_cost_usd, total_wall_secs. All new fields carry #[serde(default)] so existing baseline JSON files deserialize without change. Fix 9 struct literal call sites in locomo.rs and longmemeval.rs to use ..Default::default() spread. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…one) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add build_locomo_env() helper that fills both the legacy 9 ReportEnv fields and all 22 new P0a additive fields (layer, task, variant, embed_dim, similarity_fn_name, judge_model_id, schema_db_version, migrations_hash, n_runs, is_single_run, run_id, timestamp_utc, git_sha, warmup_iterations). Replace the three inline ReportEnv literal blocks in run_locomo_eval / run_locomo_eval_reranked / run_locomo_eval_expanded with calls to the helper. Also add pub const SCHEMA_VERSION: u32 = 51 to db.rs (highest user_version applied by migrate()); used as schema_db_version in env. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add build_lme_env() helper mirroring build_locomo_env() with task="lme". Replace 4 inline ReportEnv literal blocks in run_longmemeval_eval, run_longmemeval_eval_reranked, run_longmemeval_eval_expanded, and run_longmemeval_eval_with_gate with calls to the helper. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
eval_harness.rs:3440 had an inline ReportEnv literal that fell out of date when Task 5 extended the struct (the file is feature-gated so cargo check without --features didn't catch it). Add ..ReportEnv::default() spread. eval_report_roundtrip.rs: replace assert_eq\!(bool, literal) with assert\!() per clippy bool_assert_comparison lint that fires under -D warnings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…runner
Replace #[derive(Default)] on ReportEnv with a hand-rolled impl that mirrors
the serde default= attributes (similarity_fn_name="cosine", schema_version=1,
n_runs=1). The derived impl produced 0/"" for those fields, inconsistent with
deserialized-from-legacy-JSON state.
Convert run_locomo_eval_with_gate to call build_locomo_env("gated", ...)
instead of an inline ReportEnv literal with ..Default::default(), matching
the equivalent pattern in longmemeval's _with_gate runner.
Add report_env_default_matches_serde_defaults test to eval_report_roundtrip.rs
to catch any future drift.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolves conflict in Cargo.toml [workspace.dependencies] where P0a's `fs2 = "0.4"` collided with main's new `[profile.release]` block from the CI-throughput tune (PRs #173/#179/#182/#184/#185/#186 wave). Brings in stale-test fix for `release_workflow_publishes_cli_and_mcp_npm_packages` via PR #173's distribution.rs update (drops origin-darwin-x64 needle). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 0a of the eval-foundations refactor. Pure additive type extension; zero behavior change for current callers. Foundation for P0b (paths + port discovery), P0c (cost caps + save guards), P1 (L1 baselines), P2 (L2 live-daemon), P3 (L3 MCP contract tests).
What changed
EvalLayerenum (L1Db / L2Http / L3Mcp / L4Skill) in neweval/layer.rs. L4Skill reserved.ReportEnvextended with 22 additive fields:layer,task,variant,embed_dim,similarity_fn_name,judge_model_id,mcp_schema_hash,skill_prompt_hash,schema_version,schema_db_version,migrations_hash,n_runs,is_single_run,run_id,timestamp_utc,git_sha,warmup_iterations,eval_max_usd_baseline_cap,eval_max_usd_run_cap,eval_max_wall_secs_cap,total_cost_usd,total_wall_secs. AllOption<T>or#[serde(default)]— legacy JSON still deserializes.LatencySummarywired ineval/runner.rs(was hardcodedNone); collected viastd::time::Instantper query, summarized post-loop.from_micros()helper added toeval/latency.rs.fixture_set_hash()ineval/fixtures.rsfor directory-wide TOML hashing (sort + per-file inner sha256 + outer hash, sha256[..16]).build_locomo_env()+build_lme_env()helpers populate the new env fields in all locomo + lme runners (including_with_gatefor both).eval-harnessgatestests/eval_harness.rscompilation so unrelated PRs don't break on eval refactors.pub const SCHEMA_VERSION: u32 = 51exposed indb.rs(eval cache invalidation key for P1).fs2workspace dep added (used by P0c/P1 for scenario.db file locks).DefaultforReportEnvmirrors serdedefault = "fn"attributes (#[derive(Default)]doesn't honor those — caught by adversarial review).Honest deviations from plan (all approved in review)
origin_version: Option<String>as new — already exists asString. Omitted duplicate.llm_provider_kind/llm_model_idas MTEB-aligned new field names — these don't exist on actualReportEnv. Used legacy names (llm_provider_class,llm_model). Note for P0b:comparable_env_hashmust hash the legacy names.{ "fixture_revision": ... }would've failed required-field deserialization)._with_gaterunners converted to usebuild_*_env()helpers — adversarial review caught locomo_with_gateinitially using..Default::default()which silently produced inconsistent env state. Fixed inbed96dd9.Test plan
cargo clippy --workspace --all-targets --features origin-core/eval-harness -- -D warnings→ cleancargo test --workspace --features origin-core/eval-harness→ 1152 passed, 1 pre-existing failure (eval::retrieval::tests::test_multi_turn_eval— FastEmbed model network download, baseline carried, unrelated to P0a)cargo test --workspace(default features, no eval-harness) → eval_harness.rs gated out, no regression#[serde(default)]on new fieldsReportEnv::default()returns same values asserde_json::from_strof a legacy-shape JSON — drift-catcher test added inbed96dd9Follow-ups for P0b+
fs2dep added but unused at P0a; P0c/P1 use it forscenario.lock.migrations_hashfield readsoption_env!(\"ORIGIN_MIGRATIONS_HASH\")—Noneuntil P1's build.rs sets it.LatencySummarytruncates sub-ms to 0ms (ms-resolution storage). Acceptable for current callers; revisit if sub-ms percentile needed.Adversarial review
Full review of integrated impl ran before merge per CLAUDE.md "Code review before merge" rule. Two IMPORTANT findings fixed in
bed96dd9:_with_gaterunner converted tobuild_locomo_env#[derive(Default)]replaced with hand-rolledimpl Defaultthat mirrors serde defaults🤖 Generated with Claude Code