Auto-dedupe reran eval artifacts in reuse-ingest validator#1965
Conversation
A flaky eval retried several times leaves multiple raw eval_* dirs and multiple eval_results_all rows for one logical eval identity. The reuse validator rejected these duplicates, blocking artifact reuse and ingest recovery (e.g. PR #1931 -> failed main ingest run 28459613726). validate_reusable_sweep_artifacts.py now collapses reran eval duplicates in place before validating: it keeps the latest result per eval identity (by lm-eval result timestamp) and prunes the superseded raw dirs / aggregate rows. It only collapses identities that have a clear latest result; genuinely ambiguous duplicates (no result timestamp to order them by) are still rejected. No flag, no new files; eval-only, no-op when nothing to collapse. The InferenceX-app ingest upserts eval rows on (workflow_run_id, config_id, task, isl, osl, conc) with the per-config eval dir path overwriting the aggregate, so with duplicates present the ingested value is non-deterministic. Verified end-to-end on the real PR #1931 source artifacts + app ingest against a local Postgres: after dedupe, one eval_results row at the latest value (em_strict 0.9515); the raw pile-up otherwise ingests an arbitrary flaky value (0.7817). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
3de1816 to
67766a6
Compare
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🟡
utils/dedupe_reusable_eval_artifacts.py:164-183— The dedupe's aggregate-row selector uses an unbounded substring check (if winner in str(data[idx].get("source") or "")) to match the winner's raw-dir name against the aggregate'ssourcepath. Real runner pools (e.g..github/configs/runners.yamldefinesh200-dgxc-slurm_0through_13) mix single- and double-digit indices, so if a flaky eval retries across both_1and_10for one identity and_1wins, the winner name is a prefix substring of the loser's source path — the loop can return the loser's row. Result: on-disk aggregate keeps the wrongem_strictand asourcefield pointing to a raw dir thatprune_raw_dirthen deletes. The final DB value is still correct (app ingest's per-config path overwrites last), so this is a provenance/on-disk-consistency defect rather than a data-loss one. Fix is trivial: match by path component, e.g.f"/{winner}/" in source.Extended reasoning...
What the bug is
In
utils/dedupe_reusable_eval_artifacts.py:_choose_aggregate_row(lines 164-183), the selector picks which aggregate row survives for a duplicate identity group with:if winner is not None: for idx in indices: if winner in str(data[idx].get("source") or ""): return idx
winneris a bare raw-dir name (e.g.eval_MyExp_..._h200-dgxc-slurm_1) andsourceis a full path (str(json_path)fromcollect_eval_results.py:168) shaped likeeval_results/<raw_dir_name>/results_<ts>.json. Theincheck is unbounded substring containment, not a boundary-aware match.The concrete trigger
RESULT_FILENAMEinbenchmark-tmpl.yml:180ends with_${{ runner.name }}, and the eval artifact name iseval_${EXP_NAME}_${RESULT_FILENAME}(line 297), so the last token of every raw dir name is literallyrunner.name..github/configs/runners.yamldefines pools with mixed-width numbering — e.g.h200-dgxc-slurm_0through_9and_10through_13, plush100-cw_00/_01. In those pools,..._slurm_1is a strict prefix of..._slurm_10.Step-by-step proof
- A flaky eval retries and lands raw dirs
eval_..._h200-dgxc-slurm_1andeval_..._h200-dgxc-slurm_10for the same identity, plus corresponding rows ineval_results_all. select_winnerspicks the_1attempt (later lm-eval timestamp) as the winner.dedupe_aggregatecollects both aggregate rows intoindicesfor the identity; ordering depends on filesystemiterdir(unpredictable)._choose_aggregate_rowiteratesindicesand callsif winner in str(data[idx].get("source")):- Winner =
eval_..._h200-dgxc-slurm_1. - Loser row's source =
eval_results/eval_..._h200-dgxc-slurm_10/results_....json— contains..._h200-dgxc-slurm_1as a prefix substring. Match. If this row is first inindices, the function returns the loser's idx.
- Winner =
- Dedupe writes back the loser's aggregate row (stale
em_strict, wrongmetadata) whosesourcepoints to..._slurm_10. prune_raw_diruses exact equality (winners.get(key) != name, line 199) — correct — so it deletes the_10raw dir, leaving the surviving aggregate row pointing to a directory that no longer exists.
Why existing code does not prevent it
The existing tests use runner suffixes
_15,_16,_12,_03— none are prefixes of the others, so the substring hazard is never exercised.raw_dir_contributionsusesiterdir, so index order in the aggregate is filesystem-dependent; the correct row may or may not come first.Impact
- On-disk
eval_results_all/*.json: aggregate row can hold the flaky (loser)em_strictand metadata for that identity. - Provenance: the surviving row's
sourcepoints to a raw dirprune_raw_dirdeletes, so any consumer reading the aggregate directly sees dangling provenance. - DB: the app ingest per-config path overwrites the aggregate path last, reading from the (correct) surviving winning raw dir, so the final
em_strictin the DB is still right. This limits the blast radius to on-disk consistency and the docstring's "latest result per eval identity" promise.
Suggested fix (one-line)
if f"/{winner}/" in str(data[idx].get("source") or ""): return idx
Or split
sourceon/and compare segments. Either bounds the match to a full path component. - A flaky eval retries and lands raw dirs
-
🟡
utils/dedupe_reusable_eval_artifacts.py:71-95—raw_dir_contributions(utils/dedupe_reusable_eval_artifacts.py:82) callsload_json(artifact_dir / "meta_env.json")with no exception handling. If any reused raw eval dir has a missing or truncatedmeta_env.json, the dedupe step now wired in front of the validator will die with an unhandledFileNotFoundError/json.JSONDecodeErrortraceback instead of the validator's structured 'raw eval artifact X is missing meta_env.json' listing that this step is supposed to defer to.dedupe_aggregatehas the same gap on aggregate JSONs. Mirror the validator'smeta_path.is_file()+try/except (OSError, json.JSONDecodeError)guards and skip un-loadable dirs so the validator still owns the error reporting.Extended reasoning...
What's wrong
raw_dir_contributionsunconditionally callsload_json(artifact_dir / "meta_env.json")at line 82.load_json(defined atvalidate_reusable_sweep_artifacts.py:30-33) is a bareopen()+json.load()— no exception handling. So:- Missing
meta_env.json→FileNotFoundError(anOSError) - Truncated / malformed JSON →
json.JSONDecodeError
Either propagates up through
select_winners→prune_raw_dir→dedupe→main, crashing the whole dedupe step.dedupe_aggregatehas the same shape on itsload_json(agg_path)call.Why the validator does not save us
The dedupe step is now wired to run before the validator in the
reuse-ingest-artifactsjob (run-sweep.yml:725-731) and in therecover-failed-ingest.mdpre-check. So a bad artifact makes the merge-time ingest fail with a raw Python traceback rather than the validator's own structured message. The validator (raw_eval_key_rows, validate_reusable_sweep_artifacts.py:358-378) explicitly:- Checks
meta_path.is_file()and appends"raw eval artifact X is missing meta_env.json" - Wraps
load_jsonintry/except (OSError, json.JSONDecodeError) - Also handles the non-dict case
The dedupe module only handles case (3). That contradicts its docstring's 'safe to run unconditionally in the reuse path' / 'no-op when there is nothing to collapse' claim.
Step-by-step proof
- Operator runs
/recover-failed-ingest.gh run downloadproducessource-artifacts/eval_minimaxm3_conc4096_b300-nv_15/with a truncatedmeta_env.json(e.g. because the artifact upload was interrupted). - The recovery script runs
python3 utils/dedupe_reusable_eval_artifacts.py --artifacts-dir source-artifacts(or the equivalent merge-time step). dedupe→select_winners→raw_eval_artifact_dirsyields that dir.raw_dir_contributionscallsload_json(<dir>/meta_env.json)→open(...)succeeds,json.load(f)raisesjson.JSONDecodeError.- No handler catches it; the process exits with a Python traceback pointing at line 82.
- The validator step never runs; the operator sees a traceback instead of
"raw eval artifact 'eval_minimaxm3_conc4096_b300-nv_15' has invalid meta_env.json: ...". If more than one dir is affected they get only the first error instead of the validator's full listing.
Fix
Mirror the validator's guards inline in
raw_dir_contributions(and add the sametry/exceptaroundload_json(agg_path)indedupe_aggregate): ifmeta_env.jsondoes not exist or fails to decode, return the same([], {}, False)no-contribution result. That leaves the dir untouched, lets dedupe finish cleanly, and defers the error tovalidate_reusable_sweep_artifacts.py— which is the step designed to produce a human-readable listing for exactly this case.Severity
nit — the ingest job must fail either way (a corrupted
meta_env.jsonis a real error). The only observable regression is diagnostic quality: a Python traceback replaces the validator's clean 'raw eval artifact X is missing meta_env.json' message, and the operator sees only the first failure instead of the validator's full listing. No wrong data is committed. The docstring's 'safe to run unconditionally' claim is weakened but not falsified for the common case. Worth fixing, but not a merge blocker. - Missing
Problem
When an eval is retried due to flakes, a sweep accumulates several raw
eval_*dirs and severaleval_results_allrows for one logical eval identity.validate_reusable_sweep_artifacts.pyhard-rejects these duplicates, which blocks artifact reuse and/recover-failed-ingest.Concrete case: PR #1931 ran evals 6× (flakes) → source run
28214046079carries 4 raw dirs + 3 aggregate rows for theminimaxm3 / fp4 / b300 / dynamo-vllmconc4096 config. The main-branch ingest for that merge (run28459613726) failed at the validator:Why it matters beyond the gate
The app ingest (
InferenceX-appadmin:db:ingest:ci) writes evals via two paths into the sameeval_resultsrows — theeval_results_allaggregate and the per-configeval_*dirs — upserting on(workflow_run_id, config_id, task, isl, osl, conc)with the per-config path overwriting last. With duplicates present the surviving value is non-deterministic (arbitraryreaddirSyncorder across the reran dirs). That is exactly why the validator blocks duplicates.Change
validate_reusable_sweep_artifacts.pynow collapses reran eval duplicates in place before validating — no flag, no new files:eval_key, so dedupe granularity matches the gate.Because it lives in the validator, the existing
reuse-ingest-artifactsstep and the/recover-failed-ingestpre-check get it automatically. Tests added totest_validate_reusable_sweep_artifacts.py(legacy reruns, batched partial-overlap, no-op, and ambiguous-still-rejected).Verification (real artifacts + real ingest)
Ran the actual
InferenceX-appingest against a local Postgres using PR #1931's source artifacts:em_strict(conc4096/dtp8)Follow-up / caveat
The validator's
eval_key(and therefore this dedupe) omitstask, i.e. it assumes single-task (gsm8k) evals — true today. Multi-task-per-config would need a key change.