Skip to content

Auto-dedupe reran eval artifacts in reuse-ingest validator#1965

Merged
Oseltamivir merged 1 commit into
mainfrom
klaud/dedupe-reusable-eval-artifacts
Jul 1, 2026
Merged

Auto-dedupe reran eval artifacts in reuse-ingest validator#1965
Oseltamivir merged 1 commit into
mainfrom
klaud/dedupe-reusable-eval-artifacts

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Problem

When an eval is retried due to flakes, a sweep accumulates several raw eval_* dirs and several eval_results_all rows for one logical eval identity. validate_reusable_sweep_artifacts.py hard-rejects these duplicates, which blocks artifact reuse and /recover-failed-ingest.

Concrete case: PR #1931 ran evals 6× (flakes) → source run 28214046079 carries 4 raw dirs + 3 aggregate rows for the minimaxm3 / fp4 / b300 / dynamo-vllm conc4096 config. The main-branch ingest for that merge (run 28459613726) failed at the validator:

raw eval artifacts contain 3 duplicate row(s)
  duplicate x4: ('multi','b300','minimaxm3','dynamo-vllm','fp4','none',8192,1024,2,2,True,4,8,8,True,2,4096)
eval aggregate artifacts contain 2 duplicate row(s)

Why it matters beyond the gate

The app ingest (InferenceX-app admin:db:ingest:ci) writes evals via two paths into the same eval_results rows — the eval_results_all aggregate and the per-config eval_* dirs — upserting on (workflow_run_id, config_id, task, isl, osl, conc) with the per-config path overwriting last. With duplicates present the surviving value is non-deterministic (arbitrary readdirSync order across the reran dirs). That is exactly why the validator blocks duplicates.

Change

validate_reusable_sweep_artifacts.py now collapses reran eval duplicates in place before validating — no flag, no new files:

  • Keeps the latest result per eval identity (by lm-eval result timestamp) and prunes the superseded raw dirs (and, for batched artifacts, superseded concurrencies) plus aggregate rows.
  • Only collapses identities that have a clear latest result; genuinely ambiguous duplicates (no result timestamp to order them by) are still rejected.
  • Eval-only (fixed-sequence/agentic untouched); no-op when nothing to collapse; idempotent.
  • Reuses the module's own eval_key, so dedupe granularity matches the gate.

Because it lives in the validator, the existing reuse-ingest-artifacts step and the /recover-failed-ingest pre-check get it automatically. Tests added to test_validate_reusable_sweep_artifacts.py (legacy reruns, batched partial-overlap, no-op, and ambiguous-still-rejected).

Verification (real artifacts + real ingest)

Ran the actual InferenceX-app ingest against a local Postgres using PR #1931's source artifacts:

validator ingested em_strict (conc4096/dtp8) rows for key
raw 6-attempt pileup ❌ blocked 0.7817 (flaky, wrong) arbitrary
after dedupe ✅ passes 0.9515 (latest run) exactly 1

Follow-up / caveat

The validator's eval_key (and therefore this dedupe) omits task, i.e. it assumes single-task (gsm8k) evals — true today. Multi-task-per-config would need a key change.

@Oseltamivir Oseltamivir requested a review from a team July 1, 2026 00:31
Comment thread utils/dedupe_reusable_eval_artifacts.py Fixed
A flaky eval retried several times leaves multiple raw eval_* dirs and
multiple eval_results_all rows for one logical eval identity. The reuse
validator rejected these duplicates, blocking artifact reuse and ingest
recovery (e.g. PR #1931 -> failed main ingest run 28459613726).

validate_reusable_sweep_artifacts.py now collapses reran eval duplicates
in place before validating: it keeps the latest result per eval identity
(by lm-eval result timestamp) and prunes the superseded raw dirs / aggregate
rows. It only collapses identities that have a clear latest result; genuinely
ambiguous duplicates (no result timestamp to order them by) are still
rejected. No flag, no new files; eval-only, no-op when nothing to collapse.

The InferenceX-app ingest upserts eval rows on
(workflow_run_id, config_id, task, isl, osl, conc) with the per-config eval
dir path overwriting the aggregate, so with duplicates present the ingested
value is non-deterministic. Verified end-to-end on the real PR #1931 source
artifacts + app ingest against a local Postgres: after dedupe, one eval_results
row at the latest value (em_strict 0.9515); the raw pile-up otherwise ingests
an arbitrary flaky value (0.7817).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Oseltamivir Oseltamivir force-pushed the klaud/dedupe-reusable-eval-artifacts branch from 3de1816 to 67766a6 Compare July 1, 2026 00:39
@Oseltamivir Oseltamivir changed the title Dedupe reran eval artifacts before reuse ingest Auto-dedupe reran eval artifacts in reuse-ingest validator Jul 1, 2026
@Oseltamivir Oseltamivir merged commit 493f781 into main Jul 1, 2026
5 checks passed
@Oseltamivir Oseltamivir deleted the klaud/dedupe-reusable-eval-artifacts branch July 1, 2026 00:45

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🟡 utils/dedupe_reusable_eval_artifacts.py:164-183 — The dedupe's aggregate-row selector uses an unbounded substring check (if winner in str(data[idx].get("source") or "")) to match the winner's raw-dir name against the aggregate's source path. Real runner pools (e.g. .github/configs/runners.yaml defines h200-dgxc-slurm_0 through _13) mix single- and double-digit indices, so if a flaky eval retries across both _1 and _10 for one identity and _1 wins, the winner name is a prefix substring of the loser's source path — the loop can return the loser's row. Result: on-disk aggregate keeps the wrong em_strict and a source field pointing to a raw dir that prune_raw_dir then deletes. The final DB value is still correct (app ingest's per-config path overwrites last), so this is a provenance/on-disk-consistency defect rather than a data-loss one. Fix is trivial: match by path component, e.g. f"/{winner}/" in source.

    Extended reasoning...

    What the bug is

    In utils/dedupe_reusable_eval_artifacts.py:_choose_aggregate_row (lines 164-183), the selector picks which aggregate row survives for a duplicate identity group with:

    if winner is not None:
        for idx in indices:
            if winner in str(data[idx].get("source") or ""):
                return idx

    winner is a bare raw-dir name (e.g. eval_MyExp_..._h200-dgxc-slurm_1) and source is a full path (str(json_path) from collect_eval_results.py:168) shaped like eval_results/<raw_dir_name>/results_<ts>.json. The in check is unbounded substring containment, not a boundary-aware match.

    The concrete trigger

    RESULT_FILENAME in benchmark-tmpl.yml:180 ends with _${{ runner.name }}, and the eval artifact name is eval_${EXP_NAME}_${RESULT_FILENAME} (line 297), so the last token of every raw dir name is literally runner.name. .github/configs/runners.yaml defines pools with mixed-width numbering — e.g. h200-dgxc-slurm_0 through _9 and _10 through _13, plus h100-cw_00/_01. In those pools, ..._slurm_1 is a strict prefix of ..._slurm_10.

    Step-by-step proof

    1. A flaky eval retries and lands raw dirs eval_..._h200-dgxc-slurm_1 and eval_..._h200-dgxc-slurm_10 for the same identity, plus corresponding rows in eval_results_all.
    2. select_winners picks the _1 attempt (later lm-eval timestamp) as the winner.
    3. dedupe_aggregate collects both aggregate rows into indices for the identity; ordering depends on filesystem iterdir (unpredictable).
    4. _choose_aggregate_row iterates indices and calls if winner in str(data[idx].get("source")):
      • Winner = eval_..._h200-dgxc-slurm_1.
      • Loser row's source = eval_results/eval_..._h200-dgxc-slurm_10/results_....json — contains ..._h200-dgxc-slurm_1 as a prefix substring. Match. If this row is first in indices, the function returns the loser's idx.
    5. Dedupe writes back the loser's aggregate row (stale em_strict, wrong metadata) whose source points to ..._slurm_10.
    6. prune_raw_dir uses exact equality (winners.get(key) != name, line 199) — correct — so it deletes the _10 raw dir, leaving the surviving aggregate row pointing to a directory that no longer exists.

    Why existing code does not prevent it

    The existing tests use runner suffixes _15, _16, _12, _03 — none are prefixes of the others, so the substring hazard is never exercised. raw_dir_contributions uses iterdir, so index order in the aggregate is filesystem-dependent; the correct row may or may not come first.

    Impact

    • On-disk eval_results_all/*.json: aggregate row can hold the flaky (loser) em_strict and metadata for that identity.
    • Provenance: the surviving row's source points to a raw dir prune_raw_dir deletes, so any consumer reading the aggregate directly sees dangling provenance.
    • DB: the app ingest per-config path overwrites the aggregate path last, reading from the (correct) surviving winning raw dir, so the final em_strict in the DB is still right. This limits the blast radius to on-disk consistency and the docstring's "latest result per eval identity" promise.

    Suggested fix (one-line)

    if f"/{winner}/" in str(data[idx].get("source") or ""):
        return idx

    Or split source on / and compare segments. Either bounds the match to a full path component.

  • 🟡 utils/dedupe_reusable_eval_artifacts.py:71-95raw_dir_contributions (utils/dedupe_reusable_eval_artifacts.py:82) calls load_json(artifact_dir / "meta_env.json") with no exception handling. If any reused raw eval dir has a missing or truncated meta_env.json, the dedupe step now wired in front of the validator will die with an unhandled FileNotFoundError / json.JSONDecodeError traceback instead of the validator's structured 'raw eval artifact X is missing meta_env.json' listing that this step is supposed to defer to. dedupe_aggregate has the same gap on aggregate JSONs. Mirror the validator's meta_path.is_file() + try/except (OSError, json.JSONDecodeError) guards and skip un-loadable dirs so the validator still owns the error reporting.

    Extended reasoning...

    What's wrong

    raw_dir_contributions unconditionally calls load_json(artifact_dir / "meta_env.json") at line 82. load_json (defined at validate_reusable_sweep_artifacts.py:30-33) is a bare open() + json.load() — no exception handling. So:

    • Missing meta_env.jsonFileNotFoundError (an OSError)
    • Truncated / malformed JSON → json.JSONDecodeError

    Either propagates up through select_winnersprune_raw_dirdedupemain, crashing the whole dedupe step. dedupe_aggregate has the same shape on its load_json(agg_path) call.

    Why the validator does not save us

    The dedupe step is now wired to run before the validator in the reuse-ingest-artifacts job (run-sweep.yml:725-731) and in the recover-failed-ingest.md pre-check. So a bad artifact makes the merge-time ingest fail with a raw Python traceback rather than the validator's own structured message. The validator (raw_eval_key_rows, validate_reusable_sweep_artifacts.py:358-378) explicitly:

    1. Checks meta_path.is_file() and appends "raw eval artifact X is missing meta_env.json"
    2. Wraps load_json in try/except (OSError, json.JSONDecodeError)
    3. Also handles the non-dict case

    The dedupe module only handles case (3). That contradicts its docstring's 'safe to run unconditionally in the reuse path' / 'no-op when there is nothing to collapse' claim.

    Step-by-step proof

    1. Operator runs /recover-failed-ingest. gh run download produces source-artifacts/eval_minimaxm3_conc4096_b300-nv_15/ with a truncated meta_env.json (e.g. because the artifact upload was interrupted).
    2. The recovery script runs python3 utils/dedupe_reusable_eval_artifacts.py --artifacts-dir source-artifacts (or the equivalent merge-time step).
    3. dedupeselect_winnersraw_eval_artifact_dirs yields that dir.
    4. raw_dir_contributions calls load_json(<dir>/meta_env.json)open(...) succeeds, json.load(f) raises json.JSONDecodeError.
    5. No handler catches it; the process exits with a Python traceback pointing at line 82.
    6. The validator step never runs; the operator sees a traceback instead of "raw eval artifact 'eval_minimaxm3_conc4096_b300-nv_15' has invalid meta_env.json: ...". If more than one dir is affected they get only the first error instead of the validator's full listing.

    Fix

    Mirror the validator's guards inline in raw_dir_contributions (and add the same try/except around load_json(agg_path) in dedupe_aggregate): if meta_env.json does not exist or fails to decode, return the same ([], {}, False) no-contribution result. That leaves the dir untouched, lets dedupe finish cleanly, and defers the error to validate_reusable_sweep_artifacts.py — which is the step designed to produce a human-readable listing for exactly this case.

    Severity

    nit — the ingest job must fail either way (a corrupted meta_env.json is a real error). The only observable regression is diagnostic quality: a Python traceback replaces the validator's clean 'raw eval artifact X is missing meta_env.json' message, and the operator sees only the first failure instead of the validator's full listing. No wrong data is committed. The docstring's 'safe to run unconditionally' claim is weakened but not falsified for the common case. Worth fixing, but not a merge blocker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant