Auto-dedupe reran eval artifacts in reuse-ingest validator by Oseltamivir · Pull Request #1965 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-07-01T00:31:27Z

Problem

When an eval is retried due to flakes, a sweep accumulates several raw eval_* dirs and several eval_results_all rows for one logical eval identity. validate_reusable_sweep_artifacts.py hard-rejects these duplicates, which blocks artifact reuse and /recover-failed-ingest.

Concrete case: PR #1931 ran evals 6× (flakes) → source run 28214046079 carries 4 raw dirs + 3 aggregate rows for the minimaxm3 / fp4 / b300 / dynamo-vllm conc4096 config. The main-branch ingest for that merge (run 28459613726) failed at the validator:

raw eval artifacts contain 3 duplicate row(s)
  duplicate x4: ('multi','b300','minimaxm3','dynamo-vllm','fp4','none',8192,1024,2,2,True,4,8,8,True,2,4096)
eval aggregate artifacts contain 2 duplicate row(s)

Why it matters beyond the gate

The app ingest (InferenceX-app admin:db:ingest:ci) writes evals via two paths into the same eval_results rows — the eval_results_all aggregate and the per-config eval_* dirs — upserting on (workflow_run_id, config_id, task, isl, osl, conc) with the per-config path overwriting last. With duplicates present the surviving value is non-deterministic (arbitrary readdirSync order across the reran dirs). That is exactly why the validator blocks duplicates.

Change

validate_reusable_sweep_artifacts.py now collapses reran eval duplicates in place before validating — no flag, no new files:

Keeps the latest result per eval identity (by lm-eval result timestamp) and prunes the superseded raw dirs (and, for batched artifacts, superseded concurrencies) plus aggregate rows.
Only collapses identities that have a clear latest result; genuinely ambiguous duplicates (no result timestamp to order them by) are still rejected.
Eval-only (fixed-sequence/agentic untouched); no-op when nothing to collapse; idempotent.
Reuses the module's own eval_key, so dedupe granularity matches the gate.

Because it lives in the validator, the existing reuse-ingest-artifacts step and the /recover-failed-ingest pre-check get it automatically. Tests added to test_validate_reusable_sweep_artifacts.py (legacy reruns, batched partial-overlap, no-op, and ambiguous-still-rejected).

Verification (real artifacts + real ingest)

Ran the actual InferenceX-app ingest against a local Postgres using PR #1931's source artifacts:

	validator	ingested `em_strict` (conc4096/dtp8)	rows for key
raw 6-attempt pileup	❌ blocked	0.7817 (flaky, wrong)	arbitrary
after dedupe	✅ passes	0.9515 (latest run)	exactly 1

Follow-up / caveat

The validator's eval_key (and therefore this dedupe) omits task, i.e. it assumes single-task (gsm8k) evals — true today. Multi-task-per-config would need a key change.

A flaky eval retried several times leaves multiple raw eval_* dirs and multiple eval_results_all rows for one logical eval identity. The reuse validator rejected these duplicates, blocking artifact reuse and ingest recovery (e.g. PR #1931 -> failed main ingest run 28459613726). validate_reusable_sweep_artifacts.py now collapses reran eval duplicates in place before validating: it keeps the latest result per eval identity (by lm-eval result timestamp) and prunes the superseded raw dirs / aggregate rows. It only collapses identities that have a clear latest result; genuinely ambiguous duplicates (no result timestamp to order them by) are still rejected. No flag, no new files; eval-only, no-op when nothing to collapse. The InferenceX-app ingest upserts eval rows on (workflow_run_id, config_id, task, isl, osl, conc) with the per-config eval dir path overwriting the aggregate, so with duplicates present the ingested value is non-deterministic. Verified end-to-end on the real PR #1931 source artifacts + app ingest against a local Postgres: after dedupe, one eval_results row at the latest value (em_strict 0.9515); the raw pile-up otherwise ingests an arbitrary flaky value (0.7817). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

claude

Additional findings (outside current diff — PR may have been updated during review):

🟡 utils/dedupe_reusable_eval_artifacts.py:164-183 — The dedupe's aggregate-row selector uses an unbounded substring check (if winner in str(data[idx].get("source") or "")) to match the winner's raw-dir name against the aggregate's source path. Real runner pools (e.g. .github/configs/runners.yaml defines h200-dgxc-slurm_0 through _13) mix single- and double-digit indices, so if a flaky eval retries across both _1 and _10 for one identity and _1 wins, the winner name is a prefix substring of the loser's source path — the loop can return the loser's row. Result: on-disk aggregate keeps the wrong em_strict and a source field pointing to a raw dir that prune_raw_dir then deletes. The final DB value is still correct (app ingest's per-config path overwrites last), so this is a provenance/on-disk-consistency defect rather than a data-loss one. Fix is trivial: match by path component, e.g. f"/{winner}/" in source.
Extended reasoning...

What the bug is

In utils/dedupe_reusable_eval_artifacts.py:_choose_aggregate_row (lines 164-183), the selector picks which aggregate row survives for a duplicate identity group with:
```
if winner is not None:
    for idx in indices:
        if winner in str(data[idx].get("source") or ""):
            return idx
```
winner is a bare raw-dir name (e.g. eval_MyExp_..._h200-dgxc-slurm_1) and source is a full path (str(json_path) from collect_eval_results.py:168) shaped like eval_results/<raw_dir_name>/results_<ts>.json. The in check is unbounded substring containment, not a boundary-aware match.

The concrete trigger

RESULT_FILENAME in benchmark-tmpl.yml:180 ends with _${{ runner.name }}, and the eval artifact name is eval_${EXP_NAME}_${RESULT_FILENAME} (line 297), so the last token of every raw dir name is literally runner.name. .github/configs/runners.yaml defines pools with mixed-width numbering — e.g. h200-dgxc-slurm_0 through _9 and _10 through _13, plus h100-cw_00/_01. In those pools, ..._slurm_1 is a strict prefix of ..._slurm_10.

Step-by-step proof
1. A flaky eval retries and lands raw dirs eval_..._h200-dgxc-slurm_1 and eval_..._h200-dgxc-slurm_10 for the same identity, plus corresponding rows in eval_results_all.
2. select_winners picks the _1 attempt (later lm-eval timestamp) as the winner.
3. dedupe_aggregate collects both aggregate rows into indices for the identity; ordering depends on filesystem iterdir (unpredictable).
4. _choose_aggregate_row iterates indices and calls if winner in str(data[idx].get("source")):
  - Winner = eval_..._h200-dgxc-slurm_1.
  - Loser row's source = eval_results/eval_..._h200-dgxc-slurm_10/results_....json — contains ..._h200-dgxc-slurm_1 as a prefix substring. Match. If this row is first in indices, the function returns the loser's idx.
5. Dedupe writes back the loser's aggregate row (stale em_strict, wrong metadata) whose source points to ..._slurm_10.
6. prune_raw_dir uses exact equality (winners.get(key) != name, line 199) — correct — so it deletes the _10 raw dir, leaving the surviving aggregate row pointing to a directory that no longer exists.
Why existing code does not prevent it

The existing tests use runner suffixes _15, _16, _12, _03 — none are prefixes of the others, so the substring hazard is never exercised. raw_dir_contributions uses iterdir, so index order in the aggregate is filesystem-dependent; the correct row may or may not come first.

Impact
- On-disk eval_results_all/*.json: aggregate row can hold the flaky (loser) em_strict and metadata for that identity.
- Provenance: the surviving row's source points to a raw dir prune_raw_dir deletes, so any consumer reading the aggregate directly sees dangling provenance.
- DB: the app ingest per-config path overwrites the aggregate path last, reading from the (correct) surviving winning raw dir, so the final em_strict in the DB is still right. This limits the blast radius to on-disk consistency and the docstring's "latest result per eval identity" promise.
Suggested fix (one-line)
```
if f"/{winner}/" in str(data[idx].get("source") or ""):
    return idx
```
Or split source on / and compare segments. Either bounds the match to a full path component.
🟡 utils/dedupe_reusable_eval_artifacts.py:71-95 — raw_dir_contributions (utils/dedupe_reusable_eval_artifacts.py:82) calls load_json(artifact_dir / "meta_env.json") with no exception handling. If any reused raw eval dir has a missing or truncated meta_env.json, the dedupe step now wired in front of the validator will die with an unhandled FileNotFoundError / json.JSONDecodeError traceback instead of the validator's structured 'raw eval artifact X is missing meta_env.json' listing that this step is supposed to defer to. dedupe_aggregate has the same gap on aggregate JSONs. Mirror the validator's meta_path.is_file() + try/except (OSError, json.JSONDecodeError) guards and skip un-loadable dirs so the validator still owns the error reporting.
Extended reasoning...

What's wrong

raw_dir_contributions unconditionally calls load_json(artifact_dir / "meta_env.json") at line 82. load_json (defined at validate_reusable_sweep_artifacts.py:30-33) is a bare open() + json.load() — no exception handling. So:
- Missing meta_env.json → FileNotFoundError (an OSError)
- Truncated / malformed JSON → json.JSONDecodeError
Either propagates up through select_winners → prune_raw_dir → dedupe → main, crashing the whole dedupe step. dedupe_aggregate has the same shape on its load_json(agg_path) call.

Why the validator does not save us

The dedupe step is now wired to run before the validator in the reuse-ingest-artifacts job (run-sweep.yml:725-731) and in the recover-failed-ingest.md pre-check. So a bad artifact makes the merge-time ingest fail with a raw Python traceback rather than the validator's own structured message. The validator (raw_eval_key_rows, validate_reusable_sweep_artifacts.py:358-378) explicitly:
1. Checks meta_path.is_file() and appends "raw eval artifact X is missing meta_env.json"
2. Wraps load_json in try/except (OSError, json.JSONDecodeError)
3. Also handles the non-dict case
The dedupe module only handles case (3). That contradicts its docstring's 'safe to run unconditionally in the reuse path' / 'no-op when there is nothing to collapse' claim.

Step-by-step proof
1. Operator runs /recover-failed-ingest. gh run download produces source-artifacts/eval_minimaxm3_conc4096_b300-nv_15/ with a truncated meta_env.json (e.g. because the artifact upload was interrupted).
2. The recovery script runs python3 utils/dedupe_reusable_eval_artifacts.py --artifacts-dir source-artifacts (or the equivalent merge-time step).
3. dedupe → select_winners → raw_eval_artifact_dirs yields that dir.
4. raw_dir_contributions calls load_json(<dir>/meta_env.json) → open(...) succeeds, json.load(f) raises json.JSONDecodeError.
5. No handler catches it; the process exits with a Python traceback pointing at line 82.
6. The validator step never runs; the operator sees a traceback instead of "raw eval artifact 'eval_minimaxm3_conc4096_b300-nv_15' has invalid meta_env.json: ...". If more than one dir is affected they get only the first error instead of the validator's full listing.
Fix

Mirror the validator's guards inline in raw_dir_contributions (and add the same try/except around load_json(agg_path) in dedupe_aggregate): if meta_env.json does not exist or fails to decode, return the same ([], {}, False) no-contribution result. That leaves the dir untouched, lets dedupe finish cleanly, and defers the error to validate_reusable_sweep_artifacts.py — which is the step designed to produce a human-readable listing for exactly this case.

Severity

nit — the ingest job must fail either way (a corrupted meta_env.json is a real error). The only observable regression is diagnostic quality: a Python traceback replaces the validator's clean 'raw eval artifact X is missing meta_env.json' message, and the operator sees only the first failure instead of the validator's full listing. No wrong data is committed. The docstring's 'safe to run unconditionally' claim is weakened but not falsified for the common case. Worth fixing, but not a merge blocker.

Oseltamivir requested a review from a team July 1, 2026 00:31

github-code-quality Bot found potential problems Jul 1, 2026

View reviewed changes

Comment thread utils/dedupe_reusable_eval_artifacts.py Fixed

github-project-automation Bot added this to InferenceMAX Board Jul 1, 2026

Oseltamivir force-pushed the klaud/dedupe-reusable-eval-artifacts branch from 3de1816 to 67766a6 Compare July 1, 2026 00:39

Oseltamivir changed the title ~~Dedupe reran eval artifacts before reuse ingest~~ Auto-dedupe reran eval artifacts in reuse-ingest validator Jul 1, 2026

Oseltamivir merged commit 493f781 into main Jul 1, 2026
5 checks passed

Oseltamivir deleted the klaud/dedupe-reusable-eval-artifacts branch July 1, 2026 00:45

claude Bot reviewed Jul 1, 2026

View reviewed changes

github-project-automation Bot moved this to Done in InferenceMAX Board Jul 1, 2026

Oseltamivir mentioned this pull request Jul 1, 2026

fix: recover PR 1931 ingest via sweep reuse #1966

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Auto-dedupe reran eval artifacts in reuse-ingest validator#1965

Auto-dedupe reran eval artifacts in reuse-ingest validator#1965
Oseltamivir merged 1 commit into
mainfrom
klaud/dedupe-reusable-eval-artifacts

Oseltamivir commented Jul 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Oseltamivir commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Why it matters beyond the gate

Change

Verification (real artifacts + real ingest)

Follow-up / caveat

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

What the bug is

The concrete trigger

Step-by-step proof

Why existing code does not prevent it

Impact

Suggested fix (one-line)

What's wrong

Why the validator does not save us

Step-by-step proof

Fix

Severity

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Oseltamivir commented Jul 1, 2026 •

edited

Loading