Skip to content

fix(eval): deduplicate blocked ContextBench rows#122

Merged
PatrickSys merged 1 commit intomasterfrom
fix/contextbench-baseline-reservations
Apr 30, 2026
Merged

fix(eval): deduplicate blocked ContextBench rows#122
PatrickSys merged 1 commit intomasterfrom
fix/contextbench-baseline-reservations

Conversation

@PatrickSys
Copy link
Copy Markdown
Owner

Summary

  • Make blocked missing-evidence baseline rows idempotent when a snapshot command is rerun.
  • Validate duplicate primary baseline rows and exact blocked-row/reservation coverage.
  • Add regressions for duplicate-row rejection and snapshot rerun behavior.

Verification

  • pnpm test -- tests/contextbench-baseline-runner.test.ts tests/contextbench-baseline-snapshot.test.ts tests/contextbench-baseline-schema-gate.test.ts
  • pnpm exec tsc --noEmit
  • pnpm run build
  • pnpm test -- tests/zombie-guard.test.ts tests/benchmark-comparators.test.ts

Notes

  • No live or paid ContextBench rows were run.
  • No benchmark claims are made.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request prevents the duplication of blocked missing-evidence rows in contextbench-runner.mjs by introducing a primary key tracking mechanism during snapshot creation. It also strengthens session validation by checking for duplicate primary baseline rows and ensuring a strict match between blocked reservations and manifest entries. New test cases verify these fixes. The review feedback recommends using a dedicated cleanup helper in the test suite to improve cross-platform reliability and maintain consistency.

Comment on lines +194 to +197
rmSync(path.dirname(path.dirname(path.dirname(path.dirname(sessionRoot)))), {
recursive: true,
force: true
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use the cleanupSessionRoot helper function instead of calling rmSync directly. This ensures consistency across tests and leverages the retry logic implemented in the helper to avoid potential race conditions during cleanup, especially on Windows environments.

      cleanupSessionRoot(sessionRoot);

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

This PR makes writeBlockedRunRows idempotent by snapshotting existing primary-key entries before the write loop so that re-running --baseline-snapshot on the same session root never duplicates terminal_missing_evidence rows. validateBaselineSession is also tightened: it now detects duplicate primary baseline rows and verifies exact bijective coverage between blocked reservations and blocked manifest rows (replacing a fragile hardcoded lane-name list). Two regression tests cover both the duplicate-rejection and the rerun-idempotency paths.

Confidence Score: 4/5

Safe to merge; only P2 style findings in the test files, core runner logic is correct.

No P0 or P1 issues found. The deduplication and validation logic in the runner script is sound. The two P2 findings are limited to tests: new tests bypass the existing cleanupSessionRoot helper (missing Windows retry options) and one test hardcodes a fixture-derived count without explanation.

tests/contextbench-baseline-runner.test.ts and tests/contextbench-baseline-snapshot.test.ts — both new tests should use cleanupSessionRoot for consistent cleanup.

Important Files Changed

Filename Overview
scripts/contextbench-runner.mjs Adds idempotent deduplication to writeBlockedRunRows via a pre-read primary-key set, a new primaryReservationKey helper, and strengthened validateBaselineSession coverage/duplicate checks — logic is sound.
tests/contextbench-baseline-runner.test.ts New regression for duplicate-row rejection; correct test logic but cleanup skips the existing cleanupSessionRoot helper (missing maxRetries / ignoreWindowsTempCleanupRace).
tests/contextbench-baseline-snapshot.test.ts New snapshot-rerun idempotency test; hardcoded 2023 row count is fragile, and cleanup also bypasses the retry helper.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[writeBlockedRunRows called] --> B[Read existing primary keys\nfrom run-manifest.jsonl]
    B --> C{For each terminal_missing_evidence\nreservation}
    C --> D[Look up laneCard, task, evidence]
    D --> E{Missing any?}
    E -- yes --> C
    E -- no --> F[Compute primaryReservationKey\nlaneId :: taskId :: repeatIndex]
    F --> G{Key already in\nexistingPrimaryKeys?}
    G -- yes --> C
    G -- no --> H[Write artifacts + append\nmanifest row]
    H --> I[Add key to existingPrimaryKeys]
    I --> C

    J[validateBaselineSession] --> K[Count primary rows by key]
    K --> L{Any key count > 1?}
    L -- yes --> M[Push duplicate error]
    L -- no --> N[Build blockedReservationKeys\nfrom reservations]
    N --> O[Build blockedRowKeys from\nfallbackReason rows]
    O --> P{missing or extra keys?}
    P -- yes --> Q[Push coverage mismatch error]
    P -- no --> R[Validation passed]
Loading

Reviews (1): Last reviewed commit: "fix(eval): deduplicate blocked ContextBe..." | Re-trigger Greptile

Comment on lines +199 to 203
});

it('creates fake-executor baseline attempt artifacts without scripting agent decisions', () => {
const sessionRoot = tempSessionRoot();
const taskId = manifest.tasks[0].instance_id;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Cleanup bypasses Windows retry/error handling

The new test uses a bare rmSync in finally without maxRetries, retryDelay, or a wrapping ignoreWindowsTempCleanupRace catch. The existing cleanupSessionRoot helper at line 59 already encapsulates all three — using the helper would keep cleanup consistent and avoid flaky failures on Windows where temp-dir handles can still be open when finally runs.

Suggested change
});
it('creates fake-executor baseline attempt artifacts without scripting agent decisions', () => {
const sessionRoot = tempSessionRoot();
const taskId = manifest.tasks[0].instance_id;
} finally {
cleanupSessionRoot(sessionRoot);
}

Comment on lines +162 to 173
expect(secondBlockedRows).toHaveLength(firstBlockedRows.length);
expect(validateOutput).toContain('baseline session validation passed');
} finally {
rmSync(path.dirname(path.dirname(path.dirname(path.dirname(sessionRoot)))), {
recursive: true,
force: true
});
}
});

it('refuses raw baseline artifacts outside the ignored benchmark-runs root', () => {
const outDir = mkdtempSync(path.join(tmpdir(), 'contextbench-invalid-out-'));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Hardcoded fixture count couples test to manifest config

20 * 2 * 3 embeds the exact task count, blocked-lane count, and repeat count from the phase41 fixture. Any fixture change will silently break this assertion without a clear failure message. Adding a comment naming the source of each factor would make the intent explicit and the breakage diagnosable.

The finally block also repeats a bare rmSync without the maxRetries/retryDelay options present in the cleanupSessionRoot helper used by other tests in this file.

@PatrickSys PatrickSys merged commit 99c9753 into master Apr 30, 2026
4 checks passed
@PatrickSys PatrickSys deleted the fix/contextbench-baseline-reservations branch April 30, 2026 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant