fix(eval): deduplicate blocked ContextBench rows by PatrickSys · Pull Request #122 · PatrickSys/codebase-context

PatrickSys · 2026-04-30T15:50:48Z

Summary

Make blocked missing-evidence baseline rows idempotent when a snapshot command is rerun.
Validate duplicate primary baseline rows and exact blocked-row/reservation coverage.
Add regressions for duplicate-row rejection and snapshot rerun behavior.

Verification

pnpm test -- tests/contextbench-baseline-runner.test.ts tests/contextbench-baseline-snapshot.test.ts tests/contextbench-baseline-schema-gate.test.ts
pnpm exec tsc --noEmit
pnpm run build
pnpm test -- tests/zombie-guard.test.ts tests/benchmark-comparators.test.ts

Notes

No live or paid ContextBench rows were run.
No benchmark claims are made.

gemini-code-assist

Code Review

This pull request prevents the duplication of blocked missing-evidence rows in contextbench-runner.mjs by introducing a primary key tracking mechanism during snapshot creation. It also strengthens session validation by checking for duplicate primary baseline rows and ensuring a strict match between blocked reservations and manifest entries. New test cases verify these fixes. The review feedback recommends using a dedicated cleanup helper in the test suite to improve cross-platform reliability and maintain consistency.

gemini-code-assist · 2026-04-30T15:52:22Z

+      rmSync(path.dirname(path.dirname(path.dirname(path.dirname(sessionRoot)))), {
+        recursive: true,
+        force: true
+      });


Use the cleanupSessionRoot helper function instead of calling rmSync directly. This ensures consistency across tests and leverages the retry logic implemented in the helper to avoid potential race conditions during cleanup, especially on Windows environments.

cleanupSessionRoot(sessionRoot);

greptile-apps · 2026-04-30T15:53:35Z

Greptile Summary

This PR makes writeBlockedRunRows idempotent by snapshotting existing primary-key entries before the write loop so that re-running --baseline-snapshot on the same session root never duplicates terminal_missing_evidence rows. validateBaselineSession is also tightened: it now detects duplicate primary baseline rows and verifies exact bijective coverage between blocked reservations and blocked manifest rows (replacing a fragile hardcoded lane-name list). Two regression tests cover both the duplicate-rejection and the rerun-idempotency paths.

Confidence Score: 4/5

Safe to merge; only P2 style findings in the test files, core runner logic is correct.

No P0 or P1 issues found. The deduplication and validation logic in the runner script is sound. The two P2 findings are limited to tests: new tests bypass the existing cleanupSessionRoot helper (missing Windows retry options) and one test hardcodes a fixture-derived count without explanation.

tests/contextbench-baseline-runner.test.ts and tests/contextbench-baseline-snapshot.test.ts — both new tests should use cleanupSessionRoot for consistent cleanup.

Important Files Changed

Filename	Overview
scripts/contextbench-runner.mjs	Adds idempotent deduplication to writeBlockedRunRows via a pre-read primary-key set, a new primaryReservationKey helper, and strengthened validateBaselineSession coverage/duplicate checks — logic is sound.
tests/contextbench-baseline-runner.test.ts	New regression for duplicate-row rejection; correct test logic but cleanup skips the existing cleanupSessionRoot helper (missing maxRetries / ignoreWindowsTempCleanupRace).
tests/contextbench-baseline-snapshot.test.ts	New snapshot-rerun idempotency test; hardcoded 2023 row count is fragile, and cleanup also bypasses the retry helper.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[writeBlockedRunRows called] --> B[Read existing primary keys\nfrom run-manifest.jsonl]
    B --> C{For each terminal_missing_evidence\nreservation}
    C --> D[Look up laneCard, task, evidence]
    D --> E{Missing any?}
    E -- yes --> C
    E -- no --> F[Compute primaryReservationKey\nlaneId :: taskId :: repeatIndex]
    F --> G{Key already in\nexistingPrimaryKeys?}
    G -- yes --> C
    G -- no --> H[Write artifacts + append\nmanifest row]
    H --> I[Add key to existingPrimaryKeys]
    I --> C

    J[validateBaselineSession] --> K[Count primary rows by key]
    K --> L{Any key count > 1?}
    L -- yes --> M[Push duplicate error]
    L -- no --> N[Build blockedReservationKeys\nfrom reservations]
    N --> O[Build blockedRowKeys from\nfallbackReason rows]
    O --> P{missing or extra keys?}
    P -- yes --> Q[Push coverage mismatch error]
    P -- no --> R[Validation passed]

_{Reviews (1): Last reviewed commit: "fix(eval): deduplicate blocked ContextBe..." | Re-trigger Greptile}

greptile-apps · 2026-04-30T15:53:39Z

+  });
+
  it('creates fake-executor baseline attempt artifacts without scripting agent decisions', () => {
    const sessionRoot = tempSessionRoot();
    const taskId = manifest.tasks[0].instance_id;


Cleanup bypasses Windows retry/error handling

The new test uses a bare rmSync in finally without maxRetries, retryDelay, or a wrapping ignoreWindowsTempCleanupRace catch. The existing cleanupSessionRoot helper at line 59 already encapsulates all three — using the helper would keep cleanup consistent and avoid flaky failures on Windows where temp-dir handles can still be open when finally runs.

Suggested change

});

it('creates fake-executor baseline attempt artifacts without scripting agent decisions', () => {

const sessionRoot = tempSessionRoot();

const taskId = manifest.tasks[0].instance_id;

} finally {

cleanupSessionRoot(sessionRoot);

}

greptile-apps · 2026-04-30T15:53:40Z

+      expect(secondBlockedRows).toHaveLength(firstBlockedRows.length);
+      expect(validateOutput).toContain('baseline session validation passed');
+    } finally {
+      rmSync(path.dirname(path.dirname(path.dirname(path.dirname(sessionRoot)))), {
+        recursive: true,
+        force: true
+      });
+    }
+  });
+
  it('refuses raw baseline artifacts outside the ignored benchmark-runs root', () => {
    const outDir = mkdtempSync(path.join(tmpdir(), 'contextbench-invalid-out-'));


Hardcoded fixture count couples test to manifest config

20 * 2 * 3 embeds the exact task count, blocked-lane count, and repeat count from the phase41 fixture. Any fixture change will silently break this assertion without a clear failure message. Adding a comment naming the source of each factor would make the intent explicit and the breakage diagnosable.

The finally block also repeats a bare rmSync without the maxRetries/retryDelay options present in the cleanupSessionRoot helper used by other tests in this file.

fix(eval): deduplicate blocked ContextBench rows

c41e844

gemini-code-assist Bot reviewed Apr 30, 2026

View reviewed changes

greptile-apps Bot reviewed Apr 30, 2026

View reviewed changes

PatrickSys merged commit 99c9753 into master Apr 30, 2026
4 checks passed

PatrickSys deleted the fix/contextbench-baseline-reservations branch April 30, 2026 19:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): deduplicate blocked ContextBench rows#122

fix(eval): deduplicate blocked ContextBench rows#122
PatrickSys merged 1 commit intomasterfrom
fix/contextbench-baseline-reservations

PatrickSys commented Apr 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Uh oh!

greptile-apps Bot commented Apr 30, 2026

Important Files Changed

Uh oh!

greptile-apps Bot Apr 30, 2026

Uh oh!

greptile-apps Bot Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PatrickSys commented Apr 30, 2026

Summary

Verification

Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant