fix(eval): deduplicate blocked ContextBench rows#122
Conversation
There was a problem hiding this comment.
Code Review
This pull request prevents the duplication of blocked missing-evidence rows in contextbench-runner.mjs by introducing a primary key tracking mechanism during snapshot creation. It also strengthens session validation by checking for duplicate primary baseline rows and ensuring a strict match between blocked reservations and manifest entries. New test cases verify these fixes. The review feedback recommends using a dedicated cleanup helper in the test suite to improve cross-platform reliability and maintain consistency.
| rmSync(path.dirname(path.dirname(path.dirname(path.dirname(sessionRoot)))), { | ||
| recursive: true, | ||
| force: true | ||
| }); |
There was a problem hiding this comment.
Greptile SummaryThis PR makes Confidence Score: 4/5Safe to merge; only P2 style findings in the test files, core runner logic is correct. No P0 or P1 issues found. The deduplication and validation logic in the runner script is sound. The two P2 findings are limited to tests: new tests bypass the existing cleanupSessionRoot helper (missing Windows retry options) and one test hardcodes a fixture-derived count without explanation. tests/contextbench-baseline-runner.test.ts and tests/contextbench-baseline-snapshot.test.ts — both new tests should use cleanupSessionRoot for consistent cleanup.
|
| Filename | Overview |
|---|---|
| scripts/contextbench-runner.mjs | Adds idempotent deduplication to writeBlockedRunRows via a pre-read primary-key set, a new primaryReservationKey helper, and strengthened validateBaselineSession coverage/duplicate checks — logic is sound. |
| tests/contextbench-baseline-runner.test.ts | New regression for duplicate-row rejection; correct test logic but cleanup skips the existing cleanupSessionRoot helper (missing maxRetries / ignoreWindowsTempCleanupRace). |
| tests/contextbench-baseline-snapshot.test.ts | New snapshot-rerun idempotency test; hardcoded 2023 row count is fragile, and cleanup also bypasses the retry helper. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[writeBlockedRunRows called] --> B[Read existing primary keys\nfrom run-manifest.jsonl]
B --> C{For each terminal_missing_evidence\nreservation}
C --> D[Look up laneCard, task, evidence]
D --> E{Missing any?}
E -- yes --> C
E -- no --> F[Compute primaryReservationKey\nlaneId :: taskId :: repeatIndex]
F --> G{Key already in\nexistingPrimaryKeys?}
G -- yes --> C
G -- no --> H[Write artifacts + append\nmanifest row]
H --> I[Add key to existingPrimaryKeys]
I --> C
J[validateBaselineSession] --> K[Count primary rows by key]
K --> L{Any key count > 1?}
L -- yes --> M[Push duplicate error]
L -- no --> N[Build blockedReservationKeys\nfrom reservations]
N --> O[Build blockedRowKeys from\nfallbackReason rows]
O --> P{missing or extra keys?}
P -- yes --> Q[Push coverage mismatch error]
P -- no --> R[Validation passed]
Reviews (1): Last reviewed commit: "fix(eval): deduplicate blocked ContextBe..." | Re-trigger Greptile
| }); | ||
|
|
||
| it('creates fake-executor baseline attempt artifacts without scripting agent decisions', () => { | ||
| const sessionRoot = tempSessionRoot(); | ||
| const taskId = manifest.tasks[0].instance_id; |
There was a problem hiding this comment.
Cleanup bypasses Windows retry/error handling
The new test uses a bare rmSync in finally without maxRetries, retryDelay, or a wrapping ignoreWindowsTempCleanupRace catch. The existing cleanupSessionRoot helper at line 59 already encapsulates all three — using the helper would keep cleanup consistent and avoid flaky failures on Windows where temp-dir handles can still be open when finally runs.
| }); | |
| it('creates fake-executor baseline attempt artifacts without scripting agent decisions', () => { | |
| const sessionRoot = tempSessionRoot(); | |
| const taskId = manifest.tasks[0].instance_id; | |
| } finally { | |
| cleanupSessionRoot(sessionRoot); | |
| } |
| expect(secondBlockedRows).toHaveLength(firstBlockedRows.length); | ||
| expect(validateOutput).toContain('baseline session validation passed'); | ||
| } finally { | ||
| rmSync(path.dirname(path.dirname(path.dirname(path.dirname(sessionRoot)))), { | ||
| recursive: true, | ||
| force: true | ||
| }); | ||
| } | ||
| }); | ||
|
|
||
| it('refuses raw baseline artifacts outside the ignored benchmark-runs root', () => { | ||
| const outDir = mkdtempSync(path.join(tmpdir(), 'contextbench-invalid-out-')); |
There was a problem hiding this comment.
Hardcoded fixture count couples test to manifest config
20 * 2 * 3 embeds the exact task count, blocked-lane count, and repeat count from the phase41 fixture. Any fixture change will silently break this assertion without a clear failure message. Adding a comment naming the source of each factor would make the intent explicit and the breakage diagnosable.
The finally block also repeats a bare rmSync without the maxRetries/retryDelay options present in the cleanupSessionRoot helper used by other tests in this file.
Summary
Verification
pnpm test -- tests/contextbench-baseline-runner.test.ts tests/contextbench-baseline-snapshot.test.ts tests/contextbench-baseline-schema-gate.test.tspnpm exec tsc --noEmitpnpm run buildpnpm test -- tests/zombie-guard.test.ts tests/benchmark-comparators.test.tsNotes