Staging by AustinKelsay · Pull Request #24 · AustinKelsay/plebdev-bench

AustinKelsay · 2026-04-10T21:29:05Z

Summary by CodeRabbit

New Features
- Benchmarks now run Ollama-only by default; per-item isolated OpenCode workspaces and a residency guard ensure only the allowed model is loaded; run plans can record model exclusions.
Bug Fixes
- Better error propagation: process output/duration and taint evidence are preserved and surface in failure results.
Enhancements
- Two new taint reason codes; schemas accept additive future fields and richer retry/scoring metadata.
Tests
- Large expansion of regression/unit tests across adapters, OpenCode, residency, retries, and schemas.
Chores
- Schema bumped to 0.5.2; docs and CLI help updated for Ollama-first guidance.

[codex] bench: harden signal assessment and retry fairness

vercel · 2026-04-10T21:29:12Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
plebdev-bench-dashboard	Ready	Preview, Comment	Apr 25, 2026 1:40am

coderabbitai · 2026-04-10T21:29:21Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Converges execution to Ollama (schema → 0.5.2), adds Ollama residency enforcement, enriches signal-assessment (new taint reasons and merging), introduces model-exclusions, refactors OpenCode into modular runner/CLI/artifacts/permissions/events, threads taint through harnesses/runner, and adds many tests and fixtures.

Changes

Cohort / File(s)	Summary
Schemas & types `src/schemas/common.schema.ts`, `src/schemas/index.ts`, `src/schemas/*.schema.ts`, `apps/dashboard/src/lib/types.ts`, `apps/dashboard/src/lib/schemas.ts`	Bump to schema 0.5.2; split supported vs artifact runtime names; add two signal-assessment reason literals; export ModelExclusion and artifact-compatible profile types; make RunPlan more permissive (preprocess+passthrough) and add optional `modelExclusions`; tighten runtime typing in matrix/result schemas.
Signal assessment core `src/lib/signal-assessment.ts`, `test/signal-assessment.test.ts`	Add deterministic transcript/input detectors, `getTranscriptOrInputTaintReasons`, `isInternalToolTranscriptOutput`, `isAgentRequestedInputOutput`, `mergeSignalAssessments`; refactor `finalizeItemSignalAssessment` to use `rowFailed` and merge reasons.
OpenCode redesign `src/harnesses/opencode-*.ts`, `src/harnesses/opencode-output.ts` (deleted)	Break OpenCode into: CLI feature probe, run-args builder, robust runner (timeout + stale-output), artifacts/workspace hardening, event parser (JSONL/tool-call extraction + permission detection), provider spec, permission policy, and structured errors; remove old normalizers/process helpers.
Adapters (Goose/Direct) `src/harnesses/goose-adapter.ts`, `src/harnesses/goose-output.ts`, `src/harnesses/direct-adapter.ts`, `test/goose-*.test.ts`	Adapters unified to Ollama-only; pass Ollama host via env, always use `GOOSE_PROVIDER="ollama"`, dedupe taint reasons from stdout/stderr + normalized output, attach/propagate `signalAssessment` to results/errors, and adjust retry/error control flow.
Runner, eligibility & progress `src/runner/item-executor.ts`, `src/runner/plan-builder.ts`, `src/runner/index.ts`, `src/runner/model-eligibility.ts`, `src/runner/model-availability.ts`, `src/runner/run-progress.ts`, `test/*`	Add Ollama residency guard pre/post items, centralize generation-failure normalization, merge/preserve signal assessments, introduce `filterGenerativeModels` returning `ModelExclusion`s (and include them in RunPlan), add model-availability probes and run-progress/checkpoint helpers.
Runtimes & residency `src/runtimes/index.ts`, `src/runtimes/runtime.ts`, `src/runtimes/discovery.ts` (removed), `src/runtimes/vllm-runtime.ts` (removed), `src/runtimes/ollama-residency.ts`, `src/runtimes/ollama-runtime.ts`	Remove vLLM & discovery; restrict active runtime names to Ollama; add Ollama residency APIs (list/unload/ensure); enrich ModelInfo with modelKind/capabilities/metadata.
OpenAI-compat & pruning `src/lib/openai-compat-client.ts` (deleted), `src/harnesses/goose-openai.ts` (deleted)	Delete legacy OpenAI-compat client and related OpenAI base-path helper.
CLI, compare & docs `src/cli/run-command.ts`, `src/cli/compare-command.ts`, `src/cli/compare-utils.ts`, `src/cli/compare-formatters.ts`, `README.md`, `llm/`, `apps/dashboard/`, `package.json`	CLI/docs updated to Ollama-only defaults; deprecate `--vllm-url`; remove vLLM scripts/examples; refactor `bench compare` to shared utils/formatters; document OpenCode isolation and residency requirements.
OpenCode helpers & tests `src/harnesses/opencode-artifacts.ts`, `src/harnesses/opencode-cli.ts`, `src/harnesses/opencode-runner.ts`, `src/harnesses/opencode-events.ts`, `src/harnesses/opencode-provider.ts`, `src/harnesses/opencode-permissions.ts`, many `test/opencode-*.test.ts`	Add artifact preparation/cleanup, CLI feature detection and args builder, robust process runner with stale-output handling, JSONL event parsing with tool-call extraction and permission detection, provider spec builder and permission policy, plus extensive unit/regression tests.
Tests & fixtures `test/**`, `test/utils/fixtures.ts`	Large test additions/updates across OpenCode/Goose/Ollama residency/schema/runner/compare; add deterministic fixtures/helpers and many regression/unit tests.
Dashboard / UI & minor `apps/dashboard/src/components/*`, CSS, JSON meta files, `apps/dashboard/src/lib/`	Mostly import-specifier updates, formatting, JSDoc and copy updates to reflect Ollama-only runtime; small UI refactors (chart colors, validations) and many formatting/JSDoc tweaks.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Runner as Runner
    participant Plan as PlanBuilder
    participant Executor as ItemExecutor
    participant Runtime as OllamaRuntime
    participant Harness as HarnessAdapter
    participant OpenCode as OpenCodeRunner
    participant Events as EventParser
    participant Assessor as SignalAssessor
    participant Writer as ResultWriter

    Runner->>Plan: buildRunPlan(config)
    Runner->>Executor: executeItem(item)
    Executor->>Runtime: createRuntime(ollamaBaseUrl)
    Executor->>Harness: generate(prompt, opts)
    Note right of Harness: OpenCode path prepares artifacts/config and builds args
    Harness->>OpenCode: runOpenCodeCommand(args, env, cwd, timeout)
    OpenCode-->>Harness: stdout/stderr/exitCode
    Harness->>Events: parseOpenCodeEvents(stdout + stderr)
    Events-->>Harness: { output, method, hasToolUse, permissionDenied }
    Harness->>Assessor: getTranscriptOrInputTaintReasons(raw + normalized)
    Assessor-->>Harness: [taintReasons]
    Harness->>Assessor: mergeSignalAssessments(existing?, taintAssessment)
    Assessor-->>Executor: signalAssessment
    Executor->>Assessor: finalizeItemSignalAssessment({ existing?, automatedScore?, rowFailed?, output, outputSource })
    Assessor-->>Executor: finalSignalAssessment
    Executor->>Writer: writeResult(result)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[codex] rebuild opencode harness #25 — Major OpenCode infra refactor (provider/runner/artifacts/events/permissions) with significant overlap.
[codex] bench: harden signal assessment and retry fairness #23 — Overlapping signal-assessment changes (new taint reasons and merging across adapters and runner).
Add trusted signal assessment to benchmark results #20 — Related propagation of signal-assessment through harnesses/results and matching schema updates.

Poem

🐇 I sniffed the stdout trail and hopped to find,

Two new taints to mark the traces left behind.
Ollama guards the models we call with care,
OpenCode split and hardened—tidy lair.
Tests clap tiny paws; the runner hums, aware.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch staging

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/schemas/common.schema.ts (1)
174-182: ⚠️ Potential issue | 🟠 Major

Version this additive on-disk schema change.

These new signalAssessment.reasons literals will be emitted into artifacts, but SCHEMA_VERSION stays 0.5.0. That leaves two different serialized shapes under the same version and prevents explicit compatibility or migration handling.

As per coding guidelines "Maintain versioned schemas with schemaVersion field and use additive changes; migrations must be explicit".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/schemas/common.schema.ts` around lines 174 - 182, This change adds new
literals to signalAssessmentReasonTypes which alters on-disk artifacts, so bump
the package schema version constant (SCHEMA_VERSION) to a new semver (e.g.,
0.5.1) and ensure any serialized objects include the updated schemaVersion field
where artifacts are created/serialized (search for usages that stamp
schemaVersion in export/save code and update them to emit the new value); update
tests/fixtures that assert schemaVersion or serialized shapes to the new version
as well.
src/runner/item-executor.ts (1)
138-157: ⚠️ Potential issue | 🟠 Major

Carry harness-supplied failure metadata through these catch blocks.

Both paths rebuild the failure from error.message only, so the output, signalAssessment, and durationMs now attached by src/harnesses/goose-adapter.ts and src/harnesses/opencode-adapter.ts are dropped. A failed row can therefore lose the very taint reason this PR is adding and end up recorded as a generic trustworthy failure.

Based on learnings "For per-item runtime failures (timeouts, model errors, eval failures), record structured failure in run.json and continue the matrix".

Also applies to: 334-368
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/runner/item-executor.ts` around lines 138 - 157, The catch blocks
reconstruct failure metadata from error.message and drop harness-provided
fields; update the catch handlers in item-executor.ts (the blocks that set
generation, generationFailure, and call finalizeItemSignalAssessment) to
merge/retain any harness-supplied metadata present on the thrown value (e.g.
(error as any).output, .durationMs, .failureType, and any signalAssessment info)
instead of overwriting them—use the existing
classifyGenerationError(errorMessage) only if a failureType isn't present, set
generation.durationMs to the harness-provided duration if available, preserve
generation.output when provided, and build generationFailure.type/message from
the harness fields when present; apply the same merging logic to both catch
sites that currently call finalizeItemSignalAssessment so the harness-supplied
taint reason and duration are carried through.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/harnesses/goose-adapter.ts`:
- Around line 515-529: The error handler currently only inspects
execaError.stderr, losing context printed to stdout; update the catch branch in
goose-adapter.ts (the block using execaError, getTranscriptOrInputTaintReasons,
and appendSignalAssessmentReasons) to aggregate stdout and stderr (e.g., const
combined = `${execaError.stdout || ''}\n${execaError.stderr || ''}`), run
getTranscriptOrInputTaintReasons(combined) and use combined for the Error
message and output property, and adjust the execaError type to include stdout so
both streams are preserved in the thrown Error's signalAssessment and output.

In `@src/lib/signal-assessment.ts`:
- Around line 22-31: The current INTERNAL_TOOL_TRANSCRIPT_PATTERNS is too
permissive and matches isolated tokens; tighten detection by replacing the
single-pattern test with a composite matcher that requires stronger transcript
shapes—e.g., check for JSON-like pairs appearing together (require both
"sessionID" and a nearby "type":"tool_|step_" within ~200 chars or same line),
or require tool-specific markers plus filePath/function signatures together
(e.g., "<function=bash>" with "<parameter=filePath>" or "filePath:" in the same
block); update the code that inspects failed rows to use this new composite
matcher (instead of matching any entry in INTERNAL_TOOL_TRANSCRIPT_PATTERNS
directly) and add a regression test that feeds a benign artifact containing only
"sessionID" or "step_start"/"tool_call" tokens and asserts it is NOT classified
as internal_tool_transcript while another test includes the stronger multi-token
transcript and asserts it IS classified.

---

Outside diff comments:
In `@src/runner/item-executor.ts`:
- Around line 138-157: The catch blocks reconstruct failure metadata from
error.message and drop harness-provided fields; update the catch handlers in
item-executor.ts (the blocks that set generation, generationFailure, and call
finalizeItemSignalAssessment) to merge/retain any harness-supplied metadata
present on the thrown value (e.g. (error as any).output, .durationMs,
.failureType, and any signalAssessment info) instead of overwriting them—use the
existing classifyGenerationError(errorMessage) only if a failureType isn't
present, set generation.durationMs to the harness-provided duration if
available, preserve generation.output when provided, and build
generationFailure.type/message from the harness fields when present; apply the
same merging logic to both catch sites that currently call
finalizeItemSignalAssessment so the harness-supplied taint reason and duration
are carried through.

In `@src/schemas/common.schema.ts`:
- Around line 174-182: This change adds new literals to
signalAssessmentReasonTypes which alters on-disk artifacts, so bump the package
schema version constant (SCHEMA_VERSION) to a new semver (e.g., 0.5.1) and
ensure any serialized objects include the updated schemaVersion field where
artifacts are created/serialized (search for usages that stamp schemaVersion in
export/save code and update them to emit the new value); update tests/fixtures
that assert schemaVersion or serialized shapes to the new version as well.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b8ce16e5-0258-4842-bdc5-f5d7e80b9ec9

📥 Commits

Reviewing files that changed from the base of the PR and between 56b63e7 and 9c372ee.

📒 Files selected for processing (16)

apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
src/harnesses/goose-adapter.ts
src/harnesses/goose-output.ts
src/harnesses/opencode-adapter.ts
src/harnesses/opencode-output.ts
src/lib/signal-assessment.ts
src/runner/item-executor.ts
src/schemas/common.schema.ts
test/dashboard-schema-compat.test.ts
test/goose-adapter.test.ts
test/goose-output.test.ts
test/item-retry.test.ts
test/opencode-adapter.test.ts
test/opencode-output.test.ts
test/signal-assessment.test.ts

- keep stdout/stderr and signal assessments from Goose failures - carry structured generation failure details through executeItem - bump schema version and accept prior current-version artifacts

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/harnesses/goose-adapter.ts (1)

323-367: ⚠️ Potential issue | 🟠 Major

Scan stderr for taint even when stdout is present.

transcriptOrInputReasons is currently derived from rawOutput/output, but rawOutput is assigned after the stdout→stderr fallback. That means a successful Goose run can print code on stdout and leak agent_requested_input or internal_tool_transcript only on stderr, and this adapter will still return no signalAssessment.

💡 Suggested fix

 				let output = result.stdout;
 				const stderr = result.stderr?.trim() || "";
+				const stdout = result.stdout?.trim() || "";
@@
-				const rawOutput = output;
+				const rawOutput = output;
 				const normalized = normalizeGooseOutput(output);
@@
 				const transcriptOrInputReasons = Array.from(
 					new Set([
+						...getTranscriptOrInputTaintReasons(stdout),
+						...getTranscriptOrInputTaintReasons(stderr),
 						...getTranscriptOrInputTaintReasons(rawOutput),
 						...getTranscriptOrInputTaintReasons(output),
 					]),
 				);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/goose-adapter.ts` around lines 323 - 367, The code only checks
getTranscriptOrInputTaintReasons against rawOutput/output but rawOutput is set
after the stdout→stderr fallback, so taint on stderr is missed when stdout
exists; update the logic in the goose adapter (around variables rawOutput,
output and calls to
normalizeGooseOutput/getTranscriptOrInputTaintReasons/appendSignalAssessmentReasons)
to always include stderr in the taint scan (for example compute rawOutputParts =
[result.stdout, result.stderr] or call getTranscriptOrInputTaintReasons on
stderr as well) before building transcriptOrInputReasons, so that signals like
agent_requested_input or internal_tool_transcript found only on stderr are
detected and passed into appendSignalAssessmentReasons.

🧹 Nitpick comments (1)

test/machine-profile-migration.test.ts (1)

131-159: Expand this schema-bump regression to cover run artifacts too.

Lines 131-159 only validate the "0.5.0" compatibility path for parseKnownPlanPayload. Add a sibling parseKnownRunPayload assertion so plan/run migration guarantees stay symmetric.

♻️ Suggested test extension

 it("accepts prior current-version artifacts after a schema bump", () => {
 	const parsedPlan = parseKnownPlanPayload({
 		schemaVersion: "0.5.0",
 		runId: "run-current-minus-one",
 		createdAt: "2026-03-05T21:51:18.583Z",
 		runtimeEnvironment: {
 			platform: "darwin",
 			bunVersion: "1.3.3",
 		},
 		machine: LEGACY_MACHINE,
 		config: {
 			ollamaBaseUrl: "http://localhost:11434",
 			vllmBaseUrl: "http://localhost:8000",
 			generateTimeoutMs: 120_000,
 			passTypes: ["blind"],
 		},
 		items: [],
 		summary: {
 			totalItems: 0,
 			runtimes: 0,
 			models: 0,
 			harnesses: 0,
 			tests: 0,
 		},
 	});
+	const parsedRun = parseKnownRunPayload({
+		schemaVersion: "0.5.0",
+		runId: "run-current-minus-one",
+		machine: LEGACY_MACHINE,
+		startedAt: "2026-03-05T21:51:18.583Z",
+		completedAt: "2026-03-05T21:52:18.583Z",
+		durationMs: 60_000,
+		summary: {
+			total: 0,
+			completed: 0,
+			failed: 0,
+			pending: 0,
+		},
+		items: [],
+	});
 
 	expect(parsedPlan.schemaVersion).toBe(SCHEMA_VERSION);
 	expect(parsedPlan.machine?.profileKey).toBe(LEGACY_PROFILE_KEY);
+	expect(parsedRun.schemaVersion).toBe(SCHEMA_VERSION);
+	expect(parsedRun.machine?.profileKey).toBe(LEGACY_PROFILE_KEY);
 });

Based on learnings: “Maintain versioned schemas with schemaVersion field and use additive changes; migrations must be explicit.”

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/machine-profile-migration.test.ts` around lines 131 - 159, Add a
mirrored assertion for run artifacts: alongside the existing
parseKnownPlanPayload test that feeds a "0.5.0" payload using LEGACY_MACHINE,
call parseKnownRunPayload with the same payload shape (schemaVersion "0.5.0",
runId, createdAt, runtimeEnvironment, machine: LEGACY_MACHINE, config, items,
summary) and assert the returned object's schemaVersion equals SCHEMA_VERSION
and machine?.profileKey equals LEGACY_PROFILE_KEY so plan/run migration behavior
stays symmetric (look for parseKnownPlanPayload and add the sibling
parseKnownRunPayload assertion near it).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/harnesses/goose-adapter.ts`:
- Around line 515-530: The error message and output are being built from
combined which becomes "\n" when both stdout and stderr are empty, causing the
real execaError.message to be lost; update the logic around the
execaError/combined variables in the Goose failure handling block so that you
first compute a trimmedCombined (or similar) from execaError.stdout and
execaError.stderr and if that result is empty use execaError.message as the
fallback for both the thrown Error message and the output property; keep the
existing calls to getTranscriptOrInputTaintReasons and
appendSignalAssessmentReasons unchanged but feed them the effective output
(trimmedCombined or execaError.message) so downstream signalAssessment and
output reflect the real failure.

---

Outside diff comments:
In `@src/harnesses/goose-adapter.ts`:
- Around line 323-367: The code only checks getTranscriptOrInputTaintReasons
against rawOutput/output but rawOutput is set after the stdout→stderr fallback,
so taint on stderr is missed when stdout exists; update the logic in the goose
adapter (around variables rawOutput, output and calls to
normalizeGooseOutput/getTranscriptOrInputTaintReasons/appendSignalAssessmentReasons)
to always include stderr in the taint scan (for example compute rawOutputParts =
[result.stdout, result.stderr] or call getTranscriptOrInputTaintReasons on
stderr as well) before building transcriptOrInputReasons, so that signals like
agent_requested_input or internal_tool_transcript found only on stderr are
detected and passed into appendSignalAssessmentReasons.

---

Nitpick comments:
In `@test/machine-profile-migration.test.ts`:
- Around line 131-159: Add a mirrored assertion for run artifacts: alongside the
existing parseKnownPlanPayload test that feeds a "0.5.0" payload using
LEGACY_MACHINE, call parseKnownRunPayload with the same payload shape
(schemaVersion "0.5.0", runId, createdAt, runtimeEnvironment, machine:
LEGACY_MACHINE, config, items, summary) and assert the returned object's
schemaVersion equals SCHEMA_VERSION and machine?.profileKey equals
LEGACY_PROFILE_KEY so plan/run migration behavior stays symmetric (look for
parseKnownPlanPayload and add the sibling parseKnownRunPayload assertion near
it).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 44b63a0b-5e91-4db2-bf85-a11925c28932

📥 Commits

Reviewing files that changed from the base of the PR and between 9c372ee and b6f41a8.

📒 Files selected for processing (11)

src/harnesses/goose-adapter.ts
src/lib/machine-profile/legacy.ts
src/lib/signal-assessment.ts
src/runner/item-executor.ts
src/schemas/common.schema.ts
test/dashboard-schema-compat.test.ts
test/goose-adapter.test.ts
test/item-executor.test.ts
test/machine-profile-migration.test.ts
test/schemas.test.ts
test/signal-assessment.test.ts

✅ Files skipped from review due to trivial changes (3)

test/schemas.test.ts
src/lib/machine-profile/legacy.ts
test/dashboard-schema-compat.test.ts

🚧 Files skipped from review as they are similar to previous changes (3)

test/goose-adapter.test.ts
test/signal-assessment.test.ts
src/schemas/common.schema.ts

- Scan stderr alongside stdout for taint reasons - Use execa message when Goose emits no stream output - Cover legacy run migration for machine profiles

- Remove vLLM and OpenAI-compatible runtime paths - Simplify harness, config, and docs around Ollama-only runs - Keep legacy vLLM artifacts readable for compare and history

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

README.md (1)
182-205: ⚠️ Potential issue | 🟡 Minor

Update the example model-profile schemaVersion.

This example still shows "0.5.0", but the current schema version in this PR is 0.5.1. As written, a copy-pasted config will be rejected by the parser.
Proposed fix
-  "schemaVersion": "0.5.0",
+  "schemaVersion": "0.5.1",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 182 - 205, Update the example model-profile JSON to
use the current schema version so pasted configs pass validation: change the
"schemaVersion" value in the README example from "0.5.0" to "0.5.1" (the
canonical model-profile example block that users load with --model-config should
reflect the PR's schemaVersion).

♻️ Duplicate comments (1)

src/runner/item-executor.ts (1)
88-145: ⚠️ Potential issue | 🟡 Minor

Validate signalAssessment against the real schema before persisting it.

isSignalAssessment() is still only a loose shape check, so malformed payloads from thrown harness/runtime errors can flow straight into run.json as long as they look like { classification, reasons: string[] }. This helper is normalizing unknown input at a boundary; it should parse against the actual schema/invariants, not just the outer shape.
#!/bin/bash
# Verify whether the persisted signal-assessment schema is stricter than the local guard.
rg -n -C3 'SignalAssessmentSchema|signalAssessmentReasons|SignalAssessmentReason|isSignalAssessment' src
Based on learnings, "Validate at boundaries with Zod and pass typed data inward" and "Treat all boundary inputs (CLI args, env, JSON files, HTTP responses) as unknown and parse with Zod immediately".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/runner/item-executor.ts` around lines 88 - 145, The current
isSignalAssessment loosenly checks shape and lets malformed payloads pass into
persisted run.json; replace that loose guard in extractGenerationFailureDetails
with a strict parse against the canonical schema (use
SignalAssessmentSchema.parse or safeParse) so only fully valid SignalAssessment
objects are accepted; specifically, in extractGenerationFailureDetails use
SignalAssessmentSchema.safeParse(errorRecord?.signalAssessment) and set
signalAssessment to the parsed value on success or undefined on failure (remove
or keep isSignalAssessment only for pre-checks but do not persist its result) so
all boundary inputs are validated by the real schema before being written.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/schemas/model-profile.schema.ts`:
- Around line 109-112: The schema change narrowed
ConfiguredModelProfileSchema.variants to SupportedRuntimeNameSchema which breaks
loading legacy persisted model-profile files that contain legacy runtime keys
like "vllm"; revert or relax the variants constraint so loadModelProfiles(...)
accepts unknown/legacy runtime keys (e.g., accept string keys or union with
legacy enum) or add an explicit migration step: update
ConfiguredModelProfileSchema.variants to allow legacy names (including "vllm")
alongside SupportedRuntimeNameSchema OR implement a migration function invoked
by loadModelProfiles that maps legacy keys (like "vllm") to current runtime
names (e.g., Ollama) before validation, ensuring backward compatibility without
changing runtime behavior.

---

Outside diff comments:
In `@README.md`:
- Around line 182-205: Update the example model-profile JSON to use the current
schema version so pasted configs pass validation: change the "schemaVersion"
value in the README example from "0.5.0" to "0.5.1" (the canonical model-profile
example block that users load with --model-config should reflect the PR's
schemaVersion).

---

Duplicate comments:
In `@src/runner/item-executor.ts`:
- Around line 88-145: The current isSignalAssessment loosenly checks shape and
lets malformed payloads pass into persisted run.json; replace that loose guard
in extractGenerationFailureDetails with a strict parse against the canonical
schema (use SignalAssessmentSchema.parse or safeParse) so only fully valid
SignalAssessment objects are accepted; specifically, in
extractGenerationFailureDetails use
SignalAssessmentSchema.safeParse(errorRecord?.signalAssessment) and set
signalAssessment to the parsed value on success or undefined on failure (remove
or keep isSignalAssessment only for pre-checks but do not persist its result) so
all boundary inputs are validated by the real schema before being written.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c0c41d59-1088-4aac-b6a9-34ef8ea71fed

📥 Commits

Reviewing files that changed from the base of the PR and between b2dd0fa and 8ea88b5.

📒 Files selected for processing (42)

README.md
apps/dashboard/src/lib/types.ts
llm/context/codebase-overview.md
llm/implementation/multi-runtime-mvp-implementation.md
llm/project/project-overview.md
llm/project/project-rules.md
llm/project/user-flow.md
package.json
src/cli/run-command.ts
src/harnesses/direct-adapter.ts
src/harnesses/goose-adapter.ts
src/harnesses/goose-openai.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/harnesses/opencode-config.ts
src/lib/benchmark-checkpoint.ts
src/lib/model-aliases.ts
src/lib/model-profile/normalization.ts
src/lib/model-profile/registry.ts
src/lib/openai-compat-client.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/plan-builder.ts
src/runtimes/discovery.ts
src/runtimes/index.ts
src/runtimes/runtime.ts
src/runtimes/vllm-runtime.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/model-alias.schema.ts
src/schemas/model-profile.schema.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
test/benchmark-checkpoint.test.ts
test/build-index.test.ts
test/harness-compatibility.test.ts
test/item-executor.test.ts
test/opencode-config.test.ts
test/plan-builder.test.ts
test/schemas.test.ts
test/vllm-runtime.test.ts

💤 Files with no reviewable changes (11)

src/runner/index.ts
test/opencode-config.test.ts
test/benchmark-checkpoint.test.ts
package.json
src/lib/benchmark-checkpoint.ts
src/harnesses/goose-openai.ts
test/build-index.test.ts
test/vllm-runtime.test.ts
src/runtimes/discovery.ts
src/runtimes/vllm-runtime.ts
src/lib/openai-compat-client.ts

✅ Files skipped from review due to trivial changes (4)

src/lib/model-aliases.ts
llm/implementation/multi-runtime-mvp-implementation.md
llm/project/project-overview.md
src/schemas/model-alias.schema.ts

🚧 Files skipped from review as they are similar to previous changes (3)

test/schemas.test.ts
test/item-executor.test.ts
src/schemas/common.schema.ts

👮 Files not reviewed due to content moderation or server errors (8)

src/runtimes/runtime.ts
src/harnesses/opencode-config.ts
src/harnesses/goose-adapter.ts
src/harnesses/opencode-adapter.ts
src/cli/run-command.ts
llm/project/project-rules.md
src/schemas/result.schema.ts
llm/project/user-flow.md

- Remove deprecated `vllmBaseUrl` from plan config - Add optional Goose turn limits and test categories

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/dashboard/src/lib/schemas.ts`:
- Around line 426-430: Update the four goose turn-count fields (gooseMaxTurns,
gooseRetryMaxTurns, gooseWorkspaceMaxTurns, gooseWorkspaceRetryMaxTurns) to
enforce positive integers instead of any number: replace their
z.number().optional() validators with z.number().int().positive().optional() so
validation matches the core plan.schema.ts rules.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c706018b-a626-4ccd-b42b-6afb48e87e8a

📥 Commits

Reviewing files that changed from the base of the PR and between 8ea88b5 and 5c5bc91.

📒 Files selected for processing (2)

apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts

- Accept legacy model profile files by dropping unsupported runtime variants - Validate harness signal assessments with Zod before persisting failures - Tighten dashboard schema bounds for Goose turn counts

AustinKelsay · 2026-04-20T03:08:34Z

@CodeRabbit full review

coderabbitai · 2026-04-20T03:08:40Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (10)

apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx (1)

270-313: Consider adding TSDoc to the exported function.

The exported BlindVsInformedChart function lacks TSDoc documentation. Adding it would improve discoverability and align with the project's documentation standards.

📝 Example TSDoc addition

+/**
+ * Renders a chart comparing pass rates between blind and informed prompts.
+ * Provides model-level and harness-level breakdowns with delta summaries.
+ *
+ * `@param` items - Array of matrix item results to analyze
+ * `@returns` A card component containing tabbed charts and summaries
+ */
 export function BlindVsInformedChart({ items }: BlindVsInformedChartProps) {

As per coding guidelines: "All exported functions must have TSDoc/JSDoc documentation including purpose, params, returns, and throws"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx` around
lines 270 - 313, Add a TSDoc block above the exported BlindVsInformedChart
function describing its purpose, parameters, and return value; document the
items parameter (type BlindVsInformedChartProps) and what the component renders
(chart tabs for model/harness breakdown), include `@param` for items and `@returns`
React.ReactElement (or JSX.Element), and add any `@throws` if it can throw—place
the doc comment immediately above the export function BlindVsInformedChart
declaration to satisfy the project guideline that all exported functions have
TSDoc.

apps/dashboard/src/lib/api.ts (1)

1-4: Move the module header back to the top of the file.

The new import block now appears before the required file header. Put the purpose/exports/invariants header first, then imports.

♻️ Proposed reorder

-import {
-	parseKnownPlanPayload,
-	parseKnownRunPayload,
-} from "../../../../src/lib/machine-profile/legacy.js";
 /**
  * Purpose: Fetch run data from static JSON files.
  * Exports: fetchRuns, fetchRun, fetchPlan, fetchRunWithPlan, fetchDashboardIndex, fetchLatestAggregate
@@
  * Data is loaded from the results directory via Vite's dev server.
  * All fetched JSON is validated with Zod schemas at this boundary.
  */
+import {
+	parseKnownPlanPayload,
+	parseKnownRunPayload,
+} from "../../../../src/lib/machine-profile/legacy.js";

As per coding guidelines **/*.{ts,tsx,js,jsx}: Every file begins with a short header documenting purpose, exports, and invariants.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/lib/api.ts` around lines 1 - 4, The file header
documenting purpose/exports/invariants must be the very first thing in the
module; move the existing module header comment so it appears before the import
block that currently brings in parseKnownPlanPayload and parseKnownRunPayload
from "../../../../src/lib/machine-profile/legacy.js", and ensure the header
briefly states the file's purpose, its exported items, and any invariants as per
the coding guideline so the header precedes all imports and code.

llm/project/user-flow.md (1)

209-209: Clarify Line 209 to avoid mixed messaging about runtime selection.

Line 209 says runtime is fixed, but “unless explicitly passed as --runtimes ollama” still reads like a selectable runtime path. Consider simplifying to state that runtime is fixed to ollama in MVP.

Suggested doc tweak

-- **Discovery**: models/harnesses are auto-discovered by default; runtime is fixed to `ollama` unless explicitly passed as `--runtimes ollama`.
+- **Discovery**: models/harnesses are auto-discovered by default; runtime is fixed to `ollama` in MVP.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@llm/project/user-flow.md` at line 209, Edit the "Discovery" bullet so it
clearly states the runtime decision for MVP: remove the conditional phrase
"unless explicitly passed as `--runtimes ollama`" and reword to say "Discovery:
models/harnesses are auto-discovered by default; runtime is fixed to `ollama`
for the MVP." Reference the "Discovery" bullet and the string "`--runtimes
ollama`" when making the change to ensure the mixed message is eliminated.

llm/implementation/multi-runtime-mvp-implementation.md (1)

1-5: Consider making the historical framing even harder to miss.

Line 1 and Line 5 are clear, but a quick scan can still misread later sections as current behavior. A small follow-up (e.g., “Historical Behavior (Feb 8, 2026)” heading) would reduce ambiguity.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@llm/implementation/multi-runtime-mvp-implementation.md` around lines 1 - 5,
Add a very prominent historical warning header and timestamp to the top of the
document so readers cannot mistake it for current behavior: change the existing
opening lines in Multi-Runtime MVP Implementation to include a bold/ALL-CAPS
heading like "HISTORICAL BEHAVIOR (Feb 8, 2026) — DO NOT USE FOR CURRENT
RUNTIME" and add a one-line subheading such as "Archived for artifact/debugging
context; active benchmarks are Ollama-only" (replace the current first
paragraph). Ensure the new header appears before any other text so the
historical status is immediately obvious on quick scans.

src/cli/run-command.ts (1)

83-107: Consider a temporary deprecated --vllm-url shim for smoother script migration.

With Ollama-only execution, this is directionally correct; however, existing CI/scripts using --vllm-url will fail hard. A one-release no-op flag that warns would reduce breakage.

Possible transition patch

 export const runCommand = new Command("run")
@@
 	.option("--ollama-url <url>", "Ollama API base URL", "http://localhost:11434")
+	.option(
+		"--vllm-url <url>",
+		"Deprecated: ignored (runtime execution is Ollama-only)",
+	)
@@
 	.action(async (options) => {
 		try {
+			if (typeof options.vllmUrl === "string" && options.vllmUrl.trim().length > 0) {
+				logger.warn(
+					"Warning: --vllm-url is deprecated and ignored (Ollama-only execution)",
+				);
+			}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/cli/run-command.ts` around lines 83 - 107, Add a temporary deprecated
shim flag --vllm-url that accepts a URL and maps it to the existing --ollama-url
behavior: update the CLI option definitions near the existing
.option("--ollama-url <url>") to also declare .option("--vllm-url <url>") as a
deprecated alias, when parsing prefer the explicit --ollama-url but if
--vllm-url is present set the same internal variable (e.g., ollamaUrl) and emit
a single-time warning message informing users the flag is deprecated and will be
removed next release; ensure the shim does not change runtime semantics and that
the warning uses the same logger/console path used elsewhere in this module.

test/model-profile-registry.test.ts (1)

96-97: Consider avoiding schema-version churn in this fixture.

Line [96] hardcodes "0.5.1" even though this test targets unsupported runtime-key filtering, not version migration behavior. Using SCHEMA_VERSION would keep intent stable across future bumps.

♻️ Suggested small cleanup

 import {
 	ConfiguredModelProfileSchema,
 	ModelProfileFileSchema,
 } from "../src/schemas/model-profile.schema.js";
+import { SCHEMA_VERSION } from "../src/schemas/index.js";

 ...
-						schemaVersion: "0.5.1",
+						schemaVersion: SCHEMA_VERSION,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/model-profile-registry.test.ts` around lines 96 - 97, Replace the
hardcoded schemaVersion string "0.5.1" in the test fixture with the canonical
SCHEMA_VERSION constant: import SCHEMA_VERSION into the test and set
schemaVersion: SCHEMA_VERSION in the object used by the model-profile registry
test (the fixture that currently contains schemaVersion and models). This keeps
the test focused on unsupported runtime-key filtering and avoids future churn
when the schema version is bumped.

src/harnesses/opencode-config.ts (1)

128-162: Consider removing unused runtimeName parameter.

Since runtimeName can only be "ollama" and isn't used to branch logic in the function body (no runtime-specific env tuning remains), consider removing it from the interface to simplify the API.

♻️ Optional simplification

 export function buildOpenCodeEnv(opts: {
 	configPath: string;
 	configJson: string;
-	runtimeName: "ollama";
 }): Record<string, string> {
-	const { configPath, configJson, runtimeName } = z
+	const { configPath, configJson } = z
 		.object({
 			configPath: z.string().min(1),
 			configJson: z.string().min(1),
-			runtimeName: RuntimeNameSchema,
 		})
 		.parse(opts);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/opencode-config.ts` around lines 128 - 162, The function
buildOpenCodeEnv currently accepts an opts parameter with a runtimeName property
that is never used; remove runtimeName from the function signature and from the
zod validation to simplify the API: update the opts type to only include
configPath and configJson, remove runtimeName from the destructuring and from
the z.object() schema (RuntimeNameSchema reference), and adjust any call sites
that pass runtimeName to stop providing it or to only pass the two remaining
fields so callers compile.

test/plan-builder.test.ts (1)

16-102: Consider extracting fallback helper to a shared test utility.

The fallbackCollectMachineProfile function is substantial (~85 lines) and may be useful in other test files. Consider extracting it to a shared test fixtures module.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/plan-builder.test.ts` around lines 16 - 102, The
fallbackCollectMachineProfile helper is large and should be extracted to a
shared test fixtures module: create a new test utility file (e.g.,
test/utils/fixtures.ts) exporting fallbackCollectMachineProfile as a named
export, move the function body there (preserving its signature and any helper
readNonEmpty logic and types), update test/plan-builder.test.ts to import {
fallbackCollectMachineProfile } from the new module, and ensure any TypeScript
types referenced by the function are exported or inlined so existing tests
compile unchanged.

src/harnesses/opencode-adapter.ts (1)

546-564: Preserve output/durationMs in the execa-error path too.

This branch still only inspects stderr and omits output plus durationMs. When executeItem catches this, it falls back to durationMs: 0 and loses any failure text emitted on stdout, even though the non-zero-exit path already preserves that evidence.

Possible shape

-				if (error && typeof error === "object" && "stderr" in error) {
-					const execaError = error as { stderr: string; message: string };
-					const errorReasons = [
-						...(hasPermissionDeniedStderr(execaError.stderr)
+				if (error && typeof error === "object" && "stderr" in error) {
+					const execaError = error as {
+						stdout?: string;
+						stderr?: string;
+						message: string;
+					};
+					const effectiveOutput =
+						[execaError.stdout, execaError.stderr]
+							.map((part) => part?.trim() ?? "")
+							.filter((part) => part.length > 0)
+							.join("\n") || execaError.message;
+					const durationMs = Math.round(performance.now() - startTime);
+					const errorReasons = [
+						...(hasPermissionDeniedStderr(execaError.stderr ?? "")
 							? (["tool_permission_denied"] as const)
 							: []),
-						...getTranscriptOrInputTaintReasons(execaError.stderr),
+						...getTranscriptOrInputTaintReasons(effectiveOutput),
 					];
 					throw Object.assign(
-						new Error(
-							`OpenCode failed: ${execaError.stderr || execaError.message}`,
-						),
+						new Error(`OpenCode failed: ${effectiveOutput}`),
 						{
 							signalAssessment:
 								errorReasons.length > 0
 									? appendSignalAssessmentReasons(undefined, errorReasons)
 									: undefined,
+							durationMs,
+							output: effectiveOutput,
 						},
 					);
 				}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/opencode-adapter.ts` around lines 546 - 564, The execa-error
branch in the error handler for executeItem only uses execaError.stderr and
message, dropping execa output and timing; update the throw to preserve
execaError.output (or output/stdout) and durationMs so executeItem can see the
real failure text and timing. Specifically, in the block that builds execaError
and computes errorReasons (symbols: execaError, hasPermissionDeniedStderr,
getTranscriptOrInputTaintReasons, appendSignalAssessmentReasons,
signalAssessment), include the original output and durationMs properties when
calling Object.assign on the new Error so the thrown object contains output and
durationMs for downstream handling by executeItem.

src/schemas/common.schema.ts (1)

38-54: Encode the runtime-list relationship from a single source of truth.

artifactRuntimeNames repeats "ollama" separately from supportedRuntimeNames, so the active-runtime and artifact-runtime lists can drift on a future edit. Building the artifact list from supportedRuntimeNames makes that invariant explicit.

♻️ Proposed refactor

 /** Valid runtime names for active benchmark execution. */
 export const supportedRuntimeNames = ["ollama"] as const;

 /** Zod schema for runtimes accepted by current config and execution flows. */
 export const SupportedRuntimeNameSchema = z.literal("ollama");

 /** Runtime name type accepted by current config and execution flows. */
 export type SupportedRuntimeName = z.infer<typeof SupportedRuntimeNameSchema>;

 /** Valid runtime names allowed when reading historical plan/result artifacts. */
-export const artifactRuntimeNames = ["ollama", "vllm"] as const;
+export const artifactRuntimeNames = [...supportedRuntimeNames, "vllm"] as const;

As per coding guidelines, "Favor iteration and modularization over code duplication".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/schemas/common.schema.ts` around lines 38 - 54, supportedRuntimeNames and
artifactRuntimeNames duplicate "ollama", risking drift; change
artifactRuntimeNames to derive from supportedRuntimeNames (e.g., create
artifactRuntimeNames by spreading supportedRuntimeNames and adding "vllm" with a
const assertion) so the runtime-list relationship is encoded in one source of
truth, then keep ArtifactRuntimeNameSchema = z.enum(artifactRuntimeNames) and
ArtifactRuntimeName type as z.infer<typeof ArtifactRuntimeNameSchema>.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/dashboard/src/components/ui/section-heading.tsx`:
- Around line 12-15: The exported function SectionHeading lacks complete
JSDoc/TSDoc; update the comment block above the SectionHeading declaration to
include a short purpose description, `@param` tags for title and description, an
`@returns` tag describing the React element returned, and an `@throws` tag
(explicitly stating none if it never throws). Ensure the JSDoc sits immediately
above the SectionHeading function and uses the repo's standard TSDoc style.

In `@src/harnesses/goose-adapter.ts`:
- Around line 216-223: The code currently hard-codes provider = "ollama" and
builds env without using runtime.baseUrl; update the Goose configuration to
include the runtime base URL when present by adding an environment variable
(e.g., GOOSE_BASE_URL) to the env object derived from runtime.baseUrl, keeping
provider and GOOSE_MODEL as-is; in other words, when runtime.baseUrl is defined,
set env.GOOSE_BASE_URL = runtime.baseUrl so Ollama/custom endpoints are passed
through (reference the provider variable, the env object, and runtime.baseUrl),
or alternatively remove the file header invariant if you intentionally do not
support custom endpoints.

In `@src/harnesses/opencode-output.ts`:
- Around line 260-262: The early return that sets method "json" when parsedLines
> 0 is too aggressive and drops non-JSON content; change the condition so we
only short-circuit to the JSON-only return when the entire input was
successfully parsed as JSON (e.g., parsedLines === totalLineCount or a flag like
allLinesParsed is true). If mixed content exists, do not return an empty
output/method "json" — instead let the function fall through to the raw/fallback
handling or merge the non-JSON lines into the returned output. Update the logic
around parsedLines and the JSON-short-circuit (the block that currently returns
{ output: "", method: "json" }) to perform this full-stream check or preserve
non-JSON text.

In `@src/lib/model-profile/registry.ts`:
- Around line 148-178: normalizeLoadedModelProfileRegistry currently removes
unsupported runtime variants but leaves profiles whose variants become empty,
causing ModelProfileRegistrySchema.parse to fail; update the function to detect
and drop profiles whose filtered variants object is empty (e.g., track
droppedProfileKeys alongside droppedRuntimeNames), log a warning with the
dropped profile keys and their original runtimes, and build normalized only from
profiles that still have at least one supported variant before calling
ModelProfileRegistrySchema.parse so a single legacy profile doesn’t break the
whole load.

In `@src/lib/signal-assessment.ts`:
- Around line 32-40: The matcher list AGENT_REQUESTED_INPUT_PATTERNS is too
generic and causes UI/artifact text to be classified as agent_requested_input;
restrict its application by either (1) narrowing the regexes to harness-specific
phrasing/markers (e.g., include anchors or known harness tokens) or (2) guarding
the matching logic so AGENT_REQUESTED_INPUT_PATTERNS is only evaluated for
harness boundary/failure rows (e.g., check the row context like isHarnessPrompt
/ source === 'harness' / isBoundaryFailure before running matches in the
function that classifies agent_requested_input). Update the matching code where
AGENT_REQUESTED_INPUT_PATTERNS is used (and the same change at the other
occurrences) so legitimate UI/artifact strings aren’t misclassified.

In `@src/runner/plan-builder.ts`:
- Around line 97-101: Remove the unnecessary type assertion on config.runtimes
in the runtimes assignment: config.runtimes is already correctly typed by the
BenchConfig/Zod schema as SupportedRuntimeName[] (equivalent to RuntimeName[]),
so change the ternary to use config.runtimes directly instead of
(config.runtimes as RuntimeName[]); update the expression that sets runtimes
(which references config.runtimes, RuntimeName and RUNTIME_NAMES) to the simpler
form that returns config.runtimes when non-empty and [...RUNTIME_NAMES]
otherwise.

---

Nitpick comments:
In `@apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx`:
- Around line 270-313: Add a TSDoc block above the exported BlindVsInformedChart
function describing its purpose, parameters, and return value; document the
items parameter (type BlindVsInformedChartProps) and what the component renders
(chart tabs for model/harness breakdown), include `@param` for items and `@returns`
React.ReactElement (or JSX.Element), and add any `@throws` if it can throw—place
the doc comment immediately above the export function BlindVsInformedChart
declaration to satisfy the project guideline that all exported functions have
TSDoc.

In `@apps/dashboard/src/lib/api.ts`:
- Around line 1-4: The file header documenting purpose/exports/invariants must
be the very first thing in the module; move the existing module header comment
so it appears before the import block that currently brings in
parseKnownPlanPayload and parseKnownRunPayload from
"../../../../src/lib/machine-profile/legacy.js", and ensure the header briefly
states the file's purpose, its exported items, and any invariants as per the
coding guideline so the header precedes all imports and code.

In `@llm/implementation/multi-runtime-mvp-implementation.md`:
- Around line 1-5: Add a very prominent historical warning header and timestamp
to the top of the document so readers cannot mistake it for current behavior:
change the existing opening lines in Multi-Runtime MVP Implementation to include
a bold/ALL-CAPS heading like "HISTORICAL BEHAVIOR (Feb 8, 2026) — DO NOT USE FOR
CURRENT RUNTIME" and add a one-line subheading such as "Archived for
artifact/debugging context; active benchmarks are Ollama-only" (replace the
current first paragraph). Ensure the new header appears before any other text so
the historical status is immediately obvious on quick scans.

In `@llm/project/user-flow.md`:
- Line 209: Edit the "Discovery" bullet so it clearly states the runtime
decision for MVP: remove the conditional phrase "unless explicitly passed as
`--runtimes ollama`" and reword to say "Discovery: models/harnesses are
auto-discovered by default; runtime is fixed to `ollama` for the MVP." Reference
the "Discovery" bullet and the string "`--runtimes ollama`" when making the
change to ensure the mixed message is eliminated.

In `@src/cli/run-command.ts`:
- Around line 83-107: Add a temporary deprecated shim flag --vllm-url that
accepts a URL and maps it to the existing --ollama-url behavior: update the CLI
option definitions near the existing .option("--ollama-url <url>") to also
declare .option("--vllm-url <url>") as a deprecated alias, when parsing prefer
the explicit --ollama-url but if --vllm-url is present set the same internal
variable (e.g., ollamaUrl) and emit a single-time warning message informing
users the flag is deprecated and will be removed next release; ensure the shim
does not change runtime semantics and that the warning uses the same
logger/console path used elsewhere in this module.

In `@src/harnesses/opencode-adapter.ts`:
- Around line 546-564: The execa-error branch in the error handler for
executeItem only uses execaError.stderr and message, dropping execa output and
timing; update the throw to preserve execaError.output (or output/stdout) and
durationMs so executeItem can see the real failure text and timing.
Specifically, in the block that builds execaError and computes errorReasons
(symbols: execaError, hasPermissionDeniedStderr,
getTranscriptOrInputTaintReasons, appendSignalAssessmentReasons,
signalAssessment), include the original output and durationMs properties when
calling Object.assign on the new Error so the thrown object contains output and
durationMs for downstream handling by executeItem.

In `@src/harnesses/opencode-config.ts`:
- Around line 128-162: The function buildOpenCodeEnv currently accepts an opts
parameter with a runtimeName property that is never used; remove runtimeName
from the function signature and from the zod validation to simplify the API:
update the opts type to only include configPath and configJson, remove
runtimeName from the destructuring and from the z.object() schema
(RuntimeNameSchema reference), and adjust any call sites that pass runtimeName
to stop providing it or to only pass the two remaining fields so callers
compile.

In `@src/schemas/common.schema.ts`:
- Around line 38-54: supportedRuntimeNames and artifactRuntimeNames duplicate
"ollama", risking drift; change artifactRuntimeNames to derive from
supportedRuntimeNames (e.g., create artifactRuntimeNames by spreading
supportedRuntimeNames and adding "vllm" with a const assertion) so the
runtime-list relationship is encoded in one source of truth, then keep
ArtifactRuntimeNameSchema = z.enum(artifactRuntimeNames) and ArtifactRuntimeName
type as z.infer<typeof ArtifactRuntimeNameSchema>.

In `@test/model-profile-registry.test.ts`:
- Around line 96-97: Replace the hardcoded schemaVersion string "0.5.1" in the
test fixture with the canonical SCHEMA_VERSION constant: import SCHEMA_VERSION
into the test and set schemaVersion: SCHEMA_VERSION in the object used by the
model-profile registry test (the fixture that currently contains schemaVersion
and models). This keeps the test focused on unsupported runtime-key filtering
and avoids future churn when the schema version is bumped.

In `@test/plan-builder.test.ts`:
- Around line 16-102: The fallbackCollectMachineProfile helper is large and
should be extracted to a shared test fixtures module: create a new test utility
file (e.g., test/utils/fixtures.ts) exporting fallbackCollectMachineProfile as a
named export, move the function body there (preserving its signature and any
helper readNonEmpty logic and types), update test/plan-builder.test.ts to import
{ fallbackCollectMachineProfile } from the new module, and ensure any TypeScript
types referenced by the function are exported or inlined so existing tests
compile unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d585f805-fd92-49c5-ac58-25092ba73ba5

📥 Commits

Reviewing files that changed from the base of the PR and between 56b63e7 and fc34144.

📒 Files selected for processing (116)

README.md
apps/dashboard/scripts/build-index.ts
apps/dashboard/src/components/about/about-content.ts
apps/dashboard/src/components/about/about-page.tsx
apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx
apps/dashboard/src/components/charts/chart-primitives.tsx
apps/dashboard/src/components/charts/composite-score-chart.tsx
apps/dashboard/src/components/charts/frontier-eval-scatter.tsx
apps/dashboard/src/components/charts/model-comparison-chart.tsx
apps/dashboard/src/components/charts/model-radar-chart.tsx
apps/dashboard/src/components/charts/model-test-heatmap.tsx
apps/dashboard/src/components/charts/test-difficulty-chart.tsx
apps/dashboard/src/components/charts/test-type-split-chart.tsx
apps/dashboard/src/components/charts/token-efficiency-chart.tsx
apps/dashboard/src/components/layout/header.tsx
apps/dashboard/src/components/leaderboard/leaderboard-filters.ts
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
apps/dashboard/src/components/leaderboard/leaderboard-page.tsx
apps/dashboard/src/components/leaderboard/leaderboard-results-table.tsx
apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx
apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx
apps/dashboard/src/components/run-list/run-list-page.tsx
apps/dashboard/src/components/ui/section-heading.tsx
apps/dashboard/src/index.css
apps/dashboard/src/lib/aggregations-charts.ts
apps/dashboard/src/lib/aggregations-compare.ts
apps/dashboard/src/lib/aggregations-diagnostics.ts
apps/dashboard/src/lib/aggregations-test-types.ts
apps/dashboard/src/lib/aggregations-tooling.ts
apps/dashboard/src/lib/aggregations.ts
apps/dashboard/src/lib/api.ts
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
llm/context/codebase-overview.md
llm/implementation/multi-runtime-mvp-implementation.md
llm/project/project-overview.md
llm/project/project-rules.md
llm/project/user-flow.md
package.json
src/cli/compare-command.ts
src/cli/migrate-machine-command.ts
src/cli/run-command.ts
src/harnesses/direct-adapter.ts
src/harnesses/goose-adapter.ts
src/harnesses/goose-openai.ts
src/harnesses/goose-output.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/harnesses/opencode-config.ts
src/harnesses/opencode-output.ts
src/harnesses/tool-prompt.ts
src/lib/benchmark-checkpoint.ts
src/lib/hardware-profile.ts
src/lib/machine-profile/instance-id.ts
src/lib/machine-profile/legacy.ts
src/lib/machine-profile/normalization.ts
src/lib/machine-profile/probe-linux.ts
src/lib/machine-profile/probe-macos.ts
src/lib/machine-profile/probe-utils.ts
src/lib/machine-profile/probe-windows.ts
src/lib/model-aliases.ts
src/lib/model-profile/normalization.ts
src/lib/model-profile/registry.ts
src/lib/openai-compat-client.ts
src/lib/signal-assessment.ts
src/lib/stats-format.ts
src/lib/test-catalog.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/plan-builder.ts
src/runtimes/discovery.ts
src/runtimes/index.ts
src/runtimes/runtime.ts
src/runtimes/vllm-runtime.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/model-alias.schema.ts
src/schemas/model-profile.schema.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
src/schemas/test-catalog.schema.ts
src/tests/calculator-basic/test.meta.json
src/tests/calculator-stateful/test.meta.json
src/tests/event-emitter/test.meta.json
src/tests/rate-limiter/test.meta.json
src/tests/smoke/test.meta.json
src/tests/todo-app/test.meta.json
src/tests/ttl-cache/test.meta.json
src/tests/workspace-smoke/test.meta.json
test/aggregate.test.ts
test/benchmark-checkpoint.test.ts
test/build-index.test.ts
test/dashboard-schema-compat.test.ts
test/dashboard-test-type.test.ts
test/generation-retry.test.ts
test/goose-adapter.test.ts
test/goose-output.test.ts
test/hardware-profile.test.ts
test/harness-compatibility.test.ts
test/item-executor.test.ts
test/item-retry.test.ts
test/machine-profile-migration.test.ts
test/model-profile-registry.test.ts
test/opencode-adapter.test.ts
test/opencode-config.test.ts
test/opencode-output.test.ts
test/plan-builder.test.ts
test/schemas.test.ts
test/signal-assessment.test.ts
test/stats-semantics.test.ts
test/test-catalog.test.ts
test/tool-prompt.test.ts
test/vllm-runtime.test.ts
test/workspace-capability-parity.test.ts
test/workspace-prompt-parity.test.ts

💤 Files with no reviewable changes (9)

src/runner/index.ts
test/benchmark-checkpoint.test.ts
test/opencode-config.test.ts
package.json
src/harnesses/goose-openai.ts
src/runtimes/discovery.ts
test/vllm-runtime.test.ts
src/lib/openai-compat-client.ts
src/runtimes/vllm-runtime.ts

[codex] rebuild opencode harness

AustinKelsay · 2026-04-24T15:51:50Z

@CodeRabbit full review

coderabbitai · 2026-04-24T15:51:56Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 8

♻️ Duplicate comments (3)

apps/dashboard/src/components/about/about-content.ts (1)
99-101: ⚠️ Potential issue | 🟡 Minor

Keep the axis label aligned with “prompt mode”.

This entry still surfaces the internal passType name, so the About copy mixes implementation terminology with the newer user-facing wording. Rename the label here to avoid conflicting language in the UI.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/about/about-content.ts` around lines 99 - 101,
The About copy exposes the internal key "passType" in the name field; change the
displayed label to the user-facing term (e.g., "prompt mode") so UI language is
consistent. Update the object entry where name: "passType" appears in
about-content.ts to use the human-friendly label string and ensure any
descriptive text still matches the two modes; leave internal keys/logic
untouched and only modify the displayed name.
apps/dashboard/src/lib/types.ts (1)
286-295: ⚠️ Potential issue | 🟡 Minor

Keep the legacy vllmBaseUrl field in PlanConfig until the migration is explicit.

apps/dashboard/src/lib/schemas.ts still accepts config.vllmBaseUrl for older plan.json payloads, so removing it here makes the handwritten type drift from the payload shape the dashboard can actually parse. Keep it optional until the legacy field is removed through a versioned migration.
🧩 Minimal compatibility fix
 export interface PlanConfig {
 	ollamaBaseUrl: string;
+	vllmBaseUrl?: string;
 	generateTimeoutMs: number;
 	gooseMaxTurns?: number;
 	gooseRetryMaxTurns?: number;
 	gooseWorkspaceMaxTurns?: number;
 	gooseWorkspaceRetryMaxTurns?: number;
Based on learnings, "Maintain versioned schemas with schemaVersion field and use additive changes; migrations must be explicit".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/lib/types.ts` around lines 286 - 295, Keep the legacy
optional field in the PlanConfig interface: add an optional vllmBaseUrl?: string
property to the PlanConfig declaration (alongside ollamaBaseUrl,
generateTimeoutMs, etc.) so the handwritten type matches the payload shape
parsed by the schemas that still accept config.vllmBaseUrl; leave it optional
until an explicit, versioned migration removes the legacy field.
src/harnesses/opencode-events.ts (1)
122-188: ⚠️ Potential issue | 🟠 Major

Handle all supported tool-call shapes here, and keep the last write in batched arrays.

extractFromToolCallObject() still only recognizes name/toolName, and it short-circuits on the first array hit. That means structured payloads like { tool: "write_file", input: ... } / { command: "edit", args: ... } are missed, and a toolCalls array can still return stale code if a later entry rewrites the file. In those cases the parser falls through to "json"/"raw" even though a valid write payload was present.
♻️ Suggested fix
 function extractFromToolCallObject(obj: unknown): string | null {
 	const maxDepth = 4;
 	const visit = (value: unknown, depth: number): string | null => {
 		if (depth > maxDepth || !value) return null;
 		if (Array.isArray(value)) {
-			for (const item of value) {
-				const found = visit(item, depth + 1);
-				if (found) return found;
-			}
-			return null;
+			let latest: string | null = null;
+			for (const item of value) {
+				const found = visit(item, depth + 1);
+				if (found) latest = found;
+			}
+			return latest;
 		}
 		if (typeof value !== "object") return null;
 
 		const record = value as Record<string, unknown>;
 		const nameValue =
 			typeof record.name === "string"
 				? record.name
 				: typeof record.toolName === "string"
 					? record.toolName
+					: typeof record.tool === "string"
+						? record.tool
+						: typeof record.tool_name === "string"
+							? record.tool_name
+							: typeof record.command === "string"
+								? record.command
 					: undefined;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/opencode-events.ts` around lines 122 - 188, The extractor
currently only recognizes name/toolName and returns the first hit for arrays,
missing shapes like {tool: "...", input: ...} or {command: "...", args: ...} and
losing later writes in batched arrays; update extractFromToolCallObject()/visit
to also accept record.command and record.tool as alternate tool-name fields (in
addition to name/toolName), and to treat args as record.arguments ?? record.args
?? record.parameters ?? record.input ?? record.raw; when value is an array, do
not short-circuit on the first match—iterate all elements, call visit on each,
and keep/return the last non-null content (so later writes override earlier
ones); continue to use WRITE_TOOL_NAMES and extractContentFromArgs to validate
and extract content from parsed args.

🧹 Nitpick comments (10)

src/lib/stats-format.ts (1)

81-82: Add an invariant guard for inconsistent failure counters.

If generationFailureCount ever exceeds failed, this currently degrades silently. That’s likely a programmer/data-contract mistake and should fail fast.

Suggested guard

  const generationFailureCount = stats.generationFailures?.total ?? 0;
+ if (generationFailureCount > failed) {
+   throw new Error(
+     `Invalid stats: generationFailures.total (${generationFailureCount}) exceeds failed (${failed})`,
+   );
+ }
  if (generationFailureCount > 0 || failed > generationFailureCount) {
    lines.push("  Failure breakdown:");
    for (const { type, count } of stats.generationFailures?.byType ?? []) {
      lines.push(`    ${type}: ${count}`);
    }
    const scoredRowFailures = failed - generationFailureCount;
    if (scoredRowFailures > 0) {
      lines.push(`    scored_row_failure: ${scoredRowFailures}`);
    }
  }

As per coding guidelines, "Throw on invalid config, invalid schemas, and programmer mistakes; avoid silent fallbacks."

Also applies to: 87-90

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/stats-format.ts` around lines 81 - 82, Add a fail-fast invariant
check to detect inconsistent failure counters: after computing const
generationFailureCount = stats.generationFailures?.total ?? 0 and before using
it (e.g., where the code checks if (generationFailureCount > 0 || failed >
generationFailureCount)), throw an Error if generationFailureCount > failed so
the code fails fast on this programmer/data-contract violation; apply the same
guard at the other similar spot around lines handling the alternative counters
(the same check that spans the second use at the block covering the 87-90 logic)
to ensure both places validate generationFailureCount against failed.

apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx (1)

24-31: Consider adding as const for readonly tuple type.

Adding as const to CARD_ACCENT_COLORS would provide better type safety by creating a readonly tuple with literal types, preventing accidental mutations and improving type inference.

♻️ Optional refactor to add type safety

-const CARD_ACCENT_COLORS = [
+const CARD_ACCENT_COLORS = [
 	"#34c759", // brand green — matched runs
 	"#34c759", // brand green — profiles
 	"hsl(212, 100%, 67%)", // info blue — deduped items
 	"hsl(156, 67%, 55%)", // success green — pass rate
 	"hsl(270, 60%, 60%)", // purple — frontier coverage
 	"hsl(210, 85%, 60%)", // blue — median duration
-];
+] as const;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx`
around lines 24 - 31, The constant CARD_ACCENT_COLORS should be made a readonly
tuple to improve type safety; change its declaration to use "as const" (i.e.,
const CARD_ACCENT_COLORS = [ ... ] as const) so the array items become literal
readonly types and cannot be mutated. Ensure any code that consumes
CARD_ACCENT_COLORS still accepts readonly string literals (or update types to
ReadonlyArray or typeof CARD_ACCENT_COLORS[number] where needed) so type checks
continue to pass.

apps/dashboard/src/components/charts/model-comparison-chart.tsx (1)

37-39: Consider adding Zod validation at the component boundary.

The coding guidelines specify that TypeScript files should "Validate at boundaries with Zod and pass typed data inward." Currently, the items prop is typed but not validated at runtime. Consider adding a Zod schema to validate the props at the component boundary.
🛡️ Example Zod validation pattern

Define a schema near your type definition and validate incoming props:
+import { z } from 'zod';
+
+// Define schema for runtime validation
+const MatrixItemResultSchema = z.array(
+  z.object({
+    // Define MatrixItemResult fields here
+  })
+);
+
 interface ModelComparisonChartProps {
   items: MatrixItemResult[];
 }
 
 export function ModelComparisonChart({ items }: ModelComparisonChartProps) {
+  // Validate at boundary
+  const validatedItems = MatrixItemResultSchema.parse(items);
+  
   const allModels = useMemo(() => {
-    const groups = groupByModel(items);
+    const groups = groupByModel(validatedItems);
Alternatively, validate props at a higher level (e.g., API response) and pass validated data down.
As per coding guidelines: "Validate at boundaries with Zod and pass typed data inward" for **/*.{ts,tsx} files.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/charts/model-comparison-chart.tsx` around lines
37 - 39, Add runtime validation for the ModelComparisonChart component props by
defining a Zod schema for the ModelComparisonChartProps (including the items
array shape matching MatrixItemResult) near the type definition and call
schema.parse (or safeParse and handle failure) at the component boundary inside
ModelComparisonChart (or a thin wrapper) before using the data; on parse
failure, return a fallback UI or throw/log an error so only validated, typed
data is passed inward to the component logic.

apps/dashboard/src/components/charts/token-efficiency-chart.tsx (1)

103-105: Consider centralizing harness color lookup behind a helper.

The inline cast works, but moving it to a readHarnessColor helper (as done in the other chart) reduces repetition and keeps fallback logic in one place.

♻️ Suggested refactor

 const HARNESS_COLORS = {
 	direct: "hsl(215, 70%, 62%)", // steel blue
 	goose: "hsl(142, 60%, 49%)", // brand green
 	opencode: "hsl(38, 80%, 58%)", // warm amber
 } as const;

+function readHarnessColor(harness: string): string {
+	return Object.prototype.hasOwnProperty.call(HARNESS_COLORS, harness)
+		? HARNESS_COLORS[harness as keyof typeof HARNESS_COLORS]
+		: "hsl(210, 12%, 63%)";
+}
+
 const dataByHarness = harnesses.map((harness) => ({
 	harness,
 	points: data.filter((d) => d.harness === harness),
-	color:
-		HARNESS_COLORS[harness as keyof typeof HARNESS_COLORS] ||
-		"hsl(210, 12%, 63%)",
+	color: readHarnessColor(harness),
 }));

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/charts/token-efficiency-chart.tsx` around lines
103 - 105, Extract the inline HARNESS_COLORS lookup into a shared helper named
readHarnessColor that accepts the harness key (preserve the existing type/cast)
and returns HARNESS_COLORS[harness] || "hsl(210, 12%, 63%)"; then replace the
inline expression in token-efficiency-chart (the color: HARNESS_COLORS[...] ||
"hsl(...)" usage) with a call to readHarnessColor(harness) so fallback logic is
centralized and reusable across charts.

test/run-command.test.ts (1)

45-57: Assert process.exit arguments explicitly for stronger intent.

The thrown-message check works, but adding an explicit spy assertion makes the failure mode less brittle.

Suggested tightening

 		).rejects.toThrow("process.exit(1)");
+		expect(exitSpy).toHaveBeenCalledWith(1);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/run-command.test.ts` around lines 45 - 57, The test currently only
expects parseAsync to throw with "process.exit(1)"; instead add an explicit spy
on process.exit to assert it was called with 1 for stronger intent: create a
jest.spyOn(process, "exit").mockImplementation(() => { throw new Error("exit");
}) before calling runCommand.parseAsync (or mockImplementationOnce), then after
the promise rejects assert that the spy was calledWith(1) and finally restore
the spy; reference the test function and runCommand.parseAsync to locate where
to add the spy and assertions.

src/harnesses/discovery.ts (1)

45-49: Consider lowering severity for expected OpenCode-unavailable probe failures.

When discovery runs on machines without OpenCode, this path is expected; logging it as error can produce noisy false alarms. warn/debug is usually a better fit unless the failure is truly unexpected.

Possible adjustment

-		logger.error(
+		logger.warn(
 			{ err: error, probe: "opencode", functionName: "isOpenCodeAvailable" },
 			"OpenCode probe failed",
 		);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/discovery.ts` around lines 45 - 49, The catch in
isOpenCodeAvailable currently logs expected absence of OpenCode as an error;
change the logger invocation to a lower severity (logger.warn or logger.debug)
for expected probe failures by replacing logger.error(...) with logger.warn(...)
or logger.debug(...), keeping the structured context ({ err: error, probe:
"opencode", functionName: "isOpenCodeAvailable" }) and the message "OpenCode
probe failed" so only truly unexpected failures are escalated elsewhere.

src/harnesses/goose-adapter.ts (1)

458-490: Use Execa's timedOut property instead of message matching for timeout detection.

The current approach of checking error.message.includes("timed out") is fragile. It can mislabel failures whose output happens to contain that phrase and couples timeout detection to Execa's message wording. Execa v9 exposes a dedicated timedOut boolean property on rejected errors; this is the official, reliable API for detecting timeouts.

Simplify the check to:
Suggested fix
-				const isTimeoutError =
-					error instanceof Error && error.message.includes("timed out");
+				const isTimeoutError = (error as any).timedOut === true;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/goose-adapter.ts` around lines 458 - 490, The timeout detection
currently uses message.includes("timed out"); change it to rely on Execa's
timedOut boolean: when you cast the error to the Execa-like shape in the block
(the variable currently named execaError), check execaError.timedOut to set
isTimeoutError (or replace the earlier isTimeoutError declaration with this
check) and use that value where the code builds the timeout message (the throw
that uses timeoutMs) and when constructing the Error with
signalAssessment/output (where getTranscriptOrInputTaintReasons and
appendSignalAssessmentReasons are used to compute errorReasons and
effectiveOutput). Ensure the new check handles undefined typed errors safely
(keep the existing guard that error is object and has stdout/stderr) and remove
the fragile message.includes usage.

src/runtimes/ollama-runtime.ts (1)

73-96: Consider documenting the update process for these heuristics.

The TEXT_GENERATION_ARCHITECTURES and TEXT_GENERATION_NAME_PATTERNS sets are derived from observed metadata rather than a formal registry. When new model families appear in benchmark runs and fall through to "unknown", maintainers need to know where to add them.

A brief inline comment pointing to documentation or a tracking issue would help future contributors extend these lists consistently.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/runtimes/ollama-runtime.ts` around lines 73 - 96, Add an inline comment
above the TEXT_GENERATION_ARCHITECTURES and TEXT_GENERATION_NAME_PATTERNS
declarations explaining that these heuristics are maintained manually from
observed metadata, where to update them (e.g., link to the project's
contributing doc or a specific tracking issue), and a short note on the criteria
to add a new entry; reference the constants TEXT_GENERATION_ARCHITECTURES and
TEXT_GENERATION_NAME_PATTERNS so future contributors know exactly which lists to
modify when new model families appear and fall through to "unknown".

src/schemas/plan.schema.ts (1)

1-7: Add the required Invariants section to the file header.

The header now includes Purpose/Exports, but it still omits the explicit Invariants block required for TS files.

♻️ Proposed header patch

 /**
  * Purpose: RunPlan schema capturing the expanded matrix before execution.
  * Exports: MatrixItemSchema, MatrixItem, ModelExclusionSchema,
  *          ModelExclusion, RunPlanSchema, RunPlan
+ *
+ * Invariants:
+ * - Plan payloads are validated via Zod before persistence/consumption.
+ * - `modelExclusions` is additive and optional for backward compatibility.
  *
  * The plan is written to results/<runId>/plan.json for reproducibility.
  */

As per coding guidelines, "src/**/*.{ts,tsx}: Every file must start with a descriptive header describing purpose, exports, and invariants".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/schemas/plan.schema.ts` around lines 1 - 7, The file header for the
RunPlan schema is missing the required "Invariants" section; update the
top-of-file header (near symbols MatrixItemSchema, MatrixItem,
ModelExclusionSchema, ModelExclusion, RunPlanSchema, RunPlan) to include an
"Invariants" block that documents any constraints or guarantees about the schema
(e.g., immutability, required fields, allowed value ranges, serialization
expectations) so the header includes Purpose, Exports, and Invariants per the TS
file guideline.

src/harnesses/opencode-runner.ts (1)

156-167: Skip the fixed 2s delay once SIGTERM already reaped the child.

Every timeout/hang currently pays FORCE_KILL_DELAY_MS before returning, even when the process exits immediately after SIGTERM. Across a noisy matrix that adds avoidable wall-clock delay to failure handling. Race the delay against proc settling and only escalate to SIGKILL if the process is still alive.

♻️ Proposed refinement

 async function forceKillProcess(
 	proc: ResultPromise,
 	log: pino.Logger,
 	reason: string,
 ): Promise<void> {
 	const pid = proc.pid;
 	log.warn({ pid, reason }, "Force killing OpenCode process");
 	killOpenCodeProcess(proc, "SIGTERM", log);
-	await new Promise((resolve) => setTimeout(resolve, FORCE_KILL_DELAY_MS));
+	const exitedAfterSigterm = await Promise.race([
+		proc.then(
+			() => true,
+			() => true,
+		),
+		new Promise<boolean>((resolve) =>
+			setTimeout(() => resolve(false), FORCE_KILL_DELAY_MS),
+		),
+	]);
+	if (exitedAfterSigterm) return;
 	log.warn({ pid }, "Escalating OpenCode process kill");
 	killOpenCodeProcess(proc, "SIGKILL", log);
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/opencode-runner.ts` around lines 156 - 167, The
forceKillProcess function always waits the full FORCE_KILL_DELAY_MS after
sending SIGTERM even if proc (the ResultPromise) exits immediately; change it to
race the delay against proc settling and only send the escalating SIGKILL if the
proc is still alive after the delay. Specifically, in forceKillProcess, after
calling killOpenCodeProcess(proc, "SIGTERM", log) await a Promise.race between a
timeout Promise of FORCE_KILL_DELAY_MS and proc settling (or its
completion/error), then check if the process is still running (using proc.pid or
proc settling state) and only then log the escalation and call
killOpenCodeProcess(proc, "SIGKILL", log).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx`:
- Around line 95-123: The code treats normalizedSelectedModels.length === 0 as
"all selected", which is wrong when selectedModels contains only stale ids;
update handleModelToggle to compute a true allSelected flag (e.g., const
allSelected = models.every(m => normalizedSelectedModels.includes(m))) and
replace the length check with that flag so toggling a model when all are
selected still returns models.filter(m => m !== model). Ensure you keep using
normalizedSelectedModels, models, onSelectionChange and the MODEL_SORT_COLLATOR
for the sorted results.

In `@llm/implementation/harnesses-implementation.md`:
- Around line 259-263: The note currently implies that `permission` is the sole
gate for tool access but the generated provider model entry still depends on the
model-level `"tools": true`; update the text to explicitly state that tool
access requires both the top-level `permission` setting and a model-level
`"tools": true` flag (as used in the generated `models` provider entry), and
mention that `enabled_providers`, slash-safe transport keys for runtime model
IDs, and the denial of `external_directory` remain unchanged.

In `@src/cli/compare-utils.ts`:
- Around line 156-158: The truncate function must guard against non-positive
maxLen values: in truncate(str, maxLen) early-return an empty string when maxLen
<= 0, and when computing the slice use a non-negative end index (e.g. use
Math.max(0, maxLen - 1)) so slice never receives a negative end; keep the
existing behavior for positive maxLen (return str if str.length <= maxLen,
otherwise return the truncated slice plus the ellipsis).

In `@src/harnesses/direct-adapter.ts`:
- Around line 55-61: Add an explicit runtime-type guard before calling
generateOllama to preserve the Ollama-only invariant: check the runtime object
(e.g., runtime) to ensure it is an Ollama-compatible Runtime (inspect a unique
discriminator or property used to identify Ollama runtimes) and throw a clear
programmer/config error if it isn’t, instead of calling generateOllama with an
unsupported runtime; place this check immediately before the generateOllama(...)
invocation in the generate function so generateOllama, runtime.baseUrl, and
model are only used when the runtime is validated.

In `@src/harnesses/opencode-adapter.ts`:
- Around line 353-374: The failure path currently uses rawOutput (computed via
selectProcessOutput(processResult.stdout, processResult.stderr)) which can drop
stderr when stdout is non-empty; instead, change the non-zero-exit branch to
build a combined structured failure payload that preserves both stdout and
stderr (e.g., construct an errorOutput object/string containing both
processResult.stdout and processResult.stderr or a merged preview) and pass that
into buildOpenCodeFailure and buildFailureSignalAssessment; keep
selectProcessOutput only for success/normal output handling or compute separate
successOutput vs failureOutput so buildOpenCodeFailure always receives both
stdout and stderr evidence (referencing selectProcessOutput,
buildOpenCodeFailure, buildFailureSignalAssessment, processResult, rawOutput,
stdoutPreview, stderrPreview).

In `@src/harnesses/opencode-permissions.ts`:
- Around line 34-58: Add explicit `@throws` TSDoc entries for the exported
helpers: isOpenCodePermissionDeniedText, createOpenCodePermissionPolicy, and the
permission-detection function that follows (the one accepting texts). For each
function add a single `@throws` line describing that the function does not throw
under normal usage (e.g., "@throws Never throws; input is validated and errors
are not raised") or, if a function can throw on invalid input, document the
exact error/condition thrown; ensure the `@throws` text appears alongside the
existing `@param` and `@returns` tags in each exported function's comment block.

In `@src/runner/item-executor.ts`:
- Around line 172-175: The new call to getExecutableRuntimeName() (used when
constructing runtime via createRuntime()) can throw for unsupported runtime
labels, so update the TSDoc for the exported function executeItem to include an
`@throws` entry describing that executeItem will throw when the runtime label is
invalid/unsupported; locate the executeItem function declaration and add a
concise `@throws` line that references the runtime-name validation (e.g., "Throws
if getExecutableRuntimeName() rejects an unsupported runtime label") so the
public contract matches the implementation.

In `@src/runner/plan-builder.ts`:
- Around line 59-60: Update the TSDoc for buildRunPlan to accurately list all
failure modes it can throw: retain Ollama unreachable and no models/tests found,
and add missing model selectors errors, unavailable harness errors, "all models
excluded" (all-excluded model sets) errors, and zero matrix-expansion errors;
ensure the `@throws` section names each condition and the type/shape of error
thrown (e.g., Error with descriptive message or specific error class) so the
exported contract matches the actual runtime failure surface of buildRunPlan and
any internal helpers it calls.

---

Duplicate comments:
In `@apps/dashboard/src/components/about/about-content.ts`:
- Around line 99-101: The About copy exposes the internal key "passType" in the
name field; change the displayed label to the user-facing term (e.g., "prompt
mode") so UI language is consistent. Update the object entry where name:
"passType" appears in about-content.ts to use the human-friendly label string
and ensure any descriptive text still matches the two modes; leave internal
keys/logic untouched and only modify the displayed name.

In `@apps/dashboard/src/lib/types.ts`:
- Around line 286-295: Keep the legacy optional field in the PlanConfig
interface: add an optional vllmBaseUrl?: string property to the PlanConfig
declaration (alongside ollamaBaseUrl, generateTimeoutMs, etc.) so the
handwritten type matches the payload shape parsed by the schemas that still
accept config.vllmBaseUrl; leave it optional until an explicit, versioned
migration removes the legacy field.

In `@src/harnesses/opencode-events.ts`:
- Around line 122-188: The extractor currently only recognizes name/toolName and
returns the first hit for arrays, missing shapes like {tool: "...", input: ...}
or {command: "...", args: ...} and losing later writes in batched arrays; update
extractFromToolCallObject()/visit to also accept record.command and record.tool
as alternate tool-name fields (in addition to name/toolName), and to treat args
as record.arguments ?? record.args ?? record.parameters ?? record.input ??
record.raw; when value is an array, do not short-circuit on the first
match—iterate all elements, call visit on each, and keep/return the last
non-null content (so later writes override earlier ones); continue to use
WRITE_TOOL_NAMES and extractContentFromArgs to validate and extract content from
parsed args.

---

Nitpick comments:
In `@apps/dashboard/src/components/charts/model-comparison-chart.tsx`:
- Around line 37-39: Add runtime validation for the ModelComparisonChart
component props by defining a Zod schema for the ModelComparisonChartProps
(including the items array shape matching MatrixItemResult) near the type
definition and call schema.parse (or safeParse and handle failure) at the
component boundary inside ModelComparisonChart (or a thin wrapper) before using
the data; on parse failure, return a fallback UI or throw/log an error so only
validated, typed data is passed inward to the component logic.

In `@apps/dashboard/src/components/charts/token-efficiency-chart.tsx`:
- Around line 103-105: Extract the inline HARNESS_COLORS lookup into a shared
helper named readHarnessColor that accepts the harness key (preserve the
existing type/cast) and returns HARNESS_COLORS[harness] || "hsl(210, 12%, 63%)";
then replace the inline expression in token-efficiency-chart (the color:
HARNESS_COLORS[...] || "hsl(...)" usage) with a call to
readHarnessColor(harness) so fallback logic is centralized and reusable across
charts.

In `@apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx`:
- Around line 24-31: The constant CARD_ACCENT_COLORS should be made a readonly
tuple to improve type safety; change its declaration to use "as const" (i.e.,
const CARD_ACCENT_COLORS = [ ... ] as const) so the array items become literal
readonly types and cannot be mutated. Ensure any code that consumes
CARD_ACCENT_COLORS still accepts readonly string literals (or update types to
ReadonlyArray or typeof CARD_ACCENT_COLORS[number] where needed) so type checks
continue to pass.

In `@src/harnesses/discovery.ts`:
- Around line 45-49: The catch in isOpenCodeAvailable currently logs expected
absence of OpenCode as an error; change the logger invocation to a lower
severity (logger.warn or logger.debug) for expected probe failures by replacing
logger.error(...) with logger.warn(...) or logger.debug(...), keeping the
structured context ({ err: error, probe: "opencode", functionName:
"isOpenCodeAvailable" }) and the message "OpenCode probe failed" so only truly
unexpected failures are escalated elsewhere.

In `@src/harnesses/goose-adapter.ts`:
- Around line 458-490: The timeout detection currently uses
message.includes("timed out"); change it to rely on Execa's timedOut boolean:
when you cast the error to the Execa-like shape in the block (the variable
currently named execaError), check execaError.timedOut to set isTimeoutError (or
replace the earlier isTimeoutError declaration with this check) and use that
value where the code builds the timeout message (the throw that uses timeoutMs)
and when constructing the Error with signalAssessment/output (where
getTranscriptOrInputTaintReasons and appendSignalAssessmentReasons are used to
compute errorReasons and effectiveOutput). Ensure the new check handles
undefined typed errors safely (keep the existing guard that error is object and
has stdout/stderr) and remove the fragile message.includes usage.

In `@src/harnesses/opencode-runner.ts`:
- Around line 156-167: The forceKillProcess function always waits the full
FORCE_KILL_DELAY_MS after sending SIGTERM even if proc (the ResultPromise) exits
immediately; change it to race the delay against proc settling and only send the
escalating SIGKILL if the proc is still alive after the delay. Specifically, in
forceKillProcess, after calling killOpenCodeProcess(proc, "SIGTERM", log) await
a Promise.race between a timeout Promise of FORCE_KILL_DELAY_MS and proc
settling (or its completion/error), then check if the process is still running
(using proc.pid or proc settling state) and only then log the escalation and
call killOpenCodeProcess(proc, "SIGKILL", log).

In `@src/lib/stats-format.ts`:
- Around line 81-82: Add a fail-fast invariant check to detect inconsistent
failure counters: after computing const generationFailureCount =
stats.generationFailures?.total ?? 0 and before using it (e.g., where the code
checks if (generationFailureCount > 0 || failed > generationFailureCount)),
throw an Error if generationFailureCount > failed so the code fails fast on this
programmer/data-contract violation; apply the same guard at the other similar
spot around lines handling the alternative counters (the same check that spans
the second use at the block covering the 87-90 logic) to ensure both places
validate generationFailureCount against failed.

In `@src/runtimes/ollama-runtime.ts`:
- Around line 73-96: Add an inline comment above the
TEXT_GENERATION_ARCHITECTURES and TEXT_GENERATION_NAME_PATTERNS declarations
explaining that these heuristics are maintained manually from observed metadata,
where to update them (e.g., link to the project's contributing doc or a specific
tracking issue), and a short note on the criteria to add a new entry; reference
the constants TEXT_GENERATION_ARCHITECTURES and TEXT_GENERATION_NAME_PATTERNS so
future contributors know exactly which lists to modify when new model families
appear and fall through to "unknown".

In `@src/schemas/plan.schema.ts`:
- Around line 1-7: The file header for the RunPlan schema is missing the
required "Invariants" section; update the top-of-file header (near symbols
MatrixItemSchema, MatrixItem, ModelExclusionSchema, ModelExclusion,
RunPlanSchema, RunPlan) to include an "Invariants" block that documents any
constraints or guarantees about the schema (e.g., immutability, required fields,
allowed value ranges, serialization expectations) so the header includes
Purpose, Exports, and Invariants per the TS file guideline.

In `@test/run-command.test.ts`:
- Around line 45-57: The test currently only expects parseAsync to throw with
"process.exit(1)"; instead add an explicit spy on process.exit to assert it was
called with 1 for stronger intent: create a jest.spyOn(process,
"exit").mockImplementation(() => { throw new Error("exit"); }) before calling
runCommand.parseAsync (or mockImplementationOnce), then after the promise
rejects assert that the spy was calledWith(1) and finally restore the spy;
reference the test function and runCommand.parseAsync to locate where to add the
spy and assertions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 57a3450f-5472-46e1-ab1d-b63d3cf119f4

📥 Commits

Reviewing files that changed from the base of the PR and between 56b63e7 and ea41e98.

📒 Files selected for processing (157)

README.md
apps/dashboard/scripts/build-index.ts
apps/dashboard/src/components/about/about-content.ts
apps/dashboard/src/components/about/about-page.tsx
apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx
apps/dashboard/src/components/charts/chart-primitives.tsx
apps/dashboard/src/components/charts/composite-score-chart.tsx
apps/dashboard/src/components/charts/frontier-eval-scatter.tsx
apps/dashboard/src/components/charts/model-comparison-chart.tsx
apps/dashboard/src/components/charts/model-radar-chart.tsx
apps/dashboard/src/components/charts/model-test-heatmap.tsx
apps/dashboard/src/components/charts/test-difficulty-chart.tsx
apps/dashboard/src/components/charts/test-type-split-chart.tsx
apps/dashboard/src/components/charts/token-efficiency-chart.tsx
apps/dashboard/src/components/layout/header.tsx
apps/dashboard/src/components/leaderboard/leaderboard-filters.ts
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
apps/dashboard/src/components/leaderboard/leaderboard-page.tsx
apps/dashboard/src/components/leaderboard/leaderboard-results-table.tsx
apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx
apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx
apps/dashboard/src/components/run-list/run-list-page.tsx
apps/dashboard/src/components/ui/section-heading.tsx
apps/dashboard/src/index.css
apps/dashboard/src/lib/aggregations-charts.ts
apps/dashboard/src/lib/aggregations-compare.ts
apps/dashboard/src/lib/aggregations-diagnostics.ts
apps/dashboard/src/lib/aggregations-test-types.ts
apps/dashboard/src/lib/aggregations-tooling.ts
apps/dashboard/src/lib/aggregations.ts
apps/dashboard/src/lib/api.ts
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
llm/context/codebase-overview.md
llm/implementation/computer-use-hardening.md
llm/implementation/harnesses-implementation.md
llm/implementation/multi-runtime-mvp-implementation.md
llm/project/project-overview.md
llm/project/project-rules.md
llm/project/user-flow.md
package.json
src/cli/compare-command.ts
src/cli/compare-formatters.ts
src/cli/compare-utils.ts
src/cli/migrate-machine-command.ts
src/cli/run-command.ts
src/harnesses/direct-adapter.ts
src/harnesses/discovery.ts
src/harnesses/goose-adapter.ts
src/harnesses/goose-openai.ts
src/harnesses/goose-output.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/harnesses/opencode-artifacts.ts
src/harnesses/opencode-cli.ts
src/harnesses/opencode-config.ts
src/harnesses/opencode-events.ts
src/harnesses/opencode-model.ts
src/harnesses/opencode-output.ts
src/harnesses/opencode-permissions.ts
src/harnesses/opencode-process.ts
src/harnesses/opencode-provider.ts
src/harnesses/opencode-runner.ts
src/harnesses/tool-prompt.ts
src/lib/benchmark-checkpoint.ts
src/lib/failure-classifier.ts
src/lib/hardware-profile.ts
src/lib/logger.ts
src/lib/machine-profile/instance-id.ts
src/lib/machine-profile/legacy.ts
src/lib/machine-profile/normalization.ts
src/lib/machine-profile/probe-linux.ts
src/lib/machine-profile/probe-macos.ts
src/lib/machine-profile/probe-utils.ts
src/lib/machine-profile/probe-windows.ts
src/lib/model-aliases.ts
src/lib/model-profile/normalization.ts
src/lib/model-profile/registry-compat.ts
src/lib/model-profile/registry.ts
src/lib/openai-compat-client.ts
src/lib/signal-assessment.ts
src/lib/stats-format.ts
src/lib/test-catalog.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/model-availability.ts
src/runner/model-eligibility.ts
src/runner/plan-builder.ts
src/runner/run-progress.ts
src/runtimes/discovery.ts
src/runtimes/index.ts
src/runtimes/ollama-residency.ts
src/runtimes/ollama-runtime.ts
src/runtimes/runtime.ts
src/runtimes/vllm-runtime.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/model-alias.schema.ts
src/schemas/model-profile.schema.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
src/schemas/test-catalog.schema.ts
src/tests/calculator-basic/test.meta.json
src/tests/calculator-stateful/test.meta.json
src/tests/event-emitter/test.meta.json
src/tests/rate-limiter/test.meta.json
src/tests/smoke/test.meta.json
src/tests/todo-app/test.meta.json
src/tests/ttl-cache/test.meta.json
src/tests/workspace-smoke/fixtures/logs/.gitkeep
src/tests/workspace-smoke/prompt.blind.md
src/tests/workspace-smoke/prompt.informed.md
src/tests/workspace-smoke/test.meta.json
test/aggregate.test.ts
test/benchmark-checkpoint.test.ts
test/build-index.test.ts
test/compare-checkpoint-guard.test.ts
test/compare-formatters.test.ts
test/dashboard-schema-compat.test.ts
test/dashboard-test-type.test.ts
test/failure-classifier.test.ts
test/generation-retry.test.ts
test/goose-adapter.test.ts
test/goose-output.test.ts
test/hardware-profile.test.ts
test/harness-compatibility.test.ts
test/item-executor.test.ts
test/item-retry.test.ts
test/machine-profile-migration.test.ts
test/model-alias-schema.test.ts
test/model-availability.test.ts
test/model-profile-registry.test.ts
test/ollama-adapter.test.ts
test/ollama-residency.test.ts
test/opencode-adapter.test.ts
test/opencode-artifacts.test.ts
test/opencode-cli-cache.test.ts
test/opencode-cli.test.ts
test/opencode-config.test.ts
test/opencode-model.test.ts
test/opencode-output.test.ts
test/opencode-runner-race.test.ts
test/opencode-runner.test.ts
test/plan-builder.test.ts
test/run-command.test.ts
test/runner-residency.test.ts
test/schema-regressions.test.ts
test/schemas.test.ts
test/signal-assessment.test.ts
test/stats-semantics.test.ts
test/test-catalog.test.ts
test/tool-prompt.test.ts
test/utils/fixtures.ts
test/vllm-runtime.test.ts
test/workspace-capability-parity.test.ts
test/workspace-prompt-parity.test.ts

💤 Files with no reviewable changes (10)

src/harnesses/goose-openai.ts
test/benchmark-checkpoint.test.ts
package.json
src/harnesses/opencode-model.ts
test/vllm-runtime.test.ts
src/harnesses/opencode-process.ts
src/harnesses/opencode-output.ts
src/lib/openai-compat-client.ts
src/runtimes/vllm-runtime.ts
src/runtimes/discovery.ts

AustinKelsay · 2026-04-24T18:56:02Z

@CodeRabbit full review

coderabbitai · 2026-04-24T18:56:08Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 9

♻️ Duplicate comments (4)

apps/dashboard/src/components/about/about-content.ts (1)

56-57: ⚠️ Potential issue | 🟡 Minor

Use one canonical axis label across About copy (passType vs prompt mode).

The matrix fact string and benchmark-dimension label currently use different terms for the same dimension, which creates avoidable UI inconsistency.

Also applies to: 99-102
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/about/about-content.ts` around lines 56 - 57,
The About copy uses two different labels for the same axis ("passType" vs
"prompt mode"); choose one canonical label (e.g., "prompt mode") and update all
occurrences so they match: replace the string value that currently reads
"runtime x harness x model x test x passType" and the benchmark-dimension label
entries referenced around the other block (the entries containing "passType" at
the other occurrence) to use the chosen canonical term, ensuring the keys/labels
in the about-content data structure remain consistent across value and detail
fields.

src/cli/compare-utils.ts (1)

156-159: ⚠️ Potential issue | 🟡 Minor

Fail fast on invalid truncate() widths.

Returning "" for maxLen <= 0 still turns a caller bug into display output, which makes broken column calculations harder to notice in the compare CLI. Throw here instead of silently degrading.
♻️ Proposed fix
 export function truncate(str: string, maxLen: number): string {
-	if (maxLen <= 0) return "";
+	if (maxLen <= 0) {
+		throw new RangeError(`truncate maxLen must be >= 1: ${maxLen}`);
+	}
 	if (str.length <= maxLen) return str;
 	return `${str.slice(0, Math.max(0, maxLen - 1))}…`;
 }
As per coding guidelines, "Throw on invalid config, invalid schemas, and programmer mistakes; avoid silent fallbacks".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/compare-utils.ts` around lines 156 - 159, The truncate function
currently returns an empty string when maxLen <= 0, masking caller bugs; update
the truncate(str: string, maxLen: number): string implementation to validate
maxLen and throw a descriptive error (e.g., RangeError or TypeError) when maxLen
is not a positive integer instead of returning ""; keep existing behavior for
valid positive maxLen and unchanged slicing logic (use truncate) so callers get
a hard failure on invalid widths.

src/harnesses/opencode-runner.ts (1)

156-178: ⚠️ Potential issue | 🟠 Major

Wait for the escalated kill path to settle before returning.

forceKillProcess() now waits through the SIGTERM grace period, but the SIGKILL branch still returns immediately after signaling. The timeout/stale-output paths can therefore reject and trigger workspace cleanup while opencode is still exiting.

🔧 Proposed fix

 async function forceKillProcess(
 	proc: ResultPromise,
 	log: pino.Logger,
 	reason: string,
 ): Promise<void> {
 	const pid = proc.pid;
 	log.warn({ pid, reason }, "Force killing OpenCode process");
 	killOpenCodeProcess(proc, "SIGTERM", log);
 	const outcome = await Promise.race([
 		new Promise<"delay">((resolve) =>
 			setTimeout(() => resolve("delay"), FORCE_KILL_DELAY_MS),
 		),
 		proc.then(
 			() => "settled" as const,
 			() => "settled" as const,
 		),
 	]);
 	if (outcome === "settled") {
 		return;
 	}
 	log.warn({ pid }, "Escalating OpenCode process kill");
 	killOpenCodeProcess(proc, "SIGKILL", log);
+	await Promise.race([
+		proc.then(
+			() => undefined,
+			() => undefined,
+		),
+		new Promise<void>((resolve) =>
+			setTimeout(resolve, FORCE_KILL_DELAY_MS),
+		),
+	]);
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/opencode-runner.ts` around lines 156 - 178, The
forceKillProcess implementation sends SIGTERM then waits, but after escalating
to SIGKILL it returns immediately; update forceKillProcess (and reuse
killOpenCodeProcess and FORCE_KILL_DELAY_MS) so that after calling
killOpenCodeProcess(proc, "SIGKILL", log) you wait for the proc to settle
(proc.then resolves/rejects) or for a second timeout (use Promise.race like the
initial grace wait) before returning, and log appropriately (e.g., warn before
escalating and optionally debug when settled) so callers cannot proceed while
opencode is still exiting.

src/lib/signal-assessment.ts (1)

349-379: ⚠️ Potential issue | 🟠 Major

Don’t default failed-row output to "artifact" here.

agent_requested_input only matches harness output. With the current default, any caller that forgets to pass outputSource silently loses that taint reason and can over-count trustworthy rows.

💡 Minimal fix

 	reasons.push(
 		...getTranscriptOrInputTaintReasons(input.output, {
-			source: input.outputSource ?? "artifact",
+			source: input.outputSource ?? "harness",
 		}),
 	);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/signal-assessment.ts` around lines 349 - 379, The code in
finalizeItemSignalAssessment is defaulting outputSource to "artifact" when
calling getTranscriptOrInputTaintReasons (source: input.outputSource ??
"artifact"), which hides harness-only taint reasons like
"agent_requested_input"; change the call to pass input.outputSource directly
(i.e., source: input.outputSource) or omit the source option when undefined so
harness-specific taints are preserved, ensuring callers that forget to supply
outputSource don't silently lose taint reasons.

🧹 Nitpick comments (7)

apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx (1)

40-54: Consider adding TSDoc for consistency.

While not required by coding guidelines (since this function is not exported), adding TSDoc documentation would create consistency with getRunMachineBadgeLabel and improve maintainability.

📝 Suggested documentation

+/**
+ * Determines the aggregate badge label and variant for a run card.
+ *
+ * `@param` run - Published run summary item
+ * `@param` latestCheckpointId - Current checkpoint ID powering the leaderboard
+ * `@returns` Badge configuration with label and variant
+ */
 function getRunAggregateBadge(
   run: RunListItem,
   latestCheckpointId: string | null,
 ): { label: string; variant: "secondary" | "success" | "warning" } {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx` around
lines 40 - 54, Add a short TSDoc comment for the internal helper
getRunAggregateBadge to match style used by getRunMachineBadgeLabel: describe
the function's purpose (computes label and variant for a run based on legacy
flag, checkpoint presence, and latestCheckpointId), document parameters (run:
RunListItem, latestCheckpointId: string | null) and the returned shape ({ label:
string; variant: "secondary" | "success" | "warning" }), and place it
immediately above the getRunAggregateBadge declaration to maintain consistency
and improve maintainability.

apps/dashboard/src/components/charts/token-efficiency-chart.tsx (1)

38-42: Use shared muted color constant instead of hardcoded HSL fallback.

This avoids palette drift and keeps chart theming consistent across components.

Proposed refactor

 import { Card, CardContent, CardHeader, CardTitle } from "@/components/ui/card";
 import { WithInfoTooltip } from "@/components/ui/info-tooltip";
+import { CHART_COLORS } from "@/lib/chart-colors";
 import { computeTokenEfficiencyData } from "@/lib/aggregations";
 import { tokenEfficiency as tokenTooltips } from "@/lib/tooltip-content";
 import type { MatrixItemResult } from "@/lib/types";
@@
 function readHarnessColor(harness: string): string {
-	return (
-		HARNESS_COLORS[harness as keyof typeof HARNESS_COLORS] ||
-		"hsl(210, 12%, 63%)"
-	);
+	return Object.prototype.hasOwnProperty.call(HARNESS_COLORS, harness)
+		? HARNESS_COLORS[harness as keyof typeof HARNESS_COLORS]
+		: CHART_COLORS.muted;
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/charts/token-efficiency-chart.tsx` around lines
38 - 42, The readHarnessColor function currently falls back to a hardcoded HSL
string; replace that literal with the app's shared muted color constant by
importing the shared constant (e.g., MUTED_COLOR or the project's theme.muted
color) and use it as the fallback in HARNESS_COLORS[harness as keyof typeof
HARNESS_COLORS] || MUTED_COLOR; update the import at the top of the file
accordingly so token-efficiency-chart.tsx uses the centralized muted color
instead of "hsl(210, 12%, 63%)".

test/aggregate.test.ts (2)

111-157: Consider enhancing the JSDoc for test helpers.

Similar to createItem, adding comprehensive parameter and return documentation would improve maintainability.

📝 Suggested JSDoc enhancement

 /**
  * Creates a run result for aggregation tests.
+ * `@param` runId - Unique identifier for the run
+ * `@param` checkpointId - Checkpoint identifier this run targets
+ * `@param` machineProfileKey - Machine profile key for this run
+ * `@param` instanceId - Instance identifier for the machine
+ * `@param` items - Array of matrix item results
+ * `@param` profile - Optional overrides for machine profile properties
+ * `@returns` A complete RunResult for testing
  */
 function createRun(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/aggregate.test.ts` around lines 111 - 157, Add comprehensive JSDoc to
the createRun function: document each parameter (runId, checkpointId,
machineProfileKey, instanceId, items, profile) with types and purpose, note
optional/override fields on profile (RunProfileOverrides), describe returned
type (RunResult) and what the returned object represents (summary, machine,
benchmarkCheckpoint, provenance, timestamps, items), and include any defaults
used (e.g., TEST_PROFILE_LABEL, TEST_NORMALIZED_PROFILE, TEST_HARDWARE, default
startedAt/completedAt and durationMs) so future readers understand inputs and
produced structure.

79-109: Consider enhancing the JSDoc for test helpers.

While not exported, adding @param and @returns documentation would improve maintainability and make the helper function's contract clearer for future test authors.

📝 Suggested JSDoc enhancement

 /**
  * Creates a matrix item for aggregation tests.
+ * `@param` id - Unique identifier for the matrix item
+ * `@param` completedAt - ISO timestamp when the item completed
+ * `@param` overrides - Optional partial overrides for the item result
+ * `@returns` A complete MatrixItemResult for testing
  */
 function createItem(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/aggregate.test.ts` around lines 79 - 109, Update the JSDoc for the test
helper function createItem to include `@param` tags for id (string), completedAt
(string), and overrides (Partial<MatrixItemResult> optional), and add an
`@returns` tag describing that it returns a MatrixItemResult; locate the JSDoc
above the createItem function and add concise descriptions of each parameter and
the return value to clarify the helper's contract for future test authors.

test/ollama-adapter.test.ts (1)

259-274: Optional: tighten the test title for precision.

The case validates an Ollama-named runtime with a non-Ollama API format; a more specific title would reduce ambiguity.

Possible title tweak

- it("throws when direct adapter receives a non-Ollama runtime", async () => {
+ it("throws when direct adapter receives a non-Ollama API format", async () => {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/ollama-adapter.test.ts` around lines 259 - 274, Rename the test title to
be more specific: update the it(...) description for the test using
createDirectAdapter and adapter.generate to something like "throws when runtime
is named 'ollama' but apiFormat is not 'ollama' (non-Ollama apiFormat)" so it
clearly communicates that the runtime name is "ollama" while apiFormat is
"openai-compatible" on the mockRuntime; leave the test body
(createDirectAdapter, adapter.generate, mockRuntime, timeoutMs) unchanged.

apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx (2)

183-191: "Select all" button skips sorting.

Other callbacks (lines 112, 121, 128) sort via MODEL_SORT_COLLATOR, but line 188 passes [...models] unsorted. If the parent doesn't guarantee models is sorted the same way, the output order may differ from other selection paths.

Consistency fix

 <Button
 	type="button"
 	variant="ghost"
 	size="sm"
 	className="h-7 px-2"
-	onClick={() => onSelectionChange([...models])}
+	onClick={() =>
+		onSelectionChange(
+			[...models].sort((a, b) => MODEL_SORT_COLLATOR.compare(a, b)),
+		)
+	}
 >
 	Select all
 </Button>

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx` around
lines 183 - 191, The "Select all" Button's onClick currently calls
onSelectionChange([...models]) without sorting, which can produce a different
order than other selection paths; update the onClick handler to sort models with
MODEL_SORT_COLLATOR before invoking onSelectionChange (e.g., create a sorted
copy via [...models].sort(MODEL_SORT_COLLATOR) and pass that to
onSelectionChange) so that the selection order matches the other callbacks that
use MODEL_SORT_COLLATOR.

196-197: Checkbox visual state uses raw selectedModels.

The isSelected check uses selectedModels.includes(model) rather than the normalized set. If selectedModels contains only stale IDs, all checkboxes appear unchecked while the label (after applying the suggested fix above) would show "All models". This creates a visual mismatch.

Consider normalizing here as well for consistency:

Potential fix (requires lifting normalization)

+	const modelSet = useMemo(() => new Set(models), [models]);
+	const normalizedSelectedModels = useMemo(
+		() => selectedModels.filter((s) => modelSet.has(s)),
+		[selectedModels, modelSet],
+	);
+	const effectivelyAllSelected =
+		selectedModels.length === 0 || normalizedSelectedModels.length === 0;
 ...
 {models.map((model) => {
 	const isSelected =
-		selectedModels.length === 0 || selectedModels.includes(model);
+		effectivelyAllSelected || normalizedSelectedModels.includes(model);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx` around
lines 196 - 197, The checkbox visual state is using raw selectedModels (variable
selectedModels) which may contain stale IDs; update the isSelected computation
in the Leaderboard ModelFilterDropdown component so it checks membership against
the same normalized set used for labels (e.g., the normalizedSelectedIds or
selectedModelIdsNormalized you create/lift) instead of raw selectedModels, i.e.,
compute a normalized set of model IDs once (or lift the existing normalization)
and change the isSelected logic (the isSelected variable that currently does
selectedModels.includes(model)) to use that normalized set while preserving the
existing "all models" empty-set behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/dashboard/src/components/charts/model-comparison-chart.tsx`:
- Around line 92-93: Replace the direct call to
ModelComparisonChartPropsSchema.parse inside the ModelComparisonChart render
with a safeParse flow: call ModelComparisonChartPropsSchema.safeParse(props),
check result.success and if false render a graceful fallback UI (e.g., error
placeholder or null) and optionally log the validation error; update variable
usage to use result.data when success is true. Alternatively, if you
intentionally want parse to throw, add full JSDoc/TSDoc to the exported
ModelComparisonChart (including `@throws`) to satisfy repo standards. Ensure
references to ModelComparisonChart and ModelComparisonChartPropsSchema are
updated accordingly.

In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx`:
- Around line 29-47: The buildTriggerLabel function currently returns "0 models
selected" when selectedModels contains only stale IDs; update the logic in
buildTriggerLabel (which uses selectedModels, currentModels,
selectedCurrentModels, and hasAllCurrentModels) so that if
selectedCurrentModels.length === 0 it treats this the same as an empty selection
and returns "All models" (i.e., short-circuit to the "All models" branch when no
valid current selections remain), while preserving the existing behavior for
single and multiple valid selections.

In `@README.md`:
- Around line 142-143: The README example for "bun pb" understates what runs by
default; update the "bun pb" example text to explicitly state that the command
auto-discovers runtimes, models and harnesses (e.g., "Auto-discovers
runtimes/models/harnesses by default") so readers understand the default
behavior when invoking bun pb; locate the current example referencing "bun pb"
and expand its description to mention auto-discovery of available harnesses,
models and runtimes.

In `@src/harnesses/opencode-events.ts`:
- Around line 98-110: The function extractContentFromArgs currently treats
args.file as content which can return a filename instead of actual emitted code;
update extractContentFromArgs to stop using args.file as a fallback (keep
args.file_text and args.fileText which are actual content fields) so only real
content fields (content, contents, text, code, file_text, fileText) are
considered; locate the extractContentFromArgs function and remove args.file from
the chain of fallbacks, leaving the rest of the logic intact so filenames are no
longer returned as content.

In `@src/lib/stats-format.ts`:
- Around line 81-85: Update the JSDoc for the exported function formatRunStats
to include a `@throws` tag describing the Error it can throw when failure
accounting is invalid (specifically when stats.generationFailures?.total exceeds
the computed failed count); locate the JSDoc above the formatRunStats function
and add a short sentence for the throw (e.g. "Throws Error if generation failure
count is greater than failed item count") and include when this happens and the
error type.

In `@src/runner/run-progress.ts`:
- Around line 113-115: The current readErrorMessage(error: unknown) may throw
when calling String(error) for objects with malicious/buggy toString/valueOf;
update readErrorMessage to wrap the coercion in a try/catch: if error is an
Error return error.message, else attempt String(error) inside try and return
that, and on failure fall back to a safe alternative such as
Object.prototype.toString.call(error) (also guarded) and finally return a fixed
sentinel like "<unserializable error>" if even that fails; apply these changes
inside the readErrorMessage function to ensure failure-path serialization cannot
throw.

In `@src/schemas/result.schema.ts`:
- Around line 117-149: The refinement on the metrics object omits
retryGenerationDurationMs, so partial retry payloads (e.g., { durationMs: 12,
retryGenerationDurationMs: 5 }) slip through; update the refine block that
references metrics.retryKind, metrics.retryReason, metrics.retryAttempted, and
metrics.retryPromoted to also include metrics.retryGenerationDurationMs in
hasAnyRetryField and in the branches: when retryAttempted === true require
retryGenerationDurationMs to be a number (and optionally non-negative) alongside
retryKind, non-empty string retryReason, and boolean retryPromoted; when
retryAttempted === false require retryGenerationDurationMs to be undefined
(absent) along with the other retry fields.

---

Duplicate comments:
In `@apps/dashboard/src/components/about/about-content.ts`:
- Around line 56-57: The About copy uses two different labels for the same axis
("passType" vs "prompt mode"); choose one canonical label (e.g., "prompt mode")
and update all occurrences so they match: replace the string value that
currently reads "runtime x harness x model x test x passType" and the
benchmark-dimension label entries referenced around the other block (the entries
containing "passType" at the other occurrence) to use the chosen canonical term,
ensuring the keys/labels in the about-content data structure remain consistent
across value and detail fields.

In `@src/cli/compare-utils.ts`:
- Around line 156-159: The truncate function currently returns an empty string
when maxLen <= 0, masking caller bugs; update the truncate(str: string, maxLen:
number): string implementation to validate maxLen and throw a descriptive error
(e.g., RangeError or TypeError) when maxLen is not a positive integer instead of
returning ""; keep existing behavior for valid positive maxLen and unchanged
slicing logic (use truncate) so callers get a hard failure on invalid widths.

In `@src/harnesses/opencode-runner.ts`:
- Around line 156-178: The forceKillProcess implementation sends SIGTERM then
waits, but after escalating to SIGKILL it returns immediately; update
forceKillProcess (and reuse killOpenCodeProcess and FORCE_KILL_DELAY_MS) so that
after calling killOpenCodeProcess(proc, "SIGKILL", log) you wait for the proc to
settle (proc.then resolves/rejects) or for a second timeout (use Promise.race
like the initial grace wait) before returning, and log appropriately (e.g., warn
before escalating and optionally debug when settled) so callers cannot proceed
while opencode is still exiting.

In `@src/lib/signal-assessment.ts`:
- Around line 349-379: The code in finalizeItemSignalAssessment is defaulting
outputSource to "artifact" when calling getTranscriptOrInputTaintReasons
(source: input.outputSource ?? "artifact"), which hides harness-only taint
reasons like "agent_requested_input"; change the call to pass input.outputSource
directly (i.e., source: input.outputSource) or omit the source option when
undefined so harness-specific taints are preserved, ensuring callers that forget
to supply outputSource don't silently lose taint reasons.

---

Nitpick comments:
In `@apps/dashboard/src/components/charts/token-efficiency-chart.tsx`:
- Around line 38-42: The readHarnessColor function currently falls back to a
hardcoded HSL string; replace that literal with the app's shared muted color
constant by importing the shared constant (e.g., MUTED_COLOR or the project's
theme.muted color) and use it as the fallback in HARNESS_COLORS[harness as keyof
typeof HARNESS_COLORS] || MUTED_COLOR; update the import at the top of the file
accordingly so token-efficiency-chart.tsx uses the centralized muted color
instead of "hsl(210, 12%, 63%)".

In `@apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx`:
- Around line 40-54: Add a short TSDoc comment for the internal helper
getRunAggregateBadge to match style used by getRunMachineBadgeLabel: describe
the function's purpose (computes label and variant for a run based on legacy
flag, checkpoint presence, and latestCheckpointId), document parameters (run:
RunListItem, latestCheckpointId: string | null) and the returned shape ({ label:
string; variant: "secondary" | "success" | "warning" }), and place it
immediately above the getRunAggregateBadge declaration to maintain consistency
and improve maintainability.

In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx`:
- Around line 183-191: The "Select all" Button's onClick currently calls
onSelectionChange([...models]) without sorting, which can produce a different
order than other selection paths; update the onClick handler to sort models with
MODEL_SORT_COLLATOR before invoking onSelectionChange (e.g., create a sorted
copy via [...models].sort(MODEL_SORT_COLLATOR) and pass that to
onSelectionChange) so that the selection order matches the other callbacks that
use MODEL_SORT_COLLATOR.
- Around line 196-197: The checkbox visual state is using raw selectedModels
(variable selectedModels) which may contain stale IDs; update the isSelected
computation in the Leaderboard ModelFilterDropdown component so it checks
membership against the same normalized set used for labels (e.g., the
normalizedSelectedIds or selectedModelIdsNormalized you create/lift) instead of
raw selectedModels, i.e., compute a normalized set of model IDs once (or lift
the existing normalization) and change the isSelected logic (the isSelected
variable that currently does selectedModels.includes(model)) to use that
normalized set while preserving the existing "all models" empty-set behavior.

In `@test/aggregate.test.ts`:
- Around line 111-157: Add comprehensive JSDoc to the createRun function:
document each parameter (runId, checkpointId, machineProfileKey, instanceId,
items, profile) with types and purpose, note optional/override fields on profile
(RunProfileOverrides), describe returned type (RunResult) and what the returned
object represents (summary, machine, benchmarkCheckpoint, provenance,
timestamps, items), and include any defaults used (e.g., TEST_PROFILE_LABEL,
TEST_NORMALIZED_PROFILE, TEST_HARDWARE, default startedAt/completedAt and
durationMs) so future readers understand inputs and produced structure.
- Around line 79-109: Update the JSDoc for the test helper function createItem
to include `@param` tags for id (string), completedAt (string), and overrides
(Partial<MatrixItemResult> optional), and add an `@returns` tag describing that it
returns a MatrixItemResult; locate the JSDoc above the createItem function and
add concise descriptions of each parameter and the return value to clarify the
helper's contract for future test authors.

In `@test/ollama-adapter.test.ts`:
- Around line 259-274: Rename the test title to be more specific: update the
it(...) description for the test using createDirectAdapter and adapter.generate
to something like "throws when runtime is named 'ollama' but apiFormat is not
'ollama' (non-Ollama apiFormat)" so it clearly communicates that the runtime
name is "ollama" while apiFormat is "openai-compatible" on the mockRuntime;
leave the test body (createDirectAdapter, adapter.generate, mockRuntime,
timeoutMs) unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1261471f-88b6-47d8-a13d-14ba27cb408f

📥 Commits

Reviewing files that changed from the base of the PR and between 56b63e7 and 9dcac40.

📒 Files selected for processing (158)

README.md
apps/dashboard/scripts/build-index.ts
apps/dashboard/src/components/about/about-content.ts
apps/dashboard/src/components/about/about-page.tsx
apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx
apps/dashboard/src/components/charts/chart-primitives.tsx
apps/dashboard/src/components/charts/composite-score-chart.tsx
apps/dashboard/src/components/charts/frontier-eval-scatter.tsx
apps/dashboard/src/components/charts/model-comparison-chart.tsx
apps/dashboard/src/components/charts/model-radar-chart.tsx
apps/dashboard/src/components/charts/model-test-heatmap.tsx
apps/dashboard/src/components/charts/test-difficulty-chart.tsx
apps/dashboard/src/components/charts/test-type-split-chart.tsx
apps/dashboard/src/components/charts/token-efficiency-chart.tsx
apps/dashboard/src/components/layout/header.tsx
apps/dashboard/src/components/leaderboard/leaderboard-filters.ts
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
apps/dashboard/src/components/leaderboard/leaderboard-page.tsx
apps/dashboard/src/components/leaderboard/leaderboard-results-table.tsx
apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx
apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx
apps/dashboard/src/components/run-list/run-list-page.tsx
apps/dashboard/src/components/ui/section-heading.tsx
apps/dashboard/src/index.css
apps/dashboard/src/lib/aggregations-charts.ts
apps/dashboard/src/lib/aggregations-compare.ts
apps/dashboard/src/lib/aggregations-diagnostics.ts
apps/dashboard/src/lib/aggregations-test-types.ts
apps/dashboard/src/lib/aggregations-tooling.ts
apps/dashboard/src/lib/aggregations.ts
apps/dashboard/src/lib/api.ts
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
llm/context/codebase-overview.md
llm/implementation/computer-use-hardening.md
llm/implementation/harnesses-implementation.md
llm/implementation/multi-runtime-mvp-implementation.md
llm/project/project-overview.md
llm/project/project-rules.md
llm/project/user-flow.md
package.json
src/cli/compare-command.ts
src/cli/compare-formatters.ts
src/cli/compare-utils.ts
src/cli/migrate-machine-command.ts
src/cli/run-command.ts
src/harnesses/direct-adapter.ts
src/harnesses/discovery.ts
src/harnesses/goose-adapter.ts
src/harnesses/goose-openai.ts
src/harnesses/goose-output.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/harnesses/opencode-artifacts.ts
src/harnesses/opencode-cli.ts
src/harnesses/opencode-config.ts
src/harnesses/opencode-events.ts
src/harnesses/opencode-model.ts
src/harnesses/opencode-output.ts
src/harnesses/opencode-permissions.ts
src/harnesses/opencode-process.ts
src/harnesses/opencode-provider.ts
src/harnesses/opencode-runner.ts
src/harnesses/tool-prompt.ts
src/lib/benchmark-checkpoint.ts
src/lib/failure-classifier.ts
src/lib/hardware-profile.ts
src/lib/logger.ts
src/lib/machine-profile/instance-id.ts
src/lib/machine-profile/legacy.ts
src/lib/machine-profile/normalization.ts
src/lib/machine-profile/probe-linux.ts
src/lib/machine-profile/probe-macos.ts
src/lib/machine-profile/probe-utils.ts
src/lib/machine-profile/probe-windows.ts
src/lib/model-aliases.ts
src/lib/model-profile/normalization.ts
src/lib/model-profile/registry-compat.ts
src/lib/model-profile/registry.ts
src/lib/openai-compat-client.ts
src/lib/signal-assessment.ts
src/lib/stats-format.ts
src/lib/test-catalog.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/model-availability.ts
src/runner/model-eligibility.ts
src/runner/plan-builder.ts
src/runner/run-progress.ts
src/runtimes/discovery.ts
src/runtimes/index.ts
src/runtimes/ollama-residency.ts
src/runtimes/ollama-runtime.ts
src/runtimes/runtime.ts
src/runtimes/vllm-runtime.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/model-alias.schema.ts
src/schemas/model-profile.schema.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
src/schemas/test-catalog.schema.ts
src/tests/calculator-basic/test.meta.json
src/tests/calculator-stateful/test.meta.json
src/tests/event-emitter/test.meta.json
src/tests/rate-limiter/test.meta.json
src/tests/smoke/test.meta.json
src/tests/todo-app/test.meta.json
src/tests/ttl-cache/test.meta.json
src/tests/workspace-smoke/fixtures/logs/.gitkeep
src/tests/workspace-smoke/prompt.blind.md
src/tests/workspace-smoke/prompt.informed.md
src/tests/workspace-smoke/test.meta.json
test/aggregate.test.ts
test/benchmark-checkpoint.test.ts
test/build-index.test.ts
test/compare-checkpoint-guard.test.ts
test/compare-formatters.test.ts
test/dashboard-schema-compat.test.ts
test/dashboard-test-type.test.ts
test/failure-classifier.test.ts
test/generation-retry.test.ts
test/goose-adapter.test.ts
test/goose-output.test.ts
test/hardware-profile.test.ts
test/harness-compatibility.test.ts
test/item-executor.test.ts
test/item-retry.test.ts
test/machine-profile-migration.test.ts
test/model-alias-schema.test.ts
test/model-availability.test.ts
test/model-profile-registry.test.ts
test/ollama-adapter.test.ts
test/ollama-residency.test.ts
test/opencode-adapter.test.ts
test/opencode-artifacts.test.ts
test/opencode-cli-cache.test.ts
test/opencode-cli.test.ts
test/opencode-config.test.ts
test/opencode-model.test.ts
test/opencode-output.test.ts
test/opencode-runner-race.test.ts
test/opencode-runner.test.ts
test/plan-builder.test.ts
test/run-command.test.ts
test/runner-residency.test.ts
test/schema-regressions.test.ts
test/schemas.test.ts
test/signal-assessment.test.ts
test/stats-semantics.test.ts
test/stats.test.ts
test/test-catalog.test.ts
test/tool-prompt.test.ts
test/utils/fixtures.ts
test/vllm-runtime.test.ts
test/workspace-capability-parity.test.ts
test/workspace-prompt-parity.test.ts

💤 Files with no reviewable changes (10)

test/vllm-runtime.test.ts
src/harnesses/opencode-output.ts
package.json
test/benchmark-checkpoint.test.ts
src/harnesses/opencode-model.ts
src/harnesses/goose-openai.ts
src/runtimes/discovery.ts
src/harnesses/opencode-process.ts
src/lib/openai-compat-client.ts
src/runtimes/vllm-runtime.ts

AustinKelsay · 2026-04-24T20:03:28Z

@CodeRabbit full review

coderabbitai · 2026-04-24T20:03:32Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/lib/model-profile/registry.ts (1)
205-223: ⚠️ Potential issue | 🟡 Minor

Normalize legacy alias data through the same supported-runtime filter as profile data.

The legacy alias loading paths (lines 205–223) call normalizeLegacyAliasMap() directly, bypassing the filtering logic in normalizeLoadedModelProfileRegistry(). The alias schemas (ModelAliasEntrySchema accepts z.record(z.string(), ...)) allow any runtime key for backwards compatibility, meaning retired runtimes can survive load. While profile paths filter unsupported variants, alias paths do not—inconsistent handling of the same registry structure. Move alias data through normalizeLoadedModelProfileRegistry() or apply equivalent filtering within normalizeLegacyAliasMap() to ensure retired runtimes are dropped at the boundary.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/model-profile/registry.ts` around lines 205 - 223, The legacy alias
loading paths currently call normalizeLegacyAliasMap(...) directly (after
ModelAliasFileSchema and ModelAliasMapSchema parses), which bypasses the
supported-runtime filtering in normalizeLoadedModelProfileRegistry(...); change
the code to pass the normalized alias map through
normalizeLoadedModelProfileRegistry (or alternatively implement the same
supported-runtime filtering inside normalizeLegacyAliasMap) so retired runtimes
are dropped consistently—i.e., after ModelAliasFileSchema.safeParse(parsed) and
ModelAliasMapSchema.safeParse(parsed), feed the parsed alias map into
normalizeLoadedModelProfileRegistry (referencing normalizeLegacyAliasMap and
normalizeLoadedModelProfileRegistry to locate the logic) and return that
filtered result instead of returning normalizeLegacyAliasMap(...) directly.
src/harnesses/opencode-adapter.ts (1)
455-488: ⚠️ Potential issue | 🟠 Major

Preserve first-attempt taint when the retry attempt fails.

The success branch merges firstAttemptAssessment, but the catch branch only restamps durationMs. If the retry throws, the final error drops the first attempt's contract-violation evidence.
Suggested fix
 						} catch (error) {
 							const totalDurationMs = Math.round(performance.now() - startTime);
 							if (error !== null && typeof error === "object") {
-								Object.assign(error, { durationMs: totalDurationMs });
+								const existingSignalAssessment =
+									"signalAssessment" in error
+										? (error as { signalAssessment?: SignalAssessment })
+												.signalAssessment
+										: undefined;
+								Object.assign(error, {
+									durationMs: totalDurationMs,
+									signalAssessment: appendSignalAssessmentReasons(
+										existingSignalAssessment,
+										firstAttemptAssessment?.classification === "tainted"
+											? firstAttemptAssessment.reasons
+											: [],
+									),
+								});
 							}
 							throw error;
 						}
Based on learnings, "Never implicitly fix up results after a run; capture enough evidence to explain outcomes (test failures, eval reasoning, durations, best-effort metrics)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/opencode-adapter.ts` around lines 455 - 488, The catch block
after the retry (inside the generate flow that created firstAttemptAssessment
via buildSignalAssessment and then called createOpenCodeAdapter().generate) must
preserve the firstAttemptAssessment taint evidence when the retry throws: in the
catch, besides stamping durationMs onto the thrown error, merge or append
firstAttemptAssessment.reasons (or the full firstAttemptAssessment) into the
error (e.g., attach/merge into error.signalAssessment or error.reasons) using
the same helper appendSignalAssessmentReasons so downstream handlers see the
original contract-violation evidence; ensure you only attach when
firstAttemptAssessment exists and its classification is "tainted" to avoid
adding empty data.

♻️ Duplicate comments (3)

src/runtimes/ollama-runtime.ts (1)

294-304: ⚠️ Potential issue | 🟠 Major

Emit explicit negative capabilities for non-generative models.

Right now "embedding" and "unknown" leave generateText undefined. Any downstream check that treats generateText !== false as eligible will still benchmark those models as generators. Populate both booleans explicitly at this boundary so unclassified models stay unambiguous.

Suggested patch

 				modelKind,
-				capabilities: {
-					...(modelKind === "text-generation" ? { generateText: true } : {}),
-					...(modelKind === "embedding" ? { embedText: true } : {}),
-				},
+				capabilities: {
+					generateText: modelKind === "text-generation",
+					embedText: modelKind === "embedding",
+				},

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/runtimes/ollama-runtime.ts` around lines 294 - 304, The returned model
capabilities object from the function that uses inferModelKind (where
name/modelKind/sizeBytes/parametersBillions are returned) must explicitly set
both booleans so non-generative or unclassified models don't leave
generateText/embedText undefined; update the capabilities construction in the
return block to always include generateText and embedText (true for the matching
kind, false otherwise) instead of conditionally spreading only the positive flag
so that models with kind "embedding" or "unknown" have generateText: false and
embedText: false/true as appropriate.

src/runner/index.ts (1)

231-255: ⚠️ Potential issue | 🟡 Minor

Skipped rows still hit the residency teardown path.

After buildPreflightSkipResult(...), the finally block can still call runPostItemResidencyTeardown(item) on the last skipped row. That means a deterministic preflight skip still touches Ollama, can unload models, and can add unrelated warning noise. Keep this branch side-effect free and only run teardown for rows that actually executed.

Based on learnings, "Never implicitly fix up results after a run; capture enough evidence to explain outcomes (test failures, eval reasoning, durations, best-effort metrics)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/runner/index.ts` around lines 231 - 255, The finally block
unconditionally calls runPostItemResidencyTeardown(item) even for
preflight-skipped rows; change the flow so teardown only runs for items that
actually executed. Introduce a local flag (e.g., executed or didRun) around the
per-item run logic: set it false when you create a preflight skip via
buildPreflightSkipResult(...) and true for real executions, then replace the
unconditional teardown call with if (isLastForModel && executed) await
runPostItemResidencyTeardown(item). Keep the existing progress checkpoint calls
(writeProgressCheckpoint) unchanged.

src/harnesses/opencode-events.ts (1)

249-250: ⚠️ Potential issue | 🟠 Major

Preserve original stream order when rebuilding mixedOutput.

This still buffers parsed text separately from raw lines and then appends every unparsed line at the end. If non-JSON output arrives between deltas, the reconstructed payload changes order and can corrupt fallback tool-call salvage.

Suggested fix

-	const textParts: string[] = [];
+	const orderedOutputFragments: string[] = [];
 	const unparsedLines: string[] = [];
@@
 		} catch {
 			permissionDenied ||= isOpenCodePermissionDeniedText(line);
 			unparsedLines.push(line);
+			orderedOutputFragments.push(`${line}\n`);
 			continue;
 		}
@@
 		const text = readEventText(event);
 		if (text) {
 			permissionDenied ||= isOpenCodePermissionDeniedText(text);
-			textParts.push(text);
+			orderedOutputFragments.push(text);
 		}
 	}
@@
-	if (parsedLines > 0 && textParts.length > 0) {
-		const combined = textParts.join("");
-		const mixedOutput =
-			unparsedLines.length > 0
-				? [combined, ...unparsedLines].join("\n")
-				: combined;
+	if (parsedLines > 0 && orderedOutputFragments.length > 0) {
+		const mixedOutput = orderedOutputFragments.join("").trimEnd();
 		const mixedPermissionDenied = isOpenCodePermissionDeniedText(mixedOutput);
 		const toolCallCode = extractFromToolCallText(mixedOutput);

Also applies to: 335-340

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/opencode-events.ts` around lines 249 - 250, The current rebuild
logic separates parsed text into textParts and raw lines into unparsedLines and
then appends unparsedLines at the end, which reorders interleaved non-JSON
output; instead, preserve stream order by replacing the two buffers with a
single ordered buffer (e.g., an array of items noting type and content) and push
entries as they arrive; update the code that pushes into textParts/unparsedLines
to instead push into this ordered buffer, and change the mixedOutput assembly
(where mixedOutput is created) to iterate the ordered buffer and concatenate
entries in sequence so original order is preserved (apply same change to the
other occurrence around lines 335-340).

🧹 Nitpick comments (5)

apps/dashboard/src/components/charts/token-efficiency-chart.tsx (1)
39-42: Apply defensive guard against prototype key pollution in readHarnessColor.

The function receives harness as a plain string type (not constrained to known literal values at the dashboard level, as z.string() schemas in both src/schemas and apps/dashboard schemas lack enum validation). The type assertion harness as keyof typeof HARNESS_COLORS can resolve inherited object properties (e.g., toString, constructor) before reaching the muted fallback.
Suggested fix
const HARNESS_COLORS = {
	direct: "hsl(215, 70%, 62%)", // steel blue
	goose: "hsl(142, 60%, 49%)", // brand green
	opencode: "hsl(38, 80%, 58%)", // warm amber
} as const;

+type KnownHarnessName = keyof typeof HARNESS_COLORS;
+
function readHarnessColor(harness: string): string {
-	return (
-		HARNESS_COLORS[harness as keyof typeof HARNESS_COLORS] || CHART_COLORS.muted
-	);
+	return Object.prototype.hasOwnProperty.call(HARNESS_COLORS, harness)
+		? HARNESS_COLORS[harness as KnownHarnessName]
+		: CHART_COLORS.muted;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/charts/token-efficiency-chart.tsx` around lines
39 - 42, readHarnessColor currently indexes HARNESS_COLORS with an unconstrained
string and a type assertion, which can resolve inherited prototype keys; update
readHarnessColor to defensively check that the incoming harness is a direct own
key of HARNESS_COLORS before returning its value (e.g., use
Object.prototype.hasOwnProperty.call(HARNESS_COLORS, harness) or a Map lookup)
and otherwise return CHART_COLORS.muted; reference HARNESS_COLORS and
CHART_COLORS.muted in your change and keep the function signature
readHarnessColor(harness: string): string.
apps/dashboard/src/lib/types.ts (1)
86-86: Apply ArtifactRuntimeName consistently across plan/result types.

Tightening ModelVariant.runtime is good, but MatrixItem.runtime and MatchedItem.runtime are still plain string. That leaves this file out of sync with the shared plan schema and weakens runtime-label validation in the dashboard layer.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/lib/types.ts` at line 86, Update the dashboard types so
runtime uses the shared ArtifactRuntimeName type everywhere: change the runtime
property on MatrixItem and MatchedItem to ArtifactRuntimeName (matching the
already-tightened ModelVariant.runtime) in apps/dashboard/src/lib/types.ts;
ensure any related imports include ArtifactRuntimeName and run type-checking to
fix any downstream type mismatches.
test/opencode-config.test.ts (1)
104-120: Also lock in OPENCODE_DISABLE_WEBFETCH here.

This regression test asserts the web-search flag, but the env builder is also supposed to force web fetch off. Adding expect(env.OPENCODE_DISABLE_WEBFETCH).toBe("true") would keep a one-sided regression from passing.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/opencode-config.test.ts` around lines 104 - 120, The test for "exports
config dir/path/content for isolated headless runs" is missing an assertion for
the web fetch flag; update the test that calls buildOpenCodeEnv to also assert
expect(env.OPENCODE_DISABLE_WEBFETCH).toBe("true") so OPENCODE_DISABLE_WEBFETCH
is locked off alongside OPENCODE_DISABLE_WEBSEARCH (refer to buildOpenCodeEnv
and the existing OPENCODE_DISABLE_WEBSEARCH/OPENCODE_DISABLE_AUTOUPDATE
assertions to place the new expect).
apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx (2)
116-116: Use normalizedSelectedModelSet.has() instead of includes() for O(1) lookup.

Since normalizedSelectedModelSet is already computed at line 108, prefer using the Set's has() method for constant-time lookup rather than linear array search.
♻️ Suggested fix
-	if (normalizedSelectedModels.includes(model)) {
+	if (normalizedSelectedModelSet.has(model)) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx` at line
116, The current check uses normalizedSelectedModels.includes(model) which
performs an O(n) array scan; change it to use the precomputed Set
normalizedSelectedModelSet.has(model) for O(1) lookup—update the condition in
the component where normalizedSelectedModels and normalizedSelectedModelSet are
used (replace includes() on normalizedSelectedModels with
normalizedSelectedModelSet.has(model)).
207-209: Visual inconsistency when only stale IDs are selected.

When selectedModels contains only stale IDs (no longer in models), the trigger label correctly shows "All models" (line 42 handles this), but the checkbox visual logic at line 209 will show all items as unchecked because selectedModels.length !== 0 and normalizedSelectedIds is empty.

Consider aligning the checkbox checked state with the trigger label semantics:
♻️ Suggested fix
+	const representsAllModels =
+		selectedModels.length === 0 || normalizedSelectedIds.size === 0;
 	// ...
 	{models.map((model) => {
 		const isSelected =
-			selectedModels.length === 0 || normalizedSelectedIds.has(model);
+			representsAllModels || normalizedSelectedIds.has(model);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx` around
lines 207 - 209, The checkbox state logic in the model-filter-dropdown component
is inconsistent when selectedModels contains only stale IDs: update the
isSelected calculation used in models.map so that an empty normalizedSelectedIds
is treated the same as no selection (i.e., show all checked to match the trigger
label). Specifically, change the condition for isSelected (used when rendering
each model checkbox) to treat normalizedSelectedIds.size === 0 as a truthy "all
selected" case in addition to selectedModels.length === 0 and
normalizedSelectedIds.has(model) so the visual checkboxes align with the trigger
label semantics.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@llm/implementation/harnesses-implementation.md`:
- Around line 340-345: The doc table in harnesses-implementation.md is out of
sync with the Goose adapter: either update the Goose row to describe the current
mechanism (that src/harnesses/goose-adapter.ts invokes "goose run" without
--with-builtin developer and relies on per-item provider config/permission
rules), or change the adapter to pass the CLI flag again; specifically, locate
the Goose documentation entry in harnesses-implementation.md and replace the CLI
flag description with the current provider-config/permission-based method (or
alternatively modify the goose adapter code to reintroduce the --with-builtin
developer flag where goose is invoked if you prefer restoring the original
behavior).

In `@src/cli/compare-formatters.ts`:
- Around line 209-214: The table output pads harness names but never truncates
them, causing column overflow; update each place where you format harness values
(e.g., inside the regressions loop and the other formatting loops that call
pad(item.harness, columnWidths.harness)) to wrap the harness value with truncate
before padding—i.e., replace pad(item.harness, columnWidths.harness) with
pad(truncate(item.harness, columnWidths.harness), columnWidths.harness) so long
harness strings are clipped to columnWidths.harness and table alignment is
preserved.

In `@src/cli/migrate-machine-command.ts`:
- Around line 362-364: Remove the duplicate console.error call in the catch
block and rely solely on the structured logger; specifically, delete the
console.error(...) invocation and keep the logger.error({ error },
"Machine-profile migration failed") followed by process.exit(1) so errors are
emitted only via Pino (remove the console.error line that precedes logger.error
in the catch handling around the migration logic).

In `@src/harnesses/opencode-adapter.ts`:
- Around line 493-498: The throw reuses the pre-decision signalAssessment so
failures that never wrote SOLUTION_FILENAME miss output-contract taint; rebuild
the assessment from decision before calling buildOpenCodeFailure and pass that
rebuilt assessment instead of signalAssessment. Specifically, compute a new
assessment object based on decision (marking output_contract_violation or
tool_call_not_executed as appropriate), then call buildOpenCodeFailure with
durationMs, parsed.output and the new assessment so the terminal failure path
records the correct output-contract taint.

In `@src/harnesses/opencode-artifacts.ts`:
- Around line 172-186: In readUsableOpenCodeSolution validate the
minOutputLength parameter before attempting to read the file: check that
minOutputLength is a finite number and not negative (e.g.,
Number.isFinite(minOutputLength) && minOutputLength >= 0); if the check fails
throw a clear TypeError/RangeError describing the invalid threshold. Place this
validation at the top of readUsableOpenCodeSolution so you fail fast on
programmer/config errors (refer to the function name readUsableOpenCodeSolution
and the minOutputLength parameter when making the change).

In `@src/harnesses/opencode-cli.ts`:
- Around line 1-11: Update the file header to include parseOpenCodeRunFeatures
in the exported list and add explicit `@throws` {never} TSDoc entries to both
parseOpenCodeRunFeatures() and isOpenCodeRunCompatible() function docs; locate
the functions by name (parseOpenCodeRunFeatures, isOpenCodeRunCompatible) and
add the throws tag to their existing JSDoc/TSDoc blocks, and update the
top-of-file export list to mention parseOpenCodeRunFeatures alongside the other
exports.

In `@src/harnesses/opencode-runner.ts`:
- Around line 248-255: The stdout/stderr handlers call chunk.toString() on each
Buffer, which can corrupt multi-byte UTF-8 characters; before attaching
listeners on proc.stdout and proc.stderr in the opencode runner, call
proc.stdout.setEncoding("utf8") and proc.stderr.setEncoding("utf8") so Node
reconstructs multi-byte sequences, then push the received string chunks into
stdoutChunks/stderrChunks and update lastOutputTime as before (refer to
proc.stdout, proc.stderr, lastOutputTime, stdoutChunks, stderrChunks).

In `@src/runner/item-executor.ts`:
- Around line 177-178: The call to loadPrompt(item.test, item.passType) is
currently executed before the outer try in executeItem, causing a
prompt_not_found error to reject executeItem instead of producing a per-item
failure row in run.json; move the loadPrompt call inside the recoverable item
execution path (inside the try/catch that handles per-item failures) so missing
prompts are converted into a structured failure entry (prompt_not_found) and
execution continues, or catch errors from loadPrompt immediately and normalize
them into the same failure-handling logic used for timeouts/model/eval errors;
update references to loadPrompt, executeItem and the run.json failure recording
logic to ensure prompt_not_found is treated like other per-item failures.

---

Outside diff comments:
In `@src/harnesses/opencode-adapter.ts`:
- Around line 455-488: The catch block after the retry (inside the generate flow
that created firstAttemptAssessment via buildSignalAssessment and then called
createOpenCodeAdapter().generate) must preserve the firstAttemptAssessment taint
evidence when the retry throws: in the catch, besides stamping durationMs onto
the thrown error, merge or append firstAttemptAssessment.reasons (or the full
firstAttemptAssessment) into the error (e.g., attach/merge into
error.signalAssessment or error.reasons) using the same helper
appendSignalAssessmentReasons so downstream handlers see the original
contract-violation evidence; ensure you only attach when firstAttemptAssessment
exists and its classification is "tainted" to avoid adding empty data.

In `@src/lib/model-profile/registry.ts`:
- Around line 205-223: The legacy alias loading paths currently call
normalizeLegacyAliasMap(...) directly (after ModelAliasFileSchema and
ModelAliasMapSchema parses), which bypasses the supported-runtime filtering in
normalizeLoadedModelProfileRegistry(...); change the code to pass the normalized
alias map through normalizeLoadedModelProfileRegistry (or alternatively
implement the same supported-runtime filtering inside normalizeLegacyAliasMap)
so retired runtimes are dropped consistently—i.e., after
ModelAliasFileSchema.safeParse(parsed) and
ModelAliasMapSchema.safeParse(parsed), feed the parsed alias map into
normalizeLoadedModelProfileRegistry (referencing normalizeLegacyAliasMap and
normalizeLoadedModelProfileRegistry to locate the logic) and return that
filtered result instead of returning normalizeLegacyAliasMap(...) directly.

---

Duplicate comments:
In `@src/harnesses/opencode-events.ts`:
- Around line 249-250: The current rebuild logic separates parsed text into
textParts and raw lines into unparsedLines and then appends unparsedLines at the
end, which reorders interleaved non-JSON output; instead, preserve stream order
by replacing the two buffers with a single ordered buffer (e.g., an array of
items noting type and content) and push entries as they arrive; update the code
that pushes into textParts/unparsedLines to instead push into this ordered
buffer, and change the mixedOutput assembly (where mixedOutput is created) to
iterate the ordered buffer and concatenate entries in sequence so original order
is preserved (apply same change to the other occurrence around lines 335-340).

In `@src/runner/index.ts`:
- Around line 231-255: The finally block unconditionally calls
runPostItemResidencyTeardown(item) even for preflight-skipped rows; change the
flow so teardown only runs for items that actually executed. Introduce a local
flag (e.g., executed or didRun) around the per-item run logic: set it false when
you create a preflight skip via buildPreflightSkipResult(...) and true for real
executions, then replace the unconditional teardown call with if (isLastForModel
&& executed) await runPostItemResidencyTeardown(item). Keep the existing
progress checkpoint calls (writeProgressCheckpoint) unchanged.

In `@src/runtimes/ollama-runtime.ts`:
- Around line 294-304: The returned model capabilities object from the function
that uses inferModelKind (where name/modelKind/sizeBytes/parametersBillions are
returned) must explicitly set both booleans so non-generative or unclassified
models don't leave generateText/embedText undefined; update the capabilities
construction in the return block to always include generateText and embedText
(true for the matching kind, false otherwise) instead of conditionally spreading
only the positive flag so that models with kind "embedding" or "unknown" have
generateText: false and embedText: false/true as appropriate.

---

Nitpick comments:
In `@apps/dashboard/src/components/charts/token-efficiency-chart.tsx`:
- Around line 39-42: readHarnessColor currently indexes HARNESS_COLORS with an
unconstrained string and a type assertion, which can resolve inherited prototype
keys; update readHarnessColor to defensively check that the incoming harness is
a direct own key of HARNESS_COLORS before returning its value (e.g., use
Object.prototype.hasOwnProperty.call(HARNESS_COLORS, harness) or a Map lookup)
and otherwise return CHART_COLORS.muted; reference HARNESS_COLORS and
CHART_COLORS.muted in your change and keep the function signature
readHarnessColor(harness: string): string.

In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx`:
- Line 116: The current check uses normalizedSelectedModels.includes(model)
which performs an O(n) array scan; change it to use the precomputed Set
normalizedSelectedModelSet.has(model) for O(1) lookup—update the condition in
the component where normalizedSelectedModels and normalizedSelectedModelSet are
used (replace includes() on normalizedSelectedModels with
normalizedSelectedModelSet.has(model)).
- Around line 207-209: The checkbox state logic in the model-filter-dropdown
component is inconsistent when selectedModels contains only stale IDs: update
the isSelected calculation used in models.map so that an empty
normalizedSelectedIds is treated the same as no selection (i.e., show all
checked to match the trigger label). Specifically, change the condition for
isSelected (used when rendering each model checkbox) to treat
normalizedSelectedIds.size === 0 as a truthy "all selected" case in addition to
selectedModels.length === 0 and normalizedSelectedIds.has(model) so the visual
checkboxes align with the trigger label semantics.

In `@apps/dashboard/src/lib/types.ts`:
- Line 86: Update the dashboard types so runtime uses the shared
ArtifactRuntimeName type everywhere: change the runtime property on MatrixItem
and MatchedItem to ArtifactRuntimeName (matching the already-tightened
ModelVariant.runtime) in apps/dashboard/src/lib/types.ts; ensure any related
imports include ArtifactRuntimeName and run type-checking to fix any downstream
type mismatches.

In `@test/opencode-config.test.ts`:
- Around line 104-120: The test for "exports config dir/path/content for
isolated headless runs" is missing an assertion for the web fetch flag; update
the test that calls buildOpenCodeEnv to also assert
expect(env.OPENCODE_DISABLE_WEBFETCH).toBe("true") so OPENCODE_DISABLE_WEBFETCH
is locked off alongside OPENCODE_DISABLE_WEBSEARCH (refer to buildOpenCodeEnv
and the existing OPENCODE_DISABLE_WEBSEARCH/OPENCODE_DISABLE_AUTOUPDATE
assertions to place the new expect).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c7fbff1b-b7e1-4f60-a1a8-2342d99b54cc

📥 Commits

Reviewing files that changed from the base of the PR and between 56b63e7 and f382f05.

📒 Files selected for processing (158)

README.md
apps/dashboard/scripts/build-index.ts
apps/dashboard/src/components/about/about-content.ts
apps/dashboard/src/components/about/about-page.tsx
apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx
apps/dashboard/src/components/charts/chart-primitives.tsx
apps/dashboard/src/components/charts/composite-score-chart.tsx
apps/dashboard/src/components/charts/frontier-eval-scatter.tsx
apps/dashboard/src/components/charts/model-comparison-chart.tsx
apps/dashboard/src/components/charts/model-radar-chart.tsx
apps/dashboard/src/components/charts/model-test-heatmap.tsx
apps/dashboard/src/components/charts/test-difficulty-chart.tsx
apps/dashboard/src/components/charts/test-type-split-chart.tsx
apps/dashboard/src/components/charts/token-efficiency-chart.tsx
apps/dashboard/src/components/layout/header.tsx
apps/dashboard/src/components/leaderboard/leaderboard-filters.ts
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
apps/dashboard/src/components/leaderboard/leaderboard-page.tsx
apps/dashboard/src/components/leaderboard/leaderboard-results-table.tsx
apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx
apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx
apps/dashboard/src/components/run-list/run-list-page.tsx
apps/dashboard/src/components/ui/section-heading.tsx
apps/dashboard/src/index.css
apps/dashboard/src/lib/aggregations-charts.ts
apps/dashboard/src/lib/aggregations-compare.ts
apps/dashboard/src/lib/aggregations-diagnostics.ts
apps/dashboard/src/lib/aggregations-test-types.ts
apps/dashboard/src/lib/aggregations-tooling.ts
apps/dashboard/src/lib/aggregations.ts
apps/dashboard/src/lib/api.ts
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
llm/context/codebase-overview.md
llm/implementation/computer-use-hardening.md
llm/implementation/harnesses-implementation.md
llm/implementation/multi-runtime-mvp-implementation.md
llm/project/project-overview.md
llm/project/project-rules.md
llm/project/user-flow.md
package.json
src/cli/compare-command.ts
src/cli/compare-formatters.ts
src/cli/compare-utils.ts
src/cli/migrate-machine-command.ts
src/cli/run-command.ts
src/harnesses/direct-adapter.ts
src/harnesses/discovery.ts
src/harnesses/goose-adapter.ts
src/harnesses/goose-openai.ts
src/harnesses/goose-output.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/harnesses/opencode-artifacts.ts
src/harnesses/opencode-cli.ts
src/harnesses/opencode-config.ts
src/harnesses/opencode-events.ts
src/harnesses/opencode-model.ts
src/harnesses/opencode-output.ts
src/harnesses/opencode-permissions.ts
src/harnesses/opencode-process.ts
src/harnesses/opencode-provider.ts
src/harnesses/opencode-runner.ts
src/harnesses/tool-prompt.ts
src/lib/benchmark-checkpoint.ts
src/lib/failure-classifier.ts
src/lib/hardware-profile.ts
src/lib/logger.ts
src/lib/machine-profile/instance-id.ts
src/lib/machine-profile/legacy.ts
src/lib/machine-profile/normalization.ts
src/lib/machine-profile/probe-linux.ts
src/lib/machine-profile/probe-macos.ts
src/lib/machine-profile/probe-utils.ts
src/lib/machine-profile/probe-windows.ts
src/lib/model-aliases.ts
src/lib/model-profile/normalization.ts
src/lib/model-profile/registry-compat.ts
src/lib/model-profile/registry.ts
src/lib/openai-compat-client.ts
src/lib/signal-assessment.ts
src/lib/stats-format.ts
src/lib/test-catalog.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/model-availability.ts
src/runner/model-eligibility.ts
src/runner/plan-builder.ts
src/runner/run-progress.ts
src/runtimes/discovery.ts
src/runtimes/index.ts
src/runtimes/ollama-residency.ts
src/runtimes/ollama-runtime.ts
src/runtimes/runtime.ts
src/runtimes/vllm-runtime.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/model-alias.schema.ts
src/schemas/model-profile.schema.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
src/schemas/test-catalog.schema.ts
src/tests/calculator-basic/test.meta.json
src/tests/calculator-stateful/test.meta.json
src/tests/event-emitter/test.meta.json
src/tests/rate-limiter/test.meta.json
src/tests/smoke/test.meta.json
src/tests/todo-app/test.meta.json
src/tests/ttl-cache/test.meta.json
src/tests/workspace-smoke/fixtures/logs/.gitkeep
src/tests/workspace-smoke/prompt.blind.md
src/tests/workspace-smoke/prompt.informed.md
src/tests/workspace-smoke/test.meta.json
test/aggregate.test.ts
test/benchmark-checkpoint.test.ts
test/build-index.test.ts
test/compare-checkpoint-guard.test.ts
test/compare-formatters.test.ts
test/dashboard-schema-compat.test.ts
test/dashboard-test-type.test.ts
test/failure-classifier.test.ts
test/generation-retry.test.ts
test/goose-adapter.test.ts
test/goose-output.test.ts
test/hardware-profile.test.ts
test/harness-compatibility.test.ts
test/item-executor.test.ts
test/item-retry.test.ts
test/machine-profile-migration.test.ts
test/model-alias-schema.test.ts
test/model-availability.test.ts
test/model-profile-registry.test.ts
test/ollama-adapter.test.ts
test/ollama-residency.test.ts
test/opencode-adapter.test.ts
test/opencode-artifacts.test.ts
test/opencode-cli-cache.test.ts
test/opencode-cli.test.ts
test/opencode-config.test.ts
test/opencode-model.test.ts
test/opencode-output.test.ts
test/opencode-runner-race.test.ts
test/opencode-runner.test.ts
test/plan-builder.test.ts
test/run-command.test.ts
test/runner-residency.test.ts
test/schema-regressions.test.ts
test/schemas.test.ts
test/signal-assessment.test.ts
test/stats-semantics.test.ts
test/stats.test.ts
test/test-catalog.test.ts
test/tool-prompt.test.ts
test/utils/fixtures.ts
test/vllm-runtime.test.ts
test/workspace-capability-parity.test.ts
test/workspace-prompt-parity.test.ts

💤 Files with no reviewable changes (10)

package.json
test/benchmark-checkpoint.test.ts
src/harnesses/opencode-model.ts
src/harnesses/opencode-output.ts
src/runtimes/discovery.ts
src/harnesses/goose-openai.ts
test/vllm-runtime.test.ts
src/harnesses/opencode-process.ts
src/runtimes/vllm-runtime.ts
src/lib/openai-compat-client.ts

AustinKelsay · 2026-04-24T22:20:42Z

@CodeRabbit full review

coderabbitai · 2026-04-24T22:20:47Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (2)

apps/dashboard/src/components/charts/model-comparison-chart.tsx (2)

42-44: ⚠️ Potential issue | 🟠 Major

z.custom<MatrixItemResult>() is effectively non-validating here.

On Line 43, this schema accepts arbitrary values, so malformed items can still flow into chart aggregation logic.

🔧 Proposed fix

+import { MatrixItemResultSchema } from "@/lib/schemas";
 import { z } from "zod";

 const ModelComparisonChartPropsSchema = z.object({
-	items: z.array(z.custom<MatrixItemResult>()),
+	items: z.array(MatrixItemResultSchema),
 });

#!/bin/bash
# Verify current runtime validation strength and presence/export of MatrixItemResultSchema.
rg -n --type=ts -C3 'ModelComparisonChartPropsSchema|z\.custom<MatrixItemResult>|MatrixItemResultSchema' apps/dashboard/src/components/charts/model-comparison-chart.tsx
rg -n --type=ts -C3 'MatrixItemResultSchema' apps/dashboard/src/lib/schemas.ts
rg -n --type=ts -C3 'export.*MatrixItemResultSchema|export\s*\{[^}]*MatrixItemResultSchema' apps/dashboard/src/lib/schemas.ts

Based on learnings: "Use Zod for schema validation at all boundaries (CLI args, env, JSON files, HTTP responses)" and "Validate at boundaries with Zod and pass typed data inward".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/charts/model-comparison-chart.tsx` around lines
42 - 44, The prop schema currently uses z.custom<MatrixItemResult>() which
bypasses validation; import or locate the existing MatrixItemResultSchema and
replace z.array(z.custom<MatrixItemResult>()) with
z.array(MatrixItemResultSchema) in ModelComparisonChartPropsSchema (and export
MatrixItemResultSchema if missing) so incoming items are strictly validated
before reaching chart aggregation logic; ensure the prop type and any callers
use the validated output (parse/parseAsync) of ModelComparisonChartPropsSchema
when receiving external data.

85-91: ⚠️ Potential issue | 🟡 Minor

Add @throws to the exported function JSDoc.

ModelComparisonChart is exported on Line 92, but the doc block does not include a @throws entry required by repo standards.

As per coding guidelines: "All exported functions must have TSDoc/JSDoc documentation including purpose, params, returns, and throws."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/charts/model-comparison-chart.tsx` around lines
85 - 91, The JSDoc for the exported function ModelComparisonChart is missing a
required `@throws` tag; update the doc block above the ModelComparisonChart export
to include a `@throws` description that explains what error(s) the component may
throw (e.g., invalid props, missing items, or data parsing errors) and under
what conditions, using the function name ModelComparisonChart and any related
props (props.items) in the description so the doc meets repo standards.

🧹 Nitpick comments (4)

src/lib/failure-classifier.ts (1)

48-49: Add regression coverage for the new prompt missing branch.

Line 49 adds behavior, but there’s no matching test case yet; this can regress silently. Please add a classifyGenerationError("Prompt missing ...") -> "prompt_not_found" assertion.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/failure-classifier.ts` around lines 48 - 49, Add a unit test that
covers the new "prompt missing" branch by calling classifyGenerationError with a
message containing "Prompt missing" (e.g., "Prompt missing ..." or similar) and
asserting it returns the "prompt_not_found" classification; target the
classifyGenerationError function and ensure the test mirrors existing tests for
the "prompt not found" case so the new branch cannot regress.

apps/dashboard/src/components/charts/frontier-eval-scatter.tsx (1)

29-41: Consider centralizing harness color resolution in a shared chart utility.

This resolver pattern is good, but it now appears duplicated across chart components. A shared helper would reduce drift risk.

Proposed refactor sketch

+// apps/dashboard/src/components/charts/harness-colors.ts
+import { CHART_COLORS } from "@/lib/chart-colors";
+
+export const HARNESS_COLORS = {
+  direct: "hsl(215, 70%, 62%)",
+  goose: "hsl(142, 60%, 49%)",
+  opencode: "hsl(38, 80%, 58%)",
+} as const;
+
+export function readHarnessColor(harness: string): string {
+  return Object.prototype.hasOwnProperty.call(HARNESS_COLORS, harness)
+    ? HARNESS_COLORS[harness as keyof typeof HARNESS_COLORS]
+    : CHART_COLORS.muted;
+}

-const HARNESS_COLORS = { ... } as const;
-function readHarnessColor(harness: string): string { ... }
+import { readHarnessColor } from "./harness-colors";

Also applies to: 115-115

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/charts/frontier-eval-scatter.tsx` around lines
29 - 41, Centralize the harness color resolver into a shared chart utility:
extract the HARNESS_COLORS map, KnownHarnessName type, and readHarnessColor
function into a new exported helper (e.g., getHarnessColor) in the shared charts
util and have components import and use that helper instead of duplicating the
logic; ensure the helper returns CHART_COLORS.muted for unknown keys and keep
the exact behavior of readHarnessColor so callers (including
frontier-eval-scatter.tsx) remain unchanged.

apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx (1)

33-35: Consider adding Zod validation for props.

While the TypeScript typing is present, the coding guidelines recommend validating at boundaries with Zod. You could add a Zod schema for BlindVsInformedChartProps to provide runtime validation of the items array.

Example Zod schema

import { z } from 'zod';

const BlindVsInformedChartPropsSchema = z.object({
  items: z.array(MatrixItemResultSchema), // assuming MatrixItemResultSchema exists
});

// Then validate at component entry:
export function BlindVsInformedChart(props: BlindVsInformedChartProps) {
  const { items } = BlindVsInformedChartPropsSchema.parse(props);
  // ... rest of implementation
}

As per coding guidelines: "Validate at boundaries with Zod and pass typed data inward".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx` around
lines 33 - 35, Add runtime Zod validation for the component props by defining a
BlindVsInformedChartPropsSchema (z.object with items:
z.array(MatrixItemResultSchema)) and use it at the start of the
BlindVsInformedChart function to parse/validate incoming props (replace direct
destructuring of props with const { items } =
BlindVsInformedChartPropsSchema.parse(props)); keep the existing TypeScript
interface for compile-time typing but ensure MatrixItemResultSchema is
imported/available and surface parse errors at the boundary.

src/cli/compare-utils.ts (1)

79-84: Redundant type check after instanceof Error.

At line 79, error && typeof error === 'object' is unnecessary since line 75 already confirmed error instanceof Error, which guarantees it's a truthy object.

♻️ Proposed simplification

-	if (error && typeof error === "object" && "code" in error) {
+	if ("code" in error) {
 		const code = (error as { code?: unknown }).code;
 		if (code === "ENOENT" || code === "ENOTDIR") {
 			return true;
 		}
 	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/cli/compare-utils.ts` around lines 79 - 84, Remove the redundant runtime
checks after the earlier instanceof Error check: since the code already verifies
error instanceof Error, drop the leading "error && typeof error === 'object' &&"
and simplify the conditional that detects filesystem codes to just check for
"code" in error (and cast as needed), then extract const code = (error as {
code?: unknown }).code and test code === "ENOENT" || code === "ENOTDIR"; update
the if block around the existing code variable extraction accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx`:
- Around line 109-114: allSelected currently doesn't treat the "stale-only"
(normalizedSelectedModelSet.size === 0) case the same as isSelected, causing a
single click to collapse selections incorrectly; update the allSelected
computation (the expression that sets allSelected using selectedModels,
models.every, and normalizedSelectedModelSet.has) to also treat
normalizedSelectedModelSet.size === 0 as "all selected" (i.e., include a
condition like normalizedSelectedModelSet.size === 0 in the OR chain), and make
the same change to the duplicate logic around the other occurrence (the block
referenced at lines 209-211).

In `@src/harnesses/tool-prompt.ts`:
- Around line 103-131: The code silently treats any pathMode other than
"relative-only" as the anchored branch; add an explicit validation for pathMode
(e.g., only allow "relative-only" or "absolute-anchor") before the existing
branching and trimmedWorkspaceRootPath checks: if pathMode is not one of the
allowed values, throw a clear Error. Update the validation near the top where
pathMode and trimmedWorkspaceRootPath are established (the same block that
checks toolNames and the absolute-anchor requirement) so subsequent logic using
pathMode and the pathLines/lines construction can assume a valid value.

In `@src/runner/index.ts`:
- Around line 219-257: didRun is being reset per-iteration so when earlier items
in the same runtime+model group ran but later ones are skipped the final skipped
iteration sees didRun === false and skips runPostItemResidencyTeardown; change
the tracking to persist at the runtime+model group level (e.g., move didRun out
of the per-item loop or replace it with a map keyed by
preflightKey/runtime+model) so that the flag reflects whether any item in the
current group executed, then use that group-level flag when evaluating
isLastForModel to decide whether to call runPostItemResidencyTeardown(item);
ensure the same fix is applied to the other block around lines 317-348 and that
writeProgressCheckpoint and results handling remain unchanged.

In `@src/runtimes/ollama-runtime.ts`:
- Around line 300-304: The code currently sets capabilities.generateText to
false for modelKind === "unknown", which excludes potentially valid generative
models; update the capabilities block (the modelKind / capabilities object) so
generateText is true for "text-generation" and also for "unknown" (i.e.,
generateText = modelKind === "text-generation" || modelKind === "unknown"),
while keeping embedText = modelKind === "embedding", so unknown models are not
prematurely classified as non-generative.

---

Duplicate comments:
In `@apps/dashboard/src/components/charts/model-comparison-chart.tsx`:
- Around line 42-44: The prop schema currently uses z.custom<MatrixItemResult>()
which bypasses validation; import or locate the existing MatrixItemResultSchema
and replace z.array(z.custom<MatrixItemResult>()) with
z.array(MatrixItemResultSchema) in ModelComparisonChartPropsSchema (and export
MatrixItemResultSchema if missing) so incoming items are strictly validated
before reaching chart aggregation logic; ensure the prop type and any callers
use the validated output (parse/parseAsync) of ModelComparisonChartPropsSchema
when receiving external data.
- Around line 85-91: The JSDoc for the exported function ModelComparisonChart is
missing a required `@throws` tag; update the doc block above the
ModelComparisonChart export to include a `@throws` description that explains what
error(s) the component may throw (e.g., invalid props, missing items, or data
parsing errors) and under what conditions, using the function name
ModelComparisonChart and any related props (props.items) in the description so
the doc meets repo standards.

---

Nitpick comments:
In `@apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx`:
- Around line 33-35: Add runtime Zod validation for the component props by
defining a BlindVsInformedChartPropsSchema (z.object with items:
z.array(MatrixItemResultSchema)) and use it at the start of the
BlindVsInformedChart function to parse/validate incoming props (replace direct
destructuring of props with const { items } =
BlindVsInformedChartPropsSchema.parse(props)); keep the existing TypeScript
interface for compile-time typing but ensure MatrixItemResultSchema is
imported/available and surface parse errors at the boundary.

In `@apps/dashboard/src/components/charts/frontier-eval-scatter.tsx`:
- Around line 29-41: Centralize the harness color resolver into a shared chart
utility: extract the HARNESS_COLORS map, KnownHarnessName type, and
readHarnessColor function into a new exported helper (e.g., getHarnessColor) in
the shared charts util and have components import and use that helper instead of
duplicating the logic; ensure the helper returns CHART_COLORS.muted for unknown
keys and keep the exact behavior of readHarnessColor so callers (including
frontier-eval-scatter.tsx) remain unchanged.

In `@src/cli/compare-utils.ts`:
- Around line 79-84: Remove the redundant runtime checks after the earlier
instanceof Error check: since the code already verifies error instanceof Error,
drop the leading "error && typeof error === 'object' &&" and simplify the
conditional that detects filesystem codes to just check for "code" in error (and
cast as needed), then extract const code = (error as { code?: unknown }).code
and test code === "ENOENT" || code === "ENOTDIR"; update the if block around the
existing code variable extraction accordingly.

In `@src/lib/failure-classifier.ts`:
- Around line 48-49: Add a unit test that covers the new "prompt missing" branch
by calling classifyGenerationError with a message containing "Prompt missing"
(e.g., "Prompt missing ..." or similar) and asserting it returns the
"prompt_not_found" classification; target the classifyGenerationError function
and ensure the test mirrors existing tests for the "prompt not found" case so
the new branch cannot regress.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ce480d55-c86a-40c3-b9cf-1df193adf646

📥 Commits

Reviewing files that changed from the base of the PR and between 56b63e7 and 0ab5019.

📒 Files selected for processing (158)

README.md
apps/dashboard/scripts/build-index.ts
apps/dashboard/src/components/about/about-content.ts
apps/dashboard/src/components/about/about-page.tsx
apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx
apps/dashboard/src/components/charts/chart-primitives.tsx
apps/dashboard/src/components/charts/composite-score-chart.tsx
apps/dashboard/src/components/charts/frontier-eval-scatter.tsx
apps/dashboard/src/components/charts/model-comparison-chart.tsx
apps/dashboard/src/components/charts/model-radar-chart.tsx
apps/dashboard/src/components/charts/model-test-heatmap.tsx
apps/dashboard/src/components/charts/test-difficulty-chart.tsx
apps/dashboard/src/components/charts/test-type-split-chart.tsx
apps/dashboard/src/components/charts/token-efficiency-chart.tsx
apps/dashboard/src/components/layout/header.tsx
apps/dashboard/src/components/leaderboard/leaderboard-filters.ts
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
apps/dashboard/src/components/leaderboard/leaderboard-page.tsx
apps/dashboard/src/components/leaderboard/leaderboard-results-table.tsx
apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx
apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx
apps/dashboard/src/components/run-list/run-list-page.tsx
apps/dashboard/src/components/ui/section-heading.tsx
apps/dashboard/src/index.css
apps/dashboard/src/lib/aggregations-charts.ts
apps/dashboard/src/lib/aggregations-compare.ts
apps/dashboard/src/lib/aggregations-diagnostics.ts
apps/dashboard/src/lib/aggregations-test-types.ts
apps/dashboard/src/lib/aggregations-tooling.ts
apps/dashboard/src/lib/aggregations.ts
apps/dashboard/src/lib/api.ts
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
llm/context/codebase-overview.md
llm/implementation/computer-use-hardening.md
llm/implementation/harnesses-implementation.md
llm/implementation/multi-runtime-mvp-implementation.md
llm/project/project-overview.md
llm/project/project-rules.md
llm/project/user-flow.md
package.json
src/cli/compare-command.ts
src/cli/compare-formatters.ts
src/cli/compare-utils.ts
src/cli/migrate-machine-command.ts
src/cli/run-command.ts
src/harnesses/direct-adapter.ts
src/harnesses/discovery.ts
src/harnesses/goose-adapter.ts
src/harnesses/goose-openai.ts
src/harnesses/goose-output.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/harnesses/opencode-artifacts.ts
src/harnesses/opencode-cli.ts
src/harnesses/opencode-config.ts
src/harnesses/opencode-events.ts
src/harnesses/opencode-model.ts
src/harnesses/opencode-output.ts
src/harnesses/opencode-permissions.ts
src/harnesses/opencode-process.ts
src/harnesses/opencode-provider.ts
src/harnesses/opencode-runner.ts
src/harnesses/tool-prompt.ts
src/lib/benchmark-checkpoint.ts
src/lib/failure-classifier.ts
src/lib/hardware-profile.ts
src/lib/logger.ts
src/lib/machine-profile/instance-id.ts
src/lib/machine-profile/legacy.ts
src/lib/machine-profile/normalization.ts
src/lib/machine-profile/probe-linux.ts
src/lib/machine-profile/probe-macos.ts
src/lib/machine-profile/probe-utils.ts
src/lib/machine-profile/probe-windows.ts
src/lib/model-aliases.ts
src/lib/model-profile/normalization.ts
src/lib/model-profile/registry-compat.ts
src/lib/model-profile/registry.ts
src/lib/openai-compat-client.ts
src/lib/signal-assessment.ts
src/lib/stats-format.ts
src/lib/test-catalog.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/model-availability.ts
src/runner/model-eligibility.ts
src/runner/plan-builder.ts
src/runner/run-progress.ts
src/runtimes/discovery.ts
src/runtimes/index.ts
src/runtimes/ollama-residency.ts
src/runtimes/ollama-runtime.ts
src/runtimes/runtime.ts
src/runtimes/vllm-runtime.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/model-alias.schema.ts
src/schemas/model-profile.schema.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
src/schemas/test-catalog.schema.ts
src/tests/calculator-basic/test.meta.json
src/tests/calculator-stateful/test.meta.json
src/tests/event-emitter/test.meta.json
src/tests/rate-limiter/test.meta.json
src/tests/smoke/test.meta.json
src/tests/todo-app/test.meta.json
src/tests/ttl-cache/test.meta.json
src/tests/workspace-smoke/fixtures/logs/.gitkeep
src/tests/workspace-smoke/prompt.blind.md
src/tests/workspace-smoke/prompt.informed.md
src/tests/workspace-smoke/test.meta.json
test/aggregate.test.ts
test/benchmark-checkpoint.test.ts
test/build-index.test.ts
test/compare-checkpoint-guard.test.ts
test/compare-formatters.test.ts
test/dashboard-schema-compat.test.ts
test/dashboard-test-type.test.ts
test/failure-classifier.test.ts
test/generation-retry.test.ts
test/goose-adapter.test.ts
test/goose-output.test.ts
test/hardware-profile.test.ts
test/harness-compatibility.test.ts
test/item-executor.test.ts
test/item-retry.test.ts
test/machine-profile-migration.test.ts
test/model-alias-schema.test.ts
test/model-availability.test.ts
test/model-profile-registry.test.ts
test/ollama-adapter.test.ts
test/ollama-residency.test.ts
test/opencode-adapter.test.ts
test/opencode-artifacts.test.ts
test/opencode-cli-cache.test.ts
test/opencode-cli.test.ts
test/opencode-config.test.ts
test/opencode-model.test.ts
test/opencode-output.test.ts
test/opencode-runner-race.test.ts
test/opencode-runner.test.ts
test/plan-builder.test.ts
test/run-command.test.ts
test/runner-residency.test.ts
test/schema-regressions.test.ts
test/schemas.test.ts
test/signal-assessment.test.ts
test/stats-semantics.test.ts
test/stats.test.ts
test/test-catalog.test.ts
test/tool-prompt.test.ts
test/utils/fixtures.ts
test/vllm-runtime.test.ts
test/workspace-capability-parity.test.ts
test/workspace-prompt-parity.test.ts

💤 Files with no reviewable changes (10)

src/harnesses/opencode-model.ts
package.json
src/harnesses/goose-openai.ts
src/harnesses/opencode-process.ts
src/harnesses/opencode-output.ts
src/lib/openai-compat-client.ts
test/benchmark-checkpoint.test.ts
src/runtimes/discovery.ts
test/vllm-runtime.test.ts
src/runtimes/vllm-runtime.ts

AustinKelsay · 2026-04-24T22:43:16Z

@CodeRabbit full review

coderabbitai · 2026-04-24T22:43:21Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (9)

apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx (1)

24-31: Excellent use of as const — aligns with coding guidelines.

The addition of as const strengthens type safety and prevents mutations, which follows the guideline to avoid enums and use as const instead. The whitespace cleanup in comments also improves consistency.

Optional: Consider merging colors into the cards array.

Currently, CARD_ACCENT_COLORS and the cards array are coupled through index-based access (line 113). This creates a maintenance burden where both arrays must stay in sync. Consider merging the color into each card object:

♻️ Optional refactor to reduce coupling

-/** Border accent colors — brand green for primary, semantic colors for specific KPIs. */
-const CARD_ACCENT_COLORS = [
-	"#34c759", // brand green — matched runs
-	"#34c759", // brand green — profiles
-	"hsl(212, 100%, 67%)", // info blue — deduped items
-	"hsl(156, 67%, 55%)", // success green — pass rate
-	"hsl(270, 60%, 60%)", // purple — frontier coverage
-	"hsl(210, 85%, 60%)", // blue — median duration
-] as const;

 /**
  * Renders top-level leaderboard KPI cards.
  * ...
  */
 export function LeaderboardSummaryCards({
 	aggregate,
 	filteredItemCount,
 	filteredItems,
 	passRate,
 }: LeaderboardSummaryCardsProps) {
 	// ... existing calculations ...

 	const cards = [
 		{
 			title: "Matched Runs",
 			value: String(aggregate?.summary.runsMatched ?? 0),
 			sub: null,
+			color: "#34c759", // brand green
 		},
 		{
 			title: "Profiles",
 			value: String(aggregate?.summary.machines ?? 0),
 			sub:
 				aggregate?.summary.instances !== undefined
 					? `${aggregate.summary.instances} ${
 							aggregate.summary.instances === 1 ? "instance" : "instances"
 						}`
 					: null,
+			color: "#34c759", // brand green
 		},
 		{
 			title: "Deduped Items",
 			value: String(filteredItemCount),
 			sub: `of ${aggregate?.summary.dedupedItems ?? 0} total`,
+			color: "hsl(212, 100%, 67%)", // info blue
 		},
 		{
 			title: "Pass Rate",
 			value: formatPercent(passRate.passRate),
 			sub: `${passRate.passed}/${passRate.total} tests`,
+			color: "hsl(156, 67%, 55%)", // success green
 		},
 		{
 			title: "Frontier Coverage",
 			value: formatPercent(frontierCoverage),
 			sub: `${frontierCount} of ${filteredItemCount} items`,
+			color: "hsl(270, 60%, 60%)", // purple
 		},
 		{
 			title: "Median Duration",
 			value: medianDuration !== null ? formatDuration(medianDuration) : "—",
 			sub: durations.length > 0 ? `${durations.length} items` : "no data",
+			color: "hsl(210, 85%, 60%)", // blue
 		},
-	];
+	] as const;

 	return (
 		<div className="grid gap-4 md:grid-cols-3 lg:grid-cols-6">
-			{cards.map((card, i) => (
+			{cards.map((card) => (
 				<Card
 					key={card.title}
 					glow
-					className={`border-l-2 animate-fade-slide-up animate-stagger-${i + 1}`}
-					style={{ borderLeftColor: CARD_ACCENT_COLORS[i] }}
+					className="border-l-2 animate-fade-slide-up"
+					style={{ borderLeftColor: card.color }}
 				>

Note: This refactor also removes the animation stagger classes, which would need to be addressed separately if staggered animations are desired.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx`
around lines 24 - 31, The CARD_ACCENT_COLORS array is coupled to the cards array
via index-based access (CARD_ACCENT_COLORS[i]), which is brittle; update the
card definitions in the cards array to include a color property for each card
(e.g., color: "#34c759" or "hsl(...)"), replace all uses of
CARD_ACCENT_COLORS[i] with card.color, and remove the standalone
CARD_ACCENT_COLORS constant; ensure the render logic (the component that maps
over cards and applies accent color) reads the new color property and preserves
existing animation/stagger classes.

src/cli/compare-command.ts (1)

34-42: Consider adding z.optional() default handling for boolean options.

The schema defines json and allowCrossCheckpoint as z.boolean().optional(), but Commander already provides defaults (false). This works, but the explicit .default(false) in the schema would make the post-parse type narrower (boolean instead of boolean | undefined), removing the need for === true checks at lines 82-83 and 96.

♻️ Optional refinement

 const RunCompareSchema = z.object({
 	runA: z.string().min(1),
 	runB: z.string().min(1),
 	options: z.object({
 		output: z.string().min(1),
-		json: z.boolean().optional(),
-		allowCrossCheckpoint: z.boolean().optional(),
+		json: z.boolean().default(false),
+		allowCrossCheckpoint: z.boolean().default(false),
 	}),
 });

Then simplify usage:

-				const allowCrossCheckpoint =
-					parsed.options.allowCrossCheckpoint === true;
+				const allowCrossCheckpoint = parsed.options.allowCrossCheckpoint;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/cli/compare-command.ts` around lines 34 - 42, The schema RunCompareSchema
currently uses z.boolean().optional() for options.json and
options.allowCrossCheckpoint; change those to z.boolean().default(false) so
parsed types are boolean (not boolean | undefined) and you can remove redundant
checks like `=== true` when reading options in the compare command (e.g., where
compare logic inspects options.json and options.allowCrossCheckpoint). Update
RunCompareSchema's options object to use .default(false) for both booleans so
downstream code (compare command handlers) can treat them as plain booleans.

apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx (1)

118-139: Canonicalize emitted selections to avoid duplicate IDs.

At Line 119 and Line 136 paths, normalizedSelectedModels preserves any upstream duplicates, so callback payloads can stay non-canonical. Dedupe before sort in all emit paths to keep state stable.

♻️ Suggested patch

 function handleModelToggle(model: string) {
 	const modelSet = new Set(models);
 	const normalizedSelectedModels = selectedModels.filter((selected) =>
 		modelSet.has(selected),
 	);
 	const normalizedSelectedModelSet = new Set(normalizedSelectedModels);
+	const sortAndDedupe = (values: string[]) =>
+		[...new Set(values)].sort((a, b) => MODEL_SORT_COLLATOR.compare(a, b));
 	const allSelected =
 		selectedModels.length === 0 ||
 		normalizedSelectedModelSet.size === 0 ||
 		(models.length > 0 &&
 			models.every((availableModel) =>
 				normalizedSelectedModelSet.has(availableModel),
 			));
 
 	if (normalizedSelectedModelSet.has(model)) {
-		onSelectionChange(
-			normalizedSelectedModels
-				.filter((selected) => selected !== model)
-				.sort((a, b) => MODEL_SORT_COLLATOR.compare(a, b)),
-		);
+		onSelectionChange(
+			sortAndDedupe(
+				normalizedSelectedModels.filter((selected) => selected !== model),
+			),
+		);
 		return;
 	}
 
 	if (allSelected) {
-		onSelectionChange(
-			models
-				.filter((availableModel) => availableModel !== model)
-				.sort((a, b) => MODEL_SORT_COLLATOR.compare(a, b)),
-		);
+		onSelectionChange(
+			sortAndDedupe(models.filter((availableModel) => availableModel !== model)),
+		);
 		return;
 	}
 
-	onSelectionChange(
-		[...normalizedSelectedModels, model].sort((a, b) =>
-			MODEL_SORT_COLLATOR.compare(a, b),
-		),
-	);
+	onSelectionChange(sortAndDedupe([...normalizedSelectedModels, model]));
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx` around
lines 118 - 139, The emitted selections from onSelectionChange can contain
duplicates because normalizedSelectedModels is used directly; update the three
emit paths (the remove branch that filters normalizedSelectedModels, the
allSelected branch that filters models, and the add branch that spreads
normalizedSelectedModels) to deduplicate before sorting and emitting—e.g., build
a Set or use Array.from(new Set(...)) on the array results (references:
normalizedSelectedModels, models, model, allSelected, onSelectionChange,
MODEL_SORT_COLLATOR) and then sort the deduped array with MODEL_SORT_COLLATOR
prior to calling onSelectionChange so payloads are canonical.

llm/project/user-flow.md (1)

53-56: Optional: vary repeated “Which …” bullets for readability.

This is purely editorial, but slight rewording improves scanability.

Suggested copy tweak

-  - Which test(s) to run?
-  - Which category/categories to run (`coding`, `computer-use`)?
-  - Which harness(es) to run through?
-  - Which model(s) to benchmark?
+  - Select test(s) to run.
+  - Choose category/categories (`coding`, `computer-use`).
+  - Pick harness(es) to run through.
+  - Choose model(s) to benchmark.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@llm/project/user-flow.md` around lines 53 - 56, Edit the repeated bulleted
questions in user-flow.md to improve readability by varying phrasing instead of
repeating "Which …"; specifically update the bullets "Which test(s) to run?",
"Which category/categories to run (`coding`, `computer-use`)?", "Which
harness(es) to run through?", and "Which model(s) to benchmark?" to
shorter/varied forms (e.g., "Select test(s) to run", "Choose category/categories
(`coding`, `computer-use`)", "Pick harness(es) to execute", "Select model(s) to
benchmark") so the list scans better while preserving the same options and
meaning.

apps/dashboard/src/components/charts/model-comparison-chart.tsx (1)

92-94: Misleading @throws clause for a safeParse-guarded function.

The JSDoc at lines 92–93 states the function throws when "validated props.items contain a shape accepted by the schema but unsupported by downstream chart aggregation or rendering." However, the function now uses safeParse with a graceful fallback (lines 96–131), so validation failures won't throw. If downstream aggregation errors are still possible, the @throws should clarify that scenario specifically, or be removed if the intention is purely defensive.
Suggested JSDoc update
 /**
  * Renders head-to-head model comparison with diverging bars.
  *
  * `@param` props - Component props
  * `@param` props.items - Filtered matrix items
  * `@returns` Card with model selectors and diverging bar chart
- * `@throws` {Error} If validated props.items contain a shape accepted by the schema
- * but unsupported by downstream chart aggregation or rendering.
  */
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/charts/model-comparison-chart.tsx` around lines
92 - 94, The JSDoc `@throws` is misleading because the code now uses safeParse (so
validation failures are handled gracefully) — update the comment near the
safeParse call to either remove the `@throws` entirely or change it to explicitly
state that validation errors do not throw and that only downstream
aggregation/rendering (e.g., where props.items is passed into the chart
aggregation/render pipeline) might still throw; reference the safeParse usage
and any aggregation function that consumes the parsed data when editing the
text.

src/harnesses/direct-adapter.ts (1)

103-110: Conditional taint reason merge may lose retry-generated reasons.

The current logic only merges signalAssessment when decision.taintReasons.length > 0. If the initial attempt produces taint reasons but the retry succeeds cleanly (no taint reasons from evaluateCodeOnlyOutput), the initial taint reasons are lost. Consider whether the initial attempt's taint indicators should always propagate to the final result.

Suggested fix to always propagate initial taint reasons

 					return {
 						...retryResult,
 						// ... token merging ...
 						durationMs: Math.round(performance.now() - startTime),
-						...(decision.taintReasons.length > 0
-							? {
-									signalAssessment: appendSignalAssessmentReasons(
-										retryResult.signalAssessment,
-										decision.taintReasons,
-									),
-								}
-							: {}),
+						signalAssessment: appendSignalAssessmentReasons(
+							retryResult.signalAssessment,
+							decision.taintReasons,
+						),
 					};

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/direct-adapter.ts` around lines 103 - 110, The current
conditional only adds signalAssessment when decision.taintReasons.length > 0,
which can drop initial taint reasons if the retry returns none; change the logic
around decision.taintReasons and retryResult.signalAssessment to always
propagate the initial taint reasons into the final signalAssessment by calling
appendSignalAssessmentReasons(retryResult.signalAssessment,
decision.taintReasons) unconditionally (or whenever decision.taintReasons is
non-empty OR retryResult.signalAssessment exists), and assign that merged result
to signalAssessment so initial taint indicators from decision.taintReasons are
not lost.

src/runner/model-eligibility.ts (1)

95-97: Make throw-mode errors runtime-aware for faster debugging.

Consider including runtime and detected model kind in the thrown message to reduce triage time when explicit model selection fails.

💡 Suggested refinement

 		if (input.mode === "throw") {
+			const detectedKind = info.modelKind ?? "unknown";
 			throw new Error(
-				`Requested model is not supported for generative benchmarks: ${model} (embedding-only)`,
+				`Requested model "${model}" on runtime "${input.runtimeName}" is not supported for generative benchmarks (modelKind=${detectedKind})`,
 			);
 		}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/runner/model-eligibility.ts` around lines 95 - 97, Update the thrown
Error in model-eligibility.ts (the throw that currently uses ${model}
(embedding-only)) to include runtime and detected model kind for quicker
debugging: gather the runtime identifier (e.g., runtime or env/runtimeName) and
the detected kind (e.g., detectedModelKind or modelKind) immediately before the
throw and append them to the message (for example: `Requested model ${model} is
not supported for generative benchmarks (embedding-only); runtime=${runtime};
detectedKind=${detectedModelKind}`). Ensure those variables are computed or
available in the scope of the throw site and include them in the process/error
message.

apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx (1)

39-41: Validate items before render instead of re-parsing them in the component.

BlindVsInformedChart now does a full Zod parse on every render. That duplicates validation work for large result sets and turns malformed data into a render-time crash rather than failing once at the data-loading boundary.

♻️ Proposed cleanup

-import { MatrixItemResultSchema } from "@/lib/schemas";
 import { blindInformed as blindInformedTooltips } from "@/lib/tooltip-content";
 import type { MatrixItemResult } from "@/lib/types";
@@
-import { z } from "zod";
-
 interface BlindVsInformedChartProps {
 	items: MatrixItemResult[];
 }
-
-const BlindVsInformedChartPropsSchema = z.object({
-	items: z.array(MatrixItemResultSchema),
-});
@@
- * `@param` items - Benchmark rows used to compute model and harness breakdowns.
+ * `@param` props - Typed chart props.
  * `@returns` React element containing chart tabs for model and harness breakdowns.
- * `@throws` {z.ZodError} If props.items do not match MatrixItemResultSchema.
+ * `@throws` none
  */
 export function BlindVsInformedChart(props: BlindVsInformedChartProps) {
-	const { items } = BlindVsInformedChartPropsSchema.parse(props);
+	const { items } = props;

Also applies to: 281-289

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx` around
lines 39 - 41, BlindVsInformedChart is re-parsing props.items with Zod on every
render (causing duplicated work and render-time crashes); validate/parse items
once at the data-loading or prop boundary and pass already-validated data into
the component instead of calling BlindVsInformedChartPropsSchema.parse inside
the render. Update the call sites that construct props for BlindVsInformedChart
to run BlindVsInformedChartPropsSchema.parse (or use z.infer to type the
validated shape) and change the component to accept the validated type directly;
also apply the same change for the similar validation usage around the code
referenced at 281-289 (remove in-component parse and move validation upstream).

test/plan-builder.test.ts (1)

449-483: Assert the exclusion payload here, not only the fallback error text.

This is the only changed regression for the new model-exclusion path, but it currently passes as long as the generic rejection string stays the same. If the builder stops attaching the excluded runtime/model/reason details, this test will not catch it.

If buildRunPlan exposes the exclusions on the rejection or on an intermediate result, assert those fields here alongside the message.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/signal-assessment.ts`:
- Around line 304-317: The helper isAgentRequestedInputOutput currently defaults
missing options.source to "harness", causing accidental tainting; update the
function so it does NOT treat an omitted source as "harness" — require explicit
opt-in by checking options.source === "harness" (or default to
"artifact"/non-tainting) instead of using (options.source ?? "harness") !==
"harness"; ensure the logic uses the TranscriptOrInputTaintOptions.source value
explicitly and adjust any related type/defaults to avoid a silent harness
fallback.

In `@src/runner/item-executor.ts`:
- Around line 397-405: The call to finalizeItemSignalAssessment is incorrectly
tagging outputs from failed generations as outputSource: "artifact"; update the
invocation in item-executor where signalAssessment is re-finalized (the
finalizeItemSignalAssessment call that uses existing: signalAssessment,
automatedScore, rowFailed, output: generation.output, outputSource: "artifact")
to choose the source based on generation.success (e.g., outputSource:
generation.success ? "artifact" : "harness-failure" or preserve the original
harness/source), so failed harness outputs keep their harness-origin taint
reasons instead of being re-marked as artifacts.

In `@src/schemas/common.schema.ts`:
- Around line 13-31: The change removed "vllm" from supportedRuntimeNames which
makes SupportedRuntimeNameSchema reject existing configs without doing a
breaking schema bump or migration; revert to additive behavior by adding "vllm"
back into supportedRuntimeNames (so SupportedRuntimeNameSchema continues to
accept it) and leave artifactRuntimeNames/ArtifactRuntimeNameSchema as-is, and
add a TODO comment near SCHEMA_VERSION/supportedRuntimeNames to implement an
explicit migration or a breaking-version bump if you intend to remove "vllm"
later.

In `@test/ollama-adapter.test.ts`:
- Around line 126-152: The test title is contradictory: update the it(...)
description string for the test that calls createOllamaRuntime(...), awaits
runtime.getModelInfo("mystery-model:8b"), and asserts info.modelKind ===
"unknown" and info.capabilities.generateText === true so the title reflects that
the unrecognized non-embedding model is considered generative (e.g., change "not
generative" to "generative" or similar) to match the expectations checked on
info.modelKind and info.capabilities.

---

Nitpick comments:
In `@apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx`:
- Around line 39-41: BlindVsInformedChart is re-parsing props.items with Zod on
every render (causing duplicated work and render-time crashes); validate/parse
items once at the data-loading or prop boundary and pass already-validated data
into the component instead of calling BlindVsInformedChartPropsSchema.parse
inside the render. Update the call sites that construct props for
BlindVsInformedChart to run BlindVsInformedChartPropsSchema.parse (or use
z.infer to type the validated shape) and change the component to accept the
validated type directly; also apply the same change for the similar validation
usage around the code referenced at 281-289 (remove in-component parse and move
validation upstream).

In `@apps/dashboard/src/components/charts/model-comparison-chart.tsx`:
- Around line 92-94: The JSDoc `@throws` is misleading because the code now uses
safeParse (so validation failures are handled gracefully) — update the comment
near the safeParse call to either remove the `@throws` entirely or change it to
explicitly state that validation errors do not throw and that only downstream
aggregation/rendering (e.g., where props.items is passed into the chart
aggregation/render pipeline) might still throw; reference the safeParse usage
and any aggregation function that consumes the parsed data when editing the
text.

In `@apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx`:
- Around line 24-31: The CARD_ACCENT_COLORS array is coupled to the cards array
via index-based access (CARD_ACCENT_COLORS[i]), which is brittle; update the
card definitions in the cards array to include a color property for each card
(e.g., color: "#34c759" or "hsl(...)"), replace all uses of
CARD_ACCENT_COLORS[i] with card.color, and remove the standalone
CARD_ACCENT_COLORS constant; ensure the render logic (the component that maps
over cards and applies accent color) reads the new color property and preserves
existing animation/stagger classes.

In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx`:
- Around line 118-139: The emitted selections from onSelectionChange can contain
duplicates because normalizedSelectedModels is used directly; update the three
emit paths (the remove branch that filters normalizedSelectedModels, the
allSelected branch that filters models, and the add branch that spreads
normalizedSelectedModels) to deduplicate before sorting and emitting—e.g., build
a Set or use Array.from(new Set(...)) on the array results (references:
normalizedSelectedModels, models, model, allSelected, onSelectionChange,
MODEL_SORT_COLLATOR) and then sort the deduped array with MODEL_SORT_COLLATOR
prior to calling onSelectionChange so payloads are canonical.

In `@llm/project/user-flow.md`:
- Around line 53-56: Edit the repeated bulleted questions in user-flow.md to
improve readability by varying phrasing instead of repeating "Which …";
specifically update the bullets "Which test(s) to run?", "Which
category/categories to run (`coding`, `computer-use`)?", "Which harness(es) to
run through?", and "Which model(s) to benchmark?" to shorter/varied forms (e.g.,
"Select test(s) to run", "Choose category/categories (`coding`,
`computer-use`)", "Pick harness(es) to execute", "Select model(s) to benchmark")
so the list scans better while preserving the same options and meaning.

In `@src/cli/compare-command.ts`:
- Around line 34-42: The schema RunCompareSchema currently uses
z.boolean().optional() for options.json and options.allowCrossCheckpoint; change
those to z.boolean().default(false) so parsed types are boolean (not boolean |
undefined) and you can remove redundant checks like `=== true` when reading
options in the compare command (e.g., where compare logic inspects options.json
and options.allowCrossCheckpoint). Update RunCompareSchema's options object to
use .default(false) for both booleans so downstream code (compare command
handlers) can treat them as plain booleans.

In `@src/harnesses/direct-adapter.ts`:
- Around line 103-110: The current conditional only adds signalAssessment when
decision.taintReasons.length > 0, which can drop initial taint reasons if the
retry returns none; change the logic around decision.taintReasons and
retryResult.signalAssessment to always propagate the initial taint reasons into
the final signalAssessment by calling
appendSignalAssessmentReasons(retryResult.signalAssessment,
decision.taintReasons) unconditionally (or whenever decision.taintReasons is
non-empty OR retryResult.signalAssessment exists), and assign that merged result
to signalAssessment so initial taint indicators from decision.taintReasons are
not lost.

In `@src/runner/model-eligibility.ts`:
- Around line 95-97: Update the thrown Error in model-eligibility.ts (the throw
that currently uses ${model} (embedding-only)) to include runtime and detected
model kind for quicker debugging: gather the runtime identifier (e.g., runtime
or env/runtimeName) and the detected kind (e.g., detectedModelKind or modelKind)
immediately before the throw and append them to the message (for example:
`Requested model ${model} is not supported for generative benchmarks
(embedding-only); runtime=${runtime}; detectedKind=${detectedModelKind}`).
Ensure those variables are computed or available in the scope of the throw site
and include them in the process/error message.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ec4ad437-bf48-48ac-80a9-efb8eb6f79f1

📥 Commits

Reviewing files that changed from the base of the PR and between 56b63e7 and c1ec2de.

📒 Files selected for processing (159)

README.md
apps/dashboard/scripts/build-index.ts
apps/dashboard/src/components/about/about-content.ts
apps/dashboard/src/components/about/about-page.tsx
apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx
apps/dashboard/src/components/charts/chart-primitives.tsx
apps/dashboard/src/components/charts/composite-score-chart.tsx
apps/dashboard/src/components/charts/frontier-eval-scatter.tsx
apps/dashboard/src/components/charts/model-comparison-chart.tsx
apps/dashboard/src/components/charts/model-radar-chart.tsx
apps/dashboard/src/components/charts/model-test-heatmap.tsx
apps/dashboard/src/components/charts/test-difficulty-chart.tsx
apps/dashboard/src/components/charts/test-type-split-chart.tsx
apps/dashboard/src/components/charts/token-efficiency-chart.tsx
apps/dashboard/src/components/layout/header.tsx
apps/dashboard/src/components/leaderboard/leaderboard-filters.ts
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
apps/dashboard/src/components/leaderboard/leaderboard-page.tsx
apps/dashboard/src/components/leaderboard/leaderboard-results-table.tsx
apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx
apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx
apps/dashboard/src/components/run-list/run-list-page.tsx
apps/dashboard/src/components/ui/section-heading.tsx
apps/dashboard/src/index.css
apps/dashboard/src/lib/aggregations-charts.ts
apps/dashboard/src/lib/aggregations-compare.ts
apps/dashboard/src/lib/aggregations-diagnostics.ts
apps/dashboard/src/lib/aggregations-test-types.ts
apps/dashboard/src/lib/aggregations-tooling.ts
apps/dashboard/src/lib/aggregations.ts
apps/dashboard/src/lib/api.ts
apps/dashboard/src/lib/chart-colors.ts
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
llm/context/codebase-overview.md
llm/implementation/computer-use-hardening.md
llm/implementation/harnesses-implementation.md
llm/implementation/multi-runtime-mvp-implementation.md
llm/project/project-overview.md
llm/project/project-rules.md
llm/project/user-flow.md
package.json
src/cli/compare-command.ts
src/cli/compare-formatters.ts
src/cli/compare-utils.ts
src/cli/migrate-machine-command.ts
src/cli/run-command.ts
src/harnesses/direct-adapter.ts
src/harnesses/discovery.ts
src/harnesses/goose-adapter.ts
src/harnesses/goose-openai.ts
src/harnesses/goose-output.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/harnesses/opencode-artifacts.ts
src/harnesses/opencode-cli.ts
src/harnesses/opencode-config.ts
src/harnesses/opencode-events.ts
src/harnesses/opencode-model.ts
src/harnesses/opencode-output.ts
src/harnesses/opencode-permissions.ts
src/harnesses/opencode-process.ts
src/harnesses/opencode-provider.ts
src/harnesses/opencode-runner.ts
src/harnesses/tool-prompt.ts
src/lib/benchmark-checkpoint.ts
src/lib/failure-classifier.ts
src/lib/hardware-profile.ts
src/lib/logger.ts
src/lib/machine-profile/instance-id.ts
src/lib/machine-profile/legacy.ts
src/lib/machine-profile/normalization.ts
src/lib/machine-profile/probe-linux.ts
src/lib/machine-profile/probe-macos.ts
src/lib/machine-profile/probe-utils.ts
src/lib/machine-profile/probe-windows.ts
src/lib/model-aliases.ts
src/lib/model-profile/normalization.ts
src/lib/model-profile/registry-compat.ts
src/lib/model-profile/registry.ts
src/lib/openai-compat-client.ts
src/lib/signal-assessment.ts
src/lib/stats-format.ts
src/lib/test-catalog.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/model-availability.ts
src/runner/model-eligibility.ts
src/runner/plan-builder.ts
src/runner/run-progress.ts
src/runtimes/discovery.ts
src/runtimes/index.ts
src/runtimes/ollama-residency.ts
src/runtimes/ollama-runtime.ts
src/runtimes/runtime.ts
src/runtimes/vllm-runtime.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/model-alias.schema.ts
src/schemas/model-profile.schema.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
src/schemas/test-catalog.schema.ts
src/tests/calculator-basic/test.meta.json
src/tests/calculator-stateful/test.meta.json
src/tests/event-emitter/test.meta.json
src/tests/rate-limiter/test.meta.json
src/tests/smoke/test.meta.json
src/tests/todo-app/test.meta.json
src/tests/ttl-cache/test.meta.json
src/tests/workspace-smoke/fixtures/logs/.gitkeep
src/tests/workspace-smoke/prompt.blind.md
src/tests/workspace-smoke/prompt.informed.md
src/tests/workspace-smoke/test.meta.json
test/aggregate.test.ts
test/benchmark-checkpoint.test.ts
test/build-index.test.ts
test/compare-checkpoint-guard.test.ts
test/compare-formatters.test.ts
test/dashboard-schema-compat.test.ts
test/dashboard-test-type.test.ts
test/failure-classifier.test.ts
test/generation-retry.test.ts
test/goose-adapter.test.ts
test/goose-output.test.ts
test/hardware-profile.test.ts
test/harness-compatibility.test.ts
test/item-executor.test.ts
test/item-retry.test.ts
test/machine-profile-migration.test.ts
test/model-alias-schema.test.ts
test/model-availability.test.ts
test/model-profile-registry.test.ts
test/ollama-adapter.test.ts
test/ollama-residency.test.ts
test/opencode-adapter.test.ts
test/opencode-artifacts.test.ts
test/opencode-cli-cache.test.ts
test/opencode-cli.test.ts
test/opencode-config.test.ts
test/opencode-model.test.ts
test/opencode-output.test.ts
test/opencode-runner-race.test.ts
test/opencode-runner.test.ts
test/plan-builder.test.ts
test/run-command.test.ts
test/runner-residency.test.ts
test/schema-regressions.test.ts
test/schemas.test.ts
test/signal-assessment.test.ts
test/stats-semantics.test.ts
test/stats.test.ts
test/test-catalog.test.ts
test/tool-prompt.test.ts
test/utils/fixtures.ts
test/vllm-runtime.test.ts
test/workspace-capability-parity.test.ts
test/workspace-prompt-parity.test.ts

💤 Files with no reviewable changes (10)

package.json
src/harnesses/goose-openai.ts
test/vllm-runtime.test.ts
test/benchmark-checkpoint.test.ts
src/harnesses/opencode-model.ts
src/harnesses/opencode-process.ts
src/lib/openai-compat-client.ts
src/runtimes/vllm-runtime.ts
src/runtimes/discovery.ts
src/harnesses/opencode-output.ts

AustinKelsay · 2026-04-24T23:39:35Z

@CodeRabbit full review

coderabbitai · 2026-04-24T23:39:40Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/cli/migrate-machine-command.ts (1)

334-337: ⚠️ Potential issue | 🟠 Major

Do not force non-zero exit for user validation errors.

Line 336 can throw for invalid user options, and Line 363 currently exits with status 1 for all caught errors. That treats non-crash usage errors as crash exits.

💡 Suggested adjustment

 	.action(async (options) => {
 		try {
 			const parsedOptions = MigrateMachineCommandOptionsSchema.parse(options);
 			const resultsDir = path.resolve(parsedOptions.dir);
 			const migrated = await migrateResultsDirectory(resultsDir);
 			logger.info(
 				{ resultsDir, ...migrated },
 				"Migrated machine-profile artifacts",
 			);
@@
-		} catch (error) {
-			logger.error({ error }, "Machine-profile migration failed");
-			process.exit(1);
+		} catch (error) {
+			if (error instanceof z.ZodError) {
+				logger.warn(
+					{ issues: error.issues },
+					"Invalid options for migrate-machine-profiles; skipping migration",
+				);
+				return;
+			}
+			logger.error({ error }, "Machine-profile migration failed");
+			process.exit(1);
 		}
 	});

As per coding guidelines "CLI should be non-interactive by default; exit non-zero only on crashes (MVP)".

Also applies to: 361-364

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/cli/migrate-machine-command.ts` around lines 334 - 337, The action
handler currently calls MigrateMachineCommandOptionsSchema.parse(options) and
then unconditionally calls process.exit(1) for any caught error; change this so
Zod/validation errors from MigrateMachineCommandOptionsSchema.parse (or other
user-input validation) are handled separately: detect the validation error
(e.g., instanceof ZodError or error.name === 'ZodError'), print a helpful
validation message to the user and return/exit with success (no non-zero exit),
while only calling process.exit(1) for unexpected runtime/crash errors; update
the .action(async (options) => { ... }) catch block to branch on validation vs
unexpected errors and avoid forcing a non-zero exit for user validation issues.

apps/dashboard/src/lib/schemas.ts (1)

363-384: ⚠️ Potential issue | 🟠 Major

Mirror the new scoringMetrics fields here before exporting this result schema.

MatrixItemResultSchema now validates live run.json payloads for the dashboard, but the nested ScoringMetricsSchema still only keeps durationMs. Current results can include retryGenerationDurationMs, retryKind, retryReason, retryAttempted, and retryPromoted, and Zod will silently strip all of them at this boundary.

📦 Suggested schema sync

-const ScoringMetricsSchema = z.object({
-	durationMs: z.number(),
-});
+const ScoringMetricsSchema = z
+	.object({
+		durationMs: z.number(),
+		scoringDurationMs: z.number().optional(),
+		retryGenerationDurationMs: z.number().optional(),
+		retryKind: z.enum(["compile-feedback", "opencode-workspace"]).optional(),
+		retryReason: z.string().optional(),
+		retryAttempted: z.boolean().optional(),
+		retryPromoted: z.boolean().optional(),
+	})
+	.refine(
+		(metrics) => {
+			const hasAnyRetryField =
+				metrics.retryKind !== undefined ||
+				metrics.retryReason !== undefined ||
+				metrics.retryAttempted !== undefined ||
+				metrics.retryPromoted !== undefined ||
+				metrics.retryGenerationDurationMs !== undefined;
+			if (!hasAnyRetryField) {
+				return true;
+			}
+			if (metrics.retryAttempted === true) {
+				return (
+					metrics.retryKind !== undefined &&
+					typeof metrics.retryReason === "string" &&
+					metrics.retryReason.trim().length > 0 &&
+					typeof metrics.retryPromoted === "boolean" &&
+					typeof metrics.retryGenerationDurationMs === "number" &&
+					metrics.retryGenerationDurationMs >= 0
+				);
+			}
+			if (metrics.retryAttempted === false) {
+				return (
+					metrics.retryKind === undefined &&
+					metrics.retryReason === undefined &&
+					metrics.retryPromoted === undefined &&
+					metrics.retryGenerationDurationMs === undefined
+				);
+			}
+			return false;
+		},
+		{
+			message:
+				"retry metrics must be fully absent, or when retryAttempted is true include retryKind, non-empty retryReason, retryPromoted, and non-negative retryGenerationDurationMs; when retryAttempted is false the other retry fields must be absent",
+			path: ["retryKind"],
+		},
+	);

Based on learnings, "Maintain versioned schemas with schemaVersion field and use additive changes; migrations must be explicit" and "Use Zod for schema validation at all boundaries (CLI args, env, JSON files, HTTP responses)".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/lib/schemas.ts` around lines 363 - 384,
MatrixItemResultSchema currently references ScoringMetricsSchema but that schema
only includes durationMs, causing fields like retryGenerationDurationMs,
retryKind, retryReason, retryAttempted, and retryPromoted to be stripped; update
ScoringMetricsSchema to include these additional optional fields and then ensure
MatrixItemResultSchema continues to reference ScoringMetricsSchema (and export
remains unchanged) so live run.json payloads retain those properties; locate
ScoringMetricsSchema in the same file (or its import) and add the five new
optional fields with appropriate types (numbers/strings/booleans) to match
existing usages.

♻️ Duplicate comments (3)

README.md (1)
142-143: ⚠️ Potential issue | 🟡 Minor

Don’t describe bun pb as auto-discovering runtimes anymore.

The README now says new runs are Ollama-only, so this example overstates the default behavior. Keep the discovery note to models, harnesses, and tests.
📝 Suggested doc fix
-# Run benchmarks (auto-discovers available runtimes, models, harnesses, and tests)
+# Run benchmarks (Ollama runtime; auto-discovers models, harnesses, and tests)
 bun pb
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 142 - 143, Update the README line that currently
claims "auto-discovers available runtimes, models, harnesses, and tests" for the
`bun pb` example so it no longer states runtimes are auto-discovered; change the
wording to indicate `bun pb` auto-discovers only models, harnesses, and tests
(and that new runs are Ollama-only) so the example matches the current behavior.
llm/implementation/harnesses-implementation.md (1)
340-345: ⚠️ Potential issue | 🟡 Minor

This doc still points to two different Goose tool-enablement mechanisms.

The table now says Goose tool access comes from generated config/permission rules, but the earlier “Tool-Calling Mode (Goose)” section still tells readers to use --with-builtin developer. Please reconcile both sections in the same pass or the troubleshooting guidance stays ambiguous.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@llm/implementation/harnesses-implementation.md` around lines 340 - 345, The
doc is inconsistent about how Goose enables tools: reconcile the "Tool-Calling
Mode (Goose)" section with the table entry "Goose | Per-item provider config +
permission rules via `goose run`" by choosing the intended mechanism and
updating the other place to match; either remove the reference to the CLI flag
`--with-builtin developer` from "Tool-Calling Mode (Goose)" and replace it with
a clear description of per-item provider config + permission rules (including
mention of using `goose run`), or conversely, change the table to reflect the
CLI flag approach—ensure all references to `--with-builtin developer`,
"Tool-Calling Mode (Goose)", and the table entry are consistent and use the same
terminology and steps.
src/harnesses/opencode-events.ts (1)
240-249: ⚠️ Potential issue | 🟠 Major

Preserve the separator after raw fragments.

joinOrderedOutputParts() still collapses a raw line into the next parsed text chunk. A sequence like raw -> text becomes rawtext instead of preserving the newline boundary, which can corrupt mixed-output salvage and scorer-facing transcripts.
♻️ Suggested fix
 function joinOrderedOutputParts(parts: OrderedOutputPart[]): string {
 	let output = "";
 	for (const part of parts) {
 		if (part.type === "text") {
 			output += part.content;
 			continue;
 		}
-		output += output.length > 0 ? `\n${part.content}` : part.content;
+		if (output.length > 0 && !output.endsWith("\n")) {
+			output += "\n";
+		}
+		output += `${part.content}\n`;
 	}
-	return output;
+	return output.trimEnd();
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/opencode-events.ts` around lines 240 - 249,
joinOrderedOutputParts currently collapses a raw (non-"text") part into the
following text part (e.g., raw -> text becomes rawtext); update the function
(joinOrderedOutputParts and references to OrderedOutputPart) so that when
appending a non-"text" part you always preserve its boundary by ensuring a
newline separates it from subsequent text—append the raw part.content and then,
if the output does not already end with '\n', append a single '\n' (avoid
duplicating newlines), rather than conditionally prefixing the raw piece with a
newline based on output length.

🧹 Nitpick comments (6)

test/item-retry.test.ts (1)

69-186: Optional: extract shared retry-case setup to reduce duplication.

Both tests are solid, but they repeat a large common arrangement block. A tiny helper would improve maintainability without changing behavior.

♻️ Refactor sketch

+function createRetryArgs(params: {
+  testName: string;
+  initialOutput: string;
+  prompt: string;
+}) {
+  return {
+    item: createMatrixItem(params.testName),
+    generation: { success: true, output: params.initialOutput, durationMs: 300 },
+    harnessForRetry: createHarness(generateMock),
+    runtimeForRetry: createRuntime(),
+    promptForRetry: params.prompt,
+    timeoutMs: 5_000,
+    unloadAfter: true,
+    log: { info: vi.fn(), warn: vi.fn() },
+    supportsCompileRetry: true,
+  };
+}
...
- const outcome = await runScoringWithCompileRetry({
-   item: createMatrixItem("todo-app"),
-   generation: { ... },
-   ...
- });
+ const outcome = await runScoringWithCompileRetry(
+   createRetryArgs({
+     testName: "todo-app",
+     initialOutput: "export const notTheRightThing = true;",
+     prompt: "Implement createTodoApp().",
+   }),
+ );

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/item-retry.test.ts` around lines 69 - 186, Extract the repeated
compile-retry test setup into a small helper to remove duplication: create a
helper (e.g., runCompileRetryScenario or setupCompileRetry) in
test/item-retry.test.ts that accepts the varying bits (item name, initial
generation.output/durationMs, generateMock resolved output/durationMs, first
scoring failure object, promptForRetry) and internally imports
runScoringWithCompileRetry, configures generateMock and scoreGenerationMock,
constructs harnessForRetry/runtimeForRetry, calls runScoringWithCompileRetry
with the shared options (timeoutMs, unloadAfter, log, supportsCompileRetry) and
returns the outcome; then update both tests to call this helper and assert on
the returned outcome and on generateMock call expectations.

test/build-index.test.ts (1)

28-48: Consider centralizing required benchmark lib asset fixture data.

This list is manually mirrored from src/lib/benchmark-checkpoint.ts, so asset changes require edits in multiple places. A shared export/helper would reduce drift risk.

As per coding guidelines, "Favor iteration and modularization over code duplication."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/build-index.test.ts` around lines 28 - 48, The test's local
REQUIRED_LIB_ASSETS duplicates the asset list from
src/lib/benchmark-checkpoint.ts; remove the hardcoded array and import a single
exported constant from the benchmark-checkpoint module (or a new shared helper
module) to centralize the fixture data. Update test/build-index.test.ts to
import the exported symbol (e.g., BENCHMARK_REQUIRED_ASSETS or REQUIRES_ASSETS)
and use it in place of the local REQUIRED_LIB_ASSETS, and add the export to
src/lib/benchmark-checkpoint.ts (or a new shared file) so both production code
and tests reference the same source of truth.

apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx (1)

199-203: Reuse dedupeAndSortModels in “Select all” for consistency.

This path currently reimplements sorting and skips dedupe. Reusing the helper keeps one canonical normalization path.

Proposed patch

 							onClick={() =>
-								onSelectionChange(
-									[...models].sort((a, b) => MODEL_SORT_COLLATOR.compare(a, b)),
-								)
+								onSelectionChange(dedupeAndSortModels(models))
 							}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx` around
lines 199 - 203, The "Select all" onClick currently builds a new array and
manually sorts via MODEL_SORT_COLLATOR without deduplicating; replace that logic
to call the existing helper dedupeAndSortModels(models) and pass its result to
onSelectionChange to ensure canonical dedupe+sort behavior (update the onClick
that currently calls onSelectionChange([...models].sort(...)) to use
dedupeAndSortModels(models) instead).

src/cli/compare-formatters.ts (1)

26-36: Pin the display-row collation instead of inheriting the process locale.

localeCompare() without an explicit locale uses the runtime’s default collation, so these sort orders can still drift across machines when model, harness, or test names contain locale-sensitive characters. If the goal is deterministic CLI output, make the comparator locale-independent or at least pin it to a fixed locale.

♻️ Suggested fix

+function compareText(left: string, right: string): number {
+	if (left < right) return -1;
+	if (left > right) return 1;
+	return 0;
+}
+
 function compareDisplayRows(
 	left: ComparableDisplayRow,
 	right: ComparableDisplayRow,
 ): number {
-	const modelCompare = left.model.localeCompare(right.model);
+	const modelCompare = compareText(left.model, right.model);
 	if (modelCompare !== 0) return modelCompare;
-	const harnessCompare = left.harness.localeCompare(right.harness);
+	const harnessCompare = compareText(left.harness, right.harness);
 	if (harnessCompare !== 0) return harnessCompare;
-	const testCompare = left.test.localeCompare(right.test);
+	const testCompare = compareText(left.test, right.test);
 	if (testCompare !== 0) return testCompare;
-	return left.passType.localeCompare(right.passType);
+	return compareText(left.passType, right.passType);
 }

Based on learnings, "Output should be table/diff oriented, concise, and deterministic; never rely on color alone".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/cli/compare-formatters.ts` around lines 26 - 36, The comparator
compareDisplayRows currently uses localeCompare() with the environment default;
make sorting deterministic by pinning the locale and options, e.g. call
localeCompare on model, harness, test, and passType with a fixed locale (like
"en" or "en-US") and explicit options (for example { numeric: true, sensitivity:
"variant" }) so comparisons do not vary across machines; update the calls in
compareDisplayRows (referencing left.model, left.harness, left.test,
left.passType and their right.* counterparts) to include the chosen locale and
options.

test/goose-adapter.test.ts (1)

238-245: Don’t make these regressions depend on taint-reason ordering.

signalAssessment.reasons is effectively set-like metadata. Matching the exact array order makes these tests fail on harmless merge/refactor changes even when the same reasons are present. Prefer membership-based assertions or compare sorted arrays.

♻️ Suggested assertion pattern

-				signalAssessment: {
-					classification: "tainted",
-					reasons: ["internal_tool_transcript", "agent_requested_input"],
-				},
+				signalAssessment: expect.objectContaining({
+					classification: "tainted",
+					reasons: expect.arrayContaining([
+						"internal_tool_transcript",
+						"agent_requested_input",
+					]),
+				}),

Also applies to: 279-285

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/goose-adapter.test.ts` around lines 238 - 245, The test currently
asserts exact ordering on signalAssessment.reasons which is brittle; update the
assertion in the failing expectation (the block containing signalAssessment: {
classification: "tainted", reasons: [...] }) to assert membership rather than
order — e.g., use expect.arrayContaining([...]) for signalAssessment.reasons or
compare a sorted copy (sort both actual and expected) so the test verifies the
presence of "internal_tool_transcript" and "agent_requested_input" regardless of
order; apply the same change to the other occurrence around lines 279-285 that
also asserts signalAssessment.reasons.

src/schemas/common.schema.ts (1)

25-32: Keep artifact runtimes independent from the active runtime list.

artifactRuntimeNames = supportedRuntimeNames means the next active-runtime removal will also narrow artifact parsing unless someone remembers to split the arrays in the same edit. Since this file is introducing separate “active” vs “persisted” runtime surfaces, make the artifact list its own literal set now.
♻️ Suggested refactor
 /** Valid runtime names allowed when reading historical plan/result artifacts. */
-export const artifactRuntimeNames = supportedRuntimeNames;
+export const artifactRuntimeNames = ["ollama", "vllm"] as const;
As per coding guidelines, "Maintain versioned schemas with schemaVersion field and use additive changes; migrations must be explicit".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/schemas/common.schema.ts` around lines 25 - 32, artifactRuntimeNames
currently aliases supportedRuntimeNames which couples persisted artifact parsing
to the active runtime list; replace that alias with an explicit, independent
literal array of allowed persisted runtimes (assign a new array to
artifactRuntimeNames rather than referencing supportedRuntimeNames), then keep
ArtifactRuntimeNameSchema = z.enum(artifactRuntimeNames) and ArtifactRuntimeName
= z.infer<typeof ArtifactRuntimeNameSchema> so the schema/type use the new
independent list; ensure you update only artifactRuntimeNames (not
supportedRuntimeNames) so future active-runtime removals won't change artifact
parsing and remember to increment schemaVersion/migration if this is a breaking
change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx`:
- Around line 69-76: The ModelFilterDropdown component accepts props without
runtime checking; add a Zod schema for the expected shape (matching
ModelFilterDropdownProps) and validate props at the top of the exported
ModelFilterDropdown function, throwing a TypeError when validation fails; after
validation use the parsed/validated values (models, selectedModels,
onSelectionChange) for the rest of the component (e.g., for
normalizedSelectedIds and containerRef) so internal logic relies on
runtime-validated data and the documented contract is enforced.

In `@src/harnesses/opencode-adapter.ts`:
- Around line 149-152: The protocolOnlyReasons computation uses stdout directly,
causing taint from the actual parsed stream (the result of
selectProcessOutput(stdout, stderr) passed into parseOpenCodeEvents()) to be
ignored; update the protocolOnlyReasons assignment in opencode-adapter.ts so it
calls getTranscriptOrInputTaintReasons(...) on the exact stream/string that was
passed into parseOpenCodeEvents/selectProcessOutput (the parsed stream variable)
instead of always using stdout—i.e., locate the protocolOnlyReasons declaration
and replace getTranscriptOrInputTaintReasons(stdout) with
getTranscriptOrInputTaintReasons(parsedStream) or the actual variable holding
selectProcessOutput(...) so stderr-only protocol JSONL preserves its
internal_tool_transcript taint.

In `@src/harnesses/opencode-events.ts`:
- Around line 311-317: When extractFromToolCallObject(event) returns a non-empty
string and you assign it to toolCallOutput, also set hasToolUse = true so the
parsed output reflects that tool content was recovered; update both locations
where this pattern appears (the block using extractedToolCallOutput /
toolCallOutput around extractFromToolCallObject and the similar block at lines
~347-348) to flip hasToolUse when assigning the recovered tool-call string. This
change should reference extractedToolCallOutput, toolCallOutput, hasToolUse, and
extractFromToolCallObject in the affected function(s).

In `@src/runner/index.ts`:
- Around line 280-310: The residency-guard failure branch currently continues
without running post-item teardown, which can leave a previous model resident
and poison subsequent blocks; before the early continue in the catch block,
invoke runPostItemResidencyTeardown with the same context used for successful
items (pass item, plan, config, log, and any residency state) to ensure the
model is unloaded/cleaned up even on failure, then proceed to record the failure
via buildResidencyGuardFailureResult and write the progress checkpoint as
before.

In `@src/runner/model-eligibility.ts`:
- Around line 97-99: The thrown error message always says "(embedding-only)"
even when the real cause is that the model's capabilities.generateText is false;
update the throw in model-eligibility.ts (the branch referencing model, runtime,
detectedModelKind, and capabilities.generateText) to compute an accurate reason
string (e.g., if detectedModelKind === 'embedding' use "embedding-only", else if
!capabilities.generateText use "does not support text generation" or include the
capability flag) and include that reason in the Error message so operators see
the true exclusion cause.

In `@src/runner/plan-builder.ts`:
- Around line 102-110: Currently non-executable entries in config.runtimes are
silently filtered out (configuredRuntimes → runtimes using
isExecutableRuntimeName); instead, validate configuredRuntimes up-front: compute
the set of invalid names = configuredRuntimes.filter(r =>
!isExecutableRuntimeName(r)) and if invalids.length > 0 throw an Error listing
the invalids and the allowed RUNTIME_NAMES; only after that set runtimes =
configuredRuntimes as RuntimeName[] (or keep the existing filter as a sanity
check) so we fail loudly on bad config rather than masking it.

In `@src/runtimes/ollama-residency.ts`:
- Around line 261-266: The residency check in ensureOnlyOllamaModelLoaded uses
isSameOllamaModel but does not trim whitespace from config.allowedModel, so
values like " qwen3.6 " are treated as different; update
ensureOnlyOllamaModelLoaded (or its config validation) to trim() the
allowedModel string from EnsureOnlyOllamaModelLoadedConfig before any calls to
isSameOllamaModel or residency comparisons, and throw if allowedModel is empty
after trimming to avoid silent fallbacks.
- Around line 203-245: Both listRunningOllamaModels and unloadOllamaModel must
validate and normalize incoming config strings: trim config.baseUrl and
config.model (where present) and throw a clear Error if baseUrl or model is
empty/whitespace-only after trimming; also normalize baseUrl by removing any
trailing slash before constructing endpoint (the code that builds endpoint uses
`${config.baseUrl}/api/...`). Update
OllamaResidencyBaseConfig/UnloadOllamaModelConfig consumers in
listRunningOllamaModels (endpoint) and unloadOllamaModel (endpoint and request
body.model) to perform this validation/normalization at the top of each function
and reject invalid input immediately.

In `@src/runtimes/ollama-runtime.ts`:
- Around line 37-42: Normalize and validate the 'family' and 'families' fields
at the HTTP boundary by updating the zod 'details' schema: trim whitespace and
reject empty/blank strings for the single 'family' field and for each element of
the 'families' array so downstream functions like inferModelKind() and returned
metadata only see normalized values; specifically, change the 'family' schema to
apply a trim transform and a non-empty check and change 'families' to be an
array of trimmed/non-empty strings (keep both optional semantics), so invalid or
blank entries are rejected during parsing.

In `@test/opencode-output.test.ts`:
- Around line 88-106: The test currently uses a non-write second entry (command:
"edit") so it doesn't actually verify that a later write beats an earlier one;
update the fixture in the "extracts the latest batched write from alternate
tool-call shapes" test that calls parseOpenCodeEvents so the second toolCalls
element is a write-shaped entry (e.g., { tool: "write", raw: { content: "export
const second = 2;" } }) instead of command: "edit", leaving the expectations
(parsed.output and parsed.method) unchanged so the test truly exercises the
"latest write wins" behavior.

In `@test/schemas.test.ts`:
- Around line 223-244: The test currently allows runtimes: ["vllm"] but
BenchConfigSchema should reject that; update the second assertion so parsing {
runtimes: ["vllm"] } throws instead of returning a value. Specifically, change
the expectation around BenchConfigSchema.parse({ runtimes: ["vllm"] }) in the
test to use expect(() => BenchConfigSchema.parse(...)).toThrow(...) (matching
"runtimes" or "vllm") while keeping the other assertions (e.g., the
modelProfiles variant check for vllm) unchanged.
- Around line 69-74: The test is incorrectly asserting that
SupportedRuntimeNameSchema accepts "vllm", collapsing the supported/artifact
split; update the test so SupportedRuntimeNameSchema.parse("vllm") throws (or
remove that assertion) while keeping ArtifactRuntimeNameSchema.parse("vllm")
asserting it returns "vllm", i.e., ensure SupportedRuntimeNameSchema only
accepts supported names (e.g., "ollama") and legacy "vllm" remains covered by
ArtifactRuntimeNameSchema.

---

Outside diff comments:
In `@apps/dashboard/src/lib/schemas.ts`:
- Around line 363-384: MatrixItemResultSchema currently references
ScoringMetricsSchema but that schema only includes durationMs, causing fields
like retryGenerationDurationMs, retryKind, retryReason, retryAttempted, and
retryPromoted to be stripped; update ScoringMetricsSchema to include these
additional optional fields and then ensure MatrixItemResultSchema continues to
reference ScoringMetricsSchema (and export remains unchanged) so live run.json
payloads retain those properties; locate ScoringMetricsSchema in the same file
(or its import) and add the five new optional fields with appropriate types
(numbers/strings/booleans) to match existing usages.

In `@src/cli/migrate-machine-command.ts`:
- Around line 334-337: The action handler currently calls
MigrateMachineCommandOptionsSchema.parse(options) and then unconditionally calls
process.exit(1) for any caught error; change this so Zod/validation errors from
MigrateMachineCommandOptionsSchema.parse (or other user-input validation) are
handled separately: detect the validation error (e.g., instanceof ZodError or
error.name === 'ZodError'), print a helpful validation message to the user and
return/exit with success (no non-zero exit), while only calling process.exit(1)
for unexpected runtime/crash errors; update the .action(async (options) => { ...
}) catch block to branch on validation vs unexpected errors and avoid forcing a
non-zero exit for user validation issues.

---

Duplicate comments:
In `@llm/implementation/harnesses-implementation.md`:
- Around line 340-345: The doc is inconsistent about how Goose enables tools:
reconcile the "Tool-Calling Mode (Goose)" section with the table entry "Goose |
Per-item provider config + permission rules via `goose run`" by choosing the
intended mechanism and updating the other place to match; either remove the
reference to the CLI flag `--with-builtin developer` from "Tool-Calling Mode
(Goose)" and replace it with a clear description of per-item provider config +
permission rules (including mention of using `goose run`), or conversely, change
the table to reflect the CLI flag approach—ensure all references to
`--with-builtin developer`, "Tool-Calling Mode (Goose)", and the table entry are
consistent and use the same terminology and steps.

In `@README.md`:
- Around line 142-143: Update the README line that currently claims
"auto-discovers available runtimes, models, harnesses, and tests" for the `bun
pb` example so it no longer states runtimes are auto-discovered; change the
wording to indicate `bun pb` auto-discovers only models, harnesses, and tests
(and that new runs are Ollama-only) so the example matches the current behavior.

In `@src/harnesses/opencode-events.ts`:
- Around line 240-249: joinOrderedOutputParts currently collapses a raw
(non-"text") part into the following text part (e.g., raw -> text becomes
rawtext); update the function (joinOrderedOutputParts and references to
OrderedOutputPart) so that when appending a non-"text" part you always preserve
its boundary by ensuring a newline separates it from subsequent text—append the
raw part.content and then, if the output does not already end with '\n', append
a single '\n' (avoid duplicating newlines), rather than conditionally prefixing
the raw piece with a newline based on output length.

---

Nitpick comments:
In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx`:
- Around line 199-203: The "Select all" onClick currently builds a new array and
manually sorts via MODEL_SORT_COLLATOR without deduplicating; replace that logic
to call the existing helper dedupeAndSortModels(models) and pass its result to
onSelectionChange to ensure canonical dedupe+sort behavior (update the onClick
that currently calls onSelectionChange([...models].sort(...)) to use
dedupeAndSortModels(models) instead).

In `@src/cli/compare-formatters.ts`:
- Around line 26-36: The comparator compareDisplayRows currently uses
localeCompare() with the environment default; make sorting deterministic by
pinning the locale and options, e.g. call localeCompare on model, harness, test,
and passType with a fixed locale (like "en" or "en-US") and explicit options
(for example { numeric: true, sensitivity: "variant" }) so comparisons do not
vary across machines; update the calls in compareDisplayRows (referencing
left.model, left.harness, left.test, left.passType and their right.*
counterparts) to include the chosen locale and options.

In `@src/schemas/common.schema.ts`:
- Around line 25-32: artifactRuntimeNames currently aliases
supportedRuntimeNames which couples persisted artifact parsing to the active
runtime list; replace that alias with an explicit, independent literal array of
allowed persisted runtimes (assign a new array to artifactRuntimeNames rather
than referencing supportedRuntimeNames), then keep ArtifactRuntimeNameSchema =
z.enum(artifactRuntimeNames) and ArtifactRuntimeName = z.infer<typeof
ArtifactRuntimeNameSchema> so the schema/type use the new independent list;
ensure you update only artifactRuntimeNames (not supportedRuntimeNames) so
future active-runtime removals won't change artifact parsing and remember to
increment schemaVersion/migration if this is a breaking change.

In `@test/build-index.test.ts`:
- Around line 28-48: The test's local REQUIRED_LIB_ASSETS duplicates the asset
list from src/lib/benchmark-checkpoint.ts; remove the hardcoded array and import
a single exported constant from the benchmark-checkpoint module (or a new shared
helper module) to centralize the fixture data. Update test/build-index.test.ts
to import the exported symbol (e.g., BENCHMARK_REQUIRED_ASSETS or
REQUIRES_ASSETS) and use it in place of the local REQUIRED_LIB_ASSETS, and add
the export to src/lib/benchmark-checkpoint.ts (or a new shared file) so both
production code and tests reference the same source of truth.

In `@test/goose-adapter.test.ts`:
- Around line 238-245: The test currently asserts exact ordering on
signalAssessment.reasons which is brittle; update the assertion in the failing
expectation (the block containing signalAssessment: { classification: "tainted",
reasons: [...] }) to assert membership rather than order — e.g., use
expect.arrayContaining([...]) for signalAssessment.reasons or compare a sorted
copy (sort both actual and expected) so the test verifies the presence of
"internal_tool_transcript" and "agent_requested_input" regardless of order;
apply the same change to the other occurrence around lines 279-285 that also
asserts signalAssessment.reasons.

In `@test/item-retry.test.ts`:
- Around line 69-186: Extract the repeated compile-retry test setup into a small
helper to remove duplication: create a helper (e.g., runCompileRetryScenario or
setupCompileRetry) in test/item-retry.test.ts that accepts the varying bits
(item name, initial generation.output/durationMs, generateMock resolved
output/durationMs, first scoring failure object, promptForRetry) and internally
imports runScoringWithCompileRetry, configures generateMock and
scoreGenerationMock, constructs harnessForRetry/runtimeForRetry, calls
runScoringWithCompileRetry with the shared options (timeoutMs, unloadAfter, log,
supportsCompileRetry) and returns the outcome; then update both tests to call
this helper and assert on the returned outcome and on generateMock call
expectations.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1e139aef-83e9-4aad-98b3-76391c93f0ab

📥 Commits

Reviewing files that changed from the base of the PR and between 56b63e7 and 5f87caf.

📒 Files selected for processing (159)

README.md
apps/dashboard/scripts/build-index.ts
apps/dashboard/src/components/about/about-content.ts
apps/dashboard/src/components/about/about-page.tsx
apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx
apps/dashboard/src/components/charts/chart-primitives.tsx
apps/dashboard/src/components/charts/composite-score-chart.tsx
apps/dashboard/src/components/charts/frontier-eval-scatter.tsx
apps/dashboard/src/components/charts/model-comparison-chart.tsx
apps/dashboard/src/components/charts/model-radar-chart.tsx
apps/dashboard/src/components/charts/model-test-heatmap.tsx
apps/dashboard/src/components/charts/test-difficulty-chart.tsx
apps/dashboard/src/components/charts/test-type-split-chart.tsx
apps/dashboard/src/components/charts/token-efficiency-chart.tsx
apps/dashboard/src/components/layout/header.tsx
apps/dashboard/src/components/leaderboard/leaderboard-filters.ts
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
apps/dashboard/src/components/leaderboard/leaderboard-page.tsx
apps/dashboard/src/components/leaderboard/leaderboard-results-table.tsx
apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx
apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx
apps/dashboard/src/components/run-list/run-list-page.tsx
apps/dashboard/src/components/ui/section-heading.tsx
apps/dashboard/src/index.css
apps/dashboard/src/lib/aggregations-charts.ts
apps/dashboard/src/lib/aggregations-compare.ts
apps/dashboard/src/lib/aggregations-diagnostics.ts
apps/dashboard/src/lib/aggregations-test-types.ts
apps/dashboard/src/lib/aggregations-tooling.ts
apps/dashboard/src/lib/aggregations.ts
apps/dashboard/src/lib/api.ts
apps/dashboard/src/lib/chart-colors.ts
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
llm/context/codebase-overview.md
llm/implementation/computer-use-hardening.md
llm/implementation/harnesses-implementation.md
llm/implementation/multi-runtime-mvp-implementation.md
llm/project/project-overview.md
llm/project/project-rules.md
llm/project/user-flow.md
package.json
src/cli/compare-command.ts
src/cli/compare-formatters.ts
src/cli/compare-utils.ts
src/cli/migrate-machine-command.ts
src/cli/run-command.ts
src/harnesses/direct-adapter.ts
src/harnesses/discovery.ts
src/harnesses/goose-adapter.ts
src/harnesses/goose-openai.ts
src/harnesses/goose-output.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/harnesses/opencode-artifacts.ts
src/harnesses/opencode-cli.ts
src/harnesses/opencode-config.ts
src/harnesses/opencode-events.ts
src/harnesses/opencode-model.ts
src/harnesses/opencode-output.ts
src/harnesses/opencode-permissions.ts
src/harnesses/opencode-process.ts
src/harnesses/opencode-provider.ts
src/harnesses/opencode-runner.ts
src/harnesses/tool-prompt.ts
src/lib/benchmark-checkpoint.ts
src/lib/failure-classifier.ts
src/lib/hardware-profile.ts
src/lib/logger.ts
src/lib/machine-profile/instance-id.ts
src/lib/machine-profile/legacy.ts
src/lib/machine-profile/normalization.ts
src/lib/machine-profile/probe-linux.ts
src/lib/machine-profile/probe-macos.ts
src/lib/machine-profile/probe-utils.ts
src/lib/machine-profile/probe-windows.ts
src/lib/model-aliases.ts
src/lib/model-profile/normalization.ts
src/lib/model-profile/registry-compat.ts
src/lib/model-profile/registry.ts
src/lib/openai-compat-client.ts
src/lib/signal-assessment.ts
src/lib/stats-format.ts
src/lib/test-catalog.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/model-availability.ts
src/runner/model-eligibility.ts
src/runner/plan-builder.ts
src/runner/run-progress.ts
src/runtimes/discovery.ts
src/runtimes/index.ts
src/runtimes/ollama-residency.ts
src/runtimes/ollama-runtime.ts
src/runtimes/runtime.ts
src/runtimes/vllm-runtime.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/model-alias.schema.ts
src/schemas/model-profile.schema.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
src/schemas/test-catalog.schema.ts
src/tests/calculator-basic/test.meta.json
src/tests/calculator-stateful/test.meta.json
src/tests/event-emitter/test.meta.json
src/tests/rate-limiter/test.meta.json
src/tests/smoke/test.meta.json
src/tests/todo-app/test.meta.json
src/tests/ttl-cache/test.meta.json
src/tests/workspace-smoke/fixtures/logs/.gitkeep
src/tests/workspace-smoke/prompt.blind.md
src/tests/workspace-smoke/prompt.informed.md
src/tests/workspace-smoke/test.meta.json
test/aggregate.test.ts
test/benchmark-checkpoint.test.ts
test/build-index.test.ts
test/compare-checkpoint-guard.test.ts
test/compare-formatters.test.ts
test/dashboard-schema-compat.test.ts
test/dashboard-test-type.test.ts
test/failure-classifier.test.ts
test/generation-retry.test.ts
test/goose-adapter.test.ts
test/goose-output.test.ts
test/hardware-profile.test.ts
test/harness-compatibility.test.ts
test/item-executor.test.ts
test/item-retry.test.ts
test/machine-profile-migration.test.ts
test/model-alias-schema.test.ts
test/model-availability.test.ts
test/model-profile-registry.test.ts
test/ollama-adapter.test.ts
test/ollama-residency.test.ts
test/opencode-adapter.test.ts
test/opencode-artifacts.test.ts
test/opencode-cli-cache.test.ts
test/opencode-cli.test.ts
test/opencode-config.test.ts
test/opencode-model.test.ts
test/opencode-output.test.ts
test/opencode-runner-race.test.ts
test/opencode-runner.test.ts
test/plan-builder.test.ts
test/run-command.test.ts
test/runner-residency.test.ts
test/schema-regressions.test.ts
test/schemas.test.ts
test/signal-assessment.test.ts
test/stats-semantics.test.ts
test/stats.test.ts
test/test-catalog.test.ts
test/tool-prompt.test.ts
test/utils/fixtures.ts
test/vllm-runtime.test.ts
test/workspace-capability-parity.test.ts
test/workspace-prompt-parity.test.ts

💤 Files with no reviewable changes (10)

package.json
test/benchmark-checkpoint.test.ts
src/harnesses/opencode-process.ts
src/harnesses/goose-openai.ts
test/vllm-runtime.test.ts
src/runtimes/discovery.ts
src/harnesses/opencode-model.ts
src/lib/openai-compat-client.ts
src/runtimes/vllm-runtime.ts
src/harnesses/opencode-output.ts

AustinKelsay and others added 4 commits April 5, 2026 10:51

bench: harden signal assessment and retry fairness

589033c

bench: tighten taint detection regressions

42bd9ea

fix: tighten signal assessment taint guards

aef3606

Merge pull request #23 from AustinKelsay/fix/benchmark-signal-hardening

9c372ee

[codex] bench: harden signal assessment and retry fairness

vercel Bot deployed to Preview April 10, 2026 21:29 View deployment

coderabbitai Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread src/harnesses/goose-adapter.ts Outdated

Comment thread src/lib/signal-assessment.ts Outdated

Preserve failure metadata in item execution

b6f41a8

- keep stdout/stderr and signal assessments from Goose failures - carry structured generation failure details through executeItem - bump schema version and accept prior current-version artifacts

vercel Bot deployed to Preview April 10, 2026 22:11 View deployment

coderabbitai Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread src/harnesses/goose-adapter.ts Outdated

Comment thread src/runner/item-executor.ts Outdated

Improve Goose taint detection and error fallback

b2dd0fa

- Scan stderr alongside stdout for taint reasons - Use execa message when Goose emits no stream output - Cover legacy run migration for machine profiles

vercel Bot deployed to Preview April 10, 2026 22:35 View deployment

Restrict benchmark execution to Ollama

8ea88b5

- Remove vLLM and OpenAI-compatible runtime paths - Simplify harness, config, and docs around Ollama-only runs - Keep legacy vLLM artifacts readable for compare and history

vercel Bot had a problem deploying to Preview April 13, 2026 18:39 Failure

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread src/schemas/model-profile.schema.ts Outdated

Align dashboard plan config with new Goose fields

5c5bc91

- Remove deprecated `vllmBaseUrl` from plan config - Add optional Goose turn limits and test categories

vercel Bot deployed to Preview April 13, 2026 18:51 View deployment

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread apps/dashboard/src/lib/schemas.ts Outdated

Harden legacy profile and signal-assessment loading

427c6b8

- Accept legacy model profile files by dropping unsupported runtime variants - Validate harness signal assessments with Zod before persisting failures - Tighten dashboard schema bounds for Goose turn counts

vercel Bot deployed to Preview April 13, 2026 19:09 View deployment

AustinKelsay added 2 commits April 17, 2026 11:07

Merge remote-tracking branch 'origin/main' into staging

3be41ed

chore: restore staging verification

fc34144

vercel Bot deployed to Preview April 17, 2026 18:10 View deployment

coderabbitai Bot reviewed Apr 20, 2026

View reviewed changes

AustinKelsay and others added 3 commits April 21, 2026 16:27

rebuild opencode harness

7977818

Harden OpenCode harness review fixes

ae1ab2e

Merge pull request #25 from AustinKelsay/codex/opencode-harness-rebuild

6f0daf9

[codex] rebuild opencode harness

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Update dashboard, harness, and schema robustness

9dcac40

vercel Bot deployed to Preview April 24, 2026 18:56 View deployment

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Fix dashboard docs and validation edge cases

f382f05

vercel Bot deployed to Preview April 24, 2026 20:03 View deployment

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Fix dashboard and harness regressions

0ab5019

vercel Bot had a problem deploying to Preview April 24, 2026 21:28 Failure

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx

Comment thread src/harnesses/tool-prompt.ts

Comment thread src/runner/index.ts Outdated

Comment thread src/runtimes/ollama-runtime.ts

Tighten dashboard validation and chart helpers

c1ec2de

vercel Bot deployed to Preview April 24, 2026 22:42 View deployment

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread src/lib/signal-assessment.ts

Comment thread src/runner/item-executor.ts Outdated

Comment thread src/schemas/common.schema.ts

Comment thread test/ollama-adapter.test.ts Outdated

Restore legacy runtime support and refine dashboard charts

5f87caf

vercel Bot deployed to Preview April 24, 2026 23:24 View deployment

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Harden Ollama-only benchmark execution and validation

67de245

vercel Bot deployed to Preview April 25, 2026 01:40 View deployment

Conversation

AustinKelsay commented Apr 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

vercel Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AustinKelsay commented Apr 20, 2026

Uh oh!

coderabbitai Bot commented Apr 20, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AustinKelsay commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AustinKelsay commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AustinKelsay commented Apr 10, 2026 •

edited by coderabbitai Bot

Loading

vercel Bot commented Apr 10, 2026 •

edited

Loading

coderabbitai Bot commented Apr 10, 2026 •

edited

Loading