Conversation
[codex] bench: harden signal assessment and retry fairness
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughConverges execution to Ollama (schema → 0.5.2), adds Ollama residency enforcement, enriches signal-assessment (new taint reasons and merging), introduces model-exclusions, refactors OpenCode into modular runner/CLI/artifacts/permissions/events, threads taint through harnesses/runner, and adds many tests and fixtures. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Runner as Runner
participant Plan as PlanBuilder
participant Executor as ItemExecutor
participant Runtime as OllamaRuntime
participant Harness as HarnessAdapter
participant OpenCode as OpenCodeRunner
participant Events as EventParser
participant Assessor as SignalAssessor
participant Writer as ResultWriter
Runner->>Plan: buildRunPlan(config)
Runner->>Executor: executeItem(item)
Executor->>Runtime: createRuntime(ollamaBaseUrl)
Executor->>Harness: generate(prompt, opts)
Note right of Harness: OpenCode path prepares artifacts/config and builds args
Harness->>OpenCode: runOpenCodeCommand(args, env, cwd, timeout)
OpenCode-->>Harness: stdout/stderr/exitCode
Harness->>Events: parseOpenCodeEvents(stdout + stderr)
Events-->>Harness: { output, method, hasToolUse, permissionDenied }
Harness->>Assessor: getTranscriptOrInputTaintReasons(raw + normalized)
Assessor-->>Harness: [taintReasons]
Harness->>Assessor: mergeSignalAssessments(existing?, taintAssessment)
Assessor-->>Executor: signalAssessment
Executor->>Assessor: finalizeItemSignalAssessment({ existing?, automatedScore?, rowFailed?, output, outputSource })
Assessor-->>Executor: finalSignalAssessment
Executor->>Writer: writeResult(result)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/schemas/common.schema.ts (1)
174-182:⚠️ Potential issue | 🟠 MajorVersion this additive on-disk schema change.
These new
signalAssessment.reasonsliterals will be emitted into artifacts, butSCHEMA_VERSIONstays0.5.0. That leaves two different serialized shapes under the same version and prevents explicit compatibility or migration handling.As per coding guidelines "Maintain versioned schemas with schemaVersion field and use additive changes; migrations must be explicit".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/schemas/common.schema.ts` around lines 174 - 182, This change adds new literals to signalAssessmentReasonTypes which alters on-disk artifacts, so bump the package schema version constant (SCHEMA_VERSION) to a new semver (e.g., 0.5.1) and ensure any serialized objects include the updated schemaVersion field where artifacts are created/serialized (search for usages that stamp schemaVersion in export/save code and update them to emit the new value); update tests/fixtures that assert schemaVersion or serialized shapes to the new version as well.src/runner/item-executor.ts (1)
138-157:⚠️ Potential issue | 🟠 MajorCarry harness-supplied failure metadata through these catch blocks.
Both paths rebuild the failure from
error.messageonly, so theoutput,signalAssessment, anddurationMsnow attached bysrc/harnesses/goose-adapter.tsandsrc/harnesses/opencode-adapter.tsare dropped. A failed row can therefore lose the very taint reason this PR is adding and end up recorded as a generic trustworthy failure.Based on learnings "For per-item runtime failures (timeouts, model errors, eval failures), record structured failure in run.json and continue the matrix".
Also applies to: 334-368
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/runner/item-executor.ts` around lines 138 - 157, The catch blocks reconstruct failure metadata from error.message and drop harness-provided fields; update the catch handlers in item-executor.ts (the blocks that set generation, generationFailure, and call finalizeItemSignalAssessment) to merge/retain any harness-supplied metadata present on the thrown value (e.g. (error as any).output, .durationMs, .failureType, and any signalAssessment info) instead of overwriting them—use the existing classifyGenerationError(errorMessage) only if a failureType isn't present, set generation.durationMs to the harness-provided duration if available, preserve generation.output when provided, and build generationFailure.type/message from the harness fields when present; apply the same merging logic to both catch sites that currently call finalizeItemSignalAssessment so the harness-supplied taint reason and duration are carried through.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/harnesses/goose-adapter.ts`:
- Around line 515-529: The error handler currently only inspects
execaError.stderr, losing context printed to stdout; update the catch branch in
goose-adapter.ts (the block using execaError, getTranscriptOrInputTaintReasons,
and appendSignalAssessmentReasons) to aggregate stdout and stderr (e.g., const
combined = `${execaError.stdout || ''}\n${execaError.stderr || ''}`), run
getTranscriptOrInputTaintReasons(combined) and use combined for the Error
message and output property, and adjust the execaError type to include stdout so
both streams are preserved in the thrown Error's signalAssessment and output.
In `@src/lib/signal-assessment.ts`:
- Around line 22-31: The current INTERNAL_TOOL_TRANSCRIPT_PATTERNS is too
permissive and matches isolated tokens; tighten detection by replacing the
single-pattern test with a composite matcher that requires stronger transcript
shapes—e.g., check for JSON-like pairs appearing together (require both
"sessionID" and a nearby "type":"tool_|step_" within ~200 chars or same line),
or require tool-specific markers plus filePath/function signatures together
(e.g., "<function=bash>" with "<parameter=filePath>" or "filePath:" in the same
block); update the code that inspects failed rows to use this new composite
matcher (instead of matching any entry in INTERNAL_TOOL_TRANSCRIPT_PATTERNS
directly) and add a regression test that feeds a benign artifact containing only
"sessionID" or "step_start"/"tool_call" tokens and asserts it is NOT classified
as internal_tool_transcript while another test includes the stronger multi-token
transcript and asserts it IS classified.
---
Outside diff comments:
In `@src/runner/item-executor.ts`:
- Around line 138-157: The catch blocks reconstruct failure metadata from
error.message and drop harness-provided fields; update the catch handlers in
item-executor.ts (the blocks that set generation, generationFailure, and call
finalizeItemSignalAssessment) to merge/retain any harness-supplied metadata
present on the thrown value (e.g. (error as any).output, .durationMs,
.failureType, and any signalAssessment info) instead of overwriting them—use the
existing classifyGenerationError(errorMessage) only if a failureType isn't
present, set generation.durationMs to the harness-provided duration if
available, preserve generation.output when provided, and build
generationFailure.type/message from the harness fields when present; apply the
same merging logic to both catch sites that currently call
finalizeItemSignalAssessment so the harness-supplied taint reason and duration
are carried through.
In `@src/schemas/common.schema.ts`:
- Around line 174-182: This change adds new literals to
signalAssessmentReasonTypes which alters on-disk artifacts, so bump the package
schema version constant (SCHEMA_VERSION) to a new semver (e.g., 0.5.1) and
ensure any serialized objects include the updated schemaVersion field where
artifacts are created/serialized (search for usages that stamp schemaVersion in
export/save code and update them to emit the new value); update tests/fixtures
that assert schemaVersion or serialized shapes to the new version as well.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b8ce16e5-0258-4842-bdc5-f5d7e80b9ec9
📒 Files selected for processing (16)
apps/dashboard/src/lib/schemas.tsapps/dashboard/src/lib/types.tssrc/harnesses/goose-adapter.tssrc/harnesses/goose-output.tssrc/harnesses/opencode-adapter.tssrc/harnesses/opencode-output.tssrc/lib/signal-assessment.tssrc/runner/item-executor.tssrc/schemas/common.schema.tstest/dashboard-schema-compat.test.tstest/goose-adapter.test.tstest/goose-output.test.tstest/item-retry.test.tstest/opencode-adapter.test.tstest/opencode-output.test.tstest/signal-assessment.test.ts
- keep stdout/stderr and signal assessments from Goose failures - carry structured generation failure details through executeItem - bump schema version and accept prior current-version artifacts
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/harnesses/goose-adapter.ts (1)
323-367:⚠️ Potential issue | 🟠 MajorScan
stderrfor taint even whenstdoutis present.
transcriptOrInputReasonsis currently derived fromrawOutput/output, butrawOutputis assigned after the stdout→stderr fallback. That means a successful Goose run can print code onstdoutand leakagent_requested_inputorinternal_tool_transcriptonly onstderr, and this adapter will still return nosignalAssessment.💡 Suggested fix
let output = result.stdout; const stderr = result.stderr?.trim() || ""; + const stdout = result.stdout?.trim() || ""; @@ - const rawOutput = output; + const rawOutput = output; const normalized = normalizeGooseOutput(output); @@ const transcriptOrInputReasons = Array.from( new Set([ + ...getTranscriptOrInputTaintReasons(stdout), + ...getTranscriptOrInputTaintReasons(stderr), ...getTranscriptOrInputTaintReasons(rawOutput), ...getTranscriptOrInputTaintReasons(output), ]), );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/harnesses/goose-adapter.ts` around lines 323 - 367, The code only checks getTranscriptOrInputTaintReasons against rawOutput/output but rawOutput is set after the stdout→stderr fallback, so taint on stderr is missed when stdout exists; update the logic in the goose adapter (around variables rawOutput, output and calls to normalizeGooseOutput/getTranscriptOrInputTaintReasons/appendSignalAssessmentReasons) to always include stderr in the taint scan (for example compute rawOutputParts = [result.stdout, result.stderr] or call getTranscriptOrInputTaintReasons on stderr as well) before building transcriptOrInputReasons, so that signals like agent_requested_input or internal_tool_transcript found only on stderr are detected and passed into appendSignalAssessmentReasons.
🧹 Nitpick comments (1)
test/machine-profile-migration.test.ts (1)
131-159: Expand this schema-bump regression to cover run artifacts too.Lines 131-159 only validate the
"0.5.0"compatibility path forparseKnownPlanPayload. Add a siblingparseKnownRunPayloadassertion so plan/run migration guarantees stay symmetric.♻️ Suggested test extension
it("accepts prior current-version artifacts after a schema bump", () => { const parsedPlan = parseKnownPlanPayload({ schemaVersion: "0.5.0", runId: "run-current-minus-one", createdAt: "2026-03-05T21:51:18.583Z", runtimeEnvironment: { platform: "darwin", bunVersion: "1.3.3", }, machine: LEGACY_MACHINE, config: { ollamaBaseUrl: "http://localhost:11434", vllmBaseUrl: "http://localhost:8000", generateTimeoutMs: 120_000, passTypes: ["blind"], }, items: [], summary: { totalItems: 0, runtimes: 0, models: 0, harnesses: 0, tests: 0, }, }); + const parsedRun = parseKnownRunPayload({ + schemaVersion: "0.5.0", + runId: "run-current-minus-one", + machine: LEGACY_MACHINE, + startedAt: "2026-03-05T21:51:18.583Z", + completedAt: "2026-03-05T21:52:18.583Z", + durationMs: 60_000, + summary: { + total: 0, + completed: 0, + failed: 0, + pending: 0, + }, + items: [], + }); expect(parsedPlan.schemaVersion).toBe(SCHEMA_VERSION); expect(parsedPlan.machine?.profileKey).toBe(LEGACY_PROFILE_KEY); + expect(parsedRun.schemaVersion).toBe(SCHEMA_VERSION); + expect(parsedRun.machine?.profileKey).toBe(LEGACY_PROFILE_KEY); });Based on learnings: “Maintain versioned schemas with schemaVersion field and use additive changes; migrations must be explicit.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/machine-profile-migration.test.ts` around lines 131 - 159, Add a mirrored assertion for run artifacts: alongside the existing parseKnownPlanPayload test that feeds a "0.5.0" payload using LEGACY_MACHINE, call parseKnownRunPayload with the same payload shape (schemaVersion "0.5.0", runId, createdAt, runtimeEnvironment, machine: LEGACY_MACHINE, config, items, summary) and assert the returned object's schemaVersion equals SCHEMA_VERSION and machine?.profileKey equals LEGACY_PROFILE_KEY so plan/run migration behavior stays symmetric (look for parseKnownPlanPayload and add the sibling parseKnownRunPayload assertion near it).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/harnesses/goose-adapter.ts`:
- Around line 515-530: The error message and output are being built from
combined which becomes "\n" when both stdout and stderr are empty, causing the
real execaError.message to be lost; update the logic around the
execaError/combined variables in the Goose failure handling block so that you
first compute a trimmedCombined (or similar) from execaError.stdout and
execaError.stderr and if that result is empty use execaError.message as the
fallback for both the thrown Error message and the output property; keep the
existing calls to getTranscriptOrInputTaintReasons and
appendSignalAssessmentReasons unchanged but feed them the effective output
(trimmedCombined or execaError.message) so downstream signalAssessment and
output reflect the real failure.
---
Outside diff comments:
In `@src/harnesses/goose-adapter.ts`:
- Around line 323-367: The code only checks getTranscriptOrInputTaintReasons
against rawOutput/output but rawOutput is set after the stdout→stderr fallback,
so taint on stderr is missed when stdout exists; update the logic in the goose
adapter (around variables rawOutput, output and calls to
normalizeGooseOutput/getTranscriptOrInputTaintReasons/appendSignalAssessmentReasons)
to always include stderr in the taint scan (for example compute rawOutputParts =
[result.stdout, result.stderr] or call getTranscriptOrInputTaintReasons on
stderr as well) before building transcriptOrInputReasons, so that signals like
agent_requested_input or internal_tool_transcript found only on stderr are
detected and passed into appendSignalAssessmentReasons.
---
Nitpick comments:
In `@test/machine-profile-migration.test.ts`:
- Around line 131-159: Add a mirrored assertion for run artifacts: alongside the
existing parseKnownPlanPayload test that feeds a "0.5.0" payload using
LEGACY_MACHINE, call parseKnownRunPayload with the same payload shape
(schemaVersion "0.5.0", runId, createdAt, runtimeEnvironment, machine:
LEGACY_MACHINE, config, items, summary) and assert the returned object's
schemaVersion equals SCHEMA_VERSION and machine?.profileKey equals
LEGACY_PROFILE_KEY so plan/run migration behavior stays symmetric (look for
parseKnownPlanPayload and add the sibling parseKnownRunPayload assertion near
it).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 44b63a0b-5e91-4db2-bf85-a11925c28932
📒 Files selected for processing (11)
src/harnesses/goose-adapter.tssrc/lib/machine-profile/legacy.tssrc/lib/signal-assessment.tssrc/runner/item-executor.tssrc/schemas/common.schema.tstest/dashboard-schema-compat.test.tstest/goose-adapter.test.tstest/item-executor.test.tstest/machine-profile-migration.test.tstest/schemas.test.tstest/signal-assessment.test.ts
✅ Files skipped from review due to trivial changes (3)
- test/schemas.test.ts
- src/lib/machine-profile/legacy.ts
- test/dashboard-schema-compat.test.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- test/goose-adapter.test.ts
- test/signal-assessment.test.ts
- src/schemas/common.schema.ts
- Scan stderr alongside stdout for taint reasons - Use execa message when Goose emits no stream output - Cover legacy run migration for machine profiles
- Remove vLLM and OpenAI-compatible runtime paths - Simplify harness, config, and docs around Ollama-only runs - Keep legacy vLLM artifacts readable for compare and history
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
README.md (1)
182-205:⚠️ Potential issue | 🟡 MinorUpdate the example model-profile
schemaVersion.This example still shows
"0.5.0", but the current schema version in this PR is0.5.1. As written, a copy-pasted config will be rejected by the parser.Proposed fix
- "schemaVersion": "0.5.0", + "schemaVersion": "0.5.1",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 182 - 205, Update the example model-profile JSON to use the current schema version so pasted configs pass validation: change the "schemaVersion" value in the README example from "0.5.0" to "0.5.1" (the canonical model-profile example block that users load with --model-config should reflect the PR's schemaVersion).
♻️ Duplicate comments (1)
src/runner/item-executor.ts (1)
88-145:⚠️ Potential issue | 🟡 MinorValidate
signalAssessmentagainst the real schema before persisting it.
isSignalAssessment()is still only a loose shape check, so malformed payloads from thrown harness/runtime errors can flow straight intorun.jsonas long as they look like{ classification, reasons: string[] }. This helper is normalizing unknown input at a boundary; it should parse against the actual schema/invariants, not just the outer shape.#!/bin/bash # Verify whether the persisted signal-assessment schema is stricter than the local guard. rg -n -C3 'SignalAssessmentSchema|signalAssessmentReasons|SignalAssessmentReason|isSignalAssessment' srcBased on learnings, "Validate at boundaries with Zod and pass typed data inward" and "Treat all boundary inputs (CLI args, env, JSON files, HTTP responses) as unknown and parse with Zod immediately".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/runner/item-executor.ts` around lines 88 - 145, The current isSignalAssessment loosenly checks shape and lets malformed payloads pass into persisted run.json; replace that loose guard in extractGenerationFailureDetails with a strict parse against the canonical schema (use SignalAssessmentSchema.parse or safeParse) so only fully valid SignalAssessment objects are accepted; specifically, in extractGenerationFailureDetails use SignalAssessmentSchema.safeParse(errorRecord?.signalAssessment) and set signalAssessment to the parsed value on success or undefined on failure (remove or keep isSignalAssessment only for pre-checks but do not persist its result) so all boundary inputs are validated by the real schema before being written.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/schemas/model-profile.schema.ts`:
- Around line 109-112: The schema change narrowed
ConfiguredModelProfileSchema.variants to SupportedRuntimeNameSchema which breaks
loading legacy persisted model-profile files that contain legacy runtime keys
like "vllm"; revert or relax the variants constraint so loadModelProfiles(...)
accepts unknown/legacy runtime keys (e.g., accept string keys or union with
legacy enum) or add an explicit migration step: update
ConfiguredModelProfileSchema.variants to allow legacy names (including "vllm")
alongside SupportedRuntimeNameSchema OR implement a migration function invoked
by loadModelProfiles that maps legacy keys (like "vllm") to current runtime
names (e.g., Ollama) before validation, ensuring backward compatibility without
changing runtime behavior.
---
Outside diff comments:
In `@README.md`:
- Around line 182-205: Update the example model-profile JSON to use the current
schema version so pasted configs pass validation: change the "schemaVersion"
value in the README example from "0.5.0" to "0.5.1" (the canonical model-profile
example block that users load with --model-config should reflect the PR's
schemaVersion).
---
Duplicate comments:
In `@src/runner/item-executor.ts`:
- Around line 88-145: The current isSignalAssessment loosenly checks shape and
lets malformed payloads pass into persisted run.json; replace that loose guard
in extractGenerationFailureDetails with a strict parse against the canonical
schema (use SignalAssessmentSchema.parse or safeParse) so only fully valid
SignalAssessment objects are accepted; specifically, in
extractGenerationFailureDetails use
SignalAssessmentSchema.safeParse(errorRecord?.signalAssessment) and set
signalAssessment to the parsed value on success or undefined on failure (remove
or keep isSignalAssessment only for pre-checks but do not persist its result) so
all boundary inputs are validated by the real schema before being written.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c0c41d59-1088-4aac-b6a9-34ef8ea71fed
📒 Files selected for processing (42)
README.mdapps/dashboard/src/lib/types.tsllm/context/codebase-overview.mdllm/implementation/multi-runtime-mvp-implementation.mdllm/project/project-overview.mdllm/project/project-rules.mdllm/project/user-flow.mdpackage.jsonsrc/cli/run-command.tssrc/harnesses/direct-adapter.tssrc/harnesses/goose-adapter.tssrc/harnesses/goose-openai.tssrc/harnesses/harness.tssrc/harnesses/opencode-adapter.tssrc/harnesses/opencode-config.tssrc/lib/benchmark-checkpoint.tssrc/lib/model-aliases.tssrc/lib/model-profile/normalization.tssrc/lib/model-profile/registry.tssrc/lib/openai-compat-client.tssrc/runner/index.tssrc/runner/item-executor.tssrc/runner/plan-builder.tssrc/runtimes/discovery.tssrc/runtimes/index.tssrc/runtimes/runtime.tssrc/runtimes/vllm-runtime.tssrc/schemas/common.schema.tssrc/schemas/config.schema.tssrc/schemas/index.tssrc/schemas/model-alias.schema.tssrc/schemas/model-profile.schema.tssrc/schemas/plan.schema.tssrc/schemas/result.schema.tstest/benchmark-checkpoint.test.tstest/build-index.test.tstest/harness-compatibility.test.tstest/item-executor.test.tstest/opencode-config.test.tstest/plan-builder.test.tstest/schemas.test.tstest/vllm-runtime.test.ts
💤 Files with no reviewable changes (11)
- src/runner/index.ts
- test/opencode-config.test.ts
- test/benchmark-checkpoint.test.ts
- package.json
- src/lib/benchmark-checkpoint.ts
- src/harnesses/goose-openai.ts
- test/build-index.test.ts
- test/vllm-runtime.test.ts
- src/runtimes/discovery.ts
- src/runtimes/vllm-runtime.ts
- src/lib/openai-compat-client.ts
✅ Files skipped from review due to trivial changes (4)
- src/lib/model-aliases.ts
- llm/implementation/multi-runtime-mvp-implementation.md
- llm/project/project-overview.md
- src/schemas/model-alias.schema.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- test/schemas.test.ts
- test/item-executor.test.ts
- src/schemas/common.schema.ts
👮 Files not reviewed due to content moderation or server errors (8)
- src/runtimes/runtime.ts
- src/harnesses/opencode-config.ts
- src/harnesses/goose-adapter.ts
- src/harnesses/opencode-adapter.ts
- src/cli/run-command.ts
- llm/project/project-rules.md
- src/schemas/result.schema.ts
- llm/project/user-flow.md
- Remove deprecated `vllmBaseUrl` from plan config - Add optional Goose turn limits and test categories
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/dashboard/src/lib/schemas.ts`:
- Around line 426-430: Update the four goose turn-count fields (gooseMaxTurns,
gooseRetryMaxTurns, gooseWorkspaceMaxTurns, gooseWorkspaceRetryMaxTurns) to
enforce positive integers instead of any number: replace their
z.number().optional() validators with z.number().int().positive().optional() so
validation matches the core plan.schema.ts rules.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c706018b-a626-4ccd-b42b-6afb48e87e8a
📒 Files selected for processing (2)
apps/dashboard/src/lib/schemas.tsapps/dashboard/src/lib/types.ts
- Accept legacy model profile files by dropping unsupported runtime variants - Validate harness signal assessments with Zod before persisting failures - Tighten dashboard schema bounds for Goose turn counts
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (10)
apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx (1)
270-313: Consider adding TSDoc to the exported function.The exported
BlindVsInformedChartfunction lacks TSDoc documentation. Adding it would improve discoverability and align with the project's documentation standards.📝 Example TSDoc addition
+/** + * Renders a chart comparing pass rates between blind and informed prompts. + * Provides model-level and harness-level breakdowns with delta summaries. + * + * `@param` items - Array of matrix item results to analyze + * `@returns` A card component containing tabbed charts and summaries + */ export function BlindVsInformedChart({ items }: BlindVsInformedChartProps) {As per coding guidelines: "All exported functions must have TSDoc/JSDoc documentation including purpose, params, returns, and throws"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx` around lines 270 - 313, Add a TSDoc block above the exported BlindVsInformedChart function describing its purpose, parameters, and return value; document the items parameter (type BlindVsInformedChartProps) and what the component renders (chart tabs for model/harness breakdown), include `@param` for items and `@returns` React.ReactElement (or JSX.Element), and add any `@throws` if it can throw—place the doc comment immediately above the export function BlindVsInformedChart declaration to satisfy the project guideline that all exported functions have TSDoc.apps/dashboard/src/lib/api.ts (1)
1-4: Move the module header back to the top of the file.The new import block now appears before the required file header. Put the purpose/exports/invariants header first, then imports.
♻️ Proposed reorder
-import { - parseKnownPlanPayload, - parseKnownRunPayload, -} from "../../../../src/lib/machine-profile/legacy.js"; /** * Purpose: Fetch run data from static JSON files. * Exports: fetchRuns, fetchRun, fetchPlan, fetchRunWithPlan, fetchDashboardIndex, fetchLatestAggregate @@ * Data is loaded from the results directory via Vite's dev server. * All fetched JSON is validated with Zod schemas at this boundary. */ +import { + parseKnownPlanPayload, + parseKnownRunPayload, +} from "../../../../src/lib/machine-profile/legacy.js";As per coding guidelines
**/*.{ts,tsx,js,jsx}: Every file begins with a short header documenting purpose, exports, and invariants.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/lib/api.ts` around lines 1 - 4, The file header documenting purpose/exports/invariants must be the very first thing in the module; move the existing module header comment so it appears before the import block that currently brings in parseKnownPlanPayload and parseKnownRunPayload from "../../../../src/lib/machine-profile/legacy.js", and ensure the header briefly states the file's purpose, its exported items, and any invariants as per the coding guideline so the header precedes all imports and code.llm/project/user-flow.md (1)
209-209: Clarify Line 209 to avoid mixed messaging about runtime selection.Line 209 says runtime is fixed, but “unless explicitly passed as
--runtimes ollama” still reads like a selectable runtime path. Consider simplifying to state that runtime is fixed toollamain MVP.Suggested doc tweak
-- **Discovery**: models/harnesses are auto-discovered by default; runtime is fixed to `ollama` unless explicitly passed as `--runtimes ollama`. +- **Discovery**: models/harnesses are auto-discovered by default; runtime is fixed to `ollama` in MVP.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@llm/project/user-flow.md` at line 209, Edit the "Discovery" bullet so it clearly states the runtime decision for MVP: remove the conditional phrase "unless explicitly passed as `--runtimes ollama`" and reword to say "Discovery: models/harnesses are auto-discovered by default; runtime is fixed to `ollama` for the MVP." Reference the "Discovery" bullet and the string "`--runtimes ollama`" when making the change to ensure the mixed message is eliminated.llm/implementation/multi-runtime-mvp-implementation.md (1)
1-5: Consider making the historical framing even harder to miss.Line 1 and Line 5 are clear, but a quick scan can still misread later sections as current behavior. A small follow-up (e.g., “Historical Behavior (Feb 8, 2026)” heading) would reduce ambiguity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@llm/implementation/multi-runtime-mvp-implementation.md` around lines 1 - 5, Add a very prominent historical warning header and timestamp to the top of the document so readers cannot mistake it for current behavior: change the existing opening lines in Multi-Runtime MVP Implementation to include a bold/ALL-CAPS heading like "HISTORICAL BEHAVIOR (Feb 8, 2026) — DO NOT USE FOR CURRENT RUNTIME" and add a one-line subheading such as "Archived for artifact/debugging context; active benchmarks are Ollama-only" (replace the current first paragraph). Ensure the new header appears before any other text so the historical status is immediately obvious on quick scans.src/cli/run-command.ts (1)
83-107: Consider a temporary deprecated--vllm-urlshim for smoother script migration.With Ollama-only execution, this is directionally correct; however, existing CI/scripts using
--vllm-urlwill fail hard. A one-release no-op flag that warns would reduce breakage.Possible transition patch
export const runCommand = new Command("run") @@ .option("--ollama-url <url>", "Ollama API base URL", "http://localhost:11434") + .option( + "--vllm-url <url>", + "Deprecated: ignored (runtime execution is Ollama-only)", + ) @@ .action(async (options) => { try { + if (typeof options.vllmUrl === "string" && options.vllmUrl.trim().length > 0) { + logger.warn( + "Warning: --vllm-url is deprecated and ignored (Ollama-only execution)", + ); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cli/run-command.ts` around lines 83 - 107, Add a temporary deprecated shim flag --vllm-url that accepts a URL and maps it to the existing --ollama-url behavior: update the CLI option definitions near the existing .option("--ollama-url <url>") to also declare .option("--vllm-url <url>") as a deprecated alias, when parsing prefer the explicit --ollama-url but if --vllm-url is present set the same internal variable (e.g., ollamaUrl) and emit a single-time warning message informing users the flag is deprecated and will be removed next release; ensure the shim does not change runtime semantics and that the warning uses the same logger/console path used elsewhere in this module.test/model-profile-registry.test.ts (1)
96-97: Consider avoiding schema-version churn in this fixture.Line [96] hardcodes
"0.5.1"even though this test targets unsupported runtime-key filtering, not version migration behavior. UsingSCHEMA_VERSIONwould keep intent stable across future bumps.♻️ Suggested small cleanup
import { ConfiguredModelProfileSchema, ModelProfileFileSchema, } from "../src/schemas/model-profile.schema.js"; +import { SCHEMA_VERSION } from "../src/schemas/index.js"; ... - schemaVersion: "0.5.1", + schemaVersion: SCHEMA_VERSION,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/model-profile-registry.test.ts` around lines 96 - 97, Replace the hardcoded schemaVersion string "0.5.1" in the test fixture with the canonical SCHEMA_VERSION constant: import SCHEMA_VERSION into the test and set schemaVersion: SCHEMA_VERSION in the object used by the model-profile registry test (the fixture that currently contains schemaVersion and models). This keeps the test focused on unsupported runtime-key filtering and avoids future churn when the schema version is bumped.src/harnesses/opencode-config.ts (1)
128-162: Consider removing unusedruntimeNameparameter.Since
runtimeNamecan only be"ollama"and isn't used to branch logic in the function body (no runtime-specific env tuning remains), consider removing it from the interface to simplify the API.♻️ Optional simplification
export function buildOpenCodeEnv(opts: { configPath: string; configJson: string; - runtimeName: "ollama"; }): Record<string, string> { - const { configPath, configJson, runtimeName } = z + const { configPath, configJson } = z .object({ configPath: z.string().min(1), configJson: z.string().min(1), - runtimeName: RuntimeNameSchema, }) .parse(opts);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/harnesses/opencode-config.ts` around lines 128 - 162, The function buildOpenCodeEnv currently accepts an opts parameter with a runtimeName property that is never used; remove runtimeName from the function signature and from the zod validation to simplify the API: update the opts type to only include configPath and configJson, remove runtimeName from the destructuring and from the z.object() schema (RuntimeNameSchema reference), and adjust any call sites that pass runtimeName to stop providing it or to only pass the two remaining fields so callers compile.test/plan-builder.test.ts (1)
16-102: Consider extracting fallback helper to a shared test utility.The
fallbackCollectMachineProfilefunction is substantial (~85 lines) and may be useful in other test files. Consider extracting it to a shared test fixtures module.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/plan-builder.test.ts` around lines 16 - 102, The fallbackCollectMachineProfile helper is large and should be extracted to a shared test fixtures module: create a new test utility file (e.g., test/utils/fixtures.ts) exporting fallbackCollectMachineProfile as a named export, move the function body there (preserving its signature and any helper readNonEmpty logic and types), update test/plan-builder.test.ts to import { fallbackCollectMachineProfile } from the new module, and ensure any TypeScript types referenced by the function are exported or inlined so existing tests compile unchanged.src/harnesses/opencode-adapter.ts (1)
546-564: Preserveoutput/durationMsin the execa-error path too.This branch still only inspects
stderrand omitsoutputplusdurationMs. WhenexecuteItemcatches this, it falls back todurationMs: 0and loses any failure text emitted on stdout, even though the non-zero-exit path already preserves that evidence.Possible shape
- if (error && typeof error === "object" && "stderr" in error) { - const execaError = error as { stderr: string; message: string }; - const errorReasons = [ - ...(hasPermissionDeniedStderr(execaError.stderr) + if (error && typeof error === "object" && "stderr" in error) { + const execaError = error as { + stdout?: string; + stderr?: string; + message: string; + }; + const effectiveOutput = + [execaError.stdout, execaError.stderr] + .map((part) => part?.trim() ?? "") + .filter((part) => part.length > 0) + .join("\n") || execaError.message; + const durationMs = Math.round(performance.now() - startTime); + const errorReasons = [ + ...(hasPermissionDeniedStderr(execaError.stderr ?? "") ? (["tool_permission_denied"] as const) : []), - ...getTranscriptOrInputTaintReasons(execaError.stderr), + ...getTranscriptOrInputTaintReasons(effectiveOutput), ]; throw Object.assign( - new Error( - `OpenCode failed: ${execaError.stderr || execaError.message}`, - ), + new Error(`OpenCode failed: ${effectiveOutput}`), { signalAssessment: errorReasons.length > 0 ? appendSignalAssessmentReasons(undefined, errorReasons) : undefined, + durationMs, + output: effectiveOutput, }, ); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/harnesses/opencode-adapter.ts` around lines 546 - 564, The execa-error branch in the error handler for executeItem only uses execaError.stderr and message, dropping execa output and timing; update the throw to preserve execaError.output (or output/stdout) and durationMs so executeItem can see the real failure text and timing. Specifically, in the block that builds execaError and computes errorReasons (symbols: execaError, hasPermissionDeniedStderr, getTranscriptOrInputTaintReasons, appendSignalAssessmentReasons, signalAssessment), include the original output and durationMs properties when calling Object.assign on the new Error so the thrown object contains output and durationMs for downstream handling by executeItem.src/schemas/common.schema.ts (1)
38-54: Encode the runtime-list relationship from a single source of truth.
artifactRuntimeNamesrepeats"ollama"separately fromsupportedRuntimeNames, so the active-runtime and artifact-runtime lists can drift on a future edit. Building the artifact list fromsupportedRuntimeNamesmakes that invariant explicit.♻️ Proposed refactor
/** Valid runtime names for active benchmark execution. */ export const supportedRuntimeNames = ["ollama"] as const; /** Zod schema for runtimes accepted by current config and execution flows. */ export const SupportedRuntimeNameSchema = z.literal("ollama"); /** Runtime name type accepted by current config and execution flows. */ export type SupportedRuntimeName = z.infer<typeof SupportedRuntimeNameSchema>; /** Valid runtime names allowed when reading historical plan/result artifacts. */ -export const artifactRuntimeNames = ["ollama", "vllm"] as const; +export const artifactRuntimeNames = [...supportedRuntimeNames, "vllm"] as const;As per coding guidelines, "Favor iteration and modularization over code duplication".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/schemas/common.schema.ts` around lines 38 - 54, supportedRuntimeNames and artifactRuntimeNames duplicate "ollama", risking drift; change artifactRuntimeNames to derive from supportedRuntimeNames (e.g., create artifactRuntimeNames by spreading supportedRuntimeNames and adding "vllm" with a const assertion) so the runtime-list relationship is encoded in one source of truth, then keep ArtifactRuntimeNameSchema = z.enum(artifactRuntimeNames) and ArtifactRuntimeName type as z.infer<typeof ArtifactRuntimeNameSchema>.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/dashboard/src/components/ui/section-heading.tsx`:
- Around line 12-15: The exported function SectionHeading lacks complete
JSDoc/TSDoc; update the comment block above the SectionHeading declaration to
include a short purpose description, `@param` tags for title and description, an
`@returns` tag describing the React element returned, and an `@throws` tag
(explicitly stating none if it never throws). Ensure the JSDoc sits immediately
above the SectionHeading function and uses the repo's standard TSDoc style.
In `@src/harnesses/goose-adapter.ts`:
- Around line 216-223: The code currently hard-codes provider = "ollama" and
builds env without using runtime.baseUrl; update the Goose configuration to
include the runtime base URL when present by adding an environment variable
(e.g., GOOSE_BASE_URL) to the env object derived from runtime.baseUrl, keeping
provider and GOOSE_MODEL as-is; in other words, when runtime.baseUrl is defined,
set env.GOOSE_BASE_URL = runtime.baseUrl so Ollama/custom endpoints are passed
through (reference the provider variable, the env object, and runtime.baseUrl),
or alternatively remove the file header invariant if you intentionally do not
support custom endpoints.
In `@src/harnesses/opencode-output.ts`:
- Around line 260-262: The early return that sets method "json" when parsedLines
> 0 is too aggressive and drops non-JSON content; change the condition so we
only short-circuit to the JSON-only return when the entire input was
successfully parsed as JSON (e.g., parsedLines === totalLineCount or a flag like
allLinesParsed is true). If mixed content exists, do not return an empty
output/method "json" — instead let the function fall through to the raw/fallback
handling or merge the non-JSON lines into the returned output. Update the logic
around parsedLines and the JSON-short-circuit (the block that currently returns
{ output: "", method: "json" }) to perform this full-stream check or preserve
non-JSON text.
In `@src/lib/model-profile/registry.ts`:
- Around line 148-178: normalizeLoadedModelProfileRegistry currently removes
unsupported runtime variants but leaves profiles whose variants become empty,
causing ModelProfileRegistrySchema.parse to fail; update the function to detect
and drop profiles whose filtered variants object is empty (e.g., track
droppedProfileKeys alongside droppedRuntimeNames), log a warning with the
dropped profile keys and their original runtimes, and build normalized only from
profiles that still have at least one supported variant before calling
ModelProfileRegistrySchema.parse so a single legacy profile doesn’t break the
whole load.
In `@src/lib/signal-assessment.ts`:
- Around line 32-40: The matcher list AGENT_REQUESTED_INPUT_PATTERNS is too
generic and causes UI/artifact text to be classified as agent_requested_input;
restrict its application by either (1) narrowing the regexes to harness-specific
phrasing/markers (e.g., include anchors or known harness tokens) or (2) guarding
the matching logic so AGENT_REQUESTED_INPUT_PATTERNS is only evaluated for
harness boundary/failure rows (e.g., check the row context like isHarnessPrompt
/ source === 'harness' / isBoundaryFailure before running matches in the
function that classifies agent_requested_input). Update the matching code where
AGENT_REQUESTED_INPUT_PATTERNS is used (and the same change at the other
occurrences) so legitimate UI/artifact strings aren’t misclassified.
In `@src/runner/plan-builder.ts`:
- Around line 97-101: Remove the unnecessary type assertion on config.runtimes
in the runtimes assignment: config.runtimes is already correctly typed by the
BenchConfig/Zod schema as SupportedRuntimeName[] (equivalent to RuntimeName[]),
so change the ternary to use config.runtimes directly instead of
(config.runtimes as RuntimeName[]); update the expression that sets runtimes
(which references config.runtimes, RuntimeName and RUNTIME_NAMES) to the simpler
form that returns config.runtimes when non-empty and [...RUNTIME_NAMES]
otherwise.
---
Nitpick comments:
In `@apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx`:
- Around line 270-313: Add a TSDoc block above the exported BlindVsInformedChart
function describing its purpose, parameters, and return value; document the
items parameter (type BlindVsInformedChartProps) and what the component renders
(chart tabs for model/harness breakdown), include `@param` for items and `@returns`
React.ReactElement (or JSX.Element), and add any `@throws` if it can throw—place
the doc comment immediately above the export function BlindVsInformedChart
declaration to satisfy the project guideline that all exported functions have
TSDoc.
In `@apps/dashboard/src/lib/api.ts`:
- Around line 1-4: The file header documenting purpose/exports/invariants must
be the very first thing in the module; move the existing module header comment
so it appears before the import block that currently brings in
parseKnownPlanPayload and parseKnownRunPayload from
"../../../../src/lib/machine-profile/legacy.js", and ensure the header briefly
states the file's purpose, its exported items, and any invariants as per the
coding guideline so the header precedes all imports and code.
In `@llm/implementation/multi-runtime-mvp-implementation.md`:
- Around line 1-5: Add a very prominent historical warning header and timestamp
to the top of the document so readers cannot mistake it for current behavior:
change the existing opening lines in Multi-Runtime MVP Implementation to include
a bold/ALL-CAPS heading like "HISTORICAL BEHAVIOR (Feb 8, 2026) — DO NOT USE FOR
CURRENT RUNTIME" and add a one-line subheading such as "Archived for
artifact/debugging context; active benchmarks are Ollama-only" (replace the
current first paragraph). Ensure the new header appears before any other text so
the historical status is immediately obvious on quick scans.
In `@llm/project/user-flow.md`:
- Line 209: Edit the "Discovery" bullet so it clearly states the runtime
decision for MVP: remove the conditional phrase "unless explicitly passed as
`--runtimes ollama`" and reword to say "Discovery: models/harnesses are
auto-discovered by default; runtime is fixed to `ollama` for the MVP." Reference
the "Discovery" bullet and the string "`--runtimes ollama`" when making the
change to ensure the mixed message is eliminated.
In `@src/cli/run-command.ts`:
- Around line 83-107: Add a temporary deprecated shim flag --vllm-url that
accepts a URL and maps it to the existing --ollama-url behavior: update the CLI
option definitions near the existing .option("--ollama-url <url>") to also
declare .option("--vllm-url <url>") as a deprecated alias, when parsing prefer
the explicit --ollama-url but if --vllm-url is present set the same internal
variable (e.g., ollamaUrl) and emit a single-time warning message informing
users the flag is deprecated and will be removed next release; ensure the shim
does not change runtime semantics and that the warning uses the same
logger/console path used elsewhere in this module.
In `@src/harnesses/opencode-adapter.ts`:
- Around line 546-564: The execa-error branch in the error handler for
executeItem only uses execaError.stderr and message, dropping execa output and
timing; update the throw to preserve execaError.output (or output/stdout) and
durationMs so executeItem can see the real failure text and timing.
Specifically, in the block that builds execaError and computes errorReasons
(symbols: execaError, hasPermissionDeniedStderr,
getTranscriptOrInputTaintReasons, appendSignalAssessmentReasons,
signalAssessment), include the original output and durationMs properties when
calling Object.assign on the new Error so the thrown object contains output and
durationMs for downstream handling by executeItem.
In `@src/harnesses/opencode-config.ts`:
- Around line 128-162: The function buildOpenCodeEnv currently accepts an opts
parameter with a runtimeName property that is never used; remove runtimeName
from the function signature and from the zod validation to simplify the API:
update the opts type to only include configPath and configJson, remove
runtimeName from the destructuring and from the z.object() schema
(RuntimeNameSchema reference), and adjust any call sites that pass runtimeName
to stop providing it or to only pass the two remaining fields so callers
compile.
In `@src/schemas/common.schema.ts`:
- Around line 38-54: supportedRuntimeNames and artifactRuntimeNames duplicate
"ollama", risking drift; change artifactRuntimeNames to derive from
supportedRuntimeNames (e.g., create artifactRuntimeNames by spreading
supportedRuntimeNames and adding "vllm" with a const assertion) so the
runtime-list relationship is encoded in one source of truth, then keep
ArtifactRuntimeNameSchema = z.enum(artifactRuntimeNames) and ArtifactRuntimeName
type as z.infer<typeof ArtifactRuntimeNameSchema>.
In `@test/model-profile-registry.test.ts`:
- Around line 96-97: Replace the hardcoded schemaVersion string "0.5.1" in the
test fixture with the canonical SCHEMA_VERSION constant: import SCHEMA_VERSION
into the test and set schemaVersion: SCHEMA_VERSION in the object used by the
model-profile registry test (the fixture that currently contains schemaVersion
and models). This keeps the test focused on unsupported runtime-key filtering
and avoids future churn when the schema version is bumped.
In `@test/plan-builder.test.ts`:
- Around line 16-102: The fallbackCollectMachineProfile helper is large and
should be extracted to a shared test fixtures module: create a new test utility
file (e.g., test/utils/fixtures.ts) exporting fallbackCollectMachineProfile as a
named export, move the function body there (preserving its signature and any
helper readNonEmpty logic and types), update test/plan-builder.test.ts to import
{ fallbackCollectMachineProfile } from the new module, and ensure any TypeScript
types referenced by the function are exported or inlined so existing tests
compile unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d585f805-fd92-49c5-ac58-25092ba73ba5
📒 Files selected for processing (116)
README.mdapps/dashboard/scripts/build-index.tsapps/dashboard/src/components/about/about-content.tsapps/dashboard/src/components/about/about-page.tsxapps/dashboard/src/components/charts/blind-vs-informed-chart.tsxapps/dashboard/src/components/charts/chart-primitives.tsxapps/dashboard/src/components/charts/composite-score-chart.tsxapps/dashboard/src/components/charts/frontier-eval-scatter.tsxapps/dashboard/src/components/charts/model-comparison-chart.tsxapps/dashboard/src/components/charts/model-radar-chart.tsxapps/dashboard/src/components/charts/model-test-heatmap.tsxapps/dashboard/src/components/charts/test-difficulty-chart.tsxapps/dashboard/src/components/charts/test-type-split-chart.tsxapps/dashboard/src/components/charts/token-efficiency-chart.tsxapps/dashboard/src/components/layout/header.tsxapps/dashboard/src/components/leaderboard/leaderboard-filters.tsapps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsxapps/dashboard/src/components/leaderboard/leaderboard-page.tsxapps/dashboard/src/components/leaderboard/leaderboard-results-table.tsxapps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsxapps/dashboard/src/components/leaderboard/model-filter-dropdown.tsxapps/dashboard/src/components/run-list/run-list-page.tsxapps/dashboard/src/components/ui/section-heading.tsxapps/dashboard/src/index.cssapps/dashboard/src/lib/aggregations-charts.tsapps/dashboard/src/lib/aggregations-compare.tsapps/dashboard/src/lib/aggregations-diagnostics.tsapps/dashboard/src/lib/aggregations-test-types.tsapps/dashboard/src/lib/aggregations-tooling.tsapps/dashboard/src/lib/aggregations.tsapps/dashboard/src/lib/api.tsapps/dashboard/src/lib/schemas.tsapps/dashboard/src/lib/types.tsllm/context/codebase-overview.mdllm/implementation/multi-runtime-mvp-implementation.mdllm/project/project-overview.mdllm/project/project-rules.mdllm/project/user-flow.mdpackage.jsonsrc/cli/compare-command.tssrc/cli/migrate-machine-command.tssrc/cli/run-command.tssrc/harnesses/direct-adapter.tssrc/harnesses/goose-adapter.tssrc/harnesses/goose-openai.tssrc/harnesses/goose-output.tssrc/harnesses/harness.tssrc/harnesses/opencode-adapter.tssrc/harnesses/opencode-config.tssrc/harnesses/opencode-output.tssrc/harnesses/tool-prompt.tssrc/lib/benchmark-checkpoint.tssrc/lib/hardware-profile.tssrc/lib/machine-profile/instance-id.tssrc/lib/machine-profile/legacy.tssrc/lib/machine-profile/normalization.tssrc/lib/machine-profile/probe-linux.tssrc/lib/machine-profile/probe-macos.tssrc/lib/machine-profile/probe-utils.tssrc/lib/machine-profile/probe-windows.tssrc/lib/model-aliases.tssrc/lib/model-profile/normalization.tssrc/lib/model-profile/registry.tssrc/lib/openai-compat-client.tssrc/lib/signal-assessment.tssrc/lib/stats-format.tssrc/lib/test-catalog.tssrc/runner/index.tssrc/runner/item-executor.tssrc/runner/plan-builder.tssrc/runtimes/discovery.tssrc/runtimes/index.tssrc/runtimes/runtime.tssrc/runtimes/vllm-runtime.tssrc/schemas/common.schema.tssrc/schemas/config.schema.tssrc/schemas/index.tssrc/schemas/model-alias.schema.tssrc/schemas/model-profile.schema.tssrc/schemas/plan.schema.tssrc/schemas/result.schema.tssrc/schemas/test-catalog.schema.tssrc/tests/calculator-basic/test.meta.jsonsrc/tests/calculator-stateful/test.meta.jsonsrc/tests/event-emitter/test.meta.jsonsrc/tests/rate-limiter/test.meta.jsonsrc/tests/smoke/test.meta.jsonsrc/tests/todo-app/test.meta.jsonsrc/tests/ttl-cache/test.meta.jsonsrc/tests/workspace-smoke/test.meta.jsontest/aggregate.test.tstest/benchmark-checkpoint.test.tstest/build-index.test.tstest/dashboard-schema-compat.test.tstest/dashboard-test-type.test.tstest/generation-retry.test.tstest/goose-adapter.test.tstest/goose-output.test.tstest/hardware-profile.test.tstest/harness-compatibility.test.tstest/item-executor.test.tstest/item-retry.test.tstest/machine-profile-migration.test.tstest/model-profile-registry.test.tstest/opencode-adapter.test.tstest/opencode-config.test.tstest/opencode-output.test.tstest/plan-builder.test.tstest/schemas.test.tstest/signal-assessment.test.tstest/stats-semantics.test.tstest/test-catalog.test.tstest/tool-prompt.test.tstest/vllm-runtime.test.tstest/workspace-capability-parity.test.tstest/workspace-prompt-parity.test.ts
💤 Files with no reviewable changes (9)
- src/runner/index.ts
- test/benchmark-checkpoint.test.ts
- test/opencode-config.test.ts
- package.json
- src/harnesses/goose-openai.ts
- src/runtimes/discovery.ts
- test/vllm-runtime.test.ts
- src/lib/openai-compat-client.ts
- src/runtimes/vllm-runtime.ts
[codex] rebuild opencode harness
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 8
♻️ Duplicate comments (3)
apps/dashboard/src/components/about/about-content.ts (1)
99-101:⚠️ Potential issue | 🟡 MinorKeep the axis label aligned with “prompt mode”.
This entry still surfaces the internal
passTypename, so the About copy mixes implementation terminology with the newer user-facing wording. Rename the label here to avoid conflicting language in the UI.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/about/about-content.ts` around lines 99 - 101, The About copy exposes the internal key "passType" in the name field; change the displayed label to the user-facing term (e.g., "prompt mode") so UI language is consistent. Update the object entry where name: "passType" appears in about-content.ts to use the human-friendly label string and ensure any descriptive text still matches the two modes; leave internal keys/logic untouched and only modify the displayed name.apps/dashboard/src/lib/types.ts (1)
286-295:⚠️ Potential issue | 🟡 MinorKeep the legacy
vllmBaseUrlfield inPlanConfiguntil the migration is explicit.
apps/dashboard/src/lib/schemas.tsstill acceptsconfig.vllmBaseUrlfor olderplan.jsonpayloads, so removing it here makes the handwritten type drift from the payload shape the dashboard can actually parse. Keep it optional until the legacy field is removed through a versioned migration.🧩 Minimal compatibility fix
export interface PlanConfig { ollamaBaseUrl: string; + vllmBaseUrl?: string; generateTimeoutMs: number; gooseMaxTurns?: number; gooseRetryMaxTurns?: number; gooseWorkspaceMaxTurns?: number; gooseWorkspaceRetryMaxTurns?: number;Based on learnings, "Maintain versioned schemas with schemaVersion field and use additive changes; migrations must be explicit".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/lib/types.ts` around lines 286 - 295, Keep the legacy optional field in the PlanConfig interface: add an optional vllmBaseUrl?: string property to the PlanConfig declaration (alongside ollamaBaseUrl, generateTimeoutMs, etc.) so the handwritten type matches the payload shape parsed by the schemas that still accept config.vllmBaseUrl; leave it optional until an explicit, versioned migration removes the legacy field.src/harnesses/opencode-events.ts (1)
122-188:⚠️ Potential issue | 🟠 MajorHandle all supported tool-call shapes here, and keep the last write in batched arrays.
extractFromToolCallObject()still only recognizesname/toolName, and it short-circuits on the first array hit. That means structured payloads like{ tool: "write_file", input: ... }/{ command: "edit", args: ... }are missed, and atoolCallsarray can still return stale code if a later entry rewrites the file. In those cases the parser falls through to"json"/"raw"even though a valid write payload was present.♻️ Suggested fix
function extractFromToolCallObject(obj: unknown): string | null { const maxDepth = 4; const visit = (value: unknown, depth: number): string | null => { if (depth > maxDepth || !value) return null; if (Array.isArray(value)) { - for (const item of value) { - const found = visit(item, depth + 1); - if (found) return found; - } - return null; + let latest: string | null = null; + for (const item of value) { + const found = visit(item, depth + 1); + if (found) latest = found; + } + return latest; } if (typeof value !== "object") return null; const record = value as Record<string, unknown>; const nameValue = typeof record.name === "string" ? record.name : typeof record.toolName === "string" ? record.toolName + : typeof record.tool === "string" + ? record.tool + : typeof record.tool_name === "string" + ? record.tool_name + : typeof record.command === "string" + ? record.command : undefined;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/harnesses/opencode-events.ts` around lines 122 - 188, The extractor currently only recognizes name/toolName and returns the first hit for arrays, missing shapes like {tool: "...", input: ...} or {command: "...", args: ...} and losing later writes in batched arrays; update extractFromToolCallObject()/visit to also accept record.command and record.tool as alternate tool-name fields (in addition to name/toolName), and to treat args as record.arguments ?? record.args ?? record.parameters ?? record.input ?? record.raw; when value is an array, do not short-circuit on the first match—iterate all elements, call visit on each, and keep/return the last non-null content (so later writes override earlier ones); continue to use WRITE_TOOL_NAMES and extractContentFromArgs to validate and extract content from parsed args.
🧹 Nitpick comments (10)
src/lib/stats-format.ts (1)
81-82: Add an invariant guard for inconsistent failure counters.If
generationFailureCountever exceedsfailed, this currently degrades silently. That’s likely a programmer/data-contract mistake and should fail fast.As per coding guidelines, "Throw on invalid config, invalid schemas, and programmer mistakes; avoid silent fallbacks."Suggested guard
const generationFailureCount = stats.generationFailures?.total ?? 0; + if (generationFailureCount > failed) { + throw new Error( + `Invalid stats: generationFailures.total (${generationFailureCount}) exceeds failed (${failed})`, + ); + } if (generationFailureCount > 0 || failed > generationFailureCount) { lines.push(" Failure breakdown:"); for (const { type, count } of stats.generationFailures?.byType ?? []) { lines.push(` ${type}: ${count}`); } const scoredRowFailures = failed - generationFailureCount; if (scoredRowFailures > 0) { lines.push(` scored_row_failure: ${scoredRowFailures}`); } }Also applies to: 87-90
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/stats-format.ts` around lines 81 - 82, Add a fail-fast invariant check to detect inconsistent failure counters: after computing const generationFailureCount = stats.generationFailures?.total ?? 0 and before using it (e.g., where the code checks if (generationFailureCount > 0 || failed > generationFailureCount)), throw an Error if generationFailureCount > failed so the code fails fast on this programmer/data-contract violation; apply the same guard at the other similar spot around lines handling the alternative counters (the same check that spans the second use at the block covering the 87-90 logic) to ensure both places validate generationFailureCount against failed.apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx (1)
24-31: Consider addingas constfor readonly tuple type.Adding
as consttoCARD_ACCENT_COLORSwould provide better type safety by creating a readonly tuple with literal types, preventing accidental mutations and improving type inference.♻️ Optional refactor to add type safety
-const CARD_ACCENT_COLORS = [ +const CARD_ACCENT_COLORS = [ "#34c759", // brand green — matched runs "#34c759", // brand green — profiles "hsl(212, 100%, 67%)", // info blue — deduped items "hsl(156, 67%, 55%)", // success green — pass rate "hsl(270, 60%, 60%)", // purple — frontier coverage "hsl(210, 85%, 60%)", // blue — median duration -]; +] as const;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx` around lines 24 - 31, The constant CARD_ACCENT_COLORS should be made a readonly tuple to improve type safety; change its declaration to use "as const" (i.e., const CARD_ACCENT_COLORS = [ ... ] as const) so the array items become literal readonly types and cannot be mutated. Ensure any code that consumes CARD_ACCENT_COLORS still accepts readonly string literals (or update types to ReadonlyArray or typeof CARD_ACCENT_COLORS[number] where needed) so type checks continue to pass.apps/dashboard/src/components/charts/model-comparison-chart.tsx (1)
37-39: Consider adding Zod validation at the component boundary.The coding guidelines specify that TypeScript files should "Validate at boundaries with Zod and pass typed data inward." Currently, the
itemsprop is typed but not validated at runtime. Consider adding a Zod schema to validate the props at the component boundary.🛡️ Example Zod validation pattern
Define a schema near your type definition and validate incoming props:
+import { z } from 'zod'; + +// Define schema for runtime validation +const MatrixItemResultSchema = z.array( + z.object({ + // Define MatrixItemResult fields here + }) +); + interface ModelComparisonChartProps { items: MatrixItemResult[]; } export function ModelComparisonChart({ items }: ModelComparisonChartProps) { + // Validate at boundary + const validatedItems = MatrixItemResultSchema.parse(items); + const allModels = useMemo(() => { - const groups = groupByModel(items); + const groups = groupByModel(validatedItems);Alternatively, validate props at a higher level (e.g., API response) and pass validated data down.
As per coding guidelines: "Validate at boundaries with Zod and pass typed data inward" for
**/*.{ts,tsx}files.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/charts/model-comparison-chart.tsx` around lines 37 - 39, Add runtime validation for the ModelComparisonChart component props by defining a Zod schema for the ModelComparisonChartProps (including the items array shape matching MatrixItemResult) near the type definition and call schema.parse (or safeParse and handle failure) at the component boundary inside ModelComparisonChart (or a thin wrapper) before using the data; on parse failure, return a fallback UI or throw/log an error so only validated, typed data is passed inward to the component logic.apps/dashboard/src/components/charts/token-efficiency-chart.tsx (1)
103-105: Consider centralizing harness color lookup behind a helper.The inline cast works, but moving it to a
readHarnessColorhelper (as done in the other chart) reduces repetition and keeps fallback logic in one place.♻️ Suggested refactor
const HARNESS_COLORS = { direct: "hsl(215, 70%, 62%)", // steel blue goose: "hsl(142, 60%, 49%)", // brand green opencode: "hsl(38, 80%, 58%)", // warm amber } as const; +function readHarnessColor(harness: string): string { + return Object.prototype.hasOwnProperty.call(HARNESS_COLORS, harness) + ? HARNESS_COLORS[harness as keyof typeof HARNESS_COLORS] + : "hsl(210, 12%, 63%)"; +} + const dataByHarness = harnesses.map((harness) => ({ harness, points: data.filter((d) => d.harness === harness), - color: - HARNESS_COLORS[harness as keyof typeof HARNESS_COLORS] || - "hsl(210, 12%, 63%)", + color: readHarnessColor(harness), }));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/charts/token-efficiency-chart.tsx` around lines 103 - 105, Extract the inline HARNESS_COLORS lookup into a shared helper named readHarnessColor that accepts the harness key (preserve the existing type/cast) and returns HARNESS_COLORS[harness] || "hsl(210, 12%, 63%)"; then replace the inline expression in token-efficiency-chart (the color: HARNESS_COLORS[...] || "hsl(...)" usage) with a call to readHarnessColor(harness) so fallback logic is centralized and reusable across charts.test/run-command.test.ts (1)
45-57: Assertprocess.exitarguments explicitly for stronger intent.The thrown-message check works, but adding an explicit spy assertion makes the failure mode less brittle.
Suggested tightening
).rejects.toThrow("process.exit(1)"); + expect(exitSpy).toHaveBeenCalledWith(1);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/run-command.test.ts` around lines 45 - 57, The test currently only expects parseAsync to throw with "process.exit(1)"; instead add an explicit spy on process.exit to assert it was called with 1 for stronger intent: create a jest.spyOn(process, "exit").mockImplementation(() => { throw new Error("exit"); }) before calling runCommand.parseAsync (or mockImplementationOnce), then after the promise rejects assert that the spy was calledWith(1) and finally restore the spy; reference the test function and runCommand.parseAsync to locate where to add the spy and assertions.src/harnesses/discovery.ts (1)
45-49: Consider lowering severity for expected OpenCode-unavailable probe failures.When discovery runs on machines without OpenCode, this path is expected; logging it as
errorcan produce noisy false alarms.warn/debugis usually a better fit unless the failure is truly unexpected.Possible adjustment
- logger.error( + logger.warn( { err: error, probe: "opencode", functionName: "isOpenCodeAvailable" }, "OpenCode probe failed", );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/harnesses/discovery.ts` around lines 45 - 49, The catch in isOpenCodeAvailable currently logs expected absence of OpenCode as an error; change the logger invocation to a lower severity (logger.warn or logger.debug) for expected probe failures by replacing logger.error(...) with logger.warn(...) or logger.debug(...), keeping the structured context ({ err: error, probe: "opencode", functionName: "isOpenCodeAvailable" }) and the message "OpenCode probe failed" so only truly unexpected failures are escalated elsewhere.src/harnesses/goose-adapter.ts (1)
458-490: Use Execa'stimedOutproperty instead of message matching for timeout detection.The current approach of checking
error.message.includes("timed out")is fragile. It can mislabel failures whose output happens to contain that phrase and couples timeout detection to Execa's message wording. Execa v9 exposes a dedicatedtimedOutboolean property on rejected errors; this is the official, reliable API for detecting timeouts.Simplify the check to:
Suggested fix
- const isTimeoutError = - error instanceof Error && error.message.includes("timed out"); + const isTimeoutError = (error as any).timedOut === true;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/harnesses/goose-adapter.ts` around lines 458 - 490, The timeout detection currently uses message.includes("timed out"); change it to rely on Execa's timedOut boolean: when you cast the error to the Execa-like shape in the block (the variable currently named execaError), check execaError.timedOut to set isTimeoutError (or replace the earlier isTimeoutError declaration with this check) and use that value where the code builds the timeout message (the throw that uses timeoutMs) and when constructing the Error with signalAssessment/output (where getTranscriptOrInputTaintReasons and appendSignalAssessmentReasons are used to compute errorReasons and effectiveOutput). Ensure the new check handles undefined typed errors safely (keep the existing guard that error is object and has stdout/stderr) and remove the fragile message.includes usage.src/runtimes/ollama-runtime.ts (1)
73-96: Consider documenting the update process for these heuristics.The
TEXT_GENERATION_ARCHITECTURESandTEXT_GENERATION_NAME_PATTERNSsets are derived from observed metadata rather than a formal registry. When new model families appear in benchmark runs and fall through to"unknown", maintainers need to know where to add them.A brief inline comment pointing to documentation or a tracking issue would help future contributors extend these lists consistently.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/runtimes/ollama-runtime.ts` around lines 73 - 96, Add an inline comment above the TEXT_GENERATION_ARCHITECTURES and TEXT_GENERATION_NAME_PATTERNS declarations explaining that these heuristics are maintained manually from observed metadata, where to update them (e.g., link to the project's contributing doc or a specific tracking issue), and a short note on the criteria to add a new entry; reference the constants TEXT_GENERATION_ARCHITECTURES and TEXT_GENERATION_NAME_PATTERNS so future contributors know exactly which lists to modify when new model families appear and fall through to "unknown".src/schemas/plan.schema.ts (1)
1-7: Add the requiredInvariantssection to the file header.The header now includes Purpose/Exports, but it still omits the explicit Invariants block required for TS files.
♻️ Proposed header patch
/** * Purpose: RunPlan schema capturing the expanded matrix before execution. * Exports: MatrixItemSchema, MatrixItem, ModelExclusionSchema, * ModelExclusion, RunPlanSchema, RunPlan + * + * Invariants: + * - Plan payloads are validated via Zod before persistence/consumption. + * - `modelExclusions` is additive and optional for backward compatibility. * * The plan is written to results/<runId>/plan.json for reproducibility. */As per coding guidelines, "
src/**/*.{ts,tsx}: Every file must start with a descriptive header describing purpose, exports, and invariants".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/schemas/plan.schema.ts` around lines 1 - 7, The file header for the RunPlan schema is missing the required "Invariants" section; update the top-of-file header (near symbols MatrixItemSchema, MatrixItem, ModelExclusionSchema, ModelExclusion, RunPlanSchema, RunPlan) to include an "Invariants" block that documents any constraints or guarantees about the schema (e.g., immutability, required fields, allowed value ranges, serialization expectations) so the header includes Purpose, Exports, and Invariants per the TS file guideline.src/harnesses/opencode-runner.ts (1)
156-167: Skip the fixed 2s delay onceSIGTERMalready reaped the child.Every timeout/hang currently pays
FORCE_KILL_DELAY_MSbefore returning, even when the process exits immediately afterSIGTERM. Across a noisy matrix that adds avoidable wall-clock delay to failure handling. Race the delay againstprocsettling and only escalate toSIGKILLif the process is still alive.♻️ Proposed refinement
async function forceKillProcess( proc: ResultPromise, log: pino.Logger, reason: string, ): Promise<void> { const pid = proc.pid; log.warn({ pid, reason }, "Force killing OpenCode process"); killOpenCodeProcess(proc, "SIGTERM", log); - await new Promise((resolve) => setTimeout(resolve, FORCE_KILL_DELAY_MS)); + const exitedAfterSigterm = await Promise.race([ + proc.then( + () => true, + () => true, + ), + new Promise<boolean>((resolve) => + setTimeout(() => resolve(false), FORCE_KILL_DELAY_MS), + ), + ]); + if (exitedAfterSigterm) return; log.warn({ pid }, "Escalating OpenCode process kill"); killOpenCodeProcess(proc, "SIGKILL", log); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/harnesses/opencode-runner.ts` around lines 156 - 167, The forceKillProcess function always waits the full FORCE_KILL_DELAY_MS after sending SIGTERM even if proc (the ResultPromise) exits immediately; change it to race the delay against proc settling and only send the escalating SIGKILL if the proc is still alive after the delay. Specifically, in forceKillProcess, after calling killOpenCodeProcess(proc, "SIGTERM", log) await a Promise.race between a timeout Promise of FORCE_KILL_DELAY_MS and proc settling (or its completion/error), then check if the process is still running (using proc.pid or proc settling state) and only then log the escalation and call killOpenCodeProcess(proc, "SIGKILL", log).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx`:
- Around line 95-123: The code treats normalizedSelectedModels.length === 0 as
"all selected", which is wrong when selectedModels contains only stale ids;
update handleModelToggle to compute a true allSelected flag (e.g., const
allSelected = models.every(m => normalizedSelectedModels.includes(m))) and
replace the length check with that flag so toggling a model when all are
selected still returns models.filter(m => m !== model). Ensure you keep using
normalizedSelectedModels, models, onSelectionChange and the MODEL_SORT_COLLATOR
for the sorted results.
In `@llm/implementation/harnesses-implementation.md`:
- Around line 259-263: The note currently implies that `permission` is the sole
gate for tool access but the generated provider model entry still depends on the
model-level `"tools": true`; update the text to explicitly state that tool
access requires both the top-level `permission` setting and a model-level
`"tools": true` flag (as used in the generated `models` provider entry), and
mention that `enabled_providers`, slash-safe transport keys for runtime model
IDs, and the denial of `external_directory` remain unchanged.
In `@src/cli/compare-utils.ts`:
- Around line 156-158: The truncate function must guard against non-positive
maxLen values: in truncate(str, maxLen) early-return an empty string when maxLen
<= 0, and when computing the slice use a non-negative end index (e.g. use
Math.max(0, maxLen - 1)) so slice never receives a negative end; keep the
existing behavior for positive maxLen (return str if str.length <= maxLen,
otherwise return the truncated slice plus the ellipsis).
In `@src/harnesses/direct-adapter.ts`:
- Around line 55-61: Add an explicit runtime-type guard before calling
generateOllama to preserve the Ollama-only invariant: check the runtime object
(e.g., runtime) to ensure it is an Ollama-compatible Runtime (inspect a unique
discriminator or property used to identify Ollama runtimes) and throw a clear
programmer/config error if it isn’t, instead of calling generateOllama with an
unsupported runtime; place this check immediately before the generateOllama(...)
invocation in the generate function so generateOllama, runtime.baseUrl, and
model are only used when the runtime is validated.
In `@src/harnesses/opencode-adapter.ts`:
- Around line 353-374: The failure path currently uses rawOutput (computed via
selectProcessOutput(processResult.stdout, processResult.stderr)) which can drop
stderr when stdout is non-empty; instead, change the non-zero-exit branch to
build a combined structured failure payload that preserves both stdout and
stderr (e.g., construct an errorOutput object/string containing both
processResult.stdout and processResult.stderr or a merged preview) and pass that
into buildOpenCodeFailure and buildFailureSignalAssessment; keep
selectProcessOutput only for success/normal output handling or compute separate
successOutput vs failureOutput so buildOpenCodeFailure always receives both
stdout and stderr evidence (referencing selectProcessOutput,
buildOpenCodeFailure, buildFailureSignalAssessment, processResult, rawOutput,
stdoutPreview, stderrPreview).
In `@src/harnesses/opencode-permissions.ts`:
- Around line 34-58: Add explicit `@throws` TSDoc entries for the exported
helpers: isOpenCodePermissionDeniedText, createOpenCodePermissionPolicy, and the
permission-detection function that follows (the one accepting texts). For each
function add a single `@throws` line describing that the function does not throw
under normal usage (e.g., "@throws Never throws; input is validated and errors
are not raised") or, if a function can throw on invalid input, document the
exact error/condition thrown; ensure the `@throws` text appears alongside the
existing `@param` and `@returns` tags in each exported function's comment block.
In `@src/runner/item-executor.ts`:
- Around line 172-175: The new call to getExecutableRuntimeName() (used when
constructing runtime via createRuntime()) can throw for unsupported runtime
labels, so update the TSDoc for the exported function executeItem to include an
`@throws` entry describing that executeItem will throw when the runtime label is
invalid/unsupported; locate the executeItem function declaration and add a
concise `@throws` line that references the runtime-name validation (e.g., "Throws
if getExecutableRuntimeName() rejects an unsupported runtime label") so the
public contract matches the implementation.
In `@src/runner/plan-builder.ts`:
- Around line 59-60: Update the TSDoc for buildRunPlan to accurately list all
failure modes it can throw: retain Ollama unreachable and no models/tests found,
and add missing model selectors errors, unavailable harness errors, "all models
excluded" (all-excluded model sets) errors, and zero matrix-expansion errors;
ensure the `@throws` section names each condition and the type/shape of error
thrown (e.g., Error with descriptive message or specific error class) so the
exported contract matches the actual runtime failure surface of buildRunPlan and
any internal helpers it calls.
---
Duplicate comments:
In `@apps/dashboard/src/components/about/about-content.ts`:
- Around line 99-101: The About copy exposes the internal key "passType" in the
name field; change the displayed label to the user-facing term (e.g., "prompt
mode") so UI language is consistent. Update the object entry where name:
"passType" appears in about-content.ts to use the human-friendly label string
and ensure any descriptive text still matches the two modes; leave internal
keys/logic untouched and only modify the displayed name.
In `@apps/dashboard/src/lib/types.ts`:
- Around line 286-295: Keep the legacy optional field in the PlanConfig
interface: add an optional vllmBaseUrl?: string property to the PlanConfig
declaration (alongside ollamaBaseUrl, generateTimeoutMs, etc.) so the
handwritten type matches the payload shape parsed by the schemas that still
accept config.vllmBaseUrl; leave it optional until an explicit, versioned
migration removes the legacy field.
In `@src/harnesses/opencode-events.ts`:
- Around line 122-188: The extractor currently only recognizes name/toolName and
returns the first hit for arrays, missing shapes like {tool: "...", input: ...}
or {command: "...", args: ...} and losing later writes in batched arrays; update
extractFromToolCallObject()/visit to also accept record.command and record.tool
as alternate tool-name fields (in addition to name/toolName), and to treat args
as record.arguments ?? record.args ?? record.parameters ?? record.input ??
record.raw; when value is an array, do not short-circuit on the first
match—iterate all elements, call visit on each, and keep/return the last
non-null content (so later writes override earlier ones); continue to use
WRITE_TOOL_NAMES and extractContentFromArgs to validate and extract content from
parsed args.
---
Nitpick comments:
In `@apps/dashboard/src/components/charts/model-comparison-chart.tsx`:
- Around line 37-39: Add runtime validation for the ModelComparisonChart
component props by defining a Zod schema for the ModelComparisonChartProps
(including the items array shape matching MatrixItemResult) near the type
definition and call schema.parse (or safeParse and handle failure) at the
component boundary inside ModelComparisonChart (or a thin wrapper) before using
the data; on parse failure, return a fallback UI or throw/log an error so only
validated, typed data is passed inward to the component logic.
In `@apps/dashboard/src/components/charts/token-efficiency-chart.tsx`:
- Around line 103-105: Extract the inline HARNESS_COLORS lookup into a shared
helper named readHarnessColor that accepts the harness key (preserve the
existing type/cast) and returns HARNESS_COLORS[harness] || "hsl(210, 12%, 63%)";
then replace the inline expression in token-efficiency-chart (the color:
HARNESS_COLORS[...] || "hsl(...)" usage) with a call to
readHarnessColor(harness) so fallback logic is centralized and reusable across
charts.
In `@apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx`:
- Around line 24-31: The constant CARD_ACCENT_COLORS should be made a readonly
tuple to improve type safety; change its declaration to use "as const" (i.e.,
const CARD_ACCENT_COLORS = [ ... ] as const) so the array items become literal
readonly types and cannot be mutated. Ensure any code that consumes
CARD_ACCENT_COLORS still accepts readonly string literals (or update types to
ReadonlyArray or typeof CARD_ACCENT_COLORS[number] where needed) so type checks
continue to pass.
In `@src/harnesses/discovery.ts`:
- Around line 45-49: The catch in isOpenCodeAvailable currently logs expected
absence of OpenCode as an error; change the logger invocation to a lower
severity (logger.warn or logger.debug) for expected probe failures by replacing
logger.error(...) with logger.warn(...) or logger.debug(...), keeping the
structured context ({ err: error, probe: "opencode", functionName:
"isOpenCodeAvailable" }) and the message "OpenCode probe failed" so only truly
unexpected failures are escalated elsewhere.
In `@src/harnesses/goose-adapter.ts`:
- Around line 458-490: The timeout detection currently uses
message.includes("timed out"); change it to rely on Execa's timedOut boolean:
when you cast the error to the Execa-like shape in the block (the variable
currently named execaError), check execaError.timedOut to set isTimeoutError (or
replace the earlier isTimeoutError declaration with this check) and use that
value where the code builds the timeout message (the throw that uses timeoutMs)
and when constructing the Error with signalAssessment/output (where
getTranscriptOrInputTaintReasons and appendSignalAssessmentReasons are used to
compute errorReasons and effectiveOutput). Ensure the new check handles
undefined typed errors safely (keep the existing guard that error is object and
has stdout/stderr) and remove the fragile message.includes usage.
In `@src/harnesses/opencode-runner.ts`:
- Around line 156-167: The forceKillProcess function always waits the full
FORCE_KILL_DELAY_MS after sending SIGTERM even if proc (the ResultPromise) exits
immediately; change it to race the delay against proc settling and only send the
escalating SIGKILL if the proc is still alive after the delay. Specifically, in
forceKillProcess, after calling killOpenCodeProcess(proc, "SIGTERM", log) await
a Promise.race between a timeout Promise of FORCE_KILL_DELAY_MS and proc
settling (or its completion/error), then check if the process is still running
(using proc.pid or proc settling state) and only then log the escalation and
call killOpenCodeProcess(proc, "SIGKILL", log).
In `@src/lib/stats-format.ts`:
- Around line 81-82: Add a fail-fast invariant check to detect inconsistent
failure counters: after computing const generationFailureCount =
stats.generationFailures?.total ?? 0 and before using it (e.g., where the code
checks if (generationFailureCount > 0 || failed > generationFailureCount)),
throw an Error if generationFailureCount > failed so the code fails fast on this
programmer/data-contract violation; apply the same guard at the other similar
spot around lines handling the alternative counters (the same check that spans
the second use at the block covering the 87-90 logic) to ensure both places
validate generationFailureCount against failed.
In `@src/runtimes/ollama-runtime.ts`:
- Around line 73-96: Add an inline comment above the
TEXT_GENERATION_ARCHITECTURES and TEXT_GENERATION_NAME_PATTERNS declarations
explaining that these heuristics are maintained manually from observed metadata,
where to update them (e.g., link to the project's contributing doc or a specific
tracking issue), and a short note on the criteria to add a new entry; reference
the constants TEXT_GENERATION_ARCHITECTURES and TEXT_GENERATION_NAME_PATTERNS so
future contributors know exactly which lists to modify when new model families
appear and fall through to "unknown".
In `@src/schemas/plan.schema.ts`:
- Around line 1-7: The file header for the RunPlan schema is missing the
required "Invariants" section; update the top-of-file header (near symbols
MatrixItemSchema, MatrixItem, ModelExclusionSchema, ModelExclusion,
RunPlanSchema, RunPlan) to include an "Invariants" block that documents any
constraints or guarantees about the schema (e.g., immutability, required fields,
allowed value ranges, serialization expectations) so the header includes
Purpose, Exports, and Invariants per the TS file guideline.
In `@test/run-command.test.ts`:
- Around line 45-57: The test currently only expects parseAsync to throw with
"process.exit(1)"; instead add an explicit spy on process.exit to assert it was
called with 1 for stronger intent: create a jest.spyOn(process,
"exit").mockImplementation(() => { throw new Error("exit"); }) before calling
runCommand.parseAsync (or mockImplementationOnce), then after the promise
rejects assert that the spy was calledWith(1) and finally restore the spy;
reference the test function and runCommand.parseAsync to locate where to add the
spy and assertions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 57a3450f-5472-46e1-ab1d-b63d3cf119f4
📒 Files selected for processing (157)
README.mdapps/dashboard/scripts/build-index.tsapps/dashboard/src/components/about/about-content.tsapps/dashboard/src/components/about/about-page.tsxapps/dashboard/src/components/charts/blind-vs-informed-chart.tsxapps/dashboard/src/components/charts/chart-primitives.tsxapps/dashboard/src/components/charts/composite-score-chart.tsxapps/dashboard/src/components/charts/frontier-eval-scatter.tsxapps/dashboard/src/components/charts/model-comparison-chart.tsxapps/dashboard/src/components/charts/model-radar-chart.tsxapps/dashboard/src/components/charts/model-test-heatmap.tsxapps/dashboard/src/components/charts/test-difficulty-chart.tsxapps/dashboard/src/components/charts/test-type-split-chart.tsxapps/dashboard/src/components/charts/token-efficiency-chart.tsxapps/dashboard/src/components/layout/header.tsxapps/dashboard/src/components/leaderboard/leaderboard-filters.tsapps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsxapps/dashboard/src/components/leaderboard/leaderboard-page.tsxapps/dashboard/src/components/leaderboard/leaderboard-results-table.tsxapps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsxapps/dashboard/src/components/leaderboard/model-filter-dropdown.tsxapps/dashboard/src/components/run-list/run-list-page.tsxapps/dashboard/src/components/ui/section-heading.tsxapps/dashboard/src/index.cssapps/dashboard/src/lib/aggregations-charts.tsapps/dashboard/src/lib/aggregations-compare.tsapps/dashboard/src/lib/aggregations-diagnostics.tsapps/dashboard/src/lib/aggregations-test-types.tsapps/dashboard/src/lib/aggregations-tooling.tsapps/dashboard/src/lib/aggregations.tsapps/dashboard/src/lib/api.tsapps/dashboard/src/lib/schemas.tsapps/dashboard/src/lib/types.tsllm/context/codebase-overview.mdllm/implementation/computer-use-hardening.mdllm/implementation/harnesses-implementation.mdllm/implementation/multi-runtime-mvp-implementation.mdllm/project/project-overview.mdllm/project/project-rules.mdllm/project/user-flow.mdpackage.jsonsrc/cli/compare-command.tssrc/cli/compare-formatters.tssrc/cli/compare-utils.tssrc/cli/migrate-machine-command.tssrc/cli/run-command.tssrc/harnesses/direct-adapter.tssrc/harnesses/discovery.tssrc/harnesses/goose-adapter.tssrc/harnesses/goose-openai.tssrc/harnesses/goose-output.tssrc/harnesses/harness.tssrc/harnesses/opencode-adapter.tssrc/harnesses/opencode-artifacts.tssrc/harnesses/opencode-cli.tssrc/harnesses/opencode-config.tssrc/harnesses/opencode-events.tssrc/harnesses/opencode-model.tssrc/harnesses/opencode-output.tssrc/harnesses/opencode-permissions.tssrc/harnesses/opencode-process.tssrc/harnesses/opencode-provider.tssrc/harnesses/opencode-runner.tssrc/harnesses/tool-prompt.tssrc/lib/benchmark-checkpoint.tssrc/lib/failure-classifier.tssrc/lib/hardware-profile.tssrc/lib/logger.tssrc/lib/machine-profile/instance-id.tssrc/lib/machine-profile/legacy.tssrc/lib/machine-profile/normalization.tssrc/lib/machine-profile/probe-linux.tssrc/lib/machine-profile/probe-macos.tssrc/lib/machine-profile/probe-utils.tssrc/lib/machine-profile/probe-windows.tssrc/lib/model-aliases.tssrc/lib/model-profile/normalization.tssrc/lib/model-profile/registry-compat.tssrc/lib/model-profile/registry.tssrc/lib/openai-compat-client.tssrc/lib/signal-assessment.tssrc/lib/stats-format.tssrc/lib/test-catalog.tssrc/runner/index.tssrc/runner/item-executor.tssrc/runner/model-availability.tssrc/runner/model-eligibility.tssrc/runner/plan-builder.tssrc/runner/run-progress.tssrc/runtimes/discovery.tssrc/runtimes/index.tssrc/runtimes/ollama-residency.tssrc/runtimes/ollama-runtime.tssrc/runtimes/runtime.tssrc/runtimes/vllm-runtime.tssrc/schemas/common.schema.tssrc/schemas/config.schema.tssrc/schemas/index.tssrc/schemas/model-alias.schema.tssrc/schemas/model-profile.schema.tssrc/schemas/plan.schema.tssrc/schemas/result.schema.tssrc/schemas/test-catalog.schema.tssrc/tests/calculator-basic/test.meta.jsonsrc/tests/calculator-stateful/test.meta.jsonsrc/tests/event-emitter/test.meta.jsonsrc/tests/rate-limiter/test.meta.jsonsrc/tests/smoke/test.meta.jsonsrc/tests/todo-app/test.meta.jsonsrc/tests/ttl-cache/test.meta.jsonsrc/tests/workspace-smoke/fixtures/logs/.gitkeepsrc/tests/workspace-smoke/prompt.blind.mdsrc/tests/workspace-smoke/prompt.informed.mdsrc/tests/workspace-smoke/test.meta.jsontest/aggregate.test.tstest/benchmark-checkpoint.test.tstest/build-index.test.tstest/compare-checkpoint-guard.test.tstest/compare-formatters.test.tstest/dashboard-schema-compat.test.tstest/dashboard-test-type.test.tstest/failure-classifier.test.tstest/generation-retry.test.tstest/goose-adapter.test.tstest/goose-output.test.tstest/hardware-profile.test.tstest/harness-compatibility.test.tstest/item-executor.test.tstest/item-retry.test.tstest/machine-profile-migration.test.tstest/model-alias-schema.test.tstest/model-availability.test.tstest/model-profile-registry.test.tstest/ollama-adapter.test.tstest/ollama-residency.test.tstest/opencode-adapter.test.tstest/opencode-artifacts.test.tstest/opencode-cli-cache.test.tstest/opencode-cli.test.tstest/opencode-config.test.tstest/opencode-model.test.tstest/opencode-output.test.tstest/opencode-runner-race.test.tstest/opencode-runner.test.tstest/plan-builder.test.tstest/run-command.test.tstest/runner-residency.test.tstest/schema-regressions.test.tstest/schemas.test.tstest/signal-assessment.test.tstest/stats-semantics.test.tstest/test-catalog.test.tstest/tool-prompt.test.tstest/utils/fixtures.tstest/vllm-runtime.test.tstest/workspace-capability-parity.test.tstest/workspace-prompt-parity.test.ts
💤 Files with no reviewable changes (10)
- src/harnesses/goose-openai.ts
- test/benchmark-checkpoint.test.ts
- package.json
- src/harnesses/opencode-model.ts
- test/vllm-runtime.test.ts
- src/harnesses/opencode-process.ts
- src/harnesses/opencode-output.ts
- src/lib/openai-compat-client.ts
- src/runtimes/vllm-runtime.ts
- src/runtimes/discovery.ts
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 9
♻️ Duplicate comments (4)
apps/dashboard/src/components/about/about-content.ts (1)
56-57:⚠️ Potential issue | 🟡 MinorUse one canonical axis label across About copy (
passTypevsprompt mode).The matrix fact string and benchmark-dimension label currently use different terms for the same dimension, which creates avoidable UI inconsistency.
Also applies to: 99-102
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/about/about-content.ts` around lines 56 - 57, The About copy uses two different labels for the same axis ("passType" vs "prompt mode"); choose one canonical label (e.g., "prompt mode") and update all occurrences so they match: replace the string value that currently reads "runtime x harness x model x test x passType" and the benchmark-dimension label entries referenced around the other block (the entries containing "passType" at the other occurrence) to use the chosen canonical term, ensuring the keys/labels in the about-content data structure remain consistent across value and detail fields.src/cli/compare-utils.ts (1)
156-159:⚠️ Potential issue | 🟡 MinorFail fast on invalid
truncate()widths.Returning
""formaxLen <= 0still turns a caller bug into display output, which makes broken column calculations harder to notice in the compare CLI. Throw here instead of silently degrading.♻️ Proposed fix
export function truncate(str: string, maxLen: number): string { - if (maxLen <= 0) return ""; + if (maxLen <= 0) { + throw new RangeError(`truncate maxLen must be >= 1: ${maxLen}`); + } if (str.length <= maxLen) return str; return `${str.slice(0, Math.max(0, maxLen - 1))}…`; }As per coding guidelines, "Throw on invalid config, invalid schemas, and programmer mistakes; avoid silent fallbacks".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cli/compare-utils.ts` around lines 156 - 159, The truncate function currently returns an empty string when maxLen <= 0, masking caller bugs; update the truncate(str: string, maxLen: number): string implementation to validate maxLen and throw a descriptive error (e.g., RangeError or TypeError) when maxLen is not a positive integer instead of returning ""; keep existing behavior for valid positive maxLen and unchanged slicing logic (use truncate) so callers get a hard failure on invalid widths.src/harnesses/opencode-runner.ts (1)
156-178:⚠️ Potential issue | 🟠 MajorWait for the escalated kill path to settle before returning.
forceKillProcess()now waits through theSIGTERMgrace period, but theSIGKILLbranch still returns immediately after signaling. The timeout/stale-output paths can therefore reject and trigger workspace cleanup whileopencodeis still exiting.🔧 Proposed fix
async function forceKillProcess( proc: ResultPromise, log: pino.Logger, reason: string, ): Promise<void> { const pid = proc.pid; log.warn({ pid, reason }, "Force killing OpenCode process"); killOpenCodeProcess(proc, "SIGTERM", log); const outcome = await Promise.race([ new Promise<"delay">((resolve) => setTimeout(() => resolve("delay"), FORCE_KILL_DELAY_MS), ), proc.then( () => "settled" as const, () => "settled" as const, ), ]); if (outcome === "settled") { return; } log.warn({ pid }, "Escalating OpenCode process kill"); killOpenCodeProcess(proc, "SIGKILL", log); + await Promise.race([ + proc.then( + () => undefined, + () => undefined, + ), + new Promise<void>((resolve) => + setTimeout(resolve, FORCE_KILL_DELAY_MS), + ), + ]); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/harnesses/opencode-runner.ts` around lines 156 - 178, The forceKillProcess implementation sends SIGTERM then waits, but after escalating to SIGKILL it returns immediately; update forceKillProcess (and reuse killOpenCodeProcess and FORCE_KILL_DELAY_MS) so that after calling killOpenCodeProcess(proc, "SIGKILL", log) you wait for the proc to settle (proc.then resolves/rejects) or for a second timeout (use Promise.race like the initial grace wait) before returning, and log appropriately (e.g., warn before escalating and optionally debug when settled) so callers cannot proceed while opencode is still exiting.src/lib/signal-assessment.ts (1)
349-379:⚠️ Potential issue | 🟠 MajorDon’t default failed-row output to
"artifact"here.
agent_requested_inputonly matches harness output. With the current default, any caller that forgets to passoutputSourcesilently loses that taint reason and can over-count trustworthy rows.💡 Minimal fix
reasons.push( ...getTranscriptOrInputTaintReasons(input.output, { - source: input.outputSource ?? "artifact", + source: input.outputSource ?? "harness", }), );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/signal-assessment.ts` around lines 349 - 379, The code in finalizeItemSignalAssessment is defaulting outputSource to "artifact" when calling getTranscriptOrInputTaintReasons (source: input.outputSource ?? "artifact"), which hides harness-only taint reasons like "agent_requested_input"; change the call to pass input.outputSource directly (i.e., source: input.outputSource) or omit the source option when undefined so harness-specific taints are preserved, ensuring callers that forget to supply outputSource don't silently lose taint reasons.
🧹 Nitpick comments (7)
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx (1)
40-54: Consider adding TSDoc for consistency.While not required by coding guidelines (since this function is not exported), adding TSDoc documentation would create consistency with
getRunMachineBadgeLabeland improve maintainability.📝 Suggested documentation
+/** + * Determines the aggregate badge label and variant for a run card. + * + * `@param` run - Published run summary item + * `@param` latestCheckpointId - Current checkpoint ID powering the leaderboard + * `@returns` Badge configuration with label and variant + */ function getRunAggregateBadge( run: RunListItem, latestCheckpointId: string | null, ): { label: string; variant: "secondary" | "success" | "warning" } {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx` around lines 40 - 54, Add a short TSDoc comment for the internal helper getRunAggregateBadge to match style used by getRunMachineBadgeLabel: describe the function's purpose (computes label and variant for a run based on legacy flag, checkpoint presence, and latestCheckpointId), document parameters (run: RunListItem, latestCheckpointId: string | null) and the returned shape ({ label: string; variant: "secondary" | "success" | "warning" }), and place it immediately above the getRunAggregateBadge declaration to maintain consistency and improve maintainability.apps/dashboard/src/components/charts/token-efficiency-chart.tsx (1)
38-42: Use shared muted color constant instead of hardcoded HSL fallback.This avoids palette drift and keeps chart theming consistent across components.
Proposed refactor
import { Card, CardContent, CardHeader, CardTitle } from "@/components/ui/card"; import { WithInfoTooltip } from "@/components/ui/info-tooltip"; +import { CHART_COLORS } from "@/lib/chart-colors"; import { computeTokenEfficiencyData } from "@/lib/aggregations"; import { tokenEfficiency as tokenTooltips } from "@/lib/tooltip-content"; import type { MatrixItemResult } from "@/lib/types"; @@ function readHarnessColor(harness: string): string { - return ( - HARNESS_COLORS[harness as keyof typeof HARNESS_COLORS] || - "hsl(210, 12%, 63%)" - ); + return Object.prototype.hasOwnProperty.call(HARNESS_COLORS, harness) + ? HARNESS_COLORS[harness as keyof typeof HARNESS_COLORS] + : CHART_COLORS.muted; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/charts/token-efficiency-chart.tsx` around lines 38 - 42, The readHarnessColor function currently falls back to a hardcoded HSL string; replace that literal with the app's shared muted color constant by importing the shared constant (e.g., MUTED_COLOR or the project's theme.muted color) and use it as the fallback in HARNESS_COLORS[harness as keyof typeof HARNESS_COLORS] || MUTED_COLOR; update the import at the top of the file accordingly so token-efficiency-chart.tsx uses the centralized muted color instead of "hsl(210, 12%, 63%)".test/aggregate.test.ts (2)
111-157: Consider enhancing the JSDoc for test helpers.Similar to
createItem, adding comprehensive parameter and return documentation would improve maintainability.📝 Suggested JSDoc enhancement
/** * Creates a run result for aggregation tests. + * `@param` runId - Unique identifier for the run + * `@param` checkpointId - Checkpoint identifier this run targets + * `@param` machineProfileKey - Machine profile key for this run + * `@param` instanceId - Instance identifier for the machine + * `@param` items - Array of matrix item results + * `@param` profile - Optional overrides for machine profile properties + * `@returns` A complete RunResult for testing */ function createRun(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/aggregate.test.ts` around lines 111 - 157, Add comprehensive JSDoc to the createRun function: document each parameter (runId, checkpointId, machineProfileKey, instanceId, items, profile) with types and purpose, note optional/override fields on profile (RunProfileOverrides), describe returned type (RunResult) and what the returned object represents (summary, machine, benchmarkCheckpoint, provenance, timestamps, items), and include any defaults used (e.g., TEST_PROFILE_LABEL, TEST_NORMALIZED_PROFILE, TEST_HARDWARE, default startedAt/completedAt and durationMs) so future readers understand inputs and produced structure.
79-109: Consider enhancing the JSDoc for test helpers.While not exported, adding
@paramand@returnsdocumentation would improve maintainability and make the helper function's contract clearer for future test authors.📝 Suggested JSDoc enhancement
/** * Creates a matrix item for aggregation tests. + * `@param` id - Unique identifier for the matrix item + * `@param` completedAt - ISO timestamp when the item completed + * `@param` overrides - Optional partial overrides for the item result + * `@returns` A complete MatrixItemResult for testing */ function createItem(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/aggregate.test.ts` around lines 79 - 109, Update the JSDoc for the test helper function createItem to include `@param` tags for id (string), completedAt (string), and overrides (Partial<MatrixItemResult> optional), and add an `@returns` tag describing that it returns a MatrixItemResult; locate the JSDoc above the createItem function and add concise descriptions of each parameter and the return value to clarify the helper's contract for future test authors.test/ollama-adapter.test.ts (1)
259-274: Optional: tighten the test title for precision.The case validates an Ollama-named runtime with a non-Ollama API format; a more specific title would reduce ambiguity.
Possible title tweak
- it("throws when direct adapter receives a non-Ollama runtime", async () => { + it("throws when direct adapter receives a non-Ollama API format", async () => {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/ollama-adapter.test.ts` around lines 259 - 274, Rename the test title to be more specific: update the it(...) description for the test using createDirectAdapter and adapter.generate to something like "throws when runtime is named 'ollama' but apiFormat is not 'ollama' (non-Ollama apiFormat)" so it clearly communicates that the runtime name is "ollama" while apiFormat is "openai-compatible" on the mockRuntime; leave the test body (createDirectAdapter, adapter.generate, mockRuntime, timeoutMs) unchanged.apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx (2)
183-191: "Select all" button skips sorting.Other callbacks (lines 112, 121, 128) sort via
MODEL_SORT_COLLATOR, but line 188 passes[...models]unsorted. If the parent doesn't guaranteemodelsis sorted the same way, the output order may differ from other selection paths.Consistency fix
<Button type="button" variant="ghost" size="sm" className="h-7 px-2" - onClick={() => onSelectionChange([...models])} + onClick={() => + onSelectionChange( + [...models].sort((a, b) => MODEL_SORT_COLLATOR.compare(a, b)), + ) + } > Select all </Button>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx` around lines 183 - 191, The "Select all" Button's onClick currently calls onSelectionChange([...models]) without sorting, which can produce a different order than other selection paths; update the onClick handler to sort models with MODEL_SORT_COLLATOR before invoking onSelectionChange (e.g., create a sorted copy via [...models].sort(MODEL_SORT_COLLATOR) and pass that to onSelectionChange) so that the selection order matches the other callbacks that use MODEL_SORT_COLLATOR.
196-197: Checkbox visual state uses rawselectedModels.The
isSelectedcheck usesselectedModels.includes(model)rather than the normalized set. IfselectedModelscontains only stale IDs, all checkboxes appear unchecked while the label (after applying the suggested fix above) would show "All models". This creates a visual mismatch.Consider normalizing here as well for consistency:
Potential fix (requires lifting normalization)
+ const modelSet = useMemo(() => new Set(models), [models]); + const normalizedSelectedModels = useMemo( + () => selectedModels.filter((s) => modelSet.has(s)), + [selectedModels, modelSet], + ); + const effectivelyAllSelected = + selectedModels.length === 0 || normalizedSelectedModels.length === 0; ... {models.map((model) => { const isSelected = - selectedModels.length === 0 || selectedModels.includes(model); + effectivelyAllSelected || normalizedSelectedModels.includes(model);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx` around lines 196 - 197, The checkbox visual state is using raw selectedModels (variable selectedModels) which may contain stale IDs; update the isSelected computation in the Leaderboard ModelFilterDropdown component so it checks membership against the same normalized set used for labels (e.g., the normalizedSelectedIds or selectedModelIdsNormalized you create/lift) instead of raw selectedModels, i.e., compute a normalized set of model IDs once (or lift the existing normalization) and change the isSelected logic (the isSelected variable that currently does selectedModels.includes(model)) to use that normalized set while preserving the existing "all models" empty-set behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/dashboard/src/components/charts/model-comparison-chart.tsx`:
- Around line 92-93: Replace the direct call to
ModelComparisonChartPropsSchema.parse inside the ModelComparisonChart render
with a safeParse flow: call ModelComparisonChartPropsSchema.safeParse(props),
check result.success and if false render a graceful fallback UI (e.g., error
placeholder or null) and optionally log the validation error; update variable
usage to use result.data when success is true. Alternatively, if you
intentionally want parse to throw, add full JSDoc/TSDoc to the exported
ModelComparisonChart (including `@throws`) to satisfy repo standards. Ensure
references to ModelComparisonChart and ModelComparisonChartPropsSchema are
updated accordingly.
In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx`:
- Around line 29-47: The buildTriggerLabel function currently returns "0 models
selected" when selectedModels contains only stale IDs; update the logic in
buildTriggerLabel (which uses selectedModels, currentModels,
selectedCurrentModels, and hasAllCurrentModels) so that if
selectedCurrentModels.length === 0 it treats this the same as an empty selection
and returns "All models" (i.e., short-circuit to the "All models" branch when no
valid current selections remain), while preserving the existing behavior for
single and multiple valid selections.
In `@README.md`:
- Around line 142-143: The README example for "bun pb" understates what runs by
default; update the "bun pb" example text to explicitly state that the command
auto-discovers runtimes, models and harnesses (e.g., "Auto-discovers
runtimes/models/harnesses by default") so readers understand the default
behavior when invoking bun pb; locate the current example referencing "bun pb"
and expand its description to mention auto-discovery of available harnesses,
models and runtimes.
In `@src/harnesses/opencode-events.ts`:
- Around line 98-110: The function extractContentFromArgs currently treats
args.file as content which can return a filename instead of actual emitted code;
update extractContentFromArgs to stop using args.file as a fallback (keep
args.file_text and args.fileText which are actual content fields) so only real
content fields (content, contents, text, code, file_text, fileText) are
considered; locate the extractContentFromArgs function and remove args.file from
the chain of fallbacks, leaving the rest of the logic intact so filenames are no
longer returned as content.
In `@src/lib/stats-format.ts`:
- Around line 81-85: Update the JSDoc for the exported function formatRunStats
to include a `@throws` tag describing the Error it can throw when failure
accounting is invalid (specifically when stats.generationFailures?.total exceeds
the computed failed count); locate the JSDoc above the formatRunStats function
and add a short sentence for the throw (e.g. "Throws Error if generation failure
count is greater than failed item count") and include when this happens and the
error type.
In `@src/runner/run-progress.ts`:
- Around line 113-115: The current readErrorMessage(error: unknown) may throw
when calling String(error) for objects with malicious/buggy toString/valueOf;
update readErrorMessage to wrap the coercion in a try/catch: if error is an
Error return error.message, else attempt String(error) inside try and return
that, and on failure fall back to a safe alternative such as
Object.prototype.toString.call(error) (also guarded) and finally return a fixed
sentinel like "<unserializable error>" if even that fails; apply these changes
inside the readErrorMessage function to ensure failure-path serialization cannot
throw.
In `@src/schemas/result.schema.ts`:
- Around line 117-149: The refinement on the metrics object omits
retryGenerationDurationMs, so partial retry payloads (e.g., { durationMs: 12,
retryGenerationDurationMs: 5 }) slip through; update the refine block that
references metrics.retryKind, metrics.retryReason, metrics.retryAttempted, and
metrics.retryPromoted to also include metrics.retryGenerationDurationMs in
hasAnyRetryField and in the branches: when retryAttempted === true require
retryGenerationDurationMs to be a number (and optionally non-negative) alongside
retryKind, non-empty string retryReason, and boolean retryPromoted; when
retryAttempted === false require retryGenerationDurationMs to be undefined
(absent) along with the other retry fields.
---
Duplicate comments:
In `@apps/dashboard/src/components/about/about-content.ts`:
- Around line 56-57: The About copy uses two different labels for the same axis
("passType" vs "prompt mode"); choose one canonical label (e.g., "prompt mode")
and update all occurrences so they match: replace the string value that
currently reads "runtime x harness x model x test x passType" and the
benchmark-dimension label entries referenced around the other block (the entries
containing "passType" at the other occurrence) to use the chosen canonical term,
ensuring the keys/labels in the about-content data structure remain consistent
across value and detail fields.
In `@src/cli/compare-utils.ts`:
- Around line 156-159: The truncate function currently returns an empty string
when maxLen <= 0, masking caller bugs; update the truncate(str: string, maxLen:
number): string implementation to validate maxLen and throw a descriptive error
(e.g., RangeError or TypeError) when maxLen is not a positive integer instead of
returning ""; keep existing behavior for valid positive maxLen and unchanged
slicing logic (use truncate) so callers get a hard failure on invalid widths.
In `@src/harnesses/opencode-runner.ts`:
- Around line 156-178: The forceKillProcess implementation sends SIGTERM then
waits, but after escalating to SIGKILL it returns immediately; update
forceKillProcess (and reuse killOpenCodeProcess and FORCE_KILL_DELAY_MS) so that
after calling killOpenCodeProcess(proc, "SIGKILL", log) you wait for the proc to
settle (proc.then resolves/rejects) or for a second timeout (use Promise.race
like the initial grace wait) before returning, and log appropriately (e.g., warn
before escalating and optionally debug when settled) so callers cannot proceed
while opencode is still exiting.
In `@src/lib/signal-assessment.ts`:
- Around line 349-379: The code in finalizeItemSignalAssessment is defaulting
outputSource to "artifact" when calling getTranscriptOrInputTaintReasons
(source: input.outputSource ?? "artifact"), which hides harness-only taint
reasons like "agent_requested_input"; change the call to pass input.outputSource
directly (i.e., source: input.outputSource) or omit the source option when
undefined so harness-specific taints are preserved, ensuring callers that forget
to supply outputSource don't silently lose taint reasons.
---
Nitpick comments:
In `@apps/dashboard/src/components/charts/token-efficiency-chart.tsx`:
- Around line 38-42: The readHarnessColor function currently falls back to a
hardcoded HSL string; replace that literal with the app's shared muted color
constant by importing the shared constant (e.g., MUTED_COLOR or the project's
theme.muted color) and use it as the fallback in HARNESS_COLORS[harness as keyof
typeof HARNESS_COLORS] || MUTED_COLOR; update the import at the top of the file
accordingly so token-efficiency-chart.tsx uses the centralized muted color
instead of "hsl(210, 12%, 63%)".
In `@apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx`:
- Around line 40-54: Add a short TSDoc comment for the internal helper
getRunAggregateBadge to match style used by getRunMachineBadgeLabel: describe
the function's purpose (computes label and variant for a run based on legacy
flag, checkpoint presence, and latestCheckpointId), document parameters (run:
RunListItem, latestCheckpointId: string | null) and the returned shape ({ label:
string; variant: "secondary" | "success" | "warning" }), and place it
immediately above the getRunAggregateBadge declaration to maintain consistency
and improve maintainability.
In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx`:
- Around line 183-191: The "Select all" Button's onClick currently calls
onSelectionChange([...models]) without sorting, which can produce a different
order than other selection paths; update the onClick handler to sort models with
MODEL_SORT_COLLATOR before invoking onSelectionChange (e.g., create a sorted
copy via [...models].sort(MODEL_SORT_COLLATOR) and pass that to
onSelectionChange) so that the selection order matches the other callbacks that
use MODEL_SORT_COLLATOR.
- Around line 196-197: The checkbox visual state is using raw selectedModels
(variable selectedModels) which may contain stale IDs; update the isSelected
computation in the Leaderboard ModelFilterDropdown component so it checks
membership against the same normalized set used for labels (e.g., the
normalizedSelectedIds or selectedModelIdsNormalized you create/lift) instead of
raw selectedModels, i.e., compute a normalized set of model IDs once (or lift
the existing normalization) and change the isSelected logic (the isSelected
variable that currently does selectedModels.includes(model)) to use that
normalized set while preserving the existing "all models" empty-set behavior.
In `@test/aggregate.test.ts`:
- Around line 111-157: Add comprehensive JSDoc to the createRun function:
document each parameter (runId, checkpointId, machineProfileKey, instanceId,
items, profile) with types and purpose, note optional/override fields on profile
(RunProfileOverrides), describe returned type (RunResult) and what the returned
object represents (summary, machine, benchmarkCheckpoint, provenance,
timestamps, items), and include any defaults used (e.g., TEST_PROFILE_LABEL,
TEST_NORMALIZED_PROFILE, TEST_HARDWARE, default startedAt/completedAt and
durationMs) so future readers understand inputs and produced structure.
- Around line 79-109: Update the JSDoc for the test helper function createItem
to include `@param` tags for id (string), completedAt (string), and overrides
(Partial<MatrixItemResult> optional), and add an `@returns` tag describing that it
returns a MatrixItemResult; locate the JSDoc above the createItem function and
add concise descriptions of each parameter and the return value to clarify the
helper's contract for future test authors.
In `@test/ollama-adapter.test.ts`:
- Around line 259-274: Rename the test title to be more specific: update the
it(...) description for the test using createDirectAdapter and adapter.generate
to something like "throws when runtime is named 'ollama' but apiFormat is not
'ollama' (non-Ollama apiFormat)" so it clearly communicates that the runtime
name is "ollama" while apiFormat is "openai-compatible" on the mockRuntime;
leave the test body (createDirectAdapter, adapter.generate, mockRuntime,
timeoutMs) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1261471f-88b6-47d8-a13d-14ba27cb408f
📒 Files selected for processing (158)
README.mdapps/dashboard/scripts/build-index.tsapps/dashboard/src/components/about/about-content.tsapps/dashboard/src/components/about/about-page.tsxapps/dashboard/src/components/charts/blind-vs-informed-chart.tsxapps/dashboard/src/components/charts/chart-primitives.tsxapps/dashboard/src/components/charts/composite-score-chart.tsxapps/dashboard/src/components/charts/frontier-eval-scatter.tsxapps/dashboard/src/components/charts/model-comparison-chart.tsxapps/dashboard/src/components/charts/model-radar-chart.tsxapps/dashboard/src/components/charts/model-test-heatmap.tsxapps/dashboard/src/components/charts/test-difficulty-chart.tsxapps/dashboard/src/components/charts/test-type-split-chart.tsxapps/dashboard/src/components/charts/token-efficiency-chart.tsxapps/dashboard/src/components/layout/header.tsxapps/dashboard/src/components/leaderboard/leaderboard-filters.tsapps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsxapps/dashboard/src/components/leaderboard/leaderboard-page.tsxapps/dashboard/src/components/leaderboard/leaderboard-results-table.tsxapps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsxapps/dashboard/src/components/leaderboard/model-filter-dropdown.tsxapps/dashboard/src/components/run-list/run-list-page.tsxapps/dashboard/src/components/ui/section-heading.tsxapps/dashboard/src/index.cssapps/dashboard/src/lib/aggregations-charts.tsapps/dashboard/src/lib/aggregations-compare.tsapps/dashboard/src/lib/aggregations-diagnostics.tsapps/dashboard/src/lib/aggregations-test-types.tsapps/dashboard/src/lib/aggregations-tooling.tsapps/dashboard/src/lib/aggregations.tsapps/dashboard/src/lib/api.tsapps/dashboard/src/lib/schemas.tsapps/dashboard/src/lib/types.tsllm/context/codebase-overview.mdllm/implementation/computer-use-hardening.mdllm/implementation/harnesses-implementation.mdllm/implementation/multi-runtime-mvp-implementation.mdllm/project/project-overview.mdllm/project/project-rules.mdllm/project/user-flow.mdpackage.jsonsrc/cli/compare-command.tssrc/cli/compare-formatters.tssrc/cli/compare-utils.tssrc/cli/migrate-machine-command.tssrc/cli/run-command.tssrc/harnesses/direct-adapter.tssrc/harnesses/discovery.tssrc/harnesses/goose-adapter.tssrc/harnesses/goose-openai.tssrc/harnesses/goose-output.tssrc/harnesses/harness.tssrc/harnesses/opencode-adapter.tssrc/harnesses/opencode-artifacts.tssrc/harnesses/opencode-cli.tssrc/harnesses/opencode-config.tssrc/harnesses/opencode-events.tssrc/harnesses/opencode-model.tssrc/harnesses/opencode-output.tssrc/harnesses/opencode-permissions.tssrc/harnesses/opencode-process.tssrc/harnesses/opencode-provider.tssrc/harnesses/opencode-runner.tssrc/harnesses/tool-prompt.tssrc/lib/benchmark-checkpoint.tssrc/lib/failure-classifier.tssrc/lib/hardware-profile.tssrc/lib/logger.tssrc/lib/machine-profile/instance-id.tssrc/lib/machine-profile/legacy.tssrc/lib/machine-profile/normalization.tssrc/lib/machine-profile/probe-linux.tssrc/lib/machine-profile/probe-macos.tssrc/lib/machine-profile/probe-utils.tssrc/lib/machine-profile/probe-windows.tssrc/lib/model-aliases.tssrc/lib/model-profile/normalization.tssrc/lib/model-profile/registry-compat.tssrc/lib/model-profile/registry.tssrc/lib/openai-compat-client.tssrc/lib/signal-assessment.tssrc/lib/stats-format.tssrc/lib/test-catalog.tssrc/runner/index.tssrc/runner/item-executor.tssrc/runner/model-availability.tssrc/runner/model-eligibility.tssrc/runner/plan-builder.tssrc/runner/run-progress.tssrc/runtimes/discovery.tssrc/runtimes/index.tssrc/runtimes/ollama-residency.tssrc/runtimes/ollama-runtime.tssrc/runtimes/runtime.tssrc/runtimes/vllm-runtime.tssrc/schemas/common.schema.tssrc/schemas/config.schema.tssrc/schemas/index.tssrc/schemas/model-alias.schema.tssrc/schemas/model-profile.schema.tssrc/schemas/plan.schema.tssrc/schemas/result.schema.tssrc/schemas/test-catalog.schema.tssrc/tests/calculator-basic/test.meta.jsonsrc/tests/calculator-stateful/test.meta.jsonsrc/tests/event-emitter/test.meta.jsonsrc/tests/rate-limiter/test.meta.jsonsrc/tests/smoke/test.meta.jsonsrc/tests/todo-app/test.meta.jsonsrc/tests/ttl-cache/test.meta.jsonsrc/tests/workspace-smoke/fixtures/logs/.gitkeepsrc/tests/workspace-smoke/prompt.blind.mdsrc/tests/workspace-smoke/prompt.informed.mdsrc/tests/workspace-smoke/test.meta.jsontest/aggregate.test.tstest/benchmark-checkpoint.test.tstest/build-index.test.tstest/compare-checkpoint-guard.test.tstest/compare-formatters.test.tstest/dashboard-schema-compat.test.tstest/dashboard-test-type.test.tstest/failure-classifier.test.tstest/generation-retry.test.tstest/goose-adapter.test.tstest/goose-output.test.tstest/hardware-profile.test.tstest/harness-compatibility.test.tstest/item-executor.test.tstest/item-retry.test.tstest/machine-profile-migration.test.tstest/model-alias-schema.test.tstest/model-availability.test.tstest/model-profile-registry.test.tstest/ollama-adapter.test.tstest/ollama-residency.test.tstest/opencode-adapter.test.tstest/opencode-artifacts.test.tstest/opencode-cli-cache.test.tstest/opencode-cli.test.tstest/opencode-config.test.tstest/opencode-model.test.tstest/opencode-output.test.tstest/opencode-runner-race.test.tstest/opencode-runner.test.tstest/plan-builder.test.tstest/run-command.test.tstest/runner-residency.test.tstest/schema-regressions.test.tstest/schemas.test.tstest/signal-assessment.test.tstest/stats-semantics.test.tstest/stats.test.tstest/test-catalog.test.tstest/tool-prompt.test.tstest/utils/fixtures.tstest/vllm-runtime.test.tstest/workspace-capability-parity.test.tstest/workspace-prompt-parity.test.ts
💤 Files with no reviewable changes (10)
- test/vllm-runtime.test.ts
- src/harnesses/opencode-output.ts
- package.json
- test/benchmark-checkpoint.test.ts
- src/harnesses/opencode-model.ts
- src/harnesses/goose-openai.ts
- src/runtimes/discovery.ts
- src/harnesses/opencode-process.ts
- src/lib/openai-compat-client.ts
- src/runtimes/vllm-runtime.ts
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/lib/model-profile/registry.ts (1)
205-223:⚠️ Potential issue | 🟡 MinorNormalize legacy alias data through the same supported-runtime filter as profile data.
The legacy alias loading paths (lines 205–223) call
normalizeLegacyAliasMap()directly, bypassing the filtering logic innormalizeLoadedModelProfileRegistry(). The alias schemas (ModelAliasEntrySchemaacceptsz.record(z.string(), ...)) allow any runtime key for backwards compatibility, meaning retired runtimes can survive load. While profile paths filter unsupported variants, alias paths do not—inconsistent handling of the same registry structure. Move alias data throughnormalizeLoadedModelProfileRegistry()or apply equivalent filtering withinnormalizeLegacyAliasMap()to ensure retired runtimes are dropped at the boundary.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/model-profile/registry.ts` around lines 205 - 223, The legacy alias loading paths currently call normalizeLegacyAliasMap(...) directly (after ModelAliasFileSchema and ModelAliasMapSchema parses), which bypasses the supported-runtime filtering in normalizeLoadedModelProfileRegistry(...); change the code to pass the normalized alias map through normalizeLoadedModelProfileRegistry (or alternatively implement the same supported-runtime filtering inside normalizeLegacyAliasMap) so retired runtimes are dropped consistently—i.e., after ModelAliasFileSchema.safeParse(parsed) and ModelAliasMapSchema.safeParse(parsed), feed the parsed alias map into normalizeLoadedModelProfileRegistry (referencing normalizeLegacyAliasMap and normalizeLoadedModelProfileRegistry to locate the logic) and return that filtered result instead of returning normalizeLegacyAliasMap(...) directly.src/harnesses/opencode-adapter.ts (1)
455-488:⚠️ Potential issue | 🟠 MajorPreserve first-attempt taint when the retry attempt fails.
The success branch merges
firstAttemptAssessment, but the catch branch only restampsdurationMs. If the retry throws, the final error drops the first attempt's contract-violation evidence.Suggested fix
} catch (error) { const totalDurationMs = Math.round(performance.now() - startTime); if (error !== null && typeof error === "object") { - Object.assign(error, { durationMs: totalDurationMs }); + const existingSignalAssessment = + "signalAssessment" in error + ? (error as { signalAssessment?: SignalAssessment }) + .signalAssessment + : undefined; + Object.assign(error, { + durationMs: totalDurationMs, + signalAssessment: appendSignalAssessmentReasons( + existingSignalAssessment, + firstAttemptAssessment?.classification === "tainted" + ? firstAttemptAssessment.reasons + : [], + ), + }); } throw error; }Based on learnings, "Never implicitly fix up results after a run; capture enough evidence to explain outcomes (test failures, eval reasoning, durations, best-effort metrics)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/harnesses/opencode-adapter.ts` around lines 455 - 488, The catch block after the retry (inside the generate flow that created firstAttemptAssessment via buildSignalAssessment and then called createOpenCodeAdapter().generate) must preserve the firstAttemptAssessment taint evidence when the retry throws: in the catch, besides stamping durationMs onto the thrown error, merge or append firstAttemptAssessment.reasons (or the full firstAttemptAssessment) into the error (e.g., attach/merge into error.signalAssessment or error.reasons) using the same helper appendSignalAssessmentReasons so downstream handlers see the original contract-violation evidence; ensure you only attach when firstAttemptAssessment exists and its classification is "tainted" to avoid adding empty data.
♻️ Duplicate comments (3)
src/runtimes/ollama-runtime.ts (1)
294-304:⚠️ Potential issue | 🟠 MajorEmit explicit negative capabilities for non-generative models.
Right now
"embedding"and"unknown"leavegenerateTextundefined. Any downstream check that treatsgenerateText !== falseas eligible will still benchmark those models as generators. Populate both booleans explicitly at this boundary so unclassified models stay unambiguous.Suggested patch
modelKind, - capabilities: { - ...(modelKind === "text-generation" ? { generateText: true } : {}), - ...(modelKind === "embedding" ? { embedText: true } : {}), - }, + capabilities: { + generateText: modelKind === "text-generation", + embedText: modelKind === "embedding", + },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/runtimes/ollama-runtime.ts` around lines 294 - 304, The returned model capabilities object from the function that uses inferModelKind (where name/modelKind/sizeBytes/parametersBillions are returned) must explicitly set both booleans so non-generative or unclassified models don't leave generateText/embedText undefined; update the capabilities construction in the return block to always include generateText and embedText (true for the matching kind, false otherwise) instead of conditionally spreading only the positive flag so that models with kind "embedding" or "unknown" have generateText: false and embedText: false/true as appropriate.src/runner/index.ts (1)
231-255:⚠️ Potential issue | 🟡 MinorSkipped rows still hit the residency teardown path.
After
buildPreflightSkipResult(...), thefinallyblock can still callrunPostItemResidencyTeardown(item)on the last skipped row. That means a deterministic preflight skip still touches Ollama, can unload models, and can add unrelated warning noise. Keep this branch side-effect free and only run teardown for rows that actually executed.Based on learnings, "Never implicitly fix up results after a run; capture enough evidence to explain outcomes (test failures, eval reasoning, durations, best-effort metrics)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/runner/index.ts` around lines 231 - 255, The finally block unconditionally calls runPostItemResidencyTeardown(item) even for preflight-skipped rows; change the flow so teardown only runs for items that actually executed. Introduce a local flag (e.g., executed or didRun) around the per-item run logic: set it false when you create a preflight skip via buildPreflightSkipResult(...) and true for real executions, then replace the unconditional teardown call with if (isLastForModel && executed) await runPostItemResidencyTeardown(item). Keep the existing progress checkpoint calls (writeProgressCheckpoint) unchanged.src/harnesses/opencode-events.ts (1)
249-250:⚠️ Potential issue | 🟠 MajorPreserve original stream order when rebuilding
mixedOutput.This still buffers parsed text separately from raw lines and then appends every unparsed line at the end. If non-JSON output arrives between deltas, the reconstructed payload changes order and can corrupt fallback tool-call salvage.
Suggested fix
- const textParts: string[] = []; + const orderedOutputFragments: string[] = []; const unparsedLines: string[] = []; @@ } catch { permissionDenied ||= isOpenCodePermissionDeniedText(line); unparsedLines.push(line); + orderedOutputFragments.push(`${line}\n`); continue; } @@ const text = readEventText(event); if (text) { permissionDenied ||= isOpenCodePermissionDeniedText(text); - textParts.push(text); + orderedOutputFragments.push(text); } } @@ - if (parsedLines > 0 && textParts.length > 0) { - const combined = textParts.join(""); - const mixedOutput = - unparsedLines.length > 0 - ? [combined, ...unparsedLines].join("\n") - : combined; + if (parsedLines > 0 && orderedOutputFragments.length > 0) { + const mixedOutput = orderedOutputFragments.join("").trimEnd(); const mixedPermissionDenied = isOpenCodePermissionDeniedText(mixedOutput); const toolCallCode = extractFromToolCallText(mixedOutput);Also applies to: 335-340
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/harnesses/opencode-events.ts` around lines 249 - 250, The current rebuild logic separates parsed text into textParts and raw lines into unparsedLines and then appends unparsedLines at the end, which reorders interleaved non-JSON output; instead, preserve stream order by replacing the two buffers with a single ordered buffer (e.g., an array of items noting type and content) and push entries as they arrive; update the code that pushes into textParts/unparsedLines to instead push into this ordered buffer, and change the mixedOutput assembly (where mixedOutput is created) to iterate the ordered buffer and concatenate entries in sequence so original order is preserved (apply same change to the other occurrence around lines 335-340).
🧹 Nitpick comments (5)
apps/dashboard/src/components/charts/token-efficiency-chart.tsx (1)
39-42: Apply defensive guard against prototype key pollution inreadHarnessColor.The function receives
harnessas a plainstringtype (not constrained to known literal values at the dashboard level, asz.string()schemas in both src/schemas and apps/dashboard schemas lack enum validation). The type assertionharness as keyof typeof HARNESS_COLORScan resolve inherited object properties (e.g.,toString,constructor) before reaching the muted fallback.Suggested fix
const HARNESS_COLORS = { direct: "hsl(215, 70%, 62%)", // steel blue goose: "hsl(142, 60%, 49%)", // brand green opencode: "hsl(38, 80%, 58%)", // warm amber } as const; +type KnownHarnessName = keyof typeof HARNESS_COLORS; + function readHarnessColor(harness: string): string { - return ( - HARNESS_COLORS[harness as keyof typeof HARNESS_COLORS] || CHART_COLORS.muted - ); + return Object.prototype.hasOwnProperty.call(HARNESS_COLORS, harness) + ? HARNESS_COLORS[harness as KnownHarnessName] + : CHART_COLORS.muted; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/charts/token-efficiency-chart.tsx` around lines 39 - 42, readHarnessColor currently indexes HARNESS_COLORS with an unconstrained string and a type assertion, which can resolve inherited prototype keys; update readHarnessColor to defensively check that the incoming harness is a direct own key of HARNESS_COLORS before returning its value (e.g., use Object.prototype.hasOwnProperty.call(HARNESS_COLORS, harness) or a Map lookup) and otherwise return CHART_COLORS.muted; reference HARNESS_COLORS and CHART_COLORS.muted in your change and keep the function signature readHarnessColor(harness: string): string.apps/dashboard/src/lib/types.ts (1)
86-86: ApplyArtifactRuntimeNameconsistently across plan/result types.Tightening
ModelVariant.runtimeis good, butMatrixItem.runtimeandMatchedItem.runtimeare still plainstring. That leaves this file out of sync with the shared plan schema and weakens runtime-label validation in the dashboard layer.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/lib/types.ts` at line 86, Update the dashboard types so runtime uses the shared ArtifactRuntimeName type everywhere: change the runtime property on MatrixItem and MatchedItem to ArtifactRuntimeName (matching the already-tightened ModelVariant.runtime) in apps/dashboard/src/lib/types.ts; ensure any related imports include ArtifactRuntimeName and run type-checking to fix any downstream type mismatches.test/opencode-config.test.ts (1)
104-120: Also lock inOPENCODE_DISABLE_WEBFETCHhere.This regression test asserts the web-search flag, but the env builder is also supposed to force web fetch off. Adding
expect(env.OPENCODE_DISABLE_WEBFETCH).toBe("true")would keep a one-sided regression from passing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/opencode-config.test.ts` around lines 104 - 120, The test for "exports config dir/path/content for isolated headless runs" is missing an assertion for the web fetch flag; update the test that calls buildOpenCodeEnv to also assert expect(env.OPENCODE_DISABLE_WEBFETCH).toBe("true") so OPENCODE_DISABLE_WEBFETCH is locked off alongside OPENCODE_DISABLE_WEBSEARCH (refer to buildOpenCodeEnv and the existing OPENCODE_DISABLE_WEBSEARCH/OPENCODE_DISABLE_AUTOUPDATE assertions to place the new expect).apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx (2)
116-116: UsenormalizedSelectedModelSet.has()instead ofincludes()for O(1) lookup.Since
normalizedSelectedModelSetis already computed at line 108, prefer using the Set'shas()method for constant-time lookup rather than linear array search.♻️ Suggested fix
- if (normalizedSelectedModels.includes(model)) { + if (normalizedSelectedModelSet.has(model)) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx` at line 116, The current check uses normalizedSelectedModels.includes(model) which performs an O(n) array scan; change it to use the precomputed Set normalizedSelectedModelSet.has(model) for O(1) lookup—update the condition in the component where normalizedSelectedModels and normalizedSelectedModelSet are used (replace includes() on normalizedSelectedModels with normalizedSelectedModelSet.has(model)).
207-209: Visual inconsistency when only stale IDs are selected.When
selectedModelscontains only stale IDs (no longer inmodels), the trigger label correctly shows "All models" (line 42 handles this), but the checkbox visual logic at line 209 will show all items as unchecked becauseselectedModels.length !== 0andnormalizedSelectedIdsis empty.Consider aligning the checkbox checked state with the trigger label semantics:
♻️ Suggested fix
+ const representsAllModels = + selectedModels.length === 0 || normalizedSelectedIds.size === 0; // ... {models.map((model) => { const isSelected = - selectedModels.length === 0 || normalizedSelectedIds.has(model); + representsAllModels || normalizedSelectedIds.has(model);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx` around lines 207 - 209, The checkbox state logic in the model-filter-dropdown component is inconsistent when selectedModels contains only stale IDs: update the isSelected calculation used in models.map so that an empty normalizedSelectedIds is treated the same as no selection (i.e., show all checked to match the trigger label). Specifically, change the condition for isSelected (used when rendering each model checkbox) to treat normalizedSelectedIds.size === 0 as a truthy "all selected" case in addition to selectedModels.length === 0 and normalizedSelectedIds.has(model) so the visual checkboxes align with the trigger label semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@llm/implementation/harnesses-implementation.md`:
- Around line 340-345: The doc table in harnesses-implementation.md is out of
sync with the Goose adapter: either update the Goose row to describe the current
mechanism (that src/harnesses/goose-adapter.ts invokes "goose run" without
--with-builtin developer and relies on per-item provider config/permission
rules), or change the adapter to pass the CLI flag again; specifically, locate
the Goose documentation entry in harnesses-implementation.md and replace the CLI
flag description with the current provider-config/permission-based method (or
alternatively modify the goose adapter code to reintroduce the --with-builtin
developer flag where goose is invoked if you prefer restoring the original
behavior).
In `@src/cli/compare-formatters.ts`:
- Around line 209-214: The table output pads harness names but never truncates
them, causing column overflow; update each place where you format harness values
(e.g., inside the regressions loop and the other formatting loops that call
pad(item.harness, columnWidths.harness)) to wrap the harness value with truncate
before padding—i.e., replace pad(item.harness, columnWidths.harness) with
pad(truncate(item.harness, columnWidths.harness), columnWidths.harness) so long
harness strings are clipped to columnWidths.harness and table alignment is
preserved.
In `@src/cli/migrate-machine-command.ts`:
- Around line 362-364: Remove the duplicate console.error call in the catch
block and rely solely on the structured logger; specifically, delete the
console.error(...) invocation and keep the logger.error({ error },
"Machine-profile migration failed") followed by process.exit(1) so errors are
emitted only via Pino (remove the console.error line that precedes logger.error
in the catch handling around the migration logic).
In `@src/harnesses/opencode-adapter.ts`:
- Around line 493-498: The throw reuses the pre-decision signalAssessment so
failures that never wrote SOLUTION_FILENAME miss output-contract taint; rebuild
the assessment from decision before calling buildOpenCodeFailure and pass that
rebuilt assessment instead of signalAssessment. Specifically, compute a new
assessment object based on decision (marking output_contract_violation or
tool_call_not_executed as appropriate), then call buildOpenCodeFailure with
durationMs, parsed.output and the new assessment so the terminal failure path
records the correct output-contract taint.
In `@src/harnesses/opencode-artifacts.ts`:
- Around line 172-186: In readUsableOpenCodeSolution validate the
minOutputLength parameter before attempting to read the file: check that
minOutputLength is a finite number and not negative (e.g.,
Number.isFinite(minOutputLength) && minOutputLength >= 0); if the check fails
throw a clear TypeError/RangeError describing the invalid threshold. Place this
validation at the top of readUsableOpenCodeSolution so you fail fast on
programmer/config errors (refer to the function name readUsableOpenCodeSolution
and the minOutputLength parameter when making the change).
In `@src/harnesses/opencode-cli.ts`:
- Around line 1-11: Update the file header to include parseOpenCodeRunFeatures
in the exported list and add explicit `@throws` {never} TSDoc entries to both
parseOpenCodeRunFeatures() and isOpenCodeRunCompatible() function docs; locate
the functions by name (parseOpenCodeRunFeatures, isOpenCodeRunCompatible) and
add the throws tag to their existing JSDoc/TSDoc blocks, and update the
top-of-file export list to mention parseOpenCodeRunFeatures alongside the other
exports.
In `@src/harnesses/opencode-runner.ts`:
- Around line 248-255: The stdout/stderr handlers call chunk.toString() on each
Buffer, which can corrupt multi-byte UTF-8 characters; before attaching
listeners on proc.stdout and proc.stderr in the opencode runner, call
proc.stdout.setEncoding("utf8") and proc.stderr.setEncoding("utf8") so Node
reconstructs multi-byte sequences, then push the received string chunks into
stdoutChunks/stderrChunks and update lastOutputTime as before (refer to
proc.stdout, proc.stderr, lastOutputTime, stdoutChunks, stderrChunks).
In `@src/runner/item-executor.ts`:
- Around line 177-178: The call to loadPrompt(item.test, item.passType) is
currently executed before the outer try in executeItem, causing a
prompt_not_found error to reject executeItem instead of producing a per-item
failure row in run.json; move the loadPrompt call inside the recoverable item
execution path (inside the try/catch that handles per-item failures) so missing
prompts are converted into a structured failure entry (prompt_not_found) and
execution continues, or catch errors from loadPrompt immediately and normalize
them into the same failure-handling logic used for timeouts/model/eval errors;
update references to loadPrompt, executeItem and the run.json failure recording
logic to ensure prompt_not_found is treated like other per-item failures.
---
Outside diff comments:
In `@src/harnesses/opencode-adapter.ts`:
- Around line 455-488: The catch block after the retry (inside the generate flow
that created firstAttemptAssessment via buildSignalAssessment and then called
createOpenCodeAdapter().generate) must preserve the firstAttemptAssessment taint
evidence when the retry throws: in the catch, besides stamping durationMs onto
the thrown error, merge or append firstAttemptAssessment.reasons (or the full
firstAttemptAssessment) into the error (e.g., attach/merge into
error.signalAssessment or error.reasons) using the same helper
appendSignalAssessmentReasons so downstream handlers see the original
contract-violation evidence; ensure you only attach when firstAttemptAssessment
exists and its classification is "tainted" to avoid adding empty data.
In `@src/lib/model-profile/registry.ts`:
- Around line 205-223: The legacy alias loading paths currently call
normalizeLegacyAliasMap(...) directly (after ModelAliasFileSchema and
ModelAliasMapSchema parses), which bypasses the supported-runtime filtering in
normalizeLoadedModelProfileRegistry(...); change the code to pass the normalized
alias map through normalizeLoadedModelProfileRegistry (or alternatively
implement the same supported-runtime filtering inside normalizeLegacyAliasMap)
so retired runtimes are dropped consistently—i.e., after
ModelAliasFileSchema.safeParse(parsed) and
ModelAliasMapSchema.safeParse(parsed), feed the parsed alias map into
normalizeLoadedModelProfileRegistry (referencing normalizeLegacyAliasMap and
normalizeLoadedModelProfileRegistry to locate the logic) and return that
filtered result instead of returning normalizeLegacyAliasMap(...) directly.
---
Duplicate comments:
In `@src/harnesses/opencode-events.ts`:
- Around line 249-250: The current rebuild logic separates parsed text into
textParts and raw lines into unparsedLines and then appends unparsedLines at the
end, which reorders interleaved non-JSON output; instead, preserve stream order
by replacing the two buffers with a single ordered buffer (e.g., an array of
items noting type and content) and push entries as they arrive; update the code
that pushes into textParts/unparsedLines to instead push into this ordered
buffer, and change the mixedOutput assembly (where mixedOutput is created) to
iterate the ordered buffer and concatenate entries in sequence so original order
is preserved (apply same change to the other occurrence around lines 335-340).
In `@src/runner/index.ts`:
- Around line 231-255: The finally block unconditionally calls
runPostItemResidencyTeardown(item) even for preflight-skipped rows; change the
flow so teardown only runs for items that actually executed. Introduce a local
flag (e.g., executed or didRun) around the per-item run logic: set it false when
you create a preflight skip via buildPreflightSkipResult(...) and true for real
executions, then replace the unconditional teardown call with if (isLastForModel
&& executed) await runPostItemResidencyTeardown(item). Keep the existing
progress checkpoint calls (writeProgressCheckpoint) unchanged.
In `@src/runtimes/ollama-runtime.ts`:
- Around line 294-304: The returned model capabilities object from the function
that uses inferModelKind (where name/modelKind/sizeBytes/parametersBillions are
returned) must explicitly set both booleans so non-generative or unclassified
models don't leave generateText/embedText undefined; update the capabilities
construction in the return block to always include generateText and embedText
(true for the matching kind, false otherwise) instead of conditionally spreading
only the positive flag so that models with kind "embedding" or "unknown" have
generateText: false and embedText: false/true as appropriate.
---
Nitpick comments:
In `@apps/dashboard/src/components/charts/token-efficiency-chart.tsx`:
- Around line 39-42: readHarnessColor currently indexes HARNESS_COLORS with an
unconstrained string and a type assertion, which can resolve inherited prototype
keys; update readHarnessColor to defensively check that the incoming harness is
a direct own key of HARNESS_COLORS before returning its value (e.g., use
Object.prototype.hasOwnProperty.call(HARNESS_COLORS, harness) or a Map lookup)
and otherwise return CHART_COLORS.muted; reference HARNESS_COLORS and
CHART_COLORS.muted in your change and keep the function signature
readHarnessColor(harness: string): string.
In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx`:
- Line 116: The current check uses normalizedSelectedModels.includes(model)
which performs an O(n) array scan; change it to use the precomputed Set
normalizedSelectedModelSet.has(model) for O(1) lookup—update the condition in
the component where normalizedSelectedModels and normalizedSelectedModelSet are
used (replace includes() on normalizedSelectedModels with
normalizedSelectedModelSet.has(model)).
- Around line 207-209: The checkbox state logic in the model-filter-dropdown
component is inconsistent when selectedModels contains only stale IDs: update
the isSelected calculation used in models.map so that an empty
normalizedSelectedIds is treated the same as no selection (i.e., show all
checked to match the trigger label). Specifically, change the condition for
isSelected (used when rendering each model checkbox) to treat
normalizedSelectedIds.size === 0 as a truthy "all selected" case in addition to
selectedModels.length === 0 and normalizedSelectedIds.has(model) so the visual
checkboxes align with the trigger label semantics.
In `@apps/dashboard/src/lib/types.ts`:
- Line 86: Update the dashboard types so runtime uses the shared
ArtifactRuntimeName type everywhere: change the runtime property on MatrixItem
and MatchedItem to ArtifactRuntimeName (matching the already-tightened
ModelVariant.runtime) in apps/dashboard/src/lib/types.ts; ensure any related
imports include ArtifactRuntimeName and run type-checking to fix any downstream
type mismatches.
In `@test/opencode-config.test.ts`:
- Around line 104-120: The test for "exports config dir/path/content for
isolated headless runs" is missing an assertion for the web fetch flag; update
the test that calls buildOpenCodeEnv to also assert
expect(env.OPENCODE_DISABLE_WEBFETCH).toBe("true") so OPENCODE_DISABLE_WEBFETCH
is locked off alongside OPENCODE_DISABLE_WEBSEARCH (refer to buildOpenCodeEnv
and the existing OPENCODE_DISABLE_WEBSEARCH/OPENCODE_DISABLE_AUTOUPDATE
assertions to place the new expect).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c7fbff1b-b7e1-4f60-a1a8-2342d99b54cc
📒 Files selected for processing (158)
README.mdapps/dashboard/scripts/build-index.tsapps/dashboard/src/components/about/about-content.tsapps/dashboard/src/components/about/about-page.tsxapps/dashboard/src/components/charts/blind-vs-informed-chart.tsxapps/dashboard/src/components/charts/chart-primitives.tsxapps/dashboard/src/components/charts/composite-score-chart.tsxapps/dashboard/src/components/charts/frontier-eval-scatter.tsxapps/dashboard/src/components/charts/model-comparison-chart.tsxapps/dashboard/src/components/charts/model-radar-chart.tsxapps/dashboard/src/components/charts/model-test-heatmap.tsxapps/dashboard/src/components/charts/test-difficulty-chart.tsxapps/dashboard/src/components/charts/test-type-split-chart.tsxapps/dashboard/src/components/charts/token-efficiency-chart.tsxapps/dashboard/src/components/layout/header.tsxapps/dashboard/src/components/leaderboard/leaderboard-filters.tsapps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsxapps/dashboard/src/components/leaderboard/leaderboard-page.tsxapps/dashboard/src/components/leaderboard/leaderboard-results-table.tsxapps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsxapps/dashboard/src/components/leaderboard/model-filter-dropdown.tsxapps/dashboard/src/components/run-list/run-list-page.tsxapps/dashboard/src/components/ui/section-heading.tsxapps/dashboard/src/index.cssapps/dashboard/src/lib/aggregations-charts.tsapps/dashboard/src/lib/aggregations-compare.tsapps/dashboard/src/lib/aggregations-diagnostics.tsapps/dashboard/src/lib/aggregations-test-types.tsapps/dashboard/src/lib/aggregations-tooling.tsapps/dashboard/src/lib/aggregations.tsapps/dashboard/src/lib/api.tsapps/dashboard/src/lib/schemas.tsapps/dashboard/src/lib/types.tsllm/context/codebase-overview.mdllm/implementation/computer-use-hardening.mdllm/implementation/harnesses-implementation.mdllm/implementation/multi-runtime-mvp-implementation.mdllm/project/project-overview.mdllm/project/project-rules.mdllm/project/user-flow.mdpackage.jsonsrc/cli/compare-command.tssrc/cli/compare-formatters.tssrc/cli/compare-utils.tssrc/cli/migrate-machine-command.tssrc/cli/run-command.tssrc/harnesses/direct-adapter.tssrc/harnesses/discovery.tssrc/harnesses/goose-adapter.tssrc/harnesses/goose-openai.tssrc/harnesses/goose-output.tssrc/harnesses/harness.tssrc/harnesses/opencode-adapter.tssrc/harnesses/opencode-artifacts.tssrc/harnesses/opencode-cli.tssrc/harnesses/opencode-config.tssrc/harnesses/opencode-events.tssrc/harnesses/opencode-model.tssrc/harnesses/opencode-output.tssrc/harnesses/opencode-permissions.tssrc/harnesses/opencode-process.tssrc/harnesses/opencode-provider.tssrc/harnesses/opencode-runner.tssrc/harnesses/tool-prompt.tssrc/lib/benchmark-checkpoint.tssrc/lib/failure-classifier.tssrc/lib/hardware-profile.tssrc/lib/logger.tssrc/lib/machine-profile/instance-id.tssrc/lib/machine-profile/legacy.tssrc/lib/machine-profile/normalization.tssrc/lib/machine-profile/probe-linux.tssrc/lib/machine-profile/probe-macos.tssrc/lib/machine-profile/probe-utils.tssrc/lib/machine-profile/probe-windows.tssrc/lib/model-aliases.tssrc/lib/model-profile/normalization.tssrc/lib/model-profile/registry-compat.tssrc/lib/model-profile/registry.tssrc/lib/openai-compat-client.tssrc/lib/signal-assessment.tssrc/lib/stats-format.tssrc/lib/test-catalog.tssrc/runner/index.tssrc/runner/item-executor.tssrc/runner/model-availability.tssrc/runner/model-eligibility.tssrc/runner/plan-builder.tssrc/runner/run-progress.tssrc/runtimes/discovery.tssrc/runtimes/index.tssrc/runtimes/ollama-residency.tssrc/runtimes/ollama-runtime.tssrc/runtimes/runtime.tssrc/runtimes/vllm-runtime.tssrc/schemas/common.schema.tssrc/schemas/config.schema.tssrc/schemas/index.tssrc/schemas/model-alias.schema.tssrc/schemas/model-profile.schema.tssrc/schemas/plan.schema.tssrc/schemas/result.schema.tssrc/schemas/test-catalog.schema.tssrc/tests/calculator-basic/test.meta.jsonsrc/tests/calculator-stateful/test.meta.jsonsrc/tests/event-emitter/test.meta.jsonsrc/tests/rate-limiter/test.meta.jsonsrc/tests/smoke/test.meta.jsonsrc/tests/todo-app/test.meta.jsonsrc/tests/ttl-cache/test.meta.jsonsrc/tests/workspace-smoke/fixtures/logs/.gitkeepsrc/tests/workspace-smoke/prompt.blind.mdsrc/tests/workspace-smoke/prompt.informed.mdsrc/tests/workspace-smoke/test.meta.jsontest/aggregate.test.tstest/benchmark-checkpoint.test.tstest/build-index.test.tstest/compare-checkpoint-guard.test.tstest/compare-formatters.test.tstest/dashboard-schema-compat.test.tstest/dashboard-test-type.test.tstest/failure-classifier.test.tstest/generation-retry.test.tstest/goose-adapter.test.tstest/goose-output.test.tstest/hardware-profile.test.tstest/harness-compatibility.test.tstest/item-executor.test.tstest/item-retry.test.tstest/machine-profile-migration.test.tstest/model-alias-schema.test.tstest/model-availability.test.tstest/model-profile-registry.test.tstest/ollama-adapter.test.tstest/ollama-residency.test.tstest/opencode-adapter.test.tstest/opencode-artifacts.test.tstest/opencode-cli-cache.test.tstest/opencode-cli.test.tstest/opencode-config.test.tstest/opencode-model.test.tstest/opencode-output.test.tstest/opencode-runner-race.test.tstest/opencode-runner.test.tstest/plan-builder.test.tstest/run-command.test.tstest/runner-residency.test.tstest/schema-regressions.test.tstest/schemas.test.tstest/signal-assessment.test.tstest/stats-semantics.test.tstest/stats.test.tstest/test-catalog.test.tstest/tool-prompt.test.tstest/utils/fixtures.tstest/vllm-runtime.test.tstest/workspace-capability-parity.test.tstest/workspace-prompt-parity.test.ts
💤 Files with no reviewable changes (10)
- package.json
- test/benchmark-checkpoint.test.ts
- src/harnesses/opencode-model.ts
- src/harnesses/opencode-output.ts
- src/runtimes/discovery.ts
- src/harnesses/goose-openai.ts
- test/vllm-runtime.test.ts
- src/harnesses/opencode-process.ts
- src/runtimes/vllm-runtime.ts
- src/lib/openai-compat-client.ts
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (2)
apps/dashboard/src/components/charts/model-comparison-chart.tsx (2)
42-44:⚠️ Potential issue | 🟠 Major
z.custom<MatrixItemResult>()is effectively non-validating here.On Line 43, this schema accepts arbitrary values, so malformed
itemscan still flow into chart aggregation logic.🔧 Proposed fix
+import { MatrixItemResultSchema } from "@/lib/schemas"; import { z } from "zod"; const ModelComparisonChartPropsSchema = z.object({ - items: z.array(z.custom<MatrixItemResult>()), + items: z.array(MatrixItemResultSchema), });#!/bin/bash # Verify current runtime validation strength and presence/export of MatrixItemResultSchema. rg -n --type=ts -C3 'ModelComparisonChartPropsSchema|z\.custom<MatrixItemResult>|MatrixItemResultSchema' apps/dashboard/src/components/charts/model-comparison-chart.tsx rg -n --type=ts -C3 'MatrixItemResultSchema' apps/dashboard/src/lib/schemas.ts rg -n --type=ts -C3 'export.*MatrixItemResultSchema|export\s*\{[^}]*MatrixItemResultSchema' apps/dashboard/src/lib/schemas.tsBased on learnings: "Use Zod for schema validation at all boundaries (CLI args, env, JSON files, HTTP responses)" and "Validate at boundaries with Zod and pass typed data inward".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/charts/model-comparison-chart.tsx` around lines 42 - 44, The prop schema currently uses z.custom<MatrixItemResult>() which bypasses validation; import or locate the existing MatrixItemResultSchema and replace z.array(z.custom<MatrixItemResult>()) with z.array(MatrixItemResultSchema) in ModelComparisonChartPropsSchema (and export MatrixItemResultSchema if missing) so incoming items are strictly validated before reaching chart aggregation logic; ensure the prop type and any callers use the validated output (parse/parseAsync) of ModelComparisonChartPropsSchema when receiving external data.
85-91:⚠️ Potential issue | 🟡 MinorAdd
@throwsto the exported function JSDoc.
ModelComparisonChartis exported on Line 92, but the doc block does not include a@throwsentry required by repo standards.As per coding guidelines: "All exported functions must have TSDoc/JSDoc documentation including purpose, params, returns, and throws."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/charts/model-comparison-chart.tsx` around lines 85 - 91, The JSDoc for the exported function ModelComparisonChart is missing a required `@throws` tag; update the doc block above the ModelComparisonChart export to include a `@throws` description that explains what error(s) the component may throw (e.g., invalid props, missing items, or data parsing errors) and under what conditions, using the function name ModelComparisonChart and any related props (props.items) in the description so the doc meets repo standards.
🧹 Nitpick comments (4)
src/lib/failure-classifier.ts (1)
48-49: Add regression coverage for the newprompt missingbranch.Line 49 adds behavior, but there’s no matching test case yet; this can regress silently. Please add a
classifyGenerationError("Prompt missing ...") -> "prompt_not_found"assertion.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/failure-classifier.ts` around lines 48 - 49, Add a unit test that covers the new "prompt missing" branch by calling classifyGenerationError with a message containing "Prompt missing" (e.g., "Prompt missing ..." or similar) and asserting it returns the "prompt_not_found" classification; target the classifyGenerationError function and ensure the test mirrors existing tests for the "prompt not found" case so the new branch cannot regress.apps/dashboard/src/components/charts/frontier-eval-scatter.tsx (1)
29-41: Consider centralizing harness color resolution in a shared chart utility.This resolver pattern is good, but it now appears duplicated across chart components. A shared helper would reduce drift risk.
Proposed refactor sketch
+// apps/dashboard/src/components/charts/harness-colors.ts +import { CHART_COLORS } from "@/lib/chart-colors"; + +export const HARNESS_COLORS = { + direct: "hsl(215, 70%, 62%)", + goose: "hsl(142, 60%, 49%)", + opencode: "hsl(38, 80%, 58%)", +} as const; + +export function readHarnessColor(harness: string): string { + return Object.prototype.hasOwnProperty.call(HARNESS_COLORS, harness) + ? HARNESS_COLORS[harness as keyof typeof HARNESS_COLORS] + : CHART_COLORS.muted; +}-const HARNESS_COLORS = { ... } as const; -function readHarnessColor(harness: string): string { ... } +import { readHarnessColor } from "./harness-colors";Also applies to: 115-115
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/charts/frontier-eval-scatter.tsx` around lines 29 - 41, Centralize the harness color resolver into a shared chart utility: extract the HARNESS_COLORS map, KnownHarnessName type, and readHarnessColor function into a new exported helper (e.g., getHarnessColor) in the shared charts util and have components import and use that helper instead of duplicating the logic; ensure the helper returns CHART_COLORS.muted for unknown keys and keep the exact behavior of readHarnessColor so callers (including frontier-eval-scatter.tsx) remain unchanged.apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx (1)
33-35: Consider adding Zod validation for props.While the TypeScript typing is present, the coding guidelines recommend validating at boundaries with Zod. You could add a Zod schema for
BlindVsInformedChartPropsto provide runtime validation of theitemsarray.Example Zod schema
import { z } from 'zod'; const BlindVsInformedChartPropsSchema = z.object({ items: z.array(MatrixItemResultSchema), // assuming MatrixItemResultSchema exists }); // Then validate at component entry: export function BlindVsInformedChart(props: BlindVsInformedChartProps) { const { items } = BlindVsInformedChartPropsSchema.parse(props); // ... rest of implementation }As per coding guidelines: "Validate at boundaries with Zod and pass typed data inward".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx` around lines 33 - 35, Add runtime Zod validation for the component props by defining a BlindVsInformedChartPropsSchema (z.object with items: z.array(MatrixItemResultSchema)) and use it at the start of the BlindVsInformedChart function to parse/validate incoming props (replace direct destructuring of props with const { items } = BlindVsInformedChartPropsSchema.parse(props)); keep the existing TypeScript interface for compile-time typing but ensure MatrixItemResultSchema is imported/available and surface parse errors at the boundary.src/cli/compare-utils.ts (1)
79-84: Redundant type check afterinstanceof Error.At line 79,
error && typeof error === 'object'is unnecessary since line 75 already confirmederror instanceof Error, which guarantees it's a truthy object.♻️ Proposed simplification
- if (error && typeof error === "object" && "code" in error) { + if ("code" in error) { const code = (error as { code?: unknown }).code; if (code === "ENOENT" || code === "ENOTDIR") { return true; } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cli/compare-utils.ts` around lines 79 - 84, Remove the redundant runtime checks after the earlier instanceof Error check: since the code already verifies error instanceof Error, drop the leading "error && typeof error === 'object' &&" and simplify the conditional that detects filesystem codes to just check for "code" in error (and cast as needed), then extract const code = (error as { code?: unknown }).code and test code === "ENOENT" || code === "ENOTDIR"; update the if block around the existing code variable extraction accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx`:
- Around line 109-114: allSelected currently doesn't treat the "stale-only"
(normalizedSelectedModelSet.size === 0) case the same as isSelected, causing a
single click to collapse selections incorrectly; update the allSelected
computation (the expression that sets allSelected using selectedModels,
models.every, and normalizedSelectedModelSet.has) to also treat
normalizedSelectedModelSet.size === 0 as "all selected" (i.e., include a
condition like normalizedSelectedModelSet.size === 0 in the OR chain), and make
the same change to the duplicate logic around the other occurrence (the block
referenced at lines 209-211).
In `@src/harnesses/tool-prompt.ts`:
- Around line 103-131: The code silently treats any pathMode other than
"relative-only" as the anchored branch; add an explicit validation for pathMode
(e.g., only allow "relative-only" or "absolute-anchor") before the existing
branching and trimmedWorkspaceRootPath checks: if pathMode is not one of the
allowed values, throw a clear Error. Update the validation near the top where
pathMode and trimmedWorkspaceRootPath are established (the same block that
checks toolNames and the absolute-anchor requirement) so subsequent logic using
pathMode and the pathLines/lines construction can assume a valid value.
In `@src/runner/index.ts`:
- Around line 219-257: didRun is being reset per-iteration so when earlier items
in the same runtime+model group ran but later ones are skipped the final skipped
iteration sees didRun === false and skips runPostItemResidencyTeardown; change
the tracking to persist at the runtime+model group level (e.g., move didRun out
of the per-item loop or replace it with a map keyed by
preflightKey/runtime+model) so that the flag reflects whether any item in the
current group executed, then use that group-level flag when evaluating
isLastForModel to decide whether to call runPostItemResidencyTeardown(item);
ensure the same fix is applied to the other block around lines 317-348 and that
writeProgressCheckpoint and results handling remain unchanged.
In `@src/runtimes/ollama-runtime.ts`:
- Around line 300-304: The code currently sets capabilities.generateText to
false for modelKind === "unknown", which excludes potentially valid generative
models; update the capabilities block (the modelKind / capabilities object) so
generateText is true for "text-generation" and also for "unknown" (i.e.,
generateText = modelKind === "text-generation" || modelKind === "unknown"),
while keeping embedText = modelKind === "embedding", so unknown models are not
prematurely classified as non-generative.
---
Duplicate comments:
In `@apps/dashboard/src/components/charts/model-comparison-chart.tsx`:
- Around line 42-44: The prop schema currently uses z.custom<MatrixItemResult>()
which bypasses validation; import or locate the existing MatrixItemResultSchema
and replace z.array(z.custom<MatrixItemResult>()) with
z.array(MatrixItemResultSchema) in ModelComparisonChartPropsSchema (and export
MatrixItemResultSchema if missing) so incoming items are strictly validated
before reaching chart aggregation logic; ensure the prop type and any callers
use the validated output (parse/parseAsync) of ModelComparisonChartPropsSchema
when receiving external data.
- Around line 85-91: The JSDoc for the exported function ModelComparisonChart is
missing a required `@throws` tag; update the doc block above the
ModelComparisonChart export to include a `@throws` description that explains what
error(s) the component may throw (e.g., invalid props, missing items, or data
parsing errors) and under what conditions, using the function name
ModelComparisonChart and any related props (props.items) in the description so
the doc meets repo standards.
---
Nitpick comments:
In `@apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx`:
- Around line 33-35: Add runtime Zod validation for the component props by
defining a BlindVsInformedChartPropsSchema (z.object with items:
z.array(MatrixItemResultSchema)) and use it at the start of the
BlindVsInformedChart function to parse/validate incoming props (replace direct
destructuring of props with const { items } =
BlindVsInformedChartPropsSchema.parse(props)); keep the existing TypeScript
interface for compile-time typing but ensure MatrixItemResultSchema is
imported/available and surface parse errors at the boundary.
In `@apps/dashboard/src/components/charts/frontier-eval-scatter.tsx`:
- Around line 29-41: Centralize the harness color resolver into a shared chart
utility: extract the HARNESS_COLORS map, KnownHarnessName type, and
readHarnessColor function into a new exported helper (e.g., getHarnessColor) in
the shared charts util and have components import and use that helper instead of
duplicating the logic; ensure the helper returns CHART_COLORS.muted for unknown
keys and keep the exact behavior of readHarnessColor so callers (including
frontier-eval-scatter.tsx) remain unchanged.
In `@src/cli/compare-utils.ts`:
- Around line 79-84: Remove the redundant runtime checks after the earlier
instanceof Error check: since the code already verifies error instanceof Error,
drop the leading "error && typeof error === 'object' &&" and simplify the
conditional that detects filesystem codes to just check for "code" in error (and
cast as needed), then extract const code = (error as { code?: unknown }).code
and test code === "ENOENT" || code === "ENOTDIR"; update the if block around the
existing code variable extraction accordingly.
In `@src/lib/failure-classifier.ts`:
- Around line 48-49: Add a unit test that covers the new "prompt missing" branch
by calling classifyGenerationError with a message containing "Prompt missing"
(e.g., "Prompt missing ..." or similar) and asserting it returns the
"prompt_not_found" classification; target the classifyGenerationError function
and ensure the test mirrors existing tests for the "prompt not found" case so
the new branch cannot regress.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ce480d55-c86a-40c3-b9cf-1df193adf646
📒 Files selected for processing (158)
README.mdapps/dashboard/scripts/build-index.tsapps/dashboard/src/components/about/about-content.tsapps/dashboard/src/components/about/about-page.tsxapps/dashboard/src/components/charts/blind-vs-informed-chart.tsxapps/dashboard/src/components/charts/chart-primitives.tsxapps/dashboard/src/components/charts/composite-score-chart.tsxapps/dashboard/src/components/charts/frontier-eval-scatter.tsxapps/dashboard/src/components/charts/model-comparison-chart.tsxapps/dashboard/src/components/charts/model-radar-chart.tsxapps/dashboard/src/components/charts/model-test-heatmap.tsxapps/dashboard/src/components/charts/test-difficulty-chart.tsxapps/dashboard/src/components/charts/test-type-split-chart.tsxapps/dashboard/src/components/charts/token-efficiency-chart.tsxapps/dashboard/src/components/layout/header.tsxapps/dashboard/src/components/leaderboard/leaderboard-filters.tsapps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsxapps/dashboard/src/components/leaderboard/leaderboard-page.tsxapps/dashboard/src/components/leaderboard/leaderboard-results-table.tsxapps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsxapps/dashboard/src/components/leaderboard/model-filter-dropdown.tsxapps/dashboard/src/components/run-list/run-list-page.tsxapps/dashboard/src/components/ui/section-heading.tsxapps/dashboard/src/index.cssapps/dashboard/src/lib/aggregations-charts.tsapps/dashboard/src/lib/aggregations-compare.tsapps/dashboard/src/lib/aggregations-diagnostics.tsapps/dashboard/src/lib/aggregations-test-types.tsapps/dashboard/src/lib/aggregations-tooling.tsapps/dashboard/src/lib/aggregations.tsapps/dashboard/src/lib/api.tsapps/dashboard/src/lib/schemas.tsapps/dashboard/src/lib/types.tsllm/context/codebase-overview.mdllm/implementation/computer-use-hardening.mdllm/implementation/harnesses-implementation.mdllm/implementation/multi-runtime-mvp-implementation.mdllm/project/project-overview.mdllm/project/project-rules.mdllm/project/user-flow.mdpackage.jsonsrc/cli/compare-command.tssrc/cli/compare-formatters.tssrc/cli/compare-utils.tssrc/cli/migrate-machine-command.tssrc/cli/run-command.tssrc/harnesses/direct-adapter.tssrc/harnesses/discovery.tssrc/harnesses/goose-adapter.tssrc/harnesses/goose-openai.tssrc/harnesses/goose-output.tssrc/harnesses/harness.tssrc/harnesses/opencode-adapter.tssrc/harnesses/opencode-artifacts.tssrc/harnesses/opencode-cli.tssrc/harnesses/opencode-config.tssrc/harnesses/opencode-events.tssrc/harnesses/opencode-model.tssrc/harnesses/opencode-output.tssrc/harnesses/opencode-permissions.tssrc/harnesses/opencode-process.tssrc/harnesses/opencode-provider.tssrc/harnesses/opencode-runner.tssrc/harnesses/tool-prompt.tssrc/lib/benchmark-checkpoint.tssrc/lib/failure-classifier.tssrc/lib/hardware-profile.tssrc/lib/logger.tssrc/lib/machine-profile/instance-id.tssrc/lib/machine-profile/legacy.tssrc/lib/machine-profile/normalization.tssrc/lib/machine-profile/probe-linux.tssrc/lib/machine-profile/probe-macos.tssrc/lib/machine-profile/probe-utils.tssrc/lib/machine-profile/probe-windows.tssrc/lib/model-aliases.tssrc/lib/model-profile/normalization.tssrc/lib/model-profile/registry-compat.tssrc/lib/model-profile/registry.tssrc/lib/openai-compat-client.tssrc/lib/signal-assessment.tssrc/lib/stats-format.tssrc/lib/test-catalog.tssrc/runner/index.tssrc/runner/item-executor.tssrc/runner/model-availability.tssrc/runner/model-eligibility.tssrc/runner/plan-builder.tssrc/runner/run-progress.tssrc/runtimes/discovery.tssrc/runtimes/index.tssrc/runtimes/ollama-residency.tssrc/runtimes/ollama-runtime.tssrc/runtimes/runtime.tssrc/runtimes/vllm-runtime.tssrc/schemas/common.schema.tssrc/schemas/config.schema.tssrc/schemas/index.tssrc/schemas/model-alias.schema.tssrc/schemas/model-profile.schema.tssrc/schemas/plan.schema.tssrc/schemas/result.schema.tssrc/schemas/test-catalog.schema.tssrc/tests/calculator-basic/test.meta.jsonsrc/tests/calculator-stateful/test.meta.jsonsrc/tests/event-emitter/test.meta.jsonsrc/tests/rate-limiter/test.meta.jsonsrc/tests/smoke/test.meta.jsonsrc/tests/todo-app/test.meta.jsonsrc/tests/ttl-cache/test.meta.jsonsrc/tests/workspace-smoke/fixtures/logs/.gitkeepsrc/tests/workspace-smoke/prompt.blind.mdsrc/tests/workspace-smoke/prompt.informed.mdsrc/tests/workspace-smoke/test.meta.jsontest/aggregate.test.tstest/benchmark-checkpoint.test.tstest/build-index.test.tstest/compare-checkpoint-guard.test.tstest/compare-formatters.test.tstest/dashboard-schema-compat.test.tstest/dashboard-test-type.test.tstest/failure-classifier.test.tstest/generation-retry.test.tstest/goose-adapter.test.tstest/goose-output.test.tstest/hardware-profile.test.tstest/harness-compatibility.test.tstest/item-executor.test.tstest/item-retry.test.tstest/machine-profile-migration.test.tstest/model-alias-schema.test.tstest/model-availability.test.tstest/model-profile-registry.test.tstest/ollama-adapter.test.tstest/ollama-residency.test.tstest/opencode-adapter.test.tstest/opencode-artifacts.test.tstest/opencode-cli-cache.test.tstest/opencode-cli.test.tstest/opencode-config.test.tstest/opencode-model.test.tstest/opencode-output.test.tstest/opencode-runner-race.test.tstest/opencode-runner.test.tstest/plan-builder.test.tstest/run-command.test.tstest/runner-residency.test.tstest/schema-regressions.test.tstest/schemas.test.tstest/signal-assessment.test.tstest/stats-semantics.test.tstest/stats.test.tstest/test-catalog.test.tstest/tool-prompt.test.tstest/utils/fixtures.tstest/vllm-runtime.test.tstest/workspace-capability-parity.test.tstest/workspace-prompt-parity.test.ts
💤 Files with no reviewable changes (10)
- src/harnesses/opencode-model.ts
- package.json
- src/harnesses/goose-openai.ts
- src/harnesses/opencode-process.ts
- src/harnesses/opencode-output.ts
- src/lib/openai-compat-client.ts
- test/benchmark-checkpoint.test.ts
- src/runtimes/discovery.ts
- test/vllm-runtime.test.ts
- src/runtimes/vllm-runtime.ts
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (9)
apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx (1)
24-31: Excellent use ofas const— aligns with coding guidelines.The addition of
as conststrengthens type safety and prevents mutations, which follows the guideline to avoid enums and useas constinstead. The whitespace cleanup in comments also improves consistency.
Optional: Consider merging colors into the cards array.
Currently,
CARD_ACCENT_COLORSand thecardsarray are coupled through index-based access (line 113). This creates a maintenance burden where both arrays must stay in sync. Consider merging the color into each card object:♻️ Optional refactor to reduce coupling
-/** Border accent colors — brand green for primary, semantic colors for specific KPIs. */ -const CARD_ACCENT_COLORS = [ - "#34c759", // brand green — matched runs - "#34c759", // brand green — profiles - "hsl(212, 100%, 67%)", // info blue — deduped items - "hsl(156, 67%, 55%)", // success green — pass rate - "hsl(270, 60%, 60%)", // purple — frontier coverage - "hsl(210, 85%, 60%)", // blue — median duration -] as const; /** * Renders top-level leaderboard KPI cards. * ... */ export function LeaderboardSummaryCards({ aggregate, filteredItemCount, filteredItems, passRate, }: LeaderboardSummaryCardsProps) { // ... existing calculations ... const cards = [ { title: "Matched Runs", value: String(aggregate?.summary.runsMatched ?? 0), sub: null, + color: "#34c759", // brand green }, { title: "Profiles", value: String(aggregate?.summary.machines ?? 0), sub: aggregate?.summary.instances !== undefined ? `${aggregate.summary.instances} ${ aggregate.summary.instances === 1 ? "instance" : "instances" }` : null, + color: "#34c759", // brand green }, { title: "Deduped Items", value: String(filteredItemCount), sub: `of ${aggregate?.summary.dedupedItems ?? 0} total`, + color: "hsl(212, 100%, 67%)", // info blue }, { title: "Pass Rate", value: formatPercent(passRate.passRate), sub: `${passRate.passed}/${passRate.total} tests`, + color: "hsl(156, 67%, 55%)", // success green }, { title: "Frontier Coverage", value: formatPercent(frontierCoverage), sub: `${frontierCount} of ${filteredItemCount} items`, + color: "hsl(270, 60%, 60%)", // purple }, { title: "Median Duration", value: medianDuration !== null ? formatDuration(medianDuration) : "—", sub: durations.length > 0 ? `${durations.length} items` : "no data", + color: "hsl(210, 85%, 60%)", // blue }, - ]; + ] as const; return ( <div className="grid gap-4 md:grid-cols-3 lg:grid-cols-6"> - {cards.map((card, i) => ( + {cards.map((card) => ( <Card key={card.title} glow - className={`border-l-2 animate-fade-slide-up animate-stagger-${i + 1}`} - style={{ borderLeftColor: CARD_ACCENT_COLORS[i] }} + className="border-l-2 animate-fade-slide-up" + style={{ borderLeftColor: card.color }} >Note: This refactor also removes the animation stagger classes, which would need to be addressed separately if staggered animations are desired.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx` around lines 24 - 31, The CARD_ACCENT_COLORS array is coupled to the cards array via index-based access (CARD_ACCENT_COLORS[i]), which is brittle; update the card definitions in the cards array to include a color property for each card (e.g., color: "#34c759" or "hsl(...)"), replace all uses of CARD_ACCENT_COLORS[i] with card.color, and remove the standalone CARD_ACCENT_COLORS constant; ensure the render logic (the component that maps over cards and applies accent color) reads the new color property and preserves existing animation/stagger classes.src/cli/compare-command.ts (1)
34-42: Consider addingz.optional()default handling for boolean options.The schema defines
jsonandallowCrossCheckpointasz.boolean().optional(), but Commander already provides defaults (false). This works, but the explicit.default(false)in the schema would make the post-parse type narrower (booleaninstead ofboolean | undefined), removing the need for=== truechecks at lines 82-83 and 96.♻️ Optional refinement
const RunCompareSchema = z.object({ runA: z.string().min(1), runB: z.string().min(1), options: z.object({ output: z.string().min(1), - json: z.boolean().optional(), - allowCrossCheckpoint: z.boolean().optional(), + json: z.boolean().default(false), + allowCrossCheckpoint: z.boolean().default(false), }), });Then simplify usage:
- const allowCrossCheckpoint = - parsed.options.allowCrossCheckpoint === true; + const allowCrossCheckpoint = parsed.options.allowCrossCheckpoint;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cli/compare-command.ts` around lines 34 - 42, The schema RunCompareSchema currently uses z.boolean().optional() for options.json and options.allowCrossCheckpoint; change those to z.boolean().default(false) so parsed types are boolean (not boolean | undefined) and you can remove redundant checks like `=== true` when reading options in the compare command (e.g., where compare logic inspects options.json and options.allowCrossCheckpoint). Update RunCompareSchema's options object to use .default(false) for both booleans so downstream code (compare command handlers) can treat them as plain booleans.apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx (1)
118-139: Canonicalize emitted selections to avoid duplicate IDs.At Line 119 and Line 136 paths,
normalizedSelectedModelspreserves any upstream duplicates, so callback payloads can stay non-canonical. Dedupe before sort in all emit paths to keep state stable.♻️ Suggested patch
function handleModelToggle(model: string) { const modelSet = new Set(models); const normalizedSelectedModels = selectedModels.filter((selected) => modelSet.has(selected), ); const normalizedSelectedModelSet = new Set(normalizedSelectedModels); + const sortAndDedupe = (values: string[]) => + [...new Set(values)].sort((a, b) => MODEL_SORT_COLLATOR.compare(a, b)); const allSelected = selectedModels.length === 0 || normalizedSelectedModelSet.size === 0 || (models.length > 0 && models.every((availableModel) => normalizedSelectedModelSet.has(availableModel), )); if (normalizedSelectedModelSet.has(model)) { - onSelectionChange( - normalizedSelectedModels - .filter((selected) => selected !== model) - .sort((a, b) => MODEL_SORT_COLLATOR.compare(a, b)), - ); + onSelectionChange( + sortAndDedupe( + normalizedSelectedModels.filter((selected) => selected !== model), + ), + ); return; } if (allSelected) { - onSelectionChange( - models - .filter((availableModel) => availableModel !== model) - .sort((a, b) => MODEL_SORT_COLLATOR.compare(a, b)), - ); + onSelectionChange( + sortAndDedupe(models.filter((availableModel) => availableModel !== model)), + ); return; } - onSelectionChange( - [...normalizedSelectedModels, model].sort((a, b) => - MODEL_SORT_COLLATOR.compare(a, b), - ), - ); + onSelectionChange(sortAndDedupe([...normalizedSelectedModels, model])); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx` around lines 118 - 139, The emitted selections from onSelectionChange can contain duplicates because normalizedSelectedModels is used directly; update the three emit paths (the remove branch that filters normalizedSelectedModels, the allSelected branch that filters models, and the add branch that spreads normalizedSelectedModels) to deduplicate before sorting and emitting—e.g., build a Set or use Array.from(new Set(...)) on the array results (references: normalizedSelectedModels, models, model, allSelected, onSelectionChange, MODEL_SORT_COLLATOR) and then sort the deduped array with MODEL_SORT_COLLATOR prior to calling onSelectionChange so payloads are canonical.llm/project/user-flow.md (1)
53-56: Optional: vary repeated “Which …” bullets for readability.This is purely editorial, but slight rewording improves scanability.
Suggested copy tweak
- - Which test(s) to run? - - Which category/categories to run (`coding`, `computer-use`)? - - Which harness(es) to run through? - - Which model(s) to benchmark? + - Select test(s) to run. + - Choose category/categories (`coding`, `computer-use`). + - Pick harness(es) to run through. + - Choose model(s) to benchmark.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@llm/project/user-flow.md` around lines 53 - 56, Edit the repeated bulleted questions in user-flow.md to improve readability by varying phrasing instead of repeating "Which …"; specifically update the bullets "Which test(s) to run?", "Which category/categories to run (`coding`, `computer-use`)?", "Which harness(es) to run through?", and "Which model(s) to benchmark?" to shorter/varied forms (e.g., "Select test(s) to run", "Choose category/categories (`coding`, `computer-use`)", "Pick harness(es) to execute", "Select model(s) to benchmark") so the list scans better while preserving the same options and meaning.apps/dashboard/src/components/charts/model-comparison-chart.tsx (1)
92-94: Misleading@throwsclause for asafeParse-guarded function.The JSDoc at lines 92–93 states the function throws when "validated props.items contain a shape accepted by the schema but unsupported by downstream chart aggregation or rendering." However, the function now uses
safeParsewith a graceful fallback (lines 96–131), so validation failures won't throw. If downstream aggregation errors are still possible, the@throwsshould clarify that scenario specifically, or be removed if the intention is purely defensive.Suggested JSDoc update
/** * Renders head-to-head model comparison with diverging bars. * * `@param` props - Component props * `@param` props.items - Filtered matrix items * `@returns` Card with model selectors and diverging bar chart - * `@throws` {Error} If validated props.items contain a shape accepted by the schema - * but unsupported by downstream chart aggregation or rendering. */🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/charts/model-comparison-chart.tsx` around lines 92 - 94, The JSDoc `@throws` is misleading because the code now uses safeParse (so validation failures are handled gracefully) — update the comment near the safeParse call to either remove the `@throws` entirely or change it to explicitly state that validation errors do not throw and that only downstream aggregation/rendering (e.g., where props.items is passed into the chart aggregation/render pipeline) might still throw; reference the safeParse usage and any aggregation function that consumes the parsed data when editing the text.src/harnesses/direct-adapter.ts (1)
103-110: Conditional taint reason merge may lose retry-generated reasons.The current logic only merges
signalAssessmentwhendecision.taintReasons.length > 0. If the initial attempt produces taint reasons but the retry succeeds cleanly (no taint reasons fromevaluateCodeOnlyOutput), the initial taint reasons are lost. Consider whether the initial attempt's taint indicators should always propagate to the final result.Suggested fix to always propagate initial taint reasons
return { ...retryResult, // ... token merging ... durationMs: Math.round(performance.now() - startTime), - ...(decision.taintReasons.length > 0 - ? { - signalAssessment: appendSignalAssessmentReasons( - retryResult.signalAssessment, - decision.taintReasons, - ), - } - : {}), + signalAssessment: appendSignalAssessmentReasons( + retryResult.signalAssessment, + decision.taintReasons, + ), };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/harnesses/direct-adapter.ts` around lines 103 - 110, The current conditional only adds signalAssessment when decision.taintReasons.length > 0, which can drop initial taint reasons if the retry returns none; change the logic around decision.taintReasons and retryResult.signalAssessment to always propagate the initial taint reasons into the final signalAssessment by calling appendSignalAssessmentReasons(retryResult.signalAssessment, decision.taintReasons) unconditionally (or whenever decision.taintReasons is non-empty OR retryResult.signalAssessment exists), and assign that merged result to signalAssessment so initial taint indicators from decision.taintReasons are not lost.src/runner/model-eligibility.ts (1)
95-97: Make throw-mode errors runtime-aware for faster debugging.Consider including runtime and detected model kind in the thrown message to reduce triage time when explicit model selection fails.
💡 Suggested refinement
if (input.mode === "throw") { + const detectedKind = info.modelKind ?? "unknown"; throw new Error( - `Requested model is not supported for generative benchmarks: ${model} (embedding-only)`, + `Requested model "${model}" on runtime "${input.runtimeName}" is not supported for generative benchmarks (modelKind=${detectedKind})`, ); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/runner/model-eligibility.ts` around lines 95 - 97, Update the thrown Error in model-eligibility.ts (the throw that currently uses ${model} (embedding-only)) to include runtime and detected model kind for quicker debugging: gather the runtime identifier (e.g., runtime or env/runtimeName) and the detected kind (e.g., detectedModelKind or modelKind) immediately before the throw and append them to the message (for example: `Requested model ${model} is not supported for generative benchmarks (embedding-only); runtime=${runtime}; detectedKind=${detectedModelKind}`). Ensure those variables are computed or available in the scope of the throw site and include them in the process/error message.apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx (1)
39-41: Validateitemsbefore render instead of re-parsing them in the component.
BlindVsInformedChartnow does a full Zod parse on every render. That duplicates validation work for large result sets and turns malformed data into a render-time crash rather than failing once at the data-loading boundary.♻️ Proposed cleanup
-import { MatrixItemResultSchema } from "@/lib/schemas"; import { blindInformed as blindInformedTooltips } from "@/lib/tooltip-content"; import type { MatrixItemResult } from "@/lib/types"; @@ -import { z } from "zod"; - interface BlindVsInformedChartProps { items: MatrixItemResult[]; } - -const BlindVsInformedChartPropsSchema = z.object({ - items: z.array(MatrixItemResultSchema), -}); @@ - * `@param` items - Benchmark rows used to compute model and harness breakdowns. + * `@param` props - Typed chart props. * `@returns` React element containing chart tabs for model and harness breakdowns. - * `@throws` {z.ZodError} If props.items do not match MatrixItemResultSchema. + * `@throws` none */ export function BlindVsInformedChart(props: BlindVsInformedChartProps) { - const { items } = BlindVsInformedChartPropsSchema.parse(props); + const { items } = props;Also applies to: 281-289
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx` around lines 39 - 41, BlindVsInformedChart is re-parsing props.items with Zod on every render (causing duplicated work and render-time crashes); validate/parse items once at the data-loading or prop boundary and pass already-validated data into the component instead of calling BlindVsInformedChartPropsSchema.parse inside the render. Update the call sites that construct props for BlindVsInformedChart to run BlindVsInformedChartPropsSchema.parse (or use z.infer to type the validated shape) and change the component to accept the validated type directly; also apply the same change for the similar validation usage around the code referenced at 281-289 (remove in-component parse and move validation upstream).test/plan-builder.test.ts (1)
449-483: Assert the exclusion payload here, not only the fallback error text.This is the only changed regression for the new model-exclusion path, but it currently passes as long as the generic rejection string stays the same. If the builder stops attaching the excluded
runtime/model/reasondetails, this test will not catch it.If
buildRunPlanexposes the exclusions on the rejection or on an intermediate result, assert those fields here alongside the message.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/signal-assessment.ts`:
- Around line 304-317: The helper isAgentRequestedInputOutput currently defaults
missing options.source to "harness", causing accidental tainting; update the
function so it does NOT treat an omitted source as "harness" — require explicit
opt-in by checking options.source === "harness" (or default to
"artifact"/non-tainting) instead of using (options.source ?? "harness") !==
"harness"; ensure the logic uses the TranscriptOrInputTaintOptions.source value
explicitly and adjust any related type/defaults to avoid a silent harness
fallback.
In `@src/runner/item-executor.ts`:
- Around line 397-405: The call to finalizeItemSignalAssessment is incorrectly
tagging outputs from failed generations as outputSource: "artifact"; update the
invocation in item-executor where signalAssessment is re-finalized (the
finalizeItemSignalAssessment call that uses existing: signalAssessment,
automatedScore, rowFailed, output: generation.output, outputSource: "artifact")
to choose the source based on generation.success (e.g., outputSource:
generation.success ? "artifact" : "harness-failure" or preserve the original
harness/source), so failed harness outputs keep their harness-origin taint
reasons instead of being re-marked as artifacts.
In `@src/schemas/common.schema.ts`:
- Around line 13-31: The change removed "vllm" from supportedRuntimeNames which
makes SupportedRuntimeNameSchema reject existing configs without doing a
breaking schema bump or migration; revert to additive behavior by adding "vllm"
back into supportedRuntimeNames (so SupportedRuntimeNameSchema continues to
accept it) and leave artifactRuntimeNames/ArtifactRuntimeNameSchema as-is, and
add a TODO comment near SCHEMA_VERSION/supportedRuntimeNames to implement an
explicit migration or a breaking-version bump if you intend to remove "vllm"
later.
In `@test/ollama-adapter.test.ts`:
- Around line 126-152: The test title is contradictory: update the it(...)
description string for the test that calls createOllamaRuntime(...), awaits
runtime.getModelInfo("mystery-model:8b"), and asserts info.modelKind ===
"unknown" and info.capabilities.generateText === true so the title reflects that
the unrecognized non-embedding model is considered generative (e.g., change "not
generative" to "generative" or similar) to match the expectations checked on
info.modelKind and info.capabilities.
---
Nitpick comments:
In `@apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx`:
- Around line 39-41: BlindVsInformedChart is re-parsing props.items with Zod on
every render (causing duplicated work and render-time crashes); validate/parse
items once at the data-loading or prop boundary and pass already-validated data
into the component instead of calling BlindVsInformedChartPropsSchema.parse
inside the render. Update the call sites that construct props for
BlindVsInformedChart to run BlindVsInformedChartPropsSchema.parse (or use
z.infer to type the validated shape) and change the component to accept the
validated type directly; also apply the same change for the similar validation
usage around the code referenced at 281-289 (remove in-component parse and move
validation upstream).
In `@apps/dashboard/src/components/charts/model-comparison-chart.tsx`:
- Around line 92-94: The JSDoc `@throws` is misleading because the code now uses
safeParse (so validation failures are handled gracefully) — update the comment
near the safeParse call to either remove the `@throws` entirely or change it to
explicitly state that validation errors do not throw and that only downstream
aggregation/rendering (e.g., where props.items is passed into the chart
aggregation/render pipeline) might still throw; reference the safeParse usage
and any aggregation function that consumes the parsed data when editing the
text.
In `@apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx`:
- Around line 24-31: The CARD_ACCENT_COLORS array is coupled to the cards array
via index-based access (CARD_ACCENT_COLORS[i]), which is brittle; update the
card definitions in the cards array to include a color property for each card
(e.g., color: "#34c759" or "hsl(...)"), replace all uses of
CARD_ACCENT_COLORS[i] with card.color, and remove the standalone
CARD_ACCENT_COLORS constant; ensure the render logic (the component that maps
over cards and applies accent color) reads the new color property and preserves
existing animation/stagger classes.
In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx`:
- Around line 118-139: The emitted selections from onSelectionChange can contain
duplicates because normalizedSelectedModels is used directly; update the three
emit paths (the remove branch that filters normalizedSelectedModels, the
allSelected branch that filters models, and the add branch that spreads
normalizedSelectedModels) to deduplicate before sorting and emitting—e.g., build
a Set or use Array.from(new Set(...)) on the array results (references:
normalizedSelectedModels, models, model, allSelected, onSelectionChange,
MODEL_SORT_COLLATOR) and then sort the deduped array with MODEL_SORT_COLLATOR
prior to calling onSelectionChange so payloads are canonical.
In `@llm/project/user-flow.md`:
- Around line 53-56: Edit the repeated bulleted questions in user-flow.md to
improve readability by varying phrasing instead of repeating "Which …";
specifically update the bullets "Which test(s) to run?", "Which
category/categories to run (`coding`, `computer-use`)?", "Which harness(es) to
run through?", and "Which model(s) to benchmark?" to shorter/varied forms (e.g.,
"Select test(s) to run", "Choose category/categories (`coding`,
`computer-use`)", "Pick harness(es) to execute", "Select model(s) to benchmark")
so the list scans better while preserving the same options and meaning.
In `@src/cli/compare-command.ts`:
- Around line 34-42: The schema RunCompareSchema currently uses
z.boolean().optional() for options.json and options.allowCrossCheckpoint; change
those to z.boolean().default(false) so parsed types are boolean (not boolean |
undefined) and you can remove redundant checks like `=== true` when reading
options in the compare command (e.g., where compare logic inspects options.json
and options.allowCrossCheckpoint). Update RunCompareSchema's options object to
use .default(false) for both booleans so downstream code (compare command
handlers) can treat them as plain booleans.
In `@src/harnesses/direct-adapter.ts`:
- Around line 103-110: The current conditional only adds signalAssessment when
decision.taintReasons.length > 0, which can drop initial taint reasons if the
retry returns none; change the logic around decision.taintReasons and
retryResult.signalAssessment to always propagate the initial taint reasons into
the final signalAssessment by calling
appendSignalAssessmentReasons(retryResult.signalAssessment,
decision.taintReasons) unconditionally (or whenever decision.taintReasons is
non-empty OR retryResult.signalAssessment exists), and assign that merged result
to signalAssessment so initial taint indicators from decision.taintReasons are
not lost.
In `@src/runner/model-eligibility.ts`:
- Around line 95-97: Update the thrown Error in model-eligibility.ts (the throw
that currently uses ${model} (embedding-only)) to include runtime and detected
model kind for quicker debugging: gather the runtime identifier (e.g., runtime
or env/runtimeName) and the detected kind (e.g., detectedModelKind or modelKind)
immediately before the throw and append them to the message (for example:
`Requested model ${model} is not supported for generative benchmarks
(embedding-only); runtime=${runtime}; detectedKind=${detectedModelKind}`).
Ensure those variables are computed or available in the scope of the throw site
and include them in the process/error message.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ec4ad437-bf48-48ac-80a9-efb8eb6f79f1
📒 Files selected for processing (159)
README.mdapps/dashboard/scripts/build-index.tsapps/dashboard/src/components/about/about-content.tsapps/dashboard/src/components/about/about-page.tsxapps/dashboard/src/components/charts/blind-vs-informed-chart.tsxapps/dashboard/src/components/charts/chart-primitives.tsxapps/dashboard/src/components/charts/composite-score-chart.tsxapps/dashboard/src/components/charts/frontier-eval-scatter.tsxapps/dashboard/src/components/charts/model-comparison-chart.tsxapps/dashboard/src/components/charts/model-radar-chart.tsxapps/dashboard/src/components/charts/model-test-heatmap.tsxapps/dashboard/src/components/charts/test-difficulty-chart.tsxapps/dashboard/src/components/charts/test-type-split-chart.tsxapps/dashboard/src/components/charts/token-efficiency-chart.tsxapps/dashboard/src/components/layout/header.tsxapps/dashboard/src/components/leaderboard/leaderboard-filters.tsapps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsxapps/dashboard/src/components/leaderboard/leaderboard-page.tsxapps/dashboard/src/components/leaderboard/leaderboard-results-table.tsxapps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsxapps/dashboard/src/components/leaderboard/model-filter-dropdown.tsxapps/dashboard/src/components/run-list/run-list-page.tsxapps/dashboard/src/components/ui/section-heading.tsxapps/dashboard/src/index.cssapps/dashboard/src/lib/aggregations-charts.tsapps/dashboard/src/lib/aggregations-compare.tsapps/dashboard/src/lib/aggregations-diagnostics.tsapps/dashboard/src/lib/aggregations-test-types.tsapps/dashboard/src/lib/aggregations-tooling.tsapps/dashboard/src/lib/aggregations.tsapps/dashboard/src/lib/api.tsapps/dashboard/src/lib/chart-colors.tsapps/dashboard/src/lib/schemas.tsapps/dashboard/src/lib/types.tsllm/context/codebase-overview.mdllm/implementation/computer-use-hardening.mdllm/implementation/harnesses-implementation.mdllm/implementation/multi-runtime-mvp-implementation.mdllm/project/project-overview.mdllm/project/project-rules.mdllm/project/user-flow.mdpackage.jsonsrc/cli/compare-command.tssrc/cli/compare-formatters.tssrc/cli/compare-utils.tssrc/cli/migrate-machine-command.tssrc/cli/run-command.tssrc/harnesses/direct-adapter.tssrc/harnesses/discovery.tssrc/harnesses/goose-adapter.tssrc/harnesses/goose-openai.tssrc/harnesses/goose-output.tssrc/harnesses/harness.tssrc/harnesses/opencode-adapter.tssrc/harnesses/opencode-artifacts.tssrc/harnesses/opencode-cli.tssrc/harnesses/opencode-config.tssrc/harnesses/opencode-events.tssrc/harnesses/opencode-model.tssrc/harnesses/opencode-output.tssrc/harnesses/opencode-permissions.tssrc/harnesses/opencode-process.tssrc/harnesses/opencode-provider.tssrc/harnesses/opencode-runner.tssrc/harnesses/tool-prompt.tssrc/lib/benchmark-checkpoint.tssrc/lib/failure-classifier.tssrc/lib/hardware-profile.tssrc/lib/logger.tssrc/lib/machine-profile/instance-id.tssrc/lib/machine-profile/legacy.tssrc/lib/machine-profile/normalization.tssrc/lib/machine-profile/probe-linux.tssrc/lib/machine-profile/probe-macos.tssrc/lib/machine-profile/probe-utils.tssrc/lib/machine-profile/probe-windows.tssrc/lib/model-aliases.tssrc/lib/model-profile/normalization.tssrc/lib/model-profile/registry-compat.tssrc/lib/model-profile/registry.tssrc/lib/openai-compat-client.tssrc/lib/signal-assessment.tssrc/lib/stats-format.tssrc/lib/test-catalog.tssrc/runner/index.tssrc/runner/item-executor.tssrc/runner/model-availability.tssrc/runner/model-eligibility.tssrc/runner/plan-builder.tssrc/runner/run-progress.tssrc/runtimes/discovery.tssrc/runtimes/index.tssrc/runtimes/ollama-residency.tssrc/runtimes/ollama-runtime.tssrc/runtimes/runtime.tssrc/runtimes/vllm-runtime.tssrc/schemas/common.schema.tssrc/schemas/config.schema.tssrc/schemas/index.tssrc/schemas/model-alias.schema.tssrc/schemas/model-profile.schema.tssrc/schemas/plan.schema.tssrc/schemas/result.schema.tssrc/schemas/test-catalog.schema.tssrc/tests/calculator-basic/test.meta.jsonsrc/tests/calculator-stateful/test.meta.jsonsrc/tests/event-emitter/test.meta.jsonsrc/tests/rate-limiter/test.meta.jsonsrc/tests/smoke/test.meta.jsonsrc/tests/todo-app/test.meta.jsonsrc/tests/ttl-cache/test.meta.jsonsrc/tests/workspace-smoke/fixtures/logs/.gitkeepsrc/tests/workspace-smoke/prompt.blind.mdsrc/tests/workspace-smoke/prompt.informed.mdsrc/tests/workspace-smoke/test.meta.jsontest/aggregate.test.tstest/benchmark-checkpoint.test.tstest/build-index.test.tstest/compare-checkpoint-guard.test.tstest/compare-formatters.test.tstest/dashboard-schema-compat.test.tstest/dashboard-test-type.test.tstest/failure-classifier.test.tstest/generation-retry.test.tstest/goose-adapter.test.tstest/goose-output.test.tstest/hardware-profile.test.tstest/harness-compatibility.test.tstest/item-executor.test.tstest/item-retry.test.tstest/machine-profile-migration.test.tstest/model-alias-schema.test.tstest/model-availability.test.tstest/model-profile-registry.test.tstest/ollama-adapter.test.tstest/ollama-residency.test.tstest/opencode-adapter.test.tstest/opencode-artifacts.test.tstest/opencode-cli-cache.test.tstest/opencode-cli.test.tstest/opencode-config.test.tstest/opencode-model.test.tstest/opencode-output.test.tstest/opencode-runner-race.test.tstest/opencode-runner.test.tstest/plan-builder.test.tstest/run-command.test.tstest/runner-residency.test.tstest/schema-regressions.test.tstest/schemas.test.tstest/signal-assessment.test.tstest/stats-semantics.test.tstest/stats.test.tstest/test-catalog.test.tstest/tool-prompt.test.tstest/utils/fixtures.tstest/vllm-runtime.test.tstest/workspace-capability-parity.test.tstest/workspace-prompt-parity.test.ts
💤 Files with no reviewable changes (10)
- package.json
- src/harnesses/goose-openai.ts
- test/vllm-runtime.test.ts
- test/benchmark-checkpoint.test.ts
- src/harnesses/opencode-model.ts
- src/harnesses/opencode-process.ts
- src/lib/openai-compat-client.ts
- src/runtimes/vllm-runtime.ts
- src/runtimes/discovery.ts
- src/harnesses/opencode-output.ts
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/cli/migrate-machine-command.ts (1)
334-337:⚠️ Potential issue | 🟠 MajorDo not force non-zero exit for user validation errors.
Line 336 can throw for invalid user options, and Line 363 currently exits with status 1 for all caught errors. That treats non-crash usage errors as crash exits.
💡 Suggested adjustment
.action(async (options) => { try { const parsedOptions = MigrateMachineCommandOptionsSchema.parse(options); const resultsDir = path.resolve(parsedOptions.dir); const migrated = await migrateResultsDirectory(resultsDir); logger.info( { resultsDir, ...migrated }, "Migrated machine-profile artifacts", ); @@ - } catch (error) { - logger.error({ error }, "Machine-profile migration failed"); - process.exit(1); + } catch (error) { + if (error instanceof z.ZodError) { + logger.warn( + { issues: error.issues }, + "Invalid options for migrate-machine-profiles; skipping migration", + ); + return; + } + logger.error({ error }, "Machine-profile migration failed"); + process.exit(1); } });As per coding guidelines "CLI should be non-interactive by default; exit non-zero only on crashes (MVP)".
Also applies to: 361-364
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cli/migrate-machine-command.ts` around lines 334 - 337, The action handler currently calls MigrateMachineCommandOptionsSchema.parse(options) and then unconditionally calls process.exit(1) for any caught error; change this so Zod/validation errors from MigrateMachineCommandOptionsSchema.parse (or other user-input validation) are handled separately: detect the validation error (e.g., instanceof ZodError or error.name === 'ZodError'), print a helpful validation message to the user and return/exit with success (no non-zero exit), while only calling process.exit(1) for unexpected runtime/crash errors; update the .action(async (options) => { ... }) catch block to branch on validation vs unexpected errors and avoid forcing a non-zero exit for user validation issues.apps/dashboard/src/lib/schemas.ts (1)
363-384:⚠️ Potential issue | 🟠 MajorMirror the new
scoringMetricsfields here before exporting this result schema.
MatrixItemResultSchemanow validates liverun.jsonpayloads for the dashboard, but the nestedScoringMetricsSchemastill only keepsdurationMs. Current results can includeretryGenerationDurationMs,retryKind,retryReason,retryAttempted, andretryPromoted, and Zod will silently strip all of them at this boundary.📦 Suggested schema sync
-const ScoringMetricsSchema = z.object({ - durationMs: z.number(), -}); +const ScoringMetricsSchema = z + .object({ + durationMs: z.number(), + scoringDurationMs: z.number().optional(), + retryGenerationDurationMs: z.number().optional(), + retryKind: z.enum(["compile-feedback", "opencode-workspace"]).optional(), + retryReason: z.string().optional(), + retryAttempted: z.boolean().optional(), + retryPromoted: z.boolean().optional(), + }) + .refine( + (metrics) => { + const hasAnyRetryField = + metrics.retryKind !== undefined || + metrics.retryReason !== undefined || + metrics.retryAttempted !== undefined || + metrics.retryPromoted !== undefined || + metrics.retryGenerationDurationMs !== undefined; + if (!hasAnyRetryField) { + return true; + } + if (metrics.retryAttempted === true) { + return ( + metrics.retryKind !== undefined && + typeof metrics.retryReason === "string" && + metrics.retryReason.trim().length > 0 && + typeof metrics.retryPromoted === "boolean" && + typeof metrics.retryGenerationDurationMs === "number" && + metrics.retryGenerationDurationMs >= 0 + ); + } + if (metrics.retryAttempted === false) { + return ( + metrics.retryKind === undefined && + metrics.retryReason === undefined && + metrics.retryPromoted === undefined && + metrics.retryGenerationDurationMs === undefined + ); + } + return false; + }, + { + message: + "retry metrics must be fully absent, or when retryAttempted is true include retryKind, non-empty retryReason, retryPromoted, and non-negative retryGenerationDurationMs; when retryAttempted is false the other retry fields must be absent", + path: ["retryKind"], + }, + );Based on learnings, "Maintain versioned schemas with schemaVersion field and use additive changes; migrations must be explicit" and "Use Zod for schema validation at all boundaries (CLI args, env, JSON files, HTTP responses)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/lib/schemas.ts` around lines 363 - 384, MatrixItemResultSchema currently references ScoringMetricsSchema but that schema only includes durationMs, causing fields like retryGenerationDurationMs, retryKind, retryReason, retryAttempted, and retryPromoted to be stripped; update ScoringMetricsSchema to include these additional optional fields and then ensure MatrixItemResultSchema continues to reference ScoringMetricsSchema (and export remains unchanged) so live run.json payloads retain those properties; locate ScoringMetricsSchema in the same file (or its import) and add the five new optional fields with appropriate types (numbers/strings/booleans) to match existing usages.
♻️ Duplicate comments (3)
README.md (1)
142-143:⚠️ Potential issue | 🟡 MinorDon’t describe
bun pbas auto-discovering runtimes anymore.The README now says new runs are Ollama-only, so this example overstates the default behavior. Keep the discovery note to models, harnesses, and tests.
📝 Suggested doc fix
-# Run benchmarks (auto-discovers available runtimes, models, harnesses, and tests) +# Run benchmarks (Ollama runtime; auto-discovers models, harnesses, and tests) bun pb🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 142 - 143, Update the README line that currently claims "auto-discovers available runtimes, models, harnesses, and tests" for the `bun pb` example so it no longer states runtimes are auto-discovered; change the wording to indicate `bun pb` auto-discovers only models, harnesses, and tests (and that new runs are Ollama-only) so the example matches the current behavior.llm/implementation/harnesses-implementation.md (1)
340-345:⚠️ Potential issue | 🟡 MinorThis doc still points to two different Goose tool-enablement mechanisms.
The table now says Goose tool access comes from generated config/permission rules, but the earlier “Tool-Calling Mode (Goose)” section still tells readers to use
--with-builtin developer. Please reconcile both sections in the same pass or the troubleshooting guidance stays ambiguous.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@llm/implementation/harnesses-implementation.md` around lines 340 - 345, The doc is inconsistent about how Goose enables tools: reconcile the "Tool-Calling Mode (Goose)" section with the table entry "Goose | Per-item provider config + permission rules via `goose run`" by choosing the intended mechanism and updating the other place to match; either remove the reference to the CLI flag `--with-builtin developer` from "Tool-Calling Mode (Goose)" and replace it with a clear description of per-item provider config + permission rules (including mention of using `goose run`), or conversely, change the table to reflect the CLI flag approach—ensure all references to `--with-builtin developer`, "Tool-Calling Mode (Goose)", and the table entry are consistent and use the same terminology and steps.src/harnesses/opencode-events.ts (1)
240-249:⚠️ Potential issue | 🟠 MajorPreserve the separator after raw fragments.
joinOrderedOutputParts()still collapses a raw line into the next parsed text chunk. A sequence likeraw -> textbecomesrawtextinstead of preserving the newline boundary, which can corrupt mixed-output salvage and scorer-facing transcripts.♻️ Suggested fix
function joinOrderedOutputParts(parts: OrderedOutputPart[]): string { let output = ""; for (const part of parts) { if (part.type === "text") { output += part.content; continue; } - output += output.length > 0 ? `\n${part.content}` : part.content; + if (output.length > 0 && !output.endsWith("\n")) { + output += "\n"; + } + output += `${part.content}\n`; } - return output; + return output.trimEnd(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/harnesses/opencode-events.ts` around lines 240 - 249, joinOrderedOutputParts currently collapses a raw (non-"text") part into the following text part (e.g., raw -> text becomes rawtext); update the function (joinOrderedOutputParts and references to OrderedOutputPart) so that when appending a non-"text" part you always preserve its boundary by ensuring a newline separates it from subsequent text—append the raw part.content and then, if the output does not already end with '\n', append a single '\n' (avoid duplicating newlines), rather than conditionally prefixing the raw piece with a newline based on output length.
🧹 Nitpick comments (6)
test/item-retry.test.ts (1)
69-186: Optional: extract shared retry-case setup to reduce duplication.Both tests are solid, but they repeat a large common arrangement block. A tiny helper would improve maintainability without changing behavior.
♻️ Refactor sketch
+function createRetryArgs(params: { + testName: string; + initialOutput: string; + prompt: string; +}) { + return { + item: createMatrixItem(params.testName), + generation: { success: true, output: params.initialOutput, durationMs: 300 }, + harnessForRetry: createHarness(generateMock), + runtimeForRetry: createRuntime(), + promptForRetry: params.prompt, + timeoutMs: 5_000, + unloadAfter: true, + log: { info: vi.fn(), warn: vi.fn() }, + supportsCompileRetry: true, + }; +} ... - const outcome = await runScoringWithCompileRetry({ - item: createMatrixItem("todo-app"), - generation: { ... }, - ... - }); + const outcome = await runScoringWithCompileRetry( + createRetryArgs({ + testName: "todo-app", + initialOutput: "export const notTheRightThing = true;", + prompt: "Implement createTodoApp().", + }), + );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/item-retry.test.ts` around lines 69 - 186, Extract the repeated compile-retry test setup into a small helper to remove duplication: create a helper (e.g., runCompileRetryScenario or setupCompileRetry) in test/item-retry.test.ts that accepts the varying bits (item name, initial generation.output/durationMs, generateMock resolved output/durationMs, first scoring failure object, promptForRetry) and internally imports runScoringWithCompileRetry, configures generateMock and scoreGenerationMock, constructs harnessForRetry/runtimeForRetry, calls runScoringWithCompileRetry with the shared options (timeoutMs, unloadAfter, log, supportsCompileRetry) and returns the outcome; then update both tests to call this helper and assert on the returned outcome and on generateMock call expectations.test/build-index.test.ts (1)
28-48: Consider centralizing required benchmark lib asset fixture data.This list is manually mirrored from
src/lib/benchmark-checkpoint.ts, so asset changes require edits in multiple places. A shared export/helper would reduce drift risk.As per coding guidelines, "Favor iteration and modularization over code duplication."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/build-index.test.ts` around lines 28 - 48, The test's local REQUIRED_LIB_ASSETS duplicates the asset list from src/lib/benchmark-checkpoint.ts; remove the hardcoded array and import a single exported constant from the benchmark-checkpoint module (or a new shared helper module) to centralize the fixture data. Update test/build-index.test.ts to import the exported symbol (e.g., BENCHMARK_REQUIRED_ASSETS or REQUIRES_ASSETS) and use it in place of the local REQUIRED_LIB_ASSETS, and add the export to src/lib/benchmark-checkpoint.ts (or a new shared file) so both production code and tests reference the same source of truth.apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx (1)
199-203: ReusededupeAndSortModelsin “Select all” for consistency.This path currently reimplements sorting and skips dedupe. Reusing the helper keeps one canonical normalization path.
Proposed patch
onClick={() => - onSelectionChange( - [...models].sort((a, b) => MODEL_SORT_COLLATOR.compare(a, b)), - ) + onSelectionChange(dedupeAndSortModels(models)) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx` around lines 199 - 203, The "Select all" onClick currently builds a new array and manually sorts via MODEL_SORT_COLLATOR without deduplicating; replace that logic to call the existing helper dedupeAndSortModels(models) and pass its result to onSelectionChange to ensure canonical dedupe+sort behavior (update the onClick that currently calls onSelectionChange([...models].sort(...)) to use dedupeAndSortModels(models) instead).src/cli/compare-formatters.ts (1)
26-36: Pin the display-row collation instead of inheriting the process locale.
localeCompare()without an explicit locale uses the runtime’s default collation, so these sort orders can still drift across machines when model, harness, or test names contain locale-sensitive characters. If the goal is deterministic CLI output, make the comparator locale-independent or at least pin it to a fixed locale.♻️ Suggested fix
+function compareText(left: string, right: string): number { + if (left < right) return -1; + if (left > right) return 1; + return 0; +} + function compareDisplayRows( left: ComparableDisplayRow, right: ComparableDisplayRow, ): number { - const modelCompare = left.model.localeCompare(right.model); + const modelCompare = compareText(left.model, right.model); if (modelCompare !== 0) return modelCompare; - const harnessCompare = left.harness.localeCompare(right.harness); + const harnessCompare = compareText(left.harness, right.harness); if (harnessCompare !== 0) return harnessCompare; - const testCompare = left.test.localeCompare(right.test); + const testCompare = compareText(left.test, right.test); if (testCompare !== 0) return testCompare; - return left.passType.localeCompare(right.passType); + return compareText(left.passType, right.passType); }Based on learnings, "Output should be table/diff oriented, concise, and deterministic; never rely on color alone".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cli/compare-formatters.ts` around lines 26 - 36, The comparator compareDisplayRows currently uses localeCompare() with the environment default; make sorting deterministic by pinning the locale and options, e.g. call localeCompare on model, harness, test, and passType with a fixed locale (like "en" or "en-US") and explicit options (for example { numeric: true, sensitivity: "variant" }) so comparisons do not vary across machines; update the calls in compareDisplayRows (referencing left.model, left.harness, left.test, left.passType and their right.* counterparts) to include the chosen locale and options.test/goose-adapter.test.ts (1)
238-245: Don’t make these regressions depend on taint-reason ordering.
signalAssessment.reasonsis effectively set-like metadata. Matching the exact array order makes these tests fail on harmless merge/refactor changes even when the same reasons are present. Prefer membership-based assertions or compare sorted arrays.♻️ Suggested assertion pattern
- signalAssessment: { - classification: "tainted", - reasons: ["internal_tool_transcript", "agent_requested_input"], - }, + signalAssessment: expect.objectContaining({ + classification: "tainted", + reasons: expect.arrayContaining([ + "internal_tool_transcript", + "agent_requested_input", + ]), + }),Also applies to: 279-285
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/goose-adapter.test.ts` around lines 238 - 245, The test currently asserts exact ordering on signalAssessment.reasons which is brittle; update the assertion in the failing expectation (the block containing signalAssessment: { classification: "tainted", reasons: [...] }) to assert membership rather than order — e.g., use expect.arrayContaining([...]) for signalAssessment.reasons or compare a sorted copy (sort both actual and expected) so the test verifies the presence of "internal_tool_transcript" and "agent_requested_input" regardless of order; apply the same change to the other occurrence around lines 279-285 that also asserts signalAssessment.reasons.src/schemas/common.schema.ts (1)
25-32: Keep artifact runtimes independent from the active runtime list.
artifactRuntimeNames = supportedRuntimeNamesmeans the next active-runtime removal will also narrow artifact parsing unless someone remembers to split the arrays in the same edit. Since this file is introducing separate “active” vs “persisted” runtime surfaces, make the artifact list its own literal set now.♻️ Suggested refactor
/** Valid runtime names allowed when reading historical plan/result artifacts. */ -export const artifactRuntimeNames = supportedRuntimeNames; +export const artifactRuntimeNames = ["ollama", "vllm"] as const;As per coding guidelines, "Maintain versioned schemas with schemaVersion field and use additive changes; migrations must be explicit".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/schemas/common.schema.ts` around lines 25 - 32, artifactRuntimeNames currently aliases supportedRuntimeNames which couples persisted artifact parsing to the active runtime list; replace that alias with an explicit, independent literal array of allowed persisted runtimes (assign a new array to artifactRuntimeNames rather than referencing supportedRuntimeNames), then keep ArtifactRuntimeNameSchema = z.enum(artifactRuntimeNames) and ArtifactRuntimeName = z.infer<typeof ArtifactRuntimeNameSchema> so the schema/type use the new independent list; ensure you update only artifactRuntimeNames (not supportedRuntimeNames) so future active-runtime removals won't change artifact parsing and remember to increment schemaVersion/migration if this is a breaking change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx`:
- Around line 69-76: The ModelFilterDropdown component accepts props without
runtime checking; add a Zod schema for the expected shape (matching
ModelFilterDropdownProps) and validate props at the top of the exported
ModelFilterDropdown function, throwing a TypeError when validation fails; after
validation use the parsed/validated values (models, selectedModels,
onSelectionChange) for the rest of the component (e.g., for
normalizedSelectedIds and containerRef) so internal logic relies on
runtime-validated data and the documented contract is enforced.
In `@src/harnesses/opencode-adapter.ts`:
- Around line 149-152: The protocolOnlyReasons computation uses stdout directly,
causing taint from the actual parsed stream (the result of
selectProcessOutput(stdout, stderr) passed into parseOpenCodeEvents()) to be
ignored; update the protocolOnlyReasons assignment in opencode-adapter.ts so it
calls getTranscriptOrInputTaintReasons(...) on the exact stream/string that was
passed into parseOpenCodeEvents/selectProcessOutput (the parsed stream variable)
instead of always using stdout—i.e., locate the protocolOnlyReasons declaration
and replace getTranscriptOrInputTaintReasons(stdout) with
getTranscriptOrInputTaintReasons(parsedStream) or the actual variable holding
selectProcessOutput(...) so stderr-only protocol JSONL preserves its
internal_tool_transcript taint.
In `@src/harnesses/opencode-events.ts`:
- Around line 311-317: When extractFromToolCallObject(event) returns a non-empty
string and you assign it to toolCallOutput, also set hasToolUse = true so the
parsed output reflects that tool content was recovered; update both locations
where this pattern appears (the block using extractedToolCallOutput /
toolCallOutput around extractFromToolCallObject and the similar block at lines
~347-348) to flip hasToolUse when assigning the recovered tool-call string. This
change should reference extractedToolCallOutput, toolCallOutput, hasToolUse, and
extractFromToolCallObject in the affected function(s).
In `@src/runner/index.ts`:
- Around line 280-310: The residency-guard failure branch currently continues
without running post-item teardown, which can leave a previous model resident
and poison subsequent blocks; before the early continue in the catch block,
invoke runPostItemResidencyTeardown with the same context used for successful
items (pass item, plan, config, log, and any residency state) to ensure the
model is unloaded/cleaned up even on failure, then proceed to record the failure
via buildResidencyGuardFailureResult and write the progress checkpoint as
before.
In `@src/runner/model-eligibility.ts`:
- Around line 97-99: The thrown error message always says "(embedding-only)"
even when the real cause is that the model's capabilities.generateText is false;
update the throw in model-eligibility.ts (the branch referencing model, runtime,
detectedModelKind, and capabilities.generateText) to compute an accurate reason
string (e.g., if detectedModelKind === 'embedding' use "embedding-only", else if
!capabilities.generateText use "does not support text generation" or include the
capability flag) and include that reason in the Error message so operators see
the true exclusion cause.
In `@src/runner/plan-builder.ts`:
- Around line 102-110: Currently non-executable entries in config.runtimes are
silently filtered out (configuredRuntimes → runtimes using
isExecutableRuntimeName); instead, validate configuredRuntimes up-front: compute
the set of invalid names = configuredRuntimes.filter(r =>
!isExecutableRuntimeName(r)) and if invalids.length > 0 throw an Error listing
the invalids and the allowed RUNTIME_NAMES; only after that set runtimes =
configuredRuntimes as RuntimeName[] (or keep the existing filter as a sanity
check) so we fail loudly on bad config rather than masking it.
In `@src/runtimes/ollama-residency.ts`:
- Around line 261-266: The residency check in ensureOnlyOllamaModelLoaded uses
isSameOllamaModel but does not trim whitespace from config.allowedModel, so
values like " qwen3.6 " are treated as different; update
ensureOnlyOllamaModelLoaded (or its config validation) to trim() the
allowedModel string from EnsureOnlyOllamaModelLoadedConfig before any calls to
isSameOllamaModel or residency comparisons, and throw if allowedModel is empty
after trimming to avoid silent fallbacks.
- Around line 203-245: Both listRunningOllamaModels and unloadOllamaModel must
validate and normalize incoming config strings: trim config.baseUrl and
config.model (where present) and throw a clear Error if baseUrl or model is
empty/whitespace-only after trimming; also normalize baseUrl by removing any
trailing slash before constructing endpoint (the code that builds endpoint uses
`${config.baseUrl}/api/...`). Update
OllamaResidencyBaseConfig/UnloadOllamaModelConfig consumers in
listRunningOllamaModels (endpoint) and unloadOllamaModel (endpoint and request
body.model) to perform this validation/normalization at the top of each function
and reject invalid input immediately.
In `@src/runtimes/ollama-runtime.ts`:
- Around line 37-42: Normalize and validate the 'family' and 'families' fields
at the HTTP boundary by updating the zod 'details' schema: trim whitespace and
reject empty/blank strings for the single 'family' field and for each element of
the 'families' array so downstream functions like inferModelKind() and returned
metadata only see normalized values; specifically, change the 'family' schema to
apply a trim transform and a non-empty check and change 'families' to be an
array of trimmed/non-empty strings (keep both optional semantics), so invalid or
blank entries are rejected during parsing.
In `@test/opencode-output.test.ts`:
- Around line 88-106: The test currently uses a non-write second entry (command:
"edit") so it doesn't actually verify that a later write beats an earlier one;
update the fixture in the "extracts the latest batched write from alternate
tool-call shapes" test that calls parseOpenCodeEvents so the second toolCalls
element is a write-shaped entry (e.g., { tool: "write", raw: { content: "export
const second = 2;" } }) instead of command: "edit", leaving the expectations
(parsed.output and parsed.method) unchanged so the test truly exercises the
"latest write wins" behavior.
In `@test/schemas.test.ts`:
- Around line 223-244: The test currently allows runtimes: ["vllm"] but
BenchConfigSchema should reject that; update the second assertion so parsing {
runtimes: ["vllm"] } throws instead of returning a value. Specifically, change
the expectation around BenchConfigSchema.parse({ runtimes: ["vllm"] }) in the
test to use expect(() => BenchConfigSchema.parse(...)).toThrow(...) (matching
"runtimes" or "vllm") while keeping the other assertions (e.g., the
modelProfiles variant check for vllm) unchanged.
- Around line 69-74: The test is incorrectly asserting that
SupportedRuntimeNameSchema accepts "vllm", collapsing the supported/artifact
split; update the test so SupportedRuntimeNameSchema.parse("vllm") throws (or
remove that assertion) while keeping ArtifactRuntimeNameSchema.parse("vllm")
asserting it returns "vllm", i.e., ensure SupportedRuntimeNameSchema only
accepts supported names (e.g., "ollama") and legacy "vllm" remains covered by
ArtifactRuntimeNameSchema.
---
Outside diff comments:
In `@apps/dashboard/src/lib/schemas.ts`:
- Around line 363-384: MatrixItemResultSchema currently references
ScoringMetricsSchema but that schema only includes durationMs, causing fields
like retryGenerationDurationMs, retryKind, retryReason, retryAttempted, and
retryPromoted to be stripped; update ScoringMetricsSchema to include these
additional optional fields and then ensure MatrixItemResultSchema continues to
reference ScoringMetricsSchema (and export remains unchanged) so live run.json
payloads retain those properties; locate ScoringMetricsSchema in the same file
(or its import) and add the five new optional fields with appropriate types
(numbers/strings/booleans) to match existing usages.
In `@src/cli/migrate-machine-command.ts`:
- Around line 334-337: The action handler currently calls
MigrateMachineCommandOptionsSchema.parse(options) and then unconditionally calls
process.exit(1) for any caught error; change this so Zod/validation errors from
MigrateMachineCommandOptionsSchema.parse (or other user-input validation) are
handled separately: detect the validation error (e.g., instanceof ZodError or
error.name === 'ZodError'), print a helpful validation message to the user and
return/exit with success (no non-zero exit), while only calling process.exit(1)
for unexpected runtime/crash errors; update the .action(async (options) => { ...
}) catch block to branch on validation vs unexpected errors and avoid forcing a
non-zero exit for user validation issues.
---
Duplicate comments:
In `@llm/implementation/harnesses-implementation.md`:
- Around line 340-345: The doc is inconsistent about how Goose enables tools:
reconcile the "Tool-Calling Mode (Goose)" section with the table entry "Goose |
Per-item provider config + permission rules via `goose run`" by choosing the
intended mechanism and updating the other place to match; either remove the
reference to the CLI flag `--with-builtin developer` from "Tool-Calling Mode
(Goose)" and replace it with a clear description of per-item provider config +
permission rules (including mention of using `goose run`), or conversely, change
the table to reflect the CLI flag approach—ensure all references to
`--with-builtin developer`, "Tool-Calling Mode (Goose)", and the table entry are
consistent and use the same terminology and steps.
In `@README.md`:
- Around line 142-143: Update the README line that currently claims
"auto-discovers available runtimes, models, harnesses, and tests" for the `bun
pb` example so it no longer states runtimes are auto-discovered; change the
wording to indicate `bun pb` auto-discovers only models, harnesses, and tests
(and that new runs are Ollama-only) so the example matches the current behavior.
In `@src/harnesses/opencode-events.ts`:
- Around line 240-249: joinOrderedOutputParts currently collapses a raw
(non-"text") part into the following text part (e.g., raw -> text becomes
rawtext); update the function (joinOrderedOutputParts and references to
OrderedOutputPart) so that when appending a non-"text" part you always preserve
its boundary by ensuring a newline separates it from subsequent text—append the
raw part.content and then, if the output does not already end with '\n', append
a single '\n' (avoid duplicating newlines), rather than conditionally prefixing
the raw piece with a newline based on output length.
---
Nitpick comments:
In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx`:
- Around line 199-203: The "Select all" onClick currently builds a new array and
manually sorts via MODEL_SORT_COLLATOR without deduplicating; replace that logic
to call the existing helper dedupeAndSortModels(models) and pass its result to
onSelectionChange to ensure canonical dedupe+sort behavior (update the onClick
that currently calls onSelectionChange([...models].sort(...)) to use
dedupeAndSortModels(models) instead).
In `@src/cli/compare-formatters.ts`:
- Around line 26-36: The comparator compareDisplayRows currently uses
localeCompare() with the environment default; make sorting deterministic by
pinning the locale and options, e.g. call localeCompare on model, harness, test,
and passType with a fixed locale (like "en" or "en-US") and explicit options
(for example { numeric: true, sensitivity: "variant" }) so comparisons do not
vary across machines; update the calls in compareDisplayRows (referencing
left.model, left.harness, left.test, left.passType and their right.*
counterparts) to include the chosen locale and options.
In `@src/schemas/common.schema.ts`:
- Around line 25-32: artifactRuntimeNames currently aliases
supportedRuntimeNames which couples persisted artifact parsing to the active
runtime list; replace that alias with an explicit, independent literal array of
allowed persisted runtimes (assign a new array to artifactRuntimeNames rather
than referencing supportedRuntimeNames), then keep ArtifactRuntimeNameSchema =
z.enum(artifactRuntimeNames) and ArtifactRuntimeName = z.infer<typeof
ArtifactRuntimeNameSchema> so the schema/type use the new independent list;
ensure you update only artifactRuntimeNames (not supportedRuntimeNames) so
future active-runtime removals won't change artifact parsing and remember to
increment schemaVersion/migration if this is a breaking change.
In `@test/build-index.test.ts`:
- Around line 28-48: The test's local REQUIRED_LIB_ASSETS duplicates the asset
list from src/lib/benchmark-checkpoint.ts; remove the hardcoded array and import
a single exported constant from the benchmark-checkpoint module (or a new shared
helper module) to centralize the fixture data. Update test/build-index.test.ts
to import the exported symbol (e.g., BENCHMARK_REQUIRED_ASSETS or
REQUIRES_ASSETS) and use it in place of the local REQUIRED_LIB_ASSETS, and add
the export to src/lib/benchmark-checkpoint.ts (or a new shared file) so both
production code and tests reference the same source of truth.
In `@test/goose-adapter.test.ts`:
- Around line 238-245: The test currently asserts exact ordering on
signalAssessment.reasons which is brittle; update the assertion in the failing
expectation (the block containing signalAssessment: { classification: "tainted",
reasons: [...] }) to assert membership rather than order — e.g., use
expect.arrayContaining([...]) for signalAssessment.reasons or compare a sorted
copy (sort both actual and expected) so the test verifies the presence of
"internal_tool_transcript" and "agent_requested_input" regardless of order;
apply the same change to the other occurrence around lines 279-285 that also
asserts signalAssessment.reasons.
In `@test/item-retry.test.ts`:
- Around line 69-186: Extract the repeated compile-retry test setup into a small
helper to remove duplication: create a helper (e.g., runCompileRetryScenario or
setupCompileRetry) in test/item-retry.test.ts that accepts the varying bits
(item name, initial generation.output/durationMs, generateMock resolved
output/durationMs, first scoring failure object, promptForRetry) and internally
imports runScoringWithCompileRetry, configures generateMock and
scoreGenerationMock, constructs harnessForRetry/runtimeForRetry, calls
runScoringWithCompileRetry with the shared options (timeoutMs, unloadAfter, log,
supportsCompileRetry) and returns the outcome; then update both tests to call
this helper and assert on the returned outcome and on generateMock call
expectations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1e139aef-83e9-4aad-98b3-76391c93f0ab
📒 Files selected for processing (159)
README.mdapps/dashboard/scripts/build-index.tsapps/dashboard/src/components/about/about-content.tsapps/dashboard/src/components/about/about-page.tsxapps/dashboard/src/components/charts/blind-vs-informed-chart.tsxapps/dashboard/src/components/charts/chart-primitives.tsxapps/dashboard/src/components/charts/composite-score-chart.tsxapps/dashboard/src/components/charts/frontier-eval-scatter.tsxapps/dashboard/src/components/charts/model-comparison-chart.tsxapps/dashboard/src/components/charts/model-radar-chart.tsxapps/dashboard/src/components/charts/model-test-heatmap.tsxapps/dashboard/src/components/charts/test-difficulty-chart.tsxapps/dashboard/src/components/charts/test-type-split-chart.tsxapps/dashboard/src/components/charts/token-efficiency-chart.tsxapps/dashboard/src/components/layout/header.tsxapps/dashboard/src/components/leaderboard/leaderboard-filters.tsapps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsxapps/dashboard/src/components/leaderboard/leaderboard-page.tsxapps/dashboard/src/components/leaderboard/leaderboard-results-table.tsxapps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsxapps/dashboard/src/components/leaderboard/model-filter-dropdown.tsxapps/dashboard/src/components/run-list/run-list-page.tsxapps/dashboard/src/components/ui/section-heading.tsxapps/dashboard/src/index.cssapps/dashboard/src/lib/aggregations-charts.tsapps/dashboard/src/lib/aggregations-compare.tsapps/dashboard/src/lib/aggregations-diagnostics.tsapps/dashboard/src/lib/aggregations-test-types.tsapps/dashboard/src/lib/aggregations-tooling.tsapps/dashboard/src/lib/aggregations.tsapps/dashboard/src/lib/api.tsapps/dashboard/src/lib/chart-colors.tsapps/dashboard/src/lib/schemas.tsapps/dashboard/src/lib/types.tsllm/context/codebase-overview.mdllm/implementation/computer-use-hardening.mdllm/implementation/harnesses-implementation.mdllm/implementation/multi-runtime-mvp-implementation.mdllm/project/project-overview.mdllm/project/project-rules.mdllm/project/user-flow.mdpackage.jsonsrc/cli/compare-command.tssrc/cli/compare-formatters.tssrc/cli/compare-utils.tssrc/cli/migrate-machine-command.tssrc/cli/run-command.tssrc/harnesses/direct-adapter.tssrc/harnesses/discovery.tssrc/harnesses/goose-adapter.tssrc/harnesses/goose-openai.tssrc/harnesses/goose-output.tssrc/harnesses/harness.tssrc/harnesses/opencode-adapter.tssrc/harnesses/opencode-artifacts.tssrc/harnesses/opencode-cli.tssrc/harnesses/opencode-config.tssrc/harnesses/opencode-events.tssrc/harnesses/opencode-model.tssrc/harnesses/opencode-output.tssrc/harnesses/opencode-permissions.tssrc/harnesses/opencode-process.tssrc/harnesses/opencode-provider.tssrc/harnesses/opencode-runner.tssrc/harnesses/tool-prompt.tssrc/lib/benchmark-checkpoint.tssrc/lib/failure-classifier.tssrc/lib/hardware-profile.tssrc/lib/logger.tssrc/lib/machine-profile/instance-id.tssrc/lib/machine-profile/legacy.tssrc/lib/machine-profile/normalization.tssrc/lib/machine-profile/probe-linux.tssrc/lib/machine-profile/probe-macos.tssrc/lib/machine-profile/probe-utils.tssrc/lib/machine-profile/probe-windows.tssrc/lib/model-aliases.tssrc/lib/model-profile/normalization.tssrc/lib/model-profile/registry-compat.tssrc/lib/model-profile/registry.tssrc/lib/openai-compat-client.tssrc/lib/signal-assessment.tssrc/lib/stats-format.tssrc/lib/test-catalog.tssrc/runner/index.tssrc/runner/item-executor.tssrc/runner/model-availability.tssrc/runner/model-eligibility.tssrc/runner/plan-builder.tssrc/runner/run-progress.tssrc/runtimes/discovery.tssrc/runtimes/index.tssrc/runtimes/ollama-residency.tssrc/runtimes/ollama-runtime.tssrc/runtimes/runtime.tssrc/runtimes/vllm-runtime.tssrc/schemas/common.schema.tssrc/schemas/config.schema.tssrc/schemas/index.tssrc/schemas/model-alias.schema.tssrc/schemas/model-profile.schema.tssrc/schemas/plan.schema.tssrc/schemas/result.schema.tssrc/schemas/test-catalog.schema.tssrc/tests/calculator-basic/test.meta.jsonsrc/tests/calculator-stateful/test.meta.jsonsrc/tests/event-emitter/test.meta.jsonsrc/tests/rate-limiter/test.meta.jsonsrc/tests/smoke/test.meta.jsonsrc/tests/todo-app/test.meta.jsonsrc/tests/ttl-cache/test.meta.jsonsrc/tests/workspace-smoke/fixtures/logs/.gitkeepsrc/tests/workspace-smoke/prompt.blind.mdsrc/tests/workspace-smoke/prompt.informed.mdsrc/tests/workspace-smoke/test.meta.jsontest/aggregate.test.tstest/benchmark-checkpoint.test.tstest/build-index.test.tstest/compare-checkpoint-guard.test.tstest/compare-formatters.test.tstest/dashboard-schema-compat.test.tstest/dashboard-test-type.test.tstest/failure-classifier.test.tstest/generation-retry.test.tstest/goose-adapter.test.tstest/goose-output.test.tstest/hardware-profile.test.tstest/harness-compatibility.test.tstest/item-executor.test.tstest/item-retry.test.tstest/machine-profile-migration.test.tstest/model-alias-schema.test.tstest/model-availability.test.tstest/model-profile-registry.test.tstest/ollama-adapter.test.tstest/ollama-residency.test.tstest/opencode-adapter.test.tstest/opencode-artifacts.test.tstest/opencode-cli-cache.test.tstest/opencode-cli.test.tstest/opencode-config.test.tstest/opencode-model.test.tstest/opencode-output.test.tstest/opencode-runner-race.test.tstest/opencode-runner.test.tstest/plan-builder.test.tstest/run-command.test.tstest/runner-residency.test.tstest/schema-regressions.test.tstest/schemas.test.tstest/signal-assessment.test.tstest/stats-semantics.test.tstest/stats.test.tstest/test-catalog.test.tstest/tool-prompt.test.tstest/utils/fixtures.tstest/vllm-runtime.test.tstest/workspace-capability-parity.test.tstest/workspace-prompt-parity.test.ts
💤 Files with no reviewable changes (10)
- package.json
- test/benchmark-checkpoint.test.ts
- src/harnesses/opencode-process.ts
- src/harnesses/goose-openai.ts
- test/vllm-runtime.test.ts
- src/runtimes/discovery.ts
- src/harnesses/opencode-model.ts
- src/lib/openai-compat-client.ts
- src/runtimes/vllm-runtime.ts
- src/harnesses/opencode-output.ts
Summary by CodeRabbit
New Features
Bug Fixes
Enhancements
Tests
Chores