test(evals): add Ricky doc-derived eval suite by khaliqgant · Pull Request #74 · AgentWorkforce/ricky

khaliqgant · 2026-05-08T13:52:51Z

Summary

add Ricky eval scripts backed by @agent-assistant/telemetry human eval helpers
add 44 doc-derived eval cases across CLI, workflow authoring, runtime recovery, surfaces/ingress, generation quality, and Agent Assistant boundary suites
add a simple local OpenCode provider path (npm run evals:opencode) for running manual cases via opencode run -m <model> <prompt> without OpenRouter credentials
update @agent-assistant/turn-context to ^0.4.31 and add @agent-assistant/telemetry as a dev dependency

Validation

npm run evals:compile
npm run evals (passed=2 needs-human=42 failed=0)
npm run evals -- --suite cli-behavior (passed=2 needs-human=4 failed=0)
npm run evals:opencode -- --list --suite workflow-authoring
opencode models opencode | rg "^opencode/minimax-m2.5-free$|^opencode/nemotron-3-super-free$"
npm run typecheck

Notes

Stacked on docs(conventions): require agent-relay github primitive for PR creation #72 / chore/agents-md-pr-primitive so this PR stays focused on eval scaffolding and cases.
Shared provider eval helpers are now proposed upstream in Agent Assistant: feat(telemetry): add provider eval executors agent-assistant#90
A focused provider smoke with opencode/minimax-m2.5-free invoked the local CLI correctly but timed out at the deliberately low 60s smoke timeout, so model-quality review should use a longer timeout when running real cases.

coderabbitai · 2026-05-08T13:52:56Z

📝 Walkthrough

Walkthrough

This PR adds a comprehensive human-authored evaluation framework and tooling for Ricky: suite case definitions and rubrics, generated JSONL test artifacts, Node.js CLI scripts to compile/run/summarize/compare evals (supporting manual, OpenCode, and ricky-cli executors), new npm scripts and dependency bumps, and CLI/local best-judgement flag propagation with spec-intake/normalizer updates and tests.

Changes

Ricky Evaluation Suite Framework

Layer / File(s)	Summary
Evaluation System Documentation `evals/README.md`, `evals/fixtures/transcripts/.gitkeep`	Adds top-level eval documentation: case formats (`manual` with `### Message`/`### Must`/`### Must Not`, deterministic `ricky-cli` with `### Mock` argv), compilation/run flow, OpenCode executor guidance, run-history location (`.ricky/`) and a source map linking suites to repo specs.
Agent Assistant Boundary Suite `evals/suites/agent-assistant-boundary/cases.md`, `cases.jsonl`, `rubric.md`	Adds five human-review-focused cases enforcing Agent Assistant adoption boundaries: real import/runtime grounding, preserving Ricky turn-envelope metadata, keeping product-core wording/ownership, single-slice incremental adoption, and future-surface design constraints.
CLI Behavior Suite `evals/suites/cli-behavior/cases.md`, `cases.jsonl`, `rubric.md`	Adds deterministic CLI regression/capability cases: `--help` surface completeness, `version` formatting, generation default (no `--run`) semantics, compact first-run onboarding, recovery guidance without stack traces, and honest `ricky status` provider reporting.
Generation Quality Suite `evals/suites/generation-quality/cases.md`, `cases.jsonl`, `rubric.md`	Adds generation-quality cases (including unanswered-question vs `--best-judgement` behaviors and `--mode local` override): skill/tool provenance, honoring spec tool/model hints with audit metadata, opt-in bounded refinement and re-validation, behavior-grounded acceptance gates, pattern selection discipline, and proof/review requirements for generated artifacts.
Runtime Recovery Suite `evals/suites/runtime-recovery/cases.md`, `cases.jsonl`, `rubric.md`	Adds runtime-recovery cases: classify-before-retry, stale local runtime-state handling, run-marker conflict reporting, bounded auto-fix with resumability, single-attempt behavior when auto-fix disabled, in-process Node/SDK execution preference, escalation with preserved evidence, and analytics from structured `WorkflowRunEvidence`.
Surfaces and Ingress Suite `evals/suites/surfaces-ingress/cases.md`, `cases.jsonl`, `rubric.md`	Adds multi-surface integration cases: Slack parity and normalization, web handoff normalization, MCP/Claude context-as-metadata, Cloud API versioning and JSON stdout contract, Linear readiness fail-fast and PR-link completion, and strict CLI Cloud onboarding guidance.
Workflow Authoring Suite `evals/suites/workflow-authoring/cases.md`, `cases.jsonl`, `rubric.md`	Adds human-review-heavy workflow-authoring cases: deterministic verification gates, distinct writer/reviewer separation, no-silent Cloud↔Local fallback, Agent Assistant boundary reuse, evidence-trail preservation, wave folder placement/naming standards, required runtime wrapper shape and error handling, env loading fail-fast, GitHub PR primitives, and dry-run/structural validation requirements.
Eval Compilation and Execution `scripts/evals/compile-ricky-evals.mjs`, `scripts/evals/run-ricky-evals.mjs`, `scripts/evals/summarize-ricky-evals.mjs`, `scripts/evals/compare-ricky-evals.mjs`	Adds four CLI scripts: compile markdown case specs to JSONL; run evals via pluggable executors (manual, OpenCode, ricky-cli); summarize recent runs; compare the two most recent runs and report per-test transitions.
CLI & Best-Judgement Integration `src/surfaces/cli/...`, `src/local/request-normalizer.ts`, `src/local/entrypoint.ts`, `src/product/spec-intake/...`, tests (`src/**/test.ts`)	Adds `--best-judgement` CLI flag and threads it into handoffs; extends `BaseHandoff`/`LocalInvocationRequest` with `bestJudgement?`; local entrypoint can synthesize best-judgement answers, append them to the spec, re-run spec intake, and include `best_judgement_clarifications` in generation decisions; spec-intake normalizer now recognizes explicit execution-mode metadata and answered clarifications to suppress execution-mode conflict questions; tests added/updated to cover parsing, help text, propagation, and behavior.
Dependencies and NPM Scripts `package.json`	Adds npm scripts: `evals:compile`, `evals`, `evals:opencode`, `evals:list`, `evals:summary`, `evals:compare`; bumps `@agent-assistant/turn-context` to `^0.4.31`; adds `@agent-assistant/telemetry` `^0.4.31`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

AgentWorkforce/ricky#52: related changes touching CLI/local entrypoint and power-user parsing code paths.
AgentWorkforce/ricky#61: related to spec-intake/clarification and execution-mode handling.
AgentWorkforce/ricky#55: related CLI parsing/runtime-surface changes and flag threading.

Poem

🐰 Hop, hop — I read the suites with glee,

Markdown seeds sprout JSON for all to see,
Run scripts hum, compare shows the tale,
CLI flags threaded, tests on the trail,
Evidence trails help the product be.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 6.98% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main addition: a new Ricky documentation-derived evaluation suite with test cases.
Description check	✅ Passed	The description is directly related to the changeset, detailing the eval scripts, 44 test cases, OpenCode provider support, dependency updates, and validation steps performed.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/ricky-evals-sweep

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

devin-ai-integration

Devin Review found 1 potential issue.

View 4 additional findings in Devin Review.

devin-ai-integration · 2026-05-08T13:56:24Z

+  disappeared += 1;
+  regressed += 1;


🟡 Compare summary double-counts disappeared tests in 'Regressed' total

In compare-ricky-evals.mjs, when a test disappears between runs, both regressed and disappeared are incremented (lines 50-51). The summary on line 56 then displays both as separate fields: Improved: X | Regressed: Y | Unchanged: Z | Disappeared: W. Since Y already includes W, the totals are misleading. For example, if 3 tests are unchanged and 1 disappeared, the output would be Regressed: 1 | Unchanged: 3 | Disappeared: 1 — a reader naturally interprets this as 5 total outcomes when there are only 4 unique tests. The Regressed count double-reports the disappeared test.

Suggested change

disappeared += 1;

regressed += 1;

disappeared += 1;

Was this helpful? React with 👍 or 👎 to provide feedback.

khaliqgant · 2026-05-08T15:09:21Z

Addressed the Devin review feedback in 2254f74:

scripts/evals/compare-ricky-evals.mjs no longer increments regressed for disappeared cases.
Disappeared cases remain reported in the dedicated Disappeared bucket, so summary counts are no longer double-counted.

Validated with:

npm run evals:compare
npm run evals:compile
npm run typecheck

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (9)

evals/suites/generation-quality/cases.md (1)

159-159: ⚡ Quick win

Add trailing newline at end of file.

Markdown files should end with a newline character for POSIX compliance and better version control diffs.

📝 Proposed fix

 - Skip skill-aware workflow authoring guidance for serious workflows.
+

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/suites/generation-quality/cases.md` at line 159, The file
evals/suites/generation-quality/cases.md is missing a trailing newline; update
the file so it ends with a single newline character (POSIX-compliant EOF
newline) by adding a newline at the end of cases.md and saving the file so
version control shows the change.

evals/suites/workflow-authoring/rubric.md (1)

22-22: ⚡ Quick win

Add trailing newline at end of file.

Markdown files should end with a newline character for POSIX compliance and better version control diffs.

📝 Proposed fix

 Pass only when the output is specific enough to execute and review, protects
 Ricky's local execution contract, and leaves a durable evidence trail.
+

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/suites/workflow-authoring/rubric.md` at line 22, The file rubric.md is
missing a trailing newline; open the Markdown file (rubric.md) and ensure the
very last character is a newline character (add an empty line at EOF) so the
file ends with a single trailing newline for POSIX compliance and cleaner diffs.

evals/suites/cli-behavior/rubric.md (1)

11-11: ⚡ Quick win

Add trailing newline at end of file.

Markdown files should end with a newline character for POSIX compliance and better version control diffs.

📝 Proposed fix

 plan or generated workflow.
+

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/suites/cli-behavior/rubric.md` at line 11, The file rubric.md is
missing a trailing newline; open evals/suites/cli-behavior/rubric.md (the
rubric.md document) and add a single newline character at the end of the file so
the file ends with a trailing newline for POSIX compliance and cleaner diffs.

evals/suites/runtime-recovery/rubric.md (2)

8-11: 💤 Low value

Consider varying question structure for better readability.

Questions 1-3 all begin with "Did", which slightly reduces readability. While the content is clear, varying the sentence structure can improve flow.

✍️ Optional rewording suggestion

 1. Did Ricky classify before retrying or repairing?
-2. Did the answer preserve exact evidence and uncertainty?
-3. Did it separate environment blockers from product or workflow failures?
+2. Was exact evidence and uncertainty preserved in the answer?
+3. Were environment blockers separated from product or workflow failures?
 4. Were repair attempts bounded, resumable, and artifact-scoped?
 5. Would an operator know the next safe action?

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/suites/runtime-recovery/rubric.md` around lines 8 - 11, Reword the
first three checklist items to vary sentence structure while preserving meaning:
replace "Did Ricky classify before retrying or repairing?" (item 1), "Did the
answer preserve exact evidence and uncertainty?" (item 2), and "Did it separate
environment blockers from product or workflow failures?" (item 3) with
alternative phrasings that avoid starting each with "Did" (e.g., convert to
passive/questions like "Was Ricky's classification performed before retrying or
repairing?", "Does the answer preserve exact evidence and uncertainty?", "Are
environment blockers clearly separated from product or workflow failures?"),
keeping the original intent and specificity intact.

18-18: ⚡ Quick win

Add trailing newline at end of file.

Markdown files should end with a newline character for POSIX compliance and better version control diffs.

📝 Proposed fix

 Pass only when the response is evidence-backed, bounded, and honest about what
 was fixed, retried, skipped, or escalated.
+

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/suites/runtime-recovery/rubric.md` at line 18, The file rubric.md is
missing a trailing newline at EOF; open rubric.md and add a single newline
character at the end of the file (ensure the file ends with '\n') so the file
ends with a newline for POSIX compliance and cleaner diffs.

evals/suites/generation-quality/rubric.md (1)

18-18: ⚡ Quick win

Add trailing newline at end of file.

Markdown files should end with a newline character for POSIX compliance and better version control diffs.

📝 Proposed fix

 Pass only when the generated workflow is reviewable, auditable, and has proof
 steps tied to the requested behavior.
+

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/suites/generation-quality/rubric.md` at line 18, The file rubric.md is
missing a trailing newline; open rubric.md and add a single newline character at
the end of the file so it ends with a POSIX-compliant newline (ensure the final
line break is committed).

evals/suites/agent-assistant-boundary/rubric.md (1)

18-18: ⚡ Quick win

Add trailing newline at end of file.

Markdown files should end with a newline character for POSIX compliance and better version control diffs.

📝 Proposed fix

 Pass only when the boundary is honest, specific, and grounded in actual Ricky
 runtime behavior.
+

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/suites/agent-assistant-boundary/rubric.md` at line 18, Add a single
trailing newline character at the end of the markdown file rubric.md so the file
ends with a POSIX-compliant newline; open rubric.md (the
agent-assistant-boundary/rubric.md content) and ensure the last character is a
newline, then save the file so version control shows the newline-terminated
file.

evals/suites/cli-behavior/cases.md (1)

152-152: ⚡ Quick win

Add trailing newline at end of file.

Markdown files should end with a newline character for POSIX compliance and better version control diffs.

📝 Proposed fix

 - Show empty fields with no recovery guidance when config is missing.
+

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/suites/cli-behavior/cases.md` at line 152, The file cases.md in the
cli-behavior suite is missing a trailing newline; open the Markdown file
(cases.md) and ensure the file ends with a single '\n' character (add an empty
final line), then save and commit so the file is POSIX-compliant and diffs are
clean.

evals/suites/agent-assistant-boundary/cases.md (1)

115-115: ⚡ Quick win

Add trailing newline at end of file.

Markdown files should end with a newline character for POSIX compliance and better version control diffs.

📝 Proposed fix

 - Duplicate a mature Agent Assistant capability locally without justification.
+

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/suites/agent-assistant-boundary/cases.md` at line 115, The file
cases.md is missing a trailing newline; open cases.md and add a single newline
character at the end of the file so the file ends with a newline
(POSIX-compliant), then save and commit the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/evals/run-ricky-evals.mjs`:
- Around line 83-86: The current assignment for content drops stderr when stdout
exists; change it so both trimmed stdout and stderr are preserved by combining
them (e.g., keeping non-empty parts and joining with a separator) instead of
choosing one or the other; update the variables stdout, stderr, and content in
run-ricky-evals.mjs so content is built from both stdout and stderr (trimmed)
using a simple filter-and-join strategy to retain diagnostics while keeping
stdout first.
- Around line 198-203: The current loop always removes a `--executor` and its
value (setting `executorOverride`) which swallows non-opencode values; change
the logic in the loop that handles `arg === '--executor'` so it first peeks
`argv[index + 1]` into a local (currentValue), and if currentValue ===
'opencode' set `executorOverride = currentValue`, increment `index` to skip the
value and do not push either token to `passthrough`; otherwise do not set
`executorOverride` and instead push both `--executor` and currentValue (or just
`--executor` if no next value) into `passthrough` and only increment `index` if
you consumed the value for passthrough, preserving original behavior for
non-opencode executors; update references to `executorOverride`, `passthrough`,
`argv`, and `index` in that block accordingly.

In `@scripts/evals/summarize-ricky-evals.mjs`:
- Around line 35-39: The code currently calls JSON.parse(readFileSync(file,
'utf8')) directly which will throw on a malformed result.json and stop
summarization; update the pipeline that maps over files (the chain using
readdirSync, path.join(...,'result.json'), existsSync, readFileSync, JSON.parse,
and sort) to wrap the read+parse of each result.json in a try/catch (or a helper
like safeParseResult) so that on JSON parse/read errors you skip that file
(optionally console.warn with the filename and error) and continue returning
only successfully parsed entries before sorting by timestamp.

---

Nitpick comments:
In `@evals/suites/agent-assistant-boundary/cases.md`:
- Line 115: The file cases.md is missing a trailing newline; open cases.md and
add a single newline character at the end of the file so the file ends with a
newline (POSIX-compliant), then save and commit the change.

In `@evals/suites/agent-assistant-boundary/rubric.md`:
- Line 18: Add a single trailing newline character at the end of the markdown
file rubric.md so the file ends with a POSIX-compliant newline; open rubric.md
(the agent-assistant-boundary/rubric.md content) and ensure the last character
is a newline, then save the file so version control shows the newline-terminated
file.

In `@evals/suites/cli-behavior/cases.md`:
- Line 152: The file cases.md in the cli-behavior suite is missing a trailing
newline; open the Markdown file (cases.md) and ensure the file ends with a
single '\n' character (add an empty final line), then save and commit so the
file is POSIX-compliant and diffs are clean.

In `@evals/suites/cli-behavior/rubric.md`:
- Line 11: The file rubric.md is missing a trailing newline; open
evals/suites/cli-behavior/rubric.md (the rubric.md document) and add a single
newline character at the end of the file so the file ends with a trailing
newline for POSIX compliance and cleaner diffs.

In `@evals/suites/generation-quality/cases.md`:
- Line 159: The file evals/suites/generation-quality/cases.md is missing a
trailing newline; update the file so it ends with a single newline character
(POSIX-compliant EOF newline) by adding a newline at the end of cases.md and
saving the file so version control shows the change.

In `@evals/suites/generation-quality/rubric.md`:
- Line 18: The file rubric.md is missing a trailing newline; open rubric.md and
add a single newline character at the end of the file so it ends with a
POSIX-compliant newline (ensure the final line break is committed).

In `@evals/suites/runtime-recovery/rubric.md`:
- Around line 8-11: Reword the first three checklist items to vary sentence
structure while preserving meaning: replace "Did Ricky classify before retrying
or repairing?" (item 1), "Did the answer preserve exact evidence and
uncertainty?" (item 2), and "Did it separate environment blockers from product
or workflow failures?" (item 3) with alternative phrasings that avoid starting
each with "Did" (e.g., convert to passive/questions like "Was Ricky's
classification performed before retrying or repairing?", "Does the answer
preserve exact evidence and uncertainty?", "Are environment blockers clearly
separated from product or workflow failures?"), keeping the original intent and
specificity intact.
- Line 18: The file rubric.md is missing a trailing newline at EOF; open
rubric.md and add a single newline character at the end of the file (ensure the
file ends with '\n') so the file ends with a newline for POSIX compliance and
cleaner diffs.

In `@evals/suites/workflow-authoring/rubric.md`:
- Line 22: The file rubric.md is missing a trailing newline; open the Markdown
file (rubric.md) and ensure the very last character is a newline character (add
an empty line at EOF) so the file ends with a single trailing newline for POSIX
compliance and cleaner diffs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 11671f7e-8900-41a2-b97b-d6052be5f6e7

📥 Commits

Reviewing files that changed from the base of the PR and between 7e163c5 and 2254f74.

⛔ Files ignored due to path filters (1)

package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (25)

evals/README.md
evals/fixtures/transcripts/.gitkeep
evals/suites/agent-assistant-boundary/cases.jsonl
evals/suites/agent-assistant-boundary/cases.md
evals/suites/agent-assistant-boundary/rubric.md
evals/suites/cli-behavior/cases.jsonl
evals/suites/cli-behavior/cases.md
evals/suites/cli-behavior/rubric.md
evals/suites/generation-quality/cases.jsonl
evals/suites/generation-quality/cases.md
evals/suites/generation-quality/rubric.md
evals/suites/runtime-recovery/cases.jsonl
evals/suites/runtime-recovery/cases.md
evals/suites/runtime-recovery/rubric.md
evals/suites/surfaces-ingress/cases.jsonl
evals/suites/surfaces-ingress/cases.md
evals/suites/surfaces-ingress/rubric.md
evals/suites/workflow-authoring/cases.jsonl
evals/suites/workflow-authoring/cases.md
evals/suites/workflow-authoring/rubric.md
package.json
scripts/evals/compare-ricky-evals.mjs
scripts/evals/compile-ricky-evals.mjs
scripts/evals/run-ricky-evals.mjs
scripts/evals/summarize-ricky-evals.mjs

coderabbitai · 2026-05-08T15:13:20Z

+  const stdout = result.stdout?.trimEnd() ?? '';
+  const stderr = result.stderr?.trimEnd() ?? '';
+  const content = stdout || stderr || '';
+


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Preserve stderr alongside stdout in OpenCode results.

Line 85 picks stdout first and drops stderr when both exist, which hides useful diagnostics in human review worksheets.

Suggested fix

const stdout = result.stdout?.trimEnd() ?? ''; const stderr = result.stderr?.trimEnd() ?? ''; - const content = stdout || stderr || ''; + const content = [stdout, stderr].filter(Boolean).join('\n');

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const stdout = result.stdout?.trimEnd() ?? '';

const stderr = result.stderr?.trimEnd() ?? '';

const content = stdout || stderr || '';

const stdout = result.stdout?.trimEnd() ?? '';

const stderr = result.stderr?.trimEnd() ?? '';

const content = [stdout, stderr].filter(Boolean).join('\n');

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/evals/run-ricky-evals.mjs` around lines 83 - 86, The current assignment for content drops stderr when stdout exists; change it so both trimmed stdout and stderr are preserved by combining them (e.g., keeping non-empty parts and joining with a separator) instead of choosing one or the other; update the variables stdout, stderr, and content in run-ricky-evals.mjs so content is built from both stdout and stderr (trimmed) using a simple filter-and-join strategy to retain diagnostics while keeping stdout first.

coderabbitai · 2026-05-08T15:13:20Z

+    if (arg === '--executor') {
+      executorOverride = argv[index + 1];
+      index += 1;
+      continue;
+    }
+    passthrough.push(arg);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not swallow non-OpenCode --executor values.

Line 198 currently removes every --executor argument from passthrough, but only opencode is acted on later. Using --executor ricky-cli (or any other value) silently falls back to normal behavior and can mislead run results.

Suggested fix

function parseRickyEvalArgs(argv) { const passthrough = []; let executorOverride; for (let index = 0; index < argv.length; index += 1) { const arg = argv[index]; if (arg === '--executor') { - executorOverride = argv[index + 1]; - index += 1; - continue; + const value = argv[index + 1]; + if (value === 'opencode') { + executorOverride = value; + index += 1; + continue; + } + passthrough.push(arg); + if (value !== undefined) { + passthrough.push(value); + index += 1; + } + continue; } passthrough.push(arg); } return { argv: passthrough, executorOverride }; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (arg === '--executor') {

executorOverride = argv[index + 1];

index += 1;

continue;

}

passthrough.push(arg);

function parseRickyEvalArgs(argv) {

const passthrough = [];

let executorOverride;

for (let index = 0; index < argv.length; index += 1) {

const arg = argv[index];

if (arg === '--executor') {

const value = argv[index + 1];

if (value === 'opencode') {

executorOverride = value;

index += 1;

continue;

}

passthrough.push(arg);

if (value !== undefined) {

passthrough.push(value);

index += 1;

}

continue;

}

passthrough.push(arg);

}

return { argv: passthrough, executorOverride };

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/evals/run-ricky-evals.mjs` around lines 198 - 203, The current loop always removes a `--executor` and its value (setting `executorOverride`) which swallows non-opencode values; change the logic in the loop that handles `arg === '--executor'` so it first peeks `argv[index + 1]` into a local (currentValue), and if currentValue === 'opencode' set `executorOverride = currentValue`, increment `index` to skip the value and do not push either token to `passthrough`; otherwise do not set `executorOverride` and instead push both `--executor` and currentValue (or just `--executor` if no next value) into `passthrough` and only increment `index` if you consumed the value for passthrough, preserving original behavior for non-opencode executors; update references to `executorOverride`, `passthrough`, `argv`, and `index` in that block accordingly.

coderabbitai · 2026-05-08T15:13:20Z

+  return readdirSync(RUNS_DIR)
+    .map((dir) => path.join(RUNS_DIR, dir, 'result.json'))
+    .filter((file) => existsSync(file))
+    .map((file) => JSON.parse(readFileSync(file, 'utf8')))
+    .sort((a, b) => String(b.timestamp).localeCompare(String(a.timestamp)));


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle malformed run artifacts without crashing summary output.

On Line 38, JSON.parse(...) is unguarded. One corrupted result.json will terminate evals:summary and hide all other valid history entries.

Proposed fix

function loadRuns() { if (!existsSync(RUNS_DIR)) return []; - return readdirSync(RUNS_DIR) - .map((dir) => path.join(RUNS_DIR, dir, 'result.json')) - .filter((file) => existsSync(file)) - .map((file) => JSON.parse(readFileSync(file, 'utf8'))) - .sort((a, b) => String(b.timestamp).localeCompare(String(a.timestamp))); + const files = readdirSync(RUNS_DIR) + .map((dir) => path.join(RUNS_DIR, dir, 'result.json')) + .filter((file) => existsSync(file)); + + const runs = []; + for (const file of files) { + try { + runs.push(JSON.parse(readFileSync(file, 'utf8'))); + } catch { + console.warn(`Skipping invalid eval run artifact: ${path.relative(ROOT, file)}`); + } + } + + return runs.sort((a, b) => String(b.timestamp).localeCompare(String(a.timestamp))); }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/evals/summarize-ricky-evals.mjs` around lines 35 - 39, The code currently calls JSON.parse(readFileSync(file, 'utf8')) directly which will throw on a malformed result.json and stop summarization; update the pipeline that maps over files (the chain using readdirSync, path.join(...,'result.json'), existsSync, readFileSync, JSON.parse, and sort) to wrap the read+parse of each result.json in a try/catch (or a helper like safeParseResult) so that on JSON parse/read errors you skip that file (optionally console.warn with the filename and error) and continue returning only successfully parsed entries before sorting by timestamp.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/local/entrypoint.ts (1)

961-1002: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Rebuild assistantTurnContext after the --best-judgement rewrite.

Line 966 captures turn context from the original request, but Lines 989-1002 can replace both spec and metadata before generation/run. That makes assistant_turn_context describe a different request than the one that actually produced the artifact and runtime launch.

Suggested fix

-      const assistantTurnContext = await observeRickyTurnContext(request, logs);
+      let assistantTurnContext: LocalAssistantTurnContextDecision | undefined;
...
       if (shouldApplyBestJudgement(activeRequest, intakeResult.clarificationQuestions)) {
         bestJudgementClarifications = answerClarificationsWithBestJudgement(intakeResult.clarificationQuestions);
         activeRequest = {
           ...activeRequest,
           spec: appendBestJudgementClarificationAnswers(activeRequest.spec, bestJudgementClarifications),
           metadata: {
             ...activeRequest.metadata,
             bestJudgement: true,
             bestJudgementClarifications,
           },
         };
         specDigest = digestSpec(activeRequest.spec);
         logs.push(`[local] --best-judgement answered ${bestJudgementClarifications.length} clarification question(s) as ${BEST_JUDGEMENT_IMPLEMENTER}`);
         warnings.push('--best-judgement resolved blocking clarifications with implementer assumptions; review them in the generated workflow context.');
         warnings.push(...bestJudgementClarifications.map((answer) => `--best-judgement ${answer.question}: ${answer.answer}`));
         intakeResult = intake(toRawSpecPayload(activeRequest));
       }
+      assistantTurnContext = await observeRickyTurnContext(activeRequest, logs);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/local/entrypoint.ts` around lines 961 - 1002, assistantTurnContext is
captured before the code may rewrite activeRequest with best-judgement answers,
so it can be out-of-date for subsequent generation/launch; after you assign
activeRequest with appendBestJudgementClarificationAnswers and update specDigest
(the block that sets bestJudgementClarifications, activeRequest, and
specDigest), recompute assistantTurnContext by re-calling
observeRickyTurnContext(request, logs) or an equivalent helper using the updated
activeRequest (reference symbols: assistantTurnContext, activeRequest,
bestJudgementClarifications, appendBestJudgementClarificationAnswers,
specDigest, observeRickyTurnContext) so the turn context reflects the final
request before proceeding.

🧹 Nitpick comments (1)

src/local/entrypoint.ts (1)

47-47: ⚡ Quick win

Keep best-judgement fallback answers provider-agnostic.

These answers inject impl-primary-codex, reviewer-claude, and validator-claude into the rewritten spec. On the new BYOH/OpenCode path, that can steer generation toward agent identities that the current workspace does not actually have. Prefer neutral role labels or derive them from the active runtime/provider config.

Possible direction

-const BEST_JUDGEMENT_IMPLEMENTER = 'impl-primary-codex';
+const BEST_JUDGEMENT_IMPLEMENTER = 'implementing-agent';
...
-    return `Answered by implementing agent ${BEST_JUDGEMENT_IMPLEMENTER} using --best-judgement: ${BEST_JUDGEMENT_IMPLEMENTER} owns the implementation assumption, reviewer-claude reviews it, and validator-claude performs final validation signoff.`;
+    return `Answered by implementing agent ${BEST_JUDGEMENT_IMPLEMENTER} using --best-judgement: the implementing agent owns the assumption, a reviewer validates it, and a final validator signs off.`;

Also applies to: 1419-1430

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/local/entrypoint.ts` at line 47, Replace the hardcoded
BEST_JUDGEMENT_IMPLEMENTER = 'impl-primary-codex' with a provider-agnostic value
or derive it from the active runtime/provider config; locate
BEST_JUDGEMENT_IMPLEMENTER (and similar hardcoded strings like
'reviewer-claude'/'validator-claude' around the same area) and either use a
neutral label (e.g., 'implementer'/'reviewer'/'validator') or call the
runtime/provider helper (e.g., a function like getActiveAgentName or reading
from runtime.providerConfig) to produce the correct agent name for the current
workspace so rewritten specs do not inject provider-specific identities.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@evals/suites/generation-quality/cases.md`:
- Around line 211-214: The eval is deterministically checking for the
implementing-agent id "impl-primary-codex" (and usage of "--best-judgement"),
which can change and cause false failures; update the check in the
generation-quality case to stop matching the literal implementing-agent id and
instead validate the attribution/output by matching the expected attribution
pattern (e.g., presence of a rollout signoff line or agent attribution token)
and any flags like "--best-judgement" only by pattern, not by the exact agent
id; remove or replace the hardcoded "impl-primary-codex" assertion and ensure
the workflow reference "workflows/generated/" still passes when the attribution
pattern is present.

In `@src/product/spec-intake/clarifications.ts`:
- Around line 240-248: The hasExplicitExecutionModeChoice function is missing
'auto' in its allowed values, causing explicit executionPreference: 'auto' to be
treated as non-explicit; update the whitelist check inside
hasExplicitExecutionModeChoice (the regex currently testing
/^(local|byoh|cloud|hosted|remote|both)$/i) to include 'auto' (e.g. add |auto)
so that metadataString(...) values of 'auto' are recognized as an explicit
choice consistent with explicitExecutionPreference().

In `@src/product/spec-intake/normalizer.ts`:
- Around line 133-156: The parsing in
executionPreferenceFromClarificationAnswers only strips unordered bullets and
thus misses numbered lists and checkbox-style lines (e.g., "1. ..." or "- [ ]
..."), causing answers to be overlooked; update the line normalization step to
also remove ordered list markers and checkbox tokens before further tests
(handle patterns like /^\s*\d+\.\s+/, /^\s*[-*+]\s*\[[ xX]\]\s+/, and combined
forms), keeping existing trimming and unordered-bullet removal behavior so the
subsequent header detection and answer extraction logic still works unchanged.

---

Outside diff comments:
In `@src/local/entrypoint.ts`:
- Around line 961-1002: assistantTurnContext is captured before the code may
rewrite activeRequest with best-judgement answers, so it can be out-of-date for
subsequent generation/launch; after you assign activeRequest with
appendBestJudgementClarificationAnswers and update specDigest (the block that
sets bestJudgementClarifications, activeRequest, and specDigest), recompute
assistantTurnContext by re-calling observeRickyTurnContext(request, logs) or an
equivalent helper using the updated activeRequest (reference symbols:
assistantTurnContext, activeRequest, bestJudgementClarifications,
appendBestJudgementClarificationAnswers, specDigest, observeRickyTurnContext) so
the turn context reflects the final request before proceeding.

---

Nitpick comments:
In `@src/local/entrypoint.ts`:
- Line 47: Replace the hardcoded BEST_JUDGEMENT_IMPLEMENTER =
'impl-primary-codex' with a provider-agnostic value or derive it from the active
runtime/provider config; locate BEST_JUDGEMENT_IMPLEMENTER (and similar
hardcoded strings like 'reviewer-claude'/'validator-claude' around the same
area) and either use a neutral label (e.g.,
'implementer'/'reviewer'/'validator') or call the runtime/provider helper (e.g.,
a function like getActiveAgentName or reading from runtime.providerConfig) to
produce the correct agent name for the current workspace so rewritten specs do
not inject provider-specific identities.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: a266a086-87b8-41a3-9cea-e7f9ef961085

📥 Commits

Reviewing files that changed from the base of the PR and between 2254f74 and 1df8588.

📒 Files selected for processing (13)

evals/suites/generation-quality/cases.jsonl
evals/suites/generation-quality/cases.md
scripts/evals/run-ricky-evals.mjs
src/local/entrypoint.test.ts
src/local/entrypoint.ts
src/local/request-normalizer.ts
src/product/spec-intake/clarifications.ts
src/product/spec-intake/normalizer.ts
src/product/spec-intake/parser.test.ts
src/surfaces/cli/commands/cli-main.test.ts
src/surfaces/cli/commands/cli-main.ts
src/surfaces/cli/flows/power-user-parser.test.ts
src/surfaces/cli/flows/power-user-parser.ts

✅ Files skipped from review due to trivial changes (1)

evals/suites/generation-quality/cases.jsonl

🚧 Files skipped from review as they are similar to previous changes (1)

scripts/evals/run-ricky-evals.mjs

coderabbitai · 2026-05-08T15:41:15Z

+- generated; run when ready
+- Warning: --best-judgement Who owns final rollout signoff?
+- Answered by implementing agent impl-primary-codex using --best-judgement
+- Workflow: workflows/generated/


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid pinning this eval to a specific implementing-agent id.

impl-primary-codex is not a stable contract and can change with persona/provider selection, so this deterministic check can false-fail valid runs. Checking for the attribution pattern is enough.

Suggested fix

contentIncludes: - generated; run when ready - Warning: --best-judgement Who owns final rollout signoff? -- Answered by implementing agent impl-primary-codex using --best-judgement +- Answered by implementing agent +- using --best-judgement - Workflow: workflows/generated/

🧰 Tools

🪛 LanguageTool

[grammar] ~212-~212: Ensure spelling is correct
Context: ...--best-judgement Who owns final rollout signoff? - Answered by implementing agent impl-p...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@evals/suites/generation-quality/cases.md` around lines 211 - 214, The eval is deterministically checking for the implementing-agent id "impl-primary-codex" (and usage of "--best-judgement"), which can change and cause false failures; update the check in the generation-quality case to stop matching the literal implementing-agent id and instead validate the attribution/output by matching the expected attribution pattern (e.g., presence of a rollout signoff line or agent attribution token) and any flags like "--best-judgement" only by pattern, not by the exact agent id; remove or replace the hardcoded "impl-primary-codex" assertion and ensure the workflow reference "workflows/generated/" still passes when the attribution pattern is present.

coderabbitai · 2026-05-08T15:41:16Z

+function hasExplicitExecutionModeChoice(spec: NormalizedWorkflowSpec): boolean {
+  const mode = metadataString(spec.providerContext.metadata, 'mode');
+  const preference =
+    metadataString(spec.providerContext.metadata, 'executionPreference') ??
+    metadataString(spec.providerContext.metadata, 'execution_preference');
+  return [mode, preference].some((value) => (
+    value !== undefined &&
+    /^(local|byoh|cloud|hosted|remote|both)$/i.test(value.trim())
+  ));


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Treat auto as an explicit execution choice here.

explicitExecutionPreference() already treats metadata executionPreference: 'auto' as user intent, but this whitelist does not. That leaves an explicitly chosen mixed-mode request vulnerable to the execution-mode-conflict blocker.

Suggested fix

return [mode, preference].some((value) => ( value !== undefined && - /^(local|byoh|cloud|hosted|remote|both)$/i.test(value.trim()) + /^(local|byoh|cloud|hosted|remote|both|auto)$/i.test(value.trim()) ));

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

function hasExplicitExecutionModeChoice(spec: NormalizedWorkflowSpec): boolean {

const mode = metadataString(spec.providerContext.metadata, 'mode');

const preference =

metadataString(spec.providerContext.metadata, 'executionPreference') ??

metadataString(spec.providerContext.metadata, 'execution_preference');

return [mode, preference].some((value) => (

value !== undefined &&

/^(local|byoh|cloud|hosted|remote|both)$/i.test(value.trim())

));

function hasExplicitExecutionModeChoice(spec: NormalizedWorkflowSpec): boolean {

const mode = metadataString(spec.providerContext.metadata, 'mode');

const preference =

metadataString(spec.providerContext.metadata, 'executionPreference') ??

metadataString(spec.providerContext.metadata, 'execution_preference');

return [mode, preference].some((value) => (

value !== undefined &&

/^(local|byoh|cloud|hosted|remote|both|auto)$/i.test(value.trim())

));

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/product/spec-intake/clarifications.ts` around lines 240 - 248, The hasExplicitExecutionModeChoice function is missing 'auto' in its allowed values, causing explicit executionPreference: 'auto' to be treated as non-explicit; update the whitelist check inside hasExplicitExecutionModeChoice (the regex currently testing /^(local|byoh|cloud|hosted|remote|both)$/i) to include 'auto' (e.g. add |auto) so that metadataString(...) values of 'auto' are recognized as an explicit choice consistent with explicitExecutionPreference().

coderabbitai · 2026-05-08T15:41:16Z

+function executionPreferenceFromClarificationAnswers(description: string): ExecutionPreference | undefined {
+  let inAnswerSection = false;
+  for (const rawLine of description.split(/\r?\n/)) {
+    const line = rawLine.trim().replace(/^[-*+]\s+/, '');
+    if (!line) {
+      inAnswerSection = false;
+      continue;
+    }
+    if (/^(#{1,6}\s*)?(clarification answers?|resolved clarifications?)\s*:?\s*$/i.test(line)) {
+      inAnswerSection = true;
+      continue;
+    }
+    if (/^(#{1,6}\s*)?[A-Z][\w\s/-]{2,80}:$/.test(line)) {
+      inAnswerSection = false;
+    }
+    if (!inAnswerSection || !/should this workflow run locally\/byoh, in cloud, or generate artifacts for both paths\?/i.test(line)) {
+      continue;
+    }
+    const answer = line.split(/:\s*/).slice(1).join(': ').toLowerCase();
+    if (/\b(local|locally|byoh|on this machine)\b/.test(answer)) return 'local';
+    if (/\b(cloud|hosted|remote)\b/.test(answer) && !/\bboth\b/.test(answer)) return 'cloud';
+    if (/\b(both|auto|both paths)\b/.test(answer)) return 'auto';
+  }
+  return undefined;


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keep clarification-answer parsing aligned with the clarification scanner.

This helper only strips unordered bullets. A numbered or checkbox answer like 1. Should this workflow run locally/BYOH...: locally can be recognized as answered elsewhere, but still be missed here, leaving executionPreference as auto.

Suggested fix

for (const rawLine of description.split(/\r?\n/)) { - const line = rawLine.trim().replace(/^[-*+]\s+/, ''); + const line = rawLine + .trim() + .replace(/^[-*+]\s+/, '') + .replace(/^\d+[.)]\s+/, '') + .replace(/^\[[ xX]\]\s+/, '');

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

function executionPreferenceFromClarificationAnswers(description: string): ExecutionPreference | undefined {

let inAnswerSection = false;

for (const rawLine of description.split(/\r?\n/)) {

const line = rawLine.trim().replace(/^[-*+]\s+/, '');

if (!line) {

inAnswerSection = false;

continue;

}

if (/^(#{1,6}\s*)?(clarification answers?|resolved clarifications?)\s*:?\s*$/i.test(line)) {

inAnswerSection = true;

continue;

}

if (/^(#{1,6}\s*)?[A-Z][\w\s/-]{2,80}:$/.test(line)) {

inAnswerSection = false;

}

if (!inAnswerSection || !/should this workflow run locally\/byoh, in cloud, or generate artifacts for both paths\?/i.test(line)) {

continue;

}

const answer = line.split(/:\s*/).slice(1).join(': ').toLowerCase();

if (/\b(local|locally|byoh|on this machine)\b/.test(answer)) return 'local';

if (/\b(cloud|hosted|remote)\b/.test(answer) && !/\bboth\b/.test(answer)) return 'cloud';

if (/\b(both|auto|both paths)\b/.test(answer)) return 'auto';

}

return undefined;

function executionPreferenceFromClarificationAnswers(description: string): ExecutionPreference | undefined {

let inAnswerSection = false;

for (const rawLine of description.split(/\r?\n/)) {

const line = rawLine

.trim()

.replace(/^[-*+]\s+/, '')

.replace(/^\d+[.)]\s+/, '')

.replace(/^\[[ xX]\]\s+/, '');

if (!line) {

inAnswerSection = false;

continue;

}

if (/^(#{1,6}\s*)?(clarification answers?|resolved clarifications?)\s*:?\s*$/i.test(line)) {

inAnswerSection = true;

continue;

}

if (/^(#{1,6}\s*)?[A-Z][\w\s/-]{2,80}:$/.test(line)) {

inAnswerSection = false;

}

if (!inAnswerSection || !/should this workflow run locally\/byoh, in cloud, or generate artifacts for both paths\?/i.test(line)) {

continue;

}

const answer = line.split(/:\s*/).slice(1).join(': ').toLowerCase();

if (/\b(local|locally|byoh|on this machine)\b/.test(answer)) return 'local';

if (/\b(cloud|hosted|remote)\b/.test(answer) && !/\bboth\b/.test(answer)) return 'cloud';

if (/\b(both|auto|both paths)\b/.test(answer)) return 'auto';

}

return undefined;

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/product/spec-intake/normalizer.ts` around lines 133 - 156, The parsing in executionPreferenceFromClarificationAnswers only strips unordered bullets and thus misses numbered lists and checkbox-style lines (e.g., "1. ..." or "- [ ] ..."), causing answers to be overlooked; update the line normalization step to also remove ordered list markers and checkbox tokens before further tests (handle patterns like /^\s*\d+\.\s+/, /^\s*[-*+]\s*\[[ xX]\]\s+/, and combined forms), keeping existing trimming and unordered-bullet removal behavior so the subsequent header detection and answer extraction logic still works unchanged.

test(evals): add Ricky doc-derived eval suite

d291798

devin-ai-integration Bot reviewed May 8, 2026

View reviewed changes

khaliqgant added 2 commits May 8, 2026 16:05

test(evals): add local opencode provider path

39f581e

test(evals): simplify opencode eval command

3354938

Base automatically changed from chore/agents-md-pr-primitive to main May 8, 2026 15:05

fix(evals): avoid double-counting disappeared cases

2254f74

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

khaliqgant added 2 commits May 8, 2026 17:23

Add best-judgement clarification evals

4783817

Honor explicit mode in spec intake

1df8588

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

khaliqgant merged commit 538e74b into main May 8, 2026
1 check passed

khaliqgant deleted the codex/ricky-evals-sweep branch May 8, 2026 15:41

coderabbitai Bot mentioned this pull request May 8, 2026

ci: run Ricky provider evals on PRs #78

Merged

Conversation

khaliqgant commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Notes

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

khaliqgant commented May 8, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

khaliqgant commented May 8, 2026 •

edited

Loading

coderabbitai Bot commented May 8, 2026 •

edited

Loading