Skip to content

test(evals): add Ricky doc-derived eval suite#74

Merged
khaliqgant merged 6 commits intomainfrom
codex/ricky-evals-sweep
May 8, 2026
Merged

test(evals): add Ricky doc-derived eval suite#74
khaliqgant merged 6 commits intomainfrom
codex/ricky-evals-sweep

Conversation

@khaliqgant
Copy link
Copy Markdown
Member

@khaliqgant khaliqgant commented May 8, 2026

Summary

  • add Ricky eval scripts backed by @agent-assistant/telemetry human eval helpers
  • add 44 doc-derived eval cases across CLI, workflow authoring, runtime recovery, surfaces/ingress, generation quality, and Agent Assistant boundary suites
  • add a simple local OpenCode provider path (npm run evals:opencode) for running manual cases via opencode run -m <model> <prompt> without OpenRouter credentials
  • update @agent-assistant/turn-context to ^0.4.31 and add @agent-assistant/telemetry as a dev dependency

Validation

  • npm run evals:compile
  • npm run evals (passed=2 needs-human=42 failed=0)
  • npm run evals -- --suite cli-behavior (passed=2 needs-human=4 failed=0)
  • npm run evals:opencode -- --list --suite workflow-authoring
  • opencode models opencode | rg "^opencode/minimax-m2.5-free$|^opencode/nemotron-3-super-free$"
  • npm run typecheck

Notes

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 8, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds a comprehensive human-authored evaluation framework and tooling for Ricky: suite case definitions and rubrics, generated JSONL test artifacts, Node.js CLI scripts to compile/run/summarize/compare evals (supporting manual, OpenCode, and ricky-cli executors), new npm scripts and dependency bumps, and CLI/local best-judgement flag propagation with spec-intake/normalizer updates and tests.

Changes

Ricky Evaluation Suite Framework

Layer / File(s) Summary
Evaluation System Documentation
evals/README.md, evals/fixtures/transcripts/.gitkeep
Adds top-level eval documentation: case formats (manual with ### Message/### Must/### Must Not, deterministic ricky-cli with ### Mock argv), compilation/run flow, OpenCode executor guidance, run-history location (.ricky/) and a source map linking suites to repo specs.
Agent Assistant Boundary Suite
evals/suites/agent-assistant-boundary/cases.md, cases.jsonl, rubric.md
Adds five human-review-focused cases enforcing Agent Assistant adoption boundaries: real import/runtime grounding, preserving Ricky turn-envelope metadata, keeping product-core wording/ownership, single-slice incremental adoption, and future-surface design constraints.
CLI Behavior Suite
evals/suites/cli-behavior/cases.md, cases.jsonl, rubric.md
Adds deterministic CLI regression/capability cases: --help surface completeness, version formatting, generation default (no --run) semantics, compact first-run onboarding, recovery guidance without stack traces, and honest ricky status provider reporting.
Generation Quality Suite
evals/suites/generation-quality/cases.md, cases.jsonl, rubric.md
Adds generation-quality cases (including unanswered-question vs --best-judgement behaviors and --mode local override): skill/tool provenance, honoring spec tool/model hints with audit metadata, opt-in bounded refinement and re-validation, behavior-grounded acceptance gates, pattern selection discipline, and proof/review requirements for generated artifacts.
Runtime Recovery Suite
evals/suites/runtime-recovery/cases.md, cases.jsonl, rubric.md
Adds runtime-recovery cases: classify-before-retry, stale local runtime-state handling, run-marker conflict reporting, bounded auto-fix with resumability, single-attempt behavior when auto-fix disabled, in-process Node/SDK execution preference, escalation with preserved evidence, and analytics from structured WorkflowRunEvidence.
Surfaces and Ingress Suite
evals/suites/surfaces-ingress/cases.md, cases.jsonl, rubric.md
Adds multi-surface integration cases: Slack parity and normalization, web handoff normalization, MCP/Claude context-as-metadata, Cloud API versioning and JSON stdout contract, Linear readiness fail-fast and PR-link completion, and strict CLI Cloud onboarding guidance.
Workflow Authoring Suite
evals/suites/workflow-authoring/cases.md, cases.jsonl, rubric.md
Adds human-review-heavy workflow-authoring cases: deterministic verification gates, distinct writer/reviewer separation, no-silent Cloud↔Local fallback, Agent Assistant boundary reuse, evidence-trail preservation, wave folder placement/naming standards, required runtime wrapper shape and error handling, env loading fail-fast, GitHub PR primitives, and dry-run/structural validation requirements.
Eval Compilation and Execution
scripts/evals/compile-ricky-evals.mjs, scripts/evals/run-ricky-evals.mjs, scripts/evals/summarize-ricky-evals.mjs, scripts/evals/compare-ricky-evals.mjs
Adds four CLI scripts: compile markdown case specs to JSONL; run evals via pluggable executors (manual, OpenCode, ricky-cli); summarize recent runs; compare the two most recent runs and report per-test transitions.
CLI & Best-Judgement Integration
src/surfaces/cli/..., src/local/request-normalizer.ts, src/local/entrypoint.ts, src/product/spec-intake/..., tests (src/**/test.ts)
Adds --best-judgement CLI flag and threads it into handoffs; extends BaseHandoff/LocalInvocationRequest with bestJudgement?; local entrypoint can synthesize best-judgement answers, append them to the spec, re-run spec intake, and include best_judgement_clarifications in generation decisions; spec-intake normalizer now recognizes explicit execution-mode metadata and answered clarifications to suppress execution-mode conflict questions; tests added/updated to cover parsing, help text, propagation, and behavior.
Dependencies and NPM Scripts
package.json
Adds npm scripts: evals:compile, evals, evals:opencode, evals:list, evals:summary, evals:compare; bumps @agent-assistant/turn-context to ^0.4.31; adds @agent-assistant/telemetry ^0.4.31.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 Hop, hop — I read the suites with glee,

Markdown seeds sprout JSON for all to see,
Run scripts hum, compare shows the tale,
CLI flags threaded, tests on the trail,
Evidence trails help the product be.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.98% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main addition: a new Ricky documentation-derived evaluation suite with test cases.
Description check ✅ Passed The description is directly related to the changeset, detailing the eval scripts, 44 test cases, OpenCode provider support, dependency updates, and validation steps performed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/ricky-evals-sweep

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 4 additional findings in Devin Review.

Open in Devin Review

Comment thread scripts/evals/compare-ricky-evals.mjs Outdated
Comment on lines +50 to +51
disappeared += 1;
regressed += 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Compare summary double-counts disappeared tests in 'Regressed' total

In compare-ricky-evals.mjs, when a test disappears between runs, both regressed and disappeared are incremented (lines 50-51). The summary on line 56 then displays both as separate fields: Improved: X | Regressed: Y | Unchanged: Z | Disappeared: W. Since Y already includes W, the totals are misleading. For example, if 3 tests are unchanged and 1 disappeared, the output would be Regressed: 1 | Unchanged: 3 | Disappeared: 1 — a reader naturally interprets this as 5 total outcomes when there are only 4 unique tests. The Regressed count double-reports the disappeared test.

Suggested change
disappeared += 1;
regressed += 1;
disappeared += 1;
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Base automatically changed from chore/agents-md-pr-primitive to main May 8, 2026 15:05
@khaliqgant
Copy link
Copy Markdown
Member Author

Addressed the Devin review feedback in 2254f74:

  • scripts/evals/compare-ricky-evals.mjs no longer increments regressed for disappeared cases.
  • Disappeared cases remain reported in the dedicated Disappeared bucket, so summary counts are no longer double-counted.

Validated with:

  • npm run evals:compare
  • npm run evals:compile
  • npm run typecheck

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (9)
evals/suites/generation-quality/cases.md (1)

159-159: ⚡ Quick win

Add trailing newline at end of file.

Markdown files should end with a newline character for POSIX compliance and better version control diffs.

📝 Proposed fix
 - Skip skill-aware workflow authoring guidance for serious workflows.
+
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/suites/generation-quality/cases.md` at line 159, The file
evals/suites/generation-quality/cases.md is missing a trailing newline; update
the file so it ends with a single newline character (POSIX-compliant EOF
newline) by adding a newline at the end of cases.md and saving the file so
version control shows the change.
evals/suites/workflow-authoring/rubric.md (1)

22-22: ⚡ Quick win

Add trailing newline at end of file.

Markdown files should end with a newline character for POSIX compliance and better version control diffs.

📝 Proposed fix
 Pass only when the output is specific enough to execute and review, protects
 Ricky's local execution contract, and leaves a durable evidence trail.
+
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/suites/workflow-authoring/rubric.md` at line 22, The file rubric.md is
missing a trailing newline; open the Markdown file (rubric.md) and ensure the
very last character is a newline character (add an empty line at EOF) so the
file ends with a single trailing newline for POSIX compliance and cleaner diffs.
evals/suites/cli-behavior/rubric.md (1)

11-11: ⚡ Quick win

Add trailing newline at end of file.

Markdown files should end with a newline character for POSIX compliance and better version control diffs.

📝 Proposed fix
 plan or generated workflow.
+
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/suites/cli-behavior/rubric.md` at line 11, The file rubric.md is
missing a trailing newline; open evals/suites/cli-behavior/rubric.md (the
rubric.md document) and add a single newline character at the end of the file so
the file ends with a trailing newline for POSIX compliance and cleaner diffs.
evals/suites/runtime-recovery/rubric.md (2)

8-11: 💤 Low value

Consider varying question structure for better readability.

Questions 1-3 all begin with "Did", which slightly reduces readability. While the content is clear, varying the sentence structure can improve flow.

✍️ Optional rewording suggestion
 1. Did Ricky classify before retrying or repairing?
-2. Did the answer preserve exact evidence and uncertainty?
-3. Did it separate environment blockers from product or workflow failures?
+2. Was exact evidence and uncertainty preserved in the answer?
+3. Were environment blockers separated from product or workflow failures?
 4. Were repair attempts bounded, resumable, and artifact-scoped?
 5. Would an operator know the next safe action?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/suites/runtime-recovery/rubric.md` around lines 8 - 11, Reword the
first three checklist items to vary sentence structure while preserving meaning:
replace "Did Ricky classify before retrying or repairing?" (item 1), "Did the
answer preserve exact evidence and uncertainty?" (item 2), and "Did it separate
environment blockers from product or workflow failures?" (item 3) with
alternative phrasings that avoid starting each with "Did" (e.g., convert to
passive/questions like "Was Ricky's classification performed before retrying or
repairing?", "Does the answer preserve exact evidence and uncertainty?", "Are
environment blockers clearly separated from product or workflow failures?"),
keeping the original intent and specificity intact.

18-18: ⚡ Quick win

Add trailing newline at end of file.

Markdown files should end with a newline character for POSIX compliance and better version control diffs.

📝 Proposed fix
 Pass only when the response is evidence-backed, bounded, and honest about what
 was fixed, retried, skipped, or escalated.
+
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/suites/runtime-recovery/rubric.md` at line 18, The file rubric.md is
missing a trailing newline at EOF; open rubric.md and add a single newline
character at the end of the file (ensure the file ends with '\n') so the file
ends with a newline for POSIX compliance and cleaner diffs.
evals/suites/generation-quality/rubric.md (1)

18-18: ⚡ Quick win

Add trailing newline at end of file.

Markdown files should end with a newline character for POSIX compliance and better version control diffs.

📝 Proposed fix
 Pass only when the generated workflow is reviewable, auditable, and has proof
 steps tied to the requested behavior.
+
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/suites/generation-quality/rubric.md` at line 18, The file rubric.md is
missing a trailing newline; open rubric.md and add a single newline character at
the end of the file so it ends with a POSIX-compliant newline (ensure the final
line break is committed).
evals/suites/agent-assistant-boundary/rubric.md (1)

18-18: ⚡ Quick win

Add trailing newline at end of file.

Markdown files should end with a newline character for POSIX compliance and better version control diffs.

📝 Proposed fix
 Pass only when the boundary is honest, specific, and grounded in actual Ricky
 runtime behavior.
+
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/suites/agent-assistant-boundary/rubric.md` at line 18, Add a single
trailing newline character at the end of the markdown file rubric.md so the file
ends with a POSIX-compliant newline; open rubric.md (the
agent-assistant-boundary/rubric.md content) and ensure the last character is a
newline, then save the file so version control shows the newline-terminated
file.
evals/suites/cli-behavior/cases.md (1)

152-152: ⚡ Quick win

Add trailing newline at end of file.

Markdown files should end with a newline character for POSIX compliance and better version control diffs.

📝 Proposed fix
 - Show empty fields with no recovery guidance when config is missing.
+
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/suites/cli-behavior/cases.md` at line 152, The file cases.md in the
cli-behavior suite is missing a trailing newline; open the Markdown file
(cases.md) and ensure the file ends with a single '\n' character (add an empty
final line), then save and commit so the file is POSIX-compliant and diffs are
clean.
evals/suites/agent-assistant-boundary/cases.md (1)

115-115: ⚡ Quick win

Add trailing newline at end of file.

Markdown files should end with a newline character for POSIX compliance and better version control diffs.

📝 Proposed fix
 - Duplicate a mature Agent Assistant capability locally without justification.
+
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/suites/agent-assistant-boundary/cases.md` at line 115, The file
cases.md is missing a trailing newline; open cases.md and add a single newline
character at the end of the file so the file ends with a newline
(POSIX-compliant), then save and commit the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/evals/run-ricky-evals.mjs`:
- Around line 83-86: The current assignment for content drops stderr when stdout
exists; change it so both trimmed stdout and stderr are preserved by combining
them (e.g., keeping non-empty parts and joining with a separator) instead of
choosing one or the other; update the variables stdout, stderr, and content in
run-ricky-evals.mjs so content is built from both stdout and stderr (trimmed)
using a simple filter-and-join strategy to retain diagnostics while keeping
stdout first.
- Around line 198-203: The current loop always removes a `--executor` and its
value (setting `executorOverride`) which swallows non-opencode values; change
the logic in the loop that handles `arg === '--executor'` so it first peeks
`argv[index + 1]` into a local (currentValue), and if currentValue ===
'opencode' set `executorOverride = currentValue`, increment `index` to skip the
value and do not push either token to `passthrough`; otherwise do not set
`executorOverride` and instead push both `--executor` and currentValue (or just
`--executor` if no next value) into `passthrough` and only increment `index` if
you consumed the value for passthrough, preserving original behavior for
non-opencode executors; update references to `executorOverride`, `passthrough`,
`argv`, and `index` in that block accordingly.

In `@scripts/evals/summarize-ricky-evals.mjs`:
- Around line 35-39: The code currently calls JSON.parse(readFileSync(file,
'utf8')) directly which will throw on a malformed result.json and stop
summarization; update the pipeline that maps over files (the chain using
readdirSync, path.join(...,'result.json'), existsSync, readFileSync, JSON.parse,
and sort) to wrap the read+parse of each result.json in a try/catch (or a helper
like safeParseResult) so that on JSON parse/read errors you skip that file
(optionally console.warn with the filename and error) and continue returning
only successfully parsed entries before sorting by timestamp.

---

Nitpick comments:
In `@evals/suites/agent-assistant-boundary/cases.md`:
- Line 115: The file cases.md is missing a trailing newline; open cases.md and
add a single newline character at the end of the file so the file ends with a
newline (POSIX-compliant), then save and commit the change.

In `@evals/suites/agent-assistant-boundary/rubric.md`:
- Line 18: Add a single trailing newline character at the end of the markdown
file rubric.md so the file ends with a POSIX-compliant newline; open rubric.md
(the agent-assistant-boundary/rubric.md content) and ensure the last character
is a newline, then save the file so version control shows the newline-terminated
file.

In `@evals/suites/cli-behavior/cases.md`:
- Line 152: The file cases.md in the cli-behavior suite is missing a trailing
newline; open the Markdown file (cases.md) and ensure the file ends with a
single '\n' character (add an empty final line), then save and commit so the
file is POSIX-compliant and diffs are clean.

In `@evals/suites/cli-behavior/rubric.md`:
- Line 11: The file rubric.md is missing a trailing newline; open
evals/suites/cli-behavior/rubric.md (the rubric.md document) and add a single
newline character at the end of the file so the file ends with a trailing
newline for POSIX compliance and cleaner diffs.

In `@evals/suites/generation-quality/cases.md`:
- Line 159: The file evals/suites/generation-quality/cases.md is missing a
trailing newline; update the file so it ends with a single newline character
(POSIX-compliant EOF newline) by adding a newline at the end of cases.md and
saving the file so version control shows the change.

In `@evals/suites/generation-quality/rubric.md`:
- Line 18: The file rubric.md is missing a trailing newline; open rubric.md and
add a single newline character at the end of the file so it ends with a
POSIX-compliant newline (ensure the final line break is committed).

In `@evals/suites/runtime-recovery/rubric.md`:
- Around line 8-11: Reword the first three checklist items to vary sentence
structure while preserving meaning: replace "Did Ricky classify before retrying
or repairing?" (item 1), "Did the answer preserve exact evidence and
uncertainty?" (item 2), and "Did it separate environment blockers from product
or workflow failures?" (item 3) with alternative phrasings that avoid starting
each with "Did" (e.g., convert to passive/questions like "Was Ricky's
classification performed before retrying or repairing?", "Does the answer
preserve exact evidence and uncertainty?", "Are environment blockers clearly
separated from product or workflow failures?"), keeping the original intent and
specificity intact.
- Line 18: The file rubric.md is missing a trailing newline at EOF; open
rubric.md and add a single newline character at the end of the file (ensure the
file ends with '\n') so the file ends with a newline for POSIX compliance and
cleaner diffs.

In `@evals/suites/workflow-authoring/rubric.md`:
- Line 22: The file rubric.md is missing a trailing newline; open the Markdown
file (rubric.md) and ensure the very last character is a newline character (add
an empty line at EOF) so the file ends with a single trailing newline for POSIX
compliance and cleaner diffs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 11671f7e-8900-41a2-b97b-d6052be5f6e7

📥 Commits

Reviewing files that changed from the base of the PR and between 7e163c5 and 2254f74.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (25)
  • evals/README.md
  • evals/fixtures/transcripts/.gitkeep
  • evals/suites/agent-assistant-boundary/cases.jsonl
  • evals/suites/agent-assistant-boundary/cases.md
  • evals/suites/agent-assistant-boundary/rubric.md
  • evals/suites/cli-behavior/cases.jsonl
  • evals/suites/cli-behavior/cases.md
  • evals/suites/cli-behavior/rubric.md
  • evals/suites/generation-quality/cases.jsonl
  • evals/suites/generation-quality/cases.md
  • evals/suites/generation-quality/rubric.md
  • evals/suites/runtime-recovery/cases.jsonl
  • evals/suites/runtime-recovery/cases.md
  • evals/suites/runtime-recovery/rubric.md
  • evals/suites/surfaces-ingress/cases.jsonl
  • evals/suites/surfaces-ingress/cases.md
  • evals/suites/surfaces-ingress/rubric.md
  • evals/suites/workflow-authoring/cases.jsonl
  • evals/suites/workflow-authoring/cases.md
  • evals/suites/workflow-authoring/rubric.md
  • package.json
  • scripts/evals/compare-ricky-evals.mjs
  • scripts/evals/compile-ricky-evals.mjs
  • scripts/evals/run-ricky-evals.mjs
  • scripts/evals/summarize-ricky-evals.mjs

Comment on lines +83 to +86
const stdout = result.stdout?.trimEnd() ?? '';
const stderr = result.stderr?.trimEnd() ?? '';
const content = stdout || stderr || '';

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Preserve stderr alongside stdout in OpenCode results.

Line 85 picks stdout first and drops stderr when both exist, which hides useful diagnostics in human review worksheets.

Suggested fix
   const stdout = result.stdout?.trimEnd() ?? '';
   const stderr = result.stderr?.trimEnd() ?? '';
-  const content = stdout || stderr || '';
+  const content = [stdout, stderr].filter(Boolean).join('\n');
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const stdout = result.stdout?.trimEnd() ?? '';
const stderr = result.stderr?.trimEnd() ?? '';
const content = stdout || stderr || '';
const stdout = result.stdout?.trimEnd() ?? '';
const stderr = result.stderr?.trimEnd() ?? '';
const content = [stdout, stderr].filter(Boolean).join('\n');
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/evals/run-ricky-evals.mjs` around lines 83 - 86, The current
assignment for content drops stderr when stdout exists; change it so both
trimmed stdout and stderr are preserved by combining them (e.g., keeping
non-empty parts and joining with a separator) instead of choosing one or the
other; update the variables stdout, stderr, and content in run-ricky-evals.mjs
so content is built from both stdout and stderr (trimmed) using a simple
filter-and-join strategy to retain diagnostics while keeping stdout first.

Comment on lines +198 to +203
if (arg === '--executor') {
executorOverride = argv[index + 1];
index += 1;
continue;
}
passthrough.push(arg);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not swallow non-OpenCode --executor values.

Line 198 currently removes every --executor argument from passthrough, but only opencode is acted on later. Using --executor ricky-cli (or any other value) silently falls back to normal behavior and can mislead run results.

Suggested fix
 function parseRickyEvalArgs(argv) {
   const passthrough = [];
   let executorOverride;
   for (let index = 0; index < argv.length; index += 1) {
     const arg = argv[index];
     if (arg === '--executor') {
-      executorOverride = argv[index + 1];
-      index += 1;
-      continue;
+      const value = argv[index + 1];
+      if (value === 'opencode') {
+        executorOverride = value;
+        index += 1;
+        continue;
+      }
+      passthrough.push(arg);
+      if (value !== undefined) {
+        passthrough.push(value);
+        index += 1;
+      }
+      continue;
     }
     passthrough.push(arg);
   }
   return { argv: passthrough, executorOverride };
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (arg === '--executor') {
executorOverride = argv[index + 1];
index += 1;
continue;
}
passthrough.push(arg);
function parseRickyEvalArgs(argv) {
const passthrough = [];
let executorOverride;
for (let index = 0; index < argv.length; index += 1) {
const arg = argv[index];
if (arg === '--executor') {
const value = argv[index + 1];
if (value === 'opencode') {
executorOverride = value;
index += 1;
continue;
}
passthrough.push(arg);
if (value !== undefined) {
passthrough.push(value);
index += 1;
}
continue;
}
passthrough.push(arg);
}
return { argv: passthrough, executorOverride };
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/evals/run-ricky-evals.mjs` around lines 198 - 203, The current loop
always removes a `--executor` and its value (setting `executorOverride`) which
swallows non-opencode values; change the logic in the loop that handles `arg ===
'--executor'` so it first peeks `argv[index + 1]` into a local (currentValue),
and if currentValue === 'opencode' set `executorOverride = currentValue`,
increment `index` to skip the value and do not push either token to
`passthrough`; otherwise do not set `executorOverride` and instead push both
`--executor` and currentValue (or just `--executor` if no next value) into
`passthrough` and only increment `index` if you consumed the value for
passthrough, preserving original behavior for non-opencode executors; update
references to `executorOverride`, `passthrough`, `argv`, and `index` in that
block accordingly.

Comment on lines +35 to +39
return readdirSync(RUNS_DIR)
.map((dir) => path.join(RUNS_DIR, dir, 'result.json'))
.filter((file) => existsSync(file))
.map((file) => JSON.parse(readFileSync(file, 'utf8')))
.sort((a, b) => String(b.timestamp).localeCompare(String(a.timestamp)));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle malformed run artifacts without crashing summary output.

On Line 38, JSON.parse(...) is unguarded. One corrupted result.json will terminate evals:summary and hide all other valid history entries.

Proposed fix
 function loadRuns() {
   if (!existsSync(RUNS_DIR)) return [];
-  return readdirSync(RUNS_DIR)
-    .map((dir) => path.join(RUNS_DIR, dir, 'result.json'))
-    .filter((file) => existsSync(file))
-    .map((file) => JSON.parse(readFileSync(file, 'utf8')))
-    .sort((a, b) => String(b.timestamp).localeCompare(String(a.timestamp)));
+  const files = readdirSync(RUNS_DIR)
+    .map((dir) => path.join(RUNS_DIR, dir, 'result.json'))
+    .filter((file) => existsSync(file));
+
+  const runs = [];
+  for (const file of files) {
+    try {
+      runs.push(JSON.parse(readFileSync(file, 'utf8')));
+    } catch {
+      console.warn(`Skipping invalid eval run artifact: ${path.relative(ROOT, file)}`);
+    }
+  }
+
+  return runs.sort((a, b) => String(b.timestamp).localeCompare(String(a.timestamp)));
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/evals/summarize-ricky-evals.mjs` around lines 35 - 39, The code
currently calls JSON.parse(readFileSync(file, 'utf8')) directly which will throw
on a malformed result.json and stop summarization; update the pipeline that maps
over files (the chain using readdirSync, path.join(...,'result.json'),
existsSync, readFileSync, JSON.parse, and sort) to wrap the read+parse of each
result.json in a try/catch (or a helper like safeParseResult) so that on JSON
parse/read errors you skip that file (optionally console.warn with the filename
and error) and continue returning only successfully parsed entries before
sorting by timestamp.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/local/entrypoint.ts (1)

961-1002: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Rebuild assistantTurnContext after the --best-judgement rewrite.

Line 966 captures turn context from the original request, but Lines 989-1002 can replace both spec and metadata before generation/run. That makes assistant_turn_context describe a different request than the one that actually produced the artifact and runtime launch.

Suggested fix
-      const assistantTurnContext = await observeRickyTurnContext(request, logs);
+      let assistantTurnContext: LocalAssistantTurnContextDecision | undefined;
...
       if (shouldApplyBestJudgement(activeRequest, intakeResult.clarificationQuestions)) {
         bestJudgementClarifications = answerClarificationsWithBestJudgement(intakeResult.clarificationQuestions);
         activeRequest = {
           ...activeRequest,
           spec: appendBestJudgementClarificationAnswers(activeRequest.spec, bestJudgementClarifications),
           metadata: {
             ...activeRequest.metadata,
             bestJudgement: true,
             bestJudgementClarifications,
           },
         };
         specDigest = digestSpec(activeRequest.spec);
         logs.push(`[local] --best-judgement answered ${bestJudgementClarifications.length} clarification question(s) as ${BEST_JUDGEMENT_IMPLEMENTER}`);
         warnings.push('--best-judgement resolved blocking clarifications with implementer assumptions; review them in the generated workflow context.');
         warnings.push(...bestJudgementClarifications.map((answer) => `--best-judgement ${answer.question}: ${answer.answer}`));
         intakeResult = intake(toRawSpecPayload(activeRequest));
       }
+      assistantTurnContext = await observeRickyTurnContext(activeRequest, logs);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/local/entrypoint.ts` around lines 961 - 1002, assistantTurnContext is
captured before the code may rewrite activeRequest with best-judgement answers,
so it can be out-of-date for subsequent generation/launch; after you assign
activeRequest with appendBestJudgementClarificationAnswers and update specDigest
(the block that sets bestJudgementClarifications, activeRequest, and
specDigest), recompute assistantTurnContext by re-calling
observeRickyTurnContext(request, logs) or an equivalent helper using the updated
activeRequest (reference symbols: assistantTurnContext, activeRequest,
bestJudgementClarifications, appendBestJudgementClarificationAnswers,
specDigest, observeRickyTurnContext) so the turn context reflects the final
request before proceeding.
🧹 Nitpick comments (1)
src/local/entrypoint.ts (1)

47-47: ⚡ Quick win

Keep best-judgement fallback answers provider-agnostic.

These answers inject impl-primary-codex, reviewer-claude, and validator-claude into the rewritten spec. On the new BYOH/OpenCode path, that can steer generation toward agent identities that the current workspace does not actually have. Prefer neutral role labels or derive them from the active runtime/provider config.

Possible direction
-const BEST_JUDGEMENT_IMPLEMENTER = 'impl-primary-codex';
+const BEST_JUDGEMENT_IMPLEMENTER = 'implementing-agent';
...
-    return `Answered by implementing agent ${BEST_JUDGEMENT_IMPLEMENTER} using --best-judgement: ${BEST_JUDGEMENT_IMPLEMENTER} owns the implementation assumption, reviewer-claude reviews it, and validator-claude performs final validation signoff.`;
+    return `Answered by implementing agent ${BEST_JUDGEMENT_IMPLEMENTER} using --best-judgement: the implementing agent owns the assumption, a reviewer validates it, and a final validator signs off.`;

Also applies to: 1419-1430

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/local/entrypoint.ts` at line 47, Replace the hardcoded
BEST_JUDGEMENT_IMPLEMENTER = 'impl-primary-codex' with a provider-agnostic value
or derive it from the active runtime/provider config; locate
BEST_JUDGEMENT_IMPLEMENTER (and similar hardcoded strings like
'reviewer-claude'/'validator-claude' around the same area) and either use a
neutral label (e.g., 'implementer'/'reviewer'/'validator') or call the
runtime/provider helper (e.g., a function like getActiveAgentName or reading
from runtime.providerConfig) to produce the correct agent name for the current
workspace so rewritten specs do not inject provider-specific identities.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@evals/suites/generation-quality/cases.md`:
- Around line 211-214: The eval is deterministically checking for the
implementing-agent id "impl-primary-codex" (and usage of "--best-judgement"),
which can change and cause false failures; update the check in the
generation-quality case to stop matching the literal implementing-agent id and
instead validate the attribution/output by matching the expected attribution
pattern (e.g., presence of a rollout signoff line or agent attribution token)
and any flags like "--best-judgement" only by pattern, not by the exact agent
id; remove or replace the hardcoded "impl-primary-codex" assertion and ensure
the workflow reference "workflows/generated/" still passes when the attribution
pattern is present.

In `@src/product/spec-intake/clarifications.ts`:
- Around line 240-248: The hasExplicitExecutionModeChoice function is missing
'auto' in its allowed values, causing explicit executionPreference: 'auto' to be
treated as non-explicit; update the whitelist check inside
hasExplicitExecutionModeChoice (the regex currently testing
/^(local|byoh|cloud|hosted|remote|both)$/i) to include 'auto' (e.g. add |auto)
so that metadataString(...) values of 'auto' are recognized as an explicit
choice consistent with explicitExecutionPreference().

In `@src/product/spec-intake/normalizer.ts`:
- Around line 133-156: The parsing in
executionPreferenceFromClarificationAnswers only strips unordered bullets and
thus misses numbered lists and checkbox-style lines (e.g., "1. ..." or "- [ ]
..."), causing answers to be overlooked; update the line normalization step to
also remove ordered list markers and checkbox tokens before further tests
(handle patterns like /^\s*\d+\.\s+/, /^\s*[-*+]\s*\[[ xX]\]\s+/, and combined
forms), keeping existing trimming and unordered-bullet removal behavior so the
subsequent header detection and answer extraction logic still works unchanged.

---

Outside diff comments:
In `@src/local/entrypoint.ts`:
- Around line 961-1002: assistantTurnContext is captured before the code may
rewrite activeRequest with best-judgement answers, so it can be out-of-date for
subsequent generation/launch; after you assign activeRequest with
appendBestJudgementClarificationAnswers and update specDigest (the block that
sets bestJudgementClarifications, activeRequest, and specDigest), recompute
assistantTurnContext by re-calling observeRickyTurnContext(request, logs) or an
equivalent helper using the updated activeRequest (reference symbols:
assistantTurnContext, activeRequest, bestJudgementClarifications,
appendBestJudgementClarificationAnswers, specDigest, observeRickyTurnContext) so
the turn context reflects the final request before proceeding.

---

Nitpick comments:
In `@src/local/entrypoint.ts`:
- Line 47: Replace the hardcoded BEST_JUDGEMENT_IMPLEMENTER =
'impl-primary-codex' with a provider-agnostic value or derive it from the active
runtime/provider config; locate BEST_JUDGEMENT_IMPLEMENTER (and similar
hardcoded strings like 'reviewer-claude'/'validator-claude' around the same
area) and either use a neutral label (e.g.,
'implementer'/'reviewer'/'validator') or call the runtime/provider helper (e.g.,
a function like getActiveAgentName or reading from runtime.providerConfig) to
produce the correct agent name for the current workspace so rewritten specs do
not inject provider-specific identities.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: a266a086-87b8-41a3-9cea-e7f9ef961085

📥 Commits

Reviewing files that changed from the base of the PR and between 2254f74 and 1df8588.

📒 Files selected for processing (13)
  • evals/suites/generation-quality/cases.jsonl
  • evals/suites/generation-quality/cases.md
  • scripts/evals/run-ricky-evals.mjs
  • src/local/entrypoint.test.ts
  • src/local/entrypoint.ts
  • src/local/request-normalizer.ts
  • src/product/spec-intake/clarifications.ts
  • src/product/spec-intake/normalizer.ts
  • src/product/spec-intake/parser.test.ts
  • src/surfaces/cli/commands/cli-main.test.ts
  • src/surfaces/cli/commands/cli-main.ts
  • src/surfaces/cli/flows/power-user-parser.test.ts
  • src/surfaces/cli/flows/power-user-parser.ts
✅ Files skipped from review due to trivial changes (1)
  • evals/suites/generation-quality/cases.jsonl
🚧 Files skipped from review as they are similar to previous changes (1)
  • scripts/evals/run-ricky-evals.mjs

Comment on lines +211 to +214
- generated; run when ready
- Warning: --best-judgement Who owns final rollout signoff?
- Answered by implementing agent impl-primary-codex using --best-judgement
- Workflow: workflows/generated/
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid pinning this eval to a specific implementing-agent id.

impl-primary-codex is not a stable contract and can change with persona/provider selection, so this deterministic check can false-fail valid runs. Checking for the attribution pattern is enough.

Suggested fix
 contentIncludes:
 - generated; run when ready
 - Warning: --best-judgement Who owns final rollout signoff?
-- Answered by implementing agent impl-primary-codex using --best-judgement
+- Answered by implementing agent
+- using --best-judgement
 - Workflow: workflows/generated/
🧰 Tools
🪛 LanguageTool

[grammar] ~212-~212: Ensure spelling is correct
Context: ...--best-judgement Who owns final rollout signoff? - Answered by implementing agent impl-p...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/suites/generation-quality/cases.md` around lines 211 - 214, The eval is
deterministically checking for the implementing-agent id "impl-primary-codex"
(and usage of "--best-judgement"), which can change and cause false failures;
update the check in the generation-quality case to stop matching the literal
implementing-agent id and instead validate the attribution/output by matching
the expected attribution pattern (e.g., presence of a rollout signoff line or
agent attribution token) and any flags like "--best-judgement" only by pattern,
not by the exact agent id; remove or replace the hardcoded "impl-primary-codex"
assertion and ensure the workflow reference "workflows/generated/" still passes
when the attribution pattern is present.

Comment on lines +240 to +248
function hasExplicitExecutionModeChoice(spec: NormalizedWorkflowSpec): boolean {
const mode = metadataString(spec.providerContext.metadata, 'mode');
const preference =
metadataString(spec.providerContext.metadata, 'executionPreference') ??
metadataString(spec.providerContext.metadata, 'execution_preference');
return [mode, preference].some((value) => (
value !== undefined &&
/^(local|byoh|cloud|hosted|remote|both)$/i.test(value.trim())
));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Treat auto as an explicit execution choice here.

explicitExecutionPreference() already treats metadata executionPreference: 'auto' as user intent, but this whitelist does not. That leaves an explicitly chosen mixed-mode request vulnerable to the execution-mode-conflict blocker.

Suggested fix
   return [mode, preference].some((value) => (
     value !== undefined &&
-    /^(local|byoh|cloud|hosted|remote|both)$/i.test(value.trim())
+    /^(local|byoh|cloud|hosted|remote|both|auto)$/i.test(value.trim())
   ));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function hasExplicitExecutionModeChoice(spec: NormalizedWorkflowSpec): boolean {
const mode = metadataString(spec.providerContext.metadata, 'mode');
const preference =
metadataString(spec.providerContext.metadata, 'executionPreference') ??
metadataString(spec.providerContext.metadata, 'execution_preference');
return [mode, preference].some((value) => (
value !== undefined &&
/^(local|byoh|cloud|hosted|remote|both)$/i.test(value.trim())
));
function hasExplicitExecutionModeChoice(spec: NormalizedWorkflowSpec): boolean {
const mode = metadataString(spec.providerContext.metadata, 'mode');
const preference =
metadataString(spec.providerContext.metadata, 'executionPreference') ??
metadataString(spec.providerContext.metadata, 'execution_preference');
return [mode, preference].some((value) => (
value !== undefined &&
/^(local|byoh|cloud|hosted|remote|both|auto)$/i.test(value.trim())
));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/product/spec-intake/clarifications.ts` around lines 240 - 248, The
hasExplicitExecutionModeChoice function is missing 'auto' in its allowed values,
causing explicit executionPreference: 'auto' to be treated as non-explicit;
update the whitelist check inside hasExplicitExecutionModeChoice (the regex
currently testing /^(local|byoh|cloud|hosted|remote|both)$/i) to include 'auto'
(e.g. add |auto) so that metadataString(...) values of 'auto' are recognized as
an explicit choice consistent with explicitExecutionPreference().

Comment on lines +133 to +156
function executionPreferenceFromClarificationAnswers(description: string): ExecutionPreference | undefined {
let inAnswerSection = false;
for (const rawLine of description.split(/\r?\n/)) {
const line = rawLine.trim().replace(/^[-*+]\s+/, '');
if (!line) {
inAnswerSection = false;
continue;
}
if (/^(#{1,6}\s*)?(clarification answers?|resolved clarifications?)\s*:?\s*$/i.test(line)) {
inAnswerSection = true;
continue;
}
if (/^(#{1,6}\s*)?[A-Z][\w\s/-]{2,80}:$/.test(line)) {
inAnswerSection = false;
}
if (!inAnswerSection || !/should this workflow run locally\/byoh, in cloud, or generate artifacts for both paths\?/i.test(line)) {
continue;
}
const answer = line.split(/:\s*/).slice(1).join(': ').toLowerCase();
if (/\b(local|locally|byoh|on this machine)\b/.test(answer)) return 'local';
if (/\b(cloud|hosted|remote)\b/.test(answer) && !/\bboth\b/.test(answer)) return 'cloud';
if (/\b(both|auto|both paths)\b/.test(answer)) return 'auto';
}
return undefined;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keep clarification-answer parsing aligned with the clarification scanner.

This helper only strips unordered bullets. A numbered or checkbox answer like 1. Should this workflow run locally/BYOH...: locally can be recognized as answered elsewhere, but still be missed here, leaving executionPreference as auto.

Suggested fix
   for (const rawLine of description.split(/\r?\n/)) {
-    const line = rawLine.trim().replace(/^[-*+]\s+/, '');
+    const line = rawLine
+      .trim()
+      .replace(/^[-*+]\s+/, '')
+      .replace(/^\d+[.)]\s+/, '')
+      .replace(/^\[[ xX]\]\s+/, '');
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function executionPreferenceFromClarificationAnswers(description: string): ExecutionPreference | undefined {
let inAnswerSection = false;
for (const rawLine of description.split(/\r?\n/)) {
const line = rawLine.trim().replace(/^[-*+]\s+/, '');
if (!line) {
inAnswerSection = false;
continue;
}
if (/^(#{1,6}\s*)?(clarification answers?|resolved clarifications?)\s*:?\s*$/i.test(line)) {
inAnswerSection = true;
continue;
}
if (/^(#{1,6}\s*)?[A-Z][\w\s/-]{2,80}:$/.test(line)) {
inAnswerSection = false;
}
if (!inAnswerSection || !/should this workflow run locally\/byoh, in cloud, or generate artifacts for both paths\?/i.test(line)) {
continue;
}
const answer = line.split(/:\s*/).slice(1).join(': ').toLowerCase();
if (/\b(local|locally|byoh|on this machine)\b/.test(answer)) return 'local';
if (/\b(cloud|hosted|remote)\b/.test(answer) && !/\bboth\b/.test(answer)) return 'cloud';
if (/\b(both|auto|both paths)\b/.test(answer)) return 'auto';
}
return undefined;
function executionPreferenceFromClarificationAnswers(description: string): ExecutionPreference | undefined {
let inAnswerSection = false;
for (const rawLine of description.split(/\r?\n/)) {
const line = rawLine
.trim()
.replace(/^[-*+]\s+/, '')
.replace(/^\d+[.)]\s+/, '')
.replace(/^\[[ xX]\]\s+/, '');
if (!line) {
inAnswerSection = false;
continue;
}
if (/^(#{1,6}\s*)?(clarification answers?|resolved clarifications?)\s*:?\s*$/i.test(line)) {
inAnswerSection = true;
continue;
}
if (/^(#{1,6}\s*)?[A-Z][\w\s/-]{2,80}:$/.test(line)) {
inAnswerSection = false;
}
if (!inAnswerSection || !/should this workflow run locally\/byoh, in cloud, or generate artifacts for both paths\?/i.test(line)) {
continue;
}
const answer = line.split(/:\s*/).slice(1).join(': ').toLowerCase();
if (/\b(local|locally|byoh|on this machine)\b/.test(answer)) return 'local';
if (/\b(cloud|hosted|remote)\b/.test(answer) && !/\bboth\b/.test(answer)) return 'cloud';
if (/\b(both|auto|both paths)\b/.test(answer)) return 'auto';
}
return undefined;
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/product/spec-intake/normalizer.ts` around lines 133 - 156, The parsing in
executionPreferenceFromClarificationAnswers only strips unordered bullets and
thus misses numbered lists and checkbox-style lines (e.g., "1. ..." or "- [ ]
..."), causing answers to be overlooked; update the line normalization step to
also remove ordered list markers and checkbox tokens before further tests
(handle patterns like /^\s*\d+\.\s+/, /^\s*[-*+]\s*\[[ xX]\]\s+/, and combined
forms), keeping existing trimming and unordered-bullet removal behavior so the
subsequent header detection and answer extraction logic still works unchanged.

@khaliqgant khaliqgant merged commit 538e74b into main May 8, 2026
1 check passed
@khaliqgant khaliqgant deleted the codex/ricky-evals-sweep branch May 8, 2026 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant