feat(experiments): rule engine POC for OODA Decide quadrant#525
Conversation
Terminal-only TypeScript POC of the "LLM extracts, rules decide" pattern from the AI fact-checking community: the LLM is constrained to producing structured flags from raw signals, and a deterministic rule engine maps those flags to a verdict tier with a fully replayable audit trail. Lives under experiments/rule-engine-poc/ as a sandbox (not formal Stage 1-7) and demonstrates the pattern against the repo's own quality framework. Each rule encodes a Definition of Done item from docs/quality-framework.md. What's included: - src/ - hand-rolled engine (~250 LOC): types, hash, engine, loader, cli, html-report. Pure functions; severity-first verdict; canonical JSON + SHA-256 provenance hashes for replay. - rules/quality-gates.yaml - DoD-as-rules example set. - fixtures/*.json - 5 mock Orient-quadrant extractions covering ready, blocked, and needs-attention verdicts. - test/ - 24 passing tests (vitest), including dedicated reproducibility suite (strategist-recommended North Star: byte-identical replay). - HTML reporter - self-contained, inline CSS, no JS, no external assets. - docs/ - architecture, DSL reference, audit trail + EU AI Act mapping, extension guide, OODA integration. - research/ - five-angle research wave (technical landscape, regulatory auditability, positioning/JTBD, design alternatives, risks/critique).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 32d02f485f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Validate that then.verdict is one of the four known tiers; previously a typo like 'blokced' would load successfully and silently degrade the rule into a no-op via tally[<unknown-key>] (#525 P1). - Validate that when.all / when.any / when.not are arrays at load time; previously 'any: true' would load and crash at evaluation with TypeError on .map (#525 P2). - Export VERDICTS as a runtime constant from types.ts so the schema check has one source of truth alongside the type. - Three new loader tests cover the two failure modes plus a typo case.
|
@codex review Addressed both findings from the previous round:
Three new tests in Generated by Claude Code |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6625ee1108
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- engine: 'exists' now participates in the AND-chain instead of short-circuiting, so 'exists: true' combined with 'eq'/'ne'/'gt'/'lt'/ 'in'/'regex' correctly requires every operator to match (#525 P2). 'exists: false' still tolerates flag absence without surfacing the 'flag missing' reason; this is the only short-circuit retained. - loader: empty 'when.all' / 'when.any' / 'when.not' arrays are now rejected at load time. Previously 'any: []' was vacuously satisfied by the length>0 guard in evaluateWhen, allowing a typo to fire a blocking rule unintentionally (#525 P1). - ENGINE_VERSION bumped 0.1.0 -> 0.2.0 because the exists+value-op interaction is a semantic change. Per docs/extending.md, a version bump is the auditor's signal that prior verdicts may not replay. - Six new tests: empty when.any / when.all rejection, four exists-AND cases including exists:false standalone.
|
@codex review Round 2 findings addressed in
Six new tests; suite is 33/33 green. Generated by Claude Code |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 00a43e7630
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- loader: each condition must declare at least one supported operator.
Previously a typo like { flag: 'x', eqq: true } would load and then
always-match at runtime, silently flipping verdicts (#525 round 3 P2).
- loader: 'exists: false' combined with value operators is rejected at
load time. The combination has no meaningful semantics — an absent
flag has no value for eq/ne/gt/lt/in/regex to apply to (#525 round 3 P1).
- loader: condition objects without a 'flag' string are rejected.
- engine: code comment clarified to reflect the now-enforced invariant
that exists:false is only valid alone.
- Three new loader tests cover the three rejection paths.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 179b883418
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Reject 'then.weight' values that are <=0, infinite, or NaN. A 'blocked' rule with weight 0 would contribute nothing to the tally and silently bypass the gate (#525 round 4 P1). - Reject non-array 'in' operators at load time. Previously a typo like 'in: foo' would load and crash at evaluation when .some() is called on a non-array (#525 round 4 P2). - Validate regex patterns at load time. Previously a malformed regex like 'regex: "["' would load and abort the entire decision run with a SyntaxError when new RegExp() throws during evaluation (#525 round 4 P2). - Updated existing 'missing then.weight' test to match the new 'invalid then.weight' error message. - Four new loader tests cover the three rejection paths plus a negative-weight case.
End-to-end flow now drives the POC: user adds content to the project, runs npm run plan to generate AI extraction prompts, pastes a prompt into Claude/ChatGPT, saves the JSON to extractions/, runs npm run report to render HTML and open it in the browser. Architecture: - rule-engine.config.json declares targets, each with id + label + paths (files or directories, walked deterministically). - rules/flag-schema.yaml documents every flag the rule set may reference (type + description + example); the contract between AI extractor and engine. - src/plan.ts walks target paths, collects file contents with 8 KB truncation per file, bundles role + schema + rules + source into a single prompt per target. - src/report.ts loads extractions per target, runs the engine, renders the existing HTML reporter, best-effort opens the first report in the OS default browser. Exit 0/1/2 = ok/blocked/missing. - Prompt-builder follows analyst research (research/10): XML-tag structure with markdown redundancy, explicit forbidden-fields list (verdict, assessment, conclusion, summary, recommendation, rationale, analysis), open <output> tag as a forcing function. - Original single-shot src/cli.ts preserved as a fixture-testing escape hatch. 20 new tests cover the new modules (config, flag-schema, context, prompt-builder). Suite total: 60 tests, all passing. Research wave 2 (5 background agents) wrote 5 new artifacts under research/ covering independent review, workflow failure modes, workflow architecture, UX friction, and extraction prompt patterns. POC stays sandbox-scoped under experiments/rule-engine-poc/. No wiring into specs/, /spec:status, plugins/, or the main repo.
|
@codex review Two updates pushed: Codex round 4 (
Plan/report workflow (
Five new research artifacts ( POC stays self-contained under Generated by Claude Code |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 271702eb54
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…Codex round 5) - open-browser: openInBrowser is now async and waits briefly for the spawn or error event before resolving. Previously it returned true immediately, so report.ts printed "opened in browser" even when xdg-open was missing in a headless container — misleading users during the primary plan->report flow (#525 round 5 P2). - report.ts: awaits openInBrowser and prints the correct status line. Verified in this sandbox: now prints "could not spawn a browser; open manually: file://..." when no browser is installed. - loader: 'exists' operator must be a boolean. Previously a typo like 'exists: "false"' would load and then silently never match because evaluateCondition compares boolean to string (#525 round 5 P2). - loader: 'gt' and 'lt' operators must be numbers at load time, for consistency with the other type checks (engine already failed matching at runtime, but failing at load is preferred). - Two new loader tests cover the exists-boolean and gt-number paths.
|
@codex review Round 5 findings addressed in
Two new loader tests; suite is 62/62 green. Generated by Claude Code |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4e0d0cef09
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…und 6)
- open-browser: spawn cmd /c start "" on Windows. 'start' is a cmd.exe
built-in, not a standalone exe, so spawn('start', ...) raised ENOENT
and browser open always failed for Windows users (#525 round 6 P2).
- report: validate the parsed extraction is a plain object before
passing to evaluate. Previously valid JSON like null / [] / "text"
would crash inside hasOwnProperty.call on null instead of producing
a controlled error (#525 round 6 P2).
- prompt-builder: pick a fence length longer than any backtick run in
the source content. Many repo markdown files contain ``` blocks
which would prematurely close the prompt's outer fence and corrupt
the AI extraction prompt (#525 round 6 P2).
- New pickFence helper is exported and unit-tested; prompt-builder
test asserts a 5-tick fence is emitted for content with a 4-tick
run.
Closes the schema-miss laundering failure mode flagged by the critic (research/07) and analyst (research/10): bad LLM output now fails loudly instead of becoming a reproducible-looking verdict. What's new: - src/validate.ts: validateExtraction(flags, schema, options) returns errors + warnings. Checks: forbidden fields (verdict, assessment, conclusion, summary, recommendation, rationale, analysis), unknown fields (warning), type mismatches (boolean/number/string/string[]), non-finite numbers, disallowed_values violations, prompt-hash mismatch (when expectedPromptHash provided — wired up in the next commit). - src/validate-cli.ts: 'npm run validate' surfaces issues per target, exits 0 (clean) / 1 (errors) / 2 (missing/unparseable extraction). - src/report.ts: validates each extraction before evaluating. Refuses to render when validation fails. --skip-validate flag for escape. - prompt-builder imports FORBIDDEN_FIELDS from validate.ts so the forbidden list lives in exactly one place. - 12 new validate tests; suite total 77 passing. Verified end-to-end: a polluted extraction with verdict+type-mismatch +unknown-flag is caught by both validate and report.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0aaa114eb6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…ard) Closes the stale-extraction failure mode flagged by the critic (research/07): users edit source files between plan and report and the old JSON still renders a confident verdict. The report now refuses extractions produced against a different prompt. What's new: - src/prompt-hash.ts: computePromptHash hashes the LOAD-BEARING inputs (target id, per-file sha, rule hashes, schema content), not the rendered prompt text. Cosmetic edits to the prompt template don't invalidate extractions; real source changes do. - src/plan.ts: emits sidecar prompts/<id>.hash.txt and embeds the hash into the prompt as (a) a top-of-file HTML comment, (b) an explicit rule asking the LLM to copy it into __prompt_hash, and (c) the response template's first key. - src/validate.ts: enforces expectedPromptHash when provided. Surfaces missing-prompt-hash and stale-extraction error codes with re-run instructions. - src/report.ts + src/validate-cli.ts: read the sidecar hash if it exists; absence falls back to the pre-binding behaviour for backwards-compat with fixtures. - 6 new prompt-hash tests; suite total 83 passing. End-to-end verified in this sandbox: matching hash -> exit 0 (ready-to-progress) stale hash -> exit 2 with explicit error missing field -> exit 2 with explicit error
Closes a cluster of small findings from Codex round 7 (cli, context) and reviewer S2/S3 (engine, loader audit-trail honesty): - cli.ts: validate JSON root is a plain object before evaluate. Same guard as report.ts; previously valid-JSON-but-not-object input would crash inside hasOwnProperty.call (#525 round 7 P2). - context.ts: use lstat instead of stat and skip symlinks entirely. Previously a symlink cycle (a/sub/loop -> a/) would recurse until stack overflow during plan (#525 round 7 P2). - engine.ts evaluateCondition: gt/lt against a non-number and regex against a non-string now set an explicit reason ('expected number for gt, got string') so the audit trail explains *why* the condition didn't match. Reviewer S2 — previously these set matched=false with no reason. - engine.ts evaluateWhen: when.not against a missing flag no longer silently fires. The inner condition's 'flag missing in extraction' reason is preserved through the not clause; the rule fails to match rather than inverting absence into success. Reviewer S2. - loader.ts: duplicate rule ids in a single rule file are rejected at load time. Previously a second rule with the same id loaded silently and the engine evaluated it independently. Reviewer S3. - Five new tests: gt-non-number reason, regex-non-string reason, not-missing flag, duplicate ids, symlink cycle handling. Suite: 88/88 passing.
|
@codex review Big push since the last review. Three commits:
Plus an adoption re-research ( Suite: 88/88 green across 10 files. Adds tests for everything above plus the new modules (validate, prompt-hash). Generated by Claude Code |
Closes the 'paste the sidecar to bypass staleness' cheat the critic flagged as the highest-leverage fix in the post-validate workflow. Previously: report.ts and validate-cli.ts read prompts/<id>.hash.txt (plain text) and trusted its value. An operator under deadline pressure could open the sidecar, copy the hash into the extraction's __prompt_hash field, and silently re-render a stale verdict. Now: report and validate-cli recompute the hash from current source files + rules + schema (same code path as plan.ts). The sidecar still gets written for diagnostic / debugging purposes, but it is never the authority for whether an extraction is stale. A real change to any source file invalidates the extraction automatically. Smoke-tested in this sandbox: - Source unchanged, paste-the-sidecar cheat -> exit 0 (correct; extraction is still valid against current source). - Source mutated, same paste-the-sidecar cheat -> stale-extraction error with both the pasted hash and the recomputed hash printed.
- research/14 (critic): three new failure modes the validate gate opened; ranked --skip-validate, sidecar-paste cheat, and reproducibility theatre. Highest-leverage fix already landed in the previous commit. - research/15 (sre): CI integration sketch with concrete cost math (~$0.56/target, $1,700/month at 20 PRs/day on Opus 4.7) and a Day-1/30/90 operational milestones path. - research/16 (user-researcher): 5-segment JTBD switch interview plan with sequencing (mine demand signal first, S1 indie devs next, fail fast before S2-S5), full sample script, RAT integration.
- research/12 (reviewer): independent re-review at HEAD. Verdict pass-with-findings. S2-1 (docs drift: workflow.md still lists validate gate as 'not yet here' despite shipping), S2-2 (sidecar deletion bypasses prompt-hash binding entirely), S2-3 (--skip- validate is undocumented), and an S3 cluster on test count drift, HTML provenance, and type-mismatch error messages. - Fix two typos caught by CI spell check (typos v1.46.0): research/16 'pre-empted' -> 'confirmed' (reads more clearly anyway), research/12 'ci_passsing' -> 'ci_passingx' (illustrative typo recast to avoid typos-tool false positive).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0509c11445
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…grams Replaces the engine-internals-focused architecture.md with a comprehensive system view covering: - System overview (component flowchart) - User flow (sequence diagram across plan/AI/validate/report) - Data flow (annotated with data shapes at each seam) - Engine internals (evaluate algorithm + per-condition + severity picker) - Validate gate + prompt-hash binding (sequence) - OODA mapping (Observe/Orient/Decide/Act with stochasticity boundary) - Module dependency graph (16 src/ modules) - Why these shapes (design choices + research refs) Seven Mermaid diagrams. docs/README.md now points to architecture.md as the start-here entry.
- config.ts: target ids must match /^[A-Za-z0-9][A-Za-z0-9_-]*$/. Previously a target id like '../escape' or 'foo/bar' was accepted and later interpolated into prompts/<id>.md, extractions/<id>.json, reports/<id>.html — at best ENOENT, at worst write outside the workspace (#525 round 8 P2). - context.ts: extract truncateToBytes() that walks back to a UTF-8 codepoint boundary. Previously slice(0, maxBytes) counted UTF-16 code units, so CJK / emoji-heavy markdown could emit prompt blocks 4x the advertised 8 KB cap (#525 round 8 P2). - 7 new tests cover the three slug rejection paths and the multibyte truncation invariant. Suite: 95/95 passing.
|
Codex Review: Didn't find any major issues. Already looking forward to the next diff. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Three rendered reports (ready, blocked, needs-attention) for the agents to inspect when reviewing report readability.
The sample reports under experiments/rule-engine-poc/research/sample-reports/ embed 12-char rule content-hash prefixes that randomly trip typos rules (e.g., 'afe...' -> 'safe'). Excluding the folder is consistent with the existing pattern that allow-lists specific commit-SHA fragments. Also commits research/17 (ux-designer pass on the rendered HTML reports) — top finding is that the audit trail buries the matched rules among ~21 'did not match' siblings; recommended a 'What fired' section + collapse-by-default for skipped rules.
Product-strategist pass on the HTML report as a downstream-shared
artifact. Three findings:
- The report is one artifact serving six first-fields (PR reviewer,
PM, EM, QA, compliance, auditor). Recommend one HTML with
re-stacked sections rather than reader-specific exports — keep
the 'one artifact, many destinations' moat.
- Highest-leverage change: expand action slugs ('kick-ci',
'request-reviewer') to human sentences via an actions[].human
field on the rule schema. Promote the 'verified' prompt-hash
badge next to the verdict.
- Introduce label_set config (default 'dev'; 'pm', 'qa',
'compliance' as presentational overrides) so headline labels
match the reader's vocabulary.
Brand-reviewer pass on the rendered HTML report. Verdict: pass-with-findings; not S1-blocking while the POC stays under experiments/, but would block on the promotion-to-skill step flagged in research/13. Findings: - On-temperament (no emoji / gradients / icons; ASCII [+]/[-]/[?] markers are correctly monospace-as-iconography; restrained density). - Off-token: 18 distinct literal hex values, literal -apple-system / SFMono-Regular font stacks, page background near-white instead of Specorator cream var(--paper). - Voice close but section headers are bare labels rather than sentence-case-with-period declaratives; 'Suggested actions' is passive against Specorator's imperative voice. - Open decision: Specorator has no red token. blocked tier currently uses literal #fdecea / #d8281b / #7a160d. ADR-shaped choice before graduation: extend colors_and_type.css, repurpose --soft-orange and rename the tier 'at-risk', or stay literal until packaged.
…h/21) Critic pass on the rendered HTML report as a communication artifact. Three findings: - Visual hierarchy contradicts semantic model: severity-first is invisible in the weighted-tally widget (reads as a horse race), alphabetically-sorted action list silently asserts a priority the engine refuses to give, cond--missing vs cond--miss are visually distinguished but never named (colour-blind readers lose the signal). - 'verified' badge is a trust-calibration trap — green pill reads as 'extraction verified' to an auditor when it only means 'bound to current inputs'. Compounded by --skip-validate runs producing HTML indistinguishable from validated ones (research/14 risk 1 leaks into the report layer). - Most dangerous skim path: blocker-by-absence. A high-priority blocker rule whose input flag is missing from the extraction simply doesn't fire; neither verdict tile nor any header-level summary tells the reader 'N higher-priority rules were un-evaluable'. - 3 RATs proposed (verdict-tile-alone, action-list-as-priority, 'verified' interpretation). Default no-go if any fail.
Analyst pass on the HTML report from a regulator's reading perspective + 2026 benchmark against LangSmith / Inspect / W&B Weave / sklearn / model-card conventions. Closes the open item from research/02 (human-readable rationale presentation).
New sidecar mapping action slugs to imperative human sentences so the HTML report can render readable guidance instead of bare slugs. - rules/action-glossary.yaml: 28 entries covering every action used in rules/quality-gates.yaml, with optional urgency + category metadata. Imperative voice per Specorator brand. - src/action-glossary.ts: loader + diff-coverage helper, mirroring src/flag-schema.ts conventions. Wiring into config.ts and the HTML renderer happens in subsequent commits when Agents A2/B finish their slices. 123/123 tests still passing — no behaviour change yet.
… A complete) Agent A's RALPH loop completed. Action glossary is now reachable via the config (still optional — no behaviour change for callers that don't set actionGlossary): - src/config.ts: optional 'actionGlossary' string field on RawConfig resolved to 'actionGlossaryPath' on ResolvedConfig, same pattern as flagSchema. - rule-engine.config.json: points at rules/action-glossary.yaml. - test/action-glossary.test.ts: 12 tests covering loader validation, diff-coverage, real-file coverage of rules/quality-gates.yaml, and sentence-shape invariants. Also picks up Agent C's in-flight architecture.md updates: system overview + data flow diagrams now show the glossary node (read only by the renderer) and the new HTML report sections (system-identity header, what fired, reproduce block, audit trail with non-matched collapsed). The renderer itself (Agent B) is still in flight. Suite: 135/135 passing (123 baseline + 12 new). 21 unique action slugs in rules/quality-gates.yaml — all mapped in the glossary.
…xtending - audit-trail.md: Mapping to EU AI Act table updated to credit the HTML report (what-fired with human sentences, system-identity header, tier glossary, reproduce block) as the Art. 13 human-readable rationale surface. Closes research/02 open item about explainability presentation. - workflow.md: still in flight by agent C — current commit picks up partial edits. - extending.md: new 'Authoring action human sentences' section documents rules/action-glossary.yaml as a render-only sidecar (engine never reads it; editing sentences cannot change a verdict).
Agent B (HTML report rebuild) and Agent C (README sync) are still running their RALPH loops. This commit snapshots the current on-disk state so the working tree stays clean between iterations: - src/html-report.ts: partial changes from agent B (rebuild for research wave 4 findings). 135/135 tests still passing — the partial state is internally consistent even if not yet feature complete. - src/report.ts: corresponding plumbing changes from agent B. - README.md: agent C in-flight test-count + file-map sync. Will be superseded by the next commit when both agents report final.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a479ada85d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…t B complete) Agent B's RALPH loop completed. Twelve convergent findings from research wave 4 now realised in the renderer: 1. 'What fired' section above the full audit trail (UX/17 + critic/21 + auditor/20). Verdict-card stats line now reads 'N rule(s) fired · M action(s) to take'. 2. Non-matched rules collapsed via <details class='rule-collapsed'>; matched rules stay inline (UX/17). 3. Blocker-by-absence banner adjacent to the verdict card when any rule's condition reports 'flag missing in extraction' (critic/21 + UX/17). Yellow palette, names the missing flags. 4. Suggested actions now sorted by priority-of-cause (walk evaluations in priority-desc order, dedup preserving first-seen) instead of alphabetic. result.actions unchanged for machine consumers (UX/17). 5. Action human-sentence rendering via rules/action-glossary.yaml; falls back to bare slug when entry missing (stakeholder/18). 6. Provenance section: preamble explaining the hashes + 'How to reproduce' block + 12-char hash truncation (UX/17 + auditor/20 + stakeholder/18). 7. System-identity header above the verdict card: engine version + prominent timestamp moved out of the footer (auditor/20). 8. Verdict-tier glossary + [+]/[-]/[?] glyph legend in a collapsed <details class='legend'> block (auditor/20 + UX/17). 9. cond--miss now has a faint red row-wash matching cond--missing's amber, so the visual distinction isn't glyph-color-only (UX/17). 10. @media (max-width: 540px) single-column fallback (UX/17). 11. Trust calibration: --skip-validate banner shown prominently when validationStatus='skipped'; verified-badge tooltip explains it only means 'extraction is bound to current inputs', not 'flags are correct' (stakeholder/18 + critic/21). 12. Section headers in sentence-case with periods, imperative voice: 'Take these actions.' not 'Suggested actions' (brand/19). 28 new tests in test/html-report.test.ts; suite total 163/163. Three sample reports regenerated under research/sample-reports/ so reviewers see the new shape.
…x round 11) - open-browser.ts: success now requires the opener process to exit with code 0 (or null, signal-terminated) — not just spawn. On Linux headless / CI, xdg-open spawns successfully and then immediately exits non-zero because no browser handler is registered; previously report.ts printed 'opened in browser' even though the file was never opened. Safety-net timeout bumped to 1s for opener daemons that never deliver an exit event (#525 round 11 P2). - cli-shared.ts: takeOpt now throws 'Missing value for option <flag>' when the flag is the last argv entry or is followed by another flag. Previously 'npm run report -- --target' would silently fall back to 'all targets' and 'npm run report -- --target --quiet' would interpret '--quiet' as the target id (#525 round 11 P2). - Two new takeOpt tests cover the missing-value rejection path. Suite: 165/165 passing.
|
@codex review Three commits since last review:
28 new tests in
Suite: 165/165 green. Typos clean. Generated by Claude Code |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 90f3fe10f5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…odex round 12) - html-report.ts missingFlagNames: only count rules whose final outcome was determined by the missing flag (matched === false). Previously a 'when.any' rule with one matched branch + one missing branch counted as un-evaluable in the banner, even though it contributed to the verdict (#525 round 12 P2). - html-report.ts reproCmd: paths are now single-quoted via a small shellQuote helper that escapes embedded ' as the standard '\\'' four-char sequence. Paths with spaces (e.g., 'My Projects/rules.yaml') no longer break the copy-pasted reproduce command (#525 round 12 P2). - Two new html-report tests: when.any-with-missing-branch is NOT counted in the banner, and reproCmd contains HTML-escaped quoted paths. Suite: 167/167 passing.
|
@codex review Round 12 P2s addressed in
Two new Generated by Claude Code |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eb01077fc6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…odex round 13) src/cli.ts has its own takeOption() (the single-shot fixture flow doesn't use src/cli-shared.ts). Same bug as round 11 P2 in cli-shared: when --html had no value the helper returned undefined and the CLI silently proceeded with no HTML output, breaking automation that relies on the artifact being written. Now fail fast with a clear stderr message and exit code 2 when the option is the last argv entry or is followed by another flag. Suite: 167/167 still passing (no test exercised the silent-skip path).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 003a05ed50
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
… 14) - loader.ts: every entry of 'then.actions' must be a non-empty string slug. Previously numbers / objects / empty strings passed load-time validation and flowed into the HTML reporter as unrecognised tokens that couldn't map to a glossary entry, breaking the remediation guidance the verdict is meant to provide (#525 round 14 P2). - loader.ts: 'gt' and 'lt' now reject NaN and Infinity at load time. Both are technically 'number' but silently corrupt comparisons at runtime (NaN comparisons always false), so a typo could make a gating rule unexpectedly never fire (#525 round 14 P2). - Four new loader tests cover non-string action elements, empty-string action elements, NaN gt, and Infinity lt. Suite: 171/171 passing (167 + 4).
New docs/compliance.md walks the standards/regulations an adopter is most likely to be asked about in 2026 and catalogues what the POC contributes vs what stays the adopter's job. Synthesises the two prior research passes (research/02 regulatory + research/20 auditor readability) into reference material for scoping conversations. Covered: - EU AI Act Art. 11-14 + Art. 72 with a per-article table. - ISO/IEC 42001 AIMS clauses 6-10. - ISO/IEC 23894 AI risk management. - NIST AI RMF Govern / Map / Measure / Manage. - GDPR Art. 22 (when it applies vs when it doesn't). - OECD AI Principles (1-paragraph summary). Plus: - 'What the POC ticks natively' — per-artifact provenance. - 'What is NOT in this POC' — honest gap analysis. - Maturity checklist before production with rough effort estimates. Leads with a disclaimer that this is engineering reference material, not legal advice or certification. docs/README.md updated to index the new doc.
…/git/Luis85/agentic-workflow into claude/rule-engine-poc-gO5yq
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cb23228592
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (typeof rule.priority !== "number") { | ||
| throw new Error(`Rule '${rule.id}' missing numeric 'priority'`); | ||
| } |
There was a problem hiding this comment.
Reject non-finite rule priorities during load
The loader only validates priority with typeof === "number", so YAML values like .nan or .inf are accepted. In evaluate, these values feed the sort comparator, and NaN in particular causes b.priority - a.priority to become NaN, which is treated as an equal comparison and bypasses the intended priority ordering. This silently turns a malformed rule into different rule/action ordering behavior instead of failing fast like other numeric fields.
Useful? React with 👍 / 👎.
| for (const raw of paths) { | ||
| const abs = resolve(options.baseDir, raw); | ||
| walk(abs, extensions, collected); | ||
| } |
There was a problem hiding this comment.
Deduplicate files when collecting overlapping target paths
collectFiles walks each configured path independently and appends every discovered file without deduplication. If a target includes overlapping entries (for example both a directory and a file inside it, or the same path twice), the same source file is injected multiple times into the extraction prompt. That needlessly increases token usage and can skew extraction outcomes because duplicated evidence is overrepresented.
Useful? React with 👍 / 👎.
Summary
Terminal-only TypeScript proof-of-concept of the "LLM extracts, rules decide" pattern — the LLM is constrained to producing structured flags from raw signals, and a deterministic rule engine maps those flags to a verdict tier with a fully replayable audit trail. The pattern comes from the AI fact-checking community (see
experiments/rule-engine-poc/README.mdfor the source quote) and fits the OODA orchestrator concept (#502) cleanly: stochasticity is confined to the Orient quadrant; Decide becomes reproducible.The example rule set encodes the per-stage Definition of Done from
docs/quality-framework.mdas machine-checkable rules. One DoD bullet, one rule, one audit-trail entry.Lives under
experiments/rule-engine-poc/as a sandbox — not a formal/spec:startfeature folder.What's included
src/— hand-rolled engine (~250 LOC):types,hash,engine,loader,cli,html-report. Pure functions; severity-first verdict (blocked>needs-attention>ready-to-progress>unknown); canonical-JSON + SHA-256 provenance hashes for replay.rules/quality-gates.yaml— DoD-as-rules example set covering every stage from Idea through Review.fixtures/*.json— five mock Orient-quadrant extractions covering ready, blocked, and needs-attention verdicts.test/— 24 passing vitest tests, including a dedicated reproducibility suite (the strategist-recommended North Star metric: byte-identical replay across runs).docs/— architecture, DSL reference, audit trail (with EU AI Act / ISO 42001 mapping), extension guide, OODA integration.research/— five-angle research wave dispatched in parallel: technical landscape, regulatory auditability, positioning / JTBD / North Star, alternative technical design, risks / critique.Try it
The CLI exits 1 on
blocked, 0 otherwise — usable as a CI gate.What this is not
json-rules-enginebehind a thin adapter when this graduates from POC.research/05-risks-critique.mdfor the three Riskiest Assumption Tests the critic recommends running before any production wiring.Test plan
npm test— 24/24 vitest tests pass (unit, reproducibility, operator coverage, loader validation)npx tsc --noEmit— typecheck clean1for blocked,0for non-blockedexperiments/rule-engine-poc/research/Related
docs/quality-framework.md— source of the example rule setGenerated by Claude Code