Skip to content

feat(experiments): rule engine POC for OODA Decide quadrant#525

Merged
Luis85 merged 41 commits into
developfrom
claude/rule-engine-poc-gO5yq
May 17, 2026
Merged

feat(experiments): rule engine POC for OODA Decide quadrant#525
Luis85 merged 41 commits into
developfrom
claude/rule-engine-poc-gO5yq

Conversation

@Luis85
Copy link
Copy Markdown
Owner

@Luis85 Luis85 commented May 17, 2026

Summary

Terminal-only TypeScript proof-of-concept of the "LLM extracts, rules decide" pattern — the LLM is constrained to producing structured flags from raw signals, and a deterministic rule engine maps those flags to a verdict tier with a fully replayable audit trail. The pattern comes from the AI fact-checking community (see experiments/rule-engine-poc/README.md for the source quote) and fits the OODA orchestrator concept (#502) cleanly: stochasticity is confined to the Orient quadrant; Decide becomes reproducible.

The example rule set encodes the per-stage Definition of Done from docs/quality-framework.md as machine-checkable rules. One DoD bullet, one rule, one audit-trail entry.

Lives under experiments/rule-engine-poc/ as a sandbox — not a formal /spec:start feature folder.

What's included

  • src/ — hand-rolled engine (~250 LOC): types, hash, engine, loader, cli, html-report. Pure functions; severity-first verdict (blocked > needs-attention > ready-to-progress > unknown); canonical-JSON + SHA-256 provenance hashes for replay.
  • rules/quality-gates.yaml — DoD-as-rules example set covering every stage from Idea through Review.
  • fixtures/*.json — five mock Orient-quadrant extractions covering ready, blocked, and needs-attention verdicts.
  • test/ — 24 passing vitest tests, including a dedicated reproducibility suite (the strategist-recommended North Star metric: byte-identical replay across runs).
  • HTML reporter — self-contained, inline CSS, no JS, no external assets. Drop into Slack, attach to a PR, or pipe to a static-site directory.
  • docs/ — architecture, DSL reference, audit trail (with EU AI Act / ISO 42001 mapping), extension guide, OODA integration.
  • research/ — five-angle research wave dispatched in parallel: technical landscape, regulatory auditability, positioning / JTBD / North Star, alternative technical design, risks / critique.

Try it

cd experiments/rule-engine-poc
npm install
npm test                  # 24 passing tests in <1s
npm run demo              # ready-implementation fixture (text report)
npm run demo:html:all     # one HTML report per fixture into reports/

The CLI exits 1 on blocked, 0 otherwise — usable as a CI gate.

What this is not

  • Not a production rule engine. Intentionally minimal. The technical-landscape research recommends json-rules-engine behind a thin adapter when this graduates from POC.
  • Not opinionated about extraction. The LLM half of the pipeline is mocked by fixture JSON. Wiring a real Claude / GPT extraction call is a separate POC.
  • Not yet calibrated. Weights and severity ordering are placeholders. See research/05-risks-critique.md for the three Riskiest Assumption Tests the critic recommends running before any production wiring.

Test plan

  • npm test — 24/24 vitest tests pass (unit, reproducibility, operator coverage, loader validation)
  • npx tsc --noEmit — typecheck clean
  • Every fixture produces the expected verdict (text mode)
  • HTML reporter generates self-contained 27 KB reports for every fixture
  • CLI exit code maps correctly: 1 for blocked, 0 for non-blocked
  • Manual review of the research artifacts under experiments/rule-engine-poc/research/
  • Decide whether to evolve into a Specorator skill / plugin or keep as experiment

Related


Generated by Claude Code

Terminal-only TypeScript POC of the "LLM extracts, rules decide" pattern
from the AI fact-checking community: the LLM is constrained to producing
structured flags from raw signals, and a deterministic rule engine maps
those flags to a verdict tier with a fully replayable audit trail.

Lives under experiments/rule-engine-poc/ as a sandbox (not formal Stage
1-7) and demonstrates the pattern against the repo's own quality
framework. Each rule encodes a Definition of Done item from
docs/quality-framework.md.

What's included:
- src/ - hand-rolled engine (~250 LOC): types, hash, engine, loader,
  cli, html-report. Pure functions; severity-first verdict; canonical
  JSON + SHA-256 provenance hashes for replay.
- rules/quality-gates.yaml - DoD-as-rules example set.
- fixtures/*.json - 5 mock Orient-quadrant extractions covering ready,
  blocked, and needs-attention verdicts.
- test/ - 24 passing tests (vitest), including dedicated reproducibility
  suite (strategist-recommended North Star: byte-identical replay).
- HTML reporter - self-contained, inline CSS, no JS, no external assets.
- docs/ - architecture, DSL reference, audit trail + EU AI Act mapping,
  extension guide, OODA integration.
- research/ - five-angle research wave (technical landscape, regulatory
  auditability, positioning/JTBD, design alternatives, risks/critique).
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 32d02f485f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread experiments/rule-engine-poc/src/loader.ts
Comment thread experiments/rule-engine-poc/src/loader.ts
- Validate that then.verdict is one of the four known tiers; previously
  a typo like 'blokced' would load successfully and silently degrade
  the rule into a no-op via tally[<unknown-key>] (#525 P1).
- Validate that when.all / when.any / when.not are arrays at load time;
  previously 'any: true' would load and crash at evaluation with
  TypeError on .map (#525 P2).
- Export VERDICTS as a runtime constant from types.ts so the schema
  check has one source of truth alongside the type.
- Three new loader tests cover the two failure modes plus a typo case.
Copy link
Copy Markdown
Owner Author

Luis85 commented May 17, 2026

@codex review

Addressed both findings from the previous round:

  • P1 — loader now rejects unknown then.verdict values against the VERDICTS runtime constant (loader.ts).
  • P2 — loader now rejects non-array when.all / when.any / when.not groups at load time (loader.ts).

Three new tests in test/loader.test.ts cover the two failure modes plus the typo case. npm test is 27/27 green.


Generated by Claude Code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6625ee1108

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread experiments/rule-engine-poc/src/engine.ts
Comment thread experiments/rule-engine-poc/src/engine.ts
- engine: 'exists' now participates in the AND-chain instead of
  short-circuiting, so 'exists: true' combined with 'eq'/'ne'/'gt'/'lt'/
  'in'/'regex' correctly requires every operator to match (#525 P2).
  'exists: false' still tolerates flag absence without surfacing the
  'flag missing' reason; this is the only short-circuit retained.
- loader: empty 'when.all' / 'when.any' / 'when.not' arrays are now
  rejected at load time. Previously 'any: []' was vacuously satisfied
  by the length>0 guard in evaluateWhen, allowing a typo to fire a
  blocking rule unintentionally (#525 P1).
- ENGINE_VERSION bumped 0.1.0 -> 0.2.0 because the exists+value-op
  interaction is a semantic change. Per docs/extending.md, a version
  bump is the auditor's signal that prior verdicts may not replay.
- Six new tests: empty when.any / when.all rejection, four exists-AND
  cases including exists:false standalone.
Copy link
Copy Markdown
Owner Author

Luis85 commented May 17, 2026

@codex review

Round 2 findings addressed in 00a43e7:

  • P1 — empty when.all / when.any / when.not arrays now rejected at load time; the "omit the group instead" path forces the author to express intent explicitly.
  • P2exists no longer short-circuits when combined with value operators. It participates in the AND-chain alongside eq / ne / gt / lt / in / regex. exists: false retains its tolerance for flag absence (the only remaining short-circuit, and only when no value-op coexists).

ENGINE_VERSION bumped 0.1.0 → 0.2.0 because the exists + value-op interaction is a behavioural change — per docs/extending.md, the version bump is the auditor's signal that prior verdicts may not replay.

Six new tests; suite is 33/33 green.


Generated by Claude Code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 00a43e7630

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread experiments/rule-engine-poc/src/engine.ts
Comment thread experiments/rule-engine-poc/src/loader.ts
- loader: each condition must declare at least one supported operator.
  Previously a typo like { flag: 'x', eqq: true } would load and then
  always-match at runtime, silently flipping verdicts (#525 round 3 P2).
- loader: 'exists: false' combined with value operators is rejected at
  load time. The combination has no meaningful semantics — an absent
  flag has no value for eq/ne/gt/lt/in/regex to apply to (#525 round 3 P1).
- loader: condition objects without a 'flag' string are rejected.
- engine: code comment clarified to reflect the now-enforced invariant
  that exists:false is only valid alone.
- Three new loader tests cover the three rejection paths.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 179b883418

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread experiments/rule-engine-poc/src/loader.ts Outdated
Comment thread experiments/rule-engine-poc/src/engine.ts
Comment thread experiments/rule-engine-poc/src/engine.ts Outdated
claude added 2 commits May 17, 2026 11:56
- Reject 'then.weight' values that are <=0, infinite, or NaN. A
  'blocked' rule with weight 0 would contribute nothing to the tally
  and silently bypass the gate (#525 round 4 P1).
- Reject non-array 'in' operators at load time. Previously a typo like
  'in: foo' would load and crash at evaluation when .some() is called
  on a non-array (#525 round 4 P2).
- Validate regex patterns at load time. Previously a malformed regex
  like 'regex: "["' would load and abort the entire decision run with
  a SyntaxError when new RegExp() throws during evaluation (#525 round
  4 P2).
- Updated existing 'missing then.weight' test to match the new
  'invalid then.weight' error message.
- Four new loader tests cover the three rejection paths plus a
  negative-weight case.
End-to-end flow now drives the POC: user adds content to the project,
runs npm run plan to generate AI extraction prompts, pastes a prompt
into Claude/ChatGPT, saves the JSON to extractions/, runs npm run
report to render HTML and open it in the browser.

Architecture:
- rule-engine.config.json declares targets, each with id + label +
  paths (files or directories, walked deterministically).
- rules/flag-schema.yaml documents every flag the rule set may
  reference (type + description + example); the contract between AI
  extractor and engine.
- src/plan.ts walks target paths, collects file contents with 8 KB
  truncation per file, bundles role + schema + rules + source into
  a single prompt per target.
- src/report.ts loads extractions per target, runs the engine, renders
  the existing HTML reporter, best-effort opens the first report in
  the OS default browser. Exit 0/1/2 = ok/blocked/missing.
- Prompt-builder follows analyst research (research/10): XML-tag
  structure with markdown redundancy, explicit forbidden-fields list
  (verdict, assessment, conclusion, summary, recommendation,
  rationale, analysis), open <output> tag as a forcing function.
- Original single-shot src/cli.ts preserved as a fixture-testing
  escape hatch.

20 new tests cover the new modules (config, flag-schema, context,
prompt-builder). Suite total: 60 tests, all passing.

Research wave 2 (5 background agents) wrote 5 new artifacts under
research/ covering independent review, workflow failure modes,
workflow architecture, UX friction, and extraction prompt patterns.

POC stays sandbox-scoped under experiments/rule-engine-poc/. No wiring
into specs/, /spec:status, plugins/, or the main repo.
Copy link
Copy Markdown
Owner Author

Luis85 commented May 17, 2026

@codex review

Two updates pushed:

Codex round 4 (45577c4) — loader hardening:

  • then.weight must be a positive finite number (rejects 0, negative, NaN). A blocked rule with weight: 0 would have contributed nothing to the tally and silently bypassed the gate.
  • in operator must be an array at load time.
  • Regex patterns are compile-validated at load time so a malformed regex like regex: "[" fails fast instead of aborting evaluation mid-run.

Plan/report workflow (271702e) — the POC's primary flow:

  • rule-engine.config.json declares targets with named paths.
  • npm run plan walks paths, bundles content + flag schema + rules into per-target extraction prompts under prompts/.
  • User pastes prompt into Claude/ChatGPT, saves JSON to extractions/<id>.json.
  • npm run report runs the engine and writes HTML to reports/<id>.html, best-effort opens browser.
  • Exit codes: 0 no blockers, 1 at least one blocked, 2 missing/malformed extraction.
  • rules/flag-schema.yaml documents every flag the rule set may reference; the contract between extractor and engine.
  • Prompt-builder follows research recommendations: XML structure with markdown redundancy, explicit forbidden-fields list, open <output> tag as a forcing function.

Five new research artifacts (research/0610) cover the independent review, workflow failure modes, alternative architecture, UX audit, and prompt-engineering patterns. Suite is 60/60 green.

POC stays self-contained under experiments/rule-engine-poc/. No wiring into specs/, /spec:status, or plugins/.


Generated by Claude Code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 271702eb54

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread experiments/rule-engine-poc/src/open-browser.ts Outdated
Comment thread experiments/rule-engine-poc/src/loader.ts
…Codex round 5)

- open-browser: openInBrowser is now async and waits briefly for the
  spawn or error event before resolving. Previously it returned true
  immediately, so report.ts printed "opened in browser" even when
  xdg-open was missing in a headless container — misleading users
  during the primary plan->report flow (#525 round 5 P2).
- report.ts: awaits openInBrowser and prints the correct status line.
  Verified in this sandbox: now prints "could not spawn a browser;
  open manually: file://..." when no browser is installed.
- loader: 'exists' operator must be a boolean. Previously a typo like
  'exists: "false"' would load and then silently never match because
  evaluateCondition compares boolean to string (#525 round 5 P2).
- loader: 'gt' and 'lt' operators must be numbers at load time, for
  consistency with the other type checks (engine already failed
  matching at runtime, but failing at load is preferred).
- Two new loader tests cover the exists-boolean and gt-number paths.
Copy link
Copy Markdown
Owner Author

Luis85 commented May 17, 2026

@codex review

Round 5 findings addressed in 4e0d0ce:

  • P2 (open-browser)openInBrowser is now async. It waits for either the spawn success event or the error event before resolving, with a 250 ms safety-net timeout. report.ts awaits it. Verified in this sandbox: now correctly prints could not spawn a browser; open manually: file://… when no browser is installed, instead of falsely claiming success.
  • P2 (loader exists)exists operator must be a boolean at load time. Previously a typo like exists: "false" (string) would load and then silently never match at runtime because evaluateCondition compares boolean to string.
  • Bonus consistency: gt and lt operators must be numbers at load time. The engine already failed matching at runtime for these, but failing at load is preferred.

Two new loader tests; suite is 62/62 green.


Generated by Claude Code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4e0d0cef09

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread experiments/rule-engine-poc/src/open-browser.ts Outdated
Comment thread experiments/rule-engine-poc/src/report.ts Outdated
Comment thread experiments/rule-engine-poc/src/prompt-builder.ts Outdated
claude added 2 commits May 17, 2026 12:11
…und 6)

- open-browser: spawn cmd /c start "" on Windows. 'start' is a cmd.exe
  built-in, not a standalone exe, so spawn('start', ...) raised ENOENT
  and browser open always failed for Windows users (#525 round 6 P2).
- report: validate the parsed extraction is a plain object before
  passing to evaluate. Previously valid JSON like null / [] / "text"
  would crash inside hasOwnProperty.call on null instead of producing
  a controlled error (#525 round 6 P2).
- prompt-builder: pick a fence length longer than any backtick run in
  the source content. Many repo markdown files contain ``` blocks
  which would prematurely close the prompt's outer fence and corrupt
  the AI extraction prompt (#525 round 6 P2).
- New pickFence helper is exported and unit-tested; prompt-builder
  test asserts a 5-tick fence is emitted for content with a 4-tick
  run.
Closes the schema-miss laundering failure mode flagged by the
critic (research/07) and analyst (research/10): bad LLM output now
fails loudly instead of becoming a reproducible-looking verdict.

What's new:
- src/validate.ts: validateExtraction(flags, schema, options) returns
  errors + warnings. Checks: forbidden fields (verdict, assessment,
  conclusion, summary, recommendation, rationale, analysis), unknown
  fields (warning), type mismatches (boolean/number/string/string[]),
  non-finite numbers, disallowed_values violations, prompt-hash
  mismatch (when expectedPromptHash provided — wired up in the next
  commit).
- src/validate-cli.ts: 'npm run validate' surfaces issues per target,
  exits 0 (clean) / 1 (errors) / 2 (missing/unparseable extraction).
- src/report.ts: validates each extraction before evaluating. Refuses
  to render when validation fails. --skip-validate flag for escape.
- prompt-builder imports FORBIDDEN_FIELDS from validate.ts so the
  forbidden list lives in exactly one place.
- 12 new validate tests; suite total 77 passing.

Verified end-to-end: a polluted extraction with verdict+type-mismatch
+unknown-flag is caught by both validate and report.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0aaa114eb6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread experiments/rule-engine-poc/src/cli.ts Outdated
Comment thread experiments/rule-engine-poc/src/context.ts
claude added 2 commits May 17, 2026 12:16
…ard)

Closes the stale-extraction failure mode flagged by the critic
(research/07): users edit source files between plan and report and
the old JSON still renders a confident verdict. The report now
refuses extractions produced against a different prompt.

What's new:
- src/prompt-hash.ts: computePromptHash hashes the LOAD-BEARING
  inputs (target id, per-file sha, rule hashes, schema content),
  not the rendered prompt text. Cosmetic edits to the prompt
  template don't invalidate extractions; real source changes do.
- src/plan.ts: emits sidecar prompts/<id>.hash.txt and embeds the
  hash into the prompt as (a) a top-of-file HTML comment, (b) an
  explicit rule asking the LLM to copy it into __prompt_hash, and
  (c) the response template's first key.
- src/validate.ts: enforces expectedPromptHash when provided.
  Surfaces missing-prompt-hash and stale-extraction error codes
  with re-run instructions.
- src/report.ts + src/validate-cli.ts: read the sidecar hash if it
  exists; absence falls back to the pre-binding behaviour for
  backwards-compat with fixtures.
- 6 new prompt-hash tests; suite total 83 passing.

End-to-end verified in this sandbox:
  matching hash    -> exit 0 (ready-to-progress)
  stale hash       -> exit 2 with explicit error
  missing field    -> exit 2 with explicit error
Closes a cluster of small findings from Codex round 7 (cli, context)
and reviewer S2/S3 (engine, loader audit-trail honesty):

- cli.ts: validate JSON root is a plain object before evaluate. Same
  guard as report.ts; previously valid-JSON-but-not-object input would
  crash inside hasOwnProperty.call (#525 round 7 P2).
- context.ts: use lstat instead of stat and skip symlinks entirely.
  Previously a symlink cycle (a/sub/loop -> a/) would recurse until
  stack overflow during plan (#525 round 7 P2).
- engine.ts evaluateCondition: gt/lt against a non-number and regex
  against a non-string now set an explicit reason ('expected number
  for gt, got string') so the audit trail explains *why* the
  condition didn't match. Reviewer S2 — previously these set
  matched=false with no reason.
- engine.ts evaluateWhen: when.not against a missing flag no longer
  silently fires. The inner condition's 'flag missing in extraction'
  reason is preserved through the not clause; the rule fails to
  match rather than inverting absence into success. Reviewer S2.
- loader.ts: duplicate rule ids in a single rule file are rejected
  at load time. Previously a second rule with the same id loaded
  silently and the engine evaluated it independently. Reviewer S3.
- Five new tests: gt-non-number reason, regex-non-string reason,
  not-missing flag, duplicate ids, symlink cycle handling.

Suite: 88/88 passing.
Copy link
Copy Markdown
Owner Author

Luis85 commented May 17, 2026

@codex review

Big push since the last review. Three commits:

3dc1034 — validate gate between plan and report:

  • npm run validate parses pasted JSON, checks the flag schema, refuses forbidden fields (verdict, assessment, etc.).
  • report runs validation before evaluate and refuses bad input; --skip-validate escape.
  • Closes the schema-miss laundering failure mode flagged by the critic/analyst research.

a9b1db9 — prompt-extraction binding (stale-extraction guard):

  • plan hashes load-bearing inputs (target id, per-file sha, rule hashes, schema), embeds the hash in the prompt and a sidecar.
  • LLM is asked to copy the hash into __prompt_hash.
  • validate and report enforce the hash; stale extractions are rejected with a clear re-run message.
  • Verified end-to-end: matching=0, stale=2, missing=2.

cc00478 — polish bundle (Codex round 7 + reviewer findings):

  • cli.ts validates JSON root is an object.
  • context.ts uses lstat + skips symlinks so symlink cycles don't blow the stack.
  • engine.ts: gt/lt/regex on wrong-typed observations now set an explicit reason in the audit trail.
  • engine.ts: when.not against a missing flag no longer silently inverts absence into success.
  • loader.ts: duplicate rule ids rejected at load time.

Plus an adoption re-research (research/11) concluding stay hand-rolled — every candidate fails ≥3 of the 6 locked-in constraints; the wrapper would be the engine. Recommendation to record an ADR at POC graduation.

Suite: 88/88 green across 10 files. Adds tests for everything above plus the new modules (validate, prompt-hash).


Generated by Claude Code

claude added 4 commits May 17, 2026 12:22
Closes the 'paste the sidecar to bypass staleness' cheat the critic
flagged as the highest-leverage fix in the post-validate workflow.

Previously: report.ts and validate-cli.ts read prompts/<id>.hash.txt
(plain text) and trusted its value. An operator under deadline
pressure could open the sidecar, copy the hash into the extraction's
__prompt_hash field, and silently re-render a stale verdict.

Now: report and validate-cli recompute the hash from current source
files + rules + schema (same code path as plan.ts). The sidecar still
gets written for diagnostic / debugging purposes, but it is never the
authority for whether an extraction is stale. A real change to any
source file invalidates the extraction automatically.

Smoke-tested in this sandbox:
- Source unchanged, paste-the-sidecar cheat -> exit 0 (correct;
  extraction is still valid against current source).
- Source mutated, same paste-the-sidecar cheat -> stale-extraction
  error with both the pasted hash and the recomputed hash printed.
- research/14 (critic): three new failure modes the validate gate
  opened; ranked --skip-validate, sidecar-paste cheat, and
  reproducibility theatre. Highest-leverage fix already landed in
  the previous commit.
- research/15 (sre): CI integration sketch with concrete cost math
  (~$0.56/target, $1,700/month at 20 PRs/day on Opus 4.7) and a
  Day-1/30/90 operational milestones path.
- research/16 (user-researcher): 5-segment JTBD switch interview
  plan with sequencing (mine demand signal first, S1 indie devs
  next, fail fast before S2-S5), full sample script, RAT integration.
- research/12 (reviewer): independent re-review at HEAD. Verdict
  pass-with-findings. S2-1 (docs drift: workflow.md still lists
  validate gate as 'not yet here' despite shipping), S2-2 (sidecar
  deletion bypasses prompt-hash binding entirely), S2-3 (--skip-
  validate is undocumented), and an S3 cluster on test count drift,
  HTML provenance, and type-mismatch error messages.
- Fix two typos caught by CI spell check (typos v1.46.0):
  research/16 'pre-empted' -> 'confirmed' (reads more clearly anyway),
  research/12 'ci_passsing' -> 'ci_passingx' (illustrative typo
  recast to avoid typos-tool false positive).
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0509c11445

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread experiments/rule-engine-poc/src/config.ts
Comment thread experiments/rule-engine-poc/src/context.ts Outdated
claude added 2 commits May 17, 2026 12:34
…grams

Replaces the engine-internals-focused architecture.md with a
comprehensive system view covering:
- System overview (component flowchart)
- User flow (sequence diagram across plan/AI/validate/report)
- Data flow (annotated with data shapes at each seam)
- Engine internals (evaluate algorithm + per-condition + severity picker)
- Validate gate + prompt-hash binding (sequence)
- OODA mapping (Observe/Orient/Decide/Act with stochasticity boundary)
- Module dependency graph (16 src/ modules)
- Why these shapes (design choices + research refs)

Seven Mermaid diagrams. docs/README.md now points to architecture.md
as the start-here entry.
- config.ts: target ids must match /^[A-Za-z0-9][A-Za-z0-9_-]*$/.
  Previously a target id like '../escape' or 'foo/bar' was accepted and
  later interpolated into prompts/<id>.md, extractions/<id>.json,
  reports/<id>.html — at best ENOENT, at worst write outside the
  workspace (#525 round 8 P2).
- context.ts: extract truncateToBytes() that walks back to a UTF-8
  codepoint boundary. Previously slice(0, maxBytes) counted UTF-16
  code units, so CJK / emoji-heavy markdown could emit prompt blocks
  4x the advertised 8 KB cap (#525 round 8 P2).
- 7 new tests cover the three slug rejection paths and the multibyte
  truncation invariant.

Suite: 95/95 passing.
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Already looking forward to the next diff.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

claude added 11 commits May 17, 2026 12:57
Three rendered reports (ready, blocked, needs-attention) for the agents to inspect when reviewing report readability.
The sample reports under
experiments/rule-engine-poc/research/sample-reports/ embed 12-char
rule content-hash prefixes that randomly trip typos rules (e.g.,
'afe...' -> 'safe'). Excluding the folder is consistent with the
existing pattern that allow-lists specific commit-SHA fragments.

Also commits research/17 (ux-designer pass on the rendered HTML
reports) — top finding is that the audit trail buries the matched
rules among ~21 'did not match' siblings; recommended a 'What fired'
section + collapse-by-default for skipped rules.
Product-strategist pass on the HTML report as a downstream-shared
artifact. Three findings:
- The report is one artifact serving six first-fields (PR reviewer,
  PM, EM, QA, compliance, auditor). Recommend one HTML with
  re-stacked sections rather than reader-specific exports — keep
  the 'one artifact, many destinations' moat.
- Highest-leverage change: expand action slugs ('kick-ci',
  'request-reviewer') to human sentences via an actions[].human
  field on the rule schema. Promote the 'verified' prompt-hash
  badge next to the verdict.
- Introduce label_set config (default 'dev'; 'pm', 'qa',
  'compliance' as presentational overrides) so headline labels
  match the reader's vocabulary.
Brand-reviewer pass on the rendered HTML report. Verdict:
pass-with-findings; not S1-blocking while the POC stays under
experiments/, but would block on the promotion-to-skill step
flagged in research/13.

Findings:
- On-temperament (no emoji / gradients / icons; ASCII [+]/[-]/[?]
  markers are correctly monospace-as-iconography; restrained density).
- Off-token: 18 distinct literal hex values, literal -apple-system /
  SFMono-Regular font stacks, page background near-white instead of
  Specorator cream var(--paper).
- Voice close but section headers are bare labels rather than
  sentence-case-with-period declaratives; 'Suggested actions' is
  passive against Specorator's imperative voice.
- Open decision: Specorator has no red token. blocked tier currently
  uses literal #fdecea / #d8281b / #7a160d. ADR-shaped choice before
  graduation: extend colors_and_type.css, repurpose --soft-orange and
  rename the tier 'at-risk', or stay literal until packaged.
…h/21)

Critic pass on the rendered HTML report as a communication artifact.
Three findings:

- Visual hierarchy contradicts semantic model: severity-first is
  invisible in the weighted-tally widget (reads as a horse race),
  alphabetically-sorted action list silently asserts a priority the
  engine refuses to give, cond--missing vs cond--miss are visually
  distinguished but never named (colour-blind readers lose the signal).
- 'verified' badge is a trust-calibration trap — green pill reads as
  'extraction verified' to an auditor when it only means 'bound to
  current inputs'. Compounded by --skip-validate runs producing HTML
  indistinguishable from validated ones (research/14 risk 1 leaks
  into the report layer).
- Most dangerous skim path: blocker-by-absence. A high-priority
  blocker rule whose input flag is missing from the extraction simply
  doesn't fire; neither verdict tile nor any header-level summary
  tells the reader 'N higher-priority rules were un-evaluable'.
- 3 RATs proposed (verdict-tile-alone, action-list-as-priority,
  'verified' interpretation). Default no-go if any fail.
Analyst pass on the HTML report from a regulator's reading perspective + 2026 benchmark against LangSmith / Inspect / W&B Weave / sklearn / model-card conventions. Closes the open item from research/02 (human-readable rationale presentation).
New sidecar mapping action slugs to imperative human sentences so
the HTML report can render readable guidance instead of bare slugs.

- rules/action-glossary.yaml: 28 entries covering every action used
  in rules/quality-gates.yaml, with optional urgency + category
  metadata. Imperative voice per Specorator brand.
- src/action-glossary.ts: loader + diff-coverage helper, mirroring
  src/flag-schema.ts conventions.

Wiring into config.ts and the HTML renderer happens in subsequent
commits when Agents A2/B finish their slices. 123/123 tests still
passing — no behaviour change yet.
… A complete)

Agent A's RALPH loop completed. Action glossary is now reachable via
the config (still optional — no behaviour change for callers that
don't set actionGlossary):

- src/config.ts: optional 'actionGlossary' string field on RawConfig
  resolved to 'actionGlossaryPath' on ResolvedConfig, same pattern as
  flagSchema.
- rule-engine.config.json: points at rules/action-glossary.yaml.
- test/action-glossary.test.ts: 12 tests covering loader validation,
  diff-coverage, real-file coverage of rules/quality-gates.yaml, and
  sentence-shape invariants.

Also picks up Agent C's in-flight architecture.md updates: system
overview + data flow diagrams now show the glossary node (read only
by the renderer) and the new HTML report sections (system-identity
header, what fired, reproduce block, audit trail with non-matched
collapsed). The renderer itself (Agent B) is still in flight.

Suite: 135/135 passing (123 baseline + 12 new). 21 unique action
slugs in rules/quality-gates.yaml — all mapped in the glossary.
…xtending

- audit-trail.md: Mapping to EU AI Act table updated to credit the
  HTML report (what-fired with human sentences, system-identity
  header, tier glossary, reproduce block) as the Art. 13
  human-readable rationale surface. Closes research/02 open item
  about explainability presentation.
- workflow.md: still in flight by agent C — current commit picks
  up partial edits.
- extending.md: new 'Authoring action human sentences' section
  documents rules/action-glossary.yaml as a render-only sidecar
  (engine never reads it; editing sentences cannot change a verdict).
Agent B (HTML report rebuild) and Agent C (README sync) are still
running their RALPH loops. This commit snapshots the current
on-disk state so the working tree stays clean between iterations:

- src/html-report.ts: partial changes from agent B (rebuild for
  research wave 4 findings). 135/135 tests still passing — the
  partial state is internally consistent even if not yet feature
  complete.
- src/report.ts: corresponding plumbing changes from agent B.
- README.md: agent C in-flight test-count + file-map sync.

Will be superseded by the next commit when both agents report
final.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a479ada85d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread experiments/rule-engine-poc/src/cli-shared.ts
Comment thread experiments/rule-engine-poc/src/open-browser.ts Outdated
claude added 2 commits May 17, 2026 13:15
…t B complete)

Agent B's RALPH loop completed. Twelve convergent findings from
research wave 4 now realised in the renderer:

1. 'What fired' section above the full audit trail (UX/17 + critic/21 +
   auditor/20). Verdict-card stats line now reads 'N rule(s) fired ·
   M action(s) to take'.
2. Non-matched rules collapsed via <details class='rule-collapsed'>;
   matched rules stay inline (UX/17).
3. Blocker-by-absence banner adjacent to the verdict card when any
   rule's condition reports 'flag missing in extraction' (critic/21 +
   UX/17). Yellow palette, names the missing flags.
4. Suggested actions now sorted by priority-of-cause (walk
   evaluations in priority-desc order, dedup preserving first-seen)
   instead of alphabetic. result.actions unchanged for machine
   consumers (UX/17).
5. Action human-sentence rendering via rules/action-glossary.yaml;
   falls back to bare slug when entry missing (stakeholder/18).
6. Provenance section: preamble explaining the hashes + 'How to
   reproduce' block + 12-char hash truncation (UX/17 + auditor/20 +
   stakeholder/18).
7. System-identity header above the verdict card: engine version +
   prominent timestamp moved out of the footer (auditor/20).
8. Verdict-tier glossary + [+]/[-]/[?] glyph legend in a collapsed
   <details class='legend'> block (auditor/20 + UX/17).
9. cond--miss now has a faint red row-wash matching cond--missing's
   amber, so the visual distinction isn't glyph-color-only (UX/17).
10. @media (max-width: 540px) single-column fallback (UX/17).
11. Trust calibration: --skip-validate banner shown prominently when
    validationStatus='skipped'; verified-badge tooltip explains it
    only means 'extraction is bound to current inputs', not 'flags
    are correct' (stakeholder/18 + critic/21).
12. Section headers in sentence-case with periods, imperative voice:
    'Take these actions.' not 'Suggested actions' (brand/19).

28 new tests in test/html-report.test.ts; suite total 163/163.
Three sample reports regenerated under research/sample-reports/ so
reviewers see the new shape.
…x round 11)

- open-browser.ts: success now requires the opener process to exit
  with code 0 (or null, signal-terminated) — not just spawn. On Linux
  headless / CI, xdg-open spawns successfully and then immediately
  exits non-zero because no browser handler is registered; previously
  report.ts printed 'opened in browser' even though the file was
  never opened. Safety-net timeout bumped to 1s for opener daemons
  that never deliver an exit event (#525 round 11 P2).
- cli-shared.ts: takeOpt now throws 'Missing value for option <flag>'
  when the flag is the last argv entry or is followed by another
  flag. Previously 'npm run report -- --target' would silently fall
  back to 'all targets' and 'npm run report -- --target --quiet'
  would interpret '--quiet' as the target id (#525 round 11 P2).
- Two new takeOpt tests cover the missing-value rejection path.

Suite: 165/165 passing.
Copy link
Copy Markdown
Owner Author

Luis85 commented May 17, 2026

@codex review

Three commits since last review:

28c84e9 — action glossary (sidecar): rules/action-glossary.yaml (21 entries) + src/action-glossary.ts + 12 tests; wired into config as optional actionGlossary field.

ded7400 — HTML report v3 (the wave-4 implementer pass): 12 convergent findings from research/17–21 realised:

  1. What fired section above the audit trail (matched rules first, in priority order)
  2. Audit-trail collapse for non-matched rules via <details> (default closed)
  3. Blocker-by-absence banner adjacent to verdict card naming the missing flags
  4. Suggested actions sorted by priority-of-cause (not alphabetic)
  5. Action human sentences rendered from the glossary (graceful slug fallback)
  6. Provenance reframe: preamble + "How to reproduce" block + 12-char hash truncation
  7. System-identity header above the verdict card with prominent timestamp
  8. Verdict-tier + glyph legend in a collapsed <details>
  9. cond--miss row wash so the miss-vs-missing distinction isn't glyph-color-only
  10. Mobile fallback @media (max-width: 540px)
  11. Trust calibration: --skip-validate banner + verified badge tooltip
  12. Sentence-case section headers, imperative voice ("Take these actions.")

28 new tests in test/html-report.test.ts; three sample reports regenerated under research/sample-reports/.

90f3fe1 — Codex round 11 fixes:

  • open-browser.ts: success now requires exit code 0 (or null), not just spawn. Headless / CI environments where xdg-open spawns then exits non-zero are now correctly reported as failed-to-open.
  • cli-shared.ts::takeOpt: throws Missing value for option '<flag>' when the next argv entry is missing or another flag. Previously npm run report -- --target would silently run against all targets.

Suite: 165/165 green. Typos clean.


Generated by Claude Code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 90f3fe10f5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread experiments/rule-engine-poc/src/html-report.ts
Comment thread experiments/rule-engine-poc/src/html-report.ts Outdated
…odex round 12)

- html-report.ts missingFlagNames: only count rules whose final
  outcome was determined by the missing flag (matched === false).
  Previously a 'when.any' rule with one matched branch + one missing
  branch counted as un-evaluable in the banner, even though it
  contributed to the verdict (#525 round 12 P2).
- html-report.ts reproCmd: paths are now single-quoted via a small
  shellQuote helper that escapes embedded ' as the standard
  '\\'' four-char sequence. Paths with spaces (e.g.,
  'My Projects/rules.yaml') no longer break the copy-pasted
  reproduce command (#525 round 12 P2).
- Two new html-report tests: when.any-with-missing-branch is NOT
  counted in the banner, and reproCmd contains HTML-escaped quoted
  paths.

Suite: 167/167 passing.
Copy link
Copy Markdown
Owner Author

Luis85 commented May 17, 2026

@codex review

Round 12 P2s addressed in eb01077:

  • html-report.ts::missingFlagNames: now only counts rules whose final outcome was determined by the missing flag (matched === false). A when.any rule with one matched branch + one missing branch is no longer claimed as un-evaluable since it contributed to the verdict.
  • html-report.ts::reproCmd: paths shell-quoted via a small shellQuote helper that escapes embedded single quotes as the standard four-char '\'' sequence. Paths with spaces no longer break the copy-pasted reproduce command.

Two new html-report.test.ts tests pin the corrected behaviours. Suite: 167/167 green.


Generated by Claude Code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb01077fc6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread experiments/rule-engine-poc/src/cli.ts
…odex round 13)

src/cli.ts has its own takeOption() (the single-shot fixture flow
doesn't use src/cli-shared.ts). Same bug as round 11 P2 in cli-shared:
when --html had no value the helper returned undefined and the CLI
silently proceeded with no HTML output, breaking automation that
relies on the artifact being written.

Now fail fast with a clear stderr message and exit code 2 when the
option is the last argv entry or is followed by another flag.

Suite: 167/167 still passing (no test exercised the silent-skip path).
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 003a05ed50

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread experiments/rule-engine-poc/src/loader.ts Outdated
Comment thread experiments/rule-engine-poc/src/loader.ts
claude and others added 4 commits May 17, 2026 13:37
… 14)

- loader.ts: every entry of 'then.actions' must be a non-empty string
  slug. Previously numbers / objects / empty strings passed load-time
  validation and flowed into the HTML reporter as unrecognised tokens
  that couldn't map to a glossary entry, breaking the remediation
  guidance the verdict is meant to provide (#525 round 14 P2).
- loader.ts: 'gt' and 'lt' now reject NaN and Infinity at load time.
  Both are technically 'number' but silently corrupt comparisons at
  runtime (NaN comparisons always false), so a typo could make a
  gating rule unexpectedly never fire (#525 round 14 P2).
- Four new loader tests cover non-string action elements,
  empty-string action elements, NaN gt, and Infinity lt.

Suite: 171/171 passing (167 + 4).
New docs/compliance.md walks the standards/regulations an adopter is
most likely to be asked about in 2026 and catalogues what the POC
contributes vs what stays the adopter's job. Synthesises the two
prior research passes (research/02 regulatory + research/20 auditor
readability) into reference material for scoping conversations.

Covered:
- EU AI Act Art. 11-14 + Art. 72 with a per-article table.
- ISO/IEC 42001 AIMS clauses 6-10.
- ISO/IEC 23894 AI risk management.
- NIST AI RMF Govern / Map / Measure / Manage.
- GDPR Art. 22 (when it applies vs when it doesn't).
- OECD AI Principles (1-paragraph summary).

Plus:
- 'What the POC ticks natively' — per-artifact provenance.
- 'What is NOT in this POC' — honest gap analysis.
- Maturity checklist before production with rough effort estimates.

Leads with a disclaimer that this is engineering reference material,
not legal advice or certification. docs/README.md updated to index
the new doc.
@Luis85 Luis85 merged commit dd9fcab into develop May 17, 2026
6 checks passed
@Luis85 Luis85 deleted the claude/rule-engine-poc-gO5yq branch May 17, 2026 13:50
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cb23228592

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +137 to +139
if (typeof rule.priority !== "number") {
throw new Error(`Rule '${rule.id}' missing numeric 'priority'`);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject non-finite rule priorities during load

The loader only validates priority with typeof === "number", so YAML values like .nan or .inf are accepted. In evaluate, these values feed the sort comparator, and NaN in particular causes b.priority - a.priority to become NaN, which is treated as an equal comparison and bypasses the intended priority ordering. This silently turns a malformed rule into different rule/action ordering behavior instead of failing fast like other numeric fields.

Useful? React with 👍 / 👎.

Comment on lines +38 to +41
for (const raw of paths) {
const abs = resolve(options.baseDir, raw);
walk(abs, extensions, collected);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Deduplicate files when collecting overlapping target paths

collectFiles walks each configured path independently and appends every discovered file without deduplication. If a target includes overlapping entries (for example both a directory and a file inside it, or the same path twice), the same source file is injected multiple times into the extraction prompt. That needlessly increases token usage and can skew extraction outcomes because duplicated evidence is overrepresented.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants