Skip to content

Harness synthesis + property-test + 5 existing-loop refinements#1

Merged
TSavo merged 24 commits into
mainfrom
refinements
Apr 21, 2026
Merged

Harness synthesis + property-test + 5 existing-loop refinements#1
TSavo merged 24 commits into
mainfrom
refinements

Conversation

@TSavo
Copy link
Copy Markdown
Owner

@TSavo TSavo commented Apr 21, 2026

Summary

  • Five improvements on the existing derivation/verification loop: REASON field + LLM-as-judge, template-use telemetry with auto-retirement, strength checker (drops asserts to detect vacuous proofs), CEGAR witness refinement, and one-shot retry on Z3 error/unknown.
  • New verification layer: property-test. Three paths — primitive model-to-param mapping, complex-type harness synthesis (LLM writes the JS test), and violation replay against Z3 witnesses. Boundary-value augmentation per numeric parameter (guarded by precondition consistency). LLM judge layer on top flips and persists confidence on the underlying contract with an auditable judge_note.
  • Harness synthesis specifically: prompts/harness_synthesis.md is a 650-line teaching prompt (two rounds of Opus critique applied). src/harness.ts handles synthesis + vm.createContext sandboxed execution + content-addressed caching. src/moduleLoader.ts exposes private module-scope bindings via tree-sitter + ts.transpileModule + Module._compile, so private top-level functions become reachable.
  • First tool-driven bug fix in the wild: src/cli.ts/runDiff used || where the SMT encoder assumed ?? semantics. The harness flagged the divergence; a one-character fix closed the loop (tool now reports pass).

Env gates for the new paths (all opt-in, off by default):

  • NEURALLOG_PROPERTY_TEST=1 — enable primitive property test
  • NEURALLOG_PROPERTY_TEST_JUDGE=1 — enable judge layer
  • NEURALLOG_HARNESS_SYNTHESIS=1 — enable harness synthesis path
  • NEURALLOG_HARNESS_LIMIT=N / NEURALLOG_HARNESS_TIMEOUT_MS=N / NEURALLOG_HARNESS_MODEL=... — tune

Test plan

  • npm run build clean on every commit
  • neurallog analyze examples/inventory.ts — no regressions in existing proven/violation counts
  • NEURALLOG_PROPERTY_TEST=1 NEURALLOG_PROPERTY_TEST_JUDGE=1 node dist/cli.js verify against the full 939-contract snapshot — 6 passes, judge annotations persist to contract JSON
  • scripts/harness-probe.ts end-to-end on 8 contracts spanning: exported class methods (EntailmentChecker.check), private module-scope functions (cli.ts/runDiff, printSummary, resolveProjectRoot), untestable control-flow claims (TemplateChecker.checkStateMutation)
  • Before/after demonstration on runDiff: harness reports encoding-gap on pre-fix source, pass on post-fix source, with cache invalidation via content hash
  • Reviewer smoke: run analyze on a new file and observe the full 5-phase pipeline including the new harness path

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added proof strength analysis to evaluate assertion load-bearing in verification results.
    • Added property-based testing framework with LLM-driven harness generation and execution.
    • Added automated proof refinement and judging capabilities for improved validation accuracy.
    • Added example arithmetic functions demonstrating verification patterns.
  • Documentation

    • Added comprehensive harness synthesis guidelines.
    • Updated invariant derivation requirements with code-cite reasoning tags.
  • Chores

    • Enhanced verification pipeline with principle usage tracking and retirement management.
    • Improved contract path normalization and file handling.

TSavo and others added 17 commits April 20, 2026 14:29
Lifts from prior art in the LLM+SMT verification genre.

1. REASON field + judge (Lemur)
   - prompts/invariant_derivation.md: every smt2 block must emit a ; REASON: tag
     citing specific code, not a restatement of the claim.
   - verifier.ts: extractReason parses ; REASON: comments; VerificationResult
     carries reason, judgeRejected, judgeNote.
   - judge.ts: judgeReasoning runs an LLM check that the stated reason
     justifies the claim given the code and expected z3 verdict.
     judgeTeachingExample wraps it for principle teaching-example SMT.
   - principles.ts classifyAndGeneralize: gated behind judge — if rejected,
     principle returns validated: false with judge note.
   - DerivationPhase pattern-gap synthesis: newPrinciple only added to store
     if judge validates its teaching example.
   - contracts.ts: ProvenProperty / Violation carry reason, confidence,
     judge_note for persistence.

2. Template telemetry (Code2Inv)
   - principles.ts: Principle.stats { uses, proven, violations, errors,
     lastUsedAt, retiredAt, retiredReason }.
   - PrincipleStore.recordUse(tag, verdict), persistStats, evaluateRetirements,
     retire(id, reason) moves file to principles/retired/.
   - DerivationPhase records a use for every templateVerification and
     persists / evaluates retirement at end of run.
   - Retirement rule: uses >= 20 AND errors/uses > 0.5 AND proven == 0.

3. Strength checker (SpecGen)
   - checkers/StrengthChecker.ts: for each proven property, drop each assert
     one at a time, re-run z3; count load-bearing asserts.
     verdict = proven if any assert is load-bearing, violation (vacuous) if
     the claim holds after removing every premise.
   - Wired into AxiomPhase.

4. CEGAR witness refinement (Lemur)
   - DerivationPhase.cegarRefineViolations runs after the main DAG.
     For up to 10 violations with a Z3 witness, asks the LLM whether the
     counterexample is reachable or spurious:
       REACHABLE  -> mark confidence: high, attach attack-path judge note.
       revised smt2 block -> re-verify. unsat flips to proven.contract;
                             sat updates violation with tightened encoding.
   - ContractStore.put persists any contract that mutated.

5. One-shot retry on Z3 error / unknown
   - refiner.ts: refineErrorBlock asks the LLM to fix a block Z3 couldn't
     evaluate, preserving semantics; accepts revision only if Z3 returns
     sat or unsat.
   - DerivationPhase: per-function budget of 2 retries on template
     verifications with z3Result error / unknown.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- principles.ts: import unlinkSync at top instead of require("fs") inline in
  retire().

- checkers/PropertyTestChecker.ts (new, opt-in via NEURALLOG_PROPERTY_TEST=1):
  For each proven contract, strip the goal assert, ask Z3 (get-model) for a
  concrete input satisfying the preconditions, require() the function via
  ts-node/register, match SMT variable names to tree-sitter-extracted
  parameter names, call, and record the outcome.
    - returned cleanly -> verdict: proven (Z3 and runtime agree).
    - threw -> verdict: violation (suspected encoding drift; Z3 said unsat
      but the function crashed on Z3's own witness input).
  Gated by env var. NEURALLOG_PROPERTY_TEST_LIMIT caps total calls (default 10).
  Best-effort; skips silently on unmapped params, load failures, non-primitive
  args, or Z3 errors.

No LLM judge step in this checker — outcomes are mechanical only. A followup
pass could feed (code, claim, inputs, outcome) to an LLM for semantic
divergence checks, but that's a separate change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Differential test now has both halves: mechanical runtime capture and
semantic judgment of whether the runtime behaviour is consistent with
Z3's proof claim.

- judge.ts: judgeRuntimeOutcome takes {functionSource, claim, smt2,
  inputsSummary, outcome}, asks the LLM to reply VALID or INVALID.
  INVALID means Z3 said "proven" but the runtime result contradicts
  the claim — encoding inconsistency.

- PropertyTestChecker:
    - runOne now returns a PropertyTestContext (CheckResult plus the
      structured inputs/outcome/source needed for judging).
    - lastRun keeps each context so the async judge pass can annotate it.
    - judgeResults(model?) runs only when NEURALLOG_PROPERTY_TEST_JUDGE=1.
      Uses createProvider() directly; model defaults to
      NEURALLOG_JUDGE_MODEL or claude-haiku-4-5.
    - Flips verdict on disagreement:
        proven + judge INVALID   -> violation (encoding-inconsistent)
        violation + judge VALID  -> proven (runtime threw but the throw
                                             was consistent with the claim)

- AxiomPhase.execute is now async. After the synchronous checker loop,
  it finds PropertyTestChecker and awaits judgeResults() when the env
  gate is set. Recomputes proven/violation/error totals.

- Pipeline.runVerifyOnly and all axiomPhase.execute call sites awaited.
  cli.ts awaits pipeline.runVerifyOnly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contracts stored file field as an absolute path, which made PropertyTestChecker
unable to require() them from a different checkout and silently walked all 932
stored contracts, taking 9m to produce zero results.

- contracts.ts: new normalizeContractFile(file, projectRoot). On load, absolute
  file fields are rewritten to projectRoot-relative, with a fallback that strips
  everything before /src/, /examples/, /lib/, /app/, or /packages/ so contracts
  authored in a different checkout still resolve.
- DerivationPhase.buildContracts: new contracts now store fn.relativePath,
  not fn.filePath. Over time all contracts migrate to relative on rewrite.

PropertyTestChecker fixes so a fully-failing run no longer burns minutes:

- Takes projectRoot in constructor (passed from AxiomPhase.execute options).
- resolvePath resolves contract.file against projectRoot when relative.
- fileRequireFailures: Set<string> — once require() throws for a file, every
  subsequent contract from that file skips before Z3, not after. Previously
  we'd extract a Z3 model for each contract before discovering the file
  couldn't be loaded.
- Reordered: loadFunction happens before extractModel. No Z3 call when the
  file can't be loaded.
- Limit now caps attempts, not successes. Previously this.used only
  incremented on successful runs, so a 100% failure rate walked every
  contract regardless of LIMIT.
- Summary log at end of check(): attempted / ran / skipped counts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the path fix, LIMIT=5 attempts produced 0 successful runs because
every proven contract was for a method on a class. loadFunction only
looked for module[fnName] and missed them.

- PropertyTestChecker.findMethodOwner scans the required module for
  constructors whose prototype carries the target method. If the
  constructor takes no required args, instantiate and bind. Classes
  whose ctors throw (e.g. ContractStore needing projectRoot) are
  silently skipped.

- cli.ts: guard main() behind `if (require.main === module)`. Before
  this, PropertyTestChecker's require() of transitively-cli-importing
  files re-entered main() and started a second full verify run from
  inside the first — the recursion caused verify to hang at 4000+ log
  lines.

- Skip reasons are now collected and printed when a run yields zero
  results, so the operator can see why.

Empirical result on this repo (20 proven contracts, LIMIT=20):
  attempted 20, ran 1, skipped 19
  [division-by-zero] formatDuration:152 — with ms=0 → returned "0ms"
  property-test judge: 1 judged, 0 verdict flips, 0 violations confirmed
  judge note: "The function returns '0ms' for input ms=0 without reaching
    any division operation, consistent with the claim that division by
    zero is prevented at formatDuration."

19 skipped because the proven-contract methods live on classes with
required constructor args (ContractStore(projectRoot), TemplateEngine,
etc). That's a real architectural limit — testing those needs a
synthetic-instance protocol; left as future work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ance args

Extended PropertyTestChecker to replay violations against the code, not
just proven claims. For each violation that carries a Z3 witness, the
checker executes the function with the witness values and judges the
outcome:
  threw   -> violation reproduced (bug confirmed)
  clean   -> possible false positive; judge decides

runOne now takes a source flag ("proven" | "violation") and branches on
it. Model extraction uses the stored witness for violations, and runs
z3 (get-model) for proven blocks.

resolveCallable now tries several constructor-arg patterns when
instantiating a class whose method we need:
  new Cls()                                  (zero-arg)
  new Cls(projectRoot)                       (common in this codebase)
  new Cls({})
  new Cls({ projectRoot })
  new Cls(projectRoot, false)
  new Cls({ projectRoot, verbose: false })

Static methods are now detected via tree-sitter and invoked via
Cls[fnName] without instantiation.

Iteration order: pass 1 iterates all proven entries across all
contracts, pass 2 iterates violations. This prevents violation-heavy
contracts from exhausting the LIMIT before proven entries get a chance.

Skip reasons are printed for every distinct bucket when a run ends
(not just top 8).

examples/arithmetic.ts: pure-primitive demo functions (safeDivide,
divide, absDiff, clamp, addPositive). Every parameter is Int/Real so
Z3 models map directly to runtime calls. Added specific-file entries
to .neurallogignore instead of the catch-all examples/** so this file
is analyzed and verified.

Empirical run (NEURALLOG_PROPERTY_TEST=1 NEURALLOG_PROPERTY_TEST_JUDGE=1
LIMIT=200, 939 contracts):

  attempted 200, ran 5, skipped 195
  1. [subtraction-underflow] absDiff:18 — x=0 y=0 → 0   (judge VALID)
  2. [division-by-zero]      safeDivide:8 — a=0 b=1 → 0 (judge VALID)
  3. [division-by-zero]      formatDuration:152 — ms=0 → "0ms" (VALID)
  4. [division-by-zero]      formatDuration:166 — ms=0 → "0ms" (VALID)
  5. [division-by-zero] VIOLATION-REPLAY divide:13 — a=0 b=0 → null
     judge INVALID:
     "JavaScript division by zero (including 0/0) returns NaN rather
     than causing a runtime error, so the claimed violation doesn't
     fire at runtime"
     -> verdict flipped to encoding-inconsistent

The fifth result is the payload: Z3 proved divide has a bug; runtime
disagrees; judge correctly labels it as an encoding/semantics mismatch
between the Z3 abstraction and JavaScript's NaN semantics. That's the
entire differential-test premise working end-to-end.

195 skips are all legitimate — methods taking Contract[]/Map<> that
SMT can't model as primitives, private CLI functions, or arithmetic
functions where the SMT encoding used different variable names than
the parameter list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New inferConstructorArgs method extracts class ctor param types via tree-sitter
and synthesizes mock values per param (string defaults, number 0, boolean false,
arrays [], Map/Set new, object/alias {}). Used as first-attempt path in
resolveCallable; hardcoded ctor fallback patterns kept as safety net.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds three signal-improvement layers on top of ctor-introspection:

1. matchParamToModel — if the SMT model lacks an exact key for a parameter,
   try case-fold equality, suffix-strip (_condition, _value, _returns,
   _guard, _consequent, _alternate, _old, _new, _var), then substring
   match. Preference order: exact > normalized > stripped > substring.

2. isControlFlowModel — if every key in the SMT model looks like a
   control-flow abstraction (guard_condition, code_after_reached,
   result_consequent, branch_taken, etc.) AND none match a parameter
   name, classify as control-flow-only and skip with a specific reason
   rather than the generic "SMT model missing param". Principles like
   guard-narrowing and ternary-branch-collapse encode control state,
   not data values — they're inherently un-property-testable via
   input-value mapping.

3. constructor skip — contract.function === "constructor" now short-
   circuits. Calling a class constructor as a regular function throws
   ("cannot be invoked without 'new'"), producing noise. Property-
   testing constructors would need a new-invocation protocol.

Empirical run (NEURALLOG_PROPERTY_TEST=1 JUDGE=1 LIMIT=200):
  Before fuzzy/tagging: attempted 200, ran 5
  After:                attempted 200, ran 7
  3 verdict flips by judge, 1 violation confirmed

New hits:
  [null-assertion] findSimilarByAST:77 with obs=0 → [] — judge flagged
    encoding-inconsistent (code actually guards the ! operations).
  [subtraction-underflow] absDiff:18 — already running, now flagged
    encoding-inconsistent by judge (claim text says VIOLATION but Z3
    proved unsat).

Skip reasons now bucket control-flow contracts clearly — ~60 of the
195 skips are flagged as principle-is-control-flow, which is an
architectural limit not a bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Boundary augmentation (#1)

runBoundarySamples runs each function not only with Z3's primary witness
but with boundary-variant args (candidates: 0, 1, -1) substituted per
numeric parameter. Each variant is only tried when Z3 says the variant
still satisfies the contract's preconditions (isConsistentWithPreconditions
pins the param value via an extra (assert (= name lit)) and re-checks sat).

A proven contract whose primary run is clean BUT any boundary variant
threw flips verdict to "violation" with a specific error. Conversely,
"all N boundary cases clean" is a confidence signal on the primary run.

Description line now includes a summary: [+N boundary cases all clean]
or [+N boundary cases: M clean, K threw].

Cap: 8 boundary runs per contract so we don't blow up on N-param functions.

## Judge-driven contract confidence (#2)

judgeResults now persists verdict-derived confidence to the underlying
contract via ContractStore.put, keyed off smt2 equality:

  Z3 proven + judge VALID     -> confidence: "high"   judge_note: VALID
  Z3 proven + judge INVALID   -> confidence: "low"    judge_note: INVALID
  Z3 violation + judge INVALID -> confidence: "high"   judge_note: confirmed bug
  Z3 violation + judge VALID  -> confidence: "low"    judge_note: false positive

These get written back to .neurallog/contracts/*.json. Downstream tooling
(CI gate, report summary) can now distinguish judge-corroborated proofs
and violations from raw Z3 verdicts.

## Empirical result

Against the same 939-contract snapshot:
  Before: attempted 200, ran 7
  After:  attempted 200, ran 6 (constructor-skip change retired one)

More importantly, the 6 runs now carry boundary annotations, and on disk:
  divide[13] violation: confidence=high (judge confirmed bug)
  safeDivide[8] proven: confidence=high (judge VALID)
  absDiff[18] proven: confidence=low (judge INVALID — encoding-inconsistent)
  cegar-flipped contracts: confidence=high (propagated from earlier CEGAR)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sn't confused by the literal 'VIOLATION' label

Before: CEGAR-flipped claim kept 'VIOLATION: Subtraction Underflow' text, which
the property-test judge read as asserting the bug exists — it kept flagging
proven contracts as 'encoding-inconsistent' because the label said VIOLATION
but Z3 proved unsat.

After: 'PROVEN: Subtraction Underflow is prevented (CEGAR-refined encoding)'.
Judge now correctly reads the claim as stating prevention and accepts the
unsat proof as consistent.

Migration: stale on-disk proven contracts with 'VIOLATION:' prefix rewritten
once.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Opus agent reviewed v1 and flagged specific defects. Applied surgical fixes:

Must-fix correctness
- Example 3 was broken — Number.MIN_VALUE / 4 evaluates to 0 at expression
  time so the fixture guard fired before exercising the function. Replaced
  with a correct IEEE 754 ulp-collapse exhibit (distinct-positive-reals
  claim, (1 + EPSILON/2) === 1). Also models the harness-error vs
  encoding-gap prefix distinction in-code.

Critical teaching gaps
- Added Step 1 subsection on SMT negation direction: unsat on (assert (not
  goal)) means goal is proven; harness must probe the goal direction not
  the negated one. Not taught anywhere in v1. Opus predicted this as the
  single biggest gap.
- Added Step 1.5 "Write the falsification hypothesis": force the LLM to
  commit, in a comment, to the specific runtime observation that would
  falsify the claim, before constructing any fixture. Operationalizes the
  honest-falsification disposition v1 asserted but did not cultivate.
- Expanded Step 5 with explicit harness-error: vs encoding-gap: prefix
  distinction. Two categorically different failure kinds — the judge treats
  them opposite ways, and a mislabeled prefix poisons the signal.
- Expanded A2 with counter-teaching: three legitimate sources for
  expected values (first-principle reimplementation, mathematical invariant,
  independent oracle) and one to avoid (the SMT claim itself).

Anti-pattern sharpening
- A7 rewritten to draw the crisp distinction from A2 (expected-from-claim
  vs fixture-contains-answer).
- A10 replaced: silent-TS-coercion was not an LLM-gravitational failure
  mode. New A10 is "Wrong thing passed where a value was expected"
  (Promise-wrapping, callback confusion) — the actual LLM failure mode.

Trap additions and cuts
- Added: JS integer division (5/2 = 2.5 disagrees with SMT Int div), ulp
  collapse, sparse arrays, JSON.stringify drops undefined/BigInt/symbols,
  precondition vacuity, non-determinism (clocks/randomness/scheduling).
- Cut: prototype pollution (production-security not encoding-faithfulness).
- Demoted: cross-realm instanceof now a footnote under Prototype section.

Decoration cuts (per Opus L5/L40/L470/L572)
- Removed "Read that paragraph again" and drumroll meta-instructions.
- Removed "steps below walk through this in the right order" connective.
- Removed self-advertising "densest concentration of leverage" sentence.
- Removed final moralizing paragraph duplicating the temptation warning
  from the thesis.

539 → 647 lines. Added content outweighs trimmed decoration by ~110 lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Opus audited v2 and found v2 ready-with-fixes. Applied the four items:

1. Example 5 prefix regression (L351) — 'fixture error:' → 'harness-error:'.
   The regression contradicted the Step 5 teaching.
2. L54 conflation — tightened wording so the vacuity source is properly
   named (reproducing a counterexample Z3 ruled out, not 'preconditions
   rule out the negated case'). The preconditions/claim distinction was
   blurred.
3. Step 1.5 judge-audit claim softened — v2 said 'the judge reads and
   audits hypothesis vs assertions.' The judge in the runner doesn't yet
   do that specific audit, so the prompt was asserting unearned capability.
   Rewrote to frame the hypothesis as discipline for the author and the
   human reader, with self-inconsistency as the diagnostic signal.
4. New Step 2.5-equivalent — Precondition checks in-line. Opus's new top
   prediction was 'precondition-violating fixtures that still produce
   throw-free passes.' Added a short paragraph requiring inline
   harness-error: guards for each verifiable precondition.

Opus's readiness assessment: ready-with-one-trivial-fix. v3 addresses
that fix plus the structural concerns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
src/harness.ts (new)
- synthesizeHarness(input, provider, model, projectRoot): assembles the
  prompt (loads prompts/harness_synthesis.md + appends task-specific
  claim/SMT/source) and calls the LLM. Parses response for one of:
  (a) a single ```javascript code block -> returns harness string
  (b) a // UNTESTABLE: <reason> line -> returns untestable reason
  (c) neither -> returns null on both (synthesis failed).
- runHarness(harnessCode, fn, fnClass, timeoutMs): executes the harness
  in vm.createContext with functionUnderTest and functionUnderTestClass
  bound, plus a curated set of globals (Promise, Error, Array/Object/
  Number/String/Boolean, JSON, Math, Date, Map/Set/WeakMap/WeakSet,
  Reflect, BigInt, timers). Wraps harness body in async IIFE so top-
  level await works. Races against a timeout. Classifies throws by
  message prefix: encoding-gap: / harness-error: / __HARNESS_TIMEOUT__ /
  unclassified.
- HarnessCache: content-addressed at .neurallog/harnesses/<hash>.json by
  sha256(smt2 + '\n---\n' + functionSource). Avoids re-synthesis across
  runs when nothing changed.

PropertyTestChecker
- Track harnessCandidates during sync check(): when primitive param-
  mapping fails with "SMT model missing param" (complex-type case), the
  contract becomes a candidate for harness synthesis. Control-flow
  principles and require-failures do NOT become candidates — they're
  fundamentally un-fixable by harness synthesis.
- New async synthesizeAndRunHarnesses() method: iterates candidates up
  to NEURALLOG_HARNESS_LIMIT (default 5), checks cache, synthesizes,
  runs. Records outcomes into harnessResults and mutates the contract's
  ProvenProperty confidence via the same channel as the judge:
    pass (proven) -> confidence: high
    encoding-gap -> confidence: low, judge_note: harness-encoding-gap
    untestable -> judge_note: harness-untestable
    other -> judge_note: harness-<kind>
  Persists via ContractStore.put on touched contracts.

AxiomPhase
- After judgeResults, awaits synthesizeAndRunHarnesses() when
  NEURALLOG_HARNESS_SYNTHESIS=1. Merges harnessResults into the
  property-test CheckResult array and recomputes proven/violation/error
  counts. Final summary line now distinguishes judge and harness LLM
  usage separately.

Env gates
  NEURALLOG_HARNESS_SYNTHESIS=1     enable the synthesis path
  NEURALLOG_HARNESS_LIMIT=N         cap synthesis calls (default 5)
  NEURALLOG_HARNESS_TIMEOUT_MS=N    per-harness execution timeout (3000)
  NEURALLOG_HARNESS_MODEL=name      synthesis model (default sonnet-4-6)

tsc clean. Not yet run empirically against arithmetic.ts + full contract
set — that's the next step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first probe run hit ts-node strict-mode typecheck failures (TS7006)
on EntailmentChecker.ts when loading via require() — because ts-node's
default config tries to fully typecheck every loaded file, which fails
when sibling modules aren't found in isolation. Transpile-only mode
strips types, skips typechecking, and just emits runtime JS — which is
exactly what the sandbox needs.

- PropertyTestChecker.ensureTsNode now registers ts-node with
  transpileOnly: true (falling back to ts-node/register/transpile-only
  if the explicit register call fails).
- scripts/harness-probe.ts: dev tool that runs the full synthesis +
  execution pipeline on a single contract. Usage:
    npx ts-node scripts/harness-probe.ts '<contract-key>' [--violation]
  Prints SMT, function source, raw LLM response, parsed harness, and
  execution outcome. Supports --no-cache and --no-synth flags.

Empirical first run:
  contract: src/checkers/EntailmentChecker.ts/check[11]
  claim: VIOLATION: Empty Collection Loop at check (empty contracts
         array should not execute loop body)
  function takes: (contracts: Contract[], callGraph: Map<string,
                   string[]>) — complex types primitive-mapping cannot
                   touch
  synthesis: 26s, sonnet-4-6 emitted a 60-line harness that:
    - named the falsification hypothesis in a comment
    - wrote an inline harness-error: precondition guard
    - stubbed the this-binding (fakeThis) to satisfy the method's
      access to this.name
    - used two assertions with correct encoding-gap: vs harness-error:
      prefixes
  execution: 3ms in vm sandbox, verdict pass
  interpretation: the function's empty-collection path is empirically
  consistent with the SMT claim — Z3's unsat was faithful, runtime
  agrees, proof upgraded to observable corroboration.

This is the class of contracts the whole property-test layer could not
touch before today. It works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Private top-level functions in a CommonJS file (e.g. non-exported helpers
in cli.ts — runDiff, printSummary, resolveProjectRoot) are invisible to
require() because only module.exports survives the CJS wrapper closure.
The harness runner therefore could not execute them even though the LLM
synthesized perfectly reasonable harnesses for them.

Fix: new src/moduleLoader.ts provides loadModuleWithPrivates(filePath).

How it works:
  1. Read the source file.
  2. Tree-sitter enumerates every top-level declaration: function,
     generator, class, lexical/variable declarations, type/interface/enum.
     export_statement wrappers are unwrapped.
  3. Append one exporter shim per name at end of source:
       try { if (typeof N !== 'undefined') exports['N'] = N } catch {}
  4. Transpile the augmented source via ts.transpileModule (CJS/ES2022,
     esModuleInterop, skipLibCheck).
  5. Instantiate a fresh Module, wire its filename and paths, then
     _compile the transpiled JS directly. This executes the file outside
     the normal require pipeline but with a real Module parent, so
     nested require() calls in the source resolve correctly.
  6. Return mod.exports — which now contains every top-level binding
     the original source declared, not just those it chose to export.

PropertyTestChecker.loadFunction and scripts/harness-probe.ts both try
ordinary require() first (fast path, normal case), then fall back to
loadModuleWithPrivates when the target binding isn't reachable via
exports. The fallback is transparent to the rest of the pipeline —
once a function is resolvable, the harness runner treats it identically
regardless of how it was obtained.

Empirical result — three previously-unloadable cli.ts contracts:
  printSummary[522]      pass            (claim corroborated)
  resolveProjectRoot[600] pass           (claim corroborated)
  runDiff[265]           encoding-gap   (REAL FINDING)

The runDiff result is substantive: the encoder modelled a default-value
expression as ?? semantics (preserves falsy) while the code actually
uses || (replaces "" with the default). The harness passed args=[""],
observed ref="HEAD~1" where the claim said ref=="", and threw
encoding-gap with full diagnostic context. This is the first genuine
encoder-vs-runtime semantic gap the synthesis layer has caught on a
non-toy function.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The harness synthesis pipeline flagged an encoding-faithfulness gap on
src/cli.ts/runDiff[265]. Z3 proved the falsy-default-trap claim under
the assumption that the default expression used nullish coalescing
semantics (?? — only replaces on null/undefined), but the code actually
used || (replaces on any falsy value, including ""). The harness
passed args=[""], observed the runtime produce ref="HEAD~1" where the
claim expected ref="" (empty preserved), and threw encoding-gap with
full context.

Minimum-change fix: use ?? instead of ||. Matches the encoder's
assumed semantic. Empty-string args are now preserved — if a user
really passes "" as a ref, git will surface the error itself rather
than the CLI silently substituting HEAD~1.

Post-fix probe (cache invalidated by source hash change, LLM re-
synthesized a harness against the new code):
  VERDICT: pass

First real-world fix driven by the harness synthesis layer. The
before/after round-trip demonstrates the tool: broken state produces
encoding-gap, code change produces pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 21, 2026 03:42
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 21, 2026

Important

Review skipped

Too many files!

This PR contains 284 files, which is 134 over the limit of 150.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 006d8133-57a1-4e5a-94e5-5fb1766b7154

📥 Commits

Reviewing files that changed from the base of the PR and between 655ee1e and f59e528.

📒 Files selected for processing (284)
  • .gitignore
  • .neurallog/classification.json
  • .neurallog/contexts/bundles.json
  • .neurallog/contracts/src/checkers/Checker.ts/extractDeclaredVars[52].json
  • .neurallog/contracts/src/checkers/Checker.ts/extractDeclaredVars[54].json
  • .neurallog/contracts/src/checkers/Checker.ts/extractDeclaredVars[56].json
  • .neurallog/contracts/src/checkers/Checker.ts/extractPreconditions[41].json
  • .neurallog/contracts/src/checkers/Checker.ts/extractPreconditions[46].json
  • .neurallog/contracts/src/checkers/Checker.ts/namespaceBlock[63].json
  • .neurallog/contracts/src/checkers/Checker.ts/runCheck[33].json
  • .neurallog/contracts/src/checkers/Checker.ts/runCheck[35].json
  • .neurallog/contracts/src/checkers/ConsistencyChecker.ts/check[12].json
  • .neurallog/contracts/src/checkers/ConsistencyChecker.ts/check[13].json
  • .neurallog/contracts/src/checkers/ConsistencyChecker.ts/check[18].json
  • .neurallog/contracts/src/checkers/ConsistencyChecker.ts/check[22].json
  • .neurallog/contracts/src/checkers/ConsistencyChecker.ts/check[34].json
  • .neurallog/contracts/src/checkers/ConsistencyChecker.ts/check[43].json
  • .neurallog/contracts/src/checkers/EntailmentChecker.ts/check[11].json
  • .neurallog/contracts/src/checkers/EntailmentChecker.ts/check[13].json
  • .neurallog/contracts/src/checkers/EntailmentChecker.ts/check[14].json
  • .neurallog/contracts/src/checkers/EntailmentChecker.ts/check[17].json
  • .neurallog/contracts/src/checkers/EntailmentChecker.ts/check[19].json
  • .neurallog/contracts/src/checkers/EntailmentChecker.ts/check[21].json
  • .neurallog/contracts/src/checkers/EntailmentChecker.ts/check[25].json
  • .neurallog/contracts/src/checkers/EntailmentChecker.ts/check[27].json
  • .neurallog/contracts/src/checkers/EntailmentChecker.ts/check[28].json
  • .neurallog/contracts/src/checkers/EntailmentChecker.ts/check[30].json
  • .neurallog/contracts/src/checkers/EntailmentChecker.ts/check[32].json
  • .neurallog/contracts/src/checkers/EntailmentChecker.ts/check[33].json
  • .neurallog/contracts/src/checkers/EntailmentChecker.ts/check[35].json
  • .neurallog/contracts/src/checkers/EntailmentChecker.ts/check[54].json
  • .neurallog/contracts/src/checkers/IndependenceChecker.ts/check[12].json
  • .neurallog/contracts/src/checkers/IndependenceChecker.ts/check[13].json
  • .neurallog/contracts/src/checkers/IndependenceChecker.ts/check[14].json
  • .neurallog/contracts/src/checkers/IndependenceChecker.ts/check[18].json
  • .neurallog/contracts/src/checkers/IndependenceChecker.ts/check[21].json
  • .neurallog/contracts/src/checkers/IndependenceChecker.ts/check[23].json
  • .neurallog/contracts/src/checkers/IndependenceChecker.ts/check[35].json
  • .neurallog/contracts/src/checkers/IndependenceChecker.ts/check[43].json
  • .neurallog/contracts/src/checkers/IndependenceChecker.ts/check[53].json
  • .neurallog/contracts/src/checkers/IndependenceChecker.ts/check[58].json
  • .neurallog/contracts/src/checkers/IndependenceChecker.ts/check[60].json
  • .neurallog/contracts/src/checkers/IndependenceChecker.ts/check[76].json
  • .neurallog/contracts/src/checkers/ReachabilityChecker.ts/check[11].json
  • .neurallog/contracts/src/checkers/ReachabilityChecker.ts/check[13].json
  • .neurallog/contracts/src/checkers/ReachabilityChecker.ts/check[14].json
  • .neurallog/contracts/src/checkers/ReachabilityChecker.ts/check[18].json
  • .neurallog/contracts/src/checkers/ReachabilityChecker.ts/check[19].json
  • .neurallog/contracts/src/checkers/ReachabilityChecker.ts/check[20].json
  • .neurallog/contracts/src/checkers/ReachabilityChecker.ts/check[21].json
  • .neurallog/contracts/src/checkers/ReachabilityChecker.ts/check[25].json
  • .neurallog/contracts/src/checkers/ReachabilityChecker.ts/check[26].json
  • .neurallog/contracts/src/checkers/ReachabilityChecker.ts/check[30].json
  • .neurallog/contracts/src/checkers/ReachabilityChecker.ts/check[32].json
  • .neurallog/contracts/src/checkers/ReachabilityChecker.ts/check[38].json
  • .neurallog/contracts/src/checkers/ReachabilityChecker.ts/check[40].json
  • .neurallog/contracts/src/checkers/ReachabilityChecker.ts/check[42].json
  • .neurallog/contracts/src/checkers/ReachabilityChecker.ts/check[47].json
  • .neurallog/contracts/src/checkers/ReachabilityChecker.ts/check[48].json
  • .neurallog/contracts/src/checkers/ReachabilityChecker.ts/check[50].json
  • .neurallog/contracts/src/checkers/ReachabilityChecker.ts/check[60].json
  • .neurallog/contracts/src/checkers/StrengtheningChecker.ts/check[11].json
  • .neurallog/contracts/src/checkers/StrengtheningChecker.ts/check[13].json
  • .neurallog/contracts/src/checkers/StrengtheningChecker.ts/check[14].json
  • .neurallog/contracts/src/checkers/StrengtheningChecker.ts/check[18].json
  • .neurallog/contracts/src/checkers/StrengtheningChecker.ts/check[19].json
  • .neurallog/contracts/src/checkers/StrengtheningChecker.ts/check[20].json
  • .neurallog/contracts/src/checkers/StrengtheningChecker.ts/check[21].json
  • .neurallog/contracts/src/checkers/StrengtheningChecker.ts/check[25].json
  • .neurallog/contracts/src/checkers/StrengtheningChecker.ts/check[26].json
  • .neurallog/contracts/src/checkers/StrengtheningChecker.ts/check[29].json
  • .neurallog/contracts/src/checkers/StrengtheningChecker.ts/check[33].json
  • .neurallog/contracts/src/checkers/StrengtheningChecker.ts/check[35].json
  • .neurallog/contracts/src/checkers/StrengtheningChecker.ts/check[38].json
  • .neurallog/contracts/src/checkers/StrengtheningChecker.ts/check[39].json
  • .neurallog/contracts/src/checkers/StrengtheningChecker.ts/check[41].json
  • .neurallog/contracts/src/checkers/StrengtheningChecker.ts/check[44].json
  • .neurallog/contracts/src/checkers/StrengtheningChecker.ts/check[49].json
  • .neurallog/contracts/src/checkers/StrengtheningChecker.ts/check[51].json
  • .neurallog/contracts/src/checkers/StrengtheningChecker.ts/check[57].json
  • .neurallog/contracts/src/checkers/StrengtheningChecker.ts/check[59].json
  • .neurallog/contracts/src/checkers/StrengtheningChecker.ts/check[63].json
  • .neurallog/contracts/src/checkers/TemplateChecker.ts/checkStateMutation[55].json
  • .neurallog/contracts/src/checkers/TemplateChecker.ts/checkStateMutation[57].json
  • .neurallog/contracts/src/checkers/TemplateChecker.ts/checkStateMutation[58].json
  • .neurallog/contracts/src/checkers/TemplateChecker.ts/checkStateMutation[61].json
  • .neurallog/contracts/src/checkers/TemplateChecker.ts/checkStateMutation[62].json
  • .neurallog/contracts/src/checkers/TemplateChecker.ts/checkStateMutation[65].json
  • .neurallog/contracts/src/checkers/TemplateChecker.ts/checkStateMutation[66].json
  • .neurallog/contracts/src/checkers/TemplateChecker.ts/checkStateMutation[67].json
  • .neurallog/contracts/src/checkers/TemplateChecker.ts/checkStateMutation[69].json
  • .neurallog/contracts/src/checkers/TemplateChecker.ts/checkStateMutation[71].json
  • .neurallog/contracts/src/checkers/TemplateChecker.ts/check[10].json
  • .neurallog/contracts/src/checkers/TemplateChecker.ts/check[11].json
  • .neurallog/contracts/src/checkers/TemplateChecker.ts/check[12].json
  • .neurallog/contracts/src/checkers/TemplateChecker.ts/check[14].json
  • .neurallog/contracts/src/checkers/TemplateChecker.ts/check[26].json
  • .neurallog/contracts/src/checkers/TemplateChecker.ts/check[27].json
  • .neurallog/contracts/src/checkers/TemplateChecker.ts/check[29].json
  • .neurallog/contracts/src/checkers/TemplateChecker.ts/check[31].json
  • .neurallog/contracts/src/checkers/TemplateChecker.ts/check[32].json
  • .neurallog/contracts/src/checkers/TemplateChecker.ts/check[33].json
  • .neurallog/contracts/src/cli.ts/buildSignalRegistry[551].json
  • .neurallog/contracts/src/cli.ts/buildSignalRegistry[557].json
  • .neurallog/contracts/src/cli.ts/fileIssues[543].json
  • .neurallog/contracts/src/cli.ts/fileIssues[544].json
  • .neurallog/contracts/src/cli.ts/fileIssues[546].json
  • .neurallog/contracts/src/cli.ts/findProjectRoot[608].json
  • .neurallog/contracts/src/cli.ts/findProjectRoot[610].json
  • .neurallog/contracts/src/cli.ts/getConfidence[418].json
  • .neurallog/contracts/src/cli.ts/getConfidence[419].json
  • .neurallog/contracts/src/cli.ts/getFlag[596].json
  • .neurallog/contracts/src/cli.ts/printSummary[521].json
  • .neurallog/contracts/src/cli.ts/printSummary[522].json
  • .neurallog/contracts/src/cli.ts/resolveProjectRoot[600].json
  • .neurallog/contracts/src/cli.ts/resolveProjectRoot[601].json
  • .neurallog/contracts/src/cli.ts/runDiff[265].json
  • .neurallog/contracts/src/cli.ts/runDiff[273].json
  • .neurallog/contracts/src/cli.ts/runDiff[294].json
  • .neurallog/contracts/src/cli.ts/runDiff[297].json
  • .neurallog/contracts/src/cli.ts/runDiff[308].json
  • .neurallog/contracts/src/cli.ts/runDiff[310].json
  • .neurallog/contracts/src/cli.ts/runExplain[319].json
  • .neurallog/contracts/src/cli.ts/runExplain[320].json
  • .neurallog/contracts/src/cli.ts/runExplain[326].json
  • .neurallog/contracts/src/cli.ts/runExplain[343].json
  • .neurallog/contracts/src/cli.ts/runExplain[356].json
  • .neurallog/contracts/src/cli.ts/runExplain[358].json
  • .neurallog/contracts/src/cli.ts/runExplain[359].json
  • .neurallog/contracts/src/cli.ts/runExplain[364].json
  • .neurallog/contracts/src/cli.ts/runExplain[375].json
  • .neurallog/contracts/src/cli.ts/runExplain[377].json
  • .neurallog/contracts/src/cli.ts/runExplain[378].json
  • .neurallog/contracts/src/cli.ts/runExplain[383].json
  • .neurallog/contracts/src/cli.ts/runHook[487].json
  • .neurallog/contracts/src/cli.ts/runHook[491].json
  • .neurallog/contracts/src/cli.ts/runHook[495].json
  • .neurallog/contracts/src/cli.ts/runOverride[505].json
  • .neurallog/contracts/src/cli.ts/runReport[405].json
  • .neurallog/contracts/src/cli.ts/runReport[413].json
  • .neurallog/contracts/src/cli.ts/runReport[418].json
  • .neurallog/contracts/src/cli.ts/runReport[419].json
  • .neurallog/contracts/src/cli.ts/runReport[434].json
  • .neurallog/contracts/src/cli.ts/runReport[438].json
  • .neurallog/contracts/src/cli.ts/runReport[440].json
  • .neurallog/contracts/src/cli.ts/runReport[450].json
  • .neurallog/contracts/src/cli.ts/runReport[455].json
  • .neurallog/contracts/src/cli.ts/runReport[468].json
  • .neurallog/contracts/src/cli.ts/runReport[471].json
  • .neurallog/contracts/src/cli.ts/runReport[472].json
  • .neurallog/contracts/src/cli.ts/runReport[473].json
  • .neurallog/contracts/src/contracts.ts/contractHash[50].json
  • .neurallog/contracts/src/contracts.ts/contractHash[51].json
  • .neurallog/contracts/src/contracts.ts/contractHash[52].json
  • .neurallog/contracts/src/contracts.ts/contractHash[53].json
  • .neurallog/contracts/src/contracts.ts/findStale[156].json
  • .neurallog/contracts/src/contracts.ts/findStale[157].json
  • .neurallog/contracts/src/contracts.ts/findStale[158].json
  • .neurallog/contracts/src/contracts.ts/findStale[169].json
  • .neurallog/contracts/src/contracts.ts/findStale[170].json
  • .neurallog/contracts/src/contracts.ts/findStale[171].json
  • .neurallog/contracts/src/contracts.ts/findStale[172].json
  • .neurallog/contracts/src/contracts.ts/formatForPrompt[186].json
  • .neurallog/contracts/src/contracts.ts/formatForPrompt[190].json
  • .neurallog/contracts/src/contracts.ts/formatForPrompt[195].json
  • .neurallog/contracts/src/contracts.ts/formatForPrompt[197].json
  • .neurallog/contracts/src/contracts.ts/formatForPrompt[198].json
  • .neurallog/contracts/src/contracts.ts/formatForPrompt[202].json
  • .neurallog/contracts/src/contracts.ts/formatForPrompt[204].json
  • .neurallog/contracts/src/contracts.ts/formatForPrompt[205].json
  • .neurallog/contracts/src/contracts.ts/recordWitness[222].json
  • .neurallog/contracts/src/contracts.ts/recordWitness[224].json
  • .neurallog/contracts/src/contracts.ts/recordWitness[225].json
  • .neurallog/contracts/src/contracts.ts/remove[137].json
  • .neurallog/contracts/src/contracts.ts/signalKeyToPath[46].json
  • .neurallog/contracts/src/contracts.ts/signalKeyToPath[48].json
  • .neurallog/contracts/src/contracts.ts/signalKey[41].json
  • .neurallog/contracts/src/contracts.ts/signalKey[43].json
  • .neurallog/contracts/src/contracts.ts/writeSingle[214].json
  • .neurallog/contracts/src/contracts.ts/writeSingle[215].json
  • .neurallog/contracts/src/contracts.ts/writeSingle[216].json
  • .neurallog/contracts/src/contracts.ts/writeSingle[217].json
  • .neurallog/contracts/src/git/DiffAnalyzer.ts/constructor[15].json
  • .neurallog/contracts/src/git/DiffAnalyzer.ts/getHead[70].json
  • .neurallog/contracts/src/git/DiffAnalyzer.ts/isGitRepo[57].json
  • .neurallog/contracts/src/git/DiffAnalyzer.ts/parseDiff[100].json
  • .neurallog/contracts/src/git/DiffAnalyzer.ts/parseDiff[102].json
  • .neurallog/contracts/src/git/DiffAnalyzer.ts/parseDiff[104].json
  • .neurallog/contracts/src/git/DiffAnalyzer.ts/parseDiff[105].json
  • .neurallog/contracts/src/git/DiffAnalyzer.ts/parseDiff[106].json
  • .neurallog/contracts/src/git/DiffAnalyzer.ts/parseDiff[111].json
  • .neurallog/contracts/src/git/HookInstaller.ts/constructor[30].json
  • .neurallog/contracts/src/git/HookInstaller.ts/getHooksDir[104].json
  • .neurallog/contracts/src/git/HookInstaller.ts/install[35].json
  • .neurallog/contracts/src/git/HookInstaller.ts/install[42].json
  • .neurallog/contracts/src/git/HookInstaller.ts/install[44].json
  • .neurallog/contracts/src/git/HookInstaller.ts/isInstalled[95].json
  • .neurallog/contracts/src/git/HookInstaller.ts/isInstalled[98].json
  • .neurallog/contracts/src/git/HookInstaller.ts/uninstall[61].json
  • .neurallog/contracts/src/git/HookInstaller.ts/uninstall[66].json
  • .neurallog/contracts/src/git/HookInstaller.ts/uninstall[71].json
  • .neurallog/contracts/src/git/HookInstaller.ts/uninstall[78].json
  • .neurallog/contracts/src/git/HookInstaller.ts/uninstall[86].json
  • .neurallog/contracts/src/git/IgnoreFilter.ts/constructor[9].json
  • .neurallog/contracts/src/git/IgnoreFilter.ts/isIgnored[29].json
  • .neurallog/contracts/src/git/IgnoreFilter.ts/isIgnored[31].json
  • .neurallog/contracts/src/git/IgnoreFilter.ts/isIgnored[35].json
  • .neurallog/contracts/src/git/IgnoreFilter.ts/isIgnored[36].json
  • .neurallog/contracts/src/git/IgnoreFilter.ts/isIgnored[37].json
  • .neurallog/contracts/src/git/IgnoreFilter.ts/load[15].json
  • .neurallog/contracts/src/git/IgnoreFilter.ts/load[23].json
  • .neurallog/contracts/src/git/IgnoreFilter.ts/matches[44].json
  • .neurallog/contracts/src/git/IgnoreFilter.ts/matches[51].json
  • .neurallog/contracts/src/git/IgnoreFilter.ts/matches[52].json
  • .neurallog/contracts/src/git/IgnoreFilter.ts/matches[53].json
  • .neurallog/contracts/src/git/IgnoreFilter.ts/matches[54].json
  • .neurallog/contracts/src/git/IgnoreFilter.ts/matches[55].json
  • .neurallog/contracts/src/git/ProofDiff.ts/computeDiff[37].json
  • .neurallog/contracts/src/git/ProofDiff.ts/computeDiff[41].json
  • .neurallog/contracts/src/git/ProofDiff.ts/computeDiff[42].json
  • .neurallog/contracts/src/git/ProofDiff.ts/computeDiff[45].json
  • .neurallog/contracts/src/git/ProofDiff.ts/computeDiff[51].json
  • .neurallog/contracts/src/git/ProofDiff.ts/computeDiff[52].json
  • .neurallog/contracts/src/git/ProofDiff.ts/computeDiff[58].json
  • .neurallog/contracts/src/git/ProofDiff.ts/computeDiff[64].json
  • .neurallog/contracts/src/git/ProofDiff.ts/computeDiff[65].json
  • .neurallog/contracts/src/git/ProofDiff.ts/computeDiff[74].json
  • .neurallog/contracts/src/git/ProofDiff.ts/computeDiff[75].json
  • .neurallog/contracts/src/git/ProofDiff.ts/computeDiff[84].json
  • .neurallog/contracts/src/git/ProofDiff.ts/computeDiff[85].json
  • .neurallog/contracts/src/git/ProofDiff.ts/constructor[19].json
  • .neurallog/contracts/src/git/ProofDiff.ts/indexContracts[97].json
  • .neurallog/contracts/src/git/ProofDiff.ts/loadContractsAtRef[107].json
  • .neurallog/contracts/src/git/ProofDiff.ts/loadContractsAtRef[113].json
  • .neurallog/contracts/src/git/ProofDiff.ts/loadContractsAtRef[114].json
  • .neurallog/contracts/src/git/ProofDiff.ts/loadContractsAtRef[122].json
  • .neurallog/contracts/src/issues.ts/buildIssue[37].json
  • .neurallog/contracts/src/issues.ts/buildIssue[45].json
  • .neurallog/contracts/src/issues.ts/collectViolationIssues[94].json
  • .neurallog/contracts/src/issues.ts/collectViolationIssues[96].json
  • .neurallog/contracts/src/issues.ts/extractClaim[15].json
  • .neurallog/contracts/src/issues.ts/fileIssue[129].json
  • .neurallog/contracts/src/issues.ts/fileViolationIssues[150].json
  • .neurallog/contracts/src/issues.ts/fileViolationIssues[151].json
  • .neurallog/contracts/src/issues.ts/fileViolationIssues[157].json
  • .neurallog/contracts/src/issues.ts/fileViolationIssues[168].json
  • .neurallog/contracts/src/issues.ts/issueExists[116].json
  • .neurallog/contracts/src/issues.ts/issueExists[121].json
  • .neurallog/contracts/src/llm/ClaudeAgentProvider.ts/complete[12].json
  • .neurallog/contracts/src/llm/ClaudeAgentProvider.ts/complete[19].json
  • .neurallog/contracts/src/llm/ClaudeAgentProvider.ts/complete[21].json
  • .neurallog/contracts/src/llm/ClaudeAgentProvider.ts/complete[28].json
  • .neurallog/contracts/src/llm/ClaudeAgentProvider.ts/stream[38].json
  • .neurallog/contracts/src/llm/ClaudeAgentProvider.ts/stream[46].json
  • .neurallog/contracts/src/llm/ClaudeAgentProvider.ts/stream[48].json
  • .neurallog/contracts/src/llm/ClaudeAgentProvider.ts/stream[52].json
  • .neurallog/contracts/src/llm/OpenAIProvider.ts/complete[32].json
  • .neurallog/contracts/src/llm/OpenAIProvider.ts/complete[40].json
  • .neurallog/contracts/src/llm/OpenAIProvider.ts/complete[41].json
  • .neurallog/contracts/src/llm/OpenAIProvider.ts/complete[45].json
  • .neurallog/contracts/src/llm/OpenAIProvider.ts/constructor[16].json
  • .neurallog/contracts/src/llm/OpenAIProvider.ts/constructor[17].json
  • .neurallog/contracts/src/llm/OpenAIProvider.ts/post[56].json
  • .neurallog/contracts/src/llm/OpenAIProvider.ts/post[70].json
  • .neurallog/contracts/src/llm/OpenAIProvider.ts/post[71].json
  • .neurallog/contracts/src/llm/OpenCodeProvider.ts/complete[22].json
  • .neurallog/contracts/src/llm/OpenCodeProvider.ts/complete[23].json
  • .neurallog/contracts/src/llm/OpenCodeProvider.ts/complete[24].json
  • .neurallog/contracts/src/llm/OpenCodeProvider.ts/complete[25].json
  • .neurallog/contracts/src/llm/OpenCodeProvider.ts/complete[37].json
  • .neurallog/contracts/src/llm/OpenCodeProvider.ts/complete[38].json
  • .neurallog/contracts/src/llm/OpenCodeProvider.ts/complete[39].json
  • .neurallog/contracts/src/llm/OpenCodeProvider.ts/complete[40].json
  • .neurallog/contracts/src/llm/OpenCodeProvider.ts/complete[42].json
  • .neurallog/contracts/src/llm/OpenCodeProvider.ts/complete[47].json
  • .neurallog/contracts/src/llm/OpenCodeProvider.ts/complete[48].json
  • .neurallog/contracts/src/llm/OpenCodeProvider.ts/complete[58].json
  • .neurallog/contracts/src/llm/OpenCodeProvider.ts/complete[61].json
  • .neurallog/contracts/src/llm/OpenCodeProvider.ts/complete[70].json
  • .neurallog/contracts/src/llm/OpenCodeProvider.ts/constructor[14].json
  • .neurallog/contracts/src/llm/OpenCodeProvider.ts/ensureClient[117].json
  • .neurallog/contracts/src/llm/OpenCodeProvider.ts/ensureClient[119].json
  • .neurallog/contracts/src/llm/OpenCodeProvider.ts/ensureSession[125].json
  • .neurallog/contracts/src/llm/OpenCodeProvider.ts/ensureSession[127].json

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

The pull request introduces a comprehensive property-testing and verification refinement framework. It adds new checker implementations, LLM-driven harness synthesis for runtime validation, CEGAR-based violation refinement, principle lifecycle management with retirement logic, and supporting infrastructure for dynamic module loading and SMT proof refinement. Multiple pipeline phases are updated to support async execution and post-processing workflows.

Changes

Cohort / File(s) Summary
Configuration
.gitignore, .neurallogignore
Updated ignore patterns to exclude *.worktrees/ directories and refine example file filtering to include specific example files explicitly.
Metadata
.neurallog/derivation.json, .neurallog/graph.json, .neurallog/principles/addition-overflow.json
Updated derivation timestamps, contract counts, and project root references; added new principle configuration for overflow detection.
Contract Artifacts
.neurallog/contracts/examples/arithmetic.ts/*, .neurallog/contracts/examples/inventory.ts/*, .neurallog/contracts/src/.../...json
Added 20+ contract JSON artifacts recording proof/violation analysis results, SMT encodings, and witness models for arithmetic and inventory functions, plus updated existing contract metadata with normalized file paths and confidence annotations.
Example Code
examples/arithmetic.ts
New module exporting five pure arithmetic functions: safeDivide, divide, absDiff, clamp, and addPositive with guards/edge-case handling for demonstration.
Documentation
prompts/harness_synthesis.md, prompts/invariant_derivation.md
Added comprehensive harness synthesis guide defining empirical falsification methodology, error classification, and runtime validation patterns; updated invariant derivation to mandate REASON: tags in SMT-LIB blocks.
Checker Implementations
src/checkers/StrengthChecker.ts, src/checkers/PropertyTestChecker.ts, src/checkers/index.ts
Introduced StrengthChecker for load-bearing assertion analysis and comprehensive PropertyTestChecker supporting property-based runtime validation, LLM judging, and harness synthesis with timeout isolation; updated exports.
Harness & VM Infrastructure
src/harness.ts, scripts/harness-probe.ts
New harness synthesis and VM-sandboxed execution framework with timeout handling, cache support, and structured outcome classification; added probe script for contract validation and harness testing.
Core Utilities
src/judge.ts, src/refiner.ts, src/moduleLoader.ts
Added LLM-driven verdict judges for reasoning/teaching-examples/runtime outcomes; introduced SMT block refinement via LLM correction; implemented dynamic TypeScript module loading with private export extraction.
Contract & Type Definitions
src/contracts.ts, src/verifier.ts
Extended ProvenProperty and Violation with optional reason/confidence/judge_note fields; added path normalization; enhanced VerificationResult with reason extraction from SMT comments; added extractReason helper.
Principle Management
src/principles.ts
Introduced PrincipleStats tracking usage/proof/violation counts; added stat persistence and principle retirement lifecycle; integrated teaching-example judging into classification.
Pipeline Updates
src/pipeline/AxiomPhase.ts, src/pipeline/DerivationPhase.ts, src/pipeline/Pipeline.ts, src/cli.ts
Converted axiom phase to async; integrated StrengthChecker and PropertyTestChecker with conditional post-processing (LLM judging, harness synthesis); implemented CEGAR-based violation refinement; added principle validation and stats maintenance; propagated async execution throughout pipeline; updated CLI entrypoint guard.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Poem

🐰 Hops through contracts, checking proofs with care,
Harnesses woven from Z3's thorough air,
CEGAR refines each violation found,
While judges and principles dance all around,
Runtime whispers truth where logic once paled!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main changes: harness synthesis, property-test checker, and five refinement features in the existing verification loop.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refinements

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the verification pipeline with (opt-in) runtime property testing plus harness synthesis, and strengthens the existing LLM↔Z3 derivation loop with reasoning/judge metadata, CEGAR-style witness refinement, principle telemetry/auto-retirement, and error/unknown retries.

Changes:

  • Add REASON extraction + judge metadata plumbing across verifier/contracts, and integrate LLM-judge checks for teaching examples and runtime outcomes.
  • Introduce property-test + harness synthesis execution (sandboxed + cached) and a module loader that exposes private top-level bindings for testing.
  • Improve pipeline robustness/telemetry: principle usage stats + retirement, CEGAR refinement of violations, and one-shot refinement retry on Z3 error/unknown; fix runDiff ||?? semantics.

Reviewed changes

Copilot reviewed 41 out of 44 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
src/verifier.ts Adds REASON parsing and carries reason into VerificationResult.
src/refiner.ts New LLM-based SMT block “repair” path for Z3 error/unknown blocks.
src/principles.ts Adds principle usage stats + retirement and applies teaching-example judging.
src/pipeline/Pipeline.ts Awaits Phase 5 execution; makes runVerifyOnly async.
src/pipeline/DerivationPhase.ts Adds retry/refinement on Z3 error/unknown, principle telemetry, CEGAR refinement, and normalizes contract file to relative path.
src/pipeline/AxiomPhase.ts Makes Phase 5 async; adds StrengthChecker + PropertyTestChecker and optional judge/harness flows.
src/moduleLoader.ts New loader that transpiles and exports top-level private bindings for testing.
src/judge.ts New judge layer for reasoning, teaching examples, and runtime outcomes.
src/harness.ts New harness synthesis + vm execution + content-addressed cache.
src/contracts.ts Normalizes stored contract file paths and adds confidence/judge metadata fields.
src/cli.ts Awaits verify-only pipeline; fixes runDiff default ref semantics; guards main() for import use.
src/checkers/index.ts Exports new StrengthChecker and PropertyTestChecker.
src/checkers/StrengthChecker.ts New “load-bearing assert” strength check (but currently doesn’t match Checker interface).
src/checkers/PropertyTestChecker.ts New property-test checker + harness synthesis plumbing (but currently doesn’t match Checker interface).
scripts/harness-probe.ts Adds an end-to-end harness synthesis/execution probe script.
prompts/invariant_derivation.md Requires ; REASON: in SMT blocks.
prompts/harness_synthesis.md Adds the large harness synthesis teaching prompt.
examples/arithmetic.ts Adds primitive examples to exercise property testing.
.neurallogignore Stops ignoring all examples/**; keeps most examples ignored except arithmetic demo.
.neurallog/principles/addition-overflow.json Adds a new principle definition.
.neurallog/graph.json Updates graph artifact (currently includes machine-local absolute paths).
.neurallog/derivation.json Updates derivation artifact.
.neurallog/contracts/** Updates/adds generated contract artifacts (some include absolute paths).
.gitignore Ignores .worktrees/.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -0,0 +1,11 @@
{
"key": "examples/inventory.ts/releaseStock[33]",
"file": "/Users/tsavo/provekit/.worktrees/refinements/examples/inventory.ts",
Comment on lines +5 to +10
export class StrengthChecker implements Checker {
readonly name = "strength";

check(contracts: Contract[]): CheckResult[] {
const results: CheckResult[] = [];

Comment thread src/harness.ts
Comment on lines +124 to +133
export async function runHarness(
harnessCode: string,
fn: (...args: any[]) => any,
fnClass: any,
timeoutMs: number = 3000
): Promise<HarnessOutcome> {
const sandbox: any = {
functionUnderTest: fn,
functionUnderTestClass: fnClass,
console: {
Comment on lines 68 to +76
const checkers: Checker[] = [
new TemplateChecker(),
new ConsistencyChecker(),
new EntailmentChecker(),
new ReachabilityChecker(),
new StrengtheningChecker(),
new IndependenceChecker(),
new StrengthChecker(),
new PropertyTestChecker(options.projectRoot),
@@ -0,0 +1,11 @@
{
"key": "examples/inventory.ts/reserveStock[24]",
"file": "/Users/tsavo/provekit/.worktrees/refinements/examples/inventory.ts",
@@ -0,0 +1,30 @@
{
"key": "examples/inventory.ts/reserveStock[20]",
"file": "/Users/tsavo/provekit/.worktrees/refinements/examples/inventory.ts",
@@ -0,0 +1,11 @@
{
"key": "examples/inventory.ts/reserveStock[21]",
"file": "/Users/tsavo/provekit/.worktrees/refinements/examples/inventory.ts",
Comment on lines +51 to +77
export class PropertyTestChecker implements Checker {
readonly name = "property-test";

private tsNodeLoaded = false;
private fnCache = new Map<string, ExtractedFn | null>();
private fileRequireFailures = new Set<string>();
private ctorParamsCache = new Map<string, CtorParamInfo[] | null>();
private ctorArgsCache = new Map<string, any[] | null>();
private limit: number;
private attempted = 0;
lastRun: PropertyTestContext[] = [];
private harnessCandidates: Array<{
contract: Contract;
prop: ProvenProperty | Violation;
source: "proven" | "violation";
skipReason: string;
}> = [];
harnessResults: CheckResult[] = [];

constructor(private projectRoot: string = process.cwd()) {
const raw = process.env.NEURALLOG_PROPERTY_TEST_LIMIT;
this.limit = raw ? Math.max(1, parseInt(raw, 10) || 10) : 10;
}

check(contracts: Contract[]): CheckResult[] {
if (process.env.NEURALLOG_PROPERTY_TEST !== "1") return [];

Comment thread .neurallog/graph.json Outdated
Comment on lines 2 to 4
"root": "/Users/tsavo/provekit/.worktrees/refinements/examples/inventory.ts",
"projectRoot": "/Users/tsavo/provekit/.worktrees/refinements",
"files": [
@@ -0,0 +1,30 @@
{
"key": "examples/inventory.ts/reserveStock[23]",
"file": "/Users/tsavo/provekit/.worktrees/refinements/examples/inventory.ts",
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 655ee1e306

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

contracts.push({
key,
file: fn.filePath,
file: fn.relativePath,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep contract file path consistent with stale-prune lookup

This change stores new contracts with file: fn.relativePath, but stale-contract cleanup still filters with c.file === fn.filePath (absolute path) in the same phase. After this, previously stored entries for a function no longer match the pruning predicate, so obsolete contracts are not removed when signals move/disappear, and stale proofs/violations can continue to influence later verification runs.

Useful? React with 👍 / 👎.

Comment thread src/checkers/PropertyTestChecker.ts Outdated
Comment on lines +238 to +241
verdict = outcome.kind === "threw" ? "violation" : "proven";
error = outcome.kind === "threw"
? "violation reproduced: function threw on the Z3 witness — bug confirmed"
: "violation did not reproduce: function ran cleanly on the Z3 witness — possible false positive";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Derive violation replay verdict from claim, not exception state

In the violation replay path, any non-throwing execution is marked proven, regardless of whether the violated property is value-based (e.g., arithmetic underflow/overflow) rather than exception-based. That means real counterexamples that return a bad value but do not throw will be reported as false positives, which can incorrectly flip violations to proven and corrupt confidence/judge annotations.

Useful? React with 👍 / 👎.

Comment thread src/checkers/PropertyTestChecker.ts Outdated

private parseZ3Model(text: string): Record<string, number | boolean> {
const model: Record<string, number | boolean> = {};
const modelRegex = /\(define-fun\s+(\S+)\s+\(\)\s+(Int|Real|Bool)\s+([^\s)]+(?:\s+[\d.\-]+)?)\s*\)/g;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Parse parenthesized Z3 Real literals in model extraction

The model parser regex/value handling only accepts very simple scalar tokens and a/b-style text, but Z3 commonly emits Real values as parenthesized expressions such as (/ 1 2) (and negative variants). Those entries are dropped, so Real parameters go unmapped and property-test execution is skipped or misclassified for contracts with real-valued models.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
prompts/invariant_derivation.md (1)

507-513: ⚠️ Potential issue | 🟠 Major

Update the worked examples to include ; REASON: too.

Line 512 makes ; REASON: mandatory, but the SMT-LIB examples above still omit it. Because this is a production prompt and src/verifier.ts only extracts reason text from ; REASON: comments, the examples should model the required format or the LLM may keep emitting reason-less blocks.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@prompts/invariant_derivation.md` around lines 507 - 513, The worked SMT-LIB
examples in invariant_derivation.md are missing the required `; REASON:` comment
that src/verifier.ts expects; update each example SMT-LIB block (the fenced
```smt2 examples and their accompanying explanatory comments) to include a `;
REASON: <one or two short sentences>` line that cites specific variables/lines
in the code (e.g., reference the signal/variable names used in the example and
the corresponding source line numbers mentioned in the example) and explains
concisely why the invariant should hold; ensure every example block also keeps
the existing tags `; PRINCIPLE: ...`, `; LINE: <number>`, includes
`(check-sat)`, and a brief comment about what sat/unsat means so verifier.ts can
extract the reason text correctly.
src/contracts.ts (1)

147-150: ⚠️ Potential issue | 🟠 Major

Root cause: normalize contract.file on write, not only on load.

normalizeContractFile is applied in loadFromDisk but put/writeSingle persist contract.file verbatim. That's why the inventory artifacts committed in this PR (reserveStock[20].json, releaseStock[36].json, and likely others) contain absolute /Users/tsavo/... paths while the arithmetic ones — presumably written from a clean checkout — are relative. New contracts created on any contributor's machine will keep leaking local paths into git. Normalize before writing so behavior is consistent regardless of how the in-memory contract.file got populated.

Proposed diff
   put(contract: Contract): void {
+    contract.file = normalizeContractFile(contract.file, this.projectRoot);
+    contract.key = signalKey(contract.file, contract.function, contract.line);
     this.contracts.set(contract.key, contract);
     this.writeSingle(contract);
   }

Optionally also re-derive contract.key here (as above) so a contract constructed with an absolute path can't get stored under a differently-shaped key than the one load produces.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/contracts.ts` around lines 147 - 150, The put/write path currently
persists contract.file verbatim; normalizeContractFile is only used in
loadFromDisk so local absolute paths get committed—update the put method (and/or
writeSingle) to call normalizeContractFile(contract) before writing and persist
the normalized contract.file, and also re-derive contract.key from the
normalized file (e.g., assign contract.key = deriveKeyFrom(contract.file) or the
existing key derivation logic) so stored keys match loadFromDisk behavior;
ensure you reference the normalizeContractFile, put, writeSingle, and
contract.key symbols when making the change.
src/pipeline/DerivationPhase.ts (2)

596-605: ⚠️ Potential issue | 🟠 Major

Persist reason and judge notes into newly built contracts.

verifyAll() and retry refinement now populate VerificationResult.reason/judgeNote, but buildContracts() drops them. That prevents the new REASON/judge metadata from reaching stored proven and violations entries.

Suggested fix
         if (v.z3Result === "unsat") {
-          proven.push({ principle: v.principle, principle_hash: pHash, claim, smt2: v.smt2 });
+          proven.push({
+            principle: v.principle,
+            principle_hash: pHash,
+            claim,
+            smt2: v.smt2,
+            reason: v.reason,
+            confidence: v.confidence,
+            judge_note: v.judgeNote,
+          });
         } else if (v.z3Result === "sat") {
-          violations.push({ principle: v.principle, principle_hash: pHash, claim, smt2: v.smt2, witness: v.witness, complexity: v.complexity, confidence: v.confidence });
+          violations.push({
+            principle: v.principle,
+            principle_hash: pHash,
+            claim,
+            smt2: v.smt2,
+            witness: v.witness,
+            complexity: v.complexity,
+            confidence: v.confidence,
+            reason: v.reason,
+            judge_note: v.judgeNote,
+          });
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pipeline/DerivationPhase.ts` around lines 596 - 605, The buildContracts()
loop drops VerificationResult.reason and judgeNote when converting `toUse` items
into `proven`/`violations`; update the push objects inside the loop in
DerivationPhase.buildContracts (the for...of over `toUse`) to also copy
`v.reason` -> `reason` and `v.judgeNote` -> `judgeNote` (or use
`v?.reason`/`v?.judgeNote` if optional) so that the resulting arrays `proven`
and `violations` include those fields and the metadata from verifyAll()/retry
refinement is preserved.

282-285: ⚠️ Potential issue | 🟠 Major

Compare stale contracts using the same normalized path form.

New contracts are stored with fn.relativePath on Line 610, but stale cleanup still compares existing c.file to absolute fn.filePath. Previously generated contracts for removed signals in the same function can survive indefinitely.

Suggested fix
       const existingForFn = store.getAll().filter((c) =>
-        c.file === fn.filePath && c.function === fn.functionName && !newKeys.has(c.key)
+        c.file === fn.relativePath && c.function === fn.functionName && !newKeys.has(c.key)
       );

Also applies to: 610-610

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pipeline/DerivationPhase.ts` around lines 282 - 285, The stale-contracts
filter compares existing contract.c.file to fn.filePath (absolute) while new
contracts use fn.relativePath; update the comparison in the existingForFn
computation to use the same normalized/relative path as new contracts (e.g.,
compare c.file to fn.relativePath or normalize both sides) so that contracts are
correctly detected as stale; adjust the predicate in the store.getAll().filter
that builds existingForFn (and any other similar cleanup that references
fn.filePath) to use fn.relativePath or a normalized path form consistently with
where newKeys/contracts are stored.
🟠 Major comments (18)
.neurallog/principles/addition-overflow.json-14-20 (1)

14-20: ⚠️ Potential issue | 🟠 Major

Guard patterns are too broad and can hide real violations.

Using generic tokens like "overflow" and especially "capacity" as guard indicators can mark unsafe additions as “guarded” even when no bound check exists. This risks systematic false negatives.

Proposed tightening
       "guardPatterns": [
-        "< MAX",
-        "<= MAX",
-        "Number.MAX_SAFE_INTEGER",
-        "overflow",
-        "capacity"
+        "< MAX",
+        "<= MAX",
+        "< totalCapacity",
+        "<= totalCapacity",
+        "Number.MAX_SAFE_INTEGER"
       ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.neurallog/principles/addition-overflow.json around lines 14 - 20, The
guardPatterns list in addition-overflow.json is too broad: remove or tighten the
generic tokens "overflow" and "capacity" (and any similarly generic words) so
they don't cause false negatives; instead keep explicit, unambiguous checks such
as "Number.MAX_SAFE_INTEGER", "< MAX", "<= MAX" and add more precise patterns if
needed (e.g. function names like "checkOverflow", "ensureCapacity" or regexes
matching explicit bound checks like "\\b<=\\s*MAX\\b" or
"Number\\.MAX_SAFE_INTEGER") so the detector only treats genuine bound checks as
guards.
.neurallog/principles/addition-overflow.json-23-27 (1)

23-27: ⚠️ Potential issue | 🟠 Major

SMT template and teaching example model different bug classes.

The template only proves result > 9007199254740991, while the teaching example is about exceeding domain capacity (new_total > total_capacity). This inconsistency can make validation/judging unreliable for the intended rule.

Suggested direction
-  "smt2Template": "(declare-const {{base}} Int)\n(declare-const {{delta}} Int)\n(declare-const {{max_safe}} Int)\n(declare-const {{result}} Int)\n(assert (= {{result}} (+ {{base}} {{delta}})))\n(assert (> {{delta}} 0))\n(assert (> {{base}} 0))\n(assert (= {{max_safe}} 9007199254740991))\n(assert (> {{result}} {{max_safe}}))\n(check-sat)",
+  "smt2Template": "(declare-const {{base}} Int)\n(declare-const {{delta}} Int)\n(declare-const {{limit}} Int)\n(declare-const {{result}} Int)\n(assert (= {{result}} (+ {{base}} {{delta}})))\n(assert (> {{delta}} 0))\n(assert (>= {{base}} 0))\n(assert (> {{limit}} 0))\n(assert (> {{result}} {{limit}}))\n(check-sat)",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.neurallog/principles/addition-overflow.json around lines 23 - 27, The SMT
template and the teachingExample target different properties; align them by
making the smt2Template assert overflow against a domain capacity like the
example (use symbols such as booked_seats, batch_size, total_capacity,
new_total) or alternatively change the teachingExample to match the current
large-number overflow template (result and max_safe = 9007199254740991); update
either smt2Template or teachingExample so both assert the same bug condition
(e.g., replace smt2Template to declare booked_seats, batch_size, total_capacity,
new_total and assert (= new_total (+ booked_seats batch_size)) and (> new_total
total_capacity) to match the teachingExample).
.neurallog/principles/addition-overflow.json-10-13 (1)

10-13: ⚠️ Potential issue | 🟠 Major

parentTypes field is unsupported and will be silently ignored at runtime.

The ASTPattern interface in src/principles.ts:8-16 does not include a parentTypes field. The pattern matcher in src/templates/TemplateEngine.ts checks only nodeType, operator, method, requiresParamRef, guardPatterns, pairMethod, and checkPaths—never parent types. This means the constraint scoping intended here (expression_statement or lexical_declaration parents only) will have no effect, causing the principle to match all binary_expression nodes with operator + regardless of parent context, broadening matches unexpectedly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.neurallog/principles/addition-overflow.json around lines 10 - 13, The JSON
uses a non-supported "parentTypes" constraint which is ignored because
ASTPattern (src/principles.ts) and the TemplateEngine matcher
(src/templates/TemplateEngine.ts) never handle it; either remove the field from
the JSON or add support: extend the ASTPattern interface to include parentTypes:
string[] and update the TemplateEngine's pattern-matching routine (the function
that currently checks
nodeType/operator/method/requiresParamRef/guardPatterns/pairMethod/checkPaths)
to also obtain the AST node's parent type and require parentTypes to contain it
(treat absent parent as failing the constraint unless parentTypes includes a
sentinel). Ensure you reference ASTPattern and the matching function in
TemplateEngine when implementing the check so parent-scoped patterns only match
when the immediate parent node type is in parentTypes.
.neurallog/graph.json-2-25 (1)

2-25: ⚠️ Potential issue | 🟠 Major

Do not commit a graph rooted in a private local worktree.

Lines 2, 3, 6, 14, and 20 point to /Users/tsavo/provekit/.worktrees/refinements, which is non-portable, leaks local path information, and will break consumers that try to resolve the graph outside that machine. Please regenerate this artifact from the repository checkout root, use repository-relative paths if supported, or omit this local run artifact from the PR.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.neurallog/graph.json around lines 2 - 25, The graph.json file contains
non-portable absolute local worktree paths in the "root", "projectRoot", and
"files[].path" fields (and possibly "files[].relativePath"); regenerate or
replace this artifact so it uses repository-relative paths (or omits the local
run artifact entirely) and ensure "root" and "projectRoot" no longer reference
/Users/tsavo/.worktrees/refinements; update or regenerate the graph from the
repository checkout root so all file paths are relative (or remove the file from
the PR if this artifact should not be committed).
prompts/harness_synthesis.md-234-263 (1)

234-263: ⚠️ Potential issue | 🟠 Major

Remove or correct this flawed example: it violates its own precondition, creating a template for false encoding-gap reports.

The harness assigns const a = 1 + Number.EPSILON / 2 and const b = 1. At runtime, both evaluate to 1 (verified: a === b), so the fixture passes equal values to areDistinct(a, b). This violates the stated precondition a ≠ b. Per the prompt's own guidance on precondition-vacuity (lines 95, 601-604), a fixture that fails its precondition should throw harness-error or UNTESTABLE, not unconditionally report encoding-gap. As written, this example teaches LLM-synthesized harnesses to misclassify precondition violations as encoding gaps, which falsely downgrades valid proofs.

Either remove this example, or rewrite it to:

  • Use a fixture that actually satisfies the precondition a ≠ b at runtime, or
  • Explicitly check the precondition at the start and classify a violation as harness-error/UNTESTABLE.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@prompts/harness_synthesis.md` around lines 234 - 263, The harness assigns a =
1 + Number.EPSILON / 2 which rounds to 1 at runtime, violating the precondition
a ≠ b and causing a false "encoding-gap" report; fix by either (A) ensuring
runtime distinctness (e.g., set a to 1 + Number.EPSILON or any pair that
preserves a !== b at runtime) or (B) explicitly check the precondition up-front
and classify violations as harness-error/UNTESTABLE (e.g., if (a === b) throw
harness-error/return UNTESTABLE) before calling functionUnderTest or asserting
areDistinct, so the harness does not mislabel precondition failures as encoding
gaps.
.neurallog/contracts/examples/inventory.ts/releaseStock[34].json-3-3 (1)

3-3: ⚠️ Potential issue | 🟠 Major

Normalize contract file paths before writing to disk.

Multiple contract JSON artifacts in .neurallog/contracts/ carry absolute local paths, including at least 7 entries in inventory.ts/ (releaseStock[33,34,36], reserveStock[20,21,23,24]) and additional entries under src/checkers/Checker.ts/. The root cause: writeSingle() in src/contracts.ts:235 serializes contracts without calling normalizeContractFile(), while the constructor's load path (lines 111, 121) does normalize.

Impact: Breaks reproducibility across machines, pollutes diffs, and persists maintainer-specific absolute paths in committed artifacts.

🔧 Fix

At src/contracts.ts:235, normalize the file path before write:

  private writeSingle(contract: Contract): void {
    const filePath = join(this.contractsDir, signalKeyToPath(contract.key));
    const dir = dirname(filePath);
    mkdirSync(dir, { recursive: true });
+   contract.file = normalizeContractFile(contract.file, this.projectRoot);
    writeFileSync(filePath, JSON.stringify(contract, null, 2));
  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.neurallog/contracts/examples/inventory.ts/releaseStock[34].json at line 3,
The serialized contract artifacts are being written with absolute local paths
because writeSingle() in src/contracts.ts does not call normalizeContractFile()
before writing; update writeSingle() to run the contract's file path through
normalizeContractFile() (the same normalization used in the class
constructor/load path) and use that normalized path when serializing/writing the
JSON so persisted artifacts contain only normalized, reproducible file paths.
.neurallog/contracts/examples/inventory.ts/reserveStock[20].json-3-3 (1)

3-3: ⚠️ Potential issue | 🟠 Major

Absolute user-specific path committed to the repo.

"file": "/Users/tsavo/provekit/.worktrees/refinements/examples/inventory.ts" leaks a local filesystem path and will not resolve on any other machine or in CI. The arithmetic contracts in this same PR correctly store "file": "examples/arithmetic.ts". Root cause is in src/contracts.ts (ContractStore.put doesn't normalize before write) — see comment there. Once that's fixed, regenerate these inventory.ts/* artifacts so the committed files use repo-relative paths.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.neurallog/contracts/examples/inventory.ts/reserveStock[20].json at line 3,
ContractStore.put is writing absolute, user-specific file paths into contract
artifacts; update src/contracts.ts so ContractStore.put normalizes the stored
path to a repo-relative path (e.g., use path.relative(projectRoot, originalPath)
or a similar normalization function) before writing the JSON, and ensure any
helper that computes the stored "file" value (inside ContractStore.put) handles
undefined/empty inputs robustly. After changing ContractStore.put, regenerate
the inventory.ts artifacts so the committed files contain repo-relative paths
like "examples/inventory.ts" instead of an absolute user path.
src/pipeline/DerivationPhase.ts-485-498 (1)

485-498: ⚠️ Potential issue | 🟠 Major

Recompute newViolations after CEGAR mutates contracts.

allNewViolations is collected before cegarRefineViolations(). If CEGAR flips or refines a [NEW] violation, the returned DerivationOutput.newViolations can still reference the pre-CEGAR violation state.

Suggested fix
-    const output: DerivationOutput = {
+    const currentNewViolations = allContracts.flatMap((c) =>
+      c.violations
+        .filter((v) => v.principle?.toUpperCase().includes("NEW"))
+        .map((v) => ({
+          violation: {
+            smt2: v.smt2,
+            z3Result: "sat" as const,
+            principle: v.principle,
+            error: undefined,
+            witness: v.witness,
+            complexity: v.complexity,
+            reason: v.reason,
+          },
+          context: c.key,
+        }))
+    );
+
+    const output: DerivationOutput = {
       contracts: allContracts,
-      newViolations: allNewViolations,
+      newViolations: currentNewViolations,
       derivedAt: new Date().toISOString(),
     };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pipeline/DerivationPhase.ts` around lines 485 - 498, The newViolations
value is computed before calling cegarRefineViolations so it can contain stale
violation objects that CEGAR later flips/refines; after calling
this.cegarRefineViolations(...) recompute the "new" violations from the mutated
contracts (allContracts) and assign that fresh array to
DerivationOutput.newViolations instead of using the precomputed
allNewViolations. Specifically, after cegarRefineViolations returns, iterate
over allContracts (or use the existing helper that produced allNewViolations) to
collect violations whose status is still NEW and use that result when building
the output object.
src/pipeline/DerivationPhase.ts-210-229 (1)

210-229: ⚠️ Potential issue | 🟠 Major

Fix metadata filtering when extracting the retry claim.

Line 215 strips ; before testing a regex that still expects ;, so PRINCIPLE, LINE, or REASON comments can be selected as the claim sent to the refiner.

Suggested fix
-        const claim = tv.smt2.split("\n").find((l) => l.trim().startsWith(";") && !/^;\s*(PRINCIPLE|LINE|REASON):/i.test(l.trim().replace(/^;\s*/, "")))?.trim().replace(/^;\s*/, "") || "(no claim)";
+        const claim = tv.smt2
+          .split("\n")
+          .map((l) => l.trim().replace(/^;\s*/, ""))
+          .find((l) => l && !/^(PRINCIPLE|LINE|REASON):/i.test(l)) || "(no claim)";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pipeline/DerivationPhase.ts` around lines 210 - 229, The claim-extraction
logic in the templateVerifications loop mistakenly removes the leading ";"
before testing the metadata regex, so lines like "; PRINCIPLE:" pass the
negative test and become claims; fix by applying the metadata regex to the
original trimmed line (with the leading ";") instead of the semicolon-stripped
string — e.g., in the find callback use l.trim().startsWith(";") &&
!/^;\s*(PRINCIPLE|LINE|REASON):/i.test(l.trim()) to filter metadata, then only
call .replace(/^;\s*/, "") when building the claim to strip the semicolon before
storing/sending to refineErrorBlock; update references around
templateVerifications, claim and refineErrorBlock accordingly.
src/principles.ts-209-210 (1)

209-210: ⚠️ Potential issue | 🟠 Major

Avoid reusing active IDs after retirement.

Removing retired principles from this.principles makes the length-based nextId() unsafe. For example, active P1,P3 after retiring P2 gives length + 1 === P3, colliding with an existing active principle.

Suggested fix
   nextId(): string {
-    const existing = this.principles.length + 1;
-    return `P${existing}`;
+    const maxNumericId = this.principles.reduce((max, p) => {
+      const match = p.id.match(/^P(\d+)$/i);
+      return match ? Math.max(max, parseInt(match[1]!, 10)) : max;
+    }, 0);
+    return `P${maxNumericId + 1}`;
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/principles.ts` around lines 209 - 210, Retiring a principle by filtering
it out breaks the length-based nextId() allocation and can cause ID reuse
collisions; instead, stop removing entries from this.principles and mark them
retired (e.g., add/ensure a retired boolean on the principle object) or
implement a monotonic id generator property (e.g., this.nextIdCounter) that
nextId() increments and returns; update the method that currently does
this.principles = this.principles.filter((x) => x.id !== id) to set the retired
flag on the matching principle (by id) and ensure nextId() uses the counter or
computes max existing numeric id + 1 to guarantee uniqueness.
src/principles.ts-126-130 (1)

126-130: ⚠️ Potential issue | 🟠 Major

Track all principle IDs, not only P<N> IDs.

recordUse() drops any non-P<number> principle IDs, but DerivationPhase can add slug IDs for pattern-gap principles. Those principles will never accumulate stats, so telemetry and auto-retirement silently miss them.

Suggested fix
-      .split(/[,+&\s]+/)
+      .split(/[,+&\s]+/)
       .map((s) => s.trim())
-      .filter((s) => /^P\d+$/i.test(s));
+      .filter((s) => this.principles.some((p) => p.id === s));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/principles.ts` around lines 126 - 130, The code in the principleTag
parsing drops any non-`P<number>` IDs so slug IDs produced by DerivationPhase
never get recorded; update the parsing in the principleTag handling (the code
that builds `ids`) so it does not filter out non-`P...` tokens — e.g. accept any
non-empty trimmed token (or explicitly allow both /^P\d+$/i and slug patterns)
instead of `.filter((s) => /^P\d+$/i.test(s))`, so that `recordUse()` receives
and accumulates stats for slug IDs as well.
src/checkers/PropertyTestChecker.ts-414-422 (1)

414-422: ⚠️ Potential issue | 🟠 Major

Don’t let the judge erase boundary failures it did not see.

judgeRuntimeOutcome receives only the primary outcome, but Line 436 can flip a violation caused solely by boundaryRuns back to proven. Include boundary failures in the judge input or guard this flip.

🐛 Proposed guard until the judge prompt supports boundary runs
     for (const ctx of this.lastRun) {
+      const boundaryFailures = ctx.boundaryRuns.filter((b) => b.outcome.kind === "threw");
       const verdict = await judgeRuntimeOutcome(
         {
           functionSource: ctx.functionSource,
           claim: ctx.claim,
@@
-      } else if (verdict.valid && ctx.result.verdict === "violation") {
+      } else if (verdict.valid && ctx.result.verdict === "violation") {
+        if (boundaryFailures.length > 0) {
+          ctx.result.error = ctx.result.error
+            ? `${ctx.result.error}; property-judge did not evaluate ${boundaryFailures.length} boundary failure(s)`
+            : `property-judge did not evaluate ${boundaryFailures.length} boundary failure(s)`;
+          confirmed++;
+          continue;
+        }
         ctx.result.verdict = "proven";
         ctx.result.error = `runtime threw but judge considered outcome consistent with claim: ${verdict.note}`;
         flipped++;

Also applies to: 436-439

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/checkers/PropertyTestChecker.ts` around lines 414 - 422, The judge call
is only given the primary outcome (via judgeRuntimeOutcome) so verdicts can
wrongly flip to proven when failures exist only in boundaryRuns; update the code
that iterates this.lastRun to either include boundaryRuns (or their aggregated
failure info) in the object passed to judgeRuntimeOutcome (e.g., add a
boundaryRuns or boundaryFailures field alongside functionSource, claim, smt2,
inputsSummary, outcome) or add a guard before accepting a "proven" verdict that
checks ctx.boundaryRuns (or ctx.boundaryFailures) for any failures and prevents
flipping to proven if any boundary failure is present; target the call to
judgeRuntimeOutcome and the logic that assigns/accepts verdict to ensure
boundary failures are considered.
src/checkers/PropertyTestChecker.ts-730-743 (1)

730-743: ⚠️ Potential issue | 🟠 Major

Avoid caching bound instance methods across property runs.

For instance methods, resolveCallable binds a method to one constructed object, and fnCache stores that bound function. Stateful methods can leak mutations between the primary run, boundary runs, and later contracts.

Cache FunctionInfo/constructor metadata instead, and create a fresh instance per test invocation. If a quick mitigation is needed, skip caching instance-bound methods:

🐛 Minimal mitigation
     if (fn) {
       const result = { fn, paramNames: info.paramNames, source: info.source };
-      this.fnCache.set(cacheKey, result);
+      if (!info.className || info.isStatic) {
+        this.fnCache.set(cacheKey, result);
+      }
       return result;
     }
@@
       if (fn) {
         const result = { fn, paramNames: info.paramNames, source: info.source };
-        this.fnCache.set(cacheKey, result);
+        if (!info.className || info.isStatic) {
+          this.fnCache.set(cacheKey, result);
+        }
         return result;
       }

Also applies to: 774-782, 798-801

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/checkers/PropertyTestChecker.ts` around lines 730 - 743, The code is
caching instance-bound methods (via resolveCallable) into fnCache, which can
leak state between runs; change the caching strategy so you do NOT cache bound
instance methods: in the resolution logic around resolveCallable,
loadModuleWithPrivates, cacheKey and fnCache.set, detect when the returned fn is
an instance-bound method (or when FunctionInfo indicates a constructor/instance
method) and either (a) store only the FunctionInfo/constructor metadata
(paramNames, source and constructor reference) in the cache and instantiate a
fresh object per test invocation before calling the method, or (b) as a minimal
mitigation skip fnCache.set entirely for instance-bound methods so each run
resolves a new bound instance; update all three spots that currently call
fnCache.set (the blocks using resolveCallable and loadModuleWithPrivates) to
follow this approach.
src/checkers/PropertyTestChecker.ts-506-506 (1)

506-506: ⚠️ Potential issue | 🟠 Major

Resolve the callable by contract line, not only by name.

contract.line is available but not used when extracting/loading the target. In files with duplicate method/function names, Line 1127 selects the first match, and Line 710 caches all same-name targets together.

🐛 Proposed direction
-      const info = this.extractFunctionInfo(absPath, cand.contract.function);
+      const info = this.extractFunctionInfo(absPath, cand.contract.function, cand.contract.line);
-      const extracted = this.loadFunction(absPath, cand.contract.function);
+      const extracted = this.loadFunction(absPath, cand.contract.function, cand.contract.line);
-  private loadFunction(filePath: string, fnName: string): ExtractedFn | null {
-    const cacheKey = `${filePath}::${fnName}`;
+  private loadFunction(filePath: string, fnName: string, line?: number): ExtractedFn | null {
+    const cacheKey = `${filePath}::${fnName}::${line ?? "any"}`;
-    const info = this.extractFunctionInfo(filePath, fnName);
+    const info = this.extractFunctionInfo(filePath, fnName, line);
-  private extractFunctionInfo(filePath: string, fnName: string): FunctionInfo | null {
+  private extractFunctionInfo(filePath: string, fnName: string, line?: number): FunctionInfo | null {

Inside extractFunctionInfo, only accept a name match when the target node spans the contract line:

-          if (name === fnName) target = node;
+          const spansContractLine =
+            line === undefined ||
+            (node.startPosition.row + 1 <= line && node.endPosition.row + 1 >= line);
+          if (name === fnName && spansContractLine) target = node;

Also applies to: 548-548, 709-713, 1111-1138

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/checkers/PropertyTestChecker.ts` at line 506, The code currently resolves
a callable only by name (calling extractFunctionInfo(absPath,
cand.contract.function)) which breaks when duplicates exist; modify
extractFunctionInfo to require that a name match also spans the contract.line
(use cand.contract.line) by checking the node's source location
(node.loc.start.line <= contract.line <= node.loc.end.line) before accepting it,
and update the caching/lookup that groups same-name targets to include the line
in its key (e.g. `${name}@${line}`) so callers that currently use
extractFunctionInfo (and any find-by-name helpers) return the specific node for
that contract.line instead of the first same-name match.
src/checkers/PropertyTestChecker.ts-388-394 (1)

388-394: ⚠️ Potential issue | 🟠 Major

Constrain fuzzy substring parameter matching.

Line 391 can map short params like id, i, or n to unrelated model keys such as valid or condition. That feeds the runtime checker arbitrary values and can flip verdicts incorrectly.

🐛 Proposed guard
     const substringMatch = keys.find(
       (k) => {
         const n = k.toLowerCase().replace(/_/g, "");
-        return n.includes(target) || target.includes(n);
+        if (target.length < 3 || n.length < 3) return false;
+        return (
+          n.startsWith(target) ||
+          n.endsWith(target) ||
+          target.startsWith(n) ||
+          target.endsWith(n)
+        );
       }
     );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/checkers/PropertyTestChecker.ts` around lines 388 - 394, The fuzzy
substring match is too permissive and maps short targets like "id" or "n" to
unrelated keys; tighten the predicate in the keys.find used to produce
substringMatch (the block that computes n and returns n.includes(target) ||
target.includes(n)) so it only considers fuzzy matches when both the normalized
target and the normalized candidate key are reasonably long (e.g., length >= 3)
or when the target matches a whole token boundary; update the predicate to check
target.length >= 3 && n.length >= 3 (or a whole-word boundary test) before
returning true, then keep the existing return of model[substringMatch].
src/checkers/PropertyTestChecker.ts-269-275 (1)

269-275: ⚠️ Potential issue | 🟠 Major

Don't treat unresolved Promises as successful returns.

Line 271 records an async function's returned Promise as a normal value, and rejected Promises bypass this try/catch. This can mark async runtime failures as proven.

🐛 Minimal guard if async execution is not supported yet
     try {
       const value = fn(...args);
+      if (value && typeof value.then === "function") {
+        return {
+          kind: "threw",
+          error: "async function returned a Promise; property-test does not await async targets",
+        };
+      }
       return { kind: "returned", value: this.formatValue(value) };

Longer term, make check/runOne/callFunction async and await thenables.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/checkers/PropertyTestChecker.ts` around lines 269 - 275, callFunction
currently treats returned Promise objects as successful returns and doesn't
catch rejected promises; update callFunction to handle thenables: either
(preferred) make callFunction async and await the result (and propagate/format
rejections), and update callers check and runOne to be async as well, or
(minimal) detect thenables after calling fn(...args) (e.g., check typeof
value?.then === "function") and return a { kind: "threw", error: "unresolved
Promise" } (or similar) instead of treating it as a normal return; reference the
callFunction function and the check/runOne call sites so all async flows
properly await and handle promise rejections.
src/checkers/PropertyTestChecker.ts-326-333 (1)

326-333: ⚠️ Potential issue | 🟠 Major

Insert boundary pinnings before (check-sat).

The prop.smt2 always ends with (check-sat) (see extractSmt2Blocks in verifier.ts, lines 83–84). When the code appends pinnings after filtering out the goal assertion, the boundary assertions appear after (check-sat), causing Z3 to evaluate satisfiability before those pinnings are visible. This makes invalid boundary variants appear precondition-consistent.

Fix: Insert pinnings before the first (check-sat):

Proposed implementation
-    const block = lines
-      .filter((_, i) => i !== goalIdx)
-      .concat(pinnings)
-      .join("\n");
+    const preconditionLines = lines.filter((_, i) => i !== goalIdx);
+    const checkSatIdx = preconditionLines.findIndex((line) => line.trim() === "(check-sat)");
+    const blockLines = checkSatIdx >= 0
+      ? [
+          ...preconditionLines.slice(0, checkSatIdx),
+          ...pinnings,
+          ...preconditionLines.slice(checkSatIdx),
+        ]
+      : [...preconditionLines, ...pinnings, "(check-sat)"];
+    const block = blockLines.join("\n");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/checkers/PropertyTestChecker.ts` around lines 326 - 333, In
PropertyTestChecker (the block that builds `block` from `lines`, `goalIdx`, and
`pinnings`), the code currently appends `pinnings` after filtering out the goal
which can place them after the file's final `(check-sat)`; change the assembly
so that `pinnings` are inserted before the first `(check-sat)` token in the
resulting SMT2 block (i.e., find the index of the first line that matches
`/^\s*\(check-sat\)/` and splice `pinnings` there, or if none exists append at
the end), then run Z3 with that reordered `block` so boundary assertions are
visible to the solver when it evaluates `check-sat` (references: variables
`lines`, `goalIdx`, `pinnings` in PropertyTestChecker).
src/checkers/PropertyTestChecker.ts-671-699 (1)

671-699: ⚠️ Potential issue | 🟠 Major

Handle Z3 Real values emitted as S-expressions.

Z3 commonly emits Real model values as S-expressions like (/ 1.0 2.0) and (- (/ 1.0 2.0)) for exact representation. The current modelRegex and negRegex patterns explicitly exclude parentheses via [^\s)]+, so they miss these nested forms. Real-valued parameters then vanish from the model, causing property tests to skip or misclassify.

Replace regex-only parsing with a small S-expression parser for the value portion of define-fun bodies.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/checkers/PropertyTestChecker.ts` around lines 671 - 699, The current
parseZ3Model uses regexes that fail to capture Real values emitted as
S-expressions (e.g. "(/ 1 2)" or "(- (/ 1 2))"), so update parseZ3Model and
parseValue to handle S-expression bodies instead of relying on `[^\s)]+`
matches: in parseZ3Model locate the define-fun name and sort, then extract the
full parenthesized body (handling nested parens) and pass that string to
parseValue; in parseValue accept both atomic numerics and S-exprs and implement
a tiny S-expression evaluator that recognizes (/ a b) -> a/b and unary - applied
to atoms or nested S-exprs, returning a number for Int/Real and boolean for Bool
(or undefined on parse errors) so Real rationals and negations are correctly
parsed into the model.
🟡 Minor comments (4)
.neurallog/contracts/examples/arithmetic.ts/safeDivide[7].json-11-13 (1)

11-13: ⚠️ Potential issue | 🟡 Minor

Keep reason code-cited and move CEGAR details to judge_note.

Line 13 describes the prior spurious encoding rather than why safeDivide satisfies the guard property. Since judge_note already captures cegar-flipped, the reason should be the short code-level explanation, e.g. that Line 7 returns immediately on b === 0, so Line 8 is unreachable on that branch.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.neurallog/contracts/examples/arithmetic.ts/safeDivide[7].json around lines
11 - 13, Update the JSON so the "reason" field contains a short, code-level
explanation that Line 7 in safeDivide returns immediately when b===0 (so Line
8/code_after is unreachable on that branch) and keep the existing CEGAR
debugging details (the spurious encoding, missing constraints, and cegar-flipped
note) moved into or preserved in "judge_note"; reference the symbols
guard_condition, guard_returns and code_after_reached when explaining the
immediate return and unreachable code so it’s clear which assertions correspond
to the early-return semantics.
.neurallog/contracts/src/observations.ts/findSimilarByAST[77].json-14-14 (1)

14-14: ⚠️ Potential issue | 🟡 Minor

judge_note is truncated to 200 characters without a sentinel, causing incomplete reasoning.

The judge pipeline (src/judge.ts lines 68 and 169) truncates notes via .slice(0, 200), and the CEGAR phase (src/pipeline/DerivationPhase.ts line 855) applies the same limit. No truncation indicator (e.g., [...] or ) is appended, so downstream readers encounter incomplete sentences appearing as valid text. This degrades the diagnostic value of judge notes when developers investigate rejected proofs or reconciliation decisions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.neurallog/contracts/src/observations.ts/findSimilarByAST[77].json at line
14, The judge notes are being truncated with .slice(0,200) in the judge pipeline
(judge.ts) and in the CEGAR phase (DerivationPhase.ts) without a sentinel,
producing incomplete sentences; replace those inline slices with a single
utility (e.g., truncateWithEllipsis(text, max=200)) and use it from both places
so that when text.length > max it returns text.slice(0, max - 1) + '…' (or '
[...]') otherwise returns text unchanged; add/adjust unit tests for judge note
formatting to assert the sentinel is present when truncation occurs.
.neurallog/contracts/examples/inventory.ts/reserveStock[20].json-17-17 (1)

17-17: ⚠️ Potential issue | 🟡 Minor

judge_note is truncated mid-word.

The note ends with "no guard c" — looks like the judge output was sliced at a hard character limit. If there's a .slice(0, N) upstream on judge text, consider either raising the cap or slicing at a word boundary / appending "…" so persisted notes stay readable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.neurallog/contracts/examples/inventory.ts/reserveStock[20].json at line 17,
The persisted judge_note for reserveStock is being cut mid-word; locate where
the judge output is truncated into the "judge_note" field (the code that slices
the judge text before saving, used for entries like reserveStock/judge_note) and
change it to either increase the cap or perform safe truncation: slice at the
last word boundary under the limit and append an ellipsis ("…") if truncated.
Ensure the truncation logic always preserves whole words and uses a clear
ellipsis indicator when shortening the judge output.
src/refiner.ts-57-64 (1)

57-64: ⚠️ Potential issue | 🟡 Minor

Tighten the "no-op refinement" and fence checks.

A few small correctness/robustness gaps in this block:

  1. revised === smt2 compares the trimmed revised block against the raw original. If the original has any leading/trailing whitespace (common when the block came from a template or a previous LLM response), a truly-identical refinement will slip through and trigger a redundant Z3 call. Compare trimmed-to-trimmed.
  2. The fence regex accepts smt-lib / smtlib2 but the system prompt only instructs the model to use smt2 fences. That's fine, but also consider rejecting blocks that contain additional prose/backticks before the fence gets picked up — match returns the first, so a commentary fenced block earlier in the response would be taken as the refinement.
  3. No guard against empty revised (regex can match with whitespace-only body if the model replies with an empty fence); (check-sat) check catches most cases but a single literal (check-sat) with no decls will also get handed to Z3. Cheap to add a minimum-length / declare-const sanity check.
Proposed diff
-  const m = response.text.match(/```(?:smt2|smt-lib|smtlib2)?\s*\n([\s\S]*?)```/i);
-  if (!m) return null;
-  const revised = m[1]!.trim();
-  if (!revised.includes("(check-sat)")) return null;
-  if (revised === smt2) return null;
+  const m = response.text.match(/```(?:smt2|smt-lib|smtlib2)?\s*\n([\s\S]*?)```/i);
+  if (!m) return null;
+  const revised = m[1]!.trim();
+  if (!revised.includes("(check-sat)")) return null;
+  if (!/\(declare-(const|fun)\b/.test(revised)) return null;
+  if (revised === smt2.trim()) return null;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/refiner.ts` around lines 57 - 64, The current refinement extraction can
wrongly accept non-matching or empty fences and performs a raw-string equality
check; update the logic around response.text matching and the revised/smt2
checks: ensure the regex for the code fence still targets smt2 but guard against
earlier commentary by verifying the matched fence body is non-empty and contains
meaningful declarations (e.g., require a declare-const or declare-fun token)
before proceeding, trim both sides when comparing revised and smt2 (use revised
=== smt2.trim()), and keep calling verifyBlock(revised) only after these guards
pass; search for the variables/expressions response.text, revised, smt2 and the
verifyBlock call to make the changes.
🧹 Nitpick comments (5)
src/moduleLoader.ts (2)

67-107: Consider memoizing by filePath to avoid repeated parse/transpile/compile.

loadModuleWithPrivates is invoked from PropertyTestChecker and the harness probe, potentially many times against the same target file. Each call re-reads the file, re-runs tree-sitter + ts.transpileModule, and re-executes the module's top-level code via _compile. In addition to being slow, re-executing top-level side effects (logger init, DB connection setup, etc.) per contract can perturb subsequent harness runs and inflate cost.

♻️ Suggested cache
+const moduleCache = new Map<string, any>();
+
 export function loadModuleWithPrivates(filePath: string, parentModule?: NodeModule): any {
+  const cached = moduleCache.get(filePath);
+  if (cached) return cached;
   const ts = getTs();
   if (!ts) throw new Error("typescript not available");
@@
   mod._compile(transpiled.outputText, filePath);
+  moduleCache.set(filePath, mod.exports);
   return mod.exports;
 }

If cache invalidation on file mtime matters for long-running processes, key on filePath + mtimeMs instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/moduleLoader.ts` around lines 67 - 107, Add a simple in-memory cache in
loadModuleWithPrivates keyed by filePath (optionally file mtime via
fs.statSync(filePath).mtimeMs for invalidation) to avoid repeated
read/parse/transpile/compile; on entry check the cache and return cached.exports
if present, otherwise proceed to
readFileSync/collectTopLevelNames/transpileModule, compile into a new Module
(mod) as currently done, store mod.exports (and relevant meta like mtimeMs if
used) into the cache, and return it—ensure keys reference loadModuleWithPrivates
and the cache lookup happens before calling getTs/ts.transpileModule and the
cache set happens after mod._compile completes.

40-60: Add abstract_class_declaration case and remove invalid variable_statement case from the switch statement.

handleDeclaration currently misses abstract_class_declaration and includes variable_statement, which is not a valid tree-sitter-typescript node type. Variables are parsed as lexical_declaration, not variable_statement, so that case will never match. The abstract_class_declaration type is distinct from class_declaration in tree-sitter-typescript and must be explicitly handled to avoid silently missing top-level abstract class declarations.

Suggested changes:
       case "function_declaration":
       case "generator_function_declaration":
       case "class_declaration":
+      case "abstract_class_declaration":
       case "interface_declaration":
       case "type_alias_declaration":
       case "enum_declaration":
         addName(node.childForFieldName("name"));
         break;
       case "lexical_declaration":
       case "variable_declaration":
-      case "variable_statement": {

Consider also adding cases for module_declaration and internal_module to capture TypeScript namespace and module declarations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/moduleLoader.ts` around lines 40 - 60, Update handleDeclaration in
moduleLoader.ts: remove the invalid "variable_statement" case, add
"abstract_class_declaration" alongside "class_declaration" so top-level abstract
classes are captured by addName(node.childForFieldName("name")), and ensure
variable handling remains under "lexical_declaration" and "variable_declaration"
with the existing variable_declarator loop. Optionally add "module_declaration"
and "internal_module" cases to capture TypeScript namespaces/modules using
addName on their name field as well.
src/refiner.ts (1)

9-64: Consider a budget / telemetry hook.

refineErrorBlock is now called from DerivationPhase on every error/unknown verification. There's no caller-visible retry cap or logging of the refined-vs-original SMT, so a persistent Z3 bug can silently cost one LLM call per contract per run, and when refinement succeeds there's no trace of what the model changed. Even a console.log behind verbose, or returning the original z3Error alongside the result, would make debugging much easier. Deferrable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/refiner.ts` around lines 9 - 64, refineErrorBlock currently makes an LLM
call on every Z3 error/unknown with no telemetry, retry cap, or visibility into
what changed; update refineErrorBlock to (1) accept or create a simple
budget/telemetry hook (e.g., a retry limit or a callback) and enforce a max
number of refinement attempts, (2) log or surface both the original smt2/z3Error
and the revised smt2 when a refinement succeeds (use the existing
provider.complete call and the extracted revised variable), and (3) propagate
the original z3Error alongside the returned RefineResult (or return a richer
object) so callers like DerivationPhase can record the LLM call and compare
changes; reference refineErrorBlock, provider.complete, the revised variable,
and verifyBlock to locate where to add the retry counter, logging/telemetry
callback, and modified return shape.
src/contracts.ts (1)

5-14: Normalization edge cases.

A couple of quiet footguns in normalizeContractFile:

  • The marker loop uses POSIX separators (/src/, /examples/, …). On Windows, an absolute path like C:\Users\...\examples\foo.ts will miss every marker and fall through to return file, leaving an absolute, backslash-separated path. signalKey only strips leading /, not drive letters or backslashes, so the resulting key will be broken. Normalize separators before the marker scan (file.replace(/\\/g, "/")).
  • When relative(projectRoot, file) returns a ..-prefixed path (file lives outside the project root), the fallback marker list is hardcoded. For monorepos with non-listed top-level dirs (e.g. services/, apps/, crates/), you silently keep the absolute path. Consider making the markers configurable, or at a minimum documenting the assumption.
  • isAbsolute(rel) on the result of relative() can never be true on POSIX; the check is harmless but dead. Minor.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/contracts.ts` around lines 5 - 14, The normalizeContractFile function
fails on Windows paths and can miss monorepo top-level dirs: first, normalize
separators at the top of normalizeContractFile by replacing backslashes with
forward slashes (e.g., work on a local normalized variable like fileNorm =
file.replace(/\\/g, "/")) and use that for the marker scan and for computing
relative paths; second, make the marker list configurable (e.g., extract the
hardcoded array into a shared DEFAULT_MARKERS constant and add an optional
markers parameter to normalizeContractFile or accept injected markers) so
non-listed top-level dirs (services/apps/crates) are handled; lastly, drop the
dead isAbsolute(rel) check since relative() won’t return an absolute path.
Ensure you update all uses of normalizeContractFile and any downstream logic
that expects the original path format to handle normalized forward-slash keys.
src/checkers/StrengthChecker.ts (1)

47-67: Parse full assertion ranges to handle potential multi-line assertions.

The code only detects single-line (assert ...) forms by checking parenthesis balance on individual lines. Multi-line assertions would be silently skipped, potentially causing incorrect load-bearing counts. While all current stored contracts use single-line assertions, the code should robustly handle multi-line cases since nothing enforces the single-line constraint.

Suggested fix direction
-    const assertIndices: number[] = [];
+    const assertRanges: Array<{ start: number; end: number }> = [];
     for (let i = 0; i < lines.length; i++) {
       const trimmed = lines[i]!.trim();
       if (!trimmed.startsWith("(assert")) continue;
       let depth = 0;
-      for (const ch of lines[i]!) {
-        if (ch === "(") depth++;
-        else if (ch === ")") depth--;
+      for (let j = i; j < lines.length; j++) {
+        for (const ch of lines[j]!) {
+          if (ch === "(") depth++;
+          else if (ch === ")") depth--;
+        }
+        if (depth === 0) {
+          assertRanges.push({ start: i, end: j });
+          i = j;
+          break;
+        }
       }
-      if (depth === 0) assertIndices.push(i);
     }
 
-    if (assertIndices.length === 0) {
+    if (assertRanges.length === 0) {
       return { loadBearing: 0, totalAsserts: 0, status: "ok" };
     }
@@
-    for (const idx of assertIndices) {
-      const without = lines.filter((_, i) => i !== idx).join("\n");
+    for (const range of assertRanges) {
+      const without = lines.filter((_, i) => i < range.start || i > range.end).join("\n");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/checkers/StrengthChecker.ts` around lines 47 - 67, The current loop in
StrengthChecker that builds assertIndices by checking balanced parentheses per
line misses multi-line assertions; update the logic to scan lines for assertion
start tokens (e.g., trimmed.startsWith("(assert")) then walk forward across
subsequent lines accumulating parenthesis depth until depth returns to zero to
record the full start and end line range (store ranges instead of single
indices), and when testing each assertion use those ranges to create the
`without` variant (filter or splice out all lines in the recorded range) before
calling `verifyBlock`; refer to the existing symbols `lines`, `assertIndices`
(convert to ranges), and `verifyBlock` to locate and replace the per-line
balance check and removal logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6febe2a1-2898-4574-a94e-8e88e2ff38f8

📥 Commits

Reviewing files that changed from the base of the PR and between a814845 and 655ee1e.

📒 Files selected for processing (44)
  • .gitignore
  • .neurallog/contexts/bundles.json
  • .neurallog/contracts/examples/arithmetic.ts/absDiff[18].json
  • .neurallog/contracts/examples/arithmetic.ts/absDiff[19].json
  • .neurallog/contracts/examples/arithmetic.ts/addPositive[31].json
  • .neurallog/contracts/examples/arithmetic.ts/addPositive[32].json
  • .neurallog/contracts/examples/arithmetic.ts/clamp[24].json
  • .neurallog/contracts/examples/arithmetic.ts/clamp[25].json
  • .neurallog/contracts/examples/arithmetic.ts/divide[13].json
  • .neurallog/contracts/examples/arithmetic.ts/safeDivide[7].json
  • .neurallog/contracts/examples/arithmetic.ts/safeDivide[8].json
  • .neurallog/contracts/examples/inventory.ts/releaseStock[33].json
  • .neurallog/contracts/examples/inventory.ts/releaseStock[34].json
  • .neurallog/contracts/examples/inventory.ts/releaseStock[36].json
  • .neurallog/contracts/examples/inventory.ts/reserveStock[20].json
  • .neurallog/contracts/examples/inventory.ts/reserveStock[21].json
  • .neurallog/contracts/examples/inventory.ts/reserveStock[23].json
  • .neurallog/contracts/examples/inventory.ts/reserveStock[24].json
  • .neurallog/contracts/src/llm/OpenCodeProvider.ts/formatDuration[152].json
  • .neurallog/contracts/src/observations.ts/findSimilarByAST[77].json
  • .neurallog/contracts/src/pipeline/AxiomPhase.ts/formatDuration[166].json
  • .neurallog/derivation.json
  • .neurallog/graph.json
  • .neurallog/principles/addition-overflow.json
  • .neurallog/report.json
  • .neurallogignore
  • examples/arithmetic.ts
  • prompts/harness_synthesis.md
  • prompts/invariant_derivation.md
  • scripts/harness-probe.ts
  • src/checkers/PropertyTestChecker.ts
  • src/checkers/StrengthChecker.ts
  • src/checkers/index.ts
  • src/cli.ts
  • src/contracts.ts
  • src/harness.ts
  • src/judge.ts
  • src/moduleLoader.ts
  • src/pipeline/AxiomPhase.ts
  • src/pipeline/DerivationPhase.ts
  • src/pipeline/Pipeline.ts
  • src/principles.ts
  • src/refiner.ts
  • src/verifier.ts

Comment on lines +3 to +6
"file": "/Users/tsavo/provekit/.worktrees/refinements/examples/inventory.ts",
"function": "reserveStock",
"line": 21,
"signal_hash": "0e15c5038ab1561c7cf41e475816378338b60c7d27747ac6cd99fe6eff1ec8be",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find every contract artifact in the PR that stores an absolute 'file' path.
fd -e json . .neurallog/contracts --exec sh -c '
  f="$1"
  path=$(jq -r ".file // empty" "$f" 2>/dev/null)
  case "$path" in
    /*|[A-Za-z]:*) echo "ABSOLUTE: $f -> $path" ;;
  esac
' _ {}

Repository: TSavo/provekit

Length of output: 50371


🏁 Script executed:

fd -e json . .neurallog/contracts/examples/arithmetic.ts --exec sh -c '
  f="$1"
  file_path=$(jq -r ".file // empty" "$f" 2>/dev/null)
  echo "File: $f"
  echo "  Path: $file_path"
' _ {} | head -20

Repository: TSavo/provekit

Length of output: 946


Contract artifacts across the entire .neurallog/contracts/ directory contain absolute, machine-specific paths—a systemic issue affecting 256+ artifacts and poisoning signal_hash reproducibility.

This is not localized to inventory.ts. The initial verification script found absolute paths in over 256 committed contract artifacts spanning:

  • .neurallog/contracts/src/verifier.ts/ (60+)
  • .neurallog/contracts/src/principles.ts/ (20+)
  • .neurallog/contracts/src/signals/ (multiple)
  • .neurallog/contracts/src/templates/ (40+)
  • .neurallog/contracts/src/parser.ts/, observations.ts, pipeline/ (10+ each)
  • .neurallog/contracts/examples/inventory.ts/ (6, with path /Users/tsavo/provekit/.worktrees/refinements/examples/inventory.ts)

In contrast, .neurallog/contracts/examples/arithmetic.ts/ correctly stores relative paths like examples/arithmetic.ts.

Two concrete problems:

  1. Non-portable & leaks filesystem layout. Any other checkout, CI system, or developer will have mismatching absolute paths and cannot locate or re-verify these artifacts.
  2. signal_hash is not reproducible across checkouts. Per src/signals/Signal.ts (lines 27–42), computeSignalHash incorporates signal.file into the SHA256. These absolute paths will produce different hashes on any other machine, defeating hash-based dedup and change detection.

The root cause: contract path normalization was not applied when these artifacts were generated (likely across different checkouts/worktrees with different base paths).

Remediation options:

  • Regenerate all affected contract artifacts so file stores relative paths (repo-root-relative), or
  • Delete all stale artifacts and let the pipeline regenerate them post-normalization fix.

Prevent regression: Add a guard in the artifact writer (likely DerivationPhase or ContextPhase) that asserts !isAbsolute(file) before persisting any contract, to catch future violations early.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.neurallog/contracts/examples/inventory.ts/reserveStock[21].json around
lines 3 - 6, The committed contract artifacts contain absolute machine-specific
"file" paths which break Signal.computeSignalHash reproducibility; fix by
normalizing contract file paths to repo-root-relative before persisting
artifacts (e.g., convert signal.file to a relative path) in the artifact writer
used by DerivationPhase/ContextPhase, then regenerate or delete stale artifacts
so new ones use relative paths; additionally add an assertion/guard in the
writer (in DerivationPhase or ContextPhase) that throws or fails the write if
path.isAbsolute(signal.file) to prevent future commits of absolute paths.

Comment thread src/harness.ts
Comment on lines +130 to +195
const sandbox: any = {
functionUnderTest: fn,
functionUnderTestClass: fnClass,
console: {
log: () => {},
warn: () => {},
error: () => {},
},
setTimeout,
clearTimeout,
setInterval,
clearInterval,
Promise,
Error,
TypeError,
RangeError,
SyntaxError,
Array,
Object,
Number,
String,
Boolean,
JSON,
Math,
Date,
Symbol,
Map,
Set,
WeakMap,
WeakSet,
Reflect,
BigInt,
isNaN,
isFinite,
parseInt,
parseFloat,
encodeURIComponent,
decodeURIComponent,
globalThis: {},
};
sandbox.globalThis = sandbox;

vm.createContext(sandbox);

const wrapped = `(async () => {\n${harnessCode}\n})()`;

let promise: Promise<any>;
try {
const script = new vm.Script(wrapped, { filename: "<harness>" });
promise = script.runInContext(sandbox);
} catch (err: any) {
return {
kind: "harness-error",
message: `harness failed to parse/initialize: ${err?.message?.slice(0, 200) || "unknown"}`,
harnessCode,
};
}

let timer: NodeJS.Timeout | null = null;
try {
await Promise.race([
promise,
new Promise((_, reject) => {
timer = setTimeout(() => reject(new Error("__HARNESS_TIMEOUT__")), timeoutMs);
}),
]);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Node.js vm module not a security mechanism Script.runInContext timeout synchronous code documentation

💡 Result:

The Node.js vm module is explicitly not a security mechanism and should not be used to run untrusted code. This is stated at the top of the official documentation. The timeout option in methods like vm.runInContext(code, context, options) or script.runInContext(context, options) specifies milliseconds before terminating execution, throwing ERR_SCRIPT_EXECUTION_TIMEOUT if exceeded. It applies to the synchronous execution of the script. For synchronous code (e.g., infinite while loop), timeout works reliably and terminates the execution. However, timeout does not cover asynchronous tasks. Promises, async functions, process.nextTick, queueMicrotask, setTimeout, etc., schedule tasks that run after the synchronous script execution completes, escaping the timeout. Example from docs: vm.runInNewContext( 'Promise.resolve.then( => { while(1); });', { console }, { timeout: 5 } ); The promise .then callback runs after timeout, indefinitely. To include microtasks in timeout scope, use microtaskMode: 'afterEvaluate' in options (added v14.6.0). Microtasks (Promise.then, async/await) run immediately after script evaluation, under timeout: vm.runInNewContext( 'Promise.resolve.then( => { while(1); });', { console }, { timeout: 5, microtaskMode: 'afterEvaluate' } ); This terminates the microtask loop. Note: Sharing promises across contexts with this mode can cause issues, as inner microtasks may not drain properly. Macro-tasks (setTimeout, etc.) use global queues shared across contexts, uncontrollable by vm timeout. Using timeout incurs performance overhead (new event loops/threads). Official Node.js docs (v25.9.0) confirm these behaviors and limitations.

Citations:


Move harness execution to a separate, killable process with strict timeout and minimal environment.

The Node.js vm module is not a security mechanism. The code does not use the timeout option in script.runInContext(sandbox), leaving synchronous code unprotected. Any synchronous infinite loop or blocking operation before the first await will hang the host process—the Promise.race timeout is installed after runInContext() completes, so it cannot catch synchronous attacks.

Additionally, the sandbox exposes dangerous constructors (Error, Array, Object, Function via Reflect, Symbol, BigInt, etc.) and timers (setTimeout, setInterval), enabling escape vectors and potentially preventing process termination.

If keeping vm temporarily, use runInContext(..., { timeout, microtaskMode: 'afterEvaluate' }) to enforce synchronous bounds and cover microtasks. However, this does not protect against exposed constructors or macro-task abuse. The correct fix is spawning harnesses in isolated worker threads or subprocesses with a strict time limit and IPC-only communication for results.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/harness.ts` around lines 130 - 195, The current harness runs untrusted
code via vm.Script.runInContext(sandbox) (wrapped/harnessCode) leaving
synchronous execution unbounded and exposing dangerous globals (sandbox,
setTimeout, setInterval, Error, Array, Object, Reflect, Symbol, BigInt, etc.);
replace this by executing the harness in a separate, killable worker/subprocess
that receives harnessCode via IPC, runs it in a minimal global environment,
returns results/errors over IPC, and the parent enforces timeoutMs by forcibly
terminating the worker if it exceeds the deadline; if you must keep vm for a
short-term patch, change the run to script.runInContext(sandbox, { timeout:
<small ms>, microtaskMode: 'afterEvaluate' }) and remove/omit dangerous entries
from sandbox (setTimeout, setInterval, Error, Function/Reflect accessors,
Symbol, BigInt, globalThis shims) so synchronous/microtask work is bounded until
you implement the isolated worker/subprocess approach using the same timeoutMs
and promise cancellation logic.

Comment on lines +861 to +879
const smtMatch = text.match(/```(?:smt2|smt-lib|smtlib2)?\s*\n([\s\S]*?)```/i);
if (!smtMatch) continue;
const revisedSmt = smtMatch[1]!.trim();
if (!revisedSmt.includes("(check-sat)")) continue;
if (revisedSmt === v.smt2) continue;

const { result, witness: newWitness } = verifyBlock(revisedSmt);
if (result === "unsat") {
const bare = v.claim.replace(/^VIOLATION:\s*/i, "").trim();
const flippedClaim = `PROVEN: ${bare} is prevented (CEGAR-refined encoding)`;
contract.proven.push({
principle: v.principle,
principle_hash: v.principle_hash,
claim: flippedClaim,
smt2: revisedSmt,
reason: extractReason(revisedSmt) || "cegar-refined precondition added",
confidence: "high",
judge_note: "cegar-flipped: violation became proof after tightening encoding",
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Do not flip violations to proven on an unchecked LLM-refined block.

A revised block can become unsat by adding an over-tight or contradictory precondition. Since this path immediately moves the violation into contract.proven, a bad refinement can corrupt stored verification results.

At minimum, reject refinements that change the principle/line tags, are vacuous/trivial, or lack an auditable REASON; ideally keep these as low-confidence until a second mechanical/judge check validates the added precondition against source code.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pipeline/DerivationPhase.ts` around lines 861 - 879, The code currently
flips an LLM-refined SMT block to proven immediately after verifyBlock returns
"unsat"; instead, add validation before pushing to contract.proven: after
obtaining revisedSmt and newWitness from verifyBlock, ensure the refinement did
not change principle/line tags (compare v.principle and v.principle_hash against
any tags in revisedSmt or metadata), reject vacuous/trivial refinements (e.g.,
precondition that trivially contradicts or always false, or that removes
original assertions), and require a non-empty auditable reason from
extractReason(revisedSmt); if any check fails, do not push into
contract.proven—either append to a separate low-confidence queue (e.g.,
contract.refinements or set confidence:"low" and judge_note indicating "awaiting
second check") or mark for mechanical/judge re-validation instead of marking as
proven.

TSavo and others added 7 commits April 20, 2026 21:13
… harness synthesis

The full-run at LIMIT=10 exposed a pipeline bug: primitive property-test
was calling functions with wrong-typed arguments. SMT can only declare
Int/Real/Bool, so for a function like runDiff(args: string[]), the SMT
would declare (declare-const args Int) and the matcher would obediently
pass a JavaScript number to args.find(...), which then threw
"args.find is not a function." That TypeError was labeled as a violation
and entered the verdict counts as a false-positive finding.

Fix:
- extractFunctionInfo now returns paramTypes alongside paramNames, pulled
  from each parameter's type annotation via tree-sitter (child field
  "type"). Unannotated params get "unknown" and pass through.
- New isCompatibleWithTsType(tsType, jsValue) helper. Allows the JS
  value if the declared TS type is:
    - any / unknown / missing
    - a primitive keyword (number, boolean, string) matching typeof
    - a numeric literal type and value is number
    - a string literal type and value is string
    - a union type where at least one member is compatible
  Rejects everything else (Contract[], Map<>, custom types, etc).
- In runOne, after matchParamToModel resolves a value, we run
  isCompatibleWithTsType. On mismatch: push a harness-synthesis
  candidate with the specific reason and skip the primitive path.

Empirical: the four functions that produced false positives in the
LIMIT=10 run (runDiff, runHook, buildIssue, buildCallSiteContext) now
skip with the new reason and become harness-synthesis candidates, where
the LLM constructs proper fixtures for their actual types. The false
positives are gone from the primitive-path verdicts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## CEGAR soundness — validate premises before flipping (CodeRabbit critical)

DerivationPhase.cegarRefineViolations: when the LLM-refined block
returns unsat, previously we accepted the flip (violation -> proven)
unconditionally. A refined block can go unsat two ways:
  (a) genuine refinement — new preconditions make the code path
      inconsistent with the bug, which is what we want;
  (b) vacuous refinement — new preconditions are self-contradictory,
      so the block is unsat for reasons unrelated to the claim.
Both look identical to Z3. The fix: after getting unsat on the refined
block, strip the negated goal and re-check Z3 on the premises alone.
If the premises are not sat, the refinement is vacuous and we keep the
original violation. Only accept the flip when premises are independently
consistent.

Added private stripNegatedGoal(smt2) helper that drops the last (assert)
line. Used here and available for future reuse.

## Checker interface signatures (Copilot)

StrengthChecker.check and PropertyTestChecker.check now match the
Checker interface:
  check(contracts: Contract[], _callGraph: Map<string, string[]>)
TypeScript structural typing was letting this slide; the explicit
signature is clearer and guards against future interface changes.

## Gitignore the build-derived .neurallog artifacts (Copilot, CodeRabbit)

Contract JSONs, graph.json, report.json, derivation.json, contexts/,
observations/, harnesses/, classification.json are all machine-specific
artifacts (absolute paths, timestamps, per-machine hash indices) and
should never have been versioned. Moved to .gitignore. Kept
.neurallog/principles/ in git — the learned knowledge base is the one
artifact intended to be versioned.

962 files untracked from the working tree.

## Violation-replay verdict on clean runs (Codex P1)

Previously: violation replay + clean runtime -> verdict "proven"
with error: "possible false positive." That auto-corroborates a
proof on evidence that was neither a corroboration nor a refutation.

Now: violation replay + clean runtime -> verdict "error" with a clear
message saying the replay did not reproduce and the judge must decide
whether the claim is a false positive (flip to proven) or the
encoding misrepresents the semantic (flip to violation). The judge
already handles both directions; only the initial verdict changes.

## vm sandbox trust-boundary comment (CodeRabbit)

Added a doc block above runHarness explaining that vm.createContext is
a containment mechanism, not a security boundary, and that the trust
boundary is "same as running the tool on this repo" — acceptable for
the intended use case of testing the user's own source.

## Parenthesized Z3 Real literal parsing (Codex P2)

parseZ3Model previously missed sexp-form values like
  (define-fun x () Real (/ 5 2))
  (define-fun x () Real (- (/ 5 2)))
because the regex only handled plain literals and (- N). Added
parseSexpValue recursive handler for (- x) and (/ a b), and a
splitSexpArgs helper for parenthesis-balanced tokenization.

## StrengthChecker perf gate (Copilot)

StrengthChecker runs Z3 once per top-level (assert) per proven
property — O(N × K) calls. Gated behind NEURALLOG_STRENGTH_CHECK=1
so AxiomPhase doesn't pay the cost unless the operator opts in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ti-model Z3

Four improvements landed together because they share test surface.

## #17 Harness code audit before execution

New judgeHarnessCode in judge.ts: audits the synthesized JavaScript
harness *before* it runs. Single LLM call that reads the harness, the
claim, the SMT block, and the function source, then decides whether
the harness is a genuine empirical test or rigged by construction.

The audit prompt names six specific rigging mechanisms:
  1. Function re-implementation (testing a copy instead of functionUnderTest)
  2. Tautological assertion (expect(true).toBe(true) shape)
  3. Claim-restated expected value (transliterated from the claim)
  4. Fixture contains the answer (function reads .expected and passes through)
  5. encoding-gap thrown on orthogonal conditions (TypeErrors, null derefs)
  6. Hypothesis-assertion drift (comment and checks don't match)

Response format: VALID or INVALID with a one-line critique. Bias toward
INVALID — a falsely-validated rigged harness poisons the whole
corroboration signal.

Wired into synthesizeAndRunHarnesses between synthesis and execution.
Rejected harnesses become harness-error outcomes with the auditor's
critique. Cached via HarnessCache.putAudit so the audit isn't re-run
on subsequent verify invocations (same source → same audit).

Env: NEURALLOG_HARNESS_AUDIT=0 disables (default on). Model via
NEURALLOG_AUDIT_MODEL or NEURALLOG_JUDGE_MODEL (haiku by default).

This addresses Opus's top prediction: that the LLM would emit
harnesses technically executing but not actually testing the claim.
9/10 pass rate on the previous full run couldn't distinguish
corroboration from rigging. Now it can.

## #18 Judge verdict cache

New src/judgeCache.ts persists judgeRuntimeOutcome verdicts at
.neurallog/judge-cache/<hash>.json keyed by sha256 of
(smt2, functionSource, inputsSummary, outcomeSig, source).

judgeResults now checks the cache before calling the LLM. Cache hits
skip the ~20-30s per-entry LLM round. Stats surface the hit count in
the AxiomPhase summary line:
  property-test judge: 10 judged (7 cache hits), 2 flips, 3 confirmed

Gitignored like the harness cache — build-derived, not versioned.

## #19 Parallel harness synthesis

synthesizeAndRunHarnesses previously looped candidates sequentially:
~75-95s per LLM call × 10 candidates ≈ 15 minutes. No serial dependency
between candidates — synthesis of contract A doesn't block contract B.

Refactored the loop into a two-phase pipeline:
  Phase 1 (parallel, LLM-bound): synthesize + audit each candidate
    concurrently via a parallelMap helper. Produces a prepared record
    per candidate (either "ready" with harness and info, or "skip"
    with the outcome kind and message).
  Phase 2 (sequential, in-process): iterate prepared records, run the
    harness in the vm sandbox one at a time. Execution is fast (<100ms
    typical) and shares the event loop; parallelism here would only
    add contention.

NEURALLOG_HARNESS_CONCURRENCY=N gates phase 1 (default 3). With N=3,
10 candidates drop from ~15 min to ~5 min.

## #20 Multi-model Z3 sampling

For each proven contract, previously we took one Z3 (get-model) witness.
Adversarial bugs often hide in corners the solver's first-satisfying
input doesn't touch.

New extractModels(smt2, k) asks Z3 for k distinct witnesses by
iteratively adding
  (assert (or (not (= x v_prev1)) (not (= y v_prev2)) ...))
for each prior witness in the query. Stops early if Z3 returns unsat
(no more distinct models exist) or produces a duplicate.

In check(), the proven loop now iterates up to
NEURALLOG_Z3_SAMPLES models per proven property (default 1, back-
compatible). Each sample is a separate attempt counted against the
LIMIT. runOne accepts a modelOverride arg so the sampling happens in
the outer loop, not inside runOne.

Combined with harness synthesis and the judge, each proven claim now
gets K independent empirical checks instead of one. A claim that holds
under Z3's first witness but fails at a different corner of the input
space is now detectable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nvalidation

## #22 Non-determinism handling

Two-part fix.

Primitive path (src/checkers/PropertyTestChecker.ts):
  detectsNonDeterminism() scans the function source for references to
  Math.random, Date.now, performance.now, new Date(), process.hrtime,
  crypto.randomUUID/randomBytes/getRandomValues. When any of these are
  found, the contract is skipped with a specific reason and routed to
  the harness synthesis candidate queue — the LLM can consciously stub
  the nondeterminism source, the primitive path cannot.

Harness sandbox (src/harness.ts):
  runHarness now installs deterministic default stubs:
    Math.random -> constant 0.5 (via Object.create(Math) + defineProperty)
    Date.now    -> constant 0    (via Proxy on Date with custom get)
  Both preserve constructor behaviour and all non-stubbed members;
  only the time/randomness surface is pinned. Harnesses that need real
  time or real randomness can override inside their own code.

The rationale: a claim proven by Z3 is about the function's pure logic.
Any runtime-specific value the function pulls from outside that logic
(clock, RNG) is a variable the proof didn't constrain. Pinning it to a
constant makes the empirical check deterministic without loss of
generality for the claim's scope.

## #21 Transitive-source cache invalidation

HarnessCache key was sha256(smt2 + functionSource). If the function
calls a helper whose implementation changes, the harness becomes
stale but the cache hits.

Fix: cache key now optionally includes depsSource — the concatenated
source of the function's containing file plus depth-1 relative
imports. New moduleLoader.collectTransitiveSource(filePath,
projectRoot, depth=1) handles the collection.

PropertyTestChecker memoizes depsSource per absPath within a single
synthesizeAndRunHarnesses invocation so each file is read at most
once. Pass the depsSource to cache.get / cache.put / cache.putAudit.
Existing caches keyed without depsSource still resolve (same hash
when depsSource is absent) — no migration needed, next write attaches
dep info.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Added vitest (one dev dep) and 61 tests across two files. Scope:

src/checkers/PropertyTestChecker.test.ts (53 tests)
  - isCompatibleWithTsType: 27 case-table entries covering primitives,
    literal types, unions with null/undefined, Contract[]/Map<> rejection,
    any/unknown/empty pass-through.
  - matchParamToModel: exact, case-fold, underscore-strip, suffix-strip
    (_condition, _guard), substring fallback, preference ordering.
  - isControlFlowModel: all-control-keys → true, mixed with param → false,
    empty → false, result_consequent/result_alternate recognition.
  - parseZ3Model: plain Int, negative via (- N), Real fraction via (/ a b),
    negated fraction via (- (/ a b)), Bool, multiple bindings.
  - detectsNonDeterminism: Math.random / Date.now / performance.now /
    new Date() / crypto detection + clean case.

src/moduleLoader.test.ts (8 tests)
  - collectTopLevelNames across function_declaration, async functions,
    class_declaration, lexical declarations (const/let/var), type aliases
    and interfaces, export_statement wrappers, dedup, empty source.

Bug found + fixed while writing tests: addName() only accepted nodes of
type 'identifier' or 'property_identifier', but tree-sitter-typescript
emits 'type_identifier' for class / interface / type_alias name fields.
classes/types/interfaces were silently skipped by collectTopLevelNames
— which meant a fallback loadModuleWithPrivates on a file containing
classes would fail to expose them. Now fixed: type_identifier added to
the accepted node types.

vitest.config.ts restricts test collection to src/**/*.test.ts so the
compiled dist/ output isn't picked up (vitest can't import itself in
CJS-compiled .js).

test script wired into package.json: `npm test`.

Note: larger integration-style tests for runHarness (vm sandbox
behavior, timeout enforcement, prefix classification) and for the
CEGAR premise-consistency check are not included in this pass —
they require Z3 / LLM mocks or integration harness and are a separate
investment. The helpers tested here are the pure-logic surface where
regressions silently slip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New src/lessons.ts: LessonStore persists each property-judge-flagged
encoding-gap as a lesson entry at .neurallog/lessons/<hash>.json with:
  - contractKey: where the gap was observed
  - claim: the textual claim the encoder wrote
  - judgeNote: the property-judge's critique describing the mismatch
  - principleId: which principle the encoder was applying (if derivable)
  - addedAt: timestamp

formatForPrompt(limit) renders the most recent N lessons as a "Known
Encoding Gaps — Avoid These Mistakes" section suitable for injection
into encoder prompts.

Wired into two points in the encoder pipeline:

1. DerivationPhase pattern-gap synthesis (the prompt that asks the LLM
   to produce a new atomic principle with smt2Template + astPatterns
   when N unverified signals cluster). LessonStore.formatForPrompt() is
   appended to the prompt so the encoder sees specifically which SMT
   encodings it got wrong before.

2. principles.classifyAndGeneralize (the prompt that generalises a
   [NEW]-tagged violation into a reusable principle). Same injection;
   projectRoot now an optional parameter so the lesson store can be
   opened.

Feedback loop flow:
  harness runs -> judge flags encoding-gap -> lesson stored ->
  next encoder invocation reads lessons -> encoder avoids the mistake

Storage gitignored. On projects that regularly encounter new encoding
patterns, the lesson store accumulates into a de-facto institutional
memory about which SMT abstractions have been empirically unfaithful.

61 tests still passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## #24 Runtime observation seeding (Daikon-style)

New src/runtime.ts exposes two APIs for users who want to seed the
verification pipeline with real values their code sees in production or
under tests.

observe(signalKey, values, projectRoot?)
  Instruments a call site. Values get sanitized (depth-limited,
  non-finite numbers tagged, references flattened, Maps/Sets/Errors
  serialized to tagged wrappers) and appended as NDJSON to
  .neurallog/witnesses/<slug>.ndjson. Hard-capped at 16KB per line
  and 10MB per file. Opt-in via NEURALLOG_OBSERVE=1 — zero-cost
  no-op in production without the gate.

readObservations(signalKey, projectRoot?)
  Returns the persisted Observation array for a signal.

listObservedSignals(projectRoot?)
  Enumerates signals with recorded witnesses.

PropertyTestChecker uses readObservations as an alternative input source
when NEURALLOG_USE_RUNTIME_OBSERVATIONS=1. For each proven contract,
before falling back to Z3-model sampling, it queries the runtime
observation store for the function's signal. If real values exist, it
extracts their primitive subset (number/boolean only — SMT's domain)
and tests the function with those directly. Falls through to Z3
models when no observations are available.

The architectural shift: inputs now come from three sources, in order
of precedence:
  1. NEURALLOG_USE_RUNTIME_OBSERVATIONS=1 → real values the code saw
  2. Z3 (get-model) with optional multi-sampling
  3. LLM-synthesized harness for contracts that need complex types

Gives operators a dial from "abstract proof" to "empirical proof
conditioned on real usage." Daikon's core insight, adapted.

## #26 Triangle oracle — test suite as prompt context

New src/testOracle.ts detects the project's test framework and finds
existing tests referencing the target function. Used as advisory
context in harness synthesis — the LLM sees how the project's own
tests exercise the function and can align its harness accordingly.

Detection (detectTestFramework):
  Recognizes vitest, jest, mocha, ava, tap, node-test. First from
  dependencies/devDependencies in package.json, then from
  scripts.test patterns. Returns null when no test infrastructure
  exists.

Discovery (findTestsForFunction):
  Walks __tests__/, test/, tests/, spec/ directories plus project
  root, matching .test.ts/.spec.ts/.test.js/.spec.js files. For each
  file containing a whole-word match on the function name, extracts
  test blocks (describe/it/test) that reference the function. Returns
  structured TestReference records with file, line range, block
  snippet, and test name.

Integration (SynthesisInput.testOracleContext):
  When harness synthesis runs for a candidate, PropertyTestChecker
  calls findTestsForFunction and appends the formatted references
  to the synthesis prompt under "Existing tests for this function
  (oracle context)". The synthesis prompt now instructs the LLM to
  use the tests as calibration without copying assertions verbatim
  (the anti-A2 guardrail is preserved).

No test execution — the tests are read-only oracle context. Running
the project's test runner with specific inputs and cross-referencing
outcomes is a separate architectural step; this is the minimum
surface needed to get harnesses conditioning on existing evidence.

Unit tests: 9 new tests in src/testOracle.test.ts. Uses tmp dirs to
exercise framework detection and test discovery without polluting
the worktree.

70 tests total pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TSavo added a commit that referenced this pull request May 13, 2026
Step #1 of the substrate-types wiring sequence (per admissibility-spine
#796 next phase). Adapts every Rust caller of compose_chain_contracts to
the new Result<_, CompositionRefusalMemento> signature landed in #827.

Call sites updated:
- libprovekit/src/ffi.rs:390 — FfiError::ComposeRefused carries the
  refusal memento, formats CID + serialized refusal artifact
- provekit-cli/src/cmd_compose.rs:343 — compose RPC maps refusal into
  refusal_cid + full refusal JSON payload
- provekit-walk/src/chain.rs:85 — walk method-chain wrapper now returns
  Result<Option<_>, CompositionRefusalMemento>

Integration tests added:
- ffi_smoke::rust_jcs_entrypoint_impure_input_returns_stable_refusal_cid
- compose_rpc_smoke::compose_rpc_impure_input_returns_refusal_memento_with_stable_cid
- compose_method_chain_refusal::compose_method_chain_impure_input_returns_stable_refusal_cid

Each test exercises an impure-input failure and asserts the refusal CID
is non-empty AND deterministic across two runs (per spec §0.1
determinism guarantee).

Pre-existing failures (NOT introduced here):
- provekit-bridgeworks::smoke::checked_add_exhibit_passes (also fails on
  clean main without this PR applied)
- provekit-cli::cli_surface::lift_zig_shows_production_composes_but_unit_tests_conflict
  (Zig .zig-cache PermissionDenied, environment flake; matches the
  ETXTBSY/zig flake notes in memory #72/#82)

Co-authored-by: T Savo <agentwopr@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TSavo added a commit that referenced this pull request May 13, 2026
…838)

Step #2 of the substrate-types wiring sequence. provekit prove now
persists content-addressed ProofRunMemento + StageReceipt artifacts for
each verifier run under .provekit/runs.

Per stage: captures input_cids (sorted ascending), output_cids (sorted),
refusal_cids (sorted), diagnostics (preserve order), started_at/finished_at
ISO-8601 timestamps, verdict in {ok, warned, refused, skipped},
stage_name, kind="stage-receipt", schemaVersion="1". header.cid via
StageReceipt::recompute_header_cid().

Per run: aggregates stage_receipt_cids in execution order (preserve order
in CID input per spec §2.1), input_artifact_cids sorted, output_artifact_cids
sorted (set semantics), proof_envelope_cid + link_bundle_cid +
plugin_registry_cid + verifier_pipeline_cid (deterministic placeholder
blake3_512(JCS(stage_vocabulary)) until VerifierPipelineMemento #799
follow-up lands), verdict in {admissible, refused, partial}.

Integration test: prove_run_emits_durable_content_addressed_run_and_stage_receipts
asserts the .proof bundle contains a ProofRunMemento + stage receipts,
stage count matches VERIFIER_STAGE_VOCABULARY, receipt CIDs recompute,
JCS round-trips, run CID recomputes, input artifact CIDs recorded, and
the bundle reloads cleanly through load_all_proofs.

cargo test -p provekit-verifier -p provekit-ir-types -p libprovekit:
427 passed, 0 failed.

Note: cargo check -p provekit-cli currently blocked by a pre-existing
provekit-walk/src/chain.rs Result vs Option mismatch — that's PR #837
(step #1, currently open) which adapts the CCP caller in provekit-walk.

Co-authored-by: T Savo <agentwopr@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TSavo pushed a commit that referenced this pull request May 13, 2026
…pperMemento, validate_against, ParametricRealizationMemento

Blocker #1: Mint RealizationPlanMemento after each kit dispatch in
realize_function / apply_canonical_rewrite.

Blocker #2: Call plan.validate_against(&realization) after construction;
propagate RealizationPlanError.

Blocker #3: Build synthetic ParametricRealizationMemento inline (one slot
per param) via mint_realization_artifacts; catalog lookup path acknowledged
as future enhancement per cmd_bind.rs gap comment.

Blocker #4: Mint ObservationWrapperMemento when mode in {witness, monitor,
dispatcher} and kit returns observation_wrapper_emission_record; call
wrapper.validate before persisting; set plan.observation_wrapper_cid.

Nit: used_sugars subset check in invoke_realize returns
ext:unauthorized-sugar error when kit returns a CID not in the cited set.

Add observation_wrapper_emission_record to RealizedSource (both
kit_dispatch and cmd_transport layers) extracted from kit JSON response.

Extend EngineResult with realization_plan_mementos and
observation_wrapper_mementos; write to .provekit/bindings/
realization-plans/ and observation-wrappers/.

Add three unit tests in cmd_transport for mint_realization_artifacts
asserting: plan IS minted, validate_against passes, and wrapper IS minted
for mode=witness.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TSavo added a commit that referenced this pull request May 13, 2026
* Route contract sugar through realize kits

* Add four substrate-side mints: RealizationPlanMemento, ObservationWrapperMemento, validate_against, ParametricRealizationMemento

Blocker #1: Mint RealizationPlanMemento after each kit dispatch in
realize_function / apply_canonical_rewrite.

Blocker #2: Call plan.validate_against(&realization) after construction;
propagate RealizationPlanError.

Blocker #3: Build synthetic ParametricRealizationMemento inline (one slot
per param) via mint_realization_artifacts; catalog lookup path acknowledged
as future enhancement per cmd_bind.rs gap comment.

Blocker #4: Mint ObservationWrapperMemento when mode in {witness, monitor,
dispatcher} and kit returns observation_wrapper_emission_record; call
wrapper.validate before persisting; set plan.observation_wrapper_cid.

Nit: used_sugars subset check in invoke_realize returns
ext:unauthorized-sugar error when kit returns a CID not in the cited set.

Add observation_wrapper_emission_record to RealizedSource (both
kit_dispatch and cmd_transport layers) extracted from kit JSON response.

Extend EngineResult with realization_plan_mementos and
observation_wrapper_mementos; write to .provekit/bindings/
realization-plans/ and observation-wrappers/.

Add three unit tests in cmd_transport for mint_realization_artifacts
asserting: plan IS minted, validate_against passes, and wrapper IS minted
for mode=witness.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Code <agentwopr@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
TSavo added a commit that referenced this pull request May 23, 2026
…e metadata via @substrate-signature

The honest path #1: java lift produces what the rust lower needs.
No more injection of rust-side metadata at integration time.

LIFT (provekit-lift-java-term-shape):
- Computes term_shape_cid as blake3-512 of JCS(body_shape) via
  BouncyCastle Blake3Digest. Stable substrate-canonical identity.
- Output IR entry contains: visibility, generic_params,
  original_param_types, param_sort_cids, return_sort_cid,
  source_return_type, term_shape_cid — all recoverable from the
  @substrate-signature comment alone.
- Bouncycastle dep added to pom.

LOWER (provekit-realize-java-core):
- @substrate-signature comment emitted FIRST, concept header
  IMMEDIATELY before the method (so JavaParser's getComment() returns
  the concept header for substrate identity recognition).
- New concept handlers: concept:while-let, concept:if-let,
  concept:destructure-tuple, concept:destructure-struct, concept:try,
  concept:item-decl. Substrate principle satisfied: every target's
  realize plugin handles every catalog concept.
- extractPatternBinding helper parses rust patterns (Some(x), Ok(v))
  to extract the java binding variable.

HONEST CYCLE RESULT on libprovekit-rpc-cross-platform:
  rust → ProofIR_A → java (with @substrate-signature) → ProofIR_B
  (full metadata recovery) → rust' (NO injection)

  10/10 signatures recovered with correct visibility/generics/refs
  7 functions emit with loss (java body_shape has different
   structural detail than rust source's term_shape — substrate-
   honest gap: the lift through java preserves CONCEPT identity but
   not all rust-side structural form choices)
  Zero loss in the java emission step (rust→java lossless via
   citations + carriers).

The substrate's true scorecard for this cycle: identity-preserving,
metadata-recovering, NO external injection. The 7 functions with
loss are honest evidence of where rust→java→rust loses structural
detail — concrete substrate-side gaps to close.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TSavo added a commit that referenced this pull request May 26, 2026
…tive instead

The rust kit's self-contract now comes from LIFTING native source + tests (the
config-driven `provekit mint` → rust-lift), not from a bespoke .invariant DSL run
by an orchestrator. Proven byte-substantial: 2.2 MB proof, contractSetCid
blake3-512:e6c68587...; deterministic; pin test green.

- Delete all 14 `*.invariant.rs` files (the DSL surface).
- Gut `provekit-self-contracts/src/lib.rs`: remove the `#[path]` includes,
  `author_all_invariants`, `mint_self_proof`, MintResult/AuthoredSlab/SelfBridge,
  and the orchestrator-only tests. KEEP `lift_plugin_protocol` + `catalog_format`
  (cmd_prove's C1-C8 path) and the independent parse/serialize/marshal tests.
- Delete the dead `mint-self-contracts` bin + `scan_self` (orchestrator-dependent) +
  the `rust-self-contracts` lift manifest; drop the `[[bin]]` entries.
- Repoint `--kit=rust` (KIT_TABLE) off the orchestrator surface to the lifter; rewire
  `make mint-rust` to the config-driven lift.
- Re-pin `rust.json` + the mint_kit_integration constant to the lifted CID
  (e6c68587; was the DSL CID). The DSL files are byte-neutral to the lift — deleting
  them leaves the contract set identical-in-kind (only self-contracts' own contracts drop).

Kit #1 of the .invariant elimination. The substrate proves itself by lifting its own
native tests, not by a private contract language.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
TSavo added a commit that referenced this pull request May 26, 2026
…tive instead

The rust kit's self-contract now comes from LIFTING native source + tests (the
config-driven `provekit mint` → rust-lift), not from a bespoke .invariant DSL run
by an orchestrator. Proven byte-substantial: 2.2 MB proof, contractSetCid
blake3-512:e6c68587...; deterministic; pin test green.

- Delete all 14 `*.invariant.rs` files (the DSL surface).
- Gut `provekit-self-contracts/src/lib.rs`: remove the `#[path]` includes,
  `author_all_invariants`, `mint_self_proof`, MintResult/AuthoredSlab/SelfBridge,
  and the orchestrator-only tests. KEEP `lift_plugin_protocol` + `catalog_format`
  (cmd_prove's C1-C8 path) and the independent parse/serialize/marshal tests.
- Delete the dead `mint-self-contracts` bin + `scan_self` (orchestrator-dependent) +
  the `rust-self-contracts` lift manifest; drop the `[[bin]]` entries.
- Repoint `--kit=rust` (KIT_TABLE) off the orchestrator surface to the lifter; rewire
  `make mint-rust` to the config-driven lift.
- Re-pin `rust.json` + the mint_kit_integration constant to the lifted CID
  (e6c68587; was the DSL CID). The DSL files are byte-neutral to the lift — deleting
  them leaves the contract set identical-in-kind (only self-contracts' own contracts drop).

Kit #1 of the .invariant elimination. The substrate proves itself by lifting its own
native tests, not by a private contract language.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants