Skip to content

v0.4.3 β€” RAB v0.1.1 (Replayable-Audit Benchmark) + Cycle Ξ³ multi-hop arc closure

Latest

Choose a tag to compare

@Hashevolution Hashevolution released this 10 Jun 10:36
· 6 commits to main since this release
72e95b3

v0.4.3 β€” RAB (Replayable-Audit Benchmark) v0.1.1 + Cycle Ξ³ multi-hop arc closure

Released: 2026-06-10 (R1 trace: PR sequence #758 β†’ #760 β†’ #761 β†’
#762 β†’ #763 β†’ #764 β†’ #766 β†’ #767; Cycle Ξ³ multi-hop arc trace: PRs
#752 β†’ #753 β†’ #754 β†’ #755 β†’ #756 β†’ #757).

Theme: v0.4.3 ships RAB v0.1.1 β€” the first replayable-audit
benchmark for RAG / agent systems whose 3 metrics (AC / RF / PC) are
operationalisations of EU AI Act Articles 10, 12, 19 (in force
2026-08-02). Same cycle closes the cycle Ξ³ multi-hop arc (7 probes,
6 honest nulls, 2 self-corrections β€” multi-hop improvement reframed
out of the JAMES roadmap; graph build O(NΒ²) finding lifted into RAB
as the RF-cost axis).

Two tracks closed; no production runtime change. JAMES production
audit / lifecycle / graph code paths are unchanged byte-for-byte β€”
RAB measures what was already there, it does not modify it. The
adapter (eval/rab/adapters/james.py) calls the existing
core.lifecycle.replay_audit.emit_lifecycle_event and
core.lifecycle.replay_graph.reconstruct_graph_at against a
workspace-scoped audit.db; production audit.db is untouched.

RAB v0.1.1 β€” the headline

A frozen benchmark spec + scenario fixture + deterministic scorer +
reference / JAMES / Baseline-0 adapters + the first gap-table
measurement.

Authority axes externalised (4 axes, locked before any

measurement)

  1. Anchors: EU AI Act Art. 10(2)(b) / 12(1)(2)(b) / 19 verbatim
    • W3C PROV wasDerivedFrom + NIST AI RMF + Mathkar et al. agent
      trace survey (arXiv 2606.04990) which explicitly names
      "realistic execution-trace benchmarks" as an open challenge β†’ RAB
      responds to a published gap, not a self-defined one.
  2. Scoring: deterministic only. No LLM judge anywhere in the
    scorer (SPEC Β§6.1, pinned by test).
  3. Application: generic audit-log JSONL interface β€” any system
    that can export an append-only log per SPEC Β§1 is scoreable.
    Baseline-0 (vanilla quickstart + Python-logging) is scored
    alongside JAMES; the gap structure across SUTs is the headline
    (SPEC Β§5 / Β§6.5).
  4. Verification: every result ships with the SPEC Β§4
    re-verification triple β€” result.json, exported audit log
    JSONL, mapping table JSON β€” committed to reports/rab/. Re-running
    the deterministic scorer on the same artifacts must reproduce the
    numbers bit-for-bit.

Honest framing β€” what RAB is not

  • Not a regulatory compliance certification. RAB operationalises
    Art. 10 / 12 / 19 concepts; SPEC Β§6.3 says this wherever scores
    are published.
  • Not an architecture novelty claim. ActiveGraph (arXiv 2605.21997,
    2026-05-21) independently published event-sourced log +
    deterministic replay + log-only reconstruction. The contribution
    is the benchmark
    , not the audit-native runtime (R1.0 prior-art
    finding, memory project_r1_replayable_audit_benchmark).
  • Not a JAMES-wins announcement. JAMES = reference on scenario-S1 is
    expected per SPEC Β§6.5. The bolt-on-vs-audit-native gap is the
    finding.

R1.0 β†’ R1.5 sequence

  • R1.0 prior-art + EU AI Act anchors (#758/#760/#761) β€”
    benchmark-vacancy confirmed; Art. 12/10/19 mapping verbatim; AI
    Act effective date 2026-08-02 confirmed via Art. 113.
  • R1.1 SPEC v0.1.1 FROZEN (#762) β€” eval/rab/SPEC-v0.1.md. The
    abstract log interface (SPEC Β§1), three metrics (Β§2), scenario
    contract (Β§3), reporting format (Β§4), 5 baselines (Β§5), 6
    honesty clauses (Β§6). Spec changes never retro-apply; results
    carry their spec version (SPEC Β§6.6).
  • R1.2 scenario-S1 fixture (#763) β€”
    eval/rab/scenarios/s1_lifecycle_small.json. 40 ops (11 INGEST /
    4 UPDATE / 3 SUPERSEDE / 2 DELETE / 20 QUERY) + 10 checkpoints +
    Northbridge Labs synthetic prose. Public-domain content,
    deterministic ids.
  • R1.3 driver + scorer + reference adapter (#764) β€”
    eval/rab/{driver,scorer}.py + eval/rab/adapters/reference.py +
    tests/test_rab_benchmark.py (14 tests). The reference adapter
    caught a SPEC v0.1.0 defect during implementation (supersede-born
    docs untraceable under INGEST-only PC rule), corrected to v0.1.1
    before any measurement was taken β€” adapter-as-spec-validation
    worked.
  • R1.4 pre-registration + JAMES adapter + Baseline-0 adapter +
    scenario-S1 measurement (#766 / #767)
    β€” see gap table below.
    31 tests green (reference 14 + Baseline-0 8 + JAMES 9).
  • R1.5 = this release β€” packaged for external review.

Gap table (RAB SPEC v0.1.1 / scenario-S1, deterministic,

re-runnable from reports/rab/)

SUT AC overall AC INGEST AC UPDATE AC SUPERSEDE AC DELETE AC ANSWER RF-exact RF-graded PC log events
reference (self-verify gate) 1.000 1.00 1.00 1.00 1.00 1.00 1.000 1.000 1.000 80
JAMES 1.000 1.00 1.00 1.00 1.00 1.00 1.000 1.000 1.000 80
Baseline-0 (floor) 0.275 1.00 0.00 0.00 0.00 0.00 0.000 0.000 0.000 40
  • AC INGEST = 1.0 across all three (default logging covers add-doc).
  • AC UPDATE / SUPERSEDE / DELETE / ANSWER = 0 for Baseline-0 (vanilla
    logger doesn't distinguish them, has no supersede taxonomy, often
    doesn't log deletes, emits no ANSWER event).
  • RF = 0 for Baseline-0 (logs carry strings, not payload β€” replay
    has nothing to fold).
  • PC = 0 for Baseline-0 (no parent_id chain).
  • JAMES = reference on S1 is expected (SPEC Β§6.5); the headline is
    the audit-native vs default-logging gap.

Honest tier: ⭐⭐ scenario-S1 audit-native vs floor gap confirmed
(deterministic, re-runnable from artifacts). ⭐⭐⭐ cross-scenario
remains ungated until R1.5+ ships additional scenarios.

Reproduce the table

git checkout v0.4.3
python -m pytest tests/test_rab_benchmark.py \
                 tests/test_rab_baseline0.py \
                 tests/test_rab_james_adapter.py    # 31 tests, all green
python scripts/research/rab_run.py --sut reference  # AC 1.0 / RF 1.0,1.0 / PC 1.0 (gate)
python scripts/research/rab_run.py --sut james      # AC 1.0 / RF 1.0,1.0 / PC 1.0
python scripts/research/rab_run.py --sut baseline0  # AC 0.275 / RF 0.0,0.0 / PC 0.0
# inspect reports/rab/*.result.json β€” values match
# the committed v0.4.3 artifacts bit-for-bit (SPEC Β§4 determinism).

The driver + scorer is deterministic over a fixed artifact set. The
adapter step has wall-clock-dependent timestamps but the AC/RF/PC
values are stable across re-runs (test_baseline0_deterministic,
test_james_deterministic_mode_repeat).

Cycle Ξ³ multi-hop arc closure (companion track)

Cycle Ξ³'s MuSiQue probes concluded that "multi-hop improvement"
is not a JAMES roadmap item
. The wall is unsupervised supporting-
selection, not retrieval breadth and not model ceiling. Retrieval
recall on the top-8 set is 0.76 (both R0 and ablated); the gap to
oracle-grounded performance is the selection step. Memory:
project_cycle_gamma_phase_c2_retrieval_bottleneck.

7-probe trace (PRs #752 β†’ #757):

  • #752 MuSiQue β†’ retrieval-bottleneck hypothesis (later corrected
    by #754)
  • #753 D1 static decompose β†’ INSUFFICIENT (hop-2 anaphora)
  • #754 D1b iterative β†’ INSUFFICIENT + sources-top3 artifact
    isolation: retrieve(top-8) 13/25 both arms β€” the diagnosis pivot.
  • #755 D2 rerank-OFF β†’ synth-noise is the actual lever
    (oracle 72% vs noisy 12%, a 9Γ— jump when supporting paragraphs are
    given directly).
  • #756 D3 evidence-select β†’ INSUFFICIENT β†’ multi-hop arc
    closure
    .
  • #757 D4 graph: feature already verified (250+ tests); Path-D
    over-investment PASS; β˜… graph build O(NΒ²) finding β€” folded
    into RAB as the RF-cost axis (SPEC Β§2.2 cost_s_per_1k_events).
    Lifted out of MuSiQue scope where it had no leverage, lifted into
    RAB scope where it has structural meaning.

Operator-facing artifacts: env-gate JAMES_RETRIEVE_TOP_K /
JAMES_RERANK_TOP_K (defaults 8 / 5; byte-identical when unset).
Default-off invariant preserved.

R0 P0 (pre-cycle disciplinary + security)

  • #750 R5 pre-registration rules + R2 measurement-discipline rules
    (docs/rules/) checked in as repo-side audit trail of CLAUDE.md
    rules previously memory-only. Bus-factor + auditability fix
    (R2 external review action item).
  • #751 starlette 1.0.1 (CVE) + chromadb risk-accept note
    (docs/security/).

Verification

  • RAB test suite (this release): 31/31 PASS
    • tests/test_rab_benchmark.py β€” 14 (reference adapter + scorer +
      fault injection)
    • tests/test_rab_baseline0.py β€” 8 (floor pinned)
    • tests/test_rab_james_adapter.py β€” 9 (audit-native demo +
      workspace isolation + lifecycle bridge + reconstruct_graph_at
      agreement + log-only replay invariant)
  • RAB measurement artifacts committed under reports/rab/:
    • reference-S1-*.{result.json,log.jsonl,mapping.json}
    • james-S1-*.{result.json,log.jsonl,mapping.json}
    • baseline0-S1-*.{result.json,log.jsonl,mapping.json}
      Each result.json carries scenario_sha, log_sha,
      mapping_table_sha, sut_version, runner_env for SPEC Β§4
      re-verification.
  • JAMES core test suite: no regression (no core/ change).

What v0.4.3 does NOT do

  • No production runtime change in JAMES core. The audit / lifecycle
    / graph paths are unchanged byte-for-byte. RAB measures the existing
    paths via a workspace-scoped adapter; production audit.db is never
    touched.
  • No cross-scenario RAB result. S1 is the only scenario in v0.1.1.
    Additional scenarios (S2 cross-lingual / S3 larger graph / etc.)
    unlock the ⭐⭐⭐ cross-scenario ceiling in a future cycle.
  • No Baseline-1. SPEC Β§5 names Baseline-1 as
    "default-quickstart + LangSmith/OTel tracing mapped to SPEC Β§1".
    It is a separate SUT in a future cycle; this release is the
    default-vs-native gap only.
  • No mutation-site wiring follow-up. v0.4.2's audit-only invariant
    (I4 strong form) still depends on the mutation-site wiring cross-
    cutting cycle (T1/T2/T2.D/T6/T7 call sites β†’ emit_lifecycle_event).
    Out of scope for v0.4.3.
  • No multi-hop retrieval improvement. Cycle Ξ³ closure explicitly
    re-framed this as NOT a JAMES roadmap item (memory
    project_cycle_gamma_phase_c2_retrieval_bottleneck).
  • No regulatory compliance certification claim. SPEC Β§6.3.

Out of scope (deferred)

  • Baseline-1 (LangSmith / OTel adapter).
  • RAB scenario-S2 + cross-scenario gating.
  • Replication invites to ActiveGraph + other audit-native runtimes
    (R1.6, separate collab scope β€”
    feedback_eval_cycle_vs_collab_arc_separation).
  • Mutation-site wiring (T1/T2/T2.D/T6/T7 β†’ emit_lifecycle_event)
    carried forward from v0.4.2.

Sources

  • RAB SPEC: eval/rab/SPEC-v0.1.md
  • Pre-registration: docs/research/r1-4-preregistration-2026-06-10.md
  • Gap-table handover: docs/handovers/v0.4-r1-4-gap-table-2026-06-10.md
  • Cycle Ξ³ closure handover (companion track):
    docs/handovers/v0.4-cycle-gamma-phase-c2-musique-retrieval-bottleneck-2026-06-10.md
  • Session entry: docs/handovers/v0.4-next-session-entry-2026-06-10-r1-rab.md