docs(accuracy): move benchmark scores to docs/benchmarks single source of truth; keep invariants (hallucination 0, read-only boundary, needle-in-haystack)
docs(wiki): fix remaining stale CFReDS case path and v1.0.0-as-current in Glossary
docs(wiki): align Accuracy/Home with canonical evidence and tiered cases
Remove the public --variant / sample-evidence-realistic concept from Accuracy
(single canonical evidence_root + CI fixture), retier the case tables to
self-evaluation/external-evaluation, fix case links to the new index-only
paths, rename ground-truth.json to truth.json, and drop a stale tool-count.
Dated historical roadmap entries in Phase-1 keep their original case numbers.
docs: align wiki with current live-mode scope
Document live mode through ANTHROPIC_API_KEY and --dry-run, remove public zero-cost/OAuth setup claims, and update Claude MCP registration to dart_mcp.server_stdio.
Refresh accuracy evidence counts to 62 reference files and 67 realistic files, clarify that the measured identical result applies to case-01 F-001/F-013, and remove stale 50-file language.
Update operator, SIFT, macOS, roadmap, and Phase 1 pages to the 72-tool surface and current full-suite validation model without stale 35-tool or 75-test guidance.
Fix the Home architecture link and describe external entries as case-study slots instead of fully measured benchmark rows.
QA: git diff --check passed for the wiki.
docs(wiki): fix remaining stale 'the 60' (Live-mode) and sonnet-4 model name (Accuracy)
docs(wiki): align Accuracy and Glossary with the realistic-variant enrichment design
- Accuracy.md: the realistic row claimed the generator synthesizes
"security events 516" -- it does not; the security EventLog is hand-curated
at ~11,530 lines. Only the two IOC-only logs (web access, unix auth) are
noise-injected. Dropped the "production-shape / production-noise-injected"
overstatement (the enriched ratio is ~1:30).
- Glossary.md: the MCP surface is 47 native + 25 SIFT = 72, not "72 native".
wiki QA pass: file count 49->50, test count 31->75 (current snapshots only)
post-v0.7.1 QA audit caught two latent drifts:
evidence file count:
- Accuracy.md L64 sample-evidence-realistic '49 files' was correct at the
v0.7.0 evidence-fidelity enrichment time but v0.7.1 added
linux/cron/sample.crontab fixture, raising the count to 50.
measure_accuracy --variant realistic now reports
evidence_files_measured: 50 against ground truth F-001 + F-013, which
matches the actual repo state.
test count:
- Operator-guide.md L55 step-by-step quick-start
- Phase-1.md L50 Empirical-validation 'fresh clone' summary
- Roadmap.md L60 Phase-1 validation summary
- Running-on-macOS.md L57 step header + L134 Apple Silicon notes
all said '31 tests' (the v0.5.2 snapshot baseline). v0.7.1 ships
'75 of 75 tests passing'. updated only the present-tense fresh-clone
claims; the historical v0.5.2 release row in Phase-1.md L109
('-> 31 tests passing') is preserved verbatim as a dated milestone.
wiki: sync to v0.7.1 — 11 cases, 72 MCP functions, case-11 highlight
- Accuracy.md: '61 files' -> '49 files'; new v0.7.0 section covering
case-11 supply-chain attack class; new v0.7.0 case-library summary
table (11 cases / 99 findings split 69 layer-1 + 30 layer-2 + 32/36
function coverage)
- Glossary.md: 'As of v0.6.0' -> 'As of v0.7.1: 72 native MCP tools'
- Home.md: case-studies section rewritten to mention 11 cases / 99
findings plus case-11 as recommended judge walkthrough
- MCP-function-catalog.md: previously missed v0.6.1 functions
(parse_macos_quarantine, parse_linux_cron_jobs, detect_dns_tunneling)
+ v0.7.1 functions (parse_linux_text_log, parse_linux_shell_history)
now properly documented with MITRE technique mappings and references
- Phase-1.md: timeline extended with v0.5.4, v0.6.0, v0.6.1, v0.7.0,
v0.7.1 milestones
deliberately not touched — these are version-anchored historical
records: v0.5.4 CFReDS section (locked at first external benchmark),
playbook 'target_case_classes: 10 case classes' (playbook scenario
classes, not evidence cases), v0.4 / v0.5 release rows.
wiki(qa-r18): v0.5.4 CFReDS Hacking Case section + 36-function bypass test
Paired with main repo r18 commit (3b69129).
== Updated ==
### Accuracy.md
- Bypass test: 'documented 35-function set' -> 'documented 36-function set'
- New section: 'v0.5.4 — External benchmark: NIST CFReDS Hacking Case'
with strict/lenient recall comparison (v0.5.3 0.10/0.40 -> v0.5.4
0.50/0.80) and the paradigm-gap explanation
- New 'See also' subsection linking to case-08 README + closed issue
#52 + open Phase 2 issues #53/#54/#55
== Why this is in wiki not just main repo ==
Reviewers reach the wiki via the GitHub right-rail link first, often
before they read the README. The wiki Accuracy page has been the
'source of truth for measurement claims' since v0.5.0; v0.5.4 keeping
it current with the CFReDS results is non-negotiable per the 4-surface
sync rule.
== Verified ==
- All cross-references resolve: case-08 README path, issue #52/53/54/55
links, parse_registry_hive wiki anchor
- No drift between this Accuracy page and docs/accuracy-report.md (main
repo) or the social-surface advertising (profile + pages)
wiki(qa-r17): two evidence variants + corrected ground-truth count + methodology disclosure
Pairs with main repo commit 58b3e5c (v0.5.3).
== Why ==
Round 17 addresses a fair reviewer concern: 'recall=1.0 measured on a
30-line file is not strong evidence — every line is an IOC.' That is
correct. The fix is to ship two evidence variants and disclose
methodology explicitly, then point at Phase 2 for third-party
benchmarking.
== Fixed ==
### Accuracy.md — '12/12 ground-truth findings' over-claim
Bundled find-evil-ref-01 case has exactly TWO ground-truth findings
(F-001 amcache anomaly + F-013 USB persistence), not twelve. The
'12/12' was a typo or stale value from an earlier draft. Verified
against scripts/measure_accuracy.py output: 'true_positives':
['F-001', 'F-013']. Fixed.
### Accuracy.md — '8 files' SHA-256 evidence integrity claim
The actual file count walked by measure_accuracy.py's evidence
digest map is 61, not 8. Fixed in the table.
== Added ==
### Two evidence variants — section explaining why both ship
Variant A — examples/sample-evidence/ (deterministic)
Variant B — examples/sample-evidence-realistic/ (~1:30 IOC:benign)
Both score the same ground truth and produce identical headline
numbers (recall=1.0 / FPR=0.0 / hallucination=0). The realistic
variant rules out the 'small-input over-fit' failure mode by
demonstrating the same recall on web-log 1027 lines (37× noise),
security events 516 (32× noise), unix auth 517 (29× noise).
### Phase 2 third-party dataset benchmarking
Linked to issue #47 (NIST CFReDS, Ali Hadi, DFRWS, Splunk BOTS).
Phase 1 establishes the methodology; Phase 2 operationalizes it on
community-trusted datasets.
== Verified ==
- Both variant invocations work end-to-end and produce the
headline numbers as reported (rerun: 2026-05-09)
- 61-file count matches measure_accuracy.py output
- Issue #47 exists with workstream split and acceptance criteria
wiki(qa-r10): kill function-signature + file-existence hallucinations across 6 pages
Pairs with main repo commit 8a1917b. Round 10 was a 'judge follows
every advertised command line by line' pass — surfaced 6 distinct
hallucinations a SANS judge would have hit if they tried to
reproduce anything from the wiki.
== Defects fixed ==
### Accuracy.md — broken script reference
Advertised 'bash scripts/run-accuracy-suite.sh'. That script
doesn't exist and never has. The actual reproducer is
'python3 scripts/measure_accuracy.py' with the standard
PYTHONPATH export. A judge running the README's accuracy claim
through this page would have hit:
bash: scripts/run-accuracy-suite.sh: No such file or directory
Replaced with the real measure_accuracy.py invocation, which
was verified end-to-end (recall=1.0, FPR=0.0,
hallucination_count=0, evidence_integrity_preserved=true).
### Case-PtH-Timestomp.md — 3 function-signature errors
All three are the same class of mistake — the wiki cited
positional/keyword args that don't exist on the actual MCP tools:
'dart-agent --hunt' → 'python3 -m dart_agent --case ... --out ... --mode deterministic'
'get_process_tree(host=...)' → 'get_process_tree(process_csv=...)'
'analyze_windows_logons(host=...)' → 'analyze_windows_logons(security_events_json=...)'
'parse_prefetch(target=...)' → 'parse_prefetch(prefetch_path=...)'
These same mistakes live in docs/case-pth-timestomp.md (fixed
in the paired repo commit). Verified by pulling live
inputSchema.required from list_tools() for each tool.
### dart-agent.md — run_loop() and 4 fictional files
The page advertised:
- 'run_loop() in dart_agent/src/dart_agent/__init__.py'
- A file inventory citing loop.py, decision.py, hypothesis.py,
serializer.py — none of which exist.
The actual structure is __init__.py + __main__.py + live.py.
The senior-analyst loop is the DeterministicAnalyst class's
.run() method (4 internal phases: _phase_timeline →
_phase_hypothesis → _phase_validate_usb → _phase_finalize).
Rewrote both the 'What it owns' bullet and the Files block to
match reality. Added an explanatory note that the agent is
small enough to keep its control flow in __init__.py.
### dart-audit.md — 3 hallucinations in one example
The advertised AuditLogger.log() example used:
- outputs={...} — actual kwarg is 'output' (singular)
- cpu_ms=42 — no such kwarg
- bytes_read=1024 — no such kwarg
Real signature is:
log(tool_name, inputs, output, iteration, token_count_in,
token_count_out, finding_ids=None)
Same page advertised audit_id type as 'UUID4' — actual is
8-character hex (secrets.token_hex(4)). Same page advertised
'output/<run_id>/<audit_id>.json' as the per-call output
storage location — that directory layout doesn't exist; outputs
are referenced by SHA-256 digest only in deterministic mode.
Fixed all three. Verified the corrected example works as a
copy-paste — wrote a test audit log, verified the chain, ran
CLI (verify + trace) all green.
### dart-corr.md — serializer.py hallucination
Page claimed UNRESOLVED contradictions are blocked by 'the
serializer (dart_agent/serializer.py)'. There is no
serializer.py file. The blocking happens inside
DeterministicAnalyst's finding emission path in __init__.py.
Rewrote the sentence to point at the real location.
### Live-mode.md — 2 hallucinations in the headline example
- '--evidence /mnt/case-evidence' — no such CLI flag. Real
pattern is 'export DART_EVIDENCE_ROOT=/path' before invoking
the agent.
- 'Claude sees exactly 35 typed forensic functions' — should
be 60 (35 native + 25 SIFT adapters). Stale from the v0.4
surface, missed in earlier rounds because Live-mode.md
wasn't part of the surface-count grep targets.
Fixed both. Added an explicit '(Add --dry-run to use a scripted
mock Claude with no API key)' line for CI / offline reproduction.
== Verification approach ==
For each defect:
1. Read the wiki claim
2. Pulled the actual code/schema (inputSchema, argparse output,
filesystem ls, AuditLogger signature via inspect)
3. Compared advertised ↔ actual
4. Fixed the wiki, then re-verified the fixed example by either
running it (Accuracy.md, dart-audit.md) or by checking
it would no longer raise on a copy-paste
== Pattern internalised ==
Round 9 caught output-key hallucinations in code examples. Round 10
caught argument-name hallucinations and file-path hallucinations
in tutorial prose — a different surface that print-output dry-runs
don't cover. Going forward, any wiki/docs page that references a
function by name + signature should be diff-checked against the
live inputSchema.required list whenever the underlying code changes.
wiki: add 12 missing pages, fix all 32 broken links
The wiki sidebar and Home page referenced 13 pages that didn't exist,
producing the GitHub 'create new page' UI when clicked. Adds:
Concepts:
Glossary — DFIR / agent / MCP terms
The 5 packages:
dart-agent — senior-analyst wrapper loop
dart-corr — cross-artifact correlation engine
dart-audit — SHA-256 chained audit log
dart-playbook — YAML sequencing rules
(dart-mcp already existed)
Reference:
Comparison — vs Velociraptor / Plaso / EZ tools / SOAR / vanilla LLMs
Running it:
Running-on-SIFT — SANS SIFT VM 5-minute setup
Running-on-macOS — macOS-specific mount conventions
Live-mode — real Claude API + MCP stdio integration
Case studies:
Case-PtH-Timestomp — Pass-the-Hash + timestomp pre-existence
Case-IP-KVM — IP-KVM remote-hands insider scenario
Writing-case-studies — guide for contributing new case studies
Project:
Accuracy — reproducible accuracy methodology + numbers
The Roadmap-Phase-2/3/4 links in Home.md were repointed to the
existing Roadmap page's anchors (those were never separate pages).
The Contributing link in dart-mcp.md now points to CONTRIBUTING.md
in the main repo.
_Sidebar.md restructured into 6 named sections so the 25-page wiki
is navigable. Final broken-link count: 0.