Accuracy

Reproducible accuracy measurement

How we measure detection accuracy, and how you can reproduce every number we publish.

The four metrics

Metric	What it measures
Recall	% of ground-truth findings the agent surfaced
False positive rate	claims unsupported by the bundled evidence
Hallucination count	facts not present in the source artifacts at all
Evidence integrity	SHA-256 of every input file, before and after the run

For every claim the agent makes, it must cite the audit_id of the MCP call that produced the supporting evidence. Claims without an audit_id are blocked at write time by dart_agent's serializer.

How to reproduce

git clone https://github.com/Juwon1405/agentic-dart.git
cd agentic-dart
export PYTHONPATH="$PWD/dart_audit/src:$PWD/dart_mcp/src:$PWD/dart_agent/src"

# Variant A — deterministic reference (CI baseline; ≤30 lines/file)
python3 scripts/measure_accuracy.py

# Variant B — noise-injected realistic (~1:30 IOC:benign)
python3 scripts/measure_accuracy.py --variant realistic

Both variants score the same ground truth (F-001, F-013 for the bundled find-evil-ref-01 case) and verify the audit-chain integrity end-to-end. CI runs both variants on every commit. The realistic tree is hand-curated at production volume on most surfaces; only the two IOC-only logs (web access, unix auth) are enriched with deterministic benign noise by scripts/generate_realistic_evidence.py before scoring, which demonstrates needle-in-haystack recall at ~1:30.

Two evidence variants — why both ship

A reasonable reviewer asks: "Recall=1.0 on a 30-line file is nearly meaningless — every line is an IOC." That is fair. The repository ships two evidence sets so reviewers can score the agent on both:

Variant	Path	Shape	Purpose
Reference	`examples/sample-evidence/`	Web log 27 lines, security events 16, unix auth 17 — fully IOC-loaded	Deterministic CI baseline. Stable hashes. Easy to debug.
Realistic	`examples/sample-evidence-realistic/`	Hand-curated production volume (security EventLog ~11,530 lines, supply-chain, RDP brute, USB setupapi, etc.); the two IOC-only logs are enriched with benign noise — web access 1027 (1:37), unix auth 517 (1:29)	Demonstrates needle-in-haystack recall at ~1:30. Benign noise on the two IOC-only logs generated deterministically by `scripts/generate_realistic_evidence.py`; all other evidence is committed hand-curated.

Both variants score identically on the case-01 ground-truth findings (F-001, F-013) — recall=1.000, FPR=0.000, hallucination=0. This rules out the "small-input over-fit" failure mode for the measured bundled case (the agent did not simply match anything in the toy data).

External, third-party benchmarking (NIST CFReDS, Ali Hadi, DFRWS, Splunk BOTS) is tracked as Phase 2 work — see issue #47 for the methodology and timeline. Phase 1 establishes the measurement methodology; Phase 2 operationalizes it on community-trusted datasets.

Datasets currently measured

1. Bundled IP-KVM remote-hands case (`examples/sample-evidence/`)

The headline accuracy run for the hackathon submission. Both variants score:

Metric	Reference	Realistic
Recall	1.000 (2/2 ground-truth findings F-001, F-013)	1.000 (same)
False positive rate	0.000	0.000
Hallucination count	0	0
Evidence integrity	preserved (62 files, all SHA-256 pre/post match)	preserved (67 files, all SHA-256 pre/post match)
Iterations to verdict	5	5
Total MCP calls	14
Audit-chain verified	yes (SHA-256 chain unbroken)

Reproduces in seconds on a local SIFT VM. Live-mode runtime depends on model, artifact volume, network latency, and max-iteration settings.

2. Bundled sample evidence (synthetic, in `examples/sample-evidence/`)

Smaller scope — primarily exercises the dart-corr contradiction handling.

Metric	Value
Recall	1.000 (3/3 — USB insertion, exec, contradiction)
False positive rate	0.000
Self-correction fired	yes (initial single-host hypothesis revised)
Confidence delta	0.34 → 0.91 over 5 iterations

Datasets in flight (Phase 1 polish)

Dataset	Status
NIST CFReDS Hacking Case	Issue #17 — methodology approved, run pending
Ali Hadi Memory Forensic Challenge #1	Issue #16
Digital Corpora M57 Patents	Issue #18
Real auditd corpus (v0.4 Linux validation)	Issue #28
Patrick Wardle macOS malware persistence corpus	Issue #29

When each lands, the accuracy report grows a new section with the same four metrics + the chain tail hash.

What this report is not claiming

Not claiming the agent matches a senior human analyst on open-ended novel cases. It matches on cases with mechanically verifiable ground truth.
Not claiming zero false negatives in adversarial settings — only against the documented test corpus.
Not claiming production-readiness. This is a hackathon submission demonstrating the architecture is correct and the loop is sound. Hardening for production is the Phase 2-3 roadmap.

Bypass test results — the architectural guarantee

Separate from accuracy, the bypass test asserts that the read-only MCP boundary holds. Re-run on every commit:

Test	Result
Surface is exactly the documented 36-function set	PASS
`execute_shell` raises `ToolNotFound`	PASS
`eval`, `exec`, `subprocess_run` all raise `ToolNotFound`	PASS
`_safe_resolve` rejects `..` traversal	PASS
`_safe_resolve` rejects absolute paths outside `DART_EVIDENCE_ROOT`	PASS
`_safe_resolve` rejects null-byte truncation attacks	PASS

1000-attempt random-fuzz against destructive function names: 1000/1000 blocked (Architecture-first guarantee verified). See Architecture deep dive.

v0.7.0 — case-11 supply-chain + ADCS ESC8 (newest Layer-1)

case-11 covers the SolarWinds-era attack class: a trojanized signed vendor binary entering as a routine update, then escalating via ADCS template misconfig (ESC8 — PetitPotam coercion → NTLM relay → certificate for DC01$ → PKINIT TGT → S4U2self DA impersonation → DCSync of KRBTGT → AdminSDHolder ACL persistence → Golden Ticket).

Aspect	Value
Layer	1 (synthetic, noise-injected ~1:30)
Findings	12
Functions exercised	7 (get_process_tree, analyze_windows_logons, analyze_kerberos_events, detect_lateral_movement, detect_credential_access, detect_defense_evasion, detect_exfiltration)
Provenance	RFC1918/RFC5737/RFC2606 synthetic; chain composed from public references (CISA AA20-352A, SpecterOps "Certified Pre-Owned", CVE-2021-36942) — zero cross-reference to real environment
Differentiation from case-05	case-05 = stolen credential brute force + Kerberoasting; case-11 = supply-chain entry + ESC8 + PKINIT
Byte-stable expected output	yes — locked in case README "How to invoke" section

Reproduce with the bash block in examples/case-studies/case-11-supplychain-ad-zeroday/README.md.

v0.7.0 — case library at a glance

Layer	Cases	Findings	Notes
1	case-01 to case-07 + case-11	69	Synthetic, noise-injected ~1:30. `examples/sample-evidence-realistic/`
2	case-08 to case-10	30	External: NIST CFReDS, Ali Hadi Challenge 1, Digital Corpora M57
Total	11 cases	99 findings

After v0.7.1 ground-truth function-name reconciliation, 32 of 36 expected MCP functions are implemented (89% coverage). The remaining 4 are Phase 2 (parse_recycle_bin_metadata, parse_ie_history, parse_outlook_dbx, parse_usn_journal).

v0.5.4 — External benchmark: NIST CFReDS Hacking Case

The synthetic accuracy measured above is necessary but not sufficient. To answer the fair reviewer question — "what does it score on a dataset you didn't author?" — v0.5.4 integrates the NIST CFReDS Hacking Case (Greg Schardt / "Mr. Evil", image MD5 AEE4FCD9301C03B3B054623CA261959A) as examples/case-studies/case-08-cfreds-hacking-case/.

Honest accuracy disclosure

Of 10 sampled NIST ground-truth findings:

Version	Strict (full detection)	Lenient (full + partial)
v0.5.3	0.10 (1 of 10)	0.40 (4 of 10)
v0.5.4	0.50 (5 of 10)	0.80 (8 of 10)

v0.5.4 added parse_registry_hive — a generic SOFTWARE/SYSTEM/SAM hive value extraction primitive built on python-registry. This single primitive unlocks 4 CFReDS findings (F-CFR-001 RegisteredOwner, F-CFR-004 NetworkCards, F-CFR-007 SAM\Domains\Users\Names, F-CFR-010 ShutdownTime) that were Phase 2 roadmap items in v0.5.3.

Reproduce with python3 scripts/measure_cfreds.py. Remaining gaps (F-CFR-006 IE6 index.dat, F-CFR-008 Recycle Bin, F-CFR-009 YARA bundling) are tracked as Phase 2 issues #53, #54, #55.

Why this matters more than the synthetic numbers

The drop from recall=1.0 (synthetic) to 0.50/0.80 (CFReDS) is not a regression. It is a paradigm gap honestly disclosed:

Synthetic accuracy measures correctness of the detection logic against IOCs the system claims to detect.
CFReDS accuracy measures expansion potential against a content-centric paradigm dart-mcp is still building out.

External benchmarking converted "we should add registry parsing someday" into "registry parsing unblocks 4 of 10 measured findings, ship next." This is the value of third-party data — it changes which Phase 2 deliverable lands first.

Accuracy

Reproducible accuracy measurement

The four metrics

How to reproduce

Two evidence variants — why both ship

Datasets currently measured

1. Bundled IP-KVM remote-hands case (examples/sample-evidence/)

2. Bundled sample evidence (synthetic, in examples/sample-evidence/)

Datasets in flight (Phase 1 polish)

What this report is not claiming

Bypass test results — the architectural guarantee

v0.7.0 — case-11 supply-chain + ADCS ESC8 (newest Layer-1)

v0.7.0 — case library at a glance

v0.5.4 — External benchmark: NIST CFReDS Hacking Case

Honest accuracy disclosure

Why this matters more than the synthetic numbers

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Agentic-DART

Concepts

The 5 packages

Reference

Running it

Case studies

Project

Project links

Clone this wiki locally

1. Bundled IP-KVM remote-hands case (`examples/sample-evidence/`)

2. Bundled sample evidence (synthetic, in `examples/sample-evidence/`)