Skip to content

Accuracy

Juwon1405 edited this page Jun 10, 2026 · 12 revisions

Reproducible accuracy measurement

How we measure detection accuracy, and how you can reproduce every number we publish.


The four metrics

Metric What it measures
Recall % of ground-truth findings the agent surfaced
False positive rate claims unsupported by the bundled evidence
Hallucination count facts not present in the source artifacts at all
Evidence integrity SHA-256 of every input file, before and after the run

For every claim the agent makes, it must cite the audit_id of the MCP call that produced the supporting evidence. Claims without an audit_id are blocked at write time by dart_agent's serializer.


How to reproduce

git clone https://github.com/Juwon1405/agentic-dart.git
cd agentic-dart
export PYTHONPATH="$PWD/dart_audit/src:$PWD/dart_mcp/src:$PWD/dart_agent/src"

# Variant A — deterministic reference (CI baseline; ≤30 lines/file)
python3 scripts/measure_accuracy.py

# Variant B — noise-injected realistic (~1:30 IOC:benign)
python3 scripts/measure_accuracy.py --variant realistic

Both variants score the same ground truth (F-001, F-013 for the bundled find-evil-ref-01 case) and verify the audit-chain integrity end-to-end. CI runs both variants on every commit. The realistic tree is hand-curated at production volume on most surfaces; only the two IOC-only logs (web access, unix auth) are enriched with deterministic benign noise by scripts/generate_realistic_evidence.py before scoring, which demonstrates needle-in-haystack recall at ~1:30.


Two evidence variants — why both ship

A reasonable reviewer asks: "Recall=1.0 on a 30-line file is nearly meaningless — every line is an IOC." That is fair. The repository ships two evidence sets so reviewers can score the agent on both:

Variant Path Shape Purpose
Reference examples/sample-evidence/ Web log 27 lines, security events 16, unix auth 17 — fully IOC-loaded Deterministic CI baseline. Stable hashes. Easy to debug.
Realistic examples/sample-evidence-realistic/ Hand-curated production volume (security EventLog ~11,530 lines, supply-chain, RDP brute, USB setupapi, etc.); the two IOC-only logs are enriched with benign noise — web access 1027 (1:37), unix auth 517 (1:29) Demonstrates needle-in-haystack recall at ~1:30. Benign noise on the two IOC-only logs generated deterministically by scripts/generate_realistic_evidence.py; all other evidence is committed hand-curated.

Both variants score identically on the case-01 ground-truth findings (F-001, F-013) — recall=1.000, FPR=0.000, hallucination=0. This rules out the "small-input over-fit" failure mode for the measured bundled case (the agent did not simply match anything in the toy data).

External, third-party benchmarking (NIST CFReDS, Ali Hadi, DFRWS, Splunk BOTS) is tracked as Phase 2 work — see issue #47 for the methodology and timeline. Phase 1 establishes the measurement methodology; Phase 2 operationalizes it on community-trusted datasets.


Datasets currently measured

1. Bundled IP-KVM remote-hands case (examples/sample-evidence/)

The headline accuracy run for the hackathon submission. Both variants score:

Metric Reference Realistic
Recall 1.000 (2/2 ground-truth findings F-001, F-013) 1.000 (same)
False positive rate 0.000 0.000
Hallucination count 0 0
Evidence integrity preserved (62 files, all SHA-256 pre/post match) preserved (67 files, all SHA-256 pre/post match)
Iterations to verdict 5 5
Total MCP calls 14
Audit-chain verified yes (SHA-256 chain unbroken)

Reproduces in seconds on a local SIFT VM. Live-mode runtime depends on model, artifact volume, network latency, and max-iteration settings.

2. Bundled sample evidence (synthetic, in examples/sample-evidence/)

Smaller scope — primarily exercises the dart-corr contradiction handling.

Metric Value
Recall 1.000 (3/3 — USB insertion, exec, contradiction)
False positive rate 0.000
Self-correction fired yes (initial single-host hypothesis revised)
Confidence delta 0.34 → 0.91 over 5 iterations

Datasets in flight (Phase 1 polish)

Dataset Status
NIST CFReDS Hacking Case Issue #17 — methodology approved, run pending
Ali Hadi Memory Forensic Challenge #1 Issue #16
Digital Corpora M57 Patents Issue #18
Real auditd corpus (v0.4 Linux validation) Issue #28
Patrick Wardle macOS malware persistence corpus Issue #29

When each lands, the accuracy report grows a new section with the same four metrics + the chain tail hash.


What this report is not claiming

  • Not claiming the agent matches a senior human analyst on open-ended novel cases. It matches on cases with mechanically verifiable ground truth.
  • Not claiming zero false negatives in adversarial settings — only against the documented test corpus.
  • Not claiming production-readiness. This is a hackathon submission demonstrating the architecture is correct and the loop is sound. Hardening for production is the Phase 2-3 roadmap.

Bypass test results — the architectural guarantee

Separate from accuracy, the bypass test asserts that the read-only MCP boundary holds. Re-run on every commit:

Test Result
Surface is exactly the documented 36-function set PASS
execute_shell raises ToolNotFound PASS
eval, exec, subprocess_run all raise ToolNotFound PASS
_safe_resolve rejects .. traversal PASS
_safe_resolve rejects absolute paths outside DART_EVIDENCE_ROOT PASS
_safe_resolve rejects null-byte truncation attacks PASS

1000-attempt random-fuzz against destructive function names: 1000/1000 blocked (Architecture-first guarantee verified). See Architecture deep dive.


v0.7.0 — case-11 supply-chain + ADCS ESC8 (newest Layer-1)

case-11 covers the SolarWinds-era attack class: a trojanized signed vendor binary entering as a routine update, then escalating via ADCS template misconfig (ESC8 — PetitPotam coercion → NTLM relay → certificate for DC01$ → PKINIT TGT → S4U2self DA impersonation → DCSync of KRBTGT → AdminSDHolder ACL persistence → Golden Ticket).

Aspect Value
Layer 1 (synthetic, noise-injected ~1:30)
Findings 12
Functions exercised 7 (get_process_tree, analyze_windows_logons, analyze_kerberos_events, detect_lateral_movement, detect_credential_access, detect_defense_evasion, detect_exfiltration)
Provenance RFC1918/RFC5737/RFC2606 synthetic; chain composed from public references (CISA AA20-352A, SpecterOps "Certified Pre-Owned", CVE-2021-36942) — zero cross-reference to real environment
Differentiation from case-05 case-05 = stolen credential brute force + Kerberoasting; case-11 = supply-chain entry + ESC8 + PKINIT
Byte-stable expected output yes — locked in case README "How to invoke" section

Reproduce with the bash block in examples/case-studies/case-11-supplychain-ad-zeroday/README.md.


v0.7.0 — case library at a glance

Layer Cases Findings Notes
1 case-01 to case-07 + case-11 69 Synthetic, noise-injected ~1:30. examples/sample-evidence-realistic/
2 case-08 to case-10 30 External: NIST CFReDS, Ali Hadi Challenge 1, Digital Corpora M57
Total 11 cases 99 findings

After v0.7.1 ground-truth function-name reconciliation, 32 of 36 expected MCP functions are implemented (89% coverage). The remaining 4 are Phase 2 (parse_recycle_bin_metadata, parse_ie_history, parse_outlook_dbx, parse_usn_journal).


v0.5.4 — External benchmark: NIST CFReDS Hacking Case

The synthetic accuracy measured above is necessary but not sufficient. To answer the fair reviewer question — "what does it score on a dataset you didn't author?" — v0.5.4 integrates the NIST CFReDS Hacking Case (Greg Schardt / "Mr. Evil", image MD5 AEE4FCD9301C03B3B054623CA261959A) as examples/case-studies/case-08-cfreds-hacking-case/.

Honest accuracy disclosure

Of 10 sampled NIST ground-truth findings:

Version Strict (full detection) Lenient (full + partial)
v0.5.3 0.10 (1 of 10) 0.40 (4 of 10)
v0.5.4 0.50 (5 of 10) 0.80 (8 of 10)

v0.5.4 added parse_registry_hive — a generic SOFTWARE/SYSTEM/SAM hive value extraction primitive built on python-registry. This single primitive unlocks 4 CFReDS findings (F-CFR-001 RegisteredOwner, F-CFR-004 NetworkCards, F-CFR-007 SAM\Domains\Users\Names, F-CFR-010 ShutdownTime) that were Phase 2 roadmap items in v0.5.3.

Reproduce with python3 scripts/measure_cfreds.py. Remaining gaps (F-CFR-006 IE6 index.dat, F-CFR-008 Recycle Bin, F-CFR-009 YARA bundling) are tracked as Phase 2 issues #53, #54, #55.

Why this matters more than the synthetic numbers

The drop from recall=1.0 (synthetic) to 0.50/0.80 (CFReDS) is not a regression. It is a paradigm gap honestly disclosed:

  • Synthetic accuracy measures correctness of the detection logic against IOCs the system claims to detect.
  • CFReDS accuracy measures expansion potential against a content-centric paradigm dart-mcp is still building out.

External benchmarking converted "we should add registry parsing someday" into "registry parsing unblocks 4 of 10 measured findings, ship next." This is the value of third-party data — it changes which Phase 2 deliverable lands first.


See also

Agentic-DART

Concepts

The 5 packages

Reference

Running it

Case studies

Project


Project links

Clone this wiki locally