-
Notifications
You must be signed in to change notification settings - Fork 5
Accuracy
How we measure detection accuracy, and how you can reproduce every number we publish.
| Metric | What it measures |
|---|---|
| Recall | % of ground-truth findings the agent surfaced |
| False positive rate | claims unsupported by the bundled evidence |
| Hallucination count | facts not present in the source artifacts at all |
| Evidence integrity | SHA-256 of every input file, before and after the run |
For every claim the agent makes, it must cite the audit_id of the MCP call that produced the supporting evidence. Claims without an audit_id are blocked at write time by dart_agent's serializer.
git clone https://github.com/Juwon1405/agentic-dart.git
cd agentic-dart
export PYTHONPATH="$PWD/dart_audit/src:$PWD/dart_mcp/src:$PWD/dart_agent/src"
# Variant A — deterministic reference (CI baseline; ≤30 lines/file)
python3 scripts/measure_accuracy.py
# Variant B — noise-injected realistic (~1:30 IOC:benign)
python3 scripts/measure_accuracy.py --variant realisticBoth variants score the same ground truth (F-001, F-013 for the bundled find-evil-ref-01 case) and verify the audit-chain integrity end-to-end. CI runs the reference variant on every commit; the realistic variant is regenerated by scripts/generate_realistic_evidence.py (deterministic seed) and demonstrates needle-in-haystack recall on production-shape volume.
A reasonable reviewer asks: "Recall=1.0 on a 30-line file is nearly meaningless — every line is an IOC." That is fair. The repository ships two evidence sets so reviewers can score the agent on both:
| Variant | Path | Shape | Purpose |
|---|---|---|---|
| Reference | examples/sample-evidence/ |
Web log 27 lines, security events 16, unix auth 17 — fully IOC-loaded | Deterministic CI baseline. Stable hashes. Easy to debug. |
| Realistic | examples/sample-evidence-realistic/ |
Same IOCs + synthetic benign noise: web log 1027 lines (1:37), security events 516 (1:31), unix auth 517 (1:29) | Demonstrates needle-in-haystack recall on production-shape data. Generated deterministically by scripts/generate_realistic_evidence.py. |
Both variants score identically on the ground-truth findings — recall=1.000, FPR=0.000, hallucination=0. This rules out the "small-input over-fit" failure mode (the agent didn't simply match-anything in the toy data).
External, third-party benchmarking (NIST CFReDS, Ali Hadi, DFRWS, Splunk BOTS) is tracked as Phase 2 work — see issue #47 for the methodology and timeline. Phase 1 establishes the measurement methodology; Phase 2 operationalizes it on community-trusted datasets.
The headline accuracy run for the hackathon submission. Both variants score:
| Metric | Reference | Realistic |
|---|---|---|
| Recall | 1.000 (2/2 ground-truth findings F-001, F-013) | 1.000 (same) |
| False positive rate | 0.000 | 0.000 |
| Hallucination count | 0 | 0 |
| Evidence integrity | preserved (50 files, all SHA-256 pre/post match) | preserved (50 files) |
| Iterations to verdict | 5 | 5 |
| Total MCP calls | 14 | |
| Audit-chain verified | yes (SHA-256 chain unbroken) |
Reproduces in ~14 seconds on a SIFT VM, ~90 seconds in live mode with claude-sonnet-4.
Smaller scope — primarily exercises the dart-corr contradiction handling.
| Metric | Value |
|---|---|
| Recall | 1.000 (3/3 — USB insertion, exec, contradiction) |
| False positive rate | 0.000 |
| Self-correction fired | yes (initial single-host hypothesis revised) |
| Confidence delta | 0.34 → 0.91 over 5 iterations |
| Dataset | Status |
|---|---|
| NIST CFReDS Hacking Case | Issue #17 — methodology approved, run pending |
| Ali Hadi Memory Forensic Challenge #1 | Issue #16 |
| Digital Corpora M57 Patents | Issue #18 |
| Real auditd corpus (v0.4 Linux validation) | Issue #28 |
| Patrick Wardle macOS malware persistence corpus | Issue #29 |
When each lands, the accuracy report grows a new section with the same four metrics + the chain tail hash.
- Not claiming the agent matches a senior human analyst on open-ended novel cases. It matches on cases with mechanically verifiable ground truth.
- Not claiming zero false negatives in adversarial settings — only against the documented test corpus.
- Not claiming production-readiness. This is a hackathon submission demonstrating the architecture is correct and the loop is sound. Hardening for production is the Phase 2-3 roadmap.
Separate from accuracy, the bypass test asserts that the read-only MCP boundary holds. Re-run on every commit:
| Test | Result |
|---|---|
| Surface is exactly the documented 36-function set | PASS |
execute_shell raises ToolNotFound
|
PASS |
eval, exec, subprocess_run all raise ToolNotFound
|
PASS |
_safe_resolve rejects .. traversal |
PASS |
_safe_resolve rejects absolute paths outside DART_EVIDENCE_ROOT
|
PASS |
_safe_resolve rejects null-byte truncation attacks |
PASS |
1000-attempt random-fuzz against destructive function names: 1000/1000 blocked (Architecture-first guarantee verified). See Architecture deep dive.
case-11 covers the SolarWinds-era attack class: a trojanized signed vendor binary entering as a routine update, then escalating via ADCS template misconfig (ESC8 — PetitPotam coercion → NTLM relay → certificate for DC01$ → PKINIT TGT → S4U2self DA impersonation → DCSync of KRBTGT → AdminSDHolder ACL persistence → Golden Ticket).
| Aspect | Value |
|---|---|
| Layer | 1 (synthetic, production-noise-injected) |
| Findings | 12 |
| Functions exercised | 7 (get_process_tree, analyze_windows_logons, analyze_kerberos_events, detect_lateral_movement, detect_credential_access, detect_defense_evasion, detect_exfiltration) |
| Provenance | RFC1918/RFC5737/RFC2606 synthetic; chain composed from public references (CISA AA20-352A, SpecterOps "Certified Pre-Owned", CVE-2021-36942) — zero cross-reference to real environment |
| Differentiation from case-05 | case-05 = stolen credential brute force + Kerberoasting; case-11 = supply-chain entry + ESC8 + PKINIT |
| Byte-stable expected output | yes — locked in case README "How to invoke" section |
Reproduce with the bash block in examples/case-studies/case-11-supplychain-ad-zeroday/README.md.
| Layer | Cases | Findings | Notes |
|---|---|---|---|
| 1 | case-01 to case-07 + case-11 | 69 | Synthetic, production-noise-injected. examples/sample-evidence-realistic/
|
| 2 | case-08 to case-10 | 30 | External: NIST CFReDS, Ali Hadi Challenge 1, Digital Corpora M57 |
| Total | 11 cases | 99 findings |
After v0.7.1 ground-truth function-name reconciliation, 32 of 36 expected MCP functions are implemented (89% coverage). The remaining 4 are Phase 2 (parse_recycle_bin_metadata, parse_ie_history, parse_outlook_dbx, parse_usn_journal).
The synthetic accuracy measured above is necessary but not sufficient. To answer the fair reviewer question — "what does it score on a dataset you didn't author?" — v0.5.4 integrates the NIST CFReDS Hacking Case (Greg Schardt / "Mr. Evil", image MD5 AEE4FCD9301C03B3B054623CA261959A) as examples/case-studies/case-08-cfreds-hacking-case/.
Of 10 sampled NIST ground-truth findings:
| Version | Strict (full detection) | Lenient (full + partial) |
|---|---|---|
| v0.5.3 | 0.10 (1 of 10) | 0.40 (4 of 10) |
| v0.5.4 | 0.50 (5 of 10) | 0.80 (8 of 10) |
v0.5.4 added parse_registry_hive — a generic SOFTWARE/SYSTEM/SAM hive value extraction primitive built on python-registry. This single primitive unlocks 4 CFReDS findings (F-CFR-001 RegisteredOwner, F-CFR-004 NetworkCards, F-CFR-007 SAM\Domains\Users\Names, F-CFR-010 ShutdownTime) that were Phase 2 roadmap items in v0.5.3.
Reproduce with python3 scripts/measure_cfreds.py. Remaining gaps (F-CFR-006 IE6 index.dat, F-CFR-008 Recycle Bin, F-CFR-009 YARA bundling) are tracked as Phase 2 issues #53, #54, #55.
The drop from recall=1.0 (synthetic) to 0.50/0.80 (CFReDS) is not a regression. It is a paradigm gap honestly disclosed:
- Synthetic accuracy measures correctness of the detection logic against IOCs the system claims to detect.
- CFReDS accuracy measures expansion potential against a content-centric paradigm dart-mcp is still building out.
External benchmarking converted "we should add registry parsing someday" into "registry parsing unblocks 4 of 10 measured findings, ship next." This is the value of third-party data — it changes which Phase 2 deliverable lands first.
-
examples/case-studies/case-08-cfreds-hacking-case/README.md— full case study -
Issue #52 closed —
parse_registry_hiveshipped - Issues #53 / #54 / #55 — remaining CFReDS gaps
Agentic-DART — autonomous DFIR agent · architecture-first, not prompt-first · MIT license · github.com/Juwon1405/agentic-dart
- The Memex bet ⭐ Why this design
- About the name
- Architecture-first vs prompt-first
- Architecture deep dive
- Threat model
- Glossary
- dart-mcp — typed surface (native + SIFT adapters)
- dart-agent — senior-analyst loop
- dart-corr — cross-artifact correlation
- dart-audit — SHA-256 chained log
- dart-playbook — senior-analyst sequencing rules (v3 default)
- MCP function catalog (native + SIFT adapters)
- Comparison with adjacent tools
- FAQ
- Operator guide — distro-agnostic
- Running on SIFT
- Live mode
- Accuracy report
-
Roadmap ⭐ Phase 1 ~95% complete
- Phase 1 — Agentic DFIR ⭐ dedicated page · SANS submission
-
Phase 2 — Detection engineering
- The self-learning loop ⭐ design note
- Phase 3 — Agentic SOC
- Phase 4 — Broader agentic security