-
Notifications
You must be signed in to change notification settings - Fork 5
Accuracy
How we measure detection accuracy, and how you can reproduce every number we publish.
| Metric | What it measures |
|---|---|
| Recall | % of ground-truth findings the agent surfaced |
| False positive rate | claims unsupported by the bundled evidence |
| Hallucination count | facts not present in the source artifacts at all |
| Evidence integrity | SHA-256 of every input file, before and after the run |
For every claim the agent makes, it must cite the audit_id of the MCP call that produced the supporting evidence. Claims without an audit_id are blocked at write time by dart_agent's serializer.
git clone https://github.com/Juwon1405/agentic-dart.git
cd agentic-dart
export PYTHONPATH="$PWD/dart_audit/src:$PWD/dart_mcp/src:$PWD/dart_agent/src"
# Score the canonical bundled evidence (no variant selector any more)
python3 scripts/measure_accuracy.pyThe harness scores the ground truth (F-001, F-013 for the bundled
self-evaluation/case-01 case) and verifies audit-chain integrity end-to-end.
The canonical evidence root is hand-curated at production volume on most
surfaces; only the two IOC-only logs (web access, unix auth) are enriched with
deterministic benign noise by scripts/generate_realistic_evidence.py before
scoring, which demonstrates needle-in-haystack recall at ~1:30.
A reasonable reviewer asks: "Recall=1.0 on a 30-line file is nearly meaningless — every line is an IOC." That is fair. The canonical evidence root is hand-curated at production volume so the measurement is not a small-input over-fit:
| Tree | Path | Shape | Purpose |
|---|---|---|---|
| Canonical bundled | examples/case-studies/self-evaluation/case-01/evidence_root/ |
Hand-curated production volume (security EventLog ~11,530 lines, supply-chain, RDP brute, USB setupapi, etc.); the two IOC-only logs are enriched with benign noise — web access 1027 (1:37), unix auth 517 (1:29) | The scored evidence. Needle-in-haystack recall at ~1:30. Benign noise generated deterministically by scripts/generate_realistic_evidence.py; all other evidence is committed hand-curated. |
| CI fixture | examples/sample-evidence/ |
Web log 27 lines, security events 16, unix auth 17 — fully IOC-loaded | Small, byte-stable fixture used by the unit tests. Not a user-selectable evidence set. |
The canonical evidence scores recall=1.000, FPR=0.000, hallucination=0 on the
case-01 ground-truth findings (F-001, F-013) — ruling out the
"small-input over-fit" failure mode (the agent did not simply match anything in
toy data).
External, third-party benchmarking (NIST CFReDS, Ali Hadi, DFRWS, Splunk BOTS) is tracked as Phase 2 work — see issue #47 for the methodology and timeline. Phase 1 establishes the measurement methodology; Phase 2 operationalizes it on community-trusted datasets.
The headline accuracy run for the hackathon submission, against the canonical bundled evidence:
| Metric | Value |
|---|---|
| Recall | 1.000 (2/2 ground-truth findings F-001, F-013) |
| False positive rate | 0.000 |
| Hallucination count | 0 |
| Evidence integrity | preserved (67 files, all SHA-256 pre/post match) |
| Iterations to verdict | 5 |
| Total MCP calls | 14 |
| Audit-chain verified | yes (SHA-256 chain unbroken) |
Reproduces in seconds on a local SIFT VM. Live-mode runtime depends on model, artifact volume, network latency, and max-iteration settings.
Smaller scope — primarily exercises the dart-corr contradiction handling.
| Metric | Value |
|---|---|
| Recall | 1.000 (3/3 — USB insertion, exec, contradiction) |
| False positive rate | 0.000 |
| Self-correction fired | yes (initial single-host hypothesis revised) |
| Confidence delta | 0.34 → 0.91 over 5 iterations |
| Dataset | Status |
|---|---|
| NIST CFReDS Hacking Case | Issue #17 — methodology approved, run pending |
| Ali Hadi Memory Forensic Challenge #1 | Issue #16 |
| Digital Corpora M57 Patents | Issue #18 |
| Real auditd corpus (v0.4 Linux validation) | Issue #28 |
| Patrick Wardle macOS malware persistence corpus | Issue #29 |
When each lands, the accuracy report grows a new section with the same four metrics + the chain tail hash.
- Not claiming the agent matches a senior human analyst on open-ended novel cases. It matches on cases with mechanically verifiable ground truth.
- Not claiming zero false negatives in adversarial settings — only against the documented test corpus.
- Not claiming production-readiness. This is a hackathon submission demonstrating the architecture is correct and the loop is sound. Hardening for production is the Phase 2-3 roadmap.
Separate from accuracy, the bypass test asserts that the read-only MCP boundary holds. Re-run on every commit:
| Test | Result |
|---|---|
| Surface is exactly the documented read-only function set | PASS |
execute_shell raises ToolNotFound
|
PASS |
eval, exec, subprocess_run all raise ToolNotFound
|
PASS |
_safe_resolve rejects .. traversal |
PASS |
_safe_resolve rejects absolute paths outside DART_EVIDENCE_ROOT
|
PASS |
_safe_resolve rejects null-byte truncation attacks |
PASS |
1000-attempt random-fuzz against destructive function names: 1000/1000 blocked (Architecture-first guarantee verified). See Architecture deep dive.
self-evaluation/case-08 covers the SolarWinds-era attack class: a trojanized signed vendor binary entering as a routine update, then escalating via ADCS template misconfig (ESC8 — PetitPotam coercion → NTLM relay → certificate for DC01$ → PKINIT TGT → S4U2self DA impersonation → DCSync of KRBTGT → AdminSDHolder ACL persistence → Golden Ticket).
| Aspect | Value |
|---|---|
| Layer | 1 (synthetic, noise-injected ~1:30) |
| Findings | 12 |
| Functions exercised | 7 (get_process_tree, analyze_windows_logons, analyze_kerberos_events, detect_lateral_movement, detect_credential_access, detect_defense_evasion, detect_exfiltration) |
| Provenance | RFC1918/RFC5737/RFC2606 synthetic; chain composed from public references (CISA AA20-352A, SpecterOps "Certified Pre-Owned", CVE-2021-36942) — zero cross-reference to real environment |
| Differentiation from case-05 | case-05 = stolen credential brute force + Kerberoasting; case-08 = supply-chain entry + ESC8 + PKINIT |
| Byte-stable expected output | yes — locked in case README "How to invoke" section |
Reproduce with the bash block in examples/case-studies/self-evaluation/case-08/README.md.
| Layer | Cases | Findings | Notes |
|---|---|---|---|
| 1 | self-evaluation/case-01 to case-08 | 69 | Synthetic; case-01 ships the canonical evidence_root/
|
| 2 | external-evaluation/case-01 to case-03 | 30 | External: NIST CFReDS, Ali Hadi Challenge 1, Digital Corpora M57 (Jo) |
| Total | 11 cases | 99 findings |
After v0.7.1 ground-truth function-name reconciliation, 32 of 36 expected MCP functions are implemented (89% coverage). The remaining 4 are Phase 2 (parse_recycle_bin_metadata, parse_ie_history, parse_outlook_dbx, parse_usn_journal).
The synthetic accuracy measured above is necessary but not sufficient. To answer the fair reviewer question — "what does it score on a dataset you didn't author?" — v0.5.4 integrates the NIST CFReDS Hacking Case (Greg Schardt / "Mr. Evil", image MD5 AEE4FCD9301C03B3B054623CA261959A) as examples/case-studies/external-evaluation/case-01/.
Of 10 sampled NIST ground-truth findings:
| Version | Strict (full detection) | Lenient (full + partial) |
|---|---|---|
| v0.5.3 | 0.10 (1 of 10) | 0.40 (4 of 10) |
| v0.5.4 | 0.50 (5 of 10) | 0.80 (8 of 10) |
v0.5.4 added parse_registry_hive — a generic SOFTWARE/SYSTEM/SAM hive value extraction primitive built on python-registry. This single primitive unlocks 4 CFReDS findings (F-CFR-001 RegisteredOwner, F-CFR-004 NetworkCards, F-CFR-007 SAM\Domains\Users\Names, F-CFR-010 ShutdownTime) that were Phase 2 roadmap items in v0.5.3.
Reproduce with python3 scripts/measure_cfreds.py. Remaining gaps (F-CFR-006 IE6 index.dat, F-CFR-008 Recycle Bin, F-CFR-009 YARA bundling) are tracked as Phase 2 issues #53, #54, #55.
The drop from recall=1.0 (synthetic) to 0.50/0.80 (CFReDS) is not a regression. It is a paradigm gap honestly disclosed:
- Synthetic accuracy measures correctness of the detection logic against IOCs the system claims to detect.
- CFReDS accuracy measures expansion potential against a content-centric paradigm dart-mcp is still building out.
External benchmarking converted "we should add registry parsing someday" into "registry parsing unblocks 4 of 10 measured findings, ship next." This is the value of third-party data — it changes which Phase 2 deliverable lands first.
-
examples/case-studies/external-evaluation/case-01/README.md— full case study -
Issue #52 closed —
parse_registry_hiveshipped - Issues #53 / #54 / #55 — remaining CFReDS gaps
Agentic-DART — autonomous DFIR agent · architecture-first, not prompt-first · MIT license · github.com/Juwon1405/agentic-dart
- The Memex bet ⭐ Why this design
- About the name
- Architecture-first vs prompt-first
- Architecture deep dive
- Threat model
- Glossary
- dart-mcp — typed surface (native + SIFT adapters)
- dart-agent — senior-analyst loop
- dart-corr — cross-artifact correlation
- dart-audit — SHA-256 chained log
- dart-playbook — senior-analyst sequencing rules (v3 default)
- MCP function catalog (native + SIFT adapters)
- Comparison with adjacent tools
- FAQ
- Operator guide — distro-agnostic
- Running on SIFT
- Live mode
- Accuracy report
-
Roadmap ⭐ Phase 1 ~95% complete
- Phase 1 — Agentic DFIR ⭐ dedicated page · SANS submission
-
Phase 2 — Detection engineering
- The self-learning loop ⭐ design note
- Phase 3 — Agentic SOC
- Phase 4 — Broader agentic security