Skip to content

Accuracy

Juwon1405 edited this page May 9, 2026 · 12 revisions

Reproducible accuracy measurement

How we measure detection accuracy, and how you can reproduce every number we publish.


The four metrics

Metric What it measures
Recall % of ground-truth findings the agent surfaced
False positive rate claims unsupported by the bundled evidence
Hallucination count facts not present in the source artifacts at all
Evidence integrity SHA-256 of every input file, before and after the run

For every claim the agent makes, it must cite the audit_id of the MCP call that produced the supporting evidence. Claims without an audit_id are blocked at write time by dart_agent's serializer.


How to reproduce

git clone https://github.com/Juwon1405/agentic-dart.git
cd agentic-dart
export PYTHONPATH="$PWD/dart_audit/src:$PWD/dart_mcp/src:$PWD/dart_agent/src"

# Variant A — deterministic reference (CI baseline; ≤30 lines/file)
python3 scripts/measure_accuracy.py

# Variant B — noise-injected realistic (~1:30 IOC:benign)
python3 scripts/measure_accuracy.py --variant realistic

Both variants score the same ground truth (F-001, F-013 for the bundled find-evil-ref-01 case) and verify the audit-chain integrity end-to-end. CI runs the reference variant on every commit; the realistic variant is regenerated by scripts/generate_realistic_evidence.py (deterministic seed) and demonstrates needle-in-haystack recall on production-shape volume.


Two evidence variants — why both ship

A reasonable reviewer asks: "Recall=1.0 on a 30-line file is nearly meaningless — every line is an IOC." That is fair. The repository ships two evidence sets so reviewers can score the agent on both:

Variant Path Shape Purpose
Reference examples/sample-evidence/ Web log 27 lines, security events 16, unix auth 17 — fully IOC-loaded Deterministic CI baseline. Stable hashes. Easy to debug.
Realistic examples/sample-evidence-realistic/ Same IOCs + synthetic benign noise: web log 1027 lines (1:37), security events 516 (1:31), unix auth 517 (1:29) Demonstrates needle-in-haystack recall on production-shape data. Generated deterministically by scripts/generate_realistic_evidence.py.

Both variants score identically on the ground-truth findings — recall=1.000, FPR=0.000, hallucination=0. This rules out the "small-input over-fit" failure mode (the agent didn't simply match-anything in the toy data).

External, third-party benchmarking (NIST CFReDS, Ali Hadi, DFRWS, Splunk BOTS) is tracked as Phase 2 work — see issue #47 for the methodology and timeline. Phase 1 establishes the measurement methodology; Phase 2 operationalizes it on community-trusted datasets.


Datasets currently measured

1. Bundled IP-KVM remote-hands case (examples/sample-evidence/)

The headline accuracy run for the hackathon submission. Both variants score:

Metric Reference Realistic
Recall 1.000 (2/2 ground-truth findings F-001, F-013) 1.000 (same)
False positive rate 0.000 0.000
Hallucination count 0 0
Evidence integrity preserved (61 files, all SHA-256 pre/post match) preserved
Iterations to verdict 5 5
Total MCP calls 14
Audit-chain verified yes (SHA-256 chain unbroken)

Reproduces in ~14 seconds on a SIFT VM, ~90 seconds in live mode with claude-sonnet-4.

2. Bundled sample evidence (synthetic, in examples/sample-evidence/)

Smaller scope — primarily exercises the dart-corr contradiction handling.

Metric Value
Recall 1.000 (3/3 — USB insertion, exec, contradiction)
False positive rate 0.000
Self-correction fired yes (initial single-host hypothesis revised)
Confidence delta 0.34 → 0.91 over 5 iterations

Datasets in flight (Phase 1 polish)

Dataset Status
NIST CFReDS Hacking Case Issue #17 — methodology approved, run pending
Ali Hadi Memory Forensic Challenge #1 Issue #16
Digital Corpora M57 Patents Issue #18
Real auditd corpus (v0.4 Linux validation) Issue #28
Patrick Wardle macOS malware persistence corpus Issue #29

When each lands, the accuracy report grows a new section with the same four metrics + the chain tail hash.


What this report is not claiming

  • Not claiming the agent matches a senior human analyst on open-ended novel cases. It matches on cases with mechanically verifiable ground truth.
  • Not claiming zero false negatives in adversarial settings — only against the documented test corpus.
  • Not claiming production-readiness. This is a hackathon submission demonstrating the architecture is correct and the loop is sound. Hardening for production is the Phase 2-3 roadmap.

Bypass test results — the architectural guarantee

Separate from accuracy, the bypass test asserts that the read-only MCP boundary holds. Re-run on every commit:

Test Result
Surface is exactly the documented 35-function set PASS
execute_shell raises ToolNotFound PASS
eval, exec, subprocess_run all raise ToolNotFound PASS
_safe_resolve rejects .. traversal PASS
_safe_resolve rejects absolute paths outside DART_EVIDENCE_ROOT PASS
_safe_resolve rejects null-byte truncation attacks PASS

1000-attempt random-fuzz against destructive function names: 1000/1000 blocked (Architecture-first guarantee verified). See Architecture deep dive.


See also

Agentic-DART

Concepts

The 5 packages

Reference

Running it

Case studies

Project


Project links

Clone this wiki locally