Accuracy

Reproducible accuracy measurement

How we measure detection accuracy, and how you can reproduce every number we publish.

The four metrics

Metric	What it measures
Recall	% of ground-truth findings the agent surfaced
False positive rate	claims unsupported by the bundled evidence
Hallucination count	facts not present in the source artifacts at all
Evidence integrity	SHA-256 of every input file, before and after the run

For every claim the agent makes, it must cite the audit_id of the MCP call that produced the supporting evidence. Claims without an audit_id are blocked at write time by dart_agent's serializer.

How to reproduce

git clone https://github.com/Juwon1405/agentic-dart.git
cd agentic-dart
export PYTHONPATH="$PWD/dart_audit/src:$PWD/dart_mcp/src:$PWD/dart_agent/src"

# Variant A — deterministic reference (CI baseline; ≤30 lines/file)
python3 scripts/measure_accuracy.py

# Variant B — noise-injected realistic (~1:30 IOC:benign)
python3 scripts/measure_accuracy.py --variant realistic

Both variants score the same ground truth (F-001, F-013 for the bundled find-evil-ref-01 case) and verify the audit-chain integrity end-to-end. CI runs the reference variant on every commit; the realistic variant is regenerated by scripts/generate_realistic_evidence.py (deterministic seed) and demonstrates needle-in-haystack recall on production-shape volume.

Two evidence variants — why both ship

A reasonable reviewer asks: "Recall=1.0 on a 30-line file is nearly meaningless — every line is an IOC." That is fair. The repository ships two evidence sets so reviewers can score the agent on both:

Variant	Path	Shape	Purpose
Reference	`examples/sample-evidence/`	Web log 27 lines, security events 16, unix auth 17 — fully IOC-loaded	Deterministic CI baseline. Stable hashes. Easy to debug.
Realistic	`examples/sample-evidence-realistic/`	Same IOCs + synthetic benign noise: web log 1027 lines (1:37), security events 516 (1:31), unix auth 517 (1:29)	Demonstrates needle-in-haystack recall on production-shape data. Generated deterministically by `scripts/generate_realistic_evidence.py`.

Both variants score identically on the ground-truth findings — recall=1.000, FPR=0.000, hallucination=0. This rules out the "small-input over-fit" failure mode (the agent didn't simply match-anything in the toy data).

External, third-party benchmarking (NIST CFReDS, Ali Hadi, DFRWS, Splunk BOTS) is tracked as Phase 2 work — see issue #47 for the methodology and timeline. Phase 1 establishes the measurement methodology; Phase 2 operationalizes it on community-trusted datasets.

Datasets currently measured

1. Bundled IP-KVM remote-hands case (`examples/sample-evidence/`)

The headline accuracy run for the hackathon submission. Both variants score:

Metric	Reference	Realistic
Recall	1.000 (2/2 ground-truth findings F-001, F-013)	1.000 (same)
False positive rate	0.000	0.000
Hallucination count	0	0
Evidence integrity	preserved (61 files, all SHA-256 pre/post match)	preserved
Iterations to verdict	5	5
Total MCP calls	14
Audit-chain verified	yes (SHA-256 chain unbroken)

Reproduces in ~14 seconds on a SIFT VM, ~90 seconds in live mode with claude-sonnet-4.

2. Bundled sample evidence (synthetic, in `examples/sample-evidence/`)

Smaller scope — primarily exercises the dart-corr contradiction handling.

Metric	Value
Recall	1.000 (3/3 — USB insertion, exec, contradiction)
False positive rate	0.000
Self-correction fired	yes (initial single-host hypothesis revised)
Confidence delta	0.34 → 0.91 over 5 iterations

Datasets in flight (Phase 1 polish)

Dataset	Status
NIST CFReDS Hacking Case	Issue #17 — methodology approved, run pending
Ali Hadi Memory Forensic Challenge #1	Issue #16
Digital Corpora M57 Patents	Issue #18
Real auditd corpus (v0.4 Linux validation)	Issue #28
Patrick Wardle macOS malware persistence corpus	Issue #29

When each lands, the accuracy report grows a new section with the same four metrics + the chain tail hash.

What this report is not claiming

Not claiming the agent matches a senior human analyst on open-ended novel cases. It matches on cases with mechanically verifiable ground truth.
Not claiming zero false negatives in adversarial settings — only against the documented test corpus.
Not claiming production-readiness. This is a hackathon submission demonstrating the architecture is correct and the loop is sound. Hardening for production is the Phase 2-3 roadmap.

Bypass test results — the architectural guarantee

Separate from accuracy, the bypass test asserts that the read-only MCP boundary holds. Re-run on every commit:

Test	Result
Surface is exactly the documented 35-function set	PASS
`execute_shell` raises `ToolNotFound`	PASS
`eval`, `exec`, `subprocess_run` all raise `ToolNotFound`	PASS
`_safe_resolve` rejects `..` traversal	PASS
`_safe_resolve` rejects absolute paths outside `DART_EVIDENCE_ROOT`	PASS
`_safe_resolve` rejects null-byte truncation attacks	PASS

1000-attempt random-fuzz against destructive function names: 1000/1000 blocked (Architecture-first guarantee verified). See Architecture deep dive.

Accuracy

Reproducible accuracy measurement

The four metrics

How to reproduce

Two evidence variants — why both ship

Datasets currently measured

1. Bundled IP-KVM remote-hands case (examples/sample-evidence/)

2. Bundled sample evidence (synthetic, in examples/sample-evidence/)

Datasets in flight (Phase 1 polish)

What this report is not claiming

Bypass test results — the architectural guarantee

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Agentic-DART

Concepts

The 5 packages

Reference

Running it

Case studies

Project

Project links

Clone this wiki locally

1. Bundled IP-KVM remote-hands case (`examples/sample-evidence/`)

2. Bundled sample evidence (synthetic, in `examples/sample-evidence/`)