Accuracy

Reproducible accuracy measurement

How we measure detection accuracy, and how you can reproduce every number we publish.

The four metrics

Metric	What it measures
Recall	% of ground-truth findings the agent surfaced
False positive rate	claims unsupported by the bundled evidence
Hallucination count	facts not present in the source artifacts at all
Evidence integrity	SHA-256 of every input file, before and after the run

For every claim the agent makes, it must cite the audit_id of the MCP call that produced the supporting evidence. Claims without an audit_id are blocked at write time by dart_agent's serializer.

How to reproduce

git clone https://github.com/Juwon1405/agentic-dart.git
cd agentic-dart
export PYTHONPATH="$PWD/dart_audit/src:$PWD/dart_mcp/src:$PWD/dart_agent/src"
python3 scripts/measure_accuracy.py

The script re-runs the bundled find-evil-ref-01 case, regenerates the metrics in docs/accuracy-report.md, and verifies the audit-chain integrity against the committed reference. If the agent's findings drift from the ground truth (F-001, F-013 for the bundled case), CI fails.

Datasets currently measured

1. Bundled IP-KVM remote-hands case (in `examples/`)

The headline accuracy run for the hackathon submission.

Metric	Value
Recall	1.000 (12/12 ground-truth findings)
False positive rate	0.000 (0 unsupported claims)
Hallucination count	0
Evidence integrity	preserved (8 files, all SHA-256 hashes match pre/post)
Iterations to verdict	5
Total MCP calls	14
Audit-chain verified	yes (SHA-256 chain unbroken)

Reproduces in ~14 seconds on a SIFT VM, ~90 seconds in live mode with claude-sonnet-4.

2. Bundled sample evidence (synthetic, in `examples/sample-evidence/`)

Smaller scope — primarily exercises the dart-corr contradiction handling.

Metric	Value
Recall	1.000 (3/3 — USB insertion, exec, contradiction)
False positive rate	0.000
Self-correction fired	yes (initial single-host hypothesis revised)
Confidence delta	0.34 → 0.91 over 5 iterations

Datasets in flight (Phase 1 polish)

Dataset	Status
NIST CFReDS Hacking Case	Issue #17 — methodology approved, run pending
Ali Hadi Memory Forensic Challenge #1	Issue #16
Digital Corpora M57 Patents	Issue #18
Real auditd corpus (v0.4 Linux validation)	Issue #28
Patrick Wardle macOS malware persistence corpus	Issue #29

When each lands, the accuracy report grows a new section with the same four metrics + the chain tail hash.

What this report is not claiming

Not claiming the agent matches a senior human analyst on open-ended novel cases. It matches on cases with mechanically verifiable ground truth.
Not claiming zero false negatives in adversarial settings — only against the documented test corpus.
Not claiming production-readiness. This is a hackathon submission demonstrating the architecture is correct and the loop is sound. Hardening for production is the Phase 2-3 roadmap.

Bypass test results — the architectural guarantee

Separate from accuracy, the bypass test asserts that the read-only MCP boundary holds. Re-run on every commit:

Test	Result
Surface is exactly the documented 35-function set	PASS
`execute_shell` raises `ToolNotFound`	PASS
`eval`, `exec`, `subprocess_run` all raise `ToolNotFound`	PASS
`_safe_resolve` rejects `..` traversal	PASS
`_safe_resolve` rejects absolute paths outside `DART_EVIDENCE_ROOT`	PASS
`_safe_resolve` rejects null-byte truncation attacks	PASS

1000-attempt random-fuzz against destructive function names: 1000/1000 blocked (Architecture-first guarantee verified). See Architecture deep dive.

Accuracy

Reproducible accuracy measurement

The four metrics

How to reproduce

Datasets currently measured

1. Bundled IP-KVM remote-hands case (in examples/)

2. Bundled sample evidence (synthetic, in examples/sample-evidence/)

Datasets in flight (Phase 1 polish)

What this report is not claiming

Bypass test results — the architectural guarantee

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Agentic-DART

Concepts

The 5 packages

Reference

Running it

Case studies

Project

Project links

Clone this wiki locally

1. Bundled IP-KVM remote-hands case (in `examples/`)

2. Bundled sample evidence (synthetic, in `examples/sample-evidence/`)