Skip to content

Accuracy

Juwon1405 edited this page May 8, 2026 · 12 revisions

Reproducible accuracy measurement

How we measure detection accuracy, and how you can reproduce every number we publish.


The four metrics

Metric What it measures
Recall % of ground-truth findings the agent surfaced
False positive rate claims unsupported by the bundled evidence
Hallucination count facts not present in the source artifacts at all
Evidence integrity SHA-256 of every input file, before and after the run

For every claim the agent makes, it must cite the audit_id of the MCP call that produced the supporting evidence. Claims without an audit_id are blocked at write time by dart_agent's serializer.


How to reproduce

git clone https://github.com/Juwon1405/agentic-dart.git
cd agentic-dart
export PYTHONPATH="$PWD/dart_audit/src:$PWD/dart_mcp/src:$PWD/dart_agent/src"
python3 scripts/measure_accuracy.py

The script re-runs the bundled find-evil-ref-01 case, regenerates the metrics in docs/accuracy-report.md, and verifies the audit-chain integrity against the committed reference. If the agent's findings drift from the ground truth (F-001, F-013 for the bundled case), CI fails.


Datasets currently measured

1. Bundled IP-KVM remote-hands case (in examples/)

The headline accuracy run for the hackathon submission.

Metric Value
Recall 1.000 (12/12 ground-truth findings)
False positive rate 0.000 (0 unsupported claims)
Hallucination count 0
Evidence integrity preserved (8 files, all SHA-256 hashes match pre/post)
Iterations to verdict 5
Total MCP calls 14
Audit-chain verified yes (SHA-256 chain unbroken)

Reproduces in ~14 seconds on a SIFT VM, ~90 seconds in live mode with claude-sonnet-4.

2. Bundled sample evidence (synthetic, in examples/sample-evidence/)

Smaller scope — primarily exercises the dart-corr contradiction handling.

Metric Value
Recall 1.000 (3/3 — USB insertion, exec, contradiction)
False positive rate 0.000
Self-correction fired yes (initial single-host hypothesis revised)
Confidence delta 0.34 → 0.91 over 5 iterations

Datasets in flight (Phase 1 polish)

Dataset Status
NIST CFReDS Hacking Case Issue #17 — methodology approved, run pending
Ali Hadi Memory Forensic Challenge #1 Issue #16
Digital Corpora M57 Patents Issue #18
Real auditd corpus (v0.4 Linux validation) Issue #28
Patrick Wardle macOS malware persistence corpus Issue #29

When each lands, the accuracy report grows a new section with the same four metrics + the chain tail hash.


What this report is not claiming

  • Not claiming the agent matches a senior human analyst on open-ended novel cases. It matches on cases with mechanically verifiable ground truth.
  • Not claiming zero false negatives in adversarial settings — only against the documented test corpus.
  • Not claiming production-readiness. This is a hackathon submission demonstrating the architecture is correct and the loop is sound. Hardening for production is the Phase 2-3 roadmap.

Bypass test results — the architectural guarantee

Separate from accuracy, the bypass test asserts that the read-only MCP boundary holds. Re-run on every commit:

Test Result
Surface is exactly the documented 35-function set PASS
execute_shell raises ToolNotFound PASS
eval, exec, subprocess_run all raise ToolNotFound PASS
_safe_resolve rejects .. traversal PASS
_safe_resolve rejects absolute paths outside DART_EVIDENCE_ROOT PASS
_safe_resolve rejects null-byte truncation attacks PASS

1000-attempt random-fuzz against destructive function names: 1000/1000 blocked (Architecture-first guarantee verified). See Architecture deep dive.


See also

Agentic-DART

Concepts

The 5 packages

Reference

Running it

Case studies

Project


Project links

Clone this wiki locally