Skip to content

Accuracy

Juwon1405 edited this page Jun 15, 2026 · 12 revisions

Reproducible accuracy measurement

How we measure detection accuracy — and why the scores live in the repository, not on this page.


The four metrics

Metric What it measures
Recall share of ground-truth findings the agent surfaced
False positive rate claims unsupported by the bundled evidence
Hallucination count facts not present in the source artifacts at all
Evidence integrity SHA-256 of every input file, before and after the run

For every claim the agent makes, it must cite the audit_id of the MCP call that produced the supporting evidence. Claims without an audit_id are blocked at write time by dart_agent's serializer — so a missing finding lowers recall, but a fabricated one cannot reach the report at all.


Where the numbers live

Benchmark scores are deliberately kept in one place — the repository's docs/benchmarks/, regenerated from a live run rather than transcribed by hand. This page documents how accuracy is measured and what holds regardless of the score. For the latest measured recall across models and cases, read the source of truth:

Recall varies by case difficulty and by model — that variation is the honest signal. Pinning a single figure on a wiki page would only invite it to drift out of sync with the harness. Reproduce any published number locally:

git clone https://github.com/Juwon1405/agentic-dart.git
cd agentic-dart
export PYTHONPATH="$PWD/dart_audit/src:$PWD/dart_mcp/src:$PWD/dart_agent/src"
python3 scripts/measure_accuracy.py

What does not change — the invariants

Whichever case or model you run, these hold by construction, not by tuning:

  • Hallucination is structurally prevented. Every finding carries the audit_id of the MCP call that produced it; the serializer blocks any claim without one. A low recall therefore means missed coverage — never an invented fact. The hallucination count is not a number we chase: it is zero by design.
  • Evidence integrity is sealed. SHA-256 of every input file is recorded before and after each run, and the audit trail is hash-linked into an unbroken chain.
  • The read-only boundary holds. The MCP surface exposes only typed, read-only forensic functions — no shell, no eval, no write path. Asserted on every commit:
Test Result
Surface is exactly the documented read-only function set PASS
execute_shell raises ToolNotFound PASS
eval, exec, subprocess_run all raise ToolNotFound PASS
_safe_resolve rejects .. traversal PASS
_safe_resolve rejects absolute paths outside DART_EVIDENCE_ROOT PASS
_safe_resolve rejects null-byte truncation attacks PASS

A random fuzz against destructive function names is blocked on every attempt — the architecture-first guarantee, not a prompt instruction. See Architecture deep dive.


Needle-in-a-haystack, not toy data

A fair reviewer asks: "recall on a 30-line file is meaningless — every line is an IOC." Correct. The canonical bundled evidence root (examples/case-studies/self-evaluation/case-01/evidence_root/) is hand-curated at production volume — a security EventLog of roughly eleven thousand lines, supply-chain artifacts, RDP brute-force, USB setupapi — and the two IOC-only logs are enriched with deterministic benign noise (scripts/generate_realistic_evidence.py) to a heavy signal-to-noise ratio. The measurement is not a small-input over-fit; the agent finds the needle in production-scale hay.


What this report is not claiming

  • Not that the agent matches a senior human analyst on open-ended novel cases. It matches on cases with mechanically verifiable ground truth.
  • Not zero false negatives in adversarial settings — only against the documented corpus.
  • Not production-readiness. This is a hackathon submission demonstrating that the architecture is correct and the loop is sound; hardening is the Phase 2–3 roadmap.

External benchmarking — the paradigm gap, honestly

The synthetic measurement is necessary but not sufficient. The honest reviewer question — "what does it score on a dataset you didn't author?" — is answered by integrating external corpora (NIST CFReDS Hacking Case, Ali Hadi Challenge 1, Digital Corpora M57). Those scores live in docs/benchmarks/ alongside the synthetic ones.

The point worth making here is why external recall sits below synthetic recall — and it is not a regression:

  • Synthetic accuracy measures correctness of the detection logic against IOCs the system claims to detect.
  • External accuracy measures expansion potential against a content-centric paradigm dart-mcp is still building out.

External benchmarking is what converted "we should add registry parsing someday" into "registry parsing unblocks several measured findings — ship it next." That is the real value of third-party data: it reorders the Phase 2 backlog by evidence, not by guess.


See also

Agentic-DART

Concepts

The 5 packages

Reference

Running it

Case studies

Project


Project links

Clone this wiki locally