-
Notifications
You must be signed in to change notification settings - Fork 5
Accuracy
How we measure detection accuracy — and why the scores live in the repository, not on this page.
| Metric | What it measures |
|---|---|
| Recall | share of ground-truth findings the agent surfaced |
| False positive rate | claims unsupported by the bundled evidence |
| Hallucination count | facts not present in the source artifacts at all |
| Evidence integrity | SHA-256 of every input file, before and after the run |
For every claim the agent makes, it must cite the audit_id of the MCP call that produced the supporting evidence. Claims without an audit_id are blocked at write time by dart_agent's serializer — so a missing finding lowers recall, but a fabricated one cannot reach the report at all.
Benchmark scores are deliberately kept in one place — the repository's docs/benchmarks/, regenerated from a live run rather than transcribed by hand. This page documents how accuracy is measured and what holds regardless of the score. For the latest measured recall across models and cases, read the source of truth:
- docs/benchmarks/SUMMARY.md — combined recall per model
- docs/benchmarks/MODEL-COMPARISON.md — per-case, per-model breakdown
- docs/benchmarks/HISTORY.md — run-over-run history
Recall varies by case difficulty and by model — that variation is the honest signal. Pinning a single figure on a wiki page would only invite it to drift out of sync with the harness. Reproduce any published number locally:
git clone https://github.com/Juwon1405/agentic-dart.git
cd agentic-dart
export PYTHONPATH="$PWD/dart_audit/src:$PWD/dart_mcp/src:$PWD/dart_agent/src"
python3 scripts/measure_accuracy.pyWhichever case or model you run, these hold by construction, not by tuning:
-
Hallucination is structurally prevented. Every finding carries the
audit_idof the MCP call that produced it; the serializer blocks any claim without one. A low recall therefore means missed coverage — never an invented fact. The hallucination count is not a number we chase: it is zero by design. - Evidence integrity is sealed. SHA-256 of every input file is recorded before and after each run, and the audit trail is hash-linked into an unbroken chain.
- The read-only boundary holds. The MCP surface exposes only typed, read-only forensic functions — no shell, no eval, no write path. Asserted on every commit:
| Test | Result |
|---|---|
| Surface is exactly the documented read-only function set | PASS |
execute_shell raises ToolNotFound
|
PASS |
eval, exec, subprocess_run all raise ToolNotFound
|
PASS |
_safe_resolve rejects .. traversal |
PASS |
_safe_resolve rejects absolute paths outside DART_EVIDENCE_ROOT
|
PASS |
_safe_resolve rejects null-byte truncation attacks |
PASS |
A random fuzz against destructive function names is blocked on every attempt — the architecture-first guarantee, not a prompt instruction. See Architecture deep dive.
A fair reviewer asks: "recall on a 30-line file is meaningless — every line is an IOC." Correct. The canonical bundled evidence root (examples/case-studies/self-evaluation/case-01/evidence_root/) is hand-curated at production volume — a security EventLog of roughly eleven thousand lines, supply-chain artifacts, RDP brute-force, USB setupapi — and the two IOC-only logs are enriched with deterministic benign noise (scripts/generate_realistic_evidence.py) to a heavy signal-to-noise ratio. The measurement is not a small-input over-fit; the agent finds the needle in production-scale hay.
- Not that the agent matches a senior human analyst on open-ended novel cases. It matches on cases with mechanically verifiable ground truth.
- Not zero false negatives in adversarial settings — only against the documented corpus.
- Not production-readiness. This is a hackathon submission demonstrating that the architecture is correct and the loop is sound; hardening is the Phase 2–3 roadmap.
The synthetic measurement is necessary but not sufficient. The honest reviewer question — "what does it score on a dataset you didn't author?" — is answered by integrating external corpora (NIST CFReDS Hacking Case, Ali Hadi Challenge 1, Digital Corpora M57). Those scores live in docs/benchmarks/ alongside the synthetic ones.
The point worth making here is why external recall sits below synthetic recall — and it is not a regression:
- Synthetic accuracy measures correctness of the detection logic against IOCs the system claims to detect.
- External accuracy measures expansion potential against a content-centric paradigm dart-mcp is still building out.
External benchmarking is what converted "we should add registry parsing someday" into "registry parsing unblocks several measured findings — ship it next." That is the real value of third-party data: it reorders the Phase 2 backlog by evidence, not by guess.
-
examples/case-studies/external-evaluation/case-01/README.md— external case study - docs/accuracy-report.md — full accuracy report with current measured scores
Agentic-DART — autonomous DFIR agent · architecture-first, not prompt-first · MIT license · github.com/Juwon1405/agentic-dart
- The Memex bet ⭐ Why this design
- About the name
- Architecture-first vs prompt-first
- Architecture deep dive
- Threat model
- Glossary
- dart-mcp — typed surface (native + SIFT adapters)
- dart-agent — senior-analyst loop
- dart-corr — cross-artifact correlation
- dart-audit — SHA-256 chained log
- dart-playbook — senior-analyst sequencing rules (v3 default)
- MCP function catalog (native + SIFT adapters)
- Comparison with adjacent tools
- FAQ
- Operator guide — distro-agnostic
- Running on SIFT
- Live mode
- Accuracy report
-
Roadmap ⭐ Phase 1 ~95% complete
- Phase 1 — Agentic DFIR ⭐ dedicated page · SANS submission
-
Phase 2 — Detection engineering
- The self-learning loop ⭐ design note
- Phase 3 — Agentic SOC
- Phase 4 — Broader agentic security