-
Notifications
You must be signed in to change notification settings - Fork 5
Accuracy
How we measure detection accuracy, and how you can reproduce every number we publish.
| Metric | What it measures |
|---|---|
| Recall | % of ground-truth findings the agent surfaced |
| False positive rate | claims unsupported by the bundled evidence |
| Hallucination count | facts not present in the source artifacts at all |
| Evidence integrity | SHA-256 of every input file, before and after the run |
For every claim the agent makes, it must cite the audit_id of the MCP call that produced the supporting evidence. Claims without an audit_id are blocked at write time by dart_agent's serializer.
git clone https://github.com/Juwon1405/agentic-dart.git
cd agentic-dart
export PYTHONPATH="$PWD/dart_audit/src:$PWD/dart_mcp/src:$PWD/dart_agent/src"
python3 scripts/measure_accuracy.pyThe script re-runs the bundled find-evil-ref-01 case, regenerates the metrics in docs/accuracy-report.md, and verifies the audit-chain integrity against the committed reference. If the agent's findings drift from the ground truth (F-001, F-013 for the bundled case), CI fails.
The headline accuracy run for the hackathon submission.
| Metric | Value |
|---|---|
| Recall | 1.000 (12/12 ground-truth findings) |
| False positive rate | 0.000 (0 unsupported claims) |
| Hallucination count | 0 |
| Evidence integrity | preserved (8 files, all SHA-256 hashes match pre/post) |
| Iterations to verdict | 5 |
| Total MCP calls | 14 |
| Audit-chain verified | yes (SHA-256 chain unbroken) |
Reproduces in ~14 seconds on a SIFT VM, ~90 seconds in live mode with claude-sonnet-4.
Smaller scope — primarily exercises the dart-corr contradiction handling.
| Metric | Value |
|---|---|
| Recall | 1.000 (3/3 — USB insertion, exec, contradiction) |
| False positive rate | 0.000 |
| Self-correction fired | yes (initial single-host hypothesis revised) |
| Confidence delta | 0.34 → 0.91 over 5 iterations |
| Dataset | Status |
|---|---|
| NIST CFReDS Hacking Case | Issue #17 — methodology approved, run pending |
| Ali Hadi Memory Forensic Challenge #1 | Issue #16 |
| Digital Corpora M57 Patents | Issue #18 |
| Real auditd corpus (v0.4 Linux validation) | Issue #28 |
| Patrick Wardle macOS malware persistence corpus | Issue #29 |
When each lands, the accuracy report grows a new section with the same four metrics + the chain tail hash.
- Not claiming the agent matches a senior human analyst on open-ended novel cases. It matches on cases with mechanically verifiable ground truth.
- Not claiming zero false negatives in adversarial settings — only against the documented test corpus.
- Not claiming production-readiness. This is a hackathon submission demonstrating the architecture is correct and the loop is sound. Hardening for production is the Phase 2-3 roadmap.
Separate from accuracy, the bypass test asserts that the read-only MCP boundary holds. Re-run on every commit:
| Test | Result |
|---|---|
| Surface is exactly the documented 35-function set | PASS |
execute_shell raises ToolNotFound
|
PASS |
eval, exec, subprocess_run all raise ToolNotFound
|
PASS |
_safe_resolve rejects .. traversal |
PASS |
_safe_resolve rejects absolute paths outside DART_EVIDENCE_ROOT
|
PASS |
_safe_resolve rejects null-byte truncation attacks |
PASS |
1000-attempt random-fuzz against destructive function names: 1000/1000 blocked (Architecture-first guarantee verified). See Architecture deep dive.
-
docs/accuracy-report.md— the source-of-truth document - Threat model — what the bypass test actually defends
- Architecture deep dive — why we measure chain integrity at all
Agentic-DART — autonomous DFIR agent · architecture-first, not prompt-first · MIT license · github.com/Juwon1405/agentic-dart
- The Memex bet ⭐ Why this design
- About the name
- Architecture-first vs prompt-first
- Architecture deep dive
- Threat model
- Glossary
- dart-mcp — typed surface (native + SIFT adapters)
- dart-agent — senior-analyst loop
- dart-corr — cross-artifact correlation
- dart-audit — SHA-256 chained log
- dart-playbook — senior-analyst sequencing rules (v3 default)
- MCP function catalog (native + SIFT adapters)
- Comparison with adjacent tools
- FAQ
- Operator guide — distro-agnostic
- Running on SIFT
- Live mode
- Accuracy report
-
Roadmap ⭐ Phase 1 ~95% complete
- Phase 1 — Agentic DFIR ⭐ dedicated page · SANS submission
-
Phase 2 — Detection engineering
- The self-learning loop ⭐ design note
- Phase 3 — Agentic SOC
- Phase 4 — Broader agentic security