Skip to content

Releases: Juwon1405/agentic-dart

v1.2.0 — SANS Find Evil! 2026

15 Jun 21:07

Choose a tag to compare

Agentic-DART v1.2.0 — SANS Find Evil! 2026 submission build.

Autonomous DFIR agent on the SANS SIFT Workstation. The language model analyzes evidence in read-only mode and seals every inference into a SHA-256 audit chain. 73 typed, read-only MCP tools (48 native pure-Python + 25 SIFT-tool adapters) — destructive operations are absent from the tool registry and CI-enforced, so even a fully successful prompt injection has no destructive function to call. Architecture-first, not prompt-first.

This release

  • Sigma detection pack v2 — 11 rules (DCSync, Golden Ticket, ransomware shadow-copy deletion, web-shell creation, local account creation, Kerberoasting, AS-REP roasting, HID insertion, remote exec, event-log clearing).
  • Model-aware authentication — Haiku resolves to an OAuth subscription token; Sonnet/Opus to a metered API key. New dart-auth command.
  • Persistent install aliasesdart-pull, dart-auth.
  • Unified per-case ledger — append-only, per-case timestamps.
  • case-02 ground-truth fix — Hadi Challenge #1 is Windows XAMPP, not Linux; recall 0% -> 60%.

142 tests passing. Full history in CHANGELOG.md.

v1.1.0 — Stable release (SANS FIND EVIL! 2026)

15 Jun 05:36

Choose a tag to compare

Agentic-DART is an autonomous DFIR agent on the SANS SIFT Workstation. It runs a senior-analyst reasoning loop over a custom MCP server of 73 typed, read-only forensic tools (48 native pure-Python + 25 SIFT adapters) and produces a courtroom-traceable report. Evidence integrity is enforced by the shape of the system — destructive operations (execute_shell, write_file, mount) simply do not exist on the wire — not by asking the model to behave.

This is the first genuinely stable release, verified end-to-end from a clean clone.

Why 1.1.0 supersedes everything before it

Earlier tags that claimed "stable" did not actually run clean in a fresh environment. 1.1.0 is the result of a full correctness pass — install, benchmark, scoring, external disk-image handling, and the test suite all fixed and re-verified. The prior 1.0.2 "stable" tag has been removed to avoid confusion.

  • Tests are green from anywhere. 156 tests pass. The Phase-2 placeholder suite is now explicitly skipped (not failed) wherever it's collected, so pytest is clean whether you run it from the repo root or any subdirectory.
  • No version is pinned in docs or tests. The release number lives in pyproject.toml only; READMEs, the wiki, the site, and the version test were all genericized, so a future bump touches one file.

Highlights

  • 73 typed read-only MCP tools — 48 native forensic functions + 25 SIFT adapters (Volatility 3, MFTECmd, EvtxECmd, PECmd, RECmd, AmcacheParser, YARA, Plaso), plus a versioned Sigma detection-rule matcher.
  • 11 case studies / 99 ground-truth findings across two tiers: 8 internal self-evaluation cases (ready evidence) + 3 external full-disk public images (NIST CFReDS Hacking Case, Ali Hadi DFIR Challenge #1, Digital Corpora M57).
  • External is a first-class tier. Full-disk images are adapted via ewfmount + mmls + tsk_recover (partition-offset aware) into an evidence tree, then analyzed. Run the tiers as separate processesscripts.eval.demo / scripts.eval.self / scripts.eval.external — each independently debuggable; an append-only docs/benchmarks/HISTORY.md records every self/external run.
  • Linux-only host, hardened installer — refuses to run under sudo, stages the full toolchain, verifies it, and offers to fetch the external images at the end.

Requirements / dependencies

Host OS — Linux only. Verified on the SANS SIFT Workstation (Ubuntu 22.04); RHEL / Rocky / AlmaLinux 8+ and Fedora work via dnf/yum. macOS and Windows are not supported as the host — the Plaso / libyal toolchain does not build cleanly there. Default shell is bash.

Requirement Version Verified
Python 3.10+ (CI: 3.10 – 3.13) 3.10, 3.12
OS Ubuntu 22.04 (SANS SIFT) primary; RHEL/Rocky/Alma 8+, Fedora SIFT

Python libraries (lower bounds; installed by scripts/install.sh):

Library Minimum Role
anthropic ≥ 0.40 Claude API client (live mode)
mcp ≥ 1.0 MCP client/server transport
duckdb ≥ 1.5.3, < 2.0 in-memory correlation store
python-registry ≥ 1.3 Windows registry hive parsing
PyYAML ≥ 6.0 playbook / Sigma rule loading
requests ≥ 2.25 dataset download (benchmarks)

External forensic tools (staged by the installer; SIFT ships most): sleuthkit (mmls, tsk_recover), ewfmount (ewf-tools / libewf), Volatility 3, Plaso (log2timeline.py, psort.py), EZ Tools, YARA, Velociraptor.

Install

git clone https://github.com/Juwon1405/agentic-dart.git
cd agentic-dart
bash scripts/install.sh          # Linux only; refuses sudo. Offers to fetch external images (~13 GB).
export ANTHROPIC_API_KEY='sk-ant-...'
python3 -m scripts.eval.demo                                           # deterministic, no key
python3 -m scripts.eval.self     --models claude-haiku-4-5-20251001   # 8 bundled cases
python3 -m scripts.eval.external --models claude-haiku-4-5-20251001   # public disk images

License: MIT. SANS FIND EVIL! 2026 submission.

v1.0.1 — Platform overhaul: run_eval CLI, tiered case layout, OS-aware installer

11 Jun 00:01

Choose a tag to compare

Highlights

  • run_eval.py — the new primary user-facing command. Live mode only: fails fast with an actionable message when ANTHROPIC_API_KEY is unset; discovers cases dynamically from both tiers; writes out/<tier>/<case-id>/<timestamp>/{findings,report,summary}.json.
  • Tiered, self-contained case studiesexamples/case-studies/self-evaluation/case-01..08 and external-evaluation/case-01..03 (NIST CFReDS, Ali Hadi, Digital Corpora M57-Patents/Jo). Index-only folder names, truth.json per case, canonical bundled evidence at self-evaluation/case-01/evidence_root/. The public --variant selector is gone.
  • OS-aware installerscripts/install.sh --os auto|ubuntu|centos|macos, venv-first, clones+installs the collector adapter, optional SIFT (--install-sift, via cast) and Eric Zimmerman Tools (--install-eztools, .NET 9 builds, URLs validated before download). Plus root requirements.txt and an API-free scripts/healthcheck.py.
  • Downloader hardening — browser-like headers on every request (incl. resumed range requests), pure-Python streaming split-image reassembly, --dry-run / --check-urls.
  • Hardening (earlier in this line) — MCP call_tool() schema validation before dispatch, Plaso outputs isolated to DART_DERIVED_ROOT, benchmark summary no longer fabricates rows, hallucination scoring requires resolvable audit IDs.

Measured QA at this tag

  • Full pytest suite green (tests/ + dart_corr/tests/); benchmark-integrity and CI workflows green on this commit.
  • scripts/measure_accuracy.py: recall 1.0, FPR 0.0, hallucinations 0, evidence integrity preserved (67 files).
  • validate_ground_truth.py: FAIL 0 (6 documented external-tier warnings).

Known limitations

  • The adapter's --source image (Velociraptor dead-disk) path is covered by mocked end-to-end tests and has not been exercised against a live Velociraptor binary in CI.
  • External-tier evaluations require a one-time multi-GB dataset download; no external-dataset accuracy numbers are claimed at this tag.

Full details: CHANGELOG.md

v0.7.1 — Linux DFIR triplet + ground-truth function reconciliation

16 May 10:36

Choose a tag to compare

Highlights

Closed 6 of 10 missing-function gaps identified by post-release MCP surface audit against the 11-case ground-truth library.

Added — Linux DFIR triplet (2 new MCP functions)

  • parse_linux_text_log — parses Apache/nginx combined access logs, syslog (RFC3164), /var/log/messages, /var/log/secure, and auditd dispatcher text mode. Returns parsed records plus suspicious-content tags across 10 patterns covering T1003.008 shadow read, T1190 path traversal + SQLi, T1505.003 webshell patterns, T1105 remote download to shell, T1071.001 netcat, T1046 scanner invocation, T1222.002 dangerous chmod, T1059.004 reverse-shell oneliners, T1213.002 database credential use, plus a scanner-user-agent meta-rule (T1595.002).
  • parse_linux_shell_history — parses bash/zsh history with HISTTIMEFORMAT awareness (epoch comment lines). Detects 11 attacker patterns including T1098.004 SSH key persistence, T1070.003 history clear, T1053.003 cron mutation, T1027 base64 obfuscation.

(parse_linux_cron_jobs already existed in v0.6.1 — exposed via evidence_root + flagged_only schema. Not duplicated.)

Changed — case-09 ground-truth function names reconciled

Pre-v0.7.1 case-09 (Ali Hadi Challenge 1) referenced three functions that did not exist in the MCP surface. Now mapped to actual capabilities:

Finding Pre-v0.7.1 (missing) v0.7.1 (implemented)
F-HADI1-002 detect_web_shell_indicators detect_webshell
F-HADI1-007 enumerate_filesystem_anomalies parse_linux_text_log
F-HADI1-009 detect_log_tampering_indicators detect_defense_evasion

Ground-truth coverage post-reconciliation

Of 36 expected functions referenced across all 11 cases:

  • 32 implemented (89%)
  • 4 remain as tracked Phase 2 gaps: parse_recycle_bin_metadata (#54), parse_ie_history (#53), parse_outlook_dbx (#55), parse_usn_journal (post-release issue)

Added — test coverage

tests/test_parse_linux_dfir.py — 7 new tests covering auditd dispatcher format, http access combined format (Nikto UA + path traversal + shadow read), HISTTIMEFORMAT epoch parsing, per-hit required-keys contract, missing-file error contract, path traversal rejection. Total suite: 75 green (up from 68).

Added — sample evidence

examples/sample-evidence-realistic/linux/cron/sample.crontab — fixture exercising v0.6.1 parse_linux_cron_jobs with 4 suspicious patterns (remote-pipe-shell, exec from world-writable, reverse-shell oneliner, base64 obfuscation) plus benign baseline jobs.

Post-release counts

Surface Value
Native MCP functions 72 (was 67)
Total ground-truth findings 99
Ground-truth coverage (implemented / expected) 32 / 36 (89%)
Bundled case studies 11
Unit tests 75 green (was 68)

Verification

recall:                       1.000   (F-001 + F-013)
false_positive_rate:          0.000
hallucination_count:          0
evidence_integrity_preserved: true
self_correction_observed:     true

Compare: v0.7.0...v0.7.1

v0.7.0 — case-11 supply-chain/ESC8 + evidence schema fidelity

16 May 07:52

Choose a tag to compare

Highlights

Two major additions targeted at SANS FIND EVIL! 2026 submission.

case-11 supply-chain entry → AD certificate-services abuse

examples/case-studies/case-11-supplychain-ad-zeroday/ ships 12 ground-truth findings reproduced deterministically by seven MCP functions on bundled evidence. The chain:

  • Trojanized signed vendor binary (SolarWinds SUNBURST class entry, T1195.002)
  • Low-and-slow C2 beaconing with calibrated sub-SIEM-threshold cadence
  • PetitPotam (CVE-2021-36942) coercion of DC01$ (T1187)
  • ntlmrelayx --adcs relay to CA01 Web Enrollment endpoint (T1557.001)
  • Certificate issued for DC01$ under DomainController template (ESC8, T1649)
  • Rubeus asktgt /certificate + s4u /impersonateuser:domadmin (T1550.003)
  • 4624 type-9 NewCredentials on DC (S4U2self DA impersonation)
  • PsExec / wmiexec overpass-the-hash lateral to DC, file server, endpoint (T1021.002, T1021.006, T1550.002)
  • ntdsutil ifm create full (T1003.003) + mimikatz dcsync /user:krbtgt (T1003.006)
  • AdminSDHolder ACL modification (T1098.005 — self-healing privileged persistence via SDProp)
  • Golden Ticket forged with KRBTGT hash (T1558.001) used next morning
  • Three sequential wevtutil cl + EventID 1102 self-emission (T1070.001)

Chain composed entirely from public references (CISA AA20-352A, SpecterOps "Certified Pre-Owned", MS-EFSRPC CVE, MITRE T1098.005/T1003.006/T1558.001). All hosts/IPs/domain (ent.example.local)/SIDs are RFC1918/RFC5737/RFC2606 synthetic with zero cross-reference to any real environment.

Every sample evidence file enriched to native forensic-tool dump fidelity

Prior versions of sample-evidence-realistic/ files were too sparse to look like genuine forensic-tool captures. This release replaces every file with the on-disk schema produced by the corresponding real tool — without breaking any detection.

Surface Now matches output of
Windows event logs EvtxECmd (full EVTX field set, ms timestamps, consistent SIDs)
Network flows Zeek conn.log (uid, ja3, ja3s, tls_version, http_method, user_agent)
$MFT MFTECmd 25-column (both 0x10 SI and 0x30 FN timestamps, USN, LSN, SecurityId)
Shellbags SBECmd (BagPath, NodeSlot, AbsolutePath, LastInteracted, HasExplored)
Run keys / services / shimcache RECmd / AppCompatCacheParser
Prefetch PECmd JSON (Volumes, FilesLoaded, run times)
Chrome History Hindsight (transition, danger_type, opened, referrer, etag)
Linux journal systemd-journald (__REALTIME_TIMESTAMP, _BOOT_ID, _MACHINE_ID, _AUDIT_LOGINUID)
Linux auditd SYSCALL+EXECVE+CWD+PATH+PROCTITLE+USER_LOGIN+CRED_ACQ+USER_CMD+USER_AUTH
macOS unified log log show (thread, type, subsystem, category, sender)
macOS FSEvents FSEventsParser (id, mask, flags, inode, node_id, sha256_at_event)
Memory image info winpmem metadata (kernel_base, KDBG offset, physical layout, yara hits)

Fixed

  • setupapi.dev.log was missing from realistic variant — agent F-013 IP-KVM detection silently failed and dropped recall to 0.5 on --variant realistic. Restored with full setupapi log fidelity around the IP-KVM (VID 0557 PID 2419 ATEN) signal.

Post-release counts

Surface Value
Native MCP functions 67
Total ground-truth findings 99
↳ Layer 1 (8 cases: 01–07 + 11) 69
↳ Layer 2 (3 cases: 08 CFReDS, 09 Hadi, 10 M57) 30
Bundled case studies 11
Evidence files in realistic variant 49
MITRE ATT&CK tactic coverage 11 of 12
Unit tests 68 green

Verification

recall:                      1.000   (F-001 + F-013)
false_positive_rate:         0.000
hallucination_count:         0
evidence_integrity_preserved: true
self_correction_observed:    true
audit_chain_length:          3 entries, SHA-256-linked

Full Changelog

See CHANGELOG.md for the complete diff.

Compare: v0.6.1...v0.7.0

v0.6.1 — macOS quarantine + Linux cron + DNS tunneling

14 May 09:07

Choose a tag to compare

Three new native MCP functions, plus the Single-Source-of-Truth cleanup that closes the v0.6.0 drift loop.

Added

Function Purpose MITRE
parse_macos_quarantine macOS LSQuarantineEvent reader — download provenance, non-browser downloader flagging, pastesite/raw-IP/darknet origin detection T1204, T1566.002, T1105
parse_linux_cron_jobs Enumerate /etc/crontab, cron.d/, cron.{hourly,daily,weekly,monthly}/, /var/spool/cron/ — flag curl-pipe-shell, base64 decode, @reboot triggers, /tmp/*.sh, netcat listeners T1053.003, T1059.004, T1546
detect_dns_tunneling DNS query log analysis (BIND9/dnsmasq/generic) — Shannon entropy + long-label + rare-qtype + volume + Iodine/dnscat2 signatures. Opens TA0011 (Command-and-Control) coverage T1071.004, T1568.002, T1572

17 new unit tests in test_v06_macos_linux.py. Full test suite passes on a clean clone.

Fixed

  • CI workflow (ci.yml), examples/sift-adapter-demo.sh, and scripts/install.sh no longer hardcode native/total counts. Drift-safe invariant checks (count > 0, native + sift == total, no forbidden tool names) replaced exact-count assertions.
  • This was the root cause of ten consecutive failed CI runs between v0.6.0 (2026-05-13) and the SoT cleanup commit on 2026-05-14.

Changed

  • Companion repo agentic-dart-collector-adapter flipped from Apache-2.0 to MIT for ecosystem consistency.
  • Hardcoded counts removed from ~25 locations across README body, docs, wiki, and profile surfaces. Numbers now live only in: README L92+L259 Hero, DEVPOST_SUBMISSION.md, DEMO_STORYBOARD.md, and tests/test_mcp_surface.py canonical name set.

Surface

Runtime list_tools() returns the typed read-only MCP surface (45 native pure-Python forensic functions + 25 SIFT Workstation adapters). The canonical name set is asserted in tests/test_mcp_surface.py::test_registered_tools_are_exact_set.


Full changelog: CHANGELOG.md

v0.5.4 — NIST CFReDS Hacking Case integration

12 May 00:24

Choose a tag to compare

NIST CFReDS Hacking Case integration — external benchmark validation

This release adds external benchmark validation against the NIST CFReDS "Hacking Case" (Greg Schardt / Mr. Evil) — a community-trusted forensic dataset with published ground-truth answers.

Highlights

  • 🆕 New primitive: parse_registry_hive (general native registry hive parser)
  • 🆕 New case study: case-08 (CFReDS Hacking Case full traversal)
  • 📊 3-tier accuracy evaluation now documented in docs/accuracy-report.md:
Tier Dataset recall (v0.5.4)
1 Synthetic reference (CI baseline) 1.000 / FPR=0.000
2 Noise-injected realistic (~1:30 IOC:benign) 1.000 / FPR=0.000
3 NIST CFReDS Hacking Case 0.50 strict / 0.80 lenient
  • 🚀 5× CFReDS recall jump from v0.5.3 (0.10 / 0.40) after parse_registry_hive shipped — unlocked 4 findings at once (closes #52)
  • 43/43 tests pass on Python 3.10/3.11/3.12/3.13 matrix
  • 📦 61 MCP tools (36 native + 25 SIFT adapters), all read-only

Why this matters

Synthetic recall=1.000 by itself looks too good to be true. v0.5.4 lets us state honestly that external benchmark recall is 0.50/0.80, and trace the remaining gap to specific paradigm differences — turning "registry parsing is on the wishlist" into "registry parsing unlocks 4 measured findings, ship next."

What's next (Phase 2)

  • #53 IE6 index.dat parser
  • #54 Recycle Bin INFO2 parser
  • #55 Bundled YARA rule library
  • #47 Additional external datasets (Ali Hadi, DFRWS, BOTS)

Reference