Skip to content

v1.0.1 — Platform overhaul: run_eval CLI, tiered case layout, OS-aware installer

Choose a tag to compare

@Juwon1405 Juwon1405 released this 11 Jun 00:01
· 135 commits to main since this release

Highlights

  • run_eval.py — the new primary user-facing command. Live mode only: fails fast with an actionable message when ANTHROPIC_API_KEY is unset; discovers cases dynamically from both tiers; writes out/<tier>/<case-id>/<timestamp>/{findings,report,summary}.json.
  • Tiered, self-contained case studiesexamples/case-studies/self-evaluation/case-01..08 and external-evaluation/case-01..03 (NIST CFReDS, Ali Hadi, Digital Corpora M57-Patents/Jo). Index-only folder names, truth.json per case, canonical bundled evidence at self-evaluation/case-01/evidence_root/. The public --variant selector is gone.
  • OS-aware installerscripts/install.sh --os auto|ubuntu|centos|macos, venv-first, clones+installs the collector adapter, optional SIFT (--install-sift, via cast) and Eric Zimmerman Tools (--install-eztools, .NET 9 builds, URLs validated before download). Plus root requirements.txt and an API-free scripts/healthcheck.py.
  • Downloader hardening — browser-like headers on every request (incl. resumed range requests), pure-Python streaming split-image reassembly, --dry-run / --check-urls.
  • Hardening (earlier in this line) — MCP call_tool() schema validation before dispatch, Plaso outputs isolated to DART_DERIVED_ROOT, benchmark summary no longer fabricates rows, hallucination scoring requires resolvable audit IDs.

Measured QA at this tag

  • Full pytest suite green (tests/ + dart_corr/tests/); benchmark-integrity and CI workflows green on this commit.
  • scripts/measure_accuracy.py: recall 1.0, FPR 0.0, hallucinations 0, evidence integrity preserved (67 files).
  • validate_ground_truth.py: FAIL 0 (6 documented external-tier warnings).

Known limitations

  • The adapter's --source image (Velociraptor dead-disk) path is covered by mocked end-to-end tests and has not been exercised against a live Velociraptor binary in CI.
  • External-tier evaluations require a one-time multi-GB dataset download; no external-dataset accuracy numbers are claimed at this tag.

Full details: CHANGELOG.md