History / Accuracy

Revisions

docs(accuracy): move benchmark scores to docs/benchmarks single source of truth; keep invariants (hallucination 0, read-only boundary, needle-in-haystack)

Juwon1405 committed Jun 15, 2026

57aa9fb
docs(wiki): fix remaining stale CFReDS case path and v1.0.0-as-current in Glossary

Juwon1405 committed Jun 11, 2026

3a0ca09
docs(wiki): align Accuracy/Home with canonical evidence and tiered cases Remove the public --variant / sample-evidence-realistic concept from Accuracy (single canonical evidence_root + CI fixture), retier the case tables to self-evaluation/external-evaluation, fix case links to the new index-only paths, rename ground-truth.json to truth.json, and drop a stale tool-count. Dated historical roadmap entries in Phase-1 keep their original case numbers.

Juwon1405 committed Jun 10, 2026

37bd1cc
docs: align wiki with current live-mode scope Document live mode through ANTHROPIC_API_KEY and --dry-run, remove public zero-cost/OAuth setup claims, and update Claude MCP registration to dart_mcp.server_stdio. Refresh accuracy evidence counts to 62 reference files and 67 realistic files, clarify that the measured identical result applies to case-01 F-001/F-013, and remove stale 50-file language. Update operator, SIFT, macOS, roadmap, and Phase 1 pages to the 72-tool surface and current full-suite validation model without stale 35-tool or 75-test guidance. Fix the Home architecture link and describe external entries as case-study slots instead of fully measured benchmark rows. QA: git diff --check passed for the wiki.

Juwon1405 committed Jun 10, 2026

f9dc340
docs(wiki): fix remaining stale 'the 60' (Live-mode) and sonnet-4 model name (Accuracy)

Juwon1405 committed Jun 5, 2026

c6b8958
docs(wiki): align Accuracy and Glossary with the realistic-variant enrichment design - Accuracy.md: the realistic row claimed the generator synthesizes "security events 516" -- it does not; the security EventLog is hand-curated at ~11,530 lines. Only the two IOC-only logs (web access, unix auth) are noise-injected. Dropped the "production-shape / production-noise-injected" overstatement (the enriched ratio is ~1:30). - Glossary.md: the MCP surface is 47 native + 25 SIFT = 72, not "72 native".

Juwon1405 committed Jun 5, 2026

2ff2513
wiki QA pass: file count 49->50, test count 31->75 (current snapshots only) post-v0.7.1 QA audit caught two latent drifts: evidence file count: - Accuracy.md L64 sample-evidence-realistic '49 files' was correct at the v0.7.0 evidence-fidelity enrichment time but v0.7.1 added linux/cron/sample.crontab fixture, raising the count to 50. measure_accuracy --variant realistic now reports evidence_files_measured: 50 against ground truth F-001 + F-013, which matches the actual repo state. test count: - Operator-guide.md L55 step-by-step quick-start - Phase-1.md L50 Empirical-validation 'fresh clone' summary - Roadmap.md L60 Phase-1 validation summary - Running-on-macOS.md L57 step header + L134 Apple Silicon notes all said '31 tests' (the v0.5.2 snapshot baseline). v0.7.1 ships '75 of 75 tests passing'. updated only the present-tense fresh-clone claims; the historical v0.5.2 release row in Phase-1.md L109 ('-> 31 tests passing') is preserved verbatim as a dated milestone.

Juwon1405 committed May 16, 2026

af7ec2b
wiki: sync to v0.7.1 — 11 cases, 72 MCP functions, case-11 highlight - Accuracy.md: '61 files' -> '49 files'; new v0.7.0 section covering case-11 supply-chain attack class; new v0.7.0 case-library summary table (11 cases / 99 findings split 69 layer-1 + 30 layer-2 + 32/36 function coverage) - Glossary.md: 'As of v0.6.0' -> 'As of v0.7.1: 72 native MCP tools' - Home.md: case-studies section rewritten to mention 11 cases / 99 findings plus case-11 as recommended judge walkthrough - MCP-function-catalog.md: previously missed v0.6.1 functions (parse_macos_quarantine, parse_linux_cron_jobs, detect_dns_tunneling) + v0.7.1 functions (parse_linux_text_log, parse_linux_shell_history) now properly documented with MITRE technique mappings and references - Phase-1.md: timeline extended with v0.5.4, v0.6.0, v0.6.1, v0.7.0, v0.7.1 milestones deliberately not touched — these are version-anchored historical records: v0.5.4 CFReDS section (locked at first external benchmark), playbook 'target_case_classes: 10 case classes' (playbook scenario classes, not evidence cases), v0.4 / v0.5 release rows.

Juwon1405 committed May 16, 2026

141623f
wiki(qa-r18): v0.5.4 CFReDS Hacking Case section + 36-function bypass test Paired with main repo r18 commit (3b69129). == Updated == ### Accuracy.md - Bypass test: 'documented 35-function set' -> 'documented 36-function set' - New section: 'v0.5.4 — External benchmark: NIST CFReDS Hacking Case' with strict/lenient recall comparison (v0.5.3 0.10/0.40 -> v0.5.4 0.50/0.80) and the paradigm-gap explanation - New 'See also' subsection linking to case-08 README + closed issue #52 + open Phase 2 issues #53/#54/#55 == Why this is in wiki not just main repo == Reviewers reach the wiki via the GitHub right-rail link first, often before they read the README. The wiki Accuracy page has been the 'source of truth for measurement claims' since v0.5.0; v0.5.4 keeping it current with the CFReDS results is non-negotiable per the 4-surface sync rule. == Verified == - All cross-references resolve: case-08 README path, issue #52/53/54/55 links, parse_registry_hive wiki anchor - No drift between this Accuracy page and docs/accuracy-report.md (main repo) or the social-surface advertising (profile + pages)

Juwon1405 committed May 9, 2026

94cb6f3
wiki(qa-r17): two evidence variants + corrected ground-truth count + methodology disclosure Pairs with main repo commit 58b3e5c (v0.5.3). == Why == Round 17 addresses a fair reviewer concern: 'recall=1.0 measured on a 30-line file is not strong evidence — every line is an IOC.' That is correct. The fix is to ship two evidence variants and disclose methodology explicitly, then point at Phase 2 for third-party benchmarking. == Fixed == ### Accuracy.md — '12/12 ground-truth findings' over-claim Bundled find-evil-ref-01 case has exactly TWO ground-truth findings (F-001 amcache anomaly + F-013 USB persistence), not twelve. The '12/12' was a typo or stale value from an earlier draft. Verified against scripts/measure_accuracy.py output: 'true_positives': ['F-001', 'F-013']. Fixed. ### Accuracy.md — '8 files' SHA-256 evidence integrity claim The actual file count walked by measure_accuracy.py's evidence digest map is 61, not 8. Fixed in the table. == Added == ### Two evidence variants — section explaining why both ship Variant A — examples/sample-evidence/ (deterministic) Variant B — examples/sample-evidence-realistic/ (~1:30 IOC:benign) Both score the same ground truth and produce identical headline numbers (recall=1.0 / FPR=0.0 / hallucination=0). The realistic variant rules out the 'small-input over-fit' failure mode by demonstrating the same recall on web-log 1027 lines (37× noise), security events 516 (32× noise), unix auth 517 (29× noise). ### Phase 2 third-party dataset benchmarking Linked to issue #47 (NIST CFReDS, Ali Hadi, DFRWS, Splunk BOTS). Phase 1 establishes the methodology; Phase 2 operationalizes it on community-trusted datasets. == Verified == - Both variant invocations work end-to-end and produce the headline numbers as reported (rerun: 2026-05-09) - 61-file count matches measure_accuracy.py output - Issue #47 exists with workstream split and acceptance criteria

Juwon1405 committed May 9, 2026

8f2b35e
wiki(qa-r10): kill function-signature + file-existence hallucinations across 6 pages Pairs with main repo commit 8a1917b. Round 10 was a 'judge follows every advertised command line by line' pass — surfaced 6 distinct hallucinations a SANS judge would have hit if they tried to reproduce anything from the wiki. == Defects fixed == ### Accuracy.md — broken script reference Advertised 'bash scripts/run-accuracy-suite.sh'. That script doesn't exist and never has. The actual reproducer is 'python3 scripts/measure_accuracy.py' with the standard PYTHONPATH export. A judge running the README's accuracy claim through this page would have hit: bash: scripts/run-accuracy-suite.sh: No such file or directory Replaced with the real measure_accuracy.py invocation, which was verified end-to-end (recall=1.0, FPR=0.0, hallucination_count=0, evidence_integrity_preserved=true). ### Case-PtH-Timestomp.md — 3 function-signature errors All three are the same class of mistake — the wiki cited positional/keyword args that don't exist on the actual MCP tools: 'dart-agent --hunt' → 'python3 -m dart_agent --case ... --out ... --mode deterministic' 'get_process_tree(host=...)' → 'get_process_tree(process_csv=...)' 'analyze_windows_logons(host=...)' → 'analyze_windows_logons(security_events_json=...)' 'parse_prefetch(target=...)' → 'parse_prefetch(prefetch_path=...)' These same mistakes live in docs/case-pth-timestomp.md (fixed in the paired repo commit). Verified by pulling live inputSchema.required from list_tools() for each tool. ### dart-agent.md — run_loop() and 4 fictional files The page advertised: - 'run_loop() in dart_agent/src/dart_agent/__init__.py' - A file inventory citing loop.py, decision.py, hypothesis.py, serializer.py — none of which exist. The actual structure is __init__.py + __main__.py + live.py. The senior-analyst loop is the DeterministicAnalyst class's .run() method (4 internal phases: _phase_timeline → _phase_hypothesis → _phase_validate_usb → _phase_finalize). Rewrote both the 'What it owns' bullet and the Files block to match reality. Added an explanatory note that the agent is small enough to keep its control flow in __init__.py. ### dart-audit.md — 3 hallucinations in one example The advertised AuditLogger.log() example used: - outputs={...} — actual kwarg is 'output' (singular) - cpu_ms=42 — no such kwarg - bytes_read=1024 — no such kwarg Real signature is: log(tool_name, inputs, output, iteration, token_count_in, token_count_out, finding_ids=None) Same page advertised audit_id type as 'UUID4' — actual is 8-character hex (secrets.token_hex(4)). Same page advertised 'output/<run_id>/<audit_id>.json' as the per-call output storage location — that directory layout doesn't exist; outputs are referenced by SHA-256 digest only in deterministic mode. Fixed all three. Verified the corrected example works as a copy-paste — wrote a test audit log, verified the chain, ran CLI (verify + trace) all green. ### dart-corr.md — serializer.py hallucination Page claimed UNRESOLVED contradictions are blocked by 'the serializer (dart_agent/serializer.py)'. There is no serializer.py file. The blocking happens inside DeterministicAnalyst's finding emission path in __init__.py. Rewrote the sentence to point at the real location. ### Live-mode.md — 2 hallucinations in the headline example - '--evidence /mnt/case-evidence' — no such CLI flag. Real pattern is 'export DART_EVIDENCE_ROOT=/path' before invoking the agent. - 'Claude sees exactly 35 typed forensic functions' — should be 60 (35 native + 25 SIFT adapters). Stale from the v0.4 surface, missed in earlier rounds because Live-mode.md wasn't part of the surface-count grep targets. Fixed both. Added an explicit '(Add --dry-run to use a scripted mock Claude with no API key)' line for CI / offline reproduction. == Verification approach == For each defect: 1. Read the wiki claim 2. Pulled the actual code/schema (inputSchema, argparse output, filesystem ls, AuditLogger signature via inspect) 3. Compared advertised ↔ actual 4. Fixed the wiki, then re-verified the fixed example by either running it (Accuracy.md, dart-audit.md) or by checking it would no longer raise on a copy-paste == Pattern internalised == Round 9 caught output-key hallucinations in code examples. Round 10 caught argument-name hallucinations and file-path hallucinations in tutorial prose — a different surface that print-output dry-runs don't cover. Going forward, any wiki/docs page that references a function by name + signature should be diff-checked against the live inputSchema.required list whenever the underlying code changes.

Juwon1405 committed May 8, 2026

1c089f4
wiki: add 12 missing pages, fix all 32 broken links The wiki sidebar and Home page referenced 13 pages that didn't exist, producing the GitHub 'create new page' UI when clicked. Adds: Concepts: Glossary — DFIR / agent / MCP terms The 5 packages: dart-agent — senior-analyst wrapper loop dart-corr — cross-artifact correlation engine dart-audit — SHA-256 chained audit log dart-playbook — YAML sequencing rules (dart-mcp already existed) Reference: Comparison — vs Velociraptor / Plaso / EZ tools / SOAR / vanilla LLMs Running it: Running-on-SIFT — SANS SIFT VM 5-minute setup Running-on-macOS — macOS-specific mount conventions Live-mode — real Claude API + MCP stdio integration Case studies: Case-PtH-Timestomp — Pass-the-Hash + timestomp pre-existence Case-IP-KVM — IP-KVM remote-hands insider scenario Writing-case-studies — guide for contributing new case studies Project: Accuracy — reproducible accuracy methodology + numbers The Roadmap-Phase-2/3/4 links in Home.md were repointed to the existing Roadmap page's anchors (those were never separate pages). The Contributing link in dart-mcp.md now points to CONTRIBUTING.md in the main repo. _Sidebar.md restructured into 6 named sections so the 25-page wiki is navigable. Final broken-link count: 0.

Juwon1405 committed Apr 30, 2026

b73bb8e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

History / Accuracy

Revisions