-
Notifications
You must be signed in to change notification settings - Fork 5
Self learning loop
How Agentic-DART improves its own analysis quality from its own execution traces — without fine-tuning, and without ever loosening the read-only guarantee.
This is a Phase 2 design note, not shipped code. It documents the next architectural bet: that an agent which already records a perfect, structured trace of everything it did can be made to compound — to get measurably better at the cases it just got wrong.
A static agent starts every run from the same playbook. It never sharpens on the case it just missed. A senior analyst is the opposite: each investigation tightens their instincts, and that compounding is the expertise.
The bet here is that the compounding can be mechanized as in-context learning over the agent's own audit trail. The SHA-256 ledger is already a complete, machine-readable record of what the agent did and whether it was right — exactly the substrate a learning loop needs.
- Run — the agent works a case (or the whole benchmark) and writes its audit chain, progress log, and findings.
-
Reflect — score the findings against ground truth (self-eval cases ship
truth.json; real cases use the analyst's verdict): recall, precision, hallucination, per case. Then locate where it missed or went down a wrong path, and which tool sequence would have caught it. -
Extract — turn those reflections into durable, human-readable heuristics: new
dart-playbookentries, refineddart-corrthresholds, case-class hints. Not weights — text. The artifact is a reviewable diff. - Loop — re-run the benchmark with the updated heuristics. Keep the change only if it improves the aggregate without regressing any case.
No model weights change. The learning lives in the playbooks and configs the agent reads at the start of each run.
- Every behavior change is a text diff, attributable to the reflection that produced it. Fine-tuning would bury the lesson in opaque weights — the opposite of this project's thesis that DFIR reasoning belongs in inspectable architecture, not hidden state.
- Evidence-Driven Development (EDD): the benchmark is the test suite. A heuristic only ships if it moves the measured number.
Every loop iteration is a commit.
- After re-running the benchmark, a guard compares per-case recall to the previous commit.
- If any case regresses (or aggregate recall drops), the guard reverts the commit automatically and logs why.
- Net effect: the playbook can only ratchet upward. A bad "lesson" can never silently degrade the agent.
The loop is meant to run continuously (overnight, or on every push), which would be expensive on metered API. The split:
| Workload | Model | Credential |
|---|---|---|
| The always-on learning loop | Haiku | OAuth subscription token — flat cost, runs as long as you let it |
| Real-evidence analysis + headline benchmark numbers | Sonnet / Opus | Metered API key — reserved for top-tier reasoning |
dart-auth (v1.2.0) already resolves the right credential per model, so the loop and the benchmark coexist without manual token juggling.
A single long-running supervised process holds the loop: it wakes on a trigger (timer or push), runs one Run → Reflect → Extract → Loop cycle, commits or reverts, then sleeps.
- It is supervised — every cycle is logged to the same append-only ledger, and a human reviews the accumulated diffs before they reach
main. - It has no new privileges: it calls the same read-only MCP surface the agent always uses. It can edit playbooks and configs (text); it can never touch evidence.
The loop changes how the agent reasons (playbooks, thresholds), never what it can do to evidence.
- The destructive-function exclusion (asserted by CI as an exact set) still holds: a self-written heuristic cannot add a destructive tool, because the registry is asserted, not appended to.
- The worst case of a bad lesson is a regression the guard catches and reverts.
- dart-audit — the structured record the loop reflects on.
- dart-playbook — where extracted heuristics land.
- dart-corr — where refined correlation thresholds land.
-
The benchmark (
scripts/eval/) — the test that gates every lesson.
Design for Phase 2 — agentic detection engineering. Phase 1 already ships the parts it needs — the audit chain, the benchmark harness, model-aware authentication, and the YAML playbooks — so the loop is an assembly of existing pieces, not new infrastructure.
See also: Architecture-first vs prompt-first · The Memex Bet
Agentic-DART — autonomous DFIR agent · architecture-first, not prompt-first · MIT license · github.com/Juwon1405/agentic-dart
- The Memex bet ⭐ Why this design
- About the name
- Architecture-first vs prompt-first
- Architecture deep dive
- Threat model
- Glossary
- dart-mcp — typed surface (native + SIFT adapters)
- dart-agent — senior-analyst loop
- dart-corr — cross-artifact correlation
- dart-audit — SHA-256 chained log
- dart-playbook — senior-analyst sequencing rules (v3 default)
- MCP function catalog (native + SIFT adapters)
- Comparison with adjacent tools
- FAQ
- Operator guide — distro-agnostic
- Running on SIFT
- Live mode
- Accuracy report
-
Roadmap ⭐ Phase 1 ~95% complete
- Phase 1 — Agentic DFIR ⭐ dedicated page · SANS submission
-
Phase 2 — Detection engineering
- The self-learning loop ⭐ design note
- Phase 3 — Agentic SOC
- Phase 4 — Broader agentic security