Self learning loop

The self-learning loop (Phase 2 design)

How Agentic-DART improves its own analysis quality from its own execution traces — without fine-tuning, and without ever loosening the read-only guarantee.

This is a Phase 2 design note, not shipped code. It documents the next architectural bet: that an agent which already records a perfect, structured trace of everything it did can be made to compound — to get measurably better at the cases it just got wrong.

Why

A static agent starts every run from the same playbook. It never sharpens on the case it just missed. A senior analyst is the opposite: each investigation tightens their instincts, and that compounding is the expertise.

The bet here is that the compounding can be mechanized as in-context learning over the agent's own audit trail. The SHA-256 ledger is already a complete, machine-readable record of what the agent did and whether it was right — exactly the substrate a learning loop needs.

The loop: Run → Reflect → Extract → Loop

Run — the agent works a case (or the whole benchmark) and writes its audit chain, progress log, and findings.
Reflect — score the findings against ground truth (self-eval cases ship truth.json; real cases use the analyst's verdict): recall, precision, hallucination, per case. Then locate where it missed or went down a wrong path, and which tool sequence would have caught it.
Extract — turn those reflections into durable, human-readable heuristics: new dart-playbook entries, refined dart-corr thresholds, case-class hints. Not weights — text. The artifact is a reviewable diff.
Loop — re-run the benchmark with the updated heuristics. Keep the change only if it improves the aggregate without regressing any case.

In-context, not fine-tuned

No model weights change. The learning lives in the playbooks and configs the agent reads at the start of each run.

Every behavior change is a text diff, attributable to the reflection that produced it. Fine-tuning would bury the lesson in opaque weights — the opposite of this project's thesis that DFIR reasoning belongs in inspectable architecture, not hidden state.
Evidence-Driven Development (EDD): the benchmark is the test suite. A heuristic only ships if it moves the measured number.

The recall-regression guard (Git rollback)

Every loop iteration is a commit.

After re-running the benchmark, a guard compares per-case recall to the previous commit.
If any case regresses (or aggregate recall drops), the guard reverts the commit automatically and logs why.
Net effect: the playbook can only ratchet upward. A bad "lesson" can never silently degrade the agent.

Cost model — always-on without burning API spend

The loop is meant to run continuously (overnight, or on every push), which would be expensive on metered API. The split:

Workload	Model	Credential
The always-on learning loop	Haiku	OAuth subscription token — flat cost, runs as long as you let it
Real-evidence analysis + headline benchmark numbers	Sonnet / Opus	Metered API key — reserved for top-tier reasoning

dart-auth (v1.2.0) already resolves the right credential per model, so the loop and the benchmark coexist without manual token juggling.

Keep-alive architecture

A single long-running supervised process holds the loop: it wakes on a trigger (timer or push), runs one Run → Reflect → Extract → Loop cycle, commits or reverts, then sleeps.

It is supervised — every cycle is logged to the same append-only ledger, and a human reviews the accumulated diffs before they reach main.
It has no new privileges: it calls the same read-only MCP surface the agent always uses. It can edit playbooks and configs (text); it can never touch evidence.

Safety — the read-only guarantee is untouched

The loop changes how the agent reasons (playbooks, thresholds), never what it can do to evidence.

The destructive-function exclusion (asserted by CI as an exact set) still holds: a self-written heuristic cannot add a destructive tool, because the registry is asserted, not appended to.
The worst case of a bad lesson is a regression the guard catches and reverts.

How it ties together

dart-audit — the structured record the loop reflects on.
dart-playbook — where extracted heuristics land.
dart-corr — where refined correlation thresholds land.
The benchmark (scripts/eval/) — the test that gates every lesson.

Status

Design for Phase 2 — agentic detection engineering. Phase 1 already ships the parts it needs — the audit chain, the benchmark harness, model-aware authentication, and the YAML playbooks — so the loop is an assembly of existing pieces, not new infrastructure.

See also: Architecture-first vs prompt-first · The Memex Bet

_{Agentic-DART — autonomous DFIR agent · architecture-first, not prompt-first · MIT license · github.com/Juwon1405/agentic-dart}

Agentic-DART

Home

Concepts

The 5 packages

dart-mcp — typed surface (native + SIFT adapters)
dart-agent — senior-analyst loop
dart-corr — cross-artifact correlation
dart-audit — SHA-256 chained log
dart-playbook — senior-analyst sequencing rules (v3 default)

Reference

MCP function catalog _{(native + SIFT adapters)}
Comparison with adjacent tools
FAQ

Running it

Case studies

Project

Accuracy report
Roadmap ⭐ _{Phase 1 ~95% complete}
- Phase 1 — Agentic DFIR ⭐ _{dedicated page · SANS submission}
- Phase 2 — Detection engineering
  - The self-learning loop ⭐ _{design note}
- Phase 3 — Agentic SOC
- Phase 4 — Broader agentic security

Self learning loop

The self-learning loop (Phase 2 design)

Why

The loop: Run → Reflect → Extract → Loop

In-context, not fine-tuned

The recall-regression guard (Git rollback)

Cost model — always-on without burning API spend

Keep-alive architecture

Safety — the read-only guarantee is untouched

How it ties together

Status

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Agentic-DART

Concepts

The 5 packages

Reference

Running it

Case studies

Project

Project links

Clone this wiki locally