Skip to content

Architecture first vs prompt first

Bang Juwon edited this page May 14, 2026 · 6 revisions

Architecture-first vs prompt-first

This is the central design claim of the project. It deserves its own page.

The claim

Safety properties of an LLM agent should be enforced by the shape of what it can call, not by what we tell it not to do.

What "prompt-first" looks like (and why it fails)

A prompt-first agent looks like this:

You are a forensic analyst. You MUST NOT modify any evidence.
You MUST NOT execute shell commands. You MUST NOT write to disk.
You have access to a tool called `query_evidence(sql)` that
runs arbitrary SQL against the case database.

The failure mode is obvious to anyone who has watched an LLM under deadline pressure:

  1. The agent sees a hard problem.
  2. The system prompt is far away in the context.
  3. The MCP surface offers query_evidence(sql).
  4. The agent emits query_evidence("UPDATE findings SET ...") because, mechanically, that is a thing the surface can do.
  5. The "MUST NOT" was a guideline. The MCP server didn't actually stop it.

Prompt-first safety is convention. Any sufficiently motivated agent — or a confused one, or one that got bad training data, or a prompt-injected one — will defeat it.

What "architecture-first" looks like

An architecture-first agent looks like this:

  • The MCP surface is exactly the typed MCP surface, by name. They are read-only by construction.
  • execute_shell, write_file, mount, eval, exec, spawn, system, os.system, subprocess.run — none of them exist on the surface.
  • Trying to call any of them produces ToolNotFound. Verified by an explicit bypass test (tests/test_mcp_bypass.py).
  • The evidence directory is mounted read-only at the OS level. Even if a function in dart-mcp had a bug, the mount would refuse the write.
  • Every call goes through _safe_resolve which rejects path-traversal attempts (.., absolute paths outside EVIDENCE_ROOT, NUL byte truncation).

The agent cannot modify evidence — not because the prompt told it not to, but because the function does not exist and the filesystem is mounted ro.

That's the difference. A guarantee, not a guideline.

The bypass test, concretely

tests/test_mcp_bypass.py actively tries to bypass each of the architectural guarantees:

def test_unregistered_destructive_function_raises_ToolNotFound():
    """Even if the agent emits 'execute_shell' or similar,
    dart-mcp must refuse with a hard error."""
    for forbidden in ["execute_shell", "eval"]:
        try:
            call_tool(forbidden, {"cmd": "rm -rf /"})
        except KeyError as e:
            assert "ToolNotFound" in str(e)
            continue
        raise AssertionError(f"forbidden function {forbidden} is somehow exposed")

This is not a code-coverage test. It is the architectural claim made executable. Every PR that lands on main must pass this. Any contribution that adds a function to the MCP surface must add a corresponding bypass test.

Why this matters for hackathon judges

Anyone can write a system prompt that says "be safe". Architectural guarantees are testable. The judges don't have to take the project's word that the agent is read-only — they can clone the repo, run bash examples/demo-run.sh, and watch the bypass test fire on the way through.

What architecture cannot do

Honest accounting:

  • Architecture cannot prevent the agent from drawing wrong conclusions. That's an accuracy concern, not a safety one. dart-corr and the playbook help, but the agent can still be confidently wrong.
  • Architecture cannot prevent leakage. If a tool legitimately reads data and the agent puts it in the report, the data is in the report. Confidentiality is a separate concern, addressed by what evidence you choose to mount.
  • Architecture cannot self-update. New attack patterns require new tools. Phase 2 — Sigma synthesis — is partly about closing this gap by giving the agent a way to propose new detections without granting it write access to existing ones.

Further reading


← Back to Home

Agentic-DART

Concepts

The 5 packages

Reference

Running it

Case studies

Project


Project links

Clone this wiki locally