Skip to content

Graph Agents

Alex Clarke edited this page May 18, 2026 · 11 revisions

Graph-based agents are a declarative, YAML-driven workflow engine layered on top of Loki's existing agent system. Where a normal agent runs as a single LLM loop driven by tool calls, a graph agent is a directed graph of typed nodes. Each node performs one well-defined step (call an LLM, run a script, ask the user a question, spawn a child agent, etc.) and routes to the next node based on its result.

Graph agents are best for workflows that:

  • Have a fixed shape (e.g. parse -> query -> grade -> synthesize -> verify)
  • Mix LLM calls with deterministic steps (scripts, user prompts)
  • Need explicit human-in-the-loop checkpoints
  • Benefit from per-step model / tool / temperature overrides

If you just want an agent that takes a goal and figures out the steps on its own, stick with a regular agent.


Directory Structure

A graph agent is defined by a single graph.yaml. It holds both the agent-level config (model, tools, MCP servers) and the workflow:

<loki-config-dir>/agents
    └── my-graph-agent
        ├── graph.yaml           # agent config + workflow definition
        ├── tools.sh             # optional custom tools
        ├── <rag-node-id>.yaml   # auto-built knowledge base for a rag node
        └── scripts/             # optional script-node implementations
            ├── decide.py
            └── verify.py

<rag-node-id>.yaml files are generated by Loki at agent load time - one per rag node - and should not be hand-edited.

An agent directory must contain either a config.yaml (a normal, LLM-loop agent (see Agents)) or a graph.yaml (a graph agent). Never both. The presence of graph.yaml is what marks an agent as a graph agent; when Loki runs it, execution is driven entirely by the graph.

Both files present is an error. If an agent directory contains both config.yaml and graph.yaml, Loki refuses to load it and tells you to remove one. Pick the model that fits: config.yaml for an open-ended LLM-loop agent, graph.yaml for a fixed-shape workflow.


graph.yaml Top-Level Fields

name: my-graph-agent
description: |
  Plain prose describing what the workflow does.
version: "1.0"

# --- agent-level config ---
model: anthropic:claude-sonnet-4-6   # default model for llm nodes
temperature: 0.0                     # default sampling temperature
top_p: null                          # default sampling top-p
global_tools:                        # global tools available to nodes
  - web_search_loki.sh
mcp_servers:                         # MCP servers available to nodes
  - pubmed-search
conversation_starters:               # suggested prompts in the UI
  - "Research WebAssembly outside of the browser"

settings:
  max_loop_iterations: 100     # PER-NODE visit cap; default 100 (see below)
  log_state_snapshots: true    # log state JSON before each node executes
  validate_before_run: true    # run the graph validator on startup
  timeout: 600                 # optional overall timeout in seconds

initial_state:                 # optional seed state for the run
  topic: "auth"

start: parse_input             # required: ID of the first node to run

nodes:
  parse_input: { ... }
  ...
  • version: Currently only "1.0" is accepted by the parser. Anything else fails at startup. This is the graph schema version, not your agent's version.
  • Agent-level config (model, temperature, top_p, global_tools, mcp_servers, conversation_starters) are all optional. These are the same fields a normal agent's config.yaml carries; in a graph agent they live at the top of graph.yaml instead. model / temperature / top_p act as the defaults for llm nodes that don't set their own. global_tools and mcp_servers define the tool universe that an llm node's tools: whitelist selects from (a node with no tools: field gets none of them).
  • can_spawn_agents is derived, not declared. A graph agent can spawn child agents iff its graph contains at least one agent node. You don't set a flag. The agent node's presence is the declaration.
  • max_loop_iterations: This is a per-node visit cap, not a total graph-step cap. If the same node id is entered more than this many times, execution aborts with Node 'X' visited N times (max_loop_iterations=...). Default: 100.
  • timeout: Wall-clock cap on the entire graph run. The executor checks this between every node transition; nodes that block longer than the timeout will still finish before the check fires.
  • initial_state: A JSON-compatible object. Values are seeded into state before any node runs and are referenced from any node via {{key}} templates.

{{initial_prompt}}: Automatically Seeded

When Loki invokes a graph agent with a user prompt (whether from the command line loki -a my-agent "what is X?", from the REPL, or from a parent agent that spawned it as a sub-agent), the dispatcher automatically seeds the prompt text into state under the key initial_prompt before any node runs.

This means every graph agent's first node can reference the user's request via {{initial_prompt}}:

parse_input:
  id: parse_input
  type: llm
  prompt: "{{initial_prompt}}"     # the user's command-line / REPL text
  ...

You do not need to (and should not) put initial_prompt in initial_state as it is overwritten by the dispatcher.


Node Types

There are seven node types: agent, script, approval, input, llm, rag, and end. Every node has these common fields:

my_node:
  id: my_node               # must match the map key
  type: <one of the seven>
  description: optional      # free-form
  next: another_node         # optional default next node; semantics vary per type

The next field defines the default routing edge. Node types interpret it differently (some types ignore it in favor of internal routing; see each type below).


agent

Spawns a Loki sub-agent and waits for it to finish. This is how a graph agent delegates a sub-goal to a fully autonomous Loki agent (with its own tool loop and configuration).

research_topic:
  id: research_topic
  type: agent
  agent: deep-researcher          # name of an existing Loki agent
  prompt: "Research {{topic}}"    # interpolated against state
  timeout: 600                    # optional, in seconds (default 300)
  state_updates:
    findings: "{{output}}"
  output_schema: { ... }          # optional, see "Structured Output" below
  next: render
  • agent: Name of the child agent to spawn. Must exist in <loki-config-dir>/agents/.
  • prompt: The user message sent to the child agent. Templated against the current graph state.
  • timeout: Hard wall-clock cap. If the child agent exceeds it, the whole graph fails (no built-in fallback path on agent nodes).
  • state_updates: Map of state_key: "{{template}}". The child agent's final text is available inside this map as {{output}}.

script

Runs a Bash, Python, or TypeScript script and merges its JSON-object stdout into state. Script files live under the agent's scripts/ directory.

Supported extensions and runtimes:

Extension Runtime invoked Notes
.sh bash <script>
.py python3 <script> not python. Must be Python 3
.ts npx tsx <script> requires Node + tsx available on PATH

.js / .mjs / other extensions are not supported. The shebang line inside the script is not used for script-node dispatch (it is for normal custom-tools); the file extension is the source of truth.

route_after_parse:
  id: route_after_parse
  type: script
  script: scripts/route_after_parse.py
  timeout: 30                     # seconds, default 30
  fallback: handle_error          # optional: where to route on script failure
  state_updates:                  # applied after stdout merge
    last_run: "{{some_value}}"

The script receives the current state in two forms; use whichever fits:

Env var Contents
GRAPH_STATE Inline JSON when serialized state is <= 32 KiB
GRAPH_STATE_FILE Path to a temp JSON file when serialized state exceeds 32 KiB

Exactly one of the two is set per script invocation; always check both. The temp file (when used) is cleaned up automatically after the graph finishes.

The script must print a single JSON object on stdout. All keys merge into state; the reserved _next key is extracted and overrides the default next routing.

#!/usr/bin/env python3
import json, os

def load_state():
    if path := os.environ.get("GRAPH_STATE_FILE"):
        with open(path) as f:
            return json.load(f)
    return json.loads(os.environ.get("GRAPH_STATE", "{}"))

state = load_state()
codes = (state.get("web_search_results") or "").strip()
next_node = "query_db" if codes else "ask_for_code"
print(json.dumps({"_next": next_node, "trimmed_codes": codes}))

Tolerant-fail: if the script exits non-zero or produces invalid JSON, the node routes to fallback (if set) or to next (if set). Without either, the graph errors.


approval

Prompts the user with a question and a list of options, then routes based on their answer. This is the human-in-the-loop checkpoint.

approve:
  id: approve
  type: approval
  question: |
    Final report:
    {{report}}

    Approve?
  options:
    - "yes"
    - "no"
  routes:
    "yes": end_accepted
    "no": end_rejected
  on_other: clarify                # Required - see below
  state_updates:
    decision: "{{choice}}"

The on_other field

This field is required and easy to miss. Loki's user__ask tool always gives the user a "type your own answer" option in addition to the listed options. There is no way to disable this. Without on_other, a user who types something other than the listed options would crash the graph at runtime.

on_other says where to route when the user's answer does not match any routes key. The free-form text they typed is available downstream via the {{choice}} template variable inside state_updates.

Common patterns:

  • Free-form means "I want to clarify" -> on_other: clarify_node where clarify_node is an input or llm node that processes their text.
  • Free-form means "rejection by default" -> on_other: end_rejected.

input

Collects a free-form string from the user.

ask_for_code:
  id: ask_for_code
  type: input
  question: "Enter a search term:"
  default: "{{last_used_code}}"   # optional, interpolated against state
  validation: "len(input) > 0"    # optional, see below
  state_updates:
    web_search_result: "{{input}}"
  next: query_db
  • default: If the user submits an empty response, this template is used. Only default itself is templated, not the surrounding question (which is also templated).
  • validation: A length predicate of the form len(input) <op> <integer>, where <op> is >, >=, <, <=, or ==. This is a deliberately narrow grammar; regex / type / range validation are not yet supported. If validation fails, the node fails (no fallback).
  • The user's text is exposed to state_updates as {{input}}.

llm

A one-shot LLM call with an optional bounded tool-call loop. Unlike agent nodes, this does NOT spawn a sub-agent; it runs in a fresh isolated context with a caller-supplied system prompt and user prompt. Tool access is strictly opt-in: an llm node gets no tools at all unless its tools field explicitly lists them (see below).

grade_research:
  id: grade_research
  type: llm
  instructions: |               # optional system prompt
    You decide whether research is needed for {{topic}}.
  prompt: |                     # required user prompt
    Research context:
    {{research_text}}

    Reply with YES or NO.
  tools: []                     # see below
  model: anthropic:haiku        # optional override
  temperature: 0.0
  top_p: null
  max_attempts: 1               # transient-error retries (default 1)
  max_iterations: 10            # tool-call-loop turn cap (default 10)
  fallback: skip                # routes here if all attempts fail
  state_updates:
    grade: "{{output}}"
  output_schema: { ... }        # optional, see "Structured Output" below
  timeout: 120                  # optional; node wall-clock cap in seconds (unset = no timeout)
  next: synthesize

The tools field (whitelist)

The tools field is a strict opt-in whitelist: an llm node receives only the tools it explicitly lists, never the agent's full tool set. Three modes:

  • Unset (field omitted) -> no tools. The LLM produces output but cannot make any tool calls. This is identical to tools: []. Leaving the field out does not inherit the agent's tools.
  • tools: [] -> no tools. Same as unset.
  • tools: [a, b, mcp:server-name] -> only those specific tools, and nothing else. Entries are either exact tool names (matching global_tools, agent custom tools, or individual MCP function names) or the shorthand mcp:<server-name> (which enables all functions for that MCP server).

Even when tools lists entries, the LLM receives exactly that set. The whitelist is enforced against global tools, agent custom tools, and MCP alike. Each entry is validated at startup against the active agent's tool list; an unknown entry is a startup error.

Tolerant-fail routing

Outcome Routes to
Success next
Failure WITH fallback set fallback
Failure WITHOUT fallback next (output is "LLM node failed: ...")

state_updates are always applied (success or failure). On failure, {{output}} resolves to an error description so downstream nodes can detect it.

Retries (max_attempts)

max_attempts retries the LLM call only on transient errors. The failure message containing one of: timed out, rate limit, 429, Connection reset, Connection refused, or produced no output. Any other error fails immediately without consuming further attempts. The default is 1 (no retries).


rag

Runs a hybrid (vector + keyword) retrieval against a per-node knowledge base and writes the result into state. This is how a graph agent does Retrieval-Augmented Generation: the rag node retrieves context, downstream llm/agent nodes inject it into their prompts via normal templating.

research_context:
  id: research_context
  type: rag
  documents:                    # required; The knowledge sources
    - ./knowledge/
    - https://example.com/spec
  query: "{{initial_prompt}}"   # templated; defaults to "{{initial_prompt}}"
  top_k: 5                      # optional; default = the knowledge base's own top_k
  timeout: 120                  # optional; retrieval timeout in seconds (default 120)
  state_updates:                # required in practice (see below)
    rag_context: "{{output.context}}"
    rag_sources: "{{output.sources}}"
  next: answer

answer:
  type: llm
  prompt: |
    Use this context to answer:
    {{rag_context}}

    Question: {{initial_prompt}}
  • documents: Knowledge sources: files, directories, URLs, or loader-protocol paths. Required. It's what makes the node a rag node. Relative paths resolve against the agent's directory.
  • query: The retrieval query, templated against state. Defaults to {{initial_prompt}}. Set it to {{refined_query}} to retrieve against a query an upstream llm node produced.
  • top_k: Number of chunks to retrieve. Defaults to the knowledge base's own configured top_k.
  • timeout: Retrieval timeout in seconds. Default 120.
  • state_updates: Where the result goes. A rag node with no state_updates discards its result (the validator warns).

Knowledge-base build config (all optional; used only when the knowledge base is first built):

  • embedding_model: Embedding model for the corpus.
  • chunk_size: Document chunk size.
  • chunk_overlap: Overlap between chunks.
  • reranker_model: Reranker applied to hybrid-search results.
  • batch_size: Embedding-request batch size.

Each falls back to the app-level rag_* config when omitted. When embedding_model, chunk_size, and chunk_overlap are all set, the knowledge base builds with no interactive prompts. So a fully-specified rag node works in non-interactive runs.

{{output}} shape

Inside state_updates, {{output}} is a JSON object:

{
  "context": "[Source: ./knowledge/a.md]\n...chunk...",
  "sources": ["./knowledge/a.md", "https://example.com/spec"]
}
  • {{output.context}}: The retrieved context block, ready to inject into a prompt.
  • {{output.sources}}: An array of source paths; {{output.sources[0]}} indexes individual sources (useful for downstream citation/verification nodes).

Knowledge base lifecycle

Each rag node's knowledge base is built once, at agent load time, into <agent-dir>/<node-id>.yaml:

  • If that file exists -> it is loaded (no prompt; works non-interactively).
  • If it's missing and the node is fully specified (embedding_model + chunk_size + chunk_overlap all set) -> it is built directly, no prompts. Works in non-interactive runs.
  • If it's missing, not fully specified, and Loki is interactive -> you are asked to initialize it, then prompted for the missing build values; declining is a hard error.
  • If it's missing, not fully specified, and Loki is non-interactive (no TTY) -> hard error, with a hint to set the build-config fields or run the agent once interactively.

A graph with a rag node whose knowledge base isn't built cannot run. This is deliberate fail-fast behavior. (In --info mode the agent is only inspected, not run, so knowledge-base building is skipped entirely.)

Retrieval

Retrieval at execution time is fast (no re-embedding of the corpus). It's the same hybrid vector + keyword search normal Loki RAG uses. The corpus embedding/chunking cost is paid once, at load time.


end

Terminates execution and returns a final result.

end_accepted:
  id: end_accepted
  type: end
  output: |
    Approved report:
    {{report}}
  state_updates:                # optional last state mutations
    completed_at: "now"
  • output: Templated against state, printed as the graph's final result.
  • Multiple end nodes are fine; you pick which one routes here based on upstream conditions.

State and Template Syntax

Graph state is a serde_json::Value map. Templates use {{path}} syntax inside any string field.

Form Resolves to
{{key}} top-level value
{{a.b.c}} nested object path
{{arr[0]}} array index
{{matrix[0][1]}} nested array indices
{{users[0].name}} object field via index
{{a.b.arr[2].field}} mixed path

Rendering rules per value type:

  • String -> as-is
  • Number / bool / null -> stringified (true, 42, null)
  • Array / Object -> JSON-encoded compactly (["a","b"], {"k":"v"})

Missing keys / paths behave differently per template-evaluation site:

  • Inside a node's primary fields (prompt, instructions, question, output) -> strict mode, missing keys raise an error.
  • Inside state_updates values -> lenient mode, missing keys become empty strings.

state_updates

Every node type (except end, which has a slightly different shape) accepts an optional state_updates map:

state_updates:
  some_key: "{{template}}"
  other_key: "literal text with {{var}}"

After the node body executes, each template is interpolated against state and the result is stored under the corresponding key. Three scoped variables are available only inside state_updates:

Variable Available in Resolves to
{{output}} agent, llm The node's primary text output (or parsed JSON value if output_schema is set)
{{choice}} approval The option the user picked, or their free-form text
{{input}} input The user's text (or interpolated default if they submitted empty)

These variables are cleared after state_updates runs, so they don't leak into the next node's templates.

End nodes are different. An end node's state_updates runs with plain lenient interpolation. There is no scoped {{output}} because there is no node-body output to scope. After state_updates apply, the end node's own output template is interpolated against the resulting state and returned as the graph's final result.


Routing & Tolerant-Fail

Nodes route via three mechanisms in priority order:

  1. Script _next override: script nodes can set "_next": "node_id" in their stdout JSON to dynamically choose the next node.
  2. Internal routing: approval routes via its routes map (or on_other when the answer matches no listed option).
  3. Default next edge: the next field on the node.

Routing requirements per node type

Node type Needs next?
agent Yes - next is required (unless the agent node is unreachable). Error at runtime if missing.
script Either _next from script output OR static next (or fallback on failure). Error if neither.
approval No - routing is via routes and on_other. next is ignored.
input Yes - next is the success route.
llm Yes - next is the success route (and the default for failures without fallback).
rag Yes - next is required. Error at runtime if missing.
end No - terminal.

Tolerant-fail contract

Currently honored by script and llm nodes:

  • Success -> default routing
  • Failure with fallback set -> fallback target
  • Failure without fallback -> default routing, with the error description exposed in state so the next node can react

agent and input nodes do NOT have a tolerant-fail fallback path; their failures propagate as graph failures.


Structured Output (output_schema)

Both llm and agent nodes can specify an output_schema field: a JSON Schema (written inline in YAML) describing the expected shape of the node's output:

extract_task:
  type: llm
  prompt: 'Parse: "{{raw_task}}"'
  output_schema:
    type: object
    properties:
      action: { type: string }
      items:
        type: array
        items: { type: string }
      time_minutes: { type: ["integer", "null"] }
      priority:
        type: string
        enum: [low, medium, high]
    required: [action, items, priority]

When output_schema is set:

  1. The node body runs normally.
  2. The raw text output is tried as JSON first (with light cleanup of markdown code fences); the fast path. If parsing succeeds, that's the structured output.
  3. Otherwise Loki invokes a built-in __structured_output__ role (constructed inline; not visible in the user's role list) to extract a JSON object matching the schema. One repair retry on extractor failure.
  4. When the parsed value is a JSON object, its top-level keys auto-merge into state permanently (a non-object result is still reachable via {{output}} but has no top-level keys to merge).
  5. {{output}} (inside state_updates) resolves to the full parsed value.
  6. Explicit state_updates win over auto-merge if the same key is set in both.

After the example above, downstream nodes can use {{action}}, {{items}}, {{items[0]}}, {{priority}}, etc. directly.

LLM nodes vs Agent nodes: schema-hint injection

This is the most important behavioral difference between the two node types when output_schema is set:

  • LLM nodes: Loki automatically appends a schema hint to the prompt (to the system prompt if instructions is set, otherwise to the user prompt). The hint tells the model to respond with JSON matching the schema. This means the main LLM call usually emits valid JSON directly -> the fast path succeeds -> the extractor LLM call is skipped entirely (cheaper, faster, more reliable).
  • Agent nodes: Loki does NOT inject any schema hint. Agents are multi-turn with their own tool-use loop; stuffing a schema into the initial prompt risks the agent fixating on JSON output instead of doing its actual work. The agent runs to completion freely, and the extractor converts its final text to JSON afterward.

If you need an agent to emit JSON-shaped output, include schema language in its prompt yourself. The auto-injected hint for LLM nodes uses this form:

Respond with a JSON object that matches this schema. Output ONLY the JSON
object with no surrounding prose or markdown fences.

Schema:
{...}

Tolerant-fail for extraction

  • LLM node: extraction failure = node failure -> routes via fallback or next.
  • Agent node: extraction failure propagates as a graph error (agent nodes have no fallback).

Worked Example

A compact illustrative graph -input -> llm (with output_schema) -> end - exercising structured output and all template-path forms. For a full-featured reference covering every node type and field, see the heavily-commented graph.example.yaml at the root of the Loki repository.

Illustrative graph.yaml:

name: structured-test
version: "1.0"
start: ask_task

nodes:
  ask_task:
    id: ask_task
    type: input
    question: "Describe a task in free-form text."
    validation: "len(input) > 0"
    state_updates:
      raw_task: "{{input}}"
    next: extract_task

  extract_task:
    id: extract_task
    type: llm
    instructions: |
      You are a task parser. If a field cannot be determined, use a sensible
      default (empty array, null, or "medium" for priority).
    prompt: 'Parse this task description: "{{raw_task}}"'
    tools: []
    output_schema:
      type: object
      properties:
        action: { type: string }
        items:
          type: array
          items: { type: string }
        time_minutes: { type: ["integer", "null"] }
        priority:
          type: string
          enum: [low, medium, high]
        details:
          type: object
          properties:
            urgent: { type: boolean }
            deadline: { type: ["string", "null"] }
          required: [urgent]
      required: [action, items, priority, details]
    next: done

  done:
    id: done
    type: end
    output: |
      Action:        {{action}}
      Priority:      {{priority}}
      Time:          {{time_minutes}} min
      Urgent?        {{details.urgent}}
      First item:    {{items[0]}}
      All items:     {{items}}

With the sample input Buy groceries: milk, eggs, bread. About 15 minutes. Urgent.

Sample state after extract_task:

{
  "raw_task": "Buy groceries: milk, eggs, bread. About 15 minutes. Urgent.",
  "action": "buy",
  "items": ["milk", "eggs", "bread"],
  "time_minutes": 15,
  "priority": "high",
  "details": { "urgent": true, "deadline": null }
}

Validation

When validate_before_run: true (the default), Loki validates the graph at startup.

Errors (abort startup):

  • Start node missing or pointing to a non-existent node
  • Any next / routes / fallback / on_other target pointing to a non-existent node
  • Any cycle in declared static edges (cycles are always errors. The per-node max_loop_iterations is a runtime safety net for dynamically- routed loops, not a license for static cycles)
  • Graph has zero end nodes. Execution would never terminate
  • approval option without a matching routes entry
  • script file path does not exist relative to the agent's directory
  • agent node references an agent name that doesn't exist in the loki agents directory, or that exists but has neither a config.yaml nor a graph.yaml
  • rag node with no documents (at least one knowledge source is required)
  • llm node referencing an unknown tool or mcp:<server> in its tools whitelist, or an unknown model. Validated against the agent's tool, MCP-server, and model sets

Warnings (printed, execution continues):

  • Any node unreachable from the start via declared static edges
  • No end node reachable from the start via declared static edges
  • approval routes entry without a matching option
  • rag node with no state_updates (its retrieval result goes nowhere)

Why some of these are warnings and not errors: the validator only follows declared static edges (next, routes, fallback, on_other). Script nodes can also route dynamically at runtime via _next in their JSON output, and those edges are invisible to static analysis. To avoid false positives against dynamically-routed graphs, "unreachable" and "no reachable end" are reported as warnings, not errors.


Invocation Entry Points

A graph agent can be entered from three places, all of which seed the caller's prompt into state as {{initial_prompt}}:

  1. Top-level CLI: loki -a my-graph-agent "user prompt here"
  2. REPL: When the active agent has a graph.yaml, every user message in the REPL runs the graph fresh; the message becomes {{initial_prompt}}
  3. Child-agent spawn: When another (graph or normal) agent invokes this one via Loki's sub-agent mechanism, the parent's request becomes {{initial_prompt}} for the child graph

After the graph finishes, any sub-agents this graph spawned via agent-type nodes are cancelled, so a graph cannot leak background tool loops. The graph's final end node output is what's returned to the caller.


Streaming and Observability

Graph execution has two observability channels:

1. stderr narration: Dimmed lines you follow along with in real time, regardless of log level:

▸ graph: my-agent (start: extract_task)
▸ extract_task (llm)
▸   llm call: model=<active> tools=<none>
▸ extract_task -> done
▸ done (end)
▸ graph done in 2.41s

2. tracing logs: Structured info!/debug!/warn!/error! records gated by RUST_LOG (see Configuration below). This is the developer-facing channel and includes:

  • Graph start / completion / failure
  • Per-node entry and routing decisions (debug)
  • A performance summary at completion — every node's visit count, total/avg/max wall-clock time, slowest first:
    [graph:my-agent] performance summary (slowest first):
    [graph:my-agent]   deep_research: 1 visit(s), total 8200ms, avg 8200ms, max 8200ms
    [graph:my-agent]   extract_task: 1 visit(s), total 1400ms, avg 1400ms, max 1400ms
    

State snapshots: when log_state_snapshots: true (the default), before each node runs Loki logs the state's byte size and key list at debug level, and the full state at trace level. The full state is deliberately kept at trace because graph state can contain secrets so be careful sharing trace-level logs.

Configuration

Control the tracing channel with RUST_LOG:

RUST_LOG=loki::graph=debug    loki -a my-agent "..."   # graph debug logs
RUST_LOG=loki::graph=trace    loki -a my-agent "..."   # + full state snapshots
RUST_LOG=loki::graph=info     loki -a my-agent "..."   # start/end/perf summary

The stderr narration is always shown and is not affected by RUST_LOG.


Limitations / Gotchas

A short, honest list of things that bite people:

  • A graph agent is graph.yaml-only. It must not also have a config.yaml. Both files present is a hard load error.
  • Graph agents do not support sessions. A graph manages its own state (GraphState), so there is no conversational history to persist. Explicitly requesting a session is a hard error. --session on the CLI, a session name passed to .agent in the REPL, or running .session while inside a graph agent. Any app-level agent_session default is silently skipped for graph agents rather than applied.
  • RAG is per-node, not agent-wide. Graph agents do RAG via rag nodes (each with its own knowledge base); there is no agent-wide documents field at the graph.yaml top level.
  • A rag node's knowledge base is built once, at load time. Changing a rag node's documents does not rebuild it. Delete <agent-dir>/<node-id>.yaml to force a fresh build on next run.
  • on_other is required on every approval node because user__ask always permits free-form responses (see the approval section).
  • validation on input nodes is length-only. The grammar is len(input) <op> <integer> with <op> in > >= < <= ==. No regex, no type coercion, no range checks. Use a follow-up script node for richer validation.
  • An input node's default is not re-validated. When the user submits an empty response and the default is substituted in, that substituted value is not checked against validation. Make sure any default you set would itself satisfy the validation predicate.
  • Tool whitelist is llm-only. agent nodes always use the child agent's full tool universe. They ignore any tools: field. This is by design: child agents own their tool surface.
  • {{output}}, {{choice}}, {{input}} are scoped to state_updates. Outside state_updates (e.g. in another node's prompt), these scoped variables are not available unless the previous node explicitly stored them via state_updates. end nodes do NOT get a scoped {{output}}. They have no node body output to scope.
  • Schema-hint auto-injection happens for llm nodes only, not agent nodes (see Structured Output).
  • Script-output JSON must be an object, not an array or primitive, even if you only want to set _next.
  • Cycles in declared static edges are always errors. The per-node max_loop_iterations is a runtime safety net for cycles built via dynamic script._next routing, not permission to write static cycles.
  • Schema version is fixed at "1.0" today. Any other value is a startup error.
  • Script extensions are exactly .sh, .py, .ts. No JavaScript, no Ruby, no Lua. Python must be available as python3 and TypeScript requires npx tsx on PATH.

See Also

  • graph.example.yaml - A fully-commented, full-featured reference graph agent at the root of the Loki repository (every top-level field, every node type).
  • Agents - non-graph agent system (config.yaml + LLM loop)
  • Custom Tools - building tools.sh / tools.py / tools.ts files for use in graph nodes
  • Roles - note that the built-in __structured_output__ role used by output_schema is intentionally internal and is not user-visible
  • MCP Servers - mcp:<server> shorthand inside an llm node's tools: whitelist

Clone this wiki locally