Deep-CodeAI · Skobeltsyn · May 30, 2026 · May 30, 2026 · May 30, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,12 @@ All notable changes to Agents.KT are documented here. The format follows [Keep a
 
 ## [Unreleased]
 
+### Added — Multimodal foundation (#2465 epic, Stage 1)
+
+- **Typed `Content` hierarchy (#2466)** — `sealed interface Content` with variants `Text`, `Image`, `Audio`, `Video`, `Document` in package `agents_engine.content`. Each non-text variant carries a `ContentRef` plus a typed mime (`ImageMime`, `AudioMime`, `VideoMime`, `DocMime`). Mime types are closed sealed interfaces with `wireMime: String` accessors — no `String` mime in any public API. Extension property `Content.modality: String` is the audit-stable per-variant name. Stage 1 wires Image + Document end-to-end (the modalities the 0.8 spec → product loop consumes); Audio + Video are modelled now and exercised through provider adapters in Stage 2 (#2470, deferred).
+- **`ContentRef` + `BlobStore` (#2467)** — content-addressed reference (`hash: String` SHA-256 hex, `sizeBytes: Long`, `wireMime: String`). `BlobStore` interface with `InMemoryBlobStore` (defensive byte-array copies on put + get) and `FileBlobStore(dir)` (one file per blob, filename = hash, atomic tmp + rename, survives process restart, idempotent put). Hash family matches the manifest hash (#1912) and snapshot filename hash (#2753) — single algorithm across the audit surface. Public top-level `computeContentHash(bytes): String` for byte-level comparison without a store.
+- **`ToolResult` (#2469)** — `data class ToolResult(parts: List<Content>)` for tools that return mixed content (a screenshot tool returns text + image; OCR returns extracted text + the source PDF ref). Just another `Any?` the tool executor returns — no `ToolDef` signature change; existing tools that return strings keep working byte-for-byte. AgenticLoop renders multipart returns as text + `[modality: <wireMime>] (<hash-prefix>, <size>B)` placeholders for the LLM tool-result message; provider-specific multipart rendering (vision-capable Claude/OpenAI/Gemini) is sibling #2470 (deferred). JSONL audit exporter gains an `outputParts: List<String>?` column on audit rows — for `ToolResult` returns it emits one entry per part as `<modality>:<hash-prefix>:<sizeBytes>:<wireMime>` (text parts as `text:inline:<charCount>:text/plain`); blob bytes never enter the audit row. Field is null for non-multimodal returns — legacy audit rows unchanged. `EXPECTED_FIELDS` schema-pin updated to include the new column. Composes with snapshot/resume (refs serialise, blobs stay external) and `untrustedOutput` (the text-summary rendering goes through the existing JSON envelope). See [docs/multimodal.md](docs/multimodal.md).
+
 ### Added — Eval harness (#2491 epic, feature-complete)
 
 - **`DeterministicModelClient` (#2492)** — `agents_engine.testing.DeterministicModelClient(scripted: List<LlmResponse>)` (or vararg ctor) hands back pre-scripted responses one per `chat` call. No network, byte-deterministic. `requests` records every message list the agent built up; `remaining()` reports unconsumed responses. Exhaustion throws `DeterministicScriptExhausted(callIndex, scriptSize, lastMessages)`. Streaming uses the default `ModelClient.chatStream` wrap. Out of scope for v1: record-from-live HTTP capture (mentioned in the ticket — needs an HTTP-fixture story we'll write when there's demand) and per-token chunk replay.

diff --git a/README.md b/README.md
@@ -155,6 +155,7 @@ These APIs work in `main`, are unit-tested, and are exercised by integration tes
 - **Budget controls** — `budget { maxTurns; maxToolCalls; maxDuration; perToolTimeout; maxTokens; maxConsecutiveSameTool }` (`perToolTimeout` covers regular and session-aware tools; token counts cumulative across turns when the provider reports usage; `maxConsecutiveSameTool` catches LLM retry loops on a broken tool) (#637, #963, #969, #1903). `onBudgetExceeded { reason, currentLimit -> BudgetDecision.Extend(newLimit) }` raises a cap and continues instead of throwing — a long-running agent can grant itself more tool calls mid-run rather than failing (#2412). `BudgetDecision.Checkpoint` (#2749) is the third variant — pause at the cap, deliver a `SessionSnapshot` via the registered `onTurnCheckpoint` hook, throw a recoverable `BudgetCheckpointException`, and resume later via `agent.invokeSuspendResuming(input, resumeFrom = snapshot)` once the human approves a raise (no history replay).
 - **Public snapshot / resume** — `agent.invokeSuspendResuming(input, resumeFrom = null, onTurnCheckpoint = null)` (#2749) is the public seam over the internal `executeAgentic(resumeFrom, onTurnCheckpoint)` primitives from #2416. With defaults it matches `invokeSuspend(input)` byte-for-byte; with `onTurnCheckpoint` set it captures a `SessionSnapshot` at every turn boundary; with `resumeFrom = snapshot` it continues an in-flight invocation without replaying history. On resume the loop honors `max(snapshot.toolCallLimit, agent.budget.maxToolCalls)` so a rebuilt agent with a raised cap actually picks it up.
 - **Eval harness** — `DeterministicModelClient(LlmResponse.Text("..."), LlmResponse.ToolCalls(...))` (#2492) scripts model responses for reproducible eval without a live provider; the streaming flow folds into the same Started → ArgsDelta → Finished → End chunk sequence a native streaming provider would emit. Typed assertion DSL `eval<IN, OUT>("name") { input(...); expect { ... }; expectSnapshot(...) }` (#2493) runs against the parsed `OUT` — not regex on the wire. Snapshot mode pins `toLlmInput(output)` JSON for structural diffs; `evalSuite { + case; + case }` bundles cases. Optional `judge("tone", rubric)` (#2494) runs an advisory LLM-as-judge scorer with a typed `@Generable` `JudgeVerdict` — explicitly separate from the deterministic pass/fail contract (judges never gate). See [docs/eval.md](docs/eval.md).
+- **Multimodal foundation** — `sealed Content { Text, Image(ref, ImageMime), Audio, Video, Document(ref, DocMime) }` (#2466) with closed mime types per modality (no `String` mime). Content-addressed `ContentRef(hash, sizeBytes, wireMime)` + `BlobStore` interface, `InMemoryBlobStore` / `FileBlobStore` impls — SHA-256 keys match the manifest-hash family, atomic tmp+rename, process-restart-safe, idempotent put (#2467). Tools can return `ToolResult(parts: List<Content>)` for mixed text + image + document outputs; JSONL audit exporter records `outputParts` per-part summary (`<modality>:<hash-prefix>:<size>:<mime>`) with no blob bytes in the audit row (#2469). Stage 1 wires Image + Document end-to-end; Audio + Video modelled now and exercised through provider adapters in Stage 2 (#2470, deferred). See [docs/multimodal.md](docs/multimodal.md).
 - **Prompt caching across providers** — `agent { caching { enabled = true; cacheSystemPrompt = true; cacheToolDefs = true; cacheConversation = Rolling; ttl = 1.hours; cacheable("doc-id") { ... } } }`. Vendor-neutral DSL drives Anthropic's explicit `cache_control` breakpoints (#2658), OpenAI / DeepSeek automatic prefix caching with a stable `prompt_cache_key` routing hint (#2659 / #2661), Ollama / vLLM / SGLang engine-level KV-cache reuse (no-op hints, #2662), and surfaces cache reads + writes + hit-rate on `TokenUsage` (#2663). A prefix-stability guard (#2657) detects silent cache-busters — timestamps, UUIDs, non-deterministic ordering inside cacheable segments — and warns before you pay for a single non-cached run. Off by default; non-breaking. See [docs/caching.md](docs/caching.md).
 - **JSONL audit exporter** — `:agents-kt-observability` writes append-only, one-line-per-event audit rows with `requestId`, `sessionId`, `manifestHash`, agent/skill/tool ids, event type, provider, and model; raw arguments/results are omitted by default (#1914). See [docs/observability.md](docs/observability.md).
 - **ObservabilityBridge adapters** — `.observe(OtelBridge(tracer))` maps runtime events to OTel spans (#1908), `.observe(LangSmithBridge(apiKey, project))` maps the same events to LangSmith run trees (#1909), and `.observe(LangfuseBridge(publicKey, secretKey))` maps them to Langfuse traces, generations, spans, and events (#1910), while keeping core vendor-free. See [docs/observability.md](docs/observability.md).

diff --git a/agents-kt-observability/src/main/kotlin/agents_engine/observability/JsonlAuditExporter.kt b/agents-kt-observability/src/main/kotlin/agents_engine/observability/JsonlAuditExporter.kt
@@ -1,5 +1,6 @@
 package agents_engine.observability
 
+import agents_engine.content.modality
 import agents_engine.core.Agent
 import agents_engine.core.AgentRuntimeContext
 import agents_engine.core.PipelineEvent
@@ -214,6 +215,14 @@ class JsonlAuditExporter(
                 is PipelineEvent.ToolCalled -> typeName(event.result)
                 else -> null
             },
+            // #2469 — multimodal tool results record one summary per part:
+            // "<modality>:<hash-prefix>:<size>:<mime>". No bytes; the
+            // ContentRef + modality is the auditable surface. Null on
+            // non-ToolResult returns so legacy audit rows are unchanged.
+            outputParts = when (event) {
+                is PipelineEvent.ToolCalled -> partsSummary(event.result)
+                else -> null
+            },
             // #2395 — record blocked tool calls in the audit log via the
             // guardrailDecision column. Only the decision *type* is written:
             // the free-text reason can embed offending arg values (e.g. a
@@ -265,6 +274,11 @@ class JsonlAuditExporter(
                 is AgentEvent.ToolCallFinished -> typeName(event.result)
                 else -> null
             },
+            outputParts = when (event) {
+                is AgentEvent.Completed<*> -> partsSummary(event.output)
+                is AgentEvent.ToolCallFinished -> partsSummary(event.result)
+                else -> null
+            },
             toolPolicyRisk = null,
             usedDeclaredCapability = null,
             usage = usage,
@@ -291,6 +305,7 @@ class JsonlAuditExporter(
         usedDeclaredCapability: Boolean?,
         usage: TokenUsage?,
         guardrailDecision: String? = null,
+        outputParts: List<String>? = null,
     ): Map<String, Any?> =
         linkedMapOf(
             "requestId" to context.requestId,
@@ -303,6 +318,7 @@ class JsonlAuditExporter(
             "timestamp" to timestamp,
             "inputType" to inputType,
             "outputType" to outputType,
+            "outputParts" to outputParts,
             "budgetState" to null,
             "guardrailDecision" to guardrailDecision,
             "mcpClientId" to null,
@@ -317,6 +333,35 @@ class JsonlAuditExporter(
     private fun typeName(value: Any?): String? =
         value?.javaClass?.name
 
+    /**
+     * #2469 — for [agents_engine.content.ToolResult] return values,
+     * render one summary string per part: `"<modality>:<hash-prefix>:
+     * <sizeBytes>:<wireMime>"`. Hash prefix is the first 12 hex chars
+     * (enough to disambiguate in audit grep, short enough to read).
+     * Returns `null` when [value] is not a `ToolResult` — keeps legacy
+     * audit rows byte-identical for non-multimodal returns.
+     *
+     * Crucially: no blob bytes enter the audit row. Modality + ref is
+     * the auditable surface.
+     */
+    private fun partsSummary(value: Any?): List<String>? {
+        val toolResult = value as? agents_engine.content.ToolResult ?: return null
+        return toolResult.parts.map { part ->
+            when (part) {
+                is agents_engine.content.Content.Text ->
+                    "${part.modality}:inline:${part.text.length}:text/plain"
+                is agents_engine.content.Content.Image ->
+                    "${part.modality}:${part.ref.hash.take(12)}:${part.ref.sizeBytes}:${part.mime.wireMime}"
+                is agents_engine.content.Content.Audio ->
+                    "${part.modality}:${part.ref.hash.take(12)}:${part.ref.sizeBytes}:${part.mime.wireMime}"
+                is agents_engine.content.Content.Video ->
+                    "${part.modality}:${part.ref.hash.take(12)}:${part.ref.sizeBytes}:${part.mime.wireMime}"
+                is agents_engine.content.Content.Document ->
+                    "${part.modality}:${part.ref.hash.take(12)}:${part.ref.sizeBytes}:${part.mime.wireMime}"
+            }
+        }
+    }
+
     private fun encodeJson(value: Any?): String =
         when (value) {
             null -> "null"

diff --git a/...ts-kt-observability/src/test/kotlin/agents_engine/observability/JsonlAuditExporterTest.kt b/...ts-kt-observability/src/test/kotlin/agents_engine/observability/JsonlAuditExporterTest.kt
@@ -115,6 +115,47 @@ class JsonlAuditExporterTest {
         assertEquals(false, row["usedDeclaredCapability"])
     }
 
+    @Test
+    fun `multimodal ToolResult writes outputParts with modality plus hash plus size — no bytes`() {
+        // #2469 — audit-row support for typed multimodal tool returns.
+        val store = agents_engine.content.InMemoryBlobStore()
+        val pngBytes = byteArrayOf(0x89.toByte(), 0x50, 0x4E, 0x47, 1, 2, 3)
+        val imgRef = store.put(pngBytes, agents_engine.content.ImageMime.Png.wireMime)
+
+        val dir = Files.createTempDirectory("agents-jsonl-audit-multimodal")
+        val auditFile = dir.resolve("audit.jsonl")
+        val exporter = JsonlAuditExporter(auditFile, clock = fixedClock)
+        exporter.write(
+            PipelineEvent.ToolCalled(
+                agentName = "agent",
+                timestamp = Instant.EPOCH,
+                toolName = "screenshot",
+                arguments = mapOf("url" to "https://example.com"),
+                result = agents_engine.content.ToolResult(
+                    agents_engine.content.Content.Text("captured"),
+                    agents_engine.content.Content.Image(imgRef, agents_engine.content.ImageMime.Png),
+                ),
+                runtimeContext = AgentRuntimeContext(requestId = "req-multimodal"),
+            ),
+        )
+        exporter.close()
+
+        val line = Files.readAllLines(auditFile).single()
+        // No bytes anywhere — neither the URL arg nor the image bytes
+        assertFalse(line.contains("example.com"), "argument values must not be serialized: $line")
+        // No PNG magic — image bytes definitely never enter the audit row
+        assertFalse(line.contains("0x89"), "image bytes must not be serialized: $line")
+
+        // Substring assertions on the rendered JSON (the test parser doesn't
+        // model arrays, and the column contents are stable enough to grep).
+        assertTrue("\"outputParts\":[" in line, "outputParts is emitted as an array: $line")
+        assertTrue("\"text:inline:8:text/plain\"" in line, "text-part shape in array: $line")
+        assertTrue("\"image:${imgRef.hash.take(12)}:${pngBytes.size}:image/png\"" in line,
+            "image-part shape modality:hashPrefix:size:mime — $line")
+        assertTrue("\"agents_engine.content.ToolResult\"" in line,
+            "outputType still names the wrapper type: $line")
+    }
+
     @Test
     fun `denied tool calls are recorded as ToolDenied rows without leaking the reason text`() {
         // #2395 — blocked calls must appear in the audit log. The PII-safe
@@ -309,6 +350,10 @@ class JsonlAuditExporterTest {
             "timestamp",
             "inputType",
             "outputType",
+            // #2469 — per-part summary for multimodal ToolResult returns.
+            // Null on non-multimodal rows; field is always present so
+            // schema-pinning consumers see a stable column set.
+            "outputParts",
             "budgetState",
             "guardrailDecision",
             "mcpClientId",