diff --git a/CHANGELOG.md b/CHANGELOG.md index 86f9065..03af14a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,12 @@ All notable changes to Agents.KT are documented here. The format follows [Keep a ## [Unreleased] +### Added — Multimodal foundation (#2465 epic, Stage 1) + +- **Typed `Content` hierarchy (#2466)** — `sealed interface Content` with variants `Text`, `Image`, `Audio`, `Video`, `Document` in package `agents_engine.content`. Each non-text variant carries a `ContentRef` plus a typed mime (`ImageMime`, `AudioMime`, `VideoMime`, `DocMime`). Mime types are closed sealed interfaces with `wireMime: String` accessors — no `String` mime in any public API. Extension property `Content.modality: String` is the audit-stable per-variant name. Stage 1 wires Image + Document end-to-end (the modalities the 0.8 spec → product loop consumes); Audio + Video are modelled now and exercised through provider adapters in Stage 2 (#2470, deferred). +- **`ContentRef` + `BlobStore` (#2467)** — content-addressed reference (`hash: String` SHA-256 hex, `sizeBytes: Long`, `wireMime: String`). `BlobStore` interface with `InMemoryBlobStore` (defensive byte-array copies on put + get) and `FileBlobStore(dir)` (one file per blob, filename = hash, atomic tmp + rename, survives process restart, idempotent put). Hash family matches the manifest hash (#1912) and snapshot filename hash (#2753) — single algorithm across the audit surface. Public top-level `computeContentHash(bytes): String` for byte-level comparison without a store. +- **`ToolResult` (#2469)** — `data class ToolResult(parts: List)` for tools that return mixed content (a screenshot tool returns text + image; OCR returns extracted text + the source PDF ref). Just another `Any?` the tool executor returns — no `ToolDef` signature change; existing tools that return strings keep working byte-for-byte. AgenticLoop renders multipart returns as text + `[modality: ] (, B)` placeholders for the LLM tool-result message; provider-specific multipart rendering (vision-capable Claude/OpenAI/Gemini) is sibling #2470 (deferred). JSONL audit exporter gains an `outputParts: List?` column on audit rows — for `ToolResult` returns it emits one entry per part as `:::` (text parts as `text:inline::text/plain`); blob bytes never enter the audit row. Field is null for non-multimodal returns — legacy audit rows unchanged. `EXPECTED_FIELDS` schema-pin updated to include the new column. Composes with snapshot/resume (refs serialise, blobs stay external) and `untrustedOutput` (the text-summary rendering goes through the existing JSON envelope). See [docs/multimodal.md](docs/multimodal.md). + ### Added — Eval harness (#2491 epic, feature-complete) - **`DeterministicModelClient` (#2492)** — `agents_engine.testing.DeterministicModelClient(scripted: List)` (or vararg ctor) hands back pre-scripted responses one per `chat` call. No network, byte-deterministic. `requests` records every message list the agent built up; `remaining()` reports unconsumed responses. Exhaustion throws `DeterministicScriptExhausted(callIndex, scriptSize, lastMessages)`. Streaming uses the default `ModelClient.chatStream` wrap. Out of scope for v1: record-from-live HTTP capture (mentioned in the ticket — needs an HTTP-fixture story we'll write when there's demand) and per-token chunk replay. diff --git a/README.md b/README.md index ed99b54..e0eea6a 100644 --- a/README.md +++ b/README.md @@ -155,6 +155,7 @@ These APIs work in `main`, are unit-tested, and are exercised by integration tes - **Budget controls** — `budget { maxTurns; maxToolCalls; maxDuration; perToolTimeout; maxTokens; maxConsecutiveSameTool }` (`perToolTimeout` covers regular and session-aware tools; token counts cumulative across turns when the provider reports usage; `maxConsecutiveSameTool` catches LLM retry loops on a broken tool) (#637, #963, #969, #1903). `onBudgetExceeded { reason, currentLimit -> BudgetDecision.Extend(newLimit) }` raises a cap and continues instead of throwing — a long-running agent can grant itself more tool calls mid-run rather than failing (#2412). `BudgetDecision.Checkpoint` (#2749) is the third variant — pause at the cap, deliver a `SessionSnapshot` via the registered `onTurnCheckpoint` hook, throw a recoverable `BudgetCheckpointException`, and resume later via `agent.invokeSuspendResuming(input, resumeFrom = snapshot)` once the human approves a raise (no history replay). - **Public snapshot / resume** — `agent.invokeSuspendResuming(input, resumeFrom = null, onTurnCheckpoint = null)` (#2749) is the public seam over the internal `executeAgentic(resumeFrom, onTurnCheckpoint)` primitives from #2416. With defaults it matches `invokeSuspend(input)` byte-for-byte; with `onTurnCheckpoint` set it captures a `SessionSnapshot` at every turn boundary; with `resumeFrom = snapshot` it continues an in-flight invocation without replaying history. On resume the loop honors `max(snapshot.toolCallLimit, agent.budget.maxToolCalls)` so a rebuilt agent with a raised cap actually picks it up. - **Eval harness** — `DeterministicModelClient(LlmResponse.Text("..."), LlmResponse.ToolCalls(...))` (#2492) scripts model responses for reproducible eval without a live provider; the streaming flow folds into the same Started → ArgsDelta → Finished → End chunk sequence a native streaming provider would emit. Typed assertion DSL `eval("name") { input(...); expect { ... }; expectSnapshot(...) }` (#2493) runs against the parsed `OUT` — not regex on the wire. Snapshot mode pins `toLlmInput(output)` JSON for structural diffs; `evalSuite { + case; + case }` bundles cases. Optional `judge("tone", rubric)` (#2494) runs an advisory LLM-as-judge scorer with a typed `@Generable` `JudgeVerdict` — explicitly separate from the deterministic pass/fail contract (judges never gate). See [docs/eval.md](docs/eval.md). +- **Multimodal foundation** — `sealed Content { Text, Image(ref, ImageMime), Audio, Video, Document(ref, DocMime) }` (#2466) with closed mime types per modality (no `String` mime). Content-addressed `ContentRef(hash, sizeBytes, wireMime)` + `BlobStore` interface, `InMemoryBlobStore` / `FileBlobStore` impls — SHA-256 keys match the manifest-hash family, atomic tmp+rename, process-restart-safe, idempotent put (#2467). Tools can return `ToolResult(parts: List)` for mixed text + image + document outputs; JSONL audit exporter records `outputParts` per-part summary (`:::`) with no blob bytes in the audit row (#2469). Stage 1 wires Image + Document end-to-end; Audio + Video modelled now and exercised through provider adapters in Stage 2 (#2470, deferred). See [docs/multimodal.md](docs/multimodal.md). - **Prompt caching across providers** — `agent { caching { enabled = true; cacheSystemPrompt = true; cacheToolDefs = true; cacheConversation = Rolling; ttl = 1.hours; cacheable("doc-id") { ... } } }`. Vendor-neutral DSL drives Anthropic's explicit `cache_control` breakpoints (#2658), OpenAI / DeepSeek automatic prefix caching with a stable `prompt_cache_key` routing hint (#2659 / #2661), Ollama / vLLM / SGLang engine-level KV-cache reuse (no-op hints, #2662), and surfaces cache reads + writes + hit-rate on `TokenUsage` (#2663). A prefix-stability guard (#2657) detects silent cache-busters — timestamps, UUIDs, non-deterministic ordering inside cacheable segments — and warns before you pay for a single non-cached run. Off by default; non-breaking. See [docs/caching.md](docs/caching.md). - **JSONL audit exporter** — `:agents-kt-observability` writes append-only, one-line-per-event audit rows with `requestId`, `sessionId`, `manifestHash`, agent/skill/tool ids, event type, provider, and model; raw arguments/results are omitted by default (#1914). See [docs/observability.md](docs/observability.md). - **ObservabilityBridge adapters** — `.observe(OtelBridge(tracer))` maps runtime events to OTel spans (#1908), `.observe(LangSmithBridge(apiKey, project))` maps the same events to LangSmith run trees (#1909), and `.observe(LangfuseBridge(publicKey, secretKey))` maps them to Langfuse traces, generations, spans, and events (#1910), while keeping core vendor-free. See [docs/observability.md](docs/observability.md). diff --git a/agents-kt-observability/src/main/kotlin/agents_engine/observability/JsonlAuditExporter.kt b/agents-kt-observability/src/main/kotlin/agents_engine/observability/JsonlAuditExporter.kt index 69720b1..ed4b2ae 100644 --- a/agents-kt-observability/src/main/kotlin/agents_engine/observability/JsonlAuditExporter.kt +++ b/agents-kt-observability/src/main/kotlin/agents_engine/observability/JsonlAuditExporter.kt @@ -1,5 +1,6 @@ package agents_engine.observability +import agents_engine.content.modality import agents_engine.core.Agent import agents_engine.core.AgentRuntimeContext import agents_engine.core.PipelineEvent @@ -214,6 +215,14 @@ class JsonlAuditExporter( is PipelineEvent.ToolCalled -> typeName(event.result) else -> null }, + // #2469 — multimodal tool results record one summary per part: + // ":::". No bytes; the + // ContentRef + modality is the auditable surface. Null on + // non-ToolResult returns so legacy audit rows are unchanged. + outputParts = when (event) { + is PipelineEvent.ToolCalled -> partsSummary(event.result) + else -> null + }, // #2395 — record blocked tool calls in the audit log via the // guardrailDecision column. Only the decision *type* is written: // the free-text reason can embed offending arg values (e.g. a @@ -265,6 +274,11 @@ class JsonlAuditExporter( is AgentEvent.ToolCallFinished -> typeName(event.result) else -> null }, + outputParts = when (event) { + is AgentEvent.Completed<*> -> partsSummary(event.output) + is AgentEvent.ToolCallFinished -> partsSummary(event.result) + else -> null + }, toolPolicyRisk = null, usedDeclaredCapability = null, usage = usage, @@ -291,6 +305,7 @@ class JsonlAuditExporter( usedDeclaredCapability: Boolean?, usage: TokenUsage?, guardrailDecision: String? = null, + outputParts: List? = null, ): Map = linkedMapOf( "requestId" to context.requestId, @@ -303,6 +318,7 @@ class JsonlAuditExporter( "timestamp" to timestamp, "inputType" to inputType, "outputType" to outputType, + "outputParts" to outputParts, "budgetState" to null, "guardrailDecision" to guardrailDecision, "mcpClientId" to null, @@ -317,6 +333,35 @@ class JsonlAuditExporter( private fun typeName(value: Any?): String? = value?.javaClass?.name + /** + * #2469 — for [agents_engine.content.ToolResult] return values, + * render one summary string per part: `":: + * :"`. Hash prefix is the first 12 hex chars + * (enough to disambiguate in audit grep, short enough to read). + * Returns `null` when [value] is not a `ToolResult` — keeps legacy + * audit rows byte-identical for non-multimodal returns. + * + * Crucially: no blob bytes enter the audit row. Modality + ref is + * the auditable surface. + */ + private fun partsSummary(value: Any?): List? { + val toolResult = value as? agents_engine.content.ToolResult ?: return null + return toolResult.parts.map { part -> + when (part) { + is agents_engine.content.Content.Text -> + "${part.modality}:inline:${part.text.length}:text/plain" + is agents_engine.content.Content.Image -> + "${part.modality}:${part.ref.hash.take(12)}:${part.ref.sizeBytes}:${part.mime.wireMime}" + is agents_engine.content.Content.Audio -> + "${part.modality}:${part.ref.hash.take(12)}:${part.ref.sizeBytes}:${part.mime.wireMime}" + is agents_engine.content.Content.Video -> + "${part.modality}:${part.ref.hash.take(12)}:${part.ref.sizeBytes}:${part.mime.wireMime}" + is agents_engine.content.Content.Document -> + "${part.modality}:${part.ref.hash.take(12)}:${part.ref.sizeBytes}:${part.mime.wireMime}" + } + } + } + private fun encodeJson(value: Any?): String = when (value) { null -> "null" diff --git a/agents-kt-observability/src/test/kotlin/agents_engine/observability/JsonlAuditExporterTest.kt b/agents-kt-observability/src/test/kotlin/agents_engine/observability/JsonlAuditExporterTest.kt index b7b5693..86046b2 100644 --- a/agents-kt-observability/src/test/kotlin/agents_engine/observability/JsonlAuditExporterTest.kt +++ b/agents-kt-observability/src/test/kotlin/agents_engine/observability/JsonlAuditExporterTest.kt @@ -115,6 +115,47 @@ class JsonlAuditExporterTest { assertEquals(false, row["usedDeclaredCapability"]) } + @Test + fun `multimodal ToolResult writes outputParts with modality plus hash plus size — no bytes`() { + // #2469 — audit-row support for typed multimodal tool returns. + val store = agents_engine.content.InMemoryBlobStore() + val pngBytes = byteArrayOf(0x89.toByte(), 0x50, 0x4E, 0x47, 1, 2, 3) + val imgRef = store.put(pngBytes, agents_engine.content.ImageMime.Png.wireMime) + + val dir = Files.createTempDirectory("agents-jsonl-audit-multimodal") + val auditFile = dir.resolve("audit.jsonl") + val exporter = JsonlAuditExporter(auditFile, clock = fixedClock) + exporter.write( + PipelineEvent.ToolCalled( + agentName = "agent", + timestamp = Instant.EPOCH, + toolName = "screenshot", + arguments = mapOf("url" to "https://example.com"), + result = agents_engine.content.ToolResult( + agents_engine.content.Content.Text("captured"), + agents_engine.content.Content.Image(imgRef, agents_engine.content.ImageMime.Png), + ), + runtimeContext = AgentRuntimeContext(requestId = "req-multimodal"), + ), + ) + exporter.close() + + val line = Files.readAllLines(auditFile).single() + // No bytes anywhere — neither the URL arg nor the image bytes + assertFalse(line.contains("example.com"), "argument values must not be serialized: $line") + // No PNG magic — image bytes definitely never enter the audit row + assertFalse(line.contains("0x89"), "image bytes must not be serialized: $line") + + // Substring assertions on the rendered JSON (the test parser doesn't + // model arrays, and the column contents are stable enough to grep). + assertTrue("\"outputParts\":[" in line, "outputParts is emitted as an array: $line") + assertTrue("\"text:inline:8:text/plain\"" in line, "text-part shape in array: $line") + assertTrue("\"image:${imgRef.hash.take(12)}:${pngBytes.size}:image/png\"" in line, + "image-part shape modality:hashPrefix:size:mime — $line") + assertTrue("\"agents_engine.content.ToolResult\"" in line, + "outputType still names the wrapper type: $line") + } + @Test fun `denied tool calls are recorded as ToolDenied rows without leaking the reason text`() { // #2395 — blocked calls must appear in the audit log. The PII-safe @@ -309,6 +350,10 @@ class JsonlAuditExporterTest { "timestamp", "inputType", "outputType", + // #2469 — per-part summary for multimodal ToolResult returns. + // Null on non-multimodal rows; field is always present so + // schema-pinning consumers see a stable column set. + "outputParts", "budgetState", "guardrailDecision", "mcpClientId", diff --git a/docs/multimodal.md b/docs/multimodal.md new file mode 100644 index 0000000..f46ba66 --- /dev/null +++ b/docs/multimodal.md @@ -0,0 +1,164 @@ +[← Back to README](../README.md) + +# Multimodal content + +First three pieces of the 0.8 multimodal epic ship today. The rest of the epic (provider adapters, KSP routing, manifest-anchored capability checks) is staged on top of this foundation. + +## What ships + +- **`sealed interface Content`** (#2466) — `Text`, `Image`, `Audio`, `Video`, `Document`. Each non-text variant carries a `ContentRef` + a typed mime (`ImageMime`, `AudioMime`, `VideoMime`, `DocMime`). No `String` mime anywhere in the public API. +- **`ContentRef` + `BlobStore`** (#2467) — content-addressed reference (SHA-256 hex + size + wire mime). `InMemoryBlobStore` for tests, `FileBlobStore(dir)` for on-disk persistence that survives process restart. +- **`ToolResult(parts: List)`** (#2469) — tools can return mixed content. JSONL audit exporter records per-part modality + ref summary; **no blob bytes ever enter the audit row**. + +## Design hinges + +1. **Modality + format live in the type, never in a String.** Adding a new mime is a new variant; mistyping it is a compile error. +2. **Content-addressed payload, not inlined bytes.** `Content` carries a `ContentRef`, blobs live in a `BlobStore`. Snapshot files stay small. Audit rows stay small. +3. **No `data class` holding `ByteArray`.** Kotlin data-class equals/hashCode treat arrays by identity, not content — would break every downstream consumer's assumption. The ref pattern sidesteps it entirely. + +## Quick start + +```kotlin +import agents_engine.content.* + +val store = FileBlobStore(Path.of("snapshots", "blobs")) + +val screenshotAgent = agent("screenshot") { + model { ollama("test") } + tools { + tool("capture", "Capture a screenshot of the URL") { args -> + val bytes = captureBytes(args["url"] as String) + val ref = store.put(bytes, ImageMime.Png.wireMime) + ToolResult( + Content.Text("Captured ${ref.sizeBytes}B image."), + Content.Image(ref, ImageMime.Png), + ) + } + } + skills { skill("respond", "") { tools("capture") } } +} +``` + +## `Content` variants + +```kotlin +sealed interface Content { + data class Text(val text: String) : Content + data class Image(val ref: ContentRef, val mime: ImageMime) : Content + data class Audio(val ref: ContentRef, val mime: AudioMime) : Content + data class Video(val ref: ContentRef, val mime: VideoMime) : Content + data class Document(val ref: ContentRef, val mime: DocMime) : Content +} + +val Content.modality: String // "text" | "image" | "audio" | "video" | "document" +``` + +Stage 1 (this release): **Image + Document** wired end-to-end through the audit pipeline and the agentic loop's tool-result rendering. **Audio + Video** modelled now and exercised through provider adapters in Stage 2. + +### Closed mime types + +```kotlin +sealed interface ImageMime { Png, Jpeg, Gif, Webp } +sealed interface AudioMime { Mp3, Wav, Flac, Ogg } +sealed interface VideoMime { Mp4, WebM, Mov } +sealed interface DocMime { Pdf, Docx, Markdown, Html, PlainText } +``` + +Each variant exposes `wireMime: String` for adapter serialisation. The public API never accepts `String` mime — extending is adding a variant. + +## `ContentRef` + `BlobStore` + +```kotlin +data class ContentRef(val hash: String, val sizeBytes: Long, val wireMime: String) + +interface BlobStore { + fun put(bytes: ByteArray, wireMime: String): ContentRef + fun get(ref: ContentRef): ByteArray? + fun open(ref: ContentRef): InputStream? + fun exists(ref: ContentRef): Boolean + fun delete(ref: ContentRef) +} + +class InMemoryBlobStore : BlobStore +class FileBlobStore(dir: Path) : BlobStore +``` + +**Hash:** SHA-256 hex. Same family as manifest hash (#1912) and snapshot filename hash (#2753) — single hash algorithm across the audit surface. + +**Idempotent put:** putting the same bytes twice returns the same `ContentRef`. `FileBlobStore` writes the file once and is a no-op on the second put — pinned by a test. + +**Persistence:** `FileBlobStore` survives process restart. A fresh instance on the same directory sees blobs from prior puts. Atomic via tmp + rename, matching the `FileSnapshotStore` pattern (#2753). + +**Custom backends:** an internal artifact registry, S3, GCS, etc. implement `BlobStore` and plug in via the same interface. + +## `ToolResult` — multimodal tool returns + +```kotlin +tool("ocr", "Extract text + return source") { args -> + val text = ocrText(args["pdf"] as String) + val sourceRef = store.put(args["pdf-bytes"] as ByteArray, DocMime.Pdf.wireMime) + ToolResult( + Content.Text(text), + Content.Document(sourceRef, DocMime.Pdf), + ) +} +``` + +Tools can return `ToolResult` instead of a String. The agentic loop renders it for the LLM's tool-result message in v1: + +``` +Extracted spec text. +[document: application/pdf] (a3f9b2c4...12, 524288B) +``` + +Provider-specific multipart rendering — the model actually seeing the image / document — lands in **#2470** (Provider normalization adapters, deferred). + +`untrustedOutput = true` still wraps the rendered text summary in the JSON envelope (#642 composes with #2469). + +## Audit discipline + +`JsonlAuditExporter` (#1914) gains a new `outputParts` column. For `ToolResult` returns: + +```json +{ + "...": "...", + "outputType": "agents_engine.content.ToolResult", + "outputParts": [ + "text:inline:18:text/plain", + "image:a3f9b2c41205:524288:image/png" + ], + "...": "..." +} +``` + +Format: `:::` per part. Hash prefix is the first 12 hex chars — enough to disambiguate in audit grep, short enough to keep audit rows compact. + +**Critical:** blob bytes never enter the audit row. The `ContentRef` is the auditable surface. Pinned by a dedicated test that asserts the PNG magic byte sequence does not appear anywhere in the audit JSON. + +The same discipline applies (when wired) to the OTel / LangSmith / Langfuse bridges — sibling tickets will plumb `outputParts` onto span events / run events / observations. + +## Snapshot composition + +`Content` carrying a `ContentRef` (not bytes) means `SessionSnapshot` (#2386 / #2754) stays small regardless of how much image / audio / video flowed through the agent. The snapshot serialises the ref; the blob lives in the `BlobStore`. A snapshot resume against the same `BlobStore` dereferences the ref normally. + +Pairs with the #2754 manifest-hash restore guard: resume across an agent rebuild that changed tools (including the `BlobStore` wiring) fails closed unless the caller opts in. + +## What's coming (the rest of #2465) + +- **#2468** Compile-time modality routing — `Agent` becomes a real type; cross-modality miswiring is a compile error. Multi-part `@Generable` inputs via KSP. +- **#2470** Provider adapters — Claude vision, OpenAI vision, Gemini, Ollama multimodal. Translates `Content → provider-specific payload` at the wire. +- **#2471** Manifest-anchored modality capability — declared per-agent modalities recorded in the permission manifest, validated against provider capabilities at build time. +- **#2472** Multimodal memory — `MemoryBank` entries carry `ContentRef` for image/audio/video state. +- **#2473** Testing fixtures + snapshot + mutation coverage. + +Stage 2 (Audio + Video) lights up when a concrete use case lands. + +## Related docs + +- [`docs/permission-manifest.md`](permission-manifest.md) — the manifest-hash family the BlobStore reuses. +- [`docs/observability.md`](observability.md) — the audit exporter that now carries `outputParts`. +- [`docs/hitl.md`](hitl.md) — `Content` will appear in human-approval bodies once Stage 2 lands. + +Sources: `agents_engine/content/Content.kt`, `agents_engine/content/ContentRef.kt`, `agents_engine/content/ToolResult.kt`, audit wiring in `agents-kt-observability/.../JsonlAuditExporter.kt`. + +Tests: `ContentAndRefTest.kt`, `ToolResultIntegrationTest.kt`, JsonlAuditExporterTest's "multimodal ToolResult writes outputParts" case. diff --git a/src/main/kotlin/agents_engine/content/Content.kt b/src/main/kotlin/agents_engine/content/Content.kt new file mode 100644 index 0000000..86a8a92 --- /dev/null +++ b/src/main/kotlin/agents_engine/content/Content.kt @@ -0,0 +1,139 @@ +package agents_engine.content + +/** + * `agents_engine/content/Content.kt` — typed multimodal content hierarchy + * (#2466, part of the #2465 0.8 multimodal epic). + * + * **Design hinges:** + * + * 1. **Modality + format live in the type, never in a String.** `Content` is + * a sealed interface; each non-text variant carries a `ContentRef` plus a + * closed (sealed-or-enum) mime type. No `mimeType: String` appears in any + * public API. + * 2. **Content-addressed payload, not inlined bytes.** Non-text variants + * hold a [ContentRef], not a `ByteArray`. The actual bytes live in a + * [BlobStore]. This keeps `Content` immutable, equatable, snapshot-safe + * (the #2386 / #2754 snapshot machinery never inlines blobs), and audit- + * safe (audit rows record refs + modalities, never blob contents). + * 3. **No `data class` with `ByteArray`.** Kotlin data-class equals/hashCode + * use the object identity of arrays, not their content. That breaks + * every assumption a downstream consumer would make about + * `content1 == content2`. The ref pattern sidesteps it entirely. + * + * **Staging (per the #2465 epic):** + * + * - Stage 1 (this commit): all five variants modelled. Image + Document + * are the modalities wired through the rest of the stack in 0.8 — they + * match the spec → product loop the runtime actually serves (spec + * ingestion, screenshot/UI-QA, architecture-diagram review). + * - Stage 2: Audio + Video exercised end-to-end when a concrete use + * case lands. + * + * **Composition with existing surfaces:** + * + * - Tools can return [Content] inside a [ToolResult]; audit bridges + * record modalities + refs. + * - Snapshot/resume holds refs (not bytes), so a snapshot file stays + * small regardless of how much image/audio/video the agent processed. + * - Provider adapters (sibling #2470, deferred) translate `Content → + * provider-specific payload` at the wire. + */ +sealed interface Content { + /** + * Plain text content. The one variant that holds its payload inline, + * because text is small, structural, and the lingua franca of LLM + * messages. Stays unchanged from the pre-#2466 string-only world. + */ + data class Text(val text: String) : Content + + /** + * An image. [ref] points at the bytes in a [BlobStore]; [mime] is a + * typed [ImageMime] (never a `String`). Use this for screenshots, + * UI captures, architecture diagrams, photographs. + */ + data class Image(val ref: ContentRef, val mime: ImageMime) : Content + + /** + * Audio — speech, ambient capture, telephony record. Modelled in + * Stage 1 but only wired end-to-end through provider adapters in + * Stage 2. + */ + data class Audio(val ref: ContentRef, val mime: AudioMime) : Content + + /** + * Video. Same Stage 1/2 split as [Audio] — type ships now, provider + * rendering ships when a concrete use case lands. + */ + data class Video(val ref: ContentRef, val mime: VideoMime) : Content + + /** + * A document — PDF, DOCX, Markdown. The other modality the 0.8 + * spec → product loop consumes (spec ingestion, regulatory PDFs). + */ + data class Document(val ref: ContentRef, val mime: DocMime) : Content +} + +/** + * Closed mime type for [Content.Image]. Variants cover the modalities + * the production providers (Anthropic Vision, OpenAI Vision, Ollama + * multimodal models) accept today. Extend by adding a variant — string + * mime types are intentionally not exposed. + */ +sealed interface ImageMime { + /** RFC 7-style mime form, returned by adapters when serialising to the wire. */ + val wireMime: String + + object Png : ImageMime { override val wireMime: String = "image/png" } + object Jpeg : ImageMime { override val wireMime: String = "image/jpeg" } + object Gif : ImageMime { override val wireMime: String = "image/gif" } + object Webp : ImageMime { override val wireMime: String = "image/webp" } +} + +/** Closed mime type for [Content.Audio]. */ +sealed interface AudioMime { + val wireMime: String + + object Mp3 : AudioMime { override val wireMime: String = "audio/mpeg" } + object Wav : AudioMime { override val wireMime: String = "audio/wav" } + object Flac : AudioMime { override val wireMime: String = "audio/flac" } + object Ogg : AudioMime { override val wireMime: String = "audio/ogg" } +} + +/** Closed mime type for [Content.Video]. */ +sealed interface VideoMime { + val wireMime: String + + object Mp4 : VideoMime { override val wireMime: String = "video/mp4" } + object WebM : VideoMime { override val wireMime: String = "video/webm" } + object Mov : VideoMime { override val wireMime: String = "video/quicktime" } +} + +/** Closed mime type for [Content.Document]. */ +sealed interface DocMime { + val wireMime: String + + object Pdf : DocMime { override val wireMime: String = "application/pdf" } + object Docx : DocMime { + override val wireMime: String = + "application/vnd.openxmlformats-officedocument.wordprocessingml.document" + } + object Markdown : DocMime { override val wireMime: String = "text/markdown" } + object Html : DocMime { override val wireMime: String = "text/html" } + object PlainText : DocMime { override val wireMime: String = "text/plain" } +} + +/** + * The runtime-stable name of a content's modality. Used by audit + * bridges to write a per-part `modality` field without exposing + * type-checker concerns. Stable across releases — adding a new + * [Content] variant adds a new modality string here, never repurposes + * an existing one. + */ +val Content.modality: String + get() = when (this) { + is Content.Text -> "text" + is Content.Image -> "image" + is Content.Audio -> "audio" + is Content.Video -> "video" + is Content.Document -> "document" + } diff --git a/src/main/kotlin/agents_engine/content/ContentRef.kt b/src/main/kotlin/agents_engine/content/ContentRef.kt new file mode 100644 index 0000000..1dc0936 --- /dev/null +++ b/src/main/kotlin/agents_engine/content/ContentRef.kt @@ -0,0 +1,173 @@ +package agents_engine.content + +import java.io.InputStream +import java.nio.file.Files +import java.nio.file.Path +import java.nio.file.StandardOpenOption +import java.security.MessageDigest +import java.util.concurrent.ConcurrentHashMap + +/** + * `agents_engine/content/ContentRef.kt` — content-addressed blob reference + * + [BlobStore] backend (#2467, part of the #2465 multimodal epic). + * + * **Why content-addressed:** + * + * 1. **Snapshot-safe.** `SessionSnapshot` (#2386 / #2754) serialises through + * plain JSON. Inlining image / audio / video bytes would explode the + * snapshot file and slow every resume. With refs, the snapshot stays + * small (just hash + size + mime); blobs live in the [BlobStore] and + * are addressable by hash across restarts. + * 2. **Audit-safe.** Audit bridges record `ContentRef` (hash + size + + * modality) but never blob bytes. Audit logs stay compact and PII-safe + * by construction. + * 3. **Deduplicated.** Two identical images produce identical refs — the + * store keeps one copy. Useful in eval suites where the same fixture + * image flows through many cases. + * + * **Hash algorithm:** SHA-256 hex. Matches the manifest hash used + * elsewhere (#1912 permission manifest, #2754 restore guard) so the + * audit story has a single hash family. Collisions are not a practical + * concern. + * + * **Mime + size on the ref:** the mime travels with the ref so a + * caller can introspect "what is this blob?" without dereferencing the + * store. Size is convenience metadata for audit rows; trustworthy + * because computed at `put` time from the actual byte count. + */ +data class ContentRef( + /** SHA-256 hex of the blob bytes. Stable across processes and JVM versions. */ + val hash: String, + /** Blob length in bytes. Audit-friendly; never the bytes themselves. */ + val sizeBytes: Long, + /** + * Wire-form mime ("image/png", "application/pdf", …). Pulled from the + * corresponding [ImageMime] / [DocMime] / etc. when the ref is created + * by a typed [Content] put; freeform when a caller constructs a ref + * directly (e.g. ingesting an unknown blob from disk). Adapters that + * round-trip through typed `Content` enforce the closed mime types at + * that layer. + */ + val wireMime: String, +) + +/** + * Persistence backend for content-addressed blobs. + * + * Implementations: [InMemoryBlobStore] for tests and single-JVM use, + * [FileBlobStore] for on-disk persistence. Custom backends (S3, GCS, + * an internal artifact registry) implement this interface and plug in + * via dependency injection at agent construction. + * + * **Idempotency:** `put` is deterministic on byte content. Putting the + * same bytes twice returns the same [ContentRef]; the second `put` is + * a no-op on disk. + */ +interface BlobStore { + /** + * Store [bytes] under their SHA-256 hash. Returns the resulting + * [ContentRef] carrying that hash, the byte length, and [wireMime]. + */ + fun put(bytes: ByteArray, wireMime: String): ContentRef + + /** + * Look up the blob for [ref]. Returns `null` when the store has no + * entry — callers handle the absence (re-fetch, fail closed, etc.). + */ + fun get(ref: ContentRef): ByteArray? + + /** + * Stream the blob for [ref] — for large payloads where loading the + * full bytes into memory is wasteful. `null` when missing. + */ + fun open(ref: ContentRef): InputStream? + + /** True if the store currently holds [ref]'s blob. */ + fun exists(ref: ContentRef): Boolean + + /** Remove the blob for [ref] from the store. Idempotent. */ + fun delete(ref: ContentRef) +} + +/** + * Compute the [ContentRef] hash for [bytes] without storing anything. + * Useful when comparing two byte arrays without a [BlobStore] handy. + */ +fun computeContentHash(bytes: ByteArray): String { + val digest = MessageDigest.getInstance("SHA-256").digest(bytes) + return buildString(digest.size * 2) { for (b in digest) append("%02x".format(b)) } +} + +/** + * In-process [BlobStore] — tests + single-JVM agents that don't need + * persistence across restarts. Backed by a `ConcurrentHashMap` keyed + * by hash; bytes are stored as defensive copies on `put` and returned + * as copies on `get` so consumer mutation can't corrupt the store. + */ +class InMemoryBlobStore : BlobStore { + private val store = ConcurrentHashMap() + + private data class Entry(val bytes: ByteArray, val wireMime: String) + + override fun put(bytes: ByteArray, wireMime: String): ContentRef { + val hash = computeContentHash(bytes) + store[hash] = Entry(bytes.copyOf(), wireMime) + return ContentRef(hash = hash, sizeBytes = bytes.size.toLong(), wireMime = wireMime) + } + + override fun get(ref: ContentRef): ByteArray? = store[ref.hash]?.bytes?.copyOf() + + override fun open(ref: ContentRef): InputStream? = get(ref)?.inputStream() + + override fun exists(ref: ContentRef): Boolean = store.containsKey(ref.hash) + + override fun delete(ref: ContentRef) { + store.remove(ref.hash) + } +} + +/** + * On-disk [BlobStore] — one file per blob, filename = hash. Survives + * process restarts so refs in a persisted [SessionSnapshot] + * dereference after a restart. Atomic via tmp + rename. + * + * Filename is the raw SHA-256 hex — hashes are filesystem-safe by + * construction. No suffix is appended; mime travels on the ref, not + * in the path. (Files can be tagged with extension by a deployer's + * out-of-band tooling if needed.) + * + * Composes with the #2753 filename-hashing pattern from + * `FileSnapshotStore` — both use hashes for filename safety. Here the + * hash is intrinsic (SHA-256 of blob content); there it was derived + * (SHA-256 of session id). + */ +class FileBlobStore(private val dir: Path) : BlobStore { + init { Files.createDirectories(dir) } + + override fun put(bytes: ByteArray, wireMime: String): ContentRef { + val hash = computeContentHash(bytes) + val target = dir.resolve(hash) + if (!Files.exists(target)) { + val tmp = dir.resolve("$hash.tmp") + Files.write(tmp, bytes, StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING) + Files.move(tmp, target, java.nio.file.StandardCopyOption.REPLACE_EXISTING, java.nio.file.StandardCopyOption.ATOMIC_MOVE) + } + return ContentRef(hash = hash, sizeBytes = bytes.size.toLong(), wireMime = wireMime) + } + + override fun get(ref: ContentRef): ByteArray? { + val target = dir.resolve(ref.hash) + return if (Files.exists(target)) Files.readAllBytes(target) else null + } + + override fun open(ref: ContentRef): InputStream? { + val target = dir.resolve(ref.hash) + return if (Files.exists(target)) Files.newInputStream(target) else null + } + + override fun exists(ref: ContentRef): Boolean = Files.exists(dir.resolve(ref.hash)) + + override fun delete(ref: ContentRef) { + Files.deleteIfExists(dir.resolve(ref.hash)) + } +} diff --git a/src/main/kotlin/agents_engine/content/ToolResult.kt b/src/main/kotlin/agents_engine/content/ToolResult.kt new file mode 100644 index 0000000..71cab2b --- /dev/null +++ b/src/main/kotlin/agents_engine/content/ToolResult.kt @@ -0,0 +1,88 @@ +package agents_engine.content + +/** + * `agents_engine/content/ToolResult.kt` — multimodal tool return value + * (#2469, part of the #2465 multimodal epic). + * + * Tools have historically returned `Any?` — typically a `String`. When + * a tool wants to return mixed content (a screenshot tool returns + * description text + an image; OCR returns extracted text + the + * source PDF reference; a browser-capture tool returns the page + * markup + a screenshot), it can return a [ToolResult] carrying a + * list of typed [Content] parts. + * + * ```kotlin + * tool("screenshot", "Take a screenshot of the page") { args -> + * val bytes = takeScreenshot(args["url"] as String) + * val ref = blobStore.put(bytes, ImageMime.Png.wireMime) + * ToolResult( + * Content.Text("Screenshot captured."), + * Content.Image(ref, ImageMime.Png), + * ) + * } + * ``` + * + * **Audit discipline.** Observability bridges that surface tool results + * detect `ToolResult` and emit per-part metadata: the part's + * [Content.modality] and the [ContentRef] (hash + size + mime). + * **Blob bytes never enter the audit log.** Text parts inline their + * content as before (text is small, structural, and already part of + * the audit story). + * + * **Provider rendering.** Translating a `ToolResult` into the next + * LLM turn's tool-result message is the provider adapter's job + * (sibling #2470, deferred). For now, when a tool returns a + * `ToolResult`, the agentic loop renders the text parts to the + * model and notes the non-text parts as "[modality: ]" + * placeholders. Vision-capable adapters fill these in end-to-end + * when #2470 ships. + * + * **Composition with existing surfaces:** + * + * - [ToolResult] is just another `Any?` the tool executor returns — + * no `ToolDef` signature change. Existing tools that return + * strings keep working byte-for-byte. + * - Snapshot/resume (#2386 / #2754) serialises through plain JSON + * and refs travel with the snapshot; blobs live in the + * [BlobStore]. A resumed snapshot dereferences refs against the + * same store. + * - `untrustedOutput` (#642) still applies — wrap the rendered text + * summary of a multi-part result in the untrusted envelope when + * the tool declares it. + */ +data class ToolResult(val parts: List) { + constructor(vararg parts: Content) : this(parts.toList()) + + /** + * Convenience: extract the text parts as a single concatenated + * string. Useful when a tool returns mixed content but the model + * primarily consumes the textual summary; non-text parts surface + * via the audit + the provider adapter rendering. + */ + val textSummary: String + get() = parts.filterIsInstance().joinToString("\n") { it.text } + + init { require(parts.isNotEmpty()) { "ToolResult requires at least one Content part." } } +} + +/** + * Render a [ToolResult] into the placeholder text the agentic loop + * uses for the tool-result LLM message in v1 (#2470 will replace + * this with provider-specific multipart rendering). Text parts + * inline verbatim; non-text parts surface as `[modality: ] + * (, B)`. + * + * Audit bridges and the JSONL exporter call [ToolResult.parts] + * directly for per-part metadata writeouts; this placeholder is + * only the in-context model rendering. + */ +internal fun renderToolResultPlaceholder(result: ToolResult): String = + result.parts.joinToString("\n") { part -> + when (part) { + is Content.Text -> part.text + is Content.Image -> "[image: ${part.mime.wireMime}] (${part.ref.hash.take(12)}, ${part.ref.sizeBytes}B)" + is Content.Audio -> "[audio: ${part.mime.wireMime}] (${part.ref.hash.take(12)}, ${part.ref.sizeBytes}B)" + is Content.Video -> "[video: ${part.mime.wireMime}] (${part.ref.hash.take(12)}, ${part.ref.sizeBytes}B)" + is Content.Document -> "[document: ${part.mime.wireMime}] (${part.ref.hash.take(12)}, ${part.ref.sizeBytes}B)" + } + } diff --git a/src/main/kotlin/agents_engine/model/AgenticLoop.kt b/src/main/kotlin/agents_engine/model/AgenticLoop.kt index ffefb09..566460d 100644 --- a/src/main/kotlin/agents_engine/model/AgenticLoop.kt +++ b/src/main/kotlin/agents_engine/model/AgenticLoop.kt @@ -795,9 +795,9 @@ internal suspend fun executeAgentic( } } val toolMessage = if (!denied && tool.untrustedOutput) { - wrapUntrustedToolResult(tool.name, result) + wrapUntrustedToolResult(tool.name, renderToolResultForLlm(result)) } else { - result?.toString() ?: "null" + renderToolResultForLlm(result) } messages.add(LlmMessage("tool", toolMessage)) } @@ -1162,6 +1162,20 @@ private fun wrapUntrustedToolResult(toolName: String, result: Any?): String { return """{"tool":${toolName.toJsonString()},"trusted":false,"value":${value.toJsonString()}}""" } +/** + * #2469 — render a tool's return value into the text the LLM sees as + * the tool-result message. For a [agents_engine.content.ToolResult] + * (multimodal), non-text parts surface as `[modality: ]` + * placeholders — the actual provider-specific multipart rendering is + * the sibling #2470 ticket, deferred. For non-multimodal returns, + * `toString()` (or `"null"`) — byte-for-byte the pre-#2469 behaviour. + */ +private fun renderToolResultForLlm(result: Any?): String = when (result) { + is agents_engine.content.ToolResult -> agents_engine.content.renderToolResultPlaceholder(result) + null -> "null" + else -> result.toString() +} + private fun parseOutput(text: String, outType: KClass<*>): Any? = when { outType == String::class -> text else -> @Suppress("UNCHECKED_CAST") (outType as KClass).fromLlmOutput(text) diff --git a/src/main/resources/internals-agent/content/Multimodal.md b/src/main/resources/internals-agent/content/Multimodal.md new file mode 100644 index 0000000..14dcbbd --- /dev/null +++ b/src/main/resources/internals-agent/content/Multimodal.md @@ -0,0 +1,84 @@ +--- +description: Source-file knowledge for agents_engine/content/* — multimodal foundation (#2465 epic, #2466 + #2467 + #2469 shipped). sealed Content { Text, Image(ref, ImageMime), Audio(ref, AudioMime), Video(ref, VideoMime), Document(ref, DocMime) }. Mime types are CLOSED sealed interfaces per modality with wireMime: String accessor — no String mime in the public API. ContentRef(hash, sizeBytes, wireMime) is content-addressed via SHA-256 hex; non-text Content variants hold a ref, never bytes. BlobStore interface + InMemoryBlobStore + FileBlobStore (atomic tmp+rename, survives process restart, dedupe via hash). ToolResult(parts: List) is the multimodal tool return type — just an Any? value the executor returns. AgenticLoop renders ToolResult as text + per-part placeholders in v1 (provider-specific multipart rendering is #2470, deferred). JsonlAuditExporter records outputParts column with ":::" per part; blob bytes never enter the audit row. Composes with snapshot/resume (refs serialize, blobs stay external) and the manifest-hash restore guard (#2754). Audio/Video modelled but Stage 1 wires Image + Document end-to-end. Call when reasoning about multimodal tool returns, content-addressed blob persistence, or audit discipline around binary content. +--- + +# `agents_engine/content/*` — multimodal foundation + +Three cooperating pieces in package `agents_engine.content`: + +## `Content` sealed hierarchy (#2466) + +```kotlin +sealed interface Content { + data class Text(val text: String) : Content + data class Image(val ref: ContentRef, val mime: ImageMime) : Content + data class Audio(val ref: ContentRef, val mime: AudioMime) : Content + data class Video(val ref: ContentRef, val mime: VideoMime) : Content + data class Document(val ref: ContentRef, val mime: DocMime) : Content +} + +val Content.modality: String // stable per-variant name for audit rows +``` + +Closed mime types per modality — `ImageMime { Png, Jpeg, Gif, Webp }`, `AudioMime { Mp3, Wav, Flac, Ogg }`, `VideoMime { Mp4, WebM, Mov }`, `DocMime { Pdf, Docx, Markdown, Html, PlainText }`. Each exposes `wireMime: String` for adapter serialisation; the public API never accepts `String`. + +Stage 1 wires Image + Document end-to-end (the modalities the spec→product loop consumes). Audio + Video modelled now, exercised through provider adapters in Stage 2. + +## `ContentRef` + `BlobStore` (#2467) + +```kotlin +data class ContentRef(val hash: String, val sizeBytes: Long, val wireMime: String) + +interface BlobStore { + fun put(bytes: ByteArray, wireMime: String): ContentRef + fun get(ref: ContentRef): ByteArray? + fun open(ref: ContentRef): InputStream? + fun exists(ref: ContentRef): Boolean + fun delete(ref: ContentRef) +} + +class InMemoryBlobStore : BlobStore +class FileBlobStore(dir: Path) : BlobStore + +fun computeContentHash(bytes: ByteArray): String // SHA-256 hex +``` + +Hash family: SHA-256 hex — matches the manifest hash (#1912) and snapshot filename hash (#2753). Single hash algorithm across the audit surface. + +Idempotent put: same bytes → same ref; same file on disk. InMemory uses defensive byte-array copies on put/get to protect against consumer mutation. FileBlobStore writes via tmp + atomic rename and survives process restart. + +## `ToolResult` (#2469) + +```kotlin +data class ToolResult(val parts: List) { + constructor(vararg parts: Content) + val textSummary: String // concatenated Text parts only +} +``` + +Tools can return `ToolResult` instead of `String`. No `ToolDef` signature change — `ToolResult` is just another `Any?`. + +AgenticLoop renders multipart returns for the LLM's tool-result message as: text parts inline + `[modality: ] (, B)` placeholders for non-text parts. Provider-specific multipart rendering (vision-capable Claude/OpenAI/Gemini) is the sibling #2470 ticket. + +JsonlAuditExporter detects `ToolResult` returns and writes a new `outputParts: List?` column: one entry per part as `:::` for non-text parts, or `text:inline::text/plain` for text parts. Blob bytes never enter the audit row — pinned by a test. + +## Composition + +- **Snapshot / resume (#2386 / #2754):** Content carries `ContentRef`, not bytes. Snapshot files stay small; blobs live in the `BlobStore`. Resume against the same store dereferences refs normally. Manifest-hash restore guard applies unchanged. +- **untrustedOutput (#642):** wraps the text-summary rendering. Multimodal results compose with the trust boundary. +- **JSONL audit (#1914):** new column `outputParts` is null for non-multimodal returns — legacy rows unchanged. EXPECTED_FIELDS schema test updated. + +## v1 deferrals (carried as sibling tickets in #2465) + +- **#2468** Compile-time modality routing — `Agent` typed input, KSP multi-part `@Generable` +- **#2470** Provider normalization adapters (Claude / OpenAI / Gemini / Ollama) +- **#2471** Manifest-anchored modality capability validation at build time +- **#2472** Multimodal memory — `ContentRef`-backed MemoryBank entries +- **#2473** Multimodal testing fixtures + snapshot + mutation coverage + +## Related files + +- `core/Snapshot.kt` — `SessionSnapshot` carries refs through serialisation; no inlined blobs. +- `model/AgenticLoop.kt` — `renderToolResultForLlm` placeholder rendering for the LLM tool-result message. +- `agents-kt-observability/.../JsonlAuditExporter.kt` — `outputParts` audit-row column + `partsSummary` helper. +- `agents-kt-manifest/.../PermissionManifest.kt` — modality capability declaration TBD (#2471). diff --git a/src/test/kotlin/agents_engine/content/ContentAndRefTest.kt b/src/test/kotlin/agents_engine/content/ContentAndRefTest.kt new file mode 100644 index 0000000..618c5e8 --- /dev/null +++ b/src/test/kotlin/agents_engine/content/ContentAndRefTest.kt @@ -0,0 +1,117 @@ +package agents_engine.content + +import org.junit.jupiter.api.io.TempDir +import java.nio.file.Path +import kotlin.test.Test +import kotlin.test.assertEquals +import kotlin.test.assertFalse +import kotlin.test.assertNotNull +import kotlin.test.assertNull +import kotlin.test.assertTrue + +/** + * #2466 + #2467 — typed Content hierarchy + ContentRef + BlobStore. Pins: + * + * 1. `Content` sealed; mime types are closed per modality (no String). + * 2. `ContentRef` carries hash + size + wire mime; equatable. + * 3. `computeContentHash` is deterministic and matches the store's + * `put` outcome. + * 4. `InMemoryBlobStore` round-trips bytes; identical bytes → same ref + * (dedupe); defensive copies on put/get protect against mutation. + * 5. `FileBlobStore` survives process restart (i.e. a fresh instance + * on the same dir sees prior puts). + * 6. The `modality` extension property is stable per variant. + */ +class ContentAndRefTest { + + private val sampleBytes = byteArrayOf(0x89.toByte(), 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A, 1, 2, 3) + + @Test + fun `computeContentHash is deterministic`() { + val h1 = computeContentHash(sampleBytes) + val h2 = computeContentHash(sampleBytes.copyOf()) + assertEquals(h1, h2, "same bytes → same hash") + assertEquals(64, h1.length, "SHA-256 hex is 64 chars") + assertTrue(h1.matches(Regex("[0-9a-f]{64}"))) + } + + @Test + fun `InMemoryBlobStore round-trips bytes and produces equatable refs`() { + val store = InMemoryBlobStore() + val ref1 = store.put(sampleBytes, ImageMime.Png.wireMime) + val ref2 = store.put(sampleBytes.copyOf(), ImageMime.Png.wireMime) + assertEquals(ref1, ref2, "same bytes → same ContentRef (dedupe)") + val read = store.get(ref1) + assertNotNull(read) + assertTrue(sampleBytes.contentEquals(read), "bytes round-trip") + assertEquals(sampleBytes.size.toLong(), ref1.sizeBytes) + assertEquals(ImageMime.Png.wireMime, ref1.wireMime) + } + + @Test + fun `InMemoryBlobStore returns defensive copies (consumer mutation can't corrupt)`() { + val store = InMemoryBlobStore() + val mutable = sampleBytes.copyOf() + val ref = store.put(mutable, ImageMime.Png.wireMime) + // Mutate the input array — store's internal copy must be unaffected. + mutable[0] = 0x00 + val read = store.get(ref)!! + assertEquals(sampleBytes.first(), read.first(), "internal copy not affected by external mutation") + // Mutate the returned array — subsequent gets must see clean state. + read[0] = 0x00 + val readAgain = store.get(ref)!! + assertEquals(sampleBytes.first(), readAgain.first(), "returned copies don't share storage") + } + + @Test + fun `InMemoryBlobStore exists and delete work as advertised`() { + val store = InMemoryBlobStore() + val ref = store.put(sampleBytes, ImageMime.Png.wireMime) + assertTrue(store.exists(ref)) + store.delete(ref) + assertFalse(store.exists(ref)) + assertNull(store.get(ref)) + } + + @Test + fun `FileBlobStore survives a fresh instance on the same dir (process-restart safety)`(@TempDir tmp: Path) { + val ref = FileBlobStore(tmp).put(sampleBytes, DocMime.Pdf.wireMime) + + // "Restart" — fresh instance reads the same dir. + val resumed = FileBlobStore(tmp) + assertTrue(resumed.exists(ref)) + val read = resumed.get(ref) + assertNotNull(read) + assertTrue(sampleBytes.contentEquals(read)) + } + + @Test + fun `FileBlobStore deduplicates identical puts`(@TempDir tmp: Path) { + val store = FileBlobStore(tmp) + val ref1 = store.put(sampleBytes, ImageMime.Png.wireMime) + val ref2 = store.put(sampleBytes, ImageMime.Png.wireMime) + assertEquals(ref1, ref2) + // Single file on disk — dedupe is real, not just on the ref. + val files = java.nio.file.Files.list(tmp).use { it.toList() } + assertEquals(1, files.size, "second put must not write a second file") + } + + @Test + fun `Content modality is stable per variant`() { + val img = Content.Image(ContentRef("abc", 1, "image/png"), ImageMime.Png) + val doc = Content.Document(ContentRef("def", 1, "application/pdf"), DocMime.Pdf) + val text = Content.Text("hello") + assertEquals("image", img.modality) + assertEquals("document", doc.modality) + assertEquals("text", text.modality) + } + + @Test + fun `closed mime types expose stable wire forms`() { + assertEquals("image/png", ImageMime.Png.wireMime) + assertEquals("image/jpeg", ImageMime.Jpeg.wireMime) + assertEquals("application/pdf", DocMime.Pdf.wireMime) + assertEquals("audio/mpeg", AudioMime.Mp3.wireMime) + assertEquals("video/mp4", VideoMime.Mp4.wireMime) + } +} diff --git a/src/test/kotlin/agents_engine/content/ToolResultIntegrationTest.kt b/src/test/kotlin/agents_engine/content/ToolResultIntegrationTest.kt new file mode 100644 index 0000000..83f4130 --- /dev/null +++ b/src/test/kotlin/agents_engine/content/ToolResultIntegrationTest.kt @@ -0,0 +1,104 @@ +package agents_engine.content + +import agents_engine.core.PipelineEvent +import agents_engine.core.agent +import agents_engine.core.observe +import agents_engine.model.LlmMessage +import agents_engine.model.LlmResponse +import agents_engine.model.ModelClient +import agents_engine.model.Tool +import agents_engine.model.ToolCall +import kotlin.test.Test +import kotlin.test.assertEquals +import kotlin.test.assertNotNull +import kotlin.test.assertTrue + +/** + * #2469 — multimodal ToolResult end-to-end. Pins: + * + * 1. A tool returning a `ToolResult` works through the agentic loop; + * the text-summary placeholder reaches the model on the next turn. + * 2. Non-text parts surface as `[modality: ]` placeholders in v1. + * The provider adapter rendering (#2470) replaces this end-to-end. + * 3. `ToolResult` requires at least one part — empty list fails fast. + * 4. The placeholder text encodes hash prefix + size for traceability. + */ +class ToolResultIntegrationTest { + + @Test + fun `tool returning ToolResult flows through the loop with placeholder text reaching the model`() { + val store = InMemoryBlobStore() + val imageBytes = byteArrayOf(1, 2, 3, 4, 5) + val imageRef = store.put(imageBytes, ImageMime.Png.wireMime) + + val responses = ArrayDeque() + responses.add(LlmResponse.ToolCalls(listOf(ToolCall("screenshot", emptyMap())))) + responses.add(LlmResponse.Text("got the screenshot")) + val sawMessages = mutableListOf>() + val mock = ModelClient { msgs -> sawMessages += msgs.toList(); responses.removeFirst() } + + val a = agent("snap") { + lateinit var screenshot: Tool, Any?> + model { ollama("t"); client = mock } + tools { + screenshot = tool("screenshot", "Take a screenshot") { _ -> + ToolResult( + Content.Text("Captured page."), + Content.Image(imageRef, ImageMime.Png), + ) + } + } + skills { skill("s", "") { tools(screenshot) } } + } + + val out = a("go") + assertEquals("got the screenshot", out) + + val resumeMsgs = sawMessages[1] + val toolMsg = resumeMsgs.last { it.role == "tool" } + assertTrue("Captured page." in toolMsg.content, "text part inlines in tool message") + assertTrue("[image: image/png]" in toolMsg.content, "image part surfaces as a typed placeholder") + // Hash prefix appears for traceability (full hash omitted to keep audit rows compact) + assertTrue(imageRef.hash.take(12) in toolMsg.content, "hash prefix surfaces in placeholder") + } + + @Test + fun `empty ToolResult fails fast`() { + val ex = kotlin.runCatching { ToolResult(emptyList()) }.exceptionOrNull() + assertNotNull(ex) + assertTrue(ex is IllegalArgumentException) + } + + @Test + fun `PipelineEvent ToolCalled carries the ToolResult as event_result for audit consumers`() { + val store = InMemoryBlobStore() + val ref = store.put(byteArrayOf(1, 2, 3), DocMime.Pdf.wireMime) + val responses = ArrayDeque() + responses.add(LlmResponse.ToolCalls(listOf(ToolCall("read_doc", emptyMap())))) + responses.add(LlmResponse.Text("done")) + val mock = ModelClient { _ -> responses.removeFirst() } + + val events = mutableListOf() + val a = agent("doc-reader") { + lateinit var readDoc: Tool, Any?> + model { ollama("t"); client = mock } + tools { + readDoc = tool("read_doc", "Read a document") { _ -> + ToolResult( + Content.Text("Spec summary: 12 pages"), + Content.Document(ref, DocMime.Pdf), + ) + } + } + skills { skill("s", "") { tools(readDoc) } } + } + a.observe { events += it } + a("read it") + + val toolEvent = events.filterIsInstance().single() + val result = toolEvent.result as ToolResult + assertEquals(2, result.parts.size, "both parts in the event for the bridge to walk") + assertTrue(result.parts.any { it is Content.Text }) + assertTrue(result.parts.any { it is Content.Document }) + } +}