Feat/2465 multimodal foundation by Skobeltsyn · Pull Request #67 · Deep-CodeAI/Agents.KT

Skobeltsyn · 2026-05-30T09:50:04Z

Multimodal

First three subtickets of the 0.8 multimodal epic (#2465), shipped together as a coherent foundation. No provider rendering and no KSP routing in this commit — those are the sibling tickets (#2470, #2468) and depend on what this commit establishes. ```kotlin val store = InMemoryBlobStore() // or FileBlobStore(snapshotsDir / "blobs") val pngRef = store.put(pngBytes, ImageMime.Png.wireMime) tool("screenshot", "Take a screenshot") { args -> val bytes = takeScreenshot(args["url"] as String) val ref = store.put(bytes, ImageMime.Png.wireMime) ToolResult( Content.Text("Captured page."), Content.Image(ref, ImageMime.Png), ) } ``` `#2466 — Typed Content hierarchy + typed mime`: - `agents_engine/content/Content.kt`. `sealed interface Content` with variants `Text`, `Image`, `Audio`, `Video`, `Document`. Stage 1 wires Image + Document through the rest of the stack (the modalities the 0.8 spec→product loop actually consumes); Audio + Video are modelled now and exercised end-to-end through provider adapters in Stage 2. - Mime types are CLOSED sealed interfaces per modality — `ImageMime`, `AudioMime`, `VideoMime`, `DocMime`. Each variant exposes a `wireMime: String` for adapter serialisation but the public API never accepts `String` mime. Extend by adding a variant. - Non-text variants carry a `ContentRef`, not `ByteArray`. Avoids the data-class equals/hashCode gotcha with byte arrays AND keeps `Content` snapshot-safe (the #2386 / #2754 snapshot machinery never inlines blobs). - Extension property `Content.modality: String` is the audit-stable per-variant name. Used by the JSONL audit exporter to write per-part rows. `#2467 — ContentRef + BlobStore + persistence`: - `agents_engine/content/ContentRef.kt`. `data class ContentRef(hash, sizeBytes, wireMime)`. Hash is SHA-256 hex — matches the manifest-hash family used elsewhere (#1912, #2754), so the audit story has a single hash algorithm. - `interface BlobStore { put, get, open, exists, delete }`. Idempotent put: putting the same bytes twice returns the same `ContentRef`; the store keeps one copy on disk. - `InMemoryBlobStore` — test / single-JVM. Defensive byte-array copies on put + get so consumer mutation can't corrupt the store. - `FileBlobStore(dir)` — one file per blob, filename = SHA-256 hex. Survives process restart (fresh instance on the same dir sees prior puts). Atomic via tmp + rename, matching the #2753 pattern from `FileSnapshotStore`. - Public `computeContentHash(bytes): String` for byte-level comparison without a store. `#2469 — Multimodal ToolResult + audit wiring`: - `agents_engine/content/ToolResult.kt`. `data class ToolResult(parts: List<Content>)`. Just another `Any?` the tool executor returns — no ToolDef signature change; tools that return strings keep working byte-for-byte. Requires at least one part (empty list fails fast). - AgenticLoop's tool-message rendering detects `ToolResult` and renders parts as `<text>\n[modality: <wireMime>] (<hash-prefix>, <size>B)` placeholders for the LLM context. Provider-specific multipart rendering is #2470 (deferred); the placeholder is good enough until vision-capable adapters land. - `untrustedOutput` (#642) still wraps the rendered text summary in the JSON envelope — multimodal results compose with the trust boundary. - JSONL audit exporter (#1914) gains a new `outputParts` field on audit rows. For `ToolResult` returns, emits one summary string per part: `<modality>:<hash-prefix>:<sizeBytes>:<wireMime>`. Text parts surface as `text:inline:<charCount>:text/plain`. **Blob bytes never enter the audit row.** `outputType` still names the wrapper type so column-positioned consumers see a stable shape. Field is null for non-multimodal returns — legacy audit rows unchanged. Composition with existing surfaces: - Snapshot/resume (#2386 / #2754) — refs travel with snapshots; blobs stay in the `BlobStore`. A resumed snapshot dereferences refs against the same store. No inlined-blob explosion. - Manifest-hash restore guard (#2754) — applies unchanged. - `untrustedOutput` (#642) — applies to the text-summary rendering. Tests: - ContentAndRefTest.kt (8 cases): hash determinism, InMemoryBlobStore round-trip + dedupe, defensive copies, exists/delete, FileBlobStore process-restart safety + dedupe (one file on disk), modality stability, mime wire forms. - ToolResultIntegrationTest.kt (3 cases): tool returning ToolResult end-to-end with text + image; empty ToolResult fails fast; `PipelineEvent.ToolCalled.result` carries the typed `ToolResult` for bridge consumers. - JsonlAuditExporterTest.kt: schema-pinning EXPECTED_FIELDS updated to include `outputParts`; new test "multimodal ToolResult writes outputParts" pins the per-part summary format AND asserts no argument values, no image bytes, ever enter the audit row. Deferred (carried as siblings, not this commit's scope): - #2468 Compile-time modality routing via KSP - #2470 Provider adapters (Claude/OpenAI/Gemini/Ollama) for multipart `Content` → provider payload - #2471 Manifest-anchored modality capability validation - #2472 Multimodal memory (ContentRef-backed MemoryBank entries) - #2473 Multimodal testing fixtures Full suite: 1792 tests across 7 modules, 0 failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ct, README, CHANGELOG - docs/multimodal.md (new) — user-facing multimodal doc. Three pieces walked through: typed Content variants + closed mime types, ContentRef + BlobStore (InMemory + File) with hash-family rationale and process-restart safety, ToolResult with the v1 placeholder rendering + audit-row discipline. What's coming section names the five sibling tickets (#2468 KSP routing, #2470 provider adapters, #2471 manifest-anchored capability, #2472 multimodal memory, #2473 testing fixtures). Stage 1 vs Stage 2 split explicit. - src/main/resources/internals-agent/content/Multimodal.md (new) — IDE-side LLM adjunct covering all three pieces. Signatures, hash family rationale, idempotent put semantics, audit-row column format, snapshot composition, deferral list. - README.md — adds a "Multimodal foundation" bullet under "Implemented today" right after the eval harness bullet. Names all three sub-tickets and the Stage 1 / Stage 2 split. - CHANGELOG.md `## [Unreleased]` — opens with three paragraph entries under "Multimodal foundation (#2465 epic, Stage 1)" covering #2466 / #2467 / #2469 with their AC and composition story. Calls out the EXPECTED_FIELDS schema-pin update so audit-row consumers see the wire-format change. Eval harness section preserved below. No source changes. Full suite stays at 1792 / 0 failures from the prior commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Skobeltsyn and others added 2 commits May 30, 2026 12:33

Skobeltsyn merged commit becd67f into main May 30, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/2465 multimodal foundation#67

Feat/2465 multimodal foundation#67
Skobeltsyn merged 2 commits into
mainfrom
feat/2465-multimodal-foundation

Skobeltsyn commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Skobeltsyn commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant