Skip to content

Feat/2465 multimodal foundation#67

Merged
Skobeltsyn merged 2 commits into
mainfrom
feat/2465-multimodal-foundation
May 30, 2026
Merged

Feat/2465 multimodal foundation#67
Skobeltsyn merged 2 commits into
mainfrom
feat/2465-multimodal-foundation

Conversation

@Skobeltsyn
Copy link
Copy Markdown
Contributor

Multimodal

Skobeltsyn and others added 2 commits May 30, 2026 12:33
First three subtickets of the 0.8 multimodal epic (#2465), shipped
together as a coherent foundation. No provider rendering and no KSP
routing in this commit — those are the sibling tickets (#2470,
#2468) and depend on what this commit establishes.

```kotlin
val store = InMemoryBlobStore()  // or FileBlobStore(snapshotsDir / "blobs")
val pngRef = store.put(pngBytes, ImageMime.Png.wireMime)

tool("screenshot", "Take a screenshot") { args ->
    val bytes = takeScreenshot(args["url"] as String)
    val ref = store.put(bytes, ImageMime.Png.wireMime)
    ToolResult(
        Content.Text("Captured page."),
        Content.Image(ref, ImageMime.Png),
    )
}
```

`#2466 — Typed Content hierarchy + typed mime`:

- `agents_engine/content/Content.kt`. `sealed interface Content` with
  variants `Text`, `Image`, `Audio`, `Video`, `Document`. Stage 1
  wires Image + Document through the rest of the stack (the modalities
  the 0.8 spec→product loop actually consumes); Audio + Video are
  modelled now and exercised end-to-end through provider adapters in
  Stage 2.
- Mime types are CLOSED sealed interfaces per modality — `ImageMime`,
  `AudioMime`, `VideoMime`, `DocMime`. Each variant exposes a
  `wireMime: String` for adapter serialisation but the public API
  never accepts `String` mime. Extend by adding a variant.
- Non-text variants carry a `ContentRef`, not `ByteArray`. Avoids the
  data-class equals/hashCode gotcha with byte arrays AND keeps
  `Content` snapshot-safe (the #2386 / #2754 snapshot machinery
  never inlines blobs).
- Extension property `Content.modality: String` is the audit-stable
  per-variant name. Used by the JSONL audit exporter to write
  per-part rows.

`#2467 — ContentRef + BlobStore + persistence`:

- `agents_engine/content/ContentRef.kt`. `data class ContentRef(hash,
  sizeBytes, wireMime)`. Hash is SHA-256 hex — matches the
  manifest-hash family used elsewhere (#1912, #2754), so the audit
  story has a single hash algorithm.
- `interface BlobStore { put, get, open, exists, delete }`. Idempotent
  put: putting the same bytes twice returns the same `ContentRef`;
  the store keeps one copy on disk.
- `InMemoryBlobStore` — test / single-JVM. Defensive byte-array
  copies on put + get so consumer mutation can't corrupt the store.
- `FileBlobStore(dir)` — one file per blob, filename = SHA-256 hex.
  Survives process restart (fresh instance on the same dir sees
  prior puts). Atomic via tmp + rename, matching the #2753 pattern
  from `FileSnapshotStore`.
- Public `computeContentHash(bytes): String` for byte-level comparison
  without a store.

`#2469 — Multimodal ToolResult + audit wiring`:

- `agents_engine/content/ToolResult.kt`. `data class ToolResult(parts:
  List<Content>)`. Just another `Any?` the tool executor returns — no
  ToolDef signature change; tools that return strings keep working
  byte-for-byte. Requires at least one part (empty list fails fast).
- AgenticLoop's tool-message rendering detects `ToolResult` and
  renders parts as `<text>\n[modality: <wireMime>] (<hash-prefix>,
  <size>B)` placeholders for the LLM context. Provider-specific
  multipart rendering is #2470 (deferred); the placeholder is good
  enough until vision-capable adapters land.
- `untrustedOutput` (#642) still wraps the rendered text summary
  in the JSON envelope — multimodal results compose with the
  trust boundary.
- JSONL audit exporter (#1914) gains a new `outputParts` field on
  audit rows. For `ToolResult` returns, emits one summary string per
  part: `<modality>:<hash-prefix>:<sizeBytes>:<wireMime>`. Text parts
  surface as `text:inline:<charCount>:text/plain`. **Blob bytes
  never enter the audit row.** `outputType` still names the wrapper
  type so column-positioned consumers see a stable shape. Field is
  null for non-multimodal returns — legacy audit rows unchanged.

Composition with existing surfaces:
- Snapshot/resume (#2386 / #2754) — refs travel with snapshots; blobs
  stay in the `BlobStore`. A resumed snapshot dereferences refs
  against the same store. No inlined-blob explosion.
- Manifest-hash restore guard (#2754) — applies unchanged.
- `untrustedOutput` (#642) — applies to the text-summary rendering.

Tests:
- ContentAndRefTest.kt (8 cases): hash determinism, InMemoryBlobStore
  round-trip + dedupe, defensive copies, exists/delete,
  FileBlobStore process-restart safety + dedupe (one file on disk),
  modality stability, mime wire forms.
- ToolResultIntegrationTest.kt (3 cases): tool returning ToolResult
  end-to-end with text + image; empty ToolResult fails fast;
  `PipelineEvent.ToolCalled.result` carries the typed `ToolResult`
  for bridge consumers.
- JsonlAuditExporterTest.kt: schema-pinning EXPECTED_FIELDS updated
  to include `outputParts`; new test "multimodal ToolResult writes
  outputParts" pins the per-part summary format AND asserts no
  argument values, no image bytes, ever enter the audit row.

Deferred (carried as siblings, not this commit's scope):
- #2468 Compile-time modality routing via KSP
- #2470 Provider adapters (Claude/OpenAI/Gemini/Ollama) for
  multipart `Content` → provider payload
- #2471 Manifest-anchored modality capability validation
- #2472 Multimodal memory (ContentRef-backed MemoryBank entries)
- #2473 Multimodal testing fixtures

Full suite: 1792 tests across 7 modules, 0 failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ct, README, CHANGELOG

- docs/multimodal.md (new) — user-facing multimodal doc. Three pieces
  walked through: typed Content variants + closed mime types,
  ContentRef + BlobStore (InMemory + File) with hash-family rationale
  and process-restart safety, ToolResult with the v1 placeholder
  rendering + audit-row discipline. What's coming section names the
  five sibling tickets (#2468 KSP routing, #2470 provider adapters,
  #2471 manifest-anchored capability, #2472 multimodal memory,
  #2473 testing fixtures). Stage 1 vs Stage 2 split explicit.
- src/main/resources/internals-agent/content/Multimodal.md (new) —
  IDE-side LLM adjunct covering all three pieces. Signatures, hash
  family rationale, idempotent put semantics, audit-row column
  format, snapshot composition, deferral list.
- README.md — adds a "Multimodal foundation" bullet under "Implemented
  today" right after the eval harness bullet. Names all three
  sub-tickets and the Stage 1 / Stage 2 split.
- CHANGELOG.md `## [Unreleased]` — opens with three paragraph entries
  under "Multimodal foundation (#2465 epic, Stage 1)" covering
  #2466 / #2467 / #2469 with their AC and composition story. Calls
  out the EXPECTED_FIELDS schema-pin update so audit-row consumers
  see the wire-format change. Eval harness section preserved below.

No source changes. Full suite stays at 1792 / 0 failures from the
prior commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Skobeltsyn Skobeltsyn merged commit becd67f into main May 30, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant