Feat/2470a ollama vision by Skobeltsyn · Pull Request #68 · Deep-CodeAI/Agents.KT

Skobeltsyn · 2026-05-30T10:28:27Z

No description provided.

#2470 (slice a) — vision-input path for the four built-in adapters, with programmatic image fixtures and per-provider live tests. Sibling work (`Content` → `LlmMessage` translation, multipart `@Generable` input via KSP, manifest-anchored capability validation) is the rest of #2470 / #2468 / #2471, layered on top of this. ```kotlin val png = VisionFixtures.threeSquaresPng() val client = OllamaClient(model = "qwen3-vl:8b", temperature = 0.0) client.chat(listOf( LlmMessage( role = "user", content = "How many squares?", images = listOf(ImagePart(VisionFixtures.toBase64(png), ImagePart.WireMime.Png)), ), )) // → LlmResponse.Text("3") ``` Implementation: - `LlmMessage.images: List<ImagePart>? = null` — optional, back-compat default. Adapters translate to per-provider wire when non-null AND role is "user"; otherwise zero diff vs pre-#2470. - `ImagePart(base64: String, wireMime: ImagePart.WireMime)` — closed WireMime sealed type (Png / Jpeg / Gif / Webp). String mime is not accepted in the public ctor. Base64 stored as `String` so structural equals/hashCode work (the `ByteArray` data-class trap, avoided). Per-provider wire shapes (pinned by VisionWireFormatTest): | Provider | User-message shape | |-----------|-------------------------------------------------------| | Ollama | `{role:"user", content:"text", images:["<b64>", ...]}` | | Claude | `{role:"user", content:[{type:"text"}, {type:"image", source:{type:"base64", media_type:"image/png", data:"<b64>"}}, ...]}` | | OpenAI | `{role:"user", content:[{type:"text"}, {type:"image_url", image_url:{url:"data:image/png;base64,<b64>"}}, ...]}` | | DeepSeek | inherits OpenAI; most DeepSeek models lack vision and silently ignore the field (shape-tested, no live call) | Each adapter's vision path is gated: - role must be "user" — system/assistant/tool messages with non-null `images` ignore the field on the wire (no provider's API carries images on those roles). - `images = null` or empty → exact pre-#2470 wire shape (back-compat pinned by dedicated tests). `VisionFixtures` (test source set): 256×256 PNGs generated via `BufferedImage` + `ImageIO`. Two fixtures — `threeSquaresPng()` (red/blue/green squares, well-separated, thick black outlines so counting is unambiguous) and `housePng()` (triangle roof + body + door + two windows, terracotta + beige colour scheme). Reproducible byte-for-byte; ships in source, no external assets. Tests: - VisionWireFormatTest.kt (8 cases): per-provider wire shape for both the vision path and the no-images back-compat path; multiple images in one message; non-user-role images filtered; PNG fixture sanity (magic bytes + reasonable size). - VisionLiveTest.kt (6 cases): per-provider end-to-end against: * Ollama qwen3-vl:8b — tagged `live-llm`, runs via `./gradlew integrationTest` * Claude Haiku 4.5 — tagged `live-cloud-api`, runs in default `:test`, assumeTrue skips when no key * OpenAI gpt-4o-mini — same pattern Cost discipline per call: 256×256 PNG (~5KB), temperature=0, maxTokens=80, single-turn. Each test sends a fixture image with a short text prompt, parses the text response, asserts loose keyword match (3 / three for the squares; house / home / cottage / building / cabin / barn for the house). Model names overridable via env (`AGENTSKT_TEST_OLLAMA_VISION_MODEL`, etc.) for CI flexibility. Full unit suite: 1794 tests, 0 failures. To run the live vision tests: - `./gradlew integrationTest --tests "*VisionLiveTest*"` — Ollama (requires `qwen3-vl:8b` pulled in local or Ollama Cloud) - `./gradlew test --tests "*VisionLiveTest*"` — Claude + OpenAI (run in default :test under live-cloud-api tag; assumeTrue skips per provider when no key) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- docs/multimodal.md — new "Vision input — talking to the model (#2470 slice a)" section between the existing foundation content and the "What's still coming" list. Walks through the LlmMessage.images field, ImagePart shape, per-provider wire-format table, back-compat + role-gating guarantees, programmatic VisionFixtures, and the per-provider live test how-to-run. "What's coming" list updated to flag the #2470 slice-a/slice-b split (this commit is slice a; the Content → LlmMessage.images loop translation is slice b). - README.md — new "Vision input to models" bullet right after the multimodal foundation bullet. Names all four providers and their default test models with the cost-discipline notes. - CHANGELOG.md `## [Unreleased]` — new "Added — Vision input across all providers (#2470 slice a)" section ABOVE the existing multimodal foundation section. Covers LlmMessage.images + ImagePart, per- provider adapter rows, role-gating, fixtures, live test setup + cost discipline, wire-format unit-test count. No source changes. Full suite stays at 1794 / 0 failures from the prior commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Skobeltsyn and others added 2 commits May 30, 2026 12:57

Skobeltsyn merged commit 89037b1 into main May 30, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/2470a ollama vision#68

Feat/2470a ollama vision#68
Skobeltsyn merged 2 commits into
mainfrom
feat/2470a-ollama-vision

Skobeltsyn commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Skobeltsyn commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant