Skip to content

Feat/2470a ollama vision#68

Merged
Skobeltsyn merged 2 commits into
mainfrom
feat/2470a-ollama-vision
May 30, 2026
Merged

Feat/2470a ollama vision#68
Skobeltsyn merged 2 commits into
mainfrom
feat/2470a-ollama-vision

Conversation

@Skobeltsyn
Copy link
Copy Markdown
Contributor

No description provided.

Skobeltsyn and others added 2 commits May 30, 2026 12:57
#2470 (slice a) — vision-input path for the four built-in adapters,
with programmatic image fixtures and per-provider live tests. Sibling
work (`Content` → `LlmMessage` translation, multipart `@Generable`
input via KSP, manifest-anchored capability validation) is the rest
of #2470 / #2468 / #2471, layered on top of this.

```kotlin
val png = VisionFixtures.threeSquaresPng()
val client = OllamaClient(model = "qwen3-vl:8b", temperature = 0.0)
client.chat(listOf(
    LlmMessage(
        role = "user",
        content = "How many squares?",
        images = listOf(ImagePart(VisionFixtures.toBase64(png), ImagePart.WireMime.Png)),
    ),
))
// → LlmResponse.Text("3")
```

Implementation:

- `LlmMessage.images: List<ImagePart>? = null` — optional, back-compat
  default. Adapters translate to per-provider wire when non-null AND
  role is "user"; otherwise zero diff vs pre-#2470.
- `ImagePart(base64: String, wireMime: ImagePart.WireMime)` — closed
  WireMime sealed type (Png / Jpeg / Gif / Webp). String mime is not
  accepted in the public ctor. Base64 stored as `String` so structural
  equals/hashCode work (the `ByteArray` data-class trap, avoided).

Per-provider wire shapes (pinned by VisionWireFormatTest):

| Provider  | User-message shape                                    |
|-----------|-------------------------------------------------------|
| Ollama    | `{role:"user", content:"text", images:["<b64>", ...]}` |
| Claude    | `{role:"user", content:[{type:"text"}, {type:"image", source:{type:"base64", media_type:"image/png", data:"<b64>"}}, ...]}` |
| OpenAI    | `{role:"user", content:[{type:"text"}, {type:"image_url", image_url:{url:"data:image/png;base64,<b64>"}}, ...]}` |
| DeepSeek  | inherits OpenAI; most DeepSeek models lack vision and silently ignore the field (shape-tested, no live call) |

Each adapter's vision path is gated:
- role must be "user" — system/assistant/tool messages with non-null
  `images` ignore the field on the wire (no provider's API carries
  images on those roles).
- `images = null` or empty → exact pre-#2470 wire shape (back-compat
  pinned by dedicated tests).

`VisionFixtures` (test source set): 256×256 PNGs generated via
`BufferedImage` + `ImageIO`. Two fixtures —
`threeSquaresPng()` (red/blue/green squares, well-separated, thick
black outlines so counting is unambiguous) and `housePng()` (triangle
roof + body + door + two windows, terracotta + beige colour scheme).
Reproducible byte-for-byte; ships in source, no external assets.

Tests:

- VisionWireFormatTest.kt (8 cases): per-provider wire shape for both
  the vision path and the no-images back-compat path; multiple images
  in one message; non-user-role images filtered; PNG fixture sanity
  (magic bytes + reasonable size).
- VisionLiveTest.kt (6 cases): per-provider end-to-end against:
  * Ollama qwen3-vl:8b — tagged `live-llm`, runs via
    `./gradlew integrationTest`
  * Claude Haiku 4.5 — tagged `live-cloud-api`, runs in default
    `:test`, assumeTrue skips when no key
  * OpenAI gpt-4o-mini — same pattern
  Cost discipline per call: 256×256 PNG (~5KB), temperature=0,
  maxTokens=80, single-turn. Each test sends a fixture image with a
  short text prompt, parses the text response, asserts loose keyword
  match (3 / three for the squares; house / home / cottage / building
  / cabin / barn for the house). Model names overridable via env
  (`AGENTSKT_TEST_OLLAMA_VISION_MODEL`, etc.) for CI flexibility.

Full unit suite: 1794 tests, 0 failures.

To run the live vision tests:
- `./gradlew integrationTest --tests "*VisionLiveTest*"` — Ollama
  (requires `qwen3-vl:8b` pulled in local or Ollama Cloud)
- `./gradlew test --tests "*VisionLiveTest*"` — Claude + OpenAI (run
  in default :test under live-cloud-api tag; assumeTrue skips per
  provider when no key)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- docs/multimodal.md — new "Vision input — talking to the model (#2470
  slice a)" section between the existing foundation content and the
  "What's still coming" list. Walks through the LlmMessage.images
  field, ImagePart shape, per-provider wire-format table, back-compat
  + role-gating guarantees, programmatic VisionFixtures, and the
  per-provider live test how-to-run. "What's coming" list updated to
  flag the #2470 slice-a/slice-b split (this commit is slice a; the
  Content → LlmMessage.images loop translation is slice b).
- README.md — new "Vision input to models" bullet right after the
  multimodal foundation bullet. Names all four providers and their
  default test models with the cost-discipline notes.
- CHANGELOG.md `## [Unreleased]` — new "Added — Vision input across
  all providers (#2470 slice a)" section ABOVE the existing multimodal
  foundation section. Covers LlmMessage.images + ImagePart, per-
  provider adapter rows, role-gating, fixtures, live test setup +
  cost discipline, wire-format unit-test count.

No source changes. Full suite stays at 1794 / 0 failures from the
prior commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Skobeltsyn Skobeltsyn merged commit 89037b1 into main May 30, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant