v2.4.0
Highlights
Vision feedback lands. Before v2.4.0, ScreenshotAction captured PNG bytes via ActionResult::with_image(b64) but the next prompt's renderer returned a plain String and discarded the image. Models received only text narration, fabricated everything visual, and modern Claude/GPT-4o runs against real sites produced no observable network traffic past the initial GET. v2.4.0 closes that single broken link end-to-end.
This is Phase A of the broader grounding fix. Phase B (numbered clickable index map) and Phase C (full CDP DomExtractor impl) ship in later releases.
Surprise findings during scope review
The bug report assumed ChatMessage was text-only and that all 14 provider DTOs needed image support added. Actuals:
ContentPart::{ImageBase64, ImageUrl}already existed inras-llm.ras-llm-openaiDTO already serialized both to OpenAI vision format ({"type":"image_url","image_url":{"url":"data:..."}}).ras-llm-anthropicDTO already mappedImageBase64to native Anthropic image source.- 6 OpenAI-compatible providers (cerebras, deepseek, groq, mistral, openrouter, vercel) inherit serialization via
ChatOpenAICompatible. - 6 scaffold providers (bedrock, cloud, google, langchain, oci, ollama) have empty
mod.rs— noLlmClientimpl yet, nothing to update.
The single broken link was in ras-agent. Scope of actual code change collapsed to one new module + one constructor + one render call site swap.
What's new
ras-llm — multipart constructor
- New
ChatMessage::user_parts(parts: Vec<ContentPart>)constructor for emitting mixed-content user messages directly.
ras-agent — image-aware result message
- New module
ras_agent::application::render_step_message:- Returns
Option<ChatMessage>(None when results are empty so no spurious user turns). - One
ContentPart::Textpart with step header (Step N result:),url:line, and per-action result summaries (truncated to 480 chars, errors to 240). - One
ContentPart::ImageBase64 { media_type: "image/png", data }part perActionResult.imagesentry.
- Returns
run_agent::build_promptnow callsrender_step_messageand pushes the returnedChatMessage. Old text-onlyrender_step_resultsremoved.
Prompt — unchanged
Still tells the model to emit one JSON object matching AgentOutput shape, lists the action catalog with parameter schemas, warns that empty action lists are treated as failure. The new image part rides alongside that contract; no prompt rewrite was needed.
Reproduction
Before v2.4.0:
$ RAS_MODEL=anthropic/claude-haiku-4.5 cargo run --example claude_code_oauth_cosmium
[step 0] screenshot → captured (b64 discarded by renderer)
[step 1] LLM narrates "I see the login form" ← fabrication; received only text
[mitmproxy] only GET / and GET /favicon.ico across N calls
After v2.4.0: each step that produces a screenshot attaches the image to the next user turn. Vision-capable models receive the actual page bytes.
Migration
No code changes required for callers. RunAgent::new signature unchanged. cargo update -p ras-agent --precise 2.4.0 (or any workspace crate; the workspace bumps together).
If you previously consumed render_step_results or any internal renderer, note that it's removed — replaced by the public-via-pub(crate) render_step_message returning Option<ChatMessage>.
Compatibility
- New constructor
ChatMessage::user_partsis purely additive. - Existing text-only
ChatMessage::user_textunchanged. - No breaking changes to public APIs.
- Workspace MSRV unchanged.
Tests
5 new unit tests in render_step_message:
- empty results → no message
- text-only result → text part only
- screenshot result → text + 1 image part
- multiple images across results → all attached in order
- error result → error included in text
1 new integration test screenshot_image_reaches_next_prompt_as_image_part:
ScriptedLlmrecords every receivedVec<ChatMessage>.- After step 1's
screenshotaction, asserts step 2's prompt containsContentPart::ImageBase64withmedia_type = "image/png"and non-empty data.
Verification
cargo test --workspace --no-fail-fast— all suites pass (13 unit + 4 executor + 5 + 4 + 3 integration)cargo clippy --workspace --all-targets -- -D clippy::unwrap_used -D clippy::dbg_macro— cleancargo fmt --all -- --check— cleancargo doc --workspace --no-deps— clean
Artifacts
- Linux x86_64:
ras-x86_64-unknown-linux-gnu,ras-daemon-x86_64-unknown-linux-gnu - macOS arm64:
ras-aarch64-apple-darwin,ras-daemon-aarch64-apple-darwin - crates.io: all
ras-*workspace crates published at2.4.0oncepublish.ymlfinishes
Follow-ups
- #24 —
ras-llm-anthropic:ContentPart::ImageUrlcurrently degrades to plaintext URL instead of using Anthropic's nativesource.type = urlimage format. Planned 2.4.1 patch (or fold into next minor —publish.ymlskips patch bumps). - Phase B — JS-eval clickable extractor (
querySelectorAllviaBrowserPort::evaluate) producing a numbered index map forclick_elementparameters. No full DomExtractor. - Phase C — Full CDP
DomExtractorimpl with paint-order occlusion + stable hashing across snapshots. The trait exists inras-domwith zero implementations today.
Pull requests
- #25 —
feat(agent): feed screenshot images to LLM as ImageBase64 (v2.4.0) - #26 —
release: v2.4.0 (vision feedback)
Full changelog: v2.3.0...v2.4.0