Release v2.4.0 · KeyCode17/rust-ai-surfer

Highlights

Vision feedback lands. Before v2.4.0, ScreenshotAction captured PNG bytes via ActionResult::with_image(b64) but the next prompt's renderer returned a plain String and discarded the image. Models received only text narration, fabricated everything visual, and modern Claude/GPT-4o runs against real sites produced no observable network traffic past the initial GET. v2.4.0 closes that single broken link end-to-end.

This is Phase A of the broader grounding fix. Phase B (numbered clickable index map) and Phase C (full CDP DomExtractor impl) ship in later releases.

Surprise findings during scope review

The bug report assumed ChatMessage was text-only and that all 14 provider DTOs needed image support added. Actuals:

ContentPart::{ImageBase64, ImageUrl} already existed in ras-llm.
ras-llm-openai DTO already serialized both to OpenAI vision format ({"type":"image_url","image_url":{"url":"data:..."}}).
ras-llm-anthropic DTO already mapped ImageBase64 to native Anthropic image source.
6 OpenAI-compatible providers (cerebras, deepseek, groq, mistral, openrouter, vercel) inherit serialization via ChatOpenAICompatible.
6 scaffold providers (bedrock, cloud, google, langchain, oci, ollama) have empty mod.rs — no LlmClient impl yet, nothing to update.

The single broken link was in ras-agent. Scope of actual code change collapsed to one new module + one constructor + one render call site swap.

What's new

`ras-llm` — multipart constructor

New ChatMessage::user_parts(parts: Vec<ContentPart>) constructor for emitting mixed-content user messages directly.

`ras-agent` — image-aware result message

New module ras_agent::application::render_step_message:
- Returns Option<ChatMessage> (None when results are empty so no spurious user turns).
- One ContentPart::Text part with step header (Step N result:), url: line, and per-action result summaries (truncated to 480 chars, errors to 240).
- One ContentPart::ImageBase64 { media_type: "image/png", data } part per ActionResult.images entry.
run_agent::build_prompt now calls render_step_message and pushes the returned ChatMessage. Old text-only render_step_results removed.

Prompt — unchanged

Still tells the model to emit one JSON object matching AgentOutput shape, lists the action catalog with parameter schemas, warns that empty action lists are treated as failure. The new image part rides alongside that contract; no prompt rewrite was needed.

Reproduction

Before v2.4.0:

$ RAS_MODEL=anthropic/claude-haiku-4.5 cargo run --example claude_code_oauth_cosmium
[step 0] screenshot → captured (b64 discarded by renderer)
[step 1] LLM narrates "I see the login form"  ← fabrication; received only text
[mitmproxy] only GET / and GET /favicon.ico across N calls

After v2.4.0: each step that produces a screenshot attaches the image to the next user turn. Vision-capable models receive the actual page bytes.

Migration

No code changes required for callers. RunAgent::new signature unchanged. cargo update -p ras-agent --precise 2.4.0 (or any workspace crate; the workspace bumps together).

If you previously consumed render_step_results or any internal renderer, note that it's removed — replaced by the public-via-pub(crate) render_step_message returning Option<ChatMessage>.

Compatibility

New constructor ChatMessage::user_parts is purely additive.
Existing text-only ChatMessage::user_text unchanged.
No breaking changes to public APIs.
Workspace MSRV unchanged.

Tests

5 new unit tests in render_step_message:

empty results → no message
text-only result → text part only
screenshot result → text + 1 image part
multiple images across results → all attached in order
error result → error included in text

1 new integration test screenshot_image_reaches_next_prompt_as_image_part:

ScriptedLlm records every received Vec<ChatMessage>.
After step 1's screenshot action, asserts step 2's prompt contains ContentPart::ImageBase64 with media_type = "image/png" and non-empty data.

Verification

cargo test --workspace --no-fail-fast — all suites pass (13 unit + 4 executor + 5 + 4 + 3 integration)
cargo clippy --workspace --all-targets -- -D clippy::unwrap_used -D clippy::dbg_macro — clean
cargo fmt --all -- --check — clean
cargo doc --workspace --no-deps — clean

Artifacts

Linux x86_64: ras-x86_64-unknown-linux-gnu, ras-daemon-x86_64-unknown-linux-gnu
macOS arm64: ras-aarch64-apple-darwin, ras-daemon-aarch64-apple-darwin
crates.io: all ras-* workspace crates published at 2.4.0 once publish.yml finishes

Follow-ups

#24 — ras-llm-anthropic: ContentPart::ImageUrl currently degrades to plaintext URL instead of using Anthropic's native source.type = url image format. Planned 2.4.1 patch (or fold into next minor — publish.yml skips patch bumps).
Phase B — JS-eval clickable extractor (querySelectorAll via BrowserPort::evaluate) producing a numbered index map for click_element parameters. No full DomExtractor.
Phase C — Full CDP DomExtractor impl with paint-order occlusion + stable hashing across snapshots. The trait exists in ras-dom with zero implementations today.

Pull requests

#25 — feat(agent): feed screenshot images to LLM as ImageBase64 (v2.4.0)
#26 — release: v2.4.0 (vision feedback)

Full changelog: v2.3.0...v2.4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.4.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

Surprise findings during scope review

What's new

`ras-llm` — multipart constructor

`ras-agent` — image-aware result message

Prompt — unchanged

Reproduction

Migration

Compatibility

Tests

Verification

Artifacts

Follow-ups

Pull requests

Uh oh!

v2.4.0

Highlights

Surprise findings during scope review

What's new

ras-llm — multipart constructor

ras-agent — image-aware result message

Prompt — unchanged

Reproduction

Migration

Compatibility

Tests

Verification

Artifacts

Follow-ups

Pull requests

Uh oh!

`ras-llm` — multipart constructor

`ras-agent` — image-aware result message