Skip to content

v2.4.0

Choose a tag to compare

@github-actions github-actions released this 10 May 05:23
· 75 commits to main since this release
6e7aa8c

Highlights

Vision feedback lands. Before v2.4.0, ScreenshotAction captured PNG bytes via ActionResult::with_image(b64) but the next prompt's renderer returned a plain String and discarded the image. Models received only text narration, fabricated everything visual, and modern Claude/GPT-4o runs against real sites produced no observable network traffic past the initial GET. v2.4.0 closes that single broken link end-to-end.

This is Phase A of the broader grounding fix. Phase B (numbered clickable index map) and Phase C (full CDP DomExtractor impl) ship in later releases.

Surprise findings during scope review

The bug report assumed ChatMessage was text-only and that all 14 provider DTOs needed image support added. Actuals:

  • ContentPart::{ImageBase64, ImageUrl} already existed in ras-llm.
  • ras-llm-openai DTO already serialized both to OpenAI vision format ({"type":"image_url","image_url":{"url":"data:..."}}).
  • ras-llm-anthropic DTO already mapped ImageBase64 to native Anthropic image source.
  • 6 OpenAI-compatible providers (cerebras, deepseek, groq, mistral, openrouter, vercel) inherit serialization via ChatOpenAICompatible.
  • 6 scaffold providers (bedrock, cloud, google, langchain, oci, ollama) have empty mod.rs — no LlmClient impl yet, nothing to update.

The single broken link was in ras-agent. Scope of actual code change collapsed to one new module + one constructor + one render call site swap.

What's new

ras-llm — multipart constructor

  • New ChatMessage::user_parts(parts: Vec<ContentPart>) constructor for emitting mixed-content user messages directly.

ras-agent — image-aware result message

  • New module ras_agent::application::render_step_message:
    • Returns Option<ChatMessage> (None when results are empty so no spurious user turns).
    • One ContentPart::Text part with step header (Step N result:), url: line, and per-action result summaries (truncated to 480 chars, errors to 240).
    • One ContentPart::ImageBase64 { media_type: "image/png", data } part per ActionResult.images entry.
  • run_agent::build_prompt now calls render_step_message and pushes the returned ChatMessage. Old text-only render_step_results removed.

Prompt — unchanged

Still tells the model to emit one JSON object matching AgentOutput shape, lists the action catalog with parameter schemas, warns that empty action lists are treated as failure. The new image part rides alongside that contract; no prompt rewrite was needed.

Reproduction

Before v2.4.0:

$ RAS_MODEL=anthropic/claude-haiku-4.5 cargo run --example claude_code_oauth_cosmium
[step 0] screenshot → captured (b64 discarded by renderer)
[step 1] LLM narrates "I see the login form"  ← fabrication; received only text
[mitmproxy] only GET / and GET /favicon.ico across N calls

After v2.4.0: each step that produces a screenshot attaches the image to the next user turn. Vision-capable models receive the actual page bytes.

Migration

No code changes required for callers. RunAgent::new signature unchanged. cargo update -p ras-agent --precise 2.4.0 (or any workspace crate; the workspace bumps together).

If you previously consumed render_step_results or any internal renderer, note that it's removed — replaced by the public-via-pub(crate) render_step_message returning Option<ChatMessage>.

Compatibility

  • New constructor ChatMessage::user_parts is purely additive.
  • Existing text-only ChatMessage::user_text unchanged.
  • No breaking changes to public APIs.
  • Workspace MSRV unchanged.

Tests

5 new unit tests in render_step_message:

  • empty results → no message
  • text-only result → text part only
  • screenshot result → text + 1 image part
  • multiple images across results → all attached in order
  • error result → error included in text

1 new integration test screenshot_image_reaches_next_prompt_as_image_part:

  • ScriptedLlm records every received Vec<ChatMessage>.
  • After step 1's screenshot action, asserts step 2's prompt contains ContentPart::ImageBase64 with media_type = "image/png" and non-empty data.

Verification

  • cargo test --workspace --no-fail-fast — all suites pass (13 unit + 4 executor + 5 + 4 + 3 integration)
  • cargo clippy --workspace --all-targets -- -D clippy::unwrap_used -D clippy::dbg_macro — clean
  • cargo fmt --all -- --check — clean
  • cargo doc --workspace --no-deps — clean

Artifacts

  • Linux x86_64: ras-x86_64-unknown-linux-gnu, ras-daemon-x86_64-unknown-linux-gnu
  • macOS arm64: ras-aarch64-apple-darwin, ras-daemon-aarch64-apple-darwin
  • crates.io: all ras-* workspace crates published at 2.4.0 once publish.yml finishes

Follow-ups

  • #24ras-llm-anthropic: ContentPart::ImageUrl currently degrades to plaintext URL instead of using Anthropic's native source.type = url image format. Planned 2.4.1 patch (or fold into next minor — publish.yml skips patch bumps).
  • Phase B — JS-eval clickable extractor (querySelectorAll via BrowserPort::evaluate) producing a numbered index map for click_element parameters. No full DomExtractor.
  • Phase C — Full CDP DomExtractor impl with paint-order occlusion + stable hashing across snapshots. The trait exists in ras-dom with zero implementations today.

Pull requests

  • #25feat(agent): feed screenshot images to LLM as ImageBase64 (v2.4.0)
  • #26release: v2.4.0 (vision feedback)

Full changelog: v2.3.0...v2.4.0