Skip to content

v2.6.0

Choose a tag to compare

@github-actions github-actions released this 10 May 06:37
· 58 commits to main since this release
02a8099

Highlights

Agent DOM grounding lands. v2.5.0 shipped ChromiumoxideDomExtractor but nothing called it. v2.6.0 closes the loop: every step captures a fresh DOM snapshot, the next prompt carries a numbered clickable map, and the LLM operates on real page state instead of fabricating from text narration.

This is Phase B of the agent grounding fix. Combined with v2.4.0 (vision feedback) and v2.5.0 (extractor implementation), the agent now sees and references the DOM the model claims to interact with.

Closes #31.

What's new

ras-cdpChromiumoxideAdapter::browser_arc() (closes #31)

Returns a clone of the adapter's Arc<Mutex<Browser>> so the extractor and adapter share one CDP connection / target space:
```rust
let adapter = ChromiumoxideAdapter::connect(ws, timeout).await?;
let extractor = ChromiumoxideDomExtractor::new(adapter.browser_arc(), timeout);
```
The v2.5.0 release notes referenced this accessor before it existed. It's real now.

ras-agent — DOM extractor wired through the loop

  • RunAgent::with_dom_extractor(Arc<dyn DomExtractor>) builder. None preserves pre-2.6.0 behavior.
  • RunStep calls extractor.snapshot(target) after every step and stores the result on the new StepRecord.summary: Option<BrowserStateSummary>.
  • Failures degrade gracefully: tracing::warn + summary: None. Grounding is auxiliary, not required.
  • Field is #[serde(default, skip_serializing_if = "Option::is_none")] — old serialized history files still deserialize cleanly.

Prompt format — numbered clickable map

```
Step 4 result:
url: https://example.com/login
action results:
[0] clicked login button
clickable_elements:
[0] button "Sign in"
[1] input "Email"
[2] input "Password"
…and 5 more (truncated)
```
Indexes match ClickableElement.index from the snapshot. click_element(index=N) is now grounded in real DOM state instead of model guesses. ax_name takes precedence over label. List capped at 80 elements with overflow marker.

Screenshot precedence

If summary.screenshot_b64 is present, it is the sole image part attached to the user message. Otherwise the legacy path (one part per ActionResult.images entry) kicks in. Steps that explicitly screenshot no longer end up with two image parts in the prompt.

ras-llm::ChatMessage::user_parts (carried context)

The constructor for mixed-content user messages introduced in v2.4.0 is the load-bearing primitive Phase B uses to attach text + screenshot in one turn.

Architecture decisions

  • ras-agent already depended on ras-dom; no new crate dep.
  • clickable_map extracted to its own module to keep render_step_message.rs under the 200-LOC cap. Split is also semantic — clickable rendering is independent of step message assembly.
  • ScriptedDomExtractor mock in the integration test bypasses real Chrome for fast deterministic coverage. Real CDP testing requires cosmium and remains a manual smoke step.

Tests

5 new clickable_map unit tests:

  • empty clickables → empty string
  • ax_name precedence over label
  • label fallback when no ax_name
  • no quotes when neither
  • truncation past CLICKABLE_LIMIT (80) with "…and N more" marker

New integration test dom_extractor_grounding_reaches_next_prompt:

  • ScriptedDomExtractor returns a canned BrowserStateSummary with two clickables (button "Sign in" + input "Email") and a known screenshot byte marker.
  • Asserts step 2's prompt contains clickable_elements: text with both rendered indexes AND the extractor's screenshot bytes.
  • Asserts extractor.snapshot was invoked at least once across the run.

Total ras-agent: 18 unit + 5 integration. Workspace: 97 test groups all green.

Verification

  • cargo test --workspace --no-fail-fast — clean
  • cargo clippy --workspace --all-targets -- -D clippy::unwrap_used -D clippy::dbg_macro — clean
  • cargo fmt --all -- --check — clean
  • cargo doc --workspace --no-deps — clean

LOC per new/modified file (200 cap):

  • clickable_map.rs 115
  • render_step_message.rs 173
  • run_step.rs 161
  • run_agent.rs 168

Compatibility

  • All new APIs additive (browser_arc(), with_dom_extractor, StepRecord.summary).
  • No public-API breaks anywhere in the workspace.
  • Workspace MSRV unchanged.
  • Old serialized AgentHistoryList deserialize cleanly (summary defaults to None).

Migration

```rust
// before (v2.5.0): no DOM grounding
let agent = RunAgent::new(task, llm, registry, browser, events);

// after (v2.6.0): wire extractor for real grounding
use std::sync::Arc;
use std::time::Duration;
use ras_dom::{ChromiumoxideDomExtractor, DomExtractor};

let adapter = ChromiumoxideAdapter::connect(ws_url, Duration::from_secs(60)).await?;
let extractor: Arc = Arc::new(
ChromiumoxideDomExtractor::new(adapter.browser_arc(), Duration::from_secs(30)),
);
let browser: Arc = Arc::new(adapter);

let agent = RunAgent::new(task, llm, registry, browser, events)
.with_dom_extractor(extractor);
```

Deferred (still after this release)

  • Full EnhancedDomTreeNode tree — currently tree: None in BrowserStateSummary.
  • stable_hash — currently empty in ClickableElement.stable_hash.
  • Real AX tree via Accessibility.getFullAXTree — current ax_name from attributes is a sound MVP.
  • Paint-order occlusionpaint_orders requested but not yet used to drop covered elements.
  • Per-action snapshot — current snapshot fires once per step after all actions complete. Per-click feedback is plausible if models need finer-grained grounding; defer until evidence.

Artifacts

  • Linux x86_64: ras-x86_64-unknown-linux-gnu, ras-daemon-x86_64-unknown-linux-gnu
  • macOS arm64: ras-aarch64-apple-darwin, ras-daemon-aarch64-apple-darwin
  • crates.io: all ras-* workspace crates published at 2.6.0 once publish.yml finishes

Pull requests

  • #32feat(agent): DOM grounding via ChromiumoxideDomExtractor (v2.6.0)
  • #33release: v2.6.0 (agent DOM grounding)

Sub-phase commits

  • B1 fix(cdp): add ChromiumoxideAdapter::browser_arc() accessor — 2.5.1 (closes #31)
  • B2 feat(agent): capture DOM snapshot per step via Option<Arc<dyn DomExtractor>> — 2.5.2
  • B3 feat(agent): inject numbered clickable map + prefer extractor screenshot — 2.5.3
  • B4 feat(examples): wire ChromiumoxideDomExtractor into claude_code_oauth_cosmium — 2.5.4
  • B5 test(agent): integration test for end-to-end DOM grounding flow — 2.5.5
  • chore: bump to 2.6.0

Closes: #31

Full changelog: v2.5.0...v2.6.0