v2.6.0
Highlights
Agent DOM grounding lands. v2.5.0 shipped ChromiumoxideDomExtractor but nothing called it. v2.6.0 closes the loop: every step captures a fresh DOM snapshot, the next prompt carries a numbered clickable map, and the LLM operates on real page state instead of fabricating from text narration.
This is Phase B of the agent grounding fix. Combined with v2.4.0 (vision feedback) and v2.5.0 (extractor implementation), the agent now sees and references the DOM the model claims to interact with.
Closes #31.
What's new
ras-cdp — ChromiumoxideAdapter::browser_arc() (closes #31)
Returns a clone of the adapter's Arc<Mutex<Browser>> so the extractor and adapter share one CDP connection / target space:
```rust
let adapter = ChromiumoxideAdapter::connect(ws, timeout).await?;
let extractor = ChromiumoxideDomExtractor::new(adapter.browser_arc(), timeout);
```
The v2.5.0 release notes referenced this accessor before it existed. It's real now.
ras-agent — DOM extractor wired through the loop
RunAgent::with_dom_extractor(Arc<dyn DomExtractor>)builder. None preserves pre-2.6.0 behavior.RunStepcallsextractor.snapshot(target)after every step and stores the result on the newStepRecord.summary: Option<BrowserStateSummary>.- Failures degrade gracefully:
tracing::warn+summary: None. Grounding is auxiliary, not required. - Field is
#[serde(default, skip_serializing_if = "Option::is_none")]— old serialized history files still deserialize cleanly.
Prompt format — numbered clickable map
```
Step 4 result:
url: https://example.com/login
action results:
[0] clicked login button
clickable_elements:
[0] button "Sign in"
[1] input "Email"
[2] input "Password"
…and 5 more (truncated)
```
Indexes match ClickableElement.index from the snapshot. click_element(index=N) is now grounded in real DOM state instead of model guesses. ax_name takes precedence over label. List capped at 80 elements with overflow marker.
Screenshot precedence
If summary.screenshot_b64 is present, it is the sole image part attached to the user message. Otherwise the legacy path (one part per ActionResult.images entry) kicks in. Steps that explicitly screenshot no longer end up with two image parts in the prompt.
ras-llm::ChatMessage::user_parts (carried context)
The constructor for mixed-content user messages introduced in v2.4.0 is the load-bearing primitive Phase B uses to attach text + screenshot in one turn.
Architecture decisions
ras-agentalready depended onras-dom; no new crate dep.clickable_mapextracted to its own module to keeprender_step_message.rsunder the 200-LOC cap. Split is also semantic — clickable rendering is independent of step message assembly.ScriptedDomExtractormock in the integration test bypasses real Chrome for fast deterministic coverage. Real CDP testing requires cosmium and remains a manual smoke step.
Tests
5 new clickable_map unit tests:
- empty clickables → empty string
- ax_name precedence over label
- label fallback when no ax_name
- no quotes when neither
- truncation past CLICKABLE_LIMIT (80) with "…and N more" marker
New integration test dom_extractor_grounding_reaches_next_prompt:
ScriptedDomExtractorreturns a cannedBrowserStateSummarywith two clickables (button "Sign in" + input "Email") and a known screenshot byte marker.- Asserts step 2's prompt contains
clickable_elements:text with both rendered indexes AND the extractor's screenshot bytes. - Asserts
extractor.snapshotwas invoked at least once across the run.
Total ras-agent: 18 unit + 5 integration. Workspace: 97 test groups all green.
Verification
cargo test --workspace --no-fail-fast— cleancargo clippy --workspace --all-targets -- -D clippy::unwrap_used -D clippy::dbg_macro— cleancargo fmt --all -- --check— cleancargo doc --workspace --no-deps— clean
LOC per new/modified file (200 cap):
clickable_map.rs115render_step_message.rs173run_step.rs161run_agent.rs168
Compatibility
- All new APIs additive (
browser_arc(),with_dom_extractor,StepRecord.summary). - No public-API breaks anywhere in the workspace.
- Workspace MSRV unchanged.
- Old serialized
AgentHistoryListdeserialize cleanly (summary defaults to None).
Migration
```rust
// before (v2.5.0): no DOM grounding
let agent = RunAgent::new(task, llm, registry, browser, events);
// after (v2.6.0): wire extractor for real grounding
use std::sync::Arc;
use std::time::Duration;
use ras_dom::{ChromiumoxideDomExtractor, DomExtractor};
let adapter = ChromiumoxideAdapter::connect(ws_url, Duration::from_secs(60)).await?;
let extractor: Arc = Arc::new(
ChromiumoxideDomExtractor::new(adapter.browser_arc(), Duration::from_secs(30)),
);
let browser: Arc = Arc::new(adapter);
let agent = RunAgent::new(task, llm, registry, browser, events)
.with_dom_extractor(extractor);
```
Deferred (still after this release)
- Full
EnhancedDomTreeNodetree — currentlytree: NoneinBrowserStateSummary. stable_hash— currently empty inClickableElement.stable_hash.- Real AX tree via
Accessibility.getFullAXTree— currentax_namefrom attributes is a sound MVP. - Paint-order occlusion —
paint_ordersrequested but not yet used to drop covered elements. - Per-action snapshot — current snapshot fires once per step after all actions complete. Per-click feedback is plausible if models need finer-grained grounding; defer until evidence.
Artifacts
- Linux x86_64:
ras-x86_64-unknown-linux-gnu,ras-daemon-x86_64-unknown-linux-gnu - macOS arm64:
ras-aarch64-apple-darwin,ras-daemon-aarch64-apple-darwin - crates.io: all
ras-*workspace crates published at2.6.0oncepublish.ymlfinishes
Pull requests
- #32 —
feat(agent): DOM grounding via ChromiumoxideDomExtractor (v2.6.0) - #33 —
release: v2.6.0 (agent DOM grounding)
Sub-phase commits
- B1
fix(cdp): add ChromiumoxideAdapter::browser_arc() accessor— 2.5.1 (closes #31) - B2
feat(agent): capture DOM snapshot per step via Option<Arc<dyn DomExtractor>>— 2.5.2 - B3
feat(agent): inject numbered clickable map + prefer extractor screenshot— 2.5.3 - B4
feat(examples): wire ChromiumoxideDomExtractor into claude_code_oauth_cosmium— 2.5.4 - B5
test(agent): integration test for end-to-end DOM grounding flow— 2.5.5 chore: bump to 2.6.0
Closes: #31
Full changelog: v2.5.0...v2.6.0