talk2dom is a focused utility that solves one of the hardest problems in browser automation and UI testing:
β Finding the correct UI element on a page.
In most automated testing or LLM-driven web navigation tasks, the real challenge is not how to click or type β it's how to locate the right element.
Think about it:
- Clicking a button is easy β if you know its selector.
- Typing into a field is trivial β if you've already located the right input.
- But finding the correct element among hundreds of
<div>
,<span>
, or deeply nested Shadow DOM trees? That's the hard part.
talk2dom
is built to solve exactly that.
talk2dom
helps you locate elements by:
- Understands natural language instructions and turns them into browser actions
- Supports single-command execution or persistent interactive sessions
- Uses LLMs (like GPT-4 or Claude) to analyze live HTML and intent
- Returns flexible output: actions, selectors, or both β providing flexible outputs: actions, selectors, or both β depending on the instruction and model response
- Compatible with both desktop and mobile browsers via Selenium
To avoid recomputing selectors every time, talk2dom
can cache results in a PostgreSQL database.
- For each
instruction + url
pair, a unique SHA256 hash is generated. - If a previous result exists,
talk2dom
reuses it and skips the LLM call. - Greatly improves performance and reduces token usage.
Set the TALK2DOM_DB_URI
environment variable:
export TALK2DOM_DB_URI="postgresql+psycopg2://user:password@localhost:5432/dbname"
If TALK2DOM_DB_URI
is not set, caching is automatically disabled, and all requests will use LLM inference in real-time.
While there are many modern tools for controlling browsers (like Playwright or Puppeteer), Selenium remains the most robust and cross-platform solution, especially when dealing with:
- β Safari (WebKit)
- β Firefox
- β Mobile browsers
- β Cross-browser testing grids
These tools often have limited support for anything beyond Chrome-based browsers. Selenium, by contrast, has battle-tested support across all major platforms and continues to be the industry standard in enterprise and CI/CD environments.
Thatβs why talk2dom
is designed to integrate directly with Selenium β it works where the real-world complexity lives.
pip install talk2dom
For developers and testers who prefer structured Python control, ActionChain
lets you drive the browser step-by-step.
By default, talk2dom uses gpt-4o-mini to balance performance and cost. However, during testing, gpt-4o has shown the best performance for this task.
export OPENAI_API_KEY="..."
Note: All models must support chat completion APIs and follow OpenAI-compatible schema.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from talk2dom import ActionChain
driver = webdriver.Chrome()
ActionChain(driver) \
.open("http://www.python.org") \
.find("Find the Search box") \
.type("pycon") \
.wait(2) \
.type(Keys.RETURN) \
.assert_page_not_contains("No results found.") \
.valid("the 'PSF PyCon Trademark Usage Policy' is exist") \
.close()
You can also use talk2dom
with free models like llama-3.3-70b-versatile
from Groq.
The find()
function can be used to query the entire page or a specific element.
You can pass either a full Selenium driver
or a specific WebElement
to scope the locator to part of the page.
- Reduce Token Usage β Passing a smaller HTML subtree (like a modal or container) instead of the full page saves LLM tokens, reducing latency and cost.
- Improve Locator Accuracy β Scoping the query helps the LLM focus on relevant content, which is especially helpful for nested or isolated components like popups, drawers, and cards.
You donβt need to extract HTML manually β talk2dom
will automatically use outerHTML
from any WebElement
you pass in.
Our goal is not to control the browser β you still control your browser. Our goal is to find the right DOM element, so you can tell the browser what to do.
- π¬ Natural language interface to control the browser
- π Persistent session for multi-step interactions
- π§ LLM-powered understanding of high-level intent
- π§© Outputs: actionable XPath/CSS selectors or ready-to-run browser steps
- π§ͺ Built-in assertions and step validations
- π‘ Works with both CLI scripts and interactive chat
Apache 2.0
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.
Weβd love to hear how you're using talk2dom
in your AI agents or testing flows.
Feel free to open issues or discussions!
You can also tag us on GitHub if youβre building something interesting with talk2dom
!
βοΈ If you find this project useful, please consider giving it a star!