Skip to content

Commit ffcd8c8

Browse files
authoredFeb 27, 2025
Merge pull request #29 from askui/ML-649-ask-ui-integration-add-ai-element-command-via-askui-inference
Ml 649 ask UI integration add ai element command via askui inference
2 parents 015bf35 + 0fc757e commit ffcd8c8

File tree

13 files changed

+475
-125
lines changed

13 files changed

+475
-125
lines changed
 

‎README.md

+66-9
Original file line numberDiff line numberDiff line change
@@ -75,9 +75,9 @@ pip install askui
7575

7676
**Note:** Requires Python version >=3.10.
7777

78-
### 3a. Authenticate with an Automation Model Provider
78+
### 3a. Authenticate with an **AI Model** Provider
7979

80-
| | AskUI [INFO](https://app.askui.com/) | Anthropic [INFO](https://console.anthropic.com/settings/keys) |
80+
| | AskUI [INFO](https://hub.askui.com/) | Anthropic [INFO](https://console.anthropic.com/settings/keys) |
8181
|----------|----------|----------|
8282
| ENV Variables | `ASKUI_WORKSPACE_ID`, `ASKUI_TOKEN` | `ANTHROPIC_API_KEY` |
8383
| Supported Commands | `click()` | `click()`, `get()`, `act()` |
@@ -107,25 +107,25 @@ To get started, set the environment variables required to authenticate with your
107107
</details>
108108

109109

110-
### 3b. Test with 🤗 Hugging Face Spaces API
110+
### 3b. Test with 🤗 Hugging Face **AI Models** (Spaces API)
111111

112112
You can test the Vision Agent with Hugging Face models via their Spaces API. Please note that the API is rate-limited so for production use cases, it is recommended to choose step 3a.
113113

114114
**Note:** Hugging Face Spaces host model demos provided by individuals not associated with Hugging Face or AskUI. Don't use these models on screens with sensible information.
115115

116116
**Supported Models:**
117-
- `AskUI/PTA-1`
118-
- `OS-Copilot/OS-Atlas-Base-7B`
119-
- `showlab/ShowUI-2B`
120-
- `Qwen/Qwen2-VL-2B-Instruct`
121-
- `Qwen/Qwen2-VL-7B-Instruct`
117+
- [`AskUI/PTA-1`](https://huggingface.co/spaces/AskUI/PTA-1)
118+
- [`OS-Copilot/OS-Atlas-Base-7B`](https://huggingface.co/spaces/maxiw/OS-ATLAS)
119+
- [`showlab/ShowUI-2B`](https://huggingface.co/spaces/showlab/ShowUI)
120+
- [`Qwen/Qwen2-VL-2B-Instruct`](https://huggingface.co/spaces/maxiw/Qwen2-VL-Detection)
121+
- [`Qwen/Qwen2-VL-7B-Instruct`](https://huggingface.co/spaces/maxiw/Qwen2-VL-Detection)
122122

123123
**Example Code:**
124124
```python
125125
agent.click("search field", model_name="OS-Copilot/OS-Atlas-Base-7B")
126126
```
127127

128-
### 3c. Custom Model Implementations
128+
### 3c. Host your own **AI Models**
129129

130130
#### UI-TARS
131131

@@ -171,6 +171,63 @@ Instead of relying on the default model for the entire automation script, you ca
171171

172172
**Example:** `agent.click("Preview", model_name="askui-combo")`
173173

174+
<details>
175+
<summary>Antrophic AI Models</summary>
176+
177+
Supported commands are: `click()`, `type()`, `mouse_move()`, `get()`, `act()`
178+
| Model Name | Info | Production Ready? | Enterprise? |
179+
|-------------|--------------------|--------------|--------------|
180+
| `anthropic-claude-3-5-sonnet-20241022` | The [Computer Use](https://docs.anthropic.com/en/docs/agents-and-tools/computer-use) model from Antrophic is an Large Action model (LAM), which can autonoumsly achive goals. e.g. `"Book me a flight from Berlin to Rom"` | ❌ | ❌
181+
> **Note:** Configure your Antrophic Model Provider [here]()
182+
183+
184+
</details>
185+
186+
<details>
187+
<summary>AskUI AI Models</summary>
188+
189+
Supported commands are: `click()`, `type()`, `mouse_move()`
190+
| Model Name | Info | Production Ready? | Enterprise? | Teachable? |
191+
|-------------|--------------------|--------------|--------------|--------------|
192+
| `askui-pta` | [`PTA-1`](https://huggingface.co/AskUI/PTA-1) (Prompt-to-Automation) is a vision language model (VLM) trained by [AskUI which]() is trained to address all kindes of UI elements by a textual description e.g. "`Login button`", "`Text login`" ||||
193+
| `askui-ocr` | `AskUI OCR` is an OCR model trained to address texts on UI Screens e.g. "`Login`", "`Search`" ||||
194+
| `askui-combo` | AskUI Combo is an combination from the `askui-pta` and the `askui-ocr` model to improve the accuracy. ||||
195+
| `askui-ai-element`| [AskUI AI Element](https://docs.askui.com/docs/general/Element%20Selection/aielement) allows you to address visual elements like icons or images by demonstrating what you looking for. Therfore you have to crop out the element and give it a name. ||||
196+
197+
> **Note:** Configure your AskUI Model Provider [here]()
198+
199+
</details>
200+
201+
202+
<details>
203+
<summary>Huggingface AI Models (Spaces API)</summary>
204+
205+
Supported commands are: `click()`, `type()`, `mouse_move()`
206+
| Model Name | Info | Production Ready? | Enterprise? |
207+
|-------------|--------------------|--------------|--------------|
208+
| `AskUI/PTA-1` | [`PTA-1`](https://huggingface.co/AskUI/PTA-1) (Prompt-to-Automation) is a vision language model (VLM) trained by [AskUI which]() is trained to address all kindes of UI elements by a textual description e.g. "`Login button`", "`Text login`" |||
209+
| `OS-Copilot/OS-Atlas-Base-7B` | [`OS-Atlas-Base-7B`](https://github.com/OS-Copilot/OS-Atlas) is a Large Action Model (LAM), which can autonoumsly achive goals. e.g. `"Please help me modify VS Code setting to hide all folders in the explorer view"`. This model is not in the `act()` command available |||
210+
| `showlab/ShowUI-2B` | [`showlab/ShowUI-2B`](https://huggingface.co/showlab/ShowUI-2B) is a Large Action Model (LAM), which can autonoumsly achive goals. e.g. `"Search in google maps for Nahant"`. This model is not in the `act()` command available |||
211+
| `Qwen/Qwen2-VL-2B-Instruct` | [`Qwen/Qwen2-VL-2B-Instruct`](https://github.com/QwenLM/Qwen2.5-VLB) is a Visual Language Model (VLM) pre-trained on multiple dataset including UI data. This model is not in the `act()` command available |||
212+
| `Qwen/Qwen2-VL-7B-Instruct` | [Qwen/Qwen2-VL-7B-Instruct`](https://github.com/QwenLM/Qwen2.5-VLB) is a Visual Language Model (VLM) pre-trained on multiple dataset including UI data. This model is not in the `act()` command available |||
213+
214+
> **Note:** No authentication required! But rate-limited!
215+
216+
</details>
217+
218+
<details>
219+
<summary>Self Hosted UI Models</summary>
220+
221+
Supported commands are: `click()`, `type()`, `mouse_move()`, `get()`, `act()`
222+
| Model Name | Info | Production Ready? | Enterprise? |
223+
|-------------|--------------------|--------------|--------------|
224+
| `tars` | [`UI-Tars`](https://github.com/bytedance/UI-TARS) is a Large Action Model (LAM) based on Qwen2 and fine-tuned by [ByteDance](https://www.bytedance.com/) on UI data e.g. "`Book me a flight to rom`" |||
225+
226+
227+
> **Note:** These models needs to been self hosted by yourself. (See [here]())
228+
229+
</details>
230+
174231
### 🛠️ Direct Tool Use
175232

176233
Under the hood agents are using a set of tools. You can also directly access these tools.

‎src/askui/agent.py

+30-5
Original file line numberDiff line numberDiff line change
@@ -88,13 +88,36 @@ def click(self, instruction: Optional[str] = None, button: Literal['left', 'midd
8888
self.report.add_message("User", msg)
8989
if instruction is not None:
9090
logger.debug("VisionAgent received instruction to click '%s'", instruction)
91-
screenshot = self.client.screenshot() # type: ignore
92-
x, y = self.model_router.click(screenshot, instruction, model_name)
93-
if self.report is not None:
94-
self.report.add_message("ModelRouter", f"click: ({x}, {y})")
95-
self.client.mouse(x, y) # type: ignore
91+
self.__mouse_move(instruction, model_name)
9692
self.client.click(button, repeat) # type: ignore
9793

94+
def __mouse_move(self, instruction: str, model_name: Optional[str] = None) -> None:
95+
self._check_askui_controller_enabled()
96+
screenshot = self.client.screenshot() # type: ignore
97+
x, y = self.model_router.locate(screenshot, instruction, model_name)
98+
if self.report is not None:
99+
self.report.add_message("ModelRouter", f"locate: ({x}, {y})")
100+
self.client.mouse(x, y) # type: ignore
101+
102+
def mouse_move(self, instruction: str, model_name: Optional[str] = None) -> None:
103+
"""
104+
Moves the mouse cursor to the UI element identified by the provided instruction.
105+
106+
Parameters:
107+
instruction (str): The identifier or description of the element to move to.
108+
model_name (str | None): The model name to be used for element detection. Optional.
109+
110+
Example:
111+
>>> with VisionAgent() as agent:
112+
>>> agent.mouse_move("Submit button") # Moves cursor to submit button
113+
>>> agent.mouse_move("Close") # Moves cursor to close element
114+
>>> agent.mouse_move("Profile picture", model_name="custom_model") # Uses specific model
115+
"""
116+
if self.report is not None:
117+
self.report.add_message("User", f'mouse_move: "{instruction}"')
118+
logger.debug("VisionAgent received instruction to mouse_move '%s'", instruction)
119+
self.__mouse_move(instruction, model_name)
120+
98121
def type(self, text: str) -> None:
99122
self._check_askui_controller_enabled()
100123
if self.report is not None:
@@ -145,6 +168,7 @@ def key_up(self, key: PC_AND_MODIFIER_KEY):
145168
self._check_askui_controller_enabled()
146169
if self.report is not None:
147170
self.report.add_message("User", f'key_up "{key}"')
171+
logger.debug("VisionAgent received in key_up '%s'", key)
148172
self.client.keyboard_release(key)
149173

150174
def key_down(self, key: PC_AND_MODIFIER_KEY):
@@ -161,6 +185,7 @@ def key_down(self, key: PC_AND_MODIFIER_KEY):
161185
self._check_askui_controller_enabled()
162186
if self.report is not None:
163187
self.report.add_message("User", f'key_down "{key}"')
188+
logger.debug("VisionAgent received in key_down '%s'", key)
164189
self.client.keyboard_pressed(key)
165190

166191
def act(self, goal: str, model_name: Optional[str] = None) -> None:

‎src/askui/models/anthropic/claude.py

+4-4
Original file line numberDiff line numberDiff line change
@@ -45,18 +45,18 @@ def inference(self, base64_image, prompt, system_prompt) -> list[anthropic.types
4545
)
4646
return message.content
4747

48-
def click_inference(self, image: Image.Image, instruction: str) -> tuple[int, int]:
49-
prompt = f"Click on {instruction}"
48+
def locate_inference(self, image: Image.Image, locator: str) -> tuple[int, int]:
49+
prompt = f"Click on {locator}"
5050
screen_width, screen_height = self.resolution[0], self.resolution[1]
5151
system_prompt = f"Use a mouse and keyboard to interact with a computer, and take screenshots.\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try taking another screenshot.\n* The screen's resolution is {screen_width}x{screen_height}.\n* The display number is 0\n* Whenever you intend to move the cursor to click on an element like an icon, you should consult a screenshot to determine the coordinates of the element before moving the cursor.\n* If you tried clicking on a program or link but it failed to load, even after waiting, try adjusting your cursor position so that the tip of the cursor visually falls on the element that you want to click.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.\n"
5252
scaled_image = scale_image_with_padding(image, screen_width, screen_height)
5353
response = self.inference(image_to_base64(scaled_image), prompt, system_prompt)
5454
response = response[0].text
55-
logger.debug("ClaudeHandler received instruction: %s", response)
55+
logger.debug("ClaudeHandler received locator: %s", response)
5656
try:
5757
scaled_x, scaled_y = extract_click_coordinates(response)
5858
except Exception as e:
59-
raise AutomationError(f"Couldn't locate '{instruction}' on the screen.")
59+
raise AutomationError(f"Couldn't locate '{locator}' on the screen.")
6060
x, y = scale_coordinates_back(scaled_x, scaled_y, image.width, image.height, screen_width, screen_height)
6161
return int(x), int(y)
6262

‎src/askui/models/anthropic/claude_agent.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -129,8 +129,8 @@ def step(self, messages: list):
129129
return messages
130130

131131

132-
def run(self, instruction: str):
133-
messages = [{"role": "user", "content": instruction}]
132+
def run(self, goal: str):
133+
messages = [{"role": "user", "content": goal}]
134134
logger.debug(messages[0])
135135
while messages[-1]["role"] == "user":
136136
messages = self.step(messages)
+108
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
from datetime import datetime
2+
import os
3+
import pathlib
4+
from typing import List, Optional
5+
from pydantic import UUID4, BaseModel, ConfigDict, Field
6+
from pydantic.alias_generators import to_camel
7+
from PIL import Image
8+
from askui.logging import logger
9+
10+
11+
12+
class Rectangle(BaseModel):
13+
xmin: int
14+
ymin: int
15+
ymax: int
16+
ymax: int
17+
18+
class Annotation(BaseModel):
19+
id: UUID4
20+
rectangle: Rectangle
21+
22+
class Size(BaseModel):
23+
width: int
24+
height: int
25+
26+
class AskUIImageMetadata(BaseModel):
27+
size: Size
28+
29+
class AiElementMetadata(BaseModel):
30+
model_config = ConfigDict(
31+
alias_generator=to_camel,
32+
serialization_alias=to_camel,
33+
)
34+
35+
version: int
36+
id: UUID4
37+
name: str
38+
creation_date_time: datetime
39+
image_metadata: AskUIImageMetadata = Field(alias="image")
40+
41+
class AiElement():
42+
image_path: pathlib.Path
43+
image: Image.Image
44+
json_path: pathlib.Path
45+
metadata: AiElementMetadata
46+
47+
def __init__(self, image_path: pathlib.Path, image: Image.Image, metadata_path: pathlib.Path, metadata: AiElementMetadata):
48+
self.image_path = image_path
49+
self.image = image
50+
self.metadata_path = metadata_path
51+
self.metadata = metadata
52+
53+
@classmethod
54+
def from_json_file(cls, json_file_path: pathlib.Path) -> "AiElement":
55+
image_path = json_file_path.parent / (json_file_path.stem + ".png")
56+
with open(json_file_path) as f:
57+
return cls(
58+
metadata_path= json_file_path,
59+
image_path= image_path,
60+
metadata = AiElementMetadata.model_validate_json(f.read()),
61+
image = Image.open(image_path))
62+
63+
64+
class AiElementNotFound(Exception):
65+
pass
66+
67+
68+
class AiElementCollection:
69+
70+
def __init__(self, additional_ai_element_locations: Optional[List[pathlib.Path]] = None):
71+
workspace_id = os.getenv("ASKUI_WORKSPACE_ID")
72+
if workspace_id is None:
73+
raise ValueError("ASKUI_WORKSPACE_ID is not set")
74+
75+
if additional_ai_element_locations is None:
76+
additional_ai_element_locations = []
77+
78+
addional_ai_element_from_env = []
79+
if os.getenv("ASKUI_AI_ELEMENT_LOCATIONS", "") != "":
80+
addional_ai_element_from_env = [pathlib.Path(ai_element_loc) for ai_element_loc in os.getenv("ASKUI_AI_ELEMENT_LOCATIONS", "").split(",")],
81+
82+
self.ai_element_locations = [
83+
pathlib.Path.home() / ".askui" / "SnippingTool" / "AIElement" / workspace_id,
84+
*addional_ai_element_from_env,
85+
*additional_ai_element_locations
86+
]
87+
88+
logger.debug("AI Element locations: %s", self.ai_element_locations)
89+
90+
def find(self, name: str):
91+
ai_elements = []
92+
93+
for location in self.ai_element_locations:
94+
path = pathlib.Path(location)
95+
96+
json_files = list(path.glob("*.json"))
97+
98+
if not json_files:
99+
logger.warning(f"No JSON files found in: {location}")
100+
continue
101+
102+
for json_file in json_files:
103+
ai_element = AiElement.from_json_file(json_file)
104+
105+
if ai_element.metadata.name == name:
106+
ai_elements.append(ai_element)
107+
108+
return ai_elements

0 commit comments

Comments
 (0)
Failed to load comments.