You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -107,25 +107,25 @@ To get started, set the environment variables required to authenticate with your
107
107
</details>
108
108
109
109
110
-
### 3b. Test with 🤗 Hugging Face Spaces API
110
+
### 3b. Test with 🤗 Hugging Face **AI Models** (Spaces API)
111
111
112
112
You can test the Vision Agent with Hugging Face models via their Spaces API. Please note that the API is rate-limited so for production use cases, it is recommended to choose step 3a.
113
113
114
114
**Note:** Hugging Face Spaces host model demos provided by individuals not associated with Hugging Face or AskUI. Don't use these models on screens with sensible information.
| `anthropic-claude-3-5-sonnet-20241022` | The [Computer Use](https://docs.anthropic.com/en/docs/agents-and-tools/computer-use) model from Antrophic is an Large Action model (LAM), which can autonoumsly achive goals. e.g. `"Book me a flight from Berlin to Rom"` | ❌ | ❌
181
+
> **Note:** Configure your Antrophic Model Provider [here]()
|`askui-pta`|[`PTA-1`](https://huggingface.co/AskUI/PTA-1) (Prompt-to-Automation) is a vision language model (VLM) trained by [AskUI which]() is trained to address all kindes of UI elements by a textual description e.g. "`Login button`", "`Text login`" | ✅ | ✅ | ✅ |
193
+
|`askui-ocr`|`AskUI OCR` is an OCR model trained to address texts on UI Screens e.g. "`Login`", "`Search`" | ✅ | ✅ | ✅ |
194
+
|`askui-combo`| AskUI Combo is an combination from the `askui-pta` and the `askui-ocr` model to improve the accuracy. | ✅ | ✅ | ✅ |
195
+
|`askui-ai-element`|[AskUI AI Element](https://docs.askui.com/docs/general/Element%20Selection/aielement) allows you to address visual elements like icons or images by demonstrating what you looking for. Therfore you have to crop out the element and give it a name. | ✅ | ✅ | ✅ |
196
+
197
+
> **Note:** Configure your AskUI Model Provider [here]()
198
+
199
+
</details>
200
+
201
+
202
+
<details>
203
+
<summary>Huggingface AI Models (Spaces API)</summary>
|`AskUI/PTA-1`|[`PTA-1`](https://huggingface.co/AskUI/PTA-1) (Prompt-to-Automation) is a vision language model (VLM) trained by [AskUI which]() is trained to address all kindes of UI elements by a textual description e.g. "`Login button`", "`Text login`" | ❌ | ❌ |
209
+
|`OS-Copilot/OS-Atlas-Base-7B`|[`OS-Atlas-Base-7B`](https://github.com/OS-Copilot/OS-Atlas) is a Large Action Model (LAM), which can autonoumsly achive goals. e.g. `"Please help me modify VS Code setting to hide all folders in the explorer view"`. This model is not in the `act()` command available | ❌ | ❌ |
210
+
|`showlab/ShowUI-2B`|[`showlab/ShowUI-2B`](https://huggingface.co/showlab/ShowUI-2B) is a Large Action Model (LAM), which can autonoumsly achive goals. e.g. `"Search in google maps for Nahant"`. This model is not in the `act()` command available | ❌ | ❌ |
211
+
|`Qwen/Qwen2-VL-2B-Instruct`|[`Qwen/Qwen2-VL-2B-Instruct`](https://github.com/QwenLM/Qwen2.5-VLB) is a Visual Language Model (VLM) pre-trained on multiple dataset including UI data. This model is not in the `act()` command available | ❌ | ❌ |
212
+
|`Qwen/Qwen2-VL-7B-Instruct`|[Qwen/Qwen2-VL-7B-Instruct`](https://github.com/QwenLM/Qwen2.5-VLB) is a Visual Language Model (VLM) pre-trained on multiple dataset including UI data. This model is not in the `act()` command available | ❌ | ❌ |
213
+
214
+
> **Note:** No authentication required! But rate-limited!
|`tars`|[`UI-Tars`](https://github.com/bytedance/UI-TARS) is a Large Action Model (LAM) based on Qwen2 and fine-tuned by [ByteDance](https://www.bytedance.com/) on UI data e.g. "`Book me a flight to rom`" | ❌ | ❌ |
225
+
226
+
227
+
> **Note:** These models needs to been self hosted by yourself. (See [here]())
228
+
229
+
</details>
230
+
174
231
### 🛠️ Direct Tool Use
175
232
176
233
Under the hood agents are using a set of tools. You can also directly access these tools.
system_prompt=f"Use a mouse and keyboard to interact with a computer, and take screenshots.\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try taking another screenshot.\n* The screen's resolution is {screen_width}x{screen_height}.\n* The display number is 0\n* Whenever you intend to move the cursor to click on an element like an icon, you should consult a screenshot to determine the coordinates of the element before moving the cursor.\n* If you tried clicking on a program or link but it failed to load, even after waiting, try adjusting your cursor position so that the tip of the cursor visually falls on the element that you want to click.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.\n"
0 commit comments