askui
diff --git a/‎README.md
Lines changed: 66 additions & 9 deletions b/‎README.md
Lines changed: 66 additions & 9 deletions
diff --git a/‎src/askui/agent.py
Lines changed: 30 additions & 5 deletions b/‎src/askui/agent.py
Lines changed: 30 additions & 5 deletions
diff --git a/‎src/askui/models/anthropic/claude.py
Lines changed: 4 additions & 4 deletions b/‎src/askui/models/anthropic/claude.py
Lines changed: 4 additions & 4 deletions
diff --git a/‎src/askui/models/anthropic/claude_agent.py
Lines changed: 2 additions & 2 deletions b/‎src/askui/models/anthropic/claude_agent.py
Lines changed: 2 additions & 2 deletions
diff --git a/‎src/askui/models/askui/ai_element_utils.py
Lines changed: 108 additions & 0 deletions b/‎src/askui/models/askui/ai_element_utils.py
Lines changed: 108 additions & 0 deletions
@@ -75,9 +75,9 @@ pip install askui
 
 **Note:** Requires Python version >=3.10.
 
-### 3a. Authenticate with an Automation Model Provider
+### 3a. Authenticate with an **AI Model** Provider
 
-|  | AskUI [INFO](https://app.askui.com/) | Anthropic [INFO](https://console.anthropic.com/settings/keys) |
+|  | AskUI [INFO](https://hub.askui.com/) | Anthropic [INFO](https://console.anthropic.com/settings/keys) |
 |----------|----------|----------|
 | ENV Variables    | `ASKUI_WORKSPACE_ID`, `ASKUI_TOKEN`   | `ANTHROPIC_API_KEY`   |
 | Supported Commands    | `click()`   | `click()`, `get()`, `act()`   |
@@ -107,25 +107,25 @@ To get started, set the environment variables required to authenticate with your
 </details>
 
 
-### 3b. Test with 🤗 Hugging Face Spaces API
+### 3b. Test with 🤗 Hugging Face **AI Models** (Spaces API)
 
 You can test the Vision Agent with Hugging Face models via their Spaces API. Please note that the API is rate-limited so for production use cases, it is recommended to choose step 3a.
 
 **Note:** Hugging Face Spaces host model demos provided by individuals not associated with Hugging Face or AskUI. Don't use these models on screens with sensible information.
 
 **Supported Models:**
-- `AskUI/PTA-1`
-- `OS-Copilot/OS-Atlas-Base-7B`
-- `showlab/ShowUI-2B`
-- `Qwen/Qwen2-VL-2B-Instruct`
-- `Qwen/Qwen2-VL-7B-Instruct`
+- [`AskUI/PTA-1`](https://huggingface.co/spaces/AskUI/PTA-1)
+- [`OS-Copilot/OS-Atlas-Base-7B`](https://huggingface.co/spaces/maxiw/OS-ATLAS)
+- [`showlab/ShowUI-2B`](https://huggingface.co/spaces/showlab/ShowUI)
+- [`Qwen/Qwen2-VL-2B-Instruct`](https://huggingface.co/spaces/maxiw/Qwen2-VL-Detection)
+- [`Qwen/Qwen2-VL-7B-Instruct`](https://huggingface.co/spaces/maxiw/Qwen2-VL-Detection)
 
 **Example Code:**
 ```python
 agent.click("search field", model_name="OS-Copilot/OS-Atlas-Base-7B")
 ```
 
-### 3c. Custom Model Implementations
+### 3c. Host your own **AI Models**
 
 #### UI-TARS
 
@@ -171,6 +171,63 @@ Instead of relying on the default model for the entire automation script, you ca
 
 **Example:** `agent.click("Preview", model_name="askui-combo")`
 
+<details>
+  <summary>Antrophic AI Models</summary>
+
+Supported commands are: `click()`, `type()`, `mouse_move()`, `get()`, `act()`
+| Model Name  | Info | Production Ready? |  Enterprise? |
+|-------------|--------------------|--------------|--------------|
+| `anthropic-claude-3-5-sonnet-20241022` | The [Computer Use](https://docs.anthropic.com/en/docs/agents-and-tools/computer-use) model from Antrophic is an Large Action model (LAM), which can autonoumsly achive goals. e.g. `"Book me a flight from Berlin to Rom"` | ❌ | ❌
+> **Note:** Configure your Antrophic Model Provider [here]()
+
+
+</details>
+
+<details>
+  <summary>AskUI AI Models</summary>
+
+Supported commands are: `click()`, `type()`, `mouse_move()`
+| Model Name  | Info | Production Ready? |  Enterprise? | Teachable? |
+|-------------|--------------------|--------------|--------------|--------------|
+| `askui-pta` | [`PTA-1`](https://huggingface.co/AskUI/PTA-1) (Prompt-to-Automation) is a vision language model (VLM) trained by [AskUI which]() is trained to address all kindes of UI  elements by a textual description e.g. "`Login button`", "`Text login`" | ✅ | ✅ | ✅ |
+| `askui-ocr` | `AskUI OCR` is an OCR model trained to address texts on UI Screens e.g. "`Login`", "`Search`" | ✅ | ✅ | ✅ |
+| `askui-combo` | AskUI Combo is an combination from the `askui-pta` and the `askui-ocr` model to improve the accuracy. | ✅ | ✅ | ✅ |
+| `askui-ai-element`| [AskUI AI Element](https://docs.askui.com/docs/general/Element%20Selection/aielement) allows you to address visual elements like icons or images by demonstrating what you looking for. Therfore you have to crop out the element and give it a name.  | ✅ | ✅ | ✅ |
+
+> **Note:** Configure your AskUI Model Provider [here]()
+
+</details>
+
+
+<details>
+  <summary>Huggingface AI Models (Spaces API)</summary>
+
+Supported commands are: `click()`, `type()`, `mouse_move()`
+| Model Name  | Info | Production Ready? |  Enterprise? |
+|-------------|--------------------|--------------|--------------|
+| `AskUI/PTA-1` | [`PTA-1`](https://huggingface.co/AskUI/PTA-1) (Prompt-to-Automation) is a vision language model (VLM) trained by [AskUI which]() is trained to address all kindes of UI  elements by a textual description e.g. "`Login button`", "`Text login`" | ❌ | ❌ |
+| `OS-Copilot/OS-Atlas-Base-7B` | [`OS-Atlas-Base-7B`](https://github.com/OS-Copilot/OS-Atlas) is a Large Action Model (LAM), which can autonoumsly achive goals. e.g. `"Please help me modify VS Code setting to hide all folders in the explorer view"`. This model is not in the `act()` command available | ❌ | ❌ |
+| `showlab/ShowUI-2B` | [`showlab/ShowUI-2B`](https://huggingface.co/showlab/ShowUI-2B) is a Large Action Model (LAM), which can autonoumsly achive goals. e.g. `"Search in google maps for Nahant"`. This model is not in the `act()` command available | ❌ | ❌ |
+| `Qwen/Qwen2-VL-2B-Instruct` | [`Qwen/Qwen2-VL-2B-Instruct`](https://github.com/QwenLM/Qwen2.5-VLB) is a Visual Language Model (VLM) pre-trained on multiple dataset including UI data. This model is not in the `act()` command available | ❌ | ❌ |
+| `Qwen/Qwen2-VL-7B-Instruct` | [Qwen/Qwen2-VL-7B-Instruct`](https://github.com/QwenLM/Qwen2.5-VLB) is a Visual Language Model (VLM) pre-trained on multiple dataset including UI data. This model is not in the `act()` command available | ❌ | ❌ |
+
+> **Note:** No authentication required! But rate-limited!
+
+</details>
+
+<details>
+  <summary>Self Hosted UI Models</summary>
+
+Supported commands are: `click()`, `type()`, `mouse_move()`, `get()`, `act()`
+| Model Name  | Info | Production Ready? |  Enterprise? |
+|-------------|--------------------|--------------|--------------|
+| `tars` | [`UI-Tars`](https://github.com/bytedance/UI-TARS) is a Large Action Model (LAM) based on Qwen2 and fine-tuned by [ByteDance](https://www.bytedance.com/) on UI data e.g. "`Book me a flight to rom`" | ❌ | ❌ |
+
+
+> **Note:** These models needs to been self hosted by yourself. (See [here]())
+
+</details>
+
 ### 🛠️ Direct Tool Use
 
 Under the hood agents are using a set of tools. You can also directly access these tools.
 
@@ -88,13 +88,36 @@ def click(self, instruction: Optional[str] = None, button: Literal['left', 'midd
             self.report.add_message("User", msg)
         if instruction is not None:
             logger.debug("VisionAgent received instruction to click '%s'", instruction)
-            screenshot = self.client.screenshot() # type: ignore
-            x, y = self.model_router.click(screenshot, instruction, model_name)
-            if self.report is not None:
-                self.report.add_message("ModelRouter", f"click: ({x}, {y})")
-            self.client.mouse(x, y) # type: ignore
+            self.__mouse_move(instruction, model_name)
         self.client.click(button, repeat) # type: ignore
 
+    def __mouse_move(self, instruction: str, model_name: Optional[str] = None) -> None:
+        self._check_askui_controller_enabled()
+        screenshot = self.client.screenshot() # type: ignore
+        x, y = self.model_router.locate(screenshot, instruction, model_name)
+        if self.report is not None:
+            self.report.add_message("ModelRouter", f"locate: ({x}, {y})")
+        self.client.mouse(x, y) # type: ignore
+
+    def mouse_move(self, instruction: str, model_name: Optional[str] = None) -> None:
+        """
+        Moves the mouse cursor to the UI element identified by the provided instruction.
+
+        Parameters:
+            instruction (str): The identifier or description of the element to move to.
+            model_name (str | None): The model name to be used for element detection. Optional.
+
+        Example:
+            >>> with VisionAgent() as agent:
+            >>>     agent.mouse_move("Submit button")  # Moves cursor to submit button
+            >>>     agent.mouse_move("Close")  # Moves cursor to close element
+            >>>     agent.mouse_move("Profile picture", model_name="custom_model")  # Uses specific model
+        """
+        if self.report is not None:
+            self.report.add_message("User", f'mouse_move: "{instruction}"')
+        logger.debug("VisionAgent received instruction to mouse_move '%s'", instruction)
+        self.__mouse_move(instruction, model_name)
+
     def type(self, text: str) -> None:
         self._check_askui_controller_enabled()
         if self.report is not None:
@@ -145,6 +168,7 @@ def key_up(self, key: PC_AND_MODIFIER_KEY):
         self._check_askui_controller_enabled()
         if self.report is not None:
             self.report.add_message("User", f'key_up "{key}"')
+        logger.debug("VisionAgent received in key_up '%s'", key)
         self.client.keyboard_release(key)
 
     def key_down(self, key: PC_AND_MODIFIER_KEY):
@@ -161,6 +185,7 @@ def key_down(self, key: PC_AND_MODIFIER_KEY):
         self._check_askui_controller_enabled()
         if self.report is not None:
             self.report.add_message("User", f'key_down "{key}"')
+        logger.debug("VisionAgent received in key_down '%s'", key)
         self.client.keyboard_pressed(key)
 
     def act(self, goal: str, model_name: Optional[str] = None) -> None:
 
@@ -45,18 +45,18 @@ def inference(self, base64_image, prompt, system_prompt) -> list[anthropic.types
         )
         return message.content
 
-    def click_inference(self, image: Image.Image, instruction: str) -> tuple[int, int]:
-        prompt = f"Click on {instruction}"
+    def locate_inference(self, image: Image.Image, locator: str) -> tuple[int, int]:
+        prompt = f"Click on {locator}"
         screen_width, screen_height = self.resolution[0], self.resolution[1]
         system_prompt = f"Use a mouse and keyboard to interact with a computer, and take screenshots.\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try taking another screenshot.\n* The screen's resolution is {screen_width}x{screen_height}.\n* The display number is 0\n* Whenever you intend to move the cursor to click on an element like an icon, you should consult a screenshot to determine the coordinates of the element before moving the cursor.\n* If you tried clicking on a program or link but it failed to load, even after waiting, try adjusting your cursor position so that the tip of the cursor visually falls on the element that you want to click.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.\n"
         scaled_image = scale_image_with_padding(image, screen_width, screen_height)
         response = self.inference(image_to_base64(scaled_image), prompt, system_prompt)
         response = response[0].text
-        logger.debug("ClaudeHandler received instruction: %s", response)
+        logger.debug("ClaudeHandler received locator: %s", response)
         try:
             scaled_x, scaled_y = extract_click_coordinates(response)
         except Exception as e:
-            raise AutomationError(f"Couldn't locate '{instruction}' on the screen.")
+            raise AutomationError(f"Couldn't locate '{locator}' on the screen.")
         x, y = scale_coordinates_back(scaled_x, scaled_y, image.width, image.height, screen_width, screen_height)
         return int(x), int(y)
 
 
@@ -129,8 +129,8 @@ def step(self, messages: list):
         return messages
 
 
-    def run(self, instruction: str):
-        messages = [{"role": "user", "content": instruction}]
+    def run(self, goal: str):
+        messages = [{"role": "user", "content": goal}]
         logger.debug(messages[0])
         while messages[-1]["role"] == "user":
             messages = self.step(messages)
 
@@ -0,0 +1,108 @@
+from datetime import datetime
+import os
+import pathlib
+from typing import List, Optional
+from pydantic import UUID4, BaseModel, ConfigDict, Field
+from pydantic.alias_generators import to_camel
+from PIL import Image
+from askui.logging import logger
+
+
+
+class Rectangle(BaseModel):
+    xmin: int
+    ymin: int
+    ymax: int
+    ymax: int
+
+class Annotation(BaseModel):
+    id: UUID4
+    rectangle: Rectangle
+
+class Size(BaseModel):
+    width: int
+    height: int
+
+class AskUIImageMetadata(BaseModel):
+    size: Size
+
+class AiElementMetadata(BaseModel):    
+    model_config = ConfigDict(
+        alias_generator=to_camel,
+        serialization_alias=to_camel,
+    )
+
+    version: int
+    id: UUID4
+    name: str
+    creation_date_time: datetime
+    image_metadata: AskUIImageMetadata = Field(alias="image")
+
+class AiElement():
+    image_path: pathlib.Path
+    image: Image.Image
+    json_path: pathlib.Path
+    metadata: AiElementMetadata
+
+    def __init__(self, image_path: pathlib.Path, image: Image.Image, metadata_path: pathlib.Path, metadata: AiElementMetadata):
+        self.image_path = image_path
+        self.image = image
+        self.metadata_path = metadata_path
+        self.metadata = metadata
+
+    @classmethod
+    def from_json_file(cls, json_file_path: pathlib.Path) -> "AiElement":
+            image_path = json_file_path.parent /  (json_file_path.stem + ".png")
+            with open(json_file_path) as f:
+                return cls(
+                    metadata_path= json_file_path,
+                    image_path= image_path,
+                    metadata = AiElementMetadata.model_validate_json(f.read()),
+                    image = Image.open(image_path))
+
+
+class AiElementNotFound(Exception):
+    pass
+
+
+class AiElementCollection:
+
+    def __init__(self, additional_ai_element_locations: Optional[List[pathlib.Path]] = None):
+        workspace_id = os.getenv("ASKUI_WORKSPACE_ID")
+        if workspace_id is None:
+            raise ValueError("ASKUI_WORKSPACE_ID is not set")
+        
+        if additional_ai_element_locations is None:
+            additional_ai_element_locations = []
+        
+        addional_ai_element_from_env = []
+        if os.getenv("ASKUI_AI_ELEMENT_LOCATIONS", "") != "":
+            addional_ai_element_from_env = [pathlib.Path(ai_element_loc) for ai_element_loc in os.getenv("ASKUI_AI_ELEMENT_LOCATIONS", "").split(",")],
+        
+        self.ai_element_locations = [
+            pathlib.Path.home() / ".askui" / "SnippingTool" / "AIElement" / workspace_id,
+            *addional_ai_element_from_env,
+            *additional_ai_element_locations
+        ]
+
+        logger.debug("AI Element locations: %s", self.ai_element_locations)
+
+    def find(self, name: str):
+        ai_elements = []
+
+        for location in self.ai_element_locations:
+            path = pathlib.Path(location)
+            
+            json_files = list(path.glob("*.json"))
+            
+            if not json_files:
+                logger.warning(f"No JSON files found in: {location}")
+                continue
+                
+            for json_file in json_files:
+                ai_element = AiElement.from_json_file(json_file)
+
+                if ai_element.metadata.name == name:
+                    ai_elements.append(ai_element)
+
+        return ai_elements