Skip to content

EuniAI/Argus

Repository files navigation

Argus

Python License LLM Docker

A multimodal code agent that works together with web use agent. Give an frontier agent a reliable tool to visually verify its code changes in a live browser environment.

  • Multimodal — captures browser screenshots and feeds them directly into the agent's context for visual verification
  • Dual-agent collaboration — a Code Agent fixes issues in a sandboxed Docker environment while a Web Agent visually verifies the result in a live browser
  • Extensible tool system — drop in any Tool subclass; built-in tools cover shell execution, browser control, and cross-agent delegation

Argus


Setup

Prerequisites

Install

pip install -e .
pip install -e ".[dev]"      # includes pytest, pytest-mock, ruff
playwright install chromium  # required for WebAgent

Run tests

pytest                  # run all tests
pytest -v               # verbose output
pytest -k test_name     # run a single test by name

Running the General-Purpose Agent

run_agent.py runs Argus on any task you describe. The agent executes inside a Docker container with a randomly assigned (or user-specified) host port forwarded, and optionally accepts images alongside the task description.

# Basic usage
python run_agent.py --task "Fix the off-by-one error in src/parser.js"

# Mount a custom Docker image and working directory
python run_agent.py --docker myproject:latest --workdir /app --task "Add dark mode support"

# Attach images (local files or URLs)
python run_agent.py --task "Reproduce the layout issue shown in the screenshot" --images bug.png

# Pin the forwarded port
python run_agent.py --task "..." --port 3000

Use configs/examples/general_purpose.yaml as a starting config:

cp configs/examples/general_purpose.yaml config.yaml  # then fill in api_key

JSON trajectory logs are saved under logs/ by default.


Running SWE-bench Multimodal

run_swebench_multimodal.py evaluates Argus on the SWE-bench Multimodal dataset.

# Run all test instances
python run_swebench_multimodal.py --config config.yaml

# Run specific instances
python run_swebench_multimodal.py --config config.yaml --instance-ids django__django-1234 flask__flask-5678

# Run on the dev split
python run_swebench_multimodal.py --config config.yaml --split dev

Use configs/examples/swebench_multimodal.yaml as a starting config:

cp configs/examples/swebench_multimodal.yaml config.yaml  # then fill in api_key

JSON trajectory logs are saved under logs/<instance_id>/ by default.


Long-term Memory

Argus optionally integrates with EverMemOS for memory across runs. When enabled, the agent retrieves relevant past interactions and injects them into the system prompt before each run, then stores the new interaction afterward.

Enable with enable_memory: true under agent: in config.yaml.

Architecture

argus/
├── agent.py          # Agent loop: maintains history, dispatches tool calls
├── web_agent.py      # WebAgent: browser-controlling verification agent
├── config.py         # Config dataclasses: AgentConfig, WebAgentConfig (each with a nested LLMConfig and MemoryConfig)
├── data/
│   ├── message.py    # Provider-agnostic types: SystemMessage, UserMessage, AssistantMessage, ToolMessage, ToolCall, ToolResult
│   └── content.py    # Content dataclass (text + base64 images)
├── llm/
│   ├── base.py       # LLMClient ABC with exponential-backoff retry
│   ├── openai.py     # OpenAI function-calling implementation
│   └── anthropic.py  # Anthropic tool_use implementation
├── tools/
│   ├── base.py           # Tool ABC
│   ├── shell.py          # Persistent bash session inside a Docker container
│   ├── browser.py        # BrowserTool: Playwright-based headless browser; actions: navigate, screenshot (grid overlay), click, dblclick, hover, drag_and_drop, type, press, scroll, reload, get_text, get_console_logs, get_element_bounds
│   ├── ask_web_agent.py  # AskWebAgentTool: lets the code agent delegate to WebAgent
│   └── checklist.py      # ChecklistTool: stateful in-run task planner
└── utils/
    └── evermind.py   # EverMind long-term memory client

Agent.run() calls LLMClient.chat() each step and dispatches any ToolCall objects to the matching Tool.execute(). The loop ends when the LLM replies without tool calls or max_steps is reached. ShellTool maintains a persistent bash session via pexpect — environment variables and working directory changes persist across calls. A host port is forwarded from the container and injected into the task prompt so the agent knows where to bind services.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages