A multimodal code agent that works together with web use agent. Give an frontier agent a reliable tool to visually verify its code changes in a live browser environment.
- Multimodal — captures browser screenshots and feeds them directly into the agent's context for visual verification
- Dual-agent collaboration — a Code Agent fixes issues in a sandboxed Docker environment while a Web Agent visually verifies the result in a live browser
- Extensible tool system — drop in any
Toolsubclass; built-in tools cover shell execution, browser control, and cross-agent delegation
- Python 3.10+
- Docker
pip install -e .
pip install -e ".[dev]" # includes pytest, pytest-mock, ruff
playwright install chromium # required for WebAgentpytest # run all tests
pytest -v # verbose output
pytest -k test_name # run a single test by namerun_agent.py runs Argus on any task you describe. The agent executes inside a Docker container with a randomly assigned (or user-specified) host port forwarded, and optionally accepts images alongside the task description.
# Basic usage
python run_agent.py --task "Fix the off-by-one error in src/parser.js"
# Mount a custom Docker image and working directory
python run_agent.py --docker myproject:latest --workdir /app --task "Add dark mode support"
# Attach images (local files or URLs)
python run_agent.py --task "Reproduce the layout issue shown in the screenshot" --images bug.png
# Pin the forwarded port
python run_agent.py --task "..." --port 3000Use configs/examples/general_purpose.yaml as a starting config:
cp configs/examples/general_purpose.yaml config.yaml # then fill in api_keyJSON trajectory logs are saved under logs/ by default.
run_swebench_multimodal.py evaluates Argus on the SWE-bench Multimodal dataset.
# Run all test instances
python run_swebench_multimodal.py --config config.yaml
# Run specific instances
python run_swebench_multimodal.py --config config.yaml --instance-ids django__django-1234 flask__flask-5678
# Run on the dev split
python run_swebench_multimodal.py --config config.yaml --split devUse configs/examples/swebench_multimodal.yaml as a starting config:
cp configs/examples/swebench_multimodal.yaml config.yaml # then fill in api_keyJSON trajectory logs are saved under logs/<instance_id>/ by default.
Argus optionally integrates with EverMemOS for memory across runs. When enabled, the agent retrieves relevant past interactions and injects them into the system prompt before each run, then stores the new interaction afterward.
Enable with enable_memory: true under agent: in config.yaml.
argus/
├── agent.py # Agent loop: maintains history, dispatches tool calls
├── web_agent.py # WebAgent: browser-controlling verification agent
├── config.py # Config dataclasses: AgentConfig, WebAgentConfig (each with a nested LLMConfig and MemoryConfig)
├── data/
│ ├── message.py # Provider-agnostic types: SystemMessage, UserMessage, AssistantMessage, ToolMessage, ToolCall, ToolResult
│ └── content.py # Content dataclass (text + base64 images)
├── llm/
│ ├── base.py # LLMClient ABC with exponential-backoff retry
│ ├── openai.py # OpenAI function-calling implementation
│ └── anthropic.py # Anthropic tool_use implementation
├── tools/
│ ├── base.py # Tool ABC
│ ├── shell.py # Persistent bash session inside a Docker container
│ ├── browser.py # BrowserTool: Playwright-based headless browser; actions: navigate, screenshot (grid overlay), click, dblclick, hover, drag_and_drop, type, press, scroll, reload, get_text, get_console_logs, get_element_bounds
│ ├── ask_web_agent.py # AskWebAgentTool: lets the code agent delegate to WebAgent
│ └── checklist.py # ChecklistTool: stateful in-run task planner
└── utils/
└── evermind.py # EverMind long-term memory client
Agent.run() calls LLMClient.chat() each step and dispatches any ToolCall objects to the matching Tool.execute(). The loop ends when the LLM replies without tool calls or max_steps is reached. ShellTool maintains a persistent bash session via pexpect — environment variables and working directory changes persist across calls. A host port is forwarded from the container and injected into the task prompt so the agent knows where to bind services.
