A Chrome extension that turns a large language model into a hands-on browser agent. The agent observes the current page, plans a sequence of actions, executes them through Chrome DevTools Protocol, and repeats until the task is complete or the user takes over.
This repository (crab-ts) is the source code. The user facing distribution lives at crab-agent-extension.
- Overview
- Architecture
- Component Map
- Agent Loop
- Message Flow
- LLM Providers
- Tool Registry
- Memory System
- Permission Model
- Scheduler
- Workflows
- Tab Groups
- Quick Mode
- Configuration
- Tech Stack
- Build and Install
- Adding New Features
Crab-Agent is a Chrome MV3 extension. The user opens a side panel, types a task in natural language, and the agent executes it autonomously. The agent can navigate pages, click elements, fill forms, read page content, open and close tabs, upload files, run JavaScript, generate documents, record and replay workflows, schedule future tasks, and maintain persistent memory across sessions.
The execution model is a tool-use loop. The LLM receives the current conversation (including screenshots and page context), selects a tool, the extension executes it, and the result is appended to the conversation for the next LLM call. This continues until the agent calls done.
┌─────────────────────────────────────────────────────┐
│ Side Panel (React) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ │
│ │ ChatPanel│ │Workflows │ │ Schedule │ │Settings│ │
│ └──────────┘ └──────────┘ └──────────┘ └────────┘ │
│ Zustand stores: taskStore, uiStore, settingsStore, │
│ workflowStore, memoryStore │
│ useBgMessage hook via chrome.runtime.Port │
└────────────────────┬────────────────────────────────┘
│ Port: 'side-panel'
│ postMessage / onMessage
┌────────────────────▼────────────────────────────────┐
│ Background Service Worker (src/background/index.ts)│
│ Receives new_task / follow_up_task / cancel etc. │
│ Manages tab group sessions │
│ Triggers memory dream cycles │
│ Runs chrome.alarms for scheduled tasks │
│ Calls agent-loop.handleNewTask() │
└────────────────────┬────────────────────────────────┘
│
┌────────────────────▼────────────────────────────────┐
│ Agent Loop (src/core/agent-loop.ts) │
│ ┌──────────────────────────────────────────────┐ │
│ │ 1. Build system prompt │ │
│ │ 2. Take initial screenshot │ │
│ │ 3. Run read_page for element refs │ │
│ │ 4. Loop: │ │
│ │ callLLM -> tool_use response │ │
│ │ executeTool(name, params) │ │
│ │ append tool_result │ │
│ │ repeat until done() │ │
│ └──────────────────────────────────────────────┘ │
│ MessageManager handles compaction and history │
└────────────────────┬────────────────────────────────┘
│
┌──────────┴──────────┐
│ │
┌─────────▼──────┐ ┌─────────▼──────────────┐
│ LLM Client │ │ Tool Executors │
│ llm-client.ts │ │ src/tools/*.ts │
│ │ │ │
│ Anthropic │ │ CDP: screenshots, │
│ OpenAI │ │ clicks, typing │
│ Google Gemini │ │ Browser: tabs, nav │
│ OpenRouter │ │ Page: read, find, JS │
│ Ollama │ │ Files: upload, download │
│ OpenAI-compat │ │ Docs, GIF, workflows │
│ Codex OAuth │ │ Canvas, code editor │
└────────────────┘ └─────────────────────────┘
The single service worker that orchestrates the extension.
| Responsibility | Detail |
|---|---|
| Port management | Maintains a persistent port connection to the side panel |
new_task |
Resolves the active tab, creates a tab group, triggers memory dream if due, builds memory context, calls handleNewTask |
follow_up_task |
Queues the message if a task is running, otherwise starts a fresh task with restored history |
cancel_task / pause_task / resume_task |
Delegates to agent-loop controls |
| Memory hooks | Intercepts TASK_OK events to save memory entries automatically |
| Tab groups | Collapses the group on ASK_USER and expands on resume |
| Workflows | Manages recording, playback, and analysis |
| Scheduler | Drives chrome.alarms for scheduled tasks |
| Keep-alive | Self-ping loop during execution to avoid service worker termination |
The core execution engine.
handleNewTask(
task, settings, images, sendToPanel,
llmHistory?, preferredTabId?, workflows?,
isFollowUp?, memoryContext?
)runExecutor(exec) is the main loop. It takes a screenshot, injects page context, then iterates: call LLM, parse tool use, execute tool, append result, check for done.
Special tool results handled inline:
| Tool result | Behavior |
|---|---|
done |
End task or enter monitor mode |
ask_user |
Pause for input |
suggest_rule |
Forward to user for accept or skip |
memory |
Run CRUD on persistent memory directly |
update_plan |
Plan approval flow |
The loop also detects stagnation and injects interrupt messages when the agent loops without progress, tracks active tab changes when tools open or switch tabs, and exports conversation to chrome.storage.local after each task for follow-up continuity.
A single entry point callLLM(messages, settings, useVision, toolSchemas, extraOptions) for every provider.
Highlights:
- Translates tool schemas to provider native formats (Anthropic
input_schema, OpenAI function calling, Googlefunction_declarations). - Auto-detects an Anthropic endpoint when an
openai-compatibleURL ends in/v1/messages. - Streams responses (SSE) for OpenAI, OpenRouter, openai-compatible, and Codex providers.
- On HTTP 400 with
content.strerror (vision rejected), retries automatically without images. - Strips images from all but the last image-bearing message to avoid proxy rejections from accumulated screenshots.
- Quick Mode disables tool schemas and parses compact text commands instead.
Conversation state for a single task.
| Method | Purpose |
|---|---|
compactIfNeeded() |
Strip images from old messages (soft limit 15 MB), then drop oldest non-system messages (hard limit 100 KB) |
exportForStorage() |
Remove base64 images, flatten text-only arrays, preserve tool_use and tool_result pairs |
importFromStorage(history, systemPrompt) |
Restore history and normalize stale content shape |
Each tool is a module with name, description, parameters (JSON Schema), and execute(params, context). Tools return a ToolResult with success, error, and optional flags like isDone, isAskUser, isSuggestRule, isMemoryOp.
The executeTool(name, params, context) function in src/tools/index.ts resolves internal tools first, then external tools, and runs execute.
| File | Responsibility |
|---|---|
permission-manager.ts |
Intercepts tool execution to enforce user-controlled safety policies |
memory-manager.ts |
Persistent cross-session storage for facts, preferences, and rules |
dream-engine.ts |
LLM-driven consolidation of memory entries |
task-scheduler.ts |
chrome.alarms-based scheduling engine |
tab-group-manager.ts |
Chrome Tab Group lifecycle for agent sessions |
quick-mode.ts |
Low-latency compact command execution path |
cdp-manager.ts |
Chrome DevTools Protocol client |
ax-tree-manager.ts |
Accessibility tree builder and ref ID generator |
state-manager.ts |
Stagnation detection and prompt warnings |
codex-auth.ts |
OAuth PKCE flow for ChatGPT/Codex |
codex-usage-tracker.ts |
Rate-limit tracking from Codex API responses |
A React application running in the Chrome side panel.
| Component | Role |
|---|---|
ChatPanel |
Message thread, input box, image attachment, live action indicator, suggest-rule accept/reject UI |
SettingsPanel |
Provider, model, API key, base URL, permission mode, system prompt override, tab grouping toggle |
WorkflowList, WorkflowSaveModal |
Manage recorded workflows |
ScheduledTaskList |
View and cancel scheduled tasks |
MemoryPanel |
View, delete, and trigger dream consolidation for memory entries |
useBgMessage |
Manages the persistent port connection, heartbeat, and dispatches background messages to stores |
Zustand stores: taskStore, uiStore, settingsStore, workflowStore, memoryStore, contextRulesStore.
handleNewTask called
│
├── Normalize user images (ImageItem[] to data URLs)
├── Build system prompt (memory + context rules + viewport + warnings)
├── Restore conversation history from storage (follow-up) or init fresh
│
├── Take initial screenshot via CDP
├── Run read_page(interactive) and inject element refs into first message
│
└── runExecutor loop (max steps configurable)
│
├── getMessages() -> send to callLLM
├── LLM returns tool_use (or text in non-native mode)
├── Append assistant tool_use block to conversation
│
├── executeTool(name, params, context)
│ ├── Internal: done -> exit loop (or enter monitor mode)
│ ├── Internal: ask_user -> pause, send ASK_USER event to panel
│ ├── Special: suggest_rule -> forward to user, await accept/skip
│ ├── Special: memory -> execute CRUD on memoryManager directly
│ └── External: computer, navigate, find, read_page, tabs_*, etc.
│
├── Append tool_result to conversation
├── Stagnation check -> inject interrupt message if looping
├── Tab tracking -> update exec.tabId if tool changed active tab
│
└── Repeat
On done:
- Conversation is exported to storage.
TASK_OKevent is sent to the panel with the final answer.- Background intercepts
TASK_OKto extract and save memory entries and increment the session counter. - The tab group is ungrouped.
Side Panel Background Agent Loop
─────────────────────────────────────────────────────────────────
sendToBackground({type:'new_task'})
───────────────────────────>
handleNewTask()
──────────────────────>
TASK_START
<──────────────────────
sendToPanel(TASK_START)
<──────────────────────────
THINKING /
ACTION / STEP
<──────────────────────
sendToPanel(execution_event)
<──────────────────────────
[loop continues]
TASK_OK /
TASK_FAIL
<──────────────────────
save memory entries
ungroup tabs
sendToPanel(TASK_OK)
<──────────────────────────
| State | Meaning |
|---|---|
| TASK_START | Task execution began |
| TASK_OK | Task completed successfully |
| TASK_FAIL | Task failed after max retries |
| TASK_CANCEL | Task cancelled by user |
| TASK_PAUSE | Task paused (ask_user) |
| STEP_START | New LLM call cycle started |
| STEP_FAIL | LLM call failed (will retry) |
| ACTION | Agent is executing a tool |
| THINKING | LLM is generating |
| PLANNING | Agent is in planning phase |
| ASK_USER | Agent needs user input |
| SUGGEST_RULE | Agent suggests a context rule |
| COMPACTION | Conversation was compacted |
| MONITOR_START | Monitoring mode entered |
| MONITORING | Monitor loop tick |
| MONITOR_WAKE | Monitor condition triggered |
| MONITOR_REPORT | Monitor result sent |
| MEMORY_OP | Memory tool called (add/update/delete/list) |
| DREAM_START | Dream consolidation started |
| DREAM_DONE | Dream consolidation finished |
| Provider | provider value |
Tool format | Streaming | Notes |
|---|---|---|---|---|
| Anthropic | anthropic |
Native tool_use |
Optional | Default; uses tool_choice: auto; thinking adds interleaved-thinking-2025-05-14 beta header |
| OpenAI | openai |
Function calling | Optional | Chat Completions endpoint |
| Google Gemini | google |
function_declarations |
No | Uses tool_config: function_calling_config.mode = AUTO |
| OpenRouter | openrouter |
Function calling | Optional | Multi-model router; forwards thinking param for anthropic/* models |
| Ollama | ollama |
None | No | Local server; /api/chat endpoint |
| OpenAI-compatible | openai-compatible |
Function calling or native tool_use |
Optional | Auto-detects Anthropic Messages endpoint when URL ends in /v1/messages |
| Codex OAuth | codex-oauth |
Function calling | Always | ChatGPT sign-in; PKCE flow; Responses API; tracks rate limits |
All providers go through the same callLLM interface. Tool schemas are translated per provider. Vision (image) support is available for every provider and is disabled per request if the endpoint rejects image content.
Tools are registered in src/tools/index.ts and split into external tools (exposed to the LLM) and internal tools (handled directly by the agent loop).
| Tool | Purpose |
|---|---|
computer |
Mouse, keyboard, and DOM interaction via CDP. Supports click, type, scroll, drag, screenshot, key press, hover. Ref-based targeting resolves live DOM coordinates |
navigate |
Navigate the active tab to a URL, or go back, forward, or Google search |
read_page |
Read the page DOM and return interactive elements with ref IDs and coordinates. Supports iframe traversal |
find |
Find an element by natural language description. Searches the accessibility tree |
form_input |
Fill form fields, select options, toggle checkboxes |
get_page_text |
Extract readable text content from the page |
tabs_context |
List all open tabs with URLs and titles |
tabs_create |
Open a new tab, optionally at a URL |
switch_tab |
Switch the active tab |
close_tab |
Close a tab |
read_console_messages |
Read browser console output |
read_network_requests |
Read recent network requests and responses |
resize_window |
Change the browser viewport dimensions |
update_plan |
Present a multi-step plan to the user for approval before proceeding |
file_upload |
Upload a file via a file input element |
upload_image |
Upload an image file |
gif_creator |
Record and export a GIF or replay of the task |
suggest_rule |
Propose saving a site-specific interaction rule for future sessions |
memory |
Manage persistent memory: list, add, update, delete |
shortcuts_list |
List available keyboard shortcuts on the current page |
shortcuts_execute |
Execute a keyboard shortcut by name |
javascript_tool |
Execute arbitrary JavaScript in the page context |
canvas_toolkit |
Canvas and image manipulation helpers (Figma, Miro, Excalidraw) |
code_editor |
Open an in-panel code editor |
document_generator |
Generate DOCX or HTML documents from task output |
set_of_mark |
Visual overlay for element labeling |
visualize |
Render SVG diagrams or HTML charts inline in the chat |
schedule_task |
Schedule a task for future execution via chrome.alarms |
download_file |
Trigger a file download |
run_workflow |
Execute a saved recorded workflow |
| Tool | Purpose |
|---|---|
done |
Signal task completion. Accepts finalAnswer (text), monitor (boolean for watch mode), and monitoring parameters |
ask_user |
Pause execution and prompt the user for input. The agent resumes when the user replies |
Persistent memory lets the agent remember user preferences, personal information, and project conventions across sessions.
- Key:
crab_memoryinchrome.storage.local - Maximum entries: 60 (pruned by least recently used)
interface MemoryEntry {
id: string
content: string
type: 'rule' | 'fact' | 'summary'
domain?: string // per-domain entry if set, cross-domain if undefined
createdAt: number
lastUsed: number
useCount: number
source: 'suggest_rule' | 'dream' | 'manual' | 'auto'
}The agent uses the memory tool to manage its own memory during a task.
| Command | Effect |
|---|---|
memory(command="list") |
Returns all entries with IDs |
memory(command="add", content, type) |
Saves new information (deduplicated by content similarity) |
memory(command="update", id, content) |
Corrects an existing entry |
memory(command="delete", id) |
Removes an entry |
The system prompt instructs the agent to use this selectively, only for information the user explicitly shares and that is genuinely useful in future sessions.
Before each task, memoryManager.formatForPrompt(currentDomain) produces a ## Memory section injected into the system prompt. Domain-specific entries appear first, followed by general entries, capped at roughly 1200 characters.
After a threshold of sessions (default 5) and elapsed time (default 24 hours), the background triggers a non-blocking dream cycle:
- All memory entries are sent to the LLM with instructions to deduplicate, merge near-identical entries, remove overly generic entries, and return a cleaned JSON array (max 50 entries).
- On success,
memoryManager.replaceEntries(cleaned)replaces the current entries. - The cycle is fire-and-forget. If the LLM call fails or times out (45 seconds), the original entries are kept unchanged.
The permission manager controls what actions the agent can perform on which domains, preventing unintended modifications to sensitive pages.
Set via AgentSettings.permissionMode:
| Mode | Behavior |
|---|---|
ask (default) |
Request user approval before acting on new domains or sensitive pages. Read-only actions are allowed without prompting |
auto |
Approve all actions automatically |
strict |
Require explicit approval for every action |
NAVIGATE, READ_PAGE_CONTENT, CLICK, TYPE, UPLOAD_IMAGE, PLAN_APPROVAL, DOMAIN_TRANSITION.
| Duration | Meaning |
|---|---|
once |
Granted for a single tool use (expires immediately after use) |
always |
Granted for the domain until the session ends or the user revokes it |
- Sensitive domains (login pages, financial services, government sites) always require explicit user approval and cannot be granted
always. - A blocklist of patterns prevents the agent from accessing known malicious domains.
verifyUrlDomain(tabId, expectedDomain)is called before executing actions to detect unexpected navigation mid-step.
When the agent calls update_plan, it presents a list of planned domains and actions to the user. If approved, all listed domains receive pre-authorization for the duration of the plan.
The task scheduler uses Chrome's chrome.alarms API to execute tasks at a future time or on a recurring schedule.
- Tasks are stored in
chrome.storage.localunder the keyscheduledTasksand alarms are re-registered on service worker startup. - Supports one-time tasks (absolute timestamp or relative delay in seconds) and recurring tasks (5-field cron).
- When an alarm fires, the scheduler calls
handleNewTaskdirectly (headless if the side panel is not open). - The
schedule_tasktool allows the agent to schedule follow-up tasks from within a task execution. - The side panel shows a
ScheduledTaskListwith options to cancel pending tasks.
Workflows are recorded sequences of browser interactions that can be replayed on demand.
- The user triggers recording from the side panel.
- lib/workflowRecorder.js is injected into the active tab and captures DOM events (clicks, inputs, navigation) as structured steps.
- When recording stops, the background receives the action list and the side panel shows a save modal.
- Workflows are saved to
chrome.storage.localundercrab_workflows. - Each workflow has a name, description, and an optional set of parameterized inputs (such as email address or search query).
- lib/workflowPlayer.js replays steps on the page.
- The
run_workflowtool lets the agent invoke a saved workflow as part of a larger task. - If the task matches a saved workflow, the agent is instructed to call
run_workflowimmediately.
The background can send a workflow to the LLM for semantic description extraction, which enriches the saved workflow with human-readable step summaries.
The tab group manager uses the Chrome Tab Groups API to visually group tabs that belong to an agent session.
| Behavior | Detail |
|---|---|
| Session start | The current tab is registered as the session main tab (not grouped yet) |
| First new tab | A Chrome Tab Group is created with the session name derived from the first words of the task |
| Subsequent tabs | Added to the same group |
| Task completion | Tabs are ungrouped |
| Group title | Reflects the task state: task hint while running, check on completion, X on failure |
| Collapse behavior | Group collapses on ask_user and expands on user reply |
| Toggle | Settings > Group tabs during tasks |
chrome.tabs.onUpdated marks tab context dirty when a session tab navigates, so the agent receives updated page context on the next step.
Quick Mode is an alternative execution path for lower latency control.
- Enabled when
settings.quickModeis true. - The LLM receives a compact system prompt and returns single-line commands instead of structured tool calls.
- Tool schemas are not sent to the LLM in Quick Mode.
| Command | Action |
|---|---|
CR ref / RCR ref / DCR ref / TCR ref |
Click, right-click, double-click, triple-click by ref ID |
C x y / RC x y / DC x y / TC x y |
Click actions by coordinate |
H x y |
Hover |
T text |
Type text |
K keys |
Press a key or chord (e.g. K Enter, K ctrl+a) |
S dir amt x y |
Scroll (direction, amount 1-10, position) |
D x1 y1 x2 y2 |
Drag from start to end |
N url / N back / N forward |
Navigate |
J code |
Execute JavaScript |
W |
Wait one second for the page to settle |
ST tabId / NT url / LT |
Switch tab, new tab, list tabs |
DONE text |
Complete task |
ASK question |
Ask user |
Responses end with <<END>> on their own line. The parser extracts an optional <thinking> block and the command lines, then executeQuickModeCommands runs them in sequence.
All settings are stored in chrome.storage.local via the settingsStore Zustand store.
| Setting | Type | Default | Description |
|---|---|---|---|
provider |
string | anthropic |
LLM provider |
model |
string | (provider default) | Model identifier |
apiKey |
string | "" |
API key for the provider |
baseUrl |
string | (none) | Custom endpoint (Ollama, OpenAI-compatible) |
customModel |
string | (none) | Model name when provider is openai-compatible or ollama |
maxTokens |
number | (none) | Override max tokens per LLM call |
temperature |
number | (none) | Override temperature |
systemPrompt |
string | (none) | Replace the default system prompt |
permissionMode |
ask / auto / strict |
ask |
Action approval policy |
enableWorkflowRecording |
boolean | true | Show workflow recording controls |
enableScheduledTasks |
boolean | true | Show scheduled task controls |
theme |
dark / light |
(system) | UI theme |
enableMemory |
boolean | true | Enable persistent memory and dream cycles |
enableTabGrouping |
boolean | true | Group agent tabs into a Chrome Tab Group |
codexPlan |
free / plus / pro |
(none) | Codex usage tier when provider is codex-oauth |
| Layer | Technology |
|---|---|
| UI framework | React 18 |
| Language | TypeScript 5.7 |
| Build | Vite 6 with @crxjs/vite-plugin |
| Styling | Tailwind CSS and CSS variables |
| State management | Zustand 5 |
| Markdown rendering | react-markdown, remark-gfm, rehype-highlight |
| Extension platform | Chrome MV3 |
| Chrome APIs | tabs, tabGroups, storage, debugger, scripting, sidePanel, alarms, webNavigation, downloads, notifications, contextMenus, offscreen, system.display, identity |
| LLM communication | Fetch (REST) with SSE streaming support |
- Node.js 18 or later
- Google Chrome 114 or later
# Install dependencies
npm install
# Development (Vite watch mode for UI iteration)
npm run dev
# Production build (tsc type check, then vite build)
npm run buildThe production build outputs to dist/.
- Run
npm run build. - Open
chrome://extensions. - Enable Developer mode.
- Click Load unpacked and select the
dist/directory. - Open any web page, then click the extension icon to open the side panel.
- Enter your API key in Settings, or pick ChatGPT (Sign in) to authenticate with OAuth.
The extension requests these permissions at install time:
tabs, tabGroups, activeTab, scripting, storage, debugger,
webNavigation, sidePanel, clipboardWrite, clipboardRead,
offscreen, downloads, notifications, system.display,
alarms, contextMenus, identity
Host permissions: <all_urls> (required for CDP-based page interaction and content script injection on any site).
- Create
src/tools/my-tool.tsexportingmyToolSchemaandexecuteMyTool. - Register it in
TOOL_REGISTRYandgetToolSchemas()in src/tools/index.ts. - Add a
casetoexecuteTool()in the same file. - Add a description to prompts/system-prompt.js when the provider does not support native tool-use.
- Add a
caseto theswitch (provider)block in src/core/llm-client.ts. - Implement
_buildXxxRequest(). - Implement or reuse
_extractXxxResponse()and_extractXxxToolCall(). - Add the provider to
_isStreamableProvider()if streaming is supported. - Add it to the
PROVIDERSlist in src/sidepanel/components/settings/SettingsPanel.tsx. - Add models to
MODELS_BY_PROVIDER.
- Add it to the
ToBackgroundorToSidepanelunion in src/sidepanel/lib/types.ts. - Handle it in
port.onMessage.addListenerin src/background/index.ts. - Send or receive it in
useBgMessageor the relevant store.
MIT. See LICENSE for details.