Skip to content

Hert4/crab-ts

Repository files navigation

Crab-Agent

Version Manifest TypeScript React Vite Tailwind Zustand License

A Chrome extension that turns a large language model into a hands-on browser agent. The agent observes the current page, plans a sequence of actions, executes them through Chrome DevTools Protocol, and repeats until the task is complete or the user takes over.

This repository (crab-ts) is the source code. The user facing distribution lives at crab-agent-extension.

Table of Contents

Overview

Crab-Agent is a Chrome MV3 extension. The user opens a side panel, types a task in natural language, and the agent executes it autonomously. The agent can navigate pages, click elements, fill forms, read page content, open and close tabs, upload files, run JavaScript, generate documents, record and replay workflows, schedule future tasks, and maintain persistent memory across sessions.

The execution model is a tool-use loop. The LLM receives the current conversation (including screenshots and page context), selects a tool, the extension executes it, and the result is appended to the conversation for the next LLM call. This continues until the agent calls done.

Architecture

┌─────────────────────────────────────────────────────┐
│  Side Panel (React)                                 │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐  │
│  │ ChatPanel│ │Workflows │ │ Schedule │ │Settings│  │
│  └──────────┘ └──────────┘ └──────────┘ └────────┘  │
│  Zustand stores: taskStore, uiStore, settingsStore, │
│                  workflowStore, memoryStore         │
│  useBgMessage hook via chrome.runtime.Port          │
└────────────────────┬────────────────────────────────┘
                     │  Port: 'side-panel'
                     │  postMessage / onMessage
┌────────────────────▼────────────────────────────────┐
│  Background Service Worker (src/background/index.ts)│
│  Receives new_task / follow_up_task / cancel etc.   │
│  Manages tab group sessions                         │
│  Triggers memory dream cycles                       │
│  Runs chrome.alarms for scheduled tasks             │
│  Calls agent-loop.handleNewTask()                   │
└────────────────────┬────────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────────┐
│  Agent Loop (src/core/agent-loop.ts)                │
│  ┌──────────────────────────────────────────────┐   │
│  │  1. Build system prompt                      │   │
│  │  2. Take initial screenshot                  │   │
│  │  3. Run read_page for element refs           │   │
│  │  4. Loop:                                    │   │
│  │     callLLM -> tool_use response             │   │
│  │     executeTool(name, params)                │   │
│  │     append tool_result                       │   │
│  │     repeat until done()                      │   │
│  └──────────────────────────────────────────────┘   │
│  MessageManager handles compaction and history      │
└────────────────────┬────────────────────────────────┘
                     │
          ┌──────────┴──────────┐
          │                     │
┌─────────▼──────┐   ┌─────────▼──────────────┐
│ LLM Client     │   │ Tool Executors          │
│ llm-client.ts  │   │ src/tools/*.ts          │
│                │   │                         │
│ Anthropic      │   │ CDP: screenshots,       │
│ OpenAI         │   │      clicks, typing     │
│ Google Gemini  │   │ Browser: tabs, nav      │
│ OpenRouter     │   │ Page: read, find, JS    │
│ Ollama         │   │ Files: upload, download │
│ OpenAI-compat  │   │ Docs, GIF, workflows    │
│ Codex OAuth    │   │ Canvas, code editor     │
└────────────────┘   └─────────────────────────┘

Component Map

Background (src/background/index.ts)

The single service worker that orchestrates the extension.

Responsibility Detail
Port management Maintains a persistent port connection to the side panel
new_task Resolves the active tab, creates a tab group, triggers memory dream if due, builds memory context, calls handleNewTask
follow_up_task Queues the message if a task is running, otherwise starts a fresh task with restored history
cancel_task / pause_task / resume_task Delegates to agent-loop controls
Memory hooks Intercepts TASK_OK events to save memory entries automatically
Tab groups Collapses the group on ASK_USER and expands on resume
Workflows Manages recording, playback, and analysis
Scheduler Drives chrome.alarms for scheduled tasks
Keep-alive Self-ping loop during execution to avoid service worker termination

Agent Loop (src/core/agent-loop.ts)

The core execution engine.

handleNewTask(
  task, settings, images, sendToPanel,
  llmHistory?, preferredTabId?, workflows?,
  isFollowUp?, memoryContext?
)

runExecutor(exec) is the main loop. It takes a screenshot, injects page context, then iterates: call LLM, parse tool use, execute tool, append result, check for done.

Special tool results handled inline:

Tool result Behavior
done End task or enter monitor mode
ask_user Pause for input
suggest_rule Forward to user for accept or skip
memory Run CRUD on persistent memory directly
update_plan Plan approval flow

The loop also detects stagnation and injects interrupt messages when the agent loops without progress, tracks active tab changes when tools open or switch tabs, and exports conversation to chrome.storage.local after each task for follow-up continuity.

LLM Client (src/core/llm-client.ts)

A single entry point callLLM(messages, settings, useVision, toolSchemas, extraOptions) for every provider.

Highlights:

  • Translates tool schemas to provider native formats (Anthropic input_schema, OpenAI function calling, Google function_declarations).
  • Auto-detects an Anthropic endpoint when an openai-compatible URL ends in /v1/messages.
  • Streams responses (SSE) for OpenAI, OpenRouter, openai-compatible, and Codex providers.
  • On HTTP 400 with content.str error (vision rejected), retries automatically without images.
  • Strips images from all but the last image-bearing message to avoid proxy rejections from accumulated screenshots.
  • Quick Mode disables tool schemas and parses compact text commands instead.

Message Manager (src/core/message-manager.ts)

Conversation state for a single task.

Method Purpose
compactIfNeeded() Strip images from old messages (soft limit 15 MB), then drop oldest non-system messages (hard limit 100 KB)
exportForStorage() Remove base64 images, flatten text-only arrays, preserve tool_use and tool_result pairs
importFromStorage(history, systemPrompt) Restore history and normalize stale content shape

Tools (src/tools/)

Each tool is a module with name, description, parameters (JSON Schema), and execute(params, context). Tools return a ToolResult with success, error, and optional flags like isDone, isAskUser, isSuggestRule, isMemoryOp.

The executeTool(name, params, context) function in src/tools/index.ts resolves internal tools first, then external tools, and runs execute.

Other core modules

File Responsibility
permission-manager.ts Intercepts tool execution to enforce user-controlled safety policies
memory-manager.ts Persistent cross-session storage for facts, preferences, and rules
dream-engine.ts LLM-driven consolidation of memory entries
task-scheduler.ts chrome.alarms-based scheduling engine
tab-group-manager.ts Chrome Tab Group lifecycle for agent sessions
quick-mode.ts Low-latency compact command execution path
cdp-manager.ts Chrome DevTools Protocol client
ax-tree-manager.ts Accessibility tree builder and ref ID generator
state-manager.ts Stagnation detection and prompt warnings
codex-auth.ts OAuth PKCE flow for ChatGPT/Codex
codex-usage-tracker.ts Rate-limit tracking from Codex API responses

Side Panel (src/sidepanel/)

A React application running in the Chrome side panel.

Component Role
ChatPanel Message thread, input box, image attachment, live action indicator, suggest-rule accept/reject UI
SettingsPanel Provider, model, API key, base URL, permission mode, system prompt override, tab grouping toggle
WorkflowList, WorkflowSaveModal Manage recorded workflows
ScheduledTaskList View and cancel scheduled tasks
MemoryPanel View, delete, and trigger dream consolidation for memory entries
useBgMessage Manages the persistent port connection, heartbeat, and dispatches background messages to stores

Zustand stores: taskStore, uiStore, settingsStore, workflowStore, memoryStore, contextRulesStore.

Agent Loop

handleNewTask called
│
├── Normalize user images (ImageItem[] to data URLs)
├── Build system prompt (memory + context rules + viewport + warnings)
├── Restore conversation history from storage (follow-up) or init fresh
│
├── Take initial screenshot via CDP
├── Run read_page(interactive) and inject element refs into first message
│
└── runExecutor loop (max steps configurable)
    │
    ├── getMessages() -> send to callLLM
    ├── LLM returns tool_use (or text in non-native mode)
    ├── Append assistant tool_use block to conversation
    │
    ├── executeTool(name, params, context)
    │   ├── Internal: done -> exit loop (or enter monitor mode)
    │   ├── Internal: ask_user -> pause, send ASK_USER event to panel
    │   ├── Special: suggest_rule -> forward to user, await accept/skip
    │   ├── Special: memory -> execute CRUD on memoryManager directly
    │   └── External: computer, navigate, find, read_page, tabs_*, etc.
    │
    ├── Append tool_result to conversation
    ├── Stagnation check -> inject interrupt message if looping
    ├── Tab tracking -> update exec.tabId if tool changed active tab
    │
    └── Repeat

On done:

  1. Conversation is exported to storage.
  2. TASK_OK event is sent to the panel with the final answer.
  3. Background intercepts TASK_OK to extract and save memory entries and increment the session counter.
  4. The tab group is ungrouped.

Message Flow

Side Panel                  Background              Agent Loop
─────────────────────────────────────────────────────────────────
sendToBackground({type:'new_task'})
───────────────────────────>
                            handleNewTask()
                            ──────────────────────>
                                                    TASK_START
                            <──────────────────────
sendToPanel(TASK_START)
<──────────────────────────
                                                    THINKING /
                                                    ACTION / STEP
                            <──────────────────────
sendToPanel(execution_event)
<──────────────────────────
                                [loop continues]

                                                    TASK_OK /
                                                    TASK_FAIL
                            <──────────────────────
                            save memory entries
                            ungroup tabs
sendToPanel(TASK_OK)
<──────────────────────────

execution_event states

State Meaning
TASK_START Task execution began
TASK_OK Task completed successfully
TASK_FAIL Task failed after max retries
TASK_CANCEL Task cancelled by user
TASK_PAUSE Task paused (ask_user)
STEP_START New LLM call cycle started
STEP_FAIL LLM call failed (will retry)
ACTION Agent is executing a tool
THINKING LLM is generating
PLANNING Agent is in planning phase
ASK_USER Agent needs user input
SUGGEST_RULE Agent suggests a context rule
COMPACTION Conversation was compacted
MONITOR_START Monitoring mode entered
MONITORING Monitor loop tick
MONITOR_WAKE Monitor condition triggered
MONITOR_REPORT Monitor result sent
MEMORY_OP Memory tool called (add/update/delete/list)
DREAM_START Dream consolidation started
DREAM_DONE Dream consolidation finished

LLM Providers

Provider provider value Tool format Streaming Notes
Anthropic anthropic Native tool_use Optional Default; uses tool_choice: auto; thinking adds interleaved-thinking-2025-05-14 beta header
OpenAI openai Function calling Optional Chat Completions endpoint
Google Gemini google function_declarations No Uses tool_config: function_calling_config.mode = AUTO
OpenRouter openrouter Function calling Optional Multi-model router; forwards thinking param for anthropic/* models
Ollama ollama None No Local server; /api/chat endpoint
OpenAI-compatible openai-compatible Function calling or native tool_use Optional Auto-detects Anthropic Messages endpoint when URL ends in /v1/messages
Codex OAuth codex-oauth Function calling Always ChatGPT sign-in; PKCE flow; Responses API; tracks rate limits

All providers go through the same callLLM interface. Tool schemas are translated per provider. Vision (image) support is available for every provider and is disabled per request if the endpoint rejects image content.

Tool Registry

Tools are registered in src/tools/index.ts and split into external tools (exposed to the LLM) and internal tools (handled directly by the agent loop).

External Tools

Tool Purpose
computer Mouse, keyboard, and DOM interaction via CDP. Supports click, type, scroll, drag, screenshot, key press, hover. Ref-based targeting resolves live DOM coordinates
navigate Navigate the active tab to a URL, or go back, forward, or Google search
read_page Read the page DOM and return interactive elements with ref IDs and coordinates. Supports iframe traversal
find Find an element by natural language description. Searches the accessibility tree
form_input Fill form fields, select options, toggle checkboxes
get_page_text Extract readable text content from the page
tabs_context List all open tabs with URLs and titles
tabs_create Open a new tab, optionally at a URL
switch_tab Switch the active tab
close_tab Close a tab
read_console_messages Read browser console output
read_network_requests Read recent network requests and responses
resize_window Change the browser viewport dimensions
update_plan Present a multi-step plan to the user for approval before proceeding
file_upload Upload a file via a file input element
upload_image Upload an image file
gif_creator Record and export a GIF or replay of the task
suggest_rule Propose saving a site-specific interaction rule for future sessions
memory Manage persistent memory: list, add, update, delete
shortcuts_list List available keyboard shortcuts on the current page
shortcuts_execute Execute a keyboard shortcut by name
javascript_tool Execute arbitrary JavaScript in the page context
canvas_toolkit Canvas and image manipulation helpers (Figma, Miro, Excalidraw)
code_editor Open an in-panel code editor
document_generator Generate DOCX or HTML documents from task output
set_of_mark Visual overlay for element labeling
visualize Render SVG diagrams or HTML charts inline in the chat
schedule_task Schedule a task for future execution via chrome.alarms
download_file Trigger a file download
run_workflow Execute a saved recorded workflow

Internal Tools

Tool Purpose
done Signal task completion. Accepts finalAnswer (text), monitor (boolean for watch mode), and monitoring parameters
ask_user Pause execution and prompt the user for input. The agent resumes when the user replies

Memory System

Persistent memory lets the agent remember user preferences, personal information, and project conventions across sessions.

Storage

  • Key: crab_memory in chrome.storage.local
  • Maximum entries: 60 (pruned by least recently used)

Entry Structure

interface MemoryEntry {
  id: string
  content: string
  type: 'rule' | 'fact' | 'summary'
  domain?: string         // per-domain entry if set, cross-domain if undefined
  createdAt: number
  lastUsed: number
  useCount: number
  source: 'suggest_rule' | 'dream' | 'manual' | 'auto'
}

memory Tool

The agent uses the memory tool to manage its own memory during a task.

Command Effect
memory(command="list") Returns all entries with IDs
memory(command="add", content, type) Saves new information (deduplicated by content similarity)
memory(command="update", id, content) Corrects an existing entry
memory(command="delete", id) Removes an entry

The system prompt instructs the agent to use this selectively, only for information the user explicitly shares and that is genuinely useful in future sessions.

System Prompt Injection

Before each task, memoryManager.formatForPrompt(currentDomain) produces a ## Memory section injected into the system prompt. Domain-specific entries appear first, followed by general entries, capped at roughly 1200 characters.

Dream Consolidation

After a threshold of sessions (default 5) and elapsed time (default 24 hours), the background triggers a non-blocking dream cycle:

  1. All memory entries are sent to the LLM with instructions to deduplicate, merge near-identical entries, remove overly generic entries, and return a cleaned JSON array (max 50 entries).
  2. On success, memoryManager.replaceEntries(cleaned) replaces the current entries.
  3. The cycle is fire-and-forget. If the LLM call fails or times out (45 seconds), the original entries are kept unchanged.

Permission Model

The permission manager controls what actions the agent can perform on which domains, preventing unintended modifications to sensitive pages.

Modes

Set via AgentSettings.permissionMode:

Mode Behavior
ask (default) Request user approval before acting on new domains or sensitive pages. Read-only actions are allowed without prompting
auto Approve all actions automatically
strict Require explicit approval for every action

Permission Types

NAVIGATE, READ_PAGE_CONTENT, CLICK, TYPE, UPLOAD_IMAGE, PLAN_APPROVAL, DOMAIN_TRANSITION.

Grant Durations

Duration Meaning
once Granted for a single tool use (expires immediately after use)
always Granted for the domain until the session ends or the user revokes it

Domain Safety

  • Sensitive domains (login pages, financial services, government sites) always require explicit user approval and cannot be granted always.
  • A blocklist of patterns prevents the agent from accessing known malicious domains.
  • verifyUrlDomain(tabId, expectedDomain) is called before executing actions to detect unexpected navigation mid-step.

Plan Approval

When the agent calls update_plan, it presents a list of planned domains and actions to the user. If approved, all listed domains receive pre-authorization for the duration of the plan.

Scheduler

The task scheduler uses Chrome's chrome.alarms API to execute tasks at a future time or on a recurring schedule.

  • Tasks are stored in chrome.storage.local under the key scheduledTasks and alarms are re-registered on service worker startup.
  • Supports one-time tasks (absolute timestamp or relative delay in seconds) and recurring tasks (5-field cron).
  • When an alarm fires, the scheduler calls handleNewTask directly (headless if the side panel is not open).
  • The schedule_task tool allows the agent to schedule follow-up tasks from within a task execution.
  • The side panel shows a ScheduledTaskList with options to cancel pending tasks.

Workflows

Workflows are recorded sequences of browser interactions that can be replayed on demand.

Recording

  • The user triggers recording from the side panel.
  • lib/workflowRecorder.js is injected into the active tab and captures DOM events (clicks, inputs, navigation) as structured steps.
  • When recording stops, the background receives the action list and the side panel shows a save modal.

Saving

  • Workflows are saved to chrome.storage.local under crab_workflows.
  • Each workflow has a name, description, and an optional set of parameterized inputs (such as email address or search query).

Playback

  • lib/workflowPlayer.js replays steps on the page.
  • The run_workflow tool lets the agent invoke a saved workflow as part of a larger task.
  • If the task matches a saved workflow, the agent is instructed to call run_workflow immediately.

Analysis

The background can send a workflow to the LLM for semantic description extraction, which enriches the saved workflow with human-readable step summaries.

Tab Groups

The tab group manager uses the Chrome Tab Groups API to visually group tabs that belong to an agent session.

Behavior Detail
Session start The current tab is registered as the session main tab (not grouped yet)
First new tab A Chrome Tab Group is created with the session name derived from the first words of the task
Subsequent tabs Added to the same group
Task completion Tabs are ungrouped
Group title Reflects the task state: task hint while running, check on completion, X on failure
Collapse behavior Group collapses on ask_user and expands on user reply
Toggle Settings > Group tabs during tasks

chrome.tabs.onUpdated marks tab context dirty when a session tab navigates, so the agent receives updated page context on the next step.

Quick Mode

Quick Mode is an alternative execution path for lower latency control.

  • Enabled when settings.quickMode is true.
  • The LLM receives a compact system prompt and returns single-line commands instead of structured tool calls.
  • Tool schemas are not sent to the LLM in Quick Mode.

Commands

Command Action
CR ref / RCR ref / DCR ref / TCR ref Click, right-click, double-click, triple-click by ref ID
C x y / RC x y / DC x y / TC x y Click actions by coordinate
H x y Hover
T text Type text
K keys Press a key or chord (e.g. K Enter, K ctrl+a)
S dir amt x y Scroll (direction, amount 1-10, position)
D x1 y1 x2 y2 Drag from start to end
N url / N back / N forward Navigate
J code Execute JavaScript
W Wait one second for the page to settle
ST tabId / NT url / LT Switch tab, new tab, list tabs
DONE text Complete task
ASK question Ask user

Responses end with <<END>> on their own line. The parser extracts an optional <thinking> block and the command lines, then executeQuickModeCommands runs them in sequence.

Configuration

All settings are stored in chrome.storage.local via the settingsStore Zustand store.

Setting Type Default Description
provider string anthropic LLM provider
model string (provider default) Model identifier
apiKey string "" API key for the provider
baseUrl string (none) Custom endpoint (Ollama, OpenAI-compatible)
customModel string (none) Model name when provider is openai-compatible or ollama
maxTokens number (none) Override max tokens per LLM call
temperature number (none) Override temperature
systemPrompt string (none) Replace the default system prompt
permissionMode ask / auto / strict ask Action approval policy
enableWorkflowRecording boolean true Show workflow recording controls
enableScheduledTasks boolean true Show scheduled task controls
theme dark / light (system) UI theme
enableMemory boolean true Enable persistent memory and dream cycles
enableTabGrouping boolean true Group agent tabs into a Chrome Tab Group
codexPlan free / plus / pro (none) Codex usage tier when provider is codex-oauth

Tech Stack

Layer Technology
UI framework React 18
Language TypeScript 5.7
Build Vite 6 with @crxjs/vite-plugin
Styling Tailwind CSS and CSS variables
State management Zustand 5
Markdown rendering react-markdown, remark-gfm, rehype-highlight
Extension platform Chrome MV3
Chrome APIs tabs, tabGroups, storage, debugger, scripting, sidePanel, alarms, webNavigation, downloads, notifications, contextMenus, offscreen, system.display, identity
LLM communication Fetch (REST) with SSE streaming support

Build and Install

Prerequisites

  • Node.js 18 or later
  • Google Chrome 114 or later

Commands

# Install dependencies
npm install

# Development (Vite watch mode for UI iteration)
npm run dev

# Production build (tsc type check, then vite build)
npm run build

The production build outputs to dist/.

Loading into Chrome

  1. Run npm run build.
  2. Open chrome://extensions.
  3. Enable Developer mode.
  4. Click Load unpacked and select the dist/ directory.
  5. Open any web page, then click the extension icon to open the side panel.
  6. Enter your API key in Settings, or pick ChatGPT (Sign in) to authenticate with OAuth.

Chrome Permissions

The extension requests these permissions at install time:

tabs, tabGroups, activeTab, scripting, storage, debugger,
webNavigation, sidePanel, clipboardWrite, clipboardRead,
offscreen, downloads, notifications, system.display,
alarms, contextMenus, identity

Host permissions: <all_urls> (required for CDP-based page interaction and content script injection on any site).

Adding New Features

Adding a new tool

  1. Create src/tools/my-tool.ts exporting myToolSchema and executeMyTool.
  2. Register it in TOOL_REGISTRY and getToolSchemas() in src/tools/index.ts.
  3. Add a case to executeTool() in the same file.
  4. Add a description to prompts/system-prompt.js when the provider does not support native tool-use.

Adding a new LLM provider

  1. Add a case to the switch (provider) block in src/core/llm-client.ts.
  2. Implement _buildXxxRequest().
  3. Implement or reuse _extractXxxResponse() and _extractXxxToolCall().
  4. Add the provider to _isStreamableProvider() if streaming is supported.
  5. Add it to the PROVIDERS list in src/sidepanel/components/settings/SettingsPanel.tsx.
  6. Add models to MODELS_BY_PROVIDER.

Adding a new message type (panel ↔ background)

  1. Add it to the ToBackground or ToSidepanel union in src/sidepanel/lib/types.ts.
  2. Handle it in port.onMessage.addListener in src/background/index.ts.
  3. Send or receive it in useBgMessage or the relevant store.

License

MIT. See LICENSE for details.

About

Crab-Agent is an LLM-powered Chrome extension that automates browser tasks using natural language commands. Inspired by Claude for Chrome

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors