Browser AI Agent

A production-grade AI agent that automates repetitive browser workflows on tools like Linear and Notion using natural language commands.

Instead of clicking through the same menus every day, you type a plain English instruction and the agent opens the browser, navigates to the right page, fills every field, generates relevant content using AI, and submits — all on its own.

Demo

"Create a project named Backend API with status Planned, priority Urgent, and write a relevant description"

The agent:

Navigates to the Projects page
Opens the New Project modal
Types the project name
Sets status and priority via dropdowns
Writes a meaningful description generated by Gemini
Submits the form

Tech Stack

Layer	Technology
Language	Python 3.12
Orchestration	LangGraph (state machine)
Vision + Reasoning	Google Gemini 2.5 Flash / Pro
Browser Automation	Playwright (Chrome)
Vector Store	ChromaDB
Embeddings	Gemini Embedding API
LLM Wrapper	LangChain
Observability	LangSmith
Reasoning Pattern	ReAct (Reason + Act)

Architecture

User Prompt
    │
    ▼
┌─────────────────────────────────────────────────────┐
│                  LangGraph Agent                    │
│                                                     │
│  parse_task → decompose_goals → loop:               │
│    extract_page_state                               │
│        │  (screenshot + accessibility tree)         │
│        ▼                                            │
│    decide_action  ←── RAG workflow hints            │
│        │  (Gemini vision + ReAct reasoning)         │
│        ▼                                            │
│    match_element                                    │
│        │  (primary-keyword element detection)       │
│        ▼                                            │
│    execute_action                                   │
│        │  (Playwright click / type / navigate)      │
│        ▼                                            │
│    validate_action  ──► error_recovery              │
│        │  (6-check system, confidence gate)         │
│        ▼                                            │
│    [advance sub-goal or retry]                      │
│                                                     │
│  task_complete → RAG update → teardown              │
└─────────────────────────────────────────────────────┘

Installation

git clone https://github.com/Atharva9281/BrowserAgent.git
cd BrowserAgent

python3 -m venv venv
source venv/bin/activate

pip install -r requirements.txt
playwright install chromium

Configuration

cp .env.example .env
# Add your API key:
# GEMINI_API_KEY=your_key_here

Get a free Gemini API key at Google AI Studio.

Authentication

python3 src/setup_auth.py

Run

python3 src/agent.py

Then type any task:

📋 Enter task: Create a project named Q3 Planning with status Planned in Linear
📋 Enter task: Filter projects by status In Progress in Linear
📋 Enter task: Create an issue titled Fix login bug, set priority to High in Linear

Key Design Decisions

Sub-goal Validation Gating

The agent breaks every task into sub-goals. A sub-goal only advances when the action's validation confidence exceeds 0.6 — measured across 6 checks (URL changed, modal opened, expected keywords present, no errors, click had effect). A Playwright click not throwing an exception is not sufficient evidence of success.

Primary-Keyword Element Detection

Gemini describes elements with trailing context: "Projects link in the left sidebar under the TrialAgent team section". A naive keyword extractor picks up "TrialAgent team" as a 2-word phrase and clicks the wrong button. The detector extracts the primary name — words before the first positional preposition — and tries that first.

RAG That Learns

Every successful run is embedded and stored in ChromaDB. On similar future tasks, the agent retrieves past workflows as hints. Bad runs (wrong page, failed validation) are detected and excluded before they can corrupt the knowledge base.

task_complete Always Terminates

The agent signals completion with task_complete. This always halts execution immediately — it does not advance a sub-goal. This prevents the agent from reopening modals or restarting workflows after a task is already done.

Supported Apps

App	Status
Linear	✅ Full support
Notion	🔧 In progress

Performance

Latest benchmark (Linear, 5 actions):

Phase	Time	Share
LLM decision	~36s	28%
Page state extraction	~13s	10%
Action execution	~2s	2%
Element finding	~0.5s	<1%
Total	~128s	—

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.claude		.claude
rag_data		rag_data
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.mcp.json		.mcp.json
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Browser AI Agent

Demo

Tech Stack

Architecture

Installation

Configuration

Authentication

Run

Key Design Decisions

Sub-goal Validation Gating

Primary-Keyword Element Detection

RAG That Learns

task_complete Always Terminates

Supported Apps

Performance

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Browser AI Agent

Demo

Tech Stack

Architecture

Installation

Configuration

Authentication

Run

Key Design Decisions

Sub-goal Validation Gating

Primary-Keyword Element Detection

RAG That Learns

task_complete Always Terminates

Supported Apps

Performance

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages