Skip to content

Atharva9281/BrowserAgent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Browser AI Agent

A production-grade AI agent that automates repetitive browser workflows on tools like Linear and Notion using natural language commands.

Instead of clicking through the same menus every day, you type a plain English instruction and the agent opens the browser, navigates to the right page, fills every field, generates relevant content using AI, and submits — all on its own.


Demo

"Create a project named Backend API with status Planned, priority Urgent, and write a relevant description"

The agent:

  1. Navigates to the Projects page
  2. Opens the New Project modal
  3. Types the project name
  4. Sets status and priority via dropdowns
  5. Writes a meaningful description generated by Gemini
  6. Submits the form

Tech Stack

Layer Technology
Language Python 3.12
Orchestration LangGraph (state machine)
Vision + Reasoning Google Gemini 2.5 Flash / Pro
Browser Automation Playwright (Chrome)
Vector Store ChromaDB
Embeddings Gemini Embedding API
LLM Wrapper LangChain
Observability LangSmith
Reasoning Pattern ReAct (Reason + Act)

Architecture

User Prompt
    │
    ▼
┌─────────────────────────────────────────────────────┐
│                  LangGraph Agent                    │
│                                                     │
│  parse_task → decompose_goals → loop:               │
│    extract_page_state                               │
│        │  (screenshot + accessibility tree)         │
│        ▼                                            │
│    decide_action  ←── RAG workflow hints            │
│        │  (Gemini vision + ReAct reasoning)         │
│        ▼                                            │
│    match_element                                    │
│        │  (primary-keyword element detection)       │
│        ▼                                            │
│    execute_action                                   │
│        │  (Playwright click / type / navigate)      │
│        ▼                                            │
│    validate_action  ──► error_recovery              │
│        │  (6-check system, confidence gate)         │
│        ▼                                            │
│    [advance sub-goal or retry]                      │
│                                                     │
│  task_complete → RAG update → teardown              │
└─────────────────────────────────────────────────────┘

Installation

git clone https://github.com/Atharva9281/BrowserAgent.git
cd BrowserAgent

python3 -m venv venv
source venv/bin/activate

pip install -r requirements.txt
playwright install chromium

Configuration

cp .env.example .env
# Add your API key:
# GEMINI_API_KEY=your_key_here

Get a free Gemini API key at Google AI Studio.

Authentication

python3 src/setup_auth.py

Run

python3 src/agent.py

Then type any task:

📋 Enter task: Create a project named Q3 Planning with status Planned in Linear
📋 Enter task: Filter projects by status In Progress in Linear
📋 Enter task: Create an issue titled Fix login bug, set priority to High in Linear

Key Design Decisions

Sub-goal Validation Gating

The agent breaks every task into sub-goals. A sub-goal only advances when the action's validation confidence exceeds 0.6 — measured across 6 checks (URL changed, modal opened, expected keywords present, no errors, click had effect). A Playwright click not throwing an exception is not sufficient evidence of success.

Primary-Keyword Element Detection

Gemini describes elements with trailing context: "Projects link in the left sidebar under the TrialAgent team section". A naive keyword extractor picks up "TrialAgent team" as a 2-word phrase and clicks the wrong button. The detector extracts the primary name — words before the first positional preposition — and tries that first.

RAG That Learns

Every successful run is embedded and stored in ChromaDB. On similar future tasks, the agent retrieves past workflows as hints. Bad runs (wrong page, failed validation) are detected and excluded before they can corrupt the knowledge base.

task_complete Always Terminates

The agent signals completion with task_complete. This always halts execution immediately — it does not advance a sub-goal. This prevents the agent from reopening modals or restarting workflows after a task is already done.


Supported Apps

App Status
Linear ✅ Full support
Notion 🔧 In progress

Performance

Latest benchmark (Linear, 5 actions):

Phase Time Share
LLM decision ~36s 28%
Page state extraction ~13s 10%
Action execution ~2s 2%
Element finding ~0.5s <1%
Total ~128s

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages