Skip to content

SimranShaikh20/VoiceNav

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

⚑ VoiceNav β€” Universal Web Automation Agent

UI Navigator Category

"Your AI hands on the internet. Speak any task. Watch it happen."


πŸ† Alignment

Challenge Requirement VoiceNav Implementation Status
Gemini Multimodal Screenshots sent to Gemini Vision for UI understanding βœ…
Gemini Live API Real-time voice input and spoken responses βœ…
Google ADK Agent orchestration and planning loop βœ…
Google GenAI SDK All Gemini API calls via official SDK βœ…
Google Cloud Hosting Backend on Google Cloud Run βœ…
No DOM/API dependency Pure visual screenshot analysis only βœ…
Executable actions CLICK, TYPE, SCROLL, PRESS commands βœ…
Breaks text-box paradigm Voice in β†’ screen actions out βœ…
Live & context-aware Continuous screenshot-verify loop βœ…

πŸ”΄ The Problem

Every day, millions of people waste hours on repetitive, manual web tasks:

  • Searching multiple sites to compare prices
  • Filling the same forms over and over
  • Copying data between websites and apps
  • Navigating complex software they don't fully understand

Existing solutions are broken:

  • Browser extensions only work on specific sites
  • Selenium/Playwright requires coding knowledge
  • RPA tools cost thousands and break when websites update
  • Regular AI can talk about tasks but cannot do them
  • All tools rely on DOM access β€” they break every time a site redesigns

Result: Non-technical users have zero access to automation. Businesses pay huge amounts for tools that constantly break.


βœ… The Solution

VoiceNav is your AI hands on the internet.

You speak. The agent sees your screen, plans the steps, and executes β€” entirely through visual understanding of screenshots. No DOM access. No website-specific APIs. Works on any website, any app, any platform.

User speaks β†’ Gemini hears β†’ Agent plans β†’ Gemini sees screen β†’ Agent acts β†’ Verifies result

πŸ— System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        USER'S DEVICE                            β”‚
β”‚                                                                 β”‚
β”‚   🎀 Microphone                        πŸ–₯️  Screen               β”‚
β”‚       β”‚                                    β”‚                    β”‚
β”‚       β–Ό                                    β–Ό                    β”‚
β”‚   Web Speech API                    Screenshot Capture          β”‚
β”‚   (voice β†’ text)                    (PyAutoGUI / Playwright)    β”‚
β”‚       β”‚                                    β”‚                    β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚                      β”‚                                          β”‚
β”‚              WebSocket / REST API                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   GOOGLE CLOUD RUN (Backend)                    β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                   ADK AGENT LOOP                        β”‚   β”‚
β”‚  β”‚                                                         β”‚   β”‚
β”‚  β”‚  LISTEN β†’ PLAN β†’ OBSERVE β†’ DECIDE β†’ ACT β†’ VERIFY       β”‚   β”‚
β”‚  β”‚     β”‚        β”‚       β”‚        β”‚       β”‚       β”‚         β”‚   β”‚
β”‚  β”‚     β–Ό        β–Ό       β–Ό        β–Ό       β–Ό       β–Ό         β”‚   β”‚
β”‚  β”‚  Voice    Break   Take    Choose   Run    Check          β”‚   β”‚
β”‚  β”‚  Input    into   Screen   Next   Action  Result         β”‚   β”‚
β”‚  β”‚         Steps   Shot    Action                          β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚            β”‚                    β”‚                               β”‚
β”‚            β–Ό                    β–Ό                               β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
β”‚   β”‚ Gemini Live API β”‚  β”‚  Gemini Vision API  β”‚                 β”‚
β”‚   β”‚ (voice I/O)     β”‚  β”‚  (screenshot β†’ UI   β”‚                 β”‚
β”‚   β”‚                 β”‚  β”‚   understanding)    β”‚                 β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚
β”‚                                                                 β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                     β”‚
β”‚   β”‚  Cloud Storage  β”‚  β”‚ Cloud Logging   β”‚                     β”‚
β”‚   β”‚  (screenshots)  β”‚  β”‚ (action audit)  β”‚                     β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      FRONTEND (React)                           β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Voice Button β”‚  β”‚ Screen View  β”‚  β”‚    Action Log        β”‚  β”‚
β”‚  β”‚ (mic input + β”‚  β”‚ (live agent  β”‚  β”‚ (step-by-step trace) β”‚  β”‚
β”‚  β”‚  TTS output) β”‚  β”‚  view)       β”‚  β”‚                      β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ” Agent Execution Loop

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STEP 1     β”‚  User speaks command via microphone
β”‚  LISTEN     β”‚  Gemini Live API transcribes in real-time
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STEP 2     β”‚  Break task into ordered sub-steps
β”‚  PLAN       β”‚  Identify info needed from user upfront
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STEP 3     β”‚  Take screenshot of current screen
β”‚  OBSERVE    β”‚  Gemini Vision identifies all UI elements
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STEP 4     β”‚  Pick the best next single action
β”‚  DECIDE     β”‚  Handle blockers: CAPTCHA, login walls, errors
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STEP 5     β”‚  Execute: CLICK(x,y) / TYPE(text) /
β”‚  ACT        β”‚  SCROLL(dir) / PRESS(key) / WAIT(sec)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STEP 6     β”‚  Take new screenshot
β”‚  VERIFY     β”‚  Did the action work? If yes β†’ next step
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  If no β†’ retry or re-plan
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STEP 7     β”‚  Before irreversible actions:
β”‚  CONFIRM    β”‚  Speak summary β†’ wait for "yes"
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  STEP 8     β”‚  Speak task summary to user
β”‚  DONE       β”‚  Show results with clickable links
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“¦ Tech Stack

Frontend

Technology Purpose
React UI framework
Web Speech API Voice input (microphone)
SpeechSynthesis API Voice output (TTS)
WebSocket Real-time communication with backend

Backend (Google Cloud)

Technology Purpose
Google Cloud Run Hosts the agent backend (serverless, auto-scaling)
Gemini Live API Real-time voice input and spoken response output
Gemini Vision (gemini-2.0-flash) Screenshot analysis and UI element identification
Google ADK Agent orchestration, planning loop, tool management
Google GenAI SDK All Gemini API calls
Google Cloud Storage Temporary screenshot storage during task execution
Google Cloud Logging Action audit trail and debugging
Secret Manager Secure API key storage

Action Execution

Technology Purpose
PyAutoGUI Mouse clicks and keyboard input
Playwright Browser-level automation fallback
PIL / Pillow Screenshot capture and processing

🎯 Judging Criteria Breakdown

Innovation & Multimodal UX (40%)

VoiceNav completely eliminates the text box. The entire interaction is:

  • Input: User's real voice (microphone)
  • Processing: Gemini sees the screen like a human would
  • Output: Real physical actions on screen + spoken narration

The agent has a distinct persona, narrates every step aloud, handles interruptions gracefully, and is context-aware β€” it re-plans dynamically when something unexpected appears on screen.

Technical Implementation (30%)

  • Uses Google ADK for the full agent orchestration loop
  • Uses Gemini Live API for real-time bidirectional voice
  • Uses Gemini multimodal for pure visual screen understanding
  • Backend deployed on Google Cloud Run with auto-scaling
  • Screenshots stored in Cloud Storage, actions logged in Cloud Logging
  • Error handling: retries, re-planning, CAPTCHA detection, login wall handling
  • Safe mode: confirms before any irreversible action

Demo & Presentation (30%)

  • 4-minute demo video showing real task execution end-to-end
  • Architecture diagram included in this README
  • Cloud deployment proof via Cloud Run console recording
  • Clear problem β†’ solution narrative

πŸ—£οΈ Example Voice Commands

"Go to Amazon and find the best wireless headphones under $100"

"Open Gmail and find the latest invoice from my supplier"

"Search flights from Mumbai to London in April under $800"

"Go to LinkedIn and find remote Python developer jobs posted this week"

"Find hotels in Goa under β‚Ή3000 per night on Booking.com"

"Go to Flipkart and compare iPhone 15 vs Samsung S24 prices"

"Open YouTube and search for 'learn React in 2025'"

πŸ›‘οΈ Safety Rules

VoiceNav is built with safety as a core principle:

  • NEVER completes a purchase without explicit voice confirmation
  • NEVER submits any form without showing the user what will be sent
  • NEVER stores, logs, or repeats passwords or sensitive data
  • NEVER clicks ads or suspicious popups unless explicitly asked
  • STOPS and asks for help after 3 failed actions in a row
  • SAFE MODE toggle β€” always on by default

πŸš€ Setup & Spin-up Instructions

Prerequisites

  • Google Cloud account
  • Gemini API key from Google AI Studio
  • Node.js 18+
  • Python 3.10+

1. Clone the repository

git clone https://github.com/SimranShaikh20/voicenav
cd voicenav

2. Install dependencies

# Frontend
cd frontend
npm install

# Backend
cd ../backend
pip install -r requirements.txt

3. Set environment variables

export GEMINI_API_KEY=your_api_key_here
export GOOGLE_CLOUD_PROJECT=your_project_id

4. Deploy backend to Google Cloud Run

gcloud run deploy voicenav-backend \
  --source ./backend \
  --region us-central1 \
  --allow-unauthenticated \
  --set-env-vars GEMINI_API_KEY=$GEMINI_API_KEY

5. Run frontend locally

cd frontend
npm start

6. Open in Chrome

http://localhost:3000

⚠️ Voice features work best on Chrome or Edge browsers.


πŸ— Google Cloud Architecture Proof

The following Google Cloud services are used and verifiable in the GCP console:

  • Cloud Run β€” voicenav-backend service in us-central1
  • Cloud Storage β€” bucket voicenav-screenshots for temp screenshot storage
  • Cloud Logging β€” log name voicenav-actions for full action audit trail
  • Secret Manager β€” secret gemini-api-key for secure key storage
  • Artifact Registry β€” Docker image voicenav-backend:latest

πŸ“Š Bonus Points Completed

  • Published blog post about building with Google AI β€” #GeminiLiveAgentChallenge
  • Cloud deployment automated with cloudbuild.yaml (infrastructure-as-code)
  • Google Developer Group profile: [link]

πŸ“„ License

MIT License β€” see LICENSE file for details.

About

commit

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors