UI Navigator Category
"Your AI hands on the internet. Speak any task. Watch it happen."
| Challenge Requirement | VoiceNav Implementation | Status |
|---|---|---|
| Gemini Multimodal | Screenshots sent to Gemini Vision for UI understanding | β |
| Gemini Live API | Real-time voice input and spoken responses | β |
| Google ADK | Agent orchestration and planning loop | β |
| Google GenAI SDK | All Gemini API calls via official SDK | β |
| Google Cloud Hosting | Backend on Google Cloud Run | β |
| No DOM/API dependency | Pure visual screenshot analysis only | β |
| Executable actions | CLICK, TYPE, SCROLL, PRESS commands | β |
| Breaks text-box paradigm | Voice in β screen actions out | β |
| Live & context-aware | Continuous screenshot-verify loop | β |
Every day, millions of people waste hours on repetitive, manual web tasks:
- Searching multiple sites to compare prices
- Filling the same forms over and over
- Copying data between websites and apps
- Navigating complex software they don't fully understand
Existing solutions are broken:
- Browser extensions only work on specific sites
- Selenium/Playwright requires coding knowledge
- RPA tools cost thousands and break when websites update
- Regular AI can talk about tasks but cannot do them
- All tools rely on DOM access β they break every time a site redesigns
Result: Non-technical users have zero access to automation. Businesses pay huge amounts for tools that constantly break.
VoiceNav is your AI hands on the internet.
You speak. The agent sees your screen, plans the steps, and executes β entirely through visual understanding of screenshots. No DOM access. No website-specific APIs. Works on any website, any app, any platform.
User speaks β Gemini hears β Agent plans β Gemini sees screen β Agent acts β Verifies result
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β USER'S DEVICE β
β β
β π€ Microphone π₯οΈ Screen β
β β β β
β βΌ βΌ β
β Web Speech API Screenshot Capture β
β (voice β text) (PyAutoGUI / Playwright) β
β β β β
β ββββββββββββββββ¬ββββββββββββββββββββββ β
β β β
β WebSocket / REST API β
ββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GOOGLE CLOUD RUN (Backend) β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ADK AGENT LOOP β β
β β β β
β β LISTEN β PLAN β OBSERVE β DECIDE β ACT β VERIFY β β
β β β β β β β β β β
β β βΌ βΌ βΌ βΌ βΌ βΌ β β
β β Voice Break Take Choose Run Check β β
β β Input into Screen Next Action Result β β
β β Steps Shot Action β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β βΌ βΌ β
β βββββββββββββββββββ βββββββββββββββββββββββ β
β β Gemini Live API β β Gemini Vision API β β
β β (voice I/O) β β (screenshot β UI β β
β β β β understanding) β β
β βββββββββββββββββββ βββββββββββββββββββββββ β
β β
β βββββββββββββββββββ βββββββββββββββββββ β
β β Cloud Storage β β Cloud Logging β β
β β (screenshots) β β (action audit) β β
β βββββββββββββββββββ βββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FRONTEND (React) β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββ β
β β Voice Button β β Screen View β β Action Log β β
β β (mic input + β β (live agent β β (step-by-step trace) β β
β β TTS output) β β view) β β β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββ
β STEP 1 β User speaks command via microphone
β LISTEN β Gemini Live API transcribes in real-time
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β STEP 2 β Break task into ordered sub-steps
β PLAN β Identify info needed from user upfront
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β STEP 3 β Take screenshot of current screen
β OBSERVE β Gemini Vision identifies all UI elements
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β STEP 4 β Pick the best next single action
β DECIDE β Handle blockers: CAPTCHA, login walls, errors
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β STEP 5 β Execute: CLICK(x,y) / TYPE(text) /
β ACT β SCROLL(dir) / PRESS(key) / WAIT(sec)
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β STEP 6 β Take new screenshot
β VERIFY β Did the action work? If yes β next step
ββββββββ¬βββββββ If no β retry or re-plan
β
βΌ
βββββββββββββββ
β STEP 7 β Before irreversible actions:
β CONFIRM β Speak summary β wait for "yes"
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β STEP 8 β Speak task summary to user
β DONE β Show results with clickable links
βββββββββββββββ
| Technology | Purpose |
|---|---|
| React | UI framework |
| Web Speech API | Voice input (microphone) |
| SpeechSynthesis API | Voice output (TTS) |
| WebSocket | Real-time communication with backend |
| Technology | Purpose |
|---|---|
| Google Cloud Run | Hosts the agent backend (serverless, auto-scaling) |
| Gemini Live API | Real-time voice input and spoken response output |
| Gemini Vision (gemini-2.0-flash) | Screenshot analysis and UI element identification |
| Google ADK | Agent orchestration, planning loop, tool management |
| Google GenAI SDK | All Gemini API calls |
| Google Cloud Storage | Temporary screenshot storage during task execution |
| Google Cloud Logging | Action audit trail and debugging |
| Secret Manager | Secure API key storage |
| Technology | Purpose |
|---|---|
| PyAutoGUI | Mouse clicks and keyboard input |
| Playwright | Browser-level automation fallback |
| PIL / Pillow | Screenshot capture and processing |
VoiceNav completely eliminates the text box. The entire interaction is:
- Input: User's real voice (microphone)
- Processing: Gemini sees the screen like a human would
- Output: Real physical actions on screen + spoken narration
The agent has a distinct persona, narrates every step aloud, handles interruptions gracefully, and is context-aware β it re-plans dynamically when something unexpected appears on screen.
- Uses Google ADK for the full agent orchestration loop
- Uses Gemini Live API for real-time bidirectional voice
- Uses Gemini multimodal for pure visual screen understanding
- Backend deployed on Google Cloud Run with auto-scaling
- Screenshots stored in Cloud Storage, actions logged in Cloud Logging
- Error handling: retries, re-planning, CAPTCHA detection, login wall handling
- Safe mode: confirms before any irreversible action
- 4-minute demo video showing real task execution end-to-end
- Architecture diagram included in this README
- Cloud deployment proof via Cloud Run console recording
- Clear problem β solution narrative
"Go to Amazon and find the best wireless headphones under $100"
"Open Gmail and find the latest invoice from my supplier"
"Search flights from Mumbai to London in April under $800"
"Go to LinkedIn and find remote Python developer jobs posted this week"
"Find hotels in Goa under βΉ3000 per night on Booking.com"
"Go to Flipkart and compare iPhone 15 vs Samsung S24 prices"
"Open YouTube and search for 'learn React in 2025'"
VoiceNav is built with safety as a core principle:
- NEVER completes a purchase without explicit voice confirmation
- NEVER submits any form without showing the user what will be sent
- NEVER stores, logs, or repeats passwords or sensitive data
- NEVER clicks ads or suspicious popups unless explicitly asked
- STOPS and asks for help after 3 failed actions in a row
- SAFE MODE toggle β always on by default
- Google Cloud account
- Gemini API key from Google AI Studio
- Node.js 18+
- Python 3.10+
git clone https://github.com/SimranShaikh20/voicenav
cd voicenav# Frontend
cd frontend
npm install
# Backend
cd ../backend
pip install -r requirements.txtexport GEMINI_API_KEY=your_api_key_here
export GOOGLE_CLOUD_PROJECT=your_project_idgcloud run deploy voicenav-backend \
--source ./backend \
--region us-central1 \
--allow-unauthenticated \
--set-env-vars GEMINI_API_KEY=$GEMINI_API_KEYcd frontend
npm starthttp://localhost:3000
β οΈ Voice features work best on Chrome or Edge browsers.
The following Google Cloud services are used and verifiable in the GCP console:
- Cloud Run β
voicenav-backendservice inus-central1 - Cloud Storage β bucket
voicenav-screenshotsfor temp screenshot storage - Cloud Logging β log name
voicenav-actionsfor full action audit trail - Secret Manager β secret
gemini-api-keyfor secure key storage - Artifact Registry β Docker image
voicenav-backend:latest
- Published blog post about building with Google AI β #GeminiLiveAgentChallenge
- Cloud deployment automated with
cloudbuild.yaml(infrastructure-as-code) - Google Developer Group profile: [link]
MIT License β see LICENSE file for details.