Hackathon Tracks: 🏆 Accessibility Track | 🏆 Overall Hack | 🗣️ Best Use of ElevenLabs | 🚀 Best Use of Featherless AI | 🔍 Best Use of Gemini
Empowering the visually impaired, elderly, and motor-impaired to operate any Windows computer entirely by voice — no typing, no clicking, no mouse.
For millions of people, using a computer is not a given.
- Visually impaired users cannot read a screen, locate UI elements, or navigate graphical interfaces without costly, often-incomplete screen-reader software.
- Elderly users struggle with the growing complexity of modern operating systems — nested menus, tiny click targets, and constantly changing interfaces.
- Motor-impaired users for whom holding a mouse or typing is painful or impossible are forced to rely on inflexible, slow assistive tooling.
Existing solutions (Windows Narrator, JAWS, Dragon NaturallySpeaking) are rigid and brittle — they require precise commands, flat scripted workflows, and collapse entirely when a website redesigns its layout or an app changes its UI.
Orbit is different. It actually sees the screen the same way a sighted person does, understands what is on it, and takes real actions — adapting to whatever it finds, step by step.
Hold Ctrl+Shift+Space and speak naturally:
"Open Chrome and search for today's weather in Toronto" "Go to Gmail and send a message to Mom saying I'll call her tonight" "Open Notepad and type my shopping list: milk, eggs, bread" "Find and open the most recent Excel file on my desktop"
Orbit listens, understands, and executes — autonomously operating your mouse, keyboard, and browser until the task is complete. Along the way, it handles any pop-ups, dynamically searches for missing UI elements, and speaks the result back to you in a natural, human-like voice.
No scripting. No voice-command memorization. Just natural speech.
Orbit was engineered to win across multiple tracks by pushing the boundaries of what is possible with multimodal AI, low-latency reasoning, and ultra-lifelike speech synthesis.
Orbit reimagines the entire human-computer interface. It doesn't rely on accessibility trees, which break on legacy software and modern web apps. It uses a vision AI model that perceives the screen as a pixel image.
- The Visually Impaired: Orbit can interact with the full breadth of the Windows ecosystem without compromise. It literally reads the screen and figures out how to navigate for them.
- The Elderly: No syntax to learn, no manual to read. You speak as you would to a friend.
- The Motor-Impaired: Orbit reduces an entire complex workflow (clicks, typing, scrolling, form-filling) down to a single press-and-speak gesture.
For visually impaired or elderly users, the "voice" of the assistant is the entire interface. We integrated ElevenLabs (Multilingual v2 Model) to completely transform the interaction loop.
- Warm & Natural: Replaced generic, robotic OS TTS with ultra-lifelike speech that feels empathetic and natural.
- Multilingual Support: Orbit automatically detects the language the user speaks (e.g., Spanish). It uses DeepSeek to translate the objective to English for internal OS reasoning, takes actions, and then uses ElevenLabs to synthesize the final response back into the user's native tongue.
- Blocking & Async Playback: Seamlessly integrated with
pygameto lock audio playback only when necessary, preventing overlapping system sounds.
Orbit's accessibility mission depends entirely on seeing the screen — and that vision layer is powered by Google Gemini (gemini-2.0-flash-preview via OpenRouter).
- Pixel-Level Screen Perception: On every iteration of the agentic loop, Orbit captures a full-resolution screenshot and sends it to Gemini. Gemini parses the raw pixels and returns a structured natural-language description of every visible UI element — buttons, text fields, menus, dialogs — along with normalized
[ymin, xmin, ymax, xmax]bounding boxes on a[0, 1000]coordinate scale. This means Orbit never relies on fragile accessibility trees or DOM inspection; it sees the screen the same way a human does. - Context-Aware Classification: Gemini identifies whether the active window is a BROWSER or a DESKTOP_APP, allowing the downstream decision model to route actions correctly — Playwright for web content, PyAutoGUI for native OS elements.
- Goal-Completion Signalling: Gemini evaluates whether the current screen satisfies the user's objective and emits a
GOAL STATUS: COMPLETEsignal, enabling the agent to terminate confidently rather than running to the step limit. - Enabling True Accessibility: Because Gemini can describe any screen — legacy software, custom desktop apps, dynamic web UIs — Orbit works everywhere, not just on applications with built-in accessibility support.
Orbit's core is a hyper-reactive loop that perceives the screen, reasons, and acts every 1-2 seconds. Latency is life or death.
- DeepSeek-V3 Infrastructure: We use Featherless AI to run
DeepSeek-V3-0324at lightning speed. Featherless's serverless AI infrastructure provides the ultra-low latency inference required to keep the agentic loop running fluidly. - Complex JSON Routing: The decision model relies on strict JSON boundaries to evaluate bounding boxes, DOM context, and error states. Featherless flawlessly handles the high-throughput schema requests, allowing Orbit to instantly plan and react to unexpectedly broken JSON (utilizing our custom reflection-retry loop).
Orbit is a multi-threaded Python application built around a dual-model agentic perception-action loop. It does not plan every step upfront — it perceives the current state, decides the single best next action, executes it, and re-perceives.
┌─────────────────────────────────────────────────────────────────────┐
│ widget.py (UI Thread) │
│ PyQt6 Glass Widget ── Hold-to-Talk Hotkey ── State Machine │
│ │ │ │ │
│ Glow Overlay Audio Capture ElevenLabs TTS │
└────────────────────────────┬────────────────────────────────────────┘
│ queue.Queue (thread-safe message bus)
▼
┌─────────────────────────────────────────────────────────────────────┐
│ agent.py (Agent Thread) │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Agentic Perception-Action Loop │ │
│ │ │ │
│ │ Screenshot ──► Vision Model ──► Screen Description │ │
│ │ ▲ (Gemini 3 Flash Preview via OpenRouter) │ │
│ │ │ │ │ │
│ │ │ Decision Model │ │
│ │ │ (DeepSeek-V3 via Featherless AI) │ │
│ │ │ │ │ │
│ │ │ JSON Action ◄────────────────────────── │ │
│ │ │ │ │ │
│ │ │ ┌─────────────┴──────────────┐ │ │
│ │ │ ▼ ▼ │ │
│ │ Wait/Diff OS Context Browser Context │ │
│ │ │ (PyAutoGUI) (Playwright/Edge) │ │
│ │ └──────────────────────────────────────── │ │
│ └──────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
- Plan & Translate: Detect user language, translate to English, and formulate a multi-step execution plan with strictly observable success criteria.
- Perceive: Capture a full-resolution screenshot of the Windows desktop.
- Describe: Send to Gemini 1.5 Flash (via OpenRouter). Gemini returns a natural language structure of UI elements annotated with normalized
[ymin, xmin, ymax, xmax]bounding boxes ([0, 1000]coordinate scale). - Decide: Feed the screen description, DOM affordances (if in a browser), and error history to DeepSeek-V3 (via Featherless AI). DeepSeek returns a single structured JSON action.
- Act: Execute the action via:
"context": "os"→ PyAutoGUI (mouse moves, clicks, keyboard input, hotkeys)."context": "browser"→ Playwright (URL navigation, DOM tracking on an active Edge session).
- Verify: Has the screen structurally changed? (Computed via
imagehashperceptual diffs). Did the action loop 3x? If so, prompt DeepSeek to aggressively correct course. - Terminate & Speak: Exit when criteria are met or max steps reached; translate the final status back to the user's language and speak via ElevenLabs Multilingual v2.
When the agent encounters a login page or complex CAPTCHA it cannot bypass, it emits a request_user_input action. It suspends the agentic loop, switches the UI to a "waiting" state, and blocks execution on a threading.Event. The user's next spoken whisper injects their reply directly into the prompt context, resuming the loop seamlessly without losing browser state.
- Why Dual-Models instead of one? Vision models are great at drawing bounding boxes; reasoning models (like DeepSeek-V3) excel at logic and JSON adherence. Splitting perception and reasoning allows each model to do what it does best, significantly dropping error rates.
- Why normalized
0-1000coordinates? A fixed coordinate space prevents the vision model from hallucinating specific screen resolution pixels, making Orbit completely agnostic to 1080p, 1440p, or 4K monitors. - Why perceptual hashes? Early versions would get stuck re-trying failed clicks forever. Orbit now computes a perceptual
imagehashof the screen after every action. If the screen doesn't change, Orbit dynamically injects a warning into DeepSeek-V3's prompt context to force a new approach. - Advanced JSON Reflection: If DeepSeek-V3 outputs malformed JSON, rather than crashing, the agent catches the exception, attempts an
ast.literal_evalfallback, and automatically feeds the parser error back to the model as a prompt to self-correct.
- Windows 10 or 11
- Python 3.10+
- Featherless AI API key (for DeepSeek-V3 reasoning)
- OpenRouter API key (for Gemini vision)
- ElevenLabs API key (for human-like TTS)
pip install -r requirements.txt
playwright install chromiumCopy .env.example to .env:
FEATHERLESS_API_KEY=your_key_here
OPENROUTER_API_KEY=your_key_here
ELEVENLABS_API_KEY=your_key_here
python widget.pyHold Ctrl+Shift+Space to speak. Release to execute.
- Mobile companion app — a phone-based microphone that pairs wirelessly with Orbit on the desktop, minimizing the need to reach a keyboard hotkey at all.
- Scheduled accessibility chains — "Every morning, open my email and read me the subject lines" without the user needing to initiate anything.
- Memory layer — allow the agent to remember frequently used workflows and login preferences to cut latency on repetitive task clusters.
- Cross-platform — port the OS control and hotkey layers to macOS and Linux.
Built at Quackhacks 2026 by Orbit.
- Om Rana
- Sharanya Raj