A desktop voice companion that uses Gemini Live for natural realtime conversation and Hermes Agent for long-running work.
The app is designed as a voice-first front-end: you speak naturally, Gemini Live responds in realtime, and when the request needs tools or autonomous work, Gemini hands it to Hermes in the background.
- Captures your microphone through Electron/Chromium with WebRTC audio cleanup.
- Streams cleaned audio to Gemini Live as 16 kHz PCM.
- Plays Gemini Live audio responses through the app using browser
AudioContext. - Lets Gemini use built-in Google Search for quick current facts.
- Lets Gemini hand serious work to Hermes through the Hermes local API server.
- Shows conversation in the Comms panel and Hermes jobs in the Hermes Tasks panel.
- Proactively announces Hermes results when a background task finishes.
- Supports interruption/barge-in: when you speak over Gemini, playback is flushed.
- Uses a dark-only "Orbital Deck" UI with an animated voice orb, keyboard shortcuts, Comms, Camera/Gesture, and Work Stream columns.
- Adds camera hand-gesture control (MediaPipe) after wake so you can drive the UI in the air: point to move a cursor, dwell to open a task, open-palm to scroll, and make a fist to dismiss.
- Uses a simple polished reader open/close animation for expanded Hermes results.
flowchart TD
User["User speaks"] --> ElectronRenderer["Electron Renderer UI"]
ElectronRenderer -->|"getUserMedia with echoCancellation, noiseSuppression, autoGainControl"| WebRTCAudio["WebRTC Audio Capture"]
WebRTCAudio -->|"Downsample to 16k PCM chunks"| ElectronMain["Electron Main Process"]
ElectronMain -->|"sendRealtimeInput audio/text"| GeminiLive["Gemini Live API"]
GeminiLive -->|"Voice response: 24k PCM audio chunks"| ElectronMain
ElectronMain -->|"live:audio IPC"| ElectronRenderer
ElectronRenderer -->|"AudioContext playback"| Speaker["Laptop Speaker"]
GeminiLive -->|"Transcripts and state events"| ElectronMain
ElectronMain -->|"sidecar:event IPC"| ElectronRenderer
ElectronRenderer --> Comms["Comms Panel"]
GeminiLive -->|"Quick current fact or lightweight search"| GoogleSearch["Gemini Built-in Google Search"]
GoogleSearch --> GeminiLive
GeminiLive -->|"Function call: submit_hermes_task"| HermesTool["Hermes Tool Bridge in Electron Main"]
HermesTool -->|"POST /v1/runs"| HermesAPI["Hermes Local API Server"]
HermesAPI --> HermesAgent["Hermes Agent Worker"]
HermesAgent -->|"Uses terminal, files, browser, web, MCP, memory"| HermesTools["Hermes Tool Ecosystem"]
HermesTool -->|"Poll GET /v1/runs/run_id"| HermesAPI
HermesAPI -->|"Run status/result"| HermesTool
HermesTool -->|"Task status updates"| ElectronRenderer
ElectronRenderer --> HermesTasks["Hermes Tasks Panel"]
HermesTool -->|"SYSTEM_EVENT_HERMES_COMPLETE"| GeminiLive
GeminiLive -->|"Proactive spoken summary"| ElectronMain
ElectronMain -->|"Audio chunks"| ElectronRenderer
ElectronRenderer --> Speaker
User -->|"Interrupts while Gemini speaks"| WebRTCAudio
WebRTCAudio -->|"Cleaned mic audio with browser AEC"| GeminiLive
GeminiLive -->|"serverContent.interrupted"| ElectronMain
ElectronMain -->|"Flush playback"| ElectronRenderer
User -->|"After wake: hand in front of webcam"| Camera["Webcam getUserMedia"]
Camera --> MediaPipe["MediaPipe GestureRecognizer (on-device)"]
MediaPipe -->|"Landmarks + gesture class"| HandHook["useHandControl hook"]
HandHook -->|"Smoothed pointer + gesture state"| ElectronRenderer
-
You speak to the app.
Electron captures your microphone using Chromium's WebRTC audio path:
echoCancellation: true noiseSuppression: true autoGainControl: true
This gives the app laptop-speaker echo cancellation similar to browser/mobile voice apps.
-
The renderer streams audio to Electron main.
The renderer downsamples microphone audio to 16 kHz PCM chunks and sends them over Electron IPC.
-
Electron main streams to Gemini Live.
Electron main owns the Gemini Live session using
@google/genaiand sends audio viasendRealtimeInput. -
Gemini decides the route.
Gemini has two tool paths:
- Google Search for quick current facts and simple web lookups.
- Hermes tools for real work: deals, research, coding, files, terminal work, email checks, browser tasks, automation, and anything that should continue in the background.
-
Hermes runs work in the background.
When Gemini calls
submit_hermes_task, Electron main submits the task to Hermes using:POST /v1/runsHermes returns a
run_idimmediately, so Gemini can keep talking instead of waiting. -
The app tracks Hermes.
Electron polls Hermes run status and updates the Hermes Tasks panel.
-
Hermes completion is fed back to Gemini.
When a run completes, Electron sends Gemini an internal message:
SYSTEM_EVENT_HERMES_COMPLETEGemini then proactively tells you Hermes has returned, summarizes the result, and asks whether you want to go through the details before continuing.
-
You can interrupt Gemini.
If you speak while Gemini is talking, Gemini sends an interruption event. The app flushes queued playback so Gemini stops talking over you.
File: electron/main.mjs
Responsibilities:
- Loads
.env. - Creates the Gemini Live session.
- Defines Gemini tools.
- Bridges Gemini tool calls to Hermes.
- Sends/receives Gemini audio.
- Polls Hermes runs.
- Announces Hermes completion back into Gemini.
File: electron/preload.cjs
Responsibilities:
- Exposes safe IPC APIs to the renderer.
- Sends microphone PCM chunks to Electron main.
- Receives Gemini audio chunks and interruption events.
- Receives app state events.
Files:
src/App.tsxsrc/App.csssrc/deck.csssrc/ReactorCore.tsxsrc/BootSequence.tsxsrc/useHandControl.ts(MediaPipe hand/gesture hook)
Responsibilities:
- Renders the UI.
- Captures microphone with WebRTC audio cleanup.
- Downsamples mic audio to 16 kHz PCM.
- Plays Gemini audio through
AudioContext. - Shows Comms and Hermes Tasks.
- Renders the dark-only Orbital Deck layout.
- Provides keyboard shortcuts.
- Runs camera hand-gesture control after wake and simple reader open/close animation.
Files under sidecar/
This was the original Gemini Live/PyAudio prototype. The current app now uses Electron-native audio for better laptop-speaker echo cancellation, but the Python sidecar remains useful as a reference and for future experiments.
The app can be driven in the air with your webcam. The camera does not start
on app boot; it is enabled automatically after wake, once Gemini Live and mic
capture are initialized. Hand tracking and gesture
classification run fully on-device using Google's
MediaPipe Tasks Vision
GestureRecognizer. No camera frames ever leave your machine — only the derived
pointer position and gesture label are used by the UI.
File: src/useHandControl.ts (consumed by src/App.tsx).
- Package:
@mediapipe/tasks-vision(the WebAssembly "Tasks Vision" runtime). - Task:
GestureRecognizer— a pre-trained model that returns both hand landmarks and a classified gesture in one pass. - Model asset:
gesture_recognizer.task(Google's canned-gesture classifier). - WASM runtime: loaded via
FilesetResolver.forVisionTasks(...)from the MediaPipe CDN.
const fileset = await FilesetResolver.forVisionTasks(WASM_URL);
recognizer = await GestureRecognizer.createFromOptions(fileset, {
baseOptions: { modelAssetPath: MODEL_URL, delegate: "GPU" },
runningMode: "VIDEO",
numHands: 1,
minHandDetectionConfidence: 0.6,
minHandPresenceConfidence: 0.6,
minTrackingConfidence: 0.6,
cannedGesturesClassifierOptions: { scoreThreshold: 0.55 },
});- GPU delegate for low-latency inference, VIDEO running mode for a live webcam stream.
- One hand is tracked to keep the interaction unambiguous.
- Confidence floors (
0.6) and a canned-gesture score threshold (0.55) reject weak/uncertain frames.
-
After wake,
navigator.mediaDevices.getUserMediaopens the front camera at 640×480 into a hidden<video>element. -
A
requestAnimationFrameloop callsrecognizer.recognizeForVideo(video, performance.now())each frame. -
From the result we read the first hand's landmarks and the top gesture.
-
Pointer: we take the index-fingertip landmark (
hand[8]), mirror X (1 - x) for a natural selfie view, then remap a comfortable center region of the frame to the full screen (so you don't have to reach the physical edges):const INPUT_RANGE = { xMin: 0.18, xMax: 0.82, yMin: 0.12, yMax: 0.82 };
The mapped point is then exponentially smoothed (factor
0.5) to remove jitter. -
Gesture stabilization: a raw gesture must persist for 3 frames before it becomes the "stable" gesture, which prevents flicker between classes.
| Gesture (MediaPipe class) | Action in the app |
|---|---|
Pointing_Up |
Move the on-screen cursor; dwell ~850 ms over a task card to open it |
Open_Palm |
Hold-to-scroll the open reader (joystick: hold high = scroll up, low = scroll down, middle = neutral; speed scales with distance) |
Closed_Fist |
Close the expanded reader |
None / other |
Idle — pointer hidden |
flowchart TD
Webcam["Webcam 640x480"] --> Video["Hidden video element"]
Video --> Loop["requestAnimationFrame loop"]
Loop --> Recognize["GestureRecognizer.recognizeForVideo"]
Recognize --> Landmarks["Hand landmarks - index fingertip"]
Recognize --> GestureClass["Top gesture + score"]
Landmarks --> Mirror["Mirror X + remap center region to screen"]
Mirror --> Smooth["Exponential smoothing (0.5)"]
Smooth --> Pointer["Smoothed screen pointer"]
GestureClass --> Stabilize["Stabilize: hold 3 frames"]
Stabilize --> StableGesture["Stable gesture"]
Pointer --> HandState["HandState"]
StableGesture --> HandState
HandState --> AppUI["App.tsx interactions"]
AppUI -->|"Pointing_Up + brief dwell"| OpenCard["Open task card"]
AppUI -->|"Open_Palm"| Scroll["Hold-to-scroll reader"]
AppUI -->|"Closed_Fist"| Close["Close reader"]
Expanded Hermes task results open with a simple scale/fade pop and close with a short fade/scale animation. The intentionally simple animation keeps the UI clean and avoids expensive DOM rasterization.
Gemini Live is configured with:
tools: [
{ googleSearch: {} },
{
functionDeclarations: [
check_hermes_status,
submit_hermes_task,
get_hermes_task_status,
stop_hermes_task,
approve_hermes_action,
]
}
]Routing behavior:
- Quick answer or current fact: Gemini Search.
- Multi-step work or background task: Hermes.
- Hermes completion: Gemini proactively announces result.
The app expects the Hermes API server to be reachable at:
http://127.0.0.1:8642
Your Hermes config should enable the local API server. Add this to
~/.hermes/.env (macOS/Linux) or the equivalent Hermes env file on Windows:
API_SERVER_ENABLED=true
API_SERVER_KEY=iris-local-devRestart Hermes gateway after changing this:
hermes gateway restartVerify:
curl -s http://127.0.0.1:8642/healthExpected output:
{"status":"ok"}Iris reads environment values from:
.envin this repo (development andnpm start).~/.iris/.env(packaged app on macOS/Linux).%USERPROFILE%\\.iris\\.env(packaged app on Windows)..envbundled next to app resources (optional packaging flow).
Copy the example file:
cp .env.example .envOn Windows PowerShell:
Copy-Item .env.example .envMinimum required:
GEMINI_API_KEY=your_google_ai_studio_keyRecommended example:
GEMINI_API_KEY=your_google_ai_studio_key
IRIS_USER_NAME=there
GEMINI_LIVE_MODEL=models/gemini-3.1-flash-live-preview
GEMINI_LIVE_VOICE=Zephyr
HERMES_API_URL=http://127.0.0.1:8642
API_SERVER_KEY=iris-local-devHERMES_BIN is optional. Set it only if the packaged GUI app cannot find the
Hermes binary on PATH. If the Hermes gateway is already running, Iris only needs
HERMES_API_URL and API_SERVER_KEY.
Use this table as the single source of truth for which Google pieces we use, so future changes don't reintroduce wrong/deprecated names or version drift.
| Purpose | Exact identifier we use | Where it's set | Source |
|---|---|---|---|
| Gemini Live model | models/gemini-3.1-flash-live-preview |
electron/main.mjs (GEMINI_LIVE_MODEL env override) |
Google AI Studio / Gemini API |
| Gemini voice | Zephyr |
electron/main.mjs (GEMINI_LIVE_VOICE env override) |
Gemini Live prebuilt voices |
| Gemini SDK | @google/genai ^2.10.0 |
package.json |
npm |
| Gemini built-in search tool | { googleSearch: {} } |
electron/main.mjs tools |
Gemini Live tools |
| Gesture/hand ML runtime | @mediapipe/tasks-vision ^0.10.35 |
package.json |
npm |
| MediaPipe WASM fileset | https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@0.10.35/wasm |
src/useHandControl.ts (WASM_URL) |
jsDelivr CDN |
| MediaPipe model asset | https://storage.googleapis.com/mediapipe-tasks/gesture_recognizer/gesture_recognizer.task |
src/useHandControl.ts (MODEL_URL) |
Google Cloud Storage |
- Use the exact Live model name
gemini-3.1-flash-live-preview. Live models are a distinct family from regulargemini-*chat models; a normal chat model name will fail to open a Live session. Keep themodels/prefix. - Keep the MediaPipe WASM URL version equal to the installed npm version.
Both are pinned to
0.10.35today. A mismatch between the JS API (@mediapipe/tasks-vision) and the WASM fileset can cause subtle runtime/ABI breakage, so update the@x.y.zinWASM_URLwhenever you bump the package (or self-host the WASM from the installed package instead of a CDN). - MediaPipe WASM + model are fetched from Google/jsDelivr at first load, so gesture control needs network access on first run. Vendor both locally if you need fully offline startup.
- Gemini Live audio formats are fixed: send 16 kHz PCM, receive 24 kHz PCM. Don't assume a single sample rate for both directions.
- Gemini 3.1 Live function calls are synchronous — never block a tool call on
long Hermes work; return a
run_idimmediately and track completion separately. - Send realtime input with
sendRealtimeInput(not the deprecatedmedia_chunkspath) for audio/text streaming.
- Node.js 20+ (LTS recommended).
- npm.
- Hermes Agent installed and able to run
hermes gateway. - A Gemini API key for the Live model (
GEMINI_API_KEY). - macOS, Windows, or Linux with microphone permission available.
npm installCreate .env from .env.example and set at least GEMINI_API_KEY.
Make sure Hermes API is enabled:
echo 'API_SERVER_ENABLED=true' >> ~/.hermes/.env
echo 'API_SERVER_KEY=iris-local-dev' >> ~/.hermes/.env
hermes gateway restartOn Windows, edit your Hermes env file manually and add the same two values:
API_SERVER_ENABLED=true
API_SERVER_KEY=iris-local-devThen restart Hermes gateway.
Verify:
curl http://127.0.0.1:8642/healthExpected:
{"status":"ok"}npm run devThis starts Vite and Electron with hot reload. In dev mode the macOS Dock may show the generic Electron app name, but the packaged app is named Iris.
npm startThis builds dist/ and launches Electron from the built files.
If you already built once:
npm run start:prodnpm run buildnpm run package:mac
open release/mac-arm64/Iris.appThe app is unsigned by default. If macOS blocks it, right-click the app and choose Open once.
From Windows:
npm install
Copy-Item .env.example .env
# edit .env and set GEMINI_API_KEY, HERMES_API_URL, API_SERVER_KEY
npm run devTo create an unpacked Windows app directory:
npm run package:winTo create a distributable Windows build:
npm run dist:winFor the packaged Windows app, copy .env.example to:
%USERPROFILE%\\.iris\\.env
Then set GEMINI_API_KEY and Hermes values there.
- W: Wake
- S: Sleep
- Top-right signal icon: live connection indicator
- Top-right hand icon: manually enables/disables camera gesture tracking
Camera/gesture behavior:
- App boot: camera is off.
- Wake (
W): Gemini Live starts, mic capture starts, then camera/gesture control starts automatically. - Sleep (
S): Gemini, mic, and camera/gesture control stop.
- Point (index up): move the cursor; hold over a task card briefly to open it
- Open palm: hold-to-scroll inside Comms, Work Stream, and the open reader (high = up, low = down)
- Closed fist: close the reader
The first launch will prompt for camera permission. Frames are processed on-device by MediaPipe and never uploaded.
- The app now uses Electron/Chromium microphone capture instead of Python
pyaudiofor the main Gemini Live path. This gives better echo cancellation on laptop speakers. - Gemini Live model:
gemini-3.1-flash-live-preview. - Gemini 3.1 Live function calls are synchronous, so Hermes tasks return a
run_idimmediately and finish in the background. - Hermes remains your actual worker agent for tool-heavy tasks.
- Hand tracking uses
@mediapipe/tasks-vision(GestureRecognizer) entirely on-device and starts only after wake unless manually enabled.
.envis ignored. Do not commit real Gemini keys or Hermes API keys.- The default
API_SERVER_KEY=iris-local-devis for local development only; choose your own local key if you share the app broadly. - The packaged app is unsigned unless you add your own Apple/Windows signing certificates.
- Licensed under the MIT License. See
LICENSE.