Skip to content

Gent8/voxate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Voxate

Voxate for macOS

Local, Whisper-powered dictation for any text field on macOS. Hold one key, speak, release — text appears at the cursor. No cloud, no per-app integration.

License: GPL-3.0 Platform: macOS 13+ Backend: whisper.cpp


Highlights

  • One key, system-wide. Works in Notes, Slack, Safari address bars, terminals — anywhere you can type.
  • Hold-to-talk or toggle. Both modes are first-class.
  • 100% local. Audio never leaves your machine.
  • Backend is swappable. whisper.cpp ships in the bundle; MLX, faster-whisper, or an HTTP server work via one config field.
  • Safe insertion. Pasteboard is snapshotted and restored.

Install

Voxate is built from source. Clone the repo, install a Whisper backend, build the menu-bar app, and grant the macOS permissions on first launch.

git clone https://github.com/Gent8/voxate.git
cd voxate

1. Install a Whisper backend

Option A — whisper.cpp (recommended; fastest, no Python)
brew install whisper-cpp
mkdir -p ~/.config/voxate/models
# base = good speed/quality default; large-v3 = most accurate.
curl -L -o ~/.config/voxate/models/ggml-base.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin
Option B — MLX (Apple Silicon only; slightly higher quality per MB)
pip3 install --user mlx-whisper

Then in config.json:

"transcribeCommand": [
  "/usr/bin/env", "python3",
  "/ABSOLUTE/PATH/TO/voxate/scripts/transcribe_mlx.py",
  "{audio}",
  "--model", "mlx-community/whisper-base.en-mlx"
]

2. Build the menu-bar app

Create a stable local signing identity once. This stops macOS from treating every rebuild as a different app for Microphone + Accessibility:

bash scripts/setup-signing.sh

Then build and open the app:

bash scripts/bundle-app.sh
open build/Voxate.app

bundle-app.sh requires stable local signing by default. For a throwaway build that resets permissions on each rebuild:

VOXATE_SIGNING=adhoc bash scripts/bundle-app.sh

First launch will prompt for:

Permission Why
Microphone Capture audio.
Accessibility Listen to global keys and synthesize ⌘V.

The app opens Setup… automatically if anything important is missing — it checks the mic, Accessibility, the configured Whisper executable, and the model path. Grant permissions, install the backend/model, then click Refresh.

Configure

Click the menu-bar icon → Settings… for the everyday options:

Setting What it does
Trigger behavior Hold-to-talk or toggle recording.
Trigger key fn / globe, F1, F5, F6, F7, or F8.
Language Auto-detect, English, Dutch, French, German, Spanish, or a custom code.
Sounds Enable/disable the start/finish cues.
Appearance Branded or system recording indicator (system/light/dark).
Insertion Clipboard restore, smart spacing, focus-change safety.

For advanced backend changes, click Open Advanced Config… to edit ~/.config/voxate/config.json directly:

{
  "keyCode": 63,
  "modifierFlags": 0,
  "triggerMode": "hold",
  "transcribeCommand": [...],
  "language": "auto",
  "restoreClipboard": true,
  "insertionPrefix": "",
  "playSounds": true,
  "smartSpacing": true,
  "focusSafetyCheck": true,
  "recordingIndicatorStyle": "branded",
  "appearanceMode": "system"
}
Field reference
  • keyCode — macOS virtual key code. Defaults to 63 (fn / globe). Common others: 49 = Space, 122 = F1, 53 = Esc, 96 = F5.
  • modifierFlags — required modifier bitmask. 0 for none. 0x40000 = fn, 0x20000 = shift, 0x40000 = control, 0x80000 = option, 0x100000 = command (combine with |).
  • triggerMode"hold" (push-to-talk) or "toggle" (press to start, press to stop).
  • language"auto", "en", "nl", "fr", … Appended to the backend as --language <lang> unless "auto".
  • restoreClipboard — keep your clipboard intact across dictations.
  • insertionPrefix — string to prepend to each insertion. Set to " " if words run into the previous one.
  • playSounds — enable/disable start, stop, and completion sounds.
  • recordingIndicatorStyle"branded" by default, or "system" for a quieter native overlay.
  • appearanceMode"system", "light", or "dark" for the recording indicator.

After editing, click the menu-bar icon → Reload config.

Note

Language selection depends on the model. .en whisper.cpp models are English-only — use a multilingual model like ggml-base.bin or ggml-small.bin for other languages or auto-detect. If you prefer ggml-base.en.bin, set Language to English.

Use

  1. Place the cursor wherever you want text.
  2. Hold the trigger key (fn / globe by default).
  3. Speak.
  4. Release — text appears at the cursor.

In toggle mode: press once to start, press again to stop.

Architecture at a glance

┌──────────────────────────────────────────────────────────────────┐
│  Swift menu-bar agent  (Sources/Voxate)                          │
│  ─────────────────────────────────────────────────────────────   │
│  HotkeyManager   ── CGEventTap → press/release of trigger key    │
│  AudioRecorder   ── AVAudioEngine → 16 kHz mono WAV (temp file)  │
│  WhisperEngine   ── subprocess → user-configurable transcribe CLI│
│  TextInserter    ── pasteboard + synthesized ⌘V                  │
│  AppDelegate     ── status item, state machine, config hot-reload│
└──────────────────────────────────────────────────────────────────┘
                          ▲                              ▲
                          │                              │
                 config.json                  whisper-cli (whisper.cpp)
                                              or transcribe_mlx.py (MLX)

The Whisper backend is a subprocess, not a linked library. That lets you swap engines (whisper.cpp, mlx-whisper, faster-whisper, an HTTP server, …) by editing one array in config.json — no recompile.

Why this design

Choice Why we picked it
Swift menu-bar app vs. pure Python We need reliable global hotkeys, microphone capture, and synthesized keystrokes. The Cocoa APIs make this clean; Python via pyobjc is fragile around Accessibility/CGEventTap.
CGEventTap vs. Carbon RegisterEventHotKey We need both press and release for hold-to-talk, and we need to listen to modifier-only keys (fn/globe = keyCode 63). Carbon hotkeys can't do either.
Pasteboard + ⌘V vs. AXUIElement insertion Direct AX writes break in Electron, web views, and some Cocoa controls. Paste works everywhere a user can type. We snapshot/restore the clipboard.
Subprocess Whisper vs. embedded library Lets us defer the model/runtime choice to the user, ship no native ML deps, and keep the app bundle simple to sign.

Inspired by StageWhisper (menu-bar shape and cursor-insertion intent) and whisper-shortcut (lightweight shortcut→Whisper plumbing), but neither is imported directly — the hotkey layer here uses a CGEventTap so we get genuine key-up events for hold-to-talk, including for the fn/globe key.

File layout

Package.swift                  SPM manifest (macOS 13+, single executable target)
Sources/Voxate/
  main.swift                   App entry, sets .accessory activation policy
  AppDelegate.swift            Status item, state machine, config hot-reload
  Config.swift                 Codable JSON config + ~/.config bootstrap
  HotkeyManager.swift          CGEventTap → key press/release callbacks
  AudioRecorder.swift          AVAudioEngine → 16 kHz mono int16 WAV
  WhisperEngine.swift          Subprocess transcription, stdout or .out.txt
  TextInserter.swift           Pasteboard snapshot + ⌘V + restore
scripts/
  bundle-app.sh                Builds release binary, wraps in .app w/ Info.plist
  package-release.sh           Creates local DMG/zip artifacts + SHA-256 sums
  transcribe_mlx.py            Optional MLX backend
Resources/
  config.example.json          Drop-in starter config

Limitations / known sharp edges

  • Latency = backend latency. With ggml-base.en on M-series silicon, a 3-second utterance transcribes in ~0.4–0.8s. With large-v3 expect a few seconds. There's no streaming yet — Whisper transcribes only after release.
  • Paste-based insertion means a brief pasteboard touch. We snapshot/restore, but a clipboard manager logging every change will see one entry.
  • fn / globe key is special on Apple silicon: macOS sometimes reserves it for the emoji picker or input switching. If your fn key fires the system picker first, switch to a function key (e.g., keyCode: 122 for F1) in config.
  • Signing and sandboxing. Local development builds use a self-signed identity in ~/.config/voxate/signing.keychain-db so macOS can keep Accessibility + Microphone grants stable across rebuilds. The app is still unsandboxed. Public distribution needs Developer ID signing, notarization, and explicit entitlements.
  • No streaming partial results. Easy future addition: switch the backend to whisper-stream and pipe partials into TextInserter.

Troubleshooting

HUD does not appear when pressing the trigger key

Usually the global hotkey path isn't active yet. Open Setup… from the menu-bar app and check Accessibility first. If System Settings shows the app enabled but the trigger still does nothing, reset the stale grant and reopen the stable-signed local build:

tccutil reset Accessibility dev.local.voxate
open build/Voxate.app

Then enable Voxate again in System Settings → Privacy & Security → Accessibility. This often happens after switching from ad-hoc to the local stable signer.

For deeper hotkey debugging, quit the app and launch it directly:

VOXATE_DEBUG=1 build/Voxate.app/Contents/MacOS/Voxate

The debug logs include Accessibility trust, configured key code, event-tap startup, and raw key events received by the tap.

Local release artifacts

bash scripts/package-release.sh

Artifacts land in dist/:

  • Voxate-0.1.0.dmg
  • Voxate-0.1.0.zip
  • SHA256SUMS

These are for local / open-source testing. Public macOS distribution still needs Developer ID signing and notarization.

Closeness to the "native dictation" feel

What works today:

  • One configurable global key, system-wide, in any text-input app.
  • Hold-to-talk and toggle-to-talk modes.
  • Text arrives at the cursor with no app-specific integration.
  • Clipboard is preserved by default.
  • Backend is swappable; multiple languages supported via the language field.

What's missing vs. macOS built-in dictation:

  • No live partial transcription while you speak.
  • No on-the-fly punctuation commands ("new line", "comma").
  • No per-app text formatting heuristics (capitalize-after-period is whatever Whisper produces).

The first two are tractable follow-ups against the same architecture — both live in WhisperEngine + TextInserter and need no changes elsewhere.

Contributing

Issues and PRs are welcome. A few ground rules to keep things smooth:

  • Bug reports — include macOS version, Apple silicon vs Intel, the Whisper backend you're using, and the relevant lines from VOXATE_DEBUG=1 build/Voxate.app/Contents/MacOS/Voxate.
  • PRs — keep them small and focused; one logical change per PR. Run a clean build (bash scripts/bundle-app.sh) before opening.
  • Scope — Voxate is intentionally narrow: hold-to-talk → paste-at-cursor. Features outside that loop (LLM rewriting, command modes, per-app heuristics) are unlikely to land in core.

Security

If you find a vulnerability, please follow the disclosure process in SECURITY.md rather than opening a public issue.

License

Released under the GPL-3.0 License.

About

Local, Whisper-powered dictation for any text field on macOS

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors