Voxate for macOS

Local, Whisper-powered dictation for any text field on macOS. Hold one key, speak, release — text appears at the cursor. No cloud, no per-app integration.

Highlights

One key, system-wide. Works in Notes, Slack, Safari address bars, terminals — anywhere you can type.
Hold-to-talk or toggle. Both modes are first-class.
100% local. Audio never leaves your machine.
Backend is swappable. whisper.cpp ships in the bundle; MLX, faster-whisper, or an HTTP server work via one config field.
Safe insertion. Pasteboard is snapshotted and restored.

Install

Voxate is built from source. Clone the repo, install a Whisper backend, build the menu-bar app, and grant the macOS permissions on first launch.

git clone https://github.com/Gent8/voxate.git
cd voxate

1. Install a Whisper backend

Option A — whisper.cpp (recommended; fastest, no Python)

brew install whisper-cpp
mkdir -p ~/.config/voxate/models
# base = good speed/quality default; large-v3 = most accurate.
curl -L -o ~/.config/voxate/models/ggml-base.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin

Option B — MLX (Apple Silicon only; slightly higher quality per MB)

pip3 install --user mlx-whisper

Then in config.json:

"transcribeCommand": [
  "/usr/bin/env", "python3",
  "/ABSOLUTE/PATH/TO/voxate/scripts/transcribe_mlx.py",
  "{audio}",
  "--model", "mlx-community/whisper-base.en-mlx"
]

2. Build the menu-bar app

Create a stable local signing identity once. This stops macOS from treating every rebuild as a different app for Microphone + Accessibility:

bash scripts/setup-signing.sh

Then build and open the app:

bash scripts/bundle-app.sh
open build/Voxate.app

bundle-app.sh requires stable local signing by default. For a throwaway build that resets permissions on each rebuild:

VOXATE_SIGNING=adhoc bash scripts/bundle-app.sh

First launch will prompt for:

Permission	Why
Microphone	Capture audio.
Accessibility	Listen to global keys and synthesize ⌘V.

The app opens Setup… automatically if anything important is missing — it checks the mic, Accessibility, the configured Whisper executable, and the model path. Grant permissions, install the backend/model, then click Refresh.

Configure

Click the menu-bar icon → Settings… for the everyday options:

Setting	What it does
Trigger behavior	Hold-to-talk or toggle recording.
Trigger key	fn / globe, F1, F5, F6, F7, or F8.
Language	Auto-detect, English, Dutch, French, German, Spanish, or a custom code.
Sounds	Enable/disable the start/finish cues.
Appearance	Branded or system recording indicator (system/light/dark).
Insertion	Clipboard restore, smart spacing, focus-change safety.

For advanced backend changes, click Open Advanced Config… to edit ~/.config/voxate/config.json directly:

{
  "keyCode": 63,
  "modifierFlags": 0,
  "triggerMode": "hold",
  "transcribeCommand": [...],
  "language": "auto",
  "restoreClipboard": true,
  "insertionPrefix": "",
  "playSounds": true,
  "smartSpacing": true,
  "focusSafetyCheck": true,
  "recordingIndicatorStyle": "branded",
  "appearanceMode": "system"
}

Field reference

keyCode — macOS virtual key code. Defaults to 63 (fn / globe). Common others: 49 = Space, 122 = F1, 53 = Esc, 96 = F5.
modifierFlags — required modifier bitmask. 0 for none. 0x40000 = fn, 0x20000 = shift, 0x40000 = control, 0x80000 = option, 0x100000 = command (combine with |).
triggerMode — "hold" (push-to-talk) or "toggle" (press to start, press to stop).
language — "auto", "en", "nl", "fr", … Appended to the backend as --language <lang> unless "auto".
restoreClipboard — keep your clipboard intact across dictations.
insertionPrefix — string to prepend to each insertion. Set to " " if words run into the previous one.
playSounds — enable/disable start, stop, and completion sounds.
recordingIndicatorStyle — "branded" by default, or "system" for a quieter native overlay.
appearanceMode — "system", "light", or "dark" for the recording indicator.

After editing, click the menu-bar icon → Reload config.

Note

Language selection depends on the model. .en whisper.cpp models are English-only — use a multilingual model like ggml-base.bin or ggml-small.bin for other languages or auto-detect. If you prefer ggml-base.en.bin, set Language to English.

Use

Place the cursor wherever you want text.
Hold the trigger key (fn / globe by default).
Speak.
Release — text appears at the cursor.

In toggle mode: press once to start, press again to stop.

Architecture at a glance

┌──────────────────────────────────────────────────────────────────┐
│  Swift menu-bar agent  (Sources/Voxate)                          │
│  ─────────────────────────────────────────────────────────────   │
│  HotkeyManager   ── CGEventTap → press/release of trigger key    │
│  AudioRecorder   ── AVAudioEngine → 16 kHz mono WAV (temp file)  │
│  WhisperEngine   ── subprocess → user-configurable transcribe CLI│
│  TextInserter    ── pasteboard + synthesized ⌘V                  │
│  AppDelegate     ── status item, state machine, config hot-reload│
└──────────────────────────────────────────────────────────────────┘
                          ▲                              ▲
                          │                              │
                 config.json                  whisper-cli (whisper.cpp)
                                              or transcribe_mlx.py (MLX)

The Whisper backend is a subprocess, not a linked library. That lets you swap engines (whisper.cpp, mlx-whisper, faster-whisper, an HTTP server, …) by editing one array in config.json — no recompile.

Why this design

Choice	Why we picked it
Swift menu-bar app vs. pure Python	We need reliable global hotkeys, microphone capture, and synthesized keystrokes. The Cocoa APIs make this clean; Python via `pyobjc` is fragile around Accessibility/CGEventTap.
`CGEventTap` vs. Carbon `RegisterEventHotKey`	We need both press and release for hold-to-talk, and we need to listen to modifier-only keys (fn/globe = keyCode 63). Carbon hotkeys can't do either.
Pasteboard + ⌘V vs. AXUIElement insertion	Direct AX writes break in Electron, web views, and some Cocoa controls. Paste works everywhere a user can type. We snapshot/restore the clipboard.
Subprocess Whisper vs. embedded library	Lets us defer the model/runtime choice to the user, ship no native ML deps, and keep the app bundle simple to sign.

Inspired by StageWhisper (menu-bar shape and cursor-insertion intent) and whisper-shortcut (lightweight shortcut→Whisper plumbing), but neither is imported directly — the hotkey layer here uses a CGEventTap so we get genuine key-up events for hold-to-talk, including for the fn/globe key.

File layout

Package.swift                  SPM manifest (macOS 13+, single executable target)
Sources/Voxate/
  main.swift                   App entry, sets .accessory activation policy
  AppDelegate.swift            Status item, state machine, config hot-reload
  Config.swift                 Codable JSON config + ~/.config bootstrap
  HotkeyManager.swift          CGEventTap → key press/release callbacks
  AudioRecorder.swift          AVAudioEngine → 16 kHz mono int16 WAV
  WhisperEngine.swift          Subprocess transcription, stdout or .out.txt
  TextInserter.swift           Pasteboard snapshot + ⌘V + restore
scripts/
  bundle-app.sh                Builds release binary, wraps in .app w/ Info.plist
  package-release.sh           Creates local DMG/zip artifacts + SHA-256 sums
  transcribe_mlx.py            Optional MLX backend
Resources/
  config.example.json          Drop-in starter config

Limitations / known sharp edges

Latency = backend latency. With ggml-base.en on M-series silicon, a 3-second utterance transcribes in ~0.4–0.8s. With large-v3 expect a few seconds. There's no streaming yet — Whisper transcribes only after release.
Paste-based insertion means a brief pasteboard touch. We snapshot/restore, but a clipboard manager logging every change will see one entry.
fn / globe key is special on Apple silicon: macOS sometimes reserves it for the emoji picker or input switching. If your fn key fires the system picker first, switch to a function key (e.g., keyCode: 122 for F1) in config.
Signing and sandboxing. Local development builds use a self-signed identity in ~/.config/voxate/signing.keychain-db so macOS can keep Accessibility + Microphone grants stable across rebuilds. The app is still unsandboxed. Public distribution needs Developer ID signing, notarization, and explicit entitlements.
No streaming partial results. Easy future addition: switch the backend to whisper-stream and pipe partials into TextInserter.

Troubleshooting

HUD does not appear when pressing the trigger key

Usually the global hotkey path isn't active yet. Open Setup… from the menu-bar app and check Accessibility first. If System Settings shows the app enabled but the trigger still does nothing, reset the stale grant and reopen the stable-signed local build:

tccutil reset Accessibility dev.local.voxate
open build/Voxate.app

Then enable Voxate again in System Settings → Privacy & Security → Accessibility. This often happens after switching from ad-hoc to the local stable signer.

For deeper hotkey debugging, quit the app and launch it directly:

VOXATE_DEBUG=1 build/Voxate.app/Contents/MacOS/Voxate

The debug logs include Accessibility trust, configured key code, event-tap startup, and raw key events received by the tap.

Local release artifacts

bash scripts/package-release.sh

Artifacts land in dist/:

Voxate-0.1.0.dmg
Voxate-0.1.0.zip
SHA256SUMS

These are for local / open-source testing. Public macOS distribution still needs Developer ID signing and notarization.

Closeness to the "native dictation" feel

What works today:

One configurable global key, system-wide, in any text-input app.
Hold-to-talk and toggle-to-talk modes.
Text arrives at the cursor with no app-specific integration.
Clipboard is preserved by default.
Backend is swappable; multiple languages supported via the language field.

What's missing vs. macOS built-in dictation:

No live partial transcription while you speak.
No on-the-fly punctuation commands ("new line", "comma").
No per-app text formatting heuristics (capitalize-after-period is whatever Whisper produces).

The first two are tractable follow-ups against the same architecture — both live in WhisperEngine + TextInserter and need no changes elsewhere.

Contributing

Issues and PRs are welcome. A few ground rules to keep things smooth:

Bug reports — include macOS version, Apple silicon vs Intel, the Whisper backend you're using, and the relevant lines from VOXATE_DEBUG=1 build/Voxate.app/Contents/MacOS/Voxate.
PRs — keep them small and focused; one logical change per PR. Run a clean build (bash scripts/bundle-app.sh) before opening.
Scope — Voxate is intentionally narrow: hold-to-talk → paste-at-cursor. Features outside that loop (LLM rewriting, command modes, per-app heuristics) are unlikely to land in core.

Security

If you find a vulnerability, please follow the disclosure process in SECURITY.md rather than opening a public issue.

License

Released under the GPL-3.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Resources		Resources
Sources/Voxate		Sources/Voxate
design		design
scripts		scripts
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Package.swift		Package.swift
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voxate for macOS

Highlights

Install

1. Install a Whisper backend

2. Build the menu-bar app

Configure

Use

Architecture at a glance

Why this design

File layout

Limitations / known sharp edges

Troubleshooting

Local release artifacts

Closeness to the "native dictation" feel

Contributing

Security

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Voxate for macOS

Highlights

Install

1. Install a Whisper backend

2. Build the menu-bar app

Configure

Use

Architecture at a glance

Why this design

File layout

Limitations / known sharp edges

Troubleshooting

Local release artifacts

Closeness to the "native dictation" feel

Contributing

Security

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages