A voice-driven teleprompter that runs entirely on your laptop.
Open a script, hit start and read out loud! The page scrolls under your voice, word by word, driven by NVIDIA's multilingual 600M-parameter speech model running locally. Read in English, Spanish, Japanese, or any of 35+ languages — the model detects the language automatically (or pin one with STT_LANGUAGE). Your voice never leaves the machine.
git clone <this repo>
cd local-teleprompter
./start.shThat's it. Then open http://localhost:3000, pick a script, hit start, and read. Press Ctrl-C in the terminal to bring everything down.
On first run, start.sh will install dependencies — pulling torch and NVIDIA's NeMo toolkit takes a few minutes and several GB. Subsequent runs come up in seconds.
- Zero cloud bill. A teleprompter is a "20 minutes of mic" workload. That gets expensive fast at $0.01–0.05/min cloud STT pricing. Local is free.
- No latency floor. End-of-utterance to final transcript is consistently <150ms on Apple Silicon, around 100ms on a real GPU. Word-level streaming starts within ~80ms of speech.
- Privacy. Your voice and your script never leave the machine.
- It's fun.
- macOS or Linux (Apple Silicon, Intel Mac, or NVIDIA Linux)
- Python 3.10+
uv— install withcurl -LsSf https://astral.sh/uv/install.sh | sh- Node 20+ and
pnpm—npm i -g pnpmorbrew install pnpm - Homebrew on macOS (used to install one binary if it isn't already on PATH)
A CUDA GPU is not required. On Apple Silicon the model runs on MPS automatically. CPU also works, just slower. GPU does work best though.
Scripts are stored in a local SQLite database at frontend/data/scripts.db. The DB is created on first launch and seeded with a sample script. You can:
- Create, edit, and delete scripts in the library view.
- Import any Markdown or plain text file (
.md,.markdown,.txt). The first#heading becomes the title; the rest becomes the body. - Export any script as Markdown.
The DB is just a regular SQLite file — open it with any SQLite tool, back it up by copying it, or move it elsewhere with SCRIPTS_DB_PATH=/path/to/scripts.db.
Speech recognition gives you words. Turning those into the right script position — even when you stumble, skip a sentence, or re-read a paragraph — is the interesting part. The matcher lives in frontend/lib/teleprompter/position-tracker.ts:
- Forward-only under normal flow. Each new spoken word scans an 18-word lookahead from the current cursor.
- Bigram confirmation for non-local jumps. A far match (3–17 words ahead) only commits if the previous spoken word also confirmed nearby — stops stopwords ("the", "is", "of") from yanking the cursor ten words ahead.
- Tightly-scoped fuzzy matching. Levenshtein-1, but only for words ≥ 5 characters. Short words must match exactly, so
the ↔ thendoesn't sneak through. - Auto re-anchor. After 4 unmatched words in a row, scan a 6-word trailing window globally; commit a jump if ≥ 3 words align.
- Double-click a word to manually snap the cursor there. Pause freezes it; resume picks up wherever you start reading again.
If the matcher feels off for your style, the constants at the top of position-tracker.ts are the levers:
| Constant | Default | What it does |
|---|---|---|
DEFAULT_LOOKAHEAD |
18 | How many script words ahead a single spoken word can jump |
NEAR_JUMP |
2 | Small jumps that don't need bigram confirmation |
BIGRAM_WINDOW |
3 | How far back the prior-word confirmation looks |
REANCHOR_MISS_THRESHOLD |
4 | Unmatched words in a row before re-anchor kicks in |
REANCHOR_WINDOW |
6 | Trailing spoken-word window for re-anchoring |
REANCHOR_MIN_MATCHES |
3 | Matches required to commit a re-anchor |
If you stall on mumbled words, bump NEAR_JUMP to 3. If the cursor jumps too aggressively, lower DEFAULT_LOOKAHEAD or raise REANCHOR_MIN_MATCHES.
Most of the interesting parts:
frontend/lib/teleprompter/position-tracker.ts— the matching algorithm.frontend/components/teleprompter/— library, editor, and prompter UI.stt-server/server.py— the speech-to-text server.agent/src/— the glue that hands audio to the model and transcripts to the browser.
- Speech model: NVIDIA's
nemotron-3.5-asr-streaming-0.6b, a multilingual (35+ languages) cache-aware FastConformer-RNNT. - STT server scaffold:
fastapi-nemotron-speech-streaming. - Frontend starter:
livekit-examples/agent-starter-react.
MIT — do whatever you want with it.

