A local MCP server that transcribes audio files using faster-whisper and optionally identifies speakers using pyannote.audio. All processing runs on your own machine — no audio is sent to any external service.
| Tool | Description |
|---|---|
transcribe_audio |
Transcribe a single file → plain text |
transcribe_detailed |
Transcribe a single file → timestamped segments + detected language |
transcribe_with_speakers |
Transcribe a single file → timestamped segments with speaker labels |
transcribe_folder |
Transcribe all audio files in a folder → filename → transcript map |
Supported audio formats: mp3, mp4, wav, m4a, flac, ogg, wma, aac
- Python 3.12+
- uv
- CUDA-capable GPU (optional, but recommended)
uv syncCreate a .env file in the project root:
cp .env.example .env # or just create .env manually| Variable | Description | Default |
|---|---|---|
WHISPER_DEVICE |
cpu, cuda (NVIDIA), or mps (Apple Silicon) |
cpu |
WHISPER_DEFAULT_MODEL |
Whisper model size (see table below) | small |
DIARIZATION_MODEL |
HuggingFace model ID or local path to diarization model | pyannote/speaker-diarization-3.1 |
HF_TOKEN |
HuggingFace token — only needed to download the diarization model | (empty) |
| Model | VRAM | Speed | Accuracy |
|---|---|---|---|
tiny |
~0.5 GB | Fastest | Low |
base |
~0.5 GB | Fast | Low |
small |
~1 GB | Fast | Good |
medium |
~2.5 GB | Moderate | Better |
large-v2 / large-v3 |
~6 GB | Slow | Best |
GTX 1050 Ti (4GB): Use
smallormedium.largemodels will run out of VRAM.
transcribe_with_speakers requires the pyannote diarization model. You can download it on first use (needs internet + token) or once for fully offline use.
- Create a free account at huggingface.co
- Go to Settings → Access Tokens → New token (Role: Read)
- Accept the model terms while logged in:
- Add your token to
.env:HF_TOKEN=hf_xxxxxxxxxxxxxxxxxx
The model downloads automatically on the first call to transcribe_with_speakers.
After completing steps 1–4 above, run:
uv run python download_model.pyThen update .env:
DIARIZATION_MODEL=./models/speaker-diarization-3.1
HF_TOKEN=
From this point no internet connection or token is needed.
uv run python main.py.
├── main.py # MCP server and all tools
├── download_model.py # One-time script to download diarization model for offline use
├── .env # Local config (git-ignored)
├── models/ # Downloaded model weights (git-ignored)
└── pyproject.toml