Audio Transcriber MCP Server

A local MCP server that transcribes audio files using faster-whisper and optionally identifies speakers using pyannote.audio. All processing runs on your own machine — no audio is sent to any external service.

Tools

Tool	Description
`transcribe_audio`	Transcribe a single file → plain text
`transcribe_detailed`	Transcribe a single file → timestamped segments + detected language
`transcribe_with_speakers`	Transcribe a single file → timestamped segments with speaker labels
`transcribe_folder`	Transcribe all audio files in a folder → filename → transcript map

Supported audio formats: mp3, mp4, wav, m4a, flac, ogg, wma, aac

Requirements

Python 3.12+
uv
CUDA-capable GPU (optional, but recommended)

Installation

uv sync

Configuration

Create a .env file in the project root:

cp .env.example .env   # or just create .env manually

Variable	Description	Default
`WHISPER_DEVICE`	`cpu`, `cuda` (NVIDIA), or `mps` (Apple Silicon)	`cpu`
`WHISPER_DEFAULT_MODEL`	Whisper model size (see table below)	`small`
`DIARIZATION_MODEL`	HuggingFace model ID or local path to diarization model	`pyannote/speaker-diarization-3.1`
`HF_TOKEN`	HuggingFace token — only needed to download the diarization model	(empty)

Whisper model sizes

Model	VRAM	Speed	Accuracy
`tiny`	~0.5 GB	Fastest	Low
`base`	~0.5 GB	Fast	Low
`small`	~1 GB	Fast	Good
`medium`	~2.5 GB	Moderate	Better
`large-v2` / `large-v3`	~6 GB	Slow	Best

GTX 1050 Ti (4GB): Use small or medium. large models will run out of VRAM.

Speaker Diarization Setup

transcribe_with_speakers requires the pyannote diarization model. You can download it on first use (needs internet + token) or once for fully offline use.

Option A — Download on first use

Create a free account at huggingface.co
Go to Settings → Access Tokens → New token (Role: Read)
Accept the model terms while logged in:
- pyannote/speaker-diarization-3.1
- pyannote/segmentation-3.0
Add your token to .env:
```
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxx
```

The model downloads automatically on the first call to transcribe_with_speakers.

Option B — Download once, run offline

After completing steps 1–4 above, run:

uv run python download_model.py

Then update .env:

DIARIZATION_MODEL=./models/speaker-diarization-3.1
HF_TOKEN=

From this point no internet connection or token is needed.

Running the Server

uv run python main.py

Project Structure

.
├── main.py              # MCP server and all tools
├── download_model.py    # One-time script to download diarization model for offline use
├── .env                 # Local config (git-ignored)
├── models/              # Downloaded model weights (git-ignored)
└── pyproject.toml

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
download_model.py		download_model.py
main.py		main.py
pyproject.toml		pyproject.toml
test_transcription.py		test_transcription.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio Transcriber MCP Server

Tools

Requirements

Installation

Configuration

Whisper model sizes

Speaker Diarization Setup

Option A — Download on first use

Option B — Download once, run offline

Running the Server

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Audio Transcriber MCP Server

Tools

Requirements

Installation

Configuration

Whisper model sizes

Speaker Diarization Setup

Option A — Download on first use

Option B — Download once, run offline

Running the Server

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages