Skip to content

NikhilMank/Audio_Transcriber_MCP_Server

Repository files navigation

Audio Transcriber MCP Server

A local MCP server that transcribes audio files using faster-whisper and optionally identifies speakers using pyannote.audio. All processing runs on your own machine — no audio is sent to any external service.

Tools

Tool Description
transcribe_audio Transcribe a single file → plain text
transcribe_detailed Transcribe a single file → timestamped segments + detected language
transcribe_with_speakers Transcribe a single file → timestamped segments with speaker labels
transcribe_folder Transcribe all audio files in a folder → filename → transcript map

Supported audio formats: mp3, mp4, wav, m4a, flac, ogg, wma, aac

Requirements

  • Python 3.12+
  • uv
  • CUDA-capable GPU (optional, but recommended)

Installation

uv sync

Configuration

Create a .env file in the project root:

cp .env.example .env   # or just create .env manually
Variable Description Default
WHISPER_DEVICE cpu, cuda (NVIDIA), or mps (Apple Silicon) cpu
WHISPER_DEFAULT_MODEL Whisper model size (see table below) small
DIARIZATION_MODEL HuggingFace model ID or local path to diarization model pyannote/speaker-diarization-3.1
HF_TOKEN HuggingFace token — only needed to download the diarization model (empty)

Whisper model sizes

Model VRAM Speed Accuracy
tiny ~0.5 GB Fastest Low
base ~0.5 GB Fast Low
small ~1 GB Fast Good
medium ~2.5 GB Moderate Better
large-v2 / large-v3 ~6 GB Slow Best

GTX 1050 Ti (4GB): Use small or medium. large models will run out of VRAM.

Speaker Diarization Setup

transcribe_with_speakers requires the pyannote diarization model. You can download it on first use (needs internet + token) or once for fully offline use.

Option A — Download on first use

  1. Create a free account at huggingface.co
  2. Go to Settings → Access Tokens → New token (Role: Read)
  3. Accept the model terms while logged in:
  4. Add your token to .env:
    HF_TOKEN=hf_xxxxxxxxxxxxxxxxxx
    

The model downloads automatically on the first call to transcribe_with_speakers.

Option B — Download once, run offline

After completing steps 1–4 above, run:

uv run python download_model.py

Then update .env:

DIARIZATION_MODEL=./models/speaker-diarization-3.1
HF_TOKEN=

From this point no internet connection or token is needed.

Running the Server

uv run python main.py

Project Structure

.
├── main.py              # MCP server and all tools
├── download_model.py    # One-time script to download diarization model for offline use
├── .env                 # Local config (git-ignored)
├── models/              # Downloaded model weights (git-ignored)
└── pyproject.toml

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages