Spectrum is a Python microservice for speaker diarization and transcription using OpenAI's Whisper model and pyannote.audio. It provides RESTful APIs to transcribe audio files and identify different speakers in audio recordings.
- Audio Transcription: Convert audio files to text using WhisperX (OpenAI Whisper)
- Speaker Diarization: Identify and separate different speakers in audio recordings
- Multiple Input Sources: Support for base64-encoded files and presigned URLs
- Language Detection: Automatic language detection for transcribed audio
- RESTful API: Simple HTTP endpoints for easy integration
- Docker Support: Containerized deployment with Docker
- Python 3.8 or higher
- CUDA-capable GPU (recommended for better performance)
- Hugging Face account with access token
- AWS credentials (if using S3 presigned URLs)
git clone <repository-url>
cd spectrumpython3 -m venv spectrumEnv
source spectrumEnv/bin/activate # On Windows: spectrumEnv\Scripts\activatepip install -r requirements.txtNote: The installation may take some time as it includes PyTorch and WhisperX dependencies.
FFmpeg is required for audio processing:
macOS:
brew install ffmpegUbuntu/Debian:
sudo apt-get update
sudo apt-get install -y ffmpegWindows: Download from FFmpeg official website
- Navigate to the
configdirectory - Edit
config.jsonwith your credentials:
{
"HF_TOKEN": "your_huggingface_token_here",
"aws_access_key_id": "your_aws_access_key",
"aws_secret_access_key": "your_aws_secret_key",
"region_name": "your_aws_region"
}- Create an account at Hugging Face
- Go to Settings → Access Tokens
- Create a new token with read permissions
- Accept the terms for the pyannote/speaker-diarization model
python wsgi.pyThe service will start on http://localhost:5000
gunicorn wsgi:app -b 0.0.0.0:8081The service will start on http://0.0.0.0:8081
Note: It's recommended to use a GPU instance for faster performance as the models are computationally intensive and will be slow on CPU.
POST /transcription/transcribe
Transcribe an audio file to text.
Request Body:
{
"source": "base64_encoded_file" | "presigned_url",
"base64_encoded_file": "base64_encoded_audio_string" // Required if source is "base64_encoded_file"
"presigned_url": "https://s3.amazonaws.com/..." // Required if source is "presigned_url"
}Response:
{
"success": 1,
"transcription": "Transcribed text here...",
"language": "en"
}Error Response:
{
"success": 0,
"errorMessage": "Error message here"
}POST /diarization/diarize
Transcribe audio and identify different speakers.
Request Body:
{
"source": "base64_encoded_file" | "presigned_url",
"base64_encoded_file": "base64_encoded_audio_string", // Required if source is "base64_encoded_file"
"presigned_url": "https://s3.amazonaws.com/...", // Required if source is "presigned_url"
"min_speakers": 2, // Optional, default: 2
"max_speakers": 2 // Optional, default: 2
}Response:
{
"success": 1,
"diarization": [
{
"start": 0.0,
"end": 5.2,
"text": "Hello, this is speaker one.",
"speaker": "SPEAKER_00"
},
{
"start": 5.2,
"end": 10.5,
"text": "And this is speaker two.",
"speaker": "SPEAKER_01"
}
],
"language": "en"
}Error Response:
{
"success": 0,
"errorMessage": "Error message here"
}GET /healthCheck
Check if the service is running.
Response:
I am healthy!
docker build -t spectrum:latest .docker run -d -p 8081:8081 \
-v $(pwd)/config:/app/config \
spectrum:latestThe service will be available at http://localhost:8081
Note: For GPU support, ensure Docker has access to NVIDIA GPU runtime. You may need to install nvidia-docker and use --gpus all flag.
Transcription:
curl -X POST http://localhost:8081/transcription/transcribe \
-H "Content-Type: application/json" \
-d '{
"source": "presigned_url",
"presigned_url": "https://your-s3-bucket.s3.amazonaws.com/audio.mp3"
}'Diarization:
curl -X POST http://localhost:8081/diarization/diarize \
-H "Content-Type: application/json" \
-d '{
"source": "presigned_url",
"presigned_url": "https://your-s3-bucket.s3.amazonaws.com/audio.mp3",
"min_speakers": 2,
"max_speakers": 3
}'import requests
import base64
# Read audio file
with open('audio.mp3', 'rb') as f:
audio_base64 = base64.b64encode(f.read()).decode('utf-8')
# Transcribe
response = requests.post(
'http://localhost:8081/transcription/transcribe',
json={
'source': 'base64_encoded_file',
'base64_encoded_file': audio_base64
}
)
print(response.json())spectrum/
├── config/
│ ├── __init__.py
│ └── config.json # Configuration file
├── routes/
│ ├── diarization_routes.py # Speaker diarization endpoints
│ └── transcription_routes.py # Transcription endpoints
├── utils/
│ ├── audioUtil.py # Audio processing utilities
│ ├── configUtil.py # Configuration utilities
│ └── s3Util.py # AWS S3 utilities
├── main.py # Flask application entry point
├── wsgi.py # WSGI entry point
├── requirements.txt # Python dependencies
├── Dockerfile # Docker configuration
└── README.md # This file
- GPU Recommended: The service uses WhisperX and pyannote.audio models which are computationally intensive. GPU instances are strongly recommended for production use.
- Batch Processing: The service automatically adjusts batch size based on available hardware (GPU: 8, CPU: 32).
- Memory Usage: Large audio files may require significant memory. Monitor resource usage accordingly.
- Hugging Face Token Error: Ensure your token is valid and has access to the pyannote models
- CUDA Out of Memory: Reduce batch size or use a smaller Whisper model
- FFmpeg Not Found: Ensure FFmpeg is installed and available in your PATH
- Model Download Issues: First run may take time to download models. Ensure stable internet connection
See LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.