Skip to content

Audio transcription and speaker diarization microservice powered by OpenAI Whisper and pyannote.audio

License

Notifications You must be signed in to change notification settings

LiteLife99/spectrum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spectrum

Spectrum is a Python microservice for speaker diarization and transcription using OpenAI's Whisper model and pyannote.audio. It provides RESTful APIs to transcribe audio files and identify different speakers in audio recordings.

Features

  • Audio Transcription: Convert audio files to text using WhisperX (OpenAI Whisper)
  • Speaker Diarization: Identify and separate different speakers in audio recordings
  • Multiple Input Sources: Support for base64-encoded files and presigned URLs
  • Language Detection: Automatic language detection for transcribed audio
  • RESTful API: Simple HTTP endpoints for easy integration
  • Docker Support: Containerized deployment with Docker

Prerequisites

  • Python 3.8 or higher
  • CUDA-capable GPU (recommended for better performance)
  • Hugging Face account with access token
  • AWS credentials (if using S3 presigned URLs)

Installation

1. Clone the Repository

git clone <repository-url>
cd spectrum

2. Create Virtual Environment

python3 -m venv spectrumEnv
source spectrumEnv/bin/activate  # On Windows: spectrumEnv\Scripts\activate

3. Install Dependencies

pip install -r requirements.txt

Note: The installation may take some time as it includes PyTorch and WhisperX dependencies.

4. Install FFmpeg

FFmpeg is required for audio processing:

macOS:

brew install ffmpeg

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install -y ffmpeg

Windows: Download from FFmpeg official website

Configuration

  1. Navigate to the config directory
  2. Edit config.json with your credentials:
{
    "HF_TOKEN": "your_huggingface_token_here",
    "aws_access_key_id": "your_aws_access_key",
    "aws_secret_access_key": "your_aws_secret_key",
    "region_name": "your_aws_region"
}

Getting a Hugging Face Token

  1. Create an account at Hugging Face
  2. Go to Settings → Access Tokens
  3. Create a new token with read permissions
  4. Accept the terms for the pyannote/speaker-diarization model

Running the Service

Development Mode

python wsgi.py

The service will start on http://localhost:5000

Production Mode (using Gunicorn)

gunicorn wsgi:app -b 0.0.0.0:8081

The service will start on http://0.0.0.0:8081

Note: It's recommended to use a GPU instance for faster performance as the models are computationally intensive and will be slow on CPU.

API Endpoints

Transcription

POST /transcription/transcribe

Transcribe an audio file to text.

Request Body:

{
    "source": "base64_encoded_file" | "presigned_url",
    "base64_encoded_file": "base64_encoded_audio_string"  // Required if source is "base64_encoded_file"
    "presigned_url": "https://s3.amazonaws.com/..."  // Required if source is "presigned_url"
}

Response:

{
    "success": 1,
    "transcription": "Transcribed text here...",
    "language": "en"
}

Error Response:

{
    "success": 0,
    "errorMessage": "Error message here"
}

Speaker Diarization

POST /diarization/diarize

Transcribe audio and identify different speakers.

Request Body:

{
    "source": "base64_encoded_file" | "presigned_url",
    "base64_encoded_file": "base64_encoded_audio_string",  // Required if source is "base64_encoded_file"
    "presigned_url": "https://s3.amazonaws.com/...",  // Required if source is "presigned_url"
    "min_speakers": 2,  // Optional, default: 2
    "max_speakers": 2   // Optional, default: 2
}

Response:

{
    "success": 1,
    "diarization": [
        {
            "start": 0.0,
            "end": 5.2,
            "text": "Hello, this is speaker one.",
            "speaker": "SPEAKER_00"
        },
        {
            "start": 5.2,
            "end": 10.5,
            "text": "And this is speaker two.",
            "speaker": "SPEAKER_01"
        }
    ],
    "language": "en"
}

Error Response:

{
    "success": 0,
    "errorMessage": "Error message here"
}

Health Check

GET /healthCheck

Check if the service is running.

Response:

I am healthy!

Docker Deployment

Build the Docker Image

docker build -t spectrum:latest .

Run the Container

docker run -d -p 8081:8081 \
  -v $(pwd)/config:/app/config \
  spectrum:latest

The service will be available at http://localhost:8081

Note: For GPU support, ensure Docker has access to NVIDIA GPU runtime. You may need to install nvidia-docker and use --gpus all flag.

Example Usage

Using cURL

Transcription:

curl -X POST http://localhost:8081/transcription/transcribe \
  -H "Content-Type: application/json" \
  -d '{
    "source": "presigned_url",
    "presigned_url": "https://your-s3-bucket.s3.amazonaws.com/audio.mp3"
  }'

Diarization:

curl -X POST http://localhost:8081/diarization/diarize \
  -H "Content-Type: application/json" \
  -d '{
    "source": "presigned_url",
    "presigned_url": "https://your-s3-bucket.s3.amazonaws.com/audio.mp3",
    "min_speakers": 2,
    "max_speakers": 3
  }'

Using Python

import requests
import base64

# Read audio file
with open('audio.mp3', 'rb') as f:
    audio_base64 = base64.b64encode(f.read()).decode('utf-8')

# Transcribe
response = requests.post(
    'http://localhost:8081/transcription/transcribe',
    json={
        'source': 'base64_encoded_file',
        'base64_encoded_file': audio_base64
    }
)

print(response.json())

Project Structure

spectrum/
├── config/
│   ├── __init__.py
│   └── config.json          # Configuration file
├── routes/
│   ├── diarization_routes.py  # Speaker diarization endpoints
│   └── transcription_routes.py # Transcription endpoints
├── utils/
│   ├── audioUtil.py         # Audio processing utilities
│   ├── configUtil.py        # Configuration utilities
│   └── s3Util.py            # AWS S3 utilities
├── main.py                  # Flask application entry point
├── wsgi.py                  # WSGI entry point
├── requirements.txt         # Python dependencies
├── Dockerfile              # Docker configuration
└── README.md               # This file

Performance Notes

  • GPU Recommended: The service uses WhisperX and pyannote.audio models which are computationally intensive. GPU instances are strongly recommended for production use.
  • Batch Processing: The service automatically adjusts batch size based on available hardware (GPU: 8, CPU: 32).
  • Memory Usage: Large audio files may require significant memory. Monitor resource usage accordingly.

Troubleshooting

Common Issues

  1. Hugging Face Token Error: Ensure your token is valid and has access to the pyannote models
  2. CUDA Out of Memory: Reduce batch size or use a smaller Whisper model
  3. FFmpeg Not Found: Ensure FFmpeg is installed and available in your PATH
  4. Model Download Issues: First run may take time to download models. Ensure stable internet connection

License

See LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •