Spectrum

Spectrum is a Python microservice for speaker diarization and transcription using OpenAI's Whisper model and pyannote.audio. It provides RESTful APIs to transcribe audio files and identify different speakers in audio recordings.

Features

Audio Transcription: Convert audio files to text using WhisperX (OpenAI Whisper)
Speaker Diarization: Identify and separate different speakers in audio recordings
Multiple Input Sources: Support for base64-encoded files and presigned URLs
Language Detection: Automatic language detection for transcribed audio
RESTful API: Simple HTTP endpoints for easy integration
Docker Support: Containerized deployment with Docker

Prerequisites

Python 3.8 or higher
CUDA-capable GPU (recommended for better performance)
Hugging Face account with access token
AWS credentials (if using S3 presigned URLs)

Installation

1. Clone the Repository

git clone <repository-url>
cd spectrum

2. Create Virtual Environment

python3 -m venv spectrumEnv
source spectrumEnv/bin/activate  # On Windows: spectrumEnv\Scripts\activate

3. Install Dependencies

pip install -r requirements.txt

Note: The installation may take some time as it includes PyTorch and WhisperX dependencies.

4. Install FFmpeg

FFmpeg is required for audio processing:

macOS:

brew install ffmpeg

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install -y ffmpeg

Windows: Download from FFmpeg official website

Configuration

Navigate to the config directory
Edit config.json with your credentials:

{
    "HF_TOKEN": "your_huggingface_token_here",
    "aws_access_key_id": "your_aws_access_key",
    "aws_secret_access_key": "your_aws_secret_key",
    "region_name": "your_aws_region"
}

Getting a Hugging Face Token

Create an account at Hugging Face
Go to Settings → Access Tokens
Create a new token with read permissions
Accept the terms for the pyannote/speaker-diarization model

Running the Service

Development Mode

python wsgi.py

The service will start on http://localhost:5000

Production Mode (using Gunicorn)

gunicorn wsgi:app -b 0.0.0.0:8081

The service will start on http://0.0.0.0:8081

Note: It's recommended to use a GPU instance for faster performance as the models are computationally intensive and will be slow on CPU.

API Endpoints

Transcription

POST /transcription/transcribe

Transcribe an audio file to text.

Request Body:

{
    "source": "base64_encoded_file" | "presigned_url",
    "base64_encoded_file": "base64_encoded_audio_string"  // Required if source is "base64_encoded_file"
    "presigned_url": "https://s3.amazonaws.com/..."  // Required if source is "presigned_url"
}

Response:

{
    "success": 1,
    "transcription": "Transcribed text here...",
    "language": "en"
}

Error Response:

{
    "success": 0,
    "errorMessage": "Error message here"
}

Speaker Diarization

POST /diarization/diarize

Transcribe audio and identify different speakers.

Request Body:

{
    "source": "base64_encoded_file" | "presigned_url",
    "base64_encoded_file": "base64_encoded_audio_string",  // Required if source is "base64_encoded_file"
    "presigned_url": "https://s3.amazonaws.com/...",  // Required if source is "presigned_url"
    "min_speakers": 2,  // Optional, default: 2
    "max_speakers": 2   // Optional, default: 2
}

Response:

{
    "success": 1,
    "diarization": [
        {
            "start": 0.0,
            "end": 5.2,
            "text": "Hello, this is speaker one.",
            "speaker": "SPEAKER_00"
        },
        {
            "start": 5.2,
            "end": 10.5,
            "text": "And this is speaker two.",
            "speaker": "SPEAKER_01"
        }
    ],
    "language": "en"
}

Error Response:

{
    "success": 0,
    "errorMessage": "Error message here"
}

Health Check

GET /healthCheck

Check if the service is running.

Response:

I am healthy!

Docker Deployment

Build the Docker Image

docker build -t spectrum:latest .

Run the Container

docker run -d -p 8081:8081 \
  -v $(pwd)/config:/app/config \
  spectrum:latest

The service will be available at http://localhost:8081

Note: For GPU support, ensure Docker has access to NVIDIA GPU runtime. You may need to install nvidia-docker and use --gpus all flag.

Example Usage

Using cURL

Transcription:

curl -X POST http://localhost:8081/transcription/transcribe \
  -H "Content-Type: application/json" \
  -d '{
    "source": "presigned_url",
    "presigned_url": "https://your-s3-bucket.s3.amazonaws.com/audio.mp3"
  }'

Diarization:

curl -X POST http://localhost:8081/diarization/diarize \
  -H "Content-Type: application/json" \
  -d '{
    "source": "presigned_url",
    "presigned_url": "https://your-s3-bucket.s3.amazonaws.com/audio.mp3",
    "min_speakers": 2,
    "max_speakers": 3
  }'

Using Python

import requests
import base64

# Read audio file
with open('audio.mp3', 'rb') as f:
    audio_base64 = base64.b64encode(f.read()).decode('utf-8')

# Transcribe
response = requests.post(
    'http://localhost:8081/transcription/transcribe',
    json={
        'source': 'base64_encoded_file',
        'base64_encoded_file': audio_base64
    }
)

print(response.json())

Project Structure

spectrum/
├── config/
│   ├── __init__.py
│   └── config.json          # Configuration file
├── routes/
│   ├── diarization_routes.py  # Speaker diarization endpoints
│   └── transcription_routes.py # Transcription endpoints
├── utils/
│   ├── audioUtil.py         # Audio processing utilities
│   ├── configUtil.py        # Configuration utilities
│   └── s3Util.py            # AWS S3 utilities
├── main.py                  # Flask application entry point
├── wsgi.py                  # WSGI entry point
├── requirements.txt         # Python dependencies
├── Dockerfile              # Docker configuration
└── README.md               # This file

Performance Notes

GPU Recommended: The service uses WhisperX and pyannote.audio models which are computationally intensive. GPU instances are strongly recommended for production use.
Batch Processing: The service automatically adjusts batch size based on available hardware (GPU: 8, CPU: 32).
Memory Usage: Large audio files may require significant memory. Monitor resource usage accordingly.

Troubleshooting

Common Issues

Hugging Face Token Error: Ensure your token is valid and has access to the pyannote models
CUDA Out of Memory: Reduce batch size or use a smaller Whisper model
FFmpeg Not Found: Ensure FFmpeg is installed and available in your PATH
Model Download Issues: First run may take time to download models. Ensure stable internet connection

License

See LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
config		config
routes		routes
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
wsgi.py		wsgi.py

License

LiteLife99/spectrum

Folders and files

Latest commit

History

Repository files navigation

Spectrum

Features

Prerequisites

Installation

1. Clone the Repository

2. Create Virtual Environment

3. Install Dependencies

4. Install FFmpeg

Configuration

Getting a Hugging Face Token

Running the Service

Development Mode

Production Mode (using Gunicorn)

API Endpoints

Transcription

Speaker Diarization

Health Check

Docker Deployment

Build the Docker Image

Run the Container

Example Usage

Using cURL

Using Python

Project Structure

Performance Notes

Troubleshooting

Common Issues

License

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages