Generate transcriptions and subtitles using OpenAI whisper as a base model, stable-ts or whisperx as a timestamp stabilizer using ASR models and pyannote / nemo models in order to identify different speakers.
demo_smaller.mp4
- Read and preprocess the input file (video or audio)
- Load models: pyannote Voice Activity Detection (VAD), pyannote diarization, whisper, align-model (wav2vec2.0)
- Transcribe input using VAD and whisper
- Run alignment model to stabilize the timestamps from whisper
- Perform diarization
- Form final results and output.
- Read and preprocess the input file (video or audio)
- Load whisper + timestamp stabilizer and inference with the input file
- Load diarization model and run model (if selected by the user) with the input file
- Form final results with post processing (fix output by punctuation, etc.)
- Output results
- The main limitations besides inference time (depending on the selected model) its the overlapping speakers. When 2 or more speakers speaks at the same time, whisper just outputs the transcription of one speaker. Also, the diarization models can output sequences that overlapp bethween each other, but whisper will just output one token for a given timestamp. You can end in a situation where you have one token and multiple possible speakers.
- Untested in different languages, but should work.
- Untested in longer audio/video files. Maybe would be a good idea to split files into smaller chunks if this is a problem.
First approach with stable-ts whisper and pyannoteUser interface using streamlitAlternative pipeline with whisperx and pyannoteAlternative pipeline with whisperx (timestamp stabilizer + custom diarization)Pipeline with whisperx as timestamp stabilizer + VAD + diarization- Include audio processing (Silero VAD, audio enhancer with espnet or similar)
- Try pipeline with Nvidia NEMO
- Implement speech separation
- Alternative pipeline with stable-ts/whisperx and speech separation (instead of diarization)
- Include translation to the pipeline
Add demo video