whisper_subtitler

Generate transcriptions and subtitles using OpenAI whisper as a base model, stable-ts or whisperx as a timestamp stabilizer using ASR models and pyannote / nemo models in order to identify different speakers.

demo_smaller.mp4

Current pipeline (transcriber_wx | whisperx)

Read and preprocess the input file (video or audio)
Load models: pyannote Voice Activity Detection (VAD), pyannote diarization, whisper, align-model (wav2vec2.0)
Transcribe input using VAD and whisper
Run alignment model to stabilize the timestamps from whisper
Perform diarization
Form final results and output.

Previous pipeline (transcriber_sw | stable-whisper)

Read and preprocess the input file (video or audio)
Load whisper + timestamp stabilizer and inference with the input file
Load diarization model and run model (if selected by the user) with the input file
Form final results with post processing (fix output by punctuation, etc.)
Output results

Limitations

The main limitations besides inference time (depending on the selected model) its the overlapping speakers. When 2 or more speakers speaks at the same time, whisper just outputs the transcription of one speaker. Also, the diarization models can output sequences that overlapp bethween each other, but whisper will just output one token for a given timestamp. You can end in a situation where you have one token and multiple possible speakers.
Untested in different languages, but should work.
Untested in longer audio/video files. Maybe would be a good idea to split files into smaller chunks if this is a problem.

TODO

~~First approach with stable-ts whisper and pyannote~~
~~User interface using streamlit~~
~~Alternative pipeline with whisperx and pyannote~~
~~Alternative pipeline with whisperx (timestamp stabilizer + custom diarization)~~
~~Pipeline with whisperx as timestamp stabilizer + VAD + diarization~~
Include audio processing (Silero VAD, audio enhancer with espnet or similar)
Try pipeline with Nvidia NEMO
Implement speech separation
Alternative pipeline with stable-ts/whisperx and speech separation (instead of diarization)
Include translation to the pipeline
~~Add demo video~~

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
docs		docs
source		source
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

whisper_subtitler

Current pipeline (transcriber_wx | whisperx)

Previous pipeline (transcriber_sw | stable-whisper)

Limitations

TODO

About

Releases

Packages

Languages

License

Fcabla/whisper_subtitler

Folders and files

Latest commit

History

Repository files navigation

whisper_subtitler

Current pipeline (transcriber_wx | whisperx)

Previous pipeline (transcriber_sw | stable-whisper)

Limitations

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages