Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

Merged
merged 145 commits into from
Jul 18, 2024

Conversation

Jiltseb
Copy link
Contributor

@Jiltseb Jiltseb commented May 24, 2024

Hello everyone,

This PR adds a major update to Faster Whisper, bringing both speed and quality improvements!

Speed improvements:

  • Batching support: Inspired by whisper-x, this update introduces batching support allowing for a 3x speed increase. This implementation builds on whiper-x and supports more run-time arguments and external VAD segments. The batched version now runs at 64x real-time speed, compared to the previous 20x.

  • Faster feature extraction: We've incorporated torchaudio-based parallel STFT as an alternative to the current implementation from transformers, providing additional speed boosts. With the enable_ta_fe flag, the final version achieves an impressive 104x real-time speed. This is up to 12.5x on average compared to OpenAI implementation!

Using the batched version is straightforward:

from faster_whisper import WhisperModel, BatchedInferencePipeline
#load faster-whisper model in the usual way
model = WhisperModel("medium", device="cuda", compute_type="float16") 

#apply batched pipeline
batched_model = BatchedInferencePipeline(model=model)

#predict using the batched_model
result = batched_model.transcribe("audio.mp3", batch_size=16)

for segment, info in result:
	print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Quality Improvements

  1. Consistency across runs: By setting the model seed, consistency across runs is improved.
  2. Reducing hallucinations: Stricter checks in the inference pipeline reduce unstructured or repeated phrases.
  3. Reliable language detection: A new function detects language more reliably by considering highly confident and random segments, breaking ties to determine the major language.
  4. Code-switching support: Handles audio with multiple languages by detecting language every 30 seconds and dynamically directing data flow. Since the exact language switching position is unknown, this can have an error within a 30 sec segment range.

Language detection Usage:

from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cuda", compute_type="float16")
language_info = model.detect_language_multi_segment("audio.mp3")

Benchmarking:

A. Open source benchmarking:

Open_asr_eval solely consists of short-form audio and the average audio duration is less than 10 sec in general. Hence, using a subset of the YouTube-Commons dataset, we've tested more complex use cases with long-form audio. Whisper-medium model is used (with batch size = 8 for batched versions) for the experiments. Dataset card of youtube-commons-asr-eval is mobiuslabsgmbh/youtube-commons-asr-eval.

Speed (x real-time):

System Speed GPU Speed CPU
OpenAI Whisper 8.2x 4.5x
faster-whisper 20.1x 5.6x
HF Whisper (batched) 59.3x 8.4x
Batched Faster-Whisper 104x 14.6x

WER:

System WER
OpenAI Whisper 15.1
faster-whisper 14.6
HF Whisper (batched) 16.8
Batched Faster-Whisper 13.1

B. Internal dataset:

Since the transcriptions in the open-source dataset are unverified, they can contain various types of errors. Additional internal benchmarking ensures robustness across various scenarios. A smaller test set (84 minutes) with verified ground truth is used for verifying the transcription quality and speed. The test set contains 9 audios ranging from 3 minutes to 13 minutes and various audio types.

System WER Speed
OpenAI Whisper 6.8 9.1x
faster-whisper 6.1 17.4x
HF Whisper (batched) 8.2 42.8x
Batched Faster-Whisper 6.5 86.6x

Batched processing speeds up long-form audio without causing an increase in WER. Users can easily switch between sequential and batched Faster Whisper versions based on specific requirements.

Thank you in advance!

Acknowledgements

This is the work done at Mobiuslabs GmbH. Contact Dr. Jilt Sebastian for any queries or requests.

Jiltseb added 30 commits June 9, 2023 13:52
PR: Changes to faster-whisper project for asr v2.1 based on latest faster_whisper (0.9.0)
SDK v3.0 does not work with latest numpy version (1.26.0) and faster whisper won't work if numpy <1.21.6
Updating the base faster-whisper to 0.10.0
Support for Batched inference and language detection from multiple segments in faster-whisper
Updating the base directory
@hobodrifterdavid
Copy link

With the regular faster-whisper, I got very nice segment lengths for movie subtitles by setting length_penalty=1.1

With batched transcribe, there's a nice speed boost, but the segments are generally much longer, even with length_penalty=1.1

Could somebody explain this for me?

@Jiltseb
Copy link
Contributor Author

Jiltseb commented Jul 12, 2024

With the regular faster-whisper, I got very nice segment lengths for movie subtitles by setting length_penalty=1.1

With batched transcribe, there's a nice speed boost, but the segments are generally much longer, even with length_penalty=1.1

Could somebody explain this for me?

Have a look at the arguments, without_timestamps=True by default. It should give you similar results if you make without_timesatmps=False . Also to get word_timestamps, you can make word_timestamps=True. Note that the focus of batched version is to improve speed and hence, these are not computed by default.

@MahmoudAshraf97
Copy link
Contributor

With the regular faster-whisper, I got very nice segment lengths for movie subtitles by setting length_penalty=1.1

With batched transcribe, there's a nice speed boost, but the segments are generally much longer, even with length_penalty=1.1

Could somebody explain this for me?

length penalty doesn't affect segment length, what you are looking for is the without_timestamps option which is turned off by default in the batched version, this will give you segment division you want, although it's better to use something like this to divide the segments with meaningful segmentation

@MahmoudAshraf97
Copy link
Contributor

@trungkienbkhn any estimate when this PR will be merged? I suppose there's no more work to do

@kalradivyanshu
Copy link

@Jiltseb is there a way to use this to pass multiple files, and process them in parallel, rather than 1 file? Something like:

from faster_whisper import WhisperModel, BatchedInferencePipeline
#load faster-whisper model in the usual way
model = WhisperModel("medium", device="cuda", compute_type="float16") 

#apply batched pipeline
batched_model = BatchedInferencePipeline(model=model)

#predict using the batched_model
results = batched_model.transcribe(["audio0.mp3", "audio1.mp3", "audio2.mp3", "audio3.mp3"])

for result in results:
      for segment, info in result:
	    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

If not, how difficult will it be to add?

This would be really helpful for batch processing lots of smaller files. Thank you for this PR!

@Jiltseb
Copy link
Contributor Author

Jiltseb commented Jul 18, 2024

@Jiltseb is there a way to use this to pass multiple files, and process them in parallel, rather than 1 file? Something like:

from faster_whisper import WhisperModel, BatchedInferencePipeline
#load faster-whisper model in the usual way
model = WhisperModel("medium", device="cuda", compute_type="float16") 

#apply batched pipeline
batched_model = BatchedInferencePipeline(model=model)

#predict using the batched_model
results = batched_model.transcribe(["audio0.mp3", "audio1.mp3", "audio2.mp3", "audio3.mp3"])

for result in results:
      for segment, info in result:
	    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

If not, how difficult will it be to add?

This would be really helpful for batch processing lots of smaller files. Thank you for this PR!

We are not adding this functionality in this PR, but definitely possible in one of the future PRs and it’s on our TODO list.

@trungkienbkhn trungkienbkhn merged commit eb83902 into SYSTRAN:master Jul 18, 2024
3 checks passed
@benniekiss
Copy link

A huge appeal of faster-whisper is that it has a small requirements footprint and doesnt depend on torch. Is it possible to make these dependencies optional?

@hahazei
Copy link

hahazei commented Jul 18, 2024

It is astonishing that a merger has taken place. Thank you.

@hobodrifterdavid
Copy link

hobodrifterdavid commented Jul 20, 2024

@Jiltseb @MahmoudAshraf97 Thanks for the tips, I wrote a bit of code to split segments using word timings. It seems almost 2x faster when using without_timestamps=True.

btw, I have noticed some overlapping segments in the output in batched mode, I don't recall seeing these previously, but I may be wrong. In fact the first segment here starts after the second one:

image

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Jul 20, 2024

Congratulations on the successful merge! I'll be doing benchmarks with WhisperX and WhisperS2T for everyone's edification, like I did before.

@reasv
Copy link

reasv commented Jul 20, 2024

Comparing the output of this new batched version to the previous non-batched pipeline, I find that on some videos it fails to detect and transcribe audio unless I disable batching.
So, results seem to suffer.

I experienced the same with whisperx:
m-bain/whisperX#844

With some videos, it seems like the original OpenAI Whisper, as well as faster_whisper without batching, give correct results while whisperx and faster_whisper with batching output nothing, failing to detect speech correctly.
I'm not the only one with this issue:
m-bain/whisperX#828

It's weird because the videos I have issues with are too short for batching anyways (~8 seconds for example) while I believe by default batching works on 30 second segments?

@stri8ed
Copy link

stri8ed commented Jul 25, 2024

Comparing the output of this new batched version to the previous non-batched pipeline, I find that on some videos it fails to detect and transcribe audio unless I disable batching. So, results seem to suffer.

I experienced the same with whisperx: m-bain/whisperX#844

With some videos, it seems like the original OpenAI Whisper, as well as faster_whisper without batching, give correct results while whisperx and faster_whisper with batching output nothing, failing to detect speech correctly. I'm not the only one with this issue: m-bain/whisperX#828

It's weird because the videos I have issues with are too short for batching anyways (~8 seconds for example) while I believe by default batching works on 30 second segments?

The batching implementation uses a different VAD model. That is likely the cause.

@Jiltseb
Copy link
Contributor Author

Jiltseb commented Jul 26, 2024

The time-stamp-related issue will be solved in the follow-up PR: #921

Batching works on 30-second segments as well (batch size will be 1), but the VAD model is different hence a different set of parameters might be needed for your use case. We are comparing the performance with Silero VAD, there will be a PR to replace the VAD if trade-offs go in favour of it.

@jordimas
Copy link
Contributor

The time-stamp-related issue will be solved in the follow-up PR: #921

Batching works on 30-second segments as well (batch size will be 1), but the VAD model is different hence a different set of parameters might be needed for your use case. We are comparing the performance with Silero VAD, there will be a PR to replace the VAD if trade-offs go in favour of it.

Thanks this an important topic. Having two VAD libraries for different code path should be avoided for consistency, number of external models that we, etc. Should we open a separate ticket for this?

@Jiltseb
Copy link
Contributor Author

Jiltseb commented Jul 26, 2024

@MahmoudAshraf97 is working on this and will open a PR soon, but you can create an issue if you like.

@jordimas
Copy link
Contributor

@MahmoudAshraf97 is working on this and will open a PR soon, but you can create an issue if you like.

PR better than ticket, thanks!

aligokalppeker added a commit to aligokalppeker/faster-whisper that referenced this pull request Jul 29, 2024
@mru4913
Copy link

mru4913 commented Aug 8, 2024

Is this released yet? I have checked version 1.0.3 that does not have batch inference yet.

@Jiltseb
Copy link
Contributor Author

Jiltseb commented Aug 8, 2024

No, it's not released. You can use it with pip install git+https://github.com/SYSTRAN/faster-whisper.git for the time being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet