New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

Jiltseb · 2024-05-24T09:02:22Z

Hello everyone,

This PR adds a major update to Faster Whisper, bringing both speed and quality improvements!

Speed improvements:

Batching support: Inspired by whisper-x, this update introduces batching support allowing for a 3x speed increase. This implementation builds on whiper-x and supports more run-time arguments and external VAD segments. The batched version now runs at 64x real-time speed, compared to the previous 20x.
Faster feature extraction: We've incorporated torchaudio-based parallel STFT as an alternative to the current implementation from transformers, providing additional speed boosts. With the enable_ta_fe flag, the final version achieves an impressive 104x real-time speed. This is up to 12.5x on average compared to OpenAI implementation!

Using the batched version is straightforward:

from faster_whisper import WhisperModel, BatchedInferencePipeline
#load faster-whisper model in the usual way
model = WhisperModel("medium", device="cuda", compute_type="float16") 

#apply batched pipeline
batched_model = BatchedInferencePipeline(model=model)

#predict using the batched_model
result = batched_model.transcribe("audio.mp3", batch_size=16)

for segment, info in result:
	print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Quality Improvements

Consistency across runs: By setting the model seed, consistency across runs is improved.
Reducing hallucinations: Stricter checks in the inference pipeline reduce unstructured or repeated phrases.
Reliable language detection: A new function detects language more reliably by considering highly confident and random segments, breaking ties to determine the major language.
Code-switching support: Handles audio with multiple languages by detecting language every 30 seconds and dynamically directing data flow. Since the exact language switching position is unknown, this can have an error within a 30 sec segment range.

Language detection Usage:

from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cuda", compute_type="float16")
language_info = model.detect_language_multi_segment("audio.mp3")

Benchmarking:

A. Open source benchmarking:

Open_asr_eval solely consists of short-form audio and the average audio duration is less than 10 sec in general. Hence, using a subset of the YouTube-Commons dataset, we've tested more complex use cases with long-form audio. Whisper-medium model is used (with batch size = 8 for batched versions) for the experiments. Dataset card of youtube-commons-asr-eval is mobiuslabsgmbh/youtube-commons-asr-eval.

Speed (x real-time):

System	Speed GPU	Speed CPU
OpenAI Whisper	8.2x	4.5x
faster-whisper	20.1x	5.6x
HF Whisper (batched)	59.3x	8.4x
Batched Faster-Whisper	104x	14.6x

WER:

System	WER
OpenAI Whisper	15.1
faster-whisper	14.6
HF Whisper (batched)	16.8
Batched Faster-Whisper	13.1

B. Internal dataset:

Since the transcriptions in the open-source dataset are unverified, they can contain various types of errors. Additional internal benchmarking ensures robustness across various scenarios. A smaller test set (84 minutes) with verified ground truth is used for verifying the transcription quality and speed. The test set contains 9 audios ranging from 3 minutes to 13 minutes and various audio types.

System	WER	Speed
OpenAI Whisper	6.8	9.1x
faster-whisper	6.1	17.4x
HF Whisper (batched)	8.2	42.8x
Batched Faster-Whisper	6.5	86.6x

Batched processing speeds up long-form audio without causing an increase in WER. Users can easily switch between sequential and batched Faster Whisper versions based on specific requirements.

Thank you in advance!

Acknowledgements

This is the work done at Mobiuslabs GmbH. Contact Dr. Jilt Sebastian for any queries or requests.

PR: Changes to faster-whisper project for asr v2.1 based on latest faster_whisper (0.9.0)

SDK v3.0 does not work with latest numpy version (1.26.0) and faster whisper won't work if numpy <1.21.6

Updating the base faster-whisper to 0.10.0

… logic

Support for Batched inference and language detection from multiple segments in faster-whisper

Updating the base directory

hobodrifterdavid · 2024-07-12T17:33:58Z

With the regular faster-whisper, I got very nice segment lengths for movie subtitles by setting length_penalty=1.1

With batched transcribe, there's a nice speed boost, but the segments are generally much longer, even with length_penalty=1.1

Could somebody explain this for me?

Jiltseb · 2024-07-12T19:13:31Z

With the regular faster-whisper, I got very nice segment lengths for movie subtitles by setting length_penalty=1.1

With batched transcribe, there's a nice speed boost, but the segments are generally much longer, even with length_penalty=1.1

Could somebody explain this for me?

Have a look at the arguments, without_timestamps=True by default. It should give you similar results if you make without_timesatmps=False . Also to get word_timestamps, you can make word_timestamps=True. Note that the focus of batched version is to improve speed and hence, these are not computed by default.

MahmoudAshraf97 · 2024-07-12T19:16:38Z

With the regular faster-whisper, I got very nice segment lengths for movie subtitles by setting length_penalty=1.1

With batched transcribe, there's a nice speed boost, but the segments are generally much longer, even with length_penalty=1.1

Could somebody explain this for me?

length penalty doesn't affect segment length, what you are looking for is the without_timestamps option which is turned off by default in the batched version, this will give you segment division you want, although it's better to use something like this to divide the segments with meaningful segmentation

MahmoudAshraf97 · 2024-07-17T17:01:59Z

@trungkienbkhn any estimate when this PR will be merged? I suppose there's no more work to do

kalradivyanshu · 2024-07-17T21:58:39Z

@Jiltseb is there a way to use this to pass multiple files, and process them in parallel, rather than 1 file? Something like:

from faster_whisper import WhisperModel, BatchedInferencePipeline
#load faster-whisper model in the usual way
model = WhisperModel("medium", device="cuda", compute_type="float16") 

#apply batched pipeline
batched_model = BatchedInferencePipeline(model=model)

#predict using the batched_model
results = batched_model.transcribe(["audio0.mp3", "audio1.mp3", "audio2.mp3", "audio3.mp3"])

for result in results:
      for segment, info in result:
	    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

If not, how difficult will it be to add?

This would be really helpful for batch processing lots of smaller files. Thank you for this PR!

Jiltseb · 2024-07-18T06:34:20Z

@Jiltseb is there a way to use this to pass multiple files, and process them in parallel, rather than 1 file? Something like:

from faster_whisper import WhisperModel, BatchedInferencePipeline
#load faster-whisper model in the usual way
model = WhisperModel("medium", device="cuda", compute_type="float16") 

#apply batched pipeline
batched_model = BatchedInferencePipeline(model=model)

#predict using the batched_model
results = batched_model.transcribe(["audio0.mp3", "audio1.mp3", "audio2.mp3", "audio3.mp3"])

for result in results:
      for segment, info in result:
	    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

If not, how difficult will it be to add?

This would be really helpful for batch processing lots of smaller files. Thank you for this PR!

We are not adding this functionality in this PR, but definitely possible in one of the future PRs and it’s on our TODO list.

benniekiss · 2024-07-18T10:07:53Z

A huge appeal of faster-whisper is that it has a small requirements footprint and doesnt depend on torch. Is it possible to make these dependencies optional?

hahazei · 2024-07-18T17:08:28Z

It is astonishing that a merger has taken place. Thank you.

hobodrifterdavid · 2024-07-20T02:37:33Z

@Jiltseb @MahmoudAshraf97 Thanks for the tips, I wrote a bit of code to split segments using word timings. It seems almost 2x faster when using without_timestamps=True.

btw, I have noticed some overlapping segments in the output in batched mode, I don't recall seeing these previously, but I may be wrong. In fact the first segment here starts after the second one:

BBC-Esq · 2024-07-20T13:02:51Z

Congratulations on the successful merge! I'll be doing benchmarks with WhisperX and WhisperS2T for everyone's edification, like I did before.

reasv · 2024-07-20T15:42:05Z

Comparing the output of this new batched version to the previous non-batched pipeline, I find that on some videos it fails to detect and transcribe audio unless I disable batching.
So, results seem to suffer.

I experienced the same with whisperx:
m-bain/whisperX#844

With some videos, it seems like the original OpenAI Whisper, as well as faster_whisper without batching, give correct results while whisperx and faster_whisper with batching output nothing, failing to detect speech correctly.
I'm not the only one with this issue:
m-bain/whisperX#828

It's weird because the videos I have issues with are too short for batching anyways (~8 seconds for example) while I believe by default batching works on 30 second segments?

stri8ed · 2024-07-25T22:08:16Z

Comparing the output of this new batched version to the previous non-batched pipeline, I find that on some videos it fails to detect and transcribe audio unless I disable batching. So, results seem to suffer.

I experienced the same with whisperx: m-bain/whisperX#844

With some videos, it seems like the original OpenAI Whisper, as well as faster_whisper without batching, give correct results while whisperx and faster_whisper with batching output nothing, failing to detect speech correctly. I'm not the only one with this issue: m-bain/whisperX#828

It's weird because the videos I have issues with are too short for batching anyways (~8 seconds for example) while I believe by default batching works on 30 second segments?

The batching implementation uses a different VAD model. That is likely the cause.

Jiltseb · 2024-07-26T07:24:07Z

The time-stamp-related issue will be solved in the follow-up PR: #921

Batching works on 30-second segments as well (batch size will be 1), but the VAD model is different hence a different set of parameters might be needed for your use case. We are comparing the performance with Silero VAD, there will be a PR to replace the VAD if trade-offs go in favour of it.

jordimas · 2024-07-26T10:19:10Z

The time-stamp-related issue will be solved in the follow-up PR: #921

Batching works on 30-second segments as well (batch size will be 1), but the VAD model is different hence a different set of parameters might be needed for your use case. We are comparing the performance with Silero VAD, there will be a PR to replace the VAD if trade-offs go in favour of it.

Thanks this an important topic. Having two VAD libraries for different code path should be avoided for consistency, number of external models that we, etc. Should we open a separate ticket for this?

Jiltseb · 2024-07-26T10:24:56Z

@MahmoudAshraf97 is working on this and will open a PR soon, but you can create an issue if you like.

jordimas · 2024-07-26T10:29:27Z

@MahmoudAshraf97 is working on this and will open a PR soon, but you can create an issue if you like.

PR better than ticket, thanks!

…d Quality Enhancements (SYSTRAN#856)" This reverts commit eb83902.

mru4913 · 2024-08-08T06:47:17Z

Is this released yet? I have checked version 1.0.3 that does not have batch inference yet.

Jiltseb · 2024-08-08T07:21:45Z

No, it's not released. You can use it with pip install git+https://github.com/SYSTRAN/faster-whisper.git for the time being.

Implement changes in review request

Jiltseb added 30 commits June 9, 2023 13:52

seed, multilingual and fixes

fc54cb9

added languages in tokenizer

84d58fa

multilingual fixes

63bea66

vocabulary extension fix for downloads

b95d694

code fixes for multilingual

a8626bb

Squash long words at window and sentence boundaries

c2ca8d4

added commits specifying changes to original package

9edf960

seed, multilingual and fixes

d008650

added languages in tokenizer

2573982

multilingual fixes

8add326

vocabulary extension fix for downloads

afc3f5c

code fixes for multilingual

dd55c03

Squash long words at window and sentence boundaries

d34780e

added commits specifying changes to original package

9fab8d9

modifications based on review

162fbf0

removed LANGUAGES from tokenizer and added numpy requirements

ca6a2ba

Merge remote-tracking branch 'upstream/master'

0df6953

Merge local master to 'updated_js_v2.1'

988c528

Merge pull request #1 from mobiusml/js_asr_v2.1_pr

443eb86

PR: Changes to faster-whisper project for asr v2.1 based on latest faster_whisper (0.9.0)

Update requirements.txt

6a51407

SDK v3.0 does not work with latest numpy version (1.26.0) and faster whisper won't work if numpy <1.21.6

Merge pull request #2 from SYSTRAN/master

4138e16

Updating the base faster-whisper to 0.10.0

changes to README.md

b906a98

Added BatchedInferencePipeline

0464122

Added language detection from multiple segments and batched inference…

78b5cd7

… logic

added additional packages

f397e37

changes to batched inference based on the review

83895ac

change in silence detection

e1c1699

Merge pull request #3 from mobiusml/batched_asr

b516bc8

Support for Batched inference and language detection from multiple segments in faster-whisper

Merge pull request #4 from SYSTRAN/master

3477d86

Updating the base directory

added logic for torchaudio based feature extraction

95df9eb

trungkienbkhn merged commit eb83902 into SYSTRAN:master Jul 18, 2024
3 checks passed

kalradivyanshu mentioned this pull request Jul 19, 2024

Need ability to send multiple files in one go #915

Open

zh-plus mentioned this pull request Jul 19, 2024

The option parameter does not affect the VAD process. #916

Closed

ooobo mentioned this pull request Jul 20, 2024

major slowdown with batching commit - cpu only #917

Open

reasv mentioned this pull request Jul 20, 2024

OAI Whisper transcribes correctly but whisperx returns No active speech found in audio m-bain/whisperX#844

Open

kalradivyanshu mentioned this pull request Jul 20, 2024

Segment timestamps are buggy in BatchedInferencePipeline #919

Closed

aligokalppeker added a commit to aligokalppeker/faster-whisper that referenced this pull request Jul 29, 2024

Revert "New PR for Faster Whisper: Batching Support, Speed Boosts, an…

915e7ce

…d Quality Enhancements (SYSTRAN#856)" This reverts commit eb83902.

Purfview mentioned this pull request Aug 10, 2024

Batching inference commit should be reverted and applied part-by-part for community adaptation !!!! #937

Open

MahmoudAshraf97 mentioned this pull request Aug 14, 2024

revert back to using PyAV instead of torch audio #961

Open

MahmoudAshraf97 and others added 3 commits October 2, 2024 11:00

.

bb6696b

remove duplicate detect_language function

5e6a426

Merge pull request #22 from MahmoudAshraf97/master

3ffb18f

Implement changes in review request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

Jiltseb commented May 24, 2024 •

edited

Loading

hobodrifterdavid commented Jul 12, 2024

Jiltseb commented Jul 12, 2024

MahmoudAshraf97 commented Jul 12, 2024

MahmoudAshraf97 commented Jul 17, 2024

kalradivyanshu commented Jul 17, 2024

Jiltseb commented Jul 18, 2024

benniekiss commented Jul 18, 2024

hahazei commented Jul 18, 2024

hobodrifterdavid commented Jul 20, 2024 •

edited

Loading

BBC-Esq commented Jul 20, 2024

reasv commented Jul 20, 2024 •

edited

Loading

stri8ed commented Jul 25, 2024

Jiltseb commented Jul 26, 2024

jordimas commented Jul 26, 2024

Jiltseb commented Jul 26, 2024

jordimas commented Jul 26, 2024

mru4913 commented Aug 8, 2024

Jiltseb commented Aug 8, 2024

New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

Conversation

Jiltseb commented May 24, 2024 • edited Loading

Speed improvements:

Quality Improvements

Benchmarking:

A. Open source benchmarking:

B. Internal dataset:

Acknowledgements

hobodrifterdavid commented Jul 12, 2024

Jiltseb commented Jul 12, 2024

MahmoudAshraf97 commented Jul 12, 2024

MahmoudAshraf97 commented Jul 17, 2024

kalradivyanshu commented Jul 17, 2024

Jiltseb commented Jul 18, 2024

benniekiss commented Jul 18, 2024

hahazei commented Jul 18, 2024

hobodrifterdavid commented Jul 20, 2024 • edited Loading

BBC-Esq commented Jul 20, 2024

reasv commented Jul 20, 2024 • edited Loading

stri8ed commented Jul 25, 2024

Jiltseb commented Jul 26, 2024

jordimas commented Jul 26, 2024

Jiltseb commented Jul 26, 2024

jordimas commented Jul 26, 2024

mru4913 commented Aug 8, 2024

Jiltseb commented Aug 8, 2024

Jiltseb commented May 24, 2024 •

edited

Loading

hobodrifterdavid commented Jul 20, 2024 •

edited

Loading

reasv commented Jul 20, 2024 •

edited

Loading