Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert the batched process updates for the future of the faster-whisper #940

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

aligokalppeker
Copy link

@aligokalppeker aligokalppeker commented Jul 29, 2024

Concerning the issue; #937, I would suggest revert the related commits. The major PR is huge and can be separated to merge as independent PRs.

This is a serious concern and is important for the project's future.

Injected changes should be;

  • done in incremental PRs, that make them integrated and tested in a more controllable and timely fashion.
  • should be cleaned up from pytorch dependencies, considering other lightweight effective feature extraction implementations such as numba and cupy.
  • decoding framework is changed from pyAV to ffmpeg which is a compatibility issue.
  • changes are too big and complex that any other than the author group can not maintain the code; affecting VAD, transcribe logic, audio decoding, and so on.
  • may inflict bugs and performance slowdown due to its complexity and code bloat.

Even if the authors downvote this PR (I understand that it is harsh), let's revert the related commit, partition it to several sub-functional PRs, and merge them incrementally and timely fashion to enable the community to adapt iteratively.

Thanks, @trungkienbkhn

@aligokalppeker aligokalppeker changed the title Revert the batched process updates. Revert the batched process updates for the future of the faster-whisper Jul 29, 2024
@aligokalppeker
Copy link
Author

aligokalppeker commented Jul 29, 2024

I believe this is an utmost importance. @guillaumekln

We need to keep faster-whisper elegant and agile to make it more usable and maintainable. Otherwise, it will be transformed into something non-maintainable, and non-usable, such as whisper-x(some of the batched inferencing authors are also contributors to that project).

I will keep this PR alive with all the other changes on the master branch.

@aboutmedicine
Copy link

While I agree with @ozancaglayan that your approach was suboptimal initially, I agree with this PR. Having spent the last 2 hours reversing the compatibility issues that this update caused.

@aligokalppeker
Copy link
Author

While I agree with @ozancaglayan that your approach was suboptimal initially, I agree with this PR. Having spent the last 2 hours reversing the compatibility issues that this update caused.

This is really an important aspect as community is not really aware of the effect of the changes. Community users will understand the effect of the change when they use the version.

We can partition the whole PR into parts and make it adapted by the community by the time.

That is really critical for this project to live.

@doublex
Copy link

doublex commented Aug 10, 2024

@trungkienbkhn
Please consider reverting these patches

@aligokalppeker
Copy link
Author

@trungkienbkhn Please consider reverting these patches

Yes, Let's avoid faster-whisper to become bulky-whisper.

There are many other healthy ways to use the contributions of the regarding PR without affecting the community or this repo;

  • Batched inferencing can be a fork instead of being part of this repo.
  • Changes can be injected in this repo partially and timely fashion.

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Aug 12, 2024

Has anyone taken a look at considering the approach used by WhisperS2T? There seems to be some kind of unexplained loyalty and/or comradare between faster-whisper and WhisperX for some reason when there's a completely viable alternative. Thanks, my 2cents.

@MahmoudAshraf97
Copy link
Contributor

Has anyone taken a look at considering the approach used by WhisperS2T? There seems to be some kind of unexplained loyalty and/or comradare between faster-whisper and WhisperX for some reason when there's a completely viable alternative. Thanks, my 2cents.

It was already implemented in #921

@ozancaglayan
Copy link
Contributor

I really don't like the current situation in this repository. Some contributors merged these changes to main, and there has been more and more negative concerns & comments about the disruptive nature of these changes.

Then there's this total silence where the contributors are reluctant to revert those changes, the organization hosting this repository does not comment at all at various places that they were tagged.

I don't know what's going on to be honest.

@doublex
Copy link

doublex commented Aug 12, 2024

@trungkienbkhn has greatly improved this repository.
Thanks for all your efforts!
It's just these 3 patches. Something stinks.

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Aug 12, 2024

Has anyone taken a look at considering the approach used by WhisperS2T? There seems to be some kind of unexplained loyalty and/or comradare between faster-whisper and WhisperX for some reason when there's a completely viable alternative. Thanks, my 2cents.

It was already implemented in #921

Thanks, I'll take a second look. My comment was based on a comment - perhaps in a different discussion as I don't see it here - about whether WhisperX's batch approach or WhisperS2T's should be used and WhisperX's was used...but I might have been mistaken. Consider my comment temporarily withdrawn...

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Aug 12, 2024

I really don't like the current situation in this repository. Some contributors merged these changes to main, and there has been more and more negative concerns & comments about the disruptive nature of these changes.

Then there's this total silence where the contributors are reluctant to revert those changes, the organization hosting this repository does not comment at all at various places that they were tagged.

I don't know what's going on to be honest.

I don't think the people who maintains this repo do this as their primary job. I think they all have day jobs and work on this in their spare time. Unlike well-known repositories like transformers and torch and such...That, and I don't think they have the manpower. I don't hold this against them, actually, because not everyone is graced with a huggingface level of employees. In fact, they're probably doing the best they can with the manpower they have...Anyways, shout out to @MahmoudAshraf97 for being the one who's actually trying to implement the batch processing. Even the @guillaumekln guy didn't so...Hoping for the best no-bug implementation eventually!

@aligokalppeker
Copy link
Author

I really don't like the current situation in this repository. Some contributors merged these changes to main, and there has been more and more negative concerns & comments about the disruptive nature of these changes.

Then there's this total silence where the contributors are reluctant to revert those changes, the organization hosting this repository does not comment at all at various places that they were tagged.

I don't know what's going on to be honest.

That is unfortunate but right. I think if the organization did its work well, it would not allow such kind of big and bulky changes to the main repository making it really something different than the "faster-whisper".

What is going to be is a big unknown until the organization steps up and listens to community feed backs.

@aligokalppeker
Copy link
Author

I really don't like the current situation in this repository. Some contributors merged these changes to main, and there has been more and more negative concerns & comments about the disruptive nature of these changes.
Then there's this total silence where the contributors are reluctant to revert those changes, the organization hosting this repository does not comment at all at various places that they were tagged.
I don't know what's going on to be honest.

I don't think the people who maintains this repo do this as their primary job. I think they all have day jobs and work on this in their spare time. Unlike well-known repositories like transformers and torch and such...That, and I don't think they have the manpower. I don't hold this against them, actually, because not everyone is graced with a huggingface level of employees. In fact, they're probably doing the best they can with the manpower they have...Anyways, shout out to @MahmoudAshraf97 for being the one who's actually trying to implement the batch processing. Even the @guillaumekln guy didn't so...Hoping for the best no-bug implementation eventually!

Manpower is important that is for sure. But spending effort and time does not always mean that the result will be better. As apparent in this scenario, bringing something this big and blocking the community from using faster whisper following its design philosophy, is something that will kill the project.

@guillaumekln was right not to bring the batch processing faster-whisper, it is something that can be added as a separate project like whisper-x. We do need manpower but with design thinking and proper software engineering capabilities.

@aligokalppeker
Copy link
Author

Has anyone taken a look at considering the approach used by WhisperS2T? There seems to be some kind of unexplained loyalty and/or comradare between faster-whisper and WhisperX for some reason when there's a completely viable alternative. Thanks, my 2cents.

It was already implemented in #921

@MahmoudAshraf97 Can you elaborate on what is particularly implemented in this PR as in the WhisperS2T context?

As I see WhisperS2T performance comes from its TensorRT-LLM backend implementation.

https://github.com/shashikg/WhisperS2T/releases/tag/v1.3.0

@MahmoudAshraf97
Copy link
Contributor

Whisper S2T contains multiple backends, including Ctranslate2 and TRT LLM
What @BBC-Esq meant here is the difference between how the batching was implemented in the CT2 backend which is used by faster whisper, we initially used the whisperx approach with plans to later simplify it once it was merged, the simplified approach was replacing HF pipeline with just a simple loop over batches, which resembles what WhisperS2T does

@aligokalppeker
Copy link
Author

Whisper S2T contains multiple backends, including Ctranslate2 and TRT LLM What @BBC-Esq meant here is the difference between how the batching was implemented in the CT2 backend which is used by faster whisper, we initially used the whisperx approach with plans to later simplify it once it was merged, the simplified approach was replacing HF pipeline with just a simple loop over batches, which resembles what WhisperS2T does

I see, thanks for the info. So if the same logic is implemented in WhisperS2T, why not use that one for batched inferencing? It is not right to spend effort to do the same batching logic on multiple projects.

I think this is another reason to make all these batched inferencing stuff reverted and created as a different project.

@MahmoudAshraf97
Copy link
Contributor

Because neither WhisperS2T nor WhisperX are maintained any longer, it doesn't make sense to implement every new feature in a separate package, especially that it's an independent feature that honours the old usage with absolutely no change to the behaviour ( except for the requirements which are halfway done ), bugs that appear in issues are fixed within a week, and this is expected to have bugs on the master branch because this was not advertised as a stable release anywhere, you don't like the batching implementation? Don't use it or keep using 1.0.3 and if the next stable release has issues then report them to be fixed or even better, fix them yourself and open a PR.

@aligokalppeker
Copy link
Author

aligokalppeker commented Aug 13, 2024

Because neither WhisperS2T nor WhisperX are maintained any longer, it doesn't make sense to implement every new feature in a separate package, especially that it's an independent feature that honours the old usage with absolutely no change to the behaviour ( except for the requirements which are halfway done ), bugs that appear in issues are fixed within a week, and this is expected to have bugs on the master branch because this was not advertised as a stable release anywhere, you don't like the batching implementation? Don't use it or keep using 1.0.3 and if the next stable release has issues then report them to be fixed or even better, fix them yourself and open a PR.

You will make the same to faster-whisper, make it something unmaintained by making it complex and dependent. Why don't you fork your project and maintain all this stuff in your repo? Why do you prefer to make the faster-whisper repo bulky and ugly designed?

Another question, who will mark this version stable? how will it be done? Can you see/solve all the issues that the increased complexity injected through the community usage? Will you fix those? or will you abandon that part of the community? Due to the bulky design, can you be able to satisfy the demands of the community from the original faster-whisper ?

Will you advise everyone to use 1.0.3 or forking their repo that lives through the issues/incompatibilities all the time? As I said, you are not the owner of this repository and spending time does not make you either.

I will pursue this and do my best to return faster-whisper to its origins. It would help if you did not direct the community like this and I believe the organization should not allow you to do such irresponsible behaviour.

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Aug 13, 2024

I don't think that's entirely accurate @aligokalppeker. I mean, I respect your opinion and your right to express it, but I actually think that @MahmoudAshraf97 is doing a hell of a job, especially considering it'd be the biggest improvement to faster-whisper in a long time. IMHO, I personally am NOT a fan of 90k+ character long scripts like transcribe.py is. I feel there needs to be far more modularization. Second, there so many moving parts...but I'll get into that later.

...but I'll get into that later...for now...can someone/anyone summarize the current issues with the code - i.e., any and all issues? I'll go through and see if I can't contribute to a solution. If anyone has any sample scripts using the new API, and especially SIMPLE bench marking scripts I'd appreciate it. I use benchmarking scripts internally. I'm attaching one here that we can build off of.

Also, @MahmoudAshraf97 you say that it's now "resembling" WhisperS2T? Can you explain a little more how your implementation works in this regard and the difference between "mirroring" WhisperS2T and "resembling" it? I'd like to start comparing...and possibly contributing.

HERE'S THE SCRIPT

SCRIPT HERE
import os
from faster_whisper import WhisperModel
import pynvml
import threading
import time
from multiprocessing import Process, Queue
import torch

"""
pip install nvidia-ml-py
pip install [YOUR VERSION OF TORCH]
pip install faster-whisper
"""

class WhisperTranscriber:
    def **init**(self, model_size=r"[INSERT PATH TO FOLDER CONTAINING SPECIFIC WHISPER MODEL]", compute_type="bfloat16"):
        self.model_size = model_size
        self.compute_type = compute_type

    def start_transcription_process(self, audio_file):
        self.audio_file = audio_file
        start_time = time.time()
        vram_queue = Queue()
        baseline_vram_queue = Queue()
        process = Process(target=self.transcribe_and_benchmark, args=(vram_queue, baseline_vram_queue))
        process.start()

        max_vram_readings = []
        while process.is_alive():
            while not vram_queue.empty():
                vram_reading = vram_queue.get()
                max_vram_readings.append(vram_reading)
        process.join()

        transcription_time = time.time() - start_time
        if not baseline_vram_queue.empty():
            self.baseline_vram_usage = baseline_vram_queue.get()
        if max_vram_readings:
            max_vram_usage = max(max_vram_readings) - self.baseline_vram_usage
            print(f"Max VRAM usage during processing: {max_vram_usage:.2f} MB")
        print(f"Total transcription time: {transcription_time:.2f} seconds")

    @torch.inference_mode()
    def transcribe_and_benchmark(self, vram_queue, baseline_vram_queue):
        pynvml.nvmlInit()
        handle = pynvml.nvmlDeviceGetHandleByIndex(0)
        baseline_vram_usage = pynvml.nvmlDeviceGetMemoryInfo(handle).used / 1024**2
        baseline_vram_queue.put(baseline_vram_usage)
        print(f"Baseline VRAM usage before loading the model: {baseline_vram_usage:.2f} MB")

        model = WhisperModel(self.model_size, device="cuda", compute_type=self.compute_type)
        vram_usage_after_loading = pynvml.nvmlDeviceGetMemoryInfo(handle).used / 1024**2
        print(f"VRAM usage after loading the model: {vram_usage_after_loading:.2f} MB")

        stop_event = threading.Event()

        def poll_vram_usage():
            while not stop_event.is_set():
                vram_usage = pynvml.nvmlDeviceGetMemoryInfo(handle).used / 1024**2
                vram_queue.put(vram_usage)
                time.sleep(0.5)

        vram_thread = threading.Thread(target=poll_vram_usage)
        vram_thread.start()

        segments, info = model.transcribe(self.audio_file, beam_size=5, chunk_length=29)
        stop_event.set()
        vram_thread.join()

        for segment in segments:
            print(segment.text)

        pynvml.nvmlShutdown()

if **name** == "__main__":
    audio_file_path = r"[INSERT PATH TO TEST FILE TO PROCESS]"
    transcriber = WhisperTranscriber()
    transcriber.start_transcription_process(audio_file_path)
    time.sleep(2)

@ozancaglayan
Copy link
Contributor

ozancaglayan commented Aug 13, 2024

@MahmoudAshraf97 Where does the benefit of batching comes from? I'm doing some benchmarking (only encoding of 30-sec chunks) for now. The below benchmark does not do any decoding etc.

Below numbers are obtained from a Tesla T4 GPU. You can see that dividing the obtained times by batch size, consistently gives ~150ms.

Any idea what am I missing?

In [107]: feats.shape
Out[107]: (80, 19620)

In [108]: f30 = feats[None, :, :3000] # 30sec

In [110]: for bs in (1, 2, 4, 8):
     ...:     # repeat the same data multiple times to simulate batching
     ...:     batch = f30.repeat(bs, axis=0)[..., :3000]
     ...:     print(batch.shape)
     ...:     batch = get_ctranslate2_storage(batch).to_device(ct2.Device.cuda)
     ...:     %timeit model.model.encode(batch)
     ...:
(1, 80, 3000)
149 ms ± 761 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
(2, 80, 3000)
287 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(4, 80, 3000)
568 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(8, 80, 3000)
1.16 s ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@ozancaglayan
Copy link
Contributor

ozancaglayan commented Aug 13, 2024

I agree on @aligokalppeker 's last points (although not the tone, @aligokalppeker no need to be angry, we can discuss calmly)

Lots of repos appeared and disappeared in the past year(s) regarding Whisper ASR. Lots of them were later abandoned. Unless we have a clear plan and roadmap on how this repository will be maintained going forward, I don't see a benefit of merging a bulky PR and then not being co-operative given the amount of complaints from the community. In the original PR, @Purfview also wrote that the PR contains multiple unrelated stuff. Lots of people reacted with thumbs up including the contributors themselves (@MahmoudAshraf97 ) but this comment was not addressed. Why did we add another VAD model? No answer and no clear justifications, other than adding 100~s of new package dependencies, etc. etc.

The whole thing about the PR was massive speed improvements, even claiming 100x speed ups which are unrealistic. Moreover, people running on CPUs experienced slowdowns as the number of CPU threads was defaulted to a non-portable value.

I don't know why can't we just revert it and then properly review and merge it atomically. Nobody wants to block contributions here, we just want to ensure quality.

Also, I don't understand who is the person approving the PR. They doesn't seem to be a contributor and/or maintainer of this repo. I don't understand how somebody who didn't contribute to this repo could approve that PR?

image

Please we're not asking rocket science here:

  • Let's determine a couple of people who knows this repo to be reviewers for future PRs
  • Let's keep the PRs atomic and well tested
  • Let's TRY to write unit tests

Thanks!

@Purfview
Copy link
Contributor

Purfview commented Aug 13, 2024

Purfview also wrote that the PR contains multiple unrelated stuff.

That was before PR was cleaned up.
And there is PR #936 to remove redundant VAD and maybe torch dependency later.

@MahmoudAshraf97
Copy link
Contributor

@BBC-Esq I guess it means the same thing, the main difference between both is how the process is handled and our implementation is now closer to WhisperS2T than to WhisperX, we can't exactly mirror WhisperS2T because it was designed with multiple backends in mind, while we only use one backend so our approach is simpler and adheres to the original design of faster whisper more.

@aligokalppeker
Batching shows the greatest performance increase in decoding because it's done autoregressively
you can test using this code

batch_size = 1

for i in range(0, len(features), batch_size): # (75,128,3000)
    inputs = features[i : i + batch_size]
    outputs = ct2_model.model.generate(
        get_ctranslate2_storage(inputs),
        prompts=[[50258, 50259, 50360, 50364]] * len(inputs),
    )
    torch.cuda.empty_cache()

These are the results on a single T4, showing 2x speedup

batch_size=1
1min 44s ± 24.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
batch_size=32
53.1 s ± 58.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I used a large batch size to demonsrate that batching will have increasing speedups until we are compute-bound, low batch sizes are mostly memory-bound

We already addressed most of the problems regarding VAD and additional package dependencies in #936 where the number of dependencies is now almost back to the original number, we are almost there regarding the correct implementation, I know the approach could've been better but that already happened, how about we try to correct the approach from now on instead of starting all over again?

@ozancaglayan
Copy link
Contributor

Thanks @MahmoudAshraf97 so do you confirm that for encoding, batching wouldn't have much effect?

@MahmoudAshraf97
Copy link
Contributor

It does have effect as you demonstrated, 5% speedup is considerable in cloud deployment scenarios where time is literally money, but YMMV depending on the hardware specifications as T4 is considered old now days, the largest effect is in the decoding, but that doesn't mean that the encoding doesn't benefit from it

@aligokalppeker
Copy link
Author

aligokalppeker commented Aug 14, 2024

Thx for the comments @ozancaglayan and @BBC-Esq, I am furious about how this kind of PR is allowed to butcher the project despite all community feedback. Performance is important but cleanness, maintainability, and compatibility are more prioritized in software projects, especially for open-source ones. This PR does not achieve performance improvement for most of the scenarios (need specific hardware and data conditions) but makes code complex compared to the gains.

I am looking at patch #936, and it is still far from what needs to be done for the cleanup. One big example is torch dependency still there and should be removed, there is no place for the bulky torch dependency if you do not have a training pipeline. Also, most of the complex transcription code remains in there with batching stuff and it is still bulky.

The effect in the code is massive and everyone gets different performance benchmarks. One can not disregard old hardware, low hardware, etc. In the cloud, many different types of CPU/GPU configurations are utilized for power/efficiency considerations.

@ozancaglayan I agree with your approach on the PR as I think similarly. Starting over is critical and necessary in this amount of change as if have not started development in a clean feature context (with a non-opinionated) with a proper modular design, it will be harder to achieve the clean status back again.

@doublex
Copy link

doublex commented Aug 14, 2024

5% speedup is considerable in cloud deployment scenarios where time is literally money ...

We use faster-whisper on mobile devices - and the small models work exceptionally well.
You can't just pip install torch (4GB) on a mobile device.

@MahmoudAshraf97
Copy link
Contributor

5% speedup is considerable in cloud deployment scenarios where time is literally money ...

We use faster-whisper on mobile devices - and the small models work exceptionally well. You can't just pip install torch (4GB) on a mobile device.

We're removing the torch dependency already, it's just a matter of time, but the speedup I'm talking about here is due to batching which is not dependant on torch

@heimoshuiyu
Copy link

heimoshuiyu commented Sep 5, 2024

just share my story: after the batch PR my https://github.com/heimoshuiyu/whisper-fastapi docker image size grows from 5GB to 10GB (without the model file)

Edit: and the vram usage increased from 4GB to 10GB (30 mins audio on 4060ti, only using the WhisperModel, not BatchedInferencePipeline. model is large-v2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants