Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added multiprocessing for cpu processing #648

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

joiemoie
Copy link

Because of the Python GIL, the preprocessing doesn't fully efficiently use all the CPU cores. By spawning the CPU tasks in its own multiprocess, you can get requests that happen on different threads to fully utilize the CPU cores.

@joiemoie joiemoie force-pushed the multicore branch 2 times, most recently from dd68247 to 47e14c8 Compare January 19, 2024 04:45
@Purfview
Copy link
Contributor

Does this have any actual impact on performance? Do you have benchmarks?

@joiemoie
Copy link
Author

joiemoie commented Jan 19, 2024 via email

@joiemoie
Copy link
Author

Does this have any actual impact on performance? Do you have benchmarks?

Testing code:

`from faster_whisper import WhisperModel, decode_audio
from io import BytesIO
import time
from fastapi import FastAPI, Request, UploadFile

import nvtx
import threading
import time
import time
from concurrent.futures import ThreadPoolExecutor
from faster_whisper import WhisperModel, decode_audio

def preprocess_audio(filename):
with nvtx.annotate("Decode audio"):
return decode_audio(filename)

model = WhisperModel("large-v3", device="cuda", device_index=[0], compute_type="bfloat16", cpu_threads=2, num_workers=2)

def transcribe(model_to_use):
start_time = time.time()
with nvtx.annotate("Transcribe"):

    segments, info = model_to_use.transcribe("test.wav", language=None, vad_filter=True, word_timestamps=False, vad_parameters={"window_size_samples": 1024}, preprocess_on_multiple_cores=True)

print(f"Single Request Elapsed time: {time.time() - start_time}. Audio duration: {info.duration}")

#this is to clear out memory from the GPUs
transcribe(model)
transcribe(model)

if name == "main":
threads = []
for i in range(20):
threads.append(threading.Thread(target=transcribe, args=(model,)))

# thread_1 = threading.Thread(target=transcribe)
start_time = time.time()

for thread in threads:
    thread.start()

for thread in threads:
    thread.join()
print(f"Total Elapsed time: {time.time() - start_time}")

`

Results:

Overall time to pre-process 20 requests before without multicore:

2.7506766319274902 seconds

Overall time to pre-process 20 requests before with multicore:

1.9269721508026123 seconds

Now to test the overhead for a single request.

Overall time to pre-process 1 requests before without multicore:

0.21215391159057617 seconds

Overall time to pre-process 1 request with multicore:
Total Elapsed time: 0.21257996559143066

So there's a tradeoff between overhead and spawning the worker process

@trungkienbkhn
Copy link
Collaborator

@joiemoie , hello. Tks for an interesting pull request.
From my test (20 requests, device=cpu, model=tiny, cpy_threads=8), I received the overall time as below:

  • without multicore: 14.506s
  • with multicore: 10.917s

That's a pretty significant improvement !
But I think we can improve further. I tried adding this logic to the cpu_preprocessing function:

if not isinstance(audio, np.ndarray):
    audio = decode_audio(
        audio, sampling_rate=feature_extractor.sampling_rate
    )

if vad_filter:
    if vad_parameters is None:
        vad_parameters = VadOptions()
    elif isinstance(vad_parameters, dict):
        vad_parameters = VadOptions(**vad_parameters)

The overall time was 9.633s after my change. I think the logic in the decode_audio function also takes up a significant amount of computation time.
What do you think about this idea? And should we add more code logic into the cpu_preprocessing function?

@joiemoie
Copy link
Author

Nice! That's not a bad idea. Please don't merge this in for now. I noticed that there's memory inefficiency, and the pool size needs to be capped or have a parameter set. I'm investigating the memory inefficiency

@trungkienbkhn
Copy link
Collaborator

@joiemoie , hello. Have you finished your work yet 😃 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants