Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi threading #100

Closed
Joepetey opened this issue Mar 31, 2023 · 11 comments
Closed

Multi threading #100

Joepetey opened this issue Mar 31, 2023 · 11 comments

Comments

@Joepetey
Copy link

I saw a few people talking about using multiple threads. Is there any documentation or code examples I can see to accomplish this?

@guillaumekln
Copy link
Contributor

guillaumekln commented Mar 31, 2023

There are 2 levels of multithreading:

  • Running one transcription on CPU with multiple threads
  • Running multiple transcriptions in parallel

Running one transcription on CPU with multiple threads

This number of threads can be configured with the argument cpu_threads (4 by default):

model = WhisperModel("large-v2", device="cpu", cpu_threads=8)

This is the number of threads used by the model itself (usually the number of OpenMP threads). The input is not split and processed in multiple parts.

Running multiple transcriptions in parallel

Multiple transcriptions can run in parallel when the model is using multiple workers or running on multiple GPUs:

# Create a model running on CPU with 4 workers each using 2 threads:
model = WhisperModel("large-v2", device="cpu", num_workers=4, cpu_threads=2)

# Create a model running on multiple GPUs:
model = WhisperModel("large-v2", device="cuda", device_index=[0, 1, 2, 3])

# Using multiple workers on a single GPU is also possible but will not increase the throughput by much:
# model = WhisperModel("large-v2", device="cuda", num_workers=2)

Then you can call model.transcribe from multiple Python threads. Of course there are multiple ways to do that. If you are using this library in a webserver, it may already use multiple threads so there is nothing to do.

Just as an example, here's how you can submit multiple transcriptions using a ThreadPoolExecutor. If there are enough files you will see num_workers * cpu_threads active CPU threads.

import concurrent.futures

from faster_whisper import WhisperModel

num_workers = 4
model = WhisperModel("large-v2", device="cpu", num_workers=num_workers, cpu_threads=2)

files = [
    "audio1.mp3",
    "audio2.mp3",
    "audio3.mp3",
    "audio4.mp3",
]


def transcribe_file(file_path):
    segments, info = model.transcribe(file_path)
    segments = list(segments)
    return segments


with concurrent.futures.ThreadPoolExecutor(num_workers) as executor:
    results = executor.map(transcribe_file, files)

    for path, segments in zip(files, results):
        print(
            "Transcription for %s:%s"
            % (path, "".join(segment.text for segment in segments))
        )

@Joepetey
Copy link
Author

Thank you @guillaumekln for your help!

@supratim1121992
Copy link

There are 2 levels of multithreading:

  • Running one transcription on CPU with multiple threads
  • Running multiple transcriptions in parallel

Running one transcription on CPU with multiple threads

This number of threads can be configured with the argument cpu_threads (4 by default):

model = WhisperModel("large-v2", device="cpu", cpu_threads=8)

Running multiple transcriptions in parallel

Multiple transcriptions can run in parallel when the model is using multiple workers or running on multiple GPUs:

# Create a model running on CPU with 4 workers each using 2 threads:
model = WhisperModel("large-v2", device="cpu", num_workers=4, cpu_threads=2)

# Create a model running on multiple GPUs:
model = WhisperModel("large-v2", device="cuda", device_index=[0, 1, 2, 3])

# Using multiple workers on a single GPU is also possible but will not increase the throughput by much:
# model = WhisperModel("large-v2", device="cuda", num_workers=2)

Then you can call model.transcribe from multiple Python threads. Of course there are multiple ways to do that. If you are using this library in a webserver, it may already use multiple threads so there is nothing to do.

Just as an example, here's how you can submit multiple transcriptions using a ThreadPoolExecutor. If there are enough files you will see num_workers * cpu_threads active CPU threads.

import concurrent.futures

from faster_whisper import WhisperModel

num_workers = 4
model = WhisperModel("large-v2", device="cpu", num_workers=num_workers, cpu_threads=2)

files = [
    "audio1.mp3",
    "audio2.mp3",
    "audio3.mp3",
    "audio4.mp3",
]


def transcribe_file(file_path):
    segments, info = model.transcribe(file_path)
    segments = list(segments)
    return segments


with concurrent.futures.ThreadPoolExecutor(num_workers) as executor:
    results = executor.map(transcribe_file, files)

    for path, segments in zip(files, results):
        print(
            "Transcription for %s:%s"
            % (path, "".join(segment.text for segment in segments))
        )

I am running the model on an AWS p3.16xlarge Sagemaker instance with 8 GPUs (16GB each). I am looking to achieve parallelization. Would ThreadPoolExecutor work in this case as well after I create the model on multiple GPUs using the device_index?

@guillaumekln
Copy link
Contributor

Yes, the ThreadPoolExecutor example would also work to transcribe multiple files on multiple GPUs.

Another approach is to launch multiple Python processes (e.g. using multiprocessing) and load a model on a different GPU in each process.

@supratim1121992
Copy link

Yes, the ThreadPoolExecutor example would also work to transcribe multiple files on multiple GPUs.

Another approach is to launch multiple Python processes (e.g. using multiprocessing) and load a model on a different GPU in each process.

Can you share a code snippet implementing the multiprocessing route with the above code instead of using ThreadPoolExecutor please.

@guillaumekln
Copy link
Contributor

When using multiprocessing there is nothing specific about faster-whisper. You can look at the dozen of multiprocessing examples on the Web. You just want to make sure to load a single model in each process.

@brajeshvisio01
Copy link

brajeshvisio01 commented Aug 10, 2023

@guillaumekln I am using the model on GPU and using this line model = WhisperModel("large-v2", device="cuda", device_index=[0, 1, 2, 3]), but the the responstime is getting added , it's not resolving all the requests at a time , it is resolving like a queue, Below is my flask app code
`import base64
import concurrent.futures
from flask import *
from flask_cors import CORS, cross_origin
import os
import time
import wave
import threading
from faster_whisper import WhisperModel

model_size = "large-v2"
app = Flask(name)

os.environ["OMP_NUM_THREADS"] = "6"

model = WhisperModel("large-v2", device="cuda", device_index=[0, 1,2,3],num_workers=4, compute_type="int8")

@app.route("/transcribe", methods=["POST"])
@cross_origin()
def transcribe():
start_time=time.time()
try:
audio_file = request.files['audio']
audio_file.save(audio_file.filename)
segments, info = model.transcribe(audio_file.filename,language='en',task='transcribe', beam_size=5,temperature=0.2,vad_filter=True)
result=""
for segment in segments:
result=result+" "+segment.text
end_time=time.time()
dur=end_time-start_time
return {"start_time":start_time,"end_time":end_time, "duration":round(dur,3),"text":result}
except Exception as e:
print("Error::",str(e))
return(str(e))

if name == "main":
app.run(debug=True, host="0.0.0.0", port=int(os.environ.get("PORT",8080)))`

@guillaumekln
Copy link
Contributor

Try adding threaded=True when calling app.run.

@brajeshvisio01
Copy link

brajeshvisio01 commented Aug 10, 2023

@guillaumekln
added threaded=True
app.run(debug=True,threaded=True, host="0.0.0.0", port=int(os.environ.get("PORT",8080)))

still time is increasing and geting this in log--->
Screenshot (14)
this is showing increasing response time
Screenshot (16)

@brajeshvisio01
Copy link

@guillaumekln
Just for your information the above code {https://github.com/guillaumekln/faster-whisper/issues/100#issuecomment-1672616398} had run earlier (07-Aug-2023) with average response time around 600ms but now it is giving the above result https://github.com/guillaumekln/faster-whisper/issues/100#issuecomment-1672652559.
Kindly Help,
Thanks and Regards

@wwfcnu
Copy link

wwfcnu commented Nov 16, 2023

I ran 4 processes on a gpu at the same time, but the speed did not improve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants