Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideas for better performance #1140

Open
hobodrifterdavid opened this issue Mar 20, 2023 · 12 comments
Open

Ideas for better performance #1140

hobodrifterdavid opened this issue Mar 20, 2023 · 12 comments

Comments

@hobodrifterdavid
Copy link

hobodrifterdavid commented Mar 20, 2023

Hello.
So, I want to run NLLB-200 (3.3B) model on a server with 4x 3090, and a say, 16 core AMD Epyc cpu.
I wrapped Ctranslate2 in fastAPI, running with uvicorn, inside a docker container with GPU support.

All code is here, feel free to do whatever with it:
https://github.com/hobodrifterdavid/nllb-docker-rest

I want to handle requests with between 1 and 1000 sentences, with a reasonable balance between latency and throughput.

Here's a few things I did, from reading the documentation:

for ctranslate2.Translator:

device='auto', # May use CPU for very small translations?
compute_type="float16",
device_index=[0, 1, 2, 3]

for translator.translate_batch:

max_batch_size=256 # Bigger than this I get Cuda OOM errors.

I tried to use translate_batch with asynchronous=True, but couldn't figure out easily how to await the results (EDIT: figured it out, added results below)

uvicorn is run without the --workers flag, so, defaults to a python process, a single model loaded into GPU ram. FastAPI accepts up to 40 concurrent requests.

Anyway, I'll carry on trying to improve this setup, will post further results. If there are some suggestions for something I missed, it would be appreciated. Python is not my first langauge, please excuse naive errors.

@hobodrifterdavid
Copy link
Author

hobodrifterdavid commented Mar 21, 2023

I made a script to load test script. Translating flores dataset EN => ES. Flores sentences are pretty long, newspaper-type sentence.

Here the server handles a request, translate_batch is being called with a "Req. Batchsize" sentences, the response is sent, then the server receives the next request.

Req. Batchsize: 1, sents/s: 1.89
Req. Batchsize: 4, sents/s: 6.08
Req. Batchsize: 16, sents/s: 15.27
Req. Batchsize: 64, sents/s: 47.00
Req. Batchsize: 256, sents/s: 64.82
Req. Batchsize: 1000, sents/s: 165.06

Here the requesting machine does 4 concurrent translation requests:

Batchsize: 1, sents/s: 2.89
Batchsize: 4, sents/s: 8.53
Batchsize: 16, sents/s: 25.45
Batchsize: 64, sents/s: 52.98
Batchsize: 256, sents/s: 67.36
Batchsize: 1000, sents/s: 175.07

And 16:

Batchsize: 1, sents/s: 2.90
Batchsize: 4, sents/s: 8.54
Batchsize: 16, sents/s: 25.49
Batchsize: 64, sents/s: 52.67
Batchsize: 256, sents/s: 65.13
Batchsize: 1000, sents/s: 174.77

I think the code may benefit from figuring out the translate_batch asynchronous=True.. or else certainly from making a task queue and batching between requests. Does this code exist already?

EDIT: going to --workers 2, I'm seeing GPU ram at 16GB+ (it's 8GB+ with a single worker), so def two models loaded in.. running again with 16 concurrent requests, it's a bit faster:

Batchsize: 1, sents/s: 3.67
Batchsize: 4, sents/s: 11.88
Batchsize: 16, sents/s: 33.95
Batchsize: 64, sents/s: 53.02
Batchsize: 256, sents/s: 87.28
Batchsize: 1000, sents/s: OOM

EDIT2: I found the code snippet for asynchronous=True, it's doesn't help perf though here, after a bit of reading, fastAPI already runs sync requests in a ThreadPool.. uhhuh.
image
Numbers with 16 concurrent requests, 1 uvicorn worker:
Batchsize: 1, sents/s: 2.50
Batchsize: 4, sents/s: 8.22
Batchsize: 16, sents/s: 24.26
Batchsize: 64, sents/s: 50.99
Batchsize: 256, sents/s: 66.62
Batchsize: 1000, sents/s: 172.69

EDIT3: Same as above (16 concurrent requests, 1 uvicorn worker), with 1 GPU only (device_index=[0]):

Batchsize: 1, sents/s: 2.61
Batchsize: 4, sents/s: 7.69
Batchsize: 16, sents/s: 21.74
Batchsize: 64, sents/s: 52.34
Batchsize: 256, sents/s: 58.01
Batchsize: 1000, sents/s: 73.07

With 2 GPU (device_index=[0,1]):

Batchsize: 1, sents/s: 2.71
Batchsize: 4, sents/s: 8.51
Batchsize: 16, sents/s: 25.33
Batchsize: 64, sents/s: 52.62
Batchsize: 256, sents/s: 67.17
Batchsize: 1000, sents/s: 138.76

Seems like benefit of multi-gpu starts to give a benefit above batch sizes of 256.

EDIT: I made code that collects requests and batches them, big improvement when lots of small requests, will post code tomorrow.

@guillaumekln
Copy link
Collaborator

Concurrent requests are processed sequentially in the translate function which is not ideal. Ideally, translate should be called from multiple Python threads which would automatically enable multi-GPU translations. Some webservers allow using multiple worker threads (not processes!), but it does not seem to be the case uvicorn.

Note that in your example, multiple GPUs are only used in the case "Batchsize: 1000": the request will be rebatched with max_batch_size=256 and each sub-batch will be executed by a different GPU.

For all other cases only a single GPU is working at a time because translate is executed sequentially AND the request size is smaller or equal to max_batch_size.

@hobodrifterdavid
Copy link
Author

hobodrifterdavid commented Mar 22, 2023

Hmm. Ok, so seems I had an 'async' fastAPI handler, but then I was calling blocking functions inside it (translate_batch). If you don't have an 'async' request handler, fastAPI spawns a thread to run it, so it doesn't block the handling of other requests. (I am explaining to myself, I am figuring this stuff out).

Solution one is just to remove the async keyword from the request handler here, and fastAPI runs the request handler in a thread:
image

Performance (16 concurrent requests, 4 gpus, 1 uvicorn worker, switched to max_batch_size 128 to avoid potential OOM errors):

Batchsize: 1, sents/s: 7.82
Batchsize: 4, sents/s: 25.04
Batchsize: 16, sents/s: 94.00
Batchsize: 64, sents/s: 197.08
Batchsize: 256, sents/s: 205.18
Batchsize: 1000, sents/s: 251.80

Solution two is that you can use an 'async' request handler, but wrap the blocking call like this, this also spawns a thread and doesn't block the event loop:
image

Batchsize: 1, sents/s: 6.98
Batchsize: 4, sents/s: 25.10
Batchsize: 16, sents/s: 97.21
Batchsize: 64, sents/s: 194.80
Batchsize: 256, sents/s: 217.99
Batchsize: 1000, sents/s: 249.32

Performance exactly the same.

Okays, now I switch on the code that collects translations from many requests and processes them togther in a batch: (16 concurrent req)

Batchsize: 1, sents/s: 11.89
Batchsize: 4, sents/s: 31.72
Batchsize: 16, sents/s: 74.01
Batchsize: 64, sents/s: 195.33
Batchsize: 256, sents/s: 176.32
Batchsize: 1000, sents/s: 236.62

Increasing concurrent reqs to 128 (fastAPI only allows 40 to be handled at a time IIRC):

Batchsize: 1, sents/s: 48.86
Batchsize: 4, sents/s: 93.88
Batchsize: 16, sents/s: 190.34
Batchsize: 64, sents/s: 177.50
Batchsize: 256, sents/s: 169.45
Batchsize: 1000, sents/s: 234.16

The code is here, it's a bit ugly: https://github.com/hobodrifterdavid/nllb-docker-rest/blob/main/app.py

There's a potential concern that the single python process is bottlenecking the throughput, it might be better to start two containers or something, and split the gpus between them. Didn't check this. Or, you might have two sets of gpus handling requests independantly in one python process. From here need to make a decision about how to balance latency and thoughput, possibly prioritising shorter translations.

@guillaumekln
Copy link
Collaborator

Batchsize: 1, sents/s: 7.82
Batchsize: 4, sents/s: 25.04
Batchsize: 16, sents/s: 94.00
Batchsize: 64, sents/s: 197.08
Batchsize: 256, sents/s: 205.18
Batchsize: 1000, sents/s: 251.80

These numbers look good to me. The performance increase is almost linear with the number of GPUs, e.g. for "Batchsize: 64" going from 52.34 to 197.08 is a ~3,8x speedup.

@vince62s
Copy link
Member

@hobodrifterdavid why don't you make a PR to replace our old-style translation_server based on Flask with FastAPI + uvicorn?

see here: https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/bin/server.py
and for your info, I made a tuto to finetune NLLB200 here: https://forum.opennmt.net/t/finetuning-and-curating-nllb-200-with-opennmt-py/5238

That would be helpful to have a robust and fast server solution for OpenNMT.
Cheers.

@hobodrifterdavid
Copy link
Author

hobodrifterdavid commented Mar 23, 2023

@vince62s Hi Vince. One advantage of the sketch I made (there's no error handling etc. yet) is that one python process handles multiple concurrent requests. This means you can make translations for multiple requests together, calling translate_batch once, which helps a lot when you are handling a lot of small translations with NLLB. With NLLB, a single batch can contain translations with different langauge pairs, you can't do that with marian models etc. Actually, I'm not certain that CTranslate2 doesn't have some kind of combine-smaller-translations-into-a-bigger-batch code internally (@guillaumekln), I could run the code without the request-batching-stuff, with say 128 concurrent requests, to check. I'd like to contribute, just a little crushed currently, we're deploying a chat feature on Language Reactor. Actually the 'sketch' is already handling translations for 'Text Mode' (you import a website/paste a text).

I noticed, if you give nllb subtitles to translate with multiple sentences, it only translates one sentence (seems, the longest). The OPUS models do this too I think. So a sentence tokeniser is needed, that handles 100+ languages. UDPipe can handle maybe 50, but it's an expensive way to break sentences. I'll check your repo to see what you are doing.

EDIT: I'm probably missing something simple.. but, this web demo also translates a single sentence only, and ignores the second: (https://huggingface.co/spaces/Geonmo/nllb-translation-demo)
image

finetune NLLB200 - this is very cool and I will certainly take a look, thanks for that. Can you also use say 4 24GB gpus for training the 1.3B model? Anyways, getting off topic. :)

@guillaumekln
Copy link
Collaborator

Actually, I'm not certain that CTranslate2 doesn't have some kind of combine-smaller-translations-into-a-bigger-batch code internally

There is a C++ class that does this, but it's not currently used and there are no equivalent in the Python API.

@hobodrifterdavid
Copy link
Author

hobodrifterdavid commented Mar 28, 2023

Thanks Guillaume.

It was recommended to me to use split-sentences.perl logic from Moses. Sacremoses (a python reimplementation of some Moses scripts) doesn't have a complete reimplementation of split-sentences.perl, so I used the mosestokenizer (https://github.com/luismsgomes/mosestokenizer) which bridges from Python to the original Perl code using pipes. The code (some version) is also in the Moses repo (https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/mosestokenizer/sentsplitter.py)

EDIT: In the NLLB paper, they mention the stopes data prep pipeline, my repo now has the relevant the relevant code for sentence segmenting and cleaning from there (stopes_snippet folder).

image

A few notes:

  1. API accepts flores codes. (not google, as earlier)
  2. The sentence splitter, as used here, will swallow newlines when breaking strings into sentences. This is ok for our use, but at least newlines between sentences could be preserved inside a string for translation with some somewhat tricky code.
  3. TODO: When the translations are reassembled into strings, my code joins them with spaces, which is probably not what you want for Asian languages.
  4. TODO: Need to check language codes passed to moses splitter are all appropriate.
  5. TODO: Check that moses sentence splitter is appropriate for Asian etc. languages with NLLB model.

@ArtanisTheOne
Copy link

I have a custom implementation using a Flask REST API server which can help with preserving structure of sentences in regards to new lines or custom strings/characters you may want to exclude from translation reqs to the model if needed

@hobodrifterdavid
Copy link
Author

hobodrifterdavid commented May 20, 2023

There was a nasty bug when batches were processed with multiple languages.. pushed a fix.. code seems pretty stable otherwise.

image

@arnavmehta7
Copy link

Hi, were you able to figure out how to translate all things in a batch if the time of requests come within a DELTA?

@nickchomey
Copy link
Contributor

nickchomey commented Sep 14, 2023

I just stumbled upon this issue as i want to do something extremely similar (fastapi +nllb, though without any GPUs). It'll be at least a few months til i have time to look into any of this again, but I wonder if gunicorn could be helpful here with the multithreaded stuff?

https://fastapi.tiangolo.com/deployment/server-workers/

Edit: on second thought, I don't think this is what you're looking for as it would surely just multiply the Python processes (and ram requirements). It seems like what you did with async batching requests is a nice way to handle it all.

Perhaps it could be combined with an effort such as that mentioned in this issue for continuous batching? (though, it was already confirmed there that it isn't and won't really be possible and a batching mechanism like has already been done here was recommended) #1333

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants