-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code for chat inference server #1329
Comments
I editied the code above to remove the problematic 'repetition_penalty=1.5' parameter. |
this is isnaely cool and what we were also planning to do, checking it out! |
Okay, I'll add some tips later that may be helpful. :) It's currently running this feature: https://www.languagereactor.com/chatbot |
NICE!!!!!!!!!!!!!! which model(s) are you running? |
have you added any more to this? |
Cleaned up the code a bit and made streaming possible, but the code is currently buggy with the wrong response coming back on parallel requests. Might be some un-threadsafe parts of the code that's causing this |
THIS IS TOTALLY AWESOME!!!!!!!!!!!!!!!!!! WILL TEST IT OUT!!!!!!!!!!!!!!!!!! |
Is there any way I can get a copy of the python wrapper you referred to in your first message? I'm also curious if you have any other resources regarding Python as it relates to ctranslate2 specifically? I've been struggling to write my own scripts utilizing both technologies, although I still think ctranslate2 is awesome. |
@BBC-Esq For a chat usage, you can see the Llama 2 example: https://github.com/OpenNMT/CTranslate2/tree/master/examples/llama2 |
I just cloned your repo and am going to start studying it a little more to understand it better. I was already aware of that chat script and actually have something based off of it, but I'm still struggling. Will keep trying though because, if it lives up to what I've seen comparatively with Faster Whisper and WhisperX, it'll beat ggml/gguf/gptq and those guys. Appreciate the advice. |
this package is the best. also, you should check out fatser-whisper. |
I tried to solve the bug. My solution is not very elegant but it should solve the problem. So the solution for now is to process the request with more tokens at the beginning. You only need to modify the handle_stream_request. async def handle_stream_request(self):
global BUFFER
processing_queue = BUFFER.flush(MAX_BATCH_SIZE)
# Sort the processing_queue based on token lengths
processing_queue.sort(key=lambda item: len(self.sp.encode(item[0], out_type=str)),reverse=True)
if len(processing_queue) == 0:
return
try:
self.batch_count += 1
print(
f"[GPU-{self.device_number}] handle_stream_request: processing {len(processing_queue)} requests"
)
# print(processing_queue)
await self.generate_stream(processing_queue)
finally:
self.batch_count -= 1 |
very cool, checking it out. |
Can you provide the code regarding faster whisper? I'd love to see it! |
hey guys, the following is my working code:
I have 4 A6000 and out of each of them only around 5-6GB of VRAM is using during inference per card. while the rest of the VRAM is just sitting idle. I was wondering if there was a way to have multiple replicas of the same model on each card ? to maximize resources and have a higher throughput/concurrency with bare minimal latency. i tried to reduce batch size, added inter_thread args but there isnt much of difference in load testing. could someone please let me know on what to exactly do ? |
Hey would love to checkout the similar wrapper on faster-whisper |
THIS IS TOTALLY COOL!!!!!!!!!!!!!!!!!! |
This looks nice; Any plans to put this on docker hub or Github container registry? Or other integration plans? |
as great as it is, first poster said "probably no time to support it" so we can't commit things that won't have support. but it satys here. closing. |
I made a wrapper around CTranslate2, an API server. It batches requests, supports running these batches on multiple gpus (sort of round-robin), requests can have different priority levels, requests are immeadiatly rejected when more than 100 requests in front of it in the queue, and has an endpoint that returns the approximate load. It doesn't do streaming, just does continuation for up to 80 tokens.. this fits with how I am using CTranslate2. It avoids a websocket (which can create networking issues), from the front end you can do requests in a loop until you see the string
</s>
in the continuation string. It's responsive enough with the 13B model at least, the performance overhead could be tested. @guillaumekln If you want to use it as an example, or put it in a seperate repo, please feel free. Knowing myself, I know I probably am not able to find time to provide support, but it may be useful for someone. I also have a python wrapper for faster-whisper. If you prefer to just delete the issue, I won't have hurt feelings. :) If there is interest, I can document the code better.What you do on the front-end etc.:
app.py
Dockerfile
docker-dioco-ct2-chat.service [systemd unit file]
The text was updated successfully, but these errors were encountered: