-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
possible quantization (e.g. ctranslate2, llama.cpp, bitsandbytes, gptq, etc.?) #68
Comments
Hey, Regarding gguf or other quantized formats – I don't have much practical experience with these. A conversion would certainly be possible since we use the same layers as llama and whisper. See also #56 . |
Thanks, is there a way to choose between different voices, a variety of English voices, for example, or is there only one that's hardcoded? I'm familiar with the misnomer...tiny, base, small are smallest to largest for Whisper models...And there's medium/medium.en, large-v2, which apparently you guys haven't trained on yet if I understand correctly. |
You can use the voice cloning feature to get different voices: generate_to_file("""
This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
""", lang='en', speaker='https://upload.wikimedia.org/wikipedia/commons/7/75/Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg') see the speaker parameter |
For those interested, I tested the same text with my program located at https://github.com/BBC-Esq/ChromaDB-Plugin-for-LM-Studio Specifically, I ran the program, obtained a response from the LLM, and clicked the "Bark Response" button within my GUI, which uses a Bark model to speak the text of the LLM's response. My program can use both the "normal" and "small" bark models, in either fp32 or fp16. Here are the results: Bark/normal/fp32 = 8 G.B. whisperspeech/small-en/float32 (hardcoded by default) = 5.8 G.B. CAVEAT - these numbers are RELATIVE. i.e. I did not subtract my system overhead, which can fluctuate from 1.5-2.5 G.B. depending on what my computer is doing. However, I ran them with the same overhead as much as I can control it... In terms of audio quality, both messed up in the same places (e.g. with non-English stuff like "(Ga. Juv. Prac. & Proc.) § 6:29", which is a citation to a legal treatise) and whisperspeech actually seemed smoother overall...and that's using the "small-en" model! I think it'd be exciting to see the medium.en and large-v2 (AVOID large-v3 IMHO due to regressions!) models, but quantized in either float16 or other formats. Overall very impressed/excited. I'd recommend looking at ctranslate2, llama.cpp or Transformers or other quantization options. Transformers is updated more often is now using BetterTransformer, bitsandbytes, and flash attention 2 so...Just my humble suggestion! |
I would also add that I noticed in fetch_models.py you mention "whisperx," which is based on ctranslate2 so...perhaps that's an inroad if you already plan on using other quantized models...
I didn't have a chance to analyze what this script actually does, however. In any event, I'll be following this repository and would love to include it in my program. |
For the GPU RAM usage I think we may be very suboptimal – right now we always load the FP32 weights and then convert to FP16. From my quick tests it seemed like it never releases the GPU memory so in FP16 mode we actually use more GPU memory than in FP32. ;) That would be definitely worth fixing. |
We are aware of CTranslate but we don't have resources to work on these conversions at the moment. The fetch_models script is used to download the models we use for data preprocessing for training because the cluster we are using does not have internet access on compute nodes. |
If that's the case, if you'll point me to the script/scripts that do the conversion I'll take a quick look at if/why the memory isn't being released...It's possible that it's already superior to Bark right now (probably marginally) even though small-en is the largest model available at the moment. I'm assuming you're using dynamic quantization via Pytorch at runtime? |
We are using fp16 and torch.compile, no weight quantization at all at the moment. The conversion test should be pretty simple. I think the way to go would be to load the models, convert to fp16 ( |
I'm sorry, you said in one message that they're float32 and then you said float16...can you please clarify? |
Sorry , there a couple of steps so it may not be clear: the models are uploaded FP32 to Huggingface, we download them in FP32, convert to FP16 after loading and do inference at FP16. |
Where exactly does this happen? It seems like it's done in multiple places; for example, here:
What I am evaluating is possibly using Further, it's my understand that under the hood it includes some custom CUDA kernels... Here a short summary from their github: It's officially only available on Linux, but I've been successfully using a custom-wheel FROM HERE for the last few months, no issues whatsoever: |
The fp16 conversion is done in For 8bit I’d start by replacing the Linear layers in source code on a temporary branch and see how well it works. If it works and the performance is good we can later implement switching the layers to 8bit in a similar manner to the existing convert_to_eval functions. |
You mentioned that the architecture is the same as Whisper, but is there any way you guys could release it? Is it exactly the same or...even a slight variation would alter the quantization options. It'd help me understand model layers as I begin working on this after implementing other gpu-acceleration. Huggingface has all of the models in the same repository so I don't know if this'd require restructuring it, but here's an example from whisper-large-v2 from OpenAI: In one of the other issues you actually posted a link to some charts and graphs when it was trained, but can't locate that again and don't remember if that had the architectures... |
I am pretty sure you can view a table like this if you open the model file in Netron. We don’t have any diagrams right now, the only documentation of the architecture is the model code in Python… It is a encoder decoder transformer like Whisper with some additions, mainly RoPE positional embeddings and MusicGen-like multiple heads for efficient EnCodec tokens prediction. |
That Netron is awesome, thanks. It worked for most of the models and I was automatically directed to their github where I submitted an issue. Overall, it loaded all of the models except all of the ones beginning with "s2a" with one exception. It did load s2a_up.model |
My script located here has a basic test now...
#67
I'm excited about the concept of reverse-engineering Whisper models (so to speak) for TTS, but am wondering if it's possible to use quantized versions of models? There's no information on the model's architecture on huggingface as per normal in a config.json file, at least that I found. Therefore, it's impossible to see if the models could be converted to quantized version using, for example, ctranslate2, llama.cpp (GGUF format), GPTQ, "Transformers" (relying on BetterTransformer and/or bitsandbytes and/or flash attention2), and whatever other options there are out there...
Looking for a better version of Bark since it's so heavy...but I noticed that even the "small" version of your models was fairly heavy...If quantized, however, I'm wondering if it's better overall. The quality of the audio is so close to Bark such that any speedup would make this far superior.
The text was updated successfully, but these errors were encountered: