Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible quantization (e.g. ctranslate2, llama.cpp, bitsandbytes, gptq, etc.?) #68

Open
BBC-Esq opened this issue Jan 30, 2024 · 16 comments

Comments

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Jan 30, 2024

My script located here has a basic test now...

#67

I'm excited about the concept of reverse-engineering Whisper models (so to speak) for TTS, but am wondering if it's possible to use quantized versions of models? There's no information on the model's architecture on huggingface as per normal in a config.json file, at least that I found. Therefore, it's impossible to see if the models could be converted to quantized version using, for example, ctranslate2, llama.cpp (GGUF format), GPTQ, "Transformers" (relying on BetterTransformer and/or bitsandbytes and/or flash attention2), and whatever other options there are out there...

Looking for a better version of Bark since it's so heavy...but I noticed that even the "small" version of your models was fairly heavy...If quantized, however, I'm wondering if it's better overall. The quality of the audio is so close to Bark such that any speedup would make this far superior.

@jpc
Copy link
Contributor

jpc commented Jan 30, 2024

Hey, small is a bit of a misnomer. We inherited this naming from Whisper but these are the biggest models we trained. You can try tiny or base which are 5 and 2 times smaller, respectively, and work pretty well. All the models are currently distributed as fp32 so the files are currently 2x bigger than they could be.

Regarding gguf or other quantized formats – I don't have much practical experience with these. A conversion would certainly be possible since we use the same layers as llama and whisper. See also #56 .

@BBC-Esq
Copy link
Contributor Author

BBC-Esq commented Jan 30, 2024

Thanks, is there a way to choose between different voices, a variety of English voices, for example, or is there only one that's hardcoded?

I'm familiar with the misnomer...tiny, base, small are smallest to largest for Whisper models...And there's medium/medium.en, large-v2, which apparently you guys haven't trained on yet if I understand correctly.

@zoq
Copy link
Contributor

zoq commented Jan 30, 2024

You can use the voice cloning feature to get different voices:

generate_to_file("""
This is the first demo of Whisper Speech, a fully open source text-to-speech model trained by Collabora and Lion on the Juwels supercomputer.
""", lang='en', speaker='https://upload.wikimedia.org/wikipedia/commons/7/75/Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg')

see the speaker parameter

@BBC-Esq
Copy link
Contributor Author

BBC-Esq commented Jan 30, 2024

For those interested, I tested the same text with my program located at https://github.com/BBC-Esq/ChromaDB-Plugin-for-LM-Studio

Specifically, I ran the program, obtained a response from the LLM, and clicked the "Bark Response" button within my GUI, which uses a Bark model to speak the text of the LLM's response. My program can use both the "normal" and "small" bark models, in either fp32 or fp16. Here are the results:

Bark/normal/fp32 = 8 G.B.
Bark/normal/fp16 = 5.7 G.B.
Bark/small/fp32 = 5.3 G.B.
Bark/small/fp16 = 4.2 G.B.

whisperspeech/small-en/float32 (hardcoded by default) = 5.8 G.B.
whisperspeech/base-en/float32 = 5.3. G.B.
whisperspeech/tiny-en/float32 = 5.2 G.B.

CAVEAT - these numbers are RELATIVE. i.e. I did not subtract my system overhead, which can fluctuate from 1.5-2.5 G.B. depending on what my computer is doing. However, I ran them with the same overhead as much as I can control it...

In terms of audio quality, both messed up in the same places (e.g. with non-English stuff like "(Ga. Juv. Prac. & Proc.) § 6:29", which is a citation to a legal treatise) and whisperspeech actually seemed smoother overall...and that's using the "small-en" model!

I think it'd be exciting to see the medium.en and large-v2 (AVOID large-v3 IMHO due to regressions!) models, but quantized in either float16 or other formats.

Overall very impressed/excited. I'd recommend looking at ctranslate2, llama.cpp or Transformers or other quantization options. Transformers is updated more often is now using BetterTransformer, bitsandbytes, and flash attention 2 so...Just my humble suggestion!

@BBC-Esq
Copy link
Contributor Author

BBC-Esq commented Jan 30, 2024

I would also add that I noticed in fetch_models.py you mention "whisperx," which is based on ctranslate2 so...perhaps that's an inroad if you already plan on using other quantized models...

# AUTOGENERATED! DO NOT EDIT! File to edit: ../nbs/0. Download models.ipynb.

# %% auto 0
__all__ = []

# %% ../nbs/0. Download models.ipynb 1
from fastcore.script import call_parse
import whisperx
import whisper
from speechbrain.pretrained import EncoderClassifier

# %% ../nbs/0. Download models.ipynb 3
def load_whisperx(model, lang):
    try:
        whisperx.asr.load_model(model, "cpu", compute_type="float16", language=lang)
    except ValueError as exc:
        print(exc.args[0])
        if exc.args[0] != "Requested float16 compute type, but the target device or backend do not support efficient float16 computation.":
            raise

@call_parse
def main():
    whisper.load_model('base.en')
    whisper.load_model('small.en')
    whisperx.vad.load_vad_model('cpu')
    load_whisperx('medium.en', 'en')
    load_whisperx('medium', 'en')
    EncoderClassifier.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb", savedir="~/.cache/speechbrain/")

I didn't have a chance to analyze what this script actually does, however. In any event, I'll be following this repository and would love to include it in my program.

@jpc
Copy link
Contributor

jpc commented Jan 30, 2024

For the GPU RAM usage I think we may be very suboptimal – right now we always load the FP32 weights and then convert to FP16. From my quick tests it seemed like it never releases the GPU memory so in FP16 mode we actually use more GPU memory than in FP32. ;)

That would be definitely worth fixing.

@jpc
Copy link
Contributor

jpc commented Jan 30, 2024

I would also add that I noticed in fetch_models.py you mention "whisperx," which is based on ctranslate2 so...perhaps that's an inroad if you already plan on using other quantized models...

We are aware of CTranslate but we don't have resources to work on these conversions at the moment. The fetch_models script is used to download the models we use for data preprocessing for training because the cluster we are using does not have internet access on compute nodes.

@BBC-Esq
Copy link
Contributor Author

BBC-Esq commented Jan 30, 2024

For the GPU RAM usage I think we may be very suboptimal – right now we always load the FP32 weights and then convert to FP16. From my quick tests it seemed like it never releases the GPU memory so in FP16 mode we actually use more GPU memory than in FP32. ;)

That would be definitely worth fixing.

If that's the case, if you'll point me to the script/scripts that do the conversion I'll take a quick look at if/why the memory isn't being released...It's possible that it's already superior to Bark right now (probably marginally) even though small-en is the largest model available at the moment. I'm assuming you're using dynamic quantization via Pytorch at runtime?

@jpc
Copy link
Contributor

jpc commented Jan 31, 2024

We are using fp16 and torch.compile, no weight quantization at all at the moment.

The conversion test should be pretty simple. I think the way to go would be to load the models, convert to fp16 (pipe.s2a.switch_dtypes(torch.float16) and pipe.t2s.switch_dtypes(torch.float16)) and save each model in a new file with save_model. Afterwards load the new files (you can give a local paths to s2a_ref) and I think it should stop copying the model in memory.

@BBC-Esq
Copy link
Contributor Author

BBC-Esq commented Jan 31, 2024

I'm sorry, you said in one message that they're float32 and then you said float16...can you please clarify?

@jpc
Copy link
Contributor

jpc commented Jan 31, 2024

Sorry , there a couple of steps so it may not be clear: the models are uploaded FP32 to Huggingface, we download them in FP32, convert to FP16 after loading and do inference at FP16.

@BBC-Esq
Copy link
Contributor Author

BBC-Esq commented Feb 3, 2024

Where exactly does this happen? It seems like it's done in multiple places; for example, here:

def optimize(self, max_batch_size=1, dtype=torch.float16, torch_compile=True):
        for emb in [self.embeddings.embedding, self.embeddings.embedding]:
            emb.convert_for_eval()
        for l in self.encoder.layers:
            l.attn.convert_for_eval()
        for l in self.decoder.layers:
            l.attn.convert_for_eval()
            l.cross_attn.convert_for_eval()
            l.setup_kv_cache(max_batch_size, self.stoks_len, self.ttoks_len)
        self.switch_dtypes(dtype)
        if torch_compile:
            self.generate_next = torch.compile(self.generate_next, mode="reduce-overhead", fullgraph=True)

What I am evaluating is possibly using bitsandbytes. Are you familiar with this library and would you be open to pull requests regarding this? Bitsandbytes doesn't require creating an entirely new "quantized" model like llama.cpp or ctranslate2, at least that's my understanding. It seems to simply replace certain pytorch scripts altogether to achieve this affect at runtime instead.

Further, it's my understand that under the hood it includes some custom CUDA kernels...

Here a short summary from their github:
image

It's officially only available on Linux, but I've been successfully using a custom-wheel FROM HERE for the last few months, no issues whatsoever:

@BBC-Esq BBC-Esq changed the title is quantization available for these models - e.g. ctranslate2, llama.cpp, gptq, etc.? is quantization available - e.g. ctranslate2, llama.cpp, bitsandbytes, gptq, etc.? Feb 3, 2024
@jpc
Copy link
Contributor

jpc commented Feb 4, 2024

The fp16 conversion is done in self.switch_dtypes(dtype).

For 8bit I’d start by replacing the Linear layers in source code on a temporary branch and see how well it works. If it works and the performance is good we can later implement switching the layers to 8bit in a similar manner to the existing convert_to_eval functions.

@BBC-Esq
Copy link
Contributor Author

BBC-Esq commented Feb 4, 2024

You mentioned that the architecture is the same as Whisper, but is there any way you guys could release it? Is it exactly the same or...even a slight variation would alter the quantization options. It'd help me understand model layers as I begin working on this after implementing other gpu-acceleration. Huggingface has all of the models in the same repository so I don't know if this'd require restructuring it, but here's an example from whisper-large-v2 from OpenAI:

image

image

In one of the other issues you actually posted a link to some charts and graphs when it was trained, but can't locate that again and don't remember if that had the architectures...

@jpc
Copy link
Contributor

jpc commented Feb 4, 2024

I am pretty sure you can view a table like this if you open the model file in Netron. We don’t have any diagrams right now, the only documentation of the architecture is the model code in Python…

It is a encoder decoder transformer like Whisper with some additions, mainly RoPE positional embeddings and MusicGen-like multiple heads for efficient EnCodec tokens prediction.

@BBC-Esq
Copy link
Contributor Author

BBC-Esq commented Feb 4, 2024

That Netron is awesome, thanks. It worked for most of the models and I was automatically directed to their github where I submitted an issue. Overall, it loaded all of the models except all of the ones beginning with "s2a" with one exception. It did load

s2a_up.model

@BBC-Esq BBC-Esq changed the title is quantization available - e.g. ctranslate2, llama.cpp, bitsandbytes, gptq, etc.? possible quantization (e.g. ctranslate2, llama.cpp, bitsandbytes, gptq, etc.?) Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants