Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 4bit inference code #2

Open
WithNoAsterisk opened this issue Apr 10, 2024 · 2 comments
Open

Add 4bit inference code #2

WithNoAsterisk opened this issue Apr 10, 2024 · 2 comments

Comments

@WithNoAsterisk
Copy link

Hello, I'm really interested in your project. But it takes much vram to run. Could you please add inference code in 4bits using bitsandbytes or smth

@susanin1970
Copy link

susanin1970 commented Apr 25, 2024

I would also suggest adding the ability to infer multiple GPU's if not already implemented

Because when I tried to choose other variants of device setting ('auto', 'balanced', 'balanced_low_0', 'sequential'), I got following error:

RuntimeError: don't know how to restore data location of torch.storage.UntypedStorage (tagged with balanced_low_0)

When I choose cuda:0, I got CUDA out of memory error

Nevertheless, thanks for this repo :)

@MatthewMih
Copy link
Collaborator

Thank you for your interest in the OmniFusion model!
At the moment we do not have a 4bit version of the model, but we plan to publish a light version of the model based on 1B or 3B LLM. We will also think about 8bit/4bit versions of the model, thanks for the ideas and suggestions!

If you have multiple GPUs, you can try using 'auto' for the Mistral language model in the example code from readme. The remaining parts (adapter and visual encoder) do not use a lot of video memory and can be placed on one video card.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants