Add 4bit inference code #2

WithNoAsterisk · 2024-04-10T12:10:18Z

Hello, I'm really interested in your project. But it takes much vram to run. Could you please add inference code in 4bits using bitsandbytes or smth

susanin1970 · 2024-04-25T20:41:42Z

I would also suggest adding the ability to infer multiple GPU's if not already implemented

Because when I tried to choose other variants of device setting ('auto', 'balanced', 'balanced_low_0', 'sequential'), I got following error:

RuntimeError: don't know how to restore data location of torch.storage.UntypedStorage (tagged with balanced_low_0)

When I choose cuda:0, I got CUDA out of memory error

Nevertheless, thanks for this repo :)

MatthewMih · 2024-05-16T13:49:40Z

Thank you for your interest in the OmniFusion model!
At the moment we do not have a 4bit version of the model, but we plan to publish a light version of the model based on 1B or 3B LLM. We will also think about 8bit/4bit versions of the model, thanks for the ideas and suggestions!

If you have multiple GPUs, you can try using 'auto' for the Mistral language model in the example code from readme. The remaining parts (adapter and visual encoder) do not use a lot of video memory and can be placed on one video card.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 4bit inference code #2

Add 4bit inference code #2

WithNoAsterisk commented Apr 10, 2024

susanin1970 commented Apr 25, 2024 •

edited

MatthewMih commented May 16, 2024

Add 4bit inference code #2

Add 4bit inference code #2

Comments

WithNoAsterisk commented Apr 10, 2024

susanin1970 commented Apr 25, 2024 • edited

MatthewMih commented May 16, 2024

susanin1970 commented Apr 25, 2024 •

edited