Is FlexGen+GPTQ 4bit possible? #101

BarfingLemurs · 2023-03-19T12:49:04Z

Just a curious question I suppose!
GPTQ 4bit - https://github.com/qwopqwop200/GPTQ-for-LLaMa
Suppose someone eventually finetunes 175B OPT model, with loras or regular finetunng. or perhaps the BLOOM or BLOOMZ model, would running inference be possible with GPTQ to allow the model to be run on 4gbvram and 50gb dram?

Ying1123 · 2023-03-21T23:19:14Z

FlexGen has support for 4-bit compression, see sec 5 in paper, and weights compression https://github.com/FMInference/FlexGen/blob/bbc9ea9670c496cd31dbb2c4b04e9a1337d82d53/flexgen/flex_opt.py#L1306 cache compression https://github.com/FMInference/FlexGen/blob/bbc9ea9670c496cd31dbb2c4b04e9a1337d82d53/flexgen/flex_opt.py#L1308
The compression in FlexGen has computation overhead, so it is not always better to turn it on. For large models like 175B which involves disk swap, it is usually better to turn on both weights and cache compression.
GPTQ 4bit has not been implemented in FlexGen.
Even you use 4bit, the weights of an 175B model need to occupy ~90G memory. 4GB vram and 50GB dram is not sufficient.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is FlexGen+GPTQ 4bit possible? #101

Is FlexGen+GPTQ 4bit possible? #101

BarfingLemurs commented Mar 19, 2023 •

edited

Loading

Ying1123 commented Mar 21, 2023

Is FlexGen+GPTQ 4bit possible? #101

Is FlexGen+GPTQ 4bit possible? #101

Comments

BarfingLemurs commented Mar 19, 2023 • edited Loading

Ying1123 commented Mar 21, 2023

BarfingLemurs commented Mar 19, 2023 •

edited

Loading