-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve CPU offload #100
Improve CPU offload #100
Conversation
I have tried loading
I added this quantize_config.json to the model folder:
And used this code from auto_gptq import AutoGPTQForCausalLM
path_to_model = 'models/llama-7b-4bit-128g'
params = {
'model_basename': 'llama-7b-4bit-128g',
'use_triton': False,
'use_safetensors': True,
'max_memory': {0: '2GiB', 'cpu': '99GiB'}
}
model = AutoGPTQForCausalLM.from_quantized(path_to_model, **params)
input() For |
Yeah I see the same with that Llama torrent, and I tested a few more of my recent models and am getting ValueError with some. Recent example, which I quantised today with ooba CUDA GPTQ-for-LLaMa fork: https://huggingface.co/TheBloke/manticore-13b-chat-pyg-GPTQ So yeah, this PR hasn't fixed Haven't yet figured out why it happens one some but not others, but definitely still an issue. And needs to be a priority because now we can't fix it by adding |
…ized from other frameworks
I've fixed the problem that model quantized using other frameworks became incompatible with auto-gptq after removing |
I confirm that the model loads and generates successfully now. These were the results of my performance tests using text-generation-webui:
For small context size, the performance in my tests is consistently a bit slower, taking 12-14% longer to generate the same number of tokens. Interestingly, for large context size, the performance seems to be the same:
|
@oobabooga seems it's because auto-gptq moves offloaded layers between CPU and GPU frequently, I'll look deeper and try to find a way to optimize it. |
So far for 65B I get 1/3 less performance than the pipelining merged into ooba GPTQ. I get 1.3 it/s vs 1.8 at like 5-600 context. I will check if I can do full context soon. |
OOM for full context :( |
What does this pr do
This pr mainly improves CPU offload, together with some api changes and bug fix.
Change Description
API Change
trust_remote_code
argument tofrom_pretrained()
, defaults toFalse
strict
argument fromfrom_quantized()
from_quantized()
low_cpu_mem_usage
: defaults toFalse
, same as in hf transformersfrom_quantized()
inject_fused_attention
: now the default value isTrue
inject_fused_mlp
: now the default value isTrue
warmup_triton
: now the default value isFalse
warmup_triton()
toBaseGPTQForCausalLM
so that users can execute warmup and model loading separatelyBug Fix
Appendix
This benchmark results just to show the influence of different arguments combination in
from_quantized()
, unless explicitly statement, all arguments are using the default value.hardware and software
model and generation config
Result
load the whole model into GPU
~33 tokens/s, ~5.5GB peak VRAM
cpu offload with
max_memory={0: "1GIB", "cpu": "10GIB"}
~1 tokens/s, ~2GB peak VRAM