Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CPU offload #100

Merged
merged 24 commits into from
May 24, 2023
Merged

Improve CPU offload #100

merged 24 commits into from
May 24, 2023

Conversation

PanQiWei
Copy link
Collaborator

@PanQiWei PanQiWei commented May 23, 2023

What does this pr do

This pr mainly improves CPU offload, together with some api changes and bug fix.

Change Description

API Change

  • add trust_remote_code argument to from_pretrained(), defaults to False
  • remove strict argument from from_quantized()
  • add the following arguments to from_quantized()
    • low_cpu_mem_usage: defaults to False, same as in hf transformers
  • change the following arguments' default value in from_quantized()
    • inject_fused_attention: now the default value is True
    • inject_fused_mlp: now the default value is True
    • warmup_triton: now the default value is False
  • add new method warmup_triton() to BaseGPTQForCausalLM so that users can execute warmup and model loading separately

Bug Fix

  • fix bug that quantized model can't be saved when using CPU offload to load from pretrained

Appendix

This benchmark results just to show the influence of different arguments combination in from_quantized(), unless explicitly statement, all arguments are using the default value.

hardware and software

  • OS: Win10
  • CPU: Intel Core i7 12700K
  • GPU: 1xNvidia RTX3060-12G
  • CUDA: 11.8.0
  • python: 3.9.16
  • pytorch: 2.0.0
  • transformers: 4.29.2
  • accelerate: 0.19.0

model and generation config

  • model: gptj 4bit (group_size=128, desc_act=False)
  • min_new_tokens & max_new_tokens: 64
  • num_beams: 1
  • do_sample: False
  • num_return_sequences: 1
  • use TextGenerationPipeline: True
  • batch_size: 1

Result

load the whole model into GPU
~33 tokens/s, ~5.5GB peak VRAM

cpu offload with max_memory={0: "1GIB", "cpu": "10GIB"}
~1 tokens/s, ~2GB peak VRAM

@oobabooga
Copy link
Contributor

oobabooga commented May 23, 2023

I have tried loading llama-7b-4bit-128g (obtained here) to run some benchmarks and I got this error:

│    132 │   │   tensor_name = splits[-1]                                                          │
│    133 │                                                                                         │
│    134 │   if tensor_name not in module._parameters and tensor_name not in module._buffers:      │
│ ❱  135 │   │   raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_  │
│    136 │   is_buffer = tensor_name in module._buffers                                            │
│    137 │   old_value = getattr(module, tensor_name)                                              │
│    138                                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: QuantLinear() does not have a parameter or a buffer named bias.

I added this quantize_config.json to the model folder:

{
  "bits": 4,
  "group_size": 128
}

And used this code

from auto_gptq import AutoGPTQForCausalLM

path_to_model = 'models/llama-7b-4bit-128g'
params = {
    'model_basename': 'llama-7b-4bit-128g',
    'use_triton': False,
    'use_safetensors': True,
    'max_memory': {0: '2GiB', 'cpu': '99GiB'}
}

model = AutoGPTQForCausalLM.from_quantized(path_to_model, **params)

input()

For TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g the offloading worked as expected and seemed fast at first glance.

@TheBloke
Copy link
Contributor

Yeah I see the same with that Llama torrent, and I tested a few more of my recent models and am getting ValueError with some.

Recent example, which I quantised today with ooba CUDA GPTQ-for-LLaMa fork: https://huggingface.co/TheBloke/manticore-13b-chat-pyg-GPTQ

So yeah, this PR hasn't fixed ValueError: QuantLinear() does not have a parameter or a buffer named bias.

Haven't yet figured out why it happens one some but not others, but definitely still an issue. And needs to be a priority because now we can't fix it by adding strict=False

@PanQiWei
Copy link
Collaborator Author

I've fixed the problem that model quantized using other frameworks became incompatible with auto-gptq after removing strict

@oobabooga
Copy link
Contributor

I confirm that the model loads and generates successfully now.

These were the results of my performance tests using text-generation-webui:

Case Performance
--gpu-memory 2380MiB --autogptq (19 layers go to device :0) Output generated in 82.58 seconds (2.41 tokens/s, 199 tokens, context 16, seed 1988446545)
--gpu-memory 2500MiB --autogptq (20 layers go to device :0) Output generated in 78.00 seconds (2.55 tokens/s, 199 tokens, context 16, seed 1179169760)
--pre_layer 20 using GPTQ-for-LLaMa Output generated in 68.44 seconds (2.91 tokens/s, 199 tokens, context 16, seed 581396535)

For small context size, the performance in my tests is consistently a bit slower, taking 12-14% longer to generate the same number of tokens.

Interestingly, for large context size, the performance seems to be the same:

Case Performance
--gpu-memory 2500MiB --autogptq (20 layers go to device :0) Output generated in 21.85 seconds (2.24 tokens/s, 49 tokens, context 1921, seed 667621066)
--pre_layer 20 using GPTQ-for-LLaMa Output generated in 21.91 seconds (2.24 tokens/s, 49 tokens, context 1921, seed 1565634474)

@PanQiWei
Copy link
Collaborator Author

For small context size, the performance in my tests is consistently a bit slower, taking 12-14% longer to generate the same number of tokens.

@oobabooga seems it's because auto-gptq moves offloaded layers between CPU and GPU frequently, I'll look deeper and try to find a way to optimize it.

@PanQiWei PanQiWei merged commit 18c7ce5 into main May 24, 2023
@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented May 24, 2023

So far for 65B I get 1/3 less performance than the pipelining merged into ooba GPTQ. I get 1.3 it/s vs 1.8 at like 5-600 context. I will check if I can do full context soon.

@PanQiWei PanQiWei deleted the improve_cpu_offload branch May 24, 2023 17:24
@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented May 24, 2023

65b autogptq xformers

Output generated in 26.45 seconds (3.55 tokens/s, 94 tokens, context 30, seed 1959592305)
Output generated in 24.82 seconds (3.63 tokens/s, 90 tokens, context 30, seed 1848171511)
Output generated in 31.34 seconds (3.64 tokens/s, 114 tokens, context 31, seed 752106982)

65b autogptq no xformers
Output generated in 24.47 seconds (3.56 tokens/s, 87 tokens, context 31, seed 983386653)
Output generated in 29.75 seconds (3.80 tokens/s, 113 tokens, context 31, seed 296496704)
Output generated in 22.57 seconds (3.77 tokens/s, 85 tokens, context 30, seed 1738875643)

65b fused attn

Output generated in 27.65 seconds (3.69 tokens/s, 102 tokens, context 30, seed 607409177)
Output generated in 17.02 seconds (3.70 tokens/s, 63 tokens, context 30, seed 72486718)
Output generated in 27.17 seconds (3.86 tokens/s, 105 tokens, context 30, seed 3575823)
OOM on full context.

Regular GPTQ with pipeline
Output generated in 135.06 seconds (1.84 tokens/s, 249 tokens, context 1761, seed 2014304440)
Output generated in 17.43 seconds (4.36 tokens/s, 76 tokens, context 29, seed 695209492)
Output generated in 17.64 seconds (4.31 tokens/s, 76 tokens, context 29, seed 1268251468)
Output generated in 17.57 seconds (4.32 tokens/s, 76 tokens, context 29, seed 1632724734)


OOM for full context :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants