Improve CPU offload #100

PanQiWei · 2023-05-23T15:44:51Z

What does this pr do

This pr mainly improves CPU offload, together with some api changes and bug fix.

Change Description

API Change

add trust_remote_code argument to from_pretrained(), defaults to False
remove strict argument from from_quantized()
add the following arguments to from_quantized()
- low_cpu_mem_usage: defaults to False, same as in hf transformers
change the following arguments' default value in from_quantized()
- inject_fused_attention: now the default value is True
- inject_fused_mlp: now the default value is True
- warmup_triton: now the default value is False
add new method warmup_triton() to BaseGPTQForCausalLM so that users can execute warmup and model loading separately

Bug Fix

fix bug that quantized model can't be saved when using CPU offload to load from pretrained

Appendix

This benchmark results just to show the influence of different arguments combination in from_quantized(), unless explicitly statement, all arguments are using the default value.

hardware and software

OS: Win10
CPU: Intel Core i7 12700K
GPU: 1xNvidia RTX3060-12G
CUDA: 11.8.0
python: 3.9.16
pytorch: 2.0.0
transformers: 4.29.2
accelerate: 0.19.0

model and generation config

model: gptj 4bit (group_size=128, desc_act=False)
min_new_tokens & max_new_tokens: 64
num_beams: 1
do_sample: False
num_return_sequences: 1
use TextGenerationPipeline: True
batch_size: 1

Result

load the whole model into GPU
~33 tokens/s, ~5.5GB peak VRAM

cpu offload with max_memory={0: "1GIB", "cpu": "10GIB"}
~1 tokens/s, ~2GB peak VRAM

…offload

oobabooga · 2023-05-23T18:17:21Z

I have tried loading llama-7b-4bit-128g (obtained here) to run some benchmarks and I got this error:

│    132 │   │   tensor_name = splits[-1]                                                          │
│    133 │                                                                                         │
│    134 │   if tensor_name not in module._parameters and tensor_name not in module._buffers:      │
│ ❱  135 │   │   raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_  │
│    136 │   is_buffer = tensor_name in module._buffers                                            │
│    137 │   old_value = getattr(module, tensor_name)                                              │
│    138                                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: QuantLinear() does not have a parameter or a buffer named bias.

I added this quantize_config.json to the model folder:

{
  "bits": 4,
  "group_size": 128
}

And used this code

from auto_gptq import AutoGPTQForCausalLM

path_to_model = 'models/llama-7b-4bit-128g'
params = {
    'model_basename': 'llama-7b-4bit-128g',
    'use_triton': False,
    'use_safetensors': True,
    'max_memory': {0: '2GiB', 'cpu': '99GiB'}
}

model = AutoGPTQForCausalLM.from_quantized(path_to_model, **params)

input()

For TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g the offloading worked as expected and seemed fast at first glance.

TheBloke · 2023-05-24T00:04:13Z

Yeah I see the same with that Llama torrent, and I tested a few more of my recent models and am getting ValueError with some.

Recent example, which I quantised today with ooba CUDA GPTQ-for-LLaMa fork: https://huggingface.co/TheBloke/manticore-13b-chat-pyg-GPTQ

So yeah, this PR hasn't fixed ValueError: QuantLinear() does not have a parameter or a buffer named bias.

Haven't yet figured out why it happens one some but not others, but definitely still an issue. And needs to be a priority because now we can't fix it by adding strict=False

…ized from other frameworks

PanQiWei · 2023-05-24T02:58:24Z

I've fixed the problem that model quantized using other frameworks became incompatible with auto-gptq after removing strict

oobabooga · 2023-05-24T04:11:58Z

I confirm that the model loads and generates successfully now.

These were the results of my performance tests using text-generation-webui:

Case	Performance
`--gpu-memory 2380MiB --autogptq` (19 layers go to device :0)	`Output generated in 82.58 seconds (2.41 tokens/s, 199 tokens, context 16, seed 1988446545)`
`--gpu-memory 2500MiB --autogptq` (20 layers go to device :0)	`Output generated in 78.00 seconds (2.55 tokens/s, 199 tokens, context 16, seed 1179169760)`
`--pre_layer 20` using GPTQ-for-LLaMa	`Output generated in 68.44 seconds (2.91 tokens/s, 199 tokens, context 16, seed 581396535)`

For small context size, the performance in my tests is consistently a bit slower, taking 12-14% longer to generate the same number of tokens.

Interestingly, for large context size, the performance seems to be the same:

Case	Performance
`--gpu-memory 2500MiB --autogptq` (20 layers go to device :0)	`Output generated in 21.85 seconds (2.24 tokens/s, 49 tokens, context 1921, seed 667621066)`
`--pre_layer 20` using GPTQ-for-LLaMa	`Output generated in 21.91 seconds (2.24 tokens/s, 49 tokens, context 1921, seed 1565634474)`

PanQiWei · 2023-05-24T05:26:43Z

For small context size, the performance in my tests is consistently a bit slower, taking 12-14% longer to generate the same number of tokens.

@oobabooga seems it's because auto-gptq moves offloaded layers between CPU and GPU frequently, I'll look deeper and try to find a way to optimize it.

Ph0rk0z · 2023-05-24T14:32:31Z

So far for 65B I get 1/3 less performance than the pipelining merged into ooba GPTQ. I get 1.3 it/s vs 1.8 at like 5-600 context. I will check if I can do full context soon.

Ph0rk0z · 2023-05-24T18:10:02Z

65b autogptq xformers

Output generated in 26.45 seconds (3.55 tokens/s, 94 tokens, context 30, seed 1959592305)
Output generated in 24.82 seconds (3.63 tokens/s, 90 tokens, context 30, seed 1848171511)
Output generated in 31.34 seconds (3.64 tokens/s, 114 tokens, context 31, seed 752106982)

65b autogptq no xformers
Output generated in 24.47 seconds (3.56 tokens/s, 87 tokens, context 31, seed 983386653)
Output generated in 29.75 seconds (3.80 tokens/s, 113 tokens, context 31, seed 296496704)
Output generated in 22.57 seconds (3.77 tokens/s, 85 tokens, context 30, seed 1738875643)

65b fused attn

Output generated in 27.65 seconds (3.69 tokens/s, 102 tokens, context 30, seed 607409177)
Output generated in 17.02 seconds (3.70 tokens/s, 63 tokens, context 30, seed 72486718)
Output generated in 27.17 seconds (3.86 tokens/s, 105 tokens, context 30, seed 3575823)
OOM on full context.

Regular GPTQ with pipeline
Output generated in 135.06 seconds (1.84 tokens/s, 249 tokens, context 1761, seed 2014304440)
Output generated in 17.43 seconds (4.36 tokens/s, 76 tokens, context 29, seed 695209492)
Output generated in 17.64 seconds (4.31 tokens/s, 76 tokens, context 29, seed 1268251468)
Output generated in 17.57 seconds (4.32 tokens/s, 76 tokens, context 29, seed 1632724734)

OOM for full context :(

PanQiWei added 9 commits May 23, 2023 16:08

half out

db63c08

add more help functions

1b2159b

update setup.py

c639593

add options: 'low_cpu_mem_usage' and 'full_cpu_offload'

6476ee4

fix save quantized model failed when load pretrained model using CPU …

ed14d3a

…offload

add warmup_triton method

e4e90e8

fix device mismatch

191da81

Merge branch 'main' into improve_cpu_offload

4373d6b

remove duplicate code

8e034b2

This was referenced May 23, 2023

NotImplementedError: Cannot copy out of meta tensor; no data! #96

Closed

max_memory and offload_folder options not working for big models #78

Closed

Add pre_layer parameter and precompiled wheels #94

Open

always to enable QuantLinear bias to make compatible with model quant…

e2e7809

…ized from other frameworks

PanQiWei added 4 commits May 24, 2023 11:19

fix meta device bug when use low_cpu_mem_usage

057c39e

remove comment

63f1b4e

make_sure_not_tensor_in_meta_device before load checkpoint

c31b370

make comments more readable

21ab7c4

support add_align_logits_hook_to_model

58c1b50

PanQiWei added 7 commits May 24, 2023 13:42

disable add_align_logits_hook_to_model for now

749dba1

remove add_align_logits_hook_to_model

379f24c

remove full_cpu_offload argument and unify model dispatch strategy

10347fd

correct typo of function name

c89bb64

update basic usage example code

94ef4d5

update basic usage example code

e6ba062

update README

065fd1d

PanQiWei added 2 commits May 24, 2023 18:31

update tutorial

ac14180

update tutorial

c341a6d

PanQiWei merged commit 18c7ce5 into main May 24, 2023

PanQiWei deleted the improve_cpu_offload branch May 24, 2023 17:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve CPU offload #100

Improve CPU offload #100

PanQiWei commented May 23, 2023 •

edited

oobabooga commented May 23, 2023 •

edited

TheBloke commented May 24, 2023

PanQiWei commented May 24, 2023

oobabooga commented May 24, 2023

PanQiWei commented May 24, 2023

Ph0rk0z commented May 24, 2023

Ph0rk0z commented May 24, 2023

Improve CPU offload #100

Improve CPU offload #100

Conversation

PanQiWei commented May 23, 2023 • edited

What does this pr do

Change Description

API Change

Bug Fix

Appendix

Result

oobabooga commented May 23, 2023 • edited

TheBloke commented May 24, 2023

PanQiWei commented May 24, 2023

oobabooga commented May 24, 2023

PanQiWei commented May 24, 2023

Ph0rk0z commented May 24, 2023

Ph0rk0z commented May 24, 2023

PanQiWei commented May 23, 2023 •

edited

oobabooga commented May 23, 2023 •

edited