Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: FlexGen seems slower than simple CPU code, am I missing something? [see discussion] #24

Closed
justheuristic opened this issue Feb 21, 2023 · 19 comments

Comments

@justheuristic
Copy link

justheuristic commented Feb 21, 2023

Hi!
I'm trying to reproduce FlexGen results and compare with more naive methods and i'm getting weird results. Can you please help me?

edit: added benchmark details and a minimalistic code to reproduce my claims.

library versions (click to expand)

ubuntu-server 18.04
PyTorch and dependencies installed via anaconda 4.9.1, package versions:
pytorch 1.13.1 py3.8_cuda11.7_cudnn8.5.0_0
numpy 1.23.4 py38h14f4228_0
numpy-base 1.23.4 py38h31eccc5_0

Transformers and tqdm installed via pip:
transformers 4.25.1
tqdm 4.64.1
bitsandbytes 0.37.0

I ran OPT-175B on a similar machine:

  • dual Xeon 6426Y (mid range server cpu) and 256GB RAM which is slightly more than in the benchmark, ~but the code never uses more than 200GB. (the benchmark setup has 208 GB)

  • using prefix length 512 and output length 32, similar to the README benchmark, and used a batch size of 8 (edited; thanks to @merrymercy for pointing out the discrepancy).

I am using standard Hugging Face code, with transformers.models.opt.modeling_opt.OPTForCausalLM.
The model was quantized to 8-bit using PyTorch PTDQ on linear layers with all default parameters.

Based on my measurements, I am getting 2.06 tokens per second in a basic CPU setup for a 8-bit model, or about 3.9 seconds per batch-step. This is basic HuggingFace + PyTorch PTDQ, no deepspeed / accelerate. Note: this does not account for prefill, so it is not a fair comparison, see adjusted figures below

In turn, FlexGen reports 1.12 tokens per second for a 4-bit OPT-175B model [tricky:
image

And, weirdly, __ simple 8-bit CPU inference beats both FlexGen and FlexGen(c)__ -- given the large-batch setup in question.

Did I understand the evaluation setup correctly? If not, can you please tell me what am I missing?

Summary and corrections from the discussion below:

Based on the suggestions by @merrymercy , it is inappropriate to compare CPU with batch size 64 since to does not fit in the original testing environment. I have updated the metrics with batch size 8 (to be safe), the decoding throughput fell from 3.66 to 2.06 tokens/second.

Based on the discussion with @Ying1123 : In Section 6.0, the generative throughput is defined as "the number of generated tokens / (prefill time +
decoding time)".

Here, prefill time stands for encoding the input sequence in parallel, layer-by-layer.
If the baseline algorithm prefills naively on CPU, FlexGen(c)-4-bit does indeed outperform the CPU-8bit baseline. For CPU, most of the time constitutes to prefilling. For GPU, there is the opposite situation: prefill is quick since it can be done with one offloading cycle; in turn, generation requires multiple offloading cycles and takes longer.

In further discussion, we consider an option of running prefill on GPU (using simple offloading, streaming KVs to CPU), then running inference on a CPU.

On a single T4 GPU, you can prefill 8 samples of 512 tokens with OPT-175B model in 8-bit precision (cuda 8-bit code runs Linear8bitLt from bitsandbytes 0.37.0 threshold=6) in 91.2 seconds using naive overlapped offloading. The CPU decoding time is, in turn, 124.3 seconds on 2x 6426Y. The aggregate baseline throughput is 8 * 32 / (91.18 + 124.277) ~= 1.19 tokens / second.

While the naive code is still faster, the difference between flexgen and baseline is not as significant as I originally thought.
Important: later in this thread, @Ying1123 provides their own evaluation on a somewhat weaker CPU (2ghz, less cores, virtualized). For that setup FlexGen-4bit on GPU is indeed 1.6x faster than 8-bit CPU baseline, even if we account for gpu prefill. I thank @Ying1123 and @merrymercy for pointing it out the differences and apologize for taking up their time.

(Expand) Limitations that I left unaddressed

  • the baseline algorithm uses 8-bit compression, while FlexGen(c) is using a 4-bit compression algorithm; It would be better to evaluate with the same compression level. If the baseline is switched for 4-bit compression, it would also make sense to increase the batch size.
  • the throughput comparison depends on the chosen sequence length and CPU type. I have a hunch that shorter sequence lengths would benefit from GPU-side decoding while longer sequence lengths favour CPU to avoid transferring the attention cache. @Ying1123 correctly points out that it would be best to compare the two approaches more systematically.
  • the GPU prefill was measured separately on a different machine. This is because the original 6426Y machine has no gpu attached. In turn, the machine with T4 has a more powerful CPU (epyc 7742) that decodes faster (1.67t/s final throughput), but is significantly more expensive. For a pure academic comparison, it would be best to evaluate both setups on a number of identical machines with difference cpu/gpu balance.
@BojanFaletic
Copy link

I think you are missing a GPU

@justheuristic
Copy link
Author

justheuristic commented Feb 21, 2023

Thing is - the basic code without GPU seems to work faster than what flexgen with GPU

(edit: clarified the title and replaced "slower than running on CPU" with "slower than CPU code" to make it less ambiguous)

@justheuristic justheuristic changed the title Question: FlexGen seems slower than running on CPU, what am I doing wrong? Question: FlexGen seems slower than simple CPU code, am I missing something? Feb 21, 2023
@min-xu-ai
Copy link

@justheuristic, have you tried other model sizes as well? I wonder how the result might change with model size on the X-axis.

Also, cpu or gpu, it might not be the most important matter here. The dual CPU setup might be very costly (I haven't done a price estimate), but I think token/second/dollar might be an interesting metric to look at? (ignore the cost of electricity is fine, just the equipment cost).

@justheuristic
Copy link
Author

justheuristic commented Feb 21, 2023

Hi! Thans for the answer

Tried other model sizes?

Agreed. I will benchmark all the setups and reply here shortly, should take about a day.
Also, I suppose it was rude of me to make performance claims without publishing the exact code. I will do so shortly.

Token/second/dollar:

That's a good point - but they are about the same price :)

The dual xeon 6426Y setup costs marginally more than the T4 GPU by itself (i.e. without the computer in which you plug it)

Xeon MSRP is about 1500 per cpu, so 3000 in total -- and T4 MSRP is about $2300 (plus whatever cpu you're using).
Both can be bought used for cheaper, and both have more cost-efficient analogies in the desktop segment.

My point is, 6426Y is about what you typically get on a server that has enough RAM to load OPT-175B - and you gotta have a CPU there anyways. And even if you go out of your way to put a weak CPU in there, the rest of the platform (200gb ram, mobo, psu) will take up most of the price.

I have a hunch - but not 100% certain - that if dual cpu gives you 3.66 token / sec, a single cpu be about 50% of that and it would still be faster than 1.12 (flexgen table).

@min-xu-ai
Copy link

Also, keeping an eye out for new EPYC & apple M2 silicons which will have unified memory that is sizeable. I certainly think there are potentially more efficient ways than GPUs potentially. We need software that can take advantage of those.

@min-xu-ai
Copy link

Another important factor is the minimal requirement for a system. While $3K can get you the dual socket CPUs and additional money can get 256GB memory, it is not common for majority of desktops. Gaming GPUs like RTXs, on the other hand, should be much more common. IMHO, a good target for CPU inference might be less beefy and more common machine specs.

I tried flexgen's chatbot.py + opt 30B with my 64GB machine with a 24GB gpu. Unfortunately, it crashed the machine after "converting the weights to numpy format" step... :-(

@justheuristic
Copy link
Author

justheuristic commented Feb 22, 2023

That's sounds surprising. As in "it should only convert these weights once, shouldn't it?"

Maybe you can configure your system with a large (e.g. 64GB) swap? It will take a while, but theoretically it should eventually convert (and save) numpy weights, and then the model would work normally.

@min-xu-ai
Copy link

Thanks for the suggestion. I did have 64gb of swap and my nvme drive should be pretty fast. I think it crashed after the conversion but the output msg didn’t indicate what it was computing. I think it must brought down the system with some intense swapping. I will try to figure it out later

@justheuristic
Copy link
Author

justheuristic commented Feb 22, 2023

[Added the benchmark code and to the first message; i've also added the library versions, just in case]

I tried the 30B model as requested, but is still faster on CPU, albeit with a smaller gap (2.2x instead of 3.2x, compared to flexgen; 18.1tokens/second).
So far, there is only one setup where flexgen outperforms the naive cpu baseline: when the model fits entirely on GPU (e.g. 30B in 4-bit on an rtx titan is faster than cpu; 8-bit 30B is slower) - when model parameters are not offloaded - which is kind of the point of flexgen. If parameters are not offloaded, you can likely use FasterTransformer or deepspeed.inference to get in-memory-optimized inference.

Also tested on an older system with dual xeon 6149 CPUs that were bought in 2017 - and it still yields 1.42 tokens per second, i.e. still faster than flexgen - even though the cpu benchmark is using 8 bit instead of 4. Based on my understanding, if we were to switch flexgen(c) to 8-bit compression, it would work slower, not faster.

@Ying1123
Copy link
Collaborator

Hi @justheuristic Thanks for bringing this up. I think this is a very reasonable baseline that we should discuss.

At the first glance, there are two issues with your script. I will follow up with more detailed experiments later.

  • It seems you only count the time for decoding. In FlexGen, the metric "generation throughput" counts both the time for encoding the prompts and the time for decoding. See the definition in the readme. However, your script does not count the time for encoding. For encoding the prompts, GPU will be much faster than CPU, because the utilization of GPU during prefill is very high (see Table. 4 in the paper). This means for longer prompts, GPU will definitely have its advantage.
  • A naive int8 quantization may have some accuracy loss for large models, according to LLM.int8 and SmoothQuant papers. I am not familiar with PTDQ. Could the basic PTDQ preserve accuracy for large models? More advanced quantization methods will have other overheads.

I think this is a very interesting discussion. To systematically study this, I will try to do more experiments and analysis later, as there are a lot of factors and metrics.

@justheuristic
Copy link
Author

justheuristic commented Feb 22, 2023

@Ying1123 thank you for the detailed reponse.

  • On encoding time: encoding time can indeed be faster on GPU. Assuming that encoding is faster on GPU and decoding is faster on CPU, there's another simple baseline you may want to consider:

    1. run encoding on GPU, loading layer by layer (i.e. standard offloading); dump each layer's KVs into RAM on the fly
    2. continue decoding on CPU, without the GPU involvement
      Naturally, the optimal algorithm would depend on the CPU: an extremely weak cpu (e.g. 8c/16t) can still be faster with standard flexgen, whereas a mid-range server CPU (e.g. 6426Y) or high-end consumer CPU (e.g. ryzen 7900x) seems likely
  • On quantization The naive int8 quantization may indeed lose precision - it was chosen for simplicity, since more advanced quantization methods would have the same compute (see details later). In practice, you can run the same quantization method as in your paper or something related, e.g. GPTQ or LLM.8bit and fill in a more accurate qint8 matrix with the same compute complexity (for GPTQ). Outlier-based quantization methods are not strictly equivalent, (e.g. LLM.8bit) introduce an additional overhead of about 2.5% extra compute for recommended outlier threshold (6) on OPT-175B. In other words, the benchmark is indeed simplified, but more sophisticated methods would do the exact same computation with different input numbers.

Reproducibility notes: for the 2.5% overhead in PyTorch, I cast the chosen outliers to float32 instead of float16, as in Dettmers et al. This is because the CPU kernels for 16-bit precision are slower than for 32-bit. Since outliers correspond to under 0.1% of all model weights (see Table 4 and Section 3.2 here), the RAM usage does not change significantly.

Related note: [first suggeted by Dettmers et al in BigScience transactions] there is one more setup where decoding on GPU would be faster than any CPU: when generating a very large number of extremely short sequences (i.e. with no prefix, such that there is no need to swap inputs).
Likewise, for any weak cpu, there exists a sufficiently long sequence where computing on CPU is faster than swapping the (large) past KV tensors to the GPU.

I appreciate your intent to study this more systematically. I am curious how does FlexGen vs CPU scale across different prefix lengths and different modern CPU types (e.g. cheap, like ryzen 7950x v.s. mid-range like 2x xeon 6426Y or 2xEPYC 9124 v.s. high-end, like 2x EPYC 9654) and figure out the optimal niche.

@merrymercy
Copy link
Collaborator

Thanks for your suggestions and notes! Will look into them.
Could you edit your main post to reflect the mistake about the metric in your script?

Also, for the memory usage part. Correct me if I m wrong. In your setting, the model weight takes 325/2 = 162.5 GB and the KV cache takes 76.5 GB. The total memory usage is 239 GB. How did you verify "the code never uses more than 200GB"?

@justheuristic
Copy link
Author

justheuristic commented Feb 22, 2023

Thank you for noting that. I have mistakenly measured the memory with an earlier version of the code with smaller batch size - which did fit into memory. I will update the first message shortly (in under 30 minutes) once I verify the throughput and memory measurements. Updated - and set it to batch size 8 to be safe. The throughput is noticeably lower.

I also clarified that the decoding throughput is not the same as the total aggregate throughput

@Ying1123
Copy link
Collaborator

Ying1123 commented Feb 22, 2023

Hi @justheuristic, the update looks good to me! We also tried to run your scripts on our GCP instance, the same one used in the paper.
Here is what I got. We will keep updating the results when we get more.

Setup

The CPU is an Intel(R) Xeon(R) CPU @ 2.00GHz with 32 vCPUs and 208GB of DRAM.
We tried to run the provided script with bath size = 8 on the GCP instance. It runs out of memory during the weight initialization. We saw a “Killed” during the initialization of 69-th layer.
To make it runnable, we change the number of layers from 96 to 48. We then run the script and scale the latency printed by the script by 2 to get an estimation for the 175B model.

OPT-175B Results

PTDQ (int8)

prefill latency: 718.5s, prefill throughput: 5.7 token/s
decode latency: 235.2s, decode throughput: 1.08 token/s
total latency: 953.7 s, total throughput: 0.269 token/s

FlexGen (int4)

prefill latency: 1468.57 s, prefill throughput: 50.20 token/s
decode latency: 2603.31 s, decode throughput: 1.71 token/s
total latency: 4071.88 s, total throughput: 1.12 token/s

If we take your approach to use GPU for prefill and CPU for decoding, we get throughput 8 * 32 / (91.18 + 235.2) = 0.78 token/s

On our setup, the throughput of your proposed approaches is still lower than FlexGen for both prefill and decoding, although they are definitely more efficient than the default offloading strategy in HF accelerate and deepspeed. Note that the int4 in FlexGen is not well-optimized. We did not use any specialized CUDA kernel for quantization. We just compose the quantization with normal pytorch operators without fusion. I guess the int8 kernels in PTDQ are better optimized with FBGEMM.

Conclusion

I believe the results also depend on the capabilities of CPUs (both memory and compute). We mainly tested on normal cloud CPUs and desktop CPUs and found them too slow, so we ignored these CPU-only options. Another reason is that there is no any good available implementation (instead of a simplified benchmark script) of the methods proposed by you. I think in the future we should study the whole space. For example, in the case when the CPU memory is not enough for PTDQ to hold the whole model, we need offloading to disk anyway. In this case, the techniques in FlexGen will show a bigger win.

This is the point of FlexGen. It provides infra to allow you easily try different offloading strategies and approximation methods.

@justheuristic
Copy link
Author

justheuristic commented Feb 22, 2023

Thank you for running the benchmarks so quickly. If I read it correctly, this evaluated is proportional to the relative CPU performance. In other words, "Yay! It all makes sense now!!"

For instance, if we compare the CPUs and their cpu decoding throughput, we get:

  • GCP gpu-optimized instance: 32 virtual / 16 physical cores, 2GHz per core -> 1.08 token/s
  • General-purpose instance: 64 virtual / 32 physical cores, 2.5Ghz per core -> 2.06 tokens/s
Boring CPU stuff (click to expand)

Based on my (limited) understanding of gcp infrastructure, the "Intel(R) Xeon(R) CPU @ 2.00GHz" is a slice of a virtualized Xeon CPU from a physical server that has multuple T4 GPUs and more cores. I couldn't find the exact CPU model from GCP, but similar aws instances have 2nd gen xeon-gold cpus - and i'm using a newer 4rd gen of the same xeon-gold cpu.

If we assume that the GCP instance is also 2nd gen xeon, there is also a slight IPC (instructions per cycle) improvement from using 4th gen xeon in my case, but, to it should be at most few x 10% in total.

In other words, it makes total sense that the FlexGen 4bit wins against this baseline on the first CPU.

We just compose the quantization with normal pytorch operators without fusion. I guess the int8 kernels in PTDQ are better optimized with FBGEMM.

True enough, but the FBGEMM code only affects linear layers, so the cpu baseline is not even fusing operators - not that fusing would help on CPU anyway :)

It is also curious that all other evaluated "baseline" algorithms (accelerate, deepspeed, petals) are hopelessly slower both FlexGen and the naive CPU baseline on the same machine.

I was particularly surprised that, DeepSpeed Inference - which is using inference-optimzed kernels and targets according to the cited paper - hopelessly lost to both flexgen and CPU.

// If I read the results correctly, DeepSpeed-Inference baseline is not just 112x slower than FlexGen 4-bit, it is also ~27x slower than simply running on on CPU, even with CPU-only prefill on the same test hardware. Please correct me if I misunderstood something.

p.s. to avoid misleading the accidental reader, I added the "Important! Authors evaluated..." in bold to the first message in the thread.

@Eric-Wallace-WebHost
Copy link

Thanks for this discussion, I learned a lot from this.

@merrymercy
Copy link
Collaborator

merrymercy commented Feb 23, 2023

Happy to see that we are gradually reaching some common conclusions. Here are some thoughts on your questions.

Q1

For instance, if we compare the CPUs and their CPU decoding throughput, we get:
GCP gpu-optimized instance: 32 virtual / 16 physical cores, 2GHz per core -> 1.08 token/s
General-purpose instance: 64 virtual / 32 physical cores, 2.5Ghz per core -> 2.06 tokens/s

The scaling cannot be as perfect as you calculated. We also run your script on a GCP memory-optimized instance with 768 GB of memory. The CPU is Intel(R) Xeon(R) CPU @ 2.60GHz with 96 vCPU. The decoding throughput is 1.42 token/s, still less than the 1.71 token/s we got with FlexGen 4bit.
An instance with such a high number of vCPU is not cheap on GCP -- this instance is 2x more expensive than the T4 instance we used.

Edited: The 96-vCPU instance has AVX512 and VNNI support.

Q2

It is also curious that all other evaluated "baseline" algorithms (accelerate, deepspeed, petals) are hopelessly slower both FlexGen and the naive CPU baseline on the same machine.

You are right. We mentioned in the paper that these baseline systems optimize more for latency or directly use suboptimal strategies inherited from the training systems. If we just use them out of the box, they cannot use a batch size larger than 2 on our setup. This is the best we can get with these systems. Please see the benchmark scripts for baselines here.

One goal of the paper is to point out that they missed a huge room for improvement in terms of throughput.

Q3

I was particularly surprised that, DeepSpeed Inference - which is using inference-optimzed kernels and targets according to the cited paper - hopelessly lost to both flexgen and CPU.

Your surprise makes sense because this paper packed a lot of stuff. Specifically, we use the "DeepSpeed ZeRO-Inference" feature in this paper, suggested by this blog.

Note that "DeepSpeed Inference" and "DeepSpeed ZeRO-Inference" are two different things, as suggested by this huggingface repo.
"DeepSpeed Inference" utilizes inference-optimized kernels but does not support offloading, so we cannot use it.
"DeepSpeed ZeRO-Inference" is actually the one that uses the "ZeRO-3" technique to do offloading, which is used by us.

Q4

I want to mention an additional note about the CPU-only baseline. We did think about this at the very beginning. However, as we have shown, it is slower than FlexGen even with a 96-core CPU on GCP. So we didn't spend too much time working on it.

Why we don't include a CPU-only baseline?
There is no paper or code that we can use or reference to. No one did accuracy-preserving int8 LLM inference on the CPU before. Your benchmark script is awesome, but it does not exist before we release this repo. It does not work as an end-to-end example with correct outputs that we can verify either. We thank your contribution and would like to see follow-ups in this direction.

We really appreciate these insightful discussions! Could we add you to the acknowledgment of our paper?

@justheuristic
Copy link
Author

justheuristic commented Feb 23, 2023

Q1 The scaling cannot be as perfect as you calculated.

I see. It is indeed more complicated. I cannot be sure, but there's another thing that could be at play: the high-memory instances could have cpus without VNNI support. GCP has older-gen special CPUs, just like it still has old GPUs. You can think of VNNI as "tensor cores, but smaller and for CPU" - and they have already seen mass adoption by the time T4 have rolled out, so both your and my benchmarks likely have them. To repeat, this is a wild guess and I have no way of verifying it.

Q2 One goal of the paper is to point out that they missed a huge room for improvement in terms of throughput.

Agreed. This checks out with deepspeed-inference's abstract and intro, which emphasizes latency over throughput.

Q3 Note that "DeepSpeed Inference" and "DeepSpeed ZeRO-Inference" are two different things,

Thank you for clarification, I somehow missed that when I was reading :)

Q4 No one did accuracy-preserving int8 LLM inference on the CPU before.

True enough. It's curious because the quantization library I refer to has been around since 2019 and cpu 8-bit acceleration was around since 2018 - but indeed there is no dedicated library for running OPT/BLOOM (specifically) on cpu in 8-bit.

From my understanding, LLM CPU inference is "pathologically underhyped", despite that many non-academic practitioners were asking about CPU inference (in general) in the past (example).

Could we add you to the acknowledgment of our paper?

I appreciate the gesture, but, unfortunately, I cannot accept for personal reasons. Please, don't make the wrong implication of this: you guys have done a great step towards making LLM inference affordable, which is, like, the most important problem in today's DL engineering. I also appreciate our discussion -- I had a few (more constructive) follow-up points in mind, but i will need a few hours to check the numbers.

@Eric-Wallace-WebHost Thanks for this discussion, I learned a lot from this.

Since others may be viewing this, let me make it clear: I have nothing but respect to the authors and even more respect for the problem that authors are trying to solve. We need a way to make running large models accessible, and this is a significant step in this direction. The way I see it, most pre-prints published today have oversights - and I am "guilty" of such oversights myself. What defines the few good papers is that authors are willing to work on fixing these oversights, which is the case here.

@justheuristic justheuristic changed the title Question: FlexGen seems slower than simple CPU code, am I missing something? Question: FlexGen seems slower than simple CPU code, am I missing something? [see discussion] Feb 23, 2023
@min-xu-ai
Copy link

you guys have done a great step towards making LLM inference affordable, which is, like, the most important problem in today's DL engineering

Can't agree with you more on this. Only thing I'd add is that it is not just inference, but also fine-tuning. Making those large models accessible will unleash a lot of human creativity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants