-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: FlexGen seems slower than simple CPU code, am I missing something? [see discussion] #24
Comments
I think you are missing a GPU |
Thing is - the basic code without GPU seems to work faster than what flexgen with GPU (edit: clarified the title and replaced "slower than running on CPU" with "slower than CPU code" to make it less ambiguous) |
@justheuristic, have you tried other model sizes as well? I wonder how the result might change with model size on the X-axis. Also, cpu or gpu, it might not be the most important matter here. The dual CPU setup might be very costly (I haven't done a price estimate), but I think token/second/dollar might be an interesting metric to look at? (ignore the cost of electricity is fine, just the equipment cost). |
Hi! Thans for the answer Tried other model sizes?Agreed. I will benchmark all the setups and reply here shortly, should take about a day. Token/second/dollar:That's a good point - but they are about the same price :) The dual xeon 6426Y setup costs marginally more than the T4 GPU by itself (i.e. without the computer in which you plug it) Xeon MSRP is about 1500 per cpu, so 3000 in total -- and T4 MSRP is about $2300 (plus whatever cpu you're using). My point is, 6426Y is about what you typically get on a server that has enough RAM to load OPT-175B - and you gotta have a CPU there anyways. And even if you go out of your way to put a weak CPU in there, the rest of the platform (200gb ram, mobo, psu) will take up most of the price. I have a hunch - but not 100% certain - that if dual cpu gives you 3.66 token / sec, a single cpu be about 50% of that and it would still be faster than 1.12 (flexgen table). |
Also, keeping an eye out for new EPYC & apple M2 silicons which will have unified memory that is sizeable. I certainly think there are potentially more efficient ways than GPUs potentially. We need software that can take advantage of those. |
Another important factor is the minimal requirement for a system. While $3K can get you the dual socket CPUs and additional money can get 256GB memory, it is not common for majority of desktops. Gaming GPUs like RTXs, on the other hand, should be much more common. IMHO, a good target for CPU inference might be less beefy and more common machine specs. I tried flexgen's chatbot.py + opt 30B with my 64GB machine with a 24GB gpu. Unfortunately, it crashed the machine after "converting the weights to numpy format" step... :-( |
That's sounds surprising. As in "it should only convert these weights once, shouldn't it?" Maybe you can configure your system with a large (e.g. 64GB) swap? It will take a while, but theoretically it should eventually convert (and save) numpy weights, and then the model would work normally. |
Thanks for the suggestion. I did have 64gb of swap and my nvme drive should be pretty fast. I think it crashed after the conversion but the output msg didn’t indicate what it was computing. I think it must brought down the system with some intense swapping. I will try to figure it out later |
[Added the benchmark code and to the first message; i've also added the library versions, just in case] I tried the 30B model as requested, but is still faster on CPU, albeit with a smaller gap (2.2x instead of 3.2x, compared to flexgen; 18.1tokens/second). Also tested on an older system with dual xeon 6149 CPUs that were bought in 2017 - and it still yields 1.42 tokens per second, i.e. still faster than flexgen - even though the cpu benchmark is using 8 bit instead of 4. Based on my understanding, if we were to switch flexgen(c) to 8-bit compression, it would work slower, not faster. |
Hi @justheuristic Thanks for bringing this up. I think this is a very reasonable baseline that we should discuss. At the first glance, there are two issues with your script. I will follow up with more detailed experiments later.
I think this is a very interesting discussion. To systematically study this, I will try to do more experiments and analysis later, as there are a lot of factors and metrics. |
@Ying1123 thank you for the detailed reponse.
Reproducibility notes: for the 2.5% overhead in PyTorch, I cast the chosen outliers to float32 instead of float16, as in Dettmers et al. This is because the CPU kernels for 16-bit precision are slower than for 32-bit. Since outliers correspond to under 0.1% of all model weights (see Table 4 and Section 3.2 here), the RAM usage does not change significantly. Related note: [first suggeted by Dettmers et al in BigScience transactions] there is one more setup where decoding on GPU would be faster than any CPU: when generating a very large number of extremely short sequences (i.e. with no prefix, such that there is no need to swap inputs). I appreciate your intent to study this more systematically. I am curious how does FlexGen vs CPU scale across different prefix lengths and different modern CPU types (e.g. cheap, like ryzen 7950x v.s. mid-range like 2x xeon 6426Y or 2xEPYC 9124 v.s. high-end, like 2x EPYC 9654) and figure out the optimal niche. |
Thanks for your suggestions and notes! Will look into them. Also, for the memory usage part. Correct me if I m wrong. In your setting, the model weight takes 325/2 = 162.5 GB and the KV cache takes 76.5 GB. The total memory usage is 239 GB. How did you verify "the code never uses more than 200GB"? |
Thank you for noting that. I have mistakenly measured the memory with an earlier version of the code with smaller batch size - which did fit into memory. I will update the first message shortly (in under 30 minutes) once I verify the throughput and memory measurements. Updated - and set it to batch size 8 to be safe. The throughput is noticeably lower. I also clarified that the decoding throughput is not the same as the total aggregate throughput |
Hi @justheuristic, the update looks good to me! We also tried to run your scripts on our GCP instance, the same one used in the paper. SetupThe CPU is an Intel(R) Xeon(R) CPU @ 2.00GHz with 32 vCPUs and 208GB of DRAM. OPT-175B ResultsPTDQ (int8)
FlexGen (int4)
If we take your approach to use GPU for prefill and CPU for decoding, we get throughput 8 * 32 / (91.18 + 235.2) = 0.78 token/s On our setup, the throughput of your proposed approaches is still lower than FlexGen for both prefill and decoding, although they are definitely more efficient than the default offloading strategy in HF accelerate and deepspeed. Note that the int4 in FlexGen is not well-optimized. We did not use any specialized CUDA kernel for quantization. We just compose the quantization with normal pytorch operators without fusion. I guess the int8 kernels in PTDQ are better optimized with FBGEMM. ConclusionI believe the results also depend on the capabilities of CPUs (both memory and compute). We mainly tested on normal cloud CPUs and desktop CPUs and found them too slow, so we ignored these CPU-only options. Another reason is that there is no any good available implementation (instead of a simplified benchmark script) of the methods proposed by you. I think in the future we should study the whole space. For example, in the case when the CPU memory is not enough for PTDQ to hold the whole model, we need offloading to disk anyway. In this case, the techniques in FlexGen will show a bigger win. This is the point of FlexGen. It provides infra to allow you easily try different offloading strategies and approximation methods. |
Thank you for running the benchmarks so quickly. If I read it correctly, this evaluated is proportional to the relative CPU performance. In other words, "Yay! It all makes sense now!!" For instance, if we compare the CPUs and their cpu decoding throughput, we get:
Boring CPU stuff (click to expand)Based on my (limited) understanding of gcp infrastructure, the "Intel(R) Xeon(R) CPU @ 2.00GHz" is a slice of a virtualized Xeon CPU from a physical server that has multuple T4 GPUs and more cores. I couldn't find the exact CPU model from GCP, but similar aws instances have 2nd gen xeon-gold cpus - and i'm using a newer 4rd gen of the same xeon-gold cpu. If we assume that the GCP instance is also 2nd gen xeon, there is also a slight IPC (instructions per cycle) improvement from using 4th gen xeon in my case, but, to it should be at most In other words, it makes total sense that the FlexGen 4bit wins against this baseline on the first CPU.
True enough, but the FBGEMM code only affects linear layers, so the cpu baseline is not even fusing operators - not that fusing would help on CPU anyway :) It is also curious that all other evaluated "baseline" algorithms (accelerate, deepspeed, petals) are hopelessly slower both FlexGen and the naive CPU baseline on the same machine. I was particularly surprised that, DeepSpeed Inference - which is using inference-optimzed kernels and targets according to the cited paper - hopelessly lost to both flexgen and CPU. // If I read the results correctly, DeepSpeed-Inference baseline is not just 112x slower than FlexGen 4-bit, it is also ~27x slower than simply running on on CPU, even with CPU-only prefill on the same test hardware. Please correct me if I misunderstood something. p.s. to avoid misleading the accidental reader, I added the "Important! Authors evaluated..." in bold to the first message in the thread. |
Thanks for this discussion, I learned a lot from this. |
Happy to see that we are gradually reaching some common conclusions. Here are some thoughts on your questions. Q1
The scaling cannot be as perfect as you calculated. We also run your script on a GCP memory-optimized instance with 768 GB of memory. The CPU is Intel(R) Xeon(R) CPU @ 2.60GHz with 96 vCPU. The decoding throughput is 1.42 token/s, still less than the 1.71 token/s we got with FlexGen 4bit. Edited: The 96-vCPU instance has AVX512 and VNNI support. Q2
You are right. We mentioned in the paper that these baseline systems optimize more for latency or directly use suboptimal strategies inherited from the training systems. If we just use them out of the box, they cannot use a batch size larger than 2 on our setup. This is the best we can get with these systems. Please see the benchmark scripts for baselines here. One goal of the paper is to point out that they missed a huge room for improvement in terms of throughput. Q3
Your surprise makes sense because this paper packed a lot of stuff. Specifically, we use the "DeepSpeed ZeRO-Inference" feature in this paper, suggested by this blog. Note that "DeepSpeed Inference" and "DeepSpeed ZeRO-Inference" are two different things, as suggested by this huggingface repo. Q4I want to mention an additional note about the CPU-only baseline. We did think about this at the very beginning. However, as we have shown, it is slower than FlexGen even with a 96-core CPU on GCP. So we didn't spend too much time working on it. Why we don't include a CPU-only baseline? We really appreciate these insightful discussions! Could we add you to the acknowledgment of our paper? |
I see. It is indeed more complicated. I cannot be sure, but there's another thing that could be at play: the high-memory instances could have cpus without VNNI support. GCP has older-gen special CPUs, just like it still has old GPUs. You can think of VNNI as "tensor cores, but smaller and for CPU" - and they have already seen mass adoption by the time T4 have rolled out, so both your and my benchmarks likely have them. To repeat, this is a wild guess and I have no way of verifying it.
Agreed. This checks out with deepspeed-inference's abstract and intro, which emphasizes latency over throughput.
Thank you for clarification, I somehow missed that when I was reading :)
True enough. It's curious because the quantization library I refer to has been around since 2019 and cpu 8-bit acceleration was around since 2018 - but indeed there is no dedicated library for running OPT/BLOOM (specifically) on cpu in 8-bit. From my understanding, LLM CPU inference is "pathologically underhyped", despite that many non-academic practitioners were asking about CPU inference (in general) in the past (example).
I appreciate the gesture, but, unfortunately, I cannot accept for personal reasons. Please, don't make the wrong implication of this: you guys have done a great step towards making LLM inference affordable, which is, like, the most important problem in today's DL engineering. I also appreciate our discussion -- I had a few (more constructive) follow-up points in mind, but i will need a few hours to check the numbers.
Since others may be viewing this, let me make it clear: I have nothing but respect to the authors and even more respect for the problem that authors are trying to solve. We need a way to make running large models accessible, and this is a significant step in this direction. The way I see it, most pre-prints published today have oversights - and I am "guilty" of such oversights myself. What defines the few good papers is that authors are willing to work on fixing these oversights, which is the case here. |
Can't agree with you more on this. Only thing I'd add is that it is not just inference, but also fine-tuning. Making those large models accessible will unleash a lot of human creativity. |
Hi!
I'm trying to reproduce FlexGen results and compare with more naive methods and i'm getting weird results. Can you please help me?
edit: added benchmark details and a minimalistic code to reproduce my claims.
library versions (click to expand)
ubuntu-server 18.04
PyTorch and dependencies installed via anaconda 4.9.1, package versions:
pytorch 1.13.1 py3.8_cuda11.7_cudnn8.5.0_0
numpy 1.23.4 py38h14f4228_0
numpy-base 1.23.4 py38h31eccc5_0
Transformers and tqdm installed via pip:
transformers 4.25.1
tqdm 4.64.1
bitsandbytes 0.37.0
I ran OPT-175B on a similar machine:
dual Xeon 6426Y (mid range server cpu) and 256GB RAM which is slightly more than in the benchmark, ~but the code never uses more than 200GB. (the benchmark setup has 208 GB)
using prefix length 512 and output length 32, similar to the README benchmark, and used a batch size of 8 (edited; thanks to @merrymercy for pointing out the discrepancy).
I am using standard Hugging Face code, with
transformers.models.opt.modeling_opt.OPTForCausalLM
.The model was quantized to 8-bit using PyTorch PTDQ on linear layers with all default parameters.
Based on my measurements, I am getting 2.06 tokens per second in a basic CPU setup for a 8-bit model, or about 3.9 seconds per batch-step. This is basic HuggingFace + PyTorch PTDQ, no deepspeed / accelerate. Note: this does not account for prefill, so it is not a fair comparison, see adjusted figures below
In turn, FlexGen reports 1.12 tokens per second for a 4-bit OPT-175B model [tricky:
And, weirdly, __ simple 8-bit CPU inference beats both FlexGen and FlexGen(c)__ -- given the large-batch setup in question.
Did I understand the evaluation setup correctly? If not, can you please tell me what am I missing?
Summary and corrections from the discussion below:
Based on the suggestions by @merrymercy , it is inappropriate to compare CPU with batch size 64 since to does not fit in the original testing environment. I have updated the metrics with batch size 8 (to be safe), the decoding throughput fell from 3.66 to 2.06 tokens/second.
Based on the discussion with @Ying1123 : In Section 6.0, the generative throughput is defined as "the number of generated tokens / (prefill time +
decoding time)".
Here, prefill time stands for encoding the input sequence in parallel, layer-by-layer.
If the baseline algorithm prefills naively on CPU, FlexGen(c)-4-bit does indeed outperform the CPU-8bit baseline. For CPU, most of the time constitutes to prefilling. For GPU, there is the opposite situation: prefill is quick since it can be done with one offloading cycle; in turn, generation requires multiple offloading cycles and takes longer.
In further discussion, we consider an option of running prefill on GPU (using simple offloading, streaming KVs to CPU), then running inference on a CPU.
On a single T4 GPU, you can prefill 8 samples of 512 tokens with OPT-175B model in 8-bit precision (cuda 8-bit code runs Linear8bitLt from bitsandbytes 0.37.0 threshold=6) in 91.2 seconds using naive overlapped offloading. The CPU decoding time is, in turn, 124.3 seconds on 2x 6426Y. The aggregate baseline throughput is 8 * 32 / (91.18 + 124.277) ~= 1.19 tokens / second.
While the naive code is still faster, the difference between flexgen and baseline is not as significant as I originally thought.
Important: later in this thread, @Ying1123 provides their own evaluation on a somewhat weaker CPU (2ghz, less cores, virtualized). For that setup FlexGen-4bit on GPU is indeed 1.6x faster than 8-bit CPU baseline, even if we account for gpu prefill. I thank @Ying1123 and @merrymercy for pointing it out the differences and apologize for taking up their time.
(Expand) Limitations that I left unaddressed
The text was updated successfully, but these errors were encountered: