# FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

Jiaao He Tsinghua University Jidong Zhai Tsinghua University

# **Abstract**

Cost of serving large language models (LLM) is high, but the expensive and scarce GPUs are poorly efficient when generating tokens sequentially, unless the batch of sequences is enlarged. However, the batch size is limited by some constantly reused intermediate results, namely KV-Cache. They occupy too much memory to fit more sequences into a GPU simultaneously. While they could be offloaded to host memory, the CPU-GPU bandwidth is an inevitable bottleneck.

We find a way to decompose the transformer models into two parts of different characteristics, one of which includes the memory-bound KV-Cache accessing. Our key insight is that the aggregated memory capacity, bandwidth, and computing power of CPUs across multiple nodes is an efficient option to process this part. Performance improvement comes from reduced data transmission overhead and boosted GPU throughput to process the other model part. Moreover, we address efficiency challenges brought by heterogeneity at both temporal and inter-device scopes using scheduling and performance modeling techniques. Evaluation results show that our system achieves  $1.88 \times -5.04 \times$  the throughput of vLLM when serving modern LLMs with the same GPU.

#### 1 Introduction

The large language models (LLM) are gaining high attention. These transformer-based models are very hardware-friendly when training and evaluating [18, 20], because the main computation workload is matrix multiplication, a highly optimized operation to run on accelerators, e.g., GPUs. However, when using the models, the auto-regressive procedure, i.e., decoding, is inefficient. Because tokens in a sequence are generated one-by-one, one of the operand matrices is in fact a vector. Multiplying a vector with a matrix achieves much lower throughput due to poor utilization of GPUs.

Although employing an arbitrary number of GPUs can always increase the throughput, in most cases, GPU is one of the most scarce resources. So, we should increase the utilization of single GPUs for many reasons, including affordability and being eco-friendly.

Enlarging batch size, i.e., generating tokens for multiple requests simultaneously, is the most feasible way without

changing the model. Different from prior use cases of neural networks, there is a larger opportunity for LLMs to run generation in batches, as an LLM usually serves many users online. The latency requirement to generate a token is much looser than other user cases of NN models, e.g., object detection in autonomous driving. Keeping up with reading speed of the user is the most strict latency requirement for text generation.

However, generating a new token depends on huge intermediate results of generating the previous tokens, namely *KV-cache* [23]. Processing batched requests results in a much larger memory footprint, far beyond the capacity of GPU memory. Figure 1 shows the dilemma instantiated on a common 7b model and several different GPUs. Increasing batch size makes the GPUs significantly better utilized, but the memory footprint of the *KV-cache* is much larger than the GPU memory. To make it worse, the *KV-cache* becomes even larger as more tokens are generated and the sequences get longer.



Figure 1: Memory footprint of *KV-cache* stops increasing GPU utilization by enlarging batch size

Host memory has naturally become the place to offload the KV-cache [14], as it is larger and cheaper than GPU memory. However, the KV-cache is not cold data. The complete KV-

*cache* is loaded into GPU memory to generate every token. Considering the enormous size of *KV-cache* as suggested in Figure 1, transmitting it between GPU and host memory frequently is the bottleneck of the offloading design. Essentially, the bandwidth of PCIe is always much lower than the memory bandwidth of GPUs and even CPUs.



<sup>\*</sup> Accurate numbers are in Section 2.3

Figure 2: Performance characteristics of typical GPUs and CPUs, matching the need of two parts of the model

We study the performance characteristics, including compute throughput and memory bandwidth, of both GPUs and CPUs, as shown in Figure 2. We find that compared with the huge gap in compute power, the two types of hardware have a much closer gap in memory bandwidth.

Fortunately, we find a way to partition the transformer model into two parts, namely *R-Part* and *S-Part*. *KV-cache* is included by the former. Little performance loss is introduced when completely moving the memory-bound part to the host side, as the ratio of memory bandwidth and compute throughput can fulfill its requirement. Therefore, we get **our key insight** that we should **compute near KV-cache** on CPUs. Instead of transmitting KV-cache data over any interdevice connection, we transmit the activation tensors, which are orders of magnitudes smaller than KV-cache.

Our approach totally removes intermediate data of sequences, the *KV-cache*, from GPU memory. Therefore, the batch size can be greatly increased, and the GPUs can be optimally utilized. However, such a heterogeneous approach faces three challenges to achieve high overall throughput.

Challenge 1: The CPU is busy but slow. It runs multiple tasks, including batch gathering, tokenization, and coordinating the GPUs. Performing extra computation interferes with these tasks, slowing down all of them. To add to the difficulty, the memory bandwidth of a CPU is lower than GPU.

Challenge 2: The pattern of workload variation, as the generated sequences get longer, differs between the two parts. In our solution, the CPU and the GPU take turns to perform computation, and pass the results to each other. A basic pipeline of multiple batches of requests is used to utilize both of them. However, computation on the CPU takes longer time as the generated sequence gets longer, while the latency of its counterpart on GPU does not change at all. This makes it hard to always utilize both CPU and GPU.

Challenge 3: Careful orchestration is needed to balance the

performance of both types of hardware. Bottleneck may be either of the GPU or CPU, because they are tightly coupled. We need to balance the two considering the heterogeneous hardware and token generation workload. We seek for a minimum CPU requirement that can fully exploit the compute power of the GPU, aiming at minimizing the overall cost.

Our system, FASTDECODE, is a CPU-GPU heterogeneous pipeline for LLM inference that addresses the challenges by the following innovations.

Innovation 1: We employ multiple out-of-chassis remote CPUs for KV-cache and the related computation. The aggregated memory capacity and bandwidth of the system are scaled up. The distributed CPUs can achieve sufficient throughput to saturate the GPU, with moderate communication overhead.

Innovation 2: We invent a sequence-level load-stabilizing schedule to minimize idling and better utilize both types of hardware. The workload on a CPU is proportional to the total length of sequences it maintains. To keep the latency stable, sequences are fed into the system following a workload control algorithm. Short and long sequences are simultaneously processed by CPU workers, leaving the total length of sequences stable. As a result, the overall latency of CPUs changes more gently, and both types of hardware are better utilized.

Innovation 3: We adopt a model-guided approach to orchestrate the GPU with CPUs. It quantitatively characterizes the performance bottleneck considering different aspects of the LLM inference tasks. Aggregated memory bandwidth is identified as the key metric in selecting the CPUs. For a given model and GPU setup, based on profiling result of a micro-benchmark, we can estimate the minimum required aggregated CPU memory bandwidth for different batch sizes.

Overall, the throughput of a single GPU is saturated with a significantly larger batch size. Thanks to the scalability and aggregated power of CPUs across nodes, high overall token generation throughput is achieved with affordable GPU resources. In our evaluation, up to  $5\times$  throughput of vLLM is achieved on the same GPU with acceptable latency.

Contributions of this paper are summarized as follows.

- We find an unconventional way to decompose the autoregressive transformer model, with high potential of performance improvement.
- We propose a near-memory processing system over the *KV-cache* that exploits the aggregated memory bandwidth out-of-chassis CPUs for higher throughput.
- We invent a sequence-level pipeline schedule to balance the growing-with-time workload and the fixed workload in token generation using LLMs.
- We create a performance model that can provide optimal hardware configuration for different models and requirements using our system.

This paper is organized as follows. Section 2 provides background information of LLMs and hardware options to serve them. Section 3 shows our way to decompose the model, and illustrates our key insight that processing near *KV-cache* can boost overall throughput. Section 4 introduces the design of our system, with techniques to resolve challenges brought by heterogeneity in both terms of workload and hardware. Section 5 includes more details in our implementation. Section 6 compares the performance of our system with other systems, and Section 7 shows more experiment results for analyzing our performance. Section 8 includes discussion with more related works, and Section 9 concludes our work.

# 2 Background and Motivation

### 2.1 Transformer Model and KV-Cache

The auto-regressive models are based on transformer structure [28]. The key module of these models is the *attention layer*. We briefly illustrate its process as follows. Denote the feature vector of the i-th token in a sequence as  $X_i$ .

First,  $X_i$  is mapped to three different linear spaces, which is implemented by three fully-connected layers.

$$Q_{i} = W_{q}X_{i}$$

$$K_{i} = W_{k}X_{i}$$

$$V_{i} = W_{\nu}X_{i}$$
(1)

For the i-th token, inner production is applied between its feature vector and the  $K_j$  feature vectors of all tokens before it. This is actually the attention process, and it generates an attention vector for the current token.

$$A_i = \text{Normalize}\{Q_i \cdot K_i (j = 1, \dots, i - 1)\}$$
 (2)

The attention vector is normalized, commonly using *soft-max* operation. Then, it is used as a weight to gather information, i.e., add up the  $V_j$  vectors of all preceding tokens.

$$O_i = \sum_{i=1}^{i-1} A_{ij} V_j$$
 (3)

The output is applied with the final linear transformation using another fully-connected layer.

$$Y_i = W_o O_i \tag{4}$$

A *transformer block* consists of one attention layer followed by a multi-layer perceptron (*MLP*) module, which includes multiple large fully-connected layers and non-linear activation functions in between. Connecting tens of such transformer blocks sequentially makes a complete decoder model, which is the backbone of most well-known LLMs, including GPT [3, 21], Llama [26, 27], and many more.

When using such models in real-world tasks including chat and text generation, tokens are produced one-by-one. To get the next token, only the latest token needs to be processed by the model, because the fully-connected layers and MLPs process each token independently. However, eq. (2) and eq. (3) of the attention layer involve reaction between the latest token and all previous tokens. Instead of re-computing them,  $K_j$  and  $V_j$  can be saved in memory and reused for the newly generated tokens. KV-cache [23] refers to these saved intermediate tensors. For a sequence of length S, this technique reduces the total amount of inner production computation between feature vectors from  $O(S^3)$  to  $O(S^2)$ . So, using KV-cache is mandatory in LLM inference.

## 2.2 Accelerating Decoding

There are two steps to use an LLM to respond to a request. First, during the **prefilling** stage, the entire input sequence from the user is processed by the model, where all the tokens in the sequence can be processed as a batch in the MLP layers. Then, in the **decoding** stage, the model use the feature vector of the last known token to predict the next token to be appended to the sequence. So, each new token of the generated sequence goes through the model one-by-one.

Computation efficiency is extremely important, as it is directly related to the cost of serving LLMs. Unfortunately, using GPUs, generating a single sequence is poorly efficient, because the main computation workload is applying fully connected layers to one feature vector in the decoding stage. In other words, the main computation task is multiplying matrices with vectors (GeMV). There is little chance to reuse the matrix data in near-processor memory, so accessing the global memory bounds the workload. The numerous floating point number units on GPUs are underutilized.

To leverage the computation throughput of GPUs, enlarging batch size is the most feasible approach. A batch of sequences are generated simultaneously, so multiple tokens are processed at the same time. The feature vectors are stacked together to be a matrix, and the GeMV computation becomes multiplying the weight matrix with the feature matrix (GeMM). GeMM is a highly optimized operation on GPUs. As long as the batch size is large enough, the computation power of GPUs is fully exploited.

Specifically, for the auto-regressive transformer-based models, Orca [30] points out that the granularity of batching can be reduced to improve performance. Instead of batching complete sequences together, it is more effective to batch the generation task of single tokens. This technique greatly improves the throughput of serving LLMs by introducing more chances of batching.

Unfortunately, beside leaving the memory issue of *KV-cache* unaddressed, the flexible batching mechanism in Orca introduces significant memory fragmentation. Paged attention technique is adapted by vLLM [14] to address the memory issue. GPU and host memory of *KV-cache* is managed by pages, so that the GPU memory can be better utilized without

fragmentation. Also, host memory can be used to store more *KV-cache* for more sequences, so the chance for batching token generation tasks is increased.

Chance of batching in vLLM is still limited, because swapping the large *KV-cache* over PCIe between GPU and host memory introduces high overhead. Therefore, vLLM has to reduce the swapping frequency. So, as the sequences get longer, *KV-cache* of few sequences can reside in the GPU memory, resulting in a small batch size. The system achieves high throughput in less common cases, e.g., generating tokens with shared prefix or wide beam searching, where multiple independent new tokens share the same *KV-cache*.

FlexGen [24] studies on finding an optimal offloading order of both model weights and *KV-cache*. Still, the *KV-cache* is orders of magnitudes larger than the model weights. Transmitting them over the PCIe link, which is much slower than the memory bandwidth, is inherently inefficient.

In summary, it is a consensus that increasing batch size is the most effective way for higher throughput. However, because *KV-cache* has to be in the GPU memory for computation, few works can achieve actual speed up due to its large memory footprint.

Differently, we find that *KV-cache* does not need to present in GPU memory. In this paper, we show challenges and our solutions that unleashes the power of CPUs to handle the *KV-cache* and achieve high token generating throughput by enabling a significantly larger batch size.

# 2.3 Memory-bound Workload Fits CPU

While the slower but larger host memory is used to compensate for the lack of GPU memory capacity, the CPUs are barely used to perform computation. They have up to TFLOPs of floating point computation throughput, negligible compared with hundreds of TFLOPs achieved by specialized tensor processing units on GPUs.

Table 1: Performance and Power Comparison

| Tuna | M - J -1 | TDP   | Compute |         | Memory |         |
|------|----------|-------|---------|---------|--------|---------|
| Type | Model    | IDP   | FLOPs   | W. per. | GB/s   | W. per. |
| CDII | Xeon*    | 125 W | 1.3 T   | 96.15   | 128    | 0.97    |
| CPU  | Epyc*    | 155 W | 1.2 T   | 129.2   | 205    | 0.76    |
| GPU  | A10      | 150 W | 125 T   | 1.2     | 600    | 0.25    |
|      | V100     | 250 W | 112 T   | 2.2     | 900    | 0.27    |

<sup>\*</sup> Using Intel Xeon Gold 5218 and AMD Epyc 7452 CPUs.

However, in terms of memory access bandwidth, the gap between CPU and GPU is smaller. Table 1 lists the compute throughput and memory bandwidth of several common ones. Modern server-class CPUs can achieve hundreds of GB/s. Memory bandwidth of mid-range GPUs, e.g. NVIDIA A10, is only a few times larger than the CPUs. A dual-socket AMD Epyc server can achieve 68% of its memory bandwidth. Even

the top-level GPUs can barely have more than  $10 \times$  bandwidth. Additionally, different from the huge gap between GPUs of different levels, the memory bandwidth remain similar from entry-level to high-end in each generation of CPUs.

CPUs are more attractive, considering the cost. As a general purpose processor that exists in every computer, they are much more widely deployed than GPUs. It is much easy to acquire a large number of CPUs with relatively low cost. We can easily enlarge the memory capacity and bandwidth by adding standard DIMMs to the servers. On the contrary, GPU memory is not only expensive but also hard to extend its capacity as the memory is soldered on the circuit.

Table 1 includes power consumption of the hardware as a metric of efficiency. Maximum power consumption of CPUs is a few times lower than GPUs. Besides, when memory access is the major workload, the hardware usually does not consume as much power as the TDP. So, the actual efficiency gap is even smaller than our estimation.

In conclusion, CPU is an appealing option for the memorybound jobs.

# **3** Observation and Insights

# 3.1 Performance Dilemma and Decomposition



Figure 3: Performance dilemma in auto-regressive generation

Throughput of the fully connection layers in the transformer blocks increases significantly with a larger batch size. However, the attention operation benefits little when enlarging the batch size, because each sequence has different *K* and *V*. When generating sequences in a batch, instead of GeMM, the GeMV becomes batched GeMV, which is still memorybound. To make it worse, the size of *KV-cache* is proportional to the batch size. As shown in Figure 3, when using a larger batch size to better saturate the computation power of GPU, the memory footprint of *KV-cache* is much larger than the capacity of GPU memory. In fact, the key difficulty is handling

the memory-intensive operations with *KV-cache* when using only GPUs for computation.

We categorize the computation workload during generating a token into two parts. They are denoted using dashed and solid lines, respectively, in Figure 3.

- *R-Part*: The auto-**R**egressive computation related to preceding tokens in the sequence, as formulated in eq. (2) and eq. (3). Each sequence is processed independently with its own *KV-cache*. It benefits little from enlarging batch size, but introduces huge memory footprint. Notably, no model parameter is involved in *R-Part*.
- S-Part: Rest of the model where sequences Share the same parameters. It mainly consists of fully connected layers. GPU utilization can be significantly increased by batching tokens in more sequences together in S-Part.

### 3.2 CPUs can Undertake More in LLM

In the LLM inference workload, the *KV-cache* takes enormous memory, ideal to be placed in the larger CPU-side DRAM, despite the bottleneck to move the data from host memory to GPU memory for computation.

However, the *R-Part* is inherently memory-bound work-load, where using a GPU gets little benefit over using CPUs. Therefore, we get our **key insight**: not only should we store *KV-cache* in CPU memory, but also process them with CPUs. In other words, *R-Part* should be **processed near data**.

As a result, the *KV-cache* is removed from GPU memory, and the batch size can be as large as 1024 or more. Recall the throughput curve in Figure 1. The computation in *S-Part* is able to utilize the GPUs with much higher efficiency. The overall token generation throughput is significantly increased.

Two concerns are intuitively raised on such design. (1) CPUs may not be fast enough to match the throughput of GPUs. (2) Transmitting the intermediate data between *S-Part* and *R-Part* across devices may be slow.

Table 2: Latency of Computation Operations

| Omenation                         | Batch Size | Lat. / ms |       | TFLOPs |       |
|-----------------------------------|------------|-----------|-------|--------|-------|
| Operation                         |            | GPU       | CPU   | GPU    | CPU   |
| R-Part                            | 1          | 0.084     | 0.287 | 0.050  | 0.015 |
| (eq. (2) & eq. (3))               | 1024       | 8.32      | 8.12  | 0.516  | 0.529 |
| S-Part                            | 1          | 1.46      | 49.5  | 0.366  | 0.011 |
| $(\sim 16 \times \text{eq.} (4))$ | 1024       | 7.08      | 611   | 77.5   | 0.899 |

Table 2 compares latency of computation tasks in generating tokens on two CPU nodes with those on an A10 GPU. We highlight the latency of our selected mapping between device and model part using underlines. When generating tokens using a widely-used 7B foundation model, the latency of *R-Part* is almost identical between using the GPU or CPUs, as the total hardware memory bandwidths are similar.

When computing on GPUs, the latency of *S-Part* is at a similar magnitude with *R-Part*. However, the *S-Part* cannot

be moved to CPUs, because it involves much heavier computation that can be greatly accelerated by the GPU. The poor throughput of CPUs leads to very high latency. It is also notable as the batch size is  $1024 \times$  larger, the latency is only  $5 \times$  larger. There is more than  $100 \times$  potential throughput gain. Overall, with similar latency and throughput of *R-Part*, efficiency is increased in *S-Part* and thus the whole system.

Table 3: Size of Data and Communication Latency

| Data           | Batch Size | Data Size | Latency / ms |       |
|----------------|------------|-----------|--------------|-------|
| Data           |            | Data Size | PCIe*        | RoCE* |
| Model Weight   | N/A        | 402 MB    | 12.6         | 32.2  |
| KV-Cache       | 1          | 4.19 MB   | 0.131        | 0.335 |
| K v-Cacile     | 1024       | 4.29 GB   | 134          | 343   |
| Intermediate   | 1          | 32.7 KB   | < 0.01       | 0.03  |
| Vectors (ours) | 1024       | 33.5 MB   | 1.04         | 2.68  |

<sup>\*</sup> Calculated using 32 GB/s PCIe 4.0 x16 and 100 Gbps RoCE

Sizes of data within a transformer block of a typical 7B model and the latency to send them across different types interdevice connection are shown in Table 3. Instead of previous works that may send the huge model or KV-cache across the links, we only communicate intermediate vectors, i.e.,  $Q_i, K_i, V_i, O_i$  in eq. (2) and eq. (3). These vectors are orders of magnitudes smaller than others. The estimated latency to send a large batch of them from 1024 sequences over the network is only a few milliseconds. The latency is a moderate portion of the computation latency presented in Table 2. Compared with existing systems, communication overhead of our near-KV-cache processing design is much smaller.

In brief, our approach enables serving LLMs with a much larger batch size. It brings huge potential of throughput gain, despite the minor overhead that is minimized in our system.

# 4 Methodology

## 4.1 System Overview

The local CPU in a server equipped with GPU may be too busy and too slow to provide sufficient throughput of processing the *R-Part*. Our approach uses out-of-chassis CPUs, whose aggregated compute power is exploited.

Figure 4 shows the basic design of FASTDECODE, which consists of two types of workers.

An *S-worker* computes *S-Part* of an LLM. It may use one or multiple GPUs. All weights of the model are on the S-worker, and partitioned by a certain way of model parallelism if using multiple GPUs. It acts as a typical token generation worker simply using GPUs, except for its much larger batch size and the behavior of computing *R-Part*. To generate a new token, it goes through the transformer blocks. After  $Q_i, K_i, V_i$  are produced by fully connected layers in *S-Part*, instead of computing *R-Part* locally, the S-worker sends different parts of them related to different sequences to the R-workers,



Figure 4: Workers of FASTDECODE

and retrieve the output,  $O_i$ , from them. Then, it feeds  $O_i$  to succeeding layers in *S-Part* on the GPU.

The *R-workers* use CPUs on remote nodes to compute *R-Part*. These R-workers are light-weight, because no model parameter is involved in *R-Part* of LLMs. The functionality of a R-worker is simple. It receives  $Q_i, K_i, V_i$  of a batch of tokens.  $K_i$  and  $V_i$  are appended to the existing KV-cache.  $Q_i$  is used to in attention computation with the local KV-cache data, and the output is returned. The R-workers may also drop KV-cache of a certain sequence upon its generation ends.

As GPU is the most scarce resource, the system maximizes the throughput of the S-worker via maximizing the batch size. As *KV-cache* is excluded from the S-worker, there is little tension on GPU memory capacity. Beside the model weight, it only needs memory for a small scratchpad of the current layer. The possible batch size can be up to millions of sequences. While a larger batch leads to larger latency, we enable the possibility to choose one that best fits the use case.

In this system, the S-worker and the R-workers work in turns to generate a token. When one type of worker is working, the other idles, as shown in Figure 5(a).

Therefore, a basic two-stage pipeline at token level is applied. As Figure 5(b) shows, the S-worker starts with two separate mini-batches: A and B. After it finishes the first *S-Part* of mini-batch A, it starts working on the first *S-Part* of mini-batch B. At the same time, the R-workers are working on the *R-Part* of mini-batch A. Overall, the two mini-batches are processed by each type of worker in turns.

This token-level pipeline only achieves optimal utilization of all workers if computation latency of *S-Part* and *R-Part* are equal. Otherwise, there are still bubbles in the pipeline due to idling workers, as Figure 5(c) suggests. In fact, the pipeline can barely be free of bubbles, because the workload and the

| S-worker                                     | S-Part 1   |            | S-Part 2   |            |  |  |  |
|----------------------------------------------|------------|------------|------------|------------|--|--|--|
| R-worker 1                                   |            | R-Part 1   |            | R-Part 2   |  |  |  |
| R-worker 2                                   |            | R-Part 1   |            | R-Part 2   |  |  |  |
| (a) No pipeline                              |            |            |            |            |  |  |  |
| S-worker                                     | S-Part A-1 | S-Part B-1 | S-Part A-2 | S-Part B-2 |  |  |  |
| R-worker 1                                   |            | R-Part A-1 | R-Part B-1 | R-Part A-2 |  |  |  |
| R-worker 2                                   |            | R-Part A-1 | R-Part B-1 | R-Part A-2 |  |  |  |
| (b) Ideal case of the basic 2-stage pipeline |            |            |            |            |  |  |  |
| S-worker                                     | S-Part A-1 | S-Part B-1 | S-Part A-2 | S-Part B-2 |  |  |  |
| R-worker 1                                   |            |            | R-Part A   |            |  |  |  |
| R-worker 2                                   |            |            |            | R-Part A-2 |  |  |  |
| (c) Possible bubbles in a real pipeline      |            |            |            |            |  |  |  |

Figure 5: Temporal view of FASTDECODE

hardware are both of high heterogeneity. The following two techniques addresses the challenges at temporal and interdevice scopes, respectively.

# 4.2 Sequence-level Load-Stabilizing Schedule

There is a significant amount of bubbles in the basic two-stage pipeline, because the workloads of *R-Part* and *S-Part* change differently depending on the length of the generated sequence. Latency of *S-Part* is only related to the batch size, because each new token involves fixed amount of computation in the fully-connected layers. In contrast, latency of *R-Part* is related to the total length of the sequences, because each new token interacts with all previous tokens in the sequence. When processing a batch of sequences, the generated sequences get longer over time. Because the overall latency of FASTDECODE is dominated by the larger latency of *S-Part* and *R-Part*, either S-worker or R-worker idles a lot, as shown in Figure 6.



Figure 6: Hardware idling due to workload heterogeneity

While changing the workload of *S-Part* may lead to decreased GPU utilization, the overall performance can be improved by reducing the latency fluctuation of *R-Part* without much efficiency issue. Ideally, as indicated in Figure 6, we

may move the workload of *R-Part*, keeping its total amount unchanged. The triangular area of idling S-worker is moved to the area of idling R-workers. As the total overall latency can be indicated by the area under the latency curve, this can reduce as much as 20% the total overall latency. In other words, the overall throughput is increased by 20%. Besides, the maximum latency to generate a token is reduced by 50%, indicated by the highest point of the latency curve.

We identify that the long latency of *R-Part* is caused by all sequences being long. As previous works [30] indicates, sequences of different lengths can be batched together in *S-Part* to increase throughput. And in *R-Part*, processing the sequences separately on different workers introduce no extra overhead. Therefore, we schedule the sequences in a pipeline to control the total length being processed at each step, as shown in Figure 7.



Figure 7: The sequence-level load-stabilizing schedule

Instead of starting a large batch of sequences together, smaller micro-batches are started with a fixed interval, F steps. As a sequence of target length S is generated in S steps, multiple micro-batches are being processed together in each step. The total size of the micro-batches being concurrently processed roughly equals to the total batch size. S-Part of these micro-batches are processed together as a large batch, so the GPUs are well-utilized in the same way as serving the sequences in a large batch.

For example, in Figure 7, size of the micro-batch is 2. After 4 steps of cold starting, the total number of sequences to be processed at each step equals to 6, the original large batch size. The sum of the numbers in a column indicates the total load in *R-Part* of the step. In the original large-batch schedule, it can be as much as 36 in a step, while the number is 24 in the load-stabilizing schedule, indicating 1/3 reduction of the maximum latency. In fact, the reduction of latency can be nearly halved, following the formal deduction below.

This schedule reduces the total length of all sequences by mixing sequences of different lengths together. To be more specific, assume that there are originally B sequences of length S. The total length of sequences can be  $W_{\rm max} = BS$  in the final step if all sequences are started together. In our schedule, the size of micro-batches is defined as follows.

$$M = \frac{BF}{S} \tag{5}$$

So, in the final step of generating a micro-batch, we have the maximum total sequence length.

$$W'_{\text{max}} = \sum_{k=1}^{S/F} MkF = \frac{B(S+F)}{2} \approx \frac{BS}{2} = \frac{W_{\text{max}}}{2}$$
 (6)

Although  $S = \frac{1}{3}F$  in the example in Figure 7 leads to  $W'_{\text{max}} = \frac{2}{3}W_{\text{max}}$ , S is usually much larger than F in real cases (thousands compared to tens). So, (S+F) is closer to S. As the approximation indicates, the maximum total length of sequences is reduced by 50%.

The total workload remains near  $W'_{\text{max}}$  throughout the rest contiguous serving process. Because there are almost infinite number of micro-batches to be processed, after the cold start of the pipeline schedule, there are always sequences of different lengths being processed.

An extra benefit is the reduction of waiting time for incoming sequence generating requests in online serving. It has to wait for up to S steps before being served in a large batch, but only waits for F steps in the load-stabilizing schedule.

```
Algorithm 1 Load-control Algorithm
```

**Require:** *M*: array of batch sizes of all current micro-batches **Require:** *E*: ending step index of all current micro-batches

**Require:** W: array of workload

**Require:** t: starting step index of the micro-batch

**Require:** *m*: size of the micro-batch 1: **function** ADDMICROBATCH

2: M.append(m)3: E.append(t+S)

4: W.append(m\*S)

5: **for all** i **in** current micro-batches **do**6:  $W[i] \leftarrow W[i] + (E[i] - t) * m$ 

 $w[\iota] \leftarrow w[\iota] +$ end for

8: end function

9: function GETEARILIESTSTEP

10:  $r \leftarrow t$ 11: **for al** 

12:

for all i in current micro-batches do

 $x \leftarrow \lfloor \frac{W_{\lim} - W[i]}{m} \rfloor \qquad \triangleright \text{Maximum allowed length}$  $r \leftarrow \max(r, E[i] - x + 1)$ 

13:  $r \leftarrow \text{ma}$ 14: **end for** 

14: end for

15: **return** *r* 

16: end function

The sequence-level load-stabilizing schedule can be generalized to a load control algorithm that dynamically determines when a new micro-batch starts. Given a maximum load limit  $W_{\text{lim}}$ , the earliest starting step index for a micro-batch of M sequences can be calculated based on the current micro-batches being processed. Algorithm 1 shows the algorithm that figures out the earliest step index, with a few more information to maintain when a micro-batch actually starts. As maximum total length is reached at the last step of each micro-batches, the algorithm maintains the workload at these steps. The margin between the maximum workload and the limit is used to get the maximum length of the new micro-batch at the specific step, Then, we get the earliest step index constrained by a certain peak of workload.

Notably, during cold starting of the schedule, there may be an issue to set  $W_{\text{lim}}$  to  $W'_{\text{max}}$ . With sufficient input to be processed, a large number of sequences are started at step 0. Then, step S becomes the peak, and the rest sequences can only launch at step S+1. Instead, we need to gradually increase  $W_{\text{lim}}$ , or use a fixed F in the beginning.

## 4.3 Workload-balanced Hardware Selection

FASTDECODE introduces hardware heterogeneity between the GPU and CPUs. To optimally utilize both of them, beside stabilizing the workload, selection of hardware also makes significant impact. Specifically, we need to determine the number of CPUs to use in our system. If we have too many CPUs, it is a waste because they have to wait for the GPU. If we have too few CPUs, it is the GPU that idles.

Also, the expected service latency to LLM users should be considered. In some cases, the acceptable latency to generate a single sequence is large, so we have more space to optimize the throughput. In other cases, the batch size should be reduced to fulfill a stricter latency limit.

We introduce a quantitative approach to determine the two most important parameters of our system: the batch size,  $\mathcal{B}$ , and the number of CPUs,  $\mathcal{P}$ .

To start with, there are two given conditions: the LLM and the GPU we use. Then, we need a few reference metrics.

Throughput to compute *S-Part* of the model on the GPU can be measured by a micro-benchmark. As shown in Figure 1, the throughput varies significantly as the batch size,  $\mathcal{B}$ , changes. Therefore, we use a function  $\mathbb{T}(\mathcal{B})$  to indicate the latency to compute *S-Part* of one transformer block on the GPU. Also, we have the user-specified expected maximum length of sequences  $\mathcal{S}$ .

We assume that perfect efficiency is achieved by the pipelines. As the latency of *S-Part* is fixed, and the latency of *R-Part* equals to the latency of *S-Part* in such a pipeline, the latency to generate a token using a model of *N* layers is calculated as follows.

$$2NS \cdot \mathbb{T}(\mathcal{B}) \le L \tag{7}$$

L indicates the expected latency to generate a sequence. Larger  $\mathcal{B}$  leads to larger  $\mathbb{T}(\mathcal{B})$ , as well as better overall throughput. A maximum possible  $\mathcal{B}$  is selected as the above constraint is fulfilled.

If there is no constraint on L,  $\mathcal{B}$  is selected based on the overall throughput on GPU.

$$\mathbb{E}(\mathcal{B}) = \frac{\mathcal{B}}{\mathbb{T}(\mathcal{B})} \tag{8}$$

 $\mathbb{E}(\mathcal{B})$  is proportional to the GPU throughput that is shown in Figure 1. It increases sharply when  $\mathcal{B}$  is small, indicating that increasing  $\mathcal{B}$  brings much benefit. When it becomes more stable as  $\mathcal{B}$  is large enough, the performance gains little. In this case, we should select a  $\mathcal{B}$  that further increasing it only brings marginal throughput improvement.

Another constraint of  $\mathcal{B}$  is the host-side memory capacity. Assume that each CPU has memory for K and V vectors for C tokens, which can be calculated from the size of the memory and specifications of the model.

$$\frac{1}{2}\mathcal{B}S \le C\mathcal{P} \tag{9}$$

In fact, this constraint is barely the actual limitation, because CPUs commonly have abundant memory.

After having  $\mathcal{B}$  determined,  $\mathcal{P}$  is minimized by the constraint of computing R-Part in similar time to S-Part. Assume that we are using the same model of CPUs in the system. We use another micro-benchmark to get R, which indicates the latency that one CPU processes one token for R-Part. So, the CPU takes time Rk in R-Part when generating a token appending to a sequence of k existing tokens. We get the constraint for  $\mathcal{P}$  using R.

$$\frac{\mathcal{B}S}{2\mathcal{P}}R \approx \mathbb{T}(\mathcal{B}) \tag{10}$$

Then, we get a direct approximation for the optimal number of CPUs to work with a GPU.

$$\mathcal{P} \approx \frac{\mathcal{B}SR}{2\mathbb{T}(\mathcal{B})} = \frac{1}{2}SR\mathbb{E}(\mathcal{B})$$
 (11)

Briefly, to cope with increased GPU efficiency  $\mathbb{E}(\mathcal{B})$  thanks to increased  $\mathcal{B}$ , more CPUs are needed. However, as we select the  $\mathcal{B}$  where increasing it brings marginal  $\mathbb{E}(\mathcal{B})$ , and  $\mathcal{P}$  has to be an integer, it has little impact when tweaking  $\mathcal{B}$ . Also, longer expected sequence length S makes the CPUs more heavily loaded, so more of them are needed.

In summary, given specifications of the hardware and the model, we first measure  $\mathbb{T}(\mathcal{B})$  and R with micro-benchmarks. Then, a definite optimal choice of  $\mathcal{B}$  and  $\mathcal{P}$  is given by Equation (7), Equation (9), and Equation (11).

Furthermore, assume that the feature dimension of the model is h. The workload of *S-Part*, reflected in  $\mathbb{T}(\mathcal{B})$ , is proportional to  $h^2$ . Meanwhile, R, the per-token workload of

*R-Part*, is proportional to *h*. So,  $\mathcal{P}$  is approximately proportional to  $\frac{1}{h}$ . The optimal number of CPUs tends to be smaller for larger *h*, which commonly appears in larger models.

## 5 Implementation

The S-worker of FASTDECODE is implemented using Py-Torch, for ease of adapting to various models and serving APIs. *R-Part* is stripped from the model and the token generation scheduling is taken over by our system. The R-worker is implemented using C++. As a light-weight service, it receives data from S-worker. We find that the performance the R-worker is more critical but understudied.

# 5.1 Mix-precision CPU Attention

Optimizing the performance of R-worker is critical to the overall throughput of the system. Compared to the well-established neural network libraries on GPUs, there lacks existing high performance neural network libraries on CPUs that can be used out-of-box. Most current LLMs use 16-bit floating point numbers (fp16), which is not supported by most CPU libraries. However, using fp32 libraries doubles the volume of memory access, which means doubling the latency.

We develop a mixed-precision attention operator that reads fp16 data from memory, convert them to fp32 in registers, and compute. Luckily, we find intrinsics in AVX-2 instruction set to perform the vectorized fp16-fp32 conversion in one instruction. Although fp16 floating point multiply and add (FMA) instruction is included in AVX-512 instruction set, we exclude it for compatibility to a wider range of CPUs.

# 5.2 Supporting Quantization

The above fp16-fp32 mix-precision implementation is lossless comparing with the original fp16 computation on GPUs. We also support more aggressive performance optimization if model accuracy degradation is tolerated. Model quantization is widely used and welcomed to boost our performance.

Various quantization algorithms are supported with a few extra functions to implement. Given  $Q_i$ ,  $K_i$ ,  $V_i$  vectors in fp16, the user function adds  $K_i$  and  $V_i$  to the KV-cache after quantization.  $Q_i$  is transformed as the quantization algorithm requires to produce  $O_i$ .

Our throughput benefits from storing the *KV-cache* data in a quantized format. Suppose that 4-bit integers are used to store K and V, the memory access size is quartered, and we are likely to get  $4 \times$  speed up, or save  $4 \times$  CPUs.

### 5.3 Model Parallelism

For larger LLMs, model parallelism at either tensor or layer level is a common technique that reduces the computation latency. It is mandatory in many cases as the model cannot fit in the memory of a single GPU.

FASTDECODE naturally have good support for such techniques. A separate group of R-workers are assigned to every S-worker that plays the part of a worker in the parallel groups.

For inter-layer model parallelism, i.e., pipeline parallelism, different transformer blocks are processed by each worker. So, *R-Part* related to each worker are totally independent.

For intra-layer model parallelism, i.e., the tensor-model parallelism, the fully-connected layers before and after *R-Part* are commonly partitioned across attention heads [20]. Therefore, each group of R-workers maintains independent *KV-cache* for different attention heads.

In both types of model parallelism, the workloads of *S-Part* and *R-Part* are divided by a same factor. Therefore, the number of R-workers to work with each S-worker remain unchanged, as revealed in Equation (11). Latency of FASTDE-CODE directly benefits from the resulting latency reduction.

## 6 Evaluation

# 6.1 Setup

**Models and tasks** All the auto-regressive models use the same transformer backbone. We choose a state-of-the-art open-source LLMs, Llama [27] and OPT [32]. We evaluate system performance over different sizes of the models, including Llama-7b, Llama-13b, and Opt-175b.

We reduce the number of layers in the models to reduce evaluation cost. The estimated throughput and latency of the original model is reported. The number of layers is strongly proportional to the overall latency, and inversely proportional to the throughput. So, throughput and latency of the real original model can be directly calculated. Fairness of comparing systems is not lost by using models of reduced number of layers, because little chance is found across layers, and no system have done optimizations at the layer dimension. This is justified by Figure 8, showing the latency of Opt-175b model using different number of layers, i.e., the number of transformer blocks. When keeping other settings unchanged, they are almost linearly related.



Figure 8: Latency of FASTDECODE serving Opt-175b

To measure the token generation throughput, the models

are used to generate a sequence over a short prompt. The total length of the generated sequences is 1024 for both models.

The models run over fp16 data format, without any quantization or pruning technique. Therefore, output of different systems should identical except for floating point errors.

**Hardware** We use a NVIDIA A10 GPU with 24 GB device memory as the S-worker. The node has 256 GB host memory as swap space of vLLM. Up to 4 additional nodes with dual sockets of AMD Epyc CPUs are used as the R-workers of FASTDECODE. The cluster is connected via Infiniband network.

As no existing system exploits out-of-chassis CPUs, all the baseline systems run on the GPU node only. While it may not look fair because FASTDECODE introduces extra hardware, existing approaches can only use the CPU nodes as a standalone CPU worker for the text generation task, contributing less than 1% to the total throughput beside the GPU.

**Baselines** vLLM [14] uses paged attention technique to manage the *KV-cache*, and efficiently swapping its parts to host memory. As vLLM is reported to totally outperform **Orca** [30], we omit Orca in our experiments.

**TensorRT-LLM**<sup>1</sup> is the newest generation of **FasterTransformer**<sup>2</sup>. Both systems are developed by NVIDIA with state-of-the-art performance optimizations that can best utilize the GPUs with hardware-specific tuning.

**FastLLM** <sup>3</sup> is an accelerated LLM serving system crafted by experts. It adopts a pure C++ implementation that targets on fast deployment, low latency, and high throughput with various accelerators.

A **Vanilla**<sup>4</sup> implementation of Llama [26, 27] is released with the model. The pure PyTorch [22] implementation and its derived versions are widely used in both academia and industry. It includes a simple *KV-cache* implementation on GPUs, and achieves competitive throughput thanks to the optimized PyTorch library.

# 6.2 Maximum Throughput

Figure 9 shows the measured throughput of all the systems. The number in brackets after *ours* indicates the batch size of FASTDECODE. The possible batch size is enormous in our system, because the distributed host memory is large enough for thousands of sequences. Increasing the batch size can increase the utilization of GPUs, and thus the overall throughput. However, as there may be constraint on the latency, the batch size should be set properly. Also, we observe that the performance gain of increasing batch size gets less when the



Figure 9: Token generating throughput

batch size is large enough. When the batch size increases by  $8 \times$  from 128 to 1024, we only get  $2 \times$  throughput.

Comparing with the most powerful baseline, vLLM, in the generation task of the 7b model, FASTDECODE achieves a maximum throughput of more than 2k tokens per second,  $4\times$  more than vLLM, and  $8.7\times$  more than TensorRT-LLM. For the 13b model, our maximum throughput is  $4.12\times$  the throughput of vLLM. Even when reducing the batch size to 128 for lower latency, we achieve  $2.32\times/1.88\times$  the throughput of vLLM.

When running vLLM, we observe that it can achieve the batch size of 1024 in the beginning, because the sequences are short, and the KV-cache can be all stored in GPU memory. However, as the sequences get longer, it finds less batching opportunity, and can only use a similar batch size with other GPU-only systems. TensorRT-LLM performs better than fastllm and vanilla because of its more efficient CUDA kernels. However, the maximum possible batch size of these systems is barely more than 16, limited by the GPU memory. Thus, they have much lower throughput than ours. The average throughput of our system is  $6.71 \times$  and  $6.04 \times$  the average throughput of all baseline systems, respectively.

### **6.3** Token Generating Latency

Figure 10 shows the measured latency to generate a new token by all the systems. The wide bar indicates the average latency between generating two adjacent token, and the three narrow bars show P = 0.01/0.5/0.99 latency, respectively.

When we maximize our batch size to target on highest throughput, the latency is about  $3.5 \times$  the latency when using  $8 \times$  smaller batch size. This also implies the GPU utilization improvement of increased batch size.

<sup>1</sup>https://github.com/NVIDIA/TensorRT-LLM

<sup>2</sup>https://github.com/NVIDIA/FasterTransformer

https://github.com/ztxz16/fastllm

<sup>4</sup>https://github.com/facebookresearch/llama



Figure 10: Token generating latency

TensorRT-LLM achieves the minimum average latency of 34.2 ms and 77.0 ms per token, respectively in the two models. Using a batch size of 128, the average latency of FASTDE-CODE is 120.8 ms and 191.6 ms. With at most  $2.5 \times$  larger latency of the 7b model, we have  $4.5 \times$  throughput. Potentially, given  $4 \times$  GPUs, we are able to retain the throughput improvement while reducing the per-token average latency to the same level as TensorRT-LLM.

The latency of vLLM is as low as other systems when generating most tokens, because it uses a similar small batch size in most steps. However, the average latency of vLLM is higher than all setups of FASTDECODE. This is because a few steps that swaps the *KV-cache* between host and GPU memory are significantly slow, a key bottleneck of all systems that offloads the *KV-cache*.

# 7 Performance Analysis

## 7.1 Coping with Heterogeneity

The curve of per-step latency in Figure 11 clearly shows the difference between FASTDECODE with or without the sequence-level load-stabilizing schedule (SLS).

As a baseline, the vanilla implementation runs both *R-Part* and *S-Part* on the same GPU. So, its latency grows linearly with the workload of *R-Part*, which is proportional to the length of the current sequences.

Without the load-stabilizing schedule, the latency first grows slightly. While most part of *S-Part* and *R-Part* are overlapped by the basic two-stage pipeline, the starting and ending overhead is exposed, and leads to the gentle increase of overall latency. Then, after the length of generated sequences exceeds a certain point, the latency grows sharply with time,



Figure 11: Latency at each step

because the latency becomes dominated by the increasing latency of *R-Part*, and the GPU gets underutilized.

After a cold start process of latency and low throughput due to smaller batch size, the sequence-level load-stabilizing schedule provides a stable latency at 66% - 70% the maximum latency without it. The sustainable throughput is increased by 8% - 11% by the technique. Overhead of the pipeline stops the system from achieving the ideal benefit of 50% maximum latency reduction and 20% throughput improvement indicated by Figure 6,

The smaller improvement on the 7b model compared with the 13b model is also caused by overload of the R-workers. Feature dimension h of the 13b model is larger than the 7b model. The workload of fully-connected layers in *S-Part* is proportional to  $O(h^2)$ , while it is O(h) in *R-Part*. Therefore, it is expected that *R-Part* have more workload than *S-Part* in smaller models.



Figure 12: Latency at each step for Llama-7b model with reduced sequence length

As indicated by Equation (11), the sequence length S is proportional to the required number of R-workers. To justify the performance model, as shown in Figure 12, we reduce the

length of generated sequences to 768. The latency of *S-Part* gets closer to the sustainable latency of the sequence-level load-stabilizing schedule, indicating more balanced workloads between the S-worker and the R-workers. The throughput improvement increases from 8% to 13%.

# 7.2 Scalability

Employing more CPUs is a basic requirement of FASTDE-CODE for certain workload. So, scalability of the R-workers is important to the overall efficiency. We use a fixed workload, generating tokens after 1024 sequences of length 1024 or 128. Each worker is bound to a socket. We evaluate the scalability of FASTDECODE on up to 8 sockets on 4 nodes.



Figure 13: Strong scalability of FASTDECODE

Figure 13 shows the strong scalability experiment results of FASTDECODE over the 7b and 13b model. When the length of sequences is 1024, FASTDECODE achieves 72.8% and 84.1% efficiency when scaling up from 1 socket to 8 sockets, on the 7b and 13b models, respectively. As the total latency is smaller in the 7b model, overhead of the pipeline is more significant, leading to lower efficiency with 8 sockets. When sequence is as short as 128, the efficiency is 37.6% for the 13b model. Using 8 sockets achieves even lower throughput than using 4 sockets with 75.9% efficiency. This is implied by our performance model. Shorter sequences require less R-workers. Employing more R-workers does not increase the performance when the S-worker is the bottleneck.



Figure 14: Using more workers in FASTDECODE

We show the ability of scaling up FASTDECODE by using more S-workers with model parallelism. In this case, we use the Opt-175b model which requires less R-workers to work with a S-worker, because the Opt model is larger, so less R-workers are needed, as Equation (11) suggests. As a baseline setting, we use one A10 GPU with two Epyc CPUs in a node. Both hardware are well utilized, while the R-workers are slightly overloaded. Figure 14 shows the results of introducing more hardware. When only using 2× CPUs as R-workers, the overall throughput is only slightly increased. For the bar on the right, we double the number of both R-workers and S-workers. The two S-workers work in model parallelism by partitioning all the parameter and activation tensors. FAST-DECODE achieves 1.84× throughput using double hardware.

# 7.3 Latency Break-down



Figure 15: Latency of two layers in a 13b model

To see the detailed utilization of different workers, we trace different operations of the workers, as shown in Figure 15. The R-workers are busy with computation in more than 75% of the time. The performance variance across nodes makes some of the workers wait for others.

Copying the QKV data from GPU to CPU takes 3 ms, during the iteration of 43 ms to generate a new token. Sending QKV across the network takes another 7.4 ms. In total, the distributed design of FASTDECODE introduces about 25% overhead to transmit the feature vectors. Notably, we change the asynchronous communication to synchronous mode for profiling. In production, the asynchronous communication can overlap part of the communication overhead.

The S-worker is actually working in less than 50% of time in the profiled case, due to overloading and performance variance of the R-workers. However, the overall throughput is

still competitive, as the efficiency to compute *S-Part* is significantly increased because of the much larger batch size.

## 8 Related Works and Discussion

**Optimizing the Attention Operator** For training LLMs, FlashAttention [5, 6] is a widely-used optimized fused attention operator. It achieves performance improvement by eliminating the need to store the memory-consuming intermediate *A* matrix. The idea is ported to the token generation scenario by FlashDecoding [9]. However, as *A* is a vector of much smaller size in decoding, it has less impact than that in training. These techniques can be ported to CPUs and accelerate our computation.

The idea of window attention [1], which is further extended in StreamingLLM [29], is a variation of the original attention algorithm that reduces the number of tokens to interact with for each new token. Our system benefits from these techniques in the same way as quantization [2,7,17] and pruning [13,25]. They reduce the workload of *R-Part*, while the user has to be are of the potential change of the model quality.

Speculative token generation [19] is currently the only approach that essentially increases the efficiency of attention operation, which depends on accurate prediction of tokens.

**Distributed and Heterogeneous LLM Serving** Typical model partitioning systems [11, 15, 33] can barely handle the token generation case where hardware and workload of the model are both heterogeneous. Beside offloading *KV-cache* to host memory [10, 12, 14, 24], peer GPUs [16] may also be the place to offload, despite the expense. In fact, FPGAs [4, 31] may be a better choice to store and process *KV-cache*.

The idea of this paper can be generalized as using heterogeneous hardware for the different parts of LLMs for better efficiency of the whole system. Beside CPU, possible selection of hardware for the memory-intensive part includes cheaper GPUs, FPGAs, and domain-specific chips. A memory pool directly connected to the GPU by CXL [8] would be a even more reasonable option for the *KV-cache*. Among the various possibilities, our approach that unleashes the power of CPUs for *R-Part* is an immediately feasible approach that uses existing hardware with high affordability.

### 9 Conclusion

In this paper, we propose FASTDECODE, a system that achieves high throughput of generating tokens with LLMs using affordable GPU resources. Different from typical solutions that fully use GPUs for computation, we decompose the model into two parts, and move both storage and computation of the memory-bound part to distributed out-of-chassis CPUs, utilizing their aggregated compute power. Performance challenges brought by heterogeneity in both temporally varying workload and hardware are addressed by a sequence-level

load-stabilizing schedule and a performance model. Finally, as the GPU is better utilized thanks to greatly enlarged batch size, and the overall throughput is competitive.

#### References

- [1] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. *CoRR*, abs/2004.05150, 2020.
- [2] Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Understanding and overcoming the challenges of efficient transformer quantization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 7947–7969. Association for Computational Linguistics, 2021.
- [3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- [4] Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, and Zhiru Zhang. Understanding the potential of fpga-based spatial acceleration for large language model inference, 2023.
- [5] Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. *CoRR*, abs/2307.08691, 2023.
- [6] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 December 9, 2022, 2022.
- [7] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: accurate post-training quantization for generative pre-trained transformers. *CoRR*, abs/2210.17323, 2022.

- [8] Donghyun Gouk, Sangwon Lee, Miryeong Kwon, and Myoungsoo Jung. Direct access, high-performance memory disaggregation with directcxl. In 2022 USENIX Annual Technical Conference, USENIX ATC 2022, Carlsbad, CA, USA, July 11-13, 2022, pages 287– 294. USENIX Association, 2022.
- [9] Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, and Yu Wang. Flashdecoding++: Faster large language model inference on gpus. *CoRR*, abs/2311.01282, 2023.
- [10] Chien-Chin Huang, Gu Jin, and Jinyang Li. Swapadvisor: Pushing deep learning beyond the GPU memory limit via smart swapping. In ASPLOS '20: Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, March 16-20, 2020, pages 1341–1355. ACM, 2020.
- [11] Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond data and model parallelism for deep neural networks. In *Proceedings of Machine Learning and Systems 2019*, *MLSys 2019*, *Stanford*, *CA*, *USA*, *March 31 April 2*, 2019. mlsys.org, 2019.
- [12] Jaehoon Jung, Jinpyo Kim, and Jaejin Lee. Deepum: Tensor migration and prefetching in unified memory. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2023, Vancouver, BC, Canada, March 25-29, 2023, pages 207–221. ACM, 2023.
- [13] Woosuk Kwon, Sehoon Kim, Michael W. Mahoney, Joseph Hassoun, Kurt Keutzer, and Amir Gholami. A fast post-training pruning framework for transformers. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
- [14] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In *Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023*, pages 611–626. ACM, 2023.
- [15] Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Alpaserve: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2023, Boston, MA, USA, July 10-12, 2023, pages 663–679. USENIX Association, 2023.

- [16] Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li, Hanyu Zhao, Wencong Xiao, Qi Xu, Xiafei Qiu, Shen Li, Zhigang Ji, Yong Li, and Wei Lin. Infinite-Ilm: Efficient Ilm service for long context with distattention and distributed kycache, 2024.
- [17] Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. AWQ: activation-aware weight quantization for LLM compression and acceleration. *CoRR*, abs/2306.00978, 2023.
- [18] Zixuan Ma, Jiaao He, Jiezhong Qiu, Huanqi Cao, Yuanwei Wang, Zhenbo Sun, Liyan Zheng, Haojie Wang, Shizhi Tang, Tianyu Zheng, Junyang Lin, Guanyu Feng, Zeqiang Huang, Jie Gao, Aohan Zeng, Jianwei Zhang, Runxin Zhong, Tianhui Shi, Sha Liu, Weimin Zheng, Jie Tang, Hongxia Yang, Xin Liu, Jidong Zhai, and Wenguang Chen. Bagualu: targeting brain scale pretrained models with over 37 million cores. In PPoPP '22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, April 2 6, 2022, pages 192–204. ACM, 2022.
- [19] Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating generative LLM serving with speculative inference and token tree verification. *CoRR*, abs/2305.09781, 2023.
- [20] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on GPU clusters using megatron-lm. In *International Conference* for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, November 14-19, 2021, page 58. ACM, 2021.
- [21] OpenAI. GPT-4 technical report. *CoRR*, abs/2303.08774, 2023.
- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 8024–8035, 2019.

- [23] Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. *Proceedings of Machine Learning and Systems*, 5, 2023.
- [24] Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: Highthroughput generative inference of large language models with a single GPU. In *International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA*, volume 202 of *Proceedings of Machine Learning Research*, pages 31094–31116. PMLR, 2023.
- [25] Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. Powerinfer: Fast large language model serving with a consumer-grade gpu, 2023.
- [26] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
- [27] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranian Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023.
- [28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing

- Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
- [29] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. *CoRR*, abs/2309.17453, 2023.
- [30] Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for transformer-based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, pages 521–538. USENIX Association, 2022.
- [31] Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, and Yu Wang. Flightllm: Efficient large language model inference with a complete mapping flow on fpgas, 2024.
- [32] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models. CoRR, abs/2205.01068, 2022.
- [33] Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, and Ion Stoica. Alpa: Automating inter- and intra-operator parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, pages 559–578. USENIX Association, 2022.