# Scaling up HBM Efficiency of Top-K SpMV for Approximate Embedding Similarity on FPGAs

Alberto Parravicini\*, Luca Giuseppe Cellamare<sup>†</sup>, Marco Siracusa<sup>†</sup> Marco D. Santambrogio\* Politecnico di Milano, DEIB, Milan, Italy

\*{alberto.parravicini, marco.santambrogio}@polimi.it, †{lucagiuseppe.cellamare, marco.siracusa}@mail.polimi.it

Abstract—Top-K SpMV is a key component of similarity-search on sparse embeddings. This sparse workload does not perform well on general-purpose NUMA systems that employ traditional caching strategies. Instead, modern FPGA accelerator cards have a few tricks up their sleeve. We introduce a Top-K SpMV FPGA design that leverages reduced precision and a novel packet-wise CSR matrix compression, enabling custom data layouts and delivering bandwidth efficiency often unreachable even in architectures with higher peak bandwidth. With HBM-based boards, we are 100x faster than a multi-threaded CPU implementation and 2x faster than a GPU with 20% higher bandwidth, with 14.2x higher power-efficiency.

Index Terms—FPGA, SpMV, Approximate Computing, Hardware Acceleration, HBM

#### I. Introduction

Information Retrieval (IR) and recommender system process an always-increasing amount of data, often with strong realtime constraint, to suggest products, movies, news articles to billions of users. Top-K Sparse matrix-vector multiplication (Top-K SpMV) is a key building block of high-performance similarity-search applications found in recommender systems that store items or documents as sparse embeddings, short numerical representations usually obtained through a neural network [1], [2]. Sparse embeddings guarantee low memory footprint and reduced execution time, often without decreasing accuracy, as they capture important information while filtering out noise [3]. Top-K SpMV computes the highest K values of the product between a sparse matrix (a matrix where only non-zero entries are stored) and a dense vector. In our case, it matches an input embedding against a collection of sparse embeddings and finds the K most similar ones (Figure 1).

The computation of Top-K SpMV is extremely memory intensive and presents sequential, indirect, and fully random memory accesses, making it unsuitable for general-purpose architectures with traditional caching policies. Improving its performance through a custom hardware design demands a theoretically sound approach to optimize the data-access strategy. FPGAs can fully leverage application-specific data representations thanks to reduced-precision arithmetic and a customizable memory subsystem. In agreement with the Roofline methodology [4], this approach is key to improving operational intensity and performance in memory-starved computations. Exact numerical accuracy for large values of K is not critical, as long as the most similar embeddings are retrieved: compared to CPUs or GPUs, FPGAs can leverage optimized arbitrary-precision arithmetic and offer fine-grained control over the



Fig. 1: Top-K SpMV between a sparse matrix A (in our case, a collection of sparse embeddings) and a dense vector  $\mathbf{x}$  (a dense embedding), with notation as in Section III.

desired target accuracy, providing better performance, lower resource utilization, and lower power consumption. To fully utilize off-chip bandwidth, we devise an approximation scheme based on matrix partitioning that enables a flexible multi-core design with a single HBM channel per core and URAM for caching high-reuse or random access data (section III-A). On an Alveo U280 accelerator card, our design accesses 32 HBM2 pseudo-channels with a total bandwidth of 460 GB/s. Then, we maximize bandwidth efficiency through a novel streaming sparse matrix format that leverages coalesced HBM accesses and reduced numerical precision, and is oblivious to the matrix non-zero entries distribution (section III-B). We propose a novel multi-core streaming FPGA design to approximate the computation of Top-K SpMV<sup>1</sup> and present the following contributions:

- A multi-core Top-K SpMV FPGA design that leverages a novel compressed matrix representation and algorithmic approximations (section III) and to effectively exploit the available HBM bandwidth (Section IV).
- A performance evaluation of our FPGA design, showing 100x faster execution versus a state-of-the-art CPU implementation. We are also 2x faster than a GPU with 20% higher bandwidth, with 14.2x higher power-efficiency and no significant loss of Top-K precision (Section V).

# II. RELATED WORK

To the best of our knowledge, no prior work optimizes the computation of Top-K SpMV on FPGA or GPU, although existing research covers approximation techniques on FPGA for dense matrix multiplications and deep learning [5], [6]. On

<sup>1</sup>Code is available at github.com/AlbertoParravicini/approximate-spmv-topk



Fig. 2: Simplified approximation scheme for Top-K SpMV. No errors occur if all partitions contain less than k Top-K values.

CPUs, sparse\_dot\_topn is a C++ multi-threaded implementation that employs the Compressed Sparse Row (CSR) format and is specialized for sparse embeddings and documents similarity [1]. That said, the CPU performance in this computation is inherently held back by their limited memory bandwidth and the inability to consistently perform quick random accesses, as there are no guarantees that requested values have not been evicted from cache. Although no existing work specifically optimizes Top-K SpMV on GPU, there exist multiple highlyoptimized implementations of SpMV [7]-[10]. While modern GPUs provide higher memory bandwidth than FPGA accelerator cards (from 549 GB/s of the Nvidia P100 to the 1555 GB/s of the Nvidia A100), existing SpMV implementations are often unable to utilize the available bandwidth fully [11]. Moreover, optimizations for reduced precision data-types are currently limited to half-precision floating-point, with no support for reduced-precision fixed-point arithmetic [12], [13] although recent work has explored heuristics to mix single and double precision floating-point arithmetic [14].

Existing research on SpMV optimization on FPGA focuses on sparse matrix compression [15], moving the bottleneck from memory accesses to the input data decompression. Umuroglu et al. [16] maximize the time in which values are kept in a fast local cache hierarchy. The efficient use of HBM on FPGA also has received the academic community's attention. Sadi et al. [17] propose an SpMV FPGA implementation that achieves significant speedup leveraging HBM and a data-compression scheme to reduce off-chip traffic. However, directly employing off-the-shelf SpMV FPGA designs (e.g. [18]) is undesirable: it would waste memory to store for the entire output vector (which can contain millions of values, Section III) and demand unnecessary computation to sort the output afterward.

## III. THEORETICAL CONTRIBUTIONS

Two key theoretical contributions enable our FPGA hardware design. First, we introduce a partitioning scheme that splits and approximates Top-K SpMV on independent cores, giving access to the entire HBM bandwidth (Section III-A). Then, we present Block-Streaming CSR (BS-CSR), a sparse matrix format optimized for streaming coalesced memory transactions and reduced-precision arithmetic (Section III-B).

Given a sparse matrix  $\mathbf{A}$  of size  $N \times M$ , and a dense vector  $\mathbf{x}$  of size  $M \times 1$ , the result  $\mathbf{y}$  of the SpMV  $\mathbf{y} = \mathbf{A}\mathbf{x}$  is a vector  $N \times 1$ . Top-K SpMV does not track the entire  $\mathbf{y}$ , but only its highest K values (and the corresponding indices). If  $\mathbf{A}$ 

TABLE I: Estimated precision of Top-K indices for increasing number of partitions. We average results of 1000 tests.

| Number of    | Number of         | Top-K Value, with $k = 8$ |    |       |       |       |       |  |
|--------------|-------------------|---------------------------|----|-------|-------|-------|-------|--|
| matrix rows  | partitions        | 8                         | 16 | 32    | 50    | 75    | 100   |  |
|              | c = 16            | 1                         | 1  | 0.999 | 0.998 | 0.983 | 0.942 |  |
| $N = 10^{6}$ | c = 28            | 1                         | 1  | 1     | 0.999 | 0.999 | 0.996 |  |
|              | c = 32            | 1                         | 1  | 1     | 0.999 | 0.999 | 0.997 |  |
| $N = 10^7$   | c = 16            | 1                         | 1  | 1     | 0.999 | 0.986 | 0.947 |  |
|              | c = 28            | 1                         | 1  | 1     | 0.999 | 0.999 | 0.995 |  |
|              | $\mathbf{c} = 32$ | 1                         | 1  | 1     | 0.999 | 0.998 | 0.998 |  |

and  $\mathbf{x}$  are L2-normalized embeddings, Top-K SpMV retrieves the rows of  $\mathbf{A}$  with highest cosine similarity to  $\mathbf{x}$ . In our task, N is the size of the embedding collection (usually millions of entries), while M, the dense embedding size, has only a few hundred values [19].  $\mathbf{x}$  is stored in URAM to perform random accesses in a single clock cycle, which is infeasible with an unconstrained M. Similarly, we track just the Top-K entries of  $\mathbf{y}$  on a LUT scratchpad instead of performing data-dependent write-backs on HBM and sharing bandwidth with read-channels, which would undermine the maximum memory throughput.  $\mathbf{x}$  is dense as many sparsification algorithms operate on dense matrices (e.g. batches of  $\mathbf{A}$ ) and cannot efficiently sparsify a single vector [20], [21].

# A. Top-K SPMV Approximation

In Top-K similarity-search applications, it is often acceptable to trade maximum accuracy on large values of K for lower execution time and power consumption. Instead of computing the total Top-K values on the matrix A, we partition it in c submatrices with N/c rows each. Each sub-matrix is processed by an independent FPGA core, which computes the top  $\mathbf{k} < K$ results for the partition (with  $\mathbf{k} \cdot \mathbf{c} > K$ ). We obtain  $\mathbf{k} \cdot \mathbf{c}$  results, which represents an approximation of the original Top-K (Figure 2). As c increases, so does the approximation accuracy. As we always retrieve the top k values, the approximation does not affect the best-ranked rows. Equation (1) presents a closed-form expression of the expected value of correctly retrieved Top-K values (i.e. the precision); intuitively, Equation (1) compares the number of permutations with more than k Top-K values in a partition with the total number of Top-K permutations. We estimate  $\mathbb{E}[P]$  through a Monte Carlo simulation for different K and number of partitions, with K from 8 to 100, common thresholds in IR [22]. Having at least 16 partitions guarantees a minimal loss of precision, even for large matrices.

$$\mathbb{E}[P] \approx \frac{1}{K-1} \sum_{K_i=1}^{K} 1 - \frac{\mathbf{c} \cdot \sum_{\mathbf{k}_i = \mathbf{k}+1}^{\min(K_i, \lfloor N/\mathbf{c} \rfloor)} {\binom{\lfloor N/\mathbf{c} \rfloor}{\mathbf{k}_i}}}{\binom{N}{K_i}}$$
(1)

## B. The Block-Streaming CSR Matrix Layout

The matrix **A** is stored in a *sparse format* to save only non-zero values and lower its memory footprint. Different storage techniques are possible, depending on the non-zero elements distribution and the desired type of access patterns.



Fig. 3: Naïve COO allows only 5 non-zero entries in a 512-bits packet. Using fewer bits for y and val (assuming y <1024 and val stored as 20-bits fixed-point) permits only 3 additional non-zeros per packet. BS-CSR drastically reduces the space required for ptr, enabling 15 non-zeros per packet.

The common CSR format [23] is unsuitable for fully-pipelined streaming FPGA designs as it contains data dependencies to access matrix values, and hiding the memory controller latency is not trivial, given the lack of a built-in hardware prefetcher. Instead, the Coordinate (COO) layout uses three equally sized arrays to store, for each matrix entry, its two coordinates and the entry itself's value. COO allows streaming iterations with burst memory transactions on non-zero matrix entries to saturate bandwidth, as the architecture does not have to perform data-dependent memory accesses determined by the number of non-zero values of each row, as in the CSR format. COO allows coalesced memory transactions of multiple non-zero entries, but requires redundant storage of coordinate values, limiting the overall operational intensity.

We propose BS-CSR, a new sparse matrix storage format that combines the low memory footprint of CSR with the streaming properties of COO, and is enabled by reduced precision datatypes. As the architecture of the Alveo U280 HBM memory controllers incentivizes memory transactions of data packets (or blocks) between 256 and 512 bits [24], we build each packet in BS-CSR as an independent CSR partition (Figure 3): values of idx and val are identical to standard CSR, using less than 32 bits for each value if possible (in our case, 10 bits are enough for idx entries). The ptr vector tracks the cumulative nonzero entries with respect to the packet itself: each entry of ptr is just  $\lfloor log_2(B) \rfloor$  bits for a packet with B non-zero entries (4 bits for a packet B = 15), instead of 32 bits like in a naïve CSR or COO (as the number of rows is not bound). The new\_row bit tracks if the first row of a packet continues the last row of the previous packet. Missing rows are handled with placeholder 0 values, although rows are never fully empty in our application domain. BS-CSR does not provide an explicit reference to row indices: this is not a limitation in streaming algorithms such as SpMV and its variations, as we process the entire matrix sequentially and track the current rows by counting non-zero entries in each ptr packet, plus the new\_row bit. We can fit 2 to 3 times as many non-zero entries in each 512-bits packet and improve the operational intensity by the same amount. Although significant work has been made to optimize SpMV on matrices with irregular row density distribution [17], we avoid



Fig. 4: Block diagram of our multi-core Top-K SpMV, simplified for BS-CSR with 5 non-zero y and val entries per packet.

the problem entirely through a fully-streaming data format that is oblivious to the number of non-zero matrix entries per row. In spite of its simplicity, BS-CSR can be a life-saver in streaming memory-bound computations (Section V-C).

## IV. THE PROPOSED FPGA HARDWARE DESIGN

Our Top-K SpMV FPGA design offers multiple low-profile cores that operate independently. Each core processes a partition of the input matrix, as explained in Section III-A. To guarantee maximum flexibility in terms of the number of cores placed on the FPGA, each core uses a single HBM channel to read the input matrix and store its output at the very end of the computation. As cores read 512 bits per clock cycle (at 225 MHz) from their respective HBM memory port and memory transactions happen in continuous maximum length AXI4 bursts (256 beats), our FPGA design can theoretically operate at the maximum bandwidth offered by HBM, by coalescing transactions with BS-CSR packets containing multiple non-zero entries, and without expensive distributed memory controllers. Each core reads A in packets of size B. B ranges from 7 to 15, depending on the desired level of numerical precision and the embedding size: at worst we use 32 bits for idx and val, but realistic size bounds (e.g. idx < 1024) allow much greater coalescing and operational intensity (Figure 6).

## A. Leveraging URAM for Fast Random Access

The input vector  $\mathbf{x}$  is stored in URAM. Our core performs B random accesses per cycle on  $\mathbf{x}$ . As each URAM bank only has 2 read ports, we replicate  $\mathbf{x} \lceil B/2 \rceil$  times to allow all random accesses. This replication does not constitute a limitation of our design:  $\mathbf{x}$  represents a dense embedding whose size does not go beyond a few hundred values in real applications. With our implementation,  $\mathbf{x}$  can have size up to 80000 (assuming a worst-case scenario with 32 bits values, 32 cores, and 8 replicas of  $\mathbf{x}$  per core), given a URAM size of around 90MB.

#### B. Top-K SpMV Algorithm Design

The main Top-K SpMV algorithm (alg. 1) is implemented as a 4-stage pipelined data-flow computation, decoupling memory transfer and computation. Each stage processes the input matrix as packets of B elements. The first stage retrieves B entries from A per cycle from HBM and URAM and computes pointwise products. Values corresponding to the same matrix row are aggregated in the second stage. The third stage performs book-keeping for rows that span more than one packet and finds which rows have been fully processed in the current packet. The final stage updates the current Top-K values for each complete row if the row output value is in the Top-K. Our data-flow design completely hides the Top-K update cost, which would otherwise be an expensive part of the computation.

Each core retrieves the top  $\mathbf{k}=8$  values of each matrix partition. Higher  $\mathbf{k}$  results in lower clock speed due to RAW data-dependencies in the argmin computation, while lower  $\mathbf{k}$  decreases accuracy. It is unlikely that many rows are entirely contained in a single packet B: we avoid consumption of unnecessary logic by tracking results for at most  $\mathbf{r}$  rows per packet. In our experiments, using  $B/4 < \mathbf{r} < B/2$  provided resource savings up to 50% with no accuracy loss.

### C. Lower Precision, More Cores, Better Performance

Our hardware design processes  $\mathbf{c} \cdot B$  non-zeros per clock cycle. Maximizing this performance equation is not trivial, as increasing B increments FPGA resource consumption and possibly prevents the placement of 32 cores, a requirement to achieve full HBM bandwidth utilization. Indeed, c is a nonlinear function of k, r and B, with B depending from Mand from the number of bits V used for entries of val, as in  $B \cdot (\lceil log_2 B \rceil + \lceil log_2 M \rceil + V) + 1 = 512$ . M can be safely assumed in the order of thousands. Choosing V, k and rrequires a performance-accuracy trade-off: based on theoretical bounds in Table I and the experiments in Section V-D, we observe that k = 8 and V = 20 are sufficient to provide great accuracy. Combined with r between 4 and 8 we guarantee a 32-cores design that can exploit all HBM channels without significant accuracy loss. Table II summarizes characteristics and resource usage of the designs in our evaluation.

# V. EXPERIMENTAL EVALUATION

We employ a Xilinx Alveo U280 Accelerator Card equipped with 8 GB of HBM2 memory (460 GB/s of total bandwidth over 32 channels) and a xcu280-fsvh2892-2L-e FPGA whose available resources are reported in Table II. Our CPU baseline is sparse\_dot\_topn, a multi-threaded C++ implementation of Top-K SpMV [1], running on two Intel Xeon Gold 6248 and 384 GB of DRAM. We also compare our design against GPUs, using a Tesla P100 (549 GB/s of HBM bandwidth): although we are not aware of any GPU implementation of Top-K SpMV, we can combine a fast GPU implementation of SpMV (such as cuSPARSE [7]) with the GPU radix sort of Thrust [25], using both single and half-precision floating-point arithmetic. To provide a worst-case comparison, we also assume a zero-cost GPU sorting, as if cuSPARSE already retrieved Top-K values at no cost.

## Algorithm 1 Approximate BS-CSR Top-K SpMV

```
1: function TOP-K-SPMV(bscsr_matrix, vec)
         x_{s \ old} \leftarrow 0; \ last\_packet\_output \leftarrow 0
 2:
        for i \leftarrow 0..NNZ/B do \triangleright For each packet of the matrix
 3:
            \triangleright 1. Process BS-CSR in packets of size B
 4:
 5:
            for j \leftarrow 0..B do \triangleright All loops of size B are Unrolled
                bscsr\_packet \leftarrow bscsr\_matrix[i]
 6:
                x_{loc}[j] \leftarrow bscsr\_packet.x[j]
 7:
                val_{loc}[j] \leftarrow bscsr\_packet.val[j]
 8:
                res_{tmp}[j] \leftarrow val_{loc}[j] \cdot \mathbf{vec}[j][bscsr\_packet.y[j]]
 9:
            \triangleright 2. Aggregate partial res_{loc} values
10:
            for b1 \leftarrow 0..B do
                                           \triangleright x_{loc}[-1] = 0 by convention
11:
12:
                row_{curr} += (x_{loc}[b1] \neq x_{loc}[b1-1])
                for b2 \leftarrow x_{loc}[b1-1]..x_{loc}[b1] do
13:
                    res_{aqq}[b1] += res_{tmp}[b2]
14:
            row_{curr} += bscsr \ packet.new \ row - 1
15:
            > 3. Check if first value was split among packets
16:
                                                       ▶ Find finished rows
            for j \leftarrow 1..B do
17:
                rows_{fin}[j] \leftarrow x_{loc}[j-1] \neq 0
18:
            if bscsr\_packet.new\_row then
19:
                res_{aqq}[0] \leftarrow last\_packet\_output
20:
                rows_{fin}[0] \leftarrow true
21:
                              > This packet continues a previous row
22:
                \begin{array}{l} res_{agg}[1] \mathrel{+}{=} last\_packet\_output \\ res_{agg}[0] \leftarrow 0; \ rows_{fin}[0] \leftarrow false \end{array}
23:
24:
            ▶ 4. Update Top-K values
25:
26:
            for j \leftarrow 0..B do
                                            ▶ Process only finished rows
                if res_{agg}[j] \ge worst_{curr}[j] && rows_{fin}[j] then
27:
                    res_{loc}[j] \leftarrow res_{agg}[j]
res_{idx}[j] \leftarrow row_{curr} + j - 1
28:
29:
                argmin(res_{loc}, worst_{curr}[j], worst_{idx}[j])
30:
31:
            reset(res_{aqq})
```

We analyze 4 FPGA design: 32 bits unsigned fixed-point (Q1.31), 25 bits (Q1.24), 20 bits (Q1.19), and a 32-bit floating-point version (F32) (Table II). The number of HBM channels limits the maximum number of cores to 32, although we could easily place more cores given our design's low resource footprint. The CPU baseline uses floating-point arithmetic: our CPU does not support arbitrary reduced-precision, and simulated reduced-precision fixed-point resulted in lower performance. The experimental setup comprises 19 synthetic and real sparse embedding matrices (Table III), with different sizes and distributions. Sizes are reported using BS-CSR as in Figure 3: if stored as a naïve COO, they would take 3 times as much space. We simulate different non-zero distributions (uniform and left-skewed  $\Gamma$ ), with 20 or 40 average nonzero entries per row (a sparsity factor of 2-8%). The use of synthetic matrices provides full control over the desired datadistribution, as the non-zero distribution in real embedding matrices is influenced by the chosen sparsification procedure. To the best of our knowledge, no public data-set of sparse embedding with sizes comparable to ours (millions of rows) is available. Instead, we sparsify the GloVe [26] embedding

TABLE II: Resource usage, clock frequency and power consumption of our architecture. Best values in bold.

| Bit-width      | Cores | LUT     | FF      | BRAM | URAM | DSP  | Clock<br>(MHz) |      |
|----------------|-------|---------|---------|------|------|------|----------------|------|
| 20 bits        | 32    | 38%     | 35%     | 20%  | 33%  | 7%   | 253            | 34 W |
| 25 bits        | 32    | 38%     | 36%     | 20%  | 30%  | 11%  | 240            | 35 W |
| 32 bits        | 32    | 35%     | 33%     | 20%  | 27%  | 17%  | 249            | 35 W |
| 32 bits, float | 32    | 44%     | 37%     | 20%  | 26%  | 19%  | 204            | 45 W |
| Available      |       | 1097419 | 2180971 | 1812 | 960  | 9020 |                |      |

TABLE III: Matrices in the evaluation, with M=512 and 1024, and with memory occupation using BS-CSR as in Figure 3. *Distribution* controls the number of non-zeros per row.

| Distribution             | Rows             | Non-zeros (min-max)                                                                                                                     | Size (min-max, GB)                                                                                                      |
|--------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------|
| Uniform                  | $1.0\cdot 10^7$  | $10^8 - 2 \cdot 10^8$ $2 \cdot 10^8 - 4 \cdot 10^8$ $3 \cdot 10^8 - 6 \cdot 10^8$                                                       | $0.4  \mathrm{GB} - 0.8  \mathrm{GB}$<br>$0.8  \mathrm{GB} - 1.7  \mathrm{GB}$<br>$1.2  \mathrm{GB} - 2.5  \mathrm{GB}$ |
| $\Gamma(k=3,\theta=4/3)$ | $1.0\cdot 10^7$  | $\begin{array}{c} 9.7 \cdot 10^7 - 1.97 \cdot 10^8 \\ 1.9 \cdot 10^8 - 3.95 \cdot 10^8 \\ 2.9 \cdot 10^8 - 5.92 \cdot 10^8 \end{array}$ | $0.4\mathrm{GB} - 0.8\mathrm{GB}$<br>$0.8\mathrm{GB} - 1.7\mathrm{GB}$<br>$1.2\mathrm{GB} - 2.5\mathrm{GB}$             |
| Sparsified GloVe         | $0.2 \cdot 10^7$ | $2.4 \cdot 10^7 - 4.6 \cdot 10^7$                                                                                                       | $0.1{\rm GB} - 0.3{\rm GB}$                                                                                             |

corpus with the technique in [21]. Our performance is mostly unaffected by the underlying data-distribution (Section V-A): as our FPGA design processes the matrix in a streaming fashion, density does not affect performance. We perform each test 30 times, with different random vertices  $\mathbf{x}$ .

#### A. Execution Time

For each design and matrix, we measure the speedup against CPU and GPU implementations and report results in Figure 5, for K = 100. While both GPU and FPGA implementations are significantly faster than the CPU baseline, our FPGA design achieves a 2x speedup over an idealized GPU Top-K SpMV implementation with zero-cost sorting, despite the 20% lower peak memory bandwidth. When accounting for sorting costs on the GPU, this speedup can be as large as 7x. We expect to provide competitive performance even against a GPU with significantly higher memory bandwidth (e.g. Tesla A100, 1.5TB/s), and always provide greater power-efficiency (Section V-B). Fixedpoint arithmetic guarantees higher speedups than floating-point, thanks to the pipelines' lower initiation interval. Reducedprecision enables packing more non-zeros per transfer (higher B), resulting in increased operational intensity and better performance. Our 32-cores design can find the Top-K values of a matrix with 10<sup>7</sup> rows and 200 million non-zero entries in less than 4 ms and is suitable for real-time applications.

#### B. Power Efficiency

Our FPGA hardware design consumes about 35W during execution, plus 40W for the host server, measured with an external power meter monitor. The CPU implementation consumes around 300W during execution, while the GPU requires 250W (plus 40W for the host). Our fixed-point FPGA design provides 400x higher Performance/Watt ratio than the CPU,



Fig. 5: Execution time speedup on Top-K SpMV of our FPGA design and GPU vs. CPU. We can process over 57 billion non-zeros per second, 2x more than the GPU SpMV.



Fig. 6: Roofline model of our Top-K SpMV architecture. (a) Operational intensity increase with BS-CSR. (b) Comparison of FPGA vs. CPU and GPU. Thanks to BS-CSR, we provide the highest operational intensity and the best performance.

and 14.2x compared to the idealized GPU implementation (7.7x when accounting for an equal host machine): we provide higher performance without any sacrifice of power efficiency.

## C. Roofline Model Analysis

From the Roofline Model in Figure 6, built following the methodology in [4], we observe how the performance (expressed as non-zeros per second) of our FPGA design scales linearly to the total bandwidth of the available HBM channels: this result is significant as it entails predictable performance when deploying our design on an FPGA board with fewer (or more) HBM channels. Most importantly, BS-CSR increases the operational intensity up to 3x compared to a naïve COO (B=15 vs. B=5), which immediately translates to an equivalent performance improvement. When compared against CPU and GPU, our FPGA design is the best in terms of operational intensity and bandwidth usage, both in absolute and percentage terms. Despite having access to 20% less bandwidth than the GPU and considering a Top-K SpMV GPU implementation with zero-cost sorting, we still provide the highest performance, thanks to the increased operational intensity of BS-CSR.

#### D. Approximation Accuracy Analysis

We compare the accuracy of our approximated Top-K SpMV with the exact CPU results and with the approximated results of the GPU with half-precision floating-point. We look at values of K from 8 to 100 and evaluate accuracy using common Recommender System evaluation metrics such as Precision,



Fig. 7: Top-K SpMV accuracy (higher is better) for different architectures and types of reduced-precision arithmetic.

Kendall's  $\tau$ , and NDCG [27]. Precision does not penalize out-of-order results; the other two metrics do. Our results are in line with the theoretical estimations in Table I, showing only a minor accuracy dip for large K. Even 20-bit fixed-point designs provide excellent accuracy across the board, with Precision above 97% (Figure 7). Moreover, 32-bits fixed-point designs provide accuracy above the half-precision floating-point GPU implementation, despite our algorithmic approximation.

#### VI. CONCLUSION AND FUTURE WORK

Top-K SpMV is a cornerstone of IR and recommender systems based on sparse embeddings similarity and must be computed guaranteeing real-time latencies and power efficiency. As these constraints are hardly met on general-purpose architectures, we presented a novel approximate FPGA multicore design for Top-K SpMV that leverages HBM, fixed-point arithmetic, and our new BS-CSR matrix layout to achieve high-performance and modularity over the FPGA resources. Our 32-cores 20-bit FPGA design achieves 100x the performance of a state-of-the-art CPU baseline and 2x the GPU performance with 20% higher bandwidth. We deliver real-time results on sparse matrices with hundreds of millions of values while keeping 14.2x higher power efficiency than the GPU.

Future work will focus on adaptive compressed matrix representations by reconfiguring the FPGA in terms of numerical precision to guarantee desired targets of accuracy or performance. We will also apply our design to smaller FPGA accelerator cards: with similar memory bandwidth, the computation can be cheaper and even more power-efficient, with no performance loss. Finally, we will study how to apply BS-CSR to other computations and possibly other architectures.

#### REFERENCES

- [1] 2017. [Online]. Available: https://github.com/ing-bank/sparse\_dot\_topn
- [2] M. Aumüller, E. Bernhardsson, and A. Faithfull, "Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms," 2018.
- [3] G. Berend, "Sparse coding of neural word embeddings for multilingual sequence labeling," *Transactions of the Association for Computational Linguistics*, vol. 5, pp. 247–261, 2017.

- [4] M. Siracusa, M. Rabozzi, E. Del Sozzo, L. Di Tucci, S. Williams, and M. D. Santambrogio, "A cad-based methodology to optimize hls code via the roofline model," in 2020 IEEE/ACM ICCAD, 2020.
- [5] H. Waris, C. Wang, W. Liu, and F. Lombardi, "Axsa: On the design of high-performance and power-efficient approximate systolic arrays for matrix multiplication," *Journal of Signal Processing Systems*, 8 2020.
- [6] E. Adams, S. Venkatachalam, and S. Ko, "Energy-efficient approximate mac unit," in 2019 IEEE ISCAS, 2019, pp. 1–4.
- [7] M. Naumov, L. Chien, P. Vandermersch, and U. Kapasi, "Cusparse library," in GPU Technology Conference, 2010.
- [8] Y. Liu and B. Schmidt, "Lightspmv: Faster csr-based sparse matrix-vector multiplication on cuda-enabled gpus," in 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE, 2015, pp. 82–89.
- [9] M. Steinberger, A. Derlery, R. Zayer, and H.-P. Seidel, "How naive is naive spmv on the gpu?" in 2016 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2016, pp. 1–8.
- [10] W. Liu and B. Vinter, "Csr5: An efficient storage format for crossplatform sparse matrix-vector multiplication," in *Proceedings of the 29th* ACM on International Conference on Supercomputing, 2015.
- [11] T. Nguyen, S. Williams, M. Siracusa, C. MacLean, and N. W. D. Doerfler, "The performance and energy efficiency potential of fpgas in scientific computing," in *IEEE/ACM PMBS*, 2020.
- [12] A. Abdelfattah, H. Anzt, E. G. Boman, E. Carson, T. Cojean, J. Dongarra, M. Gates, T. Grützmacher, N. J. Higham, S. Li et al., "A survey of numerical methods utilizing mixed precision arithmetic," arXiv preprint arXiv:2007.06674, 2020.
- [13] A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, "Harnessing gpu tensor cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers," in SC18. IEEE, 2018.
- [14] K. Ahmad, H. Sundar, and M. Hall, "Data-driven mixed precision sparse matrix vector multiplication for gpus," ACM Transactions on Architecture and Code Optimization (TACO), vol. 16, no. 4, pp. 1–24, 2019.
- [15] P. Grigoras, P. Burovskiy, E. Hung, and W. Luk, "Accelerating spmv on fpgas by compressing nonzero values," in 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 2015, pp. 64–67.
- [16] Y. Umuroglu and M. Jahre, "A vector caching scheme for streaming fpga spmv accelerators," in *International Symposium on Applied Recon*figurable Computing. Springer, 2015, pp. 15–26.
- [17] F. Sadi, J. Sweeney, T. M. Low, J. C. Hoe, L. Pileggi, and F. Franchetti, "Efficient spmv operation for large and highly sparse matrices using scalable multi-way merge parallelization," in *Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture*. New York, NY, USA: Association for Computing Machinery, 2019, p. 347–358.
- [18] A. Parravicini, F. Sgherzi, and M. D. Santambrogio, "A reduced-precision streaming spmv architecture for personalized pagerank on fpga," in *Proceedings of the 26th Asia and South Pacific Design Automation Conference*, 2021, pp. 378–383.
  [19] Z. Yin and Y. Shen, "On the dimensionality of word embedding," in
- [19] Z. Yin and Y. Shen, "On the dimensionality of word embedding," in Advances in Neural Information Processing Systems, 2018, pp. 887–898.
- [20] J. A. Tropp and A. C. Gilbert, "Signal recovery from random measurements via orthogonal matching pursuit," *IEEE Transactions on informa*tion theory, vol. 53, no. 12, pp. 4655–4666, 2007.
- [21] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, "Online dictionary learning for sparse coding," in *Proceedings of the 26th annual international* conference on machine learning, 2009, pp. 689–696.
- [22] M. F. Dacrema, P. Cremonesi, and D. Jannach, "Are we really making much progress? a worrying analysis of recent neural recommendation approaches," in *Proceedings of the 13th ACM Conference on Recommender* Systems, 2019, pp. 101–109.
- [23] C. Yang, A. Buluç, and J. D. Owens, "Design principles for sparse matrix multiplication on the gpu," in *European Conference on Parallel Processing*. Springer, 2018, pp. 672–687.
- [24] Z. Wang, H. Huang, J. Zhang, and G. Alonso, "Shuhai: Benchmarking high bandwidth memory on fpgas," in 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2020, pp. 111–119.
- [25] N. Bell and J. Hoberock, "Thrust: A productivity-oriented library for cuda," in GPU computing gems Jade edition. Elsevier, 2012.
- [26] J. Pennington, R. Socher, and C. D. Manning, "Glove: Global vectors for word representation," in *Empirical Methods in Natural Language Processing (EMNLP)*, 2014, pp. 1532–1543.
- [27] G. Shani and A. Gunawardana, "Evaluating recommendation systems," in *Recommender systems handbook*. Springer, 2011, pp. 257–297.