# Hardware and Software Co-design for Sparse Linear Algebra Operations Acceleration

Aleksey Tyurin

Saint Petersburg State University

Saint Petersburg, Russia

alekseytyurinspb@gmail.com

Daniil Berezun

Saint Petersburg State University

JetBrains Research

Saint Petersburg, Russia
daniil.berezun@jetbrains.com

Semyon Grigorev

Saint Petersburg State University

JetBrains Research

Saint Petersburg, Russia

s.v.grigoriev@spbu.ru

Abstract—In the era of big data, computations are expected to be faster and less power-consuming in order to become more effective and affordable.

#### Introduction

Linear algebra is a great instrument for solving a wide variety of problems utilizing matrices and vectors for data representation and analysis with the help of highly optimized routines. And whilst the matrices involved in a vast diversity of modern applications, e.g., recommender systems [1], [2] and graph analysis [3], [4], consist of a large number of elements, the major part of them are zeros. Such a high sparsity incurs both computational and storage inefficiencies, requiring an unnecessarily large storage, occupied by zero elements, and a large number of operations on zeroes, where the result is obviously known beforehand. The traditional approach to address these inefficiencies is to compress the matrix and store only the non-zero elements. Thus, the effect of matrices tending to be sparse in many applications makes the techniques of matrix compressed representation and sparse linear algebra to be the effective way of tackling problems in areas including but not limited to graph analysis [5], computational biology [6] and machine learning [7].

Sparse linear algebra defines primitives for expressing algorithms for the mentioned areas in a uniform way in terms of sparse matrix and vector operations parameterized by a semiring. Such uniform representation allows to tune the whole bunch of expressible algorithms through optimizing the primitives solely. One of the most used primitive is a sparse matrix-sparse matrix multiplication (spMspM) operation. It has finely-tuned implementations for both CPU and GPU, which, however, are proven to be underutilized due to the memory-bound nature of sparse computations induced by compressed representation [8]–[11]. Further, the pipeline of spMspM is patchy, which makes some of the computational units to be idle from time to time as it could be seen in figure 1, while the peak FLOPS is less than 1% of maximum available<sup>1</sup>.

This makes the typical CPUs and GPUs not well-suited hardware for sparse computations and gives a rise to specialized hardware accelerators, which are primarily concerned



Fig. 1. GPU's SM utilization for spMspM pipeline from cuSPARSE<sup>2</sup>

-- A, B, C, D are sparse matrices
-- M is a mask
D<M> = A eWiseAdd B eWiseMult C

Listing 1: Sequence of sparse operations example

with spMspM. However, for a sparse framework to be useful, it should incorporate not only spMspM, but also other sparse operations like in listing 1, where masking, which filters the matrix elements, and element-wise operations (possibly parameterized by a semiring) needed for, e.g., PageRank and bread-first search (BFS) algorithms [12] are presented. And when such operations are chained explicitly or implicitly, via a loop body, certain optimizations could be applied, like the one that eliminates intermediate matrices in sequence from listting 1. Unfortunately, some of such optimizations (e.g., the one mentioned) are only expressible at software level, i.e., in programming language, hence modern spMspM accelerators could be impractical for accelerating the whole program representing a linear algebra based algorithm like PageRank or BFS, due to the lack of a software part and to a too narrow hardware specialization. Thus a co-design of dedicated hardware and software components, i.e., domain-

<sup>&</sup>lt;sup>1</sup>https://hanlab.mit.edu/projects/sparch/ (Accessed 09.02.2021)

<sup>&</sup>lt;sup>2</sup>https://developer.nvidia.com/cusparse (Accessed 09.02.2021)

specific processor (DSP) and a corresponding domain-specific language (DSL), could provide a system which is not more effecient for spMspM than present hardware accelerators, but appear to be more effective in terms of speed and power consumption for holistic pieces of program, i.e., for chained operations, than current CPUs and GPUs implementations. The ongoing work is devoted to the design of respective DSL, DSP, and an optimizing compiler, and this work in particular gives a brief overview of the filed, discusses the ideas and challenges behind the design.

## I. PROBLEM STATEMENT

Since sparse linear algebra applications are mostly concerned with graph problems, some of the optimizations are graph-specific [12], [13], e.g., direction optimization. However, in memory-bound applications optimizations that minimize data transfer are essential. For example, *kernel fusion* is a wide addressed optimization that fuses multiple operations into one, avoiding intermediate memory accesses, utilizing, e.g., registers to pass the data between the operations. In the case of sparse linear algebra frameworks kernel fusion is not yet widely implemented, but most often related to fusing chained operations like from listing 1 to avoid intermediate matrices construction and reduce memory accesses with masking [12].

The problem of intermediate data structures is common for functional programming and there have been developed a number of optimization techniques that try to reduce intermediate data structures or computations, namely partial evaluation [14], deforestation [15], supercompilation [16], and distillation [17].

These fusion family optimizations are implementationdependent, meaning that the optimizer should have an access to the source code, which is impossible when the function is implemented solely in hardware, hence a software part is inevitable in the design of a systems that tries to put together fusion and hardware. Thus, present hardware accelerators are not general enough to implement and accelerate a whole sparse linear algebra-based program, e.g., do not provide arbitrary semiring support, and lack a software part hence leaving out essential optimizations. General purpose devices such as GPUs are hard to perform some optimizations, e.g., fusion requires two GPU kernels to have the same memory access pattern and partial evaluation could induce some thread divergence due to SIMD nature of a GPU. Further, present systems that support fusion are domain-specific<sup>3</sup>, or perform the optimization on top of data structures that may not be suitable for sparse operations (mainly for effective compressed representation), e.g., streams and lists [18], [19], while array-based fusion systems does not perform fusion with index arithmetic [20]. Finally, some data needed for optimizations [14] is only available in runtime, thus there should be support for JIT optimizations. The design of the proposed hardware-software system should take these challenges into consideration. Next, we discuss hardware accelerators from adjacent areas and elaborate about the software optimization we want to exploit.

## II. GRAPH PROCESSORS

Sparse linear algebra is often concerned with graph problems, and since graph processing is mature enough to be considered as a distinct computation paradigm, there have been emerged a number of graph processors [21] to mitigate the issues of graph processing on general-purpose processors. However, many proposed accelerator models are not generic, in the sense they are optimized for specific classes of graph algorithms and lack scalability. Further, sparse linear algebra makes graph computations more amenable to effective parallelization, ease scalability and even spreads beyond graph computations [6]. Presently, there is no sparse linear algebrabased instruction set processor [10], while other sparse linear algebra accelerators [11], [22], [23] are applicable only to certain operations like spMspM. Notably, in [24] the way of compressed representation of a sparse matrix is identified as the main bottleneck and an approach for hardware acceleration for compressed representation indexing is proposed, which is able to employ existing optimizations and could be integrated into existing software frameworks. Hence, the underlying hardware is better to support the indexing needed by a compressed representation of choice to be effective. However, we believe that to fully leverage fusion the indexing should be transparent, i.e., constructive, in the software, which is not the case in [24].

## III. PARTIAL EVALUATION AND SUPERCOMPILATION

Since intermediate data structures are incurred by the concept of functional programming, a lot of techniques have been developed to reduce the number of such structures and the number of computations in general. Supercompilation [16] could be thought as a combination of many such optimizations [14], [15], [17]. In particular, partial evaluation [14] stands for precomputing the parts of a program that depend on statically known data. interestingly, some data may become static in runtime, e.g., it could remain fixed between different iterations of a loop, thus it may be profitable to perform such optimization even in runtime. Despite such optimizations are common in the world of functional programming, they are hard to perform in case of imperative languages widely used in high performance computing. However, a fully working supercompiler at the moment is developed only for a small functional language, since being too hard to implement for a rich language like Haksell. But we believe that such a small language is enough to provide a DSL for sparse linear algebra. Also, such optimization may produce a burden of case statements, which slow down the hardware that exploits SIMD model, hence they could not be executed in parallel. Finally, fusion performed, e.g., by a supercompiler heavily depends on the compressed representation, i.e. presently quad tree representation seems to fit more for fusion than, e.g., coordinate format. Some operations could not be fused in principle, e.g. sorting, but the need for some of such operation could be induced by the compressed representation once again, e.g., quad tree representation does not require sorting, so it is reasonable to implement non-fusable operations in hardware.

<sup>&</sup>lt;sup>3</sup>https://www.tensorflow.org/xla (Accessed 09.02.2021)

## IV. FUNCTIONAL LANGUAGE PROCESSORS

The functional DSL is not of much use until effectively compiled for DSP. There are a number of works that aim to add a hardware support for functional programming [25]-[27]. Where [25] leverages the parallelization of independent function calls, which do not appear much after supercompilation True or not? More likely False, and also provides an optimized support for lists and map operation specifically, which may not be the desired property due to the compressed representation of choice. Further, the general-purpose nature of such processors could hurt sparse operations performance due to non-specialized cache. They also focus on hardware support for graph reduction via parallelization, specialized memory, and pipelining, thus even more parallelism could be extracted from compressed representation. However, present functional processors could be too complex to integrate with, e.g., to add hardware support for indexing or another computation that cannot be optimized effectively in software. Instead, we plan to leverage the approach of dataflow representation of functional programs with indirect memory accesses [28], which then could be effectively represented in hardware in a highly parallel way. The approach currently generates applicationspecific RTL code, but we hope to adapt it for dataflow processor architecture, e.g., transport-triggered-architecture (TTA), in order to have highly scalable and parallel processor with hardware support for a compressed representation, able to run functional programs and easily extensible with custom operations.

## V. ROADMAP

Library + language + compiler + hardware co-design. Integration with application level.

Runtime code generation, staged compilation. Allows one to exploit dynamic data as static. For example, first round of supercompilation in compile time to fuse functions, and the second round in running time to specialize on some data.

#### REFERENCES

- U. Gupta, C.-J. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, K. Hazelwood, B. Jia, H.-H. S. Lee, A. Malevich, D. Mudigere, M. Smelyanskiy, L. Xiong, and X. Zhang, "The architectural implications of facebook's dnn-based personalized recommendation," 2020.
- [2] G. Linden, B. Smith, and J. York, "Amazon.com recommendations: Item-to-item collaborative filtering," *IEEE Internet Computing*, vol. 7, p. 76–80, Jan. 2003.
- [3] M. Besta, F. Marending, E. Solomonik, and T. Hoefler, "Slimsell: A vectorizable graph representation for breadth-first search," in 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 32–41, 2017.
- [4] S. Brin and L. Page, "The anatomy of a large-scale hypertextual web search engine," *Computer Networks and ISDN Systems*, vol. 30, no. 1, pp. 107 – 117, 1998. Proceedings of the Seventh International World Wide Web Conference.
- [5] J. Kepner and J. Gilbert, Graph Algorithms in the Language of Linear Algebra. USA: Society for Industrial and Applied Mathematics, 2011.
- [6] O. Selvitopi, S. Ekanayake, G. Guidi, G. Pavlopoulos, A. Azad, and A. Buluc, "Distributed many-to-many protein sequence alignment using sparse matrices," 2020.
- [7] J. Kepner, M. Kumar, J. Moreira, P. Pattnaik, M. Serrano, and H. Tufo, "Enabling massive deep neural networks with the graphblas," 2017 IEEE High Performance Extreme Computing Conference (HPEC), Sep 2017.

- [8] T. A. Davis and Y. Hu, "The university of florida sparse matrix collection," ACM Trans. Math. Softw., vol. 38, Dec. 2011.
- [9] J. Leskovec and R. Sosic, "Snap: A general purpose network analysis and graph mining library," 2016.
- [10] W. S. Song, V. Gleyzer, A. Lomakin, and J. Kepner, "Novel graph processor architecture, prototype system, and results," 2016 IEEE High Performance Extreme Computing Conference (HPEC), Sep 2016.
- [11] Z. Zhang, H. Wang, S. Han, and W. J. Dally, "Sparch: Efficient architecture for sparse matrix multiplication," in 26th IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020.
- [12] C. Yang, A. Buluc, and J. D. Owens, "Graphblast: A high-performance linear algebra-based graph framework on the gpu," 2020.
- [13] Y. Zhang, M. Yang, R. Baghdadi, S. Kamil, J. Shun, and S. Amarasinghe, "Graphit: A high-performance graph dsl," *Proc. ACM Program. Lang.*, vol. 2, Oct. 2018.
- [14] N. D. Jones, C. K. Gomard, and P. Sestoft, Partial Evaluation and Automatic Program Generation. USA: Prentice-Hall, Inc., 1993.
- [15] P. Wadler, "Deforestation: transforming programs to eliminate trees," Theoretical Computer Science, vol. 73, no. 2, pp. 231–248, 1990.
- [16] M. Sørensen, R. Glück, and N. Jones, "A positive supercompiler," Journal of Functional Programming, vol. 6, pp. 811 – 838, 11 1996.
- [17] G. Hamilton, "Extracting the essence of distillation," pp. 151–164, 06 2009.
- [18] O. Kiselyov, A. Biboudis, N. Palladinos, and Y. Smaragdakis, "Stream fusion, to completeness," SIGPLAN Not., vol. 52, p. 285–299, Jan. 2017.
- [19] D. Coutts, R. Leshchinskiy, and D. Stewart, "Stream fusion. from lists to streams to nothing at all," vol. 42, pp. 315–326, 09 2007.
- [20] T. Henriksen, N. G. W. Serup, M. Elsman, F. Henglein, and C. E. Oancea, "Futhark: Purely functional gpu-programming with nested parallelism and in-place array updates," SIGPLAN Not., vol. 52, p. 556–571, June 2017.
- [21] Y. Horawalavithana, "On the design of an efficient hardware accelerator for large scale graph analytics," 2016.
- [22] M. Soltaniyeh, R. P. Martin, and S. Nagarakatte, "Synergistic cpu-fpga acceleration of sparse linear algebra," 2020.
- [23] S. Pal, J. Beaumont, D. Park, A. Amarnath, S. Feng, C. Chakrabarti, H. Kim, D. Blaauw, T. Mudge, and R. Dreslinski, "Outerspace: An outer product based sparse matrix multiplication accelerator," in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 724–736, 2018.
- [24] K. Kanellopoulos, N. Vijaykumar, C. Giannoula, R. Azizi, S. Koppula, N. M. Ghiasi, T. Shahroodi, J. G. Luna, and O. Mutlu, "Smash: Codesigning software compression and hardware-accelerated indexing for efficient sparse matrix operations," in *Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture*, MICRO '52, (New York, NY, USA), p. 600–614, Association for Computing Machinery, 2019.
- [25] R. Coelho, F. Tanus, A. Moreira, and G. Nazar, "Acqua: A parallel accelerator architecture for pure functional programs," in 2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 346–351, 2020.
- [26] M. Naylor and C. Runciman, "The reduceron: Widening the von neumann bottleneck for graph reduction using an fpga," in *Implementation and Application of Functional Languages* (O. Chitil, Z. Horváth, and V. Zsók, eds.), (Berlin, Heidelberg), pp. 129–146, Springer Berlin Heidelberg, 2008.
- [27] A. Boeijink, P. K. F. Hölzenspies, and J. Kuper, "Introducing the pilgrim: A processor for executing lazy functional languages," in *Implementation and Application of Functional Languages* (J. Hage and M. T. Morazán, eds.), (Berlin, Heidelberg), pp. 54–71, Springer Berlin Heidelberg, 2011.
- [28] R. Townsend, "Compiling irregular software to specialized hardware," 2019