# Hardware Acceleration of Sparse and Irregular Tensor Computations of ML Models: A Survey and Insights

Shail Dave, Student Member, IEEE, Riyadh Baghdadi, Member, IEEE, Tony Nowatzki, Member, IEEE, Sasikanth Avancha, Member, IEEE, Aviral Shrivastava, Senior Member, IEEE, and Baoxin Li, Senior Member, IEEE

Abstract-Machine learning (ML) models are widely used in many domains including media processing and generation, computer vision, medical diagnosis, embedded systems, high-performance and scientific computing, and recommendation systems. For efficiently processing these computationaland memory-intensive applications, tensors of these overparameterized models are compressed by leveraging sparsity, size reduction, and quantization of tensors. Unstructured sparsity and tensors with varying dimensions yield irregular-shaped computation, communication, and memory access patterns; processing them on hardware accelerators in a conventional manner does not inherently leverage acceleration opportunities. This paper provides a comprehensive survey on how to efficiently execute sparse and irregular tensor computations of ML models on hardware accelerators. In particular, it discusses additional enhancement modules in architecture design and software support; categorizes different hardware designs and acceleration techniques and analyzes them in terms of hardware and execution costs; highlights further opportunities in terms of hardware/software/algorithm co-design optimizations and joint optimizations among described hardware and software enhancement modules. The takeaways from this paper include: understanding the key challenges in accelerating sparse, irregular-shaped, and quantized tensors; understanding enhancements in acceleration systems for supporting their efficient computations; analyzing trade-offs in opting for a specific type of design enhancement; understanding how to map and compile models with sparse tensors on the accelerators; understanding recent design trends for efficient accelerations and further opportunities.

Index Terms—Machine learning, deep neural networks, spatial architecture, dataflow, sparsity, compact models, model pruning, quantization, tensor decomposition, energy efficiency, hardware/software/algorithm co-design, compiler optimizations, reconfigurable computing, dimension reduction.

# I. INTRODUCTION

Machine learning (ML) models are becoming the way to implement the "intelligence" in computing systems. Different ML models are widely used in several important domains

Shail Dave, Aviral Shrivastava, and Baoxin Li are with the School of Computing Informatics and Decision Systems Engineering at Arizona State University, Tempe, AZ, 85281 USA (e-mail: shail.dave@asu.edu; aviral.shrivastava@asu.edu; baoxin.li@asu.edu).

Riyadh Baghdadi is with the Computer Science and Artificial Intelligence Laboratory at Massachusetts Institute of Technology, Cambridge, MA 02139 USA (e-mail: baghdadi@mit.edu).

Tony Nowatzki is with the School of Computer Science at University of California, Los Angeles, CA 90095 USA (e-mail: tjn@cs.ucla.edu).

Sasikanth Avancha is with the Parallel Computing Lab at Intel Labs, Bangalore, India (email:sasikanth.avancha@intel.com).

including computer vision (object classification [1]–[6] and detection [7]-[13]), media processing [14]-[22] and generation [23], [24], recommendation systems [25], [26], largescale scientific computing [27], embedded systems [28], edge processing [29], and even for optimizing system executions [30], [31] and designing the hardware [32] and software systems [33]. These models are compute-intensive and their executions are often memory-bounded. Domain-customized accelerators can significantly speed-up their execution in an energy-efficient manner [34]-[38]. However, the computational and memory requirements for processing these models have surged drastically [39]. Moreover, ML models can be deeper and larger which improves learning accuracy [3], [4], but significant redundancy may exist in these often overparameterized models [40]-[42]. Therefore, recent techniques for efficient learning and inference have proposed compressing tensors of ML models. Tensors are compressed by inducing and leveraging: (a) sparsity (zero values in tensors) [43]-[47], (b) quantization (precision lowering and value sharing) [43], [47]–[50], and (c) size reduction (tensor decomposition, dimension reduction, and shape reduction) [6], [51]-[56]. With significantly lowered computational, storage, and communication requirements, efficient processing of compressed tensors (sparse, size-reduced, and quantized) offers notable acceleration and energy efficiency opportunities [57]-[60].

Hardware accelerators can efficiently process tensor computations of ML models. In particular, coarse-grain spatial architectures are a common choice for hardware accelerator designs. They comprise of an array of processing elements (PEs) with local registers/memory and shared memory. These accelerators feature interconnects like mesh or multicast for communicating data to PEs and spatial reuse of the data, which lowers the accesses to memory hierarchy. With simple PE designs and effective spatial and temporal management of the data and computations, such architectures achieve high speed-ups and energy-efficiency [34]–[36], [61].

However, special mechanisms are needed to exploit the acceleration benefits of the tensor sparsity, size reduction, and quantization. This is because, while hardware accelerators for ML can process low-precision tensors, they cannot inherently benefit from sparsity of tensors [61], [62]. Without special support for sparse tensors, accelerators still fetch all the data including zero values from the memory and feed into the PEs, thereby wasting the execution time. Additionally, unstructured

zeros in tensors make exploiting sparsity hard, since accelerators are conventionally designed for performing structured computations with regular memory access and communication patterns. The goal of exploiting sparsity is to exploit all forms of sparsity possible to reduce computation and memory, while avoiding adding performance, power, and area overheads. Exploiting sparsity effectively depends on tailoring the data encoding and extraction, dataflow, memory banking structure, interconnect design, and write-back mechanisms. Further, exploiting sparsity requires new representations and enables new opportunities for hardware/software/algorithm codesign. In this survey, we discuss different accelerator designs that have leveraged sparsity of different tensors and different opportunities for performance-gains and energy efficiency. Tensor decomposition and dimension reduction yield tensors of varying sizes in different layers of various models [6], [63]. Dataflow mechanisms for executing layers of a model are usually optimized for a fixed set of commonly used layers (regular-shaped and symmetric dimensions), and hence they often become ill-suited for processing tensors with reduced dimensions or reduced shapes [63]. So, this survey describes characteristics of different layers of the models and how explorations of optimized dataflow mappings can effectively map sparse and irregular-shaped tensors. Further, tensors quantized with value sharing require additional support to look up in a dictionary and to obtain the quantized value. The survey also discusses such techniques and accelerators that leverage value-similarity and support variable bit-widths of the tensors.

**Contributions:** This paper provides a comprehensive survey of different techniques for efficiently executing sparse and irregular tensor computations of compact ML models on hardware accelerators. It describes the required enhancements in the hardware architecture and the required software support. In specific,

- For training and inference of different ML models, we summarize various sources of sparsity of the tensors.
- We highlight challenges in accelerating computations of sparse and irregular-shaped tensors (e.g., dot product, convolution, and matrix multiplication) on spatialarchitecture-based hardware accelerators that execute with dataflow mechanisms.
- We present an overview of the accelerator system along with the different hardware/software modules for sparse and irregular computations, their interfacing, and the execution flow. We provide an in-depth discussion for the need of each module, different design choices, and qualitative analysis of the different choices.
- We survey different accelerator systems and execution techniques for sparse tensors of ML models and provide taxonomies to categorize them based on the various hardware/software aspects of the designs.
- We analyze how various sparsity-levels and tensor shapes of different models impact the storage efficiency of different sparsity-encodings and the reuse of tensors.
- For designing these accelerator modules and overall accelerator system, we discuss recent trends and outline further opportunities for hardware/software/algorithm co-designs.

#### Paper organization:

- Section II provides a brief background on domain-specific ML models.
- Section III provides an overview of hardware accelerators for tensor computations of ML models.
- Section IV discusses the need for efficient execution of ML models by reducing computation and storage requirements.
- Section V discusses tensor compression techniques and opportunities offered by sparse, size-reduced, and quantized tensors due to significantly reduced computation, storage, and communication requirements.
- Section VI describes why efficient hardware acceleration of the sparse and irregular tensor computations requires a special support.
- Section VII provides an overview of the accelerator system
  design with enhanced architectural modules and software
  support for efficient sparse and irregular tensor computations. In-depth discussions of these modules follow through
  sections VIII—XV. Section VII also lists the accelerators
  that have leveraged sparsity of different tensors of various
  models and various opportunities for performance-gains and
  energy-efficiency.
- Section VIII illustrates common sparse data encodings, analyzes their implications in terms of storage and coding overheads, and describes group-wise encoding of tensors.
- Section IX discusses various techniques to decode tensors and extract non-zeros for computations. It analyzes advantages and limitations of the centralized and in-PE indexing mechanisms and describes optimization opportunities.
- Section X provides an overview of managing non-coherent, multi-banked, global scratchpad and hiding of the memory access latency behind computations. It also discusses data reuse of the sparse tensors and cross-layer reuse opportunities due to intermediate output tensors.
- Section XI discusses common interconnect designs for inter-PE communication and distributing data from memory, along with their bandwidth requirements, spatial data reuse opportunities, and their configurability to support multiple dataflow mechanisms for execution.
- Section XII describes sparsity-aware dataflows and pipelined architecture of PEs, including scalar or vector function units, leveraging value similarity, and local memory management.
- Section XIII discusses sources of the inter-PE and intra-PE imbalance due to sparsity of tensors and their impact, followed by different software-directed balancing schemes and special hardware structures for dynamic balancing.
- Section XIV describes different write-back mechanisms for collecting data from PEs and assembling the data locally in PEs or on a central module. It also discusses data layout transformations and on-the-fly encoding of output tensors.
- Section XV discusses compiler support for targeting hardware accelerators including intermediate representations for deep learning models, compiler optimizations and their automation, and ISAs and code generation for accelerators.
- Section XVI describes recent trends and future directions in terms of developing tools and techniques for systematic exploration of hardware/software/algorithm co-designs.

Section XVII discusses relevant surveys that describe additional details (domain-specific models, tensor compression techniques, etc.) and can be useful to non-expert readers.

#### II. MACHINE LEARNING MODELS

This section provides a brief overview of different domainspecific ML models. The learning through ML models can be supervised (where labeled data is available), unsupervised (training samples are unlabeled), or semi-supervised. We refer non-expert readers to surveys [64], [65], and [66] for a detailed discussion on different learning approaches and inference and training of various models. In our discussions through this survey, we focus on accelerating different neural networks that are commonly used in supervised learning.

Convolutional neural networks (CNNs) are commonly used for object classification and detection in image processing, video analysis, and autonomous vehicle systems; these models are usually trained through the supervised learning approach. CNNs majorly consist of many *convolution layers* (CONV) and a *few fully-connected* (FC) layers [1], [2], [4]. Early layers are convolution layers that capture low-level features (e.g., edges and corners in images) from the feature map tensors of images, which are used to construct high-level features (e.g., shapes) by subsequent layers. Finally, classifier aka FC layers determine the type of the object [64].

**Sequence-to-sequence models** include recurrent neural networks (RNNs), gated recurrent units (GRU), long-short term memory (LSTM) [14]–[17], [28], and their variants are used in natural language processing and media processing tasks. These models essentially use unidirectional or bidirectional recurrent cells at their core with processing of *multi-layer perceptrons* (*MLP*) aka FC structures. Surveys [28], [64]–[66] provide additional details about various layers used by different *deep neural networks* (*DNNs*) [67].

Generative adversarial networks (GANs) [24] are used by media generation applications. GANs use generator and discriminative networks that consist of convolution layers.

Models for semantic segmentation and language translation often use encoder-decoder structures with convolutions and/or recurrent layers [12], [13], [27].

Graph neural networks (GNNs) and other graph learning models [68], [69] are becoming popular for a variety of applications such as text classification and translation, node classification and link predictions in large social graphs, etc. They employ machine learning models on graph structures to learn the graph properties and to inference about unforeseen information. To achieve this objective, each node of the graph may contain an embedding feature vector that contains the information mixture about own and neighborhood features. The nodes then recurrently aggregate the features of local neighbors, perform neural network computations on the aggregated data (e.g., MLP for down-scaling intermediate embedding vectors), and then update their own embeddings. Primarily, such operations consist of matrix or vector multiplications as a compute-intensive operation.

**Recommendation system models** consist of embedding layers (look-ups and matrix operations), CNNs for object



Fig. 1. Executing applications on hardware accelerators for machine learning require explicit management of computational, communication, and memory resources (Figure adopted from [74]).

detection and video understanding, and RNNs for processing language models [25].

Thus, primitives like MLP (matrix-matrix or matrix-vector multiplication) and CONV are at the core of many ML models. Tensor computations of these primitives dominate the execution of several ML models. So, many ML frameworks like PyTorch [70], TensorFlow [71], and Intel MKL [72] provide efficient implementations of these primitives for efficiently executing their tensor computations on commodity hardware (CPU, GPU, and FPGAs) or even on specialized accelerators (e.g., with TVM [73]). Therefore, our discussions mainly focus on efficiently accelerating tensor computations of MLPs, CONVs, and RNNs.

# III. HARDWARE ACCELERATORS FOR MACHINE LEARNING

In the "new golden age of computer architecture", recent research efforts and commercial solutions have extensively demonstrated that the domain-customized hardware accelerators significantly speed-up the execution of ML models in an energy-efficient way [34]–[36], [75]–[82]. Typically, these specialized solutions feature spatial architectures, which are those that expose low-level aspects of the hardware's interconnect and storage to the hardware-software interface. Spatial architectures can be coarse-grained or fine-grained. Coarse-grained architectures feature arrays of interconnected PEs, and fine-grained designs are realized by programming FPGAs. Coarse-grained spatial architectures are a common implementation choice for designing hardware accelerators for ML [35]–[38], [83]–[85]. As Fig. 1 illustrates, these accelerators comprise of an array of PEs that contain private register files (RFs) and shared buffers or a scratchpad memory. PEs are much simple in design (function units with little local control), and the shared scratchpad is non-coherent with softwaredirected execution. Therefore, these accelerators are a few orders of magnitude more power-efficient than out-of-order CPU or GPU cores [34]-[36]. They lead to highly energyefficient execution of ML models that are compute-intensive and memory-intensive. Since performance-critical tensor computations of ML models are relatively simple operations like element-wise or tensor additions and multiplications, they can be efficiently processed with structured computations on the PE-array. Moreover, private and shared memories of PEs enable very high temporal reuse of the data, and with efficient data management, PEs can be continuously engaged in tensor computations while the data is communicated via memories [34]. Additionally, interconnects like mesh or multicast enable data communication among PEs and spatial reuse of the data, lowering the accesses to memory hierarchy and off-chip memory. Thus, with minimized execution time, spatial-architecture-based hardware accelerators yield very high throughput and low latency for ML models.

Performance-critical layers of ML models are typically nested loops, which are compute-intensive and their executions can be memory-bounded [36], [86]. Spatial architectures can efficiently accelerate them with optimized dataflow mechanisms [83] (section XII-B provides details) which require explicit management of the computational, communication, and memory resources. For example, as shown in Fig. 1, we need to determine: which PEs execute what subset of the computational graph and in which sequence, which chunk of data to be accessed from memory, and when to communicate data among PEs via interconnect and between PEarray and memory hierarchy. During accelerator execution with a dataflow, such management is achieved by the control logic (instructions or state machines) in PEs and a central controller. However, for the explicit and efficient management of accelerator resources, the sequence of execution is determined and optimized beforehand [34]. For a given accelerator design, there exist different ways of spatiotemporal execution (execution methods [74]). They yield various computational, communication, and memory access patterns, thereby having a dramatic impact on energy consumption and execution time. The search-space of optimizing execution methods for ML models is usually vast [87], [88]. Therefore, for automated, and quick explorations of efficient accelerator designs and execution methods, accelerator system designers use analytical models [74], [80], [81], [89]. These analytical models can effectively capture costs associated with different computational and data movement patterns of various execution methods and help guide the mapping optimization and design space exploration. Tools like Timeloop [87], MAESTRO [89], dMazeRunner [88], and Interstellar [38] can achieve optimized mappings and efficient accelerator designs to execute DNNs.

# IV. NEED FOR EVEN MORE EFFICIENT EXECUTION OF ML MODELS ON HARDWARE ACCELERATORS

Specialized hardware accelerators can significantly speedup computations of ML models in an energy-efficient manner [34]-[36], [61]. However, with recent advances in the development of ML models for various domains, the computational and memory requirements for processing these models have increased drastically [32], [39]. Fig. 2 provides an overview of this dramatic surge, showing that the computational requirements for training the ML models have almost doubled every few months [39]. One major reason for the increase in model size and computation requirements is the rise of deeper models. For example, for processing ImageNet images [90], AlexNet [2] featured five CONV and three FC layers (total eight parameter layers) with the model size of 61M parameters (weights and bias) and computation of 724 MFLOPs. VGG-19 [91] improved classification accuracy on the ImageNet dataset [90] further with 19 parameter layers but, it required



Fig. 2. Computation requirements for the training of AI algorithms almost double every few months (Figure adopted from [39]).

processing 144M (576 MB) parameters per image. Similarly, deep CNN models like DenseNet-201 [92] and ResNet-101 [4] consisted of more than 200 and 100 parameter layers for achieving more than 77.5% top-1 classification accuracy on ImageNet, but required processing about 4 GFLOPS and 7.6 GFLOPs per image, respectively.

While deeper and larger models can improve the learning accuracy, previous studies have shown that significant data redundancy exists in these often over-parameterized models [40]–[42]. Therefore, researchers have recently focused on different techniques that obtain compact models after compressing the tensors, thereby significantly reducing computational and storage requirements.

# V. OPPORTUNITIES FOR EVEN MORE EFFICIENT EXECUTION OF ML MODELS

Further efficient executions of ML models are achieved by drastically reducing computation and memory requirements. Various techniques provide such opportunities by compressing the tensors of ML models. Tensors are compressed by inducing and leveraging: (a) sparsity (zero values) [43]–[45], (b) size reduction (tensor decomposition, dimension reduction, and shape reduction) [6], [45], [51]–[56], and (c) quantization (precision lowering and value sharing) [43], [48]–[50], [93].

Prior compression techniques have demonstrated that they can successfully achieve highly compact models without incurring accuracy loss. For example, after applying pruning, quantization, and Huffman encoding, Deep Compression [43] reduced the model size of AlexNet and VGG-16 by 35× (from 240 MB to 6.9 MB) and 49× (from 552 MB to 11.3 MB), respectively. Similarly, accelerator-aware algorithms [58], [60] compress the model further. For AlexNet and GoogLeNet models, energy-aware pruning [60] pruned 91% and 66% of weights and reduced computational requirements by 6.63× and 3.43×, respectively. ADMM-NN [58] applied weight pruning and quantization, thereby reducing the model size of AlexNet, VGG-16, and ResNet-50 (with up to 0.2% accuracy loss) by 99×, 66.5× and 25.3×, respectively.

This section describes various sources of tensor sparsity which is either naturally present in the data (e.g., due to model architecture) or can be induced for model regularization. It describes how sparsity offers opportunity for eliminating ineffectual computations and reduced storage and communication requirements. Then, it discusses techniques for reducing the size and quantization of tensors, and how they offer advantages in terms of storage/performance/energy-efficiency.

#### A. Opportunities Due to Sparse Tensors

Tensors of different ML models can be sparse due to multiple reasons:

- CNNs typically use the ReLU activation function [2] which clamps negative values to zero. Thus, feature maps can be on average more than 40% sparse in CNNs [94], with increased sparsity in later layers (about up to 70% [40], [94]–[96]). For instance, feature maps for CONV4 and CONV5 layers of VGG-16 [91] are more than 50% and 70% sparse, respectively [40], [94], [95]. Moreover, Cao et al. [94] reported that max-pooling can further amplify the sparsity of feature maps, e.g., up to 80% for VGG-16 layers. Lee et al. [97] showed that sparsity of feature maps can eliminate about 40% and 55% of the multiply-and-accumulate (MAC) operations during CNN training and inference, respectively.
- Neural networks use drop-out layers to avoid overfitting. After applying the drop-out on a layer, only partial activations are retained [42], [67]. So, dropped-out activations get removed. It automatically yields sparse tensors [42], [98].
- Several pruning techniques, including [41], [99], [100], remove unimportant weights and alleviate the overfitting of the model while maintaining the classification accuracy. Typically, weights with the least significant values can be safely pruned [40], [43], [47] (in training or post-training). It is also observed that such weight pruning techniques bring regularity in the learning of the model and can often slightly increase accuracy [57], [101]. Pruning algorithms introduce significant sparsity, e.g., more than 60% weights of CONV and more than 90% of weights of FC layers can be removed [40]. Similarly, techniques can prune more than 80% weights of RNN, GRU, or LSTM networks [59], [102], [103], especially for medium or large models, without significantly increasing error rate. Besides, regularization of the models e.g., L1 or group-lasso based regularization can lead to unstructured or structured sparsity of weights [44], [67].

Sparsity can be induced in a structure [44], [104]–[106] for hardware-friendly execution [57], [101], [107], [108], or can be fine-grained and exploited by the special hardware/software support [57], [61], [63], [96], [109], [110].

Pruning of activations has also been shown effective [109], [111]–[114]. For example, DasNet [111] recently showed that for ImageNet classification with AlexNet and MobileNet models, activation sparsification eliminated about 27% and 12% MACs respectively. It eliminated about 79% activations of AlexNet FC layers along with pruning 81% weights, without dropping top-1 classification accuracy. Similarly, MASR [115] applied the refactoring of the batch normalization operation and achieved about 60% sparsity of activations for RNNs.

- Some CNNs use shift convolutions [116] or Atrous (dilated) convolutions [11], [117] which process sparse filters.
- GANs use transposed convolution (aka fractionally striding convolutions [118]) in a degenerator network. In such

networks, the input data is first transformed (expanded or upscaled) by inserting zeros between the input values in the tensors, and then a conventional convolution operator is applied. A recent study [119] has shown that such a transformation can yield about 60% zeros in the MAC operations of transposed convolution layers. Additional sparsity is introduced when GANs are forced to forget drawing/generating specific objects/features while generating the media [118].

- Input data for several object detection tasks can be inherently sparse, as only specific regions of image frames are valid [120]. For example, object detection models of autonomous driving systems process 3D LiDAR data by constructing point clouds and projecting them from the bird's eye view (top view) [121], [122]. The resultant images are then fed to object detection algorithms for locating the regions of interest. Recent techniques have demonstrated that the sparsity of the input data for object detection can be 80% or more [120], [121].
- Input data for the tasks of recommendation systems (e.g., user-item matrix) can be inherently very sparse e.g., from 95% [26] to 99% [25].
- Graph neural networks process large graphs, e.g., with thousands of vertices. Depending on the real-world interactions of the objects (modeled as vertices of graphs), the data corresponding to these graphs can inherently consist of unstructured sparsity [123], [124]. For example, in processing large graphs with GCNs, many features of the vertices are local features that lead to zero values in adjacency matrices for remote nodes [124]. Geng et al. [124] showed that matrices of GCNs can be more than 90% sparse.
- Text corpus in text analytics applications leads to high sparsity since each document contains only a fraction of the words from the vocabulary. Such analytics applications include PCA for dimensionality reduction of the sparse data, support vector machines and regression for classification, collaborative filtering for the recommendation, and k-means for clustering the data [45]. These operations involve multiplications of sparse matrices with dense or sparse vectors, where the sparsity of the input matrix can vary from 67%–99% [45].

Sparsity allows (i) *eliminating ineffectual computations* i.e., saving execution time and energy consumption by processing only non-zero (NZ) data, (ii) *reducing storage* by encoding tensors with only NZ values, which allows managing more data in on-chip memory and reduces DRAM accesses which are extremely energy-consuming [38], [61], [83], [125], and (iii) improving speedup due to *reduced communication requirements* for memory-bounded ML models.

#### B. Opportunities Due to Size-Reduced Tensors

Symmetric or high dimensional tensors have large sizes and processing them require more computation (GFLOPS) and memory (on-chip and off-chip storage). So, several ML models have been designed to reduce such requirements by using group or parallel operators [2], [126], [127], 1×1 convolutions [3], [53], [128], or dimensionality reduction with PCA [45], [54]. Moreover, tensors can be decomposed with spatial factorization [54], [129]–[131], depth-wise separation for convolutions [51], [52], [116], or low-rank approximations

[54], [55]. Further, tensors can be ragged [56] to obviate the need of structured or rectangular shapes. While these transformations significantly reduce storage and computations, they make tensors asymmetric or irregular-shaped.

## C. Opportunities Due to Quantized Tensors

Tensor quantization includes precision lowering [47]-[50] and leveraging value sharing or value similarity [43], [132]-[135]. Precision lowering allows representing tensors (weights, activations, gradients, weight updates) at much lower bit-width (e.g., 8b or lower for inference and 8b or 16b for learning). Moreover, tensor elements with similar values can be clustered and approximated by sharing common values (centroids of clusters). In general, significant data redundancy exists in the tensor elements (particularly in the parameters of large models) and, a successfully trained model is generalized and immune to noisy data. So, when the precision of a tensor is lowered, the error induced by quantization may often be tolerated by a well-trained model [136], and it can also obviate over-fitting caused otherwise by excessive precision, thereby bringing the generality in learning [137]. For compensating any accuracy drop due to quantization, learning algorithms fine-tune the model [79], [93] or use quantization-aware training [48]. Thus, quantization techniques typically do not degrade inference accuracy [18], [97] or can even trade-off the accuracy with aggressive quantization [50], [94], [138]–[141].

Quantization provides significant storage benefits and enables lower accesses to off-chip memory. It offers additional advantages in terms of *reduced area and power consumption*. This is because, for quantized tensors, accelerators can feature simpler and energy-efficient function units [61], [125] (e.g., for a 45 nm process, int8 multiplier consumes  $20 \times$  less energy than FP32 multiplier [125]), and bandwidth requirements are reduced i.e., bus sizes can be smaller.

Thus, with sparse, size-reduced, and quantized tensors, compact models can achieve higher accuracy as models with uncompressed tensors, while becoming amenable for deployment at the edge, mobile, or online learning platforms [43], [142] due to increased scope for low-latency and energy-efficient execution (much lower computational and storage requirements). Therefore, effectively leveraging such opportunities is crucial for further efficient accelerations.

# VI. NEED FOR SPECIAL SUPPORT TO ACCELERATE SPARSE AND IRREGULAR TENSOR COMPUTATIONS

Many hardware accelerators efficiently process different models [35], [36], [80], [143]. However, they inherently cannot benefit from the sparsity due to lack of special support. All the data, including the zero values of activations and weights, still have to be fetched from the memory and fed into PEs of the accelerator; PEs are also unable to skip ineffectual computations, wasting the execution time. Therefore, hardware acceleration of sparse tensor computations requires additional hardware and software support [62], [144]. Moreover, unstructured zeros in the data makes exploiting the sparsity hard, since accelerators conventionally perform structured computations with regular memory access patterns. The many sources of tensor sparsity

and their sparsity levels lead to unique challenges and solutions in hardware/software co-design. Therefore, our discussions through this survey majorly focus on exploiting tensor sparsity for accelerating compact models.

Tensor dimension reduction and tensor decomposition makes tensors irregular-shaped, which deviate from the conventional tensors with symmetric dimensions that are well optimized with specific dataflow mechanisms for execution on hardware accelerators. Moreover, such shape transformations may also modify the functionality of the computational primitives, which affect optimizations for mapping these primitives onto hardware accelerators. Efficient processing of irregular-shaped tensors with a high utilization of accelerator's architectural resources need additional support. Therefore, hardware architectures and mapping optimizations need to be flexible enough to support irregular-shaped tensors of various models.

Hardware accelerators have supported precision lowering by storing, communicating, computing tensors of low bit-widths, and even more recently, by supporting tensors with mixed precision [97]. However, value sharing yields processing with indices of tensor elements, followed by a look-up in the codebook for approximated values of the indices [61], thereby requiring special support. Moreover, leveraging varying bit-widths of various tensors of different ML models on the hardware accelerator require bit-adaptive computing. Further, value-similarity leads to reuse of computations on accelerators that can be achieved by memoization of the outputs.

Compressed tensors of ML models lead to sparse and irregular computations and their efficient accelerations require the special support which is described in the next section. Appendix describes that exploiting acceleration opportunities for sparse and irregular tensor computations is relatively hard for execution on CPUs and GPUs; with special support, accelerators can effectively capitalize acceleration gains.

# VII. OVERVIEW OF THE ACCELERATOR SYSTEM FOR EFFICIENTLY PROCESSING SPARSE AND IRREGULAR TENSOR COMPUTATIONS

To efficiently process sparse and irregular tensor computations, designers of the accelerator systems can integrate special hardware or software modules. It enables orchestration of the structured computations while processing the tensors in compressed formats. Consequently, it can lead to efficient utilization of the accelerator resources and allows to conveniently exploit acceleration opportunities. Fig. 3 provides an overview of the accelerator system equipped with such modules. This section briefly describes these system modules.

Since sparse, size-reduced, and quantized tensors of ML models offer various opportunities for storage, performance, and energy-efficiency, several accelerators have provided marginal or comprehensive support and leveraged some or all the opportunities. Table I lists such common objectives and corresponding accelerator solutions that meet these objectives. Now, we describe different hardware and software aspects of the accelerator system that help achieving such objectives.

**Sparsity encodings:** To efficiently process sparse tensors on accelerators, they are encoded using different formats, where



Fig. 3. Overview of the accelerator system for processing sparse, irregular-shaped, and quantized tensors.

 $\label{eq:table_interpolation} \text{TABLE I} \\ \text{Accelerators for Processing Sparse Tensors.}$ 

| Objective                                             | Techniques                                                                                                               |
|-------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|
| Compress data<br>in off-chip<br>memory (storage)      | [34], [45], [57], [61]–[63], [95]–[97], [101], [102], [132], [144]–[160]                                                 |
| Compress data<br>in on-chip                           | [45], [57], [61]–[63], [95]–[97], [101], [102], [109], [132], [144]–[148], [151]–                                        |
| memory (storage) Skip processing zeros                | [153], [155]–[158], [160], [161]<br>[34], [45], [57], [61]–[63], [95]–[97],<br>[101], [102], [107], [109], [110], [144], |
| (energy-efficiency)  Reduce ineffect- ual computation | [146]–[153], [155]–[168]<br>[45], [57], [61]–[63], [95]–[97], [101],<br>[102], [109], [110], [144]–[148], [150]–         |
| cycles (performance<br>& energy efficiency)           | [153], [155]–[158], [160]–[162], [165]                                                                                   |
| Load balancing (performance)                          | [57], [61], [63], [95], [97], [101], [102], [107], [110], [151], [155], [160], [163]                                     |

only NZ values of a sparse tensor are stored in a "data" tensor and one or more "metadata" tensors encode the location of NZs. Section VIII discusses different coding formats and, for different sparsity levels, it analyzes their effectiveness in terms of storage efficiency. It also discusses costs for encoding and decoding the data using these formats. For accelerating ML models, sparse tensors are also quantized i.e., their precisions are lowered (to typically int8 or int16 for inference [63], [96], [110] and FP16 for learning [97], [144]) and often approximated by clustering data of similar values [57], [61], [132]. Therefore, when sparse data is encoded, a value tensor contains quantized values of NZ elements.

NZ detection and data extraction: In processing sparse tensors of different primitives, corresponding elements of the weight and activation tensors are multiplied and accumulated. Different accelerators for learning and inference exploit the sparsity of one or both tensors, which impact acceleration gains [110]. Several accelerators including Cambricon-X [62] exploit only static sparsity i.e., the locations of zeros in the tensor are known beforehand. Recent accelerator designs including SCNN [96], ZENA [110], SNAP [148], and EyerissV2 [63] efficiently execute tensors with dynamic sparsity as well. It requires determining the location of intersecting NZs in both tensors at run-time and on-the-fly decoding (encoding)

NZs from (in) sparsity-coded tensors. Table II lists different accelerators that support static and dynamic sparsity of tensors. Depending on the sparsity level of tensors, accelerators need to use data extraction logic that decodes compressed tensors, search within a window of NZs, and extract matching pair of NZs for multiplication and other following operations. Section IX provides a taxonomy of different data extraction mechanisms and analyzes their implications for varying sparsity levels. It also discusses sharing the data extraction mechanism among PEs or employing in PEs. Then, it discusses opportunities for further optimizations.

**Shared memory management:** Compressed tensors are stored in the shared on-chip memory which is non-coherent, multi-banked, and often non-unified. For a pre-determined sequence of execution, the control logic of the accelerator initiates the accesses between off-chip and on-chip memory. For energy-efficient acceleration, such memory accesses need to be interleaved with computations such that the memory access latency can be hidden. Section X discusses shared memory architectures of accelerators and describes hiding of miss penalty for processing sparse tensors. It also describes the data reuse opportunities for varying sparsity-levels and dimensions of tensors of common DNNs and how sparsity can lower the potential data reuse. Apart from temporal reuse of the tensors of a single layer from on-chip memory, sparse and quantized tensors (low storage requirements) offer crosslayer reuse of intermediate output tensors. Section X discusses techniques that leverage such opportunities.

Communication networks: Once tensor elements are accessed from the shared memory, they are grouped and communicated to appropriate PEs via interconnect networks. Accelerators use one or more networks to distribute the data and metadata of compressed tensors to PEs. Depending on the data reuse opportunities available, multiple PEs are often programmed to process the same set of elements of a tensor. Such spatial reuse is leveraged by accelerators with multicast or mesh networks. However, with varying sparsity and depending on data extraction and mapping mechanism, spatial reuse opportunities during data communication vary considerably [63]. Section XI discusses different communication networks used by accelerators to process sparse and quantized tensors.

[62], [101], [102], [107], [145], [150], [156], [158]–[160], [163] Static Dynamicity [34], [45], [57], [61], [63], [95]–[97], [109], [110], [132], [144], [146]–[149], [151]–[153], [155], [157], of Sparsity Dynamic [161], [162], [164]-[168] [62], [101], [102], [107], [145], [158]–[160], [163] Weight Tensors Treated [34], [97], [109], [132], [150], [157], [161], [162], [164], [168] Activation as Sparse [45], [57], [61], [63], [95], [96], [110], [144], [146]–[149], [151]–[153], [155], [156], [165], [167] Both Matrix-Vector Multiply [45], [57], [61]–[63], [97], [101], [147], [148], [151], [152], [156], [165], [168] Matrix-Matrix Multiply [62], [107], [144], [151], [152], [163], [167] Primitive [34], [57], [62], [63], [95]–[97], [101], [109], [132], [145]–[151], [153], [155], [157]–[160], [162], [165] Convolution Operation Recurrent Layer [102], [115], [161], [164] Accelerators for Learning [97], [144]

 ${\bf TABLE~II}\\ {\bf Accelerator~Systems~Leveraging~Sparsity~of~Different~Tensors~for~Different~ML~Models}.$ 

It also describes challenges in executing inter-PE communications that become unstructured due to sparsity and mechanisms to accumulate intermediate/partial outputs.

**PE** architecture: Several accelerators consist of scalar PEs with fused MAC units (e.g., EIE [61], LNPU [97], and Envision [149]). Other accelerators feature SIMD PEs (multiple function units) (e.g., Eyeriss V2 [63]) or vector PEs that consist of multiplier arrays and adder-trees (e.g., Cambricon-X [62] and SNAP [148]). PE architectures either directly process pairs of matching NZs extracted from tensors or use hardware logic for data extraction or coordinate computation. PEs usually feature large RFs or SRAMs for temporally reusing the compressed data for several times without accessing shared or off-chip memory, thereby improving energy efficiency and performance gains. Section XII provides corresponding discussions and describes sparsity-aware dataflow mechanisms (mapping of tensor computations on accelerator resources) used by different accelerators. It also describes different accelerators that have leveraged value-similarity of tensors and the corresponding modifications in PE architecture.

Load balancing: Allocating computations over tensor blocks to different PEs is tightly integrated with the choice of the dataflow mechanism and the data extraction logic. For efficient acceleration, such work allocation needs to further take into account the distribution of zeros in the tensors. Otherwise, the execution ends up with a different amount of NZs in different PEs, which creates inter-PE and intra-PE load imbalance. Section XIII analyzes such sources of the imbalance and introduces a taxonomy of different load balancing techniques. Different accelerators achieve load balance through either software techniques or by providing a hardware module for dynamic work balance, which provides further accelerations. For example, ZENA [110] leveraged the sparsity of both activation and weight tensors and achieved about 32% additional performance-gains through load balancing. It increased the total speedup from about  $3.3\times$  to  $4.4\times$  for AlexNet and from about  $4\times$  to  $5.6\times$  for VGG-16, when compared to the accelerator processing dense tensors.

Write-back and post-processing: Tensor elements produced by PEs need to be collected, post-processed for further operations, and written back to the memory. PEs in different accelerators either write back sequentially or in an asynchronous manner through a shared bus, or communicate with the memory via point-to-point links. In addition, accelerators usually exhibit a post-processing unit which re-organizes the

data (as per the dataflow mechanism of the current and next layer of the model) and performs on-the-fly encoding. Section XIV discusses such write-back and post-processing schemes used by different accelerator designs.

Compilation support: It is extremely important to ensure supporting the execution of a variety of ML models onto hardware accelerators and enabling easier programming of the models from ML libraries for the users. Section XV describes the compiler support for executing sparse models on hardware accelerators. It discusses polyhedral and nonpolyhedral intermediate representations and their implications on the compiler's ability to represent the code and feasibility of different code transformations. Moreover, it describes challenges in supporting sparse tensors and describes different DNN compilers that facilitate optimized executions of sparse tensor computations. Then, it discusses compiler optimizations including common loop optimizations such as loop tiling, loop reordering, loop unrolling and other optimizations specific to hardware intrinsics. It also describes semi-automatic optimizations for transforming the loops and data layout as well as automatic optimizations using the cost models. Finally, it discusses ISAs for accelerators and the code generation by either using library of high-level primitives or lowering the optimized mappings to accelerator-specific code.

#### VIII. SPARSE DATA REPRESENTATIONS

Different encoding formats store a sparse tensor in a compressed manner. In each format, actual data (NZ values of a tensor) is stored along with metadata (information about the positions of NZs in the tensor). While processing tensors in compressed format, accelerator's data indexing logic uses metadata to locate and extract appropriate NZs. Fig. 5 shows an example of an uncompressed tensor T and its encoding in different formats. For instance, NZ value '5' is located with indices (y,x)=(1,0). Fig. 5(b) shows the encoding of T in coordinate format, where vector val contains all NZ values and *coord* vectors represent the corresponding metadata. This section discusses commonly used formats for encoding sparse tensors through a common example (Fig. 5) and their implications on the storage and processing requirements. For different formats, Fig. 4 introduces a taxonomy for processing on metadata during data extraction, and Table III lists the corresponding storage and coding overhead. Depending on the mapping of the layer of a model onto the accelerator, tensors are divided into blocks (per PE-wise work) and processed



Fig. 4. A taxonomy for the required processing on the metadata during data extraction when a sparse tensor is encoded using different formats.

separately. We refer to such processing as a group-wise encoding, which is discussed later. Finally, this section briefly describes on-the-fly encoding and further opportunities.

## A. Coding Formats and Implications

1) Coordinate (COO): It stores the absolute positions (coordinates) of NZ values in a tensor [169]. As Fig. 5(b) shows, all NZs of tensor T (total S elements) are stored in the data vector val, while coordinate vectors  $coord_y$  and  $coord_x$  indicate the location of each NZ value. Thus, COO is a natural (and usually convenient) way to express sparse tensors. Hence, it is commonly used for expressing the sparse data [170]. Formats adopted by FROSTT [171] and matrix market [172] closely resemble COO.

The COO format compresses only data, i.e., if NNZ is the total number of NZ elements in a tensor T, COO occupies only NNZ elements for data, but for each NZ, it stores coordinates of all n dimensions in the uncompressed format [152], [173]. Therefore, if a vector d contains tensor dimensions, then the overhead for storing n coordinates per NZ value is about  $\sum_{1}^{n} \lceil \log_2 d_i \rceil$  bits. As Fig. 5(b) shows, the metadata for values '2' and '3' (same row) or '2' and '5' (same column) are not compressed, i.e., duplicate values corresponding to the row and column indices exist in coordinate vectors.

COO can minimize pre-processing costs for inserting data and metadata of an NZ element, as new values can be appended to *val* and *coord* vectors. However, for executing ML models on accelerators, such pre-processing is typically done off-line (e.g., for weights). Even in the case of on-the-fly processing (e.g., activations), the data reformatting and layout reorganization are typically handled by a separate unit for post-processing (data write-back, assembling, and encoding) or a co-processor core. Therefore, assembling the output data and encoding it in other coding formats may be equally effective for processing sparse tensors on accelerators.

Note that due to storing coordinates in uncompressed format, COO may not be a very efficient format for storing data with low or moderate sparsity. For example, Fig. 6 shows that for a 2 MB matrix, COO can reduce storage requirements when sparsity is 70% or higher. However, COO may yield simple indexing logic for decoding, since both the data and metadata can be directly extracted and processed (i.e., no

further arithmetic and logical processing required on either of val or coord vectors for determining the positions of NZs).

2) COO-1D: Accelerators often process only a block of NZs from an encoded sparse tensor, where elements in each block varies in terms of only one or two dimensions of the tensor. For example, some accelerators including Cnvlutin [109] process the feature maps and filters in a point-wise manner by extracting the data of each 2D point across the channel direction. Such processing of the blocks requires determining the location of a point across the dimensions that are varying. Therefore, the data block is encoded with COO-1D format, which is just like COO, but along with val vector, there is only one pos vector that contains the absolute position of each NZ value in the flattened data block. For example, if all the nine values in tensor T belongs to a block that we consider for processing a dot product on the accelerator, then NZ value '5' is indexed by position '3'. Similarly, if we consider only the third row of the tensor T as a tensor block, then the encoded vectors are val: (7) and pos: (2).

Due to its simplicity of extracting the NZ value and position, several accelerators including Cnvlutin [109], CoNNA [147], and SNAP [148] used COO-1D format. For a given n-dimension tensor T, the storage overhead for COO-1D is similar to that for COO. For example, if a vector d contains dimensions of the tensor block, then the overhead for storing a coordinate per NZ value is about  $\lceil \log_2 \left( \prod_{i=1}^n d_i \right) \rceil$  bits.

3) Run-Length Coding (RLC): It compresses a given sequence of values by replacing each continuing sequence of the same value with just a single value and appending (prepending) it to the number of repetitions (aka run) [174]. For sparse tensors, the designers usually focus on value '0', and in RLC-coded tensor, each 'run' indicates the total number of zeros before (after) an NZ value. Thus, a sparse tensor is compressed with RLC by storing NZ data and run values, where each element in run corresponds to the total number of leading zeros before each NZ in val. For uncompressed tensor T of Fig. 5(a), Fig. 5(d) shows tensor compressed using RLC format. Run values for NZs '2' and '3' are '0' and '1', respectively. A few accelerator systems including Eyeriss [34] encode both the NZs and runs in the same vector val, where consecutive values in val (each virtual pair) contain a run (leading zeros) and an NZ value. For example, tensor T can be encoded as val: (0, 2, 1, 3, 0, 5, 4, 7).

Note that RLC requires a single step-processing on metadata, because a variable run-length accumulates the runs and number of NZs for determining the position of an NZ in the tensor. The storage overhead for RLC-B is NNZ  $\times$  B bits, where B is bit-width of the run for each NZ. If a vector d contains tensor dimensions, then B can be set as up to  $\lceil \log_2\left(\prod_1^n d_i\right) \rceil$  bits for accommodating the number of leading zeros in a highly sparse tensor. In general, setting B as  $\lfloor \log_2\left(\frac{sparsity}{density}\right) \rfloor + 1$  bits can effectively compress the tensors with a feasible bit-width for indicating leading zeros, where sparsity and density are fractional numbers indicating the actual or anticipated number of zeros and non-zeros in the tensor T, respectively (e.g., for 25% NZs, 0.75 and 0.25, respectively). For accelerating CNNs with 30%–90% sparsity



Fig. 5. Encodings to store sparse tensors in different formats. Elements with green shade encode the same NZ element. (Figure inspired by [173].)



Fig. 6. Storage benefits for encoding a sparse tensor  $(512 \times 2048 \text{ matrix with } 16b \text{ elements})$  in different formats, normalized to the size of the fully dense tensor (Figure inspired by [144]).

of tensors, accelerator system designers typically opt for low bit-width of the run, e.g., setting B as 2 or 4 bits. Fig. 6 shows that for a 2 MB matrix, RLC can reduce storage requirements when sparsity is 30% or higher and achieves efficient compression for 70% or higher sparsity. Low value of B such as 1 or 2 bits is more effective for encoding tensors with low or moderate sparsity and setting B as 4 or 7 bits can achieve better storage benefits for high sparsity. Fig. 6 also shows the effectiveness of RLC-B where B is obtained by  $min(\lceil \log_2(\prod_1^n d_i) \rceil, \lfloor \log_2(\frac{sparsity}{density}) \rfloor + 1)$ . The first term in min function truncates B to the bit-width required for representing indices of flattened tensor and the second term approximates the size of the field offset for accommodating the sparsity level. As depicted in the figure, setting B as 1, 1, 1, 2, 4, and 7 efficiently encodes tensor for sparsity of 10%, 30%, 50%, 70%, 90%, and 99%, respectively.

Low bit-width B cannot always capture the number of leading zeros as *run*. For example, Fig. 5(d) shows RLC-2b encoding. For value '7', leading zeros are four, which cannot be captured as it is in 2 bits. As a work-around, the encoding mechanism inserts one or more padding zeros [61] in vector *val*, which are treated as NZs. Fig. 5(d) shows that a padding zero is inserted between '5' and '7'; run values corresponding to the newly inserted padding zero and '7' (an actual NZ) are '3' and '0', which contributes to the total run of four.

Unlike COO, RLC requires a step-processing on metadata. Data extraction logic needs an accumulator to determine the run-length i.e., the position of NZ within a tensor. If tensor is not flattened and indexed by n-dimensions, then the decoding logic requires performing division and modulo operations on the metadata. Alternatively, for multi-dimension representation, run for the coordinates of each dimension can be calculated and stored separately. The overall computational cost (arithmetic and logical operations realized in hardware) for such single-step decoding is lower. Therefore, several accelerator designs used RLC encoding for processing sparse tensors. For example, Eyeriss [34] stored RLC-coded feature maps in the DRAM, which reduced off-chip accesses by about 30% in AlexNet CONV1 and 75% in CONV5 layer. CompAct [132] used an enhanced RLC format for encoding both the sparse and similar-value activations; like conventional RLC, each run element indicated the number of times the previous activation value was repeated. For example, 5 5 5 0 0 0 0 7 was encoded as 5 2 0 3 7, where '2' was run (repetition count) for value '5' and '3' was run for '0'. However, RLC used by CompAct did not use run for non-repeated values. Therefore, its encoding also required a bitmap (0, 1, 0, 1, 0)

TABLE III STORAGE AND CODING OVERHEAD FOR COMMON SPARSITY ENCODINGS OF TENSORS. VECTOR d STORES n DIMENSIONS OF A TENSOR THAT CONTAINS A TOTAL OF NNZ NON-ZERO ELEMENTS.

| Format | Storage Overhead (bits)                                                                   | Coding Overhead |
|--------|-------------------------------------------------------------------------------------------|-----------------|
| COO    | $NNZ 	imes \sum_{1}^{n} \lceil \log_2 d_i \rceil$                                         | low             |
| COO-1D | $NNZ \times \lceil \log_2 \prod_1^n d_i \rceil$                                           | low             |
| RLC    | $NNZ \times B$                                                                            | moderate        |
| Bitmap | $\prod_1^n d_i$                                                                           | moderate        |
| CSR    |                                                                                           | high            |
| CSC    | $ NNZ \times \lceil \log_2 d_0 \rceil + (d_1 + 1) \times \lfloor \log_2 NNZ + 1 \rfloor $ | high            |

which indicated whether an entry in val vector corresponded to an NZ value (flag=0) or a run (flag=1).

4) Bitmap: Bitmap encoding [175] compresses sparse tensors by storing all NZs in a tensor val along with an additional tensor flag which contains 1-bit flags for all elements in tensor T. Each flag value corresponds to a tensor element and indicates whether the element is NZ (flag=1) or not (flag=0). Fig. 5(e) shows an example encoding for uncompressed tensor T of Fig. 5(a). The bitmap (aka bit-mask) yields a low storage overhead, e.g., a total of  $\prod_{i=1}^{n} d_i$  bits (where vector d stores n dimensions of tensor T) [157]. Therefore, bitmap is effective for compressing the tensors of low or moderate sparsity (e.g., tensors in several DNN layers). For example, Fig. 6 shows that the bitmap results in the least storage overhead for low sparsity (e.g., < 30%). Prior techniques [144], [146], [157] studied the storage overhead of compressing tensors with different encoding formats. For instance, Aimar et al. [157] analyzed the effectiveness of the bitmap format for compressing sparse feature maps and determined that for 16b precision, it can achieve storage efficiency for 6.25% or higher sparsity.

We analyzed storage benefits when matrices of varying sizes and sparsity are encoded with different formats. Table IV presents the analysis. For determining storage benefits, we calculated the storage requirements for encoded tensors with the analysis presented in Table III and normalized it to the size of matrices in the dense storage format. Sizes of matrices were selected based on tensors of different layers in commonly used DNNs. We used Scipy library [176] for generating matrices of varying sparsity and for encoding them in COO, CSR, and CSC format. Table shows that bitmap achieves better compression than other formats when the sparsity of the tensor is 50% or lower. Since bitmap stores the metadata for all elements in a dense format, achieved benefits are reduced as sparsity increases where RLC is more effective. Both bitmap and RLC are commonly used by DNN accelerators.

Like RLC format, decoding the bitmap also requires a step-processing of metadata. The data extraction logic to process a single element typically consists of at least an adder and a comparator or a logical AND [62]. Due to overall low overhead for the storage and hardware cost of coding, bitmap format is used by several accelerator designs including Cambricon-X [62], SparTen [151], and SIGMA [144].

5) Compressed Sparse Row (CSR): It compresses the tensor by processing each row as a sparse vector. In CSR-coded tensor, an array *val* contains all NZ values (ordered

row-wise) and array idx stores the column indices of each NZ value [177]–[179]. For example, Fig. 5(f) shows that for tensor T of Fig. 5(a), val contains NZ values (row-wise) and idx contains column-indices '0' and '2' for NZs '2' and '3'. For a CSR-coded tensor, array ptr contains information about total NZs in each row i, which is obtained by calculating ptr[i+1] - ptr[i]. The last element of ptr equals to the total number of NZs in tensor T. Since CSR compresses the tensor row-wise, it enables random accesses to any row [180].

While the COO format redundantly stores the row coordinates for NZs in the same row of the tensor, CSR compresses such metadata by storing sparse tensor row-wise [173]. For example, in Fig. 5(b) (COO), coord-y stores row index '0' for both NZs '2' and '3'. Such redundancy is removed in the CSR coding of Fig. 5(f), where array ptr contains information about total NZs in each row. For compressing an  $M \times N$  matrix using CSR, the total storage overhead is NNZ  $\times \lceil \log_2 N \rceil$  (for idx) + (M + 1) ×  $\lfloor \log_2 NNZ + 1 \rfloor$  (for ptr). Due to high storage overhead (proportional to the number of NZs and size of the row of the tensor), CSR coding is considered as effective for high sparsity [62], [144]. For example, Fig. 6 shows that for compressing a 2 MB matrix, CSR can reduce storage requirements when sparsity is 50% or higher and achieves efficient compression for 90% or higher sparsity. Therefore, accelerators typically use CSR or CSC formats for highsparsity tensor. For example, Mishra et al. [45] encoded tensors with 70% or more sparsity in CSR format and performed sparse matrix-vector multiplication (SpMV) operation.

Unlike RLC or bitmap format, decoding a CSR-coded tensor can require a two-step processing of metadata. This is because, the first processing step locates a row by iterating over ptr, which determines the region of corresponding NZs [61], and the next step looks among the NZs of the row and determines desired NZ element with the matching column index. Thus, decoding an NZ from val can require iterating over ptr, a subtraction of ptr values, and additional look-up in idxfor matching the column index, which can lead to a twostep decoding. Note that accelerators can efficiently process CSR-coded tensors with row-wise processing where ptr is accessed once for fetching each row of NZs and then a decoding mechanism repeatedly iterates over idx (to locate column positions). The ability of flexible row-wise indexing makes CSR a preferred choice over COO for operations on a sparse matrix, since iterating over different rows of COOcoded matrix can be difficult.

A few other variants of CSR are also used. For example, contiguous elements in ptr array need to store the same values, when consecutive rows contain all zeros. This redundancy is eliminated by the doubly compressed sparse row (DCSR) format [184] which achieves additional compression for hypersparse matrices by storing metadata for only those rows which contain NZs. Unlike CSR, the block CSR (BCSR) [185] stores a dense block of tensor elements in val, if any block (with predetermined block-size) contains an NZ element. BCSR format improves storage efficiency by avoiding storing the blocks with all zeros while providing opportunities for accelerating the data as dense regions. Thus, BCSR-coded tensors can be efficiently executed not only on conventional processors

TABLE IV

Storage benefits achieved with different encoding formats for varying sparsity of matrices. Matrix sizes are selected based on tensors of different layers in commonly used DNNs. Memory sizes of encoded tensors are calculated with the analysis presented in Table III and normalized to the size of matrices in dense storage format. Columns corresponding to the rows 'only NZs' present theoretical upper bounds for storage savings.

| Matrix   |      | $64 \times 27$    |      |      |      |       | 128 × 1152          |      |      |      |      |       | 84 × 4096 |                                   |      |      |      |       |  |
|----------|------|-------------------|------|------|------|-------|---------------------|------|------|------|------|-------|-----------|-----------------------------------|------|------|------|-------|--|
| Layer    |      | VGG CONV 1_1 [91] |      |      |      |       | ResNet CONV 3_2 [4] |      |      |      |      |       |           | Transformer [20], [181] TF1 [144] |      |      |      |       |  |
| Sparsity | 10%  | 30%               | 50%  | 70%  | 90%  | 99%   | 10%                 | 30%  | 50%  | 70%  | 90%  | 99%   | 10%       | 30%                               | 50%  | 70%  | 90%  | 99%   |  |
| Only NZs | 1.11 | 1.43              | 2    | 3.33 | 10   | 100   | 1.11                | 1.43 | 2    | 3.33 | 10   | 100   | 1.11      | 1.43                              | 2    | 3.33 | 10   | 100   |  |
| RLC-B    | 1.05 | 1.34              | 1.85 | 2.87 | 7.64 | 62.20 | 1.05                | 1.34 | 1.85 | 2.86 | 7.65 | 62.16 | 1.05      | 1.34                              | 1.85 | 2.86 | 7.66 | 62.41 |  |
| RLC-2    | 0.99 | 1.27              | 1.77 | 2.87 | 7.33 | 23.95 | 0.99                | 1.27 | 1.77 | 2.86 | 7.34 | 23.96 | 0.99      | 1.27                              | 1.77 | 2.86 | 7.34 | 23.97 |  |
| RLC-4    | 0.89 | 1.14              | 1.60 | 2.67 | 7.64 | 37.14 | 0.89                | 1.14 | 1.60 | 2.67 | 7.65 | 37.28 | 0.89      | 1.14                              | 1.60 | 2.67 | 7.66 | 37.30 |  |
| Bitmap   | 1.04 | 1.31              | 1.78 | 2.76 | 6.15 | 13.79 | 1.04                | 1.31 | 1.78 | 2.76 | 6.15 | 13.79 | 1.04      | 1.31                              | 1.78 | 2.76 | 6.15 | 13.79 |  |
| CSC      | 0.80 | 1.03              | 1.44 | 2.37 | 6.88 | 53.33 | 0.77                | 0.99 | 1.38 | 2.28 | 6.65 | 50.66 | 0.77      | 0.98                              | 1.37 | 2.26 | 6.43 | 42.94 |  |
| CSR      | 0.83 | 1.06              | 1.47 | 2.40 | 6.67 | 40.28 | 0.66                | 0.85 | 1.19 | 1.98 | 5.91 | 57.27 | 0.64      | 0.82                              | 1.14 | 1.91 | 5.72 | 56.65 |  |
| COO      | 0.66 | 0.85              | 1.19 | 1.98 | 5.94 | 59.65 | 0.52                | 0.67 | 0.94 | 1.57 | 4.71 | 47.12 | 0.51      | 0.65                              | 0.92 | 1.53 | 4.58 | 45.82 |  |

| Matrix   |      | 256 × 128               |      |      |      |       | 512 × 2048          |      |      |      |      |       | 600 × 4096 |                             |      |      |      |       |  |
|----------|------|-------------------------|------|------|------|-------|---------------------|------|------|------|------|-------|------------|-----------------------------|------|------|------|-------|--|
| Layer    |      | MobileNet CONV 1×1 [52] |      |      |      |       | ResNet CONV 5_1 [4] |      |      |      |      |       |            | NeuralTalk [182] NT-We [61] |      |      |      |       |  |
| Sparsity | 10%  | 30%                     | 50%  | 70%  | 90%  | 99%   | 10%                 | 30%  | 50%  | 70%  | 90%  | 99%   | 10%        | 30%                         | 50%  | 70%  | 90%  | 99%   |  |
| Only NZs | 1.11 | 1.43                    | 2    | 3.33 | 10   | 100   | 1.11                | 1.43 | 2    | 3.33 | 10   | 100   | 1.11       | 1.43                        | 2    | 3.33 | 10   | 100   |  |
| RLC-B    | 1.05 | 1.34                    | 1.85 | 2.86 | 7.64 | 62.62 | 1.05                | 1.34 | 1.85 | 2.86 | 7.66 | 62.37 | 1.05       | 1.34                        | 1.85 | 2.86 | 7.66 | 62.38 |  |
| RLC-2    | 0.99 | 1.27                    | 1.77 | 2.86 | 7.33 | 23.96 | 0.99                | 1.27 | 1.77 | 2.86 | 7.34 | 23.97 | 0.99       | 1.27                        | 1.77 | 2.86 | 7.33 | 23.97 |  |
| RLC-4    | 0.89 | 1.14                    | 1.60 | 2.67 | 7.64 | 37.30 | 0.89                | 1.14 | 1.60 | 2.67 | 7.66 | 37.29 | 0.89       | 1.14                        | 1.60 | 2.67 | 7.66 | 37.28 |  |
| Bitmap   | 1.04 | 1.31                    | 1.78 | 2.76 | 6.15 | 13.79 | 1.04                | 1.31 | 1.78 | 2.76 | 6.15 | 13.79 | 1.04       | 1.31                        | 1.78 | 2.76 | 6.15 | 13.79 |  |
| CSC      | 0.74 | 0.95                    | 1.33 | 2.21 | 6.55 | 58.23 | 0.71                | 0.91 | 1.28 | 2.13 | 6.32 | 57.77 | 0.68       | 0.88                        | 1.23 | 2.05 | 6.09 | 56.22 |  |
| CSR      | 0.77 | 0.99                    | 1.38 | 2.29 | 6.69 | 53.33 | 0.66                | 0.85 | 1.19 | 1.98 | 5.92 | 57.89 | 0.64       | 0.82                        | 1.14 | 1.91 | 5.71 | 56.50 |  |
| COO      | 0.57 | 0.74                    | 1.03 | 1.72 | 5.17 | 51.82 | 0.49                | 0.64 | 0.89 | 1.48 | 4.45 | 44.55 | 0.47       | 0.60                        | 0.84 | 1.41 | 4.22 | 42.20 |  |

| Matrix   |      | 8791 × 600 |          |          |         |       | 3584 × 1500     |      |      |      |      |       | $4096 \times 9216$ |                 |      |      |      |       |  |
|----------|------|------------|----------|----------|---------|-------|-----------------|------|------|------|------|-------|--------------------|-----------------|------|------|------|-------|--|
| Layer    |      | Neural     | Talk [18 | 32] NT-V | Vd [61] |       | DeepBench [183] |      |      |      |      |       |                    | AlexNet FC1 [2] |      |      |      |       |  |
| Sparsity | 10%  | 30%        | 50%      | 70%      | 90%     | 99%   | 10%             | 30%  | 50%  | 70%  | 90%  | 99%   | 10%                | 30%             | 50%  | 70%  | 90%  | 99%   |  |
| Only NZs | 1.11 | 1.43       | 2        | 3.33     | 10      | 100   | 1.11            | 1.43 | 2    | 3.33 | 10   | 100   | 1.11               | 1.43            | 2    | 3.33 | 10   | 100   |  |
| RLC-B    | 1.05 | 1.34       | 1.85     | 2.86     | 7.66    | 62.37 | 1.05            | 1.34 | 1.85 | 2.86 | 7.66 | 62.34 | 1.05               | 1.34            | 1.85 | 2.86 | 7.66 | 62.38 |  |
| RLC-2    | 0.99 | 1.27       | 1.77     | 2.86     | 7.34    | 23.97 | 0.99            | 1.27 | 1.77 | 2.86 | 7.34 | 23.97 | 0.99               | 1.27            | 1.77 | 2.86 | 7.34 | 23.97 |  |
| RLC-4    | 0.89 | 1.14       | 1.60     | 2.67     | 7.66    | 37.28 | 0.89            | 1.14 | 1.60 | 2.67 | 7.66 | 37.28 | 0.89               | 1.14            | 1.60 | 2.67 | 7.66 | 37.28 |  |
| Bitmap   | 1.04 | 1.31       | 1.78     | 2.76     | 6.15    | 13.79 | 1.04            | 1.31 | 1.78 | 2.76 | 6.15 | 13.79 | 1.04               | 1.31            | 1.78 | 2.76 | 6.15 | 13.79 |  |
| CSC      | 0.59 | 0.76       | 1.07     | 1.78     | 5.34    | 53.11 | 0.64            | 0.82 | 1.14 | 1.91 | 5.71 | 56.34 | 0.64               | 0.82            | 1.14 | 1.91 | 5.71 | 56.30 |  |
| CSR      | 0.68 | 0.88       | 1.23     | 2.05     | 6.08    | 55.89 | 0.66            | 0.85 | 1.19 | 1.98 | 5.91 | 57.09 | 0.59               | 0.76            | 1.07 | 1.78 | 5.34 | 53.06 |  |
| COO      | 0.45 | 0.57       | 0.80     | 1.34     | 4.01    | 40.10 | 0.46            | 0.59 | 0.82 | 1.37 | 4.11 | 41.12 | 0.42               | 0.55            | 0.76 | 1.27 | 3.82 | 38.19 |  |

but also on hardware accelerators (which requires additional support for appropriately indexing dense regions e.g., [107]).

6) Compressed Sparse Column (CSC): CSC is similar to CSR, except that the NZ elements in tensors are stored column-wise [177]–[179]. For a CSC-coded tensor, an array val contains all NZ values (ordered column-wise); array idx stores the row indices of each NZ value; array ptr contains information about total NZs in each column. Fig. 5(g) shows CSC coding for tensor T of Fig. 5(a). The storage overhead and hardware cost for encoding/decoding tensors in CSC format are similar to those for CSR. Like CSR, processing tensors in CSC format can be efficient when tensors exhibit high sparsity. Therefore, accelerators including EIE [61] and Sticker [153] process high-sparsity tensors in CSC format.

To alleviate high storage overhead of CSR or CSC formats due to storing idx and ptr arrays, a few accelerator designs use additional encoding formats for the metadata idx or ptr. For example, EIE [61] and EyerissV2 [63] encode idx in RLC such that elements in idx indicate zeros between column indices of NZs (similar to run in RLC for NZ values). Fig. 5(h) provides an example for such RLC-coded CSC with additional step-processing on row index array. Values '2' and '5' have column index of '0' and '1', respectively, which can be encoded as '0' and '0' because there are no leading zeros before NZs '2' and '5'. If the first column of T would have been (0, 2, 0, 0, 5), then the row indices for '2' and

'5' will be encoded as '1' and '2'. Similarly, for CSR or CSC formats, ptr can be relatively indexed. For example, for populating ptr, instead of storing a total number of NZs in the columns cumulatively, we can store the NZs in each column. So, ptr[i] can store the number of NZs in each column. Thus, for a large tensor with moderate or high sparsity, such relative indexing can reduce storage requirements considerably. Note that such relative indexing requires additional step-processing on the metadata. Therefore, compressing a sparse tensor in CSR or CSC format with such relatively-indexed or RLC-coded metadata can result in triple-step processing on the metadata during decoding, and therefore additional hardware cost (due to further arithmetic and logical operations).

7) Compressed Sparse Fiber (CSF): CSF [186] provides a generalization of CSR for higher-order tensors. For n-dimension tensor, CSF extends CSR or CSC formats by forming a tree with n levels. Nodes in each level l contain indices for lth mode (dimension) of uncompressed tensor T. Paths from a root to a leaf node encode different coordinates for an NZ value, which are stored in different parent nodes throughout the path; each leaf node stores an NZ value. Therefore, the height of the tree is the total dimensions of the tensor T; the width of the tree is NNZ in the tensor T. Fig. 5(i) illustrates a mode-0 tree of CSF tensor for the tensor T of Fig. 5(a). In this mode-0 tree, root nodes represent the major mode (dimension 0 or y), and their leaf nodes represent the

consecutive dimension (dimension 1 or x). For a CSF tensor for T, the metadata in Fig. 5(i) shows a generalization of CSR, where the initial ptr and idx arrays correspond to mode 0 (nodes at the top of the tree) and the later ptr and idx arrays correspond to mode 1 (leaves of the tree).

CSF generalizes the compression format for high dimension tensors and represents a tree structure by layering arrays of index pointers [187]. Like CSR, ptr informs about a group of indices corresponding to a dimension. For instance, ptr array at the beginning informs that there is one group with 3 coordinates for the major mode 0; corresponding idx array stores the coordinate values for mode 0. Similarly, the next ptr array informs about three different groups of coordinates for the next mode (dimension 1) for the three parent nodes (corresponding to the previous mode 0). The corresponding idx array stores the coordinate values for mode 1, separated into three groups. These different groups are indicated with thick vertical borders outside the group of values. In general, the number of groups within ptr or idx arrays at level lcorresponds to the number of parents p; the ptr array contains p+1 values. Therefore, with p=1 for level 1 (mode 0), there is one group for idx and two values in the ptr array. The three nodes at level 1 (mode 0) are parents for level 2 (mode 1). Therefore, they form three groups in the next idx array (mode 1), pointed by a total four values in ptr array.

Layering the arrays of index pointers in CSF compresses tensor dimensions, hence reducing the duplication of index values [186], [187]. Each time when a node directs to children, it eliminates duplicating indices for the corresponding coordinate of the parent node. For instance, Fig. 5(i) shows that unlike COO, CSF avoids storing redundant row coordinates for NZs. The storage benefits increase with more tensor dimensions and increased redundancy among coordinate values of NZs. The order of compression can also significantly impact the benefits. For example, Fig. 5(j) shows the other possible ordering, which eliminates storing redundant column (mode 1) coordinates, resulting in a lesser number of nodes. Parker et al. [187] and Smith et al. [186] provide more details about managing higher-order tensors with CSF format.

For an n-mode CSF tensor, the storage overhead corresponds to storing more than NNZ + n - 1 coordinates and typically much less than  $n \times NNZ$  coordinates or  $\sum_{i=1}^{n} \lceil \log_2 d_i \rceil$  bits. However, the hierarchical structure of CSF makes it difficult (costly in terms of hardware logic) for assembling or traversing through the encoded tensor. In fact, processing metadata arrays at each dimension-level require two-step processing (just like processing ptr and idx arrays in CSR or CSC format), thereby up to 2n-step processing for an n-dimensional tensor. With very high coding overhead and moderate storage overhead, accelerator system designers may opt for CSF format, when processing high-dimensional tensors exhibiting high sparsity. For example, Hegde et al. [152] recently proposed ExTensor for accelerating sparse tensor algebra. It processed CSF-coded tensors for high (80%) and very high (99% or more) sparsity.

**8) Huffman coding:** It is typically applied for compressing sparse tensors once they are quantized using precision lower-

TABLE V
COMMONLY USED SPARSITY ENCODINGS BY ACCELERATORS

| COO    | [152], [153]                                                            |
|--------|-------------------------------------------------------------------------|
| COO-1D | [101], [109], [147], [148], [158], [161], [165], [167]                  |
| RLC    | [34], [95]–[97], [132], [145], [188]                                    |
| Bitmap | [57], [62], [110], [132], [144], [146], [149]–[151], [153]–[155], [157] |
| CSR    | [45]                                                                    |
| CSC    | [45], [61], [63], [102], [156], [160], [189]                            |
| CSF    | [152]                                                                   |

ing or value sharing. Quantization results in a set of limited values because (i) precision lowering typically reduces the range and precision with which the data can be represented and, (ii) value sharing among tensor elements leads to the set of few common values. The different values in the codebook can have different frequency (total number of times a shared value is used to represent different tensor elements) and therefore, they can be further compressed with Huffman encoding [43], [57]. For example, Han et al. [43] showed that the pruning and quantization (8b or 5b encoding for weight indices with a codebook of 256 or 32 values for CONV or FC layers) significantly compressed weights of the models, e.g., by 27× for AlexNet [2] and by 31× for VGG-16 [91]; Huffman coding provided another 22% and 36% compression, compressing the weights by 35× and 49×, respectively.

9) Other formats: For compressing sparse tensors (especially for high-performance and scientific computing), previous techniques have proposed a variety of encoding formats, which can help achieve better storage efficiency or incur low overheads of accessing (or dynamically inserting) NZs from (into) the compressed tensors. For example, previous techniques proposed different formats including compressed sparse blocks (CSB) [190], libsvm [191], ELLPACK [192], doubly compressed sparse column (DCSC) [184], dynamic CSR [193], diagonal (DIA) [194], delta-coded CSR [195], and, mode-generic and mode-specific formats [196].

Prior works including SPARSKIT [179], Chou et al. [173], Bader et al. [169], Vuduc et al. [180], [197], and [198] have surveyed different storage formats and discussed implications of different encoding formats on storage or coding requirements. Different libraries for sparse tensor computations on CPU or GPU platforms including MATLAB tensor toolbox [199], Intel MKL [72], SciPy [176], and cuSPARSE [200] provide support for storing tensors in different formats.

While various compression formats have been used for executing sparse tensors on CPUs or GPUs, as Table V shows, most accelerator systems use a few encoding formats like COO, RLC, bitmap, CSC, or their variants. For instance, accelerators have used RLC or bitmap format for efficiently encoding the tensors with low or moderate sparsity and CSR or CSC for tensors with very high sparsity. In general, depending on the sparsity levels of tensors for domain-specific models, designers of the accelerator systems need to determine the appropriate format that can incur low overhead for storage and encoding and decoding of tensors.

#### B. Group-wise Encoding

One way for encoding a tensor is to process the whole tensor, e.g., flattening it into a vector before encoding in a specific format. In such a scenario, during execution on the accelerator, the data management logic extracts the appropriate data block, (optionally) decodes it, and communicates (encoded) decoded data to different PEs. In contrast, with group-wise encoding, the accelerator system encodes per-PE tensor tiles separately, based on pre-determined per-PE work. Depending on the obtained mapping, each tile is typically communicated to a unique PE during execution. Thus, group-wise encoding technique considers mapping of tensor blocks to PEs and combines the encoding step with the allocation/storage of the tensors for their executions on different PEs. Accelerators including EIE [61], Cambricon-X [62], Cnvlutin [109], CompAct [132], and ESE [102] used group-wise encoding.

Group-wise encoding makes the decoding and data extraction easier, as each block can correspond to execution on a distinct PE (or a PE-group). For example, in EIE accelerator [61], each PE processed different non-contiguous rows of the output column-vector and the weight matrix. Therefore, for processing on 64 PEs, alternating rows of the weight matrix with a step-size of 64 were grouped altogether. For instance, if a weight matrix consists of 9216 rows, it is divided among 64 groups; each group contains the data of non-contiguous 144 rows. Each group is then encoded separately using CSC format (with RLC encoding of the row-index pointer vector, as shown in Fig. 5h). Similarly, PEs in Cambricon-X [62] processed multiple output activations by storing corresponding columns of the weight matrix. Therefore, the weights corresponding to the execution of each PE were grouped together and encoded using bitmap format. A centralized indexing module decoded the bitmaps for distributing matching activations to each PE. Thus, group-wise coding mechanism goes in tandem with the mapping of the tensors for their executions on PEs, i.e., closely integrated with the dataflow mechanism (that determines which PEs process what data, discussed in section XII-B), fetching of tensors from memory, and communicating tensors between PEs and memory via interconnect.

## C. On-the-fly Encoding

Often accelerator designers target static sparsity of tensors and encode them during off-line processing. For example, accelerators including EIE [61], Cambricon-X [62], and [145] dynamically decoded sparse weight tensors (section IX) but, encoded them off-line prior to execution. However, both the on-the-fly encoding and decoding is required for efficiently processing tensors that are dynamically sparsified, i.e., sparse activations in the inference of DNNs and tensor computations in the training of the models. Therefore, accelerators including CompAct [132], SCNN [96], NullHop [157], and [95], [109], [147], [201] exhibit units for on-the-fly encoding. Typically, before encoding tensors on-the-fly, the data is re-organized as per requirements of the group-wise encoding and dataflow mechanism for processing the subsequent layer. So, on-the-fly encoding is often combined with assembling the output data from PEs (section XIV-D provides further details).

#### D. Optimization Opportunities

(i) Selecting coding formats that are tailored for sparsity levels: Various DNN layers exhibit a wide range of sparsity for weight/activation tensors (inter-layer and intra-tensor sparsity variation). Moreover, even within a DNN layer, sparsity among tensors can be different (intra-layer inter-tensor sparsity variation). For efficient executions, accelerators need to support such varying sparsity-levels effectively without incurring significant overheads for storage and coding. It is possible that selected coding format and corresponding data extraction mechanism may not serve such varying sparsity-levels and increase the decoding latency or metadata size for storage. This can lead to a considerable performance drop or higher energy consumption. To achieve efficient processing that is customized to application-needs, accelerator designers can first explore the impact of different coding formats on the storage efficiency and computational overheads (mechanisms for lookups and intersections of NZs). Then, coding formats tailored to sparsity levels of tensors can be selected. Moreover, when sparsity of multiple tensors falls into diverse ranges, designers can opt for the separate encoding of different tensors (e.g., [153]). These different sparsity-codings can be utilized for offchip storage, zero-guarding the PEs, or for reducing the latency of on-chip decoding and obviating the ineffectual cycles to locate intersecting NZs. When useing different formats for performance improvements, accelerator should provide hardware logic for decoding of the tensors that are stored in varying formats (and support for any on-the-fly encoding). Such decoding logic may use existing data intersection mechanisms (e.g., comparators or skip mechanisms) but, it will require separate logic for each of the multiple formats to fetch absolute position of a NZ in a vector/tensor (e.g., different circuitry for decoding tensors encoded with RLC or bitmap).

# IX. EXTRACTION OF INTERSECTING NON-ZEROS FROM TENSORS

Tensors in the on-chip accelerator memory are typically in the compressed format. Therefore, the location of the NZ elements, that need to be multiplied, should be determined from corresponding metadata of tensors. Once the matching pair is extracted, then the PE can proceed for tensor computations. Identifying effective NZs is the primary step towards eliminating ineffectual computations due to sparsity. CNN accelerators, including ZENA [110], Cambricon-X [62], and Cnvlutin [109], demonstrated that leveraging sparsity of only activations or weights (i.e., subsiding execution cycles of processing zero values) can provide about  $1.6 \times -2.4 \times$ speedup and, processing only NZs of both the weights and activations can provide more than 3× speedup. For example, by processing unstructured NZs of both the tensors (16b precisions), ZENA [110] achieved about 3.3× speed-up for AlexNet [2] and  $4.0\times$  for VGG-16 [91], when compared to baseline architecture processing dense tensors. This section describes different data extraction mechanisms (Table VI provides a taxonomy), their management in PEs or as central modules, and their trade-offs. Then, it discusses further acceleration opportunities to effectively leverage varying levels of sparsity.

| Target<br>Sparsity | PE Arch-<br>itecture | Function Unit<br>Operation | Accelerators                                  |  |  |  |
|--------------------|----------------------|----------------------------|-----------------------------------------------|--|--|--|
| One                | Scalar               | MAC                        | [45], [102], [145], [157], [162]              |  |  |  |
| Tensor             | SIMD/                | Sc-Vec-Mul                 | [97], [101], [109]                            |  |  |  |
|                    | Vector               | Vec-Vec-Mul                | [62], [101], [160]                            |  |  |  |
| Both               | Scalar               | MAC                        | [45], [61], [110], [147], [151], [152], [156] |  |  |  |
| Tensors            | SIMD/                | Sc-Vec-Mul                 | [63], [146]                                   |  |  |  |
|                    | Vector               | Vec-Vec-Mul                | [57], [144], [148]                            |  |  |  |

| Location of Unit for<br>NZ Detection and<br>Data Extraction | Accelerators                                                                           |
|-------------------------------------------------------------|----------------------------------------------------------------------------------------|
| Centralized                                                 | [45], [57], [61], [62], [95], [97], [144], [146], [148], [156], [157], [162]           |
| In-PE                                                       | [61], [63], [96], [101], [102], [109], [110], [145], [147], [151], [152], [155], [160] |

#### A. NZ Detection and Extraction Mechanisms

Depending on whether the function units of PEs multiply a NZ scalar or a vector of NZs with another scalar or vector, Table VI categorizes the corresponding designs for NZ data extraction as (i) MAC operation on scalars, (ii) scalar-vector multiplication, and (iii) vector-vector multiplication.

#### 1) Processing indices of NZ elements of single tensor:

Sc-Vec Mul: Depending on the sparsity level of tensors, only one tensor may be treated as sparse and processed in the compressed format (e.g., activations in Cnvlutin [109] and NullHop [157] or weights in Cambricon-X [62]). In such scenarios, the position of an NZ element in the sparse tensor (offset) can be used to index the other (i.e., dense) tensor to extract the corresponding value. For example, Fig. 7 shows one such mechanism used in Cnvlutin PEs [109] for indexing matching weights of filters. Each Cnvlutin PE consisted of 16 subunits; each subunit featured an activation lane (corresponding to an input channel in convolutions) and 16 multipliers that process the data of 16 different filters from 16 synapse lanes. Each multiplier in a sub-unit processed a common NZ activation from the neuron lane, obtained the corresponding offset, and used this offset to look up in the synapse lane for corresponding weight value for multiplication. Thus, the execution corresponded to only NZ activations. We categorize such processing through SIMD lanes of a subunit as a scalarvector multiplication (Sc-Vec Mul) on the function unit. This is because, multipliers of a function unit obtain a common NZ scalar and different elements of a vector after look-up from a matrix/tensor. Similar NZ extraction schemes are used by different architectures [101], [145].

Sc-Sc MAC: Scalar PEs (e.g., in [145]) perform MAC operations in a way similar to SIMD processing. For example, consider the activation lane and filter lane 0 of subunit 0 in Fig. 7, which can be visualized as processing on a scalar PE. When activation lane provides a NZ value, matching weight can be looked-up and provided to the multiplier or MAC unit.

If a tensor is compressed using a COO-like format (e.g., COO-1D described in section VIII), then the absolute position of NZ values can be directly decoded from the metadata. Other



Fig. 7. Data extraction in a subunit of Cnvlutin PE [109] (Figure adopted from [109]).

formats (e.g., RLC, bitmap vector) can aggressively compress the metadata (e.g., reduce its bit-width) by relatively indexing the position of NZs. But then, the absolute position of an NZ element may not be directly available from metadata, for indexing another tensor, and it needs to be computed explicitly. For example, architectures like Cambricon-X [62] or [145] used similar NZ extraction schemes but required decoding of the metadata (encoded in bitmap vector, CSC, or RLC format) through simple combinational logic consisting of AND gates, multiplexers, and adders. Section VIII summarized such overheads for decoding different compression formats.

Vec-Vec Mul: In some accelerators (e.g., Cambricon-X [62]), each PE houses several multipliers and an adder tree, for performing a vector-vector multiplication at every cycle. Such architectures use data extraction logic (either in each PE or on/alongside a global controller), with multiplexers for a parallel look-up [62], [101]. Based on offsets of NZs (positions in a sparse tensor), a combinational logic with multiplexers can select data elements corresponding to multiple NZs, which are then fed to the multipliers. Fig. 8 shows one such mechanism used by Cambricon-X. One of the challenges in acceleration through such data extraction designs is the overheads of parallel look-up. In particular, when sparsity of a tensor is high, larger multiplexers need to be used for indexing the dense tensor, since the positions of NZs are distant in the scattered data. Cambricon-X performed a sensitivity analysis for area



Fig. 8. Data extraction via central indexing module in Cambricon-X [62] accelerator. The indexing module decodes weights encoded in step-indexed COO-1D format to obtain the absolute positions of NZs. Then, it extracts the activations via a parallel look-up, which are later communicated to a PE via fat-tree NOC for a vector-vector multiplication. (Figure adopted from [62].)

and power costs when the search length for multiplexers was varied from 32 (50% sparsity for fetching 16 elements) to 512 (96.88% sparsity). With length set as 256, the indexing module in Cambricon-X [62] occupied about 31% and 35% of total on-chip area and power, respectively and, its power consumption exceeded total power consumption of all 16 PEs (each with 16 multipliers and an adder tree).

# 2) Compare metadata of both tensors for extracting the matching pair of NZs:

Sc-Sc MAC: Tensors with moderate or higher sparsity can be altogether processed on-chip in compressed format. Therefore, accelerator architectures use a data extraction logic which processes the compressed data being streamed in and obtains pairs of NZs (intersections) that need to be multiplied and accumulated. This NZ extraction circuit consists of one or more comparators (or AND gates for boolean metadata) and an additional indexing logic. For example, architectures like ZENA [110], CoNNA [147], and SparTen [151] use NZ extraction logic in the PE pipeline. In their NZ extraction logic, one or more comparators match the positions of NZs in both the tensors. The NZ indexing logic utilizes the comparison outputs to extract the leading pair of intersecting data. Due to varying levels of the sparsity of both tensors, the indices of NZs may not match for the elements being compared. Therefore, the detection logic typically features several comparators for searching within a large window (e.g., search length of 16 elements in CoNNA). Data extraction via large search-window or multiple comparators usually provide at least one pair at every cycle, for performing multiplication on a multiplier, even when there are some mismatches between indices of NZ values due to varying sparsity of tensors. When using multiple comparators, priority encoders are used to obtain the leading n-pairs of matching data for feeding a total of n multipliers (n=1 for scalar PEs). Note that when intersection mechanisms in the PEs determine the matching pairs by searching in the metadata lanes, they can use skip mechanisms (e.g., in ExTensor [152]) to quickly navigate through the lanes.

To obtain intersecting NZs from the two sparse and encoded tensors, PEs are often designed with multi-stage logic to decode the tensors and extract the data. The first stage obtains index of a NZ from one tensor. The later stage checks whether there is a corresponding NZ in another tensor, and it extracts the NZ value upon matching of the indices. Depending on the sparsity and distribution of zeros in tensors, the later stage occasionally cannot find the matching data, wasting the execution cycles (e.g., subsequent cycles) for function units in the pipeline. In other words, although accelerators can reduce ineffectual computations and achieve the majority of the performance-gain by processing only NZs, their pipeline design can still lead to a fraction of the execution time in which PEs do not perform useful computations for producing outputs. For example, in EIE [61], each PE loads an activation value from its activation queue; when a PE does not have any weights to process corresponding to the activation, it fetches the next activation value from the queue in the next cycle.

Sc-Vec Mul: PEs in EyerissV2 [63] use a multi-stage logic for data extraction. Each SIMD PE fetches a CSC-coded



Fig. 9. Associative index matching in SNAP (Figure adopted from [148]).

activation, extract the location of the NZ, and check for matching indices of NZ weights. Upon match, it forwards the NZ activation and two NZ weights to two MAC units.

Vec-Vec Mul: Several architectures including Cambricon-S [57] and SNAP [148] feature vector PEs. Each PE consists of a multiplier-array and an adder tree. Data extraction logic in such accelerators exhibits a comparator array (similar to scalar PEs). Additionally, it uses priority encoders or multiplexers that extract and supply multiple intersecting pairs of NZs to feed the multipliers. For example, in SNAP architecture [148], an associate index matching module (AIM, Fig. 9) determines the positions of NZs in case of valid matches. Each PE of a row is interfaced with a shared AIM. By using comparison outcomes from AIM, a sequencer in each PE determines leading pairs of intersecting data, which are then fed to three multipliers within the PE. Similar hardware logic is used by Cambricon-S [57]. It processes bitmaps of both tensors to index the data, extracts pairs of intersecting NZs, and executes vector-vector multiplication on PEs.

3) Eliminating hardware logic for detecting the intersecting NZs: Different accelerators leverage structured or fine-grained sparsity, but do not require either detection of NZs or extraction of intersecting NZs. Now we describe such techniques based on their design objectives.

Orchestrating structured computations: A few techniques focused on high sparsity of single tensor (e.g., weights in CNNs), and proposed data pruning or data combining approaches such that each PE received a dense region consisting of NZ elements. For example, ERIDANUS [107] proposed a pruning algorithm that clustered weights into locally dense regions. Then, the group of dense elements were dispatched to PEs of systolic arrays for processing in a conventional manner. With high regularity achieved in processing MACs on PEs (about 90% of the throughput for processing dense tensors), ERIDANUS eliminated the need for processing weights in compressed format. Similarly, adaptive tiling [163] proposed a column-combining approach. NZ weights to be stored in PEs of the systolic array were statically combined such that each column of the systolic array could process multiple columns of input activations. Thus, it obviated the need for detecting NZ weights and reduced total invocations of the systolic array for tensor processing by  $2\times-3\times$  for processing point-wise convolution layers of MobileNet [52]. Techniques

like CirCNN [202] and C-LSTM [203] proposed accelerating tensors of DNN layers as FFT (Fast Fourier Transform) operations on smaller block-circulant matrices.

Coordinate computation unit: Accelerator architectures including SCNN [96] and SqueezeFlow [95] use coordinate computation units (within PEs and shared among PEs, respectively). These accelerators perform Cartesian product so that all elements of both matrices (tensor blocks) should be multiplied together. Index computation is still required to determine which partial-sum values should be accumulated with partial products. This calculation is performed in a "coordinate computation unit" that process metadata of tensors (indices of NZ values) and determine the indices for accumulation. These approaches require conflict detection in hardware, since it can't be pre-determined which accumulators would be accessed in any cycle. The coordinate computation unit facilitates the processing of both the tensors directly in compressed format. For example, SCNN PEs feature multiplier arrays that perform Cartesian product of NZ data from both input tensors. Due to the all-to-all multiplication mechanism, no special hardware support is required for extracting intersecting pairs of NZs. For performing convolutions, Cartesian products result in correct functionality due to appropriate accumulations directed by coordinate computation units.

Skipping processing of zeros for energy-efficiency: Energy-efficient accelerator designs including Eyeriss [34], Thinker [164], and Minerva [112] use zero detection mechanisms for clock gating the datapath in the PE pipeline. PEs in these accelerators check whether the tensor element being read is zero (or compare the value with a threshold), and based on the comparator output, their clock gating logic prevents the MAC datapath from switching in the consecutive cycle. Skipping the execution of MAC operations can provide significant energy savings. For example, in the Eyeriss accelerator [34], data gating logic saved the PE power consumption by 45%, as compared to PE design without gating logic.

## B. Centralized vs. Distributed Management

1) Centralized: Hardware logic for NZ detection and extraction of the matching data can be either centralized and shared among many PEs or can be accommodated within the PE pipeline. Centralized management architectures have been used by Cambricon-X [62], Cnvlutin2 [146], and [162]. The advantages of such shared mechanisms are: (i) PEs can be directly provided effective NZ data; PEs are engaged only in performing useful computations [61], [62]. Such a data extraction mechanism can be beneficial as a pre-processing unit appended to the PE-array for efficiently processing structured computations, e.g., systolic arrays or near-data accelerator designs. (ii) Although some architectures use a centralized mechanism, they duplicate the hardware logic for decoding compressed tensors or extracting the data, e.g., Cambricon-X [62] and CASA [162]. It typically brings higher area and power consumption costs. However, if the same data is multicast to PEs or an additional logic is used for time-sharing the data extraction module, then a single data extraction module can be used by multiple PEs, e.g., in Cambricon-S [57]. Such

time-shared implementation essentially requires less area and power, as compared to the hardware logic duplication in the pipeline of all PEs. (iii) Since centralized data extraction logic is coupled with the controller that orchestrates the allocation of tensors among PEs, it is more amenable to enabling runtime load balancing. However, one major challenge in the central data extraction mechanism is to maintain spatial data reuse. This is because, in most cases, the central mechanism extracts the data on a per-PE basis, which are then off-loaded to PEs (typically via unicast networks). Since the matching data pairs are extracted per PE beforehand, the data values that are common among scattered PEs cannot be multi-cast now, thereby missing the opportunity for spatial reuse. Section XI provides a detailed discussion on a variety of communication networks and their trade-offs.

2) In-PE: NZ detection mechanisms or data extraction logic in the pipeline of PEs are also widely used in many accelerator architectures including ZENA [110], Cnvlutin [109], and EyerissV2 [63]. Such in-PE hardware logic allows the central controller to multicast or broadcast elements of some tensor, exploiting the spatial data reuse. This is because hardware logic in the PE can then match the data with appropriate values. For example, in Cnvlutin architecture [109], the same set of activations are broadcast to different PEs, which are then matched with corresponding weights by the data extraction logic. Similarly, the EyerissV2 accelerator [63] exploits spatial reuse of both activations and weights via row-stationary dataflow [83] and processes the computations on appropriate data by using in-PE data extraction logic. However, the challenges in such mechanisms are: (i) in-PE logic may cause ineffectual computation cycles for extracting NZ values, and (ii) employing inter-PE load-balancing in the hardware may be infeasible, since the compressed tensors are mostly treated as dense data, and the actual work carried out by different PEs is unknown at the time of offloading computations on PEs (in fact, unknown till the data value execution in the PE datapath).

#### C. Optimization Opportunities

Based on the sparsity levels in application domains, in-PE or central data extraction mechanisms, and PE architecture pipeline, a few optimization opportunities may be explored, for achieving further efficient accelerations:

(i) Sparsity-adaptive low-cost data extraction mechanisms: Encoding formats for sparse tensors are often selected with a focus on storage benefits. However, the computational overhead and hardware cost for encoding and decoding tensors should be also reduced, since they can considerably increase execution latency and energy consumption. For instance, the decoding logic, at every cycle, must locate at least n pairs of intersecting NZs from two tensors and feed them to function units (n multipliers) for dataflow-driven execution of PEs. Otherwise, it lowers the speedup or utilization of function units in PEs. Varying levels of sparsity of tensors present additional challenges for sustaining the acceleration because different data extraction schemes may be cost-effective for only a certain range of sparsity. For example, when both the tensors exhibit similar (e.g., moderate) sparsity, an extraction



Fig. 10. Data reuse opportunities for executing different CNN layers (dense tensors) on hardware accelerators (Figure inspired by [63]).

circuit that uses a few comparators would be effective, because it may easily locate a pair of intersecting NZs. However, with one tensor exhibiting high sparsity and the other as dense, another hardware logic, that fetches index of NZs from the compressed (sparse) tensor and indexes the dense tensor with appropriate metadata, may be more effective. Moreover, when sparsity levels among tensors vary considerably, the data extraction logic need to use a large comparator array or several multiplexers for parallel lookup, so that it can extract at least one data pair of intersecting NZs for each multiplier in a PE. Therefore, the data extraction module needs to be configurable or use multiple extraction mechanisms so that it can efficiently process tensors with varying sparsity at modest hardware cost.

To achieve such objectives, designers can explore decoding modules that use combinations of different features: (i) multiple comparators for locating more intersections, (ii) both the indexing-based logic intersection-based logic for leveraging sparsity of one or more tensors, and (iii) skip-index mechanisms (e.g., in ExTensor [152]) along with the intersections through comparator-arrays. Whenever possible, the module can be configurable to select partial features for desired sparsity-levels and power-gated otherwise for energy efficiency. While multiple features can sustain/improve performance or energy, they lead to high area costs. Therefore, such trade-offs should be explored systematically to meet the target hardware budget while effectively leveraging varying sparsity.

(ii) Tightened integration with load balance mechanism: Several accelerator architectures use a central data extraction module as a pre-processing unit before communicating data to PEs. Such centralized module enables additional opportunity about dynamic load balancing of work among PEs. As section XIII discusses, inter-PE imbalance can be severe due to irregular distribution of NZs in the tensor blocks that are allocated to PEs. Consequently, accelerators attain low speedup and ineffective utilization of PEs. Although a few accelerators recently used hardware modules to dynamically balance the work among PEs, further efficient designs may be achieved by enhancing the central/shared data extraction module to balance the work. This is because, such module already keeps the track of the data provided to PEs, which leads to the information about the number of operations to be performed by different PEs. Therefore, it may be feasible to use an additional low-cost hardware logic in such central modules for run-time balancing of compressed tensor computations (e.g., data-driven dynamic work dispatch in GraphDynS [204]).

#### X. MANAGING DATA IN SHARED MEMORY

Hardware accelerators feature multi-banked scratchpad memories or buffers (SRAMs), which are shared among PEs [34], [62], [110]. Scratchpad can be either unified (i.e., contains data of multiple tensors) [38] or divided into separate buffers to store various tensors [110], [132]. For data management, accelerators typically employ on-chip memories with size varying from several tens of KBs [57], [205], [206] (e.g., 192 KB in Eyeriss V2 [63]) to a few MBs [36], [75]. Effective data management is required to highly reuse the data from the scratchpads/buffers for reducing costly accesses to lowerlevel memory and to hide the memory access latency behind computations on PEs. This section discusses how sparsity and reduced dimensions of tensors lower the data reuse. Sparsity lowers the reuse because many activations and weights are zero and not processed for updating output activations. But, these compressed tensors help achieve better speedups and the energy-efficiency due to opportunity of fitting more tensor blocks (e.g., tensors of the entire layer) on-chip and significantly reduced latency of accessing the compact tensor blocks. Additionally, sparsity can make bank management of the memory challenging due to unstructured accesses, e.g., for arbitrating output activations. This section also discusses the opportunity for reusing intermediate output tensors via fusedlayer executions and how sparsity affects it.

#### A. Leveraging Data Reuse Opportunities

1) Reuse characteristics: Depending on the functionality of model layers, different tensors (input activations, weights, partial summations) exhibit significant reuse. Fig. 10 depicts such reuse opportunities for processing different CNN layers (early convolution layers, last convolution layers, MLPs, depth-wise layers, point-wise layers, etc.). For each tensor, the data reuse is calculated as the total number of MACs per data element [63]. Reuse factors are plotted on a logarithmic scale for showing the values varying across a wide range. Layers are also plotted on a logarithmic scale to visualize them better for both shallow and deep models. Input activations: The figure shows that in general, the reuse of input activations increases with going deeper in the model because the total number of filters increases significantly (e.g., from 64 to 2048 in ResNet-50). Depth-wise layers are an exception and present very low reuse due to the processing of the single filter. FC layers or MLPs present high reuse of input activations, which depends on the sizes of weight matrices (i.e., dependency on the sizes of the output tensors). Weights: Since 2D feature



Fig. 11. Impact of sparsity on data reuse opportunities for accelerating CNN layers.

maps (height/width) are usually much larger than 2D weights, weight reuse can be higher by an order of magnitude (e.g., in VGG-16, weight reuse of layer #1 vs. high input reuse in later layers). With going deeper in CNNs, feature maps get shrunk spatially, which lowers the reuse of weights. FC layers or MLPs do not present any reuse, as each weight value is used only once when processing a vector of input activations (no batching). Increasing the batch size linearly improves weight reuse. Partial summations: Going deeper in CNNs increases input channels of layers, which improves the reuse of partial summations. FC or MLP layers also feature high reuse of partial summations due to larger input vectors. Depthwise convolution layers show very low reuse because partial summations are not accumulated across the input channels. Video processing applications use 3D CNNs (e.g., c3d [22]), which can further increase the reuse opportunities [207] for input activations and weights due to additional processing steps on consecutive frames.

2) Impact of sparsity on reuse: Increasing sparsity of tensors can lead to lower reuse of the data operands. To determine the impact of sparsity, we considered evaluations by Han et al. [40] and obtained effectual MACs (%) and NZs in input activations and pruned weights (%) for AlexNet [2] and VGG-16 [91]. Then, we calculated the reuse as NZ MACs per NZ in a tensor. Fig. 11 plots the reuse opportunities for both dense and sparse tensors of CNN layers. It shows that while reuse characteristics discussed above are preserved (e.g., increase/decrease in the reuse of activations/weights as we go deeper), the reuse factor decreases for almost all layers and all the three tensors, as compared to processing dense tensors. Primarily, this is due to the reduced number of effectual MACs. For example, for FC layers or MLPs, weight reuse can drop below one. It means that even if a weight matrix consists of NZs, some of them are never used due to the unavailability of matching NZs in the input activation tensor. Similarly, the reuse of partial summations decreases because effectual MACs per partial summation decreases with sparsity. Note that each output activation element still needs to be populated or assembled before ReLU/encoding. It is important to note that even if sparsity reduces the reuse, the available reuse can be still high (e.g., from 1E + 02 to 1E + 04) and should be leveraged for efficient accelerations.

3) Temporally reusing data through shared on-chip memory: Data reuse can be leveraged temporally (repeatedly accessing the data from memory without accessing lower-level memory) and spatially (providing the same data to multiple

PEs without repeatedly accessing memory). Like CPUs, accelerators have memory hierarchies because applications have different working set sizes. With high temporal reuse, the highest energy gets spent in upper-level buffers [38], [83].

#### B. Hiding Miss Latency Behind Computations

1) Management of tiled data in double-buffered memory: On-chip buffers are typically not large enough to accommodate all tensors, e.g., larger feature maps or weights of deep models. Therefore, execution of the loops is tiled for reusing certain tensors from the scratchpad, while repeatedly accessing other tensors from the off-chip memory [34], [38], [96]. Since scratchpads are non-coherent and their management is software-directed, data needs to be transferred between off-chip memory and scratchpads by programming the direct memory access (DMA) controller [36], [74].

For effective processing on the accelerator, it is crucial to keep PEs engaged in useful computations by interleaving computations with memory accesses. Such objective is usually achieved by double-buffering the scratchpads (aka ping-pong buffers) [110], [208], [209]. Loop optimization techniques for dataflow execution, like tiling and ordering of the loops, can determine sizes of tensor blocks to be managed in the memory and sequence of memory accesses such that tensors can be highly reused from the memory with reduced accesses to lower-level memory [38], [74], [80], [83], [96].

2) Impact of sparsity on access latency and speed**up:** For memory-bounded layers (e.g., FC layers), even with effective prefetching schemes, miss penalty may be significant, which restricts accelerators to achieve peak performance [36]. When tensors are sparse, the amount of the data that needs to be transferred from off-chip reduces significantly, leading to substantial performance gains. For example, Zhou et al. [57] performed a sensitivity analysis by varying sparsity of weights or activations and measuring speedup of Cambricon-S accelerator, as compared to executions of dense tensors. On convolution layers, the maximum speedup of Cambricon-S approached the ideal speedup  $(15.5 \times \text{ vs. } 16 \times \text{ for weights})$ and at most 3.9× for activations). The upper bound stems from the ability of the modules to extract up to 16 NZ activations out of 256 (leveraging up to 93.75% sparsity of weights) and 16 NZ weights out of 64 (leveraging up to 75% sparsity of activations), respectively, at every cycle. While compressed tensors lowered memory traffic, double buffering helped Cambricon-S to achieve peak performance by hiding the DMA latency behind computations [57]. In fact,

for executing fully-connected layers with sparse tensors, accelerators can achieve even super-linear speedup, as compared to execution with dense tensors. For example, for very high sparsity (99% zero weights) of FC layers, Cambricon-S [57] achieved about  $59.6\times$  speed-up. It was about  $3.7\times$  higher than the ideal speedup i.e., the bound of  $16\times$  for processing weights with 93.75% or higher sparsity. However, higher sparsity of activations did not bring the performance gain for FC layers (saturated at about  $14\times$ ), since the total execution time was dominated by latency for accessing weights.

3) Asynchronous communication: Some accelerators use asynchronous communication mechanisms without strictly enforcing double-buffering. In other words, without separate partitioning of the memory, the PE-array and DMA controller may simultaneously produce/consume the data from the same physical buffer, either through different memory banks or at the granularity of small memory blocks in the same bank. For example, the Eyeriss accelerator [34] uses an asynchronous interface for communicating the data between the global buffer and DRAM. It facilitates pre-loading of the tensors in unused banks of the buffer for the next execution pass. Cambricon-X [62] accelerator uses an asynchronous communication mechanism for pre-loading the compressed weights from DRAM in the memory of each PE. For the assigned short period, the memory access port is assigned to only one PE for fetching several chunks of weights via DMA transfer. Such a mechanism allows PEs to load some NZ weights which can be used later for performing computations. Depending on the prefetching interval and unstructured sparsity, each PE may be engaged in (asynchronous) computations throughout most of the execution cycles. Note that for the prefetching/writeback mechanisms, specially when done asynchronously in hardware/software, designers need to ensure that (i) the PEs stall computations till completion of the data movement for the dependent data, and (ii) new data is overwritten in upper-level memory only after the PEs consume previous data; previous outputs are obtained from upper-level memory before PEs over-write them with the new data.

#### C. Management of Multi-Bank Memory

- 1) Multi-bank memory: On-chip memories are typically partitioned into smaller banks [38], [208], [209]. After obtaining the mapping of the layer onto the accelerator, each bank is usually allocated to only one tensor. Banked buffers provide multiple read and write ports, allowing simultaneous accesses to different tensors stored in different banks [34], [208]. Although single-bank memory can be easier to manage, it is typically infeasible to provide multiple ports for the PE-array with just one bank [210]. Moreover, multi-port unified memory incurs very high power consumption and longer latency. This is because, the single large memory array requires long word-lines and bit-lines that have higher capacitance, which yields long delay and high power consumption [211].
- 2) Concurrent accesses to memory banks: The banked organization can yield low-latency and energy-efficient data transfers, while providing sufficient bandwidth for concurrently accessing multiple tensors via different ports. For ex-

ample, in the Eyeriss architecture [34], the global buffer (108) KB) consists of 27 banks, where each bank is  $512b \times 64b$ SRAM (4 KB); two banks are allocated to filters and each of the remaining 25 banks can be allocated to either input feature maps or partial summations [34]. In some accelerator designs [36], [132], tensor elements need to be accessed simultaneously from multiple banks for feeding them into different rows/columns of PEs. Such an arrangement requires a data layout reorganization after loading the data from DRAM (or before writing the data back to DRAM) [212], which can incur additional overhead in terms of execution time and energy efficiency. However, with the sparse and compressed tensors, such overhead becomes low, since the memory traffic for offchip accesses is reduced notably. Moreover, a few accelerator designs use data encoding logic to transfer the compressed data between on-chip memory and DRAM [34], [132]. In such scenarios, overheads arising from data reorganization can be reduced by integrating the corresponding logic with the hardware logic for data encoding.

3) Arbitration and conflict management: Depending on the design of the interconnect between the memory and PEs, managing the application data may require additional compilation support or hardware logic for data arbitration and conflict management [96], [201]. When memory access patterns are regular (e.g., for structured computations on dense tensors), bank-allocation and accesses to the banks can be determined after obtaining optimized mappings (dataflows) of layers onto the accelerator hardware. However, processing sparse computations can lead to arbitrary accesses to the banks and require special support. For example, in processing sparse tensors, PEs may produce unstructured outputs that need to be written to different banks of the memory. Moreover, accelerators can feature accumulator-buffers [96]. So, PEs or their function units (e.g., multiplier-array) are connected with the memory banks via a crossbar that arbitrates the writeback of output values to appropriate bank [95], [96]. The crossbar can require higher bandwidth and significant on-chip area. For example, in SCNN architecture [96], 16×32 crossbar occupied about 21% of the total on-chip area for connecting 16 multipliers of each PE with 32 banks in PE's local memory. Further, as PEs or their multipliers process different numbers of NZs, they may generate partial outcomes corresponding to non-contiguous elements in the output tensors. Therefore, when these partial outcomes are directed to memory banks, the bank conflicts are possible, i.e., multiple outputs correspond to the same memory bank [96], [201]. So, for obviating the bank conflicts, accelerators employ more banks in the on-chip buffer (e.g.,  $2 \times N$  banks for storing the output values from N sources [96]). It helps alleviating the collision in hashing the irregular output values into different memory banks. Depending on the look-up mechanism for obtaining NZ values for PEs and the interconnect between PEs and memory, requests for indexing the data can be irregular and may require arbitration logic. In NullHop accelerator, eight controllers are shared among 128 MAC units that have dedicated memory banks for weights. When a controller receives a NZ activation value, it submits requests to the banks for fetching weights matching to the activation index. It also provides the position value to MACs for updating the partial summations for output elements.

## D. Reusing Intermediate Tensors

- 1) Reusing intermediate tensors from large on-chip memory: Intermediate feature map in DNNs is an output of a layer that serves as inputs to next layers. It can be maintained stationary and reused from the on-chip memory to reduce offchip memory traffic. Encoding the sparse and quantized data significantly reduces the storage requirements, which makes such opportunities more feasible. Many accelerators feature large on-chip memories with the size of hundreds of KBs (e.g., [36], [75], [208]). They can accommodate the compressed tensors for different layers of models [2], [4], [53]. For example, accelerators like SCNN [96] and EIE [61] processed data in the compressed format and employed large on-chip memory, thereby enabling the reuse of intermediate activation tensors from on-chip memory. Leveraging the high reuse of the intermediate tensors can be important for accelerating latencybounded real-time applications (e.g., when weight reuse is low without batching and, feature maps need to be highly reused).
- 2) Addressing static bank assignment problem: Many accelerators process models layer-by-layer and do not leverage cross-layer reuse. For example, with dataflows optimized for each layer, accelerators typically write the output tensors back in the DRAM at the end of the layer L and fetch the inputs for the next layer L+1 from the memory. This is more prevalent among accelerators that feature smaller onchip memories. Moreover, even though the activation tensor (output from layer L) can fit within the banks (e.g., #7-#10) of a reasonably large on-chip memory, computations for the next layer L+1 may require accessing the input tensor from different banks (e.g., banks #1-#4). Such a problem occurs due to static bank assignment [208] where the assignments of the banks for each tensor are determined statically at design time and cannot be changed after implementation. The static bank assignment enforces the write-back of output tensor and reloading it in other banks for processing the next layer. Thus, in both of these cases, the activation tensors (outputs from layers) are not reused on-chip, which causes excessive offchip memory traffic, i.e., higher processing latency and energy consumption. For efficiently reusing the intermediate data onchip, shortcut-mining [208] used a flexible architecture with decoupled physical-logical buffers to address the static bank assignment problem and to exploit the cross-layer data reuse.

When tensors are sparse, prior techniques for statically determining the data allocation to memory may work well by incorporating the estimates about the sparsity of tensors and sizes of sparsity-encoded tensors. However, conservative estimations about dynamic sparsity levels may lead to inefficient utilization of memory banks. Efficient banking for non-conflicting accesses can be also challenging while leveraging dynamic sparsity (e.g., in accelerating training of the model, where sparse weights of layers are evolved and pruned).

3) Fused-layer execution: To leverage reuse of intermediate outputs across layers, Alwani et al [213] proposed fused-layer CNNs. It processed smaller tile of activations such that tiles



Fig. 12. Fusing the execution of layers can significantly reuse intermediate activations [213] (Figure adopted from [213]).

of output activations for the next few layers were computed alongside while retaining the corresponding data in the onchip memory. Fig. 12 shows one sample example, where a tile of  $5\times 5$  activations (C L input channels) of layer L is processed with  $M_L$  filters of  $3\times3$  weights and produces  $3\times3$  output activations (M\_L channels). Then, the accelerator uses intermediate output and performs convolution (for layer L+1) with  $3\times 3$  weights ( $M_L$  channels) of  $M_L+1$  filters and produces  $1\times 1$  output activations with  $M_L + 1$  channels. In [213], these activation tiles and filters were maintained in on-chip memory. Then, such tiles were partially reused while performing the convolution of newer tiles through striding execution in the spatial direction. As a result, fused-layers reduced off-chip data transfer of input feature maps by 28% for the first two layers of AlexNet and by 95% for the first five layers of the VGG-19 model. However, such cascading yields high memory requirements because of storing all the filters of targeted layers and all the input channels of the corresponding spatial tiles of feature maps. Therefore, during processing dense tensors, [213] applied the approach to only early layers of models [208]. Sparse tensors can be efficiently compressed with different encoding formats, which allows to fit more portions of tensors in the smaller on-chip memory, making such fusion opportunities more feasible for reusing intermediate tensors. Selection of the tile size and number of layers that can be fused is bounded by on-chip memory capacity. So, fusion parameters depend on the actual/anticipated sparsity levels (for weights/activations). For efficient executions, the fusion parameters should be explored systematically with other sparsity-aware dataflows (section XII-B).

# E. Techniques for Further Improving Energy-Efficiency

1) Look-ahead snoozing: Depending on the sparsity of tensors, sparsity-encoding, and mapping of the layer onto the accelerator, it is possible that several banks can be unused or inactive for certain time intervals. Therefore, hardware accelerators can achieve further energy efficiency by power gating unused or inactive banks. For example, for energy-efficient

execution, CompAct [132] proposed look-ahead snoozing of the on-chip memory. It targeted reducing the leakage power which consumes a significant portion of the energy for large on-chip SRAMs (typically shared among PEs). CompAct employed a banked activation SRAM architecture with two power-gating transistors for each bank. Through power gating, these banks were snoozed or put in the deep sleep mode. Since the sparse tensors are sparsity-coded (e.g., RLC) and stored in compressed format, it can result in a few non-utilized banks in the shared buffer. CompAct determined such empty banks for the execution of each layer of the model and, it put them into deep sleep mode (maximal savings in leakage power, while not preserving any data in unused banks). Since the sequence of execution can be pre-determined for executing the layer (i.e., order in which banks are accessed), CompAct considered the data movement schedule and determined the period of active cycles for each bank. It snoozed inactive banks during execution (i.e., connecting the bank to the data retention voltage for consuming lower leakage power).

2) Skipping memory hierarchy: Some ML models or their specific layers (e.g., FC layers without batch processing) may not provide significant opportunities for reusing tensors. Moreover, moderate data reuse can become very low due to sparsity of tensors and architectural design for extracting or communicating NZs. For efficiently handling such scenarios (e.g., with low latency and energy consumption), it becomes very important to directly feed the data to PEs, by obviating the storage of non-reusable data in the shared memory. We describe techniques used by recent hardware accelerators for efficiently handling tensors with low reuse.

Tetris [214] is a neural network (NN) accelerator using 3D memory that associates a NN engine (an array of  $14 \times 14$ PEs and scratchpad) with each of the 16 vaults of an eightdie hybrid memory cube (HMC) stack. The shared 133 kB scratchpad reuse or store the tensors that are not managed in RFs of PEs (each PE houses a 512B RF). For utilizing the shared memory for storing and reusing data blocks of only one tensor (e.g., input feature maps), Tetris enabled bypassing of the shared memory such that data blocks of non-reused tensors (e.g., filters and output feature maps) can be directly streamed between DRAM and RFs of PEs. Gao et al. [214] proposed an algorithm for managing the data movement of tensors with bypassing and explored efficiency of bypass-aware dataflow mechanisms with an analytical model. The data movement with bypassing a memory-level in the memory hierarchy can be controlled by implementing additional hardware logic [214] (e.g., as an additional logic in the finite state machine of the accelerator controller that manages data movement between on-chip and off-chip memory). Similarly, Timeloop [87] optimizes dataflow mechanisms for executing DNN layers by using an analytical model and enables design space exploration of DNN accelerators. For efficiently managing low-reuse data, it provides support for the memory level bypass through a directive for specifying which blocks of the tensors can be resided at which memory level. Then, the constraints for level bypassing can restrict the search-space accordingly for mapping optimization.

In contrast, a few accelerator designs including EIE [61] and Cambricon-X [62] architectures do not provide a shared memory but rather employ large local memories in PEs. Thus, PEs can directly access the data (with no or low reuse) from off-chip memory via DMA transfers. These PEs may operate asynchronously in a demand-based fashion or, PEs can be iteratively provided dedicated access to the memory bus for a short time interval in order to obtain the new work.

## F. Optimization Opportunities

(i) Hardware/software mechanisms for managing unified memory for both data and metadata: Accelerators for processing sparse tensor computations often employ separate buffers for storing metadata (e.g., NZ locations). Similarly, for processing tensors quantized with value-similarity, the codebook (containing shared values) or indirection table can be stored in a separate buffer. Although such designs are easy to manage for processing execution of some models with tensors encoded in a specific format, it may not scale well across different levels of sparsity and value-similarity. Depending on the sparsity-level and the encoding format, storage requirements for metadata of tensor blocks can vary significantly. For instance, separate on-chip buffers for metadata may be unused for processing tensors with low sparsity or low valuesimilarity. So, designers can explore unified memory architectures (partitioning, banking, port requirements) for managing both data and metadata and their trade-offs. Moreover, such trade-off analysis can help developing tailored coarse-grain designs for FPGA-based accelerators.

(ii) Sparsity-aware dataflows for fused-layer executions to leverage reuse of intermediate tensors from on-chip memory: Intermediate output tensor, which serves as an input to the next layer, is input for multiple layers due to residual connections in the models (e.g., ResNet [4]) or parallel path execution due to high cardinality of the model blocks (e.g., ResNeXt [126]). Layer-wise execution in such scenarios can significantly increase the accesses to off-chip memory because, the data blocks of the previously computed intermediate tensors need to be brought back in on-chip memory. Consequently, it incurs higher execution latency and energy consumption. Leveraging the cross-layer reuse opportunities can improve acceleration efficiency, e.g., by executing fused layers (multiple consecutive layers as in [213]) and concurrent execution of the parallel paths of high-cardinality blocks. The fusion of the layers requires to determine the sizes of the data tiles that need to be processed in each of the fused layer and the sequence of processing these tiles. Such calculation depends on the collective size of the data that can fit in the on-chip memory. So, the sparsity of the tensors can make the layer-fusion opportunities more feasible. However, the fusion parameters still need to be determined from the actual/anticipated sparsity levels (for weights/activations). Apart from leveraging sparsity, systematic layer-fusion techniques need to be developed that can work together with multiple dataflow mechanisms. This is crucial because, depending on the shape of the tensors of layers, multiple dataflow mechanisms are often required for attaining efficient accelerations. Thus, sparsity-aware dataflows can be explored for fused-layer executions.



Fig. 13. Common NOC designs (Figure adopted from [63]).

#### XI. COMMUNICATION NETWORKS

While on-chip memory prefetches and maintains the data for all PEs, on-chip communication networks (network-on-chip or NOC) plays a key role in efficiently distributing the data to the PEs [63], [83], exchanging the data between PEs [34], [36], and writing the data back from PEs to the shared memory [62]. Machine learning applications are data-intensive, and therefore, accelerators employ multiple high-bandwidth interconnects to deliver different tensors simultaneously to the PEs. NOCs are crucial for achieving high-throughput and energyefficient acceleration. This is because the execution of PEs gets stalled if the desired set of the data is not timely supplied by the interconnect. Thus, inefficient NOCs can easily become the bottleneck, leading to high-latency execution. In an efficient execution, PEs process the data from input FIFOs and/or local memory, which is interleaved with the filling/draining of new data into PEs via NOCs.

Several NOCs for data distribution enables spatial reuse by communicating the same tensor element to multiple PEs. It helps to efficiently utilize the available bandwidth, thereby yielding low communication latency and high energy efficiency. Opportunities for leveraging the spatial reuse become low for sparse tensors (section X-A). This is because, the considerable fraction of tensors are zero values, which does not need to be stored, communicated, and computed. Moreover, for detecting NZs and extracting the matching pairs of NZs, accelerators often employ centralized modules. These modules usually extract the data per PE-basis and feed unique blocks of extracted data to each PE. Since the data is extracted beforehand (in contrast to multicasting data to PEs followed by data extraction), such designs cannot leverage the data reuse and require high-bandwidth NOCs to deliver the extracted per-PE work. However, recent accelerators including ZENA [110] and EyerissV2 [63] obviate the early centralized indexing and use in-PE data extraction, thereby leveraging the spatial reuse.

The bandwidth requirements and spatial reuse opportunities can vary considerably for processing irregular-shaped and sparse tensors. Therefore, recent designs have proposed configurable NOCs that can adapt to varying communication requirements. Lastly, in processing sparse tensors, unstructured communication of partial outputs among PEs can be challenging, which are handled by temporal accumulations on PEs, in-PE spatial accumulation via adder-tree, or interconnect for arbitration or global accumulations.

# A. Common Interconnect Designs

Fig. 13 shows some of the common NOC designs along with their bandwidth and capabilities of spatially reusing

TABLE VII

COMMUNICATION NETWORKS OF ACCELERATORS FOR DISTRIBUTION OF

SPARSE TENSORS

|          | Fat-width /  | [62], [95], [96], [109], [145], [147],   |  |  |  |  |  |  |  |
|----------|--------------|------------------------------------------|--|--|--|--|--|--|--|
|          | Unicast      | [152], [157], [159], [160], [168]        |  |  |  |  |  |  |  |
| Topology | Multicast    | [148], [152], [157]                      |  |  |  |  |  |  |  |
| Topology |              | [57], [61], [95]–[97], [101], [102],     |  |  |  |  |  |  |  |
|          | Broadcast    | [109], [110], [145]–[148], [151], [153], |  |  |  |  |  |  |  |
|          |              | [159], [160], [168]                      |  |  |  |  |  |  |  |
|          | Mesh         | [107], [132], [163]                      |  |  |  |  |  |  |  |
|          | Configurable | [63], [144], [215]                       |  |  |  |  |  |  |  |
|          |              | [57], [61], [63], [97], [101], [102],    |  |  |  |  |  |  |  |
| Spatial  | Activations  | [109], [110], [144]–[148], [151], [153], |  |  |  |  |  |  |  |
| Reuse    |              | [155], [157], [159], [160], [168], [201] |  |  |  |  |  |  |  |
|          | Weights      | [63], [95], [96], [110], [144], [148],   |  |  |  |  |  |  |  |
|          | weights      | [153], [201]                             |  |  |  |  |  |  |  |

the data [63]. For layers with high reuse opportunities (Fig. 10), leveraging spatial reuse is important, since it allows to distribute the same data of a tensor to multiple PEs. With lowered communication requirement, it can enable efficient interleaving of the communication latency with computations on PEs [88], [215]. Most accelerators use interconnect designs such as multicast or broadcast NOC to leverage spatial reuse. Accelerator designs with either of the 2D (1D) bus and mesh interconnects can achieve high (moderate) spatial reuse through multicasting or broadcasting the data [216]. The NOCs consisting of configurable buses or tree topology (used in spatial architectures with scalar or vector PEs) can multicast the data to PEs usually in a single cycle [34], [148], [215]. In contrast, the mesh interconnect (used in systolic arrays) spatially reuse and communicate the data with the store-andforward mechanism [36], [132], [144]. For distributing tensors that are reused less, some accelerators use unicast NOCs.

The communication requirements vary significantly depending on the sparsity of tensors as well as the adopted dataflow mechanism, which is relatively lightly studied. However, recent studies including [215] characterize the NOC bandwidth required for different dataflow mechanisms. Similarly, analytical tools including [89] model implications of different dataflows on communication requirements and execution time. Now, we discuss common interconnect topologies (listed in Table VII) used by previous accelerators for distributing compressed data of sparse and quantized tensors.

1) Broadcast NOC: Accelerator designs, including Cnvlutin [109], SCNN [96], EIE [61], and Cambricon-S [57], use broadcast NOC to leverage high reuse of activations or weight tensors in convolutions and activations in MLPs or FC layers. When one or more tensors are sparse, their NZs can be broadcast for leveraging data reuse spatially, as long as the mechanism for indexing or intersection operates on NZs after the broadcast. For example, Cnvlutin [109] processed sparse activations; NZ activation values were broadcast to all 16 PEs. Then, based upon the position of each NZ activation, each SIMD PE indexed corresponding weights from multiple synapse lanes for executions on multiplication array and adder tree. Similarly, Cambricon-S [57] broadcast activations to PEs with their positions; each PE indexed the matching NZ weights from its local buffer for execution on its function unit. Accelerators often employ multiple broadcast NOCs for simultaneously broadcasting different data elements to the corresponding PE-groups. For instance, Sticker [201] fetches multiple blocks of activation tensor and broadcasts them to different columns of PEs. Similarly, it broadcasts multiple blocks of weight tensor to different rows of PEs.

- 2) Multicast NOC: Often the dataflow mechanism reuses multiple operands spatially. E.g., in a 2-D PE array, PEs in the same row need to obtain the same elements of the first tensor and PEs in the same column should receive the same elements of the second tensor. Therefore, instead of broadcast NOC, accelerators [34], [110], [148] use multicast NOC to spatially reuse multiple operands. For example, Eyeriss [34] featured configurable multicast NOCs for executing dense tensors with the row-stationary dataflow mechanism. PEs in the same row processed the same spatial rows of filters and PEs diagonally processed the same spatial row of feature maps. Eyeriss [34] facilitated such multicasting of packets (tensor elements) through its configurable NOCs, which consisted of row-wise and column-wise interconnect controllers. Each controller could be configured with a pre-determined tag value, which was compared at run-time with the row-wise or column-wise tag of a packet. Upon matching the tags, the corresponding row-wise controller forwarded the packet to associated column-wise controllers, and the column-wise controller forwarded the data to the associated PE. Similarly, for processing bitmap-coded tensors of CONV layers, ZENA [110] fetched a block of activation tensor and broadcast it to a row of PEs, and multicast a block of weights to the PEs with the same index in each row (i.e., PEs of the same column).
- 3) Mesh NOC: A few accelerators including Compact [132], ERIDANUS [107], and [163] employ systolic arrays with mesh-style interconnects. They reuse the data spatially with store-and-forward execution. Similarly, in the MAERI accelerator [206], while the data is communicated to a group of multipliers via a fat-tree, adjacent multipliers spatially reuse the data through the store-and-forward mechanism. For PEs connected via mesh, since the same data is forwarded among PEs in the same row or column, such interconnects achieve the same amount of spatial reuse as multicast NOCs. However, structured computations with the streaming data make it difficult for executing sparse tensors. Therefore, pre-processing modules are required that can cluster the appropriate NZs before feeding the PEs of the systolic array [107], [144].
- 4) Unicast NOC: Accelerators including SCNN [96], Cambricon-X [62], and SqueezeFlow [95] use unicast NOC or point-to-point links. Such NOCs concurrently feed different data elements of the same tensor to various PEs. They are required for communication when the data operand does not exhibit reuse (e.g., weights in MLPs), or the tensor is not reused spatially, or output needs to be collected simultaneously (discussed in section XIV-A). Unicast NOC provides high bandwidth, alleviates the network congestion, and reduces the overall latency of communicating tensor elements [62]. Moreover, for executing sparse tensors, accelerators use unicast NOC to communicate different blocks of a tensor PE-wise, each of which consists of extracted NZs for execution on a PE. However, point-to-point interconnects can incur higher costs



Fig. 14. EyerissV2 accelerator architecture [63] (Figure adopted from [63]).

in terms of area and power. Vainbrand et al. [217] provides a detailed analysis of different NOC topologies and Kwon et al. [215] analyzes the costs of executing convolution layers on accelerators with different NOC designs.

#### B. Configurable Interconnects

Accelerators may not employ a mix of different NOCs that vary in terms of bandwidth and data reuse capabilities. Such communication capabilities are often needed for efficiently executing different layers of various ML models with varying levels of sparsity. This is because, although the accelerators typically use NOCs that cater the sufficient bandwidth and exploit spatial data reuse (at an extent), it may not work well for different dataflows [215]. Communication requirements vary with different dataflows that effectively accelerate some of DNN layers (section XII-B and Table 21). Moreover, while communication requirements for executing many ML models can be visualized as gather, scatter, or reduction patterns [215], [218], efficient execution may demand a combination of these patterns or even non-uniform patterns including multihop communications among PEs [38]. Therefore, configurable NOC designs are required, which can be flexibly programmed for executing various communication patterns. Recent designs including EyerissV2 [63], microswitch-NOC [215], and SIGMA [144] address some of these challenges.

EyerissV2 [63] employs a novel hierarchical-mesh NOC, which is illustrated in Fig. 14. EyerissV2 accelerator features 16 clusters of PEs and 16 clusters of global buffers (GLBs), which are arranged in an  $8\times2$  array. Each PE-cluster exhibits a 3×4 PE-array, and each 12 kB GLB-cluster consists of a total of seven banks for input and output activations. In the hierarchical NOC, router clusters connect different PEclusters and GLB-clusters. At the top-level, router clusters are connected through a 2D mesh which enables communication among different PE-clusters and GLB-clusters. For local communication among each PE-cluster and GLB-cluster, the router-cluster contains total ten routers. Each router connects PEs with a port of the GLB cluster for reading (writing) from (to) GLB bank or off-chip memory (three routers for managing input activations and weights and four for partial summations). As Fig. 14 shows, an all-to-all NOC connects all the PEs



Fig. 15. Different configuration modes of hierarchical mesh network in EyerissV2 architecture [63] (Figure adopted from [63]).

of a PE-cluster to all routers of the adjacent router-cluster. As Fig. 15(a)–(d) illustrates, the all-to-all NOC for local communication facilitates multiple communication patterns including multicast, broadcast, and unicast of the tensors. 2D mesh topology enables inter-cluster communications in EyerissV2, allowing an interleaved-multicast or broadcasting the data to a different cluster.

Kwon et al. [215] have proposed NOC with an array of microswitches. As Fig. 16(a) illustrates, for N-PE accelerator, the microswitch array consists of  $\log_2 N + 1$  levels and each level contains N micro-switches. Each microswitch contains a small combinational logic for configuring different communication patterns and up to two FIFOs for buffering the data during routing conflict. With smaller logic and storage components, data traverses through multiple microswitches within each cycle (e.g., 24 switches in one ns for a 15 nm technology [215]), which enables configurable yet low-latency communication. All microswitches contain gather and scatter units, and besides, bottom microswitches (level  $\log_2 N$ ) contain local units for inter-PE communication. In top microswitches (level 0), the scatter unit connects to the banks in the global buffer, and the gather unit uses round-robin-based priority logic for arbitrating the incoming data to the global buffer in a pipelined manner. Similarly, in middle microswitches (level 1 to  $\log_2 N - 1$ ), the scatter unit forwards the data to desired links (lower-level branches of the scatter-tree), and the gather unit streams the data back towards the top microswitches. Finally, in bottom microswitches, the scatter unit and the gather unit stream the data, and the local unit connects the adjacent PEs. Fig. 16(b)–(d) shows how configurable microswitches can enable various communication patterns including multicast,



Fig. 16. (a) Microswich network [215]. NOC configurations: (b) multicast (c) gather (d) local communication. (Figure adopted from [215].)



Fig. 17. (a) Flexible dot product engine in SIGMA accelerator [144] features a data distribution NOC with configurable switches interconnected via Benes topology. (b)–(c): Configuration of the interconnect facilitates different unicast and multicast communication patterns. (Figure adopted from [144].)

broadcast, gather, and inter-PE communication.

For distributing the data to PEs, SIGMA [144] used Benes topology [219] with configurable switches (Fig. 17a). For N-PE accelerator, the interconnect exhibits  $2\log_2 N + 1$  levels, and each level consists of N number of  $2\times2$  switches. Each switch receives two control signals which determine whether the switch forwards the data vertically (configuration value 0) and/or diagonally (configuration value 1). As Fig. 17(b)–(c) shows, different switches can be configured for communicating the elements to desired multipliers. Fig. 17(c) shows that after combining the communication requirements for distributing all data elements, the switches can be configured to forward the data both vertically and diagonally (e.g., to multicast q to multipliers #0 and #3). Thus, such configurable interconnect design can enable unicast, multicast, or broadcast of different tensor elements to the multipliers.

#### C. Mechanisms for Communication of Partial Outputs

Computation primitives in ML models usually require reduction operation (e.g., a summation of the values), which can be performed by PEs in different ways (Table VIII). Temporal: all the summations required for computing an output scalar are performed on a single PE during different cycles; different PEs compute distinct output values [61], [145]. **Spatial:** multiple function units within a PE (or multiple PEs) calculate partial outputs for the same output scalar. For example, multiplier-array in a PE calculates partial products for a dot product, which are then reduced by an adder tree. **Spatiotemporal:** different PEs compute the partial outputs and locally accumulate them over time. Partial outputs from PEs are later collected and accumulated spatially via interconnect before further processing (e.g., write-back or other arithmetic operations). Whether PEs accumulate outputs temporally/spatially/spatiotemporally depends on the mapping of the computation graph onto PE-array [74]. Inter-PE communication for spatial accumulation (e.g., in weight stationary dataflow) or spatiotemporal accumulation (e.g., in row stationary dataflow) are typically achieved through a mesh network [34], [132], [163]. This is because, mesh or similar topology enables nearneighbor communications among PEs.

TABLE VIII
MECHANISMS FOR ACCUMULATIONS OF PARTIAL OUTPUTS

| Temporal           | [61], [97], [102], [110], [145], [147], [151], [152], [155], [157], [159], [160], [168] |
|--------------------|-----------------------------------------------------------------------------------------|
| Spatial (intra-PE) | [57], [62], [144]                                                                       |
| Spatial (inter-PE) | [95], [97], [132], [144]                                                                |
| Spatiotemporal     | [63], [96], [148], [165]                                                                |

For processing sparse tensors, temporal accumulation makes execution simple as compared to spatial or spatiotemporal accumulation. This is because, in temporal accumulation, just like other arithmetic or logical operations, PE process data from their private memory or registers, without any inter-PE communications via an additional interconnect. Therefore, it is adopted by several architectures including EIE [61], CoNNA [147], and [145]. However, temporal accumulation is done by reading/writing the data to local memory [110] and by accumulating the computations in the output register of the adder [145]. It requires the register read and write operations which consumes higher energy, as compared to combinational logic for integer arithmetic [38], [83], [165]. Therefore, some accelerators either use spatial/spatiotemporal accumulations with inter-PE communication (e.g., SNAP [148]) or intra-PE spatial accumulation of the outputs from multiplier-array via adder-tree (e.g., Cambricon-X [62]). Lee et al. [165] analyzed energy efficiency of spatial and temporal accumulation and showed that for an accelerator with vector PEs (multipliers + adder-trees), spatial reductions were about  $2\times-3\times$  energyefficient, as compared to the accelerator performing temporal reduction on scalar PEs.

Although spatially accumulating data among PEs can be energy-efficient, the spatial or spatiotemporal accumulation for sparse tensor computations runs into the following challenges. Firstly, depending on the frequency of spatial accumulation, different PEs communicate the accumulated outputs of different sizes due to processing the blocks that consist of a different number of NZs. For example, some PEs may not encounter any NZs in computation and do not need to communicate any data for accumulation. This not only results in poor utilization of the function units (e.g., fused MAC units in PEs), but it also makes execution challenging. This is because, PEs typically execute pre-determined instructions or state machines for communicating the partial outputs and accumulating them. Now such communication should be triggered by a glue logic, only after it determines that the PE computed NZs and generated some partial output. Secondly, in-PE spatial accumulation requires an adder-tree, which is fed by an array of multipliers. The multiplier-array and adder-tree collectively execute a dot product. Depending on the sparsity of tensors, the PE may suffer from intra-PE work imbalance (section XIII), since these multipliers and adders within each PE are often fed with zeros and not utilized effectively.

For efficient handling of spatial or spatiotemporal accumulations, a few accelerators use global accumulators [148] or accumulator-buffers [96]. However, the accumulators in such global reduction units or in the accumulation buffers require the address of the output element, which needs to be explicitly

determined every time from the metadata of NZ inputs. Then, the partial outputs from PEs or their function units (e.g., multiplier-array) are provided to the global accumulator or accumulator-buffers that accumulate them with appropriate output elements and then write them back to appropriate memory bank. Arbitration of outputs to memory banks via crossbar interconnects and conflict management for accumulator buffers is discussed in section X-C.

#### D. Optimization Opportunities

i) Low-cost flexible interconnects for accommodating spatial reuse opportunities, dynamic communication requirements, and varying levels of sparsity and precision: Sparse tensors of different layers exhibit varying levels of reuse (Fig. 11), which is affected by sizes of tensors, the functionality of the layer (e.g., stride, separable execution), batch size, and sparsity of tensors. The communication mechanism needs to leverage such reuse by supporting various multicast and unicast patterns [63], [215]. Moreover, depending on the load balancing and inter-PE synchronization employed, distribution of unique chunks of sparse data to PEs, inter-PE communication for accumulating sparse tensors, and writing the outputs back from PEs to shared memory can be done asynchronously and concurrently. This requires the interconnect switches to support dynamic communication (including low-cost logic for priority and congestion management). Furthermore, communication among distance PEs may be required (e.g., for store-andforward or exchange of outputs during sparse computations). Finally, depending on the sparsity and precision of tensors, the bit-width of the metadata and NZ value can differ significantly. Communication of varying numbers of data and metadata can be facilitated by designing the configurable interconnect buses and their interfacing with PEs and memory. For instance, in EyerissV2 [63], a 24-bit bus can supply PEs either three 8b uncompressed values or 8b NZs and 4b metadata. Thus, various configurable interconnect topologies should be explored for effectively serving such communication requirements with varying reuse, dynamicity, and bit-width of the NZ data and metadata. When execution of target models necessitate supporting a wide range of communication patterns and bitwidth, FPGAs can be leveraged for designing accelerators with tailored interconnect topologies.

ii) Programming support for configurable interconnects and design exploration: While configurable interconnects can support varying communication patterns and dynamic data movement of sparse computations, the compilation support is required for programming configurable switches of the interconnect for facilitating such execution for various dataflows. This is because, configurable interconnects often employ parameterized multi-level switches and switches with many-tomany links between source and destination switches (e.g., [144], [215]). Depending on the interconnect topology and optimized dataflow, the compiler may need to select efficient paths for distributing data from source to destination switches. Additionally, the underlying topology (e.g., lack of multi-hop connectivity) may not support some dataflows (e.g., spatiotemporal accumulation of partial outputs that are



Fig. 18. Overview of the PE pipeline for processing sparse and value-approximated tensors [61], [63], [148] (Figure adopted from [148]).

produced by distant PEs). Further, systematic methodology for mapping communication flow onto interconnect topology can enable design space exploration of efficient interconnects that are required for accelerating required ML models, with the minimum overhead of run-time reconfiguration of the interconnect switches to support varying dataflows.

#### XII. PE ARCHITECTURE

PE architecture consists of function units (typically MAC units), local memory (large register files or SRAMs), and local control (instruction memory or finite state machine) [62], [83], [205]. Fig. 18 shows an overview of the PE pipeline stages for processing sparse and value-approximated tensors. Depending upon PE's interface, it either gets data from the interconnect network (typical design), or directly accesses shared onchip memory (often), or accesses off-chip memory via DMA transfer. At every cycle or few, a PE processes an instruction or events based on the current state [164], [201], [205], fetches data from the local memory (or the interconnect), processes tensor elements via function unit, stores the intermediate result into the local memory, or writes the data back (to lower memory or the interconnect). Typically, significant fraction of the tensor data in PE's local memory is temporally reused at least a few times [63], [96], [148]. Such temporal data reuse notably reduces the accesses to memory hierarchy, improving energy efficiency significantly. Application-specific PE designs may also exhibit additional function units for supporting other arithmetic or logical operations, or even customized functionality e.g., ReLU or sigmoid computations [96], [102].

Processing compressed tensors (particularly sparse and value-approximated data) imposes significant maneuvering efforts in the PE architecture design. For example, the PE may require additional hardware logic for detecting NZs and extracting compressed data (section IX), or a glue logic for computing the coordinates from metadata, which are then used for correctly processing the arithmetic on the input data and to store intermediate results. Such mechanisms may result in ineffectual computation cycles (less speed-up) when it passes some zero values for a tensor or not being able to determine pairs of NZ values to feed the multipliers. Moreover, the computation graph corresponding to a layer of ML model is executed on the hardware accelerator with a dataflow mechanism; each PE concurrently processes a different subset of the computation graph [74], [83]. For the varying dimensions of the layers and corresponding tensors of one or more ML models [6], a single dataflow may not be effective in all scenarios [63], [88], leading to significant acceleration loss due to poor PE-array utilization or ineffective interleaving of PE computations with data communication latency. Therefore, the datapath logic of the PEs needs to be adaptive for flexibly supporting multiple dataflows optimized for different layers and varying levels of sparsity. Finally, PEs may require post-processing on the produced outputs or generate additional metadata, before writing the output back to the interconnect network or accessing the memory. Efficient pipeline design is required to effectively hide pre-processing and post-processing latency [61], [96].

#### A. Function Units

1) Scalar PEs: Table IX lists accelerators based on their function units for scalar, SIMD, or vector processing. Several accelerator architectures including Eyeriss [34], EIE [61], and SparTen [151] contain an array of scalar PEs; PE datapath exhibits a MAC unit (pipelined multiplier and adder). Such a design makes it easier for obtaining a data pair at every cycle from PE's memory or via interconnect and keeping the MAC unit engaged in performing useful computations.

2) SIMD/vector PEs: PEs in other architectures including Cnvlutin [109], Cambricon-X [62], and Cambricon-S [57] feature multiplier arrays and adder trees to perform a vector-vector multiplication at every cycle. With multiple MAC operations at every cycle, designs with vector/SIMD PEs can significantly improve the throughput, as compared to architecture consisting of the same number of scalar PEs. Moreover, tensor accumulation through adder-trees spatially reuses the data within each PE, which lowers energy consumption (e.g., by  $2\times-3\times$  [165]), as compared to temporal accumulation of the data on a scalar PE by reading and writing the partial summation via PE's local memory.

However, a major challenge in processing sparse data on the PE design with a multiplier-array is the less effective utilization of multipliers and adders for useful tensor computations, which often leads to the intra-PE load imbalance and ineffectual computation cycles, and consequently, the acceleration loss. For example, in Cambricon-X architecture [62], the array of 16 multipliers and an adder-tree in each PE processes an output activation by selecting input activations from a dense tensor with bitmap-coded NZ weights. Often, there may not be enough NZ values (weights in this example) to feed all multipliers. This is either due to high sparsity of weights or insertion of padding zeros with NZ weights during the last invocation of multipliers for producing an output activation value. Such inefficiencies may lower acceleration opportunities. For example, a sparsity sensitivity analysis by [62] determined that for high weight-sparsity, Cambricon-X accelerated convolutions by about  $8\times$  (vs.  $16\times$  ideal speedup), as compared to processing dense tensors. Similarly, for SNAP architecture [148], multiplier array utilization can fall below 80% for moderate sparsity and up to 20% utilization for high (90%) sparsity of tensors. Higher utilization of multipliers and adder trees can be achieved either by employing larger indexing modules or comparator arrays for parallel lookup. It may increase on-chip area and power considerably. Alternatively, PE can be designed with a smaller number of multipliers for maintaining PE-array scalability and efficiency over a wide sparsity range.

TABLE IX
PE ARCHITECTURES FOR ACCELERATING SPARSE TENSOR
COMPUTATIONS

| Scalar | [34], [45], [61], [95], [97], [102], [110], [145], [147], [149], [151]–[153], [156], [157], [159], [164], [168] |
|--------|-----------------------------------------------------------------------------------------------------------------|
| SIMD / | [57], [62], [63], [96], [101], [109], [144], [146],                                                             |
| Vector | [148], [155], [160], [161], [165]                                                                               |

Some architectures employ multiplier arrays that are connected to accumulation units by crossbars [95], [96]. For each Cartesian product, the coordinate computation unit determines indices of accumulator banks that are supplied with the multiplier outputs. Coordinate computation unit and Cartesian product execution eliminate the need for data extraction (of matching pairs), and function units can spatially reuse data within multiplier array for performing all-to-all multiplications. However, such computation may not be feasible (or incur high computation costs) in several scenarios, including the processing of FC layers (with no batch processing), computations with nonunit stride, etc. [151]. Processing  $1 \times 1$  convolutions on SCNN [96] resulted in the intra-PE fragmentation (not enough useful work to feed the vectorized arithmetic units), resulting in 20% multiplier utilization. Moreover, a high-bandwidth crossbar between multipliers and accumulators (16×32) within each PE can introduce considerable costs in the total on-chip area (e.g., more than 20% in SCNN [96]) and power consumption. PEs of SNAP architecture [148] use a configurable adder-tree design. It processes inputs from 3 multipliers and computes different combinations of partial summations. With multiple adders and multiplexers, such PE design can concurrently process different partial summations (vs. gather in adder-tree), while eliminating the need for high-bandwidth crossbars (in the scatter networks). Such configurable designs make the architecture amenable to flexibly support layers or tensors of different sizes, e.g., depth-wise separable convolutions.

- 3) Multiplier-free PEs: Previous accelerator designs including YodaNN [220], LogNet [141], [145], [221], and [222] have proposed multiplier-free PEs that significantly improve the energy efficiency of the hardware. To do so, the PEs process very low-precision data like either binary and ternary values [138], [139] or tensors with logarithmic quantization [223]. Such designs replace multipliers with simpler arithmetic operations like 2's complement (inverters and adders or subtraction unis) [145], [220] or bit-wise shift and additions [141], [221], [224]. However, one challenge in executing different DNNs on such accelerators is to maintain the accuracy, since aggressive quantization often drops the top-1 and top-5 accuracy, e.g., by 0.1% [221], [225] - 5% [141], [223]. By trading off the flexibility with simpler hardware consisting of limited arithmetic, it can become challenging to support executions of various ML models on such accelerators.
- 4) Bit-adaptive computing: Precision requirements of tensors for maintaining high accuracy can vary for different models. PE designs with precision-adaptive computing can achieve higher acceleration efficiency in such scenarios.

*Bit-serial computing:* Stripes has [226] proposed bit-serial computing on SIMD PEs. Each PE contains an array of serial

 $\label{eq:table_X} TABLE~X$  Precision of Sparse Tensors Supported by Accelerators

| binary/ternary | [145]                                                                                    |
|----------------|------------------------------------------------------------------------------------------|
| int8           | [132], [151]                                                                             |
| int16          | [34], [57], [61], [62], [95], [96], [102], [109], [146]–[148], [157], [160]–[163], [168] |
| logarithmic    | [110], [167]                                                                             |
| bit-adaptive   | [149], [227], [229]–[232]                                                                |
| FP8            | [97]                                                                                     |
| FP16           | [97], [159]                                                                              |
| FP32           | [45], [144]                                                                              |
| FP64           | [152], [156]                                                                             |

inner product units. Each unit produces an output activation with an array of AND gates, an adder tree, and bit-wise shift of the partial output. AND gates are fed 1b input for bitserial processing of activations and bit-parallel 16b weights. Thus, with bit-serial processing, Stripes support fixed bit-width of one operand and varying widths of another operand for executing tensors of various DNNs with different precisions. Albericio et al. [227] showed that zero bits in NZ activations (quantized with 8b or 16b precision) can be more than 50% and proposed Pragmatic accelerator for leveraging sparsity of activation bits. Loom [228] enabled support for variable widths of both activations and weights with bit-serial processing on an execution engine that comprised an  $2\times 2$  array of bitserial subunits. Each subunit accepted 2b input activations and weights per cycle and produced 2 1b×1b products. In the execution engine, subunits in the same row got same weight bits and subunits in the same column got the same activation bits. Laconic [229] achieved further efficient accelerations by leveraging sparsity and processing only NZ bits of both activations and weights. UNPU [230] performed matrix multiplications by bit-serial processing of weights with arbitrary bit-widths between 1 and 16. It employed look-up table (LUT) based PEs followed by adders and shift logic. LUTs contained predetermined partial products for multiplying three activations with corresponding weight bits.

Bit-decomposable computing: Thinker architecture [164] consisted of PEs with two 8b×16b multipliers, which supported two 8b×16b multiplications in parallel or a single 16b×16b multiplication. Bit-fusion [233] employed fusion units consisting of an array of BitBricks. The fusion units can be configured for processing multiplications of 2b, 4b, 8b, or 16b operands. For processing NZs of sparse tensors of CNNs, Envision [149] used a single-cycle N-subword-parallel multiplier, followed by a N×48b/N reconfigurable adder. The subword-parallel design of PEs allowed configuration of MAC units for processing the data of 4b, 8b, or 16b. Sparse processing unit architecture [232] employed DGRA, a decomposablegranularity systolic-style coarse-grained reconfigurable array (CGRA), for efficiently processing stream-join accesses. The DGRA PE and interconnect switches enabled decomposition for processing up to four 16b sub-word operands. DGRA also featured support for accessing sub-word data from the scratchpad memory. For DNN training with mixed-precision and sparse tensors, LNPU employed PEs with MAC units that can be configured to process either FP8 or FP16 tensors. Table X lists precisions of sparse tensors that are supported by different accelerators. Note that these precisions usually indicate the bit-width of input operands (activations and weights). For MAC operations, accumulators usually produce high precision output (e.g., 32b in NullHop [157] for 16b weights and activations), which can be down-scaled or truncated afterward.

5) Clock-gated PEs: PEs of some architectures including Everiss [34] and Thinker [164] are clock-gated, which yields energy savings when PEs are not in use for execution of the layer. Moreover, when some accelerator designs do not target obviating ineffectual computations for performance-gains, they improve energy-efficiency by skipping computations with zero values. The gating logic in the datapath of PEs ensures skipping the execution of function units upon encountering zero values [34], [167] or, when input value falls below some threshold [166]. For example, in Eyeriss architecture, the PE fetches a data element of the input feature map from the local memory and compares it zero. Upon a match, the gating logic disables reading a value of filter from the local memory and prevents the MAC datapath from switching. Similarly, in Envision architecture [149], the execution of the PE-rows and PE-columns are controlled by flags. When either a weight or an activation (which is shared by an entire row or a column) is zero, the datapath in the corresponding (row or column of) PEs are turned off. It prevented switching activity and improved energy consumption by about 60% for 30%-60% zeros in CNN tensors. For accelerating tensors with 15%-50% sparsity on Sticker architecture [201], the gated PEs led to 65% savings of the power consumption with 12.5% memory overhead due to storing bitmaps in on-chip memory for zero-guarding.

# 6) Optimization opportunities:

(i) Exploring efficient designs of SIMD/vector function units for sparse computations: While PEs with multiple function units or multiplier arrays can effectively leverage the spatial reuse [63], [165], the utilization of such function units can significantly drop due to unstructured sparsity [148]. Therefore, design space explorations of hardware accelerators [38], [88], [89], [218] need to accommodate the impact of SIMD/vector PEs and unstructured sparsity on the performance and energyefficiency. Moreover, for low sparsity, designs should deliver performance at par with accelerators for dense tensors. For example, for processing dense tensors, SCNN [96] achieved 79% of the performance and consumed 33% higher energy as compared to the baseline accelerator that processed only dense tensors. For consistent performance or energy efficiency with different sparsity-levels, design of the PE pipeline may ensure that the additional design features, e.g., lookup and intersection mechanisms (for codebook/NZs), do not increase the critical path latency and can be power-gated if not used.

(ii) Supporting various tensors precisions through bitadaptive FUs, without increasing programming and compilation complexity: Precision requirements for tensors of different layers can vary significantly during training and inference of various models [226], [234]. Such varying precisions are explored to achieve further efficiency [46], [93]. Therefore, PEs, interconnect communication, and memory management need to support bit-adaptive computing, e.g., with bit-decomposable or bit-serial FUs for PEs. However, such designs should be



Fig. 19. Commonly used dataflow mechanisms for executing convolution layers on hardware accelerators.

generic enough to support multiple precisions of different tensors for the same or different layers, while retaining the support for different arithmetic and logic operations (e.g., multiplication, addition, shift). Moreover, such designs should not expose further details to the compiler or mapping optimizations. In other words, the design flow for programming, mapping, and code generation should not encounter significant modifications to support bit-adaptive computing.

#### B. Dataflow Mechanisms

1) Background: The efficiency of executing a layer of a model onto hardware accelerator depends on the computational, communication, and memory access patterns, which are commonly referred to as dataflow mechanisms [38], [74], [83]. A dataflow refers to the spatiotemporal execution of a model layer (nested loop) on the architectural resources [38], [74]. Here, spatial execution corresponds to how PEs exploits parallelism in the computation graph and processes different subsets of the tensors. Temporal execution drives the data accessed throughout memory hierarchy and data communication via interconnect. Thus, depending on the layer's configuration and tensor dimensions, the dataflow can significantly impact the spatial execution, data reuse, and hiding of communication latency, and consequently, the execution time and energy consumption [63], [74], [83], [87], [89].

One way to classify dataflows is by what data is kept "stationary" in registers or the local memory of each PE (and reused fully before eviction), while other data is being iterated over. Some of the commonly used dataflow mechanisms are output stationary, weight stationary, input stationary, row stationary, no local reuse, etc. Fig. 19 shows an example of convolution and the layout of the stationary data, for a finegrained mapping of the convolution with these dataflows. For example, in weight stationary dataflow, each weight value (from a 2D spatial filter) remains stationary on a unique PE, and reused many times, during processing different activations from the input feature maps (corresponding to the same input channel C). With processing a unique weight from a 2D filter, each PE produces partial summations for output feature maps. Therefore, PEs need to communicate for accumulating partial summations, before outputs are written back to the memory hierarchy. Thus, in a weight stationary approach,



Fig. 20. Low utilization of a  $16 \times 16$  PE-array in (a) coarse weight stationary dataflow when executing depth-wise layers and (b) input stationary dataflow for executing later layers of deep CNN models (Figure inspired by [63]).

input and output activations are fetched from off-chip memory, shared scratchpad, and PE's local memory several times, while weights are continuously reused from PE's local memory. After exploiting the weight reuse to an extent, a new set of weights is loaded from the memory in a new execution pass, and the execution repeats. Weight reuse of CNN is higher in processing larger input feature maps and is multi-folded when images are processed in a batch [83]. Fig. 10 provides an overview of different data reuse opportunities possible in different types of layers of models; Fig. 21 lists different characteristics of the layers.

Note that these dataflows can be applied at coarser-level, where each PE can process a block or planar data (1D/2D/3D) of the tensors. For example, in a coarse weight stationary approach [38], each PE processes unique weights of an entire 2D filter. In other words, tensor dimensions C and M are laid out spatially on PEs. So, during an execution pass, each row of PEs receives tensors corresponding to a unique input channel and each column of PEs process tensors corresponding to a unique output channel. Therefore, the same activations need to be multicast to each row of PEs, different weights need to be provided to each PE, and partial summations that correspond to unique output channels can be accumulated vertically [36]. Similarly, in an input stationary dataflow, unique activations (or blocks of input feature maps) remain stationary in PE's memory till their maximum reuse. In an output stationary dataflow, each PE produces a unique output value (corresponding to the feature map of the same or different output channel) [83]. To facilitate such execution, PEs can process spatial data and input channels first, so that partial summations can be accumulated in the memory of each PE. Thus, different dataflows uniquely exploit the spatial parallelism and data reuse of different tensors.

Dataflow optimization: Model layers and corresponding tensors are usually of larger dimensions (large lengths of various dimensions), and therefore, many ways exist for spatiotemporally executing each layer onto computational and memory resources of a given accelerator. It is crucial to optimize the dataflow (mapping of a layer onto accelerator) since it can significantly impact the power and performance of the execution [38], [74], [89]. For example, Parashar et al. [87] analyzed executing VGG CONV3\_2 layer on an NVDLA-like architecture [154] with 1024 MAC units. They reported that 480k mappings were within 5% of the peak performance, but varied by about 19× in energy efficiency. Among these

| Layers             | Examples                                                                               | Characteristics                                          | Implications on Execution       |
|--------------------|----------------------------------------------------------------------------------------|----------------------------------------------------------|---------------------------------|
| Early<br>CONV      | •                                                                                      | Large H/W of feature maps                                | High weight reuse               |
|                    | AlexNet                                                                                | Less filters                                             | Low input reuse                 |
|                    | CONV1,<br>MobileNet                                                                    | Less channels                                            | Low reuse of partial summations |
|                    | CONV1                                                                                  | Low sparsity of weight                                   |                                 |
|                    |                                                                                        | and activations                                          |                                 |
| Last<br>CONV       | ResNet-50<br>CONV4_x &<br>CONV5_x,<br>VGG-16<br>CONV4                                  | Small H/W of feature maps                                | Low weight reuse                |
|                    |                                                                                        | More filters                                             | High input reuse                |
|                    |                                                                                        |                                                          | High reuse of partial           |
|                    |                                                                                        | More channels                                            | summations                      |
|                    |                                                                                        | High weight sparsity and                                 |                                 |
|                    | & CONV5                                                                                | moderate activation sparsity                             |                                 |
|                    |                                                                                        | Smaller activation vectors                               | No weight reuse                 |
|                    |                                                                                        | Smaller activation vectors                               | (batch size=1)                  |
|                    | AlexNet FC,                                                                            |                                                          | execution is                    |
| MLP                | VGG-16 FC,<br>NeuralTalk                                                               | Large weight matrix                                      | memory-bounded or               |
|                    |                                                                                        |                                                          | communication-bounded           |
|                    |                                                                                        | High weight sparsity and low/                            |                                 |
|                    |                                                                                        | moderate activation sparsity                             |                                 |
|                    | MobileNet d/w<br>Xception                                                              | Single filter  No channel-wise reduction                 | Low input reuse                 |
|                    |                                                                                        |                                                          | Low reuse of partial            |
|                    |                                                                                        |                                                          | summations                      |
| Depth-<br>wise     |                                                                                        |                                                          | Low computation and             |
|                    |                                                                                        |                                                          | high communication              |
| CONV               |                                                                                        |                                                          | requirements                    |
|                    |                                                                                        | No/low sparsity due to non-<br>pruning of few parameters |                                 |
|                    | MobileNet s/1,<br>Inception<br>module in<br>GoogLeNet,<br>Fire module in<br>SqueezeNet | 1x1 convolution kernel                                   | Reduced reuse of                |
|                    |                                                                                        |                                                          | weights & activations           |
| Point-<br>wise     |                                                                                        |                                                          | limits structured               |
|                    |                                                                                        |                                                          | sparsity to channel and         |
| CONV               |                                                                                        |                                                          | filter direction                |
|                    |                                                                                        | Low/moderate sparsity                                    |                                 |
|                    |                                                                                        | due to few parameters                                    |                                 |
|                    | ResNet-50,<br>U-Net                                                                    | Concatenation of outputs                                 | additional accesses to          |
| Residual<br>Layers |                                                                                        | concatenation of outputs                                 | off-chip memory                 |
|                    |                                                                                        | additional inputs from previous layers                   | opportunity for on-chip         |
|                    |                                                                                        |                                                          | reuse of intermediate           |
|                    |                                                                                        |                                                          | feature maps                    |
| Group              | ResNeXt                                                                                | Parallel paths                                           | opportunity for input reuse     |
| CONV               | Aggregation                                                                            | due to cardinality                                       | with fused executions           |
|                    | Blocks                                                                                 | ,                                                        |                                 |
| 3D CNN             | C3D                                                                                    | Temporal processing                                      | heavy computations,             |
|                    |                                                                                        | across frames                                            | increased data reuse            |

Fig. 21. Characteristics of different DNN layers pertaining to hardware execution (Figure inspired by [63], [89]).

mappings, only one was energy-optimal, while nine were within 1% of the optimal energy efficiency. Further, as Fig. 21 shows, reuse characteristics and tensor dimensions can vary significantly for different DNN layers. Hence, a single dataflow may not be always effective for accelerating different layers of a DNN model and for varying DNNs. Fig. 20 provides two such examples that lead to low utilization of the PE-array. For example, the coarse weight-stationary dataflow mechanism processes different 2D filters on different PEs. Therefore, it cannot efficiently process the depth-wise layer [52] of separable convolutions. Similarly, mappings are often obtained with output-stationary or input-stationary dataflow mechanisms for processing spatial dimension of feature maps on different PEs (Fig. 20b). Such mappings result in the low utilization of PEs for processing later layers of deep CNNs [96]. Therefore, dataflows need to be optimized for adapting to the tensor dimensions and layer characteristics.

With the vast space of execution methods and growing development of the new models (with varying dimensions of tensors), it becomes hard for nonexpert programmers to figure out optimized execution methods for efficient accel-

TABLE XI
DATAFLOW MECHANISMS OF ACCELERATORS

| Input Stationary         | [96], [144], [145], [148], [165]                                    |
|--------------------------|---------------------------------------------------------------------|
| O                        | [45], [57], [61], [62], [95], [97], [101],                          |
| Output Stationary        | [102], [109], [146], [149], [151]–[153], [155], [159], [160], [168] |
| Weight Stationary        | [163]                                                               |
| Coarse Weight Stationary | [144], [147], [148]                                                 |
| Row Stationary           | [34], [63]                                                          |

erations. Therefore, many mapping optimization tools have been proposed recently including Timeloop [87], Interstellar [38], MAESTRO [89], and dMazeRunner [74]. These tools analytically model the accelerator execution to estimate execution metrics and evaluate the set of mappings from the pruned mapping space for various dataflows. For example, Yang et al. [38] presented a framework for optimizing energy efficiency for executions on DNN accelerators. Dave et al. [88] presented dMazeRunner framework to optimize DNN models from MXNet and other libraries for performance, energy-consumption, or energy-delay product. MAESTRO [89] provided data-centric directives to specify the execution methods and estimated the accelerator efficiency for dataflow mechanisms through an analytical model.

2) Sparsity-aware dataflows: When the data is sparse and stored in compressed format, computation patterns and data communication can become highly irregular [110]. Therefore dataflows need to be adaptive to the sparse and approximated data. Typically, dataflows for sparse tensor computations are similar to the dataflows optimized for dense tensors, while processing the (sparse and approximated) data in the compressed format (treating as zero-free dense tensors). For correct functionality, such dataflow executions are facilitated by NZ detection and data extraction logic in PE architecture [63], [96]. For example, SCNN [96] used a novel PT-IS-CP dataflow. It processed planar tiles of input activations with input stationary dataflow, for producing a Cartesian product and their appropriate accumulations on each PE. PT-IS-CPsparse dataflow in SCNN extended the PT-IS-CP (dense) dataflow, and processed only NZ activations and weights in their compressed format, for accessing the data from memory and performing tensor computations on PEs. The coordinate computation module in each PE ensured that generated all-toall multiplications through a Cartesian product of NZ inputs and weights were accumulated correctly and stored in appropriate accumulation buffers. Table XI lists different sparsityaware dataflow mechanisms used by accelerators.

EyerissV2 [63] used row-stationary dataflow (originally introduced by Eyeriss [83] for dense tensor computations), while processing compressed data of sparse tensors that fit in the local and shared memories of PEs. Using statically known sparsity of weights, more NZ weights were allocated in the memory. For example, each PE in EyerissV2 can store up to 192 NZ weights. The mapping of different CONV and FC layers of AlexNet with row-stationary dataflow allocated 64–174 NZ weights, which corresponded to a total of 132–480 weights in the dense format (including zeros in the tensor). With in-PE data extraction logic, each PE only processed NZ

values from CSC-encoded data. Thus, sparsity-aware dataflow optimizations or explorations can be performed with the information about the pre-known (or expected bounds of) sparsity-levels and the degree of value sharing. Consequently, optimized mappings of dataflows can be achieved for executing different topology of models and varying tensor sizes. For example, EyerissV2 accelerated CONV, FC, and depth-wise separable layers of varying sparsity with the row-stationary dataflow. While sparsity-aware dataflows bring regularity in the processing, they may result in considerable load imbalance, since different PEs may have a different amount of NZ values to process [110]. Exploiting such opportunities can speed-up the execution by about more than 32% [110], which is discussed later in section XIII.

#### 3) Optimization opportunities:

(i) Dataflow optimizations accounting for storage and computational overheads due to metadata of sparse tensors and and lookup tables of value-shared tensors: Sparse and value-shared tensors are processed along with the metadata (indicates the position of NZ values in a tensor) and a code-book (e.g., for looking up the shared value, corresponding to the position of a tensor element), respectively. It requires additional processing, e.g., buffer management, interconnect communication, and indexing of appropriate values. Depending on the dataflow, such additional management can considerably increase the execution cost.

Existing tools for dataflow optimizations (e.g., [38], [87], [88]) majorly focus on improving performance of the dense tensor computations. Similarly, recent accelerators SCNN [96] and EyerissV2 [63] optimize sparse tensor computations but, they target specific sparsity-aware dataflows. Therefore, the dataflow exploration and frameworks need to consider the sparsity and value sharing of tensors and their variations through layers for achieving efficient mappings and hardware designs. To facilitate such support, such tools can include include the costs for corresponding storage, communication, and computational overheads (due to metadata, codebook, etc.) in their analytical cost models. For instance, after exploring dataflow mappings, an optimized dataflow for a specific layer of a model may choose to offload distinct fractions of the codebook or memoized values to different PEs, which then remains stationary throughout the execution. Moreover, design space explorations of accelerator hardware for processing sparse and irregular-shaped tensors can be flexible enough to support multiple dataflows. Further, the execution modeling of accelerators can be made adaptive to the variations in bitwidths of tensors and implications of the bit-decomposable computing in the hardware. Then, the obtained estimations of execution time or energy consumption can more accurately reflect the implications of sparsity, variable precisions, and value sharing, leading to explorations of efficient dataflow mappings and hardware designs.

(ii) Sparsity-aware mappings for resource partitioning: Accelerators can efficiently execute multiple layers simultaneously by partitioning the architectural resources [209]. Moreover, DNN executions can be further accelerated with multiple accelerators by leveraging model-parallelism or data-

parallelism [235], [236]. Techniques for resource partitioning aim to reduce data movements and highly reuse the data from on-chip memory of accelerators. Optimization of the resource partitioning (many-to-many mappings between layers and accelerators) can be crucial for several applications that require low-latency execution for real-time processing or high frame rates (e.g., accelerating multiple object detection models in autonomous vehicle systems that process the same frames). Sparsity of tensors provide further acceleration opportunities due to reduced communication latency and less storage. Therefore, sparsity-aware resource partitioning techniques can be developed that consider the anticipated sparsity-levels of tensors and any overheads due to decoding, storing, or communicating metadata of the encoded tensors.

#### C. Local Memory Management

1) Temporal data reuse: For exploiting temporal reuse of the tensors further and to minimize accesses to shared on-chip memory, PEs employ private RFs or SRAMs (e.g., in SNAP [148], EyerissV2 [63], LNPU [97]). Data reuse opportunities can diminish for sparse tensors (section X-A). However, DNNs still offer significant opportunities for reusing the sparse tensors, which can be exploited for improving energy efficiency and performance gains. Reusing tensors temporally through local memory becomes feasible when accelerators use in-PE NZ detection and data extraction. This is because, PEs can iterate over the data stored in the local memory, obtain matching operands, and perform computations. When a central module is used for data extraction before communicating the data to PEs, it usually extracts matching operands to be streamed to each PE. In contrast, housing data locally allows PEs to maintain the tensor blocks for later reuse; their data extraction or coordinate computation logic helps in extracting the correct operands. For example, accelerators including EyerissV2 [63], SCNN [96], and Cambricon-S [57] leverage temporal reuse of different tensors. Similarly, PEs in SNAP architecture [148] buffer NZ tensors in local RFs. However, for extracting matching operands, SNAP PEs communicate data and metadata to a shared (row-wise) index matching module, which provides sequences of extracted operands back to PEs for further computations. Thus, temporal reuse of tensors enables accelerators to alleviate overheads of repeatedly accessing compressed tensors, decoding them, and then communicating the data and metadata to PEs. However, such benefits are traded-off with either communicating the data or metadata between PEs and shared module or the storage costs (area and power) of low-reuse (due to high sparsity) tensors.

2) Hiding communication latency: Just like hiding miss penalty for communicating data between on-chip memory and DRAM [57], [62], [96], [110], hiding the latency of communicating data from the shared on-chip memory to PEs is important. Accelerators achieve this objective either by double-buffering the local memory of PEs [151], [152] or providing asynchronous communication mechanism which can re-fill the PE's memory after some of the data in the local memory has been consumed for useful computations [34], [63]. For instance, configurable communication network allows PEs to

TABLE XII
LEVERAGING VALUE SIMILARITY AND REDUCED COMPUTATIONS.

| Value similarity                  | Weights     | [57], [61], [133], [153], [241]                 |
|-----------------------------------|-------------|-------------------------------------------------|
| value similarity                  | Activations | [134], [237]                                    |
| Computation reuse and memoization | Partial     | [133], [134], [161], [237], [239], [241], [242] |
| and memorzation                   | Full        | [188], [238], [240]                             |
| Computation redu<br>early termin  |             | [243]–[246]                                     |

execute in dataflow fashion; PEs can request for partially refilling their buffers with the new data. EyerissV2 [63] proposed a hierarchical mesh interconnect with configurable router nodes which allow configuring the router for communicating the data between the source (e.g., shared memory) and destination (e.g., PEs) ports via broadcast/multicast/unicast.

#### D. Leveraging Value-Similarity and Computation Reuse

Several recent techniques have explored leveraging value similarity and computation reuse for accelerating DNNs. This is because the data of images and videos exhibit high similarity spatially (e.g., among neighboring pixels in images) and temporally (e.g., similar input data in consecutive image frames) [134], [237], [238]. Moreover, after lowering the precision of tensors, many values in the feature maps and filters are same and repeat frequently [43], [133]. Therefore, these values can be compressed further by maintaining a codebook that only contains unique values [61]. Furthermore, due to the repetition of similar values, computation of many values can be reused, either partially during processing a layer of the model [133], [134], [239] or entirely while skipping the processing of a layer [188], [238], [240]. Table XII lists various techniques that have leveraged such reuse of the values and computations to reduce the storage and computation requirements. This subsection describes such techniques and how accelerators leverage them e.g., with additional support in the PE datapath.

1) Weight similarity: Prior studies have shown that elements of weight tensors can be effectively approximated with a smaller set of unique values. This is because, lowering the precision (e.g., 8b) of large tensors (e.g., 2.3M weights in ResNet CONV5\_2 layer) leads to a small set of unique values (up to 256), which leads to repetition of the values [133]. Pruning of the elements also leaves behind relatively few values, which can contribute to high quantization. Hegde et al. [133] showed that for quantized weights of DNNs, each NZ value mostly repeated for more than 10 times and for more than 100 times in later layers of AlexNet and ResNet-50 models. Han et al. [43] showed that after pruning weight tensors of DNNs, the data can be quantized with kmeans clustering for value sharing. Then, the unique values can be represented with 4 or 5 bits and shared among all elements, without dropping classification accuracy. Similarly, for obtaining higher compression of models, Cambricon-S [57] employed local quantization of weights by dividing the weights into sub-matrices and then applying clustering on the sub-matrices. Local quantization helped achieve smaller sizes of the codebooks and the indices of shared weights. For example, for the FC6 layer of AlexNet, the global quantization achieved 5b indices of shared weights, 128B codebook, and 2 MB weights but, local quantization achieved 4b indices, 4 kB codebook for 64 sub-matrices, and 1.6 MB weights [57]. RAPIDNN [135] used a multi-level clustering of tensors with a tree-based codebook generation to achieve efficient compression at high accuracy. Thus, quantization techniques leveraging the weight similarity can compress the pruned models further by up to an order of magnitude [43], [57].

For processing sparse and value-shared weights, each PE of the EIE accelerator [61] fetched an activation from its work queue and accessed the local memory to obtain matching NZ weights. Upon finding a match, the PE looked up in weight decoder with the encoded index of the weight to obtain the shared value. Then, the MAC unit in the datapath processed the NZ activation and weight. Similarly, Cambricon-S [57] processed weights compressed with local quantization. It used a weight decoder module with a LUT in each PE to extract the shared value of the weight. Depending on the lookup mechanism and size of the bits that need to be extracted from the decoder at every cycle, the decoding module can incur considerable area and power costs. For example, the weight decoding module in Cambricon-S consumed about 32.56% of the total on-chip area and 3.98% of the on-chip power.

2) Input similarity: Frames of speech signals or video data can exhibit high similarity spatially or temporally. This is because a speech signal can be quasi-stationary for a short interval. Moreover, successive executions of DNNs process overlapping windows of frames for extracting the context information [134]. Similarly, feature maps in DNNs for computational imaging can exhibit high spatial correlation [237]. The high similarity of inputs enables only storing unique input values and reusing the computations along with a differential processing of non-similar data.

For example, Riera et al. [134] have shown that after uniform linear quantization of input tensors of different DNNs (e.g., C3D [22], EESEN [247], CNN for self-driving cars [248]), on average 61% of input activations are same as previous DNN execution and 66% of the computations can be avoided. Their accelerator [134] maintains centroids of quantized inputs  $(c_i^p)$  along with the quantized index  $(idx_i^p)$ of the corresponding centroid (i.e., shared value) for each input element. Then, during processing subsequent frames, it process frames layer-wise with differential computing. For example, for each activation of an FC layer (of the new frame), it calculates the centroid  $(c_i)$  and index  $(idx_i)$  for quantization and then compares it to the memoized centroid  $(c_i^p = centroid(idx_i^p))$ . If the difference  $(d_i)$  is zero, then the next activation is processed while reusing the output from the previous execution. Otherwise, index in the buffer is updated with the new value  $(idx_i)$  and new values for output activations are computed by executing multiplications of the difference value  $(d_i)$  with corresponding weights  $(W_{io})$  on 128 multipliers. Then, 128 adders can add partial outputs from multipliers with the previously computed output activations. Finally, updated outputs are written back to the on-chip output buffer. For processing all input activations with the differential computing, the calculation of output activations



Fig. 22. (a) Leveraging weight similarity and reuse of partial outputs [133]. (b) Modifications in UCNN PE architecture (shaded blocks) for buffering indirection tables, partial summations of activation groups, and memoization of partial outputs. (Figure adopted from [133].)

can be visualized as:  $z_o' = (z_o) + \sum_{i=1}^K ((c_i - c_i^p) * W_{io}).$ 

3) Computation reuse (partial during processing a model layer): UCNN [133] leverages repetition of weights by forming the activation groups that share the same weight and then multiplying summation of activations with the shared value of the weight. Moreover, it also leverages reuse of the activation sub-groups i.e., memoization of the partial summations of activations that can repeatedly appear across different filters. For example, as Fig. 22(a) depicts, weight values A and C can be shared among corresponding activation groups and, after summing corresponding activations, it takes only one multiplication each for processing the partial product. In producing activation groups (local summations), subgroups like (r+s) can be reused with memoization. Since each filter m can have Unumber of unique values, it requires indexing different numbers of unique weights and activation groups (summations) for calculating each output activation. Indirection tables can map locations of each activation that correspond to the unique weight of a filter. For example, indirection table for input activation provides indices iiT[m,i,0]...iiT[m,i,grpsize(m,i)-1] that correspond to the activations to be multiplied by a shared weight with index wiT[m,i] ( $i^{th}$  unique weight in filter m). Fig. 22(b) shows modifications in the PE datapath for buffering such indices of the shared weights and grouped activations. After iteratively accumulating activations within a group, the summation is fed into a multiplier for calculating partial output for an output activation. For calculating the summation of activations within a group, streamed inputs can be also accumulated with memoized activation sub-groups. UCNN reported up to 17%-24% area overhead for PE enhancements and 1.8× speedup for CNNs as compared to the baseline accelerator that stored tensors in RLC-5b format and processed tensors in a dense format without exploiting weight repetition.

Silfa et al. [239] showed that the relative difference between the current and previous output of activations of common RNNs (e.g., DeepSpeech2 [249], EESEN [247]) was on average less than 23%; leveraging the temporal similarity of outputs saved more than 24% computations with negligible accuracy loss. To predict whether an output activation leads to a similar value as previously computed output, their technique extended each RNN layer with a binary neural network (BNN).

With BNN outputs correlating to actual outputs, execution of much smaller BNN layers led to an effective and hardwareefficient prediction of the temporal output-similarity. Their fuzzy memoization technique dynamically cached outputs of activations from the previous layer  $(y_m)$  in a memoization table, along with the output from the corresponding BNN  $(y_m^b)$ . For new input sequence, it compared newly computed output from the BNN of the new layer  $(y_t^b)$  with the previously cached BNN output  $(y_m^b)$ . If accumulation of their difference  $(\delta_t^b)$  through the consecutive timestamps (from m to t) was smaller than a threshold ( $\theta$ , about 30%–50%), then the memoized output  $(y_m)$  was reused. Otherwise, the output activation  $(y_t)$  was calculated from the actual values of input activations and weights (high-precision) and was memoized  $(y_m)$ . Also, caching of BNN output  $(y_m^b)$  was updated with new BNN output  $(y_m^t)$  and the difference  $(\delta_t^b)$  was reset to zero. The fuzzy memoization unit was implemented with a memoization buffer, a binary dot product unit consisting of XNOR gates (for binary multiplications) and an adder tree, and a comparison unit. The fuzzy memoization unit was added to each of the computation units for processing RNN gates. The computation units also contained buffers for weights and activations, a dot product unit consisting of multipliers and an adder tree, and a multi-function unit. Upon getting the input, the memoization unit evaluated partial BNN (for a gate) and determined whether memoized output can be reused. If not, then it triggered the dot product unit of the computation unit for computing the actual output.

4) Computation reuse (completely skipping layer execution): To leverage temporal redundancy of the data (e.g., in computer vision applications), a few techniques predict outputs based on previous computations and skip heavy computations of some layers of the model. For example, Gonçalves et al. [238] showed that about 18%-81% of computations in AlexNet CONV layers can be reused due to spatial (intraframe) and temporal (inter-frame) redundancy of the input data. They leveraged such reuse by replacing CONV layers with memory look-ups. After profiling the data offline, their technique generated output tables and corresponding mapping functions for each filter. Then, during processing input frame, the mapping function generated a table index. If a matching entry can be retrieved, the corresponding output feature map was updated with memoized values through a look-up. Otherwise, the output was calculated by performing full convolution. For obviating LUTs with large sizes, their technique sorted the output values and clustered them according to their proximity. It also used motion prediction with block matching for skipping execution of some consecutive frames until the accumulated estimation of the error reached a threshold. Consequently, for DNNs like YOLO-v3 [8], it processed only 22%-32% of the frames with a negligible accuracy loss.

Buckler et al. [188] proposed to skip heavy processing of some CNN layers for several frames (predicted) and executing precise computations periodically for the remaining (key) frames. For predicted frames, their activation motion compensation algorithm estimated motion in the input frame and used the result to incrementally update the saved output from

the last key frame. Note that unlike some of the value similarity techniques that incur change in the PE datapath, such techniques to identify the computation reuse with marginal computations can be efficiently executed on a separate module, while the remaining modules in the hardware accelerator (e.g., memory, PE-array engine) computes the sparse tensors of DNN layers. For example, the activation motion compensation algorithm was implemented on the EVA<sup>2</sup> accelerator module [188] which determined the frames that need to be predicted and corresponding warping of the activations based on the motion estimation. Tasks of processing key frames or remaining (suffix) CNN layers of predicted frames can be executed on accelerators such as EIE [61] or Eyeriss [34]. EVA<sup>2</sup> identified 78%–96% of the frames for AlexNet [2] and 40%–71% of the frames for Faster-RCNN [7] as predicted frames while processing YouTube-BoundingBoxes dataset [250], without incurring considerable accuracy loss.

5) Early termination of computations by predicting outputs: Several accelerators including SnaPEA [243], SparseNN [245], and CompEND [246] have proposed to reduce ineffectual computations by early prediction of the usefulness of the output values. These accelerators check whether the computations (MAC operations) contribute to the effective inputs for the subsequent layer. If not, PEs of these accelerators can terminate such computations for the output elements. For example, DNNs use ReLU as activation functions that clamps negative values to zero. Therefore, computations can be considerably reduced by monitoring the negative values of the outputs and early termination of the computation by obviating the MAC operations that are likely to produce negative values [243], [244]. Similar strategy can be adopted for efficiently processing DNN layers that precede the maxpooling function [244]. Song et al. [244] analyzed ineffectual output activations for CONV layers of AlexNet and VGG-19 models and reported that about 80% of the multiplications led to ineffectual outputs. To leverage computation reduction due to negative values, SnaPEA [243] statically re-ordered the weights based on their signs. PEs of the SnaPEA architecture employed prediction activation units with each MAC. the prediction activation unit checked the sign-bit of the partial summation and raised the termination signal once the signbit became one. The termination signal notified the controller to terminate the rest of the computations on the PE for the corresponding output element.

# 6) Optimization opportunities:

(i) Joint exploration of spatial and temporal similarity of inputs, weights, and outputs: Depending on the model configurations (depth, layer dimensions, cardinality, etc.) and domain-specific data, opportunities for value sharing and the resultant computation reuse can vary considerably. For example, for a given DNN and dataset, leveraging value-similarity for a specific tensor and computation reuse may lead to better storage-efficiency and acceleration, with no or negligible accuracy loss. Selecting a different tensor for value approximation may yield considerably different design requirements and processing overheads. Therefore, a joint exploration of the value-similarity for DNNs can help to identify

the storage-Ops-accuracy trade-offs for different tensors (input and output activations and weights) and the implications of the structured or unstructured value sharing. The framework for the joint exploration can also indicate opportunities for hardware-friendly execution. For instance, for an image frame or audio sample, it can opt for skipping computations of specific layers entirely by reusing memoized outputs.

(ii) Separate pre-processing or post-processing module for determining data similarity and orchestrating structured computations on PE-array: Many techniques that leverage value similarity process DNN layers with differential computing [134], [188], [239]. For example, they find out whether the input data has changed among consecutive frames or predict whether the outputs will be similar to memoized values. Hardware-friendly algorithms can be devised that perform small computations on input data and can skip entire computations for some DNN layers. Such additional processing may induce considerable changes in the PE-array design and increase the accelerator area, latency, or energy-efficiency. Designers may obviate such challenges by providing a separate and compact accelerator module for differential computing. It can be handle necessary pre-processing or post-processing of the data and can be interfaced with the PE-array and memory of the accelerator. Such module can also obviate the need for modifying the datapaths of scalar/vector PEs and extending PE instructions or control logic. Upon requirement, it can trigger execution on PE-array for structured computations. For such accelerator, the designers should ensure efficient execution pipeline, e.g., interleaving the latency of pre-processing or post-processing with the execution of PE-array.

(iii) Hardware/algorithm co-design for differential processing of the tensors: The approximation of outputs can be achieved with differential processing of inputs. Output elements of a tensor, that need accurate computing due to mismatch in the consecutive input segments (e.g., temporal image frames), can be unstructured and require fine-grained processing. Such processing may be achieved through special support in the PE datapath and additional memory management for the corresponding metadata and memorized values. However, fine-grained computations of outputs can lead to imbalanced computations on different PEs. Therefore, algorithms expressing the functionality of layers (e.g., convolutions) may need to be defined in terms of differential computing (i.e., execution conditional to input mismatch). Then, the techniques can systematically target such functionality and explore efficient dataflows and hardware designs for optimizing the execution during training and inference. Moreover, the hardware/algorithm co-design can utilize analytical modeling of hardware architecture. It can induce structured value-similarity or select to leverage the value-similarity of a tensor that is more amenable to hardware-friendly accelerations. Consequently, accelerators can attain more structured and balanced computations with less overheads of metadata or memoization.

# XIII. LOAD BALANCING

Depending on the distribution of zeros in tensors, the inter-PE or intra-PE imbalance can cause significantly low

utilization of PEs or their function units, which increases execution time and energy consumption. Firstly, this section summarizes sources of such imbalance, and then it discusses different software-directed techniques or hardware designs for balanced computations. Table XIII categorizes these techniques. Software-based techniques facilitate structured computations by forming local regions of dense elements, sorting the data by combining same-sparsity tensor blocks for PEs, or regularizing models with structured pruning. While requiring low/no additional hardware cost, these techniques are often limited to static sparsity. Accelerators dynamically balance computations by prefetching work in FIFOs or memory, which obviates fine-grained synchronization of computations on PEs. Some accelerators achieve further run-time balance across PEs by a central hardware module that enables work-stealing.

#### A. Sources and Impact of Imbalance

1) Inter-PE imbalance: Zero values in different tensors are typically scattered and their positions may not be deterministic statically. Even if one tensor is dense or exhibits coarsegrained sparsity (due to pruning techniques for structured sparsity), another tensor to be multiplied can exhibit scattered zeros (fine-grained sparsity). Typically, work allocation for each PE is determined statically for most accelerator architectures. In other words, the spatial execution of the dataflow (i.e., which PE would process which portion of the computation graph) is determined beforehand. Therefore, data extraction and communication mechanisms provide a fixed set of computations for each PE. Hence, with unstructured NZs in the tensors, computations to be carried out by each PE can vary drastically, commonly referred to as inter-PE load imbalance. In such scenarios, from a few to many PEs finish their computations early, get stalled, and wait for the next set of data (provided by communication network), while other PEs still process the previously allocated data. Such imbalance lowers average PE utilization and results in higher energy consumption due to idle cycles of PEs. Typically, executions with conventional dataflows yield synchronous processing of PEs. Synchronization or lock-stepping among PEs (e.g., in SCNN [96], Cnvlutin [109], and Cnvlutin2 [146]) is achieved by barriers implemented in software via instructions or in hardware via PE architecture or controller logic. Therefore, with imbalanced computations, the trailing PE becomes the bottleneck of an execution pass, affecting the acceleration opportunities. For example, Kim et al. [110] analyzed the distribution of NZ weights in AlexNet CONV3 filters (for allocating them to distinct PEs). It showed that in an execution pass, NZs to be processed by the leading and trailing PEs differed by up to 6.5×. Parashar et al. [96] reported up to 40% idle cycles for executions of PEs in SCNN architecture. EIE [61] accelerated FC layers of different models. In EIE architecture, at every cycle, an NZ activation was broadcast to all PEs, and each PE processed the activation along with appropriate NZ weights (accessed from its local memory). Their sensitivity analysis showed that without any support for load balance, the accelerator suffered from severe load imbalance, i.e., on average, about 47% of the cycles were idle

TABLE XIII
CLASSIFICATION OF LOAD BALANCING TECHNIQUES

| Software | Data Clustering      | [95], [107], [163]        |
|----------|----------------------|---------------------------|
| Directed | Data Reorganization  | [110], [151], [160]       |
| Directed | Model Regularization | [57], [101], [102], [155] |
| Hardware | Prefetching of Work  | [61], [62], [102]         |
| Module   | Work Stealing        | [97], [110]               |

(bubbles due to starvation) for a 64-PE accelerator. Therefore, the accelerator systems for sparse tensors need to effectively balance the computations among PEs.

2) Intra-PE imbalance: For SIMD or vector PEs, apart from inter-PE load imbalance, intra-PE load imbalance may also contribute to a significant acceleration loss. Several architectures feature PEs consisting of multiplier arrays e.g., Cambricon-X [62], SCNN [96], SNAP [148], Cnvlutin [109], and Cambricon-S [57]. With unstructured sparsity of one or both the tensors, there may not be enough NZs to feed some multipliers within PEs, which causes intra-PE load imbalance. Intra-PE load imbalance results in lower utilization of multipliers and accumulators within each PE, lowering the acceleration opportunities. For example, Zhang et al. [148] analyzed the utilization of multipliers in the SNAP accelerator [148], by varying sparsity of activations and weights from 0% to 90%. Their analysis showed that with moderate sparsity, multiplier utilization can fall below 80%, and up to 20% for 90% sparsity. Similarly, SCNN [96] reported low multiplier utilization due to intra-PE imbalance (e.g., less than 80% for all GoogLeNet [3] layers, and 20% for the last two inception modules). Moreover, a few architectures use PE designs with multiple subunits in each PE; each subunit consists of multiplier arrays and may need to be in synchronization with other subunits of the same PE e.g., in Cnvlutin [109] and [101], [146]. With unstructured sparsity of the either tensor, multipliers and accumulators in some subunits can be often idle, while the trailing subunits process useful computations.

#### B. Software Directed Load Balance

1) Clustering NZ tensor elements for populating dense data regions: As described in section IX-A, a few techniques focused on high sparsity of weights. They proposed structured pruning or data combining approaches for clustering the tensor elements in locally dense regions that are dispatched to PEs for processing in a conventional manner [107], [163]. Such techniques improve the regularity of processing, considerably increase the PE utilization, and lower the number of execution passes (invocations of accelerator resources). However, these techniques may not be effective when the algorithms cannot control data generation or do not achieve the structured sparsity for other tensors (e.g., activations).

Li et al. [95] proposed concise convolution rules (CCR) for efficiently partitioning sparse convolutions into effective and ineffective sub-convolutions. CCR decomposed 2D convolutions into matrix multiplication for processing locally dense regions (subsets) of filters and input feature maps. Li et al. [95] showed that CCR can eliminate most of the ineffective



Fig. 23. Distribution of non-zero weights for executing AlexNet [251] CONV1 with coarse weight stationary dataflow on a  $4\times3$  PE-array. Distribution is shown for NZ weights in different workgroups (each workgroup contains NZs for 12 PEs): (a) without load balance (b) after applying zero-aware allocation [110]. (Figure inspired by [110].)

computations and their storage (for VGG-16, achieving reduction of about 79% and 51%, respectively). Sub-convolutions after CCR transformation were executed on the SqueezeFlow accelerator [95]. When only one tensor is sparse, SqueezeFlow executes it with a scalar-matrix product, and as a Cartesian product of vectors when both the tensors are sparse. However, this architecture may not support generic tensor operations, since each PE only performs multiplication for a Cartesian product. Extending CCR methodology to other algorithms (beyond convolutions) can be challenging.

2) Data reorganization prior to work allocation: In ZENA accelerator [110], each PE of a sub-workgroup processed a different set of filters. Kim et al. [110] recognized inefficiencies caused by load imbalance during the execution of convolutions on the ZENA accelerator. To balance the load on each PE, they sorted filters by sparsity, and the PEs were allocated filters such that all PEs executed filters of similar sparsity during processing of a sub-workgroup.

To determine efficacy of such sorting of the tensor blocks for load balancing, we considered AlexNet [251] for ImageNet classification. We obtained the pruned model through neural network distiller [252] with the pruning algorithm similar to Han et al. [40]. For accelerating AlexNet [251] CONV1 layer with coarse weight stationary dataflow, Fig. 23 presents distributions of NZs in filters before and after reorganization. For processing 64 filters of size  $3\times11\times11$  on  $4\times3$  PEs, we consider execution through 16 different workgroups or execution passes. Each workgroup contains NZ weights (up to  $11 \times 11$ ) for a total of 12 PEs that concurrently process four different filters and three different channels. In this example, we consider that NZ weights are reused from PE's local memory to leverage high weight reuse of CONV1 layer. Once all PEs in a workgroup process their weights completely, then only a new set of weights (next workgroup) can be offloaded to PEs. Such execution requires 16 workgroups. Fig. 23(a) shows that before data re-organization, the NZ weights allocated to PEs within the workgroups differed by up to  $21.4 \times (5 \text{ vs. } 107 \text{ m})$ NZ weights in a  $11 \times 11$  filter) and on average by about  $6.09 \times$ . Different amount of NZs in the tensors lead to allocation of imbalanced computations on PEs. Fig. 23(b) shows that after sorting the weights (both filter-wise and input channel-wise), it leads to almost equal number of NZs for computations onto 12 PEs during each workgroup. And, NZ weights allocated to PEs within the workgroups on average differed by only  $1.36 \times$ .

After applying similar static sorting of the tensors, ZENA [110] achieved about 20%–32% additional acceleration for convolution layers of AlexNet and VGG-16. Note that depending on the sparsity of the tensor and distribution of NZs in the tensors, NZ weights allocated to PEs for an execution pass may differ considerably even after sorting the data. Moreover, such transformation is feasible only statically (i.e., before execution begins). Therefore, ZENA also used dynamic work allocation, which we discuss later in this section.

3) Accelerator-aware regularization of the model: Some accelerators have recently leveraged structured sparsity of weights for balancing computations on PEs. For example, a pruning algorithm can target blocks of n elements in the tensor and, it can prune k out of n elements in each block. Then, during executions, all PEs or compute units can process the same number of NZ weights. Recent accelerators, including SparseCore [155], sparse Tensor Core [253], and [101], execute models pruned with similar strategies.

For k:n block-sparsity, appropriate values of k for the model can be selected based on the sensitivity of the pruning (i.e., impact on accuracy). Analysis by Kang et al. [101] showed that for different models including VGG-16 and ResNet-50, about 12 out of 16 elements were safely pruned for 16-element blocks, without any accuracy loss. Similarly, it pruned about 10 out of 16 elements for compact models like MobileNet v1 and SqueezeNet v1. To efficiently execute such pruned models, accelerators employ PEs with multiplexers that can fetch k elements from the blocks of n values. These multiplexers can use indices of NZ weights and then select appropriate values from the dense activations. Extracted activations and NZ weights can be provided to function unit. Like k:n blocksparsity, ESE [102] used a load-balance aware pruning that considered sub-matrices to be processed by PEs for RNN executions and induced the same sparsity into all sub-matrices.

In many accelerator designs, all PEs receive the same set of NZ activations and process them with their unique weights, and produce distinct output activations. One such architecture is Cambricon-S [57] which used a coarse-grained model pruning for structured sparsity. Such pruning considers the total number of multipliers within accelerator PEs. The coarsegrained pruning of weights (connections) in local regions was done such that all connections between an input activation and all output activations were pruned together. So, when each PE processed an output neuron, it performed the same number of MACs. The same strategy was effective even when an input activation was dynamically zero. This is because, for all PEs, the data extraction logic disregarded corresponding weights. While such an approach naturally achieves inter-PE load balance, it may not be applicable for exploiting sparsity during training of the models or when algorithms cannot control the data generation or perform coarse-grain pruning.

#### C. Load Balancing with Hardware Structures

1) Facilitating prefetching of allocated work to PEs: One way to improve PE utilization (in the presence of load imbalance) is to prefetch the allocated work for PEs and avoid fine-grain synchronization between PEs. In this way, even if

there is a different amount of work (e.g., different number of MACs per input activation), all the PEs may perform effectual computations (e.g., work on different activations) at the same time. Thus, each PE can be engaged in performing some computations, before it runs out of the available tensor data. This can be achieved by offloading more data into FIFO or memory of each PE. For example, in the EIE accelerator [61], activations are broadcast to FIFOs of all PEs, which are later processed by each PE along with loading compressed weights from its memory. Once a PE finishes multiplying an activation to all appropriate weights or does not find any matching weights for multiplication, it proceeds to process the next activation from its queue. For EIE architecture, FIFO size of 8 or higher ensured each PE almost having an NZ activation to process (i.e., during 87%–91% of computation cycles) [61]. Thus, activation queue mechanism significantly improved PE utilization and lowered idle time of PEs (from 47% to 13%).

Cambricon-X [62] allows asynchronous communication of weights to PEs to improve execution efficiency. A centralized data extraction mechanism provides NZ activations to each PE via a unicast network, and compressed weights are loaded in the memory of each PE (2 KB). To facilitate asynchronous computations of PEs, the memory access port is assigned to only one PE. Within the assigned short period, each PE fetches several chunks of weights via DMA transfer. It allowed PEs to load some weights which are used later for performing computations. Depending on the prefetching interval and unstructured sparsity, each PE may be engaged in (asynchronous) computations throughout most of the execution cycles.

While mechanisms of allocating more work through prefetching increase the engagement of PEs in computations, work allocated to PEs is still fixed. Moreover, data fetching mechanisms are usually in PEs, which restrict them to get information about pending work in other PEs and from workstealing. So, such techniques cannot balance the computations across PEs by dynamically transferring work to free PEs. When computations are highly imbalanced, straggling PEs can still be the bottleneck, resulting in the acceleration loss.

2) Centralized load balancer: In some accelerators, data is multicast to one or more rows (or columns) of PEs. Due to load imbalance, one or more leading rows are done early with processing the assigned data. For efficient work distribution, the shared load balance logic first processes the metadata (indices) of the tensor tile to be distributed, along with the control signals from PE-rows. Then, it produces appropriate metadata and feeds the fast-acting rows/lanes of PEs. Thus, a shared control facilitates work-stealing for fast-acting PEs (or PE-rows) and helps efficient processing of an execution pass.

For example, ZENA [110] supports dynamic work allocation through down counters. Different PE-groups (e.g., PE-rows) process the same set of filters with different workgroups of activation tiles. The central distribution mechanism features down counters that store the number of remaining activation tiles within each workgroup. When a leading PE-group finishes its workgroup (counter value zero), it steals an activation tile from a straggling group (the one with biggest count value) and then continues processing output activations. With dynamic



Fig. 24. Load balance mechanism in LNPU [97] (Figure adopted from [97]).

work-stealing, ZENA achieved about 10% additional acceleration for CONV layers of AlexNet and VGG-16. For processing AlexNet CONV layers with 16b data and 5bit (logarithmic quantization) data, speed-up of ZENA improved from 4× and 4.16× to 4.4× and 4.52×, respectively, when compared to processing dense tensors with the same number of PEs. In such dynamic work allocation, memory port contention may occur, when multiple leading groups simultaneously attempt to fetch the same set of input activation tiles. ZENA's execution mechanism overcomes this problem by reassigning only one activation tile at a time (to leading group) and performing any reassignments only during the bus idle time.

LNPU [97] uses an input load balancer (ILB) which is shared among PE-rows. As shown in Fig. 24, ILB contains address generator units to determine the indices of the compressed elements that need to be fetched. Based on these addresses, ILB obtains tensor elements from the on-chip memory along with other metadata. Then, the skip-index decoder unit in ILB determines the appropriate indices for data extraction and pushes them along with the tensor values into the FIFO of a PE-row. It also calculates bitmaps, which are used to push the data (indices and NZ values) selectively into FIFOs of PE-rows. Thus, by selectively pushing the data at runtime, ILB balances the data among rows of PEs. Due to ILB, PE utilization in LNPU was increased by 2%-26% for 10%-90% sparsity of the inputs (activations or their gradients) [97]. Thus, centralized load balancing mechanisms can leverage the information about data allocation for PEs, and then, they can provide equal work to PEs or feed the fast-acting PEs, whenever necessary during run-time.

#### D. Optimization Opportunities

(i) Software-level or hardware/software/algorithm co-design optimizations to achieve low-cost load balance: Many accelerators have lacked special support to balance the computations among PEs. Depending on the distribution of zeros in tensors, the inter-PE or intra-PE imbalance can cause significantly low utilization of PEs or their function units, which increases execution time and energy consumption. Due to high hardware costs (on-chip area and power), the designs may avoid a special hardware logic for dynamic load balance. One software-directed technique is to reorganize the data [110], [151] (e.g., statically before beginning DNN inference). Such approach can exploit the static sparsity (weight tensors during inference) at no/low hardware cost but, may not be feasible to leverage dynamic sparsity (e.g., activations). Therefore, we may require additional co-design optimizations for regularizing/leveraging

dynamic sparsity. If level of sparsity or distribution of zeros is pre-known (or estimated statically), the sparsity-aware mapping optimizations for accelerators can identify dataflows that can sustain higher PE utilization, while exploiting data reuse and parallelism. Even though such mappings may not achieve ideal acceleration (e.g., due to unstructured zeros in activations), at least they may achieve higher accelerations. Moreover, when algorithms or hardware logic can control the sparsity of tensors (e.g., for many DNNs), the hardware/algorithm co-designs can induce the balance. This can be done either via structurally pruning the activation elements (in hardware or during training) or refactoring the functions for nonlinear activations, batch normalization [115], or quantization of outputs. Consequently, the co-designs can achieve desired structured sparsity for both activations and weights at an extent, leading to further acceleration benefits.

#### XIV. WRITE-BACK AND POST-PROCESSING

Once PEs process the allocated blocks of tensors, they write (partially computed) outputs back via interconnect. Managing such write-backs (WBs) for sparse data can be challenging because different PEs can produce different amounts of output values. Moreover, operations like ReLU or non-linear activation functions, pooling, batch-normalization, etc. need to be performed on the obtained output tensor. These operations are not performance-critical like CONV or MLP layers, and they can be either executed on PEs before they WB outputs (e.g., in SCNN [96], Cambricon-S [57], and EIE [61]) or can be postprocessed on central modules before WB to memory (e.g., in MAERI [206], [110], and SqueezeFlow [95]). Depending on the dataflow mechanism, different PEs produce contiguous or non-contiguous blocks of output, which often need to be assembled in a central module and reorganized. Further, for efficient processing of convolutions, some accelerators require additional data layout transformations. For example, activation and weight tensors need to be reorganized for striding execution or activation tensor need to be transformed into a Toeplitz matrix [64], [254]. Lastly, before processing the next layer of the model, sparse outputs may need to be encoded again onthe-fly, which is handled by a central encoding unit once the data is collected from PEs and reorganized.

## A. Write-Back from PEs

1) Simultaneous WB: A few accelerator designs including Cambricon-X [62] and SCNN [96] use fat-tree networks or point-to-point links, which allows multiple PEs for simultaneous WBs. Such networks eliminate the requirement of managing the WBs from PEs through a common bus (which incurs additional hardware/software overhead otherwise); whenever ready, PEs can execute in a dataflow manner and immediately write the outputs back once done with computations. This is important for processing sparse tensors because, with unstructured sparsity, different PEs can have a different number of NZs to process and they produce different amounts of output values for WB. These PEs either WB to a central module for post-processing (e.g., in Cambricon-X [62]), or directly to the on-chip memory [62] or off-chip memory (e.g. in SCNN [96]).

Although simultaneous WBs are faster, such a fat-tree network can incur considerable overhead due to increased bandwidth and inefficient utilization of bandwidth. In such scenarios, accelerator designs use a common bus that is time-shared among multiple PEs; PEs can write the data back turn-wise or asynchronously. Besides, architecture designs using systolic arrays inherently offer simultaneous WB for the bottom (rightmost) row (column) of PEs, since tensors are distributed and streamed via store-and-forward mechanism and partial outputs are accumulated through each column (row) of PEs.

- 2) Sequential WB: PEs in several accelerator designs operate in a lock-stepped manner, i.e., data blocks of one or more tensors are multicast or broadcast to PEs for computations on their scalar or vector function units [109], [147]. With lockstepped execution, the execution time of each pass is driven by the slowest PE; regardless of a different number of NZs provided to different PEs, these PEs spend an equal amount of time in processing their data chunks (idle when done). After processing one or more passes, these PEs write the output back to the memory. Synchronized execution can allow to WB in a specific sequence (e.g., a PE with the lowest PE-index writes the data first and so forth). Such sequential WB makes programming of the accelerator easier. It also obviates the need for specialized hardware/software support for asynchronous WB that occurs otherwise due to dynamic and simultaneous requests of WB from different PEs.
- 3) Asynchronous WB: With unstructured sparsity, PEs process a different amount of data and can asynchronously request WB. To facilitate such support, accelerator designs can employ additional hardware logic. For example, ZENA [110] used a common bus that was used to multicast/broadcast blocks of filters or feature maps to PEs and for collecting the output. In ZENA, the output buffer was flushed to the memory during the idle period of the bus, which avoided bus contention between broadcasting activations from memory and WB of partial summations. For prioritizing the requests from PEs to access the bus for WB, it determined the PE groups with a higher number of pending output tiles.

## B. Data Assembling

PEs in a few accelerator designs process different output tiles (e.g., of output feature maps). These PEs perform finegrained assembling of the outputs locally while managing them within the memory of PEs. The processing units in PEs of some designs (e.g., SCNN [96] and EIE [61]) provide support for such local assembling of the outputs. For example, SCNN [96] PEs use a coordinate computation unit. It determines the appropriate indices for accumulating outputs and storing them in PE's memory. In contrast, PEs in other accelerators produce metadata and supplies it with outputs for correctly indexing the shared memory (e.g., in ZENA [110]) or assembling the output on a central module (e.g., in CoNNA [147] and Cambricon-X [62]). For instance, in accelerator designs including Cambricon-X [62] and SparTen [151], PEs provide outputs to a central module which handles data assembling before encoding it or storing back to the global scratchpad or DRAM. The central module can assemble



Fig. 25. Data layout transformation for executing convolution. (a) Convolution of two  $2\times3\times3$  feature maps with two  $2\times2\times2$  filters. (b) Reorganizing data for striding execution. (c) Transforming feature map tensors into Toeplitz matrix.

the data based on the metadata (e.g., output coordinate) provided by PEs or pre-known indices of PEs. In some designs, data assembling logic is integrated with accumulators [95], [148]. It performs a reduction of the partial summation, before writing the data back to the appropriate bank of the memory. The data assembling logic typically also handles data layout transformation like a reorganization of the collected output [132], [147], which is required later for processing the subsequent layer on the PE-array.

## C. Data Layout Transformations

- 1) Data reorganization (NWHC): Several accelerators including Cambricon-X [62], Compact [132], TPU [36], [107], and [163] have been designed to efficiently process vector or matrix multiplications on dense or sparse tensors. So, for processing convolutions, accelerators CompAct [132], Cnvlutin [109], and SCNN [96] require data layout transformations e.g., processing data in NHWC format [255]. Note that NWHC format is also used for accelerating multiplications of dense tensors on CPU and GPU platforms [256], [257]. For processing the convolution of Fig. 25(a), Fig. 25(b) shows data reorganization for striding executions. It shows that weights and activation tensors are stored in the channel-first (NHWC) format. For example, for processing an output activation 1A, a block containing all channels is fetched for four spatial locations of the first filter and ifmap. Similarly, in the next execution pass, output activation 1B can be processed on the accelerator by fetching a block containing both channels for another set of four spatial locations of the first filter and ifmap. Thus, the data reorganization allows accelerators to fetch a block of data corresponding to different channels and distribute the blocks to a single vector-PE or multiple PEs for efficiently performing vector-vector or vector-matrix multiplications. To process blocks of sparse data, data extraction module extracts pairs of matching NZs (discussed in section IX).
- 2) Transformations to Toeplitz matrix: While processing data with channel-first (NHWC) format allows accelerators to execute convolutions with vector-vector multiplications on PEs, it still requires additional hardware support to extract appropriate blocks from the memory in a certain manner. Some

accelerators have been designed to efficiently process only matrix-vector or matrix-matrix multiplications on dense/sparse data and cannot perform striding window execution for convolutions. So, to support convolutions of sparse tensors and obviate the need of striding executions, a few accelerators including ERIDANUS [107] and [163] transform feature maps into the Toeplitz matrix [254] by using im2col transformation [258], [259]. Once sparse feature map tensors are transformed into sparse matrices and then sparsity-encoded, accelerators can process them with a sparse matrix-multiplication operation. Fig. 25(c) illustrates the transformation for tensors of Fig. 25(a). It shows that for each channel and feature map (or filter), neighborhood values for computing a spatial output of 2D convolution are combined altogether as a vector. For instance, values (1A, 1B, 1C, 1D) correspond to the output 1A, which forms a column vector. Similarly, corresponding filter values form a row vector. For multiple channels, corresponding tensor elements are stacked in these column-vectors or row-vectors. For example, ifmap values (2A, 2B, 2C, 2D) also correspond to the output 1A, which are appended in the first column. Thus, the convolution of an ifmap with each filter can be processed as a vector-matrix multiplication [64]. Similarly, the convolution of multiple feature maps with multiple filters can be processed as matrix-matrix multiplications. However, with duplication of the neighborhood data, transforming ifmap tensor into the Toeplitz matrix can yield significant storage overhead (requires about Fy×Fx higher memory for processing the convolution of stride 1 with the filters of spatial dimensions  $Fy \times Fx$ ).

#### D. On-the-fly Encoding

Several accelerator designs including CompAct [132], SqueezeFlow [95], Eyeriss [34], SparTen [151], and CoNNA [147] use an encoding module. Such modules perform on-thefly encoding of output tensor typically before the data block is stored back into the shared on-chip memory or off-chip memory. Processing data in encoded format can considerably reduce accesses to off-chip memory [34], [95]. The encoding module typically incurs low area or power overheads. For example, the RLC encoder and decoder unit in SqueezeFlow [95] occupied 2.5% of the total accelerator area and about 3.06% of total power consumption. Similarly, RLC coding unit in Eyeriss [34] occupied about 0.3% of the total on-chip area. Additionally, on-the-fly encoding allows accelerators to efficiently process tensors which are dynamically sparsified, i.e., sparse activation tensors during the inference of DNNs and tensor computations in the training of the models.

Depending on the coding format (section VIII), the complexity of the hardware logic of the encoding module increases. For example, bitmap or RLC format requires single-step processing, which incurs low overhead for generating a metadata tensor. For instance, SparTen [151] featured a central module for on-the-fly bitmap encoding of the assembled output data. The encoder unit consisted of comparators (XNOR gates) for determining NZs and additional logic that shifts NZs for populating a compressed val vector.

Sticker [153] facilitates sparsity-aware encoding of DNN tensors. It uses three modes to encode tensors of high, medium,

or low sparsity with COO, bitmap, and dense format. The three modes are controlled by two threshold values. When the number of zeros in a chunk is less than the first threshold, the data is treated as having low sparsity and only values of elements are stored. When the number of zeros is between two threshold values, the data is stored along with a bitmap vector that is used later for clock-gating PEs to skip computations. Finally, when the number of zeros exceeds the second threshold, the data is considered as highly sparse and stored in COO-coded format. Since weights can be processed offline for DNN inference, they are pre-encoded in appropriate formats. For online encoding of activations, Sticker uses a sparsity adaptor module consisting of a sparsity detector, a 4 kB buffer, an adaptive encoder, and a controller. Sparsity detector exhibits counters that count zeros in activations of consecutive 16 channels. Once the raw output activations of a layer (obtained after ReLU) are processed by the sparsity detector, they are stored in the buffer. Then, the controller determines the encoding mode that is used by the encoder to encode the data of the buffer.

## XV. COMPILER SUPPORT

This section provides an overview of the compiler support for sparse deep learning accelerators. It focuses on four main topics:

- Intermediate representations. They determine what type of code the compiler can support and what kind of compiler transformations it can perform.
- Support for sparse tensors. This subsection discusses challenges in supporting sparse deep learning in compilers and compilers developed to overcome these challenges.
- Compiler optimizations. To accelerate sparse deep learning, hardware accelerators require the use of advanced code optimization techniques. This subsection provides an overview of state-of-the-art techniques that allow the compiler to apply advanced optimizations and generate the most efficient code from high-level neural network descriptions.
- Accelerator ISAs and code generation. This subsection focuses on accelerator ISAs (e.g., instruction set for high-level tensor operations) and the compiler support for machine code generation for accelerators.

# A. Intermediate Representations

The intermediate representation (IR) used in a compiler determines which types of code can be represented by the compiler, whether it can support sparse tensor computations, the types of code transformations that can be done, and even the scalability of the compiler. Therefore, the success of a sparse deep learning compiler depends on the intermediate representation that it uses.

1) Need for high-level representations: A common example of low level intermediate representations is the LLVM IR. While this low level IR is well suited for low level code optimizations such as register allocation, it is not well suited for many high level code optimizations needed for optimizing sparse deep learning. This is mainly because low level IRs do not preserve information about loop structures

and data layouts and reconstructing such information is not trivial [260]. This is why many deep learning compilers such as TVM [73], Tiramisu [260] and Halide [261] apply many code optimizations on a high level IR (an IR that has loops and represents multi-dimensional tensors). This is also one of the motivations for creating MLIR [262], which serves as a high level IR for low-level compilers like LLVM. These high level IRs have the advantage of preserving information about loop structures and data layouts and therefore make it easier for compilers to apply high level optimizations.

2) Mathematical abstractions of code: While the previous intermediate representations have focused on representing the program statements and the program structure, many compilers use an additional mathematical representation (abstraction) to represent the iteration domains and array accesses of the statements. These mathematical representations are usually used in conjunction with the IR to simplify iteration domain and array access transformations. This subsection presents two major families of mathematical representations and compares their strengths and weaknesses.

## 2.A. Polyhedral representation.

**Background:** The polyhedral representation is a unified mathematical representation for the iteration domains of statements, code transformations, and dependencies. It relies on two main concepts: *integer sets* and *maps. Integer sets* represent the iteration domains. *Maps* are used for representing memory accesses and transforming iteration domains and memory accesses.

An integer set is a set of integer tuples described using affine constraints. An example of a set of integer tuples is

$$\{(1,1);(2,1);(1,2);(2,2);(1,3);(2,3)\}$$

Instead of listing all the tuples, we can describe the set by using affine constraints over loop iterators and symbolic constants as follows:

$$\{S(i,j): 1 \leq i \leq 2 \land 1 \leq j \leq 3\}$$

where i and j are the dimensions of the tuples in the set.

A map is a relation between two integer sets. For example,

$$\{S1(i,j) \to S2(i+1,j+1) : 1 \le i \le 2 \land 1 \le j \le 3\}$$

is a map between tuples in the set S1 and tuples in the set S2. More details about the polyhedral model and formal definitions can be found in [263]–[265].

**Polyhedral compilers:** Notable polyhedral compilers for deep learning include Tiramisu [260], Tensor Comprehensions [266], Diesel [267], and TensorFlow XLA [268] (through affine MLIR dialect [262], [269]). Other polyhedral compilers that were not designed specifically for deep learning but are rather general-purpose and support deep learning include PENCIL [265], [270], Pluto [271], Polly [272], PolyMage [273], AlphaZ [274], CHiLL [275], [276] and URUK [277].

# Strengths of the polyhedral representation:

 Unified representation: The polyhedral representation is a unified mathematical representation for code (iteration domains and data accesses), code transformations, and data dependencies. This eliminates friction within the compiler

- intermediate representations and simplifies greatly the design of code transformations.
- Instance-wise representation: The granularity of the polyhedral representation is the instances of statement executions instead of representing all the statement instances as a single statement. An instance of statement execution is a single execution of a statement during one loop iteration. Instancewise representation is not limited to the iteration domain but includes data dependencies, data accesses, iteration domains, and code transformations. This allows the polyhedral compiler to have a precise representation of dependencies, code, and code transformations.
- Support for the whole class of affine transformations: The
  polyhedral representation allows applying any affine transformation on the iteration domain and data accesses. An
  example of a complex affine transformation is iteration space
  skewing which allows the extraction of parallelism from
  multi-layer recurrent neural networks.
- Non-rectangular iteration domains: The polyhedral representation also allows compilers to naturally express non-rectangular iteration domains (i.e., iteration domains with an affine conditional).

# Weaknesses of the polyhedral representation:

- Limited support for non-affine code: The polyhedral model mainly represents code and transformations using sets and maps described using affine constraints. This means that the polyhedral model does not naturally support code that leads to non-affine constraints. This includes code with non-affine loop bounds, non-affine array accesses, and non-affine conditionals. While the classical polyhedral model does not support non-affine constraints, recent work has extended the polyhedral representation to support non-affine array accesses, non-affine loop bounds, non-affine conditionals [278], and parametric tiling [279]. The efficiency of these techniques has been demonstrated in practice by PENCIL [270] and Tiramisu [260].
- Slower compilation: While polyhedral operations are precise, they are also more computationally expensive. Therefore, polyhedral compilers are slower than non-polyhedral compilers. To avoid this limitation, recent techniques focused on reducing the number of statements by clustering a whole block of statements into macrostatements and scheduling (optimizing) the macrostatements instead of scheduling individual statements [280], showing a notable speedup in compilation time.
- **2.B** Non-polyhedral representation. A common non-polyhedral representation used in deep learning compilers is the *interval*-based representation. This representation relies mainly on using intervals to represent the iteration domain and on using interval arithmetic to represent code transformations. Using intervals, N-dimensional loops are represented with N-dimensional boxes. For example, we can represent the iteration domain of the following loop nest

<sup>&</sup>lt;sup>1</sup>The iteration domain of loop iterators in a loop is all the possible values that the loop iterators can take.

with the following 2-dimensional box:  $(i, j) \in ([0, N], [2, M-2])$ . In this subsection, we focus on deep learning compilers that use intervals and interval arithmetic internally, as this is the most common non-polyhedral representation.

**Non-polyhedral DNN compilers:** Examples of non-polyhedral deep learning compilers include TVM [73], Halide [261], DLVM [281] and Latte [282].

#### Strengths of interval-based representations:

- Better support for non-affine code: Non-polyhedral compilers can naturally support non-affine code transformations such as parametric tiling (loop tiling with parametric tile size). This is mainly because the polyhedral representation relies on using affine sets and affine relations to represent code, code transformations and dependencies (the constraints used to describe the sets and relations have to be affine), while the interval based representation does not have this limitation. Note also that non-polyhedral compilers also have limited support for non-affine code and non-affine code transformations. For example, for array access A[index[i]], it is not possible for a non-polyhedral compiler to statically find out which elements of the array A are accessed because values of index[i] are not known at compile time.
- Faster compilation: Operations on the intervals are fast and less expensive than their polyhedral equivalent operations on the sets of integer points, which makes non-polyhedral compiler faster than polyhedral compilers.

## Weaknesses of interval-based representations:

 Limited expressiveness: Non-polyhedral compilers that use intervals cannot naturally represent non-rectangular iteration spaces. For example, it is hard to use intervals to represent the iteration domain for the loop iterators in following example:

It is also hard to perform certain complex affine transformations such as iteration space skewing. Such a transformation is necessary for optimizing multi-layer RNNs and to increase hardware occupancy, e.g., for GPUs.

Lack of support for programs with cyclic data-flow graphs: To simplify checking the legality of a schedule, many interval-based compilers assume that the program has an acyclic dataflow graph. This prevents users from expressing many programs with cyclic dataflow. For example, when a value produced by a loop is read by another loop, Halide [261] does not allow the fusion of the two loops by using the compute\_with command. Although this rule avoids illegal fusion, it prevents legal loop fusions in many common cases. Polyhedral compilers avoid these over-conservative constraints by using the dependence analysis [283] to check for the correctness of the code transformations, which enables more possible schedules. While interval-based compilers can also implement non-polyhedral dependence analysis to avoid this limitation (by computing dependence distance vectors [284]), such dependence analysis is not as precise as polyhedral dependence analysis [283].

#### B. Support for Sparse Tensors

1) Challenges in supporting sparse tensors: The code for manipulating sparse tensors exhibits non-static<sup>2</sup> loop bounds, non-static array accesses, and conditionals. Analyzing such code is challenging for compilers because it is difficult to analyze at compile time. The following pseudo-code shows an example of a direct convolution with sparse tensors (bounds of the j loop and accesses of the array in are non-static).

```
for each output channel c_o
    for j in (w.row_ptr[c_o], w.row_ptr[c_o + 1])
    {
       coeff = w.value[j]
       offset = w.col_idx[j]
       for y in (0, out_H)
            for x in (0, out_W)
            out[c_o][y][x] += coeff*in[y*out_W+x+offset]
    }
}
```

2) DNN compilers supporting sparsity: Examples of deep learning compilers that support sparse deep learning include Tiramisu [260]. Acorns [120], and Taichi [285].

Tiramisu mainly supports sparsity of the weight tensors (in contrast to supporting sparsity of activations). To do that, it extends the polyhedral model in a way similar to [278]. For example, a non-affine conditional is transformed into a predicate that is attached to computation. The list of accesses of the computation is the union of the accesses of the computation in the two branches of the conditional, which is an overapproximation. During code generation, a pre-processing step inserts the conditional back into generated code. Non-static loop bounds and tensor accesses are represented as parameters in the polyhedral model. Statements that define those parameters are inserted just before the original statements that have non-static code. These techniques introduce approximations in the compiler. The efficiency of such techniques was demonstrated by Benabderrahmane et al. [278] and confirmed by PENCIL [270] and Tiramisu [260].

Acorns [120] is a framework designed mainly to optimize DNNs with sparsity of the activation tensors. Acorns fuses the operators in a computation graph of the deep CNN model, followed by sparse layout conversion (which ensures that the dense/sparse tensors produced by each operator are compatible with the next operation), followed by code optimization and code generation. Acorns introduces a data layout for exploiting the structure of sparsity of the input data in certain domains (face detection, LiDAR, etc.) where only certain regions of input tensors are NZs. For code optimization and generation, the compiler processes a set of template codes for neural network operators (e.g., convolution, pooling) and applies optimizations such as loop tiling, vectorization, and weight packing. However, it does not implement advanced loop-nest optimizations like iteration space skewing.

Other compilers for sparse code include TACO [286]. TACO uses a specific representation (iteration graphs) to generate code for sparse tensor operations and uses a scheduling language to guide how that code should be optimized.

#### C. Compiler Optimizations

To generate efficient code for neural network operators, the compiler has to apply a large set of complex code

<sup>&</sup>lt;sup>2</sup>Cannot be analyzed at compile time

optimizations. It includes operator fusion; multi-level tiling and register blocking which improve data reuse; loop reordering, array packing [287] and data prefetching which improve the memory access patterns; loop skewing which enables the extraction of wavefront parallelism from multi-layer recurrent neural networks; parallelization; loop unrolling; vectorization; full/partial tile separation; tuning optimization parameters to the target architecture (e.g., choosing tile sizes or loop unrolling factors that are optimal for the target machine using auto-tuning [288]). There are two major families of optimizing compilers: compilers that allow semi-automatic code optimization and compilers that are fully automatic.

- 1) Compilers with semi-automatic code optimization (scheduling languages): The main idea in these compilers is to separate the algorithm from the optimizations. A program in this case has two parts:
- The first part specifies the algorithm without specifying how this algorithm is optimized.
- The second part specifies how the algorithm is optimized (transformed). This is done through a set of high-level scheduling commands (optimization commands) for common optimizations.

Halide [261], Tiramisu [260], and TVM [73] are examples of compilers that allow semi-automatic optimization. The main advantage of this approach is to allow the user to have full control over how the code should be optimized. This is important because fully automatic optimization techniques do not always succeed in providing the best performance.

Semi-automatic deep learning compilers usually provide a library of highly optimized deep learning operators. The compiler then only needs to decide automatically whether to apply certain optimizations such as operator fusion. All other optimizations are encoded manually in the library using scheduling commands. This minimizes the number of automatic decisions that the compiler needs to make and thus guarantee the best possible performance. Note that semi-automatic compilers usually also have automatic optimization modules, but such modules can be disabled if necessary.

2) Fully automatic compilers: Tensor Comprehensions [266] and Diesel [267] are examples of fully automatic compilers for deep learning. Other examples of fully automatic compilers include PENCIL [265], [270], Pluto [271], and Polly [272]. All of these compilers use the Pluto [271] algorithm to automatically optimize code (choosing the schedule of statements). The main idea of the Pluto algorithm is to use integer linear programming to model the problem of automatic code optimization where the constraints are the dependencies of the program and the objective function is the minimization of the distance between producer and consumer statements. Other polyhedral compilers such as PolyMage [273] use custom algorithm for automatic optimization.

All of the previous compilers do not have a scheduling language and therefore do not allow the user to have fine-grain control over optimizations. Although fully automatic compilers provide productivity, they may not always obtain the best performance. This sub-optimal performance is due to the fact that they do not have a precise cost-model to decide which

optimizations are profitable. For instance, the Pluto [271] automatic scheduling algorithm attempts to minimize the distance between producer and consumer statements while maximizing the outermost parallelism. However, it does not consider the redundant computations, data layout, or the complexity of the control-flow of the generated code.

Cost models for automatic code optimization: The goal of an automatic code optimization pass in a compiler is to find the best combination of code optimizations that minimize the execution time. This problem can be modeled as a search problem where the search-space is a set of combinations of code optimizations. To search such a space, a compiler needs a search technique and a cost model to compare different combinations of optimizations. Classical compilers use handtuned cost models [289], while others employ machine learning to build cost models [290]. Both of these models do not precisely capture the hardware complexity (different memory hierarchies, out-of-order execution, hardware prefetching, communication latency, etc.). Instead, state-of-the-art models are built using deep learning to obtain better accuracy [291], [292]. For example, Ithemal [292] is a cost model that predicts the throughput of a basic block of x86 instructions and gets less than half the error of state-of-the-art hand-tuned models (llvm-mca in LLVM [293] and Intel's IACA).

## D. Accelerator ISAs and Code Generation

Many accelerators for machine learning e.g., Cambricon-X [62], Scaledeep [294], Thinker [164], and DnnWeaver [233] expose a high-level ISA where some instructions perform high-level tensor operations (e.g., a matrix-matrix multiplication, convolution, dot product, pooling, and sigmoid). Such hardware accelerators simplify the mission of the compiler since the compiler can now call those high-level operations instead of generating a low level-code and optimizing it. However, the compiler still has to manage the data copies automatically. This subsection describes such high-level tensor ISAs used by accelerators and machine code generation.

1) Instruction sets: For tensor computations on hardware accelerators, ISAs typically feature instructions for arithmetic, logical, and data transfer operations with matrix, vector, and scalar data. Layers of ML models feature loops iterating thousands of times; dynamic instances of repetitive instructions can significantly increase the bandwidth requirements for delivering these instructions to PEs at each cycle and the energy consumption due to fetching these instructions from memory, communicating them, and processing them [201]. To mitigate such overhead of dynamic instructions, accelerators are designed with an array of vector or SIMD PEs. It allows PEs to process a single instruction for performing multiple computations on the block of tensors. Alternatively, accelerators feature PEs with additional control logic such that PEs process an instruction once, but repeatedly perform the sequence of execution for a certain interval.

For example, Liu et al. [295] proposed Cambricon ISA for neural networks. It contains instructions for matrix and vector processing with arithmetic and logic operations, control (conditional branch and jump), and data transfer. Each operand

of instruction is either an immediate value or one of the 64 32b general-purpose registers which are used for temporarily storing scalars or register-indirect addressing of the on-chip scratchpad memory. Cambricon mainly supports computations on the vector or matrix through computational units. The tensor blocks are communicated between computational units from the on-chip scratchpad which is made transparent to the compiler and programmers. The instructions in Cambricon ISA can efficiently support commonly used primitives in various ML models. For instance, Cambricon ISA supports multiplication, addition, subtraction, and division operations on the matrices and vectors. It also supports max-pooling primitive for tensors with a vector-greater-than-merge instruction, which aggregates the tensor elements at the same position across higher tensor dimensions and iteratively performs the operation across spatial dimension for obtaining the output. It also provides dedicated instruction for random vector generation with uniform distribution of the values within [0, 1]. For supporting weight update during the training of DNNs, Cambricon provides additional instructions such as outer product, scalar-matrix multiplication, and matrix-matrix-addition. However, due to focusing on processing tensors from the only global scratchpad memory, it lacks support for managing data in the local memory of PEs and configuring NOC for onchip communication. Moreover, it does not provide specific instructions for accelerating sparse tensors, e.g., predicated execution of the sparse data in the compressed format while processing tensors of varying sparsity.

The instruction set for Sticker [201] consists of instructions for high-level operations. For processing each layer, one of the instructions is executed only once. It configures instruction registers and common control signals that correspond to the sparsity levels and tensor dimensions. Then, at a certain time interval, a dynamic 32b instruction executes for computing convolution over the data block on PE-array (e.g., processing an entire 2D feature map with a 2D filter). Meanwhile, the accelerator controller distributes the next instruction, if there is no collision between the current and the next instruction. It allows hiding the execution of other dynamic instructions including the write-back and encoding of the output data and transferring data between on-chip and off-chip memory.

2) Finite state machines (FSMs): Some accelerator designs use FSMs for PE executions, unlike dynamically executing instructions at every cycle. For supporting various ML models and tensors of different sizes on reconfigurable accelerators, these FSMs are typically parameterized. The parameters of FSMs are configured once (e.g., through bit-streams), before beginning the execution of the ML model or each layer. Moreover, accelerator controllers (which usually initiate the data movement between on-chip memory and off-chip memory and configure PEs, on-chip memory, and NOC) can also exhibit FSMs. For example, in Thinker architecture [164], the finitestate controller is used for configuring the accelerator at three levels, i.e., PE-array level, model layer level, and PE level. Configuration word for PE-array level handles partitioning of the PE-array, and it points to the memory address of configuration words for model layers. Each configuration word

for a layer provides information about tensor dimensions and their memory addresses. Lastly, configurations for PEs for the layer correspond to PE functionality and the interval (loop iterations) of PE computations and idle time.

Before processing each CNN layer, Eyeriss architecture [34] loads a 1794b scan-chain serially for re-configuration. The accelerator reconfiguration includes configuring NOC controllers for on-chip communication and configuring the control logic of PEs for carrying out repeated computations, as per mapping for the layer.

3) Library support and code generation: The instructions for cycle-level executions or primitives (higher level tensor operations) are usually obtained off-line. Accelerator system designers often provide users a template library that defines high-level primitives such as model layers or low-level primitives such as vector/matrix operations. Using these primitives, users can construct the model of their interest. Then, the lowlevel code is obtained automatically by the compiler or using the pre-defined tuned code [294], [296], [297]. For example, Zhang et al. [62] programmed Cambricon-X accelerator with a set of library functions (written in C/C++) with the primitives for convolution and matrix/vector multiplication and addition. Chen et al. [298] proposed a programming framework consisting of assembly language, an assembler, and run-time support for executing ML models with their Cambricon ISA [295]. For executing common layers, it also replaced the primitives with the corresponding pre-defined code blocks.

TVM [73] supports defining custom back-ends for accelerators, which was demonstrated using a vanilla accelerator design with a matrix-multiply engine. For executing primitives on accelerators, TVM enables Tensorization [73], i.e., decoupling the target hardware intrinsic from the schedule while mapping ML operators (e.g., 2d convolution) to the fixed intrinsics of the specialized accelerators. To demonstrate code generation for the vanilla accelerator, TVM enabled a driver library and runtime support that constructs the instructions and offloads them to the accelerator. Its code generation module translated the program into the appropriate function calls of the runtime API. Moreau et al. [299] leveraged TVM stack and proposed a JIT compiler and a runtime system in order to generate the code for the programmable VTA accelerator.

It is important that the accelerator back-end can support multiple front-ends corresponding to different ML frameworks such as TensorFlow [71], PyTorch [70], and MXNet [300]. Such integration of the programming, compilation, and runtime environment with the common ML application development frameworks is necessary for supporting different ML models written in various application frameworks. Leveraging the existing system stack (e.g., TVM) can provide such opportunities to accelerator system developers. Note that although TVM supports defining custom accelerator back-ends and can lower the optimized mappings to accelerator-specific code, it currently does not provide support for sparse tensors.

#### XVI. TRENDS AND FUTURE DIRECTIONS

#### A. Hardware/software/algorithm Co-designs

1) Hardware-aware compression techniques: The framework for exploring efficient model compression (either of quantization, value sharing, pruning, and size reduction) should be aware of hardware features and provide directed search accordingly. For example, different hardware platforms (e.g., CPU/GPU vs. FPGAs and specialized accelerators) typically support only fixed bit-widths. Also, bit-widths of tensors that can be efficiently processed by these platforms vary considerably (e.g., from multiple of 8-bits to arbitrary bit-widths). When the hardware can only support any bitwidths that are multiple of 4, then any other bit-widths require the software to pad unnecessary zeros to elements, which incurs storage inefficiency. Instead, the compression algorithm can opt for improving the accuracy or trade off the bitwidths among layers of a model such that higher hardwarefriendly compression is achieved. Likewise, the hardware accelerators typically support only uniform widths of input tensors (activations and weights) and many accelerators do not support value sharing. Therefore, when exploring adaptive bitwidths for different layers, hardware-aware quantization can achieve more effective compression. Similarly, depending on the hardware support for fine-grained or block-sparsity (block size, direction), hardware-aware pruning can better achieve the compression objectives (model exploration time, performance, energy-efficiency, and maintaining accuracy threshold).

2) Developing compression techniques that leverage the execution models of hardware accelerators: More effective accelerations can be achieved when compression techniques leverage execution models of hardware accelerators. Hardware accelerators can exhibit relatively simple logic blocks and their execution methods are typically pre-determined (sequence of coarse-grain data to be managed in memories and on PEs), which have allowed recent techniques to estimate execution metrics through analytical cost models. Accommodating such execution models can enable the compression algorithm to quickly determine the feasibility of improving the performance and energy efficiency, when opting for a specific pruning ratio or structure, tensor shapes, and tensor precisions. Compression techniques have recently leveraged such estimations (e.g., energy-aware pruning [60]). Integrating the cost models for hardware execution can help to ensure that the storage efficiency or reduced computations through a certain compression strategy delivers the expected accelerations.

3) Joint and automated exploration of sparsity, precision, and value-similarity: Recent compression techniques typically employ structured or fine-grained data pruning during training with a fixed precision of tensors. Techniques that obtain adaptive quantization (precision lowering), during or after training, usually do not explore pruning of the models. Joint explorations of the pruning and quantization may achieve high compression ratios due to the interplay of these compression techniques. For example, quantization can increase the sparsity of tensors considerably [157], because more values can be represented as zero after compressing the range [47].



Fig. 26. Co-designs can enable efficient accelerations of compact models.

Likewise, pruning may lead to further reduction in the bitwidth, since the total number of values in the pruned model are fewer and may be expressed with a much lower numeric range and precision. Therefore, the layer-wise joint exploration of sparsity and varying precisions of tensors may lead to higher compression and accelerations.

Moreover, these compression techniques do not leverage temporal and spatial value-similarity in inputs, outputs, or weights. Therefore, joint exploration algorithms may be developed that use multiple compression strategies during training and automatically explore combinations that compress the model further. Recent techniques for automated explorations include CLIP-Q [301], [105], and [302]. Exploring a wide range of combinations during the training may not be feasible and necessary. Therefore, the algorithm developers for compact models may reduce the space of compression choices by determining a fixed range of effective options before beginning the resource-extensive training and, if required, further limiting the search-space of the compression choices by evaluating them with a pre-trained model and fine-tuning. Then, the choices that appear effective after quick evaluations may be explored during the training.

It is important that compression benefits achieved through the joint explorations for compact models can be translated into efficient hardware accelerations. Therefore, the exploration heuristic should not preclude experts to express a directed search, which is required for obtaining hardwarefriendly execution of models, e.g., specifying pruning with 1D or k:n block sparsity, enforcing the same bit-widths of the input tensors for a layer, a tolerable range for accuracy loss, etc. The heuristic should also provide automated optimization/exploration of hyperparameters because the compression algorithm needs to adjust the strategy of the pruning or quantization and corresponding hyperparameters. For example, pruning algorithm needs to find out the pruning ratio during each iteration (epoch); total epochs where pruning occurs during learning; pruning mechanism (which values to prune, e.g., below a certain threshold); pruning granularity (fine-grained vs. block-structured); bit-widths of the tensors (quantization). All such hyperparameters or strategies need to be adjusted automatically (at possible extent) such that the memory footprint is highly lowered or computation requirements (FLOPs) is greatly reduced, without dropping the accuracy below the

user-specified threshold for accuracy.

4) Value-aware neural architecture search (NAS) and hardware/model co-designs: Techniques for NAS or AutoML can automatically obtain efficient models that surpass the accuracy of models devised by human developers. Although models obtained from the current NAS techniques attain high accuracy, there remains scope for considerably improving NAS for obtaining highly compact models (e.g., with the similar or higher compression and acceleration, as compared to recent compression algorithms). Recent techniques [303]-[307] have explored hardware/model co-designs. These codesign techniques can support quantized models and layers of varying shapes. However, the efficiency of the explored models and hardware accelerators can be further amplified by including the sparsity and adaptive bit-widths of model layers and analytically considering their implications on hardware accelerators (through cost models).

A major challenge faced by the model search techniques and hardware/algorithm co-designs is the vast search-space. As Fig. 26 shows, the explorations can be performed for (i) ML models (i.e., NAS) [47], (ii) compression strategies (e.g., automated pruning and quantization) [46], (iii) mappings of a model on the given hardware [38], [87], [88], and (iv) specifications of the hardware accelerator [88], [89]. The explorations of (i) and (ii) directly impact the model size and accuracy, while search optimizations for (iii) and (iv) lead to the resultant performance and energy-efficiency of the accelerator for a given model. Among these exploration spaces, NAS techniques can be significantly time-consuming (several GPU days [47]), followed by the automated model compression (e.g., [46]). Therefore, the resultant joint space for value-aware NAS and value-aware hardware/model codesigns is many-folded. It may require rethinking the search and co-design process for the joint optimizations that can truly lead to hardware-optimized compact models.

5) Facilitating structured computations of sparse tensors: Designers may opt for accelerators that are effective for structured computations with dense tensors (e.g., dot products). However, for executing some applications of the application-mix, leveraging sparsity can be useful for higher accelerations. For instance, accelerators coupled to processor cores or near-data accelerators are often designed with systolic-arrays for efficient dense tensor computations, and inmemory processing engines feature resistive crossbars. While sparsity or size reduction of tensors may need to be leveraged for execution on such accelerators, significant modifications in the design may be infeasible due to design requirements (area/power budgets for the application mix) or increasing complexity of the system stack. Therefore, in order to leverage sparsity and to support irregular-shape of tensors on these accelerators, techniques to facilitate structured computations on the accelerator engines can be developed. These techniques can perform some pre-processing (e.g., arrange intersecting NZs for dot-products) to feed the structured dense regions to the underlying power-efficient engines.

To support such pre-processing of the sparse tensors (e.g., for data decoding and extraction of matching NZs) and

post-processing (e.g., data encoding), designers can opt for introducing additional hardware modules in the accelerator. Alternatively, the pre-processing may be carried out by the host processor that handles the non-performance-critical tasks. Such disjoint mechanism can obviate heavy modifications in the PE-array or PE datapath for indexing or decoding sparse tensors. In other words, with decoupled pre-processing, inmemory/near-data processing engine (e.g., in ReCom [308], SNNrram [309]) or systolic-style co-processors can perform structured computations like in a conventional manner and leverage the sparsity if needed.

## B. Design Tools and Frameworks

1) Framework that analyzes performance-gains of hardware accelerators for sparse tensors: Given that tensors of several ML models are sparse, it is important to design the accelerator systems that can easily capitalize performancegains for multiple ML models by employing low-cost hardware modules and enhanced software support. As we discussed in sections VIII-XV, each enhancement module presents multiple implementation choices in the hardware and/or software. Although crafting a cycle-level simulation infrastructure for capturing such a wide design space may be infeasible, a data-driven quantitative model can be significantly helpful for design explorations. Such model can processes the actual data (either at a low level or to discover distributions), provide highlevel modeling of common implementation choices, and estimate the performance-gains for each combination of the implementation choices. For newer models and datasets, hardware designers can run through a set of implementation choices in an early design phase. They can explore the implications of performance-gain opportunities for the desired choice of data coding format, coding logic, scalar/vector PE, load balancing, dataflow mechanism, etc. For instance, the designers can find out the effectiveness of a central or in-PE NZ detection and extraction module and, they may not opt for a loadbalancing module for their workloads. Additionally, such a sparse-data-driven analytical framework can inform about the bandwidth requirements for on-chip communication, reuse opportunities of the sparse data, and approximated storage and communication overhead of the metadata.

2) Accelerator design frameworks for compact models: Several frameworks for developing and simulating FPGA or ASIC based accelerator designs have recently been proposed, including DNNWeaver [233], DNNBuilder [310], T2S [37], and HeteroCL [311] for FPGAs and NVDLA [154], VTA [299], MAGNet [312], MAERI [206], and AutoDNNChip [218] for specialized accelerators. Similarly, hardware construction languages or representations such as Chisel [313] and  $\mu$ IR [314] enable efficiently expressing microarchitectural features through high-level primitives. Such design and simulation infrastructures are key for the community since they can be a good learning resource for the training of the new professionals and serve as a kick-starter baseline for developing new design features.

However, most of the current frameworks support design explorations and executions of dense tensors of fixed bit-widths.

With the increasing use of the compact ML models (sparse and irregular tensor computations) for efficient inference/learning, corresponding support for accelerator design and simulation needs to be developed. In particular, the accelerator design and simulation frameworks can provide some pre-built modules for encoding/decoding sparse data (e.g., with common formats like RLC or bitmap), dynamic load balancing, function units for bit-adaptive computing, etc. Although these modules may be designed in a specific language format or can be interfaced with other accelerator modules in a restricted manner, they can serve as reusable logic that can be leveraged and plugged-in with other existing modules that compute dense tensors. Following the current trend, new ML models can be anticipated to be compact with sparse and irregular-shaped tensors. Hence, such design support can expedite development of new design features for emerging ML models and applications, as well as quick prototyping and emulation of the explored designs.

## C. Accelerating Training of ML Models

While there have been significant advances in performing inference on hardware accelerators, training the models on hardware accelerators has relatively received little attention so far. It is true that the training is currently done in highperformance computing environments that consist of commodity CPU and GPU platforms and more recently, FPGA and TPU accelerators. However, just like inference, hardware accelerators can offer significant benefits to the model training in both edge and datacenter-scale computing environments, and it can notably improve the performance and energy efficiency, respectively. In particular, hardware accelerators are promising for enabling the online learning on edge devices through compact models. While developing the system stack for training of the models on hardware accelerators demands significant development and research efforts, development of the community-driven infrastructures can boost the pace.

Recently, accelerator designs including ScaleDeep [294], HyPar [235], TNPU [320], and [35] have been proposed for efficient training of the models. However, they either do not exploit the sparsity, or may not be efficiently utilized for irregular-shaped tensors, or lack support for varying precision requirements for weight and activation tensors, gradients, and weight updates. Therefore, it presents further opportunities for performance-gain and energy-efficiency to the designers of accelerator systems. Additionally, designers can leverage cross-layer optimizations (e.g., by reusing the data of gradients during back-propagation) and support mixed-precision of tensors during the training of compact models.

# D. Applying Sparsity Techniques to Other Domains

In this work, we considered a wide variety of techniques that leverage sparsity in the domain of machine learning, which represents an enormous research effort. Many other domains face similar challenges in exploiting sparsity, and accelerators have been proposed for some of the more processing-intensive domains; this includes graph processing [321], [322], database operations [323], genomics [324], [325], and compression [326]. In some cases, the computation primitives even extend

across domains. For example, the problem of finding intersecting non-zeros is analogous to joins in a database context [232]. Applying the lessons learned from extensive research on sparsity in an ML context can likely speed innovation in a broader context.

#### XVII. RELATED WORK

## Deep learning models and domain-specific applications:

Pouyanfar et al. [65] described different deep learning models along with their classification and discussed different frameworks for processing DNNs with various datasets. Alom et al. [316] provided a historical perspective on evaluations of CNNs over the past decade. Gu et al. [327] discussed recent advances in CNNs and described their applications in computer vision and language processing. Recent surveys have discussed applications of ML and DNNs in image classification [328], mobile multimedia [329], medical image analysis [330], biomedical applications [331], wireless and networking [332], [333], and embedded systems [28]. Elsken et al. [334] surveyed recent techniques for neural architecture search.

Compact models: Reed [335] surveyed different pruning algorithms. Cheng et al. [336] surveyed model compression techniques including parameter pruning and low-rank factorization. Wang et al. [315] surveyed compression techniques including pruning, precision lowering, weight sharing, low-rank factorization, and knowledge distillation. Rezk et al. [28] described different RNNs and algorithmic optimizations including quantization and pruning. Deng et al. [47] described techniques to obtain compact models including model sparsification, data quantization, tensor decomposition, and joint-way compression. Table XIV summarizes related surveys about ML models, hardware accelerators, and software optimizations.

Hardware accelerators for dense tensor computations: Liu et al. [337] surveyed coarse-grained reconfigurable architectures for general-purpose computing applications. Shawahna et al. [338] surveyed FPGA accelerators for processing dense tensor computations of deep learning applications. Venieris et al. [318] discussed different CNN-to-FPGA toolchains and described their hardware architectures, design space exploration techniques, and support for different precisions of tensors. They also compared execution metrics of the designs obtained with various toolchains and that with the previously proposed FPGA accelerators for CNNs. Sze et al. [64] presented a survey about efficiently executing DNNs on hardware accelerators. It described different DNNs, different compression techniques for compact models, and optimized dataflows for hardware accelerators based on spatial architectures. Reuther et al. [339] benchmarked executions of different ML accelerators. Li et al. [340] discussed different ML frameworks and compilers for deep learning models.

Hardware accelerators for compact ML models: Mittal [317] surveyed execution of compact models, including BNNs and models compressed with structured pruning, on FPGAs. It also discussed optimizations for dense tensor computations such as processing convolutions with Winograd algorithm and executing models on multiple FPGA accelerators. Deng et al. [47] surveyed hardware accelerators that support bit-adaptive

TABLE XIV
RELATED LITERATURE FOR HARDWARE, SOFTWARE, AND ALGORITHM DESIGNS TO ACCELERATE COMPACT MODELS CONSISTING OF SPARSE,
IRREGULAR-SHAPED, AND QUANTIZED TENSORS.

|                         |                                  |                                     | This work | [64]         | [47]     | [315]    | [65]         | [316]    | [317]    | [318]    | [319]         |
|-------------------------|----------------------------------|-------------------------------------|-----------|--------------|----------|----------|--------------|----------|----------|----------|---------------|
|                         | ML / DL Models                   |                                     |           | <b>√</b>     | ✓        |          | <b>√</b>     | <b>√</b> |          |          |               |
|                         | ML Frameworks                    |                                     |           | <b>√</b>     |          |          | <b>√</b>     | <b>√</b> |          |          |               |
| Algorithm               | Domain Specific Applications     |                                     |           |              |          |          | $\checkmark$ | <b>√</b> |          |          |               |
| Design for              |                                  | Datasets                            |           | <b>✓</b>     |          |          | <b>√</b>     | <b>√</b> |          |          |               |
| Application Domain      | Data Pruning                     |                                     |           |              | <b>√</b> | <b>√</b> |              |          |          |          | <b>√</b>      |
|                         | Tensor Quantization              |                                     |           | <b>✓</b>     | <b>√</b> | <b>√</b> |              |          |          |          | <b>√</b>      |
|                         | Tensor Size Reduction            |                                     |           |              | <b>√</b> | <b>√</b> |              |          |          |          |               |
|                         | Neural Architecture Search       |                                     |           |              | <b>√</b> |          |              |          |          |          |               |
| Accelerator<br>Hardware | Spatial Architectures            |                                     | <b>√</b>  | <b>√</b>     |          |          |              |          |          |          |               |
|                         | Sparse<br>Tensor<br>Computations | Enodings and Data Extraction        | ✓         |              | ✓        |          |              |          |          |          |               |
|                         |                                  | Sparsity-aware Dataflows            | <b>√</b>  |              |          |          |              |          |          |          |               |
|                         |                                  | Memory Management and Communication | ✓         |              |          |          |              |          |          |          |               |
|                         |                                  | Load Balancing                      | <b>√</b>  |              |          |          |              |          |          |          |               |
|                         | Quantized                        | Bit-adaptive Computing              | <b>√</b>  |              | <b>√</b> | <b>√</b> |              |          | <b>√</b> | <b>√</b> | $\overline{}$ |
|                         | Tensors                          | Leverage Value-similarity           | <b>√</b>  | <b>√</b>     |          |          |              |          |          |          | <b>√</b>      |
|                         | Irregular-shaped Tensors         |                                     | <b>√</b>  |              |          |          |              |          |          | <b>√</b> |               |
|                         | Benchmarking                     |                                     |           |              |          |          |              |          |          | <b>√</b> |               |
| Software                | Mapping Optimizations            |                                     | <b>√</b>  | $\checkmark$ |          |          |              |          | <b>√</b> |          |               |
| Stack                   | Compiler Support                 |                                     | <b>√</b>  |              |          |          |              |          |          |          |               |
| Co-design               | Hardware/Software/Algorithm      |                                     | <b>√</b>  | ✓            | ✓        |          |              |          |          |          |               |

computing to process tensors of different precision and the designs of data extraction modules of the accelerators to leverage sparsity of input, weight, or output tensors. Du et al. [319] recently proposed MinMaxNN system for dynamically switching NN models. They surveyed techniques for designing self-aware NN systems (which can continuously sense information from the environment and dynamically react) including leveraging sparsity and tensor quantization. Wang et al. [315] surveyed hardware implementations for efficiently processing tensors of lower precisions that are obtained with binary, ternary, and logarithmic quantizations. Ignatov et al. [341] benchmarked executions of quantized deep learning models on mobile AI accelerators.

In contrast to the above relevant surveys, this work highlights sources of sparsity and size reduction of tensors in ML models and challenges in efficiently executing such tensors on hardware accelerators. Then, it surveys and discusses the hardware and software support required for sparse and irregular tensor computations on hardware accelerators including encodings and extraction of sparse data, sparsity-aware dataflows, memory management and on-chip communication of sparse tensors while leveraging data reuse, load balancing of computations, and compiler support. The survey also discusses techniques for handling computations of tensors with mixed-precision and value sharing.

# XVIII. SUMMARY

For efficient and hardware-friendly processing, compact deep learning models have been designed. They consume less storage/computation/energy and consist of tensors with considerable sparsity, irregular shapes, and low precision. While these compact models can be efficiently accelerated on hardware accelerators, it requires special hardware and software support. We have highlighted challenges in efficiently accelerating sparse and irregular tensor computations. Leveraging sparsity requires significant redesign as hardware accelerators store and compute with zero values while processing in a conventional manner. Moreover, the many sources of tensor sparsity and their sparsity levels lead to unique challenges and solutions in hardware/software/algorithm co-designs.

In this article, we have discussed how exploiting sparsity effectively depends on tailoring the data encoding and extraction, dataflow, memory bank structure, interconnect design, and write-back mechanisms. We have provided an overview of the accelerator system for processing sparse and irregular tensor computations. Then, we have categorized different techniques that leveraged sparsity of weight and/or activation tensors during learning or inference of ML models for storage/performance/energy-efficiency. We have analyzed how different sparsity-encodings impact the storage benefits for processing tensors of common DNNs with varying sparsitylevels. We have also introduced taxonomy to categorize data extraction techniques based on the functionality of the PE architecture. We have also explored data reuse opportunities (effectual MACs per data element) for processing common DNNs and described how sparsity lowers the data reuse due to less effectual computations. Shared scratchpad memory allows to temporally reuse sparse data. We have discussed techniques for the memory bank management to support unstructured accesses for sparse computations. Different accelerator designs have used various interconnects, which vary in terms of the bandwidth requirement and exploiting spatial data reuse. Therefore, configurable interconnects are required that can support different DNNs of varying sparsity and can be easily configured for a mix of unicast/multicast/broadcast communication patterns. Further, mechanisms to accumulate partial outputs from sparse tensor computations can be temporal, spatial, or spatiotemporal. Such mechanisms vary in terms of higher energy consumption vs. reduced flexibility due to unstructured accumulations. Processing elements of the accelerators can perform scalar, SIMD, or vector operation. Systematically processing tensor computations on architectural resources require sparsity-aware dataflows that can achieve higher hardware efficiency while adapting to sparsity-levels and irregular shapes of tensors. Sparse computations can be highly imbalanced; depending on synchronized executions on PEs, the gap in the computation work allocated to the leading and straggling PEs can result in a considerable acceleration loss. We have surveyed different techniques that balances the work through hardware modules or software-directed regularization. The survey has also described mechanisms for asynchronous or simultaneous write-backs from PEs and onthe-fly sparsity-aware encodings. Compilation for the accelerators requires the ability to efficiently express sparsity in intermediate representations, flexibly apply different compiler optimizations, and emit efficient accelerator-specific code. The survey has discussed different techniques that enable such support.

While required precisions of tensors for high inference accuracy can vary among DNNs and their layers, only some hardware accelerators show such support. In particular, PE designs with bit-decomposable function units or bit-serial computing can facilitate their acceleration needs. Therefore, hardware/algorithm co-designs are required that can provide hardware-friendly compact models, without much accuracy loss. Further, accelerators can systematically leverage the value-similarity of tensors (e.g., spatial and temporal similarity of activations during the processing of consecutive video frames). While accelerators have leveraged similarity of inputs, weights, or computation reuse through memoizing outputs, further opportunities in terms of the joint exploration of value-similarity with precision-lowering and sparsity can be leveraged. We have highlighted further opportunities for different hardware or software aspects of accelerator designs. We have also discussed future directions in terms of hardware/software/algorithm co-designs and system stack development.

In conclusion, while different accelerator systems and compression algorithms have been proposed for efficiently processing compact ML models, it remains an active research frontier. In particular, hardware/software/algorithm co-designs and joint explorations of tensor sparsity, precision, size (rank and dimension length), and value-sharing will likely provide further opportunities for innovations across the system stack. With a boost in energy-efficient accelerations of the learning and inference at the cloud and edge, these accelerator/model co-designs can be anticipated to further improve the intelligence of various systems or applications.

# APPENDIX HARDWARE ACCELERATORS CAN BETTER EXPLOIT SPARSE AND IRREGULAR TENSOR COMPUTATIONS

Exploiting acceleration opportunities for sparse and irregular tensor computations is relatively hard for execution on CPUs and GPUs [57], [62], [144]. In fact, the performance of ML models can even degrade, as compared to the execution of dense data. For example, for executing AlexNet layers (non-structurally sparsified by L1-norm), Wen et al. [44] analyzed speedup on different GPU platforms. They executed dense (unpruned) tensors with cuBLAS and processed the sparse matrices (of sparsified layers) in compressed sparse row format with cuSPARSE. Their experiments showed that achieved speedups were either limited (less than  $1.4\times$ ) or even slowdowns for very high sparsity. This is because unstructured sparsity yields poor data locality for scattered NZ values. Moreover, it is challenging to skip ineffectual computations associated with zeros in tensors and to equally distribute work among multiple threads or compute units of processor cores. Zhang et al. [62] analyzed the performance benefits of executing the sparse models (LeNet [1], AlexNet [2], and VGG-16 [91]) on CPU and GPU platforms. They executed dense (unpruned) models on CPU and GPU platforms with Caffe [258] and sparse (pruned) models with sparse BLAS and cuS-PARSE. They reported that even with the cuSPARSE library, GPU processed a sparse AlexNet (6.99 million weights) 1.78× faster than unpruned AlexNet (59.48 million weights). The geomean speedup for these three models, when compared with the average sparsity of 90.94%, was only 23.34% for GPU and about 110% more execution time on CPU. Furthermore, they analyzed the impact of varying sparsity levels on execution time, for executing CONV and FC layers on CPU, GPU, and Cambricon-X [62] platforms. Their sensitivity analysis showed that for low sparsity (less than 30% zeros), both the CPU and GPU platforms did not benefit from sparse models due to non-trivial costs of sparse data processing. For example, for executing FC layer with 15% sparsity on the CPU and GPU, they reported slowdowns by a factor of  $6.67 \times$  and 1.45×, respectively, as compared to the execution time of the dense (unpruned) layer. Execution of FC layer is usually memory-bounded and therefore, for both CPU and GPU, the performance improved more drastically with increase in sparsity than that for the convolution layer. In their evaluations, while CPU and GPU showed marginal acceleration benefits only at moderate or high sparsity, Cambricon-X achieved performance gains for the tensor sparsity of 5% or higher, due to its design tailored for sparse tensor computations. For higher sparsity levels of tensors (e.g., 99%), Cambricon-X accelerator achieved very high speedups (e.g., 15.5× for convolution and  $48.5 \times$  for FC layer), as compared to executing dense tensors. Equipped with special support, hardware accelerators can effectively capitalize acceleration gains by efficiently processing sparse and irregular tensor computations.

## REFERENCES

- [1] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner *et al.*, "Gradient-based learning applied to document recognition," *Proceedings of the IEEE*, vol. 86, no. 11, pp. 2278–2324, 1998.
- [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in *Advances in neural* information processing systems, 2012, pp. 1097–1105.
- [3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," in *Proceedings of the IEEE conference on computer* vision and pattern recognition, 2015, pp. 1–9.

- [4] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in *Proceedings of the IEEE conference on computer vision* and pattern recognition, 2016, pp. 770–778.
- [5] O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation," in *International Confer*ence on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
- [6] M. Tan and Q. Le, "Efficientnet: Rethinking model scaling for convolutional neural networks," in *International Conference on Machine Learning*, 2019, pp. 6105–6114.
- [7] S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," in *Advances in neural information processing systems*, 2015, pp. 91–99.
- [8] J. Redmon and A. Farhadi, "Yolov3: An incremental improvement," arXiv preprint arXiv:1804.02767, 2018.
- [9] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, "Ssd: Single shot multibox detector," in *European conference on computer vision*. Springer, 2016, pp. 21–37.
- [10] F. Schroff, D. Kalenichenko, and J. Philbin, "Facenet: A unified embedding for face recognition and clustering," in *Proceedings of the IEEE conference on computer vision and pattern recognition*, 2015, pp. 815–823.
- [11] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, "Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs," *IEEE transactions on* pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
- [12] V. Badrinarayanan, A. Kendall, and R. Cipolla, "Segnet: A deep convolutional encoder-decoder architecture for image segmentation," *IEEE transactions on pattern analysis and machine intelligence*, vol. 39, no. 12, pp. 2481–2495, 2017.
- [13] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, "Encoder-decoder with atrous separable convolution for semantic image segmentation," in *Proceedings of the European conference on computer vision (ECCV)*, 2018, pp. 801–818.
- [14] S. Hochreiter and J. Schmidhuber, "Long short-term memory," *Neural computation*, vol. 9, no. 8, pp. 1735–1780, 1997.
- [15] D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473, 2014.
- [16] I. Sutskever, O. Vinyals, and Q. Le, "Sequence to sequence learning with neural networks," Advances in NIPS, 2014.
- [17] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, "Learning phrase representations using rnn encoder-decoder for statistical machine translation," in *Proceedings* of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724–1734.
- [18] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., "Deep speech: Scaling up end-to-end speech recognition," arXiv preprint arXiv:1412.5567, 2014.
- [19] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., "Google's neural machine translation system: Bridging the gap between human and machine translation," arXiv preprint arXiv:1609.08144, 2016.
- [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention is all you need," in Advances in neural information processing systems, 2017, pp. 5998–6008
- [21] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171– 4186.
- [22] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3d convolutional networks," in *Proceedings of the IEEE international conference on computer vision*, 2015, pp. 4489–4497.
- [23] H. Lee, C. Ekanadham, and A. Y. Ng, "Sparse deep belief net model for visual area v2," in *Advances in neural information processing systems*, 2008, pp. 873–880.
- [24] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial nets," in Advances in neural information processing systems, 2014, pp. 2672– 2680.

- [25] J. Park, M. Naumov, P. Basu, S. Deng, A. Kalaiah, D. Khudia, J. Law, P. Malani, A. Malevich, S. Nadathur *et al.*, "Deep learning inference in facebook data centers: Characterization, performance optimizations and hardware implications," *arXiv preprint arXiv:1811.09886*, 2018.
- [26] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, "Neural collaborative filtering," in *Proceedings of the 26th international con*ference on world wide web, 2017, pp. 173–182.
- [27] T. Kurth, S. Treichler, J. Romero, M. Mudigonda, N. Luehr, E. Phillips, A. Mahesh, M. Matheson, J. Deslippe, M. Fatica et al., "Exascale deep learning for climate analytics," in SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2018, pp. 649–660.
- [28] N. M. Rezk, M. Purnaprajna, T. Nordström, and Z. Ul-Abdin, "Recurrent neural networks: an embedded computing perspective," *IEEE Access*, vol. 8, pp. 57967–57996, 2020.
- [29] H. Li, K. Ota, and M. Dong, "Learning iot in edge: Deep learning for the internet of things with edge computing," *IEEE network*, vol. 32, no. 1, pp. 96–101, 2018.
- [30] J. Dean, "Machine learning for systems and systems for machine learning," in *Presentation at 2017 Conference on Neural Information Processing Systems*, 2017.
- [31] A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar, M. Norouzi, S. Bengio, and J. Dean, "Device placement optimization with reinforcement learning," in *Proceedings of the 34th International Conference on Machine Learning-Volume 70*. JMLR. org, 2017, pp. 2430–2439.
- [32] J. Dean, "1.1 the deep learning revolution and its implications for computer architecture and chip design," in 2020 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 2020, pp. 8–14.
- [33] K. Olukotun, "Designing computer systems for software 2.0," in Presentation at 2018 Conference on Neural Information Processing Systems, 2018.
- [34] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks," *IEEE Journal of Solid-State Circuits*, vol. 52, no. 1, pp. 127–138, 2016.
- [35] B. Fleischer, S. Shukla, M. Ziegler, J. Silberman, J. Oh, V. Srinivasan, J. Choi, S. Mueller, A. Agrawal, T. Babinsky et al., "A scalable multi-teraops deep learning processor core for ai trainina and inference," in 2018 IEEE Symposium on VLSI Circuits. IEEE, 2018, pp. 35–36.
- [36] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al., "In-datacenter performance analysis of a tensor processing unit," in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2017, pp. 1–12.
- [37] H. Rong, "Programmatic control of a compiler for generating highperformance spatial hardware," arXiv preprint arXiv:1711.07606, 2017.
- [38] X. Yang, M. Gao, J. Pu, A. Nayak, Q. Liu, S. E. Bell, J. O. Setter, K. Cao, H. Ha, C. Kozyrakis et al., "Dnn dataflow choice is overrated," arXiv preprint arXiv:1809.04070, 2018.
- [39] D. Amodei, D. Hernandez, G. Sastry, J. Clark, G. Brock-man, and I. Sutskever, "Ai and compute," https://openai.com/blog/ai-and-compute/, May 2018.
- [40] S. Han, J. Pool, J. Tran, and W. Dally, "Learning both weights and connections for efficient neural network," in *Advances in neural* information processing systems, 2015, pp. 1135–1143.
- [41] Y. LeCun, J. S. Denker, and S. A. Solla, "Optimal brain damage," in Advances in neural information processing systems, 1990, pp. 598–605.
- [42] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: a simple way to prevent neural networks from overfitting," *The journal of machine learning research*, vol. 15, no. 1, pp. 1929–1958, 2014.
- [43] S. Han, H. Mao, and W. J. Dally, "Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding," arXiv preprint arXiv:1510.00149, 2015.
- [44] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, "Learning structured sparsity in deep neural networks," in *Advances in neural information* processing systems, 2016, pp. 2074–2082.
- [45] A. K. Mishra, E. Nurvitadhi, G. Venkatesh, J. Pearce, and D. Marr, "Fine-grained accelerators for sparse machine learning workloads," in 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2017, pp. 635–640.
- [46] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, "Amc: Automl for model compression and acceleration on mobile devices," in *Proceed*ings of the European Conference on Computer Vision (ECCV), 2018, pp. 784–800.

- [47] B. L. Deng, G. Li, S. Han, L. Shi, and Y. Xie, "Model compression and hardware acceleration for neural networks: A comprehensive survey," *Proceedings of the IEEE*, vol. 108, no. 4, pp. 485–532, 2020.
- [48] R. Krishnamoorthi, "Quantizing deep convolutional networks for efficient inference: A whitepaper," arXiv preprint arXiv:1806.08342, 2018.
- [49] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, "Deep learning with limited numerical precision," in *International Conference* on *Machine Learning*, 2015, pp. 1737–1746.
- [50] D. Lin, S. Talathi, and S. Annapureddy, "Fixed point quantization of deep convolutional networks," in *International Conference on Machine Learning*, 2016, pp. 2849–2858.
- [51] F. Chollet, "Xception: Deep learning with depthwise separable convolutions," in *Proceedings of the IEEE conference on computer vision and pattern recognition*, 2017, pp. 1251–1258.
- [52] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, "Mobilenets: Efficient convolutional neural networks for mobile vision applications," arXiv preprint arXiv:1704.04861, 2017.
- [53] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, "Squeezenet: Alexnet-level accuracy with 50x fewer parameters and; 0.5 mb model size," arXiv preprint arXiv:1602.07360, 2016
- [54] A. Cichocki, D. Mandic, L. De Lathauwer, G. Zhou, Q. Zhao, C. Caiafa, and H. A. Phan, "Tensor decompositions for signal processing applications: From two-way to multiway component analysis," *IEEE signal processing magazine*, vol. 32, no. 2, pp. 145–163, 2015.
- [55] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. De Freitas, "Predicting parameters in deep learning," in *Advances in neural information processing systems*, 2013, pp. 2148–2156.
- [56] L. Moroney. (2018) Introducing ragged tensors. [Online]. Available: https://blog.tensorflow.org/2018/12/introducing-ragged-tensors.html
- [57] X. Zhou, Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li, T. Chen, and Y. Chen, "Cambricon-s: Addressing irregularity in sparse neural networks through a cooperative software/hardware approach," in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2018, pp. 15–28.
- [58] A. Ren, T. Zhang, S. Ye, J. Li, W. Xu, X. Qian, X. Lin, and Y. Wang, "Admm-nn: An algorithm-hardware co-design framework of dnns using alternating direction methods of multipliers," in *Proceedings of the* Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019, pp. 925–938.
- [59] S. Narang, E. Elsen, G. Diamos, and S. Sengupta, "Exploring sparsity in recurrent neural networks," arXiv preprint arXiv:1704.05119, 2017.
- [60] T.-J. Yang, Y.-H. Chen, and V. Sze, "Designing energy-efficient convolutional neural networks using energy-aware pruning," in *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, 2017, pp. 5687–5695.
- [61] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, "Eie: efficient inference engine on compressed deep neural network," in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 2016, pp. 243–254.
- [62] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, "Cambricon-x: An accelerator for sparse neural networks," in The 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 2016, p. 20.
- [63] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, "Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices," *IEEE Journal on Emerging and Selected Topics in Circuits and Systems*, vol. 9, no. 2, pp. 292–308, 2019.
- [64] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, "Efficient processing of deep neural networks: A tutorial and survey," *Proceedings of the IEEE*, vol. 105, no. 12, pp. 2295–2329, 2017.
- [65] S. Pouyanfar, S. Sadiq, Y. Yan, H. Tian, Y. Tao, M. P. Reyes, M.-L. Shyu, S.-C. Chen, and S. Iyengar, "A survey on deep learning: Algorithms, techniques, and applications," ACM Computing Surveys (CSUR), vol. 51, no. 5, pp. 1–36, 2018.
- [66] M. Z. Alom, T. M. Taha, C. Yakopcic, S. Westberg, P. Sidike, M. S. Nasrin, M. Hasan, B. C. Van Essen, A. A. Awwal, and V. K. Asari, "A state-of-the-art survey on deep learning theory and architectures," *Electronics*, vol. 8, no. 3, p. 292, 2019.
- [67] I. Goodfellow, Y. Bengio, and A. Courville, *Deep learning*. MIT press, 2016.
- [68] W. L. Hamilton, R. Ying, and J. Leskovec, "Representation learning on graphs: Methods and applications," arXiv preprint arXiv:1709.05584, 2017.

- [69] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun, "Graph neural networks: A review of methods and applications," arXiv preprint arXiv:1812.08434, 2018.
- [70] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., "Pytorch: An imperative style, high-performance deep learning library," in Advances in Neural Information Processing Systems, 2019, pp. 8024–8035.
- [71] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., "Tensorflow: A system for large-scale machine learning," in 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), 2016, pp. 265–283.
- [72] E. Wang, Q. Zhang, B. Shen, G. Zhang, X. Lu, Q. Wu, and Y. Wang, "Intel math kernel library," in *High-Performance Computing on the Intel*(R) Xeon Phi<sup>TM</sup>. Springer, 2014, pp. 167–188.
- [73] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze et al., "{TVM}: An automated end-to-end optimizing compiler for deep learning," in 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), 2018, pp. 578–594.
- [74] S. Dave, Y. Kim, S. Avancha, K. Lee, and A. Shrivastava, "Dmazerunner: Executing perfectly nested loops on dataflow accelerators," ACM Transactions on Embedded Computing Systems (TECS), vol. 18, no. 5s, pp. 1–27, 2019.
- [75] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun et al., "Dadiannao: A machine-learning supercomputer," in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2014, pp. 609–622.
- [76] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, "Shidiannao: Shifting vision processing closer to the sensor," in ACM SIGARCH Computer Architecture News, vol. 43, no. 3. ACM, 2015, pp. 92–104.
- [77] K. Khan, "Xilinx dnn processor (xdnn), accelerating ai in datacenters," https://www.xilinx.com/publications/events/developer-forum/ 2018-frankfurt/accelerating-ai-in-datacenters-xilinx-ml-suite.pdf, 2018
- [78] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, L. Adams, M. Ghandi et al., "A configurable cloud-scale dnn processor for real-time ai," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018, pp. 1–14.
- [79] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song et al., "Going deeper with embedded fpga platform for convolutional neural network," in *Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*. ACM, 2016, pp. 26–35.
- [80] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, "Optimizing fpga-based accelerator design for deep convolutional neural networks," in *Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*. ACM, 2015, pp. 161–170.
- [81] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s. Seo, and Y. Cao, "Throughput-optimized opencl-based fpga accelerator for large-scale convolutional neural networks," in *Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*. ACM, 2016, pp. 16–25.
- [82] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, "Optimizing loop operation and dataflow in fpga acceleration of deep convolutional neural networks," in *Proceedings of the 2017 ACM/SIGDA International* Symposium on Field-Programmable Gate Arrays. ACM, 2017, pp. 45–54.
- [83] Y.-H. Chen, J. Emer, and V. Sze, "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks," in ACM SIGARCH Computer Architecture News, vol. 44, no. 3. IEEE Press, 2016, pp. 367–379.
- [84] J. Cong and J. Wang, "Polysa: polyhedral-based systolic array auto-compilation," in 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 2018, pp. 1–8.
- [85] K. E. Fleming, K. D. Glossop, and S. C. Steely, "Apparatus, methods, and systems with a configurable spatial accelerator," Oct. 15 2019, uS Patent 10,445,250.
- [86] D. Shin, J. Lee, J. Lee, and H.-J. Yoo, "14.2 dnpu: An 8.1 tops/w reconfigurable cnn-rnn processor for general-purpose deep neural networks," in 2017 IEEE International Solid-State Circuits Conference (ISSCC). IEEE, 2017, pp. 240–241.
- [87] A. Parashar, P. Raina, Y. S. Shao, Y.-H. Chen, V. A. Ying, A. Mukkara, R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer, "Timeloop: A systematic approach to dnn accelerator evaluation," in 2019 IEEE

- International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2019, pp. 304–315.
- [88] S. Dave, A. Shrivastava, Y. Kim, S. Avancha, and K. Lee, "dmazerunner: Optimizing convolutions on dataflow accelerators," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 1544–1548.
- [89] H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, and T. Krishna, "Understanding reuse, performance, and hardware cost of dnn dataflow: A data-centric approach," in *Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture*, 2019, pp. 754–768.
- [90] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
- [91] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
- [92] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely connected convolutional networks," in *Proceedings of the IEEE conference on computer vision and pattern recognition*, 2017, pp. 4700–4708.
- [93] P. M. Gysel, "Ristretto: Hardware-oriented approximation of convolutional neural networks," Ph.D. dissertation, University of California, Davis 2016
- [94] S. Cao, L. Ma, W. Xiao, C. Zhang, Y. Liu, L. Zhang, L. Nie, and Z. Yang, "Seernet: Predicting convolutional neural network featuremap sparsity through low-bit quantization," in *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, 2019, pp. 11216–11225.
- [95] J. Li, S. Jiang, S. Gong, J. Wu, J. Yan, G. Yan, and X. Li, "Squeezeflow: A sparse cnn accelerator exploiting concise convolution rules," *IEEE Transactions on Computers*, vol. 68, no. 11, pp. 1663–1677, 2019.
- [96] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, "Scnn: An accelerator for compressed-sparse convolutional neural networks," in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2017, pp. 27–40.
- [97] J. Lee, J. Lee, D. Han, J. Lee, G. Park, and H.-J. Yoo, "7.7 Inpu: A 25.3 tflops/w sparse deep-neural-network learning processor with fine-grained mixed precision of fp8-fp16," in 2019 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 2019, pp. 142–144.
- [98] B. Akin, Z. A. Chishti, and A. R. Alameldeen, "Zcomp: Reducing dnn cross-layer memory footprint using vector extensions," in *Proceedings* of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 126–138.
- [99] B. Hassibi and D. G. Stork, "Second order derivatives for network pruning: Optimal brain surgeon," in Advances in neural information processing systems, 1993, pp. 164–171.
- [100] J. A. Hertz, Introduction to the theory of neural computation. CRC Press, 2018.
- [101] H.-J. Kang, "Accelerator-aware pruning for convolutional neural networks," *IEEE Transactions on Circuits and Systems for Video Technol*ogy, 2019.
- [102] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang et al., "Ese: Efficient speech recognition engine with sparse lstm on fpga," in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017, pp. 75–84.
- [103] M. Zhu and S. Gupta, "To prune, or not to prune: exploring the efficacy of pruning for model compression," arXiv preprint arXiv:1710.01878, 2017.
- [104] D. T. Vooturi, D. Mudigree, and S. Avancha, "Hierarchical block sparse neural networks," arXiv preprint arXiv:1808.03420, 2018.
- [105] G. Srivastava, D. Kadetotad, S. Yin, V. Berisha, C. Chakrabarti, and J.-s. Seo, "Joint optimization of quantization and structured sparsity for compressed deep neural networks," in *ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing* (*ICASSP*). IEEE, 2019, pp. 1393–1397.
- [106] J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, and S. Mahlke, "Scalpel: Customizing dnn pruning to the underlying hardware parallelism," ACM SIGARCH Computer Architecture News, vol. 45, no. 2, pp. 548–560, 2017.
- [107] B. Asgari, R. Hadidi, H. Kim, and S. Yalamanchili, "Eridanus: Efficiently running inference of dnns using systolic arrays," *IEEE Micro*, vol. 39, no. 5, pp. 46–54, 2019.
- [108] H. Kung, B. McDanel, and S. Q. Zhang, "Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization," in *Proceedings of the Twenty*-

- Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2019, pp. 821–834.
- [109] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, "Cnvlutin: Ineffectual-neuron-free deep neural network computing," ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 1–13, 2016.
- [110] D. Kim, J. Ahn, and S. Yoo, "Zena: Zero-aware neural network accelerator," *IEEE Design & Test*, vol. 35, no. 1, pp. 39–46, 2017.
- [111] Q. Yang, J. Mao, Z. Wang, and H. Li, "Dasnet: Dynamic activation sparsity for neural network efficiency improvement," arXiv preprint arXiv:1909.06964, 2019.
- [112] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernández-Lobato, G.-Y. Wei, and D. Brooks, "Minerva: Enabling low-power, highly-accurate deep neural network accelerators," in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 2016, pp. 267–278.
- [113] G. Georgiadis, "Accelerating convolutional neural networks via activation map compression," in *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, 2019, pp. 7085–7095.
- [114] S. Shi and X. Chu, "Speeding up convolutional neural networks by exploiting the sparsity of rectifier units," arXiv preprint arXiv:1704.07724, 2017.
- [115] U. Gupta, B. Reagen, L. Pentecost, M. Donato, T. Tambe, A. M. Rush, G.-Y. Wei, and D. Brooks, "Masr: A modular accelerator for sparse rnns," in 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 2019, pp. 1–14.
- [116] B. Wu, A. Wan, X. Yue, P. Jin, S. Zhao, N. Golmant, A. Gholaminejad, J. Gonzalez, and K. Keutzer, "Shift: A zero flop, zero parameter alternative to spatial convolutions," in *Proceedings of the IEEE Conference* on Computer Vision and Pattern Recognition, 2018, pp. 9127–9135.
- [117] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, "Rethinking atrous convolution for semantic image segmentation," arXiv preprint arXiv:1706.05587, 2017.
- [118] A. Radford, L. Metz, and S. Chintala, "Unsupervised representation learning with deep convolutional generative adversarial networks," arXiv preprint arXiv:1511.06434, 2015.
- [119] A. Yazdanbakhsh, K. Samadi, N. S. Kim, and H. Esmaeilzadeh, "Ganax: A unified mimd-simd acceleration for generative adversarial networks," in *Proceedings of the 45th Annual International Symposium* on Computer Architecture. IEEE Press, 2018, pp. 650–661.
- [120] X. F. Xiao Dong, Lei Liu, "Acorns: A framework for accelerating deep neural networks with input sparsity," in *Proceedings of the 2019 International Conference on Parallel Architecture and Compilation* (*PACT*), ser. PACT '19. Seattle, WA, USA: IEEE Computer Society, 2019.
- [121] M. Ren, A. Pokrovsky, B. Yang, and R. Urtasun, "Sbnet: Sparse blocks network for fast inference," in *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, 2018, pp. 8711–8720.
- [122] M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner, "Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks," in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 1355–1361.
- [123] M. Yan, L. Deng, X. Hu, L. Liang, Y. Feng, X. Ye, Z. Zhang, D. Fan, and Y. Xie, "Hygen: A gen accelerator with hybrid architecture," in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 15–29.
- [124] T. Geng, A. Li, T. Wang, C. Wu, Y. Li, A. Tumeo, and M. Herbordt, "Uwb-gcn: Hardware acceleration of graph-convolutionnetwork through runtime workload rebalancing," arXiv preprint arXiv:1908.10834, 2019.
- [125] M. Horowitz, "1.1 computing's energy problem (and what we can do about it)," in 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). IEEE, 2014, pp. 10–14.
- [126] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, "Aggregated residual transformations for deep neural networks," in *Proceedings of the IEEE* conference on computer vision and pattern recognition, 2017, pp. 1492–1500
- [127] X. Zhang, X. Zhou, M. Lin, and J. Sun, "Shufflenet: An extremely efficient convolutional neural network for mobile devices," in *Proceedings* of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6848–6856.
- [128] M. Lin, Q. Chen, and S. Yan, "Network in network," arXiv preprint arXiv:1312.4400, 2013.
- [129] I. Sobel and G. Feldman, "A 3x3 isotropic gradient operator for image processing," a talk at the Stanford Artificial Project in, pp. 271–272, 1968

- [130] J. Jin, A. Dundar, and E. Culurciello, "Flattened convolutional neural networks for feedforward acceleration," arXiv preprint arXiv:1412.5474, 2014.
- [131] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, "Rethinking the inception architecture for computer vision," in *Proceedings of the IEEE conference on computer vision and pattern recognition*, 2016, pp. 2818–2826.
- [132] J. J. Zhang, P. Raj, S. Zarar, A. Ambardekar, and S. Garg, "Compact: On-chip compression of activations for low power systolic array based cnn acceleration," ACM Transactions on Embedded Computing Systems (TECS), vol. 18, no. 5s, p. 47, 2019.
- [133] K. Hegde, J. Yu, R. Agrawal, M. Yan, M. Pellauer, and C. Fletcher, "Ucnn: Exploiting computational reuse in deep neural networks via weight repetition," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018, pp. 674– 687.
- [134] M. Riera, J.-M. Arnau, and A. González, "Computation reuse in dnns by exploiting input similarity," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018, pp. 57–68.
- [135] M. Imani, M. S. Razlighi, Y. Kim, S. Gupta, F. Koushanfar, and T. Rosing, "Deep learning acceleration with neuron-to-memory transformation," in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 1–14.
- [136] P. Warden, "Why are eight bits enough for deep neural networks?" https://petewarden.com/2015/05/23/ why-are-eight-bits-enough-for-deep-neural-networks/, 2015.
- [137] D. Kalamkar, D. Mudigere, N. Mellempudi, D. Das, K. Banerjee, S. Avancha, D. T. Vooturi, N. Jammalamadaka, J. Huang, H. Yuen et al., "A study of bfloat16 for deep learning training," arXiv preprint arXiv:1905.12322, 2019.
- [138] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, "Xnor-net: Imagenet classification using binary convolutional neural networks," in *European Conference on Computer Vision*. Springer, 2016, pp. 525–542.
- [139] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, "Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients," arXiv preprint arXiv:1606.06160, 2016.
- [140] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, "Quantized neural networks: Training neural networks with low precision weights and activations," *The Journal of Machine Learning Research*, vol. 18, no. 1, pp. 6869–6898, 2017.
- [141] E. H. Lee, D. Miyashita, E. Chai, B. Murmann, and S. S. Wong, "Lognet: Energy-efficient neural networks using logarithmic computation," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 5900–5904.
- [142] X. S. Hu, R. Ernst, P. Eles, G. Heiser, K. Keutzer, D. Kim, and T. Tohdo, "Roundtable: Machine learning for embedded systems: Hype or lasting impact?" *IEEE Design & Test*, vol. 35, no. 6, pp. 86–93, 2018.
- [143] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning," in *ACM Sigplan Notices*, vol. 49, no. 4. ACM, 2014, pp. 269–284.
- [144] E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul, and T. Krishna, "Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training," in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 58–70.
- [145] S. Zheng, Y. Liu, S. Yin, L. Liu, and S. Wei, "An efficient kernel transformation architecture for binary-and ternary-weight neural network inference," in *Proceedings of the 55th Annual Design Automation Conference*. ACM, 2018, p. 137.
- [146] P. Judd, A. Delmas, S. Sharify, and A. Moshovos, "Cnvlutin2: Ineffectual-activation-and-weight-free deep neural network computing," arXiv preprint arXiv:1705.00125, 2017.
- [147] R. Struharik, B. Vukobratović, A. Erdeljan, and D. Rakanović, "Connacompressed cnn hardware accelerator," in 2018 21st Euromicro Conference on Digital System Design (DSD). IEEE, 2018, pp. 365–372.
- [148] J.-F. Zhang, C.-E. Lee, C. Liu, Y. S. Shao, S. W. Keckler, and Z. Zhang, "Snap: A 1.67—21.55 tops/w sparse neural acceleration processor for unstructured sparse deep neural network inference in 16nm cmos," in 2019 Symposium on VLSI Circuits. IEEE, 2019, pp. C306–C307.
- [149] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, "14.5 envision: A 0.26-to-10tops/w subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28nm

- fdsoi," in 2017 IEEE International Solid-State Circuits Conference (ISSCC). IEEE, 2017, pp. 246–247.
- [150] W. Liu, J. Lin, and Z. Wang, "Usca: A unified systolic convolution array architecture for accelerating sparse neural network," in 2019 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2019, pp. 1–5.
- [151] A. Gondimalla, N. Chesnut, M. Thottethodi, and T. Vijaykumar, "Sparten: A sparse tensor accelerator for convolutional neural networks," in *Proceedings of the 52nd Annual IEEE/ACM International* Symposium on Microarchitecture. ACM, 2019, pp. 151–165.
- [152] K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago, A. Jaleel, E. Solomonik, J. Emer, and C. W. Fletcher, "Extensor: An accelerator for sparse tensor algebra," in *Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture*. ACM, 2019, pp. 319–333.
- [153] Z. Yuan, J. Yue, H. Yang, Z. Wang, J. Li, Y. Yang, Q. Guo, X. Li, M.-F. Chang, H. Yang et al., "Sticker: A 0.41-62.1 tops/w 8bit neural network processor with multi-sparsity compatible convolution arrays and online tuning acceleration for fully connected layers," in 2018 IEEE Symposium on VLSI Circuits. IEEE, 2018, pp. 33–34.
- [154] N. Corporation, "Nvidia deep learning accelerator (nvdla)," http://nvdla. org, accessed: 2018-11-05.
- [155] S. Chole, R. Tadishetti, and S. Reddy, "Sparsecore: An accelerator for structurally sparse cnns," in SysML Conference, 2018.
- [156] L. Yavits and R. Ginosar, "Accelerator for sparse machine learning," IEEE Computer Architecture Letters, vol. 17, no. 1, pp. 21–24, 2017.
- [157] A. Aimar, H. Mostafa, E. Calabrese, A. Rios-Navarro, R. Tapiador-Morales, I.-A. Lungu, M. B. Milde, F. Corradi, A. Linares-Barranco, S.-C. Liu et al., "Nullhop: A flexible convolutional neural network accelerator based on sparse representations of feature maps," *IEEE transactions on neural networks and learning systems*, vol. 30, no. 3, pp. 644–656, 2018.
- [158] L. Lu, J. Xie, R. Huang, J. Zhang, W. Lin, and Y. Liang, "An efficient hardware accelerator for sparse convolutional neural networks on fpgas," in 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2019, pp. 17–25.
- [159] A. Page, A. Jafari, C. Shea, and T. Mohsenin, "Sparcnet: A hard-ware accelerator for efficient deployment of sparse convolutional networks," ACM Journal on Emerging Technologies in Computing Systems (JETC), vol. 13, no. 3, pp. 1–32, 2017.
- [160] L. Lu and Y. Liang, "Spwa: an efficient sparse winograd convolutional neural networks accelerator on fpgas," in 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE, 2018, pp. 1–6.
- [161] C. Gao, D. Neil, E. Ceolini, S.-C. Liu, and T. Delbruck, "Deltarnn: A power-efficient recurrent neural network accelerator," in *Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, 2018, pp. 21–30.
- [162] Y. H. Kim, G. J. An, and M. H. Sunwoo, "Casa: A convolution accelerator using skip algorithm for deep neural network," in 2019 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2019, pp. 1–5.
- [163] H. Kung, B. McDanel, and S. Q. Zhang, "Adaptive tiling: Applying fixed-size systolic arrays to sparse convolutional neural networks," in 2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 2018, pp. 1006–1011.
- [164] S. Yin, P. Ouyang, S. Tang, F. Tu, X. Li, S. Zheng, T. Lu, J. Gu, L. Liu, and S. Wei, "A high energy efficient reconfigurable hybrid neural network processor for deep learning applications," *IEEE Journal* of Solid-State Circuits, vol. 53, no. 4, pp. 968–982, 2017.
- [165] C.-E. Lee, Y. S. Shao, J.-F. Zhang, A. Parashar, J. Emer, S. W. Keckler, and Z. Zhang, "Stitch-x: An accelerator architecture for exploiting unstructured sparsity in deep neural networks," in *SysML Conference*, 2018.
- [166] H. Jang, J. Kim, J.-E. Jo, J. Lee, and J. Kim, "Mnnfast: a fast and scalable system architecture for memory-augmented neural networks," in *Proceedings of the 46th International Symposium on Computer Architecture*, 2019, pp. 250–263.
- [167] B. McDanel, S. Q. Zhang, H. Kung, and X. Dong, "Full-stack optimization for accelerating cnns using powers-of-two weights with fpga validation," in *Proceedings of the ACM International Conference on Supercomputing*, 2019, pp. 449–460.
- [168] P. N. Whatmough, S. K. Lee, D. Brooks, and G.-Y. Wei, "Dnn engine: A 28-nm timing-error tolerant sparse deep neural network processor for iot applications," *IEEE Journal of Solid-State Circuits*, vol. 53, no. 9, pp. 2722–2731, 2018.

- [169] B. W. Bader and T. G. Kolda, "Efficient matlab computations with sparse and factored tensors," SIAM Journal on Scientific Computing, vol. 30, no. 1, pp. 205-231, 2008.
- [170] PyTorch, "Sparse tensors in torch," https://pytorch.org/docs/stable/ sparse.html, 2019.
- [171] S. Smith, J. W. Choi, J. Li, R. Vuduc, J. Park, X. Liu, and G. Karypis. (2017) Frostt file format. [Online]. Available: http://frostt.io/tensors/file-formats.html
- [172] N. I. of Standards and Technology, "Matrix market exchange formats." https://math.nist.gov/MatrixMarket/formats.html, 2013.
- S. Chou, F. Kjolstad, and S. Amarasinghe, "Format abstraction for sparse tensor algebra compilers," Proceedings of the ACM on Programming Languages, vol. 2, no. OOPSLA, pp. 1-30, 2018.
- [174] E. L. Hauck, "Data compression using run length encoding and statistical encoding," Dec. 2 1986, uS Patent 4,626,829.
- [175] A. Moffat and J. Zobel, "Parameterised compression for sparse bitmaps," in Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, 1992, pp. 274-285.
- [176] E. Jones, T. Oliphant, and P. Peterson, "Scipy: Open source scientific
- tools for python," 2001.
  [177] F. G. Gustavson, "Some basic techniques for solving sparse systems of linear equations," in Sparse matrices and their applications. Springer, 1972, pp. 41-52.
- [178] I. S. Duff, A. M. Erisman, and J. K. Reid, Direct methods for sparse matrices. Clarendon Press, Oxford, 1986.
- [179] Y. Saad, "Sparskit: A basic tool kit for sparse matrix computations,"
- [180] R. W. Vuduc and J. W. Demmel, Automatic performance tuning of sparse matrix kernels. University of California, Berkeley Berkeley, CA, 2003, vol. 1.
- [181] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. Gomez, S. Gouws, L. Jones, Ł. Kaiser, N. Kalchbrenner, N. Parmar et al., "Tensor2tensor for neural machine translation," in Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Papers), 2018, pp. 193-199.
- [182] A. Karpathy and L. Fei-Fei, "Deep visual-semantic alignments for generating image descriptions," in *Proceedings of the IEEE conference* on computer vision and pattern recognition, 2015, pp. 3128-3137.
- [183] S. Narang and G. Diamos, "Deepbench," 2016.
- [184] A. Buluc and J. R. Gilbert, "On the representation and multiplication of hypersparse matrices," in 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 2008, pp. 1-11.
- E.-J. Im and K. Yelick, "Model-based memory hierarchy optimizations for sparse matrices," in Workshop on Profile and Feedback-Directed Compilation, vol. 139, 1998.
- [186] S. Smith and G. Karypis, "Tensor-matrix products with a compressed sparse tensor," in Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, 2015, pp. 1–7.
- [187] J. Park, S. Li, W. Wen, P. T. P. Tang, H. Li, Y. Chen, and P. Dubey, "Faster cnns with direct sparse convolutions and guided pruning," arXiv preprint arXiv:1608.01409, 2016.
- [188] M. Buckler, P. Bedoukian, S. Jayasuriya, and A. Sampson, "Eva<sup>2</sup>: Exploiting temporal redundancy in live computer vision," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018, pp. 533-546.
- [189] K. Guo, S. Han, S. Yao, Y. Wang, Y. Xie, and H. Yang, "Softwarehardware codesign for efficient neural network acceleration," IEEE Micro, vol. 37, no. 2, pp. 18-25, 2017.
- [190] A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, "Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks," in Proceedings of the twentyfirst annual symposium on Parallelism in algorithms and architectures, 2009, pp. 233-244.
- [191] C.-C. Chang and C.-J. Lin, "Libsvm: A library for support vector machines," ACM transactions on intelligent systems and technology (TIST), vol. 2, no. 3, pp. 1-27, 2011.
- [192] D. R. Kincaid and D. M. Young, "The itpack project: Past, present, and future," in *Elliptic Problem Solvers*. Elsevier, 1984, pp. 53–63.
- [193] J. King, T. Gilray, R. M. Kirby, and M. Might, "Dynamic sparse-matrix allocation on gpus," in International Conference on High Performance Computing. Springer, 2016, pp. 61-80.
- [194] Y. Saad, Iterative methods for sparse linear systems. siam, 2003.
- [195] J. Willcock and A. Lumsdaine, "Accelerating sparse matrix computations via data compression," in Proceedings of the 20th annual international conference on Supercomputing, 2006, pp. 307-316.

- [196] M. Baskaran, B. Meister, N. Vasilache, and R. Lethin, "Efficient and scalable computations with sparse tensors," in 2012 IEEE Conference on High Performance Extreme Computing. IEEE, 2012, pp. 1-6.
- P. A. Tew, "An investigation of sparse tensor formats for tensor libraries," Ph.D. dissertation, Massachusetts Institute of Technology, 2016
- [198] C. Hong, A. Sukumaran-Rajam, I. Nisa, K. Singh, and P. Sadayappan, "Adaptive sparse tiling for sparse matrix multiplication," in Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. ACM, 2019, pp. 300-314.
- [199] B. W. Bader, T. G. Kolda et al., "Matlab tensor toolbox version 2.6," Available online, February 2015. [Online]. Available:  $http://www.sandia.gov/{\sim}tgkolda/TensorToolbox/$
- [200] NVIDIA. cusparse, the cuda sparse matrix library. [Online]. Available: http://docs.nvidia.com/cuda/cusparse
- [201] Z. Yuan, Y. Liu, J. Yue, Y. Yang, J. Wang, X. Feng, J. Zhao, X. Li, and H. Yang, "Sticker: An energy-efficient multi-sparsity compatible accelerator for convolutional neural networks in 65-nm cmos," IEEE Journal of Solid-State Circuits, 2019.
- C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan et al., "C ir cnn: accelerating and compressing deep neural networks using block-circulant weight matrices," in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2017, pp. 395-408.
- S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. Wang, and Y. Liang, "Clstm: Enabling efficient lstm using structured compression techniques on fpgas," in Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2018, pp. 11-20.
- [204] M. Yan, X. Hu, S. Li, A. Basak, H. Li, X. Ma, I. Akgun, Y. Feng, P. Gu, L. Deng et al., "Alleviating irregularity in graph analytics acceleration: a hardware/software co-design approach," in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 615-628.
- [205] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, "Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks," in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2017, pp. 553-564.
- [206] H. Kwon, A. Samajdar, and T. Krishna, "Maeri: Enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects," ACM SIGPLAN Notices, vol. 53, no. 2, pp. 461-475, 2018.
- K. Hegde, R. Agrawal, Y. Yao, and C. W. Fletcher, "Morph: Flexible acceleration for 3d cnn-based video understanding," in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2018, pp. 933-946.
- [208] A. Azizimazreah and L. Chen, "Shortcut mining: exploiting crosslayer shortcut reuse in dcnn accelerators," in 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2019, pp. 94-105.
- [209] Y. Shen, M. Ferdman, and P. Milder, "Maximizing cnn accelerator efficiency through resource partitioning," in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2017, pp. 535-547.
- [210] Y. Kim, J. Lee, A. Shrivastava, J. W. Yoon, D. Cho, and Y. Paek, "High throughput data mapping for coarse-grained reconfigurable architectures," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 30, no. 11, pp. 1599-1609, 2011.
- [211] N. H. Weste and D. Harris, CMOS VLSI design: a circuits and systems perspective. Pearson Education India, 2015.
- [212] Z. Zhao, Y. Liu, W. Sheng, T. Krishna, Q. Wang, and Z. Mao, "Optimizing the data placement and transformation for multi-bank cgra computing system," in 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2018, pp. 1087-1092.
- [213] M. Alwani, H. Chen, M. Ferdman, and P. Milder, "Fused-layer cnn accelerators," in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016, pp. 1-12.
- [214] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, "Tetris: Scalable and efficient neural network acceleration with 3d memory," in Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, 2017, pp. 751-764.
- [215] H. Kwon, A. Samajdar, and T. Krishna, "Rethinking nocs for spatial neural network accelerators," in 2017 Eleventh IEEE/ACM International Symposium on Networks-on-Chip (NOCS). IEEE, 2017, pp. 1-8.
- [216] R. Das and T. Krishna, "Dnn accelerator systolic?" https://www.sigarch.org/ ture simd or

- dnn-accelerator-architecture-simd-or-systolic, September 2018, computer Architecture Today.
- [217] D. Vainbrand and R. Ginosar, "Network-on-chip architectures for neural networks," in 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip. IEEE, 2010, pp. 135–144.
- [218] P. Xu, X. Zhang, C. Hao, Y. Zhao, Y. Zhang, Y. Wang, C. Li, Z. Guan, D. Chen, and Y. Lin, "Autodnnchip: An automated dnn chip predictor and builder for both fpgas and asics," arXiv preprint arXiv:2001.03535, 2020.
- [219] S. Arora, T. Leighton, and B. Maggs, "On-line algorithms for path selection in a nonblocking network," in *Proceedings of the twenty-second annual ACM symposium on Theory of computing*, 1990, pp. 149–158.
- [220] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, "Yodann: An architecture for ultralow power binary-weight cnn acceleration," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 37, no. 1, pp. 48–60, 2017.
- [221] H. Tann, S. Hashemi, R. I. Bahar, and S. Reda, "Hardware-software codesign of accurate, multiplier-free deep neural networks," in 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 2017, pp. 1–6.
- [222] Y. Li, Z. Liu, K. Xu, H. Yu, and F. Ren, "A 7.663-tops 8.2-w energy-efficient fpga accelerator for binary convolutional neural networks," in *Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, 2017, pp. 290–291.
- [223] D. Miyashita, E. H. Lee, and B. Murmann, "Convolutional neural networks using logarithmic data representation," arXiv preprint arXiv:1603.01025, 2016.
- [224] S. Vogel, M. Liang, A. Guntoro, W. Stechele, and G. Ascheid, "Efficient hardware acceleration of cnns using logarithmic data representation with arbitrary log-base," in *Proceedings of the International Conference on Computer-Aided Design*, 2018, pp. 1–8.
- [225] D. A. Gudovskiy and L. Rigazio, "Shiftcnn: Generalized low-precision architecture for inference of convolutional neural networks," arXiv preprint arXiv:1706.02393, 2017.
- [226] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and A. Moshovos, "Stripes: Bit-serial deep neural network computing," in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MI-CRO). IEEE, 2016, pp. 1–12.
- [227] J. Albericio, A. Delmás, P. Judd, S. Sharify, G. O'Leary, R. Genov, and A. Moshovos, "Bit-pragmatic deep neural network computing," in *Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture*, 2017, pp. 382–394.
- [228] S. Sharify, A. D. Lascorz, K. Siu, P. Judd, and A. Moshovos, "Loom: Exploiting weight and activation precisions to accelerate convolutional neural networks," in 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE, 2018, pp. 1–6.
- [229] S. Sharify, A. D. Lascorz, M. Mahmoud, M. Nikolic, K. Siu, D. M. Stuart, Z. Poulos, and A. Moshovos, "Laconic deep learning inference acceleration," in *Proceedings of the 46th International Symposium on Computer Architecture*. ACM, 2019, pp. 304–317.
- [230] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, "Unpu: An energy-efficient deep neural network accelerator with fully variable weight bit precision," *IEEE Journal of Solid-State Circuits*, vol. 54, no. 1, pp. 173–185, 2018.
- [231] A. Delmas Lascorz, P. Judd, D. M. Stuart, Z. Poulos, M. Mahmoud, S. Sharify, M. Nikolic, K. Siu, and A. Moshovos, "Bit-tactical: A software/hardware approach to exploiting value and bit sparsity in neural networks," in *Proceedings of the Twenty-Fourth International* Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2019, pp. 749–763.
- [232] V. Dadu, J. Weng, S. Liu, and T. Nowatzki, "Towards general purpose acceleration by exploiting common data-dependence forms," in *Pro*ceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 924–939.
- [233] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, and H. Esmaeilzadeh, "From high-level deep neural models to fpgas," in *The 49th Annual IEEE/ACM International Symposium on Microarchitecture*. IEEE Press, 2016, p. 17.
- [234] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, and H. Esmaeilzadeh, "Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural networks," in *Proceedings of the 45th Annual International Symposium on Computer Architecture*. IEEE Press, 2018, pp. 764–775.
- [235] L. Song, J. Mao, Y. Zhuo, X. Qian, H. Li, and Y. Chen, "Hypar: Towards hybrid parallelism for deep learning accelerator array," in

- 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2019, pp. 56–68.
- [236] W. Jiang, E. H.-M. Sha, X. Zhang, L. Yang, Q. Zhuge, Y. Shi, and J. Hu, "Achieving super-linear speedup across multi-fpga for real-time dnn inference," ACM Transactions on Embedded Computing Systems (TECS), vol. 18, no. 5s, pp. 1–23, 2019.
- [237] M. Mahmoud, K. Siu, and A. Moshovos, "Diffy: A déjà vu-free differential deep neural network accelerator," in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2018, pp. 134–147.
- [238] L. R. Gonçalves, R. F. D. Moura, and L. Carro, "Aggressive energy reduction for video inference with software-only strategies," ACM Transactions on Embedded Computing Systems (TECS), vol. 18, no. 5s, pp. 1–20, 2019.
- [239] F. Silfa, G. Dot, J.-M. Arnau, and A. Gonzàlez, "Neuron-level fuzzy memoization in rnns," in *Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture*, 2019, pp. 782–793.
- [240] Y. Zhu, A. Samajdar, M. Mattina, and P. Whatmough, "Euphrates: Algorithm-soc co-design for low-power mobile continuous vision," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018, pp. 547–560.
- [241] H. Mahdiani, A. Khadem, A. Ghanbari, M. Modarressi, F. Fattahi, and M. Daneshtalab, "δnn: Power-efficient neural network acceleration using differential weights," *IEEE Micro*, 2019.
- [242] Y. Wang, S. Liang, H. Li, and X. Li, "A none-sparse inference accelerator that distills and reuses the computation redundancy in cnns," in *Proceedings of the 56th Annual Design Automation Conference* 2019. ACM, 2019, p. 202.
- [243] V. Akhlaghi, A. Yazdanbakhsh, K. Samadi, R. K. Gupta, and H. Esmaeilzadeh, "Snapea: Predictive early activation for reducing computation in deep convolutional neural networks," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018, pp. 662–673.
- [244] M. Song, J. Zhao, Y. Hu, J. Zhang, and T. Li, "Prediction based execution on deep neural networks," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018, pp. 752–763.
- [245] J. Zhu, J. Jiang, X. Chen, and C.-Y. Tsui, "Sparsenn: An energy-efficient neural network accelerator exploiting input and output sparsity," in 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2018, pp. 241–244.
- [246] D. Lee, S. Kang, and K. Choi, "Compend: Computation pruning through early negative detection for relu in a deep neural network accelerator," in *Proceedings of the 2018 International Conference on Supercomputing*, 2018, pp. 139–148.
- [247] Y. Miao, M. Gowayyed, and F. Metze, "Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding," in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015, pp. 167–174.
- [248] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang et al., "End to end learning for self-driving cars," arXiv preprint arXiv:1604.07316, 2016.
- [249] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., "Deep speech 2: End-to-end speech recognition in english and mandarin," in *International conference on machine learning*, 2016, pp. 173–182.
- [250] E. Real, J. Shlens, S. Mazzocchi, X. Pan, and V. Vanhoucke, "Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video," in *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, 2017, pp. 5296–5305.
- [251] A. Krizhevsky, "One weird trick for parallelizing convolutional neural networks," arXiv preprint arXiv:1404.5997, 2014.
- [252] N. Zmora, G. Jacob, L. Zlotnik, B. Elharar, and G. Novik, "Neural network distiller: A python package for dnn compression research," arXiv preprint arXiv:1910.12232, 2019.
- [253] R. Krashinsky, O. Giroux, S. Jones, N. Stam, and S. Ramaswamy, "Nvidia ampere architecture in-depth," https://devblogs.nvidia.com/ nvidia-ampere-architecture-in-depth/, 2020.
- [254] K. Chellapilla, S. Puri, and P. Simard, "High performance convolutional neural networks for document processing," 2006.
- [255] Intel. Understanding memory formats, intel mkl-dnn. Accessed: 2020-03-03. [Online]. Available: https://intel.github.io/mkl-dnn/ understanding\_memory\_formats.html
- [256] NVIDIA. (2019) Deep learning performance, user guide. [Online]. Available: https://docs.nvidia.com/deeplearning/sdk/pdf/ Deep-Learning-Performance-Guide.pdf

- [257] Tensorflow. (2019) 2d-convolution operator in tensorflow. [Online]. Available: https://www.tensorflow.org/api\_docs/python/tf/nn/conv2d
- [258] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, "Caffe: Convolutional architecture for fast feature embedding," in *Proceedings of the 22nd ACM international* conference on Multimedia. ACM, 2014, pp. 675–678.
- [259] A. Vasudevan, A. Anderson, and D. Gregg, "Parallel multi channel convolution using general matrix multiplication," in 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE, 2017, pp. 19–24.
- [260] R. Baghdadi, J. Ray, M. B. Romdhane, E. Del Sozzo, A. Akkas, Y. Zhang, P. Suriana, S. Kamil, and S. Amarasinghe, "Tiramisu: A polyhedral compiler for expressing fast and portable code," in Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization. IEEE Press, 2019, pp. 193–205.
- [261] J. Ragan-Kelley, A. Adams, S. Paris, M. Levoy, S. Amarasinghe, and F. Durand, "Decoupling algorithms from schedules for easy optimization of image processing pipelines," ACM Transactions on Graphics (TOG), vol. 31, no. 4, pp. 1–12, 2012.
- [262] C. Lattner and J. Pienaar, "Mlir primer: A compiler infrastructure for the end of moore's law," 2019.
- [263] S. Verdoolaege, "isl: An integer set library for the polyhedral model," in *International Congress on Mathematical Software*. Springer, 2010, pp. 299–302.
- [264] F. Paul and L. Christian, "The polyhedron model," in *Encyclopedia of Parallel Computing*, D. Padua, Ed. Springer, 2011, pp. 1581, 1592.
- [265] R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, A. Betts, A. F. Donaldson, J. Ketema *et al.*, "Pencil: A platform-neutral compute intermediate language for accelerator programming," in 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 2015, pp. 138–149.
- [266] N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, "Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions," arXiv preprint arXiv:1802.04730, 2018.
- [267] V. Elango, N. Rubin, M. Ravishankar, H. Sandanagobalane, and V. Grover, "Diesel: Dsl for linear algebra and neural net computations on gpus," in *Proceedings of the 2nd ACM SIGPLAN International* Workshop on Machine Learning and Programming Languages, 2018, pp. 42–51.
- [268] C. Leary and T. Wang, "Xla: Tensorflow, compiled," *TensorFlow Dev Summit*, 2017.
- [269] C. Lattner, J. Pienaar, M. Amini, U. Bondhugula, R. Riddle, A. Cohen, T. Shpeisman, A. Davis, N. Vasilache, and O. Zinenko, "Mlir: A compiler infrastructure for the end of moore's law," 2020.
- [270] R. Baghdadi, A. Cohen, T. Grosser, S. Verdoolaege, A. Lokhmotov, J. Absar, S. van Haastregt, A. Kravets, and A. F. Donaldson, "PENCIL language specification," INRIA, Research Rep. RR-8706, 2015. [Online]. Available: https://hal.inria.fr/hal-01154812
- [271] U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan, "A practical automatic polyhedral parallelizer and locality optimizer," in *PLDI*, 2008, pp. 101–113.
- [272] T. Grosser, A. Groslinger, and C. Lengauer, "Polly performing polyhedral optimizations on a low-level intermediate representation." *Parallel Processing Letters*, vol. 22, no. 4, 2012. [Online]. Available: http://dblp.uni-trier.de/db/journals/ppl/ppl22.html#GrosserGL12
- [273] R. T. Mullapudi, V. Vasista, and U. Bondhugula, "Polymage: Automatic optimization for image processing pipelines," in *Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems*, 2015, pp. 429–443.
- [274] T. Yuki, G. Gupta, D. Kim, T. Pathan, and S. Rajopadhye, "Alphaz:
   A system for design space exploration in the polyhedral model," in *International Workshop on Languages and Compilers for Parallel Computing*. Springer, 2012, pp. 17–31.

   [275] C. Chen, J. Chame, and M. Hall, "Chill: A framework for composing
- [275] C. Chen, J. Chame, and M. Hall, "Chill: A framework for composing high-level loop transformations," U. of Southern California, Tech. Rep. 08-897, 2008.
- [276] M. Hall, J. Chame, C. Chen, J. Shin, G. Rudy, and M. M. Khan, Loop Transformation Recipes for Code Generation and Auto-Tuning. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 50–64.
- [277] S. Girbal, N. Vasilache, C. Bastoul, A. Cohen, D. Parello, M. Sigler, and O. Temam, "Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies," *International Journal of Parallel Programming*, vol. 34, no. 3, pp. 261–317, 2006.
- [278] M.-W. Benabderrahmane, L.-N. Pouchet, A. Cohen, and C. Bastoul, "The polyhedral model is more widely applicable than you think," in Proceedings of the 19th Joint European Conference on Theory and

- Practice of Software, International Conference on Compiler Construction, ser. CC'10/ETAPS'10. Springer-Verlag, 2010.
- [279] A. Hartono, M. M. Baskaran, C. Bastoul, A. Cohen, S. Krishnamoorthy, B. Norris, J. Ramanujam, and P. Sadayappan, "Parametric multilevel tiling of imperfectly nested loops," in *Proceedings of the 23rd* international conference on Supercomputing. ACM, 2009, pp. 147– 157
- [280] R. Baghdadi and A. Cohen, "Scalable polyhedral compilation, syntax vs. semantics: 1–0 in the first round," in *IMPACT 2020 workshop* (associated with HIPEAC 2020), 2020, informal proceedings.
- [281] R. Wei, L. Schwartz, and V. Adve, "Dlvm: A modern compiler infrastructure for deep learning systems," arXiv preprint arXiv:1711.03016, 2017
- [282] L. Truong, R. Barik, E. Totoni, H. Liu, C. Markley, A. Fox, and T. Shpeisman, "Latte: a language, compiler, and runtime for elegant and efficient deep neural networks," in *Proceedings of the 37th ACM* SIGPLAN Conference on Programming Language Design and Implementation, 2016, pp. 209–223.
- [283] P. Feautrier, "Dataflow analysis of array and scalar references," *International Journal of Parallel Programming*, vol. 20, no. 1, pp. 23–53, 1991
- [284] M. E. Wolf, "Improving locality and parallelism in nested loops," Ph.D. dissertation, to the Department of Computer Science.Stanford University, 1992.
- [285] Y. Hu, T.-M. Li, L. Anderson, J. Ragan-Kelley, and F. Durand, "Taichi: a language for high-performance computation on spatially sparse data structures," ACM Transactions on Graphics (TOG), vol. 38, no. 6, pp. 1–16, 2019
- [286] F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe, "The tensor algebra compiler," *Proceedings of the ACM on Programming Languages*, vol. 1, no. OOPSLA, pp. 1–29, 2017.
- [287] K. Goto and R. A. v. d. Geijn, "Anatomy of high-performance matrix multiplication," ACM Transactions on Mathematical Software (TOMS), vol. 34, no. 3, pp. 1–25, 2008.
- [288] J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U.-M. O'Reilly, and S. Amarasinghe, "Opentuner: An extensible framework for program autotuning," in *International Conference on Parallel Architectures and Compilation Techniques*, Edmonton, Canada, August 2014.
- [289] K. Trifunovic, D. Nuzman, A. Cohen, A. Zaks, and I. Rosen, "Polyhedral-model guided loop-nest auto-vectorization," in 2009 18th International Conference on Parallel Architectures and Compilation Techniques. IEEE, 2009, pp. 327–337.
- [290] F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. O'Boyle, J. Thomson, M. Toussaint, and C. K. Williams, "Using machine learning to focus iterative optimization," in *International Symposium* on Code Generation and Optimization (CGO'06). IEEE, 2006, pp. 11–pp.
- [291] A. Adams, K. Ma, L. Anderson, R. Baghdadi, T.-M. Li, M. Gharbi, B. Steiner, S. Johnson, K. Fatahalian, F. Durand *et al.*, "Learning to optimize halide with tree search and random programs," *ACM Transactions on Graphics (TOG)*, vol. 38, no. 4, pp. 1–12, 2019.
- [292] C. Mendis, A. Renda, S. Amarasinghe, and M. Carbin, "Ithemal: Accurate, portable and fast basic block throughput estimation using deep neural networks," in *International Conference on Machine Learning*, 2019, pp. 4505–4515.
- [293] C. Lattner and V. Adve, "Llvm: A compilation framework for lifelong program analysis & transformation," in *International Symposium on Code Generation and Optimization*, 2004. CGO 2004. IEEE, 2004, pp. 75–86.
- [294] S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha, A. Jagannathan, A. Durg, D. Nagaraj, B. Kaul, P. Dubey et al., "Scaledeep: A scalable compute architecture for learning and evaluating deep networks," ACM SIGARCH Computer Architecture News, vol. 45, no. 2, pp. 13–26, 2017.
- [295] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, "Cambricon: An instruction set architecture for neural networks," in ACM SIGARCH Computer Architecture News, vol. 44, no. 3. IEEE Press, 2016, pp. 393–405.
- [296] J. R. Stevens, A. Ranjan, D. Das, B. Kaul, and A. Raghunathan, "Manna: An accelerator for memory-augmented neural networks," in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 794–806.
- [297] S. Gopinath, N. Ghanathe, V. Seshadri, and R. Sharma, "Compiling kb-sized machine learning models to tiny iot devices," in *Proceedings* of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2019, pp. 79–95.

- [298] Y. Chen, H. Lan, Z. Du, S. Liu, J. Tao, D. Han, T. Luo, Q. Guo, L. Li, Y. Xie et al., "An instruction set architecture for machine learning," ACM Transactions on Computer Systems (TOCS), vol. 36, no. 3, pp. 1–35, 2019.
- [299] T. Moreau, T. Chen, L. Vega, J. Roesch, E. Yan, L. Zheng, J. Fromm, Z. Jiang, L. Ceze, C. Guestrin et al., "A hardware-software blueprint for flexible deep learning specialization," arXiv preprint arXiv:1807.04188, 2018.
- [300] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, "Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems," arXiv preprint arXiv:1512.01274, 2015.
- [301] F. Tung and G. Mori, "Clip-q: Deep network compression learning by in-parallel pruning-quantization," in *Proceedings of the IEEE Confer*ence on Computer Vision and Pattern Recognition, 2018, pp. 7873– 7882.
- [302] H. Yang, S. Gui, Y. Zhu, and J. Liu, "Automatic neural network compression by sparsity-quantization joint learning: A constrained optimization-based approach," 2019.
- [303] K. Kwon, A. Amid, A. Gholami, B. Wu, K. Asanovic, and K. Keutzer, "Co-design of deep neural nets and neural net accelerators for embedded vision applications," in 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE, 2018, pp. 1–6.
- [304] C. Hao, X. Zhang, Y. Li, S. Huang, J. Xiong, K. Rupnow, W.-m. Hwu, and D. Chen, "Fpga/dnn co-design: An efficient design methodology for 1ot intelligence on the edge," in 2019 56th ACM/IEEE Design Automation Conference (DAC). IEEE, 2019, pp. 1–6.
- [305] D. Marculescu, D. Stamoulis, and E. Cai, "Hardware-aware machine learning: Modeling and optimization," in *Proceedings of the Interna*tional Conference on Computer-Aided Design, 2018, pp. 1–8.
- [306] X. Zhang, W. Jiang, Y. Shi, and J. Hu, "When neural architecture search meets hardware implementation: from hardware awareness to co-design," in 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, 2019, pp. 25–30.
- [307] M. S. Abdelfattah, Ł. Dudziak, T. Chau, R. Lee, H. Kim, and N. D. Lane, "Best of both worlds: Automl codesign of a cnn and its hardware accelerator," arXiv preprint arXiv:2002.05022, 2020.
- [308] H. Ji, L. Song, L. Jiang, H. H. Li, and Y. Chen, "Recom: An efficient resistive accelerator for compressed deep neural networks," in 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2018, pp. 237–240.
- [309] P. Wang, Y. Ji, C. Hong, Y. Lyu, D. Wang, and Y. Xie, "Snrram: an efficient sparse neural network computation architecture based on resistive random-access memory," in *Proceedings of the 55th Annual Design Automation Conference*. ACM, 2018, p. 106.
- [310] X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W.-m. Hwu, and D. Chen, "Dnnbuilder: an automated tool for building high-performance dnn hardware accelerators for fpgas," in *Proceedings of the International Conference on Computer-Aided Design*. ACM, 2018, p. 56.
- [311] Y.-H. Lai, Y. Chi, Y. Hu, J. Wang, C. H. Yu, Y. Zhou, J. Cong, and Z. Zhang, "Heterocl: A multi-paradigm programming infrastructure for software-defined reconfigurable computing," in *Proceedings of the* 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2019, pp. 242–251.
- [312] R. Venkatesan, Y. S. Shao, M. Wang, J. Clemons, S. Dai, M. Fojtik, B. Keller, A. Klinefelter, N. Pinckney, P. Raina et al., "Magnet: A modular accelerator generator for neural networks," in 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 2019, pp. 1–8.
- [313] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avižienis, J. Wawrzynek, and K. Asanović, "Chisel: constructing hardware in a scala embedded language," in *DAC Design Automation Conference* 2012. IEEE, 2012, pp. 1212–1221.
- [314] A. Sharifian, R. Hojabr, N. Rahimi, S. Liu, A. Guha, T. Nowatzki, and A. Shriraman, "µir-an intermediate representation for transforming and optimizing the microarchitecture of application accelerators," in *Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture*, 2019, pp. 940–953.
- [315] E. Wang, J. J. Davis, R. Zhao, H.-C. Ng, X. Niu, W. Luk, P. Y. Cheung, and G. A. Constantinides, "Deep neural network approximation for custom hardware: Where we've been, where we're going," ACM Computing Surveys (CSUR), vol. 52, no. 2, p. 40, 2019.
- [316] M. Z. Alom, T. M. Taha, C. Yakopcic, S. Westberg, P. Sidike, M. S. Nasrin, B. C. Van Esesn, A. A. S. Awwal, and V. K. Asari, "The history began from alexnet: A comprehensive survey on deep learning approaches," arXiv preprint arXiv:1803.01164, 2018.

- [317] S. Mittal, "A survey of fpga-based accelerators for convolutional neural networks," *Neural computing and applications*, pp. 1–31, 2018.
- [318] S. I. Venieris, A. Kouris, and C.-S. Bouganis, "Toolflows for mapping convolutional neural networks on fpgas: A survey and future directions," ACM Computing Surveys (CSUR), vol. 51, no. 3, pp. 1–39, 2018
- [319] B. Du, Q. Guo, Y. Zhao, T. Zhi, Y. Chen, and Z. Xu, "Self-aware neural network systems: A survey and new perspective," *Proceedings* of the IEEE, 2020.
- [320] J. Li, G. Yan, W. Lu, S. Jiang, S. Gong, J. Wu, J. Yan, and X. Li, "Tnpu: An efficient accelerator architecture for training convolutional neural networks," in "Proceedings of the 24th Asia and South Pacific Design Automation Conference, 2019, pp. 450–455.
- [321] T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi, "Graphicionado: A high-performance and energy-efficient accelerator for graph analytics," in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016, pp. 1–13.
- [322] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, "A scalable processing-in-memory accelerator for parallel graph processing," in Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015, pp. 105–117.
- [323] L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and K. A. Ross, "Q100: the architecture and design of a database processing unit," in Proceedings of the 19th international conference on Architectural support for programming languages and operating systems, 2014, pp. 255–268.
- [324] Y. Turakhia, G. Bejerano, and W. J. Dally, "Darwin: A genomics coprocessor provides up to 15,000 x acceleration on long read assembly," in *Proceedings of the Twenty-Third International Conference on Archi*tectural Support for Programming Languages and Operating Systems, 2018, pp. 199–213.
- [325] D. Fujiki, A. Subramaniyan, T. Zhang, Y. Zeng, R. Das, D. Blaauw, and S. Narayanasamy, "Genax: A genome sequencing accelerator," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018, pp. 69–82.
- [326] J. Fowers, J.-Y. Kim, D. Burger, and S. Hauck, "A scalable high-bandwidth architecture for lossless compression on fpgas," in 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 2015, pp. 52–59.
- [327] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, J. Cai et al., "Recent advances in convolutional neural networks," *Pattern Recognition*, vol. 77, pp. 354–377, 2018.
- [328] W. Rawat and Z. Wang, "Deep convolutional neural networks for image classification: A comprehensive review," *Neural computation*, vol. 29, no. 9, pp. 2352–2449, 2017.
- [329] K. Ota, M. S. Dao, V. Mezaris, and F. G. De Natale, "Deep learning for mobile multimedia: A survey," ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 13, no. 3s, p. 34, 2017.
- [330] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez, "A survey on deep learning in medical image analysis," *Medical image analysis*, vol. 42, pp. 60–88, 2017.
- [331] Y. Wei, J. Zhou, Y. Wang, Y. Liu, Q. Liu, J. Luo, C. Wang, F. Ren, and L. Huang, "A review of algorithm & hardware design for ai-based biomedical applications." *IEEE Transactions on Biomedical Circuits and Systems*, vol. 14, no. 2, pp. 145–163, 2020.
- [332] C. Zhang, P. Patras, and H. Haddadi, "Deep learning in mobile and wireless networking: A survey," *IEEE Communications Surveys & Tutorials*, 2019.
- [333] R. Boutaba, M. A. Salahuddin, N. Limam, S. Ayoubi, N. Shahriar, F. Estrada-Solano, and O. M. Caicedo, "A comprehensive survey on machine learning for networking: evolution, applications and research opportunities," *Journal of Internet Services and Applications*, vol. 9, no. 1, p. 16, 2018.
- [334] T. Elsken, J. H. Metzen, and F. Hutter, "Neural architecture search: A survey," *Journal of Machine Learning Research*, vol. 20, no. 55, pp. 1–21, 2019.
- [335] R. Reed, "Pruning algorithms-a survey," *IEEE transactions on Neural Networks*, vol. 4, no. 5, pp. 740–747, 1993.
- [336] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, "Model compression and acceleration for deep neural networks: The principles, progress, and challenges," *IEEE Signal Processing Magazine*, vol. 35, no. 1, pp. 126– 136, 2018.
- [337] L. Liu, J. Zhu, Z. Li, Y. Lu, Y. Deng, J. Han, S. Yin, and S. Wei, "A survey of coarse-grained reconfigurable architecture and design:

- Taxonomy, challenges, and applications," ACM Computing Surveys (CSUR), vol. 52, no. 6, pp. 1–39, 2019.
- [338] A. Shawahna, S. M. Sait, and A. El-Maleh, "Fpga-based accelerators of deep learning networks for learning and classification: A review," *IEEE Access*, vol. 7, pp. 7823–7859, 2018.
- [339] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner, "Survey and benchmarking of machine learning accelerators," in 2019 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2019, pp. 1–9.
- [340] M. Li, Y. Liu, X. Liu, Q. Sun, X. You, H. Yang, Z. Luan, and D. Qian, "The deep learning compiler: A comprehensive survey," *arXiv preprint* arXiv:2002.03794, 2020.
- [341] A. Ignatov, R. Timofte, A. Kulik, S. Yang, K. Wang, F. Baum, M. Wu, L. Xu, and L. Van Gool, "Ai benchmark: All about deep learning on smartphones in 2019," arXiv preprint arXiv:1910.06663, 2019.