# The Deep Learning Compiler: A Comprehensive Survey

MINGZHEN LI\*, YI LIU\*, XIAOYAN LIU\*, QINGXIAO SUN\*, XIN YOU\*, HAILONG YANG\*†, ZHONGZHI LUAN\*, LIN GAN§, GUANGWEN YANG§, and DEPEI QIAN\*, Beihang University\* and Tsinghua University§

The difficulty of deploying various deep learning (DL) models on diverse DL hardware has boosted the research and development of DL compilers in the community. Several DL compilers have been proposed from both industry and academia such as Tensorflow XLA and TVM. Similarly, the DL compilers take the DL models described in different DL frameworks as input, and then generate optimized codes for diverse DL hardware as output. However, none of the existing survey has analyzed the unique design architecture of the DL compilers comprehensively. In this paper, we perform a comprehensive survey of existing DL compilers by dissecting the commonly adopted design in details, with emphasis on the DL oriented multi-level IRs, and frontend/backend optimizations. Specifically, we provide a comprehensive comparison among existing DL compilers from various aspects. In addition, we present detailed analysis on the design of multi-level IRs and illustrate the commonly adopted optimization techniques. Finally, several insights are highlighted as the potential research directions of DL compiler. This is the first survey paper focusing on the design architecture of DL compilers, which we hope can pave the road for future research towards DL compiler.

Additional Key Words and Phrases: Neural Networks, Deep Learning, Compiler, Intermediate Representation, Optimization

#### 1 INTRODUCTION

The development of deep learning (DL) has generated profound impact on various scientific fields. It has not only demonstrated remarkable value in artificial intelligence such as natural language processing (NLP) [68] and computer vision (CV) [35], but also proved great success in broader applications such as e-commerce [44], smart city [70] and drug discovery [23]. With the emergence of versatile deep learning models such as convolutional neural network (CNN) [58], recurrent neural network (RNN) [79], long short-term memory (LSTM) [46] and generative adversarial network (GAN) [39], it is critical to ease the programming of diverse DL models in order to realize their widely adoption.

With the continuous efforts from both industry and academia, several popular DL frameworks have been proposed such as TensorFlow [11], PyTorch [74], MXNet [24] and CNTK [80], in order to simplify the implementation of various DL models. Although there are strengths and weaknesses among the above DL frameworks depending on the tradeoffs in their designs, the interoperability becomes important to reduce the redundant engineering efforts when supporting emerging DL models across the existing DL models. To provide interoperability, ONNX [6] has been proposed, that defines a unified format for representing DL models to facilitate model conversion between different DL frameworks.

In the meanwhile, the unique computing characteristics such as matrix multiplication have spurred the passion of chip architects to design customized DL accelerators for higher efficiency. Internet giants (e.g., Google TPU [51], Hisilicon NPU [60], Apple Bonic [55]), processor vendors (e.g., NVIDIA Turing [5], Intel NNP [4]), service providers (e.g., Amazon Inferentia [2], Alibaba Hanguang [1]), and even startups (e.g., Cambricon [61], Graphcore [50]) are investing tremendous workforce and capital in developing DL chips in order to boost the performance for DL models.

<sup>&</sup>lt;sup>†</sup>Corresponding author.

Authors' address: Mingzhen Li\*; Yi Liu\*; Xiaoyan Liu\*; Qingxiao Sun\*; Xin You\*; Hailong Yang\*†; Zhongzhi Luan\*; Lin Gan\$; Guangwen Yang\$; Depei Qian\*, Beihang University\*, Tsinghua University\$, {lmzhhh,yi.liu,liuxiaoyan,sunqingxiao, youxin2015,hailong.yang,zhongzhi.luan,depeiq}@buaa.edu.cn, {lingan,ygw}@tsinghua.edu.cn.

Generally, the DL hardware can be divided into the following categories: 1) general-purpose hardware with software-hardware co-design, 2) dedicated hardware fully customized for DL models, and 3) neuromorphic hardware inspired by biological brain science. For example, the general-purpose hardware (e.g., CPU, GPU) has added special hardware components such as AVX512 vector units and tensor core to accelerate DL models. Whereas for dedicated hardware such as Google TPU, application-specific integrated circuits (e.g., matrix multiplication engine and high-bandwidth memory) have been designed to elevate the performance and energy efficiency to extreme. To the foreseeable future, the design of DL hardware would become even more diverse.

To embrace the hardware diversity, it is important to map the computation to DL hardware efficiently. On general-purpose hardware, the highly optimized linear algebra libraries such as Basic Linear Algebra Subprograms (BLAS) libraries (e.g., MKL and cuBLAS) serve as the basics for efficient computation of DL models. Take the convolution operation for example, the DL frameworks convert the convolution to matrix multiplication and then invoke the GEMM function in the BLAS libraries. In addition, the hardware vendors have released specially optimized libraries tailored for DL computations (e.g., MKL-DNN and cuDNN), including forward and backward convolution, pooling, normalization, and activation. More advanced tools have also been developed to further speedup the DL operations. For example, TensorRT [10] supports graph optimization (e.g., layer fusion) and low-bit quantization with large collection of highly optimized GPU kernels. On dedicated DL hardware, similar libraries are also provided [50, 61]. However, the drawback of relying on the libraries is that they usually fall behind the rapid development of DL models, and thus fail to utilize the DL chips efficiently.

To address the drawback of DL libraries and tools, as well as alleviate the burden of optimizing the DL models on each DL hardware manually, the DL community has resorted to the domain specific compilers for rescue. Rapidly, several popular DL compilers have been proposed such as TVM [25], Tensor Comprehension [89], Glow [78], nGraph [30] and XLA [57], from both industry and academia. The DL compilers take the model definitions described in the DL frameworks as inputs, and generate efficient code implementations on various DL hardware as outputs. The transformation between model definition and specific code implementation are highly optimized targeting the model specification and hardware architecture. Specifically, they incorporate DL oriented optimizations such as layer and operator fusion, which enables highly efficient code generation. Moreover, existing DL compilers also leverage mature tool-chains from general-purpose compilers (e.g., LLVM [56]), which provides better portability across diverse hardware architectures. Similar to traditional compiler, DL compilers also adopt the layered design including frontend, intermediate representation (IR) and backend. However, the uniqueness of DL compiler lies in the design of multi-level IRs and DL specific optimizations.

In this paper, we provide a comprehensive survey of existing DL compilers by dissecting the compiler design into frontend, multi-level IRs and backend, with special emphasis on the IR design and optimization methods. To the best of our knowledge, this is the first paper that provides a comprehensive survey on the design of DL compiler. Specifically, this paper makes the following contributions:

- We provide a comprehensive comparison among existing DL compilers from various aspects such as hardware support, DL framework support, code generation and optimization, which can be used as guidelines for choosing the suitable DL compiler for the end user.
- We dissect the commonly adopted design architecture of existing DL compilers, and provide
  detailed analysis of the key design components such as multi-level IRs, frontend optimizations
  (including node-level, block-level and dataflow-level optimizations) and backend optimizations (including hardware-specific optimization, auto-tuning and optimized kernel libraries).

• We highlight several insights for the future development of DL compilers, including dynamic shape and pre/post processing, advanced auto-tuning, polyhedral model, subgraph partitioning, quantization, unified optimizations, differentiable programming and privacy protection, which we hope to boost the research in the DL compiler community.

The rest of this paper is organized as follows. Section 2 presents the background of DL compilers, including the DL frameworks, DL hardware, as well as hardware (FPGA) specific DL code generators. Section 3 presents a detailed comparison among existing DL compilers. Section 4 describes the common design architecture of DL compilers. Section 5 discusses the key components of DL compilers, including multi-level IRs, frontend optimizations and backend optimizations. Section 6 highlights the future directions for DL compiler research.

#### 2 BACKGROUND

# 2.1 Deep Learning Frameworks

In this section, we provide an overview of popular DL frameworks. The discussion might not be exhaustive but is meant to provide a guideline fo DL practitioners. Figure 1 presents the landscape of DL frameworks including currently popular frameworks, historical frameworks and ONNX supported frameworks.

**TensorFlow** - Among all the DL frameworks, TensorFlow has the most comprehensive support for language interfaces, including C ++, Python, Java, Go, R, and Haskell. TensorFlow employs a dataflow graph of primitive operators extended with restricted control edges to represent differentiable programs [77]. TensorFlow Lite is designed for mobile and embedded deep learning and provides an Android neural network API. To reduce the complexity of using TensorFlow, Google adopts Keras as a frontend to the TensorFlow core. Furthermore, The eager-mode in TensorFlow applies an approach similar to PyTorch to support dynamic computation graphs better.

Keras - Keras [28] is a high-level neural network library for quickly building DL models, written in pure Python. Though not a DL framework on its own, Keras provides a high-level API that integrates with TensorFlow, MXNet, Theano, and CNTK. With Keras, DL developers can build a neural network with just a few lines of code. Besides, Keras can integrate with other common DL packages, such as scikit-learn. However, Keras is not flexible enough due to over-encapsulation, which makes it too difficult to add operators or obtain low-level data information.

**PyTorch** - Facebook has rewritten the Lua-based DL framework Torch in Python and refactored all modules on *Tensor* level, which leads to the release of PyTorch. As the most popular dynamic framework, PyTorch embeds primitives for constructing dynamic dataflow graphs in Python, where the control flow is executed in the Python interpreter. PyTorch 1.0 integrated the codebases of PyTorch 0.4 and Caffe2 to create a unified framework. This allows PyTorch to absorb the benefits of Caffe2 to support efficient graph execution and mobile deployment. FastAI [47] is an advanced API layer based on PyTorch's upper-layer encapsulation. It fully borrows Keras to ease the use of PyTorch.

Caffe/Caffe2 - Caffe [49] was designed for deep learning and image classification by UC Berkeley. Caffe has the command line, Python, and MATLAB APIs. Caffe's simplicity makes the source codes easy to extend, which is suitable for developers to analyze in-depth. Therefore, Caffe is mainly positioned in research, which has made it popular from the beginning to the present. Caffe2 is built upon the original Caffe project. Caffe2 is similar to TensorFlow in code structure, albeit with a lighter API and easier access to the intermediate results in the computation graph [89].

**MXNet** - MXNet supports multiple language APIs including Python, C++, R, Scala, Julia, Matlab, and JavaScript. It was intended to be scalable and was designed from the perspective to reduce data loading and I/O complexity [24]. MXNet offers different paradigms: declarative programming like



Fig. 1. DL framework landscape: 1) Currently popular DL frameworks; 2) Historical DL frameworks; 3) ONNX supported frameworks.

Caffe and Tensorflow as well as imperative like PyTorch. In December 2017, Amazon and Microsoft jointly released Gluon [3] based on MXNet, which is an advanced interface similar to Keras and FastAI. Gluon supports both flexible, dynamic graphs and efficient, static graphs.

**CNTK** - CNTK can be used through Python, C++ and C# APIs, or its own scripting language (i.e., BrainScript). CNTK is designed to be easy-to-use and production-ready for large-scale data in production [45]. However, CNTK does not yet support the ARM architecture, which limits its usage on mobile devices. It uses the static computation graph similar to TensorFlow and Caffe, in which a DL model is treated as a series of computational steps through a directed graph.

**PaddlePaddle** - The original design of PaddlePaddle [7] is similar to Caffe, where each model can be represented as a set of layers. However, PaddlePaddle v2 has adopted the concept of operators with reference to TensorFlow, which breaks layers into finer-grained operators, thereby supporting more complex DL models. And PaddlePaddle Fluid is similar to PyTorch because it provides own interpreter so as to avoid the limited performance of Python interpreter.

ONNX - The Open Neural Network Exchange (ONNX) [6] defines a scalable computation graph model, and thus computation graphs built by different DL frameworks can be easily transformed into ONNX. With ONNX, it becomes easier to convert models between DL frameworks. For example, it allows developers to build an MXNet model and then run the model using PyTorch for inference. As shown in Figure 1, ONNX has been integrated into PyTorch, MXNet, PaddlePaddle, and so on. For several DL frameworks (e.g., TensorFlow and Keras) that are not directly supported yet, and ONNX adds converters to them.

**Historical Frameworks** - Due to the rapid evolvement in DL community, many historical DL frameworks are no longer active. For example, PyTorch has replaced Torch [29]. As one of the oldest DL frameworks, Theano [85] is no longer under maintenance. Deeplearning4J [84] a distributed DL framework based on Java and Scala, however becomes inactive due to the lack of large developer community. Chainer [86] was once the preferred framework for dynamic computation graphs, however replaced by MXNet, PyTorch and TensorFlow with similar features.

Previous works [17, 34, 43, 72, 81, 99] have compared the performance of DL frameworks on different applications (e.g., computer vision and image classification) and different hardware (e.g., CPU, GPU, and TPU). For detailed information about each DL framework, the readers can refer to [45]. Different from them, this survey focuses on the research efforts on DL compilers which provide more general approach to execute various DL models on diverse hardware efficiently.

### 2.2 Deep Learning Hardware

The DL hardware can be divided into three categories based on the generality: 1) general-purpose hardware that can support DL workloads through hardware and software optimization; 2) dedicated hardware that focus on accelerating DL workloads with fully customized circuit design; 3) neuromorphic hardware that function by mimicking the human brain.

**General-purpose Hardware -** The most representative general-purpose hardware for DL models is Graphic Processing Unit (GPU), which achieves high parallelism with many-core architecture.

For example, Nvidia GPUs have introduced tensor cores since the Volta architecture. Tensor cores can accelerate mixed-precision matrix multiply-and-accumulate calculations in parallel, which are widely used in DL models during both training and inference. Co-optimized with the hardware, NVIDIA also launches highly optimized DL libraries and tools such as cuDNN [27] and TensorRT [10] to further accelerate the computation of DL models.

**Dedicated Hardware** - Dedicated hardware is fully customized for DL computation to improve performance and energy efficiency to extreme. The rapid expansion of DL applications and algorithms has spurred many startups developing dedicated DL hardware (e.g., Graphcore GC2, Cambricon MLU270). Besides, traditional hardware companies (e.g., Intel NNP, Qualcomm Cloud AI 100) and cloud service providers (e.g., Google TPU, Amazon Inferentia, and Alibaba Hanguang) have also invested in this field. The most well known dedicated DL hardware is Google's TPU series. A TPU includes Matrix Multiplier Unit (MXU), Unified Buffer (UB), and Activation Unit (AU), which is driven with CISC instructions by the host processor. The MXU is mainly composed of a systolic array, which is optimized for power and area efficiency in performing matrix multiplications. Compared to CPU and GPU, TPU is still programmable but uses a matrix as a primitive instead of a vector or scalar. The Amazon Inferentia has also attracts the attention recently. This chip has four NeuroCores that are designed for tensor-level operations, and it has large on-chip cache to avoid the frequent main memory access.

**Neuromorphic Hardware** - Neuromorphic chips use electronic technology to simulate the biological brain. Representative products of the this kind are IBM's TrueNorth and Intel's Loihi. Neuromorphic chips (e.g., TrueNorth) have very high connectivity between their artificial neurons. Neuromorphic chips also replicate a structure similar to the brain tissue: neurons can simultaneously store and process the data. Traditional chips distribute processors and memory in different locations, but neuromorphic chips usually have many microprocessors, each of which has a small amount of local memory. Compared to TrueNorth, Loihi has a learning ability more similar to the brain. Loihi introduces the pulse-time-dependent synaptic plasticity model (STDP), a mechanism that regulates synaptic strength by the relative time of pre-synaptic and post-synaptic pulses. However, neuromorphic chips are far away from Large-scale commercial production. Despite that, in computer science domain, neuromorphic chips can help to capture the process of rapid, life-long learning which is ignored by regular DL models, and in neurology domain, they are helpful to figure out how the various parts of the brain work together to create thoughts, feelings, and even consciousness.

# 2.3 Hardware-specific DL Code Generator

Field Programmable Gate Arrays (FPGAs) are reprogrammable integrated circuits that contain an array of programmable logic blocks. Programmers can configure them after manufacturing. Besides the reprogrammable nature, the low-power and high-performance nature of the FPGA make it widely used in so many domains, such as communication, medical, image processing, and ASIC prototyping. As for the domain of deep learning, the high-performance CPUs and GPUs are highly-reprogrammable but power-hungry, while the power-efficient ASICs are specialized for fixed applications. However, the FPGA can bridge the gap between CPUs/GPUs and ASICs, which causes the FPGA to be an attractive platform for deep learning.

The High-Level Synthesis (HLS) programming model enables the FPGA programmers to generate effective hardware designs conveniently using high-level languages such as C and C++. It avoids writing lots of Verilog or VHDL descriptions, which lowers the programming threshold and reduces the long design circle. Xilinx Vivado HLS and Intel FPGA SDK for OpenCL are two of the popular HLS tools targeting their own FPGAs. However, mapping DL models to FPGAs remains a complicated work even with HLS, because that 1) DL models are usually described by the

languages of DL frameworks rather than bare mental C/C++ code, and 2) DL-specific information and optimizations are hard to be leveraged.

The hardware-specific code generator targeting FPGA take the DL models or their domain-specific languages (DSLs) as the input, conduct the domain-specific (about FPGA and DL) optimizations and mappings, then generate the HLS or Verilog/VHDL and finally generate the bitstream. They can be classified into two categories according to the generated architectures of FPGA-based accelerators: the processor architecture and the streaming architecture [91].

The processor architecture has similarities with general-purpose processors. An FPGA accelerator of this architecture usually comprises several Processing Units (PUs), which are comprised of on-chip buffers and multiple smaller Processing Engines (PEs). It usually has a virtual instruction set (ISA), and the control of hardware and the scheduling of the execution should be determined by software. What's more, the static scheduling method avoids the overheads of von Neumann execution (including instruction fetching and decoding). A hardware template is a generic and fundamental implementation with configurable parameters. The DL code generator targeting this architecture adopt the hardware templates to generate the accelerator designs automatically. With the configurable parameters of templates, the code generator achieve the scalability and flexibility [104]. The scalability means that the code generator can generate designs for FPGAs ranging from high-performance to power-efficient, and the flexibility means that the code generator can generate designs for various DL models with different layer types and parameters. The number of PUs and the number of PEs per PU are template parameters of importance. Besides, the tilling size and batch size are also essential scheduling parameters about mapping the DL models to PUs and PEs. All these parameters are usually determined by the design space exploration using various strategies, such as combining the performance model and auto-tuning. DNN Weaver [82], Angel-Eye [41], ALAMO [67], FP-DNN [40], SysArrayAccel [100] are typical FPGA DL code generator targeting the processor architecture. What's more, the PUs and PEs are usually responsible for coarse-grained basic operations such as matrix-vector multiplication, matrix-matrix multiplication, pooling, and some element-wise operations. The optimizations of these basic operations are mainly guided by the tradeoff between the parallelism and data reuse, which is similar to general optimizations.

The streaming architecture has similarities with pipelines. An FPGA accelerator of this architecture consists of multiple different hardware blocks, and it nearly has one hardware block for each layer of an input DL model. With the input data of a DL model, this kind of accelerators process the data through the different hardware blocks in the same sequence with layers. Additionally, with the streaming input data, all hardware blocks can be fully utilized in a pipeline manner. However, the streaming architecture usually follows an initial assumption that the on-chip memory the computation resources on target FPGA are sufficient to accommodate the DL models, which bring barriers to deploy deep models with complicated layers. The DL code generator targeting this architecture can solve this problem by leveraging the reconfigurability of FPGA or adopting dynamic control flow. And the further optimization of a single block resembles that of basic operations of the processor architecture. fpgaConvNet [90], DeepBurning [97], Haddoc2 [12], and AutoCodeGen [63] are typical corresponding DL code generator.

For the detailed survey of specific compilation techniques that map DL models to FPGAs, the readers can refer to [42, 91, 104]. Different from [42, 91, 104], this survey focuses on general DL compilation techniques that can be applied to broader DL hardware other than bounding to FPGA.

## 3 COMPARISON OF DL COMPILERS

In this section, we compare several popular DL compilers including TVM [25], TC [89], Glow [78], nGraph [30], PlaidML [8], and XLA [57]. Table 1 shows the detailed comparison of different DL

|                                | TVM        | TC         | Glow       | nGraph+PlaidML     | XLA        |
|--------------------------------|------------|------------|------------|--------------------|------------|
| Core/Programming Language      |            |            |            |                    |            |
| Core                           | C++        | C++        | C++        | C++                | C++        |
| Programming                    | Python/C++ | Python/C++ | Python/C++ | Python/C++         | Python/C++ |
| Supported Hardware Targets     |            |            |            |                    |            |
| CPU                            | <b>√</b>   | <b>√</b>   | <b>√</b>   | ✓                  | <b>√</b>   |
| NVIDIA-GPU                     | <b>√</b>   | <b>√</b>   | <b>√</b>   | ✓                  | <b>√</b>   |
| AMD-GPU                        | <b>√</b>   | ×          | <b>√</b>   | ✓                  | <b>√</b>   |
| FPGA                           | ✓          | ×          | ×          | ×                  | ×          |
| TPU                            | ×          | ×          | ×          | ×                  | ✓          |
| NNP                            | ×          | ×          | ×          | <b>✓</b>           | ×          |
| Customed                       | ✓          | ×          | <b>√</b>   | ✓                  | <b>√</b>   |
| Supported DL Frameworks        |            |            |            |                    |            |
| TensorFlow                     | ✓          | ×          | ×          | ✓                  | <b>√</b>   |
| PyTorch                        | <b>✓</b>   | <b>√</b>   | <b>✓</b>   | ×                  | <b>√</b>   |
| MXNet                          | ✓          | ×          | ×          | × not active       | ×          |
| Caffe2                         | ✓          | <b>√</b>   | <b>√</b>   | ×                  | ×          |
| ONNX                           | <b>√</b>   | ×          | <b>✓</b>   | <b>✓</b>           | ×          |
| CoreML                         | ✓          | ×          | ×          | ×                  | ×          |
| Keras                          | ✓          | ×          | ×          | ✓                  | <b>√</b>   |
| PaddlePaddle                   | ×          | ×          | ×          | ✓                  | ×          |
| DarkNet                        | <b>√</b>   | ×          | ×          | ×                  | ×          |
| Supported Generating Languages |            |            |            |                    |            |
| CUDA                           | ✓          | <b>√</b>   | ×          | ×                  | <b>√</b>   |
| OpenCL                         | <b>√</b>   | ×          | <b>✓</b>   | ✓                  | <b>√</b>   |
| Metal                          | <b>√</b>   | ×          | ×          | ✓                  | ×          |
| LLVM                           | ✓          | ✓          | <b>√</b>   | ✓                  | <b>√</b>   |
| OpenGL                         | <b>√</b>   | ×          | ×          | <b>√</b>           | ×          |
| Supported Features/Strategies  |            |            |            |                    |            |
| AOT                            | ✓          | ×          | <b>√</b>   | ✓ official release | <b>√</b>   |
| JIT                            | <b>✓</b>   | ✓          | <b>√</b>   | ✓                  | <b>√</b>   |
| Training                       | _          | <b>√</b>   | <b>√</b>   | <b>√</b>           | <b>√</b>   |
| Quantization                   | -          | ×          | <b>√</b>   | ✓                  | ×          |
| Automatic Differentiation      | -          | <b>√</b>   | <b>√</b>   | ✓                  | ×          |
| Dynamic Shape                  | <b>√</b>   | ×          | ×          | <b>√</b>           | ×          |
| Auto-tuning                    | ✓          | <b>√</b>   | ×          | √ only tiling      | ×          |

Table 1. The detailed comparison of popular DL compilers.

compilers from various aspects, where "\sqrt{"}" means supported, "\times" means not supported, and "-" means under development. Note that we use TVM to represent the work of VTA [71], Relay [77] and autoTVM [26]. In addition, PlaidML is tightly coupled with nGraph, therefore we consider them together during the comparison. Besides, for the performance comparison of DL compilers, the readers can refer to [101].

**Core/Programming Language -** The core language of all DL compilers is C++, because C++ highlights performance, efficiency, and flexibility of use in its design. However, Python is becoming more and more popular with programmers due to its simplicity and usability. For most mature DL compilers (e.g., TVM, TC, nGraph, PlaidML and XLA), their Python interfaces almost cover all core functions.

**Supported Hardware** - DL compilers usually support Intel and AMD CPUs as well as NVIDIA GPUs. The official versions of TC currently do not provide support for AMD GPUs. Note that nGraph is integrated with PlaidML to provide acceleration of more hardware targets. nGraph can support the target hardware by invoking existing kernel libraries (e.g., cuDNN and MKL-DNN). Additionally, PlaidML offers extensive support for various hardware. TVM can map a workload to FPGAs using the VTA architecture and runtime [71]. The supported dedicated DL chips of a DL compiler are related to its developer generally. For example, nGragh can support Intel NNP by invoking the NNP library. XLA can support Google TPUs by directly generating binary files. MLIR can take advantage of XLAâÁŹs compilation by using XLA HLO IR as its dialect. DL compilers

except TC and nGraph can support customized hardware by providing corresponding interfaces for programmers. For example, Glow uses automatic code generation techniques (i.e., ClassGen, which is based on LLVM) for defining instructions and nodes [78], and compiler researchers can invoke the interfaces to support new hardware.

**Supported DL Frameworks** - Currently TensorFlow and PyTorch are the two most popular DL frameworks. There are three approaches to support DL frameworks: 1) the DL compiler is integrated into the DL framework; 2) the DL framework has launched an official package to support the DL compiler; 3) the DL compiler uses a converter to deploy the models for DL frameworks. Here are examples to illustrate these three approaches. For 1), XLA is integrated with TensorFlow, while TC and Glow provide lightweight integration with PyTorch and Caffe2. For 2), PyTorch stands to benefit from directly leveraging the compiler stack. To that end, PyTorch now has official TVM-based and XLA-based packages (i.e., torch\_tvm and torch\_xla). Compared to 1) and 2), 3) is more common. For instance, nGraph supports TensorFlow and PaddlePaddle by using *bridge* to maintain their programmatic or user interface. As more and more frameworks support ONNX models, it is important for DL compilers to support ONNX for the future development. At present, three DL compilers (i.e., TVM, Glow and nGraph) are able to load, compile and execute the pre-trained ONNX models.

**Supported Code Targets** - DL compilers usually use LLVM IR for code generation on CPU (ARM and X86). The advantages of LLVM over GCC are the unified IR, high modularity, and rapid customization. Both CUDA and OpenCL are popular APIs for heterogeneous parallel computing. OpenCL can be used to program both NVIDIA and AMD GPUs, while CUDA is specific to NVIDIA GPUs. Although OpenCL promises a portable language for GPU programming, its generality may entail a performance penalty. Two DL compilers (i.e., TVM and XLA) support generating both CUDA and OpenCL code. And TVM and PlaidML support generating codes of OpenGL, which is a cross-platform API that deals with rendering graphics. Metal API was proposed by Apple for both graphics rendering and general-purpose computing. Currently only TVM and PlaidML support generating Metal code.

**Supported Compilation** - DL compilers usually support just-in-time compilation (JIT) to improve the efficiency of program execution. Four DL compilers (i.e., TVM, Glow, nGraph, and XLA) support ahead-of-time compilation (AOT), of which nGraph only supports AOT in the official release (not supported in Beta release). TVM/Relay can produce a native library given a Relay expression and dynamically load it to Python in AOT mode. Glow can produce ahead-of-time compiled executable bundles, which can be executed in a standalone mode.

**Supported DL Optimizations** - As for low-bit inference, currently four DL compilers (i.e., TVM, Glow, nGraph and MLIR) support quantization. At present, XLA alone cannot solve the problem of quantization: the quantization rewriter has a missing part when the rewritten TensorFlow graph is reduced to a quantized XLA graph. TVM's automatic differentiation, quantization, and training are still under development. Moreover, a count of operators with gradient support are available in TVM v0.6 release. Supporting dynamic shapes requires changing the runtimes, which is a big challenge for DL compilers. At this time two DL compilers (i.e., TVM and nGraph) support dynamic shapes. TC and XLA only support static dimensions internally to provide automatic shape and bound inference. TVM and TC support auto-tuning to optimize performance by tuning the available configurations. TC can only perform auto-tuning on NVIDIA GPUs, while TVM can apply auto-tuning to CPUs (x86 and ARM), mobile GPUs and NVIDIA GPUs [26]. PlaidML can only apply auto-tuning to tiling (auto-tiling), which explores a space of tile sizes using a hypothetical cost model [103].



Fig. 2. The overview of commonly adopted design architecture of DL compilers.

#### 4 COMMON DESIGN ARCHITECTURE OF DL COMPILERS

The common design architecture of a DL compiler primarily contains two parts: the compiler frontend and the compiler backend, as shown in Figure 2. The intermediate representation (IR) is spread across both the frontend and the backend. Generally, IR is an abstraction of the program, and is used for program optimizations. Specifically, the DL models are translated into multi-level IRs in DL compilers, where the high-level IR resides in the frontend and the low-level IR resides in the backend. Based on the high-level IR, the compiler frontend is responsible for hardware-independent transformations and optimizations. Based on the low-level IR, the compiler backend is responsible for hardware-specific optimizations, code generation, and compilation. Note that this survey focuses on the design principles of DL compilers, for functional and experimental comparisons of DL compilers, the readers can refer to [59, 101].

The high-level IR also known as graph IR, represents the computation and the control flow, and is hardware-independent. The design challenge of high-level IR is the ability of abstraction of the computation and the control flow, which is able to capture and express diverse DL models. The goal of the high-level IR is to establish the control flow and the dependency between the operators and the data, as well as provide an interface for graph-level optimizations. It also contains rich semantic information for compilation as well as offers extensibility for customized operators. The detailed discussion of high-level IR is presented in Section 5.1.

**The low-level IR** is designed for hardware-specific optimization and code generation on diverse hardware targets. Thus, the low-level IR should be fine-grained enough to reflect the hardware characteristics and represent the hardware-specific optimizations. It should also allow the use of mature third-party tool-chains in compiler backends such as Halide [76], polyhedral model [9], and LLVM [56]. The detailed discussion of low-level IR is presented in Section 5.2.

**The frontend** takes a DL model from existing DL frameworks as input, and then transforms the model into the computation graph representation (i.e., graph IR). To support the diverse formats

in different frameworks, the frontend needs to implement various format transformations. The computation graph optimizations incorporate the optimization techniques from both general-purpose compilers and the DL specific optimizations, which reduce the redundancy and improve the efficiency upon the graph IR. Such optimizations can be classified into node-level (e.g., nop elimination and zero-dim-tensor elimination), block-level (e.g., algebraic simplification, operator fusion, and operator sinking) and dataflow-level (e.g., CSE, DCE, static memory planning, and layout transformation). After the frontend, the optimized computation graph is generated and passed to the backend. The detailed discussion of frontend is presented in Section 5.3.

The backend transforms the high-level IR into low-level IR and performs hardware-specific optimizations. On one hand, it can directly transform the high-level IR to third-party tool-chains such as LLVM IR to utilize the existing infrastructures for general-purpose optimizations and code generation. On the other hand, it can take advantage of the prior knowledge of both DL models and hardware characteristics for more efficient code generation, with customized compilation passes. The commonly applied hardware-specific optimizations include hardware intrinsic mapping, memory allocation and fetching, memory latency hiding, parallelization as well as loop oriented optimization. To determine the optimal parameter setting in the large optimization space, two approaches are widely adopted in existing DL compilers such as auto-scheduling (e.g., polyhedral model) and auto-tuning (e.g., AutoTVM). The optimized low-level IR is compiled using JIT or AOT to generate codes for different hardware targets. The detailed discussion of backend is presented in Section 5.4.

#### 5 KEY COMPONENTS OF DL COMPILERS

# 5.1 High-level IR

To overcome the limitation of IR adopted in traditional compilers that constrains the expression of complex computations used in DL models, existing DL compilers leverage high-level IR (as known as graph IR) with special designs for efficient code optimizations. To better understand the graph IR used in the DL compilers, we describe the representation and implementation of graph IR as follows.

*5.1.1* Representation of Graph IR. The representation of graph IR influences the expressiveness of graph IR and also decides the way the DL compilers analyze the graph IR.

**DAG-based IR** - DAG-based IR is one of the most traditional ways for the compilers to build a computation graph, with nodes and edges organized as a directed acyclic graph (DAG). There are rich optimization algorithms on the DAG computation graph, such as live variable analysis and variable dependency analysis. DAG-based IR is convenient for programming and compiling due to its simplicity, but it has deficiencies such as semantic ambiguity caused by the missing definition of computation scope.

**Let-binding-based IR** - Let-binding is one method to solve the semantic ambiguity by offering *let* expression to certain functions with restricted scope used by many programming languages such as F# [75] and Scheme [54]. When using the *let* keyword to define an expression, a *let* node is generated and points to the operator and variable in the expression instead of just building computational relation between variables as a DAG. In DAG-based compiler, when a process needs to get the return value of one expression, it first accesses the corresponding node and searches related nodes, which is also known as recursive descent technique. In contrast, the let-binding based compiler figures out all results of the variables in *let* expression and builds a variable map. When a particular result is needed, the compiler looks up this map to decide the result of the expression.

**Representing Tensor Computation** - Different graph IRs have different ways to represent the computation on tensors. The operators of diverse DL frameworks are translated to graph

IRs according to such specific representations. And the customized operators also need to be programmed in such representation. The representation of tensor computation can be divided into the following three categories.

- 1) Function-based: The function-based representation just provides encapsulated operators, which is adopted by HLO (the IR of XLA). Taking HLO for example, its IR consists of a set of functions in symbolic programming, and most of them have no side-effect. The instructions are organized into three levels, including HloModule (the whole program), HloComputation (a function), and HloInstruction (the operation). XLA uses HLO IR to represent both graph IR and operation IR so that the operation of HLO ranges from the dataflow level to the node level.
- 2) Lambda expression: The lambda expression, an index formula expression, is adopted by TVM. Lambda expression describes calculation by variable binding and substitution. TVM represents the tensor computation using the tensor expression, which is based on the lambda expression. In TVM, computational operators in tensor expression are defined by the shape of output tensor and the lambda expression of computing rules.
- *3) Einstein notation:* The Einstein notation, also known as the summation convention, is a notation to express summation, which is adopted by TC. Taking TC for example, the indexes for temporary variables do not need to be defined, and the IR can figure out the actual expression by the occurrence of undefined variables based on Einstein notation. In Einstein notation, the operators need to be associative and commutative. This restriction guarantees the reduction operator can be executed by any order, and thus possible for further parallelization.
- 5.1.2 Implementation of Graph IR. The implementation of graph IR in DL compilers fulfills the management of data and operation.

**Data representation** - The data in DL compilers (e.g., inputs, weights and intermediate data) are usually organized in the form of tensors, which are also known as multi-dimensional arraries. The DL compilers can represent tensor data directly by memory pointers, or in a more flexible way by placeholders. A placeholder contains the size for each dimension of a tensor. Alternatively, the dimension sizes of the tensor can be marked as unknown. For optimizations, the DL compilers require the data layout information. In addition, the bound of iterators should be inferred according to the placeholders.

- 1) Placeholder: Placeholder is widely used in symbolic programming. A placeholder is simply a variable with explicit shape information (e.g., size in each dimension), and it will be populated with values at later stage of the computation. It allows the programmers to describe the operations and build the computation graph without concerning the exact data elements, which helps to separate the computation definition from the exact execution in DL compilers. Besides, it is convenient for the programmers to change the shape of input/output and other corresponding intermediate data by using placeholders without changing the computation definition.
- 2) Unknown shape representation: The unknown dimension size is usually supported when declaring the placeholders. TVM uses Any to represent an unknown dimension (e.g.,  $Tensor\langle(Any,3),fp32\rangle)$ , and XLA uses None to achieve the same purpose (e.g., tf.placeholder("float",[None,3])). The unknown shape representation is necessary to support the dynamic model. However, to fully support dynamic model, bound inference and dimension checking should be relaxed. In addition, mechanism should be implemented to guarantee the memory validity.
- 3) Data layout: The data layout describes how a tensor is organized in memory, and it is usually a mapping from logical indices to memory indices. The data layout usually includes the sequence of dimension (e.g., NCHW and NHWC), tiling, padding, striding and etc. TVM and Glow represent data layout as operator parameters and require such information for computation and optimization. Combining data layout information with operators rather than tensors enables

intuitive implementation for certain operators, and also helps to reduce the compilation overhead. XLA represents data layout as constraints related to its backend hardware. Relay and MLIR are going to add data layout information into their type systems for tensors.

4) Bound inference: The bound inference is applied to determine the bound of iterators when compiling DL models in DL compilers. Although the tensor representation in DL compilers is convenient to describe the inputs and outputs, it exposes special challenges for inferring the iterator bound. The bound inference is usually performed recursively or iteratively according to the computation graph and the known placeholders. For example, in TVM the iterators form a directed acyclic hyper-graph, where each node of the graph represents an iterator and each hyper-edge represents the relation (e.g., split, fuse or rebase) among two or more iterators. Once the bound of the root iterator is determined based on the shapes of placeholders, other iterators can be inferred according to the relations recursively.

**Operators supported** - The operators supported by DL compilers are responsible for representing the DL workloads, and they are nodes of the computation graph. The operators usually include algebraic operators (e.g., +,  $\times$ , exp and topK), neural network operators (e.g., convolution and pooling), tensor operators (e.g., reshape, resize and copy), broadcast and reduction operators (e.g., min and argmin), as well as control flow operators (e.g., conditional and loop). Here, we choose three representative operators that are frequently used across different DL compilers for illustration. In addition, we discuss the case for customized operators.

- 1) Broadcast: The broadcast operators can replicate the data and generate new data with compatible shape. Without broadcast operators, the input tensor shape are more constrained. For example, for an add operator, the input tensors are expected to be of the same shape. Some compilers such as XLA and Relay relax such restriction by offering the broadcasting operator. For example, XLA allows the element-wise addition on a matrix and a vector by replicating the vector until its shape matches the matrix.
- 2) Control flow: Control flow is needed when representing complex and flexible models. Models such as RNN and Reinforcement learning (RL) depend on recurrent relations and data-dependent conditional execution [102], which requires control flow. Without supporting control flow in graph IR of DL compilers, these models must rely on the control flow support of the host languages (e.g., if and while in Python) or static unrolling, which deteriorates the computation efficiency. Relay notices that arbitrary control flow can be implemented by recursion and pattern, which has been demonstrated by functional programming [77]. Therefore, it provides if operator and recursive function for implementing control flow. On the contrary, XLA represents control flow by special HLO operators such as while and conditional.
- 3) **Derivative:** The derivative operator of an operator Op takes the output gradients and the input data of Op as its inputs, and then calculates the gradient of Op. Although some DL compilers (e.g., TVM/Relay and TC) support automatic differentiation [87], they require the derivatives of all operators in high-level IR when the chain rule is applied. TVM is working towards providing the derivative operators of both algebraic operators and neural networks operators. The programmers can use these derivative operators for building the derivatives of customized operators. On the contrary, PlaidML can generate derivative operators automatically, even for customized operators.
- 4) Customized operators: It allows programmers to define their operators for special purpose. Providing support for customized operators improves extensibility of DL compilers. For example, when defining new operators in Glow, the programmers need to realize the logic implementation and node encapsulation. In addition, extra efforts are needed such as the lowering step, operation IR generation and instruction generation, if necessary. Whereas, TVM and TC require less programming efforts except describing the computation implementation. Specifically, the users of TVM only need to describe the computation and the schedule as well as declare the shape of input/output

tensors. Moreover, the customized operators integrate Python functions through hooks, which further reduces the burden of the programmers.

5.1.3 Discussion. Nearly all DL compilers have their unique high-level IRs. However, they share similar design philosophy such as using DAG and let-binding to build the computation graph. In addition, they usually provide convenient ways for programmers to represent the tensor computation. The data and operators designed in high-level IRs are flexible and extensible enough to support diverse DL models. More importantly, the high-level IRs are hardware-independent and thus can be applied with different hardware backend.

#### 5.2 Low-level IR

5.2.1 Implementation of Low-Level IR. Low-level IR describes the computation of a DL model in a more fine-grained representation than that in high-level IR, which enables the target-dependent optimizations by providing interfaces to tune the computation and memory access. In this section, we classify the common implementations of low-level IRs into three categories: Halide-based IR, polyhedral-based IR, and other unique IR.

**Halide-based IR** - Halide is firstly proposed to parallelize image processing, and it is proven to be extensible and efficient in DL compilers (e.g., TVM). The fundamental philosophy of Halide is the separation of *computation* and *schedule*. Rather than giving a specific scheme directly, the compilers adopting Halide try various possible *schedule* and choose the best one. The original IR of the Halide needs to be modified when applied to backend of DL compilers. For example, the input shape of Halide is infinite, whereas the DL compilers need to know the exact shape of data in order to map the operator to hardware instructions. Some compilers, such as TC, require the fixed size of data, to ensure better temporal locality for tensor data.

TVM has improved Halide IR into an independent symbolic IR by following efforts. It removes the dependency on LLVM and refactors the structure of both the project module and the IR design of Halide, pursuing better organization as well as accessibility for graph IR and frontend language such as Python. The re-usability is also improved, with a runtime dispatching mechanism implemented to add customized operators in convenience. TVM simplifies the variable definition from string matching to pointer matching, guaranteeing that each variable has single define location (static single-assignment, SSA) [31]).

**Polyhedral-based IR** - The polyhedral model is an important technique adopted in DL compilers. It uses linear programming, affine transformations and other mathematical methods to optimize loop-based codes with static control flow of bounds and branches. The polyhedral-based IR applies multiple polyhedral transformations (e.g., fusion, tiling, sinking, and mapping), including both device-dependent and device-independent optimizations. There are many toolchains that are borrowed by polyhedral-based compilers, such as isl [94], Omega [53], PIP [32], Polylib [64], and PPL [16]. Due to the ability to deal with deeply nested loops, many DL compilers, such as TC and PlaidML, have adopted the polyhedral model as their low-level IR.

TC has its unique design in low-level IR, which combines the Halide and polyhedral model. It uses Halide-based IR to represent the computation, and adopts the polyhedral-based IR to represent the loop structures. TC presents detailed expressions by abstract instances and introduces specific node types. In brief, TC uses the *domain* node to specify the ranges of index variables and uses the *context* node to describe new iterative variables that related to hardware. And it uses the *band* node to determine the order of iterations. A *filter* node represents an iterator combined with a statement instance. *Set* and *sequence* are keywords to specify the execution types (parallel and serial execution) for *filters*. Besides, TC uses *extension* nodes to describe other necessary instructions for code generation, such as the memory movement.

Stripe/PlaidML uses polyhedral-based IR tp represent tensor operations, which creates a hierarchy of parallelizable code by extending the nesting of parallel polyhedral blocks to multiple levels. Besides, it allows nested polyhedrons to be allocated to nested memory units, providing a way to match the computation with the memory hierarchy. In Stripe, the hardware configuration is independent from the kernel code. The *tags* in Stripe (known as *passes* in other compilers) do not change the kernel structure, but provide additional information about the hardware target for the optimization passes. Stripe splits the DL operators into *tiles* that fit into local hardware resources.

**Other unique IR** - There are DL compilers implementing customized low-level IRs without using Halide and polyhedral model. Upon the customized low-level IRs, they apply hardware-specific optimizations and lowers to LLVM IR.

The low-level IR in Glow is an instruction-based expression that operates on tensors referenced by addresses [78]. There are two kinds of instruction-based functions in Glow low-level IR: *declare* and *program*. The first one declares the number of constant memory regions that live throughout the lifetime of the program (e.g., input, weight, bias). The second one is a list of locally allocated regions, including functions (e.g., conv and pool) and temporary variables. Instructions can run on the global memory regions or locally allocated regions. Besides, each operand is annotated with one of the qualifiers: *@in* indicates the operand reads from the buffer; *@out* indicates that the operand writes to the buffer; *@inout* indicates that the operand reads and writes to the buffer. These instructions and operand qualifiers help Glow determine when certain memory optimizations can be performed.

MLIR is highly influenced by LLVM, and it is a purer compiler infrastructure than LLVM. MLIR reuses many ideas and interfaces in LLVM, and sits between the model representation and code generation. MLIR has a flexible type system and allows multiple levels of abstraction, which introduces *dialects* to represent these multiple levels of abstraction. Each *dialect* consists of a set of defined immutable operations. The current *dialects* of MLIR include TensorFlow IR, XLA HLO IR, experimental polyhedral IR, LLVM IR, and TensorFlow Lite. The flexible transformations between *dialects* are also supported. Furthermore, MLIR can create new *dialects* to connect to a new low-level compiler, which paves the way for hardware developers and compiler researchers.

The HLO IR of XLA can be considered as both high-level IR and low-level IR. Because HLO is fine-grained enough to represent the hardware-specific information. Besides, HLO supports hardware-specific optimizations and can be used to emit LLVM IR.

5.2.2 Code Generation based on Low-Level IR. The low-level IR adopted by most DL compilers can be eventually lowered to LLVM IR, and benefits from LLVM's mature optimizer and code generator. Furthermore, LLVM can explicitly design custom instruction sets for specialized accelerators from scratch. However, traditional compilers may generate poor code when passed directly to LLVM IR. In order to avoid this situation, two approaches are applied by DL compilers to achieve hardware-dependent optimization: 1) perform target-specific loop transformation in the upper IR of LLVM (e.g., Halide-based IR and polyhedral-based IR), and 2) provide additional information about the hardware target for the optimization passes. Most DL compilers apply both approaches, but the emphasis is different. In general, the DL compilers that prefer frontend users (e.g., TC, TVM, XLA, and nGraph) might focus on 1), whereas the DL compilers that are more inclined to backend developers (e.g., Glow, PlaidML, and MLIR) might focus on 2).

The compilation scheme in DL compilers can be mainly classified into two categories: just-in-time (JIT) and ahead-of-time (AOT). For JIT compilers, it can generate executable codes on the fly, and they can optimize codes with better runtime knowledge. AOT compilers generate all executable binaries first and then execute them, thus they have larger scope in static analysis than JIT compilation. In addition, AOT approaches can be applied with cross-compilers of embedded



Fig. 3. Example of computation graph optimizations, taken from the HLO graph of Alexnet on Volta GPU using Tensorflow XLA.

platforms (e.g. C-GOOD [52]) as well as enable execution on remote machines (TVM RPC) and customized accelerators.

5.2.3 Discussion. In DL compilers, the low-level IR is a fine-grained representation of DL models, and it reflects detailed implantation of DL models on diverse hardware. The low-level IRs include Halide-based IRs, polyhedral-based IRs and other unique IRs. Although they differ in designs, they leverage the mature compiler tool-chains and infrastructure, to provide tailored interfaces of hardware-specific optimizations and code generation. The design of low-level IRs can also impact the design of new DL accelerators (e.g., TVM HalideIR and Inferentia, as well as XLA HLO and TPU).

## 5.3 Frontend Optimizations

After constructing the computation graph, the frontend applies graph-level optimizations. Many optimizations are easier to be identified and performed at graph level, because the graph provides a global view of the computation. These optimizations are only applied to the computation graph, rather than the implementations on backends, thus they are hardware-independent and can be applied to various backend targets.

The frontend optimizations are usually defined by *passes*, and can be applied by traversing the nodes of the computation graph and performing the graph transformations. The frontend provides methods to 1) capture the specific features from the computation graph and 2) rewrite the graph for optimization. Besides the pre-defined *passes*, the developers can also define customized *passes* in the frontend. Most DL compilers can determine the shape of both input tensors and output tensors of every operation once a DL model is imported and transformed as a computation graph. This feature allows DL compilers perform optimizations according to the shape information. Figure 3 shows an example of computation graph optimizations with Tensorflow XLA.

In this section, we classify the frontend optimizations into three categories: 1) node-level optimizations, 2) block-level (peephole, local) optimizations, and 3) dataflow-level (global) optimizations.

5.3.1 Node-level optimizations. The nodes of the computation graph are coarse enough to enable optimizations inside a single node. And the node-level optimizations include node elimination that

eliminates unnecessary nodes, and node replacement that replaces nodes with other lower-cost nodes.

Nop Elimination removes the no-op instructions which occupy a small amount of space but specify no operation in general-purpose compilers. In DL compilers, Nop Elimination is responsible for removing the operations lacking adequate inputs. For example, the *sum* node with only one input tensor can be eliminated, the *padding* node with zero padding width can be eliminated.

Zero-dim-tensor elimination is responsible for removing the unnecessary operations whose inputs are zero-dimension tensors. Assume that A is a zero-dimension tensor, and B is a constant tensor, the sum operation node of A and B can be replaced with the already existing constant node B without affecting the correctness. Assume that C is a 3-dimension tensor, but the shape of one dimension is zero, such as  $\{0,2,3\}$ , therefore, C has no element, and the argmin/argmax operation node can be eliminated.

5.3.2 Block-level optimizations. Algebraic simplification - The algebraic simplification optimizations consist of 1) algebraic identification, 2) strength reduction, with which we can replace more expensive operators by cheaper ones;3) constant folding, with which we can replace the constant expressions by their values. Such optimizations consider a sequence of nodes, then take advantage of commutativity, associativity, and distributivity of different kinds of nodes to simplify the computation.

In addition to the typical operators (+,  $\times$ , etc.), the algebraic simplification can also be applied to DL specific operators (e.g., reshape, transpose, and pooling). The operators can be reordered and sometimes eliminated, which reduces the redundancy and improves the efficiency. Here we illustrate the common cases where algebraic simplification can be applied: 1) optimization of computation order, in such case, the optimization finds and removes reshape/transpose operations according to specific characteristics. Taking the matrix multiplication (GEMM) for example, there are two matrices (e.g., A and B), both matrices are transposed (to produce  $A^T$  and  $B^T$ , respectively), then  $A^T$  and  $B^T$  are multiplied together. However, a more efficient way to implement GEMM is to switch the order of the arguments A and B, multiply them together, and then transpose the output of the GEMM, which reduces two transpose to just one; 2) optimization of node combination, in such case, the optimization combines multiple consecutive transpose nodes into a single node, eliminates identity transpose nodes, and optimizes transpose nodes into transpose nodes when they actually move no data; 3) optimization of transpose nodes, in such case, the optimization performs substitutions of transpose node with transpose node (e.g., in Glow), if the input of the reduce operator is transpose to be reduced.

**Operator fusion** - Operator fusion is indispensable optimization of DL compilers. It enables better sharing of computation, eliminates intermediate allocations, facilitates further optimization by combining loop nests [77], as well as reduces launch and synchronization overhead [89]. In TVM, the operators are classified into four categories: injective, reduction, complex-out-fusible and opaque. When the operators are defined, their corresponding categories are determined. Targeting the above categories, TVM designs the fusion rules across operators. In TC, fusion is performed in a different way based on the automatic polyhedron transformations. However, how to identify and fuse more complicated graph patterns, such as blocks with multiple broadcast and reduce nodes, remains to be a problem. Recent works [65, 66] try to tackle this problem and propose a framework to explore and optimize aggressive fusion plans. It supports not only element-wise and reduction nodes, but also other computation/memory intensive nodes with complex dependencies.

**Operator sinking -** This optimization sinks operations such as transposes below operations such as batch normalization, ReLU, sigmoid, and channel shuffle. By this optimization, many

similar operations are moved closer to each other, which creates more opportunities for algebraic simplification.

5.3.3 Dataflow-level optimizations. Common sub-expression elimination (CSE) - An expression E is a common sub-expression if the value of E is previously computed, and the value of E has not be changed since previous computation [15]. In this case, the value of E is computed once, and the already computed value of E can be used to avoid recomputing in other places. The DL compilers search for common sub-expressions through the whole computation graph and replace the following common sub-expressions with the previously computed results.

**Dead code elimination (DCE)** - A set of code is dead if its computed results or side-effects are not used, and the DCE optimization removes the dead code. The dead code is usually not caused by programmers but caused by other graph optimizations. Thus, the DCE, as well as CSE, are applied after other graph optimizations. Other optimizations, such as dead store elimination (DSE), which removes stores into tensors that are never going to be used, also belong to DCE.

**Static memory planning** - Static memory planning optimizations are performed to reuse the memory buffers as much as possible. Usually, there are two approaches: in-place memory sharing and standard memory sharing. The in-place memory sharing uses the same memory for input and output for an operation, and just allocates one copy of memory before computing. Standard memory sharing reuses the memory of previous operations without overlapping. The static memory planning is done offline, which allows more complicated planning algorithms to be applied. A recent work [13] firstly designs and performs memory-aware scheduling to minimize the peak activation memory footprint on edge devices, which presents new research directions of memory planning on memory-constrained devices.

**Layout transformation** - Layout transformation tries to find the best data layouts to store tensors in the computation graph and then inserts the layout transformation nodes to the graph. Note that the actual transformation is not performed here, instead it will be performed when evaluating the computation graph by the compiler backend.

In fact, the performance of the same operation in different data layouts is different, and the best layouts are also different on different hardware. For example, operations in the NCHW format on GPU usually run faster, so itâĂŹs efficient to transform to NCHW format on GPU (e.g., TensorFlow). Some DL compilers rely on hardware-specific libraries to achieve higher performance, and the libraries may require certain layouts. Besides, some DL accelerators prefer more complicated layouts (e.g., tile). Therefore, the compilers need to provide a way to perform layout transformations across various hardware.

Not only the data layouts of tensors have a nontrivial influence on the final performance, but also the transformation operations have a significant overhead. Because they also consume the memory and computation resource.

A recent work [62] based on TVM targeting on CPUs alters the layout of all convolution operations to NCHW[x]c first in the computation graph, in which c means the split sub-dimension of channel C and x indicates the split size of the sub-dimension. Then all x parameters are globally explored by auto-tuning when providing hardware details, such as cache line size, vectorization unit size, and memory access pattern, during hardware-specific optimizations.

5.3.4 Case Study with Tensorflow XLA. To illustrate the computation graph optimizations concretely, we dump the HLO graph before and after each pass in Tensorflow XLA. We choose Alexnet model as the input of the XLA compiler and Volta GPU as the target hardware. The optimizations are shown in Figure 3. For simplicity, we remove the data layout information of some nodes. The algebraic simplification includes reducing the consecutive transpose and reshape nodes into a single reshape node, as well as replacing the reshape node into a bitcast node. The CSE reuses

the broadcast node. The cuDNN transformation transforms the convolution node into a function call (convForward) of cuDNN to enable the graph optimizations leverage the cuDNN library. The constant folding transforms the neighboring convolution (convForward) and adds nodes into a convolution with bias (convBiasActivationForward). And the operator fusion fuses several bitcast nodes and an add node. Note that, the implementation of frontend optimizations in DL compilers (e.g., XLA) consists of several stages. Therefore, the optimizations are performed several times, which may change the computation graph each time, and thus introduce more opportunities for further optimizations.

5.3.5 Discussion. The frontend is one of the most important components in DL compilers, which is responsible of transformation from DL models to high-level IR (i.e., computation graph) and hardware-independent optimizations based on high-level IR. Although the implementation of frontend may differ in the data representation and operator definition of high-level IR across DL compilers, the hardware-independent optimizations converge at three levels such as node-level, block-level and dataflow-level. The optimization methods at each level leverage the DL specific as well as general compilation optimization techniques, which reduce the computation redundancy as well as improve the performance of DL models at computation graph level.

# 5.4 Backend Optimizations

The backends of DL compilers have commonly include various hardware-specific optimizations, auto-tuning techniques and optimized kernel libraries. Hardware-specific optimizations enable efficient code generation for different hardware targets. Whereas, auto-tuning has been essential in the compiler backend to alleviate the manual efforts to derive the optimal parameter configurations. Besides, highly-optimized kernel libraries are also widely used on general-purpose processors and other customized DL accelerators.

5.4.1 Hardware-specific Optimization. Hardware-specific optimizations, also known as target-dependent optimizations, are applied to obtain high-performance codes targeting specific hardware. One way to apply the backend optimizations is to transform the low-level IR into LLVM IR in order to utilize the LLVM infrastructure to generate optimized CPU/GPU codes. The other way is to design customized optimizations with DL domain knowledge, which can leverage the target hardware more efficiently. Since hardware-specific optimizations are tailored for particular hardware and cannot be included exhaustively in this paper, we present five widely adopted approaches in existing DL compilers. The overview of these hardware-specific optimizations is shown in Figure 4, and the detailed descriptions are provided as follows.

Hardware intrinsic mapping - Hardware intrinsic mapping can transform a certain set of low-level IR instructions to kernels that have already been highly optimized on the hardware. In TVM, the hardware intrinsic mapping is realized in the method of *extensible tensorization*, which can declare the behavior of hardware intrinsic and the lowering rule for intrinsic mapping. This method enables the compiler backend apply hardware implementations as well as highly optimized handcraft micro-kernels to a certain pattern of operations, which results in a significant performance gain. Whereas, Glow supports hardware intrinsic mapping such as *quantization*. It can estimate the possible numeric range for each stage of the neural network and supports profileguided optimization to perform quantization automatically. Besides, Halide/TVM maps specific IR patterns to SIMD opcodes on each architecture to avoid the inefficiency of LLVM IR mapping when encountering vector patterns.

**Memory allocation and fetching** - Memory allocation is another challenge in code generation, especially for GPUs and customized accelerators. For example, GPU contains primarily shared memory space (lower access latency with limited memory size) and local memory space (higher



Fig. 4. Overview of hardware-specific optimizations applied in DL compilers.

access latency with large capacity). Such memory hierarchy requires efficient memory allocation and fetching techniques for improving data locality. To realize this optimization, TVM introduces the scheduling concept of *memory scope*. Memory scope schedule primitives can tag a compute stage as *shared* or *thread-local*. For compute stages tagged as *shared*, TVM generates code with shared memory allocation as well as cooperative data fetching, which inserts memory barrier at the proper code position to guarantee correctness. Besides, TC also provides similar features (known as *memory promotion*) by extending PPCG [95] compiler. However, TC only supports limited predefined rules. Particularly, TVM enables special buffering in accelerators through *memory scope* schedule primitives.

**Memory latency hiding** - Memory latency hiding is also an important technique used in the backend by reordering the execution pipeline. As most DL compilers support parallelization on CPU and GPU, memory latency hiding can be naturally achieved by hardware (e.g., warp context switching on GPU). But for TPU-like accelerators with *decoupled access-execute* (DAE) architecture, the backend needs to perform scheduling and fine-grained synchronization to obtain correct and efficient codes. To achieve better performance as well as reduce programming burden, TVM introduces *virtual threading* schedule primitive, which enables users to specify the data parallelism on virtualized multi-thread architecture. Then TVM lowers these virtually parallelized threads by inserting necessary memory barriers and interleaves the operations from these threads into a single instruction stream, which forms better execution pipeline of each thread to hide the memory access latency.

**Loop oriented optimizations** - Loop oriented optimizations are also applied in the backend to generate efficient codes for target hardware. Since Halide and LLVM [56] (integrated with polyhedral method) have already incorporated such optimization techniques, some DL compilers leverage Halide and LLVM in their backends. The key techniques applied in loop oriented optimizations include: loop fusion, sliding windows, tiling, loop reordering, and loop unrolling.

1) Loop fusion: Loop fusion is a loop optimization technique that can fuse loops with the same boundaries for better data reuse. For compilers such as PlaidML, TVM, TC, and XLA, such optimization is performed by Halide schedule or polyhedral approach, while Glow applies loop fusion by its *operator stacking*.

2) Sliding windows: Sliding windows is a loop optimization technique adopted by Halide. Its central concept is to compute values when needed and store them on the fly for data reuse until they are no longer required. As sliding windows will interleave the computation of two loops and make them serial, it is a tradeoff between parallelism and data reuse.

- *3) Tiling:* Tiling splits loops into several tiles, and thus loops are divided into outer loops iterating through tiles and inner loops iterating inside a tile. This transformation enables better data locality inside a tile by fitting a tile into hardware caches. As the size of a tile is hardware-specific, many DL compilers determines the tiling pattern and size by auto-tuning.
- 4) Loop reordering: Loop reordering (also known as loop permutation) changes the order of iterations in a nested loop, which can optimize the memory access and thus increase the spatial locality. It is specific to data layout and hardware features. However, it is not safe to perform loop reordering when there are dependencies along the iteration order.
- *5) Loop unrolling:* Loop unrolling can unroll a specific loop to a fixed number of copies of loop bodies, which allows the compilers to apply aggressive instruction-level parallelism. Usually, loop unrolling is applied in combination with loop split, which first splits the loop into two nested loops and then unrolls the inner loop completely.

**Parallelization** - As modern processors generally support multi-threading and SIMD parallelism, it is important for the compiler backend to exploit parallelism in order to maximize hardware utilization for high performance. Halide uses a schedule primitive called *parallel* to specify the parallelized dimension of the loop for thread-level parallelization and supports GPU parallelization by mapping loop dimensions tagged as *parallel* with annotation of *block* and *thread*. And it replaces a loop of size *n* with a *n-wide* vector statement, which can be mapped to hardware-specific SIMD opcodes through hardware intrinsic mapping. Stripe develops a variant of the polyhedral model called *nested polyhedral model*, which introduces *parallel polyhedral block* as its basic execution element of iteration. After this extension, a nested polyhedral model can detect hierarchy parallelization among levels of tiling and striding. In addition, some DL compilers rely on handcraft libraries such as Glow or optimized math libraries provided by hardware vendors (discussed in Section 5.4.3). In the meanwhile, Glow offloads the vectorization to LLVM because the LLVM auto-vectorizer works well when the information of tensor dimension and loop trip count is provided. However, exploiting the parallelism entirely by compiler backend allows to apply more domain-specific knowledge of DL models, and thus leads to higher performance at the expense of more engineering efforts.

5.4.2 Auto-tuning. Due to enormous search space for parameter tuning in hardware-specific optimizations, it is necessary to leverage auto-tuning to determine the optimal parameter configurations. Generally, the implementation of auto-tuning includes four key components such as parameterization, cost model, searching technique and acceleration.

**Parameterization** - 1) Data and target: The data parameter describes the specification of the data, such as input shapes. The target parameter describes hardware-specific characteristics and constrains to be considered during optimization scheduling and code generation. For example, for the GPU target, the hardware parameters such as shared memory and register size need to be specified. 2) Optimization options: The optimization options include the optimization scheduling and corresponding parameters, such as loop oriented optimizations and tile size. In TVM, both predefined and user-defined scheduling, as well as parameters, are taken into consideration. Whereas in TC, it prefers to parameterize the optimizations, which have a strong correlation with performance and can be changed later at low cost. For example, minibatch dimension is one of the parameters, which is usually mapped to grid dimensions in CUDA and can be optimized during auto-tuning.

**Cost model** - The comparison of different cost models applied in auto-tuning are as follows. *1) Black-box model*: This model only considers the final execution time rather than the characteristics

of the compilation task. It is easy to build a black-box model, but easily ends up with higher overhead and less optimal solution without the guidance of task characteristics. 2) ML-based cost model: ML-based cost model is a statistical approach to predict performance using a machine learning method. It enables the model to update as the new configuration explored, which helps to achieve higher prediction accuracy. 3) Pre-defined cost model: An approach based on a pre-defined cost model expects a perfect model that is built on the characteristics of the compilation task and able to evaluate the overall performance of the task. Compared to the ML-based model, the pre-defined model generates less computation overhead when applied, but requires large engineering efforts for re-building the model on each new DL model and hardware.

Searching technique - 1) Initialization and searching space determination: The initial option can either be set randomly or based on the known configurations, such as configurations given by users or historical optimal configurations. In terms of searching space, it should be specified before auto-tuning. TVM allows developers to specify the searching space with their domain-specific knowledge and provides automatic search space extraction for each hardware target based on the computational description. Whereas TC relies on the compilation cache and the predefined rules. 2) Genetic algorithm (SA) [37]: GA considers each tuning parameters as genes and each configuration as a candidate. The new candidate is iteratively generated by crossover, mutation and selection according to the fitness value, which is a metaheuristic inspired by the process of natural selection. And finally the optimal candidate is derived. The rate of crossover, mutation and selection is used for controlling the tradeoff between exploration and exploitation. TC adopts genetic algorithm in its auto-tuning technique. 3) Simulated annealing algorithm (SA) [18]: SA is also a metaheuristic inspired by annealing. It allows to accept worse solutions in a decreasing probability, which can find the approximate global optimum and avoid the precise local optimum in a fixed amount of iterations. TVM adopts simulated annealing algorithm in its auto-tuning technique. 4) Reinforcement learning (RL): RL performs with learning to maximize reward given an environment by the tradeoff between exploration and exploitation. Chameleon [14] (built upon TVM) adopts reinforcement learning algorithm in its auto-tuning technique.

Acceleration - 1) Parallelization: One direction for accelerating auto-tuning is parallelization. TC proposes a multi-thread, multi-GPU strategy under the consideration that the genetic algorithm needs to evaluate all candidates in each generation. First, it enqueues candidate configurations and compiles them on multiple CPU threads. The generated code are evaluated on GPUs in parallel, and each candidate owns its fitness used by the parent choosing step. After finishing the whole evaluation, the new candidate is generated, and the new compilation job is enqueued, waiting for compiling on CPU. Similarly, TVM supports cross-compilation and RPC, which allows users to compile on the local machine and run the programs with different auto-tuning configurations on multiple targets. 2) Configuration reuse: Another direction for accelerating auto-tuning is to reuse the previous auto-tuning configurations. TC stores the fastest known generated code version corresponding to the given configuration by compilation cache. During the compilation, the cache is queried before each kernel optimization, and the auto-tuning is triggered if cache miss. Similarly, TVM produces a log file that stores the optimal configurations for all scheduling operators and queries the log file for best configurations during compilation. It is worth mentioning that TVM performs auto-tuning for each operator in Halide IR (e.g., conv2d), and thus the optimal configurations are determined for each operator separately.

5.4.3 Optimized Kernel Libraries. There are several highly-optimized kernel libraries widely used to accelerate DL training and inference on various hardware. DNNL (previously MKL-DNN) from Intel, cuDNN from NVIDIA and MIOpen from AMD are widely used libraries. Both computation-intensive primitives (e.g., convolution, GEMM, and RNN) and memory bandwidth limited primitives

(e.g., batch normalization, pooling, and shuffle) are highly optimized according to the hardware features (e.g., AVX-512 ISA, tensor cores). And customizable data layouts are supported to make it easy to integrate into DL applications and avoids frequent data layout transformations. Besides, low-precision training and inference, including FP32, FP16, and INT8 as well as non-IEEE floating-point format bfloat16 [96] are also supported. Other customized DL accelerators also maintain their specific kernel libraries [50, 61].

Existing DL compilers, such as TVM, nGraph, and TC, can generate the function calls to these libraries during code generation. However, if DL compilers need to leverage the existing optimized kernel libraries, they should first transform the data layouts and fusion styles into the types that are pre-defined in kernel libraries. Such transformation may break the optimal control flow. Moreover, the DL compilers treat the kernel libraries as black box, therefore they are unable to apply optimizations across operators (e.g., operator fusion) when invoking kernel libraries. In sum, using optimized kernel libraries achieves significant performance improvement when the computation can be satisfied by specific highly-optimized primitives, otherwise may be constrained from further optimization and suffer from less optimal performance.

5.4.4 Discussion. The backend is responsible for bare-metal optimizations and code generation based on low-level IR. Although the design of backends may differ due to various low-level IRs, their optimizations can be classified into hardware-specific optimizations, auto-tuning techniques and optimized kernel libraries. These optimizations can be performed separately or combined with each other, to achieve better data locality and parallelization by exploiting the hardware/software characteristics. Eventually, the high-level IR of DL models are transformed into efficient code implementation on different hardware.

#### 6 CONCLUSION AND FUTURE DIRECTIONS

In this survey, we present a thorough analysis of the existing DL compilers targeting the design principles. First, we take a deep dive into the common architecture adopted in the existing DL compilers including the multi-level IR, the frontend and the backend. We present the design philosophies and reference implementations of each component in detail, with the emphasis on the unique IRs and optimizations specific to DL compilers. We summarize the findings in this survey and highlight the future directions in DL compiler as follows:

**Dynamic shape and pre/post processing** - Dynamic model becomes more and more popular in the field of DL, whose input shape or even model itself may change during execution. Particularly, in the area of NLP, models may accept inputs of various shapes, which is challenging for DL compilers since the shape of data is unknown until runtime. Existing DL compilers require more research efforts to support dynamic shape efficiently for emerging dynamic models.

In addition, as future DL models become more complex, their entire *control flow* may inevitably include complicated pre/post-processing procedures. Currently, most DL compilers use Python as their programming language, the pre/post-processing could become a performance bottleneck when it is executed by the Python interpreter. Such potential performance bottleneck has not yet been considered by existing DL compilers. Supporting the entire *control flow* in DL compiler enables express and optimize the pre/post-processing along with DL models, which opens up new opportunities for performance acceleration in model deployment.

**Advanced auto-tuning** - Existing auto-tuning techniques focus on the optimization of individual operators. However, the combination of the local optimal does not lead to global optimal. For example, two adjacent operators that apply on different data layouts can be tuned together without introducing extra memory transformations in between. Besides, with the rise of edge computing,

execution time is not only the optimization objective for DL compilers. New optimization targets should also be considered in the auto-tuning such as memory footprint and energy consumption.

Particularly, for the ML-based auto-tuning techniques, there are several directions worth further exploring. First, the ML techniques can be applied in other stages of auto-tuning, other than the cost model. For example, in the stage of selecting compiler options and optimization schedules, ML techniques can be used to predict the possibility directly and develop algorithms to determine the final configurations. Second, the ML-based auto-tuning techniques can be improved based on the domain knowledge. For example, incorporating the feature engineering (selecting features to represent program) [98] in auto-tuning techniques could be a potential direction for achieving better tuning results.

**Polyhedral model** - It is a promising research direction to combine polyhedral model and auto-tuning techniques in the design of DL compilers for efficiency. On one hand, the auto-tuning can be applied to minimize the overhead of polyhedral JIT compilation by reusing the previous configurations. On the other hand, the polyhedral model can be used to perform auto-scheduling, which can reduce the search space of auto-tuning.

Another challenge of applying polyhedral model in DL compilers is to support the sparse tensor. In general, the format of a sparse tensor such as CSF [83] expresses the loop indices with index arrays (e.g., a[b[i]]) that is no longer linear. Such indirect index addressing leads to non-affine subscript expressions and loop bounds, which prohibits the loop optimization of the polyhedral model [22, 88]. Fortunately, the polyhedral community has made progress in supporting sparse tensor [92, 93], and integrating the latest advancement of the polyhedral model can increase the performance opportunities for DL compilers.

**Subgraph partitioning** - DL compilers supporting subgraph partitioning can divide the computation graph into several subgraphs, and the subgraphs can be processed in different manners. The subgraph partitioning presents more research opportunities for DL compilers. First, it opens up the possibility to integrate graph libraries for optimization. Take nGraph and DNNL for example, DNNL is a DL library with graph optimizations leveraging vast collection of highly optimized kernels. The integration of DNNL with nGraph enables DNNL to speedup the execution of the subgraphs generated by nGraph. Secondly, it opens up the possibility of heterogeneous and parallel execution. Once the computation graph is partitioned into subgraphs, the execution of different subgraphs can be assigned to heterogeneous hardware targets at the same time. Take the edge device for example, its computation units may consist of ARM CPU, Mail GPU, DSP, and probably NPU. Generating subgraphs from the DL compilers that utilizes all computation units efficiently can deliver significant speedup of the DL tasks.

**Quantization** - Traditional quantization strategies applied in DL frameworks are based on a set of fixed schemes and datatypes with little customization for codes running on different hardware. Whereas, supporting quantization in DL compilers can leverage optimization opportunities during compilation to derive more efficient quantization strategies. For example, Relay [77] provides a quantization rewriting flow that can automatically generate quantized code for various schemes.

To support quantization, there are several challenges to be solved in DL compilers. The first challenge is how to implement new quantized operators without heavy engineering efforts. The attempt from AWS points out a possible direction that uses the concept of *dialect* to implement new operators upon basic operators, so that the optimizations at graph level and operator level can be reused. The second challenge is the interaction between quantization and other optimizations during compilation. For example, determining the appropriate stage for quantization and collaborating with optimizations such as operator fusion require future research investigations.

**Unified optimizations** - Although existing DL compilers adopt similar designs in both computation graph optimizations and hardware-specific optimizations, each compiler has its own

advantages in certain aspects. There is a missing way to share the state-of-the-art optimizations, as well as support of emerging hardware targets across existing compilers. We advocate unifying the optimizations from existing DL compilers so that the best practices adopted in each DL compiler can be reused. In addition, unifying the optimizations across DL compilers can accumulate a strong force to impact the design of general-purpose and dedicated DL accelerators, and provide an environment for efficient co-design of DL compiler and hardware.

Currently, Google MLIR is a promising initiative towards such direction. It provides the infrastructure of multi-level IRs, and contains IR specification and toolkit to perform transformations across IRs at each level. It also provides flexible *dialects*, so that each DL compiler can construct its customized *dialects* for both high-level and low-level IRs. Through transformation across *dialects*, optimizations of one DL compiler can be reused by another compiler. However, the transformation of *dialects* requires further research efforts to reduce the dependency on delicate design.

**Differentiable programming** - Differentiable programming is a programming paradigm, where the programs are differentiable thoroughly. Algorithms written in differentiable programming paradigm can be automatically differentiated, which is attractive for DL community. Many compiler projects have adopted differentiable programming, such as Myia [21], Flux [48] and Julia [19]. Unfortunately, there is little support for differential programming in existing DL compilers.

To support differential programming is quite challenging for existing DL compilers. The difficulties come from not only data structure, but also language semantic. For example, to realize the transformation from Julia to XLA HLO IR, one of the challenges [33] is that the control flow is different between the imperative language used by Julia and the symbolic language used by XLA. In order to use HLO IR efficiently, the compiler also needs to provide operation abstraction for Julia in order to support the particular semantic of XLA, such as *MapReduce* and *broadcast*. Moreover, the semantic difference of differentiation between Julia and XLA, also requires significant changes of compiler designs.

**Privacy protection** - In edge-cloud system, the DL models are usually split into two halves with each partial model running on the edge device and cloud service respectively, which can provide better response latency and consume less communication bandwidth. However, one of the drawbacks with the edge-cloud system is that the user privacy becomes vulnerable. The reason is that the attackers can intercept the intermediate results sent from the edge devices to cloud, and then use the intermediate results to train another model that can reveal the privacy information deviated from the original user task.

To protect privacy in edge-cloud system, existing approaches [36, 69, 73] propose to add noise with special statistic properties to the intermediate results that can reduce the accuracy of the attacker task without severely deteriorating the accuracy of the user task. However, the difficulty is to determine the layer where the noise should be inserted, which is quite labor intensive to identify the optimal layer. The above difficulty presents a great opportunity for DL compilers to support privacy protection, because the compilers maintain rich information of the DL model, which can guide the noise insertion across layers automatically.

#### **ACKNOWLEDGMENT**

The authors would like to thank Jun Yang from Alibaba and Yu Xing from Xilinx for their valuable comments and suggestions.

#### REFERENCES

- [1] [n.d.]. Announcing Hanguang 800: Alibaba's First AI-Inference Chip. https://www.alibabacloud.com/blog/announcing-hanguang-800-alibabas-first-ai-inference-chip\_595482. Accessed February 4, 2020.
- [2] [n.d.]. AWS Inferentia. https://aws.amazon.com/machine-learning/inferentia. Accessed February 4, 2020.

- [3] [n.d.]. Gluon. https://gluon.mxnet.io. Accessed February 4, 2020.
- [4] [n.d.]. Nervana Neural Network Processor. https://www.intel.ai/nervana-nnp/. Accessed February 4, 2020.
- [5] [n.d.]. Nvidia Turing Architecture. https://www.nvidia.com/en-us/design-visualization/technologies/turing-architecture/. Accessed February 4, 2020.
- [6] [n.d.]. ONNX Github repository. https://github.com/onnx/onnx. Accessed February 4, 2020.
- [7] [n.d.]. PaddlePaddle Github repository. https://github.com/PaddlePaddle/Paddle. Accessed February 4, 2020.
- [8] [n.d.]. PlaidML Github repository. https://github.com/plaidml/plaidml. Accessed February 4, 2020.
- [9] [n.d.]. Polyhedral Compilation. https://polyhedral.info. Accessed February 4, 2020.
- [10] [n.d.]. TensorRT Github repository. https://github.com/NVIDIA/TensorRT. Accessed February 4, 2020.
- [11] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16). 265–283.
- [12] Kamel Abdelouahab, Maxime Pelcat, Jocelyn Serot, Cedric Bourrasset, and François Berry. 2017. Tactics to directly map CNN graphs on embedded FPGAs. IEEE Embedded Systems Letters 9, 4 (2017), 113–116.
- [13] Byung Hoon Ahn, Jinwon Lee, Jamie Menjay Lin, Hsin-Pai Cheng, Jilei Hou, and Hadi Esmaeilzadeh. 2020. Ordering Chaos: Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices. arXiv:cs.DC/2003.02369
- [14] Byung Hoon Ahn, Prannoy Pilligundla, Amir Yazdanbakhsh, and Hadi Esmaeilzadeh. 2020. Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation. arXiv:cs.LG/2001.08743
- [15] Alfred V Aho, Ravi Sethi, and Jeffrey D Ullman. 1986. Compilers, principles, techniques. Addison wesley 7, 8 (1986), 9.
- [16] Roberto Bagnara, Patricia M Hill, and Enea Zaffanella. 2006. The Parma Polyhedra Library: Toward a complete set of numerical abstractions for the analysis and verification of hardware and software systems. arXiv preprint cs/0612085 (2006).
- [17] Soheil Bahrampour, Naveen Ramakrishnan, Lukas Schott, and Mohak Shah. 2016. Comparative study of caffe, neon, theano, and torch for deep learning. (2016).
- [18] Dimitris Bertsimas, John Tsitsiklis, et al. 1993. Simulated annealing. Statistical science 8, 1 (1993), 10-15.
- [19] Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B Shah. 2017. Julia: A fresh approach to numerical computing. SIAM review 59, 1 (2017), 65–98.
- [20] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. 2018. JAX: composable transformations of Python+NumPy programs. http://github.com/google/jax
- [21] Olivier Breuleux and Bart van Merriënboer. 2017. Automatic Differentiation in Myia. (2017).
- [22] Chun Chen. 2012. Polyhedra scanning revisited. In Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation. 499–508.
- [23] Hongming Chen, Ola Engkvist, Yinhai Wang, Marcus Olivecrona, and Thomas Blaschke. 2018. The rise of deep learning in drug discovery. *Drug discovery today* 23, 6 (2018), 1241–1250.
- [24] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).
- [25] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. {TVM}: An automated end-to-end optimizing compiler for deep learning. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 578–594.
- [26] Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to optimize tensor programs. In Advances in Neural Information Processing Systems. 3389–3400.
- [27] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).
- [28] François Chollet et al. 2015. Keras. https://keras.io.
- [29] R. Collobert, K. Kavukcuoglu, and C. Farabet. 2011. Torch7: A Matlab-like Environment for Machine Learning. In BigLearn, NIPS Workshop.
- [30] Scott Cyphers, Arjun K Bansal, Anahita Bhiwandiwalla, Jayaram Bobba, Matthew Brookhart, Avijit Chakraborty, Will Constable, Christian Convey, Leona Cook, Omar Kanawi, et al. 2018. Intel ngraph: An intermediate representation, compiler, and executor for deep learning. arXiv preprint arXiv:1801.08058 (2018).
- [31] Ron Cytron, Jeanne Ferrante, Barry K Rosen, Mark N Wegman, and F Kenneth Zadeck. 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems (TOPLAS) 13, 4 (1991), 451–490.
- [32] P. Feautrier. 1988. Parametric integer programming. RAIRO Recherche Opérationnelle 22, 3 (1988), 243-268.
- [33] Keno Fischer and Elliot Saba. 2018. Automatic full compilation of julia programs and ML models to cloud TPUs. arXiv preprint arXiv:1810.09868 (2018).

[34] Rubén D Fonnegra, Bryan Blair, and Gloria M Díaz. 2017. Performance comparison of deep learning frameworks in image classification problems using convolutional and recurrent networks. In 2017 IEEE Colombian Conference on Communications and Computing (COLCOM). IEEE, 1–6.

- [35] David A Forsyth and Jean Ponce. 2002. Computer vision: a modern approach. Prentice Hall Professional Technical Reference.
- [36] Ruiyuan Gao, Ming Dun, Hailong Yang, Zhongzhi Luan, and Depei Qian. 2019. Privacy for Rescue: A New Testimony Why Privacy is Vulnerable In Deep Models. arXiv preprint arXiv:2001.00493 (2019).
- [37] David E. Goldberg. 1989. Genetic Algorithms in Search, Optimization and Machine Learning (1st ed.). Addison-Wesley Longman Publishing Co., Inc., USA.
- [38] David E Goldberg and John Henry Holland. 1988. Genetic algorithms and machine learning. (1988).
- [39] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Networks. arXiv:stat.ML/1406.2661
- [40] Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and Jason Cong. 2017. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates. In 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 152–159.
- [41] Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Jincheng Yu, Junbin Wang, Song Yao, Song Han, Yu Wang, and Huazhong Yang. 2017. Angel-Eye: A complete design flow for mapping CNN onto embedded FPGA. *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems* 37, 1 (2017), 35–47.
- [42] Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang, and Huazhong Yang. 2017. A Survey of FPGA-Based Neural Network Accelerator. arXiv:cs.AR/1712.08934
- [43] Qianyu Guo, Xiaofei Xie, Lei Ma, Qiang Hu, Ruitao Feng, Li Li, Yang Liu, Jianjun Zhao, and Xiaohong Li. 2018. An Orchestrated Empirical Study on Deep Learning Frameworks and Platforms. arXiv preprint arXiv:1811.05187 (2018).
- [44] Jung-Woo Ha, Hyuna Pyo, and Jeonghee Kim. 2016. Large-scale item categorization in e-commerce using multiple recurrent neural networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 107–115.
- [45] William Grant Hatcher and Wei Yu. 2018. A survey of deep learning: platforms, applications and emerging research trends. *IEEE Access* 6 (2018), 24411–24432.
- [46] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
- [47] Jeremy Howard et al. 2018. fastai. https://github.com/fastai/fastai.
- [48] Michael Innes, Elliot Saba, Keno Fischer, Dhairya Gandhi, Marco Concetto Rudilosso, Neethu Mariya Joy, Tejan Karmali, Avik Pal Singh, and Viral Shah. 2018. Fashionable Modelling with Flux. arXiv preprint arXiv:1811.01457 (2018).
- [49] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In *Proceedings of the 22nd ACM international conference on Multimedia*. ACM, 675–678.
- [50] Zhe Jia, Blake Tillman, Marco Maggioni, and Daniele Paolo Scarpazza. 2019. Dissecting the Graphcore IPU Architecture via Microbenchmarking. arXiv preprint arXiv:1912.03413 (2019).
- [51] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In *Proceedings of the 44th Annual International Symposium on Computer Architecture*. 1–12.
- [52] Duseok Kang, Euiseok Kim, Inpyo Bae, Bernhard Egger, and Soonhoi Ha. 2018. C-GOOD: C-code generation framework for optimized on-device deep learning. In Proceedings of the International Conference on Computer-Aided Design. ACM, 105.
- [53] Wayne Kelly, Vadim Maslov, William Pugh, Evan Rosser, Tatiana Shpeisman, and Dave Wonnacott. 1996. The Omega Calculator and Library, Version 1.1.0. (1996). http://www.cs.utah.edu/~mhall/cs6963s09/lectures/omega.ps
- [54] Richard Kelsey, William Clinger, Jonathan Rees, et al. 1998. Revised 5 report on the algorithmic language Scheme. (1998).
- [55] Adrian Kingsley-Hughes. 2017. Inside AppleâĂŹs new A11 Bionic processor. ZDNet, September (2017).
- [56] Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization. IEEE Computer Society, 75.
- [57] Chris Leary and Todd Wang. 2017. XLA: TensorFlow, compiled. TensorFlow Dev Summit (2017).
- [58] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. *Proc. IEEE* 86, 11 (1998), 2278–2324.
- [59] Mingzhen Li, Yi Liu, Xiaoyan Liu, Qingxiao Sun, Xin You, Hailong Yang, Zhongzhi Luan, and Depei Qian. 2020. The Deep Learning Compiler: A Comprehensive Survey. arXiv preprint arXiv:2002.03794 (2020).

- [60] Heng Liao, Jiajin Tu, Jing Xia, and Xiping Zhou. 2019. DaVinci: A Scalable Architecture for Neural Network Computing. In 2019 IEEE Hot Chips 31 Symposium (HCS). IEEE, 1–44.
- [61] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen. 2016. Cambricon: An Instruction Set Architecture for Neural Networks. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 393–405. https://doi.org/10.1109/ISCA.2016.42
- [62] Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang. 2019. Optimizing {CNN} Model Inference on CPUs. In 2019 {USENIX} Annual Technical Conference ({USENIX} {ATC} 19). 1025–1040.
- [63] Zhiqiang Liu, Yong Dou, Jingfei Jiang, and Jinwei Xu. 2016. Automatic code generation of convolutional neural networks in FPGA implementation. In 2016 International Conference on Field-Programmable Technology (FPT). IEEE, 61–68.
- [64] Vincent Loechner. 1999. PolyLib: A library for manipulating parameterized polyhedra. https://repo.or.cz/polylib.git/blob\_plain/HEAD:/doc/parampoly-doc.ps.gz
- [65] Guoping Long, Jun Yang, and Wei Lin. 2019. FusionStitching: Boosting Execution Efficiency of Memory Intensive Computations for DL Workloads. arXiv:cs.DC/1911.11576
- [66] Guoping Long, Jun Yang, Kai Zhu, and Wei Lin. 2018. FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs. arXiv:cs.DC/1811.05213
- [67] Yufei Ma, Naveen Suda, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. 2018. ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler. *Integration* 62 (2018), 14–23.
- [68] Christopher D Manning, Christopher D Manning, and Hinrich Schütze. 1999. Foundations of statistical natural language processing. MIT press.
- [69] Fatemehsadat Mireshghallah, Mohammadkazem Taram, Prakash Ramrakhyani, Dean Tullsen, and Hadi Esmaeilzadeh. 2019. Shredder: Learning Noise Distributions to Protect Inference Privacy. (2019).
- [70] Mehdi Mohammadi, Ala Al-Fuqaha, Mohsen Guizani, and Jun-Seok Oh. 2017. Semisupervised deep reinforcement learning in support of IoT and smart city services. IEEE Internet of Things Journal 5, 2 (2017), 624–635.
- [71] Thierry Moreau, Tianqi Chen, Luis Vega, Jared Roesch, Eddie Yan, Lianmin Zheng, Josh Fromm, Ziheng Jiang, Luis Ceze, Carlos Guestrin, et al. 2018. A Hardware-Software Blueprint for Flexible Deep Learning Specialization. arXiv preprint arXiv:1807.04188 (2018).
- [72] Madhumitha Nara, BR Mukesh, Preethi Padala, and Bharath Kinnal. 2019. Performance Evaluation of Deep Learning frameworks on Computer Vision problems. In 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI). IEEE, 670–674.
- [73] Seyed Ali Osia, Ali Taheri, Ali Shahin Shamsabadi, Kleomenis Katevas, Hamed Haddadi, and Hamid R Rabiee. 2018.
  Deep private-feature extraction. IEEE Transactions on Knowledge and Data Engineering 32, 1 (2018), 54–66.
- [74] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems. 8024–8035.
- [75] Tomas Petricek and Don Syme. 2012. Syntax Matters: Writing abstract computations in F#. Pre-proceedings of TFP (Trends in Functional Programming), St. Andrews, Scotland (2012).
- [76] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI âĂŹ13). Association for Computing Machinery, New York, NY, USA, 519âĂŞ530.
- [77] Jared Roesch, Steven Lyubomirsky, Marisa Kirisame, Logan Weber, Josh Pollock, Luis Vega, Ziheng Jiang, Tianqi Chen, Thierry Moreau, and Zachary Tatlock. 2019. Relay: A High-Level Compiler for Deep Learning. arXiv:cs.LG/1904.08368
- [78] Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Garret Catron, Summer Deng, Roman Dzhabarov, Nick Gibson, James Hegeman, Meghan Lele, Roman Levenstein, et al. 2018. Glow: Graph lowering compiler techniques for neural networks. arXiv preprint arXiv:1805.00907 (2018).
- [79] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986. Learning representations by back-propagating errors. *nature* 323, 6088 (1986), 533–536.
- [80] Frank Seide and Amit Agarwal. 2016. CNTK: Microsoft's open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2135–2135.
- [81] Shayan Shams, Richard Platania, Kisung Lee, and Seung-Jong Park. 2017. Evaluation of deep learning frameworks over different HPC architectures. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS). IEEE, 1389–1396.
- [82] Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 17.

[83] Shaden Smith and George Karypis. 2015. Tensor-matrix products with a compressed sparse tensor. In *Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms*. 1–7.

- [84] D Team et al. 2016. Deeplearning4j: Open-source distributed deep learning for the jvm. *Apache Software Foundation License* 2 (2016).
- [85] The Theano Development Team, Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, et al. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688 (2016).
- [86] Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. 2015. Chainer: a next-generation open source framework for deep learning. In Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS), Vol. 5. 1–6.
- [87] Bart van Merrienboer, Olivier Breuleux, Arnaud Bergeron, and Pascal Lamblin. 2018. Automatic differentiation in ML: Where we are and where we should be going. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 8757–8767.
- [88] Nicolas Vasilache, Cédric Bastoul, and Albert Cohen. 2006. Polyhedral code generation in the real world. In *International Conference on Compiler Construction*. Springer, 185–201.
- [89] Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic highperformance machine learning abstractions. arXiv preprint arXiv:1802.04730 (2018).
- [90] Stylianos I Venieris and Christos-Savvas Bouganis. 2016. fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs. In 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 40–47.
- [91] Stylianos I. Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. 2018. Toolflows for Mapping Convolutional Neural Networks on FPGAs. Comput. Surveys 51, 3 (Jun 2018), 1âAŞ39. https://doi.org/10.1145/3186332
- [92] Anand Venkat, Mary Hall, and Michelle Strout. 2015. Loop and Data Transformations for Sparse Matrix Code. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI âĂŹ15). Association for Computing Machinery, New York, NY, USA, 521âĂŞ532.
- [93] Anand Venkat, Manu Shantharam, Mary Hall, and Michelle Mills Strout. 2014. Non-affine extensions to polyhedral code generation. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization. 185–194.
- [94] Sven Verdoolaege. 2010. isl: An integer set library for the polyhedral model. In *International Congress on Mathematical Software*. Springer, 299–302.
- [95] Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9, 4 (Jan. 2013), 54:1–54:23. https://doi.org/10.1145/2400682.2400713
- [96] Shibo Wang and Pankaj Kanwar. 2019. BFloat16: the secret to high performance on cloud TPUs. Google Cloud Blog, August (2019).
- [97] Ying Wang, Jie Xu, Yinhe Han, Huawei Li, and Xiaowei Li. 2016. DeepBurning: automatic generation of FPGA-based learning accelerators for the neural network family. In *Proceedings of the 53rd Annual Design Automation Conference*. ACM, 110.
- [98] Zheng Wang and Michael OâĂŹBoyle. 2018. Machine learning in compiler optimization. *Proc. IEEE* 106, 11 (2018), 1879–1901.
- [99] Gu-Yeon Wei, David Brooks, et al. 2019. Benchmarking tpu, gpu, and cpu platforms for deep learning. arXiv preprint arXiv:1907.10701 (2019).
- [100] Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017.
  Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 29.
- [101] Yu Xing, Jian Weng, Yushun Wang, Lingzhi Sui, Yi Shan, and Yu Wang. 2019. An In-depth Comparison of Compilers for Deep Neural Networks on Hardware. In 2019 IEEE International Conference on Embedded Software and Systems (ICESS). IEEE, 1–8.
- [102] Yuan Yu, Martín Abadi, Paul Barham, Eugene Brevdo, Mike Burrows, Andy Davis, Jeff Dean, Sanjay Ghemawat, Tim Harley, Peter Hawkins, Michael Isard, Manjunath Kudlur, Rajat Monga, Derek Murray, and Xiaoqiang Zheng. 2018. Dynamic Control Flow in Large-Scale Machine Learning. In Proceedings of the Thirteenth EuroSys Conference (EuroSys âĂŹ18). Association for Computing Machinery, New York, NY, USA, Article Article 18, 15 pages. https://doi.org/10.1145/3190508.3190551
- [103] Tim Zerrell and Jeremy Bruestle. 2019. Stripe: Tensor Compilation via the Nested Polyhedral Model. arXiv preprint arXiv:1903.06498 (2019).

[104] R. Zhao, S. Liu, H. Ng, E. Wang, J. J. Davis, X. Niu, X. Wang, H. Shi, G. A. Constantinides, P. Y. K. Cheung, and W. Luk. 2018. Hardware Compilation of Deep Neural Networks: An Overview. In 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 1–8. https://doi.org/10.1109/ASAP.2018.8445088