# A Survey on Neural Network Hardware Accelerators

Tamador Mohaidat and Kasem Khalil , Senior Member, IEEE

Abstract—Artificial intelligence (AI) hardware accelerator is an emerging research for several applications and domains. The hardware accelerator's direction is to provide high computational speed with retaining low-cost and high learning performance. The main challenge is to design complex machine learning models on hardware with high performance. This article presents a thorough investigation into machine learning accelerators and associated challenges. It describes a hardware implementation of different structures such as convolutional neural network (CNN), recurrent neural network (RNN), and artificial neural network (ANN). The challenges such as speed, area, resource consumption, and throughput are discussed. It also presents a comparison between the existing hardware design. Last, the article describes the evaluation parameters for a machine learning accelerator in terms of learning and testing performance and hardware design.

Impact Statement—Neural networks have revolutionized the field of AI, empowering machines to acquire knowledge from data and accomplish tasks that were previously deemed unachievable. This survey covers various types of accelerators, including custom application-specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), graphics processing units (GPUs), and dedicated AI chips, and compares their performance, power efficiency, and scalability. The survey also discusses the design tradeoffs involved in building neural network accelerators, such as memory hierarchy, dataflow architecture, and precision. It provides insights into the latest trends and advancements in hardware accelerators for neural networks. It helps researchers, engineers, and practitioners in the field choose the right hardware platform for their specific needs and optimize the performance and energy efficiency of their neural network models. Moreover, this survey can also inspire new research directions and advancements in the neural network hardware accelerator design, paving the way for the next generation of intelligent systems.

Index Terms—Artificial intelligence (AI), artificial neural network (ANN), convolutional neural network (CNN), hardware accelerator, machine learning, machine learning design, machine learning on-chip, neural network, recurrent neural network (RNN).

Manuscript received 12 August 2023; revised 22 September 2023; accepted 4 February 2024. Date of publication 14 March 2024; date of current version 13 August 2024. This article was recommended for publication by Associate Editor Mehmet Onder Efe upon evaluation of the reviewers' comments. (Corresponding author: Kasem Khalil.)

Tamador Mohaidat is with Electrical and Computer Engineering Department, University of Mississippi, Oxford, MS 38677 USA.

Kasem Khalil is with Electrical and Computer Engineering Department, University of Mississippi, Oxford, MS 38677 USA, and also with Electrical Engineering Department, Assiut University, Asyut 71515, Egypt (e-mail: kmkhalil@olemiss.edu).

Digital Object Identifier 10.1109/TAI.2024.3377147

## I. INTRODUCTION

ACHINE learning has emerged with a number of new issues alongside the growth of the Internet and multimedia technology. Machine learning has gained traction in various disciplines, with the goal of simulating human thought. Researchers from diverse fields collaborate to share their perspectives and techniques, contributing to the progression of the artificial intelligence (AI) field. Machine learning approaches are involved in several applications, and domains such as speech recognition [1], [2], image classification [3], [4], hardware and software fault prediction [5], [6], [7], text detection [8], [9], disease detection and prediction [4], [10], [11], nutrition monitoring [12], medical treatment [13], [14], and for defect inspection and metrology [15], [16]. The main challenges of current approaches are providing an optimized model of learning and hardware accelerator with low cost.

Machine learning serves as a crucial step toward achieving AI. Functioning as a specialized branch of AI, machine learning involves utilizing data and algorithms to simulate human learning processes and continually improve its accuracy. Rather than relying on direct instruction, machine learning involves the use of mathematical models to facilitate computer learning. By utilizing historical data as input, machine learning algorithms can predict new output values. As the core component of AI, machine learning enables computers to possess intelligent capabilities. Through the development of various theories and methodologies, research in machine learning aims to establish a specific application-based study system that utilizes insights from human physiology and cognitive science to conduct theoretical analysis and advance algorithmic models. Machine learning tracks complex human actions in multimedia streams. The field of machine learning comprises various types, including reinforcement learning, supervised learning, and unsupervised learning. Supervised learning involves training a model using data that is labeled, while unsupervised learning involves training a model using data that is unlabeled [17], [18], [19]. Reinforcement learning, on the other hand, requires trial and error to train a model [20], [21].

The importance of machine learning comes from its capacity to speed up the creation of new products and grant businesses insights into consumer behavior trends and operational patterns. Many of today's leading companies incorporate machine learning into their operations. Machine learning has become a critical factor in setting businesses apart from their competitors. Machine learning is reshaping every industry, from healthcare

2691-4581 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

and education to transportation and the food and entertainment industries to manufacturing and more. The fields of bioinformatics, physics, chemistry, material analysis, and other related disciplines utilize intelligent methods that enhance their content development. These methods leverage machine learning as a core technology. It will significantly influence nearly every facet of individuals' lives. Both cloud computing and the Internet of Things (IoT) are driving the expanded utilization of machine learning to enable objects and gadgets to become "smart" on their own [22], [23], [24].

Deep learning typically employs multiple layered structures, such as convolutional neural networks (CNN) [25], [26], recurrent neural networks (RNN) [27], [28], and artificial neural networks (ANN) [29], [30], to process large-scale and unstructured data. Each structure has its own model of learning. ANN is based on multiple layers connected in a chain. Each layer has multiple nodes that perform the computation process. The nodes in each layer are interconnected with the nodes in the following layer until they reach the output layer. The architecture of a CNN consists of convolutional layers and pooling layers, with a pooling layer following each convolutional layer. A convolutional layer is used to run the computation of the input data with the stored weights. To reduce complexity in the subsequent layers, a pooling layer is utilized to diminish the data size. The RNN is based on memory for the learning process which makes it suitable for time series data.

Hardware implementation of machine learning has a significant role in current applications with low cost [31], [32], [33], [34], [35], [36]. The main challenge is to provide a machine learning accelerator with high speed for problem classification with low hardware costs, without compromising desired performance in terms of area and power [37], [38], [39], [40], [41], [42]. The objective of this article is to examine the current machine-learning hardware accelerator approaches with their advantages and limitations discussion. It also defines the general challenges for any hardware accelerator design to be considered in future methods. Furthermore, it presents the evaluation parameters for a hardware accelerator. This study delves into a comprehensive examination of various hardware accelerators, without being limited to specific neural network architectures. It conducts an extensive comparative analysis of recent advancements, highlighting their respective strengths and shortcomings. This is accomplished by presenting numerical data for key performance metrics, including power consumption, area, and accuracy. By providing this detailed information, the reader gains a precise understanding of the distinctive attributes of each examined work. This article's main contributions are summarized as follows.

- A list of the challenges on machine learning hardware accelerator.
- 2) A comprehensive study on hardware accelerator systems.
- 3) A comprehensive review on machine learning hardware accelerator
- 4) A comparison between the existing hardware accelerators
- An evaluation framework for machine learning hardware accelerators.

The subsequent sections of this article are organized in the following manner. Section II presents unique challenges of machine learning hardware accelerator. Section III investigates multiple hardware accelerator systems. The models and datasets covered by the existing methods are presented in Section IV. Section V presents a review of machine learning accelerator with a comparison between the existing methods. Section VI presents an evaluation framework of a hardware accelerator, followed by the conclusion and a discussion of future work in Section VII.

## II. ACCELERATOR CHALLENGES

The existing machine learning hardware accelerator faces some challenges in providing a design with the desired performance and cost. The machine learning models are complex, so their hardware implementation is complex and slow. Thus the research direction is to propose a design with less complexity while saving performance and increasing speed. Hardware accelerator challenges are power/energy consumption, throughput, area, speed, learning performance, and resource consumption. Each one is described as follows.

## A. Power consumption

In cloud-based deep neural network (DNN) processing, power consumption is a critical factor due to the strict power limits in data centers caused by cooling costs. Additionally, data movement consumes more energy than arithmetic operations like multiplier-accumulator (MAC), as capacitance is much higher. Hence, it is crucial to provide comprehensive reporting on not only the energy efficiency and power consumption of the chip but also both the energy efficiency and power consumption associated with off-chip memory. This includes considerations such as dynamic random access memory (DRAM) or the frequency of off-chip accesses. By evaluating the energy efficiency and power consumption of the entire system, regardless of the specific memory technology employed, a more holistic assessment can be achieved [43]. Embedded system designers face an increasing challenge in reducing hardware resources and power consumption while maintaining the computational complexity of real-time applications. The weights and intermediate results can be stored in on-chip buffers in some designs to cut down on time spent retrieving data from off-chip memory and the amount of power required to keep the system running. The main issue is to design a hardware accelerator with a light structure to reduce power consumption.

# B. Throughput

Obtaining high throughput and low latency concurrently can be challenging depending on the approach taken. Expanding the amount of process elements (PEs) can enhance the overall throughput, resulting in an increased number of parallel MAC operations. However, the system's area cost and the area of the PE determine the number of PEs. If the area cost of the system remains constant, Expanding the amount of PEs results in a

decrease in the area per PE or a reduction of on-chip storage. This could affect how PEs are utilized. Reducing the logic necessary to send operands to a MAC by using a single piece of logic can decrease the area per PE. The maximum throughput is determined by The amount of PEs and the maximum throughput achievable by a single PE. However, the actual throughput depends on several factors, such as the network architecture, weight and activation sparsity in the DNN model, and batch size. Increasing the batch size can enhance the reuse of data and increase throughput. The hardware's ability to support these approaches while maintaining PE utilization, the number of PEs, or cycles per second determines the overall impact of the DNN model on throughput [43].

#### C. Area

Machine learning accelerators face a multitude of challenges that can lead to area overhead, such as the need to perform both forward and backward passes without sharing any hardware resources between the two processes. Additionally, implementing the hardware accelerator on the chip can come at a high cost. To address these challenges, it is necessary to simplify complex machine learning models in hardware designs while also optimizing hardware components without sacrificing performance, making them more efficient and cost-effective.

## D. Performance

Neural networks face difficulties with throughput due to waiting for the processing unit to finish reading data. To address this issue, improved activation functions are proposed in machine learning accelerator designs to enhance accuracy and performance. Strategies such as pooling, convolutional or kernel processing are used to further improve accuracy. To achieve low latency and high efficiency, the neural network is accelerated using pipeline design and multichannel parallel processing. The main challenge is to maintain high performance in terms of sensitivity, accuracy, and specificity while avoiding the addition of complex hardware components.

# E. Resource Consumption

Reducing hardware resources poses a significant challenge due to the increased computational complexity of real-time applications. To address this challenge, some innovative architectures have been proposed that use a convolutional PE array. This PE array can reuse pixel and weight data effectively. thereby reducing the number of resources consumed while maintaining performance in learning and testing. The basic concept is to reduce the hardware resources without compromising the learning and testing performance of the system. The convolutional PE array architecture exploits the fact that the convolution operation is both data and weight reuse friendly. The array can perform multiple convolution operations simultaneously, and the weights for each convolution operation are stored in a weight buffer. The input pixels are stored in a buffer, which can be accessed multiple times during the computation. The PE array can also incorporate multiple output channels and multiple input channels to handle different types of convolution operations. By efficiently reusing the weight data and input data, the convolutional PE array architecture reduces the number of memory accesses, which results in a decrease in power consumption and hardware resources. This technique enables the hardware designer to attain a balance between performance and hardware resources, which is a critical aspect of designing hardware accelerators for machine learning applications.

## F. Speed

The design of neural networks with both high speed and energy efficiency has been a challenging task. This has prompted researchers to explore alternatives to graphics processing units (GPUs) and central processing units (CPUs) for efficient acceleration of the algorithms used in neural network models. Due to the high energy cost per read and write operation and the long access time associated with external memory, a number of systems continue to experience difficulties handling their data loads. Alternate methods first adjust the memory to allow for a bigger data bus or make use of several memories distributed across the system in order to cut down on the overhead of this data movement. Parallel access makes it possible to handle many data streams during a single clock cycle, which both accelerates the system's overall speed and makes better use of its available hardware resources.

## III. HARDWARE ACCELERATOR SYSTEMS

Hardware Accelerator systems are specialized hardware devices designed to accelerate the performance of specific tasks. These systems use dedicated hardware components such as FPGAs, ASICs, and GPUs to perform complex computations much faster than traditional CPUs. Hardware accelerator systems are widely used in a variety of industries due to their ability to perform complex computations faster and more efficiently than traditional computing systems. In the finance industry, hardware accelerators are used for a variety of purposes, including high-frequency trading, risk management, and fraud detection. High-frequency trading relies on the ability to make trades within fractions of a second, and hardware accelerators can process vast amounts of data in real time, making them a valuable tool. In healthcare, they can be used to accelerate medical imaging tasks such as magnetic resonance imaging (MRI) and computerized tomography scans, drug discovery and development, and genomics research. In scientific research, they can be used to accelerate simulations, modeling, and data analysis tasks. They also find applications in autonomous vehicles, aerospace, and defense industries for tasks such as image processing, sensor data analysis, and control systems. Hardware Accelerator systems are also used in high-performance computing applications such as machine learning, data analytics, and virtual reality [44], [45]. They can greatly improve the performance of computing devices, allowing for faster and more efficient processing of tasks. This can lead to improved productivity and reduced wait times for users. Additionally, hardware accelerator systems can help reduce energy consumption and lower costs. By offloading certain tasks from the CPU and GPU,



Fig. 1. Hardware accelerator systems.

these systems can operate more efficiently and require less power. This can result in significant savings for both individuals and organizations. The possibilities for the use of Hardware Accelerator systems are endless and are only limited by our imagination. There are various types of hardware accelerator systems available in the market today, each designed to accelerate specific types of tasks. Some of the common types include FPGA-based systems, CPU-based systems, GPU-based systems, and ASIC-based systems. Each type has its unique advantages and disadvantages. FPGA-based systems are highly flexible and can be programmed to perform different tasks. GPU-based systems are ideal for parallel processing and are commonly used in gaming and graphics applications. ASICbased systems are highly optimized for specific tasks and offer the highest performance but are expensive to design and manufacture. Fig. 1 shows the hardware accelerators used by all the works examined for this study. As can be shown, 64% of the works used FPGA to implement the DNN networks. While the CPU accounts for 16%, the GPU and ASIC account for 12% and 8%, respectively. Some common types of hardware accelerators include the following units:

#### A. CPU

The current CPU can execute single instruction, multiple data (SIMD) instructions by utilizing multiple ALUs simultaneously. This feature is particularly useful in image processing, as it allows the same instruction to be performed on a continuous stream of data. In computer vision, most operations occur across the image [44], [46].

## B. GPU

Compared to general-purpose CPUs, GPUs have developed into a specialized architecture designed for parallel processing, with SIMD instruction extensions that allow for concurrent execution of multiple tasks, including image processing workloads [47]. Fig. 2 shows the GPU architecture, GPUs have processing cores that are far simpler than those of standard high-performance CPUs [46]. These often have simpler control logic because they do not need to predict branching or prefetch data, and constrained memory per core. GPUs can accommodate a much larger number of cores on a single chip than CPUs due to simpler computing cores. GPU architectures work very well in



Fig. 2. GPU architecture block diagram.



Fig. 3. FPGA architecture block diagram.

situations where there aren't many or any branching conditions. In addition, GPU architectures include a memory architecture designed specifically for high-speed data streaming for image processing.

## C. Field Programmable Gate Array (FPGA)

FPGA is a programmable integrated circuit capable of being reconfigured to perform different circuit functions multiple times. Instead of having a fixed design like a processor, The composition of an FPGA includes digital signal processors (DSPs), configurable logic blocks (CLBs), I/O pads, on-chip block RAMs (BRAMs), and routing channels as shown in Fig. 3. In FPGA, you can set up your data tracks so that pixels can be directly transferred between computing units and external memory without any intermediary steps. Furthermore, distributed BRAMs can be used to utilize data locality in vision kernels by storing pixels on the chip. Programmers can



Fig. 4. ASIC block diagram.

change the structure of FPGA's hardware to make it into any shape they need. FPGA's fine-grained parallel architecture gives it advantages over GPU. Once the computation clock cycle time has been calculated, the designer can optimize the output mode to minimize the demand for data storage in main memory, thereby decreasing memory reading delays. FPGA programming is powerful, FPGA provides dynamic algorithm reconfiguration with robust reconfigurability. Also, FPGA uses much less power than GPU and works better with the same amount of power, which can help a lot with the processor's problem of getting rid of heat [45], [48].

## D. Application-Specific Integrated Circuit (ASIC)

ASIC is another type of integrated circuit that is designed to perform specific tasks, as opposed to a CPU, which is a general-purpose processor, as shown in Fig. 4 [44]. ASICs are more specialized than GPUs since an ASIC is a processor built to perform a relatively small set of computations, whereas a GPU is still a massively parallel processor with thousands of processing units that can run multiple algorithms [49]. In contrast to FPGA, you cannot reprogram ASIC to do something different once it is required. Its logic has been fixed since it was made, but on FPGA you can make a different design that fits your needs better. ASICs are usually substantially more energy efficient as a result of this specialization.

# IV. MODELS AND DATASETS

Datasets are essential for determining the accuracy of a DNN. Significant research effort has been expended over the decades to increase the performance of DNNs through innovative architectures. However, the constant need for more accuracy increased new, deeper, and incredibly complex models [37], [47]. In Fig. 5, we demonstrate the most datasets from all the articles examined in our survey; various datasets were utilized to assess the accuracy of the suggested DNN algorithms. There could be multiple datasets for the same work. MNIST, ResNet, CIFAR, and ImageNet are the most popular datasets, as shown. In general, there is a well-balanced distribution of research efforts among CIFAR, ResNet, and ImageNet. While many DNN hardware works are focusing on the MNIST dataset. It is apparent



Fig. 5. DNNs models used in the works we reviewed.

that a significant portion (27%) of the accuracy assessments are carried out on the basic MNIST network. Nonetheless, considerable attention is devoted to sophisticated networks such as the CIFAR (18%) and ImageNet (18%) networks.

#### V. ACCELERATOR APPROACHES

Several machine learning approaches, such as ANN, CNN, and RNN, are implemented on hardware. This section discusses the different machine learning accelerator approaches for each category as follows.

## A. ANN

Like the biological neural network in the human body, ANN features a layered architecture in which each network node can process input and forward output to other nodes in the network. The nodes are known as neurons. ANN is comprised of three or more interconnected layers. The first layer contains input neurons that transfer data to the subsequent layers. The output layer produces the final output data. The layers between the input and output layers are hidden and made up of units that adaptively transform the information received from the previous layer through a sequence of transformations. The ANN can understand more complicated objects since each layer works as an input and output layer. The neural layer is the term used to refer to these inner layers together. The units within a neural layer aim to learn from the gathered information by assigning weights based on the internal architecture of the ANN [50], [51], [52]. These principles enable units to provide a changed result, which is delivered as an output to the next layer. Fig. 6 presents the general block diagram of the ANN. An adder accumulator receives the product of the input from each node after it has been multiplied by a weight. The result of the adder accumulator is sent to an activation function, which returns the final result. The final output is expressed by the following equation:

$$y_k = f\left(\sum_{l=0}^{n} W_{lk} * X_l + b_k\right).$$
 (1)

Equation (1) represents the final output, where n denotes the total number of neurons. Here,  $X_l$  represents the output of the



Fig. 6. Block diagram of ANN.

lth neuron in the previous layer,  $W_{lk}$  represents the synaptic weight from the lth neuron to the kth neuron in the current layer. Additionally, f denotes the activation function and b represents the bias. ANNs have extensive practical applications, such as speech and image recognition, business intelligence predictive analysis, natural language processing (NLP), and many more. ANNs have several advantages, including processing speed, scalability, and accuracy. The most significant advantage of ANNs is their high-speed processing, which can be achieved through parallel implementation.

Khalil et al. [53] present an area-efficient ANN implementation. The proposed method utilizes certain layers called hidden layers in a novel way to reduce the number of layers in the ANN by nearly half. The study proposes nonconventional layers called hidden layers. Each of these layers serves two distinct functions through the intelligent use of various weights, which are adaptable. Consequently, each hidden layer performs the tasks of two regular ANN layers. Fixed layers are the other kind of layers that are traditionally used. The fixed layers act as a single layer and are not flexible. The proposed architecture minimizes the number of fixed layers. In addition to the adaptable layers proposed, specific applications may still require one or more fixed layers. The implementation of the proposed method using Verilog HDL and Altera Arria 10 GX FPGA resulted in a 41% reduction in the area compared to the state-of-the-art method, with low overhead in terms of power consumption and speed. The benefits in terms of area reduction and accuracy are substantial.

Huynh [54] present a feedforward DNNs accelerator on FPGAs for an efficient hardware implementation architecture. The proposed neural network architecture performs fully connected feedforward DNNs with customizable layers, neurons per layer, and inputs, using only one physical processing layer. The network represents inputs, weights, and outputs in half-precision floating-point number format using 16-b operands and employs the MNIST database for handwritten digit recognition applications. The performance evaluation is conducted on two Xilinx FPGA boards, the Virtex-5 XC5VLX-110T and ZynQ-7000 7Z045. The accuracy achieved for the 784-40-40-10 network on the Virtex-5 XC5VLX-110T chip is 97.20%, while for the 784-126-126-10 network on the ZynQ-7000 7Z045 chip, it is 98.16%. The peak performance achieved is 15.81 and 15.90 thousand handwritten image frames per second (kFPS),

respectively. Medus et al. [55] present novel hardware for any feedforward neural network (FFNN) implementation, including logistic regression, multilayer perceptrons (MLPs), and autoencoders. The architecture allows for any number of layers, units of layers, and input and output numbers. The hardware utilizes matrix algebra techniques and employs serial-parallel computation. It utilizes a systolic ring of neural processing elements (NPEs) that only requires as many NPEs as neuron units in the largest layer, regardless of the number of layers. The utilization of resources increases linearly as the number of NPEs increases. This adaptable design works as a real-time application accelerator and does not have an impact on the system clock frequency due to its size. In contrast to most existing methods, the proposed approach only utilizes a single activation function block (AFB) for the entire FFNN. Performance, accuracy, and resource usage are evaluated for several activation functions and network topologies. In a Virtex7 FPGA, the architecture operates at 550 MHz clock speed. The proposed approach achieves classification performance comparable to a floatingpoint approach using an 18-b fixed point. A smaller weight bit size has no effect on accuracy and allows for more weights in the same memory. A x 256 acceleration was attained for a realtime application of abnormal cardiac detection, and different FFNNs for Iris and MNIST datasets were evaluated.

Khalil et al. [56] present a NoC-based adaptive neural network. NoC is composed of routers and PEs, where each router is connected to its PE, consisting of m nodes used for constructing neural network layers. A configuration packet is utilized to specify the number of layers and nodes per layer. The suggested method can allocate multiple routers to represent a layer based on the required configuration. Thus, the proposed solution enables the creation of several layers with flexible node configurations (as required by the application). The method is implemented on FPGA Altera 10 GX, achieving an accuracy of 98.18% with the MNIST dataset. Multiple datasets are used for testing, and the results demonstrate that The suggested approach uses comparable resources as the traditional method. Xiao et al. [57] present intrachip and interchip communication strategies for neural network accelerators named NeuronLink. In terms of intrachip communication, this study suggests techniques for route computation parallelization, arbitration interception, and scoring crossbar arbitration to improve virtual-channel routing. These techniques lead to a high-throughput network on a chip (NoC) for multicast-based traffic while keeping the hardware cost low. Additionally, this work proposes a lightweight and NoC-aware chip-to-chip interconnection architecture to enable efficient interconnection for neural network processors that use NoCs for interchip communication. Zhang et al. [58] introduce a programmable in-memory computing accelerator (PIMCA) for DNN inference with low precision (1-2 b). PIMCA integrates 108 IMC SRAM macros with custom 10T1C bit-cell technology in 28-nm. These macros can store all weights of the targeted 1-b VGG-9 model, eliminating off-chip data movement during DNN inference. The IMC SRAM macros also improve speed and energy efficiency by eliminating row-by-row memory accesses. Additionally, a flexible SIMD processor is integrated to handle non-MAC operations, such as max-pooling,

element-wise addition, and residual operations, eliminating data movement energy consumption and latency between the accelerator and host. The test chip prototyped in a 28-nm technology achieves a system-level (macro-level) peak energy efficiency of 437 TOPS/W and a peak throughput of 49 TOPS at 42-MHz clock frequency.

Song et al. [59] propose HYPAR, a method proposed to ascertain layer-specific parallelism in the training of DNNs using an array of DNN accelerators. HYPAR divides the feature map tensors (input and output), kernel tensor, gradient tensor, and error tensors among the DNN accelerators, with each division representing the chosen parallelism for weighted layers. The primary optimization goal is to identify a partition that minimizes overall communication during the complete training of a DNN. Notably, HYPAR demonstrates practicality with a linear time complexity for the partition search. This approach is implemented within the HYPAR architecture, an HMC-based DNN training framework designed to reduce data movement. The results show an average performance increase of 3.39 × and an energy efficiency improvement of 1.51 × when compared to the default data parallelism. Wei et al. [60] introduce a novel approach called layer-conscious memory hierarchy (LCMH) for DNN accelerators. LCMH dynamically determines the appropriate memory levels for each layer based on their specific requirements for off-chip memory bandwidth and on-chip buffer size. This allows for the avoidance of off-chip memory usage in layers with high memory demands, keeping their data on-chip instead. Experimental results demonstrate that designs implementing layer-conscious memory management achieve a significant speedup of up to 36% compared to designs using uniform memory hierarchy, and a 5% improvement over current state-of-the-art designs.

## B. CNN

CNN is a deep learning model that is used to process data. Its architecture is intended to learn spatial feature hierarchies automatically and adaptively., starting from low-level features and building up to high-level ones. CNNs are made up of a variety of building blocks that help them extract relevant features from input data [61], [62], [63], [64]. As shown in Fig. 7 CNN structures are typically made up of three layers: convolution layers (CLs), pooling layers, and fully connected layers. Convolution and pooling layers are used to extract features from input data. These features are then mapped to the final output, which is typically a classification result, using a fully connected layer. A crucial component of CNN is the CL, which comprises a stack of mathematical operations, including the convolution linear operation. Pixel values are recorded in digital images as an array of numbers in a 2-D grid. CNNs use a small parameter grid called a kernel as an optimizable feature extractor, which is applied at each position in an image. This makes CNNs particularly efficient for image processing. The extracted features can grow hierarchically and become increasingly complicated as the output of one layer is fed into the next. Using an optimization process called backpropagation and gradient descent, training involves improving parameters



Fig. 7. Block diagram of CNN.

such as kernels to reduce the disparity between outputs and ground truth labels. CNNs have become widely used in various applications, including image classification, image captioning, object detection, facial recognition, semantic segmentation, and other applications.

To perform a convolution operation in a CNN, we need to find a local receptive field in the input feature map that has the same size as the convolution kernel. Next, multiply the corresponding points by the convolution kernel. Finally, add the offset coefficient to the final result given by the following equation:

$$f = \text{bais} + \sum_{m} m \sum_{m} n(\text{pixel}_{mn} \times \text{weight}_{mn}).$$
 (2)

Equation (2) uses m and n to represent the height and width of the feature window, f is a pixel of the feature image generated by the convolutional layer, and bias is a constant value added to each feature window. The resulting values are typically passed through a nonlinear activation function such as the hyperbolic tangent (tanh), rectified linear unit (ReLU), and Sigmoid. When convolution is performed without padding, the output dimension is less than the input dimension. Output size is given by the following equation, which is a function of filter size and stride:

$$D = \frac{H - F + 2P}{S} + 1. {3}$$

In equation (3), H denotes the input size, F denotes the filter size, S denotes the stride size, and P denotes the padding size. The pooling layer is responsible for reducing network complexity, compressing features, and removing redundant information. The size of the final feature map after pooling is determined by the kernels' movement step, which is given by equation (3). The fully connected layer, or MLP, is the last layer of a CNN and is composed of layers, each of which has many neurons (nodes). Nodes in one layer are directly connected to nodes in the preceding and subsequent layers. Finally, the fully connected layer is linked to the final output node where classification results are received. A classification function, such as SoftMax, can be utilized at this point. The SoftMax function can be obtained by the following equation:

$$f_x(X_j) = \frac{e^{(x_j)}}{\sum_{i=1}^{j} e^{(x_i)}}$$
 for  $j = 1, 2, 3, \dots, k$ . (4)

In equation (4), x is the input signal and k is the number of output classes.

Tang et al. [65] create an improved CNN image classification model. Maximum pooling is used in the model network structure. The accuracy of the four activation functions of Sigmoid, Tanh, ReLU, and T-ReLU is compared in this article to improve neural network performance and image classification accuracy. The T-ReLU activation function improves the model, raising image classification accuracy from 62% to 76.52%. Khalil et al. [66] propose a hardware implementation of a new pooling method, absolute average deviation (AAD), for use in a CNN accelerator. AAD makes use of the spatial proximity of pixels by computing vertical and horizontal deviations, resulting in higher accuracy and lower area and power consumption compared to other pooling methods. The AAD pooling method achieves over 98% accuracy without increasing computational complexity. It was tested using various neural network structures and datasets, including EEG, ImageNet, common objects in context (COCO), and united states postal service (USPS). VHDL was used to implement AAD on Altera Arria10 GX FPGA with 45-nm technology, using synopsys design compiler. Song et al. [67] propose a multidie-based CNN accelerator. The VU9P chip involves three accelerators connected to an independent super logic region (SLR). The host computer manages the three accelerators under the control of the accelerator, which installs one accelerator in each SLR and uses on-chip resources. This system utilizes an 8-b quantization method to enhance the throughput and computational efficiency of a single DSP for accelerating the YOLOv4-tiny algorithm. The design employs a full reuse of feature maps and weights during the calculation process and stores intermediate results in the on-chip buffer to minimize off-chip access, reduce bandwidth pressure, and decrease power consumption. Moreover, a designed instruction group enables the host computer to control the accelerator. This architecture achieves a frame rate of 148.14 frames per second (FPS) and a peak throughput of 2.76 tera operations per second (TOPS) at a frequency of 200 MHz with an energy efficiency ratio of 93.15 GOPS/W. It delivers promising results in realtime target detection applications.

Ting et al. [68] propose a batch normalization (BN) processor that supports training and inference processes. To accelerate CNN training, the proposed work develops an efficient dataflow that incorporates a novel BN processor design as well as processing elements for convolution acceleration. By sharing hardware elements between the two passes, this study took use of the comparable calculations necessary for the BN forward and backward passes, reducing the area overhead. The method completed automatic placement and routing (APR) and post-APR simulation on the training of neural network and functional verification of the BN processor. The method implemented the BN processor in a CMOS technology process. The proposed solution accelerates the CNN training process while saving hardware. The proposed architecture can reduce the total area by 40.13%.

Khabbazan and Mirzakuchaki [69] describe optimized hardware for CNNs for use in embedded vision systems. This design method is intended to be applied to low-end hardware with the least resources needed. This hardware proposes a Zturn evaluation board architecture with a Xilinx Zynq-7000

System-on-Chip (SoC). This architecture optimizes all computations to be 8-b. Moreover, due to its high-speed performance, low power consumption, and compact size, the architecture is a suitable option for CNN applications that require portability and embedded systems.

Xiao et al. [70] present a neural network acceleration architecture that is efficient, scalable, and has low latency and low error rates. The architecture achieves acceleration by utilizing multichannel parallel computing methods between layers and employing a pipeline design that prioritizes high efficiency and low latency performance requirements. The addition of a line buffer to accommodate varying image widths and the implementation of a selectable convolution kernel size mechanism enhance the network's flexibility and scalability. This proposed neural network performs 32-b floating-point operations. Since CNNs are based on floating point operations, there will be a loss of precision and time-consuming transformation work if the algorithm's FPGA implementation involves the conversion of floating-point values to fixed-point values. The MNIST dataset is used to perform handwritten number recognition to perform an experimental evaluation of the solution. The acceleration strategy is implemented using the Xilinx zynq-7000 FPGA, and the results of calculating  $28 \times 28$  handwritten images at a clock frequency of 200M in 25.95us are examined. 98.43% accuracy rate is obtained.

Lee et al. [71] present S3NAS, a quick hardware-aware NAS approach. The process is broken down into three stages: supernet design, Single-Path NAS for quick architectural exploration, and scaling and postprocessing. The initial stage involves creating a supernet, which is a set of candidate networks with two main features. First, it allows for varying numbers of blocks in the stages, and secondly, it permits blocks to have parallel layers with different kernel sizes (MixConv). To minimize the hyperparameter search overhead, a differential search can be carried out by extending the single-path NAS method to include the MixConv layer and incorporating a loss term that takes into account the latency. The network is scaled to its maximum within the latency constraint using compound scaling as the last step. In the postprocessing step, SE blocks and h-swish activation functions are incorporated if they are found to be advantageous. The efficiency of the proposed methodology is demonstrated by tests conducted on four different hardware platforms. Using TPUv3, The search process can be completed within 4 h, resulting in the discovery of networks that offer superior tradeoffs between latency and accuracy compared to state-of-the-art networks. Moreover, this model outperforms other models by 0.6% in terms of accuracy and 14% in terms of speed compared to EfficientNet-B2.

Liu et al. [72] propose hardware architecture tailored for streaming applications, with a strong emphasis on increasing computation efficiency by fully accelerating CNNs on FPGAs. To support an inference of CNNs with varied topologies, the architecture integrates most computational functions, convolutional and deconvolutional layers into a single unified module. It efficiently handles concatenative and residual connections between the functions, resulting in highly customized acceleration. This design is further enhanced by utilizing various

levels of parallelism, layer fusion, and completely utilizing DSPs. The suggested accelerator has been tested using a variety of benchmark models and implemented on Intel's Arria 10 GX1150 hardware. The results show a high performance of over 1.3 TOP/s throughputs and up to 97% computation efficiency.

Wang et al. [73] propose a double buffer memory access structure, which considerably increases the computing unit's memory access efficiency; Furthermore, the proposed architecture utilizes a "ping-pong" buffer structure and employs calculation delay to overlap with memory access delay, resulting in improved acceleration performance. To improve the computation performance of the computing unit, an accelerator structure with a multilevel cache is proposed to execute data preparation reading. To prevent waiting for the processing unit when reading data, a double buffer method is used to wait for the processing unit when reading data; a double buffer method is used to perform calculation and data reading alternately. Based on the experimental results, the proposed accelerator in this article achieved a detection speed of 15 FPS when processing an input image of size  $3 \times 160 \times 320$ , while maintaining the same test accuracy as the original design. This signifies a 1.5 times enhancement in acceleration when compared to the original design.

Achararit et al. [74] offer an accuracy-and-performanceaware NAS (APNAS) that can efficiently create DNNs. APNAS is based on a weight-sharing and reinforcement learning-based exploration method. First, provide a technique for calculating the cycle count in an RNN such that the network search does not require running a time-consuming hardware simulator. Additionally, they use analytical models for cycle count estimates to speed up the DNN creation process even further. The accuracy of these analytical models is demonstrated by the fact that they provide cycle count estimates that are comparable to those generated by a cycle-accurate hardware simulator. Then, in the RL, establish a reward function by including a configurabexcellentrameter for configuring the tradeoff between the performance and accuracy of the generated DNNs. The study showed that APNAS could construct neural network models in 0.55 GPU days on an Nvidia GTX 1080Ti GPU, resulting in an average of 53% fewer cycles when compared to a manually developed neural network model (ResNet) and a state-of-theart NAS. They generated CNNs by APNAS for two different image classification datasets (CIFAR-10 and CIFAR-100) that required 52.78% and 53.57% fewer cycles compared to a manually designed CNN.

Yuan et al. [75] propose hardware-oriented compression and hybrid quantization techniques to reduce the memory requirements of CNNs. They classified all layers as either "no-pruning layers (NP-layers)" or "pruning layers (P-layers)" based on their processing features. The former uses parallel computation for high performance with a regular weights distribution, while the latter has a high compression ratio but is asymmetric due to pruning. The approach aimed to balance compression ratio and processing efficiency while maintaining reasonable accuracy by using uniform and incremental quantization techniques, as well as a distributed convolutional architecture with multiple parallel finite impulse response (FIR) filters for the regular model in the

NP-layers. They introduced a shift-accumulator-based processing element with activation-driven data flow (ADF) for handling the irregular sparse model in the P-layers. They also proposed a hardware/algorithm cooptimization (HACO) method based on the compression strategy and hardware architecture to implement an NP-P hybrid compressed CNN model on FPGAs. They implemented the compressed VGG-16 model on a Xilinx VCU118 evaluation board for image applications and achieved a compression ratio of 27.5x for a hardware accelerator on a single FPGA chip without the use of off-chip memory, processing 83.0 FPS.

Huang et al. [76] propose FPGA-based CNN hardware accelerator design, which utilizes a row-level pipelined streaming technique to calculate CLs using a multicomputing engine (CE) architecture. They also presented a mapping mechanism to optimize the computational resource utilization ratio of the PE Array, achieving up to 98.15%. Additionally, an effective data storage system was implemented to improve the work efficiency of the CE by continuously feeding input data. A weighted data allocation technique was proposed to reduce the need for off-chip bandwidth while sacrificing some on-chip storage capacity. The design was tested on XC7VX980T FPGA, achieving 1 TOPS at 150 MHz, which is approximately 98.15% of the theoretical throughput. Moreover, a ResNet-101 accelerator was implemented, achieving 600 GOPS at 100 MHZ with up to 96.12% throughput efficiency. Kim et al. [77] present an ASIC accelerator for deep CNNs that uses a novel conditional computing technique to significantly reduce the number of redundant computations and external memory accesses. By combining subsequent max-pooling processes, precision cascading (PC) is a novel conditional computing technique that reduces redundant convolution operations. In addition, combining precision-cascading with zero-skipping greatly reduced energy and external memory access. For VGG- 16 CNN for ImageNet, The accelerator achieved peak/average energy efficiency of 8.85/1.22 TOPS/W at a voltage of 0.9V, and low external memory access of 55.31 MB or it can be defined as 0.0018 access/MAC. Cheng et al. [78] introduce a low-power sparse CNN accelerator featuring a preencoding radix-4 Booth multiplier. Leveraging the properties of the radix-4 Booth algorithm, the accelerator reduces the number and bit width of partial products (PPs) and encoder power consumption. it incorporates an activation selector module that chooses activations corresponding to nonzero weights for subsequent multiple-add operations after offline encoding of nonzero weights. Additionally, it consolidates eight encoders from relevant multipliers into a single preencoding module to save area. The proposed work is developed using the Verilog HDL language and implemented in a 28 nm process. The proposed accelerator achieves a performance of 7.0325 TOPS/W with 50% sparsity and scales up to 14.3720 TOPS/W at 87.5% sparsity.

Yu et al. [79] introduce an FPGA-based acceleration platform utilizing supertile methods tailored for general-purpose CNNs in data center applications. The design of a dispatching-assembling buffering model incorporating broadcast cache sets, tailored for a multi-supertile units (SU) architecture, significantly enhances both reading and writing bandwidth.



Fig. 8. Block diagram of RNN.

Additionally, the article discusses a 2-D filter processing unit capable of handling a class of filter-like and pointwise operations, striking a balance between design complexity and performance. The experiment demonstrates that the suggested architecture implemented on KU115 attains the highest peak performance and throughput on FPGAs. Furthermore, it operates at a comparable level to cutting-edge GPUs, but with a latency more than 50 times lower. When compared to the total cost of ownership, the FPGA improves server throughput by 149.2%, albeit with a modest 31.5% increase in costs. Hwang et al. [80] propose GROW, a graph convolutional neural network (GCN)accelerator utilizing a Row-stationary SpDeGEMM dataflow. Unlike previous SpDeGEMM accelerators, GROW combines software/hardware architecture codesign to minimize data movements, notably enhancing memory-bound SpDeGEMM performance. It employs a row-stationary dataflow based on Gustavson's algorithm, enabling flexible adaptation to heterogeneous sparsity patterns. Compared to GCNAX, GROW's dataflow significantly reduces memory bandwidth waste, particularly during the aggregation phase. While it introduces a more irregular reuse of dense matrices, GROW employs a graph partitioning algorithm to improve dataflow locality. This is coupled with a multirow stationary run-ahead execution model, enhancing memory-level parallelism and overall throughput.

## C. Recurrent Neural Network

RNNs are a particular class of neural network that function well for processing time series and other sequential data. An RNN has loops that allow information to be stored inside the network. Fig. 8 presents the general block diagram of the RNN. RNNs leverage prior knowledge to make predictions about future events, making them valuable tools for complex task performance by enabling sequence modeling of vectors within the API. RNN can be viewed as a series of interconnected networks. RNNs are an extension of feedforward networks to handle variable-length sequences. Some of the most commonly used recurrent architectures include gated recurrent units (GRUs) and long short-term memory (LSTM). RNNs are often designed with a chain-like structure, making them well-suited for tasks in NLP such as language translation, speech recognition, sentiment analysis, and automatic speech recognition. However, traditional RNNs experienced poor network performance due



Fig. 9. Block diagram of LSTM.

to vanishing and exploding gradients in applications requiring long-term input data. LSTM is a variation of RNN that has been proposed to address this problem. However, LSTM introduces gating units and many extra parameters, making direct implementation difficult on resource-limited platforms like FPGAs. LSTM block diagram comprises three gates and two activation functions, as shown in Fig. 9. The first gate, known as the "forget gate," is responsible for deciding which portions of the cell state should be disregarded. The next step involves determining which new data should be stored in the cell. The "input gate" determines the values that need updating, while the "tanh" function generates new potential values. The final layer is the output layer, which determines which data will be sent out. Each part's equation is calculated by the following equations [81]:

$$f_t = \sigma(W_f[h_{t-1}, x_t] + b_f) \tag{5}$$

$$i_t = \sigma(W_i[h_{t-1}, x_t] + b_i)$$
 (6)

$$o_t = \sigma(W_o[h_{t-1}, x_t] + b_o)$$
 (7)

$$\tilde{c}_t = \tanh(W_c[h_{t-1}, x_t] + b_c)$$
 (8)

$$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \tag{9}$$

$$h_t = o_t \odot \tanh(c_t) \tag{10}$$

where  $f_t$  is the result of the forget gate,  $i_t$  is the result of the input gate, and  $o_t$  is the result of the output gate. Whereas the matrix multiplication  $W_f[h_{t-1},x_t]=W_hh_{t-1}+W_xx_t$ .  $h_t$  is the cell output, whereas  $\tilde{c}$  and  $c_t$  denote the new and final states, respectively. The weights of the forget gate, the input gate, and the output gate are, in order,  $W_f$ ,  $W_i$ , and  $W_o$ . Biases are bf for the forget layer,  $b_i$  for the input layer, and  $b_o$  for the output layer. The symbol of  $\odot$  represents the elementwise (Hadamard) multiplication,  $\sigma$  is the logistic sigmoid function, and tanh is the hyperbolic tangent function.

RNN has several hardware implementations for low cost [82], [83], [84]. Khalil et al. [28] propose economic LSTM (ELSTM) is a new LSTM architecture requiring only a single gate. ELSTM requires fewer processing units. Using a single gate, the proposed method can retain memory and learn data sequences. The proposed method offers the advantage of faster learning by utilizing fewer computation units. The accuracy of the proposed method is comparable to that of the other

method, while having fewer units, and Compared to LSTM, the proposed method offers a reduction in area and power consumption by 34% and 35%, respectively. This design exhibits notable attractiveness for low-cost hardware applications. The method proposed in this study is evaluated using three datasets: ImageNet, IMDB, and MNIST. The testing and implementation of the proposed method are performed using Altera Arria 10 GX FPGA. Wu et al. [85] introduce an energy-efficient scalable processor that leverages the data locality of compressed RNNs. By eliminating redundant connections and sharing quantized values among several weights, the RNN models are significantly compressed. Adopting the quantified sparse matrix encoding significantly reduces repeated calculations and memory operations. Both approaches ensure the suggested design has a high level of energy efficiency. Scalable architecture and network cross-division approach enable hardware parallelism and flexibility. More than 80% of the weight fetching and matrix-vector multiplications for applications like natural language and keyword spotting can be further decreased when using compressed RNNs compared to traditional processors. The peak energy efficiency reaches 3.89 GOPS/mW. It achieves a peak performance of 24 GOPS and dissipates 6.16-mW power with a 1.1 V supply and 200 MHz.

Kadetotad et al. [86] propose LSTM RNN accelerator featuring an accelerator that is based on using a memory compression method known as hierarchical coarse-grain sparsity that was algorithm-hardware cooptimized (HCGS). HCGS offers considerable compression (16x) of LSTM weights with gentle error rate degradation while minimizing index memory cost. The suggested LSTM accelerator utilizes a combination of hierarchical blockwise sparsity and low-precision quantization to store the compressed weights of LSTMs consisting of three layers and 512 cells in only 288 kB of on-chip SRAM. This method effectively reduces the necessary computation by up to 16 times. The prototype chip, fabricated using 65-nm LP CMOS technology, achieves a remarkable energy efficiency of up to 8.93 TOPS/W for real-time speech recognition. Experimental evaluations conducted on TIMIT, TED-LIUM, and LibriSpeech datasets provide solid evidence of the effectiveness and suitability of HCGS across multiple LSTM RNNs. Nan et al. [87] present a hybrid-iterative compression (HIC) technique for LSTM/GRU, which separates gating units into error-sensitive and error-insensitive groups and compresses them using different techniques, leveraging the error sensitivity of RNNs. Additionally, a proposed energy-efficient accelerator for bidirectional RNNs is made. In this accelerator, weights are rearranged to optimize data flow in the matrix operation unit based on the block structure matrix (MOU-S). A fine-grained parallelism configuration of matrix-vector multiplications is used to improve BRAM utilization (MVMs). The challenge of load imbalance between MOU-S and the matrix operation unit based on top-k pruning (MOU-P) is effectively addressed through the implementation of the timing matching technique. The architecture of the compressed LSTM/GRU, as proposed, has been thoroughly assessed on the Xilinx ADM-PCIE-7V3 platform. Gao et al. [88] propose EdgeDRNN, an RNN accelerator based on the GRU is optimized for low-latency edge RNN inference with a batch size of 1 while maintaining a lightweight design. To utilize temporal sparsity in RNNs, EdgeDRNN employs a delta network technique inspired by spiking algorithms. The weight storage of EdgeDRNN is implemented using low-cost off-chip DRAM, and it employs temporal sparsity to decrease memory bandwidth requirements during RNN updates. By employing sparse updates, the memory access to DRAM weight can be reduced by a factor of up to 10. Furthermore, the delta value can be dynamically adjusted to strike a balance between latency and accuracy requirements. This helps optimize EdgeDRNN for efficient edge RNN inference with low latency.

Shan et al. [89] introduces dynamic recurrent routing neural networks (DRRNets) as a solution to typical RNN problems such as complicated dependencies and gradient vanishing. The suggested DRRNets use the routing pointer matrix's low-rank attribute to construct adaptive routes for diverse dependencies and drastically decrease redundant parameters by discovering low-rank approximations for fully connected layers based on the inner structure of the cell state. The article contains an optimization algorithm for training the network and assesses the model's performance in a variety of tasks, including image classification, language modeling, and speaker recognition. Chen et al. [90] introduce a specialized hardware accelerator called "Eciton" designed for implementing LSTM neural networks. Eciton showcases the ability to conduct real-time inference for LSTM neural network models of practical size, all while operating within a power constraint of 17 mW. In comparison to FPGA implementations that demand higher power consumption, Eciton delivers competitive performance. This is achieved through the utilization of 8-b fixed-point weight quantization, hard sigmoid activation functions, and meticulously optimized microarchitecture, effectively minimizing chip resource and memory demands. Although these quantization techniques lead to a slight accuracy reduction of approximately 5% when assessed on real-world predictive maintenance LSTM models consisting of 3 to 4 layers, the advantage of low resource requisites permits Eciton to be accommodated within a costeffective, low-power Lattice iCE40 UP5K FPGA.

## D. Transformer-Based and Diffusion-Based Models

Transformer-based models have gained a significant amount of attention in recent years because of their outstanding results on NLP problems. The transformer architecture was first described in [91] by Vaswani et al. It uses self-attention mechanisms to capture dependencies between different input data elements, enabling parallel processing of sequences and reducing the sequential nature of conventional RNNs. For tasks such as accelerator optimization, automatic machine learning, and compiler optimization [92], transformer-based models can be applied. Diffusion-based models are a type of probabilistic model that propagates information across data points through repetitive processes. These models have found use in a variety of domains, such as image denoising, data imputation, and generation tasks, data-driven accelerator design, and neural architecture search [93], [94]. Zhao et al. [95] introduce a transformer accelerator utilizing an output block stationary (OBS) dataflow to optimize memory access and improve DSP utilization, resulting in higher energy efficiency. By minimizing repeated memory access and employing block-level and vector-level broadcasting, the accelerator achieves reduced memory access bandwidth for input and output. The FPGA-based verification of the proposed accelerator demonstrates impressive performance, with a throughput of 728.3 GOPs and an energy efficiency of 58.31 GOPs/W when evaluating a transformer-intransformer (TNT) model. Cheng et al. [96] present a novel transformer-based model which is proposed for signal detection in a multiuser molecular communication (MMC) system. The model is trained using received data generated with varying initial distances between transmitters and receivers. The numerical results demonstrate that the trained transformer-based model exhibits excellent convergence and outperforms the traditional DNN in terms of signal detection, achieving a lower bit error rate.

## E. Large Language Models (LLMs)

A LLM is a specialized hardware or software component designed to enhance the performance of LLMs in NLP tasks. LLMs, such as OpenAI's GPT-3 and BERT, have demonstrated remarkable capabilities in understanding and generating human-like text, but they come with substantial computational requirements, making them resource-intensive and time-consuming to run on standard hardware [97]. These accelerators leverage techniques like parallel processing, optimized memory access, and specialized circuit designs to improve the overall efficiency of language model computations. The LLMs accelerator has become crucial in a wide range of applications, including chatbots, language translation, text summarization, and sentiment analysis [98]. Maddigan and Susnjak [99] introduce an innovative system called Chat2VIS, which harnesses the capabilities of LLMs. Through effective prompt engineering, Chat2VIS demonstrates a more efficient solution for language understanding, resulting in simpler and more accurate end-to-end outcomes compared to previous methods. The research reveals that Chat2VIS, utilizing LLMs and proposed prompts, offers a reliable approach to generating visualizations from natural language queries, even when queries are imprecise or insufficiently specified. Moreover, this solution significantly reduces development costs for Natural Language Interface systems while achieving superior visualization inference abilities compared to traditional NLP approaches that rely on handcrafted grammar rules and tailored models.

# F. Performance Comparison of Different Methods

Table I provides a comprehensive summary of various machine learning hardware accelerators, highlighting their key features, performance metrics, and targeted applications. It aims to offer an overview of the latest advancements in ML hardware acceleration, assisting researchers, developers, and technology enthusiasts in understanding the landscape of available solutions and their respective strengths. By analyzing the characteristics and capabilities of different accelerators, readers can make informed decisions regarding the most suitable hardware for their specific ML requirements.

## VI. EVALUATION

An evaluation of a machine learning accelerator is significant for any design for validation. The evaluation is divided into training and testing and hardware evaluations. Each evaluation parameter is described as follows.

## A. Training and Testing Evaluation

In machine learning classification models, performance measures are used to evaluate how well the models perform in a specific context. This evaluation helps to improve machine learning classification models. Some of the performance metrics are accuracy, specificity, precision, tension, F1 score (tension), and loss function. Model performance is critical for machine learning because it allows us to understand the strengths and limitations of these models when making predictions in new situations. True positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs) are commonly used performance measures for evaluating the performance of classification models. TP refers to the number of correctly predicted positive cases, while TN is the number of correctly predicted negative cases. FPs are the number of negative cases that were incorrectly predicted as positive, and FNs are the number of positive cases that were incorrectly predicted as negative. These metrics are typically used to calculate other performance measures, such as accuracy, precision, recall, and F1 score. [66], [100]. The evaluation parameters are given by (11)–(20).

1) Accuracy: A test's accuracy is measured by its ability to differentiate classes accurately [55], [66], [100]. It indicates the quality of the result for a given task. Accuracy can be calculated using the following equation:

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}.$$
 (11)

More complex DNN models typically require more computations and more memory resources to process the input data, which can lead to slower processing times and higher resource utilization. This is especially true for hardware implementations of DNNs, where the processing capabilities and resources are more limited compared to software implementations. As a result, there is often a tradeoff between model complexity, accuracy, hardware performance, and efficiency [43].

2) Sensitivity: It can be defined as a TP rate that measures the ratio between the number of classes that were correctly identified and the total number of TPs and FNs [55], [66], [100]. Sensitivity is by the following equation:

Sensitivity = 
$$\frac{TP}{TP + FN}$$
. (12)

3) Precision: It is a positive predictive value. It measures the proportion of TP predictions to total positive predictions produced by the model. A high precision indicates that the model is good at avoiding FPs. [55], [66], [100]. Precision is by the following equation:

$$Precision = \frac{TP}{TP + FP}.$$
 (13)

TABLE I SUMMARY OF MACHINE LEARNING AND DEEP LEARNING HARDWARE ACCELERATORS

| Ref. | Method                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | Power                                                   | Area                      | Performance                                                           | Dataset                                                                     | Accuracy                                                | Hardware<br>Device |
|------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|---------------------------|-----------------------------------------------------------------------|-----------------------------------------------------------------------------|---------------------------------------------------------|--------------------|
| [54] | A fully connected feedforward DNN with a customizable number of layers, neurons per layer, and inputs are performed by the neural network architecture using just one physical processing layer.  1) Advantages: Adequate recognition performance can be achieved with relatively modest network sizes, resulting in an increased performance while consuming fewer hardware resources and power.  2) Limitations: Compared to other related works, the performance of floating-point DNN in this architecture on The MNIST dataset exhibits a comparatively lower when compared to fixed-point and binary-based neural networks.                                                       | N/A                                                     | N/A                       | 15.90 Thousand handwritten image kFPS                                 | MNIST                                                                       | 98.16%                                                  | FPGA               |
| [69] | An optimized hardware approach for CNNs for embedded vision systems.  1) Advantages: The presented design exhibits superior performance with significantly fewer DSP48 and BRAM resources and only half of the LUTs in comparison to previous CNN implementations. This accelerator operates at a frequency of 160 MHz and consumes 1.77 watts of power, achieving a performance of 40.96 GOP/s while utilizing only 134 processing units and 601 KB of internal memory.  2) Limitations: This architecture has less performance, and energy efficiency compared to other related works.                                                                                                | 1.77 W                                                  | N/A                       | 40.96 GOPs                                                            | AlexNet                                                                     | N/A                                                     | FPGA               |
| [56] | Adaptive hardware architecture for neural-network-on-chip.  Advantages: With the MNIST dataset, the proposed method achieves an accuracy of 98.18% and can construct networks of various sizes to suit different applications.  Limitations: The proposed method incurs an area overhead of 13% when compared to the state-of-the-art method.                                                                                                                                                                                                                                                                                                                                           | 0.97 W                                                  | 13% reduction             | N/A                                                                   | MNIST and<br>CIFAR                                                          | 98.18%                                                  | FPGA               |
| [66] | Designing a novel hardware implementation AAD pooling for a CNN accelerator.  1) Advantages: AAD pooling technique, which takes into account pixel variations to obtain a highly accurate representation. The hardware implementation achieves excellent separability and high precision. The AAD pooling achieved 98.51% accuracy without increasing computational complexity.  2) Limitations: Compared to the max and average methods, the proposed method incurs only a minor increase in power consumption and execution time.                                                                                                                                                     | 0.31 mW                                                 | 244.46<br>nm <sup>2</sup> | N/A                                                                   | EEG,<br>ImageNet,<br>Common<br>Objects in<br>Context<br>(COCO), and<br>USPS | 98.51%                                                  | FPGA               |
| [53] | Reconfigurable hardware design approach for economic neural network.  1) Advantages: The suggested approach's hardware structure consists of a neural network with fewer layers than the state-of-the-art method, resulting in a 41% reduction in area while preserving performance efficiency. Furthermore, the suggested method allows for the configuration and updating of the number of layers in the on-chip design. It can be easily adapted for complex speech recognition and image classification problems.  2) Limitations: The power consumption of the proposed method is 44 mW, which is slightly higher (9%) than the power consumption of the state-of-the-art methods. | 44 mW                                                   | 41% reduction             | N/A                                                                   | MNIST,<br>CIFAR-10,<br>and USPS                                             | 96.8%                                                   | FPGA               |
| [71] | Fast hardware-aware neural architecture search methodology.  1) Advantages: The performance of this model exceeds that of the other models, achieving 0.6% higher accuracy and 14  2) Limitations: This design has a higher number of parameters and a larger number of FLOPS compared to EfficientNet-B1.                                                                                                                                                                                                                                                                                                                                                                              | N/A                                                     | N/A                       | 14% faster than EfficientNet-B2                                       | ImageNet                                                                    | 0.6% higher<br>accuracy<br>than<br>EfficientNet-<br>B2. | CPU and GPU        |
| [72] | <ul> <li>Full-stack acceleration of deep CNNs.</li> <li>1) Advantages: The proposed method achieves a high level of performance with over 1.3 TOP/s throughput and up to 97% computation efficiency.</li> <li>2) Limitations: The proposed method exhibits high resource utilization and compute density, which may lead to a reduced working clock frequency.</li> </ul>                                                                                                                                                                                                                                                                                                               | VGG16:<br>17.2W<br>ResNet-50:<br>19.1W U-Net:<br>21.5 W | N/A                       | 1.3-1.59 TOP/s throughputs<br>and up to 97% computation<br>efficiency | VGG16,<br>ResNet-50,<br>and U-Net                                           | N/A                                                     | FPGA               |

(Continued)

 ${\bf TABLE~I} \\ ({\it Continued.})~{\bf SUMMARY~OF~MACHINE~LEARNING~HARDWARE~ACCELERATORS}$ 

| [55] | Systolic parallel hardware architecture for the FPGA acceleration of FFNNs.  1) Advantages: The proposed architecture enables the implementation of a multilayer FFNN with up to 3600 neurons per layer on a single chip, without the need for external memory, achieving a maximum performance of 1980 GOPS. The architecture is designed in a way that it can be easily scaled to larger capacity devices or multichip configurations using a simple NPE ring extension. The proposed architecture can adopt any type of FFNN.  2) Limitations: Compared to some related works, the proposed architecture employs a higher number of DSP blocks and registers.                                                                                       | N/A      | N/A              | 1980 GOPS                                                                                                                                                 | Iris, MNIST,<br>and MIT-<br>BIH&AHA | 98.53% | FPGA |
|------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------|--------|------|
| [67] | Design and implementation of CNNs accelerator based on multidie.  1) Advantages: At a clock frequency of 200 MHz, the accelerator architecture can achieve a peak throughput of 2.76 TOPS and a frame rate of 148.14 FPS. The proposed method has achieved significant improvements in performance, with a threefold increase, and energy efficiency, with a ratio of 93.15 GOPS/W, resulting in excellent real-time target detection results.  2) Limitations: The proposed method exhibits comparable power consumption to existing accelerators designed for the YOLOv4-tiny algorithm.                                                                                                                                                             | 12.689 W | N/A              | 1) 148.14 FPS     2) A peak throughput of 2.76 TOPS                                                                                                       | PASCAL<br>VOC                       | N/A    | FPGA |
| [65] | CNN image classification model.  1) Advantages: This model improves the image classification accuracy by 14.52% using the T-ReLU activation function.  2) Limitations: The model training requires much time. It also requires additional learning and improvement. To improve accuracy, additional fully connected layers can be added to the network or the number of nodes in the existing fully connected layer can be increased. However, the current paper only utilizes a single fully connected layer with 128 nodes and evaluates the network's performance on the small CIFAR-10 dataset.                                                                                                                                                    | N/A      | N/A              | N/A                                                                                                                                                       | CIFAR-10                            | 76.52% | N/A  |
| [68] | Batch normalization processor design for CNN training and inference.  1) Advantages: This method eliminates the need for a divider and reduces the area required for the original divider. Also, It can finish normalizing each data set in a single cycle. The proposed architecture offers a significant reduction in the time required for the original division operation while also achieving a 40.13% reduction in the total area.  2) Limitations: It is not tested using datasets for accuracy verification.                                                                                                                                                                                                                                   | N/A      | 40.13% reduction | N/A                                                                                                                                                       | MNIST                               | N/A    | FPGA |
| [73] | Accelerator structure with a double buffer memory access structure significantly improves memory access efficiency.  1) Advantages: The design proposed in this article accelerates the processing by approximately 1.4 times compared to the original design, as measured by the number of cycles required to complete the task.  2) Limitations: The proposed design shows an increase in resource utilization for BRAM, LUT, and FF compared to the original design.                                                                                                                                                                                                                                                                                | N/A      | N/A              | Detection speed of 15     FPS when the input image is 3 × 160 × 320     It achieves an acceleration effect of approximately 1.4 times the original design | SkyNet                              | N/A    | FPGA |
| [70] | Multi-channel parallel processing across layers and a pipeline design make the neural network acceleration architecture efficient, scalable, low-latency, and low-error.  1) Advantages: The proposed neural network acceleration architecture results in low latency, high accuracy, and scalability. With a clock frequency of 200 M, it can process 28 × 28 handwritten images in only 25.9 us, and the DSP consumption is reduced through a reasonable multiplex method. The network achieves a 98.43% accuracy rate with minimal errors, making it suitable for a wide range of applications.  2) Limitations: Compared to other previous CNN implementation methods, the utilization of FF in this method is higher and requires more resources. | N/A      | N/A              | It takes 25.9 us to calculate 28 × 28 handwritten images at a clock frequency of 200 M                                                                    | MNIST                               | 98.43% | FPGA |

(Continued)

TABLE I (Continued.) SUMMARY OF MACHINE LEARNING HARDWARE ACCELERATORS

| [85] | A novel RNN processor has been proposed that prioritizes energy efficiency by leveraging data locality and network compression techniques through a new quantified sparse matrix encoding format.  1) Advantages: The proposed processor design utilizes the network cross-division method, allowing for a high degree of flexibility and parallelism in handling various sizes of embedded RNN applications while maintaining scalability.  2) Limitations: The number of MACs and SRAM units in this design exceeds that of some related works.                                                                                     | 6.16 mW                                       | 0.65 mm <sup>2</sup> | The peak performance is 24 GOPS     The peak energy efficiency reaches 3.89 GOPS/mW                                                                                                                                     | THCHS-<br>30 Chinese<br>speech<br>corpus and<br>a command<br>word library | N/A                                                                                      | ASIC |
|------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------|------------------------------------------------------------------------------------------|------|
| [86] | An energy-efficient LSTM RNN accelerator with hierarchical coarse-grain sparsity memory compression, an algorithm-hardware cooptimized memory compression method (HCGS).  1) Advantages: Comparing the hierarchical blockwise sparsity technique to earlier research shows advantageous error rate and memory compression tradeoffs. It has a high MAC efficiency reaching 99.66%.  2) Limitations: It has higher power consumption than some existing methods.                                                                                                                                                                       | 67.3/1.85 mW                                  | 7.74 mm <sup>2</sup> | 8.93 TOPS/W for two-layer LSTM for TIMIT data set     7.22/7.24 TOPS/W for three-layer LSTM for TED-LIUM/LibriSpeech data sets                                                                                          | TIMIT, TED-<br>LIUM, and<br>LibriSpeech<br>data sets                      | 20.6% PER for TIMIT, 21.3% WER for TED- LIUM, and 11.4% WER for Lib- riSpeech data sets. | N/A  |
| [74] | The new DNN design framework APNAS emphasizes accuracy and efficiency during neural architecture search.  1) Advantages: APNAS is capable of generating DNNs with fewer parameters (i.e., cycle count) while maintaining relatively high accuracy compared to state-of-the-art NAS techniques. This is achieved by adjusting the weight of the RNN to account for cycle count, allowing APNAS to successfully trade off accuracy and cycle count.  2) Limitations: This model has less accuracy than other state-of-the-art NAS techniques.                                                                                           | N/A                                           | N/A                  | It offers an average of 53% fewer cycles than state-of-the-art techniques                                                                                                                                               | CIFAR-10<br>and CIFAR-<br>100                                             | 93.75%                                                                                   | FPGA |
| [75] | Hardware-oriented compression and hybrid quantization techniques require less memory for CNN accelerator.  1) Advantages: The compressed VGG-16 architecture proposed in the article achieves high performance without the need for off-chip memory. It can process images at a rate of 83.0 FPS while maintaining the same level of accuracy. The hardware design can be used in various real-time image processing applications that have limited resources.  2) Limitations: The proposed method exhibits a slightly lower accuracy compared to other state-of-the-art FPGA designs, also it comes with high resource utilization. | N/A                                           | N/A                  | 83.0 FPS, and a compression ratio of 27.5x is achieved for a hardware accelerator on a single FPGA chip without using off-chip memory                                                                                   | VGG-16,<br>ResNet-50,<br>and ResNet-<br>152                               | N/A                                                                                      | GPU  |
| [76] | A novel multi-CE-based row-level pipelined streaming method for calculating CLs.  1) Advantages: The PE Array exhibits exceptional work efficiency, reaching up to 99.83%, with a remarkable resource utilization ratio of 98.15%. The VGG16 and ResNet-101 accelerators, which utilize this PE Array, achieve throughput efficiencies of 98.15% and 96.12%, respectively, outperforming other existing works.  2) Limitations: It has a higher number of multiplication than the state-of-the-art methods.                                                                                                                           | 14.36 mW                                      | N/A                  | 1) The work efficiency of the PE Array is up to 99.83% 2) The VGG16 accelerator on XC7VX980T FPGA, achieves 1 TOPS at 150 MHz 3) A ResNet-101 accelerator is also implemented, achieving 600 GOPS at 100 MHZ            | VGG-16 and<br>ResNet-101                                                  | N/A                                                                                      | FPGA |
| [77] | ASIC accelerator for deep CNNs.  1) Advantages: This work demonstrates significant improvements in DCNN inference throughput and overall system-level energy efficiency, which includes both the accelerator chip and off-chip memory.  2) Limitations: It has higher on-chip SRAM than the state-of-the-art methods.                                                                                                                                                                                                                                                                                                                 | 203 mW                                        | N/A                  | The peak energy efficiency of 8.85 TOPS/W at 0.9V supply     Low external memory access of 55.31 MB (or 0.0018 access/MAC) for ImageNet classification with VGG-16 CNN                                                  | ImageNet                                                                  | N/A                                                                                      | ASIC |
| [87] | A HIC technique for LSTM/GRU.  1) Advantages: This method achieves a significant compression ratio of 37.1/32.3 for LSTM/GRU with negligible accuracy loss. It also shows improved energy efficiency for LSTM networks (5%–237%) and a 58% improvement for GRU networks.  2) Limitations: It has lower FPS computation than the state-of-the-art methods.                                                                                                                                                                                                                                                                             | GRU: 14.906<br>W Vanilla<br>LSTM: 17.106<br>W | N/A                  | <ol> <li>The improvement in the energy efficiency of the LSTM networks is (5%–237%)</li> <li>A 58% improvement in GRU networks.</li> <li>It achieves 379 507 FPS</li> <li>It achieves a latency of 2.635 μs.</li> </ol> | DeepSpeech2 (AN4)                                                         | N/A                                                                                      | FPGA |

(Continued)

TABLE I (Continued.) SUMMARY OF MACHINE LEARNING HARDWARE ACCELERATORS

| [28] | ELSTM architecture, which only requires a single gate, is more suitable for low-cost hardware design due to its simplicity in terms of components. The gate is responsible for data deletion and updating, and its output is fed to three components: the memory layer, the update layer, and the output layer. ELSTM requires fewer processing units compared to traditional LSTM architectures. Advantages: ELSTM has a low error and faster training, allowing it to achieve high accuracy faster. Compared to LSTM, the proposed method saves 34% of area and 35% of power consumption. | 1.192 W   | 34% reduction             | The proposed method has a latency of 23 ms and a throughput of 258.4 MOPS                                                                                                  | MNIST,<br>IMDB, and<br>ImageNet | 90.89%                                                                                                 | FPGA        |
|------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|---------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------|--------------------------------------------------------------------------------------------------------|-------------|
| [57] | NeuronLink is a neural network accelerator communication mechanism that combines intrachip and interchip communication techniques.  1) Advantages: The neuron link-based DNN accelerator proposed in this work outperforms previous NoC-based DNN accelerators in terms of power efficiency, with a 1.27×-1.34× improvement, and area efficiency, with a 2.01×-9.12× improvement.  2) Limitations: The design requires a significant number of FIFOs to store various packets with different priorities and routers, which increases the demand for BRAMs.                                  | 15.4 W    | 41.2 mm <sup>2</sup>      | Neuron Link-based DNN accelerator outperforms the previous NoC-based DNN accelerators 1.27×-1.34× in terms of power efficiency and 2.01×-9.12× in terms of area efficiency | ResNet-18<br>and AlexNet        | N/A                                                                                                    | FPGA        |
| [88] | EdgeDRNN is a RNN accelerator optimized for low-latency edge inference and based on a lightweight implementation of GRU networks.  1) Advantages: EdgeDRNN achieves an average effective throughput of 20.2GOp/s and a wall plug power efficiency over 4X higher than the commercial edge platforms of AI.  2) Limitations: It uses off-chip weigh storage.                                                                                                                                                                                                                                 | 2.3 W     | N/A                       | An effective throughput is 20.2 GOp/s and a wall plug power efficiency is over 4X higher than the commercial edge AI platforms                                             | TIDIGITS<br>and<br>SensorsGas   | 99%                                                                                                    | FPGA        |
| [78] | A low-power sparse CNN accelerator featuring a pre- encoding radix-4 Booth multiplier.     Advantages:This design surpasses others in terms of both area and energy efficiency.     Limitations: The accelerator struggles to achieve high     MAC utilization in small-size convolutional layers and fully connected layers.                                                                                                                                                                                                                                                               | 160.17 mW | 0.4839<br>mm <sup>2</sup> | 1) 7.0325 TOPS/W at 50% sparsity 2) 14.3720 TOPS/W at 87.5%                                                                                                                | VGG16 and<br>AlexNet            | N/A                                                                                                    | FPGA        |
| [89] | DRRNets are proposed as a remedy for traditional RNN challenges like intricate dependencies and gradient vanishing.  Advantages: DRRNets utilize fewer parameters than other models in the same domain. This model is the first to use the low-rank property in terms of input structure, and both the decoupling and low-rank features are imposed on fully connected layers.                                                                                                                                                                                                              | N/A       | N/A                       | N/A                                                                                                                                                                        | MNIST and CIFAR10               | 99%                                                                                                    | CPU and GPU |
| [58] | A PIMCA for DNN inference with low precision (1–2 b). Advantages: By employing this method, a significant reduction of up to 73% in the total program size is achieved, resulting in fewer cycle counts and ultimately leading to improved energy efficiency.                                                                                                                                                                                                                                                                                                                               | 124mW     | 20.9 mm <sup>2</sup>      | The peak energy efficiency<br>of 437 TOPS/W and a peak<br>throughput of 49 TOPS at 42-<br>MHz clock frequency                                                              | CIFAR-10                        | 1-/1-b<br>VGG-9:<br>83.20%<br>1-/1-b<br>ResNet-<br>18:<br>83.48%<br>2-/2-b<br>ResNet-<br>18:<br>86.48% | FPGA        |
| [95] | A transformer accelerator utilizing an OBS dataflow resulting in higher energy efficiency. Advantages: The proposed OBS dataflow reduces the power consumption of the BRAM, which leads to an overall power reduction of 33%. OBS also lowers the input reading and output writing bandwidth.                                                                                                                                                                                                                                                                                               | 80 mW     | N/A                       | Throughput of 728.3 GOPs and an energy efficiency of 58.31 GOPs/W                                                                                                          | ImageNet                        | 79.5%                                                                                                  | FPGA        |

4) Specificity: It calculates the percentage of actual negative cases that the classifier correctly classifies as negative. It is often referred to as the TN rate [30], [34]. Specificity is calculated by the following equation:

5) F1 Score or Tension: It measures the balanced relationship between sensitivity and precision [34]. F1 Score is calculated by the following equation:

Specificity = 
$$\frac{\text{TN}}{\text{TN} + \text{FP}}$$
. (14)  $F1 \text{ Score (Tension)} = \frac{2 * \text{Sensitivity} * \text{Precision}}{\text{Sensitivity} + \text{Precision}}$ . (15)

- 6) Loss Function: The evaluation of how well an algorithm models a dataset involves a mathematical function that depends on the machine learning algorithm's parameters. This function, known as a loss function, plays a crucial role in the training process and the results obtained from any deep learning methodology. Loss functions are typically categorized as either regression loss or classification loss. Regression loss functions used in regression neural networks predict an output value from an input value rather than preselected labels such as mean squared error (MSE) and mean absolute error (MAE). On the other hand, classification neural networks use classification loss functions, which allow selecting a category with the highest probability of the input belonging to it, such as binary crossentropy and categorical cross-entropy. Each one is described as follows.
  - 1) *MSE*: It is also known as *L*2 Loss. MSE calculates the average of the squared differences between the predicted and actual values across the entire dataset. MSE is calculated as follows:

$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2.$$
 (16)

MSE is sensitive toward outliers; given multiple examples with the same input feature values, the ideal prediction is the mean target value. This function is ideal for calculating loss due to its many features. The difference is squared. Thus the predicted value might be above or below the target value, but big errors are penalized. MSE is a convex function with a global minimum, making gradient descent optimization easier to use to select weight values.

2) *MAE*: MAE is also known as *L*1 loss. MAE represents the difference between the target and predicted values extracted by averaging the absolute difference over the data set. MAE is calculated as follows:

$$MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|.$$
 (17)

MAE is a robust metric that is not significantly affected by outliers. In cases where multiple samples have the same input feature values, MAE chooses the median target value as the best prediction. Compare this to MSE, where the mean represents the best prediction. MAE's limitation is that its gradient magnitude depends only on the sign of the difference between the predicted and actual values, not the error size. This results in large gradient magnitudes even for small errors, which can lead to convergence problems. Because of this, a loss function is known as a Huber Loss was developed. This loss function combines the benefits of MSE and MAE into a single package. We can define it using the following function:

Huber Loss

$$= \begin{cases} \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 & \text{if } |y_i - \hat{y}_i| \le \delta \\ \frac{1}{n} \sum_{i=1}^{n} \delta \left( |y_i - \hat{y}_i| - \frac{1}{2} \delta \right) & \text{if } |y_i - \hat{y}_i| > \delta. \end{cases}$$

- In (18),  $(\delta)$  delta hyperparameter defines the range for MAE and MSE.
- 3) Binary cross-entropy (log loss): Cross-Entropy loss is also called logarithmic loss, log loss, or logistic loss. This is the loss function used in binary classification models, which takes in an input and should classify it into one of two predefined categories. Classification neural networks output a vector of probabilities, the probability that the input fits into each preset category, and pick the category with the highest probability as the final output

CE Loss = 
$$-\frac{1}{n} \sum_{i=1}^{N} (y_i . \log(p_i)) + (1 - y_i) . \log(1 - p_i).$$
 (19)

In binary classification, the actual value of y can only be 0 or 1. To accurately determine the loss between actual and predicted values, it is necessary to compare the actual value (0 or 1) to the probability that the input aligns with that category [p(i)] probability that the category is 1; 1 - p(i) probability that the category is 0].

4) Categorical cross-entropy: In multiclass classification tasks, where an example can only belong to one of several possible categories, a categorical cross-entropy is commonly used. This function is designed to measure the difference between two probability distributions. We use categorical cross-entropy when the number of classes is more than two. Binary cross-entropy is a special case of categorical cross-entropy, where M=2, and M is the number of categories

$$CE = -\frac{1}{n} \sum_{i=1}^{N} \sum_{j=1}^{M} y_{ij}.\log(p_i j).$$
 (20)

## B. Hardware Evaluation

Various metrics can be used to evaluate hardware systems, such as power consumption, area, throughput, and latency. These metrics are helpful in comparing and assessing the advantages and limitations of different designs.

- 1) Energy Efficiency and Power Consumption: The efficiency of energy usage is a measure of the amount of data that can be processed or the number of tasks that can be performed per unit of energy. This is particularly important when processing DNNs on embedded devices at the edge. Power consumption is the amount of energy consumed during a given period. The thermal design power (TDP) is a design criterion that determines the maximum power consumption, which is the amount of power that the cooling system can dissipate due to increased power consumption.
- 2) Area: The size of each PE and the total area cost of the system together determine the optimal number of PEs. If the area cost of the system remains the same, increasing the number of PEs will necessitate either decreasing the amount of space required for each PE or exchanging some of the on-chip storage areas for additional PEs. However, decreasing the amount of storage on-chip can have an impact on how PEs are utilized. You can also reduce the area per PE by reducing the logic needed to send operands to a MAC [43].



Fig. 10. Performance analysis of the existing methods.

- 3) Throughput: Throughput refers to the amount of data that can be transmitted or processed within a specific time frame. It is a key performance metric used to evaluate the efficiency and performance of network connections or data processing systems, as it indicates how many packets or messages can successfully reach their destination. Throughput is commonly measured in bits per second (bps) and is often expressed in units of megabits per second (Mbps) or gigabits per second (Gbps). A higher throughput indicates a more efficient network or system, while a lower throughput can indicate performance issues or bottlenecks [32], [44].
- 4) Latency: It indicates how long it takes for packets to reach their destination. In a network, the way throughput and latency work are directly related. Applications that require real-time interaction, such as augmented reality, autonomous navigation, and robotics, require low latency in order to work correctly. Throughput and latency frequently tend to dissipate due to the maximum throughput of a conversation being determined by the level of latency. Conversations are data exchanges from one point to another. Thus, depending on the approach, achieving high throughput and low latency simultaneously can sometimes be incompatible, and both metrics should be reported [43]. Latency is measured in milliseconds (ms).
- 5) Analysis: Our AI accelerator survey begins with power usage and throughput comparisons. In Fig. 10, a comprehensive examination of power consumption, quantified in watts, is juxtaposed against the frequency of operations executed per second, measured in giga operations per second (GOPS). As part of our investigation, we derived throughput figures by the multiplication of power and power efficiency for specific articles. The observed trend reveals that contemporary accelerators predominantly align with the throughput trendline situated at 1 TOPS. Notably, accelerators with a low-power design exhibit a discernible pattern: their power consumption typically exceeds the threshold of 0.1 watts, while simultaneously showcasing a throughput surpassing 1 GOPS. It's worth highlighting that only a limited number of accelerators fall beneath these specified benchmarks.

Fig. 11 presents each accelerator's power efficiency together with the year that it was first published. We calculate the power efficiency of those articles that did not achieve it by dividing throughput by power. In the previous two years, the power efficiency of AI accelerators has ranged from a minimum of more than 50 GOPS/W to a maximum of more than 70 TOPS/W. In the surveyed accelerators, we observe that FPGA implementations have higher power efficiency than other implementations. According to the data, no major new developments have been produced that significantly affect power, power efficiency, or throughput when compared to previous years.

## C. Future Machine Learning Accelerator Designs

Future machine learning accelerator designs face several challenges as AI applications continue to grow in complexity and scale. Here are some insights and suggestions to address these challenges.

- 1) Leveraging reconfigurable designs: Reconfigurable designs with optimization strategies such as parallel processing, dynamic resource allocation, and area optimization, it becomes possible to increase the speed of machine learning accelerators while minimizing costs and maintaining flexibility to adapt to varying workloads and applications. The reconfigurable designs proposed encompass nodes that possess the ability to seamlessly transition between different layers, thereby heightening network speed and achieving specific performance objectives. This adaptability empowers optimization by enabling the configuration and updating of on-chip layer quantities, offering a versatile approach to resource allocation. Furthermore, a reduction in the number of adders and multipliers is integrated, leading to a decrease in computational operations. The successful integration of these design elements yields a solution that excels in both efficiency and resource utilization.
- 2) *Power efficiency:* Energy consumption is a major concern for AI systems, especially in mobile and edge computing



Fig. 11. Comparison between the recent methods.

scenarios. Improving power efficiency through techniques like quantization, parsity, and specialized memory architectures will be vital. Data reuse is an effective approach for reducing the energy consumption of data transfer. This requires moving data once from a remote, large memory source (such as an off-chip DRAM) and then using it for multiple operations from a nearby, smaller memory location (such as an on-chip buffer or a PE's scratchpad). The optimization of data movement holds substantial importance in the overall design of DNN processors. Furthermore, by reducing the number of adders and multipliers, the system executes fewer computational operations, leading to decreased energy consumption.

- 3) Model size and complexity: State-of-the-art AI models are becoming larger and more complex, demanding significant computational power and memory resources. Future accelerators need to be scalable to handle these large models efficiently. The optimization involves condensing layers, such as combining two layers to function as effectively as four, thereby enhancing performance. Additionally, simplifying units and reducing the number of pooling layers results in a reduced overall area footprint.
- 4) Diverse workloads: With the increasing diversity of AI workloads, designing accelerators that can efficiently handle various tasks is essential. Addressing the diversity of AI workloads requires a multifaceted approach that encompasses various strategies and methodologies like quantization (reducing the precision of weights and activations) and pruning (removing less significant connections) to reduce the computational requirements of AI models. This can make them more versatile and adaptable to different workloads.
- Real-time inference: Many AI applications require realtime or low-latency replies. Future accelerators must face the challenge of offering rapid inference while

- maintaining high accuracy, especially in time-sensitive fields such as autonomous vehicles and robotics. Implementing parallel processing techniques to execute multiple tasks simultaneously. This can lead to substantial reductions in response times for AI applications. Also, utilizing edge devices with processing capabilities to perform computations locally reduces the need for data to be sent back and forth to a centralized server.
- 6) Bottlenecks in data transfer: Addressing memory access and data transfer bottlenecks in the accelerator can be achieved through various strategies. Reusing data in calculations helps minimize the need for frequent data transfers. Processing data in batches reduces the frequency of memory access and transfers, thereby enhancing overall efficiency. Additionally, employing cache memory to store frequently accessed data mitigates the impact of slow memory access. Data compression algorithms can also be employed to reduce the volume of data transferred, leading to improved performance. Utilizing direct memory access (DMA) controllers allows for the offloading of data transfer tasks from the CPU, enabling it to focus on computation. Furthermore, structuring algorithms and data layouts to enhance spatial locality can further reduce the frequency of memory accesses. These combined approaches can effectively alleviate memory access and data transfer bottlenecks, ultimately enhancing the performance of the accelerator.
- 7) Hardware–software co-design: Tight collaboration between hardware and software teams is required to extract maximum performance from accelerators. Codesign efforts can result in improved hardware-software integration and targeted optimizations. Future AI accelerators may increasingly adopt neuromorphic computing principles, mimicking the brain's architecture. Also, quantum computing advances, Algorithms, and software frameworks will need to be tailored for quantum hardware.

8) Heterogeneous computing: To strike a balance between performance and energy efficiency, heterogeneous computing designs incorporating several types of accelerators (e.g., CPUs, GPUs, and TPUs) may become more common. Each type of processor can be optimized for specific types of computations.

Addressing these challenges would necessitate ongoing research and development in both the hardware and software areas. Collaboration between academia, industry, and the open-source community will be critical to advancing machine learning accelerator designs that match the needs of tomorrow's AI landscape.

## VII. CONCLUSION

Machine learning is involved in most of the current domains such as IoT environment and biomedical systems. The main challenge is to design a machine learning hardware accelerator with high speed and performance at a low cost. This article investigated different hardware accelerator structures: ANN, CNN, and RNN. It described the existing approaches with a comparison that shows the features and limitations of each method. This article also presented the current challenges for designing machine learning accelerators. We highlighted the evaluation parameters of both the learning and hardware sides such as accuracy, sensitivity, area, speed, throughput, and energy consumption. Thus, this article presented a complete survey on machine learning hardware accelerators to help new researchers and designers in the field. For future research, the hardware accelerator can have the reconfiguration features to be suitable for multiple applications. The reconfiguration process can be done online based on application criteria. Also, a hardware accelerator might be implemented using mixed circuits to have both benefits of analog and digital designs. Furthermore, some hardware components can be shared to support multiple operations to save area on a chip.

## REFERENCES

- A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, "Speech recognition using deep neural networks: A systematic review," *IEEE Access*, vol. 7, pp. 19143–19165, 2019.
   S. Dua et al., "Developing a speech recognition system for recognizing
- [2] S. Dua et al., "Developing a speech recognition system for recognizing tonal speech signals using a convolutional neural network," *Appl. Sci.*, vol. 12, no. 12, 2022, Art. no. 6223.
- [3] M. Chun, H. Jeong, H. Lee, T. Yoo, and H. Jung, "Development of Korean food image classification model using public food image dataset and deep learning methods," *IEEE Access*, vol. 10, pp. 128732– 128741, 2022.
- [4] C. T. Sari and C. Gunduz-Demir, "Unsupervised feature extraction via deep learning for histopathological classification of colon tissue images," *IEEE Trans. Med. Imag.*, vol. 38, no. 5, pp. 1139–1149, May 2019.
- [5] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi, "Machine learning-based approach for hardware faults prediction," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 67, no. 11, pp. 3880–3892, Nov. 2020.
- [6] R. Malhotra, "A systematic review of machine learning techniques for software fault prediction," Appl. Soft Comput., vol. 27, pp. 504– 518, 2015.
- [7] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi, "Intelligent fault-prediction assisted self-healing for embryonic hardware," *IEEE Trans. Biomed. Circuits Syst.*, vol. 14, no. 4, pp. 852–866, Aug. 2020.
- [8] L.-Q. Zuo, H.-M. Sun, Q.-C. Mao, R. Qi, and R.-S. Jia, "Natural scene text recognition based on encoder-decoder framework," *IEEE Access*, vol. 7, pp. 62616–62623, 2019.

- [9] Y. Xu, Y. Wang, W. Zhou, Y. Wang, Z. Yang, and X. Bai, "TextField: Learning a deep direction field for irregular scene text detection," *IEEE Trans. Image Process.*, vol. 28, no. 11, pp. 5566–5579, Nov. 2019.
- [10] U. P. Singh, S. S. Chouhan, S. Jain, and S. Jain, "Multilayer convolution neural network for the classification of mango leaves infected by anthracnose disease," *IEEE Access*, vol. 7, pp. 43721–43729, 2019.
- [11] K. Li, J. Daniels, C. Liu, P. Herrero, and P. Georgiou, "Convolutional recurrent neural networks for glucose prediction," *IEEE J. Biomed. Health Inform.*, vol. 24, no. 2, pp. 603–613, Feb. 2020.
- [12] C. N. Freitas, F. R. Cordeiro, and V. Macario, "MyFood: A food segmentation and classification system to aid nutritional monitoring," in *Proc. 33rd SIBGRAPI Conf. Graph. Patterns Images (SIBGRAPI)*, Piscataway, NJ, USA: IEEE Press, 2020, pp. 234–239.
- [13] H. Jelodar, Y. Wang, R. Orji, and S. Huang, "Deep sentiment classification and topic discovery on novel coronavirus or COVID-19 online discussions: NLP using LSTM recurrent neural network approach," *IEEE J. Biomed. Health Inform.*, vol. 24, no. 10, pp. 2733–2742, Oct. 2020.
- [14] M. Li, W. Hsu, X. Xie, J. Cong, and W. Gao, "SACNN: Self-attention convolutional neural network for low-dose CT denoising with selfsupervised perceptual loss network," *IEEE Trans. Med. Imag.*, vol. 39, no. 7, pp. 2289–2301, Jul. 2020.
- [15] B. Dey et al., "SEM image denoising with unsupervised machine learning for better defect inspection and metrology," in *Proc. Metrol. Inspection Process Control Semicond. Manuf. XXXV*, vol. 11611, Bellingham, WA, USA: SPIE, 2021, pp. 245–254.
- [16] B. Dey et al., "Unsupervised machine learning based SEM image denoising for robust contour detection," in *Proc. Int. Conf. Extreme Ultraviolet Lithography*, vol. 11854, Bellingham, WA, USA: SPIE, 2021, pp. 88–102.
- [17] Y. Liu et al., "Graph self-supervised learning: A survey," *IEEE Trans. Knowl. Data Eng.*, vol. 35, no. 6, pp. 5879–5900, Jun. 2023.
- [18] X. Wang, D. Kihara, J. Luo, and G.-J. Qi, "EnAET: A self-trained framework for semi-supervised and supervised learning with ensemble transformations," *IEEE Trans. Image Process.*, vol. 30, pp. 1639– 1647, 2020.
- [19] S. Ahmed, Y. Lee, S.-H. Hyun, and I. Koo, "Unsupervised machine learning-based detection of covert data integrity assault in smart grid networks utilizing isolation forest," *IEEE Trans. Inf. Forensics Secur.*, vol. 14, no. 10, pp. 2765–2777, Oct. 2019.
- [20] A. Uprety and D. B. Rawat, "Reinforcement learning for IoT security: A comprehensive survey," *IEEE Internet Things J.*, vol. 8, no. 11, pp. 8693–8706, Jun. 2020.
- [21] H. Xu, A. D. Domínguez-García, and P. W. Sauer, "Optimal tap setting of voltage regulation transformers using batch reinforcement learning," *IEEE Trans. Power Syst.*, vol. 35, no. 3, pp. 1990–2001, May 2020.
- [22] M. Saharkhizan, A. Azmoodeh, A. Dehghantanha, K.-K. R. Choo, and R. M. Parizi, "An ensemble of deep recurrent neural networks for detecting IoT cyber attacks using network traffic," *IEEE Internet Things* J., vol. 7, no. 9, pp. 8852–8859, Sep. 2020.
- [23] P. Goswami, A. Mukherjee, M. Maiti, S. K. S. Tyagi, and L. Yang, "A neural-network-based optimal resource allocation method for secure IIoT network," *IEEE Internet Things J.*, vol. 9, no. 4, pp. 2538–2544, Feb. 2022.
- [24] M. Woźniak, J. Siłka, M. Wieczorek, and M. Alrashoud, "Recurrent neural network model for IoT and networking malware threat detection," *IEEE Trans. Ind. Informat.*, vol. 17, no. 8, pp. 5583–5594, Aug. 2021.
- [25] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, "A survey of convolutional neural networks: Analysis, applications, and prospects," *IEEE Trans. Neural Netw. Learn. Syst.*, vol. 33, no. 12, pp. 6999–7019, Dec. 2022.
- [26] S. Kiranyaz, O. Avci, O. Abdeljaber, T. Ince, M. Gabbouj, and D. J. Inman, "1D convolutional neural networks and applications: A survey," *Mech. Syst. Signal Process.*, vol. 151, 2021, Art. no. 107398.
- [27] V. Veerasamy et al., "LSTM recurrent neural network classifier for high impedance fault detection in solar PV integrated power system," *IEEE Access*, vol. 9, pp. 32672–32687, 2021.
- [28] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi, "Economic LSTM approach for recurrent neural networks," *IEEE Trans. Circuits Syst.*, II, Exp. Briefs, vol. 66, no. 11, pp. 1885–1889, Nov. 2019.
- [29] O. I. Abiodun et al., "Comprehensive review of artificial neural network applications to pattern recognition," *IEEE Access*, vol. 7, pp. 158820– 158846, 2019.
- [30] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi, "An efficient approach for neural network architecture," in *Proc. 25th IEEE Int. Conf. Electron. Circuits Syst. (ICECS)*, Piscataway, NJ, USA: IEEE Press, 2018, pp. 745–748.

- [31] K. Khalil, O. Eldash, B. Dey, A. Kumar, and M. Bayoumi, "Architecture of a novel low-cost hardware neural network," in *Proc. IEEE 63rd Int. Midwest Symp. Circuits Syst. (MWSCAS)*, Piscataway, NJ, USA: IEEE Press, 2020, pp. 1060–1063.
- [32] E. Wang et al., "Deep neural network approximation for custom hard-ware: Where we've been, where we're going," ACM Comput. Surveys (CSUR), vol. 52, no. 2, pp. 1–39, 2019.
- [33] K. Khalil, B. Dey, M. Abdelrehim, A. Kumar, and M. Bayoumi, "An efficient reconfigurable neural network on chip," in Proc. 28th IEEE Int. Conf. Electron. Circuits Syst. (ICECS), Piscataway, NJ, USA: IEEE Press, 2021, pp. 1–4.
- [34] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi, "N2 OC: Neural-network-on-chip architecture," in *Proc. 32nd IEEE Int. System-Chip Conf. (SOCC)*, Piscataway, NJ, USA: IEEE Press, 2019, pp. 272–277.
- [35] K. Khalil, O. Eldash, B. Dey, A. Kumar, and M. Bayoumi, "A novel reconfigurable hardware architecture of neural network," in *Proc. IEEE* 62nd Int. Midwest Symp. Circuits Syst. (MWSCAS), Piscataway, NJ, USA: IEEE Press, 2019, pp. 618–621.
- [36] M. A. Rajput, S. Alyami, Q. A. Ahmed, H. Alshahrani, Y. Asiri, and A. Shaikh, "Improved learning-based design space exploration for approximate instance generation," *IEEE Access*, vol. 11, pp. 18291– 18299, 2023.
- [37] G. Armeniakos, G. Zervakis, D. Soudris, and J. Henkel, "Hardware approximate techniques for deep neural network accelerators: A survey," ACM Comput. Surveys, vol. 55, no. 4, pp. 1–36, 2022.
- [38] K. Khalil, A. Kumar, and M. Bayoumi, "Low-power convolutional neural network accelerator on FPGA," in *Proc. IEEE 5th Int. Conf.* Artif. Intell. Circuits Syst. (AICAS), Piscataway, NJ, USA: IEEE Press, 2023, pp. 1–5.
- [39] C. Åleskog, H. Grahn, and A. Borg, "Recent developments in low-power AI accelerators: A survey," *Algorithms*, vol. 15, no. 11, 2022, Art. no. 419.
- [40] M. Giordano, L. Piccinelli, and M. Magno, "Survey and comparison of milliwatts micro controllers for tiny machine learning at the edge," in *Proc. IEEE 4th Int. Conf. Artif. Intell. Circuits Syst. (AICAS)*, Piscataway, NJ, USA: IEEE Press, 2022, pp. 94–97.
- [41] S. S. Saha, S. S. Sandha, and M. Srivastava, "Machine learning for microcontroller-class hardware-a review," *IEEE Sens. J.*, vol. 22, no. 22, pp. 21362–21390, Nov. 2022.
- [42] K. Khalil, T. Mohaidat, and M. Bayoumi, "Low-cost hardware design approach for long short-term memory (LSTM)," in *Proc. IEEE Int. Symp. Circuits Syst. (ISCAS)*, Piscataway, NJ, USA: IEEE Press, 2023, pp. 1–5.
- [43] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, "How to evaluate deep neural network processors: TOPS/W (alone) considered harmful," *IEEE Solid-State Circuits Mag.*, vol. 12, no. 3, pp. 28–41, Summer 2020.
- [44] N. Gupta, "Introduction to hardware accelerator systems for artificial intelligence and machine learning," in *Hardware Accelerator Systems* for Artificial Intelligence and Machine Learning, S. Kim and G. C. Deka, Eds., Elsevier, 2021, ch. 1, vol. 122, pp. 1–21. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0065245820300541
- [45] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner, "Survey of machine learning accelerators," in *Proc. IEEE High Perform. Extreme Comput. Conf. (HPEC)*, Piscataway, NJ, USA: IEEE Press, 2020, pp. 1–12.
- [46] M. F. Hashmi, R. Pal, R. Saxena, and A. G. Keskar, "A new approach for real time object detection and tracking on high resolution and multicamera surveillance videos using GPU," J. Central South Univ., vol. 23, pp. 130–144, 2016.
- [47] M. Capra, B. Bussolino, A. Marchisio, G. Masera, M. Martina, and M. Shafique, "Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead," *IEEE Access*, vol. 8, pp. 225134–225180, 2020.
- [48] Z. Qi, W. Chen, R. A. Naqvi, and K. Siddique, "Designing deep learning hardware accelerator and efficiency evaluation," *Comput. Intell. Neurosci.*, vol. 2022, 2022, Art. no. 1291103.
- [49] S. Bavikadi et al., "A survey on machine learning accelerators and evolutionary hardware platforms," *IEEE Des. Test*, vol. 39, no. 3, pp. 91–116, Jun. 2022.
- [50] Z. Zhang, K. Zhang, and A. Khelifi, Multivariate Time Series Analysis in Climate and Environmental Research. Springer, 2018.
- [51] B. Dey, K. Khalil, A. Kumar, and M. Bayoumi, "A reversible-logic based architecture for artificial neural network," in *Proc. IEEE 63rd Int. Midwest Symp. Circuits Syst. (MWSCAS)*, Piscataway, NJ, USA: IEEE Press, 2020, pp. 505–508.

- [52] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi, "Self-healing approach for hardware neural network architecture," in *Proc. IEEE 62nd Int. Midwest Symp. Circuits Syst. (MWSCAS)*, Piscataway, NJ, USA: IEEE Press, 2019, pp. 622–625.
- [53] K. Khalil, A. Kumar, and M. Bayoumi, "Reconfigurable hardware design approach for economic neural network," *IEEE Trans. Circuits Syst.*, II, Exp. Briefs, vol. 69, no. 12, pp. 5094–5098, Dec. 2022.
- [54] T. V. Huynh, "Deep neural network accelerator based on FPGA," in Proc. 4th NAFOSTED Conf. Inf. Comput. Sci., Piscataway, NJ, USA: IEEE Press, 2017, pp. 254–257.
- [55] L. D. Medus, T. Iakymchuk, J. V. Frances-Villora, M. Bataller-Mompeán, and A. Rosado-Muñoz, "A novel systolic parallel hardware architecture for the FPGA acceleration of feedforward neural networks," *IEEE Access*, vol. 7, pp. 76084–76103, 2019.
- [56] K. Khalil, B. Dey, A. Kumar, and M. Bayoumi, "Adaptive hardware architecture for neural-network-on-chip," in *Proc. IEEE 65th Int. Mid*west Symp. Circuits Syst. (MWSCAS), Piscataway, NJ, USA: IEEE Press, 2022, pp. 1–4.
- [57] S. Xiao et al., "Neuronlink: An efficient chip-to-chip interconnect for large-scale neural network accelerators," *IEEE Trans. Very Large Scale Integr. VLSI Syst.*, vol. 28, no. 9, pp. 1966–1978, Sep. 2020.
- [58] B. Zhang et al., "PIMCA: A programmable in-memory computing accelerator for energy-efficient DNN inference," *IEEE J. Solid-State Circuits*, vol. 58, no. 5, pp. 1436–1449, May 2023.
- [59] L. Song, J. Mao, Y. Zhuo, X. Qian, H. Li, and Y. Chen, "HyPar: Towards hybrid parallelism for deep learning accelerator array," in *Proc. IEEE Int. Symp. High Perform. Comput. Archit. (HPCA)*, 2019, pp. 56–68.
- [60] X. Wei, Y. Liang, P. Zhang, C. H. Yu, and J. Cong, "Overcoming data transfer bottlenecks in DNN accelerators via layer-conscious memory management," in *Proc. ACM/SIGDA Int. Symp. Field-Programmable Gate Arrays (FPGA)*, New York, NY, USA: ACM, 2019, p. 120, doi: 10.1145/3289602.3293947.
- [61] M. Capra, B. Bussolino, A. Marchisio, M. Shafique, G. Masera, and M. Martina, "An updated survey of efficient hardware architectures for accelerating deep convolutional neural networks," *Future Internet*, vol. 12, no. 7, 2020, Art. no. 113.
- [62] K. Khalil, B. Dey, A. Kumar, and M. Bayoumi, "A reversible-logic based architecture for convolutional neural network (CNN)," in *Proc. IEEE Int. Midwest Symp. Circuits Syst. (MWSCAS)*, Piscataway, NJ, USA: IEEE Press, 2021, pp. 1070–1073.
- [63] H. Li, X. Yue, Z. Wang, W. Wang, H. Tomiyama, and L. Meng, "A survey of convolutional neural networks—From software to hardware and the applications in measurement," *Meas. Sens.*, vol. 18, 2021, Art. no. 100080.
- [64] B. Dey, K. Khalil, A. Kumar, and M. Bayoumi, "A reversible-logic based architecture for VGGNet," in *Proc. 28th IEEE Int. Conf. Elec*tron. Circuits Syst. (ICECS), Piscataway, NJ, USA: IEEE Press, 2021, pp. 1–4.
- [65] Y. Tang, L. Tian, Y. Liu, Y. Wen, K. Kang, and X. Zhao, "Design and implementation of improved CNN activation function," in *Proc. 3rd Int. Conf. Comput. Vis. Image Deep Learn. Int. Conf. Comput. Eng. Appl.* (CVIDL & ICCEA), Piscataway, NJ, USA: IEEE Press, 2022, pp. 1166– 1170.
- [66] K. Khalil, O. Eldash, A. Kumar, and M. Bayoumi, "Designing novel AAD pooling in hardware for a convolutional neural network accelerator," *IEEE Trans. Very Large Scale Integr. VLSI Syst.*, vol. 30, no. 3, pp. 303–314, Mar. 2022.
- [67] Q. Song, J. Zhang, L. Sun, and G. Jin, "Design and implementation of convolutional neural networks accelerator based on multidie," *IEEE Access*, vol. 10, pp. 91497–91508, 2022.
- [68] Y.-S. Ting, Y.-F. Teng, and T.-D. Chiueh, "Batch normalization processor design for convolution neural network training and inference," in *Proc. IEEE Int. Symp. Circuits Syst. (ISCAS)*, Piscataway, NJ, USA: IEEE Press, 2021, pp. 1–4.
- [69] B. Khabbazan and S. Mirzakuchaki, "Design and implementation of a low-power, embedded CNN accelerator on a low-end FPGA," in *Proc.* 22nd Euromicro Conf. Digit. Syst. Des. (DSD), Piscataway, NJ, USA: IEEE Press, 2019, pp. 647–650.
- [70] H. Xiao, K. Li, and M. Zhu, "FPGA-based scalable and highly concurrent convolutional neural network acceleration," in *Proc. IEEE Int. Conf. Power Electron. Comput. Appl. (ICPECA)*, Piscataway, NJ, USA: IEEE Press, 2021, pp. 367–370.
- [71] J. Lee, J. Rhim, D. Kang, and S. Ha, "SNAS: Fast hardware-aware neural architecture search methodology," *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 41, no. 11, pp. 4826–4836, Nov. 2022.

- [72] S. Liu, H. Fan, M. Ferianc, X. Niu, H. Shi, and W. Luk, "Toward full-stack acceleration of deep convolutional neural networks on FPGAs," *IEEE Trans. Neural Netw. Learn. Syst.*, vol. 33, no. 8, pp. 3974–3987, Aug. 2022.
- [73] H. Wang, Y. Zhao, and F. Gao, "A convolutional neural network accelerator based on FPGA for buffer optimization," in *Proc. IEEE 5th Adv. Inf. Technol., Electron. Automat. Control Conf. (IAEAC)*, vol. 5, Piscataway, NJ, USA: IEEE Press, 2021, pp. 2362–2367.
- [74] P. Achararit, M. A. Hanif, R. V. W. Putra, M. Shafique, and Y. Hara-Azumi, "APNAS: Accuracy-and-performance-aware neural architecture search for neural hardware accelerators," *IEEE Access*, vol. 8, pp. 165319–165334, 2020.
- [75] T. Yuan, W. Liu, J. Han, and F. Lombardi, "High performance CNN accelerators based on hardware and algorithm co-optimization," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 68, no. 1, pp. 250–263, Jan. 2021.
- [76] W. Huang et al., "FPGA-based high-throughput CNN hardware accelerator with high computing resource utilization ratio," *IEEE Trans. Neural Netw. Learn. Syst.*, vol. 33, no. 8, pp. 4069–4083, Aug. 2022.
- [77] M. Kim and J.-S. Seo, "Deep convolutional neural network accelerator featuring conditional computing and low external memory access," in *Proc. IEEE Custom Integr. Circuits Conf. (CICC)*, Piscataway, NJ, USA: IEEE Press, 2020, pp. 1–4.
- [78] Q. Cheng et al., "A low-power sparse convolutional neural network accelerator with pre-encoding Radix-4 booth multiplier," *IEEE Trans. Circuits Syst.*, II, Exp. Briefs, vol. 70, no. 6, pp. 2246–2250, Jun. 2023.
- [79] X. Yu et al., "A data-center FGPA acceleration platform for convolutional neural networks," in *Proc. 29th Int. Conf. Field Programmable Log. Appl. (FPL)*, 2019, pp. 151–158.
- [80] R. Hwang, M. Kang, J. Lee, D. Kam, Y. Lee, and M. Rhu, "GROW: A row-stationary sparse-dense GEMM accelerator for memory-efficient graph convolutional neural networks," in *Proc. IEEE Int. Symp. High-Perform. Comput. Archit.* (HPCA), 2023, pp. 42–55.
- [81] A. Graves and J. Schmidhuber, "Framewise phoneme classification with bidirectional LSTM networks," in *Proc. IEEE Int. Joint Conf. Neural Netw.*, vol. 4, Piscataway, NJ, USA: IEEE Press, 2005, pp. 2047–2052.
- [82] K. Smagulova and A. P. James, "A survey on LSTM memristive neural network architectures and applications," Eur. Phys. J. Special Top., vol. 228, no. 10, pp. 2313–2324, 2019.
- [83] K. Khalil, B. Dey, A. Kumar, and M. Bayoumi, "A reversible-logic based architecture for long short-term memory (LSTM) network," in *Proc. IEEE Int. Symp. Circuits Syst. (ISCAS)*, Piscataway, NJ, USA: IEEE Press, 2021, pp. 1–5.
- [84] Y. Wei et al., "A review of algorithm & hardware design for AI-based biomedical applications," *IEEE Trans. Biomed. Circuits Syst.*, vol. 14, no. 2, pp. 145–163, Apr. 2020.
- [85] J. Wu, F. Li, Z. Chen, and X. Xiang, "A 3.89-GOPS/mW scalable recurrent neural network processor with improved efficiency on memory and computation," *IEEE Trans. Very Large Scale Integr. VLSI Syst.*, vol. 27, no. 12, pp. 2939–2943, Dec. 2019.
- [86] D. Kadetotad, S. Yin, V. Berisha, C. Chakrabarti, and J.-S. Seo, "An 8.93 TOPS/W LSTM recurrent neural network accelerator featuring hierarchical coarse-grain sparsity for on-device speech recognition," *IEEE J. Solid-State Circuits*, vol. 55, no. 7, pp. 1877–1887, Jul. 2020.
- [87] G. Nan et al., "An energy efficient accelerator for bidirectional recurrent neural networks (BiRNNs) using hybrid-iterative compression with error sensitivity," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 68, no. 9, pp. 3707–3718, Sep. 2021.
- [88] C. Gao, A. Rios-Navarro, X. Chen, S.-C. Liu, and T. Delbruck, "Edge-DRNN: Recurrent neural network accelerator for edge inference," *IEEE J. Emerg. Sel. Topics. Circuits Syst.*, vol. 10, no. 4, pp. 419–432, Dec. 2020.
- [89] D. Shan, Y. Luo, X. Zhang, and C. Zhang, "DRRNets: Dynamic recurrent routing via low-rank regularization in recurrent neural networks," *IEEE Trans. Neural Netw. Learn. Syst.*, vol. 34, no. 4, pp. 2057–2067, Apr. 2023.
- [90] J. Chen, S. Hong, W. He, J. Moon, and S.-W. Jun, "Eciton: Very low-power LSTM neural network accelerator for predictive maintenance at the edge," in *Proc. 31st Int. Conf. Field-Programmable Log. Appl. (FPL)*, 2021, pp. 1–8.

- [91] A. Vaswani et al., "Attention is all you need," Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017.
- [92] W. Li, S. Wang, and G. Liu, "Transformer-based model for fMRI data: ABIDE results," in *Proc. 7th Int. Conf. Comput. Commun. Syst.* (ICCCS), 2022, pp. 162–167.
- [93] S. Ansari and K. A. Alnajjar, "Multi-hop genetic-algorithm-optimized routing technique in diffusion-based molecular communication," *IEEE Access*, vol. 11, pp. 22689–22704, 2023.
- [94] M. S. Rao, K. Venkata Rao, and M. H. M. Krishna Prasad, "Hybrid security approach for database security using diffusion based cryptography and diffie-hellman key exchange algorithm," in *Proc. 5th Int. Conf. I-SMAC (IoT Soc. Mob. Analytics Cloud) (I-SMAC)*, 2021, pp. 1608–1612.
- [95] Z. Zhao, R. Cao, K.-F. Un, W.-H. Yu, P.-I. Mak, and R. P. Martins, "An FPGA-based transformer accelerator using output block stationary dataflow for object recognition applications," *IEEE Trans. Circuits Syst.*, *II, Exp. Briefs*, vol. 70, no. 1, pp. 281–285, Jan. 2023.
- [96] Z. Cheng, Z. Zhang, J. Jiang, and J. Sun, "Signal detection of mobile multi-user molecular communication system using transformer-based model," in *Proc. 8th Int. Conf. Comput. Commun. Syst. (ICCCS)*, 2023, pp. 85–90.
- [97] Y. Yan, W. Du, D. Yang, and D. Yin, "CIPTA: Contrastive-based iterative prompt-tuning using text annotation from large language models," in *Proc. 4th Int. Conf. Electron. Commun. Artif. Intell. (ICECAI)*, 2023, pp. 174–178.
- [98] Y. Ye, H. You, and J. Du, "Improved trust in human-robot collaboration with ChatGPT," *IEEE Access*, vol. 11, pp. 55748–55754, 2023.
- [99] P. Maddigan and T. Susnjak, "Chat2VIS: Generating data visualizations via natural language using ChatGPT, Codex and GPT-3 large language models," *IEEE Access*, vol. 11, pp. 45181–45193, 2023.
- [100] W. Zhu et al., "Sensitivity, specificity, accuracy, associated confidence interval and ROC analysis with practical SAS implementations," in *Proc. Health Care Life Sci. (NESUG)*, Baltimore, MD, USA, vol. 19, 2010, p. 67.



Tamador Mohaidat received the B.Sc. degree in computer engineering from Yarmouk University, Irbid, Jordan, in 2010. She is currently working toward the M.Sc. degree in computer engineering with the Department of Electrical and Computer Engineering, University of Mississippi, Oxford, MS, USA.

She was a Lecturer with the Deanship of the Preparatory Year, Prince Sattam Bin Abdulaziz University, Al-Kharj, Saudi Arabia, for two years. She is currently a Research Assistant with the Depart-

ment of Electrical and Computer Engineering, University of Mississippi. Her research interests include very large-scale integration (VLSI), artificial intelligence, machine learning, and hardware accelerator.



Kasem Khalil (Senior Member, IEEE) received the B.Sc. and M.Sc. degrees in electrical engineering from Assiut University, Asyut, Egypt, in 2009 and 2014, respectively, and the Ph.D. degree in computer engineering from the Center of Advanced Computer Studies (CACS), University of Louisianaat Lafayette, Lafayette, LA, USA, in 2021.

Since 2022, he has been serving as an Associate Editor at Elsevier *Microelectronics Journal*. His research interests include electronics, very large-scale

integration (VLSI), microelectronics, reconfigurable hardware, self-healing hardware system, machine learning, hardware accelerators, network-on-chip, artificial intelligence, intelligent hardware system, and the Internet of Things.

Dr. Khalil was the recipient of IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION SYSTEMS Prize Paper Award (IEEE Circuits and Systems Society VLSI Paper Award), 2023.