# Vision Transformer Accelerator ASIC for In-Ear Sleep Staging

by Tristan Robitaille

Supervisor: Professor Xilin Liu April 2024

# B.A.Sc. Thesis





# ESC499 Engineering Science Thesis

Vision Transformer Accelerator ASIC for In-Ear Sleep Staging

## Tristan Robitaille

 $Student\ number$ : 1006343397

Email: tristan.robitaille@mail.utoronto.ca

Supervisor: Professor Xilin Liu Email: xilinliu@ece.utoronto.ca

April 12th, 2024

## Abstract

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget, consectetuer id, vulputate a, magna. Donec vehicula augue eu neque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim rutrum.

Keywords: Sleep staging, ASIC accelerator, vision transformer, computer architecture

## Acknowledgements

I would like to express my gratitude to my supervisor, Prof. Xilin Liu, for his guidance and support throughout the project. He has given me the freedom to explore new ideas and had provided me with the support and tools I needed.

I would also like to thank my father, Claude Robitaille, for letting me remotely use his workstation to train the model and run the accuracy study. He has also helped review the code for the functional simulation.

In addition, I owe much to the professors who have taught me the fundamentals of computer architecture at the University of Toronto - Profs. Jason Anderson, Natalie Enright-Jerger, Andreas Moshovos and Mark C. Jeffrey.

Throughout this project, I have made extensive use the Compute Canada cluster, which has provided me with the computational resources I needed to run the simulations and train the model. I would like to thank the staff at Compute Canada for their initiative. I am also appreciative of the tools provided by the Canadian Microelectronics Corporation, which have been instrumental in the hardware implementation of the accelerator.

I would also like to acknowledge the work of Professors Lisa Romkey and Alan Chong who organized this thesis project for us, ensuring a structured and productive environment.

Finally, I would like to thank my family and friends for their support and encouragement throughout this project. I am grateful for their patience and understanding during this time.

# Contents

| 1            | Intr                | oducti  | ion                                  | • | . 1  |  |
|--------------|---------------------|---------|--------------------------------------|---|------|--|
| <b>2</b>     | Bac                 | kgroui  | ınd                                  |   | . 2  |  |
| 3            | Hov                 | v to D  | Design an AI Accelerator             |   | . 3  |  |
|              | 3.1                 | Model   | el Prototyping                       |   |      |  |
|              | 3.2                 | Accele  | erator Functional Simulation         |   | . 3  |  |
|              | 3.3                 | Accele  | erator Hardware Implementation       |   | . 4  |  |
| 4            | Visi                | on Tra  | ransformer Model Design              | • | . 5  |  |
| 5            | ASI                 | C Acc   | celeator Architecture                |   | . 5  |  |
|              | 5.1                 | Centra  | ralized vs. Distributed Architecture |   | . 6  |  |
|              | 5.2                 | Maste   | er Architecture                      |   | . 6  |  |
|              | 5.3                 | Data a  | and Control Bus                      |   | . 6  |  |
|              | 5.4                 | Comp    | oute-in-Memory: Fixed-Point Accuracy |   | . 6  |  |
|              | 5.5                 | Comp    | pute-in-Memory: Memory               |   | . 6  |  |
|              | 5.6                 | Comp    | pute-in-Memory: Compute Modules      |   | . 6  |  |
|              |                     | 5.6.1   | Adder                                |   | . 7  |  |
|              |                     | 5.6.2   | Multiplier                           |   | . 7  |  |
|              |                     | 5.6.3   | Divider                              |   | . 7  |  |
|              |                     | 5.6.4   | Exponential                          |   | . 8  |  |
|              |                     | 5.6.5   | Square Root                          |   | . 8  |  |
|              |                     | 5.6.6   | Multiply-Accumulate                  |   | . 10 |  |
|              |                     | 5.6.7   | Softmax                              |   | . 10 |  |
|              |                     | 5.6.8   | LayerNorm                            |   | . 10 |  |
|              | 5.7                 | A Not   | te About Software-Hardware Co-Design |   | . 10 |  |
| 6            | Eva                 | luatio  | on of Performance Metrics            |   | . 12 |  |
|              | 6.1                 | Vision  | n Transformer                        |   | . 12 |  |
|              | 6.2                 | Accele  | erator                               |   | . 12 |  |
| 7            | Fut                 | ure W   | /ork                                 |   | . 13 |  |
| 8            | Con                 | clusio  | on                                   |   | . 14 |  |
| $\mathbf{A}$ | Codebase Statistics |         |                                      |   |      |  |
| $\mathbf{R}$ | Roff                | loction | n on Learnings and Experience Cained |   | 17   |  |

# List of Figures

| 1 | Error of exponent | ial approximation as | a function of Tay | lor series expansion |
|---|-------------------|----------------------|-------------------|----------------------|
|   | order             |                      |                   |                      |

# List of Tables

| Ι  | Performance metrics of the compute modules        | 6  |
|----|---------------------------------------------------|----|
| II | Line and file count per file type in the codebase | 16 |

# List of Abbreviations

 ${\bf ASIC}\,$  Application-Specific Integrated Circuit

**CMOS** Complimentary Metal Oxide Semiconductor

 $\mathbf{CiM} \;\; \mathbf{Compute\text{-}in\text{-}Memory}$ 

**HDL** Hardware Description Language

 ${f MAC}$  Mulitiply-Accumulate

**IP** Intellectual Property

TSMC Taiwan Semiconductor Manufacturing Company

**VCD** Value Change Dump

See [1]. I am making an Application-Specific Integrated Circuit (ASIC). It's small, low-power and fast. It's better than Google's.

## 1 Introduction

# 2 Background

## 3 How to Design an AI Accelerator

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget, consectetuer id, vulputate a, magna. Donec vehicula augue eu neque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim rutrum.

### 3.1 Model Prototyping

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget, consectetuer id, vulputate a, magna. Donec vehicula augue eu neque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim rutrum.

#### 3.2 Accelerator Functional Simulation

### 3.3 Accelerator Hardware Implementation

## 4 Vision Transformer Model Design

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget, consectetuer id, vulputate a, magna. Donec vehicula augue eu neque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim rutrum.

## 5 ASIC Acceleator Architecture

#### 5.1 Centralized vs. Distributed Architecture

#### 5.2 Master Architecture

#### 5.3 Data and Control Bus

### 5.4 Compute-in-Memory: Fixed-Point Accuracy

### 5.5 Compute-in-Memory: Memory

## 5.6 Compute-in-Memory: Compute Modules

This section describes the design and performance metrics of the various compute Intellectual Property (IP) modules use by the Compute-in-Memory (CiM) modules. Each is custom-designed for this project. Each module works with signed (2's complement) fixed-point representation. To avoid overflow, the modules use internal temporary variables of fixed-point format Q22.10. Table I shows the performance metrics of the compute modules. The working principles of each modules is described briefly in subsequent sections.

Table I: Performance metrics of the compute modules

| Module             | Area                      | Cycle/op | Energy/op            | Leakage power       | $F_{max}$           |
|--------------------|---------------------------|----------|----------------------|---------------------|---------------------|
| Adder              | $450.4 \mu m^2$           | 1        | 0.99pJ               | 11.87μW             | 6.67GHz             |
| Multiplier         | $3535.2 \mu m^2$          | 1        | $7.05 \mathrm{pJ}$   | $90.50 \mu W$       | $1.59\mathrm{GHz}$  |
| Divider            | $1719.9 \mu m^2$          | 35       | $23.44 \mathrm{pJ}$  | $34.56 \mu W$       | $1.11\mathrm{GHz}$  |
| Exponential        | $2442.2 \mu m^2$          | 24       | $62.73 \mathrm{pJ}$  | $47.10 \mu W$       | $7.14\mathrm{GHz}$  |
| Square Root        | $1325.2 \mu\mathrm{m}^2$  | 17       | $18.32 \mathrm{pJ}$  | $26.30 \mu W$       | $0.758\mathrm{GHz}$ |
| $\mathrm{MAC^{1}}$ |                           | 386      | $820.20 \mathrm{pJ}$ |                     | _                   |
| $\mathrm{MAC}^2$   | $3129.8 \mu \mathrm{m}^2$ | 391      | $839.32 \mathrm{pJ}$ | $69.40 \mathrm{pJ}$ | $2.17\mathrm{GHz}$  |
| $\mathrm{MAC^3}$   | _                         | 456      | $941.68 \mathrm{pJ}$ | _                   |                     |
| Softmax            | $2341.1 \mu m^2$          | 2024     | $1972.5 \mathrm{pJ}$ | $51.47 \mu W$       | $1.20\mathrm{GHz}$  |
| LayerNorm          | $3836.89 \mu m^2$         | 1469+494 | $1705.7\mathrm{pJ}$  | $78.39 \mu W$       | $0.877\mathrm{GHz}$ |
| Total              | $18780.69 \mu m^2$        | N/A      | N/A                  | $409.59 \mu W$      | 0.758GHz            |

<sup>&</sup>lt;sup>1</sup> No activation function applied

Note that all measurement in Table I are given for standard 65nm Taiwan Semiconductor Manufacturing Company (TSMC) process with a 100MHz clock. To determine these metrics, the following methodology was used with Synopsys Design Compiler 2017.09 running on UofT's EECG cluster:

• Area: Synthesis with the area optimization effort set to high, and the area was extracted from the report\_area command report.

<sup>&</sup>lt;sup>2</sup> Linear activation function applied

<sup>&</sup>lt;sup>3</sup> Swish activation function applied

- Cycle/op: The latency was observed when running a single operation on a presynthesis simulation.
- Energy/op: A single-instance testbench running 10000 operations was designed, and a .saif file was generated from the Value Change Dump (VCD) dump file of the testbench using Synopsys' vcd2saif utility. This provides an average activity factor for each node, yielding an accuracy that is adequate for this discussion. The energy per operation was calculated by multiplying the total dynamic power by the time to complete the 10000 operations, divided by 10000.
- Leakage current: Synthesis with the power optimization effort set to high, and the leakage power was extracted from the report\_area command report.
- $F_{max}$ : The report\_timing command was used to determine the maximum frequency of the design.

It must be noted that the measurements for all composite compute units (i.e. units that make use of shared resources) exclude the area/power/etc. of the shared resources. Including them would result in misleadingly high figures, given that they are explictly designed to share resources. The total area of the CiM provides figures more representative of this integration.

#### 5.6.1 Adder

The adder is a single-cycle, combinational module that adds two fixed-point numbers. It uses a ripple-carry adder architecture. The adder has a latency of 1 cycle, which simplifies the logic that uses it. It also provides an overflow flag. To reduce dynamic power consumption, the adder only updates its output when the refresh signal is high.

#### 5.6.2 Multiplier

The multiplier is very similar to the adder. One difference is that it uses Gaussian rounding (also known as banker's rounding). This rounding method rounds 0.5 to the nearest even number. This reduces the bias in the output that is commonly observed with standard rounding methods, which is particularly important in MAC operations where the error can accumulate. The multiplier also has a latency of 1 cycle and provides an overflow flag. Like the adder, the multiplier only updates its output when the refresh signal is high.

#### 5.6.3 Divider

The divider is more complicated than the adder and multiplier. It performs bit-wise long-division and has a latency of N+Q+3 cycles, where N is the number of integer bits

and Q is the number of fractional bits. The divider also provides flags for overflow and divide-by-zero and done/busy status signals. The module start division on an active-high pulse of the start signal and provides the result when the done signal is high. The divider module is mostly used in the MAC module during computation of the Swish activation function.

#### 5.6.4 Exponential

The exponential module computes the exponential  $e^x$  of a fixed-point number x. It uses a combination of the identities of exponential and a Taylor series approximation around zero to compute the exponential. Specifically, the module transforms the exponential as such:

$$e^x = 2^{\frac{x}{\ln(e)}} = 2^z = 2^{\lfloor z \rfloor} 2^{z - \lfloor z \rfloor} \tag{1}$$

The compute can then easily compute  $2^{\lfloor z \rfloor}$  as an inexpensive bit-shift operation and  $2^{z-\lfloor z \rfloor}$  as a Taylor series approximation. To determine a reasonable number of terms to use for the Taylor series expansion, an accuracy study was ran. Figure 5.6.4 shows the relative error of the exponential module as a function of the order of the Taylor series expansion for both fixed-point (Q22.10) approximation and float (64b) approximation. As can be seen, the error decreases with an increase in the order of the expansion. However, for the fixed-point approximation, it converges to a minimum error of 0.992%. This is because the quantization of fixed-point dominates the Taylor series error. Therefore, using a 3rd order Tarlor series expansion to appriximate the exponential function is a good balance between accuracy and latency/energy. Note that these error was measured over the input range of [-4, 4]. According to the functional simulation, this corresponds to roughly  $\pm 3$  standard deviations from the mean of inputs to the exponential function. To further speed up the computation, the exponential module uses a lookup table to store the Taylor series coefficients as well as 1/ln(e). To reduce area, the exponential module does not instantiate its own adder and multiplier modules. Rather, it accesses the adder and multiplier modules in the CiM module shared with other compute units. The latency is 24 cycles.

#### 5.6.5 Square Root

The square root module computes the square root of a fixed-point number using an iterative algorithm. It has a latency of (N+Q)//2+1 cycles, where // denotes integer division. The module provides flags for overflow and negative radicand and start/busy/done signals. The module starts computation on an active-high pulse of the **start** signal and provides the result when the **done** signal is high.

Figure 1: Error of exponential approximation as a function of Taylor series expansion order



#### 5.6.6 Multiply-Accumulate

The MAC module performs a vector dot-product for a given pair of base addresses for the data and length of the vector and applies a selectable activation function to the result. Similar to the exponential module, it uses shared adder, multiplier, divider and exponential modules in the CiM module. It can implement three activation functions: none, linear and Swish. For a nominal length of 64 (which corresponds to the embedding depth of the model, a very common value for matrix dimensions in the model) and Q22.10 formate, the latencies are 386, 391 and 456, respectively. Note that, although the Swish activation function comprises a divider operation, the MAC compute latency can still be kept fairly short because the divisor is the same for all elements. The module can thus perform the division once and multiply by the inverse, which is a single-cycle operation. Finally, the MAC module can be directed to choose the second vector from weights or intermediate results memory.

#### 5.6.7 Softmax

The softmax module computes the softmax function of a vector of fixed-point numbers. Similarly to the MAC module, it uses shared adder, mulitplier, divider and exponential modules and provides busy and done signals. For a 64-element Q22.10 vector, the latency is 2024 cycles. This is significantly longer than other vector compute modules such as the MAC because, in the softmax operation, each element is exponentiated individually.

#### 5.6.8 LayerNorm

The final compute module is the LayerNorm module. It computes the Layer Normalization of a vector of fixed-point numbers. As described in section 4, the LayerNorm operation consists of a normalization of the vector on the horizontal dimension followed by scaling and shifting using learnable parameters on the vertical dimension. Because each CiM module stores one vector at a time, the LayerNorm operation must be separated into two stages with a matrix transpose broadcast between the two. The latency for the first half is 1469 cycles and the latency for the second half is 494 cycles. The module provides busy and done signals and is controlled with a half-select and start pulse signals. Because the length of the vector is constrained to be a power of two, the module uses bit-shifting instead of division for the normalization operation to decrease latency and energy per operation.

### 5.7 A Note About Software-Hardware Co-Design

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida mauris. Nam arcu libero,

nonummy eget, consectetuer id, vulputate a, magna. Donec vehicula augue eu neque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim rutrum.

## 6 Evaluation of Performance Metrics

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget, consectetuer id, vulputate a, magna. Donec vehicula augue eu neque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim rutrum.

#### 6.1 Vision Transformer

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget, consectetuer id, vulputate a, magna. Donec vehicula augue eu neque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim rutrum.

### 6.2 Accelerator

## 7 Future Work

## 8 Conclusion

# References

[1] Xilin Liu and Andrew G Richardson. "Edge deep learning for neural implants: a case study of seizure detection and prediction". In: *Journal of Neural Engineering* 18.4 (2021), p. 046034.

# A Codebase Statistics

It may be interesting to the reader to appreciate the size of the codebase needed to develop a project of similar scale. The code for this project is available in my GitHub repository. The following table provides a breakdown of the number of lines of code in the project.

Table II: Line and file count per file type in the codebase

| File type     | File count | Line count | Percent of total |  |  |
|---------------|------------|------------|------------------|--|--|
| Python        | 12         | 3000       | 33.7%            |  |  |
| SystemVerilog | 12         | 2500       | 30.4%            |  |  |
| C++           | 12         | 1250       | 18.9%            |  |  |
| TeX           | 12         | 670        | 8.2%             |  |  |
| Shell         | 12         | 300        | 4.3%             |  |  |
| Other         | 12         | 20         | 4.5%             |  |  |
| Total         | 60         | 13,000     | 100%             |  |  |

In addition, there have been 200 commits to the repository.

B Reflection on Learnings and Experience Gained

