# Fast pivot Function for MLIR's Presburger Library Through Vectorization and Integer Arithmetic in FPU

Zhou Qi



4th Year Project Report Computer Science School of Informatics University of Edinburgh

2023

## **Abstract**

This report presents a fast implementation of the core function pivot for a math library in MLIR by performing vectorized integer arithmetics in FPU. The hot loop of the pivot function performs overflow-checked multiplication and addition on each element of an input matrix of low dimension and mostly small-value items. MLIR's upstream uses element-wise transprecision computing, where the data type of each element starts with int64\_t and will be switched to LargeInteger in case of overflow. Compilers cannot automatically vectorize this approach, and int64\_t has a much larger bit width than what is typically needed for most items in the matrix. Additionally, extra arithmetics are required to perform overflow checking for integers, resulting in significant overhead. These issues can be addressed by taking advantage of SIMD and reducing the bit width for every element. This report also introduces the int24\_t data type, a 24-bit integer data type created from the sign bit, plus the 23-bit mantissa of a 32-bit floating point. int24\_t overflow can be captured as floating point imprecision by a status register, making overflow awareness almost free. On a representative 30-row by 19-column input matrix, the runtime is reduced from 550 ns to 26 ns, achieving 20 times speedup.

# **Research Ethics Approval**

This project was planned in accordance with the Informatics Research Ethics policy. It did not involve any aspects that required approval from the Informatics Research Ethics committee.

# **Declaration**

I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified.

(Zhou Qi)

# **Acknowledgements**

I would like to express my deepest gratitude to my supervisor, Tobias Grosser, for his invaluable guidance and support throughout this project. His innovative ideas and enthusiasm have inspired me, and working under his mentorship was a great opportunity. I have learned a lot about computer architecture throughout this exciting project; this would not have been possible without his involvement.

I am also grateful to Tobias's Ph.D. students, Arjun Pitchanathan and Sasha Lopoukhine. Arjun's assistance has been instrumental in the progress of this project. I want to extend special thanks to Sasha, for generously granting me access to the powerful 7950x workstation.

Finally, I would like to express my appreciation to all my friends and my cat Nagisa Kaworu, who have been a constant source of encouragement and motivation. A special mention goes to emanon42, lyzh, gjz010, and Marisa Kirisame, their camaraderie and support have made this journey enjoyable and memorable.

# **Table of Contents**

| 1 | Intr | oduction                                    | 1  |
|---|------|---------------------------------------------|----|
| 2 | Bacl | kground                                     | 4  |
|   | 2.1  | Linear Programming and Simplex Algorithm    | 4  |
|   | 2.2  | Presburger Library                          | 5  |
|   | 2.3  | Modern CPU Micro-architecture               | 6  |
|   | 2.4  | Floating Points                             | 8  |
|   |      | 2.4.1 IEEE 754                              | 8  |
|   |      | 2.4.2 Fused-multiply-add                    | 8  |
|   |      | 2.4.3 Representing "int24_t" and "int53_t"  |    |
|   |      | Using Floating Points                       | 9  |
|   | 2.5  | Google Benchmark                            | 10 |
|   | 2.6  | llvm-mca                                    | 11 |
|   |      |                                             |    |
| 3 | Exp  | eriments with Toy Example                   | 14 |
|   | 3.1  | Vectorization Method                        | 14 |
|   |      | 3.1.1 Clang's Automatic Vectorization       | 14 |
|   |      | 3.1.2 Clang's Vector Data Type              | 15 |
|   |      | 3.1.3 Evaluation                            | 16 |
|   | 3.2  | Matrix Data Structure                       | 18 |
|   | 3.3  | Matrix Element Data Type                    | 19 |
|   |      | 3.3.1 Width                                 | 19 |
|   |      | 3.3.2 Overflow Checking for Integers        | 20 |
|   |      | 3.3.3 Overflow Checking for Floating Points | 22 |
|   |      | 3.3.4 Comparing int16_t and float           | 24 |
| 4 | Imp  | lementation and Optimization of pivot       | 26 |
|   | 4.1  | Matrix-wise Transprecision                  | 27 |
|   | 4.2  | Double Buffering                            | 28 |
|   | 4.3  | Alignment                                   | 29 |
|   | 4.4  | Column Size Specialization                  | 29 |
|   | 4.5  | Evaluation                                  | 30 |
| 5 | Con  | clusion and Future Work                     | 33 |
| 6 | Rela | ated Work                                   | 35 |

# **Chapter 1**

## Introduction

MLIR, Multi-Level Intermediate Representation, is an infrastructure for building reusable and extensible compilers. It aims to reduce fragmentation in domain-specific languages and heterogeneous hardware [6]. Its Presburger library provides polyhedral compilation techniques for dependence analysis, loop optimization [5], and cache modeling [16]. Presburger arithmetics involves determining whether a linear problem is satisfiable under its constraints [3], and can be solved using the simplex method of linear programming, with its core function pivot being the primary performance bottleneck [13].

The pivot function involves doing two multiplication and one addition operation on every element in a matrix. Notably, the input matrices for this library tend to exhibit characteristics of small values and low dimensionality. For example, 90% of test cases work with 16-bit integers that never overflow, and 74% of the runtime is spent on test cases that we can compute using 16-bit integers and matrices with at most 32 columns [14]. We can leverage these properties to utilize hardware resources from modern micro-architecture, thereby accelerating the runtime.

Currently, the source code in MLIR upstream adopts a nested for-loop to iterate through every matrix element in a transprecision manner. Each element in the matrix can either be int64\_t or LargeInteger. The int64\_t version of the pivot function is dispatched first; then, in case of overflow, it switches to the LargeInteger algorithm. This approach is computationally expensive and inefficient for the following reasons:

- 1. int 64\_t has a much larger bit width than what is typically needed for most of the elements in the matrix.
- 2. The compiler is not capable of generating vectorized instructions from scalar source code.
- 3. Overflow is checked through additional arithmetic operations.

To propose a faster alternative to the pivot function, we could consider constructing a new pivot function that satisfies the following conditions:

1. Utilize SIMD: preliminary benchmark (Section 3.1.3) indicates a 20 times perfor-

mance improvement.

- 2. Use small bit width for every element: reducing the bit width by half doubles the amount of elements packed into a single vector register and essentially reduces the instruction count by half (Table 3.3).
- 3. Fast overflow checking: overflow has to be checked manually for integers, which introduces at least 65% overhead toward total runtime (Section 3.3.2.1). This is because the x86 architecture does not provide status registers to indicate vectorized integer overflow. However, there is one for floating points, making overflow detection for vector floating point operations almost free.

Previously there was an attempt to vectorize pivot that utilizes int16\_t and targets matrices with 32 columns or less [14]. This approach offers the advantage of being able to pack a row of 32 elements into a single AVX-512 ZMM register and addresses issues 1 and 2. However, overflow is still checked manually, causing 4 times more instruction count (Section 3.3.2.1). Additionally, this approach introduces a new drawback. The AVX-512 extension is required for CPUs to support vectorized int16\_t instructions, and it is very rare among CPUs manufactured in the last decade (Section 2.3).

An alternative approach is to do 24-bit or 53-bit integer computation using float (32-bit floating point) or double (64-bit floating point) (Section 2.4.1). Though floating points are notorious for precision issues, they are reliable when representing integers that fit inside their mantissa. "Overflow" can be considered as imprecision rather than a restriction on bit width. When the result of some integer computation exceeds the bit size of the mantissa, floating point imprecision almost always occurs, and a status register will be set automatically. Occasionally, imprecision does not happen due to floating point normalization, which is not considered as overflow (Section 3.3.3).

Compared to int16\_t, due to that the overflow checking overhead can be significantly reduced, using floating points could potentially be faster. However, it comes with the cost of larger bit width and smaller vector size, as the support for 16-bit floating point half is not provided by AVX extensions. The cost of overflow checking for floating points is the time spent on resetting the status register once at the beginning of a sequence of calls to pivot [13], plus reading it in each pivot call. Even though reading the register takes 5 ns and resetting it costs 10.5 ns (Figure 3.4), the effective total overhead can be less than 1 ns through optimization (Figure 4.7). The superscalar and out-of-order execution pipeline hides the latency of reading the status register. Also, the pivot function is often invoked repeatedly in a sequence [13], making the cost of resetting the status register negligible for every single instance of call to pivot.

Another advantage of float over int32\_t is that floating points offer better compatibility with old computers. Vectorized float implementation of pivot runs fine with AVX-2; it does not require new features from the AVX-512 extension. Thus, it can be executed on almost every x86 CPU from the last decade.

This report will first analyze the capability of modern CPU micro-architecture, especially Zen 4, through a matrix element-wise fused-multiply-add toy example under the various configurations regarding vectorization methods, matrix data structures, element data types, and data widths (Section 3). It is discovered that optimal performance can be

achieved by selecting Clang vector type extension as the vectorization method and using a flat list as the matrix data structure. Also, int16\_t and float outperforms other data types in the integer and floating point category, respectively. However, even though int16\_t is faster than float in the toy example, it is not sound to deduce that it will be the same case in the workload of the pivot function due to their distinct trade-offs. int16\_t benefits from bigger vector size, less memory pressure, and less instruction count, while float has to spend only a single-time 1 ns overhead on overflow checking.

Then, two detached versions of pivot function from the Presburger library are built, according to the most optimal configurations from the toy example, one using int16\_t and the other using float. After applying further optimizations (Section 4), a benchmark is set up for a selected 30-row and 19-column matrix. The achievements are:

- 1. The vectorized float implementation of pivot significantly outperforms the upstream. Specifically, it takes only 26 ns, while the runtime of the upstream implementation is 550 ns.
- 2. The float implementation offers substantial compatibility advantage over int16\_t for the vast amount of non-AVX-512 platforms. Even though the int16\_t version is 6 ns faster than the float version, the significant improvement from the upstream, together with AVX-2's compatibility, render the 6 ns gap trivial.

# **Chapter 2**

# **Background**

## 2.1 Linear Programming and Simplex Algorithm

Linear programming is a mathematical optimization technique used to model and find the best possible solution to a problem, using a set of constraints and the objective function to maximize or minimize. Its canonical form consists of an objective function:  $Z = c_1x_1 + c_2x_2 + ... + c_nx_n$ , subjecting to the constraints:

$$x_1 ... x_n \ge 0$$
  
 $a_{11}x_1 + a_{12}x_2 + ... + a_{1n}x_n = b_1$   
 $a_{21}x_1 + a_{22}x_2 + ... + a_{2n}x_n = b_2$   
...

 $a_{m1}x_1 + a_{m2}x_2 + \dots + a_{mn}x_n = b_m,$ 

where  $x_1 ... x_n$  are the variables,  $c_1 ... c_n$  are the coefficients of the objective function, and non-negative  $a_{11}, a_{12} ... a_{21} ... a_{m1} ... a_{mn}$  together with  $b_1 ... b_m$  encodes the constraints of the problem in a matrix [13].

The simplex algorithm is an iterative approach to find  $x_1...x_n$  that maximizes the objective function while satisfying constraints at the same time. The matrix goes through a sequence of transformations, until the solution appears or it is found that the solution is not feasible. The transformations are called "pivot", and it solves the linear equation at the pivot row for the variable at the pivot column:

The pivot transformation involves division and thereby produces rational numbers. Rationals of base 10 cannot be expressed as binary floating points precisely due to inaccuracy caused by potential rounding. In addition, divisions are expensive operations compared to additions or multiplications. This issue can be addressed through the

denormalization of rows, by multiplying each row with their common denominator [13]:

Pivot item 
$$\alpha = \frac{\alpha_n}{d_p}$$
  
Rest of the pivot row  $\beta = \frac{\beta_n}{d_p}$   
Rest of the pivot column  $\gamma = \frac{\gamma_n}{d_i}$   
Other entries  $\delta = \frac{\delta_n}{d_i}$ 

where  $d_p$  is the denominator of the pivot row,  $d_i$  is the denominator of the  $i^{th}$  row. After demoralization  $\alpha_n$ ,  $\beta_n$ ,  $\gamma_n$ , and  $\delta_n$ , becomes  $\alpha$ ,  $\beta$ ,  $\gamma$ , and  $\delta$ .

Substituting into the processing formula, the pivot row becomes:

Pivot item 
$$\frac{\alpha_n}{d_p} \to \frac{d_p}{\alpha_n}$$
Rest of the pivot row  $\frac{\beta_n}{d_p} \to -\frac{\beta_n}{\alpha_n}$ 
This transformation effectively becomes swapping  $\alpha_n$  with  $d_p$  and negating every

non-pivot-column item.

Likewise, the non-pivot rows are transformed as the following:

Rest of the pivot column 
$$\frac{\gamma_n}{d_i} \rightarrow \frac{\gamma_n d_p}{a_n d_i}$$
  
Other entries  $\frac{\delta_n}{d_i} \rightarrow \frac{\delta_n a_n - \beta_n \gamma_n}{a_n d_i}$ 

and can be implemented in these procedures:

- 1. Update row denominator:  $d'_i = d_i a_n$ ,
- 2. Multiply non-pivot-column items by  $a_n$ ,
- 3. Subtract  $\beta_n \gamma_n$  from every non-pivot row.

#### 2.2 **Presburger Library**

The Fast Presburger Library (FPL) paper collected 465,460 representative linear programming problems encountered during cache analytical modeling, polyhedral loop optimization, and accelerator code generation. It is found that most of the constraint matrices are low in dimensionality and small in the value of each element [13]. Specifically, more than 99% of the entries from matrices require less than 10 bits, and 95% of them are less than 20 columns (Figure 2.1). Thus, most rows fit inside a 512-bit ZMM vector register of 32 int16\_t elements, and a row operation can be done in a single instruction.

However, in rare and corner cases, large coefficients can be up to 127 bits. Practically, the upper bound of coefficient size is unknown, making it required to have arbitrary precision arithmetic LargeInteger as a backup. Also, the maximum observed column count is 28, and there is no specific maximum column count.

The FPL paper presents a 3-way transprecision implementation for the Presburger library's simplex solver using the algorithm described in Section 2.1. It starts from rowwise vectorized int16\_t, and will be re-dispatched to element-wise scalar int64\_t or element-wise scalar Large Integer algorithm when overflow, as illustrated in Figure 2.2. Unfortunately, the MLIR upstream only presents a 2-layer transprecision, consisting of element-wise scalar operation using int64\_t and LargeInteger. The int16\_t version is not merged with the upstream for two reasons:





Figure 2.1: Linear programming problems for program analysis exhibits unique characteristics of small value size and low dimensionality [13].

- 1. int16\_t vector instructions require the AVX-512 extension, but hardware support is rare (Section 2.3).
- 2. Despite the int16\_t version being fast, overflow checking overhead is 65% [14]. Using floating points could significantly reduce this overhead and potentially be faster (Section 2.4).

#### 2.3 Modern CPU Micro-architecture

A recent trend in the development of the x86-64 architecture is to include the AVX-512 instruction set architecture (ISA) extension. AVX-512 succeeds AVX-2, increasing vector width from AVX-2's 256 bits to 512 bits. AVX-512 also provides new instructions, for example, int16\_t saturated addition (Section 3.3.2.1).

Even though its specification was released by Intel in 2013, it had been unpopular [27], as it did not bring practical performance improvements. The primary reason was that it consumes much more power than usual, causing severe overheating. A classic example is the micro-architecture Skylake from Intel, and its AVX-512-enabled counterpart Skylake-X. Skylake has 2 256-bit FMA AVX-2 execution units<sup>1</sup>. Intel provides 2 512-bit AVX-512 FMA units for Skylake-X, by fusing the existing AVX-2 units into an AVX-512 unit, then introduces an additional FMA AVX-512 unit [8]. The additional AVX-512 unit increases the heat flux density of the chip, causing server thermal throttling issues.

Intel attempted to mitigate this problem by introducing the "AVX-offset" mode. When a workload involving AVX-512 instructions is encountered, the CPU automatically enters the AVX-offset mode and reduces its clock frequency [23]. This solution only works in

<sup>&</sup>lt;sup>1</sup>Fused-multiply-add (FMA) execution units are a type of floating point execution units, capable of doing addition, multiplication, or both in a single instruction. See Section 2.4.2.



Figure 2.2: The architecture of FPL. This report focuses on the "Simplex" method of the "Core Algorithms" since it is the main performance bottleneck. [14]

theoretical benchmarks where AVX-512 instructions are present in large bulk, but in practice, it is more common to have a mix of control flow, scalar, SSE, and AVX-512 instructions. The clock frequency of executing those non-AVX-512 instructions is decreased together with AVX-512 instructions, causing many workloads could run faster with disabled AVX-512 and higher clock frequency [21].

AMD recently decided to add support for AVX-512 in their latest micro-architecture Zen 4. It has slightly less computing power than Intel but is much more efficient. Zen 4 can be considered as the modernized version of Zen 3 or Zen 2, where Zen 2 and Zen 3 support AVX-2 by providing 2 FADD units<sup>2</sup> and 2 FMA units of 256-bit width [2]. Zen 4 "double-pumps" these existing circuits to create a single 512-bit FADD and a single 512-bit FMA, without introducing any new arithmetic units [21]. Zen 2 and Zen 3 are reputable for their high performance per watt [17], and Zen 4 would be better with its more advanced lithography [21].

Additionally, recompiling existing software to target AVX-512 may improve performance on Zen 4. Though the back-end of Zen 4 can commit 2 AVX-2 FADD and 2 AVX-2 FMA instructions every cycle, the front-end has to dispatch 4 instructions per cycle, which is quite difficult. The equivalency in AVX-512 only takes 2 instructions, which is much more likely to be sustained by the frontend [21].

<sup>&</sup>lt;sup>2</sup>Floating-point add units (FADD) can execute floating point addition instructions only. They are less capable than FMA.



Figure 2.3: IEEE 754 specification for single (32 bits) and double (64 bits) precision floating point [24]. In some literature "mantissa" is referred as "fraction".

## 2.4 Floating Points

#### 2.4.1 IEEE 754

IEEE 754 is the standard for storing and manipulating floating point numbers in modern x86 computers. The standard defines several formats for representing floating point numbers. The most common ones are 32-bit single precision (float) and 64-bit double precision (double). For each format, IEEE 754 defines how many bits are used to represent the sign, the exponent, and the mantissa.

As Figure 2.3 shows, the sign bit is a single bit that indicates whether the floating point number is positive or negative. There are 8 bits and 11 bits for exponent in float and double, respectively, representing the order of magnitude. The remaining 23 bits in float and 52 bits in double are mantissae, the fractional part of the number is stored here. The value of a floating point number can be computed through this formula:  $(-1)^s * 2^{(e-B)} * (1+f)$  where s is sign, e is exponent, f is mantissa and g is a constant bias value: 127 for float, 1023 for double.

Figure 2.3 provides an example of float in binary form, whose value is 0.15625:

```
sign = 0b0 -> 0
exponent = 0b011111100 -> 0b011111100 - 127 = 124 - 127 = -3
mantissa = 0b01 -> 0b1.01 = 1.25
```

Substituting the sign, the exponent, and the mantissa into the formula, we get:

```
-1^0 * 2^(-3) * 1.25 = 0.15625.
```

## 2.4.2 Fused-multiply-add

After doing floating point arithmetic, it is required to normalize the result of floating-point arithmetic before it can be used further. However, by feeding the result of a floating-point multiplication (FMUL) directly into the floating-point addition (FADD) logic without the need for normalization and rounding in between, and a fused-multiply-add (FMA) operation is effectively created: Y = (A \* B) + C, where A, B and C are the

operands, Y is the result [26].

FMA saves cycles and reduces the accumulation of rounding errors, while at the same time not adding significant complexity to the circuit. An FMA execution unit is capable of doing FMUL, FADD, and FSUB as well:

```
Addition: Y = (A * 1.0) + C
Multiplication: Y = (A * B) + 0.0
Subtraction: Y = (A * -1.0) + C
```

This feature is useful in many numerical computations involving simultaneous multiplication and addition operations, such as dot products and matrix multiplications. Since the pivot function performs multiplies and adds the pivot row and some constant value to each row of the matrix, the performance of FMA is critical to the overall efficiency.

# 2.4.3 Representing "int24\_t" and "int53\_t" Using Floating Points

There are common stereotypes that floating points are unreliable and likely to be imprecise. They are often illustrated in popular memes, as shown in Figure 2.4 [28]. However, when storing integer values inside floating points, floating points can be pretty reliable. Historically floating point processing units in GPUs have been utilized for fast integer arithmetic, for example, modular exponentiation and RSA algorithms [12] because often the architecture of GPU prioritizes the performance of floating points rather than integers.

Given that the mantissa part of a float is 23 bits, together with the sign bit, inexactness never occurs when representing integers less than 24 bits. Furthermore, in case of an integer overflow, floating point imprecision almost always happens, and setting a corresponding status register. The same concept applies to double data types with a mantissa of 52 bits.

This mechanism is reliable, as floating-point inexactness always implies integer -inexactness. For an integer value with a bit width bigger than the mantissa size, floating point rounding is triggered to fit the most significant bits of the integer in the mantissa, then adjust the order of magnitude in the exponent accordingly. The lower bits of the mantissa are truncated, therefore causing imprecision.

In some rare cases, integers longer than the mantissa size can be represented precisely in floating points. An example of such numbers is a large power of 2, like  $2^30 = 0 \times 40000000$ . Its binary representation in float is:

```
Sign = 0b0 -> 0

Exponent = 0b10011011 -> 0b10011011 - 127 = 155 - 127 = 28

Mantissa = 0b0 -> 0b1.0 = 1
```

Despite being greater than the size of the mantissa, they are normalized rather than rounded and therefore do not break the mechanism of representing integers in floating points.



Figure 2.4: A floating point meme: 0.1 + 0.2 = 0.3000000000000000 [28].

## 2.5 Google Benchmark

Google benchmark is a library to measure the performance of a code snippet. It provides unit-test-like interfaces to set up benchmarks around a code snippet [9]. The given example from https://github.com/google/benchmark is self-explanatory for its usage:

```
#include <benchmark/benchmark.h>
static void BM_SomeFunction(benchmark::State& state) {
    // Perform setup here
    for (auto _ : state) {
        // This code gets timed
        SomeFunction();
      }
}
// Register the function as a benchmark
BENCHMARK(BM_SomeFunction);
// Run the benchmark
BENCHMARK_MAIN();
```

The library first starts a timer, next repeatedly executing its core loop: for (auto \_: state) ... multiple times, then pauses the timer. This method ensures consistent results and minimizes the overhead required for recording the timing information.

Executing the benchmarks will report both elapsed real-time and CPU time and much other helpful information to reduce variance.

```
Running ./build/example
***WARNING*** CPU scaling is enabled, the benchmark real-time
measurements may be noisy and will incur extra overhead.
Run on (32 X 5800.00 MHz CPU s)
CPU Caches:
```

```
L1 Data 32 KiB (x16)
```

L1 Instruction 32 KiB (x16)

L2 Unified 1024 KiB (x16)

L3 Unified 32768 KiB (x2)

Load Average: 8.10, 5.14, 1.14

| Benchmark       | Time    | CPU     | Iterations |
|-----------------|---------|---------|------------|
|                 |         |         |            |
| BM_SomeFunction | 18.5 ns | 18.5 ns | 37935734   |

The warning: "CPU scaling is enabled, the benchmark real-time measurements may be noisy and will incur extra overhead" is saying that the CPU clock frequency is not consistent. The governor algorithm dynamically determines the clock frequency according to the operating system. For example, with the performance governor, the OS locks the CPU to the highest possible clock frequency, specified at /sys/devices/system/cpu/cpu\*/cpufreq/scaling\_max\_freq. In contrast, the ondemand governor will push the CPU to the highest frequency on demand and gradually reduce the frequency as the idle time increases [18].

However, the clock frequency also depends on the manufacturer and other hardware constraints. By default, both Intel (Turbo Boost) and AMD (Precision Boost Overdrive) support raising clock frequency beyond the control of the governor [10]. On the other hand, CPUs have self-protecting thermal throttling mechanisms that reduce their clock frequency and voltage when it is too hot.

The benchmark mentioned in this report were performed on an AMD 7950x desktop computer. The computer system went through the following these setups for consistent results:

- 1. Set the governor to performance,
- 2. Disable AMD Precision Boost Overdrive (or Intel Turbo Boost),
- 3. Lock clock frequency at 4.5 GHz, or any desired and feasible value,
- 4. Make sure heat dissipation is working properly.

#### 2.6 llvm-mca

11vm-mca, LLVM Machine Code Analyzer, is a tool to analyze the performance of executing some instructions on a specific CPU micro-architecture, according to scheduling information provided by LLVM [11].

By supplying <code>llvm-mca</code> with a piece of assembly code and the target micro-architecture codename, <code>llvm-mca</code> reports various metrics to indicate how fast the given instructions will execute on the specified micro-architecture. It first summarizes the instruction per clock (IPC) and throughput of the entire instruction block, then provides detailed information about each instruction, including the number of uOps, latency, throughput, potential load, store, and side effects. <code>llvm-mca</code> also reports resource pressure regarding arithmetic units and memory load or store units. When the optional <code>-timeline</code> flag is prompted, <code>llvm-mca</code> illustrates a timeline view of the analyzed code, showing how

instructions progress through the pipeline stages of the target processor. The timeline helps understand the capability of complicated out-of-order superscalar architecture.

An example is provided below. The analysis from <code>llvm-mca</code> indicates that a <code>znver2</code> (Zen 2) CPU can repeatedly execute a combination of <code>vmovaps³</code> and <code>vfmadd213ps⁴</code> instructions every 1.3 cycles, or equivalently 2.74 instructions per cycle. The output from <code>llvm-mca</code> is slightly modified and truncated to fit inside the page.

```
$ llvm-mca-15 -timeline -mcpu=znver2 ./x.s
Iterations:
                   100
Instructions:
                   400
Total Cycles:
                   146
Total uOps:
                   400
Dispatch Width:
uOps Per Cycle:
                   2.74
IPC:
                   2.74
Block RThroughput: 1.3
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
 1
     8 0.33 *
                         vmovaps (%rdx, %rsi, 4), %ymm0
     8 0.33 *
                         vmovaps (%rcx, %rax, 4), %ymm1
 1
    12 0.50 *
                         vfmadd213ps (%r8,%rdi,4), %ymm0, %ymm1
 1
     1 0.33
 1
                         vmovaps %ymm1, (%r9,%rax,4)
Resources:
    - Zn2AGU0
[0]
[1]
      - Zn2AGU1
[2]
     - Zn2AGU2
[8]
      - Zn2FPU0
[9]
     - Zn2FPU1
[10] - Zn2FPU2
[11] - Zn2FPU3
[12]
      - Zn2Multiplier
```

<sup>&</sup>lt;sup>3</sup>vmovaps can be either vector float load or store instruction, depending on how operands are structured [4].

<sup>&</sup>lt;sup>4</sup>vfmadd213ps is the instruction for fused-multiply-add [4].

. . .

```
Resource pressure per iteration:
   [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]
1.33 1.33 1.34
                          - - 0.50
                                              0.50
Resource pressure by instruction:
    [1] [2] ... [11] Instructions:
0.01 0.38 0.61 ... -
                    vmovaps
                              (%rdx,%rsi,4), %ymm0
0.23 0.68 0.09 ... -
                      vmovaps (%rcx, %rax, 4), %ymm1
0.27 0.24 0.49 ... 0.50 vfmadd213ps (%r8,%rdi,4), %ymm0, %ymm1
0.82 0.03 0.15 ... -
                      vmovaps %ymm1, (%r9,%rax,4)
Timeline view:
               0123456789
Index 0123456789
[0,0] DeeeeeeeER
                        vmovaps (%rdx, %rsi, 4), %ymm0
[0,1] DeeeeeeeER .
                        vmovaps (%rcx, %rax, 4), %ymm1
[0,2] D=eeeeeeeeeER
                        vfmadd213ps (%r8,%rdi,4), %ymm0, %ymm1
[0,3] D=====eER
                        vmovaps %ymm1, (%r9,%rax,4)
[1,0] .DeeeeeeeE----R
                        vmovaps (%rdx, %rsi, 4), %ymm0
[1,1] .DeeeeeeeE----R
                        vmovaps (%rcx, %rax, 4), %ymm1
[1,2] .D=eeeeeeeeeER
                        vfmadd213ps (%r8,%rdi,4), %ymm0, %ymm1
[1,3] .D======eER vmovaps %ymm1, (%r9,%rax,4)
[2,0] . DeeeeeeeE----R
                        vmovaps (%rdx,%rsi,4), %ymm0
[2,1] . DeeeeeeeE----R
                        vmovaps (%rcx, %rax, 4), %ymm1
```

However, when evaluating the identical assembly code on a more advanced microarchitecture <code>znver3</code> (Zen 3), <code>llvm-mca</code> reveals a reduction in IPC and throughput. This appears to be contradictory to both theoretical expectations and actual benchmarks. After submitting an issue [1], llvm maintainers explained that llvm's scheduling information are hand-crafted using <code>llvm-exegesis</code>, a micro-benchmark tool. The issue was subsequently resolved after rerunning <code>llvm-exegesis</code> and confirming that <code>znver3</code> indeed has higher throughput than expected.

This issue suggests that <code>llvm-mca</code> is not a reliable tool for analyzing machine code but rather an evaluator for Clang's behavior during instruction selection. Therefore, this report chooses Google Benchmark as the performance measuring instrument.

# **Chapter 3**

# **Experiments with Toy Example**

The pivot function does multiply and add for each row in the matrix; therefore, the performance of a simple FMA toy can be an effective indicator. This chapter reports performance analysis on simple toy examples with various setups, including:

- 1. Vectorization method
  - (a) Clang's automatic vectorization from scalar source code
  - (b) Writing source code with Clang's vector type extension
- 2. Matrix data structure
  - (a) Nested list: A list of rows, where each row is a list of numbers
  - (b) Flat list: Concatenating one row after another into a single list
- 3. Element data width
  - (a) 16 bits: int16 t
  - (b) 32 bits: int32\_t, float
  - (c) 64 bits: int64\_t, double
- 4. Element data type
  - (a) Integer
  - (b) Floating point

#### 3.1 Vectorization Method

#### 3.1.1 Clang's Automatic Vectorization

Clang can generate vectorized instructions from scalar source code, using the flags -03 -march=native on a platform with vector ISA enabled. Starting with an example (Listing 3.1), the simple vec\_mul\_add function multiplies every element from two arrays, adds the product with the third, and saves it to the destination array.

After compiling on a AVX-512 enabled Zen 4 computer and disassembling the binary, it is observed that Clang automatically packs 16 int32\_t into a ZMM register as an

#### Source code

#### Assembly snippet, vectorization on

```
vmovdqu64
             (%rsi,%rax,4),%zmm0
vmovdqu64
             0x40(%rsi,%rax,4),%zmm1
vpmulld
             (%rdi, %rax, 4), %zmm0, %zmm0
             0x40(%rdi, %rax, 4), %zmm1, %zmm1
vpmulld
             (%rdx, %rax, 4), %zmm0, %zmm0
vpaddd
             0x40(%rdx, %rax, 4), %zmm1, %zmm1
vpaddd
             %zmm0, (%rcx, %rax, 4)
vmovdqu64
vmovdqu64
             %zmm1,0x40(%rcx,%rax,4)
```

#### Assembly snippet, vectorization off

```
mov (%rsi,%rax,4),%r8d
imul (%rdi,%rax,4),%r8d
add (%rdx,%rax,4),%r8d
mov %r8d,(%rcx,%rax,4)
```

Listing 3.1: The vectorized binary and scalar binary is derived by compiling with flags -03 -march=native and -03 -march=native -mno-avx -mno-sse respectively, targeting the micro-architecture Zen 4 with clang-17.

operand of the vpmulld<sup>1</sup> and vpaddd<sup>2</sup> instructions.

Alternatively, vectorization could be disabled by adding the -mno-avx -mno-sse flags on top of -03 -march=native. These two flags guarantee that the binary will be equally optimized, with the only difference being whether vector instructions are generated or not. In this case, scalar instructions imulvpaddd and add are selected.

### 3.1.2 Clang's Vector Data Type

Another approach is to write source code with vectorization in mind from the begining. Clang provides an extension that allows programmers to declare a new type representing a vector of elements of the same data type. The syntax is

```
typedef ty vec_ty __attribute__((ext_vector_type(vec_width))),
```

<sup>&</sup>lt;sup>1</sup>Vector int32 t multiplication

<sup>&</sup>lt;sup>2</sup>Vector int32\_t addition

where  $vec\_ty$  is the name of vector type being defined,  $vec\_width$  is its size and ty is the type of the elements in the vector. For example,

```
typedef int16_t int16x32 __attribute__((ext_vector_type(32))) defines a 512-bit ZMM vector type named int16x32, consisting of 32 int16_t.
```

After defining a vector data type, a vector variable can be created by casting from a pointer of the target array. Then arithmetic operators can be applied between the vectors to perform element-wise operations. The previous vec\_mul\_add example can be rewritten as the code snippet shown in Listing 3.2:

#### Source code

Listing 3.2: Compared to the source code in Listing 3.1, coding with vector types is slightly more complicated.

#### 3.1.3 Evaluation

When comparing the performance of code written with and without the vector type and examining their assembly, it has been discovered that the automatic vectorization feature in Clang can be unpredictable and may lead to undesired behaviors. It operates as a black box and may take much effort to understand its mechanisms.

Consider the mat\_mul\_add function in Listing 3.3, a slightly more complicated version of the previous vec\_mul\_add example, where the 3 input arrays and the output array are replaced with matrics. The matrix<int> data type provides getters and setters to read and write each element, as well as a function to get the pointer. The disassembly reveals that Clang fails to recognize vectorization patterns through getter and setter functions. It decides to dispatch scalar instructions for addition (add) and multiplication (imul) entirely. The vectorized equivalency instructions vpaddd and vpmulld can be obtained by writing source code using vector types, and it is is 21 times more performant (Figure 3.1).

#### Scalar source code to be automatically vectorized by Clang

#### Assembly snippet of the hot loop

```
mov (%rsi,%rax,4),%r8d
imul (%rdi,%rax,4),%r8d
add (%rdx,%rax,4),%r8d
mov %r8d,(%rcx,%rax,4)
```

#### Source code written with Clang's vector type

#### Assembly snippet of the hot loop

```
vmovdqu64
             (%rsi,%rax,4),%zmm0
vmovdqu64
             0x40(%rsi,%rax,4),%zmm1
vpmulld
             (%rdi, %rax, 4), %zmm0, %zmm0
             0x40(%rdi, %rax, 4), %zmm1, %zmm1
vpmulld
vpaddd
             (%rdx, %rax, 4), %zmm0, %zmm0
vpaddd
             0x40(%rdx, %rax, 4), %zmm1, %zmm1
             %zmm0, (%rcx, %rax, 4)
vmovdqu64
vmovdqu64
             %zmm1,0x40(%rcx,%rax,4)
```

Listing 3.3: The toy example performs vectorized multiply-add operation on every item from 3 input matrices and saves the result to the output matrix.



Figure 3.1: The vectorization pattern for the scalar source code Listing 3.3 is not recognized by Clang, making it 21 times slower than the version written using vector types.

### 3.2 Matrix Data Structure

A matrix's most intuitive data structure is nested lists, where a list of elements represents a row, and a list of rows is a matrix. In C++ this can be represented using std::vector<std::vector<T>>, where T could be float, double, int32\_t, etc. The std::vector class provides interfaces for accessing and modifying elements, making it easy to build a matrix data structure on top.

One potential drawback of nested std::vector is that it requires two indexing operations to access an element. An alternative implementation is to "flatten" a matrix into a single std::vector by simply concatenating one row after another. To access a specific element, an index can be computed manually using the given row and column: column\_count \* row + column. Compared to the nested list approach, this reduces

half of the memory indexing operation at the cost of extra arithmetics. The differences between the two patterns are illustrated by an example provided in Table 3.4.

|                           | Nested                                                                                                                                                 | Flat                                                                                                       |
|---------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
| Туре                      | <pre>std::vector&lt;    std::vector<int>&gt;</int></pre>                                                                                               | std::vector <int></int>                                                                                    |
| Structure in<br>Memory    | <pre>vector of 4 = {    vector of 4 = {0, 0, 0, 0},    vector of 4 = {0, 0, 0, 0},    vector of 4 = {0, 0, 0, 1},    vector of 4 = {0, 0, 0, 0}}</pre> | 0, 0, 0, 0,<br>0, 0, 0, 1,                                                                                 |
| Accessing row 2, column 3 | <pre>Index row first: std::vector<int>[2] then index column: std::vector<int>[2][3]</int></int></pre>                                                  | <pre>Compute i = col_count * row + col = 4 * 2 + 3 = 11, then index once: std::vector<int>[11]</int></pre> |

Table 3.4: This table highlights the differences between the nested and flat lists through an example of a 4 by 4 matrix.

Empirically, integer multiplication and addition cost less time than random memory access. In addition, both the toy example and the pivot function perform sequential load-compute-store operations on each row and each column, allowing the index of the next element to be computed by simply adding the step size, further reducing runtime spent on memory access. Benchmark (Figure 3.2) on the toy example confirms that for 16-row matrices with 16 columns, 32 columns, or 64 columns, the nested vector structure is always about 8 ns slower than the flat matrix.

## 3.3 Matrix Element Data Type

#### 3.3.1 Width

Since the numbers stored in the matrix are almost always less than 10 bits [13], using shorter data types can be more advantageous than longer ones because they allow more numbers to be packed into a single vector register (Figure 3.3). The number of instructions can be cut by half when the data width is reduced to half, and less instruction count always leads to less execution time. Given that the Zen 4 micro-architecture provides approximately the same amount of execution units for both integers and floating points, it is reasonable to estimate that the execution time is inversely proportional to the bit width of the data type. As confirmed in Figure 3.5, when overflow is ignored, int32\_t and float cost nearly the same amount of time, while int32\_t and double cost double the amount of time than int16\_t and float respectively.



Figure 3.2: On a 16-row toy example doing float FMA, selecting the flat list as the data structure of the matrix is consistently faster than the nested list implementation.

|                                 | float                                                                                | double | int16_t                                                 | int32_t                                                                             | int64_t                                         |  |
|---------------------------------|--------------------------------------------------------------------------------------|--------|---------------------------------------------------------|-------------------------------------------------------------------------------------|-------------------------------------------------|--|
| 512-bit units                   | 1 512-bit FADD + 1 512-bit FMA                                                       |        | 2 512-bit ALU                                           |                                                                                     |                                                 |  |
| 256-bit units                   | 2 256-bit FADD + 2 256-bit FMA                                                       |        | 4 256-bit ALU                                           |                                                                                     |                                                 |  |
| Fused-multiply-add              | Yes                                                                                  |        | No                                                      |                                                                                     | only on lower 52 bits,<br>no overflow exception |  |
| Saturated add                   | N/A                                                                                  |        | Yes                                                     | No                                                                                  | No                                              |  |
| Multiply higher bits            | N/A                                                                                  |        | Yes                                                     | No                                                                                  | No                                              |  |
| SIMD Floating-Point Exceptions  | Overflow, Underflow, Invalid,<br>Precision, Denormal                                 |        | No                                                      |                                                                                     |                                                 |  |
| 512-bit vector size             | 16                                                                                   | 8      | 32                                                      | 16                                                                                  | 8                                               |  |
| 256-bit vector size             | 8                                                                                    | 4      | 16                                                      | 8                                                                                   | 4                                               |  |
| Overflow checking and time cost | single time overhead:<br>read status register 4.5 ns<br>clear status register 9.4 ns |        | additional arithmetic,<br>about 4x more<br>instructions | Must fallback to scalar code due to lack of saturated add and multiply higher bits. |                                                 |  |

Figure 3.3: A summary of features and resources provided by the Zen 4 micro-architecture for different data types [21] [2].

## 3.3.2 Overflow Checking for Integers

The x86-64 micro-architecture provides the seto instruction to set some byte to 1 if overflow occurred as a result of integer arithmetic. However, seto only works for scalar operations. There is no instruction or status register to indicate whether a previous vector add or multiply instruction produced overflown results. Therefore, overflow has to be checked manually by some additional vector instructions. This would slow down the computation to some extent. Alternatively, arithmetics have to be carried out on

each element individually in a scalar manner, resulting in even worse performance.

One advantage of int16\_t is that it can be used with AVX-512's saturated add and multiply higher bits vector instructions (Figure 3.3), making it possible and convenient to write vectorized and overflow-aware code. However, Zen 4's AVX-512 extension does not provide equivalent instruction for int32\_t or int64\_t and therefore must be processed as scalar values.

#### 3.3.2.1 Implementation of Vectorized int16\_t Overflow Checking

By comparing the result of a conventional addition and saturated addition, it indicates whether the addition has gone overflown or not. In case of overflow, with saturated add, the result always retains at the maximum possible value of int16\_t: 0x7FFF, while the result of a conventional add is always smaller. The reason is that in the two's complement binary form for integer, the overflow sum is "trapped" in the negative number space. They can't go all the way around and become INT16\_MAX again. For example:

```
INT16_MAX + 1 = INT16_MIN = -32768
INT16_MAX + 2 = -32767
...
INT16_MAX + INT16_MAX = -2
```

For multiplication, two 16-bit numbers produce 32-bit products, but only lower 16 bits can be stored. Therefore, overflow can be detected by checking whether any of the upper 16 bits are set.

Inspecting these approaches from an instruction-level perspective (Table 3.5), when overflow is ignored, both add and multiply takes 1 instruction, <a href="vpaddw">vpaddw</a>3 and <a href="vpmullw">vpmullw</a>4. To obtain and process overflow-related information, an additional computation instruction <a href="vpaddsw">vpmullw</a>6 is required, followed with 2 or 3 comparison, shuffling and branch instructions: <a href="vpsaw">vpsaw</a>7, <a href="vpcmpneqw">vpcmpneqw</a>8 and <a href="kord">kord</a>9. By enabling overflow checking, it brings 4 to 5 times more instruction count and 65% more runtime [14].

#### 3.3.2.2 Implementation of Scalar int32\_t and int64\_t Overflow Checking

Clang's language extension provides functions to perform overflow-checked integer arithmetics:

```
bool __builtin_add_overflow (type1 x, type2 y, type3 *sum);
bool __builtin_mul_overflow (type1 x, type2 y, type3 *prod);
```

<sup>&</sup>lt;sup>3</sup>Vector add instruction for int16 t

<sup>&</sup>lt;sup>4</sup>Vector multiply lower half bits for int16\_t

<sup>&</sup>lt;sup>5</sup>Vector saturated add for int16\_t

<sup>&</sup>lt;sup>6</sup>Vector multiple higher half bits for int16\_t

<sup>&</sup>lt;sup>7</sup>Shift packed data right arithmetic

<sup>&</sup>lt;sup>8</sup>Compare packed data for equal

<sup>&</sup>lt;sup>9</sup>Bitwise logical OR masks

|                   | Addition                                                                                                         | Multiplication                                                                                                                               |
|-------------------|------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|
| Overflow ignored  | vpaddw %zmm4,%zmm2,%zmm3                                                                                         | vpmullw %zmm1,%zmm3,%zmm2                                                                                                                    |
| Overflow<br>aware | <pre>vpaddw %zmm4, %zmm2, %zmm3 vpaddsw %zmm2, %zmm4, %zmm2 vpcmpneqw %zmm3, %zmm2, %k1 kord %k1, %k0, %k0</pre> | <pre>vpmullw %zmm1, %zmm3, %zmm2 vpmulhw %zmm1, %zmm3, %zmm3 vpsraw \$0xf, %zmm2, %zmm5 vpcmpneqw %zmm3, %zmm5, %k1 kord %k0, %k1, %k0</pre> |

Table 3.5: This table highlights the difference in instruction count when overflow checking is enabled or disabled for vectorized int16\_t. Pink instructions are essential components as they compute the anticipated arithmetic results, while overflow information are provided by yellow and cyan instructions.

These functions take three arguments: x and y are the two input operands, and sum or prod is a pointer to the variable that will hold the result of the addition or multiplication. The return value of these functions is a boolean that indicates whether an overflow occurred during the operation. However, they do not accept vectors as input, and a loop around these functions cannot be compiled into vector instructions either.

#### 3.3.3 Overflow Checking for Floating Points

To detect floating point overflow or imprecision, one approach is to enable floating point imprecision as a trap, then upon overflow, the interrupt SIGFPE is raised, and the PC (program counter) will be redirected to its handler. The method can be programmed by using useful functions from the fenv library as follow [7]:

```
void signal_handler(int signal) {
    // handle fpe
}

void function() {
    std::signal(SIGFPE, signal_handler);
    std::feclearexcept (FE_ALL_EXCEPT);
    feenableexcept (FE_INEXACT | FE_INVALID);
    // do something
    fedisableexcept (FE_INEXACT | FE_INVALID);
}
```

In the function() block, first, the std::signal() function is called to register the signal handler function for the SIGFPE interrupt. Next, the feclear except function is called to clear any previously set exception flags in the status register. Then, FE\_INEXACT and FE\_INVALID are passed to the feenable except() function to enable inexactness and invalid exceptions. Once the floating-point exceptions are enabled, computations can be performed. If some operation produces an inexact or invalid floating point number, SIGFPE is raised, and a call to signal\_handler is triggered. After

the computation is completed, the fedisable except () function is called to disable the previously enabled exceptions.

But it is difficult to recover back from SIGFPE. By design, the usage of SIGFPE is to do some cleanup in the handler function, then exit the program gracefully. If the handler does not exit the program after returning from the handler, the PC always points back to the instruction that caused SIGFPE and triggers SIGFPE again! However, the goal is to discard the current progress and continue the program using the LargeInteger algorithm. Although there could be potential workarounds, such as modifying the call stack and changing the return address, implementing these solutions can be challenging and introduce significant complexity to the codebase.

Alternatively, we may read status registers and check if the imprecision bit is set:

```
bool function(matrix & tableau) {
   std::feclearexcept (FE_ALL_EXCEPT);
   if (fetestexcept(FE_INEXACT | FE_INVALID)) {
        // false for overflow, will be handle by its caller
        return false;
   } return true; // true for safe
}
```

Instead of registering floating point imprecision as interrupts, it clears the floating point status register with feclearexcept. It reads the status register afterwards using the fetestexcept function to check whether imprecision has ever occurred in previous computations. It then returns a boolean value to notify its caller whether or not the test for floating-point exceptions is positive, true for safe and false for overflown. In case of returning false, the caller will retry computation using LargeInteger accordingly.

#### 3.3.3.1 Evaluation

In x86\_64 there are two status registers for floating points, the legacy x87 status register for traditional scalar floating point operation and the modern mxcsr register for SSE or AVX instructions. The source code of the library function fetestexcept indeed manipulates both registers [7]:

```
int fetestexcept(int excepts) {
   unsigned short status;
   unsigned int mxcsr;

   excepts &= FE_ALL_EXCEPT;

   /* Store the current x87 status register */
   __asm__ volatile ("fnstsw %0" : "=am" (status));

   /* Store the MXCSR register state */
   __asm__ volatile ("stmxcsr %0" : "=m" (mxcsr));

   return ((status | mxcsr) & excepts);
```

}

In the vectorized floating point scenario, the instruction for the x87 status register is unnecessary, and only the mxcsr register should be concerned. The hot loop is completely vectorized, and Clang dispatches 128-bit SSE instructions for occasional scalar operations. The reason is that the SSE execution units can handle both single and double precision floating point arithmetic natively. In contrast, the legacy x87 floating-point instructions operate on an 80-bit internal format, requiring additional conversions and delay. Using SSE instructions reduce register pressure as well. The SSE units have access to 16 XMM registers, but for x87 units, only 8 floating point registers are available [15].

As illustrated in Figure 3.4, the benchmark evaluates the performance of fenv functions alongside their respective revised versions. The modifications involve restricting operations to either manipulate the x87 status register or the mxcsr register. It indicates that it is significantly faster if we remove x87 status register related operations.



Figure 3.4: A benchmark for the floating point status register reading and resetting functions from cfenv library and their modified versions that only operate on either the x87 or the mxcsr status register. Excluding x87 related operations makes both fetestexcept and feclearexcept faster.

#### 3.3.4 Comparing int16\_t and float

Section 3.3.1, 3.3.2, and 3.3.3 have concluded that int16\_t is superior to any other integer data type and float is better than double. Benchmark (Figure 3.5) on the toy example reveals that int16\_t's spends a considerably higher percentage of runtime on overflow checking compared with float. This is consistent with the reasoning from previous chapters, where overhead for float is a small one-time expense, but for int16\_t it is always a portion of the total runtime.

Also, even though int16\_t is faster than float in this benchmark, it is not convincing to conclude that the pivot function implemented in int16\_t will be faster than float. The toy example is different from the pivot function in many aspects, for example, the number of memory load operations. In the actual pivot function, float may potentially outperform int16\_t.



Figure 3.5: A benchmark for toy example on a 16 by 16 matrix of int16\_t, int32\_t, int64\_t, float, and double, with overflow checking turned on or off. The runtime information for overflow checked int32\_t and int64\_t is not available, because it is difficult to implement vectorized overflow checker. It is discovered that (1) runtime is reduced by 50% as the bit width of the data type is cut by half, (2) overflow checking for integers is much more expensive than floating points.

# Chapter 4

# Implementation and Optimization of pivot

The previous chapter on the toy example has concluded that:

- 1. To achieve vectorization, source code should be written using Clang's vector type extension rather than relying on the compiler's automatic vectorizer.
- 2. Storing a matrix into a single flat list is more advantageous than the nested list approach.
- 3. The optimal integer and floating point data type is int16\_t and float respectively. But it is difficult to tell which one is superior because they have their distinct advantages. int16\_t offers very short bit width, while float provides quick overflow checking.

After adopting Clang's vector type and the flat list as the matrix's data structure, this chapter introduces some further optimizations to pivot as listed below and then evaluates the performance of pivot using either int16\_t or float data type.

- 1. Matrix-wise transprecision computing: minimizes the overhead of the transprecision dispatcher
- 2. Double buffering: avoids the matrix being polluted by overflown data while not introducing additional memory copy operations.
- 3. Aligning the matrix to the size of vector registers: improves memory and cache IO utilization.
- 4. Specialization for different row sizes: for float, using 3 YMM vectors for a row instead of 2 ZMM vectors reduces padding waste for each row and eliminates unnecessary arithmetic instructions.

## 4.1 Matrix-wise Transprecision

Transprecision computing can be implemented at different scales, such as element-wise, row-wise, and matrix-wise (Figure 4.1). The transprecision dispatcher examines overflow after computing on each element, each row, or the entire matrix. If an overflow occurs, the dispatcher interrupts its current progress, copies the matrix to fit a wider data type inside, then restarts dispatching using the wider data type.



Figure 4.1: This figure illustrates how transprecision computing is organized from the dispatcher's perspective. From row-wise to matrix-wise transprecision, the cost of dispatching decreases [13].

The pivot function presented by this report is implemented using the matrix-wise transprecision style to minimize the runtime spent in the dispatcher. The element-wise method is unsuitable in this scenario, as it defeats the purpose of vectorization. The row-wise method is not chosen either, due to that overflow is not likely to occur, and the matrix is small. Reading the mxcsr can be considered an expensive operation compared to the time spent on pivoting through the entire matrix. Avoid wasting runtime after overflown at the cost of more mxcsr reads is not a cost-effective trade-off.

Even though the pivot function using int16\_t was implemented using the row-transprecision approach in previous works [14], it is modified to match with the matrix-wise transprecision style to control differences between the float counterpart. After the overflow checking instructions, instead of a branch instruction pointing to the overflow handler, the boolean operator OR is applied between that overflow checking result register and an overflow flag.



Figure 4.3: The dataflow of the pivot function if it creates a backup of the input matrix rather than double buffering. Compared to Figure 4.2, this method requires additional memory copy operations to make a backup.

## 4.2 Double Buffering

When matrix-wise computing is carried out, a potential overflow can contaminate the input matrix and write meaningless results. It is difficult to recover from overflown results, making it impossible to re-dispatch the same input matrix to an algorithm of higher precision and defeating the purpose of transprecision computing.

Double buffering (Figure 4.2) addresses this problem of data pollution caused by overflown data by allocating two pieces of memory, one for the input matrix and the other for the output matrix. The input matrix is read-only, while the output matrix allows read and write. This separation of data storage ensures that the input matrix remains unpolluted by overflown data and that any potential overflow is encapsulated within the separate output matrix.



Figure 4.2: The dataflow of the double buffered pivot function. Data is loaded from the input matrix, and stored to the output matrix after computing.

While making a copy of the input matrix (Figure 4.3) is a simple and easy solution for protecting the original data from pollution, double buffering is a superior technique because it does not introduce additional memory operations. With double buffering, every operand needs a load operation from the input matrix, and every result takes a store operation to be written into the output matrix. This is the minimal amount of memory operation required for arithmetics. Zen 4 can load one ZMM vector per cycle and store one ZMM vector per two cycles. In comparison, it can do two multiplication or addition of ZMM vectors every cycle [21]. The discrepancy in throughput between computing and IO suggests that memory copy is costly and inefficient.

## 4.3 Alignment

The concept of memory alignment ensures that the starting address of each piece of data is a multiple of its size. When accessing memory, the CPU retrieves data from the main memory or cache in the unit of "cache line", and the cache lines are aligned. Aligning the matrix to the size of vector registers guarantees that every vector is fitted inside a single cache line. Otherwise, a vector register might be spitted on two cache lines, so the CPU must request both cache lines and shift to extract the desired vector [19]. A simple example is provided by Figure 4.4.

In the Listing 3.3, the assembly for load and store is <code>vmovdqu64</code>, for "Move Unaligned Packed Integer Values" [4]. This indicates that the load operation are unaligned and may negatively impact performances [19]. To address this issue, an aligned allocator is given to the constructor of <code>std::vector</code> when creating a matrix as follows, where the <code>AlignedAllocator</code> is provided by the <code>int16\_t</code> implementation from the FPL paper [14]:

```
template <typename T>
class matrix {
public:
   std::vector<T, AlignedAllocator<T, 64>> m;
   // ...
}
```

Loading from an aligned address can be noticed by the compiler, then aligned load instructions vmovdqa for integers or vmovaps for floating points will be selected accordingly.

| 0×100            | 0x140       |      | 0x180         |      | 0x1c0       |    |
|------------------|-------------|------|---------------|------|-------------|----|
| <- cache line -> | <- 512 bits | ->   | <- cache line | ->   | <- 512 bits | -> |
| Vector_A0        | Vector_A1   |      |               |      |             |    |
|                  |             | Vect | or_B0         | Vect | or_B1       |    |

Figure 4.4: This example highlights the difference between aligned and unaligned loads in the memory. The vectors  $Vector_A^*$  are aligned with the cache line size; loading them involves reading two of their corresponding cache lines. However,  $Vector_B^*$  are unaligned; they are spitted on 3 cache lines. All 3 cache lines must be read when loading the two  $Vector_B^*$  vectors.

## 4.4 Column Size Specialization

A single ZMM vector can accommodate up to 16 float values, while two of them can store 32 float values. However, loading a row into two ZMM vectors can result in a minimum waste of 12 float values for 95% of matrices containing 20 columns or less, as indicated in Figure 2.1. Alternatively, rows can be into loaded into multiple YMM vectors, which can accommodate 8 float values each. For the majority of cases, 3 YMM

vectors can sufficiently cover the aforementioned 95% of matrices with substantially less waste. Since YMM registers have double the amount of execution units than ZMM, and the instruction count is anticipated to be only 50% greater, the use of a 3 YMM setup for a row is expected to outperform the 2 ZMM approach. Benchmark plotted in the Figure 4.5 demonstrates that the 3 YMM setup for a row is 15% faster then the 2 ZMM configuration, which matches the expectation.



Figure 4.5: This benchmark of the pivot function measures the difference in performance when rows are packed into 3 YMM vectors, 4 YMM vectors, or 2 ZMM vectors. The 3 YMM setup is the fastest. It is also worth noting that the performance of the 4 YMM implementation is consistently 1.4 ns slower than its ZMM counterpart. This confirms with Zen 4's performance analysis in Section 2.3 that an equal amount of workload compiled to AVX-512 instructions is faster than AVX-2 due to less front-end pressure.

#### 4.5 Evaluation

The Figure 2.1 indicates that the most commonly observed matrix size is 19 columns since the 25th and 95th percentiles fall at 18 and 19 columns, respectively. To evaluate the performance of pivot, a representative 32-row by 19-column matrix is selected from FPL's benchmark [14]. Benchmarks were conducted with the input matrix constructed from different data types, including scalar int64\_t from MLIR's upstream [13], vectorized int16\_t from FPL [14], and vectorized float presented by this report. Figure 4.6 outlines the critical difference between the three approaches, and the benchmark results are plotted in Figure 4.7.

It is discovered the two vectorized implementations are significantly faster than the MLIR's scalar upstream. However, the FPL's implementation with int16\_t [14] costs 20 ns; this is 6 ns faster than the float approach proposed by this report. Potential reasons for float being slower than int16\_t is explained as follow:

- 1. Each row operation involves loading 3 YMM (768 bits) for float or 2YMM (512 bits) for int16\_t. There is 1.5 times more memory pressure on float than int16\_t.
- 2. Zen 4 provides 2 512-bit ALU for int16\_t, but only 1 512-bit FMA unit or 2 256-bit FMA units for float. Since each row operation consists of two multiplication and one addition, the FADD units are left idle and essentially reducing the number of execution units available for floating points to half of what integers have
- 3. Despite int16\_t overflow checking requiring 4 to 5 times more instruction, there is only 1 additional arithmetic instruction. The others are for shuffling, branching, and comparing, and could potentially be processed by other chip resources. It is possible that float would have to spend the same amount of cycles, compared to int16\_t with manual overflow checking.

|                            | MLIR Upstream int64_t                         | FPL's int16_t                 | float                                                          |
|----------------------------|-----------------------------------------------|-------------------------------|----------------------------------------------------------------|
| Runtime                    | 550 ns                                        | 20 ns                         | 26 ns                                                          |
| Overflow checking overhead | N/A                                           | 28 %                          | 1 %                                                            |
| Element bit width          | 64                                            | 16                            | 24                                                             |
| Max column size            | No constraint                                 | 32                            | 24                                                             |
| Compatibility              | Do not require any vector instruction support | Require AVX-512,<br>very rare | Require AVX-2,<br>common among CPUs<br>made in the last decade |

Figure 4.6: This table lists features of 3 data type options of the pivot function: (1) scalar int64\_t, (2) vectorized int16\_t, and (3) vectorized float.



Figure 4.7: The benchmark results of the pivot functions are plotted using logarithmic scale. Note that despite float is very efficient in terms of overflow checking compared to  $int16\_t$ 's huge overhead, it is still slower than  $int16\_t$ .

# **Chapter 5**

# **Conclusion and Future Work**

In conclusion, this report presents a fast implementation of the pivot function using float to address performance bottlenecks when MLIR analysis programs using linear programming of the simplex method. The procedures are:

- 1. Write source code using Clang's vector type extension to guarantee vectorization.
- 2. Reduce the bit width of each element in the input matrix. High-precision data types are unnecessary for pivot.
- 3. Perform integer arithmetics with FPU to leverage floating point's automatic overflow detection.

This report achieved 20 times speedup over the upstream implementation, but unfortunately, it is about 20% slower than the FPL's approach. Nevertheless, my method offers better compatibility. The FPL's approach requires AVX-512, but mine works on old and vastly more common AVX-2 CPUs.

The techniques of optimizing the pivot function could make further progress with the 16-bit floating point data type half. It consists of a 1-bit sign, a 5-bit exponent, and a 10-bit mantissa (Figure 5.1). For some AI applications where high precision is not needed, half provides performance benefits over float and double [22]. It has been supported natively on GPUs since 2016 [25], and if this data type is integrated into future AVX extensions, it would potentially further improve the performance of pivot. The int11\_t data type can be defined using the sign bit and the 10-bit mantissa, which covers 99% of the elements in the constraint matrices [13]. By using int11\_t, it is expected that the runtime of the pivot function can be reduced by 50% compared to the int24\_t version.



Figure 5.1: IEEE 754 half precision floating point (16 bits) [20].

# **Chapter 6**

## **Related Work**

- 1. Fast Analytical Model of Caches [16]: This paper proposes a fast approach to model fully associative and LRU cache. It is orders-of-magnitude faster than the analytical model haystack or the simulator dinero IV. They count cache misses using the symbolic counting technique rather than enumerating all memory accesses. It was discovered that symbolic counting is sufficiently linear in practice and can be solved using Presburger arithmetics.
- 2. Fast Modular Exponentiation using Floating Point Arithmetic in GPU [12]: Modular exponentiation is critical to RSA cryptographic operations. It is the calculation of the remainder r when an integer b is raised to the power of e, then divided by the modulus m:  $r = b^e \mod m$ . The paper presents an approach to computing modular exponentiation using double precision floating points and achieved 20% to 34% speed up over the best prior implementation.
- 3. **Mixed-precision Training Using half and float** [22]: Performance and memory efficiency of deep learning models can be improved by starting with low precision half arithmetics, and transit to float if high precision is required. NVIDIA advertises 8 times more half arithmetic throughput when compared to float, and this translates into 2 to 4.5 times speedup for transprecision models.

- [1] AOIDUO, LebedevRI, RKSimon, adibiagio, and llvmbot. Unexpected rthroughput for vfmadd\* instructions in znver3. https://github.com/llvm/llvm-project/issues/59325, 2022.
- [2] At32Hz. Zen 2 microarchitectures amd. https://en.wikichip.org/wiki/amd/microarchitectures/zen\_2, 2022. accessed 12 April 2023.
- [3] Mikolaj Bojanczyk and Joël Ouaknine. A simple and practical linear-time algorithm for presburger arithmetic. 2004.
- [4] Félix Cloutier. x86 and amd64 instruction reference. https://www.felixcloutier.com/x86/, 2022. accessed 12 April 2023.
- [5] LLVM Contributors. 'affine' dialect. https://mlir.llvm.org/docs/Dialects/Affine/, 2023. accessed 12 April 2023.
- [6] LLVM Contributors. Mlir llvm. https://mlir.llvm.org/, 2023. accessed 12 April 2023.
- [7] cppreference.com. Standard library header cfenv (c++11). https://en. cppreference.com/w/cpp/header/cfenv, 2022. accessed 12 April 2023.
- [8] Ian Cutress. The intel skylake-x review: Core i9 7900x, i7 7820x and i7 7800x tested. https://www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested, 2017. accessed 12 April 2023.
- [9] Google Benchmark Developers. google/benchmark. https://github.com/google/benchmark, 2023. accessed 12 April 2023.
- [10] Google Benchmark Developers. Reducing variance. https://github.com/google/benchmark/blob/main/docs/reducing\_variance.md, 2023. accessed 12 April 2023.
- [11] LLVM Developers. llvm-mca llvm machine code analyzer. https://llvm.org/docs/CommandGuide/llvm-mca.html, 2023. accessed 12 April 2023.
- [12] Niall Emmart, Fangyu Zheng, and Charles Weems. Faster modular exponentiation using double precision floating point arithmetic on the gpu. In 2018 IEEE 25th Symposium on Computer Arithmetic (ARITH), pages 130–137, 2018.

[13] Grosser et al. Fast linear programming through transprecision computing on small and sparse data. In *Proceedings of the ACM on Programming, Languages, Volume* 4, 2020.

- [14] Pitchanathan et al. Fpl: fast presburger arithmetic through transprecision. In *Proceedings of the ACM on Programming Languages, Volume 5*, 2021.
- [15] Agner Fog. Optimizing software in c++. https://www.agner.org/optimize/optimizing\_cpp.pdf, 2022. accessed 12 April 2023.
- [16] Tobias Gysi, Tobias Grosser, Laurin Brandner, and Torsten Hoefler. A fast analytical model of fully associative caches. In *Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation*. ACM, June 2019.
- [17] kingfish. How amd's zen 2 architecture boosts performance-per-watt. https://community.amd.com/t5/general-discussions/how-amd-s-zen-2-architecture-boosts-performance-per-watt/td-p/143054, 2019. accessed 12 April 2023.
- [18] Arch Linux. Cpu frequency scaling. https://wiki.archlinux.org/title/ CPU\_frequency\_scaling, 2023. accessed 12 April 2023.
- [19] Mauricio Alvarez Mesa, Esther Salamí, Alex Ramírez, and Mateo Valero. Performance impact of unaligned memory operations in simd extensions for video codec applications. DBLP, April 2007.
- [20] Islam Mohamed. Mixed precision training. https://islammohamedmosaad. github.io/journal/Mixed-Precision-Training.html, 2020. accessed 12 April 2023.
- [21] Mysticial. Zen4's avx512 teardown. https://www.mersenneforum.org/showthread.php?p=614191, 2022. accessed 12 April 2023.
- [22] NVIDIA. Train with mixed precision user's guide. https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html, 2023. accessed 12 April 2023.
- [23] robocat. Linus torvalds on avx512. https://news.ycombinator.com/item?id=23809335, 2020. accessed 12 April 2023.
- [24] Mariana Silva, Wanjun Jiang, Matthew West, Erin Carrier, Adam Stewart, and Luke Olson. Floating point representation. https://courses.physics.illinois.edu/cs357/sp2020/notes/ref-4-fp.html, 2020. accessed 12 April 2023.
- [25] Ryan Smith. The nvidia geforce gtx 1080 & gtx 1070 founders editions review: Kicking off the finfet generation. https://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/5, 2016. accessed 12 April 2023.
- [26] Nigel Topham. Advanced arithmetic functions. Computer Architecture and Design (INFR10076) Lecture 21, 2021.

[27] Linus Torvalds. Alder lake and avx-512. https://www.realworldtech.com/forum/?threadid=193189&curpostid=193190, 2020. accessed 12 April 2023.

[28] Xeon06. Floating point arithmetic. https://www.reddit.com/r/ ProgrammerHumor/comments/1fq4jy/floating\_point\_arithmetic/, 2013. accessed 12 April 2023.