





# Scientific and Technical Computing

**Hardware and Code Optimization** 

#### Lars Koesterke

UT Austin, 9/29/22 & ...



# Tentative(!) Schedule

#### Total of 9 class periods

### Remember this? Data streams

a(i) = b(i) + c(i)



What do the designations 'a3', 'b5', 'c7' now stand for?

1 result every 3 cycles Speedup is 200×

### Remember this? Data streams

$$a(i) = b(i) + c(i)$$



What do the designations 'a3', 'b5', 'c7' now stand for?

These are now 'names/designations' of whole cache lines (512 bit)
And we will use this to our advantage to increase computational speed

### **Scalar Hardware**





Scalar Unit

Input: 2 words (single or double precision)

Output: 1 word

Operations: 1 operation → 1 result

### **Vector Hardware**

$$a(i) = b(i) + c(i)$$

Example for vector length = 4 words



#### Scalar Unit

Input: 2 words (single or double precision)

Output: 1 word

Operations: 1 operation → 1 result

#### Vector Unit

Input: 2 cache lines
Output: 1 cache line

Operations: 1 operation  $\rightarrow$  8/16 results (dp/sp)

#### The whole process is called vectorization

- Vector registers
- Vector instructions
- Loading a cache line into vector register with one instruction
- Operating on vector registers with one instruction

#### Using vector instructions

- Compiler creates vector instructions when possible (and
- Compiler creates scalar instructions when necessary
- Programmers help by
  - Writing vectorizable code
  - Adding hints to the source code (OpenMP)

#### **Status**

- Cache line: 512 bits
- Vector width/length: 512 bits (width of the vector register)
- In the past: vector length shorter than cache line

#### Reminder

#### All 'concurrency' topics come with these questions:

- What is it?
- Why do we need it?
- How is it implemented?
- Is there a specific trick?
- How does this increase concurrency?

#### Overarching idea

Single transaction is limited in speed

Increasing concurrency is our goal

Memory transactions and (floating point) operations



#### **Basics**

- Cores have registers
- Data must be moved from cache/memory into registers before operating on
- To add 2 floating point (fp) numbers:
  - 1. Move first fp 'A' from cache to a register
  - 2. Move second fp 'E' from cache to a different register
  - 3. Add fp's and place result in yet a different register
  - 4. Move result to cache/memory
- Frequencies are limited so a single fp operation can only go so fast
- Why not move and add several fp's simultaneously?
  - SIMD: Single Instruction Multiple Data



#### Make registers wider → vector registers

- Same motivations as multicore: more compute at comparable power
- Increase # computations by increasing number of ops per cycle



### **Evolution of SIMD Hardware**

2001 – 2017: Vector length has increased by a factor of 4

2017: Vector length = length of a cache line

| Year | Width (bits) | ISA Name                  | Data (register names) |
|------|--------------|---------------------------|-----------------------|
| 1997 | 80           | x87 + MMX                 | float+Integer         |
| 1999 | 128          | SSE1                      | SP FP SIMD (xMM0-8)   |
| 2001 | 128          | SSE2                      | DP FP SIMD (xMM0-8)   |
| 2004 | 128          | SSE3                      | DP FP SIMD (xMM0-8)   |
| 2006 | 128          | SSE4                      | DP FP SIMD (xMM0-8)   |
| 2010 | 256          | AVX                       | DP FP SIMD (yMM0-16)  |
| 2013 | 256          | AVX2                      | DP FP SIMD (yMM0-16)  |
| 2016 | 512          | MIC-AVX512, COMMON-AVX512 | DP FP SIMD (zMM0-32)  |
| 2017 | 512          | CORE-AVX512               | DP FP SIMD (zMM0-32)  |

Stampede2 and Frontera



## **Vector Registers**

- Different types of processors have different width vector registers
- SP/DP are 32b/64b (4B/8B)
- Vector instructions are required to use vector registers



### **Vector Instructions**

#### Vectorized code uses vector instructions

- Vector instructions act on multiple data elements
- Vector instructions exist for moving data and operating on data



Example: vector length=4

do i=1, n
 a(i) = b(i) + c(i)
enddo

Compiler

Compiler unrolls the loop

Start, End, Increment

```
do i=1, n, 4
  a(i+0) = b(i+0) + c(i+0)
  a(i+1) = b(i+1) + c(i+1)
  a(i+2) = b(i+2) + c(i+2)
  a(i+3) = b(i+3) + c(i+3)
enddo
```

Example: vector length=4

```
do i=1, n
  a(i) = b(i) + c(i)
  d(i) = e(i) + f(i)
enddo
```

Compiler

Compiler unrolls the loop

The compiler can change the order of statements if it's correct to do so

```
do i=1, n, 4
  a(i+0) = b(i+0) + c(i+0)
  d(i+0) = e(i+0) + f(i+0)
  a(i+1) = b(i+1) + c(i+1)
  d(i+1) = e(i+1) + f(i+1)
  a(i+2) = b(i+2) + c(i+2)
  d(i+2) = e(i+2) + f(i+2)
  a(i+3) = b(i+3) + c(i+3)
  d(i+4) = e(i+3) + f(i+3)
enddo
```

```
do i=1, n, 4
  a(i+0) = b(i+0) + c(i+0)
  a(i+1) = b(i+1) + c(i+1)
  a(i+2) = b(i+2) + c(i+2)
  a(i+3) = b(i+3) + c(i+3)
  d(i+0) = e(i+0) + f(i+0)
  d(i+1) = e(i+1) + f(i+1)
  d(i+2) = e(i+2) + f(i+2)
  d(i+3) = e(i+3) + f(i+3)
enddo
```

```
for ( int i=0; i<n; i++ ) {
                               Yes
 a[i] = b[i] + c[i];
```

But why exactly?

```
for ( int i=0; i<n; i++ ) {
   a[i] = b[i] + c[i];
}</pre>
```

Yes Loop iterations are independent

```
for ( int i=0; i<n; i++ ) {
   a[i] = b[i] + c[i];
}</pre>
```

Yes Loop iterations are independent

```
do i=2, n-1
  a(i) = a(i-1) + a(i+1)
end do
```



```
for ( int i=0; i<n; i++ ) {
   a[i] = b[i] + c[i];
}</pre>
```

Yes Loop iterations are independent

```
do i=2, n-1
  a(i) = a(i-1) + a(i+1)
end do
```

No Loop iterations are not independent

```
for ( int i=0; i<n; i++ ) {
   a[i] = b[i] + c[i];
}</pre>
```

Yes
Loop iterations are independent

```
do i=2, n-1
  a(i) = a(i-1) + a(i+1)
end do
```

No Loop iterations are not independent

```
for ( int i=0; i<n; i++ ) {
  temp = a[i] + 2.;
  a[i] = b[i];
  b[i] = temp;
}</pre>
```

How about this one? There is, sort of, a dependency

```
for ( int i=0; i<n; i++ ) {
   a[i] = b[i] + c[i];
}</pre>
```

Yes
Loop iterations are independent

```
do i=2, n-1
  a(i) = a(i-1) + a(i+1)
end do
```

No Loop iterations are not independent

```
for ( int i=0; i<n; i++ ) {
  temp = a[i] + 2.;
  a[i] = b[i];
  b[i] = temp;
}</pre>
```

Yes
Compiler resolves dependencies for scalars
like temp, and also takes care of
unnamed constants

# **Efficiency of Vectorization**

```
do i=1, n
  a(2*i) = b(2*i) + c(2*i)
end do
```

Assume double precision
Assume efficient data streaming and pipelining

How many cycles per operation? (Assuming pipelining)

What is the total bandwidth?
What is the 'effective' total bandwidth?
How many results per cycle?

# **Efficiency of Vectorization**

```
do i=1, n
  a(2*i) = b(2*i) + c(2*i)
end do
```

Assume double precision
Assume efficient data streaming and pipelining

How many cycles per operation? 1

What is the total bandwidth? 3×8 wpc What is the 'effective' total bandwidth? 3×4 wpc

How many results per cycle?



# **Efficiency of Vectorization**

```
do i=1, n
 a(2*i) = b(2*i) + c(2*i)
end do
```

Assume double precision
Assume efficient data streaming and pipelining

How many cycles per operation?

What is the total bandwidth? 3×8 wpc What is the 'effective' total bandwidth? 3×4 wpc

How many results per cycle? 4

So what happens to the 4 elements of the vector unit that we don't need?

### **Vectorization and Masks**

```
do i=1, n
a(2*i) = b(2*i) + c(2*i)
end do
```



### **Vectorization and Masks**



TACC

### **Vectorization and Masks**

Complete cache lines are loaded

Unwanted results are either not calculated or not stored back to register



TACC

Double precision: 8 vector lanes

Single precision: 16 vector lanes

Load 8 consecutive elements in memory to 8 consecutive elements in memory





#### All elements!



Complete cache lines are loaded Unwanted results are not stored back to register



Load b(1:8) Load c(1:8) Compute: add Store a(1:8) – full mask

Total: 4 instructions 8 results

Mask: All results copied back

#### Every other element!

**Complete** cache lines are loaded

Unwanted results are either not calculated or not stored back to register

28



#### Every other element!



**Complete** cache lines are loaded

Unwanted results are either <u>not calculated</u> or not stored back to register



Extreme case: every 8<sup>th</sup> element!

**Complete** cache lines are loaded

Unwanted results are either not calculated or not stored back to register

30



Extreme case: every 8th element!

Why is 'a' also loaded, even if we only write to it?

<u>Complete</u> cache lines are loaded and written back

Unwanted results are either not calculated or not stored back to register



Load b(1:8)

Load c(1:8)

Load a(1:8)

Compute: add – with mask

Store a(1:8)

Total: 5 instructions

1 results

a(1), a(2), ... unchanged a(1) to a(8) is written back

TACC

# **Strided Memory Access**

```
do i=1, m
  a(1*i) = b(1*i) + c(1*i)
end do
```

Stride 1 access

Stride 2 access

Stride 8 access

Stride 'n>8' access

Which access is best?

Which one is worse?

# **Strided Memory Access**

Stride 1 access

Stride 2 access

Stride 8 access

Stride 'n>8' access

Which access is best?
Lower strides are better
Stride-1 is best

Which one is worse?
Higher strides are worse
Stride-8, and stride-n are worst

Lower 'effective' memory bandwidth

Lower number of results per numerical vector operation

# **Strided Memory Access**

Stride 1 access

But things are a bit more complicated

Again, we have to do back to 'data streams'

Stride 8 access

Higher strides are worse Stride-8, and stride-n are worst

Stride 'n>8' access

Lower 'effective' memory bandwidth

Which accord is host?

Lower number of results per numerical operation