# Memory Hierarchy (5): Cache Misses and How to Address Them — A Case Study

Hung-Wei Tseng

### Recap: The "latency" gap between CPU and DRAM



Recap: Memory Hierarchy



## Recap: 3Cs of misses

- Compulsory miss
  - Cold start miss. First-time access to a block
- Capacity miss
  - The working set size of an application is bigger than cache size
  - Working set size means the total size of cache blocks we visit between the last access and the current access of the current block
- Conflict miss
  - Required data replaced by block(s) mapping to the same set
  - Similar collision in hash
  - The working set size is still smaller than the capacity

## Recap: Stream buffer



- A small cache that captures the prefetched blocks
  - Can be built as fully associative since it's small
  - Consult when there is a miss
  - Retrieve the block if found in the stream buffer
  - Reduce compulsory misses and avoid conflict misses triggered by prefetching

## Recap: Victim cache

Processor Core Registers

- A small cache that captures the evicted blocks
  - Can be built as fully associative since it's small
- oxpeader Consult when there is a miss fetch block return block Swap the entry if hit in victim cache 0xpeade 0xDEADBE Athlon/Phenom has an 8-entry victim

cache

- Reduce conflict misses
- Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache

Id/sd 0xAAAABE | 1d/sd 0xDEADBE | 1d/sd

**DRAM** 

## Array of structures or structure of arrays

```
Array of objects
                                                                                                    object of arrays
                     struct grades {
                                                                             struct grades {
                        int id;
                                                                                int *id;
                        double assignment_1, assignment_2,
                                                                                double *assignment_1, *assignment_2,
                     assignment 3, ...;
                                                                             *assignment_3, ...;
                                                                             };
assignment_1 assignment_2 assignment_3
                                                        assignment_1 assignment_2 assignment_3
                                                                             for(i = 0;i < homework_items; i++)</pre>
                     for(i=0;i<homework_items; i++)</pre>
                                                                               gradesheet.homework[i][total number students] = 0.0;
                     gradesheet[total_number_students].homework[i] = 0.0;
                                                                               for(j = 0; j <total_number_students;j++)</pre>
       average of
                        for(j=0;j<total_number_students;j++)</pre>
          each
                     gradesheet[total number students].homework[i]
                                                                                  gradesheet.homework[i][total_number_students] +=
                     +=gradesheet[j].homework[i];
                                                                             gradesheet.homework[i][j];
       homework
                        gradesheet[total_number_students].homework[i] /=
                     (double)total number students;
                                                                                  gradesheet.homework[i][total number students] /=
                                                                             total_number_students;
                                                     assignment_1 assignment_1 assignment_1 assignment_2 assignment_2 assignment_2
                                                                                                                                      assignment_3
```

## **Loop optimizations**

```
for(i = 0; i < ARRAY_SIZE; i++)
{
  for(j = 0; j < ARRAY_SIZE; j++)
  {
    c[i][j] = a[i][j]+b[i][j];
  }
}</pre>
```

## for(j = 0; j < ARRAY\_SIZE; j++,LOOP interchange

```
for(j = 0; j < ARRAY_SIZE; j++)
{
  for(i = 0; i < ARRAY_SIZE; i++)
    {
     c[i][j] = a[i][j]+b[i][j];
  }
}</pre>
```

m

Loop fission



4

Loop fusion

## **Takeaways: Software Optimizations**

- Data layout capacity miss, conflict miss, compulsory miss
- Loop interchange conflict/capacity miss
- Loop fission conflict miss when \$ has limited way associativity
- Loop fusion capacity miss when \$ has enough way associativity

# Tiling/Blocking Algorithm

## What is an M by N "2-D" array in C?

```
a = (double **)malloc(M*sizeof(double *));
for(i = 0; i < N; i++)
{
   a[i] = (double *)malloc(N*sizeof(double));
}</pre>
```

## a[i][j] is essentially a[i\*N+j]

#### abstraction



#### physical implementation



## Case Study: Matrix Multiplications



```
for(i = 0; i < M; i++) {
  for(j = 0; j < K; j++) {
    for(k = 0; k < N; k++) {
      c[i][j] += a[i][k]*b[k][j];
    }
}</pre>
```

Algorithm class tells you it's O(n<sup>3</sup>)

If M=N=K=1024, it takes about 2 sec

How long is it take when M=N=K=2048?



#### What kind(s) of misses are there in Matrix Multiplications

 Considering the case where M=N=K=2048, what do you think the majority type(s) of cache misses are we seeing on an intel processor with intel Core i7 is 48 KB, 12-way, 64-byte blocked L1-\$?

```
for(i = 0; i < M; i++) {
  for(j = 0; j < K; j++) {
    for(k = 0; k < N; k++) {
      c[i][j] += a[i][k]*b[k][j];
    }
  }
}</pre>
```

- A. Compulsory miss
- B. Capacity miss
- C. Conflict miss
- D. Capacity & conflict miss
- E. Compulsory & conflict miss



#### What kind(s) of misses are there in Matrix Multiplications

 Considering the case where M=N=K=2048, what do you think the majority type(s) of cache misses are we seeing on an intel processor with intel Core i7 is 48 KB, 12-way, 64-byte blocked L1-\$?

```
for(i = 0; i < M; i++) {
  for(j = 0; j < K; j++) {
    for(k = 0; k < N; k++) {
      c[i][j] += a[i][k]*b[k][j];
    }
  }
}</pre>
```

- A. Compulsory miss
- B. Capacity miss
- C. Conflict miss
- D. Capacity & conflict miss
- E. Compulsory & conflict miss

## **Matrix Multiplications**



## **Matrix Multiplications**



fetch everything

capacity miss!

 If each dimension of your matrix is 2048 • Each row takes  $2048 \times 8$  Bytes = 16 KB

 $2048 \times 8B$ =2048 blocks . Each column takes -

cache  $48 \times 1024B$ - = 768 blocks 64*B* 

Unlikely to be

kept in the

b

The L1-\$ of intel Core i7 is 48 KB, 12-way, 64-byte blocked, we only have

You can only hold at most 3 rows or 0.25 of a column of each matrix!



#### What kind(s) of misses are there in Matrix Multiplications

 Considering the case where M=N=K=2048, what do you think the majority type(s) of cache misses are we seeing on an intel processor with intel Core i7 is 48 KB, 12-way, 64-byte blocked L1-\$?

```
for(i = 0; i < M; i++) {
  for(j = 0; j < K; j++) {
    for(k = 0; k < N; k++) {
      c[i][j] += a[i][k]*b[k][j];
    }
  }
}</pre>
```

- A. Compulsory miss
- B. Capacity miss
- C. Conflict miss
- D. Capacity & conflict miss
- E. Compulsory & conflict miss

#### Ideas regarding reducing misses in matrix multiplications

 Reducing capacity misses — we need to reduce the length of a row that we visit within a period of time

#### Mathematical view of MM

$$c_{i,j} = \sum_{k=0}^{k=N-1} a_{i,k} \times b_{k,j} = \sum_{k=0}^{k=\frac{N}{2}-1} a_{i,k} \times b_{k,j} + \sum_{k=\frac{N}{2}}^{k=N-1} a_{i,k} \times b_{k,j}$$

$$= \sum_{k=0}^{k=\frac{N}{4}-1} a_{i,k} \times b_{k,j} + \sum_{k=\frac{N}{4}}^{k=\frac{N}{2}-1} a_{i,k} \times b_{k,j} + \sum_{k=\frac{N}{2}}^{k=\frac{3N}{4}-1} a_{i,k} \times b_{k,j} + \sum_{k=3N4-1}^{k=N-1} a_{i,k} \times b_{k,j}$$
:

Let's break up the multiplications and accumulations into something fits in the cache well



#### Only compulsory misses —

$$miss\_rate = \frac{total\ misses}{total\ accesses} = \frac{8+8+8}{3\times8\times8\times8} = 0.015625$$

These are still around when we move to the next row in the "tile"



# Bringing miss rate even further lower now —

$$miss\_rate = \frac{total\ misses}{total\ accesses} = \frac{8 + 2 \times 8 + 8}{2 \times 3 \times 8 \times 8 \times 8} = 0.0104$$



```
for(i = 0; i < M; i+=tile\_size) \\ for(j = 0; j < K; j+=tile\_size) \\ for(k = 0; k < N; k+=tile\_size) \\ for(ii = i; ii < i+tile\_size; ii++) \\ for(jj = j; jj < j+tile\_size; jj++) \\ for(kk = k; kk < k+tile\_size; kk++) \\ c[ii][jj] += a[ii][kk]*b[kk][jj];
```



Only used 10 blocks in our working set for now — remember, we have 768 blocks in Intel Core i7's L1-\$

What if we have larger tiles?



## How large a tile should be?

Considering the case where M=N=K=2048, and a tile\_size=32, what do you
think the majority type(s) of cache misses are we seeing on an intel processor with
intel Core i7 is 48 KB, 12-way, 64-byte blocked L1-\$?

```
for(i = 0; i < M; i+=tile_size)
  for(j = 0; j < K; j+=tile_size)
    for(k = 0; k < N; k+=tile_size)
    for(ii = i; ii < i+tile_size; ii++)
        for(jj = j; jj < j+tile_size; jj++)
        for(kk = k; kk < k+tile_size; kk++)
        c[ii][jj] += a[ii][kk]*b[kk][jj];</pre>
```

- A. Compulsory miss
- B. Capacity miss
- C. Conflict miss
- D. Capacity & conflict miss
- E. Compulsory & conflict miss



## How large a tile should be?

Considering the case where M=N=K=2048, and a tile\_size=32, what do you
think the majority type(s) of cache misses are we seeing on an intel processor with
intel Core i7 is 48 KB, 12-way, 64-byte blocked L1-\$?

- A. Compulsory miss
- B. Capacity miss
- C. Conflict miss
- D. Capacity & conflict miss
- E. Compulsory & conflict miss

## Matrix Multiplication — let's consider "b"

```
for(ii = i; ii < i+tile_size; ii++)
    for(jj = j; jj < j+tile_size; jj++)
        for(kk = k; kk < k+tile_size; kk++)
        c[ii][jj] += a[ii][kk]*b[kk][jj];</pre>
```

|          | Address | Tag  | Index |          |
|----------|---------|------|-------|----------|
| b[0][0]  | 0x20000 | 0x20 | 0x0   |          |
| b[1][0]  | 0x24000 | 0x24 | 0x0   |          |
| b[2][0]  | 0x28000 | 0x28 | 0x0   |          |
| b[3][0]  | 0x2C000 | 0x2C | 0x0   |          |
| b[4][0]  | 0x30000 | 0x30 | 0x0   |          |
| b[5][0]  | 0x34000 | 0x34 | 0x0   |          |
| b[6][0]  | 0x38000 | 0x38 | 0x0   |          |
| b[7][0]  | 0x3C000 | 0x3C | 0x0   |          |
| b[8][0]  | 0×40000 | 0x40 | 0x0   |          |
| b[9][0]  | 0x44000 | 0x44 | 0x0   |          |
| b[10][0] | 0x48000 | 0x48 | 0x0   |          |
| b[11][0] | 0x4C000 | 0x4C | 0x0   | <b>▼</b> |
| b[12][0] | 0x50000 | 0x50 | 0x0   |          |
| b[13][0] | 0x54000 | 0x54 | 0x0   |          |
| b[14][0] | 0x58000 | 0x58 | 0x0   |          |
| b[15][0] | 0x5C000 | 0x5C | 0x0   |          |
| b[16][0] | 0x60000 | 0x60 | 0x0   |          |

 If the row dimension (N) of your matrix is 2048, each row element with the same column index is

$$2048 \times 8 = 16384 = 0x4000$$
  
away from each other

Each set can store only 12 blocks! So we will start to kick out b[0][0-7], b[1][0-7] ...

## Now, when we work on c[0][1]

|          | Address | Tag  | Index |
|----------|---------|------|-------|
| b[0][0]  | 0×20000 | 0x20 | 0x0   |
| b[1][0]  | 0x24000 | 0x24 | 0x0   |
| b[2][0]  | 0x28000 | 0x28 | 0x0   |
| b[3][0]  | 0x2C000 | 0x2C | 0x0   |
| b[4][0]  | 0x30000 | 0x30 | 0x0   |
| b[5][0]  | 0x34000 | 0x34 | 0x0   |
| b[6][0]  | 0x38000 | 0x38 | 0x0   |
| b[7][0]  | 0x3C000 | 0x3C | 0x0   |
| b[8][0]  | 0x40000 | 0x40 | 0x0   |
| b[9][0]  | 0x44000 | 0x44 | 0x0   |
| b[10][0] | 0x48000 | 0x48 | 0x0   |
| b[11][0] | 0x4C000 | 0x4C | 0x0   |
| b[12][0] | 0x50000 | 0x50 | 0x0   |
| b[13][0] | 0x54000 | 0x54 | 0x0   |
| b[14][0] | 0x58000 | 0x58 | 0x0   |
| b[15][0] | 0x5C000 | 0x5C | 0x0   |
| b[16][0] | 0x60000 | 0x60 | 0x0   |

|          | Address | Tag  | Index |          |      |
|----------|---------|------|-------|----------|------|
| b[0][1]  | 0x20008 | 0x20 | 0x0   | Conflict | Miss |
| b[1][1]  | 0x24008 | 0x24 | 0x0   | Conflict | Miss |
| b[2][1]  | 0x28008 | 0x28 | 0x0   | Conflict | Miss |
| b[3][1]  | 0x2C008 | 0x2C | 0x0   | Conflict | Miss |
| b[4][1]  | 0x30008 | 0x30 | 0x0   | Conflict | Miss |
| b[5][1]  | 0x34008 | 0x34 | 0x0   | Conflict | Miss |
| b[6][1]  | 0x38008 | 0x38 | 0x0   | Conflict | Miss |
| b[7][1]  | 0x3C008 | 0x3C | 0x0   | Conflict | Miss |
| b[8][1]  | 0x40008 | 0×40 | 0x0   | Conflict | Miss |
| b[9][1]  | 0x44008 | 0x44 | 0x0   | Conflict | Miss |
| b[10][1] | 0x48008 | 0x48 | 0x0   | Conflict | Miss |
| b[11][1] | 0x4C008 | 0x4C | 0x0   | Conflict | Miss |
| b[12][1] | 0x50008 | 0x50 | 0x0   | Conflict | Miss |
| b[13][1] | 0x54008 | 0x54 | 0x0   | Conflict | Miss |
| b[14][1] | 0x58008 | 0x58 | 0x0   | Conflict | Miss |
| b[15][1] | 0x5C008 | 0x5C | 0x0   | Conflict | Miss |
| b[16][1] | 0x60008 | 0x60 | 0x0   | Conflict | Miss |

Each set can store only 12 blocks! So we will start to kick out b[0][0-7], b[1][0-7] ...

## How large a tile should be?

• Considering the case where M=N=K=2048, and a tile\_size=16, what do you think the majority type(s) of cache misses are we seeing on an intel processor with intel Core i7 is 48 KB, 12-way, 64-byte blocked L1-\$?

- A. Compulsory miss
- B. Capacity miss
- C. Conflict miss
- D. Capacity & conflict miss
- E. Compulsory & conflict miss

## Matrix Multiplication — let's consider "b"

```
for(ii = i; ii < i+tile_size; ii++)</pre>
    for(jj = j; jj < j+tile_size; jj++)</pre>
         for(kk = k; kk < k+tile_size; kk++)</pre>
             c[ii][jj] += a[ii][kk]*b[kk][jj];
```

|         | Address | Tag  | Index |
|---------|---------|------|-------|
| b[0][0] | 0x20000 | 0x20 | 0x0   |
| b[1][0] | 0x24000 | 0x24 | 0×0   |
| b[2][0] | 0x28000 | 0x28 | 0×0   |
| b[3][0] | 0x2C000 | 0x2C | 0x0   |
| b[4][0] | 0x30000 | 0x30 | (0x0) |

0x34000

0x38000

0x3C000

0x40000

0x44000

0x48000

0x4C000

0x50000

0x54000

0x58000

0x5C000

0x60000

0x34

0x38

0x3C

0x40

0x44

0x48

0x4C

0x50

0x54

0x58

0x5C

0x60

0x0

b[5][0]

b[6][0]

b[7][0]

b[8][0]

b[9][0]

b[10][0]

b[11][0]

b[12][0]

b[13][0]

b[14][0]

b[15][0]

b[16][0]

 If the row dimension of your matrix is 2048, each row element with the same column index is

$$2048 \times 8 = 16384 = 0x4000$$

If we stop at somewhere before 12 blocks, we should be fine!

Since each block has 8 elements, let's break up in 8 for now

- -8 elements from a[i]
- -8 columns each covers 8 rows

#### Ideas regarding reducing misses in matrix multiplications

- Reducing capacity misses we need to reduce the length of a row that we visit within a period of time
- Reducing conflict misses we need to ensure an appropriate tile size would not lead to conflict in sets

## Why is "8" not the best performing?

| size | tile_size | IC          | Cycles      | СРІ      | CT_ns    | ET_s     | DL1_miss_rate |
|------|-----------|-------------|-------------|----------|----------|----------|---------------|
| 2048 | 4         | 97766571061 | 23972375695 | 0.245200 | 0.193290 | 4.633619 | 0.015394      |
| 2048 | 8         | 81047436195 | 21583826614 | 0.266311 | 0.193122 | 4.168303 | 0.010260      |
| 2048 | 16        | 74472586117 | 19268082018 | 0.258727 | 0.193325 | 3.724997 | 0.071558      |
| 2048 | 32        | 71543547491 | 27661109860 | 0.386633 | 0.193218 | 5.344616 | 0.217189      |
| 2048 | 64        | 70151970961 | 32605985592 | 0.464791 | 0.193248 | 6.301039 | 0.242202      |
| 2048 | 128       | 69470212062 | 34530336995 | 0.497052 | 0.193235 | 6.672484 | 0.246013      |
| 2048 | 256       | 69131368754 | 35151111975 | 0.508468 | 0.193311 | 6.795085 | 0.246800      |
| 2048 | 512       | 68985162572 | 47048159619 | 0.682004 | 0.193298 | 9.094299 | 0.239775      |

More instructions due to more loop control overhead!

Best performing at 16? 39

"8" indeed has the best miss rate — and matches our predictions!

#### Ideas regarding reducing misses in matrix multiplications

- Reducing capacity misses we need to reduce the length of a row that we visit within a period of time
- Reducing conflict misses we need to ensure an appropriate tile size would not lead to conflict in sets
- Balancing the trade-offs increased instruction count can demolishing the improvement of cache misses
  - Cache miss rates affect the CPI
  - Tiling increases the IC
  - Remember,  $ET = IC \times CPI \times CT$

## **Takeaways: Software Optimizations**

- Data layout capacity miss, conflict miss, compulsory miss
- Loop interchange conflict/capacity miss
- Loop fission conflict miss when \$ has limited way associativity
- Loop fusion capacity miss when \$ has enough way associativity
- Blocking/tiling capacity miss, conflict miss

## **Matrix Transpose**

```
for(i = 0; i < M; i+=tile_size) {
  for(j = 0; j < K; j+=tile_size) {
    for(k = 0; k < N; k+=tile_size) {
     for(ii = i; ii < i+tile_size; ii++)
        for(jj = j; jj < j+tile_size; jj++)
        for(kk = k; kk < k+tile_size; kk++)
        c[ii][jj] += a[ii][kk]*b[kk][jj];
    }
}</pre>
```

```
// Transpose matrix b into b_t
for(i = 0; i < ARRAY_SIZE; i+=(ARRAY_SIZE/n)) {</pre>
  for(j = 0; j < ARRAY_SIZE; j+=(ARRAY_SIZE/n)) {</pre>
      b_t[i][j] += b[j][i];
  for(i = 0; i < M; i+=tile_size) {</pre>
     for(j = 0; j < K; j+=tile_size) {</pre>
       for(k = 0; k < N; k+=tile_size) {</pre>
         for(ii = i; ii < i+tile_size; ii++)</pre>
           for(jj = j; jj < j+tile_size; jj++)</pre>
              for(kk = k; kk < k+tile_size; kk++)</pre>
                // Compute on b_t
                c[ii][jj] += a[ii][kk]*b_t[jj][kk];
```



#### What kind(s) of misses can matrix transpose remove?

• By transposing a matrix, the performance of matrix multiplication can be further improved. What kind(s) of cache misses does matrix transpose help to remove?

```
for(i = 0; i < M; i+=tile_size) {
  for(j = 0; j < K; j+=tile_size) {
    for(k = 0; k < N; k+=tile_size) {
     for(ii = i; ii < i+tile_size; ii++)
        for(jj = j; jj < j+tile_size; jj++)
        for(kk = k; kk < k+tile_size; kk++)
        c[ii][jj] += a[ii][kk]*b[kk][jj];
  }
}
</pre>
```

- A. Compulsory miss
- B. Capacity miss
- C. Conflict miss
- D. Capacity & conflict miss
- E. Compulsory & conflict miss

```
// Transpose matrix b into b_t
for(i = 0; i < ARRAY_SIZE; i++) {</pre>
  for(j = 0; j < ARRAY_SIZE; j++) {</pre>
       b_t[i][j] += b[j][i];
  for(i = 0; i < M; i+=tile_size) {</pre>
    for(j = 0; j < K; j+=tile_size) {</pre>
       for(k = 0; k < N; k+=tile_size) {</pre>
         for(ii = i; ii < i+tile_size; ii++)</pre>
           for(jj = j; jj < j+tile_size; jj++)</pre>
             for(kk = k; kk < k+tile_size; kk++)</pre>
                // Compute on b_t
                c[ii][jj] += a[ii][kk]*b_t[jj][kk];
```

#### What kind(s) of misses can matrix transpose remove?

• By transposing a matrix, the performance of matrix multiplication can be further improved. What kind(s) of cache misses does matrix transpose help to remove?

```
for(i = 0; i < M; i+=tile_size) {
  for(j = 0; j < K; j+=tile_size) {
    for(k = 0; k < N; k+=tile_size) {
     for(ii = i; ii < i+tile_size; ii++)
        for(jj = j; jj < j+tile_size; jj++)
        for(kk = k; kk < k+tile_size; kk++)
        c[ii][jj] += a[ii][kk]*b[kk][jj];
  }
}</pre>
```

- A. Compulsory miss
- B. Capacity miss
- C. Conflict miss
- D. Capacity & conflict miss
- E. Compulsory & conflict miss

#### // Transpose matrix b into b\_t for(i = 0; i < ARRAY\_SIZE; i++) {</pre> for(j = 0; j < ARRAY\_SIZE; j++) {</pre> $b_t[i][j] += b[j][i];$ for(i = 0; i < M; i+=tile\_size) {</pre> for(j = 0; j < K; j+=tile\_size) {</pre> for(k = 0; k < N; k+=tile\_size) {</pre> for(ii = i; ii < i+tile\_size; ii++)</pre> for(jj = j; jj < j+tile\_size; jj++)</pre> Block for(kk = k; kk < k+tile\_size; kk++)</pre> // Compute on b\_t c[ii][jj] += a[ii][kk]\*b\_t[jj][kk];

#### Tiling/Blocking Algorithm for Matrix Multiplications





but it's OK if we

don't need them for

for(i = 0; i < M; i+=tile\_size)
 for(j = 0; j < K; j+=tile\_size)
 for(k = 0; k < N; k+=tile\_size)
 for(ii = i; ii < i+tile\_size; ii++)
 for(jj = j; jj < j+tile\_size; jj++)
 for(kk = k; kk < k+tile\_size; kk++)
 c[ii][jj] += a[ii][kk]\*b[kk][jj];</pre>



**b**8 elements

These can kick out the upper portion of the columns but it's OK if we don't need them for a while

#### Tiling/Blocking Algorithm for Transposed Matrix Multiplications



We can make the "tile\_size" larger without interfacing

#### What kind(s) of misses can matrix transpose remove?

• By transposing a matrix, the performance of matrix multiplication can be further improved. What kind(s) of cache misses does matrix transpose help to remove?

```
for(i = 0; i < M; i+=tile_size) {
  for(j = 0; j < K; j+=tile_size) {
    for(k = 0; k < N; k+=tile_size) {
     for(ii = i; ii < i+tile_size; ii++)
        for(jj = j; jj < j+tile_size; jj++)
        for(kk = k; kk < k+tile_size; kk++)
        c[ii][jj] += a[ii][kk]*b[kk][jj];
  }
}</pre>
```

- A. Compulsory miss
- B. Capacity miss
- C. Conflict miss
- D. Capacity & conflict miss
- E. Compulsory & conflict miss

#### // Transpose matrix b into b\_t for(i = 0; i < ARRAY\_SIZE; i++) {</pre> for(j = 0; j < ARRAY\_SIZE; j++) {</pre> b\_t[i][j] += b[j][i]; for(i = 0; i < M; i+=tile\_size) {</pre> for(j = 0; j < K; j+=tile\_size) {</pre> for(k = 0; k < N; k+=tile\_size) {</pre> for(ii = i; ii < i+tile\_size; ii++)</pre> for(jj = j; jj < j+tile\_size; jj++)</pre> Block for(kk = k; kk < k+tile\_size; kk++)</pre> // Compute on b\_t c[ii][jj] += a[ii][kk]\*b\_t[jj][kk];

#### Tiling/Blocking Algorithm for Transposed Matrix Multiplications



We can make the "tile\_size" larger without interfacing

```
for(i = 0; i < M; i+=tile_size) conflict misses
  for(j = 0; j < K; j+=tile_size)
      for(k = 0; k < N; k+=tile_size)
      for(ii = i; ii < i+tile_size; ii++)
            for(jj = j; jj < j+tile_size; jj++)
            for(kk = k; kk < k+tile_size; kk++)
            c[ii][jj] += a[ii][kk]*b_t[jj][kk];</pre>
```

## The effect of transposition

|          | size | tile_size | IC          | Cycles                     | СРІ       | CT_ns     | ET_s     | DL1_miss_rate |
|----------|------|-----------|-------------|----------------------------|-----------|-----------|----------|---------------|
| <u>0</u> | 2048 | 4         | 97766571061 | 23972375695                | 0.245200  | 0.193290  | 4.633619 | 0.015394      |
| Œ        | 2048 | 8         | 81047436195 | 21583826614                | 0.266311  | 0.193122  | 4.168303 | 0.010260      |
|          | 2048 | 16        | 74472586117 | 19268082018                | 0.258727  | 0.193325  | 3.724997 | 0.071558      |
| O        | 2048 | 32        | 71543547491 | 27661109860<br>32605985592 | oct horf  | ormina    | 5.344616 | 0.217189      |
| 00       | 2048 | 64        | 70151970961 | 32605985592                | C5.464/91 | 011111119 | 6.301039 | 0.242202      |
| m        | 2048 | 128       | 69470212062 | 34530336995                | rom 16 to | o 64?35   | 6.672484 | 0.246013      |
|          | 2048 | 256       | 69131368754 | 35151111975                | 0.508468  | 0.193311  | 6.795085 | 0.246800      |

| 5 | size | tile_size | IC                       | Cycles      | CPI      | CT_ns    | ET_s     | DL1_miss_rate |
|---|------|-----------|--------------------------|-------------|----------|----------|----------|---------------|
| 2 | 048  | 8         | 70351352368              | 16009523067 | 0.227565 | 0.193097 | 3.091384 | 0.001897      |
| 2 | 048  | 16        | 64810353582              | 15145593176 | 0.233691 | 0.193199 | 2.926121 | 0.026059      |
| 2 | 048  | 32        | 62397963236              | 14854143892 | 0.238055 | 0.193161 | 2.869243 | 0.040979      |
| 2 | 048  | 64        | Transpose in             | nproves     | 0.223012 | 0.193260 | 2.640043 | 0.023464      |
| 2 | 048  | 128       | the miss rate            | at larger   | 0.275478 | 0.193123 | 3.229842 | 0.018013      |
| 2 | 048  | 256       | 60438882203<br>Tile SIZE | 17003851823 | 0.281340 | 0.193330 | 3.287351 | 0.012972      |

Block + Transpose

## Use registers wisely

#### This will create a memory access!

The compiler will try to make result in a register

— without writing code in this way, compiler may not optimize

## The effect of transposition + register

|        |      |           |             |             |                      |          | <u></u>  |               |              |
|--------|------|-----------|-------------|-------------|----------------------|----------|----------|---------------|--------------|
| nspose | size | tile_size | IC          | Cycles      | CPI                  | CT_ns    | ET_s     | DL1_miss_rate | DL1_accesses |
|        | 2048 | 8         | 70351352368 | 16009523067 | 0.227565             | 0.193097 | 3.091384 | 0.001897      | 28769906635  |
| ns     | 2048 | 16        | 64810353582 | 15145593176 | 0.233691             | 0.193199 | 2.926121 | 0.026059      | 27045466371  |
| Tra    | 2048 | 32        | 62397963236 | 14854143892 | 0.238055             | 0.193161 | 2.869243 | 0.040979      | 26365975027  |
| +      | 2048 | 64        | 61255133491 | 13660607640 | 0.223012             | 0.193260 | 2.640043 | 0.023464      | 26061062917  |
| Block  | 2048 | 128       | 60710004318 | 16724248174 | 0.275478             | 0.193123 | 3.229842 | 0.018013      | 25921112408  |
| m      | 2048 | 256       | 60438882203 | 17003851823 | 0.281340<br>gnificar | 0.193330 | 3.287351 | 0.012972      | 25852375287  |
|        |      |           |             |             | grillicar            | it reau  | cuon     |               |              |
| + Reg. | size | tile_size | IC          | Cycles Of   | memor                | v acce   | sses     | DL1_miss_rate | DL1_accesses |
| ose    | 2048 | 8         | 60667686406 | 11604601609 | 0.191281             | 0.193985 | 2.251117 | 0.003363      | 15849319789  |
|        | 2048 | 16        | 49205927530 | 10032628049 | 0.203891             | 0.193373 | 1.940040 | 0.066542      | 11986874817  |
| usp    | 2048 | 32        | 43854347828 | 9736190510  | 0.222012             | 0.193008 | 1.879164 | 0.096360      | 10247265803  |

0 2/2220

0.357307

Best performing at 32

0 1021/1

0.193021

1.938936

2.712263

0.069853

0.045847

0.032001

9413044702

9005357520

8804127506

**Block + Tran** 

2048

2048

2048

64

128

256

41246516681

39962465083

39326412762

10022012762

14051616706

## Tiles do not have to be "squares"

```
for(i = 0; i < M; i+=tile_size_y) {</pre>
    for(j = 0; j < K; j+=tile_size_y) {</pre>
      for(k = 0; k < N; k+=tile_size_x) {</pre>
           for(ii = i; ii < i+tile_size_y; ii++)</pre>
                for(jj = j; jj < j+tile_size_y; jj++) {</pre>
                    result = 0;
                    for(kk = k; kk < k+tile_size_x; kk++)</pre>
                         result += a[ii][kk]*b[jj][kk];
                    c[ii][jj] += result;
```

## If we could have a rectangular tile + transposition

| + Reg. | size | tile_size   | IC          | Су             | Cycles     |           | CT_ns      | ET_s     | DL1_miss_rate |
|--------|------|-------------|-------------|----------------|------------|-----------|------------|----------|---------------|
|        | 2048 | 8           | 60667686    | 406 11604      | 601609 0   | .191281   | 0.193985   | 2.251117 | 0.003363      |
| ose    | 2048 | 16          | 49205927    | 530 10032      | 628049 0   | .203891   | 0.193373   | 1.940040 | 0.066542      |
| dsu    | 2048 | 32          | 43854347    | 828 97362      | 190510 0   | .222012   | 0.193008   | 1.879164 | 0.096360      |
| r Tra  | 2048 | 64          | 41246516    | 681 10038      | 942768 0   | .243389   | 0.193141   | 1.938936 | 0.069853      |
| ock +  | 2048 | 128         | 39962465    | 083 11887      | 280358 0   | .297461   | 0.193436   | 2.299423 | 0.045847      |
| Bic    | 2048 | 256         | 39326/12    | 762 14051      | 616706 0   | .357307   | 0.193021   | 3.713263 | 0.032001      |
|        |      |             | vei         | ry low mis     | ss rate c  | ompar     | ed to 3    | ZX3Z     |               |
| Reg    | size | tile_size_x | tile_size_y | IC             | Cycles     | CPI       | CT_ns      | ET_S     | DL1_miss_rate |
| 4      | 2048 | 8           | 8           | 60696024519    | 1162441928 | 0 0.19151 | 9 0.193479 | 2.249077 | 0.003664      |
| Se     | 2048 | 8           | 16          | 59757245622    | 1252376966 | 9 0.20957 | 7 0.193096 | 2.418293 | 0.022092      |
| bose   | 2048 | 16          | 8           | 49689499294    | 1141540195 | 8 0.22973 | 5 0.193200 | 2.205459 | 0.003957      |
| ans    | 2048 | 16          | 16          | <b>Best pe</b> | rformin    | 8 at 27   | 193202     | 1.935619 | 0.067519      |
| Ĕ      | 2048 | 32          | 16          | 43947034522    | 971859764  | 9 9 22114 | 193352     | 1.879111 | 0.066730      |
| X      | 2048 | 32          | 8           | 44182255983    | 9353772552 | 2 0.21170 | 9 0.193233 | 1.807453 | 0.005696      |
| Block  | 2048 | 64          | 8           | 41431394002    | 963642690  | 4 0.23258 | 3 0.193230 | 1.862043 | 0.005849      |
| ct.    | 2048 | 128         | 8           | 40059698391    | 1159972716 | 8 0.28956 | 1 0.193133 | 2.240289 | 0.006533      |
| Rect.  | 2048 | 256         | 8           | 39376445298    | 1378574095 | 4 0.35010 | 1 0.193180 | 2.663130 | 0.004190      |

## **Takeaways: Software Optimizations**

- Data layout capacity miss, conflict miss, compulsory miss
- Loop interchange conflict/capacity miss
- Loop fission conflict miss when \$ has limited way associativity
- Loop fusion capacity miss when \$ has enough way associativity
- Blocking/tiling capacity miss, conflict miss
- Matrix transpose (a technique changes layout) conflict misses
- Using registers whenever possible reduce memory accesses!

#### Software Prefetching — through prefetching instructions

- x86 provide prefetch instructions
- As a programmer, you may insert \_mm\_prefetch in x86 programs to perform software prefetch for your code
- gcc also has a flag "-fprefetch-loop-arrays" to automatically insert software prefetch instructions

### Implementation of SPM — GPU's shared memory

```
Processor
__global__ void staticReverse(int *d, int n)
 __shared__ int s[64];
                                                                      Core
 int t = threadIdx.x;
 int tr = n-t-1;
                                                                     Registers
 s[t] = d[t];
 __syncthreads();
 d[t] = s[tr];
                                  movl (others),
                                                             %eax
                                                                                  (&s[i]), %eax
                                                                         movl
__global__ void dynamicReverse(int *d, int n)
 extern __shared__ int s[];
                                                                                   SPM
                                                               L1 $
 int t = threadIdx.x;
 int tr = n-t-1;
 s[t] = d[t];
 __syncthreads();
 d[t] = s[tr];
```

https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/

#### Takeaways: Optimizing cache performance through hardware

- There is no optimal cache configurations trade-offs are everywhere
  - Increasing C (+): capacity misses; (-): cost, access time, power
  - Increasing A (+): conflict misses; (-): access time, power
  - Increasing B (+): compulsory misses; (-): miss penalty
- Adding a small buffer alongside the L1 cache can
  - Virtually add an associative set to frequently used data structures
  - Prefetched blocks won't cause conflict misses
- Software Optimization
  - Data layout capacity miss, conflict miss, compulsory miss
  - Loop interchange conflict/capacity miss
  - Loop fission conflict miss when \$ has limited way associativity
  - Loop fusion capacity miss when \$ has enough way associativity
  - Blocking/tiling capacity miss, conflict miss
  - Matrix transpose (a technique changes layout) conflict misses
  - Using registers whenever possible reduce memory accesses!
- Software-control, architectural-supported approach
  - Prefetching instructions
  - · Adding a tag-less, programmable small buffer alongside the L1 cache can reduce power consumption

## Why is "8" not the best performing?

| size | tile_size | IC          | Cycles      | СРІ      | CT_ns    | ET_s     | DL1_miss_rat<br>e | DL1_accesses |
|------|-----------|-------------|-------------|----------|----------|----------|-------------------|--------------|
| 2048 | 4         | 97765686275 | 24149510064 | 0.247014 | 0.193189 | 4.665430 | 0.015102          | 43641501149  |
| 2048 | 8         | 80996985555 | 21043742544 | 0.259809 | 0.193444 | 4.070776 | 0.010135          | 38128531445  |
| 2048 | 16        | 74473114435 | 19369857501 | 0.260092 | 0.193204 | 3.742332 | 0.071790          | 36105122733  |
| 2048 | 32        | 71543334296 | 27812871208 | 0.388756 | 0.193112 | 5.371009 | 0.217011          | 35214370198  |

More instructions due to more loop control overhead!

Why 1% and 7% miss rate do not make significant difference?

## How can we handle miss better?

#### The bandwidth between units is limited



#### When we handle a miss



assume the bus between L1/L2 only allows a quarter of the cache block go through it

## **Early Restart and Critical Word First**



assume the bus between L1/L2 only allows a quarter of the cache block go through it

## **Early Restart and Critical Word First**

- Don't wait for full block to be loaded before restarting CPU
  - Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution
  - Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first
- Most useful with large blocks
- Spatial locality is a problem; often we want the next sequential word soon, so not always a benefit (early restart).

#### Can we avoid the overhead of writes?



assume the bus between L1/L2 only allows a quarter of the cache block go through it

#### Write buffer!



assume the bus between L1/L2 only allows a quarter of the cache block go through it

## Can we avoid the "double penalty"?



- Every write to lower memory will first write to a small SRAM buffer.
  - store does not incur data hazards, but the pipeline has to stall if the write misses
  - The write buffer will continue writing data to lower-level memory
  - · The processor/higher-level memory can response as soon as the data is written to write buffer.
- Write merge
  - Since application has locality, it's highly possible the evicted data have neighboring addresses.
     Write buffer delays the writes and allows these neighboring data to be grouped together.

#### Announcement

- Assignment #2 due tonight
- Reading quiz #5 due next Tuesday before the lecture
- Assignment #3 due next Thursday
- Programming Assignment #2 due 11/7
- Midterm on 11/5
  - 80 minutes, in-person only
  - Closed book, closed note, no laptop, no mobile phones (including the calculator app)
  - You may use a calculator
  - Will release sample midterm questions together with slides of 10/31

# Computer Science & Engineering

203



