# Virtual Memory: Just an Illusion

Hung-Wei Tseng

#### **Recap: Memory Hierarchy Processor** fastest 1 **Processor** < 1ns Core fastest Registers L1\$ L2\$ SRAM\$ a few ns L3\$ **DRAM** tens of ns larger v TBs Storage us/ms larger

# Recap: NVIDIA Tegra X100% miss rate!

C = ABS

• Size 32KB, 4-way set associativity, 64B block, LRU policy, write-allocate, write-back, and assuming 64-bit address.

|      |               | tag index onset                         |      |       |                      |          |
|------|---------------|-----------------------------------------|------|-------|----------------------|----------|
|      | Address (Hex) | Address in binary                       | Tag  | Index | Hit? Miss?           | Replace? |
| a[0] | 0×10000       | 0 <mark>b0001000</mark> 00000000000000  | 0x8  | 0x0   | Compulsory Miss      |          |
| b[0] | 0x20000       | 0 <mark>b0010000</mark> 000000000000000 | 0x10 | 0x0   | Compulsory Miss      |          |
| c[0] | 0x30000       | 0 <mark>b0011000</mark> 000000000000000 | 0x18 | 0x0   | Compulsory Miss      |          |
| d[0] | 0x40000       | 0 <mark>b0100000</mark> 000000000000000 | 0x20 | 0x0   | Compulsory Miss      |          |
| e[0] | 0x50000       | 0 <mark>b0101000</mark> 000000000000000 | 0x28 | 0x0   | Compulsory Miss      | a[0-7]   |
| a[1] | 0x10008       | 0 <mark>b0001000</mark> 00000000001000  | 0x8  | 0x0   | <b>Conflict Miss</b> | b[0-7]   |
| b[1] | 0x20008       | 0b00100000000000001000                  | 0x10 | 0×0   | <b>Conflict Miss</b> | c[0-7]   |
| c[1] | 0x30008       | 0b00110000000000001000                  | 0x18 | 0x0   | <b>Conflict Miss</b> | d[0-7]   |
| d[1] | 0x40008       | 0b01000000000000001000                  | 0x20 | 0×0   | <b>Conflict Miss</b> | e[0-7]   |
| e[1] | 0x50008       | 0b01010000000000001000                  | 0x28 | 0x0   | <b>Conflict Miss</b> | a[0-7]   |
| :    |               | :<br>:                                  | :    | :     | :<br>:<br>:          | :        |

# **Loop optimizations**

```
for(i = 0; i < ARRAY_SIZE; i++)
{
   for(j = 0; j < ARRAY_SIZE; j++)
   {
     c[i][j] = a[i][j]+b[i][j];
   }
}</pre>
```

# for(j = 0; j < ARRAY\_SIZE; j++,LOOP interchange

```
for(j = 0; j < ARRAY_SIZE; j++)
{
  for(i = 0; i < ARRAY_SIZE; i++)
    {
     c[i][j] = a[i][j]+b[i][j];
  }
}</pre>
```

m

Loop fission



4

Loop fusion

# Recap: Matrix Transpose

```
for(i = 0; i < ARRAY_SIZE; i++) {</pre>
          for(j = 0; j < ARRAY_SIZE; j++) {</pre>
            for(k = 0; k < ARRAY_SIZE; k++) {</pre>
               c[i][j] += a[i][k]*b[k][j];
for(i = 0; i < ARRAY_SIZE; i+=(ARRAY_SIZE/n)) {</pre>
  for(j = 0; j < ARRAY_SIZE; j+=(ARRAY_SIZE/n)) {</pre>
    for(k = 0; k < ARRAY_SIZE; k+=(ARRAY_SIZE/n)) {</pre>
        for(ii = i; ii < i+(ARRAY_SIZE/n); ii++)</pre>
           for(jj = j; jj < j+(ARRAY_SIZE/n); jj++)
             for(kk = k; kk < k+(ARRAY_SIZE/n); kk++)
               c[ii][jj] += a[ii][kk]*b[kk][jj];
```

```
Reduce capacity/conflict misses
```

```
// Transpose matrix b into b_t
for(i = 0; i < ARRAY_SIZE; i+=(ARRAY_SIZE/n)) {</pre>
  for(j = 0; j < ARRAY_SIZE; j+=(ARRAY_SIZE/n)) {</pre>
      b_t[i][j] += b[j][i];
for(i = 0; i < ARRAY_SIZE; i+=(ARRAY_SIZE/n)) {</pre>
  for(j = 0; j < ARRAY_SIZE; j+=(ARRAY_SIZE/n)) {</pre>
    for(k = 0; k < ARRAY_SIZE; k+=(ARRAY_SIZE/n)) {</pre>
         for(ii = i; ii < i+(ARRAY_SIZE/n); ii++)</pre>
           for(jj = j; jj < j+(ARRAY_SIZE/n); jj++)
             for(kk = k; kk < k+(ARRAY_SIZE/n); kk++)
               // Compute on b_t
               c[ii][jj] += a[ii][kk]*b_t[jj][kk];
```

#### Reduce conflict misses

# Recap: Summary of Optimizations

#### Software

- Data layout capacity miss, conflict miss, compulsory miss
- Blocking capacity miss, conflict miss
- Loop fission conflict miss when \$ has limited way associativity
- Loop fusion capacity miss when \$ has enough way associativity
- Loop interchange conflict/capacity miss

#### Hardware

- Prefetch compulsory miss
- Write buffer miss penalty
- Bank/pipeline miss penalty
- Critical word first and early restart miss panelty

# What will happen?

- If we execute the code on the right-hand side code on a machine with only 32 GB of physical memory installed and the dim is "70000" (requires 70000\*70000\*8 bytes ~ 37 GB memory at least), What will happen?
  - A. The program will crash in one of the malloc function call
  - B. The program will crash due to a "segmentation fault" that caused by accessing NULL pointer
  - C. The program will be killed automatically by the OS as it uses more than installed physical main memory
  - D. The program will finish without any issue

```
int main(int argc, char *argv[])
    int i,j;
    double **a;
    double sum=0, average;
    int dim=32768;
   if(argc < 2)
        fprintf(stderr, "Usage: %s dimension\n",argv[0]);
        exit(1);
    dim = atoi(argv[1]);
    a = (double **)malloc(sizeof(double *)*dim);
    for(i = 0 ; i < dim; i++)
        a[i] = (double *)malloc(sizeof(double)*dim);
    for(i = 0 ; i < dim; i++)
        for(j = 0 ; j < dim; j++)
            a[i][i] = rand();
   for(i = 0 ; i < dim; i++)
        for(j = 0 ; j < dim; j++)
            sum+=a[i][i];
    average = sum/(dim*dim);
    fprintf(stderr, "average: %lf\n", average);
    for(i = 0 ; i < dim; i++)
        free(a[i]);
    free(a);
    return 0;
```

## **Outline**

- Virtual memory
- Architectural support for virtual memory
- Advanced hardware support for virtual memory

# Let's dig into this code

```
#define GNU SOURCE
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <sched.h>
#include <sys/syscall.h>
#include <time.h>
double a;
int main(int argc, char *argv[])
    int i, number_of_total_processes=4;
    number of total processes = atoi(argv[1]);
    // Create processes
    for(i = 0; i< number_of_total_processes-1 && fork(); i++);</pre>
    // Generate rand seed
    srand((int)time(NULL)+(int)getpid());
    a = rand();
    fprintf(stderr, "\nProcess %d. Value of a is %lf and address of a is %p\n",getpid(), a, &a);
    sleep(10);
    fprintf(stderr, "\nProcess %d. Value of a is %lf and address of a is %p\n",getpid(), a, &a);
    return 0;
```

# Consider the following code ...

- Consider the case when we run multiple instances of the given program at the same time on modern machines, which pair of statements is correct?
  - ① The printed "address of a" is the same for every running instances
  - ② The printed "address of a" is different for each instance
  - All running instances will print the same value of a
  - Some instances will print the same value of a
  - ⑤ Each instance will print a different value of a
  - A. (1) & (3)
  - B. (1) & (4)
  - C. (1) & (5)
  - D. (2) & (3)
  - E. (2) & (4)

```
#define _GNU_SOURCE
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <sched.h>
#include <sys/syscall.h>
#include <time.h>
double a;
int main(int argc, char *argv[])
    int i, number_of_total_processes=4;
    number_of_total_processes = atoi(argv[1]);
    for(i = 0; i< number_of_total_processes-1 && fork(); i++);</pre>
    srand((int)time(NULL)+(int)getpid());
    fprintf(stderr, "\nProcess %d. Value of a is %lf and address
of a is %p\n",getpid(), a, &a);
    sleep(10);
    fprintf(stderr, "\nProcess %d. Value of a is %lf and address
of a is %p\n",getpid(), a, &a);
    return 0;
```

# Consider the following code ...

- Consider the case when we run multiple instances of the given program at the same time on modern machines, which pair of statements is correct?
  - ① The printed "address of a" is the same for every running instances
  - ② The printed "address of a" is different for each instance
  - ③ All running instances will print the same value of a
  - Some instances will print the same value of a
  - ⑤ Each instance will print a different value of a
  - A. (1) & (3)
  - B. (1) & (4)
  - C. (1) & (5)
  - D. (2) & (3)
  - E. (2) & (4)

```
#define GNU SOURCE
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <sched.h>
#include <sys/syscall.h>
#include <time.h>
double a;
int main(int argc, char *argv[])
    int i, number_of_total_processes=4;
    number of total processes = atoi(argv[1]);
    for(i = 0; i< number_of_total_processes-1 && fork(); i++);</pre>
    srand((int)time(NULL)+(int)getpid());
    fprintf(stderr, "\nProcess %d. Value of a is %lf and address
of a is %p\n",getpid(), a, &a);
    sleep(10);
    fprintf(stderr, "\nProcess %d. Value of a is %lf and address
of a is %p\n",getpid(), a, &a);
    return 0;
```

# Virtual Memory

#### This approach is Virtual memory called demand paging · · page + swapping PC for A 3f00bb27 **Program A** 509cbd23 00005d24 00c2e800 0f00bb27 00c2f800 00c2e800 0000bd24 509cbd23 80000008 80000008 > 0000008 PC for A 00c2f000 00005d24 2ca422a0 00c30000 00c2f000 $\Delta v \cap$ 0000008 130020e4 0000bd24 80000008 0000008 PC for B 3f00bb27 00003d24 00c2f800 2ca422a0 00c2e800 2ca422a0 530chu23 0000008 2ca4e2b3 130020e4 13002024 8000000 00695d24 00c30000 00003d24 90c2e800 00003d24 00c2f000 load 0000 Ld24 AYZEN 8000000 2ca4e2b3 8000000 2ca4e2b3 8000000 2ca42230 PC for B 00c2f000 (intel) 130020e4 2ca422a0 0f00bb2/ 00000008 Virtual Mem 00003d24 509cbd23 130020e4 Core™ i7 00c2f800 2ca4e2b3 00005d24 00003d24 00000008 0000bd24 2ca4e2b3 30c2e800 load 00c30000 80000008 **Physical** 8000000 **Processor** 00c2f000 00000008 Memory 90c2†800 **Program B** load 80000008 00c2e800 0f00bb27 0f00bb27 00c30000 509cbd23 8000000 80000008 509cbd23 264-1 00005d24 00c2f000 00005d24 00000008 00c2f800 0000bd24 0000bd24 2ca422a0 130020e4 8000000 00003d24 00c30000 Swap 2ca4e2b3 8000000 264-1 17

#### **Demo revisited**

```
Process A's
                                                     Process A
                                                                                        Virtual
                                                                                    Memory Space
                      &a = 0x601090
#define _GNU_SOURCE
                                                                                     Process B's
#include <unistd.h>
                                                     Process B
                                                                                        Virtual
#include <stdio.h>
#include <stdlib.h>
                                                                                   Memory Space
#include <assert.h>
#include <sched.h>
#include <sys/syscall.h>
#include <time.h>
double a;
int main(int argc, char *argv[])
   int i, number_of_total_processes=4;
   number of total processes = atoi(argv[1]);
   for(i = 0; i< number_of_total_processes-1 && fork(); i++);</pre>
   srand((int)time(NULL)+(int)getpid());
   fprintf(stderr, "\nProcess %d. Value of a is %lf and address of a is %p\n",getpid(), a, &a);
   sleep(10);
   fprintf(stderr, "\nProcess %d. Value of a is %lf and address of a is %p\n",getpid(), a, &a);
   return 0;
```

# Virtual memory

- An abstraction of memory space available for programs/ software/programmer
- Programs execute using virtual memory address
- The operating system and hardware work together to handle the mapping between virtual memory addresses and real/ physical memory addresses
- Virtual memory organizes memory locations into "pages"

# Why Virtual memory?

- Allowing multiple applications to share physical main memory
  - Memory protection/isolation among programs/processes is automatically achieved
- Allowing applications to work even the installed physical memory or available physical memory is smaller than the working set of the application
  - Programmer does not need to worry about the physical memory capacity of different machines — make compiled program compatible
  - Multiple programs can work concurrently even through their total memory demand is larger than the installed physical memory

**Processor** Core

Registers

The virtual memory abstraction

load 0x0009

Page table

MaiPage#hory

(DRAM)



# **Demand paging**

- Treating physical main memory as a "cache" of virtual memory
- The block size is the "page size"
- The page table is the "tag array"
- It's a "fully-associate" cache a virtual page can go anywhere in the physical main memory

## **Address translation**

- Processor receives virtual addresses from the running code, main memory uses physical memory addresses
- Virtual address space is organized into "pages"
- The system references the page table to translate addresses
  - Each process has its own page table
  - The page table content is maintained by OS



# Size of page table

- Assume that we have 64-bit virtual address space, each page is 4KB, each page table entry is 8 Bytes, what magnitude in size is the page table for a process?
  - A. MB 2<sup>20</sup> Bytes
  - B. GB 2<sup>30</sup> Bytes
  - C. TB 2<sup>40</sup> Bytes
  - D. PB 2<sup>50</sup> Bytes
  - E. EB 2<sup>60</sup> Bytes

# Size of page table

 Assume that we have 64-bit virtual address space, each page is 4KB, each page table entry is 8 Bytes, what magnitude in size is the page table for a process?

$$\frac{2^{64} \ Bytes}{4 \ KB} \times 8 \ Bytes = 2^{55} \ Bytes = 32 \ PB$$

If you still don't know why — you need to take CS202

# Conventional page table

0xffffffffffffff

#### **Virtual Address Space**

- must be consecutive in the physical memory
- need a big segment! difficult to find a spot
- simply too big to fit in memory if address space is large!

$$\frac{2^{64} B}{2^{12} B}$$
 page table entries/leaf nodes -

# Do we really need a large table?



# "Paged" page table

0xFFFFFFFFFFFFFF

Code Data Heap Virtual Address Space Stack

Break up entries into pages!

0x0

Each of these occupies exactly a page

$$-\frac{2^{12} B}{2^{3} P} = 2^{9}$$
 PTEs per node

**Question:** 

These nodes are spread out, how to locate them in the memory?

Otherwise, you always need to find more than one consecutive pages — difficult!

These are nodes are not presented if they are not referenced at all — save space

Allocate page table entry nodes "on demänd"

## **B-tree**



https://en.wikipedia.org/wiki/B-tree#/media/File:B-tree.svg

# **Hierarchical Page Table**

0x0 0xffffffffffffff Code Data Heap **Virtual Address Space** Stack  $\lceil log_{2^9} \frac{2^{64} B}{2^{12} B} \rceil = \lceil log_{2^9} 2^{52} \rceil = 6 \text{ levels}$ These are nodes are not presented as they are not referenced at all.  $\frac{2}{2^{12}}$  page table entries/leaf nodes (worst case)

#### Address translation in x86-64



# When we have virtual memory...

• If an x86 processor supports virtual memory through the basic format of the page table as shown in the previous slide, how many memory accesses can a **mov** instruction that access data memory once incur?

- A. 2
- B. 4
- C. 6
- D. 8
- E. 10



## Address translation in x86-64



# When we have virtual memory...

 If an x86 processor supports virtual memory through the basic format of the page table as shown in the previous slide, how many memory accesses can a mov instruction that access data memory once incur?

A. 2

B. 4

C. 6

D. 8

E. 10

# Avoiding the address translation overhead

## **TLB: Translation Look-aside Buffer**



- TLB a small SRAM stores frequently used page table entries
- Good A lot faster than having everything going to the DRAM
- Bad Still on the critical path

#### TLB + Virtual cache

- L1 \$ accepts virtual address you don't need to translate
- Good you can access both TLB and L1-\$ at the same time and physical address is only needed if L1-\$ missed d/sd
- Bad it doesn't work in practice
  - Many applications have the same virtual address but should be pointing different physical addresses
  - An application can have "aliasing virtual addresses" pointing to the same physical address



# Virtually indexed, physically tagged cache

- Can we find physical address directly in the virtual address
  — Not everything — but the
  - Not everything but the page offset isn't changing!
- Can we indexing the cache using the "partial physical address"?
  - Yes Just make set index + block set to be exactly the page offset



# Virtually indexed, physically tagged cache



# Virtually indexed, physically tagged cache

If page size is 4KB —

$$lg(B) + lg(S) = lg(4096) = 12$$

$$C = ABS$$

$$C = A \times 2^{12}$$

$$if A = 1$$

$$C = 4KB$$



#### Virtual indexed, physical tagged cache limits the cache size

- If you want to build a virtual indexed, physical tagged cache with 32KB capacity, which of the following configuration is possible? Assume the operating system use 4K pages.
  - A. 32B blocks, 2-way
  - B. 32B blocks, 4-way
  - C. 64B blocks, 4-way
  - D. 64B blocks, 8-way



#### Virtual indexed, physical tagged cache limits the cache size

 If you want to build a virtual indexed, physical tagged cache with 32KB capacity, which of the following configuration is possible? Assume the operating system use 4K pages.

**Exactly how Core i7 configures** its own cache

$$lg(B) + lg(S) = lg(4096) = 12$$

$$C = ABS$$

$$32KB = A \times 2^{12}$$

$$A = 8$$

# **About midterm**

## Format of the midterm

- Multiple choices \* 16 like your poll/reading quizzes multiple choices questions
- Short answer questions \* 3
- Homework style correctness questions \* 3
  - You need to clearly write down the original form of the applied equation/formula
  - You need to replace each term accordingly with numbers
  - You will have some credits for right equations even though the final number isn't correct
  - You will receive 0 credits if we only see the numbers

# Sample Midterm

# Programmer's impact

• By adding the "sort" in the following code snippet, what the programmer changes in the performance equation to achieve **better** performance?

```
std::sort(data, data + arraySize);
       for (unsigned c = 0; c < arraySize*1000; ++c) {
               if (data[c%arraySize] >= INT_MAX/2)
                    sum ++;
B. IC
C. CT
D. IC & CPI
E. CPI & CT
```

## Amdahl's Law on Multicore Architectures

- Regarding Amdahl's Law on multicore architectures, how many of the following statements is/are correct?
  - ① If we have unlimited parallelism, the performance of each parallel piece does not matter as long as the performance slowdown in each piece is bounded
  - ② With unlimited amount of parallel hardware units, single-core performance does not matter anymore
  - ③ With unlimited amount of parallel hardware units, the maximum speedup will be bounded by the fraction of parallel parts
  - With unlimited amount of parallel hardware units, the effect of scheduling and data exchange overhead is minor
  - A. 0
  - B. 1
  - C. 2
  - D. 3
  - E. 4

## How programmer affects performance?

- Performance equation consists of the following three factors
  - ① IC
  - 2 CPI
  - **3** CT

How many can a **programmer** affect?

- A. 0
- B. 1
- C. 2
- D. 3

# Demo — programmer & performance

```
for(i = 0; i < ARRAY_SIZE; i++)
{
  for(j = 0; j < ARRAY_SIZE; j++)
    {
    c[i][j] = a[i][j]+b[i][j];
  }
}</pre>
```

```
for(j = 0; j < ARRAY_SIZE; j++)
{
  for(i = 0; i < ARRAY_SIZE; i++)
  {
    c[i][j] = a[i][j]+b[i][j];
  }
}</pre>
```

- How many of the following make(s) the performance of A better than B?
  - ① IC
  - ② CPI
  - **3** CT
  - A. 0
  - B. 1
  - C. 2
  - D. 3

# **Data locality**

 Which description about locality of arrays matrix and vector in the following code is the most accurate?

```
for(uint32_t i = 0; i < m; i++) {
    result = 0;
    for(uint32_t j = 0; j < n; j++) {
        result += matrix[i][j]*vector[j];
    }
    output[i] = result;
}</pre>
```

- A. Access of matrix has temporal locality, vector has spatial locality
- B. Both matrix and vector have temporal locality, and vector also has spatial locality
- C. Access of matrix has spatial locality, vector has temporal locality
- D. Both matrix and vector have spatial locality and temporal locality
- E. Both matrix and vector have spatial locality, and vector also has temporal locality

## 3Cs and A, B, C

- Regarding 3Cs: compulsory, conflict and capacity misses and A, B, C: associativity, block size, capacity How many of the following are correct?
  - Increasing associativity can reduce conflict misses
  - ② Increasing associativity can reduce hit time
  - ③ Increasing block size can increase the miss penalty
  - 4 Increasing block size can reduce compulsory misses
  - A. 0
  - B. 1
  - C. 2
  - D. 3
  - E. 4

## intel Core i7

- L1 data (D-L1) cache configuration of Core i7
  - Size 32KB, 8-way set associativity, 64B block
  - Assume 64-bit memory address
  - Which of the following is NOT correct?
    - A. Tag is 52 bits
    - B. Index is 6 bits
    - C. Offset is 6 bits
    - D. The cache has 128 sets

#### Virtual indexed, physical tagged cache limits the cache size

- If you want to build a virtual indexed, physical tagged cache with 32KB capacity, which of the following configuration is possible? Assume the system use 4K pages.
  - A. 32B blocks, 2-way
  - B. 32B blocks, 4-way
  - C. 64B blocks, 4-way
  - D. 64B blocks, 8-way

# When we have virtual memory...

- In a modern x86-64 processor supports virtual memory through, how many memory accesses can an instruction incur?
  - A. 2
  - B. 4
  - C. 6
  - D. 8
  - E. More than 10

# Practicing Amdahl's Law (2)

 Final Fantasy XV spends lots of time loading a map — within which period that 95% of the time on the accessing the H.D.D., the rest in the operating system, file system and the I/O protocol. If we replace the H.D.D. with a flash drive, which provides 100x faster access time and a better processor to accelerate the software overhead by 2x. By how much can we speed up the map loading process?

- A. ~7x
- B. ~10x
- C. ~17x
- D. ~29x
- E. ~100x



# What data structure is performing better

|                             | Array of objects                                                                                                                                                                                                                                   | object of arrays                                                                                                                                                                                                                                                                                                                           |  |  |  |
|-----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|
|                             | <pre>struct grades {   int id;   double *homework;   double average; };</pre>                                                                                                                                                                      | <pre>struct grades {   int *id;   double **homework;   double *average; };</pre>                                                                                                                                                                                                                                                           |  |  |  |
| average of each<br>homework | <pre>for(i=0;i<homework_items; (double)total_number_students;="" +="gradesheet[j].homework[i];" =="" for(j="0;j&lt;total_number_students;j++)" gradesheet[total_number_students].homework[i]="" i++)="" pre="" {="" }<=""></homework_items;></pre> | <pre>for(i = 0;i &lt; homework_items; i++) {    gradesheet.homework[i][total_number_students] = 0.0;    for(j = 0; j &lt; total_number_students; j++)    {       gradesheet.homework[i][total_number_students] +=    gradesheet.homework[i][j];    }    gradesheet.homework[i][total_number_students] /=    total_number_students; }</pre> |  |  |  |

- Considering your workload would like to calculate the average score of one of the homework for all students, which data structure would deliver better performance?
  - A. Array of objects
  - B. Object of arrays

## Which of the following schemes can help Tegra?

- How many of the following schemes mentioned in "improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers" would help NVIDIA's Tegra for the code in the previous slide?
  - ① Missing cache
  - ② Victim cache
  - ③ Prefetch
  - 4 Stream buffer
  - A. 0
  - B. 1
  - C. 2
  - D. 3
  - E. 4

# The result of size of (struct struct\_8)

Consider the following data structure:

```
struct struct_8 {
   uint8_t a;
   uint64_t b;
   uint8_t c;
What's the output of
printf("%lu\n", sizeof(struct struct_8))?
 A. 10
 B. 17
 C. 24
 D. 36
 E. 80
```

## What if the code look like this?

- D-L1 Cache configuration of NVIDIA Tegra X1
  - Size 32KB, 4-way set associativity, 64B block, LRU policy, write-allocate, write-back, and assuming 64-bit address.

```
double a[8192], b[8192], c[8192], d[8192], e[8192]; 

/* a = 0x10000, b = 0x20000, c = 0x30000, d = 0x40000, e = 0x50000 */ 

for(i = 0; i < 512; i++) 

    e[i] = a[i] * b[i] + c[i]; //load a, b, c and then store to e 

for(i = 0; i < 512; i++) 

    e[i] /= d[i]; //load e, load d, and then store to e
```

#### What's the data cache miss rate for this code?

- A. ~10%
- B. ~20%
- C. ~40%
- D. ~80%
- E. 100%

# Pipelined access and multi-banked caches

 Assume each bank in the \$ takes 10 ns to serve a request, and the \$ can take the next request 1 ns after assigning a request to a bank — if we have 4 banks and we want to serve 4 requests, what's the speedup over non-banked, non-pipelined \$? — pick the closest one

```
A. 1x — no speedup
```

- B. 2x
- C. 3x
- D. 4x
- E. 5x

# **Summary of Optimizations**

- Regarding the following cache optimizations, how many of them would help improve miss rate?
  - ① Non-blocking/pipelined/multibanked cache
  - ② Critical word first and early restart
  - ③ Prefetching
  - Write buffer
  - A. 0
  - B. 1
  - C. 2
  - D. 3
  - E. 4

# Sample short answer questions (< 40 words)

- What are the limitations of compiler optimizations? Can you list two?
- Please define Amdahl's Law and explain each term in it
- Please define the CPU performance equation and explain each term.
- Can you list two things affecting each term in the performance equation?
- What's the difference between latency and throughput? When should you use latency or throughput to judge performance? When will throughput mislead performance?
- What's "benchmark" suite? Why is it important?
- Why TFLOPS or inferences per second is not a good metrics?

## Amdahl's Law for multiple optimizations

- Assume that a program is mainly composed of 2 functions baseline\_int() and matrix(). If the program takes 60% in baseline\_int() and remaining 30% in matrix().
  - If there exists an optimization that speedup baseline\_int by 10x.
     What's the total speedup?
  - If there exists an optimization that speedup matrix by 30x. What's the total speedup?
  - What if we can apply both optimizations?
  - What if there is an overhead applying these optimizations?

# Speedup of Y over X

 Consider the same program on the following two machines, X and Y. By how much Y is faster than X?

|           | Clock<br>Rate | Instructions | Percentage of Type-A |    | Percentage of Type-B |   | Percentage of Type-C | CPI of<br>Type-C |
|-----------|---------------|--------------|----------------------|----|----------------------|---|----------------------|------------------|
| Machine 3 | 3 GHz         | 500000000    | 20%                  | 8  | 20%                  | 4 | 60%                  | 1                |
| Machine ' | 5 GHz         | 500000000    | 20%                  | 13 | 20%                  | 4 | 60%                  | 1                |

## How can deeper memory hierarchy help in performance?

- Assume that we have a processor running @ 2 GHz and a program with 30% of load/store instructions. If the computer has "perfect" memory, the CPI is just 1. Now, in addition to DDR4, whose latency 26 ns, we also got a 2-level SRAM caches with
  - it's 1st-level one at latency of 0.5ns and can capture 90% of the desired data/instructions.
  - the 2nd-level at latency of 5ns and can capture 60% of the desired data/instructions
     What's the average CPI?

## **Cache simulation**

 The processor has a 32KB, 64B blocked, 4-way L1 cache. Consider the following code:

```
double a[8192], b[8192], c[8192], d[8192], e[8192];
/* a = 0x10000, b = 0x20000, c = 0x30000, d = 0x40000, e = 0x50000 */
for(i = 0; i < 512; i++) {
    e[i] = (a[i] * b[i] + c[i])/d[i];
    //load a[i], b[i], c[i], d[i] and then store to e[i]
}</pre>
```

- What's the total miss rate? How many of the misses are compulsory misses? How many of the misses are conflict misses?
- How can you improve the cache performance of the above code through changing hardware?
- How can you improve the performance without changing hardware?

## Announcement

- Assignment #2 due this Thursday
  - Please start early some programs take very long time to finish
  - As mentioned in the beginning of each assignment server busy cannot be a reason for late submission or extension
  - If you have questions that cannot be address online, please come to office hours
- Midterm next Tuesday
  - You can only take the exam in-person, closed-book, closed-note
  - Please bring your student ID and we will check
  - You may review/focus on the materials/topics covered in lectures
  - You SHOULD review your assignments

# Computer Science & Engineering

203



