# Virtual Memory: Just an Illusion

Hung-Wei Tseng

#### Recap: the "latency" gap between CPU and DRAM



Recap: Memory Hierarchy **Processor** fastest 1 **Processor** < 1ns Core fastest L1\$ Registers 32 or L2\$ SRAM\$ a few ns L3\$ larger GBs **DRAM** tens of ns TBs Storage us/ms larger

#### Tiling/Blocking Algorithm for Transposed Matrix Multiplications



#### Takeaways: Optimizing cache performance through hardware

- There is no optimal cache configurations trade-offs are everywhere
  - Increasing C (+): capacity misses; (-): cost, access time, power
  - Increasing A (+): conflict misses; (-): access time, power
  - Increasing B (+): compulsory misses; (-): miss penalty
- Adding a small buffer alongside the L1 cache can
  - Virtually add an associative set to frequently used data structures
  - Prefetched blocks won't cause conflict misses
- Software Optimization
  - Data layout capacity miss, conflict miss, compulsory miss
  - Loop interchange conflict/capacity miss
  - Loop fission conflict miss when \$ has limited way associativity
  - Loop fusion capacity miss when \$ has enough way associativity
  - Blocking/tiling capacity miss, conflict miss
  - Matrix transpose (a technique changes layout) conflict misses
  - Using registers whenever possible reduce memory accesses!
- Software-control, architectural-supported approach
  - Prefetching instructions
  - · Adding a tag-less, programmable small buffer alongside the L1 cache can reduce power consumption

## Let's dig into this code

```
int main(int argc, char *argv[])
    int i,j;
    double **a;
    double sum=0, average;
    int dim=32768;
    if(argc < 2)
        fprintf(stderr, "Usage: %s dimension\n", argv[0]);
        exit(1);
    dim = atoi(argv[1]);
    a = (double **)malloc(sizeof(double *)*dim);
    for(i = 0 ; i < dim; i++)
        a[i] = (double *)malloc(sizeof(double)*dim);
    for(i = 0 ; i < dim; i++)
        for(j = 0 ; j < dim; j++)
            a[i][j] = rand();
    for(i = 0 ; i < dim; i++)
        for(j = 0 ; j < dim; j++)
            sum+=a[i][i];
    average = sum/(dim*dim);
    fprintf(stderr, "average: %lf\n", average);
    for(i = 0 ; i < dim; i++)
        free(a[i]);
    free(a);
    return 0;
```



#### What will happen?

- If we execute the code on the right-hand side code on a machine with only 8 GB of physical memory installed and the dim is 33000 (requires 33000\*33000\*8 bytes ~ 8.12 GB memory at least), What will happen?
  - A. The program will crash in one of the malloc function call
  - B. The program will crash due to a "segmentation fault" that caused by accessing NULL pointer
  - C. The program will be killed automatically by the OS as it uses more than installed physical main memory
  - D. The program will finish without any issue

```
int main(int argc, char *argv[])
   int i,j;
    double **a;
   double sum=0, average;
   int dim=32768;
   if(argc < 2)
        fprintf(stderr, "Usage: %s dimension\n",argv[0]);
        exit(1);
   dim = atoi(argv[1]);
   a = (double **)malloc(sizeof(double *)*dim);
   for(i = 0 ; i < dim; i++)
        a[i] = (double *)malloc(sizeof(double)*dim);
   for(i = 0 ; i < dim; i++)
        for(j = 0 ; j < dim; j++)
            a[i][j] = rand();
   for(i = 0 ; i < dim; i++)
        for(j = 0 ; j < dim; j++)
            sum+=a[i][i];
   average = sum/(dim*dim);
   fprintf(stderr, "average: %lf\n", average);
   for(i = 0 ; i < dim; i++)
       free(a[i]);
   free(a);
   return 0;
```

#### What will happen?

- If we execute the code on the right-hand side code on a machine with only 8 GB of physical memory installed and the dim is 33000 (requires 33000\*33000\*8 bytes ~ 8.12 GB memory at least), What will happen?
  - A. The program will crash in one of the malloc function call
  - B. The program will crash due to a "segmentation fault" that caused by accessing NULL pointer
  - C. The program will be killed automatically by the OS as it uses more than installed physical main memory
  - D. The program will finish without any issue

```
int main(int argc, char *argv[])
    int i,j;
    double **a;
    double sum=0, average;
    int dim=32768;
   if(argc < 2)
        fprintf(stderr, "Usage: %s dimension\n",argv[0]);
        exit(1);
    dim = atoi(argv[1]);
    a = (double **)malloc(sizeof(double *)*dim);
    for(i = 0 ; i < dim; i++)
        a[i] = (double *)malloc(sizeof(double)*dim);
   for(i = 0 ; i < dim; i++)
        for(j = 0 ; j < dim; j++)
            a[i][j] = rand();
   for(i = 0 ; i < dim; i++)
        for(j = 0 ; j < dim; j++)
            sum+=a[i][i];
    average = sum/(dim*dim);
    fprintf(stderr, "average: %lf\n", average);
    for(i = 0 ; i < dim; i++)
        free(a[i]);
    free(a);
    return 0;
```

#### What will happen?

- If we execute the code on the right-hand side code on a machine with only 8 GB of physical memory installed and the dim is 33000 (requires 33000\*33000\*8 bytes ~ 8.12 GB memory at least), What will happen?
  - A. The program will crash in one of the malloc function call
  - B. The program will crash due to a "segmentation fault" that caused by accessing NULL pointer
  - C. The program will be killed automatically by the OS as it uses more than installed physical main memory
  - D. The program will finish without any issue

```
int main(int argc, char *argv[])
    int i,j;
    double **a;
    double sum=0, average;
    int dim=32768;
   if(argc < 2)
        fprintf(stderr, "Usage: %s dimension\n",argv[0]);
        exit(1);
    dim = atoi(argv[1]);
    a = (double **)malloc(sizeof(double *)*dim);
    for(i = 0 ; i < dim; i++)
        a[i] = (double *)malloc(sizeof(double)*dim);
    for(i = 0 ; i < dim; i++)
        for(j = 0 ; j < dim; j++)
            a[i][i] = rand();
   for(i = 0 ; i < dim; i++)
        for(j = 0 ; j < dim; j++)
            sum+=a[i][i];
    average = sum/(dim*dim);
    fprintf(stderr, "average: %lf\n", average);
    for(i = 0 ; i < dim; i++)
        free(a[i]);
    free(a);
    return 0;
```

## Let's dig into this code

```
#define GNU SOURCE
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <sched.h>
#include <sys/syscall.h>
#include <time.h>
double a;
int main(int argc, char *argv[])
    int i, number_of_total_processes=4;
    number of total processes = atoi(argv[1]);
    // Create processes
    for(i = 0; i< number_of_total_processes-1 && fork(); i++);</pre>
    // Generate rand seed
    srand((int)time(NULL)+(int)getpid());
    a = rand();
    fprintf(stderr, "\nProcess %d. Value of a is %lf and address of a is %p\n",getpid(), a, &a);
    sleep(10);
    fprintf(stderr, "\nProcess %d. Value of a is %lf and address of a is %p\n",getpid(), a, &a);
    return 0;
```



## Consider the following code ...

- Consider the case when we run 4 instances of the given program at the same time on modern machines, which pair of statements is correct?
  - ① The printed "address of a" is the same for every #include <stdint.h>
    running instances

    double a = 0;
  - ② The printed "address of a" is different for each instance
  - ③ All running instances will print the same value of a
  - Some instances will print the same value of a
  - ⑤ Each instance will print a different value of a
  - A. (1) & (3)
  - B. (1) & (4)
  - C. (1) & (5)
  - D. (2) & (3)
  - E. (2) & (4)

```
#define _GNU_SOURCE
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <sched.h>
#include <sys/syscall.h>
#include <time.h>
double a = 0;
int main(int argc, char *argv[])
    uint64_t i, number_of_total_processes=4, p;
    if(argc < 2)
        fprintf(stderr, "Usage: %s number_of_processes\n",argv[0]);
        exit(1);
    number_of_total_processes = atoi(argv[1]);
    for(i = 0; i< number_of_total_processes-1 && fork(); i++);</pre>
    srand((int)time(NULL)+(int)getpid());
    a = rand();
    fprintf(stderr, "\nProcess %d: Value of a is %lf and address of a is
%p\n",(int)getpid(), a, &a);
    sleep(10);
    fprintf(stderr, "\n10 Seconds Later -- Process %d: Value of a is %lf
and address of a is %p\n",(int)getpid(), a, &a);
    return 0;
```

## Consider the following code ...

- Consider the case when we run 4 instances of the given program at the same time on modern machines, which pair of statements is correct?
  - ① The printed "address of a" is the same for every running instances
  - ② The printed "address of a" is different for each instance
  - ③ All running instances will print the same value of a
  - Some instances will print the same value of a
  - ⑤ Each instance will print a different value of a
  - A. (1) & (3)
  - B. (1) & (4)
  - C. (1) & (5)
  - D. (2) & (3)
  - E. (2) & (4)

```
#define GNU SOURCE
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <sched.h>
#include <sys/syscall.h>
#include <time.h>
#include <stdint.h>
double a = 0;
int main(int argc, char *argv[])
    uint64_t i, number_of_total_processes=4, p;
    if(argc < 2)
        fprintf(stderr, "Usage: %s number of processes\n", argv[0]);
        exit(1);
    number of total processes = atoi(argv[1]);
    for(i = 0; i < number of total processes-1 && fork(); i++);</pre>
    srand((int)time(NULL)+(int)getpid());
    a = rand();
    fprintf(stderr, "\nProcess %d: Value of a is %lf and address of a is
%p\n",(int)getpid(), a, &a);
    sleep(10);
    fprintf(stderr, "\n10 Seconds Later -- Process %d: Value of a is %lf
and address of a is %p\n",(int)getpid(), a, &a);
    return 0;
```

#### **Outline**

- Virtual memory
- Architectural support for virtual memory

# Virtual Memory

#### **Demo revisited**



fprintf(stderr, "\nProcess %d: Value of a is %lf and address of a is %p\n",(int)getpid(), a, &a);

fprintf(stderr, "\n10 Seconds Later -- Process %d: Value of a is %lf and address of a is %p\n",(int)getpid(), a, &a);

a = rand();

return 0;

## Why Virtual memory?

- Allowing multiple applications to share physical main memory
  - Memory protection/isolation among programs/processes is automatically achieved
- Allowing applications to work even the installed physical memory or available physical memory is smaller than the working set of the application
  - Programmer does not need to worry about the physical memory capacity of different machines — make compiled program compatible
  - Multiple programs can work concurrently even through their total memory demand is larger than the installed physical memory

#### This approach is Virtual memory called demand paging · · page + swapping PC for A 3f00bb27 **Program A** 509cbd23 00005d24 00c2e800 0f00bb27 00c2f800 00c2e800 0000bd24 509cbd23 80000008 80000008 > 0000008 PC for A 00c2f000 00005d24 2ca422a0 00c30000 00c2f000 $\Delta v \cap$ 0000008 130020e4 0000bd24 80000008 0000008 PC for B 3f00bb27 00003d24 00c2f800 2ca422a0 00c2e800 2ca422a0 530chu23 0000008 2ca4e2b3 130020e4 8000000 00695d24 13002024 00c30000 00003d24 90c2e800 00003d24 00c2f000 load 0000 Ld24 AYZEN 8000000 2ca4e2b3 8000000 2ca4e2b3 8000000 2ca42230 PC for B 00c2f000 (intel) 130020e4 2ca422a0 0f00bb2/ 00000008 Virtual Mem 00003d24 509cbd23 130020e4 Core™ i7 00c2f800 2ca4e2b3 00005d24 00003d24 00000008 0000bd24 2ca4e2b3 30c2e800 load 00c30000 80000008 **Physical** 8000000 **Processor** 00c2f000 00000008 Memory 90c2†800 **Program B** load 80000008 00c2e800 0f00bb27 0f00bb27 00c30000 509cbd23 8000000 80000008 509cbd23 264-1 00005d24 00c2f000 00005d24 00000008 00c2f800 0000bd24 0000bd24 2ca422a0 130020e4 8000000 00003d24 00c30000 Swap 2ca4e2b3 8000000 264-1 26

#### Virtual memory

- An abstraction of memory space available for programs/ software/programmer
- Programs execute using virtual memory address
- The operating system and hardware work together to handle the mapping between virtual memory addresses and real/ physical memory addresses
- Virtual memory organizes memory locations into "pages"

## Demand paging — another angle



#### Partition memory addresses into fix-sized chunks





\$

## Demand paging + Swapping

- Paging: partition virtual/physical memory spaces into fix-sized pages
- Page fault: when the requested page cannot be found in the physical memory — created the demand of allocating pages!
- Demand paging: Allocate a physical memory page for a virtual memory page when the virtual page is needed (page fault occurs)
- Swapping: use secondary storage to store pages not in DRAM

## Demand paging + Swapping v.s. caching

- Treating physical main memory as a "cache" of virtual memory
- The block size is the "page size"
- The page table is the "tag array"
- It's a "fully-associate" cache a virtual page can go anywhere in the physical main memory
- The storage serves as the lower level memory hierarchy for physical main memory

#### **Takeaways: Virtual Memory**

Virtual memory is essential to support the success of software industry

#### **Address translation**

- Processor receives virtual addresses from the running code, main memory uses physical memory addresses
- Virtual address space is organized into "pages"
- The system references the page table to translate addresses
  - Each process has its own page table
  - The page table content is maintained by OS





## Size of page table

- Assume that we have 64-bit virtual address space, each page is 4KB, each page table entry is 8 Bytes, what magnitude in size is the page table for a process?
  - A. MB 2<sup>20</sup> Bytes
  - B. GB 2<sup>30</sup> Bytes
  - C. TB 2<sup>40</sup> Bytes
  - D. PB 2<sup>50</sup> Bytes
  - E. EB 2<sup>60</sup> Bytes



## Size of page table

 Assume that we have 64-bit virtual address space, each page is 4KB, each page table entry is 8 Bytes, what magnitude in size is the page table for a process?

$$\frac{2^{64} \ Bytes}{4 \ KB} \times 8 \ Bytes = 2^{55} \ Bytes = 32 \ PB$$

If you still don't know why — you need to take CS202

#### Conventional page table

0xffffffffffffff

#### **Virtual Address Space**

- must be consecutive in the physical memory
- need a big segment! difficult to find a spot
- simply too big to fit in memory if address space is large!

$$\frac{2^{64} B}{2^{12} B}$$
 page table entries/leaf nodes -

## Do we really need a large table?



0x0

**OXFFFFFFFFFFFFF** 



Break up entries into pages!

Each of these occupies exactly a page

$$-\frac{2^{12} B}{2^{3} P} = 2^{9}$$
 PTEs per node

**Question:** 

These nodes are spread out, how to locate them in the memory?

Otherwise, you always need to find more than one consecutive pages — difficult!

These are nodes are not presented if they are not referenced at all — save space

Allocate page table entry nodes "on demand"

#### **B-tree**



https://en.wikipedia.org/wiki/B-tree#/media/File:B-tree.svg

## **Hierarchical Page Table**

0x0 0xffffffffffffff Code **Data** Heap **Virtual Address Space** Stack  $\lceil log_{2^9} \frac{2^{64} B}{2^{12} B} \rceil = \lceil log_{2^9} 2^{52} \rceil = 6 \text{ levels}$ These are nodes are not presented as they are not referenced at all.  $\frac{2}{2^{12}}$  page table entries/leaf nodes (worst case)

#### Address translation in x86-64



#### **Takeaways: Virtual Memory**

- Virtual memory is essential to support the success of software industry
- To reduce the page table size, we introduced hierarchical page table data structure



#### When we have virtual memory...

- If an x86 processor supports virtual memory through the basic format of the page table as shown in the previous slide, how many memory accesses can a mov instruction that access data memory once incur?
  - A. 2
  - B. 4
  - C. 6
  - D. 8
  - E. 10



#### Address translation in x86-64



## When we have virtual memory...

 If an x86 processor supports virtual memory through the basic format of the page table as shown in the previous slide, how many memory accesses can a mov instruction that access data memory once incur?

A. 2

B. 4

C. 6

D. 8

E. 10

# Avoiding the address translation overhead

#### **TLB: Translation Look-aside Buffer**



- TLB a small SRAM stores frequently used page table entries
- Good A lot faster than having everything going to the DRAM
- Bad Still on the critical path

#### TLB + Virtual cache

- L1 \$ accepts virtual address you don't need to translate
- Good you can access both TLB and L1-\$ at the same time and physical address is only needed if L1-\$ missed d/sd
- Bad it doesn't work in practice
  - Many applications have the same virtual address but should be pointing different physical addresses
  - An application can have "aliasing virtual addresses" pointing to the same physical address



# Virtually indexed, physically tagged cache

- Can we find physical address directly in the virtual address
  — Not everything — but the
- page offset isn't changing!
- Can we indexing the cache using the "partial physical address"?
  - Yes Just make set index + block set to be exactly the page offset



# Virtually indexed, physically tagged cache



# Virtually indexed, physically tagged cache

If page size is 4KB —

$$lg(B) + lg(S) = lg(4096) = 12$$

$$C = ABS$$

$$C = A \times 2^{12}$$

$$if A = 1$$

$$C = 4KB$$



# **Takeaways: Virtual Memory**

- Virtual memory is essential to support the success of software industry
- To reduce the page table size, we introduced hierarchical page table data structure
- Virtually-indexed, physically tagged cache provides the efficiency for accessing cache and TLB together — but limited cache design

# Translation Caching: Skip, Don't Walk (the Page Table)

Thomas W. Barr, Alan L. Cox, Scott Rixner

## Why should we care about this paper?

- TLB miss is expensive
  - You have to walk through multiple nodes in the hierarchical page table
  - Each node is a memory access 100 ns
- Modern processors use memory management units (MMUs)
  - MMUs have caches, but not optimized for the timing critical TLB miss
  - Page table caches
  - Translational caches

## Page table caches



## Page table caches

- PTC caches the addresses of "page table nodes"
- PTC uses the physical address of page table nodes as the index
  - Unified page table cache (UPTC)
  - Split page table cache (SPTC)
    - Each page level get a private cache location

#### **Translation cache**



#### **Translation caches**

- Indexed by the prefix of the requesting virtual address
  - Split translational cache (STC)
  - Unified translational cache (UTC)
  - Translational-path Cache (TPC)
- Pros:
  - Allowing each level lookup to perform independently, in parallel
- Cons:
  - Less space efficient

# **Takeaways: Virtual Memory**

- Virtual memory is essential to support the success of software industry
- To reduce the page table size, we introduced hierarchical page table data structure
- Virtually-indexed, physically tagged cache provides the efficiency for accessing cache and TLB together — but limited cache design
- Page table caches & translation caching can help reducing the TLB miss penalty

# Please consider attending...



# They're recruiting!

# CEN TECH TALK

Join Us to Learn How to Shape Tech Careers: Insights and Guidance from Industry Experts

#### **Ethan Law**

Senior Firmware Engineer at Foresight Sports

#### **Anthony Mucciolo**

Systems Architect at Foresight Sports





1:00 pm - 2:00 pm

Food will be provided!





#### Announcement

- Assignment #3 due this Thursday
- Programming Assignment #2 due 11/7
- Midterm next Tuesday
  - 80 minutes, in-person only
  - Closed book, closed note, no laptop, no mobile phones (including the calculator app)
  - You may use a calculator
  - Will release sample midterm questions on Thursday

# Computer Science & Engineering

203



