

## Review

- · Last Class:
  - Basic cache optimizations
  - Address translation with caches
- Today's class:
  - Advanced cache optimizations
  - Advanced translation optimizations
- · Announcement and reminder
  - Project will be distributed tonight
  - In-class mid-term exam next Wednesday. Please mark you calendar

51



Advanced optimizations of cache performance (§ 2.2)



## 1. Small and Simple Caches to reduce hit time

- · Critical timing path:
  - · address tag memory, then compare tags, then select set
- Lower associativity
- Direct-mapped caches can overlap tag comparison and transmission of data



53



## 2. Way prediction to reduce hit time

- Combine fast hit time of Direct Mapped and the lower conflict misses of 2-way SA caches?
  - · check one way first (speed of direct mapped cache)
  - on a miss, check the other way, if it hits, call it a pseudo-hit (slow hit)
  - way prediction is a bit to indicate which half to check first (changes dynamically)



- · May extend prediction to more than 2-way SA caches
- · Saves power
- Drawback: CPU pipeline is hard if hit takes sometimes 1 and sometimes 2 cycles



## 3. Pipeline cache access to increase bandwidth

- · Examples:
  - » Pentium: 1 cycle
  - » Pentium Pro Pentium III: 2 cycles
  - » Pentium 4 Core i7: 4 cycles
- · Increases branch mis-prediction penalty
- Separate the tag compare and data access (in your HW)

55

## 4. Non-blocking caches to increase bandwidth



- <u>Hit under miss</u> allows data cache to continue to supply cache hits during a miss -- useful only with out-of-order execution.
- <u>Hit under multiple miss</u> or <u>miss under miss</u> may further lower the effective miss penalty by overlapping multiple misses
- Significantly increases the complexity of the cache controller (multiple outstanding memory accesses)
- Pentium Pro allows 4 outstanding memory misses





### 5. Multi-bank caches to increase bandwidth

- · Individual memory controller for each bank.
- · Each bank may have its own address and data lines.
- Banks used for independent accesses vs. faster sequential accesses.
  - ARM Cortex-A8 supports 1-4 banks for L2
  - Intel i7 supports 4 banks for L1 and 8 banks for L2
- · How blocks are interleaved affects performance.

| Block          | Block          | Block          | Block          |
|----------------|----------------|----------------|----------------|
| address Bank 0 | address Bank 1 | address Bank 2 | address Bank 3 |
| 0              | 1              | 2              | 3              |
| 4              | 5              | 6              | 7              |
| 8              | 9              | 10             | 11             |
| 12             | 13             | 14             | 15             |

**Figure 2.6** Four-way interleaved cache banks using block addressing. Assuming 64 bytes per blocks, each of these addresses would be multiplied by 64 to get byte addressing.

57





- Don't wait for full block to be loaded before restarting CPU
  - <u>Early restart</u> As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution
  - <u>Critical Word First</u> Request the missed word first from memory and send it to CPU as soon as it arrives; Generally useful only in large blocks,
- Beneficial when we have long cache lines (blocks)
- · If want next sequential word, early restart may not be useful,



## 7. Merging write buffer to reduce miss penalty

No merging



### Merging



- · Most useful in write through caches
- · Combine writing individual words into a block
- · Writing block is faster than writing individual words

59

## 8. Compiler optimizations to reduce miss rate



#### Instructions

- Reorder procedures in memory so as to reduce conflict misses
- Aligning basic blocks with cache blocks (lines)
- Profiling to look at conflicts.

#### Data

- Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays
- Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap
- Loop Interchange: change nesting of loops to access data in order stored in memory
- Blocking (Tiling): Improve temporal locality by accessing "blocks" of data repeatedly vs. going down whole columns or rows



## Merging Arrays Example:

```
int val[SIZE];
int key[SIZE];

struct merge {
    int val;
    int key;
};
struct merge merged_array[SIZE];
```

Reducing conflicts between val & key improves spatial locality

61

## **Loop Fusion Example**



```
for (i = 0; i < N; i = i+1)
  for (j = 0; j < N; j = j+1)
        a[i][j] = 1/b[i][j] * c[i][j];

for (i = 0; i < N; i = i+1)
  for (j = 0; j < N; j = j+1)
        d[i][j] = a[i][j] + c[i][j];

for (i = 0; i < N; i = i+1)
  for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N; j = j+1)
        for (j = 0; j < N
```

One miss per access to a and c vs. two misses per access. Improves temporal locality







# 9. Hardware prefetching to reduce miss rate and penalty



- · Instruction Prefetching
  - Can fetch 2 (or more) blocks on a miss
  - Extra block placed in "stream buffer"
  - On miss, check stream buffer if found move to cache and prefetch next
- · Data Prefetching:
  - May have multiple stream buffers beyond the cache, each prefetching at a different address
- Relies on extra memory bandwidth that can be used without penalty.



# 10.Compiler prefetching to reduce miss rate and penalty

- · Data Prefetch
  - Load data into register
  - Cache Prefetch: load into cache
  - Special prefetching instructions should not cause premature page faults.
  - Issuing Prefetch Instructions takes time
  - Is cost of prefetch issues < savings in reduced misses?
- · Works only if can overlap prefetching with execution.
- Example: Assume that arrays a[] and b[] are aligned at block boundaries and that the cache block size is 4 words.

```
for (i=0; i < 100; i++)

if (i mod 4 = 0) prefetch (b[i+4], a[i+4])

b[i] = c * a[i] + d * a[i+1];
```

if body of loop takes 20 cycles to execute and cache miss penalty is 80 cycles, then, after the first few iterations, data will be in cache when needed.

67

## 11. Using victim caches

- A buffer to place data just evicted from cache
- A small number of fully associative entries
- Accessed in parallel with cache (no increase in hit time)
- On a hit in the VC, swap blocks in VC and cache



 When used with direct mapped caches, it has the effect of adding associativity to the most recently used cache blocks



## 12. Computing with Near Data





- > a[i] in Sy can be re-computed by Sx.
- Should we perform the recomputation?
  - Based on the cost evaluation.
  - > Only re-computation if the cost is less.



69

Xulong Tang, Mahmut Taylan Kandemir, Hui Zhao , Myoungsoo Jung, Mustafa Karakoy. "Computing with Near Data." SIGMETRCIS-2019



- Replacement policy
  - Pseudo LRU
  - RL based
  - Prediction
  - · Hot-cold cache
- Multi-directional cache
  - Sumitha George, Minli Julie Liao, Huaipan Jiang, Jagadish B. Kotra, Mahmut T. Kandemir, Jack Sampson, Vijaykrishnan Narayanan: MDACache: Caching for Multi-Dimensional-Access Memories. MICRO 2018: 841-854
  - Minli Julie Liao, Jack Sampson: D-SOAP: Dynamic Spatial Orientation Affinity Prediction for Caching in Multi-Orientation Memory Systems. MICRO 2020: 581-595