

Our dream? Improve the performance of the bouquet of prefetchers



Checking the plot for graph traces for the DPC3 Pros (a possibility of a perfect pairing :))











BINGO performs extremely well on graph traces

IPCP currently doesn't use associativity ideas in its tables

Opportunity to introduce this idea of BINGO into IPCP

Introducing associativity in the table (kinda a cache). Does it work?

```
int findEmptyWays(uint64_t ip, uint64_t offset, uint8_t cpu)
{
   int set = hash_index[[ip + offset, 8]];
   int temp = set * NUM_WAY;
   for (int w = 0; w < NUM_WAY; w++)
   {
      if (trackers_l1[cpu][temp + w].ip_valid == 0)
      {
          return (temp + w);
      }
}

return (temp + rand()%(NUM_WAY)); //choose eviction policy
}</pre>
```

Try 1:
Associative Table (kinda a cache, does not work:()









Maybe the access times and the complicated replacement schemes are adding latency?



Simplification while conserving the core ideas of the BINGO prefetcher can help! (best of both worlds)

## Try 2: Separating the tagging and indexing policies:)











Seems to be a good result

We do need to check it on other traces though. Architecture is all about tradeoffs:)

Is this a good scheme? Depends. We need to evaluate it on more traces.



Us after these results on graphs

#### But wait! Let's improve SAT Solver stuff. Use next line in LLC. (Yay!)







# This feels weird, is a tradeoff coming? SPEC:(





Huh, a need to limit our losses and match the performance with IPCP on SPEC

## THE COMEBACK is always stronger than the setback

# Try 1: Integrating complex stride at L2 cache



(Comeback!)
Modifying the tag using cl\_addr instead:)



Complex stride adds a level of configurability to the prefetcher. So, it helps to learn a bit more about SPEC.

Cl\_addr helps to make addr less specific. So due to more coverage, it helps:)

## Integrating MLOP

```
17 (type !- LOAD)
usinted t block number = addr >> LOG2 BLOCK SIZE;
/* check profetch NST */
bool prefetch hit - false;
if (cache hit - 1)
   wintsit set + get_set(block_number);
   whether t way - get way(block number, set);
   if (block[set][way].prefetch == 1)
       prefetch hit - true;
bool trigger access - false;
if (cache hit -- 0 || prefetch hit)
    trigger_access - temp
17 (trigger_access)
   L10 PREF; :prefetchers cpu .access block number);
/* issue prefetches */
LSD_PREF::prefetchers[cpu].mark(block_number, LSD_PREF::State::ACCESS);
num_prefs ++ LID PHIF::prefetchers[cpu].prefetch(ip,addr, metadata ,thin, block_number);
IF (LIO_PREFIDENIG_LEVEL >+ 3)
   110 PREF::profetchers[cpu].log();
    cerr (c "-----" (c dec (c endl;
LED PREF::prefetchers[cpu].track(block_number);
```

## Integrating MLOP? Minor improvement



MLOP prefetcher tends to do well on cactus traces

So, incorporating this prefetcher in L1 cache helps us to gain minor IPC improvements in cactus traces (SPEC)

IPC of other traces is not affected in this modification

Does it work decently enough for servers? Well, kinda



Results!

| Traces | Speed-up |
|--------|----------|
| GRAPH  | 1.072    |
| SAT    | 1.007    |
| SPEC   | 0.999    |
| SERVER | 1.001    |

### Further improvements?





Perhaps scope for further optimization?

Well, we'll leave it to another day :p

#### Bouquet of DPC3 Winners

Ayush Agarwal\*, Sankalan Baidya<sup>†</sup> and Soham Joshi<sup>‡</sup> Department of Computer Science, IIT Bombay Mumbai, Maharashtra, India

Email: \* ayushagarwal@cse.iitb.ac.in, † somekoln@cse.iitb.ac.in , ‡ sohamjoshi@cse.iitb.ac.in

Abstract—This work is built upon IPCP 1.0 by Biswa, which uses a bouquet of prefetchers. We are proposing a series of specialized improvements for different categories like Craphs, Sat-solvers, SPEC traces and Server worklouds, and in the prefetcher which performs at par or better than TPCP for most traces. The ideas used in this work have been built upon from various DPC3 submissions [1] [2] [3].

Index Terms-Prefetechers, Computer Architecture, DPC3

#### I. Introduction

Hardware prefetching is a technique used by computer processors to fetch data from memory (into the cache) before it's actually needed, reducing the time spent waiting for data to arrive. This is achieved by analyzing memory access patterns and predicting which data is likely to be needed next, and fetching it in advance.



