## Computer Architecture Lab 4 Report

## Fabian Wüthrich

## Dezember 2020

The L2 prefetcher for task 3 is based on the Pangloss prefetcher[1] presented at the Third Data Prefetching Championship (DPC3)[2].

Pangloss is a Markov prefetcher[3] that uses address deltas to predict complex access patterns. A delta is the difference between two consecutive addresses. Given the initial address and a series of deltas, a prefetcher is able to reconstruct the address stream. The deltas are limited by the page boundary as page allocation is not sequential e.g. for security reasons. Thus, the Markov model tracks deltas per page instead of globally, which enables an efficient representation of the model.

The prefetcher consists of two set-associative caches:

- **Delta Cache** The delta cache is indexed by the current delta and a set holds the most frequent next deltas. The L2 cache works with cache line addresses (64-bit) so in a 4KB page we have 64 possible offsets i.e. all deltas can be represented by the range (-64, +63] or by 7 bits. Additionally, each cache block has a counter which is incremented on each hit and is used by the replacement policy and to calculate probabilities. The cache uses a Least Frequently Used (LFU) policy.
- Page Cache The page cache keeps track of the last delta per page to prevent wrong predictions cause by interleaved pages. The cache is indexed by the page address and each cache blocks stores a tag, the previous delta and the previous page offset (used to calculate the current delta). The replacement policy is Not-Recently Used (NRU).

On each L2 lookup, the prefetcher performs the following steps:

- 1. Retrieve the previous delta  $(D_{prev})$  and page offset  $(O_{prev})$  with the current page address from the page cache.
- 2. On a page cache hit, update the delta cache and the page cache with the current delta  $(O_{curr} O_{prev})$ .
- 3. On a page cache miss, store the current delta and offset in the page cache (don't update the delta cache to prevent wrong predictions)
- 4. Traverse the Markov chain (delta cache) from the current delta to predict next addresses, until we have as many prefetches as the prefetch degree.

The last step can be expensive, if the prefetcher has to find the best path among all possible paths. Therefore, the paper proposes a simple heuristic: prefetch

the address occurring from the child deltas with a probability > 1/3 and proceed with the highest probable delta for the next iteration, until we count as many prefetches as the prefetch degree. A prefetch is only issued if the address lies in the same page.

Figure 1 compares the prefetchers of each task. The IPC is normalized to a system without a prefetcher. Pangloss (markov in the plot) outperforms all prefetchers, with an average speedup of 60% over no prefetching, 12% speedup over global history buffer prefetching[4] and 31% speedup over feedback directed prefetching[5]. Further performance improvements might be possible by tuning the cache parameters to the workloads.



Figure 1: Prefetcher Performance

## References

- [1] Philippos Papaphilippou, Paul H. J. Kelly, and Wayne Luk. *Pangloss: a novel Markov chain prefetcher*. 2019. arXiv: 1906.00877 [cs.AR].
- [2] Third Data Prefetching Championship (DPC3). URL: https://dpc3.compas.cs.stonybrook.edu/. (accessed: 22.01.2021).
- [3] Doug Joseph and Dirk Grunwald. "Prefetching Using Markov Predictors". In: Proceedings of the 24th Annual International Symposium on Computer Architecture. ISCA '97. Denver, Colorado, USA: Association for Computing Machinery, 1997, pp. 252–263. ISBN: 0897919017. DOI: 10.1145/264107. 264207. URL: https://doi.org/10.1145/264107.264207.
- [4] Kyle J. Nesbit and James E. Smith. "Data Cache Prefetching Using a Global History Buffer". In: Proceedings of the 10th International Symposium on High Performance Computer Architecture. HPCA '04. USA: IEEE Computer Society, 2004, p. 96. ISBN: 0769520537. DOI: 10.1109/HPCA. 2004.10030. URL: https://doi.org/10.1109/HPCA.2004.10030.
- [5] Santhosh Srinath et al. "Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers". In: Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture. HPCA '07. USA: IEEE Computer Society, 2007, pp. 63-74. ISBN: 1424408040. DOI: 10.1109/HPCA.2007.346185. URL: https://doi.org/10.1109/HPCA.2007.346185.