# Improving Performance in Structured GPGPU Workloads via Specialized Thread Schedules

# Naraenda Prasetya

# 2022

# Abstract

# [TODO] Write thisss

# Contents

| 1 | Introduction                      |                                           |  |  |  |  |  |  |  |
|---|-----------------------------------|-------------------------------------------|--|--|--|--|--|--|--|
|   | 1.1                               | Motivation                                |  |  |  |  |  |  |  |
|   | 1.2                               | Contributions                             |  |  |  |  |  |  |  |
|   | 1.3                               | Outline                                   |  |  |  |  |  |  |  |
| 2 | Bac                               | ekground 2                                |  |  |  |  |  |  |  |
| _ | 2.1                               | GPU Architecture                          |  |  |  |  |  |  |  |
|   |                                   | 2.1.1 Hardware                            |  |  |  |  |  |  |  |
|   |                                   | 2.1.2 Software                            |  |  |  |  |  |  |  |
|   |                                   | 2.1.3 GPU Caches                          |  |  |  |  |  |  |  |
|   |                                   | 2.1.4 Performance of Access Patterns      |  |  |  |  |  |  |  |
|   | 2.2                               | CPU vs GPU based multi-threading          |  |  |  |  |  |  |  |
|   | 2.3                               | Accelerate                                |  |  |  |  |  |  |  |
|   | 2.4                               | Commonly Applicable Cache Improvements    |  |  |  |  |  |  |  |
|   | 2.1                               | 2.4.1 Optimizations of Blocked Algorithms |  |  |  |  |  |  |  |
|   |                                   | 2.4.2 CTA Clustering                      |  |  |  |  |  |  |  |
|   |                                   | 2.4.3 PAVER                               |  |  |  |  |  |  |  |
|   |                                   | 2.1.0                                     |  |  |  |  |  |  |  |
| 3 | Analysis of Existing Approaches 5 |                                           |  |  |  |  |  |  |  |
| J | 3.1                               | Spatial Temporal Analysis                 |  |  |  |  |  |  |  |
|   | 3.2                               | Stencil Operations                        |  |  |  |  |  |  |  |
|   |                                   | 3.2.1 Naive                               |  |  |  |  |  |  |  |
|   |                                   | 3.2.2 Tiling                              |  |  |  |  |  |  |  |
|   | 3.3                               | Matrix Multiplication                     |  |  |  |  |  |  |  |
|   |                                   | 3.3.1 Naive                               |  |  |  |  |  |  |  |
|   |                                   | 3.3.2 Tiling                              |  |  |  |  |  |  |  |
|   |                                   |                                           |  |  |  |  |  |  |  |
| 4 | Col                               | umn Based Iteration 7                     |  |  |  |  |  |  |  |
|   | 4.1                               | Zigzagging variation                      |  |  |  |  |  |  |  |
|   | 4.2                               | Implementation in Accelerate              |  |  |  |  |  |  |  |
|   | 4.3                               | Stencil Operation                         |  |  |  |  |  |  |  |
|   | 4.4                               | Matrix Multiplication                     |  |  |  |  |  |  |  |
| 5 | Results 10                        |                                           |  |  |  |  |  |  |  |
| • | 5.1                               | Execution Times                           |  |  |  |  |  |  |  |
|   | 5.2                               | Cache Hitrate                             |  |  |  |  |  |  |  |
|   | 5.3                               | Hardware Utilization                      |  |  |  |  |  |  |  |

6 Conclusion 10

#### [TODO] Make sure equations are consitent:

- L Cache line size (elements)
- M Cache size (elements)
- $M_L$  Cache size (number of cache lines)
- $\bullet$  F Memory fetches
- $I_w$ ,  $I_h$  input size (number of tasks to produce the output)
- $S_w$ ,  $S_w$  stencil size

# 1 Introduction

Graphics Processing Units (GPUs) are specialized hardware that can manipulate large amounts of data in a highly parallel manner. The hardware's main purpose is processing graphical data, however modern use of GPUs is in the form of general-pupose computing on the GPU (GPGPU) where we exploit the GPUs many cores to run computations normally run on CPUs in parallel.

This high performance GPU computing has become very accessible via high level frameworks, which provide a limited set of operators, such as stencil, permute, fold, scan, which can manipulate data on a GPU [1]. However, in most cases the compiled assembly has inefficient memory accesses with lower cache hit rates and uncoalesced accesses than manually tweaked code. Threads can be scheduled in such a way to minimize these inefficiencies. In such cases, a smarter scheduler can achieve better cache and memory utilization [2]. We propose a simple schedule to improve performance by leveraging the structure of memory accesses in the high level GPGPU framework Accelerate [1].

# 1.1 Motivation

#### 1.2 Contributions

This thesis shows that the cache with tiled implementations for stencil and matrix multiplication can be improved by rescheduling via index mapping. While the main focus lies in improving the performance on GPU, the techniques presented can also be applied on a CPU.

The main contributions of this thesis are:

- An analysis for the theoretical cache utilization and efficiency of both linear and multithreaded (GPU) execution for naive and tiled implementations of stencil and matrix multiplication operations.
- A simple implementeable schedule for stencil and matrix multiplication operations that better utilize the cache, which results in equal or better performance than the naive and tiled implementation.
- An analysis for the theoretical cache utilization and efficiency of a column-based scheduler.

#### 1.3 Outline

[TODO] Chapter x introduces yada, yada

# 2 Background

#### 2.1 GPU Architecture

Since the GPU backend of Accelerate only works with CUDA capable devices (see section 2.3), we will mostly focus on the architecture of newer NVidia GPUs. Massive parallel workloads are executed on numerous cores clustered in streaming multiprocessors (SMs). The memory is structured in a multi-level hierarchy containing an L1 cache for each SM, a shared L2 cache for all SMs and multiple banks of DRAM [3, 4].

When working with CUDA the programmer defines a kernel. This is normally done with CUDA C++, an extension on C++ programming language, but in our case Accelerate will handle the generation of kernels (section 2.3) [5].

#### 2.1.1 Hardware

On modern Nvidia GPU architectures the compute units are grouped in graphics processing clusters (GPC). Since GPUs are primarily used in graphics applications, the GPC contains some specialized components: a raster engine, Texture Processing Clusters (TPCs), [TODO] .... The main point of interest are streaming multiprocessors (SMs) in the TPCs. Each SMs has its own instruction scheduler and various execution pipelines. Jia et al. suggests that a threadblock should contain at least 128 threads due to SMs on Turing, Volta, and Ampere being split into four processing blocks [3, 4, 6, 7].

[TODO] More specific + figure

#### 2.1.2 Software

Executions on a GPU are directed on both the host and device (GPU). Kernels define the functions that should be executed on the GPU. The host side controls how these kernels should be executed, namely how the threads should be launched and executed. Threads are grouped and defined on a 2 level hierarchy: threads are grouped together in cooperative thread arrays (CTAs), also known as thread blocks, and multiple CTAs can be queued for the execution of a single kernel. Both can be controlled upon executing a kernel: the amount of threads per CTA (threadblock size) and the total amount of CTAs (gridsize). CTAs get assigned to SMs in a round-robin fashion.

When an SM executes a CTA, it splits the work into warps, a grouping of 32-threads. On architecture before Volta, a single warp is executed in a single instruction, multiple threads fashion, where a single program counter is shared amongst the 32 threads. With the volta architecture, independent thread scheduling allows full concurrency between threads and the scheduler can group multiple threads into SIMT units.

[TODO] SM can context switch between multipel warps to hide latency [TODO] Figure visualizing the whole thing because text is confusing... probably

#### 2.1.3 GPU Caches

To the programmer, the memory hierarchy is very simplified: there is the compute unit and the memory. Caches are hidden from this model as on most architectures they are managed by hardware.

Memory can become a significant bottleneck due to the large amount of threads running concurrently. Caches can alleviate this but are limited in size, and given a large enough problem can cause cache trashing – the premature eviction of data before any significant reuse [8]. To improve efficiency of caches, caches asume spatial locality via cache lines. A cache line is the smallest unit of data that a cache can hold. For example, an L1 cache on a Turing GPU uses cache lines that hold 128 bytes of data. Therefore, fetching data from memory also brings extra nearby data with it.

Data shared between threads through the cache can happen in a read-after-write (RAW) or read-after-read (RAR) manner. RAW has data dependency among tasks, for example in scan operations. RAR has no data dependency and can be executed in any order [9].

The L1 cache in older Nvidia GPU architectures (Maxwell, Pascal) uses the least recently used (LRU) eviction policy. When caches become full, we need to remove data (a cache line) from the cache to allow newer data to be cached. An LRU eviction policy evicts data that is the least recently used. Jia et al. [6] have shown that in Turing and Volta GPUs, the P-chase benchmark that is used to detect the LRU eviction policy presented by Mei and Chu [10] fails to complete over the full L1 cache. Jia et al. conclude that newer architectures (Turing, Volta) uses a non-LRU eviction policy [6, 10, 11]. When the L1 cache in Turing and Volta GPU saturates, 4 consecutive cache lines are chosen randomly to be evicted. This is in line with a new eviction policy mechanism introduced with Volta, where cache lines can be assigned a priority [5, 6].

Modern Nvidia GPUs are able to handle various types cache operations and eviction hints. By default, loads are cached at all levels (L2, L1) with an LRU policy. This brings a problem with it: if data is writen to a cached value, we need to evict this cache line from all other L1 caches first, since that value is no longer up to date after our update. As an example, it is also possible to only cache on L2, bypassing L1. Another option is to hint cache streaming, where the loaded cache line will have an evict-first policy to prevent polution of the cache. Similar operations exist for writing data to memory. In both cases it is up to the compiler and programmer to exploit this for extra performance [5].

## 2.1.4 Performance of Access Patterns

Lam et al. and Meyer et al. describe two types of reference reuse[12, 13]:

- Spatial reuse occurs when accessing data from the same cache line, increasing spatial locality.
- **Temporal reuse** occurs when the same data is accessed at a later time, increasing temporal locality.

The reuse factor can be kept track of by counting the two types of reuse. Loading data horizontally (sequentially) exploits spatial locality and is therefore cheaper than loading data vertically. Additionally, temporal reuse can only happen when other memory accesses do not displace reusable data from the cache.

Lam et al. proposed a method to model cache interference. In the simplest case where all data used is cached in different locations (and therefore no data is evicted), the number of misses per variable v is described by D(v)/R(v), where D(v) is the total number of references (loads) of v and R(v) is the reuse factor. However, with interference misses when data gets displaced from cache, the total number of misses for v is

$$D(v)\left(\frac{1}{R(v)} + \frac{R(v) - 1}{R(v)}M(v)\right)$$

With missrate being

$$M(v) = 1 (1 - S(v)) \prod_{u \in V - \{v\}} (1 - F(u))$$

F(u) is defined as the fraction of cache used by variable u, and self interference S(v) is defined to be the fraction of accesses that map to non-unique location in the cache.

[TODO] Illustrations for interference

### 2.2 CPU vs GPU based multi-threading

[TODO] Aka, the limits of GPUs, segway to introduce Accelerate

#### 2.3 Accelerate

[TODO] Generalize to Array DSLs Accelerate is an embedded purely functional array language in Haskell [1]. Accelerate has a frontend containing the embedded language, and the backend which handles code generation and execution. The frontend handles general optimizations such as sharing recovery and array fusion [14, 15]. Further hardware specific optimization is handled on the various backends. There are two LLVM [16] backends provided: one that targets multicore CPUs accelerate-llvm-native and one that targets Nvidia GPUs accelerate-llvm-ptx. In both backends we compile Accelerate code to LLVM IR. When we want to run Accelerate on a GPU, LLVM will handle the compilation from LLVM IR to PTX, the instructions set for Nvidia's CUDA programming environment [5, 16, 17]. The GPU backend implements a series of skeletons which implement primitive operations such as stencils, generate, permute, and scan. These skeletons define how a program should be compiled and is the part where a custom thread scheduler can be implemented. Further customizations to the scheduler can be done on the executing side of the backend as it controls how kernels are launched.

# 2.4 Commonly Applicable Cache Improvements

#### 2.4.1 Optimizations of Blocked Algorithms

Lam et al. expands on the well known idea of working on blocks instead of entire rows or columns.

If all data fits onto cache without eviction, the misses that occur are *intrinsic misses*. In the real world however, data can be evicted by other memory accesses. This interference of reuse is categorized between two cases: *cross interference* and *self interference*. Cross interference assumes the location of data in memory is unrelated to the location in cache, and instead is measured by probablity that the reuse falls within the footprint of the variable. Self interference extends this by taking the cache locations of variables into account, which can happen when the data for a single iteration no longer fits in cache. [12]

#### 2.4.2 CTA Clustering

[TODO] Write

#### 2.4.3 PAVER

#### [TODO] wRITE

# 3 Analysis of Existing Approaches

# 3.1 Spatial Temporal Analysis

The memory accesses of algorithms can be plotted in a spatial-temporal diagram, with the address space on the spatial axis and order of access on the temporal axis. Since only the thread order can be manipulated, elements on the temporal can be condensed by grouping them by thread. Additionally, a cache simulation can annotate this diagram with the cache level (L1, L2, RAM) of each memory address. [TODO] Example diagrams, cache simulation

# 3.2 Stencil Operations

Stencil operation produces an N-dimensional array from a same sized input. For each element it reads a fixed sized neighborhood and writes a single element. Stencils are used for image operations (edge detection, filters, noise reduction), but can also find their use in other fields such as solving partial differentiation [18] and cellular automata.

#### 3.2.1 Naive

The naive implementation iterates over each of the outputs linearly, horizontally first. The temporal linearity translates to parallelism on GPUs where multiple threads work concurrently on each element of the output array.

While the naive implementation of stencil operations works well enough when enough rows of the input fit in the cache, it begins to fall in performance on larger inputs.

[TODO] Split figure 1 up into their respective parts



(a) The spatial temporal diagram of a 7x7 stencil with linear ordering.

(b) The spatial temporal diagram of a 7x7 stencil with a column of size 4 ordering.

Figure 1: The spatial temporal diagrams of a 7x7 stencil. The vertical axis describes the location in a 2D array which is mapped to 1D address space. The horizontal axis describes time.  $\blacksquare$  are addresses of cache lines that are brought into cache.  $\blacksquare$  are addresses being accessed.  $\blacksquare$  are addresses in cache.

The minimum amount of cachelines cached  $M_l$ , and therefore needed, of cache line size of L elements to stay in the ideal case is correlated by the stencil size and width of the input array: [TODO] Maths

$$M_l \ge \left\lceil \frac{I_w + S_w}{L} \right\rceil S_h \tag{1}$$

While in most cases [TODO] examples this enough to fully exploit L2 caches, this can be unoptimal in regard to the L1 cache. [TODO] Example The amount of cache lines needed to be feteched from memory F is bound by the worse case (eq. 1 is not satisfied) where we consistently evict data from cache before we can reuse:

$$F \le \left\lceil \frac{I_w + S_w}{L} \right\rceil I_h S_h$$

If equation 1 is satisfied, the amount of fetches F is no longer depedent on the stencil height  $S_h$ 

$$F = \left\lceil \frac{I_w + S_w}{L} \right\rceil I_h$$

[TODO] GPU threading is different from CPU stuff yada yada With multiple threads active, even more data is required to be kept in cache for optimal usuage. In the best case, all threads are cohered with overlapping accesses, and in the worst case, threads will be spread out more with less overlapping accesses. Threads in GPUs are grouped by warps, threads contained within are always cohered, and therefore a guarantee for overlapping accesses. Therefore, only when multiple warps are executed on the same SM, divergence in accesses can occur. A single warp of 32 threads uses  $\left\lceil \frac{32+S_w}{l} \right\rceil S_h$  cache lines when the threads cover a single rows. When the warp is split between 2 rows, the cache needs to be slightly bigger:  $\left\lceil \frac{32+2S_w}{l} \right\rceil S_h$ .

Ideally, the whole input array would fit on the cache, but a sufficiently large input (i.e. a  $2048 \times 2048$  32-bit floating point array uses 16 MiB) will not fit on the L2 caches of modern GPUs ( $\approx$  6MiB of L2 data cache, Volta V100) and cache misses are unavoidable. Even if data would fit on the L2 cache, there would still be potential cache misses at the L1 cache (128 KiB, Volta V100).

[TODO] More figures like in fig. 2, but for GPU/multithreading

Using the model described in section 2.1.3 can be used to estimate the cache misses of the naive implementation and the model parameters are summarized in table 1. Calculating the reuse for an iteration of  $s_y$ ,  $i_x$ , and  $i_y$  is fairly trivial.

[TODO] These are the calculations adapted from Lam et al. 1991. I'm not really sure if I should do this. They feel not very ellegant, and more like a black box system...  $s_x$  is ommitted due to it only loading a singular value during one loop and has therefore no reuse. During a single step of  $s_y$  an entire row of all  $S_w$  elements from the stencil has been loaded. A single step on  $i_x$  processes the output of a singular element, which means an entire stencil is read.



(a) Ideal cache size, only minimal amount of loading is required.



(b) Cache too small, data gets evicted before any potential reuse.

Figure 2: [TODO] write this

| X     | $S_w S_h / C$ | 1/C                | $S_w/C$ |       | $I_wI_h$   | $\overline{S_w S_h}$ | ı     |       |  |
|-------|---------------|--------------------|---------|-------|------------|----------------------|-------|-------|--|
|       | $i_x$         | $i_{n}$            | $s_x$   | $s_u$ |            |                      |       |       |  |
| Array | Footprint     |                    |         |       | References |                      |       |       |  |
| X     | $(S_w-1)S_h$  | $(S_h - 1)S_w I_w$ | -       | $S_w$ |            |                      | 0     | 0     |  |
|       | $i_x$         | $i_y$              | $s_x$   | $s_y$ | $i_x$      | $i_y$                | $s_x$ | $s_y$ |  |
| Array | Reuse         |                    |         |       |            | Self-Interference    |       |       |  |

Table 1: The reuse, self-interference, and footprint of the naive implementation of stencils during a single step of iterating on  $i_x$ ,  $i_y$ ,  $s_x$ , and  $s_y$ . [TODO] Calculate self interference

#### 3.2.2 Tiling

A common used optimization is by dividing the work into works as described in 2.4.1. [TODO] blablablabla

The minimum required cache for optimal tiling is depedent on [TODO]  $\dots$  tiling size t

$$M_l \ge \left\lceil \frac{t + S_w}{L} \right\rceil S_h \tag{2}$$

The largest possible tiling size t is derived by inverting equation 2

$$t \le \frac{LM_l}{S_h} - S_w \tag{3}$$

Since equation 2 can always be satisfied by adjusting t, we can have a lower upper bound on the amount of cache line fetches F:

$$F \le \left\lceil \frac{I_w}{t} \right\rceil \left\lceil \frac{t + S_w}{L} \right\rceil \left\lceil \frac{I_h}{t} \right\rceil (t + S_h) \tag{4}$$

[TODO] Maybe remove this. Probably just keep it, so we can plot our expected number of cache line fetches We can define the number cache lines fetched in terms of the available cache by substituting equation 2 into equation 4:

$$F \le \left\lceil \frac{I_w}{\frac{LM_l}{S_h} - S_w} \right\rceil \left\lceil \frac{\frac{LM_l}{S_h} - S_w + S_w}{L} \right\rceil \left\lceil \frac{I_h}{\frac{LM_l}{S_h} - S_w} \right\rceil \left( \frac{LM_l}{S_h} - S_w + S_h \right) \tag{5}$$

$$F \le \frac{I_w I_h M_l}{(\frac{LM_l}{S_h} - S_w)^2 S_h} (\frac{LM_l}{S_h} - S_w + S_h)$$
(6)

[TODO] simplify this further by relaxation perhaps?

 $[{\bf TODO}]~{\bf GPU/multithreading}~{\bf notes}$ 

### 3.3 Matrix Multiplication

#### 3.3.1 Naive

[TODO] Split figure 3 up into their parts for better flow

#### 3.3.2 Tiling

## 4 Column Based Iteration

[TODO] Consider splitting in two chapters if they get too big

While tiling itself is a well-known optimization technique, we most consider the parts of why it works. **[TODO]** ...

In a sense, grouping columns is similar to striding clusters of data, except in the case when a row of work can't be perfectly strided. When the width of a multidimensional array is not a multiple of the stride, the loads will not align column wise ([TODO] figure). To work around this, in the column approach we allow the last column to be narrower.

[TODO] implementation details





- (a) The spatial temporal diagram of matrix multiplication with linear ordering.
- (b) The spatial temporal diagram of a matrix multiplication with a column of size 4 ordering.

Figure 3: The spatial temporal diagrams of a matrix multiplication. The vertical axis describes the location in a 2D array which is mapped to 1D address space. The horizontal axis describes time.  $\blacksquare$  are addresses of cache lines that are brought into cache.  $\blacksquare$  are addresses being accessed.  $\blacksquare$  are addresses in cache.

We can construct the index mapping  $i \mapsto j$ . First, we calculate the offset for the starting index of the column we need to map to

$$o = \left| \frac{i}{I_h w} \right| w \tag{7}$$

Then, modify the width value such that the last column does not exceed the input width

$$w' = \begin{cases} ((I_w - 1) \bmod w) + 1 & \text{if last column} \\ w & \text{otherwise} \end{cases}$$
 (8)

And take i' as the index within a column.

$$i' = i \bmod I_h w' \tag{9}$$

So that we can calculate j

$$j = (i' \bmod w')' + \left\lfloor \frac{i'}{w'} \right\rfloor I_w + o \tag{10}$$



Figure 4: [TODO] write this

[TODO] should i add a figure showing the relation between the equation and the column pattern?

[TODO] What are the effects of GPU warps (32 threads) with this schedule?

#### 4.1 Zigzagging variation

[TODO] If we make columns fit for L2 cache, zigzagging may improve locality for lower level cache (L1) when columns sizes aren't a nice power of 2 (or)

## 4.2 Implementation in Accelerate

#### [TODO] AAAAAAAAAAA

## 4.3 Stencil Operation

The naive stencil operations had problems of chasing the cache on sufficiently large input sizes which tiling does resolve (section 3.2.2). Consider what the tiling implies: we split the work on all axis, to improve locality. However, all tiling except horizontal is unnecessary [TODO] bold claim, needs support and may cause even more discontunuity in favourable access pattern.

Let us first consider the single threaded case of column based itteration: by controlling the column width we can force the ideal scenario of the naive stencil implementation (figure 2a) to occur, similarly to tiling. The column based approach is similar to tiling, but also allows the ideal memory access pattern to continue accross tiles vertically. In GPUs this translates to less cohesive threads as threadblocks get assigned in a round-robin fashion to SMs, eliminating posibilities of L1 cache reuse.



(a) Grouping by column only incurs a heavy load every time a new column is started.



(b) Grouping by column and row (tiling) also incurs a heavy load when starting a new row, and due to having more rows also has more column starts.

Figure 5: Visualisation of cache bottlenecks for both column based and tiling approaches. ■ Memory required by the next task. ■ Tasks that still have their memory in cache. ■ Memory in cache. ■ Previous and upcoming tasks.

#### [TODO] WRITEAAAAAAAAAAAAAA

The size of the cache needed  $M_{size}$  can be approximated given the unit byte size u, column width c, stencil width  $s_w$  and height  $s_h$ , and the number of active threads t. The required memory is independent of the size of the input data.

$$M_{size} = u \left( c + s_w \right) \left( s_h + \left\lceil \frac{t}{c} \right\rceil \right)$$

Solving for c gives an unneedly complex solution, so instead we approximate  $M_{size}$  using the asymptote.

$$M_{size} = u \left( t + s_h \left( c + s_w \right) \right)$$

Solving for column width c gives

$$c = \frac{M_{size} - u(s_h s_w + t)}{s_h u}$$

**[TODO]** amount cache line fetches text here The amount of fetches is similar to equation 4, however, since we do not divide the work horizontally anymore, the amount of cache line fetches is slightly less:

$$F \le \left\lceil \frac{I_w}{c} \right\rceil \left\lceil \frac{c + S_w}{L} \right\rceil (I_h + S_h) \tag{11}$$

[TODO] Is a difference useful?

$$\Delta F \approx$$
 (12)

[TODO] Calculate column parameters, spatial temporal diagrams

# 4.4 Matrix Multiplication

A matrix multiplication of  $A \cdot B = C$  consist out of three parts: reading horizontally from matrix A, reading vertically from matrix B, and writing to matrix C. Since writing the output does not block any further instructions, it can be ignored. Reading horizontally is significantly cheaper due to cache locality.

$$M_{size} = u \left( hc + w \left\lceil \frac{t}{c} \right\rceil \right)$$

Similarly to stencil operations, approximate  $M_{size}$  using the asymptote

$$M_{size} = uhc$$

Then solving for c

$$c = \frac{M_{size}}{uh}$$

[TODO] Calculate column parameters, spatial temporal diagrams

#### 5 Results

[TODO] Benchmark results, Nvidia profiler statistics

- 5.1 Execution Times
- 5.2 Cache Hitrate
- 5.3 Hardware Utilization

## 6 Conclusion

## References

- [1] Manuel M T Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. Accelerating Haskell array codes with multicore GPUs. In *DAMP '11: The 6th workshop on Declarative Aspects of Multicore Programming*. ACM, 2011.
- [2] Cedric Nugteren, Gert-Jan van den Braak, and Henk Corporaal. A study of the potential of locality-aware thread scheduling for gpus. In *European Conference on Parallel Processing*, pages 146–157. Springer, 2014.
- [3] NVIDIA. Nvidia volta v100 gpu architecture. 2017.
- [4] NVIDIA. Nvidia a100 tensore core gpu architecture. 2020.
- [5] NVIDIA. Cuda toolkit documentation v11.5.0. URL https://docs.nvidia.com/cuda/index.html.

- [6] Zhe Jia, Marco Maggioni, Jeffrey Smith, and Daniele Paolo Scarpazza. Dissecting the nvidia turing t4 gpu via microbenchmarking. arXiv preprint arXiv:1903.07486, 2019.
- [7] NVIDIA. Nvidia turing gpu architecture. 2018.
- [8] Hongwen Dai, Chao Li, Huiyang Zhou, Saurabh Gupta, Christos Kartsaklis, and Mike Mantor. A model-driven approach to warp/thread-block level gpu cache bypassing. In 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC), pages 1–6. IEEE, 2016.
- [9] Devashree Tripathy, Amirali Abdolrashidi, Laxmi Narayan Bhuyan, Liang Zhou, and Daniel Wong. Paver: Locality graph-based thread block scheduling for gpus. *ACM Transactions on Architecture and Code Optimization (TACO)*, 18(3):1–26, 2021.
- [10] Xinxin Mei and Xiaowen Chu. Dissecting gpu memory hierarchy through microbenchmarking. *IEEE Transactions on Parallel and Distributed Systems*, 28(1):72–86, 2016.
- [11] Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. Dissecting the nvidia volta gpu architecture via microbenchmarking. arXiv preprint arXiv:1804.06826, 2018.
- [12] Monica D Lam, Edward E Rothberg, and Michael E Wolf. The cache performance and optimizations of blocked algorithms. ACM SIGOPS Operating Systems Review, 25(Special Issue):63–74, 1991.
- [13] Ulrich Meyer, Peter Sanders, et al. Algorithms for memory hierarchies: advanced lectures, volume 2625. Springer Science & Business Media, 2003.
- [14] Trevor L McDonell, Manuel MT Chakravarty, Gabriele Keller, and Ben Lippmeier. Optimising purely functional gpu programs. *ACM SIGPLAN Notices*, 48(9):49–60, 2013.
- [15] DP van Balen. Optimal fusion in data-parallel languages: From diagonal fusion to code generation. Master's thesis, 2020.
- [16] The llvm compiler infrastructure. URL https://llvm.org/.
- [17] Trevor L McDonell, Manuel MT Chakravarty, Vinod Grover, and Ryan R Newton. Type-safe runtime code generation: accelerate to llvm. *ACM SIGPLAN Notices*, 50(12):201–212, 2015.
- [18] Gerald Roth, Gerald Roth, John Mellor-crummey, John Mellor-crummey, Ken Kennedy, Ken Kennedy, R. Gregg Brickner, and R. Gregg Brickner. Compiling stencils in high performance fortran. In In Supercomputing '97: Proceedings of the 1997 ACM/IEEE conference on Supercomputing (CDROM, pages 1–20. ACM Press, 1997.