

# INSTITUTO SUPERIOR TÉCNICO

### DEPARTAMENTO DE ENGENHARIA INFORMÁTICA

# Organização de Computadores

#### **LEIC**

# Third Lab Assignment: System Modeling and Profiling

Version 1.0

#### IDENTIFICAÇÃO:

| Número: | Nome: |
|---------|-------|
|         |       |
|         |       |
|         |       |

2015/2016

#### 1 Introduction

The goal of this assignment is twofold: (i) to determine the characteristics of a computer's caches, and (ii) to leverage the obtained knowledge about the caches in order to optimize the performance of a given program. For this task, the students will make use of a performance analysis tool that have direct access to hardware performance counters available on most modern microprocessors. The tool that will be used is the standard Application Programming Interface (API): PAPI [1].

In the rest of this section, we make a brief introduction to PAPI, and describe the targeted computer platform and the development environment. In Section 2, we describe the procedure for modeling the L1 and L2 caches of the targeted platform (subection 2.1), and provide a guide for analyzing the performance of a matrix-multiply code segment and optimizing it based on the characteristics of the L2 cache of the target architecture (Subsection 2.2).

#### 1.1 PAPI - Performance Application Programming Interface

The PAPI project [1] specifies a standard Application Programming Interface (API) for accessing hardware performance counters available on most modern microprocessors. These counters exist as a small set of registers that count *Events*, defined as occurrences of specific signals related to the processor's function (such as cache misses and floating point operations), while the program executes on the processor. Monitoring these events may have a variety of uses in application performance analysis and tuning, since it facilitates the correlation between the source/object code structure and the efficiency of the actual mapping of such code in the underlying architecture. Besides performance analysis, including hand tuning, this set of information may also be used in compiler optimization, debugging, benchmarking, monitoring and performance modeling.

PAPI has been implemented on a number of different platforms, including: Alpha; MIPS R10K and R12K; AMD Athlon and Opteron; Intel Pentium II, Pentium III, Pentium M, Pentium IV, Itanium 1 and Itanium 2; IBM Power 3, 4 and 5; Cell; Sun UltraSparc I, II and II, etc.

Although each processor has a number of events that are native to that specific architecture, PAPI provides a software abstraction of these architecture-dependent *Native Events* into a collection of *Preset Events*, also known as *predefined events*, that define a common set of events deemed relevant and useful for application performance tuning. These events are typically found in many CPUs that provide performance counters. They give access to the memory hierarchy, cache coherence protocol events, cycle and instruction counts, functional unit, and pipeline status. Hence, preset events may be regarded as mappings from symbolic names (PAPI preset name) to machine specific definitions (native countable events) for a particular hardware resource. For example, Total Cycles (in user mode) is mapped into PAPI\_TOT\_CYC. Some presets are derived from the underlying hardware metrics. For example, Total L1 Cache Misses (PAPI\_L1\_TCM) is the sum of L1 Data Misses and L1 Instruction Misses on a given platform. The list of preset and native events that are available on a specific platform can be obtained by running the commands papi\_avail and papi\_native\_avail, both provided by the papi source distribution (see Appendix A).

Besides the standard set of events for application performance tuning, PAPI specification also includes both a high-level and a low-level sets of routines for accessing the counters. The high level interface consists of eight functions that make it easy to get started with PAPI, by simply providing the ability to start, stop, and read sets of events. This interface is intended for the acquisition of simple but accurate measurement by application engineers [2, 3]:

- PAPI\_num\_counters get the number of hardware counters available on the system;
- PAPI\_flips simplified call to get Mflips/s (floating point instruction rate), real and processor time:
- PAPI\_flops simplified call to get Mflops/s (floating point operation rate), real and processor time:
- PAPI\_ipc gets instructions per cycle, real and processor time;
- PAPI\_accum\_counters add current counts to array and reset counters;

- PAPI\_read\_counters copy current counts to array and reset counters;
- PAPI\_start\_counters start counting hardware events;
- PAPI\_stop\_counters stop counters and return current counts.

The following is a simple code example of using the high-level API [2, 3]:

```
#include <papi.h>
#define NUM FLOPS 10000
#define NUM_EVENTS 1
 int Events[NUM_EVENTS] = {PAPI_TOT_INS};
  long_long values[NUM_EVENTS];
  /* Start counting events */
 if (PAPI_start_counters(Events, NUM_EVENTS) != PAPI_OK)
   handle_error(1);
  do_some_work();
  /* Read the counters */
  if (PAPI_read_counters(values, NUM_EVENTS) != PAPI_OK)
   handle_error(1);
  printf("After reading the counters: {lld}n", values[0]);
  do_some_work();
  /* Add the counters */
  if (PAPI_accum_counters(values, NUM_EVENTS) != PAPI_OK)
   handle_error(1);
  printf("After adding the counters: %lld\n", values[0]);
  do_some_work();
  /* Stop counting events */
  if (PAPI_stop_counters(values, NUM_EVENTS) != PAPI_OK)
   handle_error(1);
 printf("After stopping the counters: {lld}n", values[0]);
```

#### Possible output:

```
After reading the counters: 441027
After adding the counters: 891959
After stopping the counters: 443994
```

The fully programmable low-level interface provides more sophisticated options for controlling the counters, such as setting thresholds for interrupt on overflow, as well as access to all native counting modes and events. Such interface is intended for third-party tool writers or users with more sophisticated needs.

PAPI specification also provides the access to the most accurate timers available on the platform in use. These timers can be used to obtain both real and virtual time on each supported platform: the real time clock runs all the time (e.g. a wall clock), while the virtual time clock runs only when the processor is running in user mode.

In the following code example, PAPI\_get\_real\_cyc() and PAPI\_get\_real\_usec() are used to obtain the real time it takes to create an event set in clock cycles and in microseconds, respectively [2, 3]:

```
#include <papi.h>
main(){
  long_long start_cycles, end_cycles, start_usec, end_usec;
  int EventSet = PAPI_NULL;
  if (PAPI_library_init(PAPI_VER_CURRENT) != PAPI_VER_CURRENT)
  /\star Gets the starting time in clock cycles \star/
  start_cycles = PAPI_get_real_cyc();
  /\star Gets the starting time in microseconds \star/
  start_usec = PAPI_get_real_usec();
  /*Create an EventSet */
  if (PAPI_create_eventset(&EventSet) != PAPI_OK)
    exit(1):
  do some work();
  /\star Gets the ending time in clock cycles \star/
  end_cycles = PAPI_get_real_cyc();
  /\star Gets the ending time in microseconds \star/
  end_usec = PAPI_get_real_usec();
  printf("Wall clock cycles: %lld\n", end_cycles - start_cycles);
  prinf("Wall clock time in microseconds: %1ld\n", end_usec - start_usec);
```

#### Possible output:

```
Wall clock cycles: 100173
Wall clock time in microseconds: 136
```

#### 1.2 Targeted Platform and Development Environment

This assignment must be performed on the computers of the lab classes. These computers have similar hardware characteristics, and any of them can be used as a target platform. Note that, since this work is hardware-dependent, conducting it on an computer with different hardware characteristics could produce unexpected results.

To properly setup the development environment, it is necessary to obtain the PAPI library and a set of auxiliary program files. This material can be found in the package <code>lab3\_kit.zip</code>, which can be downloaded from the course website. After downloading and uncompressing this package on any of the lab classes' computers, PAPI must be built. To this end, change directories to the location of the PAPI source code: folder <code>papi/</code>. Compile the code by issuing the commands: <code>./configure</code>, and <code>make</code>. This operation will produce a set of helper tools located in directory <code>src/utils/</code> and create the PAPI library <code>papilib.a</code>. The tool <code>papi\_avail</code>, in particular, is useful to determine the PAPI events supported on the target platform. The library will be linked to the auxiliary programs presented in the following sections.

#### 2 Procedure

#### 2.1 Modeling Computer Caches

In the first part of this assignment, the goal is to model the characteristics of the L1 data cache and L2 cache of the targeted computer platform. Next, we provide instructions for performing this analysis.

#### 2.1.1 Modeling the L1 Data Cache

The methodology to experimentally model the L1 data cache consists of considering the total amount of data cache misses during the execution of the following code sequence of program cm1.c. This program

can be found in the package lab3\_kit.zip.

The meaning of each variable is the following:

- an arbitrary large array (1 Mega-entries) that will be repeatedly accessed to measure the cache miss pattern;
- csize value of the cache size under test; all cache sizes given by integer powers of 2, between CACHE MIN = 8kB and CACHE MAX = 64kB should be considered;
- stride states how many entries are being skipped at each access; for example, if the stride is 4, entries 0, 4, 8, 12, ... in the array are being accessed, while entries 1, 2, 3, 5, 6, 7, 9, 10, 11, ... are skipped;
- limit the greatest address that will be accessed for the cache size and access pattern under test;
- repeat denotes the number of times that each access pattern will be repeated in array x[].
  - a) Change to directory cm1/ and analyze de code of the program cm1.c. Identify its source code with the program described above.

What are the processor events that will be analyzed during its execution? Explain their meaning.



**b)** Compile the program cml.c using the provided Makefile and execute cml. Plot the variation of the average number of misses (*Avg Misses*) with the stride size, for each considered dimension of the L1 data cache (8kB, 16kB, 32kB and 64kB).

NOTE: A fast sketch of these plots can be drawn in your computer by running the following commands:

```
./cm1 > cm1.out
./cm1_proc.sh
```

| Cache Size | Stride | Avg Misses | Avg Cycl Time |
|------------|--------|------------|---------------|
|            | 1      |            |               |
|            | 2      |            |               |
|            | 4      |            |               |
|            | 8      |            |               |
|            | 16     |            |               |
|            | 32     |            |               |
| 8kBytes    | 64     |            |               |
|            | 128    |            |               |
|            | 256    |            |               |
|            | 512    |            |               |
|            | 1024   |            |               |
|            | 2048   |            |               |
|            | 4096   |            |               |
|            | 1      |            |               |
|            | 2      |            |               |
|            | 4      |            |               |
|            | 8      |            |               |
|            | 16     |            |               |
|            | 32     |            |               |
| 16kBytes   | 64     |            |               |
|            | 128    |            |               |
|            | 256    |            |               |
|            | 512    |            |               |
|            | 1024   |            |               |
|            | 2048   |            |               |
|            | 4096   |            |               |
|            | 8102   |            |               |

| Cache Size | Stride | Avg Misses | Avg Cycl Time |
|------------|--------|------------|---------------|
|            | 1      |            |               |
|            | 2      |            |               |
|            | 4      |            |               |
|            | 8      |            |               |
|            | 16     |            |               |
|            | 32     |            |               |
| 32kBytes   | 64     |            |               |
|            | 128    |            |               |
|            | 256    |            |               |
|            | 512    |            |               |
|            | 1024   |            |               |
|            | 2048   |            |               |
|            | 4096   |            |               |
|            | 8102   |            |               |
|            | 16384  |            |               |
|            | 1      |            |               |
|            | 2      |            |               |
|            | 4      |            |               |
|            | 8      |            |               |
|            | 16     |            |               |
|            | 32     |            |               |
| 64kBytes   | 64     |            |               |
|            | 128    |            |               |
|            | 256    |            |               |
|            | 512    |            |               |
|            | 1024   |            |               |
|            | 2048   |            |               |
|            | 4096   |            |               |
|            | 8102   |            |               |
|            | 16384  |            |               |
|            | 32768  |            |               |



|                                                                                                                                                           | Determine the <b>block size</b> adopted in this cache. Justify your answer.                        | nalyzing th  | e obtained results:                                    |                                  |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|--------------|--------------------------------------------------------|----------------------------------|
|                                                                                                                                                           |                                                                                                    | Determine    | the <b>size</b> of the L1 data cache. <u>Justify y</u> | our answer.                      |
| Determine the block size adopted in this cache. Justify your answer.  Characterize the associativity set size adopted in this cache. Justify your answer. |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
| Characterize the associativity set size adopted in this cache. Justify your answer.                                                                       | Characterize the <b>associativity set size</b> adopted in this cache. <u>Justify your answer.</u>  | Determine    | the <b>block size</b> adopted in this cache. Ju        | ustify your answer.              |
| Characterize the associativity set size adopted in this cache. Justify your answer.                                                                       | Characterize the <b>associativity set size</b> adopted in this cache. <u>Justify your answer</u> . |              |                                                        |                                  |
| Characterize the associativity set size adopted in this cache. Justify your answer.                                                                       | Characterize the associativity set size adopted in this cache. Justify your answer.                |              |                                                        |                                  |
| Characterize the associativity set size adopted in this cache. Justify your answer.                                                                       | Characterize the <b>associativity set size</b> adopted in this cache. Justify your answer.         |              |                                                        |                                  |
| Characterize the associativity set size adopted in this cache. Justify your answer.                                                                       | Characterize the associativity set size adopted in this cache. Justify your answer.                |              |                                                        |                                  |
| Characterize the associativity set size adopted in this cache. Justify your answer.                                                                       | Characterize the associativity set size adopted in this cache. Justify your answer.                |              |                                                        |                                  |
| Characterize the associativity set size adopted in this cache. Justify your answer.                                                                       | Characterize the associativity set size adopted in this cache. Justify your answer.                |              |                                                        |                                  |
| Characterize the associativity set size adopted in this cache. Justify your answer.                                                                       | Characterize the associativity set size adopted in this cache. Justify your answer.                |              |                                                        |                                  |
| Characterize the associativity set size adopted in this cache. Justify your answer.                                                                       | Characterize the associativity set size adopted in this cache. Justify your answer.                |              |                                                        |                                  |
| Characterize the associativity set size adopted in this cache. Justify your answer.                                                                       | Characterize the associativity set size adopted in this cache. Justify your answer.                |              |                                                        |                                  |
| Characterize the associativity set size adopted in this cache. Justify your answer.                                                                       | Characterize the associativity set size adopted in this cache. Justify your answer.                | Clara a tani |                                                        | diamental Tandi Communication    |
|                                                                                                                                                           |                                                                                                    | Cnaracteri   | ze the associativity set size adopted in               | this cache. Justify your answer. |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |
|                                                                                                                                                           |                                                                                                    |              |                                                        |                                  |

#### 2.1.2 Modeling the L2 Cache

In this part of the assignment, the goal is to experimentally model the characteristics of the L2 cache of the targeted computer platform. To analyze the computer's L2 cache, we will use the same methodology that was introduced in the previous section to model the L1 data cache.

a) Modify the program cml.c in order to analyze the characteristics of the L2 cache. (Hint: use the event PAPI\_L2\_DCM.) Describe and justify the changes introduced in this program.

**b)** Compile the program cml.c, execute cml, and plot the variation of the average number of misses (*Avg Misses*) with the stride size, for each considered dimension of the L2 cache.



| Determine th | he <b>size</b> of the L2 cache. <u>Justify your answer</u> .            |     |
|--------------|-------------------------------------------------------------------------|-----|
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
| D            |                                                                         |     |
| Determine th | he <b>block size</b> adopted in this cache. <u>Justify your answer.</u> |     |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
| Characterize | e the associativity set size adopted in this cache. Justify your answ   | er. |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |
|              |                                                                         |     |

#### 2.2 Profiling and Optimizing Data Cache Accesses

Often, programmers wishing to improve their programs' performance focus their attention on how the programs affect the computer's caches. In the following, it will be analyzed how simple code changes can help to improve that performance for a matrix multiplication application.

Consider a simple matrix multiplication application, operating on two square matrices of  $N \times N$  16-bit integer elements, with N = 1024. From a mathematical point of view, given two matrices **A** and **B**, with elements  $a_{ij}$  and  $b_{ij}$  such that  $0 \le i, j < N$ , the product is defined as:





Figure 1: Straightforward matrix multiplication.

#### 2.2.1 Straightforward implementation

A straight-forward C implementation of eq. 1 can look like this:

```
for (i = 0; i < N; ++i)
  for (j = 0; j < N; ++j)
   for (k = 0; k < N; ++k)
    res[i][j] += mull[i][k] * mul2[k][j];</pre>
```

The two input matrices are mull and mull. The result matrix res is assumed to be initialized to all zeroes.

The provided program mm1.c includes this code sequence and all the necessary initialization steps, as well as the set of statements that are required in order to profile its execution using the PAPI toolbox.

a) Change to directory mm1/ and analyze de code of the program mm1.c. Identify its source code with the program described above.

What is the total amount of memory that is required to accommodate each of these matrices?



b) Compile the source file mml.c using the provided Makefile. Execute it, by running the provided mutual exclusion script mml.sh.

Fill the following table with the obtained data.

| Total number of L1 data cache misses                | $\times 10^6$ |
|-----------------------------------------------------|---------------|
| Total number of load / store instructions completed | $\times 10^6$ |
| Total number of clock cycles                        | $\times 10^6$ |
| Elapsed time                                        | seconds       |

c) Evaluate the resulting L1 data cache *Hit-Rate*:



#### 2.2.2 First Optimization: Matrix transpose before multiplication [4]

By analyzing the obtained results, it can be observed that such a straightforward implementation suffers from a severe penalty in what concerns the amount of L2 cache misses resulting from its access pattern. In fact, while mull matrix is accessed sequentially, the inner loop advances the row number of mull (see fig. 1).



Figure 2: Transposed matrix multiplication.

One possible remedy to attenuate such problem is based on matrix transposition. In fact, since each matrix element is accessed multiple times, it might be worthwhile to rearrange ("transpose," in mathematical terms) the second matrix mul2 before using it (see fig. 2):

$$(\mathbf{AB})_{ij} = \sum_{k=0}^{N-1} a_{ik} b_{jk}^T = a_{i1} b_{j1}^T + a_{i2} b_{j2}^T + \dots + a_{i(N-1)} b_{j(N-1)}^T$$
(2)

After the preliminary transposition step, both matrices may be iterated sequentially. As far as the C code is concerned, it now looks like this:

```
short tmp[N][N];
for (i = 0; i < N; ++i)
  for (j = 0; j < N; ++j)
    tmp[i][j] = mul2[j][i];
for (i = 0; i < N; ++i)
  for (j = 0; j < N; ++j)
  for (k = 0; k < N; ++k)
   res[i][j] += mul1[i][k] * tmp[j][k];</pre>
```

Variable tmp is a temporary variable to store the transposed matrix.

One direct consequence of this optimization is that it now requires additional accesses to the data memory. Hopefully, this extra cost can be easily recovered, since the 1024 non-sequential accesses per column are usually much more expensive.

a) Change to directory mm2/ and analyze de code of the program mm2.c. Identify its source code with the program described above. Compile this program using the provided Makefile and execute it, by running the provided mutual exclusion script mm2.sh.

Fill the following table with the obtained data.

| Total number of L1 data cache misses                | $\times 10^6$ |
|-----------------------------------------------------|---------------|
| Total number of load / store instructions completed | $\times 10^6$ |
| Total number of clock cycles                        | $\times 10^6$ |
| Elapsed time                                        | seconds       |

| b) | Evaluate the resulting L1 data cache <i>Hit-Rate</i> : |
|----|--------------------------------------------------------|
|    |                                                        |
|    |                                                        |
|    |                                                        |
|    |                                                        |

c) Compare the obtained results with those that were obtained for the straightforward implementation, by calculating the difference of the resulting hit-rates ( $\Delta HitRate$ ) and the obtained speedups.

#### 2.2.3 Second Optimization: Blocked (tiled) matrix multiply [4]

Despite the good results that may be obtained with the matrix transposition method, in many applications this approach can not be applied, either because the matrix is too large or the available memory is too small. Hence, other alternatives, which do not require the extra copy procedure, should be studied.

The search for an alternative processing scheme should start with a close examination of the involved math and the operations performed by the original implementation. Trivial math knowledge shows that the order of the several additions to obtain each element of the result matrix is irrelevant, as long as each addend appears exactly once. This understanding will lead to solutions which reorder the additions performed in the inner loop of the original code.

According to the original algorithm, the adopted order to access the elements of matrix mul2 is:  $(0,0), (1,0), \ldots, (N-1,0), (0,1), (1,1), \ldots$ . Although the elements (0,0) and (0,1) are in the same cache line, by the time the inner loop completes one round, this cache line has long been evicted. For this

example, each round of the inner loop requires, for each of the three matrices, 1024 cache lines, which is much more than what is available in most processors' caches.

One possible solution is to simultaneously handle more than one iteration of the middle loop, while executing the inner loop. In this case, several values which are guaranteed to be in cache will be used, thus contributing to a reduction of the L2 cache miss-rate. Hence, to maximize the speedup provided by this technique, it is necessary to adapt the dimension of the sub-matrix under processing to the cache block size, by taking into account the size of each matrix element. As a hypothetical example, considering that a short operand occupies 2-Bytes, this means that a 64-Byte cache block will accommodate 32 matrix elements, thus defining the optimal size for the sub-matrix line to be 32 (see fig. 3).



Figure 3: Blocked matrix multiplication.

As far as the C code is concerned, it now looks like this [4]:

```
#define SM (CLS / sizeof (short))

for (i = 0; i < N; i += SM)
    for (j = 0; j < N; j += SM)
    for (k = 0; k < N; k += SM)
    for (i2 = 0, rres = &res[i][j], rmul1 = &mul1[i][k]; i2 < SM; ++i2, rres += N, rmul1 += N)
        for (k2 = 0, rmul2 = &mul2[k][j]; k2 < SM; ++k2, rmul2 += N)
        for (j2 = 0; j2 < SM; ++j2)
            rres[j2] += rmul1[k2] * rmul2[j2];</pre>
```

The most visible change is that the code has six nested loops now. The outer loops iterate with intervals of SM (the cache line size (CLS) divided by sizeof(short)). This divides the matrix multiplication in several smaller problems which can be handled with more cache locality. The inner loops iterate over the missing indexes of the outer loops. There are, once again, three loops. The k2 and j2 loops are in a different order. This is done because, in the actual computation, only one expression depends on k2 but two depend on j2.

**a)** Change to directory mm3/ and analyze the code of the program mm3.c. Identify its source code with the program described above.

Change the program source code in order to comply the algorithm parameterization (sub-matrix line size) with the block size (CLS) that was determined in Section 2.1.

How many matrix elements can be accommodated in each cache line?



b) Compile this program using the provided Makefile and execute it, by running the provided mutual exclusion script mm3.sh.

Fill the following table with the obtained data.

| Total number of L1 data cache misses                | $\times 10^6$ |
|-----------------------------------------------------|---------------|
| Total number of load / store instructions completed | $\times 10^6$ |
| Total number of clock cycles                        | $\times 10^6$ |
| Elapsed time                                        | seconds       |

| the the resulting L1 data cache $Hit$ - $Rate$ :  The the obtained results with those that were obtained for the straightforward implementation ulating the difference of the resulting hit-rates ( $\Delta$ HitRate) and the obtained speedup.                                                                                                                                                                                                                                                               |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| $ERate = HitRate_{mm3} - HitRate_{mm1}$ :                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| $dup(\#Clocks) = \#Clocks_{mm1}/\#Clocks_{mm3}$ :                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| ment:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| re the obtained results with those that were obtained for the matrix transpose implementa calculating the difference of the resulting hit-rates ( $\Delta HitRate$ ) and the obtained speedup obtained speedup is positive, but the difference of the resulting hit-rates is negative, how explain the performance improvement? (Hint: study the hit-rates of the L2 cache for both tentations; the program mm1_12.c shows how to compute the L2 hit-rate for the straight difference matrix implementation.) |
| $Rate = HitRate_{mm3} - HitRate_{mm2}$ :                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| $dup(\#Clocks) = \#Clocks_{mm2}/\#Clocks_{mm3}$ :                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| ment:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| a y continue                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |

## References

- [1] Performance Application Programming Interface (PAPI). Webpage. "http://icl.cs.utk.edu/papi", December 2008.
- [2] PAPI User's Guide.
- [3] PAPI Programmer's Reference.
- [4] Ulrich Drepper. What every programmer should know about memory. Technical report, Red Hat, Inc., November 2007.

### A A brief introduction to PAPI Events [1, 2, 3]

Events are occurrences of specific signals related to a processor's function. Hardware performance counters exist as a small set of registers that count events, such as cache misses and floating point operations while the program executes on the processor. Monitoring these events facilitates correlation between the structure of source/object code and the efficiency of the mapping of that code to the underlying architecture. Each processor has a number of events that are native to that architecture. PAPI provides a software abstraction of these architecture-dependent <u>native events</u> into a collection of <u>preset events</u> that are accessible through the PAPI interface.

#### A.1 Preset Events

The PAPI library names a number of predefined, or preset events. This set is a collection of events typically found in many CPUs that provide performance counters. A PAPI *preset* event name is mapped onto one or more of the countable *native* events on each hardware platform. On any particular platform, the preset can either be directly available as a single counter, derived using a combination of counters or unavailable.

The list of preset events that is available in a particular platform may be obtained by invoking the utility program papi\_avail.

Example:

```
Name Code Avail Deriv Description (Note)
PAPI L1 DCM 0x80000000 Yes No Level 1 data cache misses
```

#### **A.2** Native Events

*Native events* comprise the set of all events that are countable by the CPU. There are generally far more native events available than can be mapped onto PAPI *preset* events. Even if no preset event is available that exposes a given native event, native events can still be accessed directly. To use native events effectively, the user should be somewhat more familiar with the particular platform in use. Native events use the same interface as used when setting up a preset event. However, since a PAPI preset event definition is not available for native events, the native event name must often be translated into an event code before it can be used.

Native event codes and names are platform dependent, so native codes for one platform are not likely to work for any other platform. Although every attempt is made to keep native event names used by PAPI as similar as possible to those used in the vendor documentation, this is not always possible. The utility program papi\_native\_avail provides insight into the names and event codes of the native events for a specific platform. Such events are often denoted by a hexadecimal constant, which should be used as argument of the PAPI\_add\_event() function, in a manner entirely similar to adding PAPI preset events.

#### Example:

```
Event Code Symbol Long Description

0x4000002b DIV Number of divides.This count includes integer as well as FP divides and is speculative.
```