Skip to content

TutorialStart

Thomas Gruber edited this page Oct 10, 2023 · 2 revisions

Introduction

LIKWID is a tool suite developed by the HPC group of the computing center at the university Erlangen-Nuremburg. The intention to start the LIKWID project was lightweight set of tools for application developers and users. One of the main tools is likwid-perfctr that offers access to hardware performance counters and derives metrics from the raw counts. Although the command line interface is pretty handy, the complexity of hardware performance monitoring (HPM) requires some introduction.

What is hardware performance monitoring?

Starting with the Intel Pentium platform in the early 90's, Intel offers hardware registers that can be configured to count specific events. AMD followed this trend in the late 90's. The events correspond to one or more operation(s) performed by the hardware like fetching data, storing data or calculations. Since then, the number of events as well as the number of available counter registers per CPU core have increased. This reflects also the evolution of CPU micro architectures. In contrast to software running in parallel to the real application, the hardware counters do not affect the runtime of the application because they are incremented transparently to the other instructions performed by the CPU. This makes HPM a good starting point for performance analysis. As a downside, the events and their configuration changes between micro-architectures, so it is tedious for application developers to do it themselves.

Note: All outputs shown in this tutorial are for the Intel SandyBridge EP architecture. If you run likwid-perfctr on a system with a different micro-architecture, the outputs will be different.

What can be measured on my system (with LIKWID)?

The likwid-perfctr tool is used for anything regarding hardware performance counters. It also provides lists of available events, a list of available counter registers and a list of available performance groups.

Native HPM events and counters

List available counters and events (only Intel SandyBridge EP):

$ likwid-perfctr -e
This architecture has 97 counters.
Counter tags(name, type<, options>):
FIXC0, Fixed counters, KERNEL|ANYTHREAD
PMC0, Core-local general purpose counters, EDGEDETECT|THRESHOLD|INVERT|KERNEL|ANYTHREAD
[...]
This architecture has 722 events.
Event tags (tag, id, umask, counters<, options>):
INSTR_RETIRED_ANY, 0x0, 0x0, FIXC0
UOPS_ISSUED_ANY, 0xE, 0x1, PMC
[...]

In contrast to many other HPM tools, LIKWID offers almost all native events. LIKWID works completely in user space, hence all HPM events that require kernel space handling are not supported. In order to select an event, you need to combine the event with a suitable counter like UOPS_ISSUED_ANY:PMC0. This configures the first general-purpose counter register (PMC = Performance Monitoring Counter) to count all issued micro-ops. Same for the first fixed-purpose counter register (FIXC0): INSTR_RETIRED_ANY:FIXC0. You can combine multiple event-counter combinations to an event set: UOPS_ISSUED_ANY:PMC0,INSTR_RETIRED_ANY:FIXC0. In order to measure the event set, you have to select the CPU which should be measured. In the following example we use the CPU with the physical ID 3:

$ likwid-perfctr -c 3 -g UOPS_ISSUED_ANY:PMC0,INSTR_RETIRED_ANY:FIXC0 ./a.out
--------------------------------------------------------------------------------
CPU name:	Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz
CPU type:	Intel Xeon SandyBridge EN/EP processor
CPU clock:	3.09 GHz
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Group 1: Custom
+-----------------------+---------+--------------+
|         Event         | Counter |    Core 3    |
+-----------------------+---------+--------------+
|  Runtime (RDTSC) [s]  |   TSC   | 2.000641e+00 |
|    UOPS_ISSUED_ANY    |   PMC0  |      451     |
|   INSTR_RETIRED_ANY   |  FIXC0  |      190     |
+-----------------------+---------+--------------+

At first, LIKWID prints some basic information about the used hardware. The CPU name is the actual CPU model name read from the system. Since each micro-architecture has its own set of counters and registers, LIKWID has to determine the current micro-architecture type. LIKWID outputs this information because it is important for performance analysis. The CPU clock is a measurement of the current clock speed. The line Group 1: Custom is used to separate the measurements of multiple event sets (multiple -g on the command line). If you select an performance group (next chapter) the string Custom is replaced by the performance group name. The following table lists the events, the used counters and the measurement results of the current run. Each selected CPU core gets its own column with results. The runtime is measured reading the Time Stamp Counter (TSC). The following lines refer to the selected event set.

If you selected more than one CPU core for measurement, an additional table with statistics is printed:

$ likwid-perfctr -c 3,5 -g UOPS_ISSUED_ANY:PMC0,INSTR_RETIRED_ANY:FIXC0 ./a.out
--------------------------------------------------------------------------------
CPU name:	Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz
CPU type:	Intel Xeon SandyBridge EN/EP processor
CPU clock:	3.09 GHz
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Group 1: Custom
+-----------------------+---------+--------------+--------------+
|         Event         | Counter |    Core 3    |    Core 5    |
+-----------------------+---------+--------------+--------------+
|  Runtime (RDTSC) [s]  |   TSC   | 2.000524e+00 | 2.000524e+00 |
|    UOPS_ISSUED_ANY    |   PMC0  |      458     |     4619     |
|   INSTR_RETIRED_ANY   |  FIXC0  |      190     |     2252     |
+-----------------------+---------+--------------+--------------+

+----------------------------+---------+----------+----------+----------+----------+
|            Event           | Counter |    Sum   |    Min   |    Max   |    Avg   |
+----------------------------+---------+----------+----------+----------+----------+
|  Runtime (RDTSC) [s] STAT  |   TSC   | 4.001048 | 2.000524 | 2.000524 | 2.000524 |
|    UOPS_ISSUED_ANY STAT    |   PMC0  |   5077   |    458   |   4619   |  2538.5  |
|   INSTR_RETIRED_ANY STAT   |  FIXC0  |   2442   |    190   |   2252   |   1221   |
+----------------------------+---------+----------+----------+----------+----------+

The statistics table contains the sum, the minimum, the maximum and the mean of the measured values.

In many cases, the raw counts are less interesting, but a derived metric gives more insights. Based on the used event set, you could derive something like "Average UOPs per instruction":

CPU 3: UOPS_ISSUED_ANY/INSTR_RETIRED_ANY = 2.41
CPU 5: UOPS_ISSUED_ANY/INSTR_RETIRED_ANY = 2.05

Since deriving all values by hand is tedious and multiple metrics can be derived of the measured event set, LIKWID offers performance groups.

Performance groups

Performance groups combine pre-defined event sets, corresponding derivable metrics and some description. For usability, instead of writing down the whole event-counter-combinations on the command line, you write down the name of the performance group. Performance groups are architecture-dependent, some architectures provide more groups than others.

List of all available performance groups (output is only an excerpt, check on your system):

$ likwid-perfctr -a
    BRANCH	Branch prediction miss rate/ratio
     CLOCK	Clock frequencies
      DATA	Load to store ratio
[...]

In order to measure the branch prediction miss rate on CPU core 3:

$ likwid-perfctr -c 3 -g BRANCH -S 2s
--------------------------------------------------------------------------------
CPU name:	Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz
CPU type:	Intel Xeon SandyBridge EN/EP processor
CPU clock:	3.09 GHz
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Group 1: BRANCH
+------------------------------+---------+---------+
|             Event            | Counter |  Core 3 |
+------------------------------+---------+---------+
|       INSTR_RETIRED_ANY      |  FIXC0  |  103987 |
|     CPU_CLK_UNHALTED_CORE    |  FIXC1  | 1312984 |
|     CPU_CLK_UNHALTED_REF     |  FIXC2  | 1135189 |
| BR_INST_RETIRED_ALL_BRANCHES |   PMC0  |  18968  |
| BR_MISP_RETIRED_ALL_BRANCHES |   PMC1  |   1001  |
+------------------------------+---------+---------+

+----------------------------+--------------+
|           Metric           |    Core 3    |
+----------------------------+--------------+
|     Runtime (RDTSC) [s]    | 2.000606e+00 |
|    Runtime unhalted [s]    | 4.245653e-04 |
|         Clock [MHz]        | 3.576895e+03 |
|             CPI            | 1.262642e+01 |
|         Branch rate        | 1.824074e-01 |
|  Branch misprediction rate | 9.626203e-03 |
| Branch misprediction ratio | 5.277309e-02 |
|   Instructions per branch  | 5.482233e+00 |
+----------------------------+--------------+

The header and first table is similar to the execution with a custom event set, except that Group 1: BRANCH contains the name of the performance group. Additional to the raw counts table, a metric table is printed listing the derived metrics defined by the performance group BRANCH with the derived results.

If you select multiple CPU cores for measurement (again 3 and 5), there is also a table with statistics for the metrics:

+---------------------------------+-----------------+--------------+--------------+------------------+
|              Metric             |       Sum       |      Min     |      Max     |        Avg       |
+---------------------------------+-----------------+--------------+--------------+------------------+
|     Runtime (RDTSC) [s] STAT    |     4.001176    |   2.000588   |   2.000588   |     2.000588     |
|    Runtime unhalted [s] STAT    | 0.0004178867783 | 7.084783e-07 | 0.0004171783 | 0.00020894338915 |
|         Clock [MHz] STAT        |     4965.228    |   1476.845   |   3488.383   |     2482.614     |
|             CPI STAT            |     23.93906    |   11.53158   |   12.40748   |     11.96953     |
|         Branch rate STAT        |     0.571882    |   0.1824083  |   0.3894737  |     0.285941     |
|  Branch misprediction rate STAT |   0.078172832   |  0.009751782 |  0.06842105  |    0.039086416   |
| Branch misprediction ratio STAT |    0.22913697   |  0.05346127  |   0.1756757  |    0.114568485   |
|   Instructions per branch STAT  |     8.049774    |   2.567568   |   5.482206   |     4.024887     |
+---------------------------------+-----------------+--------------+--------------+------------------+

Automatically added events

Intel systems starting with Core 2 micro-architecture provide three fixed-purpose counters (FIXC0-2). In the examples on this page, we used the event-counter combination INSTR_RETIRED_ANY:FIXC0. In reality, this event is not required in an user-given event set, because the three fixed counters are automatically added to the end of the event set for Intel platforms after Core 2. No metrics are automatically added, although some basic metrics like Clock [MHz] or CPI could be derived with these events/counters.

How to interpret the measured results?

This is probably the most difficult question in the fields of performance engineering using hardware performance counters. There is commonly no easy answer because it depends which events are measured and which metric is currently the most interesting for analysis. Basically, you have to see the measurements in the focus of software-hardware-interaction, what requires a good understanding of the current architecture and the analyzed software. In the scope of this document I give a short insight about the CPI metric:

CPI or cycles per instructions measures the quality of the instruction stream where lower results represent higher quality. Most recent x86 architectures are able to finalize (retire) four instructions per cycle, thus on this systems the minimal CPI is 0.25. CPI is often used as an starting metric to get a first insight but it also can be misleading. Under the assumption that we analyze a multi-threaded application with a resulting CPI of 0.4, one might think that the instruction mix can be processes with high throughput but what is a common operation of multi-threaded application using small and fast processable instructions? The answer is barriers. They are often implemented as busy-waiting loops with later switching to a sleeping mechanism. Busy-waiting loops commonly are implemented like this: while not (barrier-reached-by-all) do noop done. Since the noop instruction does not require any inputs or outputs, it can be directly issued to the execution ports and retires in the next cycle without without further delay. This reduces the CPI because many instructions get retired compared to the required cycles for these instructions. Consequently, the CPI is getting lower when stuck in a busy-waiting loop. So trusting in the CPI metric can be misleading without any knowledge about the actions of the software.

Clone this wiki locally