TutorialMarkerC

Thomas Röhl edited this page Oct 4, 2017 · 6 revisions

Marker API with C/C++

Introduction

The Marker API consists of a bunch of function calls and defines that enable the measuring of code regions to get a better insight of the system's activities executing the code region. In scientific programming the interesting code regions are often calculation loops.

The instrumentation is done inside your application but the setup of the performance counters is performed by likwid-perfctr. After the application run, the results are evaluated by likwid-perfctr.

Available functions

The functions are defined in the header file likwid.h. Each function can be either called directly or through some defines. Using the defines is recommended because it enables the (dis)activating of the Marker API at build time. The defines are only resolved to the functions if -DLIKWID_PERFMON is set as compiler flag.

  • LIKWID_MARKER_INIT or likwid_markerInit()
    Initialize the Marker API and read the configured eventsets from the environment
  • LIKWID_MARKER_THREADINIT or likwid_markerThreadInit()
    Add thread to the Marker API and initialize access to the performance counters (start daemon or open device files)
  • LIKWID_MARKER_REGISTER(char* tag) or likwid_markerRegisterRegion(char* tag)
    Register a region name tag to the Marker API. This creates an entry in the internally used hash table to minimize overhead at the start of the named region. This call is optional, the same operations are done by start if not registered previously.
  • LIKWID_MARKER_START(char* tag) or likwid_markerStartRegion(char* tag)
    Start a named region identified by tag. This reads the counters are stores the values in the thread's hash table. If the hash table entry tag does not exist, it is created.
  • LIKWID_MARKER_STOP(char* tag) or likwid_markerStopRegion(char* tag)
    Stop a named region identified by tag. This reads the counters are stores the values in the thread's hash table. It is assumed that a STOP can only follow a START, hence no existence check of the hash table entry is performed.
  • LIKWID_MARKER_GET(char* tag, int* nevents, double* events, double* time, int* count) or likwid_markerGetRegion(char* tag, int* nevents, double* events, double* time, int* count)
    If you want to process a code regions measurement results in the instrumented application itself, you can call this function to get the intermediate results. The region is identified by tag. The nevents parameter is used to specify the length of the events array. After the function returns, nevents is the number of events filled in the events array. The aggregated measurement time is returned in time and the amount of measurements is returned in count.
  • LIKWID_MARKER_SWITCH or likwid_markerNextGroup()
    Switch to the next eventset. If only a single eventset is given, the function performs no operation. If multiple eventsets are configured, this function switches through the eventsets in a round-robin fashion.
    Notice: This function creates the biggest overhead of all Marker API functions as it has to setup the register to the next eventset.
  • LIKWID_MARKER_CLOSE or likwid_markerClose()
    Finalize the Marker API and write the aggregated results of all regions to a file that is picked up by likwid-perfctr for evaulation.

Instrumenting the code

Let's assume we have a C code with a interesting code region, like:

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

#define N 10000

int main(int argc, char* argv[])
{
    int i;
    double data[N];
#pragma omp parallel for
    for(i = 0; i < N; i++)
    {
        data[i] = omp_get_thread_num();
    }
    return 0;
}

We want to measure the parallel for loop. We need to perform some transformations to the code to place the Marker API functions to the right places. At first we need to substitute the #pragma omp parallel for with two pragmas to #pragma omp parallel and #pragma omp for, so that we can add the start and stop calls. Moreover, we have to add the initialization and finalization API calls. The resulting code looks like:

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
// This block enables compilation of the code with and without LIKWID in place
#ifdef LIKWID_PERFMON
#include <likwid.h>
#else
#define LIKWID_MARKER_INIT
#define LIKWID_MARKER_THREADINIT
#define LIKWID_MARKER_SWITCH
#define LIKWID_MARKER_REGISTER(regionTag)
#define LIKWID_MARKER_START(regionTag)
#define LIKWID_MARKER_STOP(regionTag)
#define LIKWID_MARKER_CLOSE
#define LIKWID_MARKER_GET(regionTag, nevents, events, time, count)
#endif

#define N 10000

int main(int argc, char* argv[])
{
    int i;
    double data[N];
    LIKWID_MARKER_INIT;
#pragma omp parallel
{
    LIKWID_MARKER_THREADINIT;
}
#pragma omp parallel
{
    LIKWID_MARKER_START("foo");
    #pragma omp for
    for(i = 0; i < N; i++)
    {
        data[i] = omp_get_thread_num();
    }
    LIKWID_MARKER_STOP("foo");
}
    LIKWID_MARKER_CLOSE;
    return 0;
}

The call to LIKWID_MARKER_INIT initializes the Marker API and adds all configured eventsets to the LIKWID library. Each thread has to call LIKWID_MARKER_THREADINIT, hence we do that in a parallel region. We could also fuse the two parallel regions but there must be a barrier between LIKWID_MARKER_THREADINIT and LIKWID_MARKER_START(tag) which is done implicitly here because each parallel region ends with a barrier unless nowait is specified. In the second parallel region we perform LIKWID_MARKER_START("foo") to start the measurement for each thread and store the start value of the counters using the name foo. After the loop we are interested in, each thread stops the measurement phase, stores the stop value of the counters and aggregates the result in the hash table entry foo. After the parallel region, we finalize the Marker API and write the results to a file for later evaluation.

This is only a basic instrumentation, it does not take multiple eventsets or internal processing of measurement results into account. A slightly enhanced example can be found here: https://github.com/RRZE-HPC/likwid/blob/master/examples/C-markerAPI.c

Build and Run application

At first we need to know the include and library paths for LIKWID. There is no common way but if you used the default build process you can use this:

$ /bin/bash -c "echo LIKWID_LIB=$(dirname $(which likwid-perfctr))/../lib/"
$ /bin/bash -c "echo LIKWID_INCLUDE=$(dirname $(which likwid-perfctr))/../include/"

This prints the paths to the LIKWID library and the header files. Assuming you have exported the paths in your environment, you can build your application using gcc:

$ gcc -fopenmp -DLIKWID_PERFMON -L$LIKWID_LIB -I$LIKWID_INCLUDE <SRC> -o <EXEC> -llikwid

You can (de)activate the Marker API integration with the define -DLIKWID_PERFMON. If you omit it, the calls resolve to an empty string causing no overhead.

By now, we have not defined which performance counters should be measured and which metrics be derived for our application. This is done using likwid-perfctr which also performs the pinning of threads and the validity checking of the given eventsets.

Run the application serially:

$ likwid-perfctr -C S0:0 -g L3 -m <EXEC>

Run the application parallel using multiple CPUs:

$ likwid-perfctr -C S0:0-3 -g L3 -m <EXEC>

A possible output of a parallel run looks like this:

--------------------------------------------------------------------------------
CPU name:	Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz
CPU type:	Intel Xeon Haswell EN/EP/EX processor
CPU clock:	2.30 GHz
--------------------------------------------------------------------------------
YOUR PROGRAM OUTPUT
--------------------------------------------------------------------------------
================================================================================
Group 1 L3: Region foo
================================================================================
+-------------------+----------+----------+----------+----------+
|    Region Info    |  Core 0  |  Core 1  |  Core 2  |  Core 3  | 
+-------------------+----------+----------+----------+----------+
| RDTSC Runtime [s] | 0.001782 | 0.001733 | 0.001734 | 0.001740 | 
|     call count    |    1     |    1     |    1     |    1     | 
+-------------------+----------+----------+----------+----------+

+---------------------------+---------+--------------+--------------+--------------+--------------+
|           Event           | Counter |    Core 0    |    Core 1    |    Core 2    |    Core 3    | 
+---------------------------+---------+--------------+--------------+--------------+--------------+
|     INSTR_RETIRED_ANY     |  FIXC0  | 8.915056e+06 | 6.420429e+06 | 3.921364e+06 | 1.421908e+06 | 
|   CPU_CLK_UNHALTED_CORE   |  FIXC1  | 4.956292e+06 | 3.699170e+06 | 2.406131e+06 | 1.092918e+06 | 
|    CPU_CLK_UNHALTED_REF   |  FIXC2  | 4.015639e+06 | 2.998418e+06 | 1.951642e+06 | 8.863970e+05 | 
|      L2_LINES_IN_ALL      |   PMC0  | 3.069070e+05 | 2.392480e+05 | 1.616770e+05 | 6.799700e+04 | 
| L2_LINES_OUT_DEMAND_DIRTY |   PMC1  | 6.030000e+02 | 6.210000e+02 | 7.360000e+02 | 1.063000e+03 | 
+---------------------------+---------+--------------+--------------+--------------+--------------+

+--------------------------------+---------+----------+---------+---------+------------+
|              Event             | Counter |    Sum   |   Min   |   Max   |     Avg    | 
+--------------------------------+---------+----------+---------+---------+------------+
|     INSTR_RETIRED_ANY STAT     |  FIXC0  | 20678757 | 1421908 | 8915056 | 5169689.25 | 
|   CPU_CLK_UNHALTED_CORE STAT   |  FIXC1  | 12154511 | 1092918 | 4956292 | 3038627.75 | 
|    CPU_CLK_UNHALTED_REF STAT   |  FIXC2  |  9852096 |  886397 | 4015639 |   2463024  | 
|      L2_LINES_IN_ALL STAT      |   PMC0  |  775829  |  67997  |  306907 |  193957.25 | 
| L2_LINES_OUT_DEMAND_DIRTY STAT |   PMC1  |   3023   |   603   |   1063  |   755.75   | 
+--------------------------------+---------+----------+---------+---------+------------+

+-------------------------------+--------------+--------------+--------------+--------------+
|             Metric            |    Core 0    |    Core 1    |    Core 2    |    Core 3    | 
+-------------------------------+--------------+--------------+--------------+--------------+
|      Runtime (RDTSC) [s]      |  0.00178246  |  0.001733079 |  0.001733966 |  0.001739621 | 
|      Runtime unhalted [s]     | 2.154906e-03 | 1.608332e-03 | 1.046142e-03 | 4.751810e-04 | 
|          Clock [MHz]          | 2.838773e+03 | 2.837531e+03 | 2.835617e+03 | 2.835880e+03 | 
|              CPI              | 5.559463e-01 | 5.761562e-01 | 6.135954e-01 | 7.686278e-01 | 
|  L3 load bandwidth [MBytes/s] | 1.101963e+04 | 8.835069e+03 | 5.967434e+03 | 2.501584e+03 | 
|  L3 load data volume [GBytes] |  0.019642048 |  0.015311872 |  0.010347328 |  0.004351808 | 
| L3 evict bandwidth [MBytes/s] | 2.165098e+01 | 2.293260e+01 | 2.716547e+01 | 3.910737e+01 | 
| L3 evict data volume [GBytes] |  3.8592e-05  |  3.9744e-05  |  4.7104e-05  |  6.8032e-05  | 
|    L3 bandwidth [MBytes/s]    | 1.104128e+04 | 8.858001e+03 | 5.994600e+03 | 2.540691e+03 | 
|    L3 data volume [GBytes]    |  0.01968064  |  0.015351616 |  0.010394432 |  0.00441984  | 
+-------------------------------+--------------+--------------+--------------+--------------+

+------------------------------------+-------------+-------------+-------------+---------------+
|               Metric               |     Sum     |     Min     |     Max     |      Avg      | 
+------------------------------------+-------------+-------------+-------------+---------------+
|      Runtime (RDTSC) [s] STAT      | 0.006989126 | 0.001733079 |  0.00178246 |  0.0017472815 | 
|      Runtime unhalted [s] STAT     | 0.005284561 | 0.000475181 | 0.002154906 | 0.00132114025 | 
|          Clock [MHz] STAT          |  11347.801  |   2835.617  |   2838.773  |   2836.95025  | 
|              CPI STAT              |  2.5143257  |  0.5559463  |  0.7686278  |  0.628581425  | 
|  L3 load bandwidth [MBytes/s] STAT |  28323.717  |   2501.584  |   11019.63  |   7080.92925  | 
|  L3 load data volume [GBytes] STAT | 0.049653056 | 0.004351808 | 0.019642048 |  0.012413264  | 
| L3 evict bandwidth [MBytes/s] STAT |  110.85642  |   21.65098  |   39.10737  |   27.714105   | 
| L3 evict data volume [GBytes] STAT | 0.000193472 |  3.8592e-05 |  6.8032e-05 |   4.8368e-05  | 
|    L3 bandwidth [MBytes/s] STAT    |  28434.572  |   2540.691  |   11041.28  |    7108.643   | 
|    L3 data volume [GBytes] STAT    | 0.049846528 |  0.00441984 |  0.01968064 |  0.012461632  | 
+------------------------------------+-------------+-------------+-------------+---------------+

At first the group (eventset) and region name is printed followed by an information table listing the number of region calls and the measurement time for each CPU. Since our instrumented code runs the named region only once, the call count is 1. The next table contains the raw aggregated counter values of the region calls. If we use more than one CPU, a table with a few statistics (MIN, MAX, SUM, AVG) is printed. After that we have a table of derived metrics as defined by the performance group L3 for each thread/CPU. Similar to the raw values, the last table contains some statistics of the derived metrics. It is only printed if more than one CPU is used.

Problems

At the moment it is not possible to place multiple threads on a single CPU. You could use likwid-pin after likwid-perfctr but the access to the hash table entries (one per CPU and named region) is not thread-safe.

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.