Likwid Bench

Georg Hager edited this page Oct 9, 2018 · 7 revisions

likwid-bench: assembly microkernel benchmark suite

Introduction

likwid-bench is a benchmarking application together with a framework to enable rapid prototyping of multi-threaded assembly kernels. Adding a new benchmark amounts to creating a simple text file and recompiling. The framework takes care of threaded execution and pinning, data allocation and placement, time measurement and result presentation.

Build

Likwid-bench uses x86-64 instructions in its benchmark kernels. In order to build it on a 32 bit machine, you have to set the COMPILER option in config.mk to GCCX86. To build it call:

make likwid-bench

Limitiations

likwid-bench supports up to 38 streams. Also note that at the moment only plain (one-dimensional) streams are supported. This makes it impossible to emulate the behavior of multi-dimensional data structures.

Usage

likwid-bench comes with a bunch of kernels included. You can use it as a basic bandwidth benchmarking tool.

You can get a help message with

$ likwid-bench -h

A list with all available benchmark kernels is available with:

$ likwid-bench -a

You have to specify a benchmark kernel you want to use. This kernel will operate on a number of streams. Streams are one dimensional arrays (or vectors). Let's assume you only use one workgroup (thread group), then all threads of a workgroup will divide the stream in portions and every thread will update its part of the total vector.

Each assembly kernel has a number of properties. These are:

  1. Number of streams
  2. The data type (DOUBLE, SINGLE, INT)
  3. number of flops it performs in one update
  4. number of bytes it transfers in one update
  5. the stride of one loop iteration

To output the properties of a test kernel call likwid-bench with the -l option:

$ likwid-bench -l copy
Name: copy
Number of streams: 2
Loop stride: 8
Flops: 0
Bytes: 16
Data Type: Double precision float

When running a benchmark, you have to specify how many threads you want to use, where these threads should be placed and how large the total data set should be. Per default the memory is allocated in the same domain as the threads are running in; optionally you can place the memory in another domain. All vectors are page aligned by default.

Let's try some examples to illustrate this. Get the default list of benchmark kernels (output shortened):

$ likwid-bench -a
clcopy
clload
clstore
copy
copy_mem
load
store
store_mem
stream
stream_mem
triad
triad_mem

In order to specify the number of threads and where these threads should be placed we already used the term "thread domain." A thread domain is a number of threads sharing a topological entity. This can be a socket or a shared cache or a NUMA domain. Note that if hyper-threading (i.e., simultaneous multi-threading) is enabled, the numbering within a domain is compact. This means that SMT threads are numbered consecutively within a core (in contrast to the default behavior of other tools, such as likwid-pin).

To get a list of thread domains call:

$ likwid-bench  -p
Domain 0:
        Tag S0: 0 1 2 3 4 5
Domain 1:
        Tag S1: 6 7 8 9 10 11
Domain 2:
        Tag C0: 0 1 2 3 4 5
Domain 3:
        Tag C1: 6 7 8 9 10 11
[...]

This a machine with two sockets and six CPUs per socket. There is a shared L3 cache which is equivalent to the socket domains. There are two socket groups S0 and S1 and two cache groups C0 and C1. Depending on whether it is a UMA or NUMA system, you have either one memory domain M0 covering all CPUs or two memory domains M0 and M1 similar to S0 and S1. Some processors have more memory domains than socket domains (for example, on Intel Xeons with Cluster-on-Die [or sub-NUMA clustering] enabled or on AMD Epyc). This splits up a socket in two or more memory domains.

The simplest form to run a benchmark is:

$ likwid-bench  -t copy -w S1:100kB
Allocate: Process running on core 6 - Vector length 6400 Offset 0
Allocate: Process running on core 6 - Vector length 6400 Offset 0
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: copy
--------------------------------------------------------------------------------
Using 1 work groups
Using 6 threads
--------------------------------------------------------------------------------
Group: 0 Thread 5 Global Thread 5 running on core 11 - Vector length 1064 Offset 5320
Group: 0 Thread 2 Global Thread 2 running on core 8 - Vector length 1064 Offset 2128
Group: 0 Thread 4 Global Thread 4 running on core 10 - Vector length 1064 Offset 4256
Group: 0 Thread 0 Global Thread 0 running on core 6 - Vector length 1064 Offset 0
Group: 0 Thread 3 Global Thread 3 running on core 9 - Vector length 1064 Offset 3192
Group: 0 Thread 1 Global Thread 1 running on core 7 - Vector length 1064 Offset 1064
--------------------------------------------------------------------------------
Cycles:			3211476582
CPU Clock:		2999807232
Time:			1.070561e+00 sec
Iterations:		15852792
Iterations per thread:	2642132
Size:			99840
Size per thread:	16640
Number of Flops:	0
MFlops/s:		0.00
Data volume (Byte):	263790458880
MByte/s:		246403.95
Cycles per update:	0.194790
Cycles per cacheline:	1.558316
--------------------------------------------------------------------------------

This example uses the copy kernel and runs it on socket 1 with all threads available there. The working set size is set to 100 kB (the unit can be either kB, KB, MB or GB). You get diagnostic output about where the streams are placed, how the threads are pinned and and on what part of the vector each thread operates. The number of iterations is determined automatically before running the actual benchmark so that the runtime is at least one second.

The result section contains the following output:

  • Cycles: The cycles are measurements with the RDTSC instruction. On modern processors with Turbo mode the RDTSC clock is invariant. This means that it can be different from the actual clock frequency. To make the cycle metrics meaningful you have to fix the frequency to the nominal frequency.
  • CPU Clock: The CPU frequency at startup
  • Cycle clock: The frequency used to count the invariant RDTSC cycles. By comparing this value with CPU Clock you can see if the CPU frequency is different from the invariant (base) frequency.
  • Time: Runtime of the benchmark determined using the Cycles and Cycle clock values.
  • Iterations: Sum of iterations performed by all threads
  • Iterations per thread: Executions of the inner loop performed by each thread
  • Inner loop executions: How many iterations are done in the internal loop. This varies with the working set size and the amount of data processed in each inner loop iteration.
  • Size (Byte): Total working set size
  • Size per thread: Working set per thread. The working set is split up equally so that the chunk of data for each thread is the same
  • Number of Flops: Number of floating-point operations
  • MFlops/s: Floating-point operations per second (Number of Flops/Time)
  • Data volume (Byte): Processed data volume (Size (Byte) * Iterations per thread)
  • MByte/s: Bandwidth during the benchmark. This value is from the application's point of view thus it does not include hidden traffic (write-allocates/RFOs, Snooping, ...)
  • Cycles per update: Amount of CPU cycles required to update one item in the result cache line. If you e.g. need to load 2 cache lines to write one cache line, the reading of the two values and the writing of the single value is one update.
  • Cycles per cacheline: Amount of CPU cycles required to update the whole result cache line.
  • Loads per update: How many data items must be loaded to update one item in the output vector
  • Stores per update: How many stores are performed for one update
  • Load bytes per element: Amount of data loaded for one update (does not include hidden traffic)
  • Store bytes per elem.: Amount of data stored for one update (does not include hidden traffic)
  • Load/store ratio: Ratio of loaded and stored data items (Loads per update/Stores per update = Load bytes per element/Store bytes per elem.`)
  • Instructions: Amount of instructions executed during the benchmark. This contains only the instructions of the assembly kernel and is a calculated value.
  • UOPs: Amount of micro-ops executed during the benchmark. This contains only the instructions of the assembly kernel and is a calculated value.

Let's try a single threaded example with the stream benchmark in L1 cache:

$ likwid-bench  -t stream -w S1:20kB:1
Allocate: Process running on core 6 - Vector length 853 Offset 0
Allocate: Process running on core 6 - Vector length 853 Offset 0
Allocate: Process running on core 6 - Vector length 853 Offset 0
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: stream
--------------------------------------------------------------------------------
Using 1 work groups
Using 1 threads
--------------------------------------------------------------------------------
Group: 0 Thread 0 Global Thread 0 running on core 6 - Vector length 848 Offset 0
--------------------------------------------------------------------------------
Cycles:			1849715976
CPU Clock:		2999786262
Time:			6.166159e-01 sec
Iterations:		3408209
Iterations per thread:	3408209
Size:			19968
Size per thread:	19968
Number of Flops:	0
MFlops/s:		0.00
Data volume (Byte):	68055117312
MByte/s:		110368.73
Cycles per update:	0.434875
Cycles per cacheline:	3.478998
--------------------------------------------------------------------------------

A workgroup is specified with <domain>:<size>:<nrThreads>. The number of threads is optional. An efficient spin waiting loop based barrier is employed to keep the overhead low.

The following example will do the same as above, but this time with two workgroups on two different sockets. Notice that also the memory is placed on each of the sockets according to the workgroups.

$ likwid-bench  -t stream -w S1:20kB:1 -w S0:20kB:1
Allocate: Process running on core 6 - Vector length 853 Offset 0
Allocate: Process running on core 6 - Vector length 853 Offset 0
Allocate: Process running on core 6 - Vector length 853 Offset 0
Allocate: Process running on core 0 - Vector length 853 Offset 0
Allocate: Process running on core 0 - Vector length 853 Offset 0
Allocate: Process running on core 0 - Vector length 853 Offset 0
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: stream
--------------------------------------------------------------------------------
Using 2 work groups
Using 2 threads
--------------------------------------------------------------------------------
Group: 0 Thread 0 Global Thread 0 running on core 0 - Vector length 848 Offset 0
Group: 1 Thread 0 Global Thread 1 running on core 6 - Vector length 848 Offset 0
--------------------------------------------------------------------------------
Cycles:			2348851791
CPU Clock:		2999805430
Time:			7.830014e-01 sec
Iterations:		5283062
Iterations per thread:	2641531
Size:			39936
Size per thread:	19968
Number of Flops:	8791015168
MFlops/s:		11227.33
Data volume (Byte):	105492182016
MByte/s:		134727.96
Cycles per update:	0.534376
Cycles per cacheline:	4.275004
--------------------------------------------------------------------------------

There is also the possibility to further specify in more detail how the memory should be allocated and placed. Per default every stream is allocated page aligned in the same domain the threads run in. You can change this with the following optional arguments:

$ likwid-bench  -t copy -w S1:1GB:2-0:S0,1:S0  -w S0:1GB:2
Allocate: Process running on core 0 - Vector length 67108864 Offset 0
Allocate: Process running on core 0 - Vector length 67108864 Offset 0
Allocate: Process running on core 0 - Vector length 67108864 Offset 0
Allocate: Process running on core 0 - Vector length 67108864 Offset 0
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: copy
--------------------------------------------------------------------------------
Using 2 work groups
Using 4 threads
--------------------------------------------------------------------------------
Group: 0 Thread 1 Global Thread 1 running on core 1 - Vector length 33554432 Offset 33554432
Group: 0 Thread 0 Global Thread 0 running on core 0 - Vector length 33554432 Offset 0
Group: 1 Thread 1 Global Thread 3 running on core 7 - Vector length 33554432 Offset 33554432
Group: 1 Thread 0 Global Thread 2 running on core 6 - Vector length 33554432 Offset 0
--------------------------------------------------------------------------------
Cycles:			6853695501
CPU Clock:		2999786982
Time:			2.284727e+00 sec
Iterations:		76
Iterations per thread:	19
Size:			2000000000
Size per thread:	500000000
Number of Flops:	0
MFlops/s:		0.00
Data volume (Byte):	38000000000
MByte/s:		16632.18
Cycles per update:	2.885767
Cycles per cacheline:	23.086132
--------------------------------------------------------------------------------

This example runs the copy kernel on two threads per socket but overrides the default setting with placing all vectors in the socket 0 domain. Note that you either specify no stream arguments or all stream arguments. This means that if your kernel operates on two streams you have to specify two streams in the optional memory arguments. The syntax is <domain>:<size>:[<threads>](<threads>])-::,:.... You can offset the array by a multiple of type, i.e., if the kernel operates on doubles you can offset the array by a multiple of sizeof(double)`. Notice that this offset is also checked against the stride of the loop. Offsetting can be of advantage if you think you have cache associativity problems (thrashing).

Of course this is not too smart on a NUMA machine. If placing the data correctly and using all threads with the kernel employing non-temporal stores, you get the peak memory bandwidth of this system:

$ likwid-bench  -t copy_mem -w S1:1GB  -w S0:1GB  
Allocate: Process running on core 6 - Vector length 67108864 Offset 0
Allocate: Process running on core 6 - Vector length 67108864 Offset 0
Allocate: Process running on core 0 - Vector length 67108864 Offset 0
Allocate: Process running on core 0 - Vector length 67108864 Offset 0
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: copy_mem
--------------------------------------------------------------------------------
Using 2 work groups
Using 12 threads
--------------------------------------------------------------------------------
Group: 0 Thread 1 Global Thread 1 running on core 1 - Vector length 11184808 Offset 11184808
Group: 1 Thread 2 Global Thread 8 running on core 8 - Vector length 11184808 Offset 22369616
Group: 1 Thread 0 Global Thread 6 running on core 6 - Vector length 11184808 Offset 0
Group: 0 Thread 0 Global Thread 0 running on core 0 - Vector length 11184808 Offset 0
Group: 1 Thread 4 Global Thread 10 running on core 10 - Vector length 11184808 Offset 44739232
Group: 1 Thread 1 Global Thread 7 running on core 7 - Vector length 11184808 Offset 11184808
Group: 1 Thread 5 Global Thread 11 running on core 11 - Vector length 11184808 Offset 55924040
Group: 0 Thread 5 Global Thread 5 running on core 5 - Vector length 11184808 Offset 55924040
Group: 1 Thread 3 Global Thread 9 running on core 9 - Vector length 11184808 Offset 33554424
Group: 0 Thread 2 Global Thread 2 running on core 2 - Vector length 11184808 Offset 22369616
Group: 0 Thread 4 Global Thread 4 running on core 4 - Vector length 11184808 Offset 44739232
Group: 0 Thread 3 Global Thread 3 running on core 3 - Vector length 11184808 Offset 33554424
Cycles: 15602213015 
Iterations: 100 
Size: 67108864 
Vectorlength: 11184808 
Time: 5.318840e+00 sec
MFlops/s:       0.00
MByte/s:        40375.03
Cycles per update:      13.949469
Cycles per cacheline:   111.595750
--------------------------------------------------------------------------------

Default benchmarks

likwid-bench already contains a number of basic benchmark kernel you can use out of the box.

These are:

  • copy Standard memcpy benchmark. A[i] = B[i]
  • copy_mem The same as above but with non temporal store.
  • load One load stream. This one does some software prefetching you can experimenet with.
  • store One store stream.
  • store_mem The same as above but with non temporal store.
  • stream Classical STREAM triad. A[i] = B[i] + a \* C[i]
  • stream_mem The same as above but with non temporal store.
  • triad Full vector triad. `A[i] = B[i] + C[i] * D[i]
  • triad_mem The same as above but with non temporal store.

Apart from these standard benchmarks there are special cache line versions for the basic data operations load, store and copy. These versions only execute one operation per cache line. Thereby the runtime is as far as possible reduced to the time needed for the data transfers inside the memory hierarchy. Use this benchmarks to measure the raw bandwidth of different memory levels.

  • clcopy
  • clload
  • clstore

Using likwid-bench together with likwid-perfctr

To measure hardware performance counter events, likwid-bench can be build to be instrumented with the LIKWID Marker API allowing to measure additional events.

To build likwid-bench for use with likwid-perfctr set the following switch in config.mk to true:

INSTRUMENT_BENCH = true

Call make distclean and rebuild. Now you can use both tools together. Of course in likwid-perfctr you still have to specify the cores you want to measure explicitly with -c. To indicate that likwid-bench was build with instrumentation it will output a message like Have you set -m for likwid-perfctr when running stand-alone. If wrapped by likwid-perfctr, there is a message like Using LIKWID.

The following measurements shows a multi socket Uncore measurement for the L3 cache with 2 threads running on each socket:

$ likwid-perfctr -c 0,1,6,7 -g L3CACHE likwid-bench -t copy_mem -w S1:1GB:2 -w S0:1GB:2
--------------------------------------------------------------------------------
CPU type:       Intel Core Westmere processor 
CPU clock:      2.93 GHz 
--------------------------------------------------------------------------------
Measuring group L3CACHE
--------------------------------------------------------------------------------
Allocate: Process running on core 6 - Vector length 67108864 Offset 0
Allocate: Process running on core 6 - Vector length 67108864 Offset 0
Allocate: Process running on core 0 - Vector length 67108864 Offset 0
Allocate: Process running on core 0 - Vector length 67108864 Offset 0
Using likwid
--------------------------------------------------------------------------------
LIKWID MICRO BENCHMARK
Test: copy_mem
--------------------------------------------------------------------------------
Using 2 work groups
Using 4 threads
--------------------------------------------------------------------------------
Group: 1 Thread 1 Global Thread 3 running on core 7 - Vector length 33554432 Offset 33554432
Group: 1 Thread 0 Global Thread 2 running on core 6 - Vector length 33554432 Offset 0
Group: 0 Thread 1 Global Thread 1 running on core 1 - Vector length 33554432 Offset 33554432
Group: 0 Thread 0 Global Thread 0 running on core 0 - Vector length 33554432 Offset 0
--------------------------------------------------------------------------------
Cycles:			5056904547
CPU Clock:		2999809184
Time:			1.685742e+00 sec
Iterations:		68
Iterations per thread:	17
Size:			2000000000
Size per thread:	500000000
Number of Flops:	0
MFlops/s:		0.00
Data volume (Byte):	34000000000
MByte/s:		20169.16
Cycles per update:	2.379720
Cycles per cacheline:	19.037758
--------------------------------------------------------------------------------
+-----------------------+-------------+-------------+-------------+-------------+
|         Event         |   core 0    |   core 1    |   core 6    |   core 7    |
+-----------------------+-------------+-------------+-------------+-------------+
|   INSTR_RETIRED_ANY   | 5.40253e+09 | 5.45509e+09 | 4.1945e+09  | 4.19677e+09 |
| CPU_CLK_UNHALTED_CORE | 3.33546e+10 | 3.35159e+10 | 3.35228e+10 | 3.35342e+10 |
|    UNC_L3_HITS_ANY    | 3.82275e+07 |      0      | 3.86728e+07 |      0      |
|    UNC_L3_MISS_ANY    |  1.716e+09  |      0      | 1.71604e+09 |      0      |
|  UNC_L3_LINES_IN_ANY  | 8.47763e+08 |      0      | 8.46451e+08 |      0      |
| UNC_L3_LINES_OUT_ANY  | 8.47762e+08 |      0      | 8.46451e+08 |      0      |
+-----------------------+-------------+-------------+-------------+-------------+
+-----------------+------------+--------+------------+--------+
|     Metric      |   core 0   | core 1 |   core 6   | core 7 |
+-----------------+------------+--------+------------+--------+
| L3 request rate | 0.00707585 |   0    | 0.00921989 |   0    |
|  L3 miss rate   |  0.31763   |   0    |  0.409117  |   0    |
|  L3 miss ratio  |  44.8893   |   0    |  44.3733   |   0    |
+-----------------+------------+--------+------------+--------+

For Uncore events likwid-perfctr will only measure on one core per socket.

Adding benchmarks

To add new benchmarks you have to create test files in the directory <LIKWID_SRC>/bench/<ARCH> . The file must have the ending .ptt. Lets look on a copy benchmark bench/x86-86/copy.ptt. Later the benchmark will be name according to the file's name.

STREAMS 2
TYPE DOUBLE
FLOPS 0
BYTES 16
LOOP 8
movaps    FPR1, [STR0 + GPR1 * 8]
movaps    FPR2, [STR0 + GPR1 * 8 + 16]
movaps    FPR3, [STR0 + GPR1 * 8 + 32]
movaps    FPR4, [STR0 + GPR1 * 8 + 48]
movaps    [STR1 + GPR1 * 8], FPR1
movaps    [STR1 + GPR1 * 8 + 16], FPR2
movaps    [STR1 + GPR1 * 8 + 32], FPR3
movaps    [STR1 + GPR1 * 8 + 48], FPR4

The file consists of a header section and the actual loop kernel. The following header tags must be present (the order is arbitrary):

  • STREAM: The number of streams the benchmark needs.
  • TYPE: Can be one of DOUBLE, SINGLE or INT.
  • FLOPS: How many flops the kernel executes in for one scalar update.
  • BYTES: How many bytes need to be transferred per scalar update.

Everything else before the LOOP tag is taken as instruction code and placed before the actual loop code. Everything after LOOP is placed inside the loop kernel. The argument behind LOOP indicates the stride of the loop, means how many updates are performed in one loop iteration.

The BYTES parameter defines the number of bytes needed to perform a single scalar update operation. If you look at the C code of a copy benchmark it would look like this:

for (int GPR1 = 0; GPR1 < size; ++GPR1) {
    STR1[GPR1] = STR0[GPR1];
}

or in a more low level approach using a floating point register and accesses through dereferencing the pointer:

register double FPR1;
for (int GPR1_8 = 0; GPR1_8 < size; ++GPR1_8) {
    FPR1 = *(STR0 + GPR1_8);
    *(STR1 + GPR1_8) = FPR1;
}

In each iteration two double-precision values are handled, one is loaded and the other one stored. With a size of 8 Byte per double-precision value, this results in 16 Bytes per scalar loop iteration. In the high-level assembly language in the ptt files, one scalar update operation is:

movaps    FPR1, [STR0 + GPR1 * 8]
movaps    [STR1 + GPR1 * 8], FPR1

Don't get confused by unrolled loops in the ptt files, the BYTES as well as the FLOPS entry specify the number of Bytes respectively FLOPs for not unrolled loops.

The instruction code must be in Intel syntax, hence the source is the right argument and the destination the left one.

You can write plain x86-64 instruction code, but LIKWID provides some predefined labels to ease your job. The following list introduces all labels:

  • GPR1 - GPR16 : General-purpose registers
  • FPR1 - FPR16 : Floating-point registers
  • STR0 - STR10 : Registers with stream addresses
  • SCALAR : Double-precision constant
  • SSCALAR : Single-precision constant
  • ISCALAR : Integer constant

The loop counter is always placed in register GPR1!

Technically the text files in the ptt format is converted in an intermediate high level assembly format (PAS) and finally to assembly. Both intermediate formats, the .pas file and the .s are present in the build directory (e.g. ./GCC). The intermediate assembly format allows to provide different assembler backends, e.g. for masm. Still at the moment there is only a backend for gas.

After recompiling the benchmark code is generated and automatically included in likwid-bench.

Community Aspect

One idea behind likwid-bench beyond the ability of rapid prototyping and benchmarking of loop kernels is to enable a platform which helps to generate knowledge about what instructions code works on different platforms for certain algorithms. We want e.g. provide learning packages with collections of micro benchmarks showing the influence of different instructions and implementation types on performance. Users can then easily share their implementations and results can be easily compared on different processors. The typical targets for such packages are the e.g. :

  • Stencil kernels: Jacobi, Gauss Seidel
  • Stream and Full triad (4 vectors)
  • Add operations
  • basic data operations (load, store, copy)

Future Plans

  • more data access patterns should be supported apart from plain streams. E.g. multidimensional arrays for stencil kernels or CRS data formats for sparse matrix computations.
  • provide more assembler backends.
  • Provide more example packages.
  • Provide a perl skript which generates a bandwidth map based on likwid-bench measurements
Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.