Skip to content

Likwid Pin

Thomas Roehl edited this page Sep 6, 2022 · 8 revisions

likwid-pin: Tool to pin threaded applications without touching the source code

NOTICE

You can only use likwid-pin with threading implementations using the pthread_create API call which are dynamically linked. Moreover the usage makes only sense if you use a static placement of the threads. This means every thread runs on a dedicated processor. Since version 3.1 it is possible to oversubscribe processors creating many more threads as there are processors. likwid-pin will distribute the threads round robin on the processors you specify in your thread list.

Introduction

For threaded applications on modern multi-core platforms it is crucial to pin threads to dedicated cores. While the Linux kernel offers an API to pin your threads, it is tedious and involves some coding to implement a flexible solution to address affinity. Intel includes an sophisticated pinning mechanism for their OpenMP implementation. While this already works quite well out of the box, it can be further controlled with environment variables.

Still there are occasions where a simple platform and compiler independent solution is required. Because all common OpenMP implementations rely on the pthread API it is possible for likwid-pin to preload a wrapper library to the pthread_create call. In this wrapper, the threads are pinned using the Linux OS API. likwid-pin can also be used to pin serial applications as a replacement for taskset. This is an idea inspired by a tool available at http://www.mulder.franken.de/workstuff/ . likwid-pin explicitly supports pthread and the OpenMP implementations of Intel and GNU gcc. Other OpenMP implementations are also supported by allowing to specify a skip mask. In this mask, it is specified which threads shall be skipped during pinning because they are used as shepard threads and do no actual work.

likwid-pin offers three different syntax flavors to specify how to pin threads to processors:

  1. Using a thread list
  2. Specify a expression based thread list
  3. Use scatter policy

Usually processors are numbered within the Linux kernel, we refer to this ordering as physical numbering. LIKWID introduces thread groups throughout all tools to enable logical pinning. A thread group are processors sharing a topological entity on a node or chip. This may be the socket, or a ccNUMA domain or a shared cache. likwid-pin supports four different ways of numbering the cores when using the thread group syntax:

  1. physical numbering: processors are numbered according to the numbering in the OS
  2. logical numbering in node: processors are logical numbered over whole node (N prefix)
  3. logical numbering in socket: processors are logical numbered in every socket (S# prefix, e.g., S0)
  4. logical numbering in cache group: processors are logical numbered in last level cache group (C# prefix, e.g., C1)
  5. logical numbering in memory domain: processors are logical numbered in NUMA domain (M# prefix, e.g., M2)
  6. logical numbering within cpuset: processors are logical numbered inside Linux cpuset (L prefix)

For all numberings apart from one and six physical cores come first. If you have two sockets with 4 cores each and every core has 2 SMT threads with -c N:0-7 you get all physical cores. To also use SMT threads use N:0-15.

Since version 3.1 LIKWID also supports an alternative expression based syntax variant. If you use an expression based thread list definition compact ordering is used. So the processors will be in consecutive ordering with regard to SMT threads.

likwid-pin can be used to also set the NUMA memory policy to interleave. Because likwid-pin can figure out all memory domains involved in your run, it automatically configure interleaving for all NUMA nodes used.

likwid-pin sets the environment variable OMP_NUM_THREADS for you, if not already present in your environment. It will set as many threads as present in your pin expression. Moreover, the environment variable CILK_WORKERS is set to number of threads present in the pin expression.

likwid-pin always set KMP_AFFINITY to disabled to avoid interference with other pinning mechanisms. In LIKWID 4.2.1, also OMP_PLACES, GOMP_CPU_AFFINITY and OMP_PROC_BIND are unset if set before.

Options

-h, --help		 Help message
-v, --version		 Version information
-V, --verbose <level>	 Verbose output, 0 (only errors), 1 (info), 2 (details), 3 (developer)
-i			 Set NUMA interleave policy with all involved numa nodes
-m			 Set NUMA membind policy with all involved numa nodes
-S, --sweep		 Sweep memory and LLC of involved NUMA nodes
-c <list>		 Comma separated processor IDs or expression
-s, --skip <hex>	 Bitmask with threads to skip
-p			 Print available domains with mapping on physical IDs
			 If used together with -p option outputs a physical processor IDs.
-d <string>		 Delimiter used for using -p to output physical processor list, default is comma.
-q, --quiet		 Silent without output

Usage

As usual you can get a short help message with

$ likwid-pin -h

With a pthread application type (in this example with 5 threads)

$ likwid-pin -c 0,2,4-6  ./myApp parameters

With pthread it is important that you also have to include the process in your processor list. This is because for pthreads it is also possible to use the process as a worker. You can omit the -c option now. likwid-pin will then automatically use -c N:0-maxProcessors.

For a gcc OpenMP application this is the same. If you omit to set OMP_NUM_THREADS likwid-pin will set it to as many threads as you specified in your pinning expression.

$ likwid-pin -c 0,2,4,6  ./myApp parameters

With logical numbering this may translate to:

$ likwid-pin -c N:0-3  ./myApp parameters

or:

$ likwid-pin  -c S0:0-3  ./myApp parameters

If you want the ccNUMA domains your threads are running to be cleaned up before your code running like with likwid-memsweeper you can use the -S flag:

$ likwid-pin -S -c S0:0-3  ./myApp parameters

You can use multiple thread domains in a logical processor list, separated by @:

$ likwid-pin -c S0:0-3@S3:4-7  ./myApp parameters

To print out available thread domains use ( the output is for a four socket Nehalem EX machine). In this example socket, last level cache group and memory domain are equivalent:

 $ likwid-pin  -p
Domain 0:
        Tag N: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
Domain 1:
        Tag S0: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
Domain 2:
        Tag S1: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
Domain 3:
        Tag S2: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
Domain 4:
        Tag S3: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
Domain 5:
        Tag C0: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
Domain 6:
        Tag C1: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
Domain 7:
        Tag C2: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
Domain 8:
        Tag C3: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
Domain 9:
        Tag M0: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
Domain 10:
        Tag M1: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
Domain 11:
        Tag M2: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
Domain 12:
        Tag M3: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63

With LIKWID 5.2.0, there are additonal domains for each CPU die using the notion D0, D1, ...

$ likwid-pin  -c S0:0@S3:0 -S ./stream-icc

Sweeping memory
Sweeping domain 0: Using 104849 MB of 131062 MB
Cleaning LLC with 50 MB
Sweeping domain 3: Using 104858 MB of 131072 MB
Cleaning LLC with 50 MB

Starting from version 3.1 likwid-pin also supports thread expressions.

Expressions based thread list generation with compact processor numbering. Example usage expression: likwid-pin -c E:N:8 ./myApp This will generate a compact list of thread to processor mapping for the node domain with eight threads. The following syntax variants are available:

  1. -c E:<thread domain>:<number of threads>
  2. -c E:<thread domain>:<number of threads>:<chunk size>:<stride>

For two SMT threads per core on a SMT 4 machine use e.g. -c E:N:122:2:4

The simplest way to use the expression based syntax is:

$ likwid-pin -c E:S0:4  ./myApp parameters

This will use 4 processors within the socket 0 thread domain. Remember that the ordering is compact. This means if the processor has 2-way SMT the first two physical cores will be used with 4 threads.

Optionally you may specify a block size and stride:

$ likwid-pin -c E:S0:8:1:2  ./myApp parameters

On a 2-way SMT system this is equivalent to -c S0:0-7, eight threads, block size is one and stride (from start of block to start of block) is two. This is handy especially on systems with 4-way SMT. Consider an Intel Xeon Phi, you want to use 2 SMT threads per physical core with only 30 cores resulting in 60 threads. This can easily be achieved with:

$ likwid-pin -c E:N:60:2:4  ./myApp parameters

Or consider an AMD Bulldozer system and you want to use only one core per FPU:

$ likwid-pin -c E:S0:4:1:2  ./myApp parameters

You may also chain expression using the following syntax:

$ likwid-pin -c E:S0:20:2:4@S1:4:1:2  ./myApp parameters

Another option is to use pinning policies among a thread domain type. The general syntax is <domainType>:<policy>(:<numThreads>). If <numThreads> is not given, it assumes all available hardware threads. Example usage scatter policy: likwid-pin -c M:scatter ./myApp This will generate a thread to processor mapping scattered among all memory domains with physical cores first. Other policies are balanced and cbalanced.

You can also use likwid-pin to convert logical thread expressions into physical processor lists. This may be handy for other tools which do not support logical processor IDs. Optionally you can specify a custom delimiter for this list with the -d option.

Since version 3.1 oversubscription is allowed reusing the thread list you provided. If an overflow occurred, this will be indicated in the output.

Important notice

With version 4.2.1 the shepard threads are detected automatically. Many OpenMP runtime versions were tested and the only version where it wasn't able to detect them was the Intel C/C++ compiler 11.0/11.1. If you want to use likwid-pin with older OpenMP runtimes you might have to skip the shepard threads manually by setting a skip mask with the -s command line option.

# Example with Intel C/C++ compiler 11.1
$ likwid-pin -c 3,4,5,6 -s 0x1 a.out
[pthread wrapper]
[pthread wrapper] MAIN -> 3
[pthread wrapper] PIN_MASK: 0->4  1->5  2->6
[pthread wrapper] SKIP MASK: 0x0
    threadid 140177980745472 -> core 4 - OK
    threadid 140177980479232 -> core 5 - OK
    threadid 140177976280832 -> core 6 - OK
Roundrobin placement triggered
    threadid 140177972082432 -> core 3 - OK
likwid-pin(35974)---a.out(35978)-+-pstree(35995)
                                         |-{a.out}(35982)
                                         |-{a.out}(35986)
                                         |-{a.out}(35990)
                                         `-{a.out}(35994)
Thread 0 of 4 threads says: Hello from CPU 3 on host host1! - 35978 - 35978
Thread 1 of 4 threads says: Hello from CPU 5 on host host1! - 35978 - 35986
Thread 2 of 4 threads says: Hello from CPU 6 on host host1! - 35978 - 35990
Thread 3 of 4 threads says: Hello from CPU 3 on host host1! - 35978 - 35994

Everytime you see the message Roundrobin placement triggered more threads than specified CPUs were started. These additional threads are commonly shepard threads. In the output you can see, that thread 0 and 3 are both scheduled to CPU 3. When comparing the output of pstree and the Thread-IDs (TID, last number in hello lines), the thread with TID 35982 does not say hello because it's a shepard thread. When looking at the list, it is the first started thread, thus a skip mask of 0x1 skips it:

$ likwid-pin -c 3,4,5,6 -s 0x1 a.out
[pthread wrapper]
[pthread wrapper] MAIN -> 3
[pthread wrapper] PIN_MASK: 0->4  1->5  2->6
[pthread wrapper] SKIP MASK: 0x1
    threadid 140457091475200 -> SKIP
    threadid 140457091208960 -> core 4 - OK
    threadid 140457087010560 -> core 5 - OK
    threadid 140457082812160 -> core 6 - OK
likwid-pin(32439)---a.out(32443)-+-pstree(32460)
                                         |-{a.out}(32447)
                                         |-{a.out}(32451)
                                         |-{a.out}(32455)
                                         `-{a.out}(32459)
Thread 0 of 4 threads says: Hello from CPU 3 on host host1! - 32443 - 32443
Thread 2 of 4 threads says: Hello from CPU 5 on host host1! - 32443 - 32455
Thread 1 of 4 threads says: Hello from CPU 4 on host host1! - 32443 - 32451
Thread 3 of 4 threads says: Hello from CPU 6 on host host1! - 32443 - 32459

Example

Example output for a OpenMP threaded STREAM benchmark.

$ likwid-pin  -c 0-3  ./STREAM_OMP-WOODY
[likwid-pin] Main PID -> core 0 - OK
-------------------------------------------------------------
STREAM version $Revision: 5.8 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 6000000, Offset = 0
Total memory required = 137.3 MB.
Each test is run 10 times, but only
the **best** time for each is used.
-------------------------------------------------------------
[wrapper](pthread) [wrapper](pthread) PIN_MASK: 0->1  1->2  2->3
[wrapper](pthread) SKIP MASK: 0x2
[wrapper 0](pthread) Notice: Using libpthread.so.0
        threadid 47223170505040 -> core 1 - OK
[wrapper 1](pthread) Notice: Using libpthread.so.0
        threadid 47223174703440 -> SKIP
[wrapper 2](pthread) Notice: Using libpthread.so.0
        threadid 47223178901840 -> core 2 - OK
[wrapper 3](pthread) Notice: Using libpthread.so.0
        threadid 47223183100240 -> core 3 - OK
Number of Threads requested = 4
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 2 microseconds.
Each test below will take on the order of 70298 microseconds.
   (= 35149 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        7034.1035       0.0137       0.0136       0.0137
Scale:       7087.4672       0.0138       0.0135       0.0154
Add:         7147.0976       0.0207       0.0201       0.0219
Triad:       7186.9842       0.0207       0.0200       0.0227
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
Clone this wiki locally